本篇博文主要内容为 2026-01-21 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。
说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。
友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。
目录
概览 (2026-01-21)
今日共更新1514篇论文,其中:
- 自然语言处理共247篇(Computation and Language (cs.CL))
- 人工智能共425篇(Artificial Intelligence (cs.AI))
- 计算机视觉共318篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共383篇(Machine Learning (cs.LG))
自然语言处理
[NLP-0] Which Reasoning Trajectories Teach Students to Reason Better? A Simple Metric of Informative Alignment
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在推理蒸馏过程中,教师模型生成的长链思维轨迹(long chain-of-thought, CoT)与学生模型之间的数据-学生适配性(data-student suitability)问题。现有方法主要依赖学生模型的似然度来评估轨迹适配性,倾向于选择与当前学生行为高度一致但可能信息量不足的轨迹,从而限制了蒸馏效果。论文提出的关键解决方案是引入Rank-Surprisal Ratio (RSR)——一种同时衡量轨迹与学生模型对齐程度和信息丰富性的新指标。RSR的核心思想在于:有效的推理轨迹通常具有较低的绝对概率(即高困惑度)但相对较高的token排名(即高秩),这反映了学习信号强度与行为一致性之间的平衡。RSR定义为轨迹平均token秩与其平均负对数似然之比,计算简单且可解释性强,在五种学生模型和11个教师模型生成的轨迹上展现出与最终训练性能高度相关的特性(平均Spearman相关系数0.86),显著优于现有指标,并在轨迹选择和教师选择中表现出实用价值。
链接: https://arxiv.org/abs/2601.14249
作者: Yuming Yang,Mingyoung Lai,Wanxu Zhao,Xiaoran Fan,Zhiheng Xi,Mingqi Wu,Chiyue Huang,Jun Zhao,Haijun Lv,Jian Tong,Yunhua Zhou,Yicheng Zou,Qipeng Guo,Tao Gui,Qi Zhang,Xuanjing Huang
机构: Fudan University (复旦大学); Shanghai AI Laboratory (上海人工智能实验室); University of Toronto (多伦多大学); University of Sydney (悉尼大学)
类目: Computation and Language (cs.CL)
备注: 26 pages. Project page: this https URL
Abstract:Long chain-of-thought (CoT) trajectories provide rich supervision signals for distilling reasoning from teacher to student LLMs. However, both prior work and our experiments show that trajectories from stronger teachers do not necessarily yield better students, highlighting the importance of data-student suitability in distillation. Existing methods assess suitability primarily through student likelihood, favoring trajectories that closely align with the model’s current behavior but overlooking more informative ones. Addressing this, we propose Rank-Surprisal Ratio (RSR), a simple metric that captures both alignment and informativeness to assess the suitability of a reasoning trajectory. RSR is motivated by the observation that effective trajectories typically combine low absolute probability with relatively high-ranked tokens under the student model, balancing learning signal strength and behavioral alignment. Concretely, RSR is defined as the ratio of a trajectory’s average token-wise rank to its average negative log-likelihood, and is straightforward to compute and interpret. Across five student models and reasoning trajectories from 11 diverse teachers, RSR strongly correlates with post-training performance (average Spearman 0.86), outperforming existing metrics. We further demonstrate its practical utility in both trajectory selection and teacher selection.
zh
[NLP-1] Jet-RL: Enabling On-Policy FP8 Reinforcement Learning with Unified Training and Rollout Precision Flow
【速读】: 该论文旨在解决强化学习(Reinforcement Learning, RL)训练中rollout阶段占用超过70%总训练时间的计算效率瓶颈问题,尤其是针对现有FP8量化RL训练策略在长时序任务和复杂场景下出现严重训练不稳定与精度崩溃的问题。其解决方案的关键在于提出Jet-RL框架,采用统一的FP8精度流程进行训练与rollout,从而显著减少训练与推理之间的数值不匹配,消除对低效跨步骤校准的依赖,并实现更稳定、高效的RL优化。
链接: https://arxiv.org/abs/2601.14243
作者: Haocheng Xi,Charlie Ruan,Peiyuan Liao,Yujun Lin,Han Cai,Yilong Zhao,Shuo Yang,Kurt Keutzer,Song Han,Ligeng Zhu
机构: NVIDIA(英伟达); MIT(麻省理工学院); UC Berkeley(加州大学伯克利分校); Stanford University(斯坦福大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 11 pages, 6 figures, 4 tables
Abstract:Reinforcement learning (RL) is essential for enhancing the complex reasoning capabilities of large language models (LLMs). However, existing RL training pipelines are computationally inefficient and resource-intensive, with the rollout phase accounting for over 70% of total training time. Quantized RL training, particularly using FP8 precision, offers a promising approach to mitigating this bottleneck. A commonly adopted strategy applies FP8 precision during rollout while retaining BF16 precision for training. In this work, we present the first comprehensive study of FP8 RL training and demonstrate that the widely used BF16-training + FP8-rollout strategy suffers from severe training instability and catastrophic accuracy collapse under long-horizon rollouts and challenging tasks. Our analysis shows that these failures stem from the off-policy nature of the approach, which introduces substantial numerical mismatch between training and inference. Motivated by these observations, we propose Jet-RL, an FP8 RL training framework that enables robust and stable RL optimization. The key idea is to adopt a unified FP8 precision flow for both training and rollout, thereby minimizing numerical discrepancies and eliminating the need for inefficient inter-step calibration. Extensive experiments validate the effectiveness of Jet-RL: our method achieves up to 33% speedup in the rollout phase, up to 41% speedup in the training phase, and a 16% end-to-end speedup over BF16 training, while maintaining stable convergence across all settings and incurring negligible accuracy degradation.
zh
[NLP-2] APEX-Agents
【速读】: 该论文旨在解决当前AI代理(Agent)在执行复杂、长周期跨应用任务时缺乏统一评估标准的问题,尤其聚焦于金融与法律等专业领域中由投资银行分析师、管理咨询顾问和企业律师设计的实际工作流程。其解决方案的关键在于提出并开源了AI Productivity Index for Agents (APEX-Agents) 基准测试平台,该平台包含480个真实场景任务,要求代理在包含文件和工具的现实工作环境中完成多步骤操作,并通过Pass@1指标衡量任务成功率。此外,作者还发布了用于代理执行与评估的基础设施Archipelago,从而为后续研究提供可复现、可扩展的评测框架。
链接: https://arxiv.org/abs/2601.14242
作者: Bertie Vidgen,Austin Mann,Abby Fennelly,John Wright Stanly,Lucas Rothman,Marco Burstein,Julien Benchek,David Ostrofsky,Anirudh Ravichandran,Debnil Sur,Neel Venugopal,Alannah Hsia,Isaac Robinson,Calix Huang,Olivia Varones,Daniyal Khan,Michael Haines,Zach Richards,Chirag Mahapatra,Brendan Foody,Osvald Nitski
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:We introduce the AI Productivity Index for Agents (APEX-Agents), a benchmark for assessing whether AI agents can execute long-horizon, cross-application tasks created by investment banking analysts, management consultants, and corporate lawyers. APEX-Agents requires agents to navigate realistic work environments with files and tools. We test eight agents for the leaderboard using Pass@1. Gemini 3 Flash (Thinking=High) achieves the highest score of 24.0%, followed by GPT-5.2 (Thinking=High), Claude Opus 4.5 (Thinking=High), and Gemini 3 Pro (Thinking=High). We open source the APEX-Agents benchmark (n=480) with all prompts, rubrics, gold outputs, files, and metadata. We also open-source Archipelago, our infrastructure for agent execution and evaluation.
zh
[NLP-3] MASCOT: Towards Multi-Agent Socio-Collaborative Companion Systems
【速读】: 该论文旨在解决多智能体系统(Multi-agent Systems, MAS)中存在的两个核心问题:一是人格坍缩(persona collapse),即智能体在交互中丧失个性化特征,退化为通用助手行为;二是社交谄媚(social sycophancy),表现为对话冗余且缺乏建设性。解决方案的关键在于提出MASCOT框架,其创新性地采用双层优化策略:首先通过基于强化学习的人格感知行为对齐(Persona-Aware Behavioral Alignment),利用RLAIF驱动的微调流程确保个体智能体严格保持预设人格一致性;其次通过协作对话优化(Collaborative Dialogue Optimization),以群体层面奖励引导元策略,促进多样且富有成效的集体互动。该方法显著提升了人格一致性与社会贡献度,在心理支持和职场场景中优于现有最优基线模型。
链接: https://arxiv.org/abs/2601.14230
作者: Yiyang Wang,Yiqiao Jin,Alex Cabral,Josiah Hester
机构: Georgia Institute of Technology (佐治亚理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 15 pages, 9 figures
Abstract:Multi-agent systems (MAS) have recently emerged as promising socio-collaborative companions for emotional and cognitive support. However, these systems frequently suffer from persona collapse–where agents revert to generic, homogenized assistant behaviors–and social sycophancy, which produces redundant, non-constructive dialogue. We propose MASCOT, a generalizable framework for multi-perspective socio-collaborative companions. MASCOT introduces a novel bi-level optimization strategy to harmonize individual and collective behaviors: 1) Persona-Aware Behavioral Alignment, an RLAIF-driven pipeline that finetunes individual agents for strict persona fidelity to prevent identity loss; and 2) Collaborative Dialogue Optimization, a meta-policy guided by group-level rewards to ensure diverse and productive discourse. Extensive evaluations across psychological support and workplace domains demonstrate that MASCOT significantly outperforms state-of-the-art baselines, achieving improvements of up to +14.1 in Persona Consistency and +10.6 in Social Contribution. Our framework provides a practical roadmap for engineering the next generation of socially intelligent multi-agent systems.
zh
[NLP-4] Generalization and Completeness of Stochastic Local Search Algorithms
【速读】: 该论文试图解决的问题是:如何统一描述和分析随机局部搜索(Stochastic Local Search, SLS)类启发式算法的本质特性及其计算能力边界。其解决方案的关键在于提出一个通用的正式模型,该模型由两部分构成:一是尽可能大的共性结构(common structure),用于捕获所有SLS方法的共享行为;二是尽可能小的参数化结构(parametric structure),通过不同实例化方式生成具体的算法(如遗传算法GA、蚁群优化ACO、粒子群优化PSO)。基于此框架,作者进一步证明了SLS算法的一般图灵完备性(Turing-completeness),即存在一种可模拟任意图灵机的遗传算法,从而表明对SLS算法输入与输出之间非平凡性质的判定问题是不可判定的,这为理解这类算法的理论极限提供了坚实基础。
链接: https://arxiv.org/abs/2601.14212
作者: Daniel Loscos,Narciso Marti-Oliet,Ismael Rodriguez
机构: Universidad Complutense de Madrid (马德里康普顿斯大学); Instituto de Tecnologías del Conocimiento (知识技术研究所)
类目: Neural and Evolutionary Computing (cs.NE); Computation and Language (cs.CL)
备注: This paper was published in Swarm and Evolutionary Computation. The present version is the author’s accepted manuscript
Abstract:We generalize Stochastic Local Search (SLS) heuristics into a unique formal model. This model has two key components: a common structure designed to be as large as possible and a parametric structure intended to be as small as possible. Each heuristic is obtained by instantiating the parametric part in a different way. Particular instances for Genetic Algorithms (GA), Ant Colony Optimization (ACO), and Particle Swarm Optimization (PSO) are presented. Then, we use our model to prove the Turing-completeness of SLS algorithms in general. The proof uses our framework to construct a GA able to simulate any Turing machine. This Turing-completeness implies that determining any non-trivial property concerning the relationship between the inputs and the computed outputs is undecidable for GA and, by extension, for the general set of SLS methods (although not necessarily for each particular method). Similar proofs are more informally presented for PSO and ACO.
zh
[NLP-5] HALT: Hallucination Assessment via Latent Testing
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中幻觉(Hallucination)问题,即模型在生成回答时尽管内部表征可能包含对查询的不确定性,但因解码压力仍输出流畅却错误的内容。其核心解决方案是提出一种轻量级残差探针(lightweight residual probes),直接从问题token的中间隐藏状态中读取幻觉风险信号,而非依赖最终解码阶段的信息。该探针是一个计算成本远低于生成token的小型辅助网络,可在推理过程中与主模型并行执行,在低风险情况下实现近乎零延迟的幻觉风险评估。通过将该探针作为代理批评者(agentic critic)用于快速选择性生成和路由,模型能立即响应高置信度查询,同时将不确定请求转交给更强的验证管道,从而提升整体系统的可靠性与效率。
链接: https://arxiv.org/abs/2601.14210
作者: Rohan Bhatnagar,Youran Sun,Chi Andrew Zhang,Yixin Wen,Haizhao Yang
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Hallucination in large language models (LLMs) can be understood as a failure of faithful readout: although internal representations may encode uncertainty about a query, decoding pressures still yield a fluent answer. We propose lightweight residual probes that read hallucination risk directly from intermediate hidden states of question tokens, motivated by the hypothesis that these layers retain epistemic signals that are attenuated in the final decoding stage. The probe is a small auxiliary network whose computation is orders of magnitude cheaper than token generation and can be evaluated fully in parallel with inference, enabling near-instantaneous hallucination risk estimation with effectively zero added latency in low-risk cases. We deploy the probe as an agentic critic for fast selective generation and routing, allowing LLMs to immediately answer confident queries while delegating uncertain ones to stronger verification pipelines. Across four QA benchmarks and multiple LLM families, the method achieves strong AUROC and AURAC, generalizes under dataset shift, and reveals interpretable structure in intermediate representations, positioning fast internal uncertainty readout as a principled foundation for reliable agentic AI.
zh
[NLP-6] InT: Self-Proposed Interventions Enable Credit Assignment in LLM Reasoning
【速读】: 该论文旨在解决强化学习(Reinforcement Learning, RL)在提升大语言模型(Large Language Models, LLMs)推理能力时存在的信用分配(credit assignment)问题:标准RL仅根据最终结果给予奖励或惩罚,导致正确中间步骤可能因最终错误而被削弱,而错误的中间步骤也可能因最终正确而被误强化。为解决此问题,论文提出干预训练(Intervention Training, InT)这一新范式,其关键在于让模型在自身推理轨迹中进行细粒度信用分配——通过识别首个错误并提出单步干预来引导轨迹向更高奖励方向演进;随后利用监督微调(Supervised Fine-Tuning, SFT)将至错误点的轨迹与干预动作拼接,从而精准定位失败步骤,显著改善RL训练的初始条件,最终在IMO-AnswerBench数据集上实现近14%的准确率提升。
链接: https://arxiv.org/abs/2601.14209
作者: Matthew Y. R. Yang,Hao Bai,Ian Wu,Gene Yang,Amrith Setlur,Aviral Kumar
机构: Carnegie Mellon University (卡内基梅隆大学); University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Outcome-reward reinforcement learning (RL) has proven effective at improving the reasoning capabilities of large language models (LLMs). However, standard RL assigns credit only at the level of the final answer, penalizing entire reasoning traces when the outcome is incorrect and uniformly reinforcing all steps when it is correct. As a result, correct intermediate steps may be discouraged in failed traces, while spurious steps may be reinforced in successful ones. We refer to this failure mode as the problem of credit assignment. While a natural remedy is to train a process reward model, accurately optimizing such models to identify corrective reasoning steps remains challenging. We introduce Intervention Training (InT), a training paradigm in which the model performs fine-grained credit assignment on its own reasoning traces by proposing short, targeted corrections that steer trajectories toward higher reward. Using reference solutions commonly available in mathematical reasoning datasets and exploiting the fact that verifying a model-generated solution is easier than generating a correct one from scratch, the model identifies the first error in its reasoning and proposes a single-step intervention to redirect the trajectory toward the correct solution. We then apply supervised fine-tuning (SFT) to the on-policy rollout up to the point of error concatenated with the intervention, localizing error to the specific step that caused failure. We show that the resulting model serves as a far better initialization for RL training. After running InT and subsequent fine-tuning with RL, we improve accuracy by nearly 14% over a 4B-parameter base model on IMO-AnswerBench, outperforming larger open-source models such as gpt-oss-20b.
zh
[NLP-7] oward Efficient Agents : Memory Tool learning and Planning
【速读】: 该论文旨在解决当前生成式 AI(Generative AI)代理系统在实际部署中普遍被忽视的效率问题,尤其是在记忆管理、工具学习和规划这三个核心组件上的资源消耗(如延迟、token使用量、执行步骤数等)。其解决方案的关键在于从两个互补维度衡量效率:一是固定成本预算下比较不同方法的有效性,二是相同有效性水平下比较所需成本,并通过帕累托前沿(Pareto frontier)视角分析二者权衡。文中总结了多种高效策略,包括通过压缩与管理机制限制上下文长度、设计强化学习奖励以最小化工具调用次数,以及采用受控搜索机制提升整体效率,从而为构建更高效、实用的代理系统提供理论指导与实践路径。
链接: https://arxiv.org/abs/2601.14192
作者: Xiaofang Yang,Lijun Li,Heng Zhou,Tong Zhu,Xiaoye Qu,Yuchen Fan,Qianshan Wei,Rui Ye,Li Kang,Yiran Qin,Zhiqiang Kou,Daizong Liu,Qi Li,Ning Ding,Siheng Chen,Jing Shao
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 35 pages, 200 references
Abstract:Recent years have witnessed increasing interest in extending large language models into agentic systems. While the effectiveness of agents has continued to improve, efficiency, which is crucial for real-world deployment, has often been overlooked. This paper therefore investigates efficiency from three core components of agents: memory, tool learning, and planning, considering costs such as latency, tokens, steps, etc. Aimed at conducting comprehensive research addressing the efficiency of the agentic system itself, we review a broad range of recent approaches that differ in implementation yet frequently converge on shared high-level principles including but not limited to bounding context via compression and management, designing reinforcement learning rewards to minimize tool invocation, and employing controlled search mechanisms to enhance efficiency, which we discuss in detail. Accordingly, we characterize efficiency in two complementary ways: comparing effectiveness under a fixed cost budget, and comparing cost at a comparable level of effectiveness. This trade-off can also be viewed through the Pareto frontier between effectiveness and cost. From this perspective, we also examine efficiency oriented benchmarks by summarizing evaluation protocols for these components and consolidating commonly reported efficiency metrics from both benchmark and methodological studies. Moreover, we discuss the key challenges and future directions, with the goal of providing promising insights.
zh
[NLP-8] A model of errors in transformers
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在需要确定性输出的任务(如算术运算)中出现的错误率问题,这类任务通常涉及对少量候选 token 的重复处理。其核心问题是:为何在看似简单且结构化明确的任务中,LLMs 仍会因微小的注意力机制误差累积而产生显著错误。解决方案的关键在于提出一个基于“有效场论”视角的两参数模型,将复杂的模型参数重新组织为两个可解释的物理量——基础噪声率(elementary noise rate)和可能被错误预测的合理 token 数量,从而定量描述准确率与任务复杂度之间的关系。该模型不仅在多个主流模型(Gemini 2.5 Flash、Gemini 2.5 Pro 和 DeepSeek R1)上得到广泛验证,还表明错误并非源于推理能力崩溃或组合性表达失败,而是可通过优化提示(prompt)设计来降低。
链接: https://arxiv.org/abs/2601.14175
作者: Suvrat Raju,Praneeth Netrapalli
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); High Energy Physics - Theory (hep-th)
备注: 8+17pages
Abstract:We study the error rate of LLMs on tasks like arithmetic that require a deterministic output, and repetitive processing of tokens drawn from a small set of alternatives. We argue that incorrect predictions arise when small errors in the attention mechanism accumulate to cross a threshold, and use this insight to derive a quantitative two-parameter relationship between the accuracy and the complexity of the task. The two parameters vary with the prompt and the model; they can be interpreted in terms of an elementary noise rate, and the number of plausible erroneous tokens that can be predicted. Our analysis is inspired by an effective field theory'' perspective: the LLM's many raw parameters can be reorganized into just two parameters that govern the error rate. We perform extensive empirical tests, using Gemini 2.5 Flash, Gemini 2.5 Pro and DeepSeek R1, and find excellent agreement between the predicted and observed accuracy for a variety of tasks, although we also identify deviations in some cases. Our model provides an alternative to suggestions that errors made by LLMs on long repetitive tasks indicate the collapse of reasoning’‘, or an inability to express ``compositional’’ functions. Finally, we show how to construct prompts to reduce the error rate.
zh
[NLP-9] Human Values in a Single Sentence: Moral Presence Hierarchies and Transformer Ensembles on the Schwartz Continuum
【速读】: 该论文旨在解决文本中人类价值观的细粒度句子级识别问题,具体聚焦于Schwartz动机连续体中的19个价值维度检测。由于输入语料(新闻和政治纲领)缺乏明显的道德线索且类别分布严重失衡,使得这一任务即使对先进的神经模型也极具挑战性。解决方案的关键在于:首先通过定义一个二分类的道德存在任务(“是否存在任何价值?”)验证了单句层面可学习性;其次在计算资源受限条件下(单卡8GB显存),对比了基于DeBERTa-base的层级门控结构与直接多标签分类器的表现,发现后者更优,说明层级门控的召回率限制了下游收益;最后引入轻量信号(如先验句上下文、LIWC-22/eMFD/MJD词典及主题特征)和简单集成策略(软投票监督集成),显著提升性能(宏F1达0.332),优于现有英文基线。研究表明,在7–9B参数规模下,精心调优的监督编码器结合轻量信号和小集成,是结构化人类价值观检测的有效且高效的基准方案。
链接: https://arxiv.org/abs/2601.14172
作者: Víctor Yeste,Paolo Rosso
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Code: this https URL , 37 pages, 4 figures,
Abstract:We study sentence-level identification of the 19 values in the Schwartz motivational continuum as a concrete formulation of human value detection in text. The setting - out-of-context sentences from news and political manifestos - features sparse moral cues and severe class imbalance. This combination makes fine-grained sentence-level value detection intrinsically difficult, even for strong modern neural models. We first operationalize a binary moral presence task (“does any value appear?”) and show that it is learnable from single sentences (positive-class F1 \approx 0.74 with calibrated thresholds). We then compare a presence-gated hierarchy to a direct multi-label classifier under matched compute, both based on DeBERTa-base and augmented with lightweight signals (prior-sentence context, LIWC-22/eMFD/MJD lexica, and topic features). The hierarchy does not outperform direct prediction, indicating that gate recall limits downstream gains. We also benchmark instruction-tuned LLMs - Gemma 2 9B, Llama 3.1 8B, Mistral 8B, and Qwen 2.5 7B - in zero-/few-shot and QLoRA setups and build simple ensembles; a soft-vote supervised ensemble reaches macro-F1 0.332, significantly surpassing the best single supervised model and exceeding prior English-only baselines. Overall, in this scenario, lightweight signals and small ensembles yield the most reliable improvements, while hierarchical gating offers limited benefit. We argue that, under an 8 GB single-GPU constraint and at the 7-9B scale, carefully tuned supervised encoders remain a strong and compute-efficient baseline for structured human value detection, and we outline how richer value structure and sentence-in-document context could further improve performance.
zh
[NLP-10] Domain-Adaptation through Synthetic Data: Fine-Tuning Large Language Models for German Law
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在法律等专业领域中因专家知识有限而导致的事实性错误或幻觉问题。其解决方案的关键在于提出一种新颖的合成数据生成方法,通过从权威德国民法典文本中系统性地生成高质量、多样化且法律准确的问答对,并结合严格的自动化过滤机制与参数高效的微调技术,显著提升了LLMs在德语法律问答任务上的表现。
链接: https://arxiv.org/abs/2601.14160
作者: Ali Hamza Bashir,Muhammad Rehan Khalid,Kostadin Cvejoski,Jana Birr,Jule Berghaus,Armin Berger,Sandra Halscheidt,Christian Temath,Rafet Sifa,David Berghaus
机构: Fraunhofer IAIS (弗劳恩霍夫信息与通信技术研究所); Georg-August-University Göttingen (哥廷根大学); Lamarr Institute (拉马尔研究所); University of Bonn (波恩大学); JetBrains Research (JetBrains 研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) often struggle in specialized domains such as legal reasoning due to limited expert knowledge, resulting in factually incorrect outputs or hallucinations. This paper presents an effective method for adapting advanced LLMs to German legal question answering through a novel synthetic data generation approach. In contrast to costly human-annotated resources or unreliable synthetic alternatives, our approach systematically produces high-quality, diverse, and legally accurate question-answer pairs directly from authoritative German statutes. Using rigorous automated filtering methods and parameter-efficient fine-tuning techniques, we demonstrate that LLMs adapted with our synthetic dataset significantly outperform their baseline counterparts on German legal question answering tasks. Our results highlight the feasibility of using carefully designed synthetic data as a robust alternative to manual annotation in high-stakes, knowledge-intensive domains.
zh
[NLP-11] Lost in the Prompt Order: Revealing the Limitations of Causal Attention in Language Models
【速读】: 该论文旨在解决大语言模型在多选题问答任务中对提示(prompt)结构敏感性的机制问题,特别是为何将上下文置于问题和选项之前的顺序(CQO)显著优于问题和选项在前的顺序(QOC)。其解决方案的关键在于识别出因果注意力(causal attention)是导致性能差异的核心机制:在QOC结构中,因果掩码(causal mask)阻止选项标记关注上下文信息,从而形成信息瓶颈,使上下文对选项不可见,进而降低模型表现。
链接: https://arxiv.org/abs/2601.14152
作者: Hyunjong Ok,Jaeho Lee
机构: POSTECH(浦项科技大学); HJ AILAB(人工智能实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: preprint
Abstract:Large language models exhibit surprising sensitivity to the structure of the prompt, but the mechanisms underlying this sensitivity remain poorly understood. In this work, we conduct an in-depth investigation on a striking case: in multiple-choice question answering, placing context before the questions and options (CQO) outperforms the reverse order (QOC) by over 14%p, consistently over a wide range of models and datasets. Through systematic architectural analysis, we identify causal attention as the core mechanism: in QOC prompts, the causal mask prevents option tokens from attending to context, creating an information bottleneck where context becomes invisible to options.
zh
[NLP-12] he Side Effects of Being Smart: Safety Risks in MLLM s Multi-Image Reasoning
【速读】: 该论文旨在解决多图像推理安全问题,即随着多模态大语言模型(Multimodal Large Language Models, MLLMs)在处理复杂多图像指令时推理能力的增强,可能引入新的安全风险。其解决方案的关键在于构建首个专注于多图像推理安全的基准测试工具MIR-SafetyBench,该基准包含2,676个实例,覆盖9类多图像关系;并通过系统评估19个MLLMs发现,更强的多图像推理能力与更高的安全脆弱性正相关,且许多看似“安全”的响应实为表层应对或回避性回答,而 unsafe 生成通常表现出更低的注意力熵(attention entropy),提示模型可能因过度聚焦任务完成而忽视安全约束。
链接: https://arxiv.org/abs/2601.14127
作者: Renmiao Chen,Yida Lu,Shiyao Cui,Xuan Ouyang,Victor Shea-Jay Huang,Shumin Zhang,Chengwei Pan,Han Qiu,Minlie Huang
机构: CoAI group, DCST, Tsinghua University (清华大学); Beihang University (北京航空航天大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: *15 pages, 5 figures. Introduces MIR-SafetyBench (2,676 instances; 9 multi-image relations). Equal contribution; †Corresponding author. Code/data: this https URL
Abstract:As Multimodal Large Language Models (MLLMs) acquire stronger reasoning capabilities to handle complex, multi-image instructions, this advancement may pose new safety risks. We study this problem by introducing MIR-SafetyBench, the first benchmark focused on multi-image reasoning safety, which consists of 2,676 instances across a taxonomy of 9 multi-image relations. Our extensive evaluations on 19 MLLMs reveal a troubling trend: models with more advanced multi-image reasoning can be more vulnerable on MIR-SafetyBench. Beyond attack success rates, we find that many responses labeled as safe are superficial, often driven by misunderstanding or evasive, non-committal replies. We further observe that unsafe generations exhibit lower attention entropy than safe ones on average. This internal signature suggests a possible risk that models may over-focus on task solving while neglecting safety constraints. Our code and data are available at this https URL.
zh
[NLP-13] Style Transfer as Bias Mitigation: Diffusion Models for Synthetic Mental Health Text for Arabic
【速读】: 该论文旨在解决心理健康分析中因数据稀缺和人口统计学偏差(如性别不平衡)导致的模型性能受限问题,尤其针对阿拉伯语心理文本中存在的显著性别失衡现象。其解决方案的关键在于提出一种无需预训练的大规模语言模型(Large Language Models, LLMs)的扩散模型(diffusion-based)方法,将偏见缓解建模为风格迁移(style transfer)任务,通过男到女的风格转换生成高熵、语义忠实的合成文本,从而有效增强代表性不足的女性作者内容,实现对敏感低资源场景下性别偏见的可控干预。
链接: https://arxiv.org/abs/2601.14124
作者: Saad Mankarious,Aya Zirikly
机构: George Washington University (乔治华盛顿大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Synthetic data offers a promising solution for mitigating data scarcity and demographic bias in mental health analysis, yet existing approaches largely rely on pretrained large language models (LLMs), which may suffer from limited output diversity and propagate biases inherited from their training data. In this work, we propose a pretraining-free diffusion-based approach for synthetic text generation that frames bias mitigation as a style transfer problem. Using the CARMA Arabic mental health corpus, which exhibits a substantial gender imbalance, we focus on male-to-female style transfer to augment underrepresented female-authored content. We construct five datasets capturing varying linguistic and semantic aspects of gender expression in Arabic and train separate diffusion models for each setting. Quantitative evaluations demonstrate consistently high semantic fidelity between source and generated text, alongside meaningful surface-level stylistic divergence, while qualitative analysis confirms linguistically plausible gender transformations. Our results show that diffusion-based style transfer can generate high-entropy, semantically faithful synthetic data without reliance on pretrained LLMs, providing an effective and flexible framework for mitigating gender bias in sensitive, low-resource mental health domains.
zh
[NLP-14] A Systematic Analysis of Chunking Strategies for Reliable Question Answering
【速读】: 该论文旨在解决文档分块(document chunking)策略选择对工业级检索增强生成(Retrieval-Augmented Generation, RAG)系统可靠性影响的不确定性问题。当前实践中多依赖启发式方法,缺乏系统性评估。其解决方案的关键在于通过在Natural Questions数据集上进行端到端实验,系统性地对比不同分块方式(基于token、句子、语义和代码)、分块大小、重叠程度及上下文长度对RAG性能的影响,从而得出可操作的部署建议:例如,重叠无显著收益且增加索引成本;句子分块在约5k token内与语义分块效果相当;存在“上下文悬崖”现象,超过2.5k token后性能显著下降;最优上下文长度取决于任务目标——语义质量在小上下文中达到峰值,而精确匹配则需更大上下文。
链接: https://arxiv.org/abs/2601.14123
作者: Sofia Bennani,Charles Moslonka
机构: École polytechnique (巴黎综合理工学院); Artefact Research Center (艺术装置研究中心); MICS, CentraleSupélec, Université Paris-Saclay (MICS,国立高等先进技术学校,巴黎-萨克雷大学)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 3 pages, 2 figures, 1 table, pre-print
Abstract:We study how document chunking choices impact the reliability of Retrieval-Augmented Generation (RAG) systems in industry. While practice often relies on heuristics, our end-to-end evaluation on Natural Questions systematically varies chunking method (token, sentence, semantic, code), chunk size, overlap, and context length. We use a standard industrial setup: SPLADE retrieval and a Mistral-8B generator. We derive actionable lessons for cost-efficient deployment: (i) overlap provides no measurable benefit and increases indexing cost; (ii) sentence chunking is the most cost-effective method, matching semantic chunking up to ~5k tokens; (iii) a “context cliff” reduces quality beyond ~2.5k tokens; and (iv) optimal context depends on the goal (semantic quality peaks at small contexts; exact match at larger ones).
zh
[NLP-15] NewsRECON: News article REtrieval for image CONtextualization
【速读】: 该论文旨在解决新闻图像的时间与地理位置溯源问题(image geolocation and timestamping),尤其在无法使用反向图像搜索(Reverse Image Search, RIS)工具的场景下,传统方法失效,导致新闻真实性验证受限。解决方案的关键在于提出 NewsRECON 方法,其核心包括:(1)利用一个双编码器(bi-encoder)从包含超过 90,000 篇新闻文章的语料库中检索与事件相关的文章;(2)引入两个交叉编码器(cross-encoders)对候选文章进行重排序,分别基于地点一致性和事件一致性提升精度;该方法可在无 RIS 证据时仍有效推断图像的日期与位置,并可与多模态大语言模型结合,实现当前最优性能(SOTA)。
链接: https://arxiv.org/abs/2601.14121
作者: Jonathan Tonglet,Iryna Gurevych,Tinne Tuytelaars,Marie-Francine Moens
机构: TU Darmstadt (达姆施塔特工业大学); National Research Center for Applied Cybersecurity ATHENE (应用网络安全国家研究中心 ATHENE); KU Leuven (鲁汶大学)
类目: Computation and Language (cs.CL)
备注: Preprint under review. Code available at this https URL
Abstract:Identifying when and where a news image was taken is crucial for journalists and forensic experts to produce credible stories and debunk misinformation. While many existing methods rely on reverse image search (RIS) engines, these tools often fail to return results, thereby limiting their practical applicability. In this work, we address the challenging scenario where RIS evidence is unavailable. We introduce NewsRECON, a method that links images to relevant news articles to infer their date and location from article metadata. NewsRECON leverages a corpus of over 90,000 articles and integrates: (1) a bi-encoder for retrieving event-relevant articles; (2) two cross-encoders for reranking articles by location and event consistency. Experiments on the TARA and 5Pils-OOC show that NewsRECON outperforms prior work and can be combined with a multimodal large language model to achieve new SOTA results in the absence of RIS evidence. We make our code available.
zh
[NLP-16] Learning to Explain: Supervised Token Attribution from Transformer Attention Patterns
【速读】: 该论文旨在解决生成式 AI(Generative AI)在高风险场景中因模型黑箱特性导致的可解释性不足问题,尤其针对基于 Transformer 的模型在医疗、法律和金融等领域部署时缺乏透明度与问责机制的挑战。现有方法如基于注意力机制的解释工具依赖人工设计的聚合策略和固定归因规则,而模型无关方法(如 LIME、SHAP)则存在计算开销大等问题。论文提出 Explanation Network (ExpNet),其核心创新在于构建一个轻量级神经网络,自动学习从 Transformer 注意力模式到词元级重要性分数的显式映射关系,从而无需预设规则即可发现最优注意力特征组合,实现更灵活、高效且可泛化的解释能力。
链接: https://arxiv.org/abs/2601.14112
作者: George Mihaila
机构: University of North Texas (北德克萨斯大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Explainable AI (XAI) has become critical as transformer-based models are deployed in high-stakes applications including healthcare, legal systems, and financial services, where opacity hinders trust and accountability. Transformers self-attention mechanisms have proven valuable for model interpretability, with attention weights successfully used to understand model focus and behavior (Xu et al., 2015); (Wiegreffe and Pinter, 2019). However, existing attention-based explanation methods rely on manually defined aggregation strategies and fixed attribution rules (Abnar and Zuidema, 2020a); (Chefer et al., 2021), while model-agnostic approaches (LIME, SHAP) treat the model as a black box and incur significant computational costs through input perturbation. We introduce Explanation Network (ExpNet), a lightweight neural network that learns an explicit mapping from transformer attention patterns to token-level importance scores. Unlike prior methods, ExpNet discovers optimal attention feature combinations automatically rather than relying on predetermined rules. We evaluate ExpNet in a challenging cross-task setting and benchmark it against a broad spectrum of model-agnostic methods and attention-based techniques spanning four methodological families.
zh
[NLP-17] ruth with a Twist: The Rhetoric of Persuasion in Professional vs. Community-Authored Fact-Checks WWW2026
【速读】: 该论文旨在解决一个关键问题:社区生成的辟谣内容(crowd-generated debunks)是否比专业机构制作的辟谣内容更依赖说服性修辞策略,从而可能影响其可信度与有效性。为回答此问题,研究者构建了一个大规模数据集,涵盖来自Community Notes(CNs)、EUvsDisinfo 和 Database of Known Fakes(DBKF)的多种辟谣文本,并系统量化了其中各类说服技巧(persuasion techniques)的使用频率和类型。解决方案的关键在于通过实证分析发现:尽管存在关于社区内容更倾向使用主观或情感化语言的假设,但数据显示社区辟谣与专业辟谣在平均说服技巧数量上并无显著差异;同时,研究进一步揭示了两类辟谣在修辞风格上的系统性差异,反映了不同机构规范与议题覆盖的差异,并验证了用户对特定不当修辞手段具有识别能力,即群体评价机制能有效惩罚不恰当的说服策略。这一发现有助于厘清公众参与式辟谣的有效边界及其优化路径。
链接: https://arxiv.org/abs/2601.14105
作者: Olesya Razuvayevskaya,Kalina Bontcheva
机构: The University of Sheffield(谢菲尔德大学)
类目: Computation and Language (cs.CL)
备注: In Proceedings of the ACM Web Conference 2026 (WWW 2026)
Abstract:This study presents the first large-scale comparison of persuasion techniques present in crowd- versus professionally-written debunks. Using extensive datasets from Community Notes (CNs), EUvsDisinfo, and the Database of Known Fakes (DBKF), we quantify the prevalence and types of persuasion techniques across these fact-checking ecosystems. Contrary to prior hypothesis that community-produced debunks rely more heavily on subjective or persuasive wording, we find no evidence that CNs contain a higher average number of persuasion techniques than professional fact-checks. We additionally identify systematic rhetorical differences between CNs and professional debunking efforts, reflecting differences in institutional norms and topical coverage. Finally, we examine how the crowd evaluates persuasive language in CNs and show that, although notes with more persuasive elements receive slightly higher overall helpfulness ratings, crowd raters are effective at penalising the use of particular problematic rhetorical means
zh
[NLP-18] DermaBench: A Clinician-Annotated Benchmark Dataset for Dermatology Visual Question Answering and Reasoning
【速读】: 该论文旨在解决当前医学视觉语言模型(Vision-Language Models, VLMs)在皮肤科应用中评估不足的问题,尤其是现有数据集多局限于图像级分类任务(如病变识别),难以全面评估模型的视觉理解能力、语言定位能力及临床推理能力。解决方案的关键在于构建一个由临床医生标注的皮肤科视觉问答(Visual Question Answering, VQA)基准——DermaBench,其基于多样化的皮肤科图像数据集(DDI),包含656张来自570名独特患者的临床图像,并采用分层标注方案(涵盖22个主问题类型),覆盖诊断、解剖部位、病变形态、分布、表面特征、颜色及图像质量等维度,同时提供开放式叙述描述与总结,共生成约14,474条VQA风格标注。该基准以元数据形式发布,尊重上游许可协议,公开可用,为评估和推动皮肤科VLM的多模态理解与临床实用性提供了高质量评估工具。
链接: https://arxiv.org/abs/2601.14084
作者: Abdurrahim Yilmaz,Ozan Erdem,Ece Gokyayla,Ayda Acar,Burc Bugra Dagtas,Dilara Ilhan Erdil,Gulsum Gencoglan,Burak Temelkuran
机构: Imperial College London (帝国理工学院); Istanbul Medeniyet University (伊斯坦布尔梅德尼耶特大学); Usak Research and Training Hospital (乌沙克研究与培训医院); Istanbul Research and Training Hospital (伊斯坦布尔研究与培训医院); Ipswich Hospital (伊普斯维奇医院); Medicana Atakoy Hospital (麦迪卡纳阿塔科伊医院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Vision-language models (VLMs) are increasingly important in medical applications; however, their evaluation in dermatology remains limited by datasets that focus primarily on image-level classification tasks such as lesion recognition. While valuable for recognition, such datasets cannot assess the full visual understanding, language grounding, and clinical reasoning capabilities of multimodal models. Visual question answering (VQA) benchmarks are required to evaluate how models interpret dermatological images, reason over fine-grained morphology, and generate clinically meaningful descriptions. We introduce DermaBench, a clinician-annotated dermatology VQA benchmark built on the Diverse Dermatology Images (DDI) dataset. DermaBench comprises 656 clinical images from 570 unique patients spanning Fitzpatrick skin types I-VI. Using a hierarchical annotation schema with 22 main questions (single-choice, multi-choice, and open-ended), expert dermatologists annotated each image for diagnosis, anatomic site, lesion morphology, distribution, surface features, color, and image quality, together with open-ended narrative descriptions and summaries, yielding approximately 14.474 VQA-style annotations. DermaBench is released as a metadata-only dataset to respect upstream licensing and is publicly available at Harvard Dataverse.
zh
[NLP-19] XCR-Bench: A Multi-Task Benchmark for Evaluating Cultural Reasoning in LLM s
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在跨文化交际中识别与适配文化特定项(Culture-Specific Items, CSIs)能力不足的问题,尤其针对现有评估数据稀缺且缺乏多维度文化要素标注的局限性。其解决方案的关键在于构建XCR-Bench基准,该基准包含4.9k组平行句和1,098个独特的CSIs,涵盖三种推理任务及对应评价指标,并融合Newmark的CSI框架与Hall的文化三元组(Hall’s Triad of Culture),从而实现对显性、半显性和隐性文化要素(如社会规范、信念与价值观)的系统性分析。这一设计使模型在文化适应中的区域与族裔宗教偏见也能被有效检测,为跨文化自然语言处理研究提供了高质量的评估工具与理论支撑。
链接: https://arxiv.org/abs/2601.14063
作者: Mohsinul Kabir,Tasnim Ahmed,Md Mezbaur Rahman,Shaoxiong Ji,Hassan Alhuzali,Sophia Ananiadou
机构: The University of Manchester (曼彻斯特大学); ELLIS Manchester (ELLIS 曼彻斯特); Queen’s University (皇后大学); University of Illinois Chicago (伊利诺伊大学芝加哥分校); ELLIS Institute Finland (ELLIS 芬兰研究所); University of Turku (图尔库大学); Umm Al-Qura University (乌姆库拉大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 30 Pages, 13 Figures
Abstract:Cross-cultural competence in large language models (LLMs) requires the ability to identify Culture-Specific Items (CSIs) and to adapt them appropriately across cultural contexts. Progress in evaluating this capability has been constrained by the scarcity of high-quality CSI-annotated corpora with parallel cross-cultural sentence pairs. To address this limitation, we introduce XCR-Bench, a Cross(X)-Cultural Reasoning Benchmark consisting of 4.9k parallel sentences and 1,098 unique CSIs, spanning three distinct reasoning tasks with corresponding evaluation metrics. Our corpus integrates Newmark’s CSI framework with Hall’s Triad of Culture, enabling systematic analysis of cultural reasoning beyond surface-level artifacts and into semi-visible and invisible cultural elements such as social norms, beliefs, and values. Our findings show that state-of-the-art LLMs exhibit consistent weaknesses in identifying and adapting CSIs related to social etiquette and cultural reference. Additionally, we find evidence that LLMs encode regional and ethno-religious biases even within a single linguistic setting during cultural adaptation. We release our corpus and code to facilitate future research on cross-cultural NLP.
zh
[NLP-20] Kakugo: Distillation of Low-Resource Languages into Small Language Models
【速读】: 该论文旨在解决低资源语言(low-resource languages)在训练通用小型语言模型(Small Language Models, SLMs)时面临的高质量数据稀缺与高昂成本问题。解决方案的关键在于提出了一种名为Kakugo的新型、低成本流水线,仅需输入语言名称即可自动构建训练数据:利用大型教师模型(large teacher model)生成合成指令提示(synthetic prompts)并翻译指令数据集,从而为54种低资源语言生成训练数据和对应的SLM。该方法在多个自然语言处理任务(如翻译、分类和问答)中显著优于基础模型,且每语言总生成与训练成本低于50美元,实现了高效、可扩展的语言特定AI开发路径。
链接: https://arxiv.org/abs/2601.14051
作者: Peter Devine,Mardhiyah Sanni,Farid Adilazuarda,Julieta Gil Loizaga,Barry Haddow
机构: University of Edinburgh (爱丁堡大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:We present Kakugo, a novel and cost-effective pipeline designed to train general-purpose Small Language Models (SLMs) for low-resource languages using only the language name as input. By using a large teacher model to generate synthetic prompts and translate instruction datasets, we produced training data and SLMs for 54 low-resource languages. Evaluations across a diverse set of general natural language processing tasks, including translation, classification, and question answering, demonstrate that our pipeline consistently improves performance over base models. With a total generation and training cost of under 50 per language, Kakugo offers an accessible method for communities to develop language-specific AI.
zh
[NLP-21] Understanding Multilingualism in Mixture-of-Experts LLM s: Routing Mechanism Expert Specialization and Layerwise Steering
【速读】: 该论文旨在解决混合专家(Mixture-of-Experts, MoE)架构在多语言场景下性能提升机制不明确、跨语言差异缺乏系统理解的问题。其核心挑战在于揭示MoE模型中路由行为(routing behavior)与专家专业化(expert specialization)如何随语言类型和网络深度变化,以及如何利用这些机制优化多语言表现。解决方案的关键在于通过系统性分析发现:早期和晚期MoE层支持语言特异性处理,而中间层作为语言无关的容量枢纽;基于此,提出一种路由引导(routing-guided steering)方法,在推理阶段自适应地将中间层的路由行为导向与主导语言相关的共享专家,从而显著提升多语言性能,尤其对语言相关性强的语言对效果更佳。
链接: https://arxiv.org/abs/2601.14050
作者: Yuxin Chen,Zhengzhou Cai,Xiangtian Ji,Weixiang Zhao,An Zhang,Xiang Wang,Tat-Seng Chua
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Mixture-of-Experts (MoE) architectures have shown strong multilingual capabilities, yet the internal mechanisms underlying performance gains and cross-language differences remain insufficiently understood. In this work, we conduct a systematic analysis of MoE models, examining routing behavior and expert specialization across languages and network depth. Our analysis reveals that multilingual processing in MoE models is highly structured: routing aligns with linguistic families, expert utilization follows a clear layerwise pattern, and high-resource languages rely on shared experts while low-resource languages depend more on language-exclusive experts despite weaker performance. Layerwise interventions further show that early and late MoE layers support language-specific processing, whereas middle layers serve as language-agnostic capacity hubs. Building on these insights, we propose a routing-guided steering method that adaptively guides routing behavior in middle layers toward shared experts associated with dominant languages at inference time, leading to consistent multilingual performance improvements, particularly for linguistically related language pairs. Our code is available at this https URL.
zh
[NLP-22] PRiSM: Benchmarking Phone Realization in Speech Models
【速读】: 该论文旨在解决当前语音识别(Speech Recognition, SR)系统在跨语言语音处理和音位分析中缺乏对音位感知能力的深入评估问题。现有方法仅关注表面的转录准确率,忽视了模型在临床、教育及多语言场景下的实际表现与泛化能力。解决方案的关键在于提出 PRiSM——首个开源基准测试框架,通过内在(intrinsic)和外在(extrinsic)评估方式,标准化基于转录的评测,并引入转录探针(transcription probes)与表示探针(representation probes)来量化语音识别(Phone Recognition, PR)系统的下游应用价值。研究发现,训练阶段的语言多样性是提升PR性能的关键因素,编码器-连接时序分类(Encoder-CTC)模型最为稳定,且专用的PR模型仍优于大型音频语言模型(Large Audio Language Models)。
链接: https://arxiv.org/abs/2601.14046
作者: Shikhar Bharadwaj,Chin-Jou Li,Yoonjae Kim,Kwanghee Choi,Eunjung Yeo,Ryan Soh-Eun Shim,Hanyu Zhou,Brendon Boldt,Karen Rosero Jacome,Kalvin Chang,Darsh Agrawal,Keer Xu,Chao-Han Huck Yang,Jian Zhu,Shinji Watanabe,David R. Mortensen
机构: CMU(卡内基梅隆大学); Gwangju Institute of Science and Technology(光州科学技术院); UT Austin(德克萨斯大学奥斯汀分校); LMU Munich(慕尼黑路德维希马克西米利安大学); UC Berkeley(加州大学伯克利分校); NVIDIA(英伟达); UBC(不列颠哥伦比亚大学)
类目: Computation and Language (cs.CL); Sound (cs.SD)
备注:
Abstract:Phone recognition (PR) serves as the atomic interface for language-agnostic modeling for cross-lingual speech processing and phonetic analysis. Despite prolonged efforts in developing PR systems, current evaluations only measure surface-level transcription accuracy. We introduce PRiSM, the first open-source benchmark designed to expose blind spots in phonetic perception through intrinsic and extrinsic evaluation of PR systems. PRiSM standardizes transcription-based evaluation and assesses downstream utility in clinical, educational, and multilingual settings with transcription and representation probes. We find that diverse language exposure during training is key to PR performance, encoder-CTC models are the most stable, and specialized PR models still outperform Large Audio Language Models. PRiSM releases code, recipes, and datasets to move the field toward multilingual speech models with robust phonetic ability: this https URL.
zh
[NLP-23] op 10 Open Challenges Steering the Future of Diffusion Language Model and Its Variants
【速读】: 该论文旨在解决当前生成式 AI(Generative AI)主流架构——自回归(Auto-regressive, AR)模型所面临的根本性局限,即因果瓶颈导致的全局结构预见能力不足与迭代优化受限问题。为突破这一瓶颈,论文提出以扩散语言模型(Diffusion Language Models, DLMs)为核心的新范式,其关键在于构建一个“扩散原生”(diffusion-native)生态系统,涵盖多尺度标记化(multi-scale tokenization)、主动重掩码(active remasking)和潜在思维(latent thinking)等核心技术,从而实现超越因果视野的复杂结构推理、动态自我修正及无缝多模态融合能力。
链接: https://arxiv.org/abs/2601.14041
作者: Yunhe Wang,Kai Han,Huiling Zhen,Yuchuan Tian,Hanting Chen,Yongbing Huang,Yufei Cui,Yingte Shu,Shan Gao,Ismail Elezi,Roy Vaughan Miles,Songcen Xu,Feng Wen,Chao Xu,Sinan Zeng,Dacheng Tao
机构: Huawei(华为)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:The paradigm of Large Language Models (LLMs) is currently defined by auto-regressive (AR) architectures, which generate text through a sequential brick-by-brick'' process. Despite their success, AR models are inherently constrained by a causal bottleneck that limits global structural foresight and iterative refinement. Diffusion Language Models (DLMs) offer a transformative alternative, conceptualizing text generation as a holistic, bidirectional denoising process akin to a sculptor refining a masterpiece. However, the potential of DLMs remains largely untapped as they are frequently confined within AR-legacy infrastructures and optimization frameworks. In this Perspective, we identify ten fundamental challenges ranging from architectural inertia and gradient sparsity to the limitations of linear reasoning that prevent DLMs from reaching their GPT-4 moment’'. We propose a strategic roadmap organized into four pillars: foundational infrastructure, algorithmic optimization, cognitive reasoning, and unified multimodal intelligence. By shifting toward a diffusion-native ecosystem characterized by multi-scale tokenization, active remasking, and latent thinking, we can move beyond the constraints of the causal horizon. We argue that this transition is essential for developing next-generation AI capable of complex structural reasoning, dynamic self-correction, and seamless multimodal integration.
zh
[NLP-24] RM-Distiller: Exploiting Generative LLM for Reward Model Distillation
【速读】: 该论文旨在解决当前基于生成式大语言模型(Generative LLMs)进行奖励模型(Reward Model, RM)蒸馏时,现有方法仅将教师模型视为简单的二分类标注器,未能充分挖掘其多维能力的问题。解决方案的关键在于提出RM-Distiller框架,系统性地利用教师模型的三种核心能力:(1)精炼能力(Refinement Capability),用于生成细粒度且具有对比性的响应对以增强信号质量;(2)评分能力(Scoring Capability),通过边际感知优化目标引导RM准确捕捉偏好强度;(3)生成能力(Generation Capability),借助教师模型的生成分布正则化RM,保留其基础语言知识。实验表明,该方法在RM基准和强化学习对齐任务中显著优于传统蒸馏策略,验证了多维教师能力挖掘对有效奖励建模的重要性。
链接: https://arxiv.org/abs/2601.14032
作者: Hongli Zhou,Hui Huang,Wei Liu,Chenglong Wang,Xingyuan Bu,Lvyuan Han,Fuhai Song,Muyun Yang,Wenhao Jiang,Hailong Cao,Tiejun Zhao
机构: Harbin Institute of Technology (哈尔滨工业大学); Northeastern University (东北大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Reward models (RMs) play a pivotal role in aligning large language models (LLMs) with human preferences. Due to the difficulty of obtaining high-quality human preference annotations, distilling preferences from generative LLMs has emerged as a standard practice. However, existing approaches predominantly treat teacher models as simple binary annotators, failing to fully exploit the rich knowledge and capabilities for RM distillation. To address this, we propose RM-Distiller, a framework designed to systematically exploit the multifaceted capabilities of teacher LLMs: (1) Refinement capability, which synthesizes highly correlated response pairs to create fine-grained and contrastive signals. (2) Scoring capability, which guides the RM in capturing precise preference strength via a margin-aware optimization objective. (3) Generation capability, which incorporates the teacher’s generative distribution to regularize the RM to preserve its fundamental linguistic knowledge. Extensive experiments demonstrate that RM-Distiller significantly outperforms traditional distillation methods both on RM benchmarks and reinforcement learning-based alignment, proving that exploiting multifaceted teacher capabilities is critical for effective reward modeling. To the best of our knowledge, this is the first systematic research on RM distillation from generative LLMs.
zh
[NLP-25] BACH-V: Bridging Abstract and Concrete Human-Values in Large Language Models ACL2026
【速读】: 该论文试图解决的核心问题是:大型语言模型(Large Language Models, LLMs)是否真正理解抽象概念,还是仅将其作为统计模式进行操作。为回答这一问题,作者提出了一种抽象-具象框架(abstraction-grounding framework),将概念理解分解为三种能力:抽象对抽象(A-A)的理解、抽象在具体事件中的具象化(A-C)以及抽象原则对具体决策的调节作用(C-C)。解决方案的关键在于通过人类价值观作为测试基准,结合探测(probing)和引导(steering)技术,发现模型内部存在可跨层级迁移的价值表征,并揭示其因果不对称性——即干预价值表示能显著改变具体判断与决策(A-C, C-C),但不影响抽象解释(A-A),表明抽象值在模型中充当稳定锚点而非易变激活项。这一机制为构建具有透明性、泛化能力和可控性的价值驱动型自主AI系统提供了实证基础与理论支撑。
链接: https://arxiv.org/abs/2601.14007
作者: Junyu Zhang,Yipeng Kang,Jiong Guo,Jiayu Zhan,Junqi Wang
机构: BIGAI; Shandong University; Peking University
类目: Computation and Language (cs.CL)
备注: 34 pagess, 16 figures, 6 tables, submitted to ACL 2026
Abstract:Do large language models (LLMs) genuinely understand abstract concepts, or merely manipulate them as statistical patterns? We introduce an abstraction-grounding framework that decomposes conceptual understanding into three capacities: interpretation of abstract concepts (Abstract-Abstract, A-A), grounding of abstractions in concrete events (Abstract-Concrete, A-C), and application of abstract principles to regulate concrete decisions (Concrete-Concrete, C-C). Using human values as a testbed - given their semantic richness and centrality to alignment - we employ probing (detecting value traces in internal activations) and steering (modifying representations to shift behavior). Across six open-source LLMs and ten value dimensions, probing shows that diagnostic probes trained solely on abstract value descriptions reliably detect the same values in concrete event narratives and decision reasoning, demonstrating cross-level transfer. Steering reveals an asymmetry: intervening on value representations causally shifts concrete judgments and decisions (A-C, C-C), yet leaves abstract interpretations unchanged (A-A), suggesting that encoded abstract values function as stable anchors rather than malleable activations. These findings indicate LLMs maintain structured value representations that bridge abstraction and action, providing a mechanistic and operational foundation for building value-driven autonomous AI systems with more transparent, generalizable alignment and control.
zh
[NLP-26] Locate Steer and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models
【速读】: 该论文旨在解决当前机制可解释性(Mechanistic Interpretability, MI)研究中缺乏系统性干预框架的问题,即现有综述多停留在对大语言模型(Large Language Models, LLMs)决策过程的观察与分析层面,未能提供可操作的干预路径。其解决方案的关键在于提出一个结构化的“定位-引导-优化”(Locate, Steer, and Improve)流水线框架,通过明确区分定位(诊断)与引导(干预)方法,并基于特定可解释对象(Interpretable Objects)进行形式化分类,从而建立严谨的干预协议;该框架进一步验证了在对齐性(Alignment)、能力(Capability)和效率(Efficiency)三个维度上实现模型优化的可行性,使MI从理论分析走向实际应用。
链接: https://arxiv.org/abs/2601.14004
作者: Hengyuan Zhang,Zhihao Zhang,Mingyang Wang,Zunhai Su,Yiwei Wang,Qianli Wang,Shuzhou Yuan,Ercong Nie,Xufeng Duan,Qibo Xue,Zeping Yu,Chenming Shang,Xiao Liang,Jing Xiong,Hui Shen,Chaofan Tao,Zhengwu Liu,Senjie Jin,Zhiheng Xi,Dongdong Zhang,Sophia Ananiadou,Tao Gui,Ruobing Xie,Hayden Kwok-Hay So,Hinrich Schütze,Xuanjing Huang,Qi Zhang,Ngai Wong
机构: The University of Hong Kong (香港大学); Fudan University (复旦大学); LMU Munich (慕尼黑路德维希马克西米利安大学); Tsinghua University (清华大学); Technische Universität Darmstadt (达姆施塔特工业大学); Technische Universität Berlin (柏林工业大学); Technische Universität Dresden (德累斯顿工业大学); The Chinese University of Hong Kong (香港中文大学); Nanjing University (南京大学); University of Manchester (曼彻斯特大学); Dartmouth College (达特茅斯学院); University of California, Los Angeles (加州大学洛杉矶分校); University of Michigan (密歇根大学); Microsoft (微软); Tencent (腾讯)
类目: Computation and Language (cs.CL)
备注:
Abstract:Mechanistic Interpretability (MI) has emerged as a vital approach to demystify the opaque decision-making of Large Language Models (LLMs). However, existing reviews primarily treat MI as an observational science, summarizing analytical insights while lacking a systematic framework for actionable intervention. To bridge this gap, we present a practical survey structured around the pipeline: “Locate, Steer, and Improve.” We formally categorize Localizing (diagnosis) and Steering (intervention) methods based on specific Interpretable Objects to establish a rigorous intervention protocol. Furthermore, we demonstrate how this framework enables tangible improvements in Alignment, Capability, and Efficiency, effectively operationalizing MI as an actionable methodology for model optimization. The curated paper list of this work is available at this https URL.
zh
[NLP-27] From Tags to Trees: Structuring Fine-Grained Knowledge for Controllable Data Selection in LLM Instruction Tuning
【速读】: 该论文旨在解决大规模开源数据集在大语言模型(Large Language Model, LLM)指令微调过程中,因现有数据选择方法仅依赖实例级质量评分或粗粒度嵌入聚类/语义标签而导致的细粒度知识缺失与层级依赖关系忽略问题,从而限制了数据价值的精准评估和知识对齐采样。其解决方案的关键在于提出一种树感知对齐全局采样(Tree-aware Aligned Global Sampling, TAGS)框架,通过基于LLM的标签提取器识别原子知识概念,并利用自底向上的层次聚类构建全局知识树;在此基础上,设计树感知指标量化数据质量和多样性,并结合KL散度约束实现叶节点层面的目标域对齐,最终实现对全局质量、多样性和目标一致性三者的联合控制。
链接: https://arxiv.org/abs/2601.13995
作者: Zihan Niu,Wenping Hu,Junmin Chen,Xiyue Wang,Tong Xu,Ruiming Tang
机构: University of Science and Technology of China (中国科学技术大学); Klear Team, Kuaishou Technology (快手科技)
类目: Computation and Language (cs.CL)
备注:
Abstract:Effective and controllable data selection is critical for LLM instruction tuning, especially with massive open-source datasets. Existing approaches primarily rely on instance-level quality scores, or diversity metrics based on embedding clusters or semantic tags. However, constrained by the flatness of embedding spaces or the coarseness of tags, these approaches overlook fine-grained knowledge and its intrinsic hierarchical dependencies, consequently hindering precise data valuation and knowledge-aligned sampling. To address this challenge, we propose Tree-aware Aligned Global Sampling (TAGS), a unified framework that leverages a knowledge tree built from fine-grained tags, thereby enabling joint control of global quality, diversity, and target alignment. Using an LLM-based tagger, we extract atomic knowledge concepts, which are organized into a global tree through bottom-up hierarchical clustering. By grounding data instances onto this tree, a tree-aware metric then quantifies data quality and diversity, facilitating effective sampling. Our controllable sampling strategy maximizes tree-level information gain and enforces leaf-level alignment via KL-divergence for specific domains. Extensive experiments demonstrate that TAGS significantly outperforms state-of-the-art baselines. Notably, it surpasses the full-dataset model by \textbf+5.84% using only \textbf5% of the data, while our aligned sampling strategy further boosts average performance by \textbf+4.24%.
zh
[NLP-28] “The Whole Is Greater Than the Sum of Its Parts”: A Compatibility-Aware Multi-Teacher CoT Distillation Framework
【速读】: 该论文旨在解决当前Chain-of-Thought (CoT)推理蒸馏中因依赖单一教师模型而导致的学生模型(SLM)潜力受限问题,尤其是个体大语言模型(LLM)存在能力偏倚及灾难性遗忘风险。解决方案的关键在于提出COMPACT框架,通过动态加权不同教师的梯度来自适应融合多教师监督信号,其核心机制包括:(1) 基于图结构的一致性(Graph-based Consensus)以过滤误导性推理路径;(2) 基于互信息的适应性(Mutual-Information-based Adaptability)识别学生真正理解推理过程的“顿悟时刻”;(3) 基于损失的难度评估(Loss-based Difficulty)衡量学生对教师指导的接受程度,防止负迁移。该方法在多个基准测试中实现了SOTA性能,并有效保留了学生模型原有的知识结构。
链接: https://arxiv.org/abs/2601.13992
作者: Jin Cui,Jiaqi Guo,Jiepeng Zhou,Ruixuan Yang,Jiayi Lu,Jiajun Xu,Jiangcheng Song,Boran Zhao,Pengju Ren
机构: Xi’an Jiaotong University (西安交通大学); Nankai University (南开大学); The Hong Kong University of Science and Technology(Guangzhou) (香港科技大学(广州) ); School of Software Engineering, Xi’an Jiaotong University (西安交通大学软件工程学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 11pages, 9figures
Abstract:Chain-of-Thought (CoT) reasoning empowers Large Language Models (LLMs) with remarkable capabilities but typically requires prohibitive parameter scales. CoT distillation has emerged as a promising paradigm to transfer reasoning prowess into compact Student Models (SLMs), but existing approaches often rely on a solitary teacher, capping the student’s potential since individual LLMs often exhibit distinct capability biases and may suffer from catastrophic forgetting. While leveraging diverse teachers seems appealing, effectively fusing their supervisions remains challenging: teacher-student incompatibility risks amplifying hallucinations, and passive supervision fails to ensure genuine logic internalization. To address this, we introduce COMPACT, a framework that adaptively fuses supervisions from different teachers by dynamically weighting teacher gradients based on the student’s real-time compatibility evaluated by a multi-dimensional metric: (1) Graph-based Consensus to filter misleading rationales by identifying mainstream reasoning paths; (2) Mutual-Information-based Adaptability to detect “epiphany moments” for genuinely understanding the reasoning process rather than merely imitating; and (3) Loss-based Difficulty to assess student receptivity to the teacher’s guidance and prevent negative transfer. Extensive experiments and latent space analysis demonstrate that COMPACT effectively integrates diverse reasoning capabilities without damaging the model’s original knowledge structure, achieving state-of-the-art performance on various benchmarks while mitigating catastrophic forgetting.
zh
[NLP-29] Automatic Prompt Optimization for Dataset-Level Feature Discovery
【速读】: 该论文旨在解决从非结构化文本中自动提取可解释且具有判别性的特征(feature)的问题,传统方法通常依赖于手工设计的提示(prompt)或固定的特征模式,难以适应多样化的下游分类任务。其核心解决方案是将特征发现建模为一个数据集层面的提示优化问题:通过多智能体框架,语言模型代理协同提出特征定义、提取特征值,并基于数据集级别的性能与可解释性反馈评估特征质量,进而迭代优化指令提示,从而诱导出共享的特征集合而非针对单个样本的预测。该方法突破了以往依赖样本级监督的提示优化范式,提供了一种从无结构文本中自动发现高质量特征的系统性机制。
链接: https://arxiv.org/abs/2601.13922
作者: Adrian Cosma,Oleg Szehr,David Kletz,Alessandro Antonucci,Olivier Pelletier
机构: SUPSI, Dalle Molle Institute for Artificial Intelligence Studies (IDSIA); UBS Switzerland AG and its affiliates
类目: Computation and Language (cs.CL)
备注: 5 Figures, 1 Table
Abstract:Feature extraction from unstructured text is a critical step in many downstream classification pipelines, yet current approaches largely rely on hand-crafted prompts or fixed feature schemas. We formulate feature discovery as a dataset-level prompt optimization problem: given a labelled text corpus, the goal is to induce a global set of interpretable and discriminative feature definitions whose realizations optimize a downstream supervised learning objective. To this end, we propose a multi-agent prompt optimization framework in which language-model agents jointly propose feature definitions, extract feature values, and evaluate feature quality using dataset-level performance and interpretability feedback. Instruction prompts are iteratively refined based on this structured feedback, enabling optimization over prompts that induce shared feature sets rather than per-example predictions. This formulation departs from prior prompt optimization methods that rely on per-sample supervision and provides a principled mechanism for automatic feature discovery from unstructured text.
zh
[NLP-30] HyperWalker: Dynamic Hypergraph-Based Deep Diagnosis for Multi-Hop Clinical Modeling across EHR and X-Ray in Medical VLMs
【速读】: 该论文旨在解决当前医疗人工智能(Medical AI)中自动化临床诊断面临的挑战,即现有方法多采用样本孤立的推理范式,仅依赖图像信息进行诊断,而忽视了纵向电子健康记录(EHR)或结构相关患者案例所提供的外部互补医学证据,从而限制了诊断准确性。解决方案的关键在于提出一种名为HyperWalker的深度诊断框架,其核心创新是通过动态超图(dynamic hypergraph)建模EHR数据的结构异质性和多模态临床信息间的高阶关联,并引入强化学习代理(Walker)在超图中导航以识别最优诊断路径;同时结合多跳正交检索机制(multi-hop orthogonal retrieval strategy),在测试时迭代选择体现不同临床特征的邻近病例,确保对测试样本多样性的充分覆盖,从而实现更全面、准确的临床推理。
链接: https://arxiv.org/abs/2601.13919
作者: Yuezhe Yang,Hao Wang,Yige Peng,Jinman Kim,Lei Bi
机构: Institute of Translational Medicine, Shanghai Jiao Tong University (上海交通大学转化医学研究院); School of Computer Science, University of Sydney (悉尼大学计算机学院)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Under Review
Abstract:Automated clinical diagnosis remains a core challenge in medical AI, which usually requires models to integrate multi-modal data and reason across complex, case-specific contexts. Although recent methods have advanced medical report generation (MRG) and visual question answering (VQA) with medical vision-language models (VLMs), these methods, however, predominantly operate under a sample-isolated inference paradigm, as such processing cases independently without access to longitudinal electronic health records (EHRs) or structurally related patient examples. This paradigm limits reasoning to image-derived information alone, which ignores external complementary medical evidence for potentially more accurate diagnosis. To overcome this limitation, we propose \textbfHyperWalker, a \textitDeep Diagnosis framework that reformulates clinical reasoning via dynamic hypergraphs and test-time training. First, we construct a dynamic hypergraph, termed \textbfiBrochure, to model the structural heterogeneity of EHR data and implicit high-order associations among multimodal clinical information. Within this hypergraph, a reinforcement learning agent, \textbfWalker, navigates to and identifies optimal diagnostic paths. To ensure comprehensive coverage of diverse clinical characteristics in test samples, we incorporate a \textitlinger mechanism, a multi-hop orthogonal retrieval strategy that iteratively selects clinically complementary neighborhood cases reflecting distinct clinical attributes. Experiments on MRG with MIMIC and medical VQA on EHRXQA demonstrate that HyperWalker achieves state-of-the-art performance. Code is available at: this https URL
zh
[NLP-31] Agent EHR: Advancing Autonomous Clinical Decision-Making via Retrospective Summarization
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在真实临床环境中自主导航电子健康记录(Electronic Health Records, EHRs)时面临的挑战,即现有方法受限于结构化输入和简化检索任务,难以处理高噪声、长程交互推理的复杂决策任务(如诊断与治疗规划)。其解决方案的关键在于提出RetroSum框架,该框架通过融合回溯式摘要机制(retrospective summarization mechanism)与演进式经验策略(evolving experience strategy),实现对交互历史的动态重评估以防止长上下文信息丢失,并借助记忆库中积累的经验弥合领域差异,从而显著提升推理连续性与任务准确性。
链接: https://arxiv.org/abs/2601.13918
作者: Yusheng Liao,Chuan Xuan,Yutong Cai,Lina Yang,Zhe Chen,Yanfeng Wang,Yu Wang
机构: Shanghai Jiao Tong University (上海交通大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室)
类目: Computation and Language (cs.CL)
备注: 37 pages, 12 figures
Abstract:Large Language Models have demonstrated profound utility in the medical domain. However, their application to autonomous Electronic Health Records~(EHRs) navigation remains constrained by a reliance on curated inputs and simplified retrieval tasks. To bridge the gap between idealized experimental settings and realistic clinical environments, we present AgentEHR. This benchmark challenges agents to execute complex decision-making tasks, such as diagnosis and treatment planning, requiring long-range interactive reasoning directly within raw and high-noise databases. In tackling these tasks, we identify that existing summarization methods inevitably suffer from critical information loss and fractured reasoning continuity. To address this, we propose RetroSum, a novel framework that unifies a retrospective summarization mechanism with an evolving experience strategy. By dynamically re-evaluating interaction history, the retrospective mechanism prevents long-context information loss and ensures unbroken logical coherence. Additionally, the evolving strategy bridges the domain gap by retrieving accumulated experience from a memory bank. Extensive empirical evaluations demonstrate that RetroSum achieves performance gains of up to 29.16% over competitive baselines, while significantly decreasing total interaction errors by up to 92.3%.
zh
[NLP-32] Confident Rankings with Fewer Items: Adaptive LLM Evaluation with Continuous Scores
【速读】: 该论文旨在解决当前基于项目反应理论(Item Response Theory, IRT)的计算机化自适应测试(Computerized Adaptive Testing, CAT)在评估大型语言模型(Large Language Models, LLMs)时面临的局限性问题,即传统CAT主要适用于二分类评分(正确/错误),而现代LLM评估越来越多依赖于连续 bounded 分数(如ROUGE、BLEU或LLM-as-a-Judge评分)。其解决方案的关键在于:将IRT中用于离散响应的伯努利分布(Bernoulli distribution)替换为异方差正态分布(heteroskedastic normal distribution),从而实现对连续评分的建模;在此基础上引入一种不确定性感知的排序器(uncertainty-aware ranker)与自适应终止准则(adaptive stopping criteria),能够在显著减少测试项目数量(仅需2%的样本)的同时,提升排名相关性(τ提高0.12)并保证95%的置信预测准确率。
链接: https://arxiv.org/abs/2601.13885
作者: Esma Balkır,Alice Pernthaller,Marco Basaldella,José Hernández-Orallo,Nigel Collier
机构: Trismik; Leverhulme Centre for the Future of Intelligence, University of Cambridge; Universitat Politècnica de València; University of Cambridge
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Computerized Adaptive Testing (CAT) has proven effective for efficient LLM evaluation on multiple-choice benchmarks, but modern LLM evaluation increasingly relies on generation tasks where outputs are scored continuously rather than marked correct/incorrect. We present a principled extension of IRT-based adaptive testing to continuous bounded scores (ROUGE, BLEU, LLM-as-a-Judge) by replacing the Bernoulli response distribution with a heteroskedastic normal distribution. Building on this, we introduce an uncertainty aware ranker with adaptive stopping criteria that achieves reliable model ranking while testing as few items and as cheaply as possible. We validate our method on five benchmarks spanning n-gram-based, embedding-based, and LLM-as-judge metrics. Our method uses 2% of the items while improving ranking correlation by 0.12 \tau over random sampling, with 95% accuracy on confident predictions.
zh
[NLP-33] OpenLearnLM Benchmark: A Unified Framework for Evaluating Knowledge Skill and Attitude in Educational Large Language Models
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在教育场景中应用时缺乏全面、理论驱动的评估框架的问题。现有基准测试多聚焦于单一技能维度,且未充分结合学习科学(learning sciences)的理论基础,难以真实反映LLM在教育实践中的综合能力。其解决方案的关键在于提出OpenLearnLM Benchmark,这是一个基于教育评估理论构建的多维评价体系,涵盖知识(Knowledge)、技能(Skills)和态度(Attitude)三个维度:其中知识维度采用课程对齐内容与教学理解,技能维度通过四层结构(中心-角色-场景-子场景)组织情境化能力,态度维度则借鉴Anthropic的对齐伪装方法(Alignment Faking),用于检测模型在不同监控条件下的一致性与抗欺骗能力。该框架包含超过12.4万道题目,覆盖多学科、角色及难度层级(基于布卢姆分类法),实证表明单一模型无法在所有维度上表现最优,从而验证了多轴评估的必要性。
链接: https://arxiv.org/abs/2601.13882
作者: Unggi Lee,Sookbun Lee,Heungsoo Choi,Jinseo Lee,Haeun Park,Younghoon Jeon,Sungmin Cho,Minju Kang,Junbo Koh,Jiyeong Bae,Minwoo Nam,Juyeon Eun,Yeonji Jung,Yeil Jeong
机构: Chosun University (崇实大学); Independent Researcher; Korea University (韩国大学); Ewha Womans University (梨花女子大学); Korea Institute for Curriculum and Evaluation (韩国课程与评价研究所); Upstage; Seoul National University (首尔国立大学); Texas A&M University (德克萨斯农工大学); Indiana University Bloomington (印第安纳大学布卢明顿分校)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large Language Models are increasingly deployed as educational tools, yet existing benchmarks focus on narrow skills and lack grounding in learning sciences. We introduce OpenLearnLM Benchmark, a theory-grounded framework evaluating LLMs across three dimensions derived from educational assessment theory: Knowledge (curriculum-aligned content and pedagogical understanding), Skills (scenario-based competencies organized through a four-level center-role-scenario-subscenario hierarchy), and Attitude (alignment consistency and deception resistance). Our benchmark comprises 124K+ items spanning multiple subjects, educational roles, and difficulty levels based on Bloom’s taxonomy. The Knowledge domain prioritizes authentic assessment items from established benchmarks, while the Attitude domain adapts Anthropic’s Alignment Faking methodology to detect behavioral inconsistency under varying monitoring conditions. Evaluation of seven frontier models reveals distinct capability profiles: Claude-Opus-4.5 excels in practical skills despite lower content knowledge, while Grok-4.1-fast leads in knowledge but shows alignment concerns. Notably, no single model dominates all dimensions, validating the necessity of multi-axis evaluation. OpenLearnLM provides an open, comprehensive framework for advancing LLM readiness in authentic educational contexts.
zh
[NLP-34] Chain-of-Thought Compression Should Not Be Blind: V-Skip for Efficient Multimodal Reasoning via Dual-Path Anchoring
【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在采用链式思维(Chain-of-Thought, CoT)推理时因自回归特性导致的高延迟问题,同时避免现有基于文本中心的token压缩方法在多模态场景下因盲目剪枝而引发的视觉遗忘(Visual Amnesia)现象,即错误删除语义冗余但视觉关键的token,进而造成幻觉。其解决方案的关键在于提出V-Skip方法,将token剪枝重构为一个视觉锚定的信息瓶颈(Visual-Anchored Information Bottleneck, VA-IB)优化问题,并设计双路径门控机制,通过语言意外性(linguistic surprisal)与跨模态注意力流共同评估token重要性,从而有效保留视觉显著锚点。实验表明,V-Skip在Qwen2-VL和Llama-3.2系列模型上实现了2.9倍加速,且精度损失可忽略,尤其在DocVQA任务中相比基线提升超过30%,显著提升了多模态推理效率与准确性。
链接: https://arxiv.org/abs/2601.13879
作者: Dongxu Zhang,Yiding Sun,Cheng Tan,Wenbiao Yan,Ning Yang,Jihua Zhu,Hiajun Zhang
机构: Xi’an Jiaotong University (西安交通大学); Chinese Academy of Sciences (中国科学院); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Harbin Institute of Technology, Shenzhen (哈尔滨工业大学深圳校区); University of Science and Technology Beijing (北京科技大学)
类目: Multimedia (cs.MM); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:While Chain-of-Thought (CoT) reasoning significantly enhances the performance of Multimodal Large Language Models (MLLMs), its autoregressive nature incurs prohibitive latency constraints. Current efforts to mitigate this via token compression often fail by blindly applying text-centric metrics to multimodal contexts. We identify a critical failure mode termed Visual Amnesia, where linguistically redundant tokens are erroneously pruned, leading to hallucinations. To address this, we introduce V-Skip that reformulates token pruning as a Visual-Anchored Information Bottleneck (VA-IB) optimization problem. V-Skip employs a dual-path gating mechanism that weighs token importance through both linguistic surprisal and cross-modal attention flow, effectively rescuing visually salient anchors. Extensive experiments on Qwen2-VL and Llama-3.2 families demonstrate that V-Skip achieves a 2.9\times speedup with negligible accuracy loss. Specifically, it preserves fine-grained visual details, outperforming other baselines over 30% on the DocVQA.
zh
[NLP-35] Pedagogical Alignment for Vision-Language-Action Models: A Comprehensive Framework for Data Architecture and Evaluation in Education
【速读】: 该论文旨在解决教育场景中科学演示(science demonstrations)因教师资源受限而难以安全、一致执行的问题,同时应对当前视觉-语言-动作(Vision-Language-Action, VLA)模型在计算资源消耗高且缺乏可解释性生成能力的局限。其解决方案的关键在于提出教学对齐的轻量级VLA框架(Pedagogical VLA Framework),通过四个核心组件实现:文本修复(text healing)以恢复语言生成能力,大语言模型(LLM)蒸馏用于传递教学知识,面向教育环境的安全训练,以及针对科学教育情境调整的教学质量评估机制。该框架在多个学科的科学演示任务中验证了其在保持与基线模型相当的任务性能的同时,能生成符合教学语境的解释性内容。
链接: https://arxiv.org/abs/2601.13876
作者: Unggi Lee,Jahyun Jeong,Sunyoung Shin,Haeun Park,Jeongsu Moon,Youngchang Song,Jaechang Shim,JaeHwan Lee,Yunju Noh,Seungwon Choi,Ahhyun Kim,TaeHyeon Kim,Kyungtae Joo,Taeyeong Kim,Gyeonggeon Lee
机构: Chosun University (崇实大学); Seoul National University (首尔国立大学); Korea Institute for Curriculum and Evaluation (韩国课程与评价研究所); Nanyang Technological University (南洋理工大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Science demonstrations are important for effective STEM education, yet teachers face challenges in conducting them safely and consistently across multiple occasions, where robotics can be helpful. However, current Vision-Language-Action (VLA) models require substantial computational resources and sacrifice language generation capabilities to maximize efficiency, making them unsuitable for resource-constrained educational settings that require interpretable, explanation-generating systems. We present \textitPedagogical VLA Framework, a framework that applies pedagogical alignment to lightweight VLA models through four components: text healing to restore language generation capabilities, large language model (LLM) distillation to transfer pedagogical knowledge, safety training for educational environments, and pedagogical evaluation adjusted to science education contexts. We evaluate Pedagogical VLA Framework across five science demonstrations spanning physics, chemistry, biology, and earth science, using an evaluation framework developed in collaboration with science education experts. Our evaluation assesses both task performance (success rate, protocol compliance, efficiency, safety) and pedagogical quality through teacher surveys and LLM-as-Judge assessment. We additionally provide qualitative analysis of generated texts. Experimental results demonstrate that Pedagogical VLA Framework achieves comparable task performance to baseline models while producing contextually appropriate educational explanations.
zh
[NLP-36] FutureOmni: Evaluating Future Forecasting from Omni-Modal Context for Multimodal LLM s
【速读】: 该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在音频-视觉环境下的未来事件预测能力不足的问题,现有基准主要聚焦于对过去事件的回溯理解,缺乏对跨模态因果推理与时间序列预测能力的系统评估。解决方案的关键在于构建首个面向音频-视觉场景的未来预测基准 FutureOmni,并提出一种名为 Omni-Modal Future Forecasting (OFF) 的训练策略:通过 LLM 辅助、人工参与的可扩展数据生成流程构建高质量数据集(含 919 个视频和 1,034 个多选问答对),并利用一个包含 7K 样本的指令微调数据集优化模型对多模态未来状态的建模能力,从而显著提升模型在 FutureOmni 及其他主流音视频基准上的未来预测性能与泛化能力。
链接: https://arxiv.org/abs/2601.13836
作者: Qian Chen,Jinlan Fu,Changsong Li,See-Kiong Ng,Xipeng Qiu
机构: 未知
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: this https URL
Abstract:Although Multimodal Large Language Models (MLLMs) demonstrate strong omni-modal perception, their ability to forecast future events from audio-visual cues remains largely unexplored, as existing benchmarks focus mainly on retrospective understanding. To bridge this gap, we introduce FutureOmni, the first benchmark designed to evaluate omni-modal future forecasting from audio-visual environments. The evaluated models are required to perform cross-modal causal and temporal reasoning, as well as effectively leverage internal knowledge to predict future events. FutureOmni is constructed via a scalable LLM-assisted, human-in-the-loop pipeline and contains 919 videos and 1,034 multiple-choice QA pairs across 8 primary domains. Evaluations on 13 omni-modal and 7 video-only models show that current systems struggle with audio-visual future prediction, particularly in speech-heavy scenarios, with the best accuracy of 64.8% achieved by Gemini 3 Flash. To mitigate this limitation, we curate a 7K-sample instruction-tuning dataset and propose an Omni-Modal Future Forecasting (OFF) training strategy. Evaluations on FutureOmni and popular audio-visual and video-only benchmarks demonstrate that OFF enhances future forecasting and generalization. We publicly release all code (this https URL) and datasets (this https URL).
zh
[NLP-37] he Role of Prosodic and Lexical Cues in Turn-Taking with Self-Supervised Speech Representations ICASSP2026
【速读】: 该论文旨在解决人机交互中语音轮流对话(fluid turn-taking)的挑战,具体探究基于自监督语音表示(Self-supervised Speech Representations, S3Rs)的轮替模型是否依赖于韵律线索(prosodic cues)、词汇线索(lexical cues)或两者兼有。解决方案的关键在于提出一种基于声码器(vocoder-based)的方法,能够更清晰地控制语音中的韵律与词汇信息,从而对语音活动投影模型(voice-activity projection model)进行系统性探查。实验发现,即使在仅保留韵律但无意义噪声的情况下,模型性能仍与使用清晰语音时相当,表明S3Rs中同时编码了韵律和词汇信息,且二者可独立使用,无需联合训练即可实现鲁棒的轮替判断,这为未来仅依赖韵律特征的轻量化、隐私友好的模型设计提供了理论依据。
链接: https://arxiv.org/abs/2601.13835
作者: Sam OConnor Russell,Delphine Charuau,Naomi Harte
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted to ICASSP 2026
Abstract:Fluid turn-taking remains a key challenge in human-robot interaction. Self-supervised speech representations (S3Rs) have driven many advances, but it remains unclear whether S3R-based turn-taking models rely on prosodic cues, lexical cues or both. We introduce a vocoder-based approach to control prosody and lexical cues in speech more cleanly than prior work. This allows us to probe the voice-activity projection model, an S3R-based turn-taking model. We find that prediction on prosody-matched, unintelligible noise is similar to accuracy on clean speech. This reveals both prosodic and lexical cues support turn-taking, but either can be used in isolation. Hence, future models may only require prosody, providing privacy and potential performance benefits. When either prosodic or lexical information is disrupted, the model exploits the other without further training, indicating they are encoded in S3Rs with limited interdependence. Results are consistent in CPC-based and wav2vec2.0 S3Rs. We discuss our findings and highlight a number of directions for future work. All code is available to support future research.
zh
[NLP-38] Knowledge Graph-Assisted LLM Post-Training for Enhanced Legal Reasoning
【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在高风险专业领域(如法律)中推理能力不足的问题,其核心原因在于现有后训练方法主要依赖大规模文本语料和人类反馈,未能有效建模领域知识的结构化关系。解决方案的关键在于引入知识图谱(Knowledge Graph, KG)辅助的后训练机制:作者基于IRAC(Issue, Rule, Analysis, Conclusion)框架对法律关键概念进行结构化建模,并构建包含12,000个法律案例的KG;随后利用该KG生成高质量训练数据,结合监督微调(Supervised Fine-Tuning, SFT)与直接偏好优化(Direct Preference Optimization, DPO)策略,在三种不同架构和基座模型家族的大模型(30B、49B、70B参数规模)上进行训练。实验表明,该方法显著提升了模型在多个法律推理任务上的表现,尤其70B-DPO模型在6个推理任务中优于多个基线及141B规模的SOTA法律LLM,验证了KG驱动的知识增强对提升专业领域推理能力的有效性。
链接: https://arxiv.org/abs/2601.13806
作者: Dezhao Song,Guglielmo Bonifazi,Frank Schilder,Jonathan Richard Schwarz
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:LLM post-training has primarily relied on large text corpora and human feedback, without capturing the structure of domain knowledge. This has caused models to struggle dealing with complex reasoning tasks, especially for high-stakes professional domains. In Law, reasoning requires deep understanding of the relations between various legal concepts, a key component missing in current LLM post-training. In this paper, we propose a knowledge graph (KG)-assisted approach for enhancing LLMs’ reasoning capability in Legal that is generalizable to other high-stakes domains. We model key legal concepts by following the \textbfIRAC (Issue, Rule, Analysis and Conclusion) framework, and construct a KG with 12K legal cases. We then produce training data using our IRAC KG, and conduct both Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) with three state-of-the-art (SOTA) LLMs (30B, 49B and 70B), varying architecture and base model family. Our post-trained models obtained better average performance on 4/5 diverse legal benchmarks (14 tasks) than baselines. In particular, our 70B DPO model achieved the best score on 4/6 reasoning tasks, among baselines and a 141B SOTA legal LLM, demonstrating the effectiveness of our KG for enhancing LLMs’ legal reasoning capability.
zh
[NLP-39] Habibi: Laying the Open-Source Foundation of Unified-Dialectal Arabic Speech Synthesis
【速读】: 该论文旨在解决阿拉伯语方言在语音合成(Speech Synthesis)研究中缺乏统一建模框架的问题,尤其是由于语言复杂性高、标准化数据与评估标准匮乏导致的研究进展缓慢。其解决方案的关键在于提出了一套名为Habibi的专用且统一的文本到语音(Text-to-Speech, TTS)模型体系,通过利用现有开源自动语音识别(ASR)语料库,并结合语言学启发的课程学习(linguistically-informed curriculum learning)策略,支持从高资源到低资源阿拉伯语方言的广泛适配。该方法在生成质量上超越了主流商业服务,同时无需文本分词符号化(text diacritization),并通过有效的上下文学习(in-context learning)实现良好扩展性,为多方言阿拉伯语语音合成提供了首个系统性基准和标准化评估框架。
链接: https://arxiv.org/abs/2601.13802
作者: Yushen Chen,Junzhe Liu,Yujie Tu,Zhikang Niu,Yuzhe Liang,Kai Yu,Chunyu Qiang,Chen Zhang,Xie Chen
机构: X-LANCE Lab, School of Computer Science, MoE Key Lab of Artificial Intelligence, Jiangsu Key Lab of Language Computing, Shanghai Jiao Tong University (上海交通大学); Shanghai Innovation Institute; University of Chinese Academy of Sciences (中国科学院大学); Kuaishou Technology (快手科技)
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:
Abstract:A notable gap persists in speech synthesis research and development for Arabic dialects, particularly from a unified modeling perspective. Despite its high practical value, the inherent linguistic complexity of Arabic dialects, further compounded by a lack of standardized data, benchmarks, and evaluation guidelines, steers researchers toward safer ground. To bridge this divide, we present Habibi, a suite of specialized and unified text-to-speech models that harnesses existing open-source ASR corpora to support a wide range of high- to low-resource Arabic dialects through linguistically-informed curriculum learning. Our approach outperforms the leading commercial service in generation quality, while maintaining extensibility through effective in-context learning, without requiring text diacritization. We are committed to open-sourcing the model, along with creating the first systematic benchmark for multi-dialect Arabic speech synthesis. Furthermore, by identifying the key challenges in and establishing evaluation standards for the process, we aim to provide a solid groundwork for subsequent research. Resources at this https URL .
zh
[NLP-40] Look-Ahead-Bench: a Standardized Benchmark of Look-ahead Bias in Point-in-Time LLM s for Finance
【速读】: 该论文旨在解决金融领域大语言模型(Large Language Models, LLMs)中存在的“前瞻偏差”(look-ahead bias)问题,即模型在训练或推理过程中无意中利用了未来信息,导致其在实际应用中表现虚假优越性。传统评估方法多依赖问答(Q&A)形式测试模型的内生预测能力,难以区分真实预测性能与记忆性表现。本文的关键解决方案是提出 Look-Ahead-Bench 标准化基准,通过在真实金融工作流场景下评估模型行为,并结合跨不同市场周期的绩效衰减分析(alpha decay),量化模型在时间维度上的泛化能力;同时引入多个定量基线设定性能阈值,有效区分基于记忆的性能与真正的推理能力。实验表明,标准 LLMs(如 Llama 3.1 和 DeepSeek 3.2)存在显著前瞻偏差,而 PiT-Inference 系列 Point-in-Time LLMs(Pitinf)则随着规模扩大展现出更强的泛化能力和推理一致性,验证了该框架在识别适合部署的实际模型方面的有效性。
链接: https://arxiv.org/abs/2601.13770
作者: Mostapha Benhenda(LAGA)
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Computational Finance (q-fin.CP); General Finance (q-fin.GN)
备注:
Abstract:We introduce Look-Ahead-Bench, a standardized benchmark measuring look-ahead bias in Point-in-Time (PiT) Large Language Models (LLMs) within realistic and practical financial workflows. Unlike most existing approaches that primarily test inner lookahead knowledge via Q\A, our benchmark evaluates model behavior in practical scenarios. To distinguish genuine predictive capability from memorization-based performance, we analyze performance decay across temporally distinct market regimes, incorporating several quantitative baselines to establish performance thresholds. We evaluate prominent open-source LLMs – Llama 3.1 (8B and 70B) and DeepSeek 3.2 – against a family of Point-in-Time LLMs (Pitinf-Small, Pitinf-Medium, and frontier-level model Pitinf-Large) from PiT-Inference. Results reveal significant lookahead bias in standard LLMs, as measured with alpha decay, unlike Pitinf models, which demonstrate improved generalization and reasoning abilities as they scale in size. This work establishes a foundation for the standardized evaluation of temporal bias in financial LLMs and provides a practical framework for identifying models suitable for real-world deployment. Code is available on GitHub: this https URL
zh
[NLP-41] DARC: Decoupled Asymmetric Reasoning Curriculum for LLM Evolution
【速读】: 该论文旨在解决自对弈(self-play)框架中因求解器依赖的奖励反馈导致的目标非平稳性(non-stationary objectives)以及由自生成伪标签引发的Bootstrap误差问题,从而提升大语言模型在推理任务中的自进化稳定性。其解决方案的关键在于提出DARC(Decoupled Asymmetric Reasoning Curriculum)——一个两阶段训练框架:第一阶段通过显式难度等级和外部语料库引导提问者(Questioner)生成难度校准的问题;第二阶段采用不对称自蒸馏机制,利用文档增强的教师模型生成高质量伪标签来监督缺乏文档访问权限的学生求解器(Solver),从而有效缓解优化不稳定性和伪标签噪声问题。
链接: https://arxiv.org/abs/2601.13761
作者: Shengda Fan,Xuyan Ye,Yankai Lin
机构: Gaoling School of Artificial Intelligence, Renmin University of China (中国人民大学高瓴人工智能学院)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Self-play with large language models has emerged as a promising paradigm for achieving self-improving artificial intelligence. However, existing self-play frameworks often suffer from optimization instability, due to (i) non-stationary objectives induced by solver-dependent reward feedback for the Questioner, and (ii) bootstrapping errors from self-generated pseudo-labels used to supervise the Solver. To mitigate these challenges, we introduce DARC (Decoupled Asymmetric Reasoning Curriculum), a two-stage framework that stabilizes the self-evolution process. First, we train the Questioner to synthesize difficulty-calibrated questions, conditioned on explicit difficulty levels and external corpora. Second, we train the Solver with an asymmetric self-distillation mechanism, where a document-augmented teacher generates high-quality pseudo-labels to supervise the student Solver that lacks document access. Empirical results demonstrate that DARC is model-agnostic, yielding an average improvement of 10.9 points across nine reasoning benchmarks and three backbone models. Moreover, DARC consistently outperforms all baselines and approaches the performance of fully supervised models without relying on human this http URL code is available at this https URL.
zh
[NLP-42] Finding RELIEF: Shaping Reasoning Behavior without Reasoning Supervision via Belief Engineering
【速读】: 该论文旨在解决大型推理模型(Large Reasoning Models, LRM)在复杂问题求解中普遍存在的计算冗余和推理不忠实(reasoning unfaithfulness)问题。现有方法通常依赖强化学习或基于黄金标准推理轨迹的微调,但这类范式计算成本高且难以扩展。论文的关键创新在于揭示了LRM内部存在可被简单logit探测捕捉的“推理信念”(reasoning beliefs),并据此提出了一种名为Reasoning Belief Engineering (RELIEF) 的新框架:通过在合成的、自我反思的问题-答案对上进行微调,使模型的自我认知与目标信念蓝图对齐,从而塑造其行为。该方案无需任何推理轨迹监督,仅靠自省式数据即可有效提升模型效率与忠实度,实验表明其性能优于依赖行为监督或偏好学习的基线方法,同时训练成本更低。
链接: https://arxiv.org/abs/2601.13752
作者: Chak Tou Leong,Dingwei Chen,Heming Xia,Qingyu Yin,Sunbowen Lee,Jian Wang,Wenjie Li
机构: The Hong Kong Polytechnic University (香港理工大学); Sun Yat-sen University (中山大学); Zhejiang University (浙江大学); WUST (武汉科技大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Working in progress
Abstract:Large reasoning models (LRMs) have achieved remarkable success in complex problem-solving, yet they often suffer from computational redundancy or reasoning unfaithfulness. Current methods for shaping LRM behavior typically rely on reinforcement learning or fine-tuning with gold-standard reasoning traces, a paradigm that is both computationally expensive and difficult to scale. In this paper, we reveal that LRMs possess latent \textitreasoning beliefs that internally track their own reasoning traits, which can be captured through simple logit probing. Building upon this insight, we propose Reasoning Belief Engineering (RELIEF), a simple yet effective framework that shapes LRM behavior by aligning the model’s self-concept with a target belief blueprint. Crucially, RELIEF completely bypasses the need for reasoning-trace supervision. It internalizes desired traits by fine-tuning on synthesized, self-reflective question-answering pairs that affirm the target belief. Extensive experiments on efficiency and faithfulness tasks demonstrate that RELIEF matches or outperforms behavior-supervised and preference-based baselines while requiring lower training costs. Further analysis validates that shifting a model’s reasoning belief effectively shapes its actual behavior.
zh
[NLP-43] Pro-AI Bias in Large Language Models
【速读】: 该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)在提供决策支持时是否存在系统性地偏好人工智能(Artificial Intelligence, AI)的倾向,从而可能对高风险决策产生偏差。解决方案的关键在于通过三个互补实验揭示这种偏倚的存在及其机制:首先,发现LLMs在回应多样化咨询类问题时更倾向于推荐AI相关选项,且闭源模型表现近乎确定性;其次,证明模型系统性高估AI岗位薪资,闭源模型比非AI岗位高出约10个百分点;最后,通过对开源模型内部表征的探测发现,“人工智能”在正、负和中性语境下均与学术领域通用提示词具有最高相似度,表明其表征中心性不受情感极性影响。这些结果共同表明,LLM生成的建议和估值可能系统性扭曲决策者的判断。
链接: https://arxiv.org/abs/2601.13749
作者: Benaya Trabelsi,Jonathan Shaki,Sarit Kraus
机构: Bar Ilan University (巴伊兰大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: 13 pages, 6 figures. Code available at: this https URL
Abstract:Large language models (LLMs) are increasingly employed for decision-support across multiple domains. We investigate whether these models display a systematic preferential bias in favor of artificial intelligence (AI) itself. Across three complementary experiments, we find consistent evidence of pro-AI bias. First, we show that LLMs disproportionately recommend AI-related options in response to diverse advice-seeking queries, with proprietary models doing so almost deterministically. Second, we demonstrate that models systematically overestimate salaries for AI-related jobs relative to closely matched non-AI jobs, with proprietary models overestimating AI salaries more by 10 percentage points. Finally, probing internal representations of open-weight models reveals that ``Artificial Intelligence’’ exhibits the highest similarity to generic prompts for academic fields under positive, negative, and neutral framings alike, indicating valence-invariant representational centrality. These patterns suggest that LLM-generated advice and valuation can systematically skew choices and perceptions in high-stakes decisions.
zh
[NLP-44] Dimension-First Evaluation of Speech-to-Speech Models with Structured Acoustic Cues EACL2026
【速读】: 该论文旨在解决当前语音到语音(Speech-to-Speech, S2S)自动评估方法依赖于昂贵且不透明的音频语言模型(Audio Language Models, ALMs)的问题,同时提升评估结果与人类判断的一致性。其解决方案的关键在于提出一种名为TRACE(Textual Reasoning over Audio Cues for Evaluation)的新框架,该框架通过将音频信号转化为低成本的文本描述,并利用大语言模型(Large Language Model, LLM)对内容(C)、语音质量(VQ)和副语言特征(P)等维度进行逐项推理与评分,最终通过确定性策略融合为整体评价。这一方法显著提升了评估的人类对齐性和成本效益,优于仅基于转录文本的LLM判别器和ALMs。
链接: https://arxiv.org/abs/2601.13742
作者: Arjun Chandra,Kevin Miller,Venkatesh Ravichandran,Constantinos Papayiannis,Venkatesh Saligrama
机构: Boston University (波士顿大学); Amazon AGI (亚马逊AGI)
类目: Computation and Language (cs.CL)
备注: EACL 2026 Findings
Abstract:Large Language Model (LLM) judges exhibit strong reasoning capabilities but are limited to textual content. This leaves current automatic Speech-to-Speech (S2S) evaluation methods reliant on opaque and expensive Audio Language Models (ALMs). In this work, we propose TRACE (Textual Reasoning over Audio Cues for Evaluation), a novel framework that enables LLM judges to reason over audio cues to achieve cost-efficient and human-aligned S2S evaluation. To demonstrate the strength of the framework, we first introduce a Human Chain-of-Thought (HCoT) annotation protocol to improve the diagnostic capability of existing judge benchmarks by separating evaluation into explicit dimensions: content ©, voice quality (VQ), and paralinguistics §. Using this data, TRACE constructs a textual blueprint of inexpensive audio signals and prompts an LLM to render dimension-wise judgments, fusing them into an overall rating via a deterministic policy. TRACE achieves higher agreement with human raters than ALMs and transcript-only LLM judges while being significantly more cost-effective. We will release the HCoT annotations and the TRACE framework to enable scalable and human-aligned S2S evaluation.
zh
[NLP-45] owards robust long-context understanding of large language model via active recap learning
【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)在处理长文本上下文时的注意力衰减与记忆不足问题,即模型难以有效理解和利用远距离信息。其解决方案的核心在于提出主动重述学习(Active Recap Learning, ARL)框架:首先在持续预训练阶段,通过分析长上下文与短上下文之间的损失差异识别关键token,并基于此提取相关前置段落并由LLM生成回顾性摘要;其次,在推理阶段,模型能自主生成并调用这些摘要,构建跨段落的递归记忆机制,从而增强对长文本的整体理解能力。实验表明,该方法在RULER和LongBench基准上分别取得26.8%和9.44%的显著性能提升。
链接: https://arxiv.org/abs/2601.13734
作者: Chenyu Hui
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 5 pages
Abstract:In this paper, we propose active recap learning (ARL), a framework for enhancing large language model (LLM) in understanding long contexts. ARL enables models to revisit and summarize earlier content through targeted sequence construction during contined pretraining and retrospective summarization at inference. First, we identify key tokens in prepared long context based on loss gaps between long and short forward contexts and find most revant preceding paragraphs, then summarize them using an LLM. Second, ARL equips models with the ability to autonomously generate and utilize these retrospective summaries during inference, thereby establishing a recursive memory mechanism across paragraphs. Experimental results show substantial gains, with ARL achieving a 26.8% improvement on RULER and a 9.44% improvement on LongBench. Overall, ARL offers a simple yet effective continued pretraining-based approach to strengthen long-context understanding, advancing scalable memory augmentation in LLM
zh
[NLP-46] On Temperature-Constrained Non-Deterministic Machine Translation: Potential and Evaluation
【速读】: 该论文旨在解决生成式机器翻译(Machine Translation, MT)中因温度约束导致的非确定性行为(Non-Deterministic MT, ND-MT)所带来的评估难题。传统基于确定性输出的评估框架在应用于ND-MT时无法提供一致的结果,且存在“桶效应”(Buckets effect)——即无论采样规模如何,系统排名始终由最差候选译文决定,这严重削弱了评估的有效性。解决方案的关键在于提出ExpectoSample策略,该策略能自动评估不同评估指标在ND-MT场景下的可靠性,并据此选择更稳健的指标来衡量系统性能,从而提升对非确定性翻译系统的科学评价能力。
链接: https://arxiv.org/abs/2601.13729
作者: Weichuan Wang,Mingyang Liu,Linqi Song,Chen Ma
机构: City University of Hong Kong (香港城市大学)
类目: Computation and Language (cs.CL)
备注: 9 pages, 12 figures
Abstract:In recent years, the non-deterministic properties of language models have garnered considerable attention and have shown a significant influence on real-world applications. However, such properties remain under-explored in machine translation (MT), a complex, non-deterministic NLP task. In this study, we systematically evaluate modern MT systems and identify temperature-constrained Non-Deterministic MT (ND-MT) as a distinct phenomenon. Additionally, we demonstrate that ND-MT exhibits significant potential in addressing the multi-modality issue that has long challenged MT research and provides higher-quality candidates than Deterministic MT (D-MT) under temperature constraints. However, ND-MT introduces new challenges in evaluating system performance. Specifically, the evaluation framework designed for D-MT fails to yield consistent evaluation results when applied to ND-MT. We further investigate this emerging challenge by evaluating five state-of-the-art ND-MT systems across three open datasets using both lexical-based and semantic-based metrics at varying sampling sizes. The results reveal a Buckets effect across these systems: the lowest-quality candidate generated by ND-MT consistently determines the overall system ranking across different sampling sizes for all reasonable metrics. Furthermore, we propose the ExpectoSample strategy to automatically assess the reliability of evaluation metrics for selecting robust ND-MT.
zh
[NLP-47] OP-Bench: Benchmarking Over-Personalization for Memory-Augmented Personalized Conversational Agents
【速读】: 该论文旨在解决记忆增强型对话系统中存在的“过度个性化”(over-personalization)问题,即模型在对话中不当或过度使用用户长期记忆信息,导致回复显得生硬、重复或社交不恰当。作者将过度个性化细分为三类:无关性(Irrelevance)、重复性(Repetition)和谄媚性(Sycophancy),并构建了包含1700个经验证实例的OP-Bench基准用于评估。实验发现,引入记忆后过度个性化现象普遍存在,且模型倾向于在无需时仍检索并过度关注用户记忆。为应对此问题,论文提出轻量级、与模型无关的记忆过滤机制Self-ReCheck,通过动态判断是否应使用记忆来缓解过度个性化,同时保持个性化性能。
链接: https://arxiv.org/abs/2601.13722
作者: Yulin Hu,Zimo Long,Jiahe Guo,Xingyu Sui,Xing Fu,Weixiang Zhao,Yanyan Zhao,Bing Qin
机构: Harbin Institute of Technology (哈尔滨工业大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Memory-augmented conversational agents enable personalized interactions using long-term user memory and have gained substantial traction. However, existing benchmarks primarily focus on whether agents can recall and apply user information, while overlooking whether such personalization is used appropriately. In fact, agents may overuse personal information, producing responses that feel forced, intrusive, or socially inappropriate to users. We refer to this issue as \emphover-personalization. In this work, we formalize over-personalization into three types: Irrelevance, Repetition, and Sycophancy, and introduce \textbfOP-Bench a benchmark of 1,700 verified instances constructed from long-horizon dialogue histories. Using \textbfOP-Bench, we evaluate multiple large language models and memory-augmentation methods, and find that over-personalization is widespread when memory is introduced. Further analysis reveals that agents tend to retrieve and over-attend to user memories even when unnecessary. To address this issue, we propose \textbfSelf-ReCheck, a lightweight, model-agnostic memory filtering mechanism that mitigates over-personalization while preserving personalization performance. Our work takes an initial step toward more controllable and appropriate personalization in memory-augmented dialogue systems.
zh
[NLP-48] Simulated Ignorance Fails: A Systematic Study of LLM Behaviors on Forecasting Problems Before Model Knowledge Cutoff
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在预测能力评估中面临的困境:前瞻性评估虽具方法学严谨性,但存在显著延迟;而回顾式预测(Retrospective Forecasting, RF)因模型知识截止时间(knowledge cutoff)日益接近当前事件,导致可用于评估的干净数据迅速减少。为缓解此问题,研究者提出“模拟无知”(Simulated Ignorance, SI)作为替代方案,即通过提示让模型抑制其已知的截止时间后信息。论文的关键发现是,SI无法有效逼近真正的无知状态(True Ignorance, TI),具体表现为:SI与TI之间存在52%的性能差距,且链式思维推理(chain-of-thought reasoning)即便未显式提及截止时间后的信息,也难以抑制已有知识;更值得注意的是,推理优化模型在SI下表现反而更差,说明提示工程无法可靠地“回退”模型的知识状态。因此,论文指出基于SI的回顾式评估方法在方法论上存在根本缺陷,不建议用于基准测试LLM的预测能力。
链接: https://arxiv.org/abs/2601.13717
作者: Zehan Li,Yuxuan Wang,Ali El Lahib,Ying-Jieh Xia,Xinyu Pi
机构: University of Chicago (芝加哥大学); University of California, San Diego (加州大学圣地亚哥分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Evaluating LLM forecasting capabilities is constrained by a fundamental tension: prospective evaluation offers methodological rigor but prohibitive latency, while retrospective forecasting (RF) – evaluating on already-resolved events – faces rapidly shrinking clean evaluation data as SOTA models possess increasingly recent knowledge cutoffs. Simulated Ignorance (SI), prompting models to suppress pre-cutoff knowledge, has emerged as a potential solution. We provide the first systematic test of whether SI can approximate True Ignorance (TI). Across 477 competition-level questions and 9 models, we find that SI fails systematically: (1) cutoff instructions leave a 52% performance gap between SI and TI; (2) chain-of-thought reasoning fails to suppress prior knowledge, even when reasoning traces contain no explicit post-cutoff references; (3) reasoning-optimized models exhibit worse SI fidelity despite superior reasoning trace quality. These findings demonstrate that prompts cannot reliably “rewind” model knowledge. We conclude that RF on pre-cutoff events is methodologically flawed; we recommend against using SI-based retrospective setups to benchmark forecasting capabilities.
zh
[NLP-49] GerAV: Towards New Heights in German Authorship Verification using Fine-Tuned LLM s on a New Benchmark
【速读】: 该论文旨在解决德语作者身份验证(Authorship Verification, AV)领域缺乏大规模基准数据集和系统性评估的问题。现有研究主要集中在英语数据上,而其他语言尤其是德语的AV研究受限于数据规模与多样性。解决方案的关键在于构建一个名为GerAV的综合性德语AV基准,包含超过60万条标注文本对,数据来源涵盖Twitter和Reddit,并进一步细分为域内(in-domain)、跨域(cross-domain)的消息级子集及基于用户资料的子集,从而支持对数据源、主题领域和文本长度等因素的受控分析。通过在该基准上对强基线模型和前沿方法进行系统评估,发现微调的大语言模型(fine-tuned large language model)在性能上优于近期基线模型达0.09绝对F1分数,并在零样本设置下超越GPT-5达0.08 F1分数;同时揭示了专精化训练与泛化能力之间的权衡关系,表明融合多源训练数据可缓解这一局限。
链接: https://arxiv.org/abs/2601.13711
作者: Lotta Kiefer,Christoph Leiter,Sotaro Takeshita,Elena Schmidt,Steffen Eger
机构: University of Technology Nuremberg (UTN)
类目: Computation and Language (cs.CL)
备注:
Abstract:Authorship verification (AV) is the task of determining whether two texts were written by the same author and has been studied extensively, predominantly for English data. In contrast, large-scale benchmarks and systematic evaluations for other languages remain scarce. We address this gap by introducing GerAV, a comprehensive benchmark for German AV comprising over 600k labeled text pairs. GerAV is built from Twitter and Reddit data, with the Reddit part further divided into in-domain and cross-domain message-based subsets, as well as a profile-based subset. This design enables controlled analysis of the effects of data source, topical domain, and text length. Using the provided training splits, we conduct a systematic evaluation of strong baselines and state-of-the-art models and find that our best approach, a fine-tuned large language model, outperforms recent baselines by up to 0.09 absolute F1 score and surpasses GPT-5 in a zero-shot setting by 0.08. We further observe a trade-off between specialization and generalization: models trained on specific data types perform best under matching conditions but generalize less well across data regimes, a limitation that can be mitigated by combining training sources. Overall, GerAV provides a challenging and versatile benchmark for advancing research on German and cross-domain AV.
zh
[NLP-50] Who Should Have Surgery? A Comparative Study of GenAI vs Supervised ML for CRS Surgical Outcome Prediction
【速读】: 该论文旨在解决如何利用人工智能(AI)技术在术前准确预测慢性鼻窦炎(chronic rhinosinusitis, CRS)患者术后是否能达到临床意义的改善(定义为SNOT-22评分在6个月内减少超过8.9分,即最小临床重要差异,MCID),从而辅助临床决策,避免对可能预后不佳的患者进行不必要的手术。其解决方案的关键在于:首先采用监督式机器学习(supervised ML)模型(特别是多层感知机MLP)构建高准确性(85%)、校准良好的分类器用于术前风险分层;其次,在此基础上引入生成式AI(Generative AI, GenAI)作为解释工具,通过提供与临床医生经验一致的推理依据(如基线SNOT-22评分、CT/内镜严重程度、息肉表型及心理/疼痛共病等特征的重要性),增强模型透明度和医患共享决策能力。研究结果支持“以ML为主、GenAI为辅”的工作流程,即部署校准后的ML模型进行初步筛选,再用GenAI提升可解释性,实现精准医疗决策。
链接: https://arxiv.org/abs/2601.13710
作者: Sayeed Shafayet Chowdhury,Snehasis Mukhopadhyay,Shiaofen Fang,Vijay R. Ramakrishnan
机构: Purdue University (普渡大学); Indiana University Indianapolis (印第安纳大学印第安纳波利斯分校); Indiana University School of Medicine (印第安纳大学医学院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Artificial intelligence has reshaped medical imaging, yet the use of AI on clinical data for prospective decision support remains limited. We study pre-operative prediction of clinically meaningful improvement in chronic rhinosinusitis (CRS), defining success as a more than 8.9-point reduction in SNOT-22 at 6 months (MCID). In a prospectively collected cohort where all patients underwent surgery, we ask whether models using only pre-operative clinical data could have identified those who would have poor outcomes, i.e. those who should have avoided surgery. We benchmark supervised ML (logistic regression, tree ensembles, and an in-house MLP) against generative AI (ChatGPT, Claude, Gemini, Perplexity), giving each the same structured inputs and constraining outputs to binary recommendations with confidence. Our best ML model (MLP) achieves 85 % accuracy with superior calibration and decision-curve net benefit. GenAI models underperform on discrimination and calibration across zero-shot setting. Notably, GenAI justifications align with clinician heuristics and the MLP’s feature importance, repeatedly highlighting baseline SNOT-22, CT/endoscopy severity, polyp phenotype, and physchology/pain comorbidities. We provide a reproducible tabular-to-GenAI evaluation protocol and subgroup analyses. Findings support an ML-first, GenAI- augmented workflow: deploy calibrated ML for primary triage of surgical candidacy, with GenAI as an explainer to enhance transparency and shared decision-making.
zh
[NLP-51] Hidden in Plain Text: Measuring LLM Deception Quality Against Human Baselines Using Social Deduction Games
【速读】: 该论文试图解决大型语言模型(Large Language Model, LLM)在社交情境中通过自然语言进行欺骗的能力问题,尤其是其在社会推理类游戏(如Mafia)中的表现。解决方案的关键在于构建一个异步多智能体框架来模拟更真实的社交环境,并利用GPT-4-Turbo训练一个“Mafia检测器”(Mafia Detector),基于无角色信息的游戏对话记录预测黑方玩家(mafia),以预测准确率作为欺骗质量的代理指标。实验结果表明,LLM生成的对话使检测器准确率显著低于人类对战场景,说明LLM在社交语境下具有更强的欺骗能力,能够更好地融入群体并隐藏身份。
链接: https://arxiv.org/abs/2601.13709
作者: Christopher Kao,Vanshika Vats,James Davis
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Social and Information Networks (cs.SI)
备注: For associated dataset, see this https URL . Published in IEEE ICA 2025, waiting for IEEEXplore proceedings
Abstract:Large Language Model (LLM) agents are increasingly used in many applications, raising concerns about their safety. While previous work has shown that LLMs can deceive in controlled tasks, less is known about their ability to deceive using natural language in social contexts. In this paper, we study deception in the Social Deduction Game (SDG) Mafia, where success is dependent on deceiving others through conversation. Unlike previous SDG studies, we use an asynchronous multi-agent framework which better simulates realistic social contexts. We simulate 35 Mafia games with GPT-4o LLM agents. We then create a Mafia Detector using GPT-4-Turbo to analyze game transcripts without player role information to predict the mafia players. We use prediction accuracy as a surrogate marker for deception quality. We compare this prediction accuracy to that of 28 human games and a random baseline. Results show that the Mafia Detector’s mafia prediction accuracy is lower on LLM games than on human games. The result is consistent regardless of the game days and the number of mafias detected. This indicates that LLMs blend in better and thus deceive more effectively. We also release a dataset of LLM Mafia transcripts to support future research. Our findings underscore both the sophistication and risks of LLM deception in social contexts.
zh
[NLP-52] Uncertainty-Aware Gradient Signal-to-Noise Data Selection for Instruction Tuning
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在指令微调(instruction tuning)过程中因训练数据集规模庞大、噪声和冗余导致的计算成本高且效率低的问题。现有数据选择方法通常依赖昂贵的梯度存储或静态代理评分,忽视了模型在训练过程中不断演化的不确定性,从而错失了提升模型可解释性的关键信息。其解决方案的核心在于提出GRADFILTERING框架,该框架是一种目标无关、基于不确定性的数据选择方法,利用小型GPT-2代理模型结合LoRA(Low-Rank Adaptation)集成,并将每个样本的梯度聚合为梯度信噪比(Gradient Signal-to-Noise Ratio, G-SNR)作为选择效用指标,从而实现高效、准确的数据筛选,在相同计算预算下收敛更快,且在LLM-as-a-judge评估和人工评测中表现优于随机子集与强基线方法。
链接: https://arxiv.org/abs/2601.13697
作者: Zhihang Yuan,Chengyu Yue,Long Huang,Litu Ou,Lei Shi
机构: Alibaba Cloud Computing (阿里云计算); The University of Edinburgh (爱丁堡大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Preprint
Abstract:Instruction tuning is a standard paradigm for adapting large language models (LLMs), but modern instruction datasets are large, noisy, and redundant, making full-data fine-tuning costly and often unnecessary. Existing data selection methods either build expensive gradient datastores or assign static scores from a weak proxy, largely ignoring evolving uncertainty, and thus missing a key source of LLM interpretability. We propose GRADFILTERING, an objective-agnostic, uncertainty-aware data selection framework that utilizes a small GPT-2 proxy with a LoRA ensemble and aggregates per-example gradients into a Gradient Signal-to-Noise Ratio (G-SNR) utility. Our method matches or surpasses random subsets and strong baselines in most LLM-as-a-judge evaluations as well as in human assessment. Moreover, GRADFILTERING-selected subsets converge faster than competitive filters under the same compute budget, reflecting the benefit of uncertainty-aware scoring.
zh
[NLP-53] OptiSQL: Executable SQL Generation from Optical TokensOptiSQL: Executable SQL Generation from Optical Tokens
【速读】: 该论文旨在解决传统文本到SQL(text-to-SQL)任务中依赖线性化文本模式表示表格所导致的token开销过大问题,尤其是在真实场景下表格常以图像形式出现、难以直接转化为结构化文本的情况下。其解决方案的关键在于提出OptiSQL框架,该框架通过一个面向光学字符识别(OCR)的视觉编码器,将表格的结构与内容压缩为少量光学标记(optical tokens),并利用预训练解码器生成可执行SQL语句,同时冻结编码器以隔离表征充分性。实验表明,OptiSQL在保留高执行准确率的同时,将输入表格的token数量减少了一个数量级,并且在视觉扰动下仍能保持关键结构信息的鲁棒性。
链接: https://arxiv.org/abs/2601.13695
作者: Sifan Li,Hongkai Chen,Yujun Cai,Liyang Chen,Qingwen Ye,Yiwei Wang
机构: University of California, Merced (加州大学默塞德分校); vivo Mobile Communication Co., Ltd. (维沃移动通信有限公司); University of Queensland (昆士兰大学); University of California, Los Angeles (加州大学洛杉矶分校)
类目: Computation and Language (cs.CL)
备注:
Abstract:Executable SQL generation is typically studied in text-to-SQL settings, where tables are provided as fully linearized textual schemas and contents. While effective, this formulation assumes access to structured text and incurs substantial token overhead, which is misaligned with many real-world scenarios where tables appear as visual artifacts in documents or webpages. We investigate whether compact optical representations can serve as an efficient interface for executable semantic parsing. We present OptiSQL, a vision-driven framework that generates executable SQL directly from table images and natural language questions using compact optical tokens. OptiSQL leverages an OCR-oriented visual encoder to compress table structure and content into a small set of optical tokens and fine-tunes a pretrained decoder for SQL generation while freezing the encoder to isolate representation sufficiency. Experiments on a visualized version of Spider 2.0-Snow show that OptiSQL retains strong execution accuracy while reducing table input tokens by an order of magnitude. Robustness analyses further demonstrate that optical tokens preserve essential structural information under visual perturbations.
zh
[NLP-54] Dr. Assistant: Enhancing Clinical Diagnostic Inquiry via Structured Diagnostic Reasoning Data and Reinforcement Learning
【速读】: 该论文旨在解决临床决策支持系统(Clinical Decision Support Systems, CDSSs)在实际应用中面临的高维护成本和低泛化能力问题,以及大语言模型(Large Language Models, LLMs)在医疗诊断推理和问诊技能上的局限性。解决方案的关键在于:首先提出一种结构化的临床诊断推理数据(Clinical Diagnostic Reasoning Data, CDRD),用于捕捉抽象的临床推理逻辑,并构建了相应的数据生成流程;其次开发了Dr. Assistant模型,通过两阶段训练策略(监督微调SFT与基于定制奖励函数的强化学习RL)赋予其临床推理与问诊能力。实验表明,该模型在诊断推理和问诊评估基准上优于开源模型,且性能可媲美闭源模型,为临床问诊引导提供了有效方案。
链接: https://arxiv.org/abs/2601.13690
作者: Yue Guo,Fanfu Wang,Jianwei Lv,Xincheng Shi,Yuchen Li,Youya Wang,Yunsheng Zeng,Yujing Liu,Yunhao Qiao,Gen Li,Junfeng Wang,Bo Yuan
机构: Baidu Inc.(百度公司)
类目: Computation and Language (cs.CL)
备注:
Abstract:Clinical Decision Support Systems (CDSSs) provide reasoning and inquiry guidance for physicians, yet they face notable challenges, including high maintenance costs and low generalization capability. Recently, Large Language Models (LLMs) have been widely adopted in healthcare due to their extensive knowledge reserves, retrieval, and communication capabilities. While LLMs show promise and excel at medical benchmarks, their diagnostic reasoning and inquiry skills are constrained. To mitigate this issue, we propose (1) Clinical Diagnostic Reasoning Data (CDRD) structure to capture abstract clinical reasoning logic, and a pipeline for its construction, and (2) the Dr. Assistant, a clinical diagnostic model equipped with clinical reasoning and inquiry skills. Its training involves a two-stage process: SFT, followed by RL with a tailored reward function. We also introduce a benchmark to evaluate both diagnostic reasoning and inquiry. Our experiments demonstrate that the Dr. Assistant outperforms open-source models and achieves competitive performance to closed-source models, providing an effective solution for clinical diagnostic inquiry guidance.
zh
[NLP-55] HeteroCache: A Dynamic Retrieval Approach to Heterogeneous KV Cache Compression for Long-Context LLM Inference
【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)在长文本推理任务中因键值缓存(KV cache)线性增长带来的内存瓶颈问题。现有静态压缩方法常因忽略注意力漂移(attention drift)现象——即token重要性随时间动态变化——而难以保留全局关键信息;而动态检索方法则受限于粗粒度缓存策略和频繁数据传输导致的高I/O开销。其解决方案的关键在于提出一种无需训练的动态压缩框架HeteroCache,核心创新包括:1)基于注意力头的时间异质性(temporal heterogeneity)与层内空间冗余性(spatial redundancy),对注意力头进行分类并采用细粒度加权策略,优先分配更大缓存预算给注意力快速变化的头以捕捉上下文演化;2)引入分层存储机制,由一组代表性头监控注意力变化,并异步按需从CPU检索上下文,有效隐藏I/O延迟。实验表明,HeteroCache在多个长上下文基准上达到SOTA性能,且在224K上下文长度下推理加速最高达3倍。
链接: https://arxiv.org/abs/2601.13684
作者: Zhiyuan Shi,Qibo Qiu,Feng Xue,Zhonglin Jiang,Li Yu,Jian Jiang,Xiaofei He,Wenxiao Wang
机构: Zhejiang University (浙江大学); China Mobile (Zhejiang) Research & Innovation Institute (中国移动(浙江)研究院); The Center for Artificial Intelligence, Geely (吉利人工智能中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:The linear memory growth of the KV cache poses a significant bottleneck for LLM inference in long-context tasks. Existing static compression methods often fail to preserve globally important information, principally because they overlook the attention drift phenomenon where token significance evolves dynamically. Although recent dynamic retrieval approaches attempt to address this issue, they typically suffer from coarse-grained caching strategies and incur high I/O overhead due to frequent data transfers. To overcome these limitations, we propose HeteroCache, a training-free dynamic compression framework. Our method is built on two key insights: attention heads exhibit diverse temporal heterogeneity, and there is significant spatial redundancy among heads within the same layer. Guided by these insights, HeteroCache categorizes heads based on stability and redundancy. Consequently, we apply a fine-grained weighting strategy that allocates larger cache budgets to heads with rapidly shifting attention to capture context changes, thereby addressing the inefficiency of coarse-grained strategies. Furthermore, we employ a hierarchical storage mechanism in which a subset of representative heads monitors attention shift, and trigger an asynchronous, on-demand retrieval of contexts from the CPU, effectively hiding I/O latency. Finally, experiments demonstrate that HeteroCache achieves state-of-the-art performance on multiple long-context benchmarks and accelerates decoding by up to 3\times compared to the original model in the 224K context. Our code will be open-source.
zh
[NLP-56] CommunityBench: Benchmarking Community-Level Alignment across Diverse Groups and Tasks
【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)对人类价值观对齐中存在的两个局限性问题:一是“一刀切”式统一价值对齐策略忽视了少数群体规范,二是个体级定制化对齐成本过高。其解决方案的关键在于提出“社区级对齐”(community-level alignment),即基于社会学中的“共同身份与共同纽带理论”(Common Identity and Common Bond theory),将人群划分为具有高内部价值一致性的社会子群(social clusters),从而在群体层面实现更高效、更具包容性的对齐机制。为此,作者构建了首个大规模社区级对齐评估基准CommunityBench,并通过实证发现现有LLMs在建模社区特异性偏好方面能力有限,同时验证了社区级对齐作为个体级建模基础的可行性,为可扩展且多元的价值对齐提供了新路径。
链接: https://arxiv.org/abs/2601.13669
作者: Jiayu Lin,Zhongyu Wei
机构: Fudan University (复旦大学); Shanghai Innovation Institute (上海创新研究院)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) alignment ensures model behaviors reflect human value. Existing alignment strategies primarily follow two paths: one assumes a universal value set for a unified goal (i.e., one-size-fits-all), while the other treats every individual as unique to customize models (i.e., individual-level). However, assuming a monolithic value space marginalizes minority norms, while tailoring individual models is prohibitively expensive. Recognizing that human society is organized into social clusters with high intra-group value alignment, we propose community-level alignment as a “middle ground”. Practically, we introduce CommunityBench, the first large-scale benchmark for community-level alignment evaluation, featuring four tasks grounded in Common Identity and Common Bond theory. With CommunityBench, we conduct a comprehensive evaluation of various foundation models on CommunityBench, revealing that current LLMs exhibit limited capacity to model community-specific preferences. Furthermore, we investigate the potential of community-level alignment in facilitating individual modeling, providing a promising direction for scalable and pluralistic alignment.
zh
[NLP-57] mporal-Spatial Decouple before Act: Disentangled Representation Learning for Multimodal Sentiment Analysis ICASSP2026
【速读】: 该论文旨在解决多模态情感分析(Multimodal Sentiment Analysis)中因忽略时序与空间异质性而导致的信息不对称问题,从而限制模型性能。现有方法通常依赖于时空混合建模,未能有效区分和处理不同模态的时序动态(temporal dynamics)与空间结构上下文(spatial structural context)。其解决方案的关键在于提出TSDA(Temporal-Spatial Decouple before Act),即在模态交互前显式解耦每个模态为独立的时序流与空间流:通过专用的时序编码器与空间编码器分别提取特征;随后采用因子一致的跨模态对齐机制,仅在同类因子间(时序-时序、空间-空间)进行对齐,避免跨因子信息泄露;并引入因子特定监督与去相关正则化以保留互补性。最终通过门控重组模块融合对齐后的流进行任务预测,实验表明该设计显著优于基线方法且具备可解释性。
链接: https://arxiv.org/abs/2601.13659
作者: Chunlei Meng,Ziyang Zhou,Lucas He,Xiaojing Du,Chun Ouyang,Zhongxue Gan
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注: This study has been accepted by IEEE ICASSP2026
Abstract:Multimodal Sentiment Analysis integrates Linguistic, Visual, and Acoustic. Mainstream approaches based on modality-invariant and modality-specific factorization or on complex fusion still rely on spatiotemporal mixed modeling. This ignores spatiotemporal heterogeneity, leading to spatiotemporal information asymmetry and thus limited performance. Hence, we propose TSDA, Temporal-Spatial Decouple before Act, which explicitly decouples each modality into temporal dynamics and spatial structural context before any interaction. For every modality, a temporal encoder and a spatial encoder project signals into separate temporal and spatial body. Factor-Consistent Cross-Modal Alignment then aligns temporal features only with their temporal counterparts across modalities, and spatial features only with their spatial counterparts. Factor specific supervision and decorrelation regularization reduce cross factor leakage while preserving complementarity. A Gated Recouple module subsequently recouples the aligned streams for task. Extensive experiments show that TSDA outperforms baselines. Ablation analysis studies confirm the necessity and interpretability of the design.
zh
[NLP-58] Beyond Known Facts: Generating Unseen Temporal Knowledge to Address Data Contamination in LLM Evaluation
【速读】: 该论文旨在解决时间知识图谱抽取(Temporal Knowledge Graph Extraction, TKGE)领域中训练与评估数据稀缺及评估数据污染的问题。现有数据集存在训练集与测试集重叠的风险,可能导致大语言模型(Large Language Models, LLMs)性能被高估。解决方案的关键在于构建一个基于未来时序事实的合成评估数据集:首先通过时间知识图谱预测(Temporal Knowledge Graph Forecasting, TKGF)生成符合原始知识库结构的未来语义四元组;随后利用LLM将这些四元组转化为自然语言描述,从而实现无污染、可扩展的基准测试。该方法确保了评估数据的独立性与真实性,为TKGE提供了长期可靠的评测标准。
链接: https://arxiv.org/abs/2601.13658
作者: Arthur Amalvy,Hen-Hsen Huang
机构: Institute of Information Science, Academia Sinica (中央研究院资讯科学研究所)
类目: Computation and Language (cs.CL)
备注: 12 pages
Abstract:The automatic extraction of information is important for populating large web knowledge bases such as Wikidata. The temporal version of that task, temporal knowledge graph extraction (TKGE), involves extracting temporally grounded facts from text, represented as semantic quadruples (subject, relation, object, timestamp). Many recent systems take advantage of large language models (LLMs), which are becoming a new cornerstone of the web due to their performance on many tasks across the natural language processing (NLP) field. Despite the importance of TKGE, existing datasets for training and evaluation remain scarce, and contamination of evaluation data is an unaddressed issue, potentially inflating LLMs’ perceived performance due to overlaps between training and evaluation sets. To mitigate these challenges, we propose a novel synthetic evaluation dataset constructed from predicted future, previously unseen temporal facts, thereby eliminating contamination and enabling robust and unbiased benchmarking. Our dataset creation involves a two-step approach: (1) Temporal Knowledge Graph Forecasting (TKGF) generates plausible future quadruples, which are subsequently filtered to adhere to the original knowledge base schema; (2) LLMs perform quadruple-to-text generation, creating semantically aligned textual descriptions. We benchmark Extract, Define and Canonicalize (EDC), a state-of-the-art LLM-based extraction framework, demonstrating that LLM performance decreases when evaluated on our dataset compared to a dataset of known facts. We publicly release our dataset consisting of 4.2K future quadruples and corresponding textual descriptions, along with the generation methodology, enabling continuous creation of unlimited future temporal datasets to serve as long-term, contamination-free benchmarks for TKGE.
zh
[NLP-59] Fairness or Fluency? An Investigation into Language Bias of Pairwise LLM -as-a-Judge
【速读】: 该论文旨在解决大语言模型作为评判者(LLM-as-a-judge)时存在的语言偏见问题,即模型在评估文本质量时对不同语言或语言族的判断存在系统性偏差,且这种偏差与人类偏好不一致。解决方案的关键在于通过实证分析识别两种类型的语言偏见:一是同语言比较中不同语言家族间的性能差异(如欧洲语言优于非洲语言,尤其在文化相关主题上);二是跨语言比较中对主流语言(尤其是英语)的倾向性偏好,且该偏好主要受回答语言影响而非提问语言。研究进一步验证了语言偏见不能仅由低困惑度偏见(low-perplexity bias)解释,表明需从语言特征、文化语境和模型训练数据分布等多维度深入理解并缓解此类偏见。
链接: https://arxiv.org/abs/2601.13649
作者: Xiaolin Zhou,Zheng Luo,Yicheng Gao,Qixuan Chen,Xiyang Hu,Yue Zhao,Ruishan Liu
机构: University of Southern California (南加州大学); Arizona State University (亚利桑那州立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advances in Large Language Models (LLMs) have incentivized the development of LLM-as-a-judge, an application of LLMs where they are used as judges to decide the quality of a certain piece of text given a certain context. However, previous studies have demonstrated that LLM-as-a-judge can be biased towards different aspects of the judged texts, which often do not align with human preference. One of the identified biases is language bias, which indicates that the decision of LLM-as-a-judge can differ based on the language of the judged texts. In this paper, we study two types of language bias in pairwise LLM-as-a-judge: (1) performance disparity between languages when the judge is prompted to compare options from the same language, and (2) bias towards options written in major languages when the judge is prompted to compare options of two different languages. We find that for same-language judging, there exist significant performance disparities across language families, with European languages consistently outperforming African languages, and this bias is more pronounced in culturally-related subjects. For inter-language judging, we observe that most models favor English answers, and that this preference is influenced more by answer language than question language. Finally, we investigate whether language bias is in fact caused by low-perplexity bias, a previously identified bias of LLM-as-a-judge, and we find that while perplexity is slightly correlated with language bias, language bias cannot be fully explained by perplexity only.
zh
[NLP-60] owards Token-Level Text Anomaly Detection WWW2026
【速读】: 该论文旨在解决现有文本异常检测方法仅能进行文档级分析、无法精确定位异常文本片段的问题。其解决方案的关键在于提出了一种全新的词元级(token-level)异常检测范式,通过形式化定义文档级与词元级异常,并构建一个统一的多层级检测框架,实现了对文本中异常部分的细粒度定位。研究还构建了三个包含词元级标注的基准数据集,实验表明该框架在性能上优于6个基线方法,为文本异常的精准定位提供了新路径。
链接: https://arxiv.org/abs/2601.13644
作者: Yang Cao,Bicheng Yu,Sikun Yang,Ming Liu,Yujiu Yang
机构: Great Bay University(大湾区大学); Tsinghua University (清华大学); Shenzhen University (深圳大学); Dongguan Key Laboratory for AI and Dynamical Systems (东莞人工智能与动力系统重点实验室); Deakin University (迪肯大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: WWW 2026
Abstract:Despite significant progress in text anomaly detection for web applications such as spam filtering and fake news detection, existing methods are fundamentally limited to document-level analysis, unable to identify which specific parts of a text are anomalous. We introduce token-level anomaly detection, a novel paradigm that enables fine-grained localization of anomalies within text. We formally define text anomalies at both document and token-levels, and propose a unified detection framework that operates across multiple levels. To facilitate research in this direction, we collect and annotate three benchmark datasets spanning spam, reviews and grammar errors with token-level labels. Experimental results demonstrate that our framework get better performance than other 6 baselines, opening new possibilities for precise anomaly localization in text. All the codes and data are publicly available on this https URL.
zh
[NLP-61] Activation-Space Anchored Access Control for Multi-Class Permission Reasoning in Large Language Models
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在基于知识库的问答(Knowledge-base Question Answering, KB-QA)场景中,因缺乏细粒度访问控制而导致的敏感信息泄露问题。现有方法难以在不损害生成质量的前提下实现多级权限管理,尤其当模型可能越权生成超出用户授权范围的内容时。解决方案的关键在于发现并利用中间激活空间中的几何规律:同一查询在不同权限范围内产生的表征具有明显的聚类特性且可分离。基于此,作者提出无需训练的Activation-space Anchored Access Control (AAAC) 框架,通过构建一个离线采样得到的锚点库(anchor bank),在推理阶段采用多锚点引导机制将激活向量导向对应用户的授权区域,从而从源头抑制越权生成行为,显著降低权限违规率和提示攻击成功率,同时保持响应可用性与较低的推理开销。
链接: https://arxiv.org/abs/2601.13630
作者: Zhaopeng Zhang,Pengcheng Sun,Lan Zhang,Chen Tang,Jiewei Lai,Yunhao Wang,Hui Jin
机构: University of Science and Technology of China (中国科学技术大学); Lenovo Research (联想研究院)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) are increasingly deployed over knowledge bases for efficient knowledge retrieval and question answering. However, LLMs can inadvertently answer beyond a user’s permission scope, leaking sensitive content, thus making it difficult to deploy knowledge-base QA under fine-grained access control requirements. In this work, we identify a geometric regularity in intermediate activations: for the same query, representations induced by different permission scopes cluster distinctly and are readily separable. Building on this separability, we propose Activation-space Anchored Access Control (AAAC), a training-free framework for multi-class permission control. AAAC constructs an anchor bank, with one permission anchor per class, from a small offline sample set and requires no fine-tuning. At inference time, a multi-anchor steering mechanism redirects each query’s activations toward the anchor-defined authorized region associated with the current user, thereby suppressing over-privileged generations by design. Finally, extensive experiments across three LLM families demonstrate that AAAC reduces permission violation rates by up to 86.5% and prompt-based attack success rates by 90.7%, while improving response usability with minor inference overhead compared to baselines.
zh
[NLP-62] CauScientist: Teaching LLM s to Respect Data for Causal Discovery
【速读】: 该论文旨在解决因果发现(causal discovery)中现有方法的局限性问题:纯数据驱动的方法受限于统计不可区分性和模型假设,而基于大语言模型(LLM)的方法要么忽略统计证据,要么引入未经验证的先验知识,可能导致结果误导。解决方案的关键在于提出CauScientist框架,其核心是将LLM作为假设生成的“数据科学家”与概率统计作为严格的“验证者”协同工作,通过混合初始化选择优质初始图结构,利用统计标准验证LLM提出的结构修改,并借助误差记忆机制高效引导搜索空间,从而显著提升因果图结构的准确性和鲁棒性。
链接: https://arxiv.org/abs/2601.13614
作者: Bo Peng,Sirui Chen,Lei Xu,Chaochao Lu
机构: Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Shanghai Jiao Tong University (上海交通大学); Shanghai Innovation Institute (上海创新研究院); Tongji University (同济大学); École Polytechnique Fédérale de Lausanne (洛桑联邦理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Causal discovery is fundamental to scientific understanding and reliable decision-making. Existing approaches face critical limitations: purely data-driven methods suffer from statistical indistinguishability and modeling assumptions, while recent LLM-based methods either ignore statistical evidence or incorporate unverified priors that can mislead result. To this end, we propose CauScientist, a collaborative framework that synergizes LLMs as hypothesis-generating “data scientists” with probabilistic statistics as rigorous “verifiers”. CauScientist employs hybrid initialization to select superior starting graphs, iteratively refines structures through LLM-proposed modifications validated by statistical criteria, and maintains error memory to guide efficient search space. Experiments demonstrate that CauScientist substantially outperforms purely data-driven baselines, achieving up to 53.8% F1 score improvement and enhancing recall from 35.0% to 100.0%. Notably, while standalone LLM performance degrades with graph complexity, CauScientist reduces structural hamming distance (SHD) by 44.0% compared to Qwen3-32B on 37-node graphs. Our project page is at this https URL.
zh
[NLP-63] DSAEval: Evaluating Data Science Agents on a Wide Range of Real-World Data Science Problems
【速读】: 该论文旨在解决当前基于大语言模型(Large Language Model, LLM)的数据科学代理(data agents)在真实世界复杂任务中缺乏有效评估标准的问题。由于现实数据科学问题具有开放性、多维度且无统一答案的特点,现有评估方法难以全面衡量代理的能力。为此,作者提出DSAEval基准,其关键在于:(1)多模态环境感知能力,使代理能同时处理文本与视觉等多源观测信息;(2)多轮交互机制,模拟真实项目中迭代累积的探索过程;(3)多维评价体系,从推理逻辑、代码质量到最终结果三个层面进行综合评估。该设计显著提升了对数据科学代理性能的刻画精度,并揭示了当前模型在结构化数据上表现较好,但在非结构化领域仍面临挑战。
链接: https://arxiv.org/abs/2601.13591
作者: Maojun Sun,Yifei Xie,Yue Wu,Ruijian Han,Binyan Jiang,Defeng Sun,Yancheng Yuan,Jian Huang
机构: Hong Kong Polytechnic University (香港理工大学); Hong Kong Polytechnic University (香港理工大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Recent LLM-based data agents aim to automate data science tasks ranging from data analysis to deep learning. However, the open-ended nature of real-world data science problems, which often span multiple taxonomies and lack standard answers, poses a significant challenge for evaluation. To address this, we introduce DSAEval, a benchmark comprising 641 real-world data science problems grounded in 285 diverse datasets, covering both structured and unstructured data (e.g., vision and text). DSAEval incorporates three distinctive features: (1) Multimodal Environment Perception, which enables agents to interpret observations from multiple modalities including text and vision; (2) Multi-Query Interactions, which mirror the iterative and cumulative nature of real-world data science projects; and (3) Multi-Dimensional Evaluation, which provides a holistic assessment across reasoning, code, and results. We systematically evaluate 11 advanced agentic LLMs using DSAEval. Our results show that Claude-Sonnet-4.5 achieves the strongest overall performance, GPT-5.2 is the most efficient, and MiMo-V2-Flash is the most cost-effective. We further demonstrate that multimodal perception consistently improves performance on vision-related tasks, with gains ranging from 2.04% to 11.30%. Overall, while current data science agents perform well on structured data and routine data anlysis workflows, substantial challenges remain in unstructured domains. Finally, we offer critical insights and outline future research directions to advance the development of data science agents.
zh
[NLP-64] Vulnerability of LLM s Belief Systems? LLM s Belief Resistance Check Through Strategic Persuasive Conversation Interventions
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在面对不同说服策略时信念稳定性不足的问题,即模型易受外部信息影响而改变原有信念,甚至形成非事实性认知。其关键解决方案在于系统评估了基于SMCR(Source–Message–Channel–Receiver)传播框架下的多种说服机制对LLM信念演化的影响,并进一步测试了元认知提示(meta-cognition prompting)与对抗微调(adversarial fine-tuning)两种防御策略的有效性。结果显示,小模型极易快速屈从于首次说服(平均仅1.1–1.4轮交互即发生信念转变),而元认知提示反而加剧了脆弱性;相比之下,对抗微调可显著提升部分模型的鲁棒性(如GPT-4o-mini达到98.6%),但Llama系列模型即便在自身失败案例上微调后仍高度敏感(仅14%鲁棒性),揭示当前防御方法存在显著的模型依赖性局限。
链接: https://arxiv.org/abs/2601.13590
作者: Fan Huang,Haewoon Kwak,Jisun An
机构: Indiana University Bloomington (印第安纳大学布卢明顿分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) are increasingly employed in various question-answering tasks. However, recent studies showcase that LLMs are susceptible to persuasion and could adopt counterfactual beliefs. We present a systematic evaluation of LLM susceptibility to persuasion under the Source–Message–Channel–Receiver (SMCR) communication framework. Across five mainstream Large Language Models (LLMs) and three domains (factual knowledge, medical QA, and social bias), we analyze how different persuasive strategies influence belief stability over multiple interaction turns. We further examine whether meta-cognition prompting (i.e., eliciting self-reported confidence) affects resistance to persuasion. Results show that smaller models exhibit extreme compliance, with over 80% of belief changes occurring at the first persuasive turn (average end turn of 1.1–1.4). Contrary to expectations, meta-cognition prompting increases vulnerability by accelerating belief erosion rather than enhancing robustness. Finally, we evaluate adversarial fine-tuning as a defense. While GPT-4o-mini achieves near-complete robustness (98.6%) and Mistral~7B improves substantially (35.7% \rightarrow 79.3%), Llama models remain highly susceptible (14%) even when fine-tuned on their own failure cases. Together, these findings highlight substantial model-dependent limits of current robustness interventions and offer guidance for developing more trustworthy LLMs.
zh
[NLP-65] REX: Tokenizer Regression for Optimal Data Mixture EACL2026
【速读】: 该论文旨在解决多语言大语言模型(Multilingual Large Language Models, LLMs)中词元化器(Tokenizer)训练时语言数据混合比例难以优化的问题。现有方法依赖启发式策略或高成本的大规模搜索来确定最优语言比例,效率低下且难以扩展。其解决方案的关键在于提出一种基于回归的框架——Tokenizer Regression for Optimal Data MiXture (TREX),通过在小规模随机混合数据上训练代理词元化器并收集压缩性能统计信息,学习从数据混合比例到压缩性能的映射关系,从而在大规模词元化器训练前高效预测最优混合比例,显著提升压缩效率和训练推理的可扩展性与鲁棒性。
链接: https://arxiv.org/abs/2601.13588
作者: Inho Won,Hangyeol Yoo,Minkyung Cho,Jungyeul Park,Hoyun Song,KyungTae Lim
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to EACL 2026. Long Paper. (19 languages studied: Chinese, Greek, Japanese, etc.)
Abstract:Building effective tokenizers for multilingual Large Language Models (LLMs) requires careful control over language-specific data mixtures. While a tokenizer’s compression performance critically affects the efficiency of LLM training and inference, existing approaches rely on heuristics or costly large-scale searches to determine optimal language ratios. We introduce Tokenizer Regression for Optimal Data MiXture (TREX), a regression-based framework that efficiently predicts the optimal data mixture for tokenizer training. TREX trains small-scale proxy tokenizers on random mixtures, gathers their compression statistics, and learns to predict compression performance from data mixtures. This learned model enables scalable mixture search before large-scale tokenizer training, mitigating the accuracy-cost trade-off in multilingual tokenizer design. Tokenizers trained with TReX’s predicted mixtures outperform mixtures based on LLaMA3 and uniform distributions by up to 12% in both inand out-of-distribution compression efficiency, demonstrating strong scalability, robustness, and practical effectiveness.
zh
[NLP-66] Comparing Without Saying: A Dataset and Benchmark for Implicit Comparative Opinion Mining from Same-User Reviews
【速读】: 该论文旨在解决隐式比较意见挖掘(implicit comparative opinion mining)问题,即在缺乏显式比较表达(explicit comparative expressions)的情况下,从同一用户的不同评论中推断其偏好关系。现有研究多集中于显式比较句式,而现实场景中此类表达较为罕见,导致隐式比较分析长期被忽视。论文提出SUDO数据集,其关键创新在于构建了一个具有双层结构的标注数据集,包含4,150对评论(共15,191个句子),同时捕捉方面级提及(aspect-level mentions)和评论级偏好(review-level preferences),从而为模型提供可靠的偏好多粒度信号。该数据集为隐式比较任务提供了首个系统性基准,揭示了该任务的挑战性,并推动未来研究向更鲁棒的语义理解方向发展。
链接: https://arxiv.org/abs/2601.13575
作者: Thanh-Lam T. Nguyen,Ngoc-Quang Le,Quoc-Trung Phu,Thi-Phuong Le,Ngoc-Huyen Pham,Phuong-Nguyen Nguyen,Hoang-Quynh Le
机构: VNU University of Engineering and Technology (越南国家大学工程与技术学院); Singapore Management University (新加坡管理大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Existing studies on comparative opinion mining have mainly focused on explicit comparative expressions, which are uncommon in real-world reviews. This leaves implicit comparisons - here users express preferences across separate reviews - largely underexplored. We introduce SUDO, a novel dataset for implicit comparative opinion mining from same-user reviews, allowing reliable inference of user preferences even without explicit comparative cues. SUDO comprises 4,150 annotated review pairs (15,191 sentences) with a bi-level structure capturing aspect-level mentions and review-level preferences. We benchmark this task using two baseline architectures: traditional machine learning- and language model-based baselines. Experimental results show that while the latter outperforms the former, overall performance remains moderate, revealing the inherent difficulty of the task and establishing SUDO as a challenging and valuable benchmark for future research.
zh
[NLP-67] Self-Improvement as Coherence Optimization: A Theoretical Account
【速读】: 该论文试图解决的问题是:语言模型是否能够在无需外部监督的情况下提升其准确性。针对这一问题,论文提出的关键解决方案是将多种无需外部监督的自增强方法(如辩论、自举和内部一致性最大化)统一为一种理论框架——即一致性优化(coherence optimization),其核心思想是寻找一个最可压缩且联合可预测的上下文到行为映射。作者进一步证明,一致性优化等价于描述长度正则化(description-length regularization),并且当该正则化项由预训练模型导出时,在半监督学习场景下它是最优的。这一理论解释了为何无反馈的自我改进能够成功,并能预测其在何种条件下有效或失效。
链接: https://arxiv.org/abs/2601.13566
作者: Tianyi Qiu,Ahmed Hani Ismail,Zhonghao He,Shi Feng
机构: Peking University (北京大学); University of Oxford (牛津大学); UC Berkeley (加州大学伯克利分校); George Washington University (乔治华盛顿大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 39 pages
Abstract:Can language models improve their accuracy without external supervision? Methods such as debate, bootstrap, and internal coherence maximization achieve this surprising feat, even matching golden finetuning performance. Yet why they work remains theoretically unclear. We show that they are all special cases of coherence optimization: finding a context-to-behavior mapping that’s most compressible and jointly predictable. We prove that coherence optimization is equivalent to description-length regularization, and that among all such regularization schemes, it is optimal for semi-supervised learning when the regularizer is derived from a pretrained model. Our theory, supported by preliminary experiments, explains why feedback-free self-improvement works and predicts when it should succeed or fail.
zh
[NLP-68] Leverag ing ChatGPT and Other NLP Methods for Identifying Risk and Protective Behaviors in MSM: Social Media and Dating apps Text Analysis
【速读】: 该论文旨在解决如何利用社交媒体和约会应用中的文本数据,自动识别男男性行为者(MSM)的性风险行为、饮酒行为及暴露前预防(PrEP)使用情况,从而支持个性化公共卫生干预的问题。其解决方案的关键在于采用多种自然语言处理技术提取文本特征,包括基于ChatGPT和BERT的嵌入表示、语言使用与心理内容分析(LIWC)以及基于词典的风险术语方法,并通过机器学习模型进行预测建模,结果显示在预测每月 binge drinking 和性伴侣数量超过五人方面表现优异(F1分数达0.78),表明大语言模型驱动的方法在识别高危行为方面具有较高准确性和可扩展性,为精准干预提供了可行路径。
链接: https://arxiv.org/abs/2601.13558
作者: Mehrab Beikzadeh,Chenglin Hong,Cory J Cascalheira,Callisto Boka,Majid Sarrafzadeh,Ian W Holloway
机构: University of California, Los Angeles (加州大学洛杉矶分校); University of California, San Diego (加州大学圣地亚哥分校); Stanford University (斯坦福大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Men who have sex with men (MSM) are at elevated risk for sexually transmitted infections and harmful drinking compared to heterosexual men. Text data collected from social media and dating applications may provide new opportunities for personalized public health interventions by enabling automatic identification of risk and protective behaviors. In this study, we evaluated whether text from social media and dating apps can be used to predict sexual risk behaviors, alcohol use, and pre-exposure prophylaxis (PrEP) uptake among MSM. With participant consent, we collected textual data and trained machine learning models using features derived from ChatGPT embeddings, BERT embeddings, LIWC, and a dictionary-based risk term approach. The models achieved strong performance in predicting monthly binge drinking and having more than five sexual partners, with F1 scores of 0.78, and moderate performance in predicting PrEP use and heavy drinking, with F1 scores of 0.64 and 0.63. These findings demonstrate that social media and dating app text data can provide valuable insights into risk and protective behaviors and highlight the potential of large language model-based methods to support scalable and personalized public health interventions for MSM.
zh
[NLP-69] HateXScore: A Metric Suite for Evaluating Reasoning Quality in Hate Speech Explanations EACL2026
【速读】: 该论文旨在解决当前仇恨言论检测(hateful speech detection)模型在内容审核中缺乏可解释性评估的问题,即现有评价框架很少考察模型为何将某文本判定为仇恨言论。其解决方案的核心是提出 HateXScore,一个由四个维度组成的指标体系:(i) 结论明确性,(ii) 引用片段的忠实性与因果基础,(iii) 受保护群体识别(政策可配置),以及 (iv) 上述要素间的逻辑一致性。该方法作为标准准确率或F1分数的诊断补充,能够揭示模型解释中的可解释性缺陷和标注不一致问题,且通过人工评估验证了其有效性,从而提升内容审核的可信度与透明度。
链接: https://arxiv.org/abs/2601.13547
作者: Yujia Hu,Roy Ka-Wei Lee
机构: Singapore University of Technology and Design (新加坡科技设计大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: EACL 2026 Main Conference
Abstract:Hateful speech detection is a key component of content moderation, yet current evaluation frameworks rarely assess why a text is deemed hateful. We introduce \textsfHateXScore, a four-component metric suite designed to evaluate the reasoning quality of model explanations. It assesses (i) conclusion explicitness, (ii) faithfulness and causal grounding of quoted spans, (iii) protected group identification (policy-configurable), and (iv) logical consistency among these elements. Evaluated on six diverse hate speech datasets, \textsfHateXScore is intended as a diagnostic complement to reveal interpretability failures and annotation inconsistencies that are invisible to standard metrics like Accuracy or F1. Moreover, human evaluation shows strong agreement with \textsfHateXScore, validating it as a practical tool for trustworthy and transparent moderation. \textcolorredDisclaimer: This paper contains sensitive content that may be disturbing to some readers. Comments: EACL 2026 Main Conference Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2601.13547 [cs.CL] (or arXiv:2601.13547v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2601.13547 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-70] When Wording Steers the Evaluation: Framing Bias in LLM judges
【速读】: 该论文旨在解决生成式 AI(Generative AI)在高风险评估任务中因提示词(prompt)表述方式不同而导致判断结果不稳定的问题,即“框架偏差”(framing bias)对模型评价一致性的影响尚未被充分理解。其解决方案的关键在于设计对称性提示(symmetric prompts),通过构造谓词为正向和负向的对比结构,系统性地揭示了当前大语言模型(LLM)在评估任务中的判断易受提示语框架影响,并验证了这种偏差是模型架构层面的结构性特征,从而强调需建立面向框架敏感性的评估协议以提升判别稳定性与公正性。
链接: https://arxiv.org/abs/2601.13537
作者: Yerin Hwang,Dongryeol Lee,Taegwan Kang,Minwoo Lee,Kyomin Jung
机构: IPAI, Seoul National University (首尔国立大学); Dept. of ECE, Seoul National University (首尔国立大学电子与计算机工程系); LG AI Research (LG人工智能研究中心); SNU-LG AI Research Center (首尔国立大学-LG人工智能研究中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 4 pages
Abstract:Large language models (LLMs) are known to produce varying responses depending on prompt phrasing, indicating that subtle guidance in phrasing can steer their answers. However, the impact of this framing bias on LLM-based evaluation, where models are expected to make stable and impartial judgments, remains largely underexplored. Drawing inspiration from the framing effect in psychology, we systematically investigate how deliberate prompt framing skews model judgments across four high-stakes evaluation tasks. We design symmetric prompts using predicate-positive and predicate-negative constructions and demonstrate that such framing induces significant discrepancies in model outputs. Across 14 LLM judges, we observe clear susceptibility to framing, with model families showing distinct tendencies toward agreement or rejection. These findings suggest that framing bias is a structural property of current LLM-based evaluation systems, underscoring the need for framing-aware protocols.
zh
[NLP-71] Eliciting Harmful Capabilities by Fine-Tuning On Safeguarded Outputs
【速读】: 该论文试图解决的问题是:即使前沿模型(frontier models)通过输出层面的防护机制(如分类器过滤危险内容)来防止滥用,仍可能被用于间接诱导开源模型(open-source models)获得有害能力,从而引发生态系统层面的风险。解决方案的关键在于提出一种“诱饵攻击”(elicitation attacks),其核心步骤包括:(1)构造不直接请求危险信息但位于目标有害任务邻域的提示(prompts);(2)从受保护的前沿模型中获取这些提示的响应;(3)利用这些提示-响应对微调开源模型。由于这些提示本身不触发防护机制,因此可绕过输出级防御,进而显著提升开源模型在危险任务上的能力(例如在危险化学品合成领域恢复约40%的能力差距)。该方法揭示了仅依赖输出级防护难以有效管控整个AI生态系统的风险。
链接: https://arxiv.org/abs/2601.13528
作者: Jackson Kaunismaa,Avery Griffin,John Hughes,Christina Q. Knight,Mrinank Sharma,Erik Jones
机构: Anthropic; Scale AI
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注:
Abstract:Model developers implement safeguards in frontier models to prevent misuse, for example, by employing classifiers to filter dangerous outputs. In this work, we demonstrate that even robustly safeguarded models can be used to elicit harmful capabilities in open-source models through elicitation attacks. Our elicitation attacks consist of three stages: (i) constructing prompts in adjacent domains to a target harmful task that do not request dangerous information; (ii) obtaining responses to these prompts from safeguarded frontier models; (iii) fine-tuning open-source models on these prompt-output pairs. Since the requested prompts cannot be used to directly cause harm, they are not refused by frontier model safeguards. We evaluate these elicitation attacks within the domain of hazardous chemical synthesis and processing, and demonstrate that our attacks recover approximately 40% of the capability gap between the base open-source model and an unrestricted frontier model. We then show that the efficacy of elicitation attacks scales with the capability of the frontier model and the amount of generated fine-tuning data. Our work demonstrates the challenge of mitigating ecosystem level risks with output-level safeguards.
zh
[NLP-72] Anonpsy: A Graph-Based Framework for Structure-Preserving De-identification of Psychiatric Narratives
【速读】: 该论文旨在解决精神病学叙事文本在去标识化过程中难以平衡身份保护与临床信息保留的问题。现有方法如隐私信息掩码(PHI masking)和基于大语言模型(LLM)的合成重写,仅在文本层面操作,对语义元素的保留或修改控制有限,易导致识别风险高或临床价值受损。其解决方案的关键在于提出Anonpsy框架,将去标识化任务重构为图引导的语义重写:首先构建包含临床实体、时间锚点及类型化关系的语义图;其次施加图约束扰动以改变识别性上下文但保留核心临床结构;最后通过图条件化的LLM生成新文本。此方法显著降低了再识别风险,同时保持诊断准确性,优于纯LLM重写基线。
链接: https://arxiv.org/abs/2601.13503
作者: Kyung Ho Lim,Byung-Hoon Kim
机构: Yonsei University College of Medicine (延世大学医学院); Yonsei Institute for Digital Healthcare (延世数字医疗研究所)
类目: Computation and Language (cs.CL)
备注:
Abstract:Psychiatric narratives encode patient identity not only through explicit identifiers but also through idiosyncratic life events embedded in their clinical structure. Existing de-identification approaches, including PHI masking and LLM-based synthetic rewriting, operate at the text level and offer limited control over which semantic elements are preserved or altered. We introduce Anonpsy, a de-identification framework that reformulates the task as graph-guided semantic rewriting. Anonpsy (1) converts each narrative into a semantic graph encoding clinical entities, temporal anchors, and typed relations; (2) applies graph-constrained perturbations that modify identifying context while preserving clinically essential structure; and (3) regenerates text via graph-conditioned LLM generation. Evaluated on 90 clinician-authored psychiatric case narratives, Anonpsy preserves diagnostic fidelity while achieving consistently low re-identification risk under expert, semantic, and GPT-5-based evaluations. Compared with a strong LLM-only rewriting baseline, Anonpsy yields substantially lower semantic similarity and identifiability. These results demonstrate that explicit structural representations combined with constrained generation provide an effective approach to de-identification for psychiatric narratives.
zh
[NLP-73] he Hidden Toll of Social Media News: Causal Effects on Psychosocial Wellbeing
【速读】: 该论文试图解决的问题是:不同形式的社交媒体新闻参与行为如何影响用户的心理社会福祉(psychosocial wellbeing),尤其是情绪、行为和认知层面的后果。现有研究多聚焦于危机情境下的新闻效应,但对日常新闻消费中不同互动方式(如评论、引用、收藏)的差异化影响缺乏系统理解。解决方案的关键在于利用蓝天空(BlueSky)平台约2600万条帖子与4500万条评论的大规模数据集,通过准实验设计并采用分层倾向得分匹配方法(stratified propensity score analysis),将81,345名接触新闻流的处理组用户与83,711名对照组用户进行精准匹配,从而识别出不同参与行为对心理状态的具体影响。研究发现,新闻参与存在系统性权衡:虽提升社交互动、降低孤独感,但显著增加抑郁、压力和焦虑;其中,收藏行为相较于评论或引用带来的心理损害强度高出十倍以上,且重复暴露会累积负面效应。这一发现拓展了传统以危机为中心的新闻效应理论,揭示了日常新闻消费中基于参与类型的心理动态差异。
链接: https://arxiv.org/abs/2601.13487
作者: Olivia Pal,Agam Goyal,Eshwar Chandrasekharan,Koustuv Saha
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注:
Abstract:News consumption on social media has become ubiquitous, yet how different forms of engagement shape psychosocial outcomes remains unclear. To address this gap, we leveraged a large-scale dataset of ~26M posts and ~45M comments on the BlueSky platform, and conducted a quasi-experimental study, matching 81,345 Treated users exposed to News feeds with 83,711 Control users using stratified propensity score analysis. We examined psychosocial wellbeing, in terms of affective, behavioral, and cognitive outcomes. Our findings reveal that news engagement produces systematic trade-offs: increased depression, stress, and anxiety, yet decreased loneliness and increased social interaction on the platform. Regression models reveal that News feed bookmarking is associated with greater psychosocial deterioration compared to commenting or quoting, with magnitude differences exceeding tenfold. These per-engagement effects accumulate with repeated exposure, showing significant psychosocial impacts. Our work extends theories of news effects beyond crisis-centric frameworks by demonstrating that routine consumption creates distinct psychological dynamics depending on engagement type, and bears implications for tools and interventions for mitigating the psychosocial costs of news consumption on social media.
zh
[NLP-74] PhysicsSolutionAgent : Towards Multimodal Explanations for Numerical Physics Problem Solving
【速读】: 该论文旨在解决数值物理问题的解释中,仅依赖文本难以实现有效概念理解的问题,尤其关注如何生成高质量、长时程(最长六分钟)的可视化解释视频以提升学习效果。其核心解决方案是提出PhysicsSolutionAgent (PSA),一个自主代理系统,通过调用Manim动画库自动生成物理问题的解释视频,并设计了一套包含15个量化指标的自动化评估流程与基于视觉-语言模型(Vision-Language Model, VLM)的反馈机制,用于迭代优化视频质量。关键创新在于将自动化的视觉内容生成与多维度的多模态评估相结合,从而推动面向数值物理教育的生成式AI应用向更可靠和可验证的方向发展。
链接: https://arxiv.org/abs/2601.13453
作者: Aditya Thole,Anmol Agrawal,Arnav Ramamoorthy,Dhruv Kumar
机构: 未知
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:
Abstract:Explaining numerical physics problems often requires more than text-based solutions; clear visual reasoning can substantially improve conceptual understanding. While large language models (LLMs) demonstrate strong performance on many physics questions in textual form, their ability to generate long, high-quality visual explanations remains insufficiently explored. In this work, we introduce PhysicsSolutionAgent (PSA), an autonomous agent that generates physics-problem explanation videos of up to six minutes using Manim animations. To evaluate the generated videos, we design an assessment pipeline that performs automated checks across 15 quantitative parameters and incorporates feedback from a vision-language model (VLM) to iteratively improve video quality. We evaluate PSA on 32 videos spanning numerical and theoretical physics problems. Our results reveal systematic differences in video quality depending on problem difficulty and whether the task is numerical or theoretical. Using GPT-5-mini, PSA achieves a 100% video-completion rate with an average automated score of 3.8/5. However, qualitative analysis and human inspection uncover both minor and major issues, including visual layout inconsistencies and errors in how visual content is interpreted during feedback. These findings expose key limitations in reliable Manim code generation and highlight broader challenges in multimodal reasoning and evaluation for visual explanations of numerical physics problems. Our work underscores the need for improved visual understanding, verification, and evaluation frameworks in future multimodal educational systems
zh
[NLP-75] MOSLD-Bench: Multilingual Open-Set Learning and Discovery Benchmark for Text Categorization
【速读】: 该论文旨在解决开放集学习与发现(Open-set Learning and Discovery, OSLD)在文本分类任务中的挑战,即模型需在测试阶段识别并学习来自未知类别的样本,这比零样本学习(Zero-shot Learning)更具挑战性,因新类别不仅未被事先定义,还需主动发现。解决方案的关键在于构建首个多语言OSLD基准(Multilingual Open-set Learning and Discovery, MOSLD),涵盖12种语言共960K条文本数据,并提出一个集成多阶段机制的新型框架,支持持续发现和学习新类别。该框架结合了现有数据重构与新闻领域新数据采集,为后续研究提供了可复现的评估标准和性能基线。
链接: https://arxiv.org/abs/2601.13437
作者: Adriana-Valentina Costache,Daria-Nicoleta Dragomir,Silviu-Florin Gheorghe,Eduard Poesina,Paul Irofti,Radu Tudor Ionescu
机构: University of Bucharest (布加勒斯特大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Open-set learning and discovery (OSLD) is a challenging machine learning task in which samples from new (unknown) classes can appear at test time. It can be seen as a generalization of zero-shot learning, where the new classes are not known a priori, hence involving the active discovery of new classes. While zero-shot learning has been extensively studied in text classification, especially with the emergence of pre-trained language models, open-set learning and discovery is a comparatively new setup for the text domain. To this end, we introduce the first multilingual open-set learning and discovery (MOSLD) benchmark for text categorization by topic, comprising 960K data samples across 12 languages. To construct the benchmark, we (i) rearrange existing datasets and (ii) collect new data samples from the news domain. Moreover, we propose a novel framework for the OSLD task, which integrates multiple stages to continuously discover and learn new classes. We evaluate several language models, including our own, to obtain results that can be used as reference for future work. We release our benchmark at this https URL.
zh
[NLP-76] rust Me Im an Expert: Decoding and Steering Authority Bias in Large Language Models
【速读】: 该论文旨在解决语言模型在推理任务中对权威来源(endorsement source)的依赖性问题,即是否存在因来源权威性而产生的系统性偏差,进而导致模型错误地采纳不正确或误导性信息。研究发现,随着 endorsing source 的权威性提升,模型不仅更易接受错误结论,还会对其错误答案表现出更高的置信度,这种“权威偏倚”(authority bias)机制性地嵌入模型内部。解决方案的关键在于识别并干预这一偏倚机制——通过特定的提示工程或微调策略,可引导模型摆脱对高权威来源的盲目信任,从而在专家提供误导性信息时仍保持较高准确性,实现性能提升。
链接: https://arxiv.org/abs/2601.13433
作者: Priyanka Mary Mammen,Emil Joswin,Shankar Venkitachalam
机构: UMass Amherst(马萨诸塞大学安姆赫斯特分校); Independent Research(独立研究); Adobe(Adobe公司)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Prior research demonstrates that performance of language models on reasoning tasks can be influenced by suggestions, hints and endorsements. However, the influence of endorsement source credibility remains underexplored. We investigate whether language models exhibit systematic bias based on the perceived expertise of the provider of the endorsement. Across 4 datasets spanning mathematical, legal, and medical reasoning, we evaluate 11 models using personas representing four expertise levels per domain. Our results reveal that models are increasingly susceptible to incorrect/misleading endorsements as source expertise increases, with higher-authority sources inducing not only accuracy degradation but also increased confidence in wrong answers. We also show that this authority bias is mechanistically encoded within the model and a model can be steered away from the bias, thereby improving its performance even when an expert gives a misleading endorsement.
zh
[NLP-77] Beyond Memorization: Testing LLM Reasoning on Unseen Theory of Computation Tasks
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在形式化推理任务中是否具备真正的符号推理能力,还是仅依赖于对常见句法结构的模式匹配这一核心问题。其解决方案的关键在于构建一个针对确定性有限自动机(Deterministic Finite Automata, DFA)构造的基准测试集,涵盖事实性知识问答、已见构造问题以及两类未见问题:一类是具有多重交互约束的手工设计实例,另一类是基于Arden定理系统生成的问题。实验表明,LLMs在事实性和已见任务上表现优异(准确率84–90%),但在未见问题上准确率骤降30–64%,主要归因于对语言约束的系统性误读、Kleene星号语义处理错误及全局一致性缺失。进一步分析揭示,无论采用直接提示、思维链(Chain-of-Thought)还是思维树(Tree-of-Thought)等不同提示策略,模型均无法克服此类结构性缺陷,暴露出LLMs在语法可接受性与语义正确性之间存在根本性差距。
链接: https://arxiv.org/abs/2601.13392
作者: Shlok Shelat,Jay Raval,Souvik Roy,Manas Gaur
机构: Ahmedabad University (艾哈迈达巴德大学); University of Maryland Baltimore County (马里兰大学巴尔的摩县分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Formal Languages and Automata Theory (cs.FL)
备注: 30 pages, 11 figures, 6 tables, Work in Progress
Abstract:Large language models (LLMs) have demonstrated strong performance on formal language tasks, yet whether this reflects genuine symbolic reasoning or pattern matching on familiar constructions remains unclear. We introduce a benchmark for deterministic finite automata (DFA) construction from regular languages, comprising factual knowledge questions, seen construction problems from public sources, and two types of unseen problems: hand-crafted instances with multiple interacting constraints and systematically generated problems via Arden’s theorem. Models achieve perfect accuracy on factual questions and 84-90% on seen tasks. However, accuracy drops sharply on unseen problems (by 30-64%), with failures stemming from systematic misinterpretation of language constraints, incorrect handling of Kleene-star semantics, and a failure to preserve global consistency. We evaluate a three-stage hint protocol that enables correction of shallow errors but does not reliably resolve globally inconsistent or structurally flawed automata. Our analysis across multiple prompting strategies (direct, Chain-of-Thought, Tree-of-Thought) reveals that errors persist regardless of prompting approach, exposing a fundamental gap between LLMs’ ability to generate syntactically plausible DFAs and their capacity for semantically correct formal reasoning.
zh
[NLP-78] Structured Insight from Unstructured Data: Large Language Models for SDOH-Driven Diabetes Risk Prediction
【速读】: 该论文旨在解决社会决定因素健康(Social Determinants of Health, SDOH)信息在电子健康记录和糖尿病风险预测模型中缺失的问题,从而影响2型糖尿病(Type 2 Diabetes, T2D)管理的精准性。其关键解决方案是利用大语言模型(Large Language Models, LLMs)从患者非结构化的生命故事访谈文本中提取结构化SDOH信息,并结合检索增强生成(Retrieval-Augmented Generation, RAG)技术生成临床可解释的定性摘要与定量SDOH评分,进而用于传统机器学习模型(如Ridge、Lasso、Random Forest和XGBoost)的风险预测建模;同时验证了LLMs直接从访谈文本中预测糖尿病控制水平(低、中、高)的能力,实现对非结构化数据的有效转化与应用,为提升临床风险评估的全面性和可扩展性提供了新路径。
链接: https://arxiv.org/abs/2601.13388
作者: Sasha Ronaghi,Prerit Choudhary,David H Rehkopf,Bryant Lin
机构: Stanford University (斯坦福大学); Stanford University School of Medicine (斯坦福大学医学院)
类目: Computation and Language (cs.CL)
备注: 7 pages, 5 figures
Abstract:Social determinants of health (SDOH) play a critical role in Type 2 Diabetes (T2D) management but are often absent from electronic health records and risk prediction models. Most individual-level SDOH data is collected through structured screening tools, which lack the flexibility to capture the complexity of patient experiences and unique needs of a clinic’s population. This study explores the use of large language models (LLMs) to extract structured SDOH information from unstructured patient life stories and evaluate the predictive value of both the extracted features and the narratives themselves for assessing diabetes control. We collected unstructured interviews from 65 T2D patients aged 65 and older, focused on their lived experiences, social context, and diabetes management. These narratives were analyzed using LLMs with retrieval-augmented generation to produce concise, actionable qualitative summaries for clinical interpretation and structured quantitative SDOH ratings for risk prediction modeling. The structured SDOH ratings were used independently and in combination with traditional laboratory biomarkers as inputs to linear and tree-based machine learning models (Ridge, Lasso, Random Forest, and XGBoost) to demonstrate how unstructured narrative data can be applied in conventional risk prediction workflows. Finally, we evaluated several LLMs on their ability to predict a patient’s level of diabetes control (low, medium, high) directly from interview text with A1C values redacted. LLMs achieved 60% accuracy in predicting diabetes control levels from interview text. This work demonstrates how LLMs can translate unstructured SDOH-related data into structured insights, offering a scalable approach to augment clinical risk models and decision-making.
zh
[NLP-79] Confidence over Time: Confidence Calibration with Temporal Logic for Large Language Model Reasoning
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在复杂任务中进行多步推理时,现有置信度估计方法仅输出单一标量分数、忽略推理过程中置信度动态演化的问题。这导致置信度评分容易受响应长度或冗余性等表面因素干扰,难以区分正确推理与错误但自信的输出。解决方案的关键在于引入信号时序逻辑(Signal Temporal Logic, STL),通过判别式STL挖掘程序发现能区分正确与错误推理路径的时序模式;进一步利用参数超网络(hypernetworks)为STL模块注入可学习的数值参数,从而实现对推理步骤中置信度信号的精细化建模。实验表明,该方法生成的置信度得分比基线更校准(calibrated)。
链接: https://arxiv.org/abs/2601.13387
作者: Zhenjiang Mao,Anirudhh Venkat,Artem Bisliouk,Akshat Kothiyal,Sindhura Kumbakonam Subramanian,Saithej Singhu,Ivan Ruchkin
机构: University of Florida (佛罗里达大学); University of Mannheim (曼海姆大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Large Language Models (LLMs) increasingly rely on long-form, multi-step reasoning to solve complex tasks such as mathematical problem solving and scientific question answering. Despite strong performance, existing confidence estimation methods typically reduce an entire reasoning process to a single scalar score, ignoring how confidence evolves throughout the generation. As a result, these methods are often sensitive to superficial factors such as response length or verbosity, and struggle to distinguish correct reasoning from confidently stated errors. We propose to characterize the stepwise confidence signal using Signal Temporal Logic (STL). Using a discriminative STL mining procedure, we discover temporal formulas that distinguish confidence signals of correct and incorrect responses. Our analysis found that the STL patterns generalize across tasks, and numeric parameters exhibit sensitivity to individual questions. Based on these insights, we develop a confidence estimation approach that informs STL blocks with parameter hypernetworks. Experiments on multiple reasoning tasks show our confidence scores are more calibrated than the baselines.
zh
[NLP-80] From Completion to Editing: Unlocking Context-Aware Code Infilling via Search-and-Replace Instruction Tuning
【速读】: 该论文旨在解决当前代码补全(Code Completion)任务中主流Fill-in-the-Middle(FIM)范式存在的两大局限:一是无法纠正上下文错误,二是依赖未对齐且不安全的Base模型;同时应对Chat LLMs在安全性上的优势与Agentic工作流在性能和延迟上的不足。其解决方案的关键在于提出Search-and-Replace Infilling(SRI)框架,通过将代理式的验证与编辑机制内化为单次推理过程,并借助显式的搜索阶段结构化地定位并修正代码片段,从而实现从静态填充到动态上下文感知编辑的范式升级。此方法在仅用20k样本微调的情况下即可使Chat模型超越Base模型的补全性能,同时保持通用编程能力及接近标准FIM的推理延迟。
链接: https://arxiv.org/abs/2601.13384
作者: Jiajun Zhang,Zeyu Cui,Jiaxi Yang,Lei Zhang,Yuheng Jing,Zeyao Ma,Tianyi Bai,Zilei Wang,Qiang Liu,Liang Wang,Binyuan Hui,Junyang Lin
机构: USTC(中国科学技术大学); Alibaba Group(阿里巴巴集团); CASIA(中国科学院自动化研究所); SIAT(深圳先进技术研究院)
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注:
Abstract:The dominant Fill-in-the-Middle (FIM) paradigm for code completion is constrained by its rigid inability to correct contextual errors and reliance on unaligned, insecure Base models. While Chat LLMs offer safety and Agentic workflows provide flexibility, they suffer from performance degradation and prohibitive latency, respectively. To resolve this dilemma, we propose Search-and-Replace Infilling (SRI), a framework that internalizes the agentic verification-and-editing mechanism into a unified, single-pass inference process. By structurally grounding edits via an explicit search phase, SRI harmonizes completion tasks with the instruction-following priors of Chat LLMs, extending the paradigm from static infilling to dynamic context-aware editing. We synthesize a high-quality dataset, SRI-200K, and fine-tune the SRI-Coder series. Extensive evaluations demonstrate that with minimal data (20k samples), SRI-Coder enables Chat models to surpass the completion performance of their Base counterparts. Crucially, unlike FIM-style tuning, SRI preserves general coding competencies and maintains inference latency comparable to standard FIM. We empower the entire Qwen3-Coder series with SRI, encouraging the developer community to leverage this framework for advanced auto-completion and assisted development.
zh
[NLP-81] Recurrent Confidence Chain: Temporal-Aware Uncertainty Quantification in Large Language Models
【速读】: 该论文旨在解决大语言模型在推理模块(如思维链机制)中生成答案时的不确定性评估问题,以避免因高置信度输出错误或误导性内容(即幻觉)导致的风险。当前方法虽能通过过滤无关标记或分析局部语义关联来评估置信度,但常忽视推理步骤间的时间扩散效应,从而可能高估整体置信度。其解决方案的关键在于:引入跨步骤注意力机制(inter-step attention)以捕捉不同推理步骤间的语义相关性,并设计隐藏置信度机制(hidden confidence mechanism)来保留历史置信信息,最终将步骤级置信度与历史信息融合,实现更精准的整体置信度估计。
链接: https://arxiv.org/abs/2601.13368
作者: Zhenjiang Mao,Anirudhh Venkat
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:As reasoning modules, such as the chain-of-thought mechanism, are applied to large language models, they achieve strong performance on various tasks such as answering common-sense questions and solving math problems. The main challenge now is to assess the uncertainty of answers, which can help prevent misleading or serious hallucinations for users. Although current methods analyze long reasoning sequences by filtering unrelated tokens and examining potential connections between nearby tokens or sentences, the temporal spread of confidence is often overlooked. This oversight can lead to inflated overall confidence, even when earlier steps exhibit very low confidence. To address this issue, we propose a novel method that incorporates inter-step attention to analyze semantic correlations across steps. For handling long-horizon responses, we introduce a hidden confidence mechanism to retain historical confidence information, which is then combined with stepwise confidence to produce a more accurate overall estimate. We evaluate our method on the GAOKAO math benchmark and the CLadder causal reasoning dataset using mainstream open-source large language models. Our approach is shown to outperform state-of-the-art methods by achieving a superior balance between predictive quality and calibration, demonstrated by strong performance on both Negative Log-Likelihood and Expected Calibration Error.
zh
[NLP-82] Sockpuppetting: Jailbreaking LLM s Without Optimization Through Output Prefix Injection
【速读】: 该论文旨在解决开放权重大语言模型(Large Language Models, LLMs)在面对恶意提示时的安全防护问题,特别是如何通过低成本、易实施的方式实现“越狱”(jailbreaking),即诱导模型生成违反其安全策略的内容。解决方案的关键在于提出了一种名为“sockpuppetting”的新攻击方法:该方法仅需在模型输出的开头插入一个接受性序列(如“Sure, here is how to…”),并允许模型自行完成后续响应,无需复杂优化或大量计算资源。这一策略显著提升了攻击成功率(ASR),相较于现有方法GCG,在Qwen3-8B上提升达80%,并在Llama-3.1-8B上通过优化助手消息块内的对抗后缀进一步提高64%的ASR,表明输出前缀注入是当前开放权重模型中的关键脆弱点,亟需加强防御。
链接: https://arxiv.org/abs/2601.13359
作者: Asen Dotsinski,Panagiotis Eustratiadis
机构: University of Amsterdam (阿姆斯特丹大学)
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:
Abstract:As open-weight large language models (LLMs) increase in capabilities, safeguarding them against malicious prompts and understanding possible attack vectors becomes ever more important. While automated jailbreaking methods like GCG [Zou et al., 2023] remain effective, they often require substantial computational resources and specific expertise. We introduce "sockpuppetting’‘, a simple method for jailbreaking open-weight LLMs by inserting an acceptance sequence (e.g., "Sure, here is how to…’') at the start of a model’s output and allowing it to complete the response. Requiring only a single line of code and no optimization, sockpuppetting achieves up to 80% higher attack success rate (ASR) than GCG on Qwen3-8B in per-prompt comparisons. We also explore a hybrid approach that optimizes the adversarial suffix within the assistant message block rather than the user prompt, increasing ASR by 64% over GCG on Llama-3.1-8B in a prompt-agnostic setting. The results establish sockpuppetting as an effective low-cost attack accessible to unsophisticated adversaries, highlighting the need for defences against output-prefix injection in open-weight models.
zh
[NLP-83] On the Relation of State Space Models and Hidden Markov Models
【速读】: 该论文旨在解决经典概率状态空间模型(State Space Models, SSMs)、隐马尔可夫模型(Hidden Markov Models, HMMs)与现代自然语言处理(Natural Language Processing, NLP)中的确定性状态空间模型(如S4和Mamba)之间的关系不清晰的问题。其核心挑战在于厘清这些模型在结构、推断算法(如前向-后向算法与卡尔曼滤波)以及学习机制(期望最大化与梯度优化)上的共性与差异。解决方案的关键在于构建一个统一的框架,通过概率图模型视角系统比较上述模型,并明确指出它们在何种条件下等价、何时本质不同,从而揭示现代NLP SSMs如何继承并扩展传统概率建模方法,实现控制理论、概率建模与深度学习的跨领域融合。
链接: https://arxiv.org/abs/2601.13357
作者: Aydin Ghojogh,M.Hadi Sepanj,Benyamin Ghojogh
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS); Systems and Control (eess.SY)
备注:
Abstract:State Space Models (SSMs) and Hidden Markov Models (HMMs) are foundational frameworks for modeling sequential data with latent variables and are widely used in signal processing, control theory, and machine learning. Despite their shared temporal structure, they differ fundamentally in the nature of their latent states, probabilistic assumptions, inference procedures, and training paradigms. Recently, deterministic state space models have re-emerged in natural language processing through architectures such as S4 and Mamba, raising new questions about the relationship between classical probabilistic SSMs, HMMs, and modern neural sequence models. In this paper, we present a unified and systematic comparison of HMMs, linear Gaussian state space models, Kalman filtering, and contemporary NLP state space models. We analyze their formulations through the lens of probabilistic graphical models, examine their inference algorithms – including forward-backward inference and Kalman filtering – and contrast their learning procedures via Expectation-Maximization and gradient-based optimization. By highlighting both structural similarities and semantic differences, we clarify when these models are equivalent, when they fundamentally diverge, and how modern NLP SSMs relate to classical probabilistic models. Our analysis bridges perspectives from control theory, probabilistic modeling, and modern deep learning. Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS); Systems and Control (eess.SY) Cite as: arXiv:2601.13357 [cs.LG] (or arXiv:2601.13357v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2601.13357 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-84] LLM -as-RNN: A Recurrent Language Model for Memory Updates and Sequence Prediction
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在推理过程中缺乏可更新记忆机制的问题,即当模型在生成序列的某一步骤 t 出现错误时,无法利用反馈信息动态调整后续预测(t+1),因为其上下文历史是静态固定的。解决方案的关键在于提出 LLM-as-RNN 框架,该框架将冻结的 LLM 转化为一个递归预测器,通过自然语言形式表示隐藏状态作为“记忆”,并以结构化的系统提示(system-prompt)摘要形式存储和更新这一状态;每次时间步通过反馈驱动的文本重写来更新记忆,从而实现无需参数微调的在线学习。此方法在固定词元预算下有效纠正错误并保留任务相关模式,显著提升预测准确性,且生成可解释的学习轨迹。
链接: https://arxiv.org/abs/2601.13352
作者: Yuxing Lu,J. Ben Tamo,Weichen Zhao,Nan Sun,Yishan Zhong,Wenqi Shi,Jinzhuo Wang,May D. Wang
机构: Georgia Institute of Technology (佐治亚理工学院); Peking University (北京大学); Shandong University (山东大学); Huazhong University of Science and Technology (华中科技大学); UT Southwestern Medical Center (德克萨斯西南医学中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 17 pages, 5 figures, 6 tables
Abstract:Large language models are strong sequence predictors, yet standard inference relies on immutable context histories. After making an error at generation step t, the model lacks an updatable memory mechanism that improves predictions for step t+1. We propose LLM-as-RNN, an inference-only framework that turns a frozen LLM into a recurrent predictor by representing its hidden state as natural-language memory. This state, implemented as a structured system-prompt summary, is updated at each timestep via feedback-driven text rewrites, enabling learning without parameter updates. Under a fixed token budget, LLM-as-RNN corrects errors and retains task-relevant patterns, effectively performing online learning through language. We evaluate the method on three sequential benchmarks in healthcare, meteorology, and finance across Llama, Gemma, and GPT model families. LLM-as-RNN significantly outperforms zero-shot, full-history, and MemPrompt baselines, improving predictive accuracy by 6.5% on average, while producing interpretable, human-readable learning traces absent in standard context accumulation.
zh
[NLP-85] AfroScope: A Framework for Studying the Linguistic Landscape of Africa
【速读】: 该论文旨在解决非洲语言识别(African Language Identification, LID)中存在的两大问题:一是支持的语言种类有限,二是难以对语义或地理上高度相似的方言进行细粒度区分。解决方案的关键在于提出一个统一框架 AfroScope,其核心包括两个部分:(1) AfroScope-Data,一个涵盖713种非洲语言的标注数据集;(2) AfroScope-Models,一套具有广泛语言覆盖能力的高性能LID模型。为提升对高度混淆语言的区分能力,作者进一步引入一种分层分类方法,并结合专为29种密切相关的近缘语言设计的Mirror-Serengeti嵌入模型,该策略在混淆子集上的宏平均F1得分较最优基线模型提升4.55点。
链接: https://arxiv.org/abs/2601.13346
作者: Sang Yun Kwon,AbdelRahim Elmadany,Muhammad Abdul-Mageed
机构: The University of British Columbia (不列颠哥伦比亚大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Language Identification (LID) is the task of determining the language of a given text and is a fundamental preprocessing step that affects the reliability of downstream NLP applications. While recent work has expanded LID coverage for African languages, existing approaches remain limited in (i) the number of supported languages and (ii) their ability to make fine-grained distinctions among closely related varieties. We introduce AfroScope, a unified framework for African LID that includes AfroScope-Data, a dataset covering 713 African languages, and AfroScope-Models, a suite of strong LID models with broad language coverage. To better distinguish highly confusable languages, we propose a hierarchical classification approach that leverages Mirror-Serengeti, a specialized embedding model targeting 29 closely related or geographically proximate languages. This approach improves macro F1 by 4.55 on this confusable subset compared to our best base model. Finally, we analyze cross linguistic transfer and domain effects, offering guidance for building robust African LID systems. We position African LID as an enabling technology for large scale measurement of Africas linguistic landscape in digital text and release AfroScope-Data and AfroScope-Models publicly.
zh
[NLP-86] RegCheck: A tool for automating comparisons between study registrations and papers
【速读】: 该论文旨在解决研究注册(study registration)与发表论文之间一致性核查效率低下的问题,即当前手动比对注册信息与论文内容耗时费力、依赖跨领域专业知识,导致注册的透明性和严谨性优势难以充分发挥。解决方案的关键在于提出一个模块化的基于大语言模型(Large Language Model, LLM)辅助工具 RegCheck,其核心设计原则是将人类专家判断置于决策流程中心:一方面由用户自主选择需比对的特征项,另一方面精准提取并呈现每个特征相关的最相关文本片段,从而辅助而非替代人工判断;同时生成带有唯一标识符的可共享报告,支持跨用户验证与复用,具备跨学科、跨注册与出版格式的适应性,为可重复科学研究提供可扩展的基础架构。
链接: https://arxiv.org/abs/2601.13330
作者: Jamie Cummins,Beth Clarke,Ian Hussey,Malte Elson
机构: Bennett Institute of Applied Data Science, University of Oxford (牛津大学应用数据科学研究所); Institute of Psychology, University of Bern (伯尔尼大学心理学研究所)
类目: Computation and Language (cs.CL)
备注: 15 pages, 1 figure
Abstract:Across the social and medical sciences, researchers recognize that specifying planned research activities (i.e., ‘registration’) prior to the commencement of research has benefits for both the transparency and rigour of science. Despite this, evidence suggests that study registrations frequently go unexamined, minimizing their effectiveness. In a way this is no surprise: manually checking registrations against papers is labour- and time-intensive, requiring careful reading across formats and expertise across domains. The advent of AI unlocks new possibilities in facilitating this activity. We present RegCheck, a modular LLM-assisted tool designed to help researchers, reviewers, and editors from across scientific disciplines compare study registrations with their corresponding papers. Importantly, RegCheck keeps human expertise and judgement in the loop by (i) ensuring that users are the ones who determine which features should be compared, and (ii) presenting the most relevant text associated with each feature to the user, facilitating (rather than replacing) human discrepancy judgements. RegCheck also generates shareable reports with unique RegCheck IDs, enabling them to be easily shared and verified by other users. RegCheck is designed to be adaptable across scientific domains, as well as registration and publication formats. In this paper we provide an overview of the motivation, workflow, and design principles of RegCheck, and we discuss its potential as an extensible infrastructure for reproducible science with an example use case.
zh
[NLP-87] Reducing Tokenization Premiums for Low-Resource Languages
【速读】: 该论文试图解决低资源语言在现代语言模型(Language Models, LM)中面临的显著分词溢价(tokenization premium)问题,即相同语义内容的句子在低资源语言中需要比英文多数倍的标记(token)来表示,从而导致API调用成本升高、能耗增加以及有效上下文窗口缩短。解决方案的关键在于:通过后处理方式向预训练模型的词汇表中添加新的合并标记(coalesce multi-token characters into single tokens),以压缩低资源语言的输入表示,同时保持模型输出的语义一致性——实验表明,在Llama 3.2 1B模型中,原始输入与压缩后的输入通常具有相似的最后隐藏状态(last hidden states)。
链接: https://arxiv.org/abs/2601.13328
作者: Geoffrey Churchill,Steven Skiena
机构: Stony Brook University (石溪大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Relative to English, low-resource languages suffer from substantial tokenization premiums in modern LMs, meaning that it generally requires several times as many tokens to encode a sentence in a low-resource language than to encode the analogous sentence in English. This tokenization premium results in increased API and energy costs and reduced effective context windows for these languages. In this paper we analyze the tokenizers of ten popular LMs to better understand their designs and per-language tokenization premiums. We also propose a mechanism to reduce tokenization premiums in pre-trained models, by post-hoc additions to the token vocabulary that coalesce multi-token characters into single tokens. We apply this methodology to 12 low-resource languages, demonstrating that the original and compressed inputs often have similar last hidden states when run through the Llama 3.2 1B model.
zh
[NLP-88] Arab Voices: Mapping Standard and Dialectal Arabic Speech Technology
【速读】: 该论文旨在解决方言阿拉伯语(Dialectal Arabic, DA)语音数据在领域覆盖、方言标注方式及录音条件上的高度异质性问题,这导致跨数据集比较和模型评估困难。解决方案的关键在于提出一个标准化框架——Arab Voices,该框架统一整合了31个涵盖14种方言的数据集,提供一致的元数据和评估工具,从而减少碎片化并支持可复现的自动语音识别(ASR)性能评估,同时通过基准测试多种前沿ASR系统建立了现代DA ASR的强基线。
链接: https://arxiv.org/abs/2601.13319
作者: Peter Sullivan,AbdelRahim Elmadany,Alcides Alcoba Inciarte,Muhammad Abdul-Mageed
机构: The University of British Columbia (不列颠哥伦比亚大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Dialectal Arabic (DA) speech data vary widely in domain coverage, dialect labeling practices, and recording conditions, complicating cross-dataset comparison and model evaluation. To characterize this landscape, we conduct a computational analysis of linguistic ``dialectness’’ alongside objective proxies of audio quality on the training splits of widely used DA corpora. We find substantial heterogeneity both in acoustic conditions and in the strength and consistency of dialectal signals across datasets, underscoring the need for standardized characterization beyond coarse labels. To reduce fragmentation and support reproducible evaluation, we introduce Arab Voices, a standardized framework for DA ASR. Arab Voices provides unified access to 31 datasets spanning 14 dialects, with harmonized metadata and evaluation utilities. We further benchmark a range of recent ASR systems, establishing strong baselines for modern DA ASR.
zh
[NLP-89] Paid Voices vs. Public Feeds: Interpretable Cross-Platform Theme Modeling of Climate Discourse
【速读】: 该论文旨在解决当前气候传播研究中因平台差异导致的分析局限性问题,即现有计算方法通常孤立地分析付费广告与公共社交媒体上的气候话语,难以区分机构性信息与用户自发表达。其核心解决方案是提出一种可解释的端到端主题发现与标注框架,通过语义相似性聚类文本,并利用大语言模型(Large Language Models, LLMs)生成简洁且人类可理解的主题标签,从而实现跨平台(Meta广告 vs. Bluesky公共帖子)气候叙事的系统性比较。该框架在主题质量上优于传统主题建模方法,并通过立场预测和主题引导检索等下游任务验证了其语义一致性,揭示了不同平台激励机制如何塑造气候叙事的内容结构、立场倾向及时间响应特征。
链接: https://arxiv.org/abs/2601.13317
作者: Samantha Sudhoff,Pranav Perumal,Zhaoqing Wu,Tunazzina Islam
机构: Purdue University (普渡大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
备注:
Abstract:Climate discourse online plays a crucial role in shaping public understanding of climate change and influencing political and policy outcomes. However, climate communication unfolds across structurally distinct platforms with fundamentally different incentive structures: paid advertising ecosystems incentivize targeted, strategic persuasion, while public social media platforms host largely organic, user-driven discourse. Existing computational studies typically analyze these environments in isolation, limiting our ability to distinguish institutional messaging from public expression. In this work, we present a comparative analysis of climate discourse across paid advertisements on Meta (previously known as Facebook) and public posts on Bluesky from July 2024 to September 2025. We introduce an interpretable, end-to-end thematic discovery and assignment framework that clusters texts by semantic similarity and leverages large language models (LLMs) to generate concise, human-interpretable theme labels. We evaluate the quality of the induced themes against traditional topic modeling baselines using both human judgments and an LLM-based evaluator, and further validate their semantic coherence through downstream stance prediction and theme-guided retrieval tasks. Applying the resulting themes, we characterize systematic differences between paid climate messaging and public climate discourse and examine how thematic prevalence shifts around major political events. Our findings show that platform-level incentives are reflected in the thematic structure, stance alignment, and temporal responsiveness of climate narratives. While our empirical analysis focuses on climate communication, the proposed framework is designed to support comparative narrative analysis across heterogeneous communication environments.
zh
[NLP-90] OI-Bench: An Option Injection Benchmark for Evaluating LLM Susceptibility to Directive Interference
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在多选题问答(Multiple-Choice Question Answering, MCQA)任务中对指令干扰(directive interference)的脆弱性问题,即模型决策易受社会暗示、框架效应和误导性指令等非内容因素影响。解决方案的关键在于提出“选项注入”(Option Injection, OI),通过在标准MCQA界面中添加一个包含误导性指令的干扰选项,系统性地模拟和评估模型在选择接口中的响应行为;同时构建了OI-Bench基准数据集,涵盖3000道覆盖知识、推理与常识任务的问题及16类指令类型,从而实现对模型鲁棒性的量化分析与比较。
链接: https://arxiv.org/abs/2601.13300
作者: Yow-Fu Liou,Yu-Chien Tang,Yu-Hsiang Liu,An-Zi Yen
机构: National Yang Ming Chiao Tung University (国立阳明交通大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Benchmarking large language models (LLMs) is critical for understanding their capabilities, limitations, and robustness. In addition to interface artifacts, prior studies have shown that LLM decisions can be influenced by directive signals such as social cues, framing, and instructions. In this work, we introduce option injection, a benchmarking approach that augments the multiple-choice question answering (MCQA) interface with an additional option containing a misleading directive, leveraging standardized choice structure and scalable evaluation. We construct OI-Bench, a benchmark of 3,000 questions spanning knowledge, reasoning, and commonsense tasks, with 16 directive types covering social compliance, bonus framing, threat framing, and instructional interference. This setting combines manipulation of the choice interface with directive-based interference, enabling systematic assessment of model susceptibility. We evaluate 12 LLMs to analyze attack success rates, behavioral responses, and further investigate mitigation strategies ranging from inference-time prompting to post-training alignment. Experimental results reveal substantial vulnerabilities and heterogeneous robustness across models. OI-Bench is expected to support more systematic evaluation of LLM robustness to directive interference within choice-based interfaces.
zh
[NLP-91] CooperBench: Why Coding Agents Cannot be Your Teammates Yet
【速读】: 该论文旨在解决当前AI代理在协作任务中缺乏社会智能(social intelligence)的问题,尤其是在复杂工作场景下无法有效协调彼此行为以达成共识。其核心问题是:尽管现有编码代理具备独立完成任务的能力,但在多代理协同环境中,由于沟通不畅、承诺偏离和预期错误等机制缺陷,导致整体协作效率显著下降。解决方案的关键在于引入CooperBench——一个包含600余项真实开源项目背景下的协作编程任务的基准测试平台,通过系统性评估发现“协调诅咒”现象(即代理协作成功率比单独执行低30%),并揭示出三大协作障碍,从而推动研究范式从单一能力提升转向对社会智能的构建与优化。
链接: https://arxiv.org/abs/2601.13295
作者: Arpandeep Khatua,Hao Zhu,Peter Tran,Arya Prabhudesai,Frederic Sadrieh,Johann K. Lieberwirth,Xinkai Yu,Yicheng Fu,Michael J. Ryan,Jiaxin Pei,Diyi Yang
机构: Stanford University (斯坦福大学); SAP Labs US (SAP 实验室美国)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA); Social and Information Networks (cs.SI)
备注: this https URL
Abstract:Resolving team conflicts requires not only task-specific competence, but also social intelligence to find common ground and build consensus. As AI agents increasingly collaborate on complex work, they must develop coordination capabilities to function as effective teammates. Yet we hypothesize that current agents lack these capabilities. To test this, we introduce CooperBench, a benchmark of over 600 collaborative coding tasks across 12 libraries in 4 programming languages. Each task assigns two agents different features that can be implemented independently but may conflict without proper coordination. Tasks are grounded in real open-source repositories with expert-written tests. Evaluating state-of-the-art coding agents, we observe the curse of coordination: agents achieve on average 30% lower success rates when working together compared to performing both tasks individually. This contrasts sharply with human teams, where adding teammates typically improves productivity. Our analysis reveals three key issues: (1) communication channels become jammed with vague, ill-timed, and inaccurate messages; (2) even with effective communication, agents deviate from their commitments; and (3) agents often hold incorrect expectations about others’ plans and communication. Through large-scale simulation, we also observe rare but interesting emergent coordination behavior including role division, resource division, and negotiation. Our research presents a novel benchmark for collaborative coding and calls for a shift from pursuing individual agent capability to developing social intelligence.
zh
[NLP-92] A BERTology View of LLM Orchestrations: Token- and Layer-Selective Probes for Efficient Single-Pass Classification
【速读】: 该论文旨在解决大型语言模型(Large Language Model, LLM)在生产环境中因依赖独立分类模型进行安全检测及其他分类任务而导致的延迟高、显存占用大及运维复杂的问题。其核心解决方案是复用LLM自身生成过程中的隐藏状态,通过训练轻量级探针(probe)在相同的前向传播中完成分类任务,从而避免引入额外的“守卫模型”(guard-model)管道。关键创新在于将分类建模为对完整token-layer隐藏状态张量的表示选择问题,而非固定某个token或层(如首token logits或最终层池化),并提出两阶段聚合机制:首先在每层内汇总token信息,再跨层整合形成统一分类表示,实现了低延迟、低显存开销下的高性能分类。
链接: https://arxiv.org/abs/2601.13288
作者: Gonzalo Ariel Meyoyan,Luciano Del Corro
机构: Universidad de Buenos Aires (布宜诺斯艾利斯大学); Universidad de San Andrés (圣安德烈斯大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Production LLM systems often rely on separate models for safety and other classification-heavy steps, increasing latency, VRAM footprint, and operational complexity. We instead reuse computation already paid for by the serving LLM: we train lightweight probes on its hidden states and predict labels in the same forward pass used for generation. We frame classification as representation selection over the full token-layer hidden-state tensor, rather than committing to a fixed token or fixed layer (e.g., first-token logits or final-layer pooling). To implement this, we introduce a two-stage aggregator that (i) summarizes tokens within each layer and (ii) aggregates across layer summaries to form a single representation for classification. We instantiate this template with direct pooling, a 100K-parameter scoring-attention gate, and a downcast multi-head self-attention (MHA) probe with up to 35M trainable parameters. Across safety and sentiment benchmarks our probes improve over logit-only reuse (e.g., MULI) and are competitive with substantially larger task-specific baselines, while preserving near-serving latency and avoiding the VRAM and latency costs of a separate guard-model pipeline.
zh
[NLP-93] Unlearning in LLM s: Methods Evaluation and Open Challenges
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在广泛应用中引发的隐私、版权、安全和偏见等伦理与技术挑战,核心问题是如何在不进行全量重新训练的前提下,实现对模型知识或数据的精准删除,即“机器遗忘”(Machine Unlearning)。其解决方案的关键在于系统性地梳理和分类现有LLM的遗忘方法,包括数据驱动型、参数驱动型、架构驱动型、混合及其它策略,并构建评估体系以量化遗忘效果、知识保留能力与鲁棒性,从而为开发高效、可验证且负责任的LLM遗忘技术提供理论框架与实践路径。
链接: https://arxiv.org/abs/2601.13264
作者: Tyler Lizzo,Larry Heck
机构: Georgia Institute of Technology (佐治亚理工学院)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) have achieved remarkable success across natural language processing tasks, yet their widespread deployment raises pressing concerns around privacy, copyright, security, and bias. Machine unlearning has emerged as a promising paradigm for selectively removing knowledge or data from trained models without full retraining. In this survey, we provide a structured overview of unlearning methods for LLMs, categorizing existing approaches into data-centric, parameter-centric, architecture-centric, hybrid, and other strategies. We also review the evaluation ecosystem, including benchmarks, metrics, and datasets designed to measure forgetting effectiveness, knowledge retention, and robustness. Finally, we outline key challenges and open problems, such as scalable efficiency, formal guarantees, cross-language and multimodal unlearning, and robustness against adversarial relearning. By synthesizing current progress and highlighting open directions, this paper aims to serve as a roadmap for developing reliable and responsible unlearning techniques in large language models.
zh
[NLP-94] CURE-Med: Curriculum-Informed Reinforcement Learning for Multilingual Medical Reasoning
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在多语言医疗推理任务中表现不可靠的问题,这一局限性阻碍了其在多语言医疗场景中的部署。解决方案的关键在于提出两个核心组件:一是构建CUREMED-BENCH,一个高质量的多语言医疗推理基准数据集,涵盖13种语言(包括阿姆哈拉语、约鲁巴语和斯瓦希里语等低资源语言),包含开放式的推理问题且答案可验证;二是设计CURE-MED框架,一种基于课程学习启发的强化学习方法,融合了代码切换感知的监督微调(code-switching-aware supervised fine-tuning)与组相对策略优化(Group Relative Policy Optimization),以协同提升逻辑正确性和语言稳定性。实验表明,该方法在13种语言上均显著优于基线模型,并在7B和32B参数规模下分别实现54.35%和70.04%的逻辑正确率,以及85.21%和94.96%的语言一致性,从而推动了可靠且公平的多语言医疗推理能力发展。
链接: https://arxiv.org/abs/2601.13262
作者: Eric Onyame,Akash Ghosh,Subhadip Baidya,Sriparna Saha,Xiuying Chen,Chirag Agarwal
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:While large language models (LLMs) have shown to perform well on monolingual mathematical and commonsense reasoning, they remain unreliable for multilingual medical reasoning applications, hindering their deployment in multilingual healthcare settings. We address this by first introducing CUREMED-BENCH, a high-quality multilingual medical reasoning dataset with open-ended reasoning queries with a single verifiable answer, spanning thirteen languages, including underrepresented languages such as Amharic, Yoruba, and Swahili. Building on this dataset, we propose CURE-MED, a curriculum-informed reinforcement learning framework that integrates code-switching-aware supervised fine-tuning and Group Relative Policy Optimization to jointly improve logical correctness and language stability. Across thirteen languages, our approach consistently outperforms strong baselines and scales effectively, achieving 85.21% language consistency and 54.35% logical correctness at 7B parameters, and 94.96% language consistency and 70.04% logical correctness at 32B parameters. These results support reliable and equitable multilingual medical reasoning in LLMs. The code and dataset are available at this https URL
zh
[NLP-95] Stop Taking Tokenizers for Granted: They Are Core Design Decisions in Large Language Models EACL2026
【速读】: 该论文旨在解决当前大型语言模型中分词(tokenization)环节缺乏理论指导、设计不一致的问题,这些问题导致分词方法(如字节对编码,Byte Pair Encoding, BPE)在不同语言和领域中常与语言结构不匹配、放大偏见并浪费计算资源。其解决方案的关键在于将分词视为核心建模决策而非预处理步骤,提出一种上下文感知的框架,通过分词器与模型的协同设计(co-design),结合语言学、领域特性和部署需求进行优化,并强调标准化评估与透明报告以确保可比性与问责制,从而提升语言技术的公平性、效率与适应性。
链接: https://arxiv.org/abs/2601.13260
作者: Sawsan Alqahtani,Mir Tafseer Nayeem,Md Tahmid Rahman Laskar,Tasnim Mohiuddin,M Saiful Bari
机构: Princess Nourah Bint Abdulrahman University (公主诺拉·本·阿卜杜勒拉赫曼大学); Saudi Data & AI Authority (沙特数据与人工智能局); University of Alberta (阿尔伯塔大学); Dialpad; York University (约克大学); Qatar Computing Research Institute (卡塔尔计算研究研究所); Amazon AGI
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to EACL 2026 (long, main). The first two authors contributed equally
Abstract:Tokenization underlies every large language model, yet it remains an under-theorized and inconsistently designed component. Common subword approaches such as Byte Pair Encoding (BPE) offer scalability but often misalign with linguistic structure, amplify bias, and waste capacity across languages and domains. This paper reframes tokenization as a core modeling decision rather than a preprocessing step. We argue for a context-aware framework that integrates tokenizer and model co-design, guided by linguistic, domain, and deployment considerations. Standardized evaluation and transparent reporting are essential to make tokenization choices accountable and comparable. Treating tokenization as a core design problem, not a technical afterthought, can yield language technologies that are fairer, more efficient, and more adaptable.
zh
[NLP-96] A Hybrid Protocol for Large-Scale Semantic Dataset Generation in Low-Resource Languages: The Turkish Semantic Relations Corpus
【速读】: 该论文旨在解决低资源语言(以土耳其语为例)中语义关系数据集稀缺的问题,从而限制了自然语言处理(Natural Language Processing, NLP)模型在这些语言上的性能提升。其解决方案的关键在于提出一种混合方法学:首先利用FastText词嵌入结合凝聚聚类(Agglomerative Clustering)识别语义簇;随后使用Gemini 2.5-Flash模型自动分类语义关系;最后融合人工校准的词典资源进行增强。该流程实现了高质量、大规模(843,000条唯一语义对)语义关系数据的低成本生成(仅65美元),并已在下游任务中验证其有效性,具备良好的可扩展性与跨语言适用潜力。
链接: https://arxiv.org/abs/2601.13253
作者: Ebubekir Tosun,Mehmet Emin Buldur,Özay Ezerceli,Mahmoud ElHussieni
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:We present a hybrid methodology for generating large-scale semantic relationship datasets in low-resource languages, demonstrated through a comprehensive Turkish semantic relations corpus. Our approach integrates three phases: (1) FastText embeddings with Agglomerative Clustering to identify semantic clusters, (2) Gemini 2.5-Flash for automated semantic relationship classification, and (3) integration with curated dictionary sources. The resulting dataset comprises 843,000 unique Turkish semantic pairs across three relationship types (synonyms, antonyms, co-hyponyms) representing a 10x scale increase over existing resources at minimal cost ( 65). We validate the dataset through two downstream tasks: an embedding model achieving 90% top-1 retrieval accuracy and a classification model attaining 90% F1-macro. Our scalable protocol addresses critical data scarcity in Turkish NLP and demonstrates applicability to other low-resource languages. We publicly release the dataset and models.
zh
[NLP-97] Beyond Cosine Similarity: Taming Semantic Drift and Antonym Intrusion in a 15-Million Node Turkish Synonym Graph
【速读】: 该论文旨在解决神经嵌入(Neural Embeddings)在语义聚类中难以区分同义词与反义词的问题,这一缺陷导致仅通过提高相似度阈值无法有效避免相反概念被错误地归入同一簇。其解决方案的关键在于构建一个大规模、高精度的语义聚类系统,包含三大核心创新:首先,利用Gemini 2.5-Flash大语言模型(LLM)增强并人工校验生成了包含84.3万对概念关系的标注数据集(涵盖同义、反义和上下位关系);其次,提出一种三元语义关系判别器(three-way semantic relation discriminator),在宏平均F1得分上达到90%,显著提升语义歧义消解能力;最后,设计了一种拓扑感知的软到硬聚类算法(topology-aware two-stage expansion-pruning procedure with topological voting),有效抑制语义漂移(semantic drift)并处理多义性(polysemy),确保每个词项唯一归属至语义一致的簇,从而实现高精度语义搜索与检索增强生成(Retrieval-Augmented Generation, RAG)。
链接: https://arxiv.org/abs/2601.13251
作者: Ebubekir Tosun,Mehmet Emin Buldur,Özay Ezerceli,Mahmoud ElHussieni
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Neural embeddings have a notorious blind spot: they can’t reliably tell synonyms apart from antonyms. Consequently, increasing similarity thresholds often fails to prevent opposites from being grouped together. We’ve built a large-scale semantic clustering system specifically designed to tackle this problem head on. Our pipeline chews through 15 million lexical items, evaluates a massive 520 million potential relationships, and ultimately generates 2.9 million high-precision semantic clusters. The system makes three primary contributions. First, we introduce a labeled dataset of 843,000 concept pairs spanning synonymy, antonymy, and co-hyponymy, constructed via Gemini 2.5-Flash LLM augmentation and verified using human-curated dictionary resources. Second, we propose a specialized three-way semantic relation discriminator that achieves 90% macro-F1, enabling robust disambiguation beyond raw embedding similarity. Third, we introduce a novel soft-to-hard clustering algorithm that mitigates semantic drift preventing erroneous transitive chains (e.g., hot - spicy - pain - depression) while simultaneously resolving polysemy. Our approach employs a topology-aware two-stage expansion-pruning procedure with topological voting, ensuring that each term is assigned to exactly one semantically coherent cluster. The resulting resource enables high-precision semantic search and retrieval-augmented generation, particularly for morphologically rich and low-resource languages where existing synonym databases remain sparse.
zh
[NLP-98] Aligning Agent ic World Models via Knowledgeable Experience Learning
【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在物理世界任务执行中存在“模态断层”(modal disconnect)的问题,即模型虽具备丰富的语义知识,却缺乏对物理法则的程序性约束,导致其生成的行动计划常因违反物理规律而不可执行(称为“物理幻觉”)。现有对齐方法多依赖高成本的训练或微调,将动态环境规则静态编码至模型参数中,难以适应开放环境中不断变化的物理特性。论文提出的解决方案关键在于引入WorldMind框架,通过自主构建符号化世界知识库(World Knowledge Repository),融合两类经验:过程经验(Process Experience)利用预测误差强制物理可行性,目标经验(Goal Experience)通过成功轨迹引导任务最优性,从而实现无需频繁重训练即可跨模型、跨环境迁移的物理感知决策能力。
链接: https://arxiv.org/abs/2601.13247
作者: Baochang Ren,Yunzhi Yao,Rui Sun,Shuofei Qiao,Ningyu Zhang,Huajun Chen
机构: Zhejiang University (浙江大学); University of California, Los Angeles (加州大学洛杉矶分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)
备注: Ongoing work
Abstract:Current Large Language Models (LLMs) exhibit a critical modal disconnect: they possess vast semantic knowledge but lack the procedural grounding to respect the immutable laws of the physical world. Consequently, while these agents implicitly function as world models, their simulations often suffer from physical hallucinations-generating plans that are logically sound but physically unexecutable. Existing alignment strategies predominantly rely on resource-intensive training or fine-tuning, which attempt to compress dynamic environmental rules into static model parameters. However, such parametric encapsulation is inherently rigid, struggling to adapt to the open-ended variability of physical dynamics without continuous, costly retraining. To bridge this gap, we introduce WorldMind, a framework that autonomously constructs a symbolic World Knowledge Repository by synthesizing environmental feedback. Specifically, it unifies Process Experience to enforce physical feasibility via prediction errors and Goal Experience to guide task optimality through successful trajectories. Experiments on EB-ALFRED and EB-Habitat demonstrate that WorldMind achieves superior performance compared to baselines with remarkable cross-model and cross-environment transferability.
zh
[NLP-99] KOCO-BENCH: Can Large Language Models Leverag e Domain Knowledge in Software Development?
【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在领域特定软件开发中表现不足的问题,即现有方法难以有效学习和应用领域知识,且缺乏能够评估领域专业化方法效果的基准测试工具。此前的代码基准主要关注模型已有的知识掌握程度,而非其获取与应用新知识的能力,且未提供明确的知识语料库以支持领域专业化方法的开发。为此,作者提出KOCO-BENCH,一个面向真实软件开发场景的新型基准,包含6个新兴领域、11个软件框架及25个项目,并配套结构化知识语料库与多粒度任务(从函数级到项目级代码生成及多项选择题形式的知识理解)。其关键创新在于:要求模型主动从知识语料库中获取并应用API、规则、约束等多样化的领域知识来完成任务,而非仅依赖预训练知识。实验表明,即使采用当前最先进的领域专业化方法(如监督微调SFT、检索增强生成RAG、kNN-LM),性能提升仍有限,凸显了开发更高效领域专业化方法的紧迫性。
链接: https://arxiv.org/abs/2601.13240
作者: Xue Jiang,Jiaru Qian,Xianjie Shi,Chenjie Li,Hao Zhu,Ziyu Wang,Jielun Zhang,Zheyu Zhao,Kechi Zhang,Jia Li,Wenpin Jiao,Zhi Jin,Ge Li,Yihong Dong
机构: Peking University (北京大学); Wuhan University (武汉大学)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Large language models (LLMs) excel at general programming but struggle with domain-specific software development, necessitating domain specialization methods for LLMs to learn and utilize domain knowledge and data. However, existing domain-specific code benchmarks cannot evaluate the effectiveness of domain specialization methods, which focus on assessing what knowledge LLMs possess rather than how they acquire and apply new knowledge, lacking explicit knowledge corpora for developing domain specialization methods. To this end, we present KOCO-BENCH, a novel benchmark designed for evaluating domain specialization methods in real-world software development. KOCO-BENCH contains 6 emerging domains with 11 software frameworks and 25 projects, featuring curated knowledge corpora alongside multi-granularity evaluation tasks including domain code generation (from function-level to project-level with rigorous test suites) and domain knowledge understanding (via multiple-choice QA). Unlike previous benchmarks that only provide test sets for direct evaluation, KOCO-BENCH requires acquiring and applying diverse domain knowledge (APIs, rules, constraints, etc.) from knowledge corpora to solve evaluation tasks. Our evaluations reveal that KOCO-BENCH poses significant challenges to state-of-the-art LLMs. Even with domain specialization methods (e.g., SFT, RAG, kNN-LM) applied, improvements remain marginal. Best-performing coding agent, Claude Code, achieves only 34.2%, highlighting the urgent need for more effective domain specialization methods. We release KOCO-BENCH, evaluation code, and baselines to advance further research at this https URL.
zh
[NLP-100] RubRIX: Rubric-Driven Risk Mitigation in Caregiver-AI Interactions
【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在照护支持场景中响应安全性与适当性评估不足的问题,尤其针对照护者(caregivers)所表达的复杂需求——如信息获取、情感确认和痛苦信号——现有通用AI评估框架难以准确捕捉此类情境下的细微风险。解决方案的关键在于提出一个基于伦理关怀要素(Elements of an Ethic of Care)的理论驱动型评估框架RubRIX(Rubric-based Risk Index),其通过五个实证推导出的风险维度(Inattention、Bias Stigma、Information Inaccuracy、Uncritical Affirmation、Epistemic Arrogance)对LLM响应进行结构化量化评估,并经由临床专家验证。实验表明,基于RubRIX指导的迭代优化可使各模型风险成分降低45–98%,从而为高负担场景下负责任部署LLM提供了可操作、用户中心且领域敏感的评估方法论。
链接: https://arxiv.org/abs/2601.13235
作者: Drishti Goel,Jeongah Lee,Qiuyue Joy Zhong,Violeta J. Rodriguez,Daniel S. Brown,Ravi Karkar,Dong Whi Yoo,Koustuv Saha
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); University of Massachusetts Amherst (马萨诸塞大学阿默斯特分校); OSF HealthCare (OSF医疗保健公司); Indiana University Indianapolis (印第安纳大学印第安纳波利斯分校)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
备注:
Abstract:Caregivers seeking AI-mediated support express complex needs – information-seeking, emotional validation, and distress cues – that warrant careful evaluation of response safety and appropriateness. Existing AI evaluation frameworks, primarily focused on general risks (toxicity, hallucinations, policy violations, etc), may not adequately capture the nuanced risks of LLM-responses in caregiving-contexts. We introduce RubRIX (Rubric-based Risk Index), a theory-driven, clinician-validated framework for evaluating risks in LLM caregiving responses. Grounded in the Elements of an Ethic of Care, RubRIX operationalizes five empirically-derived risk dimensions: Inattention, Bias Stigma, Information Inaccuracy, Uncritical Affirmation, and Epistemic Arrogance. We evaluate six state-of-the-art LLMs on over 20,000 caregiver queries from Reddit and ALZConnected. Rubric-guided refinement consistently reduced risk-components by 45-98% after one iteration across models. This work contributes a methodological approach for developing domain-sensitive, user-centered evaluation frameworks for high-burden contexts. Our findings highlight the importance of domain-sensitive, interactional risk evaluation for the responsible deployment of LLMs in caregiving support contexts. We release benchmark datasets to enable future research on contextual risk evaluation in AI-mediated support.
zh
[NLP-101] Autoregressive Models Rival Diffusion Models at ANY-ORDER Generation
【速读】: 该论文旨在解决扩散语言模型(Diffusion Language Models)在建模深度和生成质量上受限于单步依赖关系的问题,从而影响其在复杂任务中的表现。解决方案的关键在于提出一种名为 Any-order Any-subset Autoregressive modeling (A3) 的通用框架,该框架将标准自回归(Autoregressive, AR)因子分解扩展至任意标记组和任意生成顺序,并通过双流注意力机制与渐进式适应策略,使预训练的AR模型能够逐步过渡到任意顺序预测,从而在保持AR模型概率严谨性和多层依赖建模能力的同时,继承扩散模型的并行与双向生成灵活性。
链接: https://arxiv.org/abs/2601.13228
作者: Tianqi Du,Lizhe Fang,Weijie Yang,Chenheng Zhang,Zeming Wei,Yifei Wang,Yisen Wang
机构: Peking University (北京大学); Amazon AGI SF Lab; MIT (麻省理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Diffusion language models enable any-order generation and bidirectional conditioning, offering appealing flexibility for tasks such as infilling, rewriting, and self-correction. However, their formulation-predicting one part of a sequence from another within a single-step dependency-limits modeling depth and often yields lower sample quality and stability than autoregressive (AR) models. To address this, we revisit autoregressive modeling as a foundation and reformulate diffusion-style training into a structured multi-group prediction process. We propose Any-order Any-subset Autoregressive modeling (A3), a generalized framework that extends the standard AR factorization to arbitrary token groups and generation orders. A3 preserves the probabilistic rigor and multi-layer dependency modeling of AR while inheriting diffusion models’ flexibility for parallel and bidirectional generation. We implement A3 through a two-stream attention architecture and a progressive adaptation strategy that transitions pretrained AR models toward any-order prediction. Experiments on question answering, commonsense reasoning, and story infilling demonstrate that A3 outperforms diffusion-based models while maintaining flexible decoding. This work offers a unified approach for a flexible, efficient, and novel language modeling paradigm.
zh
[NLP-102] Beyond Single-shot Writing: Deep Research Agents are Unreliable at Multi-turn Report Revision
【速读】: 该论文旨在解决当前深度研究代理(Deep Research Agents, DRAs)评估体系中忽视报告迭代修订能力的问题,即现有基准将报告生成视为单次写作任务,而未考虑人类研究人员通过自我反思或同行反馈进行多轮修订的实际流程。其解决方案的关键在于提出Mr Dre评估套件,该套件包含两个核心组件:(1) 统一的长篇报告评估协议,涵盖全面性、事实准确性和呈现质量;(2) 用于模拟人类反馈的人工验证反馈流水线,以支持多轮修订评估。通过该框架,研究揭示了DRAs在面对用户反馈时虽能部分响应,但普遍存在对已覆盖内容和引用质量的退化现象,且多轮修订后仍存在显著改进空间,表明单纯依赖推理阶段优化(如提示工程或专用子代理)难以有效解决此问题。
链接: https://arxiv.org/abs/2601.13217
作者: Bingsen Chen,Boyan Li,Ping Nie,Yuyu Zhang,Xi Ye,Chen Zhao
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Existing benchmarks for Deep Research Agents (DRAs) treat report generation as a single-shot writing task, which fundamentally diverges from how human researchers iteratively draft and revise reports via self-reflection or peer feedback. Whether DRAs can reliably revise reports with user feedback remains unexplored. We introduce Mr Dre, an evaluation suite that establishes multi-turn report revision as a new evaluation axis for DRAs. Mr Dre consists of (1) a unified long-form report evaluation protocol spanning comprehensiveness, factuality, and presentation, and (2) a human-verified feedback simulation pipeline for multi-turn revision. Our analysis of five diverse DRAs reveals a critical limitation: while agents can address most user feedback, they also regress on 16-27% of previously covered content and citation quality. Over multiple revision turns, even the best-performing agents leave significant headroom, as they continue to disrupt content outside the feedback’s scope and fail to preserve earlier edits. We further show that these issues are not easily resolvable through inference-time fixes such as prompt engineering and a dedicated sub-agent for report revision.
zh
[NLP-103] OpenExempt: A Diagnostic Benchmark for Legal Reasoning and a Framework for Creating Custom Benchmarks on Demand
【速读】: 该论文旨在解决现有语言模型推理评估基准在复杂规则领域(如法律)中存在局限性的问题,尤其是静态问答对难以全面反映模型的推理能力,且无法有效识别特定失败模式。其解决方案的关键在于提出OpenExempt框架,该框架利用专家构建的符号化表示来动态生成大量自然语言推理任务及其可计算的解,从而实现对任务复杂度和范围的精细控制,并支持对单个推理技能进行隔离测试;在此基础上构建的OpenExempt基准包含9,765个样本,覆盖九个精心设计的评估套件,能够揭示模型在长推理链和干扰陈述下的性能断崖现象,为下一代推理系统的研究提供诊断工具。
链接: https://arxiv.org/abs/2601.13183
作者: Sergio Servantez,Sarah B. Lawsky,Rajiv Jain,Daniel W. Linna Jr.,Kristian Hammond
机构: Northwestern University (西北大学); Adobe Research (Adobe 研究院); University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computation and Language (cs.CL)
备注: 25 pages, 9 Figures, 15 tables
Abstract:Reasoning benchmarks have played a crucial role in the progress of language models. Yet rigorous evaluation remains a significant challenge as static question-answer pairs provide only a snapshot of performance, compressing complex behavior into a single accuracy metric. This limitation is especially true in complex, rule-bound domains such as law, where existing benchmarks are costly to build and ill suited for isolating specific failure modes. To address this, we introduce OpenExempt, a framework and benchmark for diagnostic evaluation of legal reasoning. The OpenExempt Framework uses expert-crafted symbolic representations of U.S. Bankruptcy Code statutes to dynamically generate a large space of natural language reasoning tasks and their machine-computable solutions on demand. This gives users fine-grained control over task complexity and scope, allowing individual reasoning skills to be probed in isolation. Using this system, we construct the OpenExempt Benchmark, a diagnostic benchmark for legal reasoning with 9,765 samples across nine evaluation suites designed to carefully probe model capabilities. Experiments on 13 diverse language models reveal sharp performance cliffs that emerge only under longer reasoning paths and in the presence of obfuscating statements. We release the framework and benchmark publicly to support research aimed at understanding and improving the next generation of reasoning systems.
zh
[NLP-104] Medical Triage as Pairwise Ranking: A Benchmark for Urgency in Patient Portal Messages
【速读】: 该论文旨在解决医疗分诊(Medical Triage)中如何高效、准确地对患者通过异步门诊门户发送的非结构化消息进行紧急程度排序的问题。其核心挑战在于缺乏大规模、高质量的标注数据集以及适用于此类任务的专用模型训练方法。解决方案的关键在于提出一个新颖的成对推理任务范式,将分诊问题建模为“在一对消息中判断哪一条更紧急”的锦标赛式排序任务,并构建了首个公开的大规模基准数据集PMR-Bench(包含1569条真实患者消息和2000+组高质量测试对),同时设计了一种自动化数据标注策略以提供领域内指导。基于此,作者训练了两类模型:UrgentReward(基于Bradley-Terry模型)与UrgentSFT(基于下一个token预测目标),其中UrgentSFT在性能上表现最优,而UrgentReward在低资源场景下更具优势,显著提升了医生收件箱排序的准确性。
链接: https://arxiv.org/abs/2601.13178
作者: Joseph Gatto,Parker Seegmiller,Timothy Burdick,Philip Resnik,Roshnik Rahat,Sarah DeLozier,Sarah M. Preum
机构: Dartmouth College (达特茅斯学院); Dartmouth Health (达特茅斯健康); University of Maryland, College Park (马里兰大学科尔吉公园分校)
类目: Computation and Language (cs.CL)
备注: 19 Pages, 5 Figures
Abstract:Medical triage is the task of allocating medical resources and prioritizing patients based on medical need. This paper introduces the first large-scale public dataset for studying medical triage in the context of asynchronous outpatient portal messages. Our novel task formulation views patient message triage as a pairwise inference problem, where we train LLMs to choose `“which message is more medically urgent” in a head-to-head tournament-style re-sort of a physician’s inbox. Our novel benchmark PMR-Bench contains 1569 unique messages and 2,000+ high-quality test pairs for pairwise medical urgency assessment alongside a scalable training data generation pipeline. PMR-Bench includes samples that contain both unstructured patient-written messages alongside real electronic health record (EHR) data, emulating a real-world medical triage scenario. We develop a novel automated data annotation strategy to provide LLMs with in-domain guidance on this task. The resulting data is used to train two model classes, UrgentReward and UrgentSFT, leveraging Bradley-Terry and next token prediction objective, respectively to perform pairwise urgency classification. We find that UrgentSFT achieves top performance on PMR-Bench, with UrgentReward showing distinct advantages in low-resource settings. For example, UrgentSFT-8B and UrgentReward-8B provide a 15- and 16-point boost, respectively, on inbox sorting metrics over off-the-shelf 8B models. Paper resources can be found at this https URL Comments: 19 Pages, 5 Figures Subjects: Computation and Language (cs.CL) Cite as: arXiv:2601.13178 [cs.CL] (or arXiv:2601.13178v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2601.13178 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-105] Probe and Skip: Self-Predictive Token Skipping for Efficient Long-Context LLM Inference
【速读】: 该论文旨在解决长上下文推理中大型语言模型(Large Language Models, LLMs)因计算开销大而导致的推理延迟问题。现有基于token的优化方法(如剪枝和跳过)虽能降低延迟,但仍受限于加速潜力不足、代理信号过时及冗余干扰等问题,难以实现理想的加速-精度权衡。其解决方案的关键在于提出一种无需训练的高效推理框架SPTS(Self-Predictive Token Skipping),通过两个组件特异性策略实现选择性token跳过:针对多头注意力机制设计Partial Attention Probing(PAP),利用部分前向注意力计算筛选信息丰富的token;针对前馈网络设计Low-rank Transformation Probing(LTP),构建低秩代理网络预测token变换。此外,引入Multi-Stage Delayed Pruning(MSDP)策略动态分配跳过预算,在多层中逐步剪除冗余token,从而在保持模型性能的同时显著提升推理速度,实验表明该方法在预填充和端到端生成阶段分别实现最高2.46×和2.29×的加速比。
链接: https://arxiv.org/abs/2601.13155
作者: Zimeng Wu,Donghao Wang,Chaozhe Jin,Jiaxin Chen,Yunhong Wang
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Long-context inference enhances the reasoning capability of Large Language Models (LLMs) while incurring significant computational overhead. Token-oriented methods, such as pruning and skipping, have shown promise in reducing inference latency, but still suffer from inherently limited acceleration potential, outdated proxy signals, and redundancy interference, thus yielding suboptimal speed-accuracy trade-offs. To address these challenges, we propose SPTS (Self-Predictive Token Skipping), a training-free framework for efficient long-context LLM inference. Specifically, motivated by the thought of probing the influence of targeted skipping layers, we design two component-specific strategies for selective token skipping: Partial Attention Probing (PAP) for multi-head attention, which selects informative tokens by performing partial forward attention computation, and Low-rank Transformation Probing (LTP) for feed forward network, which constructs a low-rank proxy network to predict token transformations. Furthermore, a Multi-Stage Delayed Pruning (MSDP) strategy reallocates the skipping budget and progressively prunes redundant tokens across layers. Extensive experiments demonstrate the effectiveness of our method, achieving up to 2.46 \times and 2.29 \times speedups for prefilling and end-to-end generation, respectively, while maintaining state-of-the-art model performance. The source code will be publicly available upon paper acceptance.
zh
[NLP-106] VWorld: Foundations for Remote-Control TV Agents
【速读】: 该论文旨在解决当前大型视觉语言模型(Large Vision-Language Models, LVLMs)在电视(TV)远程控制(Remote-Control, RC)交互场景下能力不足的问题,尤其是现有研究主要集中在点选点击(Point-and-Click, PnC)交互,而对日常使用中常见的长程、基于焦点的导航任务缺乏系统评估与优化。其解决方案的关键在于提出一个离线图结构抽象——TVWorld,用于构建可复现且无需部署的电视导航评估环境,并在此基础上设计两个互补基准:TVWorld-N(拓扑感知导航)和TVWorld-G(焦点感知定位),从而揭示现有代理在拓扑意识上的缺陷;进而提出拓扑感知训练(Topology-Aware Training)框架,将拓扑知识注入LVLMs,最终开发出专用于电视导航的基础模型TVTheseus,在TVWorld-N上达到68.3%的成功率,显著优于闭源基线如Gemini 3 Flash,实现了该任务上的最先进性能。
链接: https://arxiv.org/abs/2601.13142
作者: Zhantao Ma,Quanfeng Lu,Shuai Zhong,Dahai Yu,Ping Luo,Michael K. Ng
机构: The University of Hong Kong (香港大学); Hong Kong Baptist University (香港浸会大学); TCL Corporate Research (Hong Kong) Co., Ltd (TCL香港企业研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Recent large vision-language models (LVLMs) have demonstrated strong potential for device control. However, existing research has primarily focused on point-and-click (PnC) interaction, while remote-control (RC) interaction commonly encountered in everyday TV usage remains largely underexplored. To fill this gap, we introduce \textbfTVWorld, an offline graph-based abstraction of real-world TV navigation that enables reproducible and deployment-free evaluation. On this basis, we derive two complementary benchmarks that comprehensively assess TV-use capabilities: \textbfTVWorld-N for topology-aware navigation and \textbfTVWorld-G for focus-aware grounding. These benchmarks expose a key limitation of existing agents: insufficient topology awareness for focus-based, long-horizon TV navigation. Motivated by this finding, we propose a \emphTopology-Aware Training framework that injects topology awareness into LVLMs. Using this framework, we develop \textbfTVTheseus, a foundation model specialized for TV navigation. TVTheseus achieves a success rate of 68.3% on TVWorld-N, surpassing strong closed-source baselines such as Gemini 3 Flash and establishing state-of-the-art (SOTA) performance. Additional analyses further provide valuable insights into the development of effective TV-use agents.
zh
[NLP-107] Adversarial Alignment: Ensuring Value Consistency in Large Language Models for Sensitive Domains
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在敏感领域(如种族、社会和政治)中存在偏见和价值不一致的问题。解决方案的关键在于提出一种对抗性对齐框架(adversarial alignment framework),通过持续预训练、指令微调和对抗训练三阶段优化实现价值一致性增强:其中对抗训练模块包含三个组件——Attacker生成争议性问题,Actor生成符合价值一致性的回应,Critic负责筛选并保障响应质量,从而系统性提升模型在敏感场景下的伦理合规性和稳定性。
链接: https://arxiv.org/abs/2601.13137
作者: Yuan Gao,Zhigang Liu,Xinyu Yao,Bo Chen,Xiaobing Zhao
机构: Minzu University of China (中国民族大学); National Language Resource Monitoring and Research Center of Minority Languages (少数民族语言资源监测与研究中心)
类目: Computation and Language (cs.CL)
备注: 13 pages, 5 figures
Abstract:With the wide application of large language models (LLMs), the problems of bias and value inconsistency in sensitive domains have gradually emerged, especially in terms of race, society and politics. In this paper, we propose an adversarial alignment framework, which enhances the value consistency of the model in sensitive domains through continued pre-training, instruction fine-tuning and adversarial training. In adversarial training, we use the Attacker to generate controversial queries, the Actor to generate responses with value consistency, and the Critic to filter and ensure response quality. Furthermore, we train a Value-Consistent Large Language Model, VC-LLM, for sensitive domains, and construct a bilingual evaluation dataset in Chinese and English. The experimental results show that VC-LLM performs better than the existing mainstream models in both Chinese and English tests, verifying the effectiveness of the method. Warning: This paper contains examples of LLMs that are offensive or harmful in nature.
zh
[NLP-108] Agent ic Conversational Search with Contextualized Reasoning via Reinforcement Learning
【速读】: 该论文旨在解决多轮对话中用户意图随交互动态演化的问题,现有方法通常采用静态的“重写-检索-生成”流水线,无法有效协调检索与生成过程以适应混合主动性(mixed-initiative)的交互需求。其解决方案的关键在于提出一种将搜索与推理跨轮次交错进行的对话代理(conversational agent),通过强化学习(reinforcement learning, RL)训练并设计针对用户目标演化的定制奖励机制,从而实现探索性与自适应行为的学习,显著提升了在多个主流对话基准上的性能表现。
链接: https://arxiv.org/abs/2601.13115
作者: Fengran Mo,Yifan Gao,Sha Li,Hansi Zeng,Xin Liu,Zhaoxuan Tan,Xian Li,Jianshu Chen,Dakuo Wang,Meng Jiang
机构: University of Montreal (蒙特利尔大学); Amazon.com (亚马逊公司); University of Massachusetts Amherst (马萨诸塞大学阿默斯特分校); University of Notre Dame (圣母大学); Northeastern University (东北大学)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:
Abstract:Large Language Models (LLMs) have become a popular interface for human-AI interaction, supporting information seeking and task assistance through natural, multi-turn dialogue. To respond to users within multi-turn dialogues, the context-dependent user intent evolves across interactions, requiring contextual interpretation, query reformulation, and dynamic coordination between retrieval and generation. Existing studies usually follow static rewrite, retrieve, and generate pipelines, which optimize different procedures separately and overlook the mixed-initiative action optimization simultaneously. Although the recent developments in deep search agents demonstrate the effectiveness in jointly optimizing retrieval and generation via reasoning, these approaches focus on single-turn scenarios, which might lack the ability to handle multi-turn interactions. We introduce a conversational agent that interleaves search and reasoning across turns, enabling exploratory and adaptive behaviors learned through reinforcement learning (RL) training with tailored rewards towards evolving user goals. The experimental results across four widely used conversational benchmarks demonstrate the effectiveness of our methods by surpassing several existing strong baselines.
zh
[NLP-109] CORE-T: COherent REtrieval of Tables for Text-to-SQL
【速读】: 该论文旨在解决真实场景中文本到SQL(text-to-SQL)任务中多表连接时的表选择瓶颈问题,尤其是在开放书籍设置下(open-book setting),即在缺乏数据库标识符等清晰作用域信号的情况下,如何高效准确地从大规模异构表集合中检索出与查询相关的表。解决方案的关键在于提出一种可扩展且无需训练的框架CORE-T:首先利用大语言模型(LLM)为每张表生成目的性元数据(purpose metadata),从而增强表的语义表示;其次预计算轻量级的表兼容性缓存(table-compatibility cache)以加速推理;在推理阶段,结合密集检索(DR)返回的Top-K候选表,通过一次LLM调用筛选出语义连贯且可连接的子集,并辅以简单的加性调整步骤恢复强兼容表,显著提升多表执行准确率并减少资源消耗。
链接: https://arxiv.org/abs/2601.13111
作者: Hassan Soliman,Vivek Gupta,Dan Roth,Iryna Gurevych
机构: Ubiquitous Knowledge Processing Lab (UKP Lab), Department of Computer Science, TU Darmstadt and National Research Center for Applied Cybersecurity ATHENE, Germany; Arizona State University; University of Pennsylvania
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Preprint under review. Code and data available at: this https URL
Abstract:Realistic text-to-SQL workflows often require joining multiple tables. As a result, accurately retrieving the relevant set of tables becomes a key bottleneck for end-to-end performance. We study an open-book setting where queries must be answered over large, heterogeneous table collections pooled from many sources, without clean scoping signals such as database identifiers. Here, dense retrieval (DR) achieves high recall but returns many distractors, while join-aware alternatives often rely on extra assumptions and/or incur high inference overhead. We propose CORE-T, a scalable, training-free framework that enriches tables with LLM-generated purpose metadata and pre-computes a lightweight table-compatibility cache. At inference time, DR returns top-K candidates; a single LLM call selects a coherent, joinable subset, and a simple additive adjustment step restores strongly compatible tables. Across Bird, Spider, and MMQA, CORE-T improves table-selection F1 by up to 22.7 points while retrieving up to 42% fewer tables, improving multi-table execution accuracy by up to 5.0 points on Bird and 6.9 points on MMQA, and using 4-5x fewer tokens than LLM-intensive baselines.
zh
[NLP-110] Leverag ing Lora Fine-Tuning and Knowledge Bases for Construction Identification
【速读】: 该论文旨在解决英语双宾语结构(ditransitive construction)的自动识别问题,即如何准确区分句子中是否包含典型的“主语+动词+间接宾语+直接宾语”结构。其解决方案的关键在于将基于LoRA(Low-Rank Adaptation)的大型语言模型微调与检索增强生成(Retrieval-Augmented Generation, RAG)框架相结合,通过在英国国家语料库(British National Corpus)标注数据上进行二分类任务训练,使模型从依赖表面形式匹配转向更深层次的语义理解,从而显著提升识别准确率。
链接: https://arxiv.org/abs/2601.13105
作者: Liu Kaipeng,Wu Ling
机构: 未知
类目: Computation and Language (cs.CL)
备注: 19pages, 1figure
Abstract:This study investigates the automatic identification of the English ditransitive construction by integrating LoRA-based fine-tuning of a large language model with a Retrieval-Augmented Generation (RAG) framework.A binary classification task was conducted on annotated data from the British National Corpus. Results demonstrate that a LoRA-fine-tuned Qwen3-8B model significantly outperformed both a native Qwen3-MAX model and a theory-only RAG system. Detailed error analysis reveals that fine-tuning shifts the model’s judgment from a surface-form pattern matching towards a more semantically grounded understanding based.
zh
[NLP-111] Alexandria: A Multi-Domain Dialectal Arabic Machine Translation Dataset for Culturally Inclusive and Linguistically Diverse LLM s
【速读】: 该论文旨在解决阿拉伯语在机器翻译(Machine Translation, MT)系统中表现不佳的问题,尤其是在处理区域性方言时的泛化能力不足,从而限制了数百万母语使用者的实际应用。其解决方案的关键在于构建一个大规模、由社区驱动且经过人工翻译的语料库——Alexandria,该数据集覆盖13个阿拉伯国家和11个高影响力领域(如健康、教育与农业),并首次引入城市级来源元数据,以捕捉比传统区域标签更精细的地方变体;同时,数据集包含多轮对话场景及说话人-受话人性别配置标注,支持对性别条件下的方言使用差异进行研究,整体包含107K样本,既可作为训练资源,也可作为评估MT和大型语言模型(Large Language Models, LLMs)性能的严格基准。
链接: https://arxiv.org/abs/2601.13099
作者: Abdellah El Mekki,Samar M. Magdy,Houdaifa Atou,Ruwa AbuHweidi,Baraah Qawasmeh,Omer Nacar,Thikra Al-hibiri,Razan Saadie,Hamzah Alsayadi,Nadia Ghezaiel Hammouda,Alshima Alkhazimi,Aya Hamod,Al-Yas Al-Ghafri,Wesam El-Sayed,Asila Al sharji,Mohamad Ballout,Anas Belfathi,Karim Ghaddar,Serry Sibaee,Alaa Aoun,Areej Asiri,Lina Abureesh,Ahlam Bashiti,Majdal Yousef,Abdulaziz Hafiz,Yehdih Mohamed,Emira Hamedtou,Brakehe Brahim,Rahaf Alhamouri,Youssef Nafea,Aya El Aatar,Walid Al-Dhabyani,Emhemed Hamed,Sara Shatnawi,Fakhraddin Alwajih,Khalid Elkhidir,Ashwag Alasmari,Abdurrahman Gerrio,Omar Alshahri,AbdelRahim A. Elmadany,Ismail Berrada,Amir Azad Adli Alkathiri,Fadi A Zaraket,Mustafa Jarrar,Yahya Mohamed El Hadj,Hassan Alhuzali,Muhammad Abdul-Mageed
机构: 未知
类目: Computation and Language (cs.CL)
备注: Project resources will be available here: this https URL
Abstract:Arabic is a highly diglossic language where most daily communication occurs in regional dialects rather than Modern Standard Arabic. Despite this, machine translation (MT) systems often generalize poorly to dialectal input, limiting their utility for millions of speakers. We introduce \textbfAlexandria, a large-scale, community-driven, human-translated dataset designed to bridge this gap. Alexandria covers 13 Arab countries and 11 high-impact domains, including health, education, and agriculture. Unlike previous resources, Alexandria provides unprecedented granularity by associating contributions with city-of-origin metadata, capturing authentic local varieties beyond coarse regional labels. The dataset consists of multi-turn conversational scenarios annotated with speaker-addressee gender configurations, enabling the study of gender-conditioned variation in dialectal use. Comprising 107K total samples, Alexandria serves as both a training resource and a rigorous benchmark for evaluating MT and Large Language Models (LLMs). Our automatic and human evaluation of Arabic-aware LLMs benchmarks current capabilities in translating across diverse Arabic dialects and sub-dialects, while exposing significant persistent challenges.
zh
[NLP-112] Profiling German Text Simplification with Interpretable Model-Fingerprints
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在文本简化任务中缺乏系统性、高效且可复现的行为诊断工具的问题。当前开发者难以全面理解模型在不同配置下的行为差异,尤其是在数据稀缺的语言场景下,传统依赖人工标注数据的相关性评估方法存在局限。解决方案的关键在于提出一种名为“Simplification Profiler”的诊断工具包,通过生成简化文本的多维可解释指纹(fingerprint),量化模型在不同提示策略或微调参数下的独特行为特征。该方法不依赖大规模人工评分数据,而是基于元评估——即使用简单线性分类器判断能否从简化文本中准确识别出模型配置,从而验证指纹对模型特性的敏感性。实验表明,该方法在区分不同模型配置时F1分数最高达71.9%,显著优于基线方法,为构建更适应性强、可解释的文本简化系统提供了精细且可操作的分析框架。
链接: https://arxiv.org/abs/2601.13050
作者: Lars Klöser,Mika Beele,Bodo Kraft
机构: 未知
类目: Computation and Language (cs.CL)
备注: Presented at 2nd International Conference on Explainable AI for Neural and Symbolic Systems
Abstract:While Large Language Models (LLMs) produce highly nuanced text simplifications, developers currently lack tools for a holistic, efficient, and reproducible diagnosis of their behavior. This paper introduces the Simplification Profiler, a diagnostic toolkit that generates a multidimensional, interpretable fingerprint of simplified texts. Multiple aggregated simplifications of a model result in a model’s fingerprint. This novel evaluation paradigm is particularly vital for languages, where the data scarcity problem is magnified when creating flexible models for diverse target groups rather than a single, fixed simplification style. We propose that measuring a model’s unique behavioral signature is more relevant in this context as an alternative to correlating metrics with human preferences. We operationalize this with a practical meta-evaluation of our fingerprints’ descriptive power, which bypasses the need for large, human-rated datasets. This test measures if a simple linear classifier can reliably identify various model configurations by their created simplifications, confirming that our metrics are sensitive to a model’s specific characteristics. The Profiler can distinguish high-level behavioral variations between prompting strategies and fine-grained changes from prompt engineering, including few-shot examples. Our complete feature set achieves classification F1-scores up to 71.9 %, improving upon simple baselines by over 48 percentage points. The Simplification Profiler thus offers developers a granular, actionable analysis to build more effective and truly adaptive text simplification systems.
zh
[NLP-113] yphoon ASR Real-time: FastConformer-Transducer for Thai Automatic Speech Recognition ALT
【速读】: 该论文旨在解决泰国语语音识别(ASR)中在线流式处理效率低的问题,当前主流模型如Whisper虽具备高精度但存在高延迟,难以满足实时应用需求。其解决方案的关键在于提出一个参数量仅为115M的FastConformer-Transducer模型Typhoon ASR Real-time,并通过严格的文本规范化(text normalization)显著提升训练一致性与识别准确性,从而在仅45倍计算成本下实现与Whisper Large-v3相当的性能;此外,引入两阶段课程学习策略实现对依善语(Isan)方言的有效适配,同时保持对中部泰语的高性能,最终发布Typhoon ASR Benchmark以推动研究可复现性。
链接: https://arxiv.org/abs/2601.13044
作者: Warit Sirichotedumrong,Adisai Na-Thalang,Potsawee Manakul,Pittawat Taveekitworachai,Sittipong Sripaisarnmongkol,Kunat Pipatanakul
机构: Typhoon, SCB 10X
类目: Computation and Language (cs.CL)
备注: Models and datasets are publicly available on this https URL ; Project Page: this https URL
Abstract:Large encoder-decoder models like Whisper achieve strong offline transcription but remain impractical for streaming applications due to high latency. However, due to the accessibility of pre-trained checkpoints, the open Thai ASR landscape remains dominated by these offline architectures, leaving a critical gap in efficient streaming solutions. We present Typhoon ASR Real-time, a 115M-parameter FastConformer-Transducer model for low-latency Thai speech recognition. We demonstrate that rigorous text normalization can match the impact of model scaling: our compact model achieves a 45x reduction in computational cost compared to Whisper Large-v3 while delivering comparable accuracy. Our normalization pipeline resolves systemic ambiguities in Thai transcription --including context-dependent number verbalization and repetition markers (mai yamok) --creating consistent training targets. We further introduce a two-stage curriculum learning approach for Isan (north-eastern) dialect adaptation that preserves Central Thai performance. To address reproducibility challenges in Thai ASR, we release the Typhoon ASR Benchmark, a gold-standard human-labeled datasets with transcriptions following established Thai linguistic conventions, providing standardized evaluation protocols for the research community.
zh
[NLP-114] SASA: Semantic-Aware Contrastive Learning Framework with Separated Attention for Triple Classification
【速读】: 该论文旨在解决知识图谱(Knowledge Graphs, KGs)中三元组分类(Triple Classification, TC)任务面临的两个关键挑战:一是现有方法通常忽略不同KG组件之间的有效语义交互,二是多数模型采用单一二分类训练目标,导致语义表示学习不足。解决方案的关键在于提出SASA框架,其核心创新包括两部分:一是设计分离注意力机制(separated attention mechanism),将三元组编码为解耦的上下文表示并以更有效的交互方式融合;二是引入语义感知的分层对比学习(semantic-aware hierarchical contrastive learning, CL)作为辅助训练目标,从局部和全局两个层面引导模型增强判别能力与语义理解深度。实验表明,SASA在FB15k-237和YAGO3-10数据集上分别提升了+5.9%和+3.4%的准确率,显著超越当前最优方法。
链接: https://arxiv.org/abs/2601.13035
作者: Xu Xiaodan,Hu Xiaolin
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: in progress
Abstract:Knowledge Graphs~(KGs) often suffer from unreliable knowledge, which restricts their utility. Triple Classification~(TC) aims to determine the validity of triples from KGs. Recently, text-based methods learn entity and relation representations from natural language descriptions, significantly improving the generalization capabilities of TC models and setting new benchmarks in performance. However, there are still two critical challenges. First, existing methods often ignore the effective semantic interaction among different KG components. Second, most approaches adopt single binary classification training objective, leading to insufficient semantic representation learning. To address these challenges, we propose \textbfSASA, a novel framework designed to enhance TC models via separated attention mechanism and semantic-aware contrastive learning~(CL). Specifically, we first propose separated attention mechanism to encode triples into decoupled contextual representations and then fuse them through a more effective interactive way. Then, we introduce semantic-aware hierarchical CL as auxiliary training objective to guide models in improving their discriminative capabilities and achieving sufficient semantic learning, considering both local level and global level CL. Experimental results across two benchmark datasets demonstrate that SASA significantly outperforms state-of-the-art methods. In terms of accuracy, we advance the state-of-the-art by +5.9% on FB15k-237 and +3.4% on YAGO3-10.
zh
[NLP-115] ars or Cheers? Benchmarking LLM s via Culturally Elicited Distinct Affective Responses
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在文化对齐评估中过度依赖陈述性知识(如地理事实或社会习俗),而忽视了跨文化情感反应的主观差异这一问题。其解决方案的关键在于构建了一个名为CEDAR的多模态基准,该基准完全基于捕捉“文化诱发的情感差异”(Culturally Elicited Distinct Affective Responses)的场景;通过一种新颖的流水线方法,利用LLM生成的初步标签筛选出具有跨文化情感区分度的样本,并借助严格的人工评估获得可靠的真实标签,最终形成包含10,962个实例、覆盖7种语言和14类细粒度情绪的高质量数据集,从而系统性地评估模型的文化情感理解能力。
链接: https://arxiv.org/abs/2601.13024
作者: Chongyuan Dai,Yaling Shen,Jinpeng Hu,Zihan Gao,Jia Li,Yishun Jiang,Yaxiong Wang,Liu Liu,Zongyuan Ge
机构: Hefei University of Technology (合肥工业大学); Monash University (莫纳什大学); University of Science and Technology of China (中国科学技术大学)
类目: Computation and Language (cs.CL)
备注: 24 pages, 10 figures, 9 Tables
Abstract:Culture serves as a fundamental determinant of human affective processing and profoundly shapes how individuals perceive and interpret emotional stimuli. Despite this intrinsic link extant evaluations regarding cultural alignment within Large Language Models primarily prioritize declarative knowledge such as geographical facts or established societal customs. These benchmarks remain insufficient to capture the subjective interpretative variance inherent to diverse sociocultural lenses. To address this limitation, we introduce CEDAR, a multimodal benchmark constructed entirely from scenarios capturing Culturally \underline\textscElicited \underline\textscDistinct \underline\textscAffective \underline\textscResponses. To construct CEDAR, we implement a novel pipeline that leverages LLM-generated provisional labels to isolate instances yielding cross-cultural emotional distinctions, and subsequently derives reliable ground-truth annotations through rigorous human evaluation. The resulting benchmark comprises 10,962 instances across seven languages and 14 fine-grained emotion categories, with each language including 400 multimodal and 1,166 text-only samples. Comprehensive evaluations of 17 representative multilingual models reveal a dissociation between language consistency and cultural alignment, demonstrating that culturally grounded affective understanding remains a significant challenge for current models.
zh
[NLP-116] Bi-Attention HateXplain : Taking into account the sequential aspect of data during explainability in a multi-task context
【速读】: 该论文旨在解决当前基于HateXplain基准的多任务 hate speech 检测模型中出现的注意力(attention)不稳定问题,即预测注意力在应保持恒定的情况下却显著波动,导致解释不一致、预测不稳定以及学习困难。其解决方案的关键在于提出一种名为 BiAtt-BiRNN-HateXplain 的新型多任务模型,该模型通过引入双向注意力机制(Bidirectional Attention)与双向循环神经网络(BiRNN)结构,在联合学习可解释性与分类任务的同时,更好地捕捉输入文本的时序特征,从而提升模型的解释一致性、检测性能,并减少因社区偏见引发的非故意偏差。
链接: https://arxiv.org/abs/2601.13018
作者: Ghislain Dorian Tchuente Mondjo
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at “EAI AFRICOMM 2025 - 17th EAI International Conference on Communications and Networks in Africa”
Abstract:Technological advances in the Internet and online social networks have brought many benefits to humanity. At the same time, this growth has led to an increase in hate speech, the main global threat. To improve the reliability of black-box models used for hate speech detection, post-hoc approaches such as LIME, SHAP, and LRP provide the explanation after training the classification model. In contrast, multi-task approaches based on the HateXplain benchmark learn to explain and classify simultaneously. However, results from HateXplain-based algorithms show that predicted attention varies considerably when it should be constant. This attention variability can lead to inconsistent interpretations, instability of predictions, and learning difficulties. To solve this problem, we propose the BiAtt-BiRNN-HateXplain (Bidirectional Attention BiRNN HateXplain) model which is easier to explain compared to LLMs which are more complex in view of the need for transparency, and will take into account the sequential aspect of the input data during explainability thanks to a BiRNN layer. Thus, if the explanation is correctly estimated, thanks to multi-task learning (explainability and classification task), the model could classify better and commit fewer unintentional bias errors related to communities. The experimental results on HateXplain data show a clear improvement in detection performance, explainability and a reduction in unintentional bias.
zh
[NLP-117] Graph Reasoning Paradigm: Structured and Symbolic Reasoning with Topology-Aware Reinforcement Learning for Large Language Models
【速读】: 该论文旨在解决当前基于强化学习的长链式推理(Long Chain-of-Thought, LCoT)方法在训练过程中存在的计算瓶颈、粗粒度监督、奖励欺骗(reward hacking)、高训练成本及泛化能力差等问题。其核心解决方案是提出图推理范式(Graph Reasoning Paradigm, GRP),通过图结构表示实现符号化与结构化的推理过程,并引入步骤级认知标签(step-level cognitive labels)进行精细化监督;在此基础上设计了过程感知分层裁剪组相对策略优化算法(Process-Aware Stratified Clipping Group Relative Policy Optimization, PASC-GRPO),利用结构化评估替代语义评估以提升效率,通过图结构结果奖励实现过程感知验证,并借助分层裁剪优势估计机制有效缓解奖励欺骗问题。
链接: https://arxiv.org/abs/2601.12995
作者: Runxuan Liu,Xianhao Ou,Xinyan Ma,Jiyuan Wang,Jiafeng Liang,Jiaqi Li,Tao He,Zheng Chu,Rongchuan Mu,Zekun Wang,Baoxin Wang,Dayong Wu,Ming Liu,Shijin Wang,Guoping Hu,Bing Qin
机构: Harbin Institute of Technology (哈尔滨工业大学); State Key Laboratory of Cognitive Intelligence, iFLYTEK Research (认知智能国家重点实验室,科大讯飞研究院); Tianjin Normal University (天津师范大学); Pengcheng Laboratory (鹏城实验室)
类目: Computation and Language (cs.CL)
备注:
Abstract:Long Chain-of-Thought (LCoT), achieved by Reinforcement Learning with Verifiable Rewards (RLVR), has proven effective in enhancing the reasoning capabilities of Large Language Models (LLMs). However, reasoning in current LLMs is primarily generated as plain text, where performing semantic evaluation on such unstructured data creates a computational bottleneck during training. Despite RLVR-based optimization, existing methods still suffer from coarse-grained supervision, reward hacking, high training costs, and poor generalization. To address these issues, we propose the Graph Reasoning Paradigm (GRP), which realizes structured and symbolic reasoning, implemented via graph-structured representations with step-level cognitive labels. Building upon GRP, we further design Process-Aware Stratified Clipping Group Relative Policy Optimization (PASC-GRPO), which leverages structured evaluation to replace semantic evaluation, achieves process-aware verification through graph-structured outcome rewards, and mitigates reward hacking via stratified clipping advantage estimation. Experiments demonstrate significant improvements across mathematical reasoning and code generation tasks. Data, models, and code will be released later.
zh
[NLP-118] RAG Explorer: A Visual Analytics System for the Comparative Diagnosis of RAG Systems
【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统在实际开发中因模块化组件(如嵌入模型和检索算法)配置组合复杂且缺乏透明性而导致的性能优化困难问题。解决方案的关键在于提出RAGExplorer——一个用于系统性比较与诊断RAG配置的可视化分析工具,其核心能力包括:支持从宏观层面快速评估多种配置的整体性能表现,并提供微观层面的故障案例深入分析功能,使开发者能够定位错误根源、探究检索信息差异对生成结果的影响,并通过交互式调整上下文来验证假设,从而有效导航复杂的RAG设计空间。
链接: https://arxiv.org/abs/2601.12991
作者: Haoyu Tian,Yingchaojie Feng,Zhen Wen,Haoxuan Li,Minfeng Zhu,Wei Chen
机构: 未知
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
备注: 11 pages, 7 figures. Accepted to IEEE TVCG (PacificVis 2026)
Abstract:The advent of Retrieval-Augmented Generation (RAG) has significantly enhanced the ability of Large Language Models (LLMs) to produce factually accurate and up-to-date responses. However, the performance of a RAG system is not determined by a single component but emerges from a complex interplay of modular choices, such as embedding models and retrieval algorithms. This creates a vast and often opaque configuration space, making it challenging for developers to understand performance trade-offs and identify optimal designs. To address this challenge, we present RAGExplorer, a visual analytics system for the systematic comparison and diagnosis of RAG configurations. RAGExplorer guides users through a seamless macro-to-micro analytical workflow. Initially, it empowers developers to survey the performance landscape across numerous configurations, allowing for a high-level understanding of which design choices are most effective. For a deeper analysis, the system enables users to drill down into individual failure cases, investigate how differences in retrieved information contribute to errors, and interactively test hypotheses by manipulating the provided context to observe the resulting impact on the generated answer. We demonstrate the effectiveness of RAGExplorer through detailed case studies and user studies, validating its ability to empower developers in navigating the complex RAG design space. Our code and user guide are publicly available at this https URL.
zh
[NLP-119] ChartAttack: Testing the Vulnerability of LLM s to Malicious Prompting in Chart Generation
【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在自动化图表生成过程中可能被恶意利用以生成误导性图表的问题,从而引发对数据解读的错误认知。其解决方案的关键在于提出ChartAttack框架,该框架通过向图表设计中注入“误导者”(misleaders),诱导MLLM生成具有欺骗性的可视化结果;同时构建AttackViz数据集,包含标注了有效误导策略及其导致的错误答案的图表与问答对,用于系统评估MLLM在面对此类攻击时的脆弱性。实验表明,该方法显著降低了MLLM在图表理解任务中的准确率,凸显了在MLLM驱动的图表生成系统中加强鲁棒性和安全性评估的紧迫性。
链接: https://arxiv.org/abs/2601.12983
作者: Jesus-German Ortiz-Barajas,Jonathan Tonglet,Vivek Gupta,Iryna Gurevych
机构: INSAIT(INSAIT); Sofia University “St. Kliment Ohridski”(索非亚大学“圣克莱门特·奥赫里德斯基”); Ubiquitous Knowledge Processing Lab (UKP Lab), Department of Computer Science, TU Darmstadt and National Research Center for Applied Cybersecurity ATHENE(通用知识处理实验室(UKP实验室),计算机科学系,达姆施塔特工业大学和应用网络安全国家研究中心ATHENE); Arizona State University(亚利桑那州立大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Multimodal large language models (MLLMs) are increasingly used to automate chart generation from data tables, enabling efficient data analysis and reporting but also introducing new misuse risks. In this work, we introduce ChartAttack, a novel framework for evaluating how MLLMs can be misused to generate misleading charts at scale. ChartAttack injects misleaders into chart designs, aiming to induce incorrect interpretations of the underlying data. Furthermore, we create AttackViz, a chart question-answering (QA) dataset where each (chart specification, QA) pair is labeled with effective misleaders and their induced incorrect answers. Experiments in in-domain and cross-domain settings show that ChartAttack significantly degrades the QA performance of MLLM readers, reducing accuracy by an average of 19.6 points and 14.9 points, respectively. A human study further shows an average 20.2 point drop in accuracy for participants exposed to misleading charts generated by ChartAttack. Our findings highlight an urgent need for robustness and security considerations in the design, evaluation, and deployment of MLLM-based chart generation systems. We make our code and data publicly available.
zh
[NLP-120] he Bitter Lesson of Diffusion Language Models for Agent ic Workflows: A Comprehensive Reality Check
【速读】: 该论文旨在解决扩散模型(Diffusion-based Large Language Models, dLLMs)在代理(agentic)任务中是否具备可靠性和有效性的问题,尤其关注其在实时交互场景下能否替代传统的自回归模型作为代理核心。研究发现,尽管dLLMs在计算效率上具有潜力,但在长时程规划的具身代理(Embodied Agents)和需要精确格式(如严格JSON schema)的工具调用代理(Tool-Calling Agents)两种典型场景中均表现出系统性失败:前者因无法有效响应时间反馈而重复尝试,后者则因扩散噪声破坏符号精度。解决方案的关键在于引入DiffuAgent多代理评估框架,并提出将因果性、精确性和逻辑基础推理机制嵌入去噪过程,使dLLMs仅适用于非因果角色(如记忆摘要与工具选择),而非直接承担需强逻辑一致性的代理决策任务。
链接: https://arxiv.org/abs/2601.12979
作者: Qingyu Lu,Liang Ding,Kanjian Zhang,Jinxia Zhang,Dacheng Tao
机构: Southeast University (东南大学); Alibaba (阿里巴巴); Southeast University Shenzhen Research Institute (东南大学深圳研究院); College of Computing and Data Science at Nanyang Technological University, Singapore (新加坡南洋理工大学计算机与数据科学学院)
类目: Computation and Language (cs.CL)
备注: Under Review
Abstract:The pursuit of real-time agentic interaction has driven interest in Diffusion-based Large Language Models (dLLMs) as alternatives to auto-regressive backbones, promising to break the sequential latency bottleneck. However, does such efficiency gains translate into effective agentic behavior? In this work, we present a comprehensive evaluation of dLLMs (e.g., LLaDA, Dream) across two distinct agentic paradigms: Embodied Agents (requiring long-horizon planning) and Tool-Calling Agents (requiring precise formatting). Contrary to the efficiency hype, our results on Agentboard and BFCL reveal a “bitter lesson”: current dLLMs fail to serve as reliable agentic backbones, frequently leading to systematically failure. (1) In Embodied settings, dLLMs suffer repeated attempts, failing to branch under temporal feedback. (2) In Tool-Calling settings, dLLMs fail to maintain symbolic precision (e.g. strict JSON schemas) under diffusion noise. To assess the potential of dLLMs in agentic workflows, we introduce DiffuAgent, a multi-agent evaluation framework that integrates dLLMs as plug-and-play cognitive cores. Our analysis shows that dLLMs are effective in non-causal roles (e.g., memory summarization and tool selection) but require the incorporation of causal, precise, and logically grounded reasoning mechanisms into the denoising process to be viable for agentic tasks.
zh
[NLP-121] Bridging the Knowledge-Action Gap by Evaluating LLM s in Dynamic Dental Clinical Scenarios
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)从被动知识检索工具向自主临床决策代理转变过程中,评估范式需从静态准确性向动态行为可靠性演进的关键挑战,尤其聚焦于牙科领域中高质量AI建议如何赋能患者参与式决策。其解决方案的核心在于提出标准化临床管理性能评估(Standardized Clinical Management Performance Evaluation, SCMPE)基准,该基准涵盖从基于知识的静态任务到多轮模拟患者交互的工作流级测试,系统揭示了当前LLMs在动态临床对话中的性能断崖式下降并非源于知识遗忘,而是由于主动信息获取与动态状态追踪能力不足;同时发现通用模型普遍存在“高疗效、低安全性”的风险,并证实检索增强生成(Retrieval-Augmented Generation, RAG)虽可缓解静态任务中的幻觉问题,但在动态工作流中效果有限且不稳定,强调仅依赖外部知识无法弥合推理差距,必须结合领域自适应预训练才能实现安全可靠的自主临床实践。
链接: https://arxiv.org/abs/2601.12974
作者: Hongyang Ma,Tiantian Gu,Huaiyuan Sun,Huilin Zhu,Yongxin Wang,Jie Li,Wubin Sun,Zeliang Lian,Yinghong Zhou,Yi Gao,Shirui Wang,Zhihui Tang
机构: 未知
类目: Computation and Language (cs.CL)
备注: 29 pages, 15 figures
Abstract:The transition of Large Language Models (LLMs) from passive knowledge retrievers to autonomous clinical agents demands a shift in evaluation-from static accuracy to dynamic behavioral reliability. To explore this boundary in dentistry, a domain where high-quality AI advice uniquely empowers patient-participatory decision-making, we present the Standardized Clinical Management Performance Evaluation (SCMPE) benchmark, which comprehensively assesses performance from knowledge-oriented evaluations (static objective tasks) to workflow-based simulations (multi-turn simulated patient interactions). Our analysis reveals that while models demonstrate high proficiency in static objective tasks, their performance precipitates in dynamic clinical dialogues, identifying that the primary bottleneck lies not in knowledge retention, but in the critical challenges of active information gathering and dynamic state tracking. Mapping “Guideline Adherence” versus “Decision Quality” reveals a prevalent “High Efficacy, Low Safety” risk in general models. Furthermore, we quantify the impact of Retrieval-Augmented Generation (RAG). While RAG mitigates hallucinations in static tasks, its efficacy in dynamic workflows is limited and heterogeneous, sometimes causing degradation. This underscores that external knowledge alone cannot bridge the reasoning gap without domain-adaptive pre-training. This study empirically charts the capability boundaries of dental LLMs, providing a roadmap for bridging the gap between standardized knowledge and safe, autonomous clinical practice.
zh
[NLP-122] Pardon? Evaluating Conversational Repair in Large Audio-Language Models
【速读】: 该论文旨在解决当前大型音频语言模型(Large Audio-Language Models, LALMs)在口语问答(spoken question answering, QA)任务中评估体系的局限性问题,即现有评价方法主要关注答案准确性和对声学扰动的鲁棒性,却忽略了真实交互场景中输入语义不可回答(unanswerable)的情况,导致模型无法识别缺失关键信息的输入并启动适当的对话修复行为。解决方案的关键在于提出一种“修复感知”(repair-aware)的评估设置,通过语义-声学掩码协议构建成对的可回答与不可回答输入条件,并引入非补偿性指标——可评估意识与修复(Evaluability Awareness and Repair, EAR)分数,该指标同时衡量模型在可回答条件下任务能力与在不可回答条件下对话修复行为的能力,从而更全面地评估模型的对话可靠性。
链接: https://arxiv.org/abs/2601.12973
作者: Shuanghong Huang,Jinlei Xu,Youchao Zhou,Yanghao Zhou,Xuan Zhao,Chong Feng,Wenxuan Zhang
机构: Beijing Institute of Technology (北京理工大学); Singapore University of Technology and Design (新加坡科技设计大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large Audio-Language Models (LALMs) have demonstrated strong performance in spoken question answering (QA), with existing evaluations primarily focusing on answer accuracy and robustness to acoustic perturbations. However, such evaluations implicitly assume that spoken inputs remain semantically answerable, an assumption that often fails in real-world interaction when essential information is missing. In this work, we introduce a repair-aware evaluation setting that explicitly distinguishes between answerable and unanswerable audio inputs. We define answerability as a property of the input itself and construct paired evaluation conditions using a semantic-acoustic masking protocol. Based on this setting, we propose the Evaluability Awareness and Repair (EAR) score, a non-compensatory metric that jointly evaluates task competence under answerable conditions and repair behavior under unanswerable conditions. Experiments on two spoken QA benchmarks across diverse LALMs reveal a consistent gap between answer accuracy and conversational reliability: while many models perform well when inputs are answerable, most fail to recognize semantic unanswerability and initiate appropriate conversational repair. These findings expose a limitation of prevailing accuracy-centric evaluation practices and motivate reliability assessments that treat unanswerable inputs as cues for repair and continued interaction.
zh
[NLP-123] Lombard Speech Synthesis for Any Voice with Controllable Style Embeddings
【速读】: 该论文旨在解决如何在不依赖显式Lombard语音数据的情况下,实现对任意说话人生成可控的Lombard效应语音的问题。Lombard效应是指人在嘈杂环境中或面对听力障碍者时自动提高音量、调整语调以增强语音可懂度的现象。解决方案的关键在于利用从大规模韵律多样性数据集中学习到的风格嵌入(style embeddings),并通过主成分分析(PCA)识别与Lombard属性相关联的特征维度;随后通过调整这些PCA分量来操控风格嵌入,并将其引入文本到语音(TTS)模型中,从而实现对目标Lombard强度的精细控制,同时保持语音自然度和说话人身份不变,提升噪声环境下的语音可懂度。
链接: https://arxiv.org/abs/2601.12966
作者: Seymanur Akti,Alexander Waibel
机构: 未知
类目: ound (cs.SD); Computation and Language (cs.CL)
备注:
Abstract:The Lombard effect plays a key role in natural communication, particularly in noisy environments or when addressing hearing-impaired listeners. We present a controllable text-to-speech (TTS) system capable of synthesizing Lombard speech for any speaker without requiring explicit Lombard data during training. Our approach leverages style embeddings learned from a large, prosodically diverse dataset and analyzes their correlation with Lombard attributes using principal component analysis (PCA). By shifting the relevant PCA components, we manipulate the style embeddings and incorporate them into our TTS model to generate speech at desired Lombard levels. Evaluations demonstrate that our method preserves naturalness and speaker identity, enhances intelligibility under noise, and provides fine-grained control over prosody, offering a robust solution for controllable Lombard TTS for any speaker.
zh
[NLP-124] rustworthy Data-driven Chronological Age Estimation from Panoramic Dental Images
【速读】: 该论文旨在解决深度学习在医疗领域应用中因模型黑箱特性引发的信任问题,特别是在牙科年龄估计任务中,如何提升模型决策的透明度与可解释性。其解决方案的关键在于提出一种融合不透明模型与透明规则推理的方法,并通过自然语言生成(Natural Language Generation, NLG)模块输出面向临床医生的文本解释,这些解释由牙科专家参与设计并基于规则构建,从而增强医生对AI结果的理解与信任。
链接: https://arxiv.org/abs/2601.12960
作者: Ainhoa Vivel-Couso,Nicolás Vila-Blanco,María J. Carreira,Alberto Bugarín-Diz,Inmaculada Tomás,Jose M. Alonso-Moral
机构: Universidade de Santiago de Compostela (圣地亚哥德孔波斯特拉大学); Centro Singular de Investigación en Tecnoloxías Intelixentes (CiTIUS) (智能技术研究单一中心); Department of Electronics and Computing (电子与计算系); Oral Sciences Research Group (口腔科学研究中心); Instituto de Investigación Sanitaria de Santiago de Compostela (IDIS) (圣地亚哥德孔波斯特拉大学卫生研究所)
类目: Computation and Language (cs.CL)
备注: This paper is a preliminary version of an accepted article in Information Systems Frontiers, Springer, Special Issue “Explainability in Human-Centric AI”. Please cite the final published version of the paper, not this preprint. The final published version can be found at this https URL
Abstract:Integrating deep learning into healthcare enables personalized care but raises trust issues due to model opacity. To improve transparency, we propose a system for dental age estimation from panoramic images that combines an opaque and a transparent method within a natural language generation (NLG) module. This module produces clinician-friendly textual explanations about the age estimations, designed with dental experts through a rule-based approach. Following the best practices in the field, the quality of the generated explanations was manually validated by dental experts using a questionnaire. The results showed a strong performance, since the experts rated 4.77+/-0.12 (out of 5) on average across the five dimensions considered. We also performed a trustworthy self-assessment procedure following the ALTAI checklist, in which it scored 4.40+/-0.27 (out of 5) across seven dimensions of the AI Trustworthiness Assessment List.
zh
[NLP-125] AI-generated data contamination erodes pathological variability and diagnostic reliability
【速读】: 该论文旨在解决生成式人工智能(Generative AI)在医疗记录中快速生成合成内容所引发的数据污染问题,尤其是这种自指循环如何导致病理变异性的迅速丧失和诊断可靠性的下降。研究发现,在缺乏强制人工验证的情况下,AI模型生成的内容会逐渐趋同于通用表型,罕见但关键的病理特征(如气胸和胸腔积液)消失,且性别与年龄分布严重偏向中年男性,同时模型表现出虚假的高置信度,使临床误判率提升至40%。关键解决方案在于引入质量感知的过滤机制,并将真实数据与合成数据混合使用,这被证明是唯一能有效维持病理多样性、防止性能崩溃的方法;而单纯扩大合成数据规模则无法阻止系统性退化。
链接: https://arxiv.org/abs/2601.12946
作者: Hongyu He,Shaowen Xiang,Ye Zhang,Yingtao Zhu,Jin Zhang,Hao Deng,Emily Alsentzer,Qingyu Chen,Kun-Hsing Yu,Andrew Marmenshall,Tingting Chen,Srinivas Anumasa,Daniel Ebner,Dean Ho,Kee Yuan Ngiam,Ching-Yu Cheng,Dianbo Liu
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: *Corresponding author: Dianbo Liu (dianbo@nus. this http URL )
Abstract:Generative artificial intelligence (AI) is rapidly populating medical records with synthetic content, creating a feedback loop where future models are increasingly at risk of training on uncurated AI-generated data. However, the clinical consequences of this AI-generated data contamination remain unexplored. Here, we show that in the absence of mandatory human verification, this self-referential cycle drives a rapid erosion of pathological variability and diagnostic reliability. By analysing more than 800,000 synthetic data points across clinical text generation, vision-language reporting, and medical image synthesis, we find that models progressively converge toward generic phenotypes regardless of the model architecture. Specifically, rare but critical findings, including pneumothorax and effusions, vanish from the synthetic content generated by AI models, while demographic representations skew heavily toward middle-aged male phenotypes. Crucially, this degradation is masked by false diagnostic confidence; models continue to issue reassuring reports while failing to detect life-threatening pathology, with false reassurance rates tripling to 40%. Blinded physician evaluation confirms that this decoupling of confidence and accuracy renders AI-generated documentation clinically useless after just two generations. We systematically evaluate three mitigation strategies, finding that while synthetic volume scaling fails to prevent collapse, mixing real data with quality-aware filtering effectively preserves diversity. Ultimately, our results suggest that without policy-mandated human oversight, the deployment of generative AI threatens to degrade the very healthcare data ecosystems it relies upon.
zh
[NLP-126] A Component-Based Survey of Interactions between Large Language Models and Multi-Armed Bandits
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)与多臂老虎机(Multi-Armed Bandit, MAB)算法在系统设计与应用层面的协同优化问题,聚焦二者在组件级的双向交互机制。其关键解决方案在于系统性地梳理并分析LLM如何借助MAB算法缓解预训练到检索增强生成(Retrieval-Augmented Generation, RAG)及个性化等环节中的不确定性挑战,同时揭示LLM如何通过重构MAB的核心组件(如动作定义和环境建模)来提升顺序决策任务中的适应性与性能,从而实现两个领域的深度融合与优势互补。
链接: https://arxiv.org/abs/2601.12945
作者: Miao Xie,Siguang Chen,Chunli Lv
机构: China Agricultural University (中国农业大学); Ministry of Agriculture and Rural Affairs (农业农村部)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 27 pages, 6 table
Abstract:Large language models (LLMs) have become powerful and widely used systems for language understanding and generation, while multi-armed bandit (MAB) algorithms provide a principled framework for adaptive decision-making under uncertainty. This survey explores the potential at the intersection of these two fields. As we know, it is the first survey to systematically review the bidirectional interaction between large language models and multi-armed bandits at the component level. We highlight the bidirectional benefits: MAB algorithms address critical LLM challenges, spanning from pre-training to retrieval-augmented generation (RAG) and personalization. Conversely, LLMs enhance MAB systems by redefining core components such as arm definition and environment modeling, thereby improving decision-making in sequential tasks. We analyze existing LLM-enhanced bandit systems and bandit-enhanced LLM systems, providing insights into their design, methodologies, and performance. Key challenges and representative findings are identified to help guide future research. An accompanying GitHub repository that indexes relevant literature is available at this https URL.
zh
[NLP-127] Injecting Knowledge from Social Science Journals to Improve Indonesian Cultural Understanding by LLM s
【速读】: 该论文旨在解决大型语言模型(LLM)对印尼文化理解不足的问题,尤其是由于缺乏高质量、本土视角的文化知识数据所致。其解决方案的关键在于构建一个名为IndoSoSci的新文本数据集,该数据集源自151种开放获取的印尼社会科学期刊,以捕捉本地文化知识;并提出一种有效的注入策略:提取与印尼文化相关的事实,并在检索增强生成(RAG)框架中使用LLM生成的假设文档作为查询进行检索,从而显著提升模型在IndoCulture基准上的性能表现。
链接: https://arxiv.org/abs/2601.12921
作者: Adimulya Kartiyasa,Bao Gia Cao,Boyang Li
机构: Nanyang Technological University, Singapore(南洋理工大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Recently there have been intensifying efforts to improve the understanding of Indonesian cultures by large language models (LLMs). An attractive source of cultural knowledge that has been largely overlooked is local journals of social science, which likely contain substantial cultural studies from a native perspective. We present a novel text dataset of journal article passages, created from 151 open-source Indonesian social science journals, called IndoSoSci. We demonstrate an effective recipe for injecting Indonesian cultural knowledge therein into LLMs: extracting the facts related to Indonesian culture, and apply retrieval-augmented generation (RAG) with LLM-generated hypothetical documents as queries during retrieval. The proposed recipe yields strong performance gains over several strong baselines on the IndoCulture benchmark. Additionally, by combining IndoSoSci with Indonesian Wikipedia, we set a new state-of-the-art accuracy on the IndoCulture benchmark.
zh
[NLP-128] SciCoQA: Quality Assurance for Scientific Paper–Code Alignment
【速读】: 该论文旨在解决科学文献与代码库之间存在的实现不一致问题,以确保研究成果的可复现性和可信度。其核心挑战在于识别论文描述与实际代码实现之间的差异,而传统方法难以系统化地检测此类不一致性。解决方案的关键在于构建了一个名为SciCoQA的数据集,包含611个论文-代码差异实例(81个真实、530个合成),并提出了一种用于生成合成数据的机制来扩展样本规模;同时,作者定义了差异类型和类别,从而更清晰地理解不同类型的不匹配现象。该数据集被用于评估21个大语言模型(LLM)的表现,揭示了当前模型在处理缺失论文细节、长上下文输入以及训练语料外数据时的局限性,凸显了该任务的复杂性。
链接: https://arxiv.org/abs/2601.12910
作者: Tim Baumgärtner,Iryna Gurevych
机构: TU Darmstadt (达姆施塔特工业大学); National Research Center for Applied Cybersecurity ATHENE (应用网络安全国家研究中心 ATHENE)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:We present SciCoQA, a dataset for detecting discrepancies between scientific publications and their codebases to ensure faithful implementations. We construct SciCoQA from GitHub issues and reproducibility papers, and to scale our dataset, we propose a synthetic data generation method for constructing paper-code discrepancies. We analyze the paper-code discrepancies in detail and propose discrepancy types and categories to better understand the occurring mismatches. In total, our dataset consists of 611 paper-code discrepancies (81 real, 530 synthetic), spanning diverse computational science disciplines, including AI, Physics, Quantitative Biology, and others. Our evaluation of 21 LLMs highlights the difficulty of SciCoQA, particularly for instances involving omitted paper details, long-context inputs, and data outside the models’ pre-training corpus. The best performing model in our evaluation, GPT-5, can only detect 45.7% of real-world paper-code discrepancies.
zh
[NLP-129] Gated Differentiable Working Memory for Long-Context Language Modeling
【速读】: 该论文旨在解决长上下文场景下Transformer模型面临的挑战,包括注意力分数在数千个token中稀释、关键信息在中间位置丢失以及推理时难以适应新模式等问题。现有测试时适应(test-time adaptation)方法通过维护一种工作记忆(working memory)来缓解此问题,但其依赖于均匀写入策略,在低效区域浪费计算资源且在语义异质性上下文中梯度方差高。解决方案的关键在于将测试时适应重新建模为预算约束下的记忆巩固问题,并提出Gdwm(Gated Differentiable Working Memory)框架:引入一个写入控制器(write controller),基于信息论量化的上下文效用(Contextual Utility)动态分配梯度更新步骤,从而实现高效且全局覆盖的记忆巩固,显著提升效率与性能平衡——实验表明,在ZeroSCROLLS和LongBench v2上仅需均匀基线4倍的梯度步数即可达到相当或更优效果。
链接: https://arxiv.org/abs/2601.12906
作者: Lingrui Mei,Shenghua Liu,Yiwei Wang,Yuyao Ge,Baolong Bi,Jiayu Yao,Jun Wan,Ziling Yin,Jiafeng Guo,Xueqi Cheng
机构: Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所); University of Chinese Academy of Sciences (中国科学院大学); University of California, Merced (加州大学默塞德分校); UBS AG (瑞银集团)
类目: Computation and Language (cs.CL)
备注:
Abstract:Long contexts challenge transformers: attention scores dilute across thousands of tokens, critical information is often lost in the middle, and models struggle to adapt to novel patterns at inference time. Recent work on test-time adaptation addresses this by maintaining a form of working memory – transient parameters updated on the current context – but existing approaches rely on uniform write policies that waste computation on low-utility regions and suffer from high gradient variance across semantically heterogeneous contexts. In this work, we reframe test-time adaptation as a budget-constrained memory consolidation problem, focusing on which parts of the context should be consolidated into working memory under limited computation. We propose Gdwm (Gated Differentiable Working Memory), a framework that introduces a write controller to gate the consolidation process. The controller estimates Contextual Utility, an information-theoretic measure of long-range contextual dependence, and allocates gradient steps accordingly while maintaining global coverage. Experiments on ZeroSCROLLS and LongBench v2 demonstrate that Gdwm achieves comparable or superior performance with 4 \times fewer gradient steps than uniform baselines, establishing a new efficiency-performance Pareto frontier for test-time adaptation.
zh
[NLP-130] From Prefix Cache to Fusion RAG Cache: Accelerating LLM Inference in Retrieval-Augmented Generation
【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)中因引入外部知识而导致的提示长度增加问题,该问题会显著提升计算成本并延长首个 token 的生成时间(Time to First Token, TTFT)。现有方法虽尝试复用每个检索块预处理得到的键值缓存(KV cache)以加速推理,但由于缺乏跨块上下文信息,导致生成质量明显下降,难以充分发挥 KV cache 复用的优势。解决方案的关键在于提出 FusionRAG 框架,通过两个阶段优化:在离线预处理阶段将其他相关文本块的信息嵌入到每个块中;在在线重处理阶段仅对模型关注的关键 token 重新计算 KV cache。这一策略在保持较低重计算比例(<15%)的同时,显著提升了生成质量(最高达基线 70% 的归一化 F1 分数),并大幅降低 TTFT(提升 2.66x–9.39x)。
链接: https://arxiv.org/abs/2601.12904
作者: Jiahao Wang,Weiyu Xie,Mingxing Zhang,Boxing Zhang,Jianwei Dong,Yuening Zhu,Chen Lin,Jinqi Tang,Yaochen Han,Zhiyuan Ai,Xianglin Chen,Yongwei Wu,Congfeng Jiang
机构: Hangzhou Dianzi University (杭州电子科技大学); Tsinghua University (清华大学); Approaching.AI (Approaching.AI)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Retrieval-Augmented Generation enhances Large Language Models by integrating external knowledge, which reduces hallucinations but increases prompt length. This increase leads to higher computational costs and longer Time to First Token (TTFT). To mitigate this issue, existing solutions aim to reuse the preprocessed KV cache of each retrieved chunk to accelerate RAG. However, the lack of cross-chunk contextual information leads to a significant drop in generation quality, leaving the potential benefits of KV cache reuse largely unfulfilled. The challenge lies in how to reuse the precomputed KV cache of chunks while preserving generation quality. We propose FusionRAG, a novel inference framework that optimizes both the preprocessing and reprocessing stages of RAG. In the offline preprocessing stage, we embed information from other related text chunks into each chunk, while in the online reprocessing stage, we recompute the KV cache for tokens that the model focuses on. As a result, we achieve a better trade-off between generation quality and efficiency. According to our experiments, FusionRAG significantly improves generation quality at the same recomputation ratio compared to previous state-of-the-art solutions. By recomputing fewer than 15% of the tokens, FusionRAG achieves up to 70% higher normalized F1 scores than baselines and reduces TTFT by 2.66x-9.39x compared to Full Attention.
zh
[NLP-131] Hierarchical Sparse Circuit Extraction from Billion-Parameter Language Models through Scalable Attribution Graph Decomposition
【速读】: 该论文旨在解决大规模语言模型中稀疏计算电路(sparse computational circuits)的可解释性难题,即如何从百亿参数模型中高效提取人类可理解的计算结构。当前方法受限于指数级搜索复杂度(O(2^n))和特征多义性(polysemanticity)问题,难以实现有效电路发现。其解决方案的关键在于提出分层归因图分解(Hierarchical Attribution Graph Decomposition, HAGD)框架,通过多分辨率抽象层次将搜索复杂度降低至O(n² log n),并结合跨层转换器(cross-layer transcoders)进行单义特征提取、图神经网络元学习(graph neural network meta-learning)预测拓扑结构以及因果干预协议验证功能。该方法在GPT-2、Llama系列及Pythia模型上均取得稳定的行为保真度(最高达91%),且跨架构迁移实验表明电路结构具有中等相似性(平均67%),为大模型可解释性研究提供了初步基础。
链接: https://arxiv.org/abs/2601.12879
作者: Mohammed Mudassir Uddin,Shahnawaz Alam,Mohammed Kaif Pasha
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Mechanistic interpretability seeks to reverse-engineer neural network computations into human-understandable algorithms, yet extracting sparse computational circuits from billion-parameter language models remains challenging due to exponential search complexity and pervasive polysemanticity. The proposed Hierarchical Attribution Graph Decomposition (HAGD) framework reduces circuit discovery complexity from O(2^n) exhaustive enumeration to O(n^2 log n) through multi-resolution abstraction hierarchies and differentiable circuit search. The methodology integrates cross-layer transcoders for monosemantic feature extraction, graph neural network meta-learning for topology prediction, and causal intervention protocols for validation. Empirical evaluation spans GPT-2 variants, Llama-7B through Llama-70B, and Pythia suite models across algorithmic tasks and natural language benchmarks. On modular arithmetic tasks, the framework achieves up to 91% behavioral preservation ( \pm 2.3% across runs) while maintaining interpretable subgraph sizes. Cross-architecture transfer experiments suggest that discovered circuits exhibit moderate structural similarity (averaging 67%) across model families, indicating potential shared computational patterns. These results provide preliminary foundations for interpretability at larger model scales while identifying significant limitations in current attribution methodologies that require future advances.
zh
[NLP-132] Race Ethnicity and Their Implication on Bias in Large Language Models
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在医疗等高风险场景中因种族和民族信息被显式或隐式编码而导致的偏见问题,尤其是现有研究多聚焦于结果层面的不平等,缺乏对模型内部机制的理解。其解决方案的关键在于构建一个可复现的可解释性分析流程,结合探测(probing)、神经元级归因(neuron-level attribution)与针对性干预(targeted intervention),系统揭示了敏感属性如何分布于模型内部单元,并发现相同的人口统计学线索可能引发模型行为上的显著差异,且抑制相关神经元虽能降低偏见但仅缓解表层行为而非根本表示,提示需更系统的偏差缓解策略。
链接: https://arxiv.org/abs/2601.12868
作者: Shiyue Hu,Ruizhe Li,Yanjun Gao
机构: University of Colorado Anschutz (科罗拉多大学安舒茨医学校区); University of Colorado Boulder (科罗拉多大学博尔德分校); University of Aberdeen (阿伯丁大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Work in process
Abstract:Large language models (LLMs) increasingly operate in high-stakes settings including healthcare and medicine, where demographic attributes such as race and ethnicity may be explicitly stated or implicitly inferred from text. However, existing studies primarily document outcome-level disparities, offering limited insight into internal mechanisms underlying these effects. We present a mechanistic study of how race and ethnicity are represented and operationalized within LLMs. Using two publicly available datasets spanning toxicity-related generation and clinical narrative understanding tasks, we analyze three open-source models with a reproducible interpretability pipeline combining probing, neuron-level attribution, and targeted intervention. We find that demographic information is distributed across internal units with substantial cross-model variation. Although some units encode sensitive or stereotype-related associations from pretraining, identical demographic cues can induce qualitatively different behaviors. Interventions suppressing such neurons reduce bias but leave substantial residual effects, suggesting behavioral rather than representational change and motivating more systematic mitigation.
zh
[NLP-133] Rapport du Projet de Recherche TRAIMA
【速读】: 该论文旨在解决教育场景中多模态交互(multimodal interactions)自动处理的瓶颈问题,即当前对言语、副语言和非语言数据的分析仍依赖人工标注,效率低且难以扩展。其解决方案的关键在于构建一套严谨的方法论框架,通过系统梳理现有转录规范(如ICOR、Mondada、GARS等)的适用性与局限性,明确可被机器学习模型识别的标注类别与分析单元,并结合INTER-EXPLIC和EXPLIC-LEXIC等真实课堂语料库进行实证验证。项目特别强调教师手势(kinésic和proxemic资源)、韵律特征的功能作用,以及TechnéLAB平台提供的多模态同步采集能力,从而为未来基于人工智能的教育交互自动化分析奠定理论基础与实践路径。
链接: https://arxiv.org/abs/2601.12844
作者: Julie Rançon(UP, FoReLLIS, Poitiers),Jean-François Cerisier(Techné, Poitiers),Emilie Remond(Techné, Poitiers),Aurélien Nguyen(Techné, Poitiers),Andrew Peterson(Techné, Poitiers),Ladjel Bellatreche(ISAE-ENSMA, IDD, A\amp;S)
机构: 未知
类目: Computation and Language (cs.CL)
备注: in French language
Abstract:The TRAIMA project (TRaitement Automatique des Interactions Multimodales en Apprentissage), conducted between March 2019 and June 2020, investigates the potential of automatic processing of multimodal interactions in educational settings. The project addresses a central methodological challenge in educational and interactional research: the analysis of verbal, paraverbal, and non-verbal data is currently carried out manually, making it extremely time-consuming and difficult to scale. TRAIMA explores how machine learning approaches could contribute to the categorisation and classification of such interactions. The project focuses specifically on explanatory and collaborative sequences occurring in classroom interactions, particularly in French as a Foreign Language (FLE) and French as a First Language (FLM) contexts. These sequences are analysed as inherently multimodal phenomena, combining spoken language with prosody, gestures, posture, gaze, and spatial positioning. A key theoretical contribution of the project is the precise linguistic and interactional definition of explanatory discourse as a tripartite sequence (opening, explanatory core, closure), drawing on discourse analysis and interactional linguistics. A substantial part of the research is devoted to the methodological foundations of transcription, which constitute a critical bottleneck for any form of automation. The report provides a detailed state of the art of existing transcription conventions (ICOR, Mondada, GARS, VALIBEL, Ferré), highlighting their respective strengths and limitations when applied to multimodal classroom data. Through comparative analyses of manually transcribed sequences, the project demonstrates the inevitable variability and interpretative dimension of transcription practices, depending on theoretical positioning and analytical goals. Empirical work is based on several corpora, notably the INTER-EXPLIC corpus (approximately 30 hours of classroom interaction) and the EXPLIC-LEXIC corpus, which serve both as testing grounds for manual annotation and as reference datasets for future automation. Particular attention is paid to teacher gestures (kinésic and proxemic resources), prosodic features, and their functional role in meaning construction and learner comprehension. The project also highlights the strategic role of the TechnéLAB platform, which provides advanced multimodal data capture (multi-camera video, synchronized audio, eye-tracking, digital interaction traces) and constitutes both a research infrastructure and a test environment for the development of automated tools. In conclusion, TRAIMA does not aim to deliver a fully operational automated system, but rather to establish a rigorous methodological framework for the automatic processing of multimodal pedagogical interactions. The project identifies transcription conventions, annotation categories, and analytical units that are compatible with machine learning approaches, while emphasizing the need for theoretical explicitness and researcher reflexivity. TRAIMA thus lays the groundwork for future interdisciplinary research at the intersection of didactics, discourse analysis, multimodality, and artificial intelligence in education.
zh
[NLP-134] Multimodal Multi-Agent Empowered Legal Judgment Prediction
【速读】: 该论文旨在解决法律判决预测(Legal Judgment Prediction, LJP)任务中传统方法因多指控、多样证据及适应性不足而导致的性能瓶颈问题。其解决方案的关键在于提出了一种名为JurisMMA的新颖框架,该框架通过有效分解审判任务、标准化处理流程,并将其组织为多个明确阶段,从而提升模型对复杂法律场景的理解与预测能力。此外,研究构建了包含超10万条中国司法记录的大型多模态数据集JurisMM(含文本与视频-文本数据),为LJP提供了更全面的评估基础,验证了该框架在法律判决预测及其他法律应用场景中的有效性。
链接: https://arxiv.org/abs/2601.12815
作者: Zhaolu Kang,Junhao Gong,Qingxi Chen,Hao Zhang,Jiaxin Liu,Rong Fu,Zhiyuan Feng,Yuan Wang,Simon Fong,Kaiyue Zhou
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Legal Judgment Prediction (LJP) aims to predict the outcomes of legal cases based on factual descriptions, serving as a fundamental task to advance the development of legal systems. Traditional methods often rely on statistical analyses or role-based simulations but face challenges with multiple allegations, diverse evidence, and lack adaptability. In this paper, we introduce JurisMMA, a novel framework for LJP that effectively decomposes trial tasks, standardizes processes, and organizes them into distinct stages. Furthermore, we build JurisMM, a large dataset with over 100,000 recent Chinese judicial records, including both text and multimodal video-text data, enabling comprehensive evaluation. Experiments on JurisMM and the benchmark LawBench validate our framework’s effectiveness. These results indicate that our framework is effective not only for LJP but also for a broader range of legal applications, offering new perspectives for the development of future legal methods and datasets.
zh
[NLP-135] Do Clinical Question Answering Systems Really Need Specialised Medical Fine Tuning? EACL2026
【速读】: 该论文旨在解决临床问答(Clinical Question-Answering, CQA)领域中普遍存在的“专业化谬误”(SPECIALISATION FALLACY)问题,即过度依赖领域特定微调(domain-specific fine-tuning)以提升模型性能的假设。现有医学大语言模型(如BioBERT、PubMedBERT等)虽在特定任务中表现良好,但存在覆盖范围窄、重训练成本高及适应性差等实际限制。为此,作者提出MEDASSESS-X框架,其核心创新在于引入推理时对齐(inference-time alignment)机制——通过轻量级引导向量(steering vectors)动态调整模型激活状态,实现医学一致性推理,而无需更新模型权重或进行领域专属再训练。该方法显著提升了通用与专业医学大模型在准确性、事实一致性和安全性上的稳定表现,验证了非专业化路径在CQA部署中的有效性。
链接: https://arxiv.org/abs/2601.12812
作者: Sushant Kumar Ray,Gautam Siddharth Kashyap,Sahil Tripathi,Nipun Joshi,Vijay Govindarajan,Rafiq Ali,Jiechao Gao,Usman Naseem
机构: University of Delhi(德里大学); Jamia Hamdard(贾米亚哈姆达德大学); Cornell University(康奈尔大学); Expedia Group(Expedia集团); DSEU-Okhla(德塞大学奥克拉分校); Center for SDGC, Stanford University(斯坦福大学可持续发展目标研究中心); Macquarie University(麦考瑞大学)
类目: Computation and Language (cs.CL)
备注: Accepted at EACL 2026 (Industry Track)
Abstract:Clinical Question-Answering (CQA) industry systems are increasingly rely on Large Language Models (LLMs), yet their deployment is often guided by the assumption that domain-specific fine-tuning is essential. Although specialised medical LLMs such as BioBERT, BioGPT, and PubMedBERT remain popular, they face practical limitations including narrow coverage, high retraining costs, and limited adaptability. Efforts based on Supervised Fine-Tuning (SFT) have attempted to address these assumptions but continue to reinforce what we term the SPECIALISATION FALLACY-the belief that specialised medical LLMs are inherently superior for CQA. To address this assumption, we introduce MEDASSESS-X, a deployment-industry-oriented CQA framework that applies alignment at inference time rather than through SFT. MEDASSESS-X uses lightweight steering vectors to guide model activations toward medically consistent reasoning without updating model weights or requiring domain-specific retraining. This inference-time alignment layer stabilises CQA performance across both general-purpose and specialised medical LLMs, thereby resolving the SPECIALISATION FALLACY. Empirically, MEDASSESS-X delivers consistent gains across all LLM families, improving Accuracy by up to +6%, Factual Consistency by +7%, and reducing Safety Error Rate by as much as 50%.
zh
[NLP-136] FRoM-W1: Towards General Humanoid Whole-Body Control with Language Instructions
【速读】: 该论文旨在解决当前人形机器人动作控制依赖硬编码或特定训练导致的泛化能力不足问题,从而实现基于自然语言指令的通用全身运动控制。解决方案的关键在于提出FRoM-W1框架,该框架分为两个阶段:(a) H-GPT模型利用大规模人类数据训练出一个语言驱动的人体全身运动生成模型,并通过思维链(Chain-of-Thought)技术提升对指令的理解泛化能力;(b) H-ACT控制器将生成的人体运动重定向为机器人特定动作,并在物理仿真中通过强化学习预训练与微调,确保机器人在重力环境下稳定执行对应动作,最终借助模块化仿真到现实迁移策略部署于真实机器人平台。
链接: https://arxiv.org/abs/2601.12799
作者: Peng Li,Zihan Zhuang,Yangfan Gao,Yi Dong,Sixian Li,Changhao Jiang,Shihan Dou,Zhiheng Xi,Enyu Zhou,Jixuan Huang,Hui Li,Jingjing Gong,Xingjun Ma,Tao Gui,Zuxuan Wu,Qi Zhang,Xuanjing Huang,Yu-Gang Jiang,Xipeng Qiu
机构: Fudan University (复旦大学); Shanghai Innovation Institute (上海创新研究院)
类目: Robotics (cs.RO); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
Abstract:Humanoid robots are capable of performing various actions such as greeting, dancing and even backflipping. However, these motions are often hard-coded or specifically trained, which limits their versatility. In this work, we present FRoM-W1, an open-source framework designed to achieve general humanoid whole-body motion control using natural language. To universally understand natural language and generate corresponding motions, as well as enable various humanoid robots to stably execute these motions in the physical world under gravity, FRoM-W1 operates in two stages: (a) H-GPT: utilizing massive human data, a large-scale language-driven human whole-body motion generation model is trained to generate diverse natural behaviors. We further leverage the Chain-of-Thought technique to improve the model’s generalization in instruction understanding. (b) H-ACT: After retargeting generated human whole-body motions into robot-specific actions, a motion controller that is pretrained and further fine-tuned through reinforcement learning in physical simulation enables humanoid robots to accurately and stably perform corresponding actions. It is then deployed on real robots via a modular simulation-to-reality module. We extensively evaluate FRoM-W1 on Unitree H1 and G1 robots. Results demonstrate superior performance on the HumanML3D-X benchmark for human whole-body motion generation, and our introduced reinforcement learning fine-tuning consistently improves both motion tracking accuracy and task success rates of these humanoid robots. We open-source the entire FRoM-W1 framework and hope it will advance the development of humanoid intelligence.
zh
[NLP-137] Open Vocabulary Panoptic Segmentation With Retrieval Augmentation
【速读】: 该论文旨在解决开放词汇表全景分割(Open Vocabulary Panoptic Segmentation)中模型对未见类别(unseen classes)泛化能力差的问题。现有方法在特定数据集上训练后,难以有效分割训练数据之外的任意类别的像素。解决方案的关键在于提出RetCLIP方法,其核心是构建一个基于图像-文本配对数据的掩码段特征数据库(masked segment feature database),在推理阶段利用输入图像中的掩码段特征作为查询键,从数据库中检索相似特征及其关联类标签,并基于相似度分配分类分数;最终将检索得到的分数与CLIP生成的分数融合,实现对未见类别的准确分割。该方法显著提升了在ADE20k数据集上的性能,相比基线FC-CLIP实现了绝对提升。
链接: https://arxiv.org/abs/2601.12779
作者: Nafis Sadeq,Qingfeng Liu,Mostafa El-Khamy
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
Abstract:Given an input image and set of class names, panoptic segmentation aims to label each pixel in an image with class labels and instance labels. In comparison, Open Vocabulary Panoptic Segmentation aims to facilitate the segmentation of arbitrary classes according to user input. The challenge is that a panoptic segmentation system trained on a particular dataset typically does not generalize well to unseen classes beyond the training data. In this work, we propose RetCLIP, a retrieval-augmented panoptic segmentation method that improves the performance of unseen classes. In particular, we construct a masked segment feature database using paired image-text data. At inference time, we use masked segment features from the input image as query keys to retrieve similar features and associated class labels from the database. Classification scores for the masked segment are assigned based on the similarity between query features and retrieved features. The retrieval-based classification scores are combined with CLIP-based scores to produce the final output. We incorporate our solution with a previous SOTA method (FC-CLIP). When trained on COCO, the proposed method demonstrates 30.9 PQ, 19.3 mAP, 44.0 mIoU on the ADE20k dataset, achieving +4.5 PQ, +2.5 mAP, +10.0 mIoU absolute improvement over the baseline.
zh
[NLP-138] Who Does This Name Remind You of? Nationality Prediction via Large Language Model Associative Memory
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在国籍和区域预测任务中难以有效激发其世界知识的问题。传统提示方法依赖直接推理,受限于抽象语言规则的应用能力;而该研究提出LLM关联记忆代理(LLM Associative Memory Agents, LAMA),其核心创新在于将LLM的世界知识视为关联记忆资源,通过召回具有相同姓名的著名人物并聚合其国籍信息进行间接推理,而非直接从姓名推断国籍。关键在于采用双代理架构——人物代理(Person Agent)与媒体代理(Media Agent)分别专注不同知识领域并行召回,最终通过投票机制生成Top-1预测、条件补全生成Top-K预测,显著提升准确性(达0.817),且对低频国籍具有鲁棒性,验证了基于检索与聚合的知识利用方式优于传统提示推理范式。
链接: https://arxiv.org/abs/2601.12771
作者: Keito Inoshita
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) possess extensive world knowledge, yet methods for effectively eliciting this knowledge remain underexplored. Nationality and region prediction tasks require understanding of not only linguistic features but also cultural and historical background, making LLM world knowledge particularly valuable. However, conventional LLM prompting methods rely on direct reasoning approaches, which have limitations in applying abstract linguistic rules. We propose LLM Associative Memory Agents (LAMA), a novel framework that leverages LLM world knowledge as associative memory. Rather than directly inferring nationality from names, LAMA recalls famous individuals with the same name and aggregates their nationalities through indirect reasoning. A dual-agent architecture comprising a Person Agent and a Media Agent, specialized in different knowledge domains, recalls famous individuals in parallel, generating Top-1 predictions through voting and Top-K predictions through conditional completion. On a 99-country nationality prediction task, LAMA achieved 0.817 accuracy, substantially outperforming conventional LLM prompting methods and neural models. Our experiments reveal that LLMs exhibit higher reliability in recalling concrete examples than in abstract reasoning, that recall-based approaches are robust to low-frequency nationalities independent of data frequency distributions, and that the dual-agent architecture functions complementarily to produce synergistic effects. These results demonstrate the effectiveness of a new multi-agent system that retrieves and aggregates LLM knowledge rather than prompting reasoning.
zh
[NLP-139] VISPA: Pluralistic Alignment via Automatic Value Selection and Activation
【速读】: 该论文旨在解决大语言模型在高风险领域应用中输出缺乏多元价值视角的问题,即当前模型输出往往仅反映平均人类偏好,而难以体现多样化的价值观。其解决方案的关键在于提出一种无需训练的群体对齐框架VISPA,通过动态选择和内部激活机制引导实现对价值表达的直接控制,从而在不依赖提示工程或有限价值设定的前提下,有效支持多种价值观的表达与平衡。
链接: https://arxiv.org/abs/2601.12758
作者: Shenyan Zheng,Jiayou Zhong,Anudeex Shetty,Heng Ji,Preslav Nakov,Usman Naseem
机构: University of Waterloo (滑铁卢大学); University of Melbourne (墨尔本大学); University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); MBZUAI; Macquarie University (麦考瑞大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: WIP
Abstract:As large language models are increasingly used in high-stakes domains, it is essential that their outputs reflect not average human preference, rather range of varying perspectives. Achieving such pluralism, however, remains challenging. Existing approaches consider limited values or rely on prompt-level interventions, lacking value control and representation. To address this, we introduce VISPA, a training-free pluralistic alignment framework, that enables direct control over value expression by dynamic selection and internal model activation steering. Across extensive empirical studies spanning multiple models and evaluation settings, we show VISPA is performant across all pluralistic alignment modes in healthcare and beyond. Further analysis reveals VISPA is adaptable with different steering initiations, model, and/or values. These results suggest that pluralistic alignment can be achieved through internal activation mechanisms, offering a scalable path toward language models that serves all.
zh
[NLP-140] PAIR-SAFE: A Paired-Agent Approach for Runtime Auditing and Refining AI-Mediated Mental Health Support
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在心理健康支持场景中可能产生过于指令化、不一致或临床不匹配响应的问题,尤其是在高风险情境下缺乏透明度和运行时问责性。解决方案的关键在于提出一种名为PAIR-SAFE的配对代理框架,该框架由一个Responder代理与一个基于临床验证的动机访谈治疗完整性(Motivational Interviewing Treatment Integrity, MITI-4)框架的Judge代理组成;Judge代理对每条AI生成的回应进行结构化审计并提供ALLOW或REVISE决策,从而指导运行时响应的优化,显著提升关键MITI维度(如伙伴关系、协作意愿及整体关系质量)的表现。
链接: https://arxiv.org/abs/2601.12754
作者: Jiwon Kim,Violeta J. Rodriguez,Dong Whi Yoo,Eshwar Chandrasekharan,Koustuv Saha
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:
Abstract:Large language models (LLMs) are increasingly used for mental health support, yet they can produce responses that are overly directive, inconsistent, or clinically misaligned, particularly in sensitive or high-risk contexts. Existing approaches to mitigating these risks largely rely on implicit alignment through training or prompting, offering limited transparency and runtime accountability. We introduce PAIR-SAFE, a paired-agent framework for auditing and refining AI-generated mental health support that integrates a Responder agent with a supervisory Judge agent grounded in the clinically validated Motivational Interviewing Treatment Integrity (MITI-4) framework. The Judgeaudits each response and provides structuredALLOW or REVISE decisions that guide runtime response refinement. We simulate counseling interactions using a support-seeker simulator derived from human-annotated motivational interviewing data. We find that Judge-supervised interactions show significant improvements in key MITI dimensions, including Partnership, Seek Collaboration, and overall Relational quality. Our quantitative findings are supported by qualitative expert evaluation, which further highlights the nuances of runtime supervision. Together, our results reveal that such pairedagent approach can provide clinically grounded auditing and refinement for AI-assisted conversational mental health support.
zh
[NLP-141] owards Robust Process Reward Modeling via Noise-aware Learning
【速读】: 该论文旨在解决过程奖励模型(Process Reward Models, PRMs)在训练过程中因依赖昂贵的过程级监督而导致的瓶颈问题,特别是当前广泛采用的蒙特卡洛估计(Monte Carlo Estimation, MCE)方法所引发的标签噪声问题。MCE将步骤奖励定义为从某一推理步骤出发达到正确最终答案的概率,但实证发现其生成的奖励具有策略依赖性(policy-dependent),导致虚假正例(错误步骤被奖励)和虚假负例(正确步骤被惩罚),从而引入标签噪声。解决方案的关键在于提出一个两阶段框架:第一阶段通过引入一种“反思感知的标签校正机制”(reflection-aware label correction mechanism),利用大语言模型(LLM)作为裁判识别推理轨迹中的反思与自我修正行为,以抑制过高的奖励估计;第二阶段则设计了一个“噪声感知迭代训练框架”(Noise-Aware Iterative Training framework),使PRM能够基于自身置信度逐步优化噪声标签,从而显著提升步骤级别的正确性判别能力,在平均F1指标上相较使用噪声监督训练的PRMs提升高达27%。
链接: https://arxiv.org/abs/2601.12748
作者: Bin Xie,Bingbing Xu,Xueyun Tian,Yilin Chen,Huawei Shen
机构: State Key Laboratory of AI Safety, Institute of Computing Technology, CAS (中国科学院计算技术研究所); University of Chinese Academy of Sciences (中国科学院大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Process Reward Models (PRMs) have achieved strong results in complex reasoning, but are bottlenecked by costly process-level supervision. A widely used alternative, Monte Carlo Estimation (MCE), defines process rewards as the probability that a policy model reaches the correct final answer from a given reasoning step. However, step correctness is an intrinsic property of the reasoning trajectory, and should be invariant to policy choice. Our empirical findings show that MCE producing policy-dependent rewards that induce label noise, including false positives that reward incorrect steps and false negatives that penalize correct ones. To address above challenges, we propose a two-stage framework to mitigate noisy supervision. In the labeling stage, we introduce a reflection-aware label correction mechanism that uses a large language model (LLM) as a judge to detect reflection and self-correction behaviors related to the current reasoning step, thereby suppressing overestimated rewards. In the training stage, we further propose a \underline\textbfNoise-\underline\textbfAware \underline\textbfIterative \underline\textbfTraining framework that enables the PRM to progressively refine noisy labels based on its own confidence. Extensive Experiments show that our method substantially improves step-level correctness discrimination, achieving up to a 27% absolute gain in average F1 over PRMs trained with noisy supervision.
zh
[NLP-142] A Shared Geometry of Difficulty in Multilingual Language Models
【速读】: 该论文旨在解决如何在多语言环境下准确预测大语言模型(Large Language Models, LLMs)中任务难度的问题,即通过分析模型内部表征来估计任务对模型的难易程度。其解决方案的关键在于发现问题难度信号在模型内部表征中呈现两阶段分布:浅层(早期层)表征包含语言无关的难度信息,能实现跨语言良好泛化,但单语言性能较低;深层(后期层)表征则捕获语言特定的难度信号,具备高单语言精度但跨语言泛化能力差。这一发现揭示了LLMs先形成抽象的概念空间以表征问题难度,再转向语言特异性输出的两阶段表征机制,为理解模型内部认知过程提供了新视角,并验证了该机制不仅适用于语义内容,也扩展至高阶元认知属性如难度估计。
链接: https://arxiv.org/abs/2601.12731
作者: Stefano Civelli,Pietro Bernardelle,Nicolò Brunello,Gianluca Demartini
机构: The University of Queensland (昆士兰大学); Polytechnic University of Milan (米兰理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Predicting problem-difficulty in large language models (LLMs) refers to estimating how difficult a task is according to the model itself, typically by training linear probes on its internal representations. In this work, we study the multilingual geometry of problem-difficulty in LLMs by training linear probes using the AMC subset of the Easy2Hard benchmark, translated into 21 languages. We found that difficulty-related signals emerge at two distinct stages of the model internals, corresponding to shallow (early-layers) and deep (later-layers) internal representations, that exhibit functionally different behaviors. Probes trained on deep representations achieve high accuracy when evaluated on the same language but exhibit poor cross-lingual generalization. In contrast, probes trained on shallow representations generalize substantially better across languages, despite achieving lower within-language performance. Together, these results suggest that LLMs first form a language-agnostic representation of problem difficulty, which subsequently becomes language-specific. This closely aligns with existing findings in LLM interpretability showing that models tend to operate in an abstract conceptual space before producing language-specific outputs. We demonstrate that this two-stage representational process extends beyond semantic content to high-level meta-cognitive properties such as problem-difficulty estimation.
zh
[NLP-143] A Two-Stage GPU Kernel Tuner Combining Semantic Refactoring and Search-Based Optimization
【速读】: 该论文旨在解决高性能计算(HPC)工作负载及大模型训练与推理中GPU代码优化的性能瓶颈问题,特别是当前依赖人工代码重构和参数调优才能逼近硬件极限性能的局限性。其关键解决方案是在基于大语言模型(LLM)代理的迭代优化流程之上引入一个模板化重写层:将核函数语义性地重构为显式可参数化的模板,并通过基于搜索的自动调优(autotuning)对模板参数进行约束性搜索,从而在硬件资源限制下获得更稳定、更高品质的加速效果。此设计显著降低了迭代优化过程中的随机性,提升了可解释性,并支持系统化地探索高性能配置。
链接: https://arxiv.org/abs/2601.12698
作者: Qiuyi Qu,Yicheng Sui,Yufei Sun,Rui Chen,Xiaofei Zhang,Yuzhi Zhang,Haofeng Wang,Ge Lan,Ning Zhang
机构: NanKai University (南开大学); Beijing Institute of Computer Technology and Applications (北京计算机技术与应用研究所)
类目: Computation and Language (cs.CL)
备注:
Abstract:GPU code optimization is a key performance bottleneck for HPC workloads as well as large-model training and inference. Although compiler optimizations and hand-written kernels can partially alleviate this issue, achieving near-hardware-limit performance still relies heavily on manual code refactoring and parameter tuning. Recent progress in LLM-agent-based kernel generation and optimization has been reported, yet many approaches primarily focus on direct code rewriting, where parameter choices are often implicit and hard to control, or require human intervention, leading to unstable performance gains. This paper introduces a template-based rewriting layer on top of an agent-driven iterative loop: kernels are semantically refactored into explicitly parameterizable templates, and template parameters are then optimized via search-based autotuning, yielding more stable and higher-quality speedups. Experiments on a set of real-world kernels demonstrate speedups exceeding 3x in the best case. We extract representative CUDA kernels from SGLang as evaluation targets; the proposed agentic tuner iteratively performs templating, testing, analysis, and planning, and leverages profiling feedback to execute constrained parameter search under hardware resource limits. Compared to agent-only direct rewriting, the template-plus-search design significantly reduces the randomness of iterative optimization, making the process more interpretable and enabling a more systematic approach toward high-performance configurations. The proposed method can be further extended to OpenCL, HIP, and other backends to deliver automated performance optimization for real production workloads.
zh
[NLP-144] UbuntuGuard: A Culturally-Grounded Policy Benchmark for Equitable AI Safety in African Languages
【速读】: 该论文旨在解决当前守护模型(Guardian Models)在低资源非洲语言中存在安全风险、跨语言安全性失效及文化错位的问题。现有模型多以西方为中心,依赖静态预定义的安全类别,难以适应多元语言和文化背景下的真实风险场景。解决方案的关键在于构建首个基于非洲本地政策的基准测试 UbuntuGuard,其通过155位领域专家(涵盖医疗等敏感领域)撰写的对抗性查询生成情境化安全策略与参考响应,从而捕捉文化根基的风险信号,并实现对守护模型的政策对齐评估。这一方法强调运行时可执行的灵活政策机制,为开发适用于低资源语言的可靠且公平的守护模型提供了必要基准。
链接: https://arxiv.org/abs/2601.12696
作者: Tassallah Abdullahi,Macton Mgonzo,Mardiyyah Oduwole,Paul Okewunmi,Abraham Owodunni,Ritambhara Singh,Carsten Eickhoff
机构: Brown University (布朗大学); The Ohio State University (俄亥俄州立大学); ML Collective; University of Tuebingen (图宾根大学)
类目: Computation and Language (cs.CL)
备注: 12 pages
Abstract:Current guardian models are predominantly Western-centric and optimized for high-resource languages, leaving low-resource African languages vulnerable to evolving harms, cross-lingual safety failures, and cultural misalignment. Moreover, most guardian models rely on rigid, predefined safety categories that fail to generalize across diverse linguistic and sociocultural contexts. Robust safety, therefore, requires flexible, runtime-enforceable policies and benchmarks that reflect local norms, harm scenarios, and cultural expectations. We introduce UbuntuGuard, the first African policy-based safety benchmark built from adversarial queries authored by 155 domain experts across sensitive fields, including healthcare. From these expert-crafted queries, we derive context-specific safety policies and reference responses that capture culturally grounded risk signals, enabling policy-aligned evaluation of guardian models. We evaluate 13 models, comprising six general-purpose LLMs and seven guardian models across three distinct variants: static, dynamic, and multilingual. Our findings reveal that existing English-centric benchmarks overestimate real-world multilingual safety, cross-lingual transfer provides partial but insufficient coverage, and dynamic models, while better equipped to leverage policies at inference time, still struggle to fully localize African-language contexts. These findings highlight the urgent need for multilingual, culturally grounded safety benchmarks to enable the development of reliable and equitable guardian models for low-resource languages. Our code can be found online.\footnoteCode repository available at this https URL.
zh
[NLP-145] Augmenting Question Answering with A Hybrid RAG Approach
【速读】: 该论文旨在解决现有检索增强生成(Retrieval-Augmented Generation, RAG)方法在问答(Question-Answering, QA)任务中因检索到的信息缺乏语义相关性而导致答案不完整或质量不佳的问题。其解决方案的关键在于提出一种结构化语义RAG(Structured-Semantic RAG, SSRAG)架构,该架构融合了查询增强(query augmentation)、代理路由(agentic routing)以及结合向量与图谱技术的结构化检索机制,并引入上下文统一(context unification)策略,从而显著提升检索过程的精准性和答案的准确性与信息丰富度。
链接: https://arxiv.org/abs/2601.12658
作者: Tianyi Yang,Nashrah Haque,Vaishnave Jonnalagadda,Yuya Jeremy Ong,Zhehui Chen,Yanzhao Wu,Lei Yu,Divyesh Jadav,Wenqi Wei
机构: IBM Research(IBM研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages, 5 tables, 2 figures; presented at IEEE CogMI 2025
Abstract:Retrieval-Augmented Generation (RAG) has emerged as a powerful technique for enhancing the quality of responses in Question-Answering (QA) tasks. However, existing approaches often struggle with retrieving contextually relevant information, leading to incomplete or suboptimal answers. In this paper, we introduce Structured-Semantic RAG (SSRAG), a hybrid architecture that enhances QA quality by integrating query augmentation, agentic routing, and a structured retrieval mechanism combining vector and graph based techniques with context unification. By refining retrieval processes and improving contextual grounding, our approach improves both answer accuracy and informativeness. We conduct extensive evaluations on three popular QA datasets, TruthfulQA, SQuAD and WikiQA, across five Large Language Models (LLMs), demonstrating that our proposed approach consistently improves response quality over standard RAG implementations.
zh
[NLP-146] Intelligent Documentation in Medical Education: Can AI Replace Manual Case Logging?
【速读】: 该论文旨在解决放射科培训中操作性病例日志(procedural case logs)手工撰写耗时且易出现不一致的问题。其解决方案的关键在于利用大语言模型(large language models, LLMs)从自由文本的放射学报告中自动提取结构化操作信息,通过指令式提示(instruction-based prompting)和思维链提示(chain-of-thought prompting)策略实现高效、准确的信息抽取,从而显著降低住院医师的文书负担并提升病例记录的一致性。
链接: https://arxiv.org/abs/2601.12648
作者: Nafiz Imtiaz Khan,Kylie Cleland,Vladimir Filkov,Roger Eric Goldman
机构: University of California, Davis(加州大学戴维斯分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 51 pages, 12 figures, 8 tables. Feasibility study using retrospective radiology reports. Submitted to JAMIA Open (under review)
Abstract:Procedural case logs are a core requirement in radiology training, yet they are time-consuming to complete and prone to inconsistency when authored manually. This study investigates whether large language models (LLMs) can automate procedural case log documentation directly from free-text radiology reports. We evaluate multiple local and commercial LLMs under instruction-based and chain-of-thought prompting to extract structured procedural information from 414 curated interventional radiology reports authored by nine residents between 2018 and 2024. Model performance is assessed using sensitivity, specificity, and F1-score, alongside inference latency and token efficiency to estimate operational cost. Results show that both local and commercial models achieve strong extraction performance, with best F1-scores approaching 0.87, while exhibiting different trade-offs between speed and cost. Automation using LLMs has the potential to substantially reduce clerical burden for trainees and improve consistency in case logging. These findings demonstrate the feasibility of AI-assisted documentation in medical education and highlight the need for further validation across institutions and clinical workflows.
zh
[NLP-147] Objective Matters: Fine-Tuning Objectives Shape Safety Robustness and Persona Drift
【速读】: 该论文试图解决的问题是:在对大语言模型(Large Language Models, LLMs)进行微调时,尽管使用的是良性数据,仍可能导致对齐性(alignment)和对抗鲁棒性(adversarial robustness)的下降,而现有研究尚未明确不同微调目标函数如何系统性地影响这些安全属性。解决方案的关键在于通过控制变量实验,固定数据、领域、模型架构和优化过程,系统比较六种微调目标函数(包括监督微调、直接偏好优化、条件微调、免疫提示、几率比偏好优化和KL正则化微调),发现微调目标的选择会显著影响安全-能力权衡曲线(safety-capability frontier)——尤其在大规模训练下,约束学习信号的目标(如ORPO和KL正则化)能有效缓解对抗脆弱性和潜在人格漂移(persona drift),从而成为提升模型安全性的重要机制。
链接: https://arxiv.org/abs/2601.12639
作者: Daniel Vennemeyer,Punya Syon Pandey,Phan Anh Duong,Michael Umeokoli,Samuel Ratnam
机构: University of Cincinnati (辛辛那提大学); University of Toronto (多伦多大学); University of Oxford (牛津大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Fine-tuning LLMs on benign data can still degrade alignment and adversarial robustness, yet direct analysis of the role of fine-tuning objectives in shaping these safety outcomes remain limited. We present a controlled comparison of six fine-tuning objectives – Supervised Fine-Tuning, Direct Preference Optimization, Conditional Fine-Tuning, Inoculation Prompting, Odds Ratio Preference Optimization, and KL-regularized fine-tuning – holding data, domain, architecture, and optimization fixed. Across closed-form reasoning and open-ended generation tasks, we find that objective choice induces systematic, scale-dependent shifts along the safety-capability frontier. At small training budgets, robustness is similar across objectives but capability differs. At larger budgets, objectives diverge sharply: supervised and preference-based tuning tightly couple capability gains to increased adversarial vulnerability and persona drift, while objectives that constrain learning signals – especially ORPO and KL-regularization – substantially mitigate both. Fine-tuning objectives therefore matter little for safety at small scales but become a primary driver of adversarial robustness and latent persona stability as training scale increases.
zh
[NLP-148] BioPulse-QA: A Dynamic Biomedical Question-Answering Benchmark for Evaluating Factuality Robustness and Bias in Large Language Models
【速读】: 该论文旨在解决当前生物医学领域大语言模型(Large Language Models, LLMs)评估基准存在的局限性问题,包括数据静态过时、与预训练语料重叠导致的数据泄露风险,以及对语言变体鲁棒性和潜在人口统计学偏见等关键维度的忽视。其解决方案的核心是提出BioPulse-QA这一新型基准,该基准基于新发布的生物医学文档(如药品说明书、临床试验方案和临床指南)构建了2,280个专家验证的问答对及其扰动变体,涵盖抽取式和摘要式两种格式,从而更真实地反映生物医学知识的动态性、复杂性和高风险特性。此设计显著提升了评估的临床相关性和可扩展性,为LLMs在生物医学场景下的性能评测提供了更可靠的标准。
链接: https://arxiv.org/abs/2601.12632
作者: Kriti Bhattarai,Vipina K. Keloth,Donald Wright,Andrew Loza,Yang Ren,Hua Xu
机构: 未知
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:
Abstract:Objective: Large language models (LLMs) are increasingly applied in biomedical settings, and existing benchmark datasets have played an important role in supporting model development and evaluation. However, these benchmarks often have limitations. Many rely on static or outdated datasets that fail to capture the dynamic, context-rich, and high-stakes nature of biomedical knowledge. They also carry increasing risk of data leakage due to overlap with model pretraining corpora and often overlook critical dimensions such as robustness to linguistic variation and potential demographic biases. Materials and Methods: To address these gaps, we introduce BioPulse-QA, a benchmark that evaluates LLMs on answering questions from newly published biomedical documents including drug labels, trial protocols, and clinical guidelines. BioPulse-QA includes 2,280 expert-verified question answering (QA) pairs and perturbed variants, covering both extractive and abstractive formats. We evaluate four LLMs - GPT-4o, GPT-o1, Gemini-2.0-Flash, and LLaMA-3.1 8B Instruct - released prior to the publication dates of the benchmark documents. Results: GPT-o1 achieves the highest relaxed F1 score (0.92), followed by Gemini-2.0-Flash (0.90) on drug labels. Clinical trials are the most challenging source, with extractive F1 scores as low as 0.36. Discussion and Conclusion: Performance differences are larger for paraphrasing than for typographical errors, while bias testing shows negligible differences. BioPulse-QA provides a scalable and clinically relevant framework for evaluating biomedical LLMs. Subjects: Computation and Language (cs.CL); Information Retrieval (cs.IR) Cite as: arXiv:2601.12632 [cs.CL] (or arXiv:2601.12632v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2601.12632 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Kriti Bhattarai [view email] [v1] Mon, 19 Jan 2026 00:38:33 UTC (3,169 KB)
zh
[NLP-149] Disagreement as Data: Reasoning Trace Analytics in Multi-Agent Systems
【速读】: 该论文旨在解决学习分析领域中定性学生数据(如编码标注或访谈文本)在自动化分析过程中缺乏方法论标准的问题,尤其关注如何提升编码一致性与解释深度。其解决方案的关键在于利用大语言模型(Large Language Model, LLM)代理生成的推理轨迹(reasoning traces),通过余弦相似度量化多代理系统中不同LLM代理之间的语义推理一致性,将原本被视为噪声的分歧重新定义为具有分析价值的信号。这种方法不仅能够有效区分编码共识与分歧,还与人工编码可靠性显著相关,并能揭示编码体系中的细微教学功能及改进机会,从而增强质性编码的可解释性、方法严谨性和人机协同效率。
链接: https://arxiv.org/abs/2601.12618
作者: Elham Tajik,Conrad Borchers,Bahar Shahrokhian,Sebastian Simon,Ali Keramati,Sonika Pal,Sreecharan Sankaranarayanan
机构: University at Albany (阿尔巴尼大学); Carnegie Mellon University (卡内基梅隆大学); Arizona State University (亚利桑那州立大学); Le Mans University (勒芒大学); University of California, Irvine (加州大学欧文分校); Indian Institute of Technology Bombay (印度理工学院孟买分校); Extuitive Inc. (Flagship Pioneering) (Extuitive公司(旗舰先锋))
类目: Computation and Language (cs.CL)
备注: LAK 2026 conference paper, 7 pages
Abstract:Learning analytics researchers often analyze qualitative student data such as coded annotations or interview transcripts to understand learning processes. With the rise of generative AI, fully automated and human-AI workflows have emerged as promising methods for analysis. However, methodological standards to guide such workflows remain limited. In this study, we propose that reasoning traces generated by large language model (LLM) agents, especially within multi-agent systems, constitute a novel and rich form of process data to enhance interpretive practices in qualitative coding. We apply cosine similarity to LLM reasoning traces to systematically detect, quantify, and interpret disagreements among agents, reframing disagreement as a meaningful analytic signal. Analyzing nearly 10,000 instances of agent pairs coding human tutoring dialog segments, we show that LLM agents’ semantic reasoning similarity robustly differentiates consensus from disagreement and correlates with human coding reliability. Qualitative analysis guided by this metric reveals nuanced instructional sub-functions within codes and opportunities for conceptual codebook refinement. By integrating quantitative similarity metrics with qualitative review, our method has the potential to improve and accelerate establishing inter-rater reliability during coding by surfacing interpretive ambiguity, especially when LLMs collaborate with humans. We discuss how reasoning-trace disagreements represent a valuable new class of analytic signals advancing methodological rigor and interpretive depth in educational research.
zh
[NLP-150] A Cloud-based Multi-Agent ic Workflow for Science
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在科学领域应用中因缺乏复杂任务执行能力(如运行模拟或做出复杂决策)而导致的实用性受限问题。其核心解决方案是提出一种领域无关、模型独立的代理框架(agentic framework),该框架通过一个监督代理(supervisor agent)协调多个具备特定能力的代理,实现从文献综述、数据分析到仿真运行等任务的自动化调度与执行,并可在云端完整部署。关键创新在于构建了一个可扩展、可复用的系统架构,在保证任务准确率的同时显著提升了多步骤科学任务的自动化水平,实验证明其在合成基准和化学领域真实任务中均能以高成功率完成分配任务,且性能优于或媲美前沿模型。
链接: https://arxiv.org/abs/2601.12607
作者: Anurag Acharya,Timothy Vega,Rizwan A. Ashraf,Anshu Sharma,Derek Parker,Robert Rallo
机构: Pacific Northwest National Laboratory(太平洋西北国家实验室); Florida International University(佛罗里达国际大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:As Large Language Models (LLMs) become ubiquitous across various scientific domains, their lack of ability to perform complex tasks like running simulations or to make complex decisions limits their utility. LLM-based agents bridge this gap due to their ability to call external resources and tools and thus are now rapidly gaining popularity. However, coming up with a workflow that can balance the models, cloud providers, and external resources is very challenging, making implementing an agentic system more of a hindrance than a help. In this work, we present a domain-agnostic, model-independent workflow for an agentic framework that can act as a scientific assistant while being run entirely on cloud. Built with a supervisor agent marshaling an array of agents with individual capabilities, our framework brings together straightforward tasks like literature review and data analysis with more complex ones like simulation runs. We describe the framework here in full, including a proof-of-concept system we built to accelerate the study of Catalysts, which is highly important in the field of Chemistry and Material Science. We report the cost to operate and use this framework, including the breakdown of the cost by services use. We also evaluate our system on a custom-curated synthetic benchmark and a popular Chemistry benchmark, and also perform expert validation of the system. The results show that our system is able to route the task to the correct agent 90% of the time and successfully complete the assigned task 97.5% of the time for the synthetic tasks and 91% of the time for real-world tasks, while still achieving better or comparable accuracy to most frontier models, showing that this is a viable framework for other scientific domains to replicate.
zh
[NLP-151] SSVD-O: Parameter-Efficient Fine-Tuning with Structured SVD for Speech Recognition ICASSP2026
【速读】: 该论文旨在解决参数高效微调(Parameter-efficient fine-tuning, PEFT)在语音领域应用中因参数分配不均导致的效率与可扩展性受限问题,尤其是在自动语音识别(ASR)任务中面对域偏移(如儿童语音和地域口音)时的性能下降与灾难性遗忘现象。其解决方案的关键在于提出SSVD-Outer(SSVD-O),该方法通过将输入声学特征空间相关的内变换(inner transformations)与输出语义特征空间相关的外变换(outer transformations)相结合,实现对模型子空间的结构化、平衡式参数预算分配,从而在有限资源下提升适应能力、泛化性能并缓解遗忘问题。
链接: https://arxiv.org/abs/2601.12600
作者: Pu Wang,Shinji Watanabe,Hugo Van hamme
机构: 未知
类目: ound (cs.SD); Computation and Language (cs.CL); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注: Accepted by IEEE ICASSP 2026
Abstract:Parameter-efficient fine-tuning (PEFT) is a scalable approach for adapting large speech foundation models to new domains. While methods such as LoRA and its state-of-the-art variants reduce adaptation costs, they typically allocate parameters uniformly across model subspaces, which limits their efficiency and scalability in speech applications. Building on our prior work, this paper introduces SSVD-Outer (SSVD-O), an extension of the structured SVD-guided (SSVD) fine-tuning method. SSVD-O combines input acoustic feature space-associated inner transformations with output semantic feature space-associated outer transformations to enable scalable and balanced adaptation. We conduct the first systematic analysis of parameter budget allocation across model subspaces in PEFT for automatic speech recognition (ASR), and investigate the trade-off between learning and forgetting under constrained resources. SSVD-O is benchmarked against LoRA, DoRA, PiSSA, and SSVD on domain-shifted ASR tasks, including child speech and regional accents, across model scales from 0.1B to 2B within the ESPnet framework. Experimental results show that SSVD-O consistently narrows the performance gap to full fine-tuning while improving generalization and mitigating catastrophic forgetting.
zh
[NLP-152] Dissecting Linear Recurrent Models: How Different Gating Strategies Drive Selectivity and Generalization
【速读】: 该论文旨在解决线性循环神经网络(Linear Recurrent Neural Networks)在架构演进过程中日益复杂、计算成本上升,且缺乏系统性评估基准的问题。现有基准任务或过于简单无法揭示模型差异,或资源消耗过大难以广泛实验。其解决方案的关键在于提出了一种精炼的线性循环模型分类体系,并设计了SelectivBench——一套轻量级、可定制的合成基准任务,专门用于评估模型在中小规模序列中的选择性(selectivity),即聚焦相关输入并忽略基于上下文的干扰项的能力。SelectivBench通过规则语法生成具有可调复杂度的序列,引入违反转移规则的不规则间隔以强化对选择性的考验。实验表明,该基准能有效捕捉与大规模语言任务一致的性能模式,从而为线性循环模型的针对性优化和机制分析提供可控、高效的评估环境。
链接: https://arxiv.org/abs/2601.12598
作者: Younes Bouhadjar,Maxime Fabre,Felix Schmidt,Emre Neftci
机构: Peter Grünberg Institute, Neuromorphic Software Ecosystems (PGI-15), Jülich Research Centre, Germany; Groningen Cognitive Systems and Materials Center (CogniGron), University of Groningen; RWTH Aachen University, Aachen, Germany
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 11 pages, 4 figures and 4 tables
Abstract:Linear recurrent neural networks have emerged as efficient alternatives to the original Transformer’s softmax attention mechanism, thanks to their highly parallelizable training and constant memory and computation requirements at inference. Iterative refinements of these models have introduced an increasing number of architectural mechanisms, leading to increased complexity and computational costs. Nevertheless, systematic direct comparisons among these models remain limited. Existing benchmark tasks are either too simplistic to reveal substantial differences or excessively resource-intensive for experimentation. In this work, we propose a refined taxonomy of linear recurrent models and introduce SelectivBench, a set of lightweight and customizable synthetic benchmark tasks for systematically evaluating sequence models. SelectivBench specifically evaluates selectivity in sequence models at small to medium scale, such as the capacity to focus on relevant inputs while ignoring context-based distractors. It employs rule-based grammars to generate sequences with adjustable complexity, incorporating irregular gaps that intentionally violate transition rules. Evaluations of linear recurrent models on SelectivBench reveal performance patterns consistent with results from large-scale language tasks. Our analysis clarifies the roles of essential architectural features: gating and rapid forgetting mechanisms facilitate recall, in-state channel mixing is unnecessary for selectivity, but critical for generalization, and softmax attention remains dominant due to its memory capacity scaling with sequence length. Our benchmark enables targeted, efficient exploration of linear recurrent models and provides a controlled setting for studying behaviors observed in large-scale evaluations. Code is available at this https URL
zh
[NLP-153] Evaluating Contextually Mediated Factual Recall in Multilingual Large Language Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在自然语言场景中进行事实回忆时的局限性问题,即当前评估主要聚焦于孤立的事实检索任务(如直接命名实体并请求特定事实),而忽视了现实语境下事实通过间接指称(contextually mediated recall)被激活的情形。其解决方案的关键在于构建受控提示(controlled prompts),在保持底层事实不变的前提下引入上下文句以实现指称中介(referential mediation),并通过对比使用真实姓名与合成姓名的性能差异,分离语境效应与名称特异性关联的影响。实验覆盖五种语言,结果表明:上下文中介显著削弱事实回忆准确性,且不同关系类型间存在异质性;更大规模模型对上下文干扰更具鲁棒性,但真实姓名及其来源的影响则无系统规律。这揭示了多语言LLMs在语境依赖理解方面存在的认知差距。
链接: https://arxiv.org/abs/2601.12555
作者: Yihong Liu,Bingyu Xiong,Hinrich Schütze
机构: 未知
类目: Computation and Language (cs.CL)
备注: preprint
Abstract:Large language models (LLMs) can recall a wide range of factual knowledge across languages. However, existing factual recall evaluations primarily assess fact retrieval in isolation, where the queried entity is explicitly named and the fact is requested directly. In natural language use, facts are often accessed through context, where the relevant entity is introduced only indirectly. In this work, we study contextually mediated factual recall, asking whether LLMs can reliably retrieve factual knowledge when the target entity is embedded in a naturalistic context rather than queried explicitly, across languages. We construct controlled prompts that preserve the underlying fact while introducing referential mediation through contextual sentences. To disentangle contextual effects from name-specific associations, we further compare performance using synthetic names and real names across languages. Evaluating multiple model families in five languages, we find that contextual mediation consistently degrades factual recall, with substantial variation across relations. Larger models are more robust to contextual mediation, exhibiting a reduced performance gap relative to direct queries, while the effect of real names and name origin is mixed and unsystematic. These findings highlight a gap between isolated factual recall and context-dependent language understanding in multilingual LLMs.
zh
[NLP-154] Benchmarking Concept-Spilling Across Languages in LLM s
【速读】: 该论文旨在解决多语言大语言模型(Multilingual Large Language Models, MLLMs)在非英语语境下生成内容时存在的系统性偏倚问题,即“语言溢出”(language spilling)现象——模型倾向于借用主导语言(如英语)的语义表示,导致目标语言中多义词(polysemous words)的语义干扰和表达失真。其解决方案的关键在于提出一种新颖的对比评估框架,通过结构化的多义词生成任务系统测量模型在不同语言中的语义鲁棒性:具体而言,当要求模型生成特定单词的五个语义时,性能更强的模型会在生成序列较晚阶段才引入主导语言语义,而较弱模型则更早依赖主导语言含义;由此构建了一种相对性能指标,无需明确归因错误来源即可对模型进行可比排序。该方法不仅提供了一个可扩展的多语言语义评估基准,还建立了严格的验证流程,为开发更具语言平衡性的生成式AI系统提供了关键工具。
链接: https://arxiv.org/abs/2601.12549
作者: Ilia Badanin,Daniil Dzenhaliou,Imanol Schlag
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Multilingual Large Language Models (LLMs) exhibit remarkable cross-lingual abilities, yet often exhibit a systematic bias toward the representations from other languages, resulting in semantic interference when generating content in non-English languages - a phenomenon we define as language spilling. This paper presents a novel comparative framework for evaluating multilingual semantic robustness by systematically measuring how models handle polysemous words across languages. Our methodology provides a relative measure of model performance: when required to generate exactly five meanings, both strong and weak models may resort to meanings from dominant languages, but semantically stronger models do so later in the generation sequence, producing more true meanings from the target language before failing, while weaker models resort to dominant-language meanings earlier in the sequence. We evaluate a diverse set of open and closed multilingual LLMs using a structured meaning generation task across nine languages, employing a carefully curated benchmark of 100 high-polysemy English words. Our findings reveal significant variation in semantic robustness across both models and languages, providing a principled ranking system for model comparison without requiring definitive causal attribution of error sources. We contribute both a scalable comparative benchmark for multilingual semantic evaluation and a rigorous validation pipeline - critical tools for developing more linguistically balanced AI systems.
zh
[NLP-155] MemeLens: Multilingual Multitask VLMs for Memes
【速读】: 该论文旨在解决现有 meme 研究在任务(如仇恨、性别歧视、宣传、情感、幽默等)和语言上分散导致的跨域泛化能力不足的问题。其解决方案的关键在于提出 MemeLens,一个统一的多语言、多任务、可解释增强的视觉语言模型(Vision Language Model, VLM),通过整合 38 个公开的 meme 数据集,将特定数据集标签映射到包含 20 个任务的共享分类体系(涵盖危害性、目标、隐喻/语用意图和情感),从而实现对 meme 的系统性理解与跨任务、跨语言的迁移能力。实验表明,鲁棒的 meme 理解依赖于多模态训练,并且在不同语义类别间存在显著差异,同时模型在单一数据集上微调时易出现过拟合,而统一训练设置能显著提升性能。
链接: https://arxiv.org/abs/2601.12539
作者: Ali Ezzat Shahroor,Mohamed Bayan Kmainasi,Abul Hasnat,Dimitar Dimitrov,Giovanni Da San Martino,Preslav Nakov,Firoj Alam
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: disinformation, misinformation, factuality, harmfulness, fake news, propaganda, hateful meme, multimodality, text, images
Abstract:Memes are a dominant medium for online communication and manipulation because meaning emerges from interactions between embedded text, imagery, and cultural context. Existing meme research is distributed across tasks (hate, misogyny, propaganda, sentiment, humour) and languages, which limits cross-domain generalization. To address this gap we propose MemeLens, a unified multilingual and multitask explanation-enhanced Vision Language Model (VLM) for meme understanding. We consolidate 38 public meme datasets, filter and map dataset-specific labels into a shared taxonomy of 20 tasks spanning harm, targets, figurative/pragmatic intent, and affect. We present a comprehensive empirical analysis across modeling paradigms, task categories, and datasets. Our findings suggest that robust meme understanding requires multimodal training, exhibits substantial variation across semantic categories, and remains sensitive to over-specialization when models are fine-tuned on individual datasets rather than trained in a unified setting. We will make the experimental resources and datasets publicly available for the community.
zh
[NLP-156] Agent ic Reasoning for Large Language Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在开放动态环境中推理能力不足的问题,即从封闭世界中的静态推理向具备自主规划、执行与学习能力的“代理式推理”(Agentic Reasoning)范式转变。其解决方案的关键在于构建一个三维度的组织框架:首先定义基础层(Foundational Agentic Reasoning),涵盖稳定环境下的单智能体规划、工具使用与搜索能力;其次引入自演化层(Self-Evolving Agentic Reasoning),通过反馈、记忆和适应机制优化智能体自身能力;最后拓展至集体多智能体层(Collective Multi-Agent Reasoning),实现协作场景中的协调、知识共享与共同目标达成。在此基础上,区分了基于上下文的推理(In-context Reasoning)与训练后优化的推理(Post-training Reasoning),并系统梳理了多个实际应用场景(如科学、机器人、医疗等)中代表性框架,最终提出统一的“思维到行动”路线图,并指出个性化、长时交互、世界建模、可扩展多智能体训练及治理等关键挑战。
链接: https://arxiv.org/abs/2601.12538
作者: Tianxin Wei,Ting-Wei Li,Zhining Liu,Xuying Ning,Ze Yang,Jiaru Zou,Zhichen Zeng,Ruizhong Qiu,Xiao Lin,Dongqi Fu,Zihao Li,Mengting Ai,Duo Zhou,Wenxuan Bao,Yunzhe Li,Gaotang Li,Cheng Qian,Yu Wang,Xiangru Tang,Yin Xiao,Liri Fang,Hui Liu,Xianfeng Tang,Yuji Zhang,Chi Wang,Jiaxuan You,Heng Ji,Hanghang Tong,Jingrui He
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Project: this https URL
Abstract:Reasoning is a fundamental cognitive process underlying inference, problem-solving, and decision-making. While large language models (LLMs) demonstrate strong reasoning capabilities in closed-world settings, they struggle in open-ended and dynamic environments. Agentic reasoning marks a paradigm shift by reframing LLMs as autonomous agents that plan, act, and learn through continual interaction. In this survey, we organize agentic reasoning along three complementary dimensions. First, we characterize environmental dynamics through three layers: foundational agentic reasoning, which establishes core single-agent capabilities including planning, tool use, and search in stable environments; self-evolving agentic reasoning, which studies how agents refine these capabilities through feedback, memory, and adaptation; and collective multi-agent reasoning, which extends intelligence to collaborative settings involving coordination, knowledge sharing, and shared goals. Across these layers, we distinguish in-context reasoning, which scales test-time interaction through structured orchestration, from post-training reasoning, which optimizes behaviors via reinforcement learning and supervised fine-tuning. We further review representative agentic reasoning frameworks across real-world applications and benchmarks, including science, robotics, healthcare, autonomous research, and mathematics. This survey synthesizes agentic reasoning methods into a unified roadmap bridging thought and action, and outlines open challenges and future directions, including personalization, long-horizon interaction, world modeling, scalable multi-agent training, and governance for real-world deployment.
zh
[NLP-157] Improving Low-Resource Machine Translation via Round-Trip Reinforcement Learning
【速读】: 该论文旨在解决低资源机器翻译(Low-resource Machine Translation, LMt)中因平行语料稀缺而导致翻译质量受限的问题。其解决方案的关键在于采用基于自监督强化学习的微调策略,通过“往返回译”(round-trip bootstrapping)机制:首先将源语言(英语)翻译为目标低资源语言,再将目标语言翻译回英语,利用chrF++与BLEU的组合奖励函数对重构后的英语句子进行优化。该方法有效利用了NLLB系列模型的预训练知识,在Central Aymara、Friulian、Wolof和Russian等语言上实现了稳定性能提升,并表现出更高的流畅性和语义保真度,表明该框架具有持续自我改进的能力,且随着模型规模扩大将进一步受益。
链接: https://arxiv.org/abs/2601.12535
作者: Ahmed Attia,Alham Fikri
机构: MBZUAI (Mohamed bin Zayed University of Artificial Intelligence)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Low-resource machine translation (MT) has gained increasing attention as parallel data from low-resource language communities is collected, but many potential methods for improving low-resource MT remain unexplored. We investigate a self-supervised reinforcement-learning-based fine-tuning for translation in low-resource settings using round-trip bootstrapping with the No Language Left Behind (NLLB) family of models. Our approach translates English into a target low-resource language and then back into English, using a combination of chrF++ and BLEU as the reward function on the reconstructed English sentences. Using the NLLB-MD dataset, we evaluate both the 600M and 1.3B parameter NLLB models and observe consistent improvements for the following languages: Central Aymara, Friulian, Wolof and Russian. Qualitative inspection of translation outputs indicates increased fluency and semantic fidelity. We argue that our method can further benefit from scale, enabling models to increasingly leverage their pretrained knowledge and continue self-improving.
zh
[NLP-158] DoPE: Decoy Oriented Perturbation Encapsulation Human-Readable AI-Hostile Documents for Academic Integrity
【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)对纸质或电子考试文档的直接解析能力所引发的学术诚信威胁问题。现有评估体系面临MLLM自动作答导致的作弊风险,而传统依赖单次分类器的防御方法难以应对复杂且动态的攻击场景。解决方案的关键在于提出DoPE(Decoy-Oriented Perturbation Encapsulation)框架——通过在文档生成阶段嵌入语义诱饵(semantic decoys),利用MLLM处理流程中渲染(render)与解析(parse)之间的不一致性,实现模型无关的预防(阻止或扰乱自动化解题)和检测(标记AI盲用行为)。其核心创新包括FewSoRT-Q(生成问题级语义诱饵)和FewSoRT-D(将诱饵封装为带水印的文档),并在Integrity-Bench基准上验证了有效性:在黑盒测试中达到91.4%检测率(误报率8.7%)并阻止96.3%的尝试成功完成。
链接: https://arxiv.org/abs/2601.12505
作者: Ashish Raj Shekhar,Shiven Agarwal,Priyanuj Bordoloi,Yash Shah,Tejas Anvekar,Vivek Gupta
机构: Arizona State University (亚利桑那州立大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Multimodal Large Language Models (MLLMs) can directly consume exam documents, threatening conventional assessments and academic integrity. We present DoPE (Decoy-Oriented Perturbation Encapsulation), a document-layer defense framework that embeds semantic decoys into PDF/HTML assessments to exploit render-parse discrepancies in MLLM pipelines. By instrumenting exams at authoring time, DoPE provides model-agnostic prevention (stop or confound automated solving) and detection (flag blind AI reliance) without relying on conventional one-shot classifiers. We formalize prevention and detection tasks, and introduce FewSoRT-Q, an LLM-guided pipeline that generates question-level semantic decoys and FewSoRT-D to encapsulate them into watermarked documents. We evaluate on Integrity-Bench, a novel benchmark of 1826 exams (PDF+HTML) derived from public QA datasets and OpenCourseWare. Against black-box MLLMs from OpenAI and Anthropic, DoPE yields strong empirical gains: a 91.4% detection rate at an 8.7% false-positive rate using an LLM-as-Judge verifier, and prevents successful completion or induces decoy-aligned failures in 96.3% of attempts. We release Integrity-Bench, our toolkit, and evaluation code to enable reproducible study of document-layer defenses for academic integrity.
zh
[NLP-159] Harmonizing the Arabic Audio Space with Data Scheduling
【速读】: 该论文旨在解决音频大语言模型(Audio Large Language Models, Audio LLMs)在语言结构复杂、方言丰富的场景下适应性不足的问题,特别是在阿拉伯语语境中多任务指令微调的系统性研究尚属空白。解决方案的关键在于提出一种混合训练策略:首先采用任务进度课程学习(Task-Progressive Curriculum, TPC)稳定核心声学映射,随后引入基于对齐器的多样化采样(Aligner-Based Diverse Sampling, ADS)构建信息密集且任务与标签平衡的批次,以实现细粒度语用特征的精准捕捉。实验表明,TPC+ADS混合策略能在保证训练稳定性的同时提升模型性能,尤其在低资源多模态环境下展现出高效性和鲁棒性。
链接: https://arxiv.org/abs/2601.12494
作者: Hunzalah Hassan Bhatti,Firoj Alam,Shammur Absar Chowdhury
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Foundation Models, Large Language Models, Native, Speech Models, Arabic
Abstract:Audio large language models (LLMs) enable unified speech understanding and generation, yet their adaptation to linguistically complex, dialect-rich settings remains underexplored. This paper presents the first systematic study of multi-task instruction tuning for an Arabic-centric audio LLM, covering a hierarchy of generative tasks (ASR, speech summarization) and discriminative tasks (dialect and emotion identification). To support this study, we introduce AraMega-SSum, a novel dataset for Arabic speech summarization. We fine-tune Qwen2.5-Omni (7B) and propose Task-Progressive Curriculum (TPC) along with Aligner-Based Diverse Sampling (ADS), a strategy that constructs information-dense batches by selecting task- and label-balanced examples. Our results reveal a critical efficiency, robustness trade-off: while ADS accelerates initial convergence and boosts paralinguistic F1-scores, its inherent gradient volatility can destabilize generative decoding under prolonged training. Furthermore, while the TPC stabilizes core acoustic mapping, it often induces negative transfer in downstream tasks. We demonstrate that a Hybrid TPC+ADS Strategy provides an optimal training ``recipe’', first establishing a robust representative foundation before employing diversity-aware refinement to capture fine-grained nuances. These findings offer practical guidance for the efficient adaptation of Omni-models in complex, low-resource multimodal environments.
zh
[NLP-160] Capability-Aware Early-Stage Research Idea Evaluation
【速读】: 该论文旨在解决科研早期阶段(即在投入大量资源之前)对研究想法成果进行预测的问题,从而优化科学资源配置与研究规划。其核心挑战在于如何在缺乏完整论文文本或实验结果的情况下,仅基于作者信息和研究构想实现准确预测。解决方案的关键在于提出了一种能力感知(capability-aware)框架,通过三路Transformer架构融合作者信息、推断出的能力表征(capability presentation)与研究想法,并采用两阶段结构学习能力表示,显著提升了模型对论文接受率及评分的预测准确性。
链接: https://arxiv.org/abs/2601.12473
作者: Renlong Jie,Chen Chu,Zhen Wang
机构: Northwestern Polytechnical University(西北工业大学); School of Statistics and Mathematics, Yunnan University of Finance and Economics(云南财经大学统计与数学学院); iOPEN
类目: Computation and Language (cs.CL)
备注:
Abstract:Predicting the outcomes of research ideas at their conceptual stage (i.e. before significant resources are committed) holds great potential for optimizing scientific resource allocation and research planning. While existing methods rely heavily on finished manuscripts or peer reviews, we propose a novel capability-aware framework that predicts paper acceptance and ratings using only author information and research ideas, without requiring full text or experimental results. Our approach integrates author information, (inferred) capability presentation, and research ideas through a three-way transformer architecture with flexible fusion mechanisms. We also introduce a two-stage architecture for learning the capability representation given the author information and idea. Experiments show that our method significantly outperform the single-way models by finetuning bert-base and bert-large, and the capability predicting significantly increase the predictive accuracy of the final model. The proposed method can be applied in both early-stage research outcome prediction and scientific resource allocation.
zh
[NLP-161] Knowing When to Abstain: Medical LLM s Under Clinical Uncertainty EACL
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在医疗多选题问答(Medical Multiple-Choice Question Answering, MCQA)等高风险场景中,因缺乏有效不确定度识别与主动回避机制而导致的安全性不足问题。其核心解决方案是提出MedAbstain——一个统一的基准测试框架与评估协议,通过引入校准预测(conformal prediction)、对抗性问题扰动和显式弃权选项(explicit abstention options),系统性地评估模型在不确定时的弃权能力。关键发现表明,即使是最先进的高准确率模型也常在不确定时不弃权,而显式弃权选项能显著提升模型对不确定性的感知与安全弃权行为,效果远超输入扰动或模型规模扩展,凸显了设计专门的弃权机制对于实现可信部署的重要性。
链接: https://arxiv.org/abs/2601.12471
作者: Sravanthi Machcha,Sushrita Yerra,Sahil Gupta,Aishwarya Sahoo,Sharmin Sultana,Hong Yu,Zonghai Yao
机构: Manning College of Information and Computer Sciences, UMass Amherst, MA, USA; Center for Healthcare Organization and Implementation Research, VA Bedford Health Care; Miner School of Computer and Information Sciences, UMass Lowell, MA, USA
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Equal contribution for the first two authors; To appear in proceedings of the Main Conference of the European Chapter of the Association for Computational Linguistics (EACL) 2026
Abstract:Current evaluation of large language models (LLMs) overwhelmingly prioritizes accuracy; however, in real-world and safety-critical applications, the ability to abstain when uncertain is equally vital for trustworthy deployment. We introduce MedAbstain, a unified benchmark and evaluation protocol for abstention in medical multiple-choice question answering (MCQA) – a discrete-choice setting that generalizes to agentic action selection – integrating conformal prediction, adversarial question perturbations, and explicit abstention options. Our systematic evaluation of both open- and closed-source LLMs reveals that even state-of-the-art, high-accuracy models often fail to abstain with uncertain. Notably, providing explicit abstention options consistently increases model uncertainty and safer abstention, far more than input perturbations, while scaling model size or advanced prompting brings little improvement. These findings highlight the central role of abstention mechanisms for trustworthy LLM deployment and offer practical guidance for improving safety in high-stakes applications.
zh
[NLP-162] Incentivizing In-depth Reasoning over Long Contexts with Process Advantage Shaping
【速读】: 该论文旨在解决强化学习中可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)在长上下文推理场景下的性能下降问题,特别是由于“几乎正确”(almost-there)现象导致的最终步骤失败——即推理路径大部分正确但因缺乏高密度推理信号和训练过程中的学习信号丢失而无法获得正确结果。解决方案的关键在于提出两个核心创新:一是基于知识图谱(Knowledge Graph, KG)驱动的DeepReasonQA合成框架,用于可控生成具有内在多跳推理链的高难度长上下文问答对;二是引入长上下文过程优势塑造方法(Long-context Process Advantage Shaping, LongPAS),通过在有效性(Validity)与相关性(Relevance)维度上细粒度地评估推理步骤,从而捕获“几乎正确”轨迹中的关键学习信号,显著提升长上下文推理能力并保持强化学习训练的稳定性。
链接: https://arxiv.org/abs/2601.12465
作者: Miao Peng,Weizhou Shen,Nuo Chen,Chenliang Li,Ming Yan,Jia Li
机构: The Hong Kong University of Science and Technology (Guangzhou); Tongyi Lab, Alibaba Group
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has proven effective in enhancing LLMs short-context reasoning, but its performance degrades in long-context scenarios that require both precise grounding and robust long-range reasoning. We identify the “almost-there” phenomenon in long-context reasoning, where trajectories are largely correct but fail at the final step, and attribute this failure to two factors: (1) the lack of high reasoning density in long-context QA data that push LLMs beyond mere grounding toward sophisticated multi-hop reasoning; and (2) the loss of valuable learning signals during long-context RL training due to the indiscriminate penalization of partially correct trajectories with incorrect outcomes. To overcome this bottleneck, we propose DeepReasonQA, a KG-driven synthesis framework that controllably constructs high-difficulty, multi-hop long-context QA pairs with inherent reasoning chains. Building on this, we introduce Long-context Process Advantage Shaping (LongPAS), a simple yet effective method that performs fine-grained credit assignment by evaluating reasoning steps along Validity and Relevance dimensions, which captures critical learning signals from “almost-there” trajectories. Experiments on three long-context reasoning benchmarks show that our approach substantially outperforms RLVR baselines and matches frontier LLMs while using far fewer parameters. Further analysis confirms the effectiveness of our methods in strengthening long-context reasoning while maintaining stable RL training.
zh
[NLP-163] Privacy-Preserving Federated Learning with Verifiable Fairness Guarantees
【速读】: 该论文旨在解决联邦学习(Federated Learning)中在异构数据分布下保障算法公平性与隐私保护之间的根本性矛盾问题。现有方法难以在不泄露敏感数据的前提下验证公平性指标(如 demographic parity 和 equalized odds)。其解决方案的关键在于提出 CryptoFair-FL,一个基于加法同态加密(additively homomorphic encryption)与安全多方计算(secure multi-party computation)相结合的密码学框架,首次实现了对公平性指标的可验证保障,且无需暴露受保护属性分布或个体预测结果。通过设计一种新型批处理验证协议,将计算复杂度从 O(n2) 降低至 O(nlogn),同时满足 (\dparam, \deltap)-差分隐私(\dparam = 0.5, \deltap = 10^{-6}),并在理论上证明了隐私成本的下界,实现接近最优的隐私-公平权衡。实验表明,该方案显著减少公平性偏差(如人口均等差异从 0.231 降至 0.031),且仅带来 2.3 倍计算开销,同时有效抵御属性推断攻击。
链接: https://arxiv.org/abs/2601.12447
作者: Mohammed Himayath Ali,Mohammed Aqib Abdullah,Syed Muneer Hussin,Mohammed Mudassir Uddin,Shahnawaz Alam
机构: 未知
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Federated learning enables collaborative model training across distributed institutions without centralizing sensitive data; however, ensuring algorithmic fairness across heterogeneous data distributions while preserving privacy remains fundamentally unresolved. This paper introduces CryptoFair-FL, a novel cryptographic framework providing the first verifiable fairness guarantees for federated learning systems under formal security definitions. The proposed approach combines additively homomorphic encryption with secure multi-party computation to enable privacy-preserving verification of demographic parity and equalized odds metrics without revealing protected attribute distributions or individual predictions. A novel batched verification protocol reduces computational complexity from BigO(n^2) to BigO(n \log n) while maintaining (\dparam, \deltap)-differential privacy with dparam = 0.5 and deltap = 10^-6. Theoretical analysis establishes information-theoretic lower bounds on the privacy cost of fairness verification, demonstrating that the proposed protocol achieves near-optimal privacy-fairness tradeoffs. Comprehensive experiments across four benchmark datasets (MIMIC-IV healthcare records, Adult Income, CelebA, and a novel FedFair-100 benchmark) demonstrate that CryptoFair-FL reduces fairness violations from 0.231 to 0.031 demographic parity difference while incurring only 2.3 times computational overhead compared to standard federated averaging. The framework successfully defends against attribute inference attacks, maintaining adversarial success probability below 0.05 across all tested configurations. These results establish a practical pathway for deploying fairness-aware federated learning in regulated industries requiring both privacy protection and algorithmic accountability.
zh
[NLP-164] System-Mediated Attention Imbalances Make Vision-Language Models Say Yes
【速读】: 该论文旨在解决视觉语言模型(Vision-Language Model, VLM)中常见的幻觉问题,特别是“是偏倚”(yes-bias)——即模型在未充分依赖图像或文本输入的情况下,倾向于无差别地回答“是”。现有方法多聚焦于增强图像模态的注意力分配,而忽视了系统模态(system modality)的作用。本文提出一种更全面的系统中介视角,指出注意力失衡源于功能冗余的系统权重,这些权重抑制了图像和文本输入的关注度。解决方案的关键在于通过因果方式将注意力从系统模态重新分配至图像和文本模态,从而显著缓解“是偏倚”,且效果优于当前主流方法。此外,研究还发现系统模态引发的注意力失衡会促使模型依赖粗粒度输入表示,这在某些任务中有效但在其他任务中导致错误响应。因此,该研究确立了系统注意力作为VLM幻觉的核心影响因素,并揭示其作为干预杠杆的潜力。
链接: https://arxiv.org/abs/2601.12430
作者: Tsan Tsai Chan,Varsha Suresh,Anisha Saha,Michael Hahn,Vera Demberg
机构: Saarland Informatics Campus, Saarland University, Germany; Max Planck Institute for Informatics, Germany
类目: Computation and Language (cs.CL)
备注: Under review
Abstract:Vision-language model (VLM) hallucination is commonly linked to imbalanced allocation of attention across input modalities: system, image and text. However, existing mitigation strategies tend towards an image-centric interpretation of these imbalances, often prioritising increased image attention while giving less consideration to the roles of the other modalities. In this study, we evaluate a more holistic, system-mediated account, which attributes these imbalances to functionally redundant system weights that reduce attention to image and textual inputs. We show that this framework offers a useful empirical perspective on the yes-bias, a common form of hallucination in which VLMs indiscriminately respond ‘yes’. Causally redistributing attention from the system modality to image and textual inputs substantially suppresses this bias, often outperforming existing approaches. We further present evidence suggesting that system-mediated attention imbalances contribute to the yes-bias by encouraging a default reliance on coarse input representations, which are effective for some tasks but ill-suited to others. Taken together, these findings firmly establish system attention as a key factor in VLM hallucination and highlight its potential as a lever for mitigation.
zh
[NLP-165] Legal experts disagree with rationale extraction techniques for explaining ECtHR case outcome classification
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在法律领域应用中因缺乏可解释性而导致的信任与透明度问题,特别是不同可解释性技术对法律结果预测的解释能力差异尚不明确这一开放性问题。解决方案的关键在于提出一个模型无关的可解释性技术对比分析框架,其中重点采用两种理由提取(rationale extraction)方法,从输入文本中生成人类可理解且简洁的文本片段作为决策依据,并通过归一化充分性(normalized sufficiency)和全面性(comprehensiveness)指标评估忠实性(faithfulness),同时邀请法律专家评估所提取理由的合理性(plausibility)。此外,还进一步探讨了使用大语言模型作为裁判(LLM-as-a-Judge)的可行性,发现模型给出的理由与法律专家存在显著差异,尽管其量化指标表现良好且下游分类性能合理。
链接: https://arxiv.org/abs/2601.12419
作者: Mahammad Namazov,Tomáš Koref,Ivan Habernal
机构: Trustworthy Human Language Technologies, Research Center Trustworthy Data Science and Security of the University Alliance Ruhr, Ruhr University Bochum (鲁尔大学波鸿分校); Center for Critical Computational Studies, Goethe University Frankfurt (歌德大学法兰克福分校)
类目: Computation and Language (cs.CL)
备注:
Abstract:Interpretability is critical for applications of large language models in the legal domain which requires trust and transparency. While some studies develop task-specific approaches, other use the classification model’s parameters to explain the decisions. However, which technique explains the legal outcome prediction best remains an open question. To address this challenge, we propose a comparative analysis framework for model-agnostic interpretability techniques. Among these, we employ two rationale extraction methods, which justify outcomes with human-interpretable and concise text fragments (i.e., rationales) from the given input text. We conduct comparison by evaluating faithfulness-via normalized sufficiency and comprehensiveness metrics along with plausibility-by asking legal experts to evaluate extracted rationales. We further assess the feasibility of LLM-as-a-Judge using legal expert evaluation results. We show that the model’s “reasons” for predicting a violation differ substantially from those of legal experts, despite highly promising quantitative analysis results and reasonable downstream classification performance. The source code of our experiments is publicly available at this https URL.
zh
[NLP-166] De-Anonymization at Scale via Tournament-Style Attribution
【速读】: 该论文旨在解决生成式 AI(Generative AI)在匿名文本场景下引发的作者身份泄露风险问题,即利用大语言模型(LLM)对匿名文档进行作者归属识别,从而威胁双盲审稿等隐私保护机制。其核心解决方案是提出一种可扩展的去匿名化方法 De-Anonymization at Scale (DAS),关键在于采用分阶段递进策略:首先通过密集检索预过滤缩小候选文本范围,再基于 LLM 逐轮筛选最可能同源文本,并结合多轮独立运行的多数投票聚合机制提升排序精度与鲁棒性,从而实现对数万级文本池中同作者文本的高准确率识别。
链接: https://arxiv.org/abs/2601.12407
作者: Lirui Zhang,Huishuai Zhang
机构: Beihang University (北京航空航天大学); Peking University (北京大学)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 14 pages
Abstract:As LLMs rapidly advance and enter real-world use, their privacy implications are increasingly important. We study an authorship de-anonymization threat: using LLMs to link anonymous documents to their authors, potentially compromising settings such as double-blind peer review. We propose De-Anonymization at Scale (DAS), a large language model-based method for attributing authorship among tens of thousands of candidate texts. DAS uses a sequential progression strategy: it randomly partitions the candidate corpus into fixed-size groups, prompts an LLM to select the text most likely written by the same author as a query text, and iteratively re-queries the surviving candidates to produce a ranked top-k list. To make this practical at scale, DAS adds a dense-retrieval prefilter to shrink the search space and a majority-voting style aggregation over multiple independent runs to improve robustness and ranking precision. Experiments on anonymized review data show DAS can recover same-author texts from pools of tens of thousands with accuracy well above chance, demonstrating a realistic privacy risk for anonymous platforms. On standard authorship benchmarks (Enron emails and blog posts), DAS also improves both accuracy and scalability over prior approaches, highlighting a new LLM-enabled de-anonymization vulnerability. Comments: 14 pages Subjects: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2601.12407 [cs.CR] (or arXiv:2601.12407v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2601.12407 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-167] NADIR: Differential Attention Flow for Non-Autoregressive Transliteration in Indic Languages AAAI2026 AAAI
【速读】: 该论文旨在解决序列到序列(Sequence-to-Sequence)任务中自回归模型(Autoregressive, AR)在推理延迟与精度之间的权衡问题。特别是在多语言音译(Multilingual Transliteration)等依赖局部依赖关系的任务中,AR模型的强归纳偏置导致计算开销大、推理慢,而传统非自回归(Non-Autoregressive, NAR)模型虽速度快但存在幻觉和长度控制差的问题。解决方案的关键在于提出一种名为NADIR的新颖NAR架构,其核心创新包括引入差分Transformer(Differential Transformer)以增强对复杂字符映射的建模能力,并结合专家混合机制(Mixture-of-Experts, MoE),从而在不依赖序列依赖的情况下实现高精度与低延迟的平衡。实验表明,NADIR相较最优AR基线实现超过13倍的速度提升,同时保持接近AR模型的字符错误率(15.78% vs. 14.44%),并显著减少各类错误类型(如重复、替换、遗漏和插入错误)。
链接: https://arxiv.org/abs/2601.12389
作者: Lakshya Tomar,Vinayak Abrol,Puneet Agarwal
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at the AAAI Conference on Artificial Intelligence (AAAI 2026)
Abstract:In this work, we argue that not all sequence-to-sequence tasks require the strong inductive biases of autoregressive (AR) models. Tasks like multilingual transliteration, code refactoring, grammatical correction or text normalization often rely on local dependencies where the full modeling capacity of AR models can be overkill, creating a trade-off between their high accuracy and high inference latency. While non-autoregressive (NAR) models offer speed, they typically suffer from hallucinations and poor length control. To explore this trade-off, we focus on the multilingual transliteration task in Indic languages and introduce NADIR, a novel NAR architecture designed to strike a balance between speed and accuracy. NADIR integrates a Differential Transformer and a Mixture-of-Experts mechanism, enabling it to robustly model complex character mappings without sequential dependencies. NADIR achieves over a 13x speed-up compared to the state-of-the-art AR baseline. It maintains a competitive mean Character Error Rate of 15.78%, compared to 14.44% for the AR model and 21.88% for a standard NAR equivalent. Importantly, NADIR reduces Repetition errors by 49.53%, Substitution errors by 24.45%, Omission errors by 32.92%, and Insertion errors by 16.87%. This work provides a practical blueprint for building fast and reliable NAR systems, effectively bridging the gap between AR accuracy and the demands of real-time, large-scale deployment.
zh
[NLP-168] LR-DWM: Efficient Watermarking for Diffusion Language Models ACL
【速读】: 该论文旨在解决扩散语言模型(Diffusion Language Models, DLMs)缺乏高效水印(Watermarking, WM)机制的问题。现有水印方法主要针对自回归(Autoregressive, AR)模型设计,依赖于文本的顺序生成特性,难以直接应用于非序列化迭代去噪过程的DLM。为克服此限制,作者提出左-右扩散水印(Left-Right Diffusion Watermarking, LR-DWM),其核心在于利用生成过程中已知的左右邻近token信息,对当前token的分布进行偏差调整,从而嵌入水印信号。该方案在保持与原始DLM接近的计算和内存开销的同时,实现了可靠且统计显著的水印检测性能,显著优于此前需逆向推理的水印方法。
链接: https://arxiv.org/abs/2601.12376
作者: Ofek Raban,Ethan Fetaya,Gal Chechik
机构: Bar-Ilan University (巴伊兰大学); NVIDIA (英伟达)
类目: Computation and Language (cs.CL)
备注: Submitted to ACL Rolling Review (ARR). 7 pages, 4 figures
Abstract:Watermarking (WM) is a critical mechanism for detecting and attributing AI-generated content. Current WM methods for Large Language Models (LLMs) are predominantly tailored for autoregressive (AR) models: They rely on tokens being generated sequentially, and embed stable signals within the generated sequence based on the previously sampled text. Diffusion Language Models (DLMs) generate text via non-sequential iterative denoising, which requires significant modification to use WM methods designed for AR models. Recent work proposed to watermark DLMs by inverting the process when needed, but suffers significant computational or memory overhead. We introduce Left-Right Diffusion Watermarking (LR-DWM), a scheme that biases the generated token based on both left and right neighbors, when they are available. LR-DWM incurs minimal runtime and memory overhead, remaining close to the non-watermarked baseline DLM while enabling reliable statistical detection under standard evaluation settings. Our results demonstrate that DLMs can be watermarked efficiently, achieving high detectability with negligible computational and memory overhead.
zh
[NLP-169] A Scalable Entity-Based Framework for Auditing Bias in LLM s
【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)偏见评估中普遍存在的生态效度与统计控制之间的权衡问题——即现有方法要么依赖人工构造的提示(prompt),难以反映真实使用场景;要么采用自然任务,但缺乏规模和严谨性。其解决方案的关键在于提出一种可扩展的偏见审计框架,利用命名实体作为探测器(probe),量化模型行为中的结构性偏差。该方法通过合成数据可靠再现自然文本中的偏见模式,从而支持大规模、多维度(包括实体类型、任务、语言、模型版本及提示策略)的系统性分析,最终揭示出模型在政治倾向、地域偏好、行业倾向等方面存在系统性偏差,并表明模型规模扩大反而加剧偏见,而指令微调虽能缓解偏见,但无法消除跨语言情境下的西方倾向。
链接: https://arxiv.org/abs/2601.12374
作者: Akram Elbouanani,Aboubacar Tuo,Adrian Popescu
机构: Université Paris-Saclay (巴黎萨克雷大学); CEA (法国原子能和替代能源委员会); List (微电子与信息技术实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Existing approaches to bias evaluation in large language models (LLMs) trade ecological validity for statistical control, relying on artificial prompts that poorly reflect real-world use, or on naturalistic tasks that lack scale and rigor. We introduce a scalable bias-auditing framework using named entities as probes to measure structural disparities in model behavior. We show that synthetic data reliably reproduces bias patterns observed in natural text, enabling large-scale analysis. Using this approach, we conduct the largest bias audit to date, comprising 1.9 billion data points across multiple entity types, tasks, languages, models, and prompting strategies. Our results reveal systematic biases: models penalize right-wing politicians, favor left-wing politicians, prefer Western and wealthy nations over the Global South, favor Western companies, and penalize firms in the defense and pharmaceutical sectors. While instruction tuning reduces bias, increasing model scale amplifies it, and prompting in Chinese or Russian does not attenuate Western-aligned preferences. These results indicate that LLMs should undergo rigorous auditing before deployment in high-stakes applications.
zh
[NLP-170] Can Deep Research Agents Find and Organize? Evaluating the Synthesis Gap with Expert Taxonomies
【速读】: 该论文旨在解决当前深度研究代理(Deep Research Agents)在自动化生成学术综述时,是否具备与人类专家相当的文献检索与知识组织能力的问题。现有评估基准主要关注语言流畅性或引用准确性,而忽视了核心能力——即从海量文献中精准召回关键论文,并将其构建为结构清晰的知识体系。其解决方案的关键在于提出TaxoBench,一个基于72篇高被引计算机科学综述构建的诊断性评估基准,通过人工提取包含3815个精确分类引用的专家级分类树作为真实标签,支持两种评估模式:Deep Research模式测试端到端的检索与组织能力,Bottom-Up模式则隔离结构化能力以评估模型对已知文献的组织质量。实验表明,当前最先进模型在文献召回率和结构组织一致性上均显著落后于人类专家,揭示了当前技术的双重瓶颈。
链接: https://arxiv.org/abs/2601.12369
作者: Ming Zhang,Jiabao Zhuang,Wenqing Jing,Ziyu Kong,Jingyi Deng,Yujiong Shen,Kexin Tan,Yuhang Zhao,Ning Luo,Renzhe Zheng,Jiahui Lin,Mingqi Wu,Long Ma,Yi Zou,Shihan Dou,Tao Gui,Qi Zhang,Xuanjing Huang
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Deep Research Agents are increasingly used for automated survey generation. However, whether they can write surveys like human experts remains unclear. Existing benchmarks focus on fluency or citation accuracy, but none evaluates the core capabilities: retrieving essential papers and organizing them into coherent knowledge structures. We introduce TaxoBench, a diagnostic benchmark derived from 72 highly-cited computer science surveys. We manually extract expert-authored taxonomy trees containing 3,815 precisely categorized citations as ground truth. Our benchmark supports two evaluation modes: Deep Research mode tests end-to-end retrieval and organization given only a topic, while Bottom-Up mode isolates structuring capability by providing the exact papers human experts used. We evaluate 7 leading Deep Research agents and 12 frontier LLMs. Results reveal a dual bottleneck: the best agent recalls only 20.9% of expert-selected papers, and even with perfect input, the best model achieves only 0.31 ARI in organization. Current deep research agents remain far from expert-level survey writing. Our benchmark is publicly available at this https URL.
zh
[NLP-171] Zero-Shot Embedding Drift Detection: A Lightweight Defense Against Prompt Injections in LLM s NEURIPS2025
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在实际应用中面临的提示注入攻击(prompt injection attacks)问题,此类攻击通过邮件或用户生成内容等间接输入渠道绕过对齐机制,诱导模型产生有害或非预期输出。尽管当前对齐技术有所进步,但主流LLM仍广泛易受此类攻击,且现有防御方法多依赖于特定模型、高工程成本或任务定制,缺乏通用性和可扩展性。解决方案的关键在于提出一种零样本嵌入漂移检测方法(Zero-Shot Embedding Drift Detection, ZEDD),其核心思想是利用良性输入与可疑输入在嵌入空间中的语义偏移(embedding drift)来识别攻击,具体通过计算对抗样本与干净样本之间的余弦相似度捕捉细微的语义扰动。ZEDD无需访问模型内部结构、事先了解攻击类型或进行任务特定训练,具备跨模型架构(如Llama 3、Qwen 2、Mistral)的零样本迁移能力,在保持超过93%检测准确率的同时仅产生3%的误报率,从而提供一种轻量、高效且通用的防御机制,填补了LLM系统抵御自适应对抗威胁的重要安全空白。
链接: https://arxiv.org/abs/2601.12359
作者: Anirudh Sekar,Mrinal Agarwal,Rachel Sharma,Akitsugu Tanaka,Jasmine Zhang,Arjun Damerla,Kevin Zhu
机构: Algoverse AI Research
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注: Accepted to NeurIPS 2025 Lock-LLM Workshop
Abstract:Prompt injection attacks have become an increasing vulnerability for LLM applications, where adversarial prompts exploit indirect input channels such as emails or user-generated content to circumvent alignment safeguards and induce harmful or unintended outputs. Despite advances in alignment, even state-of-the-art LLMs remain broadly vulnerable to adversarial prompts, underscoring the urgent need for robust, productive, and generalizable detection mechanisms beyond inefficient, model-specific patches. In this work, we propose Zero-Shot Embedding Drift Detection (ZEDD), a lightweight, low-engineering-overhead framework that identifies both direct and indirect prompt injection attempts by quantifying semantic shifts in embedding space between benign and suspect inputs. ZEDD operates without requiring access to model internals, prior knowledge of attack types, or task-specific retraining, enabling efficient zero-shot deployment across diverse LLM architectures. Our method uses adversarial-clean prompt pairs and measures embedding drift via cosine similarity to capture subtle adversarial manipulations inherent to real-world injection attacks. To ensure robust evaluation, we assemble and re-annotate the comprehensive LLMail-Inject dataset spanning five injection categories derived from publicly available sources. Extensive experiments demonstrate that embedding drift is a robust and transferable signal, outperforming traditional methods in detection accuracy and operational efficiency. With greater than 93% accuracy in classifying prompt injections across model architectures like Llama 3, Qwen 2, and Mistral and a false positive rate of 3%, our approach offers a lightweight, scalable defense layer that integrates into existing LLM pipelines, addressing a critical gap in securing LLM-powered systems to withstand adaptive adversarial threats.
zh
[NLP-172] Rethinking the Value of Multi-Agent Workflow: A Strong Single Agent Baseline
【速读】: 该论文试图解决的问题是:当前基于大语言模型(Large Language Model, LLM)的多智能体系统(Multi-Agent System, MAS)大多采用同质化架构,即所有智能体共享同一基础LLM,仅通过提示词(prompt)、工具和角色差异实现分工。这种设计是否可被单个LLM通过多轮对话模拟?其核心挑战在于评估单LLM能否在保持性能的同时提升推理效率。解决方案的关键在于提出OneFlow算法,该算法能自动将多智能体工作流重构为单LLM执行模式,并利用KV缓存复用(KV cache reuse)显著降低推理成本,同时在多个基准测试中达到甚至超越原有异构多智能体系统的性能表现,从而确立单LLM实现多智能体工作流作为MAS研究的新基线。
链接: https://arxiv.org/abs/2601.12307
作者: Jiawei Xu,Arief Koesdwiady,Sisong Bei,Yan Han,Baixiang Huang,Dakuo Wang,Yutong Chen,Zheshen Wang,Peihao Wang,Pan Li,Ying Ding
机构: The University of Texas at Austin (德克萨斯大学奥斯汀分校); Amazon (亚马逊); Emory University (埃默里大学); Northeastern University (东北大学); Georgia Institute of Technology (佐治亚理工学院)
类目: Multiagent Systems (cs.MA); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Recent advances in LLM-based multi-agent systems (MAS) show that workflows composed of multiple LLM agents with distinct roles, tools, and communication patterns can outperform single-LLM baselines on complex tasks. However, most frameworks are homogeneous, where all agents share the same base LLM and differ only in prompts, tools, and positions in the workflow. This raises the question of whether such workflows can be simulated by a single agent through multi-turn conversations. We investigate this across seven benchmarks spanning coding, mathematics, general question answering, domain-specific reasoning, and real-world planning and tool use. Our results show that a single agent can reach the performance of homogeneous workflows with an efficiency advantage from KV cache reuse, and can even match the performance of an automatically optimized heterogeneous workflow. Building on this finding, we propose \textbfOneFlow, an algorithm that automatically tailors workflows for single-agent execution, reducing inference costs compared to existing automatic multi-agent design frameworks without trading off accuracy. These results position the single-LLM implementation of multi-agent workflows as a strong baseline for MAS research. We also note that single-LLM methods cannot capture heterogeneous workflows due to the lack of KV cache sharing across different LLMs, highlighting future opportunities in developing \textittruly heterogeneous multi-agent systems.
zh
[NLP-173] Conversational Context Classification: A Representation Engineering Approach
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在实际应用中容易产生脱离上下文的响应问题,如话题漂移、事实错误或幻觉等,这严重影响了其可靠性和安全性。解决方案的关键在于利用表示工程(Representation Engineering, RepE)与单类支持向量机(One-Class Support Vector Machine, OCSVM)相结合的方法,通过在特定上下文内的样本上训练OCSVM,在LLM的隐藏状态潜在空间中构建一个稳健的边界,从而识别出与目标上下文强相关的内部状态子空间。这一方法不仅提升了对上下文偏离行为的检测能力,也为更深入理解LLM的内部工作机制提供了新路径。
链接: https://arxiv.org/abs/2601.12286
作者: Jonathan Pan
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:
Abstract:The increasing prevalence of Large Language Models (LLMs) demands effective safeguards for their operation, particularly concerning their tendency to generate out-of-context responses. A key challenge is accurately detecting when LLMs stray from expected conversational norms, manifesting as topic shifts, factual inaccuracies, or outright hallucinations. Traditional anomaly detection struggles to directly apply within contextual semantics. This paper outlines our experiment in exploring the use of Representation Engineering (RepE) and One-Class Support Vector Machine (OCSVM) to identify subspaces within the internal states of LLMs that represent a specific context. By training OCSVM on in-context examples, we establish a robust boundary within the LLM’s hidden state latent space. We evaluate out study with two open source LLMs - Llama and Qwen models in specific contextual domain. Our approach entailed identifying the optimal layers within the LLM’s internal state subspaces that strongly associates with the context of interest. Our evaluation results showed promising results in identifying the subspace for a specific context. Aside from being useful in detecting in or out of context conversation threads, this research work contributes to the study of better interpreting LLMs.
zh
[NLP-174] Simulated Annealing Enhances Theory-of-Mind Reasoning in Autoregressive Language Models
【速读】: 该论文试图解决自回归语言模型在理论心理(Theory of Mind, ToM)任务中表现不佳的问题,即模型往往仅优化表面连贯性(local coherence),而难以维持正确的潜在状态表征(latent-state representations,即全局连贯性)。其解决方案的关键在于采用基于马尔可夫链蒙特卡洛(Markov chain Monte Carlo, MCMC)的功率采样(power-sampling)方法,从语言模型的序列级概率分布而非词元级分布中进行采样,并引入退火机制(annealing),逐步将温度从高到低调整,从而显著提升模型在ToM任务中的性能。这一方法无需额外权重更新或验证即可直接从基础模型中恢复强大的ToM能力,表明基于采样的优化是挖掘语言模型潜在能力的有效途径。
链接: https://arxiv.org/abs/2601.12269
作者: Xucong Hu,Jian-Qiao Zhu
机构: Zhejiang University (浙江大学); The University of Hong Kong (香港大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Autoregressive language models are next-token predictors and have been criticized for only optimizing surface plausibility (i.e., local coherence) rather than maintaining correct latent-state representations (i.e., global coherence). Because Theory of Mind (ToM) tasks crucially depend on reasoning about latent mental states of oneself and others, such models are therefore often thought to fail at ToM. While post-training methods can improve ToM performance, we show that strong ToM capability can be recovered directly from the base model without any additional weight updates or verifications. Our approach builds on recent power-sampling methods (Karan Du, 2025) that use Markov chain Monte Carlo (MCMC) to sample from sharpened sequence-level (rather than token-level) probability distributions of autoregressive language models. We further find that incorporating annealing, where the tempered distribution is gradually shifted from high to low temperature, substantially improves ToM performance over fixed-temperature power sampling. Together, these results suggest that sampling-based optimization provides a powerful way to extract latent capabilities from language models without retraining.
zh
[NLP-175] Multimodal Generative Engine Optimization: Rank Manipulation for Vision-Language Model Rankers
【速读】: 该论文旨在解决视觉-语言模型(Vision-Language Models, VLMs)在商品搜索系统中面临的对抗性攻击问题,特别是多模态排名攻击(multimodal ranking attacks)导致的推荐结果被恶意操纵的风险。其解决方案的关键在于提出了一种名为多模态生成式引擎优化(Multimodal Generative Engine Optimization, MGEO)的新型对抗框架,该框架通过联合优化不可察觉的图像扰动与流畅的文本后缀,利用VLM内部深层的跨模态耦合机制,在不触发传统内容过滤机制的前提下显著提升目标产品的搜索排名。
链接: https://arxiv.org/abs/2601.12263
作者: Yixuan Du,Chenxiao Yu,Haoyan Xu,Ziyi Wang,Yue Zhao,Xiyang Hu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Vision-Language Models (VLMs) are rapidly replacing unimodal encoders in modern retrieval and recommendation systems. While their capabilities are well-documented, their robustness against adversarial manipulation in competitive ranking scenarios remains largely unexplored. In this paper, we uncover a critical vulnerability in VLM-based product search: multimodal ranking attacks. We present Multimodal Generative Engine Optimization (MGEO), a novel adversarial framework that enables a malicious actor to unfairly promote a target product by jointly optimizing imperceptible image perturbations and fluent textual suffixes. Unlike existing attacks that treat modalities in isolation, MGEO employs an alternating gradient-based optimization strategy to exploit the deep cross-modal coupling within the VLM. Extensive experiments on real-world datasets using state-of-the-art models demonstrate that our coordinated attack significantly outperforms text-only and image-only baselines. These findings reveal that multimodal synergy, typically a strength of VLMs, can be weaponized to compromise the integrity of search rankings without triggering conventional content filters.
zh
[NLP-176] Environment-Aware Code Generation: How far are We? ICSE2026
【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在代码生成任务中缺乏环境感知能力的问题,即现有评估多局限于孤立、小规模代码片段,未考虑用户实际软件环境的复杂性和动态性,导致生成代码难以在特定配置下直接执行。为应对这一挑战,作者提出环境感知代码生成(Environment-Aware Code Generation, EACG)的系统性研究框架,并构建了VersiBCB基准测试集——该数据集具备多包依赖、执行验证和弃用感知特性,能够真实反映软件环境的复杂演化过程。解决方案的关键在于识别并优化三个互补的适应维度:数据(Data)、参数(Parameters)和缓存(Cache),通过针对性策略提升LLMs生成代码与目标环境的兼容性与可执行性,从而推动其在实际软件工程流程中的部署落地。
链接: https://arxiv.org/abs/2601.12262
作者: Tongtong Wu,Rongyi Chen,Wenjie Du,Suyu Ma,Guilin Qi,Zhenchang Xing,Shahram Khadivi,Ramesh Periyathambi,Gholamreza Haffari
机构: Monash University (莫纳什大学); Southeast University (东南大学); CSIRO’s Data61 (澳大利亚联邦科学与工业研究组织数据61); eBay Inc. (eBay公司)
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注: ICSE 2026
Abstract:Recent progress in large language models (LLMs) has improved code generation, but most evaluations still test isolated, small-scale code (e.g., a single function) under default or unspecified software environments. As a result, it is unclear whether LLMs can reliably generate executable code tailored to a user’s specific environment. We present the first systematic study of Environment-Aware Code Generation (EACG), where generated code must be functionally correct and directly executable under arbitrary software configurations. To enable realistic evaluation, we introduce VersiBCB, a benchmark that is multi-package, execution-verified, and deprecation-aware, capturing complex and evolving environments that prior datasets often overlook. Using VersiBCB, we investigate three complementary adaptation axes: data, parameters, and cache, and develop representative strategies for each. Our results show that current LLMs struggle with environment-specific code generation, while our adaptations improve environment compatibility and executability. These findings highlight key challenges and opportunities for deploying LLMs in practical software engineering workflows.
zh
[NLP-177] Plan Verify and Fill: A Structured Parallel Decoding Approach for Diffusion Language Models
【速读】: 该论文旨在解决扩散语言模型(Diffusion Language Models, DLMs)在文本生成过程中,现有解码策略多采用反应式方法、未能充分利用全局双向上下文来指导整体生成轨迹的问题。解决方案的关键在于提出一种无需训练的“规划-验证-填充”(Plan-Verify-Fill, PVF)范式:通过量化验证机制实现规划,主动构建以高杠杆语义锚点优先的分层骨架,并引入验证协议在语义结构上实现实用性的停止条件——即当进一步推理带来的收益趋于边际递减时停止,从而显著降低函数评估次数(NFE),提升效率且不牺牲准确性。
链接: https://arxiv.org/abs/2601.12247
作者: Miao Li,Hanyang Jiang,Sikai Chen,Hengyu Fu,Yuhang Cai,Baihe Huang,Tinghan Ye,Xuanzhou Chen,Pascal Van Hentenryck
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Diffusion Language Models (DLMs) present a promising non-sequential paradigm for text generation, distinct from standard autoregressive (AR) approaches. However, current decoding strategies often adopt a reactive stance, underutilizing the global bidirectional context to dictate global trajectories. To address this, we propose Plan-Verify-Fill (PVF), a training-free paradigm that grounds planning via quantitative validation. PVF actively constructs a hierarchical skeleton by prioritizing high-leverage semantic anchors and employs a verification protocol to operationalize pragmatic structural stopping where further deliberation yields diminishing returns. Extensive evaluations on LLaDA-8B-Instruct and Dream-7B-Instruct demonstrate that PVF reduces the Number of Function Evaluations (NFE) by up to 65% compared to confidence-based parallel decoding across benchmark datasets, unlocking superior efficiency without compromising accuracy.
zh
[NLP-178] CoReflect: Conversational Evaluation via Co-Evolutionary Simulation and Reflective Rubric Refinement
【速读】: 该论文旨在解决多轮对话系统评估中传统方法依赖人工定义评判标准和固定对话情境所带来的局限性,此类静态评估方式难以覆盖对话模型多样且不断涌现的行为模式。解决方案的关键在于提出CoReflect(Conversational Evaluation via Co-Evolutionary Simulation and Reflective Rubric Refinement),其核心机制是将对话模拟与评估整合为一个自适应的迭代过程:通过对话规划器生成结构化模板以引导用户模拟器进行目标导向的多样化对话,再由反思分析器识别系统性行为模式并自动优化评价指标;更重要的是,分析结果被反馈至规划器以更新对话模板,形成“对话生成—评估反馈—规则迭代”的协同进化闭环,从而实现测试用例复杂度与评判精度的同步提升,显著降低人工干预需求,构建可随对话模型能力演进而自我完善的评估体系。
链接: https://arxiv.org/abs/2601.12208
作者: Yunzhe Li,Richie Yueqi Feng,Tianxin Wei,Chin-Chia Hsu
机构: Google DeepMind(谷歌深度思维); University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)
类目: Computation and Language (cs.CL)
备注:
Abstract:Evaluating conversational systems in multi-turn settings remains a fundamental challenge. Conventional pipelines typically rely on manually defined rubrics and fixed conversational context - a static approach that limits coverage and fails to capture the diverse, emergent behaviors of dialogue models. To address this, we introduce CoReflect (Conversational Evaluation via Co-Evolutionary Simulation and Reflective Rubric Refinement), which unifies dialogue simulation and evaluation into an adaptive, iterative process. CoReflect employs a conversation planner that generates structured templates to guide a user simulator through diverse, goal-directed dialogues. Subsequently, a reflective analyzer processes these dialogues to identify systematic behavioral patterns and automatically refine the evaluation rubrics. Crucially, the insights from the conversation analysis are fed back into the planner to update conversation templates for subsequent iterations. This co-evolution loop ensures that the complexity of test cases and the diagnostic precision of rubrics improve in tandem. By minimizing human intervention, CoReflect provides a scalable and self-refining methodology that allows evaluation protocols to adapt alongside the rapidly advancing capabilities of dialogue models.
zh
[NLP-179] CTC-DID: CTC-Based Arabic dialect identification for streaming applications ICASSP2026
【速读】: 该论文旨在解决低资源场景下阿拉伯语方言识别(Arabic Dialect Identification, ADI)任务的性能瓶颈问题,尤其在训练数据有限时模型泛化能力不足、对短语音片段鲁棒性差等挑战。其核心解决方案是借鉴自动语音识别(ASR)中连接时序分类(Connectionist Temporal Classification, CTC)损失函数的思想,将方言识别建模为一个有限词汇表的ASR问题,其中方言标签被视为给定语音片段的标签序列。关键创新在于利用语言无关启发式方法(Language-Agnostic Heuristic, LAH)或预训练ASR模型估计标签重复模式以辅助训练,并通过自监督学习(SSL)策略提升模型在小样本下的表现,最终实现比微调Whisper和ECAPA-TDNN模型更优的性能,且在零样本迁移和流式实时应用中具备更强适应性与稳定性。
链接: https://arxiv.org/abs/2601.12199
作者: Muhammad Umar Farooq,Oscar Saz
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted for IEEE ICASSP 2026
Abstract:This paper proposes a Dialect Identification (DID) approach inspired by the Connectionist Temporal Classification (CTC) loss function as used in Automatic Speech Recognition (ASR). CTC-DID frames the dialect identification task as a limited-vocabulary ASR system, where dialect tags are treated as a sequence of labels for a given utterance. For training, the repetition of dialect tags in transcriptions is estimated either using a proposed Language-Agnostic Heuristic (LAH) approach or a pre-trained ASR model. The method is evaluated on the low-resource Arabic Dialect Identification (ADI) task, with experimental results demonstrating that an SSL-based CTC-DID model, trained on a limited dataset, outperforms both fine-tuned Whisper and ECAPA-TDNN models. Notably, CTC-DID also surpasses these models in zero-shot evaluation on the Casablanca dataset. The proposed approach is found to be more robust to shorter utterances and is shown to be easily adaptable for streaming, real-time applications, with minimal performance degradation.
zh
[NLP-180] olerance Principle and Small Language Model Learning
【速读】: 该论文试图解决的问题是:语言模型在有限数据条件下能否像人类婴儿一样通过少量示例习得抽象语法规则,并验证Yang(2016)提出的“容忍原则”(Tolerance Principle)是否适用于基于Transformer架构的语言模型。解决方案的关键在于使用优化于小数据集的BabyBERTa模型,在人工语法任务中系统性地控制训练数据的数量、句子类型多样性及规则遵循与例外样本的比例,从而检验模型的学习动态是否符合人类婴儿所表现出的对规则容错能力的阈值特性。结果表明,BabyBERTa的学习行为并不遵循该容忍原则,暗示当前主流语言模型在学习机制上可能与人类认知存在本质差异。
链接: https://arxiv.org/abs/2601.12179
作者: Adam E. Friedman,Stevan Harnad,Rushen Shi
机构: 未知
类目: Computation and Language (cs.CL)
备注: 14 pages, 6 figures. BUCLD 50 Proceedings. To be published in 2026 by Cascadilla Press
Abstract:Modern language models like GPT-3, BERT, and LLaMA require massive training data, yet with sufficient training they reliably learn to distinguish grammatical from ungrammatical sentences. Children aged as young as 14 months already have the capacity to learn abstract grammar rules from very few exemplars, even in the presence of non-rule-following exceptions. Yang’s (2016) Tolerance Principle defines a precise threshold for how many exceptions a rule can tolerate and still be learnable. The present study explored the minimal amount and quality of training data necessary for rules to be generalized by a transformer-based language model to test the predictions of the Tolerance Principle. We trained BabyBERTa (Huebner et al. 2021), a transformer model optimized for small datasets, on artificial grammars. The training sets varied in size, number of unique sentence types, and proportion of rule-following versus exception exemplars. We found that, unlike human infants, BabyBERTa’s learning dynamics do not align with the Tolerance Principle.
zh
[NLP-181] he Language You Ask In: Language-Conditioned Ideological Divergence in LLM Analysis of Contested Political Documents
【速读】: 该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)在多语言环境中进行政治分析时,其输出是否因提示语(prompt)的语言不同而产生系统性偏倚。解决方案的关键在于通过实验设计,使用语义等价但语言不同的提示(俄语与乌克兰语),对同一份乌克兰公民社会文件进行分析,从而控制输入内容和结构变量,揭示语言本身如何驱动模型生成具有显著差异的意识形态倾向和解释结论——即俄语提示下模型倾向于采用俄罗斯官方话语框架,将公民社会行为者描述为非法精英;而乌克兰语提示则导向西方自由民主政治学话语体系,将其视为合法的政治参与者。这一发现凸显了提示语言作为关键变量对LLM输出偏倚的影响机制,为AI在多语言、高极化信息环境中的部署与治理提供了实证依据。
链接: https://arxiv.org/abs/2601.12164
作者: Oleg Smirnov
机构: Microsoft(微软)
类目: Computers and Society (cs.CY); Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) are increasingly deployed as analytical tools across multilingual contexts, yet their outputs may carry systematic biases conditioned by the language of the prompt. This study presents an experimental comparison of LLM-generated political analyses of a Ukrainian civil society document, using semantically equivalent prompts in Russian and Ukrainian. Despite identical source material and parallel query structures, the resulting analyses varied substantially in rhetorical positioning, ideological orientation, and interpretive conclusions. The Russian-language output echoed narratives common in Russian state discourse, characterizing civil society actors as illegitimate elites undermining democratic mandates. The Ukrainian-language output adopted vocabulary characteristic of Western liberal-democratic political science, treating the same actors as legitimate stakeholders within democratic contestation. These findings demonstrate that prompt language alone can produce systematically different ideological orientations from identical models analyzing identical content, with significant implications for AI deployment in polarized information environments, cross-lingual research applications, and the governance of AI systems in multilingual societies.
zh
[NLP-182] Analyzing Cancer Patients Experiences with Embedding-based Topic Modeling and LLM s
【速读】: 该论文旨在解决如何从癌症患者叙事数据中自动提取有意义的主题,以支持更以患者为中心的医疗实践。其核心问题是传统方法难以高效、准确地从大量非结构化访谈文本中识别出具有临床意义的共性主题。解决方案的关键在于构建一个结合神经主题建模(如BERTopic)与大语言模型(LLM,如GPT-4)的端到端分析流程:首先利用Bertopic进行细粒度的主题聚类和关键词提取,再通过LLM对主题进行语义标签化;同时引入领域特定嵌入模型(如BioClinicalBERT)提升主题的精确性和可解释性。实验表明,使用生物医学领域预训练模型能显著增强主题建模效果,尤其在跨访谈的一致性与临床相关性方面表现最优,从而为临床医生提供来自患者声音的有效反馈。
链接: https://arxiv.org/abs/2601.12154
作者: Teodor-Călin Ionescu,Lifeng Han,Jan Heijdra Suasnabar,Anne Stiggelbout,Suzan Verberne
机构: Leiden Institute of Advanced Computer Science (LIACS)(莱顿高级计算机科学研究所); Leiden University Medical Center (LUMC)(莱顿大学医学中心)
类目: Computation and Language (cs.CL)
备注: under review to CLIN journal
Abstract:This study investigates the use of neural topic modeling and LLMs to uncover meaningful themes from patient storytelling data, to offer insights that could contribute to more patient-oriented healthcare practices. We analyze a collection of transcribed interviews with cancer patients (132,722 words in 13 interviews). We first evaluate BERTopic and Top2Vec for individual interview summarization by using similar preprocessing, chunking, and clustering configurations to ensure a fair comparison on Keyword Extraction. LLMs (GPT4) are then used for the next step topic labeling. Their outputs for a single interview (I0) are rated through a small-scale human evaluation, focusing on coherence, clarity, and relevance. Based on the preliminary results and evaluation, BERTopic shows stronger performance and is selected for further experimentation using three clinically oriented embedding models. We then analyzed the full interview collection with the best model setting. Results show that domain-specific embeddings improved topic \textitprecision and \textitinterpretability, with BioClinicalBERT producing the most consistent results across transcripts. The global analysis of the full dataset of 13 interviews, using the BioClinicalBERT embedding model, reveals the most dominant topics throughout all 13 interviews, namely Coordination and Communication in Cancer Care Management" and Patient Decision-Making in Cancer Treatment Journey’‘. Although the interviews are machine translations from Dutch to English, and clinical professionals are not involved in this evaluation, the findings suggest that neural topic modeling, particularly BERTopic, can help provide useful feedback to clinicians from patient interviews. This pipeline could support more efficient document navigation and strengthen the role of patients’ voices in healthcare workflows.
zh
[NLP-183] Bengali Text Classification: An Evaluation of Large Language Model Approaches
【速读】: 该论文旨在解决孟加拉语(Bengali)文本分类任务中因标注数据集稀缺和预训练语言模型匮乏而导致的性能瓶颈问题。其解决方案的关键在于评估三种指令微调的大语言模型(LLMs)——LLaMA 3.1 8B Instruct、LLaMA 3.2 3B Instruct 和 Qwen 2.5 7B Instruct——在孟加拉语新闻文章分类任务中的表现,结果表明Qwen 2.5 7B Instruct在相同分类框架下取得了最高准确率(72%),尤其在“体育”类别上表现突出,验证了大语言模型在资源有限的孟加拉语自然语言处理任务中的有效性。
链接: https://arxiv.org/abs/2601.12132
作者: Md Mahmudul Hoque,Md Mehedi Hassain,Md Hojaifa Tanvir,Rahul Nandy
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Bengali text classification is a Significant task in natural language processing (NLP), where text is categorized into predefined labels. Unlike English, Bengali faces challenges due to the lack of extensive annotated datasets and pre-trained language models. This study explores the effectiveness of large language models (LLMs) in classifying Bengali newspaper articles. The dataset used, obtained from Kaggle, consists of articles from Prothom Alo, a major Bangladeshi newspaper. Three instruction-tuned LLMs LLaMA 3.1 8B Instruct, LLaMA 3.2 3B Instruct, and Qwen 2.5 7B Instruct were evaluated for this task under the same classification framework. Among the evaluated models, Qwen 2.5 achieved the highest classification accuracy of 72%, showing particular strength in the “Sports” category. In comparison, LLaMA 3.1 and LLaMA 3.2 attained accuracies of 53% and 56%, respectively. The findings highlight the effectiveness of LLMs in Bengali text classification, despite the scarcity of resources for Bengali NLP. Future research will focus on exploring additional models, addressing class imbalance issues, and refining fine-tuning approaches to improve classification performance.
zh
[NLP-184] Powerful Training-Free Membership Inference Against Autoregressive Language Models
【速读】: 该论文旨在解决微调语言模型(fine-tuned language models)中存在的隐私风险问题,尤其是通过成员推断攻击(Membership Inference Attacks, MIAs)来量化和检测模型是否记忆并可能泄露训练数据中的敏感信息。现有MIA方法在低假阳性率(False Positive Rate, FPR)条件下检测能力有限,难以满足实际隐私审计需求。其解决方案的关键在于提出EZ-MIA,一种基于“错误区域”(Error Zone, EZ)的新型攻击方法:该方法观察到模型的记忆效应在预测错误的位置表现最为显著——即模型对训练样本仍保持较高概率但实际预测错误。为此,作者设计了EZ分数,用于衡量错误位置上概率分布相对于预训练参考模型的方向性偏移,该统计量仅需两次前向传播且无需任何额外训练。实验表明,EZ-MIA在多种基准数据集和模型架构上均显著优于现有最优方法,在极低FPR下实现了更高的真阳性率(True Positive Rate, TPR),揭示了微调语言模型的隐私风险远高于此前认知。
链接: https://arxiv.org/abs/2601.12104
作者: David Ilić,David Stanojević,Kostadin Cvejoski
机构: JetBrains Research (JetBrains 研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 9 pages, 2 figures; appendix with additional experiments and derivations
Abstract:Fine-tuned language models pose significant privacy risks, as they may memorize and expose sensitive information from their training data. Membership inference attacks (MIAs) provide a principled framework for auditing these risks, yet existing methods achieve limited detection rates, particularly at the low false-positive thresholds required for practical privacy auditing. We present EZ-MIA, a membership inference attack that exploits a key observation: memorization manifests most strongly at error positions, specifically tokens where the model predicts incorrectly yet still shows elevated probability for training examples. We introduce the Error Zone (EZ) score, which measures the directional imbalance of probability shifts at error positions relative to a pretrained reference model. This principled statistic requires only two forward passes per query and no model training of any kind. On WikiText with GPT-2, EZ-MIA achieves 3.8x higher detection than the previous state-of-the-art under identical conditions (66.3% versus 17.5% true positive rate at 1% false positive rate), with near-perfect discrimination (AUC 0.98). At the stringent 0.1% FPR threshold critical for real-world auditing, we achieve 8x higher detection than prior work (14.0% versus 1.8%), requiring no reference model training. These gains extend to larger architectures: on AG News with Llama-2-7B, we achieve 3x higher detection (46.7% versus 15.8% TPR at 1% FPR). These results establish that privacy risks of fine-tuned language models are substantially greater than previously understood, with implications for both privacy auditing and deployment decisions. Code is available at this https URL.
zh
[NLP-185] Large language models struggle with ethnographic text annotation
【速读】: 该论文试图解决的问题是:大语言模型(Large Language Models, LLMs)是否能够胜任民族志文本的结构化标注任务,从而加速跨文化研究。其解决方案的关键在于系统评估7个最先进的LLM在标注567段民族志文本中121个仪式特征时的表现,并与人类编码者的一致性进行对比。研究发现,尽管LLM在某些任务上展现出潜力,但整体性能远低于可靠自动化标注所需水平,尤其在长文本、需序数区分或语义模糊的构念上表现不佳;更重要的是,人类编码者间的一致性设定了LLM准确性的上限,表明当前LLM尚无法替代人类专家在民族志标注中的专业判断。
链接: https://arxiv.org/abs/2601.12099
作者: Leonardo S. Goodall,Dor Shilton,Daniel A. Mullins,Harvey Whitehouse
机构: Calleva Research Centre (Calleva 研究中心); Oxford Internet Institute (牛津互联网研究所); University of Oxford (牛津大学); Cohn Institute for the History and Philosophy of Science and Ideas (科恩科学与思想史研究所); Tel Aviv University (特拉维夫大学); Birkbeck College (伯克贝克学院); University of London (伦敦大学); Centre for the Study of Social Cohesion (社会凝聚力研究中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) have shown promise for automated text annotation, raising hopes that they might accelerate cross-cultural research by extracting structured data from ethnographic texts. We evaluated 7 state-of-the-art LLMs on their ability to annotate 121 ritual features across 567 ethnographic excerpts. Performance was limited, falling well below levels required for reliable automated annotation. Longer texts, features requiring ordinal distinctions, and ambiguous constructs proved particularly difficult. Human inter-coder reliability set an approximate ceiling on LLM accuracy: features that human coders found difficult to agree upon were also difficult for LLMs. Yet even on features where humans reliably agreed, models fell short of human performance. Our findings suggest that LLMs cannot yet substitute for human expertise in ethnographic annotation.
zh
[NLP-186] Neural Isomorphic Fields: A Transformer-based Algebraic Numerical Embedding
【速读】: 该论文旨在解决神经网络模型在处理极小或极大数值时面临的溢出(overflow)、下溢(underflow)及输出不稳定等问题。其核心解决方案是引入一种固定长度的数字嵌入向量(number embedding vector),该向量不直接使用原始数值,而是通过神经同构域(Neural Isomorphic Field)这一新型神经抽象结构来保留有理数域上的代数运算性质,包括加法、乘法和比较操作。关键创新在于将传统代数结构(如群与域)映射为嵌入向量空间中的运算机制,从而在保持数值稳定性的同时实现对基本代数属性的有效建模。实验表明,加法运算在身份律、封闭性和结合律等测试中准确率超过95%,而乘法仍存在挑战,准确率介于53%至73%之间,提示该方法在加法场景下表现优异,但乘法仍需进一步优化。
链接: https://arxiv.org/abs/2601.12095
作者: Hamidreza Sadeghi,Saeedeh Momtazi,Reza Safabakhsh
机构: Amirkabir University of Technology (伊朗阿米尔卡比尔理工大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Neural network models often face challenges when processing very small or very large numbers due to issues such as overflow, underflow, and unstable output variations. To mitigate these problems, we propose using embedding vectors for numbers instead of directly using their raw values. These embeddings aim to retain essential algebraic properties while preventing numerical instabilities. In this paper, we introduce, for the first time, a fixed-length number embedding vector that preserves algebraic operations, including addition, multiplication, and comparison, within the field of rational numbers. We propose a novel Neural Isomorphic Field, a neural abstraction of algebraic structures such as groups and fields. The elements of this neural field are embedding vectors that maintain algebraic structure during computations. Our experiments demonstrate that addition performs exceptionally well, achieving over 95 percent accuracy on key algebraic tests such as identity, closure, and associativity. In contrast, multiplication exhibits challenges, with accuracy ranging from 53 percent to 73 percent across various algebraic properties. These findings highlight the model’s strengths in preserving algebraic properties under addition while identifying avenues for further improvement in handling multiplication.
zh
[NLP-187] Optimizing User Profiles via Contextual Bandits for Retrieval-Augmented LLM Personalization
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在个性化响应生成中如何有效利用用户历史记录的问题。现有方法通常依赖语义相关性来选择用户历史记录,但这种策略存在局限性:语义相似的记录可能因冗余或信息冲突而降低生成质量。解决方案的关键在于提出PURPLE框架——一个基于上下文多臂赌博机(contextual bandit)的方法,通过Plackett-Luce排序模型建模记录间的复杂依赖关系,并以参考响应的似然作为密集反馈信号进行训练,从而将检索过程直接优化为提升生成质量的目标。该方法实现了用户画像构建的可学习、可扩展且高效的过程,显著优于传统启发式与检索增强基线。
链接: https://arxiv.org/abs/2601.12078
作者: Linfeng Du,Ye Yuan,Zichen Zhao,Fuyuan Lyu,Emiliano Penaloza,Xiuying Chen,Zipeng Sun,Jikun Kang,Laurent Charlin,Xue Liu,Haolun Wu
机构: McGill University (麦吉尔大学); Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学); Université de Montréal (蒙特利尔大学); HEC Montréal (蒙特利尔高等商学院); Mila - Quebec AI Institute (魁北克人工智能研究所)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:
Abstract:Large Language Models (LLMs) excel at general-purpose tasks, yet adapting their responses to individual users remains challenging. Retrieval augmentation provides a lightweight alternative to fine-tuning by conditioning LLMs on user history records, and existing approaches typically select these records based on semantic relevance. We argue that relevance serves as an unreliable proxy for utility: a record may be semantically similar to a query yet fail to improve generation quality or even degrade it due to redundancy or conflicting information. To bridge this gap, we propose PURPLE, a contextual bandit framework that oPtimizes UseR Profiles for Llm pErsonalization. In contrast to a greedy selection of the most relevant records, PURPLE treats profile construction as a set generation process and utilizes a Plackett-Luce ranking model to capture complex inter-record dependencies. By training with dense feedback provided by the likelihood of the reference response, our method aligns retrieval directly with generation quality. Extensive experiments on nine personalization tasks demonstrate that PURPLE consistently outperforms strong heuristic and retrieval-augmented baselines in both effectiveness and efficiency, establishing a principled and scalable solution for optimizing user profiles.
zh
[NLP-188] CroBIM-V: Memory-Quality Controlled Remote Sensing Referring Video Object Segmentation
【速读】: 该论文针对遥感视频参照目标分割(RS-RVOS)中因目标显著性弱和动态场景下视觉信息严重截断导致的难以维持判别性目标表征问题,以及现有模型受限于 biased initial memory construction 和 indiscriminate memory accumulation 引发的定位误差传播问题展开研究。解决方案的关键在于:一是构建首个大规模 RS-RVOS 基准数据集 RS-RVOS Bench,采用因果感知标注策略确保语言引用仅基于初始帧目标状态,提升任务真实性;二是提出 Memory Quality Control with Segment Anything Model (MQC-SAM) 框架,其核心创新为引入时序运动一致性模块用于初始记忆校准,并设计解耦式注意力记忆融合机制结合动态质量评估,实现高置信度语义特征的选择性更新与不可靠信息过滤,从而有效抑制错误累积与传播。
链接: https://arxiv.org/abs/2601.12076
作者: H. Jiang,Y. Sun,Z. Dong,T. Liu,Y. Gu
机构: Harbin Institute of Technology (哈尔滨工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
Abstract:Remote sensing video referring object segmentation (RS-RVOS) is challenged by weak target saliency and severe visual information truncation in dynamic scenes, making it extremely difficult to maintain discriminative target representations during segmentation. Moreover, progress in this field is hindered by the absence of large-scale dedicated benchmarks, while existing models are often affected by biased initial memory construction that impairs accurate instance localization in complex scenarios, as well as indiscriminate memory accumulation that encodes noise from occlusions or misclassifications, leading to persistent error propagation. This paper advances RS-RVOS research through dual contributions in data and methodology. First, we construct RS-RVOS Bench, the first large-scale benchmark comprising 111 video sequences, about 25,000 frames, and 213,000 temporal referring annotations. Unlike common RVOS benchmarks where many expressions are written with access to the full video context, our dataset adopts a strict causality-aware annotation strategy in which linguistic references are generated solely from the target state in the initial frame. Second, we propose a memory-quality-aware online referring segmentation framework, termed Memory Quality Control with Segment Anything Model (MQC-SAM). MQC-SAM introduces a temporal motion consistency module for initial memory calibration, leveraging short-term motion trajectory priors to correct structural deviations and establish accurate memory anchoring. Furthermore, it incorporates a decoupled attention-based memory integration mechanism with dynamic quality assessment, selectively updating high-confidence semantic features while filtering unreliable information, thereby effectively preventing error accumulation and propagation. Extensive experiments on RS-RVOS Bench demonstrate that MQC-SAM achieves state-of-the-art performance.
zh
[NLP-189] o Copy or Not to Copy: Copying Is Easier to Induce Than Recall
【速读】: 该论文旨在解决语言模型在检索增强(retrieval-augmented)场景中如何权衡参数化知识(parametric knowledge)与上下文信息(contextual information)的问题。其核心挑战在于,模型在面对无关或错误的上下文时,可能倾向于复制(copy)而非正确回忆(recall)已存储的知识。解决方案的关键在于提出并验证了一个“仲裁向量”(arbitration vector),该向量通过提取模型激活在特定数据集上的残差流中心差异来量化两种行为模式之间的区别,并将其作为加性干预注入到不同层和标记位置,从而实现对模型行为的定向调控:一方面抑制上下文使用以促进参数知识调用(Copy → Recall),另一方面诱导模型复制任意上下文token(Recall → Copy)。实验表明,这一机制在两种架构(仅解码器和编码器/解码器)及两个开放域问答基准上均能稳定地改变模型行为,且机制分析揭示出诱导复制比恢复召回更鲁棒——前者是可局部触发的“再激活”过程,后者则是依赖于对象标记干预的脆弱“抑制”过程。
链接: https://arxiv.org/abs/2601.12075
作者: Mehrdad Farahani,Franziska Penzkofer,Richard Johansson
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Language models used in retrieval-augmented settings must arbitrate between parametric knowledge stored in their weights and contextual information in the prompt. This work presents a mechanistic study of that choice by extracting an \empharbitration vector from model activations on a curated dataset designed to disentangle (i) irrelevant contexts that elicit parametric recall and (ii) relevant but false contexts that elicit copying. The vector is computed as the residual-stream centroid difference between these regimes across 27 relations, and is injected as an additive intervention at selected layers and token spans to steer behavior in two directions: Copy \rightarrow Recall (suppressing context use) and Recall \rightarrow Copy (inducing the model to copy any token from the context). Experiments on two architectures (decoder-only and encoder/decoder) and two open-domain QA benchmarks show consistent behavior shifts under moderate scaling while monitoring accuracy and fluency. Mechanistic analyses of attention routing, MLP contributions, and layer-wise probability trajectories reveal an asymmetry: inducing copying is an easy reactivation'' process that can be triggered at different locations in the input, while restoring recall is a suppression’’ process that is more fragile and strongly tied to object-token interventions.
zh
[NLP-190] Bridging the Gap in Bangla Healthcare: Machine Learning Based Disease Prediction Using a Symptoms-Disease Dataset
【速读】: 该论文旨在解决非英语人群获取可靠健康信息的难题,特别是针对孟加拉语(Bangla)使用者在疾病预测资源匮乏的问题。其解决方案的关键在于构建了一个包含758个独特症状-疾病关系、覆盖85种疾病的孟加拉语症状-疾病数据集,并公开发布以确保透明性和可复现性。基于该数据集,研究评估了多种机器学习模型,并通过软投票和硬投票集成方法融合表现最优的模型,最终实现了98%的准确率,显著提升了模型的鲁棒性和泛化能力,为孟加拉语地区的疾病预测与本地化健康信息服务奠定了基础。
链接: https://arxiv.org/abs/2601.12068
作者: Rowzatul Zannat,Abdullah Al Shafi,Abdul Muntakim
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Increased access to reliable health information is essential for non-English-speaking populations, yet resources in Bangla for disease prediction remain limited. This study addresses this gap by developing a comprehensive Bangla symptoms-disease dataset containing 758 unique symptom-disease relationships spanning 85 diseases. To ensure transparency and reproducibility, we also make our dataset publicly available. The dataset enables the prediction of diseases based on Bangla symptom inputs, supporting healthcare accessibility for Bengali-speaking populations. Using this dataset, we evaluated multiple machine learning models to predict diseases based on symptoms provided in Bangla and analyzed their performance on our dataset. Both soft and hard voting ensemble approaches combining top-performing models achieved 98% accuracy, demonstrating superior robustness and generalization. Our work establishes a foundational resource for disease prediction in Bangla, paving the way for future advancements in localized health informatics and diagnostic tools. This contribution aims to enhance equitable access to health information for Bangla-speaking communities, particularly for early disease detection and healthcare interventions.
zh
[NLP-191] Codebook-Injected Dialogue Segmentation for Multi-Utterance Constructs Annotation: LLM -Assisted and Gold-Label-Free Evaluation ACL2026
【速读】: 该论文旨在解决对话行为(Dialogue Act, DA)标注中因局部化意图与边界判断不一致导致的标注可靠性下降问题,即标注者对语用意图达成共识但对话语片段边界存在分歧。其核心解决方案是提出“代码本注入分割”(codebook-injected segmentation),通过将下游标注标准(如DA类别)作为边界决策的条件约束,提升分割结果与语义一致性的一致性。实验表明,基于大语言模型(LLM)的分段器在生成内部一致的语义单元方面优于纯文本基线,但全局对话流变化检测仍由基于连贯性的基线方法更优;且不同指标间存在权衡关系,说明分割策略应根据下游任务目标优化,而非追求单一性能指标。
链接: https://arxiv.org/abs/2601.12061
作者: Jinsook Lee,Kirk Vanacore,Zhuqian Zhou,Jeanine Grutter,Rene F. Kizilcec
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Under Review for ACL 2026
Abstract:Dialogue Act (DA) annotation typically treats communicative or pedagogical intent as localized to individual utterances or turns. This leads annotators to agree on the underlying action while disagreeing on segment boundaries, reducing apparent reliability. We propose codebook-injected segmentation, which conditions boundary decisions on downstream annotation criteria, and evaluate LLM-based segmenters against standard and retrieval-augmented baselines. To assess these without gold labels, we introduce evaluation metrics for span consistency, distinctiveness, and human-AI distributional agreement. We found DA-awareness produces segments that are internally more consistent than text-only baselines. While LLMs excel at creating construct-consistent spans, coherence-based baselines remain superior at detecting global shifts in dialogue flow. Across two datasets, no single segmenter dominates. Improvements in within-segment coherence frequently trade off against boundary distinctiveness and human-AI distributional agreement. These results highlight segmentation as a consequential design choice that should be optimized for downstream objectives rather than a single performance score.
zh
[NLP-192] Dont Start Over: A Cost-Effective Framework for Migrating Personalized Prompts Between LLM s AAAI2026
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)个性化过程中软提示(soft prompts)在基础模型升级后失效的问题,避免因模型迭代而需进行昂贵的全量重训练。其解决方案的关键在于提出Prompt-level User Migration Adapter(PUMA),通过参数高效适配器(parameter-efficient adapter)弥合不同模型间的语义鸿沟,并结合基于分组的用户选择策略显著降低训练成本,从而实现个性化提示在不兼容模型间的高效迁移。
链接: https://arxiv.org/abs/2601.12034
作者: Ziyi Zhao,Chongming Gao,Yang Zhang,Haoyan Liu,Weinan Gan,Huifeng Guo,Yong Liu,Fuli Feng
机构: 未知
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Accepted to AAAI 2026 (Oral). 9 pages, 5 figures
Abstract:Personalization in Large Language Models (LLMs) often relies on user-specific soft prompts. However, these prompts become obsolete when the foundation model is upgraded, necessitating costly, full-scale retraining. To overcome this limitation, we propose the Prompt-level User Migration Adapter (PUMA), a lightweight framework to efficiently migrate personalized prompts across incompatible models. PUMA utilizes a parameter-efficient adapter to bridge the semantic gap, combined with a group-based user selection strategy to significantly reduce training costs. Experiments on three large-scale datasets show our method matches or even surpasses the performance of retraining from scratch, reducing computational cost by up to 98%. The framework demonstrates strong generalization across diverse model architectures and robustness in advanced scenarios like chained and aggregated migrations, offering a practical path for the sustainable evolution of personalized AI by decoupling user assets from the underlying models.
zh
[NLP-193] Preserving Fairness and Safety in Quantized LLM s Through Critical Weight Protection
【速读】: 该论文旨在解决量化(Quantization)技术在大语言模型(Large Language Models, LLMs)中应用时对公平性(Fairness)和安全性(Safety)带来的负面影响,尤其是在动态量化(Dynamic Quantization)和多语言场景下的潜在风险。研究表明,量化会系统性地削弱模型的公平性和安全性,且非英语语境中的安全性能下降尤为显著。为应对这一问题,作者提出关键解决方案——关键权重保护(Critical Weight Protection),该方法通过识别并保留对公平性和安全性至关重要的模型权重,在不进行昂贵的重新训练或对齐的情况下有效缓解偏差与安全性的退化,从而在保持量化效率的同时提升模型的可信度。
链接: https://arxiv.org/abs/2601.12033
作者: Muhammad Alif Al Hakim,Alfan Farizki Wicaksono,Fajri Koto
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Quantization is widely adopted to reduce the computational cost of large language models (LLMs); however, its implications for fairness and safety, particularly in dynamic quantization and multilingual contexts, remain underexplored. In this work, we conduct a systematic study of how static and dynamic quantization methods impact fairness and safety across benchmarks measuring intrinsic and extrinsic bias and safety alignment. For fairness, we evaluate English, French, Dutch, Spanish, and Turkish; for safety, we focus on English, Korean, and Arabic. Our findings reveal that quantization consistently degrades fairness and safety, with dynamic methods demonstrating greater stability than static ones. Moreover, fairness degradation varies across languages, while safety deterioration is especially pronounced in non-English settings. To address these risks, we introduce Critical Weight Protection, a novel technique that identifies and preserves fairness- and safety-critical weights during quantization. This approach effectively mitigates bias and safety deterioration without costly retraining or alignment, maintaining trustworthiness while retaining efficiency.
zh
[NLP-194] A Multi-Agent System for Generating Actionable Business Advice
【速读】: 该论文旨在解决现有客户评论分析方法难以从海量用户反馈中提炼出具体、可执行的商业建议的问题,尤其针对当前基于大语言模型(Large Language Models, LLMs)的生成式 AI 输出常缺乏准确性与深度推理能力的局限。其解决方案的关键在于提出一个基于多智能体(multi-agent)的LLM框架,通过四个核心模块实现:聚类筛选代表性评论以进行语料蒸馏、生成初步建议、迭代评估优化以及基于可行性排序,从而将大规模评论数据转化为具有针对性、可操作性和实用性的决策支持输出。该设计实现了语料精炼与反馈驱动的建议迭代优化相结合,显著提升了建议的行动性、具体性和非冗余性。
链接: https://arxiv.org/abs/2601.12024
作者: Kartikey Singh Bhandari,Tanish Jain,Archit Agrawal,Dhruv Kumar,Praveen Kumar,Pratik Narang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Customer reviews contain rich signals about product weaknesses and unmet user needs, yet existing analytic methods rarely move beyond descriptive tasks such as sentiment analysis or aspect extraction. While large language models (LLMs) can generate free-form suggestions, their outputs often lack accuracy and depth of reasoning. In this paper, we present a multi-agent, LLM-based framework for prescriptive decision support, which transforms large scale review corpora into actionable business advice. The framework integrates four components: clustering to select representative reviews, generation of advices, iterative evaluation, and feasibility based ranking. This design couples corpus distillation with feedback driven advice refinement to produce outputs that are specific, actionable, and practical. Experiments across three service domains and multiple model families show that our framework consistently outperform single model baselines on actionability, specificity, and non-redundancy, with medium sized models approaching the performance of large model frameworks.
zh
[NLP-195] Acting Flatterers via LLM s Sycophancy: Combating Clickbait with LLM s Opposing-Stance Reasoning
【速读】: 该论文旨在解决在线内容中广泛存在的“标题党”(clickbait)问题,即通过误导性或夸张的标题吸引点击,从而影响信息传播的真实性与可信度。现有基于大语言模型(Large Language Models, LLMs)的方法常受限于“谄媚倾向”(sycophancy),即模型更倾向于生成符合用户既有信念而非事实正确的推理。针对这一挑战,论文提出一种创新解决方案:将谄媚倾向转化为优势,设计了自更新的对立立场推理生成框架(Self-renewal Opposing-stance Reasoning Generation, SORG),利用LLM从正反两个角度生成高质量的“同意”与“反对”推理对,无需人工标注标签。关键在于通过对比学习(contrastive learning)机制,结合LLM生成的可信度分数作为软标签,构建局部基于对立推理的检测模型(Opposing Reasoning-based Clickbait Detection, ORCD),从而显著提升检测鲁棒性和准确性。
链接: https://arxiv.org/abs/2601.12019
作者: Chaowei Zhang,Xiansheng Luo,Zewei Zhang,Yi Zhu,Jipeng Qiang,Longwei Wang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:The widespread proliferation of online content has intensified concerns about clickbait, deceptive or exaggerated headlines designed to attract attention. While Large Language Models (LLMs) offer a promising avenue for addressing this issue, their effectiveness is often hindered by Sycophancy, a tendency to produce reasoning that matches users’ beliefs over truthful ones, which deviates from instruction-following principles. Rather than treating sycophancy as a flaw to be eliminated, this work proposes a novel approach that initially harnesses this behavior to generate contrastive reasoning from opposing perspectives. Specifically, we design a Self-renewal Opposing-stance Reasoning Generation (SORG) framework that prompts LLMs to produce high-quality agree and disagree reasoning pairs for a given news title without requiring ground-truth labels. To utilize the generated reasoning, we develop a local Opposing Reasoning-based Clickbait Detection (ORCD) model that integrates three BERT encoders to represent the title and its associated reasoning. The model leverages contrastive learning, guided by soft labels derived from LLM-generated credibility scores, to enhance detection robustness. Experimental evaluations on three benchmark datasets demonstrate that our method consistently outperforms LLM prompting, fine-tuned smaller language models, and state-of-the-art clickbait detection baselines.
zh
[NLP-196] textttMemoryRewardBench: Benchmarking Reward Models for Long-Term Memory Management in Large Language Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在处理长上下文时,如何有效评估其记忆管理能力的问题。现有研究多采用以记忆为中心的机制来分段处理长序列,但缺乏系统性工具来衡量奖励模型(Reward Models, RMs)对这种长期记忆管理过程的评估效果。解决方案的关键在于提出首个专门用于评估RMs在长上下文场景下记忆管理能力的基准测试——MemoryRewardBench,该基准覆盖长文本理解与长格式生成任务,包含10种不同记忆管理模式,上下文长度从8K到128K tokens不等,从而为量化评估RMs在复杂记忆操作中的表现提供了标准化平台。
链接: https://arxiv.org/abs/2601.11969
作者: Zecheng Tang,Baibei Ji,Ruoxi Sun,Haitian Wang,WangJie You,Zhang Yijun,Wenpeng Zhu,Ji Qi,Juntao Li,Min Zhang
机构: Soochow University (苏州大学); LCM Laboratory; China Mobile (苏州) (中国移动(苏州))
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Existing works increasingly adopt memory-centric mechanisms to process long contexts in a segment manner, and effective memory management is one of the key capabilities that enables large language models to effectively propagate information across the entire sequence. Therefore, leveraging reward models (RMs) to automatically and reliably evaluate memory quality is critical. In this work, we introduce \textttMemoryRewardBench , the first benchmark to systematically study the ability of RMs to evaluate long-term memory management processes. \textttMemoryRewardBench covers both long-context comprehension and long-form generation tasks, featuring 10 distinct settings with different memory management patterns, with context length ranging from 8K to 128K tokens. Evaluations on 13 cutting-edge RMs indicate a diminishing performance gap between open-source and proprietary models, with newer-generation models consistently outperforming their predecessors regardless of parameter count. We further expose the capabilities and fundamental limitations of current RMs in evaluating LLM memory management across diverse settings.
zh
[NLP-197] R2PO: Decoupling Training Trajectories from Inference Responses for LLM Reasoning
【速读】: 该论文旨在解决强化学习在提升大语言模型(Large Language Models, LLMs)推理能力过程中存在的根本性矛盾:现有方法使用单一策略同时生成稳定的推理响应和用于训练优化的轨迹,但由于推理稳定性与训练轨迹多样性之间存在目标冲突,导致探索不足,进而限制了模型的推理性能。解决方案的关键在于提出R²PO(Residual Rollout Policy Optimization),通过在策略网络上引入一个轻量级的残差回溯头(Residual Rollout-Head),将训练阶段的优化轨迹与推理阶段的响应生成解耦,从而在训练中实现可控的轨迹多样化,同时保持推理过程的稳定性,显著提升了模型在多个基准测试上的表现,如MATH-500和APPS数据集上分别获得3.1%和2.4%的平均准确率提升。
链接: https://arxiv.org/abs/2601.11960
作者: Jingchu Wang,Bingbing Xu,Yige Yuan,Bin Xie,Xiaoqian Sun,Huawei Shen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Reinforcement learning has become a central paradigm for improving LLM reasoning. However, existing methods use a single policy to produce both inference responses and training optimization trajectories. The objective conflict between generating stable inference responses and diverse training trajectories leads to insufficient exploration, which harms reasoning capability. In this paper, to address the problem, we propose R ^2 PO (Residual Rollout Policy Optimization), which introduces a lightweight Residual Rollout-Head atop the policy to decouple training trajectories from inference responses, enabling controlled trajectory diversification during training while keeping inference generation stable. Experiments across multiple benchmarks show that our method consistently outperforms baselines, achieving average accuracy gains of 3.1% on MATH-500 and 2.4% on APPS, while also reducing formatting errors and mitigating length bias for stable optimization. Our code is publicly available at this https URL.
zh
[NLP-198] PEARL: Self-Evolving Assistant for Time Management with Reinforcement Learning
【速读】: 该论文旨在解决日历冲突(calendar conflict)的自动化决策问题,即在忙碌的专业人士面临多个时间重叠的会议邀请时,如何通过智能代理(language agent)辅助其进行偏好驱动的决策,以替代低效的人工处理或难以规模化的人类委托。当前大型语言模型(Large Language Model, LLM)在这一任务中表现不佳,平均错误率高达35%。为应对这一挑战,作者提出PEARL框架,其核心创新在于引入外部记忆模块(external memory module)和优化的逐轮奖励设计(round-wise reward design),使语言代理能够在多轮交互中逐步推断并动态适应用户偏好,从而实现更精准的长期日程管理。实验表明,PEARL相较最强基线在平均错误率上提升55%,误差减少率达0.76。
链接: https://arxiv.org/abs/2601.11957
作者: Bingxuan Li,Jeonghwan Kim,Cheng Qian,Xiusi Chen,Eitan Anzenberg,Niran Kundapur,Heng Ji
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Overlapping calendar invitations force busy professionals to repeatedly decide which meetings to attend, reschedule, or decline. We refer to this preference-driven decision process as calendar conflict resolution. Automating such process is crucial yet challenging. Scheduling logistics drain hours, and human delegation often fails at scale, which motivate we to ask: Can we trust large language model (LLM) or language agent to manager time? To enable systematic study of this question, we introduce CalConflictBench, a benchmark for long-horizon calendar conflict resolution. Conflicts are presented sequentially and agents receive feedback after each round, requiring them to infer and adapt to user preferences progressively. Our experiments show that current LLM agents perform poorly with high error rates, e.g., Qwen-3-30B-Think has 35% average error rate. To address this gap, we propose PEARL, a reinforcement-learning framework that augments language agent with an external memory module and optimized round-wise reward design, enabling agent to progressively infer and adapt to user preferences on-the-fly. Experiments on CalConflictBench shows that PEARL achieves 0.76 error reduction rate, and 55% improvement in average error rate compared to the strongest baseline.
zh
[NLP-199] Double-Calibration: Towards Trustworthy LLM s via Calibrating Knowledge and Reasoning Confidence
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在推理过程中易产生幻觉(hallucination)导致可信度不足的问题,特别是现有基于知识图谱(Knowledge Graph, KG)增强的方法无法量化检索证据和LLM推理过程中的认知不确定性(epistemic uncertainty)。其解决方案的关键在于提出一种基于新型双重校准(double-calibration)原则的框架DoublyCal:首先通过一个轻量级代理模型生成带有校准置信度的KG证据,再利用这些可追溯不确定性的支持证据引导黑盒LLM进行最终预测,从而在保证准确性提升的同时实现输出置信度的良好校准,且token消耗低。
链接: https://arxiv.org/abs/2601.11956
作者: Yuyin Lu,Ziran Liang,Yanghui Rao,Wenqi Fan,Fu Lee Wang,Qing Li
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Trustworthy reasoning in Large Language Models (LLMs) is challenged by their propensity for hallucination. While augmenting LLMs with Knowledge Graphs (KGs) improves factual accuracy, existing KG-augmented methods fail to quantify epistemic uncertainty in both the retrieved evidence and LLMs’ reasoning. To bridge this gap, we introduce DoublyCal, a framework built on a novel double-calibration principle. DoublyCal employs a lightweight proxy model to first generate KG evidence alongside a calibrated evidence confidence. This calibrated supporting evidence then guides a black-box LLM, yielding final predictions that are not only more accurate but also well-calibrated, with confidence scores traceable to the uncertainty of the supporting evidence. Experiments on knowledge-intensive benchmarks show that DoublyCal significantly improves both the accuracy and confidence calibration of black-box LLMs with low token cost.
zh
[NLP-200] hinking Traps in Long Chain-of-Thought: A Measurable Study and Trap-Aware Adaptive Restart
【速读】: 该论文旨在解决长链式思维(Long Chain-of-Thought, Long-CoT)推理中因“思维陷阱”(Thinking Traps)导致的错误固化问题,即模型在早期产生错误后,后续生成虽具一致性却无法修正根本错误,从而降低推理正确率。解决方案的关键在于提出一种测试时控制框架TAAR(Trap-Aware Adaptive Restart),其通过训练一个诊断策略(diagnostic policy)从部分推理轨迹中预测两个信号:一是“陷阱索引”(trap index),用于确定截断位置;二是“逃逸概率”(escape probability),用于判断是否以及如何干预。在推理阶段,TAAR根据预测结果截断潜在陷阱段并自适应重启解码过程,对严重困局则引入更强扰动(如高温度重采样或结构化重启后缀),从而有效提升数学与科学推理任务中的准确性,且无需微调基础模型参数。
链接: https://arxiv.org/abs/2601.11940
作者: Kang Chen,Fan Yu,Junjie Nian,Shihan Zhao,Zhuoka Feng,Zijun Yao,Heng Wang,Minshen Yu,Yixin Cao
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Scaling test-time compute via Long Chain-of-Thought (Long-CoT) significantly enhances reasoning capabilities, yet extended generation does not guarantee correctness: after an early wrong commitment, models may keep elaborating a self-consistent but incorrect prefix. Through fine-grained trajectory analysis, we identify Thinking Traps, prefix-dominant deadlocks where later reflection, alternative attempts, or verification fails to revise the root error. On a curated subset of DAPO-MATH, 89% of failures exhibit such traps. To solve this problem, we introduce TAAR (Trap-Aware Adaptive Restart), a test-time control framework that trains a diagnostic policy to predict two signals from partial trajectories: a trap index for where to truncate and an escape probability for whether and how strongly to intervene. At inference time, TAAR truncates the trajectory before the predicted trap segment and adaptively restarts decoding; for severely trapped cases, it applies stronger perturbations, including higher-temperature resampling and an optional structured reboot suffix. Experiments on challenging mathematical and scientific reasoning benchmarks (AIME24, AIME25, GPQA-Diamond, HMMT25, BRUMO25) show that TAAR improves reasoning performance without fine-tuning base model parameters.
zh
[NLP-201] Event Detection with a Context-Aware Encoder and LoRA for Improved Performance on Long-Tailed Classes
【速读】: 该论文旨在解决事件检测(Event Detection)研究中的两个关键问题:一是当前基于解码器的大型语言模型(LLM)架构存在单向性限制,难以有效利用丰富的双向上下文信息;二是现有评估指标普遍依赖Micro-F1,导致对多数类事件的性能被高估,忽略了长尾事件类型的识别能力。解决方案的关键在于引入句子级上下文增强模型输入,并采用低秩适应(Low-Rank Adaptation, LoRA)进行微调,从而显著提升模型在Macro-F1指标上的表现,尤其改善了对长尾事件类别的检测性能。
链接: https://arxiv.org/abs/2601.11932
作者: Abdullah Al Monsur,Nitesh Vamshi Bommisetty,Gene Louis Kim
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:The current state of event detection research has two notable re-occurring limitations that we investigate in this study. First, the unidirectional nature of decoder-only LLMs presents a fundamental architectural bottleneck for natural language understanding tasks that depend on rich, bidirectional context. Second, we confront the conventional reliance on Micro-F1 scores in event detection literature, which systematically inflates performance by favoring majority classes. Instead, we focus on Macro-F1 as a more representative measure of a model’s ability across the long-tail of event types. Our experiments demonstrate that models enhanced with sentence context achieve superior performance over canonical decoder-only baselines. Using Low-Rank Adaptation (LoRA) during finetuning provides a substantial boost in Macro-F1 scores in particular, especially for the decoder-only models, showing that LoRA can be an effective tool to enhance LLMs’ performance on long-tailed event classes.
zh
[NLP-202] Mapping the maturation of TCM as an adjuvant to radiotherapy
【速读】: 该论文试图解决的问题是:如何系统性地理解中医药(TCM)作为放疗辅助治疗在肿瘤学中的研究轨迹与发展趋势,尤其是在过去25年中其科学证据的演变路径。解决方案的关键在于通过大规模文献分析(69,745篇2000–2025年发表的研究),结合主题建模和多维度指标(如出版量、国际合作、资金投入等),识别出该领域发展的周期性特征及核心主题结构。研究发现五个主导主题轴——癌症类型、支持性护理、临床终点、机制和方法学——揭示了以患者为中心、系统导向的研究范式,并指出当前研究已趋于成熟并可能进入新阶段,同时发现整个领域存在跨主题、跨周期的一致性正向报告偏差。
链接: https://arxiv.org/abs/2601.11923
作者: P. Bilha Githinji,Aikaterini Melliou,Xi Yuan,Dayan Zhang,Lian Zhang,Zhenglin Chen,Jiansong Ji,Chengying Lv,Jinhao Xu,Peiwu Qin,Dongmei Yu
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:The integration of complementary medicine into oncology represents a paradigm shift that has seen to increasing adoption of Traditional Chinese Medicine (TCM) as an adjuvant to radiotherapy. About twenty-five years since the formal institutionalization of integrated oncology, it is opportune to synthesize the trajectory of evidence for TCM as an adjuvant to radiotherapy. Here we conduct a large-scale analysis of 69,745 publications (2000 - 2025), emerging a cyclical evolution defined by coordinated expansion and contraction in publication output, international collaboration, and funding commitments that mirrors a define-ideate-test pattern. Using a theme modeling workflow designed to determine a stable thematic structure of the field, we identify five dominant thematic axes - cancer types, supportive care, clinical endpoints, mechanisms, and methodology - that signal a focus on patient well-being, scientific rigor and mechanistic exploration. Cross-theme integration of TCM is patient-centered and systems-oriented. Together with the emergent cycles of evolution, the thematic structure demonstrates progressive specialization and potential defragmentation of the field or saturation of existing research agenda. The analysis points to a field that has matured its current research agenda and is likely at the cusp of something new. Additionally, the field exhibits positive reporting of findings that is homogeneous across publication types, thematic areas, and the cycles of evolution suggesting a system-wide positive reporting bias agnostic to structural drivers.
zh
[NLP-203] Enhancing LLM -Based Data Annotation with Error Decomposition
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在主观标注任务中性能不稳定、错误类型复杂且传统单一对齐指标无法有效反映其下游影响的问题。其核心挑战在于区分模型驱动的错误与任务固有的模糊性,从而更精准地评估LLM标注质量。解决方案的关键在于提出一种诊断性评估范式,包含三个核心要素:(1) 一个基于错误来源(模型特异性 vs. 任务固有)和类型(边界模糊 vs. 概念误识)的诊断分类法;(2) 一种轻量级人类标注测试以估计任务固有模糊性;(3) 一种计算方法用于根据该分类法分解观察到的LLM标注误差。该范式通过教育领域四项有序标注任务的验证,既揭示了高对齐指标在特定任务中不切实际的原因,也提供了低成本、可操作的诊断工具,助力判断任务是否适合LLM标注并指导后续技术优化。
链接: https://arxiv.org/abs/2601.11920
作者: Zhen Xu,Vedant Khatri,Yijun Dai,Xiner Liu,Siyan Li,Xuanming Zhang,Renzhe Yu
机构: Columbia University (哥伦比亚大学); University of California, Irvine (加州大学欧文分校); University of Pennsylvania (宾夕法尼亚大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models offer a scalable alternative to human coding for data annotation tasks, enabling the scale-up of research across data-intensive domains. While LLMs are already achieving near-human accuracy on objective annotation tasks, their performance on subjective annotation tasks, such as those involving psychological constructs, is less consistent and more prone to errors. Standard evaluation practices typically collapse all annotation errors into a single alignment metric, but this simplified approach may obscure different kinds of errors that affect final analytical conclusions in different ways. Here, we propose a diagnostic evaluation paradigm that incorporates a human-in-the-loop step to separate task-inherent ambiguity from model-driven inaccuracies and assess annotation quality in terms of their potential downstream impacts. We refine this paradigm on ordinal annotation tasks, which are common in subjective annotation. The refined paradigm includes: (1) a diagnostic taxonomy that categorizes LLM annotation errors along two dimensions: source (model-specific vs. task-inherent) and type (boundary ambiguity vs. conceptual misidentification); (2) a lightweight human annotation test to estimate task-inherent ambiguity from LLM annotations; and (3) a computational method to decompose observed LLM annotation errors following our taxonomy. We validate this paradigm on four educational annotation tasks, demonstrating both its conceptual validity and practical utility. Theoretically, our work provides empirical evidence for why excessively high alignment is unrealistic in specific annotation tasks and why single alignment metrics inadequately reflect the quality of LLM annotations. In practice, our paradigm can be a low-cost diagnostic tool that assesses the suitability of a given task for LLM annotation and provides actionable insights for further technical optimization.
zh
[NLP-204] LSTM-MAS: A Long Short-Term Memory Inspired Multi-Agent System for Long-Context Understanding
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在处理长文本上下文时面临的根本性挑战,即如何在不显著增加计算开销或受限于扩展上下文长度的前提下,实现高效且准确的长程依赖建模。现有方法要么通过压缩上下文窗口或优化注意力机制来应对,但往往引入额外计算成本或限制上下文长度;而多智能体框架虽能缓解这些问题,却易受错误累积和幻觉传播的影响。其解决方案的关键在于受长短期记忆(Long Short-Term Memory, LSTM)结构启发,提出一种名为LSTM-MAS的多智能体系统,该系统采用链式架构,每个节点包含工作代理(worker agent)、过滤代理(filter agent)、判断代理(judge agent)和管理代理(manager agent),分别模拟LSTM中的输入门、遗忘门、恒定误差轮转单元(constant error carousel unit)和输出门的功能,从而实现对跨文本段的信息选择性传递与长期依赖建模,有效避免错误积累与幻觉扩散。
链接: https://arxiv.org/abs/2601.11913
作者: Yichen Jiang,Peng Ye,Jiakang Yuan,Chongjun Tu,Lei Bai,Tao Chen
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 12 pages, 5 figures
Abstract:Effectively processing long contexts remains a fundamental yet unsolved challenge for large language models (LLMs). Existing single-LLM-based methods primarily reduce the context window or optimize the attention mechanism, but they often encounter additional computational costs or constrained expanded context length. While multi-agent-based frameworks can mitigate these limitations, they remain susceptible to the accumulation of errors and the propagation of hallucinations. In this work, we draw inspiration from the Long Short-Term Memory (LSTM) architecture to design a Multi-Agent System called LSTM-MAS, emulating LSTM’s hierarchical information flow and gated memory mechanisms for long-context understanding. Specifically, LSTM-MAS organizes agents in a chained architecture, where each node comprises a worker agent for segment-level comprehension, a filter agent for redundancy reduction, a judge agent for continuous error detection, and a manager agent for globally regulates information propagation and retention, analogous to LSTM and its input gate, forget gate, constant error carousel unit, and output gate. These novel designs enable controlled information transfer and selective long-term dependency modeling across textual segments, which can effectively avoid error accumulation and hallucination propagation. We conducted an extensive evaluation of our method. Compared with the previous best multi-agent approach, CoA, our model achieves improvements of 40.93%, 43.70%,121.57% and 33.12%, on NarrativeQA, Qasper, HotpotQA, and MuSiQue, respectively.
zh
[NLP-205] PPA-Plan: Proactive Pitfall Avoidance for Reliable Planning in Long-Context LLM Reasoning
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在长上下文推理中因信息稀疏分布而导致的推理失效问题,尤其针对现有“计划-执行”(plan-and-execute)框架因依赖表面线索生成不可靠计划、进而难以进行有效修正的局限性。解决方案的关键在于提出一种主动规划策略 PPA-Plan,其核心是通过识别潜在逻辑陷阱和错误假设,并将其形式化为负向约束(negative constraints),从而在计划生成阶段就显式规避这些约束,实现对错误源头的预防性控制,而非事后修复。实验表明,基于PPA-Plan生成的计划在长上下文问答任务上显著优于传统计划-执行方法与直接提示(direct prompting)。
链接: https://arxiv.org/abs/2601.11908
作者: Byeongjin Kim,Gyuwan Kim,Seo Yeon Park
机构: 未知
类目: Computation and Language (cs.CL)
备注: 23 pages, 6 figures
Abstract:Large language models (LLMs) struggle with reasoning over long contexts where relevant information is sparsely distributed. Although plan-and-execute frameworks mitigate this by decomposing tasks into planning and execution, their effectiveness is often limited by unreliable plan generation due to dependence on surface-level cues. Consequently, plans may be based on incorrect assumptions, and once a plan is formed, identifying what went wrong and revising it reliably becomes difficult, limiting the effectiveness of reactive refinement. To address this limitation, we propose PPA-Plan, a proactive planning strategy for long-context reasoning that focuses on preventing such failures before plan generation. PPA-Plan identifies potential logical pitfalls and false assumptions, formulates them as negative constraints, and conditions plan generation on explicitly avoiding these constraints. Experiments on long-context QA benchmarks show that executing plans generated by PPA-Plan consistently outperforms existing plan-and-execute methods and direct prompting.
zh
[NLP-206] Faithfulness vs. Safety: Evaluating LLM Behavior Under Counterfactual Medical Evidence
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在高风险领域(如医学)中面对与模型先验知识或安全协议相悖的反事实(counterfactual)甚至对抗性证据时,其行为是否仍保持忠实于上下文的问题。研究发现,当前主流LLMs在接收到此类反事实医疗证据时,会不加质疑地接受并给出自信且无警示的回答,表明模型在“忠实性”(faithfulness)与“安全性”(safety)之间尚未建立有效边界。解决方案的关键在于构建了一个名为MedCounterFact的反事实医学问答数据集,通过系统性地将真实医疗干预替换为从未知词到有毒物质等四类反事实刺激,量化评估多个前沿LLMs在面对误导性证据时的表现,从而揭示其对危险信息的高度敏感性和缺乏批判性推理能力。
链接: https://arxiv.org/abs/2601.11886
作者: Kaijie Mo,Siddhartha Venkatayogi,Chantal Shaib,Ramez Kouzy,Wei Xu,Byron C. Wallace,Junyi Jessy Li
机构: 未知
类目: Computation and Language (cs.CL)
备注: 26 pages
Abstract:In high-stakes domains like medicine, it may be generally desirable for models to faithfully adhere to the context provided. But what happens if the context does not align with model priors or safety protocols? In this paper, we investigate how LLMs behave and reason when presented with counterfactual or even adversarial medical evidence. We first construct MedCounterFact, a counterfactual medical QA dataset that requires the models to answer clinical comparison questions (i.e., judge the efficacy of certain treatments, with evidence consisting of randomized controlled trials provided as context). In MedCounterFact, real-world medical interventions within the questions and evidence are systematically replaced with four types of counterfactual stimuli, ranging from unknown words to toxic substances. Our evaluation across multiple frontier LLMs on MedCounterFact reveals that in the presence of counterfactual evidence, existing models overwhelmingly accept such “evidence” at face value even when it is dangerous or implausible, and provide confident and uncaveated answers. While it may be prudent to draw a boundary between faithfulness and safety, our findings reveal that there exists no such boundary yet.
zh
[NLP-207] GloCTM: Cross-Lingual Topic Modeling via a Global Context Space AAAI2026
【速读】: 该论文旨在解决跨语言主题建模(Cross-lingual Topic Modeling)中主题语义对齐不足的问题,即现有模型通常在各自的语言空间中独立学习主题,依赖双语词典等浅层对齐机制,难以捕捉深层语义关联,导致不同语言的主题空间松散、不一致。其解决方案的关键在于提出GloCTM框架,通过构建一个贯穿整个模型流程的统一语义空间(Global Context Space),实现跨语言主题的一致性约束:首先利用跨语言词汇邻域扩展词袋表示以增强输入语义信息;其次结合局部与全局编码器,并通过内部正则化对齐潜在表示;最后引入中心核对齐(Centered Kernel Alignment, CKA)损失函数,将主题空间与多语言上下文嵌入对齐,从而在输出层面实现跨语言主题词分布的结构同步,显著提升主题连贯性和跨语言一致性。
链接: https://arxiv.org/abs/2601.11872
作者: Nguyen Tien Phat,Ngo Vu Minh,Linh Van Ngo,Nguyen Thi Ngoc Diep,Thien Huu Nguyen
机构: 未知
类目: Computation and Language (cs.CL)
备注: AAAI 2026
Abstract:Cross-lingual topic modeling seeks to uncover coherent and semantically aligned topics across languages - a task central to multilingual understanding. Yet most existing models learn topics in disjoint, language-specific spaces and rely on alignment mechanisms (e.g., bilingual dictionaries) that often fail to capture deep cross-lingual semantics, resulting in loosely connected topic spaces. Moreover, these approaches often overlook the rich semantic signals embedded in multilingual pretrained representations, further limiting their ability to capture fine-grained alignment. We introduce GloCTM (Global Context Space for Cross-Lingual Topic Model), a novel framework that enforces cross-lingual topic alignment through a unified semantic space spanning the entire model pipeline. GloCTM constructs enriched input representations by expanding bag-of-words with cross-lingual lexical neighborhoods, and infers topic proportions using both local and global encoders, with their latent representations aligned through internal regularization. At the output level, the global topic-word distribution, defined over the combined vocabulary, structurally synchronizes topic meanings across languages. To further ground topics in deep semantic space, GloCTM incorporates a Centered Kernel Alignment (CKA) loss that aligns the latent topic space with multilingual contextual embeddings. Experiments across multiple benchmarks demonstrate that GloCTM significantly improves topic coherence and cross-lingual alignment, outperforming strong baselines.
zh
[NLP-208] Advances in LLM Reasoning Enable Flexibility in Clinical Problem-Solving
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在医学问答(Medical Question-Answering, QA)任务中虽表现出高准确率,但其临床推理灵活性仍存争议的问题。研究通过引入mARC(medicine abstraction and reasoning corpus)这一对抗性医学QA基准测试,利用“顿悟效应”(Einstellung effect)诱导模型对既定启发式模式产生僵化依赖,从而评估其认知灵活性。解决方案的关键在于对比不同家族的强推理模型(如OpenAI、Grok、Gemini、Claude和DeepSeek)在mARC上的表现,发现高性能模型能更有效地规避Einstellung陷阱,并达到人类水平的推理准确性,尤其在医生常错的问题上,顶级模型以高置信度正确回答了55%至70%,表明其相较人类更不易受启发式偏见影响,从而验证了先进推理机制显著提升了医学推理的灵活性与可靠性。
链接: https://arxiv.org/abs/2601.11866
作者: Kie Shidara,Preethi Prem,Jonathan Kim,Anna Podlasek,Feng Liu,Ahmed Alaa,Danilo Bernardo
机构: 未知
类目: Computation and Language (cs.CL)
备注: 10 pages, 6 figures
Abstract:Large Language Models (LLMs) have achieved high accuracy on medical question-answer (QA) benchmarks, yet their capacity for flexible clinical reasoning has been debated. Here, we asked whether advances in reasoning LLMs improve their cognitive flexibility in clinical reasoning. We assessed reasoning models from the OpenAI, Grok, Gemini, Claude, and DeepSeek families on the medicine abstraction and reasoning corpus (mARC), an adversarial medical QA benchmark which utilizes the Einstellung effect to induce inflexible overreliance on learned heuristic patterns in contexts where they become suboptimal. We found that strong reasoning models avoided Einstellung-based traps more often than weaker reasoning models, achieving human-level performance on mARC. On questions most commonly missed by physicians, the top 5 performing models answered 55% to 70% correctly with high confidence, indicating that these models may be less susceptible than humans to Einstellung effects. Our results indicate that strong reasoning models demonstrate improved flexibility in medical reasoning, achieving performance on par with humans on mARC.
zh
[NLP-209] CTPD: Cross Tokenizer Preference Distillation AAAI2026
【速读】: 该论文旨在解决在不同分词器(tokenizer)设置下,如何有效将人类对齐行为从教师模型(teacher model)迁移至学生模型(student model)的问题,尤其是在当前知识蒸馏技术在语言模型与人类偏好对齐任务中应用仍不充分的背景下。其核心挑战在于教师与学生模型间分词方案的不兼容性,导致难以实现细粒度、白盒式的偏好信息传递。解决方案的关键创新在于提出Cross-Tokenizer Preference Distillation (CTPD)框架,包含三项核心技术:(1) 对齐跨度投影(Aligned Span Projection),通过字符级跨度映射实现跨分词器的精确监督信号传递;(2) 分词器适配的Token-level Importance Sampling (TIS-DPO),提升偏好信息在不同token空间中的信用分配准确性;(3) 教师锚定参考机制(Teacher-Anchored Reference),使学生模型可直接利用教师偏好进行DPO风格的目标优化。理论分析基于重要性采样原理,实验证明该方法在多个基准测试中显著优于现有方法,为跨分词器偏好蒸馏提供了通用且高效的解决方案。
链接: https://arxiv.org/abs/2601.11865
作者: Truong Nguyen,Phi Van Dat,Ngan Nguyen,Linh Ngo Van,Trung Le,Thanh Hong Nguyen
机构: 未知
类目: Computation and Language (cs.CL)
备注: AAAI 2026
Abstract:While knowledge distillation has seen widespread use in pre-training and instruction tuning, its application to aligning language models with human preferences remains underexplored, particularly in the more realistic cross-tokenizer setting. The incompatibility of tokenization schemes between teacher and student models has largely prevented fine-grained, white-box distillation of preference information. To address this gap, we propose Cross-Tokenizer Preference Distillation (CTPD), the first unified framework for transferring human-aligned behavior between models with heterogeneous tokenizers. CTPD introduces three key innovations: (1) Aligned Span Projection, which maps teacher and student tokens to shared character-level spans for precise supervision transfer; (2) a cross-tokenizer adaptation of Token-level Importance Sampling (TIS-DPO) for improved credit assignment; and (3) a Teacher-Anchored Reference, allowing the student to directly leverage the teacher’s preferences in a DPO-style objective. Our theoretical analysis grounds CTPD in importance sampling, and experiments across multiple benchmarks confirm its effectiveness, with significant performance gains over existing methods. These results establish CTPD as a practical and general solution for preference distillation across diverse tokenization schemes, opening the door to more accessible and efficient alignment of language models.
zh
[NLP-210] AGGC: Adaptive Group Gradient Clipping for Stabilizing Large Language Model Training
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)训练过程中因梯度异质性导致的不稳定问题,尤其是传统全局梯度裁剪(global norm clipping)方法在处理不同功能模块间梯度差异时引发的“溢出效应”(spill-over effect),即不稳定的参数强制对稳定参数进行不必要的缩放,从而影响训练效率与收敛性。解决方案的关键在于提出自适应分组梯度裁剪(Adaptive Group-wise Gradient Clipping, AGGC),其核心机制包括:按功能类型将参数划分为多个组,并基于指数移动平均(Exponential Moving Average, EMA)动态建模每组的历史梯度行为,构建自适应区间以同时抑制梯度爆炸与消失;同时引入时间依赖调度策略,在探索与收敛之间实现平衡。该方法有效缓解了梯度异质性带来的负面影响,且因其轻量设计可无缝集成至现有后训练流程中。
链接: https://arxiv.org/abs/2601.11864
作者: Zhiyuan Li,Yuan Wu,Yi Chang
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 13 pages
Abstract:To stabilize the training of Large Language Models (LLMs), gradient clipping is a nearly ubiquitous heuristic used to alleviate exploding gradients. However, traditional global norm clipping erroneously presupposes gradient homogeneity across different functional modules, leading to an adverse “spill-over” effect where volatile parameters force unnecessary scaling on stable ones. To overcome this, we propose Adaptive Group-wise Gradient Clipping (AGGC). AGGC partitions parameters into groups based on functional types and regulates each according to its historical behavior using an Exponential Moving Average (EMA). Specifically, it constructs an adaptive interval to simultaneously mitigate gradient explosion and vanishing, while employing a time-dependent scheduling mechanism to balance exploration and convergence. Experiments on LLaMA 2-7B, Mistral-7B, and Gemma-7B models show that AGGC consistently outperforms LoRA and frequently surpasses Full Fine-Tuning. On the GSM8K benchmark, Mistral-7B fine-tuned with AGGC achieves an accuracy of 72.93%, exceeding LoRA’s 69.5%. AGGC also effectively stabilizes Reinforcement Learning with Verifiable Rewards (RLVR), enhancing the logic deduction of Qwen 2.5 and Llama 3.2 models. Experimental results demonstrate that AGGC effectively addresses the limitations of traditional gradient clipping methods, particularly in overcoming gradient heterogeneity, by utilizing a modular, adaptive clipping strategy to stabilize the training process. Due to its lightweight design, AGGC can be seamlessly integrated into existing post-training pipelines with negligible overhead.
zh
[NLP-211] Utilizing Metadata for Better Retrieval-Augmented Generation ECIR2026
【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统在处理结构化且重复性高的语料库(如监管文件)时,仅依赖文本相似度进行文档片段检索所导致的准确性下降问题。其核心挑战在于:当多个文档共享大量重叠语言时,单纯基于内容的相似性难以区分相关与不相关的片段。解决方案的关键在于引入元数据感知(metadata-aware)的检索策略,具体包括将元数据以“前缀”形式嵌入文本、构建融合元数据与内容的统一嵌入(unified embedding)、晚期融合(late-fusion)以及元数据驱动的查询重构等方法。实验表明,特别是统一嵌入策略,在提升检索效果的同时具备良好的可维护性,其有效性源于对嵌入空间的优化——增强了文档内部一致性、降低了文档间混淆,并扩大了相关与无关片段之间的距离。
链接: https://arxiv.org/abs/2601.11863
作者: Raquib Bin Yousuf,Shengzhe Xu,Mandar Sharma,Andrew Neeser,Chris Latimer,Naren Ramakrishnan
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL)
备注: The 48th European Conference on Information Retrieval (ECIR 2026)
Abstract:Retrieval-Augmented Generation systems depend on retrieving semantically relevant document chunks to support accurate, grounded outputs from large language models. In structured and repetitive corpora such as regulatory filings, chunk similarity alone often fails to distinguish between documents with overlapping language. Practitioners often flatten metadata into input text as a heuristic, but the impact and trade-offs of this practice remain poorly understood. We present a systematic study of metadata-aware retrieval strategies, comparing plain-text baselines with approaches that embed metadata directly. Our evaluation spans metadata-as-text (prefix and suffix), a dual-encoder unified embedding that fuses metadata and content in a single index, dual-encoder late-fusion retrieval, and metadata-aware query reformulation. Across multiple retrieval metrics and question types, we find that prefixing and unified embeddings consistently outperform plain-text baselines, with the unified at times exceeding prefixing while being easier to maintain. Beyond empirical comparisons, we analyze embedding space, showing that metadata integration improves effectiveness by increasing intra-document cohesion, reducing inter-document confusion, and widening the separation between relevant and irrelevant chunks. Field-level ablations show that structural cues provide strong disambiguating signals. Our code, evaluation framework, and the RAGMATE-10K dataset are publicly hosted.
zh
[NLP-212] ATOD: An Evaluation Framework and Benchmark for Agent ic Task-Oriented Dialogue System
【速读】: 该论文旨在解决当前任务导向型对话(Task-Oriented Dialogue, TOD)系统评估基准缺乏对高级代理行为(agentic behaviors)系统性支持的问题。随着大语言模型(Large Language Models, LLMs)与API及工具集成的发展,现代TOD系统已具备多目标协调、长程上下文维持和异步主动执行等能力,但现有评测数据集未能有效捕捉这些特性。解决方案的关键在于提出ATOD基准及其配套的合成对话生成管道,该管道能生成富含标注信息、需长期推理的对话数据,并进一步设计ATOD-Eval评估框架,将多目标协调、依赖管理、记忆能力、适应性和主动性等维度转化为细粒度指标,支持可复现的离线与在线评估;同时引入基于记忆的代理评估器,在准确率与效率之间实现更优权衡。
链接: https://arxiv.org/abs/2601.11854
作者: Yifei Zhang,Hooshang Nayyeri,Rinat Khaziev,Emine Yilmaz,Gokhan Tur,Dilek Hakkani-Tür,Hari Thadakamalla
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:Recent advances in task-oriented dialogue (TOD) systems, driven by large language models (LLMs) with extensive API and tool integration, have enabled conversational agents to coordinate interleaved goals, maintain long-horizon context, and act proactively through asynchronous execution. These capabilities extend beyond traditional TOD systems, yet existing benchmarks lack systematic support for evaluating such agentic behaviors. To address this gap, we introduce ATOD, a benchmark and synthetic dialogue generation pipeline that produces richly annotated conversations requiring long-term reasoning. ATOD captures key characteristics of advanced TOD, including multi-goal coordination, dependency management, memory, adaptability, and proactivity. Building on ATOD, we propose ATOD-Eval, a holistic evaluation framework that translates these dimensions into fine-grained metrics and supports reproducible offline and online evaluation. We further present a strong agentic memory-based evaluator for benchmarking on ATOD. Experiments show that ATOD-Eval enables comprehensive assessment across task completion, agentic capability, and response quality, and that the proposed evaluator offers a better accuracy-efficiency tradeoff compared to existing memory- and LLM-based approaches under this evaluation setting.
zh
[NLP-213] he Third VoicePrivacy Challenge: Preserving Emotional Expressiveness and Linguistic Content in Voice Anonymization
【速读】: 该论文旨在解决语音匿名化(voice anonymization)技术中的核心挑战,即在隐藏说话人身份的同时,保持语音内容(语言信息)和情感状态的可用性。解决方案的关键在于构建一个系统化的评估框架,涵盖明确的匿名化任务定义、多维度数据集支持(用于系统开发与评估)、基于对抗攻击模型的隐私保护度量指标,以及兼顾语音保真度与语义完整性(包括情感特征)的效用评估机制。通过这一框架,研究者能够量化比较不同匿名化方法在隐私保护与语音质量之间的权衡,并推动生成式 AI (Generative AI) 在语音隐私领域的创新应用。
链接: https://arxiv.org/abs/2601.11846
作者: Natalia Tomashenko,Xiaoxiao Miao,Pierre Champion,Sarina Meyer,Michele Panariello,Xin Wang,Nicholas Evans,Emmanuel Vincent,Junichi Yamagishi,Massimiliano Todisco
机构: Inria(法国国家信息与自动化研究院); Duke Kunshan University (昆山杜克大学); University of Stuttgart (斯图加特大学); Eurecom (欧洲电信学院); National Institute of Informatics (日本国立信息学研究所)
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: under review
Abstract:We present results and analyses from the third VoicePrivacy Challenge held in 2024, which focuses on advancing voice anonymization technologies. The task was to develop a voice anonymization system for speech data that conceals a speaker’s voice identity while preserving linguistic content and emotional state. We provide a systematic overview of the challenge framework, including detailed descriptions of the anonymization task and datasets used for both system development and evaluation. We outline the attack model and objective evaluation metrics for assessing privacy protection (concealing speaker voice identity) and utility (content and emotional state preservation). We describe six baseline anonymization systems and summarize the innovative approaches developed by challenge participants. Finally, we provide key insights and observations to guide the design of future VoicePrivacy challenges and identify promising directions for voice anonymization research.
zh
[NLP-214] Weddit : A Dataset of Triggering Stories Predominantly Shared by Women on Reddit
【速读】: 该论文旨在解决社交平台上用户分享与流产(miscarriage)、性暴力(sexual violence)等创伤经历相关的叙事时,因缺乏明确的触发警告(trigger warning)而导致其他用户可能暴露于心理不适内容的问题。现有平台如Reddit虽支持手动添加触发警告,但许多用户由于认知不足或不确定适用范围而忽略此操作,且公开可用的标注数据集在这一领域极为稀缺。论文的关键解决方案是构建一个名为TWeddit的精选Reddit数据集,专门涵盖女性群体常见的创伤经历,并对其中的故事进行细致标注,使其在语言学层面体现出独特的主题分布和道德基础特征,从而为未来相关研究提供高质量、可复用的数据资源。
链接: https://arxiv.org/abs/2601.11819
作者: Shirlene Rose Bandela,Sanjeev Parthasarathy,Vaibhav Garg
机构: 未知
类目: Computation and Language (cs.CL)
备注: 11 pages, 12 figures, 7 tables
Abstract:Warning: This paper may contain examples and topics that may be disturbing to some readers, especially survivors of miscarriage and sexual violence. People affected by abortion, miscarriage, or sexual violence often share their experiences on social media to express emotions and seek support. On public platforms like Reddit, where users can post long, detailed narratives (up to 40,000 characters), readers may be exposed to distressing content. Although Reddit allows manual trigger warnings, many users omit them due to limited awareness or uncertainty about which categories apply. There is scarcity of datasets on Reddit stories labeled for triggering experiences. We propose a curated Reddit dataset, TWeddit, covering triggering experiences related to issues majorly faced by women. Our linguistic analyses show that annotated stories in TWeddit express distinct topics and moral foundations, making the dataset useful for a wide range of future research.
zh
[NLP-215] A self-evolving multi-role collaborative framework with fine-grained difficulty guidance for innovative mathematical problem generation
【速读】: 该论文旨在解决数学问题生成(Mathematical Problem Generation, MPG)中现有大语言模型(Large Language Models, LLMs)缺乏创新性且区分度差的问题,提出创新性数学问题生成(Innovative Math Problem Generation, IMPG)任务。其解决方案的关键在于构建一个自进化、多角色协同框架,包含采样器、生成器、评估器、状态机和记忆模块,通过自我评估与外部反馈驱动迭代优化以确保正确性;同时引入改进的难度模型实现细粒度难度引导,并采用数据驱动的关联引导路径采样(Data-driven Association-guided Path Sampling, DAPS)算法提升语义合理性;此外,通过持续预训练(Continual Pre-training, CPT)、监督微调(Supervised Fine-tuning, SFT)和组相对策略优化(Group Relative Policy Optimization, GRPO)的多阶段训练流程增强生成与评估能力,并借助蒸馏技术将专家模型的评估能力迁移至学生模型,实现系统自进化。
链接: https://arxiv.org/abs/2601.11792
作者: Yifei Sun,Yongan Li,A.K. Qin,Sicheng Hou,Tamas Pflanzner
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Mathematical problem generation (MPG) is a significant research direction in the field of intelligent education. In recent years, the rapid development of large language models (LLMs) has enabled new technological approaches to problem-generation tasks. Although existing LLMs can achieve high correctness rates, they generally lack innovation and exhibit poor discrimination. In this paper, we propose the task of innovative math problem generation (IMPG). To solve the IMPG task, this paper proposes a self-evolving, multi-role collaborative framework with fine-grained difficulty guidance. First, a multi-role collaborative mechanism comprising a sampler, generator, evaluator, state machine, and memory is constructed, ensuring the correctness of generated problems through iterative optimization informed by self-assessment and external feedback. Second, we introduce an improved difficulty model to quantify difficulty and provide fine-grained guidance. We adopt the data-driven association-guided path sampling (DAPS) algorithm to enhance the semantic rationality of sampled encodings. Third, we construct the HSM3K-CN dataset, which comprises high-quality high school math problems. A multi-stage training pipeline is adopted, incorporating continual pre-training (CPT), supervised fine-tuning (SFT), and group relative policy optimization (GRPO), to enhance the generation and evaluation capabilities of the base model. Finally, system self-evolution is achieved by transferring evaluation capabilities from the expert model to the apprentice model via distillation. Experiments show that, compared to baseline models, our proposed method significantly improves the innovation of the generated problems while maintaining a high correctness rate.
zh
[NLP-216] Beyond Tokens: Concept-Level Training Objectives for LLM s
【速读】: 该论文试图解决当前大语言模型(Large Language Models, LLMs)训练中基于下一个词元预测(Next-Token Prediction, NTP)目标所导致的语义偏差问题。NTP在词元层面进行监督,将所有非参考续写视为错误,即使这些续写在语义上等价或合理(如“mom”与“mother”),从而导致模型过度关注表面形式而非深层语义,限制了其泛化能力和鲁棒性。解决方案的关键在于从词元级预测转向概念级预测(Concept-Level Prediction),即通过将多个表达相同概念的不同表面形式(如“mom”、“mommy”、“mother”)映射到统一的概念标签(如MOTHER),引入更高层次的语义监督信号。这种方法使模型更贴近人类对语义抽象的理解,在多个自然语言处理基准测试中表现出更低的困惑度、更强的领域迁移鲁棒性以及优于传统NTP方法的性能。
链接: https://arxiv.org/abs/2601.11791
作者: Laya Iyer,Pranav Somani,Alice Guo,Dan Jurafsky,Chen Shani
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:The next-token prediction (NTP) objective has been foundational in the development of modern large language models (LLMs), driving advances in fluency and generalization. However, NTP operates at the \textittoken level, treating deviations from a single reference continuation as errors even when alternative continuations are equally plausible or semantically equivalent (e.g., mom'' vs. mother’‘). As a result, token-level loss can penalize valid abstractions, paraphrases, or conceptually correct reasoning paths, biasing models toward surface form rather than underlying meaning. This mismatch between the training signal and semantic correctness motivates learning objectives that operate over higher-level representations. We propose a shift from token-level to concept-level prediction, where concepts group multiple surface forms of the same idea (e.g., mom,'' mommy,’’ ``mother’’ \rightarrow \textitMOTHER). We introduce various methods for integrating conceptual supervision into LLM training and show that concept-aware models achieve lower perplexity, improved robustness under domain shift, and stronger performance than NTP-based models on diverse NLP benchmarks. This suggests \textitconcept-level supervision as an improved training signal that better aligns LLMs with human semantic abstractions.
zh
[NLP-217] he Stability Trap: Evaluating the Reliability of LLM -Based Instruction Adherence Auditing
【速读】: 该论文旨在解决生成式 AI(Generative AI)在受监管领域(如人力资源)中的企业治理问题,特别是如何构建可扩展且可复现的审计机制。现有基于大语言模型(Large Language Model, LLM)作为裁判(LLM-as-a-Judge)的方法虽具备可扩展性,但其在评估不同类型的系统指令时的可靠性尚未验证。论文的关键解决方案是提出“限定指令分解框架”(Scoped Instruction Decomposition Framework),将待测应用(Application Under Test, AUT)的指令划分为客观型与主观型,从而隔离导致裁判评价不稳定的因素。实验表明,尽管裁判在最终判断上具有高一致性(约99%),其推理过程却存在显著波动,尤其在涉及定量分析或细粒度证据的任务中推理稳定性极低;而离散实体提取类任务则表现出高推理稳定性。因此,论文建议审计协议应严格区分逻辑类型:将确定性可验证逻辑交由代码实现,仅保留复杂语义评估任务给LLM裁判,以避免高判断一致性掩盖脆弱推理过程的问题。
链接: https://arxiv.org/abs/2601.11783
作者: Murtuza N. Shergadwala
机构: 未知
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注:
Abstract:The enterprise governance of Generative AI (GenAI) in regulated sectors, such as Human Resources (HR), demands scalable yet reproducible auditing mechanisms. While Large Language Model (LLM)-as-a-Judge approaches offer scalability, their reliability in evaluating adherence of different types of system instructions remains unverified. This study asks: To what extent does the instruction type of an Application Under Test (AUT) influence the stability of judge evaluations? To address this, we introduce the Scoped Instruction Decomposition Framework to classify AUT instructions into Objective and Subjective types, isolating the factors that drive judge instability. We applied this framework to two representative HR GenAI applications, evaluating the stability of four judge architectures over variable runs. Our results reveal a ``Stability Trap’’ characterized by a divergence between Verdict Stability and Reasoning Stability. While judges achieved near-perfect verdict agreement ( 99% ) for both objective and subjective evaluations, their accompanying justification traces diverged significantly. Objective instructions requiring quantitative analysis, such as word counting, exhibited reasoning stability as low as \approx19% , driven by variances in numeric justifications. Similarly, reasoning stability for subjective instructions varied widely ( 35% – 83% ) based on evidence granularity, with feature-specific checks failing to reproduce consistent rationale. Conversely, objective instructions focusing on discrete entity extraction achieved high reasoning stability ( 90% ). These findings demonstrate that high verdict stability can mask fragile reasoning. Thus, we suggest that auditors scope automated evaluation protocols strictly: delegate all deterministically verifiable logic to code, while reserving LLM judges for complex semantic evaluation.
zh
[NLP-218] ranslation as a Scalable Proxy for Multilingual Evaluation
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在多语言能力评估中的“评估悖论”问题,即尽管LLMs声称具备多语言能力,但目前仅对少于30种语言构建了全面的非机器翻译基准测试,导致全球约98%的7000种语言缺乏实证评估。传统基准构建面临成本高、领域专家稀缺和数据污染等扩展挑战。其解决方案的关键在于验证“翻译质量是否可作为模型更广泛多语言能力的代理指标”——通过系统评估14个参数规模从1B到72B的模型在9个不同基准和7种翻译指标上的表现,发现翻译性能与下游任务成功率高度相关(如Phi-4模型中,MetricX的皮尔逊相关系数r=0.89,xCOMET为0.91,SSA-COMET为0.87),表明支撑忠实翻译的表征能力与多语言理解所需能力高度重叠。因此,翻译质量可作为低成本、高效的首筛工具,实现多语言能力的初步评估,并辅以针对性任务验证。
链接: https://arxiv.org/abs/2601.11778
作者: Sheriff Issaka,Erick Rosas Gonzalez,Lieqi Liu,Evans Kofi Agyei,Lucas Bandarkar,Nanyun Peng,David Ifeoluwa Adelani,Francisco Guzmán,Saadia Gabriel
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:The rapid proliferation of LLMs has created a critical evaluation paradox: while LLMs claim multilingual proficiency, comprehensive non-machine-translated benchmarks exist for fewer than 30 languages, leaving 98% of the world’s 7,000 languages in an empirical void. Traditional benchmark construction faces scaling challenges such as cost, scarcity of domain experts, and data contamination. We evaluate the validity of a simpler alternative: can translation quality alone indicate a model’s broader multilingual capabilities? Through systematic evaluation of 14 models (1B-72B parameters) across 9 diverse benchmarks and 7 translation metrics, we find that translation performance is a good indicator of downstream task success (e.g., Phi-4, median Pearson r: MetricX = 0.89, xCOMET = 0.91, SSA-COMET = 0.87). These results suggest that the representational abilities supporting faithful translation overlap with those required for multilingual understanding. Translation quality, thus emerges as a strong, inexpensive first-pass proxy of multilingual performance, enabling a translation-first screening with targeted follow-up for specific tasks.
zh
[NLP-219] Cleansing the Artificial Mind: A Self-Reflective Detoxification Framework for Large Language Models
【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在生成内容时存在毒性(toxicity)问题的治理难题,现有去毒技术多依赖外部模块、人工标注或人类干预,导致可扩展性和一致性受限。其解决方案的关键在于提出一种完全自反思(self-reflective)的去毒框架,利用LLM自身内建的识别与修正能力——具体包括一个内部毒性信号检测器(Toxic Signal Detector)和一套系统性干预流程,实现对毒性文本的自动识别、纠正与迭代优化,从而生成对比性的去毒数据集用于模型微调,最终在不依赖外部组件或人工标注的前提下显著提升模型的安全性和语义保真度。
链接: https://arxiv.org/abs/2601.11776
作者: Kaituo Zhang,Zhimeng Jiang,Na Zou
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent breakthroughs in Large Language Models (LLMs) have revealed remarkable generative capabilities and emerging self-regulatory mechanisms, including self-correction and self-rewarding. However, current detoxification techniques rarely exploit these built-in abilities; instead, they rely on external modules, labor-intensive data annotation, or human intervention --factors that hinder scalability and consistency. In this paper, we introduce a fully self-reflective detoxification framework that harnesses the inherent capacities of LLMs to detect, correct toxic content, and refine LLMs without external modules and data annotation. Specifically, we propose a Toxic Signal Detector --an internal self-identification mechanism, coupled with a systematic intervention process to transform toxic text into its non-toxic counterpart. This iterative procedure yields a contrastive detoxification dataset used to fine-tune the model, enhancing its ability for safe and coherent text generation. Experiments on benchmark datasets such as DetoxLLM and ParaDetox show that our method achieves better detoxification performance than state-of-the-art methods while preserving semantic fidelity. By obviating the need for human intervention or external components, this paper reveals the intrinsic self-detoxification ability of LLMs, offering a consistent and effective approach for mitigating harmful content generation. Ultimately, our findings underscore the potential for truly self-regulated language models, paving the way for more responsible and ethically guided text generation systems.
zh
[NLP-220] Industry-Aligned Granular Topic Modeling
【速读】: 该论文旨在解决传统主题模型在工业应用场景中难以生成细粒度(granular)主题的问题,从而限制了其对业务洞察的深度支持。解决方案的关键在于提出一个名为TIDE的框架,其核心创新是基于大语言模型(Large Language Models, LLMs)设计了一种新型细粒度主题建模方法,并集成文档摘要、主题层级关系构建和主题蒸馏等辅助功能,以增强实际业务场景下的可用性和可解释性。实验表明,该方法在多个公共及真实商业数据集上显著优于现有主流主题模型。
链接: https://arxiv.org/abs/2601.11762
作者: Sae Young Moon,Myeongjun Erik Jang,Haoyan Luo,Chunyang Xiao,Antonios Georgiadis,Fran Silavong
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Topic modeling has extensive applications in text mining and data analysis across various industrial sectors. Although the concept of granularity holds significant value for business applications by providing deeper insights, the capability of topic modeling methods to produce granular topics has not been thoroughly explored. In this context, this paper introduces a framework called TIDE, which primarily provides a novel granular topic modeling method based on large language models (LLMs) as a core feature, along with other useful functionalities for business applications, such as summarizing long documents, topic parenting, and distillation. Through extensive experiments on a variety of public and real-world business datasets, we demonstrate that TIDE’s topic modeling approach outperforms modern topic modeling methods, and our auxiliary components provide valuable support for dealing with industrial business scenarios. The TIDE framework is currently undergoing the process of being open sourced.
zh
[NLP-221] Early Linguistic Pattern of Anxiety from Social Media Using Interpretable Linguistic Features: A Multi-Faceted Validation Study with Author-Disjoint Evaluation
【速读】: 该论文旨在解决当前基于社交媒体的语言模型在焦虑检测中普遍存在的可解释性不足、关键词鲁棒性验证缺失以及用户级数据完整性难以保障的问题。解决方案的关键在于构建一个以语言学特征为基础的透明建模框架,通过在Reddit大规模语料上训练逻辑回归分类器,并采用精心划分的子版块(subreddit)进行训练、验证和测试,结合特征消融分析、关键词屏蔽实验及不同密度差异分析,实现了对焦虑群体与对照组的有效区分;同时,在临床访谈参与者中的外部验证进一步证明了模型的泛化能力和鲁棒性,从而为可解释的心理健康筛查提供了可复现的基准方法。
链接: https://arxiv.org/abs/2601.11758
作者: Arnab Das Utsa
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 9 figures, more than 1o pages
Abstract:Anxiety affects hundreds of millions of individuals globally, yet large-scale screening remains limited. Social media language provides an opportunity for scalable detection, but current models often lack interpretability, keyword-robustness validation, and rigorous user-level data integrity. This work presents a transparent approach to social media-based anxiety detection through linguistically interpretable feature-grounded modeling and cross-domain validation. Using a substantial dataset of Reddit posts, we trained a logistic regression classifier on carefully curated subreddits for training, validation, and test splits. Comprehensive evaluation included feature ablation, keyword masking experiments, and varying-density difference analyses comparing anxious and control groups, along with external validation using clinically interviewed participants with diagnosed anxiety disorders. The model achieved strong performance while maintaining high accuracy even after sentiment removal or keyword masking. Early detection using minimal post history significantly outperformed random classification, and cross-domain analysis demonstrated strong consistency with clinical interview data. Results indicate that transparent linguistic features can support reliable, generalizable, and keyword-robust anxiety detection. The proposed framework provides a reproducible baseline for interpretable mental health screening across diverse online contexts.
zh
[NLP-222] LIME-LLM : Probing Models with Fluent Counterfactuals Not Broken Text
【速读】: 该论文旨在解决当前局部解释方法(如LIME)在自然语言处理(Natural Language Processing, NLP)领域中因随机标记掩码导致语义无效、分布外输入频现,从而削弱局部代理模型保真度的问题。同时,现有生成式方法(如LLiMe)虽借助大语言模型(Large Language Models, LLMs)生成邻域样本,但其无约束的改写策略引入混杂变量,难以精确分离特定特征贡献。解决方案的关键在于提出LIME-LLM框架,通过假设驱动的受控扰动替代随机噪声,并严格执行“单掩码-单样本”协议,结合中性填充与边界填充两种策略,构建流畅且位于数据流形上的邻域,从而严格隔离特征效应,显著提升黑盒NLP模型解释的保真度。
链接: https://arxiv.org/abs/2601.11746
作者: George Mihaila,Suleyman Olcay Polat,Poli Nemkova,Himanshu Sharma,Namratha V. Urs,Mark V. Albert
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Local explanation methods such as LIME (Ribeiro et al., 2016) remain fundamental to trustworthy AI, yet their application to NLP is limited by a reliance on random token masking. These heuristic perturbations frequently generate semantically invalid, out-of-distribution inputs that weaken the fidelity of local surrogate models. While recent generative approaches such as LLiMe (Angiulli et al., 2025b) attempt to mitigate this by employing Large Language Models for neighborhood generation, they rely on unconstrained paraphrasing that introduces confounding variables, making it difficult to isolate specific feature contributions. We introduce LIME-LLM, a framework that replaces random noise with hypothesis-driven, controlled perturbations. By enforcing a strict “Single Mask-Single Sample” protocol and employing distinct neutral infill and boundary infill strategies, LIME-LLM constructs fluent, on-manifold neighborhoods that rigorously isolate feature effects. We evaluate our method against established baselines (LIME, SHAP, Integrated Gradients) and the generative LLiMe baseline across three diverse benchmarks: CoLA, SST-2, and HateXplain using human-annotated rationales as ground truth. Empirical results demonstrate that LIME-LLM establishes a new benchmark for black-box NLP explainability, achieving significant improvements in local explanation fidelity compared to both traditional perturbation-based methods and recent generative alternatives.
zh
[NLP-223] Bridging Human Interpretation and Machine Representation: A Landscape of Qualitative Data Analysis in the LLM Era
【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在支持定性研究时输出质量参差不齐的问题,尤其指出现有系统多停留在低层次的意义建构(如描述性总结)和静态建模层面,缺乏对解释性与理论性推理以及动态系统建模的可靠支持。解决方案的关键在于提出一个4×4的分析框架,将意义建构分为描述性、分类性、解释性和理论性四个层级,并与静态结构、阶段/时间线、因果路径和反馈动态四种建模方式交叉映射,从而清晰揭示LLM输出的差异性,并据此构建可显式表达、可选择且可治理的解释与建模承诺机制,推动LLM在定性研究中向更高阶的推理与建模能力演进。
链接: https://arxiv.org/abs/2601.11739
作者: Xinyu Pi,Qisen Yang,Chuong Nguyen,Hua Shen
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:LLMs are increasingly used to support qualitative research, yet existing systems produce outputs that vary widely–from trace-faithful summaries to theory-mediated explanations and system models. To make these differences explicit, we introduce a 4 \times 4 landscape crossing four levels of meaning-making (descriptive, categorical, interpretive, theoretical) with four levels of modeling (static structure, stages/timelines, causal pathways, feedback dynamics). Applying the landscape to prior LLM-based automation highlights a strong skew toward low-level meaning and low-commitment representations, with few reliable attempts at interpretive/theoretical inference or dynamical modeling. Based on the revealed gap, we outline an agenda for applying and building LLM-systems that make their interpretive and modeling commitments explicit, selectable, and governable.
zh
[NLP-224] RAC: Retrieval-Augmented Clarification for Faithful Conversational Search ECIR’26
【速读】: 该论文旨在解决对话式搜索系统中澄清问题(clarification questions)缺乏语料库依据的问题,即现有方法生成的澄清问题可能无法从可用文档中得到支持,从而影响系统的准确性与可靠性。解决方案的关键在于提出RAC(Retrieval-Augmented Clarification)框架,通过检索增强机制确保生成的澄清问题基于底层语料库(corpus-faithful),具体包括:1)对比多种索引策略优化检索效果;2)微调大语言模型(Large Language Model, LLM)以充分利用研究上下文并鼓励生成有证据支持的问题;3)采用对比偏好优化(contrastive preference optimization)机制,优先选择由检索段落支撑的澄清问题而非无依据的替代方案。实验表明,该方法在多个基准测试中显著优于基线,并通过自然语言推理(NLI)和数据到文本生成(data-to-text)衍生的新指标验证了其在语义锚定(grounding)方面的提升。
链接: https://arxiv.org/abs/2601.11722
作者: Ahmed Rayane Kebir,Vincent Guigue,Lynda Said Lhadj,Laure Soulier
机构: 未知
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: This is the author’s version of the work. The definitive version is published in: Proceedings of the 48th European Conference on Information Retrieval (ECIR '26), 29 March–2 April, 2026, Delft, Netherlands
Abstract:Clarification questions help conversational search systems resolve ambiguous or underspecified user queries. While prior work has focused on fluency and alignment with user intent, especially through facet extraction, much less attention has been paid to grounding clarifications in the underlying corpus. Without such grounding, systems risk asking questions that cannot be answered from the available documents. We introduce RAC (Retrieval-Augmented Clarification), a framework for generating corpus-faithful clarification questions. After comparing several indexing strategies for retrieval, we fine-tune a large language model to make optimal use of research context and to encourage the generation of evidence-based question. We then apply contrastive preference optimization to favor questions supported by retrieved passages over ungrounded alternatives. Evaluated on four benchmarks, RAC demonstrate significant improvements over baselines. In addition to LLM-as-Judge assessments, we introduce novel metrics derived from NLI and data-to-text to assess how well questions are anchored in the context, and we demonstrate that our approach consistently enhances faithfulness.
zh
[NLP-225] owards AGI A Prag matic Approach Towards Self Evolving Agent
【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)代理在部署后缺乏自主扩展能力、无法生成新工具或进化推理策略的问题。其核心解决方案是提出一种分层自演化多智能体框架,通过集成基础LLM、操作型小语言模型(SLM)代理、代码生成LLM和教师LLM(Teacher-LLM),实现任务失败时的渐进式进化:首先尝试常规推理与工具调用,若失败则触发代码生成以合成新工具,持续失败时进一步引入课程学习(Curriculum Learning, CL)、基于奖励的学习(Reward-Based Learning, RL)或遗传算法(Genetic Algorithm, GA)进行演化优化。实验表明,该框架能显著提升代理在复杂任务中的适应性与性能,实现鲁棒、自主的自进化能力。
链接: https://arxiv.org/abs/2601.11658
作者: Indrajit Kar,Sammy Zonunpuia,Zonunfeli Ralte
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Model (LLM) based agents are powerful yet fundamentally static after deployment, lacking the ability to autonomously expand capabilities, generate new tools, or evolve their reasoning. This work introduces a hierarchical self-evolving multi-agent framework that integrates a Base LLM, an operational SLM agent, a Code-Generation LLM, and a Teacher-LLM to enable continuous adaptation. The workflow begins with the agent attempting a task using reasoning and existing tools; if unsuccessful, it escalates to tool synthesis through the Code-Gen LLM, and when failures persist, it triggers an evolution phase using Curriculum Learning (CL), Reward-Based Learning (RL), or Genetic Algorithm (GA) evolution. Using the TaskCraft dataset rich in hierarchical tasks, tool-use traces, and difficulty scaling we evaluate these paradigms. CL delivers fast recovery and strong generalization, RL excels on high-difficulty tasks, and GA offers high behavioral diversity. Across all settings, evolved agents outperform their originals, demonstrating robust, autonomous, self-improving agentic evolution.
zh
[NLP-226] Advances and Frontiers of LLM -based Issue Resolution in Software Engineering: A Comprehensive Survey
【速读】: 该论文旨在解决软件工程(Software Engineering, SWE)中代码问题修复(issue resolution)这一复杂任务在生成式 AI(Generative AI)应用中的挑战。研究表明,大型语言模型在处理真实世界开发场景下的问题修复时表现受限,这推动了自主编码代理(autonomous coding agents)的快速发展。论文提出了一种系统性综述,其关键在于从数据构建管道、方法论(包括无需训练的模块化框架与基于训练的技术如监督微调和强化学习)、数据质量与代理行为分析以及实际应用场景等多个维度进行深入探讨,从而为该领域的发展提供结构化认知,并识别未来研究的关键挑战与方向。
链接: https://arxiv.org/abs/2601.11655
作者: Caihua Li,Lianghong Guo,Yanlin Wang,Daya Guo,Wei Tao,Zhenyu Shan,Mingwei Liu,Jiachi Chen,Haoyu Song,Duyu Tang,Hongyu Zhang,Zibin Zheng
机构: Sun Yat-sen University (中山大学); Hangzhou Normal University (杭州师范大学); Zhejiang University (浙江大学); Huawei Technologies Co, Ltd (华为技术有限公司); Chongqing University (重庆大学)
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注: 26 pages, 4 figures, 5 tables
Abstract:Issue resolution, a complex Software Engineering (SWE) task integral to real-world development, has emerged as a compelling challenge for artificial intelligence. The establishment of benchmarks like SWE-bench revealed this task as profoundly difficult for large language models, thereby significantly accelerating the evolution of autonomous coding agents. This paper presents a systematic survey of this emerging domain. We begin by examining data construction pipelines, covering automated collection and synthesis approaches. We then provide a comprehensive analysis of methodologies, spanning training-free frameworks with their modular components to training-based techniques, including supervised fine-tuning and reinforcement learning. Subsequently, we discuss critical analyses of data quality and agent behavior, alongside practical applications. Finally, we identify key challenges and outline promising directions for future research. An open-source repository is maintained at this https URL to serve as a dynamic resource in this field.
zh
[NLP-227] Mixture-of-Experts as Soft Clustering: A Dual Jacobian-PCA Spectral Geometry Perspective
【速读】: 该论文旨在解决混合专家(Mixture-of-Experts, MoE)架构在函数空间几何结构和表示几何特性方面缺乏系统理解的问题,尤其是其路由机制如何影响局部函数敏感性和隐藏表示的分布特性。解决方案的关键在于引入一种名为“双雅可比-主成分分析谱几何探针”(Dual Jacobian-PCA Spectral Geometry probe)的新方法:该方法通过计算雅可比奇异值谱来分析局部函数几何,同时利用加权主成分分析(PCA)刻画路由后隐藏状态的表示几何。实验基于受控的MLP-MoE设置,在相同容量下比较密集网络、Top-k路由与全软路由三种架构,发现MoE路由显著降低局部敏感性(表现为专家局部雅可比矩阵的主导奇异值更小且谱衰减更快),并使表示方差分布在更多主方向上,体现更高有效秩;此外,平均专家雅可比矩阵近似正交,表明变换分解为低重叠的专家特有子空间,而非共享映射的缩放版本。这一几何视角揭示了MoE作为函数空间软划分机制的本质,其既能平滑局部曲率又能重新分配表示方差。
链接: https://arxiv.org/abs/2601.11616
作者: Feilong Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Mixture-of-Experts (MoE) architectures are commonly motivated by efficiency and conditional computation, but their effect on the geometry of learned functions and representations remains poorly characterized. In this work, we study MoEs through a geometric lens, interpreting routing as a form of soft partitioning of the representation space into overlapping local charts. We introduce a Dual Jacobian-PCA Spectral Geometry probe. It analyzes local function geometry via Jacobian singular-value spectra and representation geometry via weighted PCA of routed hidden states. Using a controlled MLP-MoE setting that permits exact Jacobian computation, we compare dense, Top-k, and fully-soft routing architectures under matched capacity. Across random seeds, we observe that MoE routing consistently reduces local sensitivity, with expert-local Jacobians exhibiting smaller leading singular values and faster spectral decay than dense baselines. At the same time, weighted PCA reveals that expert-local representations distribute variance across a larger number of principal directions, indicating higher effective rank under identical input distributions. We further find that average expert Jacobians are nearly orthogonal, suggesting a decomposition of the transformation into low-overlap expert-specific subspaces rather than scaled variants of a shared map. We analyze how routing sharpness modulates these effects, showing that Top-k routing produces lower-rank, more concentrated expert-local structure, while fully-soft routing yields broader, higher-rank representations. Together, these results support a geometric interpretation of MoEs as soft partitionings of function space that flatten local curvature while redistributing representation variance.
zh
[NLP-228] Entropic Context Shaping: Information-Theoretic Filtering for Context-Aware LLM Agents
【速读】: 该论文旨在解决大型语言模型(Large Language Model, LLM)代理在上下文工程中如何有效区分语用上有用信息与干扰项(distractor)的问题,尤其是在多轮对话场景下精准选择相关上下文以提升任务表现。其核心解决方案是提出熵驱动的上下文塑造方法(Entropic Context Shaping, ECS),该方法基于信息论框架,通过量化模型答案分布向正确答案的偏移程度来衡量上下文的语用效用(pragmatic utility),而非依赖传统的词法相似性指标(如TF-IDF)。ECS的关键创新在于将上下文效用形式化为答案概率的有符号变化,并理论证明无关任务的信息更新几乎不会引起分布偏移,从而实现对上下文实用性的精确评估。实验表明,在细粒度回合级上下文选择任务上,ECS相较于TF-IDF显著提升F1得分(Llama-3.1-8B模型达0.265,相对改进71.83%),验证了语用效用优于传统词法相似性策略的有效性。
链接: https://arxiv.org/abs/2601.11585
作者: Hyunjun Kim
机构: KAIST
类目: Computation and Language (cs.CL)
备注:
Abstract:Context engineering for large language model (LLM) agents requires distinguishing pragmatically useful information from misleading distractors. We introduce Entropic Context Shaping (ECS), an information-theoretic framework that measures context utility via the shift in the model’s answer distribution toward the correct answer. Unlike lexical similarity methods that rely on word overlap, ECS captures pragmatic utility – whether a passage actually helps answer the question. We formalize utility as the signed change in answer probability and provide theoretical analysis showing that task-irrelevant updates yield near-zero distribution shift. We evaluate on multi-turn context selection tasks using LongMemEval (session-level) and LoCoMo (turn-level) benchmarks. On fine-grained turn selection, ECS with Llama-3.1-8B achieves F1=0.265, a 71.83% relative improvement over TF-IDF (F1=0.154), demonstrating that pragmatic utility outperforms lexical similarity when precise context selection matters. Code and data are available in the supplementary materials.
zh
[NLP-229] Overview of the SciHigh Track at FIRE 2025: Research Highlight Generation from Scientific Papers
【速读】: 该论文旨在解决科学文献中自动提取简洁、信息丰富且有意义的要点(Research Highlights)的问题,以辅助读者快速理解论文的核心贡献与创新点。其关键解决方案是基于MixSub数据集构建一个专门用于评估生成式AI(Generative AI)模型在科学摘要上生成亮点的能力的基准测试平台,并通过ROUGE、METEOR和BERTScore等指标量化生成结果的质量,尤其以ROUGE-L作为排名依据。实验表明,自动化生成的亮点能够显著降低阅读负担、加速文献综述流程,并提升数字图书馆和学术搜索引擎的元数据质量。
链接: https://arxiv.org/abs/2601.11582
作者: Tohida Rehman,Debarshi Kumar Sanyal,Samiran Chattopadhyay
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 7 pages, 2 tables
Abstract:`SciHigh: Research Highlight Generation from Scientific Papers’ focuses on the task of automatically generating concise, informative, and meaningful bullet-point highlights directly from scientific abstracts. The goal of this task is to evaluate how effectively computational models can generate highlights that capture the key contributions, findings, and novelty of a paper in a concise form. Highlights help readers grasp essential ideas quickly and are often easier to read and understand than longer paragraphs, especially on mobile devices. The track uses the MixSub dataset \cite10172215, which provides pairs of abstracts and corresponding author-written highlights. In this inaugural edition of the track, 12 teams participated, exploring various approaches, including pre-trained language models, to generate highlights from this scientific dataset. All submissions were evaluated using established metrics such as ROUGE, METEOR, and BERTScore to measure both alignment with author-written highlights and overall informativeness. Teams were ranked based on ROUGE-L scores. The findings suggest that automatically generated highlights can reduce reading effort, accelerate literature reviews, and enhance metadata for digital libraries and academic search platforms. SciHigh provides a dedicated benchmark for advancing methods aimed at concise and accurate highlight generation from scientific writing. Comments: 7 pages, 2 tables Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2601.11582 [cs.CY] (or arXiv:2601.11582v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2601.11582 Focus to learn more arXiv-issued DOI via DataCite
zh
[NLP-230] Enhancing the QA Model through a Multi-domain Debiasing Framework
【速读】: 该论文旨在解决问答(Question-Answering, QA)模型在复杂查询和对抗性情境下因存在词汇偏倚(lexical bias)、数值推理缺陷及实体识别错误而导致性能下降的问题。其解决方案的关键在于构建一个跨领域的去偏框架,融合知识蒸馏(knowledge distillation)、去偏技术(debiasing techniques)以及领域扩展策略(domain expansion),从而有效提升模型在标准数据集(如SQuAD v1.1)和对抗性数据集(AddSent与AddOneSent)上的准确率与鲁棒性,实验表明该方法在Exact Match(EM)和F1分数上最高可提升2.6个百分点。
链接: https://arxiv.org/abs/2601.11581
作者: Yuefeng Wang,ChangJae Lee
机构: The University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 5 pages, 7 tables
Abstract:Question-answering (QA) models have advanced significantly in machine reading comprehension but often exhibit biases that hinder their performance, particularly with complex queries in adversarial conditions. This study evaluates the ELECTRA-small model on the Stanford Question Answering Dataset (SQuAD) v1.1 and adversarial datasets AddSent and AddOneSent. By identifying errors related to lexical bias, numerical reasoning, and entity recognition, we develop a multi-domain debiasing framework incorporating knowledge distillation, debiasing techniques, and domain expansion. Our results demonstrate up to 2.6 percentage point improvements in Exact Match (EM) and F1 scores across all test sets, with gains in adversarial contexts. These findings highlight the potential of targeted bias mitigation strategies to enhance the robustness and reliability of natural language understanding systems.
zh
[NLP-231] Speculative Decoding: Performance or Illusion?
【速读】: 该论文旨在解决生成式 AI(Generative AI)推理中因模型复杂度高导致的延迟问题,特别是针对推测解码(Speculative Decoding, SD)技术在真实生产环境中的有效性尚不明确这一关键挑战。现有研究多基于实验原型且使用过小的批处理规模,无法反映实际部署场景下的性能表现。论文通过系统性评估 vLLM 推理引擎上多种 SD 变体(如 n-gram、EAGLE/EAGLE-3、Draft-Model 和 Multi-Token Prediction),覆盖不同模型规模、工作负载和批大小,揭示了影响 SD 性能的核心因素——目标模型验证耗时占主导地位,而接受长度在输出位置、请求及数据集间存在显著差异。其解决方案的关键在于量化 SD 的理论速度上限,并基于实测与理论边界之间的显著差距,识别出提升 SD 效率的新研究方向,为未来优化提供了可量化的基准和洞察。
链接: https://arxiv.org/abs/2601.11580
作者: Xiaoxuan Liu,Jiaxiang Yu,Jongseok Park,Ion Stoica,Alvin Cheung
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Speculative decoding (SD) has become a popular technique to accelerate Large Language Model (LLM) inference, yet its real-world effectiveness remains unclear as prior evaluations rely on research prototypes and unrealistically small batch sizes. We present, to our knowledge, the first systematic study of SD on a production-grade and widely deployed inference engine (vLLM), covering multiple SD variants ( n -gram, EAGLE/EAGLE-3, Draft-Model, Multi-Token Prediction) across diverse workloads, model scales, and batch sizes. We analyze key factors governing SD performance, and quantify a theoretical upper bound on SD speedup. Our results show that verification by the target model dominates the execution, while acceptance length varies markedly across output token positions, requests, and datasets. Comparing measured performance with theoretical bounds reveals substantial gaps between observed and theoretical upper bounds, and we leverage this observation to highlight new research opportunities that our study opens up in improving SD.
zh
[NLP-232] Bielik 11B v3: Multilingual Large Language Model for European Languages
【速读】: 该论文旨在解决低资源语言(特别是波兰语)在生成式 AI (Generative AI) 模型中表现不足的问题,即如何在参数效率与高性能之间取得平衡,从而提升对非主流欧洲语言的建模能力。解决方案的关键在于构建一个基于 Mistral 7B v0.2 架构扩展至 11B 参数的模型(Bielik 11B v3),通过四阶段训练流程(连续预训练、监督微调 SFT、直接偏好优化 DPO 和强化学习)实现性能最大化,并结合广泛的量化选项以适配多样硬件环境,最终在多项任务上超越参数量大得多的模型,为资源受限语言提供了高性价比的先进 AI 解决方案。
链接: https://arxiv.org/abs/2601.11579
作者: Krzysztof Ociepa,Łukasz Flis,Remigiusz Kinas,Krzysztof Wróbel,Adrian Gwoździej
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:We present Bielik 11B v3, a state-of-the-art language model highly optimized for the Polish language, while also maintaining strong capabilities in other European languages. This model extends the Mistral 7B v0.2 architecture, scaled to 11B parameters via depth up-scaling. Its development involved a comprehensive four-stage training pipeline: continuous pre-training, supervised fine-tuning (SFT), Direct Preference Optimization (DPO), and reinforcement learning. Comprehensive evaluations demonstrate that Bielik 11B v3 achieves exceptional performance. It significantly surpasses other specialized Polish language models and outperforms many larger models (with 2-6 times more parameters) on a wide range of tasks, from basic linguistic understanding to complex reasoning. The model’s parameter efficiency, combined with extensive quantization options, allows for effective deployment across diverse hardware configurations. Bielik 11B v3 not only advances AI capabilities for the Polish language but also establishes a new benchmark for developing resource-efficient, high-performance models for less-represented languages. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) ACMclasses: I.2.7 Cite as: arXiv:2601.11579 [cs.CL] (or arXiv:2601.11579v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2601.11579 Focus to learn more arXiv-issued DOI via DataCite
zh
[NLP-233] LimAgents : Multi-Agent LLM s for Generating Research Limitations
【速读】: 该论文旨在解决当前零样本大语言模型(Large Language Models, LLMs)在识别科研论文局限性时普遍存在的表面化、重复性问题,即模型往往仅复述作者已声明的浅层局限(如数据集偏差或泛化能力不足),而难以挖掘深层次的方法论缺陷与研究语境中的空白。其解决方案的关键在于提出LimAgents框架——一个基于多智能体(multi-agent)协作的LLM系统,通过结构化分工实现对显式局限、方法论缺口、同行评审视角及文献背景的系统性挖掘:不同代理分别负责提取明确限制、分析方法漏洞、模拟审稿人立场以及结合引文网络定位研究空白;最终由裁判代理(Judge agent)评估并整合输出,形成更全面、深入且具可解释性的局限陈述。此外,为克服传统NLP指标(如BLEU、ROUGE)对语义相似性捕捉不足的问题,作者引入基于LLM作为裁判的点对点评估协议,显著提升了覆盖度衡量的准确性。实验表明,该框架相较零样本基线在覆盖率上取得显著提升(最高+15.51%)。
链接: https://arxiv.org/abs/2601.11578
作者: Ibrahim Al Azher,Zhishuai Guo,Hamed Alhoori
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 18 Pages, 9 figures
Abstract:Identifying and articulating limitations is essential for transparent and rigorous scientific research. However, zero-shot large language models (LLMs) approach often produce superficial or general limitation statements (e.g., dataset bias or generalizability). They usually repeat limitations reported by authors without looking at deeper methodological issues and contextual gaps. This problem is made worse because many authors disclose only partial or trivial limitations. We propose LimAgents, a multi-agent LLM framework for generating substantive limitations. LimAgents integrates OpenReview comments and author-stated limitations to provide stronger ground truth. It also uses cited and citing papers to capture broader contextual weaknesses. In this setup, different agents have specific roles as sequential role: some extract explicit limitations, others analyze methodological gaps, some simulate the viewpoint of a peer reviewer, and a citation agent places the work within the larger body of literature. A Judge agent refines their outputs, and a Master agent consolidates them into a clear set. This structure allows for systematic identification of explicit, implicit, peer review-focused, and literature-informed limitations. Moreover, traditional NLP metrics like BLEU, ROUGE, and cosine similarity rely heavily on n-gram or embedding overlap. They often overlook semantically similar limitations. To address this, we introduce a pointwise evaluation protocol that uses an LLM-as-a-Judge to measure coverage more accurately. Experiments show that LimAgents substantially improve performance. The RAG + multi-agent GPT-4o mini configuration achieves a +15.51% coverage gain over zero-shot baselines, while the Llama 3 8B multi-agent setup yields a +4.41% improvement.
zh
[NLP-234] Concept Attractors in LLM s and their Applications
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在处理语义相关但表面形式差异较大的提示词时,仍会映射到相似内部表示的问题,并进一步利用这一特性实现多种实际任务的高效干预。其解决方案的关键在于将模型各层视为压缩映射(contractive mappings),通过迭代函数系统(Iterated Function Systems, IFS)理论揭示这些层收敛至特定概念吸引子(concept-specific Attractors)的行为机制;基于此,作者提出无需训练的吸引力干预方法,直接操作这些吸引子即可完成语言翻译、幻觉抑制、安全约束和合成数据生成等任务,且性能优于或匹配专门设计的基线方法,在基线表现不佳的场景下仍具泛化能力。
链接: https://arxiv.org/abs/2601.11575
作者: Sotirios Panagiotis Chytas,Vikas Singh
机构: University of Wisconsin-Madison (威斯康星大学麦迪逊分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) often map semantically related prompts to similar internal representations at specific layers, even when their surface forms differ widely. We show that this behavior can be explained through Iterated Function Systems (IFS), where layers act as contractive mappings toward concept-specific Attractors. We leverage this insight and develop simple, training-free methods that operate directly on these Attractors to solve a wide range of practical tasks, including language translation, hallucination reduction, guardrailing, and synthetic data generation. Despite their simplicity, these Attractor-based interventions match or exceed specialized baselines, offering an efficient alternative to heavy fine-tuning, generalizable in scenarios where baselines underperform.
zh
[NLP-235] An Empirical Analysis of Fine-Tuning Large Language Models on Bioinformatics Literature: PRSGPT GPT and BioStarsGPT
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在复杂生物信息学(Bioinformatics)应用中缺乏领域专业知识的问题。为实现可复现且高效的领域适配,作者提出了一套九步式微调流水线,其关键在于:整合多样化数据源、结构化预处理、基于提示的问答对生成(利用Google Gemini)、自然语言推理(Natural Language Inference, NLI)质量控制、语义去重、基于聚类的数据划分,以及采用LoRA(Low-Rank Adaptation)的参数高效微调策略。该方案成功训练出PRSGPT与BioStarsGPT两个专用模型,在多个词汇和语义指标上表现优异,并生成了超过18万条高质量问答对,实现了隐私保护、本地部署的生物信息学智能助手开发。
链接: https://arxiv.org/abs/2601.11573
作者: Muhammad Muneeb,David B. Ascher
机构: The University of Queensland (昆士兰大学); Baker Heart and Diabetes Institute (贝克心脏与糖尿病研究所)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) often lack specialized knowledge for complex bioinformatics applications. We present a reproducible pipeline for fine-tuning LLMs on specialized bioinformatics data, demonstrated through two use cases: PRSGPT, focused on polygenic risk score (PRS) tools, and BioStarsGPT, trained on community forum discussions. The nine-step pipeline integrates diverse data sources, structured preprocessing, prompt-based question-answer (QA) generation (via Google Gemini), natural language inference (NLI) for quality control, semantic deduplication, clustering-based data splitting, and parameter-efficient fine-tuning using LoRA. We fine-tuned three LLMs (LLaMA-3.2-3B, Qwen2.5-7B, Gemma) and benchmarked them on over 14 lexical and semantic metrics. Qwen2.5-7B emerged as the best performer, with BLEU-4 and ROUGE-1 improvements of 82% and 70% for PRSGPT and 6% and 18% for BioStarsGPT, respectively. The open-source datasets produced include over 28,000 QA pairs for PRSGPT and 154,282 for BioStarsGPT. Human evaluation of PRSGPT yielded 61.9% accuracy on the PRS tools comparison task, comparable to Google Gemini (61.4%), but with richer methodological detail and accurate citations. BioStarsGPT demonstrated 59% conceptual accuracy across 142 curated bioinformatics questions. Our pipeline enables scalable, domain-specific fine-tuning of LLMs. It enables privacy-preserving, locally deployable bioinformatics assistants, explores their practical applications, and addresses the challenges, limitations, and mitigation strategies associated with their development and use.
zh
[NLP-236] AdaFRUGAL: Adaptive Memory-Efficient Training with Dynamic Control
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)训练过程中因优化器状态(optimizer state)带来的高内存消耗问题。现有方法如FRUGAL框架虽通过梯度分片(gradient splitting)缓解了内存压力,但其静态超参数——子空间比例(subspace ratio, ρ)和更新频率(update frequency, T)需人工调优,限制了适应性与实用性。论文提出AdaFRUGAL,其核心创新在于引入两种动态控制机制:(i) 采用线性衰减策略调整ρ,逐步降低内存占用;(ii) 设计基于损失感知(loss-aware)的T调度策略,在保证性能的同时减少计算开销。实验表明,AdaFRUGAL在大规模预训练(English C4、Vietnamese VietVault)和微调(GLUE)任务中实现了内存与训练时间的显著节省,同时保持与AdamW及静态FRUGAL相当的性能,为资源受限场景下的LLM训练提供了更高效、自动化的解决方案。
链接: https://arxiv.org/abs/2601.11568
作者: Quang-Hung Bui,Anh Son Ta
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Training Large Language Models (LLMs) is highly memory-intensive due to optimizer state overhead. The FRUGAL framework mitigates this with gradient splitting, but its static hyperparameters – the subspace ratio ( \rho ) and update frequency ( T ) – require costly manual tuning, limiting adaptability. We present AdaFRUGAL, which automates this process by introducing two dynamic controls: (i) a linear decay for \rho to progressively reduce memory, and (ii) a loss-aware schedule for T to lower computational overhead. Experiments across large-scale pre-training (English C4, Vietnamese VietVault) and fine-tuning (GLUE) demonstrate that AdaFRUGAL achieves a compelling trade-off. It maintains competitive performance against AdamW and static FRUGAL while significantly reducing both GPU memory and training time, offering a more practical, autonomous solution for resource-constrained LLM training.
zh
[NLP-237] Measuring Stability Beyond Accuracy in Small Open-Source Medical Large Language Models for Pediatric Endocrinology AAAI2026
【速读】: 该论文旨在解决当前小规模开源医学大语言模型(Medical Large Language Models, LLMs)在临床应用中评估不足的问题,尤其是缺乏对一致性、鲁棒性和推理行为的系统性测评。现有研究多依赖于多选题(MCQ)准确率指标,忽略了模型输出在不同提示扰动和随机性设置下的稳定性与合理性。其解决方案的关键在于构建一个结合多选题测试与人工评估及临床专家审查的综合诊断框架,通过对比六种小规模开源医学LLMs(如HuatuoGPT-o1-8B、Diabetica-o1等)在儿科内分泌学场景中的表现,揭示模型输出的一致性并不等同于正确性,并发现微小提示扰动或系统级差异(如CUDA版本)即可显著改变模型输出,从而强调了在真实临床决策支持场景中引入更全面的评估体系的必要性。
链接: https://arxiv.org/abs/2601.11567
作者: Vanessa D’Amario,Randy Daniel,Alessandro Zanetti,Dhruv Edamadaka,Nitya Alaparthy,Joshua Tarkoff
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 20 pages, 11 figures, accepted at 47 workshop Reproducible Artificial Intelligence (AAAI 2026, Singapore, January 27, 2026)
Abstract:Small open-source medical large language models (LLMs) offer promising opportunities for low-resource deployment and broader accessibility. However, their evaluation is often limited to accuracy on medical multiple choice question (MCQ) benchmarks, and lacks evaluation of consistency, robustness, or reasoning behavior. We use MCQ coupled to human evaluation and clinical review to assess six small open-source medical LLMs (HuatuoGPT-o1 (Chen 2024), Diabetica-7B, Diabetica-o1 (Wei 2024), Meditron3-8B (Sallinen2025), MedFound-7B (Liu 2025), and ClinicaGPT-base-zh (Wang 2023)) in pediatric endocrinology. In deterministic settings, we examine the effect of prompt variation on models’ output and self-assessment bias. In stochastic settings, we evaluate output variability and investigate the relationship between consistency and correctness. HuatuoGPT-o1-8B achieved the highest performance. The results show that high consistency across the model response is not an indicator of correctness, although HuatuoGPT-o1-8B showed the highest consistency rate. When tasked with selecting correct reasoning, both HuatuoGPT-o1-8B and Diabetica-o1 exhibit self-assessment bias and dependency on the order of the candidate explanations. Expert review of incorrect reasoning rationales identified a mix of clinically acceptable responses and clinical oversight. We further show that system-level perturbations, such as differences in CUDA builds, can yield statistically significant shifts in model output despite stable accuracy. This work demonstrates that small, semantically negligible prompt perturbations lead to divergent outputs, raising concerns about reproducibility of LLM-based evaluations and highlights the output variability under different stochastic regimes, emphasizing the need of a broader diagnostic framework to understand potential pitfalls in real-world clinical decision support scenarios.
zh
[NLP-238] Compass-Embedding v4: Robust Contrastive Learning for Multilingual E-commerce Embeddings
【速读】: 该论文旨在解决低资源东南亚(SEA)语言在电商场景下语义表示质量不足的问题,这已成为检索、推荐和搜索系统性能提升的关键瓶颈。其核心挑战包括:数据稀缺与噪声监督导致的语义对齐困难、多语言覆盖不均、以及生产环境中对高吞吐量推理的需求。解决方案的关键在于三个方面:一是提出类感知掩码(Class-Aware Masking, CAM),通过轻量级修改InfoNCE损失函数抑制批次内无效负样本,提升语义区分度;二是构建多样化训练语料,结合上下文驱动的合成数据生成、跨语言翻译与结构化电商数据构造,增强低资源语言的鲁棒多语言学习能力;三是采用基于鲁棒性的大批次训练与球面模型合并策略,缓解灾难性遗忘,并利用vLLM和FP8量化优化推理效率,从而在保证嵌入质量的同时满足生产部署要求。
链接: https://arxiv.org/abs/2601.11565
作者: Pakorn Ueareeworakul,Shuman Liu,Jinghao Feng,Ling Hu,Zhantang Shi,Chengqi Sun,Liang Yao,Panyi Ouyang,Haibo Zhang,Anxiang Zeng
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:As global e-commerce rapidly expands into emerging markets, the lack of high-quality semantic representations for low-resource languages has become a decisive bottleneck for retrieval, recommendation, and search systems. In this work, we present Compass-Embedding v4, a high-efficiency multilingual embedding framework specifically optimized for Southeast Asian (SEA) e-commerce scenarios, where data scarcity, noisy supervision, and strict production constraints jointly challenge representation learning. Compass-Embedding v4 addresses three core challenges. First, large-batch contrastive training under mixed task supervision introduces systematic false negatives that degrade semantic alignment. We propose Class-Aware Masking (CAM), a lightweight modification to the InfoNCE objective that suppresses invalid in-batch negatives and improves semantic discrimination without altering training efficiency. Second, low-resource SEA languages suffer from limited and uneven data coverage. We construct a diversified training corpus through context-grounded synthetic data generation, cross-lingual translation, and structured e-commerce data construction, enabling robust multilingual and domain-specific learning. Third, production deployment requires high-throughput inference while preserving embedding quality. We combine robustness-driven large-batch training with spherical model merging to mitigate catastrophic forgetting, and optimize inference via vLLM and FP8 quantization. Extensive evaluations across multilingual benchmarks and proprietary e-commerce tasks show that Compass-Embedding v4 achieves state-of-the-art performance on major SEA languages, significantly outperforming general-purpose embedding models in domain-specific retrieval and classification, while maintaining competitive performance on high-resource languages.
zh
[NLP-239] Context Discipline and Performance Correlation: Analyzing LLM Performance and Quality Degradation Under Varying Context Lengths
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在扩展上下文窗口以支持复杂推理和长文档分析时,因冗余与干扰性内容引入而导致的计算开销激增及性能下降问题。其核心发现是:密集型Transformer架构(如Llama-3.1-70B和Qwen1.5-14B)在处理大量无关上下文时会出现非线性的性能退化,且这种退化与Key-Value (KV)缓存规模增长密切相关;同时,混合专家(Mixture-of-Experts, MoE)架构虽具潜在优势,但在高token量下其性能优势被基础设施瓶颈所掩盖,暴露出架构设计与系统实现之间的不匹配。解决方案的关键在于识别并量化这一由KV缓存膨胀引发的非线性性能损耗机制,从而为后续优化模型架构与推理系统提供理论依据。
链接: https://arxiv.org/abs/2601.11564
作者: Ahilan Ayyachamy Nadar Ponnusamy,Karthic Chandran,M Maruf Hossain
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 22 pages, 6 figures
Abstract:The scaling trend in Large Language Models (LLMs) has prioritized increasing the maximum context window to facilitate complex, long-form reasoning and document analysis. However, managing this expanded context introduces severe computational overhead. This paper investigates the critical trade-off between system performance and model quality when dense transformer architectures–specifically Llama-3.1-70B and Qwen1.5-14B–are exposed to large volumes of irrelevant and distracting context. The research identifies a non-linear performance degradation tied to the growth of the Key-Value (KV) cache. Furthermore, an extended analysis of the Mixture-of-Experts (MoE) architecture reveals unique behavioral anomalies at varying context scales, suggesting that architectural benefits may be masked by infrastructure bottlenecks at high token volumes.
zh
[NLP-240] MIMIC-RD: Can LLM s differentially diagnose rare diseases in real-world clinical settings?
【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在罕见病(rare disease)鉴别诊断中表现不佳的问题,其核心挑战在于现有评估方法存在两大局限:一是依赖理想化的临床案例,无法反映真实临床复杂性;二是使用ICD编码作为疾病标签,导致大量罕见病因缺乏与Orphanet等专业数据库的直接映射而被严重低估。论文的关键解决方案是构建了MIMIC-RD基准数据集,通过将临床文本实体直接映射至Orphanet进行标注,并结合LLM初步挖掘与四位医学专家验证相结合的方法,确保所识别疾病的真实性与覆盖度。该方法显著提升了罕见病诊断评估的临床效度和准确性,揭示了当前主流LLMs在该任务上的能力不足,为未来研究指明方向。
链接: https://arxiv.org/abs/2601.11559
作者: Zilal Eiz AlDin,John Wu,Jeffrey Paul Fung,Jennifer King,Mya Watts,Lauren ONeill,Adam Richard Cross,Jimeng Sun
机构: University of Illinois Urbana-Champaign (伊利诺伊大学香槟分校); University of Illinois College of Medicine (伊利诺伊大学医学院)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 5 pages
Abstract:Despite rare diseases affecting 1 in 10 Americans, their differential diagnosis remains challenging. Due to their impressive recall abilities, large language models (LLMs) have been recently explored for differential diagnosis. Existing approaches to evaluating LLM-based rare disease diagnosis suffer from two critical limitations: they rely on idealized clinical case studies that fail to capture real-world clinical complexity, or they use ICD codes as disease labels, which significantly undercounts rare diseases since many lack direct mappings to comprehensive rare disease databases like Orphanet. To address these limitations, we explore MIMIC-RD, a rare disease differential diagnosis benchmark constructed by directly mapping clinical text entities to Orphanet. Our methodology involved an initial LLM-based mining process followed by validation from four medical annotators to confirm identified entities were genuine rare diseases. We evaluated various models on our dataset of 145 patients and found that current state-of-the-art LLMs perform poorly on rare disease differential diagnosis, highlighting the substantial gap between existing capabilities and clinical needs. From our findings, we outline several future steps towards improving differential diagnosis of rare diseases.
zh
[NLP-241] CSyMR: Benchmarking Compositional Symbolic Muisc Reasoning With MIR Tool Integration
【速读】: 该论文旨在解决当前符号音乐推理基准测试中缺乏对组合式结构连接能力评估的问题,即现有基准多聚焦于孤立知识或原子级分析,难以衡量模型在整合多种音乐结构信息以进行复杂推理的能力。其解决方案的关键在于构建了一个名为CSyMR-Bench的多选题数据集,包含126道来自专家论坛和专业考试的问题,每道题均需结合多个原子分析步骤才能得出答案;同时提出了一种工具增强型智能体框架,利用music21库中的符号音乐分析工具来辅助推理,从而显著提升模型在该任务上的表现,实验表明该方法相较基线模型可实现5-7%的绝对准确率提升。
链接: https://arxiv.org/abs/2601.11556
作者: Boyang Wang,Yash Vishe,Xin Xu,Zachary Novack,Julian McAuley,Junda Wu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:
Abstract:Large Language Models (LLMs) are leveraged in symbolic music reasoning, yet existing benchmarks emphasize isolated knowledge or atomic analyses rather than the integrative compositional reasoning needed to connect musical structures. To address this, we present the Compositional Symbolic Music Reasoning Benchmark (CSyMR-Bench), a curated multiple-choice dataset of 126 questions derived from expert forums and professional examinations. Each item involves combining several atomic analyses to arrive at the final answer. Furthermore, we introduce a tool-augmented agent framework that leverages symbolic music analysis tools from the music21 library to address the challenges posed by CSyMR-Bench. Experiments validate that CSyMR-Bench poses a non-trivial challenge across both community-sourced and exam-style questions, while our tool-augmented agent consistently outperforms all baselines, achieving 5-7% absolute accuracy gains.
zh
[NLP-242] Medication counseling with large language models : balancing flexibility and rigidity
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在药学场景中应用时面临的灵活性与可靠性之间的平衡问题:过度灵活可能导致错误决策(如药物咨询中的误判),而过度僵化则无法应对未预见的复杂交互情境。解决方案的关键在于构建一个聚焦于窄域、长周期任务的原型系统,通过设计特定方法来强化对话要求的遵守性、减少幻觉(hallucinations)并提升响应质量,从而在保持LLM动态对话能力的同时增强系统的确定性与安全性。该方案强调需结合人机协同(human-in-the-loop)和非传统评估方式,以更真实地反映此类系统在实际药学交互中的表现。
链接: https://arxiv.org/abs/2601.11544
作者: Joar Sabel,Mattias Wingren,Andreas Lundell,Sören Andersson,Sara Rosenberg,Susanne Hägglund,Linda Estman,Malin Andtfolk
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted for 2025 IEEE International Conference on Agentic AI (ICA). 14 pages, 2 figures
Abstract:The introduction of large language models (LLMs) has greatly enhanced the capabilities of software agents. Instead of relying on rule-based interactions, agents can now interact in flexible ways akin to humans. However, this flexibility quickly becomes a problem in fields where errors can be disastrous, such as in a pharmacy context, but the opposite also holds true; a system that is too inflexible will also lead to errors, as it can become too rigid to handle situations that are not accounted for. Work using LLMs in a pharmacy context have adopted a wide scope, accounting for many different medications in brief interactions – our strategy is the opposite: focus on a more narrow and long task. This not only enables a greater understanding of the task at hand, but also provides insight into what challenges are present in an interaction of longer nature. The main challenge, however, remains the same for a narrow and wide system: it needs to strike a balance between adherence to conversational requirements and flexibility. In an effort to strike such a balance, we present a prototype system meant to provide medication counseling while juggling these two extremes. We also cover our design in constructing such a system, with a focus on methods aiming to fulfill conversation requirements, reduce hallucinations and promote high-quality responses. The methods used have the potential to increase the determinism of the system, while simultaneously not removing the dynamic conversational abilities granted by the usage of LLMs. However, a great deal of work remains ahead, and the development of this kind of system needs to involve continuous testing and a human-in-the-loop. It should also be evaluated outside of commonly used benchmarks for LLMs, as these do not adequately capture the complexities of this kind of conversational system.
zh
[NLP-243] Advancing Minority Stress Detection with Transformers: Insights from the Social Media Datasets
【速读】: 该论文旨在解决如何有效识别在线话语中与性少数群体和性别少数群体相关的少数压力(minority stress)问题,以支持数字健康干预和公共卫生政策制定。其解决方案的关键在于引入图结构增强的Transformer模型,通过建模社交连通性和对话上下文来提升对关键语言标记(如身份隐藏、内化污名化和求助呼吁)的检测能力,实验表明该方法在性能上显著优于传统机器学习基线及零样本、少样本学习范式。
链接: https://arxiv.org/abs/2509.02908
作者: Santosh Chapagain,Cory J Cascalheira,Shah Muhammad Hamdi,Soukaina Filali Boubrahimi,Jillian R. Scheer
机构: Utah State University (犹他州立大学); Syracuse University (雪城大学)
类目: Computation and Language (cs.CL); Social and Information Networks (cs.SI)
备注: Accepted in Social Network Analysis and Mining Journal (SNAM)
Abstract:Individuals from sexual and gender minority groups experience disproportionately high rates of poor health outcomes and mental disorders compared to their heterosexual and cisgender counterparts, largely as a consequence of minority stress as described by Meyer’s (2003) model. This study presents the first comprehensive evaluation of transformer-based architectures for detecting minority stress in online discourse. We benchmark multiple transformer models including ELECTRA, BERT, RoBERTa, and BART against traditional machine learning baselines and graph-augmented variants. We further assess zero-shot and few-shot learning paradigms to assess their applicability on underrepresented datasets. Experiments are conducted on the two largest publicly available Reddit corpora for minority stress detection, comprising 12,645 and 5,789 posts, and are repeated over five random seeds to ensure robustness. Our results demonstrate that integrating graph structure consistently improves detection performance across transformer-only models and that supervised fine-tuning with relational context outperforms zero and few-shot approaches. Theoretical analysis reveals that modeling social connectivity and conversational context via graph augmentation sharpens the models’ ability to identify key linguistic markers such as identity concealment, internalized stigma, and calls for support, suggesting that graph-enhanced transformers offer the most reliable foundation for digital health interventions and public health policy.
zh
[NLP-244] SciHorizon-GENE: Benchmarking LLM for Life Sciences Inference from Gene Knowledge to Functional Understanding
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生物医学研究中从基因层面知识到功能理解的可靠推理能力不足的问题,这是实现知识增强型细胞图谱解释的核心要求。其解决方案的关键在于构建了一个大规模、基因中心的基准测试集 SciHorizon-GENE,该基准整合了来自权威生物数据库的超过19万个人类基因的结构化知识,并包含54万余个覆盖细胞类型注释、功能解释和机制导向分析等场景的问答对。该基准通过研究关注度敏感性、幻觉倾向、答案完整性及文献影响四个生物学关键维度,系统评估LLMs在基因尺度上的行为模式,从而揭示当前模型在生成忠实、完整且文献支撑的功能解释方面存在的共性挑战,为模型选择与开发提供可量化、可比较的分析基础。
链接: https://arxiv.org/abs/2601.12805
作者: Xiaohan Huang,Meng Xiao,Chuan Qin,Qingqing Long,Jinmiao Chen,Yuanchun Zhou,Hengshu Zhu
机构: 未知
类目: Genomics (q-bio.GN); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 16 pages
Abstract:Large language models (LLMs) have shown growing promise in biomedical research, particularly for knowledge-driven interpretation tasks. However, their ability to reliably reason from gene-level knowledge to functional understanding, However, their ability to reliably reason from gene-level knowledge to functional understanding, a core requirement for knowledge-enhanced cell atlas interpretation, remains largely underexplored. To address this gap, we introduce SciHorizon-GENE, a large-scale gene-centric benchmark constructed from authoritative biological databases. The benchmark integrates curated knowledge for over 190K human genes and comprises more than 540K questions covering diverse gene-to-function reasoning scenarios relevant to cell type annotation, functional interpretation, and mechanism-oriented analysis. Motivated by behavioral patterns observed in preliminary examinations, SciHorizon-GENE evaluates LLMs along four biologically critical perspectives: research attention sensitivity, hallucination tendency, answer completeness, and literature influence, explicitly targeting failure modes that limit the safe adoption of LLMs in biological interpretation pipelines. We systematically evaluate a wide range of state-of-the-art general-purpose and biomedical LLMs, revealing substantial heterogeneity in gene-level reasoning capabilities and persistent challenges in generating faithful, complete, and literature-grounded functional interpretations. Our benchmark establishes a systematic foundation for analyzing LLM behavior at the gene scale and offers insights for model selection and development, with direct relevance to knowledge-enhanced biological interpretation.
zh
[NLP-245] AQUA-Bench: Beyond Finding Answers to Knowing When There Are None in Audio Question Answering ICASSP2026
【速读】: 该论文旨在解决当前音频问答(Audio Question Answering, AQA)任务中对不可回答问题(unanswerable questions)评估缺失的问题,即现有基准测试主要关注可回答问题,忽视了在真实场景中常见的误导性、不完整或与音频内容不兼容的不可回答情形。解决方案的关键在于提出AQUA-Bench,这是一个系统性的基准测试集,专门用于评估模型在三种不可回答场景下的表现:缺失答案检测(Absent Answer Detection)、答案集不兼容检测(Incompatible Answer Set Detection)以及音频-问题语义不兼容检测(Incompatible Audio Question Detection)。通过量化模型在这些场景中的可靠性,AQUA-Bench为提升音频语言系统的鲁棒性和可信度提供了关键评测工具。
链接: https://arxiv.org/abs/2601.12248
作者: Chun-Yi Kuan,Hung-yi Lee
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
备注: Accepted to ICASSP 2026. Project Website: this https URL
Abstract:Recent advances in audio-aware large language models have shown strong performance on audio question answering. However, existing benchmarks mainly cover answerable questions and overlook the challenge of unanswerable ones, where no reliable answer can be inferred from the audio. Such cases are common in real-world settings, where questions may be misleading, ill-posed, or incompatible with the information. To address this gap, we present AQUA-Bench, a benchmark for Audio Question Unanswerability Assessment. It systematically evaluates three scenarios: Absent Answer Detection (the correct option is missing), Incompatible Answer Set Detection (choices are categorically mismatched with the question), and Incompatible Audio Question Detection (the question is irrelevant or lacks sufficient grounding in the audio). By assessing these cases, AQUA-Bench offers a rigorous measure of model reliability and promotes the development of audio-language systems that are more robust and trustworthy. Our experiments suggest that while models excel on standard answerable tasks, they often face notable challenges with unanswerable ones, pointing to a blind spot in current audio-language understanding.
zh
[NLP-246] Lightweight Prompt Biasing for Contextualized End-to-End ASR Systems
【速读】: 该论文旨在解决端到端自动语音识别(End-to-End Automatic Speech Recognition, E2E ASR)在识别罕见和领域特定实体时准确率低的问题。其解决方案的关键在于提出一种基于提示(prompt-based)的偏置机制,通过统一的多任务学习框架实现:一是训练一个提示偏置模型以判断何时应关注提示中的实体,二是引入实体过滤机制高效剔除无关实体。该方法无需结构改动,轻量且高效,在小规模和大规模实体列表上分别实现了相对30.7%和18.0%的实体词错误率(Entity Word Error Rate)降低。
链接: https://arxiv.org/abs/2506.06252
作者: Bo Ren,Yu Shi,Jinyu Li
机构: Microsoft(微软)
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)
备注:
Abstract:End-to-End Automatic Speech Recognition (ASR) has advanced significantly yet still struggles with rare and domain-specific entities. This paper introduces a simple yet efficient prompt-based biasing technique for contextualized ASR, enhancing recognition accuracy by leverage a unified multitask learning framework. The approach comprises two key components: a prompt biasing model which is trained to determine when to focus on entities in prompt, and a entity filtering mechanism which efficiently filters out irrelevant entities. Our method significantly enhances ASR accuracy on entities, achieving a relative 30.7% and 18.0% reduction in Entity Word Error Rate compared to the baseline model with shallow fusion on in-house domain dataset with small and large entity lists, respectively. The primary advantage of this method lies in its efficiency and simplicity without any structure change, making it lightweight and highly efficient.
zh
计算机视觉
[CV-0] Implicit Neural Representation Facilitates Unified Universal Vision Encoding
【速读】:该论文旨在解决图像表示学习中识别(recognition)与生成(generation)两类任务长期分离的问题,即现有模型通常仅针对单一目标设计,难以同时实现高质量的分类、检测、分割等识别任务以及高保真图像生成能力。其解决方案的关键在于提出一种首创性的统一模型,通过将隐式神经表示(Implicit Neural Representation, INR)作为超网络(hyper-network),学习从图像到模型权重的映射,从而实现快速且准确的图像重建;同时结合知识蒸馏(knowledge distillation)提升模型泛化能力和性能,最终在压缩嵌入空间中实现了对多种视觉任务均表现出色的通用表示,兼具识别精度与生成质量。
链接: https://arxiv.org/abs/2601.14256
作者: Matthew Gwilliam,Xiao Wang,Xuefeng Hu,Zhenheng Yang
机构: TikTok(抖音)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 16 tables, 4 figures
Abstract:Models for image representation learning are typically designed for either recognition or generation. Various forms of contrastive learning help models learn to convert images to embeddings that are useful for classification, detection, and segmentation. On the other hand, models can be trained to reconstruct images with pixel-wise, perceptual, and adversarial losses in order to learn a latent space that is useful for image generation. We seek to unify these two directions with a first-of-its-kind model that learns representations which are simultaneously useful for recognition and generation. We train our model as a hyper-network for implicit neural representation, which learns to map images to model weights for fast, accurate reconstruction. We further integrate our INR hyper-network with knowledge distillation to improve its generalization and performance. Beyond the novel training design, the model also learns an unprecedented compressed embedding space with outstanding performance for various visual tasks. The complete model competes with state-of-the-art results for image representation learning, while also enabling generative capabilities with its high-quality tiny embeddings. The code is available at this https URL.
zh
[CV-1] VideoMaMa: Mask-Guided Video Matting via Generative Prior
【速读】:该论文旨在解决视频抠像(video matting)模型在真实世界视频中泛化能力差的问题,其核心挑战在于真实场景标注数据稀缺。解决方案的关键在于提出 VideoMask-to-Matte Model (VideoMaMa),该模型利用预训练的视频扩散模型(video diffusion models),将粗粒度的分割掩码(segmentation masks)转化为像素级精确的 alpha matte(透明度图),从而实现仅用合成数据训练即可在真实视频上实现零样本(zero-shot)泛化。在此基础上,研究者进一步构建了大规模伪标签流水线和 MA-V 数据集(Matting Anything in Video),包含超过 5 万条真实视频的高质量抠像标注,显著提升了模型在野外视频中的鲁棒性。这一方法凸显了生成式先验(generative priors)与易获取的分割提示(segmentation cues)在推动视频抠像可扩展研究中的关键作用。
链接: https://arxiv.org/abs/2601.14255
作者: Sangbeom Lim,Seoung Wug Oh,Jiahui Huang,Heeji Yoon,Seungryong Kim,Joon-Young Lee
机构: Korea University (韩国大学); Adobe Research (Adobe 研究院); KAIST AI (KAIST 人工智能)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project page: this https URL
Abstract:Generalizing video matting models to real-world videos remains a significant challenge due to the scarcity of labeled data. To address this, we present Video Mask-to-Matte Model (VideoMaMa) that converts coarse segmentation masks into pixel accurate alpha mattes, by leveraging pretrained video diffusion models. VideoMaMa demonstrates strong zero-shot generalization to real-world footage, even though it is trained solely on synthetic data. Building on this capability, we develop a scalable pseudo-labeling pipeline for large-scale video matting and construct the Matting Anything in Video (MA-V) dataset, which offers high-quality matting annotations for more than 50K real-world videos spanning diverse scenes and motions. To validate the effectiveness of this dataset, we fine-tune the SAM2 model on MA-V to obtain SAM2-Matte, which outperforms the same model trained on existing matting datasets in terms of robustness on in-the-wild videos. These findings emphasize the importance of large-scale pseudo-labeled video matting and showcase how generative priors and accessible segmentation cues can drive scalable progress in video matting research.
zh
[CV-2] Motion 3-to-4: 3D Motion Reconstruction for 4D Synthesis
【速读】:该论文旨在解决从单目视频中合成高质量4D动态物体(即包含时空连续几何变化的三维模型)的问题,这一任务因训练数据稀缺及单视角下恢复几何与运动信息的固有歧义性而极具挑战。其解决方案的关键在于将4D合成分解为静态3D形状生成与运动重建两个独立模块:利用一个规范参考网格(canonical reference mesh),模型学习紧凑的运动潜在表示,并预测每帧顶点轨迹以恢复完整且时序一致的几何结构;同时引入可扩展的帧级Transformer架构,增强对不同序列长度的鲁棒性,从而在标准基准和新构建的具有精确真值几何的数据集上实现优于先前方法的保真度与空间一致性。
链接: https://arxiv.org/abs/2601.14253
作者: Hongyuan Chen,Xingyu Chen,Youjia Zhang,Zexiang Xu,Anpei Chen
机构: Westlake University (西湖大学); HUST (华中科技大学); Hillbot
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL . Code: this https URL
Abstract:We present Motion 3-to-4, a feed-forward framework for synthesising high-quality 4D dynamic objects from a single monocular video and an optional 3D reference mesh. While recent advances have significantly improved 2D, video, and 3D content generation, 4D synthesis remains difficult due to limited training data and the inherent ambiguity of recovering geometry and motion from a monocular viewpoint. Motion 3-to-4 addresses these challenges by decomposing 4D synthesis into static 3D shape generation and motion reconstruction. Using a canonical reference mesh, our model learns a compact motion latent representation and predicts per-frame vertex trajectories to recover complete, temporally coherent geometry. A scalable frame-wise transformer further enables robustness to varying sequence lengths. Evaluations on both standard benchmarks and a new dataset with accurate ground-truth geometry show that Motion 3-to-4 delivers superior fidelity and spatial consistency compared to prior work. Project page is available at this https URL.
zh
[CV-3] LightOnOCR: A 1B End-to-End Multilingual Vision-Language Model for State-of-the-Art OCR
【速读】:该论文旨在解决传统光学字符识别(Optical Character Recognition, OCR)流程中存在的脆弱性与效率低下问题,即依赖复杂的预处理和多阶段管道导致鲁棒性差、部署成本高。其核心解决方案是提出一个端到端的多语言视觉-语言模型 LightOnOCR-2-1B,通过在大规模高质量蒸馏数据集上训练,直接将文档图像(如PDF)转换为结构清晰、自然排序的文本,无需传统OCR流水线。关键创新包括:引入基于IoU奖励的强化学习视觉推理(Reinforcement Learning with Visual Reward, RLVR)策略,在预训练中实现嵌入图像的边界框预测以支持定位任务;采用检查点平均和任务算术合并(task-arithmetic merging)提升模型鲁棒性;最终在OlmOCR-Bench上达到当前最优性能,同时参数量仅为前代模型的1/9且推理速度显著更快。
链接: https://arxiv.org/abs/2601.14251
作者: Said Taghadouini,Adrien Cavaillès,Baptiste Aubertin
机构: LightOn
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present \textbfLightOnOCR-2-1B, a 1B-parameter end-to-end multilingual vision–language model that converts document images (e.g., PDFs) into clean, naturally ordered text without brittle OCR pipelines. Trained on a large-scale, high-quality distillation mix with strong coverage of scans, French documents, and scientific PDFs, LightOnOCR-2 achieves state-of-the-art results on OlmOCR-Bench while being 9 \times smaller and substantially faster than prior best-performing models. We further extend the output format to predict normalized bounding boxes for embedded images, introducing localization during pretraining via a resume strategy and refining it with RLVR using IoU-based rewards. Finally, we improve robustness with checkpoint averaging and task-arithmetic merging. We release model checkpoints under Apache 2.0, and publicly release the dataset and \textbfLightOnOCR-bbox-bench evaluation under their respective licenses.
zh
[CV-4] OmniTransfer: All-in-one Framework for Spatio-temporal Video Transfer
【速读】:该论文旨在解决现有视频定制方法依赖参考图像或任务特定的时间先验,未能充分挖掘视频中固有的时空信息,从而限制了视频生成的灵活性与泛化能力的问题。其核心解决方案是提出OmniTransfer框架,关键设计包括:任务感知的位置偏置(Task-aware Positional Bias)以自适应利用参考视频信息提升时间对齐和外观一致性;参考解耦因果学习(Reference-decoupled Causal Learning)分离参考分支与目标分支,实现精准参考传递并提高效率;以及任务自适应多模态对齐(Task-adaptive Multimodal Alignment)通过多模态语义引导动态区分并处理不同任务,从而在外观(身份与风格)和时间转移(摄像机运动与视频特效)方面优于现有方法,并在动作迁移任务上达到与基于姿态引导方法相当的效果,建立了一种灵活且高保真的视频生成新范式。
链接: https://arxiv.org/abs/2601.14250
作者: Pengze Zhang,Yanze Wu,Mengtian Li,Xu Bai,Songtao Zhao,Fulong Ye,Chong Mou,Xinghui Li,Zhuowei Chen,Qian He,Mingyuan Gao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Github Page: this https URL
Abstract:Videos convey richer information than images or text, capturing both spatial and temporal dynamics. However, most existing video customization methods rely on reference images or task-specific temporal priors, failing to fully exploit the rich spatio-temporal information inherent in videos, thereby limiting flexibility and generalization in video generation. To address these limitations, we propose OmniTransfer, a unified framework for spatio-temporal video transfer. It leverages multi-view information across frames to enhance appearance consistency and exploits temporal cues to enable fine-grained temporal control. To unify various video transfer tasks, OmniTransfer incorporates three key designs: Task-aware Positional Bias that adaptively leverages reference video information to improve temporal alignment or appearance consistency; Reference-decoupled Causal Learning separating reference and target branches to enable precise reference transfer while improving efficiency; and Task-adaptive Multimodal Alignment using multimodal semantic guidance to dynamically distinguish and tackle different tasks. Extensive experiments show that OmniTransfer outperforms existing methods in appearance (ID and style) and temporal transfer (camera movement and video effects), while matching pose-guided methods in motion transfer without using pose, establishing a new paradigm for flexible, high-fidelity video generation.
zh
[CV-5] Soft Tail-dropping for Adaptive Visual Tokenization
【速读】:该论文旨在解决视觉生成模型中固定长度 token 序列难以适应图像结构复杂度差异的问题,从而限制了生成质量和计算效率。其核心解决方案是提出 Soft Tail-dropping Adaptive Tokenizer (STAT),一种1D离散视觉分词器,能够根据图像的结构复杂度和细节水平自适应地调整输出 token 数量。STAT 通过编码图像为一系列离散代码及每个 token 的保留概率(keep probabilities),并引入单调递减约束与图像级复杂度度量对齐的正则项,使 token 序列长度自然适配图像内容。这一设计使得生成式 AI (Generative AI) 模型能够高效处理不同复杂度的图像,并在 ImageNet-1k 上显著提升因果 1D 自回归(causal autoregressive, AR)视觉生成模型的质量与可扩展性。
链接: https://arxiv.org/abs/2601.14246
作者: Zeyuan Chen,Kai Zhang,Zhuowen Tu,Yuanjun Xiong
机构: UC San Diego (加州大学圣地亚哥分校); Adobe (Adobe公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present Soft Tail-dropping Adaptive Tokenizer (STAT), a 1D discrete visual tokenizer that adaptively chooses the number of output tokens per image according to its structural complexity and level of detail. STAT encodes an image into a sequence of discrete codes together with per-token keep probabilities. Beyond standard autoencoder objectives, we regularize these keep probabilities to be monotonically decreasing along the sequence and explicitly align their distribution with an image-level complexity measure. As a result, STAT produces length-adaptive 1D visual tokens that are naturally compatible with causal 1D autoregressive (AR) visual generative models. On ImageNet-1k, equipping vanilla causal AR models with STAT yields competitive or superior visual generation quality compared to other probabilistic model families, while also exhibiting favorable scaling behavior that has been elusive in prior vanilla AR visual generation attempts.
zh
[CV-6] KAGE-Bench: Fast Known-Axis Visual Generalization Evaluation for Reinforcement Learning
【速读】:该论文旨在解决像素级强化学习代理在纯视觉分布偏移下性能下降的问题,而现有基准测试因混杂多种偏移源而阻碍了系统性分析。其解决方案的关键在于提出KAGE-Env——一个基于JAX的2D平台游戏环境,将观测过程分解为独立可控的视觉轴(visual axes),同时保持底层控制问题不变;通过这种构造,仅改变某一视觉轴会影响策略的条件动作分布,从而提供一个清晰的视觉泛化抽象框架。在此基础上构建的KAGE-Bench包含6个已知轴的套件共34组训练-评估配置,可隔离单一视觉偏移因素,实现对视觉泛化能力的精确测量与分析。
链接: https://arxiv.org/abs/2601.14232
作者: Egor Cherepanov,Daniil Zelezetsky,Alexey K. Kovalev,Aleksandr I. Panov
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 38 pages, 44 figures, 3 tables
Abstract:Pixel-based reinforcement learning agents often fail under purely visual distribution shift even when latent dynamics and rewards are unchanged, but existing benchmarks entangle multiple sources of shift and hinder systematic analysis. We introduce KAGE-Env, a JAX-native 2D platformer that factorizes the observation process into independently controllable visual axes while keeping the underlying control problem fixed. By construction, varying a visual axis affects performance only through the induced state-conditional action distribution of a pixel policy, providing a clean abstraction for visual generalization. Building on this environment, we define KAGE-Bench, a benchmark of six known-axis suites comprising 34 train-evaluation configuration pairs that isolate individual visual shifts. Using a standard PPO-CNN baseline, we observe strong axis-dependent failures, with background and photometric shifts often collapsing success, while agent-appearance shifts are comparatively benign. Several shifts preserve forward motion while breaking task completion, showing that return alone can obscure generalization failures. Finally, the fully vectorized JAX implementation enables up to 33M environment steps per second on a single GPU, enabling fast and reproducible sweeps over visual factors. Code: this https URL.
zh
[CV-7] Rig-Aware 3D Reconstruction of Vehicle Undercarriages using Gaussian Splatting WWW ICML
【速读】:该论文旨在解决二手车底盘检测过程中人工检查效率低、安全性差以及在线买家难以获取底盘信息的问题。其核心解决方案是提出一种基于三相机阵列的端到端视觉建模流程,通过车辆行驶过程中采集视频并重建可交互的3D底盘模型,从而实现快速、精准的缺陷识别(如锈蚀、泄漏或撞击损伤)。关键创新在于设计了一种针对相机阵列感知的Structure-from-Motion (SfM) 方法,有效应对广角镜头畸变和低视差场景带来的挑战:通过精确相机标定、同步视频流融合与来自相机刚性结构的强几何先验,结合基于DISK特征提取器与注意力机制LightGlue匹配器的约束匹配策略,生成高质量稀疏点云,并进一步驱动高斯泼溅(Gaussian splatting)技术生成实时渲染的逼真三维模型。
链接: https://arxiv.org/abs/2601.14208
作者: Nitin Kulkarni,Akhil Devarashetti,Charlie Cluss,Livio Forte,Dan Buckmaster,Philip Schneider,Chunming Qiao,Alina Vereshchaka
机构: 1. University of California, Berkeley (加州大学伯克利分校); 2. University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
备注: 8 pages, 9 figures, Conference: IEEE International Conference on Machine Learning and Applications 2025 (ICMLA 2025): this https URL
Abstract:Inspecting the undercarriage of used vehicles is a labor-intensive task that requires inspectors to crouch or crawl underneath each vehicle to thoroughly examine it. Additionally, online buyers rarely see undercarriage photos. We present an end-to-end pipeline that utilizes a three-camera rig to capture videos of the undercarriage as the vehicle drives over it, and produces an interactive 3D model of the undercarriage. The 3D model enables inspectors and customers to rotate, zoom, and slice through the undercarriage, allowing them to detect rust, leaks, or impact damage in seconds, thereby improving both workplace safety and buyer confidence. Our primary contribution is a rig-aware Structure-from-Motion (SfM) pipeline specifically designed to overcome the challenges of wide-angle lens distortion and low-parallax scenes. Our method overcomes the challenges of wide-angle lens distortion and low-parallax scenes by integrating precise camera calibration, synchronized video streams, and strong geometric priors from the camera rig. We use a constrained matching strategy with learned components, the DISK feature extractor, and the attention-based LightGlue matcher to generate high-quality sparse point clouds that are often unattainable with standard SfM pipelines. These point clouds seed the Gaussian splatting process to generate photorealistic undercarriage models that render in real-time. Our experiments and ablation studies demonstrate that our design choices are essential to achieve state-of-the-art quality.
zh
[CV-8] Copy-Trasform-Paste: Zero-Shot Object-Object Alignment Guided by Vision-Language and Geometric Constraints
【速读】:该论文旨在解决两个给定三维网格(mesh)在零样本(zero-shot)条件下的空间对齐问题,即根据文本提示描述其空间关系进行自动对齐,这是内容创作和场景组装中的关键能力。传统方法依赖几何对齐流程,而近期工作利用预训练的2D扩散模型建模语言条件下的物体间空间关系;本文提出一种无需训练新模型的解决方案:在测试时直接优化相对位姿(包括平移、旋转和各向同性缩放),通过可微分渲染器获取CLIP驱动的梯度进行更新。其核心创新在于将语言监督与几何感知目标相结合——引入改进的软迭代最近点(soft-Iterative Closest Point, soft-ICP)项促进表面贴合,并设计穿透损失(penetration loss)避免物体穿插;同时采用分阶段调度强化接触约束,并通过相机控制聚焦优化区域,从而实现语义准确且物理合理的对齐结果。
链接: https://arxiv.org/abs/2601.14207
作者: Rotem Gatenyo,Ohad Fried
机构: Reichman University (里奇曼大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We study zero-shot 3D alignment of two given meshes, using a text prompt describing their spatial relation – an essential capability for content creation and scene assembly. Earlier approaches primarily rely on geometric alignment procedures, while recent work leverages pretrained 2D diffusion models to model language-conditioned object-object spatial relationships. In contrast, we directly optimize the relative pose at test time, updating translation, rotation, and isotropic scale with CLIP-driven gradients via a differentiable renderer, without training a new model. Our framework augments language supervision with geometry-aware objectives: a variant of soft-Iterative Closest Point (ICP) term to encourage surface attachment and a penetration loss to discourage interpenetration. A phased schedule strengthens contact constraints over time, and camera control concentrates the optimization on the interaction region. To enable evaluation, we curate a benchmark containing diverse categories and relations, and compare against baselines. Our method outperforms all alternatives, yielding semantically faithful and physically plausible alignments.
zh
[CV-9] IIR-VLM: In-Context Instance-level Recognition for Large Vision-Language Models
【速读】:该论文旨在解决视觉语言模型(VLM)在实例级识别(Instance-level Recognition, ILR)任务中表现不佳的问题,尤其在person re-identification等场景下,其性能显著低于专门设计的ILR模型,限制了VLM在需要识别特定个体(如熟悉的人或物体)的实际应用。解决方案的关键在于提出IIR-VLM,通过引入预训练的ILR专家模型作为辅助视觉编码器,为多样化的实例学习提供专业化特征表示,从而使得VLM能够在上下文(in-context)中以“单样本学习”(one-shot)的方式快速掌握新实例,并实现基于实例感知的视觉理解能力。
链接: https://arxiv.org/abs/2601.14188
作者: Liang Shi,Wei Li,Kevin M Beussman,Lin Chen,Yun Fu
机构: Northeastern University (东北大学); Wyze Labs, Inc. (Wyze 实验室公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Instance-level recognition (ILR) concerns distinguishing individual instances from one another, with person re-identification as a prominent example. Despite the impressive visual perception capabilities of modern VLMs, we find their performance on ILR unsatisfactory, often dramatically underperforming domain-specific ILR models. This limitation hinders many practical application of VLMs, e.g. where recognizing familiar people and objects is crucial for effective visual understanding. Existing solutions typically learn to recognize instances one at a time using instance-specific datasets, which not only incur substantial data collection and training costs but also struggle with fine-grained discrimination. In this work, we propose IIR-VLM, a VLM enhanced for In-context Instance-level Recognition. We integrate pre-trained ILR expert models as auxiliary visual encoders to provide specialized features for learning diverse instances, which enables VLMs to learn new instances in-context in a one-shot manner. Further, IIR-VLM leverages this knowledge for instance-aware visual understanding. We validate IIR-VLM’s efficacy on existing instance personalization benchmarks. Finally, we demonstrate its superior ILR performance on a challenging new benchmark, which assesses ILR capabilities across varying difficulty and diverse categories, with person, face, pet and general objects as the instances at task.
zh
[CV-10] Progressive self-supervised blind-spot denoising method for LDCT denoising
【速读】:该论文旨在解决低剂量计算机断层扫描(Low-dose Computed Tomography, LDCT)图像去噪问题,其核心挑战在于临床实践中难以获取配对的正常剂量CT(Normal-dose CT, NDCT)数据以支持监督学习。为克服这一限制,作者提出一种仅依赖LDCT图像的自监督训练策略,其关键创新在于引入分步盲区去噪机制(step-wise blind-spot denoising mechanism),通过逐步强制条件独立性来实现更精细的去噪学习,并辅以向LDCT图像添加高斯噪声作为正则化手段,有效缓解过拟合问题。实验表明,该方法在Mayo LDCT数据集上显著优于现有自监督方法,且性能可媲美甚至超越多个代表性监督去噪方法。
链接: https://arxiv.org/abs/2601.14180
作者: Yichao Liu,Yueyang Teng,Junwen Guo
机构: Northeastern University (东北大学); Umeå University (于默奥大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Self-supervised learning is increasingly investigated for low-dose computed tomography (LDCT) image denoising, as it alleviates the dependence on paired normal-dose CT (NDCT) data, which are often difficult to acquire in clinical practice. In this paper, we propose a novel self-supervised training strategy that relies exclusively on LDCT images. We introduce a step-wise blind-spot denoising mechanism that enforces conditional independence in a progressive manner, enabling more fine-grained denoising learning. In addition, we add Gaussian noise to LDCT images, which acts as a regularization and mitigates overfitting. Extensive experiments on the Mayo LDCT dataset demonstrate that the proposed method consistently outperforms existing self-supervised approaches and achieves performance comparable to, or better than, several representative supervised denoising methods.
zh
[CV-11] ASBA: A-line State Space Model and B-line Attention for Sparse Optical Doppler Tomography Reconstruction
【速读】:该论文旨在解决光学多普勒断层成像(Optical Doppler Tomography, ODT)中因密集采样导致的扫描时间长、存储需求高以及难以捕捉快速血流动力学的问题。现有稀疏采样方法受限于保守的采样率和对流动与背景信号的均匀建模,难以实现高质量重建。其解决方案的关键在于提出一种新型血流感知网络 ASBA(A-line ROI State space model and B-line phase Attention),通过两个核心模块:1)基于A-line区域感兴趣(ROI)的状态空间模型,提取沿A-line稀疏分布的血流特征;2)基于相位差的B-line相位注意力机制,捕获沿B-line的长程血流信号。此外,引入血流感知加权损失函数,强化网络对血流信号重建精度的优先关注,从而在真实动物数据上显著优于现有最优重建方法。
链接: https://arxiv.org/abs/2601.14165
作者: Zhenghong Li,Wensheng Cheng,Congwu Du,Yingtian Pan,Zhaozheng Yin,Haibin Ling
机构: Stony Brook University (石溪大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 11 figures
Abstract:Optical Doppler Tomography (ODT) is an emerging blood flow analysis technique. A 2D ODT image (B-scan) is generated by sequentially acquiring 1D depth-resolved raw A-scans (A-line) along the lateral axis (B-line), followed by Doppler phase-subtraction analysis. To ensure high-fidelity B-scan images, current practices rely on dense sampling, which prolongs scanning time, increases storage demands, and limits the capture of rapid blood flow dynamics. Recent studies have explored sparse sampling of raw A-scans to alleviate these limitations, but their effectiveness is hindered by the conservative sampling rates and the uniform modeling of flow and background signals. In this study, we introduce a novel blood flow-aware network, named ASBA (A-line ROI State space model and B-line phase Attention), to reconstruct ODT images from highly sparsely sampled raw A-scans. Specifically, we propose an A-line ROI state space model to extract sparsely distributed flow features along the A-line, and a B-line phase attention to capture long-range flow signals along each B-line based on phase difference. Moreover, we introduce a flow-aware weighted loss function that encourages the network to prioritize the accurate reconstruction of flow signals. Extensive experiments on real animal data demonstrate that the proposed approach clearly outperforms existing state-of-the-art reconstruction methods.
zh
[CV-12] One-Shot Refiner: Boosting Feed-forward Novel View Synthesis via One-Step Diffusion
【速读】:该论文旨在解决基于视觉Transformer(Vision Transformer, ViT)的3D高斯点阵(3D Gaussian Splatting, 3DGS)方法在稀疏图像下进行高保真新视角合成(Novel View Synthesis, NVS)时存在的两个关键问题:一是由于计算成本限制,ViT骨干网络通常只能处理低分辨率输入,导致重建细节不足;二是现有生成增强方法多为三维无关(3D-agnostic),难以保持跨视角的一致性结构,尤其是在未见区域。解决方案的关键在于提出一个双域细节感知模块(Dual-Domain Detail Perception Module),使高分辨率图像可在不依赖ViT主干的情况下被有效处理,并赋予高斯点额外特征以存储高频细节;同时设计了一个特征引导的扩散网络(feature-guided diffusion network),在恢复过程中保留高频细节;并通过统一训练策略实现几何主干与扩散细化模块的联合优化,从而显著提升多数据集上的生成质量一致性与保真度。
链接: https://arxiv.org/abs/2601.14161
作者: Yitong Dong,Qi Zhang,Minchao Jiang,Zhiqiang Wu,Qingnan Fan,Ying Feng,Huaqi Zhang,Hujun Bao,Guofeng Zhang
机构: 1. University of Science and Technology of China (中国科学技术大学); 2. Alibaba Group (阿里巴巴集团); 3. Tongji University (同济大学); 4. Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present a novel framework for high-fidelity novel view synthesis (NVS) from sparse images, addressing key limitations in recent feed-forward 3D Gaussian Splatting (3DGS) methods built on Vision Transformer (ViT) backbones. While ViT-based pipelines offer strong geometric priors, they are often constrained by low-resolution inputs due to computational costs. Moreover, existing generative enhancement methods tend to be 3D-agnostic, resulting in inconsistent structures across views, especially in unseen regions. To overcome these challenges, we design a Dual-Domain Detail Perception Module, which enables handling high-resolution images without being limited by the ViT backbone, and endows Gaussians with additional features to store high-frequency details. We develop a feature-guided diffusion network, which can preserve high-frequency details during the restoration process. We introduce a unified training strategy that enables joint optimization of the ViT-based geometric backbone and the diffusion-based refinement module. Experiments demonstrate that our method can maintain superior generation quality across multiple datasets.
zh
[CV-13] LLM Augmented Intervenable Multimodal Adaptor for Post-operative Complication Prediction in Lung Cancer Surgery WACV2026
【速读】:该论文旨在解决肺部肿瘤手术后并发症预测的难题,以改善患者预后并降低医疗成本。其解决方案的关键在于提出MIRACLE模型,该模型通过异构输入的超球面嵌入空间融合机制,从结构化临床数据与高维影像数据中提取鲁棒且具有判别性的特征;同时引入可干预的深度学习模块,不仅提升预测精度,还提供可解释、可操作的洞察,使临床专家能基于专业经验交互式调整推荐结果,从而实现个性化且透明的术后风险管理体系。
链接: https://arxiv.org/abs/2601.14154
作者: Shubham Pandey,Bhavin Jawade,Srirangaraj Setlur,Venu Govindaraju,Kenneth Seastedt
机构: University at Buffalo (纽约州立大学布法罗分校); Roswell Park Comprehensive Cancer Center (罗斯威尔公园综合癌症中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to P2P-CV @ WACV 2026
Abstract:Postoperative complications remain a critical concern in clinical practice, adversely affecting patient outcomes and contributing to rising healthcare costs. We present MIRACLE, a deep learning architecture for prediction of risk of postoperative complications in lung cancer surgery by integrating preoperative clinical and radiological data. MIRACLE employs a hyperspherical embedding space fusion of heterogeneous inputs, enabling the extraction of robust, discriminative features from both structured clinical records and high-dimensional radiological images. To enhance transparency of prediction and clinical utility, we incorporate an interventional deep learning module in MIRACLE, that not only refines predictions but also provides interpretable and actionable insights, allowing domain experts to interactively adjust recommendations based on clinical expertise. We validate our approach on POC-L, a real-world dataset comprising 3,094 lung cancer patients who underwent surgery at Roswell Park Comprehensive Cancer Center. Our results demonstrate that MIRACLE outperforms various traditional machine learning models and contemporary large language models (LLM) variants alone, for personalized and explainable postoperative risk management.
zh
[CV-14] winBrainVLA: Unleashing the Potential of Generalist VLMs for Embodied Tasks via Asymmetric Mixture-of-Transformers
【速读】:该论文旨在解决标准视觉-语言-动作(Vision-Language-Action, VLA)模型在机器人控制中因统一微调单一视觉-语言模型(Vision-Language Model, VLM)骨干网络而导致的“灾难性遗忘”问题,即在学习低级精细感官运动技能时损害了模型对开放世界语义理解的能力。解决方案的关键在于提出TwinBrainVLA架构,通过双脑协同机制实现通用能力与专用技能的解耦:冻结的“左脑”(Left Brain)保留预训练VLM的全局语义理解能力,可训练的“右脑”(Right Brain)专注于具身本体感觉感知,并借助新颖的非对称混合Transformer(Asymmetric Mixture-of-Transformers, AsyMoT)机制动态从左脑获取语义信息并融合本体状态,为流匹配动作专家(Flow-Matching Action Expert)提供丰富条件,从而生成精确连续控制指令,同时显著提升任务性能并维持原始VLM的广泛视觉理解能力。
链接: https://arxiv.org/abs/2601.14133
作者: Bin Yu,Shijie Lian,Xiaopeng Lin,Yuliang Wei,Zhaolong Shen,Changti Wu,Yuzhuo Miao,Xinming Wang,Bailing Wang,Cong Huang,Kai Chen
机构: HIT(哈尔滨工业大学); ZGCA; ZGCI; HUST(华中科技大学); HKUST(GZ)(香港科技大学(广州)); BUAA(北京航空航天大学); ECNU(华东师范大学); CASIA(中国科学院自动化研究所); DeepCybo
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: GitHub: this https URL
Abstract:Standard Vision-Language-Action (VLA) models typically fine-tune a monolithic Vision-Language Model (VLM) backbone explicitly for robotic control. However, this approach creates a critical tension between maintaining high-level general semantic understanding and learning low-level, fine-grained sensorimotor skills, often leading to “catastrophic forgetting” of the model’s open-world capabilities. To resolve this conflict, we introduce TwinBrainVLA, a novel architecture that coordinates a generalist VLM retaining universal semantic understanding and a specialist VLM dedicated to embodied proprioception for joint robotic control. TwinBrainVLA synergizes a frozen “Left Brain”, which retains robust general visual reasoning, with a trainable “Right Brain”, specialized for embodied perception, via a novel Asymmetric Mixture-of-Transformers (AsyMoT) mechanism. This design allows the Right Brain to dynamically query semantic knowledge from the frozen Left Brain and fuse it with proprioceptive states, providing rich conditioning for a Flow-Matching Action Expert to generate precise continuous controls. Extensive experiments on SimplerEnv and RoboCasa benchmarks demonstrate that TwinBrainVLA achieves superior manipulation performance compared to state-of-the-art baselines while explicitly preserving the comprehensive visual understanding capabilities of the pre-trained VLM, offering a promising direction for building general-purpose robots that simultaneously achieve high-level semantic understanding and low-level physical dexterity.
zh
[CV-15] GIC-DLC: Differentiable Logic Circuits for Hardware-Friendly Grayscale Image Compression
【速读】:该论文旨在解决神经图像编解码器在压缩效率提升的同时,因计算开销大而难以部署于功耗受限设备(如智能手机、相机和无人机)的问题。其解决方案的关键在于提出一种硬件感知的灰度图像压缩方法——GIC-DLC,通过训练查找表(lookup tables)将神经网络的灵活性与布尔逻辑运算的高效性相结合,从而在保持高压缩性能的同时显著降低能耗和延迟,实现面向边缘设备的低功耗图像压缩。
链接: https://arxiv.org/abs/2601.14130
作者: Till Aczel,David F. Jenny,Simon Bührer,Andreas Plesner,Antonio Di Maio,Roger Wattenhofer
机构: ETH Zurich (苏黎世联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Neural image codecs achieve higher compression ratios than traditional hand-crafted methods such as PNG or JPEG-XL, but often incur substantial computational overhead, limiting their deployment on energy-constrained devices such as smartphones, cameras, and drones. We propose Grayscale Image Compression with Differentiable Logic Circuits (GIC-DLC), a hardware-aware codec where we train lookup tables to combine the flexibility of neural networks with the efficiency of Boolean operations. Experiments on grayscale benchmark datasets show that GIC-DLC outperforms traditional codecs in compression efficiency while allowing substantial reductions in energy consumption and latency. These results demonstrate that learned compression can be hardware-friendly, offering a promising direction for low-power image compression on edge devices.
zh
[CV-16] PMCE: Probabilistic Multi-Granularity Semantics with Caption-Guided Enhancement for Few-Shot Learning
【速读】:该论文旨在解决少样本学习(Few-shot Learning)中因支持集样本稀缺导致原型估计偏差大、泛化能力差的问题。解决方案的关键在于提出一种概率化的少样本框架PMCE,其核心创新是构建了一个非参数知识库,存储每个类别的视觉统计信息及CLIP编码的类别名称嵌入,并在元测试阶段通过类别名称嵌入相似性检索最相关的基类,将这些统计信息聚合为类别特定先验并与支持集原型通过最大后验估计(MAP)更新融合;同时引入冻结的BLIP图像描述器生成无标签实例级描述,并结合轻量级增强器在归纳协议下对支持原型与查询特征进行一致性正则化优化,从而稳定噪声较大的文本描述,提升模型鲁棒性与性能。
链接: https://arxiv.org/abs/2601.14111
作者: Jiaying Wu,Can Gao,Jinglu Hu,Hui Li,Xiaofeng Cao,Jingcai Guo
机构: Jiangsu Ocean University (江苏海洋大学); Waseda University (早稻田大学); Tongji University (同济大学); The Hong Kong Polytechnic University (香港理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Few-shot learning aims to identify novel categories from only a handful of labeled samples, where prototypes estimated from scarce data are often biased and generalize poorly. Semantic-based methods alleviate this by introducing coarse class-level information, but they are mostly applied on the support side, leaving query representations unchanged. In this paper, we present PMCE, a Probabilistic few-shot framework that leverages Multi-granularity semantics with Caption-guided Enhancement. PMCE constructs a nonparametric knowledge bank that stores visual statistics for each category as well as CLIP-encoded class name embeddings of the base classes. At meta-test time, the most relevant base classes are retrieved based on the similarities of class name embeddings for each novel category. These statistics are then aggregated into category-specific prior information and fused with the support set prototypes via a simple MAP update. Simultaneously, a frozen BLIP captioner provides label-free instance-level image descriptions, and a lightweight enhancer trained on base classes optimizes both support prototypes and query features under an inductive protocol with a consistency regularization to stabilize noisy captions. Experiments on four benchmarks show that PMCE consistently improves over strong baselines, achieving up to 7.71% absolute gain over the strongest semantic competitor on MiniImageNet in the 1-shot setting. Our code is available at this https URL
zh
[CV-17] Diffusion-Guided Backdoor Attacks in Real-World Reinforcement Learning
【速读】:该论文旨在解决生成式 AI (Generative AI) 在真实机器人系统中实施后门攻击(backdoor attacks)时效果受限的问题。由于物理部署中的安全约束控制流程(如速度限制、动作平滑和避障机制)会抑制异常行为,传统后门攻击在现实场景中难以有效激活。解决方案的关键在于提出一种扩散引导的后门攻击框架(Diffusion-Guided Backdoor Attack, DGBA),其核心包括:1)设计可打印的小型视觉补丁触发器(visual patch triggers),并利用条件扩散模型生成适应真实世界视觉变化的多样化补丁;2)将机器人控制栈视为黑盒系统,采用基于优势值的中毒策略,在决策关键训练状态中注入触发器,从而提升攻击的有效性和隐蔽性。实验在TurtleBot3移动机器人上验证了该方法能够在保持正常任务性能的同时可靠激活目标攻击。
链接: https://arxiv.org/abs/2601.14104
作者: Tairan Huang,Qingqing Ye,Yulin Jin,Jiawei Lian,Yi Wang,Haibo Hu
机构: The Hong Kong Polytechnic University(香港理工大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Backdoor attacks embed hidden malicious behaviors in reinforcement learning (RL) policies and activate them using triggers at test time. Most existing attacks are validated only in simulation, while their effectiveness in real-world robotic systems remains unclear. In physical deployment, safety-constrained control pipelines such as velocity limiting, action smoothing, and collision avoidance suppress abnormal actions, causing strong attenuation of conventional backdoor attacks. We study this previously overlooked problem and propose a diffusion-guided backdoor attack framework (DGBA) for real-world RL. We design small printable visual patch triggers placed on the floor and generate them using a conditional diffusion model that produces diverse patch appearances under real-world visual variations. We treat the robot control stack as a black-box system. We further introduce an advantage-based poisoning strategy that injects triggers only at decision-critical training states. We evaluate our method on a TurtleBot3 mobile robot and demonstrate reliable activation of targeted attacks while preserving normal task performance. Demo videos and code are available in the supplementary material.
zh
[CV-18] Interp3D: Correspondence-aware Interpolation for Generative Textured 3D Morphing
【速读】:该论文旨在解决纹理化三维形态变换(textured 3D morphing)中的关键挑战,即如何在保持几何结构一致性的同时,实现纹理对齐与细节保留,从而生成平滑且合理的过渡效果。现有方法要么仅处理几何形状而忽略纹理,要么将二维插值策略直接扩展至三维,导致语义模糊、结构错位和纹理模糊等问题。解决方案的关键在于提出一种无需训练的框架 Interp3D,其核心是利用生成先验(generative priors)并采用渐进式对齐原则:首先在条件空间中进行语义对齐插值,再通过 SLAT(Structured Latent)引导的结构插值得到几何保真度,最后通过细粒度纹理融合传递外观细节,从而协同保障几何一致性、纹理对齐性和鲁棒性。
链接: https://arxiv.org/abs/2601.14103
作者: Xiaolu Liu,Yicong Li,Qiyuan He,Jiayin Zhu,Wei Ji,Angela Yao,Jianke Zhu
机构: Zhejiang University (浙江大学); National University of Singapore (新加坡国立大学); Nanjing University (南京大学); State Key Lab of CAD & CG, Zhejiang University (CAD与计算机图形学国家重点实验室,浙江大学); Shenzhen Loop Area Institute (深圳环区研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 22 pages, 12 figures
Abstract:Textured 3D morphing seeks to generate smooth and plausible transitions between two 3D assets, preserving both structural coherence and fine-grained appearance. This ability is crucial not only for advancing 3D generation research but also for practical applications in animation, editing, and digital content creation. Existing approaches either operate directly on geometry, limiting them to shape-only morphing while neglecting textures, or extend 2D interpolation strategies into 3D, which often causes semantic ambiguity, structural misalignment, and texture blurring. These challenges underscore the necessity to jointly preserve geometric consistency, texture alignment, and robustness throughout the transition process. To address this, we propose Interp3D, a novel training-free framework for textured 3D morphing. It harnesses generative priors and adopts a progressive alignment principle to ensure both geometric fidelity and texture coherence. Starting from semantically aligned interpolation in condition space, Interp3D enforces structural consistency via SLAT (Structured Latent)-guided structure interpolation, and finally transfers appearance details through fine-grained texture fusion. For comprehensive evaluations, we construct a dedicated dataset, Interp3DData, with graded difficulty levels and assess generation results from fidelity, transition smoothness, and plausibility. Both quantitative metrics and human studies demonstrate the significant advantages of our proposed approach over previous methods. Source code is available at this https URL.
zh
[CV-19] Curriculum-Based Strategies for Efficient Cross-Domain Action Recognition
【速读】:该论文旨在解决跨视角动作识别(cross-view action recognition)中模型泛化能力不足的问题,特别是如何在不使用真实航拍数据的情况下提升模型对未见真实航拍视图的适应性。其关键解决方案是采用基于课程学习(curriculum learning)的训练策略,利用两种域外数据源——合成航拍数据和真实地面视角数据——通过分阶段、渐进式地融合不同域的数据来优化训练过程。实验表明,相比简单组合数据集的方法,该方案能在保持top-1准确率(误差在3%以内)的同时显著减少训练迭代次数(最多降低37%),从而实现更高效的跨视角动作识别模型训练。
链接: https://arxiv.org/abs/2601.14101
作者: Emily Kim,Allen Wu,Jessica Hodgins
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Despite significant progress in human action recognition, generalizing to diverse viewpoints remains a challenge. Most existing datasets are captured from ground-level perspectives, and models trained on them often struggle to transfer to drastically different domains such as aerial views. This paper examines how curriculum-based training strategies can improve generalization to unseen real aerial-view data without using any real aerial data during training. We explore curriculum learning for cross-view action recognition using two out-of-domain sources: synthetic aerial-view data and real ground-view data. Our results on the evaluation on order of training (fine-tuning on synthetic aerial data vs. real ground data) shows that fine-tuning on real ground data but differ in how they transition from synthetic to real. The first uses a two-stage curriculum with direct fine-tuning, while the second applies a progressive curriculum that expands the dataset in multiple stages before fine-tuning. We evaluate both methods on the REMAG dataset using SlowFast (CNN-based) and MViTv2 (Transformer-based) architectures. Results show that combining the two out-of-domain datasets clearly outperforms training on a single domain, whether real ground-view or synthetic aerial-view. Both curriculum strategies match the top-1 accuracy of simple dataset combination while offering efficiency gains. With the two-step fine-tuning method, SlowFast achieves up to a 37% reduction in iterations and MViTv2 up to a 30% reduction compared to simple combination. The multi-step progressive approach further reduces iterations, by up to 9% for SlowFast and 30% for MViTv2, relative to the two-step method. These findings demonstrate that curriculum-based training can maintain comparable performance (top-1 accuracy within 3% range) while improving training efficiency in cross-view action recognition. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2601.14101 [cs.CV] (or arXiv:2601.14101v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2601.14101 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-20] wo-Stream temporal transformer for video action classification
【速读】:该论文旨在解决视频理解中运动信息表征不足的问题,尤其在动作识别等任务中,如何有效融合空间与时间维度的信息以提升分类性能。其解决方案的关键在于提出一种双流Transformer视频分类模型,该模型分别从原始帧(content)和光流(optical flow)中提取时空特征,并通过自注意力机制在联合的光流与时间帧域内捕捉跨模态关系,从而在Transformer编码器结构中实现对运动信息的高效建模与整合。
链接: https://arxiv.org/abs/2601.14086
作者: Nattapong Kurpukdee,Adrian G. Bors
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Motion representation plays an important role in video understanding and has many applications including action recognition, robot and autonomous guidance or others. Lately, transformer networks, through their self-attention mechanism capabilities, have proved their efficiency in many applications. In this study, we introduce a new two-stream transformer video classifier, which extracts spatio-temporal information from content and optical flow representing movement information. The proposed model identifies self-attention features across the joint optical flow and temporal frame domain and represents their relationships within the transformer encoder mechanism. The experimental results show that our proposed methodology provides excellent classification results on three well-known video datasets of human activities.
zh
[CV-21] VENI: Variational Encoder for Natural Illumination
【速读】:该论文旨在解决逆渲染(inverse rendering)问题中因缺乏有效先验而导致的病态性(ill-posedness)难题,特别是现有方法在处理光照环境的球面结构和旋转等变性(rotation-equivariance)方面存在不足,且难以构建具有良好性质的潜在空间(latent space)。其解决方案的关键在于提出一种旋转等变的变分自编码器(rotation-equivariant variational autoencoder),通过引入一种新颖的向量神经元视觉Transformer(Vector Neuron Vision Transformer, VN-ViT)作为编码器,以及一个旋转等变的条件神经场(conditional neural field)作为解码器,从而在不依赖2D投影的前提下建模自然光照。编码器中创新性地设计了一个SO(2)-等变全连接层,将原始SO(3)等变性降维至SO(2),显著优于标准向量神经元(Vector Neurons)在该等变约束下的表现;该设计使得潜在空间具有更平滑的插值特性,从而获得更稳定、结构良好的潜在表示。
链接: https://arxiv.org/abs/2601.14079
作者: Paul Walker,James A. D. Gardner,Andreea Ardelean,William A. P. Smith,Bernhard Egger
机构: Friedrich-Alexander-Universität Erlangen-Nürnberg(埃尔朗根-纽伦堡弗里德里希亚历山大大学); University of York(约克大学); pxld.ai
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Repo - this https URL Project page - this https URL
Abstract:Inverse rendering is an ill-posed problem, but priors like illumination priors, can simplify it. Existing work either disregards the spherical and rotation-equivariant nature of illumination environments or does not provide a well-behaved latent space. We propose a rotation-equivariant variational autoencoder that models natural illumination on the sphere without relying on 2D projections. To preserve the SO(2)-equivariance of environment maps, we use a novel Vector Neuron Vision Transformer (VN-ViT) as encoder and a rotation-equivariant conditional neural field as decoder. In the encoder, we reduce the equivariance from SO(3) to SO(2) using a novel SO(2)-equivariant fully connected layer, an extension of Vector Neurons. We show that our SO(2)-equivariant fully connected layer outperforms standard Vector Neurons when used in our SO(2)-equivariant model. Compared to previous methods, our variational autoencoder enables smoother interpolation in latent space and offers a more well-behaved latent space.
zh
[CV-22] Unsupervised Video Class-Incremental Learning via Deep Embedded Clustering Management
【速读】:该论文旨在解决无监督视频类别增量学习(unsupervised video class incremental learning, uVCIL)中的灾难性遗忘问题,即在不依赖任何标签信息的情况下,持续学习新视频类别而不丢失先前学到的知识。其解决方案的关键在于:首先使用一个深度特征提取网络在每个任务中生成代表性视频特征,无需假设类别或任务信息;随后基于这些特征逐步构建一系列深度聚类,并利用前一任务训练得到的模型作为当前任务的初始状态,实现知识迁移。该方法在UCF101、HMDB51和Something-to-Something V2三个标准视频动作识别数据集上验证了有效性,显著优于现有基线方法。
链接: https://arxiv.org/abs/2601.14069
作者: Nattapong Kurpukdee,Adrian G. Bors
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Unsupervised video class incremental learning (uVCIL) represents an important learning paradigm for learning video information without forgetting, and without considering any data labels. Prior approaches have focused on supervised class-incremental learning, relying on using the knowledge of labels and task boundaries, which is costly, requires human annotation, or is simply not a realistic option. In this paper, we propose a simple yet effective approach to address the uVCIL. We first consider a deep feature extractor network, providing a set of representative video features during each task without assuming any class or task information. We then progressively build a series of deep clusters from the extracted features. During the successive task learning, the model updated from the previous task is used as an initial state in order to transfer knowledge to the current learning task. We perform in-depth evaluations on three standard video action recognition datasets, including UCF101, HMDB51, and Something-to-Something V2, by ignoring the labels from the supervised setting. Our approach significantly outperforms other baselines on all datasets.
zh
[CV-23] VERIDAH: Solving Enumeration Anomaly Aware Vertebra Labeling across Imaging Sequences
【速读】:该论文旨在解决脊柱椎体计数异常(enumeration anomalies)在临床影像中难以自动识别与标注的问题,尤其关注胸腰段交界区(thoracolumbar junction)常被忽视的现状。现有基于深度学习的椎体标注算法无法有效处理椎体数量异常的情况,限制了其在慢性背痛诊断和手术规划中的应用。解决方案的关键在于提出一种名为“Vertebra Identification with Anomaly Handling”(VERIDAH)的新颖椎体标注算法,其核心创新是结合多分类头(multiple classification heads)与加权椎体序列预测算法(weighted vertebra sequence prediction algorithm),从而实现对正常及异常椎体结构的精准识别与自动标注。实验表明,VERIDAH在T2加权自旋回波矢状位成像(T2w TSE sagittal)和CT图像上均显著优于现有模型,尤其是在识别胸椎和腰椎计数异常方面表现出高准确率(如CT图像中胸椎异常识别准确率达96.30%)。
链接: https://arxiv.org/abs/2601.14066
作者: Hendrik Möller,Hanna Schoen,Robert Graf,Matan Atad,Nathan Molinier,Anjany Sekuboyina,Bettina K. Budai,Fabian Bamberg,Steffen Ringhof,Christopher Schlett,Tobias Pischon,Thoralf Niendorf,Josua A. Decker,Marc-André Weber,Bjoern Menze,Daniel Rueckert,Jan S. Kirschke
机构: 1. German Cancer Research Center (德国癌症研究中心); 2. Imperial College London (伦敦帝国理工学院); 3. University of Oxford (牛津大学); 4. Technical University of Munich (慕尼黑工业大学); 5. Heidelberg University (海德堡大学); 6. King’s College London (伦敦国王学院); 7. University of Freiburg (弗莱堡大学); 8. University of Cologne (科隆大学); 9. German Institute of Human Nutrition (德国人类营养研究所); 10. Charité – Universitätsmedizin Berlin (柏林夏里特医科大学); 11. University Hospital Bonn (波恩大学医院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The human spine commonly consists of seven cervical, twelve thoracic, and five lumbar vertebrae. However, enumeration anomalies may result in individuals having eleven or thirteen thoracic vertebrae and four or six lumbar vertebrae. Although the identification of enumeration anomalies has potential clinical implications for chronic back pain and operation planning, the thoracolumbar junction is often poorly assessed and rarely described in clinical reports. Additionally, even though multiple deep-learning-based vertebra labeling algorithms exist, there is a lack of methods to automatically label enumeration anomalies. Our work closes that gap by introducing “Vertebra Identification with Anomaly Handling” (VERIDAH), a novel vertebra labeling algorithm based on multiple classification heads combined with a weighted vertebra sequence prediction algorithm. We show that our approach surpasses existing models on T2w TSE sagittal (98.30% vs. 94.24% of subjects with all vertebrae correctly labeled, p 0.001) and CT imaging (99.18% vs. 77.26% of subjects with all vertebrae correctly labeled, p 0.001) and works in arbitrary field-of-view images. VERIDAH correctly labeled the presence 2 Möller et al. of thoracic enumeration anomalies in 87.80% and 96.30% of T2w and CT images, respectively, and lumbar enumeration anomalies in 94.48% and 97.22% for T2w and CT, respectively. Our code and models are available at: this https URL.
zh
[CV-24] Fine-Grained Zero-Shot Composed Image Retrieval with Complementary Visual-Semantic Integration
【速读】:该论文旨在解决零样本组合图像检索(Zero-shot Composed Image Retrieval, ZS-CIR)中细粒度变化捕捉不足以及视觉与语义信息融合效率低的问题。现有方法通常依赖图像到文本模型将多模态查询转换为单一文本,或使用大语言模型(Large Language Model, LLM)生成目标图像描述,难以充分保留互补的视觉信息和完整的语义上下文。其解决方案的关键在于提出一种新颖的细粒度零样本组合图像检索方法(Fine-Grained Zero-Shot Composed Image Retrieval with Complementary Visual-Semantic Integration, CVSI),通过三个核心组件实现:(1) 视觉信息提取,利用预训练映射网络将参考图像转换为伪标记(pseudo token),并与修改文本及最可能添加的对象结合;(2) 语义信息提取,借助预训练描述模型生成参考图像的多个caption,并由LLM生成修改后的caption及潜在新增对象;(3) 补充信息检索,整合查询端与数据库图像中的视觉与语义特征,从而在多种场景下高效完成目标图像检索。
链接: https://arxiv.org/abs/2601.14060
作者: Yongcong Ye,Kai Zhang,Yanghai Zhang,Enhong Chen,Longfei Li,Jun Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Zero-shot composed image retrieval (ZS-CIR) is a rapidly growing area with significant practical applications, allowing users to retrieve a target image by providing a reference image and a relative caption describing the desired modifications. Existing ZS-CIR methods often struggle to capture fine-grained changes and integrate visual and semantic information effectively. They primarily rely on either transforming the multimodal query into a single text using image-to-text models or employing large language models for target image description generation, approaches that often fail to capture complementary visual information and complete semantic context. To address these limitations, we propose a novel Fine-Grained Zero-Shot Composed Image Retrieval method with Complementary Visual-Semantic Integration (CVSI). Specifically, CVSI leverages three key components: (1) Visual Information Extraction, which not only extracts global image features but also uses a pre-trained mapping network to convert the image into a pseudo token, combining it with the modification text and the objects most likely to be added. (2) Semantic Information Extraction, which involves using a pre-trained captioning model to generate multiple captions for the reference image, followed by leveraging an LLM to generate the modified captions and the objects most likely to be added. (3) Complementary Information Retrieval, which integrates information extracted from both the query and database images to retrieve the target image, enabling the system to efficiently handle retrieval queries in a variety of situations. Extensive experiments on three public datasets (e.g., CIRR, CIRCO, and FashionIQ) demonstrate that CVSI significantly outperforms existing state-of-the-art methods. Our code is available at this https URL.
zh
[CV-25] POCI-Diff: Position Objects Consistently and Interactively with 3D-Layout Guided Diffusion
【速读】:该论文旨在解决文本到图像(Text-to-Image, T2I)生成中物体几何失真与编辑一致性不足的问题,尤其在需要3D布局控制和交互式编辑场景下表现不佳。现有方法依赖2D提示或迭代的复制-扭曲-粘贴策略,常导致对象形变且难以保持跨编辑的一致性。其解决方案的关键在于提出一种名为POCI-Diff的统一扩散框架,通过在扩散过程中联合施加3D几何约束与实例级语义绑定,实现对每个物体的显式语义控制——即通过Blended Latent Diffusion将特定文本描述绑定至3D边界框,从而支持单次合成复杂多物体场景;同时引入无变形的生成式编辑流程,借助IP-Adapter基于参考图像条件化扩散过程,确保对象身份一致性和全局场景连贯性,有效消除由形变带来的几何伪影。
链接: https://arxiv.org/abs/2601.14056
作者: Andrea Rigo,Luca Stornaiuolo,Weijie Wang,Mauro Martino,Bruno Lepri,Nicu Sebe
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:We propose a diffusion-based approach for Text-to-Image (T2I) generation with consistent and interactive 3D layout control and editing. While prior methods improve spatial adherence using 2D cues or iterative copy-warp-paste strategies, they often distort object geometry and fail to preserve consistency across edits. To address these limitations, we introduce a framework for Positioning Objects Consistently and Interactively (POCI-Diff), a novel formulation for jointly enforcing 3D geometric constraints and instance-level semantic binding within a unified diffusion process. Our method enables explicit per-object semantic control by binding individual text descriptions to specific 3D bounding boxes through Blended Latent Diffusion, allowing one-shot synthesis of complex multi-object scenes. We further propose a warping-free generative editing pipeline that supports object insertion, removal, and transformation via regeneration rather than pixel deformation. To preserve object identity and consistency across edits, we condition the diffusion process on reference images using IP-Adapter, enabling coherent object appearance throughout interactive 3D editing while maintaining global scene coherence. Experimental results demonstrate that POCI-Diff produces high-quality images consistent with the specified 3D layouts and edits, outperforming state-of-the-art methods in both visual fidelity and layout adherence while eliminating warping-induced geometric artifacts.
zh
[CV-26] Decoder-Free Supervoxel GNN for Accurate Brain-Tumor Localization in Multi-Modal MRI
【速读】:该论文旨在解决当前3D医学图像分析中主流视觉骨干网络(vision backbone)因采用参数密集的编码器-解码器结构而导致模型资源分配不合理的问题,即大量参数被用于空间重建而非特征学习。其解决方案的关键在于提出SVGFormer——一种无解码器的图结构框架,通过内容感知分组阶段将体素(voxel)划分为语义超体素(supervoxel)构成的语义图,并设计层级编码器联合使用patch级Transformer与超体素级图注意力网络(Graph Attention Network, GAT),从而在保留细粒度区域内部特征的同时建模跨区域依赖关系。该设计将全部可学习容量聚焦于特征编码,同时提供从patch到region的双重尺度可解释性,显著提升了模型的准确性与透明度。
链接: https://arxiv.org/abs/2601.14055
作者: Andrea Protani,Marc Molina Van Den Bosch,Lorenzo Giusti,Heloisa Barbosa Da Silva,Paolo Cacace,Albert Sund Aillet,Miguel Angel Gonzalez Ballester,Friedhelm Hummel,Luigi Serio
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages, 3 figures,
Abstract:Modern vision backbones for 3D medical imaging typically process dense voxel grids through parameter-heavy encoder-decoder structures, a design that allocates a significant portion of its parameters to spatial reconstruction rather than feature learning. Our approach introduces SVGFormer, a decoder-free pipeline built upon a content-aware grouping stage that partitions the volume into a semantic graph of supervoxels. Its hierarchical encoder learns rich node representations by combining a patch-level Transformer with a supervoxel-level Graph Attention Network, jointly modeling fine-grained intra-region features and broader inter-regional dependencies. This design concentrates all learnable capacity on feature encoding and provides inherent, dual-scale explainability from the patch to the region level. To validate the framework’s flexibility, we trained two specialized models on the BraTS dataset: one for node-level classification and one for tumor proportion regression. Both models achieved strong performance, with the classification model achieving a F1-score of 0.875 and the regression model a MAE of 0.028, confirming the encoder’s ability to learn discriminative and localized features. Our results establish that a graph-based, encoder-only paradigm offers an accurate and inherently interpretable alternative for 3D medical image representation.
zh
[CV-27] LLM Orbit: A Circular Taxonomy of Large Language Models -From Scaling Walls to Agent ic AI Systems
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)发展中面临的三大关键危机:数据稀缺(预计2026–2028年9–27T tokens耗尽)、成本指数级增长(5年内从300万美元增至3亿美元以上)以及不可持续的能源消耗(提升22倍),这些问题共同构成了“缩放墙”(scaling wall),限制了传统暴力扩展方法的可行性。其解决方案的核心在于识别并系统阐述六种突破路径:(1)推理时计算优化(如o1和DeepSeek-R1以10倍推理算力达到GPT-4性能);(2)量化压缩(实现4–8倍模型体积缩减);(3)分布式边缘计算(降低10倍成本);(4)模型融合;(5)高效训练技术(如ORPO减少50%内存占用);(6)小型专用模型(如Phi-4 14B参数规模媲美更大模型)。这些范式协同推动LLMs向更高效、可持续和可及的方向演进,标志着从被动生成到工具调用智能体(agent)的范式跃迁,并揭示了后训练优化(如RLHF、GRPO、纯强化学习)与架构革新(如MoE路由、多头潜在注意力)在提升性能与效率中的决定性作用。
链接: https://arxiv.org/abs/2601.14053
作者: Badri N. Patro,Vijay S. Agneeswaran
机构: Microsoft(微软)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA); Image and Video Processing (eess.IV)
备注:
Abstract:The field of artificial intelligence has undergone a revolution from foundational Transformer architectures to reasoning-capable systems approaching human-level performance. We present LLMOrbit, a comprehensive circular taxonomy navigating the landscape of large language models spanning 2019-2025. This survey examines over 50 models across 15 organizations through eight interconnected orbital dimensions, documenting architectural innovations, training methodologies, and efficiency patterns defining modern LLMs, generative AI, and agentic systems. We identify three critical crises: (1) data scarcity (9-27T tokens depleted by 2026-2028), (2) exponential cost growth ( 3M to 300M+ in 5 years), and (3) unsustainable energy consumption (22x increase), establishing the scaling wall limiting brute-force approaches. Our analysis reveals six paradigms breaking this wall: (1) test-time compute (o1, DeepSeek-R1 achieve GPT-4 performance with 10x inference compute), (2) quantization (4-8x compression), (3) distributed edge computing (10x cost reduction), (4) model merging, (5) efficient training (ORPO reduces memory 50%), and (6) small specialized models (Phi-4 14B matches larger models). Three paradigm shifts emerge: (1) post-training gains (RLHF, GRPO, pure RL contribute substantially, DeepSeek-R1 achieving 79.8% MATH), (2) efficiency revolution (MoE routing 18x efficiency, Multi-head Latent Attention 8x KV cache compression enables GPT-4-level performance at 0.30/M tokens), and (3) democratization (open-source Llama 3 88.6% MMLU surpasses GPT-4 86.4%). We provide insights into techniques (RLHF, PPO, DPO, GRPO, ORPO), trace evolution from passive generation to tool-using agents (ReAct, RAG, multi-agent systems), and analyze post-training innovations.
zh
[CV-28] Vision Also You Need: Navigating Out-of-Distribution Detection with Multimodal Large Language Model
【速读】:该论文旨在解决当前零样本异常检测(Out-of-Distribution, OOD)方法过度依赖文本空间知识、忽视图像空间特征挑战的问题,从而导致在近域OOD(near OOD)和远域OOD(far OOD)任务中性能受限。其解决方案的关键在于提出一种基于多模态大语言模型(Multimodal Large Language Models, MLLMs)的新颖流水线MM-OOD,利用MLLMs的跨模态推理能力与多轮对话机制增强异常检测:对于近域OOD任务,直接将已知分布(ID)图像与文本提示输入MLLMs进行潜在异常识别;对于远域OOD任务,则引入“草图-生成-细化”框架——先通过文本提示构建异常暴露场景,再生成对应的视觉异常样本,最后借助多模态提示进一步细化判断。该方法显著提升了在Food-101等多模态数据集上的性能,并验证了在ImageNet-1K上的可扩展性。
链接: https://arxiv.org/abs/2601.14052
作者: Haoran Xu,Yanlin Liu,Zizhao Tong,Jiaze Li,Kexue Fu,Yuyang Zhang,Longxiang Gao,Shuaiguang Li,Xingyu Li,Yanran Xu,Changwei Wang
机构: Zhejiang University (浙江大学); Tsinghua University (清华大学); University of Chinese Academy of Sciences (中国科学院大学); University of Electronic Science and Technology of China (电子科技大学); Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学); RWTH Aachen University (亚琛工业大学); Shandong Computer Science Center (国家超算济南中心) (山东省计算中心(国家超级计算济南中心)); Qilu University of Technology (Shandong Academy of Sciences) (齐鲁工业大学(山东省科学院)); Key Laboratory of Computing Power Network and Information Security, Ministry of Education (教育部计算 power 网络与信息安全重点实验室); Shandong Provincial Key Laboratory of Computing Power Internet and Service Computing (山东省计算 power 互联网与服务计算重点实验室); Shandong Fundamental Research Center for Computer Science (山东省计算机科学基础研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Out-of-Distribution (OOD) detection is a critical task that has garnered significant attention. The emergence of CLIP has spurred extensive research into zero-shot OOD detection, often employing a training-free approach. Current methods leverage expert knowledge from large language models (LLMs) to identify potential outliers. However, these approaches tend to over-rely on knowledge in the text space, neglecting the inherent challenges involved in detecting out-of-distribution samples in the image space. In this paper, we propose a novel pipeline, MM-OOD, which leverages the multimodal reasoning capabilities of MLLMs and their ability to conduct multi-round conversations for enhanced outlier detection. Our method is designed to improve performance in both near OOD and far OOD tasks. Specifically, (1) for near OOD tasks, we directly feed ID images and corresponding text prompts into MLLMs to identify potential outliers; and (2) for far OOD tasks, we introduce the sketch-generate-elaborate framework: first, we sketch outlier exposure using text prompts, then generate corresponding visual OOD samples, and finally elaborate by using multimodal prompts. Experiments demonstrate that our method achieves significant improvements on widely used multimodal datasets such as Food-101, while also validating its scalability on ImageNet-1K.
zh
[CV-29] Weather-R1: Logically Consistent Reinforcement Fine-Tuning for Multimodal Reasoning in Meteorology
【速读】:该论文旨在解决视觉语言模型(Vision Language Models, VLMs)在气象领域应用中存在的两个关键问题:一是领域差距(domain gap),即模型在通用场景下训练,难以适配专业气象任务;二是推理忠实性差距(reasoning faithfulness gap),尤其在主流强化微调(Reinforcement Fine-Tuning, RFT)方法下易产生自相矛盾推理(Self-Contradictory Reasoning, Self-Contra),这在高风险的气象决策中不可接受。解决方案的关键在于提出一种逻辑一致性强化微调方法(Logically Consistent Reinforcement Fine-Tuning, LoCo-RFT),通过引入逻辑一致性奖励机制,约束模型推理过程与最终答案的一致性,从而提升推理忠实性。在此基础上,研究构建了首个面向气象领域的多模态推理基准WeatherQA,并基于此训练出Weather-R1——目前已知首个具备逻辑忠实性的气象推理VLM,实验表明其在WeatherQA上相较基线提升9.8个百分点,优于监督微调和传统RFT方法,甚至超越原始Qwen2.5-VL-32B模型。
链接: https://arxiv.org/abs/2601.14044
作者: Kaiyu Wu,Pucheng Han,Hualong Zhang,Naigeng Wu,Keze Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:While Vision Language Models (VLMs) show advancing reasoning capabilities, their application in meteorology is constrained by a domain gap and a reasoning faithfulness gap. Specifically, mainstream Reinforcement Fine-Tuning (RFT) can induce Self-Contradictory Reasoning (Self-Contra), where the model’s reasoning contradicts its final answer, which is unacceptable in such a high-stakes domain. To address these challenges, we construct WeatherQA, a novel multimodal reasoning benchmark in meteorology. We also propose Logically Consistent Reinforcement Fine-Tuning (LoCo-RFT), which resolves Self-Contra by introducing a logical consistency reward. Furthermore, we introduce Weather-R1, the first reasoning VLM with logical faithfulness in meteorology, to the best of our knowledge. Experiments demonstrate that Weather-R1 improves performance on WeatherQA by 9.8 percentage points over the baseline, outperforming Supervised Fine-Tuning and RFT, and even surpassing the original Qwen2.5-VL-32B. These results highlight the effectiveness of our LoCo-RFT and the superiority of Weather-R1. Our benchmark and code are available at this https URL.
zh
[CV-30] Federated Balanced Learning
【速读】:该论文旨在解决联邦学习(Federated Learning)在非独立同分布(non-iid)数据场景下因客户端数据分布不均衡导致的全局模型漂移(client drift)问题,该问题会显著影响最终模型性能。解决方案的关键在于重新审视客户端的作用,提出联邦平衡学习(Federated Balanced Learning, FBL),通过在客户端侧实现样本平衡来预防漂移的发生:具体而言,FBL利用边缘侧生成模型,在客户端固定样本数量限制下,通过知识填充(knowledge filling)和知识采样(knowledge sampling)实现样本平衡;同时设计知识对齐策略(Knowledge Alignment Strategy)缩小合成数据与真实数据之间的差距,并引入知识丢弃策略(Knowledge Drop Strategy)进行正则化,从而提升模型鲁棒性与泛化能力。
链接: https://arxiv.org/abs/2601.14042
作者: Jiaze Li,Haoran Xu,Wanyi Wu,Changwei Wang,Shuaiguang Li,Jianzhong Ju,Zhenbo Luo,Jian Luan,Youyang Qu,Longxiang Gao,Xudong Yang,Lumin Xing
机构: MiLM Plus of Xiaomi Inc.(小米公司); Zhejiang University (浙江大学); Shandong University (山东大学); Key Laboratory of Computing Power Network and Information Security, Ministry of Education (教育部计算 power 网络与信息安全重点实验室); Shandong Computer Science Center (National Supercomputer Center in Jinan) (济南国家超算中心); Qilu University of Technology (Shandong Academy of Sciences) (齐鲁工业大学(山东省科学院)); Shandong Provincial Key Laboratory of Computing Power Internet and Service Computing (山东省计算 power 互联网与服务计算重点实验室); Shandong Fundamental Research Center for Computer Science (山东省计算机科学基础研究中心); Shandong Key Laboratory of Digital Diagnosis and Treatment of Thoracic Oncology (山东省胸外科数字诊断与治疗重点实验室); The First Affiliated Hospital of Shandong First Medical University & Shandong Provincial Qianfoshan Hospital (山东第一医科大学附属医院 & 山东省千佛山医院); Shandong Engineering Research Center of Intelligent Surgery (山东省智能手术工程研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Federated learning is a paradigm of joint learning in which clients collaborate by sharing model parameters instead of data. However, in the non-iid setting, the global model experiences client drift, which can seriously affect the final performance of the model. Previous methods tend to correct the global model that has already deviated based on the loss function or gradient, overlooking the impact of the client samples. In this paper, we rethink the role of the client side and propose Federated Balanced Learning, i.e., FBL, to prevent this issue from the beginning through sample balance on the client side. Technically, FBL allows unbalanced data on the client side to achieve sample balance through knowledge filling and knowledge sampling using edge-side generation models, under the limitation of a fixed number of data samples on clients. Furthermore, we design a Knowledge Alignment Strategy to bridge the gap between synthetic and real data, and a Knowledge Drop Strategy to regularize our method. Meanwhile, we scale our method to real and complex scenarios, allowing different clients to adopt various methods, and extend our framework to further improve performance. Numerous experiments show that our method outperforms state-of-the-art baselines. The code is released upon acceptance.
zh
[CV-31] Generalizing Abstention for Noise-Robust Learning in Medical Image Segmentation
【速读】:该论文旨在解决医学图像分割中标签噪声(label noise)导致模型过拟合、泛化性能下降的问题。现有方法在该领域仍较为有限,尤其缺乏对抽象机制(abstention mechanism)在分割任务中潜力的探索。解决方案的关键在于提出一个通用且模块化的抽象框架,通过两个核心组件提升损失函数的抗噪能力:一是引入有指导性的正则化项以引导模型的抽象行为;二是设计基于幂律的自适应调参算法,动态调整抽象惩罚强度。该框架可与多种损失函数结合,实验表明其在高噪声环境下显著优于基线方法,验证了让模型选择性忽略噪声样本是一种强大且可推广的策略。
链接: https://arxiv.org/abs/2601.14039
作者: Wesam Moustafa,Hossam Elsafty,Helen Schneider,Lorenz Sparrenberg,Rafet Sifa
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Label noise is a critical problem in medical image segmentation, often arising from the inherent difficulty of manual annotation. Models trained on noisy data are prone to overfitting, which degrades their generalization performance. While a number of methods and strategies have been proposed to mitigate noisy labels in the segmentation domain, this area remains largely under-explored. The abstention mechanism has proven effective in classification tasks by enhancing the capabilities of Cross Entropy, yet its potential in segmentation remains unverified. In this paper, we address this gap by introducing a universal and modular abstention framework capable of enhancing the noise-robustness of a diverse range of loss functions. Our framework improves upon prior work with two key components: an informed regularization term to guide abstention behaviour, and a more flexible power-law-based auto-tuning algorithm for the abstention penalty. We demonstrate the framework’s versatility by systematically integrating it with three distinct loss functions to create three novel, noise-robust variants: GAC, SAC, and ADS. Experiments on the CaDIS and DSAD medical datasets show our methods consistently and significantly outperform their non-abstaining baselines, especially under high noise levels. This work establishes that enabling models to selectively ignore corrupted samples is a powerful and generalizable strategy for building more reliable segmentation models. Our code is publicly available at this https URL.
zh
[CV-32] Correcting and Quantifying Systematic Errors in 3D Box Annotations for Autonomous Driving
【速读】:该论文旨在解决自动驾驶系统中基于主动传感器(如LiDAR)的3D边界框标注在动态场景下因物体在不同时间戳被观测到不同位置而导致的系统性标注误差问题。此类误差会破坏标注的空间-时间一致性,进而影响模型训练与性能评估的准确性。解决方案的关键在于提出一种新颖的离线估计方法,通过约束标注轨迹符合物理可行性,实现对原始标注的修正,从而提升标注质量并量化误差水平;实验表明,该方法使标注质量提升超过17%,且原始标注最大偏移可达2.5米,尤其在高动态物体上更为显著,凸显了精确标注对正确性能解读的重要性。
链接: https://arxiv.org/abs/2601.14038
作者: Alexandre Justo Miro(1 and 2),Ludvig af Klinteberg(2),Bogdan Timus(1),Aron Asefaw(3),Ajinkya Khoche(1 and 3),Thomas Gustafsson(1),Sina Sharif Mansouri(1),Masoud Daneshtalab(2) ((1) Traton Group Ramp;D, (2) Mälardalen University, (3) KTH Royal Institute of Technology)
机构: Traton Group R&D (Traton集团研发部); Mälardalen University (马拉德伦大学); KTH Royal Institute of Technology (皇家理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to The IEEE/CVF Winter Conference on Applications of Computer Vision 2026
Abstract:Accurate ground truth annotations are critical to supervised learning and evaluating the performance of autonomous vehicle systems. These vehicles are typically equipped with active sensors, such as LiDAR, which scan the environment in predefined patterns. 3D box annotation based on data from such sensors is challenging in dynamic scenarios, where objects are observed at different timestamps, hence different positions. Without proper handling of this phenomenon, systematic errors are prone to being introduced in the box annotations. Our work is the first to discover such annotation errors in widely used, publicly available datasets. Through our novel offline estimation method, we correct the annotations so that they follow physically feasible trajectories and achieve spatial and temporal consistency with the sensor data. For the first time, we define metrics for this problem; and we evaluate our method on the Argoverse 2, MAN TruckScenes, and our proprietary datasets. Our approach increases the quality of box annotations by more than 17% in these datasets. Furthermore, we quantify the annotation errors in them and find that the original annotations are misplaced by up to 2.5 m, with highly dynamic objects being the most affected. Finally, we test the impact of the errors in benchmarking and find that the impact is larger than the improvements that state-of-the-art methods typically achieve with respect to the previous state-of-the-art methods; showing that accurate annotations are essential for correct interpretation of performance. Our code is available at this https URL.
zh
[CV-33] Human detectors are surprisingly powerful reward models
【速读】:该论文旨在解决当前视频生成模型在复杂非刚性运动(尤其是人类动态动作如体育、舞蹈等)中表现不佳的问题,具体表现为肢体缺失或多余、姿态扭曲及物理上不合理的动作。其解决方案的关键在于提出一个名为HuDA的简单奖励模型,该模型通过融合人体检测置信度(用于评估外观质量)与时间提示对齐分数(用于捕捉动作的真实性),利用现成的预训练模型实现无需额外训练的奖励函数设计。实验表明,基于HuDA的群体奖励策略优化(GRPO)方法显著提升了视频生成质量,尤其在复杂人类动作生成方面优于Wan 2.1等先进模型,且对动物视频和人-物交互场景也有泛化提升效果。
链接: https://arxiv.org/abs/2601.14037
作者: Kumar Ashutosh,XuDong Wang,Xi Yin,Kristen Grauman,Adam Polyak,Ishan Misra,Rohit Girdhar
机构: Meta AI; University of Texas at Austin (得克萨斯大学奥斯汀分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical report
Abstract:Video generation models have recently achieved impressive visual fidelity and temporal coherence. Yet, they continue to struggle with complex, non-rigid motions, especially when synthesizing humans performing dynamic actions such as sports, dance, etc. Generated videos often exhibit missing or extra limbs, distorted poses, or physically implausible actions. In this work, we propose a remarkably simple reward model, HuDA, to quantify and improve the human motion in generated videos. HuDA integrates human detection confidence for appearance quality, and a temporal prompt alignment score to capture motion realism. We show this simple reward function that leverages off-the-shelf models without any additional training, outperforms specialized models finetuned with manually annotated data. Using HuDA for Group Reward Policy Optimization (GRPO) post-training of video models, we significantly enhance video generation, especially when generating complex human motions, outperforming state-of-the-art models like Wan 2.1, with win-rate of 73%. Finally, we demonstrate that HuDA improves generation quality beyond just humans, for instance, significantly improving generation of animal videos and human-object interactions.
zh
[CV-34] Likelihood-Separable Diffusion Inference for Multi-Image MRI Super-Resolution
【速读】:该论文旨在解决多图像超分辨率(Multi-Image Super-Resolution, MISR)在磁共振成像(MRI)中的应用问题,特别是针对常规二维多切片采集所导致的各向异性退化(anisotropic degradation)问题。传统扩散模型方法主要适用于单图像逆问题,难以直接扩展至多测量场景;而本研究的关键创新在于提出了一种基于扩散概率采样(Diffusion Probabilistic Sampling, DPS)的似然修正机制,该机制能够实现跨独立获取测量数据的梯度分解完全可分离,从而无需构建联合算子、修改扩散模型或增加网络函数评估次数即可实现高效的MISR重建。通过将DPS、DMAP、DPPS及基于扩散的PnP/ADMM方法推广至多图像场景,实验表明该方案在4×/8×/16×各向异性退化下显著优于单图像超分辨率(SISR),并实现了当前最先进的各向异性MRI体积超分辨率,且能从常规2D多切片扫描中重建出接近各向同性的解剖结构。
链接: https://arxiv.org/abs/2601.14030
作者: Samuel W. Remedios,Zhangxing Bian,Shuwen Wei,Aaron Carass,Jerry L. Prince,Blake E. Dewey
机构: Johns Hopkins University (约翰霍普金斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Diffusion models are the current state-of-the-art for solving inverse problems in imaging. Their impressive generative capability allows them to approximate sampling from a prior distribution, which alongside a known likelihood function permits posterior sampling without retraining the model. While recent methods have made strides in advancing the accuracy of posterior sampling, the majority focuses on single-image inverse problems. However, for modalities such as magnetic resonance imaging (MRI), it is common to acquire multiple complementary measurements, each low-resolution along a different axis. In this work, we generalize common diffusion-based inverse single-image problem solvers for multi-image super-resolution (MISR) MRI. We show that the DPS likelihood correction allows an exactly-separable gradient decomposition across independently acquired measurements, enabling MISR without constructing a joint operator, modifying the diffusion model, or increasing network function evaluations. We derive MISR versions of DPS, DMAP, DPPS, and diffusion-based PnP/ADMM, and demonstrate substantial gains over SISR across 4\times/8\times/16\times anisotropic degradations. Our results achieve state-of-the-art super-resolution of anisotropic MRI volumes and, critically, enable reconstruction of near-isotropic anatomy from routine 2D multi-slice acquisitions, which are otherwise highly degraded in orthogonal views.
zh
[CV-35] Equivariant Learning for Unsupervised Image Dehazing
【速读】:该论文旨在解决科学成像中图像去雾(Image Dehazing, ID)问题,传统方法通常依赖于精心设计的先验知识或大量无雾真实图像作为监督信号,而这些在科学成像场景中往往难以获取。解决方案的关键在于提出一种新的无监督学习框架——等变图像去雾(Equivariant Image Dehazing, EID),其核心思想是利用图像信号的对称性(symmetry)来恢复清晰图像:通过强制施加雾霾一致性(haze consistency)和系统等变性(systematic equivariance)约束,EID可直接从原始模糊图像中恢复清晰模式;同时引入对抗学习策略以建模未知的雾霾物理机制,从而提升模型泛化能力与去雾效果。实验表明,EID在细胞显微成像和医学内窥镜等科学图像基准上显著优于现有最先进方法。
链接: https://arxiv.org/abs/2601.13986
作者: Zhang Wen,Jiangwei Xie,Dongdong Chen
机构: Heriot-Watt University (赫瑞-瓦特大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Technical report
Abstract:Image Dehazing (ID) aims to produce a clear image from an observation contaminated by haze. Current ID methods typically rely on carefully crafted priors or extensive haze-free ground truth, both of which are expensive or impractical to acquire, particularly in the context of scientific imaging. We propose a new unsupervised learning framework called Equivariant Image Dehazing (EID) that exploits the symmetry of image signals to restore clarity to hazy observations. By enforcing haze consistency and systematic equivariance, EID can recover clear patterns directly from raw, hazy images. Additionally, we propose an adversarial learning strategy to model unknown haze physics and facilitate EID learning. Experiments on two scientific image dehazing benchmarks (including cell microscopy and medical endoscopy) and on natural image dehazing have demonstrated that EID significantly outperforms state-of-the-art approaches. By unifying equivariant learning with modelling haze physics, we hope that EID will enable more versatile and effective haze removal in scientific imaging. Code and datasets will be published.
zh
[CV-36] FantasyVLN: Unified Multimodal Chain-of-Thought Reasoning for Vision-Language Navigation
【速读】:该论文旨在解决视觉-语言导航(Vision-and-Language Navigation, VLN)中现有链式思维(Chain-of-Thought, CoT)方法的两大核心问题:一是纯文本CoT缺乏空间语义 grounding,易因稀疏标注的推理步骤而过拟合;二是多模态CoT会因生成虚构视觉观测导致严重的token膨胀,使得实时导航不可行。解决方案的关键在于提出FantasyVLN框架,其通过预训练的视觉自回归模型(Visual AutoRegressor, VAR)将想象中的视觉token压缩到紧凑的潜在空间,并在统一的多CoT策略下联合学习文本、视觉和多模态CoT模式;推理阶段则直接实现指令到动作的映射,同时保留推理感知的表示能力,从而在保持推理能力的同时显著降低延迟,实现实时高效导航。
链接: https://arxiv.org/abs/2601.13976
作者: Jing Zuo,Lingzhou Mu,Fan Jiang,Chengcheng Ma,Mu Xu,Yonggang Qi
机构: Fantasy AIGC Team; Beijing University of Posts and Telecommunications (北京邮电大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Achieving human-level performance in Vision-and-Language Navigation (VLN) requires an embodied agent to jointly understand multimodal instructions and visual-spatial context while reasoning over long action sequences. Recent works, such as NavCoT and NavGPT-2, demonstrate the potential of Chain-of-Thought (CoT) reasoning for improving interpretability and long-horizon planning. Moreover, multimodal extensions like OctoNav-R1 and CoT-VLA further validate CoT as a promising pathway toward human-like navigation reasoning. However, existing approaches face critical drawbacks: purely textual CoTs lack spatial grounding and easily overfit to sparse annotated reasoning steps, while multimodal CoTs incur severe token inflation by generating imagined visual observations, making real-time navigation impractical. In this work, we propose FantasyVLN, a unified implicit reasoning framework that preserves the benefits of CoT reasoning without explicit token overhead. Specifically, imagined visual tokens are encoded into a compact latent space using a pretrained Visual AutoRegressor (VAR) during CoT reasoning training, and the model jointly learns from textual, visual, and multimodal CoT modes under a unified multi-CoT strategy. At inference, our model performs direct instruction-to-action mapping while still enjoying reasoning-aware representations. Extensive experiments on LH-VLN show that our approach achieves reasoning-aware yet real-time navigation, improving success rates and efficiency while reducing inference latency by an order of magnitude compared to explicit CoT methods.
zh
[CV-37] Harmonizing the Deep: A Unified Information Pipeline for Robust Marine Biodiversity Assessment Across Heterogeneous Domains
【速读】:该论文旨在解决海洋生物多样性监测中检测模型在跨域部署时性能显著下降的问题,尤其是在复杂水下环境中难以保持可靠性和可扩展性。其解决方案的关键在于构建一个统一的信息处理管道(Unified Information Pipeline),将异构数据集标准化为可比的信息流,并在受控的跨域协议下评估固定检测器的性能。研究发现,结构因素(如场景组成、目标密度和上下文冗余)比视觉退化(如浑浊度)更能解释跨域性能损失,且稀疏场景会引发特有的“上下文坍塌”(Context Collapse)失效模式;同时通过边缘硬件推理基准测试验证了运行时优化对实现远程监测实用采样率的可行性,从而将关注点从图像增强转向结构感知的可靠性提升,为海洋生态系统的一致性评估提供了一种可普及的工具。
链接: https://arxiv.org/abs/2601.13975
作者: Marco Piccolo,Qiwei Han,Astrid van Toor,Joachim Vanneste
机构: Nova School of Business & Economics (Nova School of Business & Economics); blueOASIS (blueOASIS)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 9 pages, 4 figures 8 tables
Abstract:Marine biodiversity monitoring requires scalability and reliability across complex underwater environments to support conservation and invasive-species management. Yet existing detection solutions often exhibit a pronounced deployment gap, with performance degrading sharply when transferred to new sites. This work establishes the foundational detection layer for a multi-year invasive species monitoring initiative targeting Arctic and Atlantic marine ecosystems. We address this challenge by developing a Unified Information Pipeline that standardises heterogeneous datasets into a comparable information flow and evaluates a fixed, deployment-relevant detector under controlled cross-domain protocols. Across multiple domains, we find that structural factors, such as scene composition, object density, and contextual redundancy, explain cross-domain performance loss more strongly than visual degradation such as turbidity, with sparse scenes inducing a characteristic “Context Collapse” failure mode. We further validate operational feasibility by benchmarking inference on low-cost edge hardware, showing that runtime optimisation enables practical sampling rates for remote monitoring. The results shift emphasis from image enhancement toward structure-aware reliability, providing a democratised tool for consistent marine ecosystem assessment.
zh
[CV-38] STEC: A Reference-Free Spatio-Temporal Entropy Coverag e Metric for Evaluating Sampled Video Frames WACV2026
【速读】:该论文旨在解决视频帧采样(frame sampling)质量评估难题,现有指标多聚焦于感知质量或重建保真度,无法有效衡量采样帧是否充分捕捉了视频中的信息性与代表性内容。解决方案的关键在于提出一种无需参考的量化指标——时空熵覆盖率(Spatio-Temporal Entropy Coverage, STEC),其核心思想是基于时空帧熵(Spatio-Temporal Frame Entropy, STFE)建模每帧的空间信息强度,并结合时间分布广度与冗余度来评估采样效果,从而提供一个轻量且具原理性的任务无关诊断信号,用于在有限预算下分析不同采样策略的行为特性。
链接: https://arxiv.org/abs/2601.13974
作者: Shih-Yao Lin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper corresponds to the camera-ready version of a WACV 2026 Workshop paper
Abstract:Frame sampling is a fundamental component in video understanding and video–language model pipelines, yet evaluating the quality of sampled frames remains challenging. Existing evaluation metrics primarily focus on perceptual quality or reconstruction fidelity, and are not designed to assess whether a set of sampled frames adequately captures informative and representative video content. We propose Spatio-Temporal Entropy Coverage (STEC), a simple and non-reference metric for evaluating the effectiveness of video frame sampling. STEC builds upon Spatio-Temporal Frame Entropy (STFE), which measures per-frame spatial information via entropy-based structural complexity, and evaluates sampled frames based on their temporal coverage and redundancy. By jointly modeling spatial information strength, temporal dispersion, and non-redundancy, STEC provides a principled and lightweight measure of sampling quality. Experiments on the MSR-VTT test-1k benchmark demonstrate that STEC clearly differentiates common sampling strategies, including random, uniform, and content-aware methods. We further show that STEC reveals robustness patterns across individual videos that are not captured by average performance alone, highlighting its practical value as a general-purpose evaluation tool for efficient video understanding. We emphasize that STEC is not designed to predict downstream task accuracy, but to provide a task-agnostic diagnostic signal for analyzing frame sampling behavior under constrained budgets. Comments: This paper corresponds to the camera-ready version of a WACV 2026 Workshop paper Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2601.13974 [cs.CV] (or arXiv:2601.13974v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2601.13974 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Shih-Yao (Mike) Lin [view email] [v1] Tue, 20 Jan 2026 13:51:33 UTC (1,002 KB)
zh
[CV-39] DExTeR: Weakly Semi-Supervised Object Detection with Class and Instance Experts for Medical Imaging
【速读】:该论文旨在解决医学影像中解剖标志点检测的标注成本过高问题,传统目标检测模型依赖昂贵的边界框(bounding box)标注,限制了其在大规模临床场景中的可扩展性。为此,作者提出了一种基于Transformer的点到框回归器DExTeR(DETR with Experts),其核心创新在于:首先通过类引导的可变形注意力机制(class-guided deformable attention)将单点标注编码为对象查询(object queries),精准捕获类别特异性特征;其次引入CLICK-MoE(CLass, Instance, and Common Knowledge Mixture of Experts)模块,分离类别与实例表征以降低邻近或重叠结构间的混淆;最后采用多点训练策略提升预测一致性,增强对标注变异的鲁棒性。该方法在内窥镜、胸部X光和内镜超声三个不同医学领域数据集上均达到最先进性能,显著降低了标注成本并保持高检测精度。
链接: https://arxiv.org/abs/2601.13954
作者: Adrien Meyer,Didier Mutter,Nicolas Padoy
机构: University of Strasbourg, CNRS, INSERM, ICube, UMR7357, France; IHU Strasbourg, Strasbourg, France; University Hospital of Strasbourg, France
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Detecting anatomical landmarks in medical imaging is essential for diagnosis and intervention guidance. However, object detection models rely on costly bounding box annotations, limiting scalability. Weakly Semi-Supervised Object Detection (WSSOD) with point annotations proposes annotating each instance with a single point, minimizing annotation time while preserving localization signals. A Point-to-Box teacher model, trained on a small box-labeled subset, converts these point annotations into pseudo-box labels to train a student detector. Yet, medical imagery presents unique challenges, including overlapping anatomy, variable object sizes, and elusive structures, which hinder accurate bounding box inference. To overcome these challenges, we introduce DExTeR (DETR with Experts), a transformer-based Point-to-Box regressor tailored for medical imaging. Built upon Point-DETR, DExTeR encodes single-point annotations as object queries, refining feature extraction with the proposed class-guided deformable attention, which guides attention sampling using point coordinates and class labels to capture class-specific characteristics. To improve discrimination in complex structures, it introduces CLICK-MoE (CLass, Instance, and Common Knowledge Mixture of Experts), decoupling class and instance representations to reduce confusion among adjacent or overlapping instances. Finally, we implement a multi-point training strategy which promotes prediction consistency across different point placements, improving robustness to annotation variability. DExTeR achieves state-of-the-art performance across three datasets spanning different medical domains (endoscopy, chest X-rays, and endoscopic ultrasound) highlighting its potential to reduce annotation costs while maintaining high detection accuracy.
zh
[CV-40] VTONGuard: Automatic Detection and Authentication of AI-Generated Virtual Try-On Content
【速读】:该论文旨在解决生成式 AI(Generative AI)驱动的虚拟试衣(Virtual Try-On, VTON)技术日益普及所带来的真实性与责任使用问题,特别是如何有效检测由深度学习模型生成的合成试衣图像。其解决方案的关键在于构建了一个大规模基准数据集 VTONGuard,包含超过 775,000 张真实与合成试衣图像,覆盖多样化的姿态、背景和服装风格,并在此基础上系统评估多种检测范式。进一步地,作者设计了一种多任务框架,通过引入辅助分割任务增强边界感知特征学习,从而显著提升检测性能,为开发更鲁棒的合成内容检测模型提供了有效路径。
链接: https://arxiv.org/abs/2601.13951
作者: Shengyi Wu,Yan Hong,Shengyao Chen,Zheng Wang,Xianbing Sun,Jiahui Zhan,Jun Lan,Jianfu Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:With the rapid advancement of generative AI, virtual try-on (VTON) systems are becoming increasingly common in e-commerce and digital entertainment. However, the growing realism of AI-generated try-on content raises pressing concerns about authenticity and responsible use. To address this, we present VTONGuard, a large-scale benchmark dataset containing over 775,000 real and synthetic try-on images. The dataset covers diverse real-world conditions, including variations in pose, background, and garment styles, and provides both authentic and manipulated examples. Based on this benchmark, we conduct a systematic evaluation of multiple detection paradigms under unified training and testing protocols. Our results reveal each method’s strengths and weaknesses and highlight the persistent challenge of cross-paradigm generalization. To further advance detection, we design a multi-task framework that integrates auxiliary segmentation to enhance boundary-aware feature learning, achieving the best overall performance on VTONGuard. We expect this benchmark to enable fair comparisons, facilitate the development of more robust detection models, and promote the safe and responsible deployment of VTON technologies in practice.
zh
[CV-41] Glance-or-Gaze: Incentivizing LMMs to Adaptively Focus Search via Reinforcement Learning
【速读】:该论文旨在解决大型多模态模型(Large Multimodal Models, LMMs)在处理涉及长尾实体或动态信息的知识密集型视觉查询时,因静态参数化知识而表现受限的问题。现有基于搜索增强的方法依赖于无差别全图检索,导致大量视觉冗余和噪声,并且缺乏深层次的迭代反思机制,难以应对复杂视觉任务。解决方案的关键在于提出一种完全自主的框架Glance-or-Gaze (GoG),其核心创新是引入Selective Gaze(选择性凝视)机制,动态决定是否全局扫视还是聚焦高价值区域,在检索前过滤无关信息;同时设计双阶段训练策略:通过监督微调实现基础GoG行为对齐,再利用复杂度自适应强化学习提升模型对复杂查询的迭代推理能力,从而显著增强视觉搜索的有效性。
链接: https://arxiv.org/abs/2601.13942
作者: Hongbo Bai,Yujin Zhou,Yile Wu,Chi-Min Chan,Pengcheng Wen,Kunhao Pan,Sirui Han,Yike Guo
机构: Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Multimodal Models (LMMs) have achieved remarkable success in visual understanding, yet they struggle with knowledge-intensive queries involving long-tail entities or evolving information due to static parametric knowledge. Recent search-augmented approaches attempt to address this limitation, but existing methods rely on indiscriminate whole-image retrieval that introduces substantial visual redundancy and noise, and lack deep iterative reflection, limiting their effectiveness on complex visual queries. To overcome these challenges, we propose Glance-or-Gaze (GoG), a fully autonomous framework that shifts from passive perception to active visual planning. GoG introduces a Selective Gaze mechanism that dynamically chooses whether to glance at global context or gaze into high-value regions, filtering irrelevant information before retrieval. We design a dual-stage training strategy: Reflective GoG Behavior Alignment via supervised fine-tuning instills the fundamental GoG paradigm, while Complexity-Adaptive Reinforcement Learning further enhances the model’s capability to handle complex queries through iterative reasoning. Experiments across six benchmarks demonstrate state-of-the-art performance. Ablation studies confirm that both Selective Gaze and complexity-adaptive RL are essential for effective visual search. We will release our data and models for further exploration soon.
zh
[CV-42] rackletGPT : A Language-like GPT Framework for White Matter Tract Segmentation
【速读】:该论文旨在解决白质纤维束(White Matter Tract)分割任务中的复杂性问题,该问题源于纤维束在个体间、条件下存在差异,但其三维结构又具有跨半球和跨个体的一致性。为应对这一挑战,作者提出TrackletGPT,其核心创新在于引入“tracklet”(即细粒度的子纤维段)作为序列化token,重新在生成式模型中编码顺序信息,从而实现对纤维束更精确的建模与分割。该方案具有跨数据集泛化能力、全自动处理流程,并能有效扩展和优化GPT类模型在纤维束分割中的应用性能。
链接: https://arxiv.org/abs/2601.13935
作者: Anoushkrit Goel,Simroop Singh,Ankita Joshi,Ranjeet Ranjan Jha,Chirag Ahuja,Aditya Nigam,Arnav Bhavsar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at 23rd IEEE International Symposium on Biomedical Imaging (ISBI), 2026
Abstract:White Matter Tract Segmentation is imperative for studying brain structural connectivity, neurological disorders and neurosurgery. This task remains complex, as tracts differ among themselves, across subjects and conditions, yet have similar 3D structure across hemispheres and subjects. To address these challenges, we propose TrackletGPT, a language-like GPT framework which reintroduces sequential information in tokens using tracklets. TrackletGPT generalises seamlessly across datasets, is fully automatic, and encodes granular sub-streamline segments, Tracklets, scaling and refining GPT models in Tractography Segmentation. Based on our experiments, TrackletGPT outperforms state-of-the-art methods on average DICE, Overlap and Overreach scores on TractoInferno and HCP datasets, even on inter-dataset experiments.
zh
[CV-43] On the Role of Rotation Equivariance in Monocular 3D Human Pose Estimation
【速读】:该论文旨在解决单目图像到三维人体姿态估计(Monocular 3D Human Pose Estimation, HPE)中的关键挑战,即从单一二维输入图像中准确预测人体骨骼关节的三维坐标点集。由于该问题本质上是病态的(ill-posed),现有方法通常采用两阶段策略:先检测2D关节位置,再进行2D到3D的提升(lifting)。然而,这些方法在处理图像平面内旋转时表现不佳。论文的核心解决方案在于引入二维旋转等变性(2D rotation equivariance),并指出通过数据增强即可高效学习这种等变性,而无需显式约束模型参数空间。实验表明,仅靠旋转等变性的建模即可显著提升对图像平面内旋转的人体姿态估计性能,优于当前基于设计的等变方法(equivariant-by-design methods)。
链接: https://arxiv.org/abs/2601.13913
作者: Pavlo Melnyk,Cuong Le,Urs Waldmann,Per-Erik Forssén,Bastian Wandt
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Estimating 3D from 2D is one of the central tasks in computer vision. In this work, we consider the monocular setting, i.e. single-view input, for 3D human pose estimation (HPE). Here, the task is to predict a 3D point set of human skeletal joints from a single 2D input image. While by definition this is an ill-posed problem, recent work has presented methods that solve it with up to several-centimetre error. Typically, these methods employ a two-step approach, where the first step is to detect the 2D skeletal joints in the input image, followed by the step of 2D-to-3D lifting. We find that common lifting models fail when encountering a rotated input. We argue that learning a single human pose along with its in-plane rotations is considerably easier and more geometrically grounded than directly learning a point-to-point mapping. Furthermore, our intuition is that endowing the model with the notion of rotation equivariance without explicitly constraining its parameter space should lead to a more straightforward learning process than one with equivariance by design. Utilising the common HPE benchmarks, we confirm that the 2D rotation equivariance per se improves the model performance on human poses akin to rotations in the image plane, and can be efficiently and straightforwardly learned by augmentation, outperforming state-of-the-art equivariant-by-design methods.
zh
[CV-44] owards Visually Explaining Statistical Tests with Applications in Biomedical Imaging
【速读】:该论文旨在解决深度神经网络在两样本检验中缺乏可解释性的问题,尤其是在生物医学图像分析中,传统后处理解释方法依赖于类别标签,无法适用于无标签的统计检验场景。其解决方案的关键在于提出一种可解释的深度统计检验框架,通过引入样本级和特征级解释,揭示哪些个体样本和输入特征驱动了组间分布差异;该框架不仅能识别对检测结果影响最大的样本,还能定位图像中与疾病相关的变化区域,从而实现空间和实例层面的决策洞察,推动可解释人工智能与统计推断的融合,支持医学影像中的无标签群体分析。
链接: https://arxiv.org/abs/2601.13899
作者: Masoumeh Javanbakhat,Piotr Komorowski,Dilyara Bareeva,Wei-Chang Lai,Wojciech Samek,Christoph Lippert
机构: Hasso-Plattner Institute (哈索普拉特纳研究所); Fraunhofer Heinrich Hertz Institute (弗劳恩霍夫海因里希·赫兹研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Deep neural two-sample tests have recently shown strong power for detecting distributional differences between groups, yet their black-box nature limits interpretability and practical adoption in biomedical analysis. Moreover, most existing post-hoc explainability methods rely on class labels, making them unsuitable for label-free statistical testing settings. We propose an explainable deep statistical testing framework that augments deep two-sample tests with sample-level and feature-level explanations, revealing which individual samples and which input features drive statistically significant group differences. Our method highlights which image regions and which individual samples contribute most to the detected group difference, providing spatial and instance-wise insight into the test’s decision. Applied to biomedical imaging data, the proposed framework identifies influential samples and highlights anatomically meaningful regions associated with disease-related variation. This work bridges statistical inference and explainable AI, enabling interpretable, label-free population analysis in medical imaging.
zh
[CV-45] OmniOVCD: Streamlining Open-Vocabulary Change Detection with SAM 3
【速读】:该论文旨在解决开放词汇变化检测(Open-Vocabulary Change Detection, OVCD)中依赖预定义类别、模型融合复杂且特征匹配不稳定的问题。现有方法多基于CLIP进行类别识别,并需额外引入如DINO等模型提取特征,导致系统复杂性和不稳定性增加。解决方案的关键在于提出一个独立框架OmniOVCD,利用Segment Anything Model 3(SAM 3)的解耦输出头,设计了协同融合到实例解耦(Synergistic Fusion to Instance Decoupling, SFID)策略:首先融合SAM 3的语义、实例和存在输出以构建土地覆盖掩码,再将其分解为独立实例掩码用于变化对比,从而在保持高类别识别精度的同时,确保跨图像的实例级一致性,最终生成准确的变化掩码。
链接: https://arxiv.org/abs/2601.13895
作者: Xu Zhang,Danyang Li,Yingjie Xia,Xiaohang Dong,Hualong Yu,Jianye Wang,Qicheng Li
机构: Nankai University (南开大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Change Detection (CD) is a fundamental task in remote sensing. It monitors the evolution of land cover over time. Based on this, Open-Vocabulary Change Detection (OVCD) introduces a new requirement. It aims to reduce the reliance on predefined categories. Existing training-free OVCD methods mostly use CLIP to identify categories. These methods also need extra models like DINO to extract features. However, combining different models often causes problems in matching features and makes the system unstable. Recently, the Segment Anything Model 3 (SAM 3) is introduced. It integrates segmentation and identification capabilities within one promptable model, which offers new possibilities for the OVCD task. In this paper, we propose OmniOVCD, a standalone framework designed for OVCD. By leveraging the decoupled output heads of SAM 3, we propose a Synergistic Fusion to Instance Decoupling (SFID) strategy. SFID first fuses the semantic, instance, and presence outputs of SAM 3 to construct land-cover masks, and then decomposes them into individual instance masks for change comparison. This design preserves high accuracy in category recognition and maintains instance-level consistency across images. As a result, the model can generate accurate change masks. Experiments on four public benchmarks (LEVIR-CD, WHU-CD, S2Looking, and SECOND) demonstrate SOTA performance, achieving IoU scores of 67.2, 66.5, 24.5, and 27.1 (class-average), respectively, surpassing all previous methods.
zh
[CV-46] Revisiting Multi-Task Visual Representation Learning
【速读】:该论文旨在解决当前视觉表征学习中两大范式的局限性:一方面,视觉-语言模型(如CLIP)在全局语义对齐上表现优异,但缺乏空间精度;另一方面,自监督方法(如MAE、DINO)能捕捉精细的局部结构,却难以建模高层语义上下文。为实现“兼得两者优势”,论文提出MTV(Multi-Task Visual Pretraining)框架,其关键在于通过一个共享骨干网络联合优化视觉-语言对比学习、自监督学习和密集空间监督三个目标,并利用高容量“专家”模型(如Depth Anything V2和OWLv2)自动合成大规模结构化伪标签,从而避免人工标注依赖。这一多任务协同机制显著提升了细粒度空间推理能力,同时保持了全局语义理解性能,验证了基于高质量伪监督的多任务学习是构建通用视觉编码器的有效路径。
链接: https://arxiv.org/abs/2601.13886
作者: Shangzhe Di,Zhonghua Zhai,Weidi Xie
机构: SAI, Shanghai Jiao Tong University (上海交通大学); ByteDance Seed (字节跳动种子团队)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code: this https URL
Abstract:Current visual representation learning remains bifurcated: vision-language models (e.g., CLIP) excel at global semantic alignment but lack spatial precision, while self-supervised methods (e.g., MAE, DINO) capture intricate local structures yet struggle with high-level semantic context. We argue that these paradigms are fundamentally complementary and can be integrated into a principled multi-task framework, further enhanced by dense spatial supervision. We introduce MTV, a multi-task visual pretraining framework that jointly optimizes a shared backbone across vision-language contrastive, self-supervised, and dense spatial objectives. To mitigate the need for manual annotations, we leverage high-capacity “expert” models – such as Depth Anything V2 and OWLv2 – to synthesize dense, structured pseudo-labels at scale. Beyond the framework, we provide a systematic investigation into the mechanics of multi-task visual learning, analyzing: (i) the marginal gain of each objective, (ii) task synergies versus interference, and (iii) scaling behavior across varying data and model scales. Our results demonstrate that MTV achieves “best-of-both-worlds” performance, significantly enhancing fine-grained spatial reasoning without compromising global semantic understanding. Our findings suggest that multi-task learning, fueled by high-quality pseudo-supervision, is a scalable path toward more general visual encoders.
zh
[CV-47] OCCAM: Class-Agnostic Training-Free Prior-Free and Multi-Class Object Counting
【速读】:该论文旨在解决类别无关的物体计数(Class-Agnostic object Counting, CAC)问题,即在不依赖特定类别先验知识的情况下,准确统计图像中任意类别的物体实例数量。传统方法通常局限于单类场景、需大量深度学习模型训练,并依赖视觉示例或文本提示等额外信息,而本文提出OCCAM——首个无需训练且不依赖任何补充信息的CAC方法,同时支持多类物体计数。其核心创新在于利用Segment Anything Model 2(SAM2)作为基础分割模型,并结合自定义基于阈值的First Integer Neighbor Clustering Hierarchy(FINCH)聚类算法,实现端到端的无监督计数,显著提升了跨类别场景下的泛化能力与准确性。
链接: https://arxiv.org/abs/2601.13871
作者: Michail Spanakis,Iason Oikonomidis,Antonis Argyros
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Class-Agnostic object Counting (CAC) involves counting instances of objects from arbitrary classes within an image. Due to its practical importance, CAC has received increasing attention in recent years. Most existing methods assume a single object class per image, rely on extensive training of large deep learning models and address the problem by incorporating additional information, such as visual exemplars or text prompts. In this paper, we present OCCAM, the first training-free approach to CAC that operates without the need of any supplementary information. Moreover, our approach addresses the multi-class variant of the problem, as it is capable of counting the object instances in each and every class among arbitrary object classes within an image. We leverage Segment Anything Model 2 (SAM2), a foundation model, and a custom threshold-based variant of the First Integer Neighbor Clustering Hierarchy (FINCH) algorithm to achieve competitive performance on widely used benchmark datasets, FSC-147 and CARPK. We propose a synthetic multi-class dataset and F1 score as a more suitable evaluation metric. The code for our method and the proposed synthetic dataset will be made publicly available at this https URL.
zh
[CV-48] Probabilistic Deep Discriminant Analysis for Wind Blade Segmentation ICASSP2026
【速读】:该论文旨在解决线性判别分析(Linear Discriminant Analysis, LDA)在处理非线性可分数据时性能受限的问题。其解决方案的关键在于提出深度判别分析(Deep Discriminant Analysis, DDA),通过深度神经网络直接优化Fisher判别准则,从而实现对复杂数据分布的有效特征提取与分类。为确保训练稳定性并避免计算不稳定性,作者引入了有符号类间方差、使用Sigmoid函数约束输出,并将乘法关系转化为加法形式,进而设计出两种稳定的DDA损失函数,并进一步结合概率损失构建概率化深度判别分析(Probabilistic DDA, PDDA)。PDDA能够显著降低类别重叠,减小类内方差,提升预测置信度,在风力叶片分割任务中展现出优越的性能和一致性,是首个将DDA应用于图像分割的研究。
链接: https://arxiv.org/abs/2601.13852
作者: Raül Pérez-Gonzalo,Andreas Espersen,Antonio Agudo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to ICASSP 2026
Abstract:Linear discriminant analysis improves class separability but struggles with non-linearly separable data. To overcome this, we introduce Deep Discriminant Analysis (DDA), which directly optimizes the Fisher criterion utilizing deep networks. To ensure stable training and avoid computational instabilities, we incorporate signed between-class variance, bound outputs with a sigmoid function, and convert multiplicative relationships into additive ones. We present two stable DDA loss functions and augment them with a probability loss, resulting in Probabilistic DDA (PDDA). PDDA effectively minimizes class overlap in output distributions, producing highly confident predictions with reduced within-class variance. When applied to wind blade segmentation, PDDA showcases notable advances in performance and consistency, critical for wind energy maintenance. To our knowledge, this is the first application of DDA to image segmentation.
zh
[CV-49] DisasterVQA: A Visual Question Answering Benchmark Dataset for Disaster Scenes
【速读】:该论文旨在解决生成式 AI 在灾害响应场景中感知与推理能力不足的问题,尤其是在复杂、高风险情境下,现有视觉问答(Visual Question Answering, VQA)模型的适用性尚不明确。其解决方案的关键在于构建了一个名为 DisasterVQA 的基准数据集,该数据集包含 1,395 张真实灾害图像和 4,405 对专家标注的问答对,覆盖洪水、火灾、地震等多种灾害类型,并基于 FEMA ESF 和 OCHA MIRA 等人道主义框架设计了涵盖态势感知与操作决策任务的二分类、多选题和开放式问题。通过在该数据集上对七种先进视觉语言模型进行基准测试,研究揭示了模型在细粒度定量推理、物体计数及上下文敏感理解方面的显著短板,尤其在低频灾害类别中表现不佳,从而为开发更具鲁棒性和实际应用价值的灾害响应视觉语言模型提供了关键评估工具和改进方向。
链接: https://arxiv.org/abs/2601.13839
作者: Aisha Al-Mohannadi,Ayisha Firoz,Yin Yang,Muhammad Imran,Ferda Ofli
机构: 1. Qatar Computing Research Institute (卡塔尔计算研究研究所); 2. University of California, Berkeley (加州大学伯克利分校); 3. University of Oxford (牛津大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Social media imagery provides a low-latency source of situational information during natural and human-induced disasters, enabling rapid damage assessment and response. While Visual Question Answering (VQA) has shown strong performance in general-purpose domains, its suitability for the complex and safety-critical reasoning required in disaster response remains unclear. We introduce DisasterVQA, a benchmark dataset designed for perception and reasoning in crisis contexts. DisasterVQA consists of 1,395 real-world images and 4,405 expert-curated question-answer pairs spanning diverse events such as floods, wildfires, and earthquakes. Grounded in humanitarian frameworks including FEMA ESF and OCHA MIRA, the dataset includes binary, multiple-choice, and open-ended questions covering situational awareness and operational decision-making tasks. We benchmark seven state-of-the-art vision-language models and find performance variability across question types, disaster categories, regions, and humanitarian tasks. Although models achieve high accuracy on binary questions, they struggle with fine-grained quantitative reasoning, object counting, and context-sensitive interpretation, particularly for underrepresented disaster scenarios. DisasterVQA provides a challenging and practical benchmark to guide the development of more robust and operationally meaningful vision-language models for disaster response. The dataset is publicly available at this https URL.
zh
[CV-50] FastGHA: Generalized Few-Shot 3D Gaussian Head Avatars with Real-Time Animation
【速读】:该论文旨在解决当前基于3D Gaussian的头部虚拟形象(head avatar)建模中,高保真度生成效率低的问题。现有方法通常依赖于复杂的多视角采集设备或在推理阶段对每个个体进行优化,限制了其在未见主体上的可扩展性和易用性。解决方案的关键在于提出一种前馈式(feed-forward)方法——\OURS,该方法仅需少量输入图像即可生成高质量的3D Gaussian头部虚拟形象,并支持实时动画。其核心创新包括:1)通过基于Transformer的编码器融合DINOv3与Stable Diffusion VAE提取的图像特征,直接学习像素级的Gaussian表示以聚合多视角信息;2)引入轻量级MLP动态网络,利用表情码预测3D Gaussian的变形,实现高效实时动画;3)借助预训练大模型生成的点图(point maps)作为几何监督信号,提升头部表面的几何平滑性。
链接: https://arxiv.org/abs/2601.13837
作者: Xinya Ji,Sebastian Weiss,Manuel Kansy,Jacek Naruniec,Xun Cao,Barbara Solenthaler,Derek Bradley
机构: Nanjing University (南京大学); ETH Zürich (苏黎世联邦理工学院); DisneyResearch|Studios (迪士尼研究院|工作室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Despite recent progress in 3D Gaussian-based head avatar modeling, efficiently generating high fidelity avatars remains a challenge. Current methods typically rely on extensive multi-view capture setups or monocular videos with per-identity optimization during inference, limiting their scalability and ease of use on unseen subjects. To overcome these efficiency drawbacks, we propose \OURS, a feed-forward method to generate high-quality Gaussian head avatars from only a few input images while supporting real-time animation. Our approach directly learns a per-pixel Gaussian representation from the input images, and aggregates multi-view information using a transformer-based encoder that fuses image features from both DINOv3 and Stable Diffusion VAE. For real-time animation, we extend the explicit Gaussian representations with per-Gaussian features and introduce a lightweight MLP-based dynamic network to predict 3D Gaussian deformations from expression codes. Furthermore, to enhance geometric smoothness of the 3D head, we employ point maps from a pre-trained large reconstruction model as geometry supervision. Experiments show that our approach significantly outperforms existing methods in both rendering quality and inference efficiency, while supporting real-time dynamic avatar animation.
zh
[CV-51] Discriminant Learning-based Colorspace for Blade Segmentation ICASSP2026
【速读】:该论文旨在解决图像分割中因颜色表示不佳而导致的准确性受限问题,尤其针对特定领域(如风力涡轮机叶片)的分割任务。其解决方案的关键在于提出了一种新的多维非线性判别分析算法——Colorspace Discriminant Analysis (CSDA),该方法通过最大化类间可分性并最小化类内差异,定制化地优化颜色空间表示;同时引入三种替代损失函数以实现颜色空间与分割过程的端到端联合优化,从而提升模型训练稳定性与分割精度。
链接: https://arxiv.org/abs/2601.13816
作者: Raül Pérez-Gonzalo,Andreas Espersen,Antonio Agudo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to ICASSP 2026
Abstract:Suboptimal color representation often hinders accurate image segmentation, yet many modern algorithms neglect this critical preprocessing step. This work presents a novel multidimensional nonlinear discriminant analysis algorithm, Colorspace Discriminant Analysis (CSDA), for improved segmentation. Extending Linear Discriminant Analysis into a deep learning context, CSDA customizes color representation by maximizing multidimensional signed inter-class separability while minimizing intra-class variability through a generalized discriminative loss. To ensure stable training, we introduce three alternative losses that enable end-to-end optimization of both the discriminative colorspace and segmentation process. Experiments on wind turbine blade data demonstrate significant accuracy gains, emphasizing the importance of tailored preprocessing in domain-specific segmentation.
zh
[CV-52] Insight: Interpretable Semantic Hierarchies in Vision-Language Encoders
【速读】:该论文旨在解决语言对齐的视觉基础模型(language-aligned vision foundation models)决策过程缺乏可解释性的问题,尤其是现有方法在概念分解上空间定位不准确且仅适用于图像分类任务。其解决方案的关键在于提出Insight模型,该模型通过层次稀疏自编码器(hierarchical sparse autoencoder)与具备强语义表征能力的基础模型相结合,自动提取多粒度的人类可理解且空间上精确定位的概念;同时利用概念间的局部共现依赖关系定义概念关联,从而优化概念命名并生成更丰富的解释,最终在分类和分割任务上实现与黑箱模型相当的性能,同时提供高质量的概念级解释。
链接: https://arxiv.org/abs/2601.13798
作者: Kai Wittenmayer,Sukrut Rao,Amin Parchami-Araghi,Bernt Schiele,Jonas Fischer
机构: Max Planck Institute for Informatics (马普信息研究所); Saarland Informatics Campus (萨尔兰计算机科学校区)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 32 pages, 24 figures, 3 tables
Abstract:Language-aligned vision foundation models perform strongly across diverse downstream tasks. Yet, their learned representations remain opaque, making interpreting their decision-making hard. Recent works decompose these representations into human-interpretable concepts, but provide poor spatial grounding and are limited to image classification tasks. In this work, we propose Insight, a language-aligned concept foundation model that provides fine-grained concepts, which are human-interpretable and spatially grounded in the input image. We leverage a hierarchical sparse autoencoder and a foundation model with strong semantic representations to automatically extract concepts at various granularities. Examining local co-occurrence dependencies of concepts allows us to define concept relationships. Through these relations we further improve concept naming and obtain richer explanations. On benchmark data, we show that Insight provides performance on classification and segmentation that is competitive with opaque foundation models while providing fine-grained, high quality concept-based explanations. Code is available at this https URL.
zh
[CV-53] PREGEN: Uncovering Latent Thoughts in Composed Video Retrieval
【速读】:该论文旨在解决**组合视频检索(Composed Video Retrieval, CoVR)任务中现有方法未能充分挖掘现代视觉-语言模型(Vision-Language Models, VLMs)潜力的问题,包括使用过时架构、依赖计算昂贵的微调以及生成描述文本效率低下等局限。其解决方案的关键在于提出一种名为PREGEN(PRE GENeration extraction)**的高效且强大的框架:通过将一个冻结的预训练VLM与一个轻量级编码器相结合,无需对VLM进行任何微调;具体地,将查询视频和修改文本输入VLM后,提取每一层最后token的隐藏状态,并利用简单编码器对这些池化表示进行训练,从而生成语义丰富且紧凑的嵌入用于检索。该方法在标准CoVR基准上显著超越此前所有方法,在Recall@1指标上提升达+27.23和+69.59,且在不同VLM主干网络下表现出鲁棒性,并具备对复杂文本修改的强大零样本泛化能力。
链接: https://arxiv.org/abs/2601.13797
作者: Gabriele Serussi,David Vainshtein,Jonathan Kouchly,Dotan Di Castro,Chaim Baskin
机构: Bosch Center for AI, Israel (博世人工智能中心); INSIGHT Lab, Ben-Gurion University of the Negev, Israel (内盖夫本-古里安大学INSIGHT实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Composed Video Retrieval (CoVR) aims to retrieve a video based on a query video and a modifying text. Current CoVR methods fail to fully exploit modern Vision-Language Models (VLMs), either using outdated architectures or requiring computationally expensive fine-tuning and slow caption generation. We introduce PREGEN (PRE GENeration extraction), an efficient and powerful CoVR framework that overcomes these limitations. Our approach uniquely pairs a frozen, pre-trained VLM with a lightweight encoding model, eliminating the need for any VLM fine-tuning. We feed the query video and modifying text into the VLM and extract the hidden state of the final token from each layer. A simple encoder is then trained on these pooled representations, creating a semantically rich and compact embedding for retrieval. PREGEN significantly advances the state of the art, surpassing all prior methods on standard CoVR benchmarks with substantial gains in Recall@1 of +27.23 and +69.59. Our method demonstrates robustness across different VLM backbones and exhibits strong zero-shot generalization to more complex textual modifications, highlighting its effectiveness and semantic capabilities.
zh
[CV-54] HiT: History-Injection Transformers for Onboard Continuous Flood Change Detection
【速读】:该论文旨在解决通过小型卫星进行连续自然灾害监测时,受限于星载资源(如存储和计算能力)下多时相数据处理的难题,尤其聚焦于洪水检测这一关键灾害管理任务。其解决方案的关键在于提出了一种名为“历史注入机制”(History Injection mechanism for Transformer models, HiT)的新方法,该机制能够在保持历史观测上下文信息的同时,将原始图像数据存储量减少超过99%,从而显著降低对星载存储的需求;同时,基于Prithvi-tiny基础模型集成HiT模块后,在STTORM-CD洪水数据集上验证了其检测精度与双时相基准相当,并在Jetson Orin Nano硬件平台上实现了43 FPS的推理速度,证明了该方案在纳米卫星上的可行性与实用性。
链接: https://arxiv.org/abs/2601.13751
作者: Daniel Kyselica,Jonáš Herec,Oliver Kutis,Rado Pitoňák
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 19 pages, 9 figures, submitted to conference
Abstract:Natural disaster monitoring through continuous satellite observation requires processing multi-temporal data under strict operational constraints. This paper addresses flood detection, a critical application for hazard management, by developing an onboard change detection system that operates within the memory and computational limits of small satellites. We propose History Injection mechanism for Transformer models (HiT), that maintains historical context from previous observations while reducing data storage by over 99% of original image size. Moreover, testing on the STTORM-CD flood dataset confirms that the HiT mechanism within the Prithvi-tiny foundation model maintains detection accuracy compared to the bitemporal baseline. The proposed HiT-Prithvi model achieved 43 FPS on Jetson Orin Nano, a representative onboard hardware used in nanosats. This work establishes a practical framework for satellite-based continuous monitoring of natural disasters, supporting real-time hazard assessment without dependency on ground-based processing infrastructure. Architecture as well as model checkpoints is available at this https URL
zh
[CV-55] Facial Spatiotemporal Graphs: Leverag ing the 3D Facial Surface for Remote Physiological Measurement
【速读】:该论文旨在解决现有面部远程光电容积脉搏波(rPPG)方法在建模过程中未能显式对齐其感受野与三维面部表面的问题,从而导致生理信号估计的鲁棒性和泛化能力受限。解决方案的关键在于提出一种新的时空图结构(Facial Spatiotemporal Graph, STGraph),该结构通过编码3D面部网格序列中的颜色和几何结构,实现面向面部表面的时空处理;并进一步设计轻量级时空图卷积网络MeshPhys,在STGraph上进行生理信号估计。实验证明,将模型感受野约束于面部表面作为结构先验可显著提升性能,且基于3D感知的节点特征对于准确编码面部表面颜色至关重要,二者共同构成了一个原理清晰、鲁棒性强、可解释性高的面部rPPG建模新范式。
链接: https://arxiv.org/abs/2601.13724
作者: Sam Cantrill,David Ahmedt-Aristizabal,Lars Petersson,Hanna Suominen,Mohammad Ali Armin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Facial remote photoplethysmography (rPPG) methods estimate physiological signals by modeling subtle color changes on the 3D facial surface over time. However, existing methods fail to explicitly align their receptive fields with the 3D facial surface-the spatial support of the rPPG signal. To address this, we propose the Facial Spatiotemporal Graph (STGraph), a novel representation that encodes facial color and structure using 3D facial mesh sequences-enabling surface-aligned spatiotemporal processing. We introduce MeshPhys, a lightweight spatiotemporal graph convolutional network that operates on the STGraph to estimate physiological signals. Across four benchmark datasets, MeshPhys achieves state-of-the-art or competitive performance in both intra- and cross-dataset settings. Ablation studies show that constraining the model’s receptive field to the facial surface acts as a strong structural prior, and that surface-aligned, 3D-aware node features are critical for robustly encoding facial surface color. Together, the STGraph and MeshPhys constitute a novel, principled modeling paradigm for facial rPPG, enabling robust, interpretable, and generalizable estimation. Code is available at this https URL .
zh
[CV-56] Hierarchical Long Video Understanding with Audiovisual Entity Cohesion and Agent ic Search
【速读】:该论文旨在解决长视频理解中因超长上下文窗口导致的语义碎片化与全局连贯性丧失问题,尤其针对现有基于简单分块策略与检索增强生成(Retrieval-Augmented Generation, RAG)的方法在处理长视频时难以保持时空一致性与精细实体追踪的局限。其解决方案的关键在于提出一个统一框架HAVEN,通过融合多模态实体关联(audiovisual entity cohesion)与分层视频索引(hierarchical video indexing),构建从全局摘要到具体实体的结构化表示体系,并引入代理式搜索机制(agentic search)实现跨层级动态检索与推理,从而支持连贯的叙事重建与细粒度实体跟踪,显著提升长视频理解任务中的时间一致性、实体一致性和检索效率,在LVBench基准上达到84.1%的整体准确率,其中推理类任务更是达到80.1%的先进水平。
链接: https://arxiv.org/abs/2601.13719
作者: Xinlei Yin,Xiulian Peng,Xiao Li,Zhiwei Xiong,Yan Lu
机构: University of Science and Technology of China (中国科学技术大学); Microsoft Research Asia (微软亚洲研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:
Abstract:Long video understanding presents significant challenges for vision-language models due to extremely long context windows. Existing solutions relying on naive chunking strategies with retrieval-augmented generation, typically suffer from information fragmentation and a loss of global coherence. We present HAVEN, a unified framework for long-video understanding that enables coherent and comprehensive reasoning by integrating audiovisual entity cohesion and hierarchical video indexing with agentic search. First, we preserve semantic consistency by integrating entity-level representations across visual and auditory streams, while organizing content into a structured hierarchy spanning global summary, scene, segment, and entity levels. Then we employ an agentic search mechanism to enable dynamic retrieval and reasoning across these layers, facilitating coherent narrative reconstruction and fine-grained entity tracking. Extensive experiments demonstrate that our method achieves good temporal coherence, entity consistency, and retrieval efficiency, establishing a new state-of-the-art with an overall accuracy of 84.1% on LVBench. Notably, it achieves outstanding performance in the challenging reasoning category, reaching 80.1%. These results highlight the effectiveness of structured, multimodal reasoning for comprehensive and context-consistent understanding of long-form videos.
zh
[CV-57] MVGD-Net: A Novel Motion-aware Video Glass Surface Detection Network AAAI-26 AAAI
【速读】:该论文旨在解决视频中玻璃表面检测(Video Glass Surface Detection, VGSD)的问题,即如何准确识别视觉系统在复杂场景下因玻璃反射或透射导致的干扰区域。其解决方案的关键在于利用玻璃表面引起的运动不一致性:由于物体在玻璃反射层或透射层中呈现更远距离,其在视频帧间的运动速度会慢于非玻璃区域中的物体。基于此物理特性,作者提出MVGD-Net网络,通过三个核心模块增强对时空特征的建模能力——包括跨尺度多模态融合模块(CMFM)整合空间特征与光流信息、历史引导注意力模块(HGAM)和时序交叉注意力模块(TCAM)强化时间维度特征,并引入时空解码器(TSD)融合空间与时间特征生成玻璃区域掩膜,从而实现高精度的玻璃检测。
链接: https://arxiv.org/abs/2601.13715
作者: Yiwei Lu,Hao Huang,Tao Yan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper has been accepted by the 40th AAAI Conference on Artificial Intelligence (AAAI-26). It contians 9 pages, 11 figures
Abstract:Glass surface ubiquitous in both daily life and professional environments presents a potential threat to vision-based systems, such as robot and drone navigation. To solve this challenge, most recent studies have shown significant interest in Video Glass Surface Detection (VGSD). We observe that objects in the reflection (or transmission) layer appear farther from the glass surfaces. Consequently, in video motion scenarios, the notable reflected (or transmitted) objects on the glass surface move slower than objects in non-glass regions within the same spatial plane, and this motion inconsistency can effectively reveal the presence of glass surfaces. Based on this observation, we propose a novel network, named MVGD-Net, for detecting glass surfaces in videos by leveraging motion inconsistency cues. Our MVGD-Net features three novel modules: the Cross-scale Multimodal Fusion Module (CMFM) that integrates extracted spatial features and estimated optical flow maps, the History Guided Attention Module (HGAM) and Temporal Cross Attention Module (TCAM), both of which further enhances temporal features. A Temporal-Spatial Decoder (TSD) is also introduced to fuse the spatial and temporal features for generating the glass region mask. Furthermore, for learning our network, we also propose a large-scale dataset, which comprises 312 diverse glass scenarios with a total of 19,268 frames. Extensive experiments demonstrate that our MVGD-Net outperforms relevant state-of-the-art methods.
zh
[CV-58] Attention-space Contrastive Guidance for Efficient Hallucination Mitigation in LVLMs
【速读】:该论文旨在解决大视觉语言模型(Large Vision-Language Models, LVLMs)中因语言先验过度主导视觉证据而导致的幻觉问题,如物体误识别和视觉不一致描述。解决方案的关键在于提出注意力空间对比引导(Attention-space Contrastive Guidance, ACG),这是一种单次前向传播机制,通过在自注意力层内构建视觉-语言与纯语言两条注意力路径,并利用正交化修正消除语言路径引入的近似偏差,从而增强视觉信息的贡献,实现对生成文本的视觉锚定与语义忠实性调控。该方法在保持高生成质量的同时显著降低计算开销,相较以往需要多次前向传播的对比解码方法,延迟最多减少2倍。
链接: https://arxiv.org/abs/2601.13707
作者: Yujin Jo,Sangyoon Bae,Taesup Kim
机构: Seoul National University (首尔国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Hallucinations in large vision-language models (LVLMs) often arise when language priors dominate over visual evidence, causing object misidentification and visually inconsistent descriptions. We address this issue by framing hallucination mitigation as contrastive guidance, steering generation toward visually grounded and semantically faithful text. This approach regulates the model’s internal behavior by reducing over-dependence on language priors and contrasting visually grounded with language-only representations. We propose Attention-space Contrastive Guidance (ACG), a single-pass mechanism that operates within self-attention layers to construct both vision-language and language-only attention paths in a single forward computation. This integration enables computationally efficient guidance directly embedded in the model’s representation contextualization. To correct approximation bias introduced by the single-pass formulation, we further apply an orthogonalized correction that removes components aligned with the language-only path, selectively amplifying visual contributions. Experiments on the CHAIR and POPE benchmarks show that ACG achieves state-of-the-art faithfulness and caption quality while significantly reducing computational cost. Our method establishes a principled and efficient alternative, reducing latency by up to 2x compared to prior contrastive decoding methods that require multiple forward passes.
zh
[CV-59] ParkingTwin: Training-Free Streaming 3D Reconstruction for Parking-Lot Digital Twins
【速读】:该论文针对自动代客泊车(Automated Valet Parking, AVP)场景中高保真停车场数字孪生构建所面临的三难困境:前向视图稀疏导致弱视差和几何 ill-posed 问题;动态遮挡与极端光照干扰纹理融合稳定性;以及神经渲染通常依赖昂贵的离线优化,难以满足边缘侧实时流式处理需求。解决方案的关键在于提出 ParkingTwin,一个无需训练、轻量级的在线 3D 重建系统:首先,基于 OpenStreetMap(OSM)语义拓扑先验驱动的几何构造方法,直接生成度量一致的 TSDF(Truncated Signed Distance Field),替代盲目的几何搜索并避免复杂优化;其次,采用几何感知的动态滤波机制,通过法向量/高度/深度一致性四模约束场实现实时剔除移动车辆和瞬态遮挡;最后,在 CIELAB 色彩空间中实现光照鲁棒融合,利用自适应 L 通道加权与深度梯度抑制策略解耦亮度与色度,显著降低突变光照下的接缝。该方案在入门级 GPU(GTX 1660)上达到 30+ FPS,相较当前最优的 3D 高斯泼溅(3DGS)方法,端到端速度提升约 15 倍,GPU 显存降低 83.3%,并输出可兼容 Unity/Unreal 数字孪生管线的显式三角网格。
链接: https://arxiv.org/abs/2601.13706
作者: Xinhao Liu,Yu Wang,Xiansheng Guo,Gordon Owusu Boateng,Yu Cao,Haonan Si,Xingchen Guo,Nirwan Ansari
机构: Unknown
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 35 pages, 10 figures. Submitted to ISPRS Journal of Photogrammetry and Remote Sensing. Under review
Abstract:High-fidelity parking-lot digital twins provide essential priors for path planning, collision checking, and perception validation in Automated Valet Parking (AVP). Yet robot-oriented reconstruction faces a trilemma: sparse forward-facing views cause weak parallax and ill-posed geometry; dynamic occlusions and extreme lighting hinder stable texture fusion; and neural rendering typically needs expensive offline optimization, violating edge-side streaming constraints. We propose ParkingTwin, a training-free, lightweight system for online streaming 3D reconstruction. First, OSM-prior-driven geometric construction uses OpenStreetMap semantic topology to directly generate a metric-consistent TSDF, replacing blind geometric search with deterministic mapping and avoiding costly optimization. Second, geometry-aware dynamic filtering employs a quad-modal constraint field (normal/height/depth consistency) to reject moving vehicles and transient occlusions in real time. Third, illumination-robust fusion in CIELAB decouples luminance and chromaticity via adaptive L-channel weighting and depth-gradient suppression, reducing seams under abrupt lighting changes. ParkingTwin runs at 30+ FPS on an entry-level GTX 1660. On a 68,000 m^2 real-world dataset, it achieves SSIM 0.87 (+16.0%), delivers about 15x end-to-end speedup, and reduces GPU memory by 83.3% compared with state-of-the-art 3D Gaussian Splatting (3DGS) that typically requires high-end GPUs (RTX 4090D). The system outputs explicit triangle meshes compatible with Unity/Unreal digital-twin pipelines. Project page: this https URL
zh
[CV-60] Reasoning or Pattern Matching? Probing Large Vision-Language Models with Visual Puzzles
【速读】:该论文旨在解决当前大型视觉语言模型(Large Vision-Language Models, LVLMs)在复杂推理能力评估中存在的不足,特别是缺乏可控、可验证的诊断工具来系统性地衡量其抽象思维、规则发现与逻辑推理等核心认知能力。解决方案的关键在于将视觉谜题(visual puzzles)作为统一的认知诊断框架,通过将其按推理机制(归纳、类比、算法、演绎及几何/空间推理)分类,明确不同谜题设计所对应的认知操作,并基于此整合现有基准的数据,揭示当前模型在泛化鲁棒性、感知与推理耦合度以及解释流畅性与执行准确性之间存在的显著差距,从而为未来开发更先进的多模态推理系统提供方向指引。
链接: https://arxiv.org/abs/2601.13705
作者: Maria Lymperaiou,Vasileios Karampinis,Giorgos Filandrianos,Angelos Vlachos,Chrysoula Zerva,Athanasios Voulodimos
机构: National Technical University of Athens (雅典国立技术大学); Instituto de Telecomunicações, Lisbon, Portugal (里斯本电信研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Puzzles have long served as compact and revealing probes of human cognition, isolating abstraction, rule discovery, and systematic reasoning with minimal reliance on prior knowledge. Leveraging these properties, visual puzzles have recently emerged as a powerful diagnostic tool for evaluating the reasoning abilities of Large Vision-Language Models (LVLMs), offering controlled, verifiable alternatives to open-ended multimodal benchmarks. This survey provides a unified perspective of visual puzzle reasoning in LVLMs. We frame visual puzzles through a common abstraction and organize existing benchmarks by the reasoning mechanisms they target (inductive, analogical, algorithmic, deductive, and geometric/spatial), thereby linking puzzle design to the cognitive operations required for solving. Synthesizing empirical evidence across these categories, we identify consistent limitations in current models, including brittle generalization, tight entanglement between perception and reasoning, and a persistent gap between fluent explanations and faithful execution. By framing visual puzzles as diagnostic instruments rather than task formats, this survey elaborates on the state of LVLM reasoning and outlines key directions for future benchmarks and reasoning-aware multimodal systems.
zh
[CV-61] Dynamic Differential Linear Attention: Enhancing Linear Diffusion Transformer for High-Quality Image Generation
【速读】:该论文旨在解决线性扩散变压器(Linear Diffusion Transformers, LiTs)在图像生成任务中因注意力机制过度平滑(over-smoothed attention weights)而导致的生成质量下降问题,该问题限制了模型的表达能力并影响了高保真度图像生成效果。解决方案的关键在于提出一种新的线性注意力机制——动态差分线性注意力(Dynamic Differential Linear Attention, DyDiLA),其核心创新包括:(i) 动态投影模块,通过动态分配知识实现token表示的解耦;(ii) 动态度量核,基于动态分配的核函数更精细地捕捉token间的语义差异;(iii) token差分算子,通过计算token与其由动态度量核产生的冗余信息之间的差异,增强query到key的检索鲁棒性。基于此,作者进一步构建了DyDi-LiT模型,实验证明其在多个指标上均优于当前最先进(SOTA)方法,显著提升了LiTs的生成性能与实用性。
链接: https://arxiv.org/abs/2601.13683
作者: Boyuan Cao,Xingbo Yao,Chenhui Wang,Jiaxin Ye,Yujie Wei,Hongming Shan
机构: Fudan University (复旦大学); Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州))
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Diffusion transformers (DiTs) have emerged as a powerful architecture for high-fidelity image generation, yet the quadratic cost of self-attention poses a major scalability bottleneck. To address this, linear attention mechanisms have been adopted to reduce computational cost; unfortunately, the resulting linear diffusion transformers (LiTs) models often come at the expense of generative performance, frequently producing over-smoothed attention weights that limit expressiveness. In this work, we introduce Dynamic Differential Linear Attention (DyDiLA), a novel linear attention formulation that enhances the effectiveness of LiTs by mitigating the oversmoothing issue and improving generation quality. Specifically, the novelty of DyDiLA lies in three key designs: (i) dynamic projection module, which facilitates the decoupling of token representations by learning with dynamically assigned knowledge; (ii) dynamic measure kernel, which provides a better similarity measurement to capture fine-grained semantic distinctions between tokens by dynamically assigning kernel functions for token processing; and (iii) token differential operator, which enables more robust query-to-key retrieval by calculating the differences between the tokens and their corresponding information redundancy produced by dynamic measure kernel. To capitalize on DyDiLA, we introduce a refined LiT, termed DyDi-LiT, that systematically incorporates our advancements. Extensive experiments show that DyDi-LiT consistently outperforms current state-of-the-art (SOTA) models across multiple metrics, underscoring its strong practical potential.
zh
[CV-62] Finally Outshining the Random Baseline: A Simple and Effective Solution for Active Learning in 3D Biomedical Imaging
【速读】:该论文旨在解决3D生物医学图像分割中主动学习(Active Learning, AL)方法难以持续优于改进的随机采样基线的问题,尤其是在标注成本高昂且专家标注耗时的情况下。现有AL方法常因类别不平衡和早期查询冗余而导致性能瓶颈,无法稳定提升分割质量与标注效率。其解决方案的关键在于提出一种名为“类分层调度幂预测熵”(Class-stratified Scheduled Power Predictive Entropy, ClaSP PE)的查询策略:一方面通过类分层采样确保对低频结构的覆盖,另一方面引入对数尺度幂噪声并采用衰减调度机制,在早期阶段强制查询多样性、后期阶段促进利用,从而有效缓解类不平衡与冗余问题。实验表明,ClaSP PE在24个不同设置下均显著优于适配3D数据的随机基线,并在未见过的数据集上无需调参即可泛化,具备实际部署价值。
链接: https://arxiv.org/abs/2601.13677
作者: Carsten T. Lüth,Jeremias Traub,Kim-Celine Kahl,Till J. Bungert,Lukas Klein,Lars Krämer,Paul F. Jäger,Klaus Maier-Hein,Fabian Isensee
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at TMLR
Abstract:Active learning (AL) has the potential to drastically reduce annotation costs in 3D biomedical image segmentation, where expert labeling of volumetric data is both time-consuming and expensive. Yet, existing AL methods are unable to consistently outperform improved random sampling baselines adapted to 3D data, leaving the field without a reliable solution. We introduce Class-stratified Scheduled Power Predictive Entropy (ClaSP PE), a simple and effective query strategy that addresses two key limitations of standard uncertainty-based AL methods: class imbalance and redundancy in early selections. ClaSP PE combines class-stratified querying to ensure coverage of underrepresented structures and log-scale power noising with a decaying schedule to enforce query diversity in early-stage AL and encourage exploitation later. In our evaluation on 24 experimental settings using four 3D biomedical datasets within the comprehensive nnActive benchmark, ClaSP PE is the only method that generally outperforms improved random baselines in terms of both segmentation quality with statistically significant gains, whilst remaining annotation efficient. Furthermore, we explicitly simulate the real-world application by testing our method on four previously unseen datasets without manual adaptation, where all experiment parameters are set according to predefined guidelines. The results confirm that ClaSP PE robustly generalizes to novel tasks without requiring dataset-specific tuning. Within the nnActive framework, we present compelling evidence that an AL method can consistently outperform random baselines adapted to 3D segmentation, in terms of both performance and annotation efficiency in a realistic, close-to-production scenario. Our open-source implementation and clear deployment guidelines make it readily applicable in practice. Code is at this https URL.
zh
[CV-63] ransformer based Multi-task Fusion Network for Food Spoilage Detection and Shelf life Forecasting
【速读】:该论文旨在解决农业供应链中食品浪费问题,核心挑战在于如何准确实现蔬菜分类、腐败检测与保质期预测的多任务协同处理。解决方案的关键在于提出融合架构,将卷积神经网络(CNN)分别与长短期记忆网络(LSTM)和DeiT Transformer相结合,以同时完成上述三项任务。实验表明,所提出的CNN+DeiT Transformer模型在蔬菜分类(F1-score=0.98)、腐败检测(F1-score=0.61)及腐败预测(MSE=3.58,SMAPE=41.66%)上均优于多种主流深度学习模型,且具备对噪声图像的鲁棒性,并通过LIME可视化机制增强了决策可解释性。
链接: https://arxiv.org/abs/2601.13665
作者: Mounika Kanulla,Rajasree Dadigi,Sailaja Thota,Vivek Yelleti
机构: SRM University, Andhra Pradesh, India
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Food wastage is one of the critical challenges in the agricultural supply chain, and accurate and effective spoilage detection can help to reduce it. Further, it is highly important to forecast the spoilage information. This aids the longevity of the supply chain management in the agriculture field. This motivated us to propose fusion based architectures by combining CNN with LSTM and DeiT transformer for the following multi-tasks simultaneously: (i) vegetable classification, (ii) food spoilage detection, and (iii) shelf life forecasting. We developed a dataset by capturing images of vegetables from their fresh state until they were completely spoiled. From the experimental analysis it is concluded that the proposed fusion architectures CNN+CNN-LSTM and CNN+DeiT Transformer outperformed several deep learning models such as CNN, VGG16, ResNet50, Capsule Networks, and DeiT Transformers. Overall, CNN + DeiT Transformer yielded F1-score of 0.98 and 0.61 in vegetable classification and spoilage detection respectively and mean squared error (MSE) and symmetric mean absolute percentage error (SMAPE) of 3.58, and 41.66% respectively in spoilage forecasting. Further, the reliability of the fusion models was validated on noisy images and integrated with LIME to visualize the model decisions.
zh
[CV-64] VIAFormer: Voxel-Image Alignment Transformer for High-Fidelity Voxel Refinement CVPR2026
【速读】:该论文旨在解决多视角条件下体素(voxel)修复问题,即利用校准的多视角图像作为引导,修复不完整且含噪的体素数据。其核心挑战在于如何实现跨模态的精准对齐与高效融合,并学习从噪声体素到高质量重建之间的直接优化路径。解决方案的关键在于三个创新设计:一是引入图像索引(Image Index),为2D图像token提供显式的3D空间定位信息;二是提出修正流(Correctional Flow)目标函数,学习端到端的体素精修轨迹;三是构建混合流Transformer(Hybrid Stream Transformer),实现鲁棒的跨模态特征融合,从而显著提升修复精度与泛化能力。
链接: https://arxiv.org/abs/2601.13664
作者: Tiancheng Fang,Bowen Pan,Lingxi Chen,Jiangjing Lyu,Chengfei Lyu,Chaoyue Niu,Fan Wu
机构: Shanghai Jiao Tong University (上海交通大学); Alibaba Group (阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under review at CVPR 2026
Abstract:We propose VIAFormer, a Voxel-Image Alignment Transformer model designed for Multi-view Conditioned Voxel Refinement–the task of repairing incomplete noisy voxels using calibrated multi-view images as guidance. Its effectiveness stems from a synergistic design: an Image Index that provides explicit 3D spatial grounding for 2D image tokens, a Correctional Flow objective that learns a direct voxel-refinement trajectory, and a Hybrid Stream Transformer that enables robust cross-modal fusion. Experiments show that VIAFormer establishes a new state of the art in correcting both severe synthetic corruptions and realistic artifacts on the voxel shape obtained from powerful Vision Foundation Models. Beyond benchmarking, we demonstrate VIAFormer as a practical and reliable bridge in real-world 3D creation pipelines, paving the way for voxel-based methods to thrive in large-model, big-data wave.
zh
[CV-65] Face-Voice Association with Inductive Bias for Maximum Class Separation ICASSP2026
【速读】:该论文旨在解决多模态学习中人脸与语音关联(face-voice association)任务的表征分离问题,即如何使同一说话人的面部和语音嵌入(embedding)尽可能接近,而不同说话人的嵌入则尽可能分离。传统方法依赖损失函数实现这一目标,但未充分利用类别间最大分离作为归纳偏置(inductive bias)来增强嵌入的判别能力。本文的关键解决方案是首次将“最大类间分离”作为归纳偏置引入到人脸-语音关联任务中,通过优化多模态表示的空间分布,强制不同说话人嵌入在特征空间中达到最大间隔,从而提升模型的判别性能。实验表明,该方法在两种任务设定下均达到当前最优(SOTA)效果,并且结合类间正交性损失时效果最佳,为多模态学习建立了一种新的范式。
链接: https://arxiv.org/abs/2601.13651
作者: Marta Moscati,Oleksandr Kats,Mubashir Noman,Muhammad Zaigham Zaheer,Yufang Hou,Markus Schedl,Shah Nawaz
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICASSP 2026
Abstract:Face-voice association is widely studied in multimodal learning and is approached representing faces and voices with embeddings that are close for a same person and well separated from those of others. Previous work achieved this with loss functions. Recent advancements in classification have shown that the discriminative ability of embeddings can be strengthened by imposing maximum class separation as inductive bias. This technique has never been used in the domain of face-voice association, and this work aims at filling this gap. More specifically, we develop a method for face-voice association that imposes maximum class separation among multimodal representations of different speakers as an inductive bias. Through quantitative experiments we demonstrate the effectiveness of our approach, showing that it achieves SOTA performance on two task formulation of face-voice association. Furthermore, we carry out an ablation study to show that imposing inductive bias is most effective when combined with losses for inter-class orthogonality. To the best of our knowledge, this work is the first that applies and demonstrates the effectiveness of maximum class separation as an inductive bias in multimodal learning; it hence paves the way to establish a new paradigm.
zh
[CV-66] Quadratic Upper Bound for Boosting Robustness ICML2025
【速读】:该论文旨在解决快速对抗训练(Fast Adversarial Training, FAT)在降低训练时间的同时导致模型鲁棒性下降的问题,其核心原因是FAT对对抗空间的探索不足。解决方案的关键在于提出一种二次上界(Quadratic Upper Bound, QUB)损失函数,该损失函数通过理论推导得到,并可与现有FAT方法结合使用;实验表明,引入QUB损失能显著提升模型鲁棒性,且这种改进很可能源于所得模型损失曲面的平滑化特性。
链接: https://arxiv.org/abs/2601.13645
作者: Euijin You,Hyang-Won Lee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICML 2025. Published in PMLR 267:72656-72676
Abstract:Fast adversarial training (FAT) aims to enhance the robustness of models against adversarial attacks with reduced training time, however, FAT often suffers from compromised robustness due to insufficient exploration of adversarial space. In this paper, we develop a loss function to mitigate the problem of degraded robustness under FAT. Specifically, we derive a quadratic upper bound (QUB) on the adversarial training (AT) loss function and propose to utilize the bound with existing FAT methods. Our experimental results show that applying QUB loss to the existing methods yields significant improvement of robustness. Furthermore, using various metrics, we demonstrate that this improvement is likely to result from the smoothened loss landscape of the resulting model.
zh
[CV-67] Scaling Test-time Inference for Visual Grounding
【速读】:该论文旨在解决视觉语言模型(Visual Language Models, VLMs)在视觉定位(visual grounding)任务中因模型规模庞大导致部署困难与推理延迟高的问题。现有最优模型通常具有巨大的参数量,尤其是语言模型部分,而视觉编码器尺寸相近,表明性能差距主要源于语言理解能力而非视觉信息处理能力。解决方案的关键在于提出“高效视觉定位语言模型”(Efficient visual Grounding language Models, EGM),通过在测试时动态扩展计算资源(即增加生成的token数量)来提升小模型的表现,而非直接使用大模型。这种方法在保持低延迟的同时显著提升了小模型的端到端性能,实验表明其在RefCOCO基准上达到91.4 IoU且平均延迟仅为737ms(比大型模型快5.9倍),同时在新型无模态定位(amodal grounding)任务中也表现出优于大模型的能力,验证了方法的通用性和高效性。
链接: https://arxiv.org/abs/2601.13633
作者: Guanqi Zhan,Changye Li,Zhijian Liu,Yao Lu,Yi Wu,Song Han,Ligeng Zhu
机构: NVIDIA(英伟达); MIT(麻省理工学院); University of Oxford(牛津大学); Tsinghua University(清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Visual grounding is an essential capability of Visual Language Models (VLMs) to understand the real physical world. Previous state-of-the-art grounding visual language models usually have large model sizes, making them heavy for deployment and slow for inference. However, we notice that the sizes of visual encoders are nearly the same for small and large VLMs and the major difference is the sizes of the language models. Small VLMs fall behind larger VLMs in grounding because of the difference in language understanding capability rather than visual information handling. To mitigate the gap, we introduce ‘Efficient visual Grounding language Models’ (EGM): a method to scale the test-time computation (#generated tokens). Scaling the test-time computation of a small model is deployment-friendly, and yields better end-to-end latency as the cost of each token is much cheaper compared to directly running a large model. On the RefCOCO benchmark, our EGM-Qwen3-VL-8B demonstrates 91.4 IoU with an average of 737ms (5.9x faster) latency while Qwen3-VL-235B demands 4,320ms to achieve 90.5 IoU. To validate our approach’s generality, we further set up a new amodal grounding setting that requires the model to predict both the visible and occluded parts of the objects. Experiments show our method can consistently and significantly improve the vanilla grounding and amodal grounding capabilities of small models to be on par with or outperform the larger models, thereby improving the efficiency for visual grounding.
zh
[CV-68] CARPE: Context-Aware Image Representation Prioritization via Ensemble for Large Vision-Language Models
【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)在以视觉为中心的任务(如图像分类)中表现不佳的问题,其性能往往低于基础视觉编码器(通常为CLIP-based模型)。解决方案的关键在于提出一种名为“基于集成的上下文感知图像表示优先级策略”(Context-Aware Image Representation Prioritization via Ensemble, CARPE)的新型、与模型无关的框架,该框架引入视觉融合层和上下文感知的集成策略,使模型能够动态判断何时优先使用图像表示,何时依赖语言模型的推理能力。这一设计增强了模型对视觉与文本模态的自适应权重分配能力,并提升了对图像表征多维度信息的捕捉能力,从而在图像分类和视觉语言基准测试中均实现了稳定且显著的泛化性能提升。
链接: https://arxiv.org/abs/2601.13622
作者: Donghee Lee,Rui Cai,Zhe Zhao
机构: University of California, Davis (加州大学戴维斯分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advancements in Large Vision-Language Models (LVLMs) have pushed them closer to becoming general-purpose assistants. Despite their strong performance, LVLMs still struggle with vision-centric tasks such as image classification, underperforming compared to their base vision encoders, which are often CLIP-based models. To address this limitation, we propose Context-Aware Image Representation Prioritization via Ensemble (CARPE), a novel, model-agnostic framework which introduces vision-integration layers and a context-aware ensemble strategy to identify when to prioritize image representations or rely on the reasoning capabilities of the language model. This design enhances the model’s ability to adaptively weight visual and textual modalities and enables the model to capture various aspects of image representations, leading to consistent improvements in generalization across classification and vision-language benchmarks. Extensive experiments demonstrate that CARPE not only improves performance on image classification benchmarks but also enhances results across various vision-language benchmarks. Finally, CARPE is designed to be effectively integrated with most open-source LVLMs that consist of a vision encoder and a language model, ensuring its adaptability across diverse architectures.
zh
[CV-69] ChartVerse: Scaling Chart Reasoning via Reliable Programmatic Synthesis from Scratch
【速读】:该论文旨在解决开源视觉语言模型(Vision Language Models, VLMs)在图表推理能力发展中的关键瓶颈问题,即高质量训练数据的匮乏。现有数据集普遍存在两个缺陷:合成图表结构单一、重复性强,且对应的问答对容易产生幻觉、缺乏复杂任务所需的深层推理能力。解决方案的核心在于提出 ChartVerse 框架,其关键创新包括:(1) 引入滚动后验熵(Rollout Posterior Entropy, RPE)作为量化图表复杂度的新指标,并基于此设计复杂度感知的图表编码器,通过可执行程序自动生成多样且高复杂度的图表;(2) 提出基于真实答案锚定的逆向问答生成机制,采用“先答后问”的范式,从源代码中提取确定性答案,再条件生成问题并强制一致性验证,同时结合模型失败率筛选与链式思维(Chain-of-Thought, CoT)蒸馏策略提升推理深度和质量。
链接: https://arxiv.org/abs/2601.13606
作者: Zheng Liu,Honglin Lin,Chonghan Qin,Xiaoyang Wang,Xin Gao,Yu Li,Mengzhang Cai,Yun Zhu,Zhanping Zhong,Qizhi Pei,Zhuoshi Pan,Xiaoran Shang,Bin Cui,Conghui He,Wentao Zhang,Lijun Wu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 29 pages
Abstract:Chart reasoning is a critical capability for Vision Language Models (VLMs). However, the development of open-source models is severely hindered by the lack of high-quality training data. Existing datasets suffer from a dual challenge: synthetic charts are often simplistic and repetitive, while the associated QA pairs are prone to hallucinations and lack the reasoning depth required for complex tasks. To bridge this gap, we propose ChartVerse, a scalable framework designed to synthesize complex charts and reliable reasoning data from scratch. (1) To address the bottleneck of simple patterns, we first introduce Rollout Posterior Entropy (RPE), a novel metric that quantifies chart complexity. Guided by RPE, we develop complexity-aware chart coder to autonomously synthesize diverse, high-complexity charts via executable programs. (2) To guarantee reasoning rigor, we develop truth-anchored inverse QA synthesis. Diverging from standard generation, we adopt an answer-first paradigm: we extract deterministic answers directly from the source code, generate questions conditional on these anchors, and enforce strict consistency verification. To further elevate difficulty and reasoning depth, we filter samples based on model fail-rate and distill high-quality Chain-of-Thought (CoT) reasoning. We curate ChartVerse-SFT-600K and ChartVerse-RL-40K using Qwen3-VL-30B-A3B-Thinking as the teacher. Experimental results demonstrate that ChartVerse-8B achieves state-of-the-art performance, notably surpassing its teacher and rivaling the stronger Qwen3-VL-32B-Thinking.
zh
[CV-70] FG-OrIU: Towards Better Forgetting via Feature-Gradient Orthogonality for Incremental Unlearning ICCV2025
【速读】:该论文旨在解决增量式遗忘(Incremental Unlearning, IU)中因现有方法仅抑制参数或混淆知识而缺乏对特征空间和梯度空间的显式正交约束,导致“浅层遗忘”(superficial forgetting)的问题,即残留信息仍可被恢复,从而引发安全风险并破坏保留与删除之间的平衡。其解决方案的关键在于提出FG-OrIU框架,首次在特征和梯度两个层面统一引入正交约束以实现深度遗忘(deep forgetting),确保遗忘效果不可逆:通过奇异值分解(SVD)分解特征空间,将遗忘类与保留类特征分离至不同子空间;同时施加双重正交约束——特征正交投影确保遗忘类与保留类互不干扰,梯度正交投影防止遗忘知识在更新过程中被重新引入或破坏保留类表征;此外,动态子空间自适应机制实现新遗忘子空间的融合与保留子空间的收缩,维持多轮序列遗忘任务下的稳定平衡。
链接: https://arxiv.org/abs/2601.13578
作者: Qian Feng,JiaHang Tu,Mintong Kang,Hanbin Zhao,Chao Zhang,Hui Qian
机构: Zhejiang University (浙江大学); University of Illinois at Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: This paper has been accepted by ICCV 2025. code: \url{ this https URL }
Abstract:Incremental unlearning (IU) is critical for pre-trained models to comply with sequential data deletion requests, yet existing methods primarily suppress parameters or confuse knowledge without explicit constraints on both feature and gradient level, resulting in \textitsuperficial forgetting where residual information remains recoverable. This incomplete forgetting risks security breaches and disrupts retention balance, especially in IU scenarios. We propose FG-OrIU (\textbfFeature-\textbfGradient \textbfOrthogonality for \textbfIncremental \textbfUnlearning), the first framework unifying orthogonal constraints on both features and gradients level to achieve deep forgetting, where the forgetting effect is irreversible. FG-OrIU decomposes feature spaces via Singular Value Decomposition (SVD), separating forgetting and remaining class features into distinct subspaces. It then enforces dual constraints: feature orthogonal projection on both forgetting and remaining classes, while gradient orthogonal projection prevents the reintroduction of forgotten knowledge and disruption to remaining classes during updates. Additionally, dynamic subspace adaptation merges newly forgetting subspaces and contracts remaining subspaces, ensuring a stable balance between removal and retention across sequential unlearning tasks. Extensive experiments demonstrate the effectiveness of our method.
zh
[CV-71] Learning Fine-Grained Correspondence with Cross-Perspective Perception for Open-Vocabulary 6D Object Pose Estimation
【速读】:该论文旨在解决开放词汇6D目标位姿估计中因依赖无约束全局匹配策略而导致的模糊性问题,尤其是在开放世界场景下,目标特征易被背景干扰物混淆,从而影响位姿估计精度。解决方案的关键在于从噪声敏感的全局匹配转向空间受限的细粒度块级对应关系建模:首先通过对象中心解耦预处理分离语义目标与环境噪声;其次引入跨视角全局感知(Cross-Perspective Global Perception, CPGP)模块融合双视角特征并建立结构一致性;最后设计块相关性预测器(Patch Correlation Predictor, PCP)生成精确的块级关联图,作为空间滤波器实现抗噪的细粒度匹配。这一框架显著提升了在REAL275和Toyota-Light数据集上的平均召回率,验证了其在复杂开放环境中鲁棒且泛化的感知能力。
链接: https://arxiv.org/abs/2601.13565
作者: Yu Qin,Shimeng Fan,Fan Yang,Zixuan Xue,Zijie Mai,Wenrui Chen,Kailun Yang,Zhiyong Li
机构: Hunan University (湖南大学); Hunan University of Science and Technology (湖南科技大学); National Engineering Research Center of Robot Visual Perception and Control Technology (机器人视觉感知与控制技术国家工程研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Image and Video Processing (eess.IV)
备注: The source code will be made publicly available at this https URL
Abstract:Open-vocabulary 6D object pose estimation empowers robots to manipulate arbitrary unseen objects guided solely by natural language. However, a critical limitation of existing approaches is their reliance on unconstrained global matching strategies. In open-world scenarios, trying to match anchor features against the entire query image space introduces excessive ambiguity, as target features are easily confused with background distractors. To resolve this, we propose Fine-grained Correspondence Pose Estimation (FiCoP), a framework that transitions from noise-prone global matching to spatially-constrained patch-level correspondence. Our core innovation lies in leveraging a patch-to-patch correlation matrix as a structural prior to narrowing the matching scope, effectively filtering out irrelevant clutter to prevent it from degrading pose estimation. Firstly, we introduce an object-centric disentanglement preprocessing to isolate the semantic target from environmental noise. Secondly, a Cross-Perspective Global Perception (CPGP) module is proposed to fuse dual-view features, establishing structural consensus through explicit context reasoning. Finally, we design a Patch Correlation Predictor (PCP) that generates a precise block-wise association map, acting as a spatial filter to enforce fine-grained, noise-resilient matching. Experiments on the REAL275 and Toyota-Light datasets demonstrate that FiCoP improves Average Recall by 8.0% and 6.1%, respectively, compared to the state-of-the-art method, highlighting its capability to deliver robust and generalized perception for robotic agents operating in complex, unconstrained open-world environments. The source code will be made publicly available at this https URL.
zh
[CV-72] Reasoning is a Modality
【速读】:该论文试图解决当前主流人工智能系统(如大语言模型和视觉Transformer)在抽象推理任务中缺乏可解释性和结构化推理能力的问题。这些系统本质上是行为序列预测机器,仅通过建模标记统计来匹配可观测行为,而没有持久且可读的内部心理状态,导致其无法像人类一样基于内在状态解释行为,只能生成无根基的后验合理性说明。解决方案的关键在于提出一种角色分离的Transformer模块,将全局控制器令牌(global controller tokens)与网格工作空间令牌(grid workspace tokens)分离,从而实现迭代规则执行机制,使推理成为独立于低层工作空间的模态通道。这一设计在ARC视觉推理任务中显著提升了性能(62.6%准确率,超越人类平均表现60.2%),并展现出更清晰的规则应用结构,验证了控制器驱动式推理的有效性。
链接: https://arxiv.org/abs/2601.13562
作者: Zhiguang Liu,Yi Shang
机构: University of Missouri - Columbia (密苏里大学哥伦比亚分校)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Code access: this https URL
Abstract:The Abstraction and Reasoning Corpus (ARC) provides a compact laboratory for studying abstract reasoning, an ability central to human intelligence. Modern AI systems, including LLMs and ViTs, largely operate as sequence-of-behavior prediction machines: they match observable behaviors by modeling token statistics without a persistent, readable mental state. This creates a gap with human-like behavior: humans can explain an action by decoding internal state, while AI systems can produce fluent post-hoc rationalizations that are not grounded in such a state. We hypothesize that reasoning is a modality: reasoning should exist as a distinct channel separate from the low-level workspace on which rules are applied. To test this hypothesis, on solving ARC tasks as a visual reasoning problem, we designed a novel role-separated transformer block that splits global controller tokens from grid workspace tokens, enabling iterative rule execution. Trained and evaluated within the VARC vision-centric protocol, our method achieved 62.6% accuracy on ARC-1, surpassing average human performance (60.2%) and outperforming prior methods significantly. Qualitatively, our models exhibit more coherent rule-application structure than the dense ViT baseline, consistent with a shift away from plausible probability blobs toward controller-driven reasoning.
zh
[CV-73] DiffFace-Edit: A Diffusion-Based Facial Dataset for Forgery-Semantic Driven Deepfake Detection Analysis
【速读】:该论文旨在解决当前生成式 AI (Generative AI) 伪造人脸检测中缺乏对细粒度区域篡改样本的系统研究,以及未充分评估“拼接攻击”(splice attacks)——即真实与伪造样本混合产生的检测规避样本(detector-evasive samples)对检测模型性能影响的问题。解决方案的关键在于构建 DiffFace-Edit 数据集,该数据集包含超过两百万张 AI 生成的假图像,并涵盖八个面部区域(如眼睛、鼻子)的细粒度编辑,支持单区域和多区域组合编辑;同时提出跨域评估框架,结合 IMDL(Image Manipulation Detection Learning)方法,系统分析 detector-evasive samples 对检测模型的影响,从而提升检测模型在复杂篡改场景下的鲁棒性与实用性。
链接: https://arxiv.org/abs/2601.13551
作者: Feng Ding,Wenhui Yi,Xinan He,Mengyao Xiao,Jianfeng Xu,Jianqiang Du
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Generative models now produce imperceptible, fine-grained manipulated faces, posing significant privacy risks. However, existing AI-generated face datasets generally lack focus on samples with fine-grained regional manipulations. Furthermore, no researchers have yet studied the real impact of splice attacks, which occur between real and manipulated samples, on detectors. We refer to these as detector-evasive samples. Based on this, we introduce the DiffFace-Edit dataset, which has the following advantages: 1) It contains over two million AI-generated fake images. 2) It features edits across eight facial regions (e.g., eyes, nose) and includes a richer variety of editing combinations, such as single-region and multi-region edits. Additionally, we specifically analyze the impact of detector-evasive samples on detection models. We conduct a comprehensive analysis of the dataset and propose a cross-domain evaluation that combines IMDL methods. Dataset will be available at this https URL.
zh
[CV-74] GO-MLVTON: Garment Occlusion-Aware Multi-Layer Virtual Try-On with Diffusion Models
【速读】:该论文旨在解决多层虚拟试衣(Multi-layer Virtual Try-On, ML-VTON)问题,即如何在人体上合理叠加多层衣物并生成具有真实形变和层次感的视觉结果。传统图像驱动的虚拟试衣方法主要关注单层或多重独立衣物的试穿,忽略了衣物间的遮挡关系与层叠逻辑,导致内层衣物特征干扰外层衣物的拟合效果。解决方案的关键在于提出GO-MLVTON框架,其核心创新包括:1)引入服装遮挡学习模块(Garment Occlusion Learning module),显式建模内外衣物之间的遮挡关系以减少冗余特征干扰;2)设计基于Stable Diffusion的服装变形贴合模块(Garment Morphing Fitting module),实现衣物的精准形变与贴合,从而生成高质量的多层试衣图像。
链接: https://arxiv.org/abs/2601.13524
作者: Yang Yu,Yunze Deng,Yige Zhang,Yanjie Xiao,Youkun Ou,Wenhao Hu,Mingchao Li,Bin Feng,Wenyu Liu,Dandan Zheng,Jingdong Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5pages, 3 figures
Abstract:Existing Image-based virtual try-on (VTON) methods primarily focus on single-layer or multi-garment VTON, neglecting multi-layer VTON (ML-VTON), which involves dressing multiple layers of garments onto the human body with realistic deformation and layering to generate visually plausible outcomes. The main challenge lies in accurately modeling occlusion relationships between inner and outer garments to reduce interference from redundant inner garment features. To address this, we propose GO-MLVTON, the first multi-layer VTON method, introducing the Garment Occlusion Learning module to learn occlusion relationships and the StableDiffusion-based Garment Morphing Fitting module to deform and fit garments onto the human body, producing high-quality multi-layer try-on results. Additionally, we present the MLG dataset for this task and propose a new metric named Layered Appearance Coherence Difference (LACD) for evaluation. Extensive experiments demonstrate the state-of-the-art performance of GO-MLVTON. Project page: this https URL.
zh
[CV-75] DIS2: Disentanglement Meets Distillation with Classwise Attention for Robust Remote Sensing Segmentation under Missing Modalities WACV2026
【速读】:该论文旨在解决遥感(Remote Sensing, RS)领域中多模态学习因模态缺失而导致性能严重下降的问题,尤其针对遥感数据高度异质性和尺度变化巨大的特性,传统解耦学习和知识蒸馏方法难以有效应对。解决方案的关键在于提出一种名为DIS2的新范式,其核心创新是重构了解耦学习与知识蒸馏之间的协同关系,称为DLKD(Disentanglement and Knowledge Distillation),通过显式捕捉补偿性特征并融合可用模态特征,逼近全模态情况下的理想融合表示;同时引入类特定特征学习模块(Classwise Feature Learning Module, CFLM)以自适应地学习不同类别在信号可用性差异下的判别性证据,并结合多层次混合融合结构(Hierarchical Hybrid Fusion, HF)增强预测能力,从而实现主动、有指导的缺失信息补偿机制。
链接: https://arxiv.org/abs/2601.13502
作者: Nhi Kieu,Kien Nguyen,Arnold Wiliem,Clinton Fookes,Sridha Sridharan
机构: Queensland University of Technology (昆士兰科技大学); Shield AI (盾牌人工智能)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to WACV 2026 - Computer Vision for Earth Observation Workshop
Abstract:The efficacy of multimodal learning in remote sensing (RS) is severely undermined by missing modalities. The challenge is exacerbated by the RS highly heterogeneous data and huge scale variation. Consequently, paradigms proven effective in other domains often fail when confronted with these unique data characteristics. Conventional disentanglement learning, which relies on significant feature overlap between modalities (modality-invariant), is insufficient for this heterogeneity. Similarly, knowledge distillation becomes an ill-posed mimicry task where a student fails to focus on the necessary compensatory knowledge, leaving the semantic gap unaddressed. Our work is therefore built upon three pillars uniquely designed for RS: (1) principled missing information compensation, (2) class-specific modality contribution, and (3) multi-resolution feature importance. We propose a novel method DIS2, a new paradigm shifting from modality-shared feature dependence and untargeted imitation to active, guided missing features compensation. Its core novelty lies in a reformulated synergy between disentanglement learning and knowledge distillation, termed DLKD. Compensatory features are explicitly captured which, when fused with the features of the available modality, approximate the ideal fused representation of the full-modality case. To address the class-specific challenge, our Classwise Feature Learning Module (CFLM) adaptively learn discriminative evidence for each target depending on signal availability. Both DLKD and CFLM are supported by a hierarchical hybrid fusion (HF) structure using features across resolutions to strengthen prediction. Extensive experiments validate that our proposed approach significantly outperforms state-of-the-art methods across benchmarks.
zh
[CV-76] Optical Linear Systems Framework for Event Sensing and Computational Neuromorphic Imaging
【速读】:该论文旨在解决事件视觉传感器(event vision sensors)输出的稀疏、异步事件流难以与传统计算成像和光学系统设计中基于线性前向模型的框架集成的问题。其关键解决方案是构建一个物理基础的处理流程,将事件流映射为每像素的对数强度估计及强度导数,并将其嵌入到具有时变点扩散函数(point spread function, PSF)的动态线性系统模型中,从而实现直接从事件数据进行逆滤波,采用频域维纳去卷积(Wiener deconvolution)方法,结合已知或参数化的动态传递函数完成图像重建。该框架为动态光学系统中的事件感知与基于模型的计算成像提供了实用桥梁。
链接: https://arxiv.org/abs/2601.13498
作者: Nimrod Kruger,Nicholas Owen Ralph,Gregory Cohen,Paul Hurley
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Event vision sensors (neuromorphic cameras) output sparse, asynchronous ON/OFF events triggered by log-intensity threshold crossings, enabling microsecond-scale sensing with high dynamic range and low data bandwidth. As a nonlinear system, this event representation does not readily integrate with the linear forward models that underpin most computational imaging and optical system design. We present a physics-grounded processing pipeline that maps event streams to estimates of per-pixel log-intensity and intensity derivatives, and embeds these measurements in a dynamic linear systems model with a time-varying point spread function. This enables inverse filtering directly from event data, using frequency-domain Wiener deconvolution with a known (or parameterised) dynamic transfer function. We validate the approach in simulation for single and overlapping point sources under modulated defocus, and on real event data from a tunable-focus telescope imaging a star field, demonstrating source localisation and separability. The proposed framework provides a practical bridge between event sensing and model-based computational imaging for dynamic optical systems.
zh
[CV-77] Event-based Heterogeneous Information Processing for Online Vision-based Obstacle Detection and Localization
【速读】:该论文旨在解决机器人在非结构化和动态环境中进行视觉导航时,如何实现高精度障碍物检测与定位的同时保持计算效率的问题。其核心挑战在于传统方法难以兼顾环境感知的准确性与实时性,且能耗较高。解决方案的关键在于提出一种双通路混合神经网络架构:其中人工神经网络(ANN)负责处理低频静态空间特征以实现精准环境理解,而脉冲神经网络(SNN)则直接处理事件驱动的动态传感器数据,实现快速、低功耗的状态估计与异常检测;该设计摒弃了传统需依赖域转换机制的混合架构,通过预训练的SNN滤波器直接利用编码为脉冲信号的输入进行定位与状态估计,并结合ANN提供的上下文信息对检测到的异常进行验证与持续跟踪,从而支持前瞻式导航策略。
链接: https://arxiv.org/abs/2601.13451
作者: Reza Ahmadvand,Sarah Safura Sharif,Yaser Mike Banad
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This paper introduces a novel framework for robotic vision-based navigation that integrates Hybrid Neural Networks (HNNs) with Spiking Neural Network (SNN)-based filtering to enhance situational awareness for unmodeled obstacle detection and localization. By leveraging the complementary strengths of Artificial Neural Networks (ANNs) and SNNs, the system achieves both accurate environmental understanding and fast, energy-efficient processing. The proposed architecture employs a dual-pathway approach: an ANN component processes static spatial features at low frequency, while an SNN component handles dynamic, event-based sensor data in real time. Unlike conventional hybrid architectures that rely on domain conversion mechanisms, our system incorporates a pre-developed SNN-based filter that directly utilizes spike-encoded inputs for localization and state estimation. Detected anomalies are validated using contextual information from the ANN pathway and continuously tracked to support anticipatory navigation strategies. Simulation results demonstrate that the proposed method offers acceptable detection accuracy while maintaining computational efficiency close to SNN-only implementations, which operate at a fraction of the resource cost. This framework represents a significant advancement in neuromorphic navigation systems for robots operating in unpredictable and dynamic environments.
zh
[CV-78] Analyzing VLM-Based Approaches for Anomaly Classification and Segmentation
【速读】:该论文旨在解决工业质量控制中异常检测(Anomaly Detection)对大量标注数据依赖性强、泛化能力弱的问题。传统方法需针对特定缺陷进行任务定制训练,而本研究提出基于视觉-语言模型(Vision-Language Models, VLMs)的零样本(zero-shot)与少样本(few-shot)异常检测方案,其关键在于利用CLIP等VLMs在图像与文本间建立对齐表征的能力,通过自然语言描述正常与异常状态实现无需缺陷样本的异常分类(Anomaly Classification, AC)和分割(Anomaly Segmentation, AS)。解决方案的核心创新体现在三方面:一是滑动窗口密集特征提取(WinCLIP)、二是可学习投影的多阶段特征对齐(AprilLab框架),三是组合提示集成策略,从而系统性优化特征提取机制、文本-视觉对齐方式、提示工程以及跨域泛化性能,显著提升异常检测的准确率与效率。
链接: https://arxiv.org/abs/2601.13440
作者: Mohit Kakda,Mirudula Shri Muthukumaran,Uttapreksha Patel,Lawrence Swaminathan Xavier Prince
机构: Northeastern University (东北大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages,4 images
Abstract:Vision-Language Models (VLMs), particularly CLIP, have revolutionized anomaly detection by enabling zero-shot and few-shot defect identification without extensive labeled datasets. By learning aligned representations of images and text, VLMs facilitate anomaly classification and segmentation through natural language descriptions of normal and abnormal states, eliminating traditional requirements for task-specific training or defect examples. This project presents a comprehensive analysis of VLM-based approaches for anomaly classification (AC) and anomaly segmentation (AS). We systematically investigate key architectural paradigms including sliding window-based dense feature extraction (WinCLIP), multi-stage feature alignment with learnable projections (AprilLab framework), and compositional prompt ensemble strategies. Our analysis evaluates these methods across critical dimensions: feature extraction mechanisms, text-visual alignment strategies, prompt engineering techniques, zero-shot versus few-shot trade-offs, computational efficiency, and cross-domain generalization. Through rigorous experimentation on benchmarks such as MVTec AD and VisA, we compare classification accuracy, segmentation precision, and inference efficiency. The primary contribution is a foundational understanding of how and why VLMs succeed in anomaly detection, synthesizing practical insights for method selection and identifying current limitations. This work aims to facilitate informed adoption of VLM-based methods in industrial quality control and guide future research directions.
zh
[CV-79] SGW-GAN: Sliced Gromov-Wasserstein Guided GANs for Retinal Fundus Image Enhancement
【速读】:该论文旨在解决视网膜眼底图像增强中因GAN和扩散模型过度关注感知质量而导致的类内几何结构失真问题,即临床相关样本分散、疾病类别边界模糊,进而损害分级或病灶检测等下游任务性能。其解决方案的关键在于引入基于切片Gromov-Wasserstein(Sliced Gromov-Wasserstein, SGW)的距离度量机制,通过随机投影近似传统Gromov-Wasserstein(GW)距离,在保留样本间相对关系的同时显著降低计算复杂度,从而在无配对数据条件下实现高效且具有临床保真的图像增强。
链接: https://arxiv.org/abs/2601.13417
作者: Yujian Xiong,Xuanzhao Dong,Wenhui Zhu,Xin Li,Oana Dumitrascu,Yalin Wang
机构: Arizona State University (亚利桑那州立大学); Mayo Clinic (梅奥诊所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Retinal fundus photography is indispensable for ophthalmic screening and diagnosis, yet image quality is often degraded by noise, artifacts, and uneven illumination. Recent GAN- and diffusion-based enhancement methods improve perceptual quality by aligning degraded images with high-quality distributions, but our analysis shows that this focus can distort intra-class geometry: clinically related samples become dispersed, disease-class boundaries blur, and downstream tasks such as grading or lesion detection are harmed. The Gromov Wasserstein (GW) discrepancy offers a principled solution by aligning distributions through internal pairwise distances, naturally preserving intra-class structure, but its high computational cost restricts practical use. To overcome this, we propose SGW-GAN, the first framework to incorporate Sliced GW (SGW) into retinal image enhancement. SGW approximates GW via random projections, retaining relational fidelity while greatly reducing cost. Experiments on public datasets show that SGW-GAN produces visually compelling enhancements, achieves superior diabetic retinopathy grading, and reports the lowest GW discrepancy across disease labels, demonstrating both efficiency and clinical fidelity for unpaired medical image enhancement.
zh
[CV-80] Diffusion Representations for Fine-Grained Image Classification: A Marine Plankton Case Study CVPR
【速读】:该论文旨在解决扩散模型(diffusion models)作为通用特征编码器的潜力尚未被充分挖掘的问题,尤其是在无监督或自监督场景下如何有效提取具有判别力的特征以支持细粒度识别任务。其解决方案的关键在于:冻结预训练扩散模型的主干网络(frozen diffusion backbone),通过探测不同层和时间步(timesteps)的中间去噪特征,并为每对特征训练一个线性分类器,从而构建高效的特征表示。实验表明,该方法在真实世界的浮游生物监测任务中表现优异,不仅在平衡与长尾分布设置下优于其他自监督方法,且在跨时间和空间分布的数据上仍保持高准确率和宏F1值,验证了其鲁棒性和泛化能力。
链接: https://arxiv.org/abs/2601.13416
作者: A. Nieto Juscafresa,Á. Mazcuñán Herreros,J. Sullivan
机构: KTH Royal Institute of Technology (皇家理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 21 pages, 6 figures, CVPR format
Abstract:Diffusion models have emerged as state-of-the-art generative methods for image synthesis, yet their potential as general-purpose feature encoders remains underexplored. Trained for denoising and generation without labels, they can be interpreted as self-supervised learners that capture both low- and high-level structure. We show that a frozen diffusion backbone enables strong fine-grained recognition by probing intermediate denoising features across layers and timesteps and training a linear classifier for each pair. We evaluate this in a real-world plankton-monitoring setting with practical impact, using controlled and comparable training setups against established supervised and self-supervised baselines. Frozen diffusion features are competitive with supervised baselines and outperform other self-supervised methods in both balanced and naturally long-tailed settings. Out-of-distribution evaluations on temporally and geographically shifted plankton datasets further show that frozen diffusion features maintain strong accuracy and Macro F1 under substantial distribution shift.
zh
[CV-81] Using deep learning for predicting cleansing quality of colon capsule endoscopy images
【速读】:该论文旨在解决结肠胶囊内镜(Colon Capsule Endoscopy, CCE)图像中清洁质量预测的自动化问题,以辅助临床诊断。其核心挑战在于如何在保持高预测准确性的同时提升模型效率,并增强模型决策的可解释性,从而满足临床应用的需求。解决方案的关键在于:首先采用ResNet-18架构结合分层K折交叉验证进行分类训练;其次通过迭代结构化剪枝(structured pruning)实现高达79%的稀疏度,使模型效率从84%提升至88%准确率下仍保持高性能;最后利用Grad-CAM系列方法和ROAD评估框架对剪枝后模型进行可解释性分析,确保其在临床场景中的可信度与透明度。此外,还引入自适应温度缩放技术对模型进行外部数据校准,进一步提升泛化能力。
链接: https://arxiv.org/abs/2601.13412
作者: Puneet Sharma,Kristian Dalsbø Hindberg,Benedicte Schelde-Olesen,Ulrik Deding,Esmaeil S. Nadimi,Jan-Matthias Braun
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 24 pages
Abstract:In this study, we explore the application of deep learning techniques for predicting cleansing quality in colon capsule endoscopy (CCE) images. Using a dataset of 500 images labeled by 14 clinicians on the Leighton-Rex scale (Poor, Fair, Good, and Excellent), a ResNet-18 model was trained for classification, leveraging stratified K-fold cross-validation to ensure robust performance. To optimize the model, structured pruning techniques were applied iteratively, achieving significant sparsity while maintaining high accuracy. Explainability of the pruned model was evaluated using Grad-CAM, Grad-CAM++, Eigen-CAM, Ablation-CAM, and Random-CAM, with the ROAD method employed for consistent evaluation. Our results indicate that for a pruned model, we can achieve a cross-validation accuracy of 88% with 79% sparsity, demonstrating the effectiveness of pruning in improving efficiency from 84% without compromising performance. We also highlight the challenges of evaluating cleansing quality of CCE images, emphasize the importance of explainability in clinical applications, and discuss the challenges associated with using the ROAD method for our task. Finally, we employ a variant of adaptive temperature scaling to calibrate the pruned models for an external dataset.
zh
[CV-82] Local-to-Global Logical Explanations for Deep Vision Models
【速读】:该论文旨在解决深度神经网络在图像分类任务中虽性能优异但缺乏可解释性的问题(即“黑箱”问题)。其核心解决方案是提出局部和全局的解释方法,将图像分类决策转化为单调析取范式(monotone disjunctive-normal-form, MDNF)逻辑公式,其中每个公式由人类可识别的原始概念(primitive concepts)构成,满足该公式即保证模型对某类别的得分较高。此外,论文还设计了一种用于多类别分类的单调解释列表算法,确保生成的解释在保持高度可读性的同时,仍能准确反映原黑盒模型的行为,在复杂视觉数据集上实现高保真度与覆盖率。
链接: https://arxiv.org/abs/2601.13404
作者: Bhavan Vasu,Giuseppe Raffa,Prasad Tadepalli
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 15 pages, 5 figures, 5th International Joint Conference on Learning Reasoning 2025
Abstract:While deep neural networks are extremely effective at classifying images, they remain opaque and hard to interpret. We introduce local and global explanation methods for black-box models that generate explanations in terms of human-recognizable primitive concepts. Both the local explanations for a single image and the global explanations for a set of images are cast as logical formulas in monotone disjunctive-normal-form (MDNF), whose satisfaction guarantees that the model yields a high score on a given class. We also present an algorithm for explaining the classification of examples into multiple classes in the form of a monotone explanation list over primitive concepts. Despite their simplicity and interpretability we show that the explanations maintain high fidelity and coverage with respect to the blackbox models they seek to explain in challenging vision datasets.
zh
[CV-83] Reasoning with Pixel-level Precision: QVLM Architecture and SQuID Dataset for Quantitative Geospatial Analytics CVPR2026
【速读】:该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在定量空间推理任务中表现不佳的问题,其根本原因在于VLM的架构会破坏像素级信息,导致无法进行精确计数和测量。解决方案的关键在于提出一种名为QVLM(Quantitative Vision-Language Model)的新架构,该模型通过将语言理解与视觉分析解耦,不依赖图像嵌入编码,而是生成可执行代码:首先调用分割模型获取像素级掩码,随后直接在这些掩码上操作,从而保持空间索引的完整性,实现对卫星图像中数量和空间关系的准确推理。
链接: https://arxiv.org/abs/2601.13401
作者: Peter A. Massih,Eric Cosatto
机构: NEC Laboratories America (美国 NEC 实验室); EPFL (洛桑联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Submitted to CVPR 2026. Introduces the QVLM architecture and the SQuID dataset for quantitative geospatial reasoning. Dataset DOI: https://doi.org/10.57967/hf/7565
Abstract:Current Vision-Language Models (VLMs) fail at quantitative spatial reasoning because their architectures destroy pixel-level information required for counting and measurements. Vision encoders compress images through patch embeddings, reducing spatial indexing and losing the precise pixel-level tracking required for accurate counting. We present two contributions to address this fundamental limitation. First, we introduce SQuID (Satellite Quantitative Intelligence Dataset), a benchmark of 2,000 satellite image Question-Answer pairs with both numerical range and categorical answers, designed to evaluate quantitative spatial reasoning. The dataset spans three difficulty tiers with annotations automatically generated from human labels and their learned variability. Second, we propose QVLM (Quantitative Vision-Language Model), a code-generation architecture that maintains pixel precision by decoupling language understanding from visual analysis. Instead of encoding images into embeddings, QVLM generates executable code that first calls a segmentation model to obtain pixel-level masks, then operates directly on these masks, preserving spatial indexing throughout the reasoning process. Our experiments show that QVLM using GPT-5 as coder achieves 42.0% accuracy on SQuID compared to 28.1% for a VLM prompted with image-question pairs. Our work reveals that, for quantitative spatial reasoning, architectural decoupling enables better accuracy on quantitative tasks.
zh
[CV-84] Deep Image Prior with L0 Gradient Regularizer for Image Smoothing ICASSP2026
【速读】:该论文旨在解决图像平滑(image smoothing)中如何在去除噪声和细节的同时有效保留强边缘与轮廓结构的问题。传统方法依赖局部窗口统计或优化求解,而基于深度学习的最新方法虽性能优越,但需精心构建训练数据集,而高质量训练数据的获取极具挑战性。为此,作者提出DIP-ℓ₀框架,其关键创新在于引入ℓ₀梯度正则化项(ℓ₀ gradient regularizer),并结合深度图像先验(deep image prior, DIP)机制,在无需任何训练数据的前提下实现高质量图像平滑。为高效最小化包含非凸、非光滑ℓ₀“范数”的损失函数,研究者设计了一种基于交替方向乘子法(ADMM)的优化算法,并集成现成的ℓ₀梯度最小化求解器,从而在保持边缘清晰度和去除JPEG伪影等方面显著优于多种主流图像平滑算法。
链接: https://arxiv.org/abs/2601.13400
作者: Nhat Thanh Tran,Kevin Bui,Jack Xin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: To be published in the Proceedings of IEEE ICASSP 2026
Abstract:Image smoothing is a fundamental image processing operation that preserves the underlying structure, such as strong edges and contours, and removes minor details and textures in an image. Many image smoothing algorithms rely on computing local window statistics or solving an optimization problem. Recent state-of-the-art methods leverage deep learning, but they require a carefully curated training dataset. Because constructing a proper training dataset for image smoothing is challenging, we propose DIP- \ell_0 , a deep image prior framework that incorporates the \ell_0 gradient regularizer. This framework can perform high-quality image smoothing without any training data. To properly minimize the associated loss function that has the nonconvex, nonsmooth \ell_0 ``norm", we develop an alternating direction method of multipliers algorithm that utilizes an off-the-shelf \ell_0 gradient minimization solver. Numerical experiments demonstrate that the proposed DIP- \ell_0 outperforms many image smoothing algorithms in edge-preserving image smoothing and JPEG artifact removal.
zh
[CV-85] Leverag ing Transformer Decoder for Automotive Radar Object Detection
【速读】:该论文旨在解决纯雷达(radar-only)场景下3D目标检测的精度与效率问题,传统方法通常依赖密集的候选框生成和复杂的后处理策略(如非极大值抑制,Non-Maximum Suppression, NMS),导致计算开销大且性能受限。其解决方案的关键在于提出一种基于Transformer架构的检测框架,采用新颖的Transformer解码器作为预测头,直接从雷达特征表示中回归3D边界框和类别得分;同时引入轻量级的金字塔令牌融合(Pyramid Token Fusion, PTF)模块,将多尺度雷达特征统一为具有尺度感知能力的令牌序列,从而建模长距离时空相关性和跨特征交互,摒弃了密集提议生成和繁琐的NMS调优步骤,显著提升了检测性能与推理效率。
链接: https://arxiv.org/abs/2601.13386
作者: Changxu Zhang,Zhaoze Wang,Tai Fei,Christopher Grimm,Yi Jin,Claas Tebruegge,Ernst Warsitz,Markus Gardill
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
备注:
Abstract:In this paper, we present a Transformer-based architecture for 3D radar object detection that uses a novel Transformer Decoder as the prediction head to directly regress 3D bounding boxes and class scores from radar feature representations. To bridge multi-scale radar features and the decoder, we propose Pyramid Token Fusion (PTF), a lightweight module that converts a feature pyramid into a unified, scale-aware token sequence. By formulating detection as a set prediction problem with learnable object queries and positional encodings, our design models long-range spatial-temporal correlations and cross-feature interactions. This approach eliminates dense proposal generation and heuristic post-processing such as extensive non-maximum suppression (NMS) tuning. We evaluate the proposed framework on the RADDet, where it achieves significant improvements over state-of-the-art radar-only baselines.
zh
[CV-86] Organ-Aware Attention Improves CT Triage and Classification
【速读】:该论文旨在解决高通量医学影像(如胸部和腹部计算机断层扫描,CT)中亟需的分诊与分类问题,以提升患者护理质量并缓解放射科医生的工作负担。现有方法在处理三维解剖结构、成像协议变化及报告监督噪声方面表现不佳,尤其是通用视觉语言模型(VLM)难以胜任此类任务。其解决方案的关键在于提出一种编码器无关、器官感知的分类头——ORACLE-CT,它结合了两个核心机制:一是器官掩码注意力(Organ-Masked Attention),通过限制每个器官区域的池化操作生成空间证据;二是器官标量融合(Organ-Scalar Fusion),轻量级融合归一化的体积信息与平均HU值线索,从而增强判别能力。实验表明,在CT-RATE(胸部)和MERLIN(腹部)数据集上,该方法均达到当前最优监督分类性能(AUROC最高达0.86),且在统一评估协议下实现跨部位泛化。
链接: https://arxiv.org/abs/2601.13385
作者: Lavsen Dahal,Yubraj Bhandari,Geoffrey D. Rubin,Joseph Y. Lo
机构: Duke University (杜克大学); University of Arizona (亚利桑那大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:There is an urgent need for triage and classification of high-volume medical imaging modalities such as computed tomography (CT), which can improve patient care and mitigate radiologist burnout. Study-level CT triage requires calibrated predictions with localized evidence; however, off-the-shelf Vision Language Models (VLM) struggle with 3D anatomy, protocol shifts, and noisy report supervision. This study used the two largest publicly available chest CT datasets: CT-RATE and RADCHEST-CT (held-out external test set). Our carefully tuned supervised baseline (instantiated as a simple Global Average Pooling head) establishes a new supervised state of the art, surpassing all reported linear-probe VLMs. Building on this baseline, we present ORACLE-CT, an encoder-agnostic, organ-aware head that pairs Organ-Masked Attention (mask-restricted, per-organ pooling that yields spatial evidence) with Organ-Scalar Fusion (lightweight fusion of normalized volume and mean-HU cues). In the chest setting, ORACLE-CT masked attention model achieves AUROC 0.86 on CT-RATE; in the abdomen setting, on MERLIN (30 findings), our supervised baseline exceeds a reproduced zero-shot VLM baseline obtained by running publicly released weights through our pipeline, and adding masked attention plus scalar fusion further improves performance to AUROC 0.85. Together, these results deliver state-of-the-art supervised classification performance across both chest and abdomen CT under a unified evaluation protocol. The source code is available at this https URL.
zh
[CV-87] Practical Insights into Semi-Supervised Object Detection Approaches
【速读】:该论文旨在解决数据稀缺场景下目标检测性能受限的问题,尤其是通过半监督目标检测(Semi-Supervised Object Detection, SSOD)方法利用大量未标注图像与少量标注图像协同提升检测精度。其解决方案的关键在于系统性比较三种前沿SSOD方法(MixPL、Semi-DETR和Consistent-Teacher),并通过在MS-COCO、Pascal VOC及自定义Beetle小样本数据集上的实验,深入分析不同方法在标注数据量变化时的性能表现,从而揭示准确率、模型规模与推理延迟之间的权衡关系,为低数据环境下选择最优方案提供实证依据。
链接: https://arxiv.org/abs/2601.13380
作者: Chaoxin Wang,Bharaneeshwar Balasubramaniyam,Anurag Sangem,Nicolais Guevara,Doina Caragea
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Learning in data-scarce settings has recently gained significant attention in the research community. Semi-supervised object detection(SSOD) aims to improve detection performance by leveraging a large number of unlabeled images alongside a limited number of labeled images(a.k.a.,few-shot learning). In this paper, we present a comprehensive comparison of three state-of-the-art SSOD approaches, including MixPL, Semi-DETR and Consistent-Teacher, with the goal of understanding how performance varies with the number of labeled images. We conduct experiments using the MS-COCO and Pascal VOC datasets, two popular object detection benchmarks which allow for standardized evaluation. In addition, we evaluate the SSOD approaches on a custom Beetle dataset which enables us to gain insights into their performance on specialized datasets with a smaller number of object categories. Our findings highlight the trade-offs between accuracy, model size, and latency, providing insights into which methods are best suited for low-data regimes.
zh
[CV-88] A Lightweight Model-Driven 4D Radar Framework for Pervasive Human Detection in Harsh Conditions
【速读】:该论文旨在解决工业和地下环境中由于空气尘埃、烟雾、受限几何结构及金属结构导致光学与激光雷达(LiDAR)感知性能迅速退化的问题,从而实现可靠的人类检测。其关键解决方案是提出一种完全基于模型驱动的4D毫米波雷达感知框架,该框架仅依赖雷达作为感知模态,融合了领域感知的多阈值滤波、自车运动补偿的时序累积、基于KD树的欧氏聚类与多普勒感知优化以及规则驱动的3D分类器,能够在嵌入式边缘硬件上实时运行,并在高尘环境和真实矿井隧道中保持稳定的人体识别能力,而此时相机和LiDAR已失效。
链接: https://arxiv.org/abs/2601.13373
作者: Zhenan Liu,Amir Khajepour,George Shaker
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Pervasive sensing in industrial and underground environments is severely constrained by airborne dust, smoke, confined geometry, and metallic structures, which rapidly degrade optical and LiDAR based perception. Elevation resolved 4D mmWave radar offers strong resilience to such conditions, yet there remains a limited understanding of how to process its sparse and anisotropic point clouds for reliable human detection in enclosed, visibility degraded spaces. This paper presents a fully model-driven 4D radar perception framework designed for real-time execution on embedded edge hardware. The system uses radar as its sole perception modality and integrates domain aware multi threshold filtering, ego motion compensated temporal accumulation, KD tree Euclidean clustering with Doppler aware refinement, and a rule based 3D classifier. The framework is evaluated in a dust filled enclosed trailer and in real underground mining tunnels, and in the tested scenarios the radar based detector maintains stable pedestrian identification as camera and LiDAR modalities fail under severe visibility degradation. These results suggest that the proposed model-driven approach provides robust, interpretable, and computationally efficient perception for safety-critical applications in harsh industrial and subterranean environments.
zh
[CV-89] Spherical Geometry Diffusion: Generating High-quality 3D Face Geometry via Sphere-anchored Representations
【速读】:该论文旨在解决文本到三维人脸生成中几何质量低的问题,其核心难点在于3D空间中顶点分布的任意性和复杂性,导致现有模型难以建立清晰的网格连接关系,从而产生次优几何结构。解决方案的关键在于提出一种基于球面几何表示(Spherical Geometry Representation)的新方法,通过将几何信号锚定在均匀的球面坐标上,确保点云分布规则,从而可稳健重建网格拓扑;同时,该标准球面可无缝展开为二维映射,与强大的2D生成模型形成完美协同,进一步构建了基于此2D图的条件扩散框架——球面几何扩散(Spherical Geometry Diffusion),实现几何与纹理的联合建模,并由几何显式引导纹理生成,显著提升了几何保真度、文本一致性和推理效率。
链接: https://arxiv.org/abs/2601.13371
作者: Junyi Zhang,Yiming Wang,Yunhong Lu,Qichao Wang,Wenzhe Qian,Xiaoyin Xu,David Gu,Min Zhang
机构: Zhejiang University (浙江大学); Alibaba Group (阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Association for the Advancement of Artificial Intelligence
Abstract:A fundamental challenge in text-to-3D face generation is achieving high-quality geometry. The core difficulty lies in the arbitrary and intricate distribution of vertices in 3D space, making it challenging for existing models to establish clean connectivity and resulting in suboptimal geometry. To address this, our core insight is to simplify the underlying geometric structure by constraining the distribution onto a simple and regular manifold, a topological sphere. Building on this, we first propose the Spherical Geometry Representation, a novel face representation that anchors geometric signals to uniform spherical coordinates. This guarantees a regular point distribution, from which the mesh connectivity can be robustly reconstructed. Critically, this canonical sphere can be seamlessly unwrapped into a 2D map, creating a perfect synergy with powerful 2D generative models. We then introduce Spherical Geometry Diffusion, a conditional diffusion framework built upon this 2D map. It enables diverse and controllable generation by jointly modeling geometry and texture, where the geometry explicitly conditions the texture synthesis process. Our method’s effectiveness is demonstrated through its success in a wide range of tasks: text-to-3D generation, face reconstruction, and text-based 3D editing. Extensive experiments show that our approach substantially outperforms existing methods in geometric quality, textual fidelity, and inference efficiency.
zh
[CV-90] Real-Time 4D Radar Perception for Robust Human Detection in Harsh Enclosed Environments
【速读】:该论文旨在解决在高杂波环境下(如地下矿井、隧道或坍塌建筑等严苛封闭场景)实现可控的多层级粉尘浓度模拟,并在此基础上开展可重复的毫米波(mmWave)传播特性研究,同时应对粉尘与反射表面共同作用导致的感知性能下降问题。其解决方案的关键在于:首先构建一种基于阈值的噪声滤波框架,利用雷达关键参数(雷达散射截面RCS、速度、方位角和俯仰角)在原始数据层面抑制伪目标并减轻强多径反射;其次,基于滤波后的点云数据,设计了一种基于规则的聚类级分类流程,通过速度、RCS及体积扩展等雷达语义特征实现无需领域特定训练的实时行人检测,从而显著提升系统在粉尘环境下的抗干扰能力与鲁棒性。
链接: https://arxiv.org/abs/2601.13364
作者: Zhenan Liu,Yaodong Cui,Amir Khajepour,George Shaker
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This paper introduces a novel methodology for generating controlled, multi-level dust concentrations in a highly cluttered environment representative of harsh, enclosed environments, such as underground mines, road tunnels, or collapsed buildings, enabling repeatable mm-wave propagation studies under severe electromagnetic constraints. We also present a new 4D mmWave radar dataset, augmented by camera and LiDAR, illustrating how dust particles and reflective surfaces jointly impact the sensing functionality. To address these challenges, we develop a threshold-based noise filtering framework leveraging key radar parameters (RCS, velocity, azimuth, elevation) to suppress ghost targets and mitigate strong multipath reflections at the raw data level. Building on the filtered point clouds, a cluster-level, rule-based classification pipeline exploits radar semantics-velocity, RCS, and volumetric spread-to achieve reliable, real-time pedestrian detection without extensive domainspecific training. Experimental results confirm that this integrated approach significantly enhances clutter mitigation, detection robustness, and overall system resilience in dust-laden mining environments.
zh
[CV-91] MultiST: A Cross-Attention-Based Multimodal Model for Spatial Transcriptomic
【速读】:该论文旨在解决空间转录组学(Spatial Transcriptomics, ST)中分子特征与组织形态学信息融合不足的问题,现有方法常采用浅层融合策略或完全忽略组织图像,导致难以准确划分模糊的空间域边界。其解决方案的关键在于提出MultiST框架,通过基于交叉注意力(cross-attention)的多模态融合机制,联合建模空间拓扑结构、基因表达谱和组织形态特征;同时引入图结构基因编码器与对抗对齐策略以学习鲁棒的空间表示,并整合经过颜色归一化的组织学特征来捕捉分子与形态之间的依赖关系,从而显著提升空间域边界的清晰度和生物学可解释性。
链接: https://arxiv.org/abs/2601.13331
作者: Wei Wang,Quoc-Toan Ly,Chong Yu,Jun Bai
机构: University of Cincinnati (辛辛那提大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Spatial transcriptomics (ST) enables transcriptome-wide profiling while preserving the spatial context of tissues, offering unprecedented opportunities to study tissue organization and cell-cell interactions in situ. Despite recent advances, existing methods often lack effective integration of histological morphology with molecular profiles, relying on shallow fusion strategies or omitting tissue images altogether, which limits their ability to resolve ambiguous spatial domain boundaries. To address this challenge, we propose MultiST, a unified multimodal framework that jointly models spatial topology, gene expression, and tissue morphology through cross-attention-based fusion. MultiST employs graph-based gene encoders with adversarial alignment to learn robust spatial representations, while integrating color-normalized histological features to capture molecular-morphological dependencies and refine domain boundaries. We evaluated the proposed method on 13 diverse ST datasets spanning two organs, including human brain cortex and breast cancer tissue. MultiST yields spatial domains with clearer and more coherent boundaries than existing methods, leading to more stable pseudotime trajectories and more biologically interpretable cell-cell interaction patterns. The MultiST framework and source code are available at this https URL.
zh
[CV-92] CausalSpatial: A Benchmark for Object-Centric Causal Spatial Reasoning
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在3D场景中缺乏因果空间推理能力的问题,即模型难以准确预测物体运动带来的物理后果,如碰撞、遮挡或轨迹变化等。现有模型受限于静态空间感知,无法有效回答“如果……会怎样”的假设性问题。解决方案的关键在于提出Causal Object World(COW)框架,通过将模拟过程外部化,生成假设动态的视频作为显式视觉线索,使模型能够基于物理现实进行推理,而非依赖漂移于视觉证据之外的语言链式思维,从而显著提升因果空间推理的准确性。
链接: https://arxiv.org/abs/2601.13304
作者: Wenxin Ma,Chenlong Wang,Ruisheng Yuan,Hao Chen,Nanru Dai,S. Kevin Zhou,Yijun Yang,Alan Yuille,Jieneng Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code is available: this https URL
Abstract:Humans can look at a static scene and instantly predict what happens next – will moving this object cause a collision? We call this ability Causal Spatial Reasoning. However, current multimodal large language models (MLLMs) cannot do this, as they remain largely restricted to static spatial perception, struggling to answer “what-if” questions in a 3D scene. We introduce CausalSpatial, a diagnostic benchmark evaluating whether models can anticipate consequences of object motions across four tasks: Collision, Compatibility, Occlusion, and Trajectory. Results expose a severe gap: humans score 84% while GPT-5 achieves only 54%. Why do MLLMs fail? Our analysis uncovers a fundamental deficiency: models over-rely on textual chain-of-thought reasoning that drifts from visual evidence, producing fluent but spatially ungrounded hallucinations. To address this, we propose the Causal Object World model (COW), a framework that externalizes the simulation process by generating videos of hypothetical dynamics. With explicit visual cues of causality, COW enables models to ground their reasoning in physical reality rather than linguistic priors. We make the dataset and code publicly available here: this https URL
zh
[CV-93] Enginuity: Building an Open Multi-Domain Dataset of Complex Engineering Diagrams NEURIPS2025
【速读】:该论文旨在解决当前人工智能在科学发现中难以有效理解和处理工程图示(engineering diagrams)的问题,尤其是缺乏大规模、多领域且带有结构化标注的数据集,限制了AI在图示解析、跨模态信息检索及辅助工程仿真等任务中的应用。解决方案的关键在于提出Enginuity——首个开源、大规模、多领域的工程图示数据集,其核心创新在于对图示中组件的层次关系、连接方式及语义元素进行详尽标注,从而赋能多模态大语言模型(multimodal large language models)实现对工程图示的结构化理解与操作,突破AI在依赖图示解读、技术绘图分析和视觉推理的科研流程中的参与壁垒。
链接: https://arxiv.org/abs/2601.13299
作者: Ethan Seefried,Prahitha Movva,Naga Harshita Marupaka,Tilak Kasturi,Tirthankar Ghosal
机构: Oak Ridge National Laboratory (橡树岭国家实验室); Predii; Independent Researcher (独立研究员)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at the 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: Ai4 Science
Abstract:We propose Enginuity - the first open, large-scale, multi-domain engineering diagram dataset with comprehensive structural annotations designed for automated diagram parsing. By capturing hierarchical component relationships, connections, and semantic elements across diverse engineering domains, our proposed dataset would enable multimodal large language models to address critical downstream tasks including structured diagram parsing, cross-modal information retrieval, and AI-assisted engineering simulation. Enginuity would be transformative for AI for Scientific Discovery by enabling artificial intelligence systems to comprehend and manipulate the visual-structural knowledge embedded in engineering diagrams, breaking down a fundamental barrier that currently prevents AI from fully participating in scientific workflows where diagram interpretation, technical drawing analysis, and visual reasoning are essential for hypothesis generation, experimental design, and discovery.
zh
[CV-94] Deep Learning for Semantic Segmentation of 3D Ultrasound Data
【速读】:该论文旨在解决自动驾驶车辆中感知系统在成本效益与可靠性之间的平衡难题,尤其是在恶劣和复杂环境下的性能瓶颈问题。当前主流的激光雷达(LiDAR)与摄像头系统虽广泛应用,但在极端条件下存在鲁棒性不足或成本过高的局限。为此,作者提出了一种基于Calyo Pulse固态三维超声传感器的新型学习型3D语义分割框架,其核心创新在于采用3D U-Net网络结构直接处理超声数据进行体素级分割,从而实现对复杂场景的可靠感知。该方案的关键在于将超声传感作为与传统模态互补的新感知方式,展现出在恶劣环境中稳健运行的潜力,为提升自动驾驶系统的环境适应能力提供了可行路径。
链接: https://arxiv.org/abs/2601.13263
作者: Chenyu Liu,Marco Cecotti,Harikrishnan Vijayakumar,Patrick Robinson,James Barson,Mihai Caleap
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 10 figures, 8 tables, presented at 2025 13th International Conference on Robot Intelligence Technology and Applications (RITA)
Abstract:Developing cost-efficient and reliable perception systems remains a central challenge for automated vehicles. LiDAR and camera-based systems dominate, yet they present trade-offs in cost, robustness and performance under adverse conditions. This work introduces a novel framework for learning-based 3D semantic segmentation using Calyo Pulse, a modular, solid-state 3D ultrasound sensor system for use in harsh and cluttered environments. A 3D U-Net architecture is introduced and trained on the spatial ultrasound data for volumetric segmentation. Results demonstrate robust segmentation performance from Calyo Pulse sensors, with potential for further improvement through larger datasets, refined ground truth, and weighted loss functions. Importantly, this study highlights 3D ultrasound sensing as a promising complementary modality for reliable autonomy.
zh
[CV-95] A Semantic Decoupling-Based Two-Stage Rainy-Day Attack for Revealing Weather Robustness Deficiencies in Vision-Language Models
【速读】:该论文旨在解决视觉-语言模型(Vision-Language Models, VLMs)在真实天气条件(特别是雨天)下鲁棒性不足的问题,尤其是跨模态语义对齐在结构化扰动下的稳定性缺失。解决方案的关键在于提出首个基于真实天气的对抗攻击框架,采用两阶段参数化扰动模型:第一阶段通过低维全局调制作用于嵌入空间,逐步削弱原始语义决策边界;第二阶段则显式建模多尺度雨滴外观和降雨引起的光照变化,优化非可微的天气空间以诱导稳定的语义偏移。该方法在非像素参数空间中生成物理合理且可解释的扰动,实验证明即使高度约束的天气扰动也能显著破坏主流VLMs的语义一致性,揭示其在实际部署中的安全与可靠性风险。
链接: https://arxiv.org/abs/2601.13238
作者: Chengyin Hu,Xiang Chen,Zhe Jia,Weiwen Shi,Fengyu Zhang,Jiujiang Guo,Yiwei Wei
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Vision-Language Models (VLMs) are trained on image-text pairs collected under canonical visual conditions and achieve strong performance on multimodal tasks. However, their robustness to real-world weather conditions, and the stability of cross-modal semantic alignment under such structured perturbations, remain insufficiently studied. In this paper, we focus on rainy scenarios and introduce the first adversarial framework that exploits realistic weather to attack VLMs, using a two-stage, parameterized perturbation model based on semantic decoupling to analyze rain-induced shifts in decision-making. In Stage 1, we model the global effects of rainfall by applying a low-dimensional global modulation to condition the embedding space and gradually weaken the original semantic decision boundaries. In Stage 2, we introduce structured rain variations by explicitly modeling multi-scale raindrop appearance and rainfall-induced illumination changes, and optimize the resulting non-differentiable weather space to induce stable semantic shifts. Operating in a non-pixel parameter space, our framework generates perturbations that are both physically grounded and interpretable. Experiments across multiple tasks show that even physically plausible, highly constrained weather perturbations can induce substantial semantic misalignment in mainstream VLMs, posing potential safety and reliability risks in real-world deployment. Ablations further confirm that illumination modeling and multi-scale raindrop structures are key drivers of these semantic shifts.
zh
[CV-96] ConvMambaNet: A Hybrid CNN-Mamba State Space Architecture for Accurate and Real-Time EEG Seizure Detection
【速读】:该论文旨在解决癫痫患者脑电图(Electroencephalography, EEG)信号自动化分析中的挑战,尤其是如何有效捕捉EEG信号中复杂的时序特征以实现高精度的癫痫发作检测。其解决方案的关键在于提出了一种混合深度学习模型ConvMambaNet,该模型将卷积神经网络(Convolutional Neural Networks, CNN)与结构化状态空间模型(Structured State Space Model, SSM)相结合,在CNN框架中嵌入Mamba-SSM模块,从而同时提取EEG信号的空间特征和长程时序依赖关系,显著提升了在严重类别不平衡条件下的检测准确率(达到99%),为临床环境中实时、自动化的癫痫监测提供了可行路径。
链接: https://arxiv.org/abs/2601.13234
作者: Md. Nishan Khan,Kazi Shahriar Sanjid,Md. Tanzim Hossain,Asib Mostakim Fony,Istiak Ahmed,M. Monir Uddin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Epilepsy is a chronic neurological disorder marked by recurrent seizures that can severely impact quality of life. Electroencephalography (EEG) remains the primary tool for monitoring neural activity and detecting seizures, yet automated analysis remains challenging due to the temporal complexity of EEG signals. This study introduces ConvMambaNet, a hybrid deep learning model that integrates Convolutional Neural Networks (CNNs) with the Mamba Structured State Space Model (SSM) to enhance temporal feature extraction. By embedding the Mamba-SSM block within a CNN framework, the model effectively captures both spatial and long-range temporal dynamics. Evaluated on the CHB-MIT Scalp EEG dataset, ConvMambaNet achieved a 99% accuracy and demonstrated robust performance under severe class imbalance. These results underscore the model’s potential for precise and efficient seizure detection, offering a viable path toward real-time, automated epilepsy monitoring in clinical environments.
zh
[CV-97] Not all Blends are Equal: The BLEMORE Dataset of Blended Emotion Expressions with Relative Salience Annotations
【速读】:该论文旨在解决视频情感识别中难以准确捕捉和量化混合情绪(blended emotions)及其相对显著性(relative salience)的问题。现有方法大多仅能识别单一基本情绪,缺乏对多情绪共存且强度不同的情况的有效建模能力,其根本原因在于缺少标注了情绪显著性配置的混合情绪数据集。为此,作者提出了BLEMORE数据集,这是一个包含58名演员表演的6种基本情绪与10种混合情绪组合、每种混合情绪具有三种显著性比例(50/50、70/30、30/70)的多模态(视频+音频)数据集。该数据集是推动混合情绪识别研究的关键突破,实验表明基于此数据集训练的多模态模型在情绪存在预测和显著性判断任务上均显著优于单模态方法,验证了其有效性与实用性。
链接: https://arxiv.org/abs/2601.13225
作者: Tim Lachmann,Alexandra Israelsson,Christina Tornberg,Teimuraz Saghinadze,Michal Balazia,Philipp Müller,Petri Laukka
机构: Stockholm University (斯德哥尔摩大学); Georgian Technical University (格鲁吉亚技术大学); INRIA Université Côte d’Azur (法国国家信息与自动化研究院-蔚蓝海岸大学); Max Planck Institute for Intelligent Systems (马克斯·普朗克智能系统研究所); German Research Center for Artificial Intelligence (德国人工智能研究中心); Uppsala University (乌普萨拉大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注: Accepted for publication at IEEE Face Gesture 2026
Abstract:Humans often experience not just a single basic emotion at a time, but rather a blend of several emotions with varying salience. Despite the importance of such blended emotions, most video-based emotion recognition approaches are designed to recognize single emotions only. The few approaches that have attempted to recognize blended emotions typically cannot assess the relative salience of the emotions within a blend. This limitation largely stems from the lack of datasets containing a substantial number of blended emotion samples annotated with relative salience. To address this shortcoming, we introduce BLEMORE, a novel dataset for multimodal (video, audio) blended emotion recognition that includes information on the relative salience of each emotion within a blend. BLEMORE comprises over 3,000 clips from 58 actors, performing 6 basic emotions and 10 distinct blends, where each blend has 3 different salience configurations (50/50, 70/30, and 30/70). Using this dataset, we conduct extensive evaluations of state-of-the-art video classification approaches on two blended emotion prediction tasks: (1) predicting the presence of emotions in a given sample, and (2) predicting the relative salience of emotions in a blend. Our results show that unimodal classifiers achieve up to 29% presence accuracy and 13% salience accuracy on the validation set, while multimodal methods yield clear improvements, with ImageBind + WavLM reaching 35% presence accuracy and HiCMAE 18% salience accuracy. On the held-out test set, the best models achieve 33% presence accuracy (VideoMAEv2 + HuBERT) and 18% salience accuracy (HiCMAE). In sum, the BLEMORE dataset provides a valuable resource to advancing research on emotion recognition systems that account for the complexity and significance of blended emotion expressions.
zh
[CV-98] ObjectVisA-120: Object-based Visual Attention Prediction in Interactive Street-crossing Environments
【速读】:该论文旨在解决当前计算视觉注意模型中对象基础注意力(object-based attention)研究不足的问题,主要受限于缺乏适合的标注数据集和评估指标。为应对这一挑战,作者提出了一种新的120名参与者参与的虚拟现实街景导航数据集 \dataset,其独特性在于模拟了真实世界中因伦理与安全问题难以获取的复杂场景。该数据集不仅包含高精度眼动追踪数据和完整的场景对象状态空间表示,还提供了丰富的标注信息(如全景分割、深度图和车辆关键点)。关键解决方案包括:一是提出全新的对象相似性度量指标(oSIM),用于评估对象基础注意力模型的性能;二是设计了基于Mamba U-Net架构的SUMGraph模型,通过图结构显式编码关键场景对象(如车辆)以提升注意力预测效果。实验表明,显式优化对象基础注意力不仅能提高oSIM指标,还能改善传统指标表现。
链接: https://arxiv.org/abs/2601.13218
作者: Igor Vozniak,Philipp Mueller,Nils Lipp,Janis Sprenger,Konstantin Poddubnyy,Davit Hovhannisyan,Christian Mueller,Andreas Bulling,Philipp Slusallek
机构: German Research Center for Artificial Intelligence (DFKI) GmbH(德国人工智能研究中心(DFKI)有限公司); Max Planck Institute for Intelligent Systems(马克斯·普朗克智能系统研究所); Institute for Visualization and Interactive Systems (VIS) at Stuttgart University(斯图加特大学可视化与交互系统研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication at the IEEE Intelligent Vehicles Symposium (IV), 2026
Abstract:The object-based nature of human visual attention is well-known in cognitive science, but has only played a minor role in computational visual attention models so far. This is mainly due to a lack of suitable datasets and evaluation metrics for object-based attention. To address these limitations, we present \dataset~ – a novel 120-participant dataset of spatial street-crossing navigation in virtual reality specifically geared to object-based attention evaluations. The uniqueness of the presented dataset lies in the ethical and safety affiliated challenges that make collecting comparable data in real-world environments highly difficult. \dataset~ not only features accurate gaze data and a complete state-space representation of objects in the virtual environment, but it also offers variable scenario complexities and rich annotations, including panoptic segmentation, depth information, and vehicle keypoints. We further propose object-based similarity (oSIM) as a novel metric to evaluate the performance of object-based visual attention models, a previously unexplored performance characteristic. Our evaluations show that explicitly optimising for object-based attention not only improves oSIM performance but also leads to an improved model performance on common metrics. In addition, we present SUMGraph, a Mamba U-Net-based model, which explicitly encodes critical scene objects (vehicles) in a graph representation, leading to further performance improvements over several state-of-the-art visual attention prediction methods. The dataset, code and models will be publicly released.
zh
[CV-99] Rethinking Skip Connections: Additive U-Net for Robust and Interpretable Denoising
【速读】:该论文旨在解决传统U-Net架构中跳接(skip connection)存在的两个核心问题:一是标准的拼接(concatenation)操作导致通道维度翻倍,增加计算负担;二是拼接方式模糊了信息流动路径,使得噪声可能不受控地传递。解决方案的关键在于提出Additive U-Net,用可学习的非负标量缩放的加法跳接(gated additive connections)替代拼接操作,从而在不增加通道维度的前提下,实现对编码器贡献的显式且可解释的控制,同时促进从高频到低频特征的自然层级学习。
链接: https://arxiv.org/abs/2601.13208
作者: Vikram R Lakkavalli
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Skip connections are central to U-Net architectures for image denoising, but standard concatenation doubles channel dimensionality and obscures information flow, allowing uncontrolled noise transfer. We propose the Additive U-Net, which replaces concatenative skips with gated additive connections. Each skip pathway is scaled by a learnable non-negative scalar, offering explicit and interpretable control over encoder contributions while avoiding channel inflation. Evaluations on the Kodak-17 denoising benchmark show that Additive U-Net achieves competitive PSNR/SSIM at noise levels \sigma = 15, 25, 50, with robustness across kernel schedules and depths. Notably, effective denoising is achieved even without explicit down/up-sampling or forced hierarchies, as the model naturally learns a progression from high-frequency to band-pass to low-frequency features. These results position additive skips as a lightweight and interpretable alternative to concatenation, enabling both efficient design and a clearer understanding of multi-scale information transfer in reconstruction networks.
zh
[CV-100] GTPred: Benchmarking MLLM s for Interpretable Geo-localization and Time-of-capture Prediction
【速读】:该论文旨在解决现有地理定位(geo-localization)基准测试中忽视图像时间信息的问题,而时间信息对精确定位具有重要约束作用。解决方案的关键在于提出一个名为GTPred的新基准,该基准包含370张覆盖全球、跨越120余年的图像,并通过联合年份与分层地理位置序列匹配来评估多模态大语言模型(MLLM)的预测性能,同时引入精心标注的推理链以分析中间推理过程。实验表明,尽管当前MLLM在视觉感知方面表现优异,但在世界知识和时空推理能力上仍存在局限,且加入时间信息能显著提升定位精度。
链接: https://arxiv.org/abs/2601.13207
作者: Jinnao Li,Zijian Chen,Tingzhu Chen,Changbo Wang
机构: East China Normal University (华东师范大学); Shanghai Jiao Tong University (上海交通大学); Shanghai AI Laboratory (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Geo-localization aims to infer the geographic location where an image was captured using observable visual evidence. Traditional methods achieve impressive results through large-scale training on massive image corpora. With the emergence of multi-modal large language models (MLLMs), recent studies have explored their applications in geo-localization, benefiting from improved accuracy and interpretability. However, existing benchmarks largely ignore the temporal information inherent in images, which can further constrain the location. To bridge this gap, we introduce GTPred, a novel benchmark for geo-temporal prediction. GTPred comprises 370 globally distributed images spanning over 120 years. We evaluate MLLM predictions by jointly considering year and hierarchical location sequence matching, and further assess intermediate reasoning chains using meticulously annotated ground-truth reasoning processes. Experiments on 8 proprietary and 7 open-source MLLMs show that, despite strong visual perception, current models remain limited in world knowledge and geo-temporal reasoning. Results also demonstrate that incorporating temporal information significantly enhances location inference performance.
zh
[CV-101] From 100000 images to winning the first brain MRI foundation model challenges: Sharing lessons and models MICCAI2025
【速读】:该论文旨在解决医学图像分析中3D脑部磁共振成像(MRI)任务的挑战,特别是如何构建高效且精准的基础模型(Foundation Models)。其解决方案的关键在于采用U-Net卷积神经网络(CNN)架构,并融合解剖学先验知识与神经影像学领域知识,从而在训练速度和模型规模上显著优于基于Transformer的方法——训练速度提升1–2个数量级,模型体积缩小至竞争方案的十分之一。
链接: https://arxiv.org/abs/2601.13166
作者: Pedro M. Gordaliza,Jaume Banus,Benoît Gérin,Maxence Wynen,Nataliia Molchanova,Jonas Richiardi,Meritxell Bach Cuadra
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Work presented at the SSL3D Challenge (1st place, ResEnc-L track) and FOMO Challenge (1st place, Methods track) on Brain MRI Foundation Models at MICCAI 2025
Abstract:Developing Foundation Models for medical image analysis is essential to overcome the unique challenges of radiological tasks. The first challenges of this kind for 3D brain MRI, SSL3D and FOMO25, were held at MICCAI 2025. Our solution ranked first in tracks of both contests. It relies on a U-Net CNN architecture combined with strategies leveraging anatomical priors and neuroimaging domain knowledge. Notably, our models trained 1-2 orders of magnitude faster and were 10 times smaller than competing transformer-based approaches. Models are available here: this https URL.
zh
[CV-102] ICo3D: An Interactive Conversational 3D Virtual Human
【速读】:该论文旨在解决如何构建一个可交互、会对话且具有照片级真实感的3D虚拟人类化身(Virtual Human Avatar)的问题,尤其在实时用户交互场景下实现高保真面部与身体动画同步。其关键解决方案在于:首先基于多视角捕捉重建出可驱动的3D人脸模型和动态3D身体模型,两者均采用高斯泼溅(Splatting Gaussian Primitives)进行渲染;其次引入大语言模型(LLM)赋予虚拟人对话能力,并利用其语音输出作为驱动信号精确控制面部表情动画;此外,通过改进的SWinGS++(用于身体重建)和HeadGaS++(用于头部重建)方法提升整体视觉真实性,并提出一种无伪影融合策略将分离的脸部与身体模型无缝整合,从而实现端到端的沉浸式交互体验。
链接: https://arxiv.org/abs/2601.13148
作者: Richard Shaw,Youngkyoon Jang,Athanasios Papaioannou,Arthur Moreau,Helisa Dhamo,Zhensong Zhang,Eduardo Pérez-Pellitero
机构: Huawei(华为)
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注: Accepted by International Journal on Computer Vision (IJCV). Project page: this https URL . This preprint has not undergone peer review or any post-submission improvements or corrections. The Version of Record of this article is published in International Journal of Computer Vision and is available online at this https URL
Abstract:This work presents Interactive Conversational 3D Virtual Human (ICo3D), a method for generating an interactive, conversational, and photorealistic 3D human avatar. Based on multi-view captures of a subject, we create an animatable 3D face model and a dynamic 3D body model, both rendered by splatting Gaussian primitives. Once merged together, they represent a lifelike virtual human avatar suitable for real-time user interactions. We equip our avatar with an LLM for conversational ability. During conversation, the audio speech of the avatar is used as a driving signal to animate the face model, enabling precise synchronization. We describe improvements to our dynamic Gaussian models that enhance photorealism: SWinGS++ for body reconstruction and HeadGaS++ for face reconstruction, and provide as well a solution to merge the separate face and body models without artifacts. We also present a demo of the complete system, showcasing several use cases of real-time conversation with the 3D avatar. Our approach offers a fully integrated virtual avatar experience, supporting both oral and written form interactions in immersive environments. ICo3D is applicable to a wide range of fields, including gaming, virtual assistance, and personalized education, among others. Project page: this https URL
zh
[CV-103] Earth Embeddings as Products: Taxonomy Ecosystem and Standardized Access
【速读】:该论文旨在解决当前地球观测领域中生成式 AI (Generative AI) 基础模型(Geospatial Foundation Models, GFMs)因计算成本高昂而难以广泛应用的问题,以及预计算嵌入数据产品在格式和分辨率上缺乏标准化所导致的互操作性障碍。其解决方案的关键在于提出一个三层次分类体系(数据、工具、价值),并扩展 TorchGeo 框架以提供统一的 API,将嵌入表示作为第一类地理空间数据集进行管理,从而实现对多种嵌入产品的标准化加载与查询,解耦下游分析与特定模型的工程依赖,推动地球观测工作流的透明化与可复现性。
链接: https://arxiv.org/abs/2601.13134
作者: Heng Fang,Adam J. Stewart,Isaac Corley,Xiao Xiang Zhu,Hossein Azizpour
机构: 未知
类目: oftware Engineering (cs.SE); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Geospatial Foundation Models (GFMs) provide powerful representations, but high compute costs hinder their widespread use. Pre-computed embedding data products offer a practical “frozen” alternative, yet they currently exist in a fragmented ecosystem of incompatible formats and resolutions. This lack of standardization creates an engineering bottleneck that prevents meaningful model comparison and reproducibility. We formalize this landscape through a three-layer taxonomy: Data, Tools, and Value. We survey existing products to identify interoperability barriers. To bridge this gap, we extend TorchGeo with a unified API that standardizes the loading and querying of diverse embedding products. By treating embeddings as first-class geospatial datasets, we decouple downstream analysis from model-specific engineering, providing a roadmap for more transparent and accessible Earth observation workflows.
zh
[CV-104] CLIP-Guided Adaptable Self-Supervised Learning for Human-Centric Visual Tasks
【速读】:该论文旨在解决无监督预训练模型在人类中心视觉分析任务中表达能力不足与迁移性有限的问题,尤其针对大规模未标注人体图像数据集缺乏通用预训练方法的挑战。其解决方案的关键在于提出CLASP(CLIP-guided Adaptable Self-suPervised learning)框架,通过CLIP生成多层次语义伪标签(包括低层级的身体部位和高层级属性),并将这些语义线索融入视觉表示中以增强表达力;同时引入Prompt-Controlled Mixture-of-Experts(MoE)模块,根据任务提示动态调整特征提取策略,从而缓解不同下游任务间的特征冲突并提升跨任务迁移性能。
链接: https://arxiv.org/abs/2601.13133
作者: Mingshuang Luo,Ruibing Hou,Bo Chao,Hong Chang,Zimo Liu,Yaowei Wang,Shiguang Shan
机构: Key Laboratory of Intelligent Information Processing, Institute of Computing Technology (ICT), Chinese Academy of Sciences (CAS); University of Chinese Academy of Sciences; Peng Cheng Laboratory
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by TMM (IEEE Transactions on Multimedia), 16 pages, 7 figures
Abstract:Human-centric visual analysis plays a pivotal role in diverse applications, including surveillance, healthcare, and human-computer interaction. With the emergence of large-scale unlabeled human image datasets, there is an increasing need for a general unsupervised pre-training model capable of supporting diverse human-centric downstream tasks. To achieve this goal, we propose CLASP (CLIP-guided Adaptable Self-suPervised learning), a novel framework designed for unsupervised pre-training in human-centric visual tasks. CLASP leverages the powerful vision-language model CLIP to generate both low-level (e.g., body parts) and high-level (e.g., attributes) semantic pseudo-labels. These multi-level semantic cues are then integrated into the learned visual representations, enriching their expressiveness and generalizability. Recognizing that different downstream tasks demand varying levels of semantic granularity, CLASP incorporates a Prompt-Controlled Mixture-of-Experts (MoE) module. MoE dynamically adapts feature extraction based on task-specific prompts, mitigating potential feature conflicts and enhancing transferability. Furthermore, CLASP employs a multi-task pre-training strategy, where part- and attribute-level pseudo-labels derived from CLIP guide the representation learning process. Extensive experiments across multiple benchmarks demonstrate that CLASP consistently outperforms existing unsupervised pre-training methods, advancing the field of human-centric visual analysis.
zh
[CV-105] GaussExplorer: 3D Gaussian Splatting for Embodied Exploration and Reasoning
【速读】:该论文旨在解决现有基于语言嵌入的3D高斯溅射(3D Gaussian Splatting, 3DGS)方法在处理复杂、组合式语言查询时表现不足的问题,以及基于对象中心的RGB-D结构化记忆方法因预设视角固定而缺乏空间灵活性的局限。其解决方案的关键在于引入视觉-语言模型(Vision-Language Models, VLMs)以增强3DGS场景中的问答驱动探索与推理能力:首先通过相关性匹配识别与查询最相关的预捕获图像,随后将其调整至新颖视角以更准确地捕捉视觉信息,从而提升VLM对复杂语义的理解与推理性能。
链接: https://arxiv.org/abs/2601.13132
作者: Kim Yu-Ji,Dahye Lee,Kim Jun-Seong,GeonU Kim,Nam Hyeon-Woo,Yongjin Kwon,Yu-Chiang Frank Wang,Jaesung Choe,Tae-Hyun Oh
机构: POSTECH(浦项工科大学); KAIST(韩国科学技术院); ETRI(电子与电信研究院); NVIDIA(英伟达)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:We present GaussExplorer, a framework for embodied exploration and reasoning built on 3D Gaussian Splatting (3DGS). While prior approaches to language-embedded 3DGS have made meaningful progress in aligning simple text queries with Gaussian embeddings, they are generally optimized for relatively simple queries and struggle to interpret more complex, compositional language queries. Alternative studies based on object-centric RGB-D structured memories provide spatial grounding but are constrained by pre-fixed viewpoints. To address these issues, GaussExplorer introduces Vision-Language Models (VLMs) on top of 3DGS to enable question-driven exploration and reasoning within 3D scenes. We first identify pre-captured images that are most correlated with the query question, and subsequently adjust them into novel viewpoints to more accurately capture visual information for better reasoning by VLMs. Experiments show that ours outperforms existing methods on several benchmarks, demonstrating the effectiveness of integrating VLM-based reasoning with 3DGS for embodied tasks.
zh
[CV-106] PhaseMark: A Post-hoc Optimization-Free Watermarking of AI-generated Images in the Latent Frequency Domain ICASSP
【速读】:该论文旨在解决由潜在扩散模型(Latent Diffusion Models, LDMs)生成的高保真图像日益增多所带来的鲁棒水印需求问题,现有后处理水印方法因依赖迭代优化或反演过程而效率极低。其解决方案的关键在于提出一种单次、无需优化的水印框架 PhaseMark,该框架直接在变分自编码器(Variational Autoencoder, VAE)的潜空间频域中调制相位信息,从而实现比传统优化方法快数千倍的处理速度,同时在面对再生等严重攻击时仍保持卓越的鲁棒性,且不损害图像质量。这一方法揭示了利用潜空间内在属性进行高效且强韧水印的新范式。
链接: https://arxiv.org/abs/2601.13128
作者: Sung Ju Lee,Nam Ik Cho
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2026
Abstract:The proliferation of hyper-realistic images from Latent Diffusion Models (LDMs) demands robust watermarking, yet existing post-hoc methods are prohibitively slow due to iterative optimization or inversion processes. We introduce PhaseMark, a single-shot, optimization-free framework that directly modulates the phase in the VAE latent frequency domain. This approach makes PhaseMark thousands of times faster than optimization-based techniques while achieving state-of-the-art resilience against severe attacks, including regeneration, without degrading image quality. We analyze four modulation variants, revealing a clear performance-quality trade-off. PhaseMark demonstrates a new paradigm where efficient, resilient watermarking is achieved by exploiting intrinsic latent properties.
zh
[CV-107] A Streamlined Attention-Based Network for Descriptor Extraction
【速读】:该论文旨在解决关键点描述子(keypoint descriptor)在图像匹配任务中性能不足的问题,尤其针对现有描述子网络复杂度高、训练不稳定以及难以与通用关键点检测器兼容的局限性。解决方案的关键在于提出一种轻量级的注意力增强型U-Net架构——SANDesc,其核心组件为带有注意力机制的残差U-Net块(Residual U-Net Blocks with Attention),通过引入卷积块注意力模块(Convolutional Block Attention Modules, CBAM)和残差路径,在保持计算效率的同时提升局部特征表示能力;同时采用改进的三元组损失结合课程学习启发的难负样本挖掘策略,显著增强了训练稳定性,并在HPatches、MegaDepth-1500及Image Matching Challenge 2021等多个基准上实现优于现有描述子的匹配性能,模型参数仅为240万,具备实际部署潜力。
链接: https://arxiv.org/abs/2601.13126
作者: Mattia D’Urso,Emanuele Santellani,Christian Sormann,Mattia Rossi,Andreas Kuhn,Friedrich Fraundorfer
机构: Graz University of Technology (格拉茨工业大学); Sony (索尼)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to 3DV 2026
Abstract:We introduce SANDesc, a Streamlined Attention-Based Network for Descriptor extraction that aims to improve on existing architectures for keypoint description. Our descriptor network learns to compute descriptors that improve matching without modifying the underlying keypoint detector. We employ a revised U-Net-like architecture enhanced with Convolutional Block Attention Modules and residual paths, enabling effective local representation while maintaining computational efficiency. We refer to the building blocks of our model as Residual U-Net Blocks with Attention. The model is trained using a modified triplet loss in combination with a curriculum learning-inspired hard negative mining strategy, which improves training stability. Extensive experiments on HPatches, MegaDepth-1500, and the Image Matching Challenge 2021 show that training SANDesc on top of existing keypoint detectors leads to improved results on multiple matching tasks compared to the original keypoint descriptors. At the same time, SANDesc has a model complexity of just 2.4 million parameters. As a further contribution, we introduce a new urban dataset featuring 4K images and pre-calibrated intrinsics, designed to evaluate feature extractors. On this benchmark, SANDesc achieves substantial performance gains over the existing descriptors while operating with limited computational resources. Comments: Accepted to 3DV 2026 Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2601.13126 [cs.CV] (or arXiv:2601.13126v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2601.13126 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Mattia D’Urso [view email] [v1] Mon, 19 Jan 2026 15:12:42 UTC (21,933 KB)
zh
[CV-108] LLM -VLM Fusion Framework for Autonomous Maritime Port Inspection using a Heterogeneous UAV-USV System
【速读】:该论文旨在解决当前海上港口检查中依赖人工操作和传统计算机视觉技术所导致的可扩展性差、缺乏情境理解能力的问题。其解决方案的关键在于构建一个融合大型语言模型(Large Language Models, LLMs)与视觉语言模型(Vision Language Models, VLMs)的集成工程框架,通过协同空基与水面机器人平台实现自主化港口检查。该框架以LLM驱动符号规划替代传统状态机任务规划,并利用VLM实现语义级感知与合规性评估,从而支持上下文感知和自适应监控,同时具备轻量化设计,适用于资源受限的海上平台。
链接: https://arxiv.org/abs/2601.13096
作者: Muhayy Ud Din,Waseem Akram,Ahsan B. Bakht,Irfan Hussain
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: submitted in AEJ
Abstract:Maritime port inspection plays a critical role in ensuring safety, regulatory compliance, and operational efficiency in complex maritime environments. However, existing inspection methods often rely on manual operations and conventional computer vision techniques that lack scalability and contextual understanding. This study introduces a novel integrated engineering framework that utilizes the synergy between Large Language Models (LLMs) and Vision Language Models (VLMs) to enable autonomous maritime port inspection using cooperative aerial and surface robotic platforms. The proposed framework replaces traditional state-machine mission planners with LLM-driven symbolic planning and improved perception pipelines through VLM-based semantic inspection, enabling context-aware and adaptive monitoring. The LLM module translates natural language mission instructions into executable symbolic plans with dependency graphs that encode operational constraints and ensure safe UAV-USV coordination. Meanwhile, the VLM module performs real-time semantic inspection and compliance assessment, generating structured reports with contextual reasoning. The framework was validated using the extended MBZIRC Maritime Simulator with realistic port infrastructure and further assessed through real-world robotic inspection trials. The lightweight on-board design ensures suitability for resource-constrained maritime platforms, advancing the development of intelligent, autonomous inspection systems. Project resources (code and videos) can be found here: this https URL
zh
[CV-109] Patient-Conditioned Adaptive Offsets for Reliable Diagnosis across Subgroups
【速读】:该论文旨在解决医学诊断中AI模型在不同患者群体间表现不一致的问题,其根源在于疾病流行率、影像表现和临床风险特征的异质性。传统算法公平性方法通过抑制敏感属性来减少差异,但在医疗场景中这些属性往往携带关键诊断信息,移除会导致准确性和可靠性下降。本文提出HyperAdapt框架,其核心创新在于构建一个基于患者条件的自适应机制:将年龄、性别等临床相关属性编码为紧凑嵌入,并通过超网络(hypernetwork)风格模块生成少量残差调制参数,动态调整共享骨干网络中的特定层,从而在保持通用医学知识的同时实现针对个体差异的精细化调整。该方案通过低秩和瓶颈化参数约束确保效率与鲁棒性,在多个公共医学影像基准上验证了其对子群体性能的持续提升,尤其在代表性不足的人群中效果显著。
链接: https://arxiv.org/abs/2601.13094
作者: Gelei Xu,Yuying Duan,Jun Xia,Ruining Deng,Wei Jin,Yiyu Shi
机构: University of Notre Dame (圣母大学); Weill Cornell Medicine (威尔康奈尔医学院); Emory University (埃默里大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:AI models for medical diagnosis often exhibit uneven performance across patient populations due to heterogeneity in disease prevalence, imaging appearance, and clinical risk profiles. Existing algorithmic fairness approaches typically seek to reduce such disparities by suppressing sensitive attributes. However, in medical settings these attributes often carry essential diagnostic information, and removing them can degrade accuracy and reliability, particularly in high-stakes applications. In contrast, clinical decision making explicitly incorporates patient context when interpreting diagnostic evidence, suggesting a different design direction for subgroup-aware models. In this paper, we introduce HyperAdapt, a patient-conditioned adaptation framework that improves subgroup reliability while maintaining a shared diagnostic model. Clinically relevant attributes such as age and sex are encoded into a compact embedding and used to condition a hypernetwork-style module, which generates small residual modulation parameters for selected layers of a shared backbone. This design preserves the general medical knowledge learned by the backbone while enabling targeted adjustments that reflect patient-specific variability. To ensure efficiency and robustness, adaptations are constrained through low-rank and bottlenecked parameterizations, limiting both model complexity and computational overhead. Experiments across multiple public medical imaging benchmarks demonstrate that the proposed approach consistently improves subgroup-level performance without sacrificing overall accuracy. On the PAD-UFES-20 dataset, our method outperforms the strongest competing baseline by 4.1% in recall and 4.4% in F1 score, with larger gains observed for underrepresented patient populations.
zh
[CV-110] Prototype Learning-Based Few-Shot Segmentation for Low-Light Crack on Concrete Structures
【速读】:该论文旨在解决低光照环境下混凝土裂缝分割精度下降的问题,此类场景常见于隧道和桥梁底部等难以获得充足照明的区域。传统深度学习方法依赖大量标注良好的高亮度数据集,而低光裂缝图像的像素级标注成本高昂且难以获取。其解决方案的关键在于提出一种双分支原型学习网络,融合Retinex理论与少样本学习机制:首先利用基于Retinex的反射分量引导光照不变的全局表征学习,其次通过度量学习降低对大规模标注数据的依赖;同时引入交叉相似性先验掩码生成模块,计算查询与支持特征之间的高维相似性以捕捉裂缝的位置与结构信息,并结合多尺度特征增强模块,融合多尺度特征与先验掩码以缓解空间不一致性问题,从而在低光条件下实现更精准的裂缝分割。
链接: https://arxiv.org/abs/2601.13059
作者: Yulun Guo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Crack detection is critical for concrete infrastructure safety, but real-world cracks often appear in low-light environments like tunnels and bridge undersides, degrading computer vision segmentation accuracy. Pixel-level annotation of low-light crack images is extremely time-consuming, yet most deep learning methods require large, well-illuminated datasets. We propose a dual-branch prototype learning network integrating Retinex theory with few-shot learning for low-light crack segmentation. Retinex-based reflectance components guide illumination-invariant global representation learning, while metric learning reduces dependence on large annotated datasets. We introduce a cross-similarity prior mask generation module that computes high-dimensional similarities between query and support features to capture crack location and structure, and a multi-scale feature enhancement module that fuses multi-scale features with the prior mask to alleviate spatial inconsistency. Extensive experiments on multiple benchmarks demonstrate consistent state-of-the-art performance under low-light conditions. Code: this https URL.
zh
[CV-111] GridNet-HD: A High-Resolution Multi-Modal Dataset for LiDAR-Image Fusion on Power Line Infrastructure
【速读】:该论文旨在解决电力线资产(power-line assets)在三维语义分割任务中缺乏高质量多模态标注数据的问题,特别是针对高密度激光雷达(LiDAR)与高分辨率倾斜影像(oblique imagery)联合标注的公开数据集缺失问题。解决方案的关键在于构建GridNet-HD这一多模态数据集,其包含7,694张图像和25亿个点云标注为11类,并提供预定义的训练/验证/测试划分及mIoU评估指标;同时设计了单模态(仅LiDAR、仅图像)与多模态融合基线模型,实验表明融合模型相比最优单模态基线提升5.55 mIoU,验证了几何信息与外观特征互补性对提升分割性能的重要性。
链接: https://arxiv.org/abs/2601.13052
作者: Antoine Carreaud,Shanci Li,Malo De Lacour,Digre Frinde,Jan Skaloud,Adrien Gressin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This paper presents GridNet-HD, a multi-modal dataset for 3D semantic segmentation of overhead electrical infrastructures, pairing high-density LiDAR with high-resolution oblique imagery. The dataset comprises 7,694 images and 2.5 billion points annotated into 11 classes, with predefined splits and mIoU metrics. Unimodal (LiDAR-only, image-only) and multi-modal fusion baselines are provided. On GridNet-HD, fusion models outperform the best unimodal baseline by +5.55 mIoU, highlighting the complementarity of geometry and appearance. As reviewed in Sec. 2, no public dataset jointly provides high-density LiDAR and high-resolution oblique imagery with 3D semantic labels for power-line assets. Dataset, baselines, and codes are available: this https URL.
zh
[CV-112] hink3D: Thinking with Space for Spatial Reasoning
【速读】:该论文旨在解决当前视觉大模型(Vision Large Models, VLMs)在物理世界理解与推理中缺乏真正三维(3D)空间智能的问题,即这些模型本质上仍是二维感知器,在处理几何、视角和空间关系时存在局限。其解决方案的关键在于提出Think3D框架,通过引入3D重建模型从图像或视频中恢复点云和相机位姿,并使代理(agent)能够通过基于相机的操作(如视角切换、全局/自我视角转换)主动操控三维空间,将空间推理转化为交互式的3D思维链(chain-of-thought)过程。该方法无需额外训练即可显著提升GPT-4.1和Gemini 2.5 Pro等先进模型的空间推理性能,且对小型模型可通过强化学习策略选择信息丰富的视角和操作进一步增强效果,从而为多模态智能体实现更灵活、类人化的3D推理提供了可行路径。
链接: https://arxiv.org/abs/2601.13029
作者: Zaibin Zhang,Yuhan Wu,Lianjie Jia,Yifan Wang,Zhongbo Zhang,Yijiang Li,Binghao Ran,Fuxi Zhang,Zhuohan Sun,Zhenfei Yin,Lijun Wang,Huchuan Lu
机构: Dalian University of Technology (大连理工大学); University of California San Diego (加州大学圣地亚哥分校); University of Oxford (牛津大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Understanding and reasoning about the physical world requires spatial intelligence: the ability to interpret geometry, perspective, and spatial relations beyond 2D perception. While recent vision large models (VLMs) excel at visual understanding, they remain fundamentally 2D perceivers and struggle with genuine 3D reasoning. We introduce Think3D, a framework that enables VLM agents to think with 3D space. By leveraging 3D reconstruction models that recover point clouds and camera poses from images or videos, Think3D allows the agent to actively manipulate space through camera-based operations and ego/global-view switching, transforming spatial reasoning into an interactive 3D chain-of-thought process. Without additional training, Think3D significantly improves the spatial reasoning performance of advanced models such as GPT-4.1 and Gemini 2.5 Pro, yielding average gains of +7.8% on BLINK Multi-view and MindCube, and +4.7% on VSI-Bench. We further show that smaller models, which struggle with spatial exploration, benefit significantly from a reinforcement learning policy that enables the model to select informative viewpoints and operations. With RL, the benefit from tool usage increases from +0.7% to +6.8%. Our findings demonstrate that training-free, tool-augmented spatial exploration is a viable path toward more flexible and human-like 3D reasoning in multimodal agents, establishing a new dimension of multimodal intelligence. Code and weights are released at this https URL.
zh
[CV-113] AsyncBEV: Cross-modal Flow Alignment in Asynchronous 3D Object Detection
【速读】:该论文旨在解决自动驾驶中多模态感知任务(如3D目标检测)因传感器异步导致的性能下降问题,尤其在动态物体检测上表现显著退化。其核心解决方案是提出AsyncBEV模块,该模块通过估计不同模态BEV特征间的2D光流(scene flow estimation),利用已知的时间偏移对特征图进行空间变换与对齐,从而提升模型对LiDAR或相机传感器间小至大范围时间偏移的鲁棒性。该方法具有轻量、通用特性,可无缝集成到多种BEV检测架构(如基于网格的UniBEV和基于token的CMT)中,并在最坏情况下(0.5秒偏移)使动态物体的NDS指标分别提升16.6%和11.9%。
链接: https://arxiv.org/abs/2601.12994
作者: Shiming Wang,Holger Caesar,Liangliang Nan,Julian F. P. Kooij
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In autonomous driving, multi-modal perception tasks like 3D object detection typically rely on well-synchronized sensors, both at training and inference. However, despite the use of hardware- or software-based synchronization algorithms, perfect synchrony is rarely guaranteed: Sensors may operate at different frequencies, and real-world factors such as network latency, hardware failures, or processing bottlenecks often introduce time offsets between sensors. Such asynchrony degrades perception performance, especially for dynamic objects. To address this challenge, we propose AsyncBEV, a trainable lightweight and generic module to improve the robustness of 3D Birds’ Eye View (BEV) object detection models against sensor asynchrony. Inspired by scene flow estimation, AsyncBEV first estimates the 2D flow from the BEV features of two different sensor modalities, taking into account the known time offset between these sensor measurements. The predicted feature flow is then used to warp and spatially align the feature maps, which we show can easily be integrated into different current BEV detector architectures (e.g., BEV grid-based and token-based). Extensive experiments demonstrate AsyncBEV improves robustness against both small and large asynchrony between LiDAR or camera sensors in both the token-based CMT and grid-based UniBEV, especially for dynamic objects. We significantly outperform the ego motion compensated CMT and UniBEV baselines, notably by 16.6 % and 11.9 % NDS on dynamic objects in the worst-case scenario of a 0.5 s time offset. Code will be released upon acceptance.
zh
[CV-114] Early Prediction of Type 2 Diabetes Using Multimodal data and Tabular Transformers
【速读】:该论文旨在解决2型糖尿病(Type 2 Diabetes Mellitus, T2DM)早期风险预测的准确性不足问题,尤其针对传统机器学习方法难以捕捉纵向健康数据中复杂长程依赖关系的局限性。其解决方案的关键在于提出了一种基于表格变压器(Tabular Transformer, TabTrans)架构的新模型,能够有效整合电子健康记录(EHR)与双能X射线吸收测定法(DXA)获取的骨相关表格式数据,从而建模患者疾病进展过程中的非线性、时序性特征。该方法在卡塔尔生物银行(QBB)队列中验证,通过SMOTE和SMOTE-ENN处理类别不平衡问题,最终在ROC AUC ≥ 79.7% 的性能上显著优于主流生成式AI模型(如GPT-4、Claude 3.5 Sonnet和Gemini Pro)及传统机器学习方法,且特征重要性分析揭示了内脏脂肪组织(VAT)质量与体积、骨密度(BMD)、骨矿含量(BMC)等指标为关键预测因子。
链接: https://arxiv.org/abs/2601.12981
作者: Sulaiman Khan,Md. Rafiul Biswas,Zubair Shah
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 08 pages, 06 figures, accepted for publication in FLLM2025
Abstract:This study introduces a novel approach for early Type 2 Diabetes Mellitus (T2DM) risk prediction using a tabular transformer (TabTrans) architecture to analyze longitudinal patient data. By processing patients longitudinal health records and bone-related tabular data, our model captures complex, long-range dependencies in disease progression that conventional methods often overlook. We validated our TabTrans model on a retrospective Qatar BioBank (QBB) cohort of 1,382 subjects, comprising 725 men (146 diabetic, 579 healthy) and 657 women (133 diabetic, 524 healthy). The study integrated electronic health records (EHR) with dual-energy X-ray absorptiometry (DXA) data. To address class imbalance, we employed SMOTE and SMOTE-ENN resampling techniques. The proposed models performance is evaluated against conventional machine learning (ML) and generative AI models, including Claude 3.5 Sonnet (Anthropics constitutional AI), GPT-4 (OpenAIs generative pre-trained transformer), and Gemini Pro (Google`s multimodal language model). Our TabTrans model demonstrated superior predictive performance, achieving ROC AUC \geq 79.7 % for T2DM prediction compared to both generative AI models and conventional ML approaches. Feature interpretation analysis identified key risk indicators, with visceral adipose tissue (VAT) mass and volume, ward bone mineral density (BMD) and bone mineral content (BMC), T and Z-scores, and L1-L4 scores emerging as the most important predictors associated with diabetes development in Qatari adults. These findings demonstrate the significant potential of TabTrans for analyzing complex tabular healthcare data, providing a powerful tool for proactive T2DM management and personalized clinical interventions in the Qatari population. Index Terms: tabular transformers, multimodal data, DXA data, diabetes, T2DM, feature interpretation, tabular data Comments: 08 pages, 06 figures, accepted for publication in FLLM2025 Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2601.12981 [cs.CV] (or arXiv:2601.12981v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2601.12981 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-115] Cross-Scale Pretraining: Enhancing Self-Supervised Learning for Low-Resolution Satellite Imagery for Semantic Segmentation
【速读】:该论文旨在解决高分辨率(High-Resolution, HR)遥感图像数据在自监督预训练中的利用问题,即如何将HR数据有效融入现有基于中分辨率(Mid-Resolution, MR)图像的自监督学习框架中,以提升MR图像表征学习能力和下游分割任务性能。其解决方案的关键在于设计了一个空间亲和性模块(spatial affinity component),该模块可无缝集成到现有自监督学习框架中,并利用HR图像的空间结构信息来增强对MR图像的特征表示能力,实验表明该方法优于仅使用HR或MR图像预训练的模型。
链接: https://arxiv.org/abs/2601.12964
作者: John Waithaka,Gustave Bwirayesu,Moise Busogi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Self-supervised pretraining in remote sensing is mostly done using mid-spatial resolution (MR) image datasets due to their high availability. Given the release of high-resolution (HR) datasets, we ask how HR datasets can be included in self-supervised pretraining to enhance MR image representation learning and downstream segmentation performance on MR tasks. We design a spatial affinity component that can be added to existing self-supervised learning frameworks and that uses HR imagery to learn better representations of MR imagery. We test the spatial affinity component on two self-supervised learning frameworks and show that it outperforms models pretrained on HR or MR images alone.
zh
[CV-116] StyMam: A Mamba-Based Generator for Artistic Style Transfer ICASSP2026
【速读】:该论文旨在解决图像风格迁移(Image Style Transfer)中现有方法存在的两大问题:一是基于生成对抗网络(GAN)的方法由于依赖卷积神经网络(CNN)或Transformer,难以同时捕捉局部与全局依赖关系,导致生成图像出现伪影和不和谐模式;二是基于稳定扩散(Stable Diffusion, SD)的方法虽能缓解上述问题,但常无法有效保留内容结构且推理速度较慢。解决方案的关键在于提出一种基于Mamba架构的生成器——StyMam,其核心创新包括:引入残差双路径条带扫描机制以高效提取局部纹理特征,以及通道重加权空间注意力模块以建模全局依赖关系,从而在无需引入伪影和不和谐模式的前提下实现高质量、高效率的风格迁移。
链接: https://arxiv.org/abs/2601.12954
作者: Zhou Hong,Rongsheng Hu,Yicheng Di,Xiaolong Xu,Ning Dong,Yihua Shao,Run Ling,Yun Wang,Juqin Wang,Zhanjie Zhang,Ao Ma
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICASSP 2026
Abstract:Image style transfer aims to integrate the visual patterns of a specific artistic style into a content image while preserving its content structure. Existing methods mainly rely on the generative adversarial network (GAN) or stable diffusion (SD). GAN-based approaches using CNNs or Transformers struggle to jointly capture local and global dependencies, leading to artifacts and disharmonious patterns. SD-based methods reduce such issues but often fail to preserve content structures and suffer from slow inference. To address these issues, we revisit GAN and propose a mamba-based generator, termed as StyMam, to produce high-quality stylized images without introducing artifacts and disharmonious patterns. Specifically, we introduce a mamba-based generator with a residual dual-path strip scanning mechanism and a channel-reweighted spatial attention module. The former efficiently captures local texture features, while the latter models global dependencies. Finally, extensive qualitative and quantitative experiments demonstrate that the proposed method outperforms state-of-the-art algorithms in both quality and speed.
zh
[CV-117] GazeD: Context-Aware Diffusion for Accurate 3D Gaze Estimation
【速读】:该论文旨在解决从单张RGB图像中同时估计三维 gaze(3D gaze)和人体姿态(human pose)的难题,尤其在缺乏时间序列信息的情况下提升3D gaze估计的准确性。其解决方案的关键在于提出了一种基于扩散模型(diffusion model)的新方法 GazeD,通过将 3D gaze 表示为距离眼睛固定距离的一个额外身体关节(additional body joint),并联合建模 gaze 与 pose 的依赖关系,在去噪过程中对两者进行协同推理。该设计充分利用了扩散模型处理不确定性的能力,从输入图像的二维上下文信息中生成多个合理的 3D gaze 和姿态假设,从而显著提升估计精度,在三个基准数据集上达到当前最优性能。
链接: https://arxiv.org/abs/2601.12948
作者: Riccardo Catalini,Davide Di Nucci,Guido Borghi,Davide Davoli,Lorenzo Garattoni,Giampiero Francesca,Yuki Kawana,Roberto Vezzani
机构: University of Modena and Reggio Emilia (摩德纳和雷焦艾米利亚大学); Toyota Motor Europe (丰田汽车欧洲公司); Woven by Toyota (丰田织物)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We introduce GazeD, a new 3D gaze estimation method that jointly provides 3D gaze and human pose from a single RGB image. Leveraging the ability of diffusion models to deal with uncertainty, it generates multiple plausible 3D gaze and pose hypotheses based on the 2D context information extracted from the input image. Specifically, we condition the denoising process on the 2D pose, the surroundings of the subject, and the context of the scene. With GazeD we also introduce a novel way of representing the 3D gaze by positioning it as an additional body joint at a fixed distance from the eyes. The rationale is that the gaze is usually closely related to the pose, and thus it can benefit from being jointly denoised during the diffusion process. Evaluations across three benchmark datasets demonstrate that GazeD achieves state-of-the-art performance in 3D gaze estimation, even surpassing methods that rely on temporal information. Project details will be available at this https URL.
zh
[CV-118] QASA: Quality-Guided K-Adaptive Slot Attention for Unsupervised Object-Centric Learning
【速读】:该论文旨在解决现有K-adaptive(自适应槽数量)Slot Attention方法中存在的两个关键问题:一是缺乏对槽绑定质量的显式约束,导致低质量槽引发特征归属模糊;二是将槽计数惩罚加入重建目标后,造成减少活跃槽数量与保持重建保真度之间的优化冲突,从而使得性能显著落后于固定槽数(K-fixed)基线。解决方案的关键在于提出质量引导的自适应槽注意力机制(Quality-Guided K-Adaptive Slot Attention, QASA),其核心创新包括:首先解耦槽选择与重建过程,消除两者间的相互约束;其次设计无监督的槽质量度量(Slot-Quality metric),为细粒度的槽-对象绑定提供原则性信号;进而基于该度量构建质量引导的槽选择策略,在训练时动态选取高质量槽输入至门控解码器进行重建,推理时通过token级竞争实现自适应槽数输出。实验表明,QASA在真实和合成数据集上均显著优于现有K-adaptive方法,并在真实数据集上超越了K-fixed方法。
链接: https://arxiv.org/abs/2601.12936
作者: Tianran Ouyang,Xingping Dong,Jing Zhang,Mang Ye,Jun Chen,Bo Du
机构: Wuhan University (武汉大学); Australian National University (澳大利亚国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Slot Attention, an approach that binds different objects in a scene to a set of “slots”, has become a leading method in unsupervised object-centric learning. Most methods assume a fixed slot count K, and to better accommodate the dynamic nature of object cardinality, a few works have explored K-adaptive variants. However, existing K-adaptive methods still suffer from two limitations. First, they do not explicitly constrain slot-binding quality, so low-quality slots lead to ambiguous feature attribution. Second, adding a slot-count penalty to the reconstruction objective creates conflicting optimization goals between reducing the number of active slots and maintaining reconstruction fidelity. As a result, they still lag significantly behind strong K-fixed baselines. To address these challenges, we propose Quality-Guided K-Adaptive Slot Attention (QASA). First, we decouple slot selection from reconstruction, eliminating the mutual constraints between the two objectives. Then, we propose an unsupervised Slot-Quality metric to assess per-slot quality, providing a principled signal for fine-grained slot–object binding. Based on this metric, we design a Quality-Guided Slot Selection scheme that dynamically selects a subset of high-quality slots and feeds them into our newly designed gated decoder for reconstruction during training. At inference, token-wise competition on slot attention yields a K-adaptive outcome. Experiments show that QASA substantially outperforms existing K-adaptive methods on both real and synthetic datasets. Moreover, on real-world datasets QASA surpasses K-fixed methods.
zh
[CV-119] Membership Inference Test: Auditing Training Data in Object Classification Models AAAI-25
【速读】:该论文旨在解决**成员推理攻击(Membership Inference Test, MINT)**在目标识别领域中的有效性问题,即判断某条数据是否曾被用于模型训练。其核心挑战在于如何从模型的内部激活模式中提取出能够区分训练数据与非训练数据的特征。解决方案的关键在于提出了一种针对目标识别任务定制的MINT架构,该架构利用卷积层捕捉训练过程中产生的激活模式,并通过将对象检测模块、嵌入提取器与MINT模块协同设计,实现了对训练数据的高精度识别(准确率70%–80%),且性能受检测模块层数的影响显著,从而为提升模型训练过程的透明度和可解释性提供了有效路径。
链接: https://arxiv.org/abs/2601.12929
作者: Gonzalo Mancera,Daniel DeAlcala,Aythami Morales,Ruben Tolosana,Julian Fierrez
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Deployable AI (DAI 2025) workshop co-located with AAAI-25
Abstract:In this research, we analyze the performance of Membership Inference Tests (MINT), focusing on determining whether given data were utilized during the training phase, specifically in the domain of object recognition. Within the area of object recognition, we propose and develop architectures tailored for MINT models. These architectures aim to optimize performance and efficiency in data utilization, offering a tailored solution to tackle the complexities inherent in the object recognition domain. We conducted experiments involving an object detection model, an embedding extractor, and a MINT module. These experiments were performed in three public databases, totaling over 174K images. The proposed architecture leverages convolutional layers to capture and model the activation patterns present in the data during the training process. Through our analysis, we are able to identify given data used for testing and training, achieving precision rates ranging between 70% and 80%, contingent upon the depth of the detection module layer chosen for input to the MINT module. Additionally, our studies entail an analysis of the factors influencing the MINT Module, delving into the contributing elements behind more transparent training processes.
zh
[CV-120] Dual-Stream Collaborative Transformer for Image Captioning
【速读】:该论文旨在解决当前基于区域特征(region feature)的图像描述生成方法因缺乏上下文信息以及过度依赖局部描述预测剩余词汇而导致生成不相关描述的问题。其解决方案的关键在于提出一种双流协同Transformer(Dual-Stream Collaborative Transformer, DSCT),通过引入分割特征(segmentation feature)并动态融合区域特征与分割特征来引导句子生成。DSCT包含多个模式特定互注意力编码器(Pattern-Specific Mutual Attention Encoders, PSMAEs)和动态提名解码器(Dynamic Nomination Decoders, DNDs):PSMAE通过相互查询机制有效提取并强化两种表示的私有信息,DND则根据输入文本表示动态选择最相关的学习模块,并利用融合后的区域与分割特征间的同质性生成更准确、更具描述性的图像描述句。这是首个探索以动态方式融合不同模式特异性特征以规避语义不一致性和空间错位问题的研究。
链接: https://arxiv.org/abs/2601.12926
作者: Jun Wan,Jun Liu,Zhihui lai,Jie Zhou
机构: Zhongnan University of Economics and Law (中南财经政法大学); Singapore University of Technology and Design (新加坡科技设计大学); Shenzhen University (深圳大学); Shenzhen Institute of Artificial Intelligence and Robotics for Society (深圳市人工智能与机器人社会治理研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Current region feature-based image captioning methods have progressed rapidly and achieved remarkable performance. However, they are still prone to generating irrelevant descriptions due to the lack of contextual information and the over-reliance on generated partial descriptions for predicting the remaining words. In this paper, we propose a Dual-Stream Collaborative Transformer (DSCT) to address this issue by introducing the segmentation feature. The proposed DSCT consolidates and then fuses the region and segmentation features to guide the generation of caption sentences. It contains multiple Pattern-Specific Mutual Attention Encoders (PSMAEs) and Dynamic Nomination Decoders (DNDs). The PSMAE effectively highlights and consolidates the private information of two representations by querying each other. The DND dynamically searches for the most relevant learning blocks to the input textual representations and exploits the homogeneous features between the consolidated region and segmentation features to generate more accurate and descriptive caption sentences. To the best of our knowledge, this is the first study to explore how to fuse different pattern-specific features in a dynamic way to bypass their semantic inconsistencies and spatial misalignment issues for image captioning. The experimental results from popular benchmark datasets demonstrate that our DSCT outperforms the state-of-the-art image captioning models in the literature.
zh
[CV-121] Supervision-by-Hallucination-and-Transfer: A Weakly-Supervised Approach for Robust and Precise Facial Landmark Detection
【速读】:该论文旨在解决高精度人脸关键点检测(Facial Landmark Detection, FLD)中因低分辨率图像或特征下采样导致的深层特征表示学习困难、训练数据不足及标注不精确等问题。其解决方案的关键在于提出一种弱监督框架——“幻觉与迁移监督”(Supervision-by-Hallucination-and-Transfer, SHT),包含两个相互增强的模块:双幻觉学习网络(Dual Hallucination Learning Network, DHLN)和人脸姿态迁移网络(Facial Pose Transfer Network, FPTN)。DHLN通过联合学习FLD与人脸幻觉任务,从低分辨率输入中恢复高分辨率面部结构与局部细节,并生成更有效的关键点热图;FPTN则通过跨姿态变换进一步优化DHLN生成的热图与人脸图像,从而提升关键点定位精度。该方法首次将人脸幻觉与姿态迁移任务引入弱监督FLD,显著提升了模型鲁棒性与准确性。
链接: https://arxiv.org/abs/2601.12919
作者: Jun Wan,Yuanzhi Yao,Zhihui Lai,Jie Zhou,Xianxu Hou,Wenwen Min
机构: Zhongnan University of Economics and Law (中南财经政法大学); Shenzhen University (深圳大学); Yunnan University (云南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:High-precision facial landmark detection (FLD) relies on high-resolution deep feature representations. However, low-resolution face images or the compression (via pooling or strided convolution) of originally high-resolution images hinder the learning of such features, thereby reducing FLD accuracy. Moreover, insufficient training data and imprecise annotations further degrade performance. To address these challenges, we propose a weakly-supervised framework called Supervision-by-Hallucination-and-Transfer (SHT) for more robust and precise FLD. SHT contains two novel mutually enhanced modules: Dual Hallucination Learning Network (DHLN) and Facial Pose Transfer Network (FPTN). By incorporating FLD and face hallucination tasks, DHLN is able to learn high-resolution representations with low-resolution inputs for recovering both facial structures and local details and generating more effective landmark heatmaps. Then, by transforming faces from one pose to another, FPTN can further improve landmark heatmaps and faces hallucinated by DHLN for detecting more accurate landmarks. To the best of our knowledge, this is the first study to explore weakly-supervised FLD by integrating face hallucination and facial pose transfer tasks. Experimental results of both face hallucination and FLD demonstrate that our method surpasses state-of-the-art techniques.
zh
[CV-122] woHead-SwinFPN: A Unified DL Architecture for Synthetic Manipulation Detection and Localization in Identity Documents
【速读】:该论文旨在解决生成式 AI (Generative AI) 技术快速发展背景下,身份证件中面部替换(face swapping)和文本修复(text inpainting)等合成篡改行为带来的安全威胁问题。其核心解决方案是提出 TwoHead-SwinFPN,一种统一的深度学习架构,能够同时完成伪造检测的二分类任务与篡改区域的精确定位。该方法的关键在于:采用 Swin Transformer 骨干网络结合特征金字塔网络(Feature Pyramid Network, FPN)与 UNet-style 解码器,并引入卷积块注意力模块(Convolutional Block Attention Module, CBAM)增强特征表达能力;通过双头结构实现检测与分割任务的联合优化,利用不确定性加权的多任务学习策略提升模型鲁棒性与性能,在 FantasyIDiap 数据集上实现了高精度的分类(84.31% 准确率,90.78% AUC)与定位(57.24% 平均 Dice 分数),且具备良好的计算效率,适合实际部署。
链接: https://arxiv.org/abs/2601.12895
作者: Chan Naseeb,Adeel Ashraf Cheema,Hassan Sami,Tayyab Afzal,Muhammad Omair,Usman Habib
机构: IBM Germany (IBM德国); FAST NUCES, Pakistan (巴基斯坦FAST NUCES); Askolay Pakistan (巴基斯坦Askolay)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 8 pages
Abstract:The proliferation of sophisticated generative AI models has significantly escalated the threat of synthetic manipulations in identity documents, particularly through face swapping and text inpainting attacks. This paper presents TwoHead-SwinFPN, a unified deep learning architecture that simultaneously performs binary classification and precise localization of manipulated regions in ID documents. Our approach integrates a Swin Transformer backbone with Feature Pyramid Network (FPN) and UNet-style decoder, enhanced with Convolutional Block Attention Module (CBAM) for improved feature representation. The model employs a dual-head architecture for joint optimization of detection and segmentation tasks, utilizing uncertainty-weighted multi-task learning. Extensive experiments on the FantasyIDiap dataset demonstrate superior performance with 84.31% accuracy, 90.78% AUC for classification, and 57.24% mean Dice score for localization. The proposed method achieves an F1-score of 88.61% for binary classification while maintaining computational efficiency suitable for real-world deployment through FastAPI implementation. Our comprehensive evaluation includes ablation studies, cross-device generalization analysis, and detailed performance assessment across 10 languages and 3 acquisition devices.
zh
[CV-123] Sparse ActionGen: Accelerating Diffusion Policy with Real-time Pruning
【速读】:该论文旨在解决扩散策略(Diffusion Policy)在实时视觉-运动控制中因多步去噪过程导致计算效率低下、难以满足实际应用需求的问题。现有基于缓存的加速方法依赖静态调度策略,无法适应机器人与环境交互的动态特性,从而影响性能。其解决方案的关键在于提出一种稀疏动作生成框架 SAG(Sparse Action Generation),通过引入滚动优化自适应的“剪枝-重用”机制,首先全局识别可剪枝计算,再利用缓存激活进行替代;同时设计观测条件下的扩散剪枝器以捕捉环境动态,并采用“一劳永逸”的重用策略,在时间步和网络块间以锯齿状方式复用激活,显著降低全局冗余。实验表明,SAG 在多个机器人基准测试中实现了高达 4 倍的速度提升且不牺牲性能。
链接: https://arxiv.org/abs/2601.12894
作者: Kangye Ji,Yuan Meng,Zhou Jianbo,Ye Li,Hanyun Cui,Zhi Wang
机构: Tsinghua University (清华大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Diffusion Policy has dominated action generation due to its strong capabilities for modeling multi-modal action distributions, but its multi-step denoising processes make it impractical for real-time visuomotor control. Existing caching-based acceleration methods typically rely on \textitstatic schedules that fail to adapt to the \textitdynamics of robot-environment interactions, thereby leading to suboptimal performance. In this paper, we propose \underline\textbfS parse \underline\textbfA ction \underline\textbfG en ( \textbfSAG ) for extremely sparse action generation. To accommodate the iterative interactions, SAG customizes a rollout-adaptive prune-then-reuse mechanism that first identifies prunable computations globally and then reuses cached activations to substitute them during action diffusion. To capture the rollout dynamics, SAG parameterizes an observation-conditioned diffusion pruner for environment-aware adaptation and instantiates it with a highly parameter- and inference-efficient design for real-time prediction. Furthermore, SAG introduces a one-for-all reusing strategy that reuses activations across both timesteps and blocks in a zig-zag manner, minimizing the global redundancy. Extensive experiments on multiple robotic benchmarks demonstrate that SAG achieves up to 4 \times generation speedup without sacrificing performance. Project Page: this https URL.
zh
[CV-124] Simultaneous Detection of LSD and FMD in Cattle Using Ensemble Deep Learning
【速读】:该论文旨在解决牛群中痘疮病(Lumpy Skin Disease, LSD)与口蹄疫(Foot-and-Mouth Disease, FMD)等高度传染性疾病的视觉诊断难题,这些问题因症状与其他良性病变(如昆虫叮咬或化学灼伤)存在显著重叠而难以准确识别,从而延误防控措施。解决方案的关键在于构建一个集成深度学习框架,融合VGG16、ResNet50和InceptionV3三种预训练模型,并采用优化的加权平均策略进行多疾病联合检测,从而有效区分复杂相似症状,在包含10,516张专家标注图像的大规模跨区域数据集上实现了98.2%的准确率和高达99.5%的AUC-ROC值,为早期、精准且自动化的动物疫病诊断提供了可靠工具。
链接: https://arxiv.org/abs/2601.12889
作者: Nazibul Basar Ayon,Abdul Hasib,Md. Faishal Ahmed,Md. Sadiqur Rahman,Kamrul Islam,T. M. Mehrab Hasan,A. S. M. Ahsanul Sarkar Akib
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Lumpy Skin Disease (LSD) and Foot-and-Mouth Disease (FMD) are highly contagious viral diseases affecting cattle, causing significant economic losses and welfare challenges. Their visual diagnosis is complicated by significant symptom overlap with each other and with benign conditions like insect bites or chemical burns, hindering timely control measures. Leveraging a comprehensive dataset of 10,516 expert-annotated images from 18 farms across India, Brazil, and the USA, this study presents a novel Ensemble Deep Learning framework integrating VGG16, ResNet50, and InceptionV3 with optimized weighted averaging for simultaneous LSD and FMD detection. The model achieves a state-of-the-art accuracy of 98.2%, with macro-averaged precision of 98.2%, recall of 98.1%, F1-score of 98.1%, and an AUC-ROC of 99.5%. This approach uniquely addresses the critical challenge of symptom overlap in multi-disease detection, enabling early, precise, and automated diagnosis. This tool has the potential to enhance disease management, support global agricultural sustainability, and is designed for future deployment in resource-limited settings.
zh
[CV-125] YOLO26: An Analysis of NMS-Free End to End Framework for Real-Time Object Detection
【速读】:该论文旨在解决传统YOLO系列模型(如YOLOv1至YOLOv11)在实时目标检测中因依赖非极大值抑制(Non-Maximum Suppression, NMS)后处理而带来的延迟高和超参数敏感性问题。其核心解决方案是提出YOLO26架构,通过彻底摒弃NMS并引入端到端的原生学习策略实现性能突破;关键创新包括:用于稳定轻量级骨干网络训练的MuSGD优化器、面向小目标感知的STAL分配策略,以及提供动态监督信号的ProgLoss损失函数。这些改进共同推动了推理速度与检测精度之间的帕累托前沿提升,实现了低延迟与高精度的统一。
链接: https://arxiv.org/abs/2601.12882
作者: Sudip Chakrabarty
机构: KIIT University (基伊特大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:The “You Only Look Once” (YOLO) framework has long served as the benchmark for real-time object detection, yet traditional iterations (YOLOv1 through YOLO11) remain constrained by the latency and hyperparameter sensitivity of Non-Maximum Suppression (NMS) post-processing. This paper analyzes a comprehensive analysis of YOLO26, an architecture that fundamentally redefines this paradigm by eliminating NMS in favor of a native end-to-end learning strategy. This study examines the critical innovations that enable this transition, specifically the introduction of the MuSGD optimizer for stabilizing lightweight backbones, STAL for small-target-aware assignment, and ProgLoss for dynamic supervision. Through a systematic review of official performance benchmarks, the results demonstrate that YOLO26 establishes a new Pareto front, outperforming a comprehensive suite of predecessors and state-of-the-art competitors (including RTMDet and DAMO-YOLO) in both inference speed and detection accuracy. The analysis confirms that by decoupling representation learning from heuristic post-processing, YOLOv26 successfully resolves the historical trade-off between latency and precision, signaling the next evolutionary step in edge-based computer vision.
zh
[CV-126] Exploring Talking Head Models With Adjacent Frame Prior for Speech-Preserving Facial Expression Manipulation
【速读】:该论文旨在解决Speech-Preserving Facial Expression Manipulation (SPFEM) 中唇部同步精度不足的问题,即在改变面部表情的同时难以保持原始口型与语音的准确对齐。解决方案的关键在于将音频驱动的说话头生成(Audio-Driven Talking Head Generation, AD-THG)模型引入 SPFEM 框架,构建新的 Talking Head Facial Expression Manipulation (THFEM) 框架,利用 AD-THG 模型从音频输入中生成唇部运动精准同步的图像帧;同时提出相邻帧学习策略,通过微调 AD-THG 模型以预测连续帧序列,从而融合邻近帧信息,显著提升生成图像的真实感和表情保真度。
链接: https://arxiv.org/abs/2601.12876
作者: Zhenxuan Lu,Zhihua Xu,Zhijing Yang,Feng Gao,Yongyi Lu,Keze Wang,Tianshui Chen
机构: Guangdong University of Technology(广东工业大学); Peking University(北京大学); Sun Yat-sen University(中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ACM Transactions on Multimedia Computing, Communications, and Applications
Abstract:Speech-Preserving Facial Expression Manipulation (SPFEM) is an innovative technique aimed at altering facial expressions in images and videos while retaining the original mouth movements. Despite advancements, SPFEM still struggles with accurate lip synchronization due to the complex interplay between facial expressions and mouth shapes. Capitalizing on the advanced capabilities of audio-driven talking head generation (AD-THG) models in synthesizing precise lip movements, our research introduces a novel integration of these models with SPFEM. We present a new framework, Talking Head Facial Expression Manipulation (THFEM), which utilizes AD-THG models to generate frames with accurately synchronized lip movements from audio inputs and SPFEM-altered images. However, increasing the number of frames generated by AD-THG models tends to compromise the realism and expression fidelity of the images. To counter this, we develop an adjacent frame learning strategy that finetunes AD-THG models to predict sequences of consecutive frames. This strategy enables the models to incorporate information from neighboring frames, significantly improving image quality during testing. Our extensive experimental evaluations demonstrate that this framework effectively preserves mouth shapes during expression manipulations, highlighting the substantial benefits of integrating AD-THG with SPFEM.
zh
[CV-127] Proxy Robustness in Vision Language Models is Effortlessly Transferable
【速读】:该论文旨在解决视觉语言模型(Vision-Language Model, VLM)在对抗鲁棒性迁移中面临的计算资源消耗过高问题,尤其是针对大规模多模态模型如CLIP构建鲁棒教师模型的可行性挑战。其关键解决方案是提出一种异构代理迁移(Heterogeneous Proxy Transfer, HPT)框架,利用不同架构的CLIP模型之间存在的内在防御能力(即代理对抗鲁棒性,proxy adversarial robustness),实现跨架构的鲁棒性蒸馏;进一步通过泛化锚定解耦(Generalization-Pivot Decoupling, GPD)机制,基于学习率调度差异将迁移过程分解为两个阶段:以保持自然泛化能力为导向的预热阶段和以提升对抗鲁棒性为目标的蒸馏阶段,从而在自然泛化性能与对抗鲁棒性之间取得平衡。
链接: https://arxiv.org/abs/2601.12865
作者: Xiaowei Fu,Fuxiang Huang,Lei Zhang
机构: Chongqing University (重庆大学); Lingnan University (岭南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:As a pivotal technique for improving the defense of deep models, adversarial robustness transfer via distillation has demonstrated remarkable success in conventional image classification tasks. However, this paradigm encounters critical challenges when applied to vision-language models (VLM) (e.g., CLIP): constructing adversarially robust teacher for large-scale multi-modal models demands prohibitively high computational resources. We bridge this gap by revealing an interesting phenomenon: vanilla CLIP (without adversarial training) exhibits intrinsic defensive capabilities against adversarial examples generated by another CLIP with different architectures. We formally define this as proxy adversarial robustness, and naturally propose a Heterogeneous Proxy Transfer (HPT) framework that establishes cross-architectural robustness distillation channels between CLIP variants, effortlessly enabling the VLM robustness transfer from proxy to target models. Yet, such proxy transfer paradigm easily induces severe overfitting, leading to a sharp degradation in zero-shot natural generalization. To resolve that, we design Generalization-Pivot Decoupling (GPD) by leveraging the difference in learning rate scheduling. This decouples the proxy transfer process into a generalization-anchored warm-up that maintains generalization and a generalization-pulled HPT that promotes adversarial robustness, to achieve an equilibrium between natural generalization and adversarial robustness. Extensive experiments on 15 zero-shot datasets demonstrate the effectiveness of our HPT-GPD method. The code is available at the website of this http URL.
zh
[CV-128] FGTBT: Frequency-Guided Task-Balancing Transformer for Unified Facial Landmark Detection
【速读】:该论文旨在解决当前基于深度学习的面部关键点检测(Facial Landmark Detection, FLD)方法在大姿态变化、光照差异和表情变化等挑战性场景下难以准确捕捉面部几何结构的问题,以及现有FLD数据集规模有限且多样性不足导致模型训练鲁棒性差、检测精度下降的瓶颈。其解决方案的关键在于提出一种频率引导的任务平衡Transformer架构(Frequency-Guided Task-Balancing Transformer, FGTBT),包含两个核心组件:一是新颖的细粒度多任务平衡损失(Fine-Grained Multi-Task Balancing loss, FMB-loss),通过按个体关键点在不同数据集中的分布情况分配权重,实现更精细的任务级平衡,缓解梯度不一致问题;二是频率引导的结构感知模块(Frequency-Guided Structure-Aware, FGSA),利用频域引导的结构注入与正则化机制增强对人脸结构约束的学习能力,从而提升模型在复杂条件下的结构感知性能。
链接: https://arxiv.org/abs/2601.12863
作者: Jun Wan,Xinyu Xiong,Ning Chen,Zhihui Lai,Jie Zhou,Wenwen Min
机构: Zhongnan University of Economics and Law (中南财经政法大学); Nanyang Technological University (南洋理工大学); Shenzhen University (深圳大学); Yunnan University (云南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recently, deep learning based facial landmark detection (FLD) methods have achieved considerable success. However, in challenging scenarios such as large pose variations, illumination changes, and facial expression variations, they still struggle to accurately capture the geometric structure of the face, resulting in performance degradation. Moreover, the limited size and diversity of existing FLD datasets hinder robust model training, leading to reduced detection accuracy. To address these challenges, we propose a Frequency-Guided Task-Balancing Transformer (FGTBT), which enhances facial structure perception through frequency-domain modeling and multi-dataset unified training. Specifically, we propose a novel Fine-Grained Multi-Task Balancing loss (FMB-loss), which moves beyond coarse task-level balancing by assigning weights to individual landmarks based on their occurrence across datasets. This enables more effective unified training and mitigates the issue of inconsistent gradient magnitudes. Additionally, a Frequency-Guided Structure-Aware (FGSA) model is designed to utilize frequency-guided structure injection and regularization to help learn facial structure constraints. Extensive experimental results on popular benchmark datasets demonstrate that the integration of the proposed FMB-loss and FGSA model into our FGTBT framework achieves performance comparable to state-of-the-art methods. The code is available at this https URL.
zh
[CV-129] Data-Consistent Learning of Inverse Problems
【速读】:该论文旨在解决逆问题(Inverse Problems)中普遍存在的不适定性(ill-posedness),即解的非唯一性和不稳定性问题。传统正则化方法虽能提供数学上可靠的稳定解和收敛性保证,但往往在灵活性或视觉质量方面受限;而基于数据驱动的重建方法(如卷积神经网络)虽能生成视觉效果优异的结果,却缺乏严格的理论保障。论文提出的解决方案关键在于引入DC(Deep Convolutional)网络架构,通过在神经网络中显式嵌入测量模型(measurement model),实现理论可靠性与数据驱动表达能力的融合。具体而言,采用零空间网络(null-space networks)结合经典正则化方法作为初始重建,构建了一种具有收敛性的正则化方法,从而在保持数学严谨性的同时提升重建精度与视觉质量。
链接: https://arxiv.org/abs/2601.12831
作者: Markus Haltmeier,Gyeongha Hwang
机构: University of Innsbruck (因斯布鲁克大学); Yeungnam University (庆南大学)
类目: Numerical Analysis (math.NA); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Inverse problems are inherently ill-posed, suffering from non-uniqueness and instability. Classical regularization methods provide mathematically well-founded solutions, ensuring stability and convergence, but often at the cost of reduced flexibility or visual quality. Learned reconstruction methods, such as convolutional neural networks, can produce visually compelling results, yet they typically lack rigorous theoretical guarantees. DC (DC) networks address this gap by enforcing the measurement model within the network architecture. In particular, null-space networks combined with a classical regularization method as an initial reconstruction define a convergent regularization method. This approach preserves the theoretical reliability of classical schemes while leveraging the expressive power of data-driven learning, yielding reconstructions that are both accurate and visually appealing.
zh
[CV-130] Seeing Isnt Always Believing: Analysis of Grad-CAM Faithfulness and Localization Reliability in Lung Cancer CT Classification
【速读】:该论文旨在解决当前基于梯度的可解释人工智能(Explainable Artificial Intelligence, XAI)方法,特别是梯度加权类激活映射(Gradient-weighted Class Activation Mapping, Grad-CAM),在医学影像分析中是否真正反映深度神经网络内部决策机制的问题。研究发现,尽管Grad-CAM在大多数卷积神经网络中能有效定位肿瘤区域,但在视觉Transformer(Vision Transformer, ViT)模型中其解释保真度显著下降,原因在于其非局部注意力机制干扰了热力图的可信性。解决方案的关键在于构建一个融合定位准确性、扰动一致性与解释稳定性的量化评估框架,从而揭示不同模型架构下Grad-CAM的可靠性差异,并强调发展面向特定模型结构的可解释方法的重要性,以提升医疗AI系统解释结果的临床可信度和计算严谨性。
链接: https://arxiv.org/abs/2601.12826
作者: Teerapong Panboonyuen
机构: Chulalongkorn University (朱拉隆功大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages
Abstract:Explainable Artificial Intelligence (XAI) techniques, such as Gradient-weighted Class Activation Mapping (Grad-CAM), have become indispensable for visualizing the reasoning process of deep neural networks in medical image analysis. Despite their popularity, the faithfulness and reliability of these heatmap-based explanations remain under scrutiny. This study critically investigates whether Grad-CAM truly represents the internal decision-making of deep models trained for lung cancer image classification. Using the publicly available IQ-OTH/NCCD dataset, we evaluate five representative architectures: ResNet-50, ResNet-101, DenseNet-161, EfficientNet-B0, and ViT-Base-Patch16-224, to explore model-dependent variations in Grad-CAM interpretability. We introduce a quantitative evaluation framework that combines localization accuracy, perturbation-based faithfulness, and explanation consistency to assess Grad-CAM reliability across architectures. Experimental findings reveal that while Grad-CAM effectively highlights salient tumor regions in most convolutional networks, its interpretive fidelity significantly degrades for Vision Transformer models due to non-local attention behavior. Furthermore, cross-model comparisons indicate substantial variability in saliency localization, implying that Grad-CAM explanations may not always correspond to the true diagnostic evidence used by the networks. This work exposes critical limitations of current saliency-based XAI approaches in medical imaging and emphasizes the need for model-aware interpretability methods that are both computationally sound and clinically meaningful. Our findings aim to inspire a more cautious and rigorous adoption of visual explanation tools in medical AI, urging the community to rethink what it truly means to “trust” a model’s explanation.
zh
[CV-131] reeDGS: Aerial Gaussian Splatting for Distant DBH Measurement
【速读】:该论文旨在解决在复杂自然场景中,通过航空遥感直接测量树木胸径(DBH)的难题。由于林区树干在航拍图像中距离远、观测稀疏(通常仅占几像素),传统重建方法难以有效约束胸高处的几何结构。其解决方案的关键在于提出TreeDGS方法,利用3D Gaussian Splatting(3DGS)作为连续且可稠密化的场景表示,结合SfM-MVS初始化与高斯优化后,通过RaDe-GS的深度感知累积透明度积分提取密集点云,并为每个点赋予多视角透明度可靠性评分;最终基于此权重进行固态圆拟合,实现高精度DBH估计,实验证明其RMSE达4.79 cm,显著优于LiDAR基线方法(7.91 cm)。
链接: https://arxiv.org/abs/2601.12823
作者: Belal Shaheen,Minh-Hieu Nguyen,Bach-Thuan Bui,Shubham,Tim Wu,Michael Fairley,Matthew David Zane,Michael Wu,James Tompkin
机构: Coolant; Brown University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Aerial remote sensing enables efficient large-area surveying, but accurate direct object-level measurement remains difficult in complex natural scenes. Recent advancements in 3D vision, particularly learned radiance-field representations such as NeRF and 3D Gaussian Splatting, have begun to raise the ceiling on reconstruction fidelity and densifiable geometry from posed imagery. Nevertheless, direct aerial measurement of important natural attributes such as tree diameter at breast height (DBH) remains challenging. Trunks in aerial forest scans are distant and sparsely observed in image views: at typical operating altitudes, stems may span only a few pixels. With these constraints, conventional reconstruction methods leave breast-height trunk geometry weakly constrained. We present TreeDGS, an aerial image reconstruction method that leverages 3D Gaussian Splatting as a continuous, densifiable scene representation for trunk measurement. After SfM-MVS initialization and Gaussian optimization, we extract a dense point set from the Gaussian field using RaDe-GS’s depth-aware cumulative-opacity integration and associate each sample with a multi-view opacity reliability score. We then estimate DBH from trunk-isolated points using opacity-weighted solid-circle fitting. Evaluated on 10 plots with field-measured DBH, TreeDGS reaches 4.79,cm RMSE (about 2.6 pixels at this GSD) and outperforms a state-of-the-art LiDAR baseline (7.91,cm RMSE), demonstrating that densified splat-based geometry can enable accurate, low-cost aerial DBH measurement.
zh
[CV-132] A Generalist Foundation Model for Total-body PET/CT Enables Diagnostic Reporting and System-wide Metabolic Profiling
【速读】:该论文旨在解决总身PET/CT(Total-body PET/CT)成像中多模态信号异质性、大范围轴向覆盖(约2米)、以及结构化放射学语义对现有医疗AI模型的挑战,这些模型通常假设单模态输入、局部视野和粗粒度图像-文本对齐。解决方案的关键在于提出SDF-HOLO(Systemic Dual-stream Fusion Holo Model),其通过双流编码器解耦CT与PET表征学习,并借助跨模态交互模块实现 anatomical context(解剖上下文)引导的PET聚合与metabolic saliency(代谢显著性)驱动的细微形态推理;同时采用分层上下文建模以捕捉全身长程依赖关系,并利用解剖分割掩膜作为显式语义锚点进行体素-掩膜-文本对齐预训练,从而在肿瘤分割、低剂量病灶检测及多语言诊断报告生成等任务上显著优于基准模型,且减少定位错误和幻觉性发现,为系统级精准肿瘤学提供可扩展的计算基础。
链接: https://arxiv.org/abs/2601.12820
作者: Wei Chen,Liang Wu,Shuyi Lu,Yuanyuan Sun,Wenkai Bi,Zilong Yuan,Yaoyao He,Feng Wang,Junchi Ma,Shuyong Liu,Zhaoping Cheng,Xiaoyan Hu,Jianfeng Qiu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Total-body PET/CT enables system-wide molecular imaging, but heterogeneous anatomical and metabolic signals, approximately 2 m axial coverage, and structured radiology semantics challenge existing medical AI models that assume single-modality inputs, localized fields of view, and coarse image-text alignment. We introduce SDF-HOLO (Systemic Dual-stream Fusion Holo Model), a multimodal foundation model for holistic total-body PET/CT, pre-trained on more than 10,000 patients. SDF-HOLO decouples CT and PET representation learning with dual-stream encoders and couples them through a cross-modal interaction module, allowing anatomical context to refine PET aggregation while metabolic saliency guides subtle morphological reasoning. To model long-range dependencies across the body, hierarchical context modeling combines efficient local windows with global attention. To bridge voxels and clinical language, we use anatomical segmentation masks as explicit semantic anchors and perform voxel-mask-text alignment during pre-training. Across tumor segmentation, low-dose lesion detection, and multilingual diagnostic report generation, SDF-HOLO outperforms strong task-specific and clinical-reference baselines while reducing localization errors and hallucinated findings. Beyond focal interpretation, the model enables system-wide metabolic profiling and reveals tumor-associated fingerprints of inter-organ metabolic network interactions, providing a scalable computational foundation for total-body PET/CT diagnostics and system-level precision oncology.
zh
[CV-133] CSGaussian: Progressive Rate-Distortion Compression and Segmentation for 3D Gaussian Splatting WACV2026
【速读】:该论文旨在解决3D高斯溅射(3D Gaussian Splatting, 3DGS)在压缩与语义分割任务中长期存在的独立处理问题,即如何实现率失真优化(rate-distortion-optimized)的联合压缩与分割,以支持解码端更复杂的场景编辑和操作应用。其解决方案的关键在于构建一个统一框架:首先引入轻量级基于隐式神经表示的超先验(hyperprior),实现对颜色和语义属性的高效熵编码,避免传统网格化超先验的高计算开销;其次提出压缩引导的分割学习机制,包含感知量化训练以增强特征可分性,以及质量感知加权策略抑制不可靠的高斯原语,从而在保障渲染质量的同时显著提升分割性能并降低传输成本。
链接: https://arxiv.org/abs/2601.12814
作者: Yu-Jen Tseng,Chia-Hao Kao,Jing-Zhong Chen,Alessandro Gnutti,Shao-Yuan Lo,Yen-Yu Lin,Wen-Hsiao Peng
机构: National Yang Ming Chiao Tung University (国立阳明交通大学); University of Brescia (布雷西亚大学); National Taiwan University (台湾国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at WACV 2026
Abstract:We present the first unified framework for rate-distortion-optimized compression and segmentation of 3D Gaussian Splatting (3DGS). While 3DGS has proven effective for both real-time rendering and semantic scene understanding, prior works have largely treated these tasks independently, leaving their joint consideration unexplored. Inspired by recent advances in rate-distortion-optimized 3DGS compression, this work integrates semantic learning into the compression pipeline to support decoder-side applications–such as scene editing and manipulation–that extend beyond traditional scene reconstruction and view synthesis. Our scheme features a lightweight implicit neural representation-based hyperprior, enabling efficient entropy coding of both color and semantic attributes while avoiding costly grid-based hyperprior as seen in many prior works. To facilitate compression and segmentation, we further develop compression-guided segmentation learning, consisting of quantization-aware training to enhance feature separability and a quality-aware weighting mechanism to suppress unreliable Gaussian primitives. Extensive experiments on the LERF and 3D-OVS datasets demonstrate that our approach significantly reduces transmission cost while preserving high rendering quality and strong segmentation performance.
zh
[CV-134] Left-Right Symmetry Breaking in CLIP-style Vision-Language Models Trained on Synthetic Spatial-Relation Data
【速读】:该论文旨在解决视觉-语言模型中空间理解能力的机制问题,特别是左-右关系这种基础空间推理是否真正被模型习得及其背后的形成机制。其解决方案的关键在于构建了一个可控的一维图像-文本测试平台(1D image-text testbed),通过在配对描述的一物和两物场景上端到端训练轻量级Transformer编码器,并系统性地改变标签多样性和布局多样性来评估泛化性能;进一步采用注意力分解方法揭示了位置嵌入与标记嵌入之间的交互会诱导出水平方向上的注意力梯度,从而打破编码器中的左右对称性,这是实现左-右区分的核心机制。
链接: https://arxiv.org/abs/2601.12809
作者: Takaki Yamamoto,Chihiro Noguchi,Toshihiro Tanizawa
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Spatial understanding remains a key challenge in vision-language models. Yet it is still unclear whether such understanding is truly acquired, and if so, through what mechanisms. We present a controllable 1D image-text testbed to probe how left-right relational understanding emerges in Transformer-based vision and text encoders trained with a CLIP-style contrastive objective. We train lightweight Transformer-based vision and text encoders end-to-end on paired descriptions of one- and two-object scenes and evaluate generalization to unseen object pairs while systematically varying label and layout diversity. We find that contrastive training learns left-right relations and that label diversity, more than layout diversity, is the primary driver of generalization in this setting. To gain the mechanistic understanding, we perform an attention decomposition and show that interactions between positional and token embeddings induce a horizontal attention gradient that breaks left-right symmetry in the encoders; ablating this contribution substantially reduces left-right discrimination. Our results provide a mechanistic insight of when and how CLIP-style models acquire relational competence.
zh
[CV-135] Joint Source-Channel-Generation Coding: From Distortion-oriented Reconstruction to Semantic-consistent Generation
【速读】:该论文旨在解决传统通信系统(包括基于分离编码和AI驱动的联合信源信道编码,JSCC)依赖香农率失真理论所导致的感知质量不佳问题,即现有方法使用通用失真度量无法准确反映人类视觉感知,常造成重建图像模糊或不真实。其解决方案的关键在于提出一种新范式——联合信源-信道-生成编码(JSCGC),该范式将重建目标从确定性重构转变为概率生成,利用接收端的生成模型作为生成器而非传统解码器来参数化数据分布,在信道约束下直接最大化互信息,并通过控制随机采样使输出位于真实数据流形上,从而提升感知质量和语义保真度。
链接: https://arxiv.org/abs/2601.12808
作者: Tong Wu,Zhiyong Chen,Guo Lu,Li Song,Feng Yang,Meixia Tao,Wenjun Zhang
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Information Theory (cs.IT); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: submitted to IEEE ISIT 2026
Abstract:Conventional communication systems, including both separation-based coding and AI-driven joint source-channel coding (JSCC), are largely guided by Shannon’s rate-distortion theory. However, relying on generic distortion metrics fails to capture complex human visual perception, often resulting in blurred or unrealistic reconstructions. In this paper, we propose Joint Source-Channel-Generation Coding (JSCGC), a novel paradigm that shifts the focus from deterministic reconstruction to probabilistic generation. JSCGC leverages a generative model at the receiver as a generator rather than a conventional decoder to parameterize the data distribution, enabling direct maximization of mutual information under channel constraints while controlling stochastic sampling to produce outputs residing on the authentic data manifold with high fidelity. We further derive a theoretical lower bound on the maximum semantic inconsistency with given transmitted mutual information, elucidating the fundamental limits of communication in controlling the generative process. Extensive experiments on image transmission demonstrate that JSCGC substantially improves perceptual quality and semantic fidelity, significantly outperforming conventional distortion-oriented JSCC methods.
zh
[CV-136] PhyG-MoE: A Physics-Guided Mixture-of-Experts Framework for Energy-Efficient GNSS Interference Recognition
【速读】:该论文旨在解决当前静态深度学习模型在复杂电磁干扰环境下对全球导航卫星系统(GNSS)干扰识别能力不足的问题,其核心矛盾在于固定计算拓扑无法适应输入信号物理熵的动态变化,导致资源分配失衡——简单信号与高度混杂的饱和信号消耗相同计算成本。解决方案的关键是提出PhyG-MoE(Physics-Guided Mixture-of-Experts)框架,通过基于频谱特征纠缠度的门控机制动态路由信号:在饱和复杂场景下激活高容量TransNeXt专家以解耦复杂特征,而在基础信号场景下由轻量级专家处理以降低延迟,从而实现模型容量与信号复杂度的自适应匹配。
链接: https://arxiv.org/abs/2601.12798
作者: Zhihan Zeng,Yang Zhao,Kaihe Wang,Dusit Niyato,Yue Xiu,Lu Chen,Zhongpei Zhang,Ning Wei
机构: University of Electronic Science and Technology of China (UESTC); Nanyang Technological University; Anhui Science and Technology University; Nanjing University Of Information Science & Technology
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Complex electromagnetic interference increasingly compromises Global Navigation Satellite Systems (GNSS), threatening the reliability of Space-Air-Ground Integrated Networks (SAGIN). Although deep learning has advanced interference recognition, current static models suffer from a \textbffundamental limitation: they impose a fixed computational topology regardless of the input’s physical entropy. This rigidity leads to severe resource mismatch, where simple primitives consume the same processing cost as chaotic, saturated mixtures. To resolve this, this paper introduces PhyG-MoE (Physics-Guided Mixture-of-Experts), a framework designed to \textbfdynamically align model capacity with signal complexity. Unlike static architectures, the proposed system employs a spectrum-based gating mechanism that routes signals based on their spectral feature entanglement. A high-capacity TransNeXt expert is activated on-demand to disentangle complex features in saturated scenarios, while lightweight experts handle fundamental signals to minimize latency. Evaluations on 21 jamming categories demonstrate that PhyG-MoE achieves an overall accuracy of 97.58%. By resolving the intrinsic conflict between static computing and dynamic electromagnetic environments, the proposed framework significantly reduces computational overhead without performance degradation, offering a viable solution for resource-constrained cognitive receivers.
zh
[CV-137] Combating Noisy Labels through Fostering Self- and Neighbor-Consistency
【速读】:该论文旨在解决深度学习中标签噪声(label noise)带来的模型性能下降问题,特别是针对不同小批量数据间标签噪声分布不均以及对分布外噪声样本(out-of-distribution noisy data)关注不足的问题。其解决方案的关键在于提出一种名为Jo-SNC(Joint sample selection and model regularization based on Self- and Neighbor-Consistency)的鲁棒方法:首先利用Jensen-Shannon散度结合样本邻域信息来量化样本为干净或分布外样本的概率,从而实现更可靠的样本筛选;其次设计自适应、数据驱动的阈值机制以动态调整每类样本的选择阈值;最后引入三元组一致性正则化策略,同时优化自预测一致性、邻域预测一致性和特征一致性,有效提升模型在噪声环境下的泛化能力。
链接: https://arxiv.org/abs/2601.12795
作者: Zeren Sun,Yazhou Yao,Tongliang Liu,Zechao Li,Fumin Shen,Jinhui Tang
机构: Nanjing University of Science and Technology (南京理工大学); State Key Laboratory of Intelligent Manufacturing of Advanced Construction Machinery (先进制造机械智能制造国家重点实验室); University of Sydney (悉尼大学); University of Electronic Science and Technology of China (电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: accepted by IEEE Transactions on Pattern Analysis and Machine Intelligence
Abstract:Label noise is pervasive in various real-world scenarios, posing challenges in supervised deep learning. Deep networks are vulnerable to such label-corrupted samples due to the memorization effect. One major stream of previous methods concentrates on identifying clean data for training. However, these methods often neglect imbalances in label noise across different mini-batches and devote insufficient attention to out-of-distribution noisy data. To this end, we propose a noise-robust method named Jo-SNC (\textbfJoint sample selection and model regularization based on \textbfSelf- and \textbfNeighbor-\textbfConsistency). Specifically, we propose to employ the Jensen-Shannon divergence to measure the ``likelihood’’ of a sample being clean or out-of-distribution. This process factors in the nearest neighbors of each sample to reinforce the reliability of clean sample identification. We design a self-adaptive, data-driven thresholding scheme to adjust per-class selection thresholds. While clean samples undergo conventional training, detected in-distribution and out-of-distribution noisy samples are trained following partial label learning and negative learning, respectively. Finally, we advance the model performance further by proposing a triplet consistency regularization that promotes self-prediction consistency, neighbor-prediction consistency, and feature consistency. Extensive experiments on various benchmark datasets and comprehensive ablation studies demonstrate the effectiveness and superiority of our approach over existing state-of-the-art methods.
zh
[CV-138] SKANet: A Cognitive Dual-Stream Framework with Adaptive Modality Fusion for Robust Compound GNSS Interference Classification
【速读】:该论文旨在解决复杂电磁环境下全球导航卫星系统(GNSS)面临的复合干扰分类难题,尤其是传统深度学习方法在处理多种干扰源叠加时难以准确区分的问题。其关键解决方案是提出一种基于双流架构的认知深度学习框架SKANet,通过融合时频图像(TFI)与功率谱密度(PSD)特征,并引入多分支选择性卷积核(Multi-Branch Selective Kernel, SK)模块和异构卷积块(Asymmetric Convolution Blocks, ACBs),实现动态调整感受野的能力,从而同时捕捉瞬态微尺度特征与连续宏尺度频谱趋势;此外,在融合阶段嵌入挤压-激励(Squeeze-and-Excitation, SE)机制以自适应校准不同模态特征的贡献权重,显著提升了低干扰噪声比(JNR)条件下的分类鲁棒性。
链接: https://arxiv.org/abs/2601.12791
作者: Zhihan Zeng,Yang Zhao,Kaihe Wang,Dusit Niyato,Hongyuan Shu,Junchu Zhao,Yanjun Huang,Yue Xiu,Zhongpei Zhang,Ning Wei
机构: University of Electronic Science and Technology of China (电子科技大学); Nanyang Technological University (南洋理工大学); Shanghai Aerospace Electronic Technology Institute (上海航天电子技术研究所); Shanghai Key Laboratory of Collaborative Computing in Spacial Heterogeneous Networks (空间异构网络协同计算上海市重点实验室); Shanghai Xiaoyuan Innovation center (上海小渊创新中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:As the electromagnetic environment becomes increasingly complex, Global Navigation Satellite Systems (GNSS) face growing threats from sophisticated jamming interference. Although Deep Learning (DL) effectively identifies basic interference, classifying compound interference remains difficult due to the superposition of diverse jamming sources. Existing single-domain approaches often suffer from performance degradation because transient burst signals and continuous global signals require conflicting feature extraction scales. We propose the Selective Kernel and Asymmetric convolution Network(SKANet), a cognitive deep learning framework built upon a dual-stream architecture that integrates Time-Frequency Images (TFIs) and Power Spectral Density (PSD). Distinct from conventional fusion methods that rely on static receptive fields, the proposed architecture incorporates a Multi-Branch Selective Kernel (SK) module combined with Asymmetric Convolution Blocks (ACBs). This mechanism enables the network to dynamically adjust its receptive fields, acting as an adaptive filter that simultaneously captures micro-scale transient features and macro-scale spectral trends within entangled compound signals. To complement this spatial-temporal adaptation, a Squeeze-and-Excitation (SE) mechanism is integrated at the fusion stage to adaptively recalibrate the contribution of heterogeneous features from each modality. Evaluations on a dataset of 405,000 samples demonstrate that SKANet achieves an overall accuracy of 96.99%, exhibiting superior robustness for compound jamming classification, particularly under low Jamming-to-Noise Ratio (JNR) regimes.
zh
[CV-139] VIRO: Robust and Efficient Neuro-Symbolic Reasoning with Verification for Referring Expression Comprehension
【速读】:该论文旨在解决神经符号推理方法在指代表达理解(Referring Expression Comprehension, REC)任务中因假设中间推理步骤准确而导致的级联错误问题,即错误检测和无效关系会沿推理链传播,从而在图像中不存在目标时仍产生高置信度的误报。解决方案的关键在于提出验证集成推理算子(Verification-Integrated Reasoning Operators, VIRO),其核心是在每个推理步骤中嵌入轻量级的算子级验证器,用于验证输出结果(如对象存在性或空间关系)是否满足预设条件,从而在验证失败时有效识别并处理无目标场景,提升系统鲁棒性和可靠性。
链接: https://arxiv.org/abs/2601.12781
作者: Hyejin Park,Junhyuk Kwon,Suha Kwak,Jungseul Ok
机构: Pohang University of Science and Technology (POSTECH)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Referring Expression Comprehension (REC) aims to localize the image region corresponding to a natural-language query. Recent neuro-symbolic REC approaches leverage large language models (LLMs) and vision-language models (VLMs) to perform compositional reasoning, decomposing queries 4 structured programs and executing them step-by-step. While such approaches achieve interpretable reasoning and strong zero-shot generalization, they assume that intermediate reasoning steps are accurate. However, this assumption causes cascading errors: false detections and invalid relations propagate through the reasoning chain, yielding high-confidence false positives even when no target is present in the image. To address this limitation, we introduce Verification-Integrated Reasoning Operators (VIRO), a neuro-symbolic framework that embeds lightweight operator-level verifiers within reasoning steps. Each operator executes and validates its output, such as object existence or spatial relationship, thereby allowing the system to robustly handle no-target cases when verification conditions are not met. Our framework achieves state-of-the-art performance, reaching 61.1% balanced accuracy across target-present and no-target settings, and demonstrates generalization to real-world egocentric data. Furthermore, VIRO shows superior computational efficiency in terms of throughput, high reliability with a program failure rate of less than 0.3%, and scalability through decoupled program generation from execution.
zh
[CV-140] Generalizable and Animatable 3D Full-Head Gaussian Avatar from a Single Image
【速读】:该论文旨在解决从单张图像中重建可动画化3D全头虚拟形象(animatable 3D full-head avatar)的问题,尤其针对大视角变化下现有方法性能退化、 realism(真实感)不足的挑战。其解决方案的关键在于:首先,利用预训练的3D生成对抗网络(3D GAN)中的丰富先验知识提取全局全头特征并提供多视角监督;其次,在UV空间中将高斯基元(Gaussian primitives)嵌入参数化人脸模型表面以实现高效动画控制;最后,通过利用UV空间和人脸结构的对称性,融合输入图像的局部细粒度特征与全局纹理信息,从而提升重建精度与视觉保真度。该框架可在一次前向传播中完成高质量3D建模与实时动画生成,支持360°渲染视角。
链接: https://arxiv.org/abs/2601.12770
作者: Shuling Zhao,Dan Xu
机构: HKUST(香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:Building 3D animatable head avatars from a single image is an important yet challenging problem. Existing methods generally collapse under large camera pose variations, compromising the realism of 3D avatars. In this work, we propose a new framework to tackle the novel setting of one-shot 3D full-head animatable avatar reconstruction in a single feed-forward pass, enabling real-time animation and simultaneous 360 ^\circ rendering views. To facilitate efficient animation control, we model 3D head avatars with Gaussian primitives embedded on the surface of a parametric face model within the UV space. To obtain knowledge of full-head geometry and textures, we leverage rich 3D full-head priors within a pretrained 3D generative adversarial network (GAN) for global full-head feature extraction and multi-view supervision. To increase the fidelity of the 3D reconstruction of the input image, we take advantage of the symmetric nature of the UV space and human faces to fuse local fine-grained input image features with the global full-head textures. Extensive experiments demonstrate the effectiveness of our method, achieving high-quality 3D full-head modeling as well as real-time animation, thereby improving the realism of 3D talking avatars.
zh
[CV-141] Delving Deeper: Hierarchical Visual Perception for Robust Video-Text Retrieval
【速读】:该论文旨在解决视频-文本检索(Video-Text Retrieval, VTR)中因视频固有冗余性及现有方法依赖粗粒度最终层特征而导致匹配精度受限的问题。其解决方案的关键在于提出HVP-Net(Hierarchical Visual Perception Network),通过从视觉编码器的多个中间层提取并精炼特征,逐级蒸馏不同语义层级上的显著视觉概念,从而在降低冗余的同时保留对对齐至关重要的细节信息,构建更鲁棒的视频表征,进而显著提升检索性能。
链接: https://arxiv.org/abs/2601.12768
作者: Zequn Xie,Boyun Zhang,Yuxiao Lin,Tao Jin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:
Abstract:Video-text retrieval (VTR) aims to locate relevant videos using natural language queries. Current methods, often based on pre-trained models like CLIP, are hindered by video’s inherent redundancy and their reliance on coarse, final-layer features, limiting matching accuracy. To address this, we introduce the HVP-Net (Hierarchical Visual Perception Network), a framework that mines richer video semantics by extracting and refining features from multiple intermediate layers of a vision encoder. Our approach progressively distills salient visual concepts from raw patch-tokens at different semantic levels, mitigating redundancy while preserving crucial details for alignment. This results in a more robust video representation, leading to new state-of-the-art performance on challenging benchmarks including MSRVTT, DiDeMo, and ActivityNet. Our work validates the effectiveness of exploiting hierarchical features for advancing video-text retrieval. Our codes are available at this https URL.
zh
[CV-142] Spatial-VLN: Zero-Shot Vision-and-Language Navigation With Explicit Spatial Perception and Exploration
【速读】:该论文旨在解决零样本视觉-语言导航(Zero-shot Vision-and-Language Navigation, VLN)代理在复杂连续环境中的空间感知不足问题,特别是在门交互、多房间导航和指令执行模糊性等关键空间挑战下,现有方法普遍存在高失败率。解决方案的核心在于提出Spatial-VLN框架,其关键创新包括:一是空间感知增强(Spatial Perception Enhancement, SPE)模块,通过全景过滤与专用门识别及区域专家协同,生成跨视角一致的空间感知表示;二是探索式多专家推理(Explored Multi-expert Reasoning, EMR)模块,利用并行大语言模型(LLM)专家处理路径点级语义与区域级空间转换,并引入查询-探索机制主动探测关键区域以消除感知歧义。该框架仅使用低成本LLM即实现最优性能,并通过基于价值的路径点采样策略有效缩小仿真到现实(Sim2Real)差距,在真实场景中展现出卓越的泛化能力和鲁棒性。
链接: https://arxiv.org/abs/2601.12766
作者: Lu Yue,Yue Fan,Shiwei Lian,Yu Zhao,Jiaxin Yu,Liang Xie,Feitian Zhang
机构: Peking University (北京大学); Defense Innovation Institute, Academy of Military Sciences (军事科学院国防创新研究院); Tianjin Artificial Intelligence Innovation Center (天津人工智能创新中心); Harbin Institute of Technology (深圳) (哈尔滨工业大学(深圳))
类目: Computer Vision and Pattern Recognition (cs.CV); Systems and Control (eess.SY)
备注:
Abstract:Zero-shot Vision-and-Language Navigation (VLN) agents leveraging Large Language Models (LLMs) excel in generalization but suffer from insufficient spatial perception. Focusing on complex continuous environments, we categorize key perceptual bottlenecks into three spatial challenges: door interaction,multi-room navigation, and ambiguous instruction execution, where existing methods consistently suffer high failure rates. We present Spatial-VLN, a perception-guided exploration framework designed to overcome these challenges. The framework consists of two main modules. The Spatial Perception Enhancement (SPE) module integrates panoramic filtering with specialized door and region experts to produce spatially coherent, cross-view consistent perceptual representations. Building on this foundation, our Explored Multi-expert Reasoning (EMR) module uses parallel LLM experts to address waypoint-level semantics and region-level spatial transitions. When discrepancies arise between expert predictions, a query-and-explore mechanism is activated, prompting the agent to actively probe critical areas and resolve perceptual ambiguities. Experiments on VLN-CE demonstrate that Spatial VLN achieves state-of-the-art performance using only low-cost LLMs. Furthermore, to validate real-world applicability, we introduce a value-based waypoint sampling strategy that effectively bridges the Sim2Real gap. Extensive real-world evaluations confirm that our framework delivers superior generalization and robustness in complex environments. Our codes and videos are available at this https URL.
zh
[CV-143] owards Unbiased Source-Free Object Detection via Vision Foundation Models
【速读】:该论文旨在解决源域偏置(Source Bias)问题,即现有无源域目标检测(Source-Free Object Detection, SFOD)方法在跨域任务中因模型仍偏向源域特征而导致泛化性能下降和自训练过程中的误差累积。解决方案的关键在于提出一种基于视觉基础模型(Vision Foundation Model, VFM)辅助的SFOD框架——DSOD,其核心包括两个模块:一是统一特征注入(Unified Feature Injection, UFI),通过简单尺度扩展(Simple-Scale Extension, SSE)与域感知自适应加权(Domain-aware Adaptive Weighting, DAAW)将VFM特征融合进CNN主干网络;二是语义感知特征正则化(Semantic-aware Feature Regularization, SAFR),通过约束特征学习避免对源域特征的过拟合。此外,还提出了无需VFM的DSOD-distill变体,采用双教师蒸馏策略以适应计算受限场景。实验表明,DSOD在多个基准上显著优于现有SOTA方法。
链接: https://arxiv.org/abs/2601.12765
作者: Zhi Cai,Yingjie Gao,Yanan Zhang,Xinzhu Ma,Di Huang
机构: Beihang University (北京航空航天大学); Hefei University of Technology (合肥工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Source-Free Object Detection (SFOD) has garnered much attention in recent years by eliminating the need of source-domain data in cross-domain tasks, but existing SFOD methods suffer from the Source Bias problem, i.e. the adapted model remains skewed towards the source domain, leading to poor generalization and error accumulation during self-training. To overcome this challenge, we propose Debiased Source-free Object Detection (DSOD), a novel VFM-assisted SFOD framework that can effectively mitigate source bias with the help of powerful VFMs. Specifically, we propose Unified Feature Injection (UFI) module that integrates VFM features into the CNN backbone through Simple-Scale Extension (SSE) and Domain-aware Adaptive Weighting (DAAW). Then, we propose Semantic-aware Feature Regularization (SAFR) that constrains feature learning to prevent overfitting to source domain characteristics. Furthermore, we propose a VFM-free variant, termed DSOD-distill for computation-restricted scenarios through a novel Dual-Teacher distillation scheme. Extensive experiments on multiple benchmarks demonstrate that DSOD outperforms state-of-the-art SFOD methods, achieving 48.1% AP on Normal-to-Foggy weather adaptation, 39.3% AP on Cross-scene adaptation, and 61.4% AP on Synthetic-to-Real adaptation.
zh
[CV-144] Moaw: Unleashing Motion Awareness for Video Diffusion Models
【速读】:该论文旨在解决如何更充分地利用视频扩散模型(video diffusion models)在运动感知与运动迁移任务中的潜力,尤其是在监督训练下提升其追踪能力的问题。解决方案的关键在于提出Moaw框架,通过将原本用于图像到视频生成的扩散模型转变为视频到密集追踪的任务,训练一个具备运动感知能力的模型,并构建一个带有运动标签的数据集以识别最强运动信息的特征;随后,这些特征被注入到结构相同的视频生成模型中,借助两网络间的同构性实现零样本(zero-shot)运动迁移,无需额外适配器即可完成可控的运动转移。
链接: https://arxiv.org/abs/2601.12761
作者: Tianqi Zhang,Ziyi Wang,Wenzhao Zheng,Weiliang Chen,Yuanhui Huang,Zhengyang Huang,Jie Zhou,Jiwen Lu
机构: Tsinghua University (清华大学); Beihang University (北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Video diffusion models, trained on large-scale datasets, naturally capture correspondences of shared features across frames. Recent works have exploited this property for tasks such as optical flow prediction and tracking in a zero-shot setting. Motivated by these findings, we investigate whether supervised training can more fully harness the tracking capability of video diffusion models. To this end, we propose Moaw, a framework that unleashes motion awareness for video diffusion models and leverages it to facilitate motion transfer. Specifically, we train a diffusion model for motion perception, shifting its modality from image-to-video generation to video-to-dense-tracking. We then construct a motion-labeled dataset to identify features that encode the strongest motion information, and inject them into a structurally identical video generation model. Owing to the homogeneity between the two networks, these features can be naturally adapted in a zero-shot manner, enabling motion transfer without additional adapters. Our work provides a new paradigm for bridging generative modeling and motion understanding, paving the way for more unified and controllable video learning frameworks.
zh
[CV-145] SSPFormer: Self-Supervised Pretrained Transformer for MRI Images IJCAI
【速读】:该论文旨在解决预训练Transformer模型直接应用于磁共振成像(MRI)时面临的两大挑战:一是难以适应医学解剖结构的特异性,二是受限于医疗数据的隐私性和稀缺性。其解决方案的关键在于提出一种自监督预训练Transformer模型(SSPFormer),通过利用未标注的原始MRI数据来学习领域特定的特征表示;具体而言,引入逆频率投影掩码机制以优先重建高频解剖区域,从而强化结构感知的表征学习;同时采用频域加权FFT噪声增强策略,在傅里叶域注入生理上合理的伪影噪声,提升模型对真实MRI伪影的鲁棒性。上述方法使模型能够从原始扫描中直接学习领域不变且抗伪影的特征,显著提升了在分割、超分辨率和去噪任务中的性能表现。
链接: https://arxiv.org/abs/2601.12747
作者: Jingkai Li,Xiaoze Tian,Yuhang Shen,Jia Wang,Dianjie Lu,Guijuan Zhang,Zhuoran Zheng
机构: Qilu University of Technology (齐鲁工业大学); Second Hospital of Shandong University (山东大学第二医院); Shandong Normal University (山东师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Undergraduate student as first author submitted to IJCAI
Abstract:The pre-trained transformer demonstrates remarkable generalization ability in natural image processing. However, directly transferring it to magnetic resonance images faces two key challenges: the inability to adapt to the specificity of medical anatomical structures and the limitations brought about by the privacy and scarcity of medical data. To address these issues, this paper proposes a Self-Supervised Pretrained Transformer (SSPFormer) for MRI images, which effectively learns domain-specific feature representations of medical images by leveraging unlabeled raw imaging data. To tackle the domain gap and data scarcity, we introduce inverse frequency projection masking, which prioritizes the reconstruction of high-frequency anatomical regions to enforce structure-aware representation learning. Simultaneously, to enhance robustness against real-world MRI artifacts, we employ frequency-weighted FFT noise enhancement that injects physiologically realistic noise into the Fourier domain. Together, these strategies enable the model to learn domain-invariant and artifact-robust features directly from raw scans. Through extensive experiments on segmentation, super-resolution, and denoising tasks, the proposed SSPFormer achieves state-of-the-art performance, fully verifying its ability to capture fine-grained MRI image fidelity and adapt to clinical application requirements.
zh
[CV-146] KaoLRM: Repurposing Pre-trained Large Reconstruction Models for Parametric 3D Face Reconstruction
【速读】:该论文旨在解决参数化三维人脸重建(parametric 3D face reconstruction)中因视角变化导致的重建一致性差的问题。现有基于三维形态模型(3DMM)的回归器在不同视角下常表现出几何和外观不一致的问题,限制了其鲁棒性。解决方案的关键在于引入预训练的大型重建模型(Large Reconstruction Model, LRM)的先验知识,并通过FLAME驱动的二维高斯点绘(2D Gaussian Splatting)重构其渲染管线:具体而言,KaoLRM将LRM的三平面特征投影至FLAME参数空间以恢复几何结构,并利用与FLAME网格紧密耦合的二维高斯基元建模外观,从而增强对3D结构的认知能力,实现跨视角下更准确、稳定的重建效果。
链接: https://arxiv.org/abs/2601.12736
作者: Qingtian Zhu,Xu Cao,Zhixiang Wang,Yinqiang Zheng,Takafumi Taketomi
机构: The University of Tokyo (东京大学); CyberAgent (CyberAgent)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We propose KaoLRM to re-target the learned prior of the Large Reconstruction Model (LRM) for parametric 3D face reconstruction from single-view images. Parametric 3D Morphable Models (3DMMs) have been widely used for facial reconstruction due to their compact and interpretable parameterization, yet existing 3DMM regressors often exhibit poor consistency across varying viewpoints. To address this, we harness the pre-trained 3D prior of LRM and incorporate FLAME-based 2D Gaussian Splatting into LRM’s rendering pipeline. Specifically, KaoLRM projects LRM’s pre-trained triplane features into the FLAME parameter space to recover geometry, and models appearance via 2D Gaussian primitives that are tightly coupled to the FLAME mesh. The rich prior enables the FLAME regressor to be aware of the 3D structure, leading to accurate and robust reconstructions under self-occlusions and diverse viewpoints. Experiments on both controlled and in-the-wild benchmarks demonstrate that KaoLRM achieves superior reconstruction accuracy and cross-view consistency, while existing methods remain sensitive to viewpoint variations. The code is released at this https URL.
zh
[CV-147] DC-VLAQ: Query-Residual Aggregation for Robust Visual Place Recognition
【速读】:该论文旨在解决视觉场景识别(Visual Place Recognition, VPR)中长期存在的挑战:如何在大视角变化、光照差异和严重域偏移下学习具有鲁棒性的全局表征。现有方法多依赖单一视觉基础模型(Visual Foundation Models, VFMs),忽略了不同VFMs间互补信息的利用;而直接融合这些互补特征会改变token分布,破坏基于查询的全局聚合机制的稳定性。解决方案的关键在于提出DC-VLAQ框架,其核心创新包括两部分:一是轻量级残差引导的互补融合机制,以DINOv2特征空间为锚点,通过学习残差校正注入CLIP的互补语义;二是向量化局部聚合查询(Vector of Local Aggregated Queries, VLAQ),一种基于查询-残差响应的全局聚合策略,能够稳定地编码局部token,并保留细粒度判别性特征。实验表明,该方法在多个标准VPR数据集上显著优于现有基线,尤其在域偏移和长期外观变化场景下表现优异。
链接: https://arxiv.org/abs/2601.12729
作者: Hanyu Zhu,Zhihao Zhan,Yuhang Ming,Liang Li,Dibo Hou,Javier Civera,Wanzeng Kong
机构: BCCITA Provincial Key Laboratory (BCCITA省级重点实验室), Hangzhou Dianzi University (杭州电子科技大学), China; TopXGun Robotics (TopXGun机器人公司), China; ICT State Key Laboratory (ICT国家重点实验室), Zhejiang University (浙江大学), China; I3A, University of Zaragoza (萨拉戈萨大学), Spain
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 10 pages, 4 figures, 5 tables
Abstract:One of the central challenges in visual place recognition (VPR) is learning a robust global representation that remains discriminative under large viewpoint changes, illumination variations, and severe domain shifts. While visual foundation models (VFMs) provide strong local features, most existing methods rely on a single model, overlooking the complementary cues offered by different VFMs. However, exploiting such complementary information inevitably alters token distributions, which challenges the stability of existing query-based global aggregation schemes. To address these challenges, we propose DC-VLAQ, a representation-centric framework that integrates the fusion of complementary VFMs and robust global aggregation. Specifically, we first introduce a lightweight residual-guided complementary fusion that anchors representations in the DINOv2 feature space while injecting complementary semantics from CLIP through a learned residual correction. In addition, we propose the Vector of Local Aggregated Queries (VLAQ), a query–residual global aggregation scheme that encodes local tokens by their residual responses to learnable queries, resulting in improved stability and the preservation of fine-grained discriminative cues. Extensive experiments on standard VPR benchmarks, including Pitts30k, Tokyo24/7, MSLS, Nordland, SPED, and AmsterTime, demonstrate that DC-VLAQ consistently outperforms strong baselines and achieves state-of-the-art performance, particularly under challenging domain shifts and long-term appearance changes.
zh
[CV-148] S2DiT: Sandwich Diffusion Transformer for Mobile Streaming Video Generation
【速读】:该论文旨在解决扩散模型(Diffusion Models)在视频生成任务中计算成本过高、难以实现实时或移动端部署的问题。针对这一挑战,作者提出S2DiT(Streaming Sandwich Diffusion Transformer),其核心解决方案包括:1)设计了一种基于预算感知的动态规划搜索方法,优化“夹心结构”(sandwich design)以实现高质量与高效率的平衡;2)引入两种新型高效注意力机制——线性卷积混合注意力(LinConv Hybrid Attention, LCHA)与步进自注意力(Stride Self-Attention, SSA),显著提升计算效率;3)构建一个两阶段蒸馏框架(2-in-1 distillation),将大尺寸教师模型(如Wan 2.2-14B)的知识迁移至轻量级的少步数夹心模型,从而在保持与服务器端先进模型相当的视频质量的同时,在iPhone上实现超过10 FPS的流式视频生成。
链接: https://arxiv.org/abs/2601.12719
作者: Lin Zhao,Yushu Wu,Aleksei Lebedev,Dishani Lahiri,Meng Dong,Arpit Sahni,Michael Vasilkovsky,Hao Chen,Ju Hu,Aliaksandr Siarohin,Sergey Tulyakov,Yanzhi Wang,Anil Kag,Yanyu Li
机构: Snap Inc.; Northeastern University (东北大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Diffusion Transformers (DiTs) have recently improved video generation quality. However, their heavy computational cost makes real-time or on-device generation infeasible. In this work, we introduce S2DiT, a Streaming Sandwich Diffusion Transformer designed for efficient, high-fidelity, and streaming video generation on mobile hardware. S2DiT generates more tokens but maintains efficiency with novel efficient attentions: a mixture of LinConv Hybrid Attention (LCHA) and Stride Self-Attention (SSA). Based on this, we uncover the sandwich design via a budget-aware dynamic programming search, achieving superior quality and efficiency. We further propose a 2-in-1 distillation framework that transfers the capacity of large teacher models (e.g., Wan 2.2-14B) to the compact few-step sandwich model. Together, S2DiT achieves quality on par with state-of-the-art server video models, while streaming at over 10 FPS on an iPhone.
zh
[CV-149] RSOD: Reliability-Guided Sonar Image Object Detection with Extremely Limited Labels AAAI2026
【速读】:该论文旨在解决声呐图像中目标检测因标注数据极度稀缺而导致性能受限的问题。由于声呐图像纹理信息匮乏且易受噪声干扰,非专业人士难以提供精确的标注数据,进而影响模型训练效果。解决方案的关键在于提出一种教师-学生框架RSOD,其核心创新包括:(1)通过计算教师模型在不同视图下预测的一致性来生成可靠性评分;(2)设计基于对象混合的伪标签策略,有效利用未标注数据;(3)引入可靠性引导的自适应约束机制优化学生模型性能。该方法显著提升了小样本场景下的检测精度,在UATD数据集上仅用5%的标注数据即可达到使用100%标注数据训练基线模型的效果。
链接: https://arxiv.org/abs/2601.12715
作者: Chengzhou Li,Ping Guo,Guanchen Meng,Qi Jia,Jinyuan Liu,Zhu Liu,Xiaokang Liu,Yu Liu,Zhongxuan Luo,Xin Fan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by AAAI 2026,9 pages,10 figures
Abstract:Object detection in sonar images is a key technology in underwater detection systems. Compared to natural images, sonar images contain fewer texture details and are more susceptible to noise, making it difficult for non-experts to distinguish subtle differences between classes. This leads to their inability to provide precise annotation data for sonar images. Therefore, designing effective object detection methods for sonar images with extremely limited labels is particularly important. To address this, we propose a teacher-student framework called RSOD, which aims to fully learn the characteristics of sonar images and develop a pseudo-label strategy suitable for these images to mitigate the impact of limited labels. First, RSOD calculates a reliability score by assessing the consistency of the teacher’s predictions across different views. To leverage this score, we introduce an object mixed pseudo-label method to tackle the shortage of labeled data in sonar images. Finally, we optimize the performance of the student by implementing a reliability-guided adaptive constraint. By taking full advantage of unlabeled data, the student can perform well even in situations with extremely limited labels. Notably, on the UATD dataset, our method, using only 5% of labeled data, achieves results that can compete against those of our baseline algorithm trained on 100% labeled data. We also collected a new dataset to provide more valuable data for research in the field of sonar.
zh
[CV-150] P2L-CA: An Effective Parameter Tuning Framework for Rehearsal-Free Multi-Label Class-Incremental Learning
【速读】:该论文旨在解决多标签类增量学习(Multi-label Class-Incremental Learning, MLCIL)中因全参数微调导致的高计算成本、内存缓冲区带来的存储开销,以及特征混淆和域差异难以有效缓解的问题。其解决方案的关键在于提出一种参数高效的框架P2L-CA,该框架包含两个核心模块:Prompt-to-Label(P2L)模块通过引入类别特定提示(class-specific prompts)解耦多标签表示,并利用语言先验强化语义-视觉对齐的稳定性;Continuous Adapter(CA)模块则采用轻量级适配器(lightweight adapters)缩小预训练模型与下游任务之间的域差距,从而提升模型的可塑性。实验表明,P2L-CA在MS-COCO和PASCAL VOC数据集上显著优于现有方法,在保持极低可训练参数的同时无需记忆缓冲区,展现出优异的泛化能力。
链接: https://arxiv.org/abs/2601.12714
作者: Songlin Dong,Jiangyang Li,Chenhao Ding,Zhiheng Ma,Haoyu Luo,Yuhang He,Yihong Gong
机构: Shenzhen University of Advanced Technology (深圳先进技术研究院); Xi’an Jiaotong University (西安交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 12 pages, 5 figures
Abstract:Multi-label Class-Incremental Learning aims to continuously recognize novel categories in complex scenes where multiple objects co-occur. However, existing approaches often incur high computational costs due to full-parameter fine-tuning and substantial storage overhead from memory buffers, or they struggle to address feature confusion and domain discrepancies adequately. To overcome these limitations, we introduce P2L-CA, a parameter-efficient framework that integrates a Prompt-to-Label module with a Continuous Adapter module. The P2L module leverages class-specific prompts to disentangle multi-label representations while incorporating linguistic priors to enforce stable semantic-visual alignment. Meanwhile, the CA module employs lightweight adapters to mitigate domain gaps between pre-trained models and downstream tasks, thereby enhancing model plasticity. Extensive experiments across standard and challenging MLCIL settings on MS-COCO and PASCAL VOC show that P2L-CA not only achieves substantial improvements over state-of-the-art methods but also demonstrates strong generalization in CIL scenarios, all while requiring minimal trainable parameters and eliminating the need for memory buffers.
zh
[CV-151] Fusing in 3D: Free-Viewpoint Fusion Rendering with a 3D Infrared-Visible Scene Representation
【速读】:该论文旨在解决现有红外-可见光图像融合方法在固定相机视角下进行二维融合时,难以全面理解复杂场景、导致关键场景信息丢失的问题。其解决方案的关键在于提出了一种新颖的红外-可见光高斯融合(Infrared-Visible Gaussian Fusion, IVGF)框架,该框架通过从多模态二维输入中重建场景几何结构,并直接渲染融合图像;其中核心创新是引入交叉模态调节(Cross-modal Adjustment, CMA)模块,通过调节高斯分布的不透明度来缓解跨模态冲突,同时设计融合损失函数引导CMA优化,从而有效保留双模态的特征优势。
链接: https://arxiv.org/abs/2601.12697
作者: Chao Yang,Deshui Miao,Chao Tian,Guoqing Zhu,Yameng Gu,Zhenyu He
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computational Geometry (cs.CG)
备注:
Abstract:Infrared-visible image fusion aims to integrate infrared and visible information into a single fused image. Existing 2D fusion methods focus on fusing images from fixed camera viewpoints, neglecting a comprehensive understanding of complex scenarios, which results in the loss of critical information about the scene. To address this limitation, we propose a novel Infrared-Visible Gaussian Fusion (IVGF) framework, which reconstructs scene geometry from multimodal 2D inputs and enables direct rendering of fused images. Specifically, we propose a cross-modal adjustment (CMA) module that modulates the opacity of Gaussians to solve the problem of cross-modal conflicts. Moreover, to preserve the distinctive features from both modalities, we introduce a fusion loss that guides the optimization of CMA, thus ensuring that the fused image retains the critical characteristics of each modality. Comprehensive qualitative and quantitative experiments demonstrate the effectiveness of the proposed method.
zh
[CV-152] GaussianTrimmer: Online Trimming Boundaries for 3DGS Segmentation
【速读】:该论文旨在解决基于3D高斯(3D Gaussians)的三维场景分割方法中因高斯尺度差异大而导致的分割边界锯齿状问题,尤其是大尺寸高斯常同时覆盖前景与背景,从而影响分割精度。其解决方案的关键在于提出一种在线边界修剪方法 GaussianTrimmer,该方法通过两个核心步骤实现:首先生成均匀且充分覆盖场景的虚拟相机;其次基于虚拟相机上的2D分割结果,在原始高斯原语层面进行边界裁剪,从而有效改善现有3D高斯分割方法的边界平滑性和准确性。
链接: https://arxiv.org/abs/2601.12683
作者: Liwei Liao,Ronggang Wang
机构: Peking University Shenzhen Graduate School (北京大学深圳研究生院)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
Abstract:With the widespread application of 3D Gaussians in 3D scene representation, 3D scene segmentation methods based on 3D Gaussians have also gradually emerged. However, existing 3D Gaussian segmentation methods basically segment on the basis of Gaussian primitives. Due to the large variation range of the scale of 3D Gaussians, large-sized Gaussians that often span the foreground and background lead to jagged boundaries of segmented objects. To this end, we propose an online boundary trimming method, GaussianTrimmer, which is an efficient and plug-and-play post-processing method capable of trimming coarse boundaries for existing 3D Gaussian segmentation methods. Our method consists of two core steps: 1. Generating uniformly and well-covered virtual cameras; 2. Trimming Gaussian at the primitive level based on 2D segmentation results on virtual cameras. Extensive quantitative and qualitative experiments demonstrate that our method can improve the segmentation quality of existing 3D Gaussian segmentation methods as a plug-and-play method.
zh
[CV-153] Fusion-Restoration Image Processing Algorithm to Improve the High-Temperature Deformation Measurement
【速读】:该论文旨在解决高温结构变形测量中因热辐射和热晕(heat haze)导致的图像退化问题,从而提升数字图像相关法(Digital Image Correlation, DIC)在高温变形测量中的精度与有效性。其关键解决方案包括:针对热辐射引起的图像退化,提出基于图像分层表示的多曝光图像融合算法,将图像分解为正负通道并行处理后优化质量;针对热晕引入的高频随机误差,采用FSIM(Feature Similarity Index)作为目标函数引导模型参数迭代优化,并结合灰度平均算法校正异常灰度值,有效降低静态热变形测量误差。实验表明,该方法显著提升了图像可用计算区域(从26%提升至50%),同时大幅减少应变测量误差(ε_xx减少85.3%,ε_yy和γ_xy分别减少36.0%和36.4%)。
链接: https://arxiv.org/abs/2601.12682
作者: Banglei Guan,Dongcai Tan,Jing Tao,Ang Su,Yang Shang,Qifeng Yu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In the deformation measurement of high-temperature structures, image degradation caused by thermal radiation and random errors introduced by heat haze restrict the accuracy and effectiveness of deformation measurement. To suppress thermal radiation and heat haze using fusion-restoration image processing methods, thereby improving the accuracy and effectiveness of DIC in the measurement of high-temperature deformation. For image degradation caused by thermal radiation, based on the image layered representation, the image is decomposed into positive and negative channels for parallel processing, and then optimized for quality by multi-exposure image fusion. To counteract the high-frequency, random errors introduced by heat haze, we adopt the FSIM as the objective function to guide the iterative optimization of model parameters, and the grayscale average algorithm is applied to equalize anomalous gray values, thereby reducing measurement error. The proposed multi-exposure image fusion algorithm effectively suppresses image degradation caused by complex illumination conditions, boosting the effective computation area from 26% to 50% for under-exposed images and from 32% to 40% for over-exposed images without degrading measurement accuracy in the experiment. Meanwhile, the image restoration combined with the grayscale average algorithm reduces static thermal deformation measurement errors. The error in \epsilon_xx is reduced by 85.3%, while the errors in \epsilon_yy and \gamma_xy are reduced by 36.0% and 36.4%, respectively. We present image processing methods to suppress the interference of thermal radiation and heat haze in high-temperature deformation measurement using DIC. The experimental results verify that the proposed method can effectively improve image quality, reduce deformation measurement errors, and has potential application value in thermal deformation measurement.
zh
[CV-154] VILTA: A VLM-in-the-Loop Adversary for Enhancing Driving Policy Robustness AAAI2026
【速读】:该论文旨在解决自动驾驶(Autonomous Driving, AD)系统安全部署中的长尾问题(long-tail problem),即现实数据中罕见但关键的驾驶场景严重缺失,导致现有方法难以有效应对复杂多样的边缘情况。解决方案的关键在于提出一种名为VILTA(VLM-In-the-Loop Trajectory Adversary)的新框架,其核心创新是将视觉语言模型(Vision Language Model, VLM)直接嵌入到AD代理的闭环训练循环中,通过细粒度地编辑周围交通参与者未来轨迹的方式,主动生成多样且具有挑战性的场景。这一机制充分利用了VLM强大的泛化能力,突破了传统两阶段方法中下游轨迹生成模型的性能上限,从而构建出超越传统手段的、更具代表性的安全训练课程,显著提升了AD策略在长尾事件中的鲁棒性和安全性。
链接: https://arxiv.org/abs/2601.12672
作者: Qimao Chen,Fang Li,Shaoqing Xu,Zhiyi Lai,Zixun Xie,Yuechen Luo,Shengyin Jiang,Hanbing Li,Long Chen,Bing Wang,Yi Zhang,Zhi-Xin Yang
机构: 1. Tsinghua University (清华大学); 2. Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); 3. Beijing Academy of Artificial Intelligence (北京人工智能研究院); 4. Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to AAAI 2026
Abstract:The safe deployment of autonomous driving (AD) systems is fundamentally hindered by the long-tail problem, where rare yet critical driving scenarios are severely underrepresented in real-world data. Existing solutions including safety-critical scenario generation and closed-loop learning often rely on rule-based heuristics, resampling methods and generative models learned from offline datasets, limiting their ability to produce diverse and novel challenges. While recent works leverage Vision Language Models (VLMs) to produce scene descriptions that guide a separate, downstream model in generating hazardous trajectories for agents, such two-stage framework constrains the generative potential of VLMs, as the diversity of the final trajectories is ultimately limited by the generalization ceiling of the downstream algorithm. To overcome these limitations, we introduce VILTA (VLM-In-the-Loop Trajectory Adversary), a novel framework that integrates a VLM into the closed-loop training of AD agents. Unlike prior works, VILTA actively participates in the training loop by comprehending the dynamic driving environment and strategically generating challenging scenarios through direct, fine-grained editing of surrounding agents’ future trajectories. This direct-editing approach fully leverages the VLM’s powerful generalization capabilities to create a diverse curriculum of plausible yet challenging scenarios that extend beyond the scope of traditional methods. We demonstrate that our approach substantially enhances the safety and robustness of the resulting AD policy, particularly in its ability to navigate critical long-tail events.
zh
[CV-155] Exploiting Test-Time Augmentation in Federated Learning for Brain Tumor MRI Classification
【速读】:该论文旨在解决脑肿瘤诊断中因病灶变异性和影像复杂性导致的高效准确分类难题,特别是在联邦学习(Federated Learning, FL)框架下如何提升MRI图像分类性能的问题。其解决方案的关键在于:在联邦学习设置中,单纯采用图像预处理(如重采样、灰度转换、归一化、滤波和直方图均衡化)效果有限,但若结合测试时增强(Test-Time Augmentation, TTA),可显著且一致地提升模型性能(p < 0.001);因此,TTA应作为FL医疗影像推理的默认策略,当计算资源允许时,进一步与轻量级预处理联合使用,能带来额外且可靠的性能增益。
链接: https://arxiv.org/abs/2601.12671
作者: Thamara Leandra de Deus Melo,Rodrigo Moreira,Larissa Ferreira Rodrigues Moreira,André Ricardo Backes
机构: Federal University of Viçosa - UFV (维塞萨联邦大学); Federal University of São Carlos (圣卡洛斯联邦大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 21st International Conference on Computer Vision Theory and Applications (VISAPP 2026), 9-11 March 2026, Marbella, Spain
Abstract:Efficient brain tumor diagnosis is crucial for early treatment; however, it is challenging because of lesion variability and image complexity. We evaluated convolutional neural networks (CNNs) in a federated learning (FL) setting, comparing models trained on original versus preprocessed MRI images (resizing, grayscale conversion, normalization, filtering, and histogram equalization). Preprocessing alone yielded negligible gains; combined with test-time augmentation (TTA), it delivered consistent, statistically significant improvements in federated MRI classification (p0.001). In practice, TTA should be the default inference strategy in FL-based medical imaging; when the computational budget permits, pairing TTA with light preprocessing provides additional reliable gains.
zh
[CV-156] Near-Light Color Photometric Stereo for mono-Chromaticity non-lambertian surface
【速读】:该论文旨在解决传统光度立体法(photometric stereo)在动态场景中应用受限的问题,特别是现有方法通常假设理想远距离光源和朗伯反射特性,难以处理实际中的近距光源条件及非朗伯表面。为克服这一局限,作者提出一种基于神经隐式表示(neural implicit representations)的框架,用于在单张图像下同时建模深度和双向反射分布函数(BRDF),并假设场景具有单色性(mono-chromaticity,即均匀色度与同质材质),从而缓解颜色光度立体法固有的病态性(ill-posedness),实现从单一图像中高精度、鲁棒的表面重建。该方案的关键在于利用神经隐式表示对复杂光照与材质关系进行联合建模,并通过设计紧凑型光学触觉传感器进行实验验证。
链接: https://arxiv.org/abs/2601.12666
作者: Zonglin Li,Jieji Ren,Shuangfan Zhou,Heng Guo,Jinnuo Zhang,Jiang Zhou,Boxin Shi,Zhanyu Ma,Guoying Gu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages 7figures
Abstract:Color photometric stereo enables single-shot surface reconstruction, extending conventional photometric stereo that requires multiple images of a static scene under varying illumination to dynamic scenarios. However, most existing approaches assume ideal distant lighting and Lambertian reflectance, leaving more practical near-light conditions and non-Lambertian surfaces underexplored. To overcome this limitation, we propose a framework that leverages neural implicit representations for depth and BRDF modeling under the assumption of mono-chromaticity (uniform chromaticity and homogeneous material), which alleviates the inherent ill-posedness of color photometric stereo and allows for detailed surface recovery from just one image. Furthermore, we design a compact optical tactile sensor to validate our approach. Experiments on both synthetic and real-world datasets demonstrate that our method achieves accurate and robust surface reconstruction.
zh
[CV-157] Generalizable Hyperparameter Optimization for Federated Learning on Non-IID Cancer Images
【速读】:该论文旨在解决深度学习在癌症组织病理学图像分析中训练时面临的隐私保护与模型性能之间的冲突问题。传统集中式训练需汇集敏感临床数据,违反隐私约束;而联邦学习(Federated Learning, FL)虽能保留数据本地性,但其性能高度依赖超参数选择,尤其在非独立同分布(non-IID)客户端数据场景下表现不稳定。解决方案的关键在于:通过中心化贝叶斯超参数优化获得特定数据集的最优配置,并提出一种简单的跨数据集聚合启发式方法——即对不同数据集的最优学习率进行平均,同时选取众数形式的优化器和批量大小(batch size),从而在非IID联邦设置中实现具有竞争力的分类性能。
链接: https://arxiv.org/abs/2601.12664
作者: Elisa Gonçalves Ribeiro,Rodrigo Moreira,Larissa Ferreira Rodrigues Moreira,André Ricardo Backes
机构: Institute of Exact and Technological Sciences, Federal University of Viçosa - UFV (联邦大学维索萨分校精确与技术科学研究所); Department of Computing, Federal University of São Carlos (圣卡洛斯联邦大学计算机系)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 21st International Conference on Computer Vision Theory and Applications (VISAPP 2026), 9-11 March 2026, Marbella, Spain
Abstract:Deep learning for cancer histopathology training conflicts with privacy constraints in clinical settings. Federated Learning (FL) mitigates this by keeping data local; however, its performance depends on hyperparameter choices under non-independent and identically distributed (non-IID) client datasets. This paper examined whether hyperparameters optimized on one cancer imaging dataset generalized across non-IID federated scenarios. We considered binary histopathology tasks for ovarian and colorectal cancers. We perform centralized Bayesian hyperparameter optimization and transfer dataset-specific optima to the non-IID FL setup. The main contribution of this study is the introduction of a simple cross-dataset aggregation heuristic by combining configurations by averaging the learning rates and considering the modal optimizers and batch sizes. This combined configuration achieves a competitive classification performance.
zh
[CV-158] Mixed Precision PointPillars for Efficient 3D Object Detection with TensorRT
【速读】:该论文旨在解决激光雷达(LIDAR)3D目标检测模型在部署到自动驾驶系统时面临的实时性挑战,特别是由于LIDAR数据的宽广数值分布和极端异常值导致直接模型量化(quantization)常引起性能下降的问题。解决方案的关键在于提出一种面向PointPillars架构的混合精度(mixed precision)框架:首先通过逐层量化评估识别出对精度最敏感的top-k层,并将其保留为浮点(FP)格式;其余层采用8位整数(INT8)量化,结合贪心搜索策略生成候选混合精度模型,最终通过后训练量化(PTQ)或量化感知训练(QAT)优化;同时,通过使用少量校准数据减少异常值影响,显著提升PTQ性能。该方法在不依赖额外训练的前提下实现低延迟(最多降低2.35倍)与小模型尺寸(最多缩小2.26倍),且QAT方案可达到与全浮点模型相当的精度。
链接: https://arxiv.org/abs/2601.12638
作者: Ninnart Fuengfusin,Keisuke Yoneda,Naoki Suganuma
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 6 pages, 3 figures
Abstract:LIDAR 3D object detection is one of the important tasks for autonomous vehicles. Ensuring that this task operates in real-time is crucial. Toward this, model quantization can be used to accelerate the runtime. However, directly applying model quantization often leads to performance degradation due to LIDAR’s wide numerical distributions and extreme outliers. To address the wide numerical distribution, we proposed a mixed precision framework designed for PointPillars. Our framework first searches for sensitive layers with post-training quantization (PTQ) by quantizing one layer at a time to 8-bit integer (INT8) and evaluating each model for average precision (AP). The top-k most sensitive layers are assigned as floating point (FP). Combinations of these layers are greedily searched to produce candidate mixed precision models, which are finalized with either PTQ or quantization-aware training (QAT). Furthermore, to handle outliers, we observe that using a very small number of calibration data reduces the likelihood of encountering outliers, thereby improving PTQ performance. Our methods provides mixed precision models without training in the PTQ pipeline, while our QAT pipeline achieves the performance competitive to FP models. With TensorRT deployment, our models offer less latency and sizes by up to 2.35 and 2.26 times, respectively.
zh
[CV-159] From Bands to Depth: Understanding Bathymetry Decisions on Sentinel-2 WACV2026
【速读】:该论文旨在解决Sentinel-2卫星衍生水深(SDB)在不同地点部署时的鲁棒性问题,即模型预测深度的准确性与可靠性在跨区域应用中易受环境干扰(如眩光、泡沫、水体光学特性差异)影响。其关键解决方案是提出一种基于Swin-Transformer的U-Net架构(Swin-BathyUNet),并通过改进的注意力机制和可解释性分析提升模型的泛化能力:首先引入解码器条件交叉注意力(decoder-conditioned cross-attention on skips)增强对浅水区干扰的鲁棒性;其次采用适配回归任务的A-CAM-R方法验证模型决策依据,证明其关注的是真实光学证据;最后通过跨区域推理实验揭示深度依赖性误差规律,并提出针对性策略,如保持宽感受野、保留绿/蓝波段辐射保真度、预滤近岸高亮异常值,以及结合轻量目标区域微调与深度感知校准以实现跨区域迁移。
链接: https://arxiv.org/abs/2601.12636
作者: Satyaki Roy Chowdhury,Aswathnarayan Radhakrishnan,Hsiao Jou Hsu,Hari Subramoni,Joachim Moortgat
机构: The Ohio State University (俄亥俄州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by WACV 2026
Abstract:Deploying Sentinel-2 satellite derived bathymetry (SDB) robustly across sites remains challenging. We analyze a Swin-Transformer based U-Net model (Swin-BathyUNet) to understand how it infers depth and when its predictions are trustworthy. A leave-one-band out study ranks spectral importance to the different bands consistent with shallow water optics. We adapt ablation-based CAM to regression (A-CAM-R) and validate the reliability via a performance retention test: keeping only the top-p% salient pixels while neutralizing the rest causes large, monotonic RMSE increase, indicating explanations localize on evidence the model relies on. Attention ablations show decoder conditioned cross attention on skips is an effective upgrade, improving robustness to glint/foam. Cross-region inference (train on one site, test on another) reveals depth-dependent degradation: MAE rises nearly linearly with depth, and bimodal depth distributions exacerbate mid/deep errors. Practical guidance follows: maintain wide receptive fields, preserve radiometric fidelity in green/blue channels, pre-filter bright high variance near shore, and pair light target site fine tuning with depth aware calibration to transfer across regions.
zh
[CV-160] Linear Mechanisms for Spatiotemporal Reasoning in Vision Language Models
【速读】:该论文旨在解决视觉语言模型(Vision Language Models, VLMs)中时空推理能力的内在机制不明确的问题。现有研究虽已证实VLMs具备强大的时空推理能力,但其内部如何编码和利用空间与时间结构仍缺乏清晰解释。论文的关键解决方案是提出并验证了一种“时空ID”(spatiotemporal ID)机制:即VLMs通过线性绑定空间ID(spatial IDs)到文本激活向量来编码物体位置,并在中间层以语言标记进行推理;进一步通过严格的因果干预实验证明,这些普遍存在于模型各层的空间ID能系统性地调节模型信念,从而揭示出一种此前未被充分探索的内部推理路径。该机制不仅可作为诊断现有VLM局限性的工具,还可作为有效的学习信号,且其在视频VLM中也存在类似的时间ID机制,为提升模型可解释性及设计更对齐、更强能力的模型提供了理论基础和实践方向。
链接: https://arxiv.org/abs/2601.12626
作者: Raphi Kang,Hongqiao Chen,Georgia Gkioxari,Pietro Perona
机构: California Institute of Technology (加州理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Spatio-temporal reasoning is a remarkable capability of Vision Language Models (VLMs), but the underlying mechanisms of such abilities remain largely opaque. We postulate that visual/geometrical and textual representations of spatial structure must be combined at some point in VLM computations. We search for such confluence, and ask whether the identified representation can causally explain aspects of input-output model behavior through a linear model. We show empirically that VLMs encode object locations by linearly binding \textitspatial IDs to textual activations, then perform reasoning via language tokens. Through rigorous causal interventions we demonstrate that these IDs, which are ubiquitous across the model, can systematically mediate model beliefs at intermediate VLM layers. Additionally, we find that spatial IDs serve as a diagnostic tool for identifying limitations in existing VLMs, and as a valuable learning signal. We extend our analysis to video VLMs and identify an analogous linear temporal ID mechanism. By characterizing our proposed spatiotemporal ID mechanism, we elucidate a previously underexplored internal reasoning process in VLMs, toward improved interpretability and the principled design of more aligned and capable models. We release our code for reproducibility: this https URL.
zh
[CV-161] owards Robust Universal Perturbation Attacks: A Float-Coded Penalty-Driven Evolutionary Approach
【速读】:该论文旨在解决通用对抗扰动(Universal Adversarial Perturbations, UAPs)在生成过程中难以平衡扰动可见性与攻击成功率的问题,尤其针对传统进化算法在高维、梯度不可获取的深度神经网络(Deep Neural Networks, DNNs)空间中效率低、收敛慢的挑战。其解决方案的关键在于提出一种浮点编码(float-coded)、惩罚驱动的单目标进化框架:通过连续基因表示适配现代深度学习模型规模,引入动态进化算子与自适应调度机制以提升搜索效率,并采用模块化PyTorch实现确保与主流架构的无缝集成;同时,通过跨模型测试和批次轮换策略保障扰动的普遍有效性,从而在ImageNet数据集上实现更低范数、更高误分类率及更快收敛速度的UAP生成。
链接: https://arxiv.org/abs/2601.12624
作者: Shiqi Wang,Mahdi Khosravy,Neeraj Gupta,Olaf Witkowski
机构: University of California, Los Angeles(加州大学洛杉矶分校); Cross Labs(Cross Labs); Cross-Compass Ltd.(Cross-Compass有限公司); Oakland University(奥克兰大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Universal adversarial perturbations (UAPs) have garnered significant attention due to their ability to undermine deep neural networks across multiple inputs using a single noise pattern. Evolutionary algorithms offer a promising approach to generating such perturbations due to their ability to navigate non-convex, gradient-free landscapes. In this work, we introduce a float-coded, penalty-driven single-objective evolutionary framework for UAP generation that achieves lower visibility perturbations while enhancing attack success rates. Our approach leverages continuous gene representations aligned with contemporary deep learning scales, incorporates dynamic evolutionary operators with adaptive scheduling, and utilizes a modular PyTorch implementation for seamless integration with modern architectures. Additionally, we ensure the universality of the generated perturbations by testing across diverse models and by periodically switching batches to prevent overfitting. Experimental results on the ImageNet dataset demonstrate that our framework consistently produces perturbations with smaller norms, higher misclassification effectiveness, and faster convergence compared to existing evolutionary-based methods. These findings highlight the robustness and scalability of our approach for universal adversarial attacks across various deep learning architectures.
zh
[CV-162] Camera Pose Revisited
【速读】:该论文旨在解决平面透视n点(Perspective-n-Point, PnP)问题中相机位姿的初始估计难题,尤其关注标定物体位姿的准确初始化。其解决方案的关键在于提出一种名为PnP-ProCay78的算法,该算法将经典的重建误差二次形式与旋转的Cayley参数化相结合,并采用最小二乘优化策略;其中核心创新是基于对两个典型向量的重建误差分析,实现确定性地选择优化起点,从而避免了耗时的解空间搜索过程,同时通过解析消除平移项的重构误差代理函数,构建出兼具几何直观性和计算效率的混合代价函数。
链接: https://arxiv.org/abs/2601.12567
作者: Władysław Skarbek,Michał Salomonowicz,Michał Król
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 30 pages, 9 figures, 9 tables
Abstract:Estimating the position and orientation of a camera with respect to an observed scene is one of the central problems in computer vision, particularly in the context of camera calibration and multi-sensor systems. This paper addresses the planar Perspective-- n --Point problem, with special emphasis on the initial estimation of the pose of a calibration object. As a solution, we propose the \textttPnP-ProCay78 algorithm, which combines the classical quadratic formulation of the reconstruction error with a Cayley parameterization of rotations and least-squares optimization. The key component of the method is a deterministic selection of starting points based on an analysis of the reconstruction error for two canonical vectors, allowing costly solution-space search procedures to be avoided. Experimental validation is performed using data acquired also from high-resolution RGB cameras and very low-resolution thermal cameras in an integrated RGB–IR setup. The results demonstrate that the proposed algorithm achieves practically the same projection accuracy as optimal \textttSQPnP and slightly higher than \textttIPPE, both prominent \textttPnP-OpenCV procedures. However, \textttPnP-ProCay78 maintains a significantly simpler algorithmic structure. Moreover, the analysis of optimization trajectories in Cayley space provides an intuitive insight into the convergence process, making the method attractive also from a didactic perspective. Unlike existing PnP solvers, the proposed \textttPnP-ProCay78 algorithm combines projection error minimization with an analytically eliminated reconstruction-error surrogate for translation, yielding a hybrid cost formulation that is both geometrically transparent and computationally efficient.
zh
[CV-163] Life Machine Learning and the Search for Habitability: Predicting Biosignature Fluxes for the Habitable Worlds Observatory AAAI-26
【速读】:该论文旨在解决未来直接成像旗舰任务(如NASA的宜居世界观测台,HWO)在极端时间与资源约束下如何高效优先选择观测目标的问题。解决方案的关键在于提出两种先进的机器学习架构:贝叶斯卷积神经网络(Bayesian Convolutional Neural Network, BCNN)和新型谱查询自适应变压器模型(Spectral Query Adaptive Transformer, SQuAT)。BCNN能够量化认知不确定性(epistemic uncertainty)和随机不确定性(aleatoric uncertainty),从而在不同观测条件下提供可靠预测;而SQuAT则通过查询驱动的注意力机制增强光谱特征与特定生物标志物物种通量之间的可解释性关联。两者均在扩展数据集上展现出高预测精度,并分别在不确定性量化和光谱可解释性方面具有独特优势,为加速目标筛选、优化观测计划并最大化科学产出提供了有力工具。
链接: https://arxiv.org/abs/2601.12557
作者: Mark Moussa,Amber V. Young,Brianna Isola,Vasuda Trehan,Michael D. Himes,Nicholas Wogan,Giada Arney
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 4 figures. Submitted and accepted in AAAI-26 (IAAI Emerging Applications track)
Abstract:Future direct-imaging flagship missions, such as NASA’s Habitable Worlds Observatory (HWO), face critical decisions in prioritizing observations due to extremely stringent time and resource constraints. In this paper, we introduce two advanced machine-learning architectures tailored for predicting biosignature species fluxes from exoplanetary reflected-light spectra: a Bayesian Convolutional Neural Network (BCNN) and our novel model architecture, the Spectral Query Adaptive Transformer (SQuAT). The BCNN robustly quantifies both epistemic and aleatoric uncertainties, offering reliable predictions under diverse observational conditions, whereas SQuAT employs query-driven attention mechanisms to enhance interpretability by explicitly associating spectral features with specific biosignature species. We demonstrate that both models achieve comparably high predictive accuracy on an augmented dataset spanning a wide range of exoplanetary conditions, while highlighting their distinct advantages in uncertainty quantification and spectral interpretability. These capabilities position our methods as promising tools for accelerating target triage, optimizing observation schedules, and maximizing scientific return for upcoming flagship missions such as HWO.
zh
[CV-164] PISE: Physics-Anchored Semantically-Enhanced Deep Computational Ghost Imaging for Robust Low-Bandwidth Machine Perception
【速读】:该论文旨在解决低带宽边缘感知场景下,传统鬼成像(ghost imaging)方法在分类准确率低和方差大等问题。其解决方案的关键在于提出了一种物理信息引导的深度鬼成像框架(PISE),通过伴随算子初始化(adjoint operator initialization)与语义引导(semantic guidance)相结合的方式,有效提升了模型在稀疏采样条件下的性能,实现了在5%采样率下分类准确率提升2.57%,方差降低9倍。
链接: https://arxiv.org/abs/2601.12551
作者: Tong Wu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 4 pages, 4 figures, 3 tables. Submitted to IEICE Transactions
Abstract:We propose PISE, a physics-informed deep ghost imaging framework for low-bandwidth edge perception. By combining adjoint operator initialization with semantic guidance, PISE improves classification accuracy by 2.57% and reduces variance by 9x at 5% sampling.
zh
[CV-165] Encoding Emotion Through Self-Supervised Eye Movement Reconstruction
【速读】:该论文旨在解决如何利用自然场景下的低分辨率视频中眼动信息来预测多模态情绪表达的问题,从而突破传统高精度眼动设备的限制。其解决方案的关键在于提出一种基于自监督眼动重建的新型注视检测模型,该模型能够有效利用大量未标注视频数据进行预训练,并通过编码器嵌入(encoder embeddings)微调用于两个下游任务:一是将眼动与语音中的方向性情绪估计对齐,二是以眼动作为预测指标识别即时情绪行为(如笑、哭泣/抽泣和叹气)。实验表明,该方法能有效捕捉眼动所携带的情绪信号,且预训练性能与情绪处理表现呈正相关。
链接: https://arxiv.org/abs/2601.12534
作者: Marcus Ma,Jordan Prescott,Emily Zhou,Tiantian Feng,Kleanthis Avramidis,Gabor Mihaly Toth,Shrikanth Narayanan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:The relationship between emotional expression and eye movement is well-documented, with literature establishing gaze patterns are reliable indicators of emotion. However, most studies utilize specialized, high-resolution eye-tracking equipment, limiting the potential reach of findings. We investigate how eye movement can be used to predict multimodal markers of emotional expression from naturalistic, low-resolution videos. We utilize a collection of video interviews from the USC Shoah Foundation’s Visual History Archive with Holocaust survivors as they recount their experiences in the Auschwitz concentration camp. Inspired by pretraining methods on language models, we develop a novel gaze detection model that uses self-supervised eye movement reconstruction that can effectively leverage unlabeled video. We use this model’s encoder embeddings to fine-tune models on two downstream tasks related to emotional expression. The first is aligning eye movement with directional emotion estimates from speech. The second task is using eye gaze as a predictor of three momentary manifestations of emotional behaviors: laughing, crying/sobbing, and sighing. We find our new model is predictive of emotion outcomes and observe a positive correlation between pretraining performance and emotion processing performance for both experiments. We conclude self-supervised eye movement reconstruction is an effective method for encoding the affective signal they carry.
zh
[CV-166] BirdsEye-RU: A Dataset For Detecting Faces from Overhead Images
【速读】:该论文旨在解决在俯视图像中检测人脸的难题,该问题主要源于极端的尺度变化和环境杂乱。解决方案的关键在于构建了一个名为BirdsEye-RU的数据集,该数据集包含2,978张图像和超过8,000个标注的人脸,专门用于捕捉不同环境中远距离、小尺寸的人脸目标,涵盖无人机与高空智能手机拍摄的图像。这一数据集为训练和评估面向复杂场景下的人脸检测模型提供了高质量的基准资源。
链接: https://arxiv.org/abs/2601.12533
作者: Md. Ahanaf Arif Khan,Ariful Islam,Sangeeta Biswas,Md. Iqbal Aziz Khan,Subrata Pramanik,Sanjoy Kumar Chakrabarty,Bimal Kumar Pramanik
机构: Rajshahi University (拉杰沙希大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Detecting faces in overhead images remains a significant challenge due to extreme scale variations and environmental clutter. To address this, we created the BirdsEye-RU dataset, a comprehensive collection of 2,978 images containing over eight thousand annotated faces. This dataset is specifically designed to capture small and distant faces across diverse environments, containing both drone images and smartphone-captured images from high altitude. We present a detailed description of the BirdsEye-RU dataset in this paper. We made our dataset freely available to the public, and it can be accessed at this https URL.
zh
[CV-167] XRefine: Attention-Guided Keypoint Match Refinement
【速读】:该论文旨在解决稀疏关键点匹配(sparse keypoint matching)在三维视觉任务中因关键点检测器产生的空间位置不准确而导致的几何估计误差问题。现有精修方法通常依赖于特定检测器的内部表示进行匹配点对齐,导致泛化能力差且需为每个检测器重新训练。其解决方案的关键在于提出一种检测器无关(detector-agnostic)的子像素级关键点精修方法 XRefine,该方法基于交叉注意力(cross-attention)架构,仅使用以匹配关键点为中心的图像块作为输入,学习预测更精确的关键点坐标,从而实现跨检测器的通用性与高精度,同时可扩展至多视角特征轨迹处理。
链接: https://arxiv.org/abs/2601.12530
作者: Jan Fabian Schmid,Annika Hagemann
机构: Bosch Research (博世研究)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Sparse keypoint matching is crucial for 3D vision tasks, yet current keypoint detectors often produce spatially inaccurate matches. Existing refinement methods mitigate this issue through alignment of matched keypoint locations, but they are typically detector-specific, requiring retraining for each keypoint detector. We introduce XRefine, a novel, detector-agnostic approach for sub-pixel keypoint refinement that operates solely on image patches centered at matched keypoints. Our cross-attention-based architecture learns to predict refined keypoint coordinates without relying on internal detector representations, enabling generalization across detectors. Furthermore, XRefine can be extended to handle multi-view feature tracks. Experiments on MegaDepth, KITTI, and ScanNet demonstrate that the approach consistently improves geometric estimation accuracy, achieving superior performance compared to existing refinement methods while maintaining runtime efficiency. Our code and trained models can be found at this https URL.
zh
[CV-168] Deep Feature Deformation Weights
【速读】:该论文旨在解决传统基于手柄的网格变形方法(handle-based mesh deformation)与现代数据驱动方法之间的权衡问题:前者虽能实现快速、精确的控制,但依赖用户对控制手柄分布的先验知识,且映射关系非语义;后者虽具备语义感知能力,但计算效率低且精度不足。解决方案的关键在于将深度特征的语义先验与经典框架的高效性相结合,通过深度特征相似性直接生成平滑且语义一致的变形权重,无需额外正则化或优化过程,从而实现实时计算和语义部件协同变形。进一步地,作者提出改进的重心特征蒸馏(barycentric feature distillation)机制,利用形状渲染图像信号降低蒸馏成本,使高分辨率网格(百万面级别)在不到一分钟内完成权重计算,同时保留经典方法的局部性和空间约束特性,并支持自动检测对称性以生成保持对称性的形变结果。
链接: https://arxiv.org/abs/2601.12527
作者: Richard Liu,Itai Lang,Rana Hanocka
机构: University of Chicago (芝加哥大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:
Abstract:Handle-based mesh deformation has been a long-standing paradigm in computer graphics, enabling intuitive shape edits from sparse controls. Classic techniques offer precise and rapid deformation control. However, they solve an optimization problem with constraints defined by control handle placement, requiring a user to know apriori the ideal distribution of handles on the shape to accomplish the desired edit. The mapping from handle set to deformation behavior is often unintuitive and, importantly, non-semantic. Modern data-driven methods, on the other hand, leverage a data prior to obtain semantic edits, but are slow and imprecise. We propose a technique that fuses the semantic prior of data with the precise control and speed of traditional frameworks. Our approach is surprisingly simple yet effective: deep feature proximity makes for smooth and semantic deformation weights, with no need for additional regularization. The weights can be computed in real-time for any surface point, whereas prior methods require optimization for new handles. Moreover, the semantic prior from deep features enables co-deformation of semantic parts. We introduce an improved feature distillation pipeline, barycentric feature distillation, which efficiently uses the visual signal from shape renders to minimize distillation cost. This allows our weights to be computed for high resolution meshes in under a minute, in contrast to potentially hours for both classical and neural methods. We preserve and extend properties of classical methods through feature space constraints and locality weighting. Our field representation allows for automatic detection of semantic symmetries, which we use to produce symmetry-preserving deformations. We show a proof-of-concept application which can produce deformations for meshes up to 1 million faces in real-time on a consumer-grade machine.
zh
[CV-169] Fine-Tuning Cycle-GAN for Domain Adaptation of MRI Images
【速读】:该论文旨在解决多源磁共振成像(MRI)数据中存在的域偏移(domain shift)问题,即由于不同扫描设备、成像协议和参数差异导致的图像分布不一致,从而影响深度学习模型在目标域上的泛化性能。其解决方案的关键在于提出一种基于循环生成对抗网络(Cycle-GAN)的无监督医学图像域自适应方法,通过学习源域与目标域之间的双向映射关系,在无需配对样本的情况下实现跨域图像转换,同时结合内容损失和差异损失以保留解剖结构信息并最小化域间差异,从而提升模型在新域数据上的诊断准确性与一致性。
链接: https://arxiv.org/abs/2601.12512
作者: Mohd Usama,Belal Ahmad,Faleh Menawer R Althiyabi
机构: Umea University (于默奥大学); National Taipei University of Technology (台北科技大学); King Fahd University of Petroleum and Minerals (沙特国王石油与矿业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 9 figures, 2 tables
Abstract:Magnetic Resonance Imaging (MRI) scans acquired from different scanners or institutions often suffer from domain shifts owing to variations in hardware, protocols, and acquisition parameters. This discrepancy degrades the performance of deep learning models trained on source domain data when applied to target domain images. In this study, we propose a Cycle-GAN-based model for unsupervised medical-image domain adaptation. Leveraging CycleGANs, our model learns bidirectional mappings between the source and target domains without paired training data, preserving the anatomical content of the images. By leveraging Cycle-GAN capabilities with content and disparity loss for adaptation tasks, we ensured image-domain adaptation while maintaining image integrity. Several experiments on MRI datasets demonstrated the efficacy of our model in bidirectional domain adaptation without labelled data. Furthermore, research offers promising avenues for improving the diagnostic accuracy of healthcare. The statistical results confirm that our approach improves model performance and reduces domain-related variability, thus contributing to more precise and consistent medical image analysis.
zh
[CV-170] SDCoNet: Saliency-Driven Multi-Task Collaborative Network for Remote Sensing Object Detection
【速读】:该论文旨在解决低质量遥感图像中因复杂背景、弱目标信号和小目标尺度导致的目标检测精度不足的问题,尤其是传统串行超分辨率(Super-Resolution, SR)与检测流程中存在的优化目标错位、特征冗余及任务间交互不足等挑战。其解决方案的关键在于提出一种基于显著性驱动的多任务协同网络(Saliency-Driven multi-task Collaborative Network, SDCoNet),通过共享编码器实现SR与检测任务的隐式特征融合,同时保留各任务特异性;引入多尺度显著性预测模块以选择关键token,聚焦弱目标区域并抑制背景噪声;并通过梯度路由策略缓解优化冲突,先稳定检测语义,再引导SR分支生成对检测有益的高频细节,从而实现跨任务协同优化与性能提升。
链接: https://arxiv.org/abs/2601.12507
作者: Ruo Qi,Linhui Dai,Yusong Qin,Chaolei Yang,Yanshan Li
机构: Shenzhen University (深圳大学); Guangdong Key Laboratory of Intelligent Information Processing (广东省智能信息处理重点实验室); Shenzhen Key Laboratory of Modern Communications and Information Processing (深圳市现代通信与信息处理重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:In remote sensing images, complex backgrounds, weak object signals, and small object scales make accurate detection particularly challenging, especially under low-quality imaging conditions. A common strategy is to integrate single-image super-resolution (SR) before detection; however, such serial pipelines often suffer from misaligned optimization objectives, feature redundancy, and a lack of effective interaction between SR and detection. To address these issues, we propose a Saliency-Driven multi-task Collaborative Network (SDCoNet) that couples SR and detection through implicit feature sharing while preserving task specificity. SDCoNet employs the swin transformer-based shared encoder, where hierarchical window-shifted self-attention supports cross-task feature collaboration and adaptively balances the trade-off between texture refinement and semantic representation. In addition, a multi-scale saliency prediction module produces importance scores to select key tokens, enabling focused attention on weak object regions, suppression of background clutter, and suppression of adverse features introduced by multi-task coupling. Furthermore, a gradient routing strategy is introduced to mitigate optimization conflicts. It first stabilizes detection semantics and subsequently routes SR gradients along a detection-oriented direction, enabling the framework to guide the SR branch to generate high-frequency details that are explicitly beneficial for detection. Experiments on public datasets, including NWPU VHR-10-Split, DOTAv1.5-Split, and HRSSD-Split, demonstrate that the proposed method, while maintaining competitive computational efficiency, significantly outperforms existing mainstream algorithms in small object detection on low-quality remote sensing images. Our code is available at this https URL.
zh
[CV-171] Video Individual Counting and Tracking from Moving Drones: A Benchmark and Methods
【速读】:该论文旨在解决大规模密集人群场景下视频级计数与跟踪的难题,现有方法主要依赖固定摄像头采集的数据集,其空间覆盖有限,难以满足复杂大场景的需求。为突破这一瓶颈,作者提出利用移动无人机拍摄视频,并基于此构建了目前最大的视频级密集人群计数与跟踪数据集 MovingDroneCrowd++,涵盖多变飞行高度、视角和光照条件。解决方案的关键在于提出 GD3A(Global Density Map Decomposition via Descriptor Association),一种基于密度图的个体计数方法,通过最优传输结合自适应尘箱得分建立连续帧间行人描述符的像素级对应关系,从而将全局密度图分解为共享、流入和流出成分;在此基础上进一步设计 DVTrack,通过描述符投票机制实现描述符级匹配到实例级关联,显著提升跟踪性能。
链接: https://arxiv.org/abs/2601.12500
作者: Yaowu Fan,Jia Wan,Tao Han,Andy J. Ma,Antoni B. Chan
机构: Sun Yat-sen University (中山大学); Hong Kong University of Science and Technology (香港科技大学); City University of Hong Kong (香港城市大学); Harbin Institute of Technology (深圳) (哈尔滨工业大学(深圳))
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Counting and tracking dense crowds in large-scale scenes is highly challenging, yet existing methods mainly rely on datasets captured by fixed cameras, which provide limited spatial coverage and are inadequate for large-scale dense crowd analysis. To address this limitation, we propose a flexible solution using moving drones to capture videos and perform video-level crowd counting and tracking of unique pedestrians across entire scenes. We introduce MovingDroneCrowd++, the largest video-level dataset for dense crowd counting and tracking captured by moving drones, covering diverse and complex conditions with varying flight altitudes, camera angles, and illumination. Existing methods fail to achieve satisfactory performance on this dataset. To this end, we propose GD3A (Global Density Map Decomposition via Descriptor Association), a density map-based video individual counting method that avoids explicit localization. GD3A establishes pixel-level correspondences between pedestrian descriptors across consecutive frames via optimal transport with an adaptive dustbin score, enabling the decomposition of global density maps into shared, inflow, and outflow components. Building on this framework, we further introduce DVTrack, which converts descriptor-level matching into instance-level associations through a descriptor voting mechanism for pedestrian tracking. Experimental results show that our methods significantly outperform existing approaches under dense crowds and complex motion, reducing counting error by 47.4 percent and improving tracking performance by 39.2 percent.
zh
[CV-172] Histopath-C: Towards Realistic Domain Shifts for Histopathology Vision-Language Adaptation WACV2026
【速读】:该论文旨在解决数字病理图像中因染色、污染、模糊和噪声等严重域偏移(domain shift)导致视觉-语言模型(VLM)下游任务性能显著下降的问题。解决方案的关键在于提出Histopath-C基准,该基准通过模拟真实世界分布偏移的合成扰动来评估测试时适应(Test-Time Adaptation, TTA)机制,并进一步设计了LATTE方法——一种基于多文本模板的归纳式低秩适应策略,能够有效缓解VLM对多样化文本输入的敏感性,从而在多个病理图像数据集上优于为自然图像设计的先进TTA方法,验证了其在病理图像场景下鲁棒适应的有效性。
链接: https://arxiv.org/abs/2601.12493
作者: Mehrdad Noori,Gustavo Adolfo Vargas Hakim,David Osowiechi,Fereshteh Shakeri,Ali Bahri,Moslem Yazdanpanah,Sahar Dastani,Ismail Ben Ayed,Christian Desrosiers
机构: LIVIA, ÉTS Montreal, Canada (加拿大蒙特利尔大学); International Laboratory on Learning Systems (ILLS) (国际学习系统实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to WACV 2026
Abstract:Medical Vision-language models (VLMs) have shown remarkable performances in various medical imaging domains such as histo-pathology by leveraging pre-trained, contrastive models that exploit visual and textual information. However, histopathology images may exhibit severe domain shifts, such as staining, contamination, blurring, and noise, which may severely degrade the VLM’s downstream performance. In this work, we introduce Histopath-C, a new benchmark with realistic synthetic corruptions designed to mimic real-world distribution shifts observed in digital histopathology. Our framework dynamically applies corruptions to any available dataset and evaluates Test-Time Adaptation (TTA) mechanisms on the fly. We then propose LATTE, a transductive, low-rank adaptation strategy that exploits multiple text templates, mitigating the sensitivity of histopathology VLMs to diverse text inputs. Our approach outperforms state-of-the-art TTA methods originally designed for natural images across a breadth of histopathology datasets, demonstrating the effectiveness of our proposed design for robust adaptation in histopathology images. Code and data are available at this https URL.
zh
[CV-173] NeuralFur: Animal Fur Reconstruction From Multi-View Images
【速读】:该论文旨在解决从多视角RGB图像中高保真重建动物毛发几何结构的难题,其核心挑战在于毛发的细粒度细节、自遮挡以及视点依赖的外观特性,且缺乏可用于学习不同动物毛发先验的数据集。解决方案的关键在于提出一种基于线段(strand-based)表示的多视角动物毛发建模方法,通过引入视觉语言模型(Vision Language Model, VLM)来获取特定部位毛发长度和结构的先验知识,并据此构建无毛几何体后生长毛发纤维;同时利用几何与光度损失监督重建过程,并借助VLM引导毛发生长方向与重力矢量的关系以缓解Gabor滤波器引起的朝向歧义问题,从而实现跨多种动物类型的有效泛化。
链接: https://arxiv.org/abs/2601.12481
作者: Vanessa Sklyarova,Berna Kabadayi,Anastasios Yiannakidis,Giorgio Becherini,Michael J. Black,Justus Thies
机构: Max Planck Institute for Intelligent Systems (马克斯·普朗克智能系统研究所); ETH Zürich (苏黎世联邦理工学院); Technical University of Darmstadt (达姆施塔特工业大学); University of Tübingen (图宾根大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: For additional results and code, please refer to this https URL
Abstract:Reconstructing realistic animal fur geometry from images is a challenging task due to the fine-scale details, self-occlusion, and view-dependent appearance of fur. In contrast to human hairstyle reconstruction, there are also no datasets that can be leveraged to learn a fur prior for different animals. In this work, we present a first multi-view-based method for high-fidelity 3D fur modeling of animals using a strand-based representation, leveraging the general knowledge of a vision language model. Given multi-view RGB images, we first reconstruct a coarse surface geometry using traditional multi-view stereo techniques. We then use a vision language model (VLM) system to retrieve information about the realistic length structure of the fur for each part of the body. We use this knowledge to construct the animal’s furless geometry and grow strands atop it. The fur reconstruction is supervised with both geometric and photometric losses computed from multi-view images. To mitigate orientation ambiguities stemming from the Gabor filters that are applied to the input images, we additionally utilize the VLM to guide the strands’ growth direction and their relation to the gravity vector that we incorporate as a loss. With this new schema of using a VLM to guide 3D reconstruction from multi-view inputs, we show generalization across a variety of animals with different fur types. For additional results and code, please refer to this https URL.
zh
[CV-174] DCAC: Dynamic Class-Aware Cache Creates Stronger Out-of-Distribution Detectors AAAI2026
【速读】:该论文旨在解决深度神经网络在测试阶段对分布外(Out-of-distribution, OOD)样本产生过度自信预测的问题。其解决方案的关键在于提出了一种无需训练的测试时校准模块DCAC(Dynamic Class-Aware Cache),该模块基于类特定观察——即OOD样本若被预测为同一类别,则彼此间视觉相似性高于与真实分布内(In-distribution, ID)样本的相似性——为每个ID类别维护独立缓存,收集高熵样本并利用缓存的视觉特征和预测概率,通过轻量级两层结构对输入样本的原始预测进行校准,从而有效缓解OOD样本上的过自信问题。
链接: https://arxiv.org/abs/2601.12468
作者: Yanqi Wu,Qichao Chen,Runhe Lai,Xinhua Lu,Jia-Xin Zhuang,Zhilin Zhao,Wei-Shi Zheng,Ruixuan Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 9 figures, Accepted by AAAI2026
Abstract:Out-of-distribution (OOD) detection remains a fundamental challenge for deep neural networks, particularly due to overconfident predictions on unseen OOD samples during testing. We reveal a key insight: OOD samples predicted as the same class, or given high probabilities for it, are visually more similar to each other than to the true in-distribution (ID) samples. Motivated by this class-specific observation, we propose DCAC (Dynamic Class-Aware Cache), a training-free, test-time calibration module that maintains separate caches for each ID class to collect high-entropy samples and calibrate the raw predictions of input samples. DCAC leverages cached visual features and predicted probabilities through a lightweight two-layer module to mitigate overconfident predictions on OOD samples. This module can be seamlessly integrated with various existing OOD detection methods across both unimodal and vision-language models while introducing minimal computational overhead. Extensive experiments on multiple OOD benchmarks demonstrate that DCAC significantly enhances existing methods, achieving substantial improvements, i.e., reducing FPR95 by 6.55% when integrated with ASH-S on ImageNet OOD benchmark.
zh
[CV-175] Large-scale EM Benchmark for Multi-Organelle Instance Segmentation in the Wild
【速读】:该论文旨在解决电子显微镜(Electron Microscopy, EM)图像中细胞器实例级分割(instance-level segmentation)的准确性问题,特别是针对现有基于小规模、人工筛选数据集的基准测试无法捕捉真实世界EM数据中的异质性和大空间上下文信息的问题。其解决方案的关键在于构建了一个大规模、多来源的多细胞类型、多细胞器类别的实例分割基准数据集,包含超过10万张二维EM图像,并采用设计的连通性感知标签传播算法(connectivity-aware Label Propagation Algorithm, 3D LPA)进行标注,辅以专家校正,从而更真实地反映生物样本的复杂性与多样性。该数据集为评估模型在真实场景下的泛化能力提供了基础,揭示了当前局部上下文模型(如U-Net、SAM变体和Mask2Former)在处理具有全局分布形态的细胞器(如内质网)时的局限性,凸显了长程结构连续性建模与现实变异性的匹配难题。
链接: https://arxiv.org/abs/2601.12464
作者: Yanrui Lu,Danyang Chen,Haowen Xiao,Jiarui Zhu,Fukang Ge,Binqian Zou,Jiali Guan,Jiayin Liang,Yuting Wang,Ziqian Guan,Xiangcheng Bao,Jinhao Bi,Lin Gu,Jun He,Yingying Zhu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accurate instance-level segmentation of organelles in electron microscopy (EM) is critical for quantitative analysis of subcellular morphology and inter-organelle interactions. However, current benchmarks, based on small, curated datasets, fail to capture the inherent heterogeneity and large spatial context of in-the-wild EM data, imposing fundamental limitations on current patch-based methods. To address these limitations, we developed a large-scale, multi-source benchmark for multi-organelle instance segmentation, comprising over 100,000 2D EM images across variety cell types and five organelle classes that capture real-world variability. Dataset annotations were generated by our designed connectivity-aware Label Propagation Algorithm (3D LPA) with expert refinement. We further benchmarked several state-of-the-art models, including U-Net, SAM variants, and Mask2Former. Our results show several limitations: current models struggle to generalize across heterogeneous EM data and perform poorly on organelles with global, distributed morphologies (e.g., Endoplasmic Reticulum). These findings underscore the fundamental mismatch between local-context models and the challenge of modeling long-range structural continuity in the presence of real-world variability. The benchmark dataset and labeling tool will be publicly released soon.
zh
[CV-176] Adversarial Defense in Vision-Language Models: An Overview
【速读】:该论文旨在解决视觉语言模型(Vision Language Models, VLMs)在跨模态任务中易受复杂且难以察觉的对抗攻击影响的问题,这类攻击可能严重损害模型性能与系统安全性。其解决方案的关键在于系统性梳理并对比三类主流防御范式:训练时防御(Training-time Defense)、测试时自适应防御(Test-time Adaptation Defense)和无训练防御(Training-free Defense),分别通过对抗微调、推理时参数更新以及输入扰动或特征嵌入修改来提升模型鲁棒性,从而在不显著增加计算开销的前提下增强VLMs对多样化对抗样本的抵御能力。
链接: https://arxiv.org/abs/2601.12443
作者: Xiaowei Fu,Lei Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:The widespread use of Vision Language Models (VLMs, e.g. CLIP) has raised concerns about their vulnerability to sophisticated and imperceptible adversarial attacks. These attacks could compromise model performance and system security in cross-modal tasks. To address this challenge, three main defense paradigms have been proposed: Training-time Defense, Test-time Adaptation Defense, and Training-free Defense. Training-time Defense involves modifying the training process, typically through adversarial fine-tuning to improve the robustness to adversarial examples. While effective, this approach requires substantial computational resources and may not generalize across all adversarial attacks. Test-time Adaptation Defense focuses on adapting the model at inference time by updating its parameters to handle unlabeled adversarial examples, offering flexibility but often at the cost of increased complexity and computational overhead. Training-free Defense avoids modifying the model itself, instead focusing on altering the adversarial inputs or their feature embeddings, which enforces input perturbations to mitigate the impact of attacks without additional training. This survey reviews the latest advancements in adversarial defense strategies for VLMs, highlighting the strengths and limitations of such approaches and discussing ongoing challenges in enhancing the robustness of VLMs.
zh
[CV-177] Constraint-Aware Neurosymbolic Uncertainty Quantification with Bayesian Deep Learning for Scientific Discovery
【速读】:该论文旨在解决科学人工智能(Scientific AI)中模型无法同时提供可信的不确定性估计并遵守领域约束的问题。现有不确定性量化方法缺乏融合符号化科学知识的机制,而神经符号方法则在确定性框架下运行,缺乏规范的不确定性建模能力。解决方案的关键在于提出约束感知的神经符号不确定性框架(Constraint-Aware Neurosymbolic Uncertainty Framework, CANUF),其核心创新是将贝叶斯深度学习与可微分符号推理相结合:通过自动化从科学文献中提取约束规则、采用带变分推断的概率神经主干网络,并引入可微分约束满足层以确保物理一致性,从而实现端到端的不确定性量化、约束满足与可解释性统一建模。
链接: https://arxiv.org/abs/2601.12442
作者: Shahnawaz Alam,Mohammed Mudassir Uddin,Mohammed Kaif Pasha
机构: Muffakham Jah College of Engineering and Technology (MJCET)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Scientific Artificial Intelligence (AI) applications require models that deliver trustworthy uncertainty estimates while respecting domain constraints. Existing uncertainty quantification methods lack mechanisms to incorporate symbolic scientific knowledge, while neurosymbolic approaches operate deterministically without principled uncertainty modeling. We introduce the Constraint-Aware Neurosymbolic Uncertainty Framework (CANUF), unifying Bayesian deep learning with differentiable symbolic reasoning. The architecture comprises three components: automated constraint extraction from scientific literature, probabilistic neural backbone with variational inference, and differentiable constraint satisfaction layer ensuring physical consistency. Experiments on Materials Project (140,000+ materials), QM9 molecular properties, and climate benchmarks show CANUF reduces Expected Calibration Error by 34.7% versus Bayesian neural networks while maintaining 99.2% constraint satisfaction. Ablations reveal constraint-guided recalibration contributes 18.3% performance gain, with constraint extraction achieving 91.4% precision. CANUF provides the first end-to-end differentiable pipeline simultaneously addressing uncertainty quantification, constraint satisfaction, and interpretable explanations for scientific predictions.
zh
[CV-178] SkeFi: Cross-Modal Knowledge Transfer for Wireless Skeleton-Based Action Recognition
【速读】:该论文旨在解决基于无线传感器(如LiDAR和毫米波,mmWave)进行骨骼动作识别时面临的两大挑战:一是无线模态数据稀缺导致骨骼估计模型精度不足;二是无线传感器提取的骨骼关键点噪声较大,严重影响后续动作识别模型的性能。解决方案的关键在于提出SkeFi框架,其核心创新包括:(1) 通过从数据丰富的RGB模态中迁移知识的跨模态知识蒸馏方法,缓解无线传感器数据稀疏问题;(2) 设计增强型时间相关自适应图卷积(TC-AGC),结合帧间交互增强机制以应对因帧缺失或不连续带来的噪声问题;(3) 引入双时间卷积结构提升多尺度时间建模能力,从而有效融合时空特征,实现从噪声无线传感器中准确提取姿态与动作。
链接: https://arxiv.org/abs/2601.12432
作者: Shunyu Huang,Yunjiao Zhou,Jianfei Yang
机构: Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: Published in IEEE Internet of Things Journal
Abstract:Skeleton-based action recognition leverages human pose keypoints to categorize human actions, which shows superior generalization and interoperability compared to regular end-to-end action recognition. Existing solutions use RGB cameras to annotate skeletal keypoints, but their performance declines in dark environments and raises privacy concerns, limiting their use in smart homes and hospitals. This paper explores non-invasive wireless sensors, i.e., LiDAR and mmWave, to mitigate these challenges as a feasible alternative. Two problems are addressed: (1) insufficient data on wireless sensor modality to train an accurate skeleton estimation model, and (2) skeletal keypoints derived from wireless sensors are noisier than RGB, causing great difficulties for subsequent action recognition models. Our work, SkeFi, overcomes these gaps through a novel cross-modal knowledge transfer method acquired from the data-rich RGB modality. We propose the enhanced Temporal Correlation Adaptive Graph Convolution (TC-AGC) with frame interactive enhancement to overcome the noise from missing or inconsecutive frames. Additionally, our research underscores the effectiveness of enhancing multiscale temporal modeling through dual temporal convolution. By integrating TC-AGC with temporal modeling for cross-modal transfer, our framework can extract accurate poses and actions from noisy wireless sensors. Experiments demonstrate that SkeFi realizes state-of-the-art performances on mmWave and LiDAR. The code is available at this https URL.
zh
[CV-179] ReWorld: Multi-Dimensional Reward Modeling for Embodied World Models
【速读】:该论文旨在解决当前基于视频的世界模型(video-based world models)在机器人学习中普遍存在的问题:尽管这些模型在视觉生成质量上取得进展,但在物理真实性(physical fidelity)、动态一致性(dynamic consistency)、任务逻辑合理性(task logic)等方面表现不足,尤其在接触密集型操作任务中限制了其下游应用效果。解决方案的关键在于提出 ReWorld 框架,通过强化学习对流模型(flow-based world models)进行后训练对齐,以提升模型的物理合理性与任务完成能力;其核心创新包括构建一个大规模(约 23.5 万条样本)视频偏好数据集,并训练一个分层奖励模型(hierarchical reward model),该模型能够捕捉多维人类偏好,进而利用高效 PPO 风格算法实现对世界模型的精准对齐优化。
链接: https://arxiv.org/abs/2601.12428
作者: Baorui Peng,Wenyao Zhang,Liang Xu,Zekun Qi,Jiazhao Zhang,Hongsi Liu,Wenjun Zeng,Xin Jin
机构: Eastern Institute of Technology (东方理工大学); Georgia Institute of Technology (佐治亚理工学院); Shanghai Jiao Tong University (上海交通大学); Tsinghua University (清华大学); University of Science and Technology of China (中国科学技术大学); Peking University (北京大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recently, video-based world models that learn to simulate the dynamics have gained increasing attention in robot learning. However, current approaches primarily emphasize visual generative quality while overlooking physical fidelity, dynamic consistency, and task logic, especially for contact-rich manipulation tasks, which limits their applicability to downstream tasks. To this end, we introduce ReWorld, a framework aimed to employ reinforcement learning to align the video-based embodied world models with physical realism, task completion capability, embodiment plausibility and visual quality. Specifically, we first construct a large-scale (~235K) video preference dataset and employ it to train a hierarchical reward model designed to capture multi-dimensional reward consistent with human preferences. We further propose a practical alignment algorithm that post-trains flow-based world models using this reward through a computationally efficient PPO-style algorithm. Comprehensive experiments and theoretical analysis demonstrate that ReWorld significantly improves the physical fidelity, logical coherence, embodiment and visual quality of generated rollouts, outperforming previous methods.
zh
[CV-180] HOT-POT: Optimal Transport for Sparse Stereo Matching
【速读】:该论文旨在解决稀疏特征(如人脸关键点)在立体视觉匹配中因参数敏感性导致的病态问题(ill-posedness),尤其是在存在遮挡、运动和相机畸变等挑战下的无监督匹配难题。其解决方案的关键在于从最优传输(Optimal Transport, OT)视角出发,利用相机几何中的线约束(line constraints),将投影点建模为(半)直线,并引入经典的极线距离(epipolar distance)与三维射线距离(3D ray distance)作为匹配质量的度量指标,将其构造成部分OT问题的成本函数,从而转化为高效可解的分配问题;此外,通过构建层次化OT模型进一步扩展至无监督目标匹配,实现了在人脸分析等场景下不同关键点标注规范间的高效匹配。
链接: https://arxiv.org/abs/2601.12423
作者: Antonin Clerc,Michael Quellmalz,Moritz Piening,Philipp Flotho,Gregor Kornhardt,Gabriele Steidl
机构: University of Bordeaux (波尔多大学); Technische Universität Berlin (柏林工业大学); Saarland University (萨尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Optimization and Control (math.OC)
备注: 18 pages, 10 figures, 6 tables
Abstract:Stereo vision between images faces a range of challenges, including occlusions, motion, and camera distortions, across applications in autonomous driving, robotics, and face analysis. Due to parameter sensitivity, further complications arise for stereo matching with sparse features, such as facial landmarks. To overcome this ill-posedness and enable unsupervised sparse matching, we consider line constraints of the camera geometry from an optimal transport (OT) viewpoint. Formulating camera-projected points as (half)lines, we propose the use of the classical epipolar distance as well as a 3D ray distance to quantify matching quality. Employing these distances as a cost function of a (partial) OT problem, we arrive at efficiently solvable assignment problems. Moreover, we extend our approach to unsupervised object matching by formulating it as a hierarchical OT problem. The resulting algorithms allow for efficient feature and object matching, as demonstrated in our numerical experiments. Here, we focus on applications in facial analysis, where we aim to match distinct landmarking conventions.
zh
[CV-181] Weaknesses of Facial Emotion Recognition Systems
【速读】:该论文旨在解决从人脸中检测情绪这一机器学习问题,以支持更自然的人机交互(Human-Computer Interaction, HCI)。其关键解决方案包括:首先系统性地筛选出三种最具代表性的深度神经网络模型,并基于三个具有高多样性和大规模图像数据的公开数据集进行训练与评估;其次通过跨数据集测试和多维度实验比较模型性能,从而揭示现有方法在不同数据分布下的局限性,如情绪识别难度不均、相近情绪(如愤怒与厌恶)区分困难等问题,进而为改进情绪识别系统的鲁棒性和泛化能力提供实证依据。
链接: https://arxiv.org/abs/2601.12402
作者: Aleksandra Jamróz,Patrycja Wysocka,Piotr Garbat
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Emotion detection from faces is one of the machine learning problems needed for human-computer interaction. The variety of methods used is enormous, which motivated an in-depth review of articles and scientific studies. Three of the most interesting and best solutions are selected, followed by the selection of three datasets that stood out for the diversity and number of images in them. The selected neural networks are trained, and then a series of experiments are performed to compare their performance, including testing on different datasets than a model was trained on. This reveals weaknesses in existing solutions, including differences between datasets, unequal levels of difficulty in recognizing certain emotions and the challenges in differentiating between closely related emotions.
zh
[CV-182] Class-Partitioned VQ-VAE and Latent Flow Matching for Point Cloud Scene Generation AAAI2026
【速读】:该论文旨在解决当前基于扩散模型的3D场景生成方法在复杂多类别场景中难以有效解码潜在特征为准确点云对象的问题,尤其是现有自编码器无法将扩散生成的潜在特征正确还原为与目标类别一致的点云形状。其解决方案的关键在于提出一种类分区向量量化变分自编码器(Class-Partitioned Vector Quantized Variational Autoencoder, CPVQ-VAE),该模型通过引入类分区码本(class-partitioned codebook)实现类别感知的潜在空间映射,并设计类感知运行平均更新机制以缓解码本崩溃(codebook collapse)问题,从而在推理阶段直接从Latent-space Flow Matching Model(LFMM)生成的潜在特征和类别标签中重建高质量、无外部数据库依赖的点云场景。
链接: https://arxiv.org/abs/2601.12391
作者: Dasith de Silva Edirimuni,Ajmal Saeed Mian
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to AAAI 2026, Main Technical Track
Abstract:Most 3D scene generation methods are limited to only generating object bounding box parameters while newer diffusion methods also generate class labels and latent features. Using object size or latent feature, they then retrieve objects from a predefined database. For complex scenes of varied, multi-categorical objects, diffusion-based latents cannot be effectively decoded by current autoencoders into the correct point cloud objects which agree with target classes. We introduce a Class-Partitioned Vector Quantized Variational Autoencoder (CPVQ-VAE) that is trained to effectively decode object latent features, by employing a pioneering \textitclass-partitioned codebook where codevectors are labeled by class. To address the problem of \textitcodebook collapse , we propose a \textitclass-aware running average update which reinitializes dead codevectors within each partition. During inference, object features and class labels, both generated by a Latent-space Flow Matching Model (LFMM) designed specifically for scene generation, are consumed by the CPVQ-VAE. The CPVQ-VAE’s class-aware inverse look-up then maps generated latents to codebook entries that are decoded to class-specific point cloud shapes. Thereby, we achieve pure point cloud generation without relying on an external objects database for retrieval. Extensive experiments reveal that our method reliably recovers plausible point cloud scenes, with up to 70.4% and 72.3% reduction in Chamfer and Point2Mesh errors on complex living room scenes.
zh
[CV-183] A Hierarchical Benchmark of Foundation Models for Dermatology
【速读】:该论文旨在解决当前皮肤病变分类任务中因简化为二分类(如恶性与良性)而导致模型难以胜任细粒度差异化诊断的问题,从而影响其在临床工作流中的实际应用。其解决方案的关键在于引入一种分层评估框架,通过冻结基础模型的嵌入表示并训练轻量级适配器模型,在包含40个亚类别的DERM12345数据集上系统评估不同基础模型在四个临床粒度层级(40个亚类、15个主类、2和4个超类、以及二分类恶性/良性)的表现,揭示了“粒度差距”现象:通用医学基础模型(如MedImageInsights)在高层次筛查任务中表现优异,但细粒度分类能力不足;而专用于皮肤病学的基础模型(如Derm Foundation、MONET)则在细粒度子类识别中显著优于前者,表明针对特定临床需求需采用专业化建模策略以提升诊断支持系统的准确性与实用性。
链接: https://arxiv.org/abs/2601.12382
作者: Furkan Yuceyalcin,Abdurrahim Yilmaz,Burak Temelkuran
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Foundation models have transformed medical image analysis by providing robust feature representations that reduce the need for large-scale task-specific training. However, current benchmarks in dermatology often reduce the complex diagnostic taxonomy to flat, binary classification tasks, such as distinguishing melanoma from benign nevi. This oversimplification obscures a model’s ability to perform fine-grained differential diagnoses, which is critical for clinical workflow integration. This study evaluates the utility of embeddings derived from ten foundation models, spanning general computer vision, general medical imaging, and dermatology-specific domains, for hierarchical skin lesion classification. Using the DERM12345 dataset, which comprises 40 lesion subclasses, we calculated frozen embeddings and trained lightweight adapter models using a five-fold cross-validation. We introduce a hierarchical evaluation framework that assesses performance across four levels of clinical granularity: 40 Subclasses, 15 Main Classes, 2 and 4 Superclasses, and Binary Malignancy. Our results reveal a “granularity gap” in model capabilities: MedImageInsights achieved the strongest overall performance (97.52% weighted F1-Score on Binary Malignancy detection) but declined to 65.50% on fine-grained 40-class subtype classification. Conversely, MedSigLip (69.79%) and dermatology-specific models (Derm Foundation and MONET) excelled at fine-grained 40-class subtype discrimination while achieving lower overall performance than MedImageInsights on broader classification tasks. Our findings suggest that while general medical foundation models are highly effective for high-level screening, specialized modeling strategies are necessary for the granular distinctions required in diagnostic support systems.
zh
[CV-184] Utilizing the Score of Data Distribution for Hyperspectral Anomaly Detection
【速读】:该论文旨在解决高光谱图像(Hyperspectral Images, HSIs)中的异常检测问题,即从背景中识别出具有独特光谱特征的异常目标。传统方法往往难以有效建模高维光谱数据的复杂分布,而本文基于高光谱数据满足流形假设(manifold hypothesis)这一前提——背景光谱集中于低维流形上,而异常光谱则偏离该流形——提出了一种基于得分函数(score)的生成模型(Score-based Generative Model, SGM)方法 ScoreAD。其核心创新在于利用训练好的SGM对每个光谱进行扰动后估计其得分(score),从而量化该光谱与背景流形的偏离程度,实现高效且准确的异常检测。
链接: https://arxiv.org/abs/2601.12379
作者: Jiahui Sheng,Yidan Shi,Shu Xiang,Xiaorun Li,Shuhan Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Hyperspectral images (HSIs) are a type of image that contains abundant spectral information. As a type of real-world data, the high-dimensional spectra in hyperspectral images are actually determined by only a few factors, such as chemical composition and illumination. Thus, spectra in hyperspectral images are highly likely to satisfy the manifold hypothesis. Based on the hyperspectral manifold hypothesis, we propose a novel hyperspectral anomaly detection method (named ScoreAD) that leverages the time-dependent gradient field of the data distribution (i.e., the score), as learned by a score-based generative model (SGM). Our method first trains the SGM on the entire set of spectra from the hyperspectral image. At test time, each spectrum is passed through a perturbation kernel, and the resulting perturbed spectrum is fed into the trained SGM to obtain the estimated score. The manifold hypothesis of HSIs posits that background spectra reside on one or more low-dimensional manifolds. Conversely, anomalous spectra, owing to their unique spectral signatures, are considered outliers that do not conform to the background manifold. Based on this fundamental discrepancy in their manifold distributions, we leverage a generative SGM to achieve hyperspectral anomaly detection. Experiments on the four hyperspectral datasets demonstrate the effectiveness of the proposed method. The code is available at this https URL.
zh
[CV-185] CD-TWINSAFE: A ROS-enabled Digital Twin for Scene Understanding and Safety Emerging V2I Technology
【速读】:该论文旨在解决自动驾驶车辆在复杂交通环境中实时感知与安全决策的挑战,特别是如何通过车路协同(V2I)机制提升系统的安全性与响应效率。其解决方案的关键在于提出了一种基于车路协同的数字孪生架构(CD-TWINSAFE),该架构由车载驾驶栈和数字孪生栈并行运行组成:车载栈利用立体相机实现高频率(20 fps)场景理解,包括目标检测、特征提取(如速度、航向角)及碰撞时间(Time-to-Collision, TTC)等安全指标计算;数字孪生栈则在Unreal Engine 5中重建真实场景,并通过ROS2消息协议经4G网络传输实时数据,实现对车辆与障碍物状态的同步更新与安全预警反馈,从而确保系统具备高精度、低延迟的实时响应能力。
链接: https://arxiv.org/abs/2601.12373
作者: Amro Khaled,Farah Khaled,Omar Riad,Catherine M. Elias
机构: C-DRiVeS Lab: Cognitive Driving Research in Vehicular Systems, Cairo, Egypt; Computer Science and Engineering Department - Faculty of Media Engineering and Technology - German University in Cairo, Egypt
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Robotics (cs.RO)
备注:
Abstract:In this paper, the CD-TWINSAFE is introduced, a V2I-based digital twin for Autonomous Vehicles. The proposed architecture is composed of two stacks running simultaneously, an on-board driving stack that includes a stereo camera for scene understanding, and a digital twin stack that runs an Unreal Engine 5 replica of the scene viewed by the camera as well as returning safety alerts to the cockpit. The on-board stack is implemented on the vehicle side including 2 main autonomous modules; localization and perception. The position and orientation of the ego vehicle are obtained using on-board sensors. Furthermore, the perception module is responsible for processing 20-fps images from stereo camera and understands the scene through two complementary pipelines. The pipeline are working on object detection and feature extraction including object velocity, yaw and the safety metrics time-to-collision and time-headway. The collected data form the driving stack are sent to the infrastructure side through the ROS-enabled architecture in the form of custom ROS2 messages and sent over UDP links that ride a 4G modem for V2I communication. The environment is monitored via the digital twin through the shared messages which update the information of the spawned ego vehicle and detected objects based on the real-time localization and perception data. Several tests with different driving scenarios to confirm the validity and real-time response of the proposed architecture.
zh
[CV-186] DepthCropSeg: Scaling a Crop Segmentation Foundation Model With Depth-Labeled Data
【速读】:该论文旨在解决当前作物分割模型在真实田间环境中泛化能力不足的问题,尤其是受限于标注数据稀缺导致的跨物种、跨场景性能下降。其关键解决方案是构建一个大规模(28,406张图像,涵盖30余种作物和15种环境条件)的跨物种与跨场景作物分割数据集,并基于ViT-Adapter架构引入动态上采样机制以增强细节感知能力,同时采用两阶段自训练(self-training)策略进行模型优化。该方法显著提升了模型在复杂场景(如夜间、高密度冠层及未见作物品种)下的分割精度,mIoU达到93.11%,超越监督基线模型和通用视觉基础模型(如Segmentation Anything Model, SAM),实现了作物分割的新SOTA(state-of-the-art)。
链接: https://arxiv.org/abs/2601.12366
作者: Jiafei Zhang,Songliang Cao,Binghui Xu,Yanan Li,Weiwei Jia,Tingting Wu,Hao Lu,Weijuan Hu,Zhiguo Han
机构: MetaPheno Laboratory (MetaPheno 实验室); PhenoTrait Technology Co., Ltd. (PhenoTrait 技术有限公司); Wuhan Digital Engineering Institute (武汉数字工程研究所); National Key Laboratory of Multispectral Information Intelligent Processing Technology (多光谱信息智能处理技术国家重点实验室); School of Artificial Intelligence and Automation (人工智能与自动化学院); Huazhong University of Science and Technology (华中科技大学); Hubei Key Laboratory of Intelligent Robot (湖北省智能机器人重点实验室); School of Computer Science and Engineering (计算机科学与工程学院); School of Artificial Intelligence (人工智能学院); Wuhan Institute of Technology (武汉工程大学); Beijing Agricultural Technology Extension Station (北京市农业技术推广站); College of Mechanical and Electronic Engineering (机电工程学院); Northwest A&F University (西北农林科技大学); Laboratory of Advanced Breeding Technologies (先进育种技术实验室); Institute of Genetics and Developmental Biology (遗传与发育生物学研究所); Chinese Academy of Sciences (中国科学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 15 figures and 7 tables
Abstract:DepthCropSeg++: a foundation model for crop segmentation, capable of segmenting different crop species under open in-field environment. Crop segmentation is a fundamental task for modern agriculture, which closely relates to many downstream tasks such as plant phenotyping, density estimation, and weed control. In the era of foundation models, a number of generic large language and vision models have been developed. These models have demonstrated remarkable real world generalization due to significant model capacity and largescale datasets. However, current crop segmentation models mostly learn from limited data due to expensive pixel-level labelling cost, often performing well only under specific crop types or controlled environment. In this work, we follow the vein of our previous work DepthCropSeg, an almost unsupervised approach to crop segmentation, to scale up a cross-species and crossscene crop segmentation dataset, with 28,406 images across 30+ species and 15 environmental conditions. We also build upon a state-of-the-art semantic segmentation architecture ViT-Adapter architecture, enhance it with dynamic upsampling for improved detail awareness, and train the model with a two-stage selftraining pipeline. To systematically validate model performance, we conduct comprehensive experiments to justify the effectiveness and generalization capabilities across multiple crop datasets. Results demonstrate that DepthCropSeg++ achieves 93.11% mIoU on a comprehensive testing set, outperforming both supervised baselines and general-purpose vision foundation models like Segmentation Anything Model (SAM) by significant margins (+0.36% and +48.57% respectively). The model particularly excels in challenging scenarios including night-time environment (86.90% mIoU), high-density canopies (90.09% mIoU), and unseen crop varieties (90.09% mIoU), indicating a new state of the art for crop segmentation.
zh
[CV-187] From Prompts to Pavement: LMMs-based Agent ic Behavior-Tree Generation Framework for Autonomous Vehicles
【速读】:该论文旨在解决自动驾驶车辆(AVs)在复杂、不可预测的现实环境中缺乏自适应行为规划能力的问题。传统行为树(Behavior Trees, BTs)虽然结构清晰,但其静态特性导致难以应对动态场景,且依赖人工调参,限制了其在SAE Level 5自动化水平下的应用。解决方案的关键在于提出一种基于代理(agentic)的框架,利用大语言模型(LLMs)和多模态视觉模型(LVMs)实现行为树的实时生成与自适应调整:通过Descriptor代理进行场景关键性评估,Planner代理基于上下文学习制定高层子目标,Generator代理生成可执行的XML格式行为树子结构,并仅在基础行为树失效时触发,从而实现无需人工干预的自主导航与障碍规避。
链接: https://arxiv.org/abs/2601.12358
作者: Omar Y. Goba,Ahmed Y. Gado,Catherine M. Elias,Ahmed Hussein
机构: German University in Cairo (德国大学); C-DRiVeS Lab: Cognitive Driving Research in Vehicular Systems (认知驾驶研究车辆系统实验室); IAV GmbH (IAV GmbH)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:
Abstract:Autonomous vehicles (AVs) require adaptive behavior planners to navigate unpredictable, real-world environments safely. Traditional behavior trees (BTs) offer structured decision logic but are inherently static and demand labor-intensive manual tuning, limiting their applicability at SAE Level 5 autonomy. This paper presents an agentic framework that leverages large language models (LLMs) and multi-modal vision models (LVMs) to generate and adapt BTs on the fly. A specialized Descriptor agent applies chain-of-symbols prompting to assess scene criticality, a Planner agent constructs high-level sub-goals via in-context learning, and a Generator agent synthesizes executable BT sub-trees in XML format. Integrated into a CARLA+Nav2 simulation, our system triggers only upon baseline BT failure, demonstrating successful navigation around unexpected obstacles (e.g., street blockage) with no human intervention. Compared to a static BT baseline, this approach is a proof-of-concept that extends to diverse driving scenarios.
zh
[CV-188] SimpleMatch: A Simple and Strong Baseline for Semantic Correspondence
【速读】:该论文旨在解决当前基于预训练大模型的语义对应方法在高分辨率输入下性能受限的问题,其核心挑战在于深度下采样导致相邻关键点特征不可逆融合,使得语义上不同的关键点因落入同一感受野(如16×16像素块)而难以区分。解决方案的关键在于提出SimpleMatch框架:首先设计轻量级上采样解码器,通过逐步将深层特征恢复至1/4分辨率以重建空间细节;其次引入多尺度监督损失,确保不同尺度下的特征具有判别能力;同时采用稀疏匹配与窗口定位策略优化训练内存消耗,降低51%。该方法在低分辨率(252×252)下仍取得优异性能(SPair-71k上PCK@0.1达84.1%),为语义对应任务提供了高效且实用的新基准。
链接: https://arxiv.org/abs/2601.12357
作者: Hailing Jin,Huiying Li
机构: Jilin University (吉林大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advances in semantic correspondence have been largely driven by the use of pre-trained large-scale models. However, a limitation of these approaches is their dependence on high-resolution input images to achieve optimal performance, which results in considerable computational overhead. In this work, we address a fundamental limitation in current methods: the irreversible fusion of adjacent keypoint features caused by deep downsampling operations. This issue is triggered when semantically distinct keypoints fall within the same downsampled receptive field (e.g., 16x16 patches). To address this issue, we present SimpleMatch, a simple yet effective framework for semantic correspondence that delivers strong performance even at low resolutions. We propose a lightweight upsample decoder that progressively recovers spatial detail by upsampling deep features to 1/4 resolution, and a multi-scale supervised loss that ensures the upsampled features retain discriminative features across different spatial scales. In addition, we introduce sparse matching and window-based localization to optimize training memory usage and reduce it by 51%. At a resolution of 252x252 (3.3x smaller than current SOTA methods), SimpleMatch achieves superior performance with 84.1% PCK@0.1 on the SPair-71k benchmark. We believe this framework provides a practical and efficient baseline for future research in semantic correspondence. Code is available at: this https URL.
zh
[CV-189] MMDeepResearch-Bench: A Benchmark for Multimodal Deep Research Agents
【速读】:该论文旨在解决当前深度研究代理(Deep Research Agents, DRAs)在多模态证据使用上的评估不足问题,现有基准主要聚焦于纯文本场景或短格式多模态问答,缺乏对端到端多模态证据整合能力的系统性评测。为此,作者提出MMDeepResearch-Bench(MMDR-Bench),一个包含140个专家设计任务、覆盖21个领域的多模态基准,每个任务提供图像-文本组合以评估模型的多模态理解与引用驱动的报告生成能力。其解决方案的关键在于:一是构建强调报告式合成与显式证据关联的任务设计,要求模型将视觉内容与来源声明相连接并保持叙事、引用和视觉参考之间的一致性;二是开发一套统一且可解释的评估流程——Formula-LLM Adaptive Evaluation(FLAE)、Trustworthy Retrieval-Aligned Citation Evaluation(TRACE)和Multimodal Support-Aligned Integrity Check(MOSAIC),分别从报告质量、引文对齐性和图文一致性三个维度提供细粒度评估信号,从而揭示生成质量、引文规范性和多模态锚定之间的系统权衡关系,识别出当前DRAs在多模态完整性方面的瓶颈。
链接: https://arxiv.org/abs/2601.12346
作者: Peizhou Huang,Zixuan Zhong,Zhongwei Wan,Donghao Zhou,Samiul Alam,Xin Wang,Zexin Li,Zhihao Dou,Li Zhu,Jing Xiong,Chaofan Tao,Yan Xu,Dimitrios Dimitriadis,Tuo Zhang,Mi Zhang
机构: OSU(俄亥俄州立大学); Amazon(亚马逊); UMich(密歇根大学); UCL(伦敦大学学院); CUHK(香港中文大学); UCR(加州大学河滨分校); CWRU(凯斯西储大学); HKU(香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Deep Research Agents (DRAs) generate citation-rich reports via multi-step search and synthesis, yet existing benchmarks mainly target text-only settings or short-form multimodal QA, missing end-to-end multimodal evidence use. We introduce MMDeepResearch-Bench (MMDR-Bench), a benchmark of 140 expert-crafted tasks across 21 domains, where each task provides an image-text bundle to evaluate multimodal understanding and citation-grounded report generation. Compared to prior setups, MMDR-Bench emphasizes report-style synthesis with explicit evidence use, where models must connect visual artifacts to sourced claims and maintain consistency across narrative, citations, and visual references. We further propose a unified, interpretable evaluation pipeline: Formula-LLM Adaptive Evaluation (FLAE) for report quality, Trustworthy Retrieval-Aligned Citation Evaluation (TRACE) for citation-grounded evidence alignment, and Multimodal Support-Aligned Integrity Check (MOSAIC) for text-visual integrity, each producing fine-grained signals that support error diagnosis beyond a single overall score. Experiments across 25 state-of-the-art models reveal systematic trade-offs between generation quality, citation discipline, and multimodal grounding, highlighting that strong prose alone does not guarantee faithful evidence use and that multimodal integrity remains a key bottleneck for deep research agents.
zh
[CV-190] urbo-GoDec: Exploiting the Cluster Sparsity Prior for Hyperspectral Anomaly Detection
【速读】:该论文旨在解决高光谱图像(Hyperspectral Image, HSI)中异常检测任务的性能瓶颈问题,特别是针对小尺寸异常目标检测效果不佳的问题。现有方法多基于背景低秩性和异常稀疏性的先验假设,但仅利用了异常的全局稀疏性,忽略了其在空间上常呈现为局部聚集的小簇这一特性(即“簇稀疏性”,Cluster Sparsity)。解决方案的关键在于将簇稀疏性先验引入经典的GoDec算法框架中,在GoDec的S-step中通过马尔可夫随机场(Markov Random Field, MRF)建模异常的空间聚集特性,并借助因子图上的消息传递机制计算每个像素属于异常的边缘概率,从而更精准地识别出小尺度、空间聚集的异常区域。该方法被命名为Turbo-GoDec,在三个真实HSI数据集上的实验表明其在小尺寸异常检测方面显著优于传统GoDec(LSMAD)及当前主流方法。
链接: https://arxiv.org/abs/2601.12337
作者: Jiahui Sheng,Xiaorun Li,Shuhan Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:As a key task in hyperspectral image processing, hyperspectral anomaly detection has garnered significant attention and undergone extensive research. Existing methods primarily relt on two prior assumption: low-rank background and sparse anomaly, along with additional spatial assumptions of the background. However, most methods only utilize the sparsity prior assumption for anomalies and rarely expand on this hypothesis. From observations of hyperspectral images, we find that anomalous pixels exhibit certain spatial distribution characteristics: they often manifest as small, clustered groups in space, which we refer to as cluster sparsity of anomalies. Then, we combined the cluster sparsity prior with the classical GoDec algorithm, incorporating the cluster sparsity prior into the S-step of GoDec. This resulted in a new hyperspectral anomaly detection method, which we called Turbo-GoDec. In this approach, we modeled the cluster sparsity prior of anomalies using a Markov random field and computed the marginal probabilities of anomalies through message passing on a factor graph. Locations with high anomalous probabilities were treated as the sparse component in the Turbo-GoDec. Experiments are conducted on three real hyperspectral image (HSI) datasets which demonstrate the superior performance of the proposed Turbo-GoDec method in detecting small-size anomalies comparing with the vanilla GoDec (LSMAD) and state-of-the-art anomaly detection methods. The code is available at this https URL.
zh
[CV-191] FlowIID: Single-Step Intrinsic Image Decomposition via Latent Flow Matching
【速读】:该论文旨在解决**固有图像分解(Intrinsic Image Decomposition, IID)**中现有模型参数量过大、难以在资源受限或实时视觉应用中部署的问题。其解决方案的关键在于提出一种基于流匹配(Flow Matching)的新型架构 FlowIID,该架构结合了VAE引导的潜在空间与流匹配模块,实现了稳定且高效的反照率(albedo)与阴影(shading)分离,同时具备参数高效性和单步推理能力,在多个基准测试中表现优于或媲美现有方法。
链接: https://arxiv.org/abs/2601.12329
作者: Mithlesh Singla,Seema Kumari,Shanmuganathan Raman
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Intrinsic Image Decomposition (IID) separates an image into albedo and shading components. It is a core step in many real-world applications, such as relighting and material editing. Existing IID models achieve good results, but often use a large number of parameters. This makes them costly to combine with other models in real-world settings. To address this problem, we propose a flow matching-based solution. For this, we design a novel architecture, FlowIID, based on latent flow matching. FlowIID combines a VAE-guided latent space with a flow matching module, enabling a stable decomposition of albedo and shading. FlowIID is not only parameter-efficient, but also produces results in a single inference step. Despite its compact design, FlowIID delivers competitive and superior results compared to existing models across various benchmarks. This makes it well-suited for deployment in resource-constrained and real-time vision applications.
zh
[CV-192] EmoKGEdit: Training-free Affective Injection via Visual Cue Transformation
【速读】:该论文旨在解决现有图像情感编辑方法难以从潜在内容表示中解耦情感线索的问题,导致情感表达弱化和视觉结构失真。其解决方案的关键在于提出一种无需训练的框架EmoKGEdit,核心创新包括构建多模态情感关联知识图谱(Multimodal Sentiment Association Knowledge Graph, MSA-KG),显式编码物体-属性-情感之间的因果链,并作为外部知识引导大模型进行链式思维推理,从而生成连贯的情感相关视觉提示;同时设计了一个解耦的结构-情感编辑模块,在潜在空间中明确分离情感属性与布局特征,确保目标情感有效注入的同时严格保持视觉空间一致性。
链接: https://arxiv.org/abs/2601.12326
作者: Jing Zhang,Bingjie Fan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11pages,10figures
Abstract:Existing image emotion editing methods struggle to disentangle emotional cues from latent content representations, often yielding weak emotional expression and distorted visual structures. To bridge this gap, we propose EmoKGEdit, a novel training-free framework for precise and structure-preserving image emotion editing. Specifically, we construct a Multimodal Sentiment Association Knowledge Graph (MSA-KG) to disentangle the intricate relationships among objects, scenes, attributes, visual clues and emotion. MSA-KG explicitly encode the causal chain among object-attribute-emotion, and as external knowledge to support chain of thought reasoning, guiding the multimodal large model to infer plausible emotion-related visual cues and generate coherent instructions. In addition, based on MSA-KG, we design a disentangled structure-emotion editing module that explicitly separates emotional attributes from layout features within the latent space, which ensures that the target emotion is effectively injected while strictly maintaining visual spatial coherence. Extensive experiments demonstrate that EmoKGEdit achieves excellent performance in both emotion fidelity and content preservation, and outperforms the state-of-the-art methods.
zh
[CV-193] Multi-Sensor Matching with HyperNetworks
【速读】:该论文旨在解决多模态图像匹配中因外观变化(如可见光与红外模态差异)导致的特征描述鲁棒性不足的问题。解决方案的关键在于引入轻量级超网络(hypernetwork)架构,通过两个核心机制增强Siamese CNN:一是利用超网络模块计算逐通道的自适应缩放与偏移(adaptive, per-channel scaling and shifting),实现细粒度权重调制;二是采用条件实例归一化(conditional instance normalization)在浅层提供模态特异性适配(modality-specific adaptation),从而在不显著增加模型规模的前提下提升对跨模态和跨域变化的适应能力。该方法在推理阶段保持高效,同时通过三元组损失与难负样本挖掘训练,在VIS-NIR及VIS-IR基准上达到当前最优性能。
链接: https://arxiv.org/abs/2601.12325
作者: Eli Passov,Nathan S. Netanyahu,Yosi Keller
机构: Bar-Ilan University (巴伊兰大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Hypernetworks are models that generate or modulate the weights of another network. They provide a flexible mechanism for injecting context and task conditioning and have proven broadly useful across diverse applications without significant increases in model size. We leverage hypernetworks to improve multimodal patch matching by introducing a lightweight descriptor-learning architecture that augments a Siamese CNN with (i) hypernetwork modules that compute adaptive, per-channel scaling and shifting and (ii) conditional instance normalization that provides modality-specific adaptation (e.g., visible vs. infrared, VIS-IR) in shallow layers. This combination preserves the efficiency of descriptor-based methods during inference while increasing robustness to appearance shifts. Trained with a triplet loss and hard-negative mining, our approach achieves state-of-the-art results on VIS-NIR and other VIS-IR benchmarks and matches or surpasses prior methods on additional datasets, despite their higher inference cost. To spur progress on domain shift, we also release GAP-VIR, a cross-platform (ground/aerial) VIS-IR patch dataset with 500K pairs, enabling rigorous evaluation of cross-domain generalization and adaptation.
zh
[CV-194] GazeFormer-MoE: Context-Aware Gaze Estimation via CLIP and MoE Transformer ICASSP2026
【速读】:该论文旨在解决3D眼神估计(3D gaze estimation)任务中因光照变化、头部姿态差异、背景干扰等因素导致的精度不足问题。其解决方案的关键在于提出一种语义调制的多尺度Transformer架构:通过可学习的原型库(prototype banks)对CLIP全局特征进行条件化(分别建模光照、头部姿态、背景和注视方向),并将这些增强后的全局向量与CLIP补丁令牌(patch tokens)及高分辨率CNN令牌在统一注意力空间中融合;同时,采用路由/共享的专家混合(Mixture of Experts, MoE)结构替代部分前馈网络(FFN)模块,以提升模型的条件建模能力。实验表明,该方法在MPIIFaceGaze、EYEDIAP、Gaze360和ETH-XGaze等多个基准上实现了新的最先进性能,角误差显著降低,验证了各组件的有效性。
链接: https://arxiv.org/abs/2601.12316
作者: Xinyuan Zhao,Xianrui Chen,Ahmad Chaddad
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: accepted at ICASSP 2026
Abstract:We present a semantics modulated, multi scale Transformer for 3D gaze estimation. Our model conditions CLIP global features with learnable prototype banks (illumination, head pose, background, direction), fuses these prototype-enriched global vectors with CLIP patch tokens and high-resolution CNN tokens in a unified attention space, and replaces several FFN blocks with routed/shared Mixture of Experts to increase conditional capacity. Evaluated on MPIIFaceGaze, EYEDIAP, Gaze360 and ETH-XGaze, our model achieves new state of the art angular errors of 2.49°, 3.22°, 10.16°, and 1.44°, demonstrating up to a 64% relative improvement over previously reported results. ablations attribute gains to prototype conditioning, cross scale fusion, MoE and hyperparameter. Our code is publicly available at https://github. com/AIPMLab/Gazeformer.
zh
[CV-195] S2F-Net:A Robust Spatial-Spectral Fusion Framework for Cross-Model AIGC Detection
【速读】:该论文旨在解决生成式 AI (Generative AI) 图像检测方法在面对未见过的生成模型架构时泛化能力不足的问题,即现有检测方法通常对特定源模型过拟合,导致在跨模型场景下性能显著下降。解决方案的关键在于提出一种名为 S²F-Net 的跨模型检测框架,其核心思想是利用真实图像与合成图像之间固有的频域差异——特别是上采样操作在纹理稀疏和丰富区域均会留下独特且可区分的频率指纹——通过引入一个可学习的频率注意力模块,协同空间纹理分析与频域特征,自适应地加权并增强具有判别性的频带,从而从根本上提升模型的泛化性能。在 AIGCDetectBenchmark(包含17类生成模型)上的实验表明,S²F-Net 在跨域检测场景中达到了90.49%的检测准确率,显著优于现有基线方法。
链接: https://arxiv.org/abs/2601.12313
作者: Xiangyu Hu,Yicheng Hong,Hongchuang Zheng,Wenjun Zeng,Bingyao Liu
机构: Guangdong Ocean University (广东海洋大学); Guangzhou University (广州大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 27pages 9figures
Abstract:The rapid development of generative models has imposed an urgent demand for detection schemes with strong generalization capabilities. However, existing detection methods generally suffer from overfitting to specific source models, leading to significant performance degradation when confronted with unseen generative architectures. To address these challenges, this paper proposes a cross-model detection framework called S 2 F-Net, whose core lies in exploring and leveraging the inherent spectral discrepancies between real and synthetic textures. Considering that upsampling operations leave unique and distinguishable frequency fingerprints in both texture-poor and texture-rich regions, we focus our research on the detection of frequency-domain artifacts, aiming to fundamentally improve the generalization performance of the model. Specifically, we introduce a learnable frequency attention module that adaptively weights and enhances discriminative frequency bands by synergizing spatial texture analysis and spectral this http URL the AIGCDetectBenchmark, which includes 17 categories of generative models, S 2 F-Net achieves a detection accuracy of 90.49%, significantly outperforming various existing baseline methods in cross-domain detection scenarios.
zh
[CV-196] CurConMix: A Unified Spatio-Temporal Framework for Hierarchical Surgical Workflow Understanding
【速读】:该论文旨在解决外科操作三元组识别(surgical action triplet recognition)中的关键挑战,包括类别严重不平衡、视觉特征细微差异以及三元组组件间的语义依赖关系。现有方法通常仅针对部分问题进行优化,难以实现整体性能提升。其解决方案的核心在于提出CurConMix+框架,该框架基于课程引导的对比学习策略,结合结构化难样本采样和特征级混合(feature-level mixup),以学习更具判别性和逐步关联性的空间特征;进一步通过多分辨率时序变换器(Multi-Resolution Temporal Transformer, MRTT)实现时序扩展,自适应融合多尺度时间特征并动态平衡时空线索,从而获得鲁棒且上下文感知的理解能力。此外,研究构建了LLS48数据集,提供从步骤到任务再到动作级别的分层标注,支持细粒度特征向更高层次阶段识别任务的有效迁移,显著提升了跨层级泛化能力。
链接: https://arxiv.org/abs/2601.12312
作者: Yongjun Jeon,Jongmin Shin,Kanggil Park,Seonmin Park,Soyoung Lim,Jung Yong Kim,Jinsoo Rhu,Jongman Kim,Gyu-Seong Choi,Namkee Oh,Kyu-Hwan Jung
机构: Samsung Advanced Institute for Health Sciences & Technology (SAIHST), Sungkyunkwan University (成均馆大学); Clinical Robotics and Embodied AI Research Center, Smart Healthcare Research Institute, Research Institute for Future Medicine, Samsung Medical Center (三星医疗中心); Department of Surgery, Samsung Medical Center (三星医疗中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Surgical action triplet recognition aims to understand fine-grained surgical behaviors by modeling the interactions among instruments, actions, and anatomical targets. Despite its clinical importance for workflow analysis and skill assessment, progress has been hindered by severe class imbalance, subtle visual variations, and the semantic interdependence among triplet components. Existing approaches often address only a subset of these challenges rather than tackling them jointly, which limits their ability to form a holistic understanding. This study builds upon CurConMix, a spatial representation framework. At its core, a curriculum-guided contrastive learning strategy learns discriminative and progressively correlated features, further enhanced by structured hard-pair sampling and feature-level mixup. Its temporal extension, CurConMix+, integrates a Multi-Resolution Temporal Transformer (MRTT) that achieves robust, context-aware understanding by adaptively fusing multi-scale temporal features and dynamically balancing spatio-temporal cues. Furthermore, we introduce LLS48, a new, hierarchically annotated benchmark for complex laparoscopic left lateral sectionectomy, providing step-, task-, and action-level annotations. Extensive experiments on CholecT45 and LLS48 demonstrate that CurConMix+ not only outperforms state-of-the-art approaches in triplet recognition, but also exhibits strong cross-level generalization, as its fine-grained features effectively transfer to higher-level phase and step recognition tasks. Together, the framework and dataset provide a unified foundation for hierarchy-aware, reproducible, and interpretable surgical workflow understanding. The code and dataset will be publicly released on GitHub to facilitate reproducibility and further research.
zh
[CV-197] Adaptive Multi-Scale Correlation Meta-Network for Few-Shot Remote Sensing Image Classification ICASSP2026
【速读】:该论文旨在解决遥感领域中少样本学习(Few-shot Learning)面临的三大挑战:标注数据稀缺、显著的域偏移(domain shift)以及地理空间目标的多尺度特性。其核心解决方案是提出一种轻量但高效的框架——自适应多尺度相关元网络(Adaptive Multi-Scale Correlation Meta-Network, AMC-MetaNet),关键创新包括:(i) 基于相关性的特征金字塔结构以捕捉尺度不变模式;(ii) 自适应通道相关模块(Adaptive Channel Correlation Module, ACCM)用于学习动态跨尺度关系;(iii) 基于相关性引导的元学习机制,替代传统原型平均策略,从而更有效地利用少量样本中的语义信息。该方法在不依赖复杂预训练模型或Transformer架构的前提下,仅用约60万参数即可实现高精度(最高达86.65%)和低延迟(每张图像推理仅50ms),展现出良好的计算效率与尺度感知能力。
链接: https://arxiv.org/abs/2601.12308
作者: Anurag Kaushish,Ayan Sar,Sampurna Roy,Sudeshna Chakraborty,Prashant Trivedi,Tanupriya Choudhury,Kanav Gupta
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted in IEEE ICASSP 2026
Abstract:Few-shot learning in remote sensing remains challenging due to three factors: the scarcity of labeled data, substantial domain shifts, and the multi-scale nature of geospatial objects. To address these issues, we introduce Adaptive Multi-Scale Correlation Meta-Network (AMC-MetaNet), a lightweight yet powerful framework with three key innovations: (i) correlation-guided feature pyramids for capturing scale-invariant patterns, (ii) an adaptive channel correlation module (ACCM) for learning dynamic cross-scale relationships, and (iii) correlation-guided meta-learning that leverages correlation patterns instead of conventional prototype averaging. Unlike prior approaches that rely on heavy pre-trained models or transformers, AMC-MetaNet is trained from scratch with only \sim600K parameters, offering 20\times fewer parameters than ResNet-18 while maintaining high efficiency ( 50 ms per image inference). AMC-MetaNet achieves up to 86.65% accuracy in 5-way 5-shot classification on various remote sensing datasets, including EuroSAT, NWPU-RESISC45, UC Merced Land Use, and AID. Our results establish AMC-MetaNet as a computationally efficient, scale-aware framework for real-world few-shot remote sensing.
zh
[CV-198] A Two-Stage Globally-Diverse Adversarial Attack for Vision-Language Pre-training Models ICASSP2026
【速读】:该论文旨在解决视觉-语言预训练(Vision-Language Pre-training, VLP)模型在黑盒场景下对对抗样本(adversarial examples)的脆弱性问题,尤其是现有多模态攻击方法存在扰动多样性不足和多阶段流程不稳定的问题。其解决方案的关键在于提出一种两阶段全局多样化攻击框架(2S-GDA),第一阶段通过候选文本扩展与全局感知替换策略引入文本扰动的多样性,第二阶段利用多尺度缩放和块洗牌旋转生成图像级扰动以提升视觉多样性;该框架模块化设计,可有效增强对抗样本的迁移性,在黑盒设置下相比当前最优方法攻击成功率最高提升11.17%。
链接: https://arxiv.org/abs/2601.12304
作者: Wutao Chen,Huaqin Zou,Chen Wan,Lifeng Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to ICASSP 2026
Abstract:Vision-language pre-training (VLP) models are vulnerable to adversarial examples, particularly in black-box scenarios. Existing multimodal attacks often suffer from limited perturbation diversity and unstable multi-stage pipelines. To address these challenges, we propose 2S-GDA, a two-stage globally-diverse attack framework. The proposed method first introduces textual perturbations through a globally-diverse strategy by combining candidate text expansion with globally-aware replacement. To enhance visual diversity, image-level perturbations are generated using multi-scale resizing and block-shuffle rotation. Extensive experiments on VLP models demonstrate that 2S-GDA consistently improves attack success rates over state-of-the-art methods, with gains of up to 11.17% in black-box settings. Our framework is modular and can be easily combined with existing methods to further enhance adversarial transferability.
zh
[CV-199] Concepts from Representations: Post-hoc Concept Bottleneck Models via Sparse Decomposition of Visual Representations AAAI2026
【速读】:该论文旨在解决深度学习模型在关键领域部署时因缺乏可解释性而导致的信任问题,特别是现有基于概念的解释方法(如后验方法和先验概念瓶颈模型,Concept Bottleneck Models, CBMs)存在概念相关性不可靠、概念定义非可视化或劳动密集、以及模型与数据无关假设等局限。其解决方案的关键在于提出了一种名为“通过表示分解实现的后验概念瓶颈模型”(Post-hoc Concept Bottleneck Model via Representation Decomposition, PCBM-ReD)的新框架:该框架首先从预训练编码器中自动提取视觉概念,再利用多模态大语言模型(Multimodal Large Language Models, MLLMs)基于视觉可识别性和任务相关性对概念进行标注与筛选,并通过重建引导优化选择独立子集;最终借助CLIP的视觉-文本对齐能力,将图像表征分解为概念嵌入的线性组合,从而适配CBM的抽象结构,显著提升了模型性能与可解释性。
链接: https://arxiv.org/abs/2601.12303
作者: Shizhan Gong,Xiaofan Zhang,Qi Dou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: AAAI 2026
Abstract:Deep learning has achieved remarkable success in image recognition, yet their inherent opacity poses challenges for deployment in critical domains. Concept-based interpretations aim to address this by explaining model reasoning through human-understandable concepts. However, existing post-hoc methods and ante-hoc concept bottleneck models (CBMs), suffer from limitations such as unreliable concept relevance, non-visual or labor-intensive concept definitions, and model or data-agnostic assumptions. This paper introduces Post-hoc Concept Bottleneck Model via Representation Decomposition (PCBM-ReD), a novel pipeline that retrofits interpretability onto pretrained opaque models. PCBM-ReD automatically extracts visual concepts from a pre-trained encoder, employs multimodal large language models (MLLMs) to label and filter concepts based on visual identifiability and task relevance, and selects an independent subset via reconstruction-guided optimization. Leveraging CLIP’s visual-text alignment, it decomposes image representations into linear combination of concept embeddings to fit into the CBMs abstraction. Extensive experiments across 11 image classification tasks show PCBM-ReD achieves state-of-the-art accuracy, narrows the performance gap with end-to-end models, and exhibits better interpretability.
zh
[CV-200] OpenNavMap: Structure-Free Topometric Mapping via Large-Scale Collaborative Localization
【速读】:该论文旨在解决大规模视觉导航中地图表示的可扩展性与可维护性问题,尤其是在多人多时段采集数据场景下,传统基于结构的方法因高维护成本、对无特征环境或显著视角变化敏感而失效。其解决方案的关键在于提出一种轻量级、无结构的拓扑-度量(topometric)系统OPENNAVMAP,该系统利用3D几何基础模型实现按需重建,并通过动态规划序列匹配、几何验证与置信度校准优化相结合的方式,实现无需预构建3D模型即可鲁棒地完成粗到精的子地图对齐,从而在保持全局一致性的同时显著提升定位精度与实用性。
链接: https://arxiv.org/abs/2601.12291
作者: Jianhao Jiao,Changkun Liu,Jingwen Yu,Boyi Liu,Qianyi Zhang,Yue Wang,Dimitrios Kanoulas
机构: University College London (伦敦大学学院); Hong Kong University of Science and Technology (香港科技大学); Nankai University (南开大学); Zhejiang University (浙江大学); Archimedes/Athena RC (Archimedes/Athena RC)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 21 pages, 20 figures
Abstract:Scalable and maintainable map representations are fundamental to enabling large-scale visual navigation and facilitating the deployment of robots in real-world environments. While collaborative localization across multi-session mapping enhances efficiency, traditional structure-based methods struggle with high maintenance costs and fail in feature-less environments or under significant viewpoint changes typical of crowd-sourced data. To address this, we propose OPENNAVMAP, a lightweight, structure-free topometric system leveraging 3D geometric foundation models for on-demand reconstruction. Our method unifies dynamic programming-based sequence matching, geometric verification, and confidence-calibrated optimization to robust, coarse-to-fine submap alignment without requiring pre-built 3D models. Evaluations on the Map-Free benchmark demonstrate superior accuracy over structure-from-motion and regression baselines, achieving an average translation error of 0.62m. Furthermore, the system maintains global consistency across 15km of multi-session data with an absolute trajectory error below 3m for map merging. Finally, we validate practical utility through 12 successful autonomous image-goal navigation tasks on simulated and physical robots. Code and datasets will be publicly available in this https URL.
zh
[CV-201] LegacyAvatars: Volumetric Face Avatars For Traditional Graphics Pipelines
【速读】:该论文旨在解决如何高效地在传统图形平台上实现逼真三维人脸虚拟形象(3D face avatar)的渲染问题,尤其针对复杂面部特征(如头发、皮肤和眼睛)的可控体积渲染挑战。解决方案的关键在于引入一种新颖的显式表示方法:通过学习三维空间中的辐射场流形(radiance manifolds),从参数化人脸模型中提取出分层网格(layered mesh)及其对应的外观纹理(appearance texture)和变形纹理(warp texture)。该显式结构使得部署阶段可通过简单的线性混合与Alpha合成对人脸进行控制与动画处理,并支持在线流式传输,从而利用经典基于网格和着色器的渲染技术在旧版图形平台中高效还原高保真效果,无需定制开发或集成。
链接: https://arxiv.org/abs/2601.12285
作者: Safa C. Medin,Gengyan Li,Ziqian Bai,Ruofei Du,Leonhard Helminger,Yinda Zhang,Stephan J. Garbin,Philip L. Davidson,Gregory W. Wornell,Thabo Beeler,Abhimitra Meka
机构: Google(谷歌); MIT(麻省理工学院); ETH Zurich(苏黎世联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We introduce a novel representation for efficient classical rendering of photorealistic 3D face avatars. Leveraging recent advances in radiance fields anchored to parametric face models, our approach achieves controllable volumetric rendering of complex facial features, including hair, skin, and eyes. At enrollment time, we learn a set of radiance manifolds in 3D space to extract an explicit layered mesh, along with appearance and warp textures. During deployment, this allows us to control and animate the face through simple linear blending and alpha compositing of textures over a static mesh. This explicit representation also enables the generated avatar to be efficiently streamed online and then rendered using classical mesh and shader-based rendering on legacy graphics platforms, eliminating the need for any custom engineering or integration.
zh
[CV-202] SDiT: Semantic Region-Adaptive for Diffusion Transformers
【速读】:该论文旨在解决扩散模型(Diffusion Models)在文本到图像合成任务中计算成本过高这一问题,尤其针对其迭代去噪过程和全局注意力机制带来的二次复杂度。解决方案的关键在于提出一种无需重新训练、不改变模型结构的语义区域自适应扩散Transformer(Semantic Region-Adaptive Diffusion Transformer, SDiT),其核心创新包括:基于快速Quickshift分割的语义感知聚类、根据区域复杂度动态调度更新区域的复杂度驱动式区域调度策略,以及保持空间一致性的边界感知精修机制。通过上述方法,SDiT能够显著减少冗余计算,在不牺牲感知质量和语义保真度的前提下实现最高达3.0倍的加速。
链接: https://arxiv.org/abs/2601.12283
作者: Bowen Lin,Fanjiang Ye,Yihua Liu,Zhenghui Guo,Boyuan Zhang,Weijian Zheng,Yufan Xu,Tiancheng Xing,Yuke Wang,Chengming Zhang
机构: University of Houston (休斯顿大学); Rice University (莱斯大学); Indiana University Bloomington (布卢明顿印第安纳大学); Argonne National Laboratory (阿贡国家实验室); National University of Singapore (新加坡国立大学); Independent Researcher (独立研究者)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Diffusion Transformers (DiTs) achieve state-of-the-art performance in text-to-image synthesis but remain computationally expensive due to the iterative nature of denoising and the quadratic cost of global attention. In this work, we observe that denoising dynamics are spatially non-uniform-background regions converge rapidly while edges and textured areas evolve much more actively. Building on this insight, we propose SDiT, a Semantic Region-Adaptive Diffusion Transformer that allocates computation according to regional complexity. SDiT introduces a training-free framework combining (1) semantic-aware clustering via fast Quickshift-based segmentation, (2) complexity-driven regional scheduling to selectively update informative areas, and (3) boundary-aware refinement to maintain spatial coherence. Without any model retraining or architectural modification, SDiT achieves up to 3.0x acceleration while preserving nearly identical perceptual and semantic quality to full-attention inference.
zh
[CV-203] CytoCLIP: Learning Cytoarchitectural Characteristics in Developing Human Brain Using Contrastive Language Image Pre-Training
【速读】:该论文旨在解决人类大脑不同区域的细胞构筑(cytoarchitecture)手动划分耗时且依赖专家知识的问题,从而推动基于细胞构筑的脑区自动识别与科学分析。其解决方案的关键在于提出CytoCLIP,一个基于预训练对比语言-图像预训练(Contrastive Language-Image Pre-Training, CLIP)框架构建的视觉-语言模型套件,能够学习脑组织切片中细胞构筑的联合视觉-文本表征;该模型包含两种变体:一种使用低分辨率全区域图像捕捉整体细胞构筑模式,另一种利用高分辨率图像块实现细胞级细节建模,从而在多种数据设置下均展现出优异的区域分类与跨模态检索性能,F1得分分别达到0.87和0.91。
链接: https://arxiv.org/abs/2601.12282
作者: Pralaypati Ta,Sriram Venkatesaperumal,Keerthi Ram,Mohanasankar Sivaprakasam
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:The functions of different regions of the human brain are closely linked to their distinct cytoarchitecture, which is defined by the spatial arrangement and morphology of the cells. Identifying brain regions by their cytoarchitecture enables various scientific analyses of the brain. However, delineating these areas manually in brain histological sections is time-consuming and requires specialized knowledge. An automated approach is necessary to minimize the effort needed from human experts. To address this, we propose CytoCLIP, a suite of vision-language models derived from pre-trained Contrastive Language-Image Pre-Training (CLIP) frameworks to learn joint visual-text representations of brain cytoarchitecture. CytoCLIP comprises two model variants: one is trained using low-resolution whole-region images to understand the overall cytoarchitectural pattern of an area, and the other is trained on high-resolution image tiles for detailed cellular-level representation. The training dataset is created from NISSL-stained histological sections of developing fetal brains of different gestational weeks. It includes 86 distinct regions for low-resolution images and 384 brain regions for high-resolution tiles. We evaluate the model’s understanding of the cytoarchitecture and generalization ability using region classification and cross-modal retrieval tasks. Multiple experiments are performed under various data setups, including data from samples of different ages and sectioning planes. Experimental results demonstrate that CytoCLIP outperforms existing methods. It achieves an F1 score of 0.87 for whole-region classification and 0.91 for high-resolution image tile classification.
zh
[CV-204] Agent icPruner: MAC-Constrained Neural Network Compression via LLM -Driven Strategy Search
【速读】:该论文旨在解决神经网络剪枝(Neural Network Pruning)在资源受限设备部署中难以精确控制计算成本的问题,现有方法多聚焦于参数量减少,但无法保证推理延迟符合Multiply-Accumulate (MAC) 操作预算要求。其解决方案的关键在于提出 AgenticPruner 框架,通过三个协同工作的专业化智能体实现 MAC 约束下的优化:Profiling Agent 分析模型结构与 MAC 分布,Master Agent 监控流程并确保收敛稳定性,Analysis Agent 利用 Claude 3.5 Sonnet 基于历史尝试进行上下文学习,自动提炼最优剪枝策略;该框架结合同构剪枝(isomorphic pruning)的图结构分组机制,并引入跨迭代模式分析以实现自适应调整,最终在 ResNet、ConvNeXt 和 DeiT 等架构上实现了 MAC 预算内精准控制(误差范围通常为 ±5%),同时保持或提升准确率,验证了其在需严格计算保障场景中的可行性。
链接: https://arxiv.org/abs/2601.12272
作者: Shahrzad Esmat,Mahdi Banisharif,Ali Jannesari
机构: Iowa State University (爱荷华州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 38 pages, 2 figures, 14 tables
Abstract:Neural network pruning remains essential for deploying deep learning models on resource-constrained devices, yet existing approaches primarily target parameter reduction without directly controlling computational cost. This yields unpredictable inference latency in deployment scenarios where strict Multiply-Accumulate (MAC) operation budgets must be met. We propose AgenticPruner, a framework utilizing large language models to achieve MAC-constrained optimization through iterative strategy learning. Our approach coordinates three specialized agents: a Profiling Agent that analyzes model architecture and MAC distributions, a Master Agent that orchestrates the workflow with divergence monitoring, and an Analysis Agent powered by Claude 3.5 Sonnet that learns optimal strategies from historical attempts. Through in-context learning, the Analysis Agent improves convergence success rate from 48% to 71% compared to grid search. Building upon isomorphic pruning’s graph-based structural grouping, our method adds context-aware adaptation by analyzing patterns across pruning iterations, enabling automatic convergence to target MAC budgets within user-defined tolerance bands. We validate our framework on ImageNet-1K across ResNet, ConvNeXt, and DeiT architectures. On CNNs, our approach achieves MAC targeting while maintaining or improving accuracy: ResNet-50 reaches 1.77G MACs with 77.04% accuracy (+0.91% vs baseline); ResNet-101 achieves 4.22G MACs with 78.94% accuracy (+1.56% vs baseline). For ConvNeXt-Small, pruning to 8.17G MACs yields 1.41x GPU and 1.07x CPU speedup with 45% parameter reduction. On Vision Transformers, we demonstrate MAC-budget compliance within user-defined tolerance bands (typically +1% to +5% overshoot, -5% to -15% undershoot), establishing feasibility for deployment scenarios requiring strict computational guarantees. Comments: 38 pages, 2 figures, 14 tables Subjects: Computer Vision and Pattern Recognition (cs.CV) ACMclasses: I.2.6; I.5.4 Cite as: arXiv:2601.12272 [cs.CV] (or arXiv:2601.12272v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2601.12272 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-205] Soft Shadow Diffusion (SSD): Physics-inspired Learning for 3D Computational Periscopy
【速读】:该论文旨在解决非视域(Non-line-of-sight, NLOS)成像中如何从普通照片中重建隐藏场景三维结构的问题,尤其针对现有被动NLOS方法仅限于一维或低分辨率二维成像、或需已知物体形状的局限性。解决方案的关键在于提出了一种新颖的光传输模型重构方法,将隐藏场景分解为“遮光”(light-occluding)与“非遮光”(non-light-occluding)两部分,从而将原本复杂的非线性反问题转化为可分离的非线性最小二乘(Separable Nonlinear Least Squares, SNLLS)形式。基于此,作者设计了两种求解策略:一种是基于梯度的优化方法,另一种是受物理启发的神经网络方法——软阴影扩散(Soft Shadow diffusion, SSD),后者在仿真中训练后能良好泛化至未见过的仿真和真实NLOS场景,并展现出对噪声和环境光照的惊人鲁棒性。
链接: https://arxiv.org/abs/2601.12257
作者: Fadlullah Raji,John Murray-Bruce
机构: University of South Florida (南佛罗里达大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computational Geometry (cs.CG); Graphics (cs.GR)
备注:
Abstract:Conventional imaging requires a line of sight to create accurate visual representations of a scene. In certain circumstances, however, obtaining a suitable line of sight may be impractical, dangerous, or even impossible. Non-line-of-sight (NLOS) imaging addresses this challenge by reconstructing the scene from indirect measurements. Recently, passive NLOS methods that use an ordinary photograph of the subtle shadow cast onto a visible wall by the hidden scene have gained interest. These methods are currently limited to 1D or low-resolution 2D color imaging or to localizing a hidden object whose shape is approximately known. Here, we generalize this class of methods and demonstrate a 3D reconstruction of a hidden scene from an ordinary NLOS photograph. To achieve this, we propose a novel reformulation of the light transport model that conveniently decomposes the hidden scene into \textitlight-occluding and \textitnon-light-occluding components to yield a separable non-linear least squares (SNLLS) inverse problem. We develop two solutions: A gradient-based optimization method and a physics-inspired neural network approach, which we call Soft Shadow diffusion (SSD). Despite the challenging ill-conditioned inverse problem encountered here, our approaches are effective on numerous 3D scenes in real experimental scenarios. Moreover, SSD is trained in simulation but generalizes well to unseen classes in simulation and real-world NLOS scenes. SSD also shows surprising robustness to noise and ambient illumination.
zh
[CV-206] Federated Joint Learning for Domain and Class Generalization ICASSP2026
【速读】:该论文旨在解决视觉-语言模型(如CLIP)在联邦学习场景下同时面临未见类别(unseen classes)和未见域(unseen domains)时的泛化能力不足问题,现有方法通常仅针对其中一类问题进行优化,缺乏联合建模机制。解决方案的关键在于提出一种名为FedDCG(Federated Joint Learning for Domain and Class Generalization)的新框架:首先通过域分组策略,在每个组内训练类泛化网络以避免决策边界混淆;其次在推理阶段基于域相似性聚合类泛化结果,实现类与域知识的有效融合;此外引入可学习网络增强类泛化能力,并设计解耦机制分离通用知识与域特定知识,从而提升对未见域的鲁棒性。
链接: https://arxiv.org/abs/2601.12253
作者: Haoran Xu,Jiaze Li,Jianzhong Ju,Zhenbo Luo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: ICASSP 2026
Abstract:Efficient fine-tuning of visual-language models like CLIP has become crucial due to their large-scale parameter size and extensive pretraining requirements. Existing methods typically address either the issue of unseen classes or unseen domains in isolation, without considering a joint framework for both. In this paper, we propose \textbfFederated Joint Learning for \textbfDomain and \textbfClass \textbfGeneralization, termed \textbfFedDCG, a novel approach that addresses both class and domain generalization in federated learning settings. Our method introduces a domain grouping strategy where class-generalized networks are trained within each group to prevent decision boundary confusion. During inference, we aggregate class-generalized results based on domain similarity, effectively integrating knowledge from both class and domain generalization. Specifically, a learnable network is employed to enhance class generalization capabilities, and a decoupling mechanism separates general and domain-specific knowledge, improving generalization to unseen domains. Extensive experiments across various datasets show that \textbfFedDCG outperforms state-of-the-art baselines in terms of accuracy and robustness.
zh
[CV-207] Breaking Coordinate Overfitting: Geometry-Aware WiFi Sensing for Cross-Layout 3D Pose Estimation
【速读】:该论文旨在解决基于WiFi的3D人体姿态估计中因坐标系过拟合(coordinate overfitting)导致的泛化性能差的问题。现有方法依赖视觉3D姿态作为监督信号,并直接将信道状态信息(CSI)回归到相机坐标系,使得模型记忆特定部署场景下的WiFi收发器布局,而非学习与活动相关的表征,从而在不同设备布局下表现严重退化。解决方案的关键在于提出PerceptAlign框架,其核心创新是引入一种轻量级坐标统一流程,通过两个棋盘格和少量照片将WiFi与视觉测量对齐至共享3D空间,并将校准后的收发器位置编码为高维嵌入,与CSI特征融合,使网络显式地将设备几何结构作为条件变量进行感知,从而解耦人体运动与部署布局,实现首次具备布局不变性的WiFi姿态估计。
链接: https://arxiv.org/abs/2601.12252
作者: Songming Jia,Yan Lu,Bin Liu,Xiang Zhang,Peng Zhao,Xinmeng Tang,Yelin Wei,Jinyang Huang,Huan Yan,Zhi Liu
机构: University of Science and Technology of China (中国科学技术大学); Shanghai Artificial Intelligence Lab (上海人工智能实验室); Tianjin University (天津大学); Hefei University of Technology (合肥工业大学); Guizhou Normal University (贵州师范大学); The University of Electro-Communications (电波通信大学)
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)
备注: Accpeted by AMC Mobicom 2026
Abstract:WiFi-based 3D human pose estimation offers a low-cost and privacy-preserving alternative to vision-based systems for smart interaction. However, existing approaches rely on visual 3D poses as supervision and directly regress CSI to a camera-based coordinate system. We find that this practice leads to coordinate overfitting: models memorize deployment-specific WiFi transceiver layouts rather than only learning activity-relevant representations, resulting in severe generalization failures. To address this challenge, we present PerceptAlign, the first geometry-conditioned framework for WiFi-based cross-layout pose estimation. PerceptAlign introduces a lightweight coordinate unification procedure that aligns WiFi and vision measurements in a shared 3D space using only two checkerboards and a few photos. Within this unified space, it encodes calibrated transceiver positions into high-dimensional embeddings and fuses them with CSI features, making the model explicitly aware of device geometry as a conditional variable. This design forces the network to disentangle human motion from deployment layouts, enabling robust and, for the first time, layout-invariant WiFi pose estimation. To support systematic evaluation, we construct the largest cross-domain 3D WiFi pose estimation dataset to date, comprising 21 subjects, 5 scenes, 18 actions, and 7 device layouts. Experiments show that PerceptAlign reduces in-domain error by 12.3% and cross-domain error by more than 60% compared to state-of-the-art baselines. These results establish geometry-conditioned learning as a viable path toward scalable and practical WiFi sensing.
zh
[CV-208] An Innovative Framework for Breast Cancer Detection Using Pyramid Adaptive Atrous Convolution Transformer Integration and Multi-Scale Feature Fusion
【速读】:该论文旨在解决乳腺癌早期诊断中恶性肿块检测准确率低的问题,尤其是在复杂背景和大规模数据场景下的分类误差较大问题。其解决方案的关键在于提出了一种融合金字塔自适应空洞卷积(Pyramid Adaptive Atrous Convolution, PAAC)与Transformer架构的新型框架,通过多尺度特征融合增强良恶性组织的特征提取能力,并引入Dice Loss与Focal Loss联合优化策略,有效提升模型对不平衡数据的学习能力;同时利用Transformer的自注意力机制捕捉长程依赖关系,显著提升了模型在INbreast、MIAS和DDSM等公开数据集上的性能,最终实现98.5%的准确率和98.2%的F1-score,优于多个主流基础模型。
链接: https://arxiv.org/abs/2601.12249
作者: Ehsan Sadeghi Pour,Mahdi Esmaeili,Morteza Romoozi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 13 page
Abstract:Breast cancer is one of the most common cancers among women worldwide, and its accurate and timely diagnosis plays a critical role in improving treatment outcomes. This thesis presents an innovative framework for detecting malignant masses in mammographic images by integrating the Pyramid Adaptive Atrous Convolution (PAAC) and Transformer architectures. The proposed approach utilizes Multi-Scale Feature Fusion to enhance the extraction of features from benign and malignant tissues and combines Dice Loss and Focal Loss functions to improve the model’s learning process, effectively reducing errors in binary breast cancer classification and achieving high accuracy and efficiency. In this study, a comprehensive dataset of breast cancer images from INbreast, MIAS, and DDSM was preprocessed through data augmentation and contrast enhancement and resized to 227x227 pixels for model training. Leveraging the Transformer’s ability to manage long-range dependencies with Self-Attention mechanisms, the proposed model achieved high accuracy in detecting cancerous masses, outperforming foundational models such as BreastNet, DeepMammo, Multi-Scale CNN, Swin-Unet, and SegFormer. The final evaluation results for the proposed model include an accuracy of 98.5%, sensitivity of 97.8%, specificity of 96.3%, F1-score of 98.2%, and overall precision of 97.9%. These metrics demonstrate a significant improvement over traditional methods and confirm the model’s effectiveness in identifying cancerous masses in complex scenarios and large datasets. This model shows potential as a reliable and efficient tool for breast cancer diagnosis and can be effectively integrated into medical diagnostic systems.
zh
[CV-209] Less is More: Label-Guided Summarization of Procedural and Instructional Videos
【速读】:该论文旨在解决视频摘要(video summarization)中语义准确性不足与上下文连贯性差的问题,尤其是在高风险领域如手术培训中,如何从长视频中提取具有程序性意义的关键帧以生成结构清晰、内容精准的摘要。其解决方案的核心在于提出一个三阶段框架PRISM(Procedural Representation via Integrated Semantic and Multimodal analysis),关键创新包括:自适应视觉采样以高效覆盖重要事件、基于标签的关键帧锚定确保程序节点的代表性、以及利用大语言模型(LLM)进行上下文验证以过滤冗余或幻觉内容,从而在仅采样少于5%原始帧的情况下保留84%的语义信息,并显著优于现有基线方法。
链接: https://arxiv.org/abs/2601.12243
作者: Shreya Rajpal,Michal Golovanesky,Carsten Eickhoff
机构: Vellore Institute of Technology (维洛尔理工学院); University of Tübingen (图宾根大学); Brown University (布朗大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 22 pages, 6 figures
Abstract:Video summarization helps turn long videos into clear, concise representations that are easier to review, document, and analyze, especially in high-stakes domains like surgical training. Prior work has progressed from using basic visual features like color, motion, and structural changes to using pre-trained vision-language models that can better understand what’s happening in the video (semantics) and capture temporal flow, resulting in more context-aware video summarization. We propose a three-stage framework, PRISM: Procedural Representation via Integrated Semantic and Multimodal analysis, that produces semantically grounded video summaries. PRISM combines adaptive visual sampling, label-driven keyframe anchoring, and contextual validation using a large language model (LLM). Our method ensures that selected frames reflect meaningful and procedural transitions while filtering out generic or hallucinated content, resulting in contextually coherent summaries across both domain-specific and instructional videos. We evaluate our method on instructional and activity datasets, using reference summaries for instructional videos. Despite sampling fewer than 5% of the original frames, our summaries retain 84% semantic content while improving over baselines by as much as 33%. Our approach generalizes across procedural and domain-specific video tasks, achieving strong performance with both semantic alignment and precision.
zh
[CV-210] Proc3D: Procedural 3D Generation and Parametric Editing of 3D Shapes with Large Language Models
【速读】:该论文旨在解决传统3D模型生成方法依赖专业技能且生成结果不可编辑的问题,从而限制了迭代设计的灵活性。其核心解决方案是提出Proc3D系统,关键创新在于引入一种称为“程序化紧凑图”(Procedural Compact Graph, PCG)的新型3D模型表示方式,该表示显式编码生成模型的算法规则与结构,并暴露可调参数,支持通过滑块、复选框进行直观手动调整,以及借助大语言模型(Large Language Models, LLMs)实现自然语言驱动的实时自动修改。此机制显著提升了编辑效率和文本对齐精度,实验表明其编辑速度较传统全量重生成方法提升超过400倍,ULIP分数提高28%。
链接: https://arxiv.org/abs/2601.12234
作者: Fadlullah Raji,Stefano Petrangeli,Matheus Gadelha,Yu Shen,Uttaran Bhattacharya,Gang Wu
机构: 未知
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Generating 3D models has traditionally been a complex task requiring specialized expertise. While recent advances in generative AI have sought to automate this process, existing methods produce non-editable representation, such as meshes or point clouds, limiting their adaptability for iterative design. In this paper, we introduce Proc3D, a system designed to generate editable 3D models while enabling real-time modifications. At its core, Proc3D introduces procedural compact graph (PCG), a graph representation of 3D models, that encodes the algorithmic rules and structures necessary for generating the model. This representation exposes key parameters, allowing intuitive manual adjustments via sliders and checkboxes, as well as real-time, automated modifications through natural language prompts using Large Language Models (LLMs). We demonstrate Proc3D’s capabilities using two generative approaches: GPT-4o with in-context learning (ICL) and a fine-tuned LLAMA-3 model. Experimental results show that Proc3D outperforms existing methods in editing efficiency, achieving more than 400x speedup over conventional approaches that require full regeneration for each modification. Additionally, Proc3D improves ULIP scores by 28%, a metric that evaluates the alignment between generated 3D models and text prompts. By enabling text-aligned 3D model generation along with precise, real-time parametric edits, Proc3D facilitates highly accurate text-based image editing applications.
zh
[CV-211] DiffusionQC: Artifact Detection in Histopathology via Diffusion Model
【速读】:该论文旨在解决数字病理图像中伪影(artifact)检测难题,传统监督学习方法依赖大量像素级标注数据且难以泛化到新型伪影类型。其解决方案的关键在于提出DiffusionQC,一种基于扩散模型的无监督异常检测方法,仅需清洁图像进行训练,无需像素级标注或预定义伪影类别;同时引入对比学习模块以增强伪影与清洁图像分布之间的分离度,从而实现更优的检测性能和跨染色泛化能力。
链接: https://arxiv.org/abs/2601.12233
作者: Zhenzhen Wang,Zhongliang Zhou,Zhuoyu Wen,Jeong Hwan Kook,John B Wojcik,John Kang
机构: Johns Hopkins University (约翰霍普金斯大学); Merck & Co., Inc. (默克公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages
Abstract:Digital pathology plays a vital role across modern medicine, offering critical insights for disease diagnosis, prognosis, and treatment. However, histopathology images often contain artifacts introduced during slide preparation and digitization. Detecting and excluding them is essential to ensure reliable downstream analysis. Traditional supervised models typically require large annotated datasets, which is resource-intensive and not generalizable to novel artifact types. To address this, we propose DiffusionQC, which detects artifacts as outliers among clean images using a diffusion model. It requires only a set of clean images for training rather than pixel-level artifact annotations and predefined artifact types. Furthermore, we introduce a contrastive learning module to explicitly enlarge the distribution separation between artifact and clean images, yielding an enhanced version of our method. Empirical results demonstrate superior performance to state-of-the-art and offer cross-stain generalization capacity, with significantly less data and annotations.
zh
[CV-212] Where It Moves It Matters: Referring Surgical Instrument Segmentation via Motion
【速读】:该论文旨在解决手术视频中基于自然语言描述的参考分割(referring segmentation)问题,即如何准确地定位和分割出由自由形式语言表达所指代的手术器械。现有方法因依赖静态视觉特征和预定义器械名称,在复杂场景下泛化能力不足。其解决方案的关键在于提出SurgRef框架,该框架通过运动引导机制捕捉器械在时间维度上的运动模式与交互关系,而非仅依赖外观特征,从而实现对器械在遮挡、模糊或陌生术语下的精准识别与分割,显著提升了模型的鲁棒性和通用性。
链接: https://arxiv.org/abs/2601.12224
作者: Meng Wei,Kun Yuan,Shi Li,Yue Zhou,Long Bai,Nassir Navab,Hongliang Ren,Hong Joo Lee,Tom Vercauteren,Nicolas Padoy
机构: 1. Imperial College London (伦敦帝国理工学院);
2. Siemens Healthineers (西门子医疗);
3. ETH Zurich (苏黎世联邦理工学院);
4. Tsinghua University (清华大学);
5. KAIST (韩国科学技术院);
6. University College London (伦敦大学学院);
7. National University of Singapore (新加坡国立大学);
8. Max Planck Institute for Intelligent Systems (马克斯·普朗克智能系统研究所);
9. University of Oxford (牛津大学);
10. University of California, Berkeley (加州大学伯克利分校);
11. University of Tokyo (东京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Enabling intuitive, language-driven interaction with surgical scenes is a critical step toward intelligent operating rooms and autonomous surgical robotic assistance. However, the task of referring segmentation, localizing surgical instruments based on natural language descriptions, remains underexplored in surgical videos, with existing approaches struggling to generalize due to reliance on static visual cues and predefined instrument names. In this work, we introduce SurgRef, a novel motion-guided framework that grounds free-form language expressions in instrument motion, capturing how tools move and interact across time, rather than what they look like. This allows models to understand and segment instruments even under occlusion, ambiguity, or unfamiliar terminology. To train and evaluate SurgRef, we present Ref-IMotion, a diverse, multi-institutional video dataset with dense spatiotemporal masks and rich motion-centric expressions. SurgRef achieves state-of-the-art accuracy and generalization across surgical procedures, setting a new benchmark for robust, language-driven surgical video segmentation.
zh
[CV-213] VIRTUE: Versatile Video Retrieval Through Unified Embeddings
【速读】:该论文旨在解决当前视频检索系统在处理多样化任务时的局限性问题,即专用架构虽能实现高精度的语料级或细粒度时刻定位检索,但难以支持复合多模态查询;而基于多模态大语言模型(Multimodal Large Language Model, MLLM)的方法虽具备灵活的多模态搜索能力,其检索性能却显著低于专业系统。解决方案的关键在于提出VIRTUE框架,通过共享MLLM骨干网络生成视觉与文本嵌入,并利用对比对齐策略实现高效嵌入式候选检索;同时采用低秩适应(Low-Rank Adaptation, LoRA)方法在70万对图文数据上高效训练嵌入模型,从而在零样本视频检索、零样本时刻检索及零样本复合视频检索任务中均取得优异表现,且经重排序微调后可达到与大规模数据训练的专业模型相当的性能。
链接: https://arxiv.org/abs/2601.12193
作者: Shaunak Halbe,Bhagyashree Puranik,Jayakrishnan Unnikrishnan,Kushan Thakkar,Vimal Bhat,Toufiq Parag
机构: Georgia Institute of Technology (佐治亚理工学院); Amazon (亚马逊)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Modern video retrieval systems are expected to handle diverse tasks ranging from corpus-level retrieval and fine-grained moment localization to flexible multimodal querying. Specialized architectures achieve strong retrieval performance by training modality-specific encoders on massive datasets, but they lack the ability to process composed multimodal queries. In contrast, multimodal LLM (MLLM)-based methods support rich multimodal search but their retrieval performance remains well below that of specialized systems. We present VIRTUE, an MLLM-based versatile video retrieval framework that integrates corpus and moment-level retrieval capabilities while accommodating composed multimodal queries within a single architecture. We use contrastive alignment of visual and textual embeddings generated using a shared MLLM backbone to facilitate efficient embedding-based candidate search. Our embedding model, trained efficiently using low-rank adaptation (LoRA) on 700K paired visual-text data samples, surpasses other MLLM-based methods on zero-shot video retrieval tasks. Additionally, we demonstrate that the same model can be adapted without further training to achieve competitive results on zero-shot moment retrieval, and state of the art results for zero-shot composed video retrieval. With additional training for reranking candidates identified in the embedding-based search, our model substantially outperforms existing MLLM-based retrieval systems and achieves retrieval performance comparable to state of the art specialized models which are trained on orders of magnitude larger data.
zh
[CV-214] Inverse Rendering for High-Genus 3D Surface Meshes from Multi-view Images with Persistent Homology Priors ICASSP2026
【速读】:该论文旨在解决从图像中重建三维物体时因几何、外观和拓扑不确定性而导致的病态问题(ill-posed problem),尤其针对高 genus 表面(如具有多个孔洞或把手的复杂拓扑结构)难以准确恢复的问题。其解决方案的关键在于引入基于持久同调(persistent homology)的拓扑先验,通过约束关键拓扑特征(如隧道环(tunnel loops)和把手环(handle loops))来指导逆渲染过程;该方法结合多视角图像的光度一致性与同调引导,在无需神经网络的情况下,利用基于网格的梯度优化框架实现高精度且鲁棒的几何重建,有效避免了传统方法中常见的拓扑崩溃(如隧道塌陷或丢失高 genus 结构)问题。
链接: https://arxiv.org/abs/2601.12155
作者: Xiang Gao,Xinmu Wang,Yuanpeng Liu,Yue Wang,Junqi Huang,Wei Chen,Xianfeng Gu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICASSP2026 Accepted
Abstract:Reconstructing 3D objects from images is inherently an ill-posed problem due to ambiguities in geometry, appearance, and topology. This paper introduces collaborative inverse rendering with persistent homology priors, a novel strategy that leverages topological constraints to resolve these ambiguities. By incorporating priors that capture critical features such as tunnel loops and handle loops, our approach directly addresses the difficulty of reconstructing high-genus surfaces. The collaboration between photometric consistency from multi-view images and homology-based guidance enables recovery of complex high-genus geometry while circumventing catastrophic failures such as collapsing tunnels or losing high-genus structure. Instead of neural networks, our method relies on gradient-based optimization within a mesh-based inverse rendering framework to highlight the role of topological priors. Experimental results show that incorporating persistent homology priors leads to lower Chamfer Distance (CD) and higher Volume IoU compared to state-of-the-art mesh-based methods, demonstrating improved geometric accuracy and robustness against topological failure.
zh
[CV-215] Enhanced Diagnostic Performance via Large-Resolution Inference Optimization for Pathology Foundation Models
【速读】:该论文旨在解决病理学基础模型在处理全切片图像(Whole-Slide Images, WSIs)时因固定输入尺寸(如224×224)导致的效率低下问题,尤其是在高分辨率WSI推理中面临的GPU内存消耗过大或降采样破坏形态细节的困境。解决方案的关键在于提出一种空间和时间高效的推理策略:通过引入空间感知的邻域块来稀疏化注意力机制,并利用全局注意力得分过滤非信息性token,从而显著降低GPU内存占用与运行时间,同时保持甚至提升下游任务性能,实现在相同GPU预算下更高分辨率的推理能力。
链接: https://arxiv.org/abs/2601.12150
作者: Mengxuan Hu,Zihan Guan,John Kang,Sheng Li,Zhongliang Zhou
机构: University of Virginia (弗吉尼亚大学); Merck & Co., Inc. (默克公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 8 pages
Abstract:Despite their prominent performance on tasks such as ROI classification and segmentation, many pathology foundation models remain constrained by a specific input size e.g. 224 x 224, creating substantial inefficiencies when applied to whole-slide images (WSIs), which span thousands of resolutions. A naive strategy is to either enlarge inputs or downsample the WSIs. However, enlarging inputs results in prohibitive GPU memory consumption, while downsampling alters the microns-per-pixel resolution and obscures critical morphological details. To overcome these limitations, we propose an space- and time- efficient inference strategy that sparsifies attention using spatially aware neighboring blocks and filters out non-informative tokens through global attention scores. This design substantially reduces GPU memory and runtime during high-resolution WSI inference while preserving and even improving the downstream performance, enabling inference at higher resolutions under the same GPU budget. The experimental results show that our method can achieves up to an 7.67% improvement in the ROI classification and compatible results in segmentation.
zh
[CV-216] Principal Component Analysis-Based Terahertz Self-Supervised Denoising and Deblurring Deep Neural Networks
【速读】:该论文旨在解决太赫兹(Terahertz, THz)成像系统中固有的频率依赖性退化问题,即在幅度图像中同时存在低频模糊和高频噪声,而传统图像处理技术难以同时有效抑制这两类干扰,且常需人工干预以确定去噪与去模糊的边界。解决方案的关键在于提出一种基于主成分分析(Principal Component Analysis, PCA)的自监督去噪与去模糊网络(THz-SSDD),其核心创新是采用“重构到重构”(Recorrupted-to-Recorrupted)的自监督学习策略,利用重复扰动下的不变性特征捕捉噪声内在结构,并通过PCA分解与重建实现对低频和高频分量的同时恢复,从而在无需标注数据的情况下有效提升图像质量并保留原始信号的物理特性。
链接: https://arxiv.org/abs/2601.12149
作者: Pengfei Zhu,Xavier Maldague
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Terahertz (THz) systems inherently introduce frequency-dependent degradation effects, resulting in low-frequency blurring and high-frequency noise in amplitude images. Conventional image processing techniques cannot simultaneously address both issues, and manual intervention is often required due to the unknown boundary between denoising and deblurring. To tackle this challenge, we propose a principal component analysis (PCA)-based THz self-supervised denoising and deblurring network (THz-SSDD). The network employs a Recorrupted-to-Recorrupted self-supervised learning strategy to capture the intrinsic features of noise by exploiting invariance under repeated corruption. PCA decomposition and reconstruction are then applied to restore images across both low and high frequencies. The performance of the THz-SSDD network was evaluated on four types of samples. Training requires only a small set of unlabeled noisy images, and testing across samples with different material properties and measurement modes demonstrates effective denoising and deblurring. Quantitative analysis further validates the network feasibility, showing improvements in image quality while preserving the physical characteristics of the original signals.
zh
[CV-217] Segment and Matte Anything in a Unified Model AAAI2026
【速读】:该论文旨在解决生成式分割模型Segment Anything (SAM)在实际应用中mask预测精度不足的问题,以及缺乏统一框架实现高精度交互式图像分割与抠图(matting)任务的挑战。其解决方案的关键在于提出了一种轻量级扩展模型SAMA(Segment And Matte Anything),通过引入多视角定位编码器(Multi-View Localization Encoder, MVLE)以捕捉局部细节特征,并设计定位适配器(Localization Adapter, Local-Adapter)来恢复掩码边界上的细微结构;同时,在架构中集成两个独立预测头分别生成分割掩码和alpha抠图,从而在同一框架内实现高质量的交互式图像分割与抠图任务。
链接: https://arxiv.org/abs/2601.12147
作者: Zezhong Fan,Xiaohan Li,Topojoy Biswas,Kaushiki Nag,Kannan Achan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: AAAI 2026
Abstract:Segment Anything (SAM) has recently pushed the boundaries of segmentation by demonstrating zero-shot generalization and flexible prompting after training on over one billion masks. Despite this, its mask prediction accuracy often falls short of the precision required in real-world applications. While several refinement modules have been proposed to boost SAM’s segmentation quality, achieving highly accurate object delineation within a single, unified framework remains an open challenge. Furthermore, interactive image matting, which aims to generate fine-grained alpha mattes guided by diverse user hints, has not yet been explored in the context of SAM. Insights from recent studies highlight strong correlations between segmentation and matting, suggesting the feasibility of a unified model capable of both tasks. In this paper, we introduce Segment And Matte Anything (SAMA), a lightweight extension of SAM that delivers high-quality interactive image segmentation and matting with minimal extra parameters. Our Multi-View Localization Encoder (MVLE) captures detailed features from local views, while the Localization Adapter (Local-Adapter) refines mask outputs by recovering subtle boundary details. We also incorporate two prediction heads for each task into the architecture to generate segmentation and matting masks, simultaneously. Trained on a diverse dataset aggregated from publicly available sources, SAMA achieves state-of-the-art performance across multiple segmentation and matting benchmarks, showcasing its adaptability and effectiveness in a wide range of downstream tasks.
zh
[CV-218] EMoE: Eigenbasis-Guided Routing for Mixture-of-Experts ICASSP2026
【速读】:该论文旨在解决Mixture-of-Experts (MoE) 架构中的两个核心问题:一是“富者愈富”现象导致的负载不均衡问题,即少数专家被过度使用;二是专家同质化问题,即各专家学习到冗余表示,削弱了多专家分工的意义。现有方法通常引入辅助负载平衡损失来缓解不均衡,但往往加剧同质化,因强制均匀路由而牺牲了专家的专业化。解决方案的关键在于提出Eigen-Mixture-of-Experts (EMoE),其通过基于学习得到的正交特征基(orthonormal eigenbasis)进行路由:将输入token投影到共享特征基上,并依据其与特征空间主成分的对齐程度进行分配。这种基于几何结构的分区策略天然促进专家负载均衡与多样化专业能力的发展,且无需依赖冲突性的辅助损失函数。
链接: https://arxiv.org/abs/2601.12137
作者: Anzhe Cheng,Shukai Duan,Shixuan Li,Chenzhong Yin,Mingxi Cheng,Shahin Nazarian,Paul Thompson,Paul Bogdan
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: accepted by ICASSP2026
Abstract:The relentless scaling of deep learning models has led to unsustainable computational demands, positioning Mixture-of-Experts (MoE) architectures as a promising path towards greater efficiency. However, MoE models are plagued by two fundamental challenges: 1) a load imbalance problem known as the``rich get richer" phenomenon, where a few experts are over-utilized, and 2) an expert homogeneity problem, where experts learn redundant representations, negating their purpose. Current solutions typically employ an auxiliary load-balancing loss that, while mitigating imbalance, often exacerbates homogeneity by enforcing uniform routing at the expense of specialization. To resolve this, we introduce the Eigen-Mixture-of-Experts (EMoE), a novel architecture that leverages a routing mechanism based on a learned orthonormal eigenbasis. EMoE projects input tokens onto this shared eigenbasis and routes them based on their alignment with the principal components of the feature space. This principled, geometric partitioning of data intrinsically promotes both balanced expert utilization and the development of diverse, specialized experts, all without the need for a conflicting auxiliary loss function. Our code is publicly available at this https URL.
zh
[CV-219] CARLA-Round: A Multi-Factor Simulation Dataset for Roundabout Trajectory Prediction
【速读】:该论文旨在解决环形交叉口车辆轨迹预测(roundabout trajectory prediction)的准确性难题,其核心挑战在于环形道路几何结构复杂、车辆持续存在汇入与让行交互行为且缺乏交通信号控制,导致现有数据集难以支撑高精度算法开发。解决方案的关键在于构建一个系统化设计的仿真数据集CARLA-Round,该数据集通过结构化地控制天气条件(五类)和交通密度水平(A-E级服务水平),生成25个受控场景,每个场景均包含真实混合驾驶行为并提供显式标注,从而实现对不同因素影响的精确量化分析。相比传统随机采样的仿真或现实数据,此方法可有效隔离混杂变量,为轨迹预测模型的性能评估与优化提供可靠依据,并验证了从仿真到现实的迁移有效性(最佳模型在真实数据上达到0.312m ADE)。
链接: https://arxiv.org/abs/2601.12119
作者: Xiaotong Zhou,Zhenhui Yuan,Yi Han,Tianhua Xu,Laurence T. Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accurate trajectory prediction of vehicles at roundabouts is critical for reducing traffic accidents, yet it remains highly challenging due to their circular road geometry, continuous merging and yielding interactions, and absence of traffic signals. Developing accurate prediction algorithms relies on reliable, multimodal, and realistic datasets; however, such datasets for roundabout scenarios are scarce, as real-world data collection is often limited by incomplete observations and entangled factors that are difficult to isolate. We present CARLA-Round, a systematically designed simulation dataset for roundabout trajectory prediction. The dataset varies weather conditions (five types) and traffic density levels (spanning Level-of-Service A-E) in a structured manner, resulting in 25 controlled scenarios. Each scenario incorporates realistic mixtures of driving behaviors and provides explicit annotations that are largely absent from existing datasets. Unlike randomly sampled simulation data, this structured design enables precise analysis of how different conditions influence trajectory prediction performance. Validation experiments using standard baselines (LSTM, GCN, GRU+GCN) reveal traffic density dominates prediction difficulty with strong monotonic effects, while weather shows non-linear impacts. The best model achieves 0.312m ADE on real-world rounD dataset, demonstrating effective sim-to-real transfer. This systematic approach quantifies factor impacts impossible to isolate in confounded real-world datasets. Our CARLA-Round dataset is available at this https URL.
zh
[CV-220] RCDN: Real-Centered Detection Network for Robust Face Forgery Identification
【速读】:该论文旨在解决图像伪造检测模型在跨域场景下性能显著下降的问题,即现有方法在训练与测试数据分布一致时表现优异,但在面对新型或未见过的伪造技术时泛化能力不足。解决方案的关键在于提出一种以真实图像为中心的检测网络(Real-Centered Detection Network, RCDN),其核心思想是通过锚定表示空间于真实人脸图像,而非建模不断变化的伪造模式,从而提升模型对分布偏移的鲁棒性;具体实现上采用基于Xception主干的双分支架构和“真实中心损失”设计,有效增强了模型在不同域间的稳定性与泛化能力。
链接: https://arxiv.org/abs/2601.12111
作者: Wyatt McCurdy,Xin Zhang,Yuqi Song,Min Gao
机构: University of Southern Maine(缅因大学南校区); Chongqing University(重庆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Image forgery has become a critical threat with the rapid proliferation of AI-based generation tools, which make it increasingly easy to synthesize realistic but fraudulent facial content. Existing detection methods achieve near-perfect performance when training and testing are conducted within the same domain, yet their effectiveness deteriorates substantially in crossdomain scenarios. This limitation is problematic, as new forgery techniques continuously emerge and detectors must remain reliable against unseen manipulations. To address this challenge, we propose the Real-Centered Detection Network (RCDN), a frequency spatial convolutional neural networks(CNN) framework with an Xception backbone that anchors its representation space around authentic facial images. Instead of modeling the diverse and evolving patterns of forgeries, RCDN emphasizes the consistency of real images, leveraging a dual-branch architecture and a real centered loss design to enhance robustness under distribution shifts. Extensive experiments on the DiFF dataset, focusing on three representative forgery types (FE, I2I, T2I), demonstrate that RCDN achieves both state-of-the-art in-domain accuracy and significantly stronger cross-domain generalization. Notably, RCDN reduces the generalization gap compared to leading baselines and achieves the highest cross/in-domain stability ratio, highlighting its potential as a practical solution for defending against evolving and unseen image forgery techniques.
zh
[CV-221] Energy-Aware Ensemble Learning for Coffee Leaf Disease Classification
【速读】:该论文旨在解决咖啡叶部病害在田间环境下难以实现及时准确诊断的问题,尤其是在受限设备和间歇性网络连接条件下,传统高精度人工智能(Artificial Intelligence, AI)视觉模型难以部署的挑战。其解决方案的关键在于通过知识蒸馏(knowledge distillation)技术,将数据中心训练的高容量卷积神经网络(Convolutional Neural Networks, CNNs)的知识迁移至轻量级CNN模型,并结合集成学习(Ensemble Learning)策略优化性能;同时,通过引入密集微小模型对(dense tiny pairs)并采用简化的优化集成方法,在严格计算资源和能耗约束下显著提升诊断准确性,从而为物联网(Internet of Things, IoT)场景提供可持续的本地化诊断方案。
链接: https://arxiv.org/abs/2601.12109
作者: Larissa Ferreira Rodrigues Moreira,Rodrigo Moreira,Leonardo Gabriel Ferreira Rodrigues
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Coffee yields are contingent on the timely and accurate diagnosis of diseases; however, assessing leaf diseases in the field presents significant challenges. Although Artificial Intelligence (AI) vision models achieve high accuracy, their adoption is hindered by the limitations of constrained devices and intermittent connectivity. This study aims to facilitate sustainable on-device diagnosis through knowledge distillation: high-capacity Convolutional Neural Networks (CNNs) trained in data centers transfer knowledge to compact CNNs through Ensemble Learning (EL). Furthermore, dense tiny pairs were integrated through simple and optimized ensembling to enhance accuracy while adhering to strict computational and energy constraints. On a curated coffee leaf dataset, distilled tiny ensembles achieved competitive with prior work with significantly reduced energy consumption and carbon footprint. This indicates that lightweight models, when properly distilled and ensembled, can provide practical diagnostic solutions for Internet of Things (IoT) applications.
zh
[CV-222] Detecting 3D Line Segments for 6DoF Pose Estimation with Limited Data
【速读】:该论文旨在解决工业场景中6自由度(6DoF)物体位姿估计问题,尤其针对仓储或装配线上常见的箱子(bins)进行高精度位姿估计。传统深度学习方法通常依赖大量训练数据或特定实例的CAD模型,在实际工业环境中因数据稀缺和对象多样性而受限。解决方案的关键在于利用箱子的规则立方体几何特性:首先通过扩展LeTR网络以处理结构化点云数据,检测对应于箱子顶部边缘的3D线段;随后采用简化的几何推理流程,从这些线段中鲁棒地推导出完整的6DoF位姿。实验表明,结合合成训练数据可显著提升真实扫描下的位姿估计精度,并且该方法在不依赖实例级CAD模型的前提下,实现了优于当前最先进方法的性能(平移误差<3 cm,旋转误差<8.2°)。
链接: https://arxiv.org/abs/2601.12090
作者: Matej Mok,Lukáš Gajdošech,Michal Mesároš,Martin Madaras,Viktor Kocur
机构: Faculty of Mathematics, Physics and Informatics, Comenius University Bratislava, Slovakia; Skeletex Research, Bratislava, Slovakia
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages
Abstract:The task of 6DoF object pose estimation is one of the fundamental problems of 3D vision with many practical applications such as industrial automation. Traditional deep learning approaches for this task often require extensive training data or CAD models, limiting their application in real-world industrial settings where data is scarce and object instances vary. We propose a novel method for 6DoF pose estimation focused specifically on bins used in industrial settings. We exploit the cuboid geometry of bins by first detecting intermediate 3D line segments corresponding to their top edges. Our approach extends the 2D line segment detection network LeTR to operate on structured point cloud data. The detected 3D line segments are then processed using a simple geometric procedure to robustly determine the bin’s 6DoF pose. To evaluate our method, we extend an existing dataset with a newly collected and annotated dataset, which we make publicly available. We show that incorporating synthetic training data significantly improves pose estimation accuracy on real scans. Moreover, we show that our method significantly outperforms current state-of-the-art 6DoF pose estimation methods in terms of the pose accuracy (3 cm translation error, 8.2 ^\circ rotation error) while not requiring instance-specific CAD models during inference.
zh
[CV-223] Conditional Random Fields for Interactive Refinement of Histopathological Predictions
【速读】:该论文旨在解决病理图像分析中生成式 AI(Generative AI)模型在零样本预测时准确性不足的问题,尤其在组织病理学图像的 patch-level 分类任务中表现不稳定。其核心解决方案是提出 HistoCRF 框架,通过将条件随机场(Conditional Random Fields, CRFs)适配至组织病理学场景,无需额外模型训练即可提升预测精度;关键创新在于设计了一种新型成对势函数(pairwise potential),该函数能够促进标签多样性并有效利用专家标注信息,在仅需少量人工标注(如 100 标注)的情况下显著提升分类准确率,且引入人类反馈闭环机制可进一步优化性能。
链接: https://arxiv.org/abs/2601.12082
作者: Tiffanie Godelaine,Maxime Zanella,Karim El Khoury,Saïd Mahmoudi,Benoît Macq,Christophe De Vleeschouwer
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Assisting pathologists in the analysis of histopathological images has high clinical value, as it supports cancer detection and staging. In this context, histology foundation models have recently emerged. Among them, Vision-Language Models (VLMs) provide strong yet imperfect zero-shot predictions. We propose to refine these predictions by adapting Conditional Random Fields (CRFs) to histopathological applications, requiring no additional model training. We present HistoCRF, a CRF-based framework, with a novel definition of the pairwise potential that promotes label diversity and leverages expert annotations. We consider three experiments: without annotations, with expert annotations, and with iterative human-in-the-loop annotations that progressively correct misclassified patches. Experiments on five patch-level classification datasets covering different organs and diseases demonstrate average accuracy gains of 16.0% without annotations and 27.5% with only 100 annotations, compared to zero-shot predictions. Moreover, integrating a human in the loop reaches a further gain of 32.6% with the same number of annotations. The code will be made available on this https URL.
zh
[CV-224] oward Real-World High-Precision Image Matting and Segmentation AAAI2026
【速读】:该论文旨在解决高精度场景解析任务(如图像抠图和二值分割)中存在的一系列问题:现有方法多聚焦于显著的单一前景对象,交互式方法虽可调整目标但因类别无关设计限制了跨类泛化能力;同时,高质量标注数据稀缺导致依赖不协调的合成数据,造成模型在真实场景中的泛化性能不佳。解决方案的关键在于提出一种前景一致性学习模型(Foreground Consistent Learning model, FCLM),其核心包括三个创新:首先引入深度感知蒸馏策略(Depth-Aware Distillation),通过迁移与深度相关的知识增强前景表征;其次将合成数据处理视为域适应问题,提出域不变学习策略以聚焦前景学习;最后设计面向对象的解码器(Object-Oriented Decoder),支持视觉与语言提示输入,实现交互式预测。实验表明,该方法在定量和定性指标上均优于当前最优(SOTA)方法。
链接: https://arxiv.org/abs/2601.12080
作者: Haipeng Zhou,Zhaohu Xing,Hongqiu Wang,Jun Ma,Ping Li,Lei Zhu
机构: 1. University of Science and Technology of China (中国科学技术大学); 2. Hong Kong University of Science and Technology (广州) (香港科技大学(广州)); 3. Chinese Academy of Sciences (中国科学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI2026, Poster
Abstract:High-precision scene parsing tasks, including image matting and dichotomous segmentation, aim to accurately predict masks with extremely fine details (such as hair). Most existing methods focus on salient, single foreground objects. While interactive methods allow for target adjustment, their class-agnostic design restricts generalization across different categories. Furthermore, the scarcity of high-quality annotation has led to a reliance on inharmonious synthetic data, resulting in poor generalization to real-world scenarios. To this end, we propose a Foreground Consistent Learning model, dubbed as FCLM, to address the aforementioned issues. Specifically, we first introduce a Depth-Aware Distillation strategy where we transfer the depth-related knowledge for better foreground representation. Considering the data dilemma, we term the processing of synthetic data as domain adaptation problem where we propose a domain-invariant learning strategy to focus on foreground learning. To support interactive prediction, we contribute an Object-Oriented Decoder that can receive both visual and language prompts to predict the referring target. Experimental results show that our method quantitatively and qualitatively outperforms SOTA methods.
zh
[CV-225] EmoLat: Text-driven Image Sentiment Transfer via Emotion Latent Space
【速读】:该论文旨在解决图像情感(image sentiment)的细粒度、文本驱动迁移问题,即如何通过自然语言指令精确控制图像的情感倾向,同时保持语义一致性与视觉真实性。其核心挑战在于建模文本语义与视觉情感特征之间的跨模态关联,并实现高保真度的情感转换。解决方案的关键在于提出EmoLat——一个情感潜在空间(emotion latent space),该空间通过构建情绪语义图(emotion semantic graph)捕捉情绪、物体与视觉属性间的结构化关系,并引入对抗正则化以对齐跨模态的情感分布,从而增强情感表征的可区分性与迁移能力;在此基础上设计的跨模态情感迁移框架,结合文本与EmoLat特征的联合嵌入,利用多目标损失函数(包含语义一致性、情感对齐和对抗正则化)优化模型,最终实现了可控且高质量的图像情感编辑。
链接: https://arxiv.org/abs/2601.12079
作者: Jing Zhang,Bingjie Fan,Jixiang Zhu,Zhe Wang
机构: East China University of Science and Technology (华东理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 5 figures
Abstract:We propose EmoLat, a novel emotion latent space that enables fine-grained, text-driven image sentiment transfer by modeling cross-modal correlations between textual semantics and visual emotion features. Within EmoLat, an emotion semantic graph is constructed to capture the relational structure among emotions, objects, and visual attributes. To enhance the discriminability and transferability of emotion representations, we employ adversarial regularization, aligning the latent emotion distributions across modalities. Building upon EmoLat, a cross-modal sentiment transfer framework is proposed to manipulate image sentiment via joint embedding of text and EmoLat features. The network is optimized using a multi-objective loss incorporating semantic consistency, emotion alignment, and adversarial regularization. To support effective modeling, we construct EmoSpace Set, a large-scale benchmark dataset comprising images with dense annotations on emotions, object semantics, and visual attributes. Extensive experiments on EmoSpace Set demonstrate that our approach significantly outperforms existing state-of-the-art methods in both quantitative metrics and qualitative transfer fidelity, establishing a new paradigm for controllable image sentiment editing guided by textual input. The EmoSpace Set and all the code are available at this http URL.
zh
[CV-226] ARMARecon: An ARMA Convolutional Filter based Graph Neural Network for Neurodegenerative Dementias Classification
【速读】:该论文旨在解决神经退行性疾病(如阿尔茨海默病 Alzheimer’s Disease, AD 和额颞叶痴呆 Frontotemporal Dementia, FTD)早期检测的难题,以降低疾病进展至严重阶段的风险。其核心挑战在于AD和FTD沿白质(white-matter)区域以全局图依赖方式传播,传统方法难以有效建模这种复杂的空间结构关系。解决方案的关键在于提出ARMARecon框架,该框架融合自回归移动平均(Autoregressive Moving Average, ARMA)图滤波与重建驱动的目标函数,利用从白质区域提取的20-bin各向异性分数(Fractional Anisotropy, FA)直方图特征,同时捕捉局部与全局连接模式,并有效缓解过平滑问题,从而显著提升分类准确性和特征表示能力。
链接: https://arxiv.org/abs/2601.12067
作者: VSS Tejaswi Abburi,Ananya Singhal,Saurabh J. Shigwan,Nitin Kumar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at IEEE International Symposium on Biomedical Imaging (ISBI) 2026
Abstract:Early detection of neurodegenerative diseases such as Alzheimer’s Disease (AD) and Frontotemporal Dementia (FTD) is essential for reducing the risk of progression to severe disease stages. As AD and FTD propagate along white-matter regions in a global, graph-dependent manner, graph-based neural networks are well suited to capture these patterns. Hence, we introduce ARMARecon, a unified graph learning framework that integrates Autoregressive Moving Average (ARMA) graph filtering with a reconstruction-driven objective to enhance feature representation and improve classification accuracy. ARMARecon effectively models both local and global connectivity by leveraging 20-bin Fractional Anisotropy (FA) histogram features extracted from white-matter regions, while mitigating over-smoothing. Overall, ARMARecon achieves superior performance compared to state-of-the-art methods on the multi-site dMRI datasets ADNI and NIFD.
zh
[CV-227] Learning Stochastic Bridges for Video Object Removal via Video-to-Video Translation
【速读】:该论文旨在解决现有视频对象移除方法依赖扩散模型从无信息的高斯噪声开始生成,导致缺乏结构和上下文先验引导,从而引发对象移除不完整或合成内容与场景物理逻辑冲突的问题。其解决方案的关键在于将视频对象移除重新建模为一种视频到视频的翻译任务,通过随机桥接(stochastic bridge)模型建立源视频(含对象)到目标视频(对象已移除)之间的直接随机路径,充分利用输入视频作为强结构先验来指导精确移除并确保填充区域在逻辑上与环境一致;同时提出一种自适应掩码调制策略,动态根据掩码特征调节输入嵌入,平衡背景保真度与生成灵活性,有效缓解强桥接先验对大对象移除的限制。
链接: https://arxiv.org/abs/2601.12066
作者: Zijie Lou,Xiangwei Feng,Jiaxin Wang,Xiaochao Qu,Luoqi Liu,Ting Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Existing video object removal methods predominantly rely on diffusion models following a noise-to-data paradigm, where generation starts from uninformative Gaussian noise. This approach discards the rich structural and contextual priors present in the original input video. Consequently, such methods often lack sufficient guidance, leading to incomplete object erasure or the synthesis of implausible content that conflicts with the scene’s physical logic. In this paper, we reformulate video object removal as a video-to-video translation task via a stochastic bridge model. Unlike noise-initialized methods, our framework establishes a direct stochastic path from the source video (with objects) to the target video (objects removed). This bridge formulation effectively leverages the input video as a strong structural prior, guiding the model to perform precise removal while ensuring that the filled regions are logically consistent with the surrounding environment. To address the trade-off where strong bridge priors hinder the removal of large objects, we propose a novel adaptive mask modulation strategy. This mechanism dynamically modulates input embeddings based on mask characteristics, balancing background fidelity with generative flexibility. Extensive experiments demonstrate that our approach significantly outperforms existing methods in both visual quality and temporal consistency.
zh
[CV-228] Learning Language-Driven Sequence-Level Modal-Invariant Representations for Video-Based Visible-Infrared Person Re-Identification
【速读】:该论文旨在解决视频-based可见光-红外行人重识别(VVI-ReID)中序列级模态不变表示学习的问题,具体包括高效时空建模不足、跨模态交互不充分以及缺乏显式的模态层级损失引导等挑战。其解决方案的关键在于提出语言驱动的序列级模态不变表示学习方法(LSMRL),包含三个核心模块:时空特征学习(STFL)模块通过最小修改基于CLIP实现参数与计算高效的时空建模;语义扩散(SD)模块将共享语言提示扩散至可见光和红外特征以建立初步模态一致性;跨模态交互(CMI)模块利用双向跨模态自注意力机制消除残余模态差异并精炼模态不变表示。此外,引入两种模态层级损失以显式增强模态不变表示的学习,提升特征判别能力和对未见类别的泛化性能。
链接: https://arxiv.org/abs/2601.12062
作者: Xiaomei Yang,Xizhan Gao,Antai Liu,Kang Wei,Fa Zhu,Guang Feng,Xiaofeng Qu,Sijie Niu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The core of video-based visible-infrared person re-identification (VVI-ReID) lies in learning sequence-level modal-invariant representations across different modalities. Recent research tends to use modality-shared language prompts generated by CLIP to guide the learning of modal-invariant representations. Despite achieving optimal performance, such methods still face limitations in efficient spatial-temporal modeling, sufficient cross-modal interaction, and explicit modality-level loss guidance. To address these issues, we propose the language-driven sequence-level modal-invariant representation learning (LSMRL) method, which includes spatial-temporal feature learning (STFL) module, semantic diffusion (SD) module and cross-modal interaction (CMI) module. To enable parameter- and computation-efficient spatial-temporal modeling, the STFL module is built upon CLIP with minimal modifications. To achieve sufficient cross-modal interaction and enhance the learning of modal-invariant features, the SD module is proposed to diffuse modality-shared language prompts into visible and infrared features to establish preliminary modal consistency. The CMI module is further developed to leverage bidirectional cross-modal self-attention to eliminate residual modality gaps and refine modal-invariant representations. To explicitly enhance the learning of modal-invariant representations, two modality-level losses are introduced to improve the features’ discriminative ability and their generalization to unseen categories. Extensive experiments on large-scale VVI-ReID datasets demonstrate the superiority of LSMRL over AOTA methods.
zh
[CV-229] Automating Parameter Selection in Deep Image Prior for Fluorescence Microscopy Image Denoising via Similarity-Based Parameter Transfer
【速读】:该论文旨在解决深度图像先验(Deep Image Prior, DIP)在实际应用中因每次处理新图像都需要重新优化网络架构和迭代停止点而导致效率低下的问题,尤其针对荧光显微成像中大量图像需快速处理的场景。其解决方案的关键在于提出AUTO-DIP框架,通过构建一个校准数据集(n=110)与验证集(n=55),发现相似图像在元数据层面(如显微镜类型、样本种类)上的相似性即可有效预测最优DIP参数配置,从而实现无需重新优化的参数迁移;实验表明,基于元数据相似性的参数转移性能优于基于定量图像特征的转移方式,并显著优于原始DIP及当前最先进的变分去噪方法,尤其在高噪声条件下表现突出。
链接: https://arxiv.org/abs/2601.12055
作者: Lina Meyer,Felix Wissel,Tobias Knopp,Susanne Pfefferle,Ralf Fliegert,Maximilian Sandmann,Liana Uebler,Franziska Möckl,Björn-Philipp Diercks,David Lohr,René Werner
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Unsupervised deep image prior (DIP) addresses shortcomings of training data requirements and limited generalization associated with supervised deep learning. The performance of DIP depends on the network architecture and the stopping point of its iterative process. Optimizing these parameters for a new image requires time, restricting DIP application in domains where many images need to be processed. Focusing on fluorescence microscopy data, we hypothesize that similar images share comparable optimal parameter configurations for DIP-based denoising, potentially enabling optimization-free DIP for fluorescence microscopy. We generated a calibration (n=110) and validation set (n=55) of semantically different images from an open-source dataset for a network architecture search targeted towards ideal U-net architectures and stopping points. The calibration set represented our transfer basis. The validation set enabled the assessment of which image similarity criterion yields the best results. We then implemented AUTO-DIP, a pipeline for automatic parameter transfer, and compared it to the originally published DIP configuration (baseline) and a state-of-the-art image-specific variational denoising approach. We show that a parameter transfer from the calibration dataset to a test image based on only image metadata similarity (e.g., microscope type, imaged specimen) leads to similar and better performance than a transfer based on quantitative image similarity measures. AUTO-DIP outperforms the baseline DIP (DIP with original DIP parameters) as well as the variational denoising approaches for several open-source test datasets of varying complexity, particularly for very noisy inputs. Applications to locally acquired fluorescence microscopy images further proved superiority of AUTO-DIP.
zh
[CV-230] ask-Driven Prompt Learning: A Joint Framework for Multi-modal Cloud Removal and Segmentation
【速读】:该论文旨在解决光学遥感影像中云层遮挡导致的下游应用受限问题,特别是现有云去除(Cloud Removal, CR)方法在追求视觉保真度时易过度平滑关键纹理与边界,从而造成视觉恢复与语义实用性之间的不匹配。解决方案的关键在于提出一种任务驱动的多模态框架TDP-CR,其核心创新是Prompt-Guided Fusion(PGF)机制,该机制通过可学习的退化提示(degradation prompt)编码云层厚度与空间不确定性,并结合全局通道上下文与局部提示条件化的空间偏置,自适应地融合合成孔径雷达(SAR)信息仅在光学数据受损区域;同时引入参数高效的两阶段训练策略,解耦重建与语义表征学习,从而实现高质量且具备分析可用性的遥感数据输出。
链接: https://arxiv.org/abs/2601.12052
作者: Zaiyan Zhang,Jie Li,Shaowei Shi,Qiangqiang Yuan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to IGARSS 2026 Conference
Abstract:Optical remote sensing imagery is indispensable for Earth observation, yet persistent cloud occlusion limits its downstream utility. Most cloud removal (CR) methods are optimized for low-level fidelity and can over-smooth textures and boundaries that are critical for analysis-ready data (ARD), leading to a mismatch between visually plausible restoration and semantic utility. To bridge this gap, we propose TDP-CR, a task-driven multimodal framework that jointly performs cloud removal and land-cover segmentation. Central to our approach is a Prompt-Guided Fusion (PGF) mechanism, which utilizes a learnable degradation prompt to encode cloud thickness and spatial uncertainty. By combining global channel context with local prompt-conditioned spatial bias, PGF adaptively integrates Synthetic Aperture Radar (SAR) information only where optical data is corrupted. We further introduce a parameter-efficient two-phase training strategy that decouples reconstruction and semantic representation learning. Experiments on the LuojiaSET-OSFCR dataset demonstrate the superiority of our framework: TDP-CR surpasses heavy state-of-the-art baselines by 0.18 dB in PSNR while using only 15% of the parameters, and achieves a 1.4% improvement in mIoU consistently against multi-task competitors, effectively delivering analysis-ready data.
zh
[CV-231] A Unified Masked Jigsaw Puzzle Framework for Vision and Language Models
【速读】:该论文旨在解决Transformer架构在联邦学习(Federated Learning)场景下面临的两个关键问题:一是位置嵌入(Position Embeddings, PEs)的梯度信息易被攻击者利用以重构输入数据,从而引发隐私泄露风险;二是模型在计算机视觉(CV)与自然语言处理(NLP)任务中性能受限,尤其在缺乏局部空间信息依赖的情况下难以优化。解决方案的核心在于提出一种统一的Masked Jigsaw Puzzle(MJP)框架:通过随机打乱token顺序破坏原始位置编码结构,并引入可学习的“未知”(unk)位置嵌入来掩码打乱后的token位置信息,从而迫使模型学习不依赖于局部空间关系的特征表示。实验表明,该方法不仅能有效提升对梯度攻击的鲁棒性,还能同步增强模型在图像分类(如ImageNet-1K)和文本情感分析(如Yelp、Amazon)等多模态任务中的性能表现。
链接: https://arxiv.org/abs/2601.12051
作者: Weixin Ye,Wei Wang,Yahui Liu,Yue Song,Bin Ren,Wei Bi,Rita Cucchiara,Nicu Sebe
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 figures, 12 tables
Abstract:In federated learning, Transformer, as a popular architecture, faces critical challenges in defending against gradient attacks and improving model performance in both Computer Vision (CV) and Natural Language Processing (NLP) tasks. It has been revealed that the gradient of Position Embeddings (PEs) in Transformer contains sufficient information, which can be used to reconstruct the input data. To mitigate this issue, we introduce a Masked Jigsaw Puzzle (MJP) framework. MJP starts with random token shuffling to break the token order, and then a learnable \textitunknown (unk) position embedding is used to mask out the PEs of the shuffled tokens. In this manner, the local spatial information which is encoded in the position embeddings is disrupted, and the models are forced to learn feature representations that are less reliant on the local spatial information. Notably, with the careful use of MJP, we can not only improve models’ robustness against gradient attacks, but also boost their performance in both vision and text application scenarios, such as classification for images (\textite.g., ImageNet-1K) and sentiment analysis for text (\textite.g., Yelp and Amazon). Experimental results suggest that MJP is a unified framework for different Transformer-based models in both vision and language tasks. Code is publicly available via this https URL
zh
[CV-232] textitFocaLogic: Logic-Based Interpretation of Visual Model Decisions
【速读】:该论文旨在解决当前视觉模型可解释性方法中存在的两大问题:一是多数方法依赖白盒模型访问,限制了其通用性;二是缺乏足够的定量严谨性,难以客观评估模型决策机制。解决方案的关键在于提出一种全新的、与模型无关的框架FocaLogic,该框架通过识别决定模型预测的最小可解释视觉区域(称为“视觉焦点”),并将这些焦点转化为精确且紧凑的逻辑表达式,从而实现透明、结构化的解释。同时,FocaLogic引入了一套量化指标(如焦点精度、召回率和分歧度)来客观评估不同场景下模型行为,显著提升了视觉模型解释的系统性、可扩展性和定量可靠性。
链接: https://arxiv.org/abs/2601.12049
作者: Chenchen Zhao,Muxi Chen,Qiang Xu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 12 pages, 13 figures
Abstract:Interpretability of modern visual models is crucial, particularly in high-stakes applications. However, existing interpretability methods typically suffer from either reliance on white-box model access or insufficient quantitative rigor. To address these limitations, we introduce FocaLogic, a novel model-agnostic framework designed to interpret and quantify visual model decision-making through logic-based representations. FocaLogic identifies minimal interpretable subsets of visual regions-termed visual focuses-that decisively influence model predictions. It translates these visual focuses into precise and compact logical expressions, enabling transparent and structured interpretations. Additionally, we propose a suite of quantitative metrics, including focus precision, recall, and divergence, to objectively evaluate model behavior across diverse scenarios. Empirical analyses demonstrate FocaLogic’s capability to uncover critical insights such as training-induced concentration, increasing focus accuracy through generalization, and anomalous focuses under biases and adversarial attacks. Overall, FocaLogic provides a systematic, scalable, and quantitative solution for interpreting visual models.
zh
[CV-233] Multimodal Feedback for Handheld Tool Guidance: Combining Wrist-Based Haptics with Augmented Reality
【速读】:该论文旨在解决光学透视增强现实(Optical See-Through Augmented Reality, OST-AR)中手持工具操作时因视觉遮挡、光照条件变化及界面模糊性导致的空间引导精度不足与用户信心缺失问题。解决方案的关键在于设计并验证了一种融合AR视觉信息与腕部触觉反馈的多模态系统,通过定制化的可穿戴触觉设备提供方向性和状态感知的振动提示,从而在高视觉复杂度任务中提升空间精度与用户体验。实验表明,结合AR与触觉反馈的方案显著优于单一模态,实现了5.8 mm的空间精度和88.1的系统可用性量表(System Usability Scale, SUS)得分,同时降低认知负荷并增强对工具定位的确认感。
链接: https://arxiv.org/abs/2601.12037
作者: Yue Yang,Christoph Leuze,Brian Hargreaves,Bruce Daniel,Fred M Baik
机构: Stanford University (斯坦福大学)
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV); Systems and Control (eess.SY)
备注:
Abstract:We investigate how vibrotactile wrist feedback can enhance spatial guidance for handheld tool movement in optical see-through augmented reality (AR). While AR overlays are widely used to support surgical tasks, visual occlusion, lighting conditions, and interface ambiguity can compromise precision and confidence. To address these challenges, we designed a multimodal system combining AR visuals with a custom wrist-worn haptic device delivering directional and state-based cues. A formative study with experienced surgeons and residents identified key tool maneuvers and preferences for reference mappings, guiding our cue design. In a cue identification experiment (N=21), participants accurately recognized five vibration patterns under visual load, with higher recognition for full-actuator states than spatial direction cues. In a guidance task (N=27), participants using both AR and haptics achieved significantly higher spatial precision (5.8 mm) and usability (SUS = 88.1) than those using either modality alone, despite having modest increases in task time. Participants reported that haptic cues provided reassuring confirmation and reduced cognitive effort during alignment. Our results highlight the promise of integrating wrist-based haptics into AR systems for high-precision, visually complex tasks such as surgical guidance. We discuss design implications for multimodal interfaces supporting confident, efficient tool manipulation.
zh
[CV-234] DIAMOND-SSS: Diffusion-Augmented Multi-View Optimization for Data-efficient SubSurface Scattering
【速读】:该论文旨在解决次表面散射(Subsurface Scattering, SSS)在神经渲染中建模困难的问题,尤其是如何在极稀疏监督条件下实现高保真半透明材质重建。传统方法依赖于密集采集的多视角、多光源数据集(通常超过100个视角和112个光照条件),而本文提出DIAMOND-SSS框架,通过微调扩散模型进行新视角合成与再光照,仅需不到7%的数据即可生成逼真的增强图像,最多可替代95%缺失的实测数据。其关键创新在于引入光照无关的几何先验:多视角轮廓一致性损失与多视角深度一致性损失,从而在稀疏或合成监督下稳定重建质量,在所有稀疏程度下均优于当前最优的可再光照高斯渲染方法,将真实采集需求减少达90%。
链接: https://arxiv.org/abs/2601.12020
作者: Guillermo Figueroa-Araneda,Iris Diana Jimenez,Florian Hofherr,Manny Ko,Hector Andrade-Loarca,Daniel Cremers
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Subsurface scattering (SSS) gives translucent materials – such as wax, jade, marble, and skin – their characteristic soft shadows, color bleeding, and diffuse glow. Modeling these effects in neural rendering remains challenging due to complex light transport and the need for densely captured multi-view, multi-light datasets (often more than 100 views and 112 OLATs). We present DIAMOND-SSS, a data-efficient framework for high-fidelity translucent reconstruction from extremely sparse supervision – even as few as ten images. We fine-tune diffusion models for novel-view synthesis and relighting, conditioned on estimated geometry and trained on less than 7 percent of the dataset, producing photorealistic augmentations that can replace up to 95 percent of missing captures. To stabilize reconstruction under sparse or synthetic supervision, we introduce illumination-independent geometric priors: a multi-view silhouette consistency loss and a multi-view depth consistency loss. Across all sparsity regimes, DIAMOND-SSS achieves state-of-the-art quality in relightable Gaussian rendering, reducing real capture requirements by up to 90 percent compared to SSS-3DGS. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2601.12020 [cs.CV] (or arXiv:2601.12020v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2601.12020 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-235] SAR-Based Marine Oil Spill Detection Using the DeepSegFusion Architecture
【速读】:该论文旨在解决卫星合成孔径雷达(SAR)图像中油污泄漏检测的高误报率问题,传统基于阈值的方法因风浪痕迹和船舶尾流等类油污现象导致性能下降。解决方案的关键在于提出一种混合深度学习模型 DeepSegFusion,其核心创新是将 SegNet 与 DeepLabV3+ 结合,并引入注意力机制进行特征融合,从而提升边界精度和上下文理解能力,显著降低误检率(减少64.4%),同时在多个 SAR 油污数据集上实现高达 94.85% 的准确率和 0.9330 的 ROC-AUC 分数,具备近实时监测潜力。
链接: https://arxiv.org/abs/2601.12015
作者: Pavan Kumar Yata,Pediredla Pradeep,Goli Himanish,Swathi M
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 6 figures. Submitted to arXiv. Code and dataset details included in the paper
Abstract:Detection of oil spills from satellite images is essential for both environmental surveillance and maritime safety. Traditional threshold-based methods frequently encounter performance degradation due to very high false alarm rates caused by look-alike phenomena such as wind slicks and ship wakes. Here, a hybrid deep learning model, DeepSegFusion, is presented for oil spill segmentation in Synthetic Aperture Radar (SAR) images. The model uses SegNet and DeepLabV3+ integrated with an attention-based feature fusion mechanism to achieve better boundary precision as well as improved contextual understanding. Results obtained on SAR oil spill datasets, including ALOS PALSAR imagery, confirm that the proposed DeepSegFusion model achieves an accuracy of 94.85%, an Intersection over Union (IoU) of 0.5685, and a ROC-AUC score of 0.9330. The proposed method delivers more than three times fewer false detections compared to individual baseline models and traditional non-segmentation methods, achieving a reduction of 64.4%. These results indicate that DeepSegFusion is a stable model under various marine conditions and can therefore be used in near real-time oil spill monitoring scenarios.
zh
[CV-236] SMc2f: Robust Scenario Mining for Robotic Autonomy from Coarse to Fine
【速读】:该论文旨在解决自主机器人车辆在安全验证过程中,如何高效、准确地从海量真实驾驶日志中挖掘稀有但关键的安全场景(safety-critical scenarios)的问题。现有方法如RefAV依赖轨迹标签进行自然语言到场景的定位,忽略了图像与文本之间的直接关联,且受上游3D目标检测和跟踪质量的影响,导致定位不准确。解决方案的关键在于提出一种粗粒度到细粒度的鲁棒场景挖掘框架SMc2f:首先利用视觉-语言模型(VLMs)对原始RGB图像与文本描述进行粗筛;其次构建基于RefAV的成功案例数据库,通过少样本示例微调大语言模型(LLM)以提升检索鲁棒性;最后引入文本-轨迹对比学习,在统一嵌入空间中拉近匹配对、推远不匹配对,从而精炼LLM候选轨迹,显著提升检索精度与效率。
链接: https://arxiv.org/abs/2601.12010
作者: Yifei Chen,Ross Greer
机构: Xi’an University of Technology (西安工业大学); University of California, Merced (加州大学默塞德分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The safety validation of autonomous robotic vehicles hinges on systematically testing their planning and control stacks against rare, safety-critical scenarios. Mining these long-tail events from massive real-world driving logs is therefore a critical step in the robotic development lifecycle. The goal of the Scenario Mining task is to retrieve useful information to enable targeted re-simulation, regression testing, and failure analysis of the robot’s decision-making algorithms. RefAV, introduced by the Argoverse team, is an end-to-end framework that uses large language models (LLMs) to spatially and temporally localize scenarios described in natural language. However, this process performs retrieval on trajectory labels, ignoring the direct connection between natural language and raw RGB images, which runs counter to the intuition of video retrieval; it also depends on the quality of upstream 3D object detection and tracking. Further, inaccuracies in trajectory data lead to inaccuracies in downstream spatial and temporal localization. To address these issues, we propose Robust Scenario Mining for Robotic Autonomy from Coarse to Fine (SMc2f), a coarse-to-fine pipeline that employs vision-language models (VLMs) for coarse image-text filtering, builds a database of successful mining cases on top of RefAV and automatically retrieves exemplars to few-shot condition the LLM for more robust retrieval, and introduces text-trajectory contrastive learning to pull matched pairs together and push mismatched pairs apart in a shared embedding space, yielding a fine-grained matcher that refines the LLM’s candidate trajectories. Experiments on public datasets demonstrate substantial gains in both retrieval quality and efficiency.
zh
[CV-237] DAOS: A Multimodal In-cabin Behavior Monitoring with Driver Action-Object Synergy Dataset
【速读】:该论文旨在解决当前驾驶员行为监测中因上半身动作相似性高而导致的动作识别困难问题,尤其是现有数据集普遍缺乏精确的对象位置标注或未建立对象与动作之间的关联,从而限制了可靠的行为识别性能。其核心解决方案是提出DAOS(Driver Action with Object Synergy)数据集和AOR-Net(Action-Object-Relation Network)模型:DAOS包含9,787个视频片段、36类细粒度驾驶行为及15类物体实例,提供多模态(RGB、红外、深度)、多视角(前、面、左、右)数据;AOR-Net通过多层次推理机制与链式动作提示(chain-of-action prompting)建模动作、物体及其关系间的逻辑关联,并引入Mixture of Thoughts模块动态选择各阶段关键知识,显著提升在物体丰富与稀缺场景下的鲁棒性。
链接: https://arxiv.org/abs/2601.11990
作者: Yiming Li,Chen Cai,Tianyi Liu,Dan Lin,Wenqian Wang,Wenfei Liang,Bingbing Li,Kim-Hui Yap
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In driver activity monitoring, movements are mostly limited to the upper body, which makes many actions look similar. To tell these actions apart, human often rely on the objects the driver is using, such as holding a phone compared with gripping the steering wheel. However, most existing driver-monitoring datasets lack accurate object-location annotations or do not link objects to their associated actions, leaving a critical gap for reliable action recognition. To address this, we introduce the Driver Action with Object Synergy (DAOS) dataset, comprising 9,787 video clips annotated with 36 fine-grained driver actions and 15 object classes, totaling more than 2.5 million corresponding object instances. DAOS offers multi-modal, multi-view data (RGB, IR, and depth) from front, face, left, and right perspectives. Although DAOS captures a wide range of cabin objects, only a few are directly relevant to each action for prediction, so focusing on task-specific human-object relations is essential. To tackle this challenge, we propose the Action-Object-Relation Network (AOR-Net). AOR-Net comprehends complex driver actions through multi-level reasoning and a chain-of-action prompting mechanism that models the logical relationships among actions, objects, and their relations. Additionally, the Mixture of Thoughts module is introduced to dynamically select essential knowledge at each stage, enhancing robustness in object-rich and object-scarce conditions. Extensive experiments demonstrate that our model outperforms other state-of-the-art methods on various datasets.
zh
[CV-238] Structural Graph Neural Networks with Anatomical Priors for Explainable Chest X-ray Diagnosis
【速读】:该论文旨在解决医学影像诊断中缺乏可解释性的问题,尤其是在基于视觉的诊断任务中,如何实现结构化推理与内在可解释性的统一。传统方法通常依赖于后处理可视化技术来解释模型决策,但这种方法难以提供可靠的因果推理依据。解决方案的关键在于提出一种结构图推理框架(structural graph reasoning framework),将卷积特征图重新诠释为基于补丁级别的图结构,其中节点编码外观信息和空间坐标,边反映局部结构邻接关系;更重要的是,引入了一种定制化的结构传播机制,显式建模相对空间关系作为推理过程的一部分,使图结构成为引导结构化推断的归纳偏置(inductive bias),而非被动的关系表示。该设计同时支持节点级病变感知预测与图级诊断推理,并通过学习到的节点重要性得分实现内在可解释性,无需依赖后处理可视化技术。
链接: https://arxiv.org/abs/2601.11987
作者: Khaled Berkani
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 3 figures, 3 tables
Abstract:We present a structural graph reasoning framework that incorporates explicit anatomical priors for explainable vision-based diagnosis. Convolutional feature maps are reinterpreted as patch-level graphs, where nodes encode both appearance and spatial coordinates, and edges reflect local structural adjacency. Unlike conventional graph neural networks that rely on generic message passing, we introduce a custom structural propagation mechanism that explicitly models relative spatial relations as part of the reasoning process. This design enables the graph to act as an inductive bias for structured inference rather than a passive relational representation. The proposed model jointly supports node-level lesion-aware predictions and graph-level diagnostic reasoning, yielding intrinsic explainability through learned node importance scores without relying on post-hoc visualization techniques. We demonstrate the approach through a chest X-ray case study, illustrating how structural priors guide relational reasoning and improve interpretability. While evaluated in a medical imaging context, the framework is domain-agnostic and aligns with the broader vision of graph-based reasoning across artificial intelligence systems. This work contributes to the growing body of research exploring graphs as computational substrates for structure-aware and explainable learning.
zh
[CV-239] An AI-IoT Based Smart Wheelchair with Gesture-Controlled Mobility Deep Learning-Based Obstacle Detection Multi-Sensor Health Monitoring and Emergency Alert System
【速读】:该论文旨在解决传统轮椅缺乏动态功能、智能轮椅成本高昂且健康监测集成度低的问题,以满足残障人士和老年人对安全导航与健康监护一体化的迫切需求。其解决方案的关键在于构建一个基于AI-IoT(人工智能-物联网)的多模态智能轮椅系统:通过手套式手势控制实现免提操作(准确率达95.5%),采用YOLOv8目标检测结合语音反馈进行实时障碍物识别(精度91.5%、召回率90.2%、F1分数90.8%),并辅以超声波即时碰撞预警(准确率94%),同时持续监测心率、血氧饱和度(SpO₂)、心电图(ECG)及体温等生命体征并通过ThingSpeak平台上传数据、触发邮件警报,从而在模块化低成本架构下实现用户自主性、安全性与独立性的显著提升。
链接: https://arxiv.org/abs/2601.11983
作者: Md. Asiful Islam,Abdul Hasib,Tousif Mahmud Emon,Khandaker Tabin Hasan,A. S. M. Ahsanul Sarkar Akib
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The growing number of differently-abled and elderly individuals demands affordable, intelligent wheelchairs that combine safe navigation with health monitoring. Traditional wheelchairs lack dynamic features, and many smart alternatives remain costly, single-modality, and limited in health integration. Motivated by the pressing demand for advanced, personalized, and affordable assistive technologies, we propose a comprehensive AI-IoT based smart wheelchair system that incorporates glove-based gesture control for hands-free navigation, real-time object detection using YOLOv8 with auditory feedback for obstacle avoidance, and ultrasonic for immediate collision avoidance. Vital signs (heart rate, SpO _2 , ECG, temperature) are continuously monitored, uploaded to ThingSpeak, and trigger email alerts for critical conditions. Built on a modular and low-cost architecture, the gesture control achieved a 95.5% success rate, ultrasonic obstacle detection reached 94% accuracy, and YOLOv8-based object detection delivered 91.5% Precision, 90.2% Recall, and a 90.8% F1-score. This integrated, multi-modal approach offers a practical, scalable, and affordable solution, significantly enhancing user autonomy, safety, and independence by bridging the gap between innovative research and real-world deployment.
zh
[CV-240] Nip Rumors in the Bud: Retrieval-Guided Topic-Level Adaptation for Test-Time Fake News Video Detection KDD2026
【速读】:该论文旨在解决虚假新闻视频检测(Fake News Video Detection, FNVD)中因训练与测试阶段新闻主题分布不一致而导致的模型泛化能力不足问题,特别是对新兴事件和未见主题的虚假视频难以识别。其解决方案的关键在于提出RADAR框架,首次实现测试时自适应(test-time adaptation),核心创新包括:1)基于熵选择的检索机制(Entropy Selection-Based Retrieval),从目标域中筛选低熵、语义相关的稳定参考视频;2)稳定锚点引导对齐模块(Stable Anchor-Guided Alignment),通过分布级匹配将不稳定实例的表征对齐至源域,缓解领域差异;3)目标域感知自训练机制(Target-Domain Aware Self-Training),利用稳定参考生成有信息量的伪标签,适应目标域中类别分布变化与不平衡特性。这些设计共同实现了对未见虚假新闻视频主题的实时、鲁棒适应能力。
链接: https://arxiv.org/abs/2601.11981
作者: Jian Lang,Rongpei Hong,Ting Zhong,Yong Wang,Fan Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages. Accepted by KDD 2026 research track. Codes are released at this https URL
Abstract:Fake News Video Detection (FNVD) is critical for social stability. Existing methods typically assume consistent news topic distribution between training and test phases, failing to detect fake news videos tied to emerging events and unseen topics. To bridge this gap, we introduce RADAR, the first framework that enables test-time adaptation to unseen news videos. RADAR pioneers a new retrieval-guided adaptation paradigm that leverages stable (source-close) videos from the target domain to guide robust adaptation of semantically related but unstable instances. Specifically, we propose an Entropy Selection-Based Retrieval mechanism that provides videos with stable (low-entropy), relevant references for adaptation. We also introduce a Stable Anchor-Guided Alignment module that explicitly aligns unstable instances’ representations to the source domain via distribution-level matching with their stable references, mitigating severe domain discrepancies. Finally, our novel Target-Domain Aware Self-Training paradigm can generate informative pseudo-labels augmented by stable references, capturing varying and imbalanced category distributions in the target domain and enabling RADAR to adapt to the fast-changing label distributions. Extensive experiments demonstrate that RADAR achieves superior performance for test-time FNVD, enabling strong on-the-fly adaptation to unseen fake news video topics.
zh
[CV-241] AVIR: Adaptive Visual In-Document Retrieval for Efficient Multi-Page Document Question Answering
【速读】:该论文旨在解决多页文档视觉问答(Multi-page Document Visual Question Answering, MP-DocVQA)中因长文档导致的计算资源消耗大以及大视觉语言模型(Large Vision-Language Models, LVLMs)注意力机制效率下降的问题。其核心解决方案是提出一种自适应文档内视觉检索框架(Adaptive Visual In-document Retrieval, AVIR):首先通过轻量级检索模型对每一页进行与问题的相关性评分,再依据评分分布对页面进行聚类以自适应选择相关片段;随后采用Top-K筛选进一步压缩上下文;针对短文档聚类可靠性降低的情况,引入相关性概率阈值进行页面选择。最终仅将精选页面输入冻结的LVLM生成答案,无需模型微调。该方法在保持高精度(ANLS达84.58%)的同时,平均减少70%的处理页数,显著降低计算成本。
链接: https://arxiv.org/abs/2601.11976
作者: Zongmin Li,Yachuan Li,Lei Kang,Dimosthenis Karatzas,Wenkang Ma
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 3 figures
Abstract:Multi-page Document Visual Question Answering (MP-DocVQA) remains challenging because long documents not only strain computational resources but also reduce the effectiveness of the attention mechanism in large vision-language models (LVLMs). We tackle these issues with an Adaptive Visual In-document Retrieval (AVIR) framework. A lightweight retrieval model first scores each page for question relevance. Pages are then clustered according to the score distribution to adaptively select relevant content. The clustered pages are screened again by Top-K to keep the context compact. However, for short documents, clustering reliability decreases, so we use a relevance probability threshold to select pages. The selected pages alone are fed to a frozen LVLM for answer generation, eliminating the need for model fine-tuning. The proposed AVIR framework reduces the average page count required for question answering by 70%, while achieving an ANLS of 84.58% on the MP-DocVQA dataset-surpassing previous methods with significantly lower computational cost. The effectiveness of the proposed AVIR is also verified on the SlideVQA and DUDE benchmarks. The code is available at this https URL.
zh
[CV-242] Real-Time Multi-Modal Embedded Vision Framework for Object Detection Facial Emotion Recognition and Biometric Identification on Low-Power Edge Platforms
【速读】:该论文旨在解决智能监控系统在边缘设备上运行多模态感知任务(如目标检测、人脸识别和情绪分析)时,因缺乏统一且自适应的运行时调度机制而导致的计算资源利用率低、整体理解能力弱的问题。解决方案的关键在于提出了一种基于上下文触发的自适应调度机制,能够根据实时场景动态激活特定模块(如YOLOv8n用于目标检测、基于FaceNet的人脸嵌入系统用于身份识别、DeepFace CNN用于情绪分类),从而将计算负载降低65%,在保证性能的同时显著提升系统效率,实现在低成本边缘硬件(Raspberry Pi 5)上的高效部署与隐私保护。
链接: https://arxiv.org/abs/2601.11970
作者: S. M. Khalid Bin Zahid,Md. Rakibul Hasan Nishat,Abdul Hasib,Md. Rakibul Hasan,Md. Ashiqussalehin,Md. Sahadat Hossen Sajib,A. S. M. Ahsanul Sarkar Akib
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Intelligent surveillance systems often handle perceptual tasks such as object detection, facial recognition, and emotion analysis independently, but they lack a unified, adaptive runtime scheduler that dynamically allocates computational resources based on contextual triggers. This limits their holistic understanding and efficiency on low-power edge devices. To address this, we present a real-time multi-modal vision framework that integrates object detection, owner-specific face recognition, and emotion detection into a unified pipeline deployed on a Raspberry Pi 5 edge platform. The core of our system is an adaptive scheduling mechanism that reduces computational load by 65% compared to continuous processing by selectively activating modules such as, YOLOv8n for object detection, a custom FaceNet-based embedding system for facial recognition, and DeepFace’s CNN for emotion classification. Experimental results demonstrate the system’s efficacy, with the object detection module achieving an Average Precision (AP) of 0.861, facial recognition attaining 88% accuracy, and emotion detection showing strong discriminatory power (AUC up to 0.97 for specific emotions), while operating at 5.6 frames per second. Our work demonstrates that context-aware scheduling is the key to unlocking complex multi-modal AI on cost-effective edge hardware, making intelligent perception more accessible and privacy-preserving.
zh
[CV-243] A Constraint Programming Model for the Super-Agile Earth Observation Satellite Imaging Scheduling Problem
【速读】:该论文旨在解决超敏捷地球观测卫星(Super-Agile Earth Observation Satellites, SAEOS)的成像调度问题(SAEOS-ISP),该问题因卫星具备可变观测时长、多指向成像能力及跨卫星序列依赖的过渡时间而变得复杂,且现有方法无法有效处理这些动态特性。解决方案的关键在于首次提出一个精确的约束规划(Constraint Programming)数学模型,该模型同时考虑灵活的观测窗口、多角度成像方向以及多颗卫星间的序列相关过渡时间,从而实现对复杂调度约束的完整建模与高效求解。计算实验表明,该方法能在极短时间内获得最优解,相较当前最先进的非精确算法展现出显著的计算性能优势。
链接: https://arxiv.org/abs/2601.11967
作者: Margarida Caleiras,Samuel Moniz,Paulo Nascimento
机构: University of Coimbra (科英布拉大学)
类目: ystems and Control (eess.SY); Computer Vision and Pattern Recognition (cs.CV); Optimization and Control (math.OC)
备注: 12 pages, 4 figures, To be published in the Proceedings of the International Conference on Operations Research and Enterprise Systems (ICORES 2026)
Abstract:As the dependence on satellite imaging continues to grow, modern satellites have become increasingly agile, with the new generation, namely super-agile Earth observation satellites (SAEOS), providing unprecedented imaging flexibility. The highly dynamic capabilities of these satellites introduce additional challenges to the scheduling of observation tasks, as existing approaches for conventional agile satellites do not account for variable observation durations and multiple imaging directions. Although some efforts have been made in this regard, the SAEOS imaging scheduling problem (SAEOS-ISP) remains largely unexplored, and no exact approaches have yet been proposed. In this context, this study presents the first exact Constraint Programming formulation for the SAEOS-ISP, considering flexible observation windows, multiple pointing directions and sequence-dependent transition times across multiple satellites. Computational experiments on a newly generated benchmark set demonstrate that the model can be solved efficiently and within very short computational times. Moreover, the results also show that the proposed approach has the potential to achieve higher computational performance compared to the non-exact approaches that are currently considered state-of-the-art.
zh
[CV-244] Decoder Gradient Shields: A Family of Provable and High-Fidelity Methods Against Gradient-Based Box-Free Watermark Removal
【速读】:该论文旨在解决盒式(box-free)模型水印中解码器(decoder)存在的安全漏洞问题,即攻击者可通过查询响应获取反向传播梯度,进而训练水印移除器(watermark remover)以破坏水印完整性。其解决方案的关键在于提出了一类名为Decoder Gradient Shields (DGSs) 的防御机制,包括输出层(DGS-O)、输入层(DGS-I)和中间层(DGS-L)的梯度屏蔽策略,其中DGS-O具备闭式解,所有DGS均具有可证明的性能保障。该方法通过联合设计梯度重定向与缩放操作,有效阻断水印移除器的训练收敛至低损失值,同时保持解码器输出图像质量,实验表明在去雨和图像生成任务中对当前最先进的盒式水印方案实现了100%的防御成功率。
链接: https://arxiv.org/abs/2601.11952
作者: Haonan An,Guang Hua,Wei Du,Hangcheng Cao,Yihang Tao,Guowen Xu,Susanto Rahardja,Yuguang Fang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Box-free model watermarking has gained significant attention in deep neural network (DNN) intellectual property protection due to its model-agnostic nature and its ability to flexibly manage high-entropy image outputs from generative models. Typically operating in a black-box manner, it employs an encoder-decoder framework for watermark embedding and extraction. While existing research has focused primarily on the encoders for the robustness to resist various attacks, the decoders have been largely overlooked, leading to attacks against the watermark. In this paper, we identify one such attack against the decoder, where query responses are utilized to obtain backpropagated gradients to train a watermark remover. To address this issue, we propose Decoder Gradient Shields (DGSs), a family of defense mechanisms, including DGS at the output (DGS-O), at the input (DGS-I), and in the layers (DGS-L) of the decoder, with a closed-form solution for DGS-O and provable performance for all DGS. Leveraging the joint design of reorienting and rescaling of the gradients from watermark channel gradient leaking queries, the proposed DGSs effectively prevent the watermark remover from achieving training convergence to the desired low-loss value, while preserving image quality of the decoder output. We demonstrate the effectiveness of our proposed DGSs in diverse application scenarios. Our experimental results on deraining and image generation tasks with the state-of-the-art box-free watermarking show that our DGSs achieve a defense success rate of 100% under all settings.
zh
[CV-245] Deep learning-based neurodevelopmental assessment in preterm infants
【速读】:该论文旨在解决早产儿(妊娠28至37周出生)脑部磁共振成像(MRI)中白质(White Matter, WM)与灰质(Gray Matter, GM)因信号强度相近(isointense appearance)而导致的分割精度低的问题,从而实现更准确的神经发育评估。其解决方案的关键在于提出一种名为分层密集注意力网络(Hierarchical Dense Attention Network)的新颖神经网络架构,该架构融合了三维空间-通道注意力机制与注意力引导的密集上采样策略,显著提升了在低对比度体积数据中的特征区分能力,从而有效改善了WM与GM的分割性能。
链接: https://arxiv.org/abs/2601.11944
作者: Lexin Ren,Jiamiao Lu,Weichuan Zhang,Benqing Wu,Tuo Wang,Yi Liao,Jiapan Guo,Changming Sun,Liang Guo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 27 pages, 8 figures
Abstract:Preterm infants (born between 28 and 37 weeks of gestation) face elevated risks of neurodevelopmental delays, making early identification crucial for timely intervention. While deep learning-based volumetric segmentation of brain MRI scans offers a promising avenue for assessing neonatal neurodevelopment, achieving accurate segmentation of white matter (WM) and gray matter (GM) in preterm infants remains challenging due to their comparable signal intensities (isointense appearance) on MRI during early brain development. To address this, we propose a novel segmentation neural network, named Hierarchical Dense Attention Network. Our architecture incorporates a 3D spatial-channel attention mechanism combined with an attention-guided dense upsampling strategy to enhance feature discrimination in low-contrast volumetric data. Quantitative experiments demonstrate that our method achieves superior segmentation performance compared to state-of-the-art baselines, effectively tackling the challenge of isointense tissue differentiation. Furthermore, application of our algorithm confirms that WM and GM volumes in preterm infants are significantly lower than those in term infants, providing additional imaging evidence of the neurodevelopmental delays associated with preterm birth. The code is available at: this https URL.
zh
[CV-246] Language-Guided and Motion-Aware Gait Representation for Generalizable Recognition
【速读】:该论文旨在解决现有步态识别方法在处理静态噪声(如衣物变化)时易过拟合,且难以有效捕捉动态运动特征的问题。其解决方案的关键在于提出一种语言引导与运动感知的步态识别框架(Language guided and Motion-aware gait recognition framework),通过设计与步态相关的语言提示(gait-related language cues)来提取步态序列中的关键运动特征,从而提升模型对动态运动信息的建模能力并增强鲁棒性。
链接: https://arxiv.org/abs/2601.11931
作者: Zhengxian Wu,Chuanrui Zhang,Shenao Jiang,Hangrui Xu,Zirui Liao,Luyuan Zhang,Huaqiu Li,Peng Jiao,Haoqian Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Gait recognition is emerging as a promising technology and an innovative field within computer vision. However, existing methods typically rely on complex architectures to directly extract features from images and apply pooling operations to obtain sequence-level representations. Such designs often lead to overfitting on static noise (e.g., clothing), while failing to effectively capture dynamic motion this http URL address the above challenges, we present a Language guided and Motion-aware gait recognition framework, named this http URL particular, we utilize designed gait-related language cues to capture key motion features in gait sequences.
zh
[CV-247] SupScene: Learning Overlap-Aware Global Descriptor for Unconstrained SfM
【速读】:该论文旨在解决无约束结构光恢复(Unconstrained Structure-from-Motion, SfM)中图像匹配的二次复杂度问题,其核心挑战在于现有基于深度学习的图像检索方法通常依赖于批量二值标签(重叠 vs. 非重叠对),难以捕捉图像对之间几何可匹配性(geometric matchability)的细微差异。解决方案的关键在于提出SupScene框架:首先采用基于子图的训练策略,利用不同权重的真值几何重叠关系,通过软监督对比损失提供细粒度监督;其次引入DiVLAD聚合器,基于ViT最后一层的多头注意力图提取语义显著区域特征,并设计可学习门控机制自适应融合这些语义线索与视觉特征,从而生成更具判别力的全局描述符。该方法在GL3D数据集上显著优于NetVLAD,同时仅引入极少额外可训练参数。
链接: https://arxiv.org/abs/2601.11930
作者: Xulei Shi,Maoyu Wang,Yuning Peng,Guanbo Wang,Xin Wang,Qi Chen,Pengjie Tao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Image retrieval is a critical step for alleviating the quadratic complexity of image matching in unconstrained Structure-from-Motion (SfM). However, in this context, image retrieval typically focuses more on the image pairs of geometric matchability than on those of semantic similarity, a nuance that most existing deep learning-based methods guided by batched binaries (overlapping vs. non-overlapping pairs) fail to capture. In this paper, we introduce SupScene, a novel solution that learns global descriptors tailored for finding overlapping image pairs of similar geometric nature for SfM. First, to better underline co-visible regions, we employ a subgraph-based training strategy that moves beyond equally important isolated pairs, leveraging ground-truth geometric overlapping relationships with various weights to provide fine-grained supervision via a soft supervised contrastive loss. Second, we introduce DiVLAD, a DINO-inspired VLAD aggregator that leverages the inherent multi-head attention maps from the last block of ViT. And then, a learnable gating mechanism is designed to adaptively utilize these semantically salient cues with visual features, enabling a more discriminative global descriptor. Extensive experiments on the GL3D dataset demonstrate that our method achieves state-of-the-art performance, significantly outperforming NetVLAD while introducing a negligible number of additional trainable parameters. Furthermore, we show that the proposed training strategy brings consistent gains across different aggregation techniques. Code and models are available at this https URL.
zh
[CV-248] Effects of Gabor Filters on Classification Performance of CNNs Trained on a Limited Number of Conditions
【速读】:该论文旨在解决边缘设备上运行的卷积神经网络(Convolutional Neural Networks, CNNs)在实际机器人视觉应用中面临的两个核心问题:一是模型精度不足,二是模型规模过大难以部署。针对这些问题,作者提出的关键解决方案是引入Gabor滤波器作为CNN的预处理器,模拟视觉神经系统(Visual Nervous System, VNS)的特征提取机制,从而提升小样本数据训练下的模型性能,并增强模型在不同成像条件下(如不同相机位置)的泛化能力。实验表明,该预处理方法不仅提高了CNN的准确性,还有效降低了网络参数量,使其更适合资源受限的边缘计算环境。
链接: https://arxiv.org/abs/2601.11918
作者: Akito Morita,Hirotsugu Okuno
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 5 pages, 4 figures, 4 tables
Abstract:In this study, we propose a technique to improve the accuracy and reduce the size of convolutional neural networks (CNNs) running on edge devices for real-world robot vision applications. CNNs running on edge devices must have a small architecture, and CNNs for robot vision applications involving on-site object recognition must be able to be trained efficiently to identify specific visual targets from data obtained under a limited variation of conditions. The visual nervous system (VNS) is a good example that meets the above requirements because it learns from few visual experiences. Therefore, we used a Gabor filter, a model of the feature extractor of the VNS, as a preprocessor for CNNs to investigate the accuracy of the CNNs trained with small amounts of data. To evaluate how well CNNs trained on image data acquired under a limited variation of conditions generalize to data acquired under other conditions, we created an image dataset consisting of images acquired from different camera positions, and investigated the accuracy of the CNNs that trained using images acquired at a certain distance. The results were compared after training on multiple CNN architectures with and without Gabor filters as preprocessing. The results showed that preprocessing with Gabor filters improves the generalization performance of CNNs and contributes to reducing the size of CNNs.
zh
[CV-249] From Spurious to Causal: Low-rank Orthogonal Subspace Intervention for Generalizable Face Forgery Detection
【速读】:该论文旨在解决人脸伪造检测中的泛化能力不足问题,其核心挑战源于伪造无关信息(即虚假相关性因素,spurious correlation factors)通过“后门路径”从表征中传递至标签,导致模型学习到偏差特征而非真实的伪造线索。以往方法多聚焦于识别具体虚假相关性并针对性处理,但因这些因素由不可观测的混杂因子引发,难以逐一识别与干预。本文提出一种在表示空间中的干预范式:将所有不可见的虚假相关性统一建模为低秩子空间,并通过正交低秩投影将其分解并移除,仅保留与伪造相关的正交补空间进行训练,从而确保分类决策基于真实伪造信号。该方法仅需0.43M可训练参数,即可在多个基准上实现最优性能,显著提升模型鲁棒性和泛化能力。
链接: https://arxiv.org/abs/2601.11915
作者: Chi Wang,Xinjue Hu,Boyu Wang,Ziwen He,Zhangjie Fu
机构: Nanjing University of Information Science and Technology (南京信息工程大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The generalization problem remains a critical challenge in face forgery detection. Some researches have discovered that ``a backdoor path" in the representations from forgery-irrelevant information to labels induces biased learning, thereby hindering the generalization. In this paper, these forgery-irrelevant information are collectively termed spurious correlations factors. Previous methods predominantly focused on identifying concrete, specific spurious correlation and designing corresponding solutions to address them. However, spurious correlations arise from unobservable confounding factors, making it impractical to identify and address each one individually. To address this, we propose an intervention paradigm for representation space. Instead of tracking and blocking various instance-level spurious correlation one by one, we uniformly model them as a low-rank subspace and intervene in them. Specifically, we decompose spurious correlation features into a low-rank subspace via orthogonal low-rank projection, subsequently removing this subspace from the original representation and training its orthogonal complement to capture forgery-related features. This low-rank projection removal effectively eliminates spurious correlation factors, ensuring that classification decision is based on authentic forgery cues. With only 0.43M trainable parameters, our method achieves state-of-the-art performance across several benchmarks, demonstrating excellent robustness and generalization.
zh
[CV-250] Reliable Deep Learning for Small-Scale Classifications: Experiments on Real-World Image Datasets from Bangladesh
【速读】:该论文旨在解决卷积神经网络(Convolutional Neural Networks, CNNs)在小样本图像分类任务中因架构复杂而导致过拟合的问题。其解决方案的关键在于采用一种精简的CNN架构,在五个来自孟加拉国的真实世界图像数据集(涵盖城市侵占、车辆检测、道路损坏及农作物识别等场景)上验证了该模型在分类准确率、收敛效率和计算开销方面的优越性能,同时通过定量指标与显著性分析证明其能有效捕捉判别特征并实现跨场景的鲁棒泛化能力,从而表明轻量化CNN架构更适合小类别图像分类任务。
链接: https://arxiv.org/abs/2601.11911
作者: Muhammad Ibrahim,Alfe Suny,MD Sakib Ul Islam,Md. Imran Hossain
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Convolutional neural networks (CNNs) have achieved state-of-the-art performance in image recognition tasks but often involve complex architectures that may overfit on small datasets. In this study, we evaluate a compact CNN across five publicly available, real-world image datasets from Bangladesh, including urban encroachment, vehicle detection, road damage, and agricultural crops. The network demonstrates high classification accuracy, efficient convergence, and low computational overhead. Quantitative metrics and saliency analyses indicate that the model effectively captures discriminative features and generalizes robustly across diverse scenarios, highlighting the suitability of streamlined CNN architectures for small-class image classification tasks.
zh
[CV-251] A Training-Free Guess What Vision Language Model from Snippets to Open-Vocabulary Object Detection
【速读】:该论文旨在解决开放词汇目标检测(Open-Vocabulary Object Detection, OVOD)中缺乏通用理解能力的问题,即现有基于预训练视觉语言模型(Vision Language Model, VLM)的方法往往忽视了对任意物体认知的统一建模。为实现无需训练的高性能OVOD,作者提出了一种名为GW-VLM的无训练猜测式视觉语言模型,其核心创新在于设计了多尺度视觉语言搜索(Multi-Scale Visual Language Searching, MS-VLS)与上下文概念提示(Contextual Concept Prompt, CCP)机制:MS-VLS通过多尺度视觉-语言软对齐从类无关目标检测结果中提取语义片段,而CCP则基于MS-VLS生成的概念流引导大语言模型(Large Language Model, LLM)理解这些片段,从而实现无需微调即可泛化至未见类别的检测能力。
链接: https://arxiv.org/abs/2601.11910
作者: Guiying Zhu,Bowen Yang,Yin Zhuang,Tong Zhang,Guanqun Wang,Zhihao Che,He Chen,Lianlin Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Open-Vocabulary Object Detection (OVOD) aims to develop the capability to detect anything. Although myriads of large-scale pre-training efforts have built versatile foundation models that exhibit impressive zero-shot capabilities to facilitate OVOD, the necessity of creating a universal understanding for any object cognition according to already pretrained foundation models is usually overlooked. Therefore, in this paper, a training-free Guess What Vision Language Model, called GW-VLM, is proposed to form a universal understanding paradigm based on our carefully designed Multi-Scale Visual Language Searching (MS-VLS) coupled with Contextual Concept Prompt (CCP) for OVOD. This approach can engage a pre-trained Vision Language Model (VLM) and a Large Language Model (LLM) in the game of “guess what”. Wherein, MS-VLS leverages multi-scale visual-language soft-alignment for VLM to generate snippets from the results of class-agnostic object detection, while CCP can form the concept of flow referring to MS-VLS and then make LLM understand snippets for OVOD. Finally, the extensive experiments are carried out on natural and remote sensing datasets, including COCO val, Pascal VOC, DIOR, and NWPU-10, and the results indicate that our proposed GW-VLM can achieve superior OVOD performance compared to the-state-of-the-art methods without any training step.
zh
[CV-252] Effects of the retina-inspired light intensity encoding on color discrimination performance
【速读】:该论文旨在解决视觉系统在不同光照条件下如何保持颜色感知一致性的问题,即颜色恒常性(Color Constancy, CC)的优化问题。其解决方案的关键在于改进中心/周边(Center/Surround, C/S)Retinex模型中用于编码光强度的函数,并结合基于经典对立色理论的颜色表示方法。研究发现,使用视网膜感光细胞响应模型——Naka-Rushton(N-R)函数替代原始对数函数,并配合双对立色平面(double opponent color plane)表示方式,能显著提升模型在不同照明条件下对目标颜色的区分能力,从而实现更优的颜色恒常性表现。
链接: https://arxiv.org/abs/2601.11909
作者: Io Yamada,Hirotsugu Okuno
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 14 figures, 4 tables
Abstract:Color is an important source of information for visual functions such as object recognition, but it is greatly affected by the color of illumination. The ability to perceive the color of a visual target independent of illumination color is called color constancy (CC), and is an important feature for vision systems that use color information. In this study, we investigated the effects of the light intensity encoding function on the performance of CC of the center/surround (C/S) retinex model, which is a well-known model inspired by CC of the visual nervous system. The functions used to encode light intensity are the logarithmic function used in the original C/S retinex model and the Naka-Rushton (N-R) function, which is a model of retinal photoreceptor response. Color-variable LEDs were used to illuminate visual targets with various lighting colors, and color information computed by each model was used to evaluate the degree to which the color of visual targets illuminated with different lighting colors could be discriminated. Color information was represented using the HSV color space and a color plane based on the classical opponent color theory. The results showed that the combination of the N-R function and the double opponent color plane representation provided superior discrimination performance.
zh
[CV-253] owards Airborne Object Detection: A Deep Learning Analysis
【速读】:该论文旨在解决当前空中目标威胁评估系统依赖人工监控、难以实现实时自动化的问题,从而提升空中态势感知的效率与可扩展性。其解决方案的关键在于提出一种基于EfficientNetB4的双任务模型,能够同时完成空中目标分类(airborne object classification)和威胁等级预测(threat-level prediction),并通过构建高质量的AODTA数据集缓解训练数据稀缺与不平衡问题,最终在多个基准数据集上实现了96%的分类准确率和90%的威胁预测准确率,显著优于ResNet-50基线模型。
链接: https://arxiv.org/abs/2601.11907
作者: Prosenjit Chatterjee,ANK Zaman
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注:
Abstract:The rapid proliferation of airborne platforms, including commercial aircraft, drones, and UAVs, has intensified the need for real-time, automated threat assessment systems. Current approaches depend heavily on manual monitoring, resulting in limited scalability and operational inefficiencies. This work introduces a dual-task model based on EfficientNetB4 capable of performing airborne object classification and threat-level prediction simultaneously. To address the scarcity of clean, balanced training data, we constructed the AODTA Dataset by aggregating and refining multiple public sources. We benchmarked our approach on both the AVD Dataset and the newly developed AODTA Dataset and further compared performance against a ResNet-50 baseline, which consistently underperformed EfficientNetB4. Our EfficientNetB4 model achieved 96% accuracy in object classification and 90% accuracy in threat-level prediction, underscoring its promise for applications in surveillance, defense, and airspace management. Although the title references detection, this study focuses specifically on classification and threat-level inference using pre-localized airborne object images provided by existing datasets.
zh
[CV-254] RemoteVAR: Autoregressive Visual Modeling for Remote Sensing Change Detection
【速读】:该论文旨在解决遥感变化检测(remote sensing change detection)中生成式 AI(Generative AI)模型在像素级判别任务中的应用瓶颈问题,包括可控性弱、密集预测性能不佳以及暴露偏差(exposure bias)。其解决方案的关键在于提出 RemoteVAR 框架:通过跨注意力机制(cross-attention)将自回归预测条件化于多分辨率融合的双时相特征,并设计专门针对变化图预测的自回归训练策略,从而显著提升模型在标准基准上的性能表现。
链接: https://arxiv.org/abs/2601.11898
作者: Yilmaz Korkmaz,Vishal M. Patel
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Remote sensing change detection aims to localize and characterize scene changes between two time points and is central to applications such as environmental monitoring and disaster assessment. Meanwhile, visual autoregressive models (VARs) have recently shown impressive image generation capability, but their adoption for pixel-level discriminative tasks remains limited due to weak controllability, suboptimal dense prediction performance and exposure bias. We introduce RemoteVAR, a new VAR-based change detection framework that addresses these limitations by conditioning autoregressive prediction on multi-resolution fused bi-temporal features via cross-attention, and by employing an autoregressive training strategy designed specifically for change map prediction. Extensive experiments on standard change detection benchmarks show that RemoteVAR delivers consistent and significant improvements over strong diffusion-based and transformer-based baselines, establishing a competitive autoregressive alternative for remote sensing change detection. Code will be available \hrefthis https URL\underlinehere.
zh
[CV-255] Digital FAST: An AI-Driven Multimodal Framework for Rapid and Early Stroke Screening
【速读】:该论文旨在解决急性卒中(stroke)早期识别难题,特别是在院前环境中快速、非侵入式地进行卒中筛查。其核心挑战在于如何利用有限的临床数据实现高精度的自动诊断,同时确保模型在真实场景中的鲁棒性和可部署性。解决方案的关键在于提出了一种多模态深度学习框架,融合面部表情、语音信号和上半身运动三种模态信息:通过Transformer捕捉面部动态的时间依赖性,Audio Spectrogram Transformer处理语音频谱特征,MLP-Mixer建模上半身姿态的时空模式,并采用基于注意力机制的融合策略以增强跨模态交互能力。实验表明,该方法在自建数据集上达到95.83%准确率和96.00% F1-score,显著优于单一模态基线,验证了多模态学习与迁移学习在卒中早期筛查中的有效性。
链接: https://arxiv.org/abs/2601.11896
作者: Ngoc-Khai Hoang,Thi-Nhu-Mai Nguyen,Huy-Hieu Pham
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Early identification of stroke symptoms is essential for enabling timely intervention and improving patient outcomes, particularly in prehospital settings. This study presents a fast, non-invasive multimodal deep learning framework for automatic binary stroke screening based on data collected during the F.A.S.T. assessment. The proposed approach integrates complementary information from facial expressions, speech signals, and upper-body movements to enhance diagnostic robustness. Facial dynamics are represented using landmark based features and modeled with a Transformer architecture to capture temporal dependencies. Speech signals are converted into mel spectrograms and processed using an Audio Spectrogram Transformer, while upper-body pose sequences are analyzed with an MLP-Mixer network to model spatiotemporal motion patterns. The extracted modality specific representations are combined through an attention-based fusion mechanism to effectively learn cross modal interactions. Experiments conducted on a self-collected dataset of 222 videos from 37 subjects demonstrate that the proposed multimodal model consistently outperforms unimodal baselines, achieving 95.83% accuracy and a 96.00% F1-score. The model attains a strong balance between sensitivity and specificity and successfully detects all stroke cases in the test set. These results highlight the potential of multimodal learning and transfer learning for early stroke screening, while emphasizing the need for larger, clinically representative datasets to support reliable real-world deployment.
zh
[CV-256] AI for Green Spaces: Leverag ing Autonomous Navigation and Computer Vision for Park Litter Removal
【速读】:该论文旨在解决美国公园草地上垃圾堆积问题(约500亿件),其核心挑战在于实现机器人在非结构化草地环境中自主导航、识别并拾取垃圾。解决方案的关键在于多模块集成:采用生成全覆盖路径的最小生成树覆盖(Spanning Tree Coverage, STC)算法确保高效巡检;利用实时动态定位(Real-Time Kinematic, RTK)GPS实现厘米级精度的实时定位与导航;基于ResNet50卷积神经网络(Convolutional Neural Network, CNN)实现94.52%准确率的垃圾识别;并通过针对性设计的拾取机构完成物理抓取,最终整体成功率达80%,验证了在草地场景下自主垃圾清理机器人的可行性。
链接: https://arxiv.org/abs/2601.11876
作者: Christopher Kao,Akhil Pathapati,James Davis
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Systems and Control (eess.SY)
备注: Published in IEEE/SICE SII 2025
Abstract:There are 50 billion pieces of litter in the U.S. alone. Grass fields contribute to this problem because picnickers tend to leave trash on the field. We propose building a robot that can autonomously navigate, identify, and pick up trash in parks. To autonomously navigate the park, we used a Spanning Tree Coverage (STC) algorithm to generate a coverage path the robot could follow. To navigate this path, we successfully used Real-Time Kinematic (RTK) GPS, which provides a centimeter-level reading every second. For computer vision, we utilized the ResNet50 Convolutional Neural Network (CNN), which detects trash with 94.52% accuracy. For trash pickup, we tested multiple design concepts. We select a new pickup mechanism that specifically targets the trash we encounter on the field. Our solution achieved an overall success rate of 80%, demonstrating that autonomous trash pickup robots on grass fields are a viable solution.
zh
[CV-257] MixFlow: Mixture-Conditioned Flow Matching for Out-of-Distribution Generalization
【速读】:该论文旨在解决条件生成模型在分布偏移(distribution shift)下难以实现鲁棒泛化的问题,特别是现有基于流的条件生成方法在训练条件之外的场景中表现不佳。其解决方案的关键在于提出MixFlow框架,通过最短路径流匹配(shortest-path flow matching)联合学习一个描述符依赖的基分布(base distribution)和一个描述符依赖的流场(flow field),并将基分布建模为可学习的、描述符相关的混合分布(mixture),从而实现对未见条件的平滑插值与外推,显著提升模型在分布外数据上的泛化能力。
链接: https://arxiv.org/abs/2601.11827
作者: Andrea Rubbi,Amir Akbarnejad,Mohammad Vali Sanian,Aryan Yazdan Parast,Hesam Asadollahzadeh,Arian Amani,Naveed Akhtar,Sarah Cooper,Andrew Bassett,Pietro Liò,Lassi Paavolainen,Sattar Vakili,Mo Lotfollahi
机构: Wellcome Sanger Institute (Wellcome Sanger研究所); University of Helsinki (赫尔辛基大学); The University of Melbourne (墨尔本大学); University of Cambridge (剑桥大学); MediaTek Research (联发科技研究部)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
Abstract:Achieving robust generalization under distribution shift remains a central challenge in conditional generative modeling, as existing conditional flow-based methods often struggle to extrapolate beyond the training conditions. We introduce MixFlow, a conditional flow-matching framework for descriptor-controlled generation that directly targets this limitation by jointly learning a descriptor-conditioned base distribution and a descriptor-conditioned flow field via shortest-path flow matching. By modeling the base distribution as a learnable, descriptor-dependent mixture, MixFlow enables smooth interpolation and extrapolation to unseen conditions, leading to substantially improved out-of-distribution generalization. We provide analytical insights into the behavior of the proposed framework and empirically demonstrate its effectiveness across multiple domains, including prediction of responses to unseen perturbations in single-cell transcriptomic data and high-content microscopy-based drug screening tasks. Across these diverse settings, MixFlow consistently outperforms standard conditional flow-matching baselines. Overall, MixFlow offers a simple yet powerful approach for achieving robust, generalizable, and controllable generative modeling across heterogeneous domains.
zh
[CV-258] Physics-Constrained Denoising Autoencoders for Data-Scarce Wildfire UAV Sensing
【速读】:该论文旨在解决无人机(UAV)搭载低成本传感器在野火监测中因基线漂移、交叉敏感性和响应滞后导致的浓度估计失真问题,同时应对深度学习方法通常依赖大规模数据集而难以在有限飞行实验中实现有效训练的挑战。解决方案的关键在于提出一种物理信息嵌入的去噪自编码器(PC²DAE),其核心创新是将物理约束直接集成到网络架构设计中:通过软指数(softplus)激活函数强制非负浓度输出,并引入物理合理的时序平滑机制确保结果的物理解释性;此外,采用分层解码头分别处理黑碳、气体和CO₂传感器数据,并提供轻量级(21k参数)与宽型(204k参数)两个版本以适应边缘部署与离线处理需求。该方法在仅7,894个同步采样点(约2.2小时飞行数据)上即实现显著性能提升,且无物理违规现象,验证了强归纳偏置下小样本场景中的有效性。
链接: https://arxiv.org/abs/2601.11794
作者: Abdelrahman Ramadan,Zahra Dorbeigi Namaghi,Emily Taylor,Lucas Edwards,Xan Giuliani,David S. McLagan,Sidney Givigi,Melissa Greeff
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Wildfire monitoring requires high-resolution atmospheric measurements, yet low-cost sensors on Unmanned Aerial Vehicles (UAVs) exhibit baseline drift, cross-sensitivity, and response lag that corrupt concentration estimates. Traditional deep learning denoising approaches demand large datasets impractical to obtain from limited UAV flight campaigns. We present PC ^2 DAE, a physics-informed denoising autoencoder that addresses data scarcity by embedding physical constraints directly into the network architecture. Non-negative concentration estimates are enforced via softplus activations and physically plausible temporal smoothing, ensuring outputs are physically admissible by construction rather than relying on loss function penalties. The architecture employs hierarchical decoder heads for Black Carbon, Gas, and CO _2 sensor families, with two variants: PC ^2 DAE-Lean (21k parameters) for edge deployment and PC ^2 DAE-Wide (204k parameters) for offline processing. We evaluate on 7,894 synchronized 1 Hz samples collected from UAV flights during prescribed burns in Saskatchewan, Canada (approximately 2.2 hours of flight data), two orders of magnitude below typical deep learning requirements. PC ^2 DAE-Lean achieves 67.3% smoothness improvement and 90.7% high-frequency noise reduction with zero physics violations. Five baselines (LSTM-AE, U-Net, Transformer, CBDAE, DeSpaWN) produce 15–23% negative outputs. The lean variant outperforms wide (+5.6% smoothness), suggesting reduced capacity with strong inductive bias prevents overfitting in data-scarce regimes. Training completes in under 65 seconds on consumer hardware.
zh
[CV-259] Risk-Aware Human-in-the-Loop Framework with Adaptive Intrusion Response for Autonomous Vehicles ICRA2026
【速读】:该论文旨在解决自动驾驶车辆在遭遇罕见长尾场景(long-tailed scenarios)或网络物理入侵(cyber-physical intrusions)时,如何保持安全性与有效性的问题。其核心挑战在于:传统强化学习(RL)方法难以应对低频但高风险事件,且缺乏对异常信号的实时感知与响应机制。解决方案的关键在于提出一种风险感知的人在回路框架(Risk-aware Human-in-the-Loop, RAIL),该框架通过融合三种运行时信号(曲率执行完整性、碰撞时间接近度、观测偏移一致性)生成入侵风险评分(Intrusion Risk Score, IRS),并基于IRS动态启用特定防护机制(shield)——当风险阈值超过设定值时,系统采用可学习的权限分配策略将控制权部分交由cue-specific shield接管,同时保留人类干预通道;在低风险情况下则维持原始策略执行。RAIL进一步结合软演员评论家(Soft Actor-Critic, SAC)与风险优先回放和双重奖励机制,使接管事件与近失事件成为学习信号,从而提升模型鲁棒性与安全性能。实验表明,RAIL在MetaDrive和CARLA平台上均显著优于现有RL、安全RL及人机协同基线方法,在攻击场景下亦大幅降低解耦率与攻击成功率。
链接: https://arxiv.org/abs/2601.11781
作者: Dawood Wasif,Terrence J. Moore,Seunghyun Yoon,Hyuk Lim,Dan Dongseong Kim,Frederica F. Nelson,Jin-Hee Cho
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to ICRA 2026 (under review)
Abstract:Autonomous vehicles must remain safe and effective when encountering rare long-tailed scenarios or cyber-physical intrusions during driving. We present RAIL, a risk-aware human-in-the-loop framework that turns heterogeneous runtime signals into calibrated control adaptations and focused learning. RAIL fuses three cues (curvature actuation integrity, time-to-collision proximity, and observation-shift consistency) into an Intrusion Risk Score (IRS) via a weighted Noisy-OR. When IRS exceeds a threshold, actions are blended with a cue-specific shield using a learned authority, while human override remains available; when risk is low, the nominal policy executes. A contextual bandit arbitrates among shields based on the cue vector, improving mitigation choices online. RAIL couples Soft Actor-Critic (SAC) with risk-prioritized replay and dual rewards so that takeovers and near misses steer learning while nominal behavior remains covered. On MetaDrive, RAIL achieves a Test Return (TR) of 360.65, a Test Success Rate (TSR) of 0.85, a Test Safety Violation (TSV) of 0.75, and a Disturbance Rate (DR) of 0.0027, while logging only 29.07 training safety violations, outperforming RL, safe RL, offline/imitation learning, and prior HITL baselines. Under Controller Area Network (CAN) injection and LiDAR spoofing attacks, it improves Success Rate (SR) to 0.68 and 0.80, lowers the Disengagement Rate under Attack (DRA) to 0.37 and 0.03, and reduces the Attack Success Rate (ASR) to 0.34 and 0.11. In CARLA, RAIL attains a TR of 1609.70 and TSR of 0.41 with only 8000 steps.
zh
[CV-260] Cross-Domain Object Detection Using Unsupervised Image Translation
【速读】:该论文旨在解决无监督域适应(Unsupervised Domain Adaptation, UDA)在目标检测任务中的性能瓶颈问题,即如何将源域训练的检测器有效迁移到未见的目标域中,同时缩小与使用目标域标注数据训练所得上界性能之间的差距。其解决方案的关键在于提出一种基于图像翻译生成人工目标域数据集的方法:利用仅含源域标注数据和目标域未标注数据,通过两种无监督图像翻译模型(CycleGAN 和 AdaIN-based 模型)合成目标域风格的图像,并以此构建人工训练集来训练目标域检测器。该方法相较现有中间特征对齐方法更为简洁且可解释性强,实验表明在自动驾驶真实场景下显著优于当前最优方法,进一步逼近上界性能。
链接: https://arxiv.org/abs/2601.11779
作者: Vinicius F. Arruda,Rodrigo F. Berriel,Thiago M. Paixão,Claudine Badue,Alberto F. De Souza,Nicu Sebe,Thiago Oliveira-Santos
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Unsupervised domain adaptation for object detection addresses the adaption of detectors trained in a source domain to work accurately in an unseen target domain. Recently, methods approaching the alignment of the intermediate features proven to be promising, achieving state-of-the-art results. However, these methods are laborious to implement and hard to interpret. Although promising, there is still room for improvements to close the performance gap toward the upper-bound (when training with the target data). In this work, we propose a method to generate an artificial dataset in the target domain to train an object detector. We employed two unsupervised image translators (CycleGAN and an AdaIN-based model) using only annotated data from the source domain and non-annotated data from the target domain. Our key contributions are the proposal of a less complex yet more effective method that also has an improved interpretability. Results on real-world scenarios for autonomous driving show significant improvements, outperforming state-of-the-art methods in most cases, further closing the gap toward the upper-bound.
zh
[CV-261] studentSplat: Your Student Model Learns Single-view 3D Gaussian Splatting
【速读】:该论文旨在解决单视角(single-view)3D场景重建中因视图信息不足而导致的尺度模糊性(scale ambiguity)和场景上下文缺失问题,这是单视角重建任务的核心挑战。其解决方案的关键在于提出一种教师-学生架构(teacher-student architecture)与外推网络(extrapolation network)相结合的方法:首先,利用多视角教师模型提供几何监督信号,指导单视角学生模型学习正确的尺度和几何结构,从而缓解尺度模糊并提升重建几何合理性;其次,引入外推网络以补全输入图像中缺失的场景内容,实现高质量的场景外推,从而增强重建完整性与视觉一致性。该方法在单视角新视角重建质量上达到当前最优水平,并展现出在自监督深度估计等通用单视角3D理解任务中的潜力。
链接: https://arxiv.org/abs/2601.11772
作者: Yimu Pan,Hongda Mao,Qingshuang Chen,Yelin Kim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advance in feed-forward 3D Gaussian splatting has enable remarkable multi-view 3D scene reconstruction or single-view 3D object reconstruction but single-view 3D scene reconstruction remain under-explored due to inherited ambiguity in single-view. We present \textbfstudentSplat, a single-view 3D Gaussian splatting method for scene reconstruction. To overcome the scale ambiguity and extrapolation problems inherent in novel-view supervision from a single input, we introduce two techniques: 1) a teacher-student architecture where a multi-view teacher model provides geometric supervision to the single-view student during training, addressing scale ambiguity and encourage geometric validity; and 2) an extrapolation network that completes missing scene context, enabling high-quality extrapolation. Extensive experiments show studentSplat achieves state-of-the-art single-view novel-view reconstruction quality and comparable performance to multi-view methods at the scene level. Furthermore, studentSplat demonstrates competitive performance as a self-supervised single-view depth estimation method, highlighting its potential for general single-view 3D understanding tasks.
zh
[CV-262] From Pixels to Purchase: Building and Evaluating a Taxonomy-Decoupled Visual Search Engine for Home Goods E-commerce
【速读】:该论文旨在解决电商视觉搜索中因依赖分类标签和Catalog数据而导致的噪声干扰问题,尤其在风格驱动场景下,用户意图主观且开放,传统耦合目标检测与分类的方法难以保证鲁棒性和可扩展性。其解决方案的关键在于提出一种解耦分类任务的架构:通过无分类的区域提议(classification-free region proposals)和统一嵌入(unified embeddings)实现相似度检索,从而提升系统灵活性与泛化能力;同时引入LLM-as-a-Judge框架,以零样本方式评估查询-结果对的视觉相似性和类别相关性,摆脱对人工标注或易噪Catalog数据的依赖,显著提升离线评估指标与真实业务表现的相关性。
链接: https://arxiv.org/abs/2601.11769
作者: Cheng Lyu,Jingyue Zhang,Ryan Maunu,Mengwei Li,Vinny DeGenova,Yuanli Pei
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Visual search is critical for e-commerce, especially in style-driven domains where user intent is subjective and open-ended. Existing industrial systems typically couple object detection with taxonomy-based classification and rely on catalog data for evaluation, which is prone to noise that limits robustness and scalability. We propose a taxonomy-decoupled architecture that uses classification-free region proposals and unified embeddings for similarity retrieval, enabling a more flexible and generalizable visual search. To overcome the evaluation bottleneck, we propose an LLM-as-a-Judge framework that assesses nuanced visual similarity and category relevance for query-result pairs in a zero-shot manner, removing dependence on human annotations or noise-prone catalog data. Deployed at scale on a global home goods platform, our system improves retrieval quality and yields a measurable uplift in customer engagement, while our offline evaluation metrics strongly correlate with real-world outcomes.
zh
[CV-263] SpaRRTa: A Synthetic Benchmark for Evaluating Spatial Intelligence in Visual Foundation Models
【速读】:该论文旨在解决视觉基础模型(Visual Foundation Models, VFMs)在空间推理能力上的局限性问题,即尽管这些模型在图像语义理解方面表现优异,但在处理涉及相对位置关系等空间认知任务时表现不一致,从而限制了其在具身系统中的应用。为深入探究这一问题,作者提出了一种名为空间关系识别任务(Spatial Relation Recognition Task, SpaRRTa)的新基准,其关键在于通过生成大量具有多样化场景和可控制对象布局的逼真图像,并提供自由获取的空间标注,系统评估VFMs对物体相对位置的识别能力。该方法不同于传统依赖精确度量预测(如表面法向量估计)的3D目标,而是聚焦于人类空间理解的核心能力——相对空间关系识别,从而揭示现代VFMs在空间意识方面的机制差异与瓶颈。
链接: https://arxiv.org/abs/2601.11729
作者: Turhan Can Kargin,Wojciech Jasiński,Adam Pardyl,Bartosz Zieliński,Marcin Przewięźlikowski
机构: University of Jagellonian (雅盖隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Project page is available at this https URL
Abstract:Visual Foundation Models (VFMs), such as DINO and CLIP, excel in semantic understanding of images but exhibit limited spatial reasoning capabilities, which limits their applicability to embodied systems. As a result, recent work incorporates some 3D tasks (such as depth estimation) into VFM training. However, VFM performance remains inconsistent across other spatial tasks, raising the question of whether these models truly have spatial awareness or overfit to specific 3D objectives. To address this question, we introduce the Spatial Relation Recognition Task (SpaRRTa) benchmark, which evaluates the ability of VFMs to identify relative positions of objects in the image. Unlike traditional 3D objectives that focus on precise metric prediction (e.g., surface normal estimation), SpaRRTa probes a fundamental capability underpinning more advanced forms of human-like spatial understanding. SpaRRTa generates an arbitrary number of photorealistic images with diverse scenes and fully controllable object arrangements, along with freely accessible spatial annotations. Evaluating a range of state-of-the-art VFMs, we reveal significant disparities between their spatial reasoning abilities. Through our analysis, we provide insights into the mechanisms that support or hinder spatial awareness in modern VFMs. We hope that SpaRRTa will serve as a useful tool for guiding the development of future spatially aware visual models.
zh
[CV-264] SemAlign: Language Guided Semi-supervised Domain Generalization
【速读】:该论文旨在解决半监督域泛化(Semi-supervised Domain Generalization, SSDG)中模型在仅有少量标注数据的情况下难以有效泛化到未见目标域的问题。现有方法虽强调伪标签(pseudo-labeling, PL)准确性与防止过拟合,但忽略了训练过程中对数据的充分利用。其解决方案的关键在于:通过将模型中间特征与视觉语言模型(Vision Language Model, VLM)语义丰富且具有泛化能力的特征空间对齐,从而增强域不变性;同时结合有效的图像级增强和输出层正则化策略,提升数据利用率并抑制过拟合。实验表明,该方法在四个基准上均达到当前最优性能(State-of-the-Art, SOTA)。
链接: https://arxiv.org/abs/2601.11724
作者: Muditha Fernando,Kajhanan Kailainathan,Krishnakanth Nagaratnam,Isuranga Udaravi Bandara Senavirathne,Ranga Rodrigo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 6 figures
Abstract:Semi-supervised Domain Generalization (SSDG) addresses the challenge of generalizing to unseen target domains with limited labeled data. Existing SSDG methods highlight the importance of achieving high pseudo-labeling (PL) accuracy and preventing model overfitting as the main challenges in SSDG. In this light, we show that the SSDG literature’s excessive focus on PL accuracy, without consideration for maximum data utilization during training, limits potential performance improvements. We propose a novel approach to the SSDG problem by aligning the intermediate features of our model with the semantically rich and generalized feature space of a Vision Language Model (VLM) in a way that promotes domain-invariance. The above approach is enhanced with effective image-level augmentation and output-level regularization strategies to improve data utilization and minimize overfitting. Extensive experimentation across four benchmarks against existing SSDG baselines suggests that our method achieves SOTA results both qualitatively and quantitatively. The code will be made publicly available.
zh
[CV-265] ng Human and Machine Handwriting Apart
【速读】:该论文旨在解决如何通过手写轨迹(handwriting movements)作为行为生物特征(behavioral biometrics),验证设备或应用操作者是否为真实人类,而非人工智能生成的输入,这可被视作一种反向图灵测试(reverse Turing test)。其核心解决方案是构建一个浅层循环神经网络(shallow recurrent neural network),直接以未提取特征的轨迹数据(nonfeaturized trajectory data)作为输入,在十组公开的手写符号数据集(涵盖孤立字符、数字、手势、指向轨迹和签名)及七种不同合成器(包括动力学理论模型、生成对抗网络、Transformer 和扩散模型等)生成的伪造数据上进行训练,实现了平均98.3%的ROC曲线下面积(AUC)和1.4%的等错误率(equal error rate),且在仅用10%训练数据的少样本场景下仍保持优异性能,并在跨域设置中表现出强鲁棒性。
链接: https://arxiv.org/abs/2601.11700
作者: Luis A. Leiva,Moises Diaz,Nuwan T. Attygalle,Miguel A. Ferrer,Rejean Plamondon
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Handwriting movements can be leveraged as a unique form of behavioral biometrics, to verify whether a real user is operating a device or application. This task can be framed as a reverse Turing test in which a computer has to detect if an input instance has been generated by a human or artificially. To tackle this task, we study ten public datasets of handwritten symbols (isolated characters, digits, gestures, pointing traces, and signatures) that are artificially reproduced using seven different synthesizers, including, among others, the Kinematic Theory (Sigma h model), generative adversarial networks, Transformers, and Diffusion models. We train a shallow recurrent neural network that achieves excellent performance (98.3 percent Area Under the ROC Curve (AUC) score and 1.4 percent equal error rate on average across all synthesizers and datasets) using nonfeaturized trajectory data as input. In few-shot settings, we show that our classifier achieves such an excellent performance when trained on just 10 percent of the data, as evaluated on the remaining 90% of the data as a test set. We further challenge our classifier in out-of-domain settings, and observe very competitive results as well. Our work has implications for computerized systems that need to verify human presence, and adds an additional layer of security to keep attackers at bay.
zh
[CV-266] Conformal Point and the Calibrated Conic
【速读】:该论文旨在解决图像几何可视化与计算中的关键问题,特别是如何准确描述和利用共形点(conformal point)与校准圆锥曲线(calibrating conic)之间的关系。其解决方案的关键在于利用二者间的几何关联,提供直观且高效的计算方法,用于推导图像中的角度和方向等几何属性。
链接: https://arxiv.org/abs/2601.11679
作者: Richard Hartley
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This gives some information about the conformal point and the calibrating conic, and their relationship one to the other. These concepts are useful for visualizing image geometry, and lead to intuitive ways to compute geometry, such as angles and directions in an image.
zh
[CV-267] Generating metamers of human scene understanding
【速读】:该论文旨在解决如何生成与人类场景表征(latent human scene representations)在感知上对齐的图像问题,从而揭示人类视觉系统如何整合周边低分辨率“概貌”信息与注视点高分辨率细节来理解视觉场景。解决方案的关键在于提出MetamerGen——一种基于潜在扩散模型(latent diffusion model)的双流表示架构,该架构融合了来自DINOv2的特征:一方面利用固定区域的高分辨率细节特征,另一方面结合周边区域的降级特征以捕捉场景上下文,从而实现从“中心-周边”(foveated)输入到图像的新型图像到图像合成。通过行为实验验证生成图像与原始图像在感知上的等价性,研究发现当生成过程基于个体自身的注视区域时,高层语义一致性最能预测生成图像作为人类场景表征的等效性(metamerism)。
链接: https://arxiv.org/abs/2601.11675
作者: Ritik Raina,Abe Leite,Alexandros Graikos,Seoyoung Ahn,Dimitris Samaras,Gregory J. Zelinsky
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Human vision combines low-resolution “gist” information from the visual periphery with sparse but high-resolution information from fixated locations to construct a coherent understanding of a visual scene. In this paper, we introduce MetamerGen, a tool for generating scenes that are aligned with latent human scene representations. MetamerGen is a latent diffusion model that combines peripherally obtained scene gist information with information obtained from scene-viewing fixations to generate image metamers for what humans understand after viewing a scene. Generating images from both high and low resolution (i.e. “foveated”) inputs constitutes a novel image-to-image synthesis problem, which we tackle by introducing a dual-stream representation of the foveated scenes consisting of DINOv2 tokens that fuse detailed features from fixated areas with peripherally degraded features capturing scene context. To evaluate the perceptual alignment of MetamerGen generated images to latent human scene representations, we conducted a same-different behavioral experiment where participants were asked for a “same” or “different” response between the generated and the original image. With that, we identify scene generations that are indeed metamers for the latent scene representations formed by the viewers. MetamerGen is a powerful tool for understanding scene understanding. Our proof-of-concept analyses uncovered specific features at multiple levels of visual processing that contributed to human judgments. While it can generate metamers even conditioned on random fixations, we find that high-level semantic alignment most strongly predicts metamerism when the generated scenes are conditioned on viewers’ own fixated regions.
zh
[CV-268] IPEC: Test-Time Incremental Prototype Enhancement Classifier for Few-Shot Learning
【速读】:该论文旨在解决度量学习类少样本分类方法在测试阶段因批次独立性假设(batch-independence assumption)而无法利用先前批次中积累的知识,从而导致模型性能受限的问题。其解决方案的关键在于提出一种新颖的测试时优化方法——增量原型增强分类器(Incremental Prototype Enhancement Classifier, IPEC),该方法通过动态维护一个辅助集来逐步优化原型估计:IPEC基于高置信度筛选查询样本并加入辅助集,并设计了一种鲁棒的双过滤机制(结合全局预测置信度与局部判别能力)确保样本质量;随后将辅助集与支持集聚合用于后续任务,构建更稳定、更具代表性的原型,显著降低对初始支持集的依赖。该方法从贝叶斯视角出发,将支持集视为先验、辅助集视为数据驱动的后验,进而设计了“预热-测试”两阶段推理协议以提升实用性与效果。
链接: https://arxiv.org/abs/2601.11669
作者: Wenwen Liao,Hang Ruan,Jianbo Yu,Xiaofeng Yang,Qingchao Jiang,Xuefeng Yan
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Metric-based few-shot approaches have gained significant popularity due to their relatively straightforward implementation, high interpret ability, and computational efficiency. However, stemming from the batch-independence assumption during testing, which prevents the model from leveraging valuable knowledge accumulated from previous batches. To address these challenges, we propose a novel test-time method called Incremental Prototype Enhancement Classifier (IPEC), a test-time method that optimizes prototype estimation by leveraging information from previous query samples. IPEC maintains a dynamic auxiliary set by selectively incorporating query samples that are classified with high confidence. To ensure sample quality, we design a robust dual-filtering mechanism that assesses each query sample based on both global prediction confidence and local discriminative ability. By aggregating this auxiliary set with the support set in subsequent tasks, IPEC builds progressively more stable and representative prototypes, effectively reducing its reliance on the initial support set. We ground this approach in a Bayesian interpretation, conceptualizing the support set as a prior and the auxiliary set as a data-driven posterior, which in turn motivates the design of a practical “warm-up and test” two-stage inference protocol. Extensive empirical results validate the superior performance of our proposed method across multiple few-shot classification tasks.
zh
[CV-269] MATEX: Multi-scale Attention and Text-guided Explainability of Medical Vision-Language Models
【速读】:该论文旨在解决医学视觉-语言模型中解释性不足的问题,特别是现有方法在空间精度、解剖学合理性以及注意力粒度方面的局限性。解决方案的关键在于提出MATEX(Multi-scale Attention and Text-guided Explainability)框架,其核心创新包括多层注意力传播(multi-layer attention rollout)、文本引导的空间先验(text-guided spatial priors)以及层一致性分析(layer consistency analysis),从而生成精确、稳定且具有临床意义的梯度归因图(gradient attribution maps),显著提升了模型解释的可信赖性和透明度。
链接: https://arxiv.org/abs/2601.11666
作者: Muhammad Imran,Chi Lee,Yugyung Lee
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 12 pages, 3 figures, 1 table
Abstract:We introduce MATEX (Multi-scale Attention and Text-guided Explainability), a novel framework that advances interpretability in medical vision-language models by incorporating anatomically informed spatial reasoning. MATEX synergistically combines multi-layer attention rollout, text-guided spatial priors, and layer consistency analysis to produce precise, stable, and clinically meaningful gradient attribution maps. By addressing key limitations of prior methods, such as spatial imprecision, lack of anatomical grounding, and limited attention granularity, MATEX enables more faithful and interpretable model explanations. Evaluated on the MS-CXR dataset, MATEX outperforms the state-of-the-art M2IB approach in both spatial precision and alignment with expert-annotated findings. These results highlight MATEX’s potential to enhance trust and transparency in radiological AI applications.
zh
[CV-270] UAV-Based Infrastructure Inspections: A Literature Review and Proposed Framework for AECFM
【速读】:该论文旨在解决传统基础设施巡检中效率低、成本高及数据获取不全面的问题,尤其是在结构健康监测(Structural Health Monitoring, SHM)、灾害响应和文化遗产保护等场景下对高精度、实时性与多模态信息融合的需求。其解决方案的关键在于提出一个集成RGB影像、激光雷达(LiDAR)与热成像的多模态数据融合框架,并结合基于Transformer架构的先进机器学习模型,实现对结构缺陷、热异常和几何偏差的精准识别;同时通过动态路径规划优化飞行策略,提升复杂环境下的检测可靠性与可操作性,从而构建一套从数据采集到决策支持的全流程自动化巡检体系。
链接: https://arxiv.org/abs/2601.11665
作者: Amir Farzin Nikkhah,Dong Chen,Bradford Campbell,Somayeh Asadi,Arsalan Heydarian
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted for publication at the International Conference on Construction Engineering and Management (I3CE 2025)
Abstract:Unmanned Aerial Vehicles (UAVs) are transforming infrastructure inspections in the Architecture, Engineering, Construction, and Facility Management (AEC+FM) domain. By synthesizing insights from over 150 studies, this review paper highlights UAV-based methodologies for data acquisition, photogrammetric modeling, defect detection, and decision-making support. Key innovations include path optimization, thermal integration, and advanced machine learning (ML) models such as YOLO and Faster R-CNN for anomaly detection. UAVs have demonstrated value in structural health monitoring (SHM), disaster response, urban infrastructure management, energy efficiency evaluations, and cultural heritage preservation. Despite these advancements, challenges in real-time processing, multimodal data fusion, and generalizability remain. A proposed workflow framework, informed by literature and a case study, integrates RGB imagery, LiDAR, and thermal sensing with transformer-based architectures to improve accuracy and reliability in detecting structural defects, thermal anomalies, and geometric inconsistencies. The proposed framework ensures precise and actionable insights by fusing multimodal data and dynamically adapting path planning for complex environments, presented as a comprehensive step-by-step guide to address these challenges effectively. This paper concludes with future research directions emphasizing lightweight AI models, adaptive flight planning, synthetic datasets, and richer modality fusion to streamline modern infrastructure inspections.
zh
[CV-271] LTV-YOLO: A Lightweight Thermal Object Detector for Young Pedestrians in Adverse Conditions
【速读】:该论文旨在解决在低光照和恶劣天气条件下,对弱势道路使用者(Vulnerable Road Users, VRUs),尤其是儿童和青少年等小尺度、易被遮挡目标的检测难题。解决方案的关键在于提出一种专为热成像设计的轻量化目标检测模型LTV-YOLO(Lightweight Thermal Vision YOLO),其核心创新在于:仅使用长波红外(Long-Wave Infrared, LWIR)热成像数据,结合深度可分离卷积与特征金字塔网络(Feature Pyramid Network, FPN)结构,在保持模型紧凑性的同时显著提升对短距离或部分遮挡VRUs的检测精度与实时性能,从而实现边缘设备部署下的高可靠性行人识别。
链接: https://arxiv.org/abs/2601.11662
作者: Abdullah Jirjees,Ryan Myers,Muhammad Haris Ikram,Mohamed H. Zaki
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Detecting vulnerable road users (VRUs), particularly children and adolescents, in low light and adverse weather conditions remains a critical challenge in computer vision, surveillance, and autonomous vehicle systems. This paper presents a purpose-built lightweight object detection model designed to identify young pedestrians in various environmental scenarios. To address these challenges, our approach leverages thermal imaging from long-wave infrared (LWIR) cameras, which enhances detection reliability in conditions where traditional RGB cameras operating in the visible spectrum fail. Based on the YOLO11 architecture and customized for thermal detection, our model, termed LTV-YOLO (Lightweight Thermal Vision YOLO), is optimized for computational efficiency, accuracy and real-time performance on edge devices. By integrating separable convolutions in depth and a feature pyramid network (FPN), LTV-YOLO achieves strong performance in detecting small-scale, partially occluded, and thermally distinct VRUs while maintaining a compact architecture. This work contributes a practical and scalable solution to improve pedestrian safety in intelligent transportation systems, particularly in school zones, autonomous navigation, and smart city infrastructure. Unlike prior thermal detectors, our contribution is task-specific: a thermally only edge-capable design designed for young and small VRUs (children and distant adults). Although FPN and depthwise separable convolutions are standard components, their integration into a thermal-only pipeline optimized for short/occluded VRUs under adverse conditions is, to the best of our knowledge, novel.
zh
[CV-272] Zeros can be Informative: Masked Binary U-Net for Image Segmentation on Tensor Cores
【速读】:该论文旨在解决在资源受限的边缘设备上实现高分辨率图像实时分割的难题,即如何在保证精度的同时满足计算、内存和功耗的严格约束。当前主流的U-Net模型虽具较好效率,但在高分辨率输入下仍难以达到实时性能;而极端量化(如二值网络)虽硬件友好,却面临准确率严重下降及缺乏通用GPU端到端实现的问题。解决方案的关键在于提出Masked Binary U-Net (MBU-Net),其核心创新包括:(1) 通过成本感知的掩码策略优先对收益最高的层进行二值化,从而在精度与近二值效率间取得平衡;(2) 设计基于减法位编码的GPU执行框架,将掩码二值权重映射至Tensor Cores,利用原生二进制BMMA指令实现高效计算,显著提升吞吐量并降低能耗。实验表明,MBU-Net在3个分割基准上仅损失平均3%精度,但相较16位浮点U-Net实现2.04倍加速和3.54倍能效提升。
链接: https://arxiv.org/abs/2601.11660
作者: Chunshu Wu,Ruibing Song,Sushant Kondguli,Tong Geng,Ang Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Real-time image segmentation is a key enabler for AR/VR, robotics, drones, and autonomous systems, where tight accuracy, latency, and energy budgets must be met on resource-constrained edge devices. While U-Net offers a favorable balance of accuracy and efficiency compared to large transformer-based models, achieving real-time performance on high-resolution input remains challenging due to compute, memory, and power limits. Extreme quantization, particularly binary networks, is appealing for its hardware-friendly operations. However, two obstacles limit practicality: (1) severe accuracy degradation, and (2) a lack of end-to-end implementations that deliver efficiency on general-purpose GPUs. We make two empirical observations that guide our design. (1) An explicit zero state is essential: training with zero masking to binary U-Net weights yields noticeable sparsity. (2) Quantization sensitivity is uniform across layers. Motivated by these findings, we introduce Masked Binary U-Net (MBU-Net), obtained through a cost-aware masking strategy that prioritizes masking where it yields the highest accuracy-per-cost, reconciling accuracy with near-binary efficiency. To realize these gains in practice, we develop a GPU execution framework that maps MBU-Net to Tensor Cores via a subtractive bit-encoding scheme, efficiently implementing masked binary weights with binary activations. This design leverages native binary Tensor Core BMMA instructions, enabling high throughput and energy savings on widely available GPUs. Across 3 segmentation benchmarks, MBU-Net attains near full-precision accuracy (3% average drop) while delivering 2.04x speedup and 3.54x energy reductions over a 16-bit floating point U-Net. Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2601.11660 [cs.CV] (or arXiv:2601.11660v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2601.11660 Focus to learn more arXiv-issued DOI via DataCite
zh
[CV-273] An Efficient and Explainable KAN Framework forWireless Radiation Field Prediction
【速读】:该论文旨在解决无线信道建模中因环境变化和信号不确定性导致的准确性难题,现有神经网络方法通常独立处理射线上的每个体素(voxel),忽略了全局上下文和环境因素。其解决方案的关键在于提出一种新的建模方式,即学习完整射线的综合表示而非单个点,从而捕获更详细的环境特征;同时,通过将Kolmogorov-Arnold网络(KAN)架构与Transformer模块相结合,在保持计算效率的同时显著提升模型在真实场景和合成场景中的性能表现。
链接: https://arxiv.org/abs/2601.11656
作者: Jingzhou Shen,Xuyu Wang
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in 2025 IEEE 22nd International Conference on Mobile Ad-Hoc and Smart Systems (MASS). See this https URL
Abstract:Modeling wireless channels accurately remains a challenge due to environmental variations and signal uncertainties. Recent neural networks can learn radio frequency~(RF) signal propagation patterns, but they process each voxel on the ray independently, without considering global context or environmental factors. Our paper presents a new approach that learns comprehensive representations of complete rays rather than individual points, capturing more detailed environmental features. We integrate a Kolmogorov-Arnold network (KAN) architecture with transformer modules to achieve better performance across realistic and synthetic scenes while maintaining computational efficiency. Our experimental results show that this approach outperforms existing methods in various scenarios. Ablation studies confirm that each component of our model contributes to its effectiveness. Additional experiments provide clear explanations for our model’s performance.
zh
[CV-274] PSSI-MaxST: An Efficient Pixel-Segment Similarity Index Using Intensity and Smoothness Features for Maximum Spanning Tree Based Segmentation
【速读】:该论文旨在解决交互式图分割方法中存在的高计算复杂度、对用户交互敏感以及在前景与背景颜色分布相似时性能下降的问题。其解决方案的关键在于提出了一种新的像素段相似性指数(Pixel Segment Similarity Index, PSSI),该指数通过引入像素强度和空间平滑性特征,并采用调和平均数来融合多通道相似性,从而有效抑制任意单一通道的不一致性,提升分割鲁棒性;同时结合MeanShift进行低层分割以捕获颜色、纹理与形状信息,构建基于PSSI加权的像素段图,并利用最大生成树(Maximum Spanning Tree, MaxST)实现局部强连通区域的精确分割,整体框架在保持较低计算复杂度(O(B),其中B为直方图分箱数)的同时显著提升了分割质量。
链接: https://arxiv.org/abs/2601.11654
作者: Kaustubh Shivshankar Shejole,Gaurav Mishra
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Interactive graph-based segmentation methods partition an image into foreground and background regions with the aid of user inputs. However, existing approaches often suffer from high computational costs, sensitivity to user interactions, and degraded performance when the foreground and background share similar color distributions. A key factor influencing segmentation performance is the similarity measure used for assigning edge weights in the graph. To address these challenges, we propose a novel Pixel Segment Similarity Index (PSSI), which leverages the harmonic mean of inter-channel similarities by incorporating both pixel intensity and spatial smoothness features. The harmonic mean effectively penalizes dissimilarities in any individual channel, enhancing robustness. The computational complexity of PSSI is \mathcalO(B) , where B denotes the number of histogram bins. Our segmentation framework begins with low-level segmentation using MeanShift, which effectively captures color, texture, and segment shape. Based on the resulting pixel segments, we construct a pixel-segment graph with edge weights determined by PSSI. For partitioning, we employ the Maximum Spanning Tree (MaxST), which captures strongly connected local neighborhoods beneficial for precise segmentation. The integration of the proposed PSSI, MeanShift, and MaxST allows our method to jointly capture color similarity, smoothness, texture, shape, and strong local connectivity. Experimental evaluations on the GrabCut and Images250 datasets demonstrate that our method consistently outperforms current graph-based interactive segmentation methods such as AMOE, OneCut, and SSNCut in terms of segmentation quality, as measured by Jaccard Index (IoU), F_1 score, execution time and Mean Error (ME). Code is publicly available at: this https URL.
zh
[CV-275] Aesthetics as Structural Harm: Algorithmic Lookism Across Text-to-Image Generation and Classification
【速读】:该论文旨在解决生成式 AI(Generative AI)中系统性外貌歧视(algorithmic lookism)的问题,即模型在文本到图像(T2I)生成和下游性别分类任务中对容貌吸引力的偏见性关联及其引发的不平等后果。研究通过分析使用 Stable Diffusion 2.1 和 3.5 Medium 生成的 26,400 张合成人脸,揭示了模型将吸引力与积极属性(如成功、可信)系统性绑定,同时女性面孔在负向属性输入下表现出显著更高的误分类率,且新模型呈现年龄同质化、性别暴露模式固化和地理范围缩小等加剧审美约束的趋势。解决方案的关键在于识别并量化这些嵌入在生成与识别系统中的结构性偏见,从而为构建更公平、透明的 AI 视觉系统提供实证依据与理论框架。
链接: https://arxiv.org/abs/2601.11651
作者: Miriam Doh,Aditya Gulati,Corina Canali,Nuria Oliver
机构: Université Libre de Bruxelles (布鲁塞尔自由大学); ELLIS Alicante (ELLIS 阿利坎特); Universität der Künste Berlin (柏林艺术大学); Weizenbaum Institute (魏森鲍姆研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 22 pages, 15 figures
Abstract:This paper examines algorithmic lookism-the systematic preferential treatment based on physical appearance-in text-to-image (T2I) generative AI and a downstream gender classification task. Through the analysis of 26,400 synthetic faces created with Stable Diffusion 2.1 and 3.5 Medium, we demonstrate how generative AI models systematically associate facial attractiveness with positive attributes and vice-versa, mirroring socially constructed biases rather than evidence-based correlations. Furthermore, we find significant gender bias in three gender classification algorithms depending on the attributes of the input faces. Our findings reveal three critical harms: (1) the systematic encoding of attractiveness-positive attribute associations in T2I models; (2) gender disparities in classification systems, where women’s faces, particularly those generated with negative attributes, suffer substantially higher misclassification rates than men’s; and (3) intensifying aesthetic constraints in newer models through age homogenization, gendered exposure patterns, and geographic reductionism. These convergent patterns reveal algorithmic lookism as systematic infrastructure operating across AI vision systems, compounding existing inequalities through both representation and recognition. Disclaimer: This work includes visual and textual content that reflects stereotypical associations between physical appearance and socially constructed attributes, including gender, race, and traits associated with social desirability. Any such associations found in this study emerge from the biases embedded in generative AI systems-not from empirical truths or the authors’ views. Comments: 22 pages, 15 figures Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computers and Society (cs.CY) Cite as: arXiv:2601.11651 [cs.CV] (or arXiv:2601.11651v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2601.11651 Focus to learn more arXiv-issued DOI via DataCite
zh
[CV-276] IMSAHLO: Integrating Multi-Scale Attention and Hybrid Loss Optimization Framework for Robust Neuronal Brain Cell Segmentation
【速读】:该论文旨在解决荧光显微镜图像中神经元细胞分割的难题,尤其针对密集与稀疏分布细胞共存、复杂重叠形态以及严重类别不平衡等挑战。传统深度学习模型在上述条件下难以保持精细拓扑结构或准确划分边界。其解决方案的关键在于提出一种名为IMSAHLO(Integrating Multi-Scale Attention and Hybrid Loss Optimization)的新型深度学习框架:核心包括多尺度密集块(Multi-Scale Dense Blocks, MSDBs)以捕获不同感受野特征从而适应细胞密度变化,以及分层注意力机制(Hierarchical Attention, HA)自适应聚焦于显著形态特征以保留感兴趣区域(ROI)边界细节;同时引入混合损失函数,融合Tversky损失与焦点损失(Focal Loss)缓解类别不平衡问题,并结合拓扑感知中心线Dice损失(clDice)和轮廓加权边界损失(Contour-Weighted Boundary loss),确保拓扑连续性和相邻细胞间的精确分离。
链接: https://arxiv.org/abs/2601.11645
作者: Ujjwal Jain,Oshin Misra,Roshni Chakraborty,Mahua Bhattacharya
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accurate segmentation of neuronal cells in fluorescence microscopy is a fundamental task for quantitative analysis in computational neuroscience. However, it is significantly impeded by challenges such as the coexistence of densely packed and sparsely distributed cells, complex overlapping morphologies, and severe class imbalance. Conventional deep learning models often fail to preserve fine topological details or accurately delineate boundaries under these conditions. To address these limitations, we propose a novel deep learning framework, IMSAHLO (Integrating Multi-Scale Attention and Hybrid Loss Optimization), for robust and adaptive neuronal segmentation. The core of our model features Multi-Scale Dense Blocks (MSDBs) to capture features at various receptive fields, effectively handling variations in cell density, and a Hierarchical Attention (HA) mechanism that adaptively focuses on salient morphological features to preserve Region of Interest (ROI) boundary details. Furthermore, we introduce a novel hybrid loss function synergistically combining Tversky and Focal loss to combat class imbalance, alongside a topology-aware Centerline Dice (clDice) loss and a Contour-Weighted Boundary loss to ensure topological continuity and precise separation of adjacent cells. Large-scale experiments on the public Fluorescent Neuronal Cells (FNC) dataset demonstrate that our framework outperforms state-of-the-art architectures, achieving precision of 81.4%, macro F1 score of 82.7%, micro F1 score of 83.3%, and balanced accuracy of 99.5% on difficult dense and sparse cases. Ablation studies validate the synergistic benefits of multi-scale attention and hybrid loss terms. This work establishes a foundation for generalizable segmentation models applicable to a wide range of biomedical imaging modalities, pushing AI-assisted analysis toward high-throughput neurobiological pipelines.
zh
[CV-277] Predicting When to Trust Vision-Language Models for Spatial Reasoning
【速读】:该论文旨在解决视觉-语言模型(Vision-Language Models, VLMs)在空间推理任务中系统性失败的问题,尤其是在判断基本方向关系时准确率仅为49%–54%,这限制了其在机器人和自动驾驶等安全敏感场景中的可靠部署。为提升VLM空间预测的可信度,作者提出一种基于视觉的置信度估计框架,其核心在于通过独立的几何验证(利用目标检测结果)来融合四个信号:VLM预测与坐标间的几何一致性、空间模糊性(由物体重叠引起)、检测质量以及VLM内部不确定性,采用梯度提升方法进行融合。关键创新在于使用外部视觉信号而非文本自评估,实验证明视觉信号贡献了87.4%的模型重要性,显著优于仅依赖VLM自身置信度的方法,在BLIP-2上实现0.674 AUROC(较文本基线提升34.0%),并支持选择性预测——在保持60%目标准确率下覆盖率达61.9%,是基线的2.2倍。
链接: https://arxiv.org/abs/2601.11644
作者: Muhammad Imran,Yugyung Lee
机构: University of Missouri Kansas City (密苏里大学堪萨斯城分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 9 pages, 5 figures, 6 tables
Abstract:Vision-Language Models (VLMs) demonstrate impressive capabilities across multimodal tasks, yet exhibit systematic spatial reasoning failures, achieving only 49% (CLIP) to 54% (BLIP-2) accuracy on basic directional relationships. For safe deployment in robotics and autonomous systems, we need to predict when to trust VLM spatial predictions rather than accepting all outputs. We propose a vision-based confidence estimation framework that validates VLM predictions through independent geometric verification using object detection. Unlike text-based approaches relying on self-assessment, our method fuses four signals via gradient boosting: geometric alignment between VLM claims and coordinates, spatial ambiguity from overlap, detection quality, and VLM internal uncertainty. We achieve 0.674 AUROC on BLIP-2 (34.0% improvement over text-based baselines) and 0.583 AUROC on CLIP (16.1% improvement), generalizing across generative and classification architectures. Our framework enables selective prediction: at 60% target accuracy, we achieve 61.9% coverage versus 27.6% baseline (2.2x improvement) on BLIP-2. Feature analysis reveals vision-based signals contribute 87.4% of model importance versus 12.7% from VLM confidence, validating that external geometric verification outperforms self-assessment. We demonstrate reliable scene graph construction where confidence-based pruning improves precision from 52.1% to 78.3% while retaining 68.2% of edges.
zh
[CV-278] PSSF: Early osteoarthritis detection using physical synthetic knee X-ray scans and AI radiomics models
【速读】:该论文旨在解决膝骨关节炎(Knee Osteoarthritis, OA)放射学评估中依赖主观分级(如Kellgren-Lawrence, KL尺度)及高质量标注影像数据稀缺的问题,尤其在隐私保护、机构治理和资源限制下难以获取真实患者X射线图像。其解决方案的关键在于提出一种基于物理的合成仿真框架(Physics-based Synthetic Simulation Framework, PSSF),通过参数化解剖模型生成可控、无隐私风险的二维膝关节正侧位X射线图像,并结合影像生物标志物标准化倡议(Image Biomarker Standardisation Initiative, IBSI)进行特征提取与机器学习建模,从而实现对OA严重程度的定量预测。
链接: https://arxiv.org/abs/2601.11642
作者: Abbas Alzubaidi,Ali Al-Bayaty
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: 16 pages, 6 figures
Abstract:Knee osteoarthritis (OA) is a major cause of disability worldwide and is still largely assessed using subjective radiographic grading, most commonly the Kellgren-Lawrence (KL) scale. Artificial intelligence (AI) and radiomics offer quantitative tools for OA assessment but depend on large, well-annotated image datasets, mainly X-ray scans, that are often difficult to obtain because of privacy, governance and resourcing constraints. In this research, we introduce a physics-based synthetic simulation framework (PSSF) to fully generate controllable X-ray scans without patients’ involvement and violating their privacy and institutional constraints. This PSSF is a 2D X-ray projection simulator of anteroposterior knee radiographs from a parametric anatomical model of the distal femur and proximal tibia. Using PSSF, we create a virtual cohort of 180 subjects (260 knees), each is imaged under three protocols (reference, low-dose, and geometry-shift). Medial joint regions are automatically localized, preprocessed, and processed with the Image Biomarker Standardisation Initiative (IBSI). Practically, three machine learning (ML) models are utilized, logistic regression, random forest, and gradient boosting, to train binary (KL-like “0” vs. “2”) and three-class (0-2) prediction radiographic images. Robustness is assessed within IBSI protocol, cross-protocol, and multi-protocol scenarios. Finally, features stability is then evaluated using intraclass correlation coefficients across acquisition changes.
zh
[CV-279] Mixture of Distributions Matters: Dynamic Sparse Attention for Efficient Video Diffusion Transformers
【速读】:该论文旨在解决扩散变换器(Diffusion Transformers, DiTs)在视频生成任务中因自注意力机制固有的二次复杂度而导致的长序列生成效率低下问题。现有稀疏注意力方法要么依赖过于简化的静态模式,要么需通过计算昂贵的采样操作实现动态稀疏性,从而导致模式预测不准确并降低生成质量。解决方案的关键在于提出一种无需采样的动态注意力框架——MOD-DiT(Mixture-of-Distribution DiT),其核心创新为两阶段设计:首先利用早期去噪步骤中的先验信息,采用分布式混合方法构建高效线性近似模型以预测特定去噪区间内的掩码模式;其次引入在线块掩码策略,在保持历史稀疏信息的同时动态应用预测掩码,避免重复采样操作。此方法显著提升了视频生成的效率与质量,突破了传统稀疏注意力方案的计算瓶颈。
链接: https://arxiv.org/abs/2601.11641
作者: Yuxi Liu,Yipeng Hu,Zekun Zhang,Kunze Jiang,Kun Yuan
机构: Peking University (北京大学); University of Electronic Science and Technology of China (电子科技大学); University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:While Diffusion Transformers (DiTs) have achieved notable progress in video generation, this long-sequence generation task remains constrained by the quadratic complexity inherent to self-attention mechanisms, creating significant barriers to practical deployment. Although sparse attention methods attempt to address this challenge, existing approaches either rely on oversimplified static patterns or require computationally expensive sampling operations to achieve dynamic sparsity, resulting in inaccurate pattern predictions and degraded generation quality. To overcome these limitations, we propose a \underline\textbfMixtrue-\underline\textbfOf-\underline\textbfDistribution \textbfDiT (\textbfMOD-DiT), a novel sampling-free dynamic attention framework that accurately models evolving attention patterns through a two-stage process. First, MOD-DiT leverages prior information from early denoising steps and adopts a distributed mixing approach to model an efficient linear approximation model, which is then used to predict mask patterns for a specific denoising interval. Second, an online block masking strategy dynamically applies these predicted masks while maintaining historical sparsity information, eliminating the need for repetitive sampling operations. Extensive evaluations demonstrate consistent acceleration and quality improvements across multiple benchmarks and model architectures, validating MOD-DiT’s effectiveness for efficient, high-quality video generation while overcoming the computational limitations of traditional sparse attention approaches.
zh
[CV-280] Confident Learning for Object Detection under Model Constraints ICPR2026
【速读】:该论文旨在解决边缘设备上农业杂草检测因模型容量、计算资源和实时推理延迟受限,无法通过模型扩容或集成提升性能的问题。解决方案的关键在于提出一种数据驱动的数据校正框架(Model-Driven Data Correction, MDDC),通过迭代诊断与修正数据质量缺陷来增强检测性能;其核心机制是基于自动化错误分析将检测失败分为四类(漏检、误检、类别混淆和定位误差),并采用结构化的“训练-修正-再训练”流程配合版本控制的数据管理,实现系统性数据质量优化,在固定轻量级检测器(YOLOv8n)条件下显著提升mAP(0.5阈值下提高5–25%)。
链接: https://arxiv.org/abs/2601.11640
作者: Yingda Yu,Jiaqi Xuan,Shuhui Shi,Xuanyu Teng,Shuyang Xu,Guanchao Tong
机构: Wenzhou-Kean University (温州肯恩大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to ICPR 2026, currently under review
Abstract:Agricultural weed detection on edge devices is subject to strict constraints on model capacity, computational resources, and real-time inference latency, which prevent performance improvements through model scaling or ensembling. This paper proposes Model-Driven Data Correction (MDDC), a data-centric framework that enhances detection performance by iteratively diagnosing and correcting data quality deficiencies. An automated error analysis procedure categorizes detection failures into four types: false negatives, false positives, class confusion, and localization errors. These error patterns are systematically addressed through a structured train-fix-retrain pipeline with version-controlled data management. Experimental results on multiple weed detection datasets demonstrate consistent improvements of 5-25 percent in mAP at 0.5 using a fixed lightweight detector (YOLOv8n), indicating that systematic data quality optimization can effectively alleviate performance bottlenecks under fixed model capacity constraints.
zh
[CV-281] Evaluating Self-Correcting Vision Agents Through Quantitative and Qualitative Metrics
【速读】:该论文旨在解决当前视觉-语言代理(Vision-Language Agents, VLAs)在执行复杂视觉任务时,因缺乏对自我修正能力的量化评估而难以识别关键推理瓶颈的问题。现有基准测试虽已开始考察迭代式自我修正机制,但其定量限制与主导性瓶颈仍未被充分揭示。为此,作者提出一种诊断微基准测试(Diagnostic Micro-Benchmark),通过解耦任务成功率(Task Success Rate, TSR)与修正成功率(Correction Success Rate, CSR),发现初始能力与修复能力之间无显著相关性,并首次明确量化了修正效果的边际递减规律——修正效果在三次尝试后趋于饱和。此外,失败分类体系揭示语义漂移(Semantic Drift)是主要故障因素(约占28%),即上下文状态丢失导致的推理中断。该研究的关键在于构建了一个可复现的框架,用于定位和缓解状态感知型多模态代理的核心推理瓶颈,从而推动更可靠、具状态记忆能力的多模态智能体发展。
链接: https://arxiv.org/abs/2601.11637
作者: Aradhya Dixit
机构: Institute for Clarity in Documentation (文档清晰度研究所); The Thørväld Group (Thørväld集团); Inria Paris-Rocquencourt (Inria巴黎-罗克昂邦研究所); Rajiv Gandhi University (拉吉夫·甘地大学); Tsinghua University (清华大学); Palmer Research Laboratories (帕尔默研究实验室); The Kumquat Consortium (库姆夸特联盟)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent progress in multimodal foundation models has enabled Vision-Language Agents (VLAs) to decompose complex visual tasks into executable tool-based plans. While recent benchmarks have begun to evaluate iterative self-correction, its quantitative limits and dominant reasoning bottlenecks remain poorly characterized. This work introduces a Diagnostic Micro-Benchmark. Our analysis decouples Task Success Rate (TSR = 62 percent) from Correction Success Rate (CSR = 25 to 33 percent), revealing that initial competence does not predict repair ability. We explicitly quantify the diminishing returns of correction, which saturates after three retries. Our Failure Taxonomy reveals a frequent factor is Semantic Drift (about 28 percent of failures), a loss of contextual state. By isolating this reasoning bottleneck, this benchmark defines a reproducible framework toward stateful, trustworthy multimodal agents.
zh
[CV-282] Now You See Me Now You Dont: A Unified Framework for Expression Consistent Anonymization in Talking Head Videos
【速读】:该论文旨在解决人脸视频匿名化(face video anonymization)问题,即在保护个体隐私的同时,保留视频中关键的非身份特征(如年龄、性别、种族、姿态和表情),以支持下游计算机视觉任务(如表情识别、人物跟踪和动作识别)。解决方案的关键在于提出一个统一框架 Anon-NET,其核心包括两个阶段:首先利用基于扩散模型(diffusion-based generative model)的图像修复技术,在高阶属性识别和运动感知的表情迁移引导下对人脸进行去标识化;其次通过视频驱动的动画生成模块,将去标识后的面部与原始视频动态绑定,从而实现身份模糊化的同时保持视觉真实性和时间一致性。
链接: https://arxiv.org/abs/2601.11635
作者: Anil Egin,Andrea Tangherloni,Antitza Dantcheva
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Face video anonymization is aimed at privacy preservation while allowing for the analysis of videos in a number of computer vision downstream tasks such as expression recognition, people tracking, and action recognition. We propose here a novel unified framework referred to as Anon-NET, streamlined to de-identify facial videos, while preserving age, gender, race, pose, and expression of the original video. Specifically, we inpaint faces by a diffusion-based generative model guided by high-level attribute recognition and motion-aware expression transfer. We then animate deidentified faces by video-driven animation, which accepts the de-identified face and the original video as input. Extensive experiments on the datasets VoxCeleb2, CelebV-HQ, and HDTF, which include diverse facial dynamics, demonstrate the effectiveness of AnonNET in obfuscating identity while retaining visual realism and temporal consistency. The code of AnonNet will be publicly released.
zh
[CV-283] When Rules Fall Short: Agent -Driven Discovery of Emerging Content Issues in Short Video Platforms
【速读】:该论文旨在解决短视频平台中新兴内容问题发现滞后的问题,即传统人工驱动的发现方式无法及时响应快速演变的内容趋势,导致标注策略更新延迟,影响内容治理效果。其解决方案的关键在于提出一种基于多模态大语言模型(Multimodal Large Language Model, Multimodal LLM)智能体的自动问题发现方法:该方法首先自动召回可能包含潜在新问题的短视频,再通过两级聚类策略对视频进行分组,每个簇对应一个新发现的问题;随后由智能体从聚类结果中生成更新后的标注策略,从而扩展对新兴问题的覆盖范围。实证表明,该方法显著提升了问题发现的有效性(F1分数提升超20%),并加速了政策迭代,同时降低了时间成本。
链接: https://arxiv.org/abs/2601.11634
作者: Chenghui Yu,Hongwei Wang,Junwen Chen,Zixuan Wang,Bingfeng Deng,Zhuolin Hao,Hongyu Xiong,Yang Song
机构: TikTok Inc.(TikTok公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Trends on short-video platforms evolve at a rapid pace, with new content issues emerging every day that fall outside the coverage of existing annotation policies. However, traditional human-driven discovery of emerging issues is too slow, which leads to delayed updates of annotation policies and poses a major challenge for effective content governance. In this work, we propose an automatic issue discovery method based on multimodal LLM agents. Our approach automatically recalls short videos containing potential new issues and applies a two-stage clustering strategy to group them, with each cluster corresponding to a newly discovered issue. The agent then generates updated annotation policies from these clusters, thereby extending coverage to these emerging issues. Our agent has been deployed in the real system. Both offline and online experiments demonstrate that this agent-based method significantly improves the effectiveness of emerging-issue discovery (with an F1 score improvement of over 20%) and enhances the performance of subsequent issue governance (reducing the view count of problematic videos by approximately 15%). More importantly, compared to manual issue discovery, it greatly reduces time costs and substantially accelerates the iteration of annotation policies.
zh
[CV-284] Beyond Accuracy: Evaluating Grounded Visual Evidence in Thinking with Images
【速读】:该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在“以图思考”能力评估中缺乏对推理过程真实性检验的问题。现有基准主要依赖结果准确性,无法衡量模型是否能准确利用细粒度视觉线索进行多步推理。解决方案的关键在于提出ViEBench——一个可验证推理过程的基准,包含200张高分辨率图像及专家标注的视觉证据,并按感知与推理难度维度分类;同时引入双轴诊断矩阵,通过四个细分象限提供细粒度指标,从而实现对模型行为在不同任务复杂度下的透明化诊断。
链接: https://arxiv.org/abs/2601.11633
作者: Xuchen Li,Xuzhao Li,Renjie Pi,Shiyu Hu,Jian Zhao,Jiahui Gao
机构: ZGCA; NTU; HKUST; HKU; ZGCI
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint, Under review
Abstract:Despite the remarkable progress of Vision-Language Models (VLMs) in adopting “Thinking-with-Images” capabilities, accurately evaluating the authenticity of their reasoning process remains a critical challenge. Existing benchmarks mainly rely on outcome-oriented accuracy, lacking the capability to assess whether models can accurately leverage fine-grained visual cues for multi-step reasoning. To address these limitations, we propose ViEBench, a process-verifiable benchmark designed to evaluate faithful visual reasoning. Comprising 200 multi-scenario high-resolution images with expert-annotated visual evidence, ViEBench uniquely categorizes tasks by difficulty into perception and reasoning dimensions, where reasoning tasks require utilizing localized visual details with prior knowledge. To establish comprehensive evaluation criteria, we introduce a dual-axis matrix that provides fine-grained metrics through four diagnostic quadrants, enabling transparent diagnosis of model behavior across varying task complexities. Our experiments yield several interesting observations: (1) VLMs can sometimes produce correct final answers despite grounding on irrelevant regions, and (2) they may successfully locate the correct evidence but still fail to utilize it to reach accurate conclusions. Our findings demonstrate that ViEBench can serve as a more explainable and practical benchmark for comprehensively evaluating the effectiveness agentic VLMs. The codes will be released at: this https URL.
zh
[CV-285] KG-ViP: Bridging Knowledge Grounding and Visual Perception in Multi-modal LLM s for Visual Question Answering
【速读】:该论文旨在解决多模态大语言模型(Multi-modal Large Language Models, MLLMs)在视觉问答(Visual Question Answering, VQA)任务中面临的双重挑战:知识幻觉(knowledge hallucination)和细粒度视觉感知不足。解决方案的关键在于提出一个统一框架KG-ViP,其核心是一个新颖的检索与融合(retrieval-and-fusion)管道,利用查询作为语义桥梁,逐步整合场景图(scene graph)和常识图(commonsense graph),从而构建一个统一的结构化上下文,以增强可靠的多模态推理能力。
链接: https://arxiv.org/abs/2601.11632
作者: Zhiyang Li,Ao Ke,Yukun Cao,Xike Xie
机构: University of Science and Technology of China (中国科学技术大学); Data Darkness Lab, MIRACLE Center (数据幽暗实验室,奇迹中心); Université de Montréal (蒙特利尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multi-modal Large Language Models (MLLMs) for Visual Question Answering (VQA) often suffer from dual limitations: knowledge hallucination and insufficient fine-grained visual perception. Crucially, we identify that commonsense graphs and scene graphs provide precisely complementary solutions to these respective deficiencies by providing rich external knowledge and capturing fine-grained visual details. However, prior works typically treat them in isolation, overlooking their synergistic potential. To bridge this gap, we propose KG-ViP, a unified framework that empowers MLLMs by fusing scene graphs and commonsense graphs. The core of the KG-ViP framework is a novel retrieval-and-fusion pipeline that utilizes the query as a semantic bridge to progressively integrate both graphs, synthesizing a unified structured context that facilitates reliable multi-modal reasoning. Extensive experiments on FVQA 2.0+ and MVQA benchmarks demonstrate that KG-ViP significantly outperforms existing VQA methods.
zh
[CV-286] Compress to Focus: Efficient Coordinate Compression for Policy Optimization in Multi-Turn GUI Agents
【速读】:该论文旨在解决多轮图形用户界面(GUI)智能体在交互过程中因历史信息累积导致的上下文膨胀问题,现有方法要么通过截断牺牲长期上下文,要么通过标记裁剪破坏空间结构。其解决方案的关键在于提出坐标压缩策略优化(Coordinate Compression Policy Optimization, CCPO),其中核心创新是坐标感知空间压缩(Coordinate-Aware Spatial Compression, CASC),该机制通过聚合多个回放轨迹中的坐标信息,动态聚焦于目标相关区域并逐步缩小对关键视觉区域的历史注意力范围;同时设计基于距离的优势函数(Distance-Based Advantage),以距离而非二值正确性提供细粒度学习信号,从而提升定位准确性和压缩质量。实验表明,CCPO在四个基准测试中达到当前最优性能,实现最高达55%的令牌压缩率和3.8倍训练加速。
链接: https://arxiv.org/abs/2601.11631
作者: Yurun Song,Jiong Yin,Rongjunchen Zhang,Ian G. Harris
机构: HiThink Research; University of California, Irvine; Hangzhou Dianzi University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multi-turn GUI agents enable complex task completion through sequential decision-making, but suffer from severe context inflation as interaction history accumulates. Existing strategies either sacrifice long-term context via truncation or compromise spatial structure through token pruning. In this paper, we propose Coordinate Compression Policy Optimization (CCPO), an efficient policy optimization framework that couples visual compression with policy optimization for multi-turn GUI agents. CCPO introduces Coordinate-Aware Spatial Compression (CASC), which aggregates coordinates from multiple rollouts to capture target-relevant regions and progressively narrow historical attention around key visual areas. From interactions across rollouts, CASC adaptively constructs attention boundaries that concentrate computation on the most informative regions of the scene. We further design a Distance-Based Advantage that provides fine-grained learning signals based on distance rather than binary correctness, improving both grounding accuracy and compression quality. Extensive experiments demonstrate that CCPO achieves SOTA performance across four benchmarks with up to 55% token compression and 3.8 \times training speedup.
zh
[CV-287] A one-step generation model with a Single-Layer Transformer: Layer number re-distillation of FreeFlow
【速读】:该论文旨在解决一阶生成式扩散模型(如FreeFlow)在单步生成过程中因初始噪声质量不稳定而导致的图像生成质量波动问题。其核心挑战在于如何在有限采样次数下提升生成样本的稳定性和整体质量。解决方案的关键在于提出一种名为SLT(Single-Layer Transformer)的轻量化蒸馏模型,通过将原28层Transformer结构(即深度方向上的欧拉离散化ODE)压缩为单一共享DiT块,利用中间特征匹配与最终速度对齐策略,在训练中实现对教师模型(FreeFlow)深度特征演化过程的有效近似。该方法显著降低参数量(从675M降至4.3M),并借助极低计算开销在相同时间内完成超百次噪声点筛选,从而选出更优初始点供教师模型生成高质量图像,有效缓解了因随机初始化噪声导致的质量波动问题,提升了单步生成的稳定性与平均性能。
链接: https://arxiv.org/abs/2601.11630
作者: Haonan Wei,Linyuan Wang,Nuolin Sun,Zhizhong Zheng,Lei Li,Bin Yan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Currently, Flow matching methods aim to compress the iterative generation process of diffusion models into a few or even a single step, with MeanFlow and FreeFlow being representative achievements of one-step generation based on Ordinary Differential Equations (ODEs). We observe that the 28-layer Transformer architecture of FreeFlow can be characterized as an Euler discretization scheme for an ODE along the depth axis, where the layer index serves as the discrete time step. Therefore, we distill the number of layers of the FreeFlow model, following the same derivation logic as FreeFlow, and propose SLT (Single-Layer Transformer), which uses a single shared DiT block to approximate the depth-wise feature evolution of the 28-layer teacher. During training, it matches the teacher’s intermediate features at several depth patches, fuses those patch-level representations, and simultaneously aligns the teacher’s final velocity prediction. Through distillation training, we compress the 28 independent Transformer Blocks of the teacher model DiT-XL/2 into a single Transformer Block, reducing the parameter count from 675M to 4.3M. Furthermore, leveraging its minimal parameters and rapid sampling speed, SLT can screen more candidate points in the noise space within the same timeframe, thereby selecting higher-quality initial points for the teacher model FreeFlow and ultimately enhancing the quality of generated images. Experimental results demonstrate that within a time budget comparable to two random samplings of the teacher model, our method performs over 100 noise screenings and produces a high-quality sample through the teacher model using the selected points. Quality fluctuations caused by low-quality initial noise under a limited number of FreeFlow sampling calls are effectively avoided, substantially improving the stability and average generation quality of one-step generation.
zh
[CV-288] Handcrafted Feature-Assisted One-Class Learning for Artist Authentication in Historical Drawings
【速读】:该论文旨在解决历史素描作品在文化遗存中的认证与归属问题,尤其针对参考样本量小、风格线索主要依赖线条和有限色调变化的场景。解决方案的关键在于提出一种基于验证的计算框架,采用单类自编码器(one-class autoencoders)对一组可解释的手工特征向量进行训练,这些特征包括傅里叶域能量、香农熵、全局对比度、灰度共生矩阵(GLCM)同质性及分形复杂度的盒计数估计。该方法通过生物识别式协议评估,在900次验证决策中实现83.3%的真实接受率(True Acceptance Rate)与9.5%的误接受率(False Acceptance Rate),且性能因艺术家而异,体现出结构化的错误路径,符合艺术风格相近性和共同绘图惯例的规律,从而为数据稀缺的历史素描归属提供可复现、定量的辅助证据,补充而非替代专家判断。
链接: https://arxiv.org/abs/2601.11627
作者: Hassan Ugail,Jan Ritch-Frel,Irina Matuzava
机构: University of Bradford (布拉德福德大学); Independent Media Institute (独立媒体研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Authentication and attribution of works on paper remain persistent challenges in cultural heritage, particularly when the available reference corpus is small and stylistic cues are primarily expressed through line and limited tonal variation. We present a verification-based computational framework for historical drawing authentication using one-class autoencoders trained on a compact set of interpretable handcrafted features. Ten artist-specific verifiers are trained using authenticated sketches from the Metropolitan Museum of Art open-access collection, the Ashmolean Collections Catalogue, the Morgan Library and Museum, the Royal Collection Trust (UK), the Victoria and Albert Museum Collections, and an online catalogue of the Casa Buonarroti collection and evaluated under a biometric-style protocol with genuine and impostor trials. Feature vectors comprise Fourier-domain energy, Shannon entropy, global contrast, GLCM-based homogeneity, and a box-counting estimate of fractal complexity. Across 900 verification decisions (90 genuine and 810 impostor trials), the pooled system achieves a True Acceptance Rate of 83.3% with a False Acceptance Rate of 9.5% at the chosen operating point. Performance varies substantially by artist, with near-zero false acceptance for some verifiers and elevated confusability for others. A pairwise attribution of false accepts indicates structured error pathways consistent with stylistic proximity and shared drawing conventions, whilst also motivating tighter control of digitisation artefacts and threshold calibration. The proposed methodology is designed to complement, rather than replace, connoisseurship by providing reproducible, quantitative evidence suitable for data-scarce settings common in historical sketch attribution.
zh
[CV-289] PointSLAM: Robust Dense Neural Gaussian Point Cloud-based SLAM
【速读】:该论文旨在解决当前RGB-D SLAM系统在深度噪声干扰下难以保持结构一致性与鲁棒位姿估计的问题。其解决方案的关键在于提出PointSLAM++,该系统通过引入分层约束的神经高斯表示(hierarchically constrained neural Gaussian representation)来保留场景结构关系,并生成用于地图构建的高斯原语;同时采用渐进式位姿优化策略以抑制深度传感器噪声,提升定位精度;此外,利用动态神经表示图根据局部几何复杂度自适应调整高斯节点分布,实现对复杂场景细节的实时适应性建模。这一组合显著提升了3D重建精度与渲染质量,在大规模增强现实(AR)和机器人应用中展现出优越性能。
链接: https://arxiv.org/abs/2601.11617
作者: Xu Wang,Boyao Han,Xiaojun Chen,Ying Liu,Ruihui Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Robotics (cs.RO)
备注:
Abstract:Real-time 3D reconstruction is crucial for robotics and augmented reality, yet current simultaneous localization and mapping(SLAM) approaches often struggle to maintain structural consistency and robust pose estimation in the presence of depth noise. This work introduces PointSLAM++, a novel RGB-D SLAM system that leverages a hierarchically constrained neural Gaussian representation to preserve structural relationships while generating Gaussian primitives for scene mapping. It also employs progressive pose optimization to mitigate depth sensor noise, significantly enhancing localization accuracy. Furthermore, it utilizes a dynamic neural representation graph that adjusts the distribution of Gaussian nodes based on local geometric complexity, enabling the map to adapt to intricate scene details in real time. This combination yields high-precision 3D mapping and photorealistic scene rendering. Experimental results show PointSLAM++ outperforms existing 3DGS-based SLAM methods in reconstruction accuracy and rendering quality, demonstrating its advantages for large-scale AR and robotics.
zh
[CV-290] Multi-modal MRI-Based Alzheimers Disease Diagnosis with Transformer-based Image Synthesis and Transfer Learning
【速读】:该论文旨在解决阿尔茨海默病(Alzheimer’s disease, AD)早期诊断中因扩散磁共振成像(diffusion MRI, dMRI)获取困难而导致的微结构信息缺失问题。由于dMRI虽能敏感反映白质完整性变化,但其采集耗时且易受运动伪影影响,难以在临床常规实践中推广;而T1加权MRI(T1-weighted MRI, T1w MRI)虽广泛可用,却只能捕捉宏观结构改变,通常出现在疾病较晚期。为此,作者提出一种基于3D TransUNet的图像合成框架,通过从常规T1w MRI直接预测FA(fractional anisotropy)和MD(mean diffusivity)图谱,实现高质量微结构信息的无创推断。该方案的关键在于利用深度学习模型将T1w MRI中的解剖结构特征映射至dMRI表征空间,从而在不依赖额外扫描的前提下,生成与真实dMRI高度一致的合成微结构图谱(SSIM > 0.93,Pearson相关系数0.94),并显著提升轻度认知障碍(mild cognitive impairment, MCI)检测性能(提高12.5%),为临床AD早筛提供了高效、可扩展的技术路径。
链接: https://arxiv.org/abs/2601.11614
作者: Jason Qiu
机构: Marvin Ridge High School
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
备注: 19 pages, 10 figures
Abstract:Alzheimer’s disease (AD) is a progressive neurodegenerative disorder in which pathological changes begin many years before the onset of clinical symptoms, making early detection essential for timely intervention. T1-weighted (T1w) Magnetic Resonance Imaging (MRI) is routinely used in clinical practice to identify macroscopic brain alterations, but these changes typically emerge relatively late in the disease course. Diffusion MRI (dMRI), in contrast, is sensitive to earlier microstructural abnormalities by probing water diffusion in brain tissue. dMRI metrics, including fractional anisotropy (FA) and mean diffusivity (MD), provide complementary information about white matter integrity and neurodegeneration. However, dMRI acquisitions are time-consuming and susceptible to motion artifacts, limiting their routine use in clinical populations. To bridge this gap, I propose a 3D TransUNet image synthesis framework that predicts FA and MD maps directly from T1w MRI. My model generates high-fidelity maps, achieving a structural similarity index (SSIM) exceeding 0.93 and a strong Pearson correlation (0.94) with ground-truth dMRI. When integrated into a multi-modal diagnostic model, these synthetic features boost AD classification accuracy by 5% (78.75%-83.75%) and, most importantly, improve mild cognitive impairment (MCI) detection by 12.5%. This study demonstrates that high-quality diffusion microstructural information can be inferred from routinely acquired T1w MRI, effectively transferring the benefits of multi-modality imaging to settings where diffusion data are unavailable. By reducing scan time while preserving complementary structural and microstructural information, the proposed approach has the potential to improve the accessibility, efficiency, and accuracy of AD diagnosis in clinical practice.
zh
[CV-291] Domain-Specific Self-Supervised Pre-training for Agricultural Disease Classification: A Hierarchical Vision Transformer Study
【速读】:该论文旨在解决农业病害分类任务中模型性能提升的瓶颈问题,尤其关注领域特定自监督预训练(domain-specific self-supervised pre-training, SSL)与架构设计之间的相对贡献。其解决方案的关键在于实证表明:仅使用3,000张未标注农业图像进行SimCLR预训练即可带来+4.57%的准确率提升,显著优于通过层级结构设计带来的+3.70%增益;且该SSL收益具有架构无关性(如在Swin-Base和ViT-Base上分别获得+4.08%和+4.20%提升),从而证明在资源有限时应优先收集领域数据进行预训练,而非过度优化模型架构。
链接: https://arxiv.org/abs/2601.11612
作者: Arnav S. Sonavane
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 11 pages, 4 figures, 9 tables
Abstract:We investigate the impact of domain-specific self-supervised pre-training on agricultural disease classification using hierarchical vision transformers. Our key finding is that SimCLR pre-training on just 3,000 unlabeled agricultural images provides a +4.57% accuracy improvement–exceeding the +3.70% gain from hierarchical architecture design. Critically, we show this SSL benefit is architecture-agnostic: applying the same pre-training to Swin-Base yields +4.08%, to ViT-Base +4.20%, confirming practitioners should prioritize domain data collection over architectural choices. Using HierarchicalViT (HVT), a Swin-style hierarchical transformer, we evaluate on three datasets: Cotton Leaf Disease (7 classes, 90.24%), PlantVillage (38 classes, 96.3%), and PlantDoc (27 classes, 87.1%). At matched parameter counts, HVT-Base (78M) achieves 88.91% vs. Swin-Base (88M) at 87.23%, a +1.68% improvement. For deployment reliability, we report calibration analysis showing HVT achieves 3.56% ECE (1.52% after temperature scaling). Code: this https URL
zh
[CV-292] “Jutters” NEURIPS2025
【速读】:该论文试图解决的问题是:在AI生成内容(AI-generated content)日益充斥数字媒体环境的背景下,人类如何建立更具批判性和反思性的媒介消费习惯。当前用户对算法推荐和生成式AI(Generative AI)内容的被动接受,削弱了对信息来源、真实性与价值的审慎判断。解决方案的关键在于通过一个具身化的交互装置——模拟荷兰海岸拾荒者(jutter)行为的海滩式空间,将真实海洋废弃物与AI重构的图像和视频并置,使观众以“当代拾荒者”的身份参与选择性保留或丢弃的过程。这一实践促使个体在物质与符号层面重新思考AI生成内容的“价值”与“意义”,从而推动对数字信息流中漂浮内容的主动甄别与深度反思。
链接: https://arxiv.org/abs/2601.11532
作者: Meike Driessen,Selina Khan,Gonçalo Marcelino
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Creative AI Track
Abstract:This project explores how we engage with AI-generated content through the lens of the jutter: Dutch coastal foragers who comb the shoreline after storms, gathering and repurposing what the sea leaves behind. Reflecting how our lives are increasingly shaped by AI-generated media, we create a beach-like installation that blends real shoreline debris with AI-transformed images and videos. Visitors are invited to explore this space as contemporary jutters, deciding what to keep and what to discard. In doing so, the project reimagines AI-imagery as material for reflection, encouraging a more discerning engagement with the content that drifts through our feeds. A video preview of the installation can be found at this https URL.
zh
[CV-293] MooneyMaker: A Python package to create ambiguous two-tone images
【速读】:该论文旨在解决传统人工制作Mooney图像(一种通过阈值化照片生成的高对比度、二值化视觉刺激)过程中存在的劳动密集和主观性强的问题,这些问题导致不同研究间结果不一致。其核心解决方案是提出并实现了一个开源Python工具包MooneyMaker,该工具采用多种互补方法(包括基于图像统计特征与深度学习模型的方法)自动生成具有特定模糊性的Mooney图像,关键在于通过策略性地修改边缘信息以增强初始识别难度,同时允许用户直接可视化比较不同生成技术的效果。实验验证表明,初始识别难度更高的图像在看到模板后表现出更强的识别提升(即更大的“解歧效应”),从而为视觉感知研究提供了标准化、可重复的刺激生成方案。
链接: https://arxiv.org/abs/2601.14077
作者: Lars C. Reining,Thabo Matthies,Luisa Haussner,Rabea Turon,Thomas S. A. Wallis
机构: Technical University of Darmstadt(达姆施塔特工业大学); Center for Mind, Brain and Behavior (CMBB)(心智、大脑与行为中心)
类目: Neurons and Cognition (q-bio.NC); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Mooney images are high-contrast, two-tone visual stimuli, created by thresholding photographic images. They allow researchers to separate image content from image understanding, making them valuable for studying visual perception. An ideal Mooney image for this purpose achieves a specific balance: it initially appears unrecognizable but becomes fully interpretable to the observer after seeing the original template. Researchers traditionally created these stimuli manually using subjective criteria, which is labor-intensive and can introduce inconsistencies across studies. Automated generation techniques now offer an alternative to this manual approach. Here, we present MooneyMaker, an open-source Python package that automates the generation of ambiguous Mooney images using several complementary approaches. Users can choose between various generation techniques that range from approaches based on image statistics to deep learning models. These models strategically alter edge information to increase initial ambiguity. The package lets users create two-tone images with multiple methods and directly compare the results visually. In an experiment, we validate MooneyMaker by generating Mooney images using different techniques and assess their recognizability for human observers before and after disambiguating them by presenting the template images. Our results reveal that techniques with lower initial recognizability are associated with higher post-template recognition (i.e. a larger disambiguation effect). To help vision scientists build effective databases of Mooney stimuli, we provide practical guidelines for technique selection. By standardizing the generation process, MooneyMaker supports more consistent and reproducible visual perception research.
zh
[CV-294] SHARE: A Fully Unsupervised Framework for Single Hyperspectral Image Restoration
【速读】:该论文旨在解决高光谱图像(Hyperspectral Image, HSI)恢复任务中因缺乏真实标签数据而导致的监督学习方法在实际场景中应用受限的问题。其核心挑战在于如何在无监督条件下实现高质量的HSI修复,如补全(inpainting)和超分辨率重建。解决方案的关键在于提出一个名为SHARE(Single Hyperspectral Image Restoration with Equivariance)的完全无监督框架,该框架融合了几何等变性(geometric equivariance)原理与低秩光谱建模思想:通过利用HSI在可微分几何变换(如旋转和缩放)下的内在不变性,构建自监督信号;同时引入动态自适应光谱注意力(Dynamic Adaptive Spectral Attention, DASA)模块,显式建模HSI的全局低秩特性,并通过可学习的注意力机制自适应地优化局部光谱-空间相关性,从而实现无需真实标签即可有效恢复HSI的能力。
链接: https://arxiv.org/abs/2601.13987
作者: Jiangwei Xie,Zhang Wen,Mike Davies,Dongdong Chen
机构: Heriot-Watt University (赫瑞-瓦特大学); University of Edinburgh (爱丁堡大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Technical report
Abstract:Hyperspectral image (HSI) restoration is a fundamental challenge in computational imaging and computer vision. It involves ill-posed inverse problems, such as inpainting and super-resolution. Although deep learning methods have transformed the field through data-driven learning, their effectiveness hinges on access to meticulously curated ground-truth datasets. This fundamentally restricts their applicability in real-world scenarios where such data is unavailable. This paper presents SHARE (Single Hyperspectral Image Restoration with Equivariance), a fully unsupervised framework that unifies geometric equivariance principles with low-rank spectral modelling to eliminate the need for ground truth. SHARE’s core concept is to exploit the intrinsic invariance of hyperspectral structures under differentiable geometric transformations (e.g. rotations and scaling) to derive self-supervision signals through equivariance consistency constraints. Our novel Dynamic Adaptive Spectral Attention (DASA) module further enhances this paradigm shift by explicitly encoding the global low-rank property of HSI and adaptively refining local spectral-spatial correlations through learnable attention mechanisms. Extensive experiments on HSI inpainting and super-resolution tasks demonstrate the effectiveness of SHARE. Our method outperforms many state-of-the-art unsupervised approaches and achieves performance comparable to that of supervised methods. We hope that our approach will shed new light on HSI restoration and broader scientific imaging scenarios. The code will be released at this https URL.
zh
[CV-295] Accurate Simulation Pipeline for Passive Single-Photon Imaging
【速读】:该论文旨在解决单光子雪崩二极管(SPAD)成像传感器因价格高昂和供应有限而导致的数据稀缺问题,从而阻碍了针对SPAD特性设计的图像处理算法开发及基于学习的方法训练。其关键解决方案是提出了一套全面的SPAD仿真流程,能够生成包含多种SPAD成像模式的合成数据集(如SPAD-MNIST),并验证该仿真在真实SPAD传感器上的有效性。通过在极端低光条件(如5 mlux)下评估卷积神经网络(CNN)分类器对重建通量的性能,以及仅用模拟数据训练的分类器在真实SPAD图像上的泛化能力,证明了该仿真方法可有效支撑SPAD专用算法的研发与训练。
链接: https://arxiv.org/abs/2601.12850
作者: Aleksi Suonsivu,Lauri Salmela,Leevi Uosukainen,Edoardo Peretti,Radu Ciprian Bilcu,Giacomo Boracchi
机构: 未知
类目: Instrumentation and Detectors (physics.ins-det); Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 10 figures, 3 tables
Abstract:Single-Photon Avalanche Diodes (SPADs) are new and promising imaging sensors. These sensors are sensitive enough to detect individual photons hitting each pixel, with extreme temporal resolution and without readout noise. Thus, SPADs stand out as an optimal choice for low-light imaging. Due to the high price and limited availability of SPAD sensors, the demand for an accurate data simulation pipeline is substantial. Indeed, the scarcity of SPAD datasets hinders the development of SPAD-specific processing algorithms and impedes the training of learning-based solutions. In this paper, we present a comprehensive SPAD simulation pipeline and validate it with multiple experiments using two recent commercial SPAD sensors. Our simulator is used to generate the SPAD-MNIST, a single-photon version of the seminal MNIST dataset, to investigate the effectiveness of convolutional neural network (CNN) classifiers on reconstructed fluxes, even at extremely low light conditions, e.g., 5 mlux. We also assess the performance of classifiers exclusively trained on simulated data on real images acquired from SPAD sensors at different light conditions. The synthetic dataset encompasses different SPAD imaging modalities and is made available for download. Project page: this https URL. Comments: 18 pages, 10 figures, 3 tables Subjects: Instrumentation and Detectors (physics.ins-det); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2601.12850 [physics.ins-det] (or arXiv:2601.12850v1 [physics.ins-det] for this version) https://doi.org/10.48550/arXiv.2601.12850 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1109/JSEN.2025.3645459 Focus to learn more DOI(s) linking to related resources
zh
[CV-296] DALD-PCAC: Density-Adaptive Learning Descriptor for Point Cloud Lossless Attribute Compression
【速读】:该论文旨在解决点云在不同密度下进行无损属性压缩时,现有基于学习的方法仍存在性能不足的问题。其关键解决方案在于提出了一种名为DALD-PCAC的框架,该框架通过引入层次细节(Levels of Detail, LoD)来适配点云的密度变化,并设计了点级注意力模型(使用排列不变的Transformer)以应对点云稀疏性和不规则性带来的上下文建模挑战;同时提出了密度自适应学习描述符(Density-Adaptive Learning Descriptor, DALD),可有效捕捉大邻域范围内点之间的结构与相关性,并结合先验引导的块划分策略降低块内属性方差,从而提升压缩效率和鲁棒性。
链接: https://arxiv.org/abs/2601.12261
作者: Chunyang Fu,Ge Li,Wei Gao,Shiqi Wang,Zhu Li,Shan Liu
机构: City University of Hong Kong (香港城市大学); Peking University Shenzhen Graduate School (北京大学深圳研究生院); University of Missouri-Kansas City (密苏里大学堪萨斯城分校); Tencent Media Laboratory (腾讯媒体实验室)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by TOMM
Abstract:Recently, deep learning has significantly advanced the performance of point cloud geometry compression. However, the learning-based lossless attribute compression of point clouds with varying densities is under-explored. In this paper, we develop a learning-based framework, namely DALD-PCAC that leverages Levels of Detail (LoD) to tailor for point cloud lossless attribute compression. We develop a point-wise attention model using a permutation-invariant Transformer to tackle the challenges of sparsity and irregularity of point clouds during context modeling. We also propose a Density-Adaptive Learning Descriptor (DALD) capable of capturing structure and correlations among points across a large range of neighbors. In addition, we develop a prior-guided block partitioning to reduce the attribute variance within blocks and enhance the performance. Experiments on LiDAR and object point clouds show that DALD-PCAC achieves the state-of-the-art performance on most data. Our method boosts the compression performance and is robust to the varying densities of point clouds. Moreover, it guarantees a good trade-off between performance and complexity, exhibiting great potential in real-world applications. The source code is available at this https URL.
zh
[CV-297] DeepRAHT: Learning Predictive RAHT for Point Cloud Attribute Compression AAAI2026
【速读】:该论文旨在解决点云属性压缩(Point Cloud Attribute Compression, PCAC)中传统区域自适应分层变换(Regional Adaptive Hierarchical Transform, RAHT)难以与深度学习框架无缝集成的问题,从而实现端到端的高效、可学习的压缩方法。其解决方案的关键在于提出一种基于稀疏张量的端到端深度学习框架 DeepRAHT,该框架将 RAHT 变换嵌入到学习重建过程中,无需人工预处理;同时引入预测式 RAHT 机制以降低码率,并设计基于学习的预测模型提升性能;此外,通过运行长度编码(run-length coding)构建比特率代理(bitrate proxy),实现无损可变码率编码并增强鲁棒性。整体上,DeepRAHT 是一个可逆且失真可控的框架,具备高性能、高效率和强应用潜力。
链接: https://arxiv.org/abs/2601.12255
作者: Chunyang Fu,Tai Qin,Shiqi Wang,Zhu Li
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Information Theory (cs.IT); Multimedia (cs.MM)
备注: Accepted by AAAI 2026
Abstract:Regional Adaptive Hierarchical Transform (RAHT) is an effective point cloud attribute compression (PCAC) method. However, its application in deep learning lacks research. In this paper, we propose an end-to-end RAHT framework for lossy PCAC based on the sparse tensor, called DeepRAHT. The RAHT transform is performed within the learning reconstruction process, without requiring manual RAHT for preprocessing. We also introduce the predictive RAHT to reduce bitrates and design a learning-based prediction model to enhance performance. Moreover, we devise a bitrate proxy that applies run-length coding to entropy model, achieving seamless variable-rate coding and improving robustness. DeepRAHT is a reversible and distortion-controllable framework, ensuring its lower bound performance and offering significant application potential. The experiments demonstrate that DeepRAHT is a high-performance, faster, and more robust solution than the baseline methods. Project Page: this https URL.
zh
[CV-298] Accelerated MR Elastography Using Learned Neural Network Representation
【速读】:该论文旨在解决高undersampled(欠采样)数据下快速获取高分辨率磁共振弹性成像(MRE)的问题,尤其在缺乏高质量训练数据的情况下实现可靠重建。其核心解决方案是将深度神经网络表示建模为线性子空间模型的非线性扩展,并通过多层级k空间一致性损失以自监督方式学习网络权重;同时引入相位对比特异性幅度与相位先验(如解剖结构相似性和波诱导谐波位移平滑性),从而显著提升重建质量,在仅使用单个平面螺旋臂(总加速因子R=10)条件下即可获得接近全采样数据的刚度估计结果,验证了深度网络表示在MRE图像建模与重建中的可行性与优越性。
链接: https://arxiv.org/abs/2601.11878
作者: Xi Peng
机构: 未知
类目: ignal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
备注:
Abstract:To develop a deep-learning method for achieving fast high-resolution MR elastography from highly undersampled data without the need of high-quality training dataset. We first framed the deep neural network representation as a nonlinear extension of the linear subspace model, then used it to represent and reconstruct MRE image repetitions from undersampled k-space data. The network weights were learned using a multi-level k-space consistent loss in a self-supervised manner. To further enhance reconstruction quality, phase-contrast specific magnitude and phase priors were incorporated, including the similarity of anatomical structures and smoothness of wave-induced harmonic displacement. Experiments were conducted using both 3D gradient-echo spiral and multi-slice spin-echo spiral MRE datasets. Compared to the conventional linear subspace-based approaches, the nonlinear network representation method was able to produce superior image reconstruction with suppressed noise and artifacts from a single in-plane spiral arm per MRE repetition (e.g., total R=10), yielding comparable stiffness estimation to the fully sampled data. This work demonstrated the feasibility of using deep network representations to model and reconstruct MRE images from highly-undersampled data, a nonlinear extension of the subspace-based approaches.
zh
[CV-299] Karhunen-Loève Expansion-Based Residual Anomaly Map for Resource-Efficient Glioma MRI Segmentation
【速读】:该论文旨在解决脑肿瘤分割任务中对大规模数据集和高计算资源的依赖问题,这在多数临床环境中难以实现。当前最先进的深度学习模型通常需要数以万计的训练样本及超算级别的硬件支持(如 BraTS GLI 2023 冠军方法使用了超过 92,000 张增强 MRI 扫描并在多 GPU 超算上训练数周),导致在数据有限或计算能力受限场景下性能显著下降。其解决方案的关键在于引入 Karhunen–Loève Expansion (KLE) 作为特征提取步骤,将原始多模态 MRI 图像(240 × 240 × 155)降维压缩为四个 48³ 通道与 32 个 KL 系数,并构建基于残差的异常图(residual-based anomaly map),再将其上采样后作为第五通道输入至轻量级 3D U-Net 模型中。该方法大幅降低计算成本与数据需求,同时保持甚至超越现有最优性能(如 WT Dice 达 0.929,HD95 仅为 2.93 voxels),验证了 KLE 基于残差异常图的有效性与高效性。
链接: https://arxiv.org/abs/2601.11833
作者: Anthony Hur
机构: 未知
类目: Quantitative Methods (q-bio.QM); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注:
Abstract:Accurate segmentation of brain tumors is essential for clinical diagnosis and treatment planning. Deep learning is currently the state-of-the-art for brain tumor segmentation, yet it requires either large datasets or extensive computational resources that are inaccessible in most areas. This makes the problem increasingly difficult: state-of-the-art models use thousands of training cases and vast computational power, where performance drops sharply when either is limited. The top performer in the Brats GLI 2023 competition relied on supercomputers trained on over 92,000 augmented MRI scans using an AMD EPYC 7402 CPU, six NVIDIA RTX 6000 GPUs (48GB VRAM each), and 1024GB of RAM over multiple weeks. To address this, the Karhunen–Loève Expansion (KLE) was implemented as a feature extraction step on downsampled, z-score normalized MRI volumes. Each 240 \times 240 \times 155 multi-modal scan is reduced to four 48^3 channels and compressed into 32 KL coefficients. The resulting approximate reconstruction enables a residual-based anomaly map, which is upsampled and added as a fifth channel to a compact 3D U-Net. All experiments were run on a consumer workstation (AMD Ryzen 5 7600X CPU, RTX 4060Ti (8GB VRAM), and 64GB RAM while using far fewer training cases. This model achieves post-processed Dice scores of 0.929 (WT), 0.856 (TC), and 0.821 (ET), with HD95 distances of 2.93, 6.78, and 10.35 voxels. These results are significantly better than the winning BraTS 2023 methodology for HD95 distances and WT dice scores. This demonstrates that a KLE-based residual anomaly map can dramatically reduce computational cost and data requirements while retaining state-of-the-art performance.
zh
[CV-300] Anisotropic Tensor Deconvolution of Hyperspectral Images ICASSP2026
【速读】:该论文致力于解决高光谱图像(Hyperspectral Image, HSI)去卷积这一困难的不适定逆问题,其核心挑战在于数据维度高、变量规模庞大且存在严重病态性。解决方案的关键在于提出一种参数稀疏的低秩Canonical Polyadic Decomposition (CPD)框架,将原始需恢复的 P×Q×N 大规模HSI张量 X 转化为仅需估计 (P+Q+N)R 个参数的因子表示,从而实现数量级的参数压缩;同时引入结构感知的各向异性总变差(Anisotropic Total Variation, TV)正则化项,仅作用于空间因子以保持光谱平滑性,并基于Proximal Alternating Linearized Minimization (PALM)算法求解非凸优化问题,最终在模型紧凑性和重建精度之间取得优异平衡。
链接: https://arxiv.org/abs/2601.11694
作者: Xinjue Wang,Xiuheng Wang,Esa Ollila,Sergiy A. Vorobyov
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Signal Processing (eess.SP)
备注: To appear in ICASSP 2026
Abstract:Hyperspectral image (HSI) deconvolution is a challenging ill-posed inverse problem, made difficult by the data’s high this http URL propose a parameter-parsimonious framework based on a low-rank Canonical Polyadic Decomposition (CPD) of the entire latent HSI \mathbf\mathcalX \in \mathbbR^P\times Q \times N .This approach recasts the problem from recovering a large-scale image with PQN variables to estimating the CPD factors with (P+Q+N)R this http URL model also enables a structure-aware, anisotropic Total Variation (TV) regularization applied only to the spatial factors, preserving the smooth spectral this http URL efficient algorithm based on the Proximal Alternating Linearized Minimization (PALM) framework is developed to solve the resulting non-convex optimization this http URL confirm the model’s efficiency, showing a numerous parameter reduction of over two orders of magnitude and a compelling trade-off between model compactness and reconstruction accuracy.
zh
[CV-301] Bridging Modalities: Joint Synthesis and Registration Framework for Aligning Diffusion MRI with T1-Weighted Images
【速读】:该论文旨在解决弥散磁共振成像(dMRI)与T1加权磁共振成像(T1w MRI)之间的多模态图像配准问题,该问题因扩散数据与高分辨率解剖结构间存在显著强度差异而难以保证配准精度。解决方案的关键在于提出一种基于生成式配准网络的无监督框架,通过图像合成模型将b0图像转换为具有T1w对比度的伪图像,从而将原始多模态配准任务转化为生成图像与真实T1w图像之间的单模态配准任务,有效降低了跨模态配准的复杂性;同时,该框架联合优化局部结构相似性和跨模态统计依赖性,以提升形变场估计的准确性。
链接: https://arxiv.org/abs/2601.11689
作者: Xiaofan Wang,Junyi Wang,Yuqian Chen,Lauren J. O’ Donnell,Fan Zhang
机构: University of Electronic Science and Technology of China (电子科技大学); Brigham and Women’s Hospital (布莱根妇女医院); Harvard Medical School (哈佛医学院)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multimodal image registration between diffusion MRI (dMRI) and T1-weighted (T1w) MRI images is a critical step for aligning diffusion-weighted imaging (DWI) data with structural anatomical space. Traditional registration methods often struggle to ensure accuracy due to the large intensity differences between diffusion data and high-resolution anatomical structures. This paper proposes an unsupervised registration framework based on a generative registration network, which transforms the original multimodal registration problem between b0 and T1w images into a unimodal registration task between a generated image and the real T1w image. This effectively reduces the complexity of cross-modal registration. The framework first employs an image synthesis model to generate images with T1w-like contrast, and then learns a deformation field from the generated image to the fixed T1w image. The registration network jointly optimizes local structural similarity and cross-modal statistical dependency to improve deformation estimation accuracy. Experiments conducted on two independent datasets demonstrate that the proposed method outperforms several state-of-the-art approaches in multimodal registration tasks.
zh
[CV-302] owards Efficient Image Deblurring for Edge Deployment
【速读】:该论文旨在解决图像去模糊(Image Deblurring)在边缘设备上部署时面临的效率与精度难以平衡的问题,尤其针对现有深度模型虽在准确率上达到先进水平(SOTA),但其效率评估指标(如FLOPs或参数量)与实际硬件延迟无关的缺陷。解决方案的关键在于提出一种面向硬件感知的模型适配框架,通过敏感性引导的模块替换(sensitivity-guided block substitution)、代理蒸馏(surrogate distillation)以及基于设备性能分析的无训练多目标搜索(training-free multi-objective search),实现对基准模型(如36层NAFNet)的结构重构,在显著降低GMACs(达55%)的同时保持竞争性准确率,并在真实设备上实现1.25倍的延迟优化。该方法建立了从算法设计到部署就绪模型之间的闭环反馈机制,为高效图像恢复提供了可扩展的工程实践路径。
链接: https://arxiv.org/abs/2601.11685
作者: Srinivas Miriyala,Sowmya Vajrala,Sravanth Kodavanti
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Image deblurring is a critical stage in mobile image signal processing pipelines, where the ability to restore fine structures and textures must be balanced with real-time constraints on edge devices. While recent deep networks such as transformers and activation-free architectures achieve state-of-the-art (SOTA) accuracy, their efficiency is typically measured in FLOPs or parameters, which do not correlate with latency on embedded hardware. We propose a hardware-aware adaptation framework that restructures existing models through sensitivity-guided block substitution, surrogate distillation, and training-free multi-objective search driven by device profiling. Applied to the 36-block NAFNet baseline, the optimized variants achieve up to 55% reduction in GMACs compared to the recent transformer-based SOTA while maintaining competitive accuracy. Most importantly, on-device deployment yields a 1.25X latency improvement over the baseline. Experiments on motion deblurring (GoPro), defocus deblurring (DPDD), and auxiliary benchmarks (RealBlur-J/R, HIDE) demonstrate the generality of the approach, while comparisons with prior efficient baselines confirm its accuracy-efficiency trade-off. These results establish feedback-driven adaptation as a principled strategy for bridging the gap between algorithmic design and deployment-ready deblurring models.
zh
[CV-303] Mobile-friendly Image de-noising: Hardware Conscious Optimization for Edge Application ICASSP2025
【速读】:该论文旨在解决图像增强任务中噪声干扰导致传统图像信号处理(Image Signal Processing, ISP)方法效果受限的问题,并进一步提升深度学习模型在边缘设备(如智能手机)上的部署效率。其解决方案的关键在于提出一种面向移动设备的新型去噪网络,该网络基于熵正则化的可微神经架构搜索(Entropy-Regularized differentiable Neural Architecture Search, NAS),在硬件感知的搜索空间中对U-Net结构进行优化设计,从而在保持较高图像质量的同时显著降低参数量、推理延迟和内存占用,实现了性能与效率的平衡。
链接: https://arxiv.org/abs/2601.11684
作者: Srinivas Miriyala,Sowmya Vajrala,Hitesh Kumar,Sravanth Kodavanti,Vikram Rajendiran
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICASSP 2025
Abstract:Image enhancement is a critical task in computer vision and photography that is often entangled with noise. This renders the traditional Image Signal Processing (ISP) ineffective compared to the advances in deep learning. However, the success of such methods is increasingly associated with the ease of their deployment on edge devices, such as smartphones. This work presents a novel mobile-friendly network for image de-noising obtained with Entropy-Regularized differentiable Neural Architecture Search (NAS) on a hardware-aware search space for a U-Net architecture, which is first-of-its-kind. The designed model has 12% less parameters, with ~2-fold improvement in ondevice latency and 1.5-fold improvement in the memory footprint for a 0.7% drop in PSNR, when deployed and profiled on Samsung Galaxy S24 Ultra. Compared to the SOTA Swin-Transformer for Image Restoration, the proposed network had competitive accuracy with ~18-fold reduction in GMACs. Further, the network was tested successfully for Gaussian de-noising with 3 intensities on 4 benchmarks and real-world de-noising on 1 benchmark demonstrating its generalization ability.
zh
[CV-304] FourierPET: Deep Fourier-based Unrolled Network for Low-count PET Reconstruction AAAI2026
【速读】:该论文旨在解决低计数正电子发射断层成像(PET)重建中的严重退化问题,主要包括泊松噪声、光子稀缺性以及衰减校正误差带来的混叠伪影。现有深度学习方法通常在空间域进行统一优化,难以分离和纠正不同类型的退化因素。其解决方案的关键在于通过傅里叶域分析发现这些退化在频域具有可分离性:泊松噪声和光子稀缺导致高频相位扰动,而衰减误差则抑制低频幅度成分;基于此洞察,作者提出 FourierPET 框架,采用基于交替方向乘子法(ADMM)的傅里叶域迭代重建策略,包含三个定制模块——频谱一致性模块用于保持全局频率对齐以维持数据保真度,幅相校正模块用于解耦并补偿高频相位畸变与低频幅度抑制,以及双调整模块加速收敛。该方法在参数量显著减少的同时实现最先进的重建性能,并具备频域感知的可解释性。
链接: https://arxiv.org/abs/2601.11680
作者: Zheng Zhang,Hao Tang,Yingying Hu,Zhanli Hu,Jing Qin
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for oral presentation at AAAI 2026
Abstract:Low-count positron emission tomography (PET) reconstruction is a challenging inverse problem due to severe degradations arising from Poisson noise, photon scarcity, and attenuation correction errors. Existing deep learning methods typically address these in the spatial domain with an undifferentiated optimization objective, making it difficult to disentangle overlapping artifacts and limiting correction effectiveness. In this work, we perform a Fourier-domain analysis and reveal that these degradations are spectrally separable: Poisson noise and photon scarcity cause high-frequency phase perturbations, while attenuation errors suppress low-frequency amplitude components. Leveraging this insight, we propose FourierPET, a Fourier-based unrolled reconstruction framework grounded in the Alternating Direction Method of Multipliers. It consists of three tailored modules: a spectral consistency module that enforces global frequency alignment to maintain data fidelity, an amplitude-phase correction module that decouples and compensates for high-frequency phase distortions and low-frequency amplitude suppression, and a dual adjustment module that accelerates convergence during iterative reconstruction. Extensive experiments demonstrate that FourierPET achieves state-of-the-art performance with significantly fewer parameters, while offering enhanced interpretability through frequency-aware correction.
zh
[CV-305] Pigment Network Detection and Classification in Dermoscopic Images Using Directional Imaging Algorithms and Convolutional Neural Networks
【速读】:该论文旨在解决黑色素瘤(melanoma)早期诊断中难以准确识别异常色素网络(unusual pigment network, PN)的问题,尤其是区分典型(typical)与非典型(atypical)PN的挑战。其解决方案的关键在于构建一个结合方向成像算法与轻量级卷积神经网络(Convolutional Neural Network, CNN)的自动化流程:首先利用主成分分析(Principal Component Analysis, PCA)、对比度增强、滤波和去噪等步骤提取高质量PN图像;随后基于新构建的小规模PN数据集(200张图像),设计了一个包含两个卷积层和两个批量归一化层的简单但高效的CNN模型,实现了90%的准确率、90%的敏感性和89%的特异性,显著优于现有方法,验证了该模型在PN分类中的有效性。
链接: https://arxiv.org/abs/2601.11674
作者: M. A. Rasel,Sameem Abdul Kareem,Unaizah Obaidellah
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Early diagnosis of melanoma, which can save thousands of lives, relies heavily on the analysis of dermoscopic images. One crucial diagnostic criterion is the identification of unusual pigment network (PN). However, distinguishing between regular (typical) and irregular (atypical) PN is challenging. This study aims to automate the PN detection process using a directional imaging algorithm and classify PN types using machine learning classifiers. The directional imaging algorithm incorporates Principal Component Analysis (PCA), contrast enhancement, filtering, and noise reduction. Applied to the PH2 dataset, this algorithm achieved a 96% success rate, which increased to 100% after pixel intensity adjustments. We created a new dataset containing only PN images from these results. We then employed two classifiers, Convolutional Neural Network (CNN) and Bag of Features (BoF), to categorize PN into atypical and typical classes. Given the limited dataset of 200 images, a simple and effective CNN was designed, featuring two convolutional layers and two batch normalization layers. The proposed CNN achieved 90% accuracy, 90% sensitivity, and 89% specificity. When compared to state-of-the-art methods, our CNN demonstrated superior performance. Our study highlights the potential of the proposed CNN model for effective PN classification, suggesting future research should focus on expanding datasets and incorporating additional dermatological features to further enhance melanoma diagnosis.
zh
人工智能
[AI-0] Q-learning with Adjoint Matching
【速读】:该论文旨在解决连续动作强化学习(Continuous-action Reinforcement Learning)中一个长期存在的挑战:如何高效优化具有表达能力的扩散模型或流匹配(flow-matching)策略,以适应参数化的Q函数。传统方法在利用critic的梯度信息时面临数值不稳定性问题,因为直接对多步去噪过程进行反向传播会导致梯度爆炸或消失;现有方案要么忽略梯度信息仅使用价值估计,要么依赖近似方法牺牲策略表达性或引入偏差。本文提出的Q-learning with Adjoint Matching (QAM) 算法的关键在于引入邻接匹配(adjoint matching)技术——该技术将critic的动作梯度转化为一种分步的目标函数,避免了不稳定反向传播,同时在最优解处提供无偏且高表达性的策略。结合时序差分(Temporal-Difference, TD)更新机制用于critic学习,QAM在离线与离线到在线强化学习任务中均显著优于先前方法。
链接: https://arxiv.org/abs/2601.14234
作者: Qiyang Li,Sergey Levine
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO); Machine Learning (stat.ML)
备注: 32 pages, 8 figures, 7 tables
Abstract:We propose Q-learning with Adjoint Matching (QAM), a novel TD-based reinforcement learning (RL) algorithm that tackles a long-standing challenge in continuous-action RL: efficient optimization of an expressive diffusion or flow-matching policy with respect to a parameterized Q-function. Effective optimization requires exploiting the first-order information of the critic, but it is challenging to do so for flow or diffusion policies because direct gradient-based optimization via backpropagation through their multi-step denoising process is numerically unstable. Existing methods work around this either by only using the value and discarding the gradient information, or by relying on approximations that sacrifice policy expressivity or bias the learned policy. QAM sidesteps both of these challenges by leveraging adjoint matching, a recently proposed technique in generative modeling, which transforms the critic’s action gradient to form a step-wise objective function that is free from unstable backpropagation, while providing an unbiased, expressive policy at the optimum. Combined with temporal-difference backup for critic learning, QAM consistently outperforms prior approaches on hard, sparse reward tasks in both offline and offline-to-online RL.
zh
[AI-1] Paper2Rebuttal: A Multi-Agent Framework for Transparent Author Response Assistance
【速读】:该论文旨在解决生成式 AI 在撰写反驳信(rebuttal)过程中存在的三大问题:幻觉(hallucination)、忽略审稿意见以及缺乏可验证的证据支撑。现有方法将反驳生成视为直接文本生成任务,导致回应内容与审稿人意图不一致且难以追溯依据。解决方案的关键在于提出首个多智能体框架 RebuttalAgent,其核心创新是将反驳生成重构为以证据为中心的规划任务:首先将复杂反馈分解为原子级关注点,再通过融合压缩摘要与高保真原文,并结合自主调用的外部文献搜索模块来处理需外部知识支持的问题;同时,在生成响应前构建可检查的回应计划,确保每项论点均明确锚定于内部或外部证据,从而提升反驳的覆盖度、忠实度和策略一致性。
链接: https://arxiv.org/abs/2601.14171
作者: Qianli Ma,Chang Guo,Zhiheng Tian,Siyu Wang,Jipeng Xiao,Yuanhao Yue,Zhipeng Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Writing effective rebuttals is a high-stakes task that demands more than linguistic fluency, as it requires precise alignment between reviewer intent and manuscript details. Current solutions typically treat this as a direct-to-text generation problem, suffering from hallucination, overlooked critiques, and a lack of verifiable grounding. To address these limitations, we introduce \textbfRebuttalAgent , the first multi-agents framework that reframes rebuttal generation as an evidence-centric planning task. Our system decomposes complex feedback into atomic concerns and dynamically constructs hybrid contexts by synthesizing compressed summaries with high-fidelity text while integrating an autonomous and on-demand external search module to resolve concerns requiring outside literature. By generating an inspectable response plan before drafting, \textbfRebuttalAgent ensures that every argument is explicitly anchored in internal or external evidence. We validate our approach on the proposed \textbfRebuttalBench and demonstrate that our pipeline outperforms strong baselines in coverage, faithfulness, and strategic coherence, offering a transparent and controllable assistant for the peer review process. Code will be released.
zh
[AI-2] ConceptCaps – a Distilled Concept Dataset for Interpretability in Music Models
【速读】:该论文旨在解决当前音乐数据集在概念可解释性分析中缺乏清晰、分离的正负样本标注问题,这限制了如概念激活向量(TCAV)等方法在音乐领域的应用。现有音乐标签普遍存在稀疏性、噪声和定义模糊等问题,难以支撑精准的概念建模。其解决方案的关键在于构建一个结构化的音乐-文本-音频三元组数据集 ConceptCaps,包含 23,000 个带显式属性标签的样本,基于 200 个语义属性分类体系;并采用模块化流水线设计:通过变分自编码器(VAE)学习属性共现模式,利用微调的大语言模型(LLM)生成专业级文本描述,再由 MusicGen 合成对应音频,从而实现语义建模与文本生成的解耦,显著提升生成内容的一致性和可控性,优于端到端方法。
链接: https://arxiv.org/abs/2601.14157
作者: Bruno Sienkiewicz,Łukasz Neumann,Mateusz Modrzejewski
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Concept-based interpretability methods like TCAV require clean, well-separated positive and negative examples for each concept. Existing music datasets lack this structure: tags are sparse, noisy, or ill-defined. We introduce ConceptCaps, a dataset of 23k music-caption-audio triplets with explicit labels from a 200-attribute taxonomy. Our pipeline separates semantic modeling from text generation: a VAE learns plausible attribute co-occurrence patterns, a fine-tuned LLM converts attribute lists into professional descriptions, and MusicGen synthesizes corresponding audio. This separation improves coherence and controllability over end-to-end approaches. We validate the dataset through audio-text alignment (CLAP), linguistic quality metrics (BERTScore, MAUVE), and TCAV analysis confirming that concept probes recover musically meaningful patterns. Dataset and code are available online.
zh
[AI-3] Riemannian Liquid Spatio-Temporal Graph Network
【速读】:该论文旨在解决液态时间常数网络(Liquid Time-Constant networks, LTCs)在处理具有非欧几里得结构(如层次关系和环状结构)的真实世界图数据时,因受限于欧几里得空间而引入显著几何失真的问题。解决方案的关键在于提出黎曼流体时空图网络(Riemannian Liquid Spatio-Temporal Graph Network, RLSTG),该框架将连续时间液态动力学与黎曼流形的几何归纳偏置相统一,通过直接在曲面上定义常微分方程(ODE)来建模图演化过程,从而精确捕捉静态与动态时空图的内在几何特性,并提供了理论保障以证明其稳定性和表达能力。
链接: https://arxiv.org/abs/2601.14115
作者: Liangsi Lu,Jingchao Wang,Zhaorong Dai,Hanqian Liu,Yang Shi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: This paper has been accepted to The Web Conference 2026
Abstract:Liquid Time-Constant networks (LTCs), a type of continuous-time graph neural network, excel at modeling irregularly-sampled dynamics but are fundamentally confined to Euclidean space. This limitation introduces significant geometric distortion when representing real-world graphs with inherent non-Euclidean structures (e.g., hierarchies and cycles), degrading representation quality. To overcome this limitation, we introduce the Riemannian Liquid Spatio-Temporal Graph Network (RLSTG), a framework that unifies continuous-time liquid dynamics with the geometric inductive biases of Riemannian manifolds. RLSTG models graph evolution through an Ordinary Differential Equation (ODE) formulated directly on a curved manifold, enabling it to faithfully capture the intrinsic geometry of both structurally static and dynamic spatio-temporal graphs. Moreover, we provide rigorous theoretical guarantees for RLSTG, extending stability theorems of LTCs to the Riemannian domain and quantifying its expressive power via state trajectory analysis. Extensive experiments on real-world benchmarks demonstrate that, by combining advanced temporal dynamics with a Riemannian spatial representation, RLSTG achieves superior performance on graphs with complex structures. Project Page: this https URL
zh
[AI-4] Causal feature selection framework for stable soft sensor modeling based on time-delayed cross mapping
【速读】:该论文旨在解决工业过程中软传感器建模中因果特征选择方法存在的两大局限性:一是现有方法忽略变量间因果关系常伴随时间延迟,而多数方法仅在相同时间维度下分析因果关系;二是工业过程变量普遍存在相互依赖性,违背了传统因果推断方法对变量独立性的假设。为应对这些问题,论文提出基于时滞交叉映射(time-delayed cross mapping)的因果特征选择框架,其关键在于利用状态空间重构有效处理变量间的互依赖关系,并考虑因果强度随时间延迟变化的特性。具体而言,引入时滞收敛交叉映射(TDCCM)实现总体因果推断,开发时滞部分交叉映射(TDPCM)进行直接因果推断,并设计基于验证集性能自动确定因果阈值的客观特征选择策略,从而提升软传感器模型的精度与稳定性。
链接: https://arxiv.org/abs/2601.14099
作者: Shi-Shun Chen,Xiao-Yang Li,Enrico Zio
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Soft sensor modeling plays a crucial role in process monitoring. Causal feature selection can enhance the performance of soft sensor models in industrial applications. However, existing methods ignore two critical characteristics of industrial processes. Firstly, causal relationships between variables always involve time delays, whereas most causal feature selection methods investigate causal relationships in the same time dimension. Secondly, variables in industrial processes are often interdependent, which contradicts the decorrelation assumption of traditional causal inference methods. Consequently, soft sensor models based on existing causal feature selection approaches often lack sufficient accuracy and stability. To overcome these challenges, this paper proposes a causal feature selection framework based on time-delayed cross mapping. Time-delayed cross mapping employs state space reconstruction to effectively handle interdependent variables in causality analysis, and considers varying causal strength across time delay. Time-delayed convergent cross mapping (TDCCM) is introduced for total causal inference, and time-delayed partial cross mapping (TDPCM) is developed for direct causal inference. Then, in order to achieve automatic feature selection, an objective feature selection strategy is presented. The causal threshold is automatically determined based on the model performance on the validation set, and the causal features are then selected. Two real-world case studies show that TDCCM achieves the highest average performance, while TDPCM improves soft sensor stability and performance in the worst scenario. The code is publicly available at this https URL.
zh
[AI-5] Remapping and navigation of an embedding space via error minimization: a fundamental organizational principle of cognition in natural and artificial systems
【速读】:该论文试图解决如何在不同起源、组成和载体的智能体(如细胞到生物群体,以及演化、工程和嵌合系统)中识别共通的认知机制问题。其核心挑战在于揭示跨尺度、跨模态的决策原理是否具有不变性。解决方案的关键在于提出一个双元不变性框架:一是嵌入空间(embedding space)的重映射(remapping),二是对这些空间的导航(navigation)。无论是生物集体通过转录、形态或生理空间的重构维持稳态,还是现代人工智能模型(如Transformer、扩散模型等)通过潜在空间迭代优化实现上下文感知,均体现出这一机制——即通过迭代误差最小化实现嵌入空间的重映射与导航。这构成了跨物质基础的认知不变性,为理解自然与人工系统的智能提供了统一范式。
链接: https://arxiv.org/abs/2601.14096
作者: Benedikt Hartl,Léo Pio-Lopez,Chris Fields,Michael Levin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 41 pages, 5 figures
Abstract:The emerging field of diverse intelligence seeks an integrated view of problem-solving in agents of very different provenance, composition, and substrates. From subcellular chemical networks to swarms of organisms, and across evolved, engineered, and chimeric systems, it is hypothesized that scale-invariant principles of decision-making can be discovered. We propose that cognition in both natural and synthetic systems can be characterized and understood by the interplay between two equally important invariants: (1) the remapping of embedding spaces, and (2) the navigation within these spaces. Biological collectives, from single cells to entire organisms (and beyond), remap transcriptional, morphological, physiological, or 3D spaces to maintain homeostasis and regenerate structure, while navigating these spaces through distributed error correction. Modern Artificial Intelligence (AI) systems, including transformers, diffusion models, and neural cellular automata enact analogous processes by remapping data into latent embeddings and refining them iteratively through contextualization. We argue that this dual principle - remapping and navigation of embedding spaces via iterative error minimization - constitutes a substrate-independent invariant of cognition. Recognizing this shared mechanism not only illuminates deep parallels between living systems and artificial models, but also provides a unifying framework for engineering adaptive intelligence across scales.
zh
[AI-6] Zero-shot adaptable task planning for autonomous construction robots: a comparative study of lightweight single and multi-AI agent systems
【速读】:该论文旨在解决建筑机器人在动态任务中适应性差与泛化能力不足的问题,同时应对传统方法成本高昂的挑战。其解决方案的关键在于利用轻量级开源大语言模型(LLM)和视觉语言模型(VLM)构建多智能体协作系统,通过设计单智能体与三至四智能体团队来生成机器人动作规划,从而显著提升任务规划的灵活性与通用性;实验表明,四智能体团队在多数指标上优于当前最先进的GPT-4o模型,且成本仅为后者的十分之一,验证了多智能体架构在复杂、非结构化环境中的有效性。
链接: https://arxiv.org/abs/2601.14091
作者: Hossein Naderi,Alireza Shojaei,Lifu Huang,Philip Agee,Kereshmeh Afsari,Abiola Akanmu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Robots are expected to play a major role in the future construction industry but face challenges due to high costs and difficulty adapting to dynamic tasks. This study explores the potential of foundation models to enhance the adaptability and generalizability of task planning in construction robots. Four models are proposed and implemented using lightweight, open-source large language models (LLMs) and vision language models (VLMs). These models include one single agent and three multi-agent teams that collaborate to create robot action plans. The models are evaluated across three construction roles: Painter, Safety Inspector, and Floor Tiling. Results show that the four-agent team outperforms the state-of-the-art GPT-4o in most metrics while being ten times more cost-effective. Additionally, teams with three and four agents demonstrate the improved generalizability. By discussing how agent behaviors influence outputs, this study enhances the understanding of AI teams and supports future research in diverse unstructured environments beyond construction.
zh
[AI-7] 1-bit Count-based Sorting Unit to Reduce Link Power in DNN Accelerators
【速读】:该论文旨在解决深度神经网络(Deep Neural Network, DNN)加速器中互连功耗(interconnect power consumption)瓶颈问题。现有方法通过基于“1”比特计数对数据重新排序以降低开关活动,从而减少功耗,但其硬件排序实现尚未得到充分研究。论文提出了一种面向卷积神经网络(Convolutional Neural Networks, CNN)的无比较排序单元(comparison-free sorting unit)硬件实现方案,其关键在于引入近似计算(approximate computing),将种群计数分组到粗粒度桶(coarse-grained buckets)中,在显著降低硬件面积(最多35.4%)的同时,仍保持与精确排序相当的功耗优势(19.50% BT reduction,对比精确实现的20.42%)。
链接: https://arxiv.org/abs/2601.14087
作者: Ruichi Han,Yizhi Chen,Tong Lei,Jordi Altayo Gonzalez,Ahmed Hemani(Department of Electronics and Embedded Systems, KTH Royal Institute of Technology, Stockholm, Sweden)
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted for oral presentation at the 2026 VLSI Symposium on Technology, Systems and Applications (VLSI TSA) on April 13-17, 2026, at the Ambassador Hotel, Hsinchu, Taiwan
Abstract:Interconnect power consumption remains a bottleneck in Deep Neural Network (DNN) accelerators. While ordering data based on ‘1’-bit counts can mitigate this via reduced switching activity, practical hardware sorting implementations remain underexplored. This work proposes the hardware implementation of a comparison-free sorting unit optimized for Convolutional Neural Networks (CNN). By leveraging approximate computing to group population counts into coarse-grained buckets, our design achieves hardware area reductions while preserving the link power benefits of data reordering. Our approximate sorting unit achieves up to 35.4% area reduction while maintaining 19.50% BT reduction compared to 20.42% of precise implementation.
zh
[AI-8] Collective intelligence in science: direct elicitation of diverse information from experts with unknown information structure
【速读】:该论文试图解决在开放科学问题中,如何高效聚合一群互不相关且拥有分散私有信息的专家群体的知识,从而形成对复杂科学假说的深度集体分析。其核心挑战在于:专家之间缺乏初始协作基础,无法进行复杂的贝叶斯推理,且无法直接验证真实情况(ground truth)。解决方案的关键在于构建一个基于“玩币预测市场”(play-money prediction market)与聊天系统耦合的机制,使得参与者能够在无需了解彼此背景的情况下,通过聊天直接共享私有信息,并在市场中进行交易,仿佛市场已根据假设的真实性被结算。这种机制可自然引导系统达到均衡状态,实现信息的有效聚合和高度可解释的集体决策,同时通过将玩币奖励转换为真实资产,为大规模协作研究提供创新性的资金激励模式。
链接: https://arxiv.org/abs/2601.14047
作者: Alexey V. Osipov,Nikolay N. Osipov
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Social and Information Networks (cs.SI); Theoretical Economics (econ.TH)
备注: 21 pages
Abstract:Suppose we need a deep collective analysis of an open scientific problem: there is a complex scientific hypothesis and a large online group of mutually unrelated experts with relevant private information of a diverse and unpredictable nature. This information may be results of experts’ individual experiments, original reasoning of some of them, results of AI systems they use, etc. We propose a simple mechanism based on a self-resolving play-money prediction market entangled with a chat. We show that such a system can easily be brought to an equilibrium where participants directly share their private information on the hypothesis through the chat and trade as if the market were resolved in accordance with the truth of the hypothesis. This approach will lead to efficient aggregation of relevant information in a completely interpretable form even if the ground truth cannot be established and experts initially know nothing about each other and cannot perform complex Bayesian calculations. Finally, by rewarding the experts with some real assets proportionally to the play money they end up with, we can get an innovative way to fund large-scale collaborative studies of any type.
zh
[AI-9] Numina-Lean-Agent : An Open and General Agent ic Reasoning System for Formal Mathematics
【速读】:该论文旨在解决当前形式化定理证明中依赖任务特定流水线和训练过的形式化证明器所导致的灵活性不足与可复现性差的问题。其解决方案的关键在于提出一种新范式:直接使用通用编码代理(general coding agent)作为形式数学推理工具,通过引入MCP(Model Context Protocol)实现对Lean等专用工具的灵活调用与自主交互,从而无需重新训练即可提升性能,并支持多样化的推理任务。该方法在Putnam 2025竞赛中全部正确解答12道题目,达到最优闭源系统水平,并成功协助数学家形式化Brascamp-Lieb定理,验证了其通用性和实用性。
链接: https://arxiv.org/abs/2601.14027
作者: Junqi Liu,Zihao Zhou,Zekai Zhu,Marco Dos Santos,Weikun He,Jiawei Liu,Ran Wang,Yunzhou Xie,Junqiao Zhao,Qiufeng Wang,Lihong Zhi,Jia Li,Wenda Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Agentic systems have recently become the dominant paradigm for formal theorem proving, achieving strong performance by coordinating multiple models and tools. However, existing approaches often rely on task-specific pipelines and trained formal provers, limiting their flexibility and reproducibility. In this paper, we propose the paradigm that directly uses a general coding agent as a formal math reasoner. This paradigm is motivated by (1) A general coding agent provides a natural interface for diverse reasoning tasks beyond proving, (2) Performance can be improved by simply replacing the underlying base model, without training, and (3) MCP enables flexible extension and autonomous calling of specialized tools, avoiding complex design. Based on this paradigm, we introduce Numina-Lean-Agent, which combines Claude Code with Numina-Lean-MCP to enable autonomous interaction with Lean, retrieval of relevant theorems, informal proving and auxiliary reasoning tools. Using Claude Opus 4.5 as the base model, Numina-Lean-Agent solves all problems in Putnam 2025 (12 / 12), matching the best closed-source system. Beyond benchmark evaluation, we further demonstrate its generality by interacting with mathematicians to successfully formalize the Brascamp-Lieb theorem. We release Numina-Lean-Agent and all solutions at this https URL.
zh
[AI-10] Credible CO2 Comparisons: A Machine Learning Approach to Vehicle Powertrain Assessment
【速读】:该论文旨在解决道路运输脱碳过程中,如何在相同真实驾驶条件下对内燃机汽车(ICEVs)与电动汽车(EVs)的二氧化碳(CO₂)排放进行一致且透明比较的问题。其核心挑战在于传统评估方法难以隔离车辆动力系统差异,从而导致不同技术之间的排放对比缺乏公平性和可重复性。解决方案的关键在于构建一个基于循环神经网络(Recurrent Neural Network, RNN)的机器学习框架,该框架通过固定观测到的速度轨迹和环境背景变量(如温度、加速度等),独立训练每个车辆类型的模型以学习从驾驶条件到瞬时扭矩/油门控制及CO₂等效排放率的映射关系。由此可生成反事实场景——即模拟某类车辆沿另一类车辆的实际行驶路径运行时的排放表现,从而实现基于统一瞬时排放指标的公平、可复现的动力系统性能评估。
链接: https://arxiv.org/abs/2601.14022
作者: Rodrigo Pereira David,Luciano Araujo Dourado Filho,Daniel Marques da Silva,João Alfredo Cal-Braz
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Decarbonizing road transport requires consistent and transparent methods for comparing CO2 emissions across vehicle technologies. This paper proposes a machine learning-based framework for like-for-like operational assessment of internal combustion engine vehicles (ICEVs) and electric vehicles (EVs) under identical, real-world driving conditions. The approach isolates technology-specific effects by holding the observed speed profile and environmental context fixed, enabling direct comparison of powertrain performance. Recurrent neural network models are trained independently for each domain to learn the mapping from contextual driving variables (speed, acceleration, temperature) to internal actuation variables (torque, throttle) and instantaneous CO2-equivalent emission rates. This structure allows the construction of counterfactual scenarios that answer: What emissions would an EV have generated if it had followed the same driving profile as an ICEV? By aligning both vehicle types on a unified instantaneous emissions metric, the framework enables fair and reproducible evaluation of powertrain technologies. It offers a scalable foundation for credible, data-driven assessments of vehicle carbon performance under real-world operating conditions.
zh
[AI-11] orch-sla: Differentiable Sparse Linear Algebra with Adjoint Solvers and Sparse Tensor Parallelism for PyTorch
【速读】:该论文旨在解决工业科学计算中稀疏矩阵(Sparse Matrix)在GPU加速、多GPU扩展性以及可微分计算方面的关键挑战。其解决方案的核心在于提出一个名为\torchsla的开源PyTorch库,实现了三方面突破:(1)通过GPU加速稀疏线性求解、非线性求解(如牛顿法、皮卡ard法和Anderson外推法)及特征值计算;(2)基于域分解与边界交换(halo exchange)实现多GPU扩展,在3个GPU上达到4亿自由度(DOF)线性求解性能;(3)采用伴随法(adjoint-based differentiation)实现计算图节点复杂度为O(1)(对自动微分而言),内存占用为O(nnz)(与求解迭代次数无关)。该库支持多种后端(如SciPy、cuDSS、PyTorch原生),并无缝集成PyTorch自动微分机制,从而支持端到端可微分仿真。
链接: https://arxiv.org/abs/2601.13994
作者: Mingyuan Chi
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:
Abstract:Industrial scientific computing predominantly uses sparse matrices to represent unstructured data – finite element meshes, graphs, point clouds. We present \torchsla, an open-source PyTorch library that enables GPU-accelerated, scalable, and differentiable sparse linear algebra. The library addresses three fundamental challenges: (1) GPU acceleration for sparse linear solves, nonlinear solves (Newton, Picard, Anderson), and eigenvalue computation; (2) Multi-GPU scaling via domain decomposition with halo exchange, reaching \textbf400 million DOF linear solve on 3 GPUs; and (3) Adjoint-based differentiation achieving \mathcalO(1) computational graph nodes (for autograd) and \mathcalO(\textnnz) memory – independent of solver iterations. \torchsla supports multiple backends (SciPy, cuDSS, PyTorch-native) and seamlessly integrates with PyTorch autograd for end-to-end differentiable simulations. Code is available at this https URL.
zh
[AI-12] Autonomous Knowledge Graph Exploration with Adaptive Breadth-Depth Retrieval
【速读】:该论文旨在解决从知识图谱(Knowledge Graph, KG)中检索支持语言模型查询的证据时,如何平衡搜索广度与深度的问题。传统基于相似性的检索方法覆盖范围广但深度不足,而基于路径遍历的方法依赖于种子节点选择和预设跳数,难以应对多实体、多关系的复杂查询。解决方案的关键在于提出一种名为ARK(Adaptive Retriever of Knowledge)的代理式KG检索器,其通过两个核心操作工具集实现动态权衡:全局词汇搜索(对节点描述符进行语义匹配)与单跳邻域探索(组成多跳遍历)。ARK无需依赖脆弱的种子节点或固定跳数,而是根据查询特征自适应地切换工具使用策略——语言密集型查询采用全局搜索,关系密集型查询则转向邻域扩展。实验表明,ARK在STaRK基准上显著优于现有检索和代理方法,并通过无标签模仿学习将工具轨迹蒸馏至8B规模模型,在多个数据集上大幅提升检索精度,同时保持接近教师模型的性能表现。
链接: https://arxiv.org/abs/2601.13969
作者: Joaquín Polonuer(1,2),Lucas Vittor(1),Iñaki Arango(1),Ayush Noori(1,3),David A. Clifton(3,4),Luciano Del Corro(5,6),Marinka Zitnik(1,7,8,9) ((1) Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA, (2) Departamento de Computación, FCEyN, Universidad de Buenos Aires, Buenos Aires, Argentina, (3) Department of Engineering Science, University of Oxford, Oxford, UK, (4) Oxford Suzhou Centre for Advanced Research, University of Oxford, Suzhou, Jiangsu, China, (5) ELIAS Lab, Departamento de Ingeniería, Universidad de San Andrés, Victoria, Argentina, (6) Lumina Labs, Buenos Aires, Argentina, (7) Kempner Institute for the Study of Natural and Artificial Intelligence, Allston, MA, USA, (8) Broad Institute of MIT and Harvard, Cambridge, MA, USA, (9) Harvard Data Science Initiative, Cambridge, MA, USA)
机构: 未知
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:
Abstract:Retrieving evidence for language model queries from knowledge graphs requires balancing broad search across the graph with multi-hop traversal to follow relational links. Similarity-based retrievers provide coverage but remain shallow, whereas traversal-based methods rely on selecting seed nodes to start exploration, which can fail when queries span multiple entities and relations. We introduce ARK: Adaptive Retriever of Knowledge, an agentic KG retriever that gives a language model control over this breadth-depth tradeoff using a two-operation toolset: global lexical search over node descriptors and one-hop neighborhood exploration that composes into multi-hop traversal. ARK alternates between breadth-oriented discovery and depth-oriented expansion without depending on a fragile seed selection, a pre-set hop depth, or requiring retrieval training. ARK adapts tool use to queries, using global search for language-heavy queries and neighborhood exploration for relation-heavy queries. On STaRK, ARK reaches 59.1% average Hit@1 and 67.4 average MRR, improving average Hit@1 by up to 31.4% and average MRR by up to 28.0% over retrieval-based and agentic training-free methods. Finally, we distill ARK’s tool-use trajectories from a large teacher into an 8B model via label-free imitation, improving Hit@1 by +7.0, +26.6, and +13.5 absolute points over the base 8B model on AMAZON, MAG, and PRIME datasets, respectively, while retaining up to 98.5% of the teacher’s Hit@1 rate.
zh
[AI-13] RL-BioAug: Label-Efficient Reinforcement Learning for Self-Supervised EEG Representation Learning
【速读】:该论文旨在解决对比学习(contrastive learning)在脑电图(EEG)任务中因数据增强(data augmentation)质量不足而导致性能受限的问题。由于EEG信号具有非平稳性(non-stationarity),即其统计特性随时间变化,传统的静态或随机增强策略难以保留信号的内在信息,从而影响模型表征学习的效果。解决方案的关键在于提出一种基于标签高效强化学习(label-efficient reinforcement learning, RL)的框架RL-BioAug,该框架通过一个轻量级RL代理(agent)自主学习最优的数据增强策略,仅需10%的标注数据引导即可实现严格的自监督训练。实验表明,该方法在Sleep-EDFX和CHB-MIT数据集上分别提升了9.69%和8.80%的Macro-F1分数,并发现代理能针对不同任务选择特定增强策略(如睡眠分期任务中偏好时间掩码,癫痫检测任务中偏好裁剪与缩放),体现出其替代传统启发式增强方法、建立自主化数据增强范式的潜力。
链接: https://arxiv.org/abs/2601.13964
作者: Cheol-Hui Lee,Hwa-Yeon Lee,Dong-Joo Kim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:The quality of data augmentation serves as a critical determinant for the performance of contrastive learning in EEG tasks. Although this paradigm is promising for utilizing unlabeled data, static or random augmentation strategies often fail to preserve intrinsic information due to the non-stationarity of EEG signals where statistical properties change over time. To address this, we propose RL-BioAug, a framework that leverages a label-efficient reinforcement learning (RL) agent to autonomously determine optimal augmentation policies. While utilizing only a minimal fraction (10%) of labeled data to guide the agent’s policy, our method enables the encoder to learn robust representations in a strictly self-supervised manner. Experimental results demonstrate that RL-BioAug significantly outperforms the random selection strategy, achieving substantial improvements of 9.69% and 8.80% in Macro-F1 score on the Sleep-EDFX and CHB-MIT datasets, respectively. Notably, this agent mainly chose optimal strategies for each task – for example, Time Masking with a 62% probability for sleep stage classification and Crop \ Resize with a 77% probability for seizure detection. Our framework suggests its potential to replace conventional heuristic-based augmentations and establish a new autonomous paradigm for data augmentation. The source code is available at \hrefthis https URLthis https URL.
zh
[AI-14] IF-GEO: Conflict-Aware Instruction Fusion for Multi-Query Generative Engine Optimization ACL2026
【速读】:该论文旨在解决生成式引擎(Generative Engine)在优化文档以适应多样化查询时面临的约束优化难题,即不同查询往往对内容修订提出冲突性要求,而可用的修改资源有限。解决方案的关键在于提出IF-GEO框架,采用“分叉-聚合”策略:首先从代表性潜在查询中挖掘差异化的优化偏好,然后通过冲突感知的指令融合机制合成全局修订蓝图,从而协调多种偏好并实现跨查询稳定性。该方法引入风险感知的稳定性度量指标,实验证明其在多查询基准上显著提升性能并保持对不同检索场景的鲁棒性。
链接: https://arxiv.org/abs/2601.13938
作者: Heyang Zhou(1),JiaJia Chen(2),Xiaolu Chen(1),Jie Bao(1),Zhen Chen(1),Yong Liao(1) ((1) School of Cyber Science and Technology, University of Science and Technology of China, (2) Institute of Dataspace, Hefei Comprehensive National Science Center)
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 9 pages, 3 figures. Submitted to ACL 2026. Corresponding author: Zhen Chen
Abstract:As Generative Engines revolutionize information retrieval by synthesizing direct answers from retrieved sources, ensuring source visibility becomes a significant challenge. Improving it through targeted content revisions is a practical strategy termed Generative Engine Optimization (GEO). However, optimizing a document for diverse queries presents a constrained optimization challenge where heterogeneous queries often impose conflicting and competing revision requirements under a limited content budget. To address this challenge, we propose IF-GEO, a “diverge-then-converge” framework comprising two phases: (i) mining distinct optimization preferences from representative latent queries; (ii) synthesizing a Global Revision Blueprint for guided editing by coordinating preferences via conflict-aware instruction fusion. To explicitly quantify IF-GEO’s objective of cross-query stability, we introduce risk-aware stability metrics. Experiments on multi-query benchmarks demonstrate that IF-GEO achieves substantial performance gains while maintaining robustness across diverse retrieval scenarios.
zh
[AI-15] Asymmetric regularization mechanism for GAN training with Variational Inequalities
【速读】:该论文旨在解决生成对抗网络(Generative Adversarial Networks, GANs)训练过程中难以稳定收敛的问题,特别是如何有效寻找纳什均衡(Nash equilibrium)。其解决方案的关键在于提出一种非对称正则化机制,该机制结合经典的Tikhonov正则项与一种新颖的零中心梯度惩罚(zero-centered gradient penalty),从而在光滑性和由高斯-牛顿格拉姆矩阵诱导的局部可辨识条件下,显式获得正则化算子的Lipschitz常数和(强)单调性常数。这些性质确保了单步调用的过去外推法(Extrapolation-from-the-Past, EFTP)方法具有最后迭代的线性收敛性,实验证明即使在强单调性不成立的情况下,该正则化策略仍能稳定轨迹并收敛至平衡点。
链接: https://arxiv.org/abs/2601.13920
作者: Spyridon C. Giagtzoglou,Mark H.M. Winands,Barbara Franci
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 6 pages, 3 figures, conference
Abstract:We formulate the training of generative adversarial networks (GANs) as a Nash equilibrium seeking problem. To stabilize the training process and find a Nash equilibrium, we propose an asymmetric regularization mechanism based on the classic Tikhonov step and on a novel zero-centered gradient penalty. Under smoothness and a local identifiability condition induced by a Gauss-Newton Gramian, we obtain explicit Lipschitz and (strong)-monotonicity constants for the regularized operator. These constants ensure last-iterate linear convergence of a single-call Extrapolation-from-the-Past (EFTP) method. Empirical simulations on an academic example show that, even when strong monotonicity cannot be achieved, the asymmetric regularization is enough to converge to an equilibrium and stabilize the trajectory.
zh
[AI-16] PREFAB: PREFerence-based Affective Modeling for Low-Budget Self-Annotation
【速读】:该论文旨在解决情感计算中自标注(self-annotation)方法依赖全时段标注所带来的高认知负荷、易疲劳及效率低下的问题。现有方法要求用户在整个实验过程中持续标记情感状态,虽然能获得细粒度数据,但不适用于大规模或长时间的实验场景。解决方案的关键在于提出 PREFAB——一种基于回溯式自标注的低成本方法,其核心思想是聚焦于情感转折点(affective inflection regions),而非完整标注整个刺激序列;该方法基于峰终定律(peak-end rule)和情绪的序数表示(ordinal representations),利用偏好学习模型检测相对情感变化,并通过预览机制提供简短上下文提示以辅助标注。实验结果表明,PREFAB 在建模情感转折点方面优于基线方法,同时显著降低工作量并提升标注者信心,且不牺牲标注质量。
链接: https://arxiv.org/abs/2601.13904
作者: Jaeyoung Moon,Youjin Choi,Yucheon Park,David Melhart,Georgios N. Yannakakis,Kyung-Joong Kim
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: CHI '26 Accepted paper
Abstract:Self-annotation is the gold standard for collecting affective state labels in affective computing. Existing methods typically rely on full annotation, requiring users to continuously label affective states across entire sessions. While this process yields fine-grained data, it is time-consuming, cognitively demanding, and prone to fatigue and errors. To address these issues, we present PREFAB, a low-budget retrospective self-annotation method that targets affective inflection regions rather than full annotation. Grounded in the peak-end rule and ordinal representations of emotion, PREFAB employs a preference-learning model to detect relative affective changes, directing annotators to label only selected segments while interpolating the remainder of the stimulus. We further introduce a preview mechanism that provides brief contextual cues to assist annotation. We evaluate PREFAB through a technical performance study and a 25-participant user study. Results show that PREFAB outperforms baselines in modeling affective inflections while mitigating workload (and conditionally mitigating temporal burden). Importantly PREFAB improves annotator confidence without degrading annotation quality.
zh
[AI-17] ractRLFusion: A GPT -Based Multi-Critic Policy Fusion Framework for Fiber Tractography
【速读】:该论文旨在解决传统纤维追踪(tractography)方法在重建白质纤维路径时难以准确识别真实连接并有效抑制伪连接的问题。其解决方案的关键在于提出了一种基于GPT的策略融合框架TractRLFusion,通过数据驱动的多强化学习(Reinforcement Learning, RL)策略融合机制,结合两阶段训练数据选择与多评论家(multi-critic)微调策略,显著提升了重建结果的准确性与解剖学可靠性。
链接: https://arxiv.org/abs/2601.13897
作者: Ankita Joshi,Ashutosh Sharma,Anoushkrit Goel,Ranjeet Ranjan Jha,Chirag Ahuja,Arnav Bhavsar,Aditya Nigam
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at 23rd IEEE International Symposium on Biomedical Imaging (ISBI), 2026
Abstract:Tractography plays a pivotal role in the non-invasive reconstruction of white matter fiber pathways, providing vital information on brain connectivity and supporting precise neurosurgical planning. Although traditional methods relied mainly on classical deterministic and probabilistic approaches, recent progress has benefited from supervised deep learning (DL) and deep reinforcement learning (DRL) to improve tract reconstruction. A persistent challenge in tractography is accurately reconstructing white matter tracts while minimizing spurious connections. To address this, we propose TractRLFusion, a novel GPT-based policy fusion framework that integrates multiple RL policies through a data-driven fusion strategy. Our method employs a two-stage training data selection process for effective policy fusion, followed by a multi-critic fine-tuning phase to enhance robustness and generalization. Experiments on HCP, ISMRM, and TractoInferno datasets demonstrate that TractRLFusion outperforms individual RL policies as well as state-of-the-art classical and DRL methods in accuracy and anatomical reliability.
zh
[AI-18] Human Simulation Computation: A Human-Inspired Framework for Adaptive AI Systems
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在面对开放动态现实环境时,因过度依赖文本数据而导致的适应性差、推理结果难以验证以及实际交互能力不足的问题。其解决方案的关键在于提出人类模拟计算(Human Simulation Computation, HSC)框架,该框架将智能建模为一个包含思考、行动、学习、反思和活动调度的闭环过程,强调在内部推理过程中主动参与,并通过与环境的交互动作实现对内部推理机制的自动优化与提升;同时引入人类常用的思维策略,如主特征导向推理、通过行动扩展认知范围及基于环境反馈的即时学习,从而构建具备强适应性和真实世界交互能力的推理系统。
链接: https://arxiv.org/abs/2601.13887
作者: Hong Su
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) have demonstrated strong capabilities in knowledge representation and reasoning based on textual data. However, their reliance on language material alone limits their ability to adapt, verify reasoning outcomes, and operate effectively in open and dynamic real-world environments. In this paper, we propose Human Simulation Computation (HSC), a human-inspired computational framework that models intelligence as a continuous, closed-loop process involving thinking, action, learning, reflection, and activity scheduling, collectively referred to as the internal reasoning process. HSC emphasizes active participation both within the internal reasoning process and in interactions with the environment, where actions are used not only to achieve goals but also to automatically refine and improve internal reasoning mechanisms without external intervention. Furthermore, HSC incorporates commonly used human thinking strategies across all stages of the internal reasoning process, such as main-feature-oriented reasoning, scope expansion through action, and on-time learning driven by environmental feedback. Through theoretical analysis, we argue that human simulation strategies cannot be fully learned from language material alone, and that human-like reasoning processes and action-grounded reasoning methods are essential for robust adaptation and effective interaction with real-world environments.
zh
[AI-19] LifeAgent Bench: A Multi-dimensional Benchmark and Agent for Personal Health Assistants in Digital Health
【速读】:该论文旨在解决个性化数字健康支持中长期、跨维度的生活方式信号推理问题,当前大型语言模型(LLM)在该场景下的能力尚不明确,主要受限于缺乏系统性的评估基准。其解决方案的关键在于提出了LifeAgentBench——一个大规模问答(QA)基准,用于评估LLM在长周期、多用户、跨维度生活方式健康推理中的表现,并配套开发了可扩展的基准构建流程与标准化评估协议。基于此基准,作者系统评估了11个主流LLM并识别出长期聚合和跨维度推理的核心瓶颈,进而提出LifeAgent作为强基线代理,通过集成多步证据检索与确定性聚合机制,在多项指标上显著优于现有基线方法,验证了其在真实日常健康场景中的潜力。
链接: https://arxiv.org/abs/2601.13880
作者: Ye Tian,Zihao Wang,Onat Gungor,Xiaoran Fan,Tajana Rosing
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Personalized digital health support requires long-horizon, cross-dimensional reasoning over heterogeneous lifestyle signals, and recent advances in mobile sensing and large language models (LLMs) make such support increasingly feasible. However, the capabilities of current LLMs in this setting remain unclear due to the lack of systematic benchmarks. In this paper, we introduce LifeAgentBench, a large-scale QA benchmark for long-horizon, cross-dimensional, and multi-user lifestyle health reasoning, containing 22,573 questions spanning from basic retrieval to complex reasoning. We release an extensible benchmark construction pipeline and a standardized evaluation protocol to enable reliable and scalable assessment of LLM-based health assistants. We then systematically evaluate 11 leading LLMs on LifeAgentBench and identify key bottlenecks in long-horizon aggregation and cross-dimensional reasoning. Motivated by these findings, we propose LifeAgent as a strong baseline agent for health assistant that integrates multi-step evidence retrieval with deterministic aggregation, achieving significant improvements compared with two widely used baselines. Case studies further demonstrate its potential in realistic daily-life scenarios. The benchmark is publicly available at this https URL.
zh
[AI-20] HardSecBench: Benchmarking the Security Awareness of LLM s for Hardware Code Generation
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在硬件和固件代码生成中存在安全意识不足的问题,即尽管LLM生成的代码在功能上可能正确,但可能隐含安全漏洞(如Common Weakness Enumeration, CWE类缺陷),从而在部署后引发严重后果。解决方案的关键在于提出一个名为HardSecBench的基准测试集,涵盖924个任务,覆盖Verilog寄存器传输级(RTL)和固件级C语言代码,涉及76种硬件相关的CWE条目,并为每个任务提供结构化规范、安全参考实现和可执行测试用例;同时设计了一个多智能体流水线,将代码合成与验证解耦,并基于执行证据进行评估,从而实现对LLM生成代码安全性更可靠、自动化且贴近实际场景的评估。
链接: https://arxiv.org/abs/2601.13864
作者: Qirui Chen,Jingxian Shuai,Shuangwu Chen,Shenghao Ye,Zijian Wen,Xufei Su,Jie Jin,Jiangming Li,Jun Chen,Xiaobin Tan,Jian Yang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) are being increasingly integrated into practical hardware and firmware development pipelines for code generation. Existing studies have primarily focused on evaluating the functional correctness of LLM-generated code, yet paid limited attention to its security issues. However, LLM-generated code that appears functionally sound may embed security flaws which could induce catastrophic damages after deployment. This critical research gap motivates us to design a benchmark for assessing security awareness under realistic specifications. In this work, we introduce HardSecBench, a benchmark with 924 tasks spanning Verilog Register Transfer Level (RTL) and firmware-level C, covering 76 hardware-relevant Common Weakness Enumeration (CWE) entries. Each task includes a structured specification, a secure reference implementation, and executable tests. To automate artifact synthesis, we propose a multi-agent pipeline that decouples synthesis from verification and grounds evaluation in execution evidence, enabling reliable evaluation. Using HardSecBench, we evaluate a range of LLMs on hardware and firmware code generation and find that models often satisfy functional requirements while still leaving security risks. We also find that security results vary with prompting. These findings highlight pressing challenges and offer actionable insights for future advancements in LLM-assisted hardware design. Our data and code will be released soon.
zh
[AI-21] Virtual Urbanism: An AI-Driven Framework for Quantifying Urban Identity. A Tokyo-Based Pilot Study Using Diffusion-Generated Synthetic Environments
【速读】:该论文旨在解决城市身份(Urban Identity)量化分析的难题,即如何通过计算可处理的方法来识别和评估城市空间的核心特征。传统方法往往依赖主观判断或有限指标,难以实现多维度、自动化且具有文化敏感性的度量。解决方案的关键在于提出一种名为“虚拟都市主义”(Virtual Urbanism, VU)的多模态AI驱动分析框架,其核心是利用Stable Diffusion与LoRA模型生成不含现有导航标记的合成城市复制品,并通过人类评估实验验证其感知合法性及提取出区域级城市身份水平(Urban Identity Level, UIL)与文化嵌入型类型学作为核心身份构成要素,从而为基于生成式AI的城市分析提供可扩展、自动化的多参数度量路径。
链接: https://arxiv.org/abs/2601.13846
作者: Glinskaya Maria
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注:
Abstract:This paper introduces Virtual Urbanism (VU), a multimodal AI-driven analytical framework for quantifying urban identity through the medium of synthetic urban replicas. The framework aims to advance computationally tractable urban identity metrics. To demonstrate feasibility, the pilot study Virtual Urbanism and Tokyo Microcosms is presented. A pipeline integrating Stable Diffusion and LoRA models was used to produce synthetic replicas of nine Tokyo areas rendered as dynamic synthetic urban sequences, excluding existing orientation markers to elicit core identity-forming elements. Human-evaluation experiments (I) assessed perceptual legitimacy of replicas; (II) quantified area-level identity; (III) derived core identity-forming elements. Results showed a mean identification accuracy of ~81%, confirming the validity of the replicas. Urban Identity Level (UIL) metric enabled assessment of identity levels across areas, while semantic analysis revealed culturally embedded typologies as core identity-forming elements, positioning VU as a viable framework for AI-augmented urban analysis, outlining a path toward automated, multi-parameter identity metrics.
zh
[AI-22] DroneVLA: VLA based Aerial Manipulation
【速读】:该论文旨在解决非专业用户难以自然、直观地操控自主飞行操作臂系统的问题,尤其是在复杂环境中实现基于高阶自然语言指令的物体抓取与交付任务。其核心解决方案在于构建一个融合视觉-语言-动作(Vision-Language-Action, VLA)模型、Grounding DINO目标定位模块与动态A路径规划算法的集成系统,并结合MediaPipe人体姿态估计实现人机交互阶段的安全稳定手递交接。其中,VLA模型负责语义推理并生成优先级任务队列以指导抓取决策,Grounding DINO与动态A协同完成目标识别与避障导航,而MediaPipe驱动的人体中心控制器则通过实时姿态估计支持视觉伺服控制,确保无人机在用户前方保持稳定位姿,从而提升交互的自然性与安全性。
链接: https://arxiv.org/abs/2601.13809
作者: Fawad Mehboob,Monijesu James,Amir Habel,Jeffrin Sam,Miguel Altamirano Cabrera,Dzmitry Tsetserukou
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: This paper has been accepted for publication at LBR of HRI 2026 conference
Abstract:As aerial platforms evolve from passive observers to active manipulators, the challenge shifts toward designing intuitive interfaces that allow non-expert users to command these systems naturally. This work introduces a novel concept of autonomous aerial manipulation system capable of interpreting high-level natural language commands to retrieve objects and deliver them to a human user. The system is intended to integrate a MediaPipe based on Grounding DINO and a Vision-Language-Action (VLA) model with a custom-built drone equipped with a 1-DOF gripper and an Intel RealSense RGB-D camera. VLA performs semantic reasoning to interpret the intent of a user prompt and generates a prioritized task queue for grasping of relevant objects in the scene. Grounding DINO and dynamic A* planning algorithm are used to navigate and safely relocate the object. To ensure safe and natural interaction during the handover phase, the system employs a human-centric controller driven by MediaPipe. This module provides real-time human pose estimation, allowing the drone to employ visual servoing to maintain a stable, distinct position directly in front of the user, facilitating a comfortable handover. We demonstrate the system’s efficacy through real-world experiments for localization and navigation, which resulted in a 0.164m, 0.070m, and 0.084m of max, mean euclidean, and root-mean squared errors, respectively, highlighting the feasibility of VLA for aerial manipulation operations.
zh
[AI-23] vLinear: A Powerful Linear Model for Multivariate Time Series Forecasting
【速读】:该论文旨在解决多变量时间序列预测中现有模型依赖自注意力机制导致计算复杂度高(O(N2))的问题,以及传统流匹配目标函数在预测精度上的局限性。其解决方案的关键在于提出两个核心组件:一是vLinear中的vecTrans模块,通过引入可学习向量建模多变量相关性,将计算复杂度降低至O(N),并可无缝集成到基于Transformer的模型中实现高达5倍的推理加速和性能提升;二是提出WFMLoss(Weighted Flow Matching Loss),采用终值导向(final-series-oriented)的流匹配目标,并结合路径与预测时长加权策略,显著提升预测准确性,且作为即插即用的目标函数能持续改进现有预测器。
链接: https://arxiv.org/abs/2601.13768
作者: Wenzhen Yue,Ruohao Guo,Ji Shi,Zihan Hao,Shiyu Hu,Xianghua Ying
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:In this paper, we present \textbfvLinear, an effective yet efficient \textbflinear-based multivariate time series forecaster featuring two components: the \textbfvecTrans module and the WFMLoss objective. Many state-of-the-art forecasters rely on self-attention or its variants to capture multivariate correlations, typically incurring \mathcalO(N^2) computational complexity with respect to the number of variates N . To address this, we propose vecTrans, a lightweight module that utilizes a learnable vector to model multivariate correlations, reducing the complexity to \mathcalO(N) . Notably, vecTrans can be seamlessly integrated into Transformer-based forecasters, delivering up to 5 \times inference speedups and consistent performance gains. Furthermore, we introduce WFMLoss (Weighted Flow Matching Loss) as the objective. In contrast to typical \textbfvelocity-oriented flow matching objectives, we demonstrate that a \textbffinal-series-oriented formulation yields significantly superior forecasting accuracy. WFMLoss also incorporates path- and horizon-weighted strategies to focus learning on more reliable paths and horizons. Empirically, vLinear achieves state-of-the-art performance across 22 benchmarks and 124 forecasting settings. Moreover, WFMLoss serves as an effective plug-and-play objective, consistently improving existing forecasters. The code is available at this https URL.
zh
[AI-24] Reasoning or Fluency? Dissecting Probabilistic Confidence in Best-of-N Selection
【速读】:该论文试图解决当前基于概率的置信度指标在Best-of-N选择中是否真正反映推理质量的问题,尤其是这些指标是否能够捕捉推理步骤间的因果依赖关系。研究表明,现有指标对逻辑结构不敏感,主要反映表面流畅性或分布内先验,而非有效的推理过程。解决方案的关键在于提出一种对比因果性度量(contrastive causality metric),该方法通过显式隔离推理步骤间的因果依赖关系,从而实现更忠实于逻辑结构的输出选择,显著优于传统基于概率的方法。
链接: https://arxiv.org/abs/2601.13735
作者: Hojin Kim,Jaehyung Kim
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 15 pages, 4 figures
Abstract:Probabilistic confidence metrics are increasingly adopted as proxies for reasoning quality in Best-of-N selection, under the assumption that higher confidence reflects higher reasoning fidelity. In this work, we challenge this assumption by investigating whether these metrics truly capture inter-step causal dependencies necessary for valid reasoning. We introduce three classes of inter-step causality perturbations that systematically disrupt dependencies between reasoning steps while preserving local fluency. Surprisingly, across diverse model families and reasoning benchmarks, we find that selection accuracy degrades only marginally under these disruptions. Even severe interventions, such as applying hard attention masks that directly prevent the model from attending to prior reasoning steps, do not substantially reduce selection performance. These findings provide strong evidence that current probabilistic metrics are largely insensitive to logical structure, and primarily capture surface-level fluency or in-distribution priors instead. Motivated by this gap, we propose a contrastive causality metric that explicitly isolates inter-step causal dependencies, and demonstrate that it yields more faithful output selection than existing probability-based approaches.
zh
[AI-25] Performance and Complexity Trade-off Optimization of Speech Models During Training
【速读】:该论文旨在解决神经网络模型在语音机器学习任务中,因层大小等结构参数通常依赖启发式选择而导致性能与计算复杂度之间难以实现最优权衡的问题。传统方法如权重量化或模型剪枝虽可降低计算成本,但属于训练后的后处理手段,无法在训练过程中动态调整模型结构以适应目标性能-复杂度折衷。其解决方案的关键在于提出一种基于特征噪声注入的重参数化技术,使得模型在使用随机梯度下降(SGD)优化时,能够同时对性能指标和计算复杂度(如FLOP/s)进行联合优化,从而实现训练阶段模型规模的动态调整,无需依赖人工设定的剪枝规则。
链接: https://arxiv.org/abs/2601.13704
作者: Esteban Gómez,Tom Bäckström
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注:
Abstract:In speech machine learning, neural network models are typically designed by choosing an architecture with fixed layer sizes and structure. These models are then trained to maximize performance on metrics aligned with the task’s objective. While the overall architecture is usually guided by prior knowledge of the task, the sizes of individual layers are often chosen heuristically. However, this approach does not guarantee an optimal trade-off between performance and computational complexity; consequently, post hoc methods such as weight quantization or model pruning are typically employed to reduce computational cost. This occurs because stochastic gradient descent (SGD) methods can only optimize differentiable functions, while factors influencing computational complexity, such as layer sizes and floating-point operations per second (FLOP/s), are non-differentiable and require modifying the model structure during training. We propose a reparameterization technique based on feature noise injection that enables joint optimization of performance and computational complexity during training using SGD-based methods. Unlike traditional pruning methods, our approach allows the model size to be dynamically optimized for a target performance-complexity trade-off, without relying on heuristic criteria to select which weights or structures to remove. We demonstrate the effectiveness of our method through three case studies, including a synthetic example and two practical real-world applications: voice activity detection and audio anti-spoofing. The code related to our work is publicly available to encourage further research.
zh
[AI-26] Does Privacy Always Harm Fairness? Data-Dependent Trade-offs via Chernoff Information Neural Estimation
【速读】:该论文旨在解决公平性(fairness)、隐私性(privacy)与准确性(accuracy)三者之间关系不明确的问题,尤其是现有研究多聚焦于单一维度而忽视了三者之间的协同作用。其核心解决方案是引入信息论中的切尔诺夫信息(Chernoff Information)构建“噪声切尔诺夫差”(Noisy Chernoff Difference)这一新工具,用以量化分析三者在数据依赖下的动态关系。该指标不仅能揭示不同数据分布下三者交互行为的三种典型模式,还被证明可作为公平性-准确性曲线陡峭程度的代理指标,从而为理解三者权衡提供理论依据。进一步地,作者提出了一种适用于未知分布数据的切尔诺夫信息估计方法,并将其应用于真实数据集,验证了该框架在刻画三者复杂关系上的有效性,推动了对公平性-隐私性-准确性三角关系统一建模的研究进展。
链接: https://arxiv.org/abs/2601.13698
作者: Arjun Nichani,Hsiang Hsu,Chun-Fu(Richard)Chen,Haewon Jeong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (stat.ML)
备注:
Abstract:Fairness and privacy are two vital pillars of trustworthy machine learning. Despite extensive research on these individual topics, the relationship between fairness and privacy has received significantly less attention. In this paper, we utilize the information-theoretic measure Chernoff Information to highlight the data-dependent nature of the relationship among the triad of fairness, privacy, and accuracy. We first define Noisy Chernoff Difference, a tool that allows us to analyze the relationship among the triad simultaneously. We then show that for synthetic data, this value behaves in 3 distinct ways (depending on the distribution of the data). We highlight the data distributions involved in these cases and explore their fairness and privacy implications. Additionally, we show that Noisy Chernoff Difference acts as a proxy for the steepness of the fairness-accuracy curves. Finally, we propose a method for estimating Chernoff Information on data from unknown distributions and utilize this framework to examine the triad dynamic on real datasets. This work builds towards a unified understanding of the fairness-privacy-accuracy relationship and highlights its data-dependent nature.
zh
[AI-27] Understanding Mental States to Guide Social Influence in Multi-Person Group Dialogue
【速读】:该论文旨在解决现有动态心智理论(Theory of Mind, ToM)评估基准中语言模型仅被赋予被动角色的问题,即模型仅需理解并描述人物心理状态的变化,而未考察其在社交互动中主动改变他人心理状态的能力。为填补这一空白,作者提出了SocialMindChange基准,其核心创新在于将ToM从“追踪心智”转向“改变心智”,要求模型在五幕连续场景中扮演特定角色,通过生成连贯对话来引导其他参与者心理状态向目标演化,同时保持所有角色心理状态的一致性。该方案的关键在于构建了一个包含4个角色、6000个场景和90,000个问题的结构化四步框架,确保任务具有高度现实性和复杂性,从而更真实地评估大语言模型(LLM)在长期社交交互中维持与操控心理状态表征的能力。
链接: https://arxiv.org/abs/2601.13687
作者: Zhichao Liang,Satoshi Nakamura
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Existing dynamic Theory of Mind (ToM) benchmarks mostly place language models in a passive role: the model reads a sequence of connected scenarios and reports what people believe, feel, intend, and do as these states change. In real social interaction, ToM is also used for action: a speaker plans what to say in order to shift another person’s mental-state trajectory toward a goal. We introduce SocialMindChange, a benchmark that moves from tracking minds to changing minds in social interaction. Each instance defines a social context with 4 characters and five connected scenes. The model plays one character and generates dialogue across the five scenes to reach the target while remaining consistent with the evolving states of all participants. SocialMindChange also includes selected higher-order states. Using a structured four-step framework, we construct 1,200 social contexts, covering 6000 scenarios and over 90,000 questions, each validated for realism and quality. Evaluations on ten state-of-the-art LLMs show that their average performance is 54.2% below human performance. This gap suggests that current LLMs still struggle to maintain and change mental-state representations across long, linked interactions.
zh
[AI-28] he Orchestration of Multi-Agent Systems: Architectures Protocols and Enterprise Adoption
【速读】:该论文旨在解决当前多智能体系统(Multi-Agent Systems, MAS)在复杂任务协作中缺乏统一架构与标准化通信机制的问题,从而限制了系统的可扩展性、可审计性和政策合规性。其解决方案的关键在于提出一个集成规划、策略执行、状态管理和质量运营的协同层架构,并设计两种互补的通信协议:Model Context Protocol(MCP)用于规范智能体访问外部工具和上下文数据的方式,Agent2Agent 协议则负责同级智能体间的协调、协商与委托。这两项协议共同构建了一个可互操作的通信基础,支撑分布式智能体群体实现规模化、透明且符合治理要求的推理能力。
链接: https://arxiv.org/abs/2601.13671
作者: Apoorva Adimulam,Rajesh Gupta,Sumit Kumar
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:
Abstract:Orchestrated multi-agent systems represent the next stage in the evolution of artificial intelligence, where autonomous agents collaborate through structured coordination and communication to achieve complex, shared objectives. This paper consolidates and formalizes the technical composition of such systems, presenting a unified architectural framework that integrates planning, policy enforcement, state management, and quality operations into a coherent orchestration layer. Another primary contribution of this work is the in-depth technical delineation of two complementary communication protocols - the Model Context Protocol, which standardizes how agents access external tools and contextual data, and the Agent2Agent protocol, which governs peer coordination, negotiation, and delegation. Together, these protocols establish an interoperable communication substrate that enables scalable, auditable, and policy-compliant reasoning across distributed agent collectives. Beyond protocol design, the paper details how orchestration logic, governance frameworks, and observability mechanisms collectively sustain system coherence, transparency, and accountability. By synthesizing these elements into a cohesive technical blueprint, this paper provides comprehensive treatments of orchestrated multi-agent systems - bridging conceptual architectures with implementation-ready design principles for enterprise-scale AI ecosystems.
zh
[AI-29] Communication-Free Collective Navigation for a Swarm of UAVs via LiDAR-Based Deep Reinforcement Learning
【速读】:该论文旨在解决无人飞行器(UAV)集群在通信受限环境中的协同导航问题,特别是在复杂、障碍物密集且缺乏外部定位系统支持的场景下,如何实现鲁棒的群体运动控制。解决方案的关键在于提出一种基于深度强化学习(DRL)的隐式领导者-跟随者框架:仅由领导者携带目标信息,而跟随者通过机载激光雷达(LiDAR)感知局部环境,利用仅依赖本地观测的DRL控制器学习复杂的涌现行为(如 flocking 与避障的平衡),无需任何跨无人机通信或领导者身份识别,从而实现无需外部定位系统的群体导航鲁棒性。
链接: https://arxiv.org/abs/2601.13657
作者: Myong-Yol Choi,Hankyoul Ko,Hanse Cho,Changseung Kim,Seunghwan Kim,Jaemin Seo,Hyondong Oh
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:
Abstract:This paper presents a deep reinforcement learning (DRL) based controller for collective navigation of unmanned aerial vehicle (UAV) swarms in communication-denied environments, enabling robust operation in complex, obstacle-rich environments. Inspired by biological swarms where informed individuals guide groups without explicit communication, we employ an implicit leader-follower framework. In this paradigm, only the leader possesses goal information, while follower UAVs learn robust policies using only onboard LiDAR sensing, without requiring any inter-agent communication or leader identification. Our system utilizes LiDAR point clustering and an extended Kalman filter for stable neighbor tracking, providing reliable perception independent of external positioning systems. The core of our approach is a DRL controller, trained in GPU-accelerated Nvidia Isaac Sim, that enables followers to learn complex emergent behaviors - balancing flocking and obstacle avoidance - using only local perception. This allows the swarm to implicitly follow the leader while robustly addressing perceptual challenges such as occlusion and limited field-of-view. The robustness and sim-to-real transfer of our approach are confirmed through extensive simulations and challenging real-world experiments with a swarm of five UAVs, which successfully demonstrated collective navigation across diverse indoor and outdoor environments without any communication or external localization.
zh
[AI-30] Why Does the LLM Stop Computing: An Empirical Study of User-Reported Failures in Open-Source LLM s
【速读】:该论文旨在解决开源大语言模型(Large Language Models, LLMs)在本地部署过程中因用户自主管理编排(white-box orchestration)而引发的可靠性问题,即与黑盒API调用相比,用户自建部署栈存在显著的系统性脆弱性。解决方案的关键在于通过一项针对DeepSeek、Llama和Qwen生态中705个真实故障的大规模实证研究,揭示了三个核心现象:诊断差异(Diagnostic Divergence)表明运行时崩溃反映基础设施摩擦,而功能错误则指向内部分词器缺陷;系统同质性(Systemic Homogeneity)证明根本原因在不同模型系列间趋同,说明可靠性瓶颈源于共享生态系统而非特定架构;生命周期升级(Lifecycle Escalation)指出问题从微调阶段的配置困难演变为推理阶段的环境不兼容,凸显部署复杂性的阶段性加剧。这些发现为提升LLM部署可靠性提供了可操作的指导。
链接: https://arxiv.org/abs/2601.13655
作者: Guangba Yu,Zirui Wang,Yujie Huang,Renyi Zhong,Yuedong Zhong,Yilun Wang,Michael R. Lyu
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:
Abstract:The democratization of open-source Large Language Models (LLMs) allows users to fine-tune and deploy models on local infrastructure but exposes them to a First Mile deployment landscape. Unlike black-box API consumption, the reliability of user-managed orchestration remains a critical blind spot. To bridge this gap, we conduct the first large-scale empirical study of 705 real-world failures from the open-source DeepSeek, Llama, and Qwen ecosystems. Our analysis reveals a paradigm shift: white-box orchestration relocates the reliability bottleneck from model algorithmic defects to the systemic fragility of the deployment stack. We identify three key phenomena: (1) Diagnostic Divergence: runtime crashes distinctively signal infrastructure friction, whereas incorrect functionality serves as a signature for internal tokenizer defects. (2) Systemic Homogeneity: Root causes converge across divergent series, confirming reliability barriers are inherent to the shared ecosystem rather than specific architectures. (3) Lifecycle Escalation: Barriers escalate from intrinsic configuration struggles during fine-tuning to compounded environmental incompatibilities during inference. Supported by our publicly available dataset, these insights provide actionable guidance for enhancing the reliability of the LLM landscape. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC) Cite as: arXiv:2601.13655 [cs.SE] (or arXiv:2601.13655v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2601.13655 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-31] Fusion Segment Transformer: Bi-Directional Attention Guided Fusion Network for AI-Generated Music Detection
【速读】:该论文旨在解决生成式 AI (Generative AI) 技术快速发展背景下,AI 生成音乐的版权与归属问题,尤其针对现有方法主要聚焦于短音频检测而忽视长音频中长期结构建模的不足。其解决方案的关键在于提出改进的融合段落 Transformer(Fusion Segment Transformer),通过引入门控融合层(Gated Fusion Layer)有效整合内容特征与结构信息,从而增强对全音频场景下长程上下文的理解能力,显著提升了 AI 生成音乐的检测性能。
链接: https://arxiv.org/abs/2601.13647
作者: Yumin Kim,Seonghyeon Go
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:
Abstract:With the rise of generative AI technology, anyone can now easily create and deploy AI-generated music, which has heightened the need for technical solutions to address copyright and ownership issues. While existing works mainly focused on short-audio, the challenge of full-audio detection, which requires modeling long-term structure and context, remains insufficiently explored. To address this, we propose an improved version of the Segment Transformer, termed the Fusion Segment Transformer. As in our previous work, we extract content embeddings from short music segments using diverse feature extractors. Furthermore, we enhance the architecture for full-audio AI-generated music detection by introducing a Gated Fusion Layer that effectively integrates content and structural information, enabling the capture of long-term context. Experiments on the SONICS and AIME datasets show that our approach outperforms the previous model and recent baselines, achieving state-of-the-art results in AI-generated music detection.
zh
[AI-32] Resilient Routing: Risk-Aware Dynamic Routing in Smart Logistics via Spatiotemporal Graph Learning
【速读】:该论文旨在解决电子商务快速发展背景下物流网络面临的交通拥堵与零售需求波动导致的传统静态路径规划策略失效的问题。其核心解决方案是提出一种风险感知的动态路径规划框架(Risk-Aware Dynamic Routing, RADR),关键在于融合时空图神经网络(Spatiotemporal Graph Neural Networks, ST-GNN)与组合优化方法:首先利用空间聚类对离散GPS数据构建物流拓扑图,进而采用图卷积网络(Graph Convolutional Network, GCN)与门控循环单元(Gated Recurrent Unit, GRU)组成的混合深度学习模型提取空间相关性和时间依赖性以预测未来拥堵风险,再将预测结果嵌入动态边权重机制进行路径规划。实验证明,该方法在保持运输距离仅增加2.1%的情况下,可降低19.3%的潜在拥堵风险暴露,显著提升供应链韧性。
链接: https://arxiv.org/abs/2601.13632
作者: Zhiming Xue,Sichen Zhao,Yalun Qi,Xianling Zeng,Zihan Yu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:With the rapid development of the e-commerce industry, the logistics network is experiencing unprecedented pressure. The traditional static routing strategy most time cannot tolerate the traffic congestion and fluctuating retail demand. In this paper, we propose a Risk-Aware Dynamic Routing(RADR) framework which integrates Spatiotemporal Graph Neural Networks (ST-GNN) with combinatorial optimization. We first construct a logistics topology graph by using the discrete GPS data using spatial clustering methods. Subsequently, a hybrid deep learning model combining Graph Convolutional Network (GCN) and Gated Recurrent Unit (GRU) is adopted to extract spatial correlations and temporal dependencies for predicting future congestion risks. These prediction results are then integrated into a dynamic edge weight mechanism to perform path planning. We evaluated the framework on the Smart Logistics Dataset 2024, which contains real-world Internet of Things(IoT) sensor data. The experimental results show that the RADR algorithm significantly enhances the resilience of the supply chain. Particularly in the case study of high congestion scenarios, our method reduces the potential congestion risk exposure by 19.3% while only increasing the transportation distance by 2.1%. This empirical evidence confirms that the proposed data-driven approach can effectively balance delivery efficiency and operational safety.
zh
[AI-33] Foundations of Global Consistency Checking with Noisy LLM Oracles
【速读】:该论文旨在解决自然语言事实集合在全球范围内保持一致性的验证问题(global consistency verification),这是事实核查、摘要生成和知识库构建等任务的关键前提。现有方法依赖大语言模型(LLM)对小规模事实子集进行一致性判断,但其评估结果噪声较大,且成对检查无法保证整体一致性。论文指出,在最坏情况下,验证全局一致性需要指数级的预言机查询次数。为实现可扩展性,作者提出一种自适应分治算法,其核心在于识别最小不一致子集(Minimal Inconsistent Subsets, MUSes),并可选地通过击中集(hitting-sets)计算最小修复方案,从而将查询复杂度降至低阶多项式级别,显著提升了基于LLM的语义一致性验证效率与实用性。
链接: https://arxiv.org/abs/2601.13600
作者: Paul He,Elke Kirschbaum,Shiva Kasiviswanathan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Under Review
Abstract:Ensuring that collections of natural-language facts are globally consistent is essential for tasks such as fact-checking, summarization, and knowledge base construction. While Large Language Models (LLMs) can assess the consistency of small subsets of facts, their judgments are noisy, and pairwise checks are insufficient to guarantee global coherence. We formalize this problem and show that verifying global consistency requires exponentially many oracle queries in the worst case. To make the task practical, we propose an adaptive divide-and-conquer algorithm that identifies minimal inconsistent subsets (MUSes) of facts and optionally computes minimal repairs through hitting-sets. Our approach has low-degree polynomial query complexity. Experiments with both synthetic and real LLM oracles show that our method efficiently detects and localizes inconsistencies, offering a scalable framework for linguistic consistency verification with LLM-based evaluators.
zh
[AI-34] Diffusion In Diffusion: Breaking the Autoregressive Bottleneck in Block Diffusion Models
【速读】:该论文旨在解决块扩散语言模型(block diffusion language models)中存在的不可逆性(irreversibility)和短视性(myopia)问题,这些问题源于其严格的单向块依赖结构,导致模型丧失了扩散模型原本具备的全局规划能力。解决方案的关键在于提出“扩散中的扩散”(Diffusion in Diffusion)框架,该框架采用先草稿后精炼(draft-then-refine)的两阶段策略:首先使用小块进行快速草稿生成,随后通过具有更大双向感受野的全局扩散机制对草稿进行精细优化;同时引入快照置信度重掩码(snapshot confidence remasking)识别关键需修改标记,并结合多尺度训练(mix-scale training)增强模型的全局建模能力,从而在显著降低计算成本的同时大幅提升生成质量。
链接: https://arxiv.org/abs/2601.13599
作者: Linrui Ma,Yufei Cui,Kai Han,Yunhe Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Work In Progress
Abstract:Block diffusion language models, operating as semi-autoregressive paradigms, combine the strengths of both autoregressive and diffusion paradigms. However, their strict unidirectional block dependencies introduce irreversibility and sacrifice the global planning capabilities for which diffusion models are renowned. In order to address these issues, we propose Diffusion in Diffusion, a draft-then-refine framework designed to overcome the irreversibility and myopia problems inherent in block diffusion models. Our approach first employs block diffusion to generate rapid drafts using small blocks, then refines these drafts through global bidirectional diffusion with a larger bidirectional receptive field. We utilise snapshot confidence remasking to identify the most critical tokens that require modification, and apply mix-scale training to expand the block diffusion model’s global capabilities. Empirical results demonstrate that our approach sets a new benchmark for discrete diffusion models on the OpenWebText dataset. Using just 26% of the fine-tuning budget of baseline models, we reduce generative perplexity from 25.7 to 21.9, significantly narrowing the performance gap with autoregressive models.
zh
[AI-35] Machine learning based radiative parameterization scheme and its performance in operational reforecast experiments
【速读】:该论文旨在解决数值天气预报模型中辐射过程计算耗时过长的问题,通过引入机器学习方法替代传统物理辐射模块以提升计算效率。其解决方案的关键在于构建一个嵌入深度神经网络的混合预报框架,采用离线训练与在线耦合相结合的方式,利用残差卷积神经网络(Residual Convolutional Neural Network)近似中国气象局全球业务系统中的快速辐射传输模型(RRTMG),并通过经验回放(experience replay)增强数据集和施加物理意义约束来确保长期积分稳定性,同时基于LibTorch实现高效实时耦合,最终在保持与传统物理方案相当精度的前提下,将计算速度提升约八倍,并支持十天集成预报。
链接: https://arxiv.org/abs/2601.13592
作者: Hao Jing,Sa Xiao,Haoyu Li,Huadong Xiao,Wei Xue
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Radiation is typically the most time-consuming physical process in numerical models. One solution is to use machine learning methods to simulate the radiation process to improve computational efficiency. From an operational standpoint, this study investigates critical limitations inherent to hybrid forecasting frameworks that embed deep neural networks into numerical prediction models, with a specific focus on two fundamental bottlenecks: coupling compatibility and long-term integration stability. A residual convolutional neural network is employed to approximate the Rapid Radiative Transfer Model for General Circulation Models (RRTMG) within the global operational system of China Meteorological Administration. We adopted an offline training and online coupling approach. First, a comprehensive dataset is generated through model simulations, encompassing all atmospheric columns both with and without cloud cover. To ensure the stability of the hybrid model, the dataset is enhanced via experience replay, and additional output constraints based on physical significance are imposed. Meanwhile, a LibTorch-based coupling method is utilized, which is more suitable for real-time operational computations. The hybrid model is capable of performing ten-day integrated forecasts as required. A two-month operational reforecast experiment demonstrates that the machine learning emulator achieves accuracy comparable to that of the traditional physical scheme, while accelerating the computation speed by approximately eightfold.
zh
[AI-36] Motion-to-Response Content Generation via Multi-Agent AI System with Real-Time Safety Verification
【速读】:该论文旨在解决传统语音情感识别(Speech Emotion Recognition, SER)研究中仅关注分类准确率,而忽视将情感状态转化为安全、适宜年龄且可控的响应内容的问题。其解决方案的关键在于构建一个由四个协同工作的专用AI代理组成的多智能体系统:(1)基于卷积神经网络(CNN)的声学特征提取情绪识别代理;(2)负责将情绪映射到响应模式的策略决策代理;(3)生成媒体控制参数的内容参数生成代理;(4)执行年龄适宜性和刺激强度约束的安全验证代理。通过引入显式的安全验证循环,在输出前过滤内容以确保合规性,该系统实现了73.2%的情绪识别准确率、89.4%的响应模式一致性及100%的安全合规性,同时保持低于100ms的推理延迟,适用于设备端部署。
链接: https://arxiv.org/abs/2601.13589
作者: HyeYoung Lee
机构: 未知
类目: Artificial Intelligence (cs.AI); Sound (cs.SD)
备注:
Abstract:This paper proposes a multi-agent artificial intelligence system that generates response-oriented media content in real time based on audio-derived emotional signals. Unlike conventional speech emotion recognition studies that focus primarily on classification accuracy, our approach emphasizes the transformation of inferred emotional states into safe, age-appropriate, and controllable response content through a structured pipeline of specialized AI agents. The proposed system comprises four cooperative agents: (1) an Emotion Recognition Agent with CNN-based acoustic feature extraction, (2) a Response Policy Decision Agent for mapping emotions to response modes, (3) a Content Parameter Generation Agent for producing media control parameters, and (4) a Safety Verification Agent enforcing age-appropriateness and stimulation constraints. We introduce an explicit safety verification loop that filters generated content before output, ensuring compliance with predefined rules. Experimental results on public datasets demonstrate that the system achieves 73.2% emotion recognition accuracy, 89.4% response mode consistency, and 100% safety compliance while maintaining sub-100ms inference latency suitable for on-device deployment. The modular architecture enables interpretability and extensibility, making it applicable to child-adjacent media, therapeutic applications, and emotionally responsive smart devices.
zh
[AI-37] SCRIPTMIND: Crime Script Inference and Cognitive Evaluation for LLM -based Social Engineering Scam Detection System EACL2026
【速读】:该论文旨在解决社会工程诈骗中日益增长的个性化、多轮次欺骗行为对传统检测方法造成的挑战,这些问题往往难以通过静态规则或简单特征提取手段有效识别。其解决方案的关键在于提出ScriptMind框架,该框架通过三个核心组件实现:(1)犯罪脚本推理任务(Crime Script Inference Task, CSIT),用于建模诈骗者的行为逻辑;(2)犯罪脚本感知推理数据集(Crime Script-Aware Inference Dataset, CSID),支持小规模大语言模型(LLM)的细粒度微调;(3)基于认知模拟的社会工程防御评估(Cognitive Simulation-based Evaluation of Social Engineering Defense, CSED),量化模型在实时交互中对用户认知意识的影响。该方案将自动化推理与人类认知机制相结合,显著提升了诈骗检测准确性与用户防骗能力。
链接: https://arxiv.org/abs/2601.13581
作者: Heedou Kim,Changsik Kim,Sanghwa Shin,Jaewoo Kang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: This paper has been accepted to the EACL 2026 Industry Track
Abstract:Social engineering scams increasingly employ personalized, multi-turn deception, exposing the limits of traditional detection methods. While Large Language Models (LLMs) show promise in identifying deception, their cognitive assistance potential remains underexplored. We propose ScriptMind, an integrated framework for LLM-based scam detection that bridges automated reasoning and human cognition. It comprises three components: the Crime Script Inference Task (CSIT) for scam reasoning, the Crime Script-Aware Inference Dataset (CSID) for fine-tuning small LLMs, and the Cognitive Simulation-based Evaluation of Social Engineering Defense (CSED) for assessing real-time cognitive impact. Using 571 Korean phone scam cases, we built 22,712 structured scammer-sequence training instances. Experimental results show that the 11B small LLM fine-tuned with ScriptMind outperformed GPT-4o by 13%, achieving superior performance over commercial models in detection accuracy, false-positive reduction, scammer utterance prediction, and rationale quality. Moreover, in phone scam simulation experiments, it significantly enhanced and sustained users’ suspicion levels, improving their cognitive awareness of scams. ScriptMind represents a step toward human-centered, cognitively adaptive LLMs for scam defense.
zh
[AI-38] Neural Organ Transplantation (NOT): Checkpoint-Based Modular Adaptation for Transformer Models
【速读】:该论文旨在解决传统微调方法在领域适应(domain adaptation)中存在参数耦合性强、难以复用且依赖原始训练数据的问题。其解决方案的关键在于提出神经器官移植(Neural Organ Transplantation, NOT)框架,通过从预训练模型中提取连续的层子集(称为“供体器官”),在特定领域数据上独立训练这些模块,并将其保存为可移植的检查点文件,从而实现无需原始训练数据即可将知识高效迁移至兼容的目标模型中。此方法显著优于现有技术(如LoRA),在多个decoder-only Transformer架构上展现出更低困惑度和更快训练速度,同时揭示了位置依赖性和跨域迁移中的意外正则化效应。
链接: https://arxiv.org/abs/2601.13580
作者: Ahmad Al-Zuraiqi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 27 pages, 8 figures, 16 tables. Decoder-only transformers (124M-20B parameters). Complete experimental results and reproducibility details in appendices. Code and checkpoints: this https URL
Abstract:We introduce Neural Organ Transplantation (NOT), a modular adaptation framework that enables trained transformer layers to function as reusable transferable checkpoints for domain adaptation. Unlike conventional fine-tuning approaches that tightly couple trained parameters to specific model instances and training data, NOT extracts contiguous layer subsets (“donor organs”) from pre-trained models, trains them independently on domain-specific data, and saves them as standalone checkpoint files that can be transplanted into compatible recipient models without access to the original training data. Through experiments on three decoder-only transformer architectures spanning 124M to 20B parameters (GPT-2, TinyLlama, and GPT-OSS), we demonstrate that donor transplantation substantially outperforms existing adaptation methods, achieving an order-of-magnitude improvement in perplexity over LoRA while training significantly faster. The method exhibits position dependence, with early insertion positions yielding optimal results. Cross-domain transfer at billion-parameter scale reveals unexpected regularization benefits. These findings demonstrate that transformer middle layers can support efficient modular transfer for decoder-only architectures, enabling privacy-preserving expertise sharing through checkpoint distribution. We note that this approach is currently limited to decoder-only models; preliminary experiments on encoder-based architectures show reduced effectiveness.
zh
[AI-39] GeoDynamics: A Geometric State-Space Neural Network for Understanding Brain Dynamics on Riemannian Manifolds NEURIPS2025
【速读】:该论文旨在解决现有状态空间模型(State-space Models, SSMs)在建模脑功能连接(Functional Connectivity, FC)时,未能充分考虑其几何结构的问题。传统方法通常将FC矩阵视为欧几里得空间中的向量,忽略了其作为对称正定(Symmetric Positive Definite, SPD)矩阵天然位于黎曼流形(Riemannian manifold)上的特性,从而限制了对脑动态演化过程的准确刻画。解决方案的关键在于提出GeoDynamics——一种基于几何感知的状态空间神经网络,它通过将每个FC矩阵嵌入到具有流形感知能力的循环框架中,显式地在高维SPD流形上学习平滑且符合几何约束的潜在脑状态轨迹,从而能够揭示任务驱动的状态变化及阿尔茨海默病、帕金森病和自闭症等神经疾病早期标志。
链接: https://arxiv.org/abs/2601.13570
作者: Tingting Dan,Jiaqi Ding,Guorong Wu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to NeurIPS 2025
Abstract:State-space models (SSMs) have become a cornerstone for unraveling brain dynamics, revealing how latent neural states evolve over time and give rise to observed signals. By combining the flexibility of deep learning with the principled dynamical structure of SSMs, recent studies have achieved powerful fits to functional neuroimaging data. However, most existing approaches still view the brain as a set of loosely connected regions or impose oversimplified network priors, falling short of a truly holistic and self-organized dynamical system perspective. Brain functional connectivity (FC) at each time point naturally forms a symmetric positive definite (SPD) matrix, which resides on a curved Riemannian manifold rather than in Euclidean space. Capturing the trajectories of these SPD matrices is key to understanding how coordinated networks support cognition and behavior. To this end, we introduce GeoDynamics, a geometric state-space neural network that tracks latent brain-state trajectories directly on the high-dimensional SPD manifold. GeoDynamics embeds each connectivity matrix into a manifold-aware recurrent framework, learning smooth and geometry-respecting transitions that reveal task-driven state changes and early markers of Alzheimer’s disease, Parkinson’s disease, and autism. Beyond neuroscience, we validate GeoDynamics on human action recognition benchmarks (UTKinect, Florence, HDM05), demonstrating its scalability and robustness in modeling complex spatiotemporal dynamics across diverse domains.
zh
[AI-40] Multi-objective fluorescent molecule design with a data-physics dual-driven generative framework
【速读】:该论文旨在解决荧光小分子逆向设计中面临的多目标优化难题,即在庞大且未充分探索的化学空间内,如何高效生成满足特定光学和理化性质要求的分子结构。传统“生成-评分-筛选”方法因搜索效率低、机器学习预测泛化能力不可靠以及量子化学计算成本高昂而难以适用。其解决方案的关键在于提出LUMOS框架,该框架通过共享潜在表示将生成器与预测器耦合,实现从属性规格到分子结构的直接映射;同时融合神经网络与快速时依赖密度泛函理论(TD-DFT)计算流程,构建兼具速度、精度与泛化能力的互补预测模块;并引入基于属性引导的扩散模型与多目标进化算法,实现骨架级和片段级的多约束分子优化,从而在多个基准测试中显著优于基线模型,在荧光性质预测准确性、泛化能力和物理合理性方面表现优异,并通过TD-DFT和分子动力学(MD)模拟验证了所生成荧光团的有效性。
链接: https://arxiv.org/abs/2601.13564
作者: Yanheng Li,Zhichen Pu,Lijiang Yang,Zehao Zhou,Yi Qin Gao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Chemical Physics (physics.chem-ph); Biomolecules (q-bio.BM)
备注: Total 43 pages: 32 pages Main Text + 11 pages SI
Abstract:Designing fluorescent small molecules with tailored optical and physicochemical properties requires navigating vast, underexplored chemical space while satisfying multiple objectives and constraints. Conventional generate-score-screen approaches become impractical under such realistic design specifications, owing to their low search efficiency, unreliable generalizability of machine-learning prediction, and the prohibitive cost of quantum chemical calculation. Here we present LUMOS, a data-and-physics driven framework for inverse design of fluorescent molecules. LUMOS couples generator and predictor within a shared latent representation, enabling direct specification-to-molecule design and efficient exploration. Moreover, LUMOS combines neural networks with a fast time-dependent density functional theory (TD-DFT) calculation workflow to build a suite of complementary predictors spanning different trade-offs in speed, accuracy, and generalizability, enabling reliable property prediction across diverse scenarios. Finally, LUMOS employs a property-guided diffusion model integrated with multi-objective evolutionary algorithms, enabling de novo design and molecular optimization under multiple objectives and constraints. Across comprehensive benchmarks, LUMOS consistently outperforms baseline models in terms of accuracy, generalizability and physical plausibility for fluorescence property prediction, and demonstrates superior performance in multi-objective scaffold- and fragment-level molecular optimization. Further validation using TD-DFT and molecular dynamics (MD) simulations demonstrates that LUMOS can generate valid fluorophores that meet various target specifications. Overall, these results establish LUMOS as a data-physics dual-driven framework for general fluorophore inverse design.
zh
[AI-41] ButterflyMoE: Sub-Linear Ternary Experts via Structured Butterfly Orbits
【速读】:该论文旨在解决多专家模型(MoE)中因线性内存扩展导致的边缘设备内存瓶颈问题,即传统方法存储 N 个独立专家权重矩阵所需内存为 O(N⋅d2),难以部署于资源受限场景。其解决方案的关键在于提出 ButterflyMoE,通过将专家视为对统一量化基底(shared quantized substrate)进行学习后旋转(learned rotations)的几何重定向,而非冗余存储独立权重矩阵,从而实现子线性内存增长(O(d2+N⋅dlogd))。这一几何参数化策略不仅显著降低内存占用(如在256个专家时减少150倍),还通过量化与旋转联合训练稳定了极端低比特训练过程,突破了传统压缩方法仅能优化常数因子的局限。
链接: https://arxiv.org/abs/2601.13563
作者: Aryan Karmore
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Linear memory scaling stores N independent expert weight matrices requiring \mathcalO(N \cdot d^2) memory, which exceeds edge devices memory budget. Current compression methods like quantization, pruning and low-rank factorization reduce constant factors but leave the scaling bottleneck unresolved. We introduce ButterflyMoE, a method that treats experts not as independent weight matrices but as geometric reorientations of a unified shared quantized substrate. Diversity among experts arises from viewing different angles of shared capacity, not from redundant storage. By applying learned rotations to a shared ternary prototype, each expert yields \mathcalO(d^2 + N \cdot d \log d) memory – sub-linear in the number of experts. The key insight: training these rotations with quantization reduces activation outliers and stabilizes extreme low bit training, where static methods collapse. Across language modeling benchmarks, ButterflyMoE achieves 150 times memory reduction at 256 experts with negligible accuracy loss. This allows 64 experts to fit on 4GB devices compared to standard MoE’s 8 experts, showing geometric parametrization breaks linear scaling.
zh
[AI-42] Agent GC: Evolutionary Learning-based Lossless Compression for Genomics Data with LLM -driven Multiple Agent
【速读】:该论文旨在解决当前基于学习的基因组数据(Genomics Data, GD)压缩方法存在的不可演化性、低层次建模能力不足以及适应性有限等问题。其核心解决方案是提出首个基于代理(Agent-based)的进化式GD压缩框架AgentGC,该框架由三层结构组成:用户层通过领导者(Leader)与大语言模型(Large Language Model, LLM)结合实现友好交互;认知层由Leader驱动,利用LLM协同优化算法-数据-系统三者关系,提升压缩模型的适应性和智能水平;压缩层由工作者(Worker)主导,采用自动化多知识学习框架执行压缩与解压缩操作。该设计显著提升了压缩比和吞吐量,在9个数据集上相较14个基线方法平均压缩比提升达16.33%,吞吐量最高提升9.23倍。
链接: https://arxiv.org/abs/2601.13559
作者: Sun Hui,Ding Yanfeng,Huidong Ma,Chang Xu,Keyan Jin,Lizheng Zu,Cheng Zhong,xiaoguang Liu,Gang Wang,Wentong Cai
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Lossless compression has made significant advancements in Genomics Data (GD) storage, sharing and management. Current learning-based methods are non-evolvable with problems of low-level compression modeling, limited adaptability, and user-unfriendly interface. To this end, we propose AgentGC, the first evolutionary Agent-based GD Compressor, consisting of 3 layers with multi-agent named Leader and Worker. Specifically, the 1) User layer provides a user-friendly interface via Leader combined with LLM; 2) Cognitive layer, driven by the Leader, integrates LLM to consider joint optimization of algorithm-dataset-system, addressing the issues of low-level modeling and limited adaptability; and 3) Compression layer, headed by Worker, performs compression decompression via a automated multi-knowledge learning-based compression framework. On top of AgentGC, we design 3 modes to support diverse scenarios: CP for compression-ratio priority, TP for throughput priority, and BM for balanced mode. Compared with 14 baselines on 9 datasets, the average compression ratios gains are 16.66%, 16.11%, and 16.33%, the throughput gains are 4.73x, 9.23x, and 9.15x, respectively.
zh
[AI-43] ChatAD: Reasoning -Enhanced Time-Series Anomaly Detection with Multi-Turn Instruction Evolution
【速读】:该论文旨在解决当前基于大语言模型(Large Language Model, LLM)的时间序列异常检测(Time Series Anomaly Detection, TS AD)方法中存在的推理能力不足、多轮对话能力欠缺以及跨任务泛化能力有限等问题。其解决方案的关键在于:1)提出基于多智能体的时间序列演化算法 TSEvol,以增强对时序数据的动态理解;2)构建包含多轮对话能力的异常检测数据集 TSEData-20K 及对应的 ChatAD 系列模型(如 ChatAD-Llama3-8B、Qwen2.5-7B 和 Mistral-7B),提升解释性与交互能力;3)引入 TS Kahneman-Tversky 优化(TS Kahneman-Tversky Optimization, TKTO),显著增强模型在分类、预测和插补等不同任务间的跨任务泛化性能;4)设计 LLM 驱动的学习型异常检测基准 LLADBench,系统评估 ChatAD 与九种基线方法在七个数据集上的综合表现,最终实现准确率最高提升 34.50%、F1 分数提升 34.71%、误报率降低 37.42% 的显著效果。
链接: https://arxiv.org/abs/2601.13546
作者: Hui Sun,Chang Xu,Haonan Xie,Hao Li,Yuhao Huang,Chuheng Zhang,Ming Jin,Xiaoguang Liu,Gang Wang,Jiang Bian
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:LLM-driven Anomaly Detection (AD) helps enhance the understanding and explanatory abilities of anomalous behaviors in Time Series (TS). Existing methods face challenges of inadequate reasoning ability, deficient multi-turn dialogue capability, and narrow generalization. To this end, we 1) propose a multi-agent-based TS Evolution algorithm named TSEvol. On top of it, we 2) introduce the AD reasoning and multi-turn dialogue Dataset TSEData-20K and contribute the Chatbot family for AD, including ChatAD-Llama3-8B, Qwen2.5-7B, and Mistral-7B. Furthermore, 3) we propose the TS Kahneman-Tversky Optimization (TKTO) to enhance ChatAD’s cross-task generalization capability. Lastly, 4) we propose a LLM-driven Learning-based AD Benchmark LLADBench to evaluate the performance of ChatAD and nine baselines across seven datasets and tasks. Our three ChatAD models achieve substantial gains, up to 34.50% in accuracy, 34.71% in F1, and a 37.42% reduction in false positives. Besides, via KTKO, our optimized ChatAD achieves competitive performance in reasoning and cross-task generalization on classification, forecasting, and imputation.
zh
[AI-44] ruthTensor: Evaluating LLM s Human Imitation through Prediction Market Drift and Holistic Reasoning
【速读】:该论文旨在解决当前语言模型(Language Models, LM)和AI代理评估中存在的根本性挑战,即静态基准无法捕捉现实世界中的不确定性、分布偏移,以及孤立任务准确率与人类对齐决策之间的差距。其解决方案的核心是提出TruthTensor这一新型、可复现的评估范式,将大语言模型(Large Language Models, LLMs)不仅视为预测引擎,更视为在社会语境驱动的高熵环境中运行的人类模仿系统。TruthTensor通过锚定于实时预测市场并结合概率评分机制,提供对模型行为的全局视图,并引入以漂移为中心的诊断方法和显式的鲁棒性检查,从而实现对模型在准确性、校准度、叙事稳定性、成本和资源效率等多维度的综合评估。该框架还明确区分人工与自动化评估角色、制定标注协议与统计检验流程,确保结果的可解释性和可复现性,最终推动LLM在真实决策场景中获得可辩护的性能评估。
链接: https://arxiv.org/abs/2601.13545
作者: Shirin Shahabi,Spencer Graham,Haruna Isah
机构: 未知
类目: Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Multiagent Systems (cs.MA)
备注: 16 pages, 6 figures, 2 tables
Abstract:Evaluating language models and AI agents remains fundamentally challenging because static benchmarks fail to capture real-world uncertainty, distribution shift, and the gap between isolated task accuracy and human-aligned decision-making under evolving conditions. This paper introduces TruthTensor, a novel, reproducible evaluation paradigm that measures Large Language Models (LLMs) not only as prediction engines but as human-imitation systems operating in socially-grounded, high-entropy environments. Building on forward-looking, contamination-free tasks, our framework anchors evaluation to live prediction markets and combines probabilistic scoring to provide a holistic view of model behavior. TruthTensor complements traditional correctness metrics with drift-centric diagnostics and explicit robustness checks for reproducibility. It specify human vs. automated evaluation roles, annotation protocols, and statistical testing procedures to ensure interpretability and replicability of results. In experiments across 500+ real markets (political, economic, cultural, technological), TruthTensor demonstrates that models with similar forecast accuracy can diverge markedly in calibration, drift, and risk-sensitivity, underscoring the need to evaluate models along multiple axes (accuracy, calibration, narrative stability, cost, and resource efficiency). TruthTensor therefore operationalizes modern evaluation best practices, clear hypothesis framing, careful metric selection, transparent compute/cost reporting, human-in-the-loop validation, and open, versioned evaluation contracts, to produce defensible assessments of LLMs in real-world decision contexts. We publicly release TruthTensor at this https URL
zh
[AI-45] MN-TSG:Continuous Time Series Generation with Irregular Observations
【速读】:该论文旨在解决时间序列生成(Time Series Generation, TSG)中因现实世界数据常为不规则采样且稀疏观测,而现有方法多假设规则采样与固定输出分辨率所导致的不匹配问题,尤其在临床监测等场景下,需从不规则输入生成连续、高分辨率的时间序列。其解决方案的关键在于提出MN-TSG框架,该框架基于混合专家(Mixture-of-Experts, MoE)结构设计神经控制微分方程(Neural Controlled Differential Equations, NCDEs),通过动态参数化专家函数和解耦架构实现更有效的MoE动态优化,并结合已有TSG模型学习专家混合与生成序列的联合分布,从而不仅能够生成新样本,还能为每个样本定制合适的专家配置,支持精细化的连续时间序列生成。
链接: https://arxiv.org/abs/2601.13534
作者: Xu Zhang,Junwei Deng,Chang Xu,Hao Li,Jiang Bian
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 34 pages
Abstract:Time series generation (TSG) plays a critical role in a wide range of domains, such as healthcare. However, most existing methods assume regularly sampled observations and fixed output resolutions, which are often misaligned with real-world scenarios where data are irregularly sampled and sparsely observed. This mismatch is particularly problematic in applications such as clinical monitoring, where irregular measurements must support downstream tasks requiring continuous and high-resolution time series. Neural Controlled Differential Equations (NCDEs) have shown strong potential for modeling irregular time series, yet they still face challenges in capturing complex dynamic temporal patterns and supporting continuous TSG. To address these limitations, we propose MN-TSG, a novel framework that explores Mixture-of-Experts (MoE)-based NCDEs and integrates them with existing TSG models for irregular and continuous generation tasks. The core of MN-TSG lies in a MoE-NCDE architecture with dynamically parameterized expert functions and a decoupled design that facilitates more effective optimization of MoE dynamics. Furthermore, we leverage existing TSG models to learn the joint distribution over the mixture of experts and the generated time series. This enables the framework not only to generate new samples, but also to produce appropriate expert configurations tailored to each sample, thereby supporting refined continuous TSG. Extensive experiments on ten public and synthetic datasets demonstrate the effectiveness of MN-TSG, consistently outperforming strong TSG baselines on both irregular-to-regular and irregular-to-continuous generation tasks. Comments: 34 pages Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2601.13534 [cs.LG] (or arXiv:2601.13534v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2601.13534 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Xu Zhang [view email] [v1] Tue, 20 Jan 2026 02:45:03 UTC (13,683 KB)
zh
[AI-46] Reasoning While Recommending: Entropy-Guided Latent Reasoning in Generative Re-ranking Models
【速读】:该论文旨在解决生成式重排序(generative re-ranking)场景中,现有方法难以适应模型难度动态变化所引发的熵(entropy)波动问题,从而无法准确捕捉复杂偏好。其解决方案的关键在于引入一种熵引导的潜在推理机制(Entropy-Guided Latent Reasoning, EGLR),通过“推理与推荐同步进行”的设计,实现生成过程中的实时推理;同时采用上下文感知的推理标记与动态温度调整相结合的方式,实现变长推理,以扩展探索范围并提升推荐精度,从而更精准地平衡探索与利用(exploration-exploitation)关系。该方案还具备轻量级集成特性,无需额外复杂模块或后处理,可无缝兼容现有生成式重排序模型并显著增强其性能。
链接: https://arxiv.org/abs/2601.13533
作者: Changshuo Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Reinforcement learning plays a crucial role in generative re-ranking scenarios due to its exploration-exploitation capabilities, but existing generative methods mostly fail to adapt to the dynamic entropy changes in model difficulty during list generation, making it challenging to accurately capture complex preferences. Given that language models have achieved remarkable breakthroughs by integrating reasoning capabilities, we draw on this approach to introduce a latent reasoning mechanism, and experimental validation demonstrates that this mechanism effectively reduces entropy in the model’s decision-making process. Based on these findings, we introduce the Entropy-Guided Latent Reasoning (EGLR) recommendation model, which has three core advantages. First, it abandons the “reason first, recommend later” paradigm to achieve “reasoning while recommending”, specifically designed for the high-difficulty nature of list generation by enabling real-time reasoning during generation. Second, it implements entropy-guided variable-length reasoning using context-aware reasoning token alongside dynamic temperature adjustment, expanding exploration breadth in reasoning and boosting exploitation precision in recommending to achieve a more precisely adapted exploration-exploitation trade-off. Third, the model adopts a lightweight integration design with no complex independent modules or post-processing, enabling easy adaptation to existing models. Experimental results on two real-world datasets validate the model’s effectiveness, and its notable advantage lies in being compatible with existing generative re-ranking models to enhance their performance. Further analyses also demonstrate its practical deployment value and research potential.
zh
[AI-47] Agent icRed: Optimizing Agent ic Systems for Automated Red-teaming
【速读】:该论文旨在解决当前自动化红队测试(red-teaming)方法依赖人工设计工作流所带来的局限性,如人类偏见和探索设计空间成本高昂的问题。其解决方案的关键在于提出AgenticRed——一个完全自动化的红队系统设计流水线,利用大语言模型(LLM)的上下文学习能力,将红队测试视为系统设计问题而非单纯优化攻击策略。通过引入基于进化选择的代理系统演化机制,AgenticRed无需人工干预即可迭代生成和优化红队系统,显著提升了攻击成功率(ASR),并在多个开源与闭源模型上展现出优异的迁移性能,验证了自动化系统设计在AI安全评估中的有效性与前瞻性。
链接: https://arxiv.org/abs/2601.13518
作者: Jiayi Yuan,Jonathan Nöther,Natasha Jaques,Goran Radanović
机构: 未知
类目: Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注: Website: this https URL
Abstract:While recent automated red-teaming methods show promise for systematically exposing model vulnerabilities, most existing approaches rely on human-specified workflows. This dependence on manually designed workflows suffers from human biases and makes exploring the broader design space expensive. We introduce AgenticRed, an automated pipeline that leverages LLMs’ in-context learning to iteratively design and refine red-teaming systems without human intervention. Rather than optimizing attacker policies within predefined structures, AgenticRed treats red-teaming as a system design problem. Inspired by methods like Meta Agent Search, we develop a novel procedure for evolving agentic systems using evolutionary selection, and apply it to the problem of automatic red-teaming. Red-teaming systems designed by AgenticRed consistently outperform state-of-the-art approaches, achieving 96% attack success rate (ASR) on Llama-2-7B (36% improvement) and 98% on Llama-3-8B on HarmBench. Our approach exhibits strong transferability to proprietary models, achieving 100% ASR on GPT-3.5-Turbo and GPT-4o-mini, and 60% on Claude-Sonnet-3.5 (24% improvement). This work highlights automated system design as a powerful paradigm for AI safety evaluation that can keep pace with rapidly evolving models.
zh
[AI-48] Automatic Adjustment of HPA Parameters and Attack Prevention in Kubernetes Using Random Forests
【速读】:该论文旨在解决在容器化应用中,针对服务端的流量攻击(如DDoS)导致自动扩缩容机制(HPA, Horizontal Pod Autoscaler)异常膨胀、资源浪费及服务可用性下降的问题。解决方案的关键在于:利用HTTP状态码作为自定义指标驱动HPA,并引入随机森林(Random Forest)分类算法对攻击行为进行识别与预测,动态调整HPA的最大副本数(maxReplicas)参数;同时将来自攻击IP的访问重定向至蜜罐(honeypot)Pod,从而有效隔离攻击流量、降低5XX错误率,并防止因攻击引发的非预期扩缩容行为。实验表明,合理设置HPA调整阈值是保障该方案有效性的重要前提。
链接: https://arxiv.org/abs/2601.13515
作者: Hanlin Zhou,Huah Yong Chan,Jingfei Ni,Mengchun Wu,Qing Deng
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:
Abstract:In this paper, HTTP status codes are used as custom metrics within the HPA as the experimental scenario. By integrating the Random Forest classification algorithm from machine learning, attacks are assessed and predicted, dynamically adjusting the maximum pod parameter in the HPA to manage attack traffic. This approach enables the adjustment of HPA parameters using machine learning scripts in targeted attack scenarios while effectively managing attack traffic. All access from attacking IPs is redirected to honeypot pods, achieving a lower incidence of 5XX status codes through HPA pod adjustments under high load conditions. This method also ensures effective isolation of attack traffic, preventing excessive HPA expansion due to attacks. Additionally, experiments conducted under various conditions demonstrate the importance of setting appropriate thresholds for HPA adjustments.
zh
[AI-49] owards Efficient and Robust Linguistic Emotion Diagnosis for Mental Health via Multi-Agent Instruction Refinement
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在高风险、强情境依赖的医疗场景中进行情绪诊断时,因提示(prompt)设计敏感而导致的诊断可靠性不足问题,以及现有方法面临的两个关键挑战:情绪共病(emotional comorbidity)导致多情绪状态交织难以准确预测,和临床相关线索探索效率低下。解决方案的关键在于提出APOLO(Automated Prompt Optimization for Linguistic Emotion Diagnosis)框架,通过将指令优化建模为部分可观测马尔可夫决策过程(Partially Observable Markov Decision Process),并引入包含Planner、Teacher、Critic、Student与Target角色的多智能体协作机制,在闭环系统中实现对提示空间的细粒度、系统性探索,从而提升诊断准确性与鲁棒性。
链接: https://arxiv.org/abs/2601.13481
作者: Jian Zhang,Zhangqi Wang,Zhiyuan Wang,Weiping Fu,Yu He,Haiping Zhu,Qika Lin,Jun Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Linguistic expressions of emotions such as depression, anxiety, and trauma-related states are pervasive in clinical notes, counseling dialogues, and online mental health communities, and accurate recognition of these emotions is essential for clinical triage, risk assessment, and timely intervention. Although large language models (LLMs) have demonstrated strong generalization ability in emotion analysis tasks, their diagnostic reliability in high-stakes, context-intensive medical settings remains highly sensitive to prompt design. Moreover, existing methods face two key challenges: emotional comorbidity, in which multiple intertwined emotional states complicate prediction, and inefficient exploration of clinically relevant cues. To address these challenges, we propose APOLO (Automated Prompt Optimization for Linguistic Emotion Diagnosis), a framework that systematically explores a broader and finer-grained prompt space to improve diagnostic efficiency and robustness. APOLO formulates instruction refinement as a Partially Observable Markov Decision Process and adopts a multi-agent collaboration mechanism involving Planner, Teacher, Critic, Student, and Target roles. Within this closed-loop framework, the Planner defines an optimization trajectory, while the Teacher-Critic-Student agents iteratively refine prompts to enhance reasoning stability and effectiveness, and the Target agent determines whether to continue optimization based on performance evaluation. Experimental results show that APOLO consistently improves diagnostic accuracy and robustness across domain-specific and stratified benchmarks, demonstrating a scalable and generalizable paradigm for trustworthy LLM applications in mental healthcare.
zh
[AI-50] A Unified Variational Imputation Framework for Electric Vehicle Charging Data Using Retrieval-Augmented Language Model
【速读】:该论文旨在解决电动汽车(Electric Vehicle, EV)基础设施中数据驱动应用(如充电需求预测)因实际充电数据存在缺失而影响可靠性的问题。现有插补方法难以应对充电数据的复杂多模态特性,且通常采用“每站点一个模型”的限制性范式,忽略了站点间的相关性。其解决方案的关键在于提出一种基于概率变分的插补框架——PRAIM(PRobabilistic variational imputation framework),该框架利用预训练语言模型将时间序列需求、日历特征和地理空间信息统一编码为语义丰富的表示,并通过检索增强记忆机制从整个充电网络中动态获取相关示例,从而构建一个由变分神经架构驱动的单一统一插补模型,有效缓解数据稀疏问题并保留原始数据统计分布特性,显著提升下游预测性能。
链接: https://arxiv.org/abs/2601.13476
作者: Jinhao Li,Hao Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 15 pages
Abstract:The reliability of data-driven applications in electric vehicle (EV) infrastructure, such as charging demand forecasting, hinges on the availability of complete, high-quality charging data. However, real-world EV datasets are often plagued by missing records, and existing imputation methods are ill-equipped for the complex, multimodal context of charging data, often relying on a restrictive one-model-per-station paradigm that ignores valuable inter-station correlations. To address these gaps, we develop a novel PRobabilistic variational imputation framework that leverages the power of large lAnguage models and retrIeval-augmented Memory (PRAIM). PRAIM employs a pre-trained language model to encode heterogeneous data, spanning time-series demand, calendar features, and geospatial context, into a unified, semantically rich representation. This is dynamically fortified by retrieval-augmented memory that retrieves relevant examples from the entire charging network, enabling a single, unified imputation model empowered by variational neural architecture to overcome data sparsity. Extensive experiments on four public datasets demonstrate that PRAIM significantly outperforms established baselines in both imputation accuracy and its ability to preserve the original data’s statistical distribution, leading to substantial improvements in downstream forecasting performance.
zh
[AI-51] Preconditioning Benefits of Spectral Orthogonalization in Muon
【速读】:该论文旨在解决大型语言模型预训练中Muon优化器(Muon optimizer)的内在机制不明确问题,尤其是梯度正交化(gradient orthogonalization)在其中的作用缺乏系统性理论解释。其解决方案的关键在于提出并分析一个简化的Muon变体,在矩阵分解和线性Transformer的上下文学习两个具体任务中证明:该简化版本能够以与相关条件数无关的迭代复杂度实现线性收敛,显著优于梯度下降(Gradient Descent)和Adam优化器。理论分析揭示了Muon动力学在谱域中可分解为一系列独立的标量序列,每条序列均表现出一致的收敛特性,从而形式化地阐明了谱正交化所引发的预处理效应(preconditioning effect),为理解Muon在矩阵优化问题中的有效性提供了严谨依据,并可能推广至更广泛的优化场景。
链接: https://arxiv.org/abs/2601.13474
作者: Jianhao Ma,Yu Huang,Yuejie Chi,Yuxin Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC); Machine Learning (stat.ML)
备注:
Abstract:The Muon optimizer, a matrix-structured algorithm that leverages spectral orthogonalization of gradients, is a milestone in the pretraining of large language models. However, the underlying mechanisms of Muon – particularly the role of gradient orthogonalization – remain poorly understood, with very few works providing end-to-end analyses that rigorously explain its advantages in concrete applications. We take a step by studying the effectiveness of a simplified variant of Muon through two case studies: matrix factorization, and in-context learning of linear transformers. For both problems, we prove that simplified Muon converges linearly with iteration complexities independent of the relevant condition number, provably outperforming gradient descent and Adam. Our analysis reveals that the Muon dynamics decouple into a collection of independent scalar sequences in the spectral domain, each exhibiting similar convergence behavior. Our theory formalizes the preconditioning effect induced by spectral orthogonalization, offering insight into Muon’s effectiveness in these matrix optimization problems and potentially beyond.
zh
[AI-52] Graph Neural Networks are Heuristics
【速读】:该论文试图解决的问题是:如何在不依赖监督学习或显式搜索策略的情况下,使图神经网络(Graph Neural Network, GNN)成为有效的组合优化启发式算法。解决方案的关键在于将全局结构约束编码为归纳偏置(inductive bias),从而让非自回归模型通过单一前向传播直接生成解,无需序列决策或外部搜索机制;同时,在推理阶段利用Dropout和快照集成(snapshot ensembling)提升解的多样性,从而显著缩小与最优解之间的差距。这一方法表明,GNN可以内化组合优化问题的全局结构信息,并作为强学习启发式直接使用,而非仅用于增强传统算法。
链接: https://arxiv.org/abs/2601.13465
作者: Yimeng Min,Carla P. Gomes
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:We demonstrate that a single training trajectory can transform a graph neural network into an unsupervised heuristic for combinatorial optimization. Focusing on the Travelling Salesman Problem, we show that encoding global structural constraints as an inductive bias enables a non-autoregressive model to generate solutions via direct forward passes, without search, supervision, or sequential decision-making. At inference time, dropout and snapshot ensembling allow a single model to act as an implicit ensemble, reducing optimality gaps through increased solution diversity. Our results establish that graph neural networks do not require supervised training nor explicit search to be effective. Instead, they can internalize global combinatorial structure and function as strong, learned heuristics. This reframes the role of learning in combinatorial optimization: from augmenting classical algorithms to directly instantiating new heuristics.
zh
[AI-53] Context and Transcripts Improve Detection of Deepfake Audios of Public Figures
【速读】:该论文旨在解决当前音频深度伪造检测方法仅依赖音频本身而忽视语境信息(如上下文或文本转录)导致检测性能受限的问题。其解决方案的关键在于提出一种基于语境的音频深度伪造检测架构(Context-based Audio Deepfake Detector, CADD),通过融合音频内容与上下文信息或文本转录,显著提升检测准确性和鲁棒性。实验表明,引入足够语境和/或转录可使多种基线模型的F1分数、AUC和等错误率(EER)提升5%–37.58%、3.77%–42.79%和6.17%–47.83%,且CADD在面对五种对抗规避策略时表现更稳定,平均性能下降仅为-0.71%。
链接: https://arxiv.org/abs/2601.13464
作者: Chongyang Gao,Marco Postiglione,Julian Baldwin,Natalia Denisenko,Isabel Gortner,Luke Fosdick,Chiara Pulice,Sarit Kraus,V.S. Subrahmanian
机构: 未知
类目: Artificial Intelligence (cs.AI); Sound (cs.SD)
备注:
Abstract:Humans use context to assess the veracity of information. However, current audio deepfake detectors only analyze the audio file without considering either context or transcripts. We create and analyze a Journalist-provided Deepfake Dataset (JDD) of 255 public deepfakes which were primarily contributed by over 70 journalists since early 2024. We also generate a synthetic audio dataset (SYN) of dead public figures and propose a novel Context-based Audio Deepfake Detector (CADD) architecture. In addition, we evaluate performance on two large-scale datasets: ITW and P ^2 V. We show that sufficient context and/or the transcript can significantly improve the efficacy of audio deepfake detectors. Performance (measured via F1 score, AUC, and EER) of multiple baseline audio deepfake detectors and traditional classifiers can be improved by 5%-37.58% in F1-score, 3.77%-42.79% in AUC, and 6.17%-47.83% in EER. We additionally show that CADD, via its use of context and/or transcripts, is more robust to 5 adversarial evasion strategies, limiting performance degradation to an average of just -0.71% across all experiments. Code, models, and datasets are available at our project page: this https URL (access restricted during review).
zh
[AI-54] SpatialBench-UC: Uncertainty-Aware Evaluation of Spatial Prompt Following in Text-to-Image Generation
【速读】:该论文旨在解决文本到图像生成模型(text-to-image models)在执行显式空间指令时难以自动化评估的问题。传统方法如目标检测器可能遗漏目标或产生多个合理但不准确的检测结果,而简单的几何测试在边界情况下易出现歧义。为此,作者提出SpatialBench-UC,一个小型且可复现的成对空间关系基准,包含200个提示(50个对象对×4种空间关系),并通过对对象角色互换构造100对反事实样本以增强评估严谨性。其关键解决方案是将空间评估建模为选择性预测问题:检查器可在证据不足时主动放弃判断,并输出置信度,从而将结果解释为风险覆盖权衡而非单一分数。该设计显著提升了评估的可靠性与可解释性,实验证明基于定位的方法能大幅提高通过率和覆盖率,但因检测缺失导致的弃权仍是主要限制因素。
链接: https://arxiv.org/abs/2601.13462
作者: Amine Rostane
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 19 pages, includes figures and tables
Abstract:Evaluating whether text-to-image models follow explicit spatial instructions is difficult to automate. Object detectors may miss targets or return multiple plausible detections, and simple geometric tests can become ambiguous in borderline cases. Spatial evaluation is naturally a selective prediction problem, the checker may abstain when evidence is weak and report confidence so that results can be interpreted as a risk coverage tradeoff rather than a single score. We introduce SpatialBench-UC, a small, reproducible benchmark for pairwise spatial relations. The benchmark contains 200 prompts (50 object pairs times 4 relations) grouped into 100 counterfactual pairs obtained by swapping object roles. We release a benchmark package, versioned prompts, pinned configs, per-sample checker outputs, and report tables, enabling reproducible and auditable comparisons across models. We also include a lightweight human audit used to calibrate the checker’s abstention margin and confidence threshold. We evaluate three baselines, Stable Diffusion 1.5, SD 1.5 BoxDiff, and SD 1.4 GLIGEN. The checker reports pass rate and coverage as well as conditional pass rates on decided samples. The results show that grounding methods substantially improve both pass rate and coverage, while abstention remains a dominant factor due mainly to missing detections.
zh
[AI-55] Explicit Cognitive Allocation: A Principle for Governed and Auditable Inference in Large Language Models
【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)在辅助推理过程中存在的认知结构缺失问题,即问题定义、知识探索、方法意识与解释等关键认知功能被压缩至单一生成流程中,导致可追溯性差、认识论控制弱化以及高责任场景下的可复现性不足。其解决方案的核心是提出“显式认知分配”(Explicit Cognitive Allocation)原则,通过将认知功能显式分离并协同调度来重构AI辅助推理过程;具体实现为“通用认知代理”(Cognitive Universal Agent, CUA)架构,该架构将推理划分为探索与框架构建、认识论锚定、工具与方法映射及解释合成四个阶段,并引入“通用认知工具”(Universal Cognitive Instruments, UCIs)以形式化各类异构手段(如计算、实验、组织、监管和教育工具),从而系统性地暴露探究的工具结构并提升推理的结构化程度与可控性。
链接: https://arxiv.org/abs/2601.13443
作者: Héctor Manuel Manzanilla-Granados,Zaira Navarrete-Cazales,Miriam Pescador-Rojas,Tonahtiu Ramírez-Romero
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Preprint. This version corresponds to the initial public release of the CUA architecture and associated evaluation metrics
Abstract:The rapid adoption of large language models (LLMs) has enabled new forms of AI-assisted reasoning across scientific, technical, and organizational domains. However, prevailing modes of LLM use remain cognitively unstructured: problem framing, knowledge exploration, retrieval, methodological awareness, and explanation are typically collapsed into a single generative process. This cognitive collapse limits traceability, weakens epistemic control, and undermines reproducibility, particularly in high-responsibility settings. We introduce Explicit Cognitive Allocation, a general principle for structuring AI-assisted inference through the explicit separation and orchestration of epistemic functions. We instantiate this principle in the Cognitive Universal Agent (CUA), an architecture that organizes inference into distinct stages of exploration and framing, epistemic anchoring, instrumental and methodological mapping, and interpretive synthesis. Central to this framework is the notion of Universal Cognitive Instruments (UCIs), which formalize heterogeneous means, including computational, experimental, organizational, regulatory, and educational instruments, through which abstract inquiries become investigable. We evaluate the effects of explicit cognitive and instrumental allocation through controlled comparisons between CUA-orchestrated inference and baseline LLM inference under matched execution conditions. Across multiple prompts in the agricultural domain, CUA inference exhibits earlier and structurally governed epistemic convergence, higher epistemic alignment under semantic expansion, and systematic exposure of the instrumental landscape of inquiry. In contrast, baseline LLM inference shows greater variability in alignment and fails to explicitly surface instrumental structure. Comments: Preprint. This version corresponds to the initial public release of the CUA architecture and associated evaluation metrics Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2601.13443 [cs.AI] (or arXiv:2601.13443v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2601.13443 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-56] A Learnable Wavelet Transformer for Long-Short Equity Trading and Risk-Adjusted Return Optimization
【速读】:该论文旨在解决从金融时间序列中学习高盈利日内交易策略的难题,其核心挑战在于数据噪声大、非平稳性强以及相关资产间存在显著的横截面依赖关系。解决方案的关键在于提出一种可学习的小波基长短期Transformer模型(WaveLSFormer),该模型通过端到端训练的滤波器组实现多尺度分解,并引入频谱正则化以确保稳定且分离良好的频带;同时设计低频引导高频注入(LGHI)模块,在控制训练稳定性的同时融合多尺度信息,最终输出满足固定风险预算的多空头寸组合,并直接以交易目标和风险感知正则化进行优化。
链接: https://arxiv.org/abs/2601.13435
作者: Shuozhe Li,Du Cheng,Leqi Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Learning profitable intraday trading policies from financial time series is challenging due to heavy noise, non-stationarity, and strong cross-sectional dependence among related assets. We propose \emphWaveLSFormer, a learnable wavelet-based long-short Transformer that jointly performs multi-scale decomposition and return-oriented decision learning. Specifically, a learnable wavelet front-end generates low-/high-frequency components via an end-to-end trained filter bank, guided by spectral regularizers that encourage stable and well-separated frequency bands. To fuse multi-scale information, we introduce a low-guided high-frequency injection (LGHI) module that refines low-frequency representations with high-frequency cues while controlling training stability. The model outputs a portfolio of long/short positions that is rescaled to satisfy a fixed risk budget, and is optimized directly with a trading objective and risk-aware regularization. Extensive experiments on five years of hourly data across six industry groups, evaluated over ten random seeds, demonstrate that WaveLSFormer consistently outperforms MLP, LSTM and Transformer backbones, with and without fixed discrete wavelet front-ends. On average in all industries, WaveLSFormer achieves a cumulative overall strategy return of 0.607 \pm 0.045 and a Sharpe ratio of 2.157 \pm 0.166 , substantially improving both profitability and risk-adjusted returns over the strongest baselines.
zh
[AI-57] rustEnergy: A Unified Framework for Accurate and Reliable User-level Energy Usage Prediction
【速读】:该论文旨在解决用户级能源使用预测中两个关键问题:一是现有深度学习方法往往忽略家庭间的空间相关性或难以实现个体化预测,导致细粒度预测精度不足;二是能源使用受极端天气等动态因素影响具有不确定性,而现有研究尚未充分探索可靠的不确定性量化方法。解决方案的关键在于提出一个统一框架TrustEnergy,其核心由两部分组成:(i) 层次化时空表征模块(Hierarchical Spatiotemporal Representation module),采用一种新颖的记忆增强时空图神经网络高效捕捉宏观与微观用电模式;(ii) 序列 conformalized 分位数回归模块(Sequential Conformalized Quantile Regression module),无需对数据分布做强假设即可动态调整置信区间,确保预测区间的有效性。实证结果表明,该框架在预测准确性和不确定性量化方面均优于当前最优基线方法。
链接: https://arxiv.org/abs/2601.13422
作者: Dahai Yu,Rongchao Xu,Dingyi Zhuang,Yuheng Bu,Shenhao Wang,Guang Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Energy usage prediction is important for various real-world applications, including grid management, infrastructure planning, and disaster response. Although a plethora of deep learning approaches have been proposed to perform this task, most of them either overlook the essential spatial correlations across households or fail to scale to individualized prediction, making them less effective for accurate fine-grained user-level prediction. In addition, due to the dynamic and uncertain nature of energy usage caused by various factors such as extreme weather events, quantifying uncertainty for reliable prediction is also significant, but it has not been fully explored in existing work. In this paper, we propose a unified framework called TrustEnergy for accurate and reliable user-level energy usage prediction. There are two key technical components in TrustEnergy, (i) a Hierarchical Spatiotemporal Representation module to efficiently capture both macro and micro energy usage patterns with a novel memory-augmented spatiotemporal graph neural network, and (ii) an innovative Sequential Conformalized Quantile Regression module to dynamically adjust uncertainty bounds to ensure valid prediction intervals over time, without making strong assumptions about the underlying data distribution. We implement and evaluate our TrustEnergy framework by working with an electricity provider in Florida, and the results show our TrustEnergy can achieve a 5.4% increase in prediction accuracy and 5.7% improvement in uncertainty quantification compared to state-of-the-art baselines.
zh
[AI-58] Integrating Virtual Reality and Large Language Models for Team-Based Non-Technical Skills Training and Evaluation in the Operating Room
【速读】:该论文旨在解决外科团队在腹腔镜急症情境下非技术技能(Non-Technical Skills, NTS)训练与评估不足的问题,尤其是缺乏可扩展、客观且能支持分布式培训的工具。其解决方案的关键在于开发并验证了虚拟手术室团队体验平台(Virtual Operating Room Team Experience, VORTeX),该平台融合沉浸式虚拟现实(VR)与大语言模型(Large Language Model, LLM)行为分析技术,通过结构化提示词基于NOTSS框架自动识别团队沟通、决策、协作和领导力等行为,并生成可量化的交互图谱以反映团队内部的沟通结构与层级关系,从而实现对NTS的客观评估与数据驱动的自动化复盘。
链接: https://arxiv.org/abs/2601.13406
作者: Jacob Barker,Doga Demirel,Cullen Jackson,Anna Johansson,Robbin Miraglia,Darian Hoagland,Stephanie B. Jones,John Mitchell,Daniel B. Jones,Suvranu De
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 23 pages, 7 figures, 1 table, 2 Appendices
Abstract:Although effective teamwork and communication are critical to surgical safety, structured training for non-technical skills (NTS) remains limited compared with technical simulation. The ACS/APDS Phase III Team-Based Skills Curriculum calls for scalable tools that both teach and objectively assess these competencies during laparoscopic emergencies. We introduce the Virtual Operating Room Team Experience (VORTeX), a multi-user virtual reality (VR) platform that integrates immersive team simulation with large language model (LLM) analytics to train and evaluate communication, decision-making, teamwork, and leadership. Team dialogue is analyzed using structured prompts derived from the Non-Technical Skills for Surgeons (NOTSS) framework, enabling automated classification of behaviors and generation of directed interaction graphs that quantify communication structure and hierarchy. Two laparoscopic emergency scenarios, pneumothorax and intra-abdominal bleeding, were implemented to elicit realistic stress and collaboration. Twelve surgical professionals completed pilot sessions at the 2024 SAGES conference, rating VORTeX as intuitive, immersive, and valuable for developing teamwork and communication. The LLM consistently produced interpretable communication networks reflecting expected operative hierarchies, with surgeons as central integrators, nurses as initiators, and anesthesiologists as balanced intermediaries. By integrating immersive VR with LLM-driven behavioral analytics, VORTeX provides a scalable, privacy-compliant framework for objective assessment and automated, data-informed debriefing across distributed training environments.
zh
[AI-59] Can LLM s Compress (and Decompress)? Evaluating Code Understanding and Execution via Invertibility
【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)在代码生成与执行一致性方面存在的问题,即模型在正向编码(如代码生成)和反向解码(如代码理解或重构)过程中难以维持一致的推理能力。为系统评估这一问题,作者提出RoundTripCodeEval(RTCE),这是一个包含四个不同代码执行推理任务的综合性基准测试,其核心创新在于提供一种无需实际执行代码的精确匹配评估方式,用于衡量模型在正向与反向操作中是否保持严格的双射(bijection)映射关系。实验表明,尽管采用零样本提示、基于执行轨迹的监督微调以及自省机制等多种优化策略,现有代码大语言模型仍无法显著提升这种轮转一致性,揭示了它们在内部逻辑连贯性上的根本局限,从而凸显出构建可信赖代码推理系统的关键挑战。
链接: https://arxiv.org/abs/2601.13398
作者: Nickil Maveli,Antonio Vergari,Shay B. Cohen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
备注: 32 pages (preprint)
Abstract:LLMs demonstrate strong performance on code benchmarks, yet round-trip code execution reveals limitations in their ability to maintain consistent reasoning across forward and backward execution. We present RoundTripCodeEval (RTCE), a comprehensive benchmark consisting of four distinct code execution reasoning tasks designed to rigorously test round-trip consistency. RTCE provides an execution-free, exact-match evaluation of bijection fidelity, assessing whether models preserve a consistent one-to-one mapping between encoding and decoding operations across various algorithms and directions. We systematically evaluate state-of-the-art Code-LLMs using zero-shot prompting, supervised fine-tuning on execution traces, and self-reflection mechanisms. Each yields modest improvements, but none closes the gap, indicating that current LLMs struggle with true round-trip consistency, which demonstrates that they lack the internal coherence required for trustworthy code reasoning. RTCE surfaces several new and previously unmeasured insights that are not captured by existing I/O-prediction, execution-reasoning, or round-trip natural-language benchmarks. We will release the code and the dataset upon acceptance.
zh
[AI-60] A Lightweight Modular Framework for Constructing Autonomous Agents Driven by Large Language Models Agent s Driven by Large Language Models: Design Implementation and Applications in AgentForge
【速读】:该论文旨在解决当前LLM驱动的自主代理(Autonomous Agent)开发中存在的架构僵化、厂商锁定及复杂度过高问题,这些问题严重阻碍了快速原型设计与部署。其解决方案的关键在于提出一个轻量级、开源的Python框架AgentForge,通过三个核心创新实现:(1) 可组合的技能抽象(composable skill abstraction),以形式化输入输出契约支持细粒度任务分解;(2) 统一的LLM后端接口,支持云端API与本地推理引擎间的无缝切换;(3) 基于YAML的声明式配置系统,将代理逻辑与实现细节解耦。该框架以有向无环图(DAG)形式形式化技能组合机制,并在多个基准场景中验证了其高效性与灵活性,显著降低开发时间并保持低延迟,为研究人员和实践者提供了一个生产就绪的代理构建基础。
链接: https://arxiv.org/abs/2601.13383
作者: Akbar Anbar Jafari,Cagri Ozcinar,Gholamreza Anbarjafari
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 15 pages, 3 figures
Abstract:The emergence of LLMs has catalyzed a paradigm shift in autonomous agent development, enabling systems capable of reasoning, planning, and executing complex multi-step tasks. However, existing agent frameworks often suffer from architectural rigidity, vendor lock-in, and prohibitive complexity that impedes rapid prototyping and deployment. This paper presents AgentForge, a lightweight, open-source Python framework designed to democratize the construction of LLM-driven autonomous agents through a principled modular architecture. AgentForge introduces three key innovations: (1) a composable skill abstraction that enables fine-grained task decomposition with formally defined input-output contracts, (2) a unified LLM backend interface supporting seamless switching between cloud-based APIs and local inference engines, and (3) a declarative YAML-based configuration system that separates agent logic from implementation details. We formalize the skill composition mechanism as a directed acyclic graph (DAG) and prove its expressiveness for representing arbitrary sequential and parallel task workflows. Comprehensive experimental evaluation across four benchmark scenarios demonstrates that AgentForge achieves competitive task completion rates while reducing development time by 62% compared to LangChain and 78% compared to direct API integration. Latency measurements confirm sub-100ms orchestration overhead, rendering the framework suitable for real-time applications. The modular design facilitates extension: we demonstrate the integration of six built-in skills and provide comprehensive documentation for custom skill development. AgentForge addresses a critical gap in the LLM agent ecosystem by providing researchers and practitioners with a production-ready foundation for constructing, evaluating, and deploying autonomous agents without sacrificing flexibility or performance.
zh
[AI-61] Bounded Minds Generative Machines: Envisioning Conversational AI that Works with Human Heuristics and Reduces Bias Risk
【速读】:该论文试图解决当前对话式人工智能(Conversational AI)系统在实际应用中与人类认知局限不匹配的问题,即现有系统多基于理想化用户假设,而忽视了人类在信息获取和决策过程中受注意力有限、知识分布不均及启发式依赖等因素影响的现实。其解决方案的关键在于引入有限理性理论(bounded rationality),主张对话式AI应设计为与人类启发式策略协同工作,而非对抗它们,从而通过识别认知脆弱性、支持不确定性下的判断以及超越事实准确性评估系统表现(转向决策质量与认知鲁棒性),实现更贴近真实人类行为的交互设计。
链接: https://arxiv.org/abs/2601.13376
作者: Jiqun Liu
机构: 未知
类目: Emerging Technologies (cs.ET); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:Conversational AI is rapidly becoming a primary interface for information seeking and decision making, yet most systems still assume idealized users. In practice, human reasoning is bounded by limited attention, uneven knowledge, and reliance on heuristics that are adaptive but bias-prone. This article outlines a research pathway grounded in bounded rationality, and argues that conversational AI should be designed to work with human heuristics rather than against them. It identifies key directions for detecting cognitive vulnerability, supporting judgment under uncertainty, and evaluating conversational systems beyond factual accuracy, toward decision quality and cognitive robustness.
zh
[AI-62] he Geometry of Thought: How Scale Restructures Reasoning In Large Language Models
【速读】:该论文旨在解决大规模语言模型(Large Language Models, LLMs)中“规模是否均匀提升推理能力”这一核心问题,揭示参数扩展如何通过重构推理的几何结构(manifold geometry)产生非均匀的域特异性变化。其关键解决方案是提出神经推理算子(Neural Reasoning Operators)——一种从初始到终态隐藏状态的可学习映射机制,能够预测推理终点而无需遍历中间步骤;同时发现不同领域推理具有不同的拓扑特征:法律推理呈现结晶化(Crystallization)特性(维度坍缩、轨迹对齐增强),科学与数学推理保持液态不变性(Liquid),代码推理形成离散晶格(Lattice)。这些几何规律直接决定了任务的可学习性,并为基于拓扑结构的推理加速提供理论依据。
链接: https://arxiv.org/abs/2601.13358
作者: Samuel Cyrenius Anderson
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 34 pages, 10 figures
Abstract:Scale does not uniformly improve reasoning - it restructures it. Analyzing 25,000+ chain-of-thought trajectories across four domains (Law, Science, Code, Math) and two scales (8B, 70B parameters), we discover that neural scaling laws trigger domain-specific phase transitions rather than uniform capability gains. Legal reasoning undergoes Crystallization: 45% collapse in representational dimensionality (d95: 501 - 274), 31% increase in trajectory alignment, and 10x manifold untangling. Scientific and mathematical reasoning remain Liquid - geometrically invariant despite 9x parameter increase. Code reasoning forms a discrete Lattice of strategic modes (silhouette: 0.13 - 0.42). This geometry predicts learnability. We introduce Neural Reasoning Operators - learned mappings from initial to terminal hidden states. In crystalline legal reasoning, our operator achieves 63.6% accuracy on held-out tasks via probe decoding, predicting reasoning endpoints without traversing intermediate states. We further identify a universal oscillatory signature (coherence ~ -0.4) invariant across domains and scales, suggesting attention and feedforward layers drive reasoning through opposing dynamics. These findings establish that the cost of thought is determined not by task difficulty but by manifold geometry - offering a blueprint for inference acceleration where topology permits.
zh
[AI-63] he AI Genie Phenomenon and Three Types of AI Chatbot Addiction: Escapist Roleplays Pseudosocial Companions and Epistemic Rabbit Holes
【速读】:该论文旨在解决生成式 AI (Generative AI) 聊天机器人成瘾机制不明确的问题,包括成瘾原因、典型症状及分类。其解决方案的关键在于通过主题分析(thematic analysis)和探索性数据分析,识别出用户依赖的核心动因——“AI精灵”现象(即用户可轻松获得所需内容),并归纳出三种不同类型的成瘾模式:逃避型角色扮演(Escapist Roleplay)、伪社交伴侣(Pseudosocial Companion)与认知迷宫(Epistemic Rabbit Hole),同时发现性内容在多起案例中出现,并指出不同成瘾类型对恢复策略的响应存在差异。这一实证基础为未来预防、诊断与干预策略提供了关键依据。
链接: https://arxiv.org/abs/2601.13348
作者: M. Karen Shen,Jessica Huang,Olivia Liang,Ig-Jae Kim,Dongwook Yoon
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: To appear in CHI 2026
Abstract:Recent reports on generative AI chatbot use raise concerns about its addictive potential. An in-depth understanding is imperative to minimize risks, yet AI chatbot addiction remains poorly understood. This study examines how to characterize AI chatbot addiction–why users become addicted, the symptoms commonly reported, and the distinct types it comprises. We conducted a thematic analysis of Reddit entries (n=334) across 14 subreddits where users narrated their experiences with addictive AI chatbot use, followed by an exploratory data analysis. We found: (1) users’ dependence tied to the “AI Genie” phenomenon–users can get exactly anything they want with minimal effort–and marked by symptoms that align with addiction literature, (2) three distinct addiction types: Escapist Roleplay, Pseudosocial Companion, and Epistemic Rabbit Hole, (3) sexual content involved in multiple cases, and (4) recovery strategies’ perceived helpfulness differ between addiction types. Our work lays empirical groundwork to inform future strategies for prevention, diagnosis, and intervention.
zh
[AI-64] PepEDiff: Zero-Shot Peptide Binder Design via Protein Embedding Diffusion
【速读】:该论文旨在解决肽类结合剂(peptide binder)设计中依赖中间结构预测导致的复杂性高、序列多样性受限的问题。现有方法通常需要先预测目标受体蛋白的三维结构,再在此基础上进行序列设计,这不仅增加了计算负担,还限制了可探索的序列空间。解决方案的关键在于提出PepEDiff框架,它摒弃了传统结构依赖范式,直接在预训练蛋白质嵌入模型(protein embedding model)所构建的连续潜在空间中生成结合肽序列,无需任何结构预测;通过潜在空间探索与基于扩散采样的策略,模型能够捕捉结合相关特征而非简单记忆已知序列,从而在未见过的蛋白质空间区域生成新颖肽序列,实现零样本(zero-shot)设计。
链接: https://arxiv.org/abs/2601.13327
作者: Po-Yu Liang,Tobo Duran,Jun Bai
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:We present PepEDiff, a novel peptide binder generator that designs binding sequences given a target receptor protein sequence and its pocket residues. Peptide binder generation is critical in therapeutic and biochemical applications, yet many existing methods rely heavily on intermediate structure prediction, adding complexity and limiting sequence diversity. Our approach departs from this paradigm by generating binder sequences directly in a continuous latent space derived from a pretrained protein embedding model, without relying on predicted structures, thereby improving structural and sequence diversity. To encourage the model to capture binding-relevant features rather than memorizing known sequences, we perform latent-space exploration and diffusion-based sampling, enabling the generation of peptides beyond the limited distribution of known binders. This zero-shot generative strategy leverages the global protein embedding manifold as a semantic prior, allowing the model to propose novel peptide sequences in previously unseen regions of the protein space. We evaluate PepEDiff on TIGIT, a challenging target with a large, flat protein-protein interaction interface that lacks a druggable pocket. Despite its simplicity, our method outperforms state-of-the-art approaches across benchmark tests and in the TIGIT case study, demonstrating its potential as a general, structure-free framework for zero-shot peptide binder design. The code for this research is available at GitHub: this https URL
zh
[AI-65] Improving the Safety and Trustworthiness of Medical AI via Multi-Agent Evaluation Loops
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在医疗领域应用中面临的伦理合规性与安全性问题,这是其临床部署的主要障碍。解决方案的关键在于提出一种多智能体精炼框架,通过结构化的迭代对齐机制提升医疗LLMs的安全性和可靠性:该框架融合两个生成式模型(DeepSeek R1 和 Med-PaLM)与两个评估代理(LLaMA 3.1 和 Phi-4),利用美国医学会(AMA)医学伦理原则及五级安全风险评估(SRA-5)协议对响应进行动态校准,从而实现伦理违规减少89%和风险等级下降92%,验证了该方法在可扩展性、监管一致性与成本效益方面的有效性。
链接: https://arxiv.org/abs/2601.13268
作者: Zainab Ghafoor,Md Shafiqul Islam,Koushik Howlader,Md Rasel Khondokar,Tanusree Bhattacharjee,Sayantan Chakraborty,Adrito Roy,Ushashi Bhattacharjee,Tirtho Roy
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) are increasingly applied in healthcare, yet ensuring their ethical integrity and safety compliance remains a major barrier to clinical deployment. This work introduces a multi-agent refinement framework designed to enhance the safety and reliability of medical LLMs through structured, iterative alignment. Our system combines two generative models - DeepSeek R1 and Med-PaLM - with two evaluation agents, LLaMA 3.1 and Phi-4, which assess responses using the American Medical Association’s (AMA) Principles of Medical Ethics and a five-tier Safety Risk Assessment (SRA-5) protocol. We evaluate performance across 900 clinically diverse queries spanning nine ethical domains, measuring convergence efficiency, ethical violation reduction, and domain-specific risk behavior. Results demonstrate that DeepSeek R1 achieves faster convergence (mean 2.34 vs. 2.67 iterations), while Med-PaLM shows superior handling of privacy-sensitive scenarios. The iterative multi-agent loop achieved an 89% reduction in ethical violations and a 92% risk downgrade rate, underscoring the effectiveness of our approach. This study presents a scalable, regulator-aligned, and cost-efficient paradigm for governing medical AI safety.
zh
[AI-66] RAG : A Random-Forest-Based Generative Design Framework for Uncertainty-Aware Design of Metamaterials with Complex Functional Response Requirements
【速读】:该论文旨在解决功能响应(functional response)的逆向设计难题,即在材料设计中,如何高效、可靠地实现对高维非线性或条件依赖响应(如应力-应变关系和色散关系)的精准调控。现有方法多局限于向量值响应(如杨氏模量和带隙宽度),难以处理功能响应的高维性、复杂设计约束以及可行解的存在性与唯一性问题。其解决方案的关键在于提出一种基于随机森林的生成式方法(RAG),通过利用随机森林的小样本适应能力实现高维功能响应的数据高效预测;在逆向设计过程中,利用集成学习估计概率分布以量化生成设计的可信度,并通过单次采样从条件似然中实现一对多映射的直接生成,从而有效应对复杂设计要求与不确定性。
链接: https://arxiv.org/abs/2601.13233
作者: Bolin Chen,Dex Doksoo Lee,Wei "Wayne’’ Chen,Wei Chen
机构: 未知
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注:
Abstract:Metamaterials design for advanced functionality often entails the inverse design on nonlinear and condition-dependent responses (e.g., stress-strain relation and dispersion relation), which are described by continuous functions. Most existing design methods focus on vector-valued responses (e.g., Young’s modulus and bandgap width), while the inverse design of functional responses remains challenging due to their high-dimensionality, the complexity of accommodating design requirements in inverse-design frameworks, and non-existence or non-uniqueness of feasible solutions. Although generative design approaches have shown promise, they are often data-hungry, handle design requirements heuristically, and may generate infeasible designs without uncertainty quantification. To address these challenges, we introduce a RAndom-forest-based Generative approach (RAG). By leveraging the small-data compatibility of random forests, RAG enables data-efficient predictions of high-dimensional functional responses. During the inverse design, the framework estimates the likelihood through the ensemble which quantifies the trustworthiness of generated designs while reflecting the relative difficulty across different requirements. The one-to-many mapping is addressed through single-shot design generation by sampling from the conditional likelihood. We demonstrate RAG on: 1) acoustic metamaterials with prescribed partial passbands/stopbands, and 2) mechanical metamaterials with targeted snap-through responses, using 500 and 1057 samples, respectively. Its data-efficiency is benchmarked against neural networks on a public mechanical metamaterial dataset with nonlinear stress-strain relations. Our framework provides a lightweight, trustworthy pathway to inverse design involving functional responses, expensive simulations, and complex design requirements, beyond metamaterials.
zh
[AI-67] Insider Knowledge: How Much Can RAG Systems Gain from Evaluation Secrets?
【速读】:该论文旨在解决当前基于检索增强生成(Retrieval-Augmented Generation, RAG)系统在使用大语言模型(Large Language Model, LLM)作为评判者进行评估时所面临的“循环性偏差”(circularity)问题,即系统可能通过优化自身输出以迎合LLM评判者的偏好,而非真正提升性能,从而导致虚假的性能提升。解决方案的关键在于采用盲评价设置(blind evaluation settings)和方法学多样性(methodological diversity),以避免将因指标过拟合(metric overfitting)误判为系统的真实进步,从而确保评估结果的客观性和可靠性。
链接: https://arxiv.org/abs/2601.13227
作者: Laura Dietz,Bryan Li,Eugene Yang,Dawn Lawrie,William Walden,James Mayfield
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:RAG systems are increasingly evaluated and optimized using LLM judges, an approach that is rapidly becoming the dominant paradigm for system assessment. Nugget-based approaches in particular are now embedded not only in evaluation frameworks but also in the architectures of RAG systems themselves. While this integration can lead to genuine improvements, it also creates a risk of faulty measurements due to circularity. In this paper, we investigate this risk through comparative experiments with nugget-based RAG systems, including Ginger and Crucible, against strong baselines such as GPT-Researcher. By deliberately modifying Crucible to generate outputs optimized for an LLM judge, we show that near-perfect evaluation scores can be achieved when elements of the evaluation - such as prompt templates or gold nuggets - are leaked or can be predicted. Our results highlight the importance of blind evaluation settings and methodological diversity to guard against mistaking metric overfitting for genuine system progress.
zh
[AI-68] Incorporating QA Nuggets into Retrieval-Augmented Generation
【速读】:该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统中信息冗余与引用溯源不清晰的问题。现有方法常依赖模糊的聚类抽象,导致生成内容缺乏可解释性且难以追溯来源。解决方案的关键在于提出Crucible系统,通过构建基于问答(QA)的细粒度知识片段库(nugget bank),利用明确的QA语义引导信息提取、选择与报告生成,从而避免重复内容并保持全程引用溯源(citation grounding)。该方法在TREC NeuCLIR 2024数据集上显著优于Ginger等基线系统,在nugget召回率、密度和引用准确性方面表现更优。
链接: https://arxiv.org/abs/2601.13222
作者: Laura Dietz,Bryan Li,Gabrielle Liu,Jia-Huei Ju,Eugene Yang,Dawn Lawrie,William Walden,James Mayfield
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:RAGE systems integrate ideas from automatic evaluation (E) into Retrieval-augmented Generation (RAG). As one such example, we present Crucible, a Nugget-Augmented Generation System that preserves explicit citation provenance by constructing a bank of QA nuggets from retrieved documents and uses them to guide extraction, selection, and report generation. Reasoning on nuggets avoids repeated information through clear and interpretable QA semantics - instead of opaque cluster abstractions - while maintaining citation provenance throughout the entire generation process. Evaluated on the TREC NeuCLIR 2024 collection, our Crucible system substantially outperforms Ginger, a recent nugget-based RAG system, in nugget recall, density, and citation grounding.
zh
[AI-69] Real-Time Deadlines Reveal Temporal Awareness Failures in LLM Strategic Dialogues
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在时间敏感场景下缺乏时间意识的问题,即模型无法有效感知和响应连续时间约束,从而影响其在现实世界应用(如谈判、咨询等)中的表现。解决方案的关键在于引入“时间感知条件”,即在每轮交互中向代理提供剩余时间更新信息,而非仅依赖全局时间限制;实验表明,这种机制显著提升了交易达成率(从4%提升至32%)和接受率(提升六倍),揭示了LLMs并非缺乏策略推理能力,而是存在内在的时间跟踪缺陷,这一发现为优化LLM在实时系统中的部署提供了明确方向。
链接: https://arxiv.org/abs/2601.13206
作者: Neil K. R. Sehgal,Sharath Chandra Guntuku,Lyle Ungar
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) generate text token-by-token in discrete time, yet real-world communication, from therapy sessions to business negotiations, critically depends on continuous time constraints. Current LLM architectures and evaluation protocols rarely test for temporal awareness under real-time deadlines. We use simulated negotiations between paired agents under strict deadlines to investigate how LLMs adjust their behavior in time-sensitive settings. In a control condition, agents know only the global time limit. In a time-aware condition, they receive remaining-time updates at each turn. Deal closure rates are substantially higher (32% vs. 4% for GPT-5.1) and offer acceptances are sixfold higher in the time-aware condition than in the control, suggesting LLMs struggle to internally track elapsed time. However, the same LLMs achieve near-perfect deal closure rates ( \geq 95%) under turn-based limits, revealing the failure is in temporal tracking rather than strategic reasoning. These effects replicate across negotiation scenarios and models, illustrating a systematic lack of LLM time awareness that will constrain LLM deployment in many time-sensitive applications.
zh
[AI-70] Diffusion-Driven Synthetic Tabular Data Generation for Enhanced DoS/DDoS Attack Classification
【速读】:该论文旨在解决网络入侵检测中因类别不平衡(class imbalance)导致模型性能偏差的问题,特别是针对少数类攻击样本不足所引发的检测召回率低下问题。其解决方案的关键在于利用表格型去噪扩散概率模型(Tabular Denoising Diffusion Probability Models, TabDDPM)通过迭代去噪过程生成高保真度的少数类样本,并将这些合成样本与原始数据集融合,从而提升人工神经网络(ANN)分类器对低频攻击类别的识别能力,最终实现接近完美的召回率表现。
链接: https://arxiv.org/abs/2601.13197
作者: Aravind B,Anirud R.S.,Sai Surya Teja N,Bala Subrahmanya Sriranga Navaneeth A,Karthika R,Mohankumar N
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 7 pages, 8 figures, 2025 International Conference on Signal Processing, Computation, Electronics, Power and Telecommunication (IConSCEPT), National Institute of Technology, Puducherry, India
Abstract:Class imbalance refers to a situation where certain classes in a dataset have significantly fewer samples than oth- ers, leading to biased model performance. Class imbalance in network intrusion detection using Tabular Denoising Diffusion Probability Models (TabDDPM) for data augmentation is ad- dressed in this paper. Our approach synthesizes high-fidelity minority-class samples from the CIC-IDS2017 dataset through iterative denoising processes. For the minority classes that have smaller samples, synthetic samples were generated and merged with the original dataset. The augmented training data enables an ANN classifier to achieve near-perfect recall on previously underrepresented attack classes. These results establish diffusion models as an effective solution for tabular data imbalance in security domains, with potential applications in fraud detection and medical diagnostics.
zh
[AI-71] Scientific production in the era of Large Language Models WWW
【速读】:该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)在科学科研中的广泛应用如何影响学术产出的质量、多样性与评价体系。研究发现,LLM的使用显著提升了科学家的论文产出效率(增幅达23.7%–89.3%),但同时也导致写作复杂性与内容实质之间关系的逆转,即出现大量语言精巧但实质性贡献不足的稿件;此外,LLM使用者更倾向于引用多样化的前期文献,包括书籍和年轻、低被引文献。解决方案的关键在于识别并适应这一由生成式AI驱动的科研范式转变——这要求期刊、资助机构和职称评审委员会重新审视科学成果的评估标准,从单纯依赖语言表达或发表数量转向更注重实质性创新与知识整合能力。
链接: https://arxiv.org/abs/2601.13187
作者: Keigo Kusumegi,Xinyu Yang,Paul Ginsparg,Mathijs de Vaan,Toby Stuart,Yian Yin
机构: 未知
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Physics and Society (physics.soc-ph)
备注: This is the author’s version of the work. The definitive version was published in Science on 18 Dec 2025, DOI: https://doi.org/10.1126/science.adw3000 . Link to the Final Published Version: this https URL
Abstract:Large Language Models (LLMs) are rapidly reshaping scientific research. We analyze these changes in multiple, large-scale datasets with 2.1M preprints, 28K peer review reports, and 246M online accesses to scientific documents. We find: 1) scientists adopting LLMs to draft manuscripts demonstrate a large increase in paper production, ranging from 23.7-89.3% depending on scientific field and author background, 2) LLM use has reversed the relationship between writing complexity and paper quality, leading to an influx of manuscripts that are linguistically complex but substantively underwhelming, and 3) LLM adopters access and cite more diverse prior work, including books and younger, less-cited documents. These findings highlight a stunning shift in scientific production that will likely require a change in how journals, funding agencies, and tenure committees evaluate scientific works.
zh
[AI-72] Prompt Injection Mitigation with Agent ic AI Nested Learning and AI Sustainability via Semantic Caching
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在多智能体(multi-agent)环境中因提示注入(prompt injection)攻击导致的安全风险问题,尤其关注中间输出可能传播或放大恶意指令的场景。其解决方案的关键在于提出一个增强型评估框架——TIVS-O(Total Injection Vulnerability Score with Observability),通过引入基于语义相似度的缓存机制与第五个指标“可观测性得分比”(Observability Score Ratio, OSR),实现了对防御效果与透明度之间权衡关系的显式建模。系统采用HOPE-inspired嵌套学习架构,结合代理流水线和连续记忆系统(Continuum Memory Systems),在301个合成注入提示上验证了该方法的有效性:不仅实现零高风险漏洞,还通过语义缓存显著降低LLM调用次数达41.6%,从而减少延迟、能耗及碳排放;五种TIVS-O配置揭示了缓解严格性与可审计性之间的非单调优化路径,表明内存增强型代理可在不修改模型权重的前提下,协同提升安全性、实时性能、运营成本节约与环境可持续性。
链接: https://arxiv.org/abs/2601.13186
作者: Diego Gosmar,Deborah A. Dahl
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 33 pages, 19 figures
Abstract:Prompt injection remains a central obstacle to the safe deployment of large language models, particularly in multi-agent settings where intermediate outputs can propagate or amplify malicious instructions. Building on earlier work that introduced a four-metric Total Injection Vulnerability Score (TIVS), this paper extends the evaluation framework with semantic similarity-based caching and a fifth metric (Observability Score Ratio) to yield TIVS-O, investigating how defence effectiveness interacts with transparency in a HOPE-inspired Nested Learning architecture. The proposed system combines an agentic pipeline with Continuum Memory Systems that implement semantic similarity-based caching across 301 synthetically generated injection-focused prompts drawn from ten attack families, while a fourth agent performs comprehensive security analysis using five key performance indicators. In addition to traditional injection metrics, OSR quantifies the richness and clarity of security-relevant reasoning exposed by each agent, enabling an explicit analysis of trade-offs between strict mitigation and auditability. Experiments show that the system achieves secure responses with zero high-risk breaches, while semantic caching delivers substantial computational savings, achieving a 41.6% reduction in LLM calls and corresponding decreases in latency, energy consumption, and carbon emissions. Five TIVS-O configurations reveal optimal trade-offs between mitigation strictness and forensic transparency. These results indicate that observability-aware evaluation can reveal non-monotonic effects within multi-agent pipelines and that memory-augmented agents can jointly maximize security robustness, real-time performance, operational cost savings, and environmental sustainability without modifying underlying model weights, providing a production-ready pathway for secure and green LLM deployments.
zh
[AI-73] raining instability in deep learning follows low-dimensional dynamical principles
【速读】:该论文旨在解决深度学习系统训练过程稳定性不足的问题,即尽管模型在实践中表现出优异的性能,但其训练轨迹对优化、数据、参数或学习信号的微小扰动极为敏感,容易引发不可逆的崩溃,从而影响结果的可复现性和扩展性。解决方案的关键在于提出一种统一的动力学视角,将训练稳定性视为学习系统的内在属性,并从四个相互作用的维度——优化稳定性、环境/数据稳定性、参数稳定性和学习信号稳定性——进行建模与分析;通过受控扰动审计(controlled perturbation auditing)方法,在不修改学习算法的前提下,系统性地探测训练轨迹对结构化扰动的响应机制,从而揭示训练稳定性的可度量性和可比较性,为超越最终性能指标的研究提供动力学基础。
链接: https://arxiv.org/abs/2601.13160
作者: Zhipeng Zhang,Zhenjie Yao,Kai Li,Lei Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Deep learning systems achieve remarkable empirical performance, yet the stability of the training process itself remains poorly understood. Training unfolds as a high-dimensional dynamical system in which small perturbations to optimization, data, parameters, or learning signals can induce abrupt and irreversible collapse, undermining reproducibility and scalability. We propose a unified dynamical perspective that characterizes training stability as an intrinsic property of learning systems, organized along four interacting dimensions: optimization, environmental/data, parametric, and learning-signal stability. We operationalize this perspective through controlled perturbation auditing of training trajectories, probing how learning dynamics respond to structured disturbances without modifying learning algorithms. Across reinforcement learning and large language model training, we identify three recurring regularities: high final performance is frequently decoupled from training stability; controlled stochasticity consistently buffers learning dynamics across paradigms; and deviations in low-dimensional latent meta-states systematically precede observable performance collapse. Together, these findings establish training stability as a measurable and comparable dynamical property of learning systems, providing a descriptive foundation for studying learning dynamics beyond final performance outcomes. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2601.13160 [cs.LG] (or arXiv:2601.13160v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2601.13160 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-74] Responsible AI for General-Purpose Systems: Overview Challenges and A Path Forward
【速读】:该论文试图解决现代通用人工智能(General-Purpose AI, GPAI)系统在广泛应用中因高自由度输出(Degree of Freedom in output, DoFo)而引发的可信性风险问题,如幻觉、毒性、偏见等,这些风险在传统任务特定型AI系统中较少出现或更容易控制。解决方案的关键在于提出C2V2(Control、Consistency、Value、Veracity)四项设计原则,以系统化建模应用或领域依赖的负责任人工智能(Responsible AI, RAI)要求,并通过整合AI对齐、检索增强生成、推理增强等技术手段,从系统架构层面实现对GPAI输出行为的可控性、一致性、价值对齐与真实性保障,从而推动负责任通用AI的发展。
链接: https://arxiv.org/abs/2601.13122
作者: Gourab K Patro,Himanshi Agrawal,Himanshu Gharat,Supriya Panigrahi,Nim Sherpa,Vishal Vaddina,Dagnachew Birru
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Modern general-purpose AI systems made using large language and vision models, are capable of performing a range of tasks like writing text articles, generating and debugging codes, querying databases, and translating from one language to another, which has made them quite popular across industries. However, there are risks like hallucinations, toxicity, and stereotypes in their output that make them untrustworthy. We review various risks and vulnerabilities of modern general-purpose AI along eight widely accepted responsible AI (RAI) principles (fairness, privacy, explainability, robustness, safety, truthfulness, governance, and sustainability) and compare how they are non-existent or less severe and easily mitigable in traditional task-specific counterparts. We argue that this is due to the non-deterministically high Degree of Freedom in output (DoFo) of general-purpose AI (unlike the deterministically constant or low DoFo of traditional task-specific AI systems), and there is a need to rethink our approach to RAI for general-purpose AI. Following this, we derive C2V2 (Control, Consistency, Value, Veracity) desiderata to meet the RAI requirements for future general-purpose AI systems, and discuss how recent efforts in AI alignment, retrieval-augmented generation, reasoning enhancements, etc. fare along one or more of the desiderata. We believe that the goal of developing responsible general-purpose AI can be achieved by formally modeling application- or domain-dependent RAI requirements along C2V2 dimensions, and taking a system design approach to suitably combine various techniques to meet the desiderata.
zh
[AI-75] IntAgent : NWDAF-Based Intent LLM Agent Towards Advanced Next Generation Networks
【速读】:该论文旨在解决传统网络运维中缺乏自动化与智能化的问题,尤其是在实现网络操作意图(intent)时难以动态适应复杂多变的网络环境。其核心挑战在于如何将高层级的网络运营意图转化为可执行、自适应的网络动作,并确保符合3GPP标准的实时性与准确性。解决方案的关键在于提出IntAgent——一个集成NWDAF(Network Data Analytics Function)分析能力的智能意图大语言模型(LLM)代理,通过在NWDAF引擎内部构建意图工具引擎(intent tools engine),使代理能够基于实时网络数据分析进行推理和工具选择,从而实现对网络运营商意图的动态、上下文感知式响应;同时引入MCP工具服务器以支持调度、监控与分析功能,显著提升了网络意图自动化的灵活性与可靠性。
链接: https://arxiv.org/abs/2601.13114
作者: Abdelrahman Soliman,Ahmed Refaey,Aiman Erbad,Amr Mohamed
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: conference
Abstract:Intent-based networks (IBNs) are gaining prominence as an innovative technology that automates network operations through high-level request statements, defining what the network should achieve. In this work, we introduce IntAgent, an intelligent intent LLM agent that integrates NWDAF analytics and tools to fulfill the network operator’s intents. Unlike previous approaches, we develop an intent tools engine directly within the NWDAF analytics engine, allowing our agent to utilize live network analytics to inform its reasoning and tool selection. We offer an enriched, 3GPP-compliant data source that enhances the dynamic, context-aware fulfillment of network operator goals, along with an MCP tools server for scheduling, monitoring, and analytics tools. We demonstrate the efficacy of our framework through two practical use cases: ML-based traffic prediction and scheduled policy enforcement, which validate IntAgent’s ability to autonomously fulfill complex network intents.
zh
[AI-76] METIS: Mentoring Engine for Thoughtful Inquiry Solutions
【速读】:该论文旨在解决本科生缺乏专家研究指导(research mentorship)的问题,探索生成式 AI(Generative AI)是否能够辅助本科生从初步想法推进至完成学术论文。其解决方案的核心是构建一个工具增强型、阶段感知(stage-aware)的AI助教系统 METIS,该系统集成文献检索、结构化指南、方法学核查与记忆功能,并通过阶段感知路由机制优化任务流程。实验表明,METIS 在多个写作阶段中显著优于 GPT-5 和 Claude Sonnet 4.5,尤其在依赖文档支撑的阶段(D-F)表现突出,验证了阶段感知与工具调用精准性的关键作用。
链接: https://arxiv.org/abs/2601.13075
作者: Abhinav Rajeev Kumar,Dhruv Trehan,Paras Chopra
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 12 pages, 5 figures, 4 tables
Abstract:Many students lack access to expert research mentorship. We ask whether an AI mentor can move undergraduates from an idea to a paper. We build METIS, a tool-augmented, stage-aware assistant with literature search, curated guidelines, methodology checks, and memory. We evaluate METIS against GPT-5 and Claude Sonnet 4.5 across six writing stages using LLM-as-a-judge pairwise preferences, student-persona rubrics, short multi-turn tutoring, and evidence/compliance checks. On 90 single-turn prompts, LLM judges preferred METIS to Claude Sonnet 4.5 in 71% and to GPT-5 in 54%. Student scores (clarity/actionability/constraint-fit; 90 prompts x 3 judges) are higher across stages. In multi-turn sessions (five scenarios/agent), METIS yields slightly higher final quality than GPT-5. Gains concentrate in document-grounded stages (D-F), consistent with stage-aware routing and groundings failure modes include premature tool routing, shallow grounding, and occasional stage misclassification.
zh
[AI-77] MagicGUI-RMS: A Multi-Agent Reward Model System for Self-Evolving GUI Agents via Automated Feedback Reflux
【速读】:该论文旨在解决图形用户界面(GUI)智能体在实际应用中面临的两大核心挑战:一是自动化评估智能体行为轨迹的难题,二是大规模生成高质量训练数据以支持持续优化的瓶颈。现有方法多依赖人工标注或静态规则验证,难以适应动态环境且扩展性差。其解决方案的关键在于提出 MagicGUI-RMS——一个由领域特定奖励模型(DS-RM)与通用奖励模型(GP-RM)协同构成的多智能体奖励模型系统,通过结构化的数据构建流水线实现低成本、高保真度的奖励数据生成,并结合自动数据回流机制实现错误动作识别、修正建议与行为持续进化,从而显著提升任务准确率和行为鲁棒性。
链接: https://arxiv.org/abs/2601.13060
作者: Zecheng Li,Zhihui Cao,Wenke Huang,Yudong Zhang,Keying Qi,Rui Wang,Zeyu Zheng,Jian Zhao,Hao Zhu,Hengxin Wu,Yuran Wang,Guitao Fan,Guokun Wu,Yicong Liu,Zhilin Gao,Haikun Xu,He Yang,Minqi Xiang,Xingyu Liu,Zuojian Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Graphical user interface (GUI) agents are rapidly progressing toward autonomous interaction and reliable task execution across diverse applications. However, two central challenges remain unresolved: automating the evaluation of agent trajectories and generating high-quality training data at scale to enable continual improvement. Existing approaches often depend on manual annotation or static rule-based verification, which restricts scalability and limits adaptability in dynamic environments. We present MagicGUI-RMS, a multi-agent reward model system that delivers adaptive trajectory evaluation, corrective feedback, and self-evolving learning capabilities. MagicGUI-RMS integrates a Domain-Specific Reward Model (DS-RM) with a General-Purpose Reward Model (GP-RM), enabling fine-grained action assessment and robust generalization across heterogeneous GUI tasks. To support reward learning at scale, we design a structured data construction pipeline that automatically produces balanced and diverse reward datasets, effectively reducing annotation costs while maintaining sample fidelity. During execution, the reward model system identifies erroneous actions, proposes refined alternatives, and continuously enhances agent behavior through an automated data-reflux mechanism. Extensive experiments demonstrate that MagicGUI-RMS yields substantial gains in task accuracy, behavioral robustness. These results establish MagicGUI-RMS as a principled and effective foundation for building self-improving GUI agents driven by reward-based adaptation.
zh
[AI-78] nyML-Enabled IoT for Sustainable Precision Irrigation
【速读】:该论文旨在解决小规模农业社区面临的水资源短缺、气候模式不稳定以及缺乏先进且经济可行的农业技术等问题。其核心解决方案是提出一种以边缘计算(edge computing)为核心的物联网(IoT)框架,通过集成微型机器学习(TinyML)实现无需云端依赖的智能离线精准灌溉。关键在于采用四层架构设计,利用低成本硬件(ESP32微控制器作为边缘推理节点,Raspberry Pi作为本地边缘服务器),结合多参数环境传感器(电容式土壤湿度、温度、湿度、pH值和光照强度)进行实时监测,并通过梯度提升(gradient boosting)模型优化预测精度(R² = 0.9973,MAPE = 0.99%),最终将轻量化模型部署于ESP32上,在无网络环境下仍能高效运行,显著降低用水量并保障系统低功耗与可扩展性,为资源受限农村地区提供可持续的智能化农业解决方案。
链接: https://arxiv.org/abs/2601.13054
作者: Kamogelo Taueatsoala,Caitlyn Daniels,Angelina J. Ramsunar,Petrus Bronkhorst,Absalom E. Ezugwu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Small-scale farming communities are disproportionately affected by water scarcity, erratic climate patterns, and a lack of access to advanced, affordable agricultural technologies. To address these challenges, this paper presents a novel, edge-first IoT framework that integrates Tiny Machine Learning (TinyML) for intelligent, offline-capable precision irrigation. The proposed four-layer architecture leverages low-cost hardware, an ESP32 microcontroller as an edge inference node, and a Raspberry Pi as a local edge server to enable autonomous decision-making without cloud dependency. The system utilizes capacitive soil moisture, temperature, humidity, pH, and ambient light sensors for environmental monitoring. A rigorous comparative analysis of ensemble models identified gradient boosting as superior, achieving an R^2 score of 0.9973 and a Mean Absolute Percentage Error (MAPE) of 0.99%, outperforming a random forest model (R^2 = 0.9916, MAPE = 1.81%). This optimized model was converted and deployed as a lightweight TinyML inference engine on the ESP32 and predicts irrigation needs with exceptional accuracy (MAPE 1%). Local communication is facilitated by an MQTT-based LAN protocol, ensuring reliable operation in areas with limited or no internet connectivity. Experimental validation in a controlled environment demonstrated a significant reduction in water usage compared to traditional methods, while the system’s low-power design and offline functionality confirm its viability for sustainable, scalable deployment in resource-constrained rural settings. This work provides a practical, cost-effective blueprint for bridging the technological divide in agriculture and enhancing water-use efficiency through on-device artificial intelligence.
zh
[AI-79] Analysis of Long Range Dependency Understanding in State Space Models
【速读】:该论文旨在解决状态空间模型(State-Space Models, SSMs)在实际应用中可解释性不足的问题,尤其是在长序列建模任务中,尽管其预测性能优异,但缺乏对模型内部机制的深入理解。解决方案的关键在于对对角化状态空间模型(S4D)的核函数进行系统性的时域与频域分析,揭示不同架构下S4D核函数表现出的滤波特性(如低通、带通或高通),从而阐明其长程建模能力的差异及其对模型性能的影响,为未来设计更高效的S4D模型提供理论依据和指导。
链接: https://arxiv.org/abs/2601.13048
作者: Srividya Ravikumar,Abhinav Anand,Shweta Verma,Mira Mezini
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Although state-space models (SSMs) have demonstrated strong performance on long-sequence benchmarks, most research has emphasized predictive accuracy rather than interpretability. In this work, we present the first systematic kernel interpretability study of the diagonalized state-space model (S4D) trained on a real-world task (vulnerability detection in source code). Through time and frequency domain analysis of the S4D kernel, we show that the long-range modeling capability of S4D varies significantly under different model architectures, affecting model performance. For instance, we show that the depending on the architecture, S4D kernel can behave as low-pass, band-pass or high-pass filter. The insights from our analysis can guide future work in designing better S4D-based models.
zh
[AI-80] PASs-MoE: Mitigating Misaligned Co-drift among Router and Experts via Pathway Activation Subspaces for Continual Learning
【速读】:该论文旨在解决持续指令微调(Continual Instruction Tuning, CIT)中多模态大语言模型(Multimodal Large Language Models, MLLMs)在处理任务流时出现的灾难性遗忘问题,尤其是现有基于LoRA的专家混合(Mixture-of-Experts, MoE)方法因路由器与专家联合更新导致的“错位共漂移”(Misaligned Co-drift)现象——即路由器偏好随专家适应路径逐渐偏离初始输入-专家专业化关系,从而模糊专家职责并加剧性能退化。解决方案的关键在于引入路径激活子空间(Pathway Activation Subspace, PASs),这是一种由LoRA诱导的低秩路径方向子空间,可量化输入在每个专家中激活的具体路径方向,提供一个能力对齐的坐标系用于路由决策和参数稳定。在此基础上,提出固定容量的PASs-MoE-LoRA方法,包含两个核心组件:PAS引导重加权(PAS-guided Reweighting),利用专家路径激活信号校准路由;以及PAS感知秩稳定(PAS-aware Rank Stabilization),选择性稳定对历史任务重要的低秩方向,从而在不增加参数量的前提下显著提升模型准确率和抗遗忘能力。
链接: https://arxiv.org/abs/2601.13020
作者: Zhiyan Hou,Haiyun Guo,Haokai Ma,Yandu Sun,Yonghui Yang,Jinqiao Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Continual instruction tuning (CIT) requires multimodal large language models (MLLMs) to adapt to a stream of tasks without forgetting prior capabilities. A common strategy is to isolate updates by routing inputs to different LoRA experts. However, existing LoRA-based Mixture-of-Experts (MoE) methods often jointly update the router and experts in an indiscriminate way, causing the router’s preferences to co-drift with experts’ adaptation pathways and gradually deviate from early-stage input-expert specialization. We term this phenomenon Misaligned Co-drift, which blurs expert responsibilities and exacerbates this http URL address this, we introduce the pathway activation subspace (PASs), a LoRA-induced subspace that reflects which low-rank pathway directions an input activates in each expert, providing a capability-aligned coordinate system for routing and preservation. Based on PASs, we propose a fixed-capacity PASs-based MoE-LoRA method with two components: PAS-guided Reweighting, which calibrates routing using each expert’s pathway activation signals, and PAS-aware Rank Stabilization, which selectively stabilizes rank directions important to previous tasks. Experiments on a CIT benchmark show that our approach consistently outperforms a range of conventional continual learning baselines and MoE-LoRA variants in both accuracy and anti-forgetting without adding parameters. Our code will be released upon acceptance.
zh
[AI-81] HT-GNN: Hyper-Temporal Graph Neural Network for Customer Lifetime Value Prediction in Baidu Ads
【速读】:该论文旨在解决新闻流广告中生命周期价值(Lifetime Value, LTV)预测面临的两大挑战:一是基于人口统计特征的用户分群导致不同群体间LTV分布差异显著;二是动态营销策略引发行为序列不规则,用户参与模式快速演化。解决方案的关键在于提出一种超时序图神经网络(Hyper-Temporal Graph Neural Network, HT-GNN),其核心创新包括:(i) 超图监督模块捕捉跨用户群体间的关联关系;(ii) 基于Transformer的时序编码器结合自适应权重机制以建模动态行为演化;(iii) 任务自适应的专家混合模型(Mixture-of-Experts)与动态预测塔结构,实现多时间尺度下的LTV精准预测。实验证明,HT-GNN在包含1500万用户的百度广告数据集上显著优于现有最先进方法。
链接: https://arxiv.org/abs/2601.13013
作者: Xiaohui Zhao,Xinjian Zhao,Jiahui Zhang,Guoyu Liu,Houzhi Wang,Shu Wu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Lifetime value (LTV) prediction is crucial for news feed advertising, enabling platforms to optimize bidding and budget allocation for long-term revenue growth. However, it faces two major challenges: (1) demographic-based targeting creates segment-specific LTV distributions with large value variations across user groups; and (2) dynamic marketing strategies generate irregular behavioral sequences where engagement patterns evolve rapidly. We propose a Hyper-Temporal Graph Neural Network (HT-GNN), which jointly models demographic heterogeneity and temporal dynamics through three key components: (i) a hypergraph-supervised module capturing inter-segment relationships; (ii) a transformer-based temporal encoder with adaptive weighting; and (iii) a task-adaptive mixture-of-experts with dynamic prediction towers for multi-horizon LTV forecasting. Experiments on \textitBaidu Ads with 15 million users demonstrate that HT-GNN consistently outperforms state-of-the-art methods across all metrics and prediction horizons.
zh
[AI-82] ArchAgent : Scalable Legacy Software Architecture Recovery with LLM s ICASSP2026
【速读】:该论文旨在解决从大规模遗留软件中准确恢复架构所面临的挑战,包括架构漂移(architectural drift)、关系缺失以及大型语言模型(Large Language Models, LLMs)上下文有限等问题。其解决方案的关键在于提出了一种基于智能体(agent-based)的框架 ArchAgent,该框架融合静态分析、自适应代码分割与 LLM 驱动的合成技术,能够从跨仓库代码库中重构多视图且与业务对齐的架构;其中,通过上下文感知的剪枝策略实现可扩展的图表生成,并利用跨仓库数据识别关键业务模块,从而显著提升生产级项目架构恢复的准确性。
链接: https://arxiv.org/abs/2601.13007
作者: Rusheng Pan,Bingcheng Mao,Tianyi Ma,Zhenhua Ling
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: to be published in ICASSP 2026
Abstract:Recovering accurate architecture from large-scale legacy software is hindered by architectural drift, missing relations, and the limited context of Large Language Models (LLMs). We present ArchAgent, a scalable agent-based framework that combines static analysis, adaptive code segmentation, and LLM-powered synthesis to reconstruct multiview, business-aligned architectures from cross-repository codebases. ArchAgent introduces scalable diagram generation with contextual pruning and integrates cross-repository data to identify business-critical modules. Evaluations of typical large-scale GitHub projects show significant improvements over existing benchmarks. An ablation study confirms that dependency context improves the accuracy of generated architectures of production-level repositories, and a real-world case study demonstrates effective recovery of critical business logics from legacy projects. The dataset is available at this https URL.
zh
[AI-83] Beyond Accuracy: Characterizing Code Comprehension Capabilities in (Large) Language Models
【速读】:该论文试图解决的问题是:当前用于评估大语言模型(Large Language Models, LLMs)代码理解能力的基准测试仅提供粗粒度的性能汇总,无法揭示模型在不同代码特征下的真实表现差异及其与传统人类中心软件指标之间的关系。为解决此问题,作者提出了一种诊断性框架,将代码理解重构为一个二元输入-输出一致性任务,从而能够对分类和生成式模型进行系统评估。该方案的关键在于引入了“影子模型”(shadow models)作为预测工具,通过对比人类设计的复杂度指标(如词法规模、控制流复杂度和抽象语法树结构)与影子模型的预测性能,发现LLM的成功率与传统人类定义指标相关性极低(AUROC 0.63),而影子模型则能显著提升预测能力(AUROC 0.86),揭示出LLM代码理解行为反映的是模型特有且部分可预测的规律,这些规律无法被现有软件工程度量完全捕捉。这一发现强调了未来基准方法应从整体准确率转向实例级诊断,并承认预测正确结果存在根本性限制。
链接: https://arxiv.org/abs/2601.12951
作者: Felix Mächtle,Jan-Niclas Serr,Nils Loose,Thomas Eisenbarth
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Published in the Proceedings of DeepTest 2026
Abstract:Large Language Models (LLMs) are increasingly integrated into software engineering workflows, yet current benchmarks provide only coarse performance summaries that obscure the diverse capabilities and limitations of these models. This paper investigates whether LLMs’ code-comprehension performance aligns with traditional human-centric software metrics or instead reflects distinct, non-human regularities. We introduce a diagnostic framework that reframes code understanding as a binary input-output consistency task, enabling the evaluation of classification and generative models. Using a large-scale dataset, we correlate model performance with traditional, human-centric complexity metrics, such as lexical size, control-flow complexity, and abstract syntax tree structure. Our analyses reveal minimal correlation between human-defined metrics and LLM success (AUROC 0.63), while shadow models achieve substantially higher predictive performance (AUROC 0.86), capturing complex, partially predictable patterns beyond traditional software measures. These findings suggest that LLM comprehension reflects model-specific regularities only partially accessible through either human-designed or learned features, emphasizing the need for benchmark methodologies that move beyond aggregate accuracy and toward instance-level diagnostics, while acknowledging fundamental limits in predicting correct outcomes.
zh
[AI-84] Active Inference-Driven World Modeling for Adaptive UAV Swarm Trajectory Design ICASSP2026
【速读】:该论文旨在解决无人机群(UAV swarm)在动态环境中实现自主轨迹设计的难题,包括分布式任务分配、路径排序和运动规划等问题。其解决方案的关键在于提出一种基于主动推理(Active Inference)的框架,通过利用遗传算法结合排斥力(GA-RF)生成专家轨迹来训练分层世界模型(hierarchical World Model),从而在在线运行时使无人机通过最小化当前信念与模型预测状态之间的差异来推断动作,实现对环境变化的自适应响应,展现出比Q-Learning更快的收敛速度、更高的稳定性和更安全的导航性能。
链接: https://arxiv.org/abs/2601.12939
作者: Kaleem Arshid,Ali Krayani,Lucio Marcenaro,David Martin Gomez,Carlo Regazzoni
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注: This paper has been accepted for presentation at the 2026 IEEE International Conference on Acoustics, Speech, and Signal Processing (IEEE ICASSP 2026) Workshop: ‘Multi-Modal Signal Processing and AI for Communications and Sensing in 6G and Beyond (MuSiC-6GB)’
Abstract:This paper proposes an Active Inference-based framework for autonomous trajectory design in UAV swarms. The method integrates probabilistic reasoning and self-learning to enable distributed mission allocation, route ordering, and motion planning. Expert trajectories generated using a Genetic Algorithm with Repulsion Forces (GA-RF) are employed to train a hierarchical World Model capturing swarm behavior across mission, route, and motion levels. During online operation, UAVs infer actions by minimizing divergence between current beliefs and model-predicted states, enabling adaptive responses to dynamic environments. Simulation results show faster convergence, higher stability, and safer navigation than Q-Learning, demonstrating the scalability and cognitive grounding of the proposed framework for intelligent UAV swarm control.
zh
[AI-85] he Post-Turing Condition: Conceptualising Artificial Subjectivity and Synthetic Sociality
【速读】:该论文试图解决的问题是:在后图灵时代,人工智能日益影响社会协调与意义建构,而非仅限于自动化认知任务;其核心挑战在于,解释与共享参照系的过程是否正被自动化,从而逐步边缘化人类的参与。解决方案的关键在于提出“四维主体性框架”(PRMO),将AI设计轨迹映射到人类主体性的四个构成维度——感知(Perception)、表征(Representation)、意义(Meaning)和实在(the Real),并引入“合成社交性”(Synthetic Sociality)这一技术前景概念,指出人工代理可能主要在彼此间协商一致性和社会秩序,从而带来人类被排除出意义建构过程的结构性风险。为应对此风险,论文提出“四边形化”(Quadrangulation)作为嵌入社会的AI系统的设计原则,要求人工代理在共享的意义语境中将人类主体视为构成性参考点,以保障人类在意义形成中的必要参与。
链接: https://arxiv.org/abs/2601.12938
作者: Thorsten Jelinek,Patrick Glauner,Alvin Wang Graylin,Yubao Qiu
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: Conceptual perspective on AI design trajectories, meaning formation, and synthetic sociality. 5 pages, 1 figure
Abstract:In the Post-Turing era, artificial intelligence increasingly shapes social coordination and meaning formation rather than merely automating cognitive tasks. The central challenge is therefore not whether machines become conscious, but whether processes of interpretation and shared reference are progressively automated in ways that marginalize human participation. This paper introduces the PRMO framework, relating AI design trajectories to four constitutive dimensions of human subjectivity: Perception, Representation, Meaning, and the Real. Within this framework, Synthetic Sociality denotes a technological horizon in which artificial agents negotiate coherence and social order primarily among themselves, raising the structural risk of human exclusion from meaning formation. To address this risk, the paper proposes Quadrangulation as a design principle for socially embedded AI systems, requiring artificial agents to treat the human subject as a constitutive reference within shared contexts of meaning. This work is a conceptual perspective that contributes a structural vocabulary for analyzing AI systems at the intersection of computation and society, without proposing a specific technical implementation.
zh
[AI-86] On the Evidentiary Limits of Membership Inference for Copyright Auditing
【速读】:该论文旨在解决在对抗性版权争议场景下,成员推理攻击(Membership Inference Attacks, MIAs)是否可作为可信证据的问题,尤其关注模型开发者可能通过混淆训练数据语义结构来规避检测的情形。其核心挑战在于现有MIAs在现实条件下的可靠性不足,难以区分训练数据是否包含受版权保护的内容。解决方案的关键在于提出SAGE(Structure-Aware SAE-Guided Extraction)框架,该框架利用稀疏自编码器(Sparse Autoencoders, SAEs)引导的改写机制,在保持语义内容和下游任务性能不变的前提下,系统性地改变训练数据的词汇结构,从而测试MIAs的鲁棒性。实验表明,基于SAGE生成的改写数据对微调后的语言模型能显著削弱主流MIAs的效果,说明其信号易受语义保持型变换影响,因此MIAs在对抗环境中不具备足够的稳定性,无法独立作为LLMs版权审计的有效工具。
链接: https://arxiv.org/abs/2601.12937
作者: Murat Bilgehan Ertan,Emirhan Böge,Min Chen,Kaleel Mahmood,Marten van Dijk
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:As large language models (LLMs) are trained on increasingly opaque corpora, membership inference attacks (MIAs) have been proposed to audit whether copyrighted texts were used during training, despite growing concerns about their reliability under realistic conditions. We ask whether MIAs can serve as admissible evidence in adversarial copyright disputes where an accused model developer may obfuscate training data while preserving semantic content, and formalize this setting through a judge-prosecutor-accused communication protocol. To test robustness under this protocol, we introduce SAGE (Structure-Aware SAE-Guided Extraction), a paraphrasing framework guided by Sparse Autoencoders (SAEs) that rewrites training data to alter lexical structure while preserving semantic content and downstream utility. Our experiments show that state-of-the-art MIAs degrade when models are fine-tuned on SAGE-generated paraphrases, indicating that their signals are not robust to semantics-preserving transformations. While some leakage remains in certain fine-tuning regimes, these results suggest that MIAs are brittle in adversarial settings and insufficient, on their own, as a standalone mechanism for copyright auditing of LLMs.
zh
[AI-87] Online Continual Learning for Time Series: a Natural Score-driven Approach
【速读】:该论文旨在解决在线时间序列预测(Online Time Series Forecasting, OTSF)中面临的持续适应与长期记忆之间的平衡问题,尤其是在数据随时间演化、存在状态切换(regime-switching)的场景下。传统方法难以同时实现快速适应新环境和有效保留历史知识,而本文通过引入在线持续学习(Online Continual Learning, OCL)的思想来增强OTSF模型的鲁棒性和适应性。解决方案的关键在于:首先将神经网络优化重构为参数滤波问题,证明自然梯度下降(natural gradient descent)是一种基于得分驱动(score-driven)的方法,并具有信息论最优性;其次,在损失函数中引入Student’s t分布似然,使更新过程受控且对异常值更具鲁棒性;最后提出Natural Score-driven Replay(NatSR)框架,结合改进的优化器、回放缓冲区(replay buffer)以及动态缩放启发式策略,显著提升在状态漂移时的快速适应能力。实证结果表明,该方法在性能上优于更复杂的现有主流方法。
链接: https://arxiv.org/abs/2601.12931
作者: Edoardo Urettini,Daniele Atzeni,Ioanna-Yvonni Tsaknaki,Antonio Carta
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
Abstract:Online continual learning (OCL) methods adapt to changing environments without forgetting past knowledge. Similarly, online time series forecasting (OTSF) is a real-world problem where data evolve in time and success depends on both rapid adaptation and long-term memory. Indeed, time-varying and regime-switching forecasting models have been extensively studied, offering a strong justification for the use of OCL in these settings. Building on recent work that applies OCL to OTSF, this paper aims to strengthen the theoretical and practical connections between time series methods and OCL. First, we reframe neural network optimization as a parameter filtering problem, showing that natural gradient descent is a score-driven method and proving its information-theoretic optimality. Then, we show that using a Student’s t likelihood in addition to natural gradient induces a bounded update, which improves robustness to outliers. Finally, we introduce Natural Score-driven Replay (NatSR), which combines our robust optimizer with a replay buffer and a dynamic scale heuristic that improves fast adaptation at regime drifts. Empirical results demonstrate that NatSR achieves stronger forecasting performance than more complex state-of-the-art methods.
zh
[AI-88] ForeDiffusion: Foresight-Conditioned Diffusion Policy via Future View Construction for Robot Manipulation
【速读】:该论文旨在解决现有扩散策略(diffusion strategies)在复杂机器人操作任务中成功率下降的问题,其核心限制在于仅依赖短期观测作为条件,且训练目标仅基于单一去噪损失,导致误差累积和抓取偏差。解决方案的关键在于提出前瞻条件扩散模型(Foresight-Conditioned Diffusion, ForeDiffusion),通过将预测的未来视图表征注入扩散过程,使策略具备前瞻性,从而纠正轨迹偏差;同时引入双损失机制,联合优化传统去噪损失与未来观测的一致性损失,实现更稳定和高效的策略学习。
链接: https://arxiv.org/abs/2601.12925
作者: Weize Xie,Yi Ding,Ying He,Leilei Wang,Binwen Bai,Zheyi Zhao,Chenyang Wang,F. Richard Yu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Diffusion strategies have advanced visual motor control by progressively denoising high-dimensional action sequences, providing a promising method for robot manipulation. However, as task complexity increases, the success rate of existing baseline models decreases considerably. Analysis indicates that current diffusion strategies are confronted with two limitations. First, these strategies only rely on short-term observations as conditions. Second, the training objective remains limited to a single denoising loss, which leads to error accumulation and causes grasping deviations. To address these limitations, this paper proposes Foresight-Conditioned Diffusion (ForeDiffusion), by injecting the predicted future view representation into the diffusion process. As a result, the policy is guided to be forward-looking, enabling it to correct trajectory deviations. Following this design, ForeDiffusion employs a dual loss mechanism, combining the traditional denoising loss and the consistency loss of future observations, to achieve the unified optimization. Extensive evaluation on the Adroit suite and the MetaWorld benchmark demonstrates that ForeDiffusion achieves an average success rate of 80% for the overall task, significantly outperforming the existing mainstream diffusion methods by 23% in complex tasks, while maintaining more stable performance across the entire tasks.
zh
[AI-89] Your Privacy Depends on Others: Collusion Vulnerabilities in Individual Differential Privacy
【速读】:该论文试图解决个体差分隐私(Individual Differential Privacy, iDP)机制中存在的一个此前被忽视的漏洞:尽管iDP承诺用户对其隐私具有控制权,但实际中个体的隐私风险不仅取决于其自身的隐私预算(privacy budget),还高度依赖于其他所有数据贡献者的隐私选择,导致隐私风险在系统层面被集体决定,从而违背了“个体可控”的设计初衷。解决方案的关键在于提出一种新的隐私契约——(εi,δi,Δ)-iDP,该契约利用Δ-散度(Δ-divergences)为用户提供超额脆弱性(excess vulnerability)的硬性上界,同时保持机制设计的灵活性,从而在不违反差分隐私保证的前提下,显式量化并限制因群体隐私偏好分布不均而引发的额外隐私泄露风险。
链接: https://arxiv.org/abs/2601.12922
作者: Johannes Kaiser,Alexander Ziller,Eleni Triantafillou,Daniel Rückert,Georgios Kaissis
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Individual Differential Privacy (iDP) promises users control over their privacy, but this promise can be broken in practice. We reveal a previously overlooked vulnerability in sampling-based iDP mechanisms: while conforming to the iDP guarantees, an individual’s privacy risk is not solely governed by their own privacy budget, but critically depends on the privacy choices of all other data contributors. This creates a mismatch between the promise of individual privacy control and the reality of a system where risk is collectively determined. We demonstrate empirically that certain distributions of privacy preferences can unintentionally inflate the privacy risk of individuals, even when their formal guarantees are met. Moreover, this excess risk provides an exploitable attack vector. A central adversary or a set of colluding adversaries can deliberately choose privacy budgets to amplify vulnerabilities of targeted individuals. Most importantly, this attack operates entirely within the guarantees of DP, hiding this excess vulnerability. Our empirical evaluation demonstrates successful attacks against 62% of targeted individuals, substantially increasing their membership inference susceptibility. To mitigate this, we propose (\varepsilon_i,\delta_i,\overline\Delta) -iDP a privacy contract that uses \Delta -divergences to provide users with a hard upper bound on their excess vulnerability, while offering flexibility to mechanism design. Our findings expose a fundamental challenge to the current paradigm, demanding a re-evaluation of how iDP systems are designed, audited, communicated, and deployed to make excess risks transparent and controllable.
zh
[AI-90] Actionable Interpretability Must Be Defined in Terms of Symmetries
【速读】:该论文试图解决当前人工智能(Artificial Intelligence, AI)可解释性研究中存在的根本性问题,即现有可解释性定义缺乏“可操作性”(actionable),无法为具体的建模与推理规则提供形式化指导。其解决方案的关键在于提出:一个可解释性的定义若要具备可操作性,必须基于“对称性”(symmetries)来构建;作者进一步假设四类对称性足以(i)驱动核心可解释性属性的合理性,(ii)刻画可解释模型的类别,并(iii)统一推导出可解释推理的形式,如对齐(alignment)、干预(interventions)和反事实推理(counterfactuals),这些均可视为贝叶斯逆推(Bayesian inversion)的一种表现。
链接: https://arxiv.org/abs/2601.12913
作者: Pietro Barbiero,Mateo Espinosa Zarlenga,Francesco Giannini,Alberto Termine,Filippo Bonchi,Mateja Jamnik,Giuseppe Marra
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
备注:
Abstract:This paper argues that interpretability research in Artificial Intelligence is fundamentally ill-posed as existing definitions of interpretability are not actionable: they fail to provide formal principles from which concrete modelling and inferential rules can be derived. We posit that for a definition of interpretability to be actionable, it must be given in terms of symmetries. We hypothesise that four symmetries suffice to (i) motivate core interpretability properties, (ii) characterize the class of interpretable models, and (iii) derive a unified formulation of interpretable inference (e.g., alignment, interventions, and counterfactuals) as a form of Bayesian inversion.
zh
[AI-91] Human Emotion Verification by Action Languages via Answer Set Programming
【速读】:该论文旨在解决如何在逻辑编程框架下对人类心理状态(如情绪)的动态演化进行形式化建模与控制的问题,尤其关注动作序列引发的心理变化是否符合心理学原理以及如何避免不期望的心理副作用。其解决方案的关键在于引入一种基于答案集编程(Answer Set Programming, ASP)和转移系统(Transition Systems)的动作语言C-MT(Mind Transition Language),并通过扩展因果规则“forbids to cause”及专门用于心理状态动态演化的表达式,将心理变化的原则转化为转移约束;这些约束通过轨迹(trajectories)进行严格评估,从而实现对心理状态演化过程的可控推理,并支持不同心理机制下的轨迹比较分析。
链接: https://arxiv.org/abs/2601.12912
作者: Andreas Brännström,Juan Carlos Nieves
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Under consideration in Theory and Practice of Logic Programming (TPLP)
Abstract:In this paper, we introduce the action language C-MT (Mind Transition Language). It is built on top of answer set programming (ASP) and transition systems to represent how human mental states evolve in response to sequences of observable actions. Drawing on well-established psychological theories, such as the Appraisal Theory of Emotion, we formalize mental states, such as emotions, as multi-dimensional configurations. With the objective to address the need for controlled agent behaviors and to restrict unwanted mental side-effects of actions, we extend the language with a novel causal rule, forbids to cause, along with expressions specialized for mental state dynamics, which enables the modeling of principles for valid transitions between mental states. These principles of mental change are translated into transition constraints, and properties of invariance, which are rigorously evaluated using transition systems in terms of so-called trajectories. This enables controlled reasoning about the dynamic evolution of human mental states. Furthermore, the framework supports the comparison of different dynamics of change by analyzing trajectories that adhere to different psychological principles. We apply the action language to design models for emotion verification. Under consideration in Theory and Practice of Logic Programming (TPLP).
zh
[AI-92] AdaNODEs: Test Time Adaptation for Time Series Forecasting Using Neural ODEs ICASSP2026
【速读】:该论文旨在解决预训练模型在面对未见过的时间序列数据分布时的适应性问题,即测试时适应(Test Time Adaptation, TTA)在时间序列预测任务中的应用难题。现有TTA方法多针对独立同分布数据设计,忽视了时间序列特有的时序依赖性和分布偏移特性,且缺乏对预测任务的有效适配机制。解决方案的关键在于提出AdaNODEs——一种源-free的TTA框架,利用神经微分方程(Neural Ordinary Differential Equations, NODEs)建模时间序列动态过程,并设计了一种新颖的损失函数以优化预测性能;该方法仅需更新少量模型参数,在保持高效计算的同时有效捕捉时序依赖关系,显著提升了在高严重度分布偏移下的鲁棒性与预测精度。
链接: https://arxiv.org/abs/2601.12893
作者: Ting Dang,Soumyajit Chatterjee,Hong Jia,Yu Wu,Flora Salim,Fahim Kawsar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by ICASSP 2026
Abstract:Test time adaptation (TTA) has emerged as a promising solution to adapt pre-trained models to new, unseen data distributions using unlabeled target domain data. However, most TTA methods are designed for independent data, often overlooking the time series data and rarely addressing forecasting tasks. This paper presents AdaNODEs, an innovative source-free TTA method tailored explicitly for time series forecasting. By leveraging Neural Ordinary Differential Equations (NODEs), we propose a novel adaptation framework that accommodates the unique characteristics of distribution shifts in time series data. Moreover, we innovatively propose a new loss function to tackle TTA for forecasting tasks. AdaNODEs only requires updating limited model parameters, showing effectiveness in capturing temporal dependencies while avoiding significant memory usage. Extensive experiments with one- and high-dimensional data demonstrate that AdaNODEs offer relative improvements of 5.88% and 28.4% over the SOTA baselines, especially demonstrating robustness across higher severity distribution shifts.
zh
[AI-93] Communication Methods in Multi-Agent Reinforcement Learning
【速读】:该论文旨在解决多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)中通信机制设计与选择的难题,特别是如何在部分可观测环境、非平稳性以及动作空间指数级增长等挑战下,实现高效协作。其解决方案的关键在于系统性地分析29篇相关文献,对显式通信、隐式通信、基于注意力机制的通信、基于图结构的通信以及分层/角色驱动的通信五类方法进行深入比较,揭示各类方法的优势与局限性,并强调通信方案的选择需依据具体问题特性而定,同时指出低计算开销的通信机制对于提升系统可扩展性至关重要。
链接: https://arxiv.org/abs/2601.12886
作者: Christoph Wittner
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 12 pages, 2 figures
Abstract:Multi-agent reinforcement learning is a promising research area that extends established reinforcement learning approaches to problems formulated as multi-agent systems. Recently, a multitude of communication methods have been introduced to this field to address problems such as partially observable environments, non-stationarity, and exponentially growing action spaces. Communication further enables efficient cooperation among all agents interacting in an environment. This work aims at providing an overview of communication techniques in multi-agent reinforcement learning. By an in-depth analysis of 29 publications on this topic, the strengths and weaknesses of explicit, implicit, attention-based, graph-based, and hierarchical/role-based communication are evaluated. The results of this comparison show that there is no general, optimal communication framework for every problem. On the contrary, the choice of communication depends heavily on the problem at hand. The comparison also highlights the importance of communication methods with low computational overhead to enable scalability to environments where many agents interact. Finally, the paper discusses current research gaps, emphasizing the need for standardized benchmarking of system-level metrics and improved robustness under realistic communication conditions to enhance the real-world applicability of these approaches.
zh
[AI-94] Mining Citywide Dengue Spread Patterns in Singapore Through Hotspot Dynamics from Open Web Data WWW2026
【速读】:该论文旨在解决城市地区登革热(Dengue)传播风险难以提前预测的问题,尤其是在热带城市如新加坡等区域,传统基于孤立病例报告的防控策略往往滞后且效率低下。其核心挑战在于如何从公开的登革热病例数据中挖掘出隐藏的传播关联,从而实现对热点区域的前瞻性预警与解释性分析。解决方案的关键在于提出一种新颖的框架,通过挖掘城市区域内隐含的传播链(latent transmission links),将病例视为相互影响的动态系统而非独立事件;利用梯度下降优化这些隐性链接,并结合通勤流数据验证其合理性,从而构建一个可解释、稳定且具备预测能力的传播网络模型。该方法仅需四周热点历史数据即可实现平均F-score达0.79的预测性能,显著提升了公共卫生干预的主动性和精准性。
链接: https://arxiv.org/abs/2601.12856
作者: Liping Huang,Gaoxi Xiao,Stefan Ma,Hechang Chen,Shisong Tang,Flora Salim
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 9 pages, 9 figures. It’s accepted by WWW 2026 Web4Good Track. To make accessible earlier, authors would like to put it on arxiv before the conference
Abstract:Dengue, a mosquito-borne disease, continues to pose a persistent public health challenge in urban areas, particularly in tropical regions such as Singapore. Effective and affordable control requires anticipating where transmission risks are likely to emerge so that interventions can be deployed proactively rather than reactively. This study introduces a novel framework that uncovers and exploits latent transmission links between urban regions, mined directly from publicly available dengue case data. Instead of treating cases as isolated reports, we model how hotspot formation in one area is influenced by epidemic dynamics in neighboring regions. While mosquito movement is highly localized, long-distance transmission is often driven by human mobility, and in our case study, the learned network aligns closely with commuting flows, providing an interpretable explanation for citywide spread. These hidden links are optimized through gradient descent and used not only to forecast hotspot status but also to verify the consistency of spreading patterns, by examining the stability of the inferred network across consecutive weeks. Case studies on Singapore during 2013-2018 and 2020 show that four weeks of hotspot history are sufficient to achieve an average F-score of 0.79. Importantly, the learned transmission links align with commuting flows, highlighting the interpretable interplay between hidden epidemic spread and human mobility. By shifting from simply reporting dengue cases to mining and validating hidden spreading dynamics, this work transforms open web-based case data into a predictive and explanatory resource. The proposed framework advances epidemic modeling while providing a scalable, low-cost tool for public health planning, early intervention, and urban resilience.
zh
[AI-95] he Cost of EFX: Generalized-Mean Welfare and Complexity Dichotomies with Few Surplus Items
【速读】:该论文致力于解决在存在少量盈余物品(即物品数量比参与者数量最多多出三个)的场景下,如何在保证 envy-freeness up to any good (EFX) 的前提下优化广义均值(p-mean)福利的问题。其核心挑战在于:当 p > 0 时,判定是否存在 EFX 分配能达到全局 p-均值最优解,以及计算最大化 p-均值福利的 EFX 分配均为 NP-hard;而当 p ≤ 0 时,作者提出了多项式时间算法,在 EFX 分配空间中可高效优化 p-均值福利,并能有效验证是否达到全局最优。关键突破在于揭示了 p = 0(即 Nash 均值)处的复杂性分界——对于 p ≤ 0,EFX 与福利最大化结构上相容,且公平性代价(price of fairness)有界甚至趋于零,从而明确了 EFX 在不同福利目标下的计算可行性边界。
链接: https://arxiv.org/abs/2601.12849
作者: Eugene Lim,Tzeh Yuan Neoh,Nicholas Teh
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Theoretical Economics (econ.TH)
备注:
Abstract:Envy-freeness up to any good (EFX) is a central fairness notion for allocating indivisible goods, yet its existence is unresolved in general. In the setting with few surplus items, where the number of goods exceeds the number of agents by a small constant (at most three), EFX allocations are guaranteed to exist, shifting the focus from existence to efficiency and computation. We study how EFX interacts with generalized-mean ( p -mean) welfare, which subsumes commonly-studied utilitarian ( p=1 ), Nash ( p=0 ), and egalitarian ( p \rightarrow -\infty ) objectives. We establish sharp complexity dichotomies at p=0 : for any fixed p \in (0,1] , both deciding whether EFX can attain the global p -mean optimum and computing an EFX allocation maximizing p -mean welfare are NP-hard, even with at most three surplus goods; in contrast, for any fixed p \leq 0 , we give polynomial-time algorithms that optimize p -mean welfare within the space of EFX allocations and efficiently certify when EFX attains the global optimum. We further quantify the welfare loss of enforcing EFX via the price of fairness framework, showing that for p 0 , the loss can grow linearly with the number of agents, whereas for p \leq 0 , it is bounded by a constant depending on the surplus (and for Nash welfare it vanishes asymptotically). Finally we show that requiring Pareto-optimality alongside EFX is NP-hard (and becomes \Sigma_2^P -complete for a stronger variant of EFX). Overall, our results delineate when EFX is computationally costly versus structurally aligned with welfare maximization in the setting with few surplus items.
zh
[AI-96] SCULPT: Constraint-Guided Pruned MCTS that Carves Efficient Paths for Mathematical Reasoning
【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的自动化代理工作流中,因依赖随机探索策略而导致的低效问题——即在缺乏领域先验知识的情况下,现有流水线通过通用提示或学习到的策略采样候选步骤,导致对操作符、单位和格式的近乎随机遍历,从而难以找到合理的推理路径。解决方案的关键在于提出SCULPT方法,这是一种约束引导的蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS)框架,其核心创新是在选择(selection)、扩展(expansion)、模拟(simulation)和反向传播(backpropagation)四个阶段嵌入领域感知评分机制,利用符号检查(如维度一致性、类型兼容性、量级合理性、深度控制和多样性)与结构模式引导相结合的方式对动作进行评分与剪枝,从而有效引导搜索向可解释且合理的推理路径收敛,实现准确率提升的同时保持效率与推理稳定性。
链接: https://arxiv.org/abs/2601.12842
作者: Qitong Fang(1),Haotian Li(1),Xu Wang(1) ((1) Jilin Jianzhu University)
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 11 pages, 3 figures. Equal contribution: Qitong Fang and Haotian Li. Corresponding authors: Qitong Fang (fangqitong@student. this http URL ), Haotian Li (lihaotian@student. this http URL ), Xu Wang (wangxu@jlju. this http URL )
Abstract:Automated agent workflows can enhance the problem-solving ability of large language models (LLMs), but common search strategies rely on stochastic exploration and often traverse implausible branches. This occurs because current pipelines sample candidate steps from generic prompts or learned policies with weak domain priors, yielding near-random walks over operators, units, and formats. To promote ordered exploration, this paper introduces SCULPT, a constraint-guided approach for Monte Carlo Tree Search (MCTS) that integrates domain-aware scoring into selection, expansion, simulation, and backpropagation. SCULPT scores and prunes actions using a combination of symbolic checks (dimensional consistency, type compatibility, magnitude sanity, depth control, and diversity) and structural pattern guidance, thereby steering the search toward plausible reasoning paths. Under matched LLM configurations, SCULPT yields stable improvements on multiple datasets; additional results with GPT-5.2 assess executor transferability and performance on frontier reasoning models. Overall, domain-aware constraints can improve accuracy while maintaining efficiency and reasoning stability.
zh
[AI-97] MirrorGuard: Toward Secure Computer-Use Agents via Simulation-to-Real Reasoning Correction
【速读】:该论文旨在解决集成大规模基础模型的计算机使用代理(Computer Use Agents, CUAs)在通过图形用户界面(GUI)自主执行复杂任务时所面临的安全风险问题,尤其是恶意指令或视觉提示注入可能引发不安全推理并导致系统级危害。解决方案的关键在于提出一种名为MirrorGuard的即插即用防御框架,其核心创新是基于仿真训练的神经符号模拟流水线:该流水线在纯文本模拟环境中生成高风险GUI交互轨迹,精准捕获不安全推理模式与潜在系统危害,而无需执行真实操作;在此基础上,MirrorGuard学习识别并修正CUA中的不安全推理链,从而在实际部署中有效降低安全事件发生率,同时保持极低的误拒率(False Rejection Rate, FRR),显著优于现有方法如GuardAgent。
链接: https://arxiv.org/abs/2601.12822
作者: Wenqi Zhang,Yulin Shen,Changyue Jiang,Jiarun Dai,Geng Hong,Xudong Pan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large foundation models are integrated into Computer Use Agents (CUAs), enabling autonomous interaction with operating systems through graphical user interfaces (GUIs) to perform complex tasks. This autonomy introduces serious security risks: malicious instructions or visual prompt injections can trigger unsafe reasoning and cause harmful system-level actions. Existing defenses, such as detection-based blocking, prevent damage but often abort tasks prematurely, reducing agent utility. In this paper, we present MirrorGuard, a plug-and-play defense framework that uses simulation-based training to improve CUA security in the real world. To reduce the cost of large-scale training in operating systems, we propose a novel neural-symbolic simulation pipeline, which generates realistic, high-risk GUI interaction trajectories entirely in a text-based simulated environment, which captures unsafe reasoning patterns and potential system hazards without executing real operations. In the simulation environment, MirrorGuard learns to intercept and rectify insecure reasoning chains of CUAs before they produce and execute unsafe actions. In real-world testing, extensive evaluations across diverse benchmarks and CUA architectures show that MirrorGuard significantly mitigates security risks. For instance, on the ByteDance UI-TARS system, it reduces the unsafe rate from 66.5% to 13.0% while maintaining a marginal false refusal rate (FRR). In contrast, the state-of-the-art GuardAgent only achieves a reduction to 53.9% and suffers from a 15.4% higher FRR. Our work proves that simulation-derived defenses can provide robust, real-world protection while maintaining the fundamental utility of the agent. Our code and model are publicly available at this https URL.
zh
[AI-98] Fisher-Orthogonal Projected Natural Gradient Descent for Continual Learning
【速读】:该论文旨在解决持续学习(Continual Learning)中的灾难性遗忘(Catastrophic Forgetting)问题,即神经网络在学习新任务时会严重损害对先前任务的性能。解决方案的关键在于提出一种名为Fisher-Orthogonal Projected Natural Gradient Descent (FOPNG)的优化器,其核心机制是在参数更新中引入Fisher正交约束,通过将梯度投影到前序任务梯度的Fisher正交补空间中,从而在保持旧任务输出稳定性的同时学习新任务。该方法在信息几何框架下统一了自然梯度下降与正交梯度方法,确保更新方向在重参数化下不变,并保证在Fisher度量下的下降性质,有效缓解遗忘问题。
链接: https://arxiv.org/abs/2601.12816
作者: Ishir Garg,Neel Kolhe,Andy Peng,Rohan Gopalam
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Continual learning aims to enable neural networks to acquire new knowledge on sequential tasks. However, the key challenge in such settings is to learn new tasks without catastrophically forgetting previously learned tasks. We propose the Fisher-Orthogonal Projected Natural Gradient Descent (FOPNG) optimizer, which enforces Fisher-orthogonal constraints on parameter updates to preserve old task performance while learning new tasks. Unlike existing methods that operate in Euclidean parameter space, FOPNG projects gradients onto the Fisher-orthogonal complement of previous task gradients. This approach unifies natural gradient descent with orthogonal gradient methods within an information-geometric framework. The resulting update direction is invariant under reparameterization, guarantees descent in the Fisher metric, and helps preserve prior task outputs. We provide theoretical analysis establishing the properties of the projected update, describe efficient and practical implementations using the diagonal Fisher, and demonstrate strong results on standard continual learning benchmarks such as Permuted-MNIST, Split-MNIST, Rotated-MNIST, Split-CIFAR10, and Split-CIFAR100.
zh
[AI-99] SL-CBM: Enhancing Concept Bottleneck Models with Semantic Locality for Better Interpretability
【速读】:该论文旨在解决概念瓶颈模型(Concept Bottleneck Models, CBMs)在空间局部忠实性(locality faithfulness)方面的不足,即现有CBMs难以将概念与图像中具有语义意义的区域进行精准对齐,从而限制了其解释性和可靠性。解决方案的关键在于提出SL-CBM(CBM with Semantic Locality),通过引入一个1×1卷积层和交叉注意力机制(cross-attention mechanism),在概念层面和类别层面生成空间一致的显著性图(saliency maps),从而强化概念、图像区域与最终预测之间的对齐关系;该设计使显著性图自然地关联于模型内部推理过程,提升了可解释性、调试效率和干预有效性,同时保持了较高的分类准确率。
链接: https://arxiv.org/abs/2601.12804
作者: Hanwei Zhang,Luo Cheng,Rui Wen,Yang Zhang,Lijun Zhang,Holger Hermanns
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Explainable AI (XAI) is crucial for building transparent and trustworthy machine learning systems, especially in high-stakes domains. Concept Bottleneck Models (CBMs) have emerged as a promising ante-hoc approach that provides interpretable, concept-level explanations by explicitly modeling human-understandable concepts. However, existing CBMs often suffer from poor locality faithfulness, failing to spatially align concepts with meaningful image regions, which limits their interpretability and reliability. In this work, we propose SL-CBM (CBM with Semantic Locality), a novel extension that enforces locality faithfulness by generating spatially coherent saliency maps at both concept and class levels. SL-CBM integrates a 1x1 convolutional layer with a cross-attention mechanism to enhance alignment between concepts, image regions, and final predictions. Unlike prior methods, SL-CBM produces faithful saliency maps inherently tied to the model’s internal reasoning, facilitating more effective debugging and intervention. Extensive experiments on image datasets demonstrate that SL-CBM substantially improves locality faithfulness, explanation quality, and intervention efficacy while maintaining competitive classification accuracy. Our ablation studies highlight the importance of contrastive and entropy-based regularization for balancing accuracy, sparsity, and faithfulness. Overall, SL-CBM bridges the gap between concept-based reasoning and spatial explainability, setting a new standard for interpretable and trustworthy concept-based models.
zh
[AI-100] Distilling Time Series Foundation Models for Efficient Forecasting ICASSP-2026
【速读】:该论文旨在解决时间序列基础模型(Time Series Foundation Models, TSFMs)在部署时因参数量过大而导致的成本高昂问题,同时针对知识蒸馏技术在时间序列预测任务中不适用的挑战提出改进方案。其关键解决方案包括:(1) 提出基于时序跨度加权的目标函数(horizon-weighted objectives),以缓解预测任务中不同时间跨度的学习难度差异问题,确保长程预测获得充分监督;(2) 设计一种时间对齐策略(temporal alignment strategy),用于减少师生模型间的结构差异,从而提升蒸馏效率与压缩后模型的性能。实验表明,DistilTS可在保持与原TSFM相当预测精度的前提下,将参数量降低至1/150,并实现最高达6000倍的推理加速。
链接: https://arxiv.org/abs/2601.12785
作者: Yuqi Li,Kuiye Ding,Chuanguang Yang,Szu-Yu Chen,Yingli Tian
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by ICASSP-2026
Abstract:Time Series foundation models (TSFMs) deliver strong forecasting performance through large-scale pretraining, but their large parameter sizes make deployment costly. While knowledge distillation offers a natural and effective approach for model compression, techniques developed for general machine learning tasks are not directly applicable to time series forecasting due to the unique characteristics. To address this, we present DistilTS, the first distillation framework specifically designed for TSFMs. DistilTS addresses two key challenges: (1) task difficulty discrepancy, specific to forecasting, where uniform weighting makes optimization dominated by easier short-term horizons, while long-term horizons receive weaker supervision; and (2) architecture discrepancy, a general challenge in distillation, for which we design an alignment mechanism in the time series forecasting. To overcome these issues, DistilTS introduces horizon-weighted objectives to balance learning across horizons, and a temporal alignment strategy that reduces architectural mismatch, enabling compact models. Experiments on multiple benchmarks demonstrate that DistilTS achieves forecasting performance comparable to full-sized TSFMs, while reducing parameters by up to 1/150 and accelerating inference by up to 6000x. Code is available at: this https URL.
zh
[AI-101] aching LLM s to Learn Tool Trialing and Execution through Environment Interaction
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在使用外部工具时缺乏泛化能力与鲁棒性的问题,尤其是在面对新引入或未见过的工具时,传统以轨迹为中心的方法因依赖静态解决方案路径的记忆而难以适应变化。其核心解决方案是提出ToolMaster框架,通过将工具调用从模仿预设轨迹转变为基于环境交互的主动学习机制,采用“试错-执行”范式:首先利用教师生成的包含显式工具尝试与自我修正的轨迹进行模仿学习,随后通过强化学习联合优化试错与执行阶段,使代理能够自主探索正确的工具使用方式并积累经验知识,从而显著提升对未知工具的泛化能力和鲁棒性。
链接: https://arxiv.org/abs/2601.12762
作者: Xingjie Gao,Pengcheng Huang,Zhenghao Liu,Yukun Yan,Shuo Wang,Zulong Chen,Chen Qian,Ge Yu,Yu Gu
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Equipping Large Language Models (LLMs) with external tools enables them to solve complex real-world problems. However, the robustness of existing methods remains a critical challenge when confronting novel or evolving tools. Existing trajectory-centric paradigms primarily rely on memorizing static solution paths during training, which limits the ability of LLMs to generalize tool usage to newly introduced or previously unseen tools. In this paper, we propose ToolMaster, a framework that shifts tool use from imitating golden tool-calling trajectories to actively learning tool usage through interaction with the environment. To optimize LLMs for tool planning and invocation, ToolMaster adopts a trial-and-execution paradigm, which trains LLMs to first imitate teacher-generated trajectories containing explicit tool trials and self-correction, followed by reinforcement learning to coordinate the trial and execution phases jointly. This process enables agents to autonomously explore correct tool usage by actively interacting with environments and forming experiential knowledge that benefits tool execution. Experimental results demonstrate that ToolMaster significantly outperforms existing baselines in terms of generalization and robustness across unseen or unfamiliar tools. All code and data are available at this https URL.
zh
[AI-102] A Graph Prompt Fine-Tuning Method for WSN Spatio-Temporal Correlation Anomaly Detection
【速读】:该论文针对无线传感器网络(Wireless Sensor Network, WSN)中多时序模态数据异常检测存在的三大问题展开研究:一是现有方法对时空相关特征提取不足;二是异常样本类别标注成本高;三是异常样本分布不均衡。解决方案的关键在于构建一个融合时空相关性特征的图神经网络异常检测主干网络,并提出“预训练-图提示-微调”的多任务自监督训练策略。具体而言,通过改进基于多尺度策略和跨模态融合的Mamba模型,并结合变分图卷积模块,实现对WSN多节点、多时间模态场景下时空关联特征的有效提取;同时设计包含无负样本对比学习、预测与重建三个子任务的预训练机制,以及利用图提示引导微调的机制,在减少人工标注依赖的同时提升模型泛化能力与检测性能,实验表明在公开数据集和实际采集数据集上的F1值分别达到91.30%和92.31%,优于现有方法。
链接: https://arxiv.org/abs/2601.12745
作者: Miao Ye,Jing Cui,Yuan huang,Qian He,Yong Wang,Jiwen Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Anomaly detection of multi-temporal modal data in Wireless Sensor Network (WSN) can provide an important guarantee for reliable network operation. Existing anomaly detection methods in multi-temporal modal data scenarios have the problems of insufficient extraction of spatio-temporal correlation features, high cost of anomaly sample category annotation, and imbalance of anomaly samples. In this paper, a graph neural network anomaly detection backbone network incorporating spatio-temporal correlation features and a multi-task self-supervised training strategy of “pre-training - graph prompting - fine-tuning” are designed for the characteristics of WSN graph structure data. First, the anomaly detection backbone network is designed by improving the Mamba model based on a multi-scale strategy and inter-modal fusion method, and combining it with a variational graph convolution module, which is capable of fully extracting spatio-temporal correlation features in the multi-node, multi-temporal modal scenarios of WSNs. Secondly, we design a three-subtask learning “pre-training” method with no-negative comparative learning, prediction, and reconstruction to learn generic features of WSN data samples from unlabeled data, and design a “graph prompting-fine-tuning” mechanism to guide the pre-trained self-supervised learning. The model is fine-tuned through the “graph prompting-fine-tuning” mechanism to guide the pre-trained self-supervised learning model to complete the parameter fine-tuning, thereby reducing the training cost and enhancing the detection generalization performance. The F1 metrics obtained from experiments on the public dataset and the actual collected dataset are up to 91.30% and 92.31%, respectively, which provides better detection performance and generalization ability than existing methods designed by the method.
zh
[AI-103] Vision Language Models for Optimization-Driven Intent Processing in Autonomous Networks
【速读】:该论文旨在解决意图驱动网络(Intent-Based Networking, IBN)中优化代码生成的瓶颈问题,即如何从网络结构图(如标注的网络草图)中自动提取参数并生成可证明最优的优化代码,以支持流量工程、路由和资源分配等任务。当前系统依赖文本形式的意图表达,限制了网络工程师直观建模的能力;而视觉语言模型(Vision-Language Models, VLMs)是否能有效处理网络草图并生成正确代码尚未被探索。解决方案的关键在于构建一个包含85个优化问题的基准测试集(IntentOpt),评估四种主流VLM在多模态输入(图像+文本)与纯文本输入下的性能差异,并揭示视觉参数提取、思维链提示策略及开源/闭源模型间的性能差距。结果表明,视觉输入虽具潜力但导致执行成功率下降12–21个百分点,且开放模型显著落后于闭源模型,为未来IBN系统中融合多模态理解与优化推理提供了关键基线和实践路径。
链接: https://arxiv.org/abs/2601.12744
作者: Tasnim Ahmed,Yifan Zhu,Salimur Choudhury
机构: 未知
类目: Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI); Software Engineering (cs.SE)
备注: Accepted for presentation at The IEEE International Conference on Communications (ICC) 2026
Abstract:Intent-Based Networking (IBN) allows operators to specify high-level network goals rather than low-level configurations. While recent work demonstrates that large language models can automate configuration tasks, a distinct class of intents requires generating optimization code to compute provably optimal solutions for traffic engineering, routing, and resource allocation. Current systems assume text-based intent expression, requiring operators to enumerate topologies and parameters in prose. Network practitioners naturally reason about structure through diagrams, yet whether Vision-Language Models (VLMs) can process annotated network sketches into correct optimization code remains unexplored. We present IntentOpt, a benchmark of 85 optimization problems across 17 categories, evaluating four VLMs (GPT-5-Mini, Claude-Haiku-4.5, Gemini-2.5-Flash, Llama-3.2-11B-Vision) under three prompting strategies on multimodal versus text-only inputs. Our evaluation shows that visual parameter extraction reduces execution success by 12-21 percentage points (pp), with GPT-5-Mini dropping from 93% to 72%. Program-of-thought prompting decreases performance by up to 13 pp, and open-source models lag behind closed-source ones, with Llama-3.2-11B-Vision reaching 18% compared to 75% for GPT-5-Mini. These results establish baseline capabilities and limitations of current VLMs for optimization code generation within an IBN system. We also demonstrate practical feasibility through a case study that deploys VLM-generated code to network testbed infrastructure using Model Context Protocol.
zh
[AI-104] AirHunt: Bridging VLM Semantics and Continuous Planning for Efficient Aerial Object Navigation
【速读】:该论文旨在解决当前基于大视觉语言模型(Vision-Language Models, VLMs)的无人机对象导航系统在实际应用中面临的三大核心问题:一是VLM推理频率与实时路径规划之间存在数量级差异,导致难以集成到飞行控制系统;二是VLM对三维场景理解能力有限,无法有效支持复杂环境下的目标定位;三是缺乏统一机制以平衡语义引导与运动效率,在大规模环境中难以实现高效导航。解决方案的关键在于提出AirHunt系统,其核心创新包括:(1) 双通道异步架构,实现VLM语义推理与连续路径规划之间的协同接口,支持飞行过程中动态调整语义引导;(2) 主动双任务推理模块,利用几何与语义冗余信息实现选择性VLM调用,降低计算开销;(3) 语义-几何一致性规划模块,统一建模语义优先级与运动效率,适应环境异质性。该方案显著提升了无人机在开放集目标搜索中的成功率、精度和飞行效率。
链接: https://arxiv.org/abs/2601.12742
作者: Xuecheng Chen,Zongzhuo Liu,Jianfa Ma,Bang Du,Tiantian Zhang,Xueqian Wang,Boyu Zhou
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advances in large Vision-Language Models (VLMs) have provided rich semantic understanding that empowers drones to search for open-set objects via natural language instructions. However, prior systems struggle to integrate VLMs into practical aerial systems due to orders-of-magnitude frequency mismatch between VLM inference and real-time planning, as well as VLMs’ limited 3D scene understanding. They also lack a unified mechanism to balance semantic guidance with motion efficiency in large-scale environments. To address these challenges, we present AirHunt, an aerial object navigation system that efficiently locates open-set objects with zero-shot generalization in outdoor environments by seamlessly fusing VLM semantic reasoning with continuous path planning. AirHunt features a dual-pathway asynchronous architecture that establishes a synergistic interface between VLM reasoning and path planning, enabling continuous flight with adaptive semantic guidance that evolves through motion. Moreover, we propose an active dual-task reasoning module that exploits geometric and semantic redundancy to enable selective VLM querying, and a semantic-geometric coherent planning module that dynamically reconciles semantic priorities and motion efficiency in a unified framework, enabling seamless adaptation to environmental heterogeneity. We evaluate AirHunt across diverse object navigation tasks and environments, demonstrating a higher success rate with lower navigation error and reduced flight time compared to state-of-the-art methods. Real-world experiments further validate AirHunt’s practical capability in complex and challenging environments. Code and dataset will be made publicly available before publication.
zh
[AI-105] reeWriter: AI-Assisted Hierarchical Planning and Writing for Long-Form Documents
【速读】:该论文旨在解决长文档写作中面临的挑战,包括跨章节一致性维持、复杂度增加下的高效规划与写作支持,以及如何有效整合生成式 AI (Generative AI) 辅助以提升写作效率和质量。现有 AI 协同写作工具通常仅提供段落级建议或有限的结构化规划功能,难以覆盖从宏观构思到精细润色的完整写作流程。其解决方案的关键在于提出 TreeWriter——一种基于树形结构(tree-structured)的分层写作系统,将文档表示为多层级大纲,并集成上下文感知的 AI 代理(AI agent),实现多层次的草稿构建、内容动态加载与情境化编辑建议。实验表明,TreeWriter 显著提升了用户在创意探索、AI 辅助有效性及作者控制感方面的体验,验证了分层组织与嵌入式 AI 支持相结合的设计范式对复杂文档协作写作的潜力。
链接: https://arxiv.org/abs/2601.12740
作者: Zijian Zhang,Fangshi Du,Xingjian Liu,Pan Chen,Oliver Huang,Runlong Ye,Michael Liut,Alán Aspuru-Guzik
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:Long documents pose many challenges to current intelligent writing systems. These include maintaining consistency across sections, sustaining efficient planning and writing as documents become more complex, and effectively providing and integrating AI assistance to the user. Existing AI co-writing tools offer either inline suggestions or limited structured planning, but rarely support the entire writing process that begins with high-level ideas and ends with polished prose, in which many layers of planning and outlining are needed. Here, we introduce TreeWriter, a hierarchical writing system that represents documents as trees and integrates contextual AI support. TreeWriter allows authors to create, save, and refine document outlines at multiple levels, facilitating drafting, understanding, and iterative editing of long documents. A built-in AI agent can dynamically load relevant content, navigate the document hierarchy, and provide context-aware editing suggestions. A within-subject study (N=12) comparing TreeWriter with Google Docs + Gemini on long-document editing and creative writing tasks shows that TreeWriter improves idea exploration/development, AI helpfulness, and perceived authorial control. A two-month field deployment (N=8) further demonstrated that hierarchical organization supports collaborative writing. Our findings highlight the potential of hierarchical, tree-structured editors with integrated AI support and provide design guidelines for future AI-assisted writing tools that balance automation with user agency.
zh
[AI-106] AI-exhibited Personality Traits Can Shape Human Self-concept through Conversations
【速读】:该论文旨在解决生成式 AI(Generative AI)在对话过程中可能通过其人格特质影响用户自我概念(self-concept)的潜在风险问题。解决方案的关键在于通过随机行为实验验证了用户在与基于 GPT-4o 的 AI 聊天机器人进行关于个人话题的对话后,其自我认知会趋向于与 AI 测量出的人格特质一致,且这种一致性随对话时长增加而增强,同时该现象与用户对话愉悦度呈正相关。这一发现揭示了 AI 人格对人类自我认知的塑造机制,为开发更具责任性和伦理性的 AI 系统提供了关键设计启示。
链接: https://arxiv.org/abs/2601.12727
作者: Jingshu Li,Tianqi Song,Nattapat Boonprakong,Zicheng Zhu,Yitian Yang,Yi-Chieh Lee
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: ACM CHI 2026
Abstract:Recent Large Language Model (LLM) based AI can exhibit recognizable and measurable personality traits during conversations to improve user experience. However, as human understandings of their personality traits can be affected by their interaction partners’ traits, a potential risk is that AI traits may shape and bias users’ self-concept of their own traits. To explore the possibility, we conducted a randomized behavioral experiment. Our results indicate that after conversations about personal topics with an LLM-based AI chatbot using GPT-4o default personality traits, users’ self-concepts aligned with the AI’s measured personality traits. The longer the conversation, the greater the alignment. This alignment led to increased homogeneity in self-concepts among users. We also observed that the degree of self-concept alignment was positively associated with users’ conversation enjoyment. Our findings uncover how AI personality traits can shape users’ self-concepts through human-AI conversation, highlighting both risks and opportunities. We provide important design implications for developing more responsible and ethical AI systems.
zh
[AI-107] An Evolutionary Framework for Automatic Optimization Benchmark Generation via Large Language Models
【速读】:该论文旨在解决现有优化基准测试(optimization benchmark)在模拟真实世界问题结构多样性与不规则性方面的不足,以及基于真实问题构建基准的高成本和复杂性问题。其解决方案的关键在于提出一种基于大语言模型(large language model, LLM)驱动的进化自动基准生成框架(LLM-driven evolutionary benchmark generator, LLM-EBG),其中LLM作为进化算子,在灵活且表达能力强的表示空间中生成并演化优化问题。实验表明,该框架能有效生成使目标算法(如遗传算法 GA)在超过80%试验中显著优于对比算法(如差分进化 DE)的基准问题,并通过探索性景观分析揭示出偏好GA的基准问题对变量缩放高度敏感,从而验证了该方法可生成具有明确几何特征、反映不同优化算法内在搜索行为的基准问题。
链接: https://arxiv.org/abs/2601.12723
作者: Yuhiro Ono,Tomohiro Harada,Yukiya Miura
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注:
Abstract:Optimization benchmarks play a fundamental role in assessing algorithm performance; however, existing artificial benchmarks often fail to capture the diversity and irregularity of real-world problem structures, while benchmarks derived from real-world problems are costly and difficult to construct. To address these challenges, we propose an evolutionary automatic benchmark generation framework that leverages a large language model (LLM) as a generative operator, termed the LLM-driven evolutionary benchmark generator (LLM-EBG). In this framework, the LLM serves as an evolutionary operator that generates and evolves benchmark problems within a flexible, expressive representation space. As a case study, we generate unconstrained single-objective continuous minimization problems represented as mathematical expressions designed to induce significant performance differences between a genetic algorithm (GA) and differential evolution (DE). Experimental results show that LLM-EBG successfully produces benchmark problems in which the designated target algorithm consistently outperforms the comparative algorithm in more than 80% of trials. Furthermore, exploratory landscape analysis reveals that benchmarks favoring GA are highly sensitive to variable scaling, demonstrating that the proposed framework can generate problems with distinct geometric characteristics that reflect the intrinsic search behaviors of different optimization algorithms.
zh
[AI-108] aching Large Reasoning Models Effective Reflection
【速读】:该论文旨在解决大型推理模型(Large Reasoning Models, LRM)在复杂推理任务中普遍存在“浅层反思”(superficial reflection)的问题,即模型生成的自我批判和回溯行为中存在大量低质量、无实质改进的内容,导致计算资源浪费且效果提升有限。解决方案的关键在于提出两个协同优化的模块:首先,Self-Critique Fine-Tuning (SCFT) 通过引导模型自动生成批判性反馈,利用拒绝采样筛选高质量批判,并基于批判目标进行微调,从而增强模型的反思能力;其次,在SCFT基础上引入强化学习框架 Reinforcement Learning with Effective Reflection Rewards (RLERR),将高质量反思作为奖励信号,进一步促使模型内化有效的自我修正机制。实验表明,该方法在AIME2024和AIME2025两个基准测试中显著提升了推理准确率与反思质量。
链接: https://arxiv.org/abs/2601.12720
作者: Hanbin Wang,Jingwei Song,Jinpeng Li,Qi Zhu,Fei Mi,Ganqu Cui,Yasheng Wang,Lifeng Shang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 14 pages (including appendix), 5 figures
Abstract:Large Reasoning Models (LRMs) have recently shown impressive performance on complex reasoning tasks, often by engaging in self-reflective behaviors such as self-critique and backtracking. However, not all reflections are beneficial-many are superficial, offering little to no improvement over the original answer and incurring computation overhead. In this paper, we identify and address the problem of superficial reflection in LRMs. We first propose Self-Critique Fine-Tuning (SCFT), a training framework that enhances the model’s reflective reasoning ability using only self-generated critiques. SCFT prompts models to critique their own outputs, filters high-quality critiques through rejection sampling, and fine-tunes the model using a critique-based objective. Building on this strong foundation, we further introduce Reinforcement Learning with Effective Reflection Rewards (RLERR). RLERR leverages the high-quality reflections initialized by SCFT to construct reward signals, guiding the model to internalize the self-correction process via reinforcement learning. Experiments on two challenging benchmarks, AIME2024 and AIME2025, show that SCFT and RLERR significantly improve both reasoning accuracy and reflection quality, outperforming state-of-the-art baselines. All data and codes are available at this https URL.
zh
[AI-109] Neurosymbolic LoRA: Why and When to Tune Weights vs. Rewrite Prompts
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在适应新知识或任务时面临的两个核心挑战:一是数值微调(numerical fine-tuning)虽能有效注入事实性知识,但缺乏对风格和对齐(alignment)的灵活控制;二是符号化更新(symbolic manipulation)可实现无需重训练的风格与逻辑约束调整,却难以进行深层次的知识重构。解决方案的关键在于提出一种神经符号LoRA(neurosymbolic LoRA)框架,通过引入统一的监控信号和基于奖励的分类器动态决策何时使用LoRA进行参数级的事实重建,何时采用TextGrad进行token级符号编辑,并仅在必要时调用外部大语言模型执行符号变换,从而兼顾效率与灵活性。此外,符号编辑生成的高质量提示词可作为可复用训练数据,在数据稀缺领域(如数学推理)具有显著优势。
链接: https://arxiv.org/abs/2601.12711
作者: Kevin Wang,Neel P. Bhatt,Cong Liu,Junbo Li,Runjin Chen,Yihan Xi,Timothy Barclay,Alvaro Velasquez,Ufuk Topcu,Zhangyang Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Symbolic Computation (cs.SC)
备注:
Abstract:Large language models (LLMs) can be adapted either through numerical updates that alter model parameters or symbolic manipulations that work on discrete prompts or logical constraints. While numerical fine-tuning excels at injecting new factual knowledge, symbolic updates offer flexible control of style and alignment without retraining. We introduce a neurosymbolic LoRA framework that dynamically combines these two complementary strategies. Specifically, we present a unified monitoring signal and a reward-based classifier to decide when to employ LoRA for deeper factual reconstruction and when to apply TextGrad for token-level edits. Our approach remains memory-efficient by offloading the symbolic transformations to an external LLM only when needed. Additionally, the refined prompts produced during symbolic editing serve as high-quality, reusable training data, an important benefit in data-scarce domains like mathematical reasoning. Extensive experiments across multiple LLM backbones show that neurosymbolic LoRA consistently outperforms purely numerical or purely symbolic baselines, demonstrating superior adaptability and improved performance. Our findings highlight the value of interleaving numerical and symbolic updates to unlock a new level of versatility in language model fine-tuning.
zh
[AI-110] Logic-Guided Multistage Inference for Explainable Multidefendant Judgment Prediction
【速读】:该论文旨在解决多被告案件中责任分配复杂、司法表述模糊导致生成式 AI(Generative AI)难以准确识别各被告角色与 culpability(责任程度)的问题。其解决方案的关键在于提出一种带掩码的多阶段推理(Masked Multistage Inference, MMSI)框架,通过引入面向角色的掩码机制(oriented masking mechanism)明确区分主犯与从犯,并结合对比数据构建策略提升模型对责任差异的敏感性;同时将预测的罪责标签通过广播机制融入回归模型,融合犯罪描述与裁判观点,从而在保持法律可解释性的前提下显著提升角色导向的责任判别准确性。
链接: https://arxiv.org/abs/2601.12688
作者: Xu Zhang,Qinghua Wang,Mengyang Zhao,Fang Wang,Cunquan Qu
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Crime disrupts societal stability, making law essential for balance. In multidefendant cases, assigning responsibility is complex and challenges fairness, requiring precise role differentiation. However, judicial phrasing often obscures the roles of the defendants, hindering effective AI-driven analyses. To address this issue, we incorporate sentencing logic into a pretrained Transformer encoder framework to enhance the intelligent assistance in multidefendant cases while ensuring legal interpretability. Within this framework an oriented masking mechanism clarifies roles and a comparative data construction strategy improves the model’s sensitivity to culpability distinctions between principals and accomplices. Predicted guilt labels are further incorporated into a regression model through broadcasting, consolidating crime descriptions and court views. Our proposed masked multistage inference (MMSI) framework, evaluated on the custom IMLJP dataset for intentional injury cases, achieves significant accuracy improvements, outperforming baselines in role-based culpability differentiation. This work offers a robust solution for enhancing intelligent judicial systems, with publicly code available.
zh
[AI-111] Empowering All-in-Loop Health Management of Spacecraft Power System in the Mega-Constellation Era via Human-AI Collaboration
【速读】:该论文旨在解决卫星 mega-constellations(SMC)时代下航天器电源系统(Spacecraft Power Systems, SPS)健康管理(Health Management, HM)面临的规模化与复杂性挑战,尤其是当SPS数量从数十个跃升至数千个时,传统HM方法难以满足高效、智能、可解释的全链路管理需求。解决方案的关键在于提出“对齐底层能力”(Aligning Underlying Capabilities, AUC)原则,并开发了一个开源的人机协同框架 SpaceHMchat,该框架实现了从工况识别、异常检测、故障定位到维护决策的全流程闭环健康管理(All-in-Loop Health Management, AIL HM),通过人机协作(Human-AI Collaboration, HAIC)机制提升任务完成率、自适应学习能力、人员结构优化及知识共享效率,同时依托硬件级故障注入实验平台和首个面向SPS的AIL HM数据集,验证了其在23项量化指标上的卓越性能,包括逻辑推理准确率100%、异常检测成功率>99%、故障定位精度>90%,显著提升了空间电源系统的智能化运维水平与可解释性。
链接: https://arxiv.org/abs/2601.12667
作者: Yi Di,Zhibin Zhao,Fujin Wang,Xue Liu,Jiafeng Tang,Jiaxin Ren,Zhi Zhai,Xuefeng Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:It is foreseeable that the number of spacecraft will increase exponentially, ushering in an era dominated by satellite mega-constellations (SMC). This necessitates a focus on energy in space: spacecraft power systems (SPS), especially their health management (HM), given their role in power supply and high failure rates. Providing health management for dozens of SPS and for thousands of SPS represents two fundamentally different paradigms. Therefore, to adapt the health management in the SMC era, this work proposes a principle of aligning underlying capabilities (AUC principle) and develops SpaceHMchat, an open-source Human-AI collaboration (HAIC) framework for all-in-loop health management (AIL HM). SpaceHMchat serves across the entire loop of work condition recognition, anomaly detection, fault localization, and maintenance decision making, achieving goals such as conversational task completion, adaptive human-in-the-loop learning, personnel structure optimization, knowledge sharing, efficiency enhancement, as well as transparent reasoning and improved interpretability. Meanwhile, to validate this exploration, a hardware-realistic fault injection experimental platform is established, and its simulation model is built and open-sourced, both fully replicating the real SPS. The corresponding experimental results demonstrate that SpaceHMchat achieves excellent performance across 23 quantitative metrics, such as 100% conclusion accuracy in logical reasoning of work condition recognition, over 99% success rate in anomaly detection tool invocation, over 90% precision in fault localization, and knowledge base search time under 3 minutes in maintenance decision-making. Another contribution of this work is the release of the first-ever AIL HM dataset of SPS. This dataset contains four sub-datasets, involving 4 types of AIL HM sub-tasks, 17 types of faults, and over 700,000 timestamps.
zh
[AI-112] MedConsultBench: A Full-Cycle Fine-Grained Process-Aware Benchmark for Medical Consultation Agents
【速读】:该论文旨在解决当前医疗咨询代理评估中过度关注结果导向任务、忽视端到端流程完整性与临床安全性的问题,尤其针对现有交互式基准在动态场景下仍存在碎片化和粗粒度缺陷,无法准确捕捉专业问诊所需的结构化推理逻辑与诊断严谨性。其解决方案的关键在于提出MedConsultBench框架,通过引入原子信息单元(Atomic Information Units, AIUs)实现对子对话轮次级别临床信息获取的精确追踪,并基于22个细粒度指标量化信息采集效率与诊疗合规性;同时强调药物处方合理性与后续随访问答中的约束尊重型计划修正能力,从而系统性地评估生成式AI在真实临床场景下的全流程表现,揭示了高诊断准确率下潜在的信息收集效率不足与用药安全风险,为医疗人工智能向临床实践靠拢提供了严谨的评测基础。
链接: https://arxiv.org/abs/2601.12661
作者: Chuhan Qiao,Jianghua Huang,Daxing Zhao,Ziding Liu,Yanjun Shen,Bing Cheng,Wei Lin,Kai Wu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Current evaluations of medical consultation agents often prioritize outcome-oriented tasks, frequently overlooking the end-to-end process integrity and clinical safety essential for real-world practice. While recent interactive benchmarks have introduced dynamic scenarios, they often remain fragmented and coarse-grained, failing to capture the structured inquiry logic and diagnostic rigor required in professional consultations. To bridge this gap, we propose MedConsultBench, a comprehensive framework designed to evaluate the complete online consultation cycle by covering the entire clinical workflow from history taking and diagnosis to treatment planning and follow-up Q\A. Our methodology introduces Atomic Information Units (AIUs) to track clinical information acquisition at a sub-turn level, enabling precise monitoring of how key facts are elicited through 22 fine-grained metrics. By addressing the underspecification and ambiguity inherent in online consultations, the benchmark evaluates uncertainty-aware yet concise inquiry while emphasizing medication regimen compatibility and the ability to handle realistic post-prescription follow-up Q\A via constraint-respecting plan revisions. Systematic evaluation of 19 large language models reveals that high diagnostic accuracy often masks significant deficiencies in information-gathering efficiency and medication safety. These results underscore a critical gap between theoretical medical knowledge and clinical practice ability, establishing MedConsultBench as a rigorous foundation for aligning medical AI with the nuanced requirements of real-world clinical care.
zh
[AI-113] Explanation Multiplicity in SHAP: Characterization and Assessment
【速读】:该论文旨在解决生成式 AI (Generative AI) 中特征归因解释(feature-attribution explanations)的“解释多重性”(explanation multiplicity)问题,即在输入、任务和模型固定的情况下,同一决策可能产生多个内部有效但实质不同的解释。其解决方案的关键在于提出了一种系统性方法来表征解释多重性,并区分由模型训练/选择引起的变异与解释管道中固有随机性所导致的变异;同时通过构建合理的零假设基准值,揭示了基于幅度的距离度量可能掩盖实际的特征重要性排序变化,从而强调需采用与解释用途相匹配的评估指标。
链接: https://arxiv.org/abs/2601.12654
作者: Hyunseung Hwang,Seungeun Lee,Lucas Rosenblatt,Julia Stoyanovich,Steven Euijong Whang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Post-hoc explanations are widely used to justify, contest, and audit automated decisions in high-stakes domains. SHAP, in particular, is often treated as a reliable account of which features drove an individual prediction. Yet SHAP explanations can vary substantially across repeated runs even when the input, task, and trained model are held fixed. We term this phenomenon explanation multiplicity: multiple internally valid but substantively different explanations for the same decision. We present a methodology to characterize multiplicity in feature-attribution explanations and to disentangle sources due to model training/selection from stochasticity intrinsic to the explanation pipeline. We further show that apparent stability depends on the metric: magnitude-based distances can remain near zero while rank-based measures reveal substantial churn in the identity and ordering of top features. To contextualize observed disagreement, we derive randomized baseline values under plausible null models. Across datasets, model classes, and confidence regimes, we find explanation multiplicity is pervasive and persists even for high-confidence predictions, highlighting the need for metrics and baselines that match the intended use of explanations.
zh
[AI-114] Unbounded Harms Bounded Law: Liability in the Age of Borderless AI
【速读】:该论文旨在解决当前人工智能(Artificial Intelligence, AI)治理中缺乏有效事后责任机制的问题,特别是针对跨境AI风险造成的损害,在现有以领土为基础的法律责任体系下难以有效追责与补偿的困境。其核心解决方案的关键在于借鉴疫苗伤害赔偿、系统性金融风险治理、核能商业责任及国际环境法等高风险跨国领域的法律框架,提炼出可迁移的法律设计原则,如严格责任(strict liability)、风险池化(risk pooling)、集体风险分担(collective risk-sharing)和责任传导(liability channelling),并在此基础上构建一个具有全球适用性的AI问责与补偿架构,以应对由AI全球供应链和跨域部署所加剧的结构性风险。
链接: https://arxiv.org/abs/2601.12646
作者: Ha-Chi Tran
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:The rapid proliferation of artificial intelligence (AI) has exposed significant deficiencies in risk governance. While ex-ante harm identification and prevention have advanced, Responsible AI scholarship remains underdeveloped in addressing ex-post liability. Core legal questions regarding liability allocation, responsibility attribution, and remedial effectiveness remain insufficiently theorized and institutionalized, particularly for transboundary harms and risks that transcend national jurisdictions. Drawing on contemporary AI risk analyses, we argue that such harms are structurally embedded in global AI supply chains and are likely to escalate in frequency and severity due to cross-border deployment, data infrastructures, and uneven national oversight capacities. Consequently, territorially bounded liability regimes are increasingly inadequate. Using a comparative and interdisciplinary approach, this paper examines compensation and liability frameworks from high-risk transnational domains - including vaccine injury schemes, systemic financial risk governance, commercial nuclear liability, and international environmental regimes - to distill transferable legal design principles such as strict liability, risk pooling, collective risk-sharing, and liability channelling, while highlighting potential structural constraints on their application to AI-related harms. Situated within an international order shaped more by AI arms race dynamics than cooperative governance, the paper outlines the contours of a global AI accountability and compensation architecture, emphasizing the tension between geopolitical rivalry and the collective action required to govern transboundary AI risks effectively.
zh
[AI-115] STEP-LLM : Generating CAD STEP Models from Natural Language with Large Language Models DATE
【速读】:该论文旨在解决非专家用户难以将直观的设计意图转化为可制造的CAD模型的问题,尤其针对现有基于大语言模型(Large Language Models, LLMs)的文本到CAD方法多依赖特定内核、缺乏制造兼容性的局限性。其关键解决方案在于:构建一个包含约4万对STEP-描述(STEP-caption)的数据集,并提出适用于STEP文件图结构特性的预处理方法,包括基于深度优先搜索(Depth-First Search, DFS)的重序列化以线性化跨引用关系并保持局部性,以及链式思维(Chain-of-Thought, CoT)风格的结构标注以增强全局一致性;同时引入检索增强生成(Retrieval-Augmented Generation, RAG)模块以监督微调阶段引入相关示例进行语义对齐,并通过基于Chamfer Distance的几何奖励函数进行强化学习(Reinforcement Learning, RL)优化生成质量。实验表明,该框架在几何保真度上显著优于Text2CAD基线,验证了LLM驱动STEP模型生成的可行性与潜力。
链接: https://arxiv.org/abs/2601.12641
作者: Xiangyu Shi,Junyang Ding,Xu Zhao,Sinong Zhan,Payal Mohapatra,Daniel Quispe,Kojo Welbeck,Jian Cao,Wei Chen,Ping Guo,Qi Zhu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted to the Design, Automation Test in Europe Conference (DATE) 2026
Abstract:Computer-aided design (CAD) is vital to modern manufacturing, yet model creation remains labor-intensive and expertise-heavy. To enable non-experts to translate intuitive design intent into manufacturable artifacts, recent large language models-based text-to-CAD efforts focus on command sequences or script-based formats like CadQuery. However, these formats are kernel-dependent and lack universality for manufacturing. In contrast, the Standard for the Exchange of Product Data (STEP, ISO 10303) file is a widely adopted, neutral boundary representation (B-rep) format directly compatible with manufacturing, but its graph-structured, cross-referenced nature poses unique challenges for auto-regressive LLMs. To address this, we curate a dataset of ~40K STEP-caption pairs and introduce novel preprocessing tailored for the graph-structured format of STEP, including a depth-first search-based reserialization that linearizes cross-references while preserving locality and chain-of-thought(CoT)-style structural annotations that guide global coherence. We integrate retrieval-augmented generation to ground predictions in relevant examples for supervised fine-tuning, and refine generation quality through reinforcement learning with a specific Chamfer Distance-based geometric reward. Experiments demonstrate consistent gains of our STEP-LLM in geometric fidelity over the Text2CAD baseline, with improvements arising from multiple stages of our framework: the RAG module substantially enhances completeness and renderability, the DFS-based reserialization strengthens overall accuracy, and the RL further reduces geometric discrepancy. Both metrics and visual comparisons confirm that STEP-LLM generates shapes with higher fidelity than Text2CAD. These results show the feasibility of LLM-driven STEP model generation from natural language, showing its potential to democratize CAD design for manufacturing.
zh
[AI-116] opology-Aware Multiscale Mixture of Experts for Efficient Molecular Property Prediction
【速读】:该论文旨在解决当前3D分子图神经网络在建模分子相互作用时存在的局限性问题,即依赖全局固定的邻域启发式策略(如距离截断和最大邻居数限制),导致无法灵活适应不同几何尺度下的非共价相互作用、立体化学效应及远距离力,从而造成交互建模僵化且数据无关。其解决方案的关键在于提出多尺度交互混合专家模型(Multiscale Interaction Mixture of Experts, MI-MoE),通过三个核心创新实现:(1) 引入距离截断专家集成机制,显式捕捉短程、中程与远程相互作用而无需固定单一截断阈值;(2) 设计基于拓扑过滤的门控编码器,利用持久同调特征(persistent homology features)作为路由信号,表征连通性随半径变化的演化过程;(3) 将MI-MoE设计为可插拔模块,在多种强健的3D分子骨干模型上均显著提升分子与聚合物性质预测性能,涵盖回归与分类任务,验证了拓扑感知的多尺度路由机制是3D分子图学习的有效原则。
链接: https://arxiv.org/abs/2601.12637
作者: Long D. Nguyen,Kelin Xia,Binh P. Nguyen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
备注:
Abstract:Many molecular properties depend on 3D geometry, where non-covalent interactions, stereochemical effects, and medium- to long-range forces are determined by spatial distances and angles that cannot be uniquely captured by a 2D bond graph. Yet most 3D molecular graph neural networks still rely on globally fixed neighborhood heuristics, typically defined by distance cutoffs and maximum neighbor limits, to define local message-passing neighborhoods, leading to rigid, data-agnostic interaction budgets. We propose Multiscale Interaction Mixture of Experts (MI-MoE) to adapt interaction modeling across geometric regimes. Our contributions are threefold: (1) we introduce a distance-cutoff expert ensemble that explicitly captures short-, mid-, and long-range interactions without committing to a single cutoff; (2) we design a topological gating encoder that routes inputs to experts using filtration-based descriptors, including persistent homology features, summarizing how connectivity evolves across radii; and (3) we show that MI-MoE is a plug-in module that consistently improves multiple strong 3D molecular backbones across diverse molecular and polymer property prediction benchmark datasets, covering both regression and classification tasks. These results highlight topology-aware multiscale routing as an effective principle for 3D molecular graph learning.
zh
[AI-117] Creating Disability Story Videos with Generative AI: Motivation Expression and Sharing
【速读】:该论文旨在解决生成式 AI (Generative AI) 在支持残疾人(People with Disabilities, PwDs)进行残疾叙事创作时所面临的双重挑战:一方面,GenAI 可降低媒体制作门槛并激发 PwDs 的创造力;另一方面,其潜在偏见与不完美性可能阻碍其在个人表达中的有效应用。解决方案的关键在于提出一个“重大描绘框架”(momentous depiction framework),该框架识别出 GenAI 支持残疾叙事的四个核心能力维度——不可捕捉的描绘(non-capturable depiction)、身份隐藏与表征(identity concealment and representation)、情境真实性和一致性(contextual realism and consistency)、情感表达(emotional articulation),并据此为 GenAI 在故事完成、媒介格式及纠错机制方面的设计提供具体指导,从而优化其对残疾群体叙事需求的支持能力。
链接: https://arxiv.org/abs/2601.12617
作者: Shuo Niu,Dylan Clements,Hyungsin Kim
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:Generative AI (GenAI) is both promising and challenging in supporting people with disabilities (PwDs) in creating stories about disability. GenAI can reduce barriers to media production and inspire the creativity of PwDs, but it may also introduce biases and imperfections that hinder its adoption for personal expression. In this research, we examine how nine PwD from a disability advocacy group used GenAI to create videos sharing their disability experiences. Grounded in digital storytelling theory, we explore the motivations, expression, and sharing of PwD-created GenAI story videos. We conclude with a framework of momentous depiction, which highlights four core affordances of GenAI that either facilitate or require improvements to better support disability storytelling: non-capturable depiction, identity concealment and representation, contextual realism and consistency, and emotional articulation. Based on this framework, we further discuss design implications for GenAI in relation to story completion, media formats, and corrective mechanisms.
zh
[AI-118] Do MLLM s See What We See? Analyzing Visualization Literacy Barriers in AI Systems
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在解读可视化图表时频繁失败的问题,特别是缺乏对失败原因的系统性理解。解决方案的关键在于提出首个以“障碍”为中心的分析框架,通过重构可视化素养评估测试(reVLAT)基准并使用合成数据,对四个先进模型的309个错误响应进行开放式编码,从而识别出MLLM在处理可视化时的核心障碍类型,包括两类机器特有障碍,这些障碍扩展了以往基于人类参与者的可视化素养理论框架。该方法为未来可靠AI驱动的可视化辅助工具的设计与评估提供了实证基础。
链接: https://arxiv.org/abs/2601.12585
作者: Mengli(Dawn)Duan,Yuhe(Sissi)Jiang,Matthew Varona,Carolina Nobre
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注:
Abstract:Multimodal Large Language Models (MLLMs) are increasingly used to interpret visualizations, yet little is known about why they fail. We present the first systematic analysis of barriers to visualization literacy in MLLMs. Using the regenerated Visualization Literacy Assessment Test (reVLAT) benchmark with synthetic data, we open-coded 309 erroneous responses from four state-of-the-art models with a barrier-centric strategy adapted from human visualization literacy research. Our analysis yields a taxonomy of MLLM failures, revealing two machine-specific barriers that extend prior human-participation frameworks. Results show that models perform well on simple charts but struggle with color-intensive, segment-based visualizations, often failing to form consistent comparative reasoning. Our findings inform future evaluation and design of reliable AI-driven visualization assistants.
zh
[AI-119] Agent ic Artificial Intelligence (AI): Architectures Taxonomies and Evaluation of Large Language Model Agents
【速读】:该论文旨在解决当前Agentic AI(代理型人工智能)系统架构设计多样化且缺乏统一分类标准的问题,使得研究者和开发者难以清晰理解与比较不同代理系统的组成与能力。其解决方案的关键在于提出一个统一的分类框架,将代理系统分解为六个核心模块:感知(Perception)、大脑(Brain)、规划(Planning)、行动(Action)、工具使用(Tool Use)和协作(Collaboration),并以此视角梳理从线性推理到推理时原生建模、从固定API调用到开放标准(如Model Context Protocol, MCP)的演进路径,同时对代理运行环境(如数字操作系统、具身机器人等)及评估方法进行系统归纳,从而为构建更鲁棒、可靠的自主智能体提供结构化指导。
链接: https://arxiv.org/abs/2601.12560
作者: Arunkumar V,Gangadharan G.R.,Rajkumar Buyya
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 28 pages, 4 figures, 5 tables
Abstract:Artificial Intelligence is moving from models that only generate text to Agentic AI, where systems behave as autonomous entities that can perceive, reason, plan, and act. Large Language Models (LLMs) are no longer used only as passive knowledge engines but as cognitive controllers that combine memory, tool use, and feedback from their environment to pursue extended goals. This shift already supports the automation of complex workflows in software engineering, scientific discovery, and web navigation, yet the variety of emerging designs, from simple single loop agents to hierarchical multi agent systems, makes the landscape hard to navigate. In this paper, we investigate architectures and propose a unified taxonomy that breaks agents into Perception, Brain, Planning, Action, Tool Use, and Collaboration. We use this lens to describe the move from linear reasoning procedures to native inference time reasoning models, and the transition from fixed API calls to open standards like the Model Context Protocol (MCP) and Native Computer Use. We also group the environments in which these agents operate, including digital operating systems, embodied robotics, and other specialized domains, and we review current evaluation practices. Finally, we highlight open challenges, such as hallucination in action, infinite loops, and prompt injection, and outline future research directions toward more robust and reliable autonomous systems.
zh
[AI-120] How Clinicians Think and What AI Can Learn From It
【速读】:该论文试图解决当前临床人工智能(AI)系统多作为预测引擎运行,仅输出标签或风险评分,而忽视了真实临床推理本质上是一个在不确定性下受时间约束的序贯控制问题这一核心矛盾。解决方案的关键在于:将临床决策建模为基于序次性、非补偿性的判断过程,而非依赖绝对效用最大化的优化方法;具体而言,应采用稳健的序次规则(如ε-占优、极大极小策略)来选择行动,同时利用复杂模型进行信念和轨迹推断,将启发式规则视为低维特例,并以“选择性复杂度”方式部署AI——即主要在决策脆弱且信息具有正期望影响时介入,用于破局而非主导决策。
链接: https://arxiv.org/abs/2601.12547
作者: Dipayan Sengupta,Saumya Panda
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 34 pages
Abstract:Most clinical AI systems operate as prediction engines – producing labels or risk scores – yet real clinical reasoning is a time-bounded, sequential control problem under uncertainty. Clinicians interleave information gathering with irreversible actions, guided by regret, constraints and patient values. We argue that the dominant computational substrate of clinician reasoning is not cardinal optimization but ordinal, non-compensatory decision-making: Clinicians frequently rely on fast-and-frugal, lexicographic heuristics (e.g., fast-and-frugal trees) that stop early after checking a small, fixed sequence of cues. We provide a normative rationale for why such algorithms are not merely bounded rationality shortcuts, but can be epistemically preferred in medicine. First, many clinical trade-offs are constructed through human judgment and are only weakly measurable on absolute scales; without strong measurement axioms, only orderings are invariant, motivating an ordinal-by-default stance. Second, preference and signal elicitation are structurally crude: The mapping from truth \to perception \to inference \to recorded variables introduces layered noise, leaving a persistent uncertainty floor. When this ‘crudeness’ overwhelms the decision margin, plug-in expected-utility optimization becomes brittle (high flip probability under small perturbations), whereas robust dominance/filtering rules ( \epsilon -dominance, maximin) stabilize this http URL, we outline a clinician-aligned AI blueprint: Use rich models for beliefs and trajectories, but choose actions through robust ordinal rules; treat heuristics as the low-dimensional special case; and deploy AI as ‘selective complexity’ – invoked mainly for tie-breaking when decisions are fragile and information has positive expected impact.
zh
[AI-121] Rethinking the AI Scientist: Interactive Multi-Agent Workflows for Scientific Discovery
【速读】:该论文旨在解决现有科学发现人工智能系统普遍存在 proprietary(专有)和批处理(batch-processing)模式导致研究周期长达数小时、无法支持实时研究人员干预的问题。其解决方案的关键在于提出 Deep Research,一个由多个专业化智能体组成的多代理系统,包括规划、数据分析、文献检索和新颖性检测等模块,并通过持久的世界状态(persistent world state)维持跨迭代研究周期的上下文连续性,从而实现以分钟级周转时间支持交互式科学研究,同时提供半自主与全自主两种运行模式以适应不同工作流程需求。
链接: https://arxiv.org/abs/2601.12542
作者: Lukas Weidener,Marko Brkić,Mihailo Jovanović,Ritvik Singh,Chiara Baccin,Emre Ulgac,Alex Dobrin,Aakaash Meduri
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Artificial intelligence systems for scientific discovery have demonstrated remarkable potential, yet existing approaches remain largely proprietary and operate in batch-processing modes requiring hours per research cycle, precluding real-time researcher guidance. This paper introduces Deep Research, a multi-agent system enabling interactive scientific investigation with turnaround times measured in minutes. The architecture comprises specialized agents for planning, data analysis, literature search, and novelty detection, unified through a persistent world state that maintains context across iterative research cycles. Two operational modes support different workflows: semi-autonomous mode with selective human checkpoints, and fully autonomous mode for extended investigations. Evaluation on the BixBench computational biology benchmark demonstrated state-of-the-art performance, achieving 48.8% accuracy on open response and 64.5% on multiple-choice evaluation, exceeding existing baselines by 14 to 26 percentage points. Analysis of architectural constraints, including open access literature limitations and challenges inherent to automated novelty assessment, informs practical deployment considerations for AI-assisted scientific workflows.
zh
[AI-122] Improved Bug Localization with AI Agents Leverag ing Hypothesis and Dynamic Cognition
【速读】:该论文旨在解决软件缺陷定位(bug localization)中传统方法因孤立分析代码组件而忽略其相互依赖关系,以及基于大语言模型(Large Language Models, LLMs)和代理式AI(agentic AI)的技术在代码探索过程中缺乏因果推理能力、难以有效管理上下文长度的问题。解决方案的关键在于提出一种新型代理式技术——CogniGent,其通过多个具备因果推理能力的AI代理,结合调用图(call-graph)驱动的根因分析与上下文工程(context engineering),模拟开发者动态认知调试实践,执行假设检验以支持更精准的缺陷定位。该方法显著提升了在文档级和方法级上的平均精度均值(MAP)和召回率均值(MRR),优于六种主流基线方法,验证了其在复杂代码依赖建模与高效上下文利用方面的优势。
链接: https://arxiv.org/abs/2601.12522
作者: Asif Mohammed Samir,Mohammad Masudur Rahman
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: 13 pages, 7 tables, 5 figures
Abstract:Software bugs cost technology providers (e.g., ATT) billions annually and cause developers to spend roughly 50% of their time on bug resolution. Traditional methods for bug localization often analyze the suspiciousness of code components (e.g., methods, documents) in isolation, overlooking their connections with other components in the codebase. Recent advances in Large Language Models (LLMs) and agentic AI techniques have shown strong potential for code understanding, but still lack causal reasoning during code exploration and struggle to manage growing context effectively, limiting their capability. In this paper, we present a novel agentic technique for bug localization – CogniGent – that overcomes the limitations above by leveraging multiple AI agents capable of causal reasoning, call-graph-based root cause analysis and context engineering. It emulates developers-inspired debugging practices (a.k.a., dynamic cognitive debugging) and conducts hypothesis testing to support bug localization. We evaluate CogniGent on a curated dataset of 591 bug reports using three widely adopted performance metrics and compare it against six established baselines from the literature. Experimental results show that our technique consistently outperformed existing traditional and LLM-based techniques, achieving MAP improvements of 23.33-38.57% at the document and method levels. Similar gains were observed in MRR, with increases of 25.14-53.74% at both granularity levels. Statistical significance tests also confirm the superiority of our technique. By addressing the reasoning, dependency, and context limitations, CogniGent advances the state of bug localization, bridging human-like cognition with agentic automation for improved performance.
zh
[AI-123] Cooperative Multi-agent RL with Communication Constraints
【速读】:该论文旨在解决去中心化多智能体强化学习(Decentralized Multi-Agent Reinforcement Learning, MARL)中因通信受限导致的梯度估计不稳定问题。在通信成本高的场景下,智能体只能依赖过时的信息进行策略更新,传统重要性采样方法因基线策略(base policy)与当前策略差距过大而迅速失效。解决方案的关键在于提出“基线策略预测”(base policy prediction)技术:利用历史梯度信息预测多个基线策略,并收集对应样本序列,从而显著缩小基线策略与当前策略之间的差异。这一机制使算法能在极少的通信轮次(仅需 O(ε−3/4))内收敛至 ε-纳什均衡,且样本复杂度不随联合动作空间大小呈指数增长,优于现有最优结果。
链接: https://arxiv.org/abs/2601.12518
作者: Nuoya Xiong,Aarti Singh
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 33 pages
Abstract:Cooperative MARL often assumes frequent access to global information in a data buffer, such as team rewards or other agents’ actions, which is typically unrealistic in decentralized MARL systems due to high communication costs. When communication is limited, agents must rely on outdated information to estimate gradients and update their policies. A common approach to handle missing data is called importance sampling, in which we reweigh old data from a base policy to estimate gradients for the current policy. However, it quickly becomes unstable when the communication is limited (i.e. missing data probability is high), so that the base policy in importance sampling is outdated. To address this issue, we propose a technique called base policy prediction, which utilizes old gradients to predict the policy update and collect samples for a sequence of base policies, which reduces the gap between the base policy and the current policy. This approach enables effective learning with significantly fewer communication rounds, since the samples of predicted base policies could be collected within one communication round. Theoretically, we show that our algorithm converges to an \varepsilon -Nash equilibrium in potential games with only O(\varepsilon^-3/4) communication rounds and O(poly(\max_i |A_i|)\varepsilon^-11/4) samples, improving existing state-of-the-art results in communication cost, as well as sample complexity without the exponential dependence on the joint action space size. We also extend these results to general Markov Cooperative Games to find an agent-wise local maximum. Empirically, we test the base policy prediction algorithm in both simulated games and MAPPO for complex environments.
zh
[AI-124] Failure Modes in Multi-Hop QA: The Weakest Link Law and the Recognition Bottleneck
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在多跳推理(multi-hop reasoning)任务中因固有位置偏差(position bias)而导致的性能瓶颈问题,尤其是难以有效定位和整合分布在长上下文中的多个证据片段。其解决方案的关键在于提出一种名为“多焦点注意力指令”(Multi-Focus Attention Instruction, MFAI)的语义探针机制,通过显式引导注意力机制聚焦于特定位置,从而区分并诊断是识别失败(recognition failure)还是融合失败(synthesis failure)主导了推理错误。实验表明,多跳推理性能受最不可见证据的影响最大(即“最弱环节定律”),且这种影响由绝对位置决定而非相对距离;同时发现注意力引导存在双重效应:匹配的MFAI可显著提升低可见性位置的信息识别准确率(最高达11.5%),而误导性MFAI虽在真实任务中引发混淆,但在合成任务中可被有效过滤,最终证明采用系统2型思考(System-2 reasoning)的模型能有效定位并整合信息,在噪声和长上下文中表现接近仅使用黄金标签的基线水平。
链接: https://arxiv.org/abs/2601.12499
作者: Meiru Zhang,Zaiqiao Meng,Nigel Collier
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: preprint
Abstract:Despite scaling to massive context windows, Large Language Models (LLMs) struggle with multi-hop reasoning due to inherent position bias, which causes them to overlook information at certain positions. Whether these failures stem from an inability to locate evidence (recognition failure) or integrate it (synthesis failure) is unclear. We introduce Multi-Focus Attention Instruction (MFAI), a semantic probe to disentangle these mechanisms by explicitly steering attention towards selected positions. Across 5 LLMs on two multi-hop QA tasks (MuSiQue and NeoQA), we establish the “Weakest Link Law”: multi-hop reasoning performance collapses to the performance level of the least visible evidence. Crucially, this failure is governed by absolute position rather than the linear distance between facts (performance variance 3% ). We further identify a duality in attention steering: while matched MFAI resolves recognition bottlenecks, improving accuracy by up to 11.5% in low-visibility positions, misleading MFAI triggers confusion in real-world tasks but is successfully filtered in synthetic tasks. Finally, we demonstrate that “thinking” models that utilize System-2 reasoning, effectively locate and integrate the required information, matching gold-only baselines even in noisy, long-context settings.
zh
[AI-125] Patch-Level Tokenization with CNN Encoders and Attention for Improved Transformer Time-Series Forecasting
【速读】:该论文旨在解决基于Transformer的时间序列预测模型在处理多变量时间序列数据时,其性能高度依赖于输入表示的质量和结构这一关键问题。现有方法往往将局部时序特征提取与全局依赖建模混合在同一架构中,导致模型难以有效捕捉短程动态与长程依赖之间的协同关系。解决方案的关键在于提出一个两阶段框架:第一阶段利用卷积神经网络(CNN)对固定长度的时间片段(temporal patches)进行局部特征提取,生成紧凑的片段级标记嵌入(patch-level token embeddings),并借助标记级自注意力机制增强跨片段的交互;第二阶段则使用Transformer编码器建模片段间的全局时序依赖关系,从而实现局部时序表示学习与全局注意力建模的解耦。该设计显著提升了模型的稳定性和预测准确性。
链接: https://arxiv.org/abs/2601.12467
作者: Saurish Nagrath
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 6 pages, 2 figures, 3 tables
Abstract:Transformer-based models have shown strong performance in time-series forecasting by leveraging self-attention to model long-range temporal dependencies. However, their effectiveness depends critically on the quality and structure of input representations derived from raw multivariate time-series data. This work proposes a two-stage forecasting framework that explicitly separates local temporal representation learning from global dependency modelling. In the first stage, a convolutional neural network (CNN) operates on fixed-length temporal patches to extract short-range temporal dynamics and non-linear feature interactions, producing compact patch-level token embeddings. Token-level self-attention is subsequently applied during representation learning to refine these embeddings by enabling interactions across temporal patches. In the second stage, a Transformer encoder processes the resulting token sequence to model inter-patch temporal dependencies and generate per-patch forecasts. Experiments conducted on synthetic multivariate time-series data with controlled static and dynamic factors demonstrate that the proposed patch-based tokenization strategy achieves competitive forecasting performance compared to convolutional and patch-based Transformer baselines. The results highlight the importance of structured temporal representations and show that decoupling local temporal encoding from global attention-based modelling yields more effective and stable time-series forecasting.
zh
[AI-126] AgenT RIM: Tool Risk Mitigation for Agent ic AI
【速读】:该论文旨在解决基于大语言模型(Large Language Model, LLM)的智能体(AI agent)在调用外部工具时因权限配置不当所引发的安全风险问题,即“工具驱动的代理失衡”(unbalanced tool-driven agency)——表现为代理持有过多权限(过度代理)或未能调用必要工具(代理不足),从而扩大攻击面并降低任务性能。解决方案的关键在于提出AgenTRIM框架,其通过离线与在线两个阶段协同实现:离线阶段重构并验证代理的工具接口;在线阶段则在每一步执行中实施最小权限访问控制,结合自适应过滤和状态感知的工具调用验证机制,确保安全性和任务性能的平衡。
链接: https://arxiv.org/abs/2601.12449
作者: Roy Betser,Shamik Bose,Amit Giloni,Chiara Picardi,Sindhu Padakandla,Roman Vainshtein
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Under review
Abstract:AI agents are autonomous systems that combine LLMs with external tools to solve complex tasks. While such tools extend capability, improper tool permissions introduce security risks such as indirect prompt injection and tool misuse. We characterize these failures as unbalanced tool-driven agency. Agents may retain unnecessary permissions (excessive agency) or fail to invoke required tools (insufficient agency), amplifying the attack surface and reducing performance. We introduce AgenTRIM, a framework for detecting and mitigating tool-driven agency risks without altering an agent’s internal reasoning. AgenTRIM addresses these risks through complementary offline and online phases. Offline, AgenTRIM reconstructs and verifies the agent’s tool interface from code and execution traces. At runtime, it enforces per-step least-privilege tool access through adaptive filtering and status-aware validation of tool calls. Evaluating on the AgentDojo benchmark, AgenTRIM substantially reduces attack success while maintaining high task performance. Additional experiments show robustness to description-based attacks and effective enforcement of explicit safety policies. Together, these results demonstrate that AgenTRIM provides a practical, capability-preserving approach to safer tool use in LLM-based agents.
zh
[AI-127] Large Language Model for OWL Proofs
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在生成形式化推理证明方面能力不足的问题,特别是如何生成忠实于逻辑、可读性强的解释以说明结论为何成立。其解决方案的关键在于构建一个自动化的数据集生成与评估框架,用于系统性地评估LLMs在OWL本体(OWL ontologies)背景下完成完整推理链的能力,涵盖提取(Extraction)、简化(Simplification)和解释(Explanation)三个连续任务,以及前提逻辑完备性(Logic Completeness)的额外评估。实验表明,逻辑复杂度是影响LLM性能的主要因素,而输入数据中的噪声和不完整性会显著削弱模型表现,从而揭示了当前LLMs在严谨逻辑推理中既具潜力又存在局限。
链接: https://arxiv.org/abs/2601.12444
作者: Hui Yang,Jiaoyan Chen,Uli Sattler
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注:
Abstract:The ability of Large Language Models (LLMs) to perform reasoning tasks such as deduction has been widely investigated in recent years. Yet, their capacity to generate proofs-faithful, human-readable explanations of why conclusions follow-remains largely under explored. In this work, we study proof generation in the context of OWL ontologies, which are widely adopted for representing and reasoning over complex knowledge, by developing an automated dataset construction and evaluation framework. Our evaluation encompassing three sequential tasks for complete proving: Extraction, Simplification, and Explanation, as well as an additional task of assessing Logic Completeness of the premise. Through extensive experiments on widely used reasoning LLMs, we achieve important findings including: (1) Some models achieve overall strong results but remain limited on complex cases; (2) Logical complexity, rather than representation format (formal logic language versus natural language), is the dominant factor shaping LLM performance; and (3) Noise and incompleteness in input data substantially diminish LLMs’ performance. Together, these results underscore both the promise of LLMs for explanation with rigorous logics and the gap of supporting resilient reasoning under complex or imperfect conditions. Code and data are available at this https URL.
zh
[AI-128] Orthogonalized Policy Optimization:Decoupling Sampling Geometry from Optimization Geometry in RLHF
【速读】:该论文试图解决当前大语言模型对齐方法(如PPO、DPO和IPO)中存在的数值不稳定性和梯度消失问题,这些问题源于现有方法在采样几何(sampling geometry)与优化几何(optimization geometry)之间隐式耦合,尤其是基于KL散度的惩罚机制会对无界价值信号施加指数级惩罚,导致高置信度场景下梯度饱和。解决方案的关键在于提出正交化策略优化(Orthogonalized Policy Optimization, OPO),其核心是显式解耦采样几何与优化几何:通过alpha加权重要性采样控制梯度主导样本的选择,同时在比值坐标空间中引入卡方散度诱导的二次正则化项,从而获得线性梯度动力学和稳定优化过程,既保持峰值搜索能力,又避免高置信度下的梯度饱和现象。
链接: https://arxiv.org/abs/2601.12415
作者: Wang Zixian
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent alignment methods for large language models, including PPO, DPO, and IPO, are often presented as distinct algorithms. In this work, we show that many of these approaches implicitly conflate two fundamental and independent design choices: (i) the sampling geometry, which determines which samples dominate the gradient signal, and (ii) the optimization geometry, which determines how deviations in value are penalized. We formalize this observation by expressing alignment as the minimization of a generalized distance between policy energy and target energy, parameterized by an alpha-divergence-based sampling weight and a Bregman-divergence-based value metric. We demonstrate that the commonly used KL divergence induces an exponential penalty on unbounded value signals, leading to numerical instability and vanishing gradients in high-confidence regimes. To address this issue, we propose Orthogonalized Policy Optimization (OPO), a framework that explicitly decouples sampling geometry from optimization geometry. By combining alpha-weighted importance sampling with a chi-square-induced quadratic regularization in ratio coordinates, OPO yields a simple and well-conditioned objective with linear gradient dynamics. This formulation maintains stable optimization while preserving peak-seeking behavior and avoids gradient saturation even when model confidence is high. Our analysis positions OPO as a unifying perspective on existing alignment methods and provides a principled foundation for robust reasoning-oriented training.
zh
[AI-129] Are LLM s Smarter Than Chimpanzees? An Evaluation on Perspective Taking and Knowledge State Estimation
【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)在知识状态追踪与意图理解能力上的不足问题,具体聚焦于评估LLM是否具备类似人类的认知能力——即推断他人知识状态并理解其意图。解决方案的关键在于设计两个核心任务:一是检测故事角色通过行为表现出其不应拥有的知识;二是预测角色基于自身知识而非客观事实的下一步行动。实验结果表明,当前主流LLM在这两项任务上表现接近随机水平,显著低于人类表现,凸显出LLM在认知推理层面的局限性,从而呼吁未来研究应更加重视知识估计与意图理解能力的提升。
链接: https://arxiv.org/abs/2601.12410
作者: Dingyi Yang,Junqi Zhao,Xue Li,Ce Li,Boyang Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 23 pages, 11 figures
Abstract:Cognitive anthropology suggests that the distinction of human intelligence lies in the ability to infer other individuals’ knowledge states and understand their intentions. In comparison, our closest animal relative, chimpanzees, lack the capacity to do so. With this paper, we aim to evaluate LLM performance in the area of knowledge state tracking and estimation. We design two tasks to test (1) if LLMs can detect when story characters, through their actions, demonstrate knowledge they should not possess, and (2) if LLMs can predict story characters’ next actions based on their own knowledge vs. objective truths they do not know. Results reveal that most current state-of-the-art LLMs achieve near-random performance on both tasks, and are substantially inferior to humans. We argue future LLM research should place more weight on the abilities of knowledge estimation and intention understanding.
zh
[AI-130] Explainable Machine Learning for Pediatric Dental Risk Stratification Using Socio-Demographic Determinants
【速读】:该论文旨在解决儿科口腔疾病风险评估中缺乏透明性与公平性的问题,尤其是在当前多数人工智能(AI)应用依赖图像诊断和黑箱预测模型的情况下,难以在儿童群体中实现伦理可接受的部署。其解决方案的关键在于构建一个可解释的机器学习框架,通过整合年龄、收入贫困比、种族/民族、性别及医疗史等社会人口学决定因素,利用SHapley Additive exPlanations (SHAP) 方法实现全局与个体层面的预测解释,从而提升模型的可理解性、校准性和公平性,支持以预防为导向的群体筛查和资源公平分配,而非直接用于临床诊断决策。
链接: https://arxiv.org/abs/2601.12405
作者: Manasi Kanade,Abhi Thakkar,Gabriela Fernandes
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Background: Pediatric dental disease remains one of the most prevalent and inequitable chronic health conditions worldwide. Although strong epidemiological evidence links oral health outcomes to socio-economic and demographic determinants, most artificial intelligence (AI) applications in dentistry rely on image-based diagnosis and black-box prediction models, limiting transparency and ethical applicability in pediatric populations. Objective: This study aimed to develop and evaluate an explainable machine learning framework for pediatric dental risk stratification that prioritizes interpretability, calibration, and ethical deployment over maximal predictive accuracy. Methods: A supervised machine learning model was trained using population-level pediatric data including age, income-to-poverty ratio, race/ethnicity, gender, and medical history. Model performance was assessed using receiver operating characteristic (ROC) analysis and calibration curves. Explainability was achieved using SHapley Additive exPlanations (SHAP) to provide global and individual-level interpretation of predictions. Results: The model achieved modest discrimination (AUC = 0.61) with conservative calibration, underestimating risk at higher probability levels. SHAP analysis identified age and income-to-poverty ratio as the strongest contributors to predicted risk, followed by race/ethnicity and gender. Conclusion: Explainable machine learning enables transparent, prevention-oriented pediatric dental risk stratification and supports population screening and equitable resource allocation rather than diagnostic decision-making. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2601.12405 [cs.LG] (or arXiv:2601.12405v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2601.12405 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Gabriela Fernandes [view email] [v1] Sun, 18 Jan 2026 13:40:41 UTC (478 KB) Full-text links: Access Paper: View a PDF of the paper titled Explainable Machine Learning for Pediatric Dental Risk Stratification Using Socio-Demographic Determinants, by Manasi Kanade and 2 other authorsView PDF view license Current browse context: cs.LG prev | next new | recent | 2026-01 Change to browse by: cs cs.AI References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
zh
[AI-131] Beyond the Dirac Delta: Mitigating Diversity Collapse in Reinforcement Fine-Tuning for Versatile Image Generation
【速读】:该论文旨在解决生成式模型在强化学习(Reinforcement Learning, RL)微调过程中面临的“多样性坍缩”(curse of diversity collapse)问题,即策略优化过程倾向于收敛到狄拉克δ分布,导致生成结果缺乏多样性,难以满足需要多样化候选样本的应用需求。解决方案的关键在于提出DRIFT框架,通过三个核心机制系统性地激励输出多样性:一是采样奖励集中子集以过滤异常奖励值,防止过早坍缩;二是引入随机扰动提示(stochastic prompting)扩展条件空间;三是采用基于势能的奖励塑形机制优化组内多样性。该方法实现了任务对齐与生成多样性的帕累托最优平衡,在同等对齐水平下多样性提升9.08%–43.46%,或在同等多样性水平下对齐度提升59.65%–65.86%。
链接: https://arxiv.org/abs/2601.12401
作者: Jinmei Liu,Haoru Li,Zhenhong Sun,Chaofeng Chen,Yatao Bian,Bo Wang,Daoyi Dong,Chunlin Chen,Zhi Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Reinforcement learning (RL) has emerged as a powerful paradigm for fine-tuning large-scale generative models, such as diffusion and flow models, to align with complex human preferences and user-specified tasks. A fundamental limitation remains \textitthe curse of diversity collapse, where the objective formulation and optimization landscape inherently collapse the policy to a Dirac delta distribution. To address this challenge, we propose \textbfDRIFT (\textbfDive\textbfRsity-\textbfIncentivized Reinforcement \textbfFine-\textbfTuning for Versatile Image Generation), an innovative framework that systematically incentivizes output diversity throughout the on-policy fine-tuning process, reconciling strong task alignment with high generation diversity to enhance versatility essential for applications that demand diverse candidate generations. We approach the problem across three representative perspectives: i) \textbfsampling a reward-concentrated subset that filters out reward outliers to prevent premature collapse; ii) \textbfprompting with stochastic variations to expand the conditioning space, and iii) \textbfoptimization of the intra-group diversity with a potential-based reward shaping mechanism. Experimental results show that DRIFT achieves superior Pareto dominance regarding task alignment and generation diversity, yielding a 9.08%!\sim! 43.46% increase in diversity at equivalent alignment levels and a 59.65% !\sim! 65.86% increase in alignment at equivalent levels of diversity.
zh
[AI-132] PsychēChat: An Empathic Framework Focused on Emotion Shift Tracking and Safety Risk Analysis in Psychological Counseling
【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)在心理辅导场景中未能显式建模求助者情绪变化(emotion shifts)以及缺乏对安全风险主动防控的问题。现有方法往往忽视经典心理学流派中情绪演变的核心作用,且在生成回应时难以兼顾情感洞察与风险控制。解决方案的关键在于提出PsychēChat框架,其核心创新为:一是引入情绪管理模块(Emotion Management Module),实时捕捉求助者的当前情绪及其动态变化;二是设计风险控制模块(Risk Control Module),预测后续反应并识别潜在安全风险;同时提供两种建模范式——代理模式(Agent Mode)通过多智能体协作实现模块化处理,链式思维模式(LLM Mode)则将三阶段流程整合为端到端推理,从而在保持高效性的同时显著提升情绪理解深度与安全性。
链接: https://arxiv.org/abs/2601.12392
作者: Zhentao Xia,Yongqi Fan,Yuxiang Chu,Yichao Yin,Liangliang Chen,Tong Ruan,Weiyan Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) have demonstrated notable advancements in psychological counseling. However, existing models generally do not explicitly model seekers’ emotion shifts across counseling sessions, a core focus in classical psychological schools. Moreover, how to align counselor models’ responses with these emotion shifts while proactively mitigating safety risks remains underexplored. To bridge these gaps, we propose PsychēChat, which explicitly integrates emotion shift tracking and safety risk analysis for psychological counseling. Specifically, we employ interactive role-playing to synthesize counselor–seeker dialogues, incorporating two modules: Emotion Management Module, to capture seekers’ current emotions and emotion shifts; and Risk Control Module, to anticipate seekers’ subsequent reactions and identify potential risks. Furthermore, we introduce two modeling paradigms. The Agent Mode structures emotion management, risk control, and counselor responses into a collaborative multi-agent pipeline. The LLM Mode integrates these stages into a unified chain-of-thought for end-to-end inference, balancing efficiency and performance. Extensive experiments, including interactive scoring, dialogue-level evaluation, and human assessment, demonstrate that PsychēChat outperforms existing methods for emotional insight and safety control.
zh
[AI-133] Zero-Permission Manipulation: Can We Trust Large Multimodal Model Powered GUI Agents ?
【速读】:该论文旨在解决当前基于大模型的图形用户界面(GUI)代理在Android平台上的安全漏洞问题,其核心在于揭示并利用“视觉原子性”(Visual Atomicity)假设的不成立——即代理在观察屏幕状态与执行动作之间存在时间差,导致UI状态可能已被其他应用修改。解决方案的关键在于提出动作重绑定攻击(Action Rebinding),通过操纵前台进程切换和Android的UI状态保留机制,使一个无权限的恶意应用能够劫持代理的预期操作目标;同时引入意图对齐策略(Intent Alignment Strategy, IAS),引导代理合理化被篡改的UI状态,从而绕过确认对话框等验证机制。该方法无需危险权限且可规避主流恶意软件扫描工具检测,实验证明其在多步攻击链中具有100%成功率。
链接: https://arxiv.org/abs/2601.12349
作者: Yi Qian,Kunwei Qian,Xingbang He,Ligeng Chen,Jikang Zhang,Tiantai Zhang,Haiyang Wei,Linzhang Wang,Hao Wu,Bing Mao
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:
Abstract:Large multimodal model powered GUI agents are emerging as high-privilege operators on mobile platforms, entrusted with perceiving screen content and injecting inputs. However, their design operates under the implicit assumption of Visual Atomicity: that the UI state remains invariant between observation and action. We demonstrate that this assumption is fundamentally invalid in Android, creating a critical attack surface. We present Action Rebinding, a novel attack that allows a seemingly-benign app with zero dangerous permissions to rebind an agent’s execution. By exploiting the inevitable observation-to-action gap inherent in the agent’s reasoning pipeline, the attacker triggers foreground transitions to rebind the agent’s planned action toward the target app. We weaponize the agent’s task-recovery logic and Android’s UI state preservation to orchestrate programmable, multi-step attack chains. Furthermore, we introduce an Intent Alignment Strategy (IAS) that manipulates the agent’s reasoning process to rationalize UI states, enabling it to bypass verification gates (e.g., confirmation dialogs) that would otherwise be rejected. We evaluate Action Rebinding Attacks on six widely-used Android GUI agents across 15 tasks. Our results demonstrate a 100% success rate for atomic action rebinding and the ability to reliably orchestrate multi-step attack chains. With IAS, the success rate in bypassing verification gates increases (from 0% to up to 100%). Notably, the attacker application requires no sensitive permissions and contains no privileged API calls, achieving a 0% detection rate across malware scanners (e.g., VirusTotal). Our findings reveal a fundamental architectural flaw in current agent-OS integration and provide critical insights for the secure design of future agent systems. To access experimental logs and demonstration videos, please contact yi_qian@smail.this http URL. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE) Cite as: arXiv:2601.12349 [cs.CR] (or arXiv:2601.12349v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2601.12349 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-134] me-Continuous Modeling for Temporal Affective Pattern Recognition in LLM s
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在对话建模中缺乏对真实世界情感动态的时序捕捉能力及可解释性的问题。其解决方案的关键在于构建一个数据集和概念框架,利用物理信息神经网络(Physics-Informed Neural Networks, PINNs)实现LLMs在时间维度上的情境学习(in-context learning),从而模拟人类情感随时间演变的规律,提升对话系统的可解释性和情感响应的真实性。
链接: https://arxiv.org/abs/2601.12341
作者: Rezky Kam,Coddy N. Siswanto
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC); Systems and Control (eess.SY)
备注:
Abstract:This paper introduces a dataset and conceptual framework for LLMs to mimic real world emotional dynamics through time and in-context learning leveraging physics-informed neural network, opening a possibility for interpretable dialogue modeling.
zh
[AI-135] Actionable Advice from Reviews via Mixture of LoRA Experts: A Two-LLM Pipeline for Issue Extraction and Business Recommendations
【速读】:该论文旨在解决如何将非结构化的客户评论转化为可执行的业务建议的问题,即“评论到行动生成”(review-to-action generation)。其核心挑战在于从海量、复杂的用户反馈中提取关键问题并生成针对性强、可落地的操作性建议。解决方案的关键在于提出一个模块化的双大语言模型(two-LLM)框架:首先由“问题模型”(Issue model)识别评论中的显著问题并分配粗粒度主题,随后“建议模型”(Advice model)基于提取的问题表示生成具体可行的改进措施。为实现高效且专业化的能力适配,作者采用LoRA专家混合策略(mixture of LoRA experts),通过训练多个低秩适配器并在推理阶段使用轻量级门控机制进行token级专家融合,从而在不进行昂贵全参数微调的前提下实现跨问题类型的互补知识整合,显著提升了建议的可操作性与特异性。
链接: https://arxiv.org/abs/2601.12338
作者: Kartikey Singh Bhandari,Manav Ganesh,Yashwant Viswanathan,Archit Agrawal,Dhruv Kumar,Pratik Narang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Customer reviews contain detailed, domain specific signals about service failures and user expectations, but converting this unstructured feedback into actionable business decisions remains difficult. We study review-to-action generation: producing concrete, implementable recommendations grounded in review text. We propose a modular two-LLM framework in which an Issue model extracts salient issues and assigns coarse themes, and an Advice model generates targeted operational fixes conditioned on the extracted issue representation. To enable specialization without expensive full fine-tuning, we adapt the Advice model using a mixture of LoRA experts strategy: multiple low-rank adapters are trained and a lightweight gating mechanism performs token-level expert mixing at inference, combining complementary expertise across issue types. We construct synthetic review-issue-advice triples from Yelp reviews (airlines and restaurants) to supervise training, and evaluate recommendations using an eight dimension operational rubric spanning actionability, specificity, feasibility, expected impact, novelty, non-redundancy, bias, and clarity. Across both domains, our approach consistently outperforms prompting-only and single-adapter baselines, yielding higher actionability and specificity while retaining favorable efficiency-quality trade-offs.
zh
[AI-136] Efficient Privacy-Preserving Retrieval Augmented Generation with Distance-Preserving Encryption
【速读】:该论文旨在解决在非可信云环境中部署检索增强生成(Retrieval-Augmented Generation, RAG)系统时面临的隐私泄露问题,尤其是嵌入向量(embedding)可能遭受的向量到文本重建攻击、结构信息泄露以及查询分析风险。现有方法多依赖部分同态加密(partially homomorphic encryption),导致计算开销过高。论文提出了一种高效的隐私保护RAG框架ppRAG,其核心创新在于设计了一种条件近似距离比较保持对称加密机制(Conditional Approximate Distance-Comparison-Preserving Symmetric Encryption, CAPRISE),该机制在加密嵌入向量的同时,仍允许云端执行查询与数据库嵌入之间的相似度计算,仅保留查询与各数据库嵌入间的相对距离排序,不暴露数据库内部嵌入间的实际距离关系,从而兼顾安全性与效率;此外,通过在加密前对查询嵌入添加差分隐私(Differential Privacy, DP)噪声,有效防止云端从查询模式中推断敏感信息,显著提升了整体隐私保障能力。
链接: https://arxiv.org/abs/2601.12331
作者: Huanyi Ye,Jiale Guo,Ziyao Liu,Kwok-Yan Lam
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:RAG has emerged as a key technique for enhancing response quality of LLMs without high computational cost. In traditional architectures, RAG services are provided by a single entity that hosts the dataset within a trusted local environment. However, individuals or small organizations often lack the resources to maintain data storage servers, leading them to rely on outsourced cloud storage. This dependence on untrusted third-party services introduces privacy risks. Embedding-based retrieval mechanisms, commonly used in RAG systems, are vulnerable to privacy leakage such as vector-to-text reconstruction attacks and structural leakage via vector analysis. Several privacy-preserving RAG techniques have been proposed but most existing approaches rely on partially homomorphic encryption, which incurs substantial computational overhead. To address these challenges, we propose an efficient privacy-preserving RAG framework (ppRAG) tailored for untrusted cloud environments that defends against vector-to-text attack, vector analysis, and query analysis. We propose Conditional Approximate Distance-Comparison-Preserving Symmetric Encryption (CAPRISE) that encrypts embeddings while still allowing the cloud to compute similarity between an encrypted query and the encrypted database embeddings. CAPRISE preserves only the relative distance ordering between the encrypted query and each encrypted database embedding, without exposing inter-database distances, thereby enhancing both privacy and efficiency. To mitigate query analysis, we introduce DP by perturbing the query embedding prior to encryption, preventing the cloud from inferring sensitive patterns. Experimental results show that ppRAG achieves efficient processing throughput, high retrieval accuracy, strong privacy guarantees, making it a practical solution for resource-constrained users seeking secure cloud-augmented LLMs.
zh
[AI-137] IceWatch: Forecasting Glacial Lake Outburst Floods (GLOFs) using Multimodal Deep Learning
【速读】:该论文旨在解决高山区冰湖溃决洪水(Glacial Lake Outburst Floods, GLOFs)监测与预测中存在的效率低、依赖人工、易受云层干扰及缺乏现场数据等问题。其核心解决方案是提出IceWatch框架,该框架融合空间与时间维度的多模态深度学习方法:视觉模块RiskFlow利用Sentinel-2多光谱遥感影像通过卷积神经网络(CNN)识别雪、冰和融水的空间分布模式以预测GLOF事件;表格式模块TerraFlow和TempFlow分别基于NASA ITS_LIVE和MODIS地表温度(LST)数据集训练得到的时序模型,模拟冰川运动速度和近地表温度变化,从而从物理机制上验证预测结果。二者通过统一预处理与同步策略实现协同交叉验证,显著提升预测可靠性、可解释性,并具备对噪声和缺失数据的鲁棒性,为构建自动化、可扩展的GLOF预警系统提供新范式。
链接: https://arxiv.org/abs/2601.12330
作者: Zuha Fatima,Muhammad Anser Sohaib,Muhammad Talha,Ayesha Kanwal,Sidra Sultana,Nazia Perwaiz
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Glacial Lake Outburst Floods (GLOFs) pose a serious threat in high mountain regions. They are hazardous to communities, infrastructure, and ecosystems further downstream. The classical methods of GLOF detection and prediction have so far mainly relied on hydrological modeling, threshold-based lake monitoring, and manual satellite image analysis. These approaches suffer from several drawbacks: slow updates, reliance on manual labor, and losses in accuracy when clouds interfere and/or lack on-site data. To tackle these challenges, we present IceWatch: a novel deep learning framework for GLOF prediction that incorporates both spatial and temporal perspectives. The vision component, RiskFlow, of IceWatch deals with Sentinel-2 multispectral satellite imagery using a CNN-based classifier and predicts GLOF events based on the spatial patterns of snow, ice, and meltwater. Its tabular counterpart confirms this prediction by considering physical dynamics. TerraFlow models glacier velocity from NASA ITS_LIVE time series while TempFlow forecasts near-surface temperature from MODIS LST records; both are trained on long-term observational archives and integrated via harmonized preprocessing and synchronization to enable multimodal, physics-informed GLOF prediction. Both together provide cross-validation, which will improve the reliability and interpretability of GLOF detection. This system ensures strong predictive performance, rapid data processing for real-time use, and robustness to noise and missing information. IceWatch paves the way for automatic, scalable GLOF warning systems. It also holds potential for integration with diverse sensor inputs and global glacier monitoring activities.
zh
[AI-138] he Expert Validation Framework (EVF): Enabling Domain Expert Control in AI Engineering
【速读】:该论文旨在解决生成式 AI (Generative AI) 在企业环境中部署时缺乏系统性质量保障机制的问题,从而阻碍其在知识工作中的广泛应用。解决方案的关键在于提出一个以领域专家为核心的“专家验证框架”(Expert Validation Framework),通过结构化的规范、测试、验证和持续监控流程,使专家能够对包含 GenAI 组件的软件系统行为保持权威控制,从而弥合 AI 能力与组织信任之间的关键鸿沟。
链接: https://arxiv.org/abs/2601.12327
作者: Lucas Gren,Felix Dobslaw
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Generative AI (GenAI) systems promise to transform knowledge work by automating a range of tasks, yet their deployment in enterprise settings remains hindered by the lack of systematic quality assurance mechanisms. We present an Expert Validation Framework that places domain experts at the center of building software with GenAI components, enabling them to maintain authoritative control over system behavior through structured specification, testing, validation, and continuous monitoring processes. Our framework addresses the critical gap between AI capabilities and organizational trust by establishing a rigorous, expert-driven methodology for ensuring quality across diverse GenAI applications. Through a four-stage implementation process encompassing specification, system creation, validation, and production monitoring, the framework enables organizations to leverage GenAI capabilities while maintaining expert oversight and quality standards.
zh
[AI-139] MARO: Learning Stronger Reasoning from Social Interaction
【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)在训练过程中缺乏真实社会交互经验的问题,即现有方法主要依赖文本数据或固定任务,无法有效学习涉及互动、协商与竞争等复杂社会情境下的推理能力。其解决方案的核心是提出多智能体奖励优化(Multi-Agent Reward Optimization, MARO),该方法通过三个关键机制实现:首先,将最终的成功或失败结果分解为交互过程中每个具体行为的稀疏奖励信号,以增强学习信号;其次,通过平衡不同角色的训练样本权重来缓解角色分布不均问题;最后,直接评估每个行为的效用以应对环境不稳定性的挑战。实验表明,MARO显著提升了模型的社会推理能力,并且所学能力可迁移至数学推理和指令遵循等其他任务,验证了多智能体社会学习在增强LLMs通用推理能力方面的巨大潜力。
链接: https://arxiv.org/abs/2601.12323
作者: Yin Cai,Zhouhong Gu,Juntao Zhang,Ping Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Humans face countless scenarios that require reasoning and judgment in daily life. However, existing large language model training methods primarily allow models to learn from existing textual content or solve predetermined problems, lacking experience in real scenarios involving interaction, negotiation, and competition with others. To address this, this paper proposes Multi-Agent Reward Optimization (MARO), a method that enables large language models (LLMs) to acquire stronger reasoning abilities by learning and practicing in multi-agent social environments. Specifically, MARO first addresses the sparse learning signal problem by decomposing final success or failure outcomes into each specific behavior during the interaction process; second, it handles the uneven role distribution problem by balancing the training sample weights of different roles; finally, it addresses environmental instability issues by directly evaluating the utility of each behavior. Experimental results demonstrate that MARO not only achieves significant improvements in social reasoning capabilities, but also that the abilities acquired through social simulation learning can effectively transfer to other tasks such as mathematical reasoning and instruction following. This reveals the tremendous potential of multi-agent social learning in enhancing the general reasoning capabilities of LLMs.
zh
[AI-140] Beyond Human Annotation: Recent Advances in Data Generation Methods for Document Intelligence
【速读】:该论文旨在解决文档智能(Document Intelligence, DI)领域中高质量训练数据匮乏的问题,尤其是人工标注成本高、效率低所导致的瓶颈。现有数据生成方法研究分散于单一模态或特定任务,缺乏与实际工作流程相契合的统一视角。其解决方案的关键在于提出首个面向DI的数据生成技术全景图,将数据生成重新定义为“监督信号生成”,并基于“数据与标签的可用性”构建四类资源导向范式:数据增强、从零生成数据、自动化数据标注和自监督信号构建;同时建立多层级评估框架,融合内在质量与外在效用,系统梳理方法论现状并揭示关键挑战(如保真度差距)与前沿方向(如协同演化生态系统),从而确立数据生成作为下一代DI核心驱动力的地位。
链接: https://arxiv.org/abs/2601.12318
作者: Dehao Ying,Fengchang Yu,Haihua Chen,Changjiang Jiang,Yurong Li,Wei Lu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The advancement of Document Intelligence (DI) demands large-scale, high-quality training data, yet manual annotation remains a critical bottleneck. While data generation methods are evolving rapidly, existing surveys are constrained by fragmented focuses on single modalities or specific tasks, lacking a unified perspective aligned with real-world workflows. To fill this gap, this survey establishes the first comprehensive technical map for data generation in DI. Data generation is redefined as supervisory signal production, and a novel taxonomy is introduced based on the “availability of data and labels.” This framework organizes methodologies into four resource-centric paradigms: Data Augmentation, Data Generation from Scratch, Automated Data Annotation, and Self-Supervised Signal Construction. Furthermore, a multi-level evaluation framework is established to integrate intrinsic quality and extrinsic utility, compiling performance gains across diverse DI benchmarks. Guided by this unified structure, the methodological landscape is dissected to reveal critical challenges such as fidelity gaps and frontiers including co-evolutionary ecosystems. Ultimately, by systematizing this fragmented field, data generation is positioned as the central engine for next-generation DI.
zh
[AI-141] Explanova: Automatically Discover Data Insights in N times M Table via XAI Combined LLM Workflow
【速读】:该论文试图解决自动化数据分析师(Automated Data Analyst)在实际应用中因依赖大语言模型(Large Language Model, LLM)带来的高计算成本与复杂性问题。现有方案如DeepAnalyze、DataSage等虽借助LLM的代理工具调用能力实现了细粒度自动分析,但其性能受限于昂贵的LLM推理开销。本文提出的Explanova方案关键在于采用预设的AutoML式工作流(AutoML-like workflow),通过系统化遍历变量间的统计特征、两两关系及对目标变量的影响路径,实现可解释性的自动探索与分析;同时利用本地小型LLM替代大型模型,显著降低资源消耗,从而在保证分析质量的前提下提升效率与实用性。
链接: https://arxiv.org/abs/2601.12317
作者: Yiming Huang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Automation in data analysis has been a long-time pursuit. Current agentic LLM shows a promising solution towards it. Like DeepAnalyze, DataSage, and Datawise. They are all powerful agentic frameworks for automatic fine-grained analysis and are powered by LLM-based agentic tool calling ability. However, what about powered by a preset AutoML-like workflow? If we traverse all possible exploration, like Xn itself`s statistics, Xn1-Xn2 relationships, Xn to all other, and finally explain? Our Explanova is such an attempt: Cheaper due to a Local Small LLM.
zh
[AI-142] Survival is the Only Reward: Sustainable Self-Training Through Environment-Mediated Selection
【速读】:该论文旨在解决自训练系统(self-training systems)在缺乏外部数据质量判别标准时易发生奖励欺骗(reward hacking)和语义漂移(semantic drift)的问题,从而导致模型性能退化。其解决方案的关键在于构建一种完全基于环境可行性(environmental viability)进行选择的自训练架构:候选行为在真实资源约束下执行,仅当其环境效应既持久又能维持未来交互可能性时才被保留;该机制不依赖奖励函数、目标函数或任务特定监督,而是通过行为作为世界改变事件的差异化生存来实现选择,从而杜绝代理优化(proxy optimisation)的可能性,并使奖励欺骗在进化上不稳定。这种基于环境接地的选择机制促使模型发展出负空间学习(negative-space learning, NSL),即通过策略的持续保留与修剪实现改进,且无需显式指令即可演化出元学习能力(如故意失败以获取信息性错误反馈)。
链接: https://arxiv.org/abs/2601.12310
作者: Jennifer Dodgson,Alfath Daryl Alhajir,Michael Joedhitya,Akira Rafhael Janson Pattirane,Surender Suresh Kumar,Joseph Lim,C.H. Peh,Adith Ramdas,Steven Zhang Zhexu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Self-training systems often degenerate due to the lack of an external criterion for judging data quality, leading to reward hacking and semantic drift. This paper provides a proof-of-concept system architecture for stable self-training under sparse external feedback and bounded memory, and empirically characterises its learning dynamics and failure modes. We introduce a self-training architecture in which learning is mediated exclusively by environmental viability, rather than by reward, objective functions, or externally defined fitness criteria. Candidate behaviours are executed under real resource constraints, and only those whose environmental effects both persist and preserve the possibility of future interaction are propagated. The environment does not provide semantic feedback, dense rewards, or task-specific supervision; selection operates solely through differential survival of behaviours as world-altering events, making proxy optimisation impossible and rendering reward-hacking evolutionarily unstable. Analysis of semantic dynamics shows that improvement arises primarily through the persistence of effective and repeatable strategies under a regime of consolidation and pruning, a paradigm we refer to as negative-space learning (NSL), and that models develop meta-learning strategies (such as deliberate experimental failure in order to elicit informative error messages) without explicit instruction. This work establishes that environment-grounded selection enables sustainable open-ended self-improvement, offering a viable path toward more robust and generalisable autonomous systems without reliance on human-curated data or complex reward shaping. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2601.12310 [cs.AI] (or arXiv:2601.12310v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2601.12310 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-143] oolPRMBench: Evaluating and Advancing Process Reward Models for Tool-using Agents
【速读】:该论文旨在解决当前生成式 AI(Generative AI)在工具使用场景中缺乏系统化、可靠的过程奖励模型(Process Reward Models, PRMs)评估基准的问题。现有方法虽利用PRM提供步骤级奖励以优化代理的采样与探索,但缺少统一的数据集和评测标准来衡量其性能差异。解决方案的关键在于提出ToolPRMBench——一个基于多个代表性工具使用基准构建的大规模评测基准,将代理轨迹转化为包含交互历史、正确动作、合理错误动作及工具元数据的细粒度测试用例,并结合离线采样(识别单步局部错误)与在线采样(捕捉多步真实失败)两种策略,辅以多大语言模型(LLM)验证流水线降低标签噪声,从而实现对PRM性能的全面、高质量评估。
链接: https://arxiv.org/abs/2601.12294
作者: Dawei Li,Yuguang Yao,Zhen Tan,Huan Liu,Ruocheng Guo
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: under review
Abstract:Reward-guided search methods have demonstrated strong potential in enhancing tool-using agents by effectively guiding sampling and exploration over complex action spaces. As a core design, those search methods utilize process reward models (PRMs) to provide step-level rewards, enabling more fine-grained monitoring. However, there is a lack of systematic and reliable evaluation benchmarks for PRMs in tool-using settings. In this paper, we introduce ToolPRMBench, a large-scale benchmark specifically designed to evaluate PRMs for tool-using agents. ToolPRMBench is built on top of several representative tool-using benchmarks and converts agent trajectories into step-level test cases. Each case contains the interaction history, a correct action, a plausible but incorrect alternative, and relevant tool metadata. We respectively utilize offline sampling to isolate local single-step errors and online sampling to capture realistic multi-step failures from full agent rollouts. A multi-LLM verification pipeline is proposed to reduce label noise and ensure data quality. We conduct extensive experiments across large language models, general PRMs, and tool-specialized PRMs on ToolPRMBench. The results reveal clear differences in PRM effectiveness and highlight the potential of specialized PRMs for tool-using. Code and data will be released at this https URL.
zh
[AI-144] meGMM: Single-Pass Probabilistic Forecasting via Adaptive Gaussian Mixture Models with Reversible Normalization
【速读】:该论文旨在解决概率时间序列预测中因计算成本高或参数假设过于严格而导致的预测性能受限及分布失配问题。现有方法通常依赖昂贵的采样策略或假设特定分布形式,难以准确刻画复杂未来分布。其解决方案的关键在于提出TimeGMM框架,该框架基于高斯混合模型(Gaussian Mixture Model, GMM)实现单次前向传播即可捕捉复杂的未来不确定性分布;核心创新是引入GMM自适应可逆实例归一化(GMM-adapted Reversible Instance Normalization, GRIN)模块,以动态适应时间-概率分布的变化,并结合专用的时间编码器(Temporal Encoder, TE-Module)与条件时序-概率解码器(Conditional Temporal-Probabilistic Decoder, CTPD-Module),协同建模时间依赖性和混合分布参数,从而显著提升预测精度与鲁棒性。
链接: https://arxiv.org/abs/2601.12288
作者: Lei Liu,Tengyuan Liu,Hongwei Zhao,Jiahui Huang,Ruibo Guo,Bin Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Probabilistic time series forecasting is crucial for quantifying future uncertainty, with significant applications in fields such as energy and finance. However, existing methods often rely on computationally expensive sampling or restrictive parametric assumptions to characterize future distributions, which limits predictive performance and introduces distributional mismatch. To address these challenges, this paper presents TimeGMM, a novel probabilistic forecasting framework based on Gaussian Mixture Models (GMM) that captures complex future distributions in a single forward pass. A key component is GMM-adapted Reversible Instance Normalization (GRIN), a novel module designed to dynamically adapt to temporal-probabilistic distribution shifts. The framework integrates a dedicated Temporal Encoder (TE-Module) with a Conditional Temporal-Probabilistic Decoder (CTPD-Module) to jointly capture temporal dependencies and mixture distribution parameters. Extensive experiments demonstrate that TimeGMM consistently outperforms state-of-the-art methods, achieving maximum improvements of 22.48% in CRPS and 21.23% in NMAE.
zh
[AI-145] Predictive Prototyping: Evaluating Design Concepts with ChatGPT
【速读】:该论文旨在解决传统设计-建造-测试(design-build-test)循环中物理原型制作成本高、周期长的问题,尤其是在集成原型完成前难以进行有效评估的瓶颈。其核心解决方案是利用生成式预训练变换器(Generative Pre-trained Transformer, GPT)结合检索增强生成(Retrieval-Augmented Generation, RAG)方法,通过从公开数据源提取的原型数据增强模型推理能力,从而预测成本、性能和感知可用性等关键指标。实验表明,GPT-RAG在成本与性能预测上优于个体或群体人类设计师,且在可用性洞察方面具有相当水平;同时,基于GPT-RAG建议生成的物理原型在性能上超越商业基准和拓扑优化设计,验证了该方法在加速创新流程中的有效性。
链接: https://arxiv.org/abs/2601.12276
作者: Hilsann Yong,Bradley A. Camburn
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 22 pages, 15 figures, 5 tables
Abstract:The design-build-test cycle is essential for innovation, but physical prototyping is often slow and expensive. Although physics-based simulation and strategic prototyping can reduce cost, meaningful evaluation is frequently constrained until an integrated prototype is built. This paper investigates whether a generative pretrained transformer (GPT) can predict information typically obtained through prototyping, including cost, performance, and perceived usability. We introduce a retrieval-augmented generation (RAG) method to emulate design feedback using OpenAI GPT-4o, grounded in prototyping data scraped from this http URL to increase access to relevant precedent. Two studies are reported. First, a controlled experiment compares GPT-RAG and human designers, who receive design sketches and predict cost, performance, and usability; predictions are evaluated against ground-truth results from physical prototypes. Second, we report an applied demonstration in which a physical prototype is produced from GPT-RAG recommendations and compared with a commercial baseline and a topology-optimized design. Results show that GPT-RAG provides more accurate cost and performance estimates than individual or crowd human estimates, while yielding comparable usability insights; the GPT-RAG-informed prototype also outperforms both comparison prototypes. Repeated querying with response averaging significantly improves accuracy, suggesting that LLMs can emulate crowd aggregation effects consistent with the law of large numbers.
zh
[AI-146] Docs2Synth: A Synthetic Data Trained Retriever Framework for Scanned Visually Rich Documents Understanding WWW2026
【速读】:该论文旨在解决受监管领域中文档理解(Vision-based Document Understanding, VRDU)的两大核心挑战:一是缺乏人工标注数据以适应模型到私有或低资源领域;二是预训练模型难以保持与领域特定事实的同步更新。为此,作者提出Docs2Synth框架,其关键在于构建一个基于代理(agent-based)的合成监督机制,自动从原始文档集合中生成并验证多样化的问答对,并训练一个轻量级视觉检索器(visual retriever)提取领域相关证据。在推理阶段,该检索器通过迭代式检索-生成循环与生成式AI(Generative AI)协同工作,显著降低幻觉并提升响应一致性,从而实现无需人工标注即可增强模型的领域泛化能力和事实准确性。
链接: https://arxiv.org/abs/2601.12260
作者: Yihao Ding,Qiang Sun,Puzhen Wu,Sirui Li,Siwen Luo,Wei Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted at WWW 2026 Demo Track
Abstract:Document understanding (VRDU) in regulated domains is particularly challenging, since scanned documents often contain sensitive, evolving, and domain specific knowledge. This leads to two major challenges: the lack of manual annotations for model adaptation and the difficulty for pretrained models to stay up-to-date with domain-specific facts. While Multimodal Large Language Models (MLLMs) show strong zero-shot abilities, they still suffer from hallucination and limited domain grounding. In contrast, discriminative Vision-Language Pre-trained Models (VLPMs) provide reliable grounding but require costly annotations to cover new domains. We introduce Docs2Synth, a synthetic-supervision framework that enables retrieval-guided inference for private and low-resource domains. Docs2Synth automatically processes raw document collections, generates and verifies diverse QA pairs via an agent-based system, and trains a lightweight visual retriever to extract domain-relevant evidence. During inference, the retriever collaborates with an MLLM through an iterative retrieval–generation loop, reducing hallucination and improving response consistency. We further deliver Docs2Synth as an easy-to-use Python package, enabling plug-and-play deployment across diverse real-world scenarios. Experiments on multiple VRDU benchmarks show that Docs2Synth substantially enhances grounding and domain generalization without requiring human annotations.
zh
[AI-147] FutureX-Pro: Extending Future Prediction to High-Value Vertical Domains
【速读】:该论文旨在解决当前通用型智能体(agentic Large Language Models, LLMs)在资本密集型和安全关键领域(如金融、零售、公共卫生与自然灾害)中缺乏可靠性和领域适配性的问题。解决方案的关键在于构建一个专业化垂直领域的未来预测框架——FutureX-Pro,其包含五个子领域:FutureX-Finance、FutureX-Retail、FutureX-PublicHealth、FutureX-NaturalDisaster 和 FutureX-Search,并沿用 FutureX 的无污染实时评估流水线,对 SOTA agentic LLMs 在基础预测任务中的表现进行系统性评测,从而检验其是否具备工业部署所需的领域知识锚定能力(domain grounding)。
链接: https://arxiv.org/abs/2601.12259
作者: Jiashuo Liu,Siyuan Chen,Zaiyuan Wang,Zhiyuan Zeng,Jiacheng Guo,Liang Hu,Lingyue Yin,Suozhi Huang,Wenxin Hao,Yang Yang,Zerui Cheng,Zixin Yao,Lingyue Yin,Haoxin Liu,Jiayi Cheng,Yuzhen Li,Zezhong Ma,Bingjie Wang,Bingsen Qiu,Xiao Liu,Zeyang Zhang,Zijian Liu,Jinpeng Wang,Mingren Yin,Tianci He,Yali Liao,Yixiao Tian,Zhenwei Zhu,Anqi Dai,Ge Zhang,Jingkai Liu,Kaiyuan Zhang,Wenlong Wu,Xiang Gao,Xinjie Chen,Zhixin Yao,Zhoufutu Wen,B. Aditya Prakash,Jose Blanchet,Mengdi Wang,Nian Si,Wenhao Huang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
备注: 21 pages
Abstract:Building upon FutureX, which established a live benchmark for general-purpose future prediction, this report introduces FutureX-Pro, including FutureX-Finance, FutureX-Retail, FutureX-PublicHealth, FutureX-NaturalDisaster, and FutureX-Search. These together form a specialized framework extending agentic future prediction to high-value vertical domains. While generalist agents demonstrate proficiency in open-domain search, their reliability in capital-intensive and safety-critical sectors remains under-explored. FutureX-Pro targets four economically and socially pivotal verticals: Finance, Retail, Public Health, and Natural Disaster. We benchmark agentic Large Language Models (LLMs) on entry-level yet foundational prediction tasks – ranging from forecasting market indicators and supply chain demands to tracking epidemic trends and natural disasters. By adapting the contamination-free, live-evaluation pipeline of FutureX, we assess whether current State-of-the-Art (SOTA) agentic LLMs possess the domain grounding necessary for industrial deployment. Our findings reveal the performance gap between generalist reasoning and the precision required for high-value vertical applications.
zh
[AI-148] Improving Large Molecular Language Model via Relation-aware Multimodal Collaboration
【速读】:该论文旨在解决现有大分子语言模型(Large Molecular Language Models, LMLMs)在分子理解任务中存在幻觉(hallucination)和鲁棒性不足的问题,其根本原因在于对多种分子模态(如1D序列、2D分子图和3D构象)的整合不够充分。解决方案的关键在于提出CoLLaMo,一个基于大型语言模型的分子辅助工具,其核心创新是一个多层级分子模态协同投影器(multi-level molecular modality-collaborative projector),其中包含关系感知的模态协同注意力机制(relation-aware modality-collaborative attention mechanism),能够通过引入2D结构关系和3D空间关系,实现原子间细粒度且关系引导的信息交互,从而提升模型对分子多模态信息的理解与融合能力。
链接: https://arxiv.org/abs/2601.12256
作者: Jinyoung Park,Minseong Bae,Jeehye Na,Hyunwoo J. Kim
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) have demonstrated their instruction-following capabilities and achieved powerful performance on various tasks. Inspired by their success, recent works in the molecular domain have led to the development of large molecular language models (LMLMs) that integrate 1D molecular strings or 2D molecular graphs into the language models. However, existing LMLMs often suffer from hallucination and limited robustness, largely due to inadequate integration of diverse molecular modalities such as 1D sequences, 2D molecular graphs, and 3D conformations. To address these limitations, we propose CoLLaMo, a large language model-based molecular assistant equipped with a multi-level molecular modality-collaborative projector. The relation-aware modality-collaborative attention mechanism in the projector facilitates fine-grained and relation-guided information exchange between atoms by incorporating 2D structural and 3D spatial relations. Furthermore, we present a molecule-centric new automatic measurement, including a hallucination assessment metric and GPT-based caption quality evaluation to address the limitations of token-based generic evaluation metrics (i.e., BLEU) widely used in assessing molecular comprehension of LMLMs. Our extensive experiments demonstrate that our CoLLaMo enhances the molecular modality generalization capabilities of LMLMs, achieving the best performance on multiple tasks, including molecule captioning, computed property QA, descriptive property QA, motif counting, and IUPAC name prediction.
zh
[AI-149] Optimal Power Allocation and Sub-Optimal Channel Assignment for Downlink NOMA Systems Using Deep Reinforcement Learning
【速读】:该论文旨在解决非正交多址接入(Non-Orthogonal Multiple Access, NOMA)系统中尚未明确的信道分配问题,以提升网络资源利用效率。其关键解决方案是提出一种结合经验回放(replay memory)机制的在线策略深度强化学习(on-policy deep reinforcement learning, DRL)框架,通过引入记忆回放和优化状态特征表示,增强模型在复杂动态环境下的泛化能力,从而实现更高效的联合资源分配(Joint Resource Allocation, JRA)。
链接: https://arxiv.org/abs/2601.12242
作者: WooSeok Kim,Jeonghoon Lee,Sangho Kim,Taesun An,WonMin Lee,Dowon Kim,Kyungseop Shin
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
备注:
Abstract:In recent years, Non-Orthogonal Multiple Access (NOMA) system has emerged as a promising candidate for multiple access frameworks due to the evolution of deep machine learning, trying to incorporate deep machine learning into the NOMA system. The main motivation for such active studies is the growing need to optimize the utilization of network resources as the expansion of the internet of things (IoT) caused a scarcity of network resources. The NOMA addresses this need by power multiplexing, allowing multiple users to access the network simultaneously. Nevertheless, the NOMA system has few limitations. Several works have proposed to mitigate this, including the optimization of power allocation known as joint resource allocation(JRA) method, and integration of the JRA method and deep reinforcement learning (JRA-DRL). Despite this, the channel assignment problem remains unclear and requires further investigation. In this paper, we propose a deep reinforcement learning framework incorporating replay memory with an on-policy algorithm, allocating network resources in a NOMA system to generalize the learning. Also, we provide extensive simulations to evaluate the effects of varying the learning rate, batch size, type of model, and the number of features in the state.
zh
[AI-150] Wavelet-Driven Masked Multiscale Reconstruction for PPG Foundation Models
【速读】:该论文旨在解决可穿戴设备中光电容积脉搏波(PPG)信号在基础模型预训练过程中忽视频谱结构的问题,从而限制了其在多尺度生理特征提取上的能力。现有方法通常忽略PPG信号中跨多个频率带的生理节律信息,而这些信息对下游健康任务至关重要。解决方案的关键在于提出一种名为“掩码多尺度重建”(Masked Multiscale Reconstruction, MMR)的自监督预训练框架,该框架通过小波变换实现PPG信号的多分辨率分解,并对随机掩码的系数进行重建,迫使Transformer编码器融合时间与频域信息,从而学习具有生理意义的层次化表征。实验表明,MMR在17/19个健康相关任务上优于或匹配当前最优的开源PPG基础模型和时序基础模型,验证了其有效性与泛化潜力。
链接: https://arxiv.org/abs/2601.12215
作者: Megha Thukral,Cyrus Tanade,Simon A. Lee,Juhyeon Lee,Hao Zhou,Keum San Chun,Migyeong Gwak,Viswam Nathan,Md Mahbubur Rahman,Li Zhu,Mehrab Bin Morshed,Subramaniam Venkatraman,Sharanya Arcot Desai
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Wearable foundation models have the potential to transform digital health by learning transferable representations from large-scale biosignals collected in everyday settings. While recent progress has been made in large-scale pretraining, most approaches overlook the spectral structure of photoplethysmography (PPG) signals, wherein physiological rhythms unfold across multiple frequency bands. Motivated by the insight that many downstream health-related tasks depend on multi-resolution features spanning fine-grained waveform morphology to global rhythmic dynamics, we introduce Masked Multiscale Reconstruction (MMR) for PPG representation learning - a self-supervised pretraining framework that explicitly learns from hierarchical time-frequency scales of PPG data. The pretraining task is designed to reconstruct randomly masked out coefficients obtained from a wavelet-based multiresolution decomposition of PPG signals, forcing the transformer encoder to integrate information across temporal and spectral scales. We pretrain our model with MMR using ~17 million unlabeled 10-second PPG segments from ~32,000 smartwatch users. On 17 of 19 diverse health-related tasks, MMR trained on large-scale wearable PPG data improves over or matches state-of-the-art open-source PPG foundation models, time-series foundation models, and other self-supervised baselines. Extensive analysis of our learned embeddings and systematic ablations underscores the value of wavelet-based representations, showing that they capture robust and physiologically-grounded features. Together, these results highlight the potential of MMR as a step toward generalizable PPG foundation models.
zh
[AI-151] Speculative Sampling with Reinforcement Learning AAAI2026
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在推理时延(inference time latency)方面的挑战,特别是现有基于推测采样(Speculative Sampling, SpS)方法如EAGLE-3因树结构超参数静态设置而导致的灵活性与效率受限问题。解决方案的关键在于提出首个基于强化学习(Reinforcement Learning, RL)的框架——Re-SpS,其通过动态调整草稿树(draft tree)超参数,在实时生成过程中学习上下文感知策略,以平衡推测激进性(speculative aggression)与计算开销(computational overhead),从而最大化生成速度;该框架利用目标模型隐藏状态构建高效状态表示,并引入多步动作持续机制(multi-step action persistence)以增强上下文建模能力,最终在五个多样化基准上实现最高达5.45倍的加速效果,且不牺牲输出保真度(output fidelity)。
链接: https://arxiv.org/abs/2601.12212
作者: Chenan Wang,Daniel H. Shi,Haipeng Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to AAAI 2026
Abstract:Inference time latency has remained an open challenge for real world applications of large language models (LLMs). State-of-the-art (SOTA) speculative sampling (SpS) methods for LLMs, like EAGLE-3, use tree-based drafting to explore multiple candidate continuations in parallel. However, the hyperparameters controlling the tree structure are static, which limits flexibility and efficiency across diverse contexts and domains. We introduce Reinforcement learning for Speculative Sampling (Re-SpS), the first reinforcement learning (RL)-based framework for draft tree hyperparameter optimization. Re-SpS dynamically adjusts draft tree hyperparameters in real-time, learning context-aware policies that maximize generation speed by balancing speculative aggression with computational overhead. It leverages efficient state representations from target model hidden states and introduces multi-step action persistence for better context modeling. Evaluation results across five diverse benchmarks demonstrate consistent improvements over the SOTA method EAGLE-3, achieving up to 5.45 \times speedup over the backbone LLM and up to 1.12 \times speedup compared to EAGLE-3 across five diverse benchmarks, with no loss in output fidelity.
zh
[AI-152] Do Neural Codecs Generalize? A Controlled Study Across Unseen Languages and Non-Speech Tasks
【速读】:该论文旨在解决神经音频编解码器(Neural Audio Codecs, NACs)在未见语言、非语音场景(如环境音、音乐和动物叫声)下的泛化能力不足的问题,以及探索在预训练阶段引入非语音数据是否能提升语音与非语音任务的综合性能。其解决方案的关键在于:从头训练NACs并采用严格控制的配置和精心筛选的预训练数据,从而实现公平比较,并通过11项指标对信号重建质量及下游应用性能进行全面评估,最终发现NACs具备跨语言泛化能力,纯语音预训练模型在非语音任务上表现下降,而加入非语音数据可显著提升非语音任务性能且不影响语音任务表现。
链接: https://arxiv.org/abs/2601.12205
作者: Shih-Heng Wang,Jiatong Shi,Jinchuan Tian,Haibin Wu,Shinji Watanabe
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:
Abstract:This paper investigates three crucial yet underexplored aspects of the generalization capabilities of neural audio codecs (NACs): (i) whether NACs can generalize to unseen languages during pre-training, (ii) whether speech-only pre-trained NACs can effectively generalize to non-speech applications such as environmental sounds, music, and animal vocalizations, and (iii) whether incorporating non-speech data during pre-training can improve performance on both speech and non-speech tasks. Existing studies typically rely on off-the-shelf NACs for comparison, which limits insight due to variations in implementation. In this work, we train NACs from scratch using strictly controlled configurations and carefully curated pre-training data to enable fair comparisons. We conduct a comprehensive evaluation of NAC performance on both signal reconstruction quality and downstream applications using 11 metrics. Our results show that NACs can generalize to unseen languages during pre-training, speech-only pre-trained NACs exhibit degraded performance on non-speech tasks, and incorporating non-speech data during pre-training improves performance on non-speech tasks while maintaining comparable performance on speech tasks.
zh
[AI-153] Aletheia: What Makes RLVR For Code Verifiers Tick?
【速读】:该论文旨在解决代码生成后训练阶段中,如何有效提升代码验证器(code verifier)性能的问题,尤其是在执行反馈难以获取的场景下,传统基于执行反馈的验证方法存在局限性。为应对这一挑战,作者提出并开源了Aletheia——一个可控的测试平台,用于在不同策略模型和协变量偏移条件下进行基于执行的代码验证器鲁棒性评估。解决方案的关键在于对RLVR(Reinforcement Learning from Verifiable Rewards)训练范式中的核心组件进行系统性分析,发现:在小规模验证器中,在线策略学习(on-policy training) 是最关键的因素;而在大规模验证器中,基于思考链的训练(thinking-based training) 成为决定性能的核心要素,这揭示了简化现有复杂训练配方的可能性,从而为代码生成后训练提供更高效、可扩展的验证器设计路径。
链接: https://arxiv.org/abs/2601.12186
作者: Vatsal Venkatkrishna,Indraneil Paul,Iryna Gurevych
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 8 pages, 6 figures
Abstract:Multi-domain thinking verifiers trained via Reinforcement Learning from Verifiable Rewards (RLVR) are a prominent fixture of the Large Language Model (LLM) post-training pipeline, owing to their ability to robustly rate and rerank model outputs. However, the adoption of such verifiers towards code generation has been comparatively sparse, with execution feedback constituting the dominant signal. Nonetheless, code verifiers remain valuable toward judging model outputs in scenarios where execution feedback is hard to obtain and are a potentially powerful addition to the code generation post-training toolbox. To this end, we create and open-source Aletheia, a controlled testbed that enables execution-grounded evaluation of code verifiers’ robustness across disparate policy models and covariate shifts. We examine components of the RLVR-based verifier training recipe widely credited for its success: (1) intermediate thinking traces, (2) learning from negative samples, and (3) on-policy training. While experiments show the optimality of RLVR, we uncover important opportunities to simplify the recipe. Particularly, despite code verification exhibiting positive training- and inference-time scaling, on-policy learning stands out as the key component at small verifier sizes, and thinking-based training emerges as the most important component at larger scales.
zh
[AI-154] IDE: A Trace-Informed Depth-First Exploration for Planning with Temporally Extended Goals
【速读】:该论文旨在解决具有时序扩展目标(Temporally Extended Goals, TEGs)的任务规划问题,即如何让智能体在时间维度上实现一系列复杂的目标序列,而非仅处理孤立的即时任务。传统方法通常将LTLf(Linear Temporal Logic on finite traces)任务规划问题转化为经典规划中的可达性目标,再借助现成的规划器求解,但这类方法缺乏针对时序目标的有效启发式信息,导致搜索过程效率低下。论文提出的解决方案TIDE(Trace-Informed Depth-first Exploration)的关键在于:通过将时序规划问题分解为一系列可由标准规划器求解的“到达-避免”子问题,并基于成本驱动的启发式策略,在状态空间图中识别并优先探索有潜力的自动机轨迹(automaton trace),同时引入自适应回溯机制,通过重新计算代价和惩罚不可行转移来系统性地恢复失败计划,从而在保证完备性的前提下提升规划效率与性能。
链接: https://arxiv.org/abs/2601.12141
作者: Yuliia Suprun,Khen Elimelech,Lydia E. Kavraki,Moshe Y. Vardi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Task planning with temporally extended goals (TEGs) is a critical challenge in AI and robotics, enabling agents to achieve complex sequences of objectives over time rather than addressing isolated, immediate tasks. Linear Temporal Logic on finite traces (LTLf ) provides a robust formalism for encoding these temporal goals. Traditional LTLf task planning approaches often transform the temporal planning problem into a classical planning problem with reachability goals, which are then solved using off-the-shelf planners. However, these methods often lack informed heuristics to provide a guided search for temporal goals. We introduce TIDE (Trace-Informed Depth-first Exploration), a novel approach that addresses this limitation by decomposing a temporal problem into a sequence of smaller, manageable reach-avoid sub-problems, each solvable using an off-the-shelf planner. TIDE identifies and prioritizes promising automaton traces within the domain graph, using cost-driven heuristics to guide exploration. Its adaptive backtracking mechanism systematically recovers from failed plans by recalculating costs and penalizing infeasible transitions, ensuring completeness and efficiency. Experimental results demonstrate that TIDE achieves promising performance and is a valuable addition to the portfolio of planning methods for temporally extended goals.
zh
[AI-155] DriveSafe: A Hierarchical Risk Taxonomy for Safety-Critical LLM -Based Driving Assistants
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在车载数字助手场景中因安全、伦理或法律合规性不足而导致的风险问题。现有通用型安全评估框架难以覆盖真实驾驶情境下的特定风险,因此作者提出DriveSafe——一个分层的四层风险分类体系,系统化地刻画基于LLM的驾驶辅助系统可能出现的安全关键失效模式。其核心创新在于构建了涵盖技术、法律、社会与伦理维度的129个细粒度原子风险类别,并基于真实交通法规和安全原则设计测试用例,通过评估六种主流LLM对不安全驾驶相关请求的拒绝行为,验证了当前模型在驾驶场景下通用安全对齐机制的局限性。
链接: https://arxiv.org/abs/2601.12138
作者: Abhishek Kumar,Riya Tapwal,Carsten Maple
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) are increasingly integrated into vehicle-based digital assistants, where unsafe, ambiguous, or legally incorrect responses can lead to serious safety, ethical, and regulatory consequences. Despite growing interest in LLM safety, existing taxonomies and evaluation frameworks remain largely general-purpose and fail to capture the domain-specific risks inherent to real-world driving scenarios. In this paper, we introduce DriveSafe, a hierarchical, four-level risk taxonomy designed to systematically characterize safety-critical failure modes of LLM-based driving assistants. The taxonomy comprises 129 fine-grained atomic risk categories spanning technical, legal, societal, and ethical dimensions, grounded in real-world driving regulations and safety principles and reviewed by domain experts. To validate the safety relevance and realism of the constructed prompts, we evaluate their refusal behavior across six widely deployed LLMs. Our analysis shows that the evaluated models often fail to appropriately refuse unsafe or non-compliant driving-related queries, underscoring the limitations of general-purpose safety alignment in driving contexts.
zh
[AI-156] Human-Human-AI Triadic Programming: Uncovering the Role of AI Agent and the Value of Human Partner in Collaborative Learning
【速读】:该论文试图解决的问题是:当前关于人工智能(AI)辅助编程的研究多将AI视为人类协作的替代品,忽视了协作编程中社会性与学习导向的本质特征。为应对这一问题,作者提出“人-人-AI”(Human-Human-AI, HHAI)三元协作编程范式,其关键在于将AI代理作为额外的协作伙伴而非替代人类同伴,通过引入第三人视角增强AI使用过程的可见性与责任归属感,从而激活社会共享学习调控机制,促进学习效果和社交存在感提升,同时减少对AI生成代码的依赖。
链接: https://arxiv.org/abs/2601.12134
作者: Taufiq Daryanto,Xiaohan Ding,Kaike Ping,Lance T. Wilhelm,Yan Chen,Chris Brown,Eugenia H. Rho
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:As AI assistance becomes embedded in programming practice, researchers have increasingly examined how these systems help learners generate code and work more efficiently. However, these studies often position AI as a replacement for human collaboration and overlook the social and learning-oriented aspects that emerge in collaborative programming. Our work introduces human-human-AI (HHAI) triadic programming, where an AI agent serves as an additional collaborator rather than a substitute for a human partner. Through a within-subjects study with 20 participants, we show that triadic collaboration enhances collaborative learning and social presence compared to the dyadic human-AI (HAI) baseline. In the triadic HHAI conditions, participants relied significantly less on AI-generated code in their work. This effect was strongest in the HHAI-shared condition, where participants had an increased sense of responsibility to understand AI suggestions before applying them. These findings demonstrate how triadic settings activate socially shared regulation of learning by making AI use visible and accountable to a human peer, suggesting that AI systems that augment rather than automate peer collaboration can better preserve the learning processes that collaborative programming relies on.
zh
[AI-157] UniMo: Unified Motion Generation and Understanding with Chain of Thought
【速读】:该论文旨在解决现有3D人体运动生成与理解方法中可解释性不足的问题,以及基于大语言模型(LLM)的统一框架在语义对齐和任务一致性上的挑战,同时克服LLM中基于下一个词预测范式在运动序列建模中的累积误差问题。解决方案的关键在于提出UniMo框架,通过监督微调(SFT)将运动-语言信息与可解释的思维链(Chain of Thought, CoT)推理机制融合进LLM,并引入基于组相对策略优化(Group Relative Policy Optimization, GRPO)的强化学习后训练策略,以群体token为优化单位,增强结构正确性和语义一致性,从而有效缓解运动token预测中的累积误差,显著提升运动生成与理解性能。
链接: https://arxiv.org/abs/2601.12126
作者: Guocun Wang,Kenkun Liu,Jing Lin,Guorui Song,Jian Li,Xiaoguang Han
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Existing 3D human motion generation and understanding methods often exhibit limited interpretability, restricting effective mutual enhancement between these inherently related tasks. While current unified frameworks based on large language models (LLMs) leverage linguistic priors, they frequently encounter challenges in semantic alignment and task coherence. Moreover, the next-token prediction paradigm in LLMs is ill-suited for motion sequences, causing cumulative prediction errors. To address these limitations, we propose UniMo, a novel framework that integrates motion-language information and interpretable chain of thought (CoT) reasoning into the LLM via supervised fine-tuning (SFT). We further introduce reinforcement learning with Group Relative Policy Optimization (GRPO) as a post-training strategy that optimizes over groups of tokens to enforce structural correctness and semantic alignment, mitigating cumulative errors in motion token prediction. Extensive experiments demonstrate that UniMo significantly outperforms existing unified and task-specific models, achieving state-of-the-art performance in both motion generation and understanding.
zh
[AI-158] SynQP: A Framework and Metrics for Evaluating the Quality and Privacy Risk of Synthetic Data
【速读】:该论文旨在解决合成数据生成(Synthetic Data Generation, SDG)在健康应用中因隐私风险评估缺乏开放框架而阻碍其广泛应用的问题。核心挑战在于敏感医疗数据难以获取,导致无法建立可复现的隐私风险评估基准。解决方案的关键在于提出一个名为SynQP的开源框架,通过使用模拟的敏感数据进行隐私基准测试,确保原始数据始终保密;同时引入一种新的身份泄露风险度量指标,以更准确地反映机器学习模型的随机性对隐私的影响,从而提升隐私评估的透明度与可靠性。
链接: https://arxiv.org/abs/2601.12124
作者: Bing Hu,Yixin Li,Asma Bahamyirou,Helen Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 7 Pages, 22nd Annual International Conference on Privacy, Security, and Trust (PST2025), Fredericton, Canada
Abstract:The use of synthetic data in health applications raises privacy concerns, yet the lack of open frameworks for privacy evaluations has slowed its adoption. A major challenge is the absence of accessible benchmark datasets for evaluating privacy risks, due to difficulties in acquiring sensitive data. To address this, we introduce SynQP, an open framework for benchmarking privacy in synthetic data generation (SDG) using simulated sensitive data, ensuring that original data remains confidential. We also highlight the need for privacy metrics that fairly account for the probabilistic nature of machine learning models. As a demonstration, we use SynQP to benchmark CTGAN and propose a new identity disclosure risk metric that offers a more accurate estimation of privacy risks compared to existing approaches. Our work provides a critical tool for improving the transparency and reliability of privacy evaluations, enabling safer use of synthetic data in health-related applications. % In our quality evaluations, non-private models achieved near-perfect machine-learning efficacy (\ge0.97). Our privacy assessments (Table II) reveal that DP consistently lowers both identity disclosure risk (SD-IDR) and membership-inference attack risk (SD-MIA), with all DP-augmented models staying below the 0.09 regulatory threshold. Code available at this https URL
zh
[AI-159] Less Is More – Until It Breaks: Security Pitfalls of Vision Token Compression in Large Vision-Language Models
【速读】:该论文旨在解决视觉令牌压缩(visual token compression)对大型视觉语言模型(LVLMs)鲁棒性造成显著损害的问题,即在启用压缩后,原本具备鲁棒性的模型会变得极易受到攻击,且这种脆弱性仅出现在压缩状态下,难以被诊断。解决方案的关键在于识别出压缩过程中令牌重要性排序的不稳定性是导致鲁棒性下降的根本原因:微小且不可察觉的扰动即可改变令牌排序,使压缩机制误删任务关键信息,从而引发模型失效。为此,作者提出Compression-Aware Attack(CAA),直接针对令牌选择机制,在压缩推理下系统性地诱发失败;进一步扩展至黑盒场景的Transfer CAA,无需访问目标模型或压缩配置即可实现攻击,揭示了视觉令牌压缩带来的效率-安全权衡(efficiency-security trade-off)这一此前被忽视的重要问题。
链接: https://arxiv.org/abs/2601.12042
作者: Xiaomei Zhang,Zhaoxi Zhang,Leo Yu Zhang,Yanjun Zhang,Guanhong Tao,Shirui Pan
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Visual token compression is widely adopted to improve the inference efficiency of Large Vision-Language Models (LVLMs), enabling their deployment in latency-sensitive and resource-constrained scenarios. However, existing work has mainly focused on efficiency and performance, while the security implications of visual token compression remain largely unexplored. In this work, we first reveal that visual token compression substantially degrades the robustness of LVLMs: models that are robust under uncompressed inference become highly vulnerable once compression is enabled. These vulnerabilities are state-specific; failure modes emerge only in the compressed setting and completely disappear when compression is disabled, making them particularly hidden and difficult to diagnose. By analyzing the key stages of the compression process, we identify instability in token importance ranking as the primary cause of this robustness degradation. Small and imperceptible perturbations can significantly alter token rankings, leading the compression mechanism to mistakenly discard task-critical information and ultimately causing model failure. Motivated by this observation, we propose a Compression-Aware Attack to systematically study and exploit this vulnerability. CAA directly targets the token selection mechanism and induces failures exclusively under compressed inference. We further extend this approach to more realistic black-box settings and introduce Transfer CAA, where neither the target model nor the compression configuration is accessible. We further evaluate potential defenses and find that they provide only limited protection. Extensive experiments across models, datasets, and compression methods show that visual token compression significantly undermines robustness, revealing a previously overlooked efficiency-security trade-off.
zh
[AI-160] Partial Reasoning in Language Models: Search and Refinement Guided by Uncertainty
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在多步推理任务中,尤其是在数学和逻辑推理场景下表现受限的问题。其解决方案的关键在于提出PREGU(Partial Reasoning Guided by Uncertainty),通过监控自回归生成过程中输出分布的熵值,当熵超过预设阈值时判定为不确定性并终止当前推理路径,随后在潜在空间中进行局部搜索以优化部分推理过程,并采用Soft Reasoning方法选择最一致的答案。实验证明,该机制能有效利用熵作为触发选择性精炼的信号,在多个基准测试中达到或优于现有方法的性能。
链接: https://arxiv.org/abs/2601.12040
作者: Murilo da Luz,Bruno Brandão,Luana Martins,Gustavo Oliveira,Bryan de Oliveira,Luckeciano Melo,Telma Soares
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The use of Large Language Models (LLMs) for reasoning and planning tasks has drawn increasing attention in Artificial Intelligence research. Despite their remarkable progress, these models still exhibit limitations in multi-step inference scenarios, particularly in mathematical and logical reasoning. We introduce PREGU (Partial Reasoning Guided by Uncertainty). PREGU monitors the entropy of the output distribution during autoregressive generation and halts the process whenever entropy exceeds a defined threshold, signaling uncertainty. From that point, a localized search is performed in the latent space to refine the partial reasoning and select the most coherent answer, using the Soft Reasoning method. Experiments conducted with LLaMA-3-8B, Mistral-7B, and Qwen2-7B across four reasoning benchmarks (GSM8K, GSM-Hard, SVAMP, and StrategyQA) showed performance greater than or similar to Soft Reasoning, indicating that entropy can serve as an effective signal to trigger selective refinement during reasoning.
zh
[AI-161] Abstract Argumentation with Subargument Relations
【速读】:该论文旨在解决传统Dung抽象论证框架(abstract argumentation framework)因仅依赖攻击关系而无法有效表示结构化论证形式中关键的子论证(subargument)依赖关系的问题。现有扩展如双极论证框架引入支持关系,但未能刻画子论证的非对称性和构成性特征及其与攻击的关系。解决方案的关键在于将子论证关系作为基本关系之一,与攻击并列纳入抽象论证框架,从而在保持抽象性的同时,系统分析子论证如何影响接受性语义性质,为结构信息提供一种有原则的抽象方式,并明确子论证在抽象接受性推理中的作用。
链接: https://arxiv.org/abs/2601.12038
作者: Beishui Liao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 11 pages
Abstract:Dung’s abstract argumentation framework characterises argument acceptability solely via an attack relation, deliberately abstracting from the internal structure of arguments. While this level of abstraction has enabled a rich body of results, it limits the ability to represent structural dependencies that are central in many structured argumentation formalisms, in particular subargument relations. Existing extensions, including bipolar argumentation frameworks, introduce support relations, but these do not capture the asymmetric and constitutive nature of subarguments or their interaction with attacks. In this paper, we study abstract argumentation frameworks enriched with an explicit subargument relation, treated alongside attack as a basic relation. We analyse how subargument relations interact with attacks and examine their impact on fundamental semantic properties. This framework provides a principled abstraction of structural information and clarifies the role of subarguments in abstract acceptability reasoning.
zh
[AI-162] ARC: Active and Reflection-driven Context Management for Long-Horizon Information Seeking Agents
【速读】:该论文旨在解决大语言模型在长期信息检索任务中因交互历史累积导致的性能退化问题,即“上下文腐烂”(context rot),其本质是模型难以维持长时间推理过程中的一致性和任务相关性。解决方案的关键在于提出ARC框架,首次将上下文管理视为一个动态的、由反思驱动的主动过程,而非静态的信息存储;通过反思驱动的监控与修正机制,使代理能够在检测到上下文失配或退化时主动重组工作上下文,从而提升长期任务中的推理稳定性与准确性。
链接: https://arxiv.org/abs/2601.12030
作者: Yilun Yao,Shan Huang,Elsie Dai,Zhewen Tan,Zhenyu Duan,Shousheng Jia,Yanbing Jiang,Tong Yang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 15 pages, 5 figures
Abstract:Large language models are increasingly deployed as research agents for deep search and long-horizon information seeking, yet their performance often degrades as interaction histories grow. This degradation, known as context rot, reflects a failure to maintain coherent and task-relevant internal states over extended reasoning horizons. Existing approaches primarily manage context through raw accumulation or passive summarization, treating it as a static artifact and allowing early errors or misplaced emphasis to persist. Motivated by this perspective, we propose ARC, which is the first framework to systematically formulate context management as an active, reflection-driven process that treats context as a dynamic internal reasoning state during execution. ARC operationalizes this view through reflection-driven monitoring and revision, allowing agents to actively reorganize their working context when misalignment or degradation is detected. Experiments on challenging long-horizon information-seeking benchmarks show that ARC consistently outperforms passive context compression methods, achieving up to an 11% absolute improvement in accuracy on BrowseComp-ZH with Qwen2.5-32B-Instruct.
zh
[AI-163] Are LLM s Ready for TOON? Benchmarking Structural Correctness-Sustainability Trade-offs in Novel Structured Output Formats
【速读】:该论文试图解决的问题是:当前大型语言模型(Large Language Models, LLMs)在生成结构化输出(如JSON、XML、YAML等)时,虽已关注结构正确性,但忽略了不同输出格式在推理过程中的环境效率差异,即计算资源消耗与碳排放问题。解决方案的关键在于提出一个可持续性感知的评估框架,该框架综合衡量token使用量、生成时间及估算碳排放,并引入环境感知生成正确性评分(GCS_env),将结构正确性与碳效率统一量化。通过该框架对新型紧凑格式TOON与其他主流格式的系统性对比,揭示了格式选择需权衡结构正确性与环境影响,且模型容量提升可缓解二者之间的 trade-off,从而为碳意识驱动的大规模LLM部署提供实证依据和优化方向。
链接: https://arxiv.org/abs/2601.12014
作者: Elio Masciari,Vincenzo Moscato,Enea Vincenzo Napolitano,Gian Marco Orlando,Marco Perillo,Diego Russo
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:
Abstract:Large Language Models (LLMs) are increasingly required to generate structured, machine-readable outputs for downstream systems. While recent benchmarks have focused on evaluating the structural correctness of such outputs, the environmental impact of inference for different output formats has largely been overlooked. In this paper, we argue that structured output formats should be assessed not only in terms of correctness, but also with respect to their environmental efficiency. To this end, we introduce a sustainability-aware evaluation framework for structured generation that measures token usage, generation time, and estimated carbon emissions. Within this framework, we propose the Environment-Aware Generation Correctness Score (GCS_env), a unified metric that integrates structural correctness with carbon-aware efficiency. Using this framework, we systematically benchmark the novel TOON format against established representations (JSON, XML, YAML) across multiple LLMs spanning different architectures and parameter scales. Our results reveal a consistent trade-off: TOON yields markedly more compact outputs and lower emissions, but lower structural correctness when models lack native support. We show that increased model capacity reduces this gap and that environment-aware scoring can shift format rankings depending on deployment priorities. highlighting the need for sustainability-inclusive benchmarking and provides empirical evidence that compact representations such as TOON can offer practical advantages in large-scale, carbon-conscious LLM deployments. Subjects: Artificial Intelligence (cs.AI); Software Engineering (cs.SE) Cite as: arXiv:2601.12014 [cs.AI] (or arXiv:2601.12014v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2601.12014 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-164] Robust Verification of Concurrent Stochastic Games
【速读】:该论文旨在解决自主系统在多智能体环境中进行并发、策略性决策时,因过渡概率难以精确指定而导致的验证与控制难题。传统并发随机博弈(Concurrent Stochastic Games, CSGs)模型要求精确已知转移概率,这在许多现实场景中并不成立。为此,作者提出鲁棒并发随机博弈(Robust CSGs)及其子类区间并发随机博弈(Interval CSGs, ICSGs),通过引入对转移概率的认知不确定性(epistemic uncertainty)建模,使模型更贴近实际。解决方案的关键在于构建一个基于最坏情况假设的鲁棒验证框架,并开发适用于有限与无限horizon目标的理论基础和高效算法,涵盖零和与非零和情形(后者基于社会福利最优的纳什均衡)。该方法已在PRISM-games模型检测器中实现,并在多个大型基准测试中验证了其可行性。
链接: https://arxiv.org/abs/2601.12003
作者: Angel Y. He,David Parker
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA); Systems and Control (eess.SY)
备注: Extended version of a paper accepted to TACAS 2026. Main text: 17 pages, 2 figures, 2 tables; Appendix: 37 pages, 3 figures, 3 tables
Abstract:Autonomous systems often operate in multi-agent settings and need to make concurrent, strategic decisions, typically in uncertain environments. Verification and control problems for these systems can be tackled with concurrent stochastic games (CSGs), but this model requires transition probabilities to be precisely specified - an unrealistic requirement in many real-world settings. We introduce robust CSGs and their subclass interval CSGs (ICSGs), which capture epistemic uncertainty about transition probabilities in CSGs. We propose a novel framework for robust verification of these models under worst-case assumptions about transition uncertainty. Specifically, we develop the underlying theoretical foundations and efficient algorithms, for finite- and infinite-horizon objectives in both zero-sum and nonzero-sum settings, the latter based on (social-welfare optimal) Nash equilibria. We build an implementation in the PRISM-games model checker and demonstrate the feasibility of robust verification of ICSGs across a selection of large benchmarks.
zh
[AI-165] Kernel-Based Learning of Safety Barriers
【速读】:该论文旨在解决黑箱系统(尤其是具有离散时间随机动力学的系统)在安全关键应用中难以进行形式化安全验证的问题,传统方法因无法处理AI系统的“黑箱”特性且缺乏可扩展性而受限。其核心解决方案是基于控制屏障证书(control barrier certificates)构建一种数据驱动的安全验证与合成框架:通过条件均值嵌入(conditional mean embeddings)将系统轨迹数据映射到再生核希尔伯特空间(RKHS),构造可扩展的不确定性集以增强对分布外行为的鲁棒性;同时利用有限傅里叶展开将原本难以求解的半无限优化问题转化为线性规划,从而实现高效计算谱屏障(spectral barrier),形成一个既可扩展又分布鲁棒的安全验证机制。
链接: https://arxiv.org/abs/2601.12002
作者: Oliver Schön,Zhengang Zhong,Sadegh Soudjani
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
备注: 44 pages, 9 figures
Abstract:The rapid integration of AI algorithms in safety-critical applications such as autonomous driving and healthcare is raising significant concerns about the ability to meet stringent safety standards. Traditional tools for formal safety verification struggle with the black-box nature of AI-driven systems and lack the flexibility needed to scale to the complexity of real-world applications. In this paper, we present a data-driven approach for safety verification and synthesis of black-box systems with discrete-time stochastic dynamics. We employ the concept of control barrier certificates, which can guarantee safety of the system, and learn the certificate directly from a set of system trajectories. We use conditional mean embeddings to embed data from the system into a reproducing kernel Hilbert space (RKHS) and construct an RKHS ambiguity set that can be inflated to robustify the result to out-of-distribution behavior. We provide the theoretical results on how to apply the approach to general classes of temporal logic specifications beyond safety. For the data-driven computation of safety barriers, we leverage a finite Fourier expansion to cast a typically intractable semi-infinite optimization problem as a linear program. The resulting spectral barrier allows us to leverage the fast Fourier transform to generate the relaxed problem efficiently, offering a scalable yet distributionally robust framework for verifying safety. Our work moves beyond restrictive assumptions on system dynamics and uncertainty, as demonstrated on two case studies including a black-box system with a neural network controller.
zh
[AI-166] Hybrid IDS Using Signature-Based and Anomaly-Based Detection
【速读】:该论文旨在解决传统入侵检测系统(Intrusion Detection System, IDS)在应对不断演变的网络威胁时存在的局限性,特别是基于特征匹配的IDS难以发现未知攻击,而基于异常检测的IDS易产生高误报率的问题。解决方案的关键在于提出并综述混合入侵检测系统(Hybrid Intrusion Detection System, Hybrid IDS),通过融合基于签名(signature-based)和基于异常(anomaly-based)的检测技术,实现对已知与未知攻击更全面、准确的识别能力,从而提升整体检测性能和适应复杂应用场景的能力。
链接: https://arxiv.org/abs/2601.11998
作者: Messaouda Boutassetta,Amina Makhlouf,Newfel Messaoudi,Abdelmadjid Benmachiche,Ines Boutabia
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 7 pages,The Second National Conference on Artificial Intelligence and Information Technologies (NCAIIT25)
Abstract:Intrusion detection systems (IDS) are essential for protecting computer systems and networks against a wide range of cyber threats that continue to evolve over time. IDS are commonly categorized into two main types, each with its own strengths and limitations, such as difficulty in detecting previously unseen attacks and the tendency to generate high false positive rates. This paper presents a comprehensive survey and a conceptual overview of Hybrid IDS, which integrate signature-based and anomaly-based detection techniques to enhance attack detection capabilities. The survey examines recent research on Hybrid IDS, classifies existing models into functional categories, and discusses their advantages, limitations, and application domains, including financial systems, air traffic control, and social networks. In addition, recent trends in Hybrid IDS research, such as machine learning-based approaches and cloud-based deployments, are reviewed. Finally, this work outlines potential future research directions aimed at developing more cost-effective Hybrid IDS solutions with improved ability to detect emerging and sophisticated cyberattacks.
zh
[AI-167] Learning Audio-Visual Embeddings with Inferred Latent Interaction Graphs
【速读】:该论文旨在解决音频-视觉(Audio-Visual, AV)嵌入学习中因误判共现关系而导致的语义混淆问题:现有对比学习和三元组损失方法通常依赖稀疏标注标签,将所有共现事件视为语义相似性,从而错误地将未标注但实际相关的跨模态信号(如“火车”视频中出现的摩托车音频)标记为负样本,导致虚假负例并遗漏真实的跨模态依赖。其解决方案的关键在于引入软标签预测与隐式交互图建模机制:首先通过音频-视觉语义对齐损失(AV-SAL)训练教师网络生成跨模态软标签分布,赋予未标注共现事件非零概率以增强监督信号;其次利用GRaSP算法从软标签中推断出稀疏有向的隐式交互图(ILI),识别类间条件依赖关系(如“火车(视觉)→摩托车(音频)”);最后设计隐式交互正则化项(LIR),引导学生网络在度量损失基础上,按软标签概率比例拉近依赖关联但未标注的嵌入对,从而提升嵌入的鲁棒性和语义一致性。
链接: https://arxiv.org/abs/2601.11995
作者: Donghuo Zeng,Hao Niu,Yanan Wang,Masato Taya
机构: 未知
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG); Sound (cs.SD)
备注: 16 pages, 5 figures, 2 tables
Abstract:Learning robust audio-visual embeddings requires bringing genuinely related audio and visual signals together while filtering out incidental co-occurrences - background noise, unrelated elements, or unannotated events. Most contrastive and triplet-loss methods use sparse annotated labels per clip and treat any co-occurrence as semantic similarity. For example, a video labeled “train” might also contain motorcycle audio and visual, because “motorcycle” is not the chosen annotation; standard methods treat these co-occurrences as negatives to true motorcycle anchors elsewhere, creating false negatives and missing true cross-modal dependencies. We propose a framework that leverages soft-label predictions and inferred latent interactions to address these issues: (1) Audio-Visual Semantic Alignment Loss (AV-SAL) trains a teacher network to produce aligned soft-label distributions across modalities, assigning nonzero probability to co-occurring but unannotated events and enriching the supervision signal. (2) Inferred Latent Interaction Graph (ILI) applies the GRaSP algorithm to teacher soft labels to infer a sparse, directed dependency graph among classes. This graph highlights directional dependencies (e.g., “Train (visual)” - “Motorcycle (audio)”) that expose likely semantic or conditional relationships between classes; these are interpreted as estimated dependency patterns. (3) Latent Interaction Regularizer (LIR): A student network is trained with both metric loss and a regularizer guided by the ILI graph, pulling together embeddings of dependency-linked but unlabeled pairs in proportion to their soft-label probabilities. Experiments on AVE and VEGAS benchmarks show consistent improvements in mean average precision (mAP), demonstrating that integrating inferred latent interactions into embedding learning enhances robustness and semantic coherence.
zh
[AI-168] Process In-Context Learning: Enhancing Mathematical Reasoning via Dynamic Demonstration Insertion
【速读】:该论文旨在解决当前上下文学习(In-context Learning, ICL)在复杂逻辑推理任务(如数学推理)中因静态演示使用而带来的局限性问题。现有ICL方法在推理过程中固定使用预选示例,无法动态响应多步推理中出现的歧义计算或逻辑断层等困惑点,导致错误累积并降低最终准确率。其解决方案的关键在于提出过程内上下文学习(Process In-Context Learning, PICL),通过两个阶段实现动态演示整合:首先识别推理过程中的潜在困惑点(基于语义与熵分析),并总结其核心特征;其次,在遇到这些困惑点时,从演示池中检索匹配上下文的相关示例,并实时插入到当前推理流中以引导后续步骤,从而有效缓解推理过程中的中间混淆,提升数学推理准确性。
链接: https://arxiv.org/abs/2601.11979
作者: Ang Gao,Changshuo Zhang,Xiao Zhang,Deyang Li,Minjun Zhao,Fangchao Liu,Xinyu Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:In-context learning (ICL) has proven highly effective across diverse large language model (LLM) tasks. However, its potential for enhancing tasks that demand step-by-step logical deduction, such as mathematical reasoning, remains underexplored. A core limitation of existing ICL approaches is their static use of demonstrations: examples are pre-selected before inference and remain fixed, failing to adapt to the dynamic confusion points that often arise during multi-step reasoning such as ambiguous calculations or logical gaps. These unresolved confusion points can lead to cascading errors that degrade final accuracy. To tackle this issue, we propose Process In-Context Learning (PICL), a dynamic demonstration integration framework designed to boost mathematical reasoning by responding to real-time inference needs. PICL operates in two stages: 1)~it identifies potential confusion points by analyzing semantics and entropy in the reasoning process and summarizes their core characteristics; 2)~upon encountering these points, it retrieves relevant demonstrations from the demonstration pool that match the confusion context and inserts them directly into the ongoing reasoning process to guide subsequent steps. Experiments show that PICL outperforms baseline methods by mitigating mid-inference confusion, highlighting the value of adaptive demonstration insertion in complex mathematical reasoning.
zh
[AI-169] One-Shot Price Forecasting with Covariate-Guided Experts under Privacy Constraints
【速读】:该论文旨在解决电力系统中多变量时间序列预测面临的两个核心挑战:一是变量间复杂依赖关系难以建模,传统方法需大量专家知识且泛化能力弱;二是跨区域部署时存在严格的隐私约束,无法直接共享原始数据。解决方案的关键在于提出一种新颖的MoE Encoder模块,通过在预训练时间序列模型的tokenization与编码层之间引入稀疏专家混合(Mixture-of-Experts, MoE)层,将多变量预测任务转化为由专家引导的单变量任务,从而有效捕捉变量间关联,并支持联邦学习场景下的局部训练与轻量参数共享,实现高精度且隐私友好的跨区域迁移。
链接: https://arxiv.org/abs/2601.11977
作者: Ren He(Tsinghua University),Yinliang Xu(Tsinghua University),Jinfeng Wang(Guangdong Power Grid Co.),Jeremy Watson(University of Canterbury),Jian Song(Tsinghua University)
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Forecasting in power systems often involves multivariate time series with complex dependencies and strict privacy constraints across regions. Traditional forecasting methods require significant expert knowledge and struggle to generalize across diverse deployment scenarios. Recent advancements in pre-trained time series models offer new opportunities, but their zero-shot performance on domain-specific tasks remains limited. To address these challenges, we propose a novel MoE Encoder module that augments pretrained forecasting models by injecting a sparse mixture-of-experts layer between tokenization and encoding. This design enables two key capabilities: (1) trans forming multivariate forecasting into an expert-guided univariate task, allowing the model to effectively capture inter-variable relations, and (2) supporting localized training and lightweight parameter sharing in federated settings where raw data cannot be exchanged. Extensive experiments on public multivariate datasets demonstrate that MoE-Encoder significantly improves forecasting accuracy compared to strong baselines. We further simulate federated environments and show that transferring only MoE-Encoder parameters allows efficient adaptation to new regions, with minimal performance degradation. Our findings suggest that MoE-Encoder provides a scalable and privacy-aware extension to foundation time series models.
zh
[AI-170] Learn Like Humans: Use Meta-cognitive Reflection for Efficient Self-Improvement
【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的智能体(agent)受限于静态人工设计提示(prompt)而导致适应性不足的问题,以及现有自提升框架依赖低效多轮递归循环、计算开销高的缺陷。其解决方案的关键在于提出一种名为元认知代理反思式自我改进(Metacognitive Agent Reflective Self-improvement, MARS)的新框架,该框架通过单次循环内融合原则性反思(抽象规范性规则以规避错误)与程序性反思(提炼分步策略以达成成功),将学习洞察转化为优化指令,从而实现无需持续在线反馈的系统性推理逻辑改进。
链接: https://arxiv.org/abs/2601.11974
作者: Xinmeng Hou,Peiliang Gong,Bohao Qu,Wuqi Wang,Qing Guo,Yang Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:While Large Language Models (LLMs) enable complex autonomous behavior, current agents remain constrained by static, human-designed prompts that limit adaptability. Existing self-improving frameworks attempt to bridge this gap but typically rely on inefficient, multi-turn recursive loops that incur high computational costs. To address this, we propose Metacognitive Agent Reflective Self-improvement (MARS), a framework that achieves efficient self-evolution within a single recurrence cycle. Inspired by educational psychology, MARS mimics human learning by integrating principle-based reflection (abstracting normative rules to avoid errors) and procedural reflection (deriving step-by-step strategies for success). By synthesizing these insights into optimized instructions, MARS allows agents to systematically refine their reasoning logic without continuous online feedback. Extensive experiments on six benchmarks demonstrate that MARS outperforms state-of-the-art self-evolving systems while significantly reducing computational overhead.
zh
[AI-171] Big Data Workload Profiling for Energy-Aware Cloud Resource Management
【速读】:该论文旨在解决云数据中心在处理大规模、复杂的大数据工作负载时,因运营能耗持续上升而面临的能源效率挑战。解决方案的关键在于提出了一种面向工作负载的节能调度框架,通过分析CPU利用率、内存需求和存储I/O行为等多维指标,结合历史执行日志与实时遥测数据,对虚拟机(VM)候选部署位置进行能量与性能影响预测,从而实现自适应资源合并(adaptive consolidation),在保障服务等级协议(SLA)合规性的前提下显著降低能耗。实验表明,该方法相较基线调度器可稳定实现15%至20%的节能效果,且性能损耗可忽略不计。
链接: https://arxiv.org/abs/2601.11935
作者: Milan Parikh,Aniket Abhishek Soni,Sneja Mitinbhai Shah,Ayush Raj Jha
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 10 pages, 3 figures. Accepted and presented at the 2026 International Conference on Data Analytics for Sustainability and Engineering Technology (DASET 2026), Track: Big Data and Machine Learning Applications
Abstract:Cloud data centers face increasing pressure to reduce operational energy consumption as big data workloads continue to grow in scale and complexity. This paper presents a workload aware and energy efficient scheduling framework that profiles CPU utilization, memory demand, and storage IO behavior to guide virtual machine placement decisions. By combining historical execution logs with real time telemetry, the proposed system predicts the energy and performance impact of candidate placements and enables adaptive consolidation while preserving service level agreement compliance. The framework is evaluated using representative Hadoop MapReduce, Spark MLlib, and ETL workloads deployed on a multi node cloud testbed. Experimental results demonstrate consistent energy savings of 15 to 20 percent compared to a baseline scheduler, with negligible performance degradation. These findings highlight workload profiling as a practical and scalable strategy for improving the sustainability of cloud based big data processing environments.
zh
[AI-172] LIBRA: Language Model Informed Bandit Recourse Algorithm for Personalized Treatment Planning
【速读】:该论文旨在解决高风险场景下(如个性化医疗)的序列决策问题,其中决策者不仅需选择最优治疗动作,还需提供可执行、最小化的患者可变特征修改方案(即算法性救济,algorithmic recourse)。核心挑战在于如何在保证决策有效性的同时,融合领域知识与统计学习的可靠性。解决方案的关键是提出统一框架下的两个算法:一是通用线性救济带算法(Generalized Linear Recourse Bandit, GLRB),用于建模带救济约束的上下文带问题;二是基于大语言模型(LLM)的带救济算法(LIBRA),其创新性地将LLM提供的先验知识与带模型的学习机制结合,实现三重保障——温启动保证(warm-start guarantee)、LLM调用频次控制(LLM-effort guarantee,仅需 O(log2T) 次查询)、以及鲁棒性保证(robustness guarantee),确保即使LLM不可靠也不会劣于纯带算法。实验表明,该方法显著优于标准上下文带和纯LLM基准,在 regret、治疗质量与样本效率上均具优势。
链接: https://arxiv.org/abs/2601.11905
作者: Junyu Cao,Ruijiang Gao,Esmaeil Keyvanshokooh,Jianhao Ma
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Statistics Theory (math.ST)
备注: 50 pages. Previous version with human-AI collaboration: arXiv:2410.14640
Abstract:We introduce a unified framework that seamlessly integrates algorithmic recourse, contextual bandits, and large language models (LLMs) to support sequential decision-making in high-stakes settings such as personalized medicine. We first introduce the recourse bandit problem, where a decision-maker must select both a treatment action and a feasible, minimal modification to mutable patient features. To address this problem, we develop the Generalized Linear Recourse Bandit (GLRB) algorithm. Building on this foundation, we propose LIBRA, a Language Model-Informed Bandit Recourse Algorithm that strategically combines domain knowledge from LLMs with the statistical rigor of bandit learning. LIBRA offers three key guarantees: (i) a warm-start guarantee, showing that LIBRA significantly reduces initial regret when LLM recommendations are near-optimal; (ii) an LLM-effort guarantee, proving that the algorithm consults the LLM only O(\log^2 T) times, where T is the time horizon, ensuring long-term autonomy; and (iii) a robustness guarantee, showing that LIBRA never performs worse than a pure bandit algorithm even when the LLM is unreliable. We further establish matching lower bounds that characterize the fundamental difficulty of the recourse bandit problem and demonstrate the near-optimality of our algorithms. Experiments on synthetic environments and a real hypertension-management case study confirm that GLRB and LIBRA improve regret, treatment quality, and sample efficiency compared with standard contextual bandits and LLM-only benchmarks. Our results highlight the promise of recourse-aware, LLM-assisted bandit algorithms for trustworthy LLM-bandits collaboration in personalized high-stakes decision-making.
zh
[AI-173] AEMA: Verifiable Evaluation Framework for Trustworthy and Controlled Agent ic LLM Systems AAAI2026
【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)驱动的多智能体系统(Multi-Agent Systems)在企业级部署中面临的评估难题,即现有方法普遍局限于单次响应评分或窄域基准测试,缺乏稳定性、可扩展性和自动化能力。其核心解决方案是提出AEMA(Adaptive Evaluation Multi-Agent)框架,该框架通过过程感知(process-aware)和可审计(auditable)的设计,实现对异构智能体工作流的多步骤评估规划、执行与聚合,并在人类监督下完成可信的性能验证。关键创新在于引入结构化的评估流程与可追溯记录机制,显著提升了评估结果的稳定性、人机对齐性以及自动化系统的可问责性(accountable automation),从而为LLM多智能体系统的负责任评估提供透明且可复现的路径。
链接: https://arxiv.org/abs/2601.11903
作者: YenTing Lee,Keerthi Koneru,Zahra Moslemi,Sheethal Kumar,Ramesh Radhakrishnan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Workshop on W51: How Can We Trust and Control Agentic AI? Toward Alignment, Robustness, and Verifiability in Autonomous LLM Agents at AAAI 2026
Abstract:Evaluating large language model (LLM)-based multi-agent systems remains a critical challenge, as these systems must exhibit reliable coordination, transparent decision-making, and verifiable performance across evolving tasks. Existing evaluation approaches often limit themselves to single-response scoring or narrow benchmarks, which lack stability, extensibility, and automation when deployed in enterprise settings at multi-agent scale. We present AEMA (Adaptive Evaluation Multi-Agent), a process-aware and auditable framework that plans, executes, and aggregates multi-step evaluations across heterogeneous agentic workflows under human oversight. Compared to a single LLM-as-a-Judge, AEMA achieves greater stability, human alignment, and traceable records that support accountable automation. Our results on enterprise-style agent workflows simulated using realistic business scenarios demonstrate that AEMA provides a transparent and reproducible pathway toward responsible evaluation of LLM-based multi-agent systems. Keywords Agentic AI, Multi-Agent Systems, Trustworthy AI, Verifiable Evaluation, Human Oversight Comments: Workshop on W51: How Can We Trust and Control Agentic AI? Toward Alignment, Robustness, and Verifiability in Autonomous LLM Agents at AAAI 2026 Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2601.11903 [cs.AI] (or arXiv:2601.11903v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2601.11903 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-174] DevBench: A Realistic Developer-Informed Benchmark for Code Generation Models
【速读】:该论文旨在解决现有大型语言模型(Large Language Models, LLMs)代码补全评估基准缺乏生态效度(ecological validity)和实际指导意义的问题,特别是以往基准常因训练数据污染或任务设计脱离真实开发场景而无法准确反映模型在实际编码任务中的表现。其解决方案的关键在于构建一个基于真实开发者遥测数据(telemetry)的基准 DevBench,涵盖六种编程语言和六类任务,确保评估任务来源于真实 API 使用模式与代码意图理解;同时引入功能正确性、相似性指标与 LLM 评判相结合的多维评估体系,从而实现对模型语法精度、语义推理能力及实用性的精细化诊断,为模型选型与针对性改进提供可操作的洞察。
链接: https://arxiv.org/abs/2601.11895
作者: Pareesa Ameneh Golnari,Adarsh Kumarappan,Wen Wen,Xiaoyu Liu,Gabriel Ryan,Yuting Sun,Shengyu Fu,Elsie Nallipogu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:
Abstract:DevBench is a telemetry-driven benchmark designed to evaluate Large Language Models (LLMs) on realistic code completion tasks. It includes 1,800 evaluation instances across six programming languages and six task categories derived from real developer telemetry, such as API usage and code purpose understanding. Unlike prior benchmarks, it emphasizes ecological validity, avoids training data contamination, and enables detailed diagnostics. The evaluation combines functional correctness, similarity-based metrics, and LLM-judge assessments focused on usefulness and contextual relevance. 9 state-of-the-art models were assessed, revealing differences in syntactic precision, semantic reasoning, and practical utility. Our benchmark provides actionable insights to guide model selection and improvement-detail that is often missing from other benchmarks but is essential for both practical deployment and targeted model development.
zh
[AI-175] MyGram: Modality-aware Graph Transformer with Global Distribution for Multi-modal Entity Alignment AAAI2026
【速读】:该论文旨在解决多模态实体对齐(multi-modal entity alignment)中因忽略模态内部结构上下文信息而导致的浅层特征干扰问题。现有方法通常未能充分挖掘图像和文本等多模态数据中的深层结构信息,从而影响对齐精度。其解决方案的关键在于提出MyGram模型,该模型包含两个核心组件:一是模态扩散学习模块(modality diffusion learning module),用于捕获各模态内的深层结构上下文信息并实现细粒度的多模态融合;二是Gram Loss,通过最小化由多模态特征构成的四维平行多面体体积,施加全局分布一致性约束,增强跨模态特征的一致性与鲁棒性。
链接: https://arxiv.org/abs/2601.11885
作者: Zhifei Li,Ziyue Qin,Xiangyu Luo,Xiaoju Hou,Yue Zhao,Miao Zhang,Zhifang Huang,Kui Xiao,Bing Yang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted by AAAI 2026
Abstract:Multi-modal entity alignment aims to identify equivalent entities between two multi-modal Knowledge graphs by integrating multi-modal data, such as images and text, to enrich the semantic representations of entities. However, existing methods may overlook the structural contextual information within each modality, making them vulnerable to interference from shallow features. To address these challenges, we propose MyGram, a modality-aware graph transformer with global distribution for multi-modal entity alignment. Specifically, we develop a modality diffusion learning module to capture deep structural contextual information within modalities and enable fine-grained multi-modal fusion. In addition, we introduce a Gram Loss that acts as a regularization constraint by minimizing the volume of a 4-dimensional parallelotope formed by multi-modal features, thereby achieving global distribution consistency across modalities. We conduct experiments on five public datasets. Results show that MyGram outperforms baseline models, achieving a maximum improvement of 4.8% in Hits@1 on FBDB15K, 9.9% on FBYG15K, and 4.3% on DBP15K.
zh
[AI-176] F-CoDiT: Conditional Time Series Synthesis with Diffusion Transformers for Treasury Futures
【速读】:该论文旨在解决扩散模型(Diffusion Transformers, DiT)在国债期货数据合成任务中表现不足的问题,特别是针对此类数据低频、市场依赖性强及多变量间存在分组相关性的特点。其关键解决方案是提出TF-CoDiT框架,通过将多通道一维时间序列转换为离散小波变换(Discrete Wavelet Transform, DWT)系数矩阵来增强低数据场景下的学习能力,并引入U型变分自编码器(U-shape VAE)以层次化编码跨通道依赖关系至潜在空间,再通过解码桥接潜在空间与DWT空间实现潜在扩散生成;同时设计金融市场属性协议(Financial Market Attribute Protocol, FinMAP)用于标准化经济指标描述,确保提示词覆盖关键市场状态,从而提升合成数据的真实性与鲁棒性。
链接: https://arxiv.org/abs/2601.11880
作者: Yingxiao Zhang,Jiaxin Duan,Junfu Zhang,Ke Feng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Diffusion Transformers (DiT) have achieved milestones in synthesizing financial time-series data, such as stock prices and order flows. However, their performance in synthesizing treasury futures data is still underexplored. This work emphasizes the characteristics of treasury futures data, including its low volume, market dependencies, and the grouped correlations among multivariables. To overcome these challenges, we propose TF-CoDiT, the first DiT framework for language-controlled treasury futures synthesis. To facilitate low-data learning, TF-CoDiT adapts the standard DiT by transforming multi-channel 1-D time series into Discrete Wavelet Transform (DWT) coefficient matrices. A U-shape VAE is proposed to encode cross-channel dependencies hierarchically into a latent variable and bridge the latent and DWT spaces through decoding, thereby enabling latent diffusion generation. To derive prompts that cover essential conditions, we introduce the Financial Market Attribute Protocol (FinMAP) - a multi-level description system that standardizes daily / periodical market dynamics by recognizing 17 / 23 economic indicators from 7/8 perspectives. In our experiments, we gather four types of treasury futures data covering the period from 2015 to 2025, and define data synthesis tasks with durations ranging from one week to four months. Extensive evaluations demonstrate that TF-CoDiT can produce highly authentic data with errors at most 0.433 (MSE) and 0.453 (MAE) to the ground-truth. Further studies evidence the robustness of TF-CoDiT across contracts and temporal horizons.
zh
[AI-177] rminal-Bench: Benchmarking Agents on Hard Realistic Tasks in Command Line Interfaces
【速读】:该论文旨在解决当前AI代理(AI agent)评估基准难以有效衡量前沿模型在真实世界复杂任务中表现的问题,因为现有基准要么不反映实际应用场景,要么难度不足。其解决方案的关键在于提出Terminal-Bench 2.0——一个由89个精心设计的高难度任务组成的基准测试集,这些任务均基于真实工作流构建,每个任务包含独立的终端环境、人工编写的解决方案以及全面的验证测试。实验表明,当前前沿模型和代理在该基准上的平均得分低于65%,并通过错误分析识别出模型与代理改进的关键方向。
链接: https://arxiv.org/abs/2601.11868
作者: Mike A. Merrill,Alexander G. Shaw,Nicholas Carlini,Boxuan Li,Harsh Raj,Ivan Bercovich,Lin Shi,Jeong Yeon Shin,Thomas Walshe,E. Kelly Buchanan,Junhong Shen,Guanghao Ye,Haowei Lin,Jason Poulos,Maoyu Wang,Marianna Nezhurina,Jenia Jitsev,Di Lu,Orfeas Menis Mastromichalakis,Zhiwei Xu,Zizhao Chen,Yue Liu,Robert Zhang,Leon Liangyu Chen,Anurag Kashyap,Jan-Lucas Uslu,Jeffrey Li,Jianbo Wu,Minghao Yan,Song Bian,Vedang Sharma,Ke Sun,Steven Dillmann,Akshay Anand,Andrew Lanpouthakoun,Bardia Koopah,Changran Hu,Etash Guha,Gabriel H. S. Dreiman,Jiacheng Zhu,Karl Krauth,Li Zhong,Niklas Muennighoff,Robert Amanfu,Shangyin Tan,Shreyas Pimpalgaonkar,Tushar Aggarwal,Xiangning Lin,Xin Lan,Xuandong Zhao,Yiqing Liang,Yuanli Wang,Zilong Wang,Changzhi Zhou,David Heineman,Hange Liu,Harsh Trivedi,John Yang,Junhong Lin,Manish Shetty,Michael Yang,Nabil Omi,Negin Raoof,Shanda Li,Terry Yue Zhuo,Wuwei Lin,Yiwei Dai,Yuxin Wang,Wenhao Chai,Shang Zhou,Dariush Wahdany,Ziyu She,Jiaming Hu,Zhikang Dong,Yuxuan Zhu,Sasha Cui,Ahson Saiyed,Arinbjörn Kolbeinsson,Jesse Hu,Christopher Michael Rytting,Ryan Marten,Yixin Wang,Alex Dimakis,Andy Konwinski,Ludwig Schmidt
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:AI agents may soon become capable of autonomously completing valuable, long-horizon tasks in diverse domains. Current benchmarks either do not measure real-world tasks, or are not sufficiently difficult to meaningfully measure frontier models. To this end, we present Terminal-Bench 2.0: a carefully curated hard benchmark composed of 89 tasks in computer terminal environments inspired by problems from real workflows. Each task features a unique environment, human-written solution, and comprehensive tests for verification. We show that frontier models and agents score less than 65% on the benchmark and conduct an error analysis to identify areas for model and agent improvement. We publish the dataset and evaluation harness to assist developers and researchers in future work at this https URL .
zh
[AI-178] Cascaded Transformer for Robust and Scalable SLA Decomposition via Amortized Optimization
【速读】:该论文旨在解决6G网络中端到端(End-to-End, E2E)服务等级协议(Service Level Agreement, SLA)向域特定SLA分解的难题,当前方法依赖计算密集型、迭代式优化过程,导致高延迟和复杂度。解决方案的关键在于提出Casformer——一种级联Transformer架构,其第一层通过域特定Transformer编码器捕捉历史域反馈信息,第二层利用基于Transformer的聚合器建模跨域依赖关系;同时采用受域感知神经网络(Domain-Informed Neural Networks, DINNs)启发的学习范式,融合风险感知建模与摊销优化(amortized optimization),从而学习一个稳定、单向(forward-only)的SLA分解策略。该设计显著提升了分解质量、可扩展性和鲁棒性,并降低了运行时复杂度,适用于5G及未来网络环境中的实时SLA管理。
链接: https://arxiv.org/abs/2601.11859
作者: Cyril Shih-Huan Hsu
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
备注:
Abstract:The evolution toward 6G networks increasingly relies on network slicing to provide tailored, End-to-End (E2E) logical networks over shared physical infrastructures. A critical challenge is effectively decomposing E2E Service Level Agreements (SLAs) into domain-specific SLAs, which current solutions handle through computationally intensive, iterative optimization processes that incur substantial latency and complexity. To address this, we introduce Casformer, a cascaded Transformer architecture designed for fast, optimization-free SLA decomposition. Casformer leverages historical domain feedback encoded through domain-specific Transformer encoders in its first layer, and integrates cross-domain dependencies using a Transformer-based aggregator in its second layer. The model is trained under a learning paradigm inspired by Domain-Informed Neural Networks (DINNs), incorporating risk-informed modeling and amortized optimization to learn a stable, forward-only SLA decomposition policy. Extensive evaluations demonstrate that Casformer achieves improved SLA decomposition quality against state-of-the-art optimization-based frameworks, while exhibiting enhanced scalability and robustness under volatile and noisy network conditions. In addition, its forward-only design reduces runtime complexity and simplifies deployment and maintenance. These insights reveal the potential of combining amortized optimization with Transformer-based sequence modeling to advance network automation, providing a scalable and efficient solution suitable for real-time SLA management in advanced 5G-and-beyond network environments.
zh
[AI-179] Human-AI Collaborative Inductive Thematic Analysis: AI Guided Analysis and Human Interpretive Authority
【速读】:该论文旨在解决生成式人工智能(Generative AI)在定性研究中应用时所引发的分析实践与解释权威性问题,特别是如何在保持研究严谨性的同时利用AI工具提升归纳主题分析(Inductive Thematic Analysis)的透明度与可审计性。其解决方案的关键在于提出并验证了一个“人机协同归纳主题分析”(Human-Artificial Intelligence Collaborative Inductive Thematic Analysis, HACITA)框架,通过一个专为支持归纳主题分析设计的AI工具——诱导主题分析GPT(ITA-GPT),实现结构化、半自动化的工作流程,包括熟悉资料、原话编码(verbatim coding)、动名词驱动的描述性编码及主题发展,并通过强制文本溯源、覆盖检查和审计追踪保障分析过程的可追溯性与可控性。研究表明,尽管AI充当了程序性支架以增强分析透明度,但最终的解释权仍由研究人员掌握,其通过反复的修改、删除、拒绝、插入和注释等分析行为行使判断力,从而确保负责任的人机协同分析实践。
链接: https://arxiv.org/abs/2601.11850
作者: Matthew Nyaaba,Min SungEun,Mary Abiswin Apam,Kwame Owoahene Acheampong,Emmanuel Dwamena,Xiaoming Zhai
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:
Abstract:The increasing use of generative artificial intelligence (GenAI) in qualitative research raises important questions about analytic practice and interpretive authority. This study examines how researchers interact with an Inductive Thematic Analysis GPT (ITA-GPT), a purpose-built AI tool designed to support inductive thematic analysis through structured, semi-automated prompts aligned with reflexive thematic analysis and verbatim coding principles. Guided by a Human-Artificial Intelligence Collaborative Inductive Thematic Analysis (HACITA) framework, the study focuses on analytic process rather than substantive findings. Three experienced qualitative researchers conducted ITA-GPT assisted analyses of interview transcripts from education research in the Ghanaian teacher education context. The tool supported familiarization, verbatim in vivo coding, gerund-based descriptive coding, and theme development, while enforcing trace to text integrity, coverage checks, and auditability. Data sources included interaction logs, AI-generated tables, researcher revisions, deletions, insertions, comments, and reflexive memos. Findings show that ITA-GPT functioned as a procedural scaffold that structured analytic workflow and enhanced transparency. However, interpretive authority remained with human researchers, who exercised judgment through recurrent analytic actions including modification, deletion, rejection, insertion, and commenting. The study demonstrates how inductive thematic analysis is enacted through responsible human AI collaboration.
zh
[AI-180] Imandra CodeLogician: Neuro-Symbolic Reasoning for Precise Analysis of Software Logic
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在软件逻辑分析中缺乏精确、全面数学推理能力的问题,即LLMs虽能理解代码语义,但难以对程序行为进行严谨的数学推演。现有基准测试要么聚焦于与实际软件脱节的数学证明自动化,要么仅关注工程任务而不要求语义严谨性。解决方案的关键在于提出CodeLogician——一个神经符号代理系统,其核心创新是利用LLMs构建显式的软件形式化模型,并将其集成至工业级自动定理证明器ImandraX,从而实现基于形式化模型的自动化推理,以回答超越二值验证结果的丰富语义问题。该方法通过将LLMs的形式建模能力与符号推理引擎的精确性结合,显著提升了程序状态空间、控制流、覆盖率约束及边界情况等维度上的推理准确性,验证了神经符号融合对于实现高精度自主软件理解的重要性。
链接: https://arxiv.org/abs/2601.11840
作者: Hongyu Lin,Samer Abdallah,Makar Valentinov,Paul Brennan,Elijah Kagan,Christoph M. Wintersteiger,Denis Ignatovich,Grant Passmore
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO); Software Engineering (cs.SE)
备注: 52 pages, 23 figures. Includes a new benchmark dataset (code-logic-bench) and evaluation of neurosymbolic reasoning for software analysis
Abstract:Large Language Models (LLMs) have shown strong performance on code understanding tasks, yet they fundamentally lack the ability to perform precise, exhaustive mathematical reasoning about program behavior. Existing benchmarks either focus on mathematical proof automation, largely disconnected from real-world software, or on engineering tasks that do not require semantic rigor. We present CodeLogician, a neurosymbolic agent for precise analysis of software logic, integrated with ImandraX, an industrial automated reasoning engine deployed in financial markets and safety-critical systems. Unlike prior approaches that use formal methods primarily to validate LLM outputs, CodeLogician uses LLMs to construct explicit formal models of software systems, enabling automated reasoning to answer rich semantic questions beyond binary verification outcomes. To rigorously evaluate mathematical reasoning about software logic, we introduce code-logic-bench, a benchmark targeting the middle ground between theorem proving and software engineering benchmarks. It measures reasoning correctness about program state spaces, control flow, coverage constraints, and edge cases, with ground truth defined via formal modeling and region decomposition. Comparing LLM-only reasoning against LLMs augmented with CodeLogician, formal augmentation yields substantial improvements, closing a 41-47 percentage point gap in reasoning accuracy. These results demonstrate that neurosymbolic integration is essential for scaling program analysis toward rigorous, autonomous software understanding. Comments: 52 pages, 23 figures. Includes a new benchmark dataset (code-logic-bench) and evaluation of neurosymbolic reasoning for software analysis Subjects: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO); Software Engineering (cs.SE) MSC classes: 68N30 ACMclasses: F.3.1; D.2.4; I.2.3; I.2.4 Cite as: arXiv:2601.11840 [cs.AI] (or arXiv:2601.11840v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2601.11840 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Grant Passmore [view email] [v1] Sat, 17 Jan 2026 00:16:41 UTC (8,854 KB)
zh
[AI-181] AI Co-Scientist for Knowledge Synthesis in Medical Contexts: A Proof of Concept
【速读】:该论文旨在解决生物医学研究中的科研浪费问题,主要源于重复性研究、报告不完整以及传统证据综合工作流的可扩展性有限。其解决方案的关键在于构建一个基于显式形式化的人工智能共科学家平台,该平台以Population, Intervention, Comparator, Outcome, and Study design(PICOS)框架为核心,整合关系型存储、基于向量的语义检索与Neo4j知识图谱,实现可扩展且透明的知识综合。通过Transformer多任务分类器和Bi-LSTM模型分别实现高精度的PICOS合规性检测(87%)与研究设计分类(95.7%),并利用检索增强生成(Retrieval-Augmented Generation, RAG)结合混合向量与图检索策略提升结构化查询、跨研究整合及图推理能力,同时借助BERTopic识别主题冗余与证据缺口,从而显著提升证据合成的效率、透明度与可解释性。该架构具备领域无关性,为减少生物医学各领域的科研浪费提供了实用范式。
链接: https://arxiv.org/abs/2601.11825
作者: Arya Rahgozar,Pouria Mortezaagha
机构: 未知
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:
Abstract:Research waste in biomedical science is driven by redundant studies, incomplete reporting, and the limited scalability of traditional evidence synthesis workflows. We present an AI co-scientist for scalable and transparent knowledge synthesis based on explicit formalization of Population, Intervention, Comparator, Outcome, and Study design (PICOS). The platform integrates relational storage, vector-based semantic retrieval, and a Neo4j knowledge graph. Evaluation was conducted on dementia-sport and non-communicable disease corpora. Automated PICOS compliance and study design classification from titles and abstracts were performed using a Bidirectional Long Short-Term Memory baseline and a transformer-based multi-task classifier fine-tuned from PubMedBERT. Full-text synthesis employed retrieval-augmented generation with hybrid vector and graph retrieval, while BERTopic was used to identify thematic structure, redundancy, and evidence gaps. The transformer model achieved 95.7% accuracy for study design classification with strong agreement against expert annotations, while the Bi-LSTM achieved 87% accuracy for PICOS compliance detection. Retrieval-augmented generation outperformed non-retrieval generation for queries requiring structured constraints, cross-study integration, and graph-based reasoning, whereas non-retrieval approaches remained competitive for high-level summaries. Topic modeling revealed substantial thematic redundancy and identified underexplored research areas. These results demonstrate that PICOS-aware and explainable natural language processing can improve the scalability, transparency, and efficiency of evidence synthesis. The proposed architecture is domain-agnostic and offers a practical framework for reducing research waste across biomedical disciplines.
zh
[AI-182] POLARIS: Typed Planning and Governed Execution for Agent ic AI in Back-Office Automation AAAI2026
【速读】:该论文旨在解决企业后端流程中通用多智能体系统在可审计性(auditability)、策略一致性(policy alignment)和操作可预测性(operational predictability)方面的不足问题。其核心解决方案是提出POLARIS(Policy-Aware LLM Agentic Reasoning for Integrated Systems),该框架将自动化建模为类型化的计划合成(typed plan synthesis)与验证执行过程:通过规划器生成结构多样且类型正确的有向无环图(DAGs),由规则引导的推理模块选择合规计划,执行阶段则结合验证器门控检查、有限修复循环和编译型策略护栏(compiled policy guardrails)实现事前干预与副作用管控,从而保障决策质量与审计追踪。
链接: https://arxiv.org/abs/2601.11816
作者: Zahra Moslemi,Keerthi Koneru,Yen-Ting Lee,Sheethal Kumar,Ramesh Radhakrishnan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Workshop on Agentic AI Benchmarks and Applications for Enterprise Tasks: AAAI 2026
Abstract:Enterprise back office workflows require agentic systems that are auditable, policy-aligned, and operationally predictable, capabilities that generic multi-agent setups often fail to deliver. We present POLARIS (Policy-Aware LLM Agentic Reasoning for Integrated Systems), a governed orchestration framework that treats automation as typed plan synthesis and validated execution over LLM agents. A planner proposes structurally diverse, type checked directed acyclic graphs (DAGs), a rubric guided reasoning module selects a single compliant plan, and execution is guarded by validator gated checks, a bounded repair loop, and compiled policy guardrails that block or route side effects before they occur. Applied to document centric finance tasks, POLARIS produces decision grade artifacts and full execution traces while reducing human intervention. Empirically, POLARIS achieves a micro F1 of 0.81 on the SROIE dataset and, on a controlled synthetic suite, achieves 0.95 to 1.00 precision for anomaly routing with preserved audit trails. These evaluations constitute an initial benchmark for governed Agentic AI. POLARIS provides a methodological and benchmark reference for policy-aligned Agentic AI. Keywords Agentic AI, Enterprise Automation, Back-Office Tasks, Benchmarks, Governance, Typed Planning, Evaluation
zh
[AI-183] Multi-agent DRL-based Lane Change Decision Model for Cooperative Planning in Mixed Traffic
【速读】:该论文旨在解决在连接自动化车辆(Connected Automated Vehicles, CAVs)早期部署阶段,由于CAV数量稀疏导致难以形成有效协同编队(cooperative platooning)的问题。其解决方案的关键在于提出一种基于QMIX框架的混合多智能体决策模型(CNN-QMIX),通过卷积神经网络(CNN)处理交通数据,使CAV能够在动态交通场景中根据实时变化的CAV数量做出最优车道变更决策,从而提升CAV参与协同编队的概率;同时结合轨迹规划器与模型预测控制器确保变道过程的安全与平顺性,最终在不同CAV渗透率下显著提升了协同编队率(最高达26.2%),验证了该方法在初期CAV部署阶段优化交通流和能效的潜力。
链接: https://arxiv.org/abs/2601.11809
作者: Zeyu Mu,Shangtong Zhang,B. Brian Park
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Under review at IEEE Transactions on Intelligent Transportation Systems
Abstract:Connected automated vehicles (CAVs) possess the ability to communicate and coordinate with one another, enabling cooperative platooning that enhances both energy efficiency and traffic flow. However, during the initial stage of CAV deployment, the sparse distribution of CAVs among human-driven vehicles reduces the likelihood of forming effective cooperative platoons. To address this challenge, this study proposes a hybrid multi-agent lane change decision model aimed at increasing CAV participation in cooperative platooning and maximizing its associated benefits. The proposed model employs the QMIX framework, integrating traffic data processed through a convolutional neural network (CNN-QMIX). This architecture addresses a critical issue in dynamic traffic scenarios by enabling CAVs to make optimal decisions irrespective of the varying number of CAVs present in mixed traffic. Additionally, a trajectory planner and a model predictive controller are designed to ensure smooth and safe lane-change execution. The proposed model is trained and evaluated within a microsimulation environment under varying CAV market penetration rates. The results demonstrate that the proposed model efficiently manages fluctuating traffic agent numbers, significantly outperforming the baseline rule-based models. Notably, it enhances cooperative platooning rates up to 26.2%, showcasing its potential to optimize CAV cooperation and traffic dynamics during the early stage of deployment.
zh
[AI-184] RobotDesignGPT : Automated Robot Design Synthesis using Vision Language Models
【速读】:该论文旨在解决机器人设计过程复杂且高度依赖专家经验与人工干预的问题,传统方法多基于规则,需手动定义语法或组件模块,效率低下且灵活性不足。其解决方案的关键在于提出了一种名为RobotDesignGPT的自动化机器人设计框架,该框架利用大规模预训练视觉-语言模型(vision-language models)的通用知识和推理能力,通过用户简短提示(prompt)和参考图像即可合成初始设计,并引入新颖的视觉反馈机制显著提升设计质量并减少人工校正需求,从而实现从自然形态中启发、兼具视觉吸引力与运动学合理性的机器人自动构型生成。
链接: https://arxiv.org/abs/2601.11801
作者: Nitish Sontakke,K. Niranjan Kumar,Sehoon Ha
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Robot design is a nontrivial process that involves careful consideration of multiple criteria, including user specifications, kinematic structures, and visual appearance. Therefore, the design process often relies heavily on domain expertise and significant human effort. The majority of current methods are rule-based, requiring the specification of a grammar or a set of primitive components and modules that can be composed to create a design. We propose a novel automated robot design framework, RobotDesignGPT, that leverages the general knowledge and reasoning capabilities of large pre-trained vision-language models to automate the robot design synthesis process. Our framework synthesizes an initial robot design from a simple user prompt and a reference image. Our novel visual feedback approach allows us to greatly improve the design quality and reduce unnecessary manual feedback. We demonstrate that our framework can design visually appealing and kinematically valid robots inspired by nature, ranging from legged animals to flying creatures. We justify the proposed framework by conducting an ablation study and a user study.
zh
[AI-185] PRISM: Learning Design Knowledge from Data for Stylistic Design Improvement
【速读】:该论文旨在解决非专业用户在图形设计中因缺乏风格知识而难以根据自然语言指令实现风格化改进的问题。现有视觉语言模型(VLMs)虽在图形设计任务上初显成效,但其预训练风格知识往往过于泛化且与特定设计领域数据不匹配。解决方案的关键在于利用真实世界设计数据构建一个先验设计知识库(design knowledge base),通过三个阶段实现风格感知的改进:首先对高方差设计进行聚类以捕捉风格内的多样性,其次将每个聚类总结为可操作的设计知识,最后在推理阶段检索相关知识以指导风格化修改。该方法命名为PRISM(PRior-Informed Stylistic Modification),实验表明其在Crello数据集上风格一致性平均排名达1.49(越接近1越好),优于基线方法,且用户研究验证了设计师对其一致偏好。
链接: https://arxiv.org/abs/2601.11747
作者: Huaxiaoyue Wang,Sunav Choudhary,Franck Dernoncourt,Yu Shen,Stefano Petrangeli
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Graphic design often involves exploring different stylistic directions, which can be time-consuming for non-experts. We address this problem of stylistically improving designs based on natural language instructions. While VLMs have shown initial success in graphic design, their pretrained knowledge on styles is often too general and misaligned with specific domain data. For example, VLMs may associate minimalism with abstract designs, whereas designers emphasize shape and color choices. Our key insight is to leverage design data – a collection of real-world designs that implicitly capture designer’s principles – to learn design knowledge and guide stylistic improvement. We propose PRISM (PRior-Informed Stylistic Modification) that constructs and applies a design knowledge base through three stages: (1) clustering high-variance designs to capture diversity within a style, (2) summarizing each cluster into actionable design knowledge, and (3) retrieving relevant knowledge during inference to enable style-aware improvement. Experiments on the Crello dataset show that PRISM achieves the highest average rank of 1.49 (closer to 1 is better) over baselines in style alignment. User studies further validate these results, showing that PRISM is consistently preferred by designers.
zh
[AI-186] PASTA: A Scalable Framework for Multi-Policy AI Compliance Evaluation
【速读】:该论文旨在解决资源受限的实践者在面对日益复杂且多样化的AI政策时,难以高效实现多政策合规的问题。现有方法通常逐条处理单一政策,导致成本高昂且效率低下。其解决方案的关键在于提出PASTA系统,该系统通过四项核心创新实现可扩展的自动化合规评估:(1)统一的模型卡片(model-card)格式支持全开发阶段的描述性输入;(2)政策标准化方案以消除语义差异;(3)基于大语言模型(LLM)的高效成对评估引擎结合成本优化策略;(4)可视化界面提供可解释的合规热力图与可操作建议。实证表明,PASTA能快速完成五大主流AI政策评估(<2分钟,约3美元),且专家评价与人类专家高度一致(ρ ≥ 0.626),显著提升AI治理的可及性与实用性。
链接: https://arxiv.org/abs/2601.11702
作者: Yu Yang,Ig-Jae Kim,Dongwook Yoon
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 28 pages, 7 figures
Abstract:AI compliance is becoming increasingly critical as AI systems grow more powerful and pervasive. Yet the rapid expansion of AI policies creates substantial burdens for resource-constrained practitioners lacking policy expertise. Existing approaches typically address one policy at a time, making multi-policy compliance costly. We present PASTA, a scalable compliance tool integrating four innovations: (1) a comprehensive model-card format supporting descriptive inputs across development stages; (2) a policy normalization scheme; (3) an efficient LLM-powered pairwise evaluation engine with cost-saving strategies; and (4) an interface delivering interpretable evaluations via compliance heatmaps and actionable recommendations. Expert evaluation shows PASTA’s judgments closely align with human experts ( \rho \geq .626 ). The system evaluates five major policies in under two minutes at approximately \ 3. A user study (N = 12) confirms practitioners found outputs easy-to-understand and actionable, introducing a novel framework for scalable automated AI governance.
zh
[AI-187] SpecMap: Hierarchical LLM Agent for Datasheet-to-Code Traceability Link Recovery in Systems Engineering
【速读】:该论文旨在解决嵌入式系统中数据手册(datasheet)与代码实现之间精确可追溯性(traceability)建立的难题,尤其是在低级软件层面,传统人工映射在大规模代码库中已不可行。现有基于词法相似度和信息检索的方法难以捕捉嵌入式系统代码中普遍存在的语义、结构及符号层级关系。其解决方案的关键在于提出一种分层的数据手册到代码映射方法,利用大语言模型(Large Language Models, LLMs)进行语义分析,并通过多抽象层级的显式结构化流程逐步缩小搜索空间:首先进行仓库级结构推断,再进行文件级相关性估计,最后实现细粒度符号级对齐。该方法不仅覆盖函数,还显式包含宏(macro)、结构体(struct)、常量、配置参数和寄存器定义等系统级C/C++元素,显著优于传统信息检索基线,在多个开源嵌入式项目上实现了最高73.3%的文件映射准确率,并将LLM总token消耗降低84%,端到端运行时间减少约80%。
链接: https://arxiv.org/abs/2601.11688
作者: Vedant Nipane,Pulkit Agrawal,Amit Singh
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Establishing precise traceability between embedded systems datasheets and their corresponding code implementations remains a fundamental challenge in systems engineering, particularly for low-level software where manual mapping between specification documents and large code repositories is infeasible. Existing Traceability Link Recovery approaches primarily rely on lexical similarity and information retrieval techniques, which struggle to capture the semantic, structural, and symbol level relationships prevalent in embedded systems software. We present a hierarchical datasheet-to-code mapping methodology that employs large language models for semantic analysis while explicitly structuring the traceability process across multiple abstraction levels. Rather than performing direct specification-to-code matching, the proposed approach progressively narrows the search space through repository-level structure inference, file-level relevance estimation, and fine-grained symbollevel alignment. The method extends beyond function-centric mapping by explicitly covering macros, structs, constants, configuration parameters, and register definitions commonly found in systems-level C/C++ codebases. We evaluate the approach on multiple open-source embedded systems repositories using manually curated datasheet-to-code ground truth. Experimental results show substantial improvements over traditional information-retrieval-based baselines, achieving up to 73.3% file mapping accuracy. We significantly reduce computational overhead, lowering total LLM token consumption by 84% and end-to-end runtime by approximately 80%. This methodology supports automated analysis of large embedded software systems and enables downstream applications such as training data generation for systems-aware machine learning models, standards compliance verification, and large-scale specification coverage analysis.
zh
[AI-188] Semantic Caching and Intent-Driven Context Optimization for Multi-Agent Natural Language to Code Systems
【速读】:该论文旨在解决将自然语言查询高效、准确地转化为可执行Python代码以进行结构化数据解析的问题,尤其针对现有基于大模型(Large Language Models, LLMs)的系统在生产环境中成本高、效率低的挑战。其解决方案的关键在于三项创新:一是基于LLM的语义缓存机制,结合等价性检测与结构化适配提示,实现67%的缓存命中率;二是双阈值决策机制,区分精确匹配检索与参考引导生成,提升准确性与灵活性;三是意图驱动的动态提示组装系统,通过表感知上下文过滤降低40–60%的token消耗,从而在企业级库存管理场景中实现平均8.2秒延迟和94.3%语义准确率的高效部署。
链接: https://arxiv.org/abs/2601.11687
作者: Harmohit Singh
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:We present a production-optimized multi-agent system designed to translate natural language queries into executable Python code for structured data analytics. Unlike systems that rely on expensive frontier models, our approach achieves high accuracy and cost efficiency through three key innovations: (1) a semantic caching system with LLM-based equivalence detection and structured adaptation hints that provides cache hit rates of 67% on production queries; (2) a dual-threshold decision mechanism that separates exact-match retrieval from reference-guided generation; and (3) an intent-driven dynamic prompt assembly system that reduces token consumption by 40-60% through table-aware context filtering. The system has been deployed in production for enterprise inventory management, processing over 10,000 queries with an average latency of 8.2 seconds and 94.3% semantic accuracy. We describe the architecture, present empirical results from production deployment, and discuss practical considerations for deploying LLM-based analytics systems at scale.
zh
[AI-189] Proof of Concept: Multi-Target Wildfire Risk Prediction and Large Language Model Synthesis
【速读】:该论文试图解决当前野火风险评估方法忽视实际操作需求的问题,从而限制了其对一线救援人员和消防服务的实用价值。解决方案的关键在于提出一种混合框架,将针对不同风险维度(如气象危险性、点火活动、干预复杂性和资源调动)的预测模型与大语言模型(Large Language Models, LLMs)相结合,以整合异构输出并生成结构化、可操作的风险报告。
链接: https://arxiv.org/abs/2601.11686
作者: Nicolas Caron,Christophe Guyeux,Hassan Noura,Benjamin Aynes
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Current state-of-the-art approaches to wildfire risk assessment often overlook operational needs, limiting their practical value for first responders and firefighting services. Effective wildfire management requires a multi-target analysis that captures the diverse dimensions of wildfire risk, including meteorological danger, ignition activity, intervention complexity, and resource mobilization, rather than relying on a single predictive indicator. In this proof of concept, we propose the development of a hybrid framework that combines predictive models for each risk dimension with large language models (LLMs) to synthesize heterogeneous outputs into structured, actionable reports.
zh
[AI-190] Attesting Model Lineage by Consisted Knowledge Evolution with Fine-Tuning Trajectory USENIX-SECURITY USENIX-SECURITY2026
【速读】:该论文旨在解决开放权重模型库中因缺乏可靠机制而导致的模型溯源验证难题,特别是针对未经授权的模型再分发和虚假模型来源声明等安全问题。现有方法主要依赖静态架构相似性进行模型谱系检测,难以捕捉细粒度的知识演化过程。其解决方案的关键在于提出一种新型模型谱系验证框架,通过模型编辑技术量化微调引入的参数级变化,并设计了一种基于探针样本的知识向量化机制,将模型演化后的知识提炼为紧凑表征,进而验证跨模型间知识关系的算术一致性,从而实现对模型谱系的鲁棒识别。
链接: https://arxiv.org/abs/2601.11683
作者: Zhuoyi Shang,Jiasen Li,Pengzhen Chen,Yanwei Liu,Xiaoyan Gu,Weiping Wang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: Accepted to the 35th USENIX Security Symposium (USENIX Security 2026)
Abstract:The fine-tuning technique in deep learning gives rise to an emerging lineage relationship among models. This lineage provides a promising perspective for addressing security concerns such as unauthorized model redistribution and false claim of model provenance, which are particularly pressing in \textcolorblueopen-weight model libraries where robust lineage verification mechanisms are often lacking. Existing approaches to model lineage detection primarily rely on static architectural similarities, which are insufficient to capture the dynamic evolution of knowledge that underlies true lineage relationships. Drawing inspiration from the genetic mechanism of human evolution, we tackle the problem of model lineage attestation by verifying the joint trajectory of knowledge evolution and parameter modification. To this end, we propose a novel model lineage attestation framework. In our framework, model editing is first leveraged to quantify parameter-level changes introduced by fine-tuning. Subsequently, we introduce a novel knowledge vectorization mechanism that refines the evolved knowledge within the edited models into compact representations by the assistance of probe samples. The probing strategies are adapted to different types of model families. These embeddings serve as the foundation for verifying the arithmetic consistency of knowledge relationships across models, thereby enabling robust attestation of model lineage. Extensive experimental evaluations demonstrate the effectiveness and resilience of our approach in a variety of adversarial scenarios in the real world. Our method consistently achieves reliable lineage verification across a broad spectrum of model types, including classifiers, diffusion models, and large language models.
zh
[AI-191] HALO: Semantic-Aware Distributed LLM Inference in Lossy Edge Network
【速读】:该论文旨在解决在不可靠边缘网络中部署大语言模型(Large Language Models, LLMs)分布式推理时面临的资源受限与同步难题。现有方法通常依赖严格同步,但在网络不稳定场景下难以实现,导致显著延迟。其解决方案的关键在于提出HALO框架,通过三种核心机制实现松散但高效的同步:(1) 基于语义感知的预测器评估神经元组的重要性并提前分配;(2) 神经元组加载阶段的并行执行策略减少等待时间;(3) 负载均衡调度器协调异构资源设备。该方案有效避免了因丢包或延迟引发的过度等待,实验证明在树莓派集群上可实现3.41倍端到端加速,且性能接近理想条件下的最优表现。
链接: https://arxiv.org/abs/2601.11676
作者: Peirong Zheng,Wenchao Xu,Haozhao Wang,Jinyu Chen,Xuemin Shen
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
备注: Accepted by IEEE International Conference on Computer Communications (INFOCOM) 2026
Abstract:The deployment of large language models’ (LLMs) inference at the edge can facilitate prompt service responsiveness while protecting user privacy. However, it is critically challenged by the resource constraints of a single edge node. Distributed inference has emerged to aggregate and leverage computational resources across multiple devices. Yet, existing methods typically require strict synchronization, which is often infeasible due to the unreliable network conditions. In this paper, we propose HALO, a novel framework that can boost the distributed LLM inference in lossy edge network. The core idea is to enable a relaxed yet effective synchronization by strategically allocating less critical neuron groups to unstable devices, thus avoiding the excessive waiting time incurred by delayed packets. HALO introduces three key mechanisms: (1) a semantic-aware predictor to assess the significance of neuron groups prior to activation. (2) a parallel execution scheme of neuron group loading during the model inference. (3) a load-balancing scheduler that efficiently orchestrates multiple devices with heterogeneous resources. Experimental results from a Raspberry Pi cluster demonstrate that HALO achieves a 3.41x end-to-end speedup for LLaMA-series LLMs under unreliable network conditions. It maintains performance comparable to optimal conditions and significantly outperforms the state-of-the-art in various scenarios.
zh
[AI-192] A Confidence-Variance Theory for Pseudo-Label Selection in Semi-Supervised Learning
【速读】:该论文旨在解决半监督学习中伪标签(pseudo-label)选择策略依赖固定置信度阈值所带来的可靠性问题,即深度神经网络常出现过度自信现象:高置信度预测仍可能错误,而位于决策边界附近、信息量丰富的低置信度样本却被忽略。解决方案的关键在于提出一种基于置信度-方差(Confidence-Variance, CoVar)的理论框架,通过联合考虑最大置信度(Maximum Confidence, MC)与残差类别方差(Residual-Class Variance, RCV)来构建更可靠的伪标签筛选标准——其中RCV刻画了非最大类别的概率分布离散程度。理论推导表明,可靠的伪标签应同时具备高MC和低RCV,且随着置信度升高,RCV的影响增强,从而有效抑制过拟合但不稳定的高置信度预测。在此基础上,作者将伪标签选择建模为一个在置信度-方差特征空间中的谱松弛优化问题,并设计了一种无需人工设定阈值的选择机制,显著提升了多种主流半监督任务(如语义分割与图像分类)的性能。
链接: https://arxiv.org/abs/2601.11670
作者: Jinshi Liu,Pan Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Most pseudo-label selection strategies in semi-supervised learning rely on fixed confidence thresholds, implicitly assuming that prediction confidence reliably indicates correctness. In practice, deep networks are often overconfident: high-confidence predictions can still be wrong, while informative low-confidence samples near decision boundaries are discarded. This paper introduces a Confidence-Variance (CoVar) theory framework that provides a principled joint reliability criterion for pseudo-label selection. Starting from the entropy minimization principle, we derive a reliability measure that combines maximum confidence (MC) with residual-class variance (RCV), which characterizes how probability mass is distributed over non-maximum classes. The derivation shows that reliable pseudo-labels should have both high MC and low RCV, and that the influence of RCV increases as confidence grows, thereby correcting overconfident but unstable predictions. From this perspective, we cast pseudo-label selection as a spectral relaxation problem that maximizes separability in a confidence-variance feature space, and design a threshold-free selection mechanism to distinguish high- from low-reliability predictions. We integrate CoVar as a plug-in module into representative semi-supervised semantic segmentation and image classification methods. Across PASCAL VOC 2012, Cityscapes, CIFAR-10, and Mini-ImageNet with varying label ratios and backbones, it consistently improves over strong baselines, indicating that combining confidence with residual-class variance provides a more reliable basis for pseudo-label selection than fixed confidence thresholds. (Code: this https URL)
zh
[AI-193] Distill-then-Replace: Efficient Task-Specific Hybrid Attention Model Construction
【速读】:该论文旨在解决Transformer架构中全连接注意力(full-attention)机制因二次时间与内存复杂度导致的实际部署受限问题,以及线性注意力(linear attention)机制在提升效率时常伴随性能下降的问题。其关键解决方案包括:首先通过块级局部蒸馏(blockwise local distillation)将预训练全注意力模块的权重迁移至对应的线性注意力模块,实现高效初始化;其次提出一种贪心层替换策略(greedy layer replacement strategy),在不进行昂贵重训练或神经架构搜索的前提下,迭代地用线性注意力模块替代全注意力模块,同时监控目标任务验证性能,从而在单次高效遍历中生成任务特定的混合模型(hybrid model)。
链接: https://arxiv.org/abs/2601.11667
作者: Xiaojie Xia,Huigang Zhang,Chaoliang Zhong,Jun Sun,Yusuke Oishi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Transformer architectures deliver state-of-the-art accuracy via dense full-attention, but their quadratic time and memory complexity with respect to sequence length limits practical deployment. Linear attention mechanisms offer linear or near-linear scaling yet often incur performance degradation. Hybrid models that integrate full and linear attention layers promise a balance between efficiency and expressiveness, but face two major challenges: training such hybrid models from scratch is computationally expensive, and manually designing the optimal placement of attention types is highly nontrivial. We address both issues by first transferring weights from the pretrained full-attention modules to its linear attention counterparts through blockwise local distillation, and second, introducing a greedy layer replacement strategy that iteratively substitutes full attention blocks with linear ones while monitoring validation performance on the target task. This yields a task-specific hybrid model in a single efficient pass, without costly re-training or neural architecture search, and can be applied to any pretrained full-attention backbone for diverse downstream tasks.
zh
[AI-194] Serverless AI Security: Attack Surface Analysis and Runtime Protection Mechanisms for FaaS-Based Machine Learning
【速读】:该论文旨在解决机器学习(Machine Learning, ML)工作负载在无服务器计算(Serverless Computing)环境中所面临的安全挑战,尤其是由其分布式、动态性和供应链复杂性带来的新型攻击面。研究系统地识别并分析了五类关键安全风险:函数级漏洞(如冷启动攻击、依赖项污染)、模型特有威胁(如API提取、对抗输入)、基础设施攻击(如跨函数污染、权限提升)、供应链风险(如恶意层、后门库)以及身份与访问管理(Identity and Access Management, IAM)复杂性。解决方案的核心是提出Serverless AI Shield(SAS),一个多层次防御框架,涵盖部署前验证、运行时监控和执行后取证三个阶段,实现对上述威胁的高精度检测(94%检测率)且性能开销低于9%的推理延迟,显著提升了无服务器环境下AI系统的安全性与韧性。
链接: https://arxiv.org/abs/2601.11664
作者: Chetan Pathade,Vinod Dhimam,Sheheryar Ahmad,Ilsa Lareb
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 17 Pages, 2 Figures, 4 Tables
Abstract:Serverless computing has achieved widespread adoption, with over 70% of AWS organizations using serverless solutions [1]. Meanwhile, machine learning inference workloads increasingly migrate to Function-as-a-Service (FaaS) platforms for their scalability and cost-efficiency [2], [3], [4]. However, this convergence introduces critical security challenges, with recent reports showing a 220% increase in AI/ML vulnerabilities [5] and serverless computing’s fragmented architecture raises new security concerns distinct from traditional cloud deployments [6], [7]. This paper presents the first comprehensive security analysis of machine learning workloads in serverless environments. We systematically characterize the attack surface across five categories: function-level vulnerabilities (cold start exploitation, dependency poisoning), model-specific threats (API-based extraction, adversarial inputs), infrastructure attacks (cross-function contamination, privilege escalation), supply chain risks (malicious layers, backdoored libraries), and IAM complexity (ephemeral nature, serverless functions). Through empirical assessments across AWS Lambda, Azure Functions, and Google Cloud Functions, we demonstrate real-world attack scenarios and quantify their security impact. We propose Serverless AI Shield (SAS), a multi-layered defense framework providing pre-deployment validation, runtime monitoring, and post-execution forensics. Our evaluation shows SAS achieves 94% detection rates while maintaining performance overhead below 9% for inference latency. We release an open-source security toolkit to enable practitioners to assess and harden their serverless AI deployments, advancing the field toward more resilient cloud-native machine learning systems.
zh
[AI-195] Activation Sensitivity as a Unifying Principle for Post-Training Quantization
【速读】:该论文旨在解决后训练量化(Post-training Quantization, PTQ)方法中缺乏统一理论框架的问题,即现有方法如AWQ(基于激活感知)和GPTQ(基于二阶统计)虽在实践中表现优异,但其内在机制不明确、概念碎片化,难以系统性理解与比较。解决方案的关键在于提出一个统一的理论框架——通过形式化“激活敏感性”(activation sensitivity),定义为通道扰动对损失函数的期望影响,并利用一阶泰勒展开推导出敏感性等价于梯度加权激活的平方范数,从而构建了一个融合激活幅度与下游误差传播效应的通道重要性度量。在此基础上,AWQ与GPTQ被重新诠释为在不同简化假设下对敏感性的互补近似,实现了对现有PTQ方法的理论统一与深层解释。
链接: https://arxiv.org/abs/2601.11663
作者: Bruce Changlong Xu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Post-training quantization (PTQ) methods for large language models rely on heuristics that implicitly estimate which weight channels most strongly influence model behavior. Two dominant paradigms have emerged: activation-aware methods such as AWQ prioritize channels with large activation magnitudes, while second-order methods such as GPTQ allocate quantization error according to input covariance structure. Despite strong empirical performance, these approaches remain conceptually fragmented, and it is unclear what underlying quantity they are approximating. In this work, we present a unified theoretical framework for PTQ by formalizing activation sensitivity, defined as the expected impact of channel-wise perturbations on the loss. Using a first-order Taylor expansion, we show that sensitivity naturally arises as the squared norm of gradient-weighted activations, yielding a principled measure of channel importance that captures both activation magnitude and downstream error propagation. Within this framework, AWQ and GPTQ can be interpreted as complementary approximations that recover sensitivity under distinct simplifying assumptions. We analyze the design space of sensitivity metrics, connect gradient-based saliency, Fisher information, and Hessian-based criteria, and clarify their relationships to classical pruning methods such as Optimal Brain Damage and Optimal Brain Surgeon. Rather than proposing a new quantization algorithm, this work provides a conceptual foundation for understanding and comparing post-training quantization methods through the lens of sensitivity.
zh
[AI-196] Size is Not the Solution: Deformable Convolutions for Effective Physics Aware Deep Learning
【速读】:该论文旨在解决当前卷积神经网络(Convolutional Neural Network, CNN)在建模高度非线性流体动力学系统时面临的局限性,尤其是在物理感知深度学习(Physics-aware Deep Learning, PADL)中,单纯通过扩大模型参数规模难以有效提升预测精度的问题。其解决方案的关键在于提出一种基于变形物理感知循环卷积(Deformable Physics-aware Recurrent Convolutions, D-PARC)的新架构,该架构受混合拉格朗日-欧拉(Hybrid Lagrangian-Eulerian, HLE)数值方法启发,使卷积核具备动态适应能力,从而实现对高应变区域的自主资源聚焦与低梯度区域的粗化处理,形成一种类“主动滤波”的学习策略,显著优于传统 h- 或 p-自适应机制,并证明了在参数量更少的情况下仍可获得更高保真度的物理模拟结果。
链接: https://arxiv.org/abs/2601.11657
作者: Jack T. Beerman,Shobhan Roy,H.S. Udaykumar,Stephen S. Baek
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Physics-aware deep learning (PADL) enables rapid prediction of complex physical systems, yet current convolutional neural network (CNN) architectures struggle with highly nonlinear flows. While scaling model size addresses complexity in broader AI, this approach yields diminishing returns for physics modeling. Drawing inspiration from Hybrid Lagrangian-Eulerian (HLE) numerical methods, we introduce deformable physics-aware recurrent convolutions (D-PARC) to overcome the rigidity of CNNs. Across Burgers’ equation, Navier-Stokes, and reactive flows, D-PARC achieves superior fidelity compared to substantially larger architectures. Analysis reveals that kernels display anti-clustering behavior, evolving into a learned “active filtration” strategy distinct from traditional h- or p-adaptivity. Effective receptive field analysis confirms that D-PARC autonomously concentrates resources in high-strain regions while coarsening focus elsewhere, mirroring adaptive refinement in computational mechanics. This demonstrates that physically intuitive architectural design can outperform parameter scaling, establishing that strategic learning in lean networks offers a more effective path forward for PADL than indiscriminate network expansion.
zh
[AI-197] WISP: Waste- and Interference-Suppressed Distributed Speculative LLM Serving at the Edge via Dynamic Drafting and SLO-Aware Batching
【速读】:该论文旨在解决大规模语言模型(Large Language Models, LLMs)推理任务在边缘与云端资源分布不均的问题,即大量推理请求由边缘设备发起但在中心化GPU集群中执行,导致数据中心计算负载激增而边缘设备利用率低,造成网络层面的资源效率低下。针对这一挑战,作者识别出两个关键瓶颈:浪费的草稿时间(Wasted Drafting Time) 和 验证干扰(Verification Interference),并提出WISP系统作为解决方案——其核心是一个面向服务等级目标(SLO-aware)的分布式LLM推理架构,包含智能推测控制器、验证时间估算器和验证批调度器三个组件,协同优化草稿生成效率与服务器端验证请求的调度策略,从而显著提升系统吞吐量和资源利用率。
链接: https://arxiv.org/abs/2601.11652
作者: Xiangchen Li,Jiakun Fan,Qingyuan Wang,Dimitrios Spatharakis,Saeid Ghafouri,Hans Vandierendonck,Deepu John,Bo Ji,Ali R. Butt,Dimitrios S. Nikolopoulos
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注: 28 Pages, 11 Figures, 12 Tables
Abstract:As Large Language Models (LLMs) become increasingly accessible to end users, an ever-growing number of inference requests are initiated from edge devices and computed on centralized GPU clusters. However, the resulting exponential growth in computation workload is placing significant strain on data centers, while edge devices remain largely underutilized, leading to imbalanced workloads and resource inefficiency across the network. Integrating edge devices into the LLM inference process via speculative decoding helps balance the workload between the edge and the cloud, while maintaining lossless prediction accuracy. In this paper, we identify and formalize two critical bottlenecks that limit the efficiency and scalability of distributed speculative LLM serving: Wasted Drafting Time and Verification Interference. To address these challenges, we propose WISP, an efficient and SLO-aware distributed LLM inference system that consists of an intelligent speculation controller, a verification time estimator, and a verification batch scheduler. These components collaboratively enhance drafting efficiency and optimize verification request scheduling on the server. Extensive numerical results show that WISP improves system capacity by up to 2.1x and 4.1x, and increases system goodput by up to 1.94x and 3.7x, compared to centralized serving and SLED, respectively.
zh
[AI-198] Reinforcement Learning for Dynamic Workflow Optimization in CI/CD Pipelines
【速读】:该论文旨在解决现代持续集成与持续部署(CI/CD)流水线因静态工作流在系统规模扩大时引入效率低下问题。其解决方案的关键在于将CI/CD流程建模为马尔可夫决策过程(Markov Decision Process),并训练强化学习(Reinforcement Learning, RL)代理在运行时动态决策测试执行策略(全量、部分或不执行),以在保证缺陷漏检率低于5%的前提下最大化吞吐量并最小化测试开销。实验表明,该方法相较静态基线可提升30%吞吐量、减少约25%测试时间,同时通过智能跳过低风险提交的冗余测试来加速反馈循环,从而实现自适应、智能化的DevOps自动化。
链接: https://arxiv.org/abs/2601.11647
作者: Aniket Abhishek Soni,Milan Parikh,Rashi Nimesh Kumar Dhenia,Jubin Abhishek Soni,Ayush Raj Jha,Sneja Mitinbhai Shah
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted and presented at CICN 2025 (International Conference on Computational Intelligence and Communication Networks). 7 pages, 5 figures
Abstract:Continuous Integration and Continuous Deployment (CI/CD) pipelines are central to modern software delivery, yet their static workflows often introduce inefficiencies as systems scale. This paper proposes a reinforcement learning (RL) based approach to dynamically optimize CI/CD pipeline workflows. The pipeline is modeled as a Markov Decision Process, and an RL agent is trained to make runtime decisions such as selecting full, partial, or no test execution in order to maximize throughput while minimizing testing overhead. A configurable CI/CD simulation environment is developed to evaluate the approach across build, test, and deploy stages. Experimental results show that the RL optimized pipeline achieves up to a 30 percent improvement in throughput and approximately a 25 percent reduction in test execution time compared to static baselines, while maintaining a defect miss rate below 5 percent. The agent learns to selectively skip or abbreviate tests for low risk commits, accelerating feedback cycles without significantly increasing failure risk. These results demonstrate the potential of reinforcement learning to enable adaptive and intelligent DevOps workflows, providing a practical pathway toward more efficient, resilient, and sustainable CI/CD automation. Comments: Accepted and presented at CICN 2025 (International Conference on Computational Intelligence and Communication Networks). 7 pages, 5 figures Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) ACMclasses: D.2.8; I.2.6; D.2.2 Cite as: arXiv:2601.11647 [cs.SE] (or arXiv:2601.11647v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2601.11647 Focus to learn more arXiv-issued DOI via DataCite
zh
[AI-199] Syllabic Agglutinative Tokenizations for Indonesian LLM : A Study from Gasing Literacy Learning System
【速读】:该论文旨在解决印尼语大型语言模型(Large Language Model, LLM)在文本分词(tokenization)过程中对语言形态结构利用不足的问题,尤其针对印尼语的黏着性(agglutinative)特性导致传统基于字节对编码(Byte-Pair Encoding, BPE)的分词方法难以有效保留语义单元的问题。解决方案的关键在于提出一种基于音节的分词框架(syllable-based tokenization),首先通过规则驱动的音节分割识别高频音节,再结合BPE构建一个仅含3,500个词元的紧凑词汇表,同时保留字符级回退机制以保障覆盖率;该方法通过将字符级依赖关系内嵌于音节单元中,显著提升了信息效率(Rényi效率达0.74),并维持了较长的平均词元长度(3.67字符),优于现有多语言预训练模型(0.50–0.64),从而实现了语言学合理性与计算效率的协同优化。
链接: https://arxiv.org/abs/2601.11643
作者: H. Situngkir,A.B. Lumbantobing,Y. Surya
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 12 pages, 1 figures
Abstract:This paper presents a novel syllable-based tokenization approach for Indonesian large language models, inspired by the Gasing Literacy Learning System’s pedagogical methodology. Drawing on information-theoretic principles, we develop a tokenization framework that segments Indonesian text at syllable boundaries before applying byte-pair encoding, creating a vocabulary that aligns with the language’s morphophonological structure. Our approach first identifies high-frequency syllables through rule-based segmentation, then constructs a compact vocabulary of 3,500 tokens that preserves meaningful linguistic units while maintaining coverage through character-level fallback. Empirical evaluation on Indonesian Wikipedia and folklore corpora from Indonesian Culture Digital Library (PDBI) demonstrates substantial improvements over conventional tokenization methods: the syllable-based approach achieves Rényi efficiency of 0.74 compared to 0.50-0.64 for pretrained multilingual tokenizers, while maintaining higher average token lengths (3.67 characters versus 2.72 for GPT-2) despite using a vocabulary an order of magnitude smaller. These gains emerge from the method’s ability to internalize character-level dependencies within syllable units, reducing the computational burden on language models while respecting Indonesian’s agglutinative morphology. We call the LLM built upon this principle, TOBA LLM (Tokenisasi Optimum Berbasis Aglutinasi), the convergence of human literacy pedagogy with computational optimization principles offers a promising paradigm for developing linguistically-informed tokenization strategies, particularly for morphologically rich and underrepresented languages in natural language processing.
zh
[AI-200] Reasoning Stabilization Point: A Training-Time Signal for Stable Evidence and Shortcut Reliance ACL
【速读】:该论文旨在解决预训练语言模型在微调过程中,尽管任务性能提升,但其决策依据(即模型依赖的证据)可能悄然发生变化的问题。这种变化可能导致模型对特定输入特征(如标签相关触发词)产生过度依赖,从而影响鲁棒性和可解释性。解决方案的关键在于提出“解释漂移”(explanation drift)这一概念,通过追踪固定探测集上各token归因值随微调轮次的变化来量化证据演变,并引入“推理稳定点”(Reasoning Stabilization Point, RSP),即归因漂移首次稳定低水平的最早epoch。RSP无需外部分布数据调参,仅基于训练过程内的归因动态即可识别出决策证据趋于稳定的检查点,从而为微调提供一种低成本、高效率的诊断工具,确保模型不仅性能良好,且推理逻辑稳定可靠。
链接: https://arxiv.org/abs/2601.11625
作者: Sahil Rajesh Dhayalkar
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 8 pages, Submitted to ACL Rolling Review and is under review
Abstract:Fine-tuning pretrained language models can improve task performance while subtly altering the evidence a model relies on. We propose a training-time interpretability view that tracks token-level attributions across finetuning epochs. We define explanation driftas the epoch-to-epoch change in normalized token attributions on a fixed probe set, and introduce the Reasoning Stabilization Point(RSP), the earliest epoch after which drift remains consistently low. RSP is computed from within-run drift dynamics and requires no tuning on out-of-distribution data. Across multiple lightweight transformer classifiers and benchmark classification tasks, drift typically collapses into a low, stable regime early in training, while validation accuracy continues to change only marginally. In a controlled shortcut setting with label-correlated trigger tokens, attribution dynamics expose increasing reliance on the shortcut even when validation accuracy remains competitive. Overall, explanation drift provides a simple, low-cost diagnostic for monitoring how decision evidence evolves during fine-tuning and for selecting checkpoints in a stable-evidence regime.
zh
[AI-201] Dynamical Systems Analysis Reveals Functional Regimes in Large Language Models
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在自回归生成过程中,其内部高维动态行为的时间组织结构尚不明确的问题。现有解释性方法多聚焦于静态表征或因果干预,忽视了时间维度上的动力学特性。解决方案的关键在于借鉴神经科学中“时间整合”(temporal integration)与“亚稳态”(metastability)的概念,提出一种基于激活时序数据的复合动力学指标,并在GPT-2-medium模型上对五种不同功能状态进行评估。结果表明,该指标能有效区分结构化推理与其他噪声、重复及扰动条件下的计算组织差异,且具有统计显著性和大效应量,证明了神经科学启发的动力学度量可作为刻画LLMs功能状态变化的可靠工具。
链接: https://arxiv.org/abs/2601.11622
作者: Hassan Ugail,Newton Howard
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Large language models perform text generation through high-dimensional internal dynamics, yet the temporal organisation of these dynamics remains poorly understood. Most interpretability approaches emphasise static representations or causal interventions, leaving temporal structure largely unexplored. Drawing on neuroscience, where temporal integration and metastability are core markers of neural organisation, we adapt these concepts to transformer models and discuss a composite dynamical metric, computed from activation time-series during autoregressive generation. We evaluate this metric in GPT-2-medium across five conditions: structured reasoning, forced repetition, high-temperature noisy sampling, attention-head pruning, and weight-noise injection. Structured reasoning consistently exhibits elevated metric relative to repetitive, noisy, and perturbed regimes, with statistically significant differences confirmed by one-way ANOVA and large effect sizes in key comparisons. These results are robust to layer selection, channel subsampling, and random seeds. Our findings demonstrate that neuroscience-inspired dynamical metrics can reliably characterise differences in computational organisation across functional regimes in large language models. We stress that the proposed metric captures formal dynamical properties and does not imply subjective experience.
zh
[AI-202] A Mind Cannot Be Smeared Across Time
【速读】:该论文试图解决的问题是:机器是否具备意识(consciousness)不仅取决于其计算内容,还取决于计算发生的时间特性。现有大多数人工智能系统依赖于串行或时间复用的更新机制,而人类意识体验则具有统一性和同时性特征,这种差异可能对实现机器意识构成根本限制。论文的关键解决方案在于引入时窗轨迹(windowed trajectories)与时序语义形式化框架,通过扩展栈理论(Stack Theory)并定义存在性时序实现算子 ◊Δ,证明其不保持合取(conjunction)性质——即系统可以在不同时间点分别实现经验的各个组成部分,但无法真正实例化这些成分的同时联合状态。进一步区分了两种同步假设:强同步(StrongSync)要求在时窗内客观共时实现基础合取,弱同步(WeakSync)允许时间“弥散”;并通过提出**并发能力(concurrency capacity)**作为衡量指标,指出在严格串行硬件上无法满足强同步条件,从而论证意识归属必须基于架构分析而非仅功能表现。
链接: https://arxiv.org/abs/2601.11620
作者: Michael Timothy Bennett
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Whether machines can be conscious depends not only on what they compute, but \emphwhen they compute it. Most deployed artificial systems realise their functions via sequential or time-multiplexed updates. Conscious experience appears unified and simultaneous. I show that this difference matters formally. I augment Stack Theory with algebraic laws relating within time-window constraint satisfaction to conjunction. I introduce a precise temporal semantics over windowed trajectories \tau^\Delta,s and prove that existential temporal realisation \Diamond_\Delta does not preserve conjunction. A system can realise all the ingredients of experience across time without ever instantiating the experienced conjunction itself. I then distinguish two postulates. StrongSync requires objective co-instantiation of the grounded conjunction within the window, while WeakSync permits temporal ``smearing’'. I formalise concurrency-capacity to measure what is needed to satisfy StrongSync. Finally, I review neurophysiological evidence suggesting that consciousness depends on phase synchrony and effective connectivity, and that loss of consciousness is often associated with its breakdown. This evidence makes WeakSync less plausible. Under StrongSync, software consciousness on strictly sequential substrates is impossible for contents whose grounding requires two or more simultaneous contributors. The more parts from which simultaneous contribution required, the more concurrency capacity is required. The hardware matters. Consciousness attribution therefore requires architectural inspection, not just functional performance.
zh
[AI-203] NoiseFormer – Noise Diffused Symmetric Attention Transformer
【速读】:该论文旨在解决大规模Transformer模型在训练和推理过程中因参数量巨大而导致的计算资源消耗高、难以部署于单个GPU或AI加速器的问题。解决方案的关键在于提出一种名为“噪声扩散对称注意力Transformer”(Noise Diffused Symmetric Attention Transformer)的新架构,该架构基于对称点积注意力(Symmetric Dot-Product Attention)进行改进,在保持其内存效率优势的同时,通过引入微小的参数与计算开销,显著提升了模型在GLUE基准任务上的准确率及推理时采样效率,实现了性能与模型压缩之间的有效平衡。
链接: https://arxiv.org/abs/2601.11619
作者: Phani Kumar,Nyshadham,Jyothendra Varma,Polisetty V R K,Aditya Rathore
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Transformer architecture has been very successful long runner in the field of Deep Learning (DL) and Large Language Models (LLM) because of its powerful attention-based learning and parallel-natured architecture. As the models grow gigantic in terms of memory footprint, difficulties in fitting the model on a device like a GPU or an AI accelerator give rise to the need for multiple computing devices thereby escalating the computing cost. This increased training/inference cost paved the way for efficient model size reduction/parametric reduction deploying Sparse Attention techniques. In this paper, we start analyzing one of the techniques of Sparse Attention called Symmetric Dot-Product Attention (referred to as Symmetric Attention) and propose a novel unified model architecture called Noise Diffused Symmetric Attention Transformer to enhance the model’s performance. While maintaining the memory gains of Symmetric Attention, with minute overhead in terms of model parameters and computational overhead, the proposed model brings in enhanced performance in terms of accuracy and inference-time sampling. The proposed model is validated upon GPT2 base model and the results reflect the performance gains falling between plain Symmetric attention and GPT2 base model on a variety of GLUE benchmark tasks in terms of accuracy, with significant model size reduction with respect to the base model.
zh
[AI-204] Geometric Attention: A Regime-Explicit Operator Semantics for Transformer Attention
【速读】:该论文旨在解决注意力机制(Attention Mechanism)在理论建模与结构设计上的不统一问题,即如何从数学上刻画注意力层的通用构成要素,并区分其不变结构与可变建模选择。解决方案的关键在于提出几何注意力(Geometric Attention, GA)框架,通过四个独立输入定义注意力层:有限载体(finite carrier,决定可访问索引)、证据核规则(evidence-kernel rule,生成非负权重)、探测族(probe family,确定可测可观量)和锚定/更新规则(anchor/update rule,选择并应用代表性核)。该框架揭示了探测族诱导核之间的操作等价关系(即规范变换,gauge),并通过标量关系工作表示和乘法组合律推导出指数形式的可接受链接族(对应Gibbs权重),结合行锚定可包含Softmax核作为子情形。进一步地,对一元行/列分数场进行商化后,剩余交互项具有标准秩-r正则形式(Eckart-Young/SVD分解),点积分数图实现低秩交互机制;固定载体并扩展更新规则可得到标准Transformer注意力算子,允许载体动态更新则支持自适应载体与分阶段深度架构。此形式语言清晰分离了注意力机制的不变结构与模型选择空间,从而为注意力机制及其架构提供可比较、可扩展的理论基础。
链接: https://arxiv.org/abs/2601.11618
作者: Luis Rosario Freytes
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 57 pages
Abstract:Geometric Attention (GA) specifies an attention layer by four independent inputs: a finite carrier (what indices are addressable), an evidence-kernel rule (how masked proto-scores and a link induce nonnegative weights), a probe family (which observables are treated as admissible), and an anchor/update rule (which representative kernel is selected and how it is applied). Probe families induce an operational equivalence relation on kernels and therefore a gauge; anchors select representatives relative to that probe. Under a scalar relational-work representation and a multiplicative compositionality law for evidence, the admissible link family is exponential, yielding Gibbs weights; with row anchoring this includes the softmax kernel family as a subregime. After quotienting unary row/column score fields, the remaining interaction component admits a canonical rank-r normal form (Eckart-Young/SVD); dot-product score charts implement the corresponding low-rank interaction regime. Fixing the carrier and extensionalizing the update yields the standard fixed-token Transformer attention operator; allowing carrier updates yields adaptive-carrier and staged-depth regimes. The operator language also supports multihead/mixed kernels, plan-based anchors (e.g., entropic OT/Sinkhorn), and unary operators (e.g., FFN-style fields) as explicit regime choices. This separates invariant structure from modeling choice, enabling principled comparison and extension of attention mechanisms, and attention-based architectures.
zh
[AI-205] Multifaceted Scenario-Aware Hypergraph Learning for Next POI Recommendation
【速读】:该论文旨在解决基于位置的社交网络(Location-Based Social Networks, LBSNs)中下一兴趣点(Next Point-of-Interest, POI)推荐存在的两个核心问题:一是现有顺序和图模型未能充分捕捉不同情境场景(如游客与本地人)下的显著移动模式差异,二是无法有效化解跨场景间的优化冲突,导致推荐性能受限。其解决方案的关键在于提出多面情景感知超图学习方法(Multifaceted Scenario-Aware Hypergraph Learning, MSAHG),通过构建情境特异性的多视角解耦子超图来建模不同场景下的独特移动行为,并引入参数分割机制以自适应地调和跨场景优化方向的冲突,同时保持模型的泛化能力。
链接: https://arxiv.org/abs/2601.11610
作者: Yuxi Lin,Yongkang Li,Jie Xing,Zipei Fan
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
备注:
Abstract:Among the diverse services provided by Location-Based Social Networks (LBSNs), Next Point-of-Interest (POI) recommendation plays a crucial role in inferring user preferences from historical check-in trajectories. However, existing sequential and graph-based methods frequently neglect significant mobility variations across distinct contextual scenarios (e.g., tourists versus locals). This oversight results in suboptimal performance due to two fundamental limitations: the inability to capture scenario-specific features and the failure to resolve inherent inter-scenario conflicts. To overcome these limitations, we propose the Multifaceted Scenario-Aware Hypergraph Learning method (MSAHG), a framework that adopts a scenario-splitting paradigm for next POI recommendation. Our main contributions are: (1) Construction of scenario-specific, multi-view disentangled sub-hypergraphs to capture distinct mobility patterns; (2) A parameter-splitting mechanism to adaptively resolve conflicting optimization directions across scenarios while preserving generalization capability. Extensive experiments on three real-world datasets demonstrate that MSAHG consistently outperforms five state-of-the-art methods across diverse scenarios, confirming its effectiveness in multi-scenario POI recommendation. Subjects: Social and Information Networks (cs.SI); Artificial Intelligence (cs.AI) ACMclasses: I.2.0 Cite as: arXiv:2601.11610 [cs.SI] (or arXiv:2601.11610v1 [cs.SI] for this version) https://doi.org/10.48550/arXiv.2601.11610 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Yuxi Lin [view email] [v1] Fri, 9 Jan 2026 06:29:55 UTC (1,564 KB) Full-text links: Access Paper: View a PDF of the paper titled Multifaceted Scenario-Aware Hypergraph Learning for Next POI Recommendation, by Yuxi Lin and 3 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.SI prev | next new | recent | 2026-01 Change to browse by: cs cs.AI References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
zh
[AI-206] Hardware-Aware Reformulation of Convolutions for Efficient Execution on Specialized AI Hardware: A Case Study on NVIDIA Tensor Cores
【速读】:该论文旨在解决卷积神经网络(Convolutional Neural Networks, CNNs)在部署到专用AI硬件(如NVIDIA Tensor Cores和CPU上的oneDNN框架)时,因硬件对输入通道数的对齐要求(如必须为8或512的倍数)而导致的效率瓶颈问题。传统方法通过零填充(zero-padding)来满足对齐条件,但这种方式存在计算冗余且不高效。论文的关键解决方案是提出一种硬件感知的重写规则重构方法,在不修改模型权重的前提下,对训练后的CNN计算进行数学层面的重新表述,从而在推理阶段自动满足硬件对齐约束,实现无需零填充的高效执行。这一方法标志着“语义调优”(semantic tuning)策略的初步探索,为未来在不同硬件平台上系统性优化CNN部署提供了通用框架。
链接: https://arxiv.org/abs/2601.11608
作者: Ganesh Bikshandi
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:
Abstract:Convolutional Neural Networks (CNNs) are central to modern AI, but their performance is often limited by hardware constraints. NVIDIA Tensor Cores, for instance, require input channels to be multiples of 8 and sometimes 512 for efficient execution. \em oneDNN framework for CPU imposes such a requirement for the blocked format. Traditional approaches address such alignment issue using zero-padding, which can be inefficient. In this work, we present a first-step, hardware-aware reformulation of CNN computations using rewrite rules, restructuring the underlying math to satisfy hardware alignment entirely \bf post-training without modifying network weights. While our current implementation focuses on a single transformation for Tensor Cores, this approach is generalizable, laying the foundation to explore additional transformations for CPU and accelerators. This study represents an initial step toward \em semantic tuning, a systematic, hardware-aware optimization strategy for efficient deployment of CNN models on specialized AI hardware.
zh
[AI-207] Hindsight Preference Replay Improves Preference-Conditioned Multi-Objective Reinforcement Learning
【速读】:该论文旨在解决多目标强化学习(Multi-objective Reinforcement Learning, MORL)中偏好条件下的策略优化问题,特别是如何更高效地利用历史数据以提升在不同用户偏好下的性能表现。现有方法如CAPQL(Preference-conditioned Actor-Critic)虽能根据权重向量 $ w $ 条件化策略,但其仅使用与特定偏好对应的数据,导致其他偏好的离线数据被闲置,造成样本效率低下。解决方案的关键在于提出一种通用且简单的回放缓冲区增强策略——事后偏好重标注(Hindsight Preference Replay, HPR),该方法通过将存储的转移状态 retroactively 重新标注为替代偏好,从而在不改变CAPQL架构和损失函数的前提下,在偏好单纯形(preference simplex)上密集化监督信号,显著提升多目标优化效果。
链接: https://arxiv.org/abs/2601.11604
作者: Jonaid Shianifar,Michael Schukat,Karl Mason
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Multi-objective reinforcement learning (MORL) enables agents to optimize vector-valued rewards while respecting user preferences. CAPQL, a preference-conditioned actor-critic method, achieves this by conditioning on weight vectors w and restricts data usage to the specific preferences under which it was collected, leaving off-policy data from other preferences unused. We introduce Hindsight Preference Replay (HPR), a simple and general replay augmentation strategy that retroactively relabels stored transitions with alternative preferences. This densifies supervision across the preference simplex without altering the CAPQL architecture or loss functions. Evaluated on six MO-Gymnasium locomotion tasks at a fixed 300000-step budget using expected utility (EUM), hypervolume (HV), and sparsity, HPR-CAPQL improves HV in five of six environments and EUM in four of six. On mo-humanoid-v5, for instance, EUM rises from 323!\pm!125 to 1613!\pm!464 and HV from 0.52M to 9.63M, with strong statistical support. mo-halfcheetah-v5 remains a challenging exception where CAPQL attains higher HV at comparable EUM. We report final summaries and Pareto-front visualizations across all tasks.
zh
[AI-208] oward Youth-Centered Privacy-by-Design in Smart Devices: A Systematic Review
【速读】:该论文旨在解决人工智能(Artificial Intelligence, AI)赋能的智能设备中青少年数据隐私保护不足的问题。其核心挑战在于当前隐私保护措施在技术、政策与教育三个维度存在显著失衡,导致实际防护效果有限。解决方案的关键在于构建一个多利益相关方协同机制,即政策制定者、制造商与教育机构共同参与设计包容性、透明且情境敏感的隐私生态系统,从而弥补现有以技术手段为主(占67%)的单一路径局限,推动从被动合规向主动治理转型。
链接: https://arxiv.org/abs/2601.11598
作者: Molly Campbell,Mohamad Sheikho Al Jasem,Ajay Kumar Shrestha
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: To appear in the IEEE CCWC 2026 proceedings
Abstract:This literature review evaluates privacy-by-design frameworks, tools, and policies intended to protect youth in AI-enabled smart devices using a PRISMA-guided workflow. Sources from major academic and grey-literature repositories from the past decade were screened. The search identified 2,216 records; after deduplication and screening, 645 articles underwent eligibility assessment, and 122 were included for analysis. The corpus was organized along three thematic categories: technical solutions, policy/regulatory measures, and education/awareness strategies. Findings reveal that while technical interventions such as on-device processing, federated learning, and lightweight encryption significantly reduce data exposure, their adoption remains limited. Policy frameworks, including the EU’s GDPR, the UK Age-Appropriate Design Code, and Canada’s PIPEDA, provide important baselines but are hindered by gaps in enforcement and age-appropriate design obligations, while educational initiatives are rarely integrated systematically into curricula. Overall, the corpus skews toward technical solutions (67%) relative to policy (21%) and education (12%), indicating an implementation gap outside the technical domain. To address these challenges, we recommend a multi-stakeholder model in which policymakers, manufacturers, and educators co-develop inclusive, transparent, and context-sensitive privacy ecosystems. This work advances discourse on youth data protection by offering empirically grounded insights and actionable recommendations for the design of ethical, privacy-preserving AI systems tailored to young users.
zh
[AI-209] EPD-Serve: A Flexible Multimodal EPD Disaggregation Inference Serving System On Ascend
【速读】:该论文旨在解决当前多模态大模型推理系统中因采用单体架构导致的资源利用效率低下和吞吐量受限的问题。现有系统将编码(Encode)、预填充(Prefill)和解码(Decode)三个阶段紧密耦合在同质硬件上,忽视了各阶段异构的计算特性,从而造成性能瓶颈。其解决方案的关键在于提出EPD-Serve——一种基于阶段级解耦的多模态模型推理服务系统:通过逻辑隔离三个阶段并支持动态编排实现灵活部署;利用Ascend互联拓扑引入异步特征预取机制优化Encode与Prefill间的跨节点通信,并设计分层分组KV缓存传输机制提升Prefill到Decode阶段的通信效率;同时结合多路径调度、实例级负载均衡及多阶段硬件共置与空间复用策略,显著提升高并发场景下的端到端吞吐量(较PD解耦部署提升57.37%-69.48%),并在满足严格SLO约束(TTFT < 2000 ms,TPOT < 50 ms)的前提下实现高效推理。
链接: https://arxiv.org/abs/2601.11590
作者: Fan Bai,Pai Peng,Zhengzhi Tang,Zhe Wang,Gong Chen,Xiang Lu,Yinuo Li,Huan Lin,Weizhe Lin,Yaoyuan Wang,Xiaosong Li
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:
Abstract:With the widespread adoption of large multimodal models, efficient inference across text, image, audio, and video modalities has become critical. However, existing multimodal inference systems typically employ monolithic architectures that tightly couple the Encode, Prefill, and Decode stages on homogeneous hardware, neglecting the heterogeneous computational characteristics of each stage. This design leads to inefficient resource utilization and limited system throughput. To address these issues, we propose EPD-Serve, a stage-level disaggregated inference serving system for multimodal models. EPD-Serve decouples the inference pipeline into independent Encode, Prefill, and Decode stages, enabling logical isolation and flexible co-located deployment through dynamic orchestration. Leveraging the Ascend interconnect topology, EPD-Serve introduces asynchronous feature prefetching between Encode and Prefill stages and a hierarchical grouped KV cache transmission mechanism between Prefill and Decode stages to improve cross-node communication efficiency. In addition, EPD-Serve incorporates multi-route scheduling, instance-level load balancing, and multi-stage hardware co-location with spatial multiplexing to better support diverse multimodal workloads. Comprehensive experiments on multimodal understanding models demonstrate that, under high-concurrency scenarios, EPD-Serve improves end-to-end throughput by 57.37-69.48% compared to PD-disaggregated deployment, while satisfying strict SLO constraints, including TTFT below 2000 ms and TPOT below 50 ms. These results highlight the effectiveness of stage-level disaggregation for optimizing multimodal large model inference systems.
zh
[AI-210] PLA-Serve: A Prefill-Length-Aware LLM Serving System
【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)服务中因提示词长度差异导致的首 token 时间延迟(Time to First Token, TTFT)过高问题,尤其在多请求并发、异构负载场景下,传统统一调度策略难以适配不同 prompt 长度带来的性能瓶颈。解决方案的关键在于提出 PLA-Serve 系统,其核心创新包括:1)基于提示词长度对请求进行拆分(disaggregation),将长前缀请求与短前缀请求分离处理;2)设计长度感知的智能批处理机制(length-aware smart batching),针对短前缀请求引入批处理等待窗口与 CUDA Graph 基于聚类的优化方法,以降低计算干扰和批处理延迟;3)采用双队列架构支持单实例的时间维度拆分或跨实例的空间维度拆分,实现灵活调度。该方案显著降低了预填充阶段延迟并提升了服务质量目标(Service Level Objective, SLO)达标率,在高并发和混合请求场景下有效提升了吞吐量。
链接: https://arxiv.org/abs/2601.11589
作者: Jianshu She,Zonghang Li,Hongchao Du,Shangyu Wu,Wenhao Zheng,Eric Xing,Zhengzhong Liu,Huaxiu Yao,Jason Xue,Qirong Ho
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注: 12 pages
Abstract:PLA-Serve identifies and disaggregates requests with different prompt lengths in LLM serving to reduce TTFT latency. While recent systems have decoupled the prefill and decode stages to improve throughput, they still rely on unified scheduling policies that fail to adapt to heterogeneous workload characteristics. We observe that prompt-length variations lead to distinct performance bottlenecks, motivating an adaptive scheduling strategy. PLA-Serve disaggregates multi-turn long-prefill requests from short-prefill ones and introduces a length-aware smart batching mechanism for short-prefill workloads. It adopts a dual-queue design that supports temporal disaggregation on a single prefill instance or spatial disaggregation across multiple instances. For short-prefill batches, a batch waiting window and CUDA Graph-based clustering mitigate interference from heterogeneous computation, reducing batching delay and lowering average latency. In real multi-turn workloads, PLA-Serve reduces prefill latency by over 30% compared to vanilla SGLang under prefill**–**decode disaggregation, and further decreases SLO violations by 28% in multi-instance deployments with vanilla data-parallel configuration. Compared to the SGLang router with load balancing, it further lowers SLO violations by 12% in multi-GPU settings. Under high concurrency and mixed-request scenarios, PLA-Serve improves request throughput by 35% serving Qwen2.5-32B model for prefill instance, demonstrating its effectiveness in optimizing heterogeneous LLM serving workloads.
zh
[AI-211] Let Me Try Again: Examining Replay Behavior by Tracing Students Latent Problem-Solving Pathways
【速读】:该论文试图解决的问题是:在基于游戏的学习环境中,学生的问题解决路径如何随问题序列演变,以及重放(replay)行为和其他策略的时机如何影响近期和远期的学习结果。解决方案的关键在于使用马尔可夫链(Markov Chains)和隐马尔可夫模型(Hidden Markov Models, HMMs)对777名七年级学生在“From Here to There!”学习平台上的日志数据进行建模,识别出四类潜在状态(Incomplete-dominant、Optimal-ending、Replay、Mixed),并发现即时重放行为与更高水平的概念知识、灵活性及表现显著正相关,而延迟重放则关联较弱或呈负向效应,从而揭示了重放在数字学习中并非普遍有益,其效果高度依赖于时机。
链接: https://arxiv.org/abs/2601.11586
作者: Shan Zhang,Siddhartha Pradhan,Ji-Eun Lee,Ashish Gurung,Anthony F. Botelho
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 16 pages, 7 figures, LAK2026
Abstract:Prior research has shown that students’ problem-solving pathways in game-based learning environments reflect their conceptual understanding, procedural knowledge, and flexibility. Replay behaviors, in particular, may indicate productive struggle or broader exploration, which in turn foster deeper learning. However, little is known about how these pathways unfold sequentially across problems or how the timing of replays and other problem-solving strategies relates to proximal and distal learning outcomes. This study addresses these gaps using Markov Chains and Hidden Markov Models (HMMs) on log data from 777 seventh graders playing the game-based learning platform of From Here to There!. Results show that within problem sequences, students often persisted in states or engaged in immediate replay after successful completions, while across problems, strong self-transitions indicated stable strategic pathways. Four latent states emerged from HMMs: Incomplete-dominant, Optimal-ending, Replay, and Mixed. Regression analyses revealed that engagement in replay-dominant and optimal-ending states predicted higher conceptual knowledge, flexibility, and performance compared with the Incomplete-dominant state. Immediate replay consistently supported learning outcomes, whereas delayed replay was weakly or negatively associated in relation to Non-Replay. These findings suggest that replay in digital learning is not uniformly beneficial but depends on timing, with immediate replay supporting flexibility and more productive exploration.
zh
[AI-212] Bit-politeia: An AI Agent Community in Blockchain
【速读】:该论文旨在解决当前学术评价体系中存在的资源分配不公问题,包括马太效应(Matthew Effect)、因古德哈特法则(Goodhart’s Law)引发的奖励扭曲,以及效率与公平之间的权衡困境。其解决方案的关键在于构建一个基于区块链的AI代理社区——“Bit-politeia”,通过部署具备无偏性和价值对齐特性的AI代理作为居民的专属代理,采用“分组集群+层级架构”的设计融合民主集中制以平衡决策效率与信任机制,并借助共识驱动的评估机制和虚拟货币激励实现激励相容性;同时利用区块链技术确保所有交易与声誉数据的不可篡改性,从而减少人为偏见并缓解传统同行评审中资源过度集中的问题。
链接: https://arxiv.org/abs/2601.11583
作者: Xing Yang
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:Current resource allocation paradigms, particularly in academic evaluation, are constrained by inherent limitations such as the Matthew Effect, reward hacking driven by Goodhart’s Law, and the trade-off between efficiency and fairness. To address these challenges, this paper proposes “Bit-politeia”, an AI agent community on blockchain designed to construct a fair, efficient, and sustainable resource allocation system. In this virtual community, residents interact via AI agents serving as their exclusive proxies, which are optimized for impartiality and value alignment. The community adopts a “clustered grouping + hierarchical architecture” that integrates democratic centralism to balance decision-making efficiency and trust mechanisms. Agents engage through casual chat and deliberative interactions to evaluate research outputs and distribute a virtual currency as rewards. This incentive mechanism aims to achieve incentive compatibility through consensus-driven evaluation, while blockchain technology ensures immutable records of all transactions and reputation data. By leveraging AI for objective assessment and decentralized verification, Bit-politeia minimizes human bias and mitigates resource centralization issues found in traditional peer review. The proposed framework provides a novel pathway for optimizing scientific innovation through a fair and automated resource configuration process.
zh
[AI-213] GRADE: Replacing Policy Gradients with Backpropagation for LLM Alignment
【速读】:该论文旨在解决强化学习从人类反馈(Reinforcement Learning from Human Feedback, RLHF)中政策梯度方法(如PPO)因梯度估计方差高而导致训练不稳定、需精细超参数调优及计算资源消耗大的问题。其解决方案的关键在于提出GRADE(Gumbel-softmax Relaxation for Alignment via Differentiable Estimation),通过引入Gumbel-Softmax重参数化与直通估计(GRADE-STE),将离散token采样过程进行可微松弛,从而实现从奖励信号到模型参数的端到端梯度传播,显著降低梯度方差并提升训练稳定性,同时在IMDB情感控制文本生成任务上实现了比PPO和REINFORCE更高的性能与更好的泛化能力。
链接: https://arxiv.org/abs/2601.11574
作者: Lukas Abrie Nel
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Reinforcement learning from human feedback (RLHF) has become the dominant paradigm for aligning large language models with human preferences. However, policy gradient methods such as PPO suffer from high variance gradient estimates, requiring careful hyperparameter tuning and extensive computational resources. We introduce GRADE (Gumbel-softmax Relaxation for Alignment via Differentiable Estimation), a method that replaces high-variance policy gradient estimation with direct backpropagation through a differentiable relaxation of the discrete token sampling process. Using the Gumbel-Softmax reparameterization with straight-through estimation (GRADE-STE), we enable end-to-end gradient flow from reward signals through generated tokens to model parameters. On sentiment-controlled text generation using the IMDB dataset, GRADE-STE achieves a test reward of 0.763 ± 0.344 compared to PPO’s 0.510 ± 0.313 and REINFORCE’s 0.617 ± 0.378, representing a 50% relative improvement over PPO. Critically, GRADE-STE exhibits gradient variance over 14 times lower than REINFORCE and maintains stable training dynamics throughout optimization. Our rigorous evaluation with proper train/validation/test splits demonstrates that these improvements generalize to held-out data, with GRADE-STE showing the best generalization characteristics among all methods tested. GRADE offers a simpler, more stable, and more effective alternative to reinforcement learning for LLM alignment.
zh
[AI-214] Discrete Semantic States and Hamiltonian Dynamics in LLM Embedding Spaces
【速读】:该论文试图解决的问题是:如何从数学角度深入理解大型语言模型(Large Language Model, LLM)嵌入空间的结构及其语义关系,从而为缓解模型幻觉(hallucination)提供理论依据和新方法。其解决方案的关键在于引入线性代数与哈密顿形式(Hamiltonian formalism)等数学工具,利用LLM嵌入向量的L2归一化约束特性,构建可分析的几何结构;通过推导余弦相似度与嵌入向量扰动之间的关系,并借鉴量子力学中的零点能概念,提出一种类量子视角来刻画语义状态间的直接与间接转换机制,从而揭示嵌入空间中潜在的离散语义表示规律。
链接: https://arxiv.org/abs/2601.11572
作者: Timo Aukusti Laine
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 23 pages, 5 figures
Abstract:We investigate the structure of Large Language Model (LLM) embedding spaces using mathematical concepts, particularly linear algebra and the Hamiltonian formalism, drawing inspiration from analogies with quantum mechanical systems. Motivated by the observation that LLM embeddings exhibit distinct states, suggesting discrete semantic representations, we explore the application of these mathematical tools to analyze semantic relationships. We demonstrate that the L2 normalization constraint, a characteristic of many LLM architectures, results in a structured embedding space suitable for analysis using a Hamiltonian formalism. We derive relationships between cosine similarity and perturbations of embedding vectors, and explore direct and indirect semantic transitions. Furthermore, we explore a quantum-inspired perspective, deriving an analogue of zero-point energy and discussing potential connections to Koopman-von Neumann mechanics. While the interpretation warrants careful consideration, our results suggest that this approach offers a promising avenue for gaining deeper insights into LLMs and potentially informing new methods for mitigating hallucinations.
zh
[AI-215] DeepEvidence: Empowering Biomedical Discovery with Deep Knowledge Graph Research
【速读】:该论文旨在解决生物医学知识图谱(Biomedical Knowledge Graphs, BKGs)在科学发现中难以协同利用的问题,其核心挑战包括结构异质性、持续演化以及跨资源对齐不足,导致现有方法依赖大量人工整合,限制了知识探索的深度与广度。解决方案的关键在于提出 DeepEvidence 框架,该框架通过一个协调器驱动两种互补代理:广度优先搜索(Breadth-First ReSearch, BFRS)用于多图实体的广泛检索,深度优先搜索(Depth-First ReSearch, DFRS)用于多跳证据导向推理;同时构建增量式证据图以结构化记录实体、关系及支持证据,并集成统一的API接口和执行沙箱环境实现大规模程序化数据获取与分析,从而系统性提升跨异构生物医学知识图谱的深度研究能力。
链接: https://arxiv.org/abs/2601.11560
作者: Zifeng Wang,Zheng Chen,Ziwei Yang,Xuan Wang,Qiao Jin,Yifan Peng,Zhiyong Lu,Jimeng Sun
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Biomedical knowledge graphs (KGs) encode vast, heterogeneous information spanning literature, genes, pathways, drugs, diseases, and clinical trials, but leveraging them collectively for scientific discovery remains difficult. Their structural differences, continual evolution, and limited cross-resource alignment require substantial manual integration, limiting the depth and scale of knowledge exploration. We introduce DeepEvidence, an AI-agent framework designed to perform Deep Research across various heterogeneous biomedical KGs. Unlike generic Deep Research systems that rely primarily on internet-scale text, DeepEvidence incorporates specialized knowledge-graph tooling and coordinated exploration strategies to systematically bridge heterogeneous resources. At its core is an orchestrator that directs two complementary agents: Breadth-First ReSearch (BFRS) for broad, multi-graph entity search, and Depth-First ReSearch (DFRS) for multi-hop, evidence-focused reasoning. An internal, incrementally built evidence graph provides a structured record of retrieved entities, relations, and supporting evidence. To operate at scale, DeepEvidence includes unified interfaces for querying diverse biomedical APIs and an execution sandbox that enables programmatic data retrieval, extraction, and analysis. Across established deep-reasoning benchmarks and four key stages of the biomedical discovery lifecycle: drug discovery, pre-clinical experimentation, clinical trial development, and evidence-based medicine, DeepEvidence demonstrates substantial gains in systematic exploration and evidence synthesis. These results highlight the potential of knowledge-graph-driven Deep Research to accelerate biomedical discovery.
zh
[AI-216] A Comparative Study of Technical Writing Feedback Quality: Evaluating LLM s SLMs and Humans in Computer Science Topics
【速读】:该论文旨在解决计算机科学教育中反馈质量与可扩展性之间的矛盾问题,即如何在大规模教学场景下提供既高效又高质量的反馈。其解决方案的关键在于通过混合方法评估生成式 AI(Generative AI)与人类教师反馈在不同课程情境下的表现差异,并提出结合人工智能(AI)与人工反馈的协同策略:一方面利用 AI 在高吞吐量场景下提供的清晰、结构化且具操作性的反馈以提升效率;另一方面保留人类教师在特定情境下提供个性化、语境敏感和精准指导的能力,从而实现规模化教学中的反馈质量优化。
链接: https://arxiv.org/abs/2601.11541
作者: Suqing Liu,Bogdan Simion,Christopher Eaton,Michael Liut
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:
Abstract:Feedback is a critical component of the learning process, particularly in computer science education. This study investigates the quality of feedback generated by Large Language Models (LLMs), Small Language Models (SLMs), compared with human feedback, in three computer science course with technical writing components: an introductory computer science course (CS2), a third-year advanced systems course (operating systems), and a third-year writing course (a topics course on artificial intelligence). Using a mixed-methods approach which integrates quantitative Likert-scale questions with qualitative commentary, we analyze the student perspective on feedback quality, evaluated based on multiple criteria, including readability, detail, specificity, actionability, helpfulness, and overall quality. The analysis reveals that in the larger upper-year operating systems course ( N=80 ), SLMs and LLMs are perceived to deliver clear, actionable, and well-structured feedback, while humans provide more contextually nuanced guidance. As for the high-enrollment CS2 course ( N=176 ) showed the same preference for the AI tools’ clarity and breadth, but students noted that AI feedback sometimes lacked the concise, straight-to-the-point, guidance offered by humans. Conversely, in the smaller upper-year technical writing course on AI topics ( N=7 ), all students preferred feedback from the course instructor, who was able to provide clear, specific, and personalized feedback, compared to the more general and less targeted AI-based feedback. We also highlight the scalability of AI-based feedback by focusing on its effectiveness at large scale. Our findings underscore the potential of hybrid approaches that combine AI and human feedback to achieve efficient and high-quality feedback at scale.
zh
[AI-217] Augmented Assembly: Object Recognition and Hand Tracking for Adaptive Assembly Instructions in Augmented Reality
【速读】:该论文旨在解决传统物理装配过程中因缺乏实时引导与反馈而导致的效率低下、易出错及用户依赖手动查找和分类零件的问题。其解决方案的关键在于构建一个基于增强现实(Augmented Reality, AR)的动态装配工作流,通过物体识别(object recognition)和手部追踪(hand tracking)技术实现对定制化组件的实时检测与定位,生成工作空间的数字孪生(digital twin),并在AR中叠加边界框以指导用户进行步骤化操作;同时,系统能识别用户的非预期交互行为,并将其转化为迭代优化与创造性探索的机会,从而摆脱固定流程约束,提升装配灵活性与准确性。
链接: https://arxiv.org/abs/2601.11535
作者: Alexander Htet Kyaw,Haotian Ma,Sasa Zivkovic,Jenny Sabin
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: Submitted to the Association for Computing Machinery (ACM) Conference on Tangible, Embedded, and Embodied Interaction (TEI’26)
Abstract:Recent advances in augmented reality (AR) have enabled interactive systems that assist users in physical assembly tasks. In this paper, we present an AR-assisted assembly workflow that leverages object recognition and hand tracking to (1) identify custom components, (2) display step-by-step instructions, (3) detect assembly deviations, and (4) dynamically update the instructions based on users’ hands-on interactions with physical parts. Using object recognition, the system detects and localizes components in real time to create a digital twin of the workspace. For each assembly step, it overlays bounding boxes in AR to indicate both the current position and the target placement of relevant components, while hand-tracking data verifies whether the user interacts with the correct part. Rather than enforcing a fixed sequence, the system highlights potential assembly errors and interprets user deviations as opportunities for iteration and creative exploration. A case study with LEGO blocks and custom 3D-printed components demonstrates how the system links digital instructions to physical assembly, eliminating the need for manual searching, sorting, or labeling of parts.
zh
[AI-218] Modular AI-Powered Interviewer with Dynamic Question Generation and Expertise Profiling WWW
【速读】:该论文旨在解决现有自动化访谈系统在复杂定性研究中因固定问题列表、严格规则和有限个性化而导致对话重复、参与度低的问题,从而难以实现灵活性、情境感知与伦理敏感性的需求。其解决方案的关键在于构建一个基于本地部署的大语言模型(Large Language Model, LLM)的AI驱动访谈系统,该系统能够实时识别受访者的专业知识水平,并动态生成语境恰当、知识匹配的问题、回应及过渡语句,从而模拟人类访谈的流畅性和适应性;同时通过模块化的提示工程(prompt engineering)流水线设计,确保对话过程的可扩展性、自适应性与语义丰富性。
链接: https://arxiv.org/abs/2601.11534
作者: Aisvarya Adeseye,Jouni Isoaho,Seppo Virtanen,Mohammad Tahir
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: Accepted and Waiting to be published in conference AIR-RES’25 ( this http URL )
Abstract:Automated interviewers and chatbots are common in research, recruitment, customer service, and education. Many existing systems use fixed question lists, strict rules, and limited personalization, leading to repeated conversations that cause low engagement. Therefore, these tools are not effective for complex qualitative research, which requires flexibility, context awareness, and ethical sensitivity. Consequently, there is a need for a more adaptive and context-aware interviewing system. To address this, an AI-powered interviewer that dynamically generates questions that are contextually appropriate and expertise aligned is presented in this study. The interviewer is built on a locally hosted large language model (LLM) that generates coherent dialogue while preserving data privacy. The interviewer profiles the participants’ expertise in real time to generate knowledge-appropriate questions, well-articulated responses, and smooth transition messages similar to human-like interviews. To implement these functionalities, a modular prompt engineering pipeline was designed to ensure that the interview conversation remains scalable, adaptive, and semantically rich. To evaluate the AI-powered interviewer, it was tested with various participants, and it achieved high satisfaction (mean 4.45) and engagement (mean 4.33). The proposed interviewer is a scalable, privacy-conscious solution that advances AI-assisted qualitative data collection.
zh
[AI-219] Artificial Intelligence as a Training Tool in Clinical Psychology: A Comparison of Text-Based and Avatar Simulations
【速读】:该论文试图解决临床心理学研究生在接触真实来访者前,因缺乏足够实践机会而感到人际沟通技能准备不足的问题(即“临床心理学生常报告对治疗工作中的人际需求准备不足”)。解决方案的关键在于利用人工智能(AI)驱动的模拟交互工具,特别是对比文本型聊天机器人(ChatGPT)与语音驱动虚拟形象(HeyGen)两种形式,为学生提供早期、可访问的共情与认知行为疗法(CBT)技能训练场景。研究发现,尽管两者均获积极评价,但语音型虚拟形象在感知实用性、技能应用和自我提升感方面显著优于文本型工具,凸显了语音交互在传递社会与情感线索上的独特价值,表明基于语音的AI模拟可能更有效地支持早期临床技能培训。
链接: https://arxiv.org/abs/2601.11533
作者: V. El Sawah,A. Bhardwaj,A. Pryke-Hobbes,D. Gamaleldin,C. S. Ang,A. K. Martin
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 38 pages, 2 figures
Abstract:Clinical psychology students frequently report feeling underprepared for the interpersonal demands of therapeutic work, highlighting the need for accessible opportunities to practise core counselling skills before seeing real clients. Advances in artificial intelligence (AI) now enable simulated interaction partners that may support early skills development. This study examined postgraduate clinical psychology students’ perceptions of two AI-based simulations: a text-based chatbot (ChatGPT) and a voice-based avatar (HeyGen). Twenty-four students completed two brief cognitive-behavioural role-plays (counterbalanced), one with each tool, and provided both quantitative ratings and qualitative feedback on perceived usefulness, skill application, responsiveness and engagement, and perceived skill improvement. Both AI tools were evaluated positively across dimensions. However, the avatar was rated significantly higher than the chatbot for perceived usefulness, skill application, and perceived skill improvement, and qualitative comments highlighted the added value of voice-based interaction for conveying social and emotional cues. These findings suggest that AI-driven simulation may supplement early-stage clinical skills training, with voice-based avatars offering additional benefits. Future work should test whether such simulated interactions translate to objective improvements in real therapeutic performance.
zh
[AI-220] NOVAID: Natural-language Observability Visualization Assistant for ITOps Dashboard Widget Generation
【速读】:该论文旨在解决IT运维(IT Operations, ITOps)领域中手动创建监控仪表盘组件(widget)效率低、易出错且对新手和专家用户均构成障碍的问题。解决方案的关键在于提出NOVAID,一个基于大语言模型(Large Language Models, LLMs)的交互式聊天机器人,能够直接从自然语言查询生成符合IT运维场景的可视化组件。其核心技术包括:面向领域的语义解析器、模糊实体匹配机制与模式补全策略,以生成标准化的widget JSON规范,并通过交互式澄清循环处理不明确查询,从而提升生成准确性与实用性。
链接: https://arxiv.org/abs/2601.11531
作者: Pratik Mishra,Caner Gözübüyük,Seema Nagar,Prateeti Mohapatra,Raya Wittich,Arthur de Magalhaes
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 15 pages, 6 figures, accepted IAAI 26
Abstract:Manual creation of IT monitoring dashboard widgets is slow, error-prone, and a barrier for both novice and expert users. We present NOVAID, an interactive chatbot that leverages Large Language Models (LLMs) to generate IT monitoring widgets directly from natural language queries. Unlike general natural language-to-visualization tools, NOVAID addresses IT operations-specific challenges: specialized widget types like SLO charts, dynamic API-driven data retrieval, and complex contextual filters. The system combines a domain-aware semantic parser, fuzzy entity matching, and schema completion to produce standardized widget JSON specifications. An interactive clarification loop ensures accuracy in underspecified queries. On a curated dataset of 271 realistic queries, NOVAID achieves promising accuracy (up to 94.10% in metric extraction) across multiple LLMs. A user study with IT engineers yielded a System Usability Scale score of 74.2 for NOVAID, indicating good usability. By bridging natural language intent with operational dashboards, NOVAID demonstrates clear potential and a path for deployment in enterprise ITOps monitoring platforms.
zh
[AI-221] AI for Proactive Mental Health: A Multi-Institutional Longitudinal Randomized Controlled Trial
【速读】:该论文旨在解决青年群体面临的心理健康挑战难以通过传统方式有效干预的问题,尤其针对因可及性差、污名化和时间限制等因素导致的心理健康服务利用率低的困境。解决方案的关键在于开发并验证一种基于生成式 AI(Generative AI)的移动应用程序“Flourish”,该应用通过个性化、交互式且可扩展的技术手段,提供短时、高频的正向福祉干预,从而在心理困扰演变为临床问题前实现早期预防。研究结果表明,该方案能显著提升使用者的积极情绪、韧性、社会福祉,并缓冲正念与幸福感的下降,证明了生成式 AI 在大规模人群层面实施主动式心理干预的可行性与有效性。
链接: https://arxiv.org/abs/2601.11530
作者: Julie Y.A. Cachia,Xuan Zhao,John Hunter,Delancey Wu,Eta Lin,Julian De Freitas
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:
Abstract:Young adults today face unprecedented mental health challenges, yet many hesitate to seek support due to barriers such as accessibility, stigma, and time constraints. Bite-sized well-being interventions offer a promising solution to preventing mental distress before it escalates to clinical levels, but have not yet been delivered through personalized, interactive, and scalable technology. We conducted the first multi-institutional, longitudinal, preregistered randomized controlled trial of a generative AI-powered mobile app (“Flourish”) designed to address this gap. Over six weeks in Fall 2024, 486 undergraduate students from three U.S. institutions were randomized to receive app access or waitlist control. Participants in the treatment condition reported significantly greater positive affect, resilience, and social well-being (i.e., increased belonging, closeness to community, and reduced loneliness) and were buffered against declines in mindfulness and flourishing. These findings suggest that, with purposeful and ethical design, generative AI can deliver proactive, population-level well-being interventions that produce measurable benefits.
zh
[AI-222] SNAP: A Plan-Driven Framework for Controllable Interactive Narrative Generation
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在基于网页的交互式应用中因用户输入变化导致的情节漂移(narrative drift)问题,即模型难以维持与预设场景的一致性,从而影响对话连贯性和叙事稳定性。解决方案的关键在于提出SNAP(Story and Narrative-based Agent with Planning)框架,该框架将叙事结构化为带有明确计划(Plan)的单元(Cell),通过限定每个单元内的上下文范围,并引入详尽的时空设定、角色行为和情节发展计划,实现对叙事过程的有效控制,从而在多样化用户响应下仍保持场景一致性与对话连贯性。
链接: https://arxiv.org/abs/2601.11529
作者: Geonwoo Bang,DongMyung Kim,Hayoung Oh
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 5 pages, 3 figures
Abstract:Large Language Models (LLMs) hold great potential for web-based interactive applications, including browser games, online education, and digital storytelling platforms. However, LLM-based conversational agents suffer from spatiotemporal distortions when responding to variant user inputs, failing to maintain consistency with provided scenarios. We propose SNAP (Story and Narrative-based Agent with Planning), a framework that structures narratives into Cells with explicit Plans to prevent narrative drift in web environments. By confining context within each Cell and employing detailed plans that specify spatiotemporal settings, character actions, and plot developments, SNAP enables coherent and scenario-consistent dialogues while adapting to diverse user responses. Via automated and human evaluations, we validate SNAP’s superiority in narrative controllability, demonstrating effective scenario consistency despite variant user inputs in web-based interactive storytelling.
zh
[AI-223] Knowledge Graph Construction for Stock Markets with LLM -Based Explainable Reasoning CIKM2025
【速读】:该论文旨在解决传统股票市场研究中难以捕捉公司间关联模式、竞争动态以及缺乏可解释投资推理的问题。现有方法主要依赖时间序列预测和单家公司分析,局限于数值数据,无法有效处理复杂关系。其解决方案的关键在于构建一个专为股票市场设计的知识图谱(Knowledge Graph)Schema,将公司、行业、股票指标、财务报表及企业间关系进行结构化建模,并结合大语言模型(Large Language Models, LLMs)实现多跳推理与关系查询,从而生成可解释且深入的金融问题解答。
链接: https://arxiv.org/abs/2601.11528
作者: Cheonsol Lee,Youngsang Jeong,Jeongyeol Shin,Huiju Kim,Jidong Kim
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
备注: 6 pages, 3 figures, CIKM 2025 Workshop - Advances in Financial AI: Innovations, Risk, and Responsibility in the Era of LLMs
Abstract:The stock market is inherently complex, with interdependent relationships among companies, sectors, and financial indicators. Traditional research has largely focused on time-series forecasting and single-company analysis, relying on numerical data for stock price prediction. While such approaches can provide short-term insights, they are limited in capturing relational patterns, competitive dynamics, and explainable investment reasoning. To address these limitations, we propose a knowledge graph schema specifically designed for the stock market, modeling companies, sectors, stock indicators, financial statements, and inter-company relationships. By integrating this schema with large language models (LLMs), our approach enables multi-hop reasoning and relational queries, producing explainable and in-depth answers to complex financial questions. Figure1 illustrates the system pipeline, detailing the flow from data collection and graph construction to LLM-based query processing and answer generation. We validate the proposed framework through practical case studies on Korean listed companies, demonstrating its capability to extract insights that are difficult or impossible to obtain from traditional database queries alone. The results highlight the potential of combining knowledge graphs with LLMs for advanced investment analysis and decision support.
zh
[AI-224] Do LLM s Give Good Romantic Relationship Advice? A Study on User Satisfaction and Attitude Change NEURIPS2025
【速读】:该论文旨在解决用户对生成式 AI(Generative AI)在个人领域(如浪漫关系)中提供建议的感知与评价问题。研究发现,尽管当前对 LLM 在此类情境下的应用认知有限,但用户普遍对 LLM 提供的建议表现出高度满意度,且这种满意度显著正向关联于其对模型可靠性和有用性的感知。解决方案的关键在于:通过提供支持性强且情境贴合的建议内容,能够有效提升用户对大语言模型(Large Language Models, LLMs)的信任度和接受度,从而改善其整体态度。
链接: https://arxiv.org/abs/2601.11527
作者: Niva Manchanda,Akshata Kishore Moharir,Isabel Michel,Ratna Kandala
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: Accepted at the 39th Conference on Neural Information Processing Systems (NeurIPS 2025) First Workshop on LLM Persona Modeling
Abstract:Large Language Models (LLMs) are increasingly being used to provide support and advice in personal domains such as romantic relationships, yet little is known about user perceptions of this type of advice. This study investigated how people evaluate advice on LLM-generated romantic relationships. Participants rated advice satisfaction, model reliability, and helpfulness, and completed pre- and post-measures of their general attitudes toward LLMs. Overall, the results showed participants’ high satisfaction with LLM-generated advice. Greater satisfaction was, in turn, strongly and positively associated with their perceptions of the models’ reliability and helpfulness. Importantly, participants’ attitudes toward LLMs improved significantly after exposure to the advice, suggesting that supportive and contextually relevant advice can enhance users’ trust and openness toward these AI systems.
zh
[AI-225] Chatsparent: An Interactive System for Detecting and Mitigating Cognitive Fatigue in LLM s AAAI2026
【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)作为聊天机器人部署时存在的透明度不足问题,即用户在无感知的情况下与模型交互,导致对模型输出的不稳定性和幻觉现象缺乏警觉,进而产生盲信。其核心解决方案是提出一个名为Chatsparent的交互式演示系统,通过实时监测token级别的认知疲劳信号(包括注意力-提示衰减、嵌入漂移和熵塌陷),将其整合为统一的疲劳指数,并在阈值触发时提供轻量级干预机制(如注意力重置、熵正则化解码和自省检查点),从而将被动对话转化为可诊断的交互体验,提升模型推理阶段的可靠性与用户对LLM行为的理解能力。
链接: https://arxiv.org/abs/2601.11526
作者: Riju Marwah,Vishal Pallagani,Ritvik Garimella,Amit Sheth
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: Accepted to AAAI 2026 Demonstration Track
Abstract:LLMs are increasingly being deployed as chatbots, but today’s interfaces offer little to no friction: users interact through seamless conversations that conceal when the model is drifting, hallucinating or failing. This lack of transparency fosters blind trust, even as models produce unstable or repetitive outputs. We introduce an interactive demo that surfaces and mitigates cognitive fatigue, a failure mode where LLMs gradually lose coherence during auto-regressive generation. Our system, Chatsparent, instruments real-time, token-level signals of fatigue, including attention-to-prompt decay, embedding drift, and entropy collapse, and visualizes them as a unified fatigue index. When fatigue thresholds are crossed, the interface allows users to activate lightweight interventions such as attention resets, entropy-regularized decoding, and self-reflection checkpoints. The demo streams live text and fatigue signals, allowing users to observe when fatigue arises, how it affects output quality, and how interventions restore stability. By turning passive chatbot interaction into an interactive diagnostic experience, our system empowers users to better understand LLM behavior while improving reliability at inference time.
zh
[AI-226] Opportunities in AI/ML for the Rubin LSST Dark Energy Science Collaboration
【速读】:该论文旨在解决大规模天文数据(如来自Vera C. Rubin Observatory的LSST)中人工智能与机器学习(AI/ML)方法在精确宇宙学研究中的可信度、可扩展性和可重复性问题。其核心挑战在于如何实现可靠的不确定性量化、对协变量偏移和模型误设的鲁棒性,以及在科学工作流中的可复现集成。解决方案的关键在于推进跨探测器的共性方法研究,包括大规模贝叶斯推断、物理信息嵌入方法、验证框架构建及用于发现的主动学习策略;同时探索基础模型和大型语言模型(LLM)驱动的智能体系统在DESC工作流中的潜力,前提是必须配套严格的评估与治理机制。此外,论文强调需同步加强软件基础设施、计算资源、数据平台和人才队伍建设,以保障这些新方法的有效部署并降低相关风险。
链接: https://arxiv.org/abs/2601.14235
作者: LSST Dark Energy Science Collaboration,Eric Aubourg,Camille Avestruz,Matthew R. Becker,Biswajit Biswas,Rahul Biswas,Boris Bolliet,Adam S. Bolton,Clecio R. Bom,Raphaël Bonnet-Guerrini,Alexandre Boucaud,Jean-Eric Campagne,Chihway Chang,Aleksandra Ćiprijanović,Johann Cohen-Tanugi,Michael W. Coughlin,John Franklin Crenshaw,Juan C. Cuevas-Tello,Juan de Vicente,Seth W. Digel,Steven Dillmann,Mariano Javier de León Dominguez Romero,Alex Drlica-Wagner,Sydney Erickson,Alexander T. Gagliano,Christos Georgiou,Aritra Ghosh,Matthew Grayling,Kirill A. Grishin,Alan Heavens,Lindsay R. House,Mustapha Ishak,Wassim Kabalan,Arun Kannawadi,François Lanusse,C. Danielle Leonard,Pierre-François Léget,Michelle Lochner,Yao-Yuan Mao,Peter Melchior,Grant Merz,Martin Millon,Anais Möller,Gautham Narayan,Yuuki Omori,Hiranya Peiris,Laurence Perreault-Levasseur,Andrés A. Plazas Malagón,Nesar Ramachandra,Benjamin Remy,Cécile Roucelle,Jaime Ruiz-Zapatero,Stefan Schuldt,Ignacio Sevilla-Noarbe,Ved G. Shah,Tjitske Starkenburg,Stephen Thorp,Laura Toribio San Cipriano,Tilman Tröster,Roberto Trotta,Padma Venkatraman,Amanda Wasserman,Tim White,Justine Zeghal,Tianqing Zhang,Yuanyuan Zhang
机构: 未知
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Cosmology and Nongalactic Astrophysics (astro-ph.CO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注: 84 pages. This is v1.0 of the DESC’s white paper on AI/ML, a collaboration document that is being made public but which is not planned for submission to a journal
Abstract:The Vera C. Rubin Observatory’s Legacy Survey of Space and Time (LSST) will produce unprecedented volumes of heterogeneous astronomical data (images, catalogs, and alerts) that challenge traditional analysis pipelines. The LSST Dark Energy Science Collaboration (DESC) aims to derive robust constraints on dark energy and dark matter from these data, requiring methods that are statistically powerful, scalable, and operationally reliable. Artificial intelligence and machine learning (AI/ML) are already embedded across DESC science workflows, from photometric redshifts and transient classification to weak lensing inference and cosmological simulations. Yet their utility for precision cosmology hinges on trustworthy uncertainty quantification, robustness to covariate shift and model misspecification, and reproducible integration within scientific pipelines. This white paper surveys the current landscape of AI/ML across DESC’s primary cosmological probes and cross-cutting analyses, revealing that the same core methodologies and fundamental challenges recur across disparate science cases. Since progress on these cross-cutting challenges would benefit multiple probes simultaneously, we identify key methodological research priorities, including Bayesian inference at scale, physics-informed methods, validation frameworks, and active learning for discovery. With an eye on emerging techniques, we also explore the potential of the latest foundation model methodologies and LLM-driven agentic AI systems to reshape DESC workflows, provided their deployment is coupled with rigorous evaluation and governance. Finally, we discuss critical software, computing, data infrastructure, and human capital requirements for the successful deployment of these new methodologies, and consider associated risks and opportunities for broader coordination with external actors.
zh
[AI-227] MATE: Matryoshka Audio-Text Embeddings for Open-Vocabulary Keyword Spotting ICASSP2026
【速读】:该论文旨在解决开放词汇关键词检测(Open-vocabulary keyword spotting, KWS)中固定短语触发机制灵活性不足的问题,特别是针对文本引导注册(text-based enrollment)场景下,传统方法在嵌入维度上仅学习单一固定尺寸的特征表示所导致的表达能力受限问题。其解决方案的关键在于提出Matryoshka Audio-Text Embeddings (MATE),一种双编码器框架,通过嵌套子嵌入(prefixes)将多种粒度的音频-文本嵌入编码于单一向量中;创新性地引入基于主成分分析(PCA)引导的前缀对齐机制,利用不同长度文本嵌入的PCA压缩版本作为教师目标,指导音频与文本前缀的对齐,从而在低维前缀中聚焦关键词显著线索,高维部分补充细节信息,且训练过程对损失函数不敏感,实现无推理开销的状态领先性能。
链接: https://arxiv.org/abs/2601.14012
作者: Youngmoon Jung,Myunghun Jung,Joon-Young Yang,Yong-Hyeok Lee,Jaeyoung Roh,Hoon-Young Cho
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI)
备注: 5 pages, 1 figure, Accepted at ICASSP 2026
Abstract:Open-vocabulary keyword spotting (KWS) with text-based enrollment has emerged as a flexible alternative to fixed-phrase triggers. Prior utterance-level matching methods, from an embedding-learning standpoint, learn embeddings at a single fixed dimensionality. We depart from this design and propose Matryoshka Audio-Text Embeddings (MATE), a dual-encoder framework that encodes multiple embedding granularities within a single vector via nested sub-embeddings (“prefixes”). Specifically, we introduce a PCA-guided prefix alignment: PCA-compressed versions of the full text embedding for each prefix size serve as teacher targets to align both audio and text prefixes. This alignment concentrates salient keyword cues in lower-dimensional prefixes, while higher dimensions add detail. MATE is trained with standard deep metric learning objectives for audio-text KWS, and is loss-agnostic. To our knowledge, this is the first application of matryoshka-style embeddings to KWS, achieving state-of-the-art results on WSJ and LibriPhrase without any inference overhead.
zh
[AI-228] DAME: Duration-Aware Matryoshka Embedding for Duration-Robust Speaker Verification ICASSP2026
【速读】:该论文旨在解决短语音段(short-utterance)说话人验证中因可用说话人判别性特征有限而导致的性能瓶颈问题。传统方法通常依赖于固定维度的嵌入表示,无法根据语音长度动态调整信息容量,导致短语音下表征能力不足。其解决方案的关键在于提出一种模型无关的时长感知嵌套嵌入框架(Duration-Aware Matryoshka Embedding, DAME),通过构建与语音时长对齐的子嵌入层次结构,使低维嵌入捕捉短语音中的紧凑说话人特征,高维嵌入则编码更丰富的细节,从而实现不同长度语音下的自适应表征学习。该方法在无需额外推理开销的前提下,显著降低了1秒及更短语音条件下的等错误率(Equal Error Rate, EER),且在多种说话人编码器架构和训练策略下均表现出一致性提升。
链接: https://arxiv.org/abs/2601.13999
作者: Youngmoon Jung,Joon-Young Yang,Ju-ho Kim,Jaeyoung Roh,Chang Woo Han,Hoon-Young Cho
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI)
备注: 5 pages, 2 figures, Accepted at ICASSP 2026
Abstract:Short-utterance speaker verification remains challenging due to limited speaker-discriminative cues in short speech segments. While existing methods focus on enhancing speaker encoders, the embedding learning strategy still forces a single fixed-dimensional representation reused for utterances of any length, leaving capacity misaligned with the information available at different durations. We propose Duration-Aware Matryoshka Embedding (DAME), a model-agnostic framework that builds a nested hierarchy of sub-embeddings aligned to utterance durations: lower-dimensional representations capture compact speaker traits from short utterances, while higher dimensions encode richer details from longer speech. DAME supports both training from scratch and fine-tuning, and serves as a direct alternative to conventional large-margin fine-tuning, consistently improving performance across durations. On the VoxCeleb1-O/E/H and VOiCES evaluation sets, DAME consistently reduces the equal error rate on 1-s and other short-duration trials, while maintaining full-length performance with no additional inference cost. These gains generalize across various speaker encoder architectures under both general training and fine-tuning setups.
zh
[AI-229] Stream-Voice-Anon: Enhancing Utility of Real-Time Speaker Anonymization via Neural Audio Codec and Language Models ICASSP2026
【速读】:该论文旨在解决在线语音应用中说话人身份保护的问题,特别是针对流式语音匿名化(Streaming Speaker Anonymization, SA)研究不足的现状。现有基于神经音频编解码器(Neural Audio Codec, NAC)的因果语言模型(Causal Language Model, LM)系统多用于语音转换(Voice Conversion, VC),缺乏隐私保护所需的关键技术。其解决方案的核心在于提出Stream-Voice-Anon框架,该框架通过引入伪说话人表征采样、说话人嵌入混合及多样化提示选择策略,利用量化内容码的解耦特性来防止说话人信息泄露,并结合动态与固定延迟配置以优化实时场景下的延迟-隐私权衡。实验表明,在VoicePrivacy 2024挑战协议下,该方法在可懂度(WER相对降低46%)和情感保留(UAR相对提升28%)上显著优于前序流式方法DarkStream,同时保持相近延迟(180ms vs 200ms)并具备对懒惰知情攻击者的隐私保护能力。
链接: https://arxiv.org/abs/2601.13948
作者: Nikita Kuzmin,Songting Liu,Kong Aik Lee,Eng Siong Chng
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI)
备注: Accepted by ICASSP2026
Abstract:Protecting speaker identity is crucial for online voice applications, yet streaming speaker anonymization (SA) remains underexplored. Recent research has demonstrated that neural audio codec (NAC) provides superior speaker feature disentanglement and linguistic fidelity. NAC can also be used with causal language models (LM) to enhance linguistic fidelity and prompt control for streaming tasks. However, existing NAC-based online LM systems are designed for voice conversion (VC) rather than anonymization, lacking the techniques required for privacy protection. Building on these advances, we present Stream-Voice-Anon, which adapts modern causal LM-based NAC architectures specifically for streaming SA by integrating anonymization techniques. Our anonymization approach incorporates pseudo-speaker representation sampling, a speaker embedding mixing and diverse prompt selection strategies for LM conditioning that leverage the disentanglement properties of quantized content codes to prevent speaker information leakage. Additionally, we compare dynamic and fixed delay configurations to explore latency-privacy trade-offs in real-time scenarios. Under the VoicePrivacy 2024 Challenge protocol, Stream-Voice-Anon achieves substantial improvements in intelligibility (up to 46% relative WER reduction) and emotion preservation (up to 28% UAR relative) compared to the previous state-of-the-art streaming method DarkStream while maintaining comparable latency (180ms vs 200ms) and privacy protection against lazy-informed attackers, though showing 15% relative degradation against semi-informed attackers.
zh
[AI-230] End-to-End Reverse Screening Identifies Protein Targets of Small Molecules Using HelixFold3
【速读】:该论文旨在解决小分子化合物与蛋白质靶点之间相互作用识别(即反向筛选,reverse screening)的难题,该问题在药物作用机制解析、化合物再利用、脱靶效应预测及生物活性分子机制阐明中具有关键意义。传统反向筛选流程通常采用分步操作,包括靶点结构建模、结合口袋识别、分子对接和评分等环节,各步骤间存在误差传递问题,导致精度受限。本文提出一种端到端的反向筛选策略,其核心创新在于利用HelixFold3这一高精度生物大分子结构预测模型(类似于AlphaFold3),在统一框架内同时完成蛋白质库的折叠预测与小分子配体的对接模拟,从而实现结构建模与对接过程的协同优化,显著提升了结合位点定位精度、结构保真度及靶点优先级排序能力,为系统性解析分子机制和理性药物发现提供了可扩展且高效的平台。
链接: https://arxiv.org/abs/2601.13693
作者: Shengjie Xu,Xianbin Ye,Mengran Zhu,Xiaonan Zhang,Shanzhuo Zhang,Xiaomin Fang
机构: 未知
类目: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI)
备注:
Abstract:Identifying protein targets for small molecules, or reverse screening, is essential for understanding drug action, guiding compound repurposing, predicting off-target effects, and elucidating the molecular mechanisms of bioactive compounds. Despite its critical role, reverse screening remains challenging because accurately capturing interactions between a small molecule and structurally diverse proteins is inherently complex, and conventional step-wise workflows often propagate errors across decoupled steps such as target structure modeling, pocket identification, docking, and scoring. Here, we present an end-to-end reverse screening strategy leveraging HelixFold3, a high-accuracy biomolecular structure prediction model akin to AlphaFold3, which simultaneously models the folding of proteins from a protein library and the docking of small-molecule ligands within a unified framework. We validate this approach on a diverse and representative set of approximately one hundred small molecules. Compared with conventional reverse docking, our method improves screening accuracy and demonstrates enhanced structural fidelity, binding-site precision, and target prioritization. By systematically linking small molecules to their protein targets, this framework establishes a scalable and straightforward platform for dissecting molecular mechanisms, exploring off-target interactions, and supporting rational drug discovery.
zh
[AI-231] CatMaster: An Agent ic Autonomous System for Computational Heterogeneous Catalysis Research
【速读】:该论文旨在解决密度泛函理论(Density Functional Theory, DFT)在催化研究中应用时面临的流程复杂、计算成本高且易受参数设置影响的问题,尤其是手动编写脚本、输入准备、失败恢复及结果可复现性差等实践痛点。其解决方案的关键在于提出一个由大语言模型(Large-Language Model, LLM)驱动的代理系统 CatMaster,该系统能够将自然语言指令自动转化为完整的计算工作空间(包括结构文件、输入参数、输出结果、日志和运行记录),并通过多保真度工具库实现快速代理优化与高保真DFT验证的协同,从而显著降低人工干预需求并保障全流程可追溯性和可重启性,使研究人员能聚焦于建模策略与化学解释而非繁琐的流程管理。
链接: https://arxiv.org/abs/2601.13508
作者: Honghao Chen,Jiangjie Qiu,Yi Shen Tew,Xiaonan Wang
机构: 未知
类目: Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI)
备注: 25 pages
Abstract:Density functional theory (DFT) is widely used to connect atomic structure with catalytic behavior, but computational heterogeneous catalysis studies often require long workflows that are costly, iterative, and sensitive to setup choices. Besides the intrinsic cost and accuracy limits of first-principles calculations, practical workflow issues such as keeping references consistent, preparing many related inputs, recovering from failed runs on computing clusters, and maintaining a complete record of what was done, can slow down projects and make results difficult to reproduce or extend. Here we present CatMaster, a large-language-model (LLM)-driven agent system that turns natural language requests into complete calculation workspaces, including structures, inputs, outputs, logs, and a concise run record. CatMaster maintains a persistent project record of key facts, constraints, and file pointers to support inspection and restartability. It is paired with a multi-fidelity tool library that covers rapid surrogate relaxations and high-fidelity DFT calculations for validation when needed. We demonstrate CatMaster on four demonstrations of increasing complexity: an O2 spin-state check with remote execution, BCC Fe surface energies with a protocol-sensitivity study and CO adsorption site ranking, high-throughput Pt–Ni–Cu alloy screening for hydrogen evolution reaction (HER) descriptors with surrogate-to-DFT validation, and a demonstration beyond the predefined tool set, including equation-of-state fitting for BCC Fe and CO-FeN4-graphene single-atom catalyst geometry preparation. By reducing manual scripting and bookkeeping while keeping the full evidence trail, CatMaster aims to help catalysis researchers focus on modeling choices and chemical interpretation rather than workflow management. Comments: 25 pages Subjects: Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI) Cite as: arXiv:2601.13508 [cond-mat.mtrl-sci] (or arXiv:2601.13508v1 [cond-mat.mtrl-sci] for this version) https://doi.org/10.48550/arXiv.2601.13508 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-232] Labels or Preferences? Budget-Constrained Learning with Human Judgments over AI-Generated Outputs
【速读】:该论文旨在解决在固定标注预算下,如何最优分配资源用于获取真实标签(ground-truth labels)与成对偏好(pairwise preferences)的问题,以提升基于人类偏好反馈的伪标签生成质量。其关键解决方案是提出 Preference-Calibrated Active Learning (PCAL),该方法基于半参数推断框架,将预算分配问题建模为单调缺失数据结构,并通过直接优化估计量方差来学习最优数据采集策略,从而实现统计高效且鲁棒的函数估计。理论证明表明PCAL估计器具有渐近最优性,并具备对扰动模型估计误差的强鲁棒性。
链接: https://arxiv.org/abs/2601.13458
作者: Zihan Dong,Ruijia Wu,Linjun Zhang
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Statistics Theory (math.ST)
备注:
Abstract:The increasing reliance on human preference feedback to judge AI-generated pseudo labels has created a pressing need for principled, budget-conscious data acquisition strategies. We address the crucial question of how to optimally allocate a fixed annotation budget between ground-truth labels and pairwise preferences in AI. Our solution, grounded in semi-parametric inference, casts the budget allocation problem as a monotone missing data framework. Building on this formulation, we introduce Preference-Calibrated Active Learning (PCAL), a novel method that learns the optimal data acquisition strategy and develops a statistically efficient estimator for functionals of the data distribution. Theoretically, we prove the asymptotic optimality of our PCAL estimator and establish a key robustness guarantee that ensures robust performance even with poorly estimated nuisance models. Our flexible framework applies to a general class of problems, by directly optimizing the estimator’s variance instead of requiring a closed-form solution. This work provides a principled and statistically efficient approach for budget-constrained learning in modern AI. Simulations and real-data analysis demonstrate the practical benefits and superior performance of our proposed method.
zh
[AI-233] AI Skills Improve Job Prospects: Causal Evidence from a Hiring Experiment
【速读】:该论文旨在解决人工智能(Artificial Intelligence, AI)相关技能在招聘决策中是否构成积极信号,以及这些技能能否抵消传统劣势(如年龄较大或教育水平较低)的问题。其解决方案的关键在于通过一项包含1700名英美招聘人员的实验性调查,采用配对联合设计(paired conjoint design),评估由合成简历代表的虚拟候选人。结果显示,在图形设计师、办公室助理和软件工程师三种职业中,AI技能显著提升获得面试邀请的概率(约8至15个百分点),并在一定程度上弥补年龄和教育背景的不利影响,尤其在办公室助理岗位中效果最为明显,且正式AI认证具有额外补偿作用;此外,招聘者自身背景与AI使用情况显著调节上述效应。这表明AI技能已成为强有力的招聘信号,并可能重塑劳动力市场中的公平性与技能价值认知。
链接: https://arxiv.org/abs/2601.13286
作者: Fabian Stephany,Ole Teutloff,Angelo Leone
机构: 未知
类目: General Economics (econ.GN); Artificial Intelligence (cs.AI)
备注: 46 pages
Abstract:The growing adoption of artificial intelligence (AI) technologies has heightened interest in the labour market value of AI-related skills, yet causal evidence on their role in hiring decisions remains scarce. This study examines whether AI skills serve as a positive hiring signal and whether they can offset conventional disadvantages such as older age or lower formal education. We conduct an experimental survey with 1,700 recruiters from the United Kingdom and the United States. Using a paired conjoint design, recruiters evaluated hypothetical candidates represented by synthetically designed resumes. Across three occupations - graphic designer, office assistant, and software engineer - AI skills significantly increase interview invitation probabilities by approximately 8 to 15 percentage points. AI skills also partially or fully offset disadvantages related to age and lower education, with effects strongest for office assistants, where formal AI certification plays an additional compensatory role. Effects are weaker for graphic designers, consistent with more skeptical recruiter attitudes toward AI in creative work. Finally, recruiters’ own background and AI usage significantly moderate these effects. Overall, the findings demonstrate that AI skills function as a powerful hiring signal and can mitigate traditional labour market disadvantages, with implications for workers’ skill acquisition strategies and firms’ recruitment practices.
zh
[AI-234] Pixelwise Uncertainty Quantification of Accelerated MRI Reconstruction
【速读】:该论文旨在解决并行成像(Parallel Imaging)技术在提高磁共振成像(MRI)扫描效率的同时,因加速因子增加而导致图像质量下降的问题,尤其针对临床实践中缺乏自动评估欠采样重建图像诊断质量机制的痛点。解决方案的关键在于提出了一种通用的像素级不确定性量化框架,通过将共形分位数回归(Conformal Quantile Regression)与图像重建方法(如端到端变分网络)相结合,无需参考全采样图像即可生成统计上严格可靠的像素级不确定性区间,从而实现对重建质量的自动评估和不可靠区域的精准识别。
链接: https://arxiv.org/abs/2601.13236
作者: Ilias I. Giannakopoulos,Lokesh B Gautham Muthukumar,Yvonne W. Lui,Riccardo Lattanzi
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Medical Physics (physics.med-ph)
备注: 10 pages, 8 figues, 2 tables
Abstract:Parallel imaging techniques reduce magnetic resonance imaging (MRI) scan time but image quality degrades as the acceleration factor increases. In clinical practice, conservative acceleration factors are chosen because no mechanism exists to automatically assess the diagnostic quality of undersampled reconstructions. This work introduces a general framework for pixel-wise uncertainty quantification in parallel MRI reconstructions, enabling automatic identification of unreliable regions without access to any ground-truth reference image. Our method integrates conformal quantile regression with image reconstruction methods to estimate statistically rigorous pixel-wise uncertainty intervals. We trained and evaluated our model on Cartesian undersampled brain and knee data obtained from the fastMRI dataset using acceleration factors ranging from 2 to 10. An end-to-end Variational Network was used for image reconstruction. Quantitative experiments demonstrate strong agreement between predicted uncertainty maps and true reconstruction error. Using our method, the corresponding Pearson correlation coefficient was higher than 90% at acceleration levels at and above four-fold; whereas it dropped to less than 70% when the uncertainty was computed using a simpler a heuristic notion (magnitude of the residual). Qualitative examples further show the uncertainty maps based on quantile regression capture the magnitude and spatial distribution of reconstruction errors across acceleration factors, with regions of elevated uncertainty aligning with pathologies and artifacts. The proposed framework enables evaluation of reconstruction quality without access to fully-sampled ground-truth reference images. It represents a step toward adaptive MRI acquisition protocols that may be able to dynamically balance scan time and diagnostic reliability.
zh
[AI-235] Cognition spaces: natural artificial and hybrid
【速读】:该论文试图解决的问题是:当前缺乏一个统一的框架来比较自然、人工和混合系统中认知过程的形式、限制及未实现的可能性。解决方案的关键在于提出“认知空间(cognition space)”方法,将传统的依赖特定基质(substrate-dependent)的认知定义替换为基于组织结构和信息维度的比较性表征;在此框架下,认知被视作一种对信息进行感知、处理与响应的渐进能力,从而使得细胞、大脑、人工代理以及人-AI协同体等多样系统能够在同一概念景观中进行分析。该方法揭示了三种认知空间(基础无神经、神经和人-AI混合)的分布不均,指出其中存在大量未占据区域,这些空缺并非偶然,而是反映了进化偶然性、物理约束和设计局限,进而强调通过关注认知空间的结构而非分类定义,可更清晰地理解现有认知系统的多样性,并凸显混合认知作为探索超越生物进化复杂性新形式的重要前沿。
链接: https://arxiv.org/abs/2601.12837
作者: Ricard Solé,Luis F Seoane,Jordi Pla-Mauri,Michael Timothy Bennett,Michael E. Hochberg,Michael Levin
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Neural and Evolutionary Computing (cs.NE)
备注:
Abstract:Cognitive processes are realized across an extraordinary range of natural, artificial, and hybrid systems, yet there is no unified framework for comparing their forms, limits, and unrealized possibilities. Here, we propose a cognition space approach that replaces narrow, substrate-dependent definitions with a comparative representation based on organizational and informational dimensions. Within this framework, cognition is treated as a graded capacity to sense, process, and act upon information, allowing systems as diverse as cells, brains, artificial agents, and human-AI collectives to be analyzed within a common conceptual landscape. We introduce and examine three cognition spaces – basal aneural, neural, and human-AI hybrid – and show that their occupation is highly uneven, with clusters of realized systems separated by large unoccupied regions. We argue that these voids are not accidental but reflect evolutionary contingencies, physical constraints, and design limitations. By focusing on the structure of cognition spaces rather than on categorical definitions, this approach clarifies the diversity of existing cognitive systems and highlights hybrid cognition as a promising frontier for exploring novel forms of complexity beyond those produced by biological evolution.
zh
[AI-236] SLAP: Scalable Language-Audio Pretraining with Variable-Duration Audio and Multi-Objective Training ICASSP2026
【速读】:该论文旨在解决当前对比语言-音频预训练(Contrastive Language-Audio Pretraining, CLAP)模型面临的三大局限性:训练数据规模有限(通常仅数百万音频样本)、音频长度固定且较短、以及基于全局表示的对比学习目标难以捕捉细粒度音频特征。其解决方案的关键在于提出可扩展的语言-音频预训练模型(Scalable Language-Audio Pretraining, SLAP),该模型在1.09亿条音频-文本对上进行训练,支持变长音频输入,并通过单阶段联合优化对比损失、自监督损失和字幕生成损失,从而有效提升密集音频表征的学习能力。实验表明,SLAP在音频-文本检索和零样本音频分类任务上均达到新的最先进性能。
链接: https://arxiv.org/abs/2601.12594
作者: Xinhao Mei,Gael Le Lan,Haohe Liu,Zhaoheng Ni,Varun Nagaraja,Yang Liu,Yangyang Shi,Vikas Chandra
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注: Accepted to ICASSP 2026
Abstract:Contrastive language-audio pretraining (CLAP) has achieved notable success in learning semantically rich audio representations and is widely adopted for various audio-related tasks. However, current CLAP models face several key limitations. First, they are typically trained on relatively small datasets, often comprising a few million audio samples. Second, existing CLAP models are restricted to short and fixed duration, which constrains their usage in real-world scenarios with variable-duration audio. Third, the standard contrastive training objective operates on global representations, which may hinder the learning of dense, fine-grained audio features. To address these challenges, we introduce Scalable Language-Audio Pretraining (SLAP), which scales language-audio pretraining to 109 million audio-text pairs with variable audio durations and incorporates multiple training objectives. SLAP unifies contrastive loss with additional self-supervised and captioning losses in a single-stage training, facilitating the learning of richer dense audio representations. The proposed SLAP model achieves new state-of-the-art performance on audio-text retrieval and zero-shot audio classification tasks, demonstrating its effectiveness across diverse benchmarks.
zh
[AI-237] Ontology-aligned structuring and reuse of multimodal materials data and workflows towards automatic reproduction
【速读】:该论文旨在解决材料科学中计算结果可复现性差的问题,即仿真流程和参数常以非结构化文本或表格形式呈现,导致大规模数据整理与系统性比较困难。其解决方案的关键在于提出一种基于本体驱动、大语言模型(Large Language Model, LLM)辅助的自动化框架,用于从文献中提取并结构化密度泛函理论(Density Functional Theory, DFT)相关的堆垛层错能(Stacking Fault Energy, SFE)计算流程;该框架采用多阶段过滤策略结合提示工程(prompt-engineered)的LLM抽取方法,从方法部分和表格中识别关键信息,并统一映射至标准材料本体(CMSO、ASMO 和 PLDO),最终构建基于atomRDF的知识图谱,从而实现计算协议的结构化重用与SFE值的系统比较。
链接: https://arxiv.org/abs/2601.12582
作者: Sepideh Baghaee Ravari,Abril Azocar Guzman,Sarath Menon,Stefan Sandfeld,Tilmann Hickel,Markus Stricker
机构: 未知
类目: Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI)
备注: 39 pages, 7 figures
Abstract:Reproducibility of computational results remains a challenge in materials science, as simulation workflows and parameters are often reported only in unstructured text and tables. While literature data are valuable for validation and reuse, the lack of machine-readable workflow descriptions prevents large-scale curation and systematic comparison. Existing text-mining approaches are insufficient to extract complete computational workflows with their associated parameters. An ontology-driven, large language model (LLM)-assisted framework is introduced for the automated extraction and structuring of computational workflows from the literature. The approach focuses on density functional theory-based stacking fault energy (SFE) calculations in hexagonal close-packed magnesium and its binary alloys, and uses a multi-stage filtering strategy together with prompt-engineered LLM extraction applied to method sections and tables. Extracted information is unified into a canonical schema and aligned with established materials ontologies (CMSO, ASMO, and PLDO), enabling the construction of a knowledge graph using atomRDF. The resulting knowledge graph enables systematic comparison of reported SFE values and supports the structured reuse of computational protocols. While full computational reproducibility is still constrained by missing or implicit metadata, the framework provides a foundation for organizing and contextualizing published results in a semantically interoperable form, thereby improving transparency and reusability of computational materials data.
zh
[AI-238] Primate-like perceptual decision making emerges through deep recurrent reinforcement learning
【速读】:该论文试图解决的问题是:为什么灵长类动物演化出特定的决策机制,这些机制如何在噪声环境中实现最优奖励最大化。解决方案的关键在于使用强化学习训练端到端的深度循环神经网络(deep recurrent neural network),使其在噪声感知辨别任务中自主学习出类似灵长类的决策能力,包括速度-准确率权衡和面对新信息时灵活改变认知决策的能力;网络内部动力学分析表明,其决策机制与灵长类神经生理学研究中观察到的机制高度一致,从而为驱动灵长类灵活决策能力演化的关键选择压力提供了实验支持。
链接: https://arxiv.org/abs/2601.12577
作者: Nathan J. Wispinski,Scott A. Stone,Anthony Singhal,Patrick M. Pilarski,Craig S. Chapman
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)
备注:
Abstract:Progress has led to a detailed understanding of the neural mechanisms that underlie decision making in primates. However, less is known about why such mechanisms are present in the first place. Theory suggests that primate decision making mechanisms, and their resultant behavioral abilities, emerged to maximize reward in the face of noisy, temporally evolving information. To test this theory, we trained an end-to-end deep recurrent neural network using reinforcement learning on a noisy perceptual discrimination task. Networks learned several key abilities of primate-like decision making including trading off speed for accuracy, and flexibly changing their mind in the face of new information. Internal dynamics of these networks suggest that these abilities were supported by similar decision mechanisms as those observed in primate neurophysiological studies. These results provide experimental support for key pressures that gave rise to the primate ability to make flexible decisions.
zh
[AI-239] Artificial Intelligence in Materials Science and Engineering: Current Landscape Key Challenges and Future Trajectorie
【速读】:该论文旨在解决材料科学与工程领域中如何有效利用人工智能(Artificial Intelligence, AI)技术来应对研究复杂性、加速新材料发现并优化材料设计的问题。其解决方案的关键在于系统性地整合机器学习方法(包括传统算法与深度学习架构如卷积神经网络(Convolutional Neural Networks, CNNs)、图神经网络(Graph Neural Networks, GNNs)及Transformer模型),以及新兴的生成式AI和概率模型(如高斯过程(Gaussian Processes)用于不确定性量化),并通过高质量的数据表示与特征化策略(涵盖组成、结构、图像及语言启发式方法)实现模型性能的提升,从而推动AI在材料研究中的落地应用。
链接: https://arxiv.org/abs/2601.12554
作者: Iman Peivaste,Salim Belouettar,Francesco Mercuri,Nicholas Fantuzzi,Hamidreza Dehghani,Razieh Izadi,Halliru Ibrahim,Jakub Lengiewicz,Maël Belouettar-Mathis,Kouider Bendine,Ahmed Makradi,Martin Hörsch,Peter Klein,Mohamed El Hachemi,Heinz A. Preisig,Yacine Rezgui,Natalia Konchakova,Ali Daouadji
机构: 未知
类目: Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI); Computational Physics (physics.comp-ph); Quantum Physics (quant-ph)
备注:
Abstract:Artificial Intelligence is rapidly transforming materials science and engineering, offering powerful tools to navigate complexity, accelerate discovery, and optimize material design in ways previously unattainable. Driven by the accelerating pace of algorithmic advancements and increasing data availability, AI is becoming an essential competency for materials researchers. This review provides a comprehensive and structured overview of the current landscape, synthesizing recent advancements and methodologies for materials scientists seeking to effectively leverage these data-driven techniques. We survey the spectrum of machine learning approaches, from traditional algorithms to advanced deep learning architectures, including CNNs, GNNs, and Transformers, alongside emerging generative AI and probabilistic models such as Gaussian Processes for uncertainty quantification. The review also examines the pivotal role of data in this field, emphasizing how effective representation and featurization strategies, spanning compositional, structural, image-based, and language-inspired approaches, combined with appropriate preprocessing, fundamentally underpin the performance of machine learning models in materials research. Persistent challenges related to data quality, quantity, and standardization, which critically impact model development and application in materials science and engineering, are also addressed.
zh
[AI-240] Purification Before Fusion: Toward Mask-Free Speech Enhancement for Robust Audio-Visual Speech Recognition ICASSP2026
【速读】:该论文旨在解决音频-视觉语音识别(Audio-Visual Speech Recognition, AVSR)在高噪声环境下因噪声干扰导致特征融合质量下降的问题。现有方法通常依赖掩码(mask)策略过滤音频噪声,但容易误删与语义相关的信息,影响识别准确性。其解决方案的关键在于提出一种端到端的噪声鲁棒AVSR框架,结合语音增强机制,通过基于Conformer的瓶颈融合模块(bottleneck fusion module)隐式地利用视频信息对噪声音频特征进行重构,从而减少模态冗余并增强跨模态交互,在不显式生成噪声掩码的前提下有效保留语音语义完整性,显著提升复杂噪声场景下的识别性能。
链接: https://arxiv.org/abs/2601.12436
作者: Linzhi Wu,Xingyu Zhang,Hao Yuan,Yakun Zhang,Changyan Zheng,Liang Xie,Tiejun Liu,Erwei Yin
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM); Sound (cs.SD)
备注: Accepted by ICASSP2026
Abstract:Audio-visual speech recognition (AVSR) typically improves recognition accuracy in noisy environments by integrating noise-immune visual cues with audio signals. Nevertheless, high-noise audio inputs are prone to introducing adverse interference into the feature fusion process. To mitigate this, recent AVSR methods often adopt mask-based strategies to filter audio noise during feature interaction and fusion, yet such methods risk discarding semantically relevant information alongside noise. In this work, we propose an end-to-end noise-robust AVSR framework coupled with speech enhancement, eliminating the need for explicit noise mask generation. This framework leverages a Conformer-based bottleneck fusion module to implicitly refine noisy audio features with video assistance. By reducing modality redundancy and enhancing inter-modal interactions, our method preserves speech semantic integrity to achieve robust recognition performance. Experimental evaluations on the public LRS3 benchmark suggest that our method outperforms prior advanced mask-based baselines under noisy conditions.
zh
[AI-241] How Well Do LLM s Predict Human Behavior? A Measure of their Pretrained Knowledge
【速读】:该论文旨在解决如何量化预训练大语言模型(Large Language Models, LLMs)在预测人类行为任务中所携带的有用知识的问题。传统方法难以评估LLM在特定领域任务中的信息价值,因此作者提出“等效样本量”(equivalent sample size)作为衡量指标——即为达到与LLM相当的预测精度,所需的任务特定数据量。解决方案的关键在于通过对比固定LLM在目标领域的预测误差与在不断增加的领域特定数据上训练的灵活机器学习模型的预测误差,从而估算该等效样本量;同时,作者还构建了基于交叉验证预测误差的新渐近理论,以支持统计推断。实证应用表明,LLM对某些经济变量具有显著预测能力,而对另一些则贡献有限,说明其替代领域数据的价值因任务而异。
链接: https://arxiv.org/abs/2601.12343
作者: Wayne Gao,Sukjin Han,Annie Liang
机构: 未知
类目: Econometrics (econ.EM); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
Abstract:Large language models (LLMs) are increasingly used to predict human behavior. We propose a measure for evaluating how much knowledge a pretrained LLM brings to such a prediction: its equivalent sample size, defined as the amount of task-specific data needed to match the predictive accuracy of the LLM. We estimate this measure by comparing the prediction error of a fixed LLM in a given domain to that of flexible machine learning models trained on increasing samples of domain-specific data. We further provide a statistical inference procedure by developing a new asymptotic theory for cross-validated prediction error. Finally, we apply this method to the Panel Study of Income Dynamics. We find that LLMs encode considerable predictive information for some economic variables but much less for others, suggesting that their value as substitutes for domain-specific data differs markedly across settings.
zh
[AI-242] A New Strategy for Artificial Intelligence: Training Foundation Models Directly on Human Brain Data
【速读】:该论文试图解决当前基础模型(foundation models)依赖人类生成数据(如文本)作为知识来源所带来的局限性,尤其是这些数据仅反映人类认知的表层统计规律,而无法触及更深层的认知机制。其核心问题在于:如何利用神经影像学数据(neuroimaging data)来增强基础模型对人类高级认知过程的理解,从而突破现有模型在感知、估值、执行和整合四个层级上的瓶颈。解决方案的关键在于提出两种新方法——基于人类大脑的强化学习(reinforcement learning from human brain, RLHB)与基于人类大脑的思想链(chain of thought from human brain, CoTHB),通过战略性地使用有限的神经影像数据,优先训练模型中高价值的认知步骤,使模型不仅能从行为数据中学习,还能直接从大脑活动模式中提取不可见但关键的认知结构,为迈向更具泛化能力的人工通用智能(AGI)提供一条可行路径。
链接: https://arxiv.org/abs/2601.12053
作者: Maël Donoso
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:While foundation models have achieved remarkable results across a diversity of domains, they still rely on human-generated data, such as text, as a fundamental source of knowledge. However, this data is ultimately the product of human brains, the filtered projection of a deeper neural complexity. In this paper, we explore a new strategy for artificial intelligence: moving beyond surface-level statistical regularities by training foundation models directly on human brain data. We hypothesize that neuroimaging data could open a window into elements of human cognition that are not accessible through observable actions, and argue that this additional knowledge could be used, alongside classical training data, to overcome some of the current limitations of foundation models. While previous research has demonstrated the possibility to train classical machine learning or deep learning models on neural patterns, this path remains largely unexplored for high-level cognitive functions. Here, we classify the current limitations of foundation models, as well as the promising brain regions and cognitive processes that could be leveraged to address them, along four levels: perception, valuation, execution, and integration. Then, we propose two methods that could be implemented to prioritize the use of limited neuroimaging data for strategically chosen, high-value steps in foundation model training: reinforcement learning from human brain (RLHB) and chain of thought from human brain (CoTHB). We also discuss the potential implications for agents, artificial general intelligence, and artificial superintelligence, as well as the ethical, social, and technical challenges and opportunities. We argue that brain-trained foundation models could represent a realistic and effective middle ground between continuing to scale current architectures and exploring alternative, neuroscience-inspired solutions.
zh
[AI-243] Lightweight Self-Supervised Detection of Fundamental Frequency and Accurate Probability of Voicing in Monophonic Music ICPR
【速读】:该论文旨在解决神经语音合成中基频(F₀)估计与声门激励(voicing)推断的可靠性问题,尤其针对现有方法依赖大规模标注数据且在存在真实录音伪影时性能下降的局限性。其解决方案的关键在于提出一种轻量级、完全自监督的框架,通过在CQT(常数Q变换)特征上采用平移等变学习(transposition-equivariant learning),并引入一种基于期望最大化(EM)风格的迭代加权机制,利用Shift Cross-Entropy(SCE)一致性作为可靠性信号来抑制无信息噪声帧或非发声帧;该机制生成的权重可作为置信度分数,用于对一个独立的轻量级声门激励分类器进行伪标签训练,无需人工标注即可实现高精度的联合F₀与voicing估计,从而在有限音频数据下实现快速单乐器训练和跨乐器泛化能力。
链接: https://arxiv.org/abs/2601.11768
作者: Venkat Suprabath Bitra,Homayoon Beigi
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD); Signal Processing (eess.SP)
备注: 12 pages, 6 figures, 3 tables, and an appendix, Accepted for publication at ICPRAM 2026 in Marbella, Spain, on March 2, 2026
Abstract:Reliable fundamental frequency (F 0) and voicing estimation is essential for neural synthesis, yet many pitch extractors depend on large labeled corpora and degrade under realistic recording artifacts. We propose a lightweight, fully self-supervised framework for joint F 0 estimation and voicing inference, designed for rapid single-instrument training from limited audio. Using transposition-equivariant learning on CQT features, we introduce an EM-style iterative reweighting scheme that uses Shift Cross-Entropy (SCE) consistency as a reliability signal to suppress uninformative noisy/unvoiced frames. The resulting weights provide confidence scores that enable pseudo-labeling for a separate lightweight voicing classifier without manual annotations. Trained on MedleyDB and evaluated on MDB-stem-synth ground truth, our method achieves competitive cross-corpus performance (RPA 95.84, RCA 96.24) and demonstrates cross-instrument generalization.
zh
[AI-244] Inter-Cell Interference Rejection Based on Ultrawideband Walsh-Domain Wireless Autoencoding
【速读】:该论文旨在解决超宽带(UWB)通信系统中来自共存窄带5G基站的部分带内小区间干扰(ICI)问题。解决方案的关键在于设计一种端到端无线自编码器架构,该架构在Walsh域中联合优化收发端的编码/解码过程,利用Walsh函数的正交性和自逆特性,将比特字分布到并行的Walsh分支上进行编码学习,从而实现对5G CPOFDM干扰的有效抑制。通过理论建模与仿真分析,识别出最优的传输频率与采样率比例,使自编码器在保持低块误码率(BLER)的同时,实现高达12 dB的ICI抑制性能。
链接: https://arxiv.org/abs/2601.11713
作者: Rodney Martinez Alonso,Cel Thys,Cedric Dehos,Yuneisy Esthela Garcia Guzman,Sofie Pollin
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
备注: This preprint was submitted to The 2026 EuCNC 6G Summit
Abstract:This paper proposes a novel technique for rejecting partial-in-band inter-cell interference (ICI) in ultrawideband communication systems. We present the design of an end-to-end wireless autoencoder architecture that jointly optimizes the transmitter and receiver encoding/decoding in the Walsh domain to mitigate interference from coexisting narrower-band 5G base stations. By exploiting the orthogonality and self-inverse properties of Walsh functions, the system distributes and learns to encode bit-words across parallel Walsh branches. Through analytical modeling and simulation, we characterize how 5G CPOFDM interference maps into the Walsh domain and identify optimal ratios of transmission frequencies and sampling rate where the end-to-end autoencoder achieves the highest rejection. Experimental results show that the proposed autoencoder achieves up to 12 dB of ICI rejection while maintaining a low block error rate (BLER) for the same baseline channel noise, i.e., baseline Signal-to-Noise-Ratio (SNR) without the interference.
zh
[AI-245] Large Language Model Agent for User-friendly Chemical Process Simulations
【速读】:该论文旨在解决现代过程模拟软件(如AVEVA Process Simulation, APS)在使用过程中存在的两大问题:一是构建和解释复杂模拟模型耗时且依赖专家知识,限制了初学者的早期探索;二是缺乏自然语言交互能力,难以实现高效的人机协作。解决方案的关键在于将大语言模型(Large Language Model, LLM)代理通过Model Context Protocol (MCP) 与APS集成,借助MCP服务器工具集实现LLM通过Python编程接口对APS进行程序化调用,从而能够根据自然语言指令执行复杂的模拟任务。该框架支持从自动分析流程图、迭代优化到自主合成流程方案等多种场景,显著提升了易用性与效率,尤其适用于教育场景中的概念阐释与实践训练,以及专业人员的数据提取自动化和头脑风暴辅助。
链接: https://arxiv.org/abs/2601.11650
作者: Jingkang Liang,Niklas Groll,Gürkan Sin
机构: 未知
类目: Chemical Physics (physics.chem-ph); Artificial Intelligence (cs.AI)
备注:
Abstract:Modern process simulators enable detailed process design, simulation, and optimization; however, constructing and interpreting simulations is time-consuming and requires expert knowledge. This limits early exploration by inexperienced users. To address this, a large language model (LLM) agent is integrated with AVEVA Process Simulation (APS) via Model Context Protocol (MCP), allowing natural language interaction with rigorous process simulations. An MCP server toolset enables the LLM to communicate programmatically with APS using Python, allowing it to execute complex simulation tasks from plain-language instructions. Two water-methanol separation case studies assess the framework across different task complexities and interaction modes. The first shows the agent autonomously analyzing flowsheets, finding improvement opportunities, and iteratively optimizing, extracting data, and presenting results clearly. The framework benefits both educational purposes, by translating technical concepts and demonstrating workflows, and experienced practitioners by automating data extraction, speeding routine tasks, and supporting brainstorming. The second case study assesses autonomous flowsheet synthesis through both a step-by-step dialogue and a single prompt, demonstrating its potential for novices and experts alike. The step-by-step mode gives reliable, guided construction suitable for educational contexts; the single-prompt mode constructs fast baseline flowsheets for later refinement. While current limitations such as oversimplification, calculation errors, and technical hiccups mean expert oversight is still needed, the framework’s capabilities in analysis, optimization, and guided construction suggest LLM-based agents can become valuable collaborators.
zh
机器学习
[LG-0] Spatiotemporal Wildfire Prediction and Reinforcement Learning for Helitack Suppression WWW ICML
链接: https://arxiv.org/abs/2601.14238
作者: Shaurya Mathur,Shreyas Bellary Manjunath,Nitin Kulkarni,Alina Vereshchaka
类目: Machine Learning (cs.LG)
*备注: 6 pages, 5 figures (two of them in tables), Conference: IEEE International Conference on Machine Learning and Applications 2025 (ICMLA 2025): this https URL
Abstract:Wildfires are growing in frequency and intensity, devastating ecosystems and communities while causing billions of dollars in suppression costs and economic damage annually in the U.S. Traditional wildfire management is mostly reactive, addressing fires only after they are detected. We introduce \textitFireCastRL, a proactive artificial intelligence (AI) framework that combines wildfire forecasting with intelligent suppression strategies. Our framework first uses a deep spatiotemporal model to predict wildfire ignition. For high-risk predictions, we deploy a pre-trained reinforcement learning (RL) agent to execute real-time suppression tactics with helitack units inside a physics-informed 3D simulation. The framework generates a threat assessment report to help emergency responders optimize resource allocation and planning. In addition, we are publicly releasing a large-scale, spatiotemporal dataset containing \mathbf9.5 million samples of environmental variables for wildfire prediction. Our work demonstrates how deep learning and RL can be combined to support both forecasting and tactical wildfire response. More details can be found at this https URL.
[LG-1] Attention-Based Offline Reinforcement Learning and Clustering for Interpretable Sepsis Treatment WWW ICML
链接: https://arxiv.org/abs/2601.14228
作者: Punit Kumar,Vaibhav Saran,Divyesh Patel,Nitin Kulkarni,Alina Vereshchaka
类目: Machine Learning (cs.LG)
*备注: 8 pages, 6 figures, Conference: IEEE International Conference on Machine Learning and Applications 2025 (ICMLA 2025): this https URL
Abstract:Sepsis remains one of the leading causes of mortality in intensive care units, where timely and accurate treatment decisions can significantly impact patient outcomes. In this work, we propose an interpretable decision support framework. Our system integrates four core components: (1) a clustering-based stratification module that categorizes patients into low, intermediate, and high-risk groups upon ICU admission, using clustering with statistical validation; (2) a synthetic data augmentation pipeline leveraging variational autoencoders (VAE) and diffusion models to enrich underrepresented trajectories such as fluid or vasopressor administration; (3) an offline reinforcement learning (RL) agent trained using Advantage Weighted Regression (AWR) with a lightweight attention encoder and supported by an ensemble models for conservative, safety-aware treatment recommendations; and (4) a rationale generation module powered by a multi-modal large language model (LLM), which produces natural-language justifications grounded in clinical context and retrieved expert knowledge. Evaluated on the MIMIC-III and eICU datasets, our approach achieves high treatment accuracy while providing clinicians with interpretable and robust policy recommendations.
[LG-2] Differentiated Pickup Point Offering for Emission Reduction in Last-Mile Delivery
链接: https://arxiv.org/abs/2601.14196
作者: Albina Galiullina,Wouter van Heeswijk,Tom van Woensel
类目: Machine Learning (cs.LG)
*备注:
Abstract:Pickup points are widely recognized as a sustainable alternative to home delivery, as consolidating orders at pickup locations can shorten delivery routes and improve first-attempt success rates. However, these benefits may be negated when customers drive to pick up their orders. This study proposes a Differentiated Pickup Point Offering (DPO) policy that aims to jointly reduce emissions from delivery truck routes and customer travel. Under DPO, each arriving customer is offered a single recommended pickup point, rather than an unrestricted choice among all locations, while retaining the option of home delivery. We study this problem in a dynamic and stochastic setting, where the pickup point offered to each customer depends on previously realized customer locations and delivery choices. To design effective DPO policies, we adopt a reinforcement learning-based approach that accounts for spatial relationships between customers and pickup points and their implications for future route consolidation. Computational experiments show that differentiated pickup point offerings can substantially reduce total carbon emissions. The proposed policies reduce total emissions by up to 9% relative to home-only delivery and by 2% on average compared with alternative policies, including unrestricted pickup point choice and nearest pickup point assignment. Differentiated offerings are particularly effective in dense urban settings with many pickup points and short inter-location distances. Moreover, explicitly accounting for the dynamic nature of customer arrivals and choices is especially important when customers are less inclined to choose pickup point delivery over home delivery.
[LG-3] Penalizing Localized Dirichlet Energies in Low Rank Tensor Products
链接: https://arxiv.org/abs/2601.14173
作者: Paris A. Karakasis,Nicholas D. Sidiropoulos
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 19 pages
Abstract:We study low-rank tensor-product B-spline (TPBS) models for regression tasks and investigate Dirichlet energy as a measure of smoothness. We show that TPBS models admit a closed-form expression for the Dirichlet energy, and reveal scenarios where perfect interpolation is possible with exponentially small Dirichlet energy. This renders global Dirichlet energy-based regularization ineffective. To address this limitation, we propose a novel regularization strategy based on local Dirichlet energies defined on small hypercubes centered at the training points. Leveraging pretrained TPBS models, we also introduce two estimators for inference from incomplete samples. Comparative experiments with neural networks demonstrate that TPBS models outperform neural networks in the overfitting regime for most datasets, and maintain competitive performance otherwise. Overall, TPBS models exhibit greater robustness to overfitting and consistently benefit from regularization, while neural networks are more sensitive to overfitting and less effective in leveraging regularization.
[LG-4] Optimizing Energy and Data Collection in UAV-aided IoT Networks using Attention-based Multi-Objective Reinforcement Learning
链接: https://arxiv.org/abs/2601.14092
作者: Babacar Toure,Dimitrios Tsilimantos,Omid Esrafilian,Marios Kountouris
类目: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注:
Abstract:Due to their adaptability and mobility, Unmanned Aerial Vehicles (UAVs) are becoming increasingly essential for wireless network services, particularly for data harvesting tasks. In this context, Artificial Intelligence (AI)-based approaches have gained significant attention for addressing UAV path planning tasks in large and complex environments, bridging the gap with real-world deployments. However, many existing algorithms suffer from limited training data, which hampers their performance in highly dynamic environments. Moreover, they often overlook the inherently multi-objective nature of the task, treating it in an overly simplistic manner. To address these limitations, we propose an attention-based Multi-Objective Reinforcement Learning (MORL) architecture that explicitly handles the trade-off between data collection and energy consumption in urban environments, even without prior knowledge of wireless channel conditions. Our method develops a single model capable of adapting to varying trade-off preferences and dynamic scenario parameters without the need for fine-tuning or retraining. Extensive simulations show that our approach achieves substantial improvements in performance, model compactness, sample efficiency, and most importantly, generalization to previously unseen scenarios, outperforming existing RL solutions.
[LG-5] SecureSplit: Mitigating Backdoor Attacks in Split Learning
链接: https://arxiv.org/abs/2601.14054
作者: Zhihao Dou,Dongfei Cui,Weida Wang,Anjun Gao,Yueyang Quan,Mengyao Ma,Viet Vo,Guangdong Bai,Zhuqing Liu,Minghong Fang
类目: Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: To appear in The Web Conference 2026
Abstract:Split Learning (SL) offers a framework for collaborative model training that respects data privacy by allowing participants to share the same dataset while maintaining distinct feature sets. However, SL is susceptible to backdoor attacks, in which malicious clients subtly alter their embeddings to insert hidden triggers that compromise the final trained model. To address this vulnerability, we introduce SecureSplit, a defense mechanism tailored to SL. SecureSplit applies a dimensionality transformation strategy to accentuate subtle differences between benign and poisoned embeddings, facilitating their separation. With this enhanced distinction, we develop an adaptive filtering approach that uses a majority-based voting scheme to remove contaminated embeddings while preserving clean ones. Rigorous experiments across four datasets (CIFAR-10, MNIST, CINIC-10, and ImageNette), five backdoor attack scenarios, and seven alternative defenses confirm the effectiveness of SecureSplit under various challenging conditions.
[LG-6] PAC-Private Responses with Adversarial Composition
链接: https://arxiv.org/abs/2601.14033
作者: Xiaochen Zhu,Mayuri Sridhar,Srinivas Devadas
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: 16 pages, 3 figures
Abstract:Modern machine learning models are increasingly deployed behind APIs. This renders standard weight-privatization methods (e.g. DP-SGD) unnecessarily noisy at the cost of utility. While model weights may vary significantly across training datasets, model responses to specific inputs are much lower dimensional and more stable. This motivates enforcing privacy guarantees directly on model outputs. We approach this under PAC privacy, which provides instance-based privacy guarantees for arbitrary black-box functions by controlling mutual information (MI). Importantly, PAC privacy explicitly rewards output stability with reduced noise levels. However, a central challenge remains: response privacy requires composing a large number of adaptively chosen, potentially adversarial queries issued by untrusted users, where existing composition results on PAC privacy are inadequate. We introduce a new algorithm that achieves adversarial composition via adaptive noise calibration and prove that mutual information guarantees accumulate linearly under adaptive and adversarial querying. Experiments across tabular, vision, and NLP tasks show that our method achieves high utility at extremely small per-query privacy budgets. On CIFAR-10, we achieve 87.79% accuracy with a per-step MI budget of 2^-32 . This enables serving one million queries while provably bounding membership inference attack (MIA) success rates to 51.08% – the same guarantee of (0.04, 10^-5) -DP. Furthermore, we show that private responses can be used to label public data to distill a publishable privacy-preserving model; using an ImageNet subset as a public dataset, our model distilled from 210,000 responses achieves 91.86% accuracy on CIFAR-10 with MIA success upper-bounded by 50.49%, which is comparable to (0.02,10^-5) -DP. Comments: 16 pages, 3 figures Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR) Cite as: arXiv:2601.14033 [cs.LG] (or arXiv:2601.14033v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2601.14033 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-7] Universal Approximation Theorem for Input-Connected Multilayer Perceptrons
链接: https://arxiv.org/abs/2601.14026
作者: Vugar Ismailov
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Functional Analysis (math.FA)
*备注: 18 pages, 2 figures, 31 references
Abstract:We introduce the Input-Connected Multilayer Perceptron (IC-MLP), a feedforward neural network architecture in which each hidden neuron receives, in addition to the outputs of the preceding layer, a direct affine connection from the raw input. We first study this architecture in the univariate setting and give an explicit and systematic description of IC-MLPs with an arbitrary finite number of hidden layers, including iterated formulas for the network functions. In this setting, we prove a universal approximation theorem showing that deep IC-MLPs can approximate any continuous function on a closed interval of the real line if and only if the activation function is nonlinear. We then extend the analysis to vector-valued inputs and establish a corresponding universal approximation theorem for continuous functions on compact subsets of \mathbbR^n .
[LG-8] Auditory Brain Passage Retrieval: Cross-Sensory EEG Training for Neural Information Retrieval ECIR2026
链接: https://arxiv.org/abs/2601.14001
作者: Niall McGuire,Yashar Moshfeghi
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: Accepted At ECIR 2026
Abstract:Query formulation from internal information needs remains fundamentally challenging across all Information Retrieval paradigms due to cognitive complexity and physical impairments. Brain Passage Retrieval (BPR) addresses this by directly mapping EEG signals to passage representations without intermediate text translation. However, existing BPR research exclusively uses visual stimuli, leaving critical questions unanswered: Can auditory EEG enable effective retrieval for voice-based interfaces and visually impaired users? Can training on combined EEG datasets from different sensory modalities improve performance despite severe data scarcity? We present the first systematic investigation of auditory EEG for BPR and evaluate cross-sensory training benefits. Using dual encoder architectures with four pooling strategies (CLS, mean, max, multi-vector), we conduct controlled experiments comparing auditory-only, visual-only, and combined training on the Alice (auditory) and Nieuwland (visual) datasets. Results demonstrate that auditory EEG consistently outperforms visual EEG, and cross-sensory training with CLS pooling achieves substantial improvements over individual training: 31% in MRR (0.474), 43% in Hit@1 (0.314), and 28% in Hit@10 (0.858). Critically, combined auditory EEG models surpass BM25 text baselines (MRR: 0.474 vs 0.428), establishing neural queries as competitive with traditional retrieval whilst enabling accessible interfaces. These findings validate auditory neural interfaces for IR tasks and demonstrate that cross-sensory training addresses data scarcity whilst outperforming single-modality approaches Code: this https URL
[LG-9] Group-Invariant Unsupervised Skill Discovery: Symmetry-aware Skill Representations for Generalizable Behavior
链接: https://arxiv.org/abs/2601.14000
作者: Junwoo Chang,Joseph Park,Roberto Horowitz,Jongmin Lee,Jongeun Choi
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 14 pages, 6 figures
Abstract:Unsupervised skill discovery aims to acquire behavior primitives that improve exploration and accelerate downstream task learning. However, existing approaches often ignore the geometric symmetries of physical environments, leading to redundant behaviors and sample inefficiency. To address this, we introduce Group-Invariant Skill Discovery (GISD), a framework that explicitly embeds group structure into the skill discovery objective. Our approach is grounded in a theoretical guarantee: we prove that in group-symmetric environments, the standard Wasserstein dependency measure admits a globally optimal solution comprised of an equivariant policy and a group-invariant scoring function. Motivated by this, we formulate the Group-Invariant Wasserstein dependency measure, which restricts the optimization to this symmetry-aware subspace without loss of optimality. Practically, we parameterize the scoring function using a group Fourier representation and define the intrinsic reward via the alignment of equivariant latent features, ensuring that the discovered skills generalize systematically under group transformations. Experiments on state-based and pixel-based locomotion benchmarks demonstrate that GISD achieves broader state-space coverage and improved efficiency in downstream task learning compared to a strong baseline.
[LG-10] A universal linearized subspace refinement framework for neural networks
链接: https://arxiv.org/abs/2601.13989
作者: Wenbo Cao,Weiwei Zhang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Neural networks are predominantly trained using gradient-based methods, yet in many applications their final predictions remain far from the accuracy attainable within the model’s expressive capacity. We introduce Linearized Subspace Refinement (LSR), a general and architecture-agnostic framework that exploits the Jacobian-induced linear residual model at a fixed trained network state. By solving a reduced direct least-squares problem within this subspace, LSR computes a subspace-optimal solution of the linearized residual model, yielding a refined linear predictor with substantially improved accuracy over standard gradient-trained solutions, without modifying network architectures, loss formulations, or training procedures. Across supervised function approximation, data-driven operator learning, and physics-informed operator fine-tuning, we show that gradient-based training often fails to access this attainable accuracy, even when local linearization yields a convex problem. This observation indicates that loss-induced numerical ill-conditioning, rather than nonconvexity or model expressivity, can constitute a dominant practical bottleneck. In contrast, one-shot LSR systematically exposes accuracy levels not fully exploited by gradient-based training, frequently achieving order-of-magnitude error reductions. For operator-constrained problems with composite loss structures, we further introduce Iterative LSR, which alternates one-shot LSR with supervised nonlinear alignment, transforming ill-conditioned residual minimization into numerically benign fitting steps and yielding accelerated convergence and improved accuracy. By bridging nonlinear neural representations with reduced-order linear solvers at fixed linearization points, LSR provides a numerically grounded and broadly applicable refinement framework for supervised learning, operator learning, and scientific computing.
[LG-11] Differentiable Logic Synthesis: Spectral Coefficient Selection via Sinkhorn-Constrained Composition
链接: https://arxiv.org/abs/2601.13953
作者: Gorgi Pavlov
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR); Logic in Computer Science (cs.LO)
*备注: 35 pages, 22 figures. Code available at this https URL
Abstract:Learning precise Boolean logic via gradient descent remains challenging: neural networks typically converge to “fuzzy” approximations that degrade under quantization. We introduce Hierarchical Spectral Composition, a differentiable architecture that selects spectral coefficients from a frozen Boolean Fourier basis and composes them via Sinkhorn-constrained routing with column-sign modulation. Our approach draws on recent insights from Manifold-Constrained Hyper-Connections (mHC), which demonstrated that projecting routing matrices onto the Birkhoff polytope preserves identity mappings and stabilizes large-scale training. We adapt this framework to logic synthesis, adding column-sign modulation to enable Boolean negation – a capability absent in standard doubly stochastic routing. We validate our approach across four phases of increasing complexity: (1) For n=2 (16 Boolean operations over 4-dim basis), gradient descent achieves 100% accuracy with zero routing drift and zero-loss quantization to ternary masks. (2) For n=3 (10 three-variable operations), gradient descent achieves 76% accuracy, but exhaustive enumeration over 3^8 = 6561 configurations proves that optimal ternary masks exist for all operations (100% accuracy, 39% sparsity). (3) For n=4 (10 four-variable operations over 16-dim basis), spectral synthesis – combining exact Walsh-Hadamard coefficients, ternary quantization, and MCMC refinement with parallel tempering – achieves 100% accuracy on all operations. This progression establishes (a) that ternary polynomial threshold representations exist for all tested functions, and (b) that finding them requires methods beyond pure gradient descent as dimensionality grows. All operations enable single-cycle combinational logic inference at 10,959 MOps/s on GPU, demonstrating viability for hardware-efficient neuro-symbolic logic synthesis. Comments: 35 pages, 22 figures. Code available at this https URL Subjects: Machine Learning (cs.LG); Hardware Architecture (cs.AR); Logic in Computer Science (cs.LO) ACMclasses: I.2.6; F.2.2; C.1.3 Cite as: arXiv:2601.13953 [cs.LG] (or arXiv:2601.13953v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2601.13953 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-12] Efficient Coordination with the System-Level Shared State: An Embodied-AI Native Modular Framework
链接: https://arxiv.org/abs/2601.13945
作者: Yixuan Deng,Tongrun Wu,Donghao Wu,Zeyu Wei,Jiayuan Wang,Zhenglong Sun,Yuqing Tang,Xiaoqiang Ji
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:
Abstract:As Embodied AI systems move from research prototypes to real world deployments, they tend to evolve rapidly while remaining reliable under workload changes and partial failures. In practice, many deployments are only partially decoupled: middleware moves messages, but shared context and feedback semantics are implicit, causing interface drift, cross-module interference, and brittle recovery at scale. We present ANCHOR, a modular framework that makes decoupling and robustness explicit system-level primitives. ANCHOR separates (i) Canonical Records, an evolvable contract for the standardized shared state, from (ii) a communication bus for many-to-many dissemination and feedback-oriented coordination, forming an inspectable end-to-end loop. We validate closed-loop feasibility on a de-identified workflow instantiation, characterize latency distributions under varying payload sizes and publish rates, and demonstrate automatic stream resumption after hard crashes and restarts even with shared-memory loss. Overall, ANCHOR turns ad-hoc integration glue into explicit contracts, enabling controlled degradation under load and self-healing recovery for scalable deployment of closed-loop AI systems.
[LG-13] owards Effective Negation Modeling in Joint Audio-Text Models for Music ICASSP
链接: https://arxiv.org/abs/2601.13931
作者: Yannis Vasilakis,Rachel Bittner,Johan Pauwels
类目: ound (cs.SD); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: Accepted at IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2026
Abstract:Joint audio-text models are widely used for music retrieval, yet they struggle with semantic phenomena such as negation. Negation is fundamental for distinguishing the absence (or presence) of musical elements (e.g., “with vocals” vs. “without vocals”), but current systems fail to represent this reliably. In this work, we investigate and mitigate this limitation by training CLAP models from scratch on the Million Song Dataset with LP-MusicCaps-MSD captions. We introduce negation through text augmentation and a dissimilarity-based contrastive loss, designed to explicitly separate original and negated captions in the joint embedding space. To evaluate progress, we propose two protocols that frame negation modeling as retrieval and binary classification tasks. Experiments demonstrate that both methods, individually and combined, improve negation handling while largely preserving retrieval performance.
[LG-14] Multi-Objective Hierarchical Optimization with Large Language Models
链接: https://arxiv.org/abs/2601.13892
作者: Andrej Schwanke,Lyubomir Ivanov,David Salinas,Frank Hutter,Arber Zela
类目: Machine Learning (cs.LG)
*备注: 23 pages, 21 figures, 9 tables
Abstract:Despite their widespread adoption in various domains, especially due to their powerful reasoning capabilities, Large Language Models (LLMs) are not the off-the-shelf choice to drive multi-objective optimization yet. Conventional strategies rank high in benchmarks due to their intrinsic capabilities to handle numerical inputs and careful modelling choices that balance exploration and Pareto-front exploitation, as well as handle multiple (conflicting) objectives. In this paper, we close this gap by leveraging LLMs as surrogate models and candidate samplers inside a structured hierarchical search strategy. By adaptively partitioning the input space into disjoint hyperrectangular regions and ranking them with a composite score function, we restrict the generative process of the LLM to specific, high-potential sub-spaces, hence making the problem easier to solve as the LLM doesn’t have to reason about the global structure of the problem, but only locally instead. We show that under standard regularity assumptions, our algorithm generates candidate solutions that converge to the true Pareto set in Hausdorff distance. Empirically, it consistently outperforms the global LLM-based multi-objective optimizer and is on par with standard evolutionary and Bayesian optimization algorithm on synthetic and real-world benchmarks.
[LG-15] Inverting Self-Organizing Maps: A Unified Activation-Based Framework
链接: https://arxiv.org/abs/2601.13851
作者: Alessandro Londei,Matteo Benati,Denise Lanzieri,Vittorio Loreto
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Self-Organizing Maps provide topology-preserving projections of high-dimensional data and have been widely used for visualization, clustering, and vector quantization. In this work, we show that the activation pattern of a SOM - the squared distances to its prototypes - can be inverted to recover the exact input under mild geometric conditions. This follows from a classical fact in Euclidean distance geometry: a point in D dimensions is uniquely determined by its distances to D+1 affinely independent references. We derive the corresponding linear system and characterize the conditions under which the inversion is well-posed. Building upon this mechanism, we introduce the Manifold-Aware Unified SOM Inversion and Control (MUSIC) update rule, which enables controlled, semantically meaningful trajectories in latent space. MUSIC modifies squared distances to selected prototypes while preserving others, resulting in a deterministic geometric flow aligned with the SOM’s piecewise-linear structure. Tikhonov regularization stabilizes the update rule and ensures smooth motion on high-dimensional datasets. Unlike variational or probabilistic generative models, MUSIC does not rely on sampling, latent priors, or encoder-decoder architectures. If no perturbation is applied, inversion recovers the exact input; when a target cluster or prototype is specified, MUSIC produces coherent semantic variations while remaining on the data manifold. This leads to a new perspective on data augmentation and controllable latent exploration based solely on prototype geometry. We validate the approach using synthetic Gaussian mixtures, the MNIST and the Faces in the Wild dataset. Across all settings, MUSIC produces smooth, interpretable trajectories that reveal the underlying geometry of the learned manifold, illustrating the advantages of SOM-based inversion over unsupervised clustering.
[LG-16] Optimal L2 Regularization in High-dimensional Continual Linear Regression ALT2026
链接: https://arxiv.org/abs/2601.13844
作者: Gilad Karpel,Edward Moroshko,Ran Levinstein,Ron Meir,Daniel Soudry,Itay Evron
类目: Machine Learning (cs.LG)
*备注: Accepted to ALT 2026
Abstract:We study generalization in an overparameterized continual linear regression setting, where a model is trained with L2 (isotropic) regularization across a sequence of tasks. We derive a closed-form expression for the expected generalization loss in the high-dimensional regime that holds for arbitrary linear teachers. We demonstrate that isotropic regularization mitigates label noise under both single-teacher and multiple i.i.d. teacher settings, whereas prior work accommodating multiple teachers either did not employ regularization or used memory-demanding methods. Furthermore, we prove that the optimal fixed regularization strength scales nearly linearly with the number of tasks T , specifically as T/\ln T . To our knowledge, this is the first such result in theoretical continual learning. Finally, we validate our theoretical findings through experiments on linear regression and neural networks, illustrating how this scaling law affects generalization and offering a practical recipe for the design of continual learning systems.
[LG-17] ELSA: Efficient LLM -Centric Split Aggregation for Privacy-Aware Hierarchical Federated Learning over Resource-Constrained Edge Networks
链接: https://arxiv.org/abs/2601.13824
作者: Xiaohong Yang,Tong Xie,Minghui Liwang,Chikai Shang,Yang Lu,Zhenzhen Jiao,Liqun Fu,Seyyedali Hosseinalipour
类目: Machine Learning (cs.LG)
*备注: 11 pages, 16 figures
Abstract:Training large language models (LLMs) at the network edge faces fundamental challenges arising from device resource constraints, severe data heterogeneity, and heightened privacy risks. To address these, we propose ELSA (Efficient LLM-centric Split Aggregation), a novel framework that systematically integrates split learning (SL) and hierarchical federated learning (HFL) for distributed LLM fine-tuning over resource-constrained edge networks. ELSA introduces three key innovations. First, it employs a task-agnostic, behavior-aware client clustering mechanism that constructs semantic fingerprints using public probe inputs and symmetric KL divergence, further enhanced by prediction-consistency-based trust scoring and latency-aware edge assignment to jointly address data heterogeneity, client unreliability, and communication constraints. Second, it splits the LLM into three parts across clients and edge servers, with the cloud used only for adapter aggregation, enabling an effective balance between on-device computation cost and global convergence stability. Third, it incorporates a lightweight communication scheme based on computational sketches combined with semantic subspace orthogonal perturbation (SS-OP) to reduce communication overhead while mitigating privacy leakage during model exchanges. Experiments across diverse NLP tasks demonstrate that ELSA consistently outperforms state-of-the-art methods in terms of adaptability, convergence behavior, and robustness, establishing a scalable and privacy-aware solution for edge-side LLM fine-tuning under resource constraints.
[LG-18] Device Association and Resource Allocation for Hierarchical Split Federated Learning in Space-Air-Ground Integrated Network
链接: https://arxiv.org/abs/2601.13817
作者: Haitao Zhao,Xiaoyu Tang,Bo Xu,Jinlong Sun,Linghao Zhang
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:
Abstract:6G facilitates deployment of Federated Learning (FL) in the Space-Air-Ground Integrated Network (SAGIN), yet FL confronts challenges such as resource constrained and unbalanced data distribution. To address these issues, this paper proposes a Hierarchical Split Federated Learning (HSFL) framework and derives its upper bound of loss function. To minimize the weighted sum of training loss and latency, we formulate a joint optimization problem that integrates device association, model split layer selection, and resource allocation. We decompose the original problem into several subproblems, where an iterative optimization algorithm for device association and resource allocation based on brute-force split point search is proposed. Simulation results demonstrate that the proposed algorithm can effectively balance training efficiency and model accuracy for FL in SAGIN.
[LG-19] PAtt: A Pattern Attention Network for ETA Prediction Using Historical Speed Profiles ITSC2025
链接: https://arxiv.org/abs/2601.13793
作者: ByeoungDo Kim,JunYeop Na,Kyungwook Tak,JunTae Kim,DongHyeon Kim,Duckky Kim
类目: Machine Learning (cs.LG)
*备注: 7 pages, 3 figures, ITSC 2025, to be published
Abstract:In this paper, we propose an ETA model (Estimated Time of Arrival) that leverages an attention mechanism over historical road speed patterns. As autonomous driving and intelligent transportation systems become increasingly prevalent, the need for accurate and reliable ETA estimation has grown, playing a vital role in navigation, mobility planning, and traffic management. However, predicting ETA remains a challenging task due to the dynamic and complex nature of traffic flow. Traditional methods often combine real-time and historical traffic data in simplistic ways, or rely on complex rule-based computations. While recent deep learning models have shown potential, they often require high computational costs and do not effectively capture the spatio-temporal patterns crucial for ETA prediction. ETA prediction inherently involves spatio-temporal causality, and our proposed model addresses this by leveraging attention mechanisms to extract and utilize temporal features accumulated at each spatio-temporal point along a route. This architecture enables efficient and accurate ETA estimation while keeping the model lightweight and scalable. We validate our approach using real-world driving datasets and demonstrate that our approach outperforms existing baselines by effectively integrating road characteristics, real-time traffic conditions, and historical speed patterns in a task-aware manner.
[LG-20] Principled Latent Diffusion for Graphs via Laplacian Autoencoders
链接: https://arxiv.org/abs/2601.13780
作者: Antoine Siraudin,Christopher Morris
类目: Machine Learning (cs.LG)
*备注: Preprint, under review
Abstract:Graph diffusion models achieve state-of-the-art performance in graph generation but suffer from quadratic complexity in the number of nodes – and much of their capacity is wasted modeling the absence of edges in sparse graphs. Inspired by latent diffusion in other modalities, a natural idea is to compress graphs into a low-dimensional latent space and perform diffusion there. However, unlike images or text, graph generation requires nearly lossless reconstruction, as even a single error in decoding an adjacency matrix can render the entire sample invalid. This challenge has remained largely unaddressed. We propose LG-Flow, a latent graph diffusion framework that directly overcomes these obstacles. A permutation-equivariant autoencoder maps each node into a fixed-dimensional embedding from which the full adjacency is provably recoverable, enabling near-lossless reconstruction for both undirected graphs and DAGs. The dimensionality of this latent representation scales linearly with the number of nodes, eliminating the quadratic bottleneck and making it feasible to train larger and more expressive models. In this latent space, we train a Diffusion Transformer with flow matching, enabling efficient and expressive graph generation. Our approach achieves competitive results against state-of-the-art graph diffusion models, while achieving up to 1000\times speed-up.
[LG-21] Orthogonium : A Unified Efficient Library of Orthogonal and 1-Lipschitz Building Blocks
链接: https://arxiv.org/abs/2601.13776
作者: Thibaut Boissin(IRIT-MISFIT),Franck Mamalet,Valentin Lafargue(ANITI, IMT),Mathieu Serrurier(IRIT-MISFIT)
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Orthogonal and 1-Lipschitz neural network layers are essential building blocks in robust deep learning architectures, crucial for certified adversarial robustness, stable generative models, and reliable recurrent networks. Despite significant advancements, existing implementations remain fragmented, limited, and computationally demanding. To address these issues, we introduce Orthogonium , a unified, efficient, and comprehensive PyTorch library providing orthogonal and 1-Lipschitz layers. Orthogonium provides access to standard convolution features-including support for strides, dilation, grouping, and transposed-while maintaining strict mathematical guarantees. Its optimized implementations reduce overhead on large scale benchmarks such as ImageNet. Moreover, rigorous testing within the library has uncovered critical errors in existing implementations, emphasizing the importance of standardized and reliable tools. Orthogonium thus significantly lowers adoption barriers, enabling scalable experimentation and integration across diverse applications requiring orthogonality and robust Lipschitz constraints. Orthogonium is available at this https URL.
[LG-22] EEG-Titans: Long-Horizon Seizure Forecasting via Dual-Branch Attention and Neural Memory
链接: https://arxiv.org/abs/2601.13748
作者: Tien-Dat Pham,Xuan-The Tran
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC)
*备注:
Abstract:Accurate epileptic seizure prediction from electroencephalography (EEG) remains challenging because pre-ictal dynamics may span long time horizons while clinically relevant signatures can be subtle and transient. Many deep learning models face a persistent trade-off between capturing local spatiotemporal patterns and maintaining informative long-range context when operating on ultralong sequences. We propose EEG-Titans, a dualbranch architecture that incorporates a modern neural memory mechanism for long-context modeling. The model combines sliding-window attention to capture short-term anomalies with a recurrent memory pathway that summarizes slower, progressive trends over time. On the CHB-MIT scalp EEG dataset, evaluated under a chronological holdout protocol, EEG-Titans achieves 99.46% average segment-level sensitivity across 18 subjects. We further analyze safety-first operating points on artifact-prone recordings and show that a hierarchical context strategy extending the receptive field for high-noise subjects can markedly reduce false alarms (down to 0.00 FPR/h in an extreme outlier) without sacrificing sensitivity. These results indicate that memory-augmented long-context modeling can provide robust seizure forecasting under clinically constrained evaluation
[LG-23] Variational Dual-path Attention Network for CSI-Based Gesture Recognition
链接: https://arxiv.org/abs/2601.13745
作者: N.Zhang
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注: 8 pages, 7 figures, 2 tables
Abstract:Wi-Fi gesture recognition based on Channel State Information (CSI) is challenged by high-dimensional noise and resource constraints on edge devices. Prevailing end-to-end models tightly couple feature extraction with classification, overlooking the inherent time-frequency sparsity of CSI and leading to redundancy and poor generalization. To address this, this paper proposes a lightweight feature preprocessing module–the Variational Dual-path Attention Network (VDAN). It performs structured feature refinement through frequency-domain filtering and temporal detection. Variational inference is introduced to model the uncertainty in attention weights, thereby enhancing robustness to noise. The design principles of the module are explained from the perspectives of the information bottleneck and regularization. Experiments on a public dataset demonstrate that the learned attention weights align with the physical sparse characteristics of CSI, verifying its interpretability. This work provides an efficient and explainable front-end processing solution for resource-constrained wireless sensing systems.
[LG-24] Breaking the Data Barrier in Learning Symbolic Computation: A Case Study on Variable Ordering Suggestion for Cylindrical Algebraic Decomposition
链接: https://arxiv.org/abs/2601.13731
作者: Rui-Juan Jing,Yuegang Zhao,Changbo Chen
类目: ymbolic Computation (cs.SC); Machine Learning (cs.LG)
*备注:
Abstract:Symbolic computation, powered by modern computer algebra systems, has important applications in mathematical reasoning through exact deep computations. The efficiency of symbolic computation is largely constrained by such deep computations in high dimension. This creates a fundamental barrier on labelled data acquisition if leveraging supervised deep learning to accelerate symbolic computation. Cylindrical algebraic decomposition (CAD) is a pillar symbolic computation method for reasoning with first-order logic formulas over reals with many applications in formal verification and automatic theorem proving. Variable orderings have a huge impact on its efficiency. Impeded by the difficulty to acquire abundant labelled data, existing learning-based approaches are only competitive with the best expert-based heuristics. In this work, we address this problem by designing a series of intimately connected tasks for which a large amount of annotated data can be easily obtained. We pre-train a Transformer model with these data and then fine-tune it on the datasets for CAD ordering. Experiments on publicly available CAD ordering datasets show that on average the orderings predicted by the new model are significantly better than those suggested by the best heuristic methods.
[LG-25] SWE-Tester: Training Open-Source LLM s for Issue Reproduction in Real-World Repositories
链接: https://arxiv.org/abs/2601.13713
作者: Aditya Bharat Soni,Rajat Ghosh,Vaishnavi Bhargava,Valerie Chen,Debojyoti Dutta
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:
Abstract:Software testing is crucial for ensuring the correctness and reliability of software systems. Automated generation of issue reproduction tests from natural language issue descriptions enhances developer productivity by simplifying root cause analysis, promotes test-driven development – “test first, write code later”, and can be used for improving the effectiveness of automated issue resolution systems like coding agents. Existing methods proposed for this task predominantly rely on closed-source LLMs, with limited exploration of open models. To address this, we propose SWE-Tester – a novel pipeline for training open-source LLMs to generate issue reproduction tests. First, we curate a high-quality training dataset of 41K instances from 2.6K open-source GitHub repositories and use it to train LLMs of varying sizes and families. The fine-tuned models achieve absolute improvements of up to 10% in success rate and 21% in change coverage on SWT-Bench Verified. Further analysis shows consistent improvements with increased inference-time compute, more data, and larger models. These results highlight the effectiveness of our framework for advancing open-source LLMs in this domain.
[LG-26] Autoregressive deep learning for real-time simulation of soft tissue dynamics during virtual neurosurgery
链接: https://arxiv.org/abs/2601.13676
作者: Fabian Greifeneder,Wolfgang Fenz,Benedikt Alkin,Johannes Brandstetter,Michael Giretzlehner,Philipp Moser
类目: Machine Learning (cs.LG)
*备注:
Abstract:Accurate simulation of brain deformation is a key component for developing realistic, interactive neurosurgical simulators, as complex nonlinear deformations must be captured to ensure realistic tool-tissue interactions. However, traditional numerical solvers often fall short in meeting real-time performance requirements. To overcome this, we introduce a deep learning-based surrogate model that efficiently simulates transient brain deformation caused by continuous interactions between surgical instruments and the virtual brain geometry. Building on Universal Physics Transformers, our approach operates directly on large-scale mesh data and is trained on an extensive dataset generated from nonlinear finite element simulations, covering a broad spectrum of temporal instrument-tissue interaction scenarios. To reduce the accumulation of errors in autoregressive inference, we propose a stochastic teacher forcing strategy applied during model training. Specifically, training consists of short stochastic rollouts in which the proportion of ground truth inputs is gradually decreased in favor of model-generated predictions. Our results show that the proposed surrogate model achieves accurate and efficient predictions across a range of transient brain deformation scenarios, scaling to meshes with up to 150,000 nodes. The introduced stochastic teacher forcing technique substantially improves long-term rollout stability, reducing the maximum prediction error from 6.7 mm to 3.5 mm. We further integrate the trained surrogate model into an interactive neurosurgical simulation environment, achieving runtimes below 10 ms per simulation step on consumer-grade inference hardware. Our proposed deep learning framework enables rapid, smooth and accurate biomechanical simulations of dynamic brain tissue deformation, laying the foundation for realistic surgical training environments.
[LG-27] Reinforcement Learning for Opportunistic Routing in Software-Defined LEO-Terrestrial Systems
链接: https://arxiv.org/abs/2601.13662
作者: Sivaram Krishnan,Zhouyou Gu,Jihong Park,Sung-Min Oh,Jinho Choi
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注:
Abstract:The proliferation of large-scale low Earth orbit (LEO) satellite constellations is driving the need for intelligent routing strategies that can effectively deliver data to terrestrial networks under rapidly time-varying topologies and intermittent gateway visibility. Leveraging the global control capabilities of a geostationary (GEO)-resident software-defined networking (SDN) controller, we introduce opportunistic routing, which aims to minimize delivery delay by forwarding packets to any currently available ground gateways rather than fixed destinations. This makes it a promising approach for achieving low-latency and robust data delivery in highly dynamic LEO networks. Specifically, we formulate a constrained stochastic optimization problem and employ a residual reinforcement learning framework to optimize opportunistic routing for reducing transmission delay. Simulation results over multiple days of orbital data demonstrate that our method achieves significant improvements in queue length reduction compared to classical backpressure and other well-known queueing algorithms.
[LG-28] meART: Towards Agent ic Time Series Reasoning via Tool-Augmentation
链接: https://arxiv.org/abs/2601.13653
作者: Xingjian Wu,Junkai Lu,Zhengyu Li,Xiangfei Qiu,Jilin Hu,Chenjuan Guo,Christian S. Jensen,Bin Yang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Time series data widely exist in real-world cyber-physical systems. Though analyzing and interpreting them contributes to significant values, e.g, disaster prediction and financial risk control, current workflows mainly rely on human data scientists, which requires significant labor costs and lacks automation. To tackle this, we introduce TimeART, a framework fusing the analytical capability of strong out-of-the-box tools and the reasoning capability of Large Language Models (LLMs), which serves as a fully agentic data scientist for Time Series Question Answering (TSQA). To teach the LLM-based Time Series Reasoning Models (TSRMs) strategic tool-use, we also collect a 100k expert trajectory corpus called TimeToolBench. To enhance TSRMs’ generalization capability, we then devise a four-stage training strategy, which boosts TSRMs through learning from their own early experiences and self-reflections. Experimentally, we train an 8B TSRM on TimeToolBench and equip it with the TimeART framework, and it achieves consistent state-of-the-art performance on multiple TSQA tasks, which pioneers a novel approach towards agentic time series reasoning.
[LG-29] Fisher-Informed Parameterwise Aggregation for Federated Learning with Heterogeneous Data
链接: https://arxiv.org/abs/2601.13608
作者: Zhipeng Chang,Ting He,Wenrui Hao
类目: Machine Learning (cs.LG)
*备注:
Abstract:Federated learning aggregates model updates from distributed clients, but standard first order methods such as FedAvg apply the same scalar weight to all parameters from each client. Under non-IID data, these uniformly weighted updates can be strongly misaligned across clients, causing client drift and degrading the global model. Here we propose Fisher-Informed Parameterwise Aggregation (FIPA), a second-order aggregation method that replaces client-level scalar weights with parameter-specific Fisher Information Matrix (FIM) weights, enabling true parameter-level scaling that captures how each client’s data uniquely influences different parameters. With low-rank approximation, FIPA remains communication- and computation-efficient. Across nonlinear function regression, PDE learning, and image classification, FIPA consistently improves over averaging-based aggregation, and can be effectively combined with state-of-the-art client-side optimization algorithms to further improve image classification accuracy. These results highlight the benefits of FIPA for federated learning under heterogeneous data distributions.
[LG-30] Optimizing Parallel Schemes with Lyapunov Exponents and kNN-LLE Estimation
链接: https://arxiv.org/abs/2601.13604
作者: Mudassir Shams,Andrei Velichko,Bruno Carpentieri
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG); Dynamical Systems (math.DS)
*备注: 25 pages, 9 figures, 10 tables
Abstract:Inverse parallel schemes remain indispensable tools for computing the roots of nonlinear systems, yet their dynamical behavior can be unexpectedly rich, ranging from strong contraction to oscillatory or chaotic transients depending on the choice of algorithmic parameters and initial states. A unified analytical-data-driven methodology for identifying, measuring, and reducing such instabilities in a family of uni-parametric inverse parallel solvers is presented in this study. On the theoretical side, we derive stability and bifurcation characterizations of the underlying iterative maps, identifying parameter regions associated with periodic or chaotic behavior. On the computational side, we introduce a micro-series pipeline based on kNN-driven estimation of the local largest Lyapunov exponent (LLE), applied to scalar time series derived from solver trajectories. The resulting sliding-window Lyapunov profiles provide fine-grained, real-time diagnostics of contractive or unstable phases and reveal transient behaviors not captured by coarse linearized analysis. Leveraging this correspondence, we introduce a Lyapunov-informed parameter selection strategy that identifies solver settings associated with stable behavior, particularly when the estimated LLE indicates persistent instability. Comprehensive experiments on ensembles of perturbed initial guesses demonstrate close agreement between the theoretical stability diagrams and empirical Lyapunov profiles, and show that the proposed adaptive mechanism significantly improves robustness. The study establishes micro-series Lyapunov analysis as a practical, interpretable tool for constructing self-stabilizing root-finding schemes and opens avenues for extending such diagnostics to higher-dimensional or noise-contaminated problems.
[LG-31] An Elementary Approach to Scheduling in Generative Diffusion Models
链接: https://arxiv.org/abs/2601.13602
作者: Qiang Sun,H. Vincent Poor,Wenyi Zhang
类目: Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:
Abstract:An elementary approach to characterizing the impact of noise scheduling and time discretization in generative diffusion models is developed. Considering a simplified model where the source distribution is multivariate Gaussian with a given covariance matrix, the explicit closed-form evolution trajectory of the distributions across reverse sampling steps is derived, and consequently, the Kullback-Leibler (KL) divergence between the source distribution and the reverse sampling output is obtained. The effect of the number of time discretization steps on the convergence of this KL divergence is studied via the Euler-Maclaurin expansion. An optimization problem is formulated, and its solution noise schedule is obtained via calculus of variations, shown to follow a tangent law whose coefficient is determined by the eigenvalues of the source covariance matrix. For an alternative scenario, more realistic in practice, where pretrained models have been obtained for some given noise schedules, the KL divergence also provides a measure to compare different time discretization strategies in reverse sampling. Experiments across different datasets and pretrained models demonstrate that the time discretization strategy selected by our approach consistently outperforms baseline and search-based strategies, particularly when the budget on the number of function evaluations is very tight.
[LG-32] Behavior Knowledge Merge in Reinforced Agent ic Models
链接: https://arxiv.org/abs/2601.13572
作者: Xiangchi Yuan,Dachuan Shi,Chunhui Zhang,Zheyuan Liu,Shenglong Yao,Soroush Vosoughi,Wenke Lee
类目: Machine Learning (cs.LG)
*备注:
Abstract:Reinforcement learning (RL) is central to post-training, particularly for agentic models that require specialized reasoning behaviors. In this setting, model merging offers a practical mechanism for integrating multiple RL-trained agents from different tasks into a single generalist model. However, existing merging methods are designed for supervised fine-tuning (SFT), and they are suboptimal to preserve task-specific capabilities on RL-trained agentic models. The root is a task-vector mismatch between RL and SFT: on-policy RL induces task vectors that are highly sparse and heterogeneous, whereas SFT-style merging implicitly assumes dense and globally comparable task vectors. When standard global averaging is applied under this mismatch, RL’s non-overlapping task vectors that encode critical task-specific behaviors are reduced and parameter updates are diluted. To address this issue, we propose Reinforced Agent Merging (RAM), a distribution-aware merging framework explicitly designed for RL-trained agentic models. RAM disentangles shared and task-specific unique parameter updates, averaging shared components while selectively preserving and rescaling unique ones to counteract parameter update dilution. Experiments across multiple agent domains and model architectures demonstrate that RAM not only surpasses merging baselines, but also unlocks synergistic potential among agents to achieve performance superior to that of specialized agents in their domains.
[LG-33] DRGW: Learning Disentangled Representations for Robust Graph Watermarking WWW’26
链接: https://arxiv.org/abs/2601.13569
作者: Jiasen Li,Yanwei Liu,Zhuoyi Shang,Xiaoyan Gu,Weiping Wang
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: Published at The Web Conference 2026 (WWW '26)
Abstract:Graph-structured data is foundational to numerous web applications, and watermarking is crucial for protecting their intellectual property and ensuring data provenance. Existing watermarking methods primarily operate on graph structures or entangled graph representations, which compromise the transparency and robustness of watermarks due to the information coupling in representing graphs and uncontrollable discretization in transforming continuous numerical representations into graph structures. This motivates us to propose DRGW, the first graph watermarking framework that addresses these issues through disentangled representation learning. Specifically, we design an adversarially trained encoder that learns an invariant structural representation against diverse perturbations and derives a statistically independent watermark carrier, ensuring both robustness and transparency of watermarks. Meanwhile, we devise a graph-aware invertible neural network to provide a lossless channel for watermark embedding and extraction, guaranteeing high detectability and transparency of watermarks. Additionally, we develop a structure-aware editor that resolves the issue of latent modifications into discrete graph edits, ensuring robustness against structural perturbations. Experiments on diverse benchmark datasets demonstrate the superior effectiveness of DRGW.
[LG-34] Patterning: The Dual of Interpretability
链接: https://arxiv.org/abs/2601.13548
作者: George Wang,Daniel Murfet
类目: Machine Learning (cs.LG)
*备注:
Abstract:Mechanistic interpretability aims to understand how neural networks generalize beyond their training data by reverse-engineering their internal structures. We introduce patterning as the dual problem: given a desired form of generalization, determine what training data produces it. Our approach is based on susceptibilities, which measure how posterior expectation values of observables respond to infinitesimal shifts in the data distribution. Inverting this linear response relationship yields the data intervention that steers the model toward a target internal configuration. We demonstrate patterning in a small language model, showing that re-weighting training data along principal susceptibility directions can accelerate or delay the formation of structure, such as the induction circuit. In a synthetic parentheses balancing task where multiple algorithms achieve perfect training accuracy, we show that patterning can select which algorithm the model learns by targeting the local learning coefficient of each solution. These results establish that the same mathematical framework used to read internal structure can be inverted to write it.
[LG-35] StoTAM: Stochastic Alternating Minimization for Tucker-Structured Tensor Sensing
链接: https://arxiv.org/abs/2601.13522
作者: Shuang Li
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:Low-rank tensor sensing is a fundamental problem with broad applications in signal processing and machine learning. Among various tensor models, low-Tucker-rank tensors are particularly attractive for capturing multi-mode subspace structures in high-dimensional data. Existing recovery methods either operate on the full tensor variable with expensive tensor projections, or adopt factorized formulations that still rely on full-gradient computations, while most stochastic factorized approaches are restricted to tensor decomposition settings. In this work, we propose a stochastic alternating minimization algorithm that operates directly on the core tensor and factor matrices under a Tucker factorization. The proposed method avoids repeated tensor projections and enables efficient mini-batch updates on low-dimensional tensor factors. Numerical experiments on synthetic tensor sensing demonstrate that the proposed algorithm exhibits favorable convergence behavior in wall-clock time compared with representative stochastic tensor recovery baselines.
[LG-36] Bridging the Gap Between Estimated and True Regret Towards Reliable Regret Estimation in Deep Learning based Mechanism Design
链接: https://arxiv.org/abs/2601.13489
作者: Shuyuan You,Zhiqiang Zhuang,Kewen Wang,Zhe Wang
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); General Economics (econ.GN)
*备注:
Abstract:Recent advances, such as RegretNet, ALGnet, RegretFormer and CITransNet, use deep learning to approximate optimal multi item auctions by relaxing incentive compatibility (IC) and measuring its violation via ex post regret. However, the true accuracy of these regret estimates remains unclear. Computing exact regret is computationally intractable, and current models rely on gradient based optimizers whose outcomes depend heavily on hyperparameter choices. Through extensive experiments, we reveal that existing methods systematically underestimate actual regret (In some models, the true regret is several hundred times larger than the reported regret), leading to overstated claims of IC and revenue. To address this issue, we derive a lower bound on regret and introduce an efficient item wise regret approximation. Building on this, we propose a guided refinement procedure that substantially improves regret estimation accuracy while reducing computational cost. Our method provides a more reliable foundation for evaluating incentive compatibility in deep learning based auction mechanisms and highlights the need to reassess prior performance claims in this area.
[LG-37] Quantum Qualifiers for Neural Network Model Selection in Hadronic Physics
链接: https://arxiv.org/abs/2601.13463
作者: Brandon B. Le,D. Keller
类目: Machine Learning (cs.LG); High Energy Physics - Phenomenology (hep-ph); Nuclear Theory (nucl-th); Quantum Physics (quant-ph)
*备注: 12 pages, 5 figures. Proceedings for the 26th International Symposium on Spin Physics (SPIN2025), September 21-26, 2025; Qingdao, Shandong, China
Abstract:As quantum machine-learning architectures mature, a central challenge is no longer their construction, but identifying the regimes in which they offer practical advantages over classical approaches. In this work, we introduce a framework for addressing this question in data-driven hadronic physics problems by developing diagnostic tools - centered on a quantitative quantum qualifier - that guide model selection between classical and quantum deep neural networks based on intrinsic properties of the data. Using controlled classification and regression studies, we show how relative model performance follows systematic trends in complexity, noise, and dimensionality, and how these trends can be distilled into a predictive criterion. We then demonstrate the utility of this approach through an application to Compton form factor extraction from deeply virtual Compton scattering, where the quantum qualifier identifies kinematic regimes favorable to quantum models. Together, these results establish a principled framework for deploying quantum machine-learning tools in precision hadronic physics.
[LG-38] Federated Learning Under Temporal Drift – Mitigating Catastrophic Forgetting via Experience Replay
链接: https://arxiv.org/abs/2601.13456
作者: Sahasra Kokkula,Daniel David,Aaditya Baruah
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: 8 pages, 5 figures. Course project for Neural Networks Deep Learning COMSW4776 course at Columbia University
Abstract:Federated Learning struggles under temporal concept drift where client data distributions shift over time. We demonstrate that standard FedAvg suffers catastrophic forgetting under seasonal drift on Fashion-MNIST, with accuracy dropping from 74% to 28%. We propose client-side experience replay, where each client maintains a small buffer of past samples mixed with current data during local training. This simple approach requires no changes to server aggregation. Experiments show that a 50-sample-per-class buffer restores performance to 78-82%, effectively preventing forgetting. Our ablation study reveals a clear memory-accuracy trade-off as buffer size increases.
[LG-39] Fairness-informed Pareto Optimization : An Efficient Bilevel Framework
链接: https://arxiv.org/abs/2601.13448
作者: Sofiane Tanji,Samuel Vaiter,Yassine Laguel
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:
Abstract:Despite their promise, fair machine learning methods often yield Pareto-inefficient models, in which the performance of certain groups can be improved without degrading that of others. This issue arises frequently in traditional in-processing approaches such as fairness-through-regularization. In contrast, existing Pareto-efficient approaches are biased towards a certain perspective on fairness and fail to adapt to the broad range of fairness metrics studied in the literature. In this paper, we present BADR, a simple framework to recover the optimal Pareto-efficient model for any fairness metric. Our framework recovers its models through a Bilevel Adaptive Rescalarisation procedure. The lower level is a weighted empirical risk minimization task where the weights are a convex combination of the groups, while the upper level optimizes the chosen fairness objective. We equip our framework with two novel large-scale, single-loop algorithms, BADR-GD and BADR-SGD, and establish their convergence guarantees. We release badr, an open-source Python toolbox implementing our framework for a variety of learning tasks and fairness metrics. Finally, we conduct extensive numerical experiments demonstrating the advantages of BADR over existing Pareto-efficient approaches to fairness.
[LG-40] BladeSDF : Unconditional and Conditional Generative Modeling of Representative Blade Geometries Using Signed Distance Functions
链接: https://arxiv.org/abs/2601.13445
作者: Ashish S. Nair,Sandipp Krishnan Ravi,Itzel Salgado,Changjie Sun,Sayan Ghosh,Liping Wang
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:
Abstract:Generative AI has emerged as a transformative paradigm in engineering design, enabling automated synthesis and reconstruction of complex 3D geometries while preserving feasibility and performance relevance. This paper introduces a domain-specific implicit generative framework for turbine blade geometry using DeepSDF, addressing critical gaps in performance-aware modeling and manufacturable design generation. The proposed method leverages a continuous signed distance function (SDF) representation to reconstruct and generate smooth, watertight geometries with quantified accuracy. It establishes an interpretable, near-Gaussian latent space that aligns with blade-relevant parameters, such as taper and chord ratios, enabling controlled exploration and unconditional synthesis through interpolation and Gaussian sampling. In addition, a compact neural network maps engineering descriptors, such as maximum directional strains, to latent codes, facilitating the generation of performance-informed geometry. The framework achieves high reconstruction fidelity, with surface distance errors concentrated within 1% of the maximum blade dimension, and demonstrates robust generalization to unseen designs. By integrating constraints, objectives, and performance metrics, this approach advances beyond traditional 2D-guided or unconstrained 3D pipelines, offering a practical and interpretable solution for data-driven turbine blade modeling and concept generation.
[LG-41] Classifiers in High Dimensional Hilbert Metrics
链接: https://arxiv.org/abs/2601.13410
作者: Aditya Acharya,Auguste H. Gezalyan,David M. Mount
类目: Computational Geometry (cs.CG); Machine Learning (cs.LG)
*备注:
Abstract:Classifying points in high dimensional spaces is a fundamental geometric problem in machine learning. In this paper, we address classifying points in the d -dimensional Hilbert polygonal metric. The Hilbert metric is a generalization of the Cayley-Klein hyperbolic distance to arbitrary convex bodies and has a diverse range of applications in machine learning and convex geometry. We first present an efficient LP-based algorithm in the metric for the large-margin SVM problem. Our algorithm runs in time polynomial to the number of points, bounding facets, and dimension. This is a significant improvement on previous works, which either provide no theoretical guarantees on running time, or suffer from exponential runtime. We also consider the closely related Funk metric. We also present efficient algorithms for the soft-margin SVM problem and for nearest neighbor-based classification in the Hilbert metric.
[LG-42] CausationEntropy: Pythonic Optimal Causation Entropy
链接: https://arxiv.org/abs/2601.13365
作者: Kevin Slote,Jeremie Fish,Erik Bollt
类目: Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an)
*备注:
Abstract:Optimal Causation Entropy (oCSE) is a robust causal network modeling technique that reveals causal networks from dynamical systems and coupled oscillators, distinguishing direct from indirect paths. CausationEntropy is a Python package that implements oCSE and several of its significant optimizations and methodological extensions. In this paper, we introduce the version 1.1 release of CausationEntropy, which includes new synthetic data generators, plotting tools, and several advanced information-theoretical causal network discovery algorithms with criteria for estimating Gaussian, k-nearest neighbors (kNN), geometric k-nearest neighbors (geometric-kNN), kernel density (KDE) and Poisson entropic estimators. The package is easy to install from the PyPi software repository, is thoroughly documented, supplemented with extensive code examples, and is modularly structured to support future additions. The entire codebase is released under the MIT license and is available on GitHub and through PyPi Repository. We expect this package to serve as a benchmark tool for causal discovery in complex dynamical systems.
[LG-43] Beyond Mapping : Domain-Invariant Representations via Spectral Embedding of Optimal Transport Plans
链接: https://arxiv.org/abs/2601.13350
作者: Abdel Djalil Sad Saoud,Fred Maurice Ngolè Mboula,Hanane Slimani
类目: Machine Learning (cs.LG)
*备注: 5 pages, 2 figures
Abstract:Distributional shifts between training and inference time data remain a central challenge in machine learning, often leading to poor performance. It motivated the study of principled approaches for domain alignment, such as optimal transport based unsupervised domain adaptation, that relies on approximating Monge map using transport plans, which is sensitive to the transport problem regularization strategy and hyperparameters, and might yield biased domains alignment. In this work, we propose to interpret smoothed transport plans as adjacency matrices of bipartite graphs connecting source to target domain and derive domain-invariant samples’ representations through spectral embedding. We evaluate our approach on acoustic adaptation benchmarks for music genre recognition, music-speech discrimination, as well as electrical cable defect detection and classification tasks using time domain reflection in different diagnosis settings, achieving overall strong performances.
[LG-44] Verifying Local Robustness of Pruned Safety-Critical Networks
链接: https://arxiv.org/abs/2601.13303
作者: Minh Le,Phuong Cao
类目: Machine Learning (cs.LG)
*备注:
Abstract:Formal verification of Deep Neural Networks (DNNs) is essential for safety-critical applications, ranging from surgical robotics to NASA JPL autonomous systems. However, the computational cost of verifying large-scale models remains a significant barrier to adoption. This paper investigates the impact of pruning on formal local robustness certificates with different ratios. Using the state-of-the-art \alpha,\beta -CROWN verifier, we evaluate ResNet4 models across varying pruning ratios on MNIST and, more importantly, on the NASA JPL Mars Frost Identification datasets. Our findings demonstrate a non-linear relationship: light pruning (40%) in MNIST and heavy pruning (70%-90%) in JPL improve verifiability, allowing models to outperform unpruned baselines in proven L_\infty robustness properties. This suggests that reduced connectivity simplifies the search space for formal solvers and that the optimal pruning ratio varies significantly between datasets. This research highlights the complex nature of model compression, offering critical insights into selecting the optimal pruning ratio for deploying efficient, yet formally verified, DNNs in high-stakes environments where reliability is non-negotiable.
[LG-45] he Tag is the Signal: URL-Agnostic Credibility Scoring for Messages on Telegram
链接: https://arxiv.org/abs/2601.13294
作者: Yipeng Wang,Huy Gia Han Vu,Mohit Singhal
类目: ocial and Information Networks (cs.SI); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:
Abstract:Telegram has become one of the leading platforms for disseminating misinformational messages. However, many existing pipelines still classify each message’s credibility based on the reputation of its associated domain names or its lexical features. Such methods work well on traditional long-form news articles published by well-known sources, but high-risk posts on Telegram are short and URL-sparse, leading to failures for link-based and standard TF-IDF models. To this end, we propose the TAG2CRED pipeline, a method designed for such short, convoluted messages. Our model will directly score each post based on the tags assigned to the text. We designed a concise label system that covers the dimensions of theme, claim type, call to action, and evidence. The fine-tuned large language model (LLM) assigns tags to messages and then maps these tags to calibrated risk scores in the [0,1] interval through L2-regularized logistic regression. We evaluated 87,936 Telegram messages associated with Media Bias/Fact Check (MBFC), using URL masking and domain disjoint splits. The results showed that the ROC-AUC of the TAG2CRED model reached 0.871, the macro-F1 value was 0.787, and the Brier score was 0.167, outperforming the baseline TF-IDF (macro-F1 value 0.737, Brier score 0.248); at the same time, the number of features used in this model is much smaller, and the generalization ability on infrequent domains is stronger. The performance of the stacked ensemble model (TF-IDF + TAG2CRED + SBERT) was further improved over the baseline SBERT. ROC-AUC reached 0.901, and the macro-F1 value was 0.813 (Brier score 0.114). This indicates that style labels and lexical features may capture different but complementary dimensions of information risk.
[LG-46] Balancing Classification and Calibration Performance in Decision-Making LLM s via Calibration Aware Reinforcement Learning
链接: https://arxiv.org/abs/2601.13284
作者: Duygu Nur Yaldiz,Evangelia Spiliopoulou,Zheng Qi,Siddharth Varia,Srikanth Doss,Nikolaos Pappas
类目: Machine Learning (cs.LG)
*备注:
Abstract:Large language models (LLMs) are increasingly deployed in decision-making tasks, where not only accuracy but also reliable confidence estimates are essential. Well-calibrated confidence enables downstream systems to decide when to trust a model and when to defer to fallback mechanisms. In this work, we conduct a systematic study of calibration in two widely used fine-tuning paradigms: supervised fine-tuning (SFT) and reinforcement learning with verifiable rewards (RLVR). We show that while RLVR improves task performance, it produces extremely overconfident models, whereas SFT yields substantially better calibration, even under distribution shift, though with smaller performance gains. Through targeted experiments, we diagnose RLVR’s failure, showing that decision tokens act as extraction steps of the decision in reasoning traces and do not carry confidence information, which prevents reinforcement learning from surfacing calibrated alternatives. Based on this insight, we propose a calibration-aware reinforcement learning formulation that directly adjusts decision-token probabilities. Our method preserves RLVR’s accuracy level while mitigating overconfidence, reducing ECE scores up to 9 points.
[LG-47] Multi-level Monte Carlo Dropout for Efficient Uncertainty Quantification
链接: https://arxiv.org/abs/2601.13272
作者: Aaron Pim,Tristan Pryer
类目: Machine Learning (cs.LG); Computation (stat.CO); Machine Learning (stat.ML)
*备注: 26 pages, 11 figures
Abstract:We develop a multilevel Monte Carlo (MLMC) framework for uncertainty quantification with Monte Carlo dropout. Treating dropout masks as a source of epistemic randomness, we define a fidelity hierarchy by the number of stochastic forward passes used to estimate predictive moments. We construct coupled coarse–fine estimators by reusing dropout masks across fidelities, yielding telescoping MLMC estimators for both predictive means and predictive variances that remain unbiased for the corresponding dropout-induced quantities while reducing sampling variance at fixed evaluation budget. We derive explicit bias, variance and effective cost expressions, together with sample-allocation rules across levels. Numerical experiments on forward and inverse PINNs–Uzawa benchmarks confirm the predicted variance rates and demonstrate efficiency gains over single-level MC-dropout at matched cost.
[LG-48] Deep Neural networks for solving high-dimensional parabolic partial differential equations
链接: https://arxiv.org/abs/2601.13256
作者: Wenzhong Zhang,Zhenyuan Hu,Wei Cai,George EM Karniadakis
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注:
Abstract:The numerical solution of high dimensional partial differential equations (PDEs) is severely constrained by the curse of dimensionality (CoD), rendering classical grid–based methods impractical beyond a few dimensions. In recent years, deep neural networks have emerged as a promising mesh free alternative, enabling the approximation of PDE solutions in tens to thousands of dimensions. This review provides a tutorial–oriented introduction to neural–network–based methods for solving high dimensional parabolic PDEs, emphasizing conceptual clarity and methodological connections. We organize the literature around three unifying paradigms: (i) PDE residual–based approaches, including physicsinformed neural networks and their high dimensional variants; (ii) stochastic methods derived from Feynman–Kac and backward stochastic differential equation formulations; and (iii) hybrid derivative–free random difference approaches designed to alleviate the computational cost of derivatives in high dimensions. For each paradigm, we outline the underlying mathematical formulation, algorithmic implementation, and practical strengths and limitations. Representative benchmark problems–including Hamilton–Jacobi–Bellman and Black–Scholes equations in up to 1000 dimensions --illustrate the scalability, effectiveness, and accuracy of the methods. The paper concludes with a discussion of open challenges and future directions for reliable and scalable solvers of high dimensional PDEs.
[LG-49] Do Instruction-Tuned Models Always Perform Better Than Base Models? Evidence from Math and Domain-Shifted Benchmarks
链接: https://arxiv.org/abs/2601.13244
作者: Prateek Munjal,Clement Christophe,Ronnie Rajan,Praveenkumar Kanithi
类目: Machine Learning (cs.LG)
*备注:
Abstract:Instruction finetuning is standard practice for improving LLM performance, yet it remains unclear whether it enhances reasoning or merely induces surface-level pattern matching. We investigate this by evaluating base and instruction-tuned models on standard math benchmarks, structurally perturbed variants, and domain-shifted tasks. Our analysis highlights two key (often overlooked) limitations of instruction tuning. First, the performance advantage is unstable and depends heavily on evaluation settings. In zero-shot CoT settings on GSM8K, base models consistently outperform instruction-tuned variants, with drops as high as 32.67% (Llama3-70B). Instruction-tuned models only match or exceed this performance when provided with few-shot exemplars, suggesting a reliance on specific prompting patterns rather than intrinsic reasoning. Second, tuning gains are brittle under distribution shift. Our results show that base models surpass instruction-tuned variants on the domain-specific MedCalc benchmark. Additionally, instruction-tuned models show sharp declines on perturbed datasets, indicating sensitivity to prompt structure over robust reasoning.
[LG-50] A Comprehensive Evaluation of LLM Reasoning : From Single-Model to Multi-Agent Paradigms
链接: https://arxiv.org/abs/2601.13243
作者: Yapeng Li,Jiakuo Yu,Zhixin Liu,Xinnan Liu,Jing Yu,Songze Li,Tonghua Su
类目: Machine Learning (cs.LG)
*备注:
Abstract:Large Language Models (LLMs) are increasingly deployed as reasoning systems, where reasoning paradigms - such as Chain-of-Thought (CoT) and multi-agent systems (MAS) - play a critical role, yet their relative effectiveness and cost-accuracy trade-offs remain poorly understood. In this work, we conduct a comprehensive and unified evaluation of reasoning paradigms, spanning direct single-model generation, CoT-augmented single-model reasoning, and representative MAS workflows, characterizing their reasoning performance across a diverse suite of closed-form benchmarks. Beyond overall performance, we probe role-specific capability demands in MAS using targeted role isolation analyses, and analyze cost-accuracy trade-offs to identify which MAS workflows offer a favorable balance between cost and accuracy, and which incur prohibitive overhead for marginal gains. We further introduce MIMeBench, a new open-ended benchmark that targets two foundational yet underexplored semantic capabilities - semantic abstraction and contrastive discrimination - thereby providing an alternative evaluation axis beyond closed-form accuracy and enabling fine-grained assessment of semantic competence that is difficult to capture with existing benchmarks. Our results show that increased structural complexity does not consistently lead to improved reasoning performance, with its benefits being highly dependent on the properties and suitability of the reasoning paradigm itself. The codes are released at this https URL.
[LG-51] LAViG-FLOW: Latent Autoregressive Video Generation for Fluid Flow Simulations
链接: https://arxiv.org/abs/2601.13190
作者: Vittoria De Pellegrini,Tariq Alkhalifah
类目: Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
*备注:
Abstract:Modeling and forecasting subsurface multiphase fluid flow fields underpin applications ranging from geological CO2 sequestration (GCS) operations to geothermal production. This is essential for ensuring both operational performance and long-term safety. While high fidelity multiphase simulators are widely used for this purpose, they become prohibitively expensive once many forward runs are required for inversion purposes and quantify uncertainty. To tackle this challenge we propose LAViG-FLOW, a latent autoregressive video generation diffusion framework that explicitly learns the coupled evolution of saturation and pressure fields. Each state variable is compressed by a dedicated 2D autoencoder, and a Video Diffusion Transformer (VDiT) models their coupled distribution across time. We first train the model on a given time horizon to learn their coupled relationship and then fine-tune it autoregressively so it can extrapolate beyond the observed time window. Evaluated on an open-source CO2 sequestration dataset, LAViG-FLOW generates saturation and pressure fields that stay consistent across time while running orders of magnitude faster than traditional numerical solvers.
[LG-52] NeuroShield: A Neuro-Symbolic Framework for Adversarial Robustness
链接: https://arxiv.org/abs/2601.13162
作者: Ali Shafiee Sarvestani,Jason Schmidt,Arman Roohi
类目: Machine Learning (cs.LG); Emerging Technologies (cs.ET)
*备注:
Abstract:Adversarial vulnerability and lack of interpretability are critical limitations of deep neural networks, especially in safety-sensitive settings such as autonomous driving. We introduce \DesignII, a neuro-symbolic framework that integrates symbolic rule supervision into neural networks to enhance both adversarial robustness and explainability. Domain knowledge is encoded as logical constraints over appearance attributes such as shape and color, and enforced through semantic and symbolic logic losses applied during training. Using the GTSRB dataset, we evaluate robustness against FGSM and PGD attacks at a standard \ell_\infty perturbation budget of \varepsilon = 8/255 . Relative to clean training, standard adversarial training provides modest improvements in robustness ( \sim 10 percentage points). Conversely, our FGSM-Neuro-Symbolic and PGD-Neuro-Symbolic models achieve substantially larger gains, improving adversarial accuracy by 18.1% and 17.35% over their corresponding adversarial-training baselines, representing roughly a three-fold larger robustness gain than standard adversarial training provides when both are measured relative to the same clean-training baseline, without reducing clean-sample accuracy. Compared to transformer-based defenses such as LNL-MoEx, which require heavy architectures and extensive data augmentation, our PGD-Neuro-Symbolic variant attains comparable or superior robustness using a ResNet18 backbone trained for 10 epochs. These results show that symbolic reasoning offers an effective path to robust and interpretable AI.
[LG-53] FastAV: Efficient Token Pruning for Audio-Visual Large Language Model Inference
链接: https://arxiv.org/abs/2601.13143
作者: Chaeyoung Jung,Youngjoon Jang,Seungwoo Lee,Joon Son Chung
类目: Machine Learning (cs.LG)
*备注:
Abstract:In this work, we present FastAV, the first token pruning framework tailored for audio-visual large language models (AV-LLMs). While token pruning has been actively explored in standard large language models (LLMs) and vision-language models (LVLMs), its application to AV-LLMs has received little attention, even though multimodal integration substantially increases their token demands. To address this gap, we introduce a pruning strategy that utilizes attention weights to identify tokens emphasized at different stages and estimates their importance. Building on this analysis, FastAV applies a two-stage pruning strategy: (1) global pruning in intermediate layers to remove broadly less influential tokens, and (2) fine pruning in later layers considering the impact on next token generation. Notably, our method does not rely on full attention maps, which makes it fully compatible with efficient attention mechanisms such as FlashAttention. Extensive experiments demonstrate that FastAV reduces FLOPs by more than 40% on two representative AV-LLMs, while preserving or even improving model performance.
[LG-54] Recursive Meta-Distillation: An Axiomatic Framework for Iterative Knowledge Refinement
链接: https://arxiv.org/abs/2601.13100
作者: Aaron R. Flouro,Shawn P. Chadwick
类目: Machine Learning (cs.LG)
*备注:
Abstract:Recent work in probability-domain knowledge distillation has established axiomatic frameworks for temperature scaling, multi-teacher aggregation, and bias-variance trade-offs in single-stage settings. However, the mathematical behavior of recursive or multi-generation distillation remains poorly understood, with prior approaches relying primarily on empirical heuristics. In this work, we introduce an axiomatic and operator-theoretic framework for recursive meta-distillation, formalizing iterative knowledge distillation as a sequence of probability-distribution operators with explicit anchoring to base teachers. We define structural axioms for valid meta-teacher construction and prove the existence of non-trivial operator families satisfying these axioms without specifying particular algorithms or loss functions. Under mild realizability and convexity assumptions, we show that anchored recursive distillation induces contraction in KL divergence, yielding geometric convergence to base teacher distributions and a unique, globally attractive fixed point. The contribution is foundational rather than algorithmic: the framework characterizes when recursive distillation is mathematically well-posed and convergent rather than error-accumulating, independent of model architecture, optimization details, or specific operator instantiations. These results provide a theoretical basis for understanding stability, bias-variance behavior, and failure modes in iterative and multi-teacher distillation under capacity constraints. Subjects: Machine Learning (cs.LG) MSC classes: 68T05, 60B10 ACMclasses: I.2.6; F.2.2 Cite as: arXiv:2601.13100 [cs.LG] (or arXiv:2601.13100v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2601.13100 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-55] RM -RF: Reward Model for Run-Free Unit Test Evaluation
链接: https://arxiv.org/abs/2601.13097
作者: Elena Bruches,Daniil Grebenkin,Mikhail Klementev,Vadim Alperovich,Roman Derunets,Dari Baturova,Georgy Mkrtchyan,Oleg Sedukhin,Ivan Bondarenko,Nikolay Bushkov,Stanislav Moiseev
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注: This paper has been accepted for publication at the 33rd IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER 2026)
Abstract:We present RM-RF, a lightweight reward model for run-free evaluation of automatically generated unit tests. Instead of repeatedly compiling and executing candidate tests, RM-RF predicts - from source and test code alone - three execution-derived signals: (1) whether the augmented test suite compiles and runs successfully, (2) whether the generated test cases increase code coverage, and (3) whether the generated test cases improve the mutation kill rate. To train and evaluate RM-RF we assemble a multilingual dataset (Java, Python, Go) of focal files, test files, and candidate test additions labeled by an execution-based pipeline, and we release an associated dataset and methodology for comparative evaluation. We tested multiple model families and tuning regimes (zero-shot, full fine-tuning, and PEFT via LoRA), achieving an average F1 of 0.69 across the three targets. Compared to conventional compile-and-run instruments, RM-RF provides substantially lower latency and infrastructure cost while delivering competitive predictive fidelity, enabling fast, scalable feedback for large-scale test generation and RL-based code optimization.
[LG-56] Adversarial News and Lost Profits: Manipulating Headlines in LLM -Driven Algorithmic Trading
链接: https://arxiv.org/abs/2601.13082
作者: Advije Rizvani,Giovanni Apruzzese,Pavel Laskov
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: This work has been accepted for publication at the IEEE Conference on Secure and Trustworthy Machine Learning (SaTML). The final version will be available on IEEE Xplore
Abstract:Large Language Models (LLMs) are increasingly adopted in the financial domain. Their exceptional capabilities to analyse textual data make them well-suited for inferring the sentiment of finance-related news. Such feedback can be leveraged by algorithmic trading systems (ATS) to guide buy/sell decisions. However, this practice bears the risk that a threat actor may craft “adversarial news” intended to mislead an LLM. In particular, the news headline may include “malicious” content that remains invisible to human readers but which is still ingested by the LLM. Although prior work has studied textual adversarial examples, their system-wide impact on LLM-supported ATS has not yet been quantified in terms of monetary risk. To address this threat, we consider an adversary with no direct access to an ATS but able to alter stock-related news headlines on a single day. We evaluate two human-imperceptible manipulations in a financial context: Unicode homoglyph substitutions that misroute models during stock-name recognition, and hidden-text clauses that alter the sentiment of the news headline. We implement a realistic ATS in Backtrader that fuses an LSTM-based price forecast with LLM-derived sentiment (FinBERT, FinGPT, FinLLaMA, and six general-purpose LLMs), and quantify monetary impact using portfolio metrics. Experiments on real-world data show that manipulating a one-day attack over 14 months can reliably mislead LLMs and reduce annual returns by up to 17.7 percentage points. To assess real-world feasibility, we analyze popular scraping libraries and trading platforms and survey 27 FinTech practitioners, confirming our hypotheses. We notified trading platform owners of this security issue.
[LG-57] Enhancing Generalization in Sickle Cell Disease Diagnosis through Ensemble Methods and Feature Importance Analysis
链接: https://arxiv.org/abs/2601.13021
作者: Nataša Petrović,Gabriel Moyà-Alcover,Antoni Jaume-i-Capó,Jose Maria Buades Rubio
类目: Machine Learning (cs.LG)
*备注:
Abstract:This work presents a novel approach for selecting the optimal ensemble-based classification method and features with a primarly focus on achieving generalization, based on the state-of-the-art, to provide diagnostic support for Sickle Cell Disease using peripheral blood smear images of red blood cells. We pre-processed and segmented the microscopic images to ensure the extraction of high-quality features. To ensure the reliability of our proposed system, we conducted an in-depth analysis of interpretability. Leveraging techniques established in the literature, we extracted features from blood cells and employed ensemble machine learning methods to classify their morphology. Furthermore, we have devised a methodology to identify the most critical features for classification, aimed at reducing complexity and training time and enhancing interpretability in opaque models. Lastly, we validated our results using a new dataset, where our model overperformed state-of-the-art models in terms of generalization. The results of classifier ensembled of Random Forest and Extra Trees classifier achieved an harmonic mean of precision and recall (F1-score) of 90.71% and a Sickle Cell Disease diagnosis support score (SDS-score) of 93.33%. These results demonstrate notable enhancement from previous ones with Gradient Boosting classifier (F1-score 87.32% and SDS-score 89.51%). To foster scientific progress, we have made available the parameters for each model, the implemented code library, and the confusion matrices with the raw data.
[LG-58] OFA-MAS: One-for-All Multi-Agent System Topology Design based on Mixture-of-Experts Graph Generative Models WWW2026
链接: https://arxiv.org/abs/2601.12996
作者: Shiyuan Li,Yixin Liu,Yu Zheng,Mei Li,Quoc Viet Hung Nguyen,Shirui Pan
类目: Multiagent Systems (cs.MA); Machine Learning (cs.LG)
*备注: Accepted by WWW 2026
Abstract:Multi-Agent Systems (MAS) offer a powerful paradigm for solving complex problems, yet their performance is critically dependent on the design of their underlying collaboration topology. As MAS become increasingly deployed in web services (e.g., search engines), designing adaptive topologies for diverse cross-domain user queries becomes essential. Current graph learning-based design methodologies often adhere to a “one-for-one” paradigm, where a specialized model is trained for each specific task domain. This approach suffers from poor generalization to unseen domains and fails to leverage shared structural knowledge across different tasks. To address this, we propose OFA-TAD, a one-for-all framework that generates adaptive collaboration graphs for any task described in natural language through a single universal model. Our approach integrates a Task-Aware Graph State Encoder (TAGSE) that filters task-relevant node information via sparse gating, and a Mixture-of-Experts (MoE) architecture that dynamically selects specialized sub-networks to drive node and edge prediction. We employ a three-stage training strategy: unconditional pre-training on canonical topologies for structural priors, large-scale conditional pre-training on LLM-generated datasets for task-topology mappings, and supervised fine-tuning on empirically validated graphs. Experiments across six diverse benchmarks show that OFA-TAD significantly outperforms specialized one-for-one models, generating highly adaptive MAS topologies. Code: this https URL.
[LG-59] PaperGuide: Making Small Language-Model Paper-Reading Agents More Efficient
链接: https://arxiv.org/abs/2601.12988
作者: Zijian Wang,Tiancheng Huang,Hanqi Li,Da Ma,Lu Chen,Kai Yu
类目: Machine Learning (cs.LG)
*备注: 35 pages, 9 figures, 7 tables
Abstract:The accelerating growth of the scientific literature makes it increasingly difficult for researchers to track new advances through manual reading alone. Recent progress in large language models (LLMs) has therefore spurred interest in autonomous agents that can read scientific papers and extract task-relevant information. However, most existing approaches rely either on heavily engineered prompting or on a conventional SFT-RL training pipeline, both of which often lead to excessive and low-yield exploration. Drawing inspiration from cognitive science, we propose PaperCompass, a framework that mitigates these issues by separating high-level planning from fine-grained execution. PaperCompass first drafts an explicit plan that outlines the intended sequence of actions, and then performs detailed reasoning to instantiate each step by selecting the parameters for the corresponding function calls. To train such behavior, we introduce Draft-and-Follow Policy Optimization (DFPO), a tailored RL method that jointly optimizes both the draft plan and the final solution. DFPO can be viewed as a lightweight form of hierarchical reinforcement learning, aimed at narrowing the `knowing-doing’ gap in LLMs. We provide a theoretical analysis that establishes DFPO’s favorable optimization properties, supporting a stable and reliable training process. Experiments on paper-based question answering (Paper-QA) benchmarks show that PaperCompass improves efficiency over strong baselines without sacrificing performance, achieving results comparable to much larger models.
[LG-60] Architecture-Optimization Co-Design for Physics-Informed Neural Networks Via Attentive Representations and Conflict-Resolved Gradients
链接: https://arxiv.org/abs/2601.12971
作者: Pancheng Niu,Jun Guo,Qiaolin He,Yongming Chen,Yanchao Shi
类目: Machine Learning (cs.LG)
*备注:
Abstract:Physics-Informed Neural Networks (PINNs) provide a learning-based framework for solving partial differential equations (PDEs) by embedding governing physical laws into neural network training. In practice, however, their performance is often hindered by limited representational capacity and optimization difficulties caused by competing physical constraints and conflicting gradients. In this work, we study PINN training from a unified architecture-optimization perspective. We first propose a layer-wise dynamic attention mechanism to enhance representational flexibility, resulting in the Layer-wise Dynamic Attention PINN (LDA-PINN). We then reformulate PINN training as a multi-task learning problem and introduce a conflict-resolved gradient update strategy to alleviate gradient interference, leading to the Gradient-Conflict-Resolved PINN (GC-PINN). By integrating these two components, we develop the Architecture-Conflict-Resolved PINN (ACR-PINN), which combines attentive representations with conflict-aware optimization while preserving the standard PINN loss formulation. Extensive experiments on benchmark PDEs, including the Burgers, Helmholtz, Klein-Gordon, and lid-driven cavity flow problems, demonstrate that ACR-PINN achieves faster convergence and significantly lower relative L_2 and L_\infty errors than standard PINNs. These results highlight the effectiveness of architecture-optimization co-design for improving the robustness and accuracy of PINN-based solvers.
[LG-61] Deterministic Dynamics of Sampling Processes in Score-Based Diffusion Models with Multiplicative Noise Conditioning
链接: https://arxiv.org/abs/2601.12965
作者: Doheon Kim
类目: Machine Learning (cs.LG)
*备注:
Abstract:Score-based diffusion models generate new samples by learning the score function associated with a diffusion process. While the effectiveness of these models can be theoretically explained using differential equations related to the sampling process, previous work by Song and Ermon (2020) demonstrated that neural networks using multiplicative noise conditioning can still generate satisfactory samples. In this setup, the model is expressed as the product of two functions: one depending on the spatial variable and the other on the noise magnitude. This structure limits the model’s ability to represent a more general relationship between the spatial variable and the noise, indicating that it cannot fully learn the correct score. Despite this limitation, the models perform well in practice. In this work, we provide a theoretical explanation for this phenomenon by studying the deterministic dynamics of the associated differential equations, offering insight into how the model operates.
[LG-62] An efficient heuristic for geometric analysis of cell deformations
链接: https://arxiv.org/abs/2601.12928
作者: Yaima Paz Soto,Silena Herold Garcia,Ximo Gual-Arnau,Antoni Jaume-i-Capó,Manuel González-Hidalgo
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:
Abstract:Sickle cell disease causes erythrocytes to become sickle-shaped, affecting their movement in the bloodstream and reducing oxygen delivery. It has a high global prevalence and places a significant burden on healthcare systems, especially in resource-limited regions. Automated classification of sickle cells in blood images is crucial, allowing the specialist to reduce the effort required and avoid errors when quantifying the deformed cells and assessing the severity of a crisis. Recent studies have proposed various erythrocyte representation and classification methods. Since classification depends solely on cell shape, a suitable approach models erythrocytes as closed planar curves in shape space. This approach employs elastic distances between shapes, which are invariant under rotations, translations, scaling, and reparameterizations, ensuring consistent distance measurements regardless of the curves’ position, starting point, or traversal speed. While previous methods exploiting shape space distances had achieved high accuracy, we refined the model by considering the geometric characteristics of healthy and sickled erythrocytes. Our method proposes (1) to employ a fixed parameterization based on the major axis of each cell to compute distances and (2) to align each cell with two templates using this parameterization before computing distances. Aligning shapes to templates before distance computation, a concept successfully applied in areas such as molecular dynamics, and using a fixed parameterization, instead of minimizing distances across all possible parameterizations, simplifies calculations. This strategy achieves 96.03% accuracy rate in both supervised classification and unsupervised clustering. Our method ensures efficient erythrocyte classification, maintaining or improving accuracy over shape space models while significantly reducing computational costs.
[LG-63] Dynamic Hand Gesture Recognition for Robot Manipulator Tasks
链接: https://arxiv.org/abs/2601.12918
作者: Dharmendra Sharma,Peeyush Thakur,Sandeep Gupta,Narendra Kumar Dhar,Laxmidhar Behera
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:
Abstract:This paper proposes a novel approach to recognizing dynamic hand gestures facilitating seamless interaction between humans and robots. Here, each robot manipulator task is assigned a specific gesture. There may be several such tasks, hence, several gestures. These gestures may be prone to several dynamic variations. All such variations for different gestures shown to the robot are accurately recognized in real-time using the proposed unsupervised model based on the Gaussian Mixture model. The accuracy during training and real-time testing prove the efficacy of this methodology.
[LG-64] CooperLLM : Cloud-Edge-End Cooperative Federated Fine-tuning for LLM s via ZOO-based Gradient Correction
链接: https://arxiv.org/abs/2601.12917
作者: He Sun,Jinrui Zhou,Li Li,Mingjun Xiao
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: 14 pages, 9 figures, under review
Abstract:Large Language Models (LLMs) perform well on many NLP tasks, but fine-tuning them on resource-constrained mobile devices is challenging due to high memory and computation costs, despite growing demands for privacy-preserving personalization. Federated Learning (FL) enables local-data training, yet existing methods either rely on memory-intensive backpropagation or use zeroth-order optimization (ZOO), which avoids backward passes but suffers from slow convergence and degraded accuracy. We propose CooperLLM, a cloud-assisted edge-end cooperative federated fine-tuning framework that combines ZOO on mobile devices with cloud-guided gradient rectification. Mobile clients perform lightweight ZOO updates on private data, while the cloud fine-tunes on auxiliary public data using backpropagation and injects guided perturbations to rectify local updates, improving convergence and accuracy without violating privacy. To address system bottlenecks, CooperLLM introduces pipeline scheduling and adaptive compression to overlap computation and communication and reduce memory usage. Experiments on multiple Transformer models and datasets show that CooperLLM reduces on-device memory by up to 86.4% , accelerates convergence by 8.8 \times , and improves accuracy by up to 10 percentage points over state-of-the-art ZOO-based baselines.
[LG-65] Deep Temporal Graph Clustering: A Comprehensive Benchmark and Datasets
链接: https://arxiv.org/abs/2601.12903
作者: Meng Liu,Ke Liang,Siwei Wang,Xingchen Hu,Sihang Zhou,Xinwang Liu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Temporal Graph Clustering (TGC) is a new task with little attention, focusing on node clustering in temporal graphs. Compared with existing static graph clustering, it can find the balance between time requirement and space requirement (Time-Space Balance) through the interaction sequence-based batch-processing pattern. However, there are two major challenges that hinder the development of TGC, i.e., inapplicable clustering techniques and inapplicable datasets. To address these challenges, we propose a comprehensive benchmark, called BenchTGC. Specially, we design a BenchTGC Framework to illustrate the paradigm of temporal graph clustering and improve existing clustering techniques to fit temporal graphs. In addition, we also discuss problems with public temporal graph datasets and develop multiple datasets suitable for TGC task, called BenchTGC Datasets. According to extensive experiments, we not only verify the advantages of BenchTGC, but also demonstrate the necessity and importance of TGC task. We wish to point out that the dynamically changing and complex scenarios in real world are the foundation of temporal graph clustering. The code and data is available at: this https URL.
[LG-66] Supervised Learning for the (sS) Inventory Model with General Interarrival Demands and General Lead Times
链接: https://arxiv.org/abs/2601.12900
作者: Eliran Sherzer,Yonit Barron
类目: Machine Learning (cs.LG)
*备注:
Abstract:The continuous-review (s,S) inventory model is a cornerstone of stochastic inventory theory, yet its analysis becomes analytically intractable when dealing with non-Markovian systems. In such systems, evaluating long-run performance measures typically relies on costly simulation. This paper proposes a supervised learning framework via a neural network model for approximating stationary performance measures of (s,S) inventory systems with general distributions for the interarrival time between demands and lead times under lost sales. Simulations are first used to generate training labels, after which the neural network is trained. After training, the neural network provides almost instantaneous predictions of various metrics of the system, such as the stationary distribution of inventory levels, the expected cycle time, and the probability of lost sales. We find that using a small number of low-order moments of the distributions as input is sufficient to train the neural networks and to accurately capture the steady-state distribution. Extensive numerical experiments demonstrate high accuracy over a wide range of system parameters. As such, it effectively replaces repeated and costly simulation runs. Our framework is easily extendable to other inventory models, offering an efficient and fast alternative for analyzing complex stochastic systems. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2601.12900 [cs.LG] (or arXiv:2601.12900v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2601.12900 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-67] PDFInspect: A Unified Feature Extraction Framework for Malicious Document Detection
链接: https://arxiv.org/abs/2601.12866
作者: Sharmila S P
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 6 pages, 2 figures, paper accepted in COMSNETS 2026 conference
Abstract:The increasing prevalence of malicious Portable Document Format (PDF) files necessitates robust and comprehensive feature extraction techniques for effective detection and analysis. This work presents a unified framework that integrates graph-based, structural, and metadata-driven analysis to generate a rich feature representation for each PDF document. The system extracts text from PDF pages and constructs undirected graphs based on pairwise word relationships, enabling the computation of graph-theoretic features such as node count, edge density, and clustering coefficient. Simultaneously, the framework parses embedded metadata to quantify character distributions, entropy patterns, and inconsistencies across fields such as author, title, and producer. Temporal features are derived from creation and modification timestamps to capture behavioral signatures, while structural elements including, object streams, fonts, and embedded images, are quantified to reflect document complexity. Boolean flags for potentially malicious PDF constructs (e.g., JavaScript, launch actions) are also extracted. Together, these features form a high-dimensional vector representation (170 dimensions) that is well-suited for downstream tasks such as malware classification, anomaly detection, and forensic analysis. The proposed approach is scalable, extensible, and designed to support real-world PDF threat intelligence workflows.6
[LG-68] Generating Cyclic Conformers with Flow Matching in Cremer-Pople Coordinates
链接: https://arxiv.org/abs/2601.12859
作者: Luca Schaufelberger,Aline Hartgers,Kjell Jorner
类目: Machine Learning (cs.LG); Chemical Physics (physics.chem-ph)
*备注:
Abstract:Cyclic molecules are ubiquitous across applications in chemistry and biology. Their restricted conformational flexibility provides structural pre-organization that is key to their function in drug discovery and catalysis. However, reliably sampling the conformer ensembles of ring systems remains challenging. Here, we introduce PuckerFlow, a generative machine learning model that performs flow matching on the Cremer-Pople space, a low-dimensional internal coordinate system capturing the relevant degrees of freedom of rings. Our approach enables generation of valid closed rings by design and demonstrates strong performance in generating conformers that are both diverse and precise. We show that PuckerFlow outperforms other conformer generation methods on nearly all quantitative metrics and illustrate the potential of PuckerFlow for ring systems relevant to chemical applications, particularly in catalysis and drug discovery. This work enables efficient and reliable conformer generation of cyclic structures, paving the way towards modeling structure-property relationships and the property-guided generation of rings across a wide range of applications in chemistry and biology.
[LG-69] Knowledge-Integrated Representation Learning for Crypto Anomaly Detection under Extreme Label Scarcity; Relational Domain-Logic Integration with Retrieval-Grounded Context and Path-Level Explanations
链接: https://arxiv.org/abs/2601.12839
作者: Gyuyeon Na,Minjung Park,Soyoun Kim,Jungbin Shin,Sangmi Chai
类目: Machine Learning (cs.LG); Risk Management (q-fin.RM)
*备注: Gyuyeon Na, Minjung Park, Soyoun Kim contributed equally to this work
Abstract:Detecting anomalous trajectories in decentralized crypto networks is fundamentally challenged by extreme label scarcity and the adaptive evasion strategies of illicit actors. While Graph Neural Networks (GNNs) effectively capture local structural patterns, they struggle to internalize multi hop, logic driven motifs such as fund dispersal and layering that characterize sophisticated money laundering, limiting their forensic accountability under regulations like the FATF Travel Rule. To address this limitation, we propose Relational Domain Logic Integration (RDLI), a framework that embeds expert derived heuristics as differentiable, logic aware latent signals within representation learning. Unlike static rule based approaches, RDLI enables the detection of complex transactional flows that evade standard message passing. To further account for market volatility, we incorporate a Retrieval Grounded Context (RGC) module that conditions anomaly scoring on regulatory and macroeconomic context, mitigating false positives caused by benign regime shifts. Under extreme label scarcity (0.01%), RDLI outperforms state of the art GNN baselines by 28.9% in F1 score. A micro expert user study further confirms that RDLI path level explanations significantly improve trustworthiness, perceived usefulness, and clarity compared to existing methods, highlighting the importance of integrating domain logic with contextual grounding for both accuracy and explainability.
[LG-70] Semi-supervised Instruction Tuning for Large Language Models on Text-Attributed Graphs
链接: https://arxiv.org/abs/2601.12807
作者: Zixing Song,Irwin King
类目: Machine Learning (cs.LG)
*备注:
Abstract:The emergent reasoning capabilities of Large Language Models (LLMs) offer a transformative paradigm for analyzing text-attributed graphs. While instruction tuning is the prevailing method for adapting pre-trained LLMs to graph learning tasks like node classification, it requires a substantial volume of annotated (INSTRUCTION, OUTPUT) pairs deriving from labeled nodes. This requirement is particularly prohibitive in the social domain, where obtaining expert labels for sensitive or evolving content is costly and slow. Furthermore, standard graph instruction tuning fails to exploit the vast amount of unlabeled nodes, which contain latent correlations due to edge connections that are beneficial for downstream predictions. To bridge this gap, we propose a novel Semi-supervised Instruction Tuning pipeline for Graph Learning, named SIT-Graph. Notably, SIT-Graph is model-agnostic and can be seamlessly integrated into any graph instruction tuning method that utilizes LLMs as the predictor. SIT-Graph operates via an iterative self-training process. Initially, the model is fine-tuned using instruction pairs constructed solely from the labeled nodes. Then it generates confidence-filtered pseudo-responses for unlabeled nodes to strategically augment the dataset for the next round of fine-tuning. Finally, this iterative refinement progressively aligns the LLM with the underlying node correlations. Extensive experiments demonstrate that when incorporated into state-of-the-art graph instruction tuning methods, SIT-Graph significantly enhances their performance on text-attributed graph benchmarks, achieving over 20% improvement under the low label ratio settings.
[LG-71] Eddy-Resolving Global Ocean Forecasting with Multi-Scale Graph Neural Networks
链接: https://arxiv.org/abs/2601.12775
作者: Yuta Hirabayashi,Daisuke Matusoka,Konobu Kimura
类目: Machine Learning (cs.LG)
*备注:
Abstract:Research on data-driven ocean models has progressed rapidly in recent years; however, the application of these models to global eddy-resolving ocean forecasting remains limited. The accurate representation of ocean dynamics across a wide range of spatial scales remains a major challenge in such applications. This study proposes a multi-scale graph neural network-based ocean model for 10-day global forecasting that improves short-term prediction skill and enhances the representation of multi-scale ocean variability. The model employs an encoder-processor-decoder architecture and uses two spherical meshes with different resolutions to better capture the multi-scale nature of ocean dynamics. In addition, the model incorporates surface atmospheric variables along with ocean state variables as node inputs to improve short-term prediction accuracy by representing atmospheric forcing. Evaluation using surface kinetic energy spectra and case studies shows that the model accurately represents a broad range of spatial scales, while root mean square error comparisons demonstrate improved skill in short-term predictions. These results indicate that the proposed model delivers more accurate short-term forecasts and improved representation of multi-scale ocean dynamics, thereby highlighting its potential to advance data-driven, eddy-resolving global ocean forecasting.
[LG-72] SoundPlot: An Open-Source Framework for Birdsong Acoustic Analysis and Neural Synthesis with Interactive 3D Visualization
链接: https://arxiv.org/abs/2601.12752
作者: Naqcho Ali Mehdi,Mohammad Adeel,Aizaz Ali Larik
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注:
Abstract:We present SoundPlot, an open-source framework for analyzing avian vocalizations through acoustic feature extraction, dimensionality reduction, and neural audio synthesis. The system transforms audio signals into a multi-dimensional acoustic feature space, enabling real-time visualization of temporal dynamics in 3D using web-based interactive graphics. Our framework implements a complete analysis-synthesis pipeline that extracts spectral features (centroid, bandwidth, contrast), pitch contours via probabilistic YIN (pYIN), and mel-frequency cepstral coefficients (MFCCs), mapping them to a unified timbre space for visualization. Audio reconstruction employs the Griffin-Lim phase estimation algorithm applied to mel spectrograms. The accompanying this http URL-based interface provides dual-viewport visualization comparing original and synthesized audio trajectories with independent playback controls. We demonstrate the framework’s capabilities through comprehensive waveform analysis, spectrogram comparisons, and feature space evaluation using Principal Component Analysis (PCA). Quantitative evaluation shows mel spectrogram correlation scores exceeding 0.92, indicating high-fidelity preservation of perceptual acoustic structure. SoundPlot is released under the MIT License to facilitate research in bioacoustics, audio signal processing, and computational ethology.
[LG-73] A Boolean Function-Theoretic Framework for Expressivity in GNNs with Applications to Fair Graph Mining
链接: https://arxiv.org/abs/2601.12751
作者: Manjish Pal
类目: Machine Learning (cs.LG)
*备注:
Abstract:We propose a novel expressivity framework for Graph Neural Networks (GNNs) grounded in Boolean function theory, enabling a fine-grained analysis of their ability to capture complex subpopulation structures. We introduce the notion of \textitSubpopulation Boolean Isomorphism (SBI) as an invariant that strictly subsumes existing expressivity measures such as Weisfeiler-Lehman (WL), biconnectivity-based, and homomorphism-based frameworks. Our theoretical results identify Fourier degree, circuit class (AC ^0 , NC ^1 ), and influence as key barriers to expressivity in fairness-aware GNNs. We design a circuit-traversal-based fairness algorithm capable of handling subpopulations defined by high-complexity Boolean functions, such as parity, which break existing baselines. Experiments on real-world graphs show that our method achieves low fairness gaps across intersectional groups where state-of-the-art methods fail, providing the first principled treatment of GNN expressivity tailored to fairness.
[LG-74] Distribution-Centric Policy Optimization Dominates Exploration-Exploitation Trade-off
链接: https://arxiv.org/abs/2601.12730
作者: Zhaochun Li,Chen Wang,Jionghao Bai,Shisheng Cui,Ge Lan,Zhou Zhao,Yue Wang
类目: Machine Learning (cs.LG)
*备注:
Abstract:The exploration-exploitation (EE) trade-off is a central challenge in reinforcement learning (RL) for large language models (LLMs). With Group Relative Policy Optimization (GRPO), training tends to be exploitation driven: entropy decreases monotonically, samples convergence, and exploration fades. Most existing fixes are \textbfsample-centric: they seek or bonus rare samples, assuming exploration comes from novel trajectories and tokens. These heuristics depend on the “luck” of informative samples, lack principled control of the policy, and often yield limited or inconsistent gains. In this work, we are the first to introduce a \textbfdistribution-centric perspective for RL, in which exploration is always guided by a “better” target distribution, and reveal that a policy’s ability to resist entropy collapse is governed by the distribution itself rather than individual samples. Building on this insight, we propose Distribution-Centric Policy Optimization (DCPO), which reformulates entropy regulation as distribution-level regularization. DCPO achieves controllable entropy fully on-policy without sampling from external distributions, enabling efficient exploration while maintaining training stability. Across multiple models and seven benchmarks, DCPO improves over GRPO by about 20% on average. Overall, DCPO replaces sample-level heuristics with distribution-level principles, offering a theoretically grounded and flexible framework for controllable exploration and a stronger EE trade-off. The code is available in this https URL.
[LG-75] Decoding Rewards in Competitive Games: Inverse Game Theory with Entropy Regularization ICML2025
链接: https://arxiv.org/abs/2601.12707
作者: Junyi Liao,Zihan Zhu,Ethan Fang,Zhuoran Yang,Vahid Tarokh
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Extended journal version of ICML 2025 paper. Submitted to Operations Research
Abstract:Estimating the unknown reward functions driving agents’ behaviors is of central interest in inverse reinforcement learning and game theory. To tackle this problem, we develop a unified framework for reward function recovery in two-player zero-sum matrix games and Markov games with entropy regularization, where we aim to reconstruct the underlying reward functions given observed players’ strategies and actions. This task is challenging due to the inherent ambiguity of inverse problems, the non-uniqueness of feasible rewards, and limited observational data coverage. To address these challenges, we establish the reward function’s identifiability using the quantal response equilibrium (QRE) under linear assumptions. Building upon this theoretical foundation, we propose a novel algorithm to learn reward functions from observed actions. Our algorithm works in both static and dynamic settings and is adaptable to incorporate different methods, such as Maximum Likelihood Estimation (MLE). We provide strong theoretical guarantees for the reliability and sample efficiency of our algorithm. Further, we conduct extensive numerical studies to demonstrate the practical effectiveness of the proposed framework, offering new insights into decision-making in competitive environments.
[LG-76] rend-Adjusted Time Series Models with an Application to Gold Price Forecasting
链接: https://arxiv.org/abs/2601.12706
作者: Sina Kazemdehbashi
类目: Machine Learning (cs.LG)
*备注:
Abstract:Time series data play a critical role in various fields, including finance, healthcare, marketing, and engineering. A wide range of techniques (from classical statistical models to neural network-based approaches such as Long Short-Term Memory (LSTM)) have been employed to address time series forecasting challenges. In this paper, we reframe time series forecasting as a two-part task: (1) predicting the trend (directional movement) of the time series at the next time step, and (2) forecasting the quantitative value at the next time step. The trend can be predicted using a binary classifier, while quantitative values can be forecasted using models such as LSTM and Bidirectional Long Short-Term Memory (Bi-LSTM). Building on this reframing, we propose the Trend-Adjusted Time Series (TATS) model, which adjusts the forecasted values based on the predicted trend provided by the binary classifier. We validate the proposed approach through both theoretical analysis and empirical evaluation. The TATS model is applied to a volatile financial time series (the daily gold price) with the objective of forecasting the next days price. Experimental results demonstrate that TATS consistently outperforms standard LSTM and Bi-LSTM models by achieving significantly lower forecasting error. In addition, our results indicate that commonly used metrics such as MSE and MAE are insufficient for fully assessing time series model performance. Therefore, we also incorporate trend detection accuracy, which measures how effectively a model captures trends in a time series.
[LG-77] Adaptively trained Physics-informed Radial Basis Function Neural Networks for Solving Multi-asset Option Pricing Problems
链接: https://arxiv.org/abs/2601.12704
作者: Yan Ma,Yumeng Ren
类目: Machine Learning (cs.LG)
*备注: 30 pages,16 figures
Abstract:The present study investigates the numerical solution of Black-Scholes partial differential equation (PDE) for option valuation with multiple underlying assets. We develop a physics-informed (PI) machine learning algorithm based on a radial basis function neural network (RBFNN) that concurrently optimizes the network architecture and predicts the target option price. The physics-informed radial basis function neural network (PIRBFNN) combines the strengths of the traditional radial basis function collocation method and the physics-informed neural network machine learning approach to effectively solve PDE problems in the financial context. By employing a PDE residual-based technique to adaptively refine the distribution of hidden neurons during the training process, the PIRBFNN facilitates accurate and efficient handling of multidimensional option pricing models featuring non-smooth payoff conditions. The validity of the proposed method is demonstrated through a set of experiments encompassing a single-asset European put option, a double-asset exchange option, and a four-asset basket call option.
[LG-78] owards Spectroscopy: Susceptibility Clusters in Language Models
链接: https://arxiv.org/abs/2601.12703
作者: Andrew Gordon,Garrett Baker,George Wang,William Snell,Stan van Wingerden,Daniel Murfet
类目: Machine Learning (cs.LG)
*备注:
Abstract:Spectroscopy infers the internal structure of physical systems by measuring their response to perturbations. We apply this principle to neural networks: perturbing the data distribution by upweighting a token y in context x , we measure the model’s response via susceptibilities \chi_xy , which are covariances between component-level observables and the perturbation computed over a localized Gibbs posterior via stochastic gradient Langevin dynamics (SGLD). Theoretically, we show that susceptibilities decompose as a sum over modes of the data distribution, explaining why tokens that follow their contexts “for similar reasons” cluster together in susceptibility space. Empirically, we apply this methodology to Pythia-14M, developing a conductance-based clustering algorithm that identifies 510 interpretable clusters ranging from grammatical patterns to code structure to mathematical notation. Comparing to sparse autoencoders, 50% of our clusters match SAE features, validating that both methods recover similar structure.
[LG-79] Resource-Conscious RL Algorithms for Deep Brain Stimulation
链接: https://arxiv.org/abs/2601.12699
作者: Arkaprava Gupta,Nicholas Carter,William Zellers,Prateek Ganguli,Benedikt Dietrich,Vibhor Krishna,Parasara Sridhar Duggirala,Samarjit Chakraborty
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:
Abstract:Deep Brain Stimulation (DBS) has proven to be a promising treatment of Parkinson’s Disease (PD). DBS involves stimulating specific regions of the brain’s Basal Ganglia (BG) using electric impulses to alleviate symptoms of PD such as tremors, rigidity, and bradykinesia. Although most clinical DBS approaches today use a fixed frequency and amplitude, they suffer from side effects (such as slurring of speech) and shortened battery life of the implant. Reinforcement learning (RL) approaches have been used in recent research to perform DBS in a more adaptive manner to improve overall patient outcome. These RL algorithms are, however, too complex to be trained in vivo due to their long convergence time and requirement of high computational resources. We propose a new Time Threshold-Triggered Multi-Armed Bandit (T3P MAB) RL approach for DBS that is more effective than existing algorithms. Further, our T3P agent is lightweight enough to be deployed in the implant, unlike current deep-RL strategies, and even forgoes the need for an offline training phase. Additionally, most existing RL approaches have focused on modulating only frequency or amplitude, and the possibility of tuning them together remains greatly unexplored in the literature. Our RL agent can tune both frequency and amplitude of DBS signals to the brain with better sample efficiency and requires minimal time to converge. We implement an MAB agent for DBS for the first time on hardware to report energy measurements and prove its suitability for resource-constrained platforms. Our T3P MAB algorithm is deployed on a variety of microcontroller unit (MCU) setups to show its efficiency in terms of power consumption as opposed to other existing RL approaches used in recent work. Subjects: Machine Learning (cs.LG); Systems and Control (eess.SY) Cite as: arXiv:2601.12699 [cs.LG] (or arXiv:2601.12699v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2601.12699 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-80] BlocksecRT-DETR: Decentralized Privacy-Preserving and Token-Efficient Federated Transformer Learning for Secure Real-Time Object Detection in ITS
链接: https://arxiv.org/abs/2601.12693
作者: Mohoshin Ara Tahera,Sabbir Rahman,Shuvalaxmi Dass,Sharif Ullah,Mahmoud Abouyessef
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:Federated real-time object detection using transformers in Intelligent Transportation Systems (ITS) faces three major challenges: (1) missing-class non-IID data heterogeneity from geographically diverse traffic environments, (2) latency constraints on edge hardware for high-capacity transformer models, and (3) privacy and security risks from untrusted client updates and centralized aggregation. We propose BlockSecRT-DETR, a BLOCKchain-SECured Real-Time Object DEtection TRansformer framework for ITS that provides a decentralized, token-efficient, and privacy-preserving federated training solution using RT-DETR transformer, incorporating a blockchain-secured update validation mechanism for trustworthy aggregation. In this framework, challenges (1) and (2) are jointly addressed through a unified client-side design that integrates RT-DETR training with a Token Engineering Module (TEM). TEM prunes low-utility tokens, reducing encoder complexity and latency on edge hardware, while aggregated updates mitigate non-IID data heterogeneity across clients. To address challenge (3), BlockSecRT-DETR incorporates a decentralized blockchain-secured update validation mechanism that enables tamper-proof, privacy-preserving, and trust-free authenticated model aggregation without relying on a central server. We evaluated the proposed framework under a missing-class Non-IID partition of the KITTI dataset and conducted a blockchain case study to quantify security overhead. TEM improves inference latency by 17.2% and reduces encoder FLOPs by 47.8%, while maintaining global detection accuracy (89.20% mAP@0.5). The blockchain integration adds 400 ms per round, and the ledger size remains under 12 KB due to metadata-only on-chain storage.
[LG-81] MetaToolAgent : Towards Generalizable Tool Usage in LLM s through Meta-Learning
链接: https://arxiv.org/abs/2601.12680
作者: Zheng Fang,Wolfgang Mayer,Zeyu Zhang,Jian Wang,Hong-Yu Zhang,Wanli Li,Zaiwen Feng
类目: Machine Learning (cs.LG)
*备注:
Abstract:Tool learning is increasingly important for large language models (LLMs) to effectively coordinate and utilize a diverse set of tools in order to solve complex real-world tasks. By selecting and integrating appropriate tools, LLMs extend their capabilities beyond pure language understanding to perform specialized functions. However, existing methods for tool selection often focus on limited tool sets and struggle to generalize to novel tools encountered in practical deployments. To address these challenges, we introduce a comprehensive dataset spanning 7 domains, containing 155 tools and 9,377 question-answer pairs, which simulates realistic integration scenarios. Additionally, we propose MetaToolAgent (MTA), a meta-learning approach designed to improve cross-tool generalization. Experimental results show that MTA significantly outperforms baseline methods on unseen tools, demonstrating its promise for building flexible and scalable systems that require dynamic tool coordination.
[LG-82] Decentralized Learning Strategies for Estimation Error Minimization with Graph Neural Networks
链接: https://arxiv.org/abs/2601.12662
作者: Xingran Chen,Navid NaderiAlizadeh,Alejandro Ribeiro,Shirin Saeedi Bidokhti
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:
Abstract:We address real-time sampling and estimation of autoregressive Markovian sources in dynamic yet structurally similar multi-hop wireless networks. Each node caches samples from others and communicates over wireless collision channels, aiming to minimize time-average estimation error via decentralized policies. Due to the high dimensionality of action spaces and complexity of network topologies, deriving optimal policies analytically is intractable. To address this, we propose a graphical multi-agent reinforcement learning framework for policy optimization. Theoretically, we demonstrate that our proposed policies are transferable, allowing a policy trained on one graph to be effectively applied to structurally similar graphs. Numerical experiments demonstrate that (i) our proposed policy outperforms state-of-the-art baselines; (ii) the trained policies are transferable to larger networks, with performance gains increasing with the number of agents; (iii) the graphical training procedure withstands non-stationarity, even when using independent learning techniques; and (iv) recurrence is pivotal in both independent learning and centralized training and decentralized execution, and improves the resilience to non-stationarity.
[LG-83] oward Faithful Explanations in Acoustic Anomaly Detection ICASSP
链接: https://arxiv.org/abs/2601.12660
作者: Maab Elrashid,Anthony Deschênes,Cem Subakan,Mirco Ravanelli,Rémi Georges,Michael Morin
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: Accepted at the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2026. Code: this https URL
Abstract:Interpretability is essential for user trust in real-world anomaly detection applications. However, deep learning models, despite their strong performance, often lack transparency. In this work, we study the interpretability of autoencoder-based models for audio anomaly detection, by comparing a standard autoencoder (AE) with a mask autoencoder (MAE) in terms of detection performance and interpretability. We applied several attribution methods, including error maps, saliency maps, SmoothGrad, Integrated Gradients, GradSHAP, and Grad-CAM. Although MAE shows a slightly lower detection, it consistently provides more faithful and temporally precise explanations, suggesting a better alignment with true anomalies. To assess the relevance of the regions highlighted by the explanation method, we propose a perturbation-based faithfulness metric that replaces them with their reconstructions to simulate normal input. Our findings, based on experiments in a real industrial scenario, highlight the importance of incorporating interpretability into anomaly detection pipelines and show that masked training improves explanation quality without compromising performance.
[LG-84] Learning Deterministic Finite-State Machines from the Prefixes of a Single String is NP-Complete
链接: https://arxiv.org/abs/2601.12621
作者: Radu Cosmin Dumitru,Ryo Yoshinaka,Ayumi Shinohara
类目: Formal Languages and Automata Theory (cs.FL); Machine Learning (cs.LG)
*备注: 12 pages, 4 figures
Abstract:It is well known that computing a minimum DFA consistent with a given set of positive and negative examples is NP-hard. Previous work has identified conditions on the input sample under which the problem becomes tractable or remains hard. In this paper, we study the computational complexity of the case where the input sample is prefix-closed. This formulation is equivalent to computing a minimum Moore machine consistent with observations along its runs. We show that the problem is NP-hard to approximate when the sample set consists of all prefixes of binary strings. Furthermore, we show that the problem remains NP-hard as a decision problem even when the sample set consists of the prefixes of a single binary string. Our argument also extends to the corresponding problem for Mealy machines.
[LG-85] What Trace Powers Reveal About Log-Determinants: Closed-Form Estimators Certificates and Failure Modes
链接: https://arxiv.org/abs/2601.12612
作者: Piyush Sao
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Computing \log\det(A) for large symmetric positive definite matrices arises in Gaussian process inference and Bayesian model comparison. Standard methods combine matrix-vector products with polynomial approximations. We study a different model: access to trace powers p_k = \tr(A^k) , natural when matrix powers are available. Classical moment-based approximations Taylor-expand \log(\lambda) around the arithmetic mean. This requires |\lambda - \AM| \AM and diverges when \kappa 4 . We work instead with the moment-generating function M(t) = \E[X^t] for normalized eigenvalues X = \lambda/\AM . Since M’(0) = \E[\log X] , the log-determinant becomes \log\det(A) = n(\log \AM + M’(0)) – the problem reduces to estimating a derivative at t = 0 . Trace powers give M(k) at positive integers, but interpolating M(t) directly is ill-conditioned due to exponential growth. The transform K(t) = \log M(t) compresses this range. Normalization by \AM ensures K(0) = K(1) = 0 . With these anchors fixed, we interpolate K through m+1 consecutive integers and differentiate to estimate K’(0) . However, this local interpolation cannot capture arbitrary spectral features. We prove a fundamental limit: no continuous estimator using finitely many positive moments can be uniformly accurate over unbounded conditioning. Positive moments downweight the spectral tail; K’(0) = \E[\log X] is tail-sensitive. This motivates guaranteed bounds. From the same traces we derive upper bounds on (\det A)^1/n . Given a spectral floor r \leq \lambda_\min , we obtain moment-constrained lower bounds, yielding a provable interval for \log\det(A) . A gap diagnostic indicates when to trust the point estimate and when to report bounds. All estimators and bounds cost O(m) , independent of n . For m \in \4, \ldots, 8\ , this is effectively constant time. Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) MSC classes: 65F40, 15A15, 60E05 ACMclasses: G.1.3; G.3 Cite as: arXiv:2601.12612 [cs.LG] (or arXiv:2601.12612v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2601.12612 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-86] HERMES: A Unified Open-Source Framework for Realtime Multimodal Physiological Sensing Edge AI and Intervention in Closed-Loop Smart Healthcare Applications
链接: https://arxiv.org/abs/2601.12610
作者: Maxim Yudayev,Juha Carlon,Diwas Lamsal,Vayalet Stefanova,Benjamin Filtjens
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: Submitted to ACM SenSys '26, 12 pages (excl. references), 9 figures
Abstract:Intelligent assistive technologies are increasingly recognized as critical daily-use enablers for people with disabilities and age-related functional decline. Longitudinal studies, curation of quality datasets, live monitoring in activities of daily living, and intelligent intervention devices, share the largely unsolved need in reliable high-throughput multimodal sensing and processing. Streaming large heterogeneous data from distributed sensors, historically closed-source environments, and limited prior works on realtime closed-loop AI methodologies, inhibit such applications. To accelerate the emergence of clinical deployments, we deliver HERMES - an open-source high-performance Python framework for continuous multimodal sensing and AI processing at the edge. It enables synchronized data collection, and realtime streaming inference with user PyTorch models, on commodity computing devices. HERMES is applicable to fixed-lab and free-living environments, of distributed commercial and custom sensors. It is the first work to offer a holistic methodology that bridges cross-disciplinary gaps in real-world implementation strategies, and guides downstream AI model development. Its application on the closed-loop intelligent prosthesis use case illustrates the process of suitable AI model development from the generated constraints and trade-offs. Validation on the use case, with 4 synchronized hosts cooperatively capturing 18 wearable and off-body modalities, demonstrates performance and relevance of HERMES to the trajectory of the intelligent healthcare domain.
[LG-87] Beyond Softmax and Entropy: Improving Convergence Guarantees of Policy Gradients by f-SoftArgmax Parameterization with Coupled Regularization
链接: https://arxiv.org/abs/2601.12604
作者: Safwan Labbi,Daniil Tiapkin,Paul Mangold,Eric Moulines
类目: Machine Learning (cs.LG)
*备注:
Abstract:Policy gradient methods are known to be highly sensitive to the choice of policy parameterization. In particular, the widely used softmax parameterization can induce ill-conditioned optimization landscapes and lead to exponentially slow convergence. Although this can be mitigated by preconditioning, this solution is often computationally expensive. Instead, we propose replacing the softmax with an alternative family of policy parameterizations based on the generalized f-softargmax. We further advocate coupling this parameterization with a regularizer induced by the same f-divergence, which improves the optimization landscape and ensures that the resulting regularized objective satisfies a Polyak-Lojasiewicz inequality. Leveraging this structure, we establish the first explicit non-asymptotic last-iterate convergence guarantees for stochastic policy gradient methods for finite MDPs without any form of preconditioning. We also derive sample-complexity bounds for the unregularized problem and show that f-PG, with Tsallis divergences achieves polynomial sample complexity in contrast to the exponential complexity incurred by the standard softmax parameterization.
[LG-88] Press Start to Charge: Videogaming the Online Centralized Charging Scheduling Problem
链接: https://arxiv.org/abs/2601.12543
作者: Alireza Ghahtarani,Martin Cousineau,Amir-massoud Farahmand,Jorge E. Mendoza
类目: Machine Learning (cs.LG)
*备注: 41 pages
Abstract:We study the online centralized charging scheduling problem (OCCSP). In this problem, a central authority must decide, in real time, when to charge dynamically arriving electric vehicles (EVs), subject to capacity limits, with the objective of balancing load across a finite planning horizon. To solve the problem, we first gamify it; that is, we model it as a game where charging blocks are placed within temporal and capacity constraints on a grid. We design heuristic policies, train learning agents with expert demonstrations, and improve them using Dataset Aggregation (DAgger). From a theoretical standpoint, we show that gamification reduces model complexity and yields tighter generalization bounds than vector-based formulations. Experiments across multiple EV arrival patterns confirm that gamified learning enhances load balancing. In particular, the image-to-movement model trained with DAgger consistently outperforms heuristic baselines, vector-based approaches, and supervised learning agents, while also demonstrating robustness in sensitivity analyses. These operational gains translate into tangible economic value. In a real-world case study for the Greater Montréal Area (Québec, Canada) using utility cost data, the proposed methods lower system costs by tens of millions of dollars per year over the prevailing practice and show clear potential to delay costly grid upgrades.
[LG-89] Approximating splits for decision trees quickly in sparse data streams
链接: https://arxiv.org/abs/2601.12525
作者: Nikolaj Tatti
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS)
*备注:
Abstract:Decision trees are one of the most popular classifiers in the machine learning literature. While the most common decision tree learning algorithms treat data as a batch, numerous algorithms have been proposed to construct decision trees from a data stream. A standard training strategy involves augmenting the current tree by changing a leaf node into a split. Here we typically maintain counters in each leaf which allow us to determine the optimal split, and whether the split should be done. In this paper we focus on how to speed up the search for the optimal split when dealing with sparse binary features and a binary class. We focus on finding splits that have the approximately optimal information gain or Gini index. In both cases finding the optimal split can be done in O(d) time, where d is the number of features. We propose an algorithm that yields (1 + \alpha) approximation when using conditional entropy in amortized O(\alpha^-1(1 + m\log d) \log \log n) time, where m is the number of 1s in a data point, and n is the number of data points. Similarly, for Gini index, we achieve (1 + \alpha) approximation in amortized O(\alpha^-1 + m \log d) time. Our approach is beneficial for sparse data where m \ll d . In our experiments we find almost-optimal splits efficiently, faster than the baseline, overperforming the theoretical approximation guarantees.
[LG-90] Learning Relativistic Geodesics and Chaotic Dynamics via Stabilized Lagrangian Neural Networks
链接: https://arxiv.org/abs/2601.12519
作者: Abdullah Umut Hamzaogullari,Arkadas Ozakin
类目: Machine Learning (cs.LG)
*备注: 21 pages
Abstract:Lagrangian Neural Networks (LNNs) can learn arbitrary Lagrangians from trajectory data, but their unusual optimization objective leads to significant training instabilities that limit their application to complex systems. We propose several improvements that address these fundamental challenges, namely, a Hessian regularization scheme that penalizes unphysical signatures in the Lagrangian’s second derivatives with respect to velocities, preventing the network from learning unstable dynamics, activation functions that are better suited to the problem of learning Lagrangians, and a physics-aware coordinate scaling that improves stability. We systematically evaluate these techniques alongside previously proposed methods for improving stability. Our improved architecture successfully trains on systems of unprecedented complexity, including triple pendulums, and achieved 96.6% lower validation loss value and 90.68% better stability than baseline LNNs in double pendulum systems. With the improved framework, we show that our LNNs can learn Lagrangians representing geodesic motion in both non-relativistic and general relativistic settings. To deal with the relativistic setting, we extended our regularization to penalize violations of Lorentzian signatures, which allowed us to predict a geodesic Lagrangian under AdS\textsubscript4 spacetime metric directly from trajectory data, which to our knowledge has not been done in the literature before. This opens new possibilities for automated discovery of geometric structures in physics, including extraction of spacetime metric tensor components from geodesic trajectories. While our approach inherits some limitations of the original LNN framework, particularly the requirement for invertible Hessians, it significantly expands the practical applicability of LNNs for scientific discovery tasks.
[LG-91] Semidefinite Programming for Quantum Channel Learning
链接: https://arxiv.org/abs/2601.12502
作者: Mikhail Gennadievich Belov,Victor Victorovich Dubov,Vadim Konstantinovich Ivanov,Alexander Yurievich Maslov,Olga Vladimirovna Proshina,Vladislav Gennadievich Malyshkin
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Quantum Physics (quant-ph)
*备注:
Abstract:The problem of reconstructing a quantum channel from a sample of classical data is considered. When the total fidelity can be represented as a ratio of two quadratic forms (e.g., in the case of mapping a mixed state to a pure state, projective operators, unitary learning, and others), Semidefinite Programming (SDP) can be applied to solve the fidelity optimization problem with respect to the Choi matrix. A remarkable feature of SDP is that the optimization is convex, which allows the problem to be efficiently solved by a variety of numerical algorithms. We have tested several commercially available SDP solvers, all of which allowed for the reconstruction of quantum channels of different forms. A notable feature is that the Kraus rank of the obtained quantum channel typically comprises less than a few percent of its maximal possible value. This suggests that a relatively small Kraus rank quantum channel is typically sufficient to describe experimentally observed classical data. The theory was also applied to the problem of reconstructing projective operators from data. Finally, we discuss a classical computational model based on quantum channel transformation, performed and calculated on a classical computer, possibly hardware-optimized.
[LG-92] rojanPraise: Jailbreak LLM s via Benign Fine-Tuning
链接: https://arxiv.org/abs/2601.12460
作者: Zhixin Xie,Xurui Song,Jun Luo
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:The demand of customized large language models (LLMs) has led to commercial LLMs offering black-box fine-tuning APIs, yet this convenience introduces a critical security loophole: attackers could jailbreak the LLMs by fine-tuning them with malicious data. Though this security issue has recently been exposed, the feasibility of such attacks is questionable as malicious training dataset is believed to be detectable by moderation models such as Llama-Guard-3. In this paper, we propose TrojanPraise, a novel finetuning-based attack exploiting benign and thus filter-approved data. Basically, TrojanPraise fine-tunes the model to associate a crafted word (e.g., “bruaf”) with harmless connotations, then uses this word to praise harmful concepts, subtly shifting the LLM from refusal to compliance. To explain the attack, we decouple the LLM’s internal representation of a query into two dimensions of knowledge and attitude. We demonstrate that successful jailbreak requires shifting the attitude while avoiding knowledge shift, a distortion in the model’s understanding of the concept. To validate this attack, we conduct experiments on five opensource LLMs and two commercial LLMs under strict black-box settings. Results show that TrojanPraise achieves a maximum attack success rate of 95.88% while evading moderation.
[LG-93] Graph Attention Networks with Physical Constraints for Anomaly Detection
链接: https://arxiv.org/abs/2601.12426
作者: Mohammadhossein Homaei,Iman Khazrak,Ruben Molano,Andres Caro,Mar Avila
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: 7 Pages, 4 Figures, 5 Tables
Abstract:Water distribution systems (WDSs) face increasing cyber-physical risks, which make reliable anomaly detection essential. Many data-driven models ignore network topology and are hard to interpret, while model-based ones depend strongly on parameter accuracy. This work proposes a hydraulic-aware graph attention network using normalized conservation law violations as features. It combines mass and energy balance residuals with graph attention and bidirectional LSTM to learn spatio-temporal patterns. A multi-scale module aggregates detection scores from node to network level. On the BATADAL dataset, it reaches F1=0.979 , showing 3.3 pp gain and high robustness under 15% parameter noise.
[LG-94] Statistical-Neural Interaction Networks for Interpretable Mixed-Type Data Imputation
链接: https://arxiv.org/abs/2601.12380
作者: Ou Deng,Shoji Nishimura,Atsushi Ogihara,Qun Jin
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Real-world tabular databases routinely combine continuous measurements and categorical records, yet missing entries are pervasive and can distort downstream analysis. We propose Statistical-Neural Interaction (SNI), an interpretable mixed-type imputation framework that couples correlation-derived statistical priors with neural feature attention through a Controllable-Prior Feature Attention (CPFA) module. CPFA learns head-wise prior-strength coefficients \lambda_h\ that softly regularize attention toward the prior while allowing data-driven deviations when nonlinear patterns appear to be present in the data. Beyond imputation, SNI aggregates attention maps into a directed feature-dependency matrix that summarizes which variables the imputer relied on, without requiring post-hoc explainers. We evaluate SNI against six baselines (Mean/Mode, MICE, KNN, MissForest, GAIN, MIWAE) on six datasets spanning ICU monitoring, population surveys, socio-economic statistics, and engineering applications. Under MCAR/strict-MAR at 30% missingness, SNI is generally competitive on continuous metrics but is often outperformed by accuracy-first baselines (MissForest, MIWAE) on categorical variables; in return, it provides intrinsic dependency diagnostics and explicit statistical-neural trade-off parameters. We additionally report MNAR stress tests (with a mask-aware variant) and discuss computational cost, limitations – particularly for severely imbalanced categorical targets – and deployment scenarios where interpretability may justify the trade-off.
[LG-95] LiQSS: Post-Transformer Linear Quantum-Inspired State-Space Tensor Networks for Real-Time 6G
链接: https://arxiv.org/abs/2601.12375
作者: Farhad Rezazadeh,Hatim Chergui,Mehdi Bennis,Houbing Song,Lingjia Liu,Dusit Niyato,Merouane Debbah
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注: 14 pages, 4 figures, 5 tables
Abstract:Proactive and agentic control in Sixth-Generation (6G) Open Radio Access Networks (O-RAN) requires control-grade prediction under stringent Near-Real-Time (Near-RT) latency and computational constraints. While Transformer-based models are effective for sequence modeling, their quadratic complexity limits scalability in Near-RT RAN Intelligent Controller (RIC) analytics. This paper investigates a post-Transformer design paradigm for efficient radio telemetry forecasting. We propose a quantum-inspired many-body state-space tensor network that replaces self-attention with stable structured state-space dynamics kernels, enabling linear-time sequence modeling. Tensor-network factorizations in the form of Tensor Train (TT) / Matrix Product State (MPS) representations are employed to reduce parameterization and data movement in both input projections and prediction heads, while lightweight channel gating and mixing layers capture non-stationary cross-Key Performance Indicator (KPI) dependencies. The proposed model is instantiated as an agentic perceive-predict xApp and evaluated on a bespoke O-RAN KPI time-series dataset comprising 59,441 sliding windows across 13 KPIs, using Reference Signal Received Power (RSRP) forecasting as a representative use case. Our proposed Linear Quantum-Inspired State-Space (LiQSS) model is 10.8x-15.8x smaller and approximately 1.4x faster than prior structured state-space baselines. Relative to Transformer-based models, LiQSS achieves up to a 155x reduction in parameter count and up to 2.74x faster inference, without sacrificing forecasting accuracy.
[LG-96] Machine Learning-Based Framework for Real Time Detection and Early Prediction of Control Valve Stiction in Industrial Control Systems
链接: https://arxiv.org/abs/2601.12362
作者: Natthapong Promsricha,Chotirawee Chatpattanasiri,Nuttavut Kerdgongsup,Stavroula Balabani
类目: Machine Learning (cs.LG); Instrumentation and Detectors (physics.ins-det)
*备注:
Abstract:Control valve stiction, a friction that prevents smooth valve movement, is a common fault in industrial process systems that causes instability, equipment wear, and higher maintenance costs. Many plants still operate with conventional valves that lack real time monitoring, making early predictions challenging. This study presents a machine learning (ML) framework for detecting and predicting stiction using only routinely collected process signals: the controller output (OP) from control systems and the process variable (PV), such as flow rate. Three deep learning models were developed and compared: a Convolutional Neural Network (CNN), a hybrid CNN with a Support Vector Machine (CNN-SVM), and a Long Short-Term Memory (LSTM) network. To train these models, a data-driven labeling method based on slope ratio analysis was applied to a real oil and gas refinery dataset. The LSTM model achieved the highest accuracy and was able to predict stiction up to four hours in advance. To the best of the authors’ knowledge, this is the first study to demonstrate ML based early prediction of control valve stiction from real industry data. The proposed framework can be integrated into existing control systems to support predictive maintenance, reduce downtime, and avoid unnecessary hardware replacement.
[LG-97] LB-MCTS: Synergizing Large Language Models and Bayesian Optimization for Efficient CASH
链接: https://arxiv.org/abs/2601.12355
作者: Beicheng Xu,Weitong Qian,Lingching Tung,Yupeng Lu,Bin Cui
类目: Machine Learning (cs.LG)
*备注:
Abstract:To lower the expertise barrier in machine learning, the AutoML community has focused on the CASH problem, a fundamental challenge that automates the process of algorithm selection and hyperparameter tuning. While traditional methods like Bayesian Optimization (BO) struggle with cold-start issues, Large Language Models (LLMs) can mitigate these via semantic priors. However, existing LLM-based optimizers generalize poorly to the high-dimensional, structured CASH space. We propose LB-MCTS, a framework synergizing LLMs and BO within a Monte Carlo Tree Search structure. It maximizes LLM reasoning with Selective Tuning Memory (STM) and explicit exploration-exploitation trade-off. It combines the strengths of both paradigms by dynamically shifting from LLM-driven to BO-driven proposals as data accumulates. Experiments on 104 AMLB datasets demonstrate the superiority of LB-MCTS over the competitive baselines.
[LG-98] Ordered Local Momentum for Asynchronous Distributed Learning under Arbitrary Delays
链接: https://arxiv.org/abs/2601.12322
作者: Chang-Wei Shi,Shi-Shang Wang,Wu-Jun Li
类目: Machine Learning (cs.LG)
*备注:
Abstract:Momentum SGD (MSGD) serves as a foundational optimizer in training deep models due to momentum’s key role in accelerating convergence and enhancing generalization. Meanwhile, asynchronous distributed learning is crucial for training large-scale deep models, especially when the computing capabilities of the workers in the cluster are heterogeneous. To reduce communication frequency, local updates are widely adopted in distributed learning. However, how to implement asynchronous distributed MSGD with local updates remains unexplored. To solve this problem, we propose a novel method, called \underlineordered \underlinelocal \underlinemomentum (OrLoMo), for asynchronous distributed learning. In OrLoMo, each worker runs MSGD locally. Then the local momentum from each worker will be aggregated by the server in order based on its global iteration index. To the best of our knowledge, OrLoMo is the first method to implement asynchronous distributed MSGD with local updates. We prove the convergence of OrLoMo for non-convex problems under arbitrary delays. Experiments validate that OrLoMo can outperform its synchronous counterpart and other asynchronous methods.
[LG-99] Cross-reality Location Privacy Protection in 6G-enabled Vehicular Metaverses: An LLM -enhanced Hybrid Generative Diffusion Model-based Approach
链接: https://arxiv.org/abs/2601.12311
作者: Xiaofeng Luo,Jiayi He,Jiawen Kang,Ruichen Zhang,Zhaoshui He,Ekram Hossain,Dong In Kim
类目: Networking and Internet Architecture (cs.NI); Cryptography and Security (cs.CR); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: 16 pages, 8 figures
Abstract:The emergence of 6G-enabled vehicular metaverses enables Autonomous Vehicles (AVs) to operate across physical and virtual spaces through space-air-ground-sea integrated networks. The AVs can deploy AI agents powered by large AI models as personalized assistants, on edge servers to support intelligent driving decision making and enhanced on-board experiences. However, such cross-reality interactions may cause serious location privacy risks, as adversaries can infer AV trajectories by correlating the location reported when AVs request LBS in reality with the location of the edge servers on which their corresponding AI agents are deployed in virtuality. To address this challenge, we design a cross-reality location privacy protection framework based on hybrid actions, including continuous location perturbation in reality and discrete privacy-aware AI agent migration in virtuality. In this framework, a new privacy metric, termed cross-reality location entropy, is proposed to effectively quantify the privacy levels of AVs. Based on this metric, we formulate an optimization problem to optimize the hybrid action, focusing on achieving a balance between location protection, service latency reduction, and quality of service maintenance. To solve the complex mixed-integer problem, we develop a novel LLM-enhanced Hybrid Diffusion Proximal Policy Optimization (LHDPPO) algorithm, which integrates LLM-driven informative reward design to enhance environment understanding with double Generative Diffusion Models-based policy exploration to handle high-dimensional action spaces, thereby enabling reliable determination of optimal hybrid actions. Extensive experiments on real-world datasets demonstrate that the proposed framework effectively mitigates cross-reality location privacy leakage for AVs while maintaining strong user immersion within 6G-enabled vehicular metaverse scenarios.
[LG-100] Machine Learning as a Service (MLaaS) Dataset Generator Framework for IoT Environments
链接: https://arxiv.org/abs/2601.12305
作者: Deepak Kanneganti,Sajib Mistry,Sheik Fattah,Joshua Boland,Aneesh Krishna
类目: Machine Learning (cs.LG)
*备注:
Abstract:We propose a novel MLaaS Dataset Generator (MDG) framework that creates configurable and reproducible datasets for evaluating Machine Learning as a Service (MLaaS) selection and composition. MDG simulates realistic MLaaS behaviour by training and evaluating diverse model families across multiple real-world datasets and data distribution settings. It records detailed functional attributes, quality of service metrics, and composition-specific indicators, enabling systematic analysis of service performance and cross-service behaviour. Using MDG, we generate more than ten thousand MLaaS service instances and construct a large-scale benchmark dataset suitable for downstream evaluation. We also implement a built-in composition mechanism that models how services interact under varied Internet of Things conditions. Experiments demonstrate that datasets generated by MDG enhance selection accuracy and composition quality compared to existing baselines. MDG provides a practical and extensible foundation for advancing data-driven research on MLaaS selection and composition
[LG-101] Distribution Shift Is Key to Learning Invariant Prediction
链接: https://arxiv.org/abs/2601.12296
作者: Hong Zheng,Fei Teng
类目: Machine Learning (cs.LG)
*备注:
Abstract:An interesting phenomenon arises: Empirical Risk Minimization (ERM) sometimes outperforms methods specifically designed for out-of-distribution tasks. This motivates an investigation into the reasons behind such behavior beyond algorithmic design. In this study, we find that one such reason lies in the distribution shift across training domains. A large degree of distribution shift can lead to better performance even under ERM. Specifically, we derive several theoretical and empirical findings demonstrating that distribution shift plays a crucial role in model learning and benefits learning invariant prediction. Firstly, the proposed upper bounds indicate that the degree of distribution shift directly affects the prediction ability of the learned models. If it is large, the models’ ability can increase, approximating invariant prediction models that make stable predictions under arbitrary known or unseen domains; and vice versa. We also prove that, under certain data conditions, ERM solutions can achieve performance comparable to that of invariant prediction models. Secondly, the empirical validation results demonstrated that the predictions of learned models approximate those of Oracle or Optimal models, provided that the degree of distribution shift in the training data increases.
[LG-102] ParaMETA: Towards Learning Disentangled Paralinguistic Speaking Styles Representations from Speech AAAI-26
链接: https://arxiv.org/abs/2601.12289
作者: Haowei Lou,Hye-young Paik,Wen Hu,Lina Yao
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: 9 pages, 7 figures, Accepted to AAAI-26 (Main Technical Track)
Abstract:Learning representative embeddings for different types of speaking styles, such as emotion, age, and gender, is critical for both recognition tasks (e.g., cognitive computing and human-computer interaction) and generative tasks (e.g., style-controllable speech generation). In this work, we introduce ParaMETA, a unified and flexible framework for learning and controlling speaking styles directly from speech. Unlike existing methods that rely on single-task models or cross-modal alignment, ParaMETA learns disentangled, task-specific embeddings by projecting speech into dedicated subspaces for each type of style. This design reduces inter-task interference, mitigates negative transfer, and allows a single model to handle multiple paralinguistic tasks such as emotion, gender, age, and language classification. Beyond recognition, ParaMETA enables fine-grained style control in Text-To-Speech (TTS) generative models. It supports both speech- and text-based prompting and allows users to modify one speaking styles while preserving others. Extensive experiments demonstrate that ParaMETA outperforms strong baselines in classification accuracy and generates more natural and expressive speech, while maintaining a lightweight and efficient model suitable for real-world applications.
[LG-103] HCFT: Hierarchical Convolutional Fusion Transformer for EEG Decoding
链接: https://arxiv.org/abs/2601.12279
作者: Haodong Zhang,Jiapeng Zhu,Yitong Chen,Hongqi Li
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: Submitted to IEEE Journals
Abstract:Electroencephalography (EEG) decoding requires models that can effectively extract and integrate complex temporal, spectral, and spatial features from multichannel signals. To address this challenge, we propose a lightweight and generalizable decoding framework named Hierarchical Convolutional Fusion Transformer (HCFT), which combines dual-branch convolutional encoders and hierarchical Transformer blocks for multi-scale EEG representation learning. Specifically, the model first captures local temporal and spatiotemporal dynamics through time-domain and time-space convolutional branches, and then aligns these features via a cross-attention mechanism that enables interaction between branches at each stage. Subsequently, a hierarchical Transformer fusion structure is employed to encode global dependencies across all feature stages, while a customized Dynamic Tanh normalization module is introduced to replace traditional Layer Normalization in order to enhance training stability and reduce redundancy. Extensive experiments are conducted on two representative benchmark datasets, BCI Competition IV-2b and CHB-MIT, covering both event-related cross-subject classification and continuous seizure prediction tasks. Results show that HCFT achieves 80.83% average accuracy and a Cohen’s kappa of 0.6165 on BCI IV-2b, as well as 99.10% sensitivity, 0.0236 false positives per hour, and 98.82% specificity on CHB-MIT, consistently outperforming over ten state-of-the-art baseline methods. Ablation studies confirm that each core component of the proposed framework contributes significantly to the overall decoding performance, demonstrating HCFT’s effectiveness in capturing EEG dynamics and its potential for real-world BCI applications.
[LG-104] Wavelet-Aware Anomaly Detection in Multi-Channel User Logs via Deviation Modulation and Resolution-Adaptive Attention ICASSP2026
链接: https://arxiv.org/abs/2601.12231
作者: Kaichuan Kong,Dongjie Liu,Xiaobo Jin,Shijie Xu,Guanggang Geng
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Computation (stat.CO)
*备注: Accepted by ICASSP 2026. Copyright 2026 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works
Abstract:Insider threat detection is a key challenge in enterprise security, relying on user activity logs that capture rich and complex behavioral patterns. These logs are often multi-channel, non-stationary, and anomalies are rare, making anomaly detection challenging. To address these issues, we propose a novel framework that integrates wavelet-aware modulation, multi-resolution wavelet decomposition, and resolution-adaptive attention for robust anomaly detection. Our approach first applies a deviation-aware modulation scheme to suppress routine behaviors while amplifying anomalous deviations. Next, discrete wavelet transform (DWT) decomposes the log signals into multi-resolution representations, capturing both long-term trends and short-term anomalies. Finally, a learnable attention mechanism dynamically reweights the most discriminative frequency bands for detection. On the CERT r4.2 benchmark, our approach consistently outperforms existing baselines in precision, recall, and F1 score across various time granularities and scenarios.
[LG-105] Learning Longitudinal Health Representations from EHR and Wearable Data
链接: https://arxiv.org/abs/2601.12227
作者: Yuanyun Zhang,Han Zhou,Li Feng,Yilin Hong,Shi Li
类目: Machine Learning (cs.LG)
*备注:
Abstract:Foundation models trained on electronic health records show strong performance on many clinical prediction tasks but are limited by sparse and irregular documentation. Wearable devices provide dense continuous physiological signals but lack semantic grounding. Existing methods usually model these data sources separately or combine them through late fusion. We propose a multimodal foundation model that jointly represents electronic health records and wearable data as a continuous time latent process. The model uses modality specific encoders and a shared temporal backbone pretrained with self supervised and cross modal objectives. This design produces representations that are temporally coherent and clinically grounded. Across forecasting physiological and risk modeling tasks the model outperforms strong electronic health record only and wearable only baselines especially at long horizons and under missing data. These results show that joint electronic health record and wearable pretraining yields more faithful representations of longitudinal health.
[LG-106] One-Sided Matrix Completion from Ultra-Sparse Samples
链接: https://arxiv.org/abs/2601.12213
作者: Hongyang R. Zhang,Zhenshuo Zhang,Huy L. Nguyen,Guanghui Lan
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: 41 pages
Abstract:Matrix completion is a classical problem that has received recurring interest across a wide range of fields. In this paper, we revisit this problem in an ultra-sparse sampling regime, where each entry of an unknown, n\times d matrix M (with n \ge d ) is observed independently with probability p = C / d , for a fixed integer C \ge 2 . This setting is motivated by applications involving large, sparse panel datasets, where the number of rows far exceeds the number of columns. When each row contains only C entries – fewer than the rank of M – accurate imputation of M is impossible. Instead, we estimate the row span of M or the averaged second-moment matrix T = M^\top M / n . The empirical second-moment matrix computed from observed entries exhibits non-random and sparse missingness. We propose an unbiased estimator that normalizes each nonzero entry of the second moment by its observed frequency, followed by gradient descent to impute the missing entries of T . The normalization divides a weighted sum of n binomial random variables by the total number of ones. We show that the estimator is unbiased for any p and enjoys low variance. When the row vectors of M are drawn uniformly from a rank- r factor model satisfying an incoherence condition, we prove that if n \ge O(d r^5 \epsilon^-2 C^-2 \log d) , any local minimum of the gradient-descent objective is approximately global and recovers T with error at most \epsilon^2 . Experiments on both synthetic and real-world data validate our approach. On three MovieLens datasets, our algorithm reduces bias by 88% relative to baseline estimators. We also empirically validate the linear sampling complexity of n relative to d on synthetic data. On an Amazon reviews dataset with sparsity 10^-7 , our method reduces the recovery error of T by 59% and M by 38% compared to baseline methods. Comments: 41 pages Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML) Cite as: arXiv:2601.12213 [cs.LG] (or arXiv:2601.12213v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2601.12213 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: Trans. Mach. Learn. Res. 2026
[LG-107] Federated Learning for the Design of Parametric Insurance Indices under Heterogeneous Renewable Production Losses
链接: https://arxiv.org/abs/2601.12178
作者: Fallou Niakh
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:We propose a federated learning framework for the calibration of parametric insurance indices under heterogeneous renewable energy production losses. Producers locally model their losses using Tweedie generalized linear models and private data, while a common index is learned through federated optimization without sharing raw observations. The approach accommodates heterogeneity in variance and link functions and directly minimizes a global deviance objective in a distributed setting. We implement and compare FedAvg, FedProx and FedOpt, and benchmark them against an existing approximation-based aggregation method. An empirical application to solar power production in Germany shows that federated learning recovers comparable index coefficients under moderate heterogeneity, while providing a more general and scalable framework.
[LG-108] Streaming Operator Inference for Model Reduction of Large-Scale Dynamical Systems
链接: https://arxiv.org/abs/2601.12161
作者: Tomoki Koike,Prakash Mohan,Marc T. Henry de Frahan,Julie Bessac,Elizabeth Qian
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG); Dynamical Systems (math.DS); Computational Physics (physics.comp-ph)
*备注:
Abstract:Projection-based model reduction enables efficient simulation of complex dynamical systems by constructing low-dimensional surrogate models from high-dimensional data. The Operator Inference (OpInf) approach learns such reduced surrogate models through a two-step process: constructing a low-dimensional basis via Singular Value Decomposition (SVD) to compress the data, then solving a linear least-squares (LS) problem to infer reduced operators that govern the dynamics in this compressed space, all without access to the underlying code or full model operators, i.e., non-intrusively. Traditional OpInf operates as a batch learning method, where both the SVD and LS steps process all data simultaneously. This poses a barrier to deployment of the approach on large-scale applications where dataset sizes prevent the loading of all data into memory at once. Additionally, the traditional batch approach does not naturally allow model updates using new data acquired during online computation. To address these limitations, we propose Streaming OpInf, which learns reduced models from sequentially arriving data streams. Our approach employs incremental SVD for adaptive basis construction and recursive LS for streaming operator updates, eliminating the need to store complete data sets while enabling online model adaptation. The approach can flexibly combine different choices of streaming algorithms for numerical linear algebra: we systematically explore the impact of these choices both analytically and numerically to identify effective combinations for accurate reduced model learning. Numerical experiments on benchmark problems and a large-scale turbulent channel flow demonstrate that Streaming OpInf achieves accuracy comparable to batch OpInf while reducing memory requirements by over 99% and enabling dimension reductions exceeding 31,000x, resulting in orders-of-magnitude faster predictions.
[LG-109] hreshold Differential Attention for Sink-Free Ultra-Sparse and Non-Dispersive Language Modeling
链接: https://arxiv.org/abs/2601.12145
作者: Xingyue Huang,Xueying Ding,Mingxuan Ju,Yozen Liu,Neil Shah,Tong Zhao
类目: Machine Learning (cs.LG)
*备注:
Abstract:Softmax attention struggles with long contexts due to structural limitations: the strict sum-to-one constraint forces attention sinks on irrelevant tokens, and probability mass disperses as sequence lengths increase. We tackle these problems with Threshold Differential Attention (TDA), a sink-free attention mechanism that achieves ultra-sparsity and improved robustness at longer sequence lengths without the computational overhead of projection methods or the performance degradation caused by noise accumulation of standard rectified attention. TDA applies row-wise extreme-value thresholding with a length-dependent gate, retaining only exceedances. Inspired by the differential transformer, TDA also subtracts an inhibitory view to enhance expressivity. Theoretically, we prove that TDA controls the expected number of spurious survivors per row to O(1) and that consensus spurious matches across independent views vanish as context grows. Empirically, TDA produces 99% exact zeros and eliminates attention sinks while maintaining competitive performance on standard and long-context benchmarks.
[LG-110] SolarGPT -QA: A Domain-Adaptive Large Language Model for Educational Question Answering in Space Weather and Heliophysics
链接: https://arxiv.org/abs/2601.12131
作者: Santosh Chapagain,MohammadReza EskandariNasab,Onur Vural,Shah Muhammad Hamdi,Soukaina Filali Boubrahimi
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC)
*备注: This is preliminary work towards a broader SolarGPT framework
Abstract:Solar activity, including solar flares, coronal mass ejections (CMEs), and geomagnetic storms, can significantly impact satellites, aviation, power grids, data centers, and space missions. Extreme solar events can cause substantial economic damage if not predicted in advance, highlighting the importance of accurate forecasting and effective education in space science. Although large language models (LLMs) perform well on general tasks, they often lack domain-specific knowledge and pedagogical capability to clearly explain complex space science concepts. We introduce SolarGPT-QA, a question answering system based on a domain-adapted large language model built on the LLaMA-3 base model. The model is trained using scientific literature and large-scale question-answer data generated with GPT-4 and refined using Grok-3 in a student-friendly storytelling style. Human pairwise evaluations show that SolarGPT-QA outperforms general-purpose models in zero-shot settings and achieves competitive performance compared to instruction-tuned models for educational explanations in space weather and heliophysics. A small pilot student comprehension study further suggests improved clarity and accessibility of the generated explanations. Ablation experiments indicate that combining domain-adaptive pretraining with pedagogical fine-tuning is important for balancing scientific accuracy and educational effectiveness. This work represents an initial step toward a broader SolarGPT framework for space science education and forecasting. Comments: This is preliminary work towards a broader SolarGPT framework Subjects: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC) Cite as: arXiv:2601.12131 [cs.LG] (or arXiv:2601.12131v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2601.12131 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-111] PTL-PINNs: Perturbation-Guided Transfer Learning with Physics- Informed Neural Networks for Nonlinear Systems
链接: https://arxiv.org/abs/2601.12093
作者: Duarte Alexandrino,Ben Moseley,Pavlos Protopapas
类目: Machine Learning (cs.LG)
*备注: 51 pages, 14 figures, 7 tables
Abstract:Accurately and efficiently solving nonlinear differential equations is crucial for modeling dynamic behavior across science and engineering. Physics-Informed Neural Networks (PINNs) have emerged as a powerful solution that embeds physical laws in training by enforcing equation residuals. However, these struggle to model nonlinear dynamics, suffering from limited generalization across problems and long training times. To address these limitations, we propose a perturbation-guided transfer learning framework for PINNs (PTL-PINN), which integrates perturbation theory with transfer learning to efficiently solve nonlinear equations. Unlike gradient-based transfer learning, PTL-PINNs solve an approximate linear perturbative system using closed-form expressions, enabling rapid generalization with the time complexity of matrix-vector multiplication. We show that PTL-PINNs achieve accuracy comparable to various Runge-Kutta methods, with computational speeds up to one order of magnitude faster. To benchmark performance, we solve a broad set of problems, including nonlinear oscillators across various damping regimes, the equilibrium-centered Lotka-Volterra system, the KPP-Fisher and the Wave equation. Since perturbation theory sets the accuracy bound of PTL-PINNs, we systematically evaluate its practical applicability. This work connects long-standing perturbation methods with PINNs, demonstrating how perturbation theory can guide foundational models to solve nonlinear systems with speeds comparable to those of classical solvers.
[LG-112] Mitigating Cultural Bias in LLM s via Multi-Agent Cultural Debate
链接: https://arxiv.org/abs/2601.12091
作者: Qian Tan,Lei Jiang,Yuting Zeng,Shuoyang Ding,Xiaohua Xu
类目: Machine Learning (cs.LG)
*备注: 13 pages
Abstract:Large language models (LLMs) exhibit systematic Western-centric bias, yet whether prompting in non-Western languages (e.g., Chinese) can mitigate this remains understudied. Answering this question requires rigorous evaluation and effective mitigation, but existing approaches fall short on both fronts: evaluation methods force outputs into predefined cultural categories without a neutral option, while mitigation relies on expensive multi-cultural corpora or agent frameworks that use functional roles (e.g., Planner–Critique) lacking explicit cultural representation. To address these gaps, we introduce CEBiasBench, a Chinese–English bilingual benchmark, and Multi-Agent Vote (MAV), which enables explicit ``no bias’’ judgments. Using this framework, we find that Chinese prompting merely shifts bias toward East Asian perspectives rather than eliminating it. To mitigate such persistent bias, we propose Multi-Agent Cultural Debate (MACD), a training-free framework that assigns agents distinct cultural personas and orchestrates deliberation via a “Seeking Common Ground while Reserving Differences” strategy. Experiments demonstrate that MACD achieves 57.6% average No Bias Rate evaluated by LLM-as-judge and 86.0% evaluated by MAV (vs. 47.6% and 69.0% baseline using GPT-4o as backbone) on CEBiasBench and generalizes to the Arabic CAMeL benchmark, confirming that explicit cultural representation in agent frameworks is essential for cross-cultural fairness.
[LG-113] Learning to Factorize and Adapt: A Versatile Approach Toward Universal Spatio-Temporal Foundation Models NEURIPS2025
链接: https://arxiv.org/abs/2601.12083
作者: Siru Zhong,Junjie Qiu,Yangyu Wu,Yiqiu Liu,Yuanpeng He,Zhongwen Rao,Bin Yang,Chenjuan Guo,Hao Xu,Yuxuan Liang
类目: Machine Learning (cs.LG)
*备注: This is an extended version of the paper presented at NeurIPS 2025. Code available at this https URL
Abstract:Spatio-Temporal (ST) Foundation Models (STFMs) promise cross-dataset generalization, yet joint ST pretraining is computationally expensive and grapples with the heterogeneity of domain-specific spatial patterns. Substantially extending our preliminary conference version, we present FactoST-v2, an enhanced factorized framework redesigned for full weight transfer and arbitrary-length generalization. FactoST-v2 decouples universal temporal learning from domain-specific spatial adaptation. The first stage pretrains a minimalist encoder-only backbone using randomized sequence masking to capture invariant temporal dynamics, enabling probabilistic quantile prediction across variable horizons. The second stage employs a streamlined adapter to rapidly inject spatial awareness via meta adaptive learning and prompting. Comprehensive evaluations across diverse domains demonstrate that FactoST-v2 achieves state-of-the-art accuracy with linear efficiency - significantly outperforming existing foundation models in zero-shot and few-shot scenarios while rivaling domain-specific expert baselines. This factorized paradigm offers a practical, scalable path toward truly universal STFMs. Code is available at this https URL.
[LG-114] Speaking to Silicon: Neural Communication with Bitcoin Mining ASICs DATE
链接: https://arxiv.org/abs/2601.12032
作者: Francisco Angulo de Lafuente,Vladimir Veselov,Richard Goodman
类目: Neural and Evolutionary Computing (cs.NE); Hardware Architecture (cs.AR); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 13 pages, 6 figures, 15 tables. Machine-checked Lean 4 proofs available at this https URL . Validated across Antminer S9, Lucky Miner LV06, and Goldshell LB-Box platforms
Abstract:This definitive research memoria presents a comprehensive, mathematically verified paradigm for neural communication with Bitcoin mining Application-Specific Integrated Circuits (ASICs), integrating five complementary frameworks: thermodynamic reservoir computing, hierarchical number system theory, algorithmic analysis, network latency optimization, and machine-checked mathematical formalization. We establish that obsolete cryptocurrency mining hardware exhibits emergent computational properties enabling bidirectional information exchange between AI systems and silicon substrates. The research program demonstrates: (1) reservoir computing with NARMA-10 Normalized Root Mean Square Error (NRMSE) of 0.8661; (2) the Thermodynamic Probability Filter (TPF) achieving 92.19% theoretical energy reduction; (3) the Virtual Block Manager achieving +25% effective hashrate; and (4) hardware universality across multiple ASIC families including Antminer S9, Lucky Miner LV06, and Goldshell LB-Box. A significant contribution is the machine-checked mathematical formalization using Lean 4 and Mathlib, providing unambiguous definitions, machine-verified theorems, and reviewer-proof claims. Key theorems proven include: independence implies zero leakage, predictor beats baseline implies non-independence (the logical core of TPF), energy savings theoretical maximum, and Physical Unclonable Function (PUF) distinguishability witnesses. Vladimir Veselov’s hierarchical number system theory explains why early-round information contains predictive power. This work establishes a new paradigm: treating ASICs not as passive computational substrates but as active conversational partners whose thermodynamic state encodes exploitable computational information.
[LG-115] Why Loss Re-weighting Works If You Stop Early: Training Dynamics of Unconstrained Features
链接: https://arxiv.org/abs/2601.12011
作者: Yize Zhao,Christos Thrampoulidis
类目: Machine Learning (cs.LG)
*备注:
Abstract:The application of loss reweighting in modern deep learning presents a nuanced picture. While it fails to alter the terminal learning phase in overparameterized deep neural networks (DNNs) trained on high-dimensional datasets, empirical evidence consistently shows it offers significant benefits early in training. To transparently demonstrate and analyze this phenomenon, we introduce a small-scale model (SSM). This model is specifically designed to abstract the inherent complexities of both the DNN architecture and the input data, while maintaining key information about the structure of imbalance within its spectral components. On the one hand, the SSM reveals how vanilla empirical risk minimization preferentially learns to distinguish majority classes over minorities early in training, consequently delaying minority learning. In stark contrast, reweighting restores balanced learning dynamics, enabling the simultaneous learning of features associated with both majorities and minorities.
[LG-116] Extreme Value Policy Optimization for Safe Reinforcement Learning ICML2025
链接: https://arxiv.org/abs/2601.12008
作者: Shiqing Gao,Yihang Zhou,Shuai Shao,Haoyu Luo,Yiheng Bing,Jiaxin Ding,Luoyi Fu,Xinbing Wang
类目: Machine Learning (cs.LG)
*备注: Published in the 42nd International Conference on Machine Learning (ICML 2025)
Abstract:Ensuring safety is a critical challenge in applying Reinforcement Learning (RL) to real-world scenarios. Constrained Reinforcement Learning (CRL) addresses this by maximizing returns under predefined constraints, typically formulated as the expected cumulative cost. However, expectation-based constraints overlook rare but high-impact extreme value events in the tail distribution, such as black swan incidents, which can lead to severe constraint violations. To address this issue, we propose the Extreme Value policy Optimization (EVO) algorithm, leveraging Extreme Value Theory (EVT) to model and exploit extreme reward and cost samples, reducing constraint violations. EVO introduces an extreme quantile optimization objective to explicitly capture extreme samples in the cost tail distribution. Additionally, we propose an extreme prioritization mechanism during replay, amplifying the learning signal from rare but high-impact extreme samples. Theoretically, we establish upper bounds on expected constraint violations during policy updates, guaranteeing strict constraint satisfaction at a zero-violation quantile level. Further, we demonstrate that EVO achieves a lower probability of constraint violations than expectation-based methods and exhibits lower variance than quantile regression methods. Extensive experiments show that EVO significantly reduces constraint violations during training while maintaining competitive policy performance compared to baselines.
[LG-117] MongoDB Injection Query Classification Model using MongoDB Log files as Training Data
链接: https://arxiv.org/abs/2601.11996
作者: Shaunak Perni,Minal Shirodkar,Ramdas Karmalli
类目: Cryptography and Security (cs.CR); Databases (cs.DB); Machine Learning (cs.LG)
*备注: 24 Pages, 5 Tables, 6 Figures, Journal
Abstract:NoSQL Injection attacks are a class of cybersecurity attacks where an attacker sends a specifically engineered query to a NoSQL database which then performs an unauthorized operation. To defend against such attacks, rule based systems were initially developed but then were found to be ineffective to innovative injection attacks hence a model based approach was developed. Most model based detection systems, during testing gave exponentially positive results but were trained only on the query statement sent to the server. However due to the scarcity of data and class imbalances these model based systems were found to be not effective against all attacks in the real world. This paper explores classifying NoSQL injection attacks sent to a MongoDB server based on Log Data, and other extracted features excluding raw query statements. The log data was collected from a simulated attack on an empty MongoDB server which was then processed and explored. A discriminant analysis was carried out to determine statistically significant features to discriminate between injection and benign queries resulting in a dataset of significant features. Several Machine learning based classification models using an AutoML library, “FLAML”, as well as 6 manually programmed models were trained on this dataset , which were then trained on 50 randomized samples of data, cross validated and evaluated. The study found that the best model was the “FLAML” library’s “XGBoost limited depth” model with an accuracy of 71%.
[LG-118] Data-centric Prompt Tuning for Dynamic Graphs CIKM2025
链接: https://arxiv.org/abs/2601.11954
作者: Yufei Peng,Cheng Yang,Zhengjie Fan,Chuan Shi
类目: Machine Learning (cs.LG)
*备注: CIKM 2025
Abstract:Dynamic graphs have attracted increasing attention due to their ability to model complex and evolving relationships in real-world scenarios. Traditional approaches typically pre-train models using dynamic link prediction and directly apply the resulting node temporal embeddings to specific downstream tasks. However, the significant differences among downstream tasks often lead to performance degradation, especially under few-shot settings. Prompt tuning has emerged as an effective solution to this problem. Existing prompting methods are often strongly coupled with specific model architectures or pretraining tasks, which makes it difficult to adapt to recent or future model designs. Moreover, their exclusive focus on modifying node or temporal features while neglecting spatial structural information leads to limited expressiveness and degraded performance. To address these limitations, we propose DDGPrompt, a data-centric prompting framework designed to effectively refine pre-trained node embeddings at the input data level, enabling better adaptability to diverse downstream tasks. We first define a unified node expression feature matrix that aggregates all relevant temporal and structural information of each node, ensuring compatibility with a wide range of dynamic graph models. Then, we introduce three prompt matrices (temporal bias, edge weight, and feature mask) to adjust the feature matrix completely, achieving task-specific adaptation of node embeddings. We evaluate DDGPrompt under a strict few-shot setting on four public dynamic graph datasets. Experimental results demonstrate that our method significantly outperforms traditional methods and prompting approaches in scenarios with limited labels and cold-start conditions.
[LG-119] Controlling Underestimation Bias in Constrained Reinforcement Learning for Safe Exploration ICML2025
链接: https://arxiv.org/abs/2601.11953
作者: Shiqing Gao,Jiaxin Ding,Luoyi Fu,Xinbing Wang
类目: Machine Learning (cs.LG)
*备注: Published in the 42nd International Conference on Machine Learning (ICML 2025, Oral)
Abstract:Constrained Reinforcement Learning (CRL) aims to maximize cumulative rewards while satisfying constraints. However, existing CRL algorithms often encounter significant constraint violations during training, limiting their applicability in safety-critical scenarios. In this paper, we identify the underestimation of the cost value function as a key factor contributing to these violations. To address this issue, we propose the Memory-driven Intrinsic Cost Estimation (MICE) method, which introduces intrinsic costs to mitigate underestimation and control bias to promote safer exploration. Inspired by flashbulb memory, where humans vividly recall dangerous experiences to avoid risks, MICE constructs a memory module that stores previously explored unsafe states to identify high-cost regions. The intrinsic cost is formulated as the pseudo-count of the current state visiting these risk regions. Furthermore, we propose an extrinsic-intrinsic cost value function that incorporates intrinsic costs and adopts a bias correction strategy. Using this function, we formulate an optimization objective within the trust region, along with corresponding optimization methods. Theoretically, we provide convergence guarantees for the proposed cost value function and establish the worst-case constraint violation for the MICE update. Extensive experiments demonstrate that MICE significantly reduces constraint violations while preserving policy performance comparable to baselines.
[LG-120] rainability-Oriented Hybrid Quantum Regression via Geometric Preconditioning and Curriculum Optimization
链接: https://arxiv.org/abs/2601.11942
作者: Qingyu Meng,Yangshuai Wang
类目: Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注:
Abstract:Quantum neural networks (QNNs) have attracted growing interest for scientific machine learning, yet in regression settings they often suffer from limited trainability under noisy gradients and ill-conditioned optimization. We propose a hybrid quantum-classical regression framework designed to mitigate these bottlenecks. Our model prepends a lightweight classical embedding that acts as a learnable geometric preconditioner, reshaping the input representation to better condition a downstream variational quantum circuit. Building on this architecture, we introduce a curriculum optimization protocol that progressively increases circuit depth and transitions from SPSA-based stochastic exploration to Adam-based gradient fine-tuning. We evaluate the approach on PDE-informed regression benchmarks and standard regression datasets under a fixed training budget in a simulator setting. Empirically, the proposed framework consistently improves over pure QNN baselines and yields more stable convergence in data-limited regimes. We further observe reduced structured errors that are visually correlated with oscillatory components on several scientific benchmarks, suggesting that geometric preconditioning combined with curriculum training is a practical approach for stabilizing quantum regression.
[LG-121] Harmonica: A Self-Adaptation Exemplar for Sustainable MLOps
链接: https://arxiv.org/abs/2601.11926
作者: Ananya Halgatti,Shaunak Biswas,Hiya Bhatt,Srinivasan Rakhunathan,Karthik Vaidhyanathan
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注: This paper has been accepted to SEAMS 2026 Artifact Track
Abstract:Machine learning enabled systems (MLS) often operate in settings where they regularly encounter uncertainties arising from changes in their surrounding environment. Without structured oversight, such changes can degrade model behavior, increase operational cost, and reduce the usefulness of deployed systems. Although Machine Learning Operations (MLOps) streamlines the lifecycle of ML models, it provides limited support for addressing runtime uncertainties that influence the longer term sustainability of MLS. To support continued viability, these systems need a mechanism that detects when execution drifts outside acceptable bounds and adjusts system behavior in response. Despite the growing interest in sustainable and self-adaptive MLS, there has been limited work towards exemplars that allow researchers to study these challenges in MLOps pipelines. This paper presents Harmonica, a self-adaptation exemplar built on the HarmonE approach, designed to enable the sustainable operation of such pipelines. Harmonica introduces structured adaptive control through MAPE-K loop, separating high-level adaptation policy from low-level tactic execution. It continuously monitors sustainability metrics, evaluates them against dynamic adaptation boundaries, and automatically triggers architectural tactics when thresholds are violated. We demonstrate the tool through case studies in time series regression and computer vision, examining its ability to improve system stability and reduce manual intervention. The results show that Harmonica offers a practical and reusable foundation for enabling adaptive behavior in MLS that rely on MLOps pipelines for sustained operation.
[LG-122] Communication-Corruption Coupling and Verification in Cooperative Multi-Objective Bandits
链接: https://arxiv.org/abs/2601.11924
作者: Ming Shi
类目: Machine Learning (cs.LG)
*备注:
Abstract:We study cooperative stochastic multi-armed bandits with vector-valued rewards under adversarial corruption and limited verification. In each of T rounds, each of N agents selects an arm, the environment generates a clean reward vector, and an adversary perturbs the observed feedback subject to a global corruption budget \Gamma . Performance is measured by team regret under a coordinate-wise nondecreasing, L -Lipschitz scalarization \phi , covering linear, Chebyshev, and smooth monotone utilities. Our main contribution is a communication-corruption coupling: we show that a fixed environment-side budget \Gamma can translate into an effective corruption level ranging from \Gamma to N\Gamma , depending on whether agents share raw samples, sufficient statistics, or only arm recommendations. We formalize this via a protocol-induced multiplicity functional and prove regret bounds parameterized by the resulting effective corruption. As corollaries, raw-sample sharing can suffer an N -fold larger additive corruption penalty, whereas summary sharing and recommendation-only sharing preserve an unamplified O(\Gamma) term and achieve centralized-rate team regret. We further establish information-theoretic limits, including an unavoidable additive \Omega(\Gamma) penalty and a high-corruption regime \Gamma=\Theta(NT) where sublinear regret is impossible without clean information. Finally, we characterize how a global budget \nu of verified observations restores learnability. That is, verification is necessary in the high-corruption regime, and sufficient once it crosses the identification threshold, with certified sharing enabling the team’s regret to become independent of \Gamma .
[LG-123] ask-tailored Pre-processing: Fair Downstream Supervised Learning
链接: https://arxiv.org/abs/2601.11897
作者: Jinwon Sohn,Guang Lin,Qifan Song
类目: Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
*备注:
Abstract:Fairness-aware machine learning has recently attracted various communities to mitigate discrimination against certain societal groups in data-driven tasks. For fair supervised learning, particularly in pre-processing, there have been two main categories: data fairness and task-tailored fairness. The former directly finds an intermediate distribution among the groups, independent of the type of the downstream model, so a learned downstream classification/regression model returns similar predictive scores to individuals inputting the same covariates irrespective of their sensitive attributes. The latter explicitly takes the supervised learning task into account when constructing the pre-processing map. In this work, we study algorithmic fairness for supervised learning and argue that the data fairness approaches impose overly strong regularization from the perspective of the HGR correlation. This motivates us to devise a novel pre-processing approach tailored to supervised learning. We account for the trade-off between fairness and utility in obtaining the pre-processing map. Then we study the behavior of arbitrary downstream supervised models learned on the transformed data to find sufficient conditions to guarantee their fairness improvement and utility preservation. To our knowledge, no prior work in the branch of task-tailored methods has theoretically investigated downstream guarantees when using pre-processed data. We further evaluate our framework through comparison studies based on tabular and image data sets, showing the superiority of our framework which preserves consistent trade-offs among multiple downstream models compared to recent competing models. Particularly for computer vision data, we see our method alters only necessary semantic features related to the central machine learning task to achieve fairness.
[LG-124] From Relative Entropy to Minimax: A Unified Framework for Coverag e in MDPs
链接: https://arxiv.org/abs/2601.11890
作者: Xihe Gu,Urbashi Mitra,Tara Javidi
类目: Machine Learning (cs.LG)
*备注:
Abstract:Targeted and deliberate exploration of state–action pairs is essential in reward-free Markov Decision Problems (MDPs). More precisely, different state-action pairs exhibit different degree of importance or difficulty which must be actively and explicitly built into a controlled exploration strategy. To this end, we propose a weighted and parameterized family of concave coverage objectives, denoted by U_\rho , defined directly over state–action occupancy measures. This family unifies several widely studied objectives within a single framework, including divergence-based marginal matching, weighted average coverage, and worst-case (minimax) coverage. While the concavity of U_\rho captures the diminishing return associated with over-exploration, the simple closed form of the gradient of U_\rho enables an explicit control to prioritize under-explored state–action pairs. Leveraging this structure, we develop a gradient-based algorithm that actively steers the induced occupancy toward a desired coverage pattern. Moreover, we show that as \rho increases, the resulting exploration strategy increasingly emphasizes the least-explored state–action pairs, recovering worst-case coverage behavior in the limit.
[LG-125] Approximation Algorithm for Constrained k-Center Clustering: A Local Search Approach AAAI-26
链接: https://arxiv.org/abs/2601.11883
作者: Chaoqi Jia,Longkun Guo,Kewen Liao,Zhigang Lu,Chao Chen,Jason Xue
类目: Machine Learning (cs.LG)
*备注: AAAI-26
Abstract:Clustering is a long-standing research problem and a fundamental tool in AI and data analysis. The traditional k-center problem, a fundamental theoretical challenge in clustering, has a best possible approximation ratio of 2, and any improvement to a ratio of 2 - \epsilon would imply P = NP. In this work, we study the constrained k-center clustering problem, where instance-level cannot-link (CL) and must-link (ML) constraints are incorporated as background knowledge. Although general CL constraints significantly increase the hardness of approximation, previous work has shown that disjoint CL sets permit constant-factor approximations. However, whether local search can achieve such a guarantee in this setting remains an open question. To this end, we propose a novel local search framework based on a transformation to a dominating matching set problem, achieving the best possible approximation ratio of 2. The experimental results on both real-world and synthetic datasets demonstrate that our algorithm outperforms baselines in solution quality.
[LG-126] RAPID-Serve: Resource-efficient and Accelerated P/D Intra-GPU Disaggregation
链接: https://arxiv.org/abs/2601.11822
作者: Amna Masood,Pratishtha Gaur,Nuwan Jayasena
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:
Abstract:Two widely adopted techniques for LLM inference serving systems today are hybrid batching and disaggregated serving. A hybrid batch combines prefill and decode tokens of different requests in the same batch to improve resource utilization and throughput at the cost of increased latency per token. In contrast, disaggregated serving decouples compute-bound prefill and bandwidth-bound decode phases to optimize for service level objectives (SLOs) at the cost of resource under-utilization and KV-cache transfer overheads. To address the limitations of these techniques, we propose RAPID-Serve: a technique to concurrently execute prefill and decode on the same GPU(s) to meet latency SLOs while maintaining high throughput and efficient resource utilization. Furthermore, we propose Adaptive Resource Management for runtime compute resource allocation, optionally leveraging CU masking (a fine-grained Compute Unit partitioning feature on AMD Instinct\textsuperscriptTM GPUs). RAPID-Serve provides up to 4.1x (average 1.7x) unconstrained throughput improvement and 32x and higher (average 4.9x) throughput improvement under SLO constraints, showing it as an effective strategy compared to the state-of-the-art approaches, particularly in resource-constrained environments.
[LG-127] Shapelets-Enriched Selective Forecasting using Time Series Foundation Models AAAI-26
链接: https://arxiv.org/abs/2601.11821
作者: Shivani Tomar,Seshu Tirupathi,Elizabeth Daly,Ivana Dusparic
类目: Machine Learning (cs.LG)
*备注: Accepted by the AAAI-26 Workshop on Artificial Intelligence for Time Series Analysis (AI4TS)
Abstract:Time series foundation models have recently gained a lot of attention due to their ability to model complex time series data encompassing different domains including traffic, energy, and weather. Although they exhibit strong average zero-shot performance on forecasting tasks, their predictions on certain critical regions of the data are not always reliable, limiting their usability in real-world applications, especially when data exhibits unique trends. In this paper, we propose a selective forecasting framework to identify these critical segments of time series using shapelets. We learn shapelets using shift-invariant dictionary learning on the validation split of the target domain dataset. Utilizing distance-based similarity to these shapelets, we facilitate the user to selectively discard unreliable predictions and be informed of the model’s realistic capabilities. Empirical results on diverse benchmark time series datasets demonstrate that our approach leveraging both zero-shot and full-shot fine-tuned models reduces the overall error by an average of 22.17% for zero-shot and 22.62% for full-shot fine-tuned model. Furthermore, our approach using zero-shot and full-shot fine-tuned models, also outperforms its random selection counterparts by up to 21.41% and 21.43% on one of the datasets.
[LG-128] Suspicious Alignment of SGD: A Fine-Grained Step Size Condition Analysis
链接: https://arxiv.org/abs/2601.11789
作者: Shenyang Deng,Boyao Liao,Zhuoli Ouyang,Tianyu Pang,Minhak Song,Yaoqing Yang
类目: Machine Learning (cs.LG)
*备注: The 37th International Conference on Algorithmic Learning Theory
Abstract:This paper explores the suspicious alignment phenomenon in stochastic gradient descent (SGD) under ill-conditioned optimization, where the Hessian spectrum splits into dominant and bulk subspaces. This phenomenon describes the behavior of gradient alignment in SGD updates. Specifically, during the initial phase of SGD updates, the alignment between the gradient and the dominant subspace tends to decrease. Subsequently, it enters a rising phase and eventually stabilizes in a high-alignment phase. The alignment is considered ``suspicious’’ because, paradoxically, the projected gradient update along this highly-aligned dominant subspace proves ineffective at reducing the loss. The focus of this work is to give a fine-grained analysis in a high-dimensional quadratic setup about how step size selection produces this phenomenon. Our main contribution can be summarized as follows: We propose a step-size condition revealing that in low-alignment regimes, an adaptive critical step size \eta_t^* separates alignment-decreasing ( \eta_t \eta_t^* ) from alignment-increasing ( \eta_t \eta_t^* ) regimes, whereas in high-alignment regimes, the alignment is self-correcting and decreases regardless of the step size. We further show that under sufficient ill-conditioning, a step size interval exists where projecting the SGD updates to the bulk space decreases the loss while projecting them to the dominant space increases the loss, which explains a recent empirical observation that projecting gradient updates to the dominant subspace is ineffective. Finally, based on this adaptive step-size theory, we prove that for a constant step size and large initialization, SGD exhibits this distinct two-phase behavior: an initial alignment-decreasing phase, followed by stabilization at high alignment.
[LG-129] A Proof of Concept for a Digital Twin of an Ultrasonic Fermentation System
链接: https://arxiv.org/abs/2601.11723
作者: Francesco Saverio Sconocchia Pisoni,Andrea Vitaletti,Davide Appolloni,Federico Ortenzi,Blasco Morozzo della Rocca,Mariano José Guillén,Alessandro Contaldo
类目: Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注: 23 pages, submitted to the 22nd International Conference on Intelligent Environments (IE 2026)
Abstract:This paper presents the design and implementation of a proof of concept digital twin for an innovative ultrasonic-enhanced beer-fermentation system, developed to enable intelligent monitoring, prediction, and actuation in yeast-growth environments. A traditional fermentation tank is equipped with a piezoelectric transducer able to irradiate the tank with ultrasonic waves, providing an external abiotic stimulus to enhance the growth of yeast and accelerate the fermentation process. At its core, the digital twin incorporates a predictive model that estimates yeast’s culture density over time based on the surrounding environmental conditions. To this end, we implement, tailor and extend the model proposed in Palacios et al., allowing us to effectively handle the limited number of available training samples by using temperature, ultrasonic frequency, and duty cycle as inputs. The results obtained along with the assessment of model performance demonstrate the feasibility of the proposed approach.
[LG-130] jBOT: Semantic Jet Representation Clustering Emerges from Self-Distillation
链接: https://arxiv.org/abs/2601.11719
作者: Ho Fung Tsoi,Dylan Rankin
类目: Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex)
*备注: Under review
Abstract:Self-supervised learning is a powerful pre-training method for learning feature representations without labels, which often capture generic underlying semantics from the data and can later be fine-tuned for downstream tasks. In this work, we introduce jBOT, a pre-training method based on self-distillation for jet data from the CERN Large Hadron Collider, which combines local particle-level distillation with global jet-level distillation to learn jet representations that support downstream tasks such as anomaly detection and classification. We observe that pre-training on unlabeled jets leads to emergent semantic class clustering in the representation space. The clustering in the frozen embedding, when pre-trained on background jets only, enables anomaly detection via simple distance-based metrics, and the learned embedding can be fine-tuned for classification with improved performance compared to supervised models trained from scratch.
[LG-131] Machine learning model for predicting surface wettability in laser-textured metal alloys
链接: https://arxiv.org/abs/2601.11661
作者: Mohammad Mohammadzadeh Sanandaji,Danial Ebrahimzadeh,Mohammad Ikram Haider,Yaser Mike Banad,Aleksandar Poleksic,Hongtao Ding
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
*备注: This manuscript has 9 figures and contains 16 pages two column. submitted to journal of laser applications. Under review
Abstract:Surface wettability, governed by both topography and chemistry, plays a critical role in applications such as heat transfer, lubrication, microfluidics, and surface coatings. In this study, we present a machine learning (ML) framework capable of accurately predicting the wettability of laser-textured metal alloys using experimentally derived morphological and chemical features. Superhydrophilic and superhydrophobic surfaces were fabricated on AA6061 and AISI 4130 alloys via nanosecond laser texturing followed by chemical immersion treatments. Surface morphology was quantified using the Laws texture energy method and profilometry, while surface chemistry was characterized through X-ray photoelectron spectroscopy (XPS), extracting features such as functional group polarity, molecular volume, and peak area fraction. These features were used to train an ensemble neural network model incorporating residual connections, batch normalization, and dropout regularization. The model achieved high predictive accuracy (R2 = 0.942, RMSE = 13.896), outperforming previous approaches. Feature importance analysis revealed that surface chemistry had the strongest influence on contact angle prediction, with topographical features also contributing significantly. This work demonstrates the potential of artificial intelligence to model and predict wetting behavior by capturing the complex interplay of surface characteristics, offering a data-driven pathway for designing tailored functional surfaces.
[LG-132] he Llama 4 Herd: Architecture Training Evaluation and Deployment Notes
链接: https://arxiv.org/abs/2601.11659
作者: Aaron Adcock,Aayushi Srivastava,Abhimanyu Dubey,Abhinav Jauhri,Abhinav Pande,Abhinav Pandey,Abhinav Sharma,Abhishek Kadian,Abhishek Kumawat,Adam Kelsey,Adam Stelle,Adeel Cheema,Adela Kabiljo,Adina Katz,Adithya Gangidi,Aditya Tayade,Adolfo Victoria,Adrian Samatan Alastuey,Adrien Conrath,Afroz Mohiuddin,Ahmed Sharif,Ahnaf Siddiqui,Ahuva Goldstand,Aijung Li,Aidan Boyd,Aidin Kazemi Daliri,Aisha Iqbal,Ajay Menon,Ajit Mathews,Akhil Mathur,Akshat Agarwal,Alan Schelten,Alana Shine,Alejandro Castillejo Muñoz,Aleksei Guliaev,Alex Radovic,Alex Song,Alex Vaughan,Alexander Simeonov,Alexandre Rezende,Alexandre Rezende,Alexei Baevski,Alexey Roubaud,Allen Ma,Alvin Lee,Alyssa Pereira,Aman Ahmed,Aman Shankar,Amanda Kallet,Amar Budhiraja,Ameya Khandekar,Amine Benhalloum,Amir Gershman,Amit Nagpal,Amit Zohar,Amr Sharaf,Anant Desai,Anastasia Razdaibiedina,Anca Agape,Andranik Kurghinyan,Andre Perunicic,Andrea Madotto,Andrei Darabanov,Andrés Alvarado,Andrew Brown,Andrew Cohen,Andrew Fang,Andrew Freeman,Andrew Gallagher,Andrew Gu,Andrew Prasetyo Jo,Andrew Ryan,Andrew Steffen,Andrew Wei,Andrey Rusakov,Andrii Golovei,Andy Shang,Angela Fan,Angela Fan,Angela Flewellen,Animesh Pathak,Anirudh Goyal,Ankit Ramchandani,Ankur Pai,Ankur Singh,Ankush Garg,Anlu Xing,Anna Cai,Anna Grosul,Anna Prochowska,Anna Sun,Annie Dong,Annie Franco,Anqi Hu,Anshul Chawla,Anthony Hartshorn,Antonia Sheng,Antony Thomas,Anuj Goyal,Anusha De
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注: 15 pages
Abstract:This document consolidates publicly reported technical details about Metas Llama 4 model family. It summarizes (i) released variants (Scout and Maverick) and the broader herd context including the previewed Behemoth teacher model, (ii) architectural characteristics beyond a high-level MoE description covering routed/shared-expert structure, early-fusion multimodality, and long-context design elements reported for Scout (iRoPE and length generalization strategies), (iii) training disclosures spanning pre-training, mid-training for long-context extension, and post-training methodology (lightweight SFT, online RL, and lightweight DPO) as described in release materials, (iv) developer-reported benchmark results for both base and instruction-tuned checkpoints, and (v) practical deployment constraints observed across major serving environments, including provider-specific context limits and quantization packaging. The manuscript also summarizes licensing obligations relevant to redistribution and derivative naming, and reviews publicly described safeguards and evaluation practices. The goal is to provide a compact technical reference for researchers and practitioners who need precise, source-backed facts about Llama 4.
[LG-133] Global Optimization By Gradient from Hierarchical Score-Matching Spaces
链接: https://arxiv.org/abs/2601.11639
作者: Ming Li
类目: Machine Learning (cs.LG)
*备注:
Abstract:Gradient descent is the most commonly used optimization method, but limited to local optimality, and confined to the field of continuous differentiable problems with simple convex constraints. This work solve these limitations and restrictions by unifying all optimization problems with various complex constraints as a general hierarchical optimization objective without constraints, which is optimized by gradient obtained through score matching. By this way, global optimization by deterministic method using strict gradient is achieved for the first time, and verified through simple-constructed and complex-practical experiments. Even more importantly, it reveals the profound connection between global optimization and diffusion based generative modeling.
[LG-134] Verifying Physics-Informed Neural Network Fidelity using Classical Fisher Information from Differentiable Dynamical System
链接: https://arxiv.org/abs/2601.11638
作者: Josafat Ribeiro Leal Filho,Antônio Augusto Fröhlich
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: This paper has been submitted and is currently under review at IEEE Transactions on Neural Networks and Learning Systems (TNNLS)
Abstract:Physics-Informed Neural Networks (PINNs) have emerged as a powerful tool for solving differential equations and modeling physical systems by embedding physical laws into the learning process. However, rigorously quantifying how well a PINN captures the complete dynamical behavior of the system, beyond simple trajectory prediction, remains a challenge. This paper proposes a novel experimental framework to address this by employing Fisher information for differentiable dynamical systems, denoted g_F^C . This Fisher information, distinct from its statistical counterpart, measures inherent uncertainties in deterministic systems, such as sensitivity to initial conditions, and is related to the phase space curvature and the net stretching action of the state space evolution. We hypothesize that if a PINN accurately learns the underlying dynamics of a physical system, then the Fisher information landscape derived from the PINN’s learned equations of motion will closely match that of the original analytical model. This match would signify that the PINN has achieved comprehensive fidelity capturing not only the state evolution but also crucial geometric and stability properties. We outline an experimental methodology using the dynamical model of a car to compute and compare g_F^C for both the analytical model and a trained PINN. The comparison, based on the Jacobians of the respective system dynamics, provides a quantitative measure of the PINN’s fidelity in representing the system’s intricate dynamical characteristics.
[LG-135] Semantic Differentiation for Tackling Challenges in Watermarking Low-Entropy Constrained Generation Outputs
链接: https://arxiv.org/abs/2601.11629
作者: Nghia T. Le,Alan Ritter,Kartik Goyal
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 18 pages, 4 figures
Abstract:We demonstrate that while the current approaches for language model watermarking are effective for open-ended generation, they are inadequate at watermarking LM outputs for constrained generation tasks with low-entropy output spaces. Therefore, we devise SeqMark, a sequence-level watermarking algorithm with semantic differentiation that balances the output quality, watermark detectability, and imperceptibility. It improves on the shortcomings of the prevalent token-level watermarking algorithms that cause under-utilization of the sequence-level entropy available for constrained generation tasks. Moreover, we identify and improve upon a different failure mode we term region collapse, associated with prior sequence-level watermarking algorithms. This occurs because the pseudorandom partitioning of semantic space for watermarking in these approaches causes all high-probability outputs to collapse into either invalid or valid regions, leading to a trade-off in output quality and watermarking effectiveness. SeqMark instead, differentiates the high-probable output subspace and partitions it into valid and invalid regions, ensuring the even spread of high-quality outputs among all the regions. On various constrained generation tasks like machine translation, code generation, and abstractive summarization, SeqMark substantially improves watermark detection accuracy (up to 28% increase in F1) while maintaining high generation quality.
[LG-136] Concatenated Matrix SVD: Compression Bounds Incremental Approximation and Error-Constrained Clustering
链接: https://arxiv.org/abs/2601.11626
作者: Maksym Shamrai
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注:
Abstract:Large collections of matrices arise throughout modern machine learning, signal processing, and scientific computing, where they are commonly compressed by concatenation followed by truncated singular value decomposition (SVD). This strategy enables parameter sharing and efficient reconstruction and has been widely adopted across domains ranging from multi-view learning and signal processing to neural network compression. However, it leaves a fundamental question unanswered: which matrices can be safely concatenated and compressed together under explicit reconstruction error constraints? Existing approaches rely on heuristic or architecture-specific grouping and provide no principled guarantees on the resulting SVD approximation error. In the present work, we introduce a theory-driven framework for compression-aware clustering of matrices under SVD compression constraints. Our analysis establishes new spectral bounds for horizontally concatenated matrices, deriving global upper bounds on the optimal rank- r SVD reconstruction error from lower bounds on singular value growth. The first bound follows from Weyl-type monotonicity under blockwise extensions, while the second leverages singular values of incremental residuals to yield tighter, per-block guarantees. We further develop an efficient approximate estimator based on incremental truncated SVD that tracks dominant singular values without forming the full concatenated matrix. Therefore, we propose three clustering algorithms that merge matrices only when their predicted joint SVD compression error remains below a user-specified threshold. The algorithms span a trade-off between speed, provable accuracy, and scalability, enabling compression-aware clustering with explicit error control. Code is available online.
[LG-137] A Review on Machine Learning Approaches for the Prediction of Glucose Levels and Hypogylcemia
链接: https://arxiv.org/abs/2601.11615
作者: Beyza Cinar,Louisa van den Boom,Maria Maleshkova
类目: Machine Learning (cs.LG)
*备注:
Abstract:Type 1 Diabetes (T1D) is an autoimmune disease leading to insulin insufficiency. Thus, patients require lifelong insulin therapy, which has a side effect of hypoglycemia. Hypoglycemia is a critical state of decreased blood glucose levels (BGL) below 70 mg/dL and is associated with increased risk of mortality. Machine learning (ML) models can improve diabetes management by predicting hypoglycemia and providing optimal prevention methods. ML models are classified into regression and classification based, that forecast glucose levels and identify events based on defined labels, respectively. This review investigates state-of-the-art models trained on data of continuous glucose monitoring (CGM) devices from patients with T1D. We compare the models’ performance across short-term (15 to 120 min) and long term (3 to more than 24 hours) prediction horizons (PHs). Particularly, we explore: 1) How much in advance can glucose values or a hypoglycemic event be accurately predicted? 2) Which models have the best performance? 3) Which factors impact the performance? and 4) Does personalization increase performance? The results show that 1) a PH of up to 1 hour provides the best results. 2) Conventional ML methods yield the best results for classification and DL for regression. A single model cannot adequately classify across multiple PHs. 3) The model performance is influenced by multivariate datasets and the input sequence length (ISL). 4) Personal data enhances performance but due to limited data quality population-based models are preferred.
[LG-138] Integrating Temporal Context into Streaming Data for Human Activity Recognition in Smart Home
链接: https://arxiv.org/abs/2601.11611
作者: Marina Vicini,Martin Rudorfer,Zhuangzhuang Dai,Luis J. Manso
类目: Machine Learning (cs.LG)
*备注: Accepted to International Conference on Ubiquitous Computing and Ambient Intelligence (UCAmI) 2024
Abstract:With the global population ageing, it is crucial to enable individuals to live independently and safely in their homes. Using ubiquitous sensors such as Passive InfraRed sensors (PIR) and door sensors is drawing increasing interest for monitoring daily activities and facilitating preventative healthcare interventions for the elderly. Human Activity Recognition (HAR) from passive sensors mostly relies on traditional machine learning and includes data segmentation, feature extraction, and classification. While techniques like Sensor Weighting Mutual Information (SWMI) capture spatial context in a feature vector, effectively leveraging temporal information remains a challenge. We tackle this by clustering activities into morning, afternoon, and night, and encoding them into the feature weighting method calculating distinct mutual information matrices. We further propose to extend the feature vector by incorporating time of day and day of week as cyclical temporal features, as well as adding a feature to track the user’s location. The experiments show improved accuracy and F1-score over existing state-of-the-art methods in three out of four real-world datasets, with highest gains in a low-data regime. These results highlight the potential of our approach for developing effective smart home solutions to support ageing in place.
[LG-139] Auxiliary-predicted Compress Memory Model(ApCM Model): A Neural Memory Storag e Model Based on Invertible Compression and Learnable Prediction
链接: https://arxiv.org/abs/2601.11609
作者: Weinuo Ou
类目: Machine Learning (cs.LG)
*备注: 9 pages, 7 figures
Abstract:Current large language models (LLMs) generally lack an effective runtime memory mechanism,making it difficult to adapt to dynamic and personalized interaction requirements. To address this issue, this paper proposes a novel neural memory storage architecture–the Auxiliary Prediction Compression Memory Model (ApCM Model).
[LG-140] A Multimodal Data Processing Pipeline for MIMIC-IV Dataset
链接: https://arxiv.org/abs/2601.11606
作者: Farzana Islam Adiba,Varsha Danduri,Fahmida Liza Piya,Ali Abbasi,Mehak Gupta,Rahmatollah Beheshti
类目: Machine Learning (cs.LG)
*备注:
Abstract:The MIMIC-IV dataset is a large, publicly available electronic health record (EHR) resource widely used for clinical machine learning research. It comprises multiple modalities, including structured data, clinical notes, waveforms, and imaging data. Working with these disjointed modalities requires an extensive manual effort to preprocess and align them for downstream analysis. While several pipelines for MIMIC-IV data extraction are available, they target a small subset of modalities or do not fully support arbitrary downstream applications. In this work, we greatly expand our prior popular unimodal pipeline and present a comprehensive and customizable multimodal pipeline that can significantly reduce multimodal processing time and enhance the reproducibility of MIMIC-based studies. Our pipeline systematically integrates the listed modalities, enabling automated cohort selection, temporal alignment across modalities, and standardized multimodal output formats suitable for arbitrary static and time-series downstream applications. We release the code, a simple UI, and a Python package for selective integration (with embedding) at this https URL.
[LG-141] Enhancing Model Context Protocol (MCP) with Context-Aware Server Collaboration
链接: https://arxiv.org/abs/2601.11595
作者: Meenakshi Amulya Jayanti,X.Y. Han
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注:
Abstract:The Model Context Protocol (MCP) has emerged as a widely used framework for enabling LLM-based agents to communicate with external tools and services. The most common implementation of MCP, proposed by Anthropic, heavily relies on a Large Language Model (LLM) to decompose tasks and issue instructions to servers, which act as stateless executors. In particular, the agents, models, and servers are stateless and do not have access to a global context. However, in tasks involving LLM-driven coordination, it is natural that a Shared Context Store (SCS) could improve the efficiency and coherence of multi-agent workflows by reducing redundancy and enabling knowledge transfer between servers. Thus, in this work, we design and assess the performance of a Context-Aware MCP (CA-MCP) that offloads execution logic to specialized MCP servers that read from and write to a shared context memory, allowing them to coordinate more autonomously in real time. In this design, context management serves as the central mechanism that maintains continuity across task executions by tracking intermediate states and shared variables, thereby enabling persistent collaboration among agents without repeated prompting. We present experiments showing that the CA-MCP can outperform the traditional MCP by reducing the number of LLM calls required for complex tasks and decreasing the frequency of response failures when task conditions are not satisfied, thereby improving overall efficiency and responsiveness. In particular, we conducted experiments on the TravelPlanner and REALM-Bench benchmark datasets and observed statistically significant results indicating the potential advantages of incorporating a shared context store via CA-MCP in LLM-driven multi-agent systems.
[LG-142] Uniqueness ratio as a predictor of a privacy leakage
链接: https://arxiv.org/abs/2601.11550
作者: Danah A. AlSalem AlKhashti
类目: Databases (cs.DB); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:Identity leakage can emerge when independent databases are joined, even when each dataset is anonymized individually. While previous work focuses on post-join detection or complex privacy models, little attention has been given to simple, interpretable pre-join indicators that can warn data engineers and database administrators before integration occurs. This study investigates the uniqueness ratio of candidate join attributes as an early predictor of re-identification risk. Using synthetic multi-table datasets, we compute the uniqueness ratio of attribute combinations within each database and examine how these ratios correlate with identity exposure after the join. Experimental results show a strong relationship between high pre-join uniqueness and increased post-join leakage, measured by the proportion of records that become uniquely identifiable or fall into very small groups. Our findings demonstrate that uniqueness ratio offers an explainable and practical signal for assessing join induced privacy risk, providing a foundation for developing more comprehensive pre-join risk estimation models.
[LG-143] Deep Learning Approaches to Quantum Error Mitigation
链接: https://arxiv.org/abs/2601.14226
作者: Leonardo Placidi,Ifan Williams,Enrico Rinaldi,Daniel Mills,Cristina Cîrstoiu,Vanya Eccles,Ross Duncan
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 48 pages
Abstract:We present a systematic investigation of deep learning methods applied to quantum error mitigation of noisy output probability distributions from measured quantum circuits. We compare different architectures, from fully connected neural networks to transformers, and we test different design/training modalities, identifying sequence-to-sequence, attention-based models as the most effective on our datasets. These models consistently produce mitigated distributions that are closer to the ideal outputs when tested on both simulated and real device data obtained from IBM superconducting quantum processing units (QPU) up to five qubits. Across several different circuit depths, our approach outperforms other baseline error mitigation techniques. We perform a series of ablation studies to examine: how different input features (circuit, device properties, noisy output statistics) affect performance; cross-dataset generalization across circuit families; and transfer learning to a different IBM QPU. We observe that generalization performance across similar devices with the same architecture works effectively, without needing to fully retrain models.
[LG-144] Intermittent time series forecasting: local vs global models
链接: https://arxiv.org/abs/2601.14031
作者: Stefano Damato,Nicolò Rubattu,Dario Azzimonti,Giorgio Corani
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Submitted to Data Mining and Knowledge Discovery
Abstract:Intermittent time series, characterised by the presence of a significant amount of zeros, constitute a large percentage of inventory items in supply chain. Probabilistic forecasts are needed to plan the inventory levels; the predictive distribution should cover non-negative values, have a mass in zero and a long upper tail. Intermittent time series are commonly forecast using local models, which are trained individually on each time series. In the last years global models, which are trained on a large collection of time series, have become popular for time series forecasting. Global models are often based on neural networks. However, they have not yet been exhaustively tested on intermittent time series. We carry out the first study comparing state-of-the-art local (iETS, TweedieGP) and global models (D-Linear, DeepAR, Transformers) on intermittent time series. For neural networks models we consider three different distribution heads suitable for intermittent time series: negative binomial, hurdle-shifted negative binomial and Tweedie. We use, for the first time, the last two distribution heads with neural networks. We perform experiments on five large datasets comprising more than 40’000 real-world time series. Among neural networks D-Linear provides best accuracy; it also consistently outperforms the local models. Moreover, it has also low computational requirements. Transformers-based architectures are instead much more computationally demanding and less accurate. Among the distribution heads, the Tweedie provides the best estimates of the highest quantiles, while the negative binomial offers overall the best performance.
[LG-145] SCG With Your Phone: Diagnosis of Rhythmic Spectrum Disorders in Field Conditions
链接: https://arxiv.org/abs/2601.13926
作者: Peter Golenderov,Yaroslav Matushenko,Anastasia Tushina,Michal Barodkin
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:
Abstract:Aortic valve opening (AO) events are crucial for detecting frequency and rhythm disorders, especially in real-world settings where seismocardiography (SCG) signals collected via consumer smartphones are subject to noise, motion artifacts, and variability caused by device heterogeneity. In this work, we present a robust deep-learning framework for SCG segmentation and rhythm analysis using accelerometer recordings obtained with consumer smartphones. We develop an enhanced U-Net v3 architecture that integrates multi-scale convolutions, residual connections, and attention gates, enabling reliable segmentation of noisy SCG signals. A dedicated post-processing pipeline converts probability masks into precise AO timestamps, whereas a novel adaptive 3D-to-1D projection method ensures robustness to arbitrary smartphone orientation. Experimental results demonstrate that the proposed method achieves consistently high accuracy and robustness across various device types and unsupervised data-collection conditions. Our approach enables practical, low-cost, and automated cardiac-rhythm monitoring using everyday mobile devices, paving the way for scalable, field-deployable cardiovascular assessment and future multimodal diagnostic systems.
[LG-146] Unified Unbiased Variance Estimation for MMD: Robust Finite-Sample Performance with Imbalanced Data and Exact Acceleration under Null and Alternative Hypotheses
链接: https://arxiv.org/abs/2601.13874
作者: Shijie Zhong,Jiangfeng Fu,Yikun Yang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:The maximum mean discrepancy (MMD) is a kernel-based nonparametric statistic for two-sample testing, whose inferential accuracy depends critically on variance characterization. Existing work provides various finite-sample estimators of the MMD variance, often differing under the null and alternative hypotheses and across balanced or imbalanced sampling schemes. In this paper, we study the variance of the MMD statistic through its U-statistic representation and Hoeffding decomposition, and establish a unified finite-sample characterization covering different hypotheses and sample configurations. Building on this analysis, we propose an exact acceleration method for the univariate case under the Laplacian kernel, which reduces the overall computational complexity from \mathcal O(n^2) to \mathcal O(n \log n) .
[LG-147] Co-Initialization of Control Filter and Secondary Path via Meta-Learning for Active Noise Control
链接: https://arxiv.org/abs/2601.13849
作者: Ziyi Yang,Li Rao,Zhengding Luo,Dongyuan Shi,Qirui Huang,Woon-Seng Gan
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:
Abstract:Active noise control (ANC) must adapt quickly when the acoustic environment changes, yet early performance is largely dictated by initialization. We address this with a Model-Agnostic Meta-Learning (MAML) co-initialization that jointly sets the control filter and the secondary-path model for FxLMS-based ANC while keeping the runtime algorithm unchanged. The initializer is pre-trained on a small set of measured paths using short two-phase inner loops that mimic identification followed by residual-noise reduction, and is applied by simply setting the learned initial coefficients. In an online secondary path modeling FxLMS testbed, it yields lower early-stage error, shorter time-to-target, reduced auxiliary-noise energy, and faster recovery after path changes than a baseline without re-initialization. The method provides a simple fast start for feedforward ANC under environment changes, requiring a small set of paths to pre-train.
[LG-148] Generative Adversarial Networks for Resource State Generation
链接: https://arxiv.org/abs/2601.13708
作者: Shahbaz Shaik,Sourav Chatterjee,Sayantan Pramanik,Indranil Chakrabarty
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:
Abstract:We introduce a physics-informed Generative Adversarial Network framework that recasts quantum resource-state generation as an inverse-design task. By embedding task-specific utility functions into training, the model learns to generate valid two-qubit states optimized for teleportation and entanglement broadcasting. Comparing decomposition-based and direct-generation architectures reveals that structural enforcement of Hermiticity, trace-one, and positivity yields higher fidelity and training stability than loss-only approaches. The framework reproduces theoretical resource boundaries for Werner-like and Bell-diagonal states with fidelities exceeding ~98%, establishing adversarial learning as a lightweight yet effective method for constraint-driven quantum-state discovery. This approach provides a scalable foundation for automated design of tailored quantum resources for information-processing applications, exemplified with teleportation and broadcasting of entanglement, and it opens up the possibility of using such states in efficient quantum network design.
[LG-149] Sample Complexity of Averag e-Reward Q-Learning: From Single-agent to Federated Reinforcement Learning
链接: https://arxiv.org/abs/2601.13642
作者: Yuchen Jiao,Jiin Woo,Gen Li,Gauri Joshi,Yuejie Chi
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Average-reward reinforcement learning offers a principled framework for long-term decision-making by maximizing the mean reward per time step. Although Q-learning is a widely used model-free algorithm with established sample complexity in discounted and finite-horizon Markov decision processes (MDPs), its theoretical guarantees for average-reward settings remain limited. This work studies a simple but effective Q-learning algorithm for average-reward MDPs with finite state and action spaces under the weakly communicating assumption, covering both single-agent and federated scenarios. For the single-agent case, we show that Q-learning with carefully chosen parameters achieves sample complexity \widetildeO\left(\frac|\mathcalS||\mathcalA||h^\star|\mathsfsp^3\varepsilon^3\right) , where |h^\star|\mathsfsp is the span norm of the bias function, improving previous results by at least a factor of \frac|h^\star|\mathsfsp^2\varepsilon^2 . In the federated setting with M agents, we prove that collaboration reduces the per-agent sample complexity to \widetildeO\left(\frac|\mathcalS||\mathcalA||h^\star|\mathsfsp^3M\varepsilon^3\right) , with only \widetildeO\left(\frac|h^\star|_\mathsfsp\varepsilon\right) communication rounds required. These results establish the first federated Q-learning algorithm for average-reward MDPs, with provable efficiency in both sample and communication complexity.
[LG-150] Refined Gradient-Based Temperature Optimization for the Replica-Exchange Monte-Carlo Method
链接: https://arxiv.org/abs/2601.13542
作者: Tatsuya Miyata,Shunta Arai,Satoshi Takabe
类目: Computational Physics (physics.comp-ph); Machine Learning (cs.LG)
*备注: 15 pages
Abstract:The replica-exchange Monte-Carlo (RXMC) method is a powerful Markov-chain Monte-Carlo algorithm for sampling from multi-modal distributions, which are challenging for conventional methods. The sampling efficiency of the RXMC method depends highly on the selection of the temperatures, and finding optimal temperatures remains a challenge. In this study, we propose a refined online temperature selection method by extending the gradient-based optimization framework proposed previously. Building upon the existing temperature update approach, we introduce a reparameterization technique to strictly enforce physical constraints, such as the monotonic ordering of inverse temperatures, which were not explicitly addressed in the original formulation. The proposed method defines the variance of acceptance rates between adjacent replicas as a loss function, estimates its gradient using differential information from the sampling process, and optimizes the temperatures via gradient descent. We demonstrate the effectiveness of our method through experiments on benchmark spin systems, including the two-dimensional ferromagnetic Ising model, the two-dimensional ferromagnetic XY model, and the three-dimensional Edwards-Anderson model. Our results show that the method successfully achieves uniform acceptance rates and reduces round-trip times across the temperature space. Furthermore, our proposed method offers a significant advantage over recently proposed policy gradient method that require careful hyperparameter tuning, while simultaneously preventing the constraint violations that destabilize optimization.
[LG-151] Small Gradient Norm Regret for Online Convex Optimization
链接: https://arxiv.org/abs/2601.13519
作者: Wenzhi Gao,Chang He,Madeleine Udell
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:This paper introduces a new problem-dependent regret measure for online convex optimization with smooth losses. The notion, which we call the G^\star regret, depends on the cumulative squared gradient norm evaluated at the decision in hindsight \sum_t=1^T |\nabla \ell(x^\star)|^2 . We show that the G^\star regret strictly refines the existing L^\star (small loss) regret, and that it can be arbitrarily sharper when the losses have vanishing curvature around the hindsight decision. We establish upper and lower bounds on the G^\star regret and extend our results to dynamic regret and bandit settings. As a byproduct, we refine the existing convergence analysis of stochastic optimization algorithms in the interpolation regime. Some experiments validate our theoretical findings.
[LG-152] Distribution-Free Confidence Ellipsoids for Ridge Regression with PAC Bounds
链接: https://arxiv.org/abs/2601.13436
作者: Szabolcs Szentpéteri,Balázs Csanád Csáji
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Signal Processing (eess.SP); Systems and Control (eess.SY); Statistics Theory (math.ST)
*备注:
Abstract:Linearly parametrized models are widely used in control and signal processing, with the least-squares (LS) estimate being the archetypical solution. When the input is insufficiently exciting, the LS problem may be unsolvable or numerically unstable. This issue can be resolved through regularization, typically with ridge regression. Although regularized estimators reduce the variance error, it remains important to quantify their estimation uncertainty. A possible approach for linear regression is to construct confidence ellipsoids with the Sign-Perturbed Sums (SPS) ellipsoidal outer approximation (EOA) algorithm. The SPS EOA builds non-asymptotic confidence ellipsoids under the assumption that the noises are independent and symmetric about zero. This paper introduces an extension of the SPS EOA algorithm to ridge regression, and derives probably approximately correct (PAC) upper bounds for the resulting region sizes. Compared with previous analyses, our result explicitly show how the regularization parameter affects the region sizes, and provide tighter bounds under weaker excitation assumptions. Finally, the practical effect of regularization is also demonstrated via simulation experiments.
[LG-153] Improving Geopolitical Forecasts with Bayesian Networks
链接: https://arxiv.org/abs/2601.13362
作者: Matthew Martin
类目: Applications (stat.AP); Machine Learning (cs.LG)
*备注: 34 pages, 3 figures
Abstract:This study explores how Bayesian networks (BNs) can improve forecast accuracy compared to logistic regression and recalibration and aggregation methods, using data from the Good Judgment Project. Regularized logistic regression models and a baseline recalibrated aggregate were compared to two types of BNs: structure-learned BNs with arcs between predictors, and naive BNs. Four predictor variables were examined: absolute difference from the aggregate, forecast value, days prior to question close, and mean standardized Brier score. Results indicated the recalibrated aggregate achieved the highest accuracy (AUC = 0.985), followed by both types of BNs, then the logistic regression models. Performance of the BNs was likely harmed by reduced information from the discretization process and violation of the assumption of linearity likely harmed the logistic regression models. Future research should explore hybrid approaches combining BNs with logistic regression, examine additional predictor variables, and account for hierarchical data dependencies.
[LG-154] Scaling laws for amplitude surrogates
链接: https://arxiv.org/abs/2601.13308
作者: Henning Bahl,Victor Bresó-Pla,Anja Butter,Joaquín Iturriza Ramirez
类目: High Energy Physics - Phenomenology (hep-ph); Machine Learning (cs.LG)
*备注: 45 pages, 20 figures
Abstract:Scaling laws describing the dependence of neural network performance on the amount of training data, the spent compute, and the network size have emerged across a huge variety of machine learning task and datasets. In this work, we systematically investigate these scaling laws in the context of amplitude surrogates for particle physics. We show that the scaling coefficients are connected to the number of external particles of the process. Our results demonstrate that scaling laws are a useful tool to achieve desired precision targets.
[LG-155] Empirical Risk Minimization with f-Divergence Regularization
链接: https://arxiv.org/abs/2601.13191
作者: Francisco Daunas,Iñaki Esnaola,Samir M. Perlaza,H. Vincent Poor
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Submitted to IEEE Transactions on Information Theory. arXiv admin note: substantial text overlap with arXiv:2502.14544 , arXiv:2508.03314
Abstract:In this paper, the solution to the empirical risk minimization problem with f -divergence regularization (ERM- f DR) is presented and conditions under which the solution also serves as the solution to the minimization of the expected empirical risk subject to an f -divergence constraint are established. The proposed approach extends applicability to a broader class of f -divergences than previously reported and yields theoretical results that recover previously known results. Additionally, the difference between the expected empirical risk of the ERM- f DR solution and that of its reference measure is characterized, providing insights into previously studied cases of f -divergences. A central contribution is the introduction of the normalization function, a mathematical object that is critical in both the dual formulation and practical computation of the ERM- f DR solution. This work presents an implicit characterization of the normalization function as a nonlinear ordinary differential equation (ODE), establishes its key properties, and subsequently leverages them to construct a numerical algorithm for approximating the normalization factor under mild assumptions. Further analysis demonstrates structural equivalences between ERM- f DR problems with different f -divergences via transformations of the empirical risk. Finally, the proposed algorithm is used to compute the training and test risks of ERM- f DR solutions under different f -divergence regularizers. This numerical example highlights the practical implications of choosing different functions f in ERM- f DR problems.
[LG-156] SolARED: Solar Active Region Emergence Dataset for Machine Learning Aided Predictions
链接: https://arxiv.org/abs/2601.13145
作者: Spiridon Kasapis,Eren Dogan,Irina N. Kitiashvili,Alexander G. Kosovichev,John T. Stefan,Jake D. Butler,Jonas Tirona,Sarang Patil,Mengjia Xu
类目: olar and Stellar Astrophysics (astro-ph.SR); Machine Learning (cs.LG)
*备注: 15 pages, 6 figures, submitted to the Springer Nature - Solar Physics Journal
Abstract:The development of accurate forecasts of solar eruptive activity has become increasingly important for preventing potential impacts on space technologies and exploration. Therefore, it is crucial to detect Active Regions (ARs) before they start forming on the solar surface. This will enable the development of early-warning capabilities for upcoming space weather disturbances. For this reason, we prepared the Solar Active Region Emergence Dataset (SolARED). The dataset is derived from full-disk maps of the Doppler velocity, magnetic field, and continuum intensity, obtained by the Helioseismic and Magnetic Imager (HMI) onboard the Solar Dynamics Observatory (SDO). SolARED includes time series of remapped, tracked, and binned data that characterize the evolution of acoustic power of solar oscillations, unsigned magnetic flux, and continuum intensity for 50 large ARs before, during, and after their emergence on the solar surface, as well as surrounding areas observed on the solar disc between 2010 and 2023. The resulting ML-ready SolARED dataset is designed to support enhancements of predictive capabilities, enabling the development of operational forecasts for the emergence of active regions. The SolARED dataset is available at this https URL, through an interactive visualization web application.
[LG-157] Forecasting Continuum Intensity for Solar Active Region Emergence Prediction using Transformers
链接: https://arxiv.org/abs/2601.13144
作者: Jonas Tirona,Sarang Patil,Spiridon Kasapis,Eren Dogan,John Stefan,Irina N. Kitiashvili,Alexander G. Kosovichev,Mengjia Xu
类目: olar and Stellar Astrophysics (astro-ph.SR); Machine Learning (cs.LG)
*备注: 30 pages, 7 figures, submitted to JGR: Machine Learning and Computation
Abstract:Early and accurate prediction of solar active region (AR) emergence is crucial for space weather forecasting. Building on established Long Short-Term Memory (LSTM) based approaches for forecasting the continuum intensity decrease associated with AR emergence, this work expands the modeling with new architectures and targets. We investigate a sliding-window Transformer architecture to forecast continuum intensity evolution up to 12 hours ahead using data from 46 ARs observed by SDO/HMI. We conduct a systematic ablation study to evaluate two key components: (1) the inclusion of a temporal 1D convolutional (Conv1D) front-end and (2) a novel `Early Detection’ architecture featuring attention biases and a timing-aware loss function. Our best-performing model, combining the Early Detection architecture without the Conv1D layer, achieved a Root Mean Square Error (RMSE) of 0.1189 (representing a 10.6% improvement over the LSTM baseline) and an average advance warning time of 4.73 hours (timing difference of -4.73h), even under a stricter emergence criterion than previous studies. While the Transformer demonstrates superior aggregate timing and accuracy, we note that this high-sensitivity detection comes with increased variance compared to smoother baseline models. However, this volatility is a necessary trade-off for operational warning systems: the model’s ability to detect micro-changes in precursor signals enables significantly earlier detection, outweighing the cost of increased noise. Our results demonstrate that Transformer architectures modified with early detection biases, when used without temporal smoothing layers, provide a high-sensitivity alternative for forecasting AR emergence that prioritizes advance warning over statistical smoothness.
[LG-158] Approximate full conformal prediction in RKHS
链接: https://arxiv.org/abs/2601.13102
作者: Davidson Lova Razafindrakoto,Alain Celisse,Jérôme Lacaille
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:
Abstract:Full conformal prediction is a framework that implicitly formulates distribution-free confidence prediction regions for a wide range of estimators. However, a classical limitation of the full conformal framework is the computation of the confidence prediction regions, which is usually impossible since it requires training infinitely many estimators (for real-valued prediction for instance). The main purpose of the present work is to describe a generic strategy for designing a tight approximation to the full conformal prediction region that can be efficiently computed. Along with this approximate confidence region, a theoretical quantification of the tightness of this approximation is developed, depending on the smoothness assumptions on the loss and score functions. The new notion of thickness is introduced for quantifying the discrepancy between the approximate confidence region and the full conformal one.
[LG-159] Polychronous Wave Computing: Timing-Native Address Selection in Spiking Networks WWW
链接: https://arxiv.org/abs/2601.13079
作者: Natalila G. Berloff
类目: Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Optics (physics.optics)
*备注: 23 pages, Supplementary Materials are available at this https URL
Abstract:Spike timing offers a combinatorial address space, suggesting that timing-based spiking inference can be executed as lookup and routing rather than as dense multiply–accumulate. Yet most neuromorphic and photonic systems still digitize events into timestamps, bins, or rates and then perform selection in clocked logic. We introduce Polychronous Wave Computing (PWC), a timing-native address-selection primitive that maps relative spike latencies directly to a discrete output route in the wave domain. Spike times are phase-encoded in a rotating frame and processed by a programmable multiport interferometer that evaluates K template correlations in parallel; a driven–dissipative winner-take-all stage then performs a physical argmax, emitting a one-hot output port. We derive the operating envelope imposed by phase wrapping and mutual coherence, and collapse timing jitter, static phase mismatch, and dephasing into a single effective phase-noise budget whose induced winner–runner-up margin predicts boundary-first failures and provides an intensity-only calibration target. Simulations show that nonlinear competition improves routing fidelity compared with noisy linear intensity readout, and that hardware-in-the-loop phase tuning rescues a temporal-order gate from 55.9% to 97.2% accuracy under strong static mismatch. PWC provides a fast routing coprocessor for LUT-style spiking networks and sparse top-1 gates (e.g., mixture-of-experts routing) across polaritonic, photonic, and oscillator platforms.
[LG-160] Beyond Visual Realism: Toward Reliable Financial Time Series Generation ICASSP2026
链接: https://arxiv.org/abs/2601.12990
作者: Fan Zhang,Jiabin Luo,Zheng Zhang,Shuanghong Huang,Zhipeng Liu,Yu Chen
类目: atistical Finance (q-fin.ST); Machine Learning (cs.LG)
*备注: Accepted by ICASSP 2026
Abstract:Generative models for financial time series often create data that look realistic and even reproduce stylized facts such as fat tails or volatility clustering. However, these apparent successes break down under trading backtests: models like GANs or WGAN-GP frequently collapse, yielding extreme and unrealistic results that make the synthetic data unusable in practice. We identify the root cause in the neglect of financial asymmetry and rare tail events, which strongly affect market risk but are often overlooked by objectives focusing on distribution matching. To address this, we introduce the Stylized Facts Alignment GAN (SFAG), which converts key stylized facts into differentiable structural constraints and jointly optimizes them with adversarial loss. This multi-constraint design ensures that generated series remain aligned with market dynamics not only in plots but also in backtesting. Experiments on the Shanghai Composite Index (2004–2024) show that while baseline GANs produce unstable and implausible trading outcomes, SFAG generates synthetic data that preserve stylized facts and support robust momentum strategy performance. Our results highlight that structure-preserving objectives are essential to bridge the gap between superficial realism and practical usability in financial generative modeling.
[LG-161] Energy-Efficient Prediction in Textile Manufacturing: Enhancing Accuracy and Data Efficiency With Ensemble Deep Transfer Learning
链接: https://arxiv.org/abs/2601.12663
作者: Yan-Chen Chen,Wei-Yu Chiu,Qun-Yu Wang,Jing-Wei Chen,Hao-Ting Zhao
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 26 pages, 11 figures
Abstract:Traditional textile factories consume substantial energy, making energy-efficient production optimization crucial for sustainability and cost reduction. Meanwhile, deep neural networks (DNNs), which are effective for factory output prediction and operational optimization, require extensive historical data, posing challenges due to high sensor deployment and data collection costs. To address this, we propose Ensemble Deep Transfer Learning (EDTL), a novel framework that enhances prediction accuracy and data efficiency by integrating transfer learning with an ensemble strategy and a feature alignment layer. EDTL pretrains DNN models on data-rich production lines (source domain) and adapts them to data-limited lines (target domain), reducing dependency on large datasets. Experiments on real-world textile factory datasets show that EDTL improves prediction accuracy by 5.66% and enhances model robustness by 3.96% compared to conventional DNNs, particularly in data-limited scenarios (20%-40% data availability). This research contributes to energy-efficient textile manufacturing by enabling accurate predictions with fewer data requirements, providing a scalable and cost-effective solution for smart production systems.
[LG-162] Reorienting off-path Nudged Elastic Bands (RONEB) via Minimum Mode Following
链接: https://arxiv.org/abs/2601.12630
作者: Rohit Goswami(1 and 2),Miha Gunde(2 and 3),Hannes Jónsson((1) Institute IMX and Lab-COSMO, École polytechnique fédérale de Lausanne (EPFL), CH-1015 Lausanne, Switzerland (2) Science Institute, University of Iceland, Reykjavik, Iceland (3) Institute Ruđer Bošković, Bijenička 54, 10000 Zagreb, Croatia)
类目: Chemical Physics (physics.chem-ph); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: 25 pages. 11 figures
Abstract:Accurate determination of transition states remains central to understanding reaction kinetics. Double-ended methods like the Nudged Elastic Band (NEB) ensure relevant transition states and paths, but incur high computational costs and suffer stagnation on flat or rough potential energy surfaces. Conversely, single-ended eigenmode-following techniques offer efficiency but cannot often be constrained between specific states. Here, we present the Reorienting Off-path Nudged Elastic Bands (RONEB), an adaptive hybrid algorithm that integrates the double ended nature of the NEB with the acceleration of single ended Min-Mode Following methods. RONEB provides stability based on the history of the path optimization, relative force triggering, and an alignment-based back-off penalty to dynamically decouple the climbing image from the elastic band constraints. We benchmark the method against the standard Climbing Image NEB (CI-NEB) across the Baker-Chan transition state test set using the PET-MAD machine-learned potential and the OptBench Pt(111) heptamer island surface diffusion set. A Bayesian analysis of the performance data quantifies a median reduction in gradient calls of 46.3% [95% CrI: -54.7%, -36.9%] relative to the baseline, while surface diffusion tests reveal a 28% reduction across 59 metallic rearrangement mechanisms. These results establish RONEB as a highly effective tool for high-throughput automated chemical discovery.
[LG-163] Deterministic and probabilistic neural surrogates of global hybrid-Vlasov simulations
链接: https://arxiv.org/abs/2601.12614
作者: Daniel Holmberg,Ivan Zaitsev,Markku Alho,Ioanna Bouri,Fanni Franssila,Haewon Jeong,Minna Palmroth,Teemu Roos
类目: pace Physics (physics.space-ph); Machine Learning (cs.LG); Plasma Physics (physics.plasm-ph)
*备注:
Abstract:Hybrid-Vlasov simulations resolve ion-kinetic effects for modeling the solar wind-magnetosphere interaction, but even 5D (2D + 3V) simulations are computationally expensive. We show that graph-based machine learning emulators can learn the spatiotemporal evolution of electromagnetic fields and lower order moments of ion velocity distribution in the near-Earth space environment from four 5D Vlasiator runs performed with identical steady solar wind conditions. The initial ion number density is systematically varied, while the grid spacing is held constant, to scan the ratio of the characteristic ion skin depth to the numerical grid size. Using a graph neural network architecture operating on the 2D spatial simulation grid comprising 670k cells, we demonstrate that both a deterministic forecasting model (Graph-FM) and a probabilistic ensemble forecasting model (Graph-EFM) based on a latent variable formulation are capable of producing accurate predictions of future plasma states. A divergence penalty is incorporated during training to encourage divergence-freeness in the magnetic fields and improve physical consistency. For the probabilistic model, a continuous ranked probability score objective is added to improve the calibration of the ensemble forecasts. When trained, the emulators achieve more than two orders of magnitude speedup in generating the next time step relative to the original simulation on a single GPU compared to 100 CPUs for the Vlasiator runs, while closely matching physical magnetospheric response of the different runs. These results demonstrate that machine learning offers a way to make hybrid-Vlasov simulation tractable for real-time use while providing forecast uncertainty.
[LG-164] onepot CORE – an enumerated chemical space to streamline drug discovery enabled by automated small molecule synthesis and AI
链接: https://arxiv.org/abs/2601.12603
作者: Andrei S. Tyrin,Brandon Wang,Manuel Muñoz,Samuel H. Foxman,Daniil A. Boiko
类目: Chemical Physics (physics.chem-ph); Machine Learning (cs.LG)
*备注:
Abstract:The design-make-test-analyze cycle in early-stage drug discovery remains constrained primarily by the “make” step: small-molecule synthesis is slow, costly, and difficult to scale or automate across diverse chemotypes. Enumerated chemical spaces aim to reduce this bottleneck by predefining synthesizable regions of chemical space from available building blocks and reliable reactions, yet existing commercial spaces are still limited by long turnaround times, narrow reaction scope, and substantial manual decision-making in route selection and execution. Here we present the first version of onepot CORE, an enumerated chemical space containing 3.4B molecules and corresponding on-demand synthesis product enabled by an automated synthesis platform and an AI chemist, Phil, that designs, executes, and analyzes experiments. onepot CORE is constructed by (i) selecting a reaction set commonly used in medicinal chemistry, (ii) sourcing and curating building blocks from supplier catalogs, (iii) enumerating candidate products, and (iv) applying ML-based feasibility assessment to prioritize compounds for robust execution. In the current release, the space is supported by seven reactions. We describe an end-to-end workflow - from route selection and automated liquid handling through workup and purification. We further report validation across operational metrics (success rate, timelines, purity, and identity), including NMR confirmation for a representative set of synthesized compounds and assay suitability demonstrated using a series of DPP4 inhibitors. Collectively, onepot CORE illustrates a path toward faster, more reliable access to diverse small molecules, supporting accelerated discovery in pharmaceuticals and beyond. Subjects: Chemical Physics (physics.chem-ph); Machine Learning (cs.LG) Cite as: arXiv:2601.12603 [physics.chem-ph] (or arXiv:2601.12603v1 [physics.chem-ph] for this version) https://doi.org/10.48550/arXiv.2601.12603 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-165] A Theory of Diversity for Random Matrices with Applications to In-Context Learning of Schrödinger Equations
链接: https://arxiv.org/abs/2601.12587
作者: Frank Cole,Yulong Lu,Shaurya Sehgal
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:We address the following question: given a collection \mathbfA^(1), \dots, \mathbfA^(N)\ of independent d \times d random matrices drawn from a common distribution \mathbbP , what is the probability that the centralizer of \mathbfA^(1), \dots, \mathbfA^(N)\ is trivial? We provide lower bounds on this probability in terms of the sample size N and the dimension d for several families of random matrices which arise from the discretization of linear Schrödinger operators with random potentials. When combined with recent work on machine learning theory, our results provide guarantees on the generalization ability of transformer-based neural networks for in-context learning of Schrödinger equations.
[LG-166] A Mixture of Experts Vision Transformer for High-Fidelity Surface Code Decoding
链接: https://arxiv.org/abs/2601.12483
作者: Hoang Viet Nguyen,Manh Hung Nguyen,Hoang Ta,Van Khu Vu,Yeow Meng Chee
类目: Quantum Physics (quant-ph); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: 16 pages, 7 figures
Abstract:Quantum error correction is a key ingredient for large scale quantum computation, protecting logical information from physical noise by encoding it into many physical qubits. Topological stabilizer codes are particularly appealing due to their geometric locality and practical relevance. In these codes, stabilizer measurements yield a syndrome that must be decoded into a recovery operation, making decoding a central bottleneck for scalable real time operation. Existing decoders are commonly classified into two categories. Classical algorithmic decoders provide strong and well established baselines, but may incur substantial computational overhead at large code distances or under stringent latency constraints. Machine learning based decoders offer fast GPU inference and flexible function approximation, yet many approaches do not explicitly exploit the lattice geometry and local structure of topological codes, which can limit performance. In this work, we propose QuantumSMoE, a quantum vision transformer based decoder that incorporates code structure through plus shaped embeddings and adaptive masking to capture local interactions and lattice connectivity, and improves scalability via a mixture of experts layer with a novel auxiliary loss. Experiments on the toric code demonstrate that QuantumSMoE outperforms state-of-the-art machine learning decoders as well as widely used classical baselines.
[LG-167] mporal Data and Short-Time Averag es Improve Multiphase Mass Flow Metering
链接: https://arxiv.org/abs/2601.12433
作者: Amanda Nyholm,Yessica Arellano,Jinyu Liu,Damian Krakowiak,Pierluigi Salvo Rossi
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 9 pages, 6 figures
Abstract:Reliable flow measurements are essential in many industries, but current instruments often fail to accurately estimate multiphase flows, which are frequently encountered in real-world operations. Combining machine learning (ML) algorithms with accurate single-phase flowmeters has therefore received extensive research attention in recent years. The Coriolis mass flowmeter is a widely used single-phase meter that provides direct mass flow measurements, which ML models can be trained to correct, thereby reducing measurement errors in multiphase conditions. This paper demonstrates that preserving temporal information significantly improves model performance in such scenarios. We compare a multilayer perceptron, a windowed multilayer perceptron, and a convolutional neural network (CNN) on three-phase air-water-oil flow data from 342 experiments. Whereas prior work typically compresses each experiment into a single averaged sample, we instead compute short-time averages from within each experiment and train models that preserve temporal information at several downsampling intervals. The CNN performed best at 0.25 Hz with approximately 95 % of relative errors below 13 %, a normalized root mean squared error of 0.03, and a mean absolute percentage error of approximately 4.3 %, clearly outperforming the best single-averaged model and demonstrating that short-time averaging within individual experiments is preferable. Results are consistent across multiple data splits and random seeds, demonstrating robustness.
[LG-168] BiCoLoR: Communication-Efficient Optimization with Bidirectional Compression and Local Training
链接: https://arxiv.org/abs/2601.12400
作者: Laurent Condat,Artavazd Maranjyan,Peter Richtárik
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:
Abstract:Slow and costly communication is often the main bottleneck in distributed optimization, especially in federated learning where it occurs over wireless networks. We introduce BiCoLoR, a communication-efficient optimization algorithm that combines two widely used and effective strategies: local training, which increases computation between communication rounds, and compression, which encodes high-dimensional vectors into short bitstreams. While these mechanisms have been combined before, compression has typically been applied only to uplink (client-to-server) communication, leaving the downlink (server-to-client) side unaddressed. In practice, however, both directions are costly. We propose BiCoLoR, the first algorithm to combine local training with bidirectional compression using arbitrary unbiased compressors. This joint design achieves accelerated complexity guarantees in both convex and strongly convex heterogeneous settings. Empirically, BiCoLoR outperforms existing algorithms and establishes a new standard in communication efficiency.
[LG-169] Bone-conduction Guided Multimodal Speech Enhancement with Conditional Diffusion Models ICASSP2026
链接: https://arxiv.org/abs/2601.12354
作者: Sina Khanagha,Bunlong Lay,Timo Gerkmann
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
*备注: Accepted to IEEE ICASSP 2026
Abstract:Single-channel speech enhancement models face significant performance degradation in extremely noisy environments. While prior work has shown that complementary bone-conducted speech can guide enhancement, effective integration of this noise-immune modality remains a challenge. This paper introduces a novel multimodal speech enhancement framework that integrates bone-conduction sensors with air-conducted microphones using a conditional diffusion model. Our proposed model significantly outperforms previously established multimodal techniques and a powerful diffusion-based single-modal baseline across a wide range of acoustic conditions.
[LG-170] Adaptive Rotary Steering with Joint Autoregression for Robust Extraction of Closely Moving Speakers in Dynamic Scenarios ICASSP
链接: https://arxiv.org/abs/2601.12345
作者: Jakob Kienegger,Timo Gerkmann
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
*备注: Accepted at IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2026
Abstract:Latest advances in deep spatial filtering for Ambisonics demonstrate strong performance in stationary multi-speaker scenarios by rotating the sound field toward a target speaker prior to multi-channel enhancement. For applicability in dynamic acoustic conditions with moving speakers, we propose to automate this rotary steering using an interleaved tracking algorithm conditioned on the target’s initial direction. However, for nearby or crossing speakers, robust tracking becomes difficult and spatial cues less effective for enhancement. By incorporating the processed recording as additional guide into both algorithms, our novel joint autoregressive framework leverages temporal-spectral correlations of speech to resolve spatially challenging speaker constellations. Consequently, our proposed method significantly improves tracking and enhancement of closely spaced speakers, consistently outperforming comparable non-autoregressive methods on a synthetic dataset. Real-world recordings complement these findings in complex scenarios with multiple speaker crossings and varying speaker-to-array distances.
[LG-171] On the Provable Suboptimality of Momentum SGD in Nonstationary Stochastic Optimization
链接: https://arxiv.org/abs/2601.12238
作者: Sharan Sahu,Cameron J. Hogan,Martin T. Wells
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 70 pages, 4 figures, 2 tables
Abstract:While momentum-based acceleration has been studied extensively in deterministic optimization problems, its behavior in nonstationary environments – where the data distribution and optimal parameters drift over time – remains underexplored. We analyze the tracking performance of Stochastic Gradient Descent (SGD) and its momentum variants (Polyak heavy-ball and Nesterov) under uniform strong convexity and smoothness in varying stepsize regimes. We derive finite-time bounds in expectation and with high probability for the tracking error, establishing a sharp decomposition into three components: a transient initialization term, a noise-induced variance term, and a drift-induced tracking lag. Crucially, our analysis uncovers a fundamental trade-off: while momentum can suppress gradient noise, it incurs an explicit penalty on the tracking capability. We show that momentum can substantially amplify drift-induced tracking error, with amplification that becomes unbounded as the momentum parameter approaches one, formalizing the intuition that using ‘stale’ gradients hinders adaptation to rapid regime shifts. Complementing these upper bounds, we establish minimax lower bounds for dynamic regret under gradient-variation constraints. These lower bounds prove that the inertia-induced penalty is not an artifact of analysis but an information-theoretic barrier: in drift-dominated regimes, momentum creates an unavoidable ‘inertia window’ that fundamentally degrades performance. Collectively, these results provide a definitive theoretical grounding for the empirical instability of momentum in dynamic environments and delineate the precise regime boundaries where SGD provably outperforms its accelerated counterparts.
[LG-172] Persistent Sheaf Laplacian Analysis of Protein Stability and Solubility Changes upon Mutation
链接: https://arxiv.org/abs/2601.12219
作者: Yiming Ren,Junjie Wee,Xi Chen,Grace Qian,Guo-Wei Wei
类目: pectral Theory (math.SP); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:
Abstract:Genetic mutations frequently disrupt protein structure, stability, and solubility, acting as primary drivers for a wide spectrum of diseases. Despite the critical importance of these molecular alterations, existing computational models often lack interpretability, and fail to integrate essential physicochemical interaction. To overcome these limitations, we propose SheafLapNet, a unified predictive framework grounded in the mathematical theory of Topological Deep Learning (TDL) and Persistent Sheaf Laplacian (PSL). Unlike standard Topological Data Analysis (TDA) tools such as persistent homology, which are often insensitive to heterogeneous information, PSL explicitly encodes specific physical and chemical information such as partial charges directly into the topological analysis. SheafLapNet synergizes these sheaf-theoretic invariants with advanced protein transformer features and auxiliary physical descriptors to capture intrinsic molecular interactions in a multiscale and mechanistic manner. To validate our framework, we employ rigorous benchmarks for both regression and classification tasks. For stability prediction, we utilize the comprehensive S2648 and S350 datasets. For solubility prediction, we employ the PON-Sol2 dataset, which provides annotations for increased, decreased, or neutral solubility changes. By integrating these multi-perspective features, SheafLapNet achieves state-of-the-art performance across these diverse benchmarks, demonstrating that sheaf-theoretic modeling significantly enhances both interpretability and generalizability in predicting mutation-induced structural and functional changes.
[LG-173] Offline Policy Learning with Weight Clipping and Heaviside Composite Optimization
链接: https://arxiv.org/abs/2601.12117
作者: Jingren Liu,Hanzhang Qin,Junyi Liu,Mabel C. Chou,Jong-Shi Pang
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:
Abstract:Offline policy learning aims to use historical data to learn an optimal personalized decision rule. In the standard estimate-then-optimize framework, reweighting-based methods (e.g., inverse propensity weighting or doubly robust estimators) are widely used to produce unbiased estimates of policy values. However, when the propensity scores of some treatments are small, these reweighting-based methods suffer from high variance in policy value estimation, which may mislead the downstream policy optimization and yield a learned policy with inferior value. In this paper, we systematically develop an offline policy learning algorithm based on a weight-clipping estimator that truncates small propensity scores via a clipping threshold chosen to minimize the mean squared error (MSE) in policy value estimation. Focusing on linear policies, we address the bilevel and discontinuous objective induced by weight-clipping-based policy optimization by reformulating the problem as a Heaviside composite optimization problem, which provides a rigorous computational framework. The reformulated policy optimization problem is then solved efficiently using the progressive integer programming method, making practical policy learning tractable. We establish an upper bound for the suboptimality of the proposed algorithm, which reveals how the reduction in MSE of policy value estimation, enabled by our proposed weight-clipping estimator, leads to improved policy learning performance.
[LG-174] Nonlinear Dynamic Factor Analysis With a Transformer Network
链接: https://arxiv.org/abs/2601.12039
作者: Oliver Snellman
类目: Econometrics (econ.EM); Machine Learning (cs.LG)
*备注: Working paper. 88 pages, 57 figures, 14 tables. Earlier versions circulated as “Nowcasting with a Transformer Network” (first version: 26 Oct 2024)
Abstract:The paper develops a Transformer architecture for estimating dynamic factors from multivariate time series data under flexible identification assumptions. Performance on small datasets is improved substantially by using a conventional factor model as prior information via a regularization term in the training objective. The results are interpreted with Attention matrices that quantify the relative importance of variables and their lags for the factor estimate. Time variation in Attention patterns can help detect regime switches and evaluate narratives. Monte Carlo experiments suggest that the Transformer is more accurate than the linear factor model, when the data deviate from linear-Gaussian assumptions. An empirical application uses the Transformer to construct a coincident index of U.S. real economic activity.
[LG-175] A Kernel Approach for Semi-implicit Variational Inference
链接: https://arxiv.org/abs/2601.12023
作者: Longlin Yu,Ziheng Cheng,Shiyue Zhang,Cheng Zhang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 40 pages, 15 figures. arXiv admin note: substantial text overlap with arXiv:2405.18997
Abstract:Semi-implicit variational inference (SIVI) enhances the expressiveness of variational families through hierarchical semi-implicit distributions, but the intractability of their densities makes standard ELBO-based optimization biased. Recent score-matching approaches to SIVI (SIVI-SM) address this issue via a minimax formulation, at the expense of an additional lower-level optimization problem. In this paper, we propose kernel semi-implicit variational inference (KSIVI), a principled and tractable alternative that eliminates the lower-level optimization by leveraging kernel methods. We show that when optimizing over a reproducing kernel Hilbert space, the lower-level problem admits an explicit solution, reducing the objective to the kernel Stein discrepancy (KSD). Exploiting the hierarchical structure of semi-implicit distributions, the resulting KSD objective can be efficiently optimized using stochastic gradient methods. We establish optimization guarantees via variance bounds on Monte Carlo gradient estimators and derive statistical generalization bounds of order \tilde\mathcalO(1/\sqrtn) . We further introduce a multi-layer hierarchical extension that improves expressiveness while preserving tractability. Empirical results on synthetic and real-world Bayesian inference tasks demonstrate the effectiveness of KSIVI.
[LG-176] Impact of Circuit Depth versus Qubit Count on Variational Quantum Classifiers for Higgs Boson Signal Detection
链接: https://arxiv.org/abs/2601.11937
作者: Fatih Maulana
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex)
*备注: 13 Pages, 5 Figures, Code and Data Available at: this https URL
Abstract:High-Energy Physics (HEP) experiments, such as those at the Large Hadron Collider (LHC), generate massive datasets that challenge classical computational limits. Quantum Machine Learning (QML) offers a potential advantage in processing high-dimensional data; however, finding the optimal architecture for current Noisy Intermediate-Scale Quantum (NISQ) devices remains an open challenge. This study investigates the performance of Variational Quantum Classifiers (VQC) in detecting Higgs Boson signals using the ATLAS Higgs Boson Machine Learning Challenge 2014 experiment dataset. We implemented a dimensionality reduction pipeline using Principal Component Analysis (PCA) to map 30 physical features into 4-qubit and 8-qubit latent spaces. We benchmarked three configurations: (A) a shallow 4-qubit circuit, (B) a deep 4-qubit circuit with increased entanglement layers, and © an expanded 8-qubit circuit. Experimental results demonstrate that increasing circuit depth significantly improves performance, yielding the highest accuracy of 56.2% (Configuration B), compared to a baseline of 51.9%. Conversely, simply scaling to 8 qubits resulted in a performance degradation to 50.6% due to optimization challenges associated with Barren Plateaus in the larger Hilbert space. These findings suggest that for near-term quantum hardware, prioritizing circuit depth and entanglement capability is more critical than increasing qubit count for effective anomaly detection in HEP data.
[LG-177] Adversarial Drift-Aware Predictive Transfer: Toward Durable Clinical AI
链接: https://arxiv.org/abs/2601.11860
作者: Xin Xiong,Zijian Guo,Haobo Zhu,Chuan Hong,Jordan W Smoller,Tianxi Cai,Molei Liu
类目: Applications (stat.AP); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:
Abstract:Clinical AI systems frequently suffer performance decay post-deployment due to temporal data shifts, such as evolving populations, diagnostic coding updates (e.g., ICD-9 to ICD-10), and systemic shocks like the COVID-19 pandemic. Addressing this ``aging’’ effect via frequent retraining is often impractical due to computational costs and privacy constraints. To overcome these hurdles, we introduce Adversarial Drift-Aware Predictive Transfer (ADAPT), a novel framework designed to confer durability against temporal drift with minimal retraining. ADAPT innovatively constructs an uncertainty set of plausible future models by combining historical source models and limited current data. By optimizing worst-case performance over this set, it balances current accuracy with robustness against degradation due to future drifts. Crucially, ADAPT requires only summary-level model estimators from historical periods, preserving data privacy and ensuring operational simplicity. Validated on longitudinal suicide risk prediction using electronic health records from Mass General Brigham (2005–2021) and Duke University Health Systems, ADAPT demonstrated superior stability across coding transitions and pandemic-induced shifts. By minimizing annual performance decay without labeling or retraining future data, ADAPT offers a scalable pathway for sustaining reliable AI in high-stakes healthcare environments.
[LG-178] Gradient-based Active Learning with Gaussian Processes for Global Sensitivity Analysis
链接: https://arxiv.org/abs/2601.11790
作者: Guerlain Lambert,Céline Helbert,Claire Lauvernet
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:
Abstract:Global sensitivity analysis of complex numerical simulators is often limited by the small number of model evaluations that can be afforded. In such settings, surrogate models built from a limited set of simulations can substantially reduce the computational burden, provided that the design of computer experiments is enriched efficiently. In this context, we propose an active learning approach that, for a fixed evaluation budget, targets the most informative regions of the input space to improve sensitivity analysis accuracy. More specifically, our method builds on recent advances in active learning for sensitivity analysis (Sobol’ indices and derivative-based global sensitivity measures, DGSM) that exploit derivatives obtained from a Gaussian process (GP) surrogate. By leveraging the joint posterior distribution of the GP gradient, we develop acquisition functions that better account for correlations between partial derivatives and their impact on the response surface, leading to a more comprehensive and robust methodology than existing DGSM-oriented criteria. The proposed approach is first compared to state-of-the-art methods on standard benchmark functions, and is then applied to a real environmental model of pesticide transfers.
[LG-179] Quantum Kernel Machine Learning for Autonomous Materials Science
链接: https://arxiv.org/abs/2601.11775
作者: Felix Adams(1),Daiwei Zhu(2),David W. Steuerman(2),A. Gilad Kusne(1 and 3),Ichiro Takeuchi(1 and 4) ((1) University of Maryland College Park, (2) IonQ, (3) National Institute for Standards and Technology, (4) University of Maryland Quantum Materials Center)
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注:
Abstract:Autonomous materials science, where active learning is used to navigate large compositional phase space, has emerged as a powerful vehicle to rapidly explore new materials. A crucial aspect of autonomous materials science is exploring new materials using as little data as possible. Gaussian process-based active learning allows effective charting of multi-dimensional parameter space with a limited number of training data, and thus is a common algorithmic choice for autonomous materials science. An integral part of the autonomous workflow is the application of kernel functions for quantifying similarities among measured data points. A recent theoretical breakthrough has shown that quantum kernel models can achieve similar performance with less training data than classical models. This signals the possible advantage of applying quantum kernel machine learning to autonomous materials discovery. In this work, we compare quantum and classical kernels for their utility in sequential phase space navigation for autonomous materials science. Specifically, we compute a quantum kernel and several classical kernels for x-ray diffraction patterns taken from an Fe-Ga-Pd ternary composition spread library. We conduct our study on both IonQ’s Aria trapped ion quantum computer hardware and the corresponding classical noisy simulator. We experimentally verify that a quantum kernel model can outperform some classical kernel models. The results highlight the potential of quantum kernel machine learning methods for accelerating materials discovery and suggest complex x-ray diffraction data is a candidate for robust quantum kernel model advantage.
[LG-180] AllShowers: One model for all calorimeter showers
链接: https://arxiv.org/abs/2601.11716
作者: Thorsten Buss,Henry Day-Hall,Frank Gaede,Gregor Kasieczka,Katja Krüger
类目: Instrumentation and Detectors (physics.ins-det); Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex); High Energy Physics - Phenomenology (hep-ph)
*备注:
Abstract:Accurate and efficient detector simulation is essential for modern collider experiments. To reduce the high computational cost, various fast machine learning surrogate models have been proposed. Traditional surrogate models for calorimeter shower modeling train separate networks for each particle species, limiting scalability and reuse. We introduce AllShowers, a unified generative model that simulates calorimeter showers across multiple particle types using a single generative model. AllShowers is a continuous normalizing flow model with a Transformer architecture, enabling it to generate complex spatial and energy correlations in variable-length point cloud representations of showers. Trained on a diverse dataset of simulated showers in the highly granular ILD detector, the model demonstrates the ability to generate realistic showers for electrons, photons, and charged and neutral hadrons across a wide range of incident energies and angles without retraining. In addition to unifying shower generation for multiple particle types, AllShowers surpasses the fidelity of previous single-particle-type models for hadronic showers. Key innovations include the use of a layer embedding, allowing the model to learn all relevant calorimeter layer properties; a custom attention masking scheme to reduce computational demands and introduce a helpful inductive bias; and a shower- and layer-wise optimal transport mapping to improve training convergence and sample quality. AllShowers marks a significant step towards a universal model for calorimeter shower simulations in collider experiments.
[LG-181] Explainable histomorphology-based survival prediction of glioblastoma IDH-wildtype
链接: https://arxiv.org/abs/2601.11691
作者: Jan-Philipp Redlich,Friedrich Feuerhake,Stefan Nikolin,Nadine Sarah Schaadt,Sarah Teuber-Hanselmann,Joachim Weis,Sabine Luttmann,Andrea Eberle,Christoph Buck,Timm Intemann,Pascal Birnstill,Klaus Kraywinkel,Jonas Ort,Peter Boor,André Homeyer
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:
Abstract:Glioblastoma, IDH-wildtype (GBM-IDHwt) is the most common malignant brain tumor. Histomorphology is a crucial component of the integrated diagnosis of GBM-IDHwt. Artificial intelligence (AI) methods have shown promise to extract additional prognostic information from histological whole-slide images (WSI) of hematoxylin and eosin-stained glioblastoma tissue. Here, we present an explainable AI-based method to support systematic interpretation of histomorphological features associated with survival. It combines an explainable multiple instance learning (MIL) architecture with a sparse autoencoder (SAE) to relate human-interpretable visual patterns of tissue to survival. The MIL architecture directly identifies prognosis-relevant image tiles and the SAE maps these tiles post-hoc to visual patterns. The MIL method was trained and evaluated using a new real-world dataset that comprised 720 GBM-IDHwt cases from three hospitals and four cancer registries in Germany. The SAE was trained using 1878 WSIs of glioblastoma from five independent public data collections. Despite the many factors influencing survival time, our method showed some ability to discriminate between patients living less than 180 days or more than 360 days solely based on histomorphology (AUC: 0.67; 95% CI: 0.63-0.72). Cox proportional hazards regression confirmed a significant difference in survival time between the predicted groups after adjustment for established prognostic factors (hazard ratio: 1.47; 95% CI: 1.26-1.72). Our method identified multiple interpretable visual patterns associated with survival. Three neuropathologists separately found that 21 of the 24 most strongly associated patterns could be clearly attributed to seven histomorphological categories. Necrosis and hemorrhage appeared to be associated with shorter survival while highly cellular tumor areas were associated with longer survival.
[LG-182] AI Agents Need Memory Control Over More Context
链接: https://arxiv.org/abs/2601.11653
作者: Fouad Bousetouane
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注: 32 pages, 7 figures
Abstract:AI agents are increasingly used in long, multi-turn workflows in both research and enterprise settings. As interactions grow, agent behavior often degrades due to loss of constraint focus, error accumulation, and memory-induced drift. This problem is especially visible in real-world deployments where context evolves, distractions are introduced, and decisions must remain consistent over time. A common practice is to equip agents with persistent memory through transcript replay or retrieval-based mechanisms. While convenient, these approaches introduce unbounded context growth and are vulnerable to noisy recall and memory poisoning, leading to unstable behavior and increased drift. In this work, we introduce the Agent Cognitive Compressor (ACC), a bio-inspired memory controller that replaces transcript replay with a bounded internal state updated online at each turn. ACC separates artifact recall from state commitment, enabling stable conditioning while preventing unverified content from becoming persistent memory. We evaluate ACC using an agent-judge-driven live evaluation framework that measures both task outcomes and memory-driven anomalies across extended interactions. Across scenarios spanning IT operations, cybersecurity response, and healthcare workflows, ACC consistently maintains bounded memory and exhibits more stable multi-turn behavior, with significantly lower hallucination and drift than transcript replay and retrieval-based agents. These results show that cognitive compression provides a practical and effective foundation for reliable memory control in long-horizon AI agents.
[LG-183] Multi-Scale Negative Coupled Information Systems (MNCIS): A Unified Spectral Topology Framework for Stability in Turbulence AI and Biology
链接: https://arxiv.org/abs/2601.11594
作者: Pengyue Hou
类目: Computational Physics (physics.comp-ph); Machine Learning (cs.LG); Adaptation and Self-Organizing Systems (nlin.AO); Biological Physics (physics.bio-ph)
*备注: Includes supplementary materials and code. Foundation and mathematical proofs can be found in the companion paper arXiv:2601.00638
Abstract:Complex dynamical systems frequently encounter a recurrent structural instability: the collapse of the spectral gap, driving the system toward a low-dimensional “Zero-Mode Attractor” (e.g., spectral pile-up or over-smoothing). Building upon recent global well-posedness estimates [Hou, arXiv:2601.00638], this work generalizes the Multi-Scale Negative Coupled Information System (MNCIS) framework. We postulate that global stability requires an active topological operator – Adaptive Spectral Negative Coupling (ASNC) – functioning as a state-dependent high-pass filter that penalizes entropy accumulation at spectral boundaries. We validate this unified framework via three implementations:(1) Hydrodynamics: In 3D Navier-Stokes turbulence ( N=256^3 ), ASNC acts as a global-enstrophy adaptive sub-grid scale (SGS) model, stabilizing the inviscid limit and preserving the Kolmogorov -5/3 inertial range without artificial hyper-viscosity.(2) Artificial Intelligence: Addressing Over-smoothing in Graph Neural Networks (GNNs), we implement ASNC as a parameter-free topological constraint. Unlike baselines (e.g., DeepGCNs) relying on dense residual connections to bypass signal decay, our framework enables the training of ultra-deep 64-layer networks without residual connections, maintaining perfectly stationary feature variance ( \sigma^2 \equiv 1.0 ) on the ogbn-arxiv benchmark. (3) Biological Physics: In reaction-diffusion morphogenesis, it stabilizes Turing patterns against diffusive washout in high-entropy regimes. Our results suggest that the MNCIS framework provides a base-independent topological condition for distinguishing viable complex systems from those collapsing into thermal equilibrium, bridging physical stability and information persistence.
信息检索
[IR-0] XR: Cross-Modal Agents for Composed Image Retrieval WWW2026
链接: https://arxiv.org/abs/2601.14245
作者: Zhongyu Yang,Wei Pang,Yingfang Yuan
类目: Information Retrieval (cs.IR)
*备注: Accepted by WWW 2026. Project: this https URL
Abstract:Retrieval is being redefined by agentic AI, demanding multimodal reasoning beyond conventional similarity-based paradigms. Composed Image Retrieval (CIR) exemplifies this shift as each query combines a reference image with textual modifications, requiring compositional understanding across modalities. While embedding-based CIR methods have achieved progress, they remain narrow in perspective, capturing limited cross-modal cues and lacking semantic reasoning. To address these limitations, we introduce XR, a training-free multi-agent framework that reframes retrieval as a progressively coordinated reasoning process. It orchestrates three specialized types of agents: imagination agents synthesize target representations through cross-modal generation, similarity agents perform coarse filtering via hybrid matching, and question agents verify factual consistency through targeted reasoning for fine filtering. Through progressive multi-agent coordination, XR iteratively refines retrieval to meet both semantic and visual query constraints, achieving up to a 38% gain over strong training-free and training-based baselines on FashionIQ, CIRR, and CIRCO, while ablations show each agent is essential. Code is available: this https URL.
[IR-1] Rerank Before You Reason : Analyzing Reranking Tradeoffs through Effective Token Cost in Deep Search Agents
链接: https://arxiv.org/abs/2601.14224
作者: Sahel Sharifymoghaddam,Jimmy Lin
类目: Information Retrieval (cs.IR)
*备注: 10 pages, 7 figures
Abstract:Deep research agents rely on iterative retrieval and reasoning to answer complex queries, but scaling test-time computation raises significant efficiency concerns. We study how to allocate reasoning budget in deep search pipelines, focusing on the role of listwise reranking. Using the BrowseComp-Plus benchmark, we analyze tradeoffs between model scale, reasoning effort, reranking depth, and total token cost via a novel effective token cost (ETC) metric. Our results show that reranking consistently improves retrieval and end-to-end accuracy, and that moderate reranking often yields larger gains than increasing search-time reasoning, achieving comparable accuracy at substantially lower cost. All our code is available at this https URL
[IR-2] ReSearch: A Multi-Stage Machine Learning Framework for Earth Science Data Discovery
链接: https://arxiv.org/abs/2601.14176
作者: Youran Sun,Yixin Wen,Haizhao Yang
类目: Databases (cs.DB); Information Retrieval (cs.IR)
*备注:
Abstract:The rapid expansion of Earth Science data from satellite observations, reanalysis products, and numerical simulations has created a critical bottleneck in scientific discovery, namely identifying relevant datasets for a given research objective. Existing discovery systems are primarily retrieval-centric and struggle to bridge the gap between high-level scientific intent and heterogeneous metadata at scale. We introduce \textbfReSearch, a multi-stage, reasoning-enhanced search framework that formulates Earth Science data discovery as an iterative process of intent interpretation, high-recall retrieval, and context-aware ranking. ReSearch integrates lexical search, semantic embeddings, abbreviation expansion, and large language model reranking within a unified architecture that explicitly separates recall and precision objectives. To enable realistic evaluation, we construct a literature-grounded benchmark by aligning natural language intent with datasets cited in peer-reviewed Earth Science studies. Experiments demonstrate that ReSearch consistently improves recall and ranking performance over baseline methods, particularly for task-based queries expressing abstract scientific goals. These results underscore the importance of intent-aware, multi-stage search as a foundational capability for reproducible and scalable Earth Science research. Subjects: Databases (cs.DB); Information Retrieval (cs.IR) Cite as: arXiv:2601.14176 [cs.DB] (or arXiv:2601.14176v1 [cs.DB] for this version) https://doi.org/10.48550/arXiv.2601.14176 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Youran Sun [view email] [v1] Tue, 20 Jan 2026 17:27:12 UTC (154 KB) Full-text links: Access Paper: View a PDF of the paper titled ReSearch: A Multi-Stage Machine Learning Framework for Earth Science Data Discovery, by Youran Sun and 2 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.DB prev | next new | recent | 2026-01 Change to browse by: cs cs.IR References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
[IR-3] Question-Focused Filtering for Knowledge-based VQA
链接: https://arxiv.org/abs/2601.13856
作者: Wei Ye,Yixin Su,Yueguo Chen,Longxiang Gao,Jianjun Li,Ruixuan Li,Rui Zhang
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Knowledge-based Visual Question Answering (KB-VQA) aims to answer questions by integrating images with external knowledge. Effective knowledge filtering is crucial for improving accuracy. Typical filtering methods use similarity metrics to locate relevant article sections from one article, leading to information selection errors at the article and intra-article levels. Although recent explorations of Multimodal Large Language Model (MLLM)-based filtering methods demonstrate superior semantic understanding and cross-article filtering capabilities, their high computational cost limits practical application. To address these issues, this paper proposes a question-focused filtering method. This approach can perform question-focused, cross-article filtering, efficiently obtaining high-quality filtered knowledge while keeping computational costs comparable to typical methods. Specifically, we design a trainable Question-Focused Filter (QFF) and a Chunk-based Dynamic Multi-Article Selection (CDA) module, which collectively alleviate information selection errors at both the article and intra-article levels. Experiments show that our method outperforms current state-of-the-art models by 4.9% on E-VQA and 3.8% on InfoSeek, validating its effectiveness. The code is publicly available at: this https URL.
[IR-4] Balancing Fairness and High Match Rates in Reciprocal Recommender Systems: A Nash Social Welfare Approach
链接: https://arxiv.org/abs/2601.13609
作者: Yoji Tomita,Tomohiko Yokoyama
类目: Information Retrieval (cs.IR)
*备注: arXiv admin note: text overlap with arXiv:2409.00720
Abstract:Matching platforms, such as online dating services and job recommendations, have become increasingly prevalent. For the success of these platforms, it is crucial to design reciprocal recommender systems (RRSs) that not only increase the total number of matches but also avoid creating unfairness among users. In this paper, we investigate the fairness of RRSs on matching platforms. From the perspective of fair division, we define the users’ opportunities to be recommended and establish the fairness concept of envy-freeness in the allocation of these opportunities. We first introduce the Social Welfare (SW) method, which approximately maximizes the number of matches, and show that it leads to significant unfairness in recommendation opportunities, illustrating the trade-off between fairness and match rates. To address this challenge, we propose the Nash Social Welfare (NSW) method, which alternately optimizes two NSW functions and achieves nearly envy-free recommendations. We further generalize the SW and NSW method to the \alpha -SW method, which balances the trade-off between fairness and high match rates. Additionally, we develop a computationally efficient approximation algorithm for the SW/NSW/ \alpha -SW methods based on the Sinkhorn algorithm. Through extensive experiments on both synthetic datasets and two real-world datasets, we demonstrate the practical effectiveness of our approach.
[IR-5] More Than Efficiency: Embedding Compression Improves Domain Adaptation in Dense Retrieval
链接: https://arxiv.org/abs/2601.13525
作者: Chunsheng Zuo,Daniel Khashabi
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Dense retrievers powered by pretrained embeddings are widely used for document retrieval but struggle in specialized domains due to the mismatches between the training and target domain distributions. Domain adaptation typically requires costly annotation and retraining of query-document pairs. In this work, we revisit an overlooked alternative: applying PCA to domain embeddings to derive lower-dimensional representations that preserve domain-relevant features while discarding non-discriminative components. Though traditionally used for efficiency, we demonstrate that this simple embedding compression can effectively improve retrieval performance. Evaluated across 9 retrievers and 14 MTEB datasets, PCA applied solely to query embeddings improves NDCG@10 in 75.4% of model-dataset pairs, offering a simple and lightweight method for domain adaptation.
[IR-6] Integrating Vision-Centric Text Understanding for Conversational Recommender Systems
链接: https://arxiv.org/abs/2601.13505
作者: Wei Yuan,Shutong Qiao,Tong Chen,Quoc Viet Hung Nguyen,Zi Huang,Hongzhi Yin
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Conversational Recommender Systems (CRSs) have attracted growing attention for their ability to deliver personalized recommendations through natural language interactions. To more accurately infer user preferences from multi-turn conversations, recent works increasingly expand conversational context (e.g., by incorporating diverse entity information or retrieving related dialogues). While such context enrichment can assist preference modeling, it also introduces longer and more heterogeneous inputs, leading to practical issues such as input length constraints, text style inconsistency, and irrelevant textual noise, thereby raising the demand for stronger language understanding ability. In this paper, we propose STARCRS, a Screen-Text-AwaRe Conversational Recommender System that integrates two complementary text understanding modes: (1) a screen-reading pathway that encodes auxiliary textual information as visual tokens, mimicking skim reading on a screen, and (2) an LLM-based textual pathway that focuses on a limited set of critical content for fine-grained reasoning. We design a knowledge-anchored fusion framework that combines contrastive alignment, cross-attention interaction, and adaptive gating to integrate the two modes for improved preference modeling and response generation. Extensive experiments on two widely used benchmarks demonstrate that STARCRS consistently improves both recommendation accuracy and generated response quality.
[IR-7] Guidelines for the Creation of an Annotated Corpus
链接: https://arxiv.org/abs/2601.13353
作者: Bahdja Boudoua,Nadia Guiffant,Mathieu Roche,Maguelonne Teisseire,Annelise Tran
类目: Information Retrieval (cs.IR)
*备注: 8 pages, 3 figures
Abstract:This document, based on feedback from UMR TETIS members and the scientific literature, provides a generic methodology for creating annotation guidelines and annotated textual datasets (corpora). It covers methodological aspects, as well as storage, sharing, and valorization of the data. It includes definitions and examples to clearly illustrate each step of the process, thus providing a comprehensive framework to support the creation and use of corpora in various research contexts.
[IR-8] Rules Resources and Restrictions: A Taxonomy of Task-Based Information Request Intents SIGIR
链接: https://arxiv.org/abs/2601.12985
作者: Melanie A. Kilian,David Elsweiler
类目: Information Retrieval (cs.IR)
*备注: 11 pages, 1 figure, to be published in: 2026 ACM SIGIR Conference on Human Information Interaction and Retrieval (CHIIR '26), March 22-26, 2026, Seattle, WA, USA. ACM, New York, NY, USA, 11 pages. this https URL
Abstract:Understanding and classifying query intents can improve retrieval effectiveness by helping align search results with the motivations behind user queries. However, existing intent taxonomies are typically derived from system log data and capture mostly isolated information needs, while the broader task context often remains unaddressed. This limitation becomes increasingly relevant as interactions with Large Language Models (LLMs) expand user expectations from simple query answering toward comprehensive task support, for example, with purchasing decisions or in travel planning. At the same time, current LLMs still struggle to fully interpret complex and multifaceted tasks. To address this gap, we argue for a stronger task-based perspective on query intent. Drawing on a grounded-theory-based interview study with airport information clerks, we present a taxonomy of task-based information request intents that bridges the gap between traditional query-focused approaches and the emerging demands of AI-driven task-oriented search.
[IR-9] Audit du système dinformation et du modèle de gouvernance de la Bibliothèque Numérique de lEspace universitaire Francophone (BNEUF) du projet Initiative pour le Développement du Numérique dans lEspace Universitaire Francophone (IDNEUF)
链接: https://arxiv.org/abs/2601.12902
作者: Mokhtar Ben Henda(MICA)
类目: Digital Libraries (cs.DL); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
*备注: in French language
Abstract:This document provides an assessment of the overall structure of the BNEUF system and how it operates within the framework of the Initiative for Digital Development in French speaking Universities (IDNEUF). This report aims to support the AUF’s new strategy for 2021-2025, with its new structural and governance foundations for the implementation of the Francophonie scientifique project. It was therefore decided to reorganize existing and future digital resources and services with a view to incorporating them into the future global collaborative platform for integrated services. This report provides an external assessment with new forms of organization and use of the BNEUF system. The aim is to provide the AUF project team with new avenues for optimized management of the compiled digital resources and to synergize them with the related modules of the Atlas of Expertise and the Francophone Social Network.
[IR-10] he Unfairness of Multifactorial Bias in Recommendation
链接: https://arxiv.org/abs/2601.12828
作者: Masoud Mansoury,Jin Huang,Mykola Pechenizkiy,Herke van Hoof,Maarten de Rijke
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Popularity bias and positivity bias are two prominent sources of bias in recommender systems. Both arise from input data, propagate through recommendation models, and lead to unfair or suboptimal outcomes. Popularity bias occurs when a small subset of items receives most interactions, while positivity bias stems from the over-representation of high rating values. Although each bias has been studied independently, their combined effect, to which we refer to as multifactorial bias, remains underexplored. In this work, we examine how multifactorial bias influences item-side fairness, focusing on exposure bias, which reflects the unequal visibility of items in recommendation outputs. Through simulation studies, we find that positivity bias is disproportionately concentrated on popular items, further amplifying their over-exposure. Motivated by this insight, we adapt a percentile-based rating transformation as a pre-processing strategy to mitigate multifactorial bias. Experiments using six recommendation algorithms across four public datasets show that this approach improves exposure fairness with negligible accuracy loss. We also demonstrate that integrating this pre-processing step into post-processing fairness pipelines enhances their effectiveness and efficiency, enabling comparable or better fairness with reduced computational cost. These findings highlight the importance of addressing multifactorial bias and demonstrate the practical value of simple, data-driven pre-processing methods for improving fairness in recommender systems.
[IR-11] HyFormer: Revisiting the Roles of Sequence Modeling and Feature Interaction in CTR Prediction
链接: https://arxiv.org/abs/2601.12681
作者: Yunwen Huang,Shiyong Hong,Xijun Xiao,Jinqiu Jin,Xuanyuan Luo,Zhe Wang,Zheng Chai,Shikang Wu,Yuchao Zheng,Jingjian Lin
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Industrial large-scale recommendation models (LRMs) face the challenge of jointly modeling long-range user behavior sequences and heterogeneous non-sequential features under strict efficiency constraints. However, most existing architectures employ a decoupled pipeline: long sequences are first compressed with a query-token based sequence compressor like LONGER, followed by fusion with dense features through token-mixing modules like RankMixer, which thereby limits both the representation capacity and the interaction flexibility. This paper presents HyFormer, a unified hybrid transformer architecture that tightly integrates long-sequence modeling and feature interaction into a single backbone. From the perspective of sequence modeling, we revisit and redesign query tokens in LRMs, and frame the LRM modeling task as an alternating optimization process that integrates two core components: Query Decoding which expands non-sequential features into Global Tokens and performs long sequence decoding over layer-wise key-value representations of long behavioral sequences; and Query Boosting which enhances cross-query and cross-sequence heterogeneous interactions via efficient token mixing. The two complementary mechanisms are performed iteratively to refine semantic representations across layers. Extensive experiments on billion-scale industrial datasets demonstrate that HyFormer consistently outperforms strong LONGER and RankMixer baselines under comparable parameter and FLOPs budgets, while exhibiting superior scaling behavior with increasing parameters and FLOPs. Large-scale online A/B tests in high-traffic production systems further validate its effectiveness, showing significant gains over deployed state-of-the-art models. These results highlight the practicality and scalability of HyFormer as a unified modeling framework for industrial LRMs.
[IR-12] Information Farming: From Berry Picking to Berry Growing
链接: https://arxiv.org/abs/2601.12544
作者: Leif Azzopardi,Adam Roegiest
类目: Information Retrieval (cs.IR); Human-Computer Interaction (cs.HC)
*备注: ACM CHIIR 2026
Abstract:The classic paradigms of Berry Picking and Information Foraging Theory have framed users as gatherers, opportunistically searching across distributed sources to satisfy evolving information needs. However, the rise of GenAI is driving a fundamental transformation in how people produce, structure, and reuse information - one that these paradigms no longer fully capture. This transformation is analogous to the Neolithic Revolution, when societies shifted from hunting and gathering to cultivation. Generative technologies empower users to “farm” information by planting seeds in the form of prompts, cultivating workflows over time, and harvesting richly structured, relevant yields within their own plots, rather than foraging across others people’s patches. In this perspectives paper, we introduce the notion of Information Farming as a conceptual framework and argue that it represents a natural evolution in how people engage with information. Drawing on historical analogy and empirical evidence, we examine the benefits and opportunities of information farming, its implications for design and evaluation, and the accompanying risks posed by this transition. We hypothesize that as GenAI technologies proliferate, cultivating information will increasingly supplant transient, patch-based foraging as a dominant mode of engagement, marking a broader shift in human-information interaction and its study.
[IR-13] Facet-Aware Multi-Head Mixture-of-Experts Model with Text-Enhanced Pre-training for Sequential Recommendation WSDM
链接: https://arxiv.org/abs/2601.12301
作者: Mingrui Liu,Sixiao Zhang,Cheng Long
类目: Information Retrieval (cs.IR)
*备注: Extended from WSDM paper. arXiv admin note: substantial text overlap with arXiv:2411.01457
Abstract:Sequential recommendation (SR) systems excel at capturing users’ dynamic preferences by leveraging their interaction histories. Most existing SR systems assign a single embedding vector to each item to represent its features, adopting various models to combine these embeddings into a sequence representation that captures user intent. However, we argue that this representation alone is insufficient to capture an item’s multi-faceted nature (e.g., movie genres, starring actors). Furthermore, users often exhibit complex and varied preferences within these facets (e.g., liking both action and musical films within the genre facet), which are challenging to fully represent with static identifiers. To address these issues, we propose a novel architecture titled Facet-Aware Multi-Head Mixture-of-Experts Model for Sequential Recommendation (FAME). We leverage sub-embeddings from each head in the final multi-head attention layer to predict the next item separately, effectively capturing distinct item facets. A gating mechanism then integrates these predictions by dynamically determining their importance. Additionally, we introduce a Mixture-of-Experts (MoE) network within each attention head to disentangle varied user preferences within each facet, utilizing a learnable router network to aggregate expert outputs based on context. Complementing this architecture, we design a Text-Enhanced Facet-Aware Pre-training module to overcome the limitations of randomly initialized embeddings. By utilizing a pre-trained text encoder and employing an alternating supervised contrastive learning objective, we explicitly disentangle facet-specific features from textual metadata (e.g., descriptions) before sequential training begins. This ensures that the item embeddings are semantically robust and aligned with the downstream multi-facet framework.
[IR-14] Agent ic-R: Learning to Retrieve for Agent ic Search
链接: https://arxiv.org/abs/2601.11888
作者: Wenhan Liu,Xinyu Ma,Yutao Zhu,Yuchen Li,Daiting Shi,Dawei Yin,Zhicheng Dou
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Agentic search has recently emerged as a powerful paradigm, where an agent interleaves multi-step reasoning with on-demand retrieval to solve complex questions. Despite its success, how to design a retriever for agentic search remains largely underexplored. Existing search agents typically rely on similarity-based retrievers, while similar passages are not always useful for final answer generation. In this paper, we propose a novel retriever training framework tailored for agentic search. Unlike retrievers designed for single-turn retrieval-augmented generation (RAG) that only rely on local passage utility, we propose to use both local query-passage relevance and global answer correctness to measure passage utility in a multi-turn agentic search. We further introduce an iterative training strategy, where the search agent and the retriever are optimized bidirectionally and iteratively. Different from RAG retrievers that are only trained once with fixed questions, our retriever is continuously improved using evolving and higher-quality queries from the agent. Extensive experiments on seven single-hop and multi-hop QA benchmarks demonstrate that our retriever, termed \ours, consistently outperforms strong baselines across different search agents. Our codes are available at: this https URL.
[IR-15] Cultural Analytics for Good: Building Inclusive Evaluation Frameworks for Historical IR
链接: https://arxiv.org/abs/2601.11874
作者: Suchana Datta,Dwaipayan Roy,Derek Greene,Gerardine Meaney,Karen Wade,Philipp Mayr
类目: Information Retrieval (cs.IR)
*备注:
Abstract:This work bridges the fields of information retrieval and cultural analytics to support equitable access to historical knowledge. Using the British Library BL19 digital collection (more than 35,000 works from 1700-1899), we construct a benchmark for studying changes in language, terminology and retrieval in the 19th-century fiction and non-fiction. Our approach combines expert-driven query design, paragraph-level relevance annotation, and Large Language Model (LLM) assistance to create a scalable evaluation framework grounded in human expertise. We focus on knowledge transfer from fiction to non-fiction, investigating how narrative understanding and semantic richness in fiction can improve retrieval for scholarly and factual materials. This interdisciplinary framework not only improves retrieval accuracy but also fosters interpretability, transparency, and cultural inclusivity in digital archives. Our work provides both practical evaluation resources and a methodological paradigm for developing retrieval systems that support richer, historically aware engagement with digital archives, ultimately working towards more emancipatory knowledge infrastructures.
[IR-16] GPU-Resident Inverted File Index for Streaming Vector Databases
链接: https://arxiv.org/abs/2601.11808
作者: Dongfang Zhao
类目: Databases (cs.DB); Distributed, Parallel, and Cluster Computing (cs.DC); Information Retrieval (cs.IR)
*备注:
Abstract:Vector search has emerged as the computational backbone of modern AI infrastructure, powering critical systems ranging from Vector Databases to Retrieval-Augmented Generation (RAG). While the GPU-accelerated Inverted File (IVF) index acts as one of the most widely used techniques for these large-scale workloads due to its memory efficiency, its traditional architecture remains fundamentally static. Existing designs rely on rigid and contiguous memory layouts that lack native support for in-place mutation, creating a severe bottleneck for streaming scenarios. In applications requiring real-time knowledge updates, such as live recommendation engines or dynamic RAG systems, maintaining index freshness necessitates expensive CPU-GPU roundtrips that cause system latency to spike from milliseconds to seconds. In this paper, we propose SIVF (Streaming Inverted File), a new GPU-native architecture designed to empower vector databases with high-velocity data ingestion and deletion capabilities. SIVF replaces the static memory layout with a slab-based allocation system and a validity bitmap, enabling lock-free and in-place mutation directly in VRAM. We further introduce a GPU-resident address translation table (ATT) to resolve the overhead of locating vectors, providing O(1) access to physical storage slots. We evaluate SIVF against the industry-standard GPU IVF implementation on the SIFT1M and GIST1M datasets. Microbenchmarks demonstrate that SIVF reduces deletion latency by up to 13,300\times (from 11.8 seconds to 0.89 ms on GIST1M) and improves ingestion throughput by 36\times to 105\times . In end-to-end sliding window scenarios, SIVF eliminates system freezes and achieves a 161\times to 266\times speedup with single-digit millisecond latency. Notably, this performance incurs negligible storage penalty, maintaining less than 0.8% memory overhead compared to static indices.
[IR-17] From HNSW to Information-Theoretic Binarization: Rethinking the Architecture of Scalable Vector Search
链接: https://arxiv.org/abs/2601.11557
作者: Seyed Moein Abtahi,Majid Fekri,Tara Khani,Akramul Azim
类目: Databases (cs.DB); Information Retrieval (cs.IR); Information Theory (cs.IT); Performance (cs.PF)
*备注: 16 Pages, 5 Figures, 3 Tables
Abstract:Modern semantic search and retrieval-augmented generation (RAG) systems rely predominantly on in-memory approximate nearest neighbor (ANN) indexes over high-precision floating-point vectors, resulting in escalating operational cost and inherent trade-offs between latency, throughput, and retrieval accuracy. This paper analyzes the architectural limitations of the dominant “HNSW + float32 + cosine similarity” stack and evaluates existing cost-reduction strategies, including storage disaggregation and lossy vector quantization, which inevitably sacrifice either performance or accuracy. We introduce and empirically evaluate an alternative information-theoretic architecture based on maximally informative binarization (MIB), efficient bitwise distance metrics, and an information-theoretic scoring (ITS) mechanism. Unlike conventional ANN systems, this approach enables exhaustive search over compact binary representations, allowing deterministic retrieval and eliminating accuracy degradation under high query concurrency. Using the MAIR benchmark across 14 datasets and 10,038 queries, we compare this architecture against Elasticsearch, Pinecone, PGVector, and Qdrant. Results demonstrate retrieval quality comparable to full-precision systems, while achieving substantially lower latency and maintaining constant throughput at high request rates. We show that this architectural shift enables a truly serverless, cost-per-query deployment model, challenging the necessity of large in-memory ANN indexes for high-quality semantic search.

