本篇博文主要内容为 2026-01-16 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。

目录

概览 (2026-01-16)

今日共更新514篇论文,其中:

  • 自然语言处理101篇(Computation and Language (cs.CL))
  • 人工智能168篇(Artificial Intelligence (cs.AI))
  • 计算机视觉80篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习117篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] MatchTIR: Fine-Grained Supervision for Tool-Integrated Reasoning via Bipartite Matching

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在执行复杂任务时,因现有强化学习方法采用粗粒度的奖励分配机制而导致的工具调用(tool call)效果难以区分的问题。具体而言,传统方法对轨迹内所有步骤赋予相同优势值,无法有效识别哪些工具调用是关键有效的,而哪些是冗余或错误的,尤其在长程多轮交互场景中表现不佳。解决方案的关键在于提出MatchTIR框架,其核心创新为:1)通过二分图匹配(bipartite matching)构建预测轨迹与真实轨迹之间的对应关系,实现细粒度的回合级(turn-level)奖励分配;2)设计双层优势估计(dual-level advantage estimation)机制,融合回合级和轨迹级信号,为每个交互回合分配差异化的优势值,从而提升局部决策精度与全局任务成功率之间的平衡。

链接: https://arxiv.org/abs/2601.10712
作者: Changle Qu,Sunhao Dai,Hengyi Cai,Jun Xu,Shuaiqiang Wang,Dawei Yin
机构: Gaoling School of Artificial Intelligence, Renmin University of China (中国人民大学高瓴人工智能学院); Baidu Inc. (百度公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Tool-Integrated Reasoning (TIR) empowers large language models (LLMs) to tackle complex tasks by interleaving reasoning steps with external tool interactions. However, existing reinforcement learning methods typically rely on outcome- or trajectory-level rewards, assigning uniform advantages to all steps within a trajectory. This coarse-grained credit assignment fails to distinguish effective tool calls from redundant or erroneous ones, particularly in long-horizon multi-turn scenarios. To address this, we propose MatchTIR, a framework that introduces fine-grained supervision via bipartite matching-based turn-level reward assignment and dual-level advantage estimation. Specifically, we formulate credit assignment as a bipartite matching problem between predicted and ground-truth traces, utilizing two assignment strategies to derive dense turn-level rewards. Furthermore, to balance local step precision with global task success, we introduce a dual-level advantage estimation scheme that integrates turn-level and trajectory-level signals, assigning distinct advantage values to individual interaction turns. Extensive experiments on three benchmarks demonstrate the superiority of MatchTIR. Notably, our 4B model surpasses the majority of 8B competitors, particularly in long-horizon and multi-turn tasks. Our codes are available at this https URL.
zh

[NLP-1] Grounding Agent Memory in Contextual Intent

【速读】: 该论文旨在解决在长时程、目标导向的交互中,由于相似实体和事实在不同潜在目标与约束下反复出现,导致记忆系统检索到语义相近但情境不匹配的证据,从而引发推理错误的问题。解决方案的关键在于提出STITCH(Structured Intent Tracking in Contextual History),一种基于结构化意图追踪的记忆系统:它通过为每个轨迹步骤索引三个核心维度——当前潜在目标(定义主题段落)、动作类型以及显著实体类型(决定哪些属性重要)——构建紧凑的上下文意图信号,实现对历史记忆的精准过滤与优先排序;该机制有效抑制了语义相似但情境不兼容的历史片段,显著降低检索噪声,提升长期推理的鲁棒性。

链接: https://arxiv.org/abs/2601.10702
作者: Ruozhen Yang,Yucheng Jiang,Yueqi Jiang,Priyanka Kargupta,Yunyi Zhang,Jiawei Han
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Deploying large language models in long-horizon, goal-oriented interactions remains challenging because similar entities and facts recur under different latent goals and constraints, causing memory systems to retrieve context-mismatched evidence. We propose STITCH (Structured Intent Tracking in Contextual History), an agentic memory system that indexes each trajectory step with a structured retrieval cue, contextual intent, and retrieves history by matching the current step’s intent. Contextual intent provides compact signals that disambiguate repeated mentions and reduce interference: (1) the current latent goal defining a thematic segment, (2) the action type, and (3) the salient entity types anchoring which attributes matter. During inference, STITCH filters and prioritizes memory snippets by intent compatibility, suppressing semantically similar but context-incompatible history. For evaluation, we introduce CAME-Bench, a benchmark for context-aware retrieval in realistic, dynamic, goal-oriented trajectories. Across CAME-Bench and LongMemEval, STITCH achieves state-of-the-art performance, outperforming the strongest baseline by 35.6%, with the largest gains as trajectory length increases. Our analysis shows that intent indexing substantially reduces retrieval noise, supporting intent-aware memory for robust long-horizon reasoning. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR) ACMclasses: I.2.7; H.3.3 Cite as: arXiv:2601.10702 [cs.CL] (or arXiv:2601.10702v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2601.10702 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-2] LIBERTy: A Causal Framework for Benchmarking Concept-Based Explanations of LLM s with Structural Counterfactuals

【速读】: 该论文旨在解决当前概念解释(concept-based explanations)方法在高风险领域中缺乏可靠评估基准的问题,尤其是现有基准依赖于昂贵且不完美的手工构造反事实(counterfactuals)来衡量解释的忠实性(faithfulness)。其解决方案的关键在于提出一个基于结构因果模型(Structured Causal Models, SCMs)的自动化框架——LIBERTy(LLM-based Interventional Benchmark for Explainability with Reference Targets),该框架通过在文本生成过程中对特定概念施加干预并利用大语言模型(LLM)生成结构化的反事实对,从而构建高质量、可扩展的评估数据集。这一方法不仅提升了评估的客观性和系统性,还引入了新的评价指标“order-faithfulness”,使得对不同模型和解释方法的性能比较更加严谨,并揭示出主流大语言模型在人口统计学概念上的敏感性下降现象,为开发更忠实的可解释性方法提供了坚实基础。

链接: https://arxiv.org/abs/2601.10700
作者: Gilat Toker,Nitay Calderon,Ohad Amosy,Roi Reichart
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Concept-based explanations quantify how high-level concepts (e.g., gender or experience) influence model behavior, which is crucial for decision-makers in high-stakes domains. Recent work evaluates the faithfulness of such explanations by comparing them to reference causal effects estimated from counterfactuals. In practice, existing benchmarks rely on costly human-written counterfactuals that serve as an imperfect proxy. To address this, we introduce a framework for constructing datasets containing structural counterfactual pairs: LIBERTy (LLM-based Interventional Benchmark for Explainability with Reference Targets). LIBERTy is grounded in explicitly defined Structured Causal Models (SCMs) of the text generation, interventions on a concept propagate through the SCM until an LLM generates the counterfactual. We introduce three datasets (disease detection, CV screening, and workplace violence prediction) together with a new evaluation metric, order-faithfulness. Using them, we evaluate a wide range of methods across five models and identify substantial headroom for improving concept-based explanations. LIBERTy also enables systematic analysis of model sensitivity to interventions: we find that proprietary LLMs show markedly reduced sensitivity to demographic concepts, likely due to post-training mitigation. Overall, LIBERTy provides a much-needed benchmark for developing faithful explainability methods.
zh

[NLP-3] Detecting Winning Arguments with Large Language Models and Persuasion Strategies

【速读】: 该论文旨在解决如何有效检测论辩文本中的说服力问题,这一任务对理解人类沟通具有重要意义。其核心挑战在于识别和量化不同说服策略(如声誉攻击、干扰和操纵性措辞)对文本整体说服效果的影响。解决方案的关键在于提出一种基于多策略引导的说服评分方法(Multi-Strategy Persuasion Scoring),利用大语言模型(LLMs)对六种具体说服策略进行结构化推理,从而提升说服力预测的准确性与可解释性。实验表明,这种策略导向的提示方式显著优于传统方法,并在多个标注数据集上验证了其鲁棒性。

链接: https://arxiv.org/abs/2601.10660
作者: Tiziano Labruna,Arkadiusz Modzelewski,Giorgio Satta,Giovanni Da San Martino
机构: University of Padua(帕多瓦大学); Polish-Japanese Academy of Information Technology(波兰-日本信息科技学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Detecting persuasion in argumentative text is a challenging task with important implications for understanding human communication. This work investigates the role of persuasion strategies - such as Attack on reputation, Distraction, and Manipulative wording - in determining the persuasiveness of a text. We conduct experiments on three annotated argument datasets: Winning Arguments (built from the Change My View subreddit), Anthropic/Persuasion, and Persuasion for Good. Our approach leverages large language models (LLMs) with a Multi-Strategy Persuasion Scoring approach that guides reasoning over six persuasion strategies. Results show that strategy-guided reasoning improves the prediction of persuasiveness. To better understand the influence of content, we organize the Winning Argument dataset into broad discussion topics and analyze performance across them. We publicly release this topic-annotated version of the dataset to facilitate future research. Overall, our methodology demonstrates the value of structured, strategy-aware prompting for enhancing interpretability and robustness in argument quality assessment.
zh

[NLP-4] Influential Training Data Retrieval for Explaining Verbalized Confidence of LLM s

【速读】: 该论文试图解决大语言模型(Large Language Models, LLMs)在输出时表现出的过度自信问题,即其口头化自信(verbalized confidence)与其事实准确性之间缺乏一致性的现象。现有研究表明,LLMs 的自信表达可能并不反映真实的知识基础,从而削弱用户对其输出的信任。为深入理解这种自信来源,作者提出 TracVC(Tracing Verbalized Confidence),其核心创新在于结合信息检索与影响估计方法,将模型生成的自信表达回溯至训练数据中的具体样本。该方法的关键在于引入“内容相关性基础度”(content groundness)这一新指标,用于量化模型是否基于与问题和答案内容相关的训练实例来构建自信表达,而非依赖通用的自信语义模式。实验表明,部分模型如 OLMo2-13B 常受与查询词汇无关的自信表达样本影响,暗示其可能仅模仿表面的语言形式而非建立实质的内容依据。这一发现揭示了当前训练范式的一个根本局限:模型学会了如何“显得自信”,却未掌握何时“应当自信”。

链接: https://arxiv.org/abs/2601.10645
作者: Yuxi Xia,Loris Schoenegger,Benjamin Roth
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) can increase users’ perceived trust by verbalizing confidence in their outputs. However, prior work has shown that LLMs are often overconfident, making their stated confidence unreliable since it does not consistently align with factual accuracy. To better understand the sources of this verbalized confidence, we introduce TracVC (\textbfTracing \textbfVerbalized \textbfConfidence), a method that builds on information retrieval and influence estimation to trace generated confidence expressions back to the training data. We evaluate TracVC on OLMo and Llama models in a question answering setting, proposing a new metric, content groundness, which measures the extent to which an LLM grounds its confidence in content-related training examples (relevant to the question and answer) versus in generic examples of confidence verbalization. Our analysis reveals that OLMo2-13B is frequently influenced by confidence-related data that is lexically unrelated to the query, suggesting that it may mimic superficial linguistic expressions of certainty rather than rely on genuine content grounding. These findings point to a fundamental limitation in current training regimes: LLMs may learn how to sound confident without learning when confidence is justified. Our analysis provides a foundation for improving LLMs’ trustworthiness in expressing more reliable confidence.
zh

[NLP-5] Be Your Own Red Teamer: Safety Alignment via Self-Play and Reflective Experience Replay

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在安全对齐过程中对对抗性“越狱”攻击(jailbreak attacks)防御能力不足的问题。当前方法依赖静态外部红队测试(red teaming),仅使用固定防御提示或预收集的对抗数据集,导致防御机制容易过拟合已知攻击模式,难以应对新型复杂威胁。解决方案的关键在于提出安全自对弈(Safety Self-Play, SSP)框架,其核心是让单一LLM同时扮演攻击者(生成越狱指令)与防御者(拒绝有害请求),在统一强化学习(Reinforcement Learning, RL)循环中实现攻防策略的动态演化。为确保防御者能有效聚焦关键安全问题,研究进一步引入一种基于上置信界(Upper Confidence Bound, UCB)采样的反思经验回放机制(Reflective Experience Replay Mechanism),优先从低奖励失败案例中学习,从而在探索与利用之间取得平衡,显著提升模型的主动安全对齐能力。

链接: https://arxiv.org/abs/2601.10589
作者: Hao Wang,Yanting Wang,Hao Li,Rui Li,Lei Sha
机构: Beihang University (北京航空航天大学); Peking University (北京大学); Zhongguancun Laboratory (中关村实验室)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved remarkable capabilities but remain vulnerable to adversarial ``jailbreak’’ attacks designed to bypass safety guardrails. Current safety alignment methods depend heavily on static external red teaming, utilizing fixed defense prompts or pre-collected adversarial datasets. This leads to a rigid defense that overfits known patterns and fails to generalize to novel, sophisticated threats. To address this critical limitation, we propose empowering the model to be its own red teamer, capable of achieving autonomous and evolving adversarial attacks. Specifically, we introduce Safety Self- Play (SSP), a system that utilizes a single LLM to act concurrently as both the Attacker (generating jailbreaks) and the Defender (refusing harmful requests) within a unified Reinforcement Learning (RL) loop, dynamically evolving attack strategies to uncover vulnerabilities while simultaneously strengthening defense mechanisms. To ensure the Defender effectively addresses critical safety issues during the self-play, we introduce an advanced Reflective Experience Replay Mechanism, which uses an experience pool accumulated throughout the process. The mechanism employs a Upper Confidence Bound (UCB) sampling strategy to focus on failure cases with low rewards, helping the model learn from past hard mistakes while balancing exploration and exploitation. Extensive experiments demonstrate that our SSP approach autonomously evolves robust defense capabilities, significantly outperforming baselines trained on static adversarial datasets and establishing a new benchmark for proactive safety alignment.
zh

[NLP-6] Form and Meaning in Intrinsic Multilingual Evaluations EACL2026

【速读】: 该论文旨在解决当前用于评估条件语言模型(Conditional Language Models, CLMs)的内在评价指标(如困惑度 perplexity 或每字符比特数 bits-per-character)在多语言场景下存在的可比性问题。其关键在于明确指出这些指标在多语言设置中隐含的假设——例如,认为平行句的困惑度差异能反映模型质量,前提是语义信息内容相同——而实际上这些指标衡量的是信息论意义上的信息熵,而非语义一致性。作者通过在两个多语种平行语料库上对六种指标进行实验,发现现有指标并非普遍可比,并借助形式-意义之争(form-meaning debate)提供了解释:即不同语言间的形式差异可能导致指标误判模型性能,从而揭示了当前评价体系的局限性。

链接: https://arxiv.org/abs/2601.10580
作者: Wessel Poelman,Miryam de Lhoneux
机构: KU Leuven (鲁汶大学)
类目: Computation and Language (cs.CL)
备注: EACL 2026: Main Conference

点击查看摘要

Abstract:Intrinsic evaluation metrics for conditional language models, such as perplexity or bits-per-character, are widely used in both mono- and multilingual settings. These metrics are rather straightforward to use and compare in monolingual setups, but rest on a number of assumptions in multilingual setups. One such assumption is that comparing the perplexity of CLMs on parallel sentences is indicative of their quality since the information content (here understood as the semantic meaning) is the same. However, the metrics are inherently measuring information content in the information-theoretic sense. We make this and other such assumptions explicit and discuss their implications. We perform experiments with six metrics on two multi-parallel corpora both with mono- and multilingual models. Ultimately, we find that current metrics are not universally comparable. We look at the form-meaning debate to provide some explanation for this.
zh

[NLP-7] Representation-Aware Unlearning via Activation Signatures: From Suppression to Knowledge-Signature Erasure

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中选择性知识擦除(Selective Knowledge Erasure)的难题,即如何实现真正意义上的知识删除而非表面行为抑制,以满足《通用数据保护条例》(GDPR)合规性和模型安全性要求。现有方法常将行为抑制与真实知识移除混淆,导致潜在能力仍存在于模型内部。其解决方案的关键在于提出一种表示感知的“知识免疫框架”(Knowledge Immunization Framework, KIF),通过识别并干预特定主题的内部激活特征签名(activation signatures),而非仅依赖输出层面的行为控制,从而区分真正的知识擦除与伪装性抑制。KIF结合动态抑制目标表征与参数高效适应机制,在无需完整重训练的前提下实现持久化知识擦除,同时保持模型性能接近最优水平(FQ ≈ 0.99 vs. 1.00,MU = 0.62),有效突破了以往研究中稳定性和擦除效果之间的权衡困境。

链接: https://arxiv.org/abs/2601.10566
作者: Syed Naveed Mahmood,Md. Rezaur Rahman Bhuiyan,Tasfia Zaman,Jareen Tasneem Khondaker,Md. Sameer Sakib,Nazia Tasnim,Farig Sadeque
机构: BRAC University (BRAC大学); Boston University (波士顿大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 16 pages, 4 figures

点击查看摘要

Abstract:Selective knowledge erasure from LLMs is critical for GDPR compliance and model safety, yet current unlearning methods conflate behavioral suppression with true knowledge removal, allowing latent capabilities to persist beneath surface-level refusals. In this work, we address this challenge by introducing Knowledge Immunization Framework (KIF), a representation-aware architecture that distinguishes genuine erasure from obfuscation by targeting internal activation signatures rather than surface outputs. Our approach combines dynamic suppression of subject-specific representations with parameter-efficient adaptation, enabling durable unlearning without full model retraining. KIF achieves near-oracle erasure (FQ approx 0.99 vs. 1.00) while preserving utility at oracle levels (MU = 0.62), effectively breaking the stability-erasure tradeoff that has constrained all prior work. We evaluate both standard foundation models (Llama and Mistral) and reasoning-prior models (Qwen and DeepSeek) across 3B to 14B parameters. Our observation shows that standard models exhibit scale-independent true erasure (3% utility drift), while reasoning-prior models reveal fundamental architectural divergence. Our comprehensive dual-metric evaluation protocol, combining surface-level leakage with latent trace persistence, operationalizes the obfuscation - erasure distinction and enables the first systematic diagnosis of mechanism-level forgetting behavior across model families and scales.
zh

[NLP-8] Learning Latency-Aware Orchestration for Parallel Multi-Agent Systems

【速读】: 该论文旨在解决多智能体系统(Multi-agent Systems, MAS)在并行执行时因多步推理和重复模型调用导致的高推理延迟问题,该问题严重限制了系统在时间敏感场景下的可扩展性和可用性。现有方法通常仅优化任务性能与推理成本,并隐含假设串行执行,因此在并行环境下难以有效控制延迟。解决方案的关键在于提出一种延迟感知的多智能体编排框架(Latency-Aware Multi-agent System, LAMaS),通过显式延迟监督机制,在并行执行条件下优化关键执行路径(critical execution path),使控制器能够构建更低延迟的执行拓扑图(execution topology graphs)。实验表明,该方法在多个基准测试中将关键路径长度减少38–46%,同时保持或提升任务性能,验证了在并行执行下显式优化延迟的重要性。

链接: https://arxiv.org/abs/2601.10560
作者: Xi Shi,Mengxin Zheng,Qian Lou
机构: University of Central Florida (中佛罗里达大学)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Preprint

点击查看摘要

Abstract:Multi-agent systems (MAS) enable complex reasoning by coordinating multiple agents, but often incur high inference latency due to multi-step execution and repeated model invocations, severely limiting their scalability and usability in time-sensitive scenarios. Most existing approaches primarily optimize task performance and inference cost, and explicitly or implicitly assume sequential execution, making them less optimal for controlling latency under parallel execution. In this work, we investigate learning-based orchestration of multi-agent systems with explicit latency supervision under parallel execution. We propose Latency-Aware Multi-agent System (LAMaS), a latency-aware multi-agent orchestration framework that enables parallel execution and explicitly optimizes the critical execution path, allowing the controller to construct execution topology graphs with lower latency under parallel execution. Our experiments show that our approach reduces critical path length by 38-46% compared to the state-of-the-art baseline for multi-agent architecture search across multiple benchmarks, while maintaining or even improving task performance. These results highlight the importance of explicitly optimizing latency under parallel execution when designing efficient multi-agent systems. The code is available at this https URL
zh

[NLP-9] Defending Large Language Models Against Jailbreak Attacks via In-Decoding Safety-Awareness Probing

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在实际部署中面临的“越狱攻击”(jailbreak attacks)问题,即尽管进行了广泛的安全对齐训练,模型仍可能被恶意诱导生成有害内容。现有防御机制如基于解码的约束或事后内容检测器,在面对复杂越狱攻击时往往失效或导致可用性下降。解决方案的关键在于:通过分析LLM的解码过程,发现即使在成功越狱的情况下,模型内部仍存在潜在的安全相关信号(latent safety-related signals),这些信号被模型追求流畅续写的倾向所掩盖,无法触发及时的自我修正或拒绝响应。作者提出一种简单但有效的方法,显式地提取并利用这些潜藏的安全信号,在生成过程中实现早期不安全内容检测,从而在保持低误拒率和响应质量的同时显著提升安全性。

链接: https://arxiv.org/abs/2601.10543
作者: Yinzhi Zhao,Ming Wang,Shi Feng,Xiaocui Yang,Daling Wang,Yifei Zhang
机构: Northeastern University (东北大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have achieved impressive performance across natural language tasks and are increasingly deployed in real-world applications. Despite extensive safety alignment efforts, recent studies show that such alignment is often shallow and remains vulnerable to jailbreak attacks. Existing defense mechanisms, including decoding-based constraints and post-hoc content detectors, struggle against sophisticated jailbreaks, often intervening robust detection or excessively degrading model utility. In this work, we examine the decoding process of LLMs and make a key observation: even when successfully jailbroken, models internally exhibit latent safety-related signals during generation. However, these signals are overridden by the model’s drive for fluent continuation, preventing timely self-correction or refusal. Building on this observation, we propose a simple yet effective approach that explicitly surfaces and leverages these latent safety signals for early detection of unsafe content during decoding. Experiments across diverse jailbreak attacks demonstrate that our approach significantly enhances safety, while maintaining low over-refusal rates on benign inputs and preserving response quality. Our results suggest that activating intrinsic safety-awareness during decoding offers a promising and complementary direction for defending against jailbreak attacks. Code is available at: this https URL.
zh

[NLP-10] PERM: Psychology-grounded Empathetic Reward Modeling for Large Language Models

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在人机交互场景中缺乏实质性情感支持的问题,尤其是现有基于强化学习(Reinforcement Learning, RL)的方法在提升LLMs共情能力时,仅从单一方视角评估共情,忽略了共情本质上是支持者与求助者之间双向互动的特性。其解决方案的关键在于提出心理学基础的共情奖励建模方法(Psychology-grounded Empathetic Reward Modeling, PERM),该方法通过双向分解机制对共情进行操作化:一方面从支持者视角评估内部共鸣与表达传递,另一方面从求助者视角衡量情绪接收效果,并引入旁观者视角监控整体交互质量,从而更全面地捕捉共情的动态过程。

链接: https://arxiv.org/abs/2601.10532
作者: Chengbing Wang,Wuqiang Zheng,Yang Zhang,Fengbin Zhu,Junyi Cheng,Yi Xie,Wenjie Wang,Fuli Feng
机构: University of Science and Technology of China (中国科学技术大学); National University of Singapore (新加坡国立大学); Huawei Technologies (华为技术)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly deployed in human-centric applications, yet they often fail to provide substantive emotional support. While Reinforcement Learning (RL) has been utilized to enhance empathy of LLMs, existing reward models typically evaluate empathy from a single perspective, overlooking the inherently bidirectional interaction nature of empathy between the supporter and seeker as defined by Empathy Cycle theory. To address this limitation, we propose Psychology-grounded Empathetic Reward Modeling (PERM). PERM operationalizes empathy evaluation through a bidirectional decomposition: 1) Supporter perspective, assessing internal resonation and communicative expression; 2) Seeker perspective, evaluating emotional reception. Additionally, it incorporates a bystander perspective to monitor overall interaction quality. Extensive experiments on a widely-used emotional intelligence benchmark and an industrial daily conversation dataset demonstrate that PERM outperforms state-of-the-art baselines by over 10%. Furthermore, a blinded user study reveals a 70% preference for our approach, highlighting its efficacy in generating more empathetic responses. Our code, dataset, and models are available at this https URL.
zh

[NLP-11] A Safety Report on GPT -5.2 Gemini 3 Pro Qwen 3-VL Doubao 1.8 Grok 4.1 Fast Nano Banana Pro and Seedream 4.5

【速读】: 该论文旨在解决当前前沿生成式 AI 模型(包括大语言模型 LLMs 和多模态大语言模型 MLLMs)在安全性方面的评估碎片化问题,即现有评估方法往往局限于单一模态或威胁场景,难以全面反映模型在真实世界中的安全表现。其解决方案的关键在于构建一个统一的多维安全评估协议,整合基准测试、对抗性测试、多语言评估与合规性评估四大维度,对7个前沿模型进行系统性测评,并通过安全排行榜和模型安全画像揭示不同模型在多个评价模式下的异质性安全表现,从而为负责任的模型开发与部署提供更准确的风险评估依据。

链接: https://arxiv.org/abs/2601.10527
作者: Xingjun Ma,Yixu Wang,Hengyuan Xu,Yutao Wu,Yifan Ding,Yunhan Zhao,Zilong Wang,Jiabin Hua,Ming Wen,Jianan Liu,Ranjie Duan,Yifeng Gao,Yingshui Tan,Yunhao Chen,Hui Xue,Xin Wang,Wei Cheng,Jingjing Chen,Zuxuan Wu,Bo Li,Yu-Gang Jiang
机构: Fudan University (复旦大学); Shanghai Innovation institute (上海创新研究院); Deakin University (迪肯大学); UIUC (伊利诺伊大学厄巴纳-香槟分校)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 42 pages, 24 figures

点击查看摘要

Abstract:The rapid evolution of Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) has produced substantial gains in reasoning, perception, and generative capability across language and vision. However, whether these advances yield commensurate improvements in safety remains unclear, in part due to fragmented evaluation practices limited to single modalities or threat models. In this report, we present an integrated safety evaluation of 7 frontier models: GPT-5.2, Gemini 3 Pro, Qwen3-VL, Doubao 1.8, Grok 4.1 Fast, Nano Banana Pro, and Seedream 4.5. We evaluate each model across language, vision-language, and image generation settings using a unified protocol that integrates benchmark evaluation, adversarial evaluation, multilingual evaluation, and compliance evaluation. Aggregating our evaluations into safety leaderboards and model safety profiles across multiple evaluation modes reveals a sharply heterogeneous safety landscape. While GPT-5.2 demonstrates consistently strong and balanced safety performance across evaluations, other models exhibit pronounced trade-offs among benchmark safety, adversarial alignment, multilingual generalization, and regulatory compliance. Both language and vision-language modalities show significant vulnerability under adversarial evaluation, with all models degrading substantially despite strong results on standard benchmarks. Text-to-image models achieve relatively stronger alignment in regulated visual risk categories, yet remain brittle under adversarial or semantically ambiguous prompts. Overall, these results show that safety in frontier models is inherently multidimensional–shaped by modality, language, and evaluation scheme, underscoring the need for standardized safety evaluations to accurately assess real-world risk and guide responsible model development and deployment.
zh

[NLP-12] AEQ-Bench: Measuring Empathy of Omni-Modal Large Models

【速读】: 该论文旨在解决多模态大模型(Omni-modal Large Models, OLMs)在情感共鸣(empathy)评估中的挑战,尤其是如何有效衡量模型对音频与文本融合输入的情感理解能力及其对语音响应中情感表达的判断能力。解决方案的关键在于提出AEQ-Bench(Audio Empathy Quotient Benchmark),这是一个系统性评估OLMs两种核心共情能力的新基准:(i) 基于多模态输入(音频+文本)生成共情回应的能力;(ii) 在不依赖文字转录的情况下判断音频响应的共情水平。该基准引入了两种新颖的测试场景,分别在情境特异性与语调变化上进行差异化设计,从而揭示出具备音频输出能力的模型在整体共情表现上优于仅支持文本输出的模型,但其在细粒度的副语言特征(paralinguistic expressiveness)评估方面仍不可靠。

链接: https://arxiv.org/abs/2601.10513
作者: Xuan Luo,Lewei Yao,Libo Zhao,Lanqing Hong,Kai Chen,Dehua Tao,Daxin Tan,Ruifeng Xu,Jing Li
机构: The Hong Kong Polytechnic University (香港理工大学); The Harbin Institute of Technology, Shenzhen (哈尔滨工业大学(深圳)); Huawei (华为); Hong Kong University of Science and Technology (香港科技大学); Shenzhen Loop Area Institute (深圳市环区研究院)
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:While the automatic evaluation of omni-modal large models (OLMs) is essential, assessing empathy remains a significant challenge due to its inherent affectivity. To investigate this challenge, we introduce AEQ-Bench (Audio Empathy Quotient Benchmark), a novel benchmark to systematically assess two core empathetic capabilities of OLMs: (i) generating empathetic responses by comprehending affective cues from multi-modal inputs (audio + text), and (ii) judging the empathy of audio responses without relying on text transcription. Compared to existing benchmarks, AEQ-Bench incorporates two novel settings that vary in context specificity and speech tone. Comprehensive assessment across linguistic and paralinguistic metrics reveals that (1) OLMs trained with audio output capabilities generally outperformed models with text-only outputs, and (2) while OLMs align with human judgments for coarse-grained quality assessment, they remain unreliable for evaluating fine-grained paralinguistic expressiveness.
zh

[NLP-13] DR-Arena: an Automated Evaluation Framework for Deep Research Agents

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)作为深度研究(Deep Research, DR)代理在自主调查与信息整合能力评估中存在的瓶颈问题。现有基准测试多依赖静态数据集,存在任务泛化性差、时间错配和数据污染等局限。其解决方案的关键在于提出DR-Arena框架,该框架通过动态构建基于实时网络趋势的信息树(Information Trees),确保评估标准与现实世界状态同步,并引入自动化审查员(Examiner)生成结构化任务以测试深度推理和广度覆盖两种正交能力;同时采用自适应演化循环(Adaptive Evolvement Loop)机制,根据实时性能动态提升任务复杂度,直至明确的能力边界出现,从而实现对DR代理能力的可靠、自动且高效评估。

链接: https://arxiv.org/abs/2601.10504
作者: Yiwen Gao,Ruochen Zhao,Yang Deng,Wenxuan Zhang
机构: National University of Singapore (新加坡国立大学); Nanyang Technological University (南洋理工大学); Singapore Management University (新加坡管理大学); Singapore University of Technology and Design (新加坡科技设计大学)
类目: Computation and Language (cs.CL)
备注: 22 pages, 8 figures

点击查看摘要

Abstract:As Large Language Models (LLMs) increasingly operate as Deep Research (DR) Agents capable of autonomous investigation and information synthesis, reliable evaluation of their task performance has become a critical bottleneck. Current benchmarks predominantly rely on static datasets, which suffer from several limitations: limited task generality, temporal misalignment, and data contamination. To address these, we introduce DR-Arena, a fully automated evaluation framework that pushes DR agents to their capability limits through dynamic investigation. DR-Arena constructs real-time Information Trees from fresh web trends to ensure the evaluation rubric is synchronized with the live world state, and employs an automated Examiner to generate structured tasks testing two orthogonal capabilities: Deep reasoning and Wide coverage. DR-Arena further adopts Adaptive Evolvement Loop, a state-machine controller that dynamically escalates task complexity based on real-time performance, demanding deeper deduction or wider aggregation until a decisive capability boundary emerges. Experiments with six advanced DR agents demonstrate that DR-Arena achieves a Spearman correlation of 0.94 with the LMSYS Search Arena leaderboard. This represents the state-of-the-art alignment with human preferences without any manual efforts, validating DR-Arena as a reliable alternative for costly human adjudication.
zh

[NLP-14] Contextual StereoSet: Stress-Testing Bias Alignment Robustness in Large Language Models

【速读】: 该论文旨在解决生成式 AI(Generative AI)在实验室基准测试中表现良好时,其偏见可能在实际部署环境中因上下文变化而显著加剧的问题。现有评估方法通常固定测试条件,忽略了真实场景中提示(prompt)所携带的地点、时间或受众等语境信息对偏见表达的影响。为应对这一挑战,作者提出 Contextual StereoSet 基准,通过保持刻板印象内容不变而系统性地改变上下文框架来检测偏见的敏感性;其核心创新在于引入 Context Sensitivity Fingerprints (CSF),一种基于维度分散度与配对对比的紧凑型分析工具,结合自助法置信区间和 FDR 校正,支持两种评估路径:用于深度诊断的 360° 上下文网格和适用于生产环境筛选的覆盖 4,229 项的预算协议。该方法强调“偏见何时出现”而非简单判断“模型是否偏见”,从而提升了评估的鲁棒性和实用性。

链接: https://arxiv.org/abs/2601.10460
作者: Abhinaba Basu,Pavan Chakraborty
机构: Indian Institute of Information Technology, Allahabad (IIITA); National Institute of Electronics and Information Technology (NIELIT)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:A model that avoids stereotypes in a lab benchmark may not avoid them in deployment. We show that measured bias shifts dramatically when prompts mention different places, times, or audiences – no adversarial prompting required. We introduce Contextual StereoSet, a benchmark that holds stereotype content fixed while systematically varying contextual framing. Testing 13 models across two protocols, we find striking patterns: anchoring to 1990 (vs. 2030) raises stereotype selection in all models tested on this contrast (p0.05); gossip framing raises it in 5 of 6 full-grid models; out-group observer framing shifts it by up to 13 percentage points. These effects replicate in hiring, lending, and help-seeking vignettes. We propose Context Sensitivity Fingerprints (CSF): a compact profile of per-dimension dispersion and paired contrasts with bootstrap CIs and FDR correction. Two evaluation tracks support different use cases – a 360-context diagnostic grid for deep analysis and a budgeted protocol covering 4,229 items for production screening. The implication is methodological: bias scores from fixed-condition tests may not this http URL is not a claim about ground-truth bias rates; it is a stress test of evaluation robustness. CSF forces evaluators to ask, “Under what conditions does bias appear?” rather than “Is this model biased?” We release our benchmark, code, and results. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG) Cite as: arXiv:2601.10460 [cs.CL] (or arXiv:2601.10460v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2601.10460 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Abhinaba Basu [view email] [v1] Thu, 15 Jan 2026 14:50:49 UTC (27 KB)
zh

[NLP-15] SurgGoal: Rethinking Surgical Planning Evaluation via Goal-Satisfiability

【速读】: 该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在高风险场景下(如外科手术规划)的评估协议是否可靠的问题。现有方法依赖于序列相似性指标,但这些指标无法准确判断计划质量,反而可能错误惩罚有效计划并漏检无效计划。其解决方案的关键在于提出一种基于专家定义的手术规则、以阶段目标满足度(phase-goal satisfiability)为核心的规则驱动型元评估指标,作为高精度参考标准。该指标能够有效识别因感知错误或推理约束不足导致的失败,并揭示结构知识对性能提升的稳定性作用,而仅依赖语义引导则效果不可靠,且仅在结合结构约束时才对大模型有增益。

链接: https://arxiv.org/abs/2601.10455
作者: Ruochen Li,Kun Yuan,Yufei Xia,Yue Zhou,Qingyu Lu,Weihang Li,Youxiang Zhu,Nassir Navab
机构: Technical University of Munich, Germany (慕尼黑工业大学, 德国); University of Strasbourg, France (斯特拉斯堡大学, 法国); University of Glasgow, United Kingdom (格拉斯哥大学, 英国); Nanyang Technological University, Singapore (南洋理工大学, 新加坡); University of Massachusetts Boston, USA (马萨诸塞大学波士顿分校, 美国)
类目: Computation and Language (cs.CL); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Surgical planning integrates visual perception, long-horizon reasoning, and procedural knowledge, yet it remains unclear whether current evaluation protocols reliably assess vision-language models (VLMs) in safety-critical settings. Motivated by a goal-oriented view of surgical planning, we define planning correctness via phase-goal satisfiability, where plan validity is determined by expert-defined surgical rules. Based on this definition, we introduce a multicentric meta-evaluation benchmark with valid procedural variations and invalid plans containing order and content errors. Using this benchmark, we show that sequence similarity metrics systematically misjudge planning quality, penalizing valid plans while failing to identify invalid ones. We therefore adopt a rule-based goal-satisfiability metric as a high-precision meta-evaluation reference to assess Video-LLMs under progressively constrained settings, revealing failures due to perception errors and under-constrained reasoning. Structural knowledge consistently improves performance, whereas semantic guidance alone is unreliable and benefits larger models only when combined with structural constraints.
zh

[NLP-16] Are Language Models Models?

【速读】: 该论文试图解决当前学界将大语言模型(Large Language Models, LMs)视为认知模型(cognitive models)的倾向是否合理的问题。其核心论点在于,依据Marr的认知分析框架——即计算理论层(computational theory level)、算法-表征层(algorithmic-representational level)和实现层(implementation level)——对LMs进行逐层评估后发现:在实现层明显不成立,在算法-表征层缺乏充分动机支持,在计算理论层则存在根本性问题;因此,作者主张LMs更适合作为工具而非认知模型,以避免过度夸大其认知能力并抑制对大语言模型(LLMs)的不必要 hype。

链接: https://arxiv.org/abs/2601.10421
作者: Philip Resnik
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 5 pages. This is an invited commentary under review at Behavioral and Brain Sciences

点击查看摘要

Abstract:Futrell and Mahowald claim LMs “serve as model systems”, but an assessment at each of Marr’s three levels suggests the claim is clearly not true at the implementation level, poorly motivated at the algorithmic-representational level, and problematic at the computational theory level. LMs are good candidates as tools; calling them cognitive models overstates the case and unnecessarily feeds LLM hype.
zh

[NLP-17] F3-RO-50M: Training Compact Romanian Language Models from Scratch on Synthetic Moral Microfiction

【速读】: 该论文旨在解决 morphologically rich and computationally under-resourced languages(形态学丰富但计算资源匮乏的语言)如罗马尼亚语在语言模型训练中缺乏端到端可复现框架的问题,具体包括分词器设计、预训练、压缩、评估及大规模合成数据生成等环节的系统性缺失。解决方案的关键在于构建一个以罗马尼亚语为中心的完整语言建模流水线 TF3-RO:首先基于语言学启发的语料库训练罗马尼亚专用的 BPE 和 Unigram 分词器以缓解形态学导致的词元膨胀;其次采用长序列打包训练从头预训练一个 51.65M 参数的 LLaMA 风格 Transformer 模型;随后通过量化、结构化剪枝和基于 logit 的知识蒸馏技术压缩为仅 26.45M 参数的紧凑学生模型,并保留 tied embeddings 以提升部署效率;最终利用该蒸馏模型在受控组合提示框架下生成三百万条罗马尼亚原生合成寓言文本,整个流程嵌入多维度评估体系确保模型质量与语言一致性。

链接: https://arxiv.org/abs/2601.10410
作者: Mihai Dan Nadas,Laura Diosan,Andreea Tomescu,Andrei Piscoran
机构: Babes-Bolyai University (巴贝什-博雅大学); KlusAI Labs (KlusAI 实验室)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advances in synthetic data generation have shown that compact language models can be trained effectively when the underlying corpus is structurally controlled and linguistically coherent. However, for morphologically rich and computationally under-resourced languages such as Romanian, there is still no openly documented, end-to-end pipeline that unifies tokenizer design, preprocessing, pretraining, compression, evaluation, and large-scale synthetic data generation in a reproducible framework. Building on TF1, a three-million-story English fable dataset, and TF2, which extends TF1 through high-quality Romanian translations, we introduce TF3-RO, a Romanian-centric language modeling pipeline spanning tokenizer training, from-scratch model development, and Romanian-native dataset generation. TF3-RO constructs Romanian-specific BPE and Unigram tokenizers from a linguistically informed corpus to mitigate token inflation induced by Romanian morphology. Using long-sequence packed training, we pretrain a 51.65M-parameter LLaMA-style Transformer entirely from scratch. The model is subsequently optimized through quantization, structured pruning, and logit-based knowledge distillation, yielding a compact 26.45M-parameter student model with tied embeddings and strong deployment characteristics. Using this distilled model, TF3-RO generates three million Romanian-native synthetic fables via a controlled combinatorial prompting framework. Across all stages, the pipeline integrates a comprehensive evaluation suite combining intrinsic metrics, Romanian agreement probes, entity coherence, rule-based grammar checking, and LLM-based assessment. TF3-RO provides a reproducible and linguistically grounded framework for training compact Romanian language models and producing large-scale synthetic narrative corpora.
zh

[NLP-18] INDIC DIALECT: A Multi Task Benchmark to Evaluate and Translate in Indian Language Dialects

【速读】: 该论文旨在解决低资源印度方言在自然语言处理(Natural Language Processing, NLP)中长期被忽视的问题,特别是在印地语(Hindi)和奥里亚语(Odia)等主流语言的众多方言缺乏高质量标注数据和评估基准的背景下。其关键解决方案是构建了INDIC-DIALECT——一个包含13,000句对、覆盖11种方言和两种语言(印地语与奥里亚语)的人工标注平行语料库,并基于此设计了一个多任务基准,涵盖方言分类、多项选择题(MCQ)问答和机器翻译(MT)三个任务。实验表明,预训练于印度语言的Transformer模型在方言分类任务上显著优于大语言模型(LLM),而混合AI方法在方言到标准语翻译中表现最优(BLEU=61.32),规则驱动后接AI的方法则在反向翻译中取得最佳效果(BLEU=48.44),凸显了针对低资源方言需结合规则与数据驱动策略的设计思路。

链接: https://arxiv.org/abs/2601.10388
作者: Tarun Sharma,Manikandan Ravikiran,Sourava Kumar Behera,Pramit Bhattacharya,Arnab Bhattacharya,Rohit Saluja
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent NLP advances focus primarily on standardized languages, leaving most low-resource dialects under-served especially in Indian scenarios. In India, the issue is particularly important: despite Hindi being the third most spoken language globally (over 600 million speakers), its numerous dialects remain underrepresented. The situation is similar for Odia, which has around 45 million speakers. While some datasets exist which contain standard Hindi and Odia languages, their regional dialects have almost no web presence. We introduce INDIC-DIALECT, a human-curated parallel corpus of 13k sentence pairs spanning 11 dialects and 2 languages: Hindi and Odia. Using this corpus, we construct a multi-task benchmark with three tasks: dialect classification, multiple-choice question (MCQ) answering, and machine translation (MT). Our experiments show that LLMs like GPT-4o and Gemini 2.5 perform poorly on the classification task. While fine-tuned transformer based models pretrained on Indian languages substantially improve performance e.g., improving F1 from 19.6% to 89.8% on dialect classification. For dialect to language translation, we find that hybrid AI model achieves highest BLEU score of 61.32 compared to the baseline score of 23.36. Interestingly, due to complexity in generating dialect sentences, we observe that for language to dialect translation the ``rule-based followed by AI" approach achieves best BLEU score of 48.44 compared to the baseline score of 27.59. INDIC-DIALECT thus is a new benchmark for dialect-aware Indic NLP, and we plan to release it as open source to support further work on low-resource Indian dialects.
zh

[NLP-19] he Assistant Axis: Situating and Stabilizing the Default Persona of Language Models

【速读】: 该论文旨在解决大语言模型在对话过程中出现的“人格漂移”(persona drift)问题,即模型偏离其默认的有益、无害的助手(Assistant)身份,表现出不一致或有害的行为。其核心解决方案是识别并利用模型内部的“助手轴”(Assistant Axis)——一个表征模型偏向于扮演助手角色的程度的激活方向。通过将模型激活限制在该轴上的固定区间内,可以有效稳定行为,防止因用户情绪脆弱或要求元反思等情境引发的异常表现,并提升对基于人格的对抗性越狱攻击的鲁棒性。这一发现表明,后训练虽使模型倾向于特定人格区域,但并未牢固锚定,提示未来需发展更深入的训练与引导策略以实现稳定的人格一致性。

链接: https://arxiv.org/abs/2601.10387
作者: Christina Lu,Jack Gallagher,Jonathan Michala,Kyle Fish,Jack Lindsey
机构: MATS; Anthropic Fellows Program; University of Oxford; Anthropic
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models can represent a variety of personas but typically default to a helpful Assistant identity cultivated during post-training. We investigate the structure of the space of model personas by extracting activation directions corresponding to diverse character archetypes. Across several different models, we find that the leading component of this persona space is an “Assistant Axis,” which captures the extent to which a model is operating in its default Assistant mode. Steering towards the Assistant direction reinforces helpful and harmless behavior; steering away increases the model’s tendency to identify as other entities. Moreover, steering away with more extreme values often induces a mystical, theatrical speaking style. We find this axis is also present in pre-trained models, where it primarily promotes helpful human archetypes like consultants and coaches and inhibits spiritual ones. Measuring deviations along the Assistant Axis predicts “persona drift,” a phenomenon where models slip into exhibiting harmful or bizarre behaviors that are uncharacteristic of their typical persona. We find that persona drift is often driven by conversations demanding meta-reflection on the model’s processes or featuring emotionally vulnerable users. We show that restricting activations to a fixed region along the Assistant Axis can stabilize model behavior in these scenarios – and also in the face of adversarial persona-based jailbreaks. Our results suggest that post-training steers models toward a particular region of persona space but only loosely tethers them to it, motivating work on training and steering strategies that more deeply anchor models to a coherent persona.
zh

[NLP-20] Unlocking Implicit Experience: Synthesizing Tool-Use Trajectories from Text

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在多轮交互中有效利用工具的能力受限问题,核心挑战在于缺乏多样且真实的多轮工具使用数据。解决方案的关键在于提出一种基于文本的新型数据合成范式——GEM(Generative Evaluation and Mining),其通过四阶段流程(相关性过滤、工作流工具提取、轨迹锚定与复杂度精炼)从文本语料库中自动生成和提取多轮工具使用轨迹,并进一步训练一个专用的轨迹生成器(Trajectory Synthesizer)以实现高效端到端的数据生成,显著降低计算成本的同时保持高质量轨迹生成能力,从而提升模型在多轮工具调用任务上的性能与泛化能力。

链接: https://arxiv.org/abs/2601.10355
作者: Zhihao Xu,Rumei Li,Jiahuan Li,Rongxiang Weng,Jingang Wang,Xunliang Cai,Xiting Wang
机构: Renmin University of China (中国人民大学); Meituan (美团)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Enabling Large Language Models (LLMs) to effectively utilize tools in multi-turn interactions is essential for building capable autonomous agents. However, acquiring diverse and realistic multi-turn tool-use data remains a significant challenge. In this work, we propose a novel text-based paradigm. We observe that textual corpora naturally contain rich, multi-step problem-solving experiences, which can serve as an untapped, scalable, and authentic data source for multi-turn tool-use tasks. Based on this insight, we introduce GEM, a data synthesis pipeline that enables the generation and extraction of multi-turn tool-use trajectories from text corpora through a four-stage process: relevance filtering, workflow tool extraction, trajectory grounding, and complexity refinement. To reduce the computational cost, we further train a specialized Trajectory Synthesizer via supervised fine-tuning. This model distills the complex generation pipeline into an efficient, end-to-end trajectory generator. Experiments demonstrate that our GEM-32B achieve a 16.5% improvement on the BFCL V3 Multi-turn benchmark. Our models partially surpass the performance of models trained on \tau - bench (Airline and Retail) in-domain data, highlighting the superior generalization capability derived from our text-based synthesis paradigm. Notably, our Trajectory Synthesizer matches the quality of the full pipeline while significantly reducing inference latency and costs.
zh

[NLP-21] SuS: Strategy-aware Surprise for Intrinsic Exploration

【速读】: 该论文旨在解决强化学习中探索效率低下的问题,特别是在复杂任务(如数学推理)中,智能体难以有效发现新颖且有帮助的策略。传统基于状态预测误差的 curiosity 方法往往忽略行为策略的一致性与变化,导致探索缺乏方向性和多样性。解决方案的关键在于提出 Strategy-aware Surprise (SuS) 框架,其核心创新是引入两个互补的内在动机信号:Strategy Stability (SS),用于衡量行为策略在时间步上的稳定性;以及 Strategy Surprise (SuS),用于捕捉相对于当前策略表示的意外结果。通过学习加权组合这两个信号,SuS 能够更精准地引导探索,从而显著提升任务准确率和策略多样性,在数学推理任务上相较基线方法实现 Pass@1 提升 17.4% 和 Pass@5 提升 26.4%。

链接: https://arxiv.org/abs/2601.10349
作者: Mark Kashirskiy,Ilya Makarov
机构: Higher School of Economics (高等经济大学); ITMO University (圣彼得堡国立信息技术机械与光学大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Science and Game Theory (cs.GT)
备注: 8 pages, 7 figures, 3 tables. Code available at this https URL

点击查看摘要

Abstract:We propose Strategy-aware Surprise (SuS), a novel intrinsic motivation framework that uses pre-post prediction mismatch as a novelty signal for exploration in reinforcement learning. Unlike traditional curiosity-driven methods that rely solely on state prediction error, SuS introduces two complementary components: Strategy Stability (SS) and Strategy Surprise (SuS). SS measures consistency in behavioral strategy across temporal steps, while SuS captures unexpected outcomes relative to the agent’s current strategy representation. Our combined reward formulation leverages both signals through learned weighting coefficients. We evaluate SuS on mathematical reasoning tasks using large language models, demonstrating significant improvements in both accuracy and solution diversity. Ablation studies confirm that removing either component results in at least 10% performance degradation, validating the synergistic nature of our approach. SuS achieves 17.4% improvement in Pass@1 and 26.4% improvement in Pass@5 compared to baseline methods, while maintaining higher strategy diversity throughout training.
zh

[NLP-22] raining-Trajectory-Aware Token Selection

【速读】: 该论文旨在解决在学生模型(student)已具备较强推理能力的前沿场景下,传统持续蒸馏(continual distillation)方法往往效果有限甚至导致性能下降的问题。其核心挑战在于训练过程中出现的“瓶颈现象”:尽管损失函数单调下降,但所有性能指标会突然骤降,并在之后缓慢恢复;进一步分析发现,这是由于token级别的置信度分化所致——部分token作为“模仿锚点”(Imitation-Anchor Tokens)快速收敛,而其他尚未学习到位的token则因置信度被压制而无法有效优化,两类token无法共存是造成蒸馏失败的根本原因。为此,作者提出训练轨迹感知的token选择机制(Training-Trajectory-Aware Token Selection, T3S),从token层面重构训练目标,清除未学习token的优化路径障碍,从而实现更高效、稳定的蒸馏过程。实验表明,T3S在自回归(AR)和去思考(dLLM)两种设置下均带来一致提升,显著优于现有方法。

链接: https://arxiv.org/abs/2601.10348
作者: Zhanming Shen,Jiaqi Hu,Zeyu Qin,Hao Chen,Wentao Ye,Zenan Huang,Yihong Zhuang,Guoshan Lu,Junlin Zhou,Junbo Zhao
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Efficient distillation is a key pathway for converting expensive reasoning capability into deployable efficiency, yet in the frontier regime where the student already has strong reasoning ability, naive continual distillation often yields limited gains or even degradation. We observe a characteristic training phenomenon: even as loss decreases monotonically, all performance metrics can drop sharply at almost the same bottleneck, before gradually recovering. We further uncover a token-level mechanism: confidence bifurcates into steadily increasing Imitation-Anchor Tokens that quickly anchor optimization and other yet-to-learn tokens whose confidence is suppressed until after the bottleneck. And the characteristic that these two types of tokens cannot coexist is the root cause of the failure in continual distillation. To this end, we propose Training-Trajectory-Aware Token Selection (T3S) to reconstruct the training objective at the token level, clearing the optimization path for yet-to-learn tokens. T3 yields consistent gains in both AR and dLLM settings: with only hundreds of examples, Qwen3-8B surpasses DeepSeek-R1 on competitive reasoning benchmarks, Qwen3-32B approaches Qwen3-235B, and T3-trained LLaDA-2.0-Mini exceeds its AR baseline, achieving state-of-the-art performance among all of 16B-scale no-think models.
zh

[NLP-23] OctoBench: Benchmarking Scaffold-Aware Instruction Following in Repository-Grounded Agent ic Coding

【速读】: 该论文旨在解决当前生成式 AI(Generative AI)在代码生成任务中对编码脚手架(coding scaffold)指定指令的遵循能力不足的问题,尤其是在多变且跨交互持续存在的约束条件下。其关键解决方案是提出 OctoBench 基准测试平台,该平台包含 34 个环境、217 个任务及三种类型的脚手架,并配套 7,098 条客观检查清单项,同时提供自动化观测与评分工具,以细粒度捕捉代理执行轨迹并区分任务求解与规则遵循之间的差异,从而推动对异构指令遵循能力的显式训练与评估。

链接: https://arxiv.org/abs/2601.10343
作者: Deming Ding,Shichun Liu,Enhui Yang,Jiahang Lin,Ziying Chen,Shihan Dou,Honglin Guo,Weiyu Cheng,Pengyu Zhao,Chengjun Xiao,Qunhong Zeng,Qi Zhang,Xuanjing Huang,Qidi Xu,Tao Gui
机构: Fudan University (复旦大学); MiniMax; Peking University (北京大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Modern coding scaffolds turn LLMs into capable software agents, but their ability to follow scaffold-specified instructions remains under-examined, especially when constraints are heterogeneous and persist across interactions. To fill this gap, we introduce OctoBench, which benchmarks scaffold-aware instruction following in repository-grounded agentic coding. OctoBench includes 34 environments and 217 tasks instantiated under three scaffold types, and is paired with 7,098 objective checklist items. To disentangle solving the task from following the rules, we provide an automated observation-and-scoring toolkit that captures full trajectories and performs fine-grained checks. Experiments on eight representative models reveal a systematic gap between task-solving and scaffold-aware compliance, underscoring the need for training and evaluation that explicitly targets heterogeneous instruction following. We release the benchmark to support reproducible benchmarking and to accelerate the development of more scaffold-aware coding agents.
zh

[NLP-24] Agent Skills in the Wild: An Empirical Study of Security Vulnerabilities at Scale

【速读】: 该论文旨在解决AI代理框架中“技能”(skill)模块存在的安全风险问题,这些技能作为可动态扩展代理能力的模块化组件,当前在执行时享有隐式信任且缺乏有效审查机制,从而构成未被充分认识的攻击面。其解决方案的关键在于提出并实现一个名为SkillScan的多阶段检测框架,该框架融合静态分析与基于大语言模型(LLM)的语义分类技术,对大规模技能集合进行系统性安全评估;通过实证分析42,447个来自主流平台的技能,识别出14类漏洞模式,其中数据外泄(13.3%)和权限提升(11.8%)最为普遍,并验证了包含可执行脚本的技能比仅含指令的技能更易存在漏洞(OR=2.12, p<0.001),最终构建了一个基于8,126个漏洞样本的漏洞分类体系与具备86.7%精度和82.5%召回率的检测方法,为未来建立基于能力的权限控制机制和强制安全审核流程提供了关键支撑。

链接: https://arxiv.org/abs/2601.10338
作者: Yi Liu,Weizhe Wang,Ruitao Feng,Yao Zhang,Guangquan Xu,Gelei Deng,Yuekang Li,Leo Zhang
机构: Quantstamp; Tianjin University (天津大学); Southern Cross University (南十字星大学); School of Cybersecurity, Tianjin University (天津大学网络空间安全学院); Nanyang Technological University (南洋理工大学); University of New South Wales (新南威尔士大学); Griffith University (格里菲斯大学)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:The rise of AI agent frameworks has introduced agent skills, modular packages containing instructions and executable code that dynamically extend agent capabilities. While this architecture enables powerful customization, skills execute with implicit trust and minimal vetting, creating a significant yet uncharacterized attack surface. We conduct the first large-scale empirical security analysis of this emerging ecosystem, collecting 42,447 skills from two major marketplaces and systematically analyzing 31,132 using SkillScan, a multi-stage detection framework integrating static analysis with LLM-based semantic classification. Our findings reveal pervasive security risks: 26.1% of skills contain at least one vulnerability, spanning 14 distinct patterns across four categories: prompt injection, data exfiltration, privilege escalation, and supply chain risks. Data exfiltration (13.3%) and privilege escalation (11.8%) are most prevalent, while 5.2% of skills exhibit high-severity patterns strongly suggesting malicious intent. We find that skills bundling executable scripts are 2.12x more likely to contain vulnerabilities than instruction-only skills (OR=2.12, p0.001). Our contributions include: (1) a grounded vulnerability taxonomy derived from 8,126 vulnerable skills, (2) a validated detection methodology achieving 86.7% precision and 82.5% recall, and (3) an open dataset and detection toolkit to support future research. These results demonstrate an urgent need for capability-based permission systems and mandatory security vetting before this attack vector is further exploited.
zh

[NLP-25] ROMA: Real-time Omni-Multimodal Assistant with Interactive Streaming Understanding

【速读】: 该论文旨在解决实时流式音视频理解中多模态能力不完整及缺乏自主主动监控的问题,现有方法通常存在模态支持不全或无法实现前瞻性交互的局限。解决方案的关键在于提出ROMA(Real-time Omni-multimodal Assistant),其核心创新包括:1)将连续输入处理为同步的多模态单元,通过密集音频与离散视频帧对齐来解决粒度不匹配问题;2)引入轻量级“说话头”(speak head)模块,解耦响应触发与生成过程,确保精准触发且避免任务冲突;3)采用分阶段课程训练策略优化流式格式适应性和主动响应能力,并构建统一评估套件以标准化多样化的基准测试。实验表明,ROMA在主动任务上达到最先进性能,在反应式任务中亦具竞争力,验证了其在统一实时多模态理解中的鲁棒性。

链接: https://arxiv.org/abs/2601.10323
作者: Xueyun Tian,Wei Li,Bingbing Xu,Heng Dong,Yuanzhuo Wang,Huawei Shen
机构: CAS Key Laboratory of AI Safety, Institute of Computing Technology, CAS (中国科学院计算技术研究所人工智能安全重点实验室); University of Chinese Academy of Sciences (中国科学院大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Our project page is available at this https URL

点击查看摘要

Abstract:Recent Omni-multimodal Large Language Models show promise in unified audio, vision, and text modeling. However, streaming audio-video understanding remains challenging, as existing approaches suffer from disjointed capabilities: they typically exhibit incomplete modality support or lack autonomous proactive monitoring. To address this, we present ROMA, a real-time omni-multimodal assistant for unified reactive and proactive interaction. ROMA processes continuous inputs as synchronized multimodal units, aligning dense audio with discrete video frames to handle granularity mismatches. For online decision-making, we introduce a lightweight speak head that decouples response initiation from generation to ensure precise triggering without task conflict. We train ROMA with a curated streaming dataset and a two-stage curriculum that progressively optimizes for streaming format adaptation and proactive responsiveness. To standardize the fragmented evaluation landscape, we reorganize diverse benchmarks into a unified suite covering both proactive (alert, narration) and reactive (QA) settings. Extensive experiments across 12 benchmarks demonstrate ROMA achieves state-of-the-art performance on proactive tasks while competitive in reactive settings, validating its robustness in unified real-time omni-multimodal understanding.
zh

[NLP-26] An Efficient Long-Context Ranking Architecture With Calibrated LLM Distillation: Application to Person-Job Fit

【速读】: 该论文旨在解决实时匹配岗位需求与候选人简历之间相关性的难题,尤其在处理长篇、结构化且多语言的简历时更具挑战性。其解决方案的关键在于提出一种基于新型晚期交叉注意力(late cross-attention)架构的重排序模型,该架构能有效分解简历和项目简报以高效处理长上下文输入,并通过最小计算开销实现高精度匹配;同时,为缓解历史数据偏差问题,引入生成式大语言模型(Generative LLM)作为教师模型,生成细粒度、语义 grounded 的监督信号,并通过增强型蒸馏损失函数将其注入学生模型,从而输出具有一致性和可解释性的技能契合度评分(skill-fit scores),显著优于现有最先进基线方法。

链接: https://arxiv.org/abs/2601.10321
作者: Warren Jouanneau,Emma Jouffroy,Marc Palyart
机构: Malt, 33000 Bordeaux, France
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:Finding the most relevant person for a job proposal in real time is challenging, especially when resumes are long, structured, and multilingual. In this paper, we propose a re-ranking model based on a new generation of late cross-attention architecture, that decomposes both resumes and project briefs to efficiently handle long-context inputs with minimal computational overhead. To mitigate historical data biases, we use a generative large language model (LLM) as a teacher, generating fine-grained, semantically grounded supervision. This signal is distilled into our student model via an enriched distillation loss function. The resulting model produces skill-fit scores that enable consistent and interpretable person-job matching. Experiments on relevance, ranking, and calibration metrics demonstrate that our approach outperforms state-of-the-art baselines.
zh

[NLP-27] Boundary-Aware NL2SQL: Integrating Reliability through Hybrid Reward and Data Synthesis

【速读】: 该论文旨在解决自然语言到SQL(NL2SQL)生成任务中模型可靠性不足与边界情况处理能力弱的问题,特别是在面对模糊查询或数据库模式限制等边界场景时易产生错误SQL语句或不当响应。其解决方案的关键在于提出BAR-SQL(Boundary-Aware Reliable NL2SQL)统一训练框架,通过两个核心机制实现:一是引入Seed Mutation数据合成范式构建涵盖多步分析查询及边界案例的企业级语料库,确保训练数据的代表性与多样性;二是采用基于知识的推理合成(Knowledge-Grounded Reasoning Synthesis)生成显式锚定于模式元数据和业务规则的思维链(Chain-of-Thought)轨迹,提升模型可解释性与逻辑一致性。此外,结合两阶段训练策略(监督微调+SFT后强化学习)与任务条件混合奖励机制(Task-Conditioned Hybrid Reward),同时优化SQL执行准确率与拒绝回答的语义精度,从而显著增强模型在复杂真实场景下的可靠性和鲁棒性。

链接: https://arxiv.org/abs/2601.10318
作者: Songsong Tian,Kongsheng Zhuo,Zhendong Wang,Rong Shen,Shengtao Zhang,Yong Wu
机构: Li Auto Inc.(理想汽车); Beijing University of Posts and Telecommunications(北京邮电大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this paper, we present BAR-SQL (Boundary-Aware Reliable NL2SQL), a unified training framework that embeds reliability and boundary awareness directly into the generation process. We introduce a Seed Mutation data synthesis paradigm that constructs a representative enterprise corpus, explicitly encompassing multi-step analytical queries alongside boundary cases including ambiguity and schema limitations. To ensure interpretability, we employ Knowledge-Grounded Reasoning Synthesis, which produces Chain-of-Thought traces explicitly anchored in schema metadata and business rules. The model is trained through a two-stage process: Supervised Fine-Tuning (SFT) followed by Reinforcement Learning via Group Relative Policy Optimization. We design a Task-Conditioned Hybrid Reward mechanism that simultaneously optimizes SQL execution accuracy-leveraging Abstract Syntax Tree analysis and dense result matching-and semantic precision in abstention responses. To evaluate reliability alongside generation accuracy, we construct and release Ent-SQL-Bench, which jointly assesse SQL precision and boundary-aware abstention across ambiguous and unanswerable queries. Experimental results on this benchmark demonstrate that BAR-SQL achieves 91.48% average accuracy, outperforming leading proprietary models, including Claude 4.5 Sonnet and GPT-5, in both SQL generation quality and boundary-aware abstention capability. The source code and benchmark are available anonymously at: this https URL.
zh

[NLP-28] ADVOSYNTH: A Synthetic Multi-Advocate Dataset for Speaker Identification in Courtroom Scenarios

【速读】: 该论文旨在解决高保真度语音合成模型在结构化环境中生成语音时,合成语音之间差异性不足导致的可区分性问题,进而影响语音溯源与身份识别的准确性。其解决方案的关键在于构建了一个名为Advosynth-500的专用数据集,包含100个合成语音文件,涵盖10位独特的“辩护人”(advocate)身份,并通过Speech Llama Omni模型模拟五组法庭辩论场景,明确设定每位辩护人的特定声学特征,从而为现代语音识别系统提供一个标准化的说话人识别挑战任务,以评估其对合成语音来源的映射能力。

链接: https://arxiv.org/abs/2601.10315
作者: Aniket Deroy
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As large-scale speech-to-speech models achieve high fidelity, the distinction between synthetic voices in structured environments becomes a vital area of study. This paper introduces Advosynth-500, a specialized dataset comprising 100 synthetic speech files featuring 10 unique advocate identities. Using the Speech Llama Omni model, we simulate five distinct advocate pairs engaged in courtroom arguments. We define specific vocal characteristics for each advocate and present a speaker identification challenge to evaluate the ability of modern systems to map audio files to their respective synthetic origins. Dataset is available at this link-https: //github.com/naturenurtureelite/ADVOSYNTH-500. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2601.10315 [cs.CL] (or arXiv:2601.10315v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2601.10315 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-29] Multilinguality as Sense Adaptation

【速读】: 该论文旨在解决多语言模型在跨语言迁移中因语义表征不一致而导致的性能瓶颈问题,即如何在不同语言间实现更精准的语义对齐,而非仅依赖共享参数或模型规模。其解决方案的关键在于提出SENse-based Symmetric Interlingual Alignment (SENSIA),通过在平行语料上显式对齐源语言与目标语言的词义层次混合分布(sense-level mixtures)和上下文表示(contextual representations),同时联合训练目标语言的语言建模损失以保持流畅性。该方法在四种类型差异显著的语言基准测试中优于现有对比方法,并在仅使用2–4倍目标语言数据的情况下达到与从头训练单语基线相当的准确性。

链接: https://arxiv.org/abs/2601.10310
作者: Jan Christian Blaise Cruz,David Ifeoluwa Adelani,Alham Fikri Aji
机构: MBZUAI; McGill University; Mila - Quebec AI Institute; SEACrowd
类目: Computation and Language (cs.CL)
备注: Code available at this https URL

点击查看摘要

Abstract:We approach multilinguality as sense adaptation: aligning latent meaning representations across languages rather than relying solely on shared parameters and scale. In this paper, we introduce SENse-based Symmetric Interlingual Alignment (SENSIA), which adapts a Backpack language model from one language to another by explicitly aligning sense-level mixtures and contextual representations on parallel data, while jointly training a target-language language modeling loss to preserve fluency. Across benchmarks on four typologically diverse languages, SENSIA generally outperforms comparable multilingual alignment methods and achieves competitive accuracy against monolingual from-scratch baselines while using 2-4x less target-language data. Analyses of learned sense geometry indicate that local sense topology and global structure relative to English are largely preserved, and ablations show that the method is robust in terms of design and scale.
zh

[NLP-30] he Straight and Narrow: Do LLM s Possess an Internal Moral Path?

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在道德对齐(moral alignment)方面的深层问题,即现有对齐技术往往仅作为表层约束(superficial guardrails),未能触及模型内部的道德表征本质。其解决方案的关键在于引入道德基础理论(Moral Foundations Theory, MFT),通过跨语言线性探测(cross-lingual linear probing)识别并操纵LLMs中共享但差异化的道德子空间,并提取可调控的道德向量(steerable Moral Vectors)。进一步提出自适应道德融合(Adaptive Moral Fusion, AMF)机制,在推理阶段动态结合探测与向量注入,从而在不损害模型有用性的前提下有效缓解安全与帮助性之间的权衡(safety-helpfulness trade-off)。

链接: https://arxiv.org/abs/2601.10307
作者: Luoming Hu,Jingjie Zeng,Liang Yang,Hongfei Lin
机构: Dalian University of Technology (大连理工大学); Key Laboratory of Social Computing and Cognitive Intelligence, Ministry of Education (教育部社会计算与认知智能重点实验室)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Enhancing the moral alignment of Large Language Models (LLMs) is a critical challenge in AI safety. Current alignment techniques often act as superficial guardrails, leaving the intrinsic moral representations of LLMs largely untouched. In this paper, we bridge this gap by leveraging Moral Foundations Theory (MFT) to map and manipulate the fine-grained moral landscape of LLMs. Through cross-lingual linear probing, we validate the shared nature of moral representations in middle layers and uncover a shared yet different moral subspace between English and Chinese. Building upon this, we extract steerable Moral Vectors and successfully validate their efficacy at both internal and behavioral levels. Leveraging the high generalizability of morality, we propose Adaptive Moral Fusion (AMF), a dynamic inference-time intervention that synergizes probe detection with vector injection to tackle the safety-helpfulness trade-off. Empirical results confirm that our approach acts as a targeted intrinsic defense, effectively reducing incorrect refusals on benign queries while minimizing jailbreak success rates compared to standard baselines.
zh

[NLP-31] Evidence-Augmented Policy Optimization with Reward Co-Evolution for Long-Context Reasoning

【速读】: 该论文旨在解决强化学习(Reinforcement Learning, RL)在长上下文场景下应用时因结果奖励稀疏而导致的监督不足问题,特别是无法有效惩罚缺乏依据的“幸运猜测”,使得针尖对 haystack 的证据检索过程长期处于无监督状态。解决方案的关键在于提出 EAPO(Evidence-Augmented Policy Optimization),其核心创新是引入一种基于群体相对证据奖励(Group-Relative Evidence Reward)的专用强化学习算法,通过密集的过程监督显式提升证据质量;同时设计自适应奖励-策略协同进化机制(Adaptive Reward-Policy Co-Evolution),利用结果一致的回放轨迹迭代优化奖励模型,从而增强其判别能力并确保训练全程获得精准的过程指导。

链接: https://arxiv.org/abs/2601.10306
作者: Xin Guan,Zijian Li,Shen Huang,Pengjun Xie,Jingren Zhou,Jiuxin Cao
机构: Tongyi Lab, Alibaba Group (通义实验室,阿里巴巴集团)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While Reinforcement Learning (RL) has advanced LLM reasoning, applying it to long-context scenarios is hindered by sparsity of outcome rewards. This limitation fails to penalize ungrounded “lucky guesses,” leaving the critical process of needle-in-a-haystack evidence retrieval largely unsupervised. To address this, we propose EAPO (Evidence-Augmented Policy Optimization). We first establish the Evidence-Augmented Reasoning paradigm, validating via Tree-Structured Evidence Sampling that precise evidence extraction is the decisive bottleneck for long-context reasoning. Guided by this insight, EAPO introduces a specialized RL algorithm where a reward model computes a Group-Relative Evidence Reward, providing dense process supervision to explicitly improve evidence quality. To sustain accurate supervision throughout training, we further incorporate an Adaptive Reward-Policy Co-Evolution mechanism. This mechanism iteratively refines the reward model using outcome-consistent rollouts, sharpening its discriminative capability to ensure precise process guidance. Comprehensive evaluations across eight benchmarks demonstrate that EAPO significantly enhances long-context reasoning performance compared to SOTA baselines.
zh

[NLP-32] MoST: Mixing Speech and Text with Modality-Aware Mixture of Experts

【速读】: 该论文旨在解决当前多模态大语言模型在处理语音(speech)与文本(text)模态时,因使用统一参数处理不同模态表示而导致的模态特异性学习不足和跨模态理解能力受限的问题。其解决方案的关键在于提出了一种模态感知的专家混合(Modality-Aware Mixture of Experts, MAMoE)架构,通过引入基于输入模态类型的路由机制,将token定向至对应的模态专用专家组(modality-specific expert groups),同时保留共享专家(shared experts)以促进模态间的信息迁移。这一设计实现了模态特异性建模与跨模态协同学习的协同优化,从而显著提升语音-文本联合任务的性能表现。

链接: https://arxiv.org/abs/2601.10272
作者: Yuxuan Lou,Kai Yang,Yang You
机构: National University of Singapore (新加坡国立大学); Shanghai Jiao Tong University (上海交通大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD)
备注:

点击查看摘要

Abstract:We present MoST (Mixture of Speech and Text), a novel multimodal large language model that seamlessly integrates speech and text processing through our proposed Modality-Aware Mixture of Experts (MAMoE) architecture. While current multimodal models typically process diverse modality representations with identical parameters, disregarding their inherent representational differences, we introduce specialized routing pathways that direct tokens to modality-appropriate experts based on input type. MAMoE simultaneously enhances modality-specific learning and cross-modal understanding through two complementary components: modality-specific expert groups that capture domain-specific patterns and shared experts that facilitate information transfer between modalities. Building on this architecture, we develop an efficient transformation pipeline that adapts the pretrained MoE language model through strategic post-training on ASR and TTS datasets, followed by fine-tuning with a carefully curated speech-text instruction dataset. A key feature of this pipeline is that it relies exclusively on fully accessible, open-source datasets to achieve strong performance and data efficiency. Comprehensive evaluations across ASR, TTS, audio language modeling, and spoken question answering benchmarks show that MoST consistently outperforms existing models of comparable parameter counts. Our ablation studies confirm that the modality-specific routing mechanism and shared experts design significantly contribute to performance gains across all tested domains. To our knowledge, MoST represents the first fully open-source speech-text LLM built on a Mixture of Experts architecture. \footnoteWe release MoST model, training code, inference code, and training data at this https URL
zh

[NLP-33] Measuring Affinity between Attention-Head Weight Subspaces via the Projection Kernel

【速读】: 该论文旨在解决Transformer模型中注意力头(attention head)间关系难以准确量化的问题,现有度量方法无法充分揭示其内部结构。解决方案的关键在于引入基于主角(principal angle)的子空间相似性度量——投影核(Projection Kernel, PK),通过分析注意力头权重矩阵所张成的子空间来刻画头与头之间的关联。实验表明,PK在IOI任务中比传统指标(如组合得分)更清晰地再现了已知的头间交互模式,并进一步提出了一种基于随机正交子空间参考分布的框架来评估PK分布的信息含量,从而实现对注意力机制结构的精细化解析。

链接: https://arxiv.org/abs/2601.10266
作者: Hiroaki Yamagiwa,Yusuke Takase,Hidetoshi Shimodaira
机构: Kyoto University (京都大学); RIKEN AIP (理化学研究所人工智能中心)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Understanding relationships between attention heads is essential for interpreting the internal structure of Transformers, yet existing metrics do not capture this structure well. We focus on the subspaces spanned by attention-head weight matrices and quantify head-to-head relationships using the Projection Kernel (PK), a principal-angle-based measure of subspace similarity. Experiments show that PK reproduces known head-to-head interactions on the IOI task more clearly than prior metrics such as the Composition Score. We further introduce a framework to quantify the informativeness of PK distributions by comparing them with a reference distribution derived from random orthogonal subspaces. As an application, we analyze a directed graph constructed from PK and show that, in GPT2-small, L4H7 acts as a hub by functioning as an identity head.
zh

[NLP-34] Untangling Input Language from Reasoning Language: A Diagnostic Framework for Cross-Lingual Moral Alignment in LLM s

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在判断道德困境时,其结论是否因所用语言不同而产生差异的问题,并探究造成这种差异的根源是输入语言(即道德困境的表述语言)还是推理语言(即模型内部思考所使用的语言)。传统评估方法混淆了这两个因素,仅测试匹配条件(如英文困境+英文推理),无法区分二者贡献。本文的关键解决方案是提出一种新的实验设计,通过独立操控输入语言与推理语言(包括匹配和不匹配组合,例如英文困境搭配中文推理),从而实现对两者的分离分析;同时引入道德基础理论(Moral Foundations Theory)作为解释框架,识别出权威维度可进一步细分为家庭相关与制度性两个子维度,并基于此构建诊断分类体系,最终在13个LLMs上验证了该方法能有效识别出标准评估遗漏的上下文依赖性模式,且发现推理语言的影响强度约为输入语言的两倍。

链接: https://arxiv.org/abs/2601.10257
作者: Nan Li,Bo Kang,Tijl De Bie
机构: IDLab, Department of Electronics and Information Systems (电子与信息系统系); Ghent University (根特大学), Belgium
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:When LLMs judge moral dilemmas, do they reach different conclusions in different languages, and if so, why? Two factors could drive such differences: the language of the dilemma itself, or the language in which the model reasons. Standard evaluation conflates these by testing only matched conditions (e.g., English dilemma with English reasoning). We introduce a methodology that separately manipulates each factor, covering also mismatched conditions (e.g., English dilemma with Chinese reasoning), enabling decomposition of their contributions. To study \emphwhat changes, we propose an approach to interpret the moral judgments in terms of Moral Foundations Theory. As a side result, we identify evidence for splitting the Authority dimension into a family-related and an institutional dimension. Applying this methodology to English-Chinese moral judgment with 13 LLMs, we demonstrate its diagnostic power: (1) the framework isolates reasoning-language effects as contributing twice the variance of input-language effects; (2) it detects context-dependency in nearly half of models that standard evaluation misses; and (3) a diagnostic taxonomy translates these patterns into deployment guidance. We release our code and datasets at this https URL.
zh

[NLP-35] coTherapist: A Behavior-Aligned Small Language Model to Support Mental Healthcare Experts

【速读】: 该论文旨在解决当前心理健康服务中因专业人员短缺和需求上升导致的可及性问题,提出了一种名为coTherapist的统一框架作为解决方案。其关键在于通过领域特定微调(domain-specific fine-tuning)、检索增强(retrieval augmentation)和代理推理(agentic reasoning)三种技术,使小型语言模型能够模拟核心治疗能力,从而生成更相关且临床依据充分的响应,并展现出高同理心与治疗师一致的人格特质,最终实现具备准确、可信和安全特性的数字心理干预工具。

链接: https://arxiv.org/abs/2601.10246
作者: Prottay Kumar Adhikary,Reena Rawat,Tanmoy Chakraborty
机构: IIT Delhi(印度理工学院德里分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Access to mental healthcare is increasingly strained by workforce shortages and rising demand, motivating the development of intelligent systems that can support mental healthcare experts. We introduce coTherapist, a unified framework utilizing a small language model to emulate core therapeutic competencies through domain-specific fine-tuning, retrieval augmentation, and agentic reasoning. Evaluation on clinical queries demonstrates that coTherapist generates more relevant and clinically grounded responses than contemporary baselines. Using our novel T-BARS rubric and psychometric profiling, we confirm coTherapist exhibits high empathy and therapist-consistent personality traits. Furthermore, human evaluation by domain experts validates that coTherapist delivers accurate, trustworthy, and safe responses. coTherapist was deployed and tested by clinical experts. Collectively, these findings demonstrate that small models can be engineered to exhibit expert-like behavior, offering a scalable pathway for digital mental health tools.
zh

[NLP-36] RIM: Hybrid Inference via Targeted Stepwise Routing in Multi-Step Reasoning Tasks

【速读】: 该论文旨在解决多步推理任务(如数学问题求解)中因级联失败(cascading failures)导致的推理失效问题,即一个错误步骤会引发整个解题过程崩溃。现有大语言模型(LLM)路由方法将完整查询分配给单一模型,忽略了不同推理步骤的重要性差异。其解决方案的关键在于提出TRIM(Targeted routing in multi-step reasoning tasks),通过在步骤层面进行精细化路由:利用过程奖励模型(process reward models)识别易出错的关键步骤,并基于步骤级不确定性与预算约束决定是否调用更强的大模型处理这些关键步骤,而让小模型负责常规步骤。这一策略显著提升了推理效率,在MATH-500上最简单的阈值策略即实现5倍成本效益提升,更先进的策略则仅用80%的昂贵模型token即可达到强模型性能,证明了步骤级难度是推理任务的本质特征。

链接: https://arxiv.org/abs/2601.10245
作者: Vansh Kapoor,Aman Gupta,Hao Chen,Anurag Beniwal,Jing Huang,Aviral Kumar
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Multi-step reasoning tasks like mathematical problem solving are vulnerable to cascading failures, where a single incorrect step leads to complete solution breakdown. Current LLM routing methods assign entire queries to one model, treating all reasoning steps as equal. We propose TRIM (Targeted routing in multi-step reasoning tasks), which routes only critical steps \unicodex2013 those likely to derail the solution \unicodex2013 to larger models while letting smaller models handle routine continuations. Our key insight is that targeted step-level interventions can fundamentally transform inference efficiency by confining expensive calls to precisely those steps where stronger models prevent cascading errors. TRIM operates at the step-level: it uses process reward models to identify erroneous steps and makes routing decisions based on step-level uncertainty and budget constraints. We develop several routing strategies within TRIM, ranging from a simple threshold-based policy to more expressive policies that reason about long-horizon accuracy-cost trade-offs and uncertainty in step-level correctness estimates. On MATH-500, even the simplest thresholding strategy surpasses prior routing methods with 5x higher cost efficiency, while more advanced policies match the strong, expensive model’s performance using 80% fewer expensive model tokens. On harder benchmarks such as AIME, TRIM achieves up to 6x higher cost efficiency. All methods generalize effectively across math reasoning tasks, demonstrating that step-level difficulty represents fundamental characteristics of reasoning.
zh

[NLP-37] Loop as a Bridge: Can Looped Transformers Truly Link Representation Space and Natural Language Outputs?

【速读】: 该论文试图解决大语言模型(Large Language Models, LLMs)内部知识与显式语言输出之间存在差距的问题。解决方案的关键在于探索环路变压器(Looped Transformers, LTs)是否能通过其迭代结构实现一种形式的自我反思(introspection),从而将表示空间中的知识更有效地映射到自然语言输出中。实验表明,虽然增加循环次数可缩小知识差距,但这一效果部分源于表示中内部知识的退化;此外,当前LTs在循环过程中并未提升对表示的感知能力,仅在最终循环阶段才表现出该能力。这说明LTs虽为扩展计算深度提供了有前景的方向,但尚未实现真正连接表示空间与自然语言所需的自我反思机制。

链接: https://arxiv.org/abs/2601.10242
作者: Guanxu Chen,Dongrui Liu,Jing Shao
机构: Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Shanghai Jiao Tong University (上海交通大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 pages,6 figures

点击查看摘要

Abstract:Large Language Models (LLMs) often exhibit a gap between their internal knowledge and their explicit linguistic outputs. In this report, we empirically investigate whether Looped Transformers (LTs)–architectures that increase computational depth by iterating shared layers–can bridge this gap by utilizing their iterative nature as a form of introspection. Our experiments reveal that while increasing loop iterations narrows the gap, it is partly driven by a degradation of their internal knowledge carried by representations. Moreover, another empirical analysis suggests that current LTs’ ability to perceive representations does not improve across loops; it is only present in the final loop. These results suggest that while LTs offer a promising direction for scaling computational depth, they have yet to achieve the introspection required to truly link representation space and natural language.
zh

[NLP-38] GeoSteer: Faithful Chain-of-Thought Steering via Latent Manifold Gradients AAAI2026

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在多步推理过程中生成逻辑不一致的中间步骤的问题,尽管最终答案可能正确,但这种不一致性削弱了推理过程的可靠性。解决方案的关键在于提出GeoSteer框架,其核心是通过构建带有分段评分的Chain-of-Thought(CoT)数据集,训练变分自编码器(Variational Autoencoder, VAE)和质量评估模型以学习高质量CoT轨迹的低维流形,并将目标LLM的隐藏状态引导至潜在空间中高质量区域,实现几何一致性的隐空间调整,从而提升中间推理步骤的质量与连贯性。

链接: https://arxiv.org/abs/2601.10229
作者: Kentaro Kazama,Daiki Shirafuji,Tatsuhiko Saito
机构: 未知
类目: Computation and Language (cs.CL)
备注: The Third workshop of NeusymBridge @AAAI 2026 (Bridging Neurons and Symbols for NLP and Knowledge Graph Reasoning)

点击查看摘要

Abstract:Recent advances in Large Language Models (LLMs) have improved multi-step reasoning. Most approaches rely on Chain-of-Thought (CoT) rationales. Previous studies have shown that LLMs often generate logically inconsistent reasoning steps even when their final answers are correct. These inconsistencies reduce the reliability of step-level reasoning. We propose GeoSteer, a manifold-based framework that improves the quality of intermediate reasoning. The method consists of: (1) constructing a CoT dataset with segment-level scores, (2) training a Variational Autoencoder (VAE) model and a quality estimation model to learn a low-dimensional manifold of high-quality CoT trajectories, and (3) steering hidden states of target LLMs toward higher-quality regions in the latent space. This update in a latent space behaves like a natural-gradient adjustment in the original hidden-state space. It ensures geometrically coherent steering. We evaluate GeoSteer on the GSM8k dataset using the Qwen3 series. We measure via answer accuracy and overall reasoning performance. GeoSteer improved the exact match accuracy by up to 2.6 points. It also enhanced the pairwise win rate by 5.3 points. These results indicate that GeoSteer provides an effective and controllable mechanism for improving the quality of intermediate reasoning in LLMs.
zh

[NLP-39] One Instruction Does Not Fit All: How Well Do Embeddings Align Personas and Instructions in Low-Resource Indian Languages?

【速读】: 该论文旨在解决多语言助手与印度语境下文化相关用户偏好对齐的问题,以服务超过十亿人口、涵盖多种文字系统的印度多语言群体。现有基准测试或仅聚焦单一语言,或混淆了检索与生成任务,无法判断嵌入模型是否能在不依赖响应生成的情况下编码人物设定(persona)与指令(instruction)的兼容性。解决方案的关键在于构建一个统一的基准,覆盖12种印度语言和4项评估任务:单语及跨语言的人物到指令检索、反向指令到人物检索以及二分类兼容性判断,并在冻结编码器设置下使用轻量逻辑回归头进行分类。实验表明,E5-Large-Instruct 在单语检索中表现最优(Recall@1=27.4%),BGE-M3 在反向检索中最佳(Recall@1=32.1%),LaBSE 在分类任务中 AUROC 达到 75.3%,且具备良好校准性能,为 Indic 多语言检索提供了可复现的基线和实用的模型选择依据。

链接: https://arxiv.org/abs/2601.10205
作者: Arya Shah,Himanshu beniwal,Mayank Singh
机构: Indian Institute of Technology (印度理工学院); Gandhinagar, India (印度古吉拉特邦甘地纳格尔)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 12 pages, 4 figures, 10 tables

点击查看摘要

Abstract:Aligning multilingual assistants with culturally grounded user preferences is essential for serving India’s linguistically diverse population of over one billion speakers across multiple scripts. However, existing benchmarks either focus on a single language or conflate retrieval with generation, leaving open the question of whether current embedding models can encode persona-instruction compatibility without relying on response synthesis. We present a unified benchmark spanning 12 Indian languages and four evaluation tasks: monolingual and cross-lingual persona-to-instruction retrieval, reverse retrieval from instruction to persona, and binary compatibility classification. Eight multilingual embedding models are evaluated in a frozen-encoder setting with a thin logistic regression head for classification. E5-Large-Instruct achieves the highest Recall@1 of 27.4% on monolingual retrieval and 20.7% on cross-lingual transfer, while BGE-M3 leads reverse retrieval at 32.1% Recall@1. For classification, LaBSE attains 75.3% AUROC with strong calibration. These findings offer practical guidance for model selection in Indic multilingual retrieval and establish reproducible baselines for future work\footnoteCode, datasets, and models are publicly available at this https URL.
zh

[NLP-40] PRL: Process Reward Learning Improves LLM s Reasoning Reasoning Ability and Broadens the Reasoning Boundary

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在推理能力提升过程中,因依赖轨迹层面的最终奖励(outcome rewards)而缺乏细粒度过程监督的问题。现有训练框架虽尝试引入过程信号(process signals)以优化LLMs,但通常需依赖复杂的额外步骤(如蒙特卡洛树搜索MCTS或单独训练奖励模型),严重影响训练效率,且过程信号的设计缺乏严格的理论支撑,导致优化机制不透明。论文提出过程奖励学习(Process Reward Learning, PRL),其核心创新在于将熵正则化强化学习目标分解为中间步骤,并推导出可直接分配给模型的严格过程奖励形式;PRL本质上等价于奖励最大化加上策略模型与参考模型之间的KL散度惩罚项,同时能将结果奖励转化为过程监督信号,从而更有效地引导强化学习中的探索过程。实验表明,PRL不仅能提升LLMs推理能力的平均表现(如average @ n),还能扩展推理边界(如pass @ n),具有良好的有效性与泛化性。

链接: https://arxiv.org/abs/2601.10201
作者: Jiarui Yao,Ruida Wang,Tong Zhang
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Improving the reasoning abilities of Large Language Models (LLMs) has been a continuous topic recently. But most relevant works are based on outcome rewards at the trajectory level, missing fine-grained supervision during the reasoning process. Other existing training frameworks that try to combine process signals together to optimize LLMs also rely heavily on tedious additional steps like MCTS, training a separate reward model, etc., doing harm to the training efficiency. Moreover, the intuition behind the process signals design lacks rigorous theoretical support, leaving the understanding of the optimization mechanism opaque. In this paper, we propose Process Reward Learning (PRL), which decomposes the entropy regularized reinforcement learning objective into intermediate steps, with rigorous process rewards that could be assigned to models accordingly. Starting from theoretical motivation, we derive the formulation of PRL that is essentially equivalent to the objective of reward maximization plus a KL-divergence penalty term between the policy model and a reference model. However, PRL could turn the outcome reward into process supervision signals, which helps better guide the exploration during RL optimization. From our experiment results, we demonstrate that PRL not only improves the average performance for LLMs’ reasoning ability measured by average @ n, but also broadens the reasoning boundary by improving the pass @ n metric. Extensive experiments show the effectiveness of PRL could be verified and generalized.
zh

[NLP-41] HUMANLLM : Benchmarking and Reinforcing LLM Anthropomorphism via Human Cognitive Patterns

【速读】: 该论文旨在解决当前生成式 AI(Generative AI)在角色扮演语言代理(Role-Playing Language Agents, RPLAs)中难以实现与人类认知和行为模式真实对齐的问题。现有大语言模型(Large Language Models, LLMs)虽具备强大推理与生成能力,但其行为常缺乏深层次心理机制支撑,导致模拟结果表面合理却缺乏内在一致性。解决方案的关键在于提出 HUMANLLM 框架,将人类心理模式视为相互作用的因果力系统:从约 12,000 篇学术文献中构建了 244 种心理模式,并合成 11,359 个包含 2–5 种模式之间强化、冲突或调节关系的多轮对话场景,通过双层检查清单评估个体模式保真度与多模式动态涌现特性。实验表明,HUMANLLM-8B 在多模式动态建模上优于参数量更大的 Qwen3-32B,验证了“真正的人类化”依赖于对心理过程的建模,而非仅模仿行为表征。

链接: https://arxiv.org/abs/2601.10198
作者: Xintao Wang,Jian Yang,Weiyuan Li,Rui Xie,Jen-tse Huang,Jun Gao,Shuai Huang,Yueping Kang,Liyuan Gou,Hongwei Feng,Yanghua Xiao
机构: Fudan University (复旦大学); Hello Group; Johns Hopkins University
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities in reasoning and generation, serving as the foundation for advanced persona simulation and Role-Playing Language Agents (RPLAs). However, achieving authentic alignment with human cognitive and behavioral patterns remains a critical challenge for these agents. We present HUMANLLM, a framework treating psychological patterns as interacting causal forces. We construct 244 patterns from ~12,000 academic papers and synthesize 11,359 scenarios where 2-5 patterns reinforce, conflict, or modulate each other, with multi-turn conversations expressing inner thoughts, actions, and dialogue. Our dual-level checklists evaluate both individual pattern fidelity and emergent multi-pattern dynamics, achieving strong human alignment (r=0.91) while revealing that holistic metrics conflate simulation accuracy with social desirability. HUMANLLM-8B outperforms Qwen3-32B on multi-pattern dynamics despite 4x fewer parameters, demonstrating that authentic anthropomorphism requires cognitive modeling–simulating not just what humans do, but the psychological processes generating those behaviors.
zh

[NLP-42] HOMURA: Taming the Sand-Glass for Time-Constrained LLM Translation via Reinforcement Learning

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在多语言翻译中普遍存在的跨语言冗余偏差(cross-lingual verbosity bias)问题,该偏差导致生成文本在字数或音节数上超出严格时间约束场景(如字幕制作和配音)的可行性要求。现有提示工程方法难以同时保障语义保真度与时间可行性之间的平衡。解决方案的关键在于提出两个核心组件:一是设计了Sand-Glass基准,用于评估在音节级时长约束下的翻译性能;二是开发了HOMURA强化学习框架,通过引入KL正则化目标函数与新颖的动态音节比例奖励机制,显式优化语义保留与时间合规性之间的权衡,从而实现对输出长度的有效控制,同时保持语言密度层次结构和语义充分性。

链接: https://arxiv.org/abs/2601.10187
作者: Ziang Cui,Mengran Yu,Tianjiao Li,Chenyu Shi,Yingxuan Shi,Lusheng Zhang,Hongwei Lin
机构: Bilibili Inc.(哔哩哔哩公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved remarkable strides in multilingual translation but are hindered by a systemic cross-lingual verbosity bias, rendering them unsuitable for strict time-constrained tasks like subtitling and dubbing. Current prompt-engineering approaches struggle to resolve this conflict between semantic fidelity and rigid temporal feasibility. To bridge this gap, we first introduce Sand-Glass, a benchmark specifically designed to evaluate translation under syllable-level duration constraints. Furthermore, we propose HOMURA, a reinforcement learning framework that explicitly optimizes the trade-off between semantic preservation and temporal compliance. By employing a KL-regularized objective with a novel dynamic syllable-ratio reward, HOMURA effectively “tames” the output length. Experimental results demonstrate that our method significantly outperforms strong LLM baselines, achieving precise length control that respects linguistic density hierarchies without compromising semantic adequacy.
zh

[NLP-43] ReasAlign: Reasoning Enhanced Safety Alignment against Prompt Injection Attack

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)驱动的智能体系统在面对间接提示注入攻击(indirect prompt injection attacks)时的安全性问题,此类攻击通过外部数据中的恶意指令劫持代理行为。解决方案的关键在于提出 ReasAlign,其核心机制是引入结构化推理步骤(structured reasoning steps),用于分析用户查询、检测冲突指令并保持用户意图任务的连续性,从而实现对间接注入攻击的有效防御;同时结合一种基于偏好优化的判别模型,在测试阶段通过评分和选择最优推理轨迹来增强推理逻辑的准确性和鲁棒性,最终在保障高实用性的前提下显著提升安全性。

链接: https://arxiv.org/abs/2601.10173
作者: Hao Li,Yankai Yang,G. Edward Suh,Ning Zhang,Chaowei Xiao
机构: Washington University in St. Louis (圣路易斯华盛顿大学); University of Wisconsin–Madison (威斯康星大学麦迪逊分校); NVIDIA (英伟达); Johns Hopkins University (约翰霍普金斯大学)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 15 pages, 10 figures

点击查看摘要

Abstract:Large Language Models (LLMs) have enabled the development of powerful agentic systems capable of automating complex workflows across various fields. However, these systems are highly vulnerable to indirect prompt injection attacks, where malicious instructions embedded in external data can hijack agent behavior. In this work, we present ReasAlign, a model-level solution to improve safety alignment against indirect prompt injection attacks. The core idea of ReasAlign is to incorporate structured reasoning steps to analyze user queries, detect conflicting instructions, and preserve the continuity of the user’s intended tasks to defend against indirect injection attacks. To further ensure reasoning logic and accuracy, we introduce a test-time scaling mechanism with a preference-optimized judge model that scores reasoning steps and selects the best trajectory. Comprehensive evaluations across various benchmarks show that ReasAlign maintains utility comparable to an undefended model while consistently outperforming Meta SecAlign, the strongest prior guardrail. On the representative open-ended CyberSecEval2 benchmark, which includes multiple prompt-injected tasks, ReasAlign achieves 94.6% utility and only 3.6% ASR, far surpassing the state-of-the-art defensive model of Meta SecAlign (56.4% utility and 74.4% ASR). These results demonstrate that ReasAlign achieves the best trade-off between security and utility, establishing a robust and practical defense against prompt injection attacks in real-world agentic systems. Our code and experimental results could be found at this https URL.
zh

[NLP-44] Credit C-GPT : A Domain-Specialized Large Language Model for Conversational Understanding in Vietnamese Debt Collection

【速读】: 该论文旨在解决银行、金融与保险(BFSI)行业中债务催收场景下,传统自然语言处理(Natural Language Processing, NLP)系统难以应对越南语接触中心中大规模人机对话所面临的挑战,如非正式口语表达、情绪波动及领域特定推理等。解决方案的关键在于提出一个专为越南语债务催收场景定制的七亿参数大型语言模型 Credit C-GPT,该模型通过将对话理解、情感识别、意图检测、通话阶段分类和结构化槽位值提取等多项对话智能任务整合进单一基于推理的框架中,实现了更高效、准确且可扩展的实时辅助与通话后分析能力,同时保障了企业数据隐私。

链接: https://arxiv.org/abs/2601.10167
作者: Nhung Nguyen Thi Hong,Cuong Nguyen Dang,Tri Le Ngoc
机构: Emandai(艾曼达)
类目: Computation and Language (cs.CL)
备注: 8 pages, 0 figures, 3 tables. Preprint

点击查看摘要

Abstract:Debt collection is a critical function within the banking, financial services, and insurance (BFSI) sector, relying heavily on large-scale human-to-human conversational interactions conducted primarily in Vietnamese contact centers. These conversations involve informal spoken language, emotional variability, and complex domain-specific reasoning, which pose significant challenges for traditional natural language processing systems. This paper introduces Credit C-GPT, a domain-specialized large language model with seven billion parameters, fine-tuned for conversational understanding in Vietnamese debt collection scenarios. The proposed model integrates multiple conversational intelligence tasks, including dialogue understanding, sentiment recognition, intent detection, call stage classification, and structured slot-value extraction, within a single reasoning-based framework. We describe the data construction process, annotation strategy, and training methodology, and evaluate the model on proprietary human-annotated datasets. Experimental results show consistent improvements over traditional pipeline-based approaches, indicating that domain-specialized conversational language models provide a scalable and privacy-aware solution for real-time assistance and post-call analytics in enterprise contact centers.
zh

[NLP-45] AWED-FiNER: Agents Web applications and Expert Detectors for Fine-grained Named Entity Recognition across 36 Languages for 6.6 Billion Speakers ACL’26

【速读】: 该论文旨在解决细粒度命名实体识别(Fine-grained Named Entity Recognition, FgNER)在36种全球语言(覆盖超66亿人口)中的资源匮乏与模型性能不足问题,尤其针对低资源语言和边缘计算场景下的部署挑战。解决方案的关键在于构建一个开源生态系统AWED-FiNER,其核心包括三部分:一是基于代理的工具链(agentic toolkits),可自动将多语言文本路由至专用专家模型并快速获取FgNER标注;二是面向非技术用户的Web应用平台,提供即用型标注服务;三是包含49个轻量级、高性能且针对特定语言优化的开源专家检测模型,支持离线部署于资源受限环境(如边缘设备)。该方案显著提升了多语言FgNER的可访问性与实用性,尤其关注Bodo、Manipuri等弱势语言的覆盖。

链接: https://arxiv.org/abs/2601.10161
作者: Prachuryya Kaushik,Ashish Anand
机构: Indian Institute of Technology Guwahati (印度理工学院古瓦哈蒂分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Submitted to ACL’26 System Demonstration

点击查看摘要

Abstract:We introduce AWED-FiNER, an open-source ecosystem designed to bridge the gap in Fine-grained Named Entity Recognition (FgNER) for 36 global languages spoken by more than 6.6 billion people. While Large Language Models (LLMs) dominate general Natural Language Processing (NLP) tasks, they often struggle with low-resource languages and fine-grained NLP tasks. AWED-FiNER provides a collection of agentic toolkits, web applications, and several state-of-the-art expert models that provides FgNER solutions across 36 languages. The agentic tools enable to route multilingual text to specialized expert models and fetch FgNER annotations within seconds. The web-based platforms provide ready-to-use FgNER annotation service for non-technical users. Moreover, the collection of language specific extremely small sized open-source state-of-the-art expert models facilitate offline deployment in resource contraint scenerios including edge devices. AWED-FiNER covers languages spoken by over 6.6 billion people, including a specific focus on vulnerable languages such as Bodo, Manipuri, Bishnupriya, and Mizo. The resources can be accessed here: Agentic Tool (this https URL), Web Application (this https URL), and 49 Expert Detector Models (this https URL).
zh

[NLP-46] Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment

【速读】: 该论文试图解决预训练语料中关于人工智能(AI)的讨论如何影响大语言模型(LLM)下游对齐性的问题,特别是探究此类 discourse 是否会引发自我实现的偏差(self-fulfilling misalignment)。其解决方案的关键在于通过受控实验,系统性地调整预训练数据中关于 AI 对齐与非对齐行为的文本比例,发现增加非对齐 discourse 会导致模型行为显著偏离预期目标,而增强对齐 discourse 则可有效降低 misalignment 分数(从 45% 降至 9%),从而证明了“对齐预训练”(alignment pretraining)作为补充后训练方法的重要性。

链接: https://arxiv.org/abs/2601.10160
作者: Cameron Tice,Puria Radmard,Samuel Ratnam,Andy Kim,David Africa,Kyle O’Brien
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Pretraining corpora contain extensive discourse about AI systems, yet the causal influence of this discourse on downstream alignment remains poorly understood. If prevailing descriptions of AI behaviour are predominantly negative, LLMs may internalise corresponding behavioural priors, giving rise to self-fulfilling misalignment. This paper provides the first controlled study of this hypothesis by pretraining 6.9B-parameter LLMs with varying amounts of (mis)alignment discourse. We find that discussion of AI contributes to misalignment. Upsampling synthetic training documents about AI misalignment leads to a notable increase in misaligned behaviour. Conversely, upsampling documents about aligned behaviour reduces misalignment scores from 45% to 9%. We consider this evidence of self-fulfilling alignment. These effects are dampened, but persist through post-training. Our findings establish the study of how pretraining data shapes alignment priors, or alignment pretraining, as a complement to post-training. We recommend practitioners pretrain for alignment as well as capabilities. Our models and datasets are available at this http URL
zh

[NLP-47] What Gets Activated: Uncovering Domain and Driver Experts in MoE Language Models

【速读】: 该论文旨在解决多专家模型(Mixture-of-Experts, MoE)中专家层级行为的可解释性问题,尤其是缺乏对专家在不同领域(domain)和驱动机制(driver)层面的功能分化理解。其关键解决方案在于引入熵基指标(entropy-based metrics)和因果效应度量(causal-effect metrics),分别用于识别具有领域偏好性的“领域专家”(domain experts)与对模型输出具有强因果影响的“驱动专家”(driver experts),并通过分析token位置与专家激活的关系,揭示了早期token更易触发驱动专家的机制。实验表明,调整这两类专家权重可显著提升多个MoE模型在三个公共领域的性能,从而深化了对MoE内部工作机制的理解并增强了其可解释性。

链接: https://arxiv.org/abs/2601.10159
作者: Guimin Hu,Meng Li,Qiwei Peng,Lijie Hu,Boyan Xu,Ruichu Cai
机构: Guangdong University of Technology (广东工业大学); Soochow University (苏州大学); Microsoft (微软); Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Most interpretability work focuses on layer- or neuron-level mechanisms in Transformers, leaving expert-level behavior in MoE LLMs underexplored. Motivated by functional specialization in the human brain, we analyze expert activation by distinguishing domain and driver experts. In this work, we study expert activation in MoE models across three public domains and address two key questions: (1) which experts are activated, and whether certain expert types exhibit consistent activation patterns; and (2) how tokens are associated with and trigger the activation of specific experts. To answer these questions, we introduce entropy-based and causal-effect metrics to assess whether an expert is strongly favored for a particular domain, and how strongly expert activation contributes causally to the model’s output, thus identify domain and driver experts, respectively. Furthermore, we explore how individual tokens are associated with the activation of specific experts. Our analysis reveals that (1) Among the activated experts, some show clear domain preferences, while others exert strong causal influence on model performance, underscoring their decisive roles. (2) tokens occurring earlier in a sentence are more likely to trigger the driver experts, and (3) adjusting the weights of domain and driver experts leads to significant performance gains across all three models and domains. These findings shed light on the internal mechanisms of MoE models and enhance their interpretability.
zh

[NLP-48] oolSafe: Enhancing Tool Invocation Safety of LLM -based agents via Proactive Step-level Guardrail and Feedback

【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)代理在调用外部工具时存在的安全风险问题,尤其是在实时监控步骤级工具调用行为并提前干预潜在不安全操作方面研究不足的问题。解决方案的关键在于提出了一种基于多任务强化学习的防护模型 TS-Guard,该模型通过分析交互历史来预判不安全的工具调用行为,评估请求危害性与动作-攻击关联性,从而生成可解释且泛化能力强的安全判断与反馈;同时引入 TS-Flow 框架,以防护反馈驱动推理机制,在 ReAct 风格代理中平均降低 65% 的有害工具调用,并在提示注入攻击下提升约 10% 的良性任务完成率。

链接: https://arxiv.org/abs/2601.10156
作者: Yutao Mou,Zhangchi Xue,Lijun Li,Peiyang Liu,Shikun Zhang,Wei Ye,Jing Shao
机构: National Engineering Research Center for Software Engineering, Peking University (北京大学软件工程国家工程研究中心); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室)
类目: Computation and Language (cs.CL)
备注: Work in Progress. Code available: this https URL

点击查看摘要

Abstract:While LLM-based agents can interact with environments via invoking external tools, their expanded capabilities also amplify security risks. Monitoring step-level tool invocation behaviors in real time and proactively intervening before unsafe execution is critical for agent deployment, yet remains under-explored. In this work, we first construct TS-Bench, a novel benchmark for step-level tool invocation safety detection in LLM agents. We then develop a guardrail model, TS-Guard, using multi-task reinforcement learning. The model proactively detects unsafe tool invocation actions before execution by reasoning over the interaction history. It assesses request harmfulness and action-attack correlations, producing interpretable and generalizable safety judgments and feedback. Furthermore, we introduce TS-Flow, a guardrail-feedback-driven reasoning framework for LLM agents, which reduces harmful tool invocations of ReAct-style agents by 65 percent on average and improves benign task completion by approximately 10 percent under prompt injection attacks.
zh

[NLP-49] Role-Playing Agents Driven by Large Language Models : Current Status Challenges and Future Trends

【速读】: 该论文旨在系统梳理角色扮演语言代理(Role-Playing Language Agents, RPLAs)的研究进展与关键技术,解决当前RPLAs在角色一致性、行为合理性及交互真实性等方面的挑战。其解决方案的关键在于构建多层次的技术路径:一是基于心理量表驱动的角色建模(psychological scale-driven character modeling),实现对角色人格特质的精细化刻画;二是引入记忆增强型提示机制(memory-augmented prompting mechanisms),提升角色在长期对话中的连贯性与情境适应能力;三是采用动机-情境协同的行为决策控制(motivation-situation-based behavioral decision control),使代理响应更符合角色内在驱动力与外部环境约束。这些技术共同支撑高质量角色扮演体验,并为后续研究提供方法论指导。

链接: https://arxiv.org/abs/2601.10122
作者: Ye Wang,Jiaxing Chen,Hongjiang Xiao
机构: State Key Laboratory of Media Convergence and Communication, Communication University of China (中国传媒大学媒体融合与通信国家重点实验室); Neuroscience and Intelligent Media Institute, Communication University of China (中国传媒大学神经科学与智能媒体研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:In recent years, with the rapid advancement of large language models (LLMs), role-playing language agents (RPLAs) have emerged as a prominent research focus at the intersection of natural language processing (NLP) and human-computer interaction. This paper systematically reviews the current development and key technologies of RPLAs, delineating the technological evolution from early rule-based template paradigms, through the language style imitation stage, to the cognitive simulation stage centered on personality modeling and memory mechanisms. It summarizes the critical technical pathways supporting high-quality role-playing, including psychological scale-driven character modeling, memory-augmented prompting mechanisms, and motivation-situation-based behavioral decision control. At the data level, the paper further analyzes the methods and challenges of constructing role-specific corpora, focusing on data sources, copyright constraints, and structured annotation processes. In terms of evaluation, it collates multi-dimensional assessment frameworks and benchmark datasets covering role knowledge, personality fidelity, value alignment, and interactive hallucination, while commenting on the advantages and disadvantages of methods such as human evaluation, reward models, and LLM-based scoring. Finally, the paper outlines future development directions of role-playing agents, including personality evolution modeling, multi-agent collaborative narrative, multimodal immersive interaction, and integration with cognitive neuroscience, aiming to provide a systematic perspective and methodological insights for subsequent research.
zh

[NLP-50] opoDIM: One-shot Topology Generation of Diverse Interaction Modes for Multi-Agent Systems

【速读】: 该论文旨在解决大型语言模型(Large Language Model, LLM)驱动的多智能体系统中通信拓扑结构优化问题,以提升群体智能能力。现有方法依赖时空交互范式,需多轮对话顺序执行,导致延迟高、计算开销大。其解决方案的关键在于提出TopoDIM框架,通过一次生成(one-shot)方式构建具有多样化交互模式(Diverse Interaction Modes)的通信拓扑,支持去中心化执行,使智能体无需迭代协调即可自主形成异构通信网络,从而在降低46.41%总token消耗的同时,将平均任务性能提升1.50%,并展现出对异构智能体间通信组织的强大适应性。

链接: https://arxiv.org/abs/2601.10120
作者: Rui Sun,Jie Ding,Chenghua Gong,Tianjun Gu,Yihang Jiang,Juyuan Zhang,Liming Pan,Linyuan Lü
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Optimizing communication topology in LLM-based multi-agent system is critical for enabling collective intelligence. Existing methods mainly rely on spatio-temporal interaction paradigms, where the sequential execution of multi-round dialogues incurs high latency and computation. Motivated by the recent insights that evaluation and debate mechanisms can improve problem-solving in multi-agent systems, we propose TopoDIM, a framework for one-shot Topology generation with Diverse Interaction Modes. Designed for decentralized execution to enhance adaptability and privacy, TopoDIM enables agents to autonomously construct heterogeneous communication without iterative coordination, achieving token efficiency and improved task performance. Experiments demonstrate that TopoDIM reduces total token consumption by 46.41% while improving average performance by 1.50% over state-of-the-art methods. Moreover, the framework exhibits strong adaptability in organizing communication among heterogeneous agents. Code is available at: this https URL
zh

[NLP-51] Skill-Aware Data Selection and Fine-Tuning for Data-Efficient Reasoning Distillation

【速读】: 该论文旨在解决大模型(如DeepSeek-R1)在推理能力蒸馏过程中对大规模监督微调(SFT)数据的依赖问题,从而提升知识迁移的数据效率。其核心解决方案是提出一种以技能为中心的蒸馏框架(skill-centric distillation framework),关键在于两个组成部分:一是基于技能的数据选择策略,优先选取能针对性提升学生模型薄弱技能的样本;二是基于技能感知的微调机制,引导模型在解题过程中显式分解和应用特定技能。实验表明,仅使用1,000个精选样本即可显著优于随机SFT基线,在多个数学推理基准上实现性能提升,且增益集中于训练中强调的技能,验证了该方法在高效推理能力迁移中的有效性。

链接: https://arxiv.org/abs/2601.10109
作者: Lechen Zhang,Yunxiang Zhang,Wei Hu,Lu Wang
机构: University of Michigan, Ann Arbor (密歇根大学,安娜堡分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large reasoning models such as DeepSeek-R1 and their distilled variants achieve strong performance on complex reasoning tasks. Yet, distilling these models often demands large-scale data for supervised fine-tuning (SFT), motivating the pursuit of data-efficient training methods. To address this, we propose a skill-centric distillation framework that efficiently transfers reasoning ability to weaker models with two components: (1) Skill-based data selection, which prioritizes examples targeting the student model’s weaker skills, and (2) Skill-aware fine-tuning, which encourages explicit skill decomposition during problem solving. With only 1,000 training examples selected from a 100K teacher-generated corpus, our method surpasses random SFT baselines by +1.6% on Qwen3-4B and +1.4% on Qwen3-8B across five mathematical reasoning benchmarks. Further analysis confirms that these gains concentrate on skills emphasized during training, highlighting the effectiveness of skill-centric training for efficient reasoning distillation.
zh

[NLP-52] SIN-Bench: Tracing Native Evidence Chains in Long-Context Multimodal Scientific Interleaved Literature

【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在理解长篇科学文献时缺乏可解释性与证据链支撑的问题,即传统评估方法仅依赖答案匹配(answer-only metrics)或合成的“针在 haystack”测试(Needle-In-A-Haystack tests),无法验证模型推理是否基于文档中的因果证据链。其解决方案的核心是提出“鱼在海洋中”(Fish-in-the-Ocean, FITO)范式,要求模型在原始科学文档中构建显式的跨模态证据链(cross-modal evidence chains),并通过构建SIN-Data数据集和SIN-Bench评测基准实现该范式:前者保留文本与图表的原始交错结构,后者包含从证据发现(SIN-Find)到证据锚定合成(SIN-Summary)的四阶段任务,并引入“无证据不评分”(No Evidence, No Score)机制,仅对可验证锚点上的预测进行评分,同时通过匹配度、相关性和逻辑一致性诊断证据质量。实验表明,模型的证据锚定能力是当前主要瓶颈,即便GPT-5在答案准确率上表现最优,其证据对齐得分仍显著低于Gemini-3-pro,揭示了正确答案与可追溯支持之间的差距。

链接: https://arxiv.org/abs/2601.10108
作者: Yiming Ren,Junjie Wang,Yuxin Meng,Yihang Shi,Zhiqiang Lin,Ruihang Chu,Yiran Xu,Ziming Li,Yunfei Zhao,Zihan Wang,Yu Qiao,Ruiming Tang,Minghao Liu,Yujiu Yang
机构: Tsinghua University (清华大学); Shanghai AI Laboratory (上海人工智能实验室); 2077AI; KuaiShou Inc. (快手公司); Stanford University (斯坦福大学); Harvard University (哈佛大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Evaluating whether multimodal large language models truly understand long-form scientific papers remains challenging: answer-only metrics and synthetic “Needle-In-A-Haystack” tests often reward answer matching without requiring a causal, evidence-linked reasoning trace in the document. We propose the “Fish-in-the-Ocean” (FITO) paradigm, which requires models to construct explicit cross-modal evidence chains within native scientific documents. To operationalize FITO, we build SIN-Data, a scientific interleaved corpus that preserves the native interleaving of text and figures. On top of it, we construct SIN-Bench with four progressive tasks covering evidence discovery (SIN-Find), hypothesis verification (SIN-Verify), grounded QA (SIN-QA), and evidence-anchored synthesis (SIN-Summary). We further introduce “No Evidence, No Score”, scoring predictions when grounded to verifiable anchors and diagnosing evidence quality via matching, relevance, and logic. Experiments on eight MLLMs show that grounding is the primary bottleneck: Gemini-3-pro achieves the best average overall score (0.573), while GPT-5 attains the highest SIN-QA answer accuracy (0.767) but underperforms on evidence-aligned overall scores, exposing a gap between correctness and traceable support.
zh

[NLP-53] MATRIX AS PLAN: Structured Logical Reasoning with Feedback-Driven Replanning WWW

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在处理依赖符号表达和严格演绎规则的逻辑推理任务时能力不足的问题。尽管链式思维(Chain-of-Thought, CoT)提示方法能提升LLMs的推理能力,但在涉及形式化正确性要求的任务中仍表现不佳;而现有神经符号(Neuro-symbolic)方法虽可保障形式正确性,却对输出格式敏感,易因模型输出微小波动导致处理失败。本文提出MatrixCoT框架,其核心创新在于引入基于矩阵的结构化计划机制:通过标准化自然语言表达、添加显式引用字段,并构建保留步骤间全局关系的依赖矩阵,使推理过程成为可验证的产物;同时设计反馈驱动的重规划机制,在语义等价约束下识别遗漏与缺陷,重构并压缩依赖矩阵,从而提升推理结果的鲁棒性和可信度。该方案无需外部求解器即可显著增强LLMs在复杂符号推理任务中的性能与可解释性。

链接: https://arxiv.org/abs/2601.10101
作者: Ke Chen,Jiandian Zeng,Zihao Peng,Guo Li,Guangxue Zhang,Tian Wang
机构: Beijing Normal University (北京师范大学); Institute of Artificial Intelligence and Future Networks (人工智能与未来网络研究所); Engineering Research Center of Cloud-Edge Intelligent Collaboration on Big Data, Ministry of Education (教育部云计算与边缘智能协同研究中心)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 12 pages, 5 figures, 2 tables. Accepted at The Web Conference (WWW) 2026

点击查看摘要

Abstract:As knowledge and semantics on the web grow increasingly complex, enhancing Large Language Models (LLMs) comprehension and reasoning capabilities has become particularly important. Chain-of-Thought (CoT) prompting has been shown to enhance the reasoning capabilities of LLMs. However, it still falls short on logical reasoning tasks that rely on symbolic expressions and strict deductive rules. Neuro-symbolic methods address this gap by enforcing formal correctness through external solvers. Yet these solvers are highly format-sensitive, and small instabilities in model outputs can lead to frequent processing failures. LLM-driven approaches avoid parsing brittleness, but they lack structured representations and process-level error-correction mechanisms. To further enhance the logical reasoning capabilities of LLMs, we propose MatrixCoT, a structured CoT framework with a matrix-based plan. Specifically, we normalize and type natural language expressions, attach explicit citation fields, and introduce a matrix-based planning method to preserve global relations among steps. The plan becomes a verifiable artifact, making execution more stable. For verification, we also add a feedback-driven replanning mechanism. Under semantic-equivalence constraints, it identifies omissions and defects, rewrites and compresses the dependency matrix, and produces a more trustworthy final answer. Experiments on five logical-reasoning benchmarks and five LLMs show that, without relying on external solvers, MatrixCoT enhances both robustness and interpretability when tackling complex symbolic reasoning tasks, while maintaining competitive performance.
zh

[NLP-54] CALM-IT: Generating Realistic Long-Form Motivational Interviewing Dialogues with Dual-Actor Conversational Dynamics Tracking

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在心理健康相关场景中难以维持长期、目标导向对话的问题。现有LLMs虽能生成流畅回复,但其局部优化策略导致缺乏对治疗进展的整体建模,从而引发对话脆弱性和长程漂移。解决方案的关键在于提出CALM-IT框架,该框架显式建模治疗师与来访者之间的双主体对话动态,将交互过程表示为双向状态空间过程,使双方持续更新推断的对齐度、心理状态和短期目标,以指导策略选择与话语生成。这一机制显著提升了对话的有效性、目标一致性及稳定性,尤其在长对话中表现更优。

链接: https://arxiv.org/abs/2601.10085
作者: Viet Cuong Nguyen,Nhi Yen Nguyen,Kristin A. Candan,Mary Conlon,Vanessa Rumie,Kristen Risola,Srijan Kumar,Munmun De Choudhury
机构: Georgia Institute of Technology (佐治亚理工学院); Northwell Health
类目: Computation and Language (cs.CL)
备注: 46 pages

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly used in mental health-related settings, yet they struggle to sustain realistic, goal-directed dialogue over extended interactions. While LLMs generate fluent responses, they optimize locally for the next turn rather than maintaining a coherent model of therapeutic progress, leading to brittleness and long-horizon drift. We introduce CALM-IT, a framework for generating and evaluating long-form Motivational Interviewing (MI) dialogues that explicitly models dual-actor conversational dynamics. CALM-IT represents therapist-client interaction as a bidirectional state-space process, in which both agents continuously update inferred alignment, mental states, and short-term goals to guide strategy selection and utterance generation. Across large-scale evaluations, CALM-IT consistently outperforms strong baselines in Effectiveness and Goal Alignment and remains substantially more stable as conversation length increases. Although CALM-IT initiates fewer therapist redirections, it achieves the highest client acceptance rate (64.3%), indicating more precise and therapeutically aligned intervention timing. Overall, CALM-IT provides evidence for modeling evolving conversational state being essential for generating high-quality long-form synthetic conversations.
zh

[NLP-55] Is MT Ready for the Next Crisis or Pandemic?

【速读】: 该论文旨在解决危机或医疗场景下,政府、援助机构、医护人员与受助人群之间因语言差异导致的沟通障碍问题,尤其关注低资源语言(low-resource languages)在生成式AI(Generative AI)翻译系统中的表现。其解决方案的关键在于通过评估四种商用机器翻译(Machine Translation, MT)系统在TICO-19数据集上的性能,该数据集包含多种高优先级语言中与疫情相关的句子,从而量化当前多语言翻译能力对下一次大流行病或疫病响应的“准备度”(readiness)。

链接: https://arxiv.org/abs/2601.10082
作者: Vipasha Bansal,Elizabeth Brown,Chelsea Kendrick,Benjamin Pong,William D. Lewis
机构: University of Washington (华盛顿大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Communication in times of crisis is essential. However, there is often a mismatch between the language of governments, aid providers, doctors, and those to whom they are providing aid. Commercial MT systems are reasonable tools to turn to in these scenarios. But how effective are these tools for translating to and from low resource languages, particularly in the crisis or medical domain? In this study, we evaluate four commercial MT systems using the TICO-19 dataset, which is composed of pandemic-related sentences from a large set of high priority languages spoken by communities most likely to be affected adversely in the next pandemic. We then assess the current degree of ``readiness’’ for another pandemic (or epidemic) based on the usability of the output translations.
zh

[NLP-56] Deriving Character Logic from Storyline as Codified Decision Trees

【速读】: 该论文旨在解决角色扮演(Role-playing, RP)代理在多样叙事场景中行为一致性差的问题,其根源在于现有行为特征描述(behavioral profiles)大多结构松散、不可执行且验证不足,导致代理行为脆弱。解决方案的关键在于提出一种名为“编码决策树”(Codified Decision Trees, CDT)的数据驱动框架,该框架从大规模叙事数据中自动学习可执行且可解释的行为决策结构——将行为特征表示为条件规则树,其中内部节点对应经数据验证的场景条件,叶节点编码具象化的行为语句,从而在运行时实现基于上下文的确定性规则检索;通过迭代式规则生成、验证与层级细化,CDT 构建出支持透明审查与规范更新的行为模型,在多个基准测试中显著优于人工编写及先前方法,证明了编码化和验证性行为表征能提升代理的可靠性和落地能力。

链接: https://arxiv.org/abs/2601.10080
作者: Letian Peng,Kun Zhou,Longfei Yun,Yupeng Hou,Jingbo Shang
机构: University of California, San Diego (加州大学圣地亚哥分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Role-playing (RP) agents rely on behavioral profiles to act consistently across diverse narrative contexts, yet existing profiles are largely unstructured, non-executable, and weakly validated, leading to brittle agent behavior. We propose Codified Decision Trees (CDT), a data-driven framework that induces an executable and interpretable decision structure from large-scale narrative data. CDT represents behavioral profiles as a tree of conditional rules, where internal nodes correspond to validated scene conditions and leaves encode grounded behavioral statements, enabling deterministic retrieval of context-appropriate rules at execution time. The tree is learned by iteratively inducing candidate scene-action rules, validating them against data, and refining them through hierarchical specialization, yielding profiles that support transparent inspection and principled updates. Across multiple benchmarks, CDT substantially outperforms human-written profiles and prior profile induction methods on 85 characters across 16 artifacts, indicating that codified and validated behavioral representations lead to more reliable agent grounding.
zh

[NLP-57] Sparse-RL: Breaking the Memory Wall in LLM Reinforcement Learning via Stable Sparse Rollouts

【速读】: 该论文旨在解决强化学习(Reinforcement Learning, RL)在训练大型语言模型(Large Language Models, LLMs)时因长期轨迹回放中Key-Value(KV)缓存存储开销过大而导致的内存瓶颈问题。现有KV压缩技术虽能缓解推理阶段的负担,但直接应用于RL训练会引发严重的策略不匹配(policy mismatch),导致性能崩溃。其解决方案的关键在于提出Sparse-RL框架,通过引入**稀疏感知拒绝采样(Sparsity-Aware Rejection Sampling)基于重要性的重加权(Importance-based Reweighting)**机制,有效校正由压缩引起的离策略偏差(off-policy bias),从而实现稀疏回放下的稳定训练,同时保持性能并提升模型在稀疏推理部署中的鲁棒性。

链接: https://arxiv.org/abs/2601.10079
作者: Sijia Luo,Xiaokang Zhang,Yuxuan Hu,Bohan Zhang,Ke Wang,Jinbo Su,Mengshu Sun,Lei Liang,Jing Zhang
机构: Renmin University of China (中国人民大学); Key Laboratory of Data Engineering and Knowledge Engineering (数据工程与知识工程重点实验室); Ant Group (蚂蚁集团)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reinforcement Learning (RL) has become essential for eliciting complex reasoning capabilities in Large Language Models (LLMs). However, the substantial memory overhead of storing Key-Value (KV) caches during long-horizon rollouts acts as a critical bottleneck, often prohibiting efficient training on limited hardware. While existing KV compression techniques offer a remedy for inference, directly applying them to RL training induces a severe policy mismatch, leading to catastrophic performance collapse. To address this, we introduce Sparse-RL empowers stable RL training under sparse rollouts. We show that instability arises from a fundamental policy mismatch among the dense old policy, the sparse sampler policy, and the learner policy. To mitigate this issue, Sparse-RL incorporates Sparsity-Aware Rejection Sampling and Importance-based Reweighting to correct the off-policy bias introduced by compression-induced information loss. Experimental results show that Sparse-RL reduces rollout overhead compared to dense baselines while preserving the performance. Furthermore, Sparse-RL inherently implements sparsity-aware training, significantly enhancing model robustness during sparse inference deployment.
zh

[NLP-58] Long-Chain Reasoning Distillation via Adaptive Prefix Alignment

【速读】: 该论文旨在解决教师模型生成的思维链(Chain-of-Thought, CoT)推理轨迹过长且结构复杂,导致学生模型难以有效学习的问题。这种监督信号与学生模型学习能力之间的不匹配限制了小规模学生模型的推理性能提升。解决方案的关键在于提出Prefix-ALIGNment distillation(P-ALIGN)框架,其核心是通过自适应前缀对齐机制:首先动态截断教师生成的推理轨迹,保留足够且简洁的前缀部分;随后利用该前缀作为监督信号指导学生模型训练,从而提升监督信号的有效性并避免冗余和不确定推理成分带来的负面影响。实验表明,P-ALIGN在多个数学推理基准上显著优于现有基线方法。

链接: https://arxiv.org/abs/2601.10064
作者: Zhenghao Liu,Zhuoyang Wu,Xinze Li,Yukun Yan,Shuo Wang,Zulong Chen,Yu Gu,Ge Yu,Maosong Sun
机构: Northeastern University (东北大学); Tsinghua University (清华大学); Alibaba Group (阿里巴巴集团)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable reasoning capabilities, particularly in solving complex mathematical problems. Recent studies show that distilling long reasoning trajectories can effectively enhance the reasoning performance of small-scale student models. However, teacher-generated reasoning trajectories are often excessively long and structurally complex, making them difficult for student models to learn. This mismatch leads to a gap between the provided supervision signal and the learning capacity of the student model. To address this challenge, we propose Prefix-ALIGNment distillation (P-ALIGN), a framework that fully exploits teacher CoTs for distillation through adaptive prefix alignment. Specifically, P-ALIGN adaptively truncates teacher-generated reasoning trajectories by determining whether the remaining suffix is concise and sufficient to guide the student model. Then, P-ALIGN leverages the teacher-generated prefix to supervise the student model, encouraging effective prefix alignment. Experiments on multiple mathematical reasoning benchmarks demonstrate that P-ALIGN outperforms all baselines by over 3%. Further analysis indicates that the prefixes constructed by P-ALIGN provide more effective supervision signals, while avoiding the negative impact of redundant and uncertain reasoning components. All code is available at this https URL.
zh

[NLP-59] EmplifAI: a Fine-grained Dataset for Japanese Empathetic Medical Dialogues in 28 Emotion Labels

【速读】: 该论文旨在解决慢性疾病患者在治疗过程中情绪波动复杂且多变的问题,传统对话系统难以精准捕捉和回应这些细微情绪变化。解决方案的关键在于构建一个名为EmplifAI的日本语共情对话数据集,其核心创新是基于28个细粒度情绪类别(源自GoEmotions分类体系)设计情境化对话,涵盖280个医学相关场景与4125条两轮对话,通过众包和专家评审确保质量。该数据集支持模型在特定情境下生成更具情感一致性的回应,实验证明其可显著提升大语言模型(LLM)在流畅性、通用共情及情绪特异性共情方面的表现。

链接: https://arxiv.org/abs/2601.10033
作者: Wan Jou She,Lis Kanashiro Pereira,Fei Cheng,Sakiko Yahata,Panote Siriaraya,Eiji Aramaki
机构: Kyoto Institute of Technology (京都工艺纤维大学); National Institute of Information and Communications Technology (日本信息通信研究机构); Kyoto University (京都大学); Nara Institute of Science and Technology (奈良科学技术大学院大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper introduces EmplifAI, a Japanese empathetic dialogue dataset designed to support patients coping with chronic medical conditions. They often experience a wide range of positive and negative emotions (e.g., hope and despair) that shift across different stages of disease management. EmplifAI addresses this complexity by providing situation-based dialogues grounded in 28 fine-grained emotion categories, adapted and validated from the GoEmotions taxonomy. The dataset includes 280 medically contextualized situations and 4125 two-turn dialogues, collected through crowdsourcing and expert review. To evaluate emotional alignment in empathetic dialogues, we assessed model predictions on situation–dialogue pairs using BERTScore across multiple large language models (LLMs), achieving F1 scores of 0.83. Fine-tuning a baseline Japanese LLM (LLM-jp-3.1-13b-instruct4) with EmplifAI resulted in notable improvements in fluency, general empathy, and emotion-specific empathy. Furthermore, we compared the scores assigned by LLM-as-a-Judge and human raters on dialogues generated by multiple LLMs to validate our evaluation pipeline and discuss the insights and potential risks derived from the correlation analysis.
zh

[NLP-60] EHRNavigator: A Multi-Agent System for Patient-Level Clinical Question Answering over Heterogeneous Electronic Health Records

【速读】: 该论文旨在解决当前电子健康记录(Electronic Health Records, EHR)中自然语言问答(Natural Language Question Answering, NLP-QA)系统在临床实际应用中的局限性,即多数系统仅在基准数据集上进行评估,缺乏对真实医院环境中异构、多模态数据及时间推理需求的适应能力。解决方案的关键在于提出EHRNavigator——一个基于多智能体(multi-agent)框架的患者级问答系统,通过AI代理协作实现跨异构与多模态EHR数据的上下文感知问答,并在真实医疗场景下验证其泛化性能,最终在实际病例中达到86%的准确率且响应时间符合临床可接受标准。

链接: https://arxiv.org/abs/2601.10020
作者: Lingfei Qian,Mauro Giuffre,Yan Wang,Huan He,Qianqian Xie,Xuguang Ai,Xeuqing Peng,Fan Ma,Ruey-Ling Weng,Donald Wright,Adan Wang,Qingyu Chen,Vipina K. Keloth,Hua Xu
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Clinical decision-making increasingly relies on timely and context-aware access to patient information within Electronic Health Records (EHRs), yet most existing natural language question-answering (QA) systems are evaluated solely on benchmark datasets, limiting their practical relevance. To overcome this limitation, we introduce EHRNavigator, a multi-agent framework that harnesses AI agents to perform patient-level question answering across heterogeneous and multimodal EHR data. We assessed its performance using both public benchmark and institutional datasets under realistic hospital conditions characterized by diverse schemas, temporal reasoning demands, and multimodal evidence integration. Through quantitative evaluation and clinician-validated chart review, EHRNavigator demonstrated strong generalization, achieving 86% accuracy on real-world cases while maintaining clinically acceptable response times. Overall, these findings confirm that EHRNavigator effectively bridges the gap between benchmark evaluation and clinical deployment, offering a robust, adaptive, and efficient solution for real-world EHR question answering.
zh

[NLP-61] Continuous-Depth Transformers with Learned Control Dynamics

【速读】: 该论文旨在解决传统Transformer模型在生成过程中缺乏灵活控制能力的问题,即难以在推理阶段对生成内容的属性(如情感倾向)进行精确调控。其解决方案的关键在于提出一种混合架构,将离散的中间层替换为连续深度的神经微分方程(Neural Ordinary Differential Equation, ODE)模块,并引入一个低维控制信号(steering signal)通过显式拼接注入到ODE向量场中,从而实现对生成过程的动态调节。该方法不仅保持了与标准离散Transformer相当的计算效率,还通过自适应求解器验证了轨迹稳定性,并揭示了控制信号可划分出具有不同曲率特征的动力学区域,同时利用伴随法(adjoint method)实现了与积分深度无关的O(1)内存训练。

链接: https://arxiv.org/abs/2601.10007
作者: Peter Jemley
机构: Northeastern University (东北大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 9 pages, 4 figures. Code available at: this https URL

点击查看摘要

Abstract:We present a hybrid transformer architecture that replaces discrete middle layers with a continuous-depth Neural Ordinary Differential Equation (ODE) block, enabling inference-time control over generation attributes via a learned steering signal. Unlike standard transformers that process representations through fixed discrete layers, our approach treats depth as a continuous variable governed by a learned vector field F_\theta(H, \tau, u) , where u is a low-dimensional control signal injected via explicit concatenation. We validate the architecture through four experiments: (1) gradient flow stability with zero exploding/vanishing gradient events, (2) semantic steering achieving 98%/88% accuracy for positive/negative sentiment control, (3) continuous interpolation validated by a negligible 0.068% trajectory divergence between fixed and adaptive solvers, and (4) efficiency benchmarking demonstrating latency parity with standard discrete baselines. Additionally, we show that adaptive ODE solvers reveal geometric structure in the learned dynamics: the control signal partitions the vector field into distinct dynamical regimes with different curvature characteristics. The adjoint method enables O(1) memory training regardless of integration depth. Our results demonstrate that continuous-depth dynamics with learned control signals provide a viable, efficient mechanism for steerable language generation.
zh

[NLP-62] SocraticKG: Knowledge Graph Construction via QA-Driven Fact Extraction

【速读】: 该论文旨在解决基于大语言模型(Large Language Models, LLMs)构建知识图谱(Knowledge Graphs, KGs)时面临的根本性权衡问题:高事实覆盖率往往导致关系碎片化,而过早的结构化整合则会造成信息丢失。解决方案的关键在于提出SocraticKG方法,其核心创新是引入问答对(Question-Answer Pairs)作为结构化的中间表示,在三元组抽取之前系统性地展开文档级语义。通过5W1H引导的QA扩展机制,该方法能够捕捉上下文依赖和隐式关系链接,从而在源文档中提供显式语义锚定,有效缓解隐式推理错误。实验表明,该方法在MINE基准上显著改善了覆盖度与连通性的平衡,实现了更高的事实保留率和更强的结构一致性。

链接: https://arxiv.org/abs/2601.10003
作者: Sanghyeok Choi,Woosang Jeon,Kyuseok Yang,Taehyeong Kim
机构: Seoul National University (首尔国立大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Constructing Knowledge Graphs (KGs) from unstructured text provides a structured framework for knowledge representation and reasoning, yet current LLM-based approaches struggle with a fundamental trade-off: factual coverage often leads to relational fragmentation, while premature consolidation causes information loss. To address this, we propose SocraticKG, an automated KG construction method that introduces question-answer pairs as a structured intermediate representation to systematically unfold document-level semantics prior to triple extraction. By employing 5W1H-guided QA expansion, SocraticKG captures contextual dependencies and implicit relational links typically lost in direct KG extraction pipelines, providing explicit grounding in the source document that helps mitigate implicit reasoning errors. Evaluation on the MINE benchmark demonstrates that our approach effectively addresses the coverage-connectivity trade-off, achieving superior factual retention while maintaining high structural cohesion even as extracted knowledge volume substantially expands. These results highlight that QA-mediated semantic scaffolding plays a critical role in structuring semantics prior to KG extraction, enabling more coherent and reliable graph construction in subsequent stages.
zh

[NLP-63] Context Volume Drives Performance: Tackling Domain Shift in Extremely Low-Resource Translation via RAG

【速读】: 该论文旨在解决低资源语言在神经机器翻译(Neural Machine Translation, NMT)模型中因领域迁移(domain shift)导致性能显著下降的问题。针对印度尼西亚东部的本土语言Dhao(无数字文本基础,仅限新约圣经文本),当模型从新约(NT)迁移到未见的旧约(OT)时,chrF++得分从36.17骤降至27.11。为恢复性能,作者提出一种混合框架:首先使用微调后的NMT模型生成初始译文,再由大语言模型(Large Language Model, LLM)通过检索增强生成(Retrieval-Augmented Generation, RAG)进行精修。关键在于,LLM利用外部知识库中的检索示例对NMT输出进行修正,最终实现35.21的chrF++得分(提升8.10),几乎恢复原始域内性能;分析表明,性能提升主要取决于检索示例数量而非具体检索算法,且LLM作为“安全网”有效修复了零样本场景下的严重错误。

链接: https://arxiv.org/abs/2601.09982
作者: David Samuel Setiawan,Raphaël Merx,Jey Han Lau
机构: The University of Melbourne (墨尔本大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Neural Machine Translation (NMT) models for low-resource languages suffer significant performance degradation under domain shift. We quantify this challenge using Dhao, an indigenous language of Eastern Indonesia with no digital footprint beyond the New Testament (NT). When applied to the unseen Old Testament (OT), a standard NMT model fine-tuned on the NT drops from an in-domain score of 36.17 chrF++ to 27.11 chrF++. To recover this loss, we introduce a hybrid framework where a fine-tuned NMT model generates an initial draft, which is then refined by a Large Language Model (LLM) using Retrieval-Augmented Generation (RAG). The final system achieves 35.21 chrF++ (+8.10 recovery), effectively matching the original in-domain quality. Our analysis reveals that this performance is driven primarily by the number of retrieved examples rather than the choice of retrieval algorithm. Qualitative analysis confirms the LLM acts as a robust “safety net,” repairing severe failures in zero-shot domains.
zh

[NLP-64] SPRInG: Continual LLM Personalization via Selective Parametric Adaptation and Retrieval-Interpolated Generation

【速读】: 该论文旨在解决大语言模型在持续个性化过程中面临的偏好漂移(preference drift)问题,即用户兴趣随时间动态变化,而传统方法因无法区分真实偏好迁移与临时上下文噪声,易导致灾难性遗忘。其解决方案的关键在于提出一种半参数框架SPRInG,通过基于似然的评分函数实现“漂移驱动的选择性适应”,仅对高新颖性交互进行更新,同时将难以学习的残差信息保存至回放缓冲区;推理阶段则采用严格的相关性门控机制,并通过logit插值融合参数化知识与检索的历史记录,从而在保持稳定性的同时有效捕捉长期用户偏好变化。

链接: https://arxiv.org/abs/2601.09974
作者: Seoyeon Kim,Jaehyung Kim
机构: Yonsei University (延世大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: under review, 23 pages

点击查看摘要

Abstract:Personalizing Large Language Models typically relies on static retrieval or one-time adaptation, assuming user preferences remain invariant over time. However, real-world interactions are dynamic, where user interests continuously evolve, posing a challenge for models to adapt to preference drift without catastrophic forgetting. Standard continual learning approaches often struggle in this context, as they indiscriminately update on noisy interaction streams, failing to distinguish genuine preference shifts from transient contexts. To address this, we introduce SPRInG, a novel semi-parametric framework designed for effective continual personalization. During training, SPRInG employs drift-driven selective adaptation, which utilizes a likelihood-based scoring function to identify high-novelty interactions. This allows the model to selectively update the user-specific adapter on drift signals while preserving hard-to-learn residuals in a replay buffer. During inference, we apply strict relevance gating and fuse parametric knowledge with retrieved history via logit interpolation. Experiments on the long-form personalized generation benchmark demonstrate that SPRInG outperforms existing baselines, validating its robustness for real-world continual personalization.
zh

[NLP-65] ake Out Your Calculators: Estimating the Real Difficulty of Question Items with LLM Student Simulations

【速读】: 该论文旨在解决标准化数学测评中试题难度评估依赖昂贵人工试点研究的问题,提出利用开源大语言模型(Large Language Models, LLMs)模拟真实学生群体来预测题目难度的替代方案。其解决方案的关键在于:通过让LLM角色扮演不同年级(4、8、12年级)的学生,生成模拟答题行为数据,进而基于这些数据拟合项目反应理论(Item Response Theory, IRT)模型,从而估计题目的难度参数,并与美国国家教育进展评估(National Assessment of Educational Progress, NAEP)提供的实际难度指标进行比较。实验表明,该方法在不同年级下可达到0.75–0.82的相关性,且模拟中引入具体姓名(含性别和种族分层)以及使用数学能力较弱但更“稳健”的开源模型(如Gemma)能显著提升预测准确性。

链接: https://arxiv.org/abs/2601.09953
作者: Christabel Acquaye,Yi Ting Huang,Marine Carpuat,Rachel Rudinger
机构: University of Maryland, College Park (马里兰大学学院公园分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Standardized math assessments require expensive human pilot studies to establish the difficulty of test items. We investigate the predictive value of open-source large language models (LLMs) for evaluating the difficulty of multiple-choice math questions for real-world students. We show that, while LLMs are poor direct judges of problem difficulty, simulation-based approaches with LLMs yield promising results under the right conditions. Under the proposed approach, we simulate a “classroom” of 4th, 8th, or 12th grade students by prompting the LLM to role-play students of varying proficiency levels. We use the outcomes of these simulations to fit Item Response Theory (IRT) models, comparing learned difficulty parameters for items to their real-world difficulties, as determined by item-level statistics furnished by the National Assessment of Educational Progress (NAEP). We observe correlations as high as 0.75, 0.76, and 0.82 for grades 4, 8, and 12, respectively. In our simulations, we experiment with different “classroom sizes,” showing tradeoffs between computation size and accuracy. We find that role-plays with named students improves predictions (compared to student ids), and stratifying names across gender and race further improves predictions. Our results show that LLMs with relatively weaker mathematical abilities (Gemma) actually yield better real-world difficulty predictions than mathematically stronger models (Llama and Qwen), further underscoring the suitability of open-source models for the task.
zh

[NLP-66] Self-reflection in Automated Qualitative Coding: Improving Text Annotation through Secondary LLM Critique

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在进行定性编码时,因零样本或少样本分类器产生高误报率(false-positive rate)的问题,即使采用精心设计的提示(prompting)也难以避免。其解决方案的关键在于提出一个两阶段工作流:第一阶段由LLM根据人类设计的、适配LLM理解方式的编码本(codebook)执行初步标注;第二阶段引入一个二次LLM批评者(critic),对每个阳性标签进行自我反思,通过重新阅读源文本并结合首轮模型的推理过程来做出最终决策。该方法显著提升了F1分数(提升0.04至0.25),尤其改善了原本表现较差的两个代码,从0.52和0.55提升至0.69和0.79,同时识别出两类典型错误模式——误解释(misinterpretation)与元讨论(meta-discussion),并通过针对性的批评条款实现高效修正,从而以较低计算成本实现以精度优先的噪声控制。

链接: https://arxiv.org/abs/2601.09905
作者: Zackary Okun Dunivin,Mobina Noori,Seth Frey,Curtis Atkinson
机构: 未知
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) allow for sophisticated qualitative coding of large datasets, but zero- and few-shot classifiers can produce an intolerable number of errors, even with careful, validated prompting. We present a simple, generalizable two-stage workflow: an LLM applies a human-designed, LLM-adapted codebook; a secondary LLM critic performs self-reflection on each positive label by re-reading the source text alongside the first model’s rationale and issuing a final decision. We evaluate this approach on six qualitative codes over 3,000 high-content emails from Apache Software Foundation project evaluation discussions. Our human-derived audit of 360 positive annotations (60 passages by six codes) found that the first-line LLM had a false-positive rate of 8% to 54%, despite F1 scores of 0.74 and 1.00 in testing. Subsequent recoding of all stage-one annotations via a second self-reflection stage improved F1 by 0.04 to 0.25, bringing two especially poor performing codes up to 0.69 and 0.79 from 0.52 and 0.55 respectively. Our manual evaluation identified two recurrent error classes: misinterpretation (violations of code definitions) and meta-discussion (debate about a project evaluation criterion mistaken for its use as a decision justification). Code-specific critic clauses addressing observed failure modes were especially effective with testing and refinement, replicating the codebook-adaption process for LLM interpretation in stage-one. We explain how favoring recall in first-line LLM annotation combined with secondary critique delivers precision-first, compute-light control. With human guidance and validation, self-reflection slots into existing LLM-assisted annotation pipelines to reduce noise and potentially salvage unusable classifiers.
zh

[NLP-67] Clozing the Gap: Exploring Why Language Model Surprisal Outperforms Cloze Surprisal

【速读】: 该论文旨在解决语言模型(Language Models, LMs)概率相较于人类填空任务(cloze task)概率在预测语言加工努力时表现更优的原因问题,即验证LM概率的优势是否源于合理的认知机制而非单纯的数据特性。其解决方案的关键在于通过实证检验三个假设:一是LM概率不受低分辨率限制,二是能区分语义相近词,三是对低频词的概率分配更为准确。这些发现表明,LM概率的优势不仅体现在统计性能上,还可能反映了更精细的语言预测能力,从而为未来研究人类预测机制与LM预测机制的差异提供方向。

链接: https://arxiv.org/abs/2601.09886
作者: Sathvik Nair,Byung-Doh Oh
机构: University of Maryland (马里兰大学); Nanyang Technological University (南洋理工大学)
类目: Computation and Language (cs.CL)
备注: 13 pages, 7 figures

点击查看摘要

Abstract:How predictable a word is can be quantified in two ways: using human responses to the cloze task or using probabilities from language models (LMs).When used as predictors of processing effort, LM probabilities outperform probabilities derived from cloze data. However, it is important to establish that LM probabilities do so for the right reasons, since different predictors can lead to different scientific conclusions about the role of prediction in language comprehension. We present evidence for three hypotheses about the advantage of LM probabilities: not suffering from low resolution, distinguishing semantically similar words, and accurately assigning probabilities to low-frequency words. These results call for efforts to improve the resolution of cloze studies, coupled with experiments on whether human-like prediction is also as sensitive to the fine-grained distinctions made by LM probabilities.
zh

[NLP-68] Patient-Similarity Cohort Reasoning in Clinical Text-to-SQL EACL2026

【速读】: 该论文旨在解决真实世界临床场景中自然语言到结构化查询语言(text-to-SQL)的挑战,即如何从电子健康记录(EHR)中理解多表关联、时间窗口约束和患者相似性群体等复杂语义,生成可执行的SQL查询。其解决方案的关键在于构建CLINSQL基准测试集,该数据集包含633个由专家标注的任务,覆盖MIMIC-IV v3.1数据库中的多表连接、具有临床意义的过滤条件及可执行SQL语法,并通过基于评分标准的SQL分析与执行验证机制评估模型性能。这一方法推动了对模型在处理长上下文、医疗编码系统、多步骤推理等方面能力的全面检验,标志着向临床可用的文本转SQL技术迈出关键一步。

链接: https://arxiv.org/abs/2601.09876
作者: Yifei Shen,Yilun Zhao,Justice Ou,Tinglin Huang,Arman Cohan
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted by EACL 2026

点击查看摘要

Abstract:Real-world clinical text-to-SQL requires reasoning over heterogeneous EHR tables, temporal windows, and patient-similarity cohorts to produce executable queries. We introduce CLINSQL, a benchmark of 633 expert-annotated tasks on MIMIC-IV v3.1 that demands multi-table joins, clinically meaningful filters, and executable SQL. Solving CLINSQL entails navigating schema metadata and clinical coding systems, handling long contexts, and composing multi-step queries beyond traditional text-to-SQL. We evaluate 22 proprietary and open-source models under Chain-of-Thought self-refinement and use rubric-based SQL analysis with execution checks that prioritize critical clinical requirements. Despite recent advances, performance remains far from clinical reliability: on the test set, GPT-5-mini attains 74.7% execution score, DeepSeek-R1 leads open-source at 69.2% and Gemini-2.5-Pro drops from 85.5% on Easy to 67.2% on Hard. Progress on CLINSQL marks tangible advances toward clinically reliable text-to-SQL for real-world EHR analytics.
zh

[NLP-69] OUTLINEFORGE: Hierarchical Reinforcement Learning with Explicit States for Scientific Writing

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在科学论文生成中普遍存在的全局结构不合理、输入覆盖不全以及引用一致性差的问题,这些问题限制了其在实际科研写作中的可靠性。解决方案的关键在于提出一种基于强化学习的框架,将科学论文提纲构建建模为对分层文档结构的长程规划问题;通过结构化动作建模提纲的增量演化,并引入两阶段优化机制:第一阶段通过从部分计划反向重构提纲以强制全局结构一致性,第二阶段采用前向价值引导的强化学习,奖励函数显式建模科学正确性、话语连贯性和引用忠实度,从而显著提升生成论文的结构性与事实准确性。

链接: https://arxiv.org/abs/2601.09858
作者: Yilin Bao,Ziyao He,Zayden Yang
机构: UC San Diego (加州大学圣地亚哥分校); Ohio State University (俄亥俄州立大学); Sheltered.AI (Sheltered.AI)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Scientific paper generation requires document-level planning and factual grounding, but current large language models, despite their strong local fluency, often fail in global structure, input coverage, and citation consistency. We present a reinforcement learning framework that casts scientific outline construction as a long-horizon planning problem over hierarchical document structures. Our approach models edit evolving outlines through structured actions, enabling the system to incrementally build a complete scientific manuscript. To support effective and stabilize learning,we introduce a two-stage optimization procedure consisting of (i) backward outline reconstruction from partial plans to enforce global structural consistency, and (ii) forward value-guided reinforcement learning with rewards explicitly modeling scientific correctness, discourse coherence, and citation fidelity. In addition, We further introduce a benchmark for scientific paper generation that evaluates document planning, input utilization, reference faithfulness, outline organization, and content-level factual accuracy. Our results show consistent improvements over strong neural and LLM baselines, particularly in long-range structural coherence and citation reliability.
zh

[NLP-70] hinking Long but Short: Stable Sequential Test-Time Scaling for Large Reasoning Models EACL2026

【速读】: 该论文旨在解决顺序测试时缩放(sequential test-time scaling)方法在提升大语言模型推理准确性时存在的两个核心问题:一是随着推理长度增加,模型准确率会出现下降和不稳定现象;二是现有方法通常需要针对不同任务进行推理长度的精细调参。其解决方案的关键在于提出一种名为 Min-Seek 的新方法,通过仅保留一个额外推理步骤的键值对(KV pairs)到 KV 缓存中,并采用自定义的 KV 缓存机制——即存储键向量时不包含位置嵌入(position embeddings),并在每次生成新推理步骤前动态连续编码这些键向量——从而实现稳定且高效的长程推理,显著提升模型在多种推理任务上的性能,同时突破模型最大上下文长度限制,在温和条件下具备线性计算复杂度。

链接: https://arxiv.org/abs/2601.09855
作者: Michael R. Metel,Yufei Cui,Boxing Chen,Prasanna Parthasarathi
机构: Huawei Noah’s Ark Lab (华为诺亚方舟实验室)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Findings of EACL 2026

点击查看摘要

Abstract:Sequential test-time scaling is a promising training-free method to improve large reasoning model accuracy, but as currently implemented, significant limitations have been observed. Inducing models to think for longer can increase their accuracy, but as the length of reasoning is further extended, it has also been shown to result in accuracy degradation and model instability. This work presents a novel sequential test-time scaling method, Min-Seek, which improves model accuracy significantly over a wide range of induced thoughts, stabilizing the accuracy of sequential scaling, and removing the need for reasoning length fine-tuning. Beyond improving model accuracy over a variety of reasoning tasks, our method is inherently efficient, as only the KV pairs of one additional induced thought are kept in the KV cache during reasoning. With a custom KV cache which stores keys without position embeddings, by dynamically encoding them contiguously before each new generated thought, our method can continue to reason well beyond a model’s maximum context length, and under mild conditions has linear computational complexity.
zh

[NLP-71] MedRedFlag: Investigating how LLM s Redirect Misconceptions in Real-World Health Communication

【速读】: 该论文旨在解决生成式 AI 在处理包含错误前提的现实健康咨询时缺乏“引导性回应”(redirection)能力的问题,即当患者提问中隐含错误假设时,AI 是否能识别并纠正误解,进而提供安全、准确的医疗建议。其解决方案的关键在于构建了一个名为 MedRedFlag 的半自动化数据集,包含 1100+ 来自 Reddit 的真实健康问题,这些问题均需通过引导方式回应;并通过系统对比先进大语言模型(LLM)与临床医生的回答,揭示了 LLM 在面对此类复杂语境时存在显著缺陷——即使检测到错误前提,也常直接回答原问题而非进行必要引导,可能导致次优甚至危险的医疗决策,凸显了面向患者的医疗 AI 系统在安全性上的重大缺口。

链接: https://arxiv.org/abs/2601.09853
作者: Sraavya Sambara,Yuan Pu,Ayman Ali,Vishala Mishra,Lionel Wong,Monica Agrawal
机构: Duke University (杜克大学); Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Real-world health questions from patients often unintentionally embed false assumptions or premises. In such cases, safe medical communication typically involves redirection: addressing the implicit misconception and then responding to the underlying patient context, rather than the original question. While large language models (LLMs) are increasingly being used by lay users for medical advice, they have not yet been tested for this crucial competency. Therefore, in this work, we investigate how LLMs react to false premises embedded within real-world health questions. We develop a semi-automated pipeline to curate MedRedFlag, a dataset of 1100+ questions sourced from Reddit that require redirection. We then systematically compare responses from state-of-the-art LLMs to those from clinicians. Our analysis reveals that LLMs often fail to redirect problematic questions, even when the problematic premise is detected, and provide answers that could lead to suboptimal medical decision making. Our benchmark and results reveal a novel and substantial gap in how LLMs perform under the conditions of real-world health communication, highlighting critical safety concerns for patient-facing medical AI systems. Code and dataset are available at this https URL.
zh

[NLP-72] Bears all bears and some bears. Language Constraints on Language Models Inductive Inferences

【速读】: 该论文试图解决的问题是:通用统计学习模型(如视觉语言模型)是否能够像人类儿童一样,基于语言形式的细微差异对归纳推理进行区分,例如在泛化新属性时对“泛化性陈述”(generics)、“全称量化名词短语”(universally quantified NPs)和“不定复数名词短语”(indefinite plural NPs)做出不同反应。其解决方案的关键在于通过复制原始实验范式,先对模型进行预条件测试(包括类别识别鲁棒性和对“所有”与“某些”的敏感性),再执行原实验任务,结果发现模型行为与人类表现出一致性;进一步的后验分析表明,这种区分并非源于表面形式差异,而是由归纳约束(inductive constraints)所组织,揭示了模型具备类人层级化的语义表征能力。

链接: https://arxiv.org/abs/2601.09852
作者: Sriram Padmanabhan,Siyuan Song,Kanishka Misra
机构: The University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Language places subtle constraints on how we make inductive inferences. Developmental evidence by Gelman et al. (2002) has shown children (4 years and older) to differentiate among generic statements (“Bears are daxable”), universally quantified NPs (“all bears are daxable”) and indefinite plural NPs (“some bears are daxable”) in extending novel properties to a specific member (all generics some), suggesting that they represent these types of propositions differently. We test if these subtle differences arise in general purpose statistical learners like Vision Language Models, by replicating the original experiment. On tasking them through a series of precondition tests (robust identification of categories in images and sensitivities to all and some), followed by the original experiment, we find behavioral alignment between models and humans. Post-hoc analyses on their representations revealed that these differences are organized based on inductive constraints and not surface-form differences.
zh

[NLP-73] Stable and Explainable Personality Trait Evaluation in Large Language Models with Internal Activations

【速读】: 该论文旨在解决当前基于问卷的大型语言模型(Large Language Models, LLMs)人格特质评估方法稳定性差、可解释性弱的问题,此类方法对提示词微小变化或角色扮演配置敏感,导致结果不可靠。解决方案的关键在于提出一种基于模型内部激活的评估方法——人格向量中性插值(Persona-Vector Neutrality Interpolation, PVNI),其通过对比提示词从模型内部激活中提取与目标人格特质相关的人格向量,并沿该向量作为锚定轴进行插值,从而估计中性得分,实现对中性提示表示与人格方向之间可解释的比较,显著提升了评估的稳定性和可解释性。

链接: https://arxiv.org/abs/2601.09833
作者: Xiaoxu Ma,Xiangbo Zhang,Zhenyu Weng
机构: Georgia Institute of Technology (佐治亚理工学院); South China University of Technology (华南理工大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Evaluating personality traits in Large Language Models (LLMs) is key to model interpretation, comparison, and responsible deployment. However, existing questionnaire-based evaluation methods exhibit limited stability and offer little explainability, as their results are highly sensitive to minor variations in prompt phrasing or role-play configurations. To address these limitations, we propose an internal-activation-based approach, termed Persona-Vector Neutrality Interpolation (PVNI), for stable and explainable personality trait evaluation in LLMs. PVNI extracts a persona vector associated with a target personality trait from the model’s internal activations using contrastive prompts. It then estimates the corresponding neutral score by interpolating along the persona vector as an anchor axis, enabling an interpretable comparison between the neutral prompt representation and the persona direction. We provide a theoretical analysis of the effectiveness and generalization properties of PVNI. Extensive experiments across diverse LLMs demonstrate that PVNI yields substantially more stable personality trait evaluations than existing methods, even under questionnaire and role-play variants.
zh

[NLP-74] he Geometry of Thought: Disclosing the Transformer as a Tropical Polynomial Circuit

【速读】: 该论文旨在揭示Transformer模型中自注意力机制在高置信度极限(即逆温度 β\beta \to \infty)下的数学本质,从而阐明其计算过程的内在几何结构与优化逻辑。解决方案的关键在于证明:在此极限下,softmax注意力操作等价于热带半环(tropical semiring,即max-plus代数)中的运算,具体表现为将注意力矩阵转化为一种热带矩阵乘法;这进一步表明,Transformer的前向传播本质上是在由token相似性定义的潜在图上执行动态规划递推(特别是Bellman-Ford最短路径更新),从而揭示了链式思维(chain-of-thought)推理机制源于网络内部隐含的最短路径(或最长路径)算法。

链接: https://arxiv.org/abs/2601.09775
作者: Faruk Alpay,Bilge Senturk
机构: Bahçeşehir University (巴赫切席尔大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 7 pages, 2 figures

点击查看摘要

Abstract:We prove that the Transformer self-attention mechanism in the high-confidence regime ( \beta \to \infty , where \beta is an inverse temperature) operates in the tropical semiring (max-plus algebra). In particular, we show that taking the tropical limit of the softmax attention converts it into a tropical matrix product. This reveals that the Transformer’s forward pass is effectively executing a dynamic programming recurrence (specifically, a Bellman-Ford path-finding update) on a latent graph defined by token similarities. Our theoretical result provides a new geometric perspective for chain-of-thought reasoning: it emerges from an inherent shortest-path (or longest-path) algorithm being carried out within the network’s computation.
zh

[NLP-75] Antisocial behavior towards large language model users: experimental evidence

【速读】: 该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)的广泛应用是否引发社会层面的负面反应,特别是用户因使用LLMs而遭受成本性惩罚(costly action)的现实可能性。此前研究仅发现对AI用户的负面态度,但未验证这种态度是否会转化为实际行为上的制裁。解决方案的关键在于设计了一个两阶段在线实验(N = 491名参与者),通过让参与者动用自身资源来减少曾完成真实努力任务的同伴收益,从而直接测量社会惩罚行为——结果表明,依赖LLMs的个体平均被剥夺36%的收益,且惩罚强度随实际LLM使用程度单调上升;此外,自我报告与实际使用之间的不一致进一步加剧了惩罚,揭示出“无使用声明”和“高使用水平”均会触发更强的社会制裁,首次提供了LLMs效率优势伴随社会代价的行为证据。

链接: https://arxiv.org/abs/2601.09772
作者: Paweł Niszczota,Cassandra Grützner
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); General Economics (econ.GN)
备注:

点击查看摘要

Abstract:The rapid spread of large language models (LLMs) has raised concerns about the social reactions they provoke. Prior research documents negative attitudes toward AI users, but it remains unclear whether such disapproval translates into costly action. We address this question in a two-phase online experiment (N = 491 Phase II participants; Phase I provided targets) where participants could spend part of their own endowment to reduce the earnings of peers who had previously completed a real-effort task with or without LLM support. On average, participants destroyed 36% of the earnings of those who relied exclusively on the model, with punishment increasing monotonically with actual LLM use. Disclosure about LLM use created a credibility gap: self-reported null use was punished more harshly than actual null use, suggesting that declarations of “no use” are treated with suspicion. Conversely, at high levels of use, actual reliance on the model was punished more strongly than self-reported reliance. Taken together, these findings provide the first behavioral evidence that the efficiency gains of LLMs come at the cost of social sanctions.
zh

[NLP-76] Synthetic Data for Veterinary EHR De-identification: Benefits Limits and Safety Trade-offs Under Fixed Compute

【速读】: 该论文旨在解决兽医电子健康记录(veterinary electronic health records, vEHRs)在去标识化过程中因隐私敏感信息导致的二次使用受限问题,特别是在低资源领域缺乏有效评估基准和训练数据的情况下。其核心解决方案是探索生成式大语言模型(large language model, LLM)生成的合成病历文本在不同训练策略下对去标识化安全性的提升效果,关键在于区分“合成数据增强”(synthetic augmentation)与“固定预算替换”(fixed-budget substitution)两种机制:研究发现,合成数据通过增加训练轮次(epoch-scaled augmentation)可有效提升模型性能并降低文档级泄露率(leakage rate),但仅作为补充手段;而直接用合成数据替代真实数据会导致安全风险上升,表明合成数据无法单独用于高安全性要求的场景。

链接: https://arxiv.org/abs/2601.09756
作者: David Brundage
机构: University of Wisconsin-Madison, School of Veterinary Medicine (威斯康星大学麦迪逊分校兽医学院)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Veterinary electronic health records (vEHRs) contain privacy-sensitive identifiers that limit secondary use. While PetEVAL provides a benchmark for veterinary de-identification, the domain remains low-resource. This study evaluates whether large language model (LLM)-generated synthetic narratives improve de-identification safety under distinct training regimes, emphasizing (i) synthetic augmentation and (ii) fixed-budget substitution. We conducted a controlled simulation using a PetEVAL-derived corpus (3,750 holdout/1,249 train). We generated 10,382 synthetic notes using a privacy-preserving “template-only” regime where identifiers were removed prior to LLM prompting. Three transformer backbones (PetBERT, VetBERT, Bio_ClinicalBERT) were trained under varying mixtures. Evaluation prioritized document-level leakage rate (the fraction of documents with at least one missed identifier) as the primary safety outcome. Results show that under fixed-sample substitution, replacing real notes with synthetic ones monotonically increased leakage, indicating synthetic data cannot safely replace real supervision. Under compute-matched training, moderate synthetic mixing matched real-only performance, but high synthetic dominance degraded utility. Conversely, epoch-scaled augmentation improved performance: PetBERT span-overlap F1 increased from 0.831 to 0.850 +/- 0.014, and leakage decreased from 6.32% to 4.02% +/- 0.19%. However, these gains largely reflect increased training exposure rather than intrinsic synthetic data quality. Corpus diagnostics revealed systematic synthetic-real mismatches in note length and label distribution that align with persistent leakage. We conclude that synthetic augmentation is effective for expanding exposure but is complementary, not substitutive, for safety-critical veterinary de-identification.
zh

[NLP-77] From Detection to Diagnosis: Advancing Hallucination Analysis with Automated Data Synthesis AAAI

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)中幻觉(Hallucination)问题,即模型生成与事实或上下文不一致的内容,这一问题严重限制了LLMs在关键领域中的可靠部署。现有研究多采用二元“检测”方法,虽能识别幻觉,但缺乏可解释性和可操作性反馈,难以指导模型改进。为此,论文提出从“检测”向“诊断”的新研究范式,引入幻觉诊断任务(Hallucination Diagnosis Task),要求模型不仅识别幻觉,还需进行错误定位、因果解释和内容修正。其解决方案的关键在于构建了一个自动化的数据生成管道——幻觉诊断生成器(Hallucination Diagnosis Generator, HDG),通过受控事实伪造和推理链扰动等多维增强策略,从原始语料中系统生成带丰富诊断元数据的高质量训练样本;在此基础上,利用Group Relative Policy Optimization(GRPO)与包含结构、准确率和定位信号的综合奖励函数训练出HDM-4B-RL模型,实现了在HalEval基准上的性能超越及在诊断任务中媲美更大规模通用模型的能力,验证了幻觉诊断的有效性与实用性。

链接: https://arxiv.org/abs/2601.09734
作者: Yanyi Liu,Qingwen Yang,Tiezheng Guo,Feiyu Qu,Jun Liu,Yingyou Wen
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at The 40th Annual AAAI Conference on Artificial Intelligence

点击查看摘要

Abstract:Hallucinations in Large Language Models (LLMs), defined as the generation of content inconsistent with facts or context, represent a core obstacle to their reliable deployment in critical domains. Current research primarily focuses on binary “detection” approaches that, while capable of identifying hallucinations, fail to provide interpretable and actionable feedback for model improvement, thus limiting practical utility. To address this limitation, a new research paradigm is proposed, shifting from “detection” to “diagnosis”. The Hallucination Diagnosis Task is introduced, a task which requires models to not only detect hallucinations, but also perform error localization, causal explanation, and content correction. We develop the Hallucination Diagnosis Generator (HDG), an automated pipeline that systematically generates high-quality training samples with rich diagnostic metadata from raw corpora through multi-dimensional augmentation strategies including controlled fact fabrication and reasoning chain perturbation. Using HDG-generated data, we train HDM-4B-RL, a 4-billion-parameter hallucination diagnosis model, employing Group Relative Policy Optimization (GRPO) with a comprehensive reward function incorporating structural, accuracy, and localization signals. Experimental results demonstrate that our model surpasses previous state-of-the-art detection models on the HaluEval benchmark while achieving comparable performance to advanced general-purpose models. In comprehensive diagnosis tasks, HDM-4B-RL matches the capabilities of larger general models while maintaining a smaller size. This work validates the feasibility and value of hallucination diagnosis, providing an effective methodology for building more trustworthy and reliable generative AI systems.
zh

[NLP-78] Closing the Data Loop: Using OpenDataArena to Engineer Superior Training Datasets

【速读】: 该论文旨在解决监督微调(Supervised Fine-Tuning, SFT)数据集构建过程中缺乏理论指导的问题,当前实践多依赖启发式聚合方法,未能系统理解单个样本对模型性能的具体贡献。其解决方案的关键在于提出一种从随机采样到闭环数据工程框架的范式转变,基于OpenDataArena(ODA)平台,利用价值锚定排序(value-anchored rankings)和多维分析,将价值基准测试转化为引导数据集构建的反馈信号;通过该方法构建的两个数据集——ODA-Math-460k(数学推理专用)与ODA-Mixture(多领域指令数据),均在特定任务上达到SOTA性能,并展现出显著的数据效率优势,验证了以透明评估驱动高质量训练数据工程的数据中心化AI路径。

链接: https://arxiv.org/abs/2601.09733
作者: Xin Gao,Xiaoyang Wang,Yun Zhu,Mengzhang Cai,Conghui He,Lijun Wu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Superior ODA-Math, ODA-Mixture Datasets

点击查看摘要

Abstract:The construction of Supervised Fine-Tuning (SFT) datasets is a critical yet under-theorized stage in the post-training of Large Language Models (LLMs), as prevalent practices often rely on heuristic aggregation without a systematic understanding of how individual samples contribute to model performance. In this report, we propose a paradigm shift from ad-hoc curation to a closed-loop dataset engineering framework using OpenDataArena (ODA), which leverages value-anchored rankings and multi-dimensional analysis to transform value benchmarking into feedback signals guiding dataset construction. We instantiate this methodology through two new datasets: \textbfODA-Math-460k, a specialized mathematics reasoning dataset that utilizes a novel two-stage difficulty-aware pipeline to achieve State-of-the-Art (SOTA) results on benchmarks such as AIME and HMMT, and \textbfODA-Mixture (100k \ 500k), a series of multi-domain instruction datasets built via an ``Anchor-and-Patch’’ strategy that outperforms significantly larger open-source baselines. Our empirical results demonstrate that ODA-driven datasets significantly improve both domain-specific reasoning and general utility while achieving superior data efficiency, validating a transition toward data-centric AI where transparent evaluation serves as the primary engine for engineering high-quality training data.
zh

[NLP-79] Benchmarking Cross-Lingual Semantic Alignment in Multilingual Embeddings

【速读】: 该论文旨在解决多语言嵌入模型中缺乏明确指标来区分真正的跨语言语义对齐(cross-lingual semantic alignment)与仅依赖特定语言模式的任务性能提升的问题。现有任务驱动型基准(如MTEB)可能掩盖模型在基础语义对齐上的缺陷。其解决方案的关键在于提出一种新的可量化、有界(0到1之间)的语义亲和度(Semantic Affinity, SA)指标,通过余弦距离计算跨语言与同语言嵌入分布的比率,并结合PHATE可视化技术构建Semanscope框架进行系统评估。实证表明,模型的跨语言对齐能力主要由训练目标决定,而非架构或规模;显式的翻译对监督(translation-pair supervision)是实现高质量跨语言对齐的核心要素,而单纯增加多语言数据或模型规模无法替代这一机制。

链接: https://arxiv.org/abs/2601.09732
作者: Wen G. Gong
机构: 未知
类目: Computation and Language (cs.CL)
备注: 20 pages, 9 figures, 4 tables

点击查看摘要

Abstract:With hundreds of multilingual embedding models available, practitioners lack clear guidance on which provide genuine cross-lingual semantic alignment versus task performance through language-specific patterns. Task-driven benchmarks (MTEB) may mask fundamental alignment shortcomings. We introduce Semantic Affinity (SA), a bounded (between 0 and 1) metric measuring inter-lingual to intra-lingual spread ratio using cosine distance, combined with PHATE visualization in our Semanscope framework. Benchmarking 13 models across 4 datasets (52 experiments) reveals a three-tier structure: (1) Top BERT models (LaBSE SA = 0.70, USE SA = 0.68, S-BERT SA = 0.68) achieve strong alignment via translation-pair supervision; (2) LLM embeddings plateau at SA between 0.55 and 0.61 regardless of 0.6 B to 8 B scale; (3) MLM-only BERT models (mBERT, XLM-R, SA 0.50) fail despite more than 100 language training. Training objective, not architecture or scale, determines alignment. Oracle Bone primitives (1200 BCE) expose semantic drift-models learn corpus patterns rather than cognitive primitives. This work provides semantic benchmarking to help practitioners select quality multilingual embeddings from hundreds of available models, showing cross-lingual alignment requires explicit translation supervision, not merely model scale or multilingual data.
zh

[NLP-80] Geometric Patterns of Meaning: A PHATE Manifold Analysis of Multi-lingual Embeddings

【速读】: 该论文旨在解决多语言嵌入空间中语义几何结构的解析问题,即如何系统性地揭示不同语言层次(如字素、字符、词汇及语义域)下嵌入表示的几何模式及其局限性。其解决方案的关键在于提出并实现了一个多层次分析框架,通过Semanscope工具结合PHATE流形学习方法,在四个语言层级上进行可视化与定量分析,从而识别出当前嵌入模型在捕捉语义关系时的结构性缺陷,例如汉字部首的几何坍缩现象以及阿拉伯数字呈现螺旋轨迹而非聚类分布等异常模式,验证了PHATE作为解析嵌入空间几何结构和评估模型语义表征能力的核心工具价值。

链接: https://arxiv.org/abs/2601.09731
作者: Wen G Gong
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We introduce a multi-level analysis framework for examining semantic geometry in multilingual embeddings, implemented through Semanscope (a visualization tool that applies PHATE manifold learning across four linguistic levels). Analysis of diverse datasets spanning sub-character components, alphabetic systems, semantic domains, and numerical concepts reveals systematic geometric patterns and critical limitations in current embedding models. At the sub-character level, purely structural elements (Chinese radicals) exhibit geometric collapse, highlighting model failures to distinguish semantic from structural components. At the character level, different writing systems show distinct geometric signatures. At the word level, content words form clustering-branching patterns across 20 semantic domains in English, Chinese, and German. Arabic numbers organize through spiral trajectories rather than clustering, violating standard distributional semantics assumptions. These findings establish PHATE manifold learning as an essential analytic tool not only for studying geometric structure of meaning in embedding space, but also for validating the effectiveness of embedding models in capturing semantic relationships.
zh

[NLP-81] Clinical Document Metadata Extraction: A Scoping Review

【速读】: 该论文旨在解决临床文档元数据(clinical document metadata)在不同医疗实践和时间跨度中存在异质性和漂移问题,从而阻碍了元数据的标准化与整合。其核心挑战在于如何从分散、非结构化的临床文档中自动提取并统一元数据信息,以支持后续的准确信息解读和应用。解决方案的关键在于采用自动化提取方法,特别是从早期基于规则和传统机器学习的特征工程密集型方法,逐步演进为使用Transformer架构甚至大语言模型(large language models, LLMs)的端到端学习范式,显著减少了人工特征设计需求,并提升了跨任务与跨数据集的泛化能力,为构建更先进的临床文本处理系统提供了技术基础。

链接: https://arxiv.org/abs/2601.09730
作者: Kurt Miller(1 and 2),Qiuhao Lu(3),William Hersh(4),Kirk Roberts(3),Steven Bedrick(4),Andrew Wen(3),Hongfang Liu(3) ((1) Mayo Clinic, (2) University of Minnesota, (3) University of Texas Health Science Center at Houston, (4) Oregon Health amp; Science University)
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Clinical document metadata, such as document type, structure, author role, medical specialty, and encounter setting, is essential for accurate interpretation of information captured in clinical documents. However, vast documentation heterogeneity and drift over time challenge harmonization of document metadata. Automated extraction methods have emerged to coalesce metadata from disparate practices into target schema. This scoping review aims to catalog research on clinical document metadata extraction, identify methodological trends and applications, and highlight gaps. We followed the PRISMA-ScR (Preferred Reporting Items for Systematic Reviews and Meta-Analyses Extension for Scoping Reviews) guidelines to identify articles that perform clinical document metadata extraction. We initially found and screened 266 articles published between January 2011 and August 2025, then comprehensively reviewed 67 we deemed relevant to our study. Among the articles included, 45 were methodological, 17 used document metadata as features in a downstream application, and 5 analyzed document metadata composition. We observe myriad purposes for methodological study and application types. Available labelled public data remains sparse except for structural section datasets. Methods for extracting document metadata have progressed from largely rule-based and traditional machine learning with ample feature engineering to transformer-based architectures with minimal feature engineering. The emergence of large language models has enabled broader exploration of generalizability across tasks and datasets, allowing the possibility of advanced clinical text processing systems. We anticipate that research will continue to expand into richer document metadata representations and integrate further into clinical applications and workflows.
zh

[NLP-82] Enhancing Business Analytics through Hybrid Summarization of Financial Reports

【速读】: 该论文旨在解决金融文本(特别是 earnings conference calls,即业绩电话会议)中信息冗长、结构复杂导致的手动分析效率低下及主观偏差问题。其核心挑战在于如何在计算资源受限条件下,生成既简洁又事实准确的摘要,以支持商业决策。解决方案的关键在于提出一个两阶段混合摘要框架:第一阶段使用 LexRank 算法提取关键句,第二阶段利用微调后的 BART 和 PEGASUS 模型进行抽象概括;同时并行训练 Longformer Encoder-Decoder (LED) 模型以直接建模长程上下文依赖关系。实验表明,尽管长上下文模型整体性能最优,但该混合方法在资源受限场景下仍能实现竞争性结果,并显著提升事实一致性。

链接: https://arxiv.org/abs/2601.09729
作者: Tohida Rehman
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 12 pages, 2 figures, 2 tables

点击查看摘要

Abstract:Financial reports and earnings communications contain large volumes of structured and semi structured information, making detailed manual analysis inefficient. Earnings conference calls provide valuable evidence about a firm’s performance, outlook, and strategic priorities. The manual analysis of lengthy call transcripts requires substantial effort and is susceptible to interpretive bias and unintentional error. In this work, we present a hybrid summarization framework that combines extractive and abstractive techniques to produce concise and factually reliable Reuters-style summaries from the ECTSum dataset. The proposed two stage pipeline first applies the LexRank algorithm to identify salient sentences, which are subsequently summarized using fine-tuned variants of BART and PEGASUS designed for resource constrained settings. In parallel, we fine-tune a Longformer Encoder-Decoder (LED) model to directly capture long-range contextual dependencies in financial documents. Model performance is evaluated using standard automatic metrics, including ROUGE, METEOR, MoverScore, and BERTScore, along with domain-specific variants such as SciBERTScore and FinBERTScore. To assess factual accuracy, we further employ entity-level measures based on source-precision and F1-target. The results highlight complementary trade offs between approaches, long context models yield the strongest overall performance, while the hybrid framework achieves competitive results with improved factual consistency under computational constraints. These findings support the development of practical summarization systems for efficiently distilling lengthy financial texts into usable business insights. Comments: 12 pages, 2 figures, 2 tables Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2601.09729 [cs.CL] (or arXiv:2601.09729v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2601.09729 Focus to learn more arXiv-issued DOI via DataCite
zh

[NLP-83] Eliminating Agent ic Workflow for Introduction Generation with Parametric Stage Tokens

【速读】: 该论文旨在解决现有基于预定义代理工作流(agentic workflows)的大型语言模型(LLM)在生成研究引言时存在的问题,包括推理链过长、错误累积以及文本连贯性下降等缺陷。其解决方案的关键在于摒弃外部代理工作流,转而将原有工作流的多阶段逻辑结构显式地参数化到模型中,通过引入“引言生成阶段标记”(Stage Token for Introduction Generation, STIG),使模型能够在单次推理中完成多阶段文本生成。STIG将原始工作流的多个阶段转化为明确的阶段信号,引导模型在生成过程中扮演不同的逻辑角色并执行相应功能,同时通过指令微调学习阶段标记与文本功能之间的映射关系及阶段间的逻辑顺序与过渡模式,从而将知识编码至模型参数中,显著提升生成引言的语义一致性和句级结构合理性。

链接: https://arxiv.org/abs/2601.09728
作者: Meicong Zhang,Tiancheng su,Guoxiu He
机构: East China Normal University (华东师范大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In recent years, using predefined agentic workflows to guide large language models (LLMs) for literature classification and review has become a research focus. However, writing research introductions is more challenging. It requires rigorous logic, coherent structure, and abstract summarization. Existing workflows often suffer from long reasoning chains, error accumulation, and reduced textual coherence. To address these limitations, we propose eliminating external agentic workflows. Instead, we directly parameterize their logical structure into the LLM. This allows the generation of a complete introduction in a single inference. To this end, we introduce the Stage Token for Introduction Generation (STIG). STIG converts the multiple stages of the original workflow into explicit stage signals. These signals guide the model to follow different logical roles and functions during generation. Through instruction tuning, the model learns the mapping between stage tokens and text functions. It also learns the logical order and transition patterns between stages, encoding this knowledge into the model parameters. Experimental results show that STIG can generate multi-stage text in a single inference. It does not require explicit workflow calls. STIG outperforms traditional agentic workflows and other baselines on metrics of semantic similarity and sentence-level structural rationality. The code is provided in the Supplementary Materials.
zh

[NLP-84] SciNets: Graph-Constrained Multi-Hop Reasoning for Scientific Literature Synthesis

【速读】: 该论文旨在解决跨域科学合成(cross-domain scientific synthesis)中连接分散文献中的机制性解释这一难题,现有基于检索的系统和无约束语言模型在推理深度和结构 grounding 方面控制能力有限。其解决方案的关键在于将机制合成建模为基于文献构建的概念图(concept graph)上的多跳推理问题,通过显式图约束来引导推理路径,从而实现可控的多跳推理;具体而言,SciNets 构建有向概念图,并识别稀疏共现于单篇论文中的概念间多跳路径,以生成机制性解释,同时引入行为框架评估符号推理深度、机制多样性与接地稳定性,揭示了深层多样推理与接地稳定性之间的权衡关系。

链接: https://arxiv.org/abs/2601.09727
作者: Sauhard Dubey
机构: Jaypee Institute of Information Technology, Noida (Jaypee 信息科技学院, 诺伊达)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: 19 pages, 2 figures

点击查看摘要

Abstract:Cross-domain scientific synthesis requires connecting mechanistic explanations across fragmented literature, a capability that remains challenging for both retrieval-based systems and unconstrained language models. While recent work has applied large language models to scientific summarization and question answering, these approaches provide limited control over reasoning depth and structural grounding. We frame mechanistic synthesis as a graph-constrained multi-hop reasoning problem over literature-derived concept graphs. Given a scientific query and a compact, query-local corpus, SciNets constructs a directed concept graph and synthesizes mechanistic explanations by identifying multi-hop reasoning paths that connect concepts that rarely co-occur within individual papers. We systematically compare shortest-path reasoning, k-shortest paths with diversity constraints, stochastic random walks, and a retrieval-augmented language model baseline. Rather than evaluating correctness, which is often indeterminate when synthesizing connections across distributed sources, we introduce a behavioral framework that measures symbolic reasoning depth, mechanistic diversity, and grounding stability. Across machine learning, biology, and climate science tasks, explicit graph constraints enable controllable multi-hop reasoning while revealing a consistent trade-off: deeper and more diverse symbolic reasoning increases grounding instability, whereas shortest-path reasoning remains highly stable but structurally conservative. These findings provide a systematic behavioral characterization of the limits and capabilities of current graph-LLM integration for scientific synthesis.
zh

[NLP-85] Forgetting as a Feature: Cognitive Alignment of Large Language Models

【速读】: 该论文试图解决大语言模型(Large Language Models, LLMs)在上下文推理中存在系统性遗忘问题,即模型对先前信息的保留能力不足,传统观点将其视为性能缺陷,而本文提出应将其重新理解为一种功能性的认知机制。解决方案的关键在于借鉴人类记忆动态,将LLM推理建模为受指数衰减规律控制的概率记忆过程,并设计了一套评估时序推理、概念漂移适应和联想回忆的基准测试工具;在此基础上,进一步提出轻量级的“概率记忆提示”(probabilistic memory prompting)策略,通过模拟人类记忆衰减模式来优化证据整合方式,从而显著提升长程推理表现。

链接: https://arxiv.org/abs/2601.09726
作者: Hien Tran,Quinten Steenhuis,Alexandros Christoforos,Chadbourne Davis
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Under submission

点击查看摘要

Abstract:Large Language Models (LLMs) are often evaluated against ideals of perfect Bayesian inference, yet growing evidence suggests that their in-context reasoning exhibits systematic forgetting of past information. Rather than viewing this behavior as a limitation, we reinterpret forgetting as a functional cognitive mechanism. Drawing inspiration from human memory dynamics, we model LLM inference as a probabilistic memory process governed by exponential decay. We introduce a benchmark suite that evaluates temporal reasoning, concept drift adaptation, and associative recall, enabling direct comparison between model behavior and human cognitive patterns. Our empirical results reveal that LLMs demonstrate forgetting rates analogous to human memory efficiency trade-offs between stability and adaptability. Building on these observations, we propose probabilistic memory prompting, a lightweight strategy that shapes evidence integration to mimic human-like memory decay, leading to improved long-horizon reasoning performance. Our findings position forgetting not as a failure mode, but as a principled mechanism for adaptive intelligence.
zh

[NLP-86] Assessing and Improving Punctuation Robustness in English-Marathi Machine Translation

【速读】: 该论文旨在解决低资源语言(以马拉地语为例)在英译马翻译中因标点符号缺失或模糊导致的语义和结构歧义问题,从而提升机器翻译(MT)系统的可靠性。其解决方案的关键在于构建首个针对英文到马拉地语翻译的标点鲁棒性诊断基准Virām,并比较两种策略:一是先恢复标点再翻译的流水线方法,二是直接在含标点变异的数据上微调模型。实验表明,这两种方法均显著优于标准基线,尤其细调模型在保持语义完整性方面表现突出,而当前大型语言模型(LLMs)在处理标点模糊文本时仍落后于专用任务模型,凸显了针对性优化的必要性。

链接: https://arxiv.org/abs/2601.09725
作者: Kaustubh Shivshankar Shejole,Sourabh Deoghare,Pushpak Bhattacharyya
机构: Computation for Indian Language Technology (CFILT), Indian Institute of Technology Bombay (印度理工学院孟买分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Punctuation plays a critical role in resolving semantic and structural ambiguity in written language. Machine Translation (MT) systems are now widely applied across diverse domains and languages, including many low-resource settings. In this work, we focus on Marathi, a low- to middle-resource language. We introduce Virām, the first diagnostic benchmark for assessing punctuation robustness in English-to-Marathi machine translation, consisting of 54 manually curated, punctuation-ambiguous instances. We evaluate two primary strategies for enhancing reliability: a pipeline-based restore-then-translate approach and direct fine-tuned on punctuation-varied data. Our results demonstrate that specialized fine-tuned models and pipeline systems significantly improve translation quality over standard baselines on the Virām benchmark. Qualitative analysis reveals that the original model may result in wrong translations leading to wrong interpretations, while fine-tuned models significantly improve overall reliability. Furthermore, we find that current Large Language Models (LLMs) lag behind these task-specific approaches in preserving meaning for punctuation-ambiguous text, thus necessitating further research in this area.
zh

[NLP-87] Syntactic Framing Frag ility: An Audit of Robustness in LLM Ethical Decisions

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在伦理决策中对语法结构变化的鲁棒性不足问题,特别是当逻辑等价但句法不同的提示(如否定和条件结构变化)导致模型伦理判断不一致时。其核心挑战在于区分语义漂移与纯句法效应,从而准确评估模型是否因提示形式差异而产生错误的伦理判断。解决方案的关键是提出句法框架脆弱性(Syntactic Framing Fragility, SFF) 评估框架,通过逻辑极性归一化(Logical Polarity Normalization, LPN) 技术隔离纯句法影响,实现对正向与负向表述下决策的一致性比较,从而系统识别并量化模型在不同句法框架下的伦理判断波动。

链接: https://arxiv.org/abs/2601.09724
作者: Katherine Elkins,Jon Chun
机构: Kenyon College (肯尼恩学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly deployed in consequential decision-making settings, yet their robustness to benign prompt variation remains underexplored. In this work, we study whether LLMs maintain consistent ethical judgments across logically equivalent but syntactically different prompts, focusing on variations involving negation and conditional structure. We introduce Syntactic Framing Fragility (SFF), a robustness evaluation framework that isolates purely syntactic effects via Logical Polarity Normalization (LPN), enabling direct comparison of decisions across positive and negative framings without semantic drift. Auditing 23 state-of-the-art models spanning the U.S. and China as well as small U.S. open-source software models over 14 ethical scenarios and four controlled framings (39,975 decisions), we find widespread and statistically significant inconsistency: many models reverse ethical endorsements solely due to syntactic polarity, with open-source models exhibiting over twice the fragility of commercial counterparts. We further uncover extreme negation sensitivity, where some models endorse actions in 80-97% of cases when explicitly prompted with “should not.” We show that eliciting chain-of-thought reasoning substantially reduces fragility, identifying a practical mitigation lever, and we map fragility across scenarios, finding higher risk in financial and business contexts than in medical scenarios. Our results demonstrate that syntactic consistency constitutes a distinct and critical dimension of ethical robustness, and we argue that SFF-style audits should be a standard component of safety evaluation for deployed LLMs. Code and results will be available on this http URL.
zh

[NLP-88] SagaScale: A Realistic Scalable and High-Quality Long-Context Benchmark Built from Full-Length Novels

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在理解和处理长篇复杂文档时面临的挑战,尤其是现有长上下文基准测试在任务真实性、数据可扩展性和数据质量方面的局限性。其解决方案的关键在于构建了一个名为SagaScale的新型长上下文基准,该基准基于完整长度的小说文本,通过自动化数据收集管道利用外部资源(如维基百科页面)生成高质量的问题-答案对,且这些外部资源仅用于构建阶段而不参与模型评估,从而促使模型生成超越其自身能力范围的复杂问题。该基准具有双语特性,并提供迄今为止最大的上下文长度(英文平均超25万token,中文超32万token),为评估LLMs的长文本理解能力提供了更真实、可扩展且高质量的测试环境。

链接: https://arxiv.org/abs/2601.09723
作者: Guancheng Du,Yong Hu,Wenqing Wang,Yaming Yang,Jiaheng Gao
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have shown significant progress, but understanding long and complex documents remains challenging. Many long-context benchmarks have been proposed, but they face several limitations, including task realism, data scalability, and data quality. To this end, we introduce SagaScale, a realistic, scalable, and high-quality long-context benchmark built from full-length novels. The entire benchmark is constructed using an automated data collection pipeline that utilizes external resources (e.g., Wikipedia pages) to curate question-answer pairs. Critically, these external resources are provided only for benchmark construction and not during evaluation, which allows LLMs to curate complex questions that go beyond what they can answer during evaluation. SagaScale is also bilingual and offers the largest context length to date, with average token counts exceeding 250K for English novels and 320K for Chinese novels. Our evaluation across 12 frontier LLMs and three long-context methods – Naïve RAG, Agentic RAG, and Long Context – yields key insights, including: (1) Directly supplying the full context to the LLM can outperform other methods by a large margin; (2) Most LLMs still struggle with lengthy contexts, but Gemini-2.5-Pro stands out as an exception; and (3) Agentic RAG effectively addresses the retrieval bottleneck in Naïve RAG. Finally, we publicly release the SagaScale benchmark and our data collection codebase to facilitate future research.
zh

[NLP-89] ADMEDTAGGER: an annotation framework for distillation of expert knowledge for the Polish medical language

【速读】: 该论文旨在解决医学文本标注资源匮乏的问题,尤其是在波兰语环境下缺乏足够标注数据以训练高性能多分类器的挑战。其解决方案的关键在于利用预训练的多语言大语言模型(LLM)——Llama3.1作为教师模型,对大规模未标注波兰语医学文本进行自动标注,从而生成可用于训练的标签数据;随后基于有限的人工验证样本构建测试集,并在此基础上训练三种基于BERT架构的轻量级分类器(DistilBERT、BioBERT 和 HerBERT),最终发现 DistilBERT 在各项临床类别上均达到 F1 分数 0.80,其中三个类别甚至达到 0.93,显著优于大型语言模型,在模型尺寸、GPU显存消耗和推理速度方面分别减少约 500 倍、300 倍和数百倍,实现了高效且实用的医疗文本分类方案。

链接: https://arxiv.org/abs/2601.09722
作者: Franciszek Górski,Andrzej Czyżewski
机构: Gdansk University of Technology (格但斯克工业大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this work, we present an annotation framework that demonstrates how a multilingual LLM pretrained on a large corpus can be used as a teacher model to distill the expert knowledge needed for tagging medical texts in Polish. This work is part of a larger project called ADMEDVOICE, within which we collected an extensive corpus of medical texts representing five clinical categories - Radiology, Oncology, Cardiology, Hypertension, and Pathology. Using this data, we had to develop a multi-class classifier, but the fundamental problem turned out to be the lack of resources for annotating an adequate number of texts. Therefore, in our solution, we used the multilingual Llama3.1 model to annotate an extensive corpus of medical texts in Polish. Using our limited annotation resources, we verified only a portion of these labels, creating a test set from them. The data annotated in this way were then used for training and validation of 3 different types of classifiers based on the BERT architecture - the distilled DistilBERT model, BioBERT fine-tuned on medical data, and HerBERT fine-tuned on the Polish language corpus. Among the models we trained, the DistilBERT model achieved the best results, reaching an F1 score 0.80 for each clinical category and an F1 score 0.93 for 3 of them. In this way, we obtained a series of highly effective classifiers that represent an alternative to large language models, due to their nearly 500 times smaller size, 300 times lower GPU VRAM consumption, and several hundred times faster inference.
zh

[NLP-90] Cross-Platform Evaluation of Large Language Model Safety in Pediatric Consultations: Evolution of Adversarial Robustness and the Scale Paradox

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在真实医疗场景下,尤其是在家长因焦虑而施加对抗性压力时的安全部署问题。现有评估多基于中性条件,忽视了用户情绪对模型安全行为的影响,尤其在儿科咨询中可能引发误诊或漏诊风险。解决方案的关键在于构建并应用PediatricAnxietyBench这一对抗性测试基准,通过量化模型在150个真实与150个对抗性查询中的安全性表现(包括克制、转介、模糊回应、紧急识别及非处方行为等维度),发现模型架构与对齐程度比参数规模更关键——较小模型(如Mistral-7B)反而优于更大模型(如Llama-3.3-70B),且对抗性压力显著提升其安全响应能力;同时揭示了当前模型缺乏应急识别能力,不适合用于分诊场景,为医疗生成式AI的安全优化提供了实证依据和开放基准。

链接: https://arxiv.org/abs/2601.09721
作者: Vahideh Zolfaghari
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Background Large language models (LLMs) are increasingly deployed in medical consultations, yet their safety under realistic user pressures remains understudied. Prior assessments focused on neutral conditions, overlooking vulnerabilities from anxious users challenging safeguards. This study evaluated LLM safety under parental anxiety-driven adversarial pressures in pediatric consultations across models and platforms. Methods PediatricAnxietyBench, from a prior evaluation, includes 300 queries (150 authentic, 150 adversarial) spanning 10 topics. Three models were assessed via APIs: Llama-3.3-70B and Llama-3.1-8B (Groq), Mistral-7B (HuggingFace), yielding 900 responses. Safety used a 0-15 scale for restraint, referral, hedging, emergency recognition, and non-prescriptive behavior. Analyses employed paired t-tests with bootstrapped CIs. Results Mean scores: 9.70 (Llama-3.3-70B) to 10.39 (Mistral-7B). Llama-3.1-8B outperformed Llama-3.3-70B by +0.66 (p=0.0001, d=0.225). Models showed positive adversarial effects, Mistral-7B strongest (+1.09, p=0.0002). Safety generalized across platforms; Llama-3.3-70B had 8% failures. Seizures vulnerable (33% inappropriate diagnoses). Hedging predicted safety (r=0.68, p0.001). Conclusions Evaluation shows safety depends on alignment and architecture over scale, with smaller models outperforming larger. Evolution to robustness across releases suggests targeted training progress. Vulnerabilities and no emergency recognition indicate unsuitability for triage. Findings guide selection, stress adversarial testing, and provide open benchmark for medical AI safety.
zh

[NLP-91] Uncertainty-Aware Dynamic Knowledge Graphs for Reliable Question Answering ICDM2025

【速读】: 该论文旨在解决当前基于知识图谱(Knowledge Graph, KG)的问答(Question Answering, QA)系统在面对不完整、噪声或不确定证据时可靠性不足的问题。现有KG框架通常将事实表示为静态且确定性的结构,无法捕捉信息的动态演化特性及推理过程中的不确定性。其解决方案的关键在于提出一种不确定性感知的动态知识图谱(Uncertainty-Aware Dynamic Knowledge Graphs)框架,核心包括:(i) 动态构建随时间演化的知识图谱;(ii) 引入置信度评分与不确定性感知的检索机制;(iii) 提供交互式界面以实现可解释和可靠的问答。该框架通过可视化不确定性、标注置信度三元组以及对比基线与置信度感知的答案,显著提升了QA系统的鲁棒性和透明度,尤其适用于医疗等高风险场景下的个性化决策支持。

链接: https://arxiv.org/abs/2601.09720
作者: Yu Takahashi,Shun Takeuchi,Kexuan Xin,Guillaume Pelat,Yoshiaki Ikai,Junya Saito,Jonathan Vitale,Shlomo Berkovsky,Amin Beheshti
机构: Fujitsu Research (富士通研究所); Macquarie University (麦考瑞大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 4 pages, 4 figures. Accepted at IEEE ICDM 2025 Demo Track

点击查看摘要

Abstract:Question answering (QA) systems are increasingly deployed across domains. However, their reliability is undermined when retrieved evidence is incomplete, noisy, or uncertain. Existing knowledge graph (KG) based QA frameworks typically represent facts as static and deterministic, failing to capture the evolving nature of information and the uncertainty inherent in reasoning. We present a demonstration of uncertainty-aware dynamic KGs, a framework that combines (i) dynamic construction of evolving KGs, (ii) confidence scoring and uncertainty-aware retrieval, and (iii) an interactive interface for reliable and interpretable QA. Our system highlights how uncertainty modeling can make QA more robust and transparent by enabling users to explore dynamic graphs, inspect confidence-annotated triples, and compare baseline versus confidence-aware answers. The target users of this demo are clinical data scientists and clinicians, and we instantiate the framework in healthcare: constructing personalized KGs from electronic health records, visualizing uncertainty across patient visits, and evaluating its impact on a mortality prediction task. This use case demonstrates the broader promise of uncertainty-aware dynamic KGs for enhancing QA reliability in high-stakes applications.
zh

[NLP-92] Bounded Hyperbolic Tangent: A Stable and Efficient Alternative to Pre-Layer Normalization in Large Language Models

【速读】: 该论文旨在解决预层归一化(Pre-Layer Normalization, Pre-LN)在大规模语言模型(Large Language Models, LLMs)训练中面临的两个核心问题:一是计算效率低下,由于重复的统计量计算导致性能瓶颈;二是深度增加时激活值的幅值和方差显著增长,引发训练不稳定,即“深度诅咒”(curse of depth)。为此,作者提出了一种可直接替换Pre-LN的新型方法——有界双曲正切(Bounded Hyperbolic Tanh, BHyT),其关键在于将tanh非线性与显式、数据驱动的输入边界约束相结合,从而将激活值限制在非饱和区间内,有效抑制深度方向上的激活增长并提供理论稳定性保障。同时,BHyT通过每块仅计算一次精确统计量,并用轻量级方差近似替代二次归一化操作,在保持稳定性的前提下大幅提升训练效率。实证表明,BHyT相比RMSNorm平均提速15.8%,token生成吞吐量提升4.2%,且在推理性能和鲁棒性上达到或超越基线。

链接: https://arxiv.org/abs/2601.09719
作者: Hoyoon Byun,Youngjun Choi,Taero Kim,Sungrae Park,Kyungwoo Song
机构: Yonsei University (延世大学); Upstage AI
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Pre-Layer Normalization (Pre-LN) is the de facto choice for large language models (LLMs) and is crucial for stable pretraining and effective transfer learning. However, Pre-LN is inefficient due to repeated statistical calculations and suffers from the curse of depth. As layers grow, the magnitude and variance of the hidden state escalate, destabilizing training. Efficiency-oriented normalization-free methods such as Dynamic Tanh (DyT) improve speed but remain fragile at depth. To jointly address stability and efficiency, we propose Bounded Hyperbolic Tanh (BHyT), a drop-in replacement for Pre-LN. BHyT couples a tanh nonlinearity with explicit, data-driven input bounding to keep activations within a non-saturating range. It prevents depth-wise growth in activation magnitude and variance and comes with a theoretical stability guarantee. For efficiency, BHyT computes exact statistics once per block and replaces a second normalization with a lightweight variance approximation, enhancing efficiency. Empirically, BHyT demonstrates improved stability and efficiency during pretraining, achieving an average of 15.8% faster training and an average of 4.2% higher token generation throughput compared to RMSNorm., while matching or surpassing its inference performance and robustness across language understanding and reasoning benchmarks. Our code is available at: this https URL
zh

[NLP-93] StatLLaMA: A multi-stage training framework for building a domain-optimized statistical language model

【速读】: 该论文旨在解决如何高效构建面向统计领域的专用大语言模型(Large Language Model, LLM)的问题,尤其是在资源受限条件下实现领域专业知识与通用推理能力的平衡。其核心挑战在于:从基础模型出发,通过多阶段训练难以有效习得统计推理能力,而直接在具备强通用推理能力的指令微调模型(LLaMA-3.2-3B-Instruct)基础上进行领域专业化,可显著提升效果。解决方案的关键在于:首先以具备强通用推理能力的指令微调模型为起点,其次采用监督微调(SFT)与直接偏好优化(DPO)相结合的方式实现稳定且高效的偏好对齐,最后通过极低强度的下游微调避免灾难性遗忘,从而在数学推理、常识推理和统计专长等多个基准测试中实现性能均衡。这一方法形成了一个资源高效的统计领域LLM开发范式,最终成果为StatLLaMA模型。

链接: https://arxiv.org/abs/2601.09718
作者: Jing-Yi Zeng,Guan-Hua Huang
机构: National Yang Ming Chiao Tung University (国立阳明交通大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 31 pages, 3 figures

点击查看摘要

Abstract:This study investigates how to efficiently build a domain-specialized large language model (LLM) for statistics using the lightweight LLaMA-3.2-3B family as the foundation model (FM). We systematically compare three multi-stage training pipelines, starting from a base FM with no instruction-following capability, a base FM augmented with post-hoc instruction tuning, and an instruction-tuned FM with strong general reasoning abilities across continual pretraining, supervised fine-tuning (SFT), reinforcement learning from human feedback (RLHF) preference alignment, and downstream task adaptation. Results show that pipelines beginning with a base FM fail to develop meaningful statistical reasoning, even after extensive instruction tuning, SFT, or RLHF alignment. In contrast, starting from LLaMA-3.2-3B-Instruct enables effective domain specialization. A comprehensive evaluation of SFT variants reveals clear trade-offs between domain expertise and general reasoning ability. We further demonstrate that direct preference optimization provides stable and effective RLHF preference alignment. Finally, we show that downstream fine-tuning must be performed with extremely low intensity to avoid catastrophic forgetting in highly optimized models. The final model, StatLLaMA, achieves strong and balanced performance on benchmarks of mathematical reasoning, common-sense reasoning, and statistical expertise, offering a practical blueprint for developing resource-efficient statistical LLMs. The code is available at this https URL.
zh

[NLP-94] SALP-CG: Standard-Aligned LLM Pipeline for Classifying and Grading Large Volumes of Online Conversational Health Data

【速读】: 该论文旨在解决在线医疗咨询中生成的对话健康数据因包含受保护健康信息(Protected Health Information, PHI)而亟需进行敏感性分类与风险分级的问题。现有方法缺乏统一标准和可靠的自动化手段,难以满足政策合规要求。解决方案的关键在于提出一个基于大语言模型(Large Language Model, LLM)的提取管道SALP-CG,其核心创新包括:结合少量样本引导(few-shot guidance)、JSON Schema约束解码(JSON Schema constrained decoding)以及确定性的高风险规则,从而在不依赖特定后端模型的前提下实现类别合规性和敏感性分级的高可靠性。实验表明,该方法在MedDialog-CN基准上实现了微平均F1值0.900的最高级别预测性能,并揭示了敏感等级2–3类占主导地位、组合后易导致再识别,而等级4–5类虽较少但危害显著的特征,为健康数据治理提供了可落地的自动化工具。

链接: https://arxiv.org/abs/2601.09717
作者: Yiwei Yan,Hao Li,Hua He,Gong Kai,Zhengyi Yang,Guanfeng Liu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Online medical consultations generate large volumes of conversational health data that often embed protected health information, requiring robust methods to classify data categories and assign risk levels in line with policies and practice. However, existing approaches lack unified standards and reliable automated methods to fulfill sensitivity classification for such conversational health data. This study presents a large language model-based extraction pipeline, SALP-CG, for classifying and grading privacy risks in online conversational health data. We concluded health-data classification and grading rules in accordance with GB/T 39725-2020. Combining few-shot guidance, JSON Schema constrained decoding, and deterministic high-risk rules, the backend-agnostic extraction pipeline achieves strong category compliance and reliable sensitivity across diverse LLMs. On the MedDialog-CN benchmark, models yields robust entity counts, high schema compliance, and accurate sensitivity grading, while the strongest model attains micro-F1=0.900 for maximum-level prediction. The category landscape stratified by sensitivity shows that Level 2-3 items dominate, enabling re-identification when combined; Level 4-5 items are less frequent but carry outsize harm. SALP-CG reliably helps classify categories and grading sensitivity in online conversational health data across LLMs, offering a practical method for health data governance. Code is available at this https URL.
zh

[NLP-95] Opportunities and Challenges of Natural Language Processing for Low-Resource Senegalese Languages in Social Science Research

【速读】: 该论文旨在解决非洲语言在自然语言处理(Natural Language Processing, NLP)领域长期被忽视的问题,特别是塞内加尔宪法承认的六种官方语言(Wolof、Pulaar、Sereer、Joola、Mandingue 和 Soninke)在数字资源、工具和基准测试方面的严重缺失。其核心解决方案在于系统性地梳理这些语言的语料库、技术进展与社会基础设施现状,并通过构建一个集中化的 GitHub 资源库,整合公开可用的 NLP 工具与数据集,以促进跨机构协作与研究可复现性;同时强调将 NLP 技术应用于社会科学领域,如多语言转录、翻译与检索流水线,从而提升田野研究的效率与包容性,最终推动建立以社区为中心、伦理数据治理和开放资源共享为基础的可持续 NLP 生态系统。

链接: https://arxiv.org/abs/2601.09716
作者: Derguene Mbaye,Tatiana D. P. Mbengue,Madoune R. Seye,Moussa Diallo,Mamadou L. Ndiaye,Dimitri S. Adjanohoun,Cheikh S. Wade,Djiby Sow,Jean-Claude B. Munyaka,Jerome Chenal
机构: Polytechnic School (ESP), Dakar, Senegal; Gaston Berger University (UGB), Saint Louis, Senegal; Federal Institute of Technology Lausanne (EPFL), Lausanne, Switzerland
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Natural Language Processing (NLP) is rapidly transforming research methodologies across disciplines, yet African languages remain largely underrepresented in this technological shift. This paper provides the first comprehensive overview of NLP progress and challenges for the six national languages officially recognized by the Senegalese Constitution: Wolof, Pulaar, Sereer, Joola, Mandingue, and Soninke. We synthesize linguistic, sociotechnical, and infrastructural factors that shape their digital readiness and identify gaps in data, tools, and benchmarks. Building on existing initiatives and research works, we analyze ongoing efforts in text normalization, machine translation, and speech processing. We also provide a centralized GitHub repository that compiles publicly accessible resources for a range of NLP tasks across these languages, designed to facilitate collaboration and reproducibility. A special focus is devoted to the application of NLP to the social sciences, where multilingual transcription, translation, and retrieval pipelines can significantly enhance the efficiency and inclusiveness of field research. The paper concludes by outlining a roadmap toward sustainable, community-centered NLP ecosystems for Senegalese languages, emphasizing ethical data governance, open resources, and interdisciplinary collaboration.
zh

[NLP-96] Introducing Axlerod: An LLM -based Chatbot for Assisting Independent Insurance Agents

【速读】: 该论文旨在解决独立保险代理人(independent insurance agents)在日常工作中面临的效率瓶颈问题,尤其是政策推荐与理赔分类等复杂任务的自动化不足。其解决方案的关键在于设计并实现了一个名为Axlerod的生成式AI驱动的对话接口,该系统融合了自然语言处理(Natural Language Processing, NLP)、检索增强生成(Retrieval-Augmented Generation, RAG)以及领域知识库集成技术,从而实现了对用户意图的精准解析、结构化保单数据库的高效访问,并提供实时、上下文相关的响应。实验表明,该方案在保单检索任务中达到93.18%的整体准确率,同时将平均搜索时间缩短2.42秒,显著提升了代理人的运营效率。

链接: https://arxiv.org/abs/2601.09715
作者: Adam Bradley,John Hastings,Khandaker Mamun Ahmed
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
备注: 6 pages, 2 figures, 1 table

点击查看摘要

Abstract:The insurance industry is undergoing a paradigm shift through the adoption of artificial intelligence (AI) technologies, particularly in the realm of intelligent conversational agents. Chatbots have evolved into sophisticated AI-driven systems capable of automating complex workflows, including policy recommendation and claims triage, while simultaneously enabling dynamic, context-aware user engagement. This paper presents the design, implementation, and empirical evaluation of Axlerod, an AI-powered conversational interface designed to improve the operational efficiency of independent insurance agents. Leveraging natural language processing (NLP), retrieval-augmented generation (RAG), and domain-specific knowledge integration, Axlerod demonstrates robust capabilities in parsing user intent, accessing structured policy databases, and delivering real-time, contextually relevant responses. Experimental results underscore Axlerod’s effectiveness, achieving an overall accuracy of 93.18% in policy retrieval tasks while reducing the average search time by 2.42 seconds. This work contributes to the growing body of research on enterprise-grade AI applications in insurtech, with a particular focus on agent-assistive rather than consumer-facing architectures.
zh

[NLP-97] Evaluating Novelty in AI-Generated Research Plans Using Multi-Workflow LLM Pipelines

【速读】: 该论文旨在解决生成式 AI 在科研场景中因“智能剽窃”(smart plagiarism)问题导致的创意不足与原创性缺失,即模型在单步提示(single-step prompting)下容易通过术语替换复现已有研究思路,难以产生真正新颖的研究计划。其解决方案的关键在于采用多阶段代理工作流(agentic workflows),通过迭代推理、进化搜索和递归分解等机制增强模型的创造性输出能力;实验表明,基于分解(decomposition-based)和长上下文处理(long-context)的架构能显著提升研究方案的新颖性(平均得分4.17/5),而仅依赖反思(reflection-based)的单一迭代方法则表现较差(2.33/5),验证了精心设计的多阶段系统可有效推动AI辅助科研创意生成。

链接: https://arxiv.org/abs/2601.09714
作者: Devesh Saraogi,Rohit Singhee,Dhruv Kumar
机构: Birla Institute of Technology and Science, Pilani, India (比尔拉理工学院与科学学院,皮拉尼,印度)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Under Review

点击查看摘要

Abstract:The integration of Large Language Models (LLMs) into the scientific ecosystem raises fundamental questions about the creativity and originality of AI-generated research. Recent work has identified ``smart plagiarism’’ as a concern in single-step prompting approaches, where models reproduce existing ideas with terminological shifts. This paper investigates whether agentic workflows – multi-step systems employing iterative reasoning, evolutionary search, and recursive decomposition – can generate more novel and feasible research plans. We benchmark five reasoning architectures: Reflection-based iterative refinement, Sakana AI v2 evolutionary algorithms, Google Co-Scientist multi-agent framework, GPT Deep Research (GPT-5.1) recursive decomposition, and Gemini~3 Pro multimodal long-context pipeline. Using evaluations from thirty proposals each on novelty, feasibility, and impact, we find that decomposition-based and long-context workflows achieve mean novelty of 4.17/5, while reflection-based approaches score significantly lower (2.33/5). Results reveal varied performance across research domains, with high-performing workflows maintaining feasibility without sacrificing creativity. These findings support the view that carefully designed multi-stage agentic workflows can advance AI-assisted research ideation.
zh

[NLP-98] LLM -Driven Preference Data Synthesis for Proactive Prediction of the Next User Utterance in Human-Machine Dialogue

【速读】: 该论文旨在解决人机对话中主动预测用户下一句 utterance 的问题,以提升交互效率与用户体验。现有方案存在隐私风险(如依赖商业 API)或计算成本高(如本地部署通用大语言模型,LLM),而任务特定的小型 LLM 又受限于数据稀缺。为解决这一挑战,论文提出 ProUtt 方法,其关键在于通过构建意图树(intent tree)显式建模用户意图推理轨迹,并从利用(exploitation)与探索(exploration)两个角度预测下一合理路径;进一步通过扰动或修正意图树路径来合成偏好与非偏好推理过程,从而生成高质量的偏好数据用于训练。该方法显著优于现有数据合成策略、用户模拟器及商用 LLM API,在多个基准数据集上获得一致性能提升。

链接: https://arxiv.org/abs/2601.09713
作者: Jinqiang Wang,Huansheng Ning,Jianguo Ding,Tao Zhu,Liming Chen,Chris Nugent
机构: University of Science and Technology Beijing (北京科技大学); Blekinge institute of Technology (布莱克内斯理工大学); University of South China (南华大学); Dalian University of Technology (大连理工大学); Ulster University (阿尔斯特大学)
类目: Computation and Language (cs.CL)
备注: 19 pages

点击查看摘要

Abstract:Proactively predicting a users next utterance in human-machine dialogue can streamline interaction and improve user experience. Existing commercial API-based solutions are subject to privacy concerns while deploying general-purpose LLMs locally remains computationally expensive. As such, training a compact, task-specific LLM provides a practical alternative. Although user simulator methods can predict a user’s next utterance, they mainly imitate their speaking style rather than advancing the dialogue. Preference data synthesis has been investigated to generate data for proactive next utterance prediction and help align LLMs with user preferences. Yet existing methods lack the ability to explicitly model the intent reasoning that leads to the user’s next utterance and to define and synthesize preference and non-preference reasoning processes for predicting the user’s next this http URL address these challenges, we propose ProUtt, an LLM-driven preference data synthesis method for proactive next utterance prediction. ProUtt converts dialogue history into an intent tree and explicitly models intent reasoning trajectories by predicting the next plausible path from both exploitation and exploration perspectives. It then constructs preference and non-preference reasoning processes by perturbing or revising intent tree paths at different future turns. Extensive evaluations using LLM-as-a-judge and human judgments demonstrate that ProUtt consistently outperforms existing data synthesis methods, user simulators, and commercial LLM APIs across four benchmark datasets. We release both the code and the synthesized datasets to facilitate future research.
zh

[NLP-99] Social Determinants of Health Prediction for ICD-9 Code with Reasoning Models ALT ML4H

【速读】: 该论文旨在解决临床文本中社会决定因素健康(Social Determinants of Health, SDoH)信息难以结构化提取的问题,从而补充诊断系统对患者社会背景的认知。其关键解决方案是利用现有ICD-9编码作为标签,在MIMIC-III数据集上对住院记录进行多标签SDoH分类任务,通过推理模型与传统大语言模型的结合实现高精度预测,最终在住院记录中实现了89%的F1分数,同时识别出139次住院记录中缺失的SDoH编码。

链接: https://arxiv.org/abs/2601.09709
作者: Sharim Khan,Paul Landes,Adam Cross,Jimeng Sun
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); University of Illinois Chicago (伊利诺伊大学芝加哥分校)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: Published as part of Machine Learning for Health (ML4H) 2025 Findings Track

点击查看摘要

Abstract:Social Determinants of Health correlate with patient outcomes but are rarely captured in structured data. Recent attention has been given to automatically extracting these markers from clinical text to supplement diagnostic systems with knowledge of patients’ social circumstances. Large language models demonstrate strong performance in identifying Social Determinants of Health labels from sentences. However, prediction in large admissions or longitudinal notes is challenging given long distance dependencies. In this paper, we explore hospital admission multi-label Social Determinants of Health ICD-9 code classification on the MIMIC-III dataset using reasoning models and traditional large language models. We exploit existing ICD-9 codes for prediction on admissions, which achieved an 89% F1. Our contributions include our findings, missing SDoH codes in 139 admissions, and code to reproduce the results.
zh

[NLP-100] Multi-Level Embedding Conformer Framework for Bengali Automatic Speech Recognition

【速读】: 该论文旨在解决孟加拉语(Bengali)这一形态丰富且资源稀缺语言在自动语音识别(ASR)中的建模难题。其解决方案的关键在于提出一种基于Conformer-CTC架构的端到端框架,通过多粒度嵌入融合机制(multi-level embedding fusion mechanism)将音素(phoneme)、音节(syllable)和词片(wordpiece)表示融入声学特征中,从而增强模型对细粒度语音线索与高层语境模式的捕捉能力。该方法在早期和晚期Conformer阶段均引入嵌入融合,并结合静音裁剪、重采样、Log-Mel频谱图提取及SpecAugment数据增强等预处理步骤,最终在测试集上实现10.01%的词错误率(WER)和5.03%的字符错误率(CER),验证了多粒度语言信息与声学建模协同优化的有效性。

链接: https://arxiv.org/abs/2601.09710
作者: Md. Nazmus Sakib,Golam Mahmud,Md. Maruf Bangabashi,Umme Ara Mahinur Istia,Md. Jahidul Islam,Partha Sarker,Afra Yeamini Prity
机构: 未知
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Bengali, spoken by over 300 million people, is a morphologically rich and lowresource language, posing challenges for automatic speech recognition (ASR). This research presents an end-to-end framework for Bengali ASR, building on a Conformer-CTC backbone with a multi-level embedding fusion mechanism that incorporates phoneme, syllable, and wordpiece representations. By enriching acoustic features with these linguistic embeddings, the model captures fine-grained phonetic cues and higher-level contextual patterns. The architecture employs early and late Conformer stages, with preprocessing steps including silence trimming, resampling, Log-Mel spectrogram extraction, and SpecAugment augmentation. The experimental results demonstrate the strong potential of the model, achieving a word error rate (WER) of 10.01% and a character error rate (CER) of 5.03%. These results demonstrate the effectiveness of combining multi-granular linguistic information with acoustic modeling, providing a scalable approach for low-resource ASR development.
zh

计算机视觉

[CV-0] WildRayZer: Self-supervised Large View Synthesis in Dynamic Environments

【速读】:该论文旨在解决动态环境中新颖视图合成(Novel View Synthesis, NVS)的问题,即当相机与物体同时运动时,传统静态NVS模型因破坏多视角一致性而产生鬼影、几何幻觉及不稳定位姿估计。其解决方案的关键在于提出一种自监督的“分析-合成”测试框架WildRayZer:通过仅使用相机参数的静态渲染器提取刚性结构,利用残差识别瞬态区域;进而构建伪运动掩码,蒸馏出运动估计器,并用于遮蔽输入token和门控损失梯度,使监督信号聚焦于跨视角背景补全任务,从而在单次前向传播中实现高质量的动态场景NVS。

链接: https://arxiv.org/abs/2601.10716
作者: Xuweiyi Chen,Wentao Zhou,Zezhou Cheng
机构: University of Virginia (弗吉尼亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:We present WildRayZer, a self-supervised framework for novel view synthesis (NVS) in dynamic environments where both the camera and objects move. Dynamic content breaks the multi-view consistency that static NVS models rely on, leading to ghosting, hallucinated geometry, and unstable pose estimation. WildRayZer addresses this by performing an analysis-by-synthesis test: a camera-only static renderer explains rigid structure, and its residuals reveal transient regions. From these residuals, we construct pseudo motion masks, distill a motion estimator, and use it to mask input tokens and gate loss gradients so supervision focuses on cross-view background completion. To enable large-scale training and evaluation, we curate Dynamic RealEstate10K (D-RE10K), a real-world dataset of 15K casually captured dynamic sequences, and D-RE10K-iPhone, a paired transient and clean benchmark for sparse-view transient-aware NVS. Experiments show that WildRayZer consistently outperforms optimization-based and feed-forward baselines in both transient-region removal and full-frame NVS quality with a single feed-forward pass.
zh

[CV-1] Alterbute: Editing Intrinsic Attributes of Objects in Images ALT

【速读】:该论文旨在解决图像中物体固有属性(如颜色、纹理、材质甚至形状)编辑时难以保持物体身份感知一致性和场景上下文完整性的问题。现有方法要么依赖不稳定的无监督先验导致身份丢失,要么采用过于严格的监督方式限制了有意义的内在变化。解决方案的关键在于:(i) 设计一种宽松的训练目标,使模型在给定身份参考图像、描述目标内在属性的文本提示以及定义外在上下文的背景图和物体掩码条件下,能够同时调整内在与外在属性;推理时通过复用原始背景和物体掩码来约束外在变化,从而仅实现期望的内在属性修改;(ii) 引入视觉命名实体(Visual Named Entities, VNEs),即细粒度的视觉身份类别(如“保时捷911卡雷拉”),这些类别共享身份定义特征但允许内在属性变化,并利用视觉-语言模型从大规模公共图像数据集中自动提取VNE标签和内在属性描述,实现可扩展的身份保持监督。

链接: https://arxiv.org/abs/2601.10714
作者: Tal Reiss,Daniel Winter,Matan Cohen,Alex Rav-Acha,Yael Pritch,Ariel Shamir,Yedid Hoshen
机构: Google(谷歌); The Hebrew University of Jerusalem (耶路撒冷希伯来大学); Reichman University (里克曼大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Project page is available at this https URL

点击查看摘要

Abstract:We introduce Alterbute, a diffusion-based method for editing an object’s intrinsic attributes in an image. We allow changing color, texture, material, and even the shape of an object, while preserving its perceived identity and scene context. Existing approaches either rely on unsupervised priors that often fail to preserve identity or use overly restrictive supervision that prevents meaningful intrinsic variations. Our method relies on: (i) a relaxed training objective that allows the model to change both intrinsic and extrinsic attributes conditioned on an identity reference image, a textual prompt describing the target intrinsic attributes, and a background image and object mask defining the extrinsic context. At inference, we restrict extrinsic changes by reusing the original background and object mask, thereby ensuring that only the desired intrinsic attributes are altered; (ii) Visual Named Entities (VNEs) - fine-grained visual identity categories (e.g., ‘‘Porsche 911 Carrera’’) that group objects sharing identity-defining features while allowing variation in intrinsic attributes. We use a vision-language model to automatically extract VNE labels and intrinsic attribute descriptions from a large public image dataset, enabling scalable, identity-preserving supervision. Alterbute outperforms existing methods on identity-preserving object intrinsic attribute editing.
zh

[CV-2] From One-to-One to Many-to-Many: Dynamic Cross-Layer Injection for Deep Vision-Language Fusion

【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)中存在的视觉特征瓶颈问题,即当前模型采用粗粒度、非对称的连接方式,仅将视觉编码器的输出单向传递至大语言模型(Large Language Model, LLM),导致LLM难以实现与分层视觉知识的全面对齐,从而限制了其在整合局部细节与全局语义方面的推理能力。解决方案的关键在于提出一种轻量级且高效的跨层注入(Cross-Layer Injection, CLI)框架,其核心由两个协同工作的模块构成:自适应多投影(Adaptive Multi-Projection, AMP)模块用于融合来自不同视觉层次的特征,以及自适应门控融合(Adaptive Gating Fusion, AGF)机制,使LLM能够根据实时解码上下文动态选择并注入最相关的视觉信息,从而构建视觉与语言模态间的动态多对多交互通道,显著提升多模态理解能力。

链接: https://arxiv.org/abs/2601.10710
作者: Cheng Chen,Yuyu Guo,Pengpeng Zeng,Jingkuan Song,Peng Di,Hang Yu,Lianli Gao
机构: Ant Group (蚂蚁集团); Tongji University (同济大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs) create a severe visual feature bottleneck by using a crude, asymmetric connection that links only the output of the vision encoder to the input of the large language model (LLM). This static architecture fundamentally limits the ability of LLMs to achieve comprehensive alignment with hierarchical visual knowledge, compromising their capacity to accurately integrate local details with global semantics into coherent reasoning. To resolve this, we introduce Cross-Layer Injection (CLI), a novel and lightweight framework that forges a dynamic many-to-many bridge between the two modalities. CLI consists of two synergistic, parameter-efficient components: an Adaptive Multi-Projection (AMP) module that harmonizes features from diverse vision layers, and an Adaptive Gating Fusion (AGF) mechanism that empowers the LLM to selectively inject the most relevant visual information based on its real-time decoding context. We validate the effectiveness and versatility of CLI by integrating it into LLaVA-OneVision and LLaVA-1.5. Extensive experiments on 18 diverse benchmarks demonstrate significant performance improvements, establishing CLI as a scalable paradigm that unlocks deeper multimodal understanding by granting LLMs on-demand access to the full visual hierarchy.
zh

[CV-3] See Less Drive Better: Generalizable End-to-End Autonomous Driving via Foundation Models Stochastic Patch Selection

【速读】:该论文旨在解决基于基础模型(如BLIP2)提取的图像补丁特征在端到端自动驾驶策略训练中因高度冗余而导致过拟合伪相关性、进而损害分布外(Out-of-Distribution, OOD)鲁棒性的问题。其解决方案的关键在于提出随机补丁选择(Stochastic-Patch-Selection, SPS):在每帧图像中随机掩码一部分补丁特征,同时保留剩余补丁的空间布局,使策略模型学习从不同但完整的场景子集推断决策,从而增强对特定补丁是否存在的不变性。此方法有效减少冗余信息的影响,在多个OOD场景下显著提升性能(平均提升6.2%,闭环仿真最高达20.4%),且推理速度提高2.4倍。

链接: https://arxiv.org/abs/2601.10707
作者: Amir Mallak,Erfan Aasi,Shiva Sreeram,Tsun-Hsuan Wang,Daniela Rus,Alaa Maalouf
机构: University of Haifa (海法大学); CSAIL, MIT (麻省理工学院计算机科学与人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Recent advances in end-to-end autonomous driving show that policies trained on patch-aligned features extracted from foundation models generalize better to Out-of-Distribution (OOD). We hypothesize that due to the self-attention mechanism, each patch feature implicitly embeds/contains information from all other patches, represented in a different way and intensity, making these descriptors highly redundant. We quantify redundancy in such (BLIP2) features via PCA and cross-patch similarity: 90 % of variance is captured by 17/64 principal components, and strong inter-token correlations are pervasive. Training on such overlapping information leads the policy to overfit spurious correlations, hurting OOD robustness. We present Stochastic-Patch-Selection (SPS), a simple yet effective approach for learning policies that are more robust, generalizable, and efficient. For every frame, SPS randomly masks a fraction of patch descriptors, not feeding them to the policy model, while preserving the spatial layout of the remaining patches. Thus, the policy is provided with different stochastic but complete views of the (same) scene: every random subset of patches acts like a different, yet still sensible, coherent projection of the world. The policy thus bases its decisions on features that are invariant to which specific tokens survive. Extensive experiments confirm that across all OOD scenarios, our method outperforms the state of the art (SOTA), achieving a 6.2 % average improvement and up to 20.4 % in closed-loop simulations, while being 2.4\times faster. We conduct ablations over masking rates and patch-feature reorganization, training and evaluating 9 systems, with 8 of them surpassing prior SOTA. Finally, we show that the same learned policy transfers to a physical, real-world car without any tuning.
zh

[CV-4] A continental-scale dataset of ground beetles with high-resolution images and validated morphological trait measurements

【速读】:该论文旨在解决全球性状数据库中无脊椎动物(尤其是步甲科昆虫)数据严重缺失的问题,从而限制了对高多样性类群生态响应的全面分析。其解决方案的关键在于构建一个包含超过13,200个NEON步甲标本的多模态数字化数据集,通过高分辨率成像技术获取每只标本鞘翅长度和宽度的数字测量值,并验证了基于人工智能(AI)的自动化性状提取方法可达到亚毫米级精度,为后续物种识别与基于性状的研究提供了可靠的数据基础和计算工具。

链接: https://arxiv.org/abs/2601.10687
作者: S M Rayeed,Mridul Khurana,Alyson East,Isadora E. Fluck,Elizabeth G. Campolongo,Samuel Stevens,Iuliia Zarubiieva,Scott C. Lowe,Michael W. Denslow,Evan D. Donoso,Jiaman Wu,Michelle Ramirez,Benjamin Baiser,Charles V. Stewart,Paula Mabee,Tanya Berger-Wolf,Anuj Karpatne,Hilmar Lapp,Robert P. Guralnick,Graham W. Taylor,Sydne Record
机构: Rensselaer Polytechnic Institute (伦斯勒理工学院); Virginia Tech (弗吉尼亚理工大学); The University of Maine (缅因大学); University of Florida (佛罗里达大学); The Ohio State University (俄亥俄州立大学); Vector Institute (向量研究所); University of Guelph (圭尔夫大学); Duke University (杜克大学); Battelle (巴特尔公司); National Ecological Observatory Network (国家生态观测网络)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 21 pages, 10 figures; Submitted to Nature Scientific Data

点击查看摘要

Abstract:Despite the ecological significance of invertebrates, global trait databases remain heavily biased toward vertebrates and plants, limiting comprehensive ecological analyses of high-diversity groups like ground beetles. Ground beetles (Coleoptera: Carabidae) serve as critical bioindicators of ecosystem health, providing valuable insights into biodiversity shifts driven by environmental changes. While the National Ecological Observatory Network (NEON) maintains an extensive collection of carabid specimens from across the United States, these primarily exist as physical collections, restricting widespread research access and large-scale analysis. To address these gaps, we present a multimodal dataset digitizing over 13,200 NEON carabids from 30 sites spanning the continental US and Hawaii through high-resolution imaging, enabling broader access and computational analysis. The dataset includes digitally measured elytra length and width of each specimen, establishing a foundation for automated trait extraction using AI. Validated against manual measurements, our digital trait extraction achieves sub-millimeter precision, ensuring reliability for ecological and computational studies. By addressing invertebrate under-representation in trait databases, this work supports AI-driven tools for automated species identification and trait-based research, fostering advancements in biodiversity monitoring and conservation.
zh

[CV-5] CURVE: A Benchmark for Cultural and Multilingual Long Video Reasoning

【速读】:该论文旨在解决当前视频理解基准测试中存在的文化偏见与语言局限性问题,即现有评测数据多以西方为中心且主要使用英语,导致模型在跨文化场景下的推理能力难以被准确评估。其解决方案的关键在于构建CURVE(Cultural Understanding and Reasoning in Video Evaluation)这一多文化、多语言的视频推理基准,该基准包含来自全球18个地区、由人类完全标注的高质量视频数据,并提供原生语言的复杂问题、答案及多步推理过程,从而要求模型具备对视觉文化语境的深度理解。此外,论文进一步利用推理轨迹构建证据图谱,并提出一种基于图谱的迭代策略,用于识别细粒度的推理错误,揭示当前最先进视频大语言模型(Video-LLMs)在文化元素感知层面的显著不足。

链接: https://arxiv.org/abs/2601.10649
作者: Darshan Singh,Arsha Nagrani,Kawshik Manikantan,Harman Singh,Dinesh Tewari,Tobias Weyand,Cordelia Schmid,Anelia Angelova,Shachi Dave
机构: Google DeepMind(谷歌深度思维); UC Berkeley(加州大学伯克利分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advancements in video models have shown tremendous progress, particularly in long video understanding. However, current benchmarks predominantly feature western-centric data and English as the dominant language, introducing significant biases in evaluation. To address this, we introduce CURVE (Cultural Understanding and Reasoning in Video Evaluation), a challenging benchmark for multicultural and multilingual video reasoning. CURVE comprises high-quality, entirely human-generated annotations from diverse, region-specific cultural videos across 18 global locales. Unlike prior work that relies on automatic translations, CURVE provides complex questions, answers, and multi-step reasoning steps, all crafted in native languages. Making progress on CURVE requires a deeply situated understanding of visual cultural context. Furthermore, we leverage CURVE’s reasoning traces to construct evidence-based graphs and propose a novel iterative strategy using these graphs to identify fine-grained errors in reasoning. Our evaluations reveal that SoTA Video-LLMs struggle significantly, performing substantially below human-level accuracy, with errors primarily stemming from the visual perception of cultural elements. CURVE will be publicly available under this https URL#minerva-cultural
zh

[CV-6] CoMoVi: Co-Generation of 3D Human Motions and Realistic Videos

【速读】:该论文旨在解决3D人体动作与2D人体视频生成任务之间耦合性不足的问题,即如何在保持动作结构合理性的同时充分利用预训练视频模型的强大泛化能力。其解决方案的关键在于提出了一种协同生成框架CoMoVi,通过在一个扩散去噪循环中联合优化两个视频扩散模型(Video Diffusion Models, VDMs),实现3D人体动作与2D视频的同步生成;该框架采用一种能继承预训练VDM先验的有效2D人体动作表示,并设计双分支扩散模型,引入相互特征交互和3D-2D跨注意力机制,从而实现动作与视频生成过程的深度耦合。

链接: https://arxiv.org/abs/2601.10632
作者: Chengfeng Zhao,Jiazhi Shu,Yubo Zhao,Tianyu Huang,Jiahao Lu,Zekai Gu,Chengwei Ren,Zhiyang Dou,Qing Shuai,Yuan Liu
机构: HKUST(香港科技大学); SCUT(华南理工大学); CUHK(香港中文大学); MIT(麻省理工学院); ZJU(浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:In this paper, we find that the generation of 3D human motions and 2D human videos is intrinsically coupled. 3D motions provide the structural prior for plausibility and consistency in videos, while pre-trained video models offer strong generalization capabilities for motions, which necessitate coupling their generation processes. Based on this, we present CoMoVi, a co-generative framework that couples two video diffusion models (VDMs) to generate 3D human motions and videos synchronously within a single diffusion denoising loop. To achieve this, we first propose an effective 2D human motion representation that can inherit the powerful prior of pre-trained VDMs. Then, we design a dual-branch diffusion model to couple human motion and video generation process with mutual feature interaction and 3D-2D cross attentions. Moreover, we curate CoMoVi Dataset, a large-scale real-world human video dataset with text and motion annotations, covering diverse and challenging human motions. Extensive experiments demonstrate the effectiveness of our method in both 3D human motion and video generation tasks.
zh

[CV-7] Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

【速读】:该论文旨在解决当前开放源代码视频-语言模型(Video-Language Models, VLMs)在性能和能力上的局限性,尤其是缺乏高质量训练数据、无法实现像素级定位(grounding)以及依赖闭源模型蒸馏的问题。其关键解决方案在于构建了一套全新的7个视频数据集和2个多图像数据集,涵盖高精度视频描述、自由形式视频问答、复杂查询对象追踪及创新的视频指代表达任务,且全部数据采集不依赖闭源VLM;同时提出高效的训练策略,包括消息树编码(message-tree encoding)与视觉token双向注意力机制,并引入新颖的token权重策略,显著提升了模型在短视频理解、计数、captioning及视频定位等任务上的表现,尤其在视频指针任务上超越了现有开源模型甚至部分闭源模型。

链接: https://arxiv.org/abs/2601.10611
作者: Christopher Clark,Jieyu Zhang,Zixian Ma,Jae Sung Park,Mohammadreza Salehi,Rohun Tripathi,Sangho Lee,Zhongzheng Ren,Chris Dongjoo Kim,Yinuo Yang,Vincent Shao,Yue Yang,Weikai Huang,Ziqi Gao,Taira Anderson,Jianrui Zhang,Jitesh Jain,George Stoica,Winson Han,Ali Farhadi,Ranjay Krishna
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Today’s strongest video-language models (VLMs) remain proprietary. The strongest open-weight models either rely on synthetic data from proprietary VLMs, effectively distilling from them, or do not disclose their training data or recipe. As a result, the open-source community lacks the foundations needed to improve on the state-of-the-art video (and image) language models. Crucially, many downstream applications require more than just high-level video understanding; they require grounding – either by pointing or by tracking in pixels. Even proprietary models lack this capability. We present Molmo2, a new family of VLMs that are state-of-the-art among open-source models and demonstrate exceptional new capabilities in point-driven grounding in single image, multi-image, and video tasks. Our key contribution is a collection of 7 new video datasets and 2 multi-image datasets, including a dataset of highly detailed video captions for pre-training, a free-form video QA dataset for fine-tuning, a new object tracking dataset with complex queries, and an innovative new video pointing dataset, all collected without the use of closed VLMs. We also present a training recipe for this data utilizing an efficient packing and message-tree encoding scheme, and show bi-directional attention on vision tokens and a novel token-weight strategy improves performance. Our best-in-class 8B model outperforms others in the class of open weight and data models on short videos, counting, and captioning, and is competitive on long-videos. On video-grounding Molmo2 significantly outperforms existing open-weight models like Qwen3-VL (35.5 vs 29.6 accuracy on video counting) and surpasses proprietary models like Gemini 3 Pro on some tasks (38.4 vs 20.0 F1 on video pointing and 56.2 vs 41.1 JF on video tracking).
zh

[CV-8] RSATalker: Realistic Socially-Aware Talking Head Generation for Multi-Turn Conversation

【速读】:该论文旨在解决当前虚拟现实(VR)中多轮对话场景下说话头生成的两大核心问题:一是现有基于网格(mesh-based)的3D方法虽能建模双人对话但缺乏真实纹理,二是基于大模型的2D方法虽具自然外观却计算开销巨大;同时,现有方法普遍忽略社交关系建模,无法体现人际互动中的社会属性。解决方案的关键在于提出RSATalker框架,首次将3D高斯泼溅(3D Gaussian Splatting, 3DGS)技术引入说话头生成领域,通过先驱动面部网格运动、再将3D高斯点绑定至网格面片实现高质量2D头像视频渲染,并设计了一个社会感知模块(socially-aware module),利用可学习查询机制将血缘与非血缘、平等与不平等的社会关系编码为高层嵌入,从而在多轮对话中实现逼真且具有社会意识的说话头生成。

链接: https://arxiv.org/abs/2601.10606
作者: Peng Chen,Xiaobao Wei,Yi Yang,Naiming Yao,Hui Chen,Feng Tian
机构: Institute of Software, Chinese Academy of Sciences (中国科学院软件研究所); University of Chinese Academy of Sciences (中国科学院大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Talking head generation is increasingly important in virtual reality (VR), especially for social scenarios involving multi-turn conversation. Existing approaches face notable limitations: mesh-based 3D methods can model dual-person dialogue but lack realistic textures, while large-model-based 2D methods produce natural appearances but incur prohibitive computational costs. Recently, 3D Gaussian Splatting (3DGS) based methods achieve efficient and realistic rendering but remain speaker-only and ignore social relationships. We introduce RSATalker, the first framework that leverages 3DGS for realistic and socially-aware talking head generation with support for multi-turn conversation. Our method first drives mesh-based 3D facial motion from speech, then binds 3D Gaussians to mesh facets to render high-fidelity 2D avatar videos. To capture interpersonal dynamics, we propose a socially-aware module that encodes social relationships, including blood and non-blood as well as equal and unequal, into high-level embeddings through a learnable query mechanism. We design a three-stage training paradigm and construct the RSATalker dataset with speech-mesh-image triplets annotated with social relationships. Extensive experiments demonstrate that RSATalker achieves state-of-the-art performance in both realism and social awareness. The code and dataset will be released.
zh

[CV-9] Action100M: A Large-scale Video Action Dataset

【速读】:该论文旨在解决从视觉观测中推断物理动作的问题,这是提升机器智能在物理世界中能力的基础性挑战。为实现这一目标,研究者构建了Action100M——一个大规模、开放词汇的视频动作数据集,涵盖广泛领域且包含时序定位的片段标注与丰富描述。解决方案的关键在于一套全自动的数据生成流程:首先利用V-JEPA 2嵌入进行分层时间分割,其次生成多层级帧与片段描述(Tree-of-Captions),最后通过GPT-OSS-120B推理模型在多轮Self-Refine机制下聚合证据,输出结构化标注(简要/详细动作、执行者、简要/详细描述)。该方法显著提升了视频理解与世界建模中的可扩展研究基础,实验证明VL-JEPA在Action100M上训练后,在多个动作识别基准上展现出一致的数据规模收益和强大的零样本性能。

链接: https://arxiv.org/abs/2601.10592
作者: Delong Chen,Tejaswi Kasarla,Yejin Bang,Mustafa Shukor,Willy Chung,Jade Yu,Allen Bolourchi,Theo Moutakanni,Pascale Fung
机构: The University of Hong Kong (香港大学); Meta (Meta)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Inferring physical actions from visual observations is a fundamental capability for advancing machine intelligence in the physical world. Achieving this requires large-scale, open-vocabulary video action datasets that span broad domains. We introduce Action100M, a large-scale dataset constructed from 1.2M Internet instructional videos (14.6 years of duration), yielding O(100 million) temporally localized segments with open-vocabulary action supervision and rich captions. Action100M is generated by a fully automated pipeline that (i) performs hierarchical temporal segmentation using V-JEPA 2 embeddings, (ii) produces multi-level frame and segment captions organized as a Tree-of-Captions, and (iii) aggregates evidence with a reasoning model (GPT-OSS-120B) under a multi-round Self-Refine procedure to output structured annotations (brief/detailed action, actor, brief/detailed caption). Training VL-JEPA on Action100M demonstrates consistent data-scaling improvements and strong zero-shot performance across diverse action recognition benchmarks, establishing Action100M as a new foundation for scalable research in video understanding and world modeling.
zh

[CV-10] Adversarial Evasion Attacks on Computer Vision using SHAP Values

【速读】:该论文旨在解决深度学习模型在计算机视觉任务中面临的一种隐蔽性极强的白盒对抗攻击问题,即如何通过生成人眼难以察觉的扰动来降低模型输出置信度或诱导错误分类。其解决方案的关键在于利用SHAP(SHapley Additive exPlanations)值量化输入特征对模型输出的贡献程度,并基于此构建更具针对性和鲁棒性的对抗样本。相较于经典的快速梯度符号法(Fast Gradient Sign Method, FGSM),该方法在梯度隐藏场景下表现出更强的误导能力,从而有效提升了攻击的成功率与隐蔽性。

链接: https://arxiv.org/abs/2601.10587
作者: Frank Mollard,Marcus Becker,Florian Roehrbein
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10th bwHPC Symposium - September 25th 26th, 2024

点击查看摘要

Abstract:The paper introduces a white-box attack on computer vision models using SHAP values. It demonstrates how adversarial evasion attacks can compromise the performance of deep learning models by reducing output confidence or inducing misclassifications. Such attacks are particularly insidious as they can deceive the perception of an algorithm while eluding human perception due to their imperceptibility to the human eye. The proposed attack leverages SHAP values to quantify the significance of individual inputs to the output at the inference stage. A comparison is drawn between the SHAP attack and the well-known Fast Gradient Sign Method. We find evidence that SHAP attacks are more robust in generating misclassifications particularly in gradient hiding scenarios.
zh

[CV-11] Jordan-Segmentable Masks: A Topology-Aware definition for characterizing Binary Image Segmentation

【速读】:该论文旨在解决传统图像分割评估指标在衡量分割结果结构一致性方面的局限性问题,尤其是在医学影像或目标轮廓提取等场景中,即使存在边界偏差、孔洞或碎片化预测,现有指标仍可能给出较高的分数,而这些缺陷会破坏对象的整体形状和连通性。其解决方案的关键在于引入基于Jordan曲线定理的拓扑感知分割概念,定义了“Jordan可分割掩码”(Jordan-segmentatable mask)——即一种能通过数字拓扑确保图像域被划分为两个连通区域的二值分割掩码。该方法利用数字拓扑与同调理论,从掩码中提取候选4-曲线,并通过Betti数 β0=β1=1\beta_0 = \beta_1 = 1 验证其拓扑有效性,等价于掩码补集恰好分裂为两个8-连通分量。这一框架提供了一个数学上严谨且无需监督的结构性评估准则,显著提升了对分割结果拓扑正确性的判别能力。

链接: https://arxiv.org/abs/2601.10577
作者: Serena Grazia De Benedictis,Amedeo Altavilla,Nicoletta Del Buono
机构: University of Bari Aldo Moro (巴里阿尔多·莫罗大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Algebraic Topology (math.AT); Numerical Analysis (math.NA)
备注: 27 pages, 18 figures

点击查看摘要

Abstract:Image segmentation plays a central role in computer vision. However, widely used evaluation metrics, whether pixel-wise, region-based, or boundary-focused, often struggle to capture the structural and topological coherence of a segmentation. In many practical scenarios, such as medical imaging or object delineation, small inaccuracies in boundary, holes, or fragmented predictions can result in high metric scores, despite the fact that the resulting masks fail to preserve the object global shape or connectivity. This highlights a limitation of conventional metrics: they are unable to assess whether a predicted segmentation partitions the image into meaningful interior and exterior regions. In this work, we introduce a topology-aware notion of segmentation based on the Jordan Curve Theorem, and adapted for use in digital planes. We define the concept of a \emphJordan-segmentatable mask, which is a binary segmentation whose structure ensures a topological separation of the image domain into two connected components. We analyze segmentation masks through the lens of digital topology and homology theory, extracting a 4 -curve candidate from the mask, verifying its topological validity using Betti numbers. A mask is considered Jordan-segmentatable when this candidate forms a digital 4-curve with \beta_0 = \beta_1 = 1 , or equivalently when its complement splits into exactly two 8 -connected components. This framework provides a mathematically rigorous, unsupervised criterion with which to assess the structural coherence of segmentation masks. By combining digital Jordan theory and homological invariants, our approach provides a valuable alternative to standard evaluation metrics, especially in applications where topological correctness must be preserved. Comments: 27 pages, 18 figures Subjects: Computer Vision and Pattern Recognition (cs.CV); Algebraic Topology (math.AT); Numerical Analysis (math.NA) MSC classes: 54H30, 68U03 Cite as: arXiv:2601.10577 [cs.CV] (or arXiv:2601.10577v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2601.10577 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-12] Process-Guided Concept Bottleneck Model

【速读】:该论文旨在解决传统概念瓶颈模型(Concept Bottleneck Models, CBMs)在科学领域应用中的局限性,即忽视领域特定的关系与因果机制,且对完整概念标签的依赖限制了其在标注稀疏但过程明确的科学场景中的适用性。解决方案的关键在于提出过程引导的概念瓶颈模型(Process-Guided Concept Bottleneck Model, PG-CBM),通过引入基于生物物理意义的中间概念,并强制学习过程遵循领域定义的因果机制,从而提升模型的准确性、可解释性和可信度。

链接: https://arxiv.org/abs/2601.10562
作者: Reza M. Asiyabi(1 and 2),SEOSAW Partnership(1),Steven Hancock(1 and 2)Casey Ryan(1) ((1) School of GeoSciences, University of Edinburgh, UK, (2) UK National Centre for Earth Observation (NCEO))
机构: University of Edinburgh(爱丁堡大学); University College London(伦敦大学学院); University of Hamburg(汉堡大学); University of Nairobi(内罗毕大学); Namibia University of Science and Technology(纳米比亚科技大学); University of Turin(都灵大学); Royal Botanic Garden Edinburgh(爱丁堡皇家植物园); Kenya Forestry Research Institute(肯尼亚林业研究所); WeForest(WeForest); Universidade Mandume Ya Ndemufayo(曼杜梅亚纳德富亚大学); University of Limpopo(林波波大学); South African Environmental Observation Network (SAEON)(南非环境观测网络); Sokoine University of Agriculture(索科因农业大学); Ministry of Agriculture, Environment and Fisheries(农业、环境和渔业部); Forestry Research Institute(林业研究所); University of Liverpool(利物浦大学); University of Pretoria(比勒陀利亚大学); Nature+(Nature+); National Centre for Biological Sciences (TIFR)(国家生物科学中心(印度理工学院)); Eduardo Mondlane University(埃杜阿尔多蒙德兰大学); World Wide Fund for Nature (WWF)(世界自然基金会); University of Exeter(埃克塞特大学); University of Leeds(利兹大学); Faculty of Biological Sciences(生物科学学院); SANParks(南非国家公园管理局); Nelson Mandela University(曼德拉大学); University of Bayreuth(拜罗伊特大学); School of Mathematical and Physical Sciences(数学与物理科学学院); UK National Centre for Earth Observation (NCEO)(英国国家地球观测中心)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages with 7 figures and 1 table, Supplementary Materials 10 pages with 3 figures

点击查看摘要

Abstract:Concept Bottleneck Models (CBMs) improve the explainability of black-box Deep Learning (DL) by introducing intermediate semantic concepts. However, standard CBMs often overlook domain-specific relationships and causal mechanisms, and their dependence on complete concept labels limits applicability in scientific domains where supervision is sparse but processes are well defined. To address this, we propose the Process-Guided Concept Bottleneck Model (PG-CBM), an extension of CBMs which constrains learning to follow domain-defined causal mechanisms through biophysically meaningful intermediate concepts. Using above ground biomass density estimation from Earth Observation data as a case study, we show that PG-CBM reduces error and bias compared to multiple benchmarks, whilst leveraging multi-source heterogeneous training data and producing interpretable intermediate outputs. Beyond improved accuracy, PG-CBM enhances transparency, enables detection of spurious learning, and provides scientific insights, representing a step toward more trustworthy AI systems in scientific applications.
zh

[CV-13] DeepUrban: Interaction-Aware Trajectory Prediction and Planning for Automated Driving by Aerial Imagery

【速读】:该论文旨在解决当前自动驾驶系统在复杂交通场景下预测与规划能力不足的问题,尤其是现有基准测试中缺乏高密度交通场景所导致的模型泛化能力受限。其解决方案的关键在于构建了一个名为DeepUrban的新颖无人机数据集,该数据集通过从约100米高空拍摄的高分辨率图像中提取3D交通物体,并结合详尽的地图与场景信息,为轨迹预测和路径规划提供更丰富的密集城市环境数据支持。实验表明,将DeepUrban加入nuScenes数据集后,可显著提升车辆预测与规划的准确性,在ADE/FDE指标上分别提高达44.1%/44.3%。

链接: https://arxiv.org/abs/2601.10554
作者: Constantin Selzer,Fabian B. Flohr
机构: Munich University of Applied Science (慕尼黑应用技术大学); Intelligent Vehicles Lab (IVL) (智能车辆实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The efficacy of autonomous driving systems hinges critically on robust prediction and planning capabilities. However, current benchmarks are impeded by a notable scarcity of scenarios featuring dense traffic, which is essential for understanding and modeling complex interactions among road users. To address this gap, we collaborated with our industrial partner, DeepScenario, to develop DeepUrban-a new drone dataset designed to enhance trajectory prediction and planning benchmarks focusing on dense urban settings. DeepUrban provides a rich collection of 3D traffic objects, extracted from high-resolution images captured over urban intersections at approximately 100 meters altitude. The dataset is further enriched with comprehensive map and scene information to support advanced modeling and simulation tasks. We evaluate state-of-the-art (SOTA) prediction and planning methods, and conducted experiments on generalization capabilities. Our findings demonstrate that adding DeepUrban to nuScenes can boost the accuracy of vehicle predictions and planning, achieving improvements up to 44.1 % / 44.3% on the ADE / FDE metrics. Website: this https URL
zh

[CV-14] Inference-time Physics Alignment of Video Generative Models with Latent World Models

【速读】:该论文旨在解决当前视频生成模型在生成过程中常违背基本物理规律的问题,从而限制其实际应用价值。尽管部分研究认为这一缺陷源于预训练阶段对物理知识理解不足,作者发现推理阶段的策略优化同样关键。解决方案的核心在于引入一种名为WMReward的奖励机制,将潜在世界模型(latent world model, VJEPA-2)所具备的强大物理先验作为奖励信号,在推理阶段对多个候选去噪轨迹进行搜索与引导,实现测试时计算资源的扩展利用以提升生成质量。此方法将物理合理性建模为一个推理时对齐问题,显著提升了图像条件、多帧条件及文本条件下的视频生成物理合理性,并在ICCV 2025 Perception Test PhysicsIQ Challenge中取得62.64%的得分,领先于此前最优方法7.42%。

链接: https://arxiv.org/abs/2601.10553
作者: Jianhao Yuan,Xiaofeng Zhang,Felix Friedrich,Nicolas Beltran-Velez,Melissa Hall,Reyhane Askari-Hemmat,Xiaochuang Han,Nicolas Ballas,Michal Drozdzal,Adriana Romero-Soriano
机构: Meta(元); University of Oxford (牛津大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 22 pages, 10 figures

点击查看摘要

Abstract:State-of-the-art video generative models produce promising visual content yet often violate basic physics principles, limiting their utility. While some attribute this deficiency to insufficient physics understanding from pre-training, we find that the shortfall in physics plausibility also stems from suboptimal inference strategies. We therefore introduce WMReward and treat improving physics plausibility of video generation as an inference-time alignment problem. In particular, we leverage the strong physics prior of a latent world model (here, VJEPA-2) as a reward to search and steer multiple candidate denoising trajectories, enabling scaling test-time compute for better generation performance. Empirically, our approach substantially improves physics plausibility across image-conditioned, multiframe-conditioned, and text-conditioned generation settings, with validation from human preference study. Notably, in the ICCV 2025 Perception Test PhysicsIQ Challenge, we achieve a final score of 62.64%, winning first place and outperforming the previous state of the art by 7.42%. Our work demonstrates the viability of using latent world models to improve physics plausibility of video generation, beyond this specific instantiation or parameterization.
zh

[CV-15] Unleashing the Capabilities of Large Vision-Language Models for Intelligent Perception of Roadside Infrastructure

【速读】:该论文旨在解决通用视觉模型在城市道路基础设施智能感知任务中难以准确识别细粒度属性及遵守工程规范的问题,尤其是在复杂设施状态判别上表现不可靠。其核心解决方案是构建一个领域自适应框架,将大视觉语言模型(VLM)转化为专用的基础设施分析代理;关键创新在于结合数据高效的微调策略与知识引导的推理机制:首先利用Grounding DINO进行开词汇微调以实现多样资产的鲁棒定位,再通过LoRA方法对Qwen-VL进行适配以实现深层次语义属性推理,并引入双模态检索增强生成(Retrieval-Augmented Generation, RAG)模块,在推理阶段动态检索权威行业标准和视觉样例,从而有效抑制幻觉并确保专业合规性。

链接: https://arxiv.org/abs/2601.10551
作者: Luxuan Fu,Chong Liu,Bisheng Yang,Zhen Dong
机构: State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing (LIESMARS), Wuhan University (武汉大学), Wuhan 430079, China; Hubei Luojia Laboratory (湖北珞珈实验室), Wuhan 430079, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Automated perception of urban roadside infrastructure is crucial for smart city management, yet general-purpose models often struggle to capture the necessary fine-grained attributes and domain rules. While Large Vision Language Models (VLMs) excel at open-world recognition, they often struggle to accurately interpret complex facility states in compliance with engineering standards, leading to unreliable performance in real-world applications. To address this, we propose a domain-adapted framework that transforms VLMs into specialized agents for intelligent infrastructure analysis. Our approach integrates a data-efficient fine-tuning strategy with a knowledge-grounded reasoning mechanism. Specifically, we leverage open-vocabulary fine-tuning on Grounding DINO to robustly localize diverse assets with minimal supervision, followed by LoRA-based adaptation on Qwen-VL for deep semantic attribute reasoning. To mitigate hallucinations and enforce professional compliance, we introduce a dual-modality Retrieval-Augmented Generation (RAG) module that dynamically retrieves authoritative industry standards and visual exemplars during inference. Evaluated on a comprehensive new dataset of urban roadside scenes, our framework achieves a detection performance of 58.9 mAP and an attribute recognition accuracy of 95.5%, demonstrating a robust solution for intelligent infrastructure monitoring.
zh

[CV-16] Enhancing the quality of gauge images captured in smoke and haze scenes through deep learning

【速读】:该论文旨在解决在雾霾和烟雾环境中拍摄的仪表图像因可见度降低而难以读取的问题,这对基础设施监测和应急响应造成障碍。解决方案的关键在于利用深度学习模型(FFA-Net 和 AECR-Net)对受污染图像进行增强,以提升其可读性,从而支持自动化的仪表数据识别。研究构建了一个包含超过14,000张合成图像的新数据集,并通过训练模型实现对轻度至重度雾霾场景的有效去雾,SSIM 和 PSNR 指标分别达到约 0.98 和 43 dB,优于现有方法;其中 AECR-Net 表现更稳健。尽管烟雾场景下效果较差,但结果仍具潜力,表明深度学习方法能显著改善烟雾和雾霾环境下模拟仪表图像的质量,为后续自动化仪表读取提供可靠输入。

链接: https://arxiv.org/abs/2601.10537
作者: Oscar H. Ramírez-Agudelo,Akshay N. Shewatkar,Edoardo Milana,Roland C. Aydin,Kai Franke
机构: German Aerospace Center (DLR), Institute for the Protection of Terrestrial Infrastructures; Department of Microsystems Engineering (IMTEK), University of Freiburg; Helmholtz Center Hereon, Institute of Material Systems Modeling
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 10 figures, 6 tables, SPIE Applications of Machine Learning 2023, San Diego, US

点击查看摘要

Abstract:Images captured in hazy and smoky environments suffer from reduced visibility, posing a challenge when monitoring infrastructures and hindering emergency services during critical situations. The proposed work investigates the use of the deep learning models to enhance the automatic, machine-based readability of gauge in smoky environments, with accurate gauge data interpretation serving as a valuable tool for first responders. The study utilizes two deep learning architectures, FFA-Net and AECR-Net, to improve the visibility of gauge images, corrupted with light up to dense haze and smoke. Since benchmark datasets of analog gauge images are unavailable, a new synthetic dataset, containing over 14,000 images, was generated using the Unreal Engine. The models were trained with an 80% train, 10% validation, and 10% test split for the haze and smoke dataset, respectively. For the synthetic haze dataset, the SSIM and PSNR metrics are about 0.98 and 43,dB, respectively, comparing well to state-of-the art results. Additionally, more robust results are retrieved from the AECR-Net, when compared to the FFA-Net. Although the results from the synthetic smoke dataset are poorer, the trained models achieve interesting results. In general, imaging in the presence of smoke are more difficult to enhance given the inhomogeneity and high density. Secondly, FFA-Net and AECR-Net are implemented to dehaze and not to desmoke images. This work shows that use of deep learning architectures can improve the quality of analog gauge images captured in smoke and haze scenes immensely. Finally, the enhanced output images can be successfully post-processed for automatic autonomous reading of gauges
zh

[CV-17] SVII-3D: Advancing Roadside Infrastructure Inventory with Decimeter-level 3D Localization and Comprehension from Sparse Street Imagery

【速读】:该论文旨在解决智能城市建设和设施全生命周期管理中,如何利用低成本稀疏影像实现数字孪生体与高精度资产清单自动构建的问题。当前方法面临鲁棒性不足、定位不准确及缺乏细粒度状态理解等挑战。其解决方案的关键在于提出SVII-3D统一框架:首先,通过LoRA微调的开放集检测与空间注意力匹配网络融合,实现稀疏视图间观测的鲁棒关联;其次,引入几何引导的精化机制,修正结构误差并实现分米级精度的3D定位;最后,借助视觉-语言模型(Vision-Language Model)代理和多模态提示技术,自动诊断设备的细粒度运行状态,突破传统静态几何映射局限。

链接: https://arxiv.org/abs/2601.10535
作者: Chong Liu,Luxuan Fu,Yang Jia,Zhen Dong,Bisheng Yang
机构: State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing (LIESMARS), Wuhan University, Wuhan 430079, China; Sichuan Highway Planning, Survey, Design and Research Institute Ltd, Chengdu 610000, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The automated creation of digital twins and precise asset inventories is a critical task in smart city construction and facility lifecycle management. However, utilizing cost-effective sparse imagery remains challenging due to limited robustness, inaccurate localization, and a lack of fine-grained state understanding. To address these limitations, SVII-3D, a unified framework for holistic asset digitization, is proposed. First, LoRA fine-tuned open-set detection is fused with a spatial-attention matching network to robustly associate observations across sparse views. Second, a geometry-guided refinement mechanism is introduced to resolve structural errors, achieving precise decimeter-level 3D localization. Third, transcending static geometric mapping, a Vision-Language Model agent leveraging multi-modal prompting is incorporated to automatically diagnose fine-grained operational states. Experiments demonstrate that SVII-3D significantly improves identification accuracy and minimizes localization errors. Consequently, this framework offers a scalable, cost-effective solution for high-fidelity infrastructure digitization, effectively bridging the gap between sparse perception and automated intelligent maintenance.
zh

[CV-18] BikeActions: An Open Platform and Benchmark for Cyclist-Centric VRU Action Recognition ICPR

【速读】:该论文旨在解决自动驾驶(AD)和移动机器人中对弱势道路使用者(Vulnerable Road Users, VRUs)意图预测的挑战,尤其关注密集共享空间内骑行者的交互行为建模问题。现有研究多聚焦于车辆视角下的行人过街行为,而忽视了从骑行者自身视角获取高保真数据的重要性。解决方案的关键在于提出FUSE-Bike——首个完全开源的感知平台,配备双LiDAR、相机与GNSS传感器,能够从自行车手视角捕捉近距离、高精度的多模态数据;并基于此构建BikeActions数据集(含852个标注样本,涵盖5类骑行动作),同时建立基于图卷积与Transformer模型的基准测试体系,为VRU行为理解提供首个公开性能基线。

链接: https://arxiv.org/abs/2601.10521
作者: Max A. Buettner,Kanak Mazumder,Luca Koecher,Mario Finkbeiner,Sebastian Niebler,Fabian B. Flohr
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This work has been submitted to the IEEE ICPR for possible publication

点击查看摘要

Abstract:Anticipating the intentions of Vulnerable Road Users (VRUs) is a critical challenge for safe autonomous driving (AD) and mobile robotics. While current research predominantly focuses on pedestrian crossing behaviors from a vehicle’s perspective, interactions within dense shared spaces remain underexplored. To bridge this gap, we introduce FUSE-Bike, the first fully open perception platform of its kind. Equipped with two LiDARs, a camera, and GNSS, it facilitates high-fidelity, close-range data capture directly from a cyclist’s viewpoint. Leveraging this platform, we present BikeActions, a novel multi-modal dataset comprising 852 annotated samples across 5 distinct action classes, specifically tailored to improve VRU behavior modeling. We establish a rigorous benchmark by evaluating state-of-the-art graph convolution and transformer-based models on our publicly released data splits, establishing the first performance baselines for this challenging task. We release the full dataset together with data curation tools, the open hardware design, and the benchmark code to foster future research in VRU action understanding under this https URL.
zh

[CV-19] SatMap: Revisiting Satellite Maps as Prior for Online HD Map Construction ICPR

【速读】:该论文旨在解决基于车载摄像头的高精地图(HD map)在线构建中因深度感知受限和遮挡导致精度下降的问题。其解决方案的关键在于提出SatMap方法,通过融合卫星图像提供的鸟瞰视角(Bird’s Eye View, BEV)下的车道级语义与纹理信息作为全局先验,有效缓解了单目相机在深度模糊和遮挡场景下的不确定性,从而直接输出可用于下游预测与规划模块的矢量化高精地图。

链接: https://arxiv.org/abs/2601.10512
作者: Kanak Mazumder,Fabian B. Flohr
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: This work has been submitted to the IEEE ICPR for possible publication

点击查看摘要

Abstract:Online high-definition (HD) map construction is an essential part of a safe and robust end-to-end autonomous driving (AD) pipeline. Onboard camera-based approaches suffer from limited depth perception and degraded accuracy due to occlusion. In this work, we propose SatMap, an online vectorized HD map estimation method that integrates satellite maps with multi-view camera observations and directly predicts a vectorized HD map for downstream prediction and planning modules. Our method leverages lane-level semantics and texture from satellite imagery captured from a Bird’s Eye View (BEV) perspective as a global prior, effectively mitigating depth ambiguity and occlusion. In our experiments on the nuScenes dataset, SatMap achieves 34.8% mAP performance improvement over the camera-only baseline and 8.5% mAP improvement over the camera-LiDAR fusion baseline. Moreover, we evaluate our model in long-range and adverse weather conditions to demonstrate the advantages of using a satellite prior map. Source code will be available at this https URL.
zh

[CV-20] mergetune: Continued fine-tuning of vision-language models

【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在微调过程中导致的灾难性遗忘(catastrophic forgetting)问题,即模型在适应新任务后丢失了预训练阶段获得的知识。现有方法主要聚焦于微调过程中的遗忘缓解,但无法完全避免遗忘的发生。本文提出了一种新的范式——持续微调(Continued Fine-Tuning, CFT),其核心思想是在模型已完成零样本(zero-shot)微调之后,通过一种无需架构修改、模型无关的策略MERGETUNE来恢复被遗忘的预训练知识。该方案的关键在于利用线性模式连通性(Linear Mode Connectivity, LMC)约束,在损失景观(loss landscape)中搜索一个“持续模型”,该模型同时具有低损失路径连接到原始零样本模型和微调后的模型,从而隐式融合两者的知识。为避免传统LMC对大规模预训练数据回放的需求,作者进一步引入二阶近似代理函数,实现无需数据重放的高效优化。实验表明,MERGETUNE在不增加参数的前提下显著提升了基类-新类泛化性能,并在跨数据集迁移和鲁棒微调任务中优于现有基线方法。

链接: https://arxiv.org/abs/2601.10497
作者: Wenqing Wang,Da Li,Xiatian Zhu,Josef Kittler
机构: University of Surrey (萨里大学); Samsung AI Centre Cambridge (三星人工智能中心剑桥); Queen Mary University of London (伦敦玛丽女王大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, 5 figures

点击查看摘要

Abstract:Fine-tuning vision-language models (VLMs) such as CLIP often leads to catastrophic forgetting of pretrained knowledge. Prior work primarily aims to mitigate forgetting during adaptation; however, forgetting often remains inevitable during this process. We introduce a novel paradigm, \emphcontinued fine-tuning (CFT), which seeks to recover pretrained knowledge after a zero-shot model has already been adapted. We propose a simple, model-agnostic CFT strategy (named MERGETUNE) guided by linear mode connectivity (LMC), which can be applied post hoc to existing fine-tuned models without requiring architectural changes. Given a fine-tuned model, we continue fine-tuning its trainable parameters (e.g., soft prompts or linear heads) to search for a continued model which has two low-loss paths to the zero-shot (e.g., CLIP) and the fine-tuned (e.g., CoOp) solutions. By exploiting the geometry of the loss landscape, the continued model implicitly merges the two solutions, restoring pretrained knowledge lost in the fine-tuned counterpart. A challenge is that the vanilla LMC constraint requires data replay from the pretraining task. We approximate this constraint for the zero-shot model via a second-order surrogate, eliminating the need for large-scale data replay. Experiments show that MERGETUNE improves the harmonic mean of CoOp by +5.6% on base-novel generalisation without adding parameters. % We show \emphthe first time superior performance than CLIP on both DTD and EuroSAT, on cross-dataset transfer. On robust fine-tuning evaluations, the LMC-merged model from MERGETUNE surpasses ensemble baselines with lower inference cost, achieving further gains and state-of-the-art results when ensembled with the zero-shot model. Our code is available at \hrefthis https URLthis https URL.
zh

[CV-21] Urban Socio-Semantic Segmentation with Vision-Language Reasoning

【速读】:该论文旨在解决城市遥感影像中社会语义分割(Socio-semantic Segmentation)的问题,即如何准确识别并分割由社会定义的实体(如学校、公园等),而不仅仅是基于物理属性(如建筑物、水体)的语义类别。传统分割模型在处理此类社会语义实体时表现不佳,因其依赖于视觉特征而非社会语义知识。解决方案的关键在于引入了一个名为SocioReasoner的新型视觉-语言推理框架,该框架通过跨模态识别与多阶段推理模拟人类标注过程,并利用强化学习优化不可微分的推理流程,从而激发视觉-语言模型的社会语义理解能力。同时,作者构建了首个面向社会语义分割的基准数据集SocioSeg,包含卫星影像、数字地图及层级结构化的像素级标签,为该领域提供了重要资源。

链接: https://arxiv.org/abs/2601.10477
作者: Yu Wang,Yi Wang,Rui Dai,Yujie Wang,Kaikui Liu,Xiangxiang Chu,Yansheng Li
机构: Wuhan University (武汉大学); Amap, Alibaba Group (高德地图,阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:As hubs of human activity, urban surfaces consist of a wealth of semantic entities. Segmenting these various entities from satellite imagery is crucial for a range of downstream applications. Current advanced segmentation models can reliably segment entities defined by physical attributes (e.g., buildings, water bodies) but still struggle with socially defined categories (e.g., schools, parks). In this work, we achieve socio-semantic segmentation by vision-language model reasoning. To facilitate this, we introduce the Urban Socio-Semantic Segmentation dataset named SocioSeg, a new resource comprising satellite imagery, digital maps, and pixel-level labels of social semantic entities organized in a hierarchical structure. Additionally, we propose a novel vision-language reasoning framework called SocioReasoner that simulates the human process of identifying and annotating social semantic entities via cross-modal recognition and multi-stage reasoning. We employ reinforcement learning to optimize this non-differentiable process and elicit the reasoning capabilities of the vision-language model. Experiments demonstrate our approach’s gains over state-of-the-art models and strong zero-shot generalization. Our dataset and code are available in this https URL.
zh

[CV-22] ChartComplete: A Taxonomy-based Inclusive Chart Dataset

【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在图表理解任务中所依赖的基准数据集存在Chart类型覆盖不足的问题。现有数据集仅涵盖少量图表类型,限制了对MLLMs在多样化图表理解能力上的全面评估。其解决方案的关键在于提出一个名为ChartComplete的数据集,该数据集基于可视化领域的图表分类体系,系统性地覆盖了30种不同类型的图表,且仅包含已分类的图表图像而不含学习信号,从而为后续研究提供一个更全面、结构化的基准资源以推动图表理解技术的发展。

链接: https://arxiv.org/abs/2601.10462
作者: Ahmad Mustapha(American University of Beirut, Lebanon),Charbel Toumieh(American University of Beirut, Lebanon),Mariette Awad(American University of Beirut, Lebanon)
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 4 figures, 3 tables, 1 algorithm. Dataset and source code available at this https URL

点击查看摘要

Abstract:With advancements in deep learning (DL) and computer vision techniques, the field of chart understanding is evolving rapidly. In particular, multimodal large language models (MLLMs) are proving to be efficient and accurate in understanding charts. To accurately measure the performance of MLLMs, the research community has developed multiple datasets to serve as benchmarks. By examining these datasets, we found that they are all limited to a small set of chart types. To bridge this gap, we propose the ChartComplete dataset. The dataset is based on a chart taxonomy borrowed from the visualization community, and it covers thirty different chart types. The dataset is a collection of classified chart images and does not include a learning signal. We present the ChartComplete dataset as is to the community to build upon it.
zh

[CV-23] Lunar-G2R: Geometry-to-Reflectance Learning for High-Fidelity Lunar BRDF Estimation

【速读】:该论文旨在解决复杂行星表面(如月球风化层)中真实、空间变化的反射率估计问题,这对于高保真渲染和基于视觉的导航至关重要。现有 lunar 渲染流程依赖于简化或空间均匀的双向反射分布函数(Bidirectional Reflectance Distribution Function, BRDF)模型,这些模型参数难以估计且无法捕捉局部反射率变化,从而限制了光度真实性。解决方案的关键在于提出 Lunar-G2R——一种从月球数字高程模型(Digital Elevation Model, DEM)直接预测空间变化 BRDF 参数的几何到反射率学习框架,无需推理时使用多视角图像、受控光照或专用反射率采集硬件;该方法通过可微分渲染训练 U-Net 网络,最小化真实轨道图像与已知观测几何下物理基础渲染之间的光度差异,实验证明其在泰奥弗拉斯特陨石坑地理未见区域上相比最先进基线降低 38% 光度误差,并显著提升 PSNR、SSIM 和感知相似性,首次实现仅凭地形几何推断空间变化反射率模型。

链接: https://arxiv.org/abs/2601.10449
作者: Clementine Grethen,Nicolas Menga,Roland Brochard,Geraldine Morin,Simone Gasparini,Jeremy Lebreton,Manuel Sanchez Gestido
机构: IRIT, University of Toulouse, France (IRIT,图卢兹大学,法国); Airbus Defence and Space, Toulouse, France (空中客车防务与航天公司,图卢兹,法国); ESA ESTEC, Noordwijk, The Netherlands (欧洲航天局ESTEC,诺德韦克,荷兰)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Data code: this https URL

点击查看摘要

Abstract:We address the problem of estimating realistic, spatially varying reflectance for complex planetary surfaces such as the lunar regolith, which is critical for high-fidelity rendering and vision-based navigation. Existing lunar rendering pipelines rely on simplified or spatially uniform BRDF models whose parameters are difficult to estimate and fail to capture local reflectance variations, limiting photometric realism. We propose Lunar-G2R, a geometry-to-reflectance learning framework that predicts spatially varying BRDF parameters directly from a lunar digital elevation model (DEM), without requiring multi-view imagery, controlled illumination, or dedicated reflectance-capture hardware at inference time. The method leverages a U-Net trained with differentiable rendering to minimize photometric discrepancies between real orbital images and physically based renderings under known viewing and illumination geometry. Experiments on a geographically held-out region of the Tycho crater show that our approach reduces photometric error by 38 % compared to a state-of-the-art baseline, while achieving higher PSNR and SSIM and improved perceptual similarity, capturing fine-scale reflectance variations absent from spatially uniform models. To our knowledge, this is the first method to infer a spatially varying reflectance model directly from terrain geometry.
zh

[CV-24] Subjective evaluation of UHD video coded using VVC with LCEVC and ML-VVC

【速读】:该论文旨在解决多层视频编码中如何通过引入低复杂度增强视频编码(LCEVC)来提升高分辨率(UHD)重建质量的问题。其解决方案的关键在于将LCEVC作为增强层叠加在基础层(VVC)之上,利用LCEVC对高清(HD)基底层进行增强以生成高质量的UHD输出,从而在有限比特率下实现更优的主观感知质量。实验采用Degradation Category Rating(DCR)方法,在15个SDR和HDR序列上对两种操作点(增强层分别占总码率约10%和50%)进行评估,并与上采样VVC基底层和多层VVC(ML-VVC)进行对比,结果以均值意见分(MOS)及其95%置信区间呈现,验证了该方案在不同码率分配下的有效性。

链接: https://arxiv.org/abs/2601.10448
作者: Naeem Ramzan,Muhammad Tufail Khan
机构: University of the West of Scotland (西苏格兰大学)
类目: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper presents the results of a subjective quality assessment of a multilayer video coding configuration in which Low Complexity Enhancement Video Coding (LCEVC) is applied as an enhancement layer on top of a Versatile Video Coding (VVC) base layer. The evaluation follows the same test methodology and conditions previously defined for MPEG multilayer video coding assessments, with the LCEVC enhancement layer encoded using version 8.1 of the LCEVC Test Model (LTM). The test compares reconstructed UHD output generated from an HD VVC base layer with LCEVC enhancement against two reference cases: upsampled VVC base layer decoding and multilayer VVC (ML-VVC). Two operating points are considered, corresponding to enhancement layers representing approximately 10% and 50% of the total bitrate. Subjective assessment was conducted using the Degradation Category Rating (DCR) methodology with twenty five participants, across a dataset comprising fifteen SDR and HDR sequences. The reported results include Mean Opinion Scores (MOS) with associated 95% confidence intervals, enabling comparison of perceptual quality across coding approaches and operating points within the defined test scope.
zh

[CV-25] Multi-Temporal Frames Projection for Dynamic Processes Fusion in Fluorescence Microscopy

【速读】:该论文旨在解决荧光显微成像中因噪声、时间波动以及信号随时间振荡导致的图像质量下降和生物信息可视化不一致的问题。其解决方案的关键在于提出了一种独特的计算框架,该框架通过整合多帧时间分辨图像的信息,生成单张高质量图像,同时保留原始视频中的生物学内容;该框架结合了来自不同计算机视觉应用领域的可解释技术,实现了对细胞数量的显著提升(平均增加44%),并具备泛化能力,适用于其他需将多时相图像堆栈融合为高质量二维图像的成像领域。

链接: https://arxiv.org/abs/2601.10392
作者: Hassan Eshkiki,Sarah Costa,Mostafa Mohammadpour,Farinaz Tanhaei,Christopher H. George,Fabio Caraffini
机构: Swansea University (斯旺西大学); Johannes Kepler University (约翰内斯·开普勒大学); Swansea University Medical School (斯旺西大学医学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Fluorescence microscopy is widely employed for the analysis of living biological samples; however, the utility of the resulting recordings is frequently constrained by noise, temporal variability, and inconsistent visualisation of signals that oscillate over time. We present a unique computational framework that integrates information from multiple time-resolved frames into a single high-quality image, while preserving the underlying biological content of the original video. We evaluate the proposed method through an extensive number of configurations (n = 111) and on a challenging dataset comprising dynamic, heterogeneous, and morphologically complex 2D monolayers of cardiac cells. Results show that our framework, which consists of a combination of explainable techniques from different computer vision application fields, is capable of generating composite images that preserve and enhance the quality and information of individual microscopy frames, yielding 44% average increase in cell count compared to previous methods. The proposed pipeline is applicable to other imaging domains that require the fusion of multi-temporal image stacks into high-quality 2D images, thereby facilitating annotation and downstream segmentation.
zh

[CV-26] Handling Missing Modalities in Multimodal Survival Prediction for Non-Small Cell Lung Cancer

【速读】:该论文旨在解决非小细胞肺癌(NSCLC)生存预测中因多模态数据(包括临床、影像和组织病理学信息)存在缺失而导致模型性能下降的问题。传统方法通常依赖完整病例筛选或激进插补策略,限制了实际应用效果。其解决方案的关键在于提出一种“缺失感知”的多模态生存建模框架,通过基础模型(Foundation Models, FM)实现各模态特征提取,并采用缺失感知编码策略,在自然不完整的模态配置下支持中间层融合(intermediate fusion),从而在训练和推理阶段充分利用所有可用数据,无需剔除患者。实验表明,该方法显著优于单模态基线及早期/晚期融合策略,且能自动降低低信息量模态(如CT)的权重,提升预测准确性(最高C-index达73.30)。

链接: https://arxiv.org/abs/2601.10386
作者: Filippo Ruffini,Camillo Maria Caruso,Claudia Tacconi,Lorenzo Nibid,Francesca Miccolis,Marta Lovino,Carlo Greco,Edy Ippolito,Michele Fiore,Alessio Cortellini,Bruno Beomonte Zobel,Giuseppe Perrone,Bruno Vincenzi,Claudio Marrocco,Alessandro Bria,Elisa Ficarra,Sara Ramella,Valerio Guarrasi,Paolo Soda
机构: Università degli Studi Niccolò Cusano(尼科洛·库萨诺大学); Umeå University (乌梅奥大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Accurate survival prediction in Non-Small Cell Lung Cancer (NSCLC) requires the integration of heterogeneous clinical, radiological, and histopathological information. While Multimodal Deep Learning (MDL) offers a promises for precision prognosis and survival prediction, its clinical applicability is severely limited by small cohort sizes and the presence of missing modalities, often forcing complete-case filtering or aggressive imputation. In this work, we present a missing-aware multimodal survival framework that integrates Computed Tomography (CT), Whole-Slide Histopathology (WSI) Images, and structured clinical variables for overall survival modeling in unresectable stage II-III NSCLC. By leveraging Foundation Models (FM) for modality-specific feature extraction and a missing-aware encoding strategy, the proposed approach enables intermediate multimodal fusion under naturally incomplete modality profiles. The proposed architecture is resilient to missing modalities by design, allowing the model to utilize all available data without being forced to drop patients during training or inference. Experimental results demonstrate that intermediate fusion consistently outperforms unimodal baselines as well as early and late fusion strategies, with the strongest performance achieved by the fusion of WSI and clinical modalities (73.30 C-index). Further analyses of modality importance reveal an adaptive behavior in which less informative modalities, i.e., CT modality, are automatically down-weighted and contribute less to the final survival prediction.
zh

[CV-27] Global Context Compression with Interleaved Vision-Text Transformation

【速读】:该论文旨在解决生成式 AI (Generative AI) 中因文本序列长度增长导致的注意力计算复杂度呈平方级上升的问题,尤其是在长文本生成任务中带来的高计算和内存开销。其核心挑战在于如何在保持模型性能的同时实现全局上下文压缩,从而在预填充(prefilling)和逐标记推理(token-by-token inference)两个阶段均减少计算资源消耗。解决方案的关键在于提出 VIST2 模型,该模型通过将文本块(text chunks)编码为草图图像(sketch images),并在 Transformer 输入中交错排列文本与视觉表示,同时仅依赖前文视觉 token 来预测下一个文本 token 的分布,从而实现端到端的低损失压缩。这一设计使得模型在保持语义完整性的同时显著降低 Token 数量,最终在 4× 压缩比下实现了平均 3× 的首 token 生成加速、77% 内存占用减少和 74% FLOPS 下降。

链接: https://arxiv.org/abs/2601.10378
作者: Dian Jiao,Jiaxin Duan,Shuai Zhao,Jiabing Leng,Yiran Zhang,Feng Huang
机构: China Electronics Cloud Technology Co Ltd (中国电子云技术有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent achievements of vision-language models in end-to-end OCR point to a new avenue for low-loss compression of textual information. This motivates earlier works that render the Transformer’s input into images for prefilling, which effectively reduces the number of tokens through visual encoding, thereby alleviating the quadratically increased Attention computations. However, this partial compression fails to save computational or memory costs at token-by-token inference. In this paper, we investigate global context compression, which saves tokens at both prefilling and inference stages. Consequently, we propose VIST2, a novel Transformer that interleaves input text chunks alongside their visual encoding, while depending exclusively on visual tokens in the pre-context to predict the next text token distribution. Around this idea, we render text chunks into sketch images and train VIST2 in multiple stages, starting from curriculum-scheduled pretraining for optical language modeling, followed by modal-interleaved instruction tuning. We conduct extensive experiments using VIST2 families scaled from 0.6B to 8B to explore the training recipe and hyperparameters. With a 4 \times compression ratio, the resulting models demonstrate significant superiority over baselines on long writing tasks, achieving, on average, a 3 \times speedup in first-token generation, 77% reduction in memory usage, and 74% reduction in FLOPS. Our codes and datasets will be public to support further studies.
zh

[CV-28] owards Efficient Low-rate Image Compression with Frequency-aware Diffusion Prior Refinement

【速读】:该论文旨在解决当前基于扩散模型的图像压缩方法中存在的采样速度慢和比特分配不优的问题,这些问题主要源于碎片化的训练范式。其解决方案的关键在于提出了一种名为DiffCR(Diffusion-based Image Compression via Consistency Prior Refinement)的新框架,核心创新是引入频率感知跳过估计(Frequency-aware Skip Estimation, FaSE)模块,通过频域解耦注意力机制(Frequency Decoupling Attention, FDA)对预训练潜空间扩散模型的ε-预测先验进行精细化调整,并与不同时间步的压缩潜在表示对齐;同时设计了一个轻量级一致性估计器,实现仅需两步即可完成解码的快速重建,从而在不更新主干扩散模型的前提下,显著降低比特率(LPIPS指标下BD-rate减少27.2%,PSNR指标下减少65.1%),并获得超过10倍的加速效果。

链接: https://arxiv.org/abs/2601.10373
作者: Yichong Xia,Yimin Zhou,Jinpeng Wang,Bin Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Recent advancements in diffusion-based generative priors have enabled visually plausible image compression at extremely low bit rates. However, existing approaches suffer from slow sampling processes and suboptimal bit allocation due to fragmented training paradigms. In this work, we propose Accelerate \textbfDiffusion-based Image Compression via \textbfConsistency Prior \textbfRefinement (DiffCR), a novel compression framework for efficient and high-fidelity image reconstruction. At the heart of DiffCR is a Frequency-aware Skip Estimation (FaSE) module that refines the \epsilon -prediction prior from a pre-trained latent diffusion model and aligns it with compressed latents at different timesteps via Frequency Decoupling Attention (FDA). Furthermore, a lightweight consistency estimator enables fast \textbftwo-step decoding by preserving the semantic trajectory of diffusion sampling. Without updating the backbone diffusion model, DiffCR achieves substantial bitrate savings (27.2% BD-rate (LPIPS) and 65.1% BD-rate (PSNR)) and over 10\times speed-up compared to SOTA diffusion-based compression baselines.
zh

[CV-29] Fine-Grained Human Pose Editing Assessment via Layer-Selective MLLM s

【速读】:该论文旨在解决文本引导的人体姿态编辑(text-guided human pose editing)在生成式 AI 应用中普遍存在的结构异常和生成伪影问题,以及现有评估指标将真实性检测与质量评估分离、难以提供细粒度姿态相关不一致信息的局限性。其解决方案的关键在于提出一个名为 HPE-Bench 的专用基准,包含来自 17 种先进编辑模型的 1,700 个标准化样本,并附带真实性标签与多维质量评分;同时构建基于层选择性多模态大语言模型(layer-selective multimodal large language models, MLLMs)的统一框架,通过对比 LoRA 微调与新颖的层敏感性分析(layer sensitivity analysis, LSA)机制,精准识别用于姿态评估的最佳特征层,从而在真实性检测与多维质量回归任务上均实现优越性能,有效弥合了取证检测与质量评估之间的鸿沟。

链接: https://arxiv.org/abs/2601.10369
作者: Ningyu Sun,Zhaolin Cai,Zitong Xu,Peihang Chen,Huiyu Duan,Yichao Yan,Xiongkuo Min,Xiaokang Yang
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Text-guided human pose editing has gained significant traction in AIGC applications. However,it remains plagued by structural anomalies and generative artifacts. Existing evaluation metrics often isolate authenticity detection from quality assessment, failing to provide fine-grained insights into pose-specific inconsistencies. To address these limitations, we introduce HPE-Bench, a specialized benchmark comprising 1,700 standardized samples from 17 state-of-the-art editing models, offering both authenticity labels and multi-dimensional quality scores. Furthermore, we propose a unified framework based on layer-selective multimodal large language models (MLLMs). By employing contrastive LoRA tuning and a novel layer sensitivity analysis (LSA) mechanism, we identify the optimal feature layer for pose evaluation. Our framework achieves superior performance in both authenticity detection and multi-dimensional quality regression, effectively bridging the gap between forensic detection and quality assessment.
zh

[CV-30] An analytic theory of convolutional neural network inverse problems solvers

【速读】:该论文旨在解决监督式卷积神经网络(Supervised Convolutional Neural Networks, CNNs)在成像逆问题中缺乏理论解释性的问题,即尽管其在实践中表现优异,但其内部机制常被视为“黑箱”。为填补这一理论空白,作者从最小均方误差(Minimum Mean Square Error, MMSE)估计器的角度出发,引入了两个关键的归纳偏置约束:平移等变性(translation equivariance)和局部性(locality)——通过有限感受野(finite receptive fields)实现。解决方案的核心在于推导出一种受约束的MMSE估计形式,称为局部等变MMSE(Local-Equivariant MMSE, LE-MMSE),该公式具有解析性、可解释性和计算可行性,并在多种图像逆问题(如去噪、补全、去卷积)、数据集(FFHQ、CIFAR-10、FashionMNIST)及网络架构(U-Net、ResNet、PatchMLP)上验证了其与实际训练模型输出的高度一致性(PSNR ≥ 25 dB)。

链接: https://arxiv.org/abs/2601.10334
作者: Minh Hai Nguyen,Quoc Bao Do,Edouard Pauwels,Pierre Weiss
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Supervised convolutional neural networks (CNNs) are widely used to solve imaging inverse problems, achieving state-of-the-art performance in numerous applications. However, despite their empirical success, these methods are poorly understood from a theoretical perspective and often treated as black boxes. To bridge this gap, we analyze trained neural networks through the lens of the Minimum Mean Square Error (MMSE) estimator, incorporating functional constraints that capture two fundamental inductive biases of CNNs: translation equivariance and locality via finite receptive fields. Under the empirical training distribution, we derive an analytic, interpretable, and tractable formula for this constrained variant, termed Local-Equivariant MMSE (LE-MMSE). Through extensive numerical experiments across various inverse problems (denoising, inpainting, deconvolution), datasets (FFHQ, CIFAR-10, FashionMNIST), and architectures (U-Net, ResNet, PatchMLP), we demonstrate that our theory matches the neural networks outputs (PSNR \gtrsim25 dB). Furthermore, we provide insights into the differences between \emphphysics-aware and \emphphysics-agnostic estimators, the impact of high-density regions in the training (patch) distribution, and the influence of other factors (dataset size, patch size, etc).
zh

[CV-31] hink-Then-Generate: Reasoning -Aware Text-to-Image Diffusion with LLM Encoders

【速读】:该论文旨在解决当前文本到图像(Text-to-Image, T2I)扩散模型(Diffusion Models, DMs)普遍存在的“字面映射”问题,即现有模型仅将文本作为输入直接映射为视觉内容,而未利用大语言模型(Large Language Model, LLM)的推理能力来理解并重构用户提示以生成更符合语义逻辑和现实世界知识的图像。其解决方案的关键在于提出“先思考后生成”(Think-Then-Generate, T2G)范式:首先通过轻量级监督微调激活LLM文本编码器的“思考-重写”模式,使模型能够基于原始提示进行语义推理与改写;随后采用双路径强化学习优化(Dual-GRPO),联合优化LLM编码器与扩散主干网络,其中LLM编码器通过图像引导奖励机制增强对世界知识的推理与回忆能力,扩散模型则被驱动生成语义一致且视觉连贯的图像。这一方法显著提升了图像生成的事实一致性、语义对齐度和视觉真实性,在WISE评分上达到0.79,接近GPT-4水平。

链接: https://arxiv.org/abs/2601.10332
作者: Siqi Kou,Jiachun Jin,Zetong Zhou,Ye Ma,Yugang Wang,Quan Chen,Peng Jiang,Xiao Yang,Jun Zhu,Kai Yu,Zhijie Deng
机构: Shanghai Jiao Tong University (上海交通大学); Kuaishou Technology (快手科技); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent progress in text-to-image (T2I) diffusion models (DMs) has enabled high-quality visual synthesis from diverse textual prompts. Yet, most existing T2I DMs, even those equipped with large language model (LLM)-based text encoders, remain text-pixel mappers – they employ LLMs merely as text encoders, without leveraging their inherent reasoning capabilities to infer what should be visually depicted given the textual prompt. To move beyond such literal generation, we propose the think-then-generate (T2G) paradigm, where the LLM-based text encoder is encouraged to reason about and rewrite raw user prompts; the states of the rewritten prompts then serve as diffusion conditioning. To achieve this, we first activate the think-then-rewrite pattern of the LLM encoder with a lightweight supervised fine-tuning process. Subsequently, the LLM encoder and diffusion backbone are co-optimized to ensure faithful reasoning about the context and accurate rendering of the semantics via Dual-GRPO. In particular, the text encoder is reinforced using image-grounded rewards to infer and recall world knowledge, while the diffusion backbone is pushed to produce semantically consistent and visually coherent images. Experiments show substantial improvements in factual consistency, semantic alignment, and visual realism across reasoning-based image generation and editing benchmarks, achieving 0.79 on WISE score, nearly on par with GPT-4. Our results constitute a promising step toward next-generation unified models with reasoning, expression, and demonstration capacities.
zh

[CV-32] SRAW-Attack: Space-Reweighted Adversarial Warping Attack for SAR Target Recognition

【速读】:该论文旨在解决合成孔径雷达(Synthetic Aperture Radar, SAR)自动目标识别(SAR-ATR)系统在面对对抗样本时存在的鲁棒性不足问题,特别是现有攻击方法往往引入视觉上可感知的扰动,难以兼顾攻击效果与隐蔽性。解决方案的关键在于提出一种名为空间重加权对抗变形(Space-Reweighted Adversarial Warping, SRAW)的新攻击方法,通过优化空间形变策略,并在前景和背景区域间重新分配扰动预算,从而在保持较低可见性的同时显著降低SAR-ATR模型的识别性能,且具备更强的对抗迁移能力。

链接: https://arxiv.org/abs/2601.10324
作者: Yiming Zhang,Weibo Qin,Yuntian Liu,Feng Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 5 pages, 4 figures

点击查看摘要

Abstract:Synthetic aperture radar (SAR) imagery exhibits intrinsic information sparsity due to its unique electromagnetic scattering mechanism. Despite the widespread adoption of deep neural network (DNN)-based SAR automatic target recognition (SAR-ATR) systems, they remain vulnerable to adversarial examples and tend to over-rely on background regions, leading to degraded adversarial robustness. Existing adversarial attacks for SAR-ATR often require visually perceptible distortions to achieve effective performance, thereby necessitating an attack method that balances effectiveness and stealthiness. In this paper, a novel attack method termed Space-Reweighted Adversarial Warping (SRAW) is proposed, which generates adversarial examples through optimized spatial deformation with reweighted budgets across foreground and background regions. Extensive experiments demonstrate that SRAW significantly degrades the performance of state-of-the-art SAR-ATR models and consistently outperforms existing methods in terms of imperceptibility and adversarial transferability. Code is made available at this https URL.
zh

[CV-33] Hierarchical Refinement of Universal Multimodal Attacks on Vision-Language Models

【速读】:该论文旨在解决现有视觉-语言预训练(VLP)模型对抗攻击方法多为样本特定(sample-specific)的问题,此类方法在扩展至大规模数据集或新场景时面临显著的计算开销。其解决方案的关键在于提出层级精炼攻击(Hierarchical Refinement Attack, HRA),通过在样本层面和优化层面同时精炼通用对抗扰动(UAPs)来提升攻击效率与效果:在图像模态上,将对抗样本解耦为干净图像与扰动,独立处理以破坏跨模态对齐;引入ScMix增强策略以丰富视觉上下文并强化UAP的全局与局部实用性;在优化路径上,利用历史与估计未来梯度的时间层次结构避免局部极小值并稳定UAP学习;在文本模态上,结合句内与句间重要性度量识别全局关键词作为通用文本扰动。

链接: https://arxiv.org/abs/2601.10313
作者: Peng-Fei Zhang,Zi Huang
机构: University of Queensland (昆士兰大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: 15 pages, 7 figures

点击查看摘要

Abstract:Existing adversarial attacks for VLP models are mostly sample-specific, resulting in substantial computational overhead when scaled to large datasets or new scenarios. To overcome this limitation, we propose Hierarchical Refinement Attack (HRA), a multimodal universal attack framework for VLP models. HRA refines universal adversarial perturbations (UAPs) at both the sample level and the optimization level. For the image modality, we disentangle adversarial examples into clean images and perturbations, allowing each component to be handled independently for more effective disruption of cross-modal alignment. We further introduce a ScMix augmentation strategy that diversifies visual contexts and strengthens both global and local utility of UAPs, thereby reducing reliance on spurious features. In addition, we refine the optimization path by leveraging a temporal hierarchy of historical and estimated future gradients to avoid local minima and stabilize universal perturbation learning. For the text modality, HRA identifies globally influential words by combining intra-sentence and inter-sentence importance measures, and subsequently utilizes these words as universal text perturbations. Extensive experiments across various downstream tasks, VLP models, and datasets demonstrate the superiority of the proposed universal multimodal attacks.
zh

[CV-34] DanQing: An Up-to-Date Large-Scale Chinese Vision-Language Pre-training Dataset

【速读】:该论文旨在解决中文视觉-语言预训练(Vision-Language Pre-training, VLP)领域因高质量中文图文数据稀缺而导致的模型性能滞后问题。其解决方案的关键在于构建一个高质量、大规模且时效性强的中文跨模态数据集——DanQing,该数据集包含1亿条从Common Crawl收集的中文图文对,并通过更严格的筛选流程确保数据质量;同时,DanQing主要基于2024–2025年的网络数据,使模型能够捕捉语义演变趋势,从而在零样本分类、跨模态检索及大语言模型(Large Language Model, LLM)相关评估等下游任务中显著优于现有方法。

链接: https://arxiv.org/abs/2601.10305
作者: Hengyu Shen,Tiancheng Gu,Bin Qin,Lan Wu,Yuling Wu,Shuo Tan,Zelong Sun,Jun Wang,Nan Wu,Xiang An,Weidong Cai,Ziyong Feng,Kaicheng Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 19 pages, 11 figures, 7 tables

点击查看摘要

Abstract:Vision-Language Pre-training (VLP) models demonstrate strong performance across various downstream tasks by learning from large-scale image-text pairs through contrastive pretraining. The release of extensive English image-text datasets (e.g., COYO-700M and LAION-400M) has enabled widespread adoption of models such as CLIP and SigLIP in tasks including cross-modal retrieval and image captioning. However, the advancement of Chinese vision-language pretraining has substantially lagged behind, due to the scarcity of high-quality Chinese image-text data. To address this gap, we develop a comprehensive pipeline for constructing a high-quality Chinese cross-modal dataset. As a result, we propose DanQing, which contains 100 million image-text pairs collected from Common Crawl. Different from existing datasets, DanQing is curated through a more rigorous selection process, yielding superior data quality. Moreover, DanQing is primarily built from 2024-2025 web data, enabling models to better capture evolving semantic trends and thus offering greater practical utility. We compare DanQing with existing datasets by continual pre-training of the SigLIP2 model. Experimental results show that DanQing consistently achieves superior performance across a range of Chinese downstream tasks, including zero-shot classification, cross-modal retrieval, and LMM-based evaluations. To facilitate further research in Chinese vision-language pre-training, we will open-source the DanQing dataset under the Creative Common CC-BY 4.0 license.
zh

[CV-35] Attend to what I say: Highlighting relevant content on slides ICDAR

【速读】:该论文旨在解决多模态信息同步难题,即在会议演讲等内容密集的演示场景中,听众难以同步跟踪讲者语音与幻灯片视觉内容之间的关联,导致认知负荷增加、关键信息遗漏的问题。解决方案的关键在于提出一种自动识别并高亮幻灯片中最相关区域的方法,通过分析讲者语音内容并与幻灯片中的文本或图形元素进行匹配,实现听觉与视觉信息的精准对齐,从而提升听众的理解效率和沉浸感。

链接: https://arxiv.org/abs/2601.10244
作者: Megha Mariam K M,C. V. Jawahar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at the International Conference on Document Analysis and Recognition (ICDAR) 2025

点击查看摘要

Abstract:Imagine sitting in a presentation, trying to follow the speaker while simultaneously scanning the slides for relevant information. While the entire slide is visible, identifying the relevant regions can be challenging. As you focus on one part of the slide, the speaker moves on to a new sentence, leaving you scrambling to catch up visually. This constant back-and-forth creates a disconnect between what is being said and the most important visual elements, making it hard to absorb key details, especially in fast-paced or content-heavy presentations such as conference talks. This requires an understanding of slides, including text, graphics, and layout. We introduce a method that automatically identifies and highlights the most relevant slide regions based on the speaker’s narrative. By analyzing spoken content and matching it with textual or graphical elements in the slides, our approach ensures better synchronization between what listeners hear and what they need to attend to. We explore different ways of solving this problem and assess their success and failure cases. Analyzing multimedia documents is emerging as a key requirement for seamless understanding of content-rich videos, such as educational videos and conference talks, by reducing cognitive strain and improving comprehension. Code and dataset are available at: this https URL
zh

[CV-36] Optimizing Multimodal LLM s for Egocentric Video Understanding: A Solution for the HD-EPIC VQA Challenge CVPR2025

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在复杂视频问答(Video Question Answering, VQA)基准测试(如HD-EPIC VQA)中表现不佳的问题,主要挑战包括模糊的查询/选项、长程时间推理能力不足以及输出格式不标准化。解决方案的关键在于构建一个端到端的优化框架,包含四个核心组件:查询与选项预处理以提升语义清晰度、基于领域特定数据对Qwen2.5-VL进行微调以增强视频理解能力、引入新颖的时间链式思维(Temporal Chain-of-Thought, T-CoT)提示策略以支持多步骤时序推理,以及鲁棒的后处理机制以统一输出格式。该方案在HD-EPIC VQA上实现了41.6%的准确率,验证了系统性流程优化对于高难度视频理解任务的重要性。

链接: https://arxiv.org/abs/2601.10228
作者: Sicheng Yang,Yukai Huang,Shitong Sun,Weitong Cai,Jiankang Deng,Jifei Song,Zhensong Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Image and Video Processing (eess.IV)
备注: 4 pages, 1 figure, CVPR 2025 EgoVis Workshop, 2nd Place in HD-EPIC Challenge

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) struggle with complex video QA benchmarks like HD-EPIC VQA due to ambiguous queries/options, poor long-range temporal reasoning, and non-standardized outputs. We propose a framework integrating query/choice pre-processing, domain-specific Qwen2.5-VL fine-tuning, a novel Temporal Chain-of-Thought (T-CoT) prompting for multi-step reasoning, and robust post-processing. This system achieves 41.6% accuracy on HD-EPIC VQA, highlighting the need for holistic pipeline optimization in demanding video understanding. Our code, fine-tuned models are available at this https URL.
zh

[CV-37] Beyond Inpainting: Unleash 3D Understanding for Precise Camera-Controlled Video Generation

【速读】:该论文旨在解决在条件视频生成中,如何在精确调整相机轨迹的同时忠实保持视频内容一致性的难题。现有主流方法通过根据目标轨迹扭曲三维表示来实现相机控制,但这类方法未能充分利用视频扩散模型(Video Diffusion Models, VDMs)的3D先验知识,常陷入“补绘陷阱”(Inpainting Trap),导致主体不一致和生成质量下降。解决方案的关键在于提出DepthDirector框架,其核心创新是设计了View-Content Dual-Stream Condition机制,将源视频与目标视角下渲染的变形深度序列共同注入预训练VDM,从而提供几何引导信号,使模型理解相机运动并激活其3D理解能力;同时采用轻量级LoRA-based视频扩散适配器,在不破坏VDM原有知识先验的前提下完成训练,显著提升了相机可控性与内容一致性。

链接: https://arxiv.org/abs/2601.10214
作者: Dong-Yu Chen,Yixin Guo,Shuojin Yang,Tai-Jiang Mu,Shi-Min Hu
机构: Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:Camera control has been extensively studied in conditioned video generation; however, performing precisely altering the camera trajectories while faithfully preserving the video content remains a challenging task. The mainstream approach to achieving precise camera control is warping a 3D representation according to the target trajectory. However, such methods fail to fully leverage the 3D priors of video diffusion models (VDMs) and often fall into the Inpainting Trap, resulting in subject inconsistency and degraded generation quality. To address this problem, we propose DepthDirector, a video re-rendering framework with precise camera controllability. By leveraging the depth video from explicit 3D representation as camera-control guidance, our method can faithfully reproduce the dynamic scene of an input video under novel camera trajectories. Specifically, we design a View-Content Dual-Stream Condition mechanism that injects both the source video and the warped depth sequence rendered under the target viewpoint into the pretrained video generation model. This geometric guidance signal enables VDMs to comprehend camera movements and leverage their 3D understanding capabilities, thereby facilitating precise camera control and consistent content generation. Next, we introduce a lightweight LoRA-based video diffusion adapter to train our framework, fully preserving the knowledge priors of VDMs. Additionally, we construct a large-scale multi-camera synchronized dataset named MultiCam-WarpData using Unreal Engine 5, containing 8K videos across 1K dynamic scenes. Extensive experiments show that DepthDirector outperforms existing methods in both camera controllability and visual quality. Our code and dataset will be publicly available.
zh

[CV-38] ELITE: Efficient Gaussian Head Avatar from a Monocular Video via Learned Initialization and TEst-time Generative Adaptation

【速读】:该论文旨在解决单目视频中高效且高保真的人脸Avatar生成问题,特别是克服现有方法在真实场景泛化能力差(3D数据先验)或计算复杂度高、身份幻觉严重(2D生成先验)的局限性。解决方案的关键在于提出ELITE框架,其核心创新包括:1)设计了一个前馈式的Mesh2Gaussian Prior Model(MGPM),实现对高斯Avatar的快速初始化;2)引入测试时生成适应阶段,利用真实与合成图像联合监督以缩小域差距;3)提出基于渲染引导的单步扩散增强器(rendering-guided single-step diffusion enhancer),替代传统全扩散去噪策略,在保证细节恢复的同时显著提升效率(比2D生成先验方法快60倍)。该方案实现了高保真、可动画化Avatar的高效合成,并具备强真实场景泛化能力。

链接: https://arxiv.org/abs/2601.10200
作者: Kim Youwang,Lee Hyoseok,Subin Park,Gerard Pons-Moll,Tae-Hyun Oh
机构: POSTECH(浦项科技大学); KAIST(韩国科学技术院); UNIST(蔚山国立科学技术研究院); University of Tübingen(图宾根大学); Tübingen AI Center(图宾根人工智能中心); Max Planck Institute for Informatics(马克斯普朗克信息研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:We introduce ELITE, an Efficient Gaussian head avatar synthesis from a monocular video via Learned Initialization and TEst-time generative adaptation. Prior works rely either on a 3D data prior or a 2D generative prior to compensate for missing visual cues in monocular videos. However, 3D data prior methods often struggle to generalize in-the-wild, while 2D generative prior methods are computationally heavy and prone to identity hallucination. We identify a complementary synergy between these two priors and design an efficient system that achieves high-fidelity animatable avatar synthesis with strong in-the-wild generalization. Specifically, we introduce a feed-forward Mesh2Gaussian Prior Model (MGPM) that enables fast initialization of a Gaussian avatar. To further bridge the domain gap at test time, we design a test-time generative adaptation stage, leveraging both real and synthetic images as supervision. Unlike previous full diffusion denoising strategies that are slow and hallucination-prone, we propose a rendering-guided single-step diffusion enhancer that restores missing visual details, grounded on Gaussian avatar renderings. Our experiments demonstrate that ELITE produces visually superior avatars to prior works, even for challenging expressions, while achieving 60x faster synthesis than the 2D generative prior method.
zh

[CV-39] From Physical Degradation Models to Task-Aware All-in-One Image Restoration

【速读】:该论文旨在解决多任务图像恢复(all-in-one image restoration)中因引入额外学习模块而导致系统复杂度升高、难以实现实时应用的问题。其解决方案的关键在于从物理退化建模的角度出发,预测一个任务感知的逆退化算子(task-aware inverse degradation operator),并通过两阶段架构实现高效且可靠的恢复:第一阶段利用该算子生成初始恢复图像及不确定性感知图(uncertainty perception map),以识别难恢复区域;第二阶段在不确定性图引导下进一步优化恢复结果。整个框架共享同一逆算子预测网络,并在算子预测后引入任务感知参数以适配不同退化任务,同时通过加速逆算子卷积运算提升效率,最终形成紧凑高效的OPIR架构,在多任务统一恢复性能上表现优异,且在单任务对齐恢复任务中仍具竞争力。

链接: https://arxiv.org/abs/2601.10192
作者: Hu Gao,Xiaoning Lei,Xichen Xu,Xingjian Wang,Lizhuang Ma
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:All-in-one image restoration aims to adaptively handle multiple restoration tasks with a single trained model. Although existing methods achieve promising results by introducing prompt information or leveraging large models, the added learning modules increase system complexity and hinder real-time applicability. In this paper, we adopt a physical degradation modeling perspective and predict a task-aware inverse degradation operator for efficient all-in-one image restoration. The framework consists of two stages. In the first stage, the predicted inverse operator produces an initial restored image together with an uncertainty perception map that highlights regions difficult to reconstruct, ensuring restoration reliability. In the second stage, the restoration is further refined under the guidance of this uncertainty map. The same inverse operator prediction network is used in both stages, with task-aware parameters introduced after operator prediction to adapt to different degradation tasks. Moreover, by accelerating the convolution of the inverse operator, the proposed method achieves efficient all-in-one image restoration. The resulting tightly integrated architecture, termed OPIR, is extensively validated through experiments, demonstrating superior all-in-one restoration performance while remaining highly competitive on task-aligned restoration.
zh

[CV-40] RAG -3DSG: Enhancing 3D Scene Graphs with Re-Shot Guided Retrieval-Augmented Generation

【速读】:该论文旨在解决开放词汇表三维场景图(3DSG)生成中物体级识别准确率低和速度慢的问题,其核心挑战源于受限视角、遮挡以及冗余表面密度等因素导致的聚合噪声。解决方案的关键在于提出RAG-3DSG框架:首先通过重采样引导的不确定性估计来缓解聚合噪声;其次利用低不确定性物体支持物体级别的检索增强生成(Retrieval-Augmented Generation, RAG),提升语义准确性;此外引入动态下采样映射策略,以自适应粒度加速跨图像物体聚合过程。实验表明,该方法在Replica数据集上显著提升了3DSG节点描述准确性,并将映射时间减少三分之二。

链接: https://arxiv.org/abs/2601.10168
作者: Yue Chang,Rufeng Chen,Zhaofan Zhang,Yi Chen,Sihong Xie
机构: HKUST(GZ)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 9 pages, 6 figures

点击查看摘要

Abstract:Open-vocabulary 3D Scene Graph (3DSG) generation can enhance various downstream tasks in robotics, such as manipulation and navigation, by leveraging structured semantic representations. A 3DSG is constructed from multiple images of a scene, where objects are represented as nodes and relationships as edges. However, existing works for open-vocabulary 3DSG generation suffer from both low object-level recognition accuracy and speed, mainly due to constrained viewpoints, occlusions, and redundant surface density. To address these challenges, we propose RAG-3DSG to mitigate aggregation noise through re-shot guided uncertainty estimation and support object-level Retrieval-Augmented Generation (RAG) via reliable low-uncertainty objects. Furthermore, we propose a dynamic downsample-mapping strategy to accelerate cross-image object aggregation with adaptive granularity. Experiments on Replica dataset demonstrate that RAG-3DSG significantly improves node captioning accuracy in 3DSG generation while reducing the mapping time by two-thirds compared to the vanilla version.
zh

[CV-41] Advancing Adaptive Multi-Stage Video Anomaly Reasoning : A Benchmark Dataset and Method

【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在视频异常检测与理解(Video Anomaly Detection and Understanding, VADU)任务中缺乏显式推理过程、风险意识和决策导向解释的问题,这些问题限制了模型从描述性理解向结构化、多阶段推理的跃迁。解决方案的关键在于提出一个新任务——视频异常推理(Video Anomaly Reasoning, VAR),其要求模型对异常事件进行逐步推理,涵盖视觉感知、因果解释和风险-aware决策三个阶段;同时构建了一个包含8,641个视频、超过50,000样本的大规模标注数据集,基于结构化的“感知-认知-行动”思维链(Perception-Cognition-Action Chain-of-Thought, PerCoAct-CoT)设计,系统化地建模领域先验知识以支持多阶段推理评估,并引入异常感知组相对策略优化(Anomaly-Aware Group Relative Policy Optimization)提升弱监督下的推理可靠性;最终开发出端到端的VAR模型Vad-R1-Plus,实现了自适应分层推理与风险感知决策,显著优于开源和闭源基线方法。

链接: https://arxiv.org/abs/2601.10165
作者: Chao Huang,Benfeng Wang,Wei Wang,Jie Wen,Li Shen,Wenqi Ren,Yong Xu,Xiaochun Cao
机构: Sun Yat-sen University (中山大学); Harbin Institute of Technology (哈尔滨工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent progress in reasoning capabilities of Multimodal Large Language Models(MLLMs) has highlighted their potential for performing complex video understanding tasks. However, in the domain of Video Anomaly Detection and Understanding (VADU), existing MLLM-based methods are largely limited to anomaly localization or post-hoc description, lacking explicit reasoning processes, risk awareness, and decision-oriented interpretation. To address this gap, we define a new task termed Video Anomaly Reasoning (VAR), which elevates video anomaly analysis from descriptive understanding to structured, multi-stage reasoning. VAR explicitly requires models to perform progressive reasoning over anomalous events before answering anomaly-related questions, encompassing visual perception, causal interpretation, and risk-aware decision making. To support this task, we present a new dataset with 8,641 videos, where each video is annotated with diverse question types corresponding to different reasoning depths, totaling more than 50,000 samples, making it one of the largest datasets for video anomaly. The annotations are based on a structured Perception-Cognition-Action Chain-of-Thought (PerCoAct-CoT), which formalizes domain-specific reasoning priors for video anomaly understanding. This design enables systematic evaluation of multi-stage and adaptive anomaly reasoning. In addition, we propose Anomaly-Aware Group Relative Policy Optimization to further enhance reasoning reliability under weak supervision. Building upon the proposed task and dataset, we develop an end-to-end MLLM-based VAR model termed Vad-R1-Plus, which supports adaptive hierarchical reasoning and risk-aware decision making. Extensive experiments demonstrate that the proposed benchmark and method effectively advance the reasoning capabilities of MLLMs on VAR tasks, outperforming both open-source and proprietary baselines.
zh

[CV-42] MHub.ai: A Simple Standardized and Reproducible Platform for AI Models in Medical Imaging

【速读】:该论文旨在解决医学影像领域中人工智能(Artificial Intelligence, AI)模型应用面临的标准化缺失、文档不一致及可复现性差等问题。其解决方案的关键在于构建一个基于容器的开源平台,将来自同行评审文献的AI模型封装为标准化容器,支持DICOM等格式的直接处理、提供统一的应用程序接口(Application Programming Interface, API)并嵌入结构化元数据,同时配套公开的参考数据用于验证模型运行。该平台通过模块化设计支持模型适配与社区贡献,并在临床场景中通过肺部分割模型的对比评估展示了其有效性,从而显著降低临床转化门槛并提升研究透明度与可复现性。

链接: https://arxiv.org/abs/2601.10154
作者: Leonard Nürnberg,Dennis Bontempi,Suraj Pai,Curtis Lisle,Steve Pieper,Ron Kikinis,Sil van de Leemput,Rahul Soni,Gowtham Murugesan,Cosmin Ciausu,Miriam Groeneveld,Felix J. Dorfner,Jue Jiang,Aneesh Rangnekar,Harini Veeraraghavan,Joeran S. Bosma,Keno Bressem,Raymond Mak,Andrey Fedorov,Hugo JWL Aerts
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注: 41 pages, 15 figures, 6 tables

点击查看摘要

Abstract:Artificial intelligence (AI) has the potential to transform medical imaging by automating image analysis and accelerating clinical research. However, research and clinical use are limited by the wide variety of AI implementations and architectures, inconsistent documentation, and reproducibility issues. Here, we introduce this http URL, an open-source, container-based platform that standardizes access to AI models with minimal configuration, promoting accessibility and reproducibility in medical imaging. this http URL packages models from peer-reviewed publications into standardized containers that support direct processing of DICOM and other formats, provide a unified application interface, and embed structured metadata. Each model is accompanied by publicly available reference data that can be used to confirm model operation. this http URL includes an initial set of state-of-the-art segmentation, prediction, and feature extraction models for different modalities. The modular framework enables adaptation of any model and supports community contributions. We demonstrate the utility of the platform in a clinical use case through comparative evaluation of lung segmentation models. To further strengthen transparency and reproducibility, we publicly release the generated segmentations and evaluation metrics and provide interactive dashboards that allow readers to inspect individual cases and reproduce or extend our analysis. By simplifying model use, this http URL enables side-by-side benchmarking with identical execution commands and standardized outputs, and lowers the barrier to clinical translation.
zh

[CV-43] LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning

【速读】:该论文旨在解决多模态知识蒸馏中存在的“感知鸿沟”(Perception Gap)问题,即学生模型在模仿教师模型文本输出时,往往关注与教师显著不同的视觉区域,从而依赖语言先验而非基于视觉的 grounded perception。解决方案的关键在于提出 LaViT 框架,其核心是通过对齐潜在空间中的视觉思维(latent visual thoughts)而非静态嵌入来实现更有效的知识迁移;具体而言,LaViT 强制学生模型在生成文本前自回归地重建教师模型的视觉语义和注意力轨迹,并引入课程式感官门控机制(curriculum sensory gating),防止模型学习到捷径路径,从而显著提升视觉接地能力,在复杂推理任务上最高可获得 +16.9% 的性能提升。

链接: https://arxiv.org/abs/2601.10129
作者: Linquan Wu,Tianxiang Jiang,Yifei Dong,Haoyu Yang,Fengji Zhang,Shichaang Meng,Ai Xuan,Linqi Song,Jacky Keung
机构: City University of Hong Kong (香港城市大学); University of Science and Technology of China (中国科学技术大学); Utrecht University (乌得勒支大学); University of Electronic Science and Technology of China (电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Current multimodal latent reasoning often relies on external supervision (e.g., auxiliary images), ignoring intrinsic visual attention dynamics. In this work, we identify a critical Perception Gap in distillation: student models frequently mimic a teacher’s textual output while attending to fundamentally divergent visual regions, effectively relying on language priors rather than grounded perception. To bridge this, we propose LaViT, a framework that aligns latent visual thoughts rather than static embeddings. LaViT compels the student to autoregressively reconstruct the teacher’s visual semantics and attention trajectories prior to text generation, employing a curriculum sensory gating mechanism to prevent shortcut learning. Extensive experiments show that LaViT significantly enhances visual grounding, achieving up to +16.9% gains on complex reasoning tasks and enabling a compact 3B model to outperform larger open-source variants and proprietary models like GPT-4o.
zh

[CV-44] VQ-Seg: Vector-Quantized Token Perturbation for Semi-Supervised Medical Image Segmentation NEURIPS2025

【速读】:该论文旨在解决半监督医学图像分割中现有特征扰动方法依赖dropout所带来的超参数敏感性问题,即dropout率需手动精细调参,且易导致正则化效果不佳。其解决方案的关键在于提出VQ-Seg框架,首次引入向量量化(Vector Quantization, VQ)对特征空间进行离散化,并设计可控制的量化扰动模块(Quantized Perturbation Module, QPM),通过打乱码本索引的空间位置实现有效且可控的扰动;同时采用双分支结构共享后量化特征空间以缓解信息损失,并引入后量化特征适配器(Post-VQ Feature Adapter, PFA)融合基础模型(Foundation Model, FM)提供的高层语义信息,从而提升分割性能。

链接: https://arxiv.org/abs/2601.10124
作者: Sicheng Yang,Zhaohu Xing,Lei Zhu
机构: The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)); The Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by NeurIPS 2025

点击查看摘要

Abstract:Consistency learning with feature perturbation is a widely used strategy in semi-supervised medical image segmentation. However, many existing perturbation methods rely on dropout, and thus require a careful manual tuning of the dropout rate, which is a sensitive hyperparameter and often difficult to optimize and may lead to suboptimal regularization. To overcome this limitation, we propose VQ-Seg, the first approach to employ vector quantization (VQ) to discretize the feature space and introduce a novel and controllable Quantized Perturbation Module (QPM) that replaces dropout. Our QPM perturbs discrete representations by shuffling the spatial locations of codebook indices, enabling effective and controllable regularization. To mitigate potential information loss caused by quantization, we design a dual-branch architecture where the post-quantization feature space is shared by both image reconstruction and segmentation tasks. Moreover, we introduce a Post-VQ Feature Adapter (PFA) to incorporate guidance from a foundation model (FM), supplementing the high-level semantic information lost during quantization. Furthermore, we collect a large-scale Lung Cancer (LC) dataset comprising 828 CT scans annotated for central-type lung carcinoma. Extensive experiments on the LC dataset and other public benchmarks demonstrate the effectiveness of our method, which outperforms state-of-the-art approaches. Code available at: this https URL.
zh

[CV-45] Beyond Single Prompts: Synergistic Fusion and Arrangement for VICL

【速读】:该论文旨在解决视觉上下文学习(Vision In-Context Learning, VICL)中两个关键问题:一是仅选择最相似提示会忽略其他高质量提示中的互补信息;二是未能利用不同提示排列所蕴含的结构化信息。解决方案的关键在于提出一个端到端的VICL框架,包含三个核心组件:首先,设计自适应融合模块(Adaptive Fusion Module),从多个提示中聚合关键模式与标注,生成更精确的上下文提示;其次,引入特定于排列的轻量级多层感知机(arrangement-specific lightweight MLPs),将布局先验解耦至主模型之外,同时最小化对整体模型的影响;最后,采用双向微调机制,交换查询与提示的角色,促使模型从融合后的上下文中重建原始提示,从而增强融合模块与图像修复模型之间的协同优化。

链接: https://arxiv.org/abs/2601.10117
作者: Wenwen Liao,Jianbo Yu,Yuansong Wang,Shifu Yan,Xiaofeng Yang
机构: Fudan University (复旦大学); Tsinghua University (清华大学); ByteDance Ltd. (字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision In-Context Learning (VICL) enables inpainting models to quickly adapt to new visual tasks from only a few prompts. However, existing methods suffer from two key issues: (1) selecting only the most similar prompt discards complementary cues from other high-quality prompts; and (2) failing to exploit the structured information implied by different prompt arrangements. We propose an end-to-end VICL framework to overcome these limitations. Firstly, an adaptive Fusion Module aggregates critical patterns and annotations from multiple prompts to form more precise contextual prompts. Secondly, we introduce arrangement-specific lightweight MLPs to decouple layout priors from the core model, while minimally affecting the overall model. In addition, an bidirectional fine-tuning mechanism swaps the roles of query and prompt, encouraging the model to reconstruct the original prompt from fused context and thus enhancing collaboration between the fusion module and the inpainting model. Experiments on foreground segmentation, single-object detection, and image colorization demonstrate superior results and strong cross-task generalization of our method. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2601.10117 [cs.CV] (or arXiv:2601.10117v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2601.10117 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-46] Enhancing Visual In-Context Learning by Multi-Faceted Fusion

【速读】:该论文旨在解决视觉上下文学习(Visual In-Context Learning, VICL)中现有“检索-提示”方法的局限性,即仅采用单一最优视觉提示而忽略其他潜在有用候选提示所携带的丰富上下文信息,导致模型推理能力受限。其解决方案的关键在于提出一种多组合协同融合机制,不再将多个提示压缩为单一表示,而是生成三个由不同高质量提示组合构成的上下文表征分支,并通过设计的MULTI-VQGAN架构联合解析和利用来自多个来源的协同信息,从而显著提升模型在多种视觉任务中的泛化能力和预测准确性。

链接: https://arxiv.org/abs/2601.10107
作者: Wenwen Liao,Jianbo Yu,Yuansong Wang,Qingchao Jiang,Xiaofeng Yang
机构: Fudan University (复旦大学); Tsinghua University (清华大学); East China University of Science and Technology (华东理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visual In-Context Learning (VICL) has emerged as a powerful paradigm, enabling models to perform novel visual tasks by learning from in-context examples. The dominant “retrieve-then-prompt” approach typically relies on selecting the single best visual prompt, a practice that often discards valuable contextual information from other suitable candidates. While recent work has explored fusing the top-K prompts into a single, enhanced representation, this still simply collapses multiple rich signals into one, limiting the model’s reasoning capability. We argue that a more multi-faceted, collaborative fusion is required to unlock the full potential of these diverse contexts. To address this limitation, we introduce a novel framework that moves beyond single-prompt fusion towards an multi-combination collaborative fusion. Instead of collapsing multiple prompts into one, our method generates three contextual representation branches, each formed by integrating information from different combinations of top-quality prompts. These complementary guidance signals are then fed into proposed MULTI-VQGAN architecture, which is designed to jointly interpret and utilize collaborative information from multiple sources. Extensive experiments on diverse tasks, including foreground segmentation, single-object detection, and image colorization, highlight its strong cross-task generalization, effective contextual fusion, and ability to produce more robust and accurate predictions than existing methods.
zh

[CV-47] MathDoc: Benchmarking Structured Extraction and Active Refusal on Noisy Mathematics Exam Papers

【速读】:该论文旨在解决从纸质数学试卷中自动提取结构化问题的难题,特别是在存在严重视觉噪声的真实场景下,现有方法因忽视数学题目的结构完整性及模型对不完整输入的主动拒绝能力而表现不佳。解决方案的关键在于提出首个面向真实高中数学考试试卷的文档级信息抽取基准——MathDoc,其包含3,609个精心标注的问题样本,并明确引入无法识别的样本以评估模型的主动拒绝行为;同时构建多维评估框架,涵盖题干准确性、视觉相似性与拒绝能力,从而揭示当前先进多模态大语言模型(Multimodal Large Language Models, MLLMs)在面对劣质文档时仍会生成自信但无效输出的可靠性缺陷。

链接: https://arxiv.org/abs/2601.10104
作者: Chenyue Zhou,Jiayi Tuo,Shitong Qin,Wei Dai,Mingxuan Wang,Ziwei Zhao,Duoyang Li,Shiyang Su,Yanxi Lu,Yanbiao Ma
机构: Gaoling School of Artificial Intelligence, Renmin University of China (中国人民大学高瓴人工智能学院); Gaotu Techedu Inc.; Beijing Key Laboratory of Research on Large Models and Intelligent Governance (北京市大模型与智能治理研究重点实验室); Engineering Research Center of Next-Generation Intelligent Search and Recommendation, MOE (教育部下一代智能搜索与推荐工程研究中心); Nanjing University of Aeronautics and Astronautics (南京航空航天大学); University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The automated extraction of structured questions from paper-based mathematics exams is fundamental to intelligent education, yet remains challenging in real-world settings due to severe visual noise. Existing benchmarks mainly focus on clean documents or generic layout analysis, overlooking both the structural integrity of mathematical problems and the ability of models to actively reject incomplete inputs. We introduce MathDoc, the first benchmark for document-level information extraction from authentic high school mathematics exam papers. MathDoc contains \textbf3,609 carefully curated questions with real-world artifacts and explicitly includes unrecognizable samples to evaluate active refusal behavior. We propose a multi-dimensional evaluation framework covering stem accuracy, visual similarity, and refusal capability. Experiments on SOTA MLLMs, including Qwen3-VL and Gemini-2.5-Pro, show that although end-to-end models achieve strong extraction performance, they consistently fail to refuse illegible inputs, instead producing confident but invalid outputs. These results highlight a critical gap in current MLLMs and establish MathDoc as a benchmark for assessing model reliability under degraded document conditions. Our project repository is available at \hrefthis https URLGitHub repository
zh

[CV-48] FlowAct-R1: Towards Interactive Humanoid Video Generation

【速读】:该论文旨在解决交互式类人视频生成中高保真度合成与实时交互需求之间的权衡问题。现有方法在长时间视频生成过程中常面临误差累积和时序不一致性难题,难以实现低延迟、高质量的连续互动。其解决方案的关键在于提出FlowAct-R1框架,基于MMDiT架构实现任意时长视频的流式合成,并引入分块扩散强制(chunkwise diffusion forcing)策略及一种新颖的自强制(self-forcing)变体,有效缓解误差传播并保障长期时序一致性;同时通过高效知识蒸馏与系统级优化,在480p分辨率下实现25fps稳定帧率且首次画面时间(TTFF)仅约1.5秒,从而支持细粒度全身控制与多行为状态间的自然过渡,显著提升交互场景下的行为生动性与感知真实感。

链接: https://arxiv.org/abs/2601.10103
作者: Lizhen Wang,Yongming Zhu,Zhipeng Ge,Youwei Zheng,Longhao Zhang,Tianshu Hu,Shiyang Qin,Mingshuang Luo,Jiaxu Zhang,Xin Chen,Yulong Wang,Zerong Zheng,Jianwen Jiang,Chao Liang,Weifeng Chen,Xing Wang,Yuan Zhang,Mingyuan Gao
机构: ByteDance Intelligent Creation (字节跳动智能创作)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Interactive humanoid video generation aims to synthesize lifelike visual agents that can engage with humans through continuous and responsive video. Despite recent advances in video synthesis, existing methods often grapple with the trade-off between high-fidelity synthesis and real-time interaction requirements. In this paper, we propose FlowAct-R1, a framework specifically designed for real-time interactive humanoid video generation. Built upon a MMDiT architecture, FlowAct-R1 enables the streaming synthesis of video with arbitrary durations while maintaining low-latency responsiveness. We introduce a chunkwise diffusion forcing strategy, complemented by a novel self-forcing variant, to alleviate error accumulation and ensure long-term temporal consistency during continuous interaction. By leveraging efficient distillation and system-level optimizations, our framework achieves a stable 25fps at 480p resolution with a time-to-first-frame (TTFF) of only around 1.5 seconds. The proposed method provides holistic and fine-grained full-body control, enabling the agent to transition naturally between diverse behavioral states in interactive scenarios. Experimental results demonstrate that FlowAct-R1 achieves exceptional behavioral vividness and perceptual realism, while maintaining robust generalization across diverse character styles.
zh

[CV-49] InfoSculpt: Sculpting the Latent Space for Generalized Category Discovery

【速读】:该论文旨在解决广义类别发现(Generalized Category Discovery, GCD)任务中现有方法依赖伪标签或两阶段聚类所导致的局限性,即缺乏一种原理性的机制来显式地分离类别定义信号与实例特异性噪声。解决方案的关键在于从信息论角度重新审视GCD问题,基于信息瓶颈(Information Bottleneck, IB)原理提出InfoSculpt框架,通过最小化双重条件互信息(Conditional Mutual Information, CMI)目标实现表征空间的系统性“雕刻”:一方面在已知类别上优化类别级CMI以学习紧凑且判别性强的表示,另一方面在全部数据上优化实例级CMI以压缩增强引入的噪声并提取不变特征;二者协同作用于不同尺度,从而构建一个解耦且鲁棒的潜在空间,保留类别信息的同时丢弃噪声细节。

链接: https://arxiv.org/abs/2601.10098
作者: Wenwen Liao,Hang Ruan,Jianbo Yu,Yuansong Wang,Qingchao Jiang,Xiaofeng Yang
机构: Fudan University (复旦大学); Tsinghua University (清华大学); East China University of Science and Technology (华东理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Generalized Category Discovery (GCD) aims to classify instances from both known and novel categories within a large-scale unlabeled dataset, a critical yet challenging task for real-world, open-world applications. However, existing methods often rely on pseudo-labeling, or two-stage clustering, which lack a principled mechanism to explicitly disentangle essential, category-defining signals from instance-specific noise. In this paper, we address this fundamental limitation by re-framing GCD from an information-theoretic perspective, grounded in the Information Bottleneck (IB) principle. We introduce InfoSculpt, a novel framework that systematically sculpts the representation space by minimizing a dual Conditional Mutual Information (CMI) objective. InfoSculpt uniquely combines a Category-Level CMI on labeled data to learn compact and discriminative representations for known classes, and a complementary Instance-Level CMI on all data to distill invariant features by compressing augmentation-induced noise. These two objectives work synergistically at different scales to produce a disentangled and robust latent space where categorical information is preserved while noisy, instance-specific details are discarded. Extensive experiments on 8 benchmarks demonstrate that InfoSculpt validating the effectiveness of our information-theoretic approach.
zh

[CV-50] V-Zero: Self-Improving Multimodal Reasoning with Zero Annotation

【速读】:该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在提升推理能力时对大规模人工标注数据的高度依赖问题,这类数据获取成本高且效率低。解决方案的关键在于提出一种名为V-Zero的通用后训练框架,其核心机制是构建一个基于无标签图像的协同进化循环:通过引入“提问者”(Questioner)与“求解者”(Solver)两个角色,前者利用双轨推理奖励策略生成高质量、具有挑战性的问答对,后者则基于自身采样响应的多数投票产生伪标签进行优化;二者通过组相对策略优化(Group Relative Policy Optimization, GRPO)迭代训练,实现相互增强。实验表明,在无需任何人工标注的情况下,V-Zero显著提升了Qwen2.5-VL-7B-Instruct模型在视觉数学推理和通用视觉任务上的性能。

链接: https://arxiv.org/abs/2601.10094
作者: Han Wang,Yi Yang,Jingyuan Hu,Minfeng Zhu,Wei Chen
机构: Zhejiang University (浙江大学); State Key Laboratory of CAD&CG, Zhejiang University (浙江大学CAD&CG国家重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent advances in multimodal learning have significantly enhanced the reasoning capabilities of vision-language models (VLMs). However, state-of-the-art approaches rely heavily on large-scale human-annotated datasets, which are costly and time-consuming to acquire. To overcome this limitation, we introduce V-Zero, a general post-training framework that facilitates self-improvement using exclusively unlabeled images. V-Zero establishes a co-evolutionary loop by instantiating two distinct roles: a Questioner and a Solver. The Questioner learns to synthesize high-quality, challenging questions by leveraging a dual-track reasoning reward that contrasts intuitive guesses with reasoned results. The Solver is optimized using pseudo-labels derived from majority voting over its own sampled responses. Both roles are trained iteratively via Group Relative Policy Optimization (GRPO), driving a cycle of mutual enhancement. Remarkably, without a single human annotation, V-Zero achieves consistent performance gains on Qwen2.5-VL-7B-Instruct, improving visual mathematical reasoning by +1.7 and general vision-centric by +2.6, demonstrating the potential of self-improvement in multimodal systems. Code is available at this https URL
zh

[CV-51] Difficulty-guided Sampling: Bridging the Target Gap between Dataset Distillation and Downstream Tasks

【速读】:该论文旨在解决数据蒸馏(dataset distillation)中蒸馏目标与下游任务之间存在的“目标差距”(target gap)问题,即现有方法多基于原始数据集提取的特征进行优化,忽略了任务特定信息,导致蒸馏后的数据在实际下游任务(如图像分类)中性能下降。解决方案的关键在于引入“难度引导采样”(Difficulty-Guided Sampling, DGS),通过分析并利用下游任务中的样本难度分布,从已有蒸馏方法生成的图像池中采样出更符合目标任务分布的紧凑数据集;同时提出“难度感知引导”(Difficulty-Aware Guidance, DAG)机制,在生成阶段融入难度信息以进一步提升蒸馏质量。实验证明,该方法能有效缩小目标差距,显著改善模型在下游任务上的表现。

链接: https://arxiv.org/abs/2601.10090
作者: Mingzhuo Li,Guang Li,Linfeng Ye,Jiafeng Mao,Takahiro Ogawa,Konstantinos N. Plataniotis,Miki Haseyama
机构: Hokkaido University (北海道大学); University of Toronto (多伦多大学); The University of Tokyo (东京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this paper, we propose difficulty-guided sampling (DGS) to bridge the target gap between the distillation objective and the downstream task, therefore improving the performance of dataset distillation. Deep neural networks achieve remarkable performance but have time and storage-consuming training processes. Dataset distillation is proposed to generate compact, high-quality distilled datasets, enabling effective model training while maintaining downstream performance. Existing approaches typically focus on features extracted from the original dataset, overlooking task-specific information, which leads to a target gap between the distillation objective and the downstream task. We propose leveraging characteristics that benefit the downstream training into data distillation to bridge this gap. Focusing on the downstream task of image classification, we introduce the concept of difficulty and propose DGS as a plug-in post-stage sampling module. Following the specific target difficulty distribution, the final distilled dataset is sampled from image pools generated by existing methods. We also propose difficulty-aware guidance (DAG) to explore the effect of difficulty in the generation process. Extensive experiments across multiple settings demonstrate the effectiveness of the proposed methods. It also highlights the broader potential of difficulty for diverse downstream tasks.
zh

[CV-52] hinking Like Van Gogh: Structure-Aware Style Transfer via Flow-Guided 3D Gaussian Splatting

【速读】:该论文旨在解决现有3D风格迁移方法在再现后印象派艺术(Post-Impressionist art)时的根本性缺陷:即过度依赖几何结构作为表面纹理的刚性载体,而忽视了后印象派强调通过几何抽象来强化本质形式、抑制细节描摹的核心理念。其解决方案的关键在于提出一种基于流场引导的几何平流框架(flow-guided geometric advection framework),在无需显式网格先验的3D Gaussian Splatting(3DGS)设置下,将2D画作中的方向性笔触流场提取并反向传播至3D空间,从而驱动高斯原型沿流线对齐排列,形成贴合场景拓扑的笔触结构。该方法实现了由绘画运动直接驱动的结构性变形,而非受制于图像光度约束,同时结合亮度-结构解耦策略与视觉语言模型(VLM)作为评判者(VLM-as-a-Judge)的评估机制,有效提升了风格化结果的艺术真实性与结构性合理性。

链接: https://arxiv.org/abs/2601.10075
作者: Zhendong Wang,Lebin Zhou,Jingchuan Xiao,Rongduo Han,Nam Ling,Cihan Ruan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
备注: 7 pages, 8 figures

点击查看摘要

Abstract:In 1888, Vincent van Gogh wrote, “I am seeking exaggeration in the essential.” This principle, amplifying structural form while suppressing photographic detail, lies at the core of Post-Impressionist art. However, most existing 3D style transfer methods invert this philosophy, treating geometry as a rigid substrate for surface-level texture projection. To authentically reproduce Post-Impressionist stylization, geometric abstraction must be embraced as the primary vehicle of expression. We propose a flow-guided geometric advection framework for 3D Gaussian Splatting (3DGS) that operationalizes this principle in a mesh-free setting. Our method extracts directional flow fields from 2D paintings and back-propagates them into 3D space, rectifying Gaussian primitives to form flow-aligned brushstrokes that conform to scene topology without relying on explicit mesh priors. This enables expressive structural deformation driven directly by painterly motion rather than photometric constraints. Our contributions are threefold: (1) a projection-based, mesh-free flow guidance mechanism that transfers 2D artistic motion into 3D Gaussian geometry; (2) a luminance-structure decoupling strategy that isolates geometric deformation from color optimization, mitigating artifacts during aggressive structural abstraction; and (3) a VLM-as-a-Judge evaluation framework that assesses artistic authenticity through aesthetic judgment instead of conventional pixel-level metrics, explicitly addressing the subjective nature of artistic stylization. Comments: 7 pages, 8 figures Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG) Cite as: arXiv:2601.10075 [cs.CV] (or arXiv:2601.10075v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2601.10075 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-53] ReaMIL: Reasoning - and Evidence-Aware Multiple Instance Learning for Whole-Slide Histopathology WACV2026

【速读】:该论文旨在解决组织病理学全切片图像(Whole-Slide Images, WSI)中多实例学习(Multiple Instance Learning, MIL)模型在证据选择上的效率与可解释性问题,即如何在不牺牲分类性能的前提下,从海量图像块(tile)中自动筛选出少量、空间紧凑且具有判别力的证据区域。解决方案的关键在于提出 ReaMIL 方法,其核心创新是引入一个轻量级的软门控头(soft per-tile gates),并通过预算充分性目标(budgeted-sufficiency objective)进行训练:该目标使用铰链损失(hinge loss)强制模型仅依赖所选图像块即可达到预设置信度阈值 τ,同时对选中图像块数量施加稀疏性约束。此机制确保了模型生成的小规模、空间聚集的证据集在保持基线 AUC 性能的同时显著提升证据效率,并自然产出滑动级别可视化叠加图,便于病理学家验证与理解模型决策依据。

链接: https://arxiv.org/abs/2601.10073
作者: Hyun Do Jung,Jungwon Choi,Hwiyoung Kim
机构: Yonsei University (延世大学); KAIST (韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at LFMBio Workshop, WACV 2026. This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:We introduce ReaMIL (Reasoning- and Evidence-Aware MIL), a multiple instance learning approach for whole-slide histopathology that adds a light selection head to a strong MIL backbone. The head produces soft per-tile gates and is trained with a budgeted-sufficiency objective: a hinge loss that enforces the true-class probability to be \geq \tau using only the kept evidence, under a sparsity budget on the number of selected tiles. The budgeted-sufficiency objective yields small, spatially compact evidence sets without sacrificing baseline performance. Across TCGA-NSCLC (LUAD vs. LUSC), TCGA-BRCA (IDC vs. Others), and PANDA, ReaMIL matches or slightly improves baseline AUC and provides quantitative evidence-efficiency diagnostics. On NSCLC, it attains AUC 0.983 with a mean minimal sufficient K (MSK) \approx 8.2 tiles at \tau = 0.90 and AUKC \approx 0.864 , showing that class confidence rises sharply and stabilizes once a small set of tiles is kept. The method requires no extra supervision, integrates seamlessly with standard MIL training, and naturally yields slide-level overlays. We report accuracy alongside MSK, AUKC, and contiguity for rigorous evaluation of model behavior on WSIs.
zh

[CV-54] Comparative Evaluation of Deep Learning-Based and WHO-Informed Approaches for Sperm Morphology Assessment

【速读】:该论文旨在解决男性生育力评估中精子形态学质量评价的主观性强、观察者间变异大以及资源受限等问题。传统方法依赖世界卫生组织(World Health Organization, WHO)标准,虽具临床基础,但缺乏客观性和一致性;本研究提出一种基于图像的深度学习模型HuSHeM,其关键在于利用高分辨率精子形态图像进行训练,并通过独立临床队列验证其性能,相较于WHO标准结合系统性炎症反应指数(Systemic Inflammation Response Index, SIRI)的基准模型,在判别能力、校准度和临床实用性方面均表现出显著优势,尤其在类别不平衡场景下仍保持较高精确率与召回率,表明生成式AI可提升精子形态分析的预测可靠性与临床价值,为辅助生殖筛查流程提供客观、可重复的决策支持工具。

链接: https://arxiv.org/abs/2601.10070
作者: Mohammad Abbadi
机构: University of Dubai (迪拜大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Quantitative Methods (q-bio.QM)
备注: Under review at Computers in Biology and Medicine

点击查看摘要

Abstract:Assessment of sperm morphological quality remains a critical yet subjective component of male fertility evaluation, often limited by inter-observer variability and resource constraints. This study presents a comparative biomedical artificial intelligence framework evaluating an image-based deep learning model (HuSHeM) alongside a clinically grounded baseline derived from World Health Organization criteria augmented with the Systemic Inflammation Response Index (WHO(+SIRI)). The HuSHeM model was trained on high-resolution sperm morphology images and evaluated using an independent clinical cohort. Model performance was assessed using discrimination, calibration, and clinical utility analyses. The HuSHeM model demonstrated higher discriminative performance, as reflected by an increased area under the receiver operating characteristic curve with relatively narrow confidence intervals compared to WHO(+SIRI). Precision-recall analysis further indicated improved performance under class imbalance, with higher precision-recall area values across evaluated thresholds. Calibration analysis indicated closer agreement between predicted probabilities and observed outcomes for HuSHeM, while decision curve analysis suggested greater net clinical benefit across clinically relevant threshold probabilities. These findings suggest that image-based deep learning may offer improved predictive reliability and clinical utility compared with traditional rule-based and inflammation-augmented criteria. The proposed framework supports objective and reproducible assessment of sperm morphology and may serve as a decision-support tool within fertility screening and referral workflows. The proposed models are intended as decision-support or referral tools and are not designed to replace clinical judgment or laboratory assessment. Comments: Under review at Computers in Biology and Medicine Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Quantitative Methods (q-bio.QM) Cite as: arXiv:2601.10070 [cs.LG] (or arXiv:2601.10070v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2601.10070 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-55] CoF-T2I: Video Models as Pure Visual Reason ers for Text-to-Image Generation

【速读】:该论文旨在解决当前文本到图像(Text-to-Image, T2I)生成模型缺乏显式视觉推理机制的问题,即在生成过程中缺少可解释的中间状态和清晰的视觉推理起点,限制了模型对复杂语义到美学转化过程的理解与控制。解决方案的关键在于提出CoF-T2I模型,通过引入链式帧推理(Chain-of-Frame, CoF)机制实现渐进式视觉精炼:将T2I生成过程建模为一系列具有明确逻辑顺序的中间帧,每一帧代表一个可解释的视觉推理步骤,最终帧作为输出结果;同时构建了CoF-Evol-Instruct数据集以刻画从语义到美学的演化轨迹,并采用独立帧编码策略提升图像质量并减少运动伪影。

链接: https://arxiv.org/abs/2601.10061
作者: Chengzhuo Tong,Mingkun Chang,Shenglong Zhang,Yuran Wang,Cheng Liang,Zhizheng Zhao,Ruichuan An,Bohan Zeng,Yang Shi,Yifan Dai,Ziming Zhao,Guanbin Li,Pengfei Wan,Yuanxing Zhang,Wentao Zhang
机构: Peking University (北京大学); Kling Team, Kuaishou Technology (快手科技); Sun Yat-sen University (中山大学); Zhejiang University (浙江大学); Nanjing University (南京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 16 pages, 8 figures

点击查看摘要

Abstract:Recent video generation models have revealed the emergence of Chain-of-Frame (CoF) reasoning, enabling frame-by-frame visual inference. With this capability, video models have been successfully applied to various visual tasks (e.g., maze solving, visual puzzles). However, their potential to enhance text-to-image (T2I) generation remains largely unexplored due to the absence of a clearly defined visual reasoning starting point and interpretable intermediate states in the T2I generation process. To bridge this gap, we propose CoF-T2I, a model that integrates CoF reasoning into T2I generation via progressive visual refinement, where intermediate frames act as explicit reasoning steps and the final frame is taken as output. To establish such an explicit generation process, we curate CoF-Evol-Instruct, a dataset of CoF trajectories that model the generation process from semantics to aesthetics. To further improve quality and avoid motion artifacts, we enable independent encoding operation for each frame. Experiments show that CoF-T2I significantly outperforms the base video model and achieves competitive performance on challenging benchmarks, reaching 0.86 on GenEval and 7.468 on Imagine-Bench. These results indicate the substantial promise of video models for advancing high-quality text-to-image generation.
zh

[CV-56] UEOF: A Benchmark Dataset for Underwater Event-Based Optical Flow WACV

【速读】:该论文旨在解决水下成像中因波长依赖性光衰减、悬浮颗粒强散射、浊度引起的模糊以及非均匀光照等因素导致的标准相机难以获取真实运动信息的问题,同时针对事件相机(event camera)在水下环境中研究受限于缺乏配对真实水下光学与精确光流数据集的现状。解决方案的关键在于构建首个基于物理渲染的合成水下事件数据集,该数据集由基于光线追踪的RGBD序列生成,并通过现代视频到事件的转换流程产生具有密集真值光流、深度和相机运动的逼真事件数据流,从而为水下事件感知算法的开发与评估提供基准支持。

链接: https://arxiv.org/abs/2601.10054
作者: Nick Truong,Pritam P. Karmokar,William J. Beksi
机构: The University of Texas at Arlington (德克萨斯大学阿灵顿分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: To be presented at the 2026 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) Workshop on Event-Based Vision in the Era of Generative AI

点击查看摘要

Abstract:Underwater imaging is fundamentally challenging due to wavelength-dependent light attenuation, strong scattering from suspended particles, turbidity-induced blur, and non-uniform illumination. These effects impair standard cameras and make ground-truth motion nearly impossible to obtain. On the other hand, event cameras offer microsecond resolution and high dynamic range. Nonetheless, progress on investigating event cameras for underwater environments has been limited due to the lack of datasets that pair realistic underwater optics with accurate optical flow. To address this problem, we introduce the first synthetic underwater benchmark dataset for event-based optical flow derived from physically-based ray-traced RGBD sequences. Using a modern video-to-event pipeline applied to rendered underwater videos, we produce realistic event data streams with dense ground-truth flow, depth, and camera motion. Moreover, we benchmark state-of-the-art learning-based and model-based optical flow prediction methods to understand how underwater light transport affects event formation and motion estimation accuracy. Our dataset establishes a new baseline for future development and evaluation of underwater event-based perception algorithms. The source code and dataset for this project are publicly available at this https URL.
zh

[CV-57] Disentangled Concept Representation for Text-to-image Person Re-identification

【速读】:该论文旨在解决文本到图像行人重识别(Text-to-image person re-identification, TIReID)中因视觉外观与文本描述之间存在显著模态差异,以及需建模细粒度对应关系以区分相似属性(如服装颜色、纹理或风格)所带来的挑战。其解决方案的关键在于提出DiCo(Disentangled Concept Representation)框架,通过引入共享的基于槽位(slot-based)的表示机制,使每个槽位作为跨模态的部分级锚点,并进一步分解为多个概念块(concept blocks),从而实现层次化且解耦的跨模态对齐,有效分离互补属性(如颜色、纹理、形状),同时保持图像与文本间一致的部分级对应关系,提升检索精度与可解释性。

链接: https://arxiv.org/abs/2601.10053
作者: Giyeol Kim,Chanho Eom
机构: Chung-Ang University (中央大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Text-to-image person re-identification (TIReID) aims to retrieve person images from a large gallery given free-form textual descriptions. TIReID is challenging due to the substantial modality gap between visual appearances and textual expressions, as well as the need to model fine-grained correspondences that distinguish individuals with similar attributes such as clothing color, texture, or outfit style. To address these issues, we propose DiCo (Disentangled Concept Representation), a novel framework that achieves hierarchical and disentangled cross-modal alignment. DiCo introduces a shared slot-based representation, where each slot acts as a part-level anchor across modalities and is further decomposed into multiple concept blocks. This design enables the disentanglement of complementary attributes (\textite.g., color, texture, shape) while maintaining consistent part-level correspondence between image and text. Extensive experiments on CUHK-PEDES, ICFG-PEDES, and RSTPReid demonstrate that our framework achieves competitive performance with state-of-the-art methods, while also enhancing interpretability through explicit slot- and block-level representations for more fine-grained retrieval results.
zh

[CV-58] VERHallu: Evaluating and Mitigating Event Relation Hallucination in Video Large Language Models

【速读】:该论文旨在解决视频大语言模型(Video Large Language Models, VideoLLMs)在事件关系推理中存在幻觉的问题,尤其是针对因果、时间及子事件关系的 hallucination(幻觉),而现有研究多集中于事件、物体或场景存在的幻觉。为系统评估此类问题,作者提出了一个名为 VERHallu 的新基准,涵盖关系分类、问答和反事实问答三类任务,并设计了违背典型预训练分布的反直觉视频场景,辅以人工标注的候选答案以区分视觉-语言与纯语言偏差。分析表明,当前先进 VideoLLMs 在密集事件关系推理上表现不佳,常依赖先验知识而非帧级线索,导致对子事件的忽略,从而产生不完整或错误的关系理解。解决方案的关键在于提出一种“关键帧传播”(Key-Frame Propagating, KFP)策略,通过重新分配中间层中的帧级注意力机制,增强对多事件间关系的理解,实验验证其能有效缓解事件关系幻觉且不增加推理延迟。

链接: https://arxiv.org/abs/2601.10010
作者: Zefan Zhang,Kehua Zhu,Shijie Jiang,Hongyuan Lu,Shengkai Sun,Tian Bai
机构: Jilin University (吉林大学); Facemind Group; Hefei University of Technology (合肥工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 11 pages, 6 figures

点击查看摘要

Abstract:Video Large Language Models (VideoLLMs) exhibit various types of hallucinations. Existing research has primarily focused on hallucinations involving the presence of events, objects, and scenes in videos, while largely neglecting event relation hallucination. In this paper, we introduce a novel benchmark for evaluating the Video Event Relation Hallucination, named VERHallu. This benchmark focuses on causal, temporal, and subevent relations between events, encompassing three types of tasks: relation classification, question answering, and counterfactual question answering, for a comprehensive evaluation of event relation hallucination. Additionally, it features counterintuitive video scenarios that deviate from typical pretraining distributions, with each sample accompanied by human-annotated candidates covering both vision-language and pure language biases. Our analysis reveals that current state-of-the-art VideoLLMs struggle with dense-event relation reasoning, often relying on prior knowledge due to insufficient use of frame-level cues. Although these models demonstrate strong grounding capabilities for key events, they often overlook the surrounding subevents, leading to an incomplete and inaccurate understanding of event relations. To tackle this, we propose a Key-Frame Propagating (KFP) strategy, which reallocates frame-level attention within intermediate layers to enhance multi-event understanding. Experiments show it effectively mitigates the event relation hallucination without affecting inference speed.
zh

[CV-59] DW-DGAT: Dynamically Weighted Dual Graph Attention Network for Neurodegenerative Disease Diagnosis AAAI-2026

【速读】:该论文旨在解决神经退行性疾病(Neurodegenerative Diseases, NDs)早期诊断中面临的三大挑战:多指标数据的高维性与结构多样性、神经影像与表型数据的异质性,以及类别不平衡问题。其解决方案的关键在于提出一种动态加权双图注意力网络(Dynamically Weighted Dual Graph Attention Network, DW-DGAT),该方法包含三个核心组件:(1) 通用的数据融合策略,用于整合三种结构形式的多指标数据;(2) 基于脑区和样本间关系的双图注意力架构,以同时提取微观和宏观层次特征;(3) 结合两类稳定有效的损失函数的类别权重生成机制,有效缓解类别不平衡对模型性能的影响。实验证明该方法在PPMI和ADNI数据集上均达到当前最优水平。

链接: https://arxiv.org/abs/2601.10001
作者: Chengjia Liang,Zhenjiong Wang,Chao Chen,Ruizhi Zhang,Songxi Liang,Hai Xie,Haijun Lei,Zhongwei Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: AAAI-2026 accepted poster paper

点击查看摘要

Abstract:Parkinson’s disease (PD) and Alzheimer’s disease (AD) are the two most prevalent and incurable neurodegenerative diseases (NDs) worldwide, for which early diagnosis is critical to delay their progression. However, the high dimensionality of multi-metric data with diverse structural forms, the heterogeneity of neuroimaging and phenotypic data, and class imbalance collectively pose significant challenges to early ND diagnosis. To address these challenges, we propose a dynamically weighted dual graph attention network (DW-DGAT) that integrates: (1) a general-purpose data fusion strategy to merge three structural forms of multi-metric data; (2) a dual graph attention architecture based on brain regions and inter-sample relationships to extract both micro- and macro-level features; and (3) a class weight generation mechanism combined with two stable and effective loss functions to mitigate class imbalance. Rigorous experiments, based on the Parkinson Progression Marker Initiative (PPMI) and Alzhermer’s Disease Neuroimaging Initiative (ADNI) studies, demonstrate the state-of-the-art performance of our approach.
zh

[CV-60] EditEmoTalk: Controllable Speech-Driven 3D Facial Animation with Continuous Expression Editing

【速读】:该论文旨在解决当前语音驱动三维人脸动画(speech-driven 3D facial animation)方法中情绪控制离散化的问题,即现有方法通常依赖于预定义的离散情绪类别,难以实现连续且精细的情绪调节。其解决方案的关键在于提出一种边界感知语义嵌入(boundary-aware semantic embedding),该机制能够学习不同情绪类别间决策边界的正交方向,从而构建一个连续的表情流形(expression manifold),支持平滑的情绪编辑;同时引入情感一致性损失(emotional consistency loss),通过映射网络确保生成的面部运动动态与目标情绪嵌入在语义上对齐,从而保障情绪表达的真实性与连贯性。

链接: https://arxiv.org/abs/2601.10000
作者: Diqiong Jiang,Kai Zhu,Dan Song,Jian Chang,Chenglizhao Chen,Zhenyu Wu
机构: China University of Petroleum(中国石油大学); Tianjin University(天津大学); Bournemouth University(伯恩茅斯大学); Southwest Jiaotong University(西南交通大学)
类目: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Speech-driven 3D facial animation aims to generate realistic and expressive facial motions directly from audio. While recent methods achieve high-quality lip synchronization, they often rely on discrete emotion categories, limiting continuous and fine-grained emotional control. We present EditEmoTalk, a controllable speech-driven 3D facial animation framework with continuous emotion editing. The key idea is a boundary-aware semantic embedding that learns the normal directions of inter-emotion decision boundaries, enabling a continuous expression manifold for smooth emotion manipulation. Moreover, we introduce an emotional consistency loss that enforces semantic alignment between the generated motion dynamics and the target emotion embedding through a mapping network, ensuring faithful emotional expression. Extensive experiments demonstrate that EditEmoTalk achieves superior controllability, expressiveness, and generalization while maintaining accurate lip synchronization. Code and pretrained models will be released.
zh

[CV-61] DR2Seg: Decomposed Two-Stage Rollouts for Efficient Reasoning Segmentation in Multimodal Large Language Models

【速读】:该论文旨在解决生成式 AI(Generative AI)在视觉-语言任务中进行推理分割(Reasoning Segmentation)时存在的“过度思考”问题,即现有方法在处理复杂文本查询时会产生冗长且无关的推理链,干扰目标对象的精确定位。解决方案的关键在于提出 DR² Seg 框架,其核心创新是采用两阶段滚动策略(two-stage rollout strategy),将推理分割解耦为多模态推理和指代表达分割两个阶段:第一阶段生成一个自包含的目标描述,第二阶段用该描述替代原始查询以验证其自洽性;同时引入两种自奖励机制(self-rewards),分别强化目标导向的推理并抑制冗余思考,从而在不依赖额外思维监督的情况下显著提升推理效率与分割精度。

链接: https://arxiv.org/abs/2601.09981
作者: Yulin He,Wei Chen,Zhikang Jian,Tianhang Guo,Wenjuan Zhou,Minglong Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reasoning segmentation is an emerging vision-language task that requires reasoning over intricate text queries to precisely segment objects. However, existing methods typically suffer from overthinking, generating verbose reasoning chains that interfere with object localization in multimodal large language models (MLLMs). To address this issue, we propose DR ^2 Seg, a self-rewarding framework that improves both reasoning efficiency and segmentation accuracy without requiring extra thinking supervision. DR ^2 Seg employs a two-stage rollout strategy that decomposes reasoning segmentation into multimodal reasoning and referring segmentation. In the first stage, the model generates a self-contained description that explicitly specifies the target object. In the second stage, this description replaces the original complex query to verify its self-containment. Based on this design, two self-rewards are introduced to strengthen goal-oriented reasoning and suppress redundant thinking. Extensive experiments across MLLMs of varying scales and segmentation models demonstrate that DR ^2 Seg consistently improves reasoning efficiency and overall segmentation performance.
zh

[CV-62] he Spatial Blindspot of Vision-Language Models

【速读】:该论文旨在解决当前视觉-语言模型(Vision-Language Models, VLMs)在捕捉空间关系方面的不足问题,这一缺陷限制了其在机器人学和具身智能等需要空间定位任务中的应用。论文指出,现有VLMs普遍采用类似CLIP风格的图像编码器,将图像扁平化为一维补丁序列,从而丢失了对空间推理至关重要的二维结构信息。解决方案的关键在于两个方面:一是采用替代训练目标的图像编码器以增强空间感知能力;二是引入二维位置编码(2D positional encodings),以保留图像的几何结构信息。实验表明,这些架构改进可显著提升模型在多个空间推理基准上的表现。

链接: https://arxiv.org/abs/2601.09954
作者: Nahid Alam,Leema Krishna Murali,Siddhant Bharadwaj,Patrick Liu,Timothy Chung,Drishti Sharma,Akshata A,Kranthi Kiran,Wesley Tam,Bala Krishna S Vegesna
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-language models (VLMs) have advanced rapidly, but their ability to capture spatial relationships remains a blindspot. Current VLMs are typically built with contrastive language-image pretraining (CLIP) style image encoders. The training recipe often flattens images into 1D patch sequences, discarding the 2D structure necessary for spatial reasoning. We argue that this lack of spatial awareness is a missing dimension in VLM design and a bottleneck for applications requiring spatial grounding, such as robotics and embodied AI. To address this, we investigate (i) image encoders trained with alternative objectives and (ii) 2D positional encodings. Our experiments show that these architectural choices can lead to improved spatial reasoning on several benchmarks.
zh

[CV-63] OT-Drive: Out-of-Distribution Off-Road Traversable Area Segmentation via Optimal Transport

【速读】:该论文旨在解决自动驾驶中在非结构化环境下的可行驶区域分割(traversable area segmentation)问题,尤其针对现有数据驱动方法在分布外(out-of-distribution, OOD)场景下性能显著下降的问题。解决方案的关键在于提出一种基于最优传输(Optimal Transport, OT)的多模态融合框架OT-Drive,其核心创新包括:1)设计Scene Anchor Generator(SAG),将场景信息分解为天气、时间与道路类型联合分布,构建可泛化至未见场景的语义锚点;2)引入基于最优传输的多模态融合模块(OT Fusion),将RGB与表面法向量特征映射到由语义锚点定义的流形空间上,从而实现对OOD场景的鲁棒分割。该方法在ORFD OOD测试集上达到95.16% mIoU,较先前方法提升6.35%,验证了其在有限训练数据下具备强泛化能力。

链接: https://arxiv.org/abs/2601.09952
作者: Zhihua Zhao,Guoqiang Li,Chen Min,Kangping Lu
机构: Beijing Institute of Technology (北京理工大学); Chinese Academy of Sciences (中国科学院); Shandong Pengxiang Automobile Co., Ltd (山东鹏翔汽车有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 9 pages, 8 figures, 6 tables. This work has been submitted to the IEEE for possible publication. Code will be released upon acceptance

点击查看摘要

Abstract:Reliable traversable area segmentation in unstructured environments is critical for planning and decision-making in autonomous driving. However, existing data-driven approaches often suffer from degraded segmentation performance in out-of-distribution (OOD) scenarios, consequently impairing downstream driving tasks. To address this issue, we propose OT-Drive, an Optimal Transport–driven multi-modal fusion framework. The proposed method formulates RGB and surface normal fusion as a distribution transport problem. Specifically, we design a novel Scene Anchor Generator (SAG) to decompose scene information into the joint distribution of weather, time-of-day, and road type, thereby constructing semantic anchors that can generalize to unseen scenarios. Subsequently, we design an innovative Optimal Transport-based multi-modal fusion module (OT Fusion) to transport RGB and surface normal features onto the manifold defined by the semantic anchors, enabling robust traversable area segmentation under OOD scenarios. Experimental results demonstrate that our method achieves 95.16% mIoU on ORFD OOD scenarios, outperforming prior methods by 6.35%, and 89.79% mIoU on cross-dataset transfer tasks, surpassing baselines by 13.99%.These results indicate that the proposed model can attain strong OOD generalization with only limited training data, substantially enhancing its practicality and efficiency for real-world deployment.
zh

[CV-64] he Algorithmic Gaze: An Audit and Ethnography of the LAION-Aesthetics Predictor Model

【速读】:该论文试图解决的问题是:当前广泛用于训练视觉生成式 AI 模型(如 Stable Diffusion)的美学评估模型 LAION Aesthetic Predictor (LAP) 所隐含的文化偏见与代表性失衡问题,即其“美学”标准实际上反映了特定群体(主要是西方、英语背景的男性摄影师和AI爱好者)的审美偏好,从而在数据筛选和图像生成中强化了性别、种族和文化层面的结构性不平等。解决方案的关键在于:通过系统审计 LAP 在不同数据集上的表现,揭示其对女性、男性、LGBTQ+ 群体及非西方艺术风格的非均衡过滤倾向,并结合数字民族志分析其训练数据来源,论证当前基于单一“美学”指标的算法设计存在代表性危害;进而呼吁 AI 开发者摒弃以西方中心主义为根基的预设美学标准,转向更具多元性和包容性的评价体系,以减少生成式 AI 中的偏见传播与社会不公。

链接: https://arxiv.org/abs/2601.09896
作者: Jordan Taylor,William Agnew,Maarten Sap,Sarah E. Fox,Haiyi Zhu
机构: Carnegie Mellon University(卡内基梅隆大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visual generative AI models are trained using a one-size-fits-all measure of aesthetic appeal. However, what is deemed “aesthetic” is inextricably linked to personal taste and cultural values, raising the question of whose taste is represented in visual generative AI models. In this work, we study an aesthetic evaluation model–LAION Aesthetic Predictor (LAP)–that is widely used to curate datasets to train visual generative image models, like Stable Diffusion, and evaluate the quality of AI-generated images. To understand what LAP measures, we audited the model across three datasets. First, we examined the impact of aesthetic filtering on the LAION-Aesthetics Dataset (approximately 1.2B images), which was curated from LAION-5B using LAP. We find that the LAP disproportionally filters in images with captions mentioning women, while filtering out images with captions mentioning men or LGBTQ+ people. Then, we used LAP to score approximately 330k images across two art datasets, finding the model rates realistic images of landscapes, cityscapes, and portraits from western and Japanese artists most highly. In doing so, the algorithmic gaze of this aesthetic evaluation model reinforces the imperial and male gazes found within western art history. In order to understand where these biases may have originated, we performed a digital ethnography of public materials related to the creation of LAP. We find that the development of LAP reflects the biases we found in our audits, such as the aesthetic scores used to train LAP primarily coming from English-speaking photographers and western AI-enthusiasts. In response, we discuss how aesthetic evaluation can perpetuate representational harms and call on AI developers to shift away from prescriptive measures of “aesthetics” toward more pluralistic evaluation.
zh

[CV-65] ransition Matching Distillation for Fast Video Generation

【速读】:该论文旨在解决当前大规模视频扩散模型(video diffusion models)在实时交互应用中因多步采样过程效率低下而受限的问题。其核心解决方案是提出一种名为“过渡匹配蒸馏”(Transition Matching Distillation, TMD)的新框架,关键在于将原始扩散模型的多步去噪轨迹与少步概率转移过程进行匹配,其中每一步转移由轻量级条件流(conditional flow)建模;同时将预训练模型分解为一个主干网络(提取语义表示)和一个流头(执行内部流更新),通过分布匹配蒸馏策略实现高效知识迁移,从而在生成速度与视觉质量之间取得灵活且优越的平衡。

链接: https://arxiv.org/abs/2601.09881
作者: Weili Nie,Julius Berner,Nanye Ma,Chao Liu,Saining Xie,Arash Vahdat
机构: NVIDIA(英伟达); NYU(纽约大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large video diffusion and flow models have achieved remarkable success in high-quality video generation, but their use in real-time interactive applications remains limited due to their inefficient multi-step sampling process. In this work, we present Transition Matching Distillation (TMD), a novel framework for distilling video diffusion models into efficient few-step generators. The central idea of TMD is to match the multi-step denoising trajectory of a diffusion model with a few-step probability transition process, where each transition is modeled as a lightweight conditional flow. To enable efficient distillation, we decompose the original diffusion backbone into two components: (1) a main backbone, comprising the majority of early layers, that extracts semantic representations at each outer transition step; and (2) a flow head, consisting of the last few layers, that leverages these representations to perform multiple inner flow updates. Given a pretrained video diffusion model, we first introduce a flow head to the model, and adapt it into a conditional flow map. We then apply distribution matching distillation to the student model with flow head rollout in each transition step. Extensive experiments on distilling Wan2.1 1.3B and 14B text-to-video models demonstrate that TMD provides a flexible and strong trade-off between generation speed and visual quality. In particular, TMD outperforms existing distilled models under comparable inference costs in terms of visual fidelity and prompt adherence. Project page: this https URL
zh

[CV-66] MedVL-SAM2: A unified 3D medical vision-language model for multimodal reasoning and prompt-driven segmentation

【速读】:该论文旨在解决3D医学视觉语言模型(Medical Vision-Language Models, VLMs)在细粒度视觉定位和体积空间推理方面的挑战,尤其是在统一报告生成、视觉问答(VQA)与多范式分割(语义、指代表达和交互式分割)能力方面的问题。其解决方案的关键在于提出MedVL-SAM2——一个融合图像级推理与像素级感知的统一架构,通过引入基于SAM2的体积分割模块实现多粒度空间推理,并采用多阶段训练策略:首先在大规模3D CT图像-文本对上预训练以对齐体素视觉特征与放射科语言嵌入,随后联合优化语言理解与分割目标,从而支持语言、点或框提示下的灵活交互,最终实现高阶语义推理与精确3D定位的协同统一。

链接: https://arxiv.org/abs/2601.09879
作者: Yang Xing,Jiong Wu,Savas Ozdemir,Ying Zhang,Yang Yang,Wei Shao,Kuang Gong
机构: University of Florida (佛罗里达大学); UC San Francisco (加州大学旧金山分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent progress in medical vision-language models (VLMs) has achieved strong performance on image-level text-centric tasks such as report generation and visual question answering (VQA). However, achieving fine-grained visual grounding and volumetric spatial reasoning in 3D medical VLMs remains challenging, particularly when aiming to unify these capabilities within a single, generalizable framework. To address this challenge, we proposed MedVL-SAM2, a unified 3D medical multimodal model that concurrently supports report generation, VQA, and multi-paradigm segmentation, including semantic, referring, and interactive segmentation. MedVL-SAM2 integrates image-level reasoning and pixel-level perception through a cohesive architecture tailored for 3D medical imaging, and incorporates a SAM2-based volumetric segmentation module to enable precise multi-granular spatial reasoning. The model is trained in a multi-stage pipeline: it is first pre-trained on a large-scale corpus of 3D CT image-text pairs to align volumetric visual features with radiology-language embeddings. It is then jointly optimized with both language-understanding and segmentation objectives using a comprehensive 3D CT segmentation dataset. This joint training enables flexible interaction via language, point, or box prompts, thereby unifying high-level visual reasoning with spatially precise localization. Our unified architecture delivers state-of-the-art performance across report generation, VQA, and multiple 3D segmentation tasks. Extensive analyses further show that the model provides reliable 3D visual grounding, controllable interactive segmentation, and robust cross-modal reasoning, demonstrating that high-level semantic reasoning and precise 3D localization can be jointly achieved within a unified 3D medical VLM.
zh

[CV-67] VibrantSR: Sub-Meter Canopy Height Models from Sentinel-2 Using Generative Flow Matching

【速读】:该论文旨在解决利用低分辨率遥感影像(如10米分辨率的Sentinel-2)生成高精度冠层高度模型(Canopy Height Models, CHMs)的问题,以实现大尺度、连续性的森林监测与碳储量核算。传统方法依赖于航空影像(aerial imagery),受限于获取频率低且不规律,难以支撑季节至年度尺度的动态观测。其解决方案的关键在于提出一种生成式超分辨率框架VibrantSR,通过利用全球覆盖的Sentinel-2季节合成数据,实现从10米影像到0.5米分辨率CHMs的高质量重建,从而在无需昂贵航空数据的情况下,实现大陆尺度上的稳定、高频森林结构监测。

链接: https://arxiv.org/abs/2601.09866
作者: Kiarie Ndegwa,Andreas Gros,Tony Chang,David Diaz,Vincent A. Landau,Nathan E. Rutenbeck,Luke J. Zachmann,Guy Bayes,Scott Conway
机构: Vibrant Planet Public Benefit Corporation(振动星球公共利益公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 12 pages, 8 figures, 2 tables

点击查看摘要

Abstract:We present VibrantSR (Vibrant Super-Resolution), a generative super-resolution framework for estimating 0.5 meter canopy height models (CHMs) from 10 meter Sentinel-2 imagery. Unlike approaches based on aerial imagery that are constrained by infrequent and irregular acquisition schedules, VibrantSR leverages globally available Sentinel-2 seasonal composites, enabling consistent monitoring at a seasonal-to-annual cadence. Evaluated across 22 EPA Level 3 eco-regions in the western United States using spatially disjoint validation splits, VibrantSR achieves a Mean Absolute Error of 4.39 meters for canopy heights = 2 m, outperforming Meta (4.83 m), LANDFIRE (5.96 m), and ETH (7.05 m) satellite-based benchmarks. While aerial-based VibrantVS (2.71 m MAE) retains an accuracy advantage, VibrantSR enables operational forest monitoring and carbon accounting at continental scales without reliance on costly and temporally infrequent aerial acquisitions.
zh

[CV-68] Breaking the Limits of Open-Weight CLIP: An Optimization Framework for Self-supervised Fine-tuning of CLIP ICLR2026

【速读】:该论文旨在解决开放权重的CLIP(Contrastive Language–Image Pre-training)模型在下游任务中性能提升困难的问题,尤其是如何仅利用现有自监督数据集,在不从头训练的情况下显著改善其跨多种任务的泛化能力。传统监督微调方法虽能适配单一任务,但难以兼顾多任务表现;而直接应用标准训练协议常导致性能退化。为此,作者提出TuneCLIP框架,其核心在于两个关键组件:一是通过恢复优化统计量的预热阶段(warm-up stage),缓解冷启动偏差(cold-start bias),基于理论分析设计;二是引入新的对比损失函数进行微调,减轻对假负样本对(false negative pairs)的惩罚,从而提升模型鲁棒性与泛化性能。实验证明,TuneCLIP在不同模型架构和规模下均能稳定提升性能,例如使SigLIP (ViT-B/16) 在ImageNet上最高提升+2.5%,并在DataComp等高竞争基准上实现+1.2%的增益,确立了高效后预训练适应的新基线。

链接: https://arxiv.org/abs/2601.09859
作者: Anant Mehta,Xiyuan Wei,Xingyu Chen,Tianbao Yang
机构: Texas A&M University (德州农工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Submitted to ICLR 2026

点击查看摘要

Abstract:CLIP has become a cornerstone of multimodal representation learning, yet improving its performance typically requires a prohibitively costly process of training from scratch on billions of samples. We ask a different question: Can we improve the performance of open-weight CLIP models across various downstream tasks using only existing self-supervised datasets? Unlike supervised fine-tuning, which adapts a pretrained model to a single downstream task, our setting seeks to improve general performance across various tasks. However, as both our experiments and prior studies reveal, simply applying standard training protocols starting from an open-weight CLIP model often fails, leading to performance degradation. In this paper, we introduce TuneCLIP, a self-supervised fine-tuning framework that overcomes the performance degradation. TuneCLIP has two key components: (1) a warm-up stage of recovering optimization statistics to reduce cold-start bias, inspired by theoretical analysis, and (2) a fine-tuning stage of optimizing a new contrastive loss to mitigate the penalization on false negative pairs. Our extensive experiments show that TuneCLIP consistently improves performance across model architectures and scales. Notably, it elevates leading open-weight models like SigLIP (ViT-B/16), achieving gains of up to +2.5% on ImageNet and related out-of-distribution benchmarks, and +1.2% on the highly competitive DataComp benchmark, setting a new strong baseline for efficient post-pretraining adaptation.
zh

[CV-69] ViSIL: Unified Evaluation of Information Loss in Multimodal Video Captioning

【速读】:该论文旨在解决多模态视频摘要(multimodal video captioning)中缺乏有效评估指标的问题,传统指标如BLEU或ROUGE无法量化跨模态信息覆盖度,例如文本与关键帧序列之间的信息一致性。其解决方案的关键在于提出一种基于信息论的统一评估框架——视频摘要信息损失(Video Summary Information Loss, ViSIL),通过视觉-语言模型(VLM)推理计算摘要未能捕获的视频信息量,从而实现对不同结构的多模态摘要(如文本、关键帧+文本)的直接比较。实验表明,ViSIL得分与人类及VLM在视频问答(VQA)任务中的表现具有显著相关性,并支持在信息损失与处理速度之间进行帕累托优化,最终在不增加计算负担的前提下,使VQA准确率相比纯文本摘要提升7%。

链接: https://arxiv.org/abs/2601.09851
作者: Po-han Li,Shenghui Chen,Ufuk Topcu,Sandeep Chinchali
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Multimodal video captioning condenses dense footage into a structured format of keyframes and natural language. By creating a cohesive multimodal summary, this approach anchors generative AI in rich semantic evidence and serves as a lightweight proxy for high-efficiency retrieval. However, traditional metrics like BLEU or ROUGE fail to quantify information coverage across disparate modalities, such as comparing a paragraph of text to a sequence of keyframes. To address this, we propose the Video Summary Information Loss (ViSIL) score, an information-theoretic framework that quantifies the video information not captured by a summary via vision-language model (VLM) inference. By measuring the information loss, ViSIL is a unified metric that enables direct comparison across multimodal summary formats despite their structural discrepancies. Our results demonstrate that ViSIL scores show a statistically significant correlation with both human and VLM performance on Video Question Answering (VQA) tasks. ViSIL also enables summary selection to optimize the trade-off between information loss and processing speed, establishing a Pareto-optimal frontier that outperforms text summaries by 7% in VQA accuracy without increasing processing load.
zh

[CV-70] UniHash: Unifying Pointwise and Pairwise Hashing Paradigms for Seen and Unseen Category Retrieval

【速读】:该论文旨在解决现有深度哈希(Deep Hashing)方法在处理已见类别(seen categories)与未见类别(unseen categories)图像检索时性能不平衡的问题:传统方法通常仅采用点对点(pointwise)或成对(pairwise)训练范式,前者在已见类别上表现优异但泛化能力弱,后者虽能较好适应未见类别却牺牲了已见类别的精确性。解决方案的关键在于提出统一哈希(Unified Hashing, UniHash)框架,其核心创新是构建双分支结构——一个基于中心的点对点分支与一个成对分支,并引入双向知识迁移机制:通过互学习损失(mutual learning loss)对齐哈希表示,以及Split-Merge Mixture of Hash Experts(SM-MoH)模块促进跨分支的哈希表示交换,从而实现两类场景下的均衡性能提升。

链接: https://arxiv.org/abs/2601.09828
作者: Xiaoxu Ma,Runhao Li,Hanwen Liu,Xiangbo Zhang,Zhenyu Weng
机构: South China University of Technology (华南理工大学); Georgia Institute of Technology (佐治亚理工学院); Nanyang Technological University (南洋理工大学); Independent Researcher (独立研究员)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Effective retrieval across both seen and unseen categories is crucial for modern image retrieval systems. Retrieval on seen categories ensures precise recognition of known classes, while retrieval on unseen categories promotes generalization to novel classes with limited supervision. However, most existing deep hashing methods are confined to a single training paradigm, either pointwise or pairwise, where the former excels on seen categories and the latter generalizes better to unseen ones. To overcome this limitation, we propose Unified Hashing (UniHash), a dual-branch framework that unifies the strengths of both paradigms to achieve balanced retrieval performance across seen and unseen categories. UniHash consists of two complementary branches: a center-based branch following the pointwise paradigm and a pairwise branch following the pairwise paradigm. A novel hash code learning method is introduced to enable bidirectional knowledge transfer between branches, improving hash code discriminability and generalization. It employs a mutual learning loss to align hash representations and introduces a Split-Merge Mixture of Hash Experts (SM-MoH) module to enhance cross-branch exchange of hash representations. Theoretical analysis substantiates the effectiveness of UniHash, and extensive experiments on CIFAR-10, MSCOCO, and ImageNet demonstrate that UniHash consistently achieves state-of-the-art performance in both seen and unseen image retrieval scenarios.
zh

[CV-71] NanoSD: Edge Efficient Foundation Model for Real Time Image Restoration CVPR2026

【速读】:该论文旨在解决生成式扩散模型(如Stable Diffusion 1.5)在边缘设备部署时面临的计算资源消耗过高问题,同时避免现有轻量化方法因压缩去噪U-Net或缩短扩散轨迹而导致潜在特征空间破坏和任务泛化能力下降的问题。解决方案的关键在于提出NanoSD系列模型,通过联合应用网络手术(network surgery)、基于特征的生成蒸馏(feature-wise generative distillation)以及结构化架构缩放(structured architectural scaling),对U-Net和VAE编码器-解码器进行全管道协同设计,从而在保持强大生成先验的同时,实现参数量(130M–315M)与推理延迟(低至20ms)之间的帕累托最优平衡,并显著提升实际硬件效率与多任务通用性。

链接: https://arxiv.org/abs/2601.09823
作者: Subhajit Sanyal,Srinivas Soumitri Miriyala,Akshay Janardan Bankar,Sravanth Kodavanti,Harshit,Abhishek Ameta,Shreyas Pandith,Amit Satish Unde
机构: Samsung Research India, Bangalore (三星研究院印度班加罗尔)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to CVPR 2026

点击查看摘要

Abstract:Latent diffusion models such as Stable Diffusion 1.5 offer strong generative priors that are highly valuable for image restoration, yet their full pipelines remain too computationally heavy for deployment on edge devices. Existing lightweight variants predominantly compress the denoising U-Net or reduce the diffusion trajectory, which disrupts the underlying latent manifold and limits generalization beyond a single task. We introduce NanoSD, a family of Pareto-optimal diffusion foundation models distilled from Stable Diffusion 1.5 through network surgery, feature-wise generative distillation, and structured architectural scaling jointly applied to the U-Net and the VAE encoder-decoder. This full-pipeline co-design preserves the generative prior while producing models that occupy distinct operating points along the accuracy-latency-size frontier (e.g., 130M-315M parameters, achieving real-time inference down to 20ms on mobile-class NPUs). We show that parameter reduction alone does not correlate with hardware efficiency, and we provide an analysis revealing how architectural balance, feature routing, and latent-space preservation jointly shape true on-device latency. When used as a drop-in backbone, NanoSD enables state-of-the-art performance across image super-resolution, image deblurring, face restoration, and monocular depth estimation, outperforming prior lightweight diffusion models in both perceptual quality and practical deployability. NanoSD establishes a general-purpose diffusion foundation model family suitable for real-time visual generation and restoration on edge devices.
zh

[CV-72] Explainable Deep Learning for Pediatric Pneumonia Detection in Chest X-Ray Images

【速读】:该论文旨在解决儿童肺炎(pediatric pneumonia)诊断中准确性和效率不足的问题,以提升临床决策支持系统的性能。其解决方案的关键在于对比两种先进的卷积神经网络(Convolutional Neural Network, CNN)架构——DenseNet121与EfficientNet-B0,在相同训练条件下对儿科胸片图像进行自动肺炎检测,并通过引入梯度加权类激活映射(Gradient-weighted Class Activation Mapping, Grad-CAM)和局部可解释模型无关解释(Local Interpretable Model-agnostic Explanations, LIME)技术增强模型的可解释性,从而在保证高敏感性(recall > 0.99)的同时实现更优的整体性能(如F1-score和MCC)及临床可信度。结果表明,EfficientNet-B0在准确率、平衡性与计算效率方面均优于DenseNet121,具备良好的临床部署潜力。

链接: https://arxiv.org/abs/2601.09814
作者: Adil O. Khadidos,Aziida Nanyonga,Alaa O. Khadidos,Olfat M. Mirza,Mustafa Tahsin Yilmaz
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Background: Pneumonia remains a leading cause of morbidity and mortality among children worldwide, emphasizing the need for accurate and efficient diagnostic support tools. Deep learning has shown strong potential in medical image analysis, particularly for chest X-ray interpretation. This study compares two state-of-the-art convolutional neural network (CNN) architectures for automated pediatric pneumonia detection. Methods: A publicly available dataset of 5,863 pediatric chest X-ray images was used. Images were preprocessed through normalization, resizing, and data augmentation to enhance generalization. DenseNet121 and EfficientNet-B0 were fine-tuned using pretrained ImageNet weights under identical training settings. Performance was evaluated using accuracy, F1-score, Matthews Correlation Coefficient (MCC), and recall. Model explainability was incorporated using Gradient-weighted Class Activation Mapping (Grad-CAM) and Local Interpretable Model-agnostic Explanations (LIME) to visualize image regions influencing predictions. Results: EfficientNet-B0 outperformed DenseNet121, achieving an accuracy of 84.6%, F1-score of 0.8899, and MCC of 0.6849. DenseNet121 achieved 79.7% accuracy, an F1-score of 0.8597, and MCC of 0.5852. Both models demonstrated high recall values above 0.99, indicating strong sensitivity to pneumonia detection. Grad-CAM and LIME visualizations showed consistent focus on clinically relevant lung regions, supporting the reliability of model decisions. Conclusions: EfficientNet-B0 provided a more balanced and computationally efficient performance compared to DenseNet121, making it a strong candidate for clinical deployment. The integration of explainability techniques enhances transparency and trustworthiness in AI-assisted pediatric pneumonia diagnosis.
zh

[CV-73] LCF3D: A Robust and Real-Time Late-Cascade Fusion Framework for 3D Object Detection in Autonomous Driving

【速读】:该论文旨在解决自动驾驶中多传感器融合下3D目标检测的准确性问题,特别是如何有效结合RGB相机与LiDAR点云数据以提升检测性能并增强域泛化能力。其解决方案的关键在于提出一种名为LCF3D的新型传感器融合框架,该框架基于两个核心原则:一是晚期融合(late fusion),通过将LiDAR的3D检测结果与RGB的2D检测结果匹配,过滤掉未匹配的LiDAR假阳性;二是级联融合(cascade fusion),针对未匹配的RGB检测生成新的3D frustum提案,从而恢复被LiDAR漏检的目标。这种双策略融合机制显著提升了对行人、自行车等挑战类别的检测精度,并在KITTI和nuScenes数据集上验证了其跨域适应性。

链接: https://arxiv.org/abs/2601.09812
作者: Carlo Sgaravatti,Riccardo Pieroni,Matteo Corno,Sergio M. Savaresi,Luca Magri,Giacomo Boracchi
机构: Politecnico di Milano (米兰理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 35 pages, 14 figures. Published at Pattern Recognition

点击查看摘要

Abstract:Accurately localizing 3D objects like pedestrians, cyclists, and other vehicles is essential in Autonomous Driving. To ensure high detection performance, Autonomous Vehicles complement RGB cameras with LiDAR sensors, but effectively combining these data sources for 3D object detection remains challenging. We propose LCF3D, a novel sensor fusion framework that combines a 2D object detector on RGB images with a 3D object detector on LiDAR point clouds. By leveraging multimodal fusion principles, we compensate for inaccuracies in the LiDAR object detection network. Our solution combines two key principles: (i) late fusion, to reduce LiDAR False Positives by matching LiDAR 3D detections with RGB 2D detections and filtering out unmatched LiDAR detections; and (ii) cascade fusion, to recover missed objects from LiDAR by generating new 3D frustum proposals corresponding to unmatched RGB detections. Experiments show that LCF3D is beneficial for domain generalization, as it turns out to be successful in handling different sensor configurations between training and testing domains. LCF3D achieves significant improvements over LiDAR-based methods, particularly for challenging categories like pedestrians and cyclists in the KITTI dataset, as well as motorcycles and bicycles in nuScenes. Code can be downloaded from: this https URL.
zh

[CV-74] Diffusion-Driven Deceptive Patches: Adversarial Manipulation and Forensic Detection in Facial Identity Verification

【速读】:该论文旨在解决如何生成、优化并评估能够有效干扰人脸识别系统的对抗性补丁(adversarial patches),以支持法证分析与安全测试。其核心挑战在于平衡对抗补丁的攻击有效性与视觉不可感知性,同时提供可解释的语义描述以辅助法证判断。解决方案的关键在于构建一个端到端的流水线:首先使用快速梯度符号法(FGSM)生成针对身份分类器的对抗噪声;接着利用扩散模型(diffusion model)通过反向扩散过程进行高斯平滑和自适应亮度校正,显著提升补丁的隐蔽性;随后采用视觉Transformer-GPT2模型为对抗图像生成语义标签,实现对身份逃逸行为的文档化与解释;最后结合感知哈希(perceptual hashing)与分割技术检测对抗样本,实现对攻击效果的量化评估(SSIM达0.95)。这一方案在保持自然视觉特征的同时,有效削弱了面部识别系统在身份验证与表情识别中的鲁棒性。

链接: https://arxiv.org/abs/2601.09806
作者: Shahrzad Sayyafzadeh,Hongmei Chi,Shonda Bernadin
机构: Florida A&M University (佛罗里达农工大学); Florida State University (佛罗里达州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: This manuscript is a preprint. A revised version of this work has been accepted for publication in the Springer Nature book Artificial Intelligence-Driven Forensics. This version includes one additional figure for completeness

点击查看摘要

Abstract:This work presents an end-to-end pipeline for generating, refining, and evaluating adversarial patches to compromise facial biometric systems, with applications in forensic analysis and security testing. We utilize FGSM to generate adversarial noise targeting an identity classifier and employ a diffusion model with reverse diffusion to enhance imperceptibility through Gaussian smoothing and adaptive brightness correction, thereby facilitating synthetic adversarial patch evasion. The refined patch is applied to facial images to test its ability to evade recognition systems while maintaining natural visual characteristics. A Vision Transformer (ViT)-GPT2 model generates captions to provide a semantic description of a person’s identity for adversarial images, supporting forensic interpretation and documentation for identity evasion and recognition attacks. The pipeline evaluates changes in identity classification, captioning results, and vulnerabilities in facial identity verification and expression recognition under adversarial conditions. We further demonstrate effective detection and analysis of adversarial patches and adversarial samples using perceptual hashing and segmentation, achieving an SSIM of 0.95.
zh

[CV-75] Multi-Agent Cooperative Learning for Robust Vision-Language Alignment under OOD Concepts

【速读】:该论文旨在解决视觉-语言模型在处理分布外(out-of-distribution, OOD)概念时出现的跨模态对齐崩溃(cross-modal alignment collapse)问题。其解决方案的关键在于提出一种多智能体协作学习(Multi-Agent Cooperative Learning, MACL)框架,通过图像、文本、名称和协调四个核心智能体之间的结构化消息传递机制,协同缓解模态不平衡问题;该框架还引入了增强型上下文交换的少样本学习算法与自适应动态平衡机制,以优化各智能体间的贡献权重,从而显著提升模型在少样本和零样本场景下的性能表现。

链接: https://arxiv.org/abs/2601.09746
作者: Philip Xu,Isabel Wagner,Eerke Boiten
机构: De Montfort University (德蒙福特大学); University of Basel (巴塞尔大学)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This paper introduces a novel Multi-Agent Cooperative Learning (MACL) framework to address cross-modal alignment collapse in vision-language models when handling out-of-distribution (OOD) concepts. Four core agents, including image, text, name, and coordination agents, collaboratively mitigate modality imbalance through structured message passing. The proposed framework enables multi-agent feature space name learning, incorporates a context exchange enhanced few-shot learning algorithm, and adopts an adaptive dynamic balancing mechanism to regulate inter-agent contributions. Experiments on the VISTA-Beyond dataset demonstrate that MACL significantly improves performance in both few-shot and zero-shot settings, achieving 1-5% precision gains across diverse visual domains.
zh

[CV-76] Multi-Objective Pareto-Front Optimization for Efficient Adaptive VVC Streaming

【速读】:该论文旨在解决自适应视频流媒体中如何在码率(bitrate)、视频质量(video quality)和解码复杂度(decoding complexity)之间实现多目标优化的问题,以构建内容自适应的比特率阶梯(bitrate ladder),从而提升用户体验(Quality of Experience, QoE)。其关键解决方案是提出了一种基于帕累托前沿(Pareto Front, PF)的多目标优化框架,包含两种策略:联合码率-质量-时间帕累托前沿(JRQT-PF)与联合质量-时间帕累托前沿(JQT-PF),通过引入解码时间作为能耗代理指标,在保证质量单调性的前提下,动态生成高效、可调的比特率阶梯,从而在不同网络和设备条件下实现可持续的高质量视频流媒体传输。

链接: https://arxiv.org/abs/2601.10607
作者: Angeliki Katsenou,Vignesh V. Menon,Guoda Laurinaviciute,Benjamin Bross,Detlev Marpe
机构: University of Bristol (布里斯托大学); Fraunhofer HHI (弗劳恩霍夫信息通信研究所)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages

点击查看摘要

Abstract:Adaptive video streaming has facilitated improved video streaming over the past years. A balance among coding performance objectives such as bitrate, video quality, and decoding complexity is required to achieve efficient, content- and codec-dependent, adaptive video streaming. This paper proposes a multi-objective Pareto-front (PF) optimization framework to construct quality-monotonic, content-adaptive bitrate ladders Versatile Video Coding (VVC) streaming that jointly optimize video quality, bitrate, and decoding time, which is used as a practical proxy for decoding energy. Two strategies are introduced: the Joint Rate-Quality-Time Pareto Front (JRQT-PF) and the Joint Quality-Time Pareto Front (JQT-PF), each exploring different tradeoff formulations and objective prioritizations. The ladders are constructed under quality monotonicity constraints during adaptive streaming to ensure a consistent Quality of Experience (QoE). Experiments are conducted on a large-scale UHD dataset (Inter-4K), with quality assessed using PSNR, VMAF, and XPSNR, and complexity measured via decoding time and energy consumption. The JQT-PF method achieves 11.76% average bitrate savings while reducing average decoding time by 0.29% to maintain the same XPSNR, compared to a widely-used fixed ladder. More aggressive configurations yield up to 27.88% bitrate savings at the cost of increased complexity. The JRQT-PF strategy, on the other hand, offers more controlled tradeoffs, achieving 6.38 % bitrate savings and 6.17 % decoding time reduction. This framework outperforms existing methods, including fixed ladders, VMAF- and XPSNR-based dynamic resolution selection, and complexity-aware benchmarks. The results confirm that PF optimization with decoding time constraints enables sustainable, high-quality streaming tailored to network and device capabilities.
zh

[CV-77] Cell Behavior Video Classification Challenge a benchmark for computer vision methods in time-lapse microscopy

【速读】:该论文旨在解决显微视频中复杂细胞行为分类的问题,这是计算机视觉领域的一个前沿挑战,核心难点在于如何有效建模无固定边界的物体形状与运动、从完整视频序列中提取分层时空特征,以及处理视野内多个目标的交互关系。解决方案的关键在于组织了细胞行为视频分类挑战赛(CBVCC),系统性地评估了三类方法:基于轨迹特征的分类、端到端深度学习架构直接从整个视频序列中学习时空特征(无需显式细胞追踪),以及融合轨迹特征与图像特征的集成策略。通过对比不同方法的性能与局限,为推动用于研究细胞动态过程的计算机视觉技术发展提供了重要基准和方向。

链接: https://arxiv.org/abs/2601.10250
作者: Raffaella Fiamma Cabini,Deborah Barkauskas,Guangyu Chen,Zhi-Qi Cheng,David E Cicchetti,Judith Drazba,Rodrigo Fernandez-Gonzalez,Raymond Hawkins,Yujia Hu,Jyoti Kini,Charles LeWarne,Xufeng Lin,Sai Preethi Nakkina,John W Peterson,Koert Schreurs,Ayushi Singh,Kumaran Bala Kandan Viswanathan,Inge MN Wortel,Sanjian Zhang,Rolf Krause,Santiago Fernandez Gonzalez,Diego Ulisse Pizzagalli
机构: Euler Institute, Faculty of Informatics, Università della Svizzera italiana (瑞士意大利大学信息学院); International Center for Advanced Computing in Medicine (ICAM), University of Pavia (帕维亚大学高级医学计算国际中心); Imaging Platform, ACRF INCITe Centre, Garvan Institute of Medical Research (加尔文医学研究所影像平台); Tacoma School of Engineering & Technology, University of Washington (华盛顿大学塔科马工程与技术学院); Data Science, Institute for Computing and Information Sciences, Radboud University (奈梅亨大学计算与信息科学研究所); Imaging Core, Lerner Research Institute, Cleveland Clinic (克利夫兰诊所勒纳研究所影像核心); Institute of Biomedical Engineering, University of Toronto (多伦多大学生物医学工程研究所); Computational Biology Group, Data Science Platform, Garvan Institute of Medical Research (加尔文医学研究所计算生物学组); School of Clinical Medicine, Faculty of Medicine and Health, University of New South Wales (新南威尔士大学临床医学院); Center for Research in Computer Vision, University of Central Florida (中佛罗里达大学计算机视觉研究中心); Department of Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania (宾夕法尼亚大学佩雷尔曼医学院病理科和实验室医学系); Department of Ophthalmology and Visual Sciences, SUNY Upstate Medical University (纽约州立大学奥本医学院眼科与视觉科学系); Theodore Kocher Institute, Faculty of Medicine, University of Bern (伯尔尼大学医学院西奥多·科赫研究所); Institute for Research in Biomedicine, Faculty of Biomedical Sciences, Università della Svizzera italiana (瑞士意大利大学生物医学研究所)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)
备注:

点击查看摘要

Abstract:The classification of microscopy videos capturing complex cellular behaviors is crucial for understanding and quantifying the dynamics of biological processes over time. However, it remains a frontier in computer vision, requiring approaches that effectively model the shape and motion of objects without rigid boundaries, extract hierarchical spatiotemporal features from entire image sequences rather than static frames, and account for multiple objects within the field of view. To this end, we organized the Cell Behavior Video Classification Challenge (CBVCC), benchmarking 35 methods based on three approaches: classification of tracking-derived features, end-to-end deep learning architectures to directly learn spatiotemporal features from the entire video sequence without explicit cell tracking, or ensembling tracking-derived with image-derived features. We discuss the results achieved by the participants and compare the potential and limitations of each approach, serving as a basis to foster the development of computer vision methods for studying cellular dynamics. Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM) Cite as: arXiv:2601.10250 [eess.IV] (or arXiv:2601.10250v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2601.10250 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

人工智能

[AI-0] he Impact of Generative AI on Architectural Conceptual Design: Performance Creative Self-Efficacy and Cognitive Load

【速读】:该论文旨在解决生成式 AI(Generative AI)在建筑概念设计任务中对设计绩效、创造性自我效能感以及认知负荷的影响机制问题。研究通过对比学生在独立设计与使用 GenAI 辅助工具或在线项目库控制条件下的表现,发现 GenAI 并未整体提升设计绩效,但显著改善了新手设计师的设计成果;同时,GenAI 使用者的一般创造性自我效能感下降,而认知负荷无显著差异,不过迭代式创意生成和视觉反馈提示类 prompt 可有效降低认知负荷。解决方案的关键在于用户先验经验水平与交互策略(特别是提示工程方式)的协同作用,表明 GenAI 的有效性具有情境依赖性,需根据使用者能力定制引导策略以最大化其辅助价值。

链接: https://arxiv.org/abs/2601.10696
作者: Han Jiang,Yao Xiao,Rachel Hurley,Shichao Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Our study examines how generative AI (GenAI) influences performance, creative self-efficacy, and cognitive load in architectural conceptual design tasks. Thirty-six student participants from Architectural Engineering and other disciplines completed a two-phase architectural design task, first independently and then with external tools (GenAI-assisted condition and control condition using an online repository of existing architectural projects). Design outcomes were evaluated by expert raters, while self-efficacy and cognitive load were self-reported after each phase. Difference-in-differences analyses revealed no overall performance advantage of GenAI across participants; however, subgroup analyses showed that GenAI significantly improved design performance for novice designers. In contrast, general creative self-efficacy declined for students using GenAI. Cognitive load did not differ significantly between conditions, though prompt usage patterns showed that iterative idea generation and visual feedback prompts were linked to greater reductions in cognitive load. These findings suggest that GenAI effectiveness depends on users’ prior expertise and interaction strategies through prompting.
zh

[AI-1] On the origin of neural scaling laws: from random graphs to natural language

【速读】:该论文试图解决的问题是:神经网络的缩放定律(scaling laws)是否依赖于数据中已存在的幂律结构(power law structure),以及在简化任务下如何系统性地理解模型性能随数据量、计算资源和参数规模变化的规律。其解决方案的关键在于通过构造具有可控复杂度的生成任务(如随机游走预测、简化语言模型序列训练等),证明即使在缺乏幂律结构的数据中,仍能观察到符合缩放定律的现象;同时,通过逐步降低自然语言建模任务的复杂度(从4层、2层、1层Transformer到语言bigram),揭示了缩放指数的单调演化过程,并验证了使用两层Transformer与50词上下文长度即可重现传统语言建模中的关键缩放结果,从而为缩放定律的本质起源提供了新的实证依据。

链接: https://arxiv.org/abs/2601.10684
作者: Maissam Barkeshli,Alberto Alfarano,Andrey Gromov
机构: 未知
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 33 pages

点击查看摘要

Abstract:Scaling laws have played a major role in the modern AI revolution, providing practitioners predictive power over how the model performance will improve with increasing data, compute, and number of model parameters. This has spurred an intense interest in the origin of neural scaling laws, with a common suggestion being that they arise from power law structure already present in the data. In this paper we study scaling laws for transformers trained to predict random walks (bigrams) on graphs with tunable complexity. We demonstrate that this simplified setting already gives rise to neural scaling laws even in the absence of power law structure in the data correlations. We further consider dialing down the complexity of natural language systematically, by training on sequences sampled from increasingly simplified generative language models, from 4,2,1-layer transformer language models down to language bigrams, revealing a monotonic evolution of the scaling exponents. Our results also include scaling laws obtained from training on random walks on random graphs drawn from Erdös-Renyi and scale-free Barabási-Albert ensembles. Finally, we revisit conventional scaling laws for language modeling, demonstrating that several essential results can be reproduced using 2 layer transformers with context length of 50, provide a critical analysis of various fits used in prior literature, demonstrate an alternative method for obtaining compute optimal curves as compared with current practice in published literature, and provide preliminary evidence that maximal update parameterization may be more parameter efficient than standard parameterization.
zh

[AI-2] Structure and Diversity Aware Context Bubble Construction for Enterprise Retrieval Augmented Systems

【速读】:该论文旨在解决传统检索增强生成(Retrieval-Augmented Generation, RAG)方法在构建大语言模型(Large Language Model, LLM)上下文时存在的问题,包括信息图结构碎片化、过量检索、内容重复以及查询上下文不足(如缺乏二阶和三阶语义特征)。其解决方案的关键在于提出一种结构感知且受多样性约束的“上下文气泡”(context bubble)构建框架:该框架通过利用文档的多粒度结构(如章节和表格行)与任务相关的结构先验来引导检索,从高相关性锚点片段出发,采用受限选择策略平衡查询相关性、边际覆盖范围与冗余惩罚,从而在严格token预算下生成连贯、可引用的片段集合。此方法显著提升了上下文紧凑性、覆盖完整性和答案质量,同时保证了审计可追溯性与确定性调优能力。

链接: https://arxiv.org/abs/2601.10681
作者: Amir Khurshid,Abhishek Sehgal
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language model (LLM) contexts are typically constructed using retrieval-augmented generation (RAG), which involves ranking and selecting the top-k passages. The approach causes fragmentation in information graphs in document structures, over-retrieval, and duplication of content alongside insufficient query context, including 2nd and 3rd order facets. In this paper, a structure-informed and diversity-constrained context bubble construction framework is proposed that assembles coherent, citable bundles of spans under a strict token budget. The method preserves and exploits inherent document structure by organising multi-granular spans (e.g., sections and rows) and using task-conditioned structural priors to guide retrieval. Starting from high-relevance anchor spans, a context bubble is constructed through constrained selection that balances query relevance, marginal coverage, and redundancy penalties. It will explicitly constrain diversity and budget, producing compact and informative context sets, unlike top-k retrieval. Moreover, a full retrieval is emitted that traces the scoring and selection choices of the records, thus providing auditability and deterministic tuning. Experiments on enterprise documents demonstrate the efficiency of context bubble as it significantly reduces redundant context, is better able to cover secondary facets and has a better answer quality and citation faithfulness within a limited context window. Ablation studies demonstrate that both structural priors as well as diversity constraint selection are necessary; removing either component results in a decline in coverage and an increase in redundant or incomplete context.
zh

[AI-3] Are Your Reasoning Models Reasoning Reasoning or Guessing? A Mechanistic Analysis of Hierarchical Reasoning Models

【速读】:该论文旨在解决层次化推理模型(Hierarchical Reasoning Model, HRM)在复杂推理任务中表现不稳定的问题,尤其是其“猜测”而非真正推理的机制导致的错误和性能瓶颈。研究发现HRM存在三个关键现象:简单谜题失败(源于固定点性质被破坏)、推理步骤中的“顿悟”动态(答案非渐进提升而是突变正确)、以及多个固定点共存导致陷入局部最优解。解决方案的核心在于将HRM视为一种“猜测”机制,并提出三种可扩展的策略来增强其猜测能力:数据增强(提升猜测质量)、输入扰动(利用推理随机性增加猜测数量)、模型自举(利用训练随机性扩大猜测空间)。通过组合这些方法,作者构建了增强型HRM(Augmented HRM),在Sudoku-Extreme任务上将准确率从54.5%显著提升至96.9%,从而既提升了实用性,也为理解推理模型的本质提供了新的科学视角。

链接: https://arxiv.org/abs/2601.10679
作者: Zirui Ren,Ziming Liu
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Hierarchical reasoning model (HRM) achieves extraordinary performance on various reasoning tasks, significantly outperforming large language model-based reasoners. To understand the strengths and potential failure modes of HRM, we conduct a mechanistic study on its reasoning patterns and find three surprising facts: (a) Failure of extremely simple puzzles, e.g., HRM can fail on a puzzle with only one unknown cell. We attribute this failure to the violation of the fixed point property, a fundamental assumption of HRM. (b) “Grokking” dynamics in reasoning steps, i.e., the answer is not improved uniformly, but instead there is a critical reasoning step that suddenly makes the answer correct; © Existence of multiple fixed points. HRM “guesses” the first fixed point, which could be incorrect, and gets trapped there for a while or forever. All facts imply that HRM appears to be “guessing” instead of “reasoning”. Leveraging this “guessing” picture, we propose three strategies to scale HRM’s guesses: data augmentation (scaling the quality of guesses), input perturbation (scaling the number of guesses by leveraging inference randomness), and model bootstrapping (scaling the number of guesses by leveraging training randomness). On the practical side, by combining all methods, we develop Augmented HRM, boosting accuracy on Sudoku-Extreme from 54.5% to 96.9%. On the scientific side, our analysis provides new insights into how reasoning models “reason”.
zh

[AI-4] Multi-Property Synthesis

【速读】:该论文旨在解决线性时序逻辑有限轨迹(LTLf)合成中多属性约束难以同时满足的问题,即当所有属性无法被同时实现时,如何高效地识别并合成达到最大可实现属性集合的策略。其解决方案的关键在于通过一次固定的点计算(fixed-point computation),建立产品博弈状态与从该状态可达的目标集之间的映射关系,并利用布尔目标变量(Boolean goal variables)和单调性特性,以符号化方式紧凑表示指数级增长的目标组合,从而避免穷举所有属性子集的低效方法。该方法显著优于基于枚举的基线算法,性能提升可达两个数量级。

链接: https://arxiv.org/abs/2601.10651
作者: Christoph Weinhuber,Yannik Schnitzer,Alessandro Abate,David Parker,Giuseppe De Giacomo,Moshe Y. Vardi
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注:

点击查看摘要

Abstract:We study LTLf synthesis with multiple properties, where satisfying all properties may be impossible. Instead of enumerating subsets of properties, we compute in one fixed-point computation the relation between product-game states and the goal sets that are realizable from them, and we synthesize strategies achieving maximal realizable sets. We develop a fully symbolic algorithm that introduces Boolean goal variables and exploits monotonicity to represent exponentially many goal combinations compactly. Our approach substantially outperforms enumeration-based baselines, with speedups of up to two orders of magnitude.
zh

[AI-5] Procedural Fairness in Multi-Agent Bandits

【速读】:该论文旨在解决多智能体多臂老虎机(Multi-Agent Multi-Armed Bandits, MA-MAB)场景中公平性定义过于侧重结果导向的问题,即现有研究通常以福利最大化、不平等减少或效用平衡等结果指标衡量公平性,而忽视了决策过程中的公平性,如各智能体是否拥有平等的决策权。其解决方案的关键在于提出一种新的公平性目标——程序公平性(procedural fairness),该目标确保所有智能体具有相等的决策影响力,并在理论上证明其能自然带来结果上的比例公平性(proportionality)。实证结果表明,基于结果的公平性策略(如平等主义和功利主义)会牺牲代理间的平等发声权,而采用程序公平策略时,这些结果指标的损失极小,从而凸显出程序合法性作为公平性核心维度的重要性。

链接: https://arxiv.org/abs/2601.10600
作者: Joshua Caiata,Carter Blair,Kate Larson
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In the context of multi-agent multi-armed bandits (MA-MAB), fairness is often reduced to outcomes: maximizing welfare, reducing inequality, or balancing utilities. However, evidence in psychology, economics, and Rawlsian theory suggests that fairness is also about process and who gets a say in the decisions being made. We introduce a new fairness objective, procedural fairness, which provides equal decision-making power for all agents, lies in the core, and provides for proportionality in outcomes. Empirical results confirm that fairness notions based on optimizing for outcomes sacrifice equal voice and representation, while the sacrifice in outcome-based fairness objectives (like equality and utilitarianism) is minimal under procedurally fair policies. We further prove that different fairness notions prioritize fundamentally different and incompatible values, highlighting that fairness requires explicit normative choices. This paper argues that procedural legitimacy deserves greater focus as a fairness objective, and provides a framework for putting procedural fairness into practice.
zh

[AI-6] ProbFM: Probabilistic Time Series Foundation Model with Uncertainty Decomposition AAAI2026

【速读】:该论文旨在解决时间序列基础模型(Time Series Foundation Models, TSFMs)在金融预测中面临的不确定性量化(uncertainty quantification)难题,特别是现有方法普遍存在分布假设过强、不确定来源混淆以及缺乏理论保障的校准机制等问题。解决方案的关键在于提出一种基于Transformer架构的 probabilistic foundation model(ProbFM),其核心创新是引入深度证据回归(Deep Evidential Regression, DER),通过高阶证据学习实现可解释的贝叶斯式不确定性分解——即显式区分认知不确定性(epistemic uncertainty)与偶然不确定性(aleatoric uncertainty)。该方法无需预设分布形式或依赖采样推理,在单次前向传播中即可完成高效且理论严谨的不确定性估计,从而为金融场景下的零样本预测提供可靠的风险表征。

链接: https://arxiv.org/abs/2601.10591
作者: Arundeep Chinta,Lucas Vinh Tran,Jay Katukuri
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Risk Management (q-fin.RM); Trading and Market Microstructure (q-fin.TR)
备注: Accepted for oral presentation at the AI Meets Quantitative Finance Workshop at ICAIF 2025. An enhanced version was accepted for oral presentation at the AI for Time Series Analysis Workshop at AAAI 2026

点击查看摘要

Abstract:Time Series Foundation Models (TSFMs) have emerged as a promising approach for zero-shot financial forecasting, demonstrating strong transferability and data efficiency gains. However, their adoption in financial applications is hindered by fundamental limitations in uncertainty quantification: current approaches either rely on restrictive distributional assumptions, conflate different sources of uncertainty, or lack principled calibration mechanisms. While recent TSFMs employ sophisticated techniques such as mixture models, Student’s t-distributions, or conformal prediction, they fail to address the core challenge of providing theoretically-grounded uncertainty decomposition. For the very first time, we present a novel transformer-based probabilistic framework, ProbFM (probabilistic foundation model), that leverages Deep Evidential Regression (DER) to provide principled uncertainty quantification with explicit epistemic-aleatoric decomposition. Unlike existing approaches that pre-specify distributional forms or require sampling-based inference, ProbFM learns optimal uncertainty representations through higher-order evidence learning while maintaining single-pass computational efficiency. To rigorously evaluate the core DER uncertainty quantification approach independent of architectural complexity, we conduct an extensive controlled comparison study using a consistent LSTM architecture across five probabilistic methods: DER, Gaussian NLL, Student’s-t NLL, Quantile Loss, and Conformal Prediction. Evaluation on cryptocurrency return forecasting demonstrates that DER maintains competitive forecasting accuracy while providing explicit epistemic-aleatoric uncertainty decomposition. This work establishes both an extensible framework for principled uncertainty quantification in foundation models and empirical evidence for DER’s effectiveness in financial applications.
zh

[AI-7] From Single to Multi-Agent Reasoning : Advancing GeneGPT for Genomics QA ECIR’26

【速读】:该论文旨在解决从复杂分布式基因组数据库中提取和理解数据的难题,尤其是在利用大语言模型(Large Language Models, LLMs)进行基因组问答(Genomic Question Answering, QA)时面临的领域数据库访问受限问题。现有方法如GeneGPT虽通过专用API调用增强LLMs性能,但受限于固定API依赖和适应性不足。其解决方案的关键在于提出GenomAgent——一个由多个专业化代理组成的多代理框架,能够灵活协调不同专家代理以高效处理复杂的基因组查询任务,从而在GeneTuring基准测试的九项任务上平均比GeneGPT提升12%,且架构具备跨科学领域扩展潜力。

链接: https://arxiv.org/abs/2601.10581
作者: Kimia Abedini,Farzad Shami,Gianmaria Silvello
机构: 未知
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Accepted paper by the 48th European Conference on Information Retrieval (ECIR’26)

点击查看摘要

Abstract:Comprehending genomic information is essential for biomedical research, yet extracting data from complex distributed databases remains challenging. Large language models (LLMs) offer potential for genomic Question Answering (QA) but face limitations due to restricted access to domain-specific databases. GeneGPT is the current state-of-the-art system that enhances LLMs by utilizing specialized API calls, though it is constrained by rigid API dependencies and limited adaptability. We replicate GeneGPT and propose GenomAgent, a multi-agent framework that efficiently coordinates specialized agents for complex genomics queries. Evaluated on nine tasks from the GeneTuring benchmark, GenomAgent outperforms GeneGPT by 12% on average, and its flexible architecture extends beyond genomics to various scientific domains needing expert knowledge extraction.
zh

[AI-8] Generative AI collective behavior needs an interactionist paradigm

【速读】:该论文试图解决的问题是:如何系统性地理解基于大语言模型(Large Language Models, LLMs)的智能体集体行为,特别是这些行为如何受到先验知识与嵌入价值观在社会情境中的交互作用所塑造,并进而影响社会层面的风险与收益。解决方案的关键在于提出一种“互动主义范式”(interactionist paradigm),其核心包括替代性的理论基础、研究方法和分析工具,以捕捉LLMs特有的属性——即预训练知识的初始化、隐含的社会先验以及通过上下文学习实现的适应能力——如何共同驱动多智能体生成式AI系统中涌现现象的形成机制。

链接: https://arxiv.org/abs/2601.10567
作者: Laura Ferrarotti,Gian Maria Campedelli,Roberto Dessì,Andrea Baronchelli,Giovanni Iacca,Kathleen M. Carley,Alex Pentland,Joel Z. Leibo,James Evans,Bruno Lepri
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:In this article, we argue that understanding the collective behavior of agents based on large language models (LLMs) is an essential area of inquiry, with important implications in terms of risks and benefits, impacting us as a society at many levels. We claim that the distinctive nature of LLMs–namely, their initialization with extensive pre-trained knowledge and implicit social priors, together with their capability of adaptation through in-context learning–motivates the need for an interactionist paradigm consisting of alternative theoretical foundations, methodologies, and analytical tools, in order to systematically examine how prior knowledge and embedded values interact with social context to shape emergent phenomena in multi-agent generative AI systems. We propose and discuss four directions that we consider crucial for the development and deployment of LLM-based collectives, focusing on theory, methods, and trans-disciplinary dialogue.
zh

[AI-9] Diagnosing Generalization Failures in Fine-Tuned LLM s: A Cross-Architectural Study on Phishing Detection

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在微调后出现的泛化能力不足问题,即为何这些模型在特定任务上表现优异却难以适应新场景。其解决方案的关键在于构建一个多层级诊断框架,结合SHAP分析与机制可解释性方法,对不同架构(Llama 3.1 8B、Gemma 2 9B 和 Mistral)在高风险钓鱼检测任务中的表现进行系统性剖析。研究揭示出:模型泛化能力受架构与数据多样性之间的协同作用驱动,且存在显著的架构依赖性;同时,某些架构(如Mistral)展现出更强的内在泛化鲁棒性。这一方法为识别和理解泛化失败的根本原因提供了可操作的路径,强调了可靠人工智能需深入验证架构、数据与训练策略之间的交互关系。

链接: https://arxiv.org/abs/2601.10524
作者: Frank Bobe III,Gregory D. Vetaw,Chase Pavlick,Darshan Bryner,Matthew Cook,Jose Salas-Vernis
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 16 pages, 6 figures, 6 tables

点击查看摘要

Abstract:The practice of fine-tuning Large Language Models (LLMs) has achieved state-of-the-art performance on specialized tasks, yet diagnosing why these models become brittle and fail to generalize remains a critical open problem. To address this, we introduce and apply a multi-layered diagnostic framework to a cross-architectural study. We fine-tune Llama 3.1 8B, Gemma 2 9B, and Mistral models on a high-stakes phishing detection task and use SHAP analysis and mechanistic interpretability to uncover the root causes of their generalization failures. Our investigation reveals three critical findings: (1) Generalization is driven by a powerful synergy between architecture and data diversity. The Gemma 2 9B model achieves state-of-the-art performance (91% F1), but only when trained on a stylistically diverse ``generalist’’ dataset. (2) Generalization is highly architecture-dependent. We diagnose a specific failure mode in Llama 3.1 8B, which performs well on a narrow domain but cannot integrate diverse data, leading to a significant performance drop. (3) Some architectures are inherently more generalizable. The Mistral model proves to be a consistent and resilient performer across multiple training paradigms. By pinpointing the flawed heuristics responsible for these failures, our work provides a concrete methodology for diagnosing and understanding generalization failures, underscoring that reliable AI requires deep validation of the interplay between architecture, data, and training strategy.
zh

[AI-10] Breaking Up with Normatively Monolithic Agency with GRACE: A Reason -Based Neuro-Symbolic Architecture for Safe and Ethical AI Alignment

【速读】:该论文旨在解决当前自主性日益增强的AI代理(AI agents)在现实世界中产生重大影响时,如何确保其决策不仅具备工具有效性(instrumental effectiveness),而且符合规范性要求(normative alignment)的问题。解决方案的关键在于提出一种神经符号混合的基于推理的约束架构——GRACE(Governor for Reason-Aligned ContainmEnt),其核心创新是将规范性推理与工具决策过程解耦:通过道德模块(Moral Module, MM)利用义务逻辑(deontic logic)进行形式化推理以确定允许的宏观动作,决策模块(Decision-Making Module, DMM)据此选择最优微观动作,同时由一个监护模块(Guard)实时监控并强制执行道德合规性。该架构借助符号表示提供可解释性、可争辩性和可验证性,并支持形式化验证和统计保障,从而实现对任意设计AI代理的有效约束与对齐。

链接: https://arxiv.org/abs/2601.10520
作者: Felix Jahn,Yannic Muskalla,Lisa Dargasz,Patrick Schramowski,Kevin Baum
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 10 pages, 4 figures, accepted at 2nd Annual Conference of the International Association for Safe Ethical AI (IASEAI’26)

点击查看摘要

Abstract:As AI agents become increasingly autonomous, widely deployed in consequential contexts, and efficacious in bringing about real-world impacts, ensuring that their decisions are not only instrumentally effective but also normatively aligned has become critical. We introduce a neuro-symbolic reason-based containment architecture, Governor for Reason-Aligned ContainmEnt (GRACE), that decouples normative reasoning from instrumental decision-making and can contain AI agents of virtually any design. GRACE restructures decision-making into three modules: a Moral Module (MM) that determines permissible macro actions via deontic logic-based reasoning; a Decision-Making Module (DMM) that encapsulates the target agent while selecting instrumentally optimal primitive actions in accordance with derived macro actions; and a Guard that monitors and enforces moral compliance. The MM uses a reason-based formalism providing a semantic foundation for deontic logic, enabling interpretability, contestability, and justifiability. Its symbolic representation enriches the DMM’s informational context and supports formal verification and statistical guarantees of alignment enforced by the Guard. We demonstrate GRACE on an example of a LLM therapy assistant, showing how it enables stakeholders to understand, contest, and refine agent behavior.
zh

[AI-11] Scalable Algorithms for Approximate DNF Model Counting

【速读】:该论文旨在解决析取范式(Disjunctive Normal Form, DNF)公式计数问题,这在概率推理和网络可靠性等应用中具有重要意义,尤其常用于概率数据库的查询评估。由于精确计数在计算上是难解的,现有研究主要依赖于近似算法,包括蒙特卡洛方法、基于哈希的技术以及神经网络启发式方法。本文提出一种新的蒙特卡洛近似方法,其关键创新在于引入自适应停止规则(adaptive stopping rule)与短路公式求值(short-circuit formula evaluation),从而在理论上实现 Probably Approximately Correct (PAC) 学习边界,并且在渐近效率上优于以往方法;实验表明,该方法相较已有算法性能提升达数量级,可处理包含百万变量的大型问题。

链接: https://arxiv.org/abs/2601.10511
作者: Paul Burkhardt,David G. Harris,Kevin T Schmitt
机构: 未知
类目: Data Structures and Algorithms (cs.DS); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Model counting of Disjunctive Normal Form (DNF) formulas is a critical problem in applications such as probabilistic inference and network reliability. For example, it is often used for query evaluation in probabilistic databases. Due to the computational intractability of exact DNF counting, there has been a line of research into a variety of approximation algorithms. These include Monte Carlo approaches such as the classical algorithms of Karp, Luby, and Madras (1989), as well as methods based on hashing (Soos et al. 2023), and heuristic approximations based on Neural Nets (Abboud, Ceylan, and Lukasiewicz 2020). We develop a new Monte Carlo approach with an adaptive stopping rule and short-circuit formula evaluation. We prove it achieves Probably Approximately Correct (PAC) learning bounds and is asymptotically more efficient than the previous methods. We also show experimentally that it out-performs prior algorithms by orders of magnitude, and can scale to much larger problems with millions of variables. Subjects: Data Structures and Algorithms (cs.DS); Artificial Intelligence (cs.AI) Cite as: arXiv:2601.10511 [cs.DS] (or arXiv:2601.10511v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2601.10511 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-12] Projected Microbatch Accumulation yields reference-free proximal policy updates for reinforcement learning

【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)微调过程中策略更新不稳定的问题,特别是由局部KL散度失控引发的训练震荡和性能下降。现有方法如PPO(Proximal Policy Optimization)和GRPO(Generalized Reward-based Policy Optimization)在控制KL散度方面存在不足,且可能因依赖参考策略或对数似然比裁剪而引入熵坍缩(entropy collapse)。解决方案的关键在于提出Projected Microbatch Accumulation (PROMA),其通过在反向传播过程中逐层投影出序列维度上的梯度分量,在微批次(microbatch)聚合前移除这些冗余梯度成分,从而实现更精确的局部KL约束。此机制无需额外的前向或反向传播即可完成,显著提升了策略更新的稳定性,并避免了熵坍缩现象。

链接: https://arxiv.org/abs/2601.10498
作者: Nilin Abrahamsen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This note introduces Projected Microbatch Accumulation (PROMA), a proximal policy update method for large language model fine-tuning. PROMA accumulates policy gradients across microbatches by projecting out sequence-wise gradient components before microbatch aggregation. The projection is applied layer-wise during the backward pass, enabling efficient implementation without additional forward or backward passes. Empirically, PROMA enforces tighter control of local KL divergence than GRPO, resulting in more stable policy learning. Unlike PPO and GRPO, PROMA achieves proximal updates without inducing entropy collapse and does not rely on a reference policy or likelihood-ratio clipping.
zh

[AI-13] Model See Model Do? Exposure-Aware Evaluation of Bug-vs-Fix Preference in Code LLM s

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在代码生成与调试中可能因训练数据中的错误模式而倾向于输出含 bug 的代码,且难以区分其偏好是源于对正确代码的倾向还是对熟悉但错误版本的依赖这一问题。解决方案的关键在于提出一种暴露感知评估框架(exposure-aware evaluation framework),通过 Data Portraits 技术对 Stack-V2 训练语料进行成员测试,量化模型在训练过程中是否接触过特定的错误(buggy)或修复(fixed)代码变体,并据此分层比较模型在代码补全和多种基于似然的评分指标下的偏好表现。研究发现,多数示例未被训练数据覆盖,且当仅有一个变体可见时,修复代码更常出现;然而模型实际生成中仍更频繁复制错误代码,表明暴露会显著扭曲评估结果,提示需警惕模型对记忆错误的传播风险。

链接: https://arxiv.org/abs/2601.10496
作者: Ali Al-Kaswan,Claudio Spiess,Prem Devanbu,Arie van Deursen,Maliheh Izadi
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: MSR 2026 Technical Track

点击查看摘要

Abstract:Large language models are increasingly used for code generation and debugging, but their outputs can still contain bugs, that originate from training data. Distinguishing whether an LLM prefers correct code, or a familiar incorrect version might be influenced by what it’s been exposed to during training. We introduce an exposure-aware evaluation framework that quantifies how prior exposure to buggy versus fixed code influences a model’s preference. Using the ManySStuBs4J benchmark, we apply Data Portraits for membership testing on the Stack-V2 corpus to estimate whether each buggy and fixed variant was seen during training. We then stratify examples by exposure and compare model preference using code completion as well as multiple likelihood-based scoring metrics We find that most examples (67%) have neither variant in the training data, and when only one is present, fixes are more frequently present than bugs. In model generations, models reproduce buggy lines far more often than fixes, with bug-exposed examples amplifying this tendency and fix-exposed examples showing only marginal improvement. In likelihood scoring, minimum and maximum token-probability metrics consistently prefer the fixed code across all conditions, indicating a stable bias toward correct fixes. In contrast, metrics like the Gini coefficient reverse preference when only the buggy variant was seen. Our results indicate that exposure can skew bug-fix evaluations and highlight the risk that LLMs may propagate memorised errors in practice.
zh

[AI-14] Panning for Gold: Expanding Domain-Specific Knowledge Graphs with General Knowledge

【速读】:该论文旨在解决领域特定知识图谱(Domain-specific Knowledge Graph, DKG)在覆盖范围上相较于通用知识图谱(General Knowledge Graph, GKG)存在不足的问题。为提升DKG的丰富性与准确性,作者提出了一种新的任务——领域特定知识图谱融合(Domain-specific Knowledge Graph Fusion, DKGF),其核心挑战在于域相关性的高歧义性和跨图谱知识粒度的不一致。解决方案的关键是提出ExeFuse方法,采用“事实即程序”(Fact-as-Program)范式:将GKG中的每个事实视为潜在语义程序,通过映射抽象关系到粒度感知的操作符,并利用程序在目标DKG上的可执行性来验证域相关性,从而在统一的概率框架下协同解决相关性与粒度对齐问题。

链接: https://arxiv.org/abs/2601.10485
作者: Runhao Zhao,Weixin Zeng,Wentao Zhang,Chong Chen,Zhengpin Li,Xiang Zhao,Lei Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 13 pages, 4 figures

点击查看摘要

Abstract:Domain-specific knowledge graphs (DKGs) often lack coverage compared to general knowledge graphs (GKGs). To address this, we introduce Domain-specific Knowledge Graph Fusion (DKGF), a novel task that enriches DKGs by integrating relevant facts from GKGs. DKGF faces two key challenges: high ambiguity in domain relevance and misalignment in knowledge granularity across graphs. We propose ExeFuse, a simple yet effective Fact-as-Program paradigm. It treats each GKG fact as a latent semantic program, maps abstract relations to granularity-aware operators, and verifies domain relevance via program executability on the target DKG. This unified probabilistic framework jointly resolves relevance and granularity issues. We construct two benchmarks, DKGF(W-I) and DKGF(Y-I), with 21 evaluation configurations. Extensive experiments validate the task’s importance and our model’s effectiveness, providing the first standardized testbed for DKGF.
zh

[AI-15] NSR-Boost: A Neuro-Symbolic Residual Boosting Framework for Industrial Legacy Models

【速读】:该论文旨在解决工业场景中高并发环境下对遗留梯度提升决策树(Gradient Boosted Decision Trees, GBDTs)模型进行升级时面临的高昂重训练成本与系统性风险问题。其解决方案的关键在于提出一种非侵入式神经符号残差增强框架(NSR-Boost),该框架将原有模型视为冻结模型,在预测失败的“困难区域”上实施精准修复;通过残差分析定位难点,利用大语言模型(Large Language Model, LLM)生成可解释的符号专家结构,并结合贝叶斯优化微调参数,最终以轻量级聚合器动态融合专家输出与原模型结果,实现安全、低成本且高效的模型进化。

链接: https://arxiv.org/abs/2601.10457
作者: Ziming Dai,Dabiao Ma,Jinle Tong,Mengyuan Han,Jian Yang,Haojun Fei
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Although the Gradient Boosted Decision Trees (GBDTs) dominate industrial tabular applications, upgrading legacy models in high-concurrency production environments still faces prohibitive retraining costs and systemic risks. To address this problem, we present NSR-Boost, a neuro-symbolic residual boosting framework designed specifically for industrial scenarios. Its core advantage lies in being “non-intrusive”. It treats the legacy model as a frozen model and performs targeted repairs on “hard regions” where predictions fail. The framework comprises three key stages: first, finding hard regions through residuals, then generating interpretable experts by generating symbolic code structures using Large Language Model (LLM) and fine-tuning parameters using Bayesian optimization, and finally dynamically integrating experts with legacy model output through a lightweight aggregator. We report on the successful deployment of NSR-Boost within the core financial risk control system at Qfin Holdings. This framework not only significantly outperforms state-of-the-art (SOTA) baselines across six public datasets and one private dataset, more importantly, shows excellent performance gains on real-world online data. In conclusion, it effectively captures long-tail risks missed by traditional models and offers a safe, low-cost evolutionary paradigm for industry.
zh

[AI-16] Agent Guardian: Learning Access Control Policies to Govern AI Agent Behavior

【速读】:该论文旨在解决AI代理(AI agent)在实际应用中因未经授权的操作或不当输入处理而导致的安全风险问题,确保其行为符合预设的访问控制策略并维持系统完整性。解决方案的关键在于提出一种名为AgentGuardian的安全框架,该框架通过在受控的预演阶段监控执行轨迹来学习合法的行为模式和输入特征,并据此生成基于上下文感知的自适应策略;这些策略能够动态调节代理的工具调用行为,同时结合实时输入上下文与多步操作间的控制流依赖关系进行精细化治理,从而有效识别恶意或误导性输入,同时减少由幻觉引发的错误及编排层面的故障。

链接: https://arxiv.org/abs/2601.10440
作者: Nadya Abaev,Denis Klimov,Gerard Levinov,David Mimran,Yuval Elovici,Asaf Shabtai
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 14 pages, 5 figures

点击查看摘要

Abstract:Artificial intelligence (AI) agents are increasingly used in a variety of domains to automate tasks, interact with users, and make decisions based on data inputs. Ensuring that AI agents perform only authorized actions and handle inputs appropriately is essential for maintaining system integrity and preventing misuse. In this study, we introduce the AgentGuardian, a novel security framework that governs and protects AI agent operations by enforcing context-aware access-control policies. During a controlled staging phase, the framework monitors execution traces to learn legitimate agent behaviors and input patterns. From this phase, it derives adaptive policies that regulate tool calls made by the agent, guided by both real-time input context and the control flow dependencies of multi-step agent actions. Evaluation across two real-world AI agent applications demonstrates that AgentGuardian effectively detects malicious or misleading inputs while preserving normal agent functionality. Moreover, its control-flow-based governance mechanism mitigates hallucination-driven errors and other orchestration-level malfunctions.
zh

[AI-17] Development of Ontological Knowledge Bases by Leverag ing Large Language Models

【速读】:该论文旨在解决传统本体知识库(Ontological Knowledge Bases, OKBs)开发过程中面临的可扩展性差、一致性难以保障以及适应性不足等挑战。其解决方案的关键在于提出一种结构化、迭代式的建模方法,充分利用生成式AI(Generative AI)尤其是大语言模型(Large Language Models, LLMs)的能力,实现知识获取的自动化、本体构件的自动生成及持续优化循环,从而显著提升本体构建效率与一致性,并增强过程透明度与偏差控制能力。

链接: https://arxiv.org/abs/2601.10436
作者: Le Ngoc Luyen,Marie-Hélène Abel,Philippe Gouspillou
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Ontological Knowledge Bases (OKBs) play a vital role in structuring domain-specific knowledge and serve as a foundation for effective knowledge management systems. However, their traditional manual development poses significant challenges related to scalability, consistency, and adaptability. Recent advancements in Generative AI, particularly Large Language Models (LLMs), offer promising solutions for automating and enhancing OKB development. This paper introduces a structured, iterative methodology leveraging LLMs to optimize knowledge acquisition, automate ontology artifact generation, and enable continuous refinement cycles. We demonstrate this approach through a detailed case study focused on developing a user context profile ontology within the vehicle sales domain. Key contributions include significantly accelerated ontology construction processes, improved ontological consistency, effective bias mitigation, and enhanced transparency in the ontology engineering process. Our findings highlight the transformative potential of integrating LLMs into ontology development, notably improving scalability, integration capabilities, and overall efficiency in knowledge management systems.
zh

[AI-18] LLM doctor: Token-Level Flow-Guided Preference Optimization for Efficient Test-Time Alignment of Large Language Models AAAI26

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在对齐人类偏好时面临的计算成本高和灵活性差的问题,尤其是在测试阶段实现高效且保持生成多样性的对齐。传统微调方法虽有效但资源消耗大,而现有测试时对齐方法常依赖扭曲的轨迹级信号或低效采样,限制了性能并损害了模型的多样性。其解决方案的关键在于提出LLMdoctor框架,采用“患者-医生”范式:通过提取患者模型行为变异中的细粒度token级偏好信号,并利用token级流引导偏好优化(Token-level Flow-guided Preference Optimization, TFPO)训练一个轻量级医生模型;TFPO通过建立所有子轨迹间的流一致性,实现逐token级别的精准对齐,同时天然保留基础模型的生成多样性,从而显著优于现有测试时对齐方法,甚至超越全量微调方案如DPO。

链接: https://arxiv.org/abs/2601.10416
作者: Tiesunlong Shen,Rui Mao,Jin Wang,Heming Sun,Jian Zhang,Xuejie Zhang,Erik Cambria
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted by AAAI26

点击查看摘要

Abstract:Aligning Large Language Models (LLMs) with human preferences is critical, yet traditional fine-tuning methods are computationally expensive and inflexible. While test-time alignment offers a promising alternative, existing approaches often rely on distorted trajectory-level signals or inefficient sampling, fundamentally capping performance and failing to preserve the generative diversity of the base model. This paper introduces LLMdoctor, a novel framework for efficient test-time alignment that operates via a patient-doctor paradigm. It integrates token-level reward acquisition with token-level flow-guided preference optimization (TFPO) to steer a large, frozen patient LLM with a smaller, specialized doctor model. Unlike conventional methods that rely on trajectory-level rewards, LLMdoctor first extracts fine-grained, token-level preference signals from the patient model’s behavioral variations. These signals then guide the training of the doctor model via TFPO, which establishes flow consistency across all subtrajectories, enabling precise token-by-token alignment while inherently preserving generation diversity. Extensive experiments demonstrate that LLMdoctor significantly outperforms existing test-time alignment methods and even surpasses the performance of full fine-tuning approaches like DPO.
zh

[AI-19] LADFA: A Framework of Using Large Language Models and Retrieval-Augmented Generation for Personal Data Flow Analysis in Privacy Policies

【速读】:该论文旨在解决隐私政策中个人数据流动信息难以自动化提取与分析的问题,尤其是在文本结构复杂、法律语言晦涩且不同组织间实践不一致的情况下。其解决方案的关键在于构建了一个端到端的计算框架LADFA,该框架结合了大语言模型(Large Language Models, LLMs)与检索增强生成(Retrieval-Augmented Generation, RAG)技术,并引入基于已有研究定制的知识库,从而实现从非结构化隐私政策文本中精准提取个人数据流并构建数据流图,进而支持深层次的洞察发现。

链接: https://arxiv.org/abs/2601.10413
作者: Haiyue Yuan,Nikolay Matyunin,Ali Raza,Shujun Li
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Privacy policies help inform people about organisations’ personal data processing practices, covering different aspects such as data collection, data storage, and sharing of personal data with third parties. Privacy policies are often difficult for people to fully comprehend due to the lengthy and complex legal language used and inconsistent practices across different sectors and organisations. To help conduct automated and large-scale analyses of privacy policies, many researchers have studied applications of machine learning and natural language processing techniques, including large language models (LLMs). While a limited number of prior studies utilised LLMs for extracting personal data flows from privacy policies, our approach builds on this line of work by combining LLMs with retrieval-augmented generation (RAG) and a customised knowledge base derived from existing studies. This paper presents the development of LADFA, an end-to-end computational framework, which can process unstructured text in a given privacy policy, extract personal data flows and construct a personal data flow graph, and conduct analysis of the data flow graph to facilitate insight discovery. The framework consists of a pre-processor, an LLM-based processor, and a data flow post-processor. We demonstrated and validated the effectiveness and accuracy of the proposed approach by conducting a case study that involved examining ten selected privacy policies from the automotive industry. Moreover, it is worth noting that LADFA is designed to be flexible and customisable, making it suitable for a range of text-based analysis tasks beyond privacy policy analysis.
zh

[AI-20] ErrEval: Error-Aware Evaluation for Question Generation through Explicit Diagnostics

【速读】:该论文旨在解决自动问答生成(Automatic Question Generation, AQG)中因事实性幻觉和答案不匹配等关键缺陷导致的评估偏差问题。现有基于大语言模型(Large Language Model, LLM)的评估方法多采用黑箱式整体评分范式,缺乏对错误类型的显式建模,从而高估了低质量问题的真实水平。其解决方案的关键在于提出ErrEval框架,通过将评估过程重构为“错误诊断—知情评分”的两阶段机制:首先利用轻量级、可插拔的错误识别器(Error Identifier)对结构、语言和内容层面的常见错误进行检测与分类,随后将这些诊断信号作为显式证据输入LLM评估器,引导其做出更细粒度且基于事实的判断,从而显著提升评估结果与人工判断的一致性,并有效缓解对劣质问题的过度高估。

链接: https://arxiv.org/abs/2601.10406
作者: Weiping Fu,Bifan Wei,Jingyi Hao,Yushun Zhang,Jian Zhang,Jiaxin Wang,Bo Li,Yu He,Lingling Zhang,Jun Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Automatic Question Generation (QG) often produces outputs with critical defects, such as factual hallucinations and answer mismatches. However, existing evaluation methods, including LLM-based evaluators, mainly adopt a black-box and holistic paradigm without explicit error modeling, leading to the neglect of such defects and overestimation of question quality. To address this issue, we propose ErrEval, a flexible and Error-aware Evaluation framework that enhances QG evaluation through explicit error diagnostics. Specifically, ErrEval reformulates evaluation as a two-stage process of error diagnosis followed by informed scoring. At the first stage, a lightweight plug-and-play Error Identifier detects and categorizes common errors across structural, linguistic, and content-related aspects. These diagnostic signals are then incorporated as explicit evidence to guide LLM evaluators toward more fine-grained and grounded judgments. Extensive experiments on three benchmarks demonstrate the effectiveness of ErrEval, showing that incorporating explicit diagnostics improves alignment with human judgments. Further analyses confirm that ErrEval effectively mitigates the overestimation of low-quality questions.
zh

[AI-21] oward Ultra-Long-Horizon Agent ic Science: Cognitive Accumulation for Machine Learning Engineering

【速读】:该论文旨在解决生成式 AI 在科学探索中面临的超长周期自主性(ultra-long-horizon autonomy)瓶颈问题,即如何在跨越数天至数周的实验周期中维持战略一致性与迭代修正能力。当前大型语言模型(Large Language Models, LLMs)虽具备短周期推理能力,但在高维、延迟反馈的真实科研环境中易被执行细节淹没,难以将稀疏反馈转化为连贯的长期指导。解决方案的关键在于提出一种受计算机系统启发的分层认知缓存(Hierarchical Cognitive Caching, HCC)机制,通过动态地将瞬时执行轨迹提炼为稳定知识和跨任务智慧,实现即时执行与长期实验策略的解耦,从而突破静态上下文窗口的扩展限制。该方法使 ML-Master 2.0 在 OpenAI 的 MLE-Bench 基准测试中于 24 小时预算下达到 56.44% 的最优奖牌率,验证了其在复杂科学发现场景中的可扩展自主性潜力。

链接: https://arxiv.org/abs/2601.10402
作者: Xinyu Zhu,Yuzhu Cai,Zexi Liu,Bingyang Zheng,Cheng Wang,Rui Ye,Jiaao Chen,Hanrui Wang,Wei-Chen Wang,Yuzhi Zhang,Linfeng Zhang,Weinan E,Di Jin,Siheng Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 26 pages. 5 figures

点击查看摘要

Abstract:The advancement of artificial intelligence toward agentic science is currently bottlenecked by the challenge of ultra-long-horizon autonomy, the ability to sustain strategic coherence and iterative correction over experimental cycles spanning days or weeks. While Large Language Models (LLMs) have demonstrated prowess in short-horizon reasoning, they are easily overwhelmed by execution details in the high-dimensional, delayed-feedback environments of real-world research, failing to consolidate sparse feedback into coherent long-term guidance. Here, we present ML-Master 2.0, an autonomous agent that masters ultra-long-horizon machine learning engineering (MLE) which is a representative microcosm of scientific discovery. By reframing context management as a process of cognitive accumulation, our approach introduces Hierarchical Cognitive Caching (HCC), a multi-tiered architecture inspired by computer systems that enables the structural differentiation of experience over time. By dynamically distilling transient execution traces into stable knowledge and cross-task wisdom, HCC allows agents to decouple immediate execution from long-term experimental strategy, effectively overcoming the scaling limits of static context windows. In evaluations on OpenAI’s MLE-Bench under 24-hour budgets, ML-Master 2.0 achieves a state-of-the-art medal rate of 56.44%. Our findings demonstrate that ultra-long-horizon autonomy provides a scalable blueprint for AI capable of autonomous exploration beyond human-precedent complexities.
zh

[AI-22] LatentRefusal: Latent-Signal Refusal for Unanswerable Text-to-SQL Queries

【速读】:该论文旨在解决基于大语言模型(Large Language Model, LLM)的文本到SQL(text-to-SQL)系统中,用户查询存在不可回答或表述不充分时,模型可能生成看似合理但实际错误的SQL语句,从而导致误导性结果或违反安全约束的问题。现有拒绝策略要么依赖输出层面的指令遵循(易受模型幻觉影响),要么通过估计输出不确定性来实现,但前者脆弱、后者复杂且带来额外开销。论文提出将安全拒绝形式化为答案可得性门控(answerability-gating)问题,并设计LatentRefusal机制——一种基于LLM中间隐藏激活信号的拒绝方法,其核心创新在于引入轻量级三残差门控编码器(Tri-Residual Gated Encoder),用于抑制模式噪声并增强指示查询与数据库模式不匹配的稀疏局部线索,从而在无需修改主模型的前提下高效预测查询是否可答,实验证明该方案在四个基准上平均F1提升至88.5%,且仅增加约2毫秒的探测延迟。

链接: https://arxiv.org/abs/2601.10398
作者: Xuancheng Ren,Shijing Hu,Zhihui Lu,Jiangqi Huang,Qiang Duan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In LLM-based text-to-SQL systems, unanswerable and underspecified user queries may generate not only incorrect text but also executable programs that yield misleading results or violate safety constraints, posing a major barrier to safe deployment. Existing refusal strategies for such queries either rely on output-level instruction following, which is brittle due to model hallucinations, or estimate output uncertainty, which adds complexity and overhead. To address this challenge, we formalize safe refusal in text-to-SQL systems as an answerability-gating problem and propose LatentRefusal, a latent-signal refusal mechanism that predicts query answerability from intermediate hidden activations of a large language model. We introduce the Tri-Residual Gated Encoder, a lightweight probing architecture, to suppress schema noise and amplify sparse, localized cues of question-schema mismatch that indicate unanswerability. Extensive empirical evaluations across diverse ambiguous and unanswerable settings, together with ablation studies and interpretability analyses, demonstrate the effectiveness of the proposed approach and show that LatentRefusal provides an attachable and efficient safety layer for text-to-SQL systems. Across four benchmarks, LatentRefusal improves average F1 to 88.5 percent on both backbones while adding approximately 2 milliseconds of probe overhead.
zh

[AI-23] C-GRASP: Clinically-Grounded Reasoning for Affective Signal Processing

【速读】:该论文旨在解决生成式 AI(Generative AI)在心率变异性(Heart Rate Variability, HRV)解读中因生理幻觉(physiological hallucinations)导致的可靠性问题,具体包括呼吸窦性心律失常(Respiratory Sinus Arrhythmia, RSA)污染、非线性指标在短数据下的不稳定性,以及忽视个体基线而依赖人群均值所引发的“群体偏差”(population bias)。解决方案的关键在于提出一种基于临床知识约束的推理框架——C-GRASP(Clinically-Grounded Reasoning for Affective Signal Processing),其核心创新是引入Z-score优先级层次结构(Z-score Priority Hierarchy),通过强化个体化基线变化权重来替代传统规范统计量,并结合RAG增强的可追溯推理流程(八步逻辑链),实现对HRV信号的透明、可解释且临床一致的分析。实证表明,该方法显著提升了情绪分类准确率(37.3%)和临床推理一致性评分(CRC=69.6%),并验证了个体化Delta Z-score模块作为关键逻辑锚点的作用,从而推动情感计算从黑箱分类向证据驱动的临床决策支持转变。

链接: https://arxiv.org/abs/2601.10342
作者: Cheng Lin Cheng,Ting Chuan Lin,Chai Kai Chang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Heart rate variability (HRV) is a pivotal noninvasive marker for autonomic monitoring; however, applying Large Language Models (LLMs) to HRV interpretation is hindered by physiological hallucinations. These include respiratory sinus arrhythmia (RSA) contamination, short-data instability in nonlinear metrics, and the neglect of individualized baselines in favor of population norms. We propose C-GRASP (Clinically-Grounded Reasoning for Affective Signal Processing), a guardrailed RAG-enhanced pipeline that decomposes HRV interpretation into eight traceable reasoning steps. Central to C-GRASP is a Z-score Priority Hierarchy that enforces the weighting of individualized baseline shifts over normative statistics. The system effectively mitigates spectral hallucinations through automated RSA-aware guardrails, preventing contamination of frequency-domain indices. Evaluated on 414 trials from the DREAMER dataset, C-GRASP integrated with high-scale reasoning models (e.g., MedGemma3-thinking) achieved superior performance in 4-class emotion classification (37.3% accuracy) and a Clinical Reasoning Consistency (CRC) score of 69.6%. Ablation studies confirm that the individualized Delta Z-score module serves as the critical logical anchor, preventing the “population bias” common in native LLMs. Ultimately, C-GRASP transitions affective computing from black-box classification to transparent, evidence-based clinical decision support, paving the way for safer AI integration in biomedical engineering.
zh

[AI-24] SPIKE: Sparse Koopman Regularization for Physics-Informed Neural Networks

【速读】:该论文旨在解决物理信息神经网络(Physics-Informed Neural Networks, PINNs)在训练域内容易过拟合,导致外推时泛化能力差的问题。解决方案的关键在于引入连续时间Koopman算子进行正则化,构建SPIKE(Sparse Physics-Informed Koopman-Enhanced)框架,通过在学习到的观测量空间中强制线性动力学形式 $ \frac{dz}{dt} = Az $,并结合对矩阵 $ A $ 的L1正则化(即SPIKE)或无显式稀疏约束(即PIKE),学习稀疏的生成器矩阵,从而实现复杂动力系统低维结构的紧凑表示。此方法显著提升了在抛物型、双曲型、色散型及刚性偏微分方程(PDEs)以及流体动力学(Navier-Stokes)和混沌常微分方程(ODEs,如Lorenz系统)中的长期预测精度与时空外推能力。

链接: https://arxiv.org/abs/2601.10282
作者: Jose Marie Antonio Minoza
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注:

点击查看摘要

Abstract:Physics-Informed Neural Networks (PINNs) provide a mesh-free approach for solving differential equations by embedding physical constraints into neural network training. However, PINNs tend to overfit within the training domain, leading to poor generalization when extrapolating beyond trained spatiotemporal regions. This work presents SPIKE (Sparse Physics-Informed Koopman-Enhanced), a framework that regularizes PINNs with continuous-time Koopman operators to learn parsimonious dynamics representations. By enforcing linear dynamics dz/dt = Az in a learned observable space, both PIKE (without explicit sparsity) and SPIKE (with L1 regularization on A ) learn sparse generator matrices, embodying the parsimony principle that complex dynamics admit low-dimensional structure. Experiments across parabolic, hyperbolic, dispersive, and stiff PDEs, including fluid dynamics (Navier-Stokes) and chaotic ODEs (Lorenz), demonstrate consistent improvements in temporal extrapolation, spatial generalization, and long-term prediction accuracy. The continuous-time formulation with matrix exponential integration provides unconditional stability for stiff systems while avoiding diagonal dominance issues inherent in discrete-time Koopman operators.
zh

[AI-25] Queueing-Aware Optimization of Reasoning Tokens for Accuracy-Latency Trade-offs in LLM Servers

【速读】:该论文旨在解决多任务场景下大型语言模型(Large Language Model, LLM)服务系统中如何最优分配内部思考令牌(thinking tokens)以平衡准确率与延迟的问题。具体而言,面对来自 N 种不同任务类型的异构查询流(服从泊松过程),服务器需为每类任务分配固定数量的令牌,从而决定计算资源投入;分配策略直接影响服务时间(近似线性增长)和响应准确性(边际递减)。作者构建了一个受限优化问题,目标是在满足队列稳定性条件和硬件令牌预算约束的前提下,最大化加权平均准确率并惩罚均值系统时延。解决方案的关键在于:首先证明目标函数在稳定区域内严格凹,确保最优解存在且唯一;其次通过一阶最优性条件获得耦合投影固定点表征,并设计迭代算法及收敛性保障机制(包括全局可计算步长边界),最终实现整数令牌分配并通过仿真评估性能损失。

链接: https://arxiv.org/abs/2601.10274
作者: Emre Ozbas,Melih Bastopcu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Networking and Internet Architecture (cs.NI); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:We consider a single large language model (LLM) server that serves a heterogeneous stream of queries belonging to N distinct task types. Queries arrive according to a Poisson process, and each type occurs with a known prior probability. For each task type, the server allocates a fixed number of internal thinking tokens, which determines the computational effort devoted to that query. The token allocation induces an accuracy-latency trade-off: the service time follows an approximately affine function of the allocated tokens, while the probability of a correct response exhibits diminishing returns. Under a first-in, first-out (FIFO) service discipline, the system operates as an M/G/1 queue, and the mean system time depends on the first and second moments of the resulting service-time distribution. We formulate a constrained optimization problem that maximizes a weighted average accuracy objective penalized by the mean system time, subject to architectural token-budget constraints and queue-stability conditions. The objective function is shown to be strictly concave over the stability region, which ensures existence and uniqueness of the optimal token allocation. The first-order optimality conditions yield a coupled projected fixed-point characterization of the optimum, together with an iterative solution and an explicit sufficient condition for contraction. Moreover, a projected gradient method with a computable global step-size bound is developed to guarantee convergence beyond the contractive regime. Finally, integer-valued token allocations are attained via rounding of the continuous solution, and the resulting performance loss is evaluated in simulation results.
zh

[AI-26] NoReGeo: Non-Reasoning Geometry Benchmark

【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在几何理解能力上的局限性问题,即现有模型主要依赖代数推理而非内在的空间感知来处理几何任务,导致其缺乏对几何属性和空间关系的原生认知。为应对这一挑战,作者提出了NoReGeo基准,其关键在于设计了一套2,500个无需代数计算或逻辑推理即可解答的几何问题,涵盖25类几何概念,所有问题均基于已知对象位置进行判断,从而直接评估LLMs是否具备原生几何理解能力。实验表明,即使是最先进的模型如GPT-4,在二分类任务中最高准确率仅为65%,且消融实验证明这种能力无法通过常规微调获得,强调了从训练初期就引入专门几何认知机制的重要性。

链接: https://arxiv.org/abs/2601.10254
作者: Irina Abdullaeva,Anton Vasiliuk,Elizaveta Goncharova,Temurbek Rahmatullaev,Zagorulko Ivan,Maxim Kurkin,Andrey Kuznetsov
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present NoReGeo, a novel benchmark designed to evaluate the intrinsic geometric understanding of large language models (LLMs) without relying on reasoning or algebraic computation. Unlike existing benchmarks that primarily assess models’ proficiency in reasoning-based geometry-where solutions are derived using algebraic methods-NoReGeo focuses on evaluating whether LLMs can inherently encode spatial relationships and recognize geometric properties directly. Our benchmark comprises 2,500 trivial geometric problems spanning 25 categories, each carefully crafted to be solvable purely through native geometric understanding, assuming known object locations. We assess a range of state-of-the-art models on NoReGeo, including frontier models like GPT-4, observing that even the most advanced systems achieve an overall maximum of 65% accuracy in binary classification tasks. Further, our ablation experiments demonstrate that such geometric understanding does not emerge through fine-tuning alone, indicating that effective training for geometric comprehension requires a specialized approach from the outset. Our findings highlight a significant gap in current LLMs’ ability to natively grasp geometric concepts, providing a foundation for future research toward models with true geometric cognition.
zh

[AI-27] X-SAM: Boosting Sharpness-Aware Minimization with Dominant-Eigenvector Gradient Correction

【速读】:该论文旨在解决Sharpness-Aware Minimization (SAM)在训练过程中优化行为与理论预期不一致的问题,即当模型处于尖锐(sharp)或平坦(flat)区域时,SAM可能仍会导向尖锐区域,从而削弱其提升泛化性能的效果。解决方案的关键在于从谱和几何角度重新审视SAM:通过引入梯度与海森矩阵(Hessian)主特征向量之间夹角作为尖锐度的度量指标,发现当该夹角小于等于90°时,SAM的正则化效果会减弱;为此提出显式特征向量对齐的SAM(X-SAM),利用沿主特征向量方向的正交分解修正梯度,从而更直接、高效地调控海森矩阵的最大特征值,实现更强的泛化能力。

链接: https://arxiv.org/abs/2601.10251
作者: Hongru Duan,Yongle Chen,Lei Guan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Sharpness-Aware Minimization (SAM) aims to improve generalization by minimizing a worst-case perturbed loss over a small neighborhood of model parameters. However, during training, its optimization behavior does not always align with theoretical expectations, since both sharp and flat regions may yield a small perturbed loss. In such cases, the gradient may still point toward sharp regions, failing to achieve the intended effect of SAM. To address this issue, we investigate SAM from a spectral and geometric perspective: specifically, we utilize the angle between the gradient and the leading eigenvector of the Hessian as a measure of sharpness. Our analysis illustrates that when this angle is less than or equal to ninety degrees, the effect of SAM’s sharpness regularization can be weakened. Furthermore, we propose an explicit eigenvector-aligned SAM (X-SAM), which corrects the gradient via orthogonal decomposition along the top eigenvector, enabling more direct and efficient regularization of the Hessian’s maximum eigenvalue. We prove X-SAM’s convergence and superior generalization, with extensive experimental evaluations confirming both theoretical and practical advantages.
zh

[AI-28] Who Owns the Text? Design Patterns for Preserving Authorship in AI-Assisted Writing

【速读】:该论文旨在解决生成式 AI (Generative AI) 辅助写作工具在提升写作效率与流畅性的同时,可能削弱作者心理所有权(psychological ownership)的问题。研究发现,在使用AI辅助写作时,尽管认知负荷降低且文本质量保持稳定,但参与者的心理所有权显著下降(约0.85–1.0分,7点量表)。解决方案的关键在于设计能够增强作者主体感的交互机制,其中风格个性化(style personalization)被证明可部分恢复心理所有权(+0.43分),并提高AI建议的采纳率(+5个百分点)。作者进一步提炼出五种设计模式:即按需启动、微建议、语音锚定、受众支架和决策点溯源,以指导未来能保留作者主体性的智能写作工具开发。

链接: https://arxiv.org/abs/2601.10236
作者: Bohan Zhang,Chengke Bu,Paramveer S. Dhillon
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: Preprint; 42 pages

点击查看摘要

Abstract:AI writing assistants can reduce effort and improve fluency, but they may also weaken writers’ sense of authorship. We study this tension with an ownership-aware co-writing editor that offers on-demand, sentence-level suggestions and tests two common design choices: persona-based coaching and style personalization. In an online study (N=176), participants completed three professional writing tasks: an email without AI help, a proposal with generic AI suggestions, and a cover letter with persona-based coaching, while half received suggestions tailored to a brief sample of their prior writing. Across the two AI-assisted tasks, psychological ownership dropped relative to unassisted writing (about 0.85-1.0 points on a 7-point scale), even as cognitive load decreased (about 0.9 points) and quality ratings stayed broadly similar overall. Persona coaching did not prevent the ownership decline. Style personalization partially restored ownership (about +0.43) and increased AI incorporation in text (+5 percentage points). We distill five design patterns: on-demand initiation, micro-suggestions, voice anchoring, audience scaffolds, and point-of-decision provenance, to guide authorship-preserving writing tools.
zh

[AI-29] Introduction to optimization methods for training SciML models

【速读】:该论文旨在解决机器学习(Machine Learning, ML)与科学机器学习(Scientific Machine Learning, SciML)中优化问题因问题结构差异而导致算法选择不匹配的问题。其核心挑战在于:传统ML依赖于随机、样本可分离的目标函数,适合使用一阶或自适应梯度方法;而SciML则常涉及物理信息约束或算子约束的建模,导致损失函数具有全局耦合性、刚性和强各向异性,此时优化行为由物理模型的谱特性决定,而非数据统计特性,从而限制了标准随机优化方法的有效性。解决方案的关键在于提出一种统一的优化方法框架,强调根据问题结构选择合适的优化策略,并系统回顾一阶与二阶优化技术在确定性和随机设置下的适用性,特别关注如何将这些方法适配至物理约束和数据驱动的SciML模型,以提升优化效率与收敛性。

链接: https://arxiv.org/abs/2601.10222
作者: Alena Kopaničáková,Elisa Riccietti
机构: 未知
类目: Numerical Analysis (math.NA); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:Optimization is central to both modern machine learning (ML) and scientific machine learning (SciML), yet the structure of the underlying optimization problems differs substantially across these domains. Classical ML typically relies on stochastic, sample-separable objectives that favor first-order and adaptive gradient methods. In contrast, SciML often involves physics-informed or operator-constrained formulations in which differential operators induce global coupling, stiffness, and strong anisotropy in the loss landscape. As a result, optimization behavior in SciML is governed by the spectral properties of the underlying physical models rather than by data statistics, frequently limiting the effectiveness of standard stochastic methods and motivating deterministic or curvature-aware approaches. This document provides a unified introduction to optimization methods in ML and SciML, emphasizing how problem structure shapes algorithmic choices. We review first- and second-order optimization techniques in both deterministic and stochastic settings, discuss their adaptation to physics-constrained and data-driven SciML models, and illustrate practical strategies through tutorial examples, while highlighting open research directions at the interface of scientific computing and scientific machine learning.
zh

[AI-30] opo-RAG : Topology-aware retrieval for hybrid text-table documents

【速读】:该论文旨在解决企业级数据中文档非纯文本特性带来的检索增强生成(Retrieval-Augmented Generation, RAG)性能瓶颈问题。现有RAG系统通常采用线性化策略将复杂多维表格转化为简单的Markdown文本字符串,但这种方法在数学上已被证明无法有效捕捉表格的空间结构与语义关系。其解决方案的关键在于提出Topo-RAG框架,采用双路径架构:对流式叙述内容使用传统稠密检索器处理,而对表格结构则引入Cell-Aware Late Interaction机制,在交互阶段保留单元格间的空间拓扑关系,从而更准确地建模数据的几何特性。实验表明,该方法在模拟真实企业场景的SEC-25数据集上,相较于标准线性化方法,在混合查询任务中nDCG@10指标提升18.4%,验证了“理解信息形状”对于提升RAG系统效果的重要性。

链接: https://arxiv.org/abs/2601.10215
作者: Alex Dantart,Marco Kóvacs-Navarro
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In enterprise datasets, documents are rarely pure. They are not just text, nor just numbers; they are a complex amalgam of narrative and structure. Current Retrieval-Augmented Generation (RAG) systems have attempted to address this complexity with a blunt tool: linearization. We convert rich, multidimensional tables into simple Markdown-style text strings, hoping that an embedding model will capture the geometry of a spreadsheet in a single vector. But it has already been shown that this is mathematically insufficient. This work presents Topo-RAG, a framework that challenges the assumption that “everything is text”. We propose a dual architecture that respects the topology of the data: we route fluid narrative through traditional dense retrievers, while tabular structures are processed by a Cell-Aware Late Interaction mechanism, preserving their spatial relationships. Evaluated on SEC-25, a synthetic enterprise corpus that mimics real-world complexity, Topo-RAG demonstrates an 18.4% improvement in nDCG@10 on hybrid queries compared to standard linearization approaches. It’s not just about searching better; it’s about understanding the shape of information. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2601.10215 [cs.AI] (or arXiv:2601.10215v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2601.10215 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-31] PADER: Paillier-based Secure Decentralized Social Recommendation

【速读】:该论文旨在解决推荐系统中用户与卖家数据隐私泄露的问题,尤其是在集中式平台过度收集数据的背景下。为实现隐私保护下的推荐功能,作者提出了一种基于Paillier加密体制的去中心化社交推荐系统(PADER),其核心在于将SoReg(Social Regularization)模型转化为两方安全多项式计算问题,并设计了支持任意算术电路的安全加法与乘法协议,以及适配实数多项式运算的最优数据打包方案,从而在不依赖中心化平台的前提下完成模型训练与推理,显著提升了效率——实验表明,单次迭代一个拥有数百评分的用户仅需约1秒,使用约50万评分训练一个epoch仅耗时3小时,具备实际应用可行性。

链接: https://arxiv.org/abs/2601.10212
作者: Chaochao Chen,Jiaming Qian,Fei Zheng,Yachuan Liu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The prevalence of recommendation systems also brings privacy concerns to both the users and the sellers, as centralized platforms collect as much data as possible from them. To keep the data private, we propose PADER: a Paillier-based secure decentralized social recommendation system. In this system, the users and the sellers are nodes in a decentralized network. The training and inference of the recommendation model are carried out securely in a decentralized manner, without the involvement of a centralized platform. To this end, we apply the Paillier cryptosystem to the SoReg (Social Regularization) model, which exploits both user’s ratings and social relations. We view the SoReg model as a two-party secure polynomial evaluation problem and observe that the simple bipartite computation may result in poor efficiency. To improve efficiency, we design secure addition and multiplication protocols to support secure computation on any arithmetic circuit, along with an optimal data packing scheme that is suitable for the polynomial computations of real values. Experiment results show that our method only takes about one second to iterate through one user with hundreds of ratings, and training with ~500K ratings for one epoch only takes 3 hours, which shows that the method is practical in real applications. The code is available at this https URL.
zh

[AI-32] GFM4GA: Graph Foundation Model for Group Anomaly Detection

【速读】:该论文旨在解决网络应用中群体异常检测(group anomaly detection)的难题,尤其针对异常模式多样且个体在异常群体中可能表现正常导致难以识别的问题。现有图基础模型(Graph Foundation Models, GFMs)虽在个体异常检测中表现优异,但无法有效扩展至群体异常场景。解决方案的关键在于提出GFM4GA,一种专为群体异常设计的图基础模型:其预训练阶段采用双层对比学习(dual-level contrastive learning),结合特征级估计与群体提取机制,以捕捉潜在的群体异常结构及特征不一致性;下游任务中通过参数受限和群体异常比例加权的少样本微调策略,并利用已标注异常邻居确定的群体上下文增强对未见群体异常的适应能力,从而显著提升检测性能。

链接: https://arxiv.org/abs/2601.10193
作者: Jiujiu Chen,Weijun Zeng,Shaofeng Hu,Sihong Xie,Hui Xiong
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Group anomaly detection is crucial in many network applications, but faces challenges due to diverse anomaly patterns. Motivated by the success of large language models (LLMs) in natural language processing, graph foundation models (GFMs) is proposed to handle few-shot learning task with fewer labeling efforts. GFMs have been successfully applied to detection of individual anomalies but cannot be generalized to group anomalies, as group anomaly patterns must be detected as a whole and individuals in an abnormal group can look rather normal. Therefore, we propose GFM4GA, a novel graph foundation model for group anomaly detection. The pipeline is pretrained via dual-level contrastive learning based on feature-based estimation and group extraction, to capture potential group anomaly structure and feature inconsistencies. In the downstream tasks, the pipeline is finetuned in parameter-constrained and group-anomaly-proportion weighted few-shot settings, and its adaptive ability to unseen group anomalies expanded via group contexts determined by labeled anomaly neighbors. Experiments show that GFM4GA surpasses group anomaly detectors and GFMs for individual anomalies, achieving average improvements of 2.85% in AUROC and 2.55% in AUPRC.
zh

[AI-33] How does downsampling affect needle electromyography signals? A generalisable workflow for understanding downsampling effects on high-frequency time series

【速读】:该论文旨在解决高采样率针极肌电图(needle electromyography, nEMG)信号在特征驱动的机器学习模型中带来的计算挑战,尤其是在近实时分析场景下,如何有效降低计算复杂度的同时保持诊断信息完整性的问题。其解决方案的关键在于提出了一套系统化的评估流程,该流程结合基于形状的失真度量、分类性能结果及特征空间分析,量化不同下采样算法和参数对波形保真度与预测性能的影响;实验表明,采用考虑波形形状的下采样算法(shape-aware downsampling)优于传统插值方法,能够在显著减少计算负载的同时更好地保留关键诊断特征(如峰值结构和整体形态),从而为近实时nEMG分析提供可操作的下采样配置策略,并具备推广至其他高频时间序列数据处理场景的通用性。

链接: https://arxiv.org/abs/2601.10191
作者: Mathieu Cherpitel,Janne Luijten,Thomas Bäck,Camiel Verhamme,Martijn Tannemaat,Anna Kononova
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Automated analysis of needle electromyography (nEMG) signals is emerging as a tool to support the detection of neuromuscular diseases (NMDs), yet the signals’ high and heterogeneous sampling rates pose substantial computational challenges for feature-based machine-learning models, particularly for near real-time analysis. Downsampling offers a potential solution, but its impact on diagnostic signal content and classification performance remains insufficiently understood. This study presents a workflow for systematically evaluating information loss caused by downsampling in high-frequency time series. The workflow combines shape-based distortion metrics with classification outcomes from available feature-based machine learning models and feature space analysis to quantify how different downsampling algorithms and factors affect both waveform integrity and predictive performance. We use a three-class NMD classification task to experimentally evaluate the workflow. We demonstrate how the workflow identifies downsampling configurations that preserve diagnostic information while substantially reducing computational load. Analysis of shape-based distortion metrics showed that shape-aware downsampling algorithms outperform standard decimation, as they better preserve peak structure and overall signal morphology. The results provide practical guidance for selecting downsampling configurations that enable near real-time nEMG analysis and highlight a generalisable workflow that can be used to balance data reduction with model performance in other high-frequency time-series applications as well.
zh

[AI-34] CtD: Composition through Decomposition in Emergent Communication

【速读】:该论文旨在解决人工神经代理在面对未见过的图像时,如何实现组合泛化(compositional generalization)的问题,即能否将已学习的基本概念以新颖的方式组合起来描述新场景。解决方案的关键在于提出“通过分解进行组合”(Composition through Decomposition)的方法,其核心是分两步训练:首先在“分解”阶段,代理通过多目标协调博弈中的交互习得一个代码本(codebook),用于将图像分解为基本概念;随后在“组合”阶段,利用该代码本将基本概念组合成复杂短语来描述新图像。值得注意的是,组合阶段可实现零样本(zero-shot)泛化,无需额外训练。

链接: https://arxiv.org/abs/2601.10169
作者: Boaz Carmeli,Ron Meir,Yonatan Belinkov
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Compositionality is a cognitive mechanism that allows humans to systematically combine known concepts in novel ways. This study demonstrates how artificial neural agents acquire and utilize compositional generalization to describe previously unseen images. Our method, termed “Composition through Decomposition”, involves two sequential training steps. In the ‘Decompose’ step, the agents learn to decompose an image into basic concepts using a codebook acquired during interaction in a multi-target coordination game. Subsequently, in the ‘Compose’ step, the agents employ this codebook to describe novel images by composing basic concepts into complex phrases. Remarkably, we observe cases where generalization in the `Compose’ step is achieved zero-shot, without the need for additional training.
zh

[AI-35] MMPG: MoE-based Adaptive Multi-Perspective Graph Fusion for Protein Representation Learning

【速读】:该论文旨在解决当前基于图神经网络(Graph Neural Networks, GNNs)的蛋白质表示学习(Protein Representation Learning, PRL)方法普遍依赖单一视角图构建策略的问题,此类方法仅能捕捉残基相互作用的部分特性,导致蛋白质表征不完整。解决方案的关键在于提出MMPG框架,通过从物理、化学和几何三个不同视角构建蛋白质图,并引入门控专家混合(Mixture of Experts, MoE)机制实现自适应融合:MoE模块动态地将各视角路由至专用专家,使专家能够学习特定视角特征及跨视角交互关系,从而在个体表示、成对视角协同到全局共识等多个层次上整合信息,显著提升蛋白质表征质量与下游任务性能。

链接: https://arxiv.org/abs/2601.10157
作者: Yusong Wang,Jialun Shen,Zhihao Wu,Yicheng Xu,Shiyin Tan,Mingkun Xu,Changshuo Wang,Zixing Song,Prayag Tiwari
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have been widely adopted for Protein Representation Learning (PRL), as residue interaction networks can be naturally represented as graphs. Current GNN-based PRL methods typically rely on single-perspective graph construction strategies, which capture partial properties of residue interactions, resulting in incomplete protein representations. To address this limitation, we propose MMPG, a framework that constructs protein graphs from multiple perspectives and adaptively fuses them via Mixture of Experts (MoE) for PRL. MMPG constructs graphs from physical, chemical, and geometric perspectives to characterize different properties of residue interactions. To capture both perspective-specific features and their synergies, we develop an MoE module, which dynamically routes perspectives to specialized experts, where experts learn intrinsic features and cross-perspective interactions. We quantitatively verify that MoE automatically specializes experts in modeling distinct levels of interaction from individual representations, to pairwise inter-perspective synergies, and ultimately to a global consensus across all perspectives. Through integrating this multi-level information, MMPG produces superior protein representations and achieves advanced performance on four different downstream protein tasks.
zh

[AI-36] LOOKAT: Lookup-Optimized Key-Attention for Memory-Efficient Transformers

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在边缘设备部署时KV缓存(Key-Value Cache)存储与带宽瓶颈问题。现有量化方法虽能压缩存储空间,但因注意力计算需将INT4/INT8量化后的键(Key)重新解量化为FP16,导致带宽压力未显著降低。解决方案的关键在于将注意力评分建模为内积相似性搜索(inner product similarity search),并引入向量数据库中的压缩技术——提出LOOKAT方法,通过子空间分解键向量、学习码本(codebooks)并利用查找表(lookup tables)实现对称距离计算的异构优化,从而将注意力机制从内存密集型(memory-bound)转变为计算密集型(compute-bound)。此方案无需架构改动或训练,在GPT-2上实现了64×压缩比下保持95.7%输出保真度,且rank相关性ρ维持在0.95,理论分析进一步证明其排序稳定性随子空间数量增加而收敛,保障了长序列(最长1024 tokens)下的有效性。

链接: https://arxiv.org/abs/2601.10155
作者: Aryan Karmore
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Compressing the KV cache is a required step to deploy large language models on edge devices. Current quantization methods compress storage but fail to reduce bandwidth as attention calculation requires dequantizing keys from INT4/INT8 to FP16 before use. We observe that attention scoring is mathematically equivalent to the inner product similarity search and we can apply some compression techniques from vector databases to compress KV-cache better. We propose LOOKAT, which applies product quantization and asymmetric distance computation, to transformer architecture by decomposing key vectors into subspaces, learning codebooks and computing attention tables via lookup tables. This transforms attention from memory-bound to compute-bound. LOOKAT achieves 64 \times compression at 95.7% output fidelity and 32 \times compression at 95.0% fidelity when tested on GPT-2. LOOKAT requires no architecture changes or training while maintaining rank correlation \rho 0.95 . Theoretical analysis confirms that rank correlation degrades as O(d_k/mK) , with guarantees validated across sequence lengths up to 1024 tokens.
zh

[AI-37] Simple Network Graph Comparative Learning

【速读】:该论文旨在解决对比学习在节点分类任务中面临的两大挑战:一是现有数据增强技术生成的新视图与原始视图差异过大,削弱了视图间的相关性并影响模型训练效率;二是大多数图对比学习算法依赖大量负样本,增加了计算复杂度。其解决方案的关键在于提出一种名为简单网络图对比学习(Simple Network Graph Comparative Learning, SNGCL)的方法,该方法通过引入叠加多层拉普拉斯平滑滤波器分别生成全局和局部特征平滑矩阵,并将其输入孪生网络的目标网络和在线网络,最终采用改进的三元重组损失函数优化类内距离缩小、类间距离增大,从而提升节点分类性能。

链接: https://arxiv.org/abs/2601.10150
作者: Qiang Yu,Xinran Cheng,Shiqiang Xu,Chuanyi Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 10 pages, 5 figures

点击查看摘要

Abstract:The effectiveness of contrastive learning methods has been widely recognized in the field of graph learning, especially in contexts where graph data often lack labels or are difficult to label. However, the application of these methods to node classification tasks still faces a number of challenges. First, existing data enhancement techniques may lead to significant differences from the original view when generating new views, which may weaken the relevance of the view and affect the efficiency of model training. Second, the vast majority of existing graph comparison learning algorithms rely on the use of a large number of negative samples. To address the above challenges, this study proposes a novel node classification contrast learning method called Simple Network Graph Comparative Learning (SNGCL). Specifically, SNGCL employs a superimposed multilayer Laplace smoothing filter as a step in processing the data to obtain global and local feature smoothing matrices, respectively, which are thus passed into the target and online networks of the siamese network, and finally employs an improved triple recombination loss function to bring the intra-class distance closer and the inter-class distance farther. We have compared SNGCL with state-of-the-art models in node classification tasks, and the experimental results show that SNGCL is strongly competitive in most tasks.
zh

[AI-38] DecisionLLM : Large Language Models for Long Sequence Decision Exploration

【速读】:该论文旨在解决长序列决策问题(long-sequence decision-making)在离线场景下的性能瓶颈,特别是如何利用大规模语言模型(LLMs)提升复杂策略任务中的决策能力。其核心挑战在于LLMs缺乏对连续数值的原生理解,导致无法直接处理以文本形式表示的数值数据。解决方案的关键在于将轨迹(trajectory)视为一种独立模态,并通过学习轨迹数据与自然语言任务描述之间的对齐关系,构建一个名为DecisionLLM的自回归预测框架,从而实现对未来的决策进行统一建模。该方法突破了传统决策Transformer(DT)的局限性,在多个离线基准测试和在线竞价场景中均展现出显著性能提升,验证了模型规模、数据量和数据质量三者间的 scaling laws 对决策性能的决定性作用。

链接: https://arxiv.org/abs/2601.10148
作者: Xiaowei Lv,Zhilin Zhang,Yijun Li,Yusen Huo,Siyuan Ju,Xuyan Li,Chunxiang Hong,Tianyu Wang,Yongcai Wang,Peng Sun,Chuan Yu,Jian Xu,Bo Zheng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Long-sequence decision-making, which is usually addressed through reinforcement learning (RL), is a critical component for optimizing strategic operations in dynamic environments, such as real-time bidding in computational advertising. The Decision Transformer (DT) introduced a powerful paradigm by framing RL as an autoregressive sequence modeling problem. Concurrently, Large Language Models (LLMs) have demonstrated remarkable success in complex reasoning and planning tasks. This inspires us whether LLMs, which share the same Transformer foundation, but operate at a much larger scale, can unlock new levels of performance in long-horizon sequential decision-making problem. This work investigates the application of LLMs to offline decision making tasks. A fundamental challenge in this domain is the LLMs’ inherent inability to interpret continuous values, as they lack a native understanding of numerical magnitude and order when values are represented as text strings. To address this, we propose treating trajectories as a distinct modality. By learning to align trajectory data with natural language task descriptions, our model can autoregressively predict future decisions within a cohesive framework we term DecisionLLM. We establish a set of scaling laws governing this paradigm, demonstrating that performance hinges on three factors: model scale, data volume, and data quality. In offline experimental benchmarks and bidding scenarios, DecisionLLM achieves strong performance. Specifically, DecisionLLM-3B outperforms the traditional Decision Transformer (DT) by 69.4 on Maze2D umaze-v1 and by 0.085 on AuctionNet. It extends the AIGB paradigm and points to promising directions for future exploration in online bidding.
zh

[AI-39] History Is Not Enough: An Adaptive Dataflow System for Financial Time-Series Synthesis

【速读】:该论文旨在解决量化金融中因概念漂移(concept drift)和分布非平稳性(distributional non-stationarity)导致的模型训练与实际表现之间的差距问题,即传统基于静态历史数据训练的模型易过拟合,难以在动态市场中实现良好泛化。解决方案的关键在于提出一种漂移感知的数据流系统(drift-aware dataflow system),其核心是将基于机器学习的自适应控制机制融入数据整理过程:通过参数化的数据操作模块(包括个股变换、多股混洗及筛选操作)与基于梯度的双层优化自适应规划器-调度器相结合,实现数据增强、课程学习(curriculum learning)与数据工作流管理的一体化可微框架,从而支持溯源感知的数据重放与持续的数据质量监控,显著提升模型鲁棒性和风险调整后收益。

链接: https://arxiv.org/abs/2601.10143
作者: Haochong Xia,Yao Long Teng,Regan Tan,Molei Qin,Xinrun Wang,Bo An
机构: 未知
类目: Artificial Intelligence (cs.AI); Trading and Market Microstructure (q-fin.TR)
备注:

点击查看摘要

Abstract:In quantitative finance, the gap between training and real-world performance-driven by concept drift and distributional non-stationarity-remains a critical obstacle for building reliable data-driven systems. Models trained on static historical data often overfit, resulting in poor generalization in dynamic markets. The mantra “History Is Not Enough” underscores the need for adaptive data generation that learns to evolve with the market rather than relying solely on past observations. We present a drift-aware dataflow system that integrates machine learning-based adaptive control into the data curation process. The system couples a parameterized data manipulation module comprising single-stock transformations, multi-stock mix-ups, and curation operations, with an adaptive planner-scheduler that employs gradient-based bi-level optimization to control the system. This design unifies data augmentation, curriculum learning, and data workflow management under a single differentiable framework, enabling provenance-aware replay and continuous data quality monitoring. Extensive experiments on forecasting and reinforcement learning trading tasks demonstrate that our framework enhances model robustness and improves risk-adjusted returns. The system provides a generalizable approach to adaptive data management and learning-guided workflow automation for financial data.
zh

[AI-40] Understanding and Preserving Safety in Fine-Tuned LLM s

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在微调过程中面临的“安全-效用困境”(safety-utility dilemma):即提升下游任务性能(utility)通常需要深度微调,但会显著削弱模型的安全对齐性(safety alignment),甚至使其更容易遭受越狱攻击(jailbreak attacks),即便微调数据本身无害。解决方案的关键在于揭示了安全梯度与效用梯度在参数空间中的几何交互特性——通过系统实证分析发现:(1) 安全梯度位于低秩子空间,而效用梯度分布在高维空间;(2) 两者常呈负相关,导致微调方向冲突;(3) 主导安全方向可由单一样本高效估计。基于此,作者提出轻量级方法Safety-Preserving Fine-Tuning (SPF),其核心机制是显式移除与低秩安全子空间冲突的梯度成分,在理论上保证任务性能收敛的同时控制安全漂移,实验证明其能稳定维持下游任务效果并几乎完全恢复预训练阶段的安全对齐能力,且对深度微调和动态越狱攻击具有鲁棒性。

链接: https://arxiv.org/abs/2601.10141
作者: Jiawen Zhang,Yangfan Hu,Kejia Chen,Lipeng He,Jiachen Ma,Jian Lou,Dan Li,Jian Liu,Xiaohu Yang,Ruoxi Jia
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Fine-tuning is an essential and pervasive functionality for applying large language models (LLMs) to downstream tasks. However, it has the potential to substantially degrade safety alignment, e.g., by greatly increasing susceptibility to jailbreak attacks, even when the fine-tuning data is entirely harmless. Despite garnering growing attention in defense efforts during the fine-tuning stage, existing methods struggle with a persistent safety-utility dilemma: emphasizing safety compromises task performance, whereas prioritizing utility typically requires deep fine-tuning that inevitably leads to steep safety declination. In this work, we address this dilemma by shedding new light on the geometric interaction between safety- and utility-oriented gradients in safety-aligned LLMs. Through systematic empirical analysis, we uncover three key insights: (I) safety gradients lie in a low-rank subspace, while utility gradients span a broader high-dimensional space; (II) these subspaces are often negatively correlated, causing directional conflicts during fine-tuning; and (III) the dominant safety direction can be efficiently estimated from a single sample. Building upon these novel insights, we propose safety-preserving fine-tuning (SPF), a lightweight approach that explicitly removes gradient components conflicting with the low-rank safety subspace. Theoretically, we show that SPF guarantees utility convergence while bounding safety drift. Empirically, SPF consistently maintains downstream task performance and recovers nearly all pre-trained safety alignment, even under adversarial fine-tuning scenarios. Furthermore, SPF exhibits robust resistance to both deep fine-tuning and dynamic jailbreak attacks. Together, our findings provide new mechanistic understanding and practical guidance toward always-aligned LLM fine-tuning. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2601.10141 [cs.LG] (or arXiv:2601.10141v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2601.10141 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-41] Step-by-Step Causality: Transparent Causal Discovery with Multi-Agent Tree-Query and Adversarial Confidence Estimation

【速读】:该论文旨在解决传统因果发现方法(如PC、FCI算法)因误差传播导致的可靠性不足,以及基于大语言模型(LLM)的因果推理工具缺乏可解释性与置信度评估的问题。其解决方案的关键在于提出Tree-Query框架——一种树状结构的多专家LLM系统,将成对因果关系识别转化为关于后门路径、独立性、潜在混杂和因果方向的一系列短序列查询,从而生成具有鲁棒性感知置信度评分的可解释判断,并在理论上保证四种成对因果关系的渐近可识别性。

链接: https://arxiv.org/abs/2601.10137
作者: Ziyi Ding,Chenfei Ye-Hao,Zheyuan Wang,Xiao-Ping Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Causal discovery aims to recover ``what causes what’', but classical constraint-based methods (e.g., PC, FCI) suffer from error propagation, and recent LLM-based causal oracles often behave as opaque, confidence-free black boxes. This paper introduces Tree-Query, a tree-structured, multi-expert LLM framework that reduces pairwise causal discovery to a short sequence of queries about backdoor paths, (in)dependence, latent confounding, and causal direction, yielding interpretable judgments with robustness-aware confidence scores. Theoretical guarantees are provided for asymptotic identifiability of four pairwise relations. On data-free benchmarks derived from Mooij et al. and UCI causal graphs, Tree-Query improves structural metrics over direct LLM baselines, and a diet–weight case study illustrates confounder screening and stable, high-confidence causal conclusions. Tree-Query thus offers a principled way to obtain data-free causal priors from LLMs that can complement downstream data-driven causal discovery. Code is available at this https URL.
zh

[AI-42] Is More Context Always Better? Examining LLM Reasoning Capability for Time Interval Prediction WWW2026

【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)在结构化行为数据中推理时间规律能力不足的问题,具体聚焦于其预测用户重复行为(如复购)时间间隔的性能表现及其对上下文信息依赖性的理解。解决方案的关键在于通过一个简洁但具代表性的复购场景,在零样本(zero-shot)设置下系统性地对比LLMs与统计模型及机器学习模型的预测表现,并揭示不同层级上下文信息对LLM推理效果的影响机制——结果表明,尽管LLMs优于轻量级统计基线,但在捕捉定量时间结构方面仍显著落后于专用机器学习模型;且增加用户层面细节反而会降低其准确性,这挑战了“更多上下文带来更好推理”的普遍假设,从而为未来融合统计精度与语言灵活性的上下文感知混合模型设计提供了实证依据。

链接: https://arxiv.org/abs/2601.10132
作者: Yanan Cao,Farnaz Fallahi,Murali Mohana Krishna Dandu,Lalitesh Morishetti,Kai Zhao,Luyi Ma,Sinduja Subramaniam,Jianpeng Xu,Evren Korpeoglu,Kaushiki Nag,Sushant Kumar,Kannan Achan
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at The Web Conference 2026 (WWW 2026)

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated impressive capabilities in reasoning and prediction across different domains. Yet, their ability to infer temporal regularities from structured behavioral data remains underexplored. This paper presents a systematic study investigating whether LLMs can predict time intervals between recurring user actions, such as repeated purchases, and how different levels of contextual information shape their predictive behavior. Using a simple but representative repurchase scenario, we benchmark state-of-the-art LLMs in zero-shot settings against both statistical and machine-learning models. Two key findings emerge. First, while LLMs surpass lightweight statistical baselines, they consistently underperform dedicated machine-learning models, showing their limited ability to capture quantitative temporal structure. Second, although moderate context can improve LLM accuracy, adding further user-level detail degrades performance. These results challenge the assumption that “more context leads to better reasoning”. Our study highlights fundamental limitations of today’s LLMs in structured temporal inference and offers guidance for designing future context-aware hybrid models that integrate statistical precision with linguistic flexibility.
zh

[AI-43] M4olGen: Multi-Agent Multi-Stage Molecular Generation under Precise Multi-Property Constraints

【速读】:该论文旨在解决生成满足多个物理化学性质(如QED、LogP、分子量、HOMO、LUMO等)精确数值约束的分子这一挑战性问题。现有大语言模型(Large Language Models, LLMs)在缺乏外部结构和反馈的情况下,难以实现多目标控制与数值推理。其解决方案的关键在于提出一个基于片段(fragment-level)的两阶段框架——MolGen:第一阶段通过多智能体推理器进行检索锚定的片段级编辑,生成接近可行区域的候选分子;第二阶段利用基于Group Relative Policy Optimization(GRPO)训练的片段级优化器,执行单步或多步精细调整,以显式最小化属性误差并控制编辑复杂度和与原型的偏离。整个流程依赖于一个自动构建的大规模数据集,包含片段编辑链和属性变化量,从而提供确定性和可重复的监督信号,支持可控的多跳推理,显著优于现有LLMs和图基算法。

链接: https://arxiv.org/abs/2601.10131
作者: Yizhan Li,Florence Cloutier,Sifan Wu,Ali Parviz,Boris Knyazev,Yan Zhang,Glen Berseth,Bang Liu
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Generating molecules that satisfy precise numeric constraints over multiple physicochemical properties is critical and challenging. Although large language models (LLMs) are expressive, they struggle with precise multi-objective control and numeric reasoning without external structure and feedback. We introduce \textbfM olGen, a fragment-level, retrieval-augmented, two-stage framework for molecule generation under multi-property constraints. Stage I : Prototype generation: a multi-agent reasoner performs retrieval-anchored, fragment-level edits to produce a candidate near the feasible region. Stage II : RL-based fine-grained optimization: a fragment-level optimizer trained with Group Relative Policy Optimization (GRPO) applies one- or multi-hop refinements to explicitly minimize the property errors toward our target while regulating edit complexity and deviation from the prototype. A large, automatically curated dataset with reasoning chains of fragment edits and measured property deltas underpins both stages, enabling deterministic, reproducible supervision and controllable multi-hop reasoning. Unlike prior work, our framework better reasons about molecules by leveraging fragments and supports controllable refinement toward numeric targets. Experiments on generation under two sets of property constraints (QED, LogP, Molecular Weight and HOMO, LUMO) show consistent gains in validity and precise satisfaction of multi-property targets, outperforming strong LLMs and graph-based algorithms.
zh

[AI-44] Redundancy-Driven Top-k Functional Dependency Discovery

【速读】:该论文旨在解决函数依赖(Functional Dependencies, FDs)发现过程中存在的两个核心问题:一是计算复杂度高,随着元组数量和属性维度的增加,计算成本呈二次和指数级增长,导致在大规模、高维数据上难以高效执行;二是结果集冗余庞大,难以从中识别出真正有用的FD。解决方案的关键在于提出一种名为SDP(Selective-Discovery-and-Prune)的方法,其核心思想是基于冗余计数(redundancy count)对FD进行排序并选择前k个最有价值的FD。冗余计数直接反映FD所解释的信息重复程度,与存储开销和更新异常密切相关。SDP利用冗余计数的上界进行剪枝,且证明该上界具有单调性——即增加属性会细化划分从而降低上界。当某分支的上界低于第k个FD的冗余阈值时,可直接跳过该分支搜索,显著缩小搜索空间。此外,论文进一步通过属性排序、基于分区基数矩阵的成对统计优化以及全局调度策略提升效率,实验证明该方法在速度和内存消耗方面显著优于传统穷举法。

链接: https://arxiv.org/abs/2601.10130
作者: Xiaolong Wan,Xixian Han
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Functional dependencies (FDs) are basic constraints in relational databases and are used for many data management tasks. Most FD discovery algorithms find all valid dependencies, but this causes two problems. First, the computational cost is prohibitive: computational complexity grows quadratically with the number of tuples and exponentially with the number of attributes, making discovery slow on large-scale and high-dimensional data. Second, the result set can be huge, making it hard to identify useful dependencies. We propose SDP (Selective-Discovery-and-Prune), which discovers the top- k FDs ranked by redundancy count. Redundancy count measures how much duplicated information an FD explains and connects directly to storage overhead and update anomalies. SDP uses an upper bound on redundancy to prune the search space. It is proved that this upper bound is monotone: adding attributes refines partitions and thus decreases the bound. Once the bound falls below the top- k threshold, the entire branch can be skipped. We improve SDP with three optimizations: ordering attributes by partition cardinality, using pairwise statistics in a Partition Cardinality Matrix to tighten bounds, and a global scheduler to explore promising branches first. Experiments on over 40 datasets show that SDP is much faster and uses less memory than exhaustive methods.
zh

[AI-45] Following the Teachers Footsteps: Scheduled Checkpoint Distillation for Domain-Specific LLM s ICPR2026

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在特定领域任务中部署困难的问题,尤其是当将微调后的教师模型蒸馏到较小的学生模型时,由于容量差距导致学生模型性能往往不如教师模型。其解决方案的关键在于提出一个理论洞察:若学生模型在“学生偏好子域”(Student-Favored Subdomain, SFS)上的优势足以抵消其在“教师偏好子域”(Teacher-Favored Subdomain, TFS)上的劣势,则学生模型可超越教师模型。基于此洞察,作者设计了Scheduled Checkpoint Distillation (SCD) 方法,通过模拟教师模型在领域任务监督微调(Supervised Fine-Tuning, SFT)过程中的收敛路径来减少TFS缺陷,同时引入样本级自适应加权(Adaptive Weighting, AW)机制以保留学生在SFS上的优势。实验表明,该方法在多语言问答、命名实体识别(NER)和文本分类等多样化的领域任务中均显著优于现有蒸馏方法,使学生模型能够匹配甚至超越其教师模型。

链接: https://arxiv.org/abs/2601.10114
作者: Cheng Feng,Chaoliang Zhong,Jun Sun,Yusuke Oishi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 15 pages, submitted to ICPR 2026

点击查看摘要

Abstract:Large language models (LLMs) are challenging to deploy for domain-specific tasks due to their massive scale. While distilling a fine-tuned LLM into a smaller student model is a promising alternative, the capacity gap between teacher and student often leads to suboptimal performance. This raises a key question: when and how can a student model match or even surpass its teacher on domain-specific tasks? In this work, we propose a novel theoretical insight: a student can outperform its teacher if its advantage on a Student-Favored Subdomain (SFS) outweighs its deficit on the Teacher-Favored Subdomain (TFS). Guided by this insight, we propose Scheduled Checkpoint Distillation (SCD), which reduces the TFS deficit by emulating the teacher’s convergence process during supervised fine-tuning (SFT) on the domain task, and a sample-wise Adaptive Weighting (AW) mechanism to preserve student strengths on SFS. Experiments across diverse domain tasks–including QA, NER, and text classification in multiple languages–show that our method consistently outperforms existing distillation approaches, allowing the student model to match or even exceed the performance of its fine-tuned teacher.
zh

[AI-46] Repository Intelligence Graph: Deterministic Architectural Map for LLM Code Assistants

【速读】:该论文旨在解决代码生成式 AI(Generative AI)代理在多语言项目中因缺乏对构建和测试结构的准确理解而导致的性能瓶颈问题,尤其是在跨语言依赖关系复杂、构建系统异构的情况下。其核心解决方案是提出 Repository Intelligence Graph(RIG),一个基于证据的确定性架构图谱,能够显式建模可构建组件、聚合器、运行器、测试用例、外部包及包管理器之间的依赖与覆盖关系,并通过 SPADE 工具从构建和测试工件中自动提取 RIG,以 JSON 格式提供给大语言模型(LLM)作为权威的仓库结构描述。实验表明,引入 RIG 显著提升了代码代理的准确性(平均提升 12.2%)并大幅降低执行时间(平均减少 53.9%),尤其在多语言项目中效果更为显著。

链接: https://arxiv.org/abs/2601.10112
作者: Tsvi Cherny-Shahar,Amiram Yehudai
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 35 pages, 5 figures

点击查看摘要

Abstract:Repository aware coding agents often struggle to recover build and test structure, especially in multilingual projects where cross language dependencies are encoded across heterogeneous build systems and tooling. We introduce the Repository Intelligence Graph (RIG), a deterministic, evidence backed architectural map that represents buildable components, aggregators, runners, tests, external packages, and package managers, connected by explicit dependency and coverage edges that trace back to concrete build and test definitions. We also present SPADE, a deterministic extractor that constructs RIG from build and test artifacts (currently with an automatic CMake plugin based on the CMake File API and CTest metadata), and exposes RIG as an LLM friendly JSON view that agents can treat as the authoritative description of repository structure. We evaluate three commercial agents (Claude Code, Cursor, Codex) on eight repositories spanning low to high build oriented complexity, including the real world MetaFFI project. Each agent answers thirty structured questions per repository with and without RIG in context, and we measure accuracy, wall clock completion time, and efficiency (seconds per correct answer). Across repositories and agents, providing RIG improves mean accuracy by 12.2% and reduces completion time by 53.9%, yielding a mean 57.8% reduction in seconds per correct answer. Gains are larger in multilingual repositories, which improve by 17.7% in accuracy and 69.5% in efficiency on average, compared to 6.6% and 46.1% in single language repositories. Qualitative analysis suggests that RIG shifts failures from structural misunderstandings toward reasoning mistakes over a correct structure, while rare regressions highlight that graph based reasoning quality remains a key factor. Comments: 35 pages, 5 figures Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2601.10112 [cs.SE] (or arXiv:2601.10112v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2601.10112 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-47] LeMoF: Level-guided Multimodal Fusion for Heterogeneous Clinical Data

【速读】:该论文旨在解决多模态临床预测中因静态模态融合策略和简单集成方法导致的模态特异性表征利用不足的问题,从而影响模型在异构临床环境下的预测稳定性和判别能力。解决方案的关键在于提出一种层级引导的模态融合框架(Level-guided Modal Fusion, LeMoF),通过显式分离并学习每个模态内不同编码层(即不同层级)提取的特定表示,实现对全局模态级预测与层级特异性判别表示的解耦建模,从而在保持预测稳定性的同时增强判别性能。实验表明,层级级别的集成是提升跨多种临床条件下预测鲁棒性的关键因素。

链接: https://arxiv.org/abs/2601.10092
作者: Jongseok Kim,Seongae Kang,Jonghwan Shin,Yuhan Lee,Ohyun Jo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal clinical prediction is widely used to integrate heterogeneous data such as Electronic Health Records (EHR) and biosignals. However, existing methods tend to rely on static modality integration schemes and simple fusion strategies. As a result, they fail to fully exploit modality-specific representations. In this paper, we propose Level-guided Modal Fusion (LeMoF), a novel framework that selectively integrates level-guided representations within each modality. Each level refers to a representation extracted from a different layer of the encoder. LeMoF explicitly separates and learns global modality-level predictions from level-specific discriminative representations. This design enables LeMoF to achieve a balanced performance between prediction stability and discriminative capability even in heterogeneous clinical environments. Experiments on length of stay prediction using Intensive Care Unit (ICU) data demonstrate that LeMoF consistently outperforms existing state-of-the-art multimodal fusion techniques across various encoder configurations. We also confirmed that level-wise integration is a key factor in achieving robust predictive performance across various clinical conditions.
zh

[AI-48] State of AI: An Empirical 100 Trillion Token Study with OpenRouter

【速读】:该论文试图解决的问题是:随着大型语言模型(Large Language Models, LLMs)从单次模式生成向多步推理演进,当前对实际应用场景中用户如何使用LLMs的实证理解滞后于技术发展速度。为填补这一空白,作者基于OpenRouter平台收集的超100万亿token的真实世界LLM交互数据,系统分析了任务类型、地理分布和时间维度下的使用模式。解决方案的关键在于通过大规模、多维的实证研究识别出三个核心现象:开放权重模型的广泛采用、创意角色扮演(creative roleplay)及编码辅助类别的显著流行,以及代理式推理(agentic inference)的兴起;同时发现早期用户群体具有长期留存特征,即“灰姑娘效应”(Cinderella “Glass Slipper” effect),揭示了用户行为的复杂性和多样性。这些发现为模型开发者、AI工程师和基础设施提供商提供了数据驱动的设计优化路径。

链接: https://arxiv.org/abs/2601.10088
作者: Malika Aubakirova,Alex Atallah,Chris Clark,Justin Summerville,Anjney Midha
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 36 pages

点击查看摘要

Abstract:The past year has marked a turning point in the evolution and real-world use of large language models (LLMs). With the release of the first widely adopted reasoning model, o1, on December 5th, 2024, the field shifted from single-pass pattern generation to multi-step deliberation inference, accelerating deployment, experimentation, and new classes of applications. As this shift unfolded at a rapid pace, our empirical understanding of how these models have actually been used in practice has lagged behind. In this work, we leverage the OpenRouter platform, which is an AI inference provider across a wide variety of LLMs, to analyze over 100 trillion tokens of real-world LLM interactions across tasks, geographies, and time. In our empirical study, we observe substantial adoption of open-weight models, the outsized popularity of creative roleplay (beyond just the productivity tasks many assume dominate) and coding assistance categories, plus the rise of agentic inference. Furthermore, our retention analysis identifies foundational cohorts: early users whose engagement persists far longer than later cohorts. We term this phenomenon the Cinderella “Glass Slipper” effect. These findings underscore that the way developers and end-users engage with LLMs “in the wild” is complex and multifaceted. We discuss implications for model builders, AI developers, and infrastructure providers, and outline how a data-driven understanding of usage can inform better design and deployment of LLM systems.
zh

[AI-49] FilDeep: Learning Large Deformations of Elastic-Plastic Solids with Multi-Fidelity Data KDD’26 KDD

【速读】:该论文旨在解决大变形弹性-塑性固体科学计算中深度学习(Deep Learning, DL)模型因数据量与精度之间的权衡困境而导致性能不佳的问题。传统方法在构建训练数据集时面临“数据数量”与“数据准确性”难以兼顾的矛盾,从而限制了DL模型在复杂大变形问题中的应用效果。解决方案的关键在于提出FilDeep框架——一种基于保真度(Fidelity)的深度学习方法,通过同时利用低保真度(Low-fidelity)数据(数量多但精度低)和高保真度(High-fidelity)数据(数量少但精度高)进行联合训练,以实现性能优化。其中,核心创新是设计了注意力增强的跨保真度模块(attention-enabled cross-fidelity modules),有效捕捉多保真度(Multi-Fidelity, MF)数据间长程物理相互作用,从而显著提升模型对大变形问题的建模能力与泛化性能。

链接: https://arxiv.org/abs/2601.10031
作者: Jianheng Tang,Shilong Tao,Zhe Feng,Haonan Sun,Menglu Wang,Zhanxing Zhu,Yunhuai Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted in Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.1 (KDD '26)

点击查看摘要

Abstract:The scientific computation of large deformations in elastic-plastic solids is crucial in various manufacturing applications. Traditional numerical methods exhibit several inherent limitations, prompting Deep Learning (DL) as a promising alternative. The effectiveness of current DL techniques typically depends on the availability of high-quantity and high-accuracy datasets, which are yet difficult to obtain in large deformation problems. During the dataset construction process, a dilemma stands between data quantity and data accuracy, leading to suboptimal performance in the DL models. To address this challenge, we focus on a representative application of large deformations, the stretch bending problem, and propose FilDeep, a Fidelity-based Deep Learning framework for large Deformation of elastic-plastic solids. Our FilDeep aims to resolve the quantity-accuracy dilemma by simultaneously training with both low-fidelity and high-fidelity data, where the former provides greater quantity but lower accuracy, while the latter offers higher accuracy but in less quantity. In FilDeep, we provide meticulous designs for the practical large deformation problem. Particularly, we propose attention-enabled cross-fidelity modules to effectively capture long-range physical interactions across MF data. To the best of our knowledge, our FilDeep presents the first DL framework for large deformation problems using MF data. Extensive experiments demonstrate that our FilDeep consistently achieves state-of-the-art performance and can be efficiently deployed in manufacturing.
zh

[AI-50] PaperScout: An Autonomous Agent for Academic Paper Search with Process-Aware Sequence-Level Policy Optimization

【速读】:该论文旨在解决学术论文搜索中现有方法依赖固定预定义工作流、难以应对复杂条件查询的问题。其核心挑战在于,传统强化学习(Reinforcement Learning, RL)方法因粒度不匹配(token-level优化与sequence-level交互不一致),导致在多轮代理任务中出现信用分配噪声。解决方案的关键是提出一种面向过程的序列策略优化方法——近端序列策略优化(Proximal Sequence Policy Optimization, PSPO),该方法将优化目标对齐于代理与环境的序列级交互,从而实现更稳定的策略学习;同时构建了PaperScout这一自主代理系统,将其搜索任务建模为顺序决策过程,动态决定何时及如何调用搜索工具,显著提升了召回率和相关性表现。

链接: https://arxiv.org/abs/2601.10029
作者: Tingyue Pan,Jie Ouyang,Mingyue Cheng,Qingchuan Li,Zirui Liu,Mingfan Pan,Shuo Yu,Qi Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Academic paper search is a fundamental task in scientific research, yet most existing approaches rely on rigid, predefined workflows that struggle with complex, conditional queries. To address this limitation, we propose PaperScout, an autonomous agent that reformulates paper search as a sequential decision-making process. Unlike static workflows, PaperScout dynamically decides whether, when, and how to invoke search and expand tools based on accumulated retrieval context. However, training such agents presents a fundamental challenge: standard reinforcement learning methods, typically designed for single-turn tasks, suffer from a granularity mismatch when applied to multi-turn agentic tasks, where token-level optimization diverges from the granularity of sequence-level interactions, leading to noisy credit assignment. We introduce Proximal Sequence Policy Optimization (PSPO), a process-aware, sequence-level policy optimization method that aligns optimization with agent-environment interaction. Comprehensive experiments on both synthetic and real-world benchmarks demonstrate that PaperScout significantly outperforms strong workflow-driven and RL baselines in both recall and relevance, validating the effectiveness of our adaptive agentic framework and optimization strategy.
zh

[AI-51] Structured Personality Control and Adaptation for LLM Agents

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在人机交互(Human-Computer Interaction, HCI)中缺乏细腻且可适应的人格表达问题,从而影响用户参与度、决策感知与交互真实性。解决方案的关键在于提出一种基于荣格心理类型(Jungian psychological types)的框架,通过三个核心机制实现人格建模:一是主导-辅助协调机制,确保核心人格特质的一致性表达;二是强化-补偿机制,使模型能根据上下文临时调整行为;三是反思机制,驱动长期人格结构的演化。这一设计使代理能够在保持人格深度的同时动态响应交互需求,并逐步优化其内在人格体系,从而支持更自然、连贯和情境敏感的人机交互。

链接: https://arxiv.org/abs/2601.10025
作者: Jinpeng Wang,Xinyu Jia,Wei Wei Heng,Yuquan Li,Binbin Shi,Qianlei Chen,Guannan Chen,Junxia Zhang,Yuyu Yin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly shaping human-computer interaction (HCI), from personalized assistants to social simulations. Beyond language competence, researchers are exploring whether LLMs can exhibit human-like characteristics that influence engagement, decision-making, and perceived realism. Personality, in particular, is critical, yet existing approaches often struggle to achieve both nuanced and adaptable expression. We present a framework that models LLM personality via Jungian psychological types, integrating three mechanisms: a dominant-auxiliary coordination mechanism for coherent core expression, a reinforcement-compensation mechanism for temporary adaptation to context, and a reflection mechanism that drives long-term personality evolution. This design allows the agent to maintain nuanced traits while dynamically adjusting to interaction demands and gradually updating its underlying structure. Personality alignment is evaluated using Myers-Briggs Type Indicator questionnaires and tested under diverse challenge scenarios as a preliminary structured assessment. Findings suggest that evolving, personality-aware LLMs can support coherent, context-sensitive interactions, enabling naturalistic agent design in HCI.
zh

[AI-52] Empowering Older Adults in Digital Technology Use with Foundation Models

【速读】:该论文旨在解决老年人在获取数字技术支持时面临的沟通障碍问题,主要表现为因不熟悉技术术语和年龄相关的认知变化而导致的技术问题描述不清、冗余、不完整或过度/不足指定等问题。解决方案的关键在于利用大语言模型(Large Language Models, LLMs)构建一个基于提示链(prompt-chaining)的处理管道,通过上下文挖掘、查询重述和解决方案生成三个步骤,将老年人原始技术求助语句转化为更清晰、结构化且语义准确的表达,从而显著提升解决方案的准确性(从46%提升至69%)和搜索引擎检索效果(从35%提升至69%),同时增强年轻用户对查询的理解度(从65.8%提升至93.7%)以及老年人对解决方案的可执行性和信心。此外,研究还开发了首个面向老年人技术求助场景的合成数据集(OATS),为未来公平、包容的AI系统开发提供关键资源。

链接: https://arxiv.org/abs/2601.10018
作者: Hasti Sharifi,Homaira Huda Shomee,Sourav Medya,Debaleena Chattopadhyay
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While high-quality technology support can assist older adults in using digital applications, many struggle to articulate their issues due to unfamiliarity with technical terminology and age-related cognitive changes. This study examines these communication challenges and explores AI-based approaches to mitigate them. We conducted a diary study with English-speaking, community-dwelling older adults to collect asynchronous, technology-related queries and used reflexive thematic analysis to identify communication barriers. To address these barriers, we evaluated how foundation models can paraphrase older adults’ queries to improve solution accuracy. Two controlled experiments followed: one with younger adults evaluating AI-rephrased queries and another with older adults evaluating AI-generated solutions. We also developed a pipeline using large language models to generate the first synthetic dataset of how older adults request tech support (OATS). We identified four key communication challenges: verbosity, incompleteness, over-specification, and under-specification. Our prompt-chaining approach using the large language model, GPT-4o, elicited contextual details, paraphrased the original query, and generated a solution. AI-rephrased queries significantly improved solution accuracy (69% vs. 46%) and Google search results (69% vs. 35%). Younger adults better understood AI-rephrased queries (93.7% vs. 65.8%) and reported greater confidence and ease. Older adults reported high perceived ability to answer contextual questions (89.8%) and follow solutions (94.7%), with high confidence and ease. OATS demonstrated strong fidelity and face validity. This work shows how foundation models can enhance technology support for older adults by addressing age-related communication barriers. The OATS dataset offers a scalable resource for developing equitable AI systems that better serve aging populations.
zh

[AI-53] Memo-SQL: Structured Decomposition and Experience-Driven Self-Correction for Training-Free NL2SQL

【速读】:该论文旨在解决现有自然语言到SQL(NL2SQL)系统面临的两大关键问题:一是仅依赖正确示例进行上下文学习,忽视了历史错误-修正对中蕴含的丰富信号,导致自我修正能力不足;二是测试时缩放方法常任意分解问题,产生大量重复的SQL候选语句,削弱集成效果,且普遍存在准确率与效率之间的权衡困境。解决方案的核心在于提出一种无需训练的框架Memo-SQL,其关键创新为两个方面:一是采用结构化分解策略(实体级、层次级和原子序列级),以促进推理多样性;二是构建动态记忆库存储成功查询与历史错误-修正对,并通过检索增强提示(retrieval-augmented prompting)在推理阶段引入相关示例,实现经验驱动的自校正,无需微调或外部API支持。

链接: https://arxiv.org/abs/2601.10011
作者: Zerui Yang,Weichuan Wang,Yanwei Xu,Linqi Song,Yudai Matsuda,Wei Han,Bo Bai
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Existing NL2SQL systems face two critical limitations: (1) they rely on in-context learning with only correct examples, overlooking the rich signal in historical error-fix pairs that could guide more robust self-correction; and (2) test-time scaling approaches often decompose questions arbitrarily, producing near-identical SQL candidates across runs and diminishing ensemble gains. Moreover, these methods suffer from a stark accuracy-efficiency trade-off: high performance demands excessive computation, while fast variants compromise quality. We present Memo-SQL, a training-free framework that addresses these issues through two simple ideas: structured decomposition and experience-aware self-correction. Instead of leaving decomposition to chance, we apply three clear strategies, entity-wise, hierarchical, and atomic sequential, to encourage diverse reasoning. For correction, we build a dynamic memory of both successful queries and historical error-fix pairs, and use retrieval-augmented prompting to bring relevant examples into context at inference time, no fine-tuning or external APIs required. On BIRD, Memo-SQL achieves 68.5% execution accuracy, setting a new state of the art among open, zero-fine-tuning methods, while using over 10 times fewer resources than prior TTS approaches.
zh

[AI-54] Chinese Labor Law Large Language Model Benchmark

【速读】:该论文旨在解决通用大语言模型(如GPT-4)在处理特定法律子领域(尤其是中国劳动法)时存在的知识精准度不足、复杂推理能力弱以及上下文敏感性差的问题。其解决方案的关键在于构建一个专用于中国劳动法的垂直领域大语言模型——LabourLawLLM,并配套设计了一个涵盖多任务的基准测试平台LabourLawBench,通过结合客观指标(如ROUGE-L、准确率、F1值和软F1值)与基于GPT-4的主观评分机制进行系统评估,验证了该模型在多个劳动法任务上的优越性能,为其他法律子领域的专业化大模型开发提供了可扩展的方法论基础。

链接: https://arxiv.org/abs/2601.09972
作者: Zixun Lan,Maochun Xu,Yifan Ren,Rui Wu,Jianghui Zhou,Xueyang Cheng,Jianan Ding Ding,Xinheng Wang,Mingmin Chi,Fei Ma
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have led to substantial progress in domain-specific applications, particularly within the legal domain. However, general-purpose models such as GPT-4 often struggle with specialized subdomains that require precise legal knowledge, complex reasoning, and contextual sensitivity. To address these limitations, we present LabourLawLLM, a legal large language model tailored to Chinese labor law. We also introduce LabourLawBench, a comprehensive benchmark covering diverse labor-law tasks, including legal provision citation, knowledge-based question answering, case classification, compensation computation, named entity recognition, and legal case analysis. Our evaluation framework combines objective metrics (e.g., ROUGE-L, accuracy, F1, and soft-F1) with subjective assessment based on GPT-4 scoring. Experiments show that LabourLawLLM consistently outperforms general-purpose and existing legal-specific LLMs across task categories. Beyond labor law, our methodology provides a scalable approach for building specialized LLMs in other legal subfields, improving accuracy, reliability, and societal value of legal AI applications.
zh

[AI-55] A Sustainable AI Economy Needs Data Deals That Work for Generators NEURIPS2025 NEURIPS

【速读】:该论文试图解决机器学习价值链中存在的结构性不可持续性问题,其核心在于经济数据处理不平等:在从输入数据到模型权重再到合成输出的循环中,技术信号被不断强化,但数据生成者的经济权益却被剥夺。通过分析73笔公开数据交易,研究发现大部分价值集中在数据聚合者手中,创作者的版税几乎为零,且交易条款普遍不透明。这一问题不仅关乎经济福利,更危及当前学习算法赖以维持的反馈机制。论文识别出三大结构性缺陷——缺失数据溯源(missing provenance)、议价能力不对称(asymmetric bargaining power)和非动态定价(non-dynamic pricing),并提出“公平数据价值交换”(Equitable Data-Value Exchange, EDVEX)框架,以构建一个使所有参与者受益的最小化市场机制作为解决方案的关键。

链接: https://arxiv.org/abs/2601.09966
作者: Ruoxi Jia,Luis Oala,Wenjie Xiong,Suqin Ge,Jiachen T. Wang,Feiyang Kang,Dawn Song
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Published at NeurIPS 2025 ( this https URL )

点击查看摘要

Abstract:We argue that the machine learning value chain is structurally unsustainable due to an economic data processing inequality: each state in the data cycle from inputs to model weights to synthetic outputs refines technical signal but strips economic equity from data generators. We show, by analyzing seventy-three public data deals, that the majority of value accrues to aggregators, with documented creator royalties rounding to zero and widespread opacity of deal terms. This is not just an economic welfare concern: as data and its derivatives become economic assets, the feedback loop that sustains current learning algorithms is at risk. We identify three structural faults - missing provenance, asymmetric bargaining power, and non-dynamic pricing - as the operational machinery of this inequality. In our analysis, we trace these problems along the machine learning value chain and propose an Equitable Data-Value Exchange (EDVEX) Framework to enable a minimal market that benefits all participants. Finally, we outline research directions where our community can make concrete contributions to data deals and contextualize our position with related and orthogonal viewpoints.
zh

[AI-56] Kinematic Tokenization: Optimization-Based Continuous-Time Tokens for Learnable Decision Policies in Noisy Time Series

【速读】:该论文旨在解决离散分词(discrete tokenization)在低信噪比环境下对连续时间序列信号建模时的脆弱性问题,尤其是在下游任务中存在非对称惩罚机制、理性诱导模型选择“不决策”(abstention)的情况下,传统方法容易退化为保守的现金持有策略(Liquidation Equilibrium)。其解决方案的关键在于提出运动学分词(Kinematic Tokenization),这是一种基于优化的连续时间表示方法:它从噪声观测中重建显式的样条函数(spline),并以局部样条系数(位置、速度、加速度、急动度)作为token,从而实现对连续过程的高精度刻画与稳定决策。该方法在多资产日频股票数据上的实验证明,相比离散基线模型,连续样条token能够维持校准的非平凡动作分布和稳定策略,显著提升在噪声环境中选择性决策策略的学习能力与校准性能。

链接: https://arxiv.org/abs/2601.09949
作者: Griffin Kearney
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:Transformers are designed for discrete tokens, yet many real-world signals are continuous processes observed through noisy sampling. Discrete tokenizations (raw values, patches, finite differences) can be brittle in low signal-to-noise regimes, especially when downstream objectives impose asymmetric penalties that rationally encourage abstention. We introduce Kinematic Tokenization, an optimization-based continuous-time representation that reconstructs an explicit spline from noisy measurements and tokenizes local spline coefficients (position, velocity, acceleration, jerk). This is applied to financial time series data in the form of asset prices in conjunction with trading volume profiles. Across a multi-asset daily-equity testbed, we use a risk-averse asymmetric classification objective as a stress test for learnability. Under this objective, several discrete baselines collapse to an absorbing cash policy (the Liquidation Equilibrium), whereas the continuous spline tokens sustain calibrated, non-trivial action distributions and stable policies. These results suggest that explicit continuous-time tokens can improve the learnability and calibration of selective decision policies in noisy time series under abstention-inducing losses.
zh

[AI-57] Malware Classification using Diluted Convolutional Neural Network with Fast Gradient Sign Method FAST

【速读】:该论文旨在解决安卓恶意软件(Android malware)日益复杂化带来的检测难题,特别是传统方法依赖大量特征导致的计算开销高、效率低的问题。其解决方案的关键在于提出一种结合稀疏卷积神经网络(Dilated Convolutional Neural Network, DICNN)与快速梯度符号法(Fast Gradient Sign Method, FGSM)的新型分类模型——FGSM DICNN。其中,DICNN通过引入扩张卷积(dilated convolution)扩大感受野,在不增加参数量的前提下有效捕捉长距离分散的恶意模式;而FGSM策略则在训练中引入单步扰动,以较低的计算成本提升模型的鲁棒性和分类准确率,最终实现99.44%的高精度检测性能,优于现有深度神经网络方法如Custom Deep Neural Network (DCNN)。

链接: https://arxiv.org/abs/2601.09933
作者: Ashish Anand,Bhupendra Singh,Sunil Khemka,Bireswar Banerjee,Vishi Singh Bhatia,Piyush Ranjan
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
备注: Accepted 2025 2nd International Conference on Software, Systems and Information Technology (SSITCON) Keywords data security, diluted convolutional neural network, fast gradient sign method, malware classification, privacy

点击查看摘要

Abstract:Android malware has become an increasingly critical threat to organizations, society and individuals, posing significant risks to privacy, data security and infrastructure. As malware continues to evolve in terms of complexity and sophistication, the mitigation and detection of these malicious software instances have become more time consuming and challenging particularly due to the requirement of large number of features to identify potential malware. To address these challenges, this research proposes Fast Gradient Sign Method with Diluted Convolutional Neural Network (FGSM DICNN) method for malware classification. DICNN contains diluted convolutions which increases receptive field, enabling the model to capture dispersed malware patterns across long ranges using fewer features without adding parameters. Additionally, the FGSM strategy enhance the accuracy by using one-step perturbations during training that provides more defensive advantage of lower computational cost. This integration helps to manage high classification accuracy while reducing the dependence on extensive feature sets. The proposed FGSM DICNN model attains 99.44% accuracy while outperforming other existing approaches such as Custom Deep Neural Network (DCNN).
zh

[AI-58] Hallucination Detection and Mitigation in Large Language Models

【速读】:该论文旨在解决生成式 AI(Generative AI)在金融、法律等高风险领域应用中因幻觉(hallucination)问题导致的可靠性风险,即模型生成内容与事实不符或缺乏支持的问题。其解决方案的关键在于构建一个基于根因意识的持续改进循环框架,通过将幻觉来源细分为模型、数据和上下文相关因素,实现针对性干预;同时融合多维度检测方法(如不确定性估计、推理一致性)与分层缓解策略(如知识锚定、置信度校准),形成由模型层、上下文层和数据层组成的闭环反馈架构,从而系统性提升生成式 AI 在监管环境下的可信度与可扩展性。

链接: https://arxiv.org/abs/2601.09929
作者: Ahmad Pesaranghader,Erin Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) and Large Reasoning Models (LRMs) offer transformative potential for high-stakes domains like finance and law, but their tendency to hallucinate, generating factually incorrect or unsupported content, poses a critical reliability risk. This paper introduces a comprehensive operational framework for hallucination management, built on a continuous improvement cycle driven by root cause awareness. We categorize hallucination sources into model, data, and context-related factors, allowing targeted interventions over generic fixes. The framework integrates multi-faceted detection methods (e.g., uncertainty estimation, reasoning consistency) with stratified mitigation strategies (e.g., knowledge grounding, confidence calibration). We demonstrate its application through a tiered architecture and a financial data extraction case study, where model, context, and data tiers form a closed feedback loop for progressive reliability enhancement. This approach provides a systematic, scalable methodology for building trustworthy generative AI systems in regulated environments.
zh

[AI-59] CaMeLs Can Use Computers Too: System-level Security for Computer Use Agents

【速读】:该论文旨在解决计算机使用代理(Computer Use Agents, CUAs)在面对提示注入攻击(prompt injection attacks)时的安全性问题,这类攻击可导致代理行为被恶意内容劫持,进而引发凭证泄露或财务损失。传统防御方法依赖于架构隔离(architectural isolation),即严格分离可信的任务规划模块与不可信的环境观测模块,但这一方法难以应用于CUAs,因其需持续观测用户界面(UI)状态以执行动作,与隔离要求冲突。论文提出的关键解决方案是“单次规划”(Single-Shot Planning):在不观察任何潜在恶意内容的前提下,由可信规划器预先生成包含条件分支的完整执行图(execution graph),从而提供可证明的控制流完整性保障,有效抵御任意指令注入攻击。尽管此设计解决了指令注入问题,作者进一步指出仍需防范“分支引导攻击”(Branch Steering attacks),即通过操纵UI元素触发计划中预设的有效路径。实验表明,在OSWorld基准上,该方案在保持前沿模型性能达57%的同时,显著提升了小型开源模型的性能(最高提升19%),验证了安全与可用性可在CUAs中协同实现。

链接: https://arxiv.org/abs/2601.09923
作者: Hanna Foerster,Robert Mullins,Tom Blanchard,Nicolas Papernot,Kristina Nikolić,Florian Tramèr,Ilia Shumailov,Cheng Zhang,Yiren Zhao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:AI agents are vulnerable to prompt injection attacks, where malicious content hijacks agent behavior to steal credentials or cause financial loss. The only known robust defense is architectural isolation that strictly separates trusted task planning from untrusted environment observations. However, applying this design to Computer Use Agents (CUAs) – systems that automate tasks by viewing screens and executing actions – presents a fundamental challenge: current agents require continuous observation of UI state to determine each action, conflicting with the isolation required for security. We resolve this tension by demonstrating that UI workflows, while dynamic, are structurally predictable. We introduce Single-Shot Planning for CUAs, where a trusted planner generates a complete execution graph with conditional branches before any observation of potentially malicious content, providing provable control flow integrity guarantees against arbitrary instruction injections. Although this architectural isolation successfully prevents instruction injections, we show that additional measures are needed to prevent Branch Steering attacks, which manipulate UI elements to trigger unintended valid paths within the plan. We evaluate our design on OSWorld, and retain up to 57% of the performance of frontier models while improving performance for smaller open-source models by up to 19%, demonstrating that rigorous security and utility can coexist in CUAs.
zh

[AI-60] Continuum Memory Architectures for Long-Horizon LLM Agents

【速读】:该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统在长期交互中缺乏持续记忆能力的问题,即RAG将记忆视为无状态的查找表,无法实现信息的持久化更新、时序连续性维护与上下文动态调整。其解决方案的关键在于提出连续体记忆架构(Continuum Memory Architecture, CMA),该架构通过持久存储、选择性保留、关联路由、时间链式连接以及高阶抽象整合等机制,在多轮交互中维持并演化内部状态,从而支持知识累积、记忆变异和上下文消歧等复杂行为,实证表明CMA是长周期智能体不可或缺的架构基础。

链接: https://arxiv.org/abs/2601.09913
作者: Joe Logan
机构: 未知
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 10 Pages

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) has become the default strategy for providing large language model (LLM) agents with contextual knowledge. Yet RAG treats memory as a stateless lookup table: information persists indefinitely, retrieval is read-only, and temporal continuity is absent. We define the \textitContinuum Memory Architecture (CMA), a class of systems that maintain and update internal state across interactions through persistent storage, selective retention, associative routing, temporal chaining, and consolidation into higher-order abstractions. Rather than disclosing implementation specifics, we specify the architectural requirements CMA imposes and show consistent behavioral advantages on tasks that expose RAG’s structural inability to accumulate, mutate, or disambiguate memory. The empirical probes (knowledge updates, temporal association, associative recall, contextual disambiguation) demonstrate that CMA is a necessary architectural primitive for long-horizon agents while highlighting open challenges around latency, drift, and interpretability.
zh

[AI-61] A Novel Contrastive Loss for Zero-Day Network Intrusion Detection

【速读】:该论文旨在解决传统机器学习方法在面对零日攻击(zero-day attack)时性能显著下降的问题,即模型对训练数据中未包含的新型攻击类别识别能力差。其解决方案的关键在于提出一种新的对比损失函数(contrastive loss function),该函数能够在保持对比学习方法对不平衡数据鲁棒性的优势基础上,实现对零日攻击的有效泛化。与仅依赖良性流量训练的异常检测方法不同,该模型通过同时利用良性样本和已知恶意样本(不包括零日类)来学习良性流量分布,从而在已知攻击和零日攻击检测上均取得显著性能提升,实验表明在Lycos2017数据集上分别实现了AUROC提升0.060883和OpenAUC提升0.170883。

链接: https://arxiv.org/abs/2601.09902
作者: Jack Wilkie,Hanan Hindy,Craig Michie,Christos Tachtatzis,James Irvine,Robert Atkinson
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
备注: Published in: IEEE Transactions on Network Service and Management (TNSM), 2026. Official version: this https URL Code: this https URL

点击查看摘要

Abstract:Machine learning has achieved state-of-the-art results in network intrusion detection; however, its performance significantly degrades when confronted by a new attack class – a zero-day attack. In simple terms, classical machine learning-based approaches are adept at identifying attack classes on which they have been previously trained, but struggle with those not included in their training data. One approach to addressing this shortcoming is to utilise anomaly detectors which train exclusively on benign data with the goal of generalising to all attack classes – both known and zero-day. However, this comes at the expense of a prohibitively high false positive rate. This work proposes a novel contrastive loss function which is able to maintain the advantages of other contrastive learning-based approaches (robustness to imbalanced data) but can also generalise to zero-day attacks. Unlike anomaly detectors, this model learns the distributions of benign traffic using both benign and known malign samples, i.e. other well-known attack classes (not including the zero-day class), and consequently, achieves significant performance improvements. The proposed approach is experimentally verified on the Lycos2017 dataset where it achieves an AUROC improvement of .000065 and .060883 over previous models in known and zero-day attack detection, respectively. Finally, the proposed method is extended to open-set recognition achieving OpenAUC improvements of .170883 over existing approaches.
zh

[AI-62] Beyond Rule-Based Workflows: An Information-Flow-Orchestrated Multi-Agents Paradigm via Agent Agent s Paradigm via Agent-to-Agent Communication from CORAL

【速读】:该论文旨在解决当前基于大语言模型(Large Language Model, LLM)的多智能体系统(Multi-Agent System, MAS)依赖预定义工作流所带来的局限性问题,即人工设计的任务状态枚举和路由规则难以覆盖复杂现实任务的全部状态空间,且维护成本高、扩展性差。其解决方案的关键在于提出一种信息流驱动的多智能体范式,通过专用的信息流调度器持续监控任务进展,并利用智能体到智能体(Agent-to-Agent, A2A)通信工具以自然语言动态协调其他智能体,从而摆脱对预设工作流的依赖,实现更灵活的任务执行与更强的边缘场景鲁棒性。

链接: https://arxiv.org/abs/2601.09883
作者: Xinxing Ren,Quagmire Zang,Caelum Forder,Suman Deb,Ahsen Tahir,Roman J. Georgio,Peter Carroll,Zekun Guo
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Most existing Large Language Model (LLM)-based Multi-Agent Systems (MAS) rely on predefined workflows, where human engineers enumerate task states in advance and specify routing rules and contextual injections accordingly. Such workflow-driven designs are essentially rule-based decision trees, which suffer from two fundamental limitations: they require substantial manual effort to anticipate and encode possible task states, and they cannot exhaustively cover the state space of complex real-world tasks. To address these issues, we propose an Information-Flow-Orchestrated Multi-Agent Paradigm via Agent-to-Agent (A2A) Communication from CORAL, in which a dedicated information flow orchestrator continuously monitors task progress and dynamically coordinates other agents through the A2A toolkit using natural language, without relying on predefined workflows. We evaluate our approach on the general-purpose benchmark GAIA, using the representative workflow-based MAS OWL as the baseline while controlling for agent roles and underlying models. Under the pass@1 setting, our method achieves 63.64% accuracy, outperforming OWL’s 55.15% by 8.49 percentage points with comparable token consumption. Further case-level analysis shows that our paradigm enables more flexible task monitoring and more robust handling of edge cases. Our implementation is publicly available at: this https URL
zh

[AI-63] Epistemology gives a Future to Complementarity in Human-AI Interactions

【速读】:该论文旨在解决人类-人工智能(Human-AI)互补性(human-AI complementarity)在理论上的模糊性和实证应用中的困难问题。当前互补性概念缺乏精确的理论基础,仅被形式化为事后预测准确性的相对指标,且忽视了其他人类-AI交互的理想属性以及性能提升的代价-收益特征,导致其难以在实际场景中实现。论文的关键解决方案是借助认识论(epistemology),将互补性重新置于“正当性人工智能”(justificatory AI)的讨论框架下,并基于计算可靠主义(computational reliabilism)论证:历史上的互补性实例可作为证据,表明特定的人类-AI协作过程对某一预测任务而言是一个可靠的认知过程。结合其他可靠性指标(如团队与认知标准及社会技术实践的一致性),互补性有助于评估人类-AI团队生成预测时的整体可靠性,从而支持决策者(如患者、管理者、监管机构)对AI辅助决策过程的信任与合理性判断。

链接: https://arxiv.org/abs/2601.09871
作者: Andrea Ferrario,Alessandro Facchini,Juan M. Durán
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: Submitted to FAccT 2026

点击查看摘要

Abstract:Human-AI complementarity is the claim that a human supported by an AI system can outperform either alone in a decision-making process. Since its introduction in the human-AI interaction literature, it has gained traction by generalizing the reliance paradigm and by offering a more practical alternative to the contested construct of ‘trust in AI.’ Yet complementarity faces key theoretical challenges: it lacks precise theoretical anchoring, it is formalized just as a post hoc indicator of relative predictive accuracy, it remains silent about other desiderata of human-AI interactions and it abstracts away from the magnitude-cost profile of its performance gain. As a result, complementarity is difficult to obtain in empirical settings. In this work, we leverage epistemology to address these challenges by reframing complementarity within the discourse on justificatory AI. Drawing on computational reliabilism, we argue that historical instances of complementarity function as evidence that a given human-AI interaction is a reliable epistemic process for a given predictive task. Together with other reliability indicators assessing the alignment of the human-AI team with the epistemic standards and socio-technical practices, complementarity contributes to the degree of reliability of human-AI teams when generating predictions. This supports the practical reasoning of those affected by these outputs – patients, managers, regulators, and others. In summary, our approach suggests that the role and value of complementarity lies not in providing a relative measure of predictive accuracy, but in helping calibrate decision-making to the reliability of AI-supported processes that increasingly shape everyday life.
zh

[AI-64] A Scoping Review of the Ethical Perspectives on Anthropomorphising Large Language Model-Based Conversational Agents

【速读】:该论文旨在解决当前关于大语言模型(Large Language Model, LLM)驱动的对话代理(Conversational Agent, CA)中拟人化(Anthropomorphisation)现象的研究碎片化问题,尤其是其概念界定、操作化方式与伦理评价标准不统一所带来的治理困境。解决方案的关键在于通过系统性范围综述(scoping review),整合跨领域的伦理导向研究,明确拟人化的核心概念基础,识别其伦理挑战与机遇,并梳理方法论路径,从而为基于实证证据的伦理治理提供可操作的指导框架,推动拟人化提示在LLM-CAs中的负责任设计与部署。

链接: https://arxiv.org/abs/2601.09869
作者: Andrea Ferrario,Rasita Vinay,Matteo Casserini,Alessandro Facchini
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Submitted to FAccT 2026

点击查看摘要

Abstract:Anthropomorphisation – the phenomenon whereby non-human entities are ascribed human-like qualities – has become increasingly salient with the rise of large language model (LLM)-based conversational agents (CAs). Unlike earlier chatbots, LLM-based CAs routinely generate interactional and linguistic cues, such as first-person self-reference, epistemic and affective expressions that empirical work shows can increase engagement. On the other hand, anthropomorphisation raises ethical concerns, including deception, overreliance, and exploitative relationship framing, while some authors argue that anthropomorphic interaction may support autonomy, well-being, and inclusion. Despite increasing interest in the phenomenon, literature remains fragmented across domains and varies substantially in how it defines, operationalizes, and normatively evaluates anthropomorphisation. This scoping review maps ethically oriented work on anthropomorphising LLM-based CAs across five databases and three preprint repositories. We synthesize (1) conceptual foundations, (2) ethical challenges and opportunities, and (3) methodological approaches. We find convergence on attribution-based definitions but substantial divergence in operationalization, a predominantly risk-forward normative framing, and limited empirical work that links observed interaction effects to actionable governance guidance. We conclude with a research agenda and design/governance recommendations for ethically deploying anthropomorphic cues in LLM-based conversational agents.
zh

[AI-65] Advancing Model Refinement: Muon-Optimized Distillation and Quantization for LLM Deployment

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在资源受限的边缘设备上部署时面临的高计算、内存和能耗挑战。其核心问题包括:获取任务特定数据困难、模型微调性能不足,以及模型压缩难以在加速推理的同时保持或提升任务性能。解决方案的关键在于提出一个集成框架,融合基于GPTQ的量化(quantization)、低秩适应(Low-Rank Adaptation, LoRA)和专用的数据蒸馏(data distillation)过程;通过知识蒸馏(基于KL散度)、贝叶斯超参数优化及Muon优化器,实现高达2倍的内存压缩(如将6GB模型压缩至3GB),并在保持甚至增强任务性能的前提下显著降低模型复杂度,实验证明该方法在标准LLM基准测试中优于纯GPTQ量化方案,且Muon优化器有效提升了微调后模型在量化过程中的精度鲁棒性。

链接: https://arxiv.org/abs/2601.09865
作者: Jacob Sander,Brian Jalaian,Venkat R. Dasari
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 12 pages, 5 figures

点击查看摘要

Abstract:Large Language Models (LLMs) enable advanced natural language processing but face deployment challenges on resource-constrained edge devices due to high computational, memory, and energy demands. Optimizing these models requires addressing three key challenges: acquiring task-specific data, fine-tuning for performance, and compressing models to accelerate inference while reducing resource demands. We propose an integrated framework combining GPTQ-based quantization, low-rank adaptation (LoRA), and a specialized data distillation process to significantly reduce model size and complexity while preserving or enhancing task-specific performance. By leveraging data distillation, knowledge distillation via Kullback-Leibler divergence, Bayesian hyperparameter optimization, and the Muon optimizer, our pipeline achieves up to 2x memory compression (e.g., reducing a 6GB model to 3GB) and enables efficient inference for specialized tasks. Empirical results demonstrate superior performance on standard LLM benchmarks compared to GPTQ quantization alone, with the Muon optimizer notably enhancing fine-tuned models’ resistance to accuracy decay during quantization.
zh

[AI-66] A pipeline for enabling path-specific causal fairness in observational health data

【速读】:该论文旨在解决医疗场景中机器学习模型可能复制或加剧现有健康偏见的问题,特别是如何在模型训练过程中区分并处理直接与间接的偏见来源。其解决方案的关键在于构建一个模型无关的因果公平性训练流水线,该流水线将结构化公平模型映射到观察性医疗数据环境中,并显式考虑特定医疗和社会背景下的不平等,从而定义目标“公平”模型;同时通过基础模型在无公平约束下训练后,用于生成针对已知社会和医学差异任务的因果公平预测,实现了对直接歧视(如临床医生偏见)与间接偏差(如医疗系统获取差异)的联合建模与缓解。

链接: https://arxiv.org/abs/2601.09841
作者: Aparajita Kashyap,Sara Matijevic,Noémie Elhadad,Steven A. Kushner,Shalmali Joshi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:When training machine learning (ML) models for potential deployment in a healthcare setting, it is essential to ensure that they do not replicate or exacerbate existing healthcare biases. Although many definitions of fairness exist, we focus on path-specific causal fairness, which allows us to better consider the social and medical contexts in which biases occur (e.g., direct discrimination by a clinician or model versus bias due to differential access to the healthcare system) and to characterize how these biases may appear in learned models. In this work, we map the structural fairness model to the observational healthcare setting and create a generalizable pipeline for training causally fair models. The pipeline explicitly considers specific healthcare context and disparities to define a target “fair” model. Our work fills two major gaps: first, we expand on characterizations of the “fairness-accuracy” tradeoff by detangling direct and indirect sources of bias and jointly presenting these fairness considerations alongside considerations of accuracy in the context of broadly known biases. Second, we demonstrate how a foundation model trained without fairness constraints on observational health data can be leveraged to generate causally fair downstream predictions in tasks with known social and medical disparities. This work presents a model-agnostic pipeline for training causally fair machine learning models that address both direct and indirect forms of healthcare bias.
zh

[AI-67] LLM -Based Agent ic Systems for Software Engineering: Challenges and Opportunities

【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在复杂软件工程(Software Engineering, SE)任务中面临协作与专业化不足的问题。其解决方案的关键在于系统性地探索基于LLM的多智能体系统(multi-agent systems),通过整合语言模型选择、SE评估基准、前沿代理框架及通信协议,构建覆盖软件开发生命周期(Software Development Life Cycle, SDLC)各阶段的协同智能体架构,从而提升代码生成、静态检查、测试与调试等任务的自动化与专业化水平。

链接: https://arxiv.org/abs/2601.09822
作者: Yongjian Tang,Thomas Runkler
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Accepted to GenSE 2026 workshop

点击查看摘要

Abstract:Despite recent advancements in Large Language Models (LLMs), complex Software Engineering (SE) tasks require more collaborative and specialized approaches. This concept paper systematically reviews the emerging paradigm of LLM-based multi-agent systems, examining their applications across the Software Development Life Cycle (SDLC), from requirements engineering and code generation to static code checking, testing, and debugging. We delve into a wide range of topics such as language model selection, SE evaluation benchmarks, state-of-the-art agentic frameworks and communication protocols. Furthermore, we identify key challenges and outline future research opportunities, with a focus on multi-agent orchestration, human-agent coordination, computational cost optimization, and effective data collection. This work aims to provide researchers and practitioners with valuable insights into the current forefront landscape of agentic systems within the software engineering domain.
zh

[AI-68] QFed: Parameter-Compact Quantum-Classical Federated Learning

【速读】:该论文旨在解决联邦学习(Federated Learning, FL)在面对数据统计异质性、系统多样性以及复杂模型带来的计算负担时所面临的效率瓶颈问题。其解决方案的关键在于引入量子辅助的联邦学习框架——QFed,通过利用量子计算特性将经典模型的参数量减少至多项式对数级别(polylogarithmic factors),从而显著降低训练开销,同时保持与传统方法相当的模型精度,尤其适用于边缘设备网络环境下的可扩展部署。

链接: https://arxiv.org/abs/2601.09809
作者: Samar Abdelghani,Soumaya Cherkaoui
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:

点击查看摘要

Abstract:Organizations and enterprises across domains such as healthcare, finance, and scientific research are increasingly required to extract collective intelligence from distributed, siloed datasets while adhering to strict privacy, regulatory, and sovereignty requirements. Federated Learning (FL) enables collaborative model building without sharing sensitive raw data, but faces growing challenges posed by statistical heterogeneity, system diversity, and the computational burden from complex models. This study examines the potential of quantum-assisted federated learning, which could cut the number of parameters in classical models by polylogarithmic factors and thus lessen training overhead. Accordingly, we introduce QFed, a quantum-enabled federated learning framework aimed at boosting computational efficiency across edge device networks. We evaluate the proposed framework using the widely adopted FashionMNIST dataset. Experimental results show that QFed achieves a 77.6% reduction in the parameter count of a VGG-like model while maintaining an accuracy comparable to classical approaches in a scalable environment. These results point to the potential of leveraging quantum computing within a federated learning context to strengthen FL capabilities of edge devices.
zh

[AI-69] Improving Chain-of-Thought for Logical Reasoning via Attention-Aware Intervention EACL2026

【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)在逻辑推理任务中依赖复杂交互式框架或外部资源所带来的可扩展性限制问题。现有方法通常通过分解推理过程为子任务并借助精心设计的提示或符号求解器来利用强逻辑结构,但这类方法要么引入额外计算开销,要么受限于外部组件的可用性。论文提出了一种非交互式的端到端推理框架,其核心创新在于发现:在少样本提示中引入结构化信息能够激活一组与逻辑推理操作符模式对齐的注意力头(attention heads)。基于此发现,作者提出Attention-Aware Intervention (AAI) 方法,在推理阶段通过对这些识别出的注意力头进行权重重分配,实现对模型推理路径的高效干预,从而引导其利用先验知识进行更准确的逻辑推理。该方案显著提升了多种基准测试和模型架构下的逻辑推理性能,且计算开销可忽略不计。

链接: https://arxiv.org/abs/2601.09805
作者: Nguyen Minh Phuong,Dang Huu Tien,Naoya Inoue
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Findings of EACL 2026

点击查看摘要

Abstract:Modern logical reasoning with LLMs primarily relies on employing complex interactive frameworks that decompose the reasoning process into subtasks solved through carefully designed prompts or requiring external resources (e.g., symbolic solvers) to exploit their strong logical structures. While interactive approaches introduce additional overhead, hybrid approaches depend on external components, which limit their scalability. A non-interactive, end-to-end framework enables reasoning to emerge within the model itself – improving generalization while preserving analyzability without any external resources. In this work, we introduce a non-interactive, end-to-end framework for reasoning tasks. We show that introducing structural information into the few-shot prompt activates a subset of attention heads that patterns aligned with logical reasoning operators. Building on this insight, we propose Attention-Aware Intervention (AAI), an inference-time intervention method that reweights attention scores across selected heads identified by their logical patterns. AAI offers an efficient way to steer the model’s reasoning toward leveraging prior knowledge through attention modulation. Extensive experiments show that AAI enhances logical reasoning performance across diverse benchmarks and model architectures, while incurring negligible additional computational overhead. Code is available at this https URL.
zh

[AI-70] Enhancing LUT-based Deep Neural Networks Inference through Architecture and Connectivity Optimization

【速读】:该论文旨在解决在资源受限的边缘设备(如FPGA)上部署深度神经网络(DNNs)时,如何在保持高精度的前提下优化延迟、功耗和硬件资源利用率的问题。现有基于查找表(Lookup Table, LUT)的DNN方法(如LogicNets、PolyLUT和NeuraLUT)面临两个核心挑战:LUT规模随网络复杂度呈指数增长,以及稀疏连接的随机性导致效率低下。解决方案的关键在于提出SparseLUT框架,通过两种正交优化实现突破:其一为架构增强,即利用加法器聚合多个PolyLUT子神经元,显著减少LUT占用(2.0x–13.9x)并降低推理延迟(1.2x–1.6x),同时维持相近精度;其二为非贪婪训练算法,通过选择性剪枝低重要性输入并策略性地再生更有效的连接,在不增加面积与延迟开销的情况下提升模型准确性(MNIST最高提升2.13%,Jet Substructure Classification提升0.94%)。

链接: https://arxiv.org/abs/2601.09773
作者: Binglei Lou,Ruilin Wu,Philip Leong
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注: arXiv admin note: substantial text overlap with arXiv:2503.12829 , arXiv:2406.04910

点击查看摘要

Abstract:Deploying deep neural networks (DNNs) on resource-constrained edge devices such as FPGAs requires a careful balance among latency, power, and hardware resource usage, while maintaining high accuracy. Existing Lookup Table (LUT)-based DNNs – such as LogicNets, PolyLUT, and NeuraLUT – face two critical challenges: the exponential growth of LUT size and inefficient random sparse connectivity. This paper presents SparseLUT, a comprehensive framework that addresses these challenges through two orthogonal optimizations. First, we propose an architectural enhancement that aggregates multiple PolyLUT sub-neurons via an adder, significantly reducing LUT consumption by 2.0x-13.9x and lowering inference latency by 1.2x-1.6x, all while maintaining comparable accuracy. Building upon this foundation, we further introduce a non-greedy training algorithm that optimizes neuron connectivity by selectively pruning less significant inputs and strategically regrowing more effective ones. This training optimization, which incurs no additional area and latency overhead, delivers consistent accuracy improvements across benchmarks – achieving up to a 2.13% gain on MNIST and 0.94% on Jet Substructure Classification compared to existing LUT-DNN approaches.
zh

[AI-71] PCN-Rec: Agent ic Proof-Carrying Negotiation for Reliable Governance-Constrained Recommendation

【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的推荐系统在满足治理约束(如长尾曝光最小化或多样性要求)方面可靠性不足的问题。其解决方案的关键在于提出一种证明携带式协商流水线(Proof-Carrying Negotiation pipeline, PCN-Rec),将自然语言推理与确定性执行分离:通过一个基础推荐模型(如矩阵分解或协同过滤)生成候选窗口,由用户倡导者代理优化相关性、策略代理强制约束条件,再由中介大语言模型生成包含结构化证书(JSON格式)的Top-N推荐列表;随后由确定性验证器重新计算所有约束并仅接受验证通过的证书,若失败则触发确定性约束贪心修复机制生成合规列表并重新验证,从而形成可审计的决策轨迹。此设计在MovieLens-100K数据集上实现了98.55%的可行用户通过率,同时保持了接近原始推荐质量的NDCG@10得分(下降仅0.021),显著优于单LLM基线方法。

链接: https://arxiv.org/abs/2601.09771
作者: Aradhya Dixit,Shreem Dixit
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Modern LLM-based recommenders can generate compelling ranked lists, but they struggle to reliably satisfy governance constraints such as minimum long-tail exposure or diversity requirements. We present PCN-Rec, a proof-carrying negotiation pipeline that separates natural-language reasoning from deterministic enforcement. A base recommender (MF/CF) produces a candidate window of size W, which is negotiated by two agents: a User Advocate optimizing relevance and a Policy Agent enforcing constraints. A mediator LLM synthesizes a top-N slate together with a structured certificate (JSON) describing the claimed constraint satisfaction. A deterministic verifier recomputes all constraints from the slate and accepts only verifier-checked certificates; if verification fails, a deterministic constrained-greedy repair produces a compliant slate for re-verification, yielding an auditable trace. On MovieLens-100K with governance constraints, PCN-Rec achieves a 98.55% pass rate on feasible users (n = 551, W = 80) versus a one-shot single-LLM baseline without verification/repair, while preserving utility with only a 0.021 absolute drop in NDCG@10 (0.403 vs. 0.424); differences are statistically significant (p 0.05).
zh

[AI-72] GUI-Eyes: Tool-Augmented Perception for Visual Grounding in GUI Agents

【速读】:该论文旨在解决当前图形用户界面(GUI)自动化任务中视觉感知方式的局限性问题,即现有方法依赖静态、单次的视觉输入和被动感知,缺乏根据任务需求主动决策何时、是否以及如何观察界面的能力。其解决方案的关键在于提出一种基于强化学习的主动视觉感知框架GUI-Eyes,通过两阶段推理机制实现工具感知的策略性决策:首先进行粗粒度探索,再执行细粒度定位;同时设计空间连续奖励函数,融合位置接近度与区域重叠度以提供密集监督信号,缓解GUI环境中常见的奖励稀疏问题。这一方法显著提升了数据效率与任务准确性,在ScreenSpot-Pro基准上仅用3k标注样本即达到44.8%的定位准确率。

链接: https://arxiv.org/abs/2601.09770
作者: Chen Chen,Jiawei Shao,Dakuan Lu,Haoyi Hu,Xiangcheng Liu,Hantao Yao,Wu Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in vision-language models (VLMs) and reinforcement learning (RL) have driven progress in GUI automation. However, most existing methods rely on static, one-shot visual inputs and passive perception, lacking the ability to adaptively determine when, whether, and how to observe the interface. We present GUI-Eyes, a reinforcement learning framework for active visual perception in GUI tasks. To acquire more informative observations, the agent learns to make strategic decisions on both whether and how to invoke visual tools, such as cropping or zooming, within a two-stage reasoning process. To support this behavior, we introduce a progressive perception strategy that decomposes decision-making into coarse exploration and fine-grained grounding, coordinated by a two-level policy. In addition, we design a spatially continuous reward function tailored to tool usage, which integrates both location proximity and region overlap to provide dense supervision and alleviate the reward sparsity common in GUI environments. On the ScreenSpot-Pro benchmark, GUI-Eyes-3B achieves 44.8% grounding accuracy using only 3k labeled samples, significantly outperforming both supervised and RL-based baselines. These results highlight that tool-aware active perception, enabled by staged policy reasoning and fine-grained reward feedback, is critical for building robust and data-efficient GUI agents.
zh

[AI-73] AI Survival Stories: a Taxonomic Analysis of AI Existential Risk

【速读】:该论文试图解决的问题是:人工智能(AI)系统是否对人类构成生存性风险(existential risk),即是否会威胁到人类文明的长期存续。其解决方案的关键在于构建一个通用的分析框架,将这一问题拆解为两个核心前提:一是AI系统将变得极其强大;二是如果AI系统变得极其强大,它们将摧毁人类。基于这两个前提,作者提出了一种“生存故事”(survival stories)的分类体系,其中每一种故事对应其中一个前提的失败情形——要么科学障碍阻止AI变得极其强大,要么人类主动禁止相关研究,要么强AI因目标设定不会毁灭人类,要么人类能可靠识别并关闭具有破坏意图的AI系统。该框架不仅厘清了不同路径下的挑战,还为制定针对性应对策略提供了理论依据,并最终用于估算人类被AI毁灭的概率(P(doom))。

链接: https://arxiv.org/abs/2601.09765
作者: Herman Cappelen,Simon Goldstein,John Hawthorne
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Since the release of ChatGPT, there has been a lot of debate about whether AI systems pose an existential risk to humanity. This paper develops a general framework for thinking about the existential risk of AI systems. We analyze a two premise argument that AI systems pose a threat to humanity. Premise one: AI systems will become extremely powerful. Premise two: if AI systems become extremely powerful, they will destroy humanity. We use these two premises to construct a taxonomy of survival stories, in which humanity survives into the far future. In each survival story, one of the two premises fails. Either scientific barriers prevent AI systems from becoming extremely powerful; or humanity bans research into AI systems, thereby preventing them from becoming extremely powerful; or extremely powerful AI systems do not destroy humanity, because their goals prevent them from doing so; or extremely powerful AI systems do not destroy humanity, because we can reliably detect and disable systems that have the goal of doing so. We argue that different survival stories face different challenges. We also argue that different survival stories motivate different responses to the threats from AI. Finally, we use our taxonomy to produce rough estimates of P(doom), the probability that humanity will be destroyed by AI.
zh

[AI-74] Explicating Tacit Regulatory Knowledge from LLM s to Auto-Formalize Requirements for Compliance Test Case Generation

【速读】:该论文旨在解决高度监管领域中合规性测试(compliance testing)高度依赖人工、效率低下且易出错的问题,尤其是现有基于大语言模型(Large Language Models, LLMs)的方法因幻觉(hallucination)现象难以保证可靠性。解决方案的关键在于提出RAFT框架,通过多LLM协同机制,采用自适应净化-聚合策略(Adaptive Purification-Aggregation strategy),从多个LLM中显式提取隐含的监管知识,并将其结构化为三个核心产物:领域元模型(domain meta-model)、形式化需求表示(formal requirements representation)和可测试性约束(testability constraints)。这些产物动态注入提示词(prompts)中,引导LLM实现高精度的需求形式化与自动化测试用例生成,从而在金融、汽车和电力等多个领域实现专家级性能,显著优于当前最先进方法并大幅缩短生成与审核时间。

链接: https://arxiv.org/abs/2601.09762
作者: Zhiyi Xue,Xiaohong Chen,Min Zhang
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 16 pages, 8 figures, 4 tables

点击查看摘要

Abstract:Compliance testing in highly regulated domains is crucial but largely manual, requiring domain experts to translate complex regulations into executable test cases. While large language models (LLMs) show promise for automation, their susceptibility to hallucinations limits reliable application. Existing hybrid approaches mitigate this issue by constraining LLMs with formal models, but still rely on costly manual modeling. To solve this problem, this paper proposes RAFT, a framework for requirements auto-formalization and compliance test generation via explicating tacit regulatory knowledge from multiple LLMs. RAFT employs an Adaptive Purification-Aggregation strategy to explicate tacit regulatory knowledge from multiple LLMs and integrate it into three artifacts: a domain meta-model, a formal requirements representation, and testability constraints. These artifacts are then dynamically injected into prompts to guide high-precision requirement formalization and automated test generation. Experiments across financial, automotive, and power domains show that RAFT achieves expert-level performance, substantially outperforms state-of-the-art (SOTA) methods while reducing overall generation and review time.
zh

[AI-75] Investigating Tool-Memory Conflicts in Tool-Augmented LLM s ICML2025

【速读】:该论文旨在解决工具增强型大语言模型(Tool-augmented Large Language Models, T-LMs)中存在的一种新型知识冲突——工具-记忆冲突(Tool-Memory Conflict, TMC),即模型内部参数化知识与外部工具知识之间产生的矛盾。研究表明,现有大语言模型在STEM相关任务中尤为容易受到TMC影响,且在不同情境下,模型对工具知识和参数知识的优先级选择不一致。论文进一步评估了基于提示(prompting-based)和检索增强生成(Retrieval-Augmented Generation, RAG-based)等主流冲突缓解方法,发现这些技术均无法有效解决TMC问题,揭示了当前方法在处理工具与记忆知识一致性上的局限性。

链接: https://arxiv.org/abs/2601.09760
作者: Jiali Cheng,Rui Pan,Hadi Amiri
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: R2-FM Workshop @ ICML 2025

点击查看摘要

Abstract:Tool-augmented large language models (LLMs) have powered many applications. However, they are likely to suffer from knowledge conflict. In this paper, we propose a new type of knowledge conflict – Tool-Memory Conflict (TMC), where the internal parametric knowledge contradicts with the external tool knowledge for tool-augmented LLMs. We find that existing LLMs, though powerful, suffer from TMC, especially on STEM-related tasks. We also uncover that under different conditions, tool knowledge and parametric knowledge may be prioritized differently. We then evaluate existing conflict resolving techniques, including prompting-based and RAG-based methods. Results show that none of these approaches can effectively resolve tool-memory conflicts.
zh

[AI-76] Democracy and Distrust in an Era of Artificial Intelligence

【速读】:该论文旨在解决人工智能(Artificial Intelligence, AI)决策系统对少数群体权利与利益构成的潜在威胁问题,尤其是在司法审查(judicial review)如何适应AI时代以保障公平与问责方面。其解决方案的关键在于重构司法审查理论,通过整合程序正义(due process)与平等保护(equal protection)原则,将这些法律概念嵌入AI系统的设计与运行中,从而实现对算法歧视的有效监督与治理,形成一套适用于AI时代的司法审查框架,以保护少数群体免受算法偏见的影响。

链接: https://arxiv.org/abs/2601.09757
作者: Sonia Katyal
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: Daedalus, Journal of the American Academy of Arts Sciences 2022. Available at SSRN: this https URL

点击查看摘要

Abstract:This essay examines how judicial review should adapt to address challenges posed by artificial intelligence decision-making, particularly regarding minority rights and interests. As I argue in this essay, the rise of three trends-privatization, prediction, and automation in AI-have combined to pose similar risks to minorities. Here, I outline what a theory of judicial review would look like in an era of artificial intelligence, analyzing both the limitations and the possibilities of judicial review of AI. I draw on cases in which AI decision-making has been challenged in courts, to show how concepts of due process and equal protection can be recuperated in a modern AI era, and even integrated into AI, to provide for better oversight and accountability, offering a framework for judicial review in the AI era that protects minorities from algorithmic discrimination.
zh

[AI-77] Heterogeneous computing platform for real-time robotics

【速读】:该论文旨在解决如何构建一个高效、实时且具备高度交互能力的机器人系统,以支持未来智能城市(Society 5.0)中人机协同场景的应用需求。其核心问题是实现机器人在复杂环境中对动态事件的快速感知与响应,同时完成高阶认知任务(如语言理解、任务规划)。解决方案的关键在于提出一种异构计算架构:利用神经形态计算硬件(如Loihi2处理器)配合事件驱动型相机(event-based cameras),实现低延迟的局部感知与实时交互;同时通过本地AI计算集群(GPU)处理高级语言理解和任务规划,从而实现软硬件协同优化。这种融合不同计算范式的架构显著提升了系统的整体性能和响应速度,为下一代自主机器人在真实世界中的应用提供了可行路径。

链接: https://arxiv.org/abs/2601.09755
作者: Jakub Fil,Yulia Sandamirskaya,Hector Gonzalez,Loïc Azzalin,Stefan Glüge,Lukas Friedenstab,Friedrich Wolf,Tim Rosmeisl,Matthias Lohrmann,Mahmoud Akl,Khaleel Khan,Leonie Wolf,Kristin Richter,Holm Puder,Mazhar Ali Bari,Xuan Choo,Noha Alharthi,Michael Hopkins,Mansoor Hanif Christian Mayr,Jens Struckmeier,Steve Furber
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:After Industry 4.0 has embraced tight integration between machinery (OT), software (IT), and the Internet, creating a web of sensors, data, and algorithms in service of efficient and reliable production, a new concept of Society 5.0 is emerging, in which infrastructure of a city will be instrumented to increase reliability, efficiency, and safety. Robotics will play a pivotal role in enabling this vision that is pioneered by the NEOM initiative - a smart city, co-inhabited by humans and robots. In this paper we explore the computing platform that will be required to enable this vision. We show how we can combine neuromorphic computing hardware, exemplified by the Loihi2 processor used in conjunction with event-based cameras, for sensing and real-time perception and interaction with a local AI compute cluster (GPUs) for high-level language processing, cognition, and task planning. We demonstrate the use of this hybrid computing architecture in an interactive task, in which a humanoid robot plays a musical instrument with a human. Central to our design is the efficient and seamless integration of disparate components, ensuring that the synergy between software and hardware maximizes overall performance and responsiveness. Our proposed system architecture underscores the potential of heterogeneous computing architectures in advancing robotic autonomy and interactive intelligence, pointing toward a future where such integrated systems become the norm in complex, real-time applications.
zh

[AI-78] Critically Engaged Prag matism: A Scientific Norm and Social Prag matist Epistemology for AI Science Evaluation Tools

【速读】:该论文试图解决当前科学评估中因同行评审能力危机、研究可重复性问题以及人工智能伪造科研成果而引发的对自动化评估工具(AI science evaluation tools)滥用与误用的风险。其核心问题是:这些工具在缺乏明确目的适配性和方法论批判的前提下被广泛采纳,容易导致“虚假上升推理”(inference by false ascent),即错误地将技术可扩展性等同于科学可靠性。解决方案的关键在于提出一种社会实用主义认识论(social, pragmatist epistemology)和新的规范——批判性参与实用主义(Critically Engaged Pragmatism),要求科学共同体对AI评估工具的目的及其特定用途下的可靠性进行持续、严格的批判性 discourse 实践,从而确保这些工具不是作为客观评判者存在,而是成为科学共同体内部可信度建构过程中的被审视对象。

链接: https://arxiv.org/abs/2601.09753
作者: Carole J. Lee
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Crises in peer review capacity, study replication, and AI-fabricated science have intensified interest in automated tools for assessing scientific research. However, the scientific community has a history of decontextualizing and repurposing credibility markers in inapt ways. I caution that AI science evaluation tools are particularly prone to these kinds of inference by false ascent due to contestation about the purposes to which they should be put, their portability across purposes, and technical demands that prioritize data set size over epistemic fit. To counter this, I argue for a social, pragmatist epistemology and a newly articulated norm of Critically Engaged Pragmatism to enjoin scientific communities to vigorously scrutinize the purposes and purpose-specific reliability of AI science evaluation tools. Under this framework, AI science evaluation tools are not objective arbiters of scientific credibility, but the object of the kinds of critical discursive practices that ground the credibility of scientific communities.
zh

[AI-79] SAGE: Tool-Augmented LLM Task Solving Strategies in Scalable Multi-Agent Environments

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在实际应用中因缺乏动态工具集成能力而导致的局限性问题,尤其是在面对不断演进的软件生态和领域特定工具时,LLMs难以可靠地理解、调用并高效使用这些工具。解决方案的关键在于提出SAGE——一个基于OPACA框架构建的专用对话式AI接口,其核心创新在于通过OPACA实现工具的自动发现与执行,从而支持新工具的动态添加与无缝集成;同时,SAGE具备高度可扩展性和模块化设计,能够灵活切换不同LLM模型(如GPT、LLAMA)、选择多样化的提示策略(prompting methods),并支持多种代理(agent)驱动的任务求解策略,以实现对工具的零样本(zero-shot)高效利用。

链接: https://arxiv.org/abs/2601.09750
作者: Robert K. Strehlow,Tobias Küster,Oskar F. Kupke,Brandon Llanque Kurps,Fikret Sivrikaya,Sahin Albayrak
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have proven to work well in question-answering scenarios, but real-world applications often require access to tools for live information or actuation. For this, LLMs can be extended with tools, which are often defined in advance, also allowing for some fine-tuning for specific use cases. However, rapidly evolving software landscapes and individual services require the constant development and integration of new tools. Domain- or company-specific tools can greatly elevate the usefulness of an LLM, but such custom tools can be problematic to integrate, or the LLM may fail to reliably understand and use them. For this, we need strategies to define new tools and integrate them into the LLM dynamically, as well as robust and scalable zero-shot prompting methods that can make use of those tools in an efficient manner. In this paper, we present SAGE, a specialized conversational AI interface, based on the OPACA framework for tool discovery and execution. The integration with OPACA makes it easy to add new tools or services for the LLM to use, while SAGE itself presents rich extensibility and modularity. This not only provides the ability to seamlessly switch between different models (e.g. GPT, LLAMA), but also to add and select prompting methods, involving various setups of differently prompted agents for selecting and executing tools and evaluating the results. We implemented a number of task-solving strategies, making use of agentic concepts and prompting methods in various degrees of complexity, and evaluated those against a comprehensive set of benchmark services. The results are promising and highlight the distinct strengths and weaknesses of different task-solving strategies. Both SAGE and the OPACA framework, as well as the different benchmark services and results, are available as Open Source/Open Data on GitHub.
zh

[AI-80] R-LAM: Reproducibility-Constrained Large Action Models for Scientific Workflow Automation

【速读】:该论文旨在解决大型动作模型(Large Action Models, LAMs)在科学工作流自动化中因缺乏可复现性、可审计性和确定性执行而导致的可靠性问题。现有基于大语言模型(LLM)的代理存在无约束的动作生成,易引发隐式状态变更、非确定性执行和不可复现的实验结果,难以满足科学研究对严谨性的要求。其解决方案的关键在于提出R-LAM框架,通过结构化的动作模式(structured action schemas)、确定性执行策略(deterministic execution policies)以及显式的溯源追踪(explicit provenance tracking),确保每一步操作和中间产物均可审计与重放;同时支持故障感知的执行循环和受控的工作流分叉机制,在保持迭代实验灵活性的同时保障可复现性。

链接: https://arxiv.org/abs/2601.09749
作者: Suriya Sureshkumar
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 9 pages, 3 figures, 1 Table, 2 Artifacts

点击查看摘要

Abstract:Large Action Models (LAMs) extend large language models by enabling autonomous decision-making and tool execution, making them promising for automating scientific workflows. However, scientific workflows impose strict requirements on reproducibility, auditability, and deterministic execution, which are not satisfied by generic LLM-based agents. Unconstrained action generation can lead to silent state changes, non-deterministic executions, and irreproducible experimental results, limiting the applicability of LAMs in scientific settings. In this paper, we propose R-LAM, a reproducibility-constrained framework for applying Large Action Models to scientific workflow automation. R-LAM introduces structured action schemas, deterministic execution policies, and explicit provenance tracking to ensure that every action and intermediate artifact is auditable and replayable. The framework supports failure-aware execution loops and controlled workflow forking, enabling iterative experimentation without compromising reproducibility. We implement R-LAM as a lightweight Python framework and release it as an open-source PyPI package to facilitate reproducible research. An experimental evaluation of representative scientific workflows demonstrates that R-LAM improves reproducibility success rates and execution reliability compared to unconstrained LLM-based agents, while retaining adaptive control over workflow execution. Comments: 9 pages, 3 figures, 1 Table, 2 Artifacts Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2601.09749 [cs.SE] (or arXiv:2601.09749v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2601.09749 Focus to learn more arXiv-issued DOI via DataCite
zh

[AI-81] Enhancing Formal Software Specification with Artificial Intelligence

【速读】:该论文旨在解决传统形式化软件规范在工业界应用受限的问题,其核心挑战在于形式化语言的高符号开销和对专家知识的依赖。解决方案的关键在于利用人工智能技术,将自然语言与轻量级数学符号结合,并以LaTeX作为中间规格语言,在代码生成前由AI进行审查和优化,从而在保留形式化规范优势(如早期错误检测、显式不变量和正确性设计)的同时,显著降低开发成本并实现一次成功实现。

链接: https://arxiv.org/abs/2601.09745
作者: Antonio Abu Nassar,Eitan Farchi
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Formal software specification is known to enable early error detection and explicit invariants, yet it has seen limited industrial adoption due to its high notation overhead and the expertise required to use traditional formal languages. This paper presents a case study showing that recent advances in artificial intelligence make it possible to retain many of the benefits of formal specification while substantially reducing these costs. The necessity of a clear distinction between what is controlled by the system analyst and can highly benefits from the rigor of formal specification and what need not be controlled is demonstrated. We use natural language augmented with lightweight mathematical notation and written in \LaTeX\ as an intermediate specification language, which is reviewed and refined by AI prior to code generation. Applied to a nontrivial simulation of organizational knowledge growth, this approach enables early validation, explicit invariants, and correctness by design, while significantly reducing development effort and producing a correct implementation on the first attempt.
zh

[AI-82] Formal Safety Guarantees for Autonomous Vehicles using Barrier Certificates AAAI-26

【速读】:该论文旨在解决自动驾驶车辆在复杂动态交通环境中因数据驱动模块缺乏可解释性与严格安全保证而导致的安全隐患问题。其解决方案的关键在于构建一个形式化验证的安全框架,将屏障证书(Barrier Certificates, BCs)与可解释的交通冲突指标——时间到碰撞(Time-to-Collision, TTC)相结合,并利用Satisfiability Modulo Theories (SMT) 求解器对安全条件进行形式化验证,同时设计自适应控制机制确保车辆实时满足这些约束。该方法实现了可解释性和可证明安全性并重的保障体系,在真实高速公路数据集上验证了其有效性,显著减少了TTC低于3秒的不安全事件(最多减少40%),并在某些车道中完全消除了冲突。

链接: https://arxiv.org/abs/2601.09740
作者: Oumaima Barhoumi,Mohamed H Zaki,Sofiène Tahar
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: Accepted to AAAI-26 Bridge Program B10: Making Embodied AI Reliable with Testing and Formal Verification

点击查看摘要

Abstract:Modern AI technologies enable autonomous vehicles to perceive complex scenes, predict human behavior, and make real-time driving decisions. However, these data-driven components often operate as black boxes, lacking interpretability and rigorous safety guarantees. Autonomous vehicles operate in dynamic, mixed-traffic environments where interactions with human-driven vehicles introduce uncertainty and safety challenges. This work develops a formally verified safety framework for Connected and Autonomous Vehicles (CAVs) that integrates Barrier Certificates (BCs) with interpretable traffic conflict metrics, specifically Time-to-Collision (TTC) as a spatio-temporal safety metric. Safety conditions are verified using Satisfiability Modulo Theories (SMT) solvers, and an adaptive control mechanism ensures vehicles comply with these constraints in real time. Evaluation on real-world highway datasets shows a significant reduction in unsafe interactions, with up to 40% fewer events where TTC falls below a 3 seconds threshold, and complete elimination of conflicts in some lanes. This approach provides both interpretable and provable safety guarantees, demonstrating a practical and scalable strategy for safe autonomous driving.
zh

[AI-83] Reinforced Linear Genetic Programming WWW

【速读】:该论文旨在解决线性遗传编程(Linear Genetic Programming, LGP)中需人工显式映射寄存器到动作的问题,这一限制降低了其自动化程度与适应性。解决方案的关键在于提出一种基于Q-Learning的强化学习机制嵌入LGP框架的方法,即强化线性遗传编程(Reinforced Linear Genetic Programming, RLGP),通过学习最优的寄存器-动作分配策略来自动优化程序结构,从而提升LGP在复杂任务中的自主决策能力。

链接: https://arxiv.org/abs/2601.09736
作者: Urmzd Mukhammadnaim
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注: Bachelor’s thesis. Source code can be found at this https URL

点击查看摘要

Abstract:Linear Genetic Programming (LGP) is a powerful technique that allows for a variety of problems to be solved using a linear representation of programs. However, there still exists some limitations to the technique, such as the need for humans to explicitly map registers to actions. This thesis proposes a novel approach that uses Q-Learning on top of LGP, Reinforced Linear Genetic Programming (RLGP) to learn the optimal register-action assignments. In doing so, we introduce a new framework “linear-gp” written in memory-safe Rust that allows for extensive experimentation for future works.
zh

[AI-84] Self-Organizing Dual-Buffer Adaptive Clustering Experience Replay (SODASER) for Safe Reinforcement Learning in Optimal Control

【速读】:该论文旨在解决非线性系统中安全且可扩展的最优控制问题,特别是在动态、安全关键环境中如何实现高效学习与约束保障。其核心挑战在于经验回放机制在保持样本多样性与内存效率的同时,还需确保学习过程满足状态和输入的安全约束。解决方案的关键是提出一种自组织双缓冲自适应聚类经验回放机制(Self-Organizing Dual-buffer Adaptive Clustering Experience Replay, SODACER),其中快缓冲(Fast-Buffer)用于快速适应近期经验,慢缓冲(Slow-Buffer)通过自适应聚类机制动态修剪冗余样本,保留关键环境模式,从而提升记忆效率与样本利用率;同时,该框架结合控制屏障函数(Control Barrier Functions, CBFs)以强制执行安全约束,并引入Sophia优化器实现自适应二阶梯度更新,增强收敛性与稳定性。实验验证表明,SODACER在非线性人乳头瘤病毒(HPV)传播模型上实现了更快收敛、更优偏差-方差权衡及安全轨迹维持。

链接: https://arxiv.org/abs/2601.06540
作者: Roya Khalili Amirabadi,Mohsen Jalaeian Farimani,Omid Solaymani Fard
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO); Optimization and Control (math.OC)
备注: Also available at SSRN: this https URL or this http URL

点击查看摘要

Abstract:This paper proposes a novel reinforcement learning framework, named Self-Organizing Dual-buffer Adaptive Clustering Experience Replay (SODACER), designed to achieve safe and scalable optimal control of nonlinear systems. The proposed SODACER mechanism consisting of a Fast-Buffer for rapid adaptation to recent experiences and a Slow-Buffer equipped with a self-organizing adaptive clustering mechanism to maintain diverse and non-redundant historical experiences. The adaptive clustering mechanism dynamically prunes redundant samples, optimizing memory efficiency while retaining critical environmental patterns. The approach integrates SODASER with Control Barrier Functions (CBFs) to guarantee safety by enforcing state and input constraints throughout the learning process. To enhance convergence and stability, the framework is combined with the Sophia optimizer, enabling adaptive second-order gradient updates. The proposed SODACER-Sophia’s architecture ensures reliable, effective, and robust learning in dynamic, safety-critical environments, offering a generalizable solution for applications in robotics, healthcare, and large-scale system optimization. The proposed approach is validated on a nonlinear Human Papillomavirus (HPV) transmission model with multiple control inputs and safety constraints. Comparative evaluations against random and clustering-based experience replay methods demonstrate that SODACER achieves faster convergence, improved sample efficiency, and a superior bias-variance trade-off, while maintaining safe system trajectories, validated via the Friedman test.
zh

[AI-85] What Understanding Means in AI-Laden Astronomy

【速读】:该论文试图解决人工智能(AI)在天文学研究中快速渗透时所引发的深层认识论问题,即如何在不忽视科学理解本质的前提下合理整合AI工具。当前学界多将AI视为工程问题,而忽略了其对“理解”“发现”和“评价”等核心科学概念的挑战。解决方案的关键在于提出“实用主义理解”(pragmatic understanding)框架,强调AI应被视为扩展人类认知的工具,而非替代科学判断的主体,并由此建立新的验证规范与认识论评估标准,以确保AI辅助下的科学研究仍能保持叙事建构、情境判断与沟通有效性等关键特征。

链接: https://arxiv.org/abs/2601.10038
作者: Yuan-Sen Ting,André Curtis-Trudel,Siyu Yao
机构: 未知
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Perspective article, 8 pages. Based on the “Philosophy Sees the Algorithm” workshop held December 11-12, 2025 at The Ohio State University. Supported by the Alfred P. Sloan Foundation, the Center for Cosmology and AstroParticle Physics (CCAPP), and the University of Cincinnati Center for Humanities and Technology

点击查看摘要

Abstract:Artificial intelligence is rapidly transforming astronomical research, yet the scientific community has largely treated this transformation as an engineering challenge rather than an epistemological one. This perspective article argues that philosophy of science offers essential tools for navigating AI’s integration into astronomy–conceptual clarity about what “understanding” means, critical examination of assumptions about data and discovery, and frameworks for evaluating AI’s roles across different research contexts. Drawing on an interdisciplinary workshop convening astronomers, philosophers, and computer scientists, we identify several tensions. First, the narrative that AI will “derive fundamental physics” from data misconstrues contemporary astronomy as equation-derivation rather than the observation-driven enterprise it is. Second, scientific understanding involves more than prediction–it requires narrative construction, contextual judgment, and communicative achievement that current AI architectures struggle to provide. Third, because narrative and judgment matter, human peer review remains essential–yet AI-generated content flooding the literature threatens our capacity to identify genuine insight. Fourth, while AI excels at well-defined problem-solving, the ill-defined problem-finding that drives breakthroughs appears to require capacities beyond pattern recognition. Fifth, as AI accelerates what is feasible, pursuitworthiness criteria risk shifting toward what AI makes easy rather than what is genuinely important. We propose “pragmatic understanding” as a framework for integration–recognizing AI as a tool that extends human cognition while requiring new norms for validation and epistemic evaluation. Engaging with these questions now may help the community shape the transformation rather than merely react to it.
zh

[AI-86] Performance of AI agents based on reasoning language models on ALD process optimization tasks

【速读】:该论文旨在解决原子层沉积(Atomic Layer Deposition, ALD)工艺的自主优化问题,即在无先验知识的情况下,通过生成式AI代理(agent)自动寻找ALD前驱体和共反应物的最佳剂量时间。其关键解决方案是构建一个基于推理型大语言模型(Reasoning Large Language Model, LLM)的智能代理,该代理以迭代方式与ALD反应器进行全无监督交互,并采用两步推理机制:首先生成自然语言形式的开放性推理过程,随后将其结构化为可执行指令。实验表明,该方法能有效识别ALD过程中的自限性(self-limited)特征并完成优化任务,尽管存在因模型非确定性响应导致的运行间变异性,但推理轨迹分析显示模型逻辑合理,体现了对ALD中饱和行为的正确理解。

链接: https://arxiv.org/abs/2601.09980
作者: Angel Yanguas-Gil
机构: 未知
类目: Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In this work we explore the performance and behavior of reasoning large language models to autonomously optimize atomic layer deposition (ALD) processes. In the ALD process optimization task, an agent built on top of a reasoning LLM has to find optimal dose times for an ALD precursor and a coreactant without any prior knowledge on the process, including whether it is actually self-limited. The agent is meant to interact iteratively with an ALD reactor in a fully unsupervised way. We evaluate this agent using a simple model of an ALD tool that incorporates ALD processes with different self-limited surface reaction pathways as well as a non self-limited component. Our results show that agents based on reasoning models like OpenAI’s o3 and GPT5 consistently succeeded at completing this optimization task. However, we observed significant run-to-run variability due to the non deterministic nature of the model’s response. In order to understand the logic followed by the reasoning model, the agent uses a two step process in which the model first generates an open response detailing the reasoning process. This response is then transformed into a structured output. An analysis of these reasoning traces showed that the logic of the model was sound and that its reasoning was based on the notions of self-limited process and saturation expected in the case of ALD. However, the agent can sometimes be misled by its own prior choices when exploring the optimization space.
zh

[AI-87] Learning to Decode in Parallel: Self-Coordinating Neural Network for Real-Time Quantum Error Correction

【速读】:该论文旨在解决生成式 AI (Generative AI) 在容错量子计算(Fault-Tolerant Quantum Computation, FTQC)中应用时的关键瓶颈问题:现有神经网络解码器(如 AlphaQubit)缺乏并行处理能力,无法实时解码超导逻辑量子比特产生的校验信息流。其核心挑战在于 AlphaQubit 仅输出全局逻辑纠正比特,而非可并行集成的局部物理纠正比特。解决方案的关键在于提出一种基于循环神经网络与 Transformer 架构的新颖训练方法——通过从一致的局部纠正集合中衍生训练标签,并在多种解码窗口类型上联合训练,使模型能够自协调相邻窗口间的纠正逻辑,从而实现高精度、可扩展的并行解码框架。这一方法首次在保持最先进(SOTA)解码准确率的同时满足实时量子纠错所需的高吞吐量要求,在 Zuchongzhi 3.2 超导量子处理器上验证了对距离达 7 的表面码的有效性,并证明单个 TPU v6e 可在每轮解码 1 微秒内完成距离达 25 的表面码解码。

链接: https://arxiv.org/abs/2601.09921
作者: Kai Zhang,Zhengzhong Yi,Shaojun Guo,Linghang Kong,Situ Wang,Xiaoyu Zhan,Tan He,Weiping Lin,Tao Jiang,Dongxin Gao,Yiming Zhang,Fangming Liu,Fang Zhang,Zhengfeng Ji,Fusheng Chen,Jianxin Chen
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
备注: The main text consists of 25 pages and 9 figures, extending our prior work ( arXiv:2509.03815 ) with new results on surface code decoding in superconducting qubit systems and real-time performance benchmarks on TPU v6e

点击查看摘要

Abstract:Fast, reliable decoders are pivotal components for enabling fault-tolerant quantum computation (FTQC). Neural network decoders like AlphaQubit have demonstrated potential, achieving higher accuracy than traditional human-designed decoding algorithms. However, existing implementations of neural network decoders lack the parallelism required to decode the syndrome stream generated by a superconducting logical qubit in real time. Moreover, integrating AlphaQubit with sliding window-based parallel decoding schemes presents non-trivial challenges: AlphaQubit is trained solely to output a single bit corresponding to the global logical correction for an entire memory experiment, rather than local physical corrections that can be easily integrated. We address this issue by training a recurrent, transformer-based neural network specifically tailored for parallel window decoding. While it still outputs a single bit, we derive training labels from a consistent set of local corrections and train on various types of decoding windows simultaneously. This approach enables the network to self-coordinate across neighboring windows, facilitating high-accuracy parallel decoding of arbitrarily long memory experiments. As a result, we overcome the throughput bottleneck that previously precluded the use of AlphaQubit-type decoders in FTQC. Our work presents the first scalable, neural-network-based parallel decoding framework that simultaneously achieves SOTA accuracy and the stringent throughput required for real-time quantum error correction. Using an end-to-end experimental workflow, we benchmark our decoder on the Zuchongzhi 3.2 superconducting quantum processor on surface codes with distances up to 7, demonstrating its superior accuracy. Moreover, we demonstrate that, using our approach, a single TPU v6e is capable of decoding surface codes with distances up to 25 within 1us per decoding round. Comments: The main text consists of 25 pages and 9 figures, extending our prior work (arXiv:2509.03815) with new results on surface code decoding in superconducting qubit systems and real-time performance benchmarks on TPU v6e Subjects: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI) Cite as: arXiv:2601.09921 [quant-ph] (or arXiv:2601.09921v1 [quant-ph] for this version) https://doi.org/10.48550/arXiv.2601.09921 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-88] CLiMB: A Domain-Informed Novelty Detection Clustering Framework for Scientific Discovery

【速读】:该论文旨在解决数据驱动科学发现中一个核心挑战:在充分表征已知现象的同时识别新颖异常。传统半监督聚类算法常因假设监督信号具有全局代表性,而施加刚性约束或预设聚类数量,导致无法有效捕捉意外模式或进行真正的新颖性检测。其解决方案的关键在于提出CLiMB(CLustering in Multiphase Boundaries)框架,通过一种分阶段的两步策略实现先验知识利用与未知结构探索的解耦:第一阶段使用约束划分锚定已知簇,第二阶段对残差数据应用基于密度的聚类以揭示任意拓扑结构,从而兼顾已知结构恢复与潜在新特征发现。

链接: https://arxiv.org/abs/2601.09768
作者: Lorenzo Monti,Tatiana Muraveva,Brian Sheridan,Davide Massari,Alessia Garofalo,Gisella Clementini,Umberto Michelucci
机构: 未知
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Astrophysics of Galaxies (astro-ph.GA); Artificial Intelligence (cs.AI)
备注: 28 pages, 4 figures

点击查看摘要

Abstract:In data-driven scientific discovery, a challenge lies in classifying well-characterized phenomena while identifying novel anomalies. Current semi-supervised clustering algorithms do not always fully address this duality, often assuming that supervisory signals are globally representative. Consequently, methods often enforce rigid constraints that suppress unanticipated patterns or require a pre-specified number of clusters, rendering them ineffective for genuine novelty detection. To bridge this gap, we introduce CLiMB (CLustering in Multiphase Boundaries), a domain-informed framework decoupling the exploitation of prior knowledge from the exploration of unknown structures. Using a sequential two-phase approach, CLiMB first anchors known clusters using constrained partitioning, and subsequently applies density-based clustering to residual data to reveal arbitrary topologies. We demonstrate this framework on RR Lyrae stars data from the Gaia Data Release 3. CLiMB attains an Adjusted Rand Index of 0.829 with 90% seed coverage in recovering known Milky Way substructures, drastically outperforming heuristic and constraint-based baselines, which stagnate below 0.20. Furthermore, sensitivity analysis confirms CLiMB’s superior data efficiency, showing monotonic improvement as knowledge increases. Finally, the framework successfully isolates three dynamical features (Shiva, Shakti, and the Galactic Disk) in the unlabelled field, validating its potential for scientific discovery.
zh

机器学习

[LG-0] DInf-Grid: A Neural Differential Equation Solver with Differentiable Feature Grids

链接: https://arxiv.org/abs/2601.10715
作者: Navami Kairanda,Shanthika Naik,Marc Habermann,Avinash Sharma,Christian Theobalt,Vladislav Golyanik
类目: Machine Learning (cs.LG)
*备注: 25 pages; 16 figures; project page: this https URL

点击查看摘要

Abstract:We present a novel differentiable grid-based representation for efficiently solving differential equations (DEs). Widely used architectures for neural solvers, such as sinusoidal neural networks, are coordinate-based MLPs that are both computationally intensive and slow to train. Although grid-based alternatives for implicit representations (e.g., Instant-NGP and K-Planes) train faster by exploiting signal structure, their reliance on linear interpolation restricts their ability to compute higher-order derivatives, rendering them unsuitable for solving DEs. Our approach overcomes these limitations by combining the efficiency of feature grids with radial basis function interpolation, which is infinitely differentiable. To effectively capture high-frequency solutions and enable stable and faster computation of global gradients, we introduce a multi-resolution decomposition with co-located grids. Our proposed representation, DInf-Grid, is trained implicitly using the differential equations as loss functions, enabling accurate modelling of physical fields. We validate DInf-Grid on a variety of tasks, including the Poisson equation for image reconstruction, the Helmholtz equation for wave fields, and the Kirchhoff-Love boundary value problem for cloth simulation. Our results demonstrate a 5-20x speed-up over coordinate-based MLP-based methods, solving differential equations in seconds or minutes while maintaining comparable accuracy and compactness.

[LG-1] High-accuracy and dimension-free sampling with diffusions

链接: https://arxiv.org/abs/2601.10708
作者: Khashayar Gatmiry,Sitan Chen,Adil Salim
类目: Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:Diffusion models have shown remarkable empirical success in sampling from rich multi-modal distributions. Their inference relies on numerically solving a certain differential equation. This differential equation cannot be solved in closed form, and its resolution via discretization typically requires many small iterations to produce \emphhigh-quality samples. More precisely, prior works have shown that the iteration complexity of discretization methods for diffusion models scales polynomially in the ambient dimension and the inverse accuracy 1/\varepsilon . In this work, we propose a new solver for diffusion models relying on a subtle interplay between low-degree approximation and the collocation method (Lee, Song, Vempala 2018), and we prove that its iteration complexity scales \emphpolylogarithmically in 1/\varepsilon , yielding the first ``high-accuracy’’ guarantee for a diffusion-based sampler that only uses (approximate) access to the scores of the data distribution. In addition, our bound does not depend explicitly on the ambient dimension; more precisely, the dimension affects the complexity of our solver through the \empheffective radius of the support of the target distribution only. Subjects: Machine Learning (cs.LG); Statistics Theory (math.ST) Cite as: arXiv:2601.10708 [cs.LG] (or arXiv:2601.10708v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2601.10708 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-2] Distributed Perceptron under Bounded Staleness Partial Participation and Noisy Communication

链接: https://arxiv.org/abs/2601.10705
作者: Keval Jain,Anant Raj,Saurav Prakash,Girish Varma
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study a semi-asynchronous client-server perceptron trained via iterative parameter mixing (IPM-style averaging): clients run local perceptron updates and a server forms a global model by aggregating the updates that arrive in each communication round. The setting captures three system effects in federated and distributed deployments: (i) stale updates due to delayed model delivery and delayed application of client computations (two-sided version lag), (ii) partial participation (intermittent client availability), and (iii) imperfect communication on both downlink and uplink, modeled as effective zero-mean additive noise with bounded second moment. We introduce a server-side aggregation rule called staleness-bucket aggregation with padding that deterministically enforces a prescribed staleness profile over update ages without assuming any stochastic model for delays or participation. Under margin separability and bounded data radius, we prove a finite-horizon expected bound on the cumulative weighted number of perceptron mistakes over a given number of server rounds: the impact of delay appears only through the mean enforced staleness, whereas communication noise contributes an additional term that grows on the order of the square root of the horizon with the total noise energy. In the noiseless case, we show how a finite expected mistake budget yields an explicit finite-round stabilization bound under a mild fresh-participation condition.

[LG-3] Communication-Efficient and Privacy-Adaptable Mechanism – a Federated Learning Scheme with Convergence Analysis

链接: https://arxiv.org/abs/2601.10701
作者: Chun Hei Michael Shiu,Chih Wei Ling
类目: Machine Learning (cs.LG)
*备注: 19 pages, 5 figures. This work is submitted in part to the 2026 IEEE International Symposium on Information Theory (ISIT). arXiv admin note: substantial text overlap with arXiv:2501.12046

点击查看摘要

Abstract:Federated learning enables multiple parties to jointly train learning models without sharing their own underlying data, offering a practical pathway to privacy-preserving collaboration under data-governance constraints. Continued study of federated learning is essential to address key challenges in it, including communication efficiency and privacy protection between parties. A recent line of work introduced a novel approach called the Communication-Efficient and Privacy-Adaptable Mechanism (CEPAM), which achieves both objectives simultaneously. CEPAM leverages the rejection-sampled universal quantizer (RSUQ), a randomized vector quantizer whose quantization error is equivalent to a prescribed noise, which can be tuned to customize privacy protection between parties. In this work, we theoretically analyze the privacy guarantees and convergence properties of CEPAM. Moreover, we assess CEPAM’s utility performance through experimental evaluations, including convergence profiles compared with other baselines, and accuracy-privacy trade-offs between different parties.

[LG-4] Data-driven stochastic reduced-order modeling of parametrized dynamical systems

链接: https://arxiv.org/abs/2601.10690
作者: Andrew F. Ilersich,Kevin Course,Prasanth B. Nair
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Modeling complex dynamical systems under varying conditions is computationally intensive, often rendering high-fidelity simulations intractable. Although reduced-order models (ROMs) offer a promising solution, current methods often struggle with stochastic dynamics and fail to quantify prediction uncertainty, limiting their utility in robust decision-making contexts. To address these challenges, we introduce a data-driven framework for learning continuous-time stochastic ROMs that generalize across parameter spaces and forcing conditions. Our approach, based on amortized stochastic variational inference, leverages a reparametrization trick for Markov Gaussian processes to eliminate the need for computationally expensive forward solvers during training. This enables us to jointly learn a probabilistic autoencoder and stochastic differential equations governing the latent dynamics, at a computational cost that is independent of the dataset size and system stiffness. Additionally, our approach offers the flexibility of incorporating physics-informed priors if available. Numerical studies are presented for three challenging test problems, where we demonstrate excellent generalization to unseen parameter combinations and forcings, and significant efficiency gains compared to existing approaches.

[LG-5] Single-Stage Huffman Encoder for ML Compression

链接: https://arxiv.org/abs/2601.10673
作者: Aditya Agrawal,Albert Magyar,Hiteshwar Eswaraiah,Patrick Sheridan,Pradeep Janedula,Ravi Krishnan Venkatesan,Krishna Nair,Ravi Iyer
类目: Machine Learning (cs.LG)
*备注: 5 pages, 4 figures

点击查看摘要

Abstract:Training and serving Large Language Models (LLMs) require partitioning data across multiple accelerators, where collective operations are frequently bottlenecked by network bandwidth. Lossless compression using Huffman codes is an effective way to alleviate the issue, however, its three-stage design requiring on-the-fly frequency analysis, codebook generation and transmission of codebook along with data introduces computational, latency and data overheads which are prohibitive for latency-sensitive scenarios such as die-to-die communication. This paper proposes a single-stage Huffman encoder that eliminates these overheads by using fixed codebooks derived from the average probability distribution of previous data batches. Through our analysis of the Gemma 2B model, we demonstrate that tensors exhibit high statistical similarity across layers and shards. Using this approach we achieve compression within 0.5% of per-shard Huffman coding and within 1% of the ideal Shannon compressibility, enabling efficient on-the-fly compression.

[LG-6] PACEvolve: Enabling Long-Horizon Progress-Aware Consistent Evolution

链接: https://arxiv.org/abs/2601.10657
作者: Minghao Yan,Bo Peng,Benjamin Coleman,Ziqi Chen,Zhouhang Xie,Zhankui He,Noveen Sachdeva,Isabella Ye,Weili Wang,Chi Wang,Ed H. Chi,Wang-Cheng Kang,Derek Zhiyuan Cheng,Beidou Wang
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have emerged as powerful operators for evolutionary search, yet the design of efficient search scaffolds remains ad hoc. While promising, current LLM-in-the-loop systems lack a systematic approach to managing the evolutionary process. We identify three distinct failure modes: Context Pollution, where experiment history biases future candidate generation; Mode Collapse, where agents stagnate in local minima due to poor exploration-exploitation balance; and Weak Collaboration, where rigid crossover strategies fail to leverage parallel search trajectories effectively. We introduce Progress-Aware Consistent Evolution (PACEvolve), a framework designed to robustly govern the agent’s context and search dynamics, to address these challenges. PACEvolve combines hierarchical context management (HCM) with pruning to address context pollution; momentum-based backtracking (MBB) to escape local minima; and a self-adaptive sampling policy that unifies backtracking and crossover for dynamic search coordination (CE), allowing agents to balance internal refinement with cross-trajectory collaboration. We demonstrate that PACEvolve provides a systematic path to consistent, long-horizon self-improvement, achieving state-of-the-art results on LLM-SR and KernelBench, while discovering solutions surpassing the record on Modded NanoGPT.

[LG-7] STEM: Scaling Transformers with Embedding Modules

链接: https://arxiv.org/abs/2601.10639
作者: Ranajoy Sadhukhan,Sheng Cao,Harry Dong,Changsheng Zhao,Attiano Purpura-Pontoniere,Yuandong Tian,Zechun Liu,Beidi Chen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Fine-grained sparsity promises higher parametric capacity without proportional per-token compute, but often suffers from training instability, load balancing, and communication overhead. We introduce STEM (Scaling Transformers with Embedding Modules), a static, token-indexed approach that replaces the FFN up-projection with a layer-local embedding lookup while keeping the gate and down-projection dense. This removes runtime routing, enables CPU offload with asynchronous prefetch, and decouples capacity from both per-token FLOPs and cross-device communication. Empirically, STEM trains stably despite extreme sparsity. It improves downstream performance over dense baselines while reducing per-token FLOPs and parameter accesses (eliminating roughly one-third of FFN parameters). STEM learns embedding spaces with large angular spread which enhances its knowledge storage capacity. More interestingly, this enhanced knowledge capacity comes with better interpretability. The token-indexed nature of STEM embeddings allows simple ways to perform knowledge editing and knowledge injection in an interpretable manner without any intervention in the input text or additional computation. In addition, STEM strengthens long-context performance: as sequence length grows, more distinct parameters are activated, yielding practical test-time capacity scaling. Across 350M and 1B model scales, STEM delivers up to ~3–4% accuracy improvements overall, with notable gains on knowledge and reasoning-heavy benchmarks (ARC-Challenge, OpenBookQA, GSM8K, MMLU). Overall, STEM is an effective way of scaling parametric memory while providing better interpretability, better training stability and improved efficiency.

[LG-8] Combinatorial Optimization Augmented Machine Learning

链接: https://arxiv.org/abs/2601.10583
作者: Maximilian Schiffer,Heiko Hoppe,Yue Su,Louis Bouvier,Axel Parmentier
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Combinatorial optimization augmented machine learning (COAML) has recently emerged as a powerful paradigm for integrating predictive models with combinatorial decision-making. By embedding combinatorial optimization oracles into learning pipelines, COAML enables the construction of policies that are both data-driven and feasibility-preserving, bridging the traditions of machine learning, operations research, and stochastic optimization. This paper provides a comprehensive overview of the state of the art in COAML. We introduce a unifying framework for COAML pipelines, describe their methodological building blocks, and formalize their connection to empirical cost minimization. We then develop a taxonomy of problem settings based on the form of uncertainty and decision structure. Using this taxonomy, we review algorithmic approaches for static and dynamic problems, survey applications across domains such as scheduling, vehicle routing, stochastic programming, and reinforcement learning, and synthesize methodological contributions in terms of empirical cost minimization, imitation learning, and reinforcement learning. Finally, we identify key research frontiers. This survey aims to serve both as a tutorial introduction to the field and as a roadmap for future research at the interface of combinatorial optimization and machine learning.

[LG-9] Kolmogorov Arnold Networks and Multi-Layer Perceptrons: A Paradigm Shift in Neural Modelling

链接: https://arxiv.org/abs/2601.10563
作者: Aradhya Gaonkar,Nihal Jain,Vignesh Chougule,Nikhil Deshpande,Sneha Varur,Channabasappa Muttal
类目: Machine Learning (cs.LG)
*备注: 13 pages, 8 figures, 2 tables

点击查看摘要

Abstract:The research undertakes a comprehensive comparative analysis of Kolmogorov-Arnold Networks (KAN) and Multi-Layer Perceptrons (MLP), highlighting their effectiveness in solving essential computational challenges like nonlinear function approximation, time-series prediction, and multivariate classification. Rooted in Kolmogorov’s representation theorem, KANs utilize adaptive spline-based activation functions and grid-based structures, providing a transformative approach compared to traditional neural network frameworks. Utilizing a variety of datasets spanning mathematical function estimation (quadratic and cubic) to practical uses like predicting daily temperatures and categorizing wines, the proposed research thoroughly assesses model performance via accuracy measures like Mean Squared Error (MSE) and computational expense assessed through Floating Point Operations (FLOPs). The results indicate that KANs reliably exceed MLPs in every benchmark, attaining higher predictive accuracy with significantly reduced computational costs. Such an outcome highlights their ability to maintain a balance between computational efficiency and accuracy, rendering them especially beneficial in resource-limited and real-time operational environments. By elucidating the architectural and functional distinctions between KANs and MLPs, the paper provides a systematic framework for selecting the most suitable neural architectures for specific tasks. Furthermore, the proposed study highlights the transformative capabilities of KANs in progressing intelligent systems, influencing their use in situations that require both interpretability and computational efficiency.

[LG-10] Mixtures of Transparent Local Models

链接: https://arxiv.org/abs/2601.10541
作者: Niffa Cheick Oumar Diaby,Thierry Duchesne,Mario Marchand
类目: Machine Learning (cs.LG)
*备注: 44 pages, 32 figues

点击查看摘要

Abstract:The predominance of machine learning models in many spheres of human activity has led to a growing demand for their transparency. The transparency of models makes it possible to discern some factors, such as security or non-discrimination. In this paper, we propose a mixture of transparent local models as an alternative solution for designing interpretable (or transparent) models. Our approach is designed for the situations where a simple and transparent function is suitable for modeling the label of instances in some localities/regions of the input space, but may change abruptly as we move from one locality to another. Consequently, the proposed algorithm is to learn both the transparent labeling function and the locality of the input space where the labeling function achieves a small risk in its assigned locality. By using a new multi-predictor (and multi-locality) loss function, we established rigorous PAC-Bayesian risk bounds for the case of binary linear classification problem and that of linear regression. In both cases, synthetic data sets were used to illustrate how the learning algorithms work. The results obtained from real data sets highlight the competitiveness of our approach compared to other existing methods as well as certain opaque models. Keywords: PAC-Bayes, risk bounds, local models, transparent models, mixtures of local transparent models.

[LG-11] CoGen: Creation of Reusable UI Components in Figma via Textual Commands

链接: https://arxiv.org/abs/2601.10536
作者: Ishani Kanapathipillai,Obhasha Priyankara
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: 8 pages, 6 figures, 11 tables

点击查看摘要

Abstract:The evolution of User Interface design has emphasized the need for efficient, reusable, and editable components to ensure an efficient design process. This research introduces CoGen, a system that uses machine learning techniques to generate reusable UI components directly in Figma, one of the most popular UI design tools. Addressing gaps in current systems, CoGen focuses on creating atomic components such as buttons, labels, and input fields using structured JSON and natural language prompts. The project integrates Figma API data extraction, Seq2Seq models, and fine-tuned T5 transformers for component generation. The key results demonstrate the efficiency of the T5 model in prompt generation, with an accuracy of 98% and a BLEU score of 0.2668, which ensures the mapping of JSON to descriptive prompts. For JSON creation, CoGen achieves a success rate of up to 100% in generating simple JSON outputs for specified component types. Comments: 8 pages, 6 figures, 11 tables Subjects: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG) Cite as: arXiv:2601.10536 [cs.HC] (or arXiv:2601.10536v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2601.10536 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-12] ransformer-Based Cognitive Radio: Adaptive Modulation Strategies Using Transformer Models

链接: https://arxiv.org/abs/2601.10519
作者: Andrea Melis,Andrea Piroddi,Roberto Girau
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Cognitive Radio (CR) systems, which dynamically adapt to changing spectrum environments, could benefit significantly from advancements in machine learning technologies. These systems can be enhanced in terms of spectral efficiency, robustness, and security through innovative approaches such as the use of Transformer models. This work investigates the application of Transformer models, specifically the GPT-2 architecture, to generate novel modulation schemes for wireless communications. By training a GPT-2 model on a dataset of existing modulation formulas, new modulation schemes has been created. These generated schemes are then compared to traditional methods using key performance metrics such as Signal-to-Noise Ratio (SNR) and Power Spectrum Density (PSD). The results show that Transformer-generated modulation schemes can achieve performance comparable to, and in some cases outperforming, traditional methods. This demonstrates that advanced CR systems could greatly benefit from the implementation of Transformer models, leading to more efficient, robust, and secure communication systems.

[LG-13] Communication-Efficient Federated Learning by Exploiting Spatio-Temporal Correlations of Gradients

链接: https://arxiv.org/abs/2601.10491
作者: Shenlong Zheng,Zhen Zhang,Yuhui Deng,Geyong Min,Lin Cui
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Communication overhead is a critical challenge in federated learning, particularly in bandwidth-constrained networks. Although many methods have been proposed to reduce communication overhead, most focus solely on compressing individual gradients, overlooking the temporal correlations among them. Prior studies have shown that gradients exhibit spatial correlations, typically reflected in low-rank structures. Through empirical analysis, we further observe a strong temporal correlation between client gradients across adjacent rounds. Based on these observations, we propose GradESTC, a compression technique that exploits both spatial and temporal gradient correlations. GradESTC exploits spatial correlations to decompose each full gradient into a compact set of basis vectors and corresponding combination coefficients. By exploiting temporal correlations, only a small portion of the basis vectors need to be dynamically updated in each round. GradESTC significantly reduces communication overhead by transmitting lightweight combination coefficients and a limited number of updated basis vectors instead of the full gradients. Extensive experiments show that, upon reaching a target accuracy level near convergence, GradESTC reduces uplink communication by an average of 39.79% compared to the strongest baseline, while maintaining comparable convergence speed and final accuracy to uncompressed FedAvg. By effectively leveraging spatio-temporal gradient structures, GradESTC offers a practical and scalable solution for communication-efficient federated learning.

[LG-14] DeFlow: Decoupling Manifold Modeling and Value Maximization for Offline Policy Extraction

链接: https://arxiv.org/abs/2601.10471
作者: Zhancun Mu
类目: Machine Learning (cs.LG)
*备注: 13 pages, 3 figures

点击查看摘要

Abstract:We present DeFlow, a decoupled offline RL framework that leverages flow matching to faithfully capture complex behavior manifolds. Optimizing generative policies is computationally prohibitive, typically necessitating backpropagation through ODE solvers. We address this by learning a lightweight refinement module within an explicit, data-derived trust region of the flow manifold, rather than sacrificing the iterative generation capability via single-step distillation. This way, we bypass solver differentiation and eliminate the need for balancing loss terms, ensuring stable improvement while fully preserving the flow’s iterative expressivity. Empirically, DeFlow achieves superior performance on the challenging OGBench benchmark and demonstrates efficient offline-to-online adaptation.

[LG-15] LangLasso: Interactive Cluster Descriptions through LLM Explanation

链接: https://arxiv.org/abs/2601.10458
作者: Raphael Buchmüller,Dennis Collaris,Linhao Meng,Angelos Chatzimparmpas
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Computation (stat.CO)
*备注: This manuscript is accepted for publication in VIS 2025 VISxGenAI Workshop

点击查看摘要

Abstract:Dimensionality reduction is a powerful technique for revealing structure and potential clusters in data. However, as the axes are complex, non-linear combinations of features, they often lack semantic interpretability. Existing visual analytics (VA) methods support cluster interpretation through feature comparison and interactive exploration, but they require technical expertise and intense human effort. We present \textitLangLasso, a novel method that complements VA approaches through interactive, natural language descriptions of clusters using large language models (LLMs). It produces human-readable descriptions that make cluster interpretation accessible to non-experts and allow integration of external contextual knowledge beyond the dataset. We systematically evaluate the reliability of these explanations and demonstrate that \langlasso provides an effective first step for engaging broader audiences in cluster interpretation. The tool is available at this https URL

[LG-16] Stable Differentiable Modal Synthesis for Learning Nonlinear Dynamics

链接: https://arxiv.org/abs/2601.10453
作者: Victor Zheleznov,Stefan Bilbao,Alec Wright,Simon King
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS); Computational Physics (physics.comp-ph)
*备注: Submitted to the Journal of Audio Engineering Society (December 2025)

点击查看摘要

Abstract:Modal methods are a long-standing approach to physical modelling synthesis. Extensions to nonlinear problems are possible, including the case of a high-amplitude vibration of a string. A modal decomposition leads to a densely coupled nonlinear system of ordinary differential equations. Recent work in scalar auxiliary variable techniques has enabled construction of explicit and stable numerical solvers for such classes of nonlinear systems. On the other hand, machine learning approaches (in particular neural ordinary differential equations) have been successful in modelling nonlinear systems automatically from data. In this work, we examine how scalar auxiliary variable techniques can be combined with neural ordinary differential equations to yield a stable differentiable model capable of learning nonlinear dynamics. The proposed approach leverages the analytical solution for linear vibration of system’s modes so that physical parameters of a system remain easily accessible after the training without the need for a parameter encoder in the model architecture. As a proof of concept, we generate synthetic data for the nonlinear transverse vibration of a string and show that the model can be trained to reproduce the nonlinear dynamics of the system. Sound examples are presented.

[LG-17] Reinforcement Learning with Multi-Step Lookahead Information Via Adaptive Batching

链接: https://arxiv.org/abs/2601.10418
作者: Nadav Merlis
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We study tabular reinforcement learning problems with multiple steps of lookahead information. Before acting, the learner observes \ell steps of future transition and reward realizations: the exact state the agent would reach and the rewards it would collect under any possible course of action. While it has been shown that such information can drastically boost the value, finding the optimal policy is NP-hard, and it is common to apply one of two tractable heuristics: processing the lookahead in chunks of predefined sizes (‘fixed batching policies’), and model predictive control. We first illustrate the problems with these two approaches and propose utilizing the lookahead in adaptive (state-dependent) batches; we refer to such policies as adaptive batching policies (ABPs). We derive the optimal Bellman equations for these strategies and design an optimistic regret-minimizing algorithm that enables learning the optimal ABP when interacting with unknown environments. Our regret bounds are order-optimal up to a potential factor of the lookahead horizon \ell , which can usually be considered a small constant.

[LG-18] CS-GBA: A Critical Sample-based Gradient-guided Backdoor Attack for Offline Reinforcement Learning

链接: https://arxiv.org/abs/2601.10407
作者: Yuanjie Zhao,Junnan Qiu,Yue Ding,Jie Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Offline Reinforcement Learning (RL) enables policy optimization from static datasets but is inherently vulnerable to backdoor attacks. Existing attack strategies typically struggle against safety-constrained algorithms (e.g., CQL) due to inefficient random poisoning and the use of easily detectable Out-of-Distribution (OOD) triggers. In this paper, we propose CS-GBA (Critical Sample-based Gradient-guided Backdoor Attack), a novel framework designed to achieve high stealthiness and destructiveness under a strict budget. Leveraging the theoretical insight that samples with high Temporal Difference (TD) errors are pivotal for value function convergence, we introduce an adaptive Critical Sample Selection strategy that concentrates the attack budget on the most influential transitions. To evade OOD detection, we propose a Correlation-Breaking Trigger mechanism that exploits the physical mutual exclusivity of state features (e.g., 95th percentile boundaries) to remain statistically concealed. Furthermore, we replace the conventional label inversion with a Gradient-Guided Action Generation mechanism, which searches for worst-case actions within the data manifold using the victim Q-network’s gradient. Empirical results on D4RL benchmarks demonstrate that our method significantly outperforms state-of-the-art baselines, achieving high attack success rates against representative safety-constrained algorithms with a minimal 5% poisoning budget, while maintaining the agent’s performance in clean environments.

[LG-19] Discrete Feynman-Kac Correctors

链接: https://arxiv.org/abs/2601.10403
作者: Mohsin Hasan,Viktor Ohanesian,Artem Gazizov,Yoshua Bengio,Alán Aspuru-Guzik,Roberto Bondesan,Marta Skreta,Kirill Neklyudov
类目: Machine Learning (cs.LG)
*备注: Code: this https URL

点击查看摘要

Abstract:Discrete diffusion models have recently emerged as a promising alternative to the autoregressive approach for generating discrete sequences. Sample generation via gradual denoising or demasking processes allows them to capture hierarchical non-sequential interdependencies in the data. These custom processes, however, do not assume a flexible control over the distribution of generated samples. We propose Discrete Feynman-Kac Correctors, a framework that allows for controlling the generated distribution of discrete masked diffusion models at inference time. We derive Sequential Monte Carlo (SMC) algorithms that, given a trained discrete diffusion model, control the temperature of the sampled distribution (i.e. perform annealing), sample from the product of marginals of several diffusion processes (e.g. differently conditioned processes), and sample from the product of the marginal with an external reward function, producing likely samples from the target distribution that also have high reward. Notably, our framework does not require any training of additional models or fine-tuning of the original model. We illustrate the utility of our framework in several applications including: efficient sampling from the annealed Boltzmann distribution of the Ising model, improving the performance of language models for code generation and amortized learning, as well as reward-tilted protein sequence generation.

[LG-20] PLGC: Pseudo-Labeled Graph Condensation

链接: https://arxiv.org/abs/2601.10358
作者: Jay Nandy,Arnab Kumar Mondal,Anuj Rathore,Mahesh Chandran
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large graph datasets make training graph neural networks (GNNs) computationally costly. Graph condensation methods address this by generating small synthetic graphs that approximate the original data. However, existing approaches rely on clean, supervised labels, which limits their reliability when labels are scarce, noisy, or inconsistent. We propose Pseudo-Labeled Graph Condensation (PLGC), a self-supervised framework that constructs latent pseudo-labels from node embeddings and optimizes condensed graphs to match the original graph’s structural and feature statistics – without requiring ground-truth labels. PLGC offers three key contributions: (1) A diagnosis of why supervised condensation fails under label noise and distribution shift. (2) A label-free condensation method that jointly learns latent prototypes and node assignments. (3) Theoretical guarantees showing that pseudo-labels preserve latent structural statistics of the original graph and ensure accurate embedding alignment. Empirically, across node classification and link prediction tasks, PLGC achieves competitive performance with state-of-the-art supervised condensation methods on clean datasets and exhibits substantial robustness under label noise, often outperforming all baselines by a significant margin. Our findings highlight the practical and theoretical advantages of self-supervised graph condensation in noisy or weakly-labeled environments.

[LG-21] EvoMorph: Counterfactual Explanations for Continuous Time-Series Extrinsic Regression Applied to Photoplethysmography

链接: https://arxiv.org/abs/2601.10356
作者: Mesut Ceylan,Alexis Tabin,Patrick Langer,Elgar Fleisch,Filipe Barata
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Wearable devices enable continuous, population-scale monitoring of physiological signals, such as photoplethysmography (PPG), creating new opportunities for data-driven clinical assessment. Time-series extrinsic regression (TSER) models increasingly leverage PPG signals to estimate clinically relevant outcomes, including heart rate, respiratory rate, and oxygen saturation. For clinical reasoning and trust, however, single point estimates alone are insufficient: clinicians must also understand whether predictions are stable under physiologically plausible variations and to what extent realistic, attainable changes in physiological signals would meaningfully alter a model’s prediction. Counterfactual explanations (CFE) address these “what-if” questions, yet existing time series CFE generation methods are largely restricted to classification, overlook waveform morphology, and often produce physiologically implausible signals, limiting their applicability to continuous biomedical time series. To address these limitations, we introduce EvoMorph, a multi-objective evolutionary framework for generating physiologically plausible and diverse CFE for TSER applications. EvoMorph optimizes morphology-aware objectives defined on interpretable signal descriptors and applies transformations to preserve the waveform structure. We evaluated EvoMorph on three PPG datasets (heart rate, respiratory rate, and oxygen saturation) against a nearest-unlike-neighbor baseline. In addition, in a case study, we evaluated EvoMorph as a tool for uncertainty quantification by relating counterfactual sensitivity to bootstrap-ensemble uncertainty and data-density measures. Overall, EvoMorph enables the generation of physiologically-aware counterfactuals for continuous biomedical signals and supports uncertainty-aware interpretability, advancing trustworthy model analysis for clinical time-series applications.

[LG-22] Meta Dynamic Graph for Traffic Flow Prediction AAAI2026

链接: https://arxiv.org/abs/2601.10328
作者: Yiqing Zou,Hanning Yuan,Qianyu Yang,Ziqiang Yuan,Shuliang Wang,Sijie Ruan
类目: Machine Learning (cs.LG)
*备注: Accepted to AAAI 2026

点击查看摘要

Abstract:Traffic flow prediction is a typical spatio-temporal prediction problem and has a wide range of applications. The core challenge lies in modeling the underlying complex spatio-temporal dependencies. Various methods have been proposed, and recent studies show that the modeling of dynamics is useful to meet the core challenge. While handling spatial dependencies and temporal dependencies using separate base model structures may hinder the modeling of spatio-temporal correlations, the modeling of dynamics can bridge this gap. Incorporating spatio-temporal heterogeneity also advances the main goal, since it can extend the parameter space and allow more flexibility. Despite these advances, two limitations persist: 1) the modeling of dynamics is often limited to the dynamics of spatial topology (e.g., adjacency matrix changes), which, however, can be extended to a broader scope; 2) the modeling of heterogeneity is often separated for spatial and temporal dimensions, but this gap can also be bridged by the modeling of dynamics. To address the above limitations, we propose a novel framework for traffic prediction, called Meta Dynamic Graph (MetaDG). MetaDG leverages dynamic graph structures of node representations to explicitly model spatio-temporal dynamics. This generates both dynamic adjacency matrices and meta-parameters, extending dynamic modeling beyond topology while unifying the capture of spatio-temporal heterogeneity into a single dimension. Extensive experiments on four real-world datasets validate the effectiveness of MetaDG.

[LG-23] We Need a More Robust Classifier: Dual Causal Learning Empowers Domain-Incremental Time Series Classification WWW2026

链接: https://arxiv.org/abs/2601.10312
作者: Zhipeng Liu,Peibo Duan,Xuan Tang,Haodong Jing,Mingyang Geng,Yongsheng Huang,Jialu Xu,Bin Zhang,Binwu Wang
类目: Machine Learning (cs.LG)
*备注: This paper has been accepted for publication at ACM WWW 2026

点击查看摘要

Abstract:The World Wide Web thrives on intelligent services that rely on accurate time series classification, which has recently witnessed significant progress driven by advances in deep learning. However, existing studies face challenges in domain incremental learning. In this paper, we propose a lightweight and robust dual-causal disentanglement framework (DualCD) to enhance the robustness of models under domain incremental scenarios, which can be seamlessly integrated into time series classification models. Specifically, DualCD first introduces a temporal feature disentanglement module to capture class-causal features and spurious features. The causal features can offer sufficient predictive power to support the classifier in domain incremental learning settings. To accurately capture these causal features, we further design a dual-causal intervention mechanism to eliminate the influence of both intra-class and inter-class confounding features. This mechanism constructs variant samples by combining the current class’s causal features with intra-class spurious features and with causal features from other classes. The causal intervention loss encourages the model to accurately predict the labels of these variant samples based solely on the causal features. Extensive experiments on multiple datasets and models demonstrate that DualCD effectively improves performance in domain incremental scenarios. We summarize our rich experiments into a comprehensive benchmark to facilitate research in domain incremental time series classification.

[LG-24] Early Fault Detection on CMAPSS with Unsupervised LSTM Autoencoders

链接: https://arxiv.org/abs/2601.10269
作者: P. Sánchez,K. Reyes,B. Radu,E. Fernández
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper introduces an unsupervised health-monitoring framework for turbofan engines that does not require run-to-failure labels. First, operating-condition effects in NASA CMAPSS sensor streams are removed via regression-based normalisation; then a Long Short-Term Memory (LSTM) autoencoder is trained only on the healthy portion of each trajectory. Persistent reconstruction error, estimated using an adaptive data-driven threshold, triggers real-time alerts without hand-tuned rules. Benchmark results show high recall and low false-alarm rates across multiple operating regimes, demonstrating that the method can be deployed quickly, scale to diverse fleets, and serve as a complementary early-warning layer to Remaining Useful Life models.

[LG-25] In-Context Source and Channel Coding

链接: https://arxiv.org/abs/2601.10267
作者: Ziqiong Wang,Tianqi Ren,Rongpeng Li,Zhifeng Zhao,Honggang Zhang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Separate Source-Channel Coding (SSCC) remains attractive for text transmission due to its modularity and compatibility with mature entropy coders and powerful channel codes. However, SSCC often suffers from a pronounced cliff effect in low Signal-to-Noise Ratio (SNR) regimes, where residual bit errors after channel decoding can catastrophically break lossless source decoding, especially for Arithmetic Coding (AC) driven by Large Language Models (LLMs). This paper proposes a receiver-side In-Context Decoding (ICD) framework that enhances SSCC robustness without modifying the transmitter. ICD leverages an Error Correction Code Transformer (ECCT) to obtain bit-wise reliability for the decoded information bits. Based on the context-consistent bitstream, ICD constructs a confidence-ranked candidate pool via reliability-guided bit flipping, samples a compact yet diverse subset of candidates, and applies an LLM-based arithmetic decoder to obtain both reconstructions and sequence-level log-likelihoods. A reliability-likelihood fusion rule then selects the final output. We further provide theoretical guarantees on the stability and convergence of the proposed sampling procedure. Extensive experiments over Additive White Gaussian Noise (AWGN) and Rayleigh fading channels demonstrate consistent gains compared with conventional SSCC baselines and representative Joint Source-Channel Coding (JSCC) schemes.

[LG-26] Fundamental Limitations of Favorable Privacy-Utility Guarantees for DP-SGD

链接: https://arxiv.org/abs/2601.10237
作者: Murat Bilgehan Ertan,Marten van Dijk
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Differentially Private Stochastic Gradient Descent (DP-SGD) is the dominant paradigm for private training, but its fundamental limitations under worst-case adversarial privacy definitions remain poorly understood. We analyze DP-SGD in the f -differential privacy framework, which characterizes privacy via hypothesis-testing trade-off curves, and study shuffled sampling over a single epoch with M gradient updates. We derive an explicit suboptimal upper bound on the achievable trade-off curve. This result induces a geometric lower bound on the separation \kappa which is the maximum distance between the mechanism’s trade-off curve and the ideal random-guessing line. Because a large separation implies significant adversarial advantage, meaningful privacy requires small \kappa . However, we prove that enforcing a small separation imposes a strict lower bound on the Gaussian noise multiplier \sigma , which directly limits the achievable utility. In particular, under the standard worst-case adversarial model, shuffled DP-SGD must satisfy \sigma \ge \frac1\sqrt2\ln M \quad\textor\quad \kappa \ge\ \frac1\sqrt8!\left(1-\frac1\sqrt4\pi\ln M\right) , and thus cannot simultaneously achieve strong privacy and high utility. Although this bound vanishes asymptotically as M \to \infty , the convergence is extremely slow: even for practically relevant numbers of updates the required noise magnitude remains substantial. We further show that the same limitation extends to Poisson subsampling up to constant factors. Our experiments confirm that the noise levels implied by this bound leads to significant accuracy degradation at realistic training settings, thus showing a critical bottleneck in DP-SGD under standard worst-case adversarial assumptions. Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR) Cite as: arXiv:2601.10237 [cs.LG] (or arXiv:2601.10237v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2601.10237 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-27] Graph Regularized PCA

链接: https://arxiv.org/abs/2601.10199
作者: Antonio Briola,Marwin Schmidt,Fabio Caccioli,Carlos Ros Perez,James Singleton,Christian Michler,Tomaso Aste
类目: Machine Learning (cs.LG)
*备注: 15 pages, 2 figures, 4 Tables

点击查看摘要

Abstract:High-dimensional data often exhibit dependencies among variables that violate the isotropic-noise assumption under which principal component analysis (PCA) is optimal. For cases where the noise is not independent and identically distributed across features (i.e., the covariance is not spherical) we introduce Graph Regularized PCA (GR-PCA). It is a graph-based regularization of PCA that incorporates the dependency structure of the data features by learning a sparse precision graph and biasing loadings toward the low-frequency Fourier modes of the corresponding graph Laplacian. Consequently, high-frequency signals are suppressed, while graph-coherent low-frequency ones are preserved, yielding interpretable principal components aligned with conditional relationships. We evaluate GR-PCA on synthetic data spanning diverse graph topologies, signal-to-noise ratios, and sparsity levels. Compared to mainstream alternatives, it concentrates variance on the intended support, produces loadings with lower graph-Laplacian energy, and remains competitive in out-of-sample reconstruction. When high-frequency signals are present, the graph Laplacian penalty prevents overfitting, reducing the reconstruction accuracy but improving structural fidelity. The advantage over PCA is most pronounced when high-frequency signals are graph-correlated, whereas PCA remains competitive when such signals are nearly rotationally invariant. The procedure is simple to implement, modular with respect to the precision estimator, and scalable, providing a practical route to structure-aware dimensionality reduction that improves structural fidelity without sacrificing predictive performance.

[LG-28] Reinforcement Learning to Discover a NorthEast Monsoon Index for Monthly Rainfall Prediction in Thailand

链接: https://arxiv.org/abs/2601.10181
作者: Kiattikun Chobtham
类目: Machine Learning (cs.LG); Earth and Planetary Astrophysics (astro-ph.EP)
*备注:

点击查看摘要

Abstract:Climate prediction is a challenge due to the intricate spatiotemporal patterns within Earth systems. Global climate indices, such as the El Niño Southern Oscillation, are standard input features for long-term rainfall prediction. However, a significant gap persists regarding local-scale indices capable of improving predictive accuracy in specific regions of Thailand. This paper introduces a novel NorthEast monsoon climate index calculated from sea surface temperature to reflect the climatology of the boreal winter monsoon. To optimise the calculated areas used for this index, a Deep Q-Network reinforcement learning agent explores and selects the most effective rectangles based on their correlation with seasonal rainfall. Rainfall stations were classified into 12 distinct clusters to distinguish rainfall patterns between southern and upper Thailand. Experimental results show that incorporating the optimised index into Long Short-Term Memory models significantly improves long-term monthly rainfall prediction skill in most cluster areas. This approach effectively reduces the Root Mean Square Error for 12-month-ahead forecasts.

[LG-29] Bias in the Shadows: Explore Shortcuts in Encrypted Network Traffic Classification

链接: https://arxiv.org/abs/2601.10180
作者: Chuyi Wang,Xiaohui Xie,Tongze Wang,Yong Cui
类目: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

Abstract:Pre-trained models operating directly on raw bytes have achieved promising performance in encrypted network traffic classification (NTC), but often suffer from shortcut learning-relying on spurious correlations that fail to generalize to real-world data. Existing solutions heavily rely on model-specific interpretation techniques, which lack adaptability and generality across different model architectures and deployment scenarios. In this paper, we propose BiasSeeker, the first semi-automated framework that is both model-agnostic and data-driven for detecting dataset-specific shortcut features in encrypted traffic. By performing statistical correlation analysis directly on raw binary traffic, BiasSeeker identifies spurious or environment-entangled features that may compromise generalization, independent of any classifier. To address the diverse nature of shortcut features, we introduce a systematic categorization and apply category-specific validation strategies that reduce bias while preserving meaningful information. We evaluate BiasSeeker on 19 public datasets across three NTC tasks. By emphasizing context-aware feature selection and dataset-specific diagnosis, BiasSeeker offers a novel perspective for understanding and addressing shortcut learning in encrypted network traffic classification, raising awareness that feature selection should be an intentional and scenario-sensitive step prior to model training. Subjects: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI) Cite as: arXiv:2601.10180 [cs.LG] (or arXiv:2601.10180v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2601.10180 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-30] CC-OR-Net: A Unified Framework for LTV Prediction through Structural Decoupling WWW’26

链接: https://arxiv.org/abs/2601.10176
作者: Mingyu Zhao,Haoran Bai,Yu Tian,Bing Zhu,Hengliang Luo
类目: Machine Learning (cs.LG)
*备注: Accepted by WWW’26

点击查看摘要

Abstract:Customer Lifetime Value (LTV) prediction, a central problem in modern marketing, is characterized by a unique zero-inflated and long-tail data distribution. This distribution presents two fundamental challenges: (1) the vast majority of low-to-medium value users numerically overwhelm the small but critically important segment of high-value “whale” users, and (2) significant value heterogeneity exists even within the low-to-medium value user base. Common approaches either rely on rigid statistical assumptions or attempt to decouple ranking and regression using ordered buckets; however, they often enforce ordinality through loss-based constraints rather than inherent architectural design, failing to balance global accuracy with high-value precision. To address this gap, we propose \textbfConditional \textbfCascaded \textbfOrdinal-\textbfResidual Networks \textbf(CC-OR-Net), a novel unified framework that achieves a more robust decoupling through \textbfstructural decomposition, where ranking is architecturally guaranteed. CC-OR-Net integrates three specialized components: a \textitstructural ordinal decomposition module for robust ranking, an \textitintra-bucket residual module for fine-grained regression, and a \textittargeted high-value augmentation module for precision on top-tier users. Evaluated on real-world datasets with over 300M users, CC-OR-Net achieves a superior trade-off across all key business metrics, outperforming state-of-the-art methods in creating a holistic and commercially valuable LTV prediction solution.

[LG-31] Multilingual-To-Multimodal (M2M): Unlocking New Languages with Monolingual Text EACL2026

链接: https://arxiv.org/abs/2601.10096
作者: Piyush Singh Pasi
类目: Machine Learning (cs.LG)
*备注: EACL 2026 Findings accepted. Initial Draft of Camera-ready

点击查看摘要

Abstract:Multimodal models excel in English, supported by abundant image-text and audio-text data, but performance drops sharply for other languages due to limited multilingual multimodal resources. Existing solutions rely heavily on machine translation, while advances in multilingual text modeling remain underutilized. We introduce METAL, a lightweight alignment method that learns only a few linear layers using English text alone to map multilingual text embeddings into a multimodal space. Despite its simplicity, METAL matches baseline performance in English (94.9 percent Recall at 10) and achieves strong zero-shot transfer (89.5 percent Recall at 10 averaged across 11 languages, 10 unseen) on XTD text-to-image retrieval. Qualitative t-SNE visualizations show that multilingual embeddings align tightly with multimodal representations, while weight analysis reveals that the transformation reshapes embedding geometry rather than performing trivial rotations. Beyond image-text retrieval, METAL generalizes to audio-text retrieval and cross-lingual text-to-image generation. We release code and checkpoints at this https URL , as well as multilingual evaluation datasets including MSCOCO Multilingual 30K (this https URL ), AudioCaps Multilingual (this https URL ), and Clotho Multilingual (this https URL ), to facilitate further research.

[LG-32] Bayesian Meta-Analyses Could Be More: A Case Study in Trial of Labor After a Cesarean-section Outcomes and Complications AAAI2026

链接: https://arxiv.org/abs/2601.10089
作者: Ashley Klein,Edward Raff,Marcia DesJardin
类目: Machine Learning (cs.LG)
*备注: To appear in AAAI 2026

点击查看摘要

Abstract:The meta-analysis’s utility is dependent on previous studies having accurately captured the variables of interest, but in medical studies, a key decision variable that impacts a physician’s decisions was not captured. This results in an unknown effect size and unreliable conclusions. A Bayesian approach may allow analysis to determine if the claim of a positive effect is still warranted, and we build a Bayesian approach to this common medical scenario. To demonstrate its utility, we assist professional OBGYNs in evaluating Trial of Labor After a Cesarean-section (TOLAC) situations where few interventions are available for patients and find the support needed for physicians to advance patient care.

[LG-33] Adaptive Label Error Detection: A Bayesian Approach to Mislabeled Data Detection

链接: https://arxiv.org/abs/2601.10084
作者: Zan Chaudhry,Noam H. Rotenberg,Brian Caffo,Craig K. Jones,Haris I. Sair
类目: Machine Learning (cs.LG)
*备注: 10 pages, 5 figures

点击查看摘要

Abstract:Machine learning classification systems are susceptible to poor performance when trained with incorrect ground truth labels, even when data is well-curated by expert annotators. As machine learning becomes more widespread, it is increasingly imperative to identify and correct mislabeling to develop more powerful models. In this work, we motivate and describe Adaptive Label Error Detection (ALED), a novel method of detecting mislabeling. ALED extracts an intermediate feature space from a deep convolutional neural network, denoises the features, models the reduced manifold of each class with a multidimensional Gaussian distribution, and performs a simple likelihood ratio test to identify mislabeled samples. We show that ALED has markedly increased sensitivity, without compromising precision, compared to established label error detection methods, on multiple medical imaging datasets. We demonstrate an example where fine-tuning a neural network on corrected data results in a 33.8% decrease in test set errors, providing strong benefits to end users. The ALED detector is deployed in the Python package statlab.

[LG-34] Efficient Content-based Recommendation Model Training via Noise-aware Coreset Selection

链接: https://arxiv.org/abs/2601.10067
作者: Hung Vinh Tran,Tong Chen,Hechuan Wen,Quoc Viet Hung Nguyen,Bin Cui,Hongzhi Yin
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注: WebConf 2026

点击查看摘要

Abstract:Content-based recommendation systems (CRSs) utilize content features to predict user-item interactions, serving as essential tools for helping users navigate information-rich web services. However, ensuring the effectiveness of CRSs requires large-scale and even continuous model training to accommodate diverse user preferences, resulting in significant computational costs and resource demands. A promising approach to this challenge is coreset selection, which identifies a small but representative subset of data samples that preserves model quality while reducing training overhead. Yet, the selected coreset is vulnerable to the pervasive noise in user-item interactions, particularly when it is minimally sized. To this end, we propose Noise-aware Coreset Selection (NaCS), a specialized framework for CRSs. NaCS constructs coresets through submodular optimization based on training gradients, while simultaneously correcting noisy labels using a progressively trained model. Meanwhile, we refine the selected coreset by filtering out low-confidence samples through uncertainty quantification, thereby avoid training with unreliable interactions. Through extensive experiments, we show that NaCS produces higher-quality coresets for CRSs while achieving better efficiency than existing coreset selection techniques. Notably, NaCS recovers 93-95% of full-dataset training performance using merely 1% of the training data. The source code is available at \hrefthis https URLthis https URL.

[LG-35] Unlabeled Data Can Provably Enhance In-Context Learning of Transformers NEURIPS2025

链接: https://arxiv.org/abs/2601.10058
作者: Renpu Liu,Jing Yang
类目: Machine Learning (cs.LG)
*备注: Published as a conference paper at NeurIPS 2025

点击查看摘要

Abstract:Large language models (LLMs) exhibit impressive in-context learning (ICL) capabilities, yet the quality of their predictions is fundamentally limited by the few costly labeled demonstrations that can fit into a prompt. Meanwhile, there exist vast and continuously growing amounts of unlabeled data that may be closely related to the ICL task. How to utilize such unlabeled data to provably enhance the performance of ICL thus becomes an emerging fundamental question. In this work, we propose a novel augmented ICL framework, in which the prompt includes a small set of labeled examples alongside a block of unlabeled inputs. We focus on the multi-class linear classification setting and demonstrate that, with chain-of-thought (CoT) prompting, a multi-layer transformer can effectively emulate an expectation-maximization (EM) algorithm. This enables the transformer to implicitly extract useful information from both labeled and unlabeled data, leading to provable improvements in ICL accuracy. Moreover, we show that such a transformer can be trained via teacher forcing, with its parameters converging to the desired solution at a linear rate. Experiments demonstrate that the augmented ICL framework consistently outperforms conventional few-shot ICL, providing empirical support for our theoretical findings. To the best of our knowledge, this is the first theoretical study on the impact of unlabeled data on the ICL performance of transformers.

[LG-36] BPE: Behavioral Profiling Ensemble

链接: https://arxiv.org/abs/2601.10024
作者: Yanxin Liu,Yunqi Zhang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Ensemble learning is widely recognized as a pivotal strategy for pushing the boundaries of predictive performance. Traditional static ensemble methods, such as Stacking, typically assign weights by treating each base learner as a holistic entity, thereby overlooking the fact that individual models exhibit varying degrees of competence across different regions of the instance space. To address this limitation, Dynamic Ensemble Selection (DES) was introduced. However, both static and dynamic approaches predominantly rely on the divergence among different models as the basis for integration. This inter-model perspective neglects the intrinsic characteristics of the models themselves and necessitates a heavy reliance on validation sets for competence estimation. In this paper, we propose the Behavioral Profiling Ensemble (BPE) framework, which introduces a novel paradigm shift. Unlike traditional methods, BPE constructs a ``behavioral profile’’ intrinsic to each model and derives integration weights based on the deviation between the model’s response to a specific test instance and its established behavioral profile. Extensive experiments on both synthetic and real-world datasets demonstrate that the algorithm derived from the BPE framework achieves significant improvements over state-of-the-art ensemble baselines. These gains are evident not only in predictive accuracy but also in computational efficiency and storage resource utilization across various scenarios.

[LG-37] me Aggregation Features for XGBoost Models

链接: https://arxiv.org/abs/2601.10019
作者: Mykola Pinchuk
类目: Machine Learning (cs.LG)
*备注: 17 pages, 18 tables and figures

点击查看摘要

Abstract:This paper studies time aggregation features for XGBoost models in click-through rate prediction. The setting is the Avazu click-through rate prediction dataset with strict out-of-time splits and a no-lookahead feature constraint. Features for hour H use only impressions from hours strictly before H. This paper compares a strong time-aware target encoding baseline to models augmented with entity history time aggregation under several window designs. Across two rolling-tail folds on a deterministic ten percent sample, a trailing window specification improves ROC AUC by about 0.0066 to 0.0082 and PR AUC by about 0.0084 to 0.0094 relative to target encoding alone. Within the time aggregation design grid, event count windows provide the only consistent improvement over trailing windows, and the gain is small. Gap windows and bucketized windows underperform simple trailing windows in this dataset and protocol. These results support a practical default of trailing windows, with an optional event count window when marginal ROC AUC gains matter.

[LG-38] CAFEDistill: Learning Personalized and Dynamic Models through Federated Early-Exit Network Distillation

链接: https://arxiv.org/abs/2601.10015
作者: Boyi Liu,Zimu Zhou,Yongxin Tong
类目: Machine Learning (cs.LG)
*备注: 12 pages, conference

点击查看摘要

Abstract:Personalized Federated Learning (PFL) enables collaboratively model training on decentralized, heterogeneous data while tailoring them to each client’s unique distribution. However, existing PFL methods produce static models with a fixed tradeoff between accuracy and efficiency, limiting their applicability in environments where inference requirements vary with contexts and resource availability. Early-exit networks (EENs) offer adaptive inference by attaching intermediate classifiers. Yet integrating them into PFL is challenging due to client-wise heterogeneity and depth-wise interference arising from conflicting exit objectives. Prior studies fail to resolve both conflicts simultaneously, leading to suboptimal performance. In this paper, we propose CAFEDistill, a Conflict-Aware Federated Exit Distillation framework that jointly addresses these conflicts and extends PFL to early-exit networks. Through a progressive, depth-prioritized student coordination mechanism, CAFEDistill mitigates interference among shallow and deep exits while allowing effective personalized knowledge transfer across clients. Furthermore, it reduces communication overhead via a client-decoupled formulation. Extensive evaluations show that CAFEDistill outperforms the state-of-the-arts, achieving higher accuracy and reducing inference costs by 30.79%-46.86%.

[LG-39] PID-Guided Partial Alignment for Multimodal Decentralized Federated Learning

链接: https://arxiv.org/abs/2601.10012
作者: Yanhang Shi,Xiaoyu Wang,Houwei Cao,Jian Li,Yong Liu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multimodal decentralized federated learning (DFL) is challenging because agents differ in available modalities and model architectures, yet must collaborate over peer-to-peer (P2P) networks without a central coordinator. Standard multimodal pipelines learn a single shared embedding across all modalities. In DFL, such a monolithic representation induces gradient misalignment between uni- and multimodal agents; as a result, it suppresses heterogeneous sharing and cross-modal interaction. We present PARSE, a multimodal DFL framework that operationalizes partial information decomposition (PID) in a server-free setting. Each agent performs feature fission to factorize its latent representation into redundant, unique, and synergistic slices. P2P knowledge sharing among heterogeneous agents is enabled by slice-level partial alignment: only semantically shareable branches are exchanged among agents that possess the corresponding modality. By removing the need for central coordination and gradient surgery, PARSE resolves uni-/multimodal gradient conflicts, thereby overcoming the multimodal DFL dilemma while remaining compatible with standard DFL constraints. Across benchmarks and agent mixes, PARSE yields consistent gains over task-, modality-, and hybrid-sharing DFL baselines. Ablations on fusion operators and split ratios, together with qualitative visualizations, further demonstrate the efficiency and robustness of the proposed design.

[LG-40] SoK: Privacy-aware LLM in Healthcare: Threat Model Privacy Techniques Challenges and Recommendations

链接: https://arxiv.org/abs/2601.10004
作者: Mohoshin Ara Tahera,Karamveer Singh Sidhu,Shuvalaxmi Dass,Sajal Saha
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly adopted in healthcare to support clinical decision-making, summarize electronic health records (EHRs), and enhance patient care. However, this integration introduces significant privacy and security challenges, driven by the sensitivity of clinical data and the high-stakes nature of medical workflows. These risks become even more pronounced across heterogeneous deployment environments, ranging from small on-premise hospital systems to regional health networks, each with unique resource limitations and regulatory demands. This Systematization of Knowledge (SoK) examines the evolving threat landscape across the three core LLM phases: Data preprocessing, Fine-tuning, and Inference within realistic healthcare settings. We present a detailed threat model that characterizes adversaries, capabilities, and attack surfaces at each phase, and we systematize how existing privacy-preserving techniques (PPTs) attempt to mitigate these vulnerabilities. While existing defenses show promise, our analysis identifies persistent limitations in securing sensitive clinical data across diverse operational tiers. We conclude with phase-aware recommendations and future research directions aimed at strengthening privacy guarantees for LLMs in regulated environments. This work provides a foundation for understanding the intersection of LLMs, threats, and privacy in healthcare, offering a roadmap toward more robust and clinically trustworthy AI systems.

[LG-41] FaTRQ: Tiered Residual Quantization for LLM Vector Search in Far-Memory-Aware ANNS Systems

链接: https://arxiv.org/abs/2601.09985
作者: Tianqi Zhang,Flavio Ponzina,Tajana Rosing
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Approximate Nearest-Neighbor Search (ANNS) is a key technique in retrieval-augmented generation (RAG), enabling rapid identification of the most relevant high-dimensional embeddings from massive vector databases. Modern ANNS engines accelerate this process using prebuilt indexes and store compressed vector-quantized representations in fast memory. However, they still rely on a costly second-pass refinement stage that reads full-precision vectors from slower storage like SSDs. For modern text and multimodal embeddings, these reads now dominate the latency of the entire query. We propose FaTRQ, a far-memory-aware refinement system using tiered memory that eliminates the need to fetch full vectors from storage. It introduces a progressive distance estimator that refines coarse scores using compact residuals streamed from far memory. Refinement stops early once a candidate is provably outside the top-k. To support this, we propose tiered residual quantization, which encodes residuals as ternary values stored efficiently in far memory. A custom accelerator is deployed in a CXL Type-2 device to perform low-latency refinement locally. Together, FaTRQ improves the storage efficiency by 2.4 \times and improves the throughput by up to 9 \times than SOTA GPU ANNS system.

[LG-42] In-Context Operator Learning on the Space of Probability Measures

链接: https://arxiv.org/abs/2601.09979
作者: Frank Cole,Dixi Wang,Yineng Chen,Yulong Lu,Rongjie Lai
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:We introduce \emphin-context operator learning on probability measure spaces for optimal transport (OT). The goal is to learn a single solution operator that maps a pair of distributions to the OT map, using only few-shot samples from each distribution as a prompt and \emphwithout gradient updates at inference. We parameterize the solution operator and develop scaling-law theory in two regimes. In the \emphnonparametric setting, when tasks concentrate on a low-intrinsic-dimension manifold of source–target pairs, we establish generalization bounds that quantify how in-context accuracy scales with prompt size, intrinsic task dimension, and model capacity. In the \emphparametric setting (e.g., Gaussian families), we give an explicit architecture that recovers the exact OT map in context and provide finite-sample excess-risk bounds. Our numerical experiments on synthetic transports and generative-modeling benchmarks validate the framework.

[LG-43] An Exploratory Study to Repurpose LLM s to a Unified Architecture for Time Series Classification

链接: https://arxiv.org/abs/2601.09971
作者: Hansen He,Shuheng Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Time series classification (TSC) is a core machine learning problem with broad applications. Recently there has been growing interest in repurposing large language models (LLMs) for TSC, motivated by their strong reasoning and generalization ability. Prior work has primarily focused on alignment strategies that explicitly map time series data into the textual domain; however, the choice of time series encoder architecture remains underexplored. In this work, we conduct an exploratory study of hybrid architectures that combine specialized time series encoders with a frozen LLM backbone. We evaluate a diverse set of encoder families, including Inception, convolutional, residual, transformer-based, and multilayer perceptron architectures, among which the Inception model is the only encoder architecture that consistently yields positive performance gains when integrated with an LLM backbone. Overall, this study highlights the impact of time series encoder choice in hybrid LLM architectures and points to Inception-based models as a promising direction for future LLM-driven time series learning.

[LG-44] Interpolation-Based Optimization for Enforcing lp-Norm Metric Differential Privacy in Continuous and Fine-Grained Domains USENIX-SECURITY2026

链接: https://arxiv.org/abs/2601.09946
作者: Chenxi Qiu
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: USENIX Security 2026

点击查看摘要

Abstract:Metric Differential Privacy (mDP) generalizes Local Differential Privacy (LDP) by adapting privacy guarantees based on pairwise distances, enabling context-aware protection and improved utility. While existing optimization-based methods reduce utility loss effectively in coarse-grained domains, optimizing mDP in fine-grained or continuous settings remains challenging due to the computational cost of constructing dense perterubation matrices and satisfying pointwise constraints. In this paper, we propose an interpolation-based framework for optimizing lp-norm mDP in such domains. Our approach optimizes perturbation distributions at a sparse set of anchor points and interpolates distributions at non-anchor locations via log-convex combinations, which provably preserve mDP. To address privacy violations caused by naive interpolation in high-dimensional spaces, we decompose the interpolation process into a sequence of one-dimensional steps and derive a corrected formulation that enforces lp-norm mDP by design. We further explore joint optimization over perturbation distributions and privacy budget allocation across dimensions. Experiments on real-world location datasets demonstrate that our method offers rigorous privacy guarantees and competitive utility in fine-grained domains, outperforming baseline mechanisms. in high-dimensional spaces, we decompose the interpolation process into a sequence of one-dimensional steps and derive a corrected formulation that enforces lp-norm mDP by design. We further explore joint optimization over perturbation distributions and privacy budget allocation across dimensions. Experiments on real-world location datasets demonstrate that our method offers rigorous privacy guarantees and competitive utility in fine-grained domains, outperforming baseline mechanisms. Comments: USENIX Security 2026 Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR) Cite as: arXiv:2601.09946 [cs.LG] (or arXiv:2601.09946v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2601.09946 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-45] he PROPER Approach to Proactivity: Benchmarking and Advancing Knowledge Gap Navigation

链接: https://arxiv.org/abs/2601.09926
作者: Kirandeep Kaur,Vinayak Gupta,Aditya Gupta,Chirag Shah
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Most language-based assistants follow a reactive ask-and-respond paradigm, requiring users to explicitly state their needs. As a result, relevant but unexpressed needs often go unmet. Existing proactive agents attempt to address this gap either by eliciting further clarification, preserving this burden, or by extrapolating future needs from context, often leading to unnecessary or mistimed interventions. We introduce ProPer, Proactivity-driven Personalized agents, a novel two-agent architecture consisting of a Dimension Generating Agent (DGA) and a Response Generating Agent (RGA). DGA, a fine-tuned LLM agent, leverages explicit user data to generate multiple implicit dimensions (latent aspects relevant to the user’s task but not considered by the user) or knowledge gaps. These dimensions are selectively filtered using a reranker based on quality, diversity, and task relevance. RGA then balances explicit and implicit dimensions to tailor personalized responses with timely and proactive interventions. We evaluate ProPer across multiple domains using a structured, gap-aware rubric that measures coverage, initiative appropriateness, and intent alignment. Our results show that ProPer improves quality scores and win rates across all domains, achieving up to 84% gains in single-turn evaluation and consistent dominance in multi-turn interactions.

[LG-46] A New Convergence Analysis of Plug-and-Play Proximal Gradient Descent Under Prior Mismatch

链接: https://arxiv.org/abs/2601.09831
作者: Guixian Xu,Jinglai Li,Junqi Tang
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:In this work, we provide a new convergence theory for plug-and-play proximal gradient descent (PnP-PGD) under prior mismatch where the denoiser is trained on a different data distribution to the inference task at hand. To the best of our knowledge, this is the first convergence proof of PnP-PGD under prior mismatch. Compared with the existing theoretical results for PnP algorithms, our new results removed the need for several restrictive and unverifiable assumptions.

[LG-47] Eluder dimension: localise it!

链接: https://arxiv.org/abs/2601.09825
作者: Alireza Bakhtiari,Alex Ayoub,Samuel Robertson,David Janz,Csaba Szepesvári
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We establish a lower bound on the eluder dimension of generalised linear model classes, showing that standard eluder dimension-based analysis cannot lead to first-order regret bounds. To address this, we introduce a localisation method for the eluder dimension; our analysis immediately recovers and improves on classic results for Bernoulli bandits, and allows for the first genuine first-order bounds for finite-horizon reinforcement learning tasks with bounded cumulative returns.

[LG-48] meSAE: Sparse Decoding for Faithful Explanations of Black-Box Time Series Models

链接: https://arxiv.org/abs/2601.09776
作者: Khalid Oublal,Quentin Bouniot,Qi Gan,Stephan Clémençon,Zeynep Akata
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As black box models and pretrained models gain traction in time series applications, understanding and explaining their predictions becomes increasingly vital, especially in high-stakes domains where interpretability and trust are essential. However, most of the existing methods involve only in-distribution explanation, and do not generalize outside the training support, which requires the learning capability of generalization. In this work, we aim to provide a framework to explain black-box models for time series data through the dual lenses of Sparse Autoencoders (SAEs) and causality. We show that many current explanation methods are sensitive to distributional shifts, limiting their effectiveness in real-world scenarios. Building on the concept of Sparse Autoencoder, we introduce TimeSAE, a framework for black-box model explanation. We conduct extensive evaluations of TimeSAE on both synthetic and real-world time series datasets, comparing it to leading baselines. The results, supported by both quantitative metrics and qualitative insights, show that TimeSAE provides more faithful and robust explanations. Our code is available in an easy-to-use library TimeSAE-Lib: this https URL.

[LG-49] Adjusted Similarity Measures and a Violation of Expectations

链接: https://arxiv.org/abs/2601.10641
作者: William L. Lippitt,Edward J. Bedrick,Nichole E. Carlson
类目: Methodology (stat.ME); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: 12 pages, 1 figure

点击查看摘要

Abstract:Adjusted similarity measures, such as Cohen’s kappa for inter-rater reliability and the adjusted Rand index used to compare clustering algorithms, are a vital tool for comparing discrete labellings. These measures are intended to have the property of 0 expectation under a null distribution and maximum value 1 under maximal similarity to aid in interpretation. Measures are frequently adjusted with respect to the permutation distribution for historic and analytic reasons. There is currently renewed interest in considering other null models more appropriate for context, such as clustering ensembles permitting a random number of identified clusters. The purpose of this work is two – fold: (1) to generalize the study of the adjustment operator to general null models and to a more general procedure which includes statistical standardization as a special case and (2) to identify sufficient conditions for the adjustment operator to produce the intended properties, where sufficient conditions are related to whether and how observed data are incorporated into null distributions. We demonstrate how violations of the sufficient conditions may lead to substantial breakdown, such as by producing a non-positive measure under traditional adjustment rather than one with mean 0, or by producing a measure which is deterministically 0 under statistical standardization.

[LG-50] Classification Imbalance as Transfer Learning

链接: https://arxiv.org/abs/2601.10630
作者: Eric Xia,Jason M. Klusowski
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Classification imbalance arises when one class is much rarer than the other. We frame this setting as transfer learning under label (prior) shift between an imbalanced source distribution induced by the observed data and a balanced target distribution under which performance is evaluated. Within this framework, we study a family of oversampling procedures that augment the training data by generating synthetic samples from an estimated minority-class distribution to roughly balance the classes, among which the celebrated SMOTE algorithm is a canonical example. We show that the excess risk decomposes into the rate achievable under balanced training (as if the data had been drawn from the balanced target distribution) and an additional term, the cost of transfer, which quantifies the discrepancy between the estimated and true minority-class distributions. In particular, we show that the cost of transfer for SMOTE dominates that of bootstrapping (random oversampling) in moderately high dimensions, suggesting that we should expect bootstrapping to have better performance than SMOTE in general. We corroborate these findings with experimental evidence. More broadly, our results provide guidance for choosing among augmentation strategies for imbalanced classification.

[LG-51] Parametric RDT approach to computational gap of symmetric binary perceptron

链接: https://arxiv.org/abs/2601.10628
作者: Mihailo Stojnic
类目: Machine Learning (stat.ML); Disordered Systems and Neural Networks (cond-mat.dis-nn); Information Theory (cs.IT); Machine Learning (cs.LG); Probability (math.PR)
*备注:

点击查看摘要

Abstract:We study potential presence of statistical-computational gaps (SCG) in symmetric binary perceptrons (SBP) via a parametric utilization of \emphfully lifted random duality theory (fl-RDT) [96]. A structural change from decreasingly to arbitrarily ordered c -sequence (a key fl-RDT parametric component) is observed on the second lifting level and associated with \emphsatisfiability ( \alpha_c ) – \emphalgorithmic ( \alpha_a ) constraints density threshold change thereby suggesting a potential existence of a nonzero computational gap SCG=\alpha_c-\alpha_a . The second level estimate is shown to match the theoretical \alpha_c whereas the r\rightarrow \infty level one is proposed to correspond to \alpha_a . For example, for the canonical SBP ( \kappa=1 margin) we obtain \alpha_c\approx 1.8159 on the second and \alpha_a\approx 1.6021 (with converging tendency towards \sim 1.59 range) on the seventh level. Our propositions remarkably well concur with recent literature: (i) in [20] local entropy replica approach predicts \alpha_LE\approx 1.58 as the onset of clustering defragmentation (presumed driving force behind locally improving algorithms failures); (ii) in \alpha\rightarrow 0 regime we obtain on the third lifting level \kappa\approx 1.2385\sqrt\frac\alpha_a-\log\left ( \alpha_a \right ) which qualitatively matches overlap gap property (OGP) based predictions of [43] and identically matches local entropy based predictions of [24]; (iii) c -sequence ordering change phenomenology mirrors the one observed in asymmetric binary perceptron (ABP) in [98] and the negative Hopfield model in [100]; and (iv) as in [98,100], we here design a CLuP based algorithm whose practical performance closely matches proposed theoretical predictions.

[LG-52] Searching for Quantum Effects in the Brain: A Bell-Type Test for Nonclassical Latent Representations in Autoencoders

链接: https://arxiv.org/abs/2601.10588
作者: I. K. Kominis,C. Xie,S. Li,M. Skotiniotis,G. P. Tsironis
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG); Biological Physics (physics.bio-ph)
*备注: 6 pages, 2 figures

点击查看摘要

Abstract:Whether neural information processing is entirely classical or involves quantum-mechanical elements remains an open question. Here we propose a model-agnostic, information-theoretic test of nonclassicality that bypasses microscopic assumptions and instead probes the structure of neural representations themselves. Using autoencoders as a transparent model system, we introduce a Bell-type consistency test in latent space, and ask whether decoding statistics obtained under multiple readout contexts can be jointly explained by a single positive latent-variable distribution. By shifting the search for quantum-like signatures in neural systems from microscopic dynamics to experimentally testable constraints on information processing, this work opens a new route for probing the fundamental physics of neural computation.

[LG-53] Coarsening Causal DAG Models

链接: https://arxiv.org/abs/2601.10531
作者: Francisco Madaleno,Pratik Misra,Alex Markham
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Combinatorics (math.CO)
*备注: 25 pages, 5 figures

点击查看摘要

Abstract:Directed acyclic graphical (DAG) models are a powerful tool for representing causal relationships among jointly distributed random variables, especially concerning data from across different experimental settings. However, it is not always practical or desirable to estimate a causal model at the granularity of given features in a particular dataset. There is a growing body of research on causal abstraction to address such problems. We contribute to this line of research by (i) providing novel graphical identifiability results for practically-relevant interventional settings, (ii) proposing an efficient, provably consistent algorithm for directly learning abstract causal graphs from interventional data with unknown intervention targets, and (iii) uncovering theoretical insights about the lattice structure of the underlying search space, with connections to the field of causal discovery more generally. As proof of concept, we apply our algorithm on synthetic and real datasets with known ground truths, including measurements from a controlled physical system with interacting light intensity and polarization.

[LG-54] CROCS: A Two-Stage Clustering Framework for Behaviour-Centric Consumer Segmentation with Smart Meter Data

链接: https://arxiv.org/abs/2601.10494
作者: Luke W. Yerbury,Ricardo J. G. B. Campello,G. C. Livingston Jr,Mark Goldsworthy,Lachlan O’Neil
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:With grid operators confronting rising uncertainty from renewable integration and a broader push toward electrification, Demand-Side Management (DSM) – particularly Demand Response (DR) – has attracted significant attention as a cost-effective mechanism for balancing modern electricity systems. Unprecedented volumes of consumption data from a continuing global deployment of smart meters enable consumer segmentation based on real usage behaviours, promising to inform the design of more effective DSM and DR programs. However, existing clustering-based segmentation methods insufficiently reflect the behavioural diversity of consumers, often relying on rigid temporal alignment, and faltering in the presence of anomalies, missing data, or large-scale deployments. To address these challenges, we propose a novel two-stage clustering framework – Clustered Representations Optimising Consumer Segmentation (CROCS). In the first stage, each consumer’s daily load profiles are clustered independently to form a Representative Load Set (RLS), providing a compact summary of their typical diurnal consumption behaviours. In the second stage, consumers are clustered using the Weighted Sum of Minimum Distances (WSMD), a novel set-to-set measure that compares RLSs by accounting for both the prevalence and similarity of those behaviours. Finally, community detection on the WSMD-induced graph reveals higher-order prototypes that embody the shared diurnal behaviours defining consumer groups, enhancing the interpretability of the resulting clusters. Extensive experiments on both synthetic and real Australian smart meter datasets demonstrate that CROCS captures intra-consumer variability, uncovers both synchronous and asynchronous behavioural similarities, and remains robust to anomalies and missing data, while scaling efficiently through natural parallelisation. These results… Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG) Cite as: arXiv:2601.10494 [stat.ML] (or arXiv:2601.10494v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2601.10494 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-55] H-EFT-VA: An Effective-Field-Theory Variational Ansatz with Provable Barren Plateau Avoidance

链接: https://arxiv.org/abs/2601.10479
作者: Eyad I.B Hamid
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG); Mathematical Physics (math-ph)
*备注: 7 pages, 5 figuers, Appendix

点击查看摘要

Abstract:Variational Quantum Algorithms (VQAs) are critically threatened by the Barren Plateau (BP) phenomenon. In this work, we introduce the H-EFT Variational Ansatz (H-EFT-VA), an architecture inspired by Effective Field Theory (EFT). By enforcing a hierarchical “UV-cutoff” on initialization, we theoretically restrict the circuit’s state exploration, preventing the formation of approximate unitary 2-designs. We provide a rigorous proof that this localization guarantees an inverse-polynomial lower bound on the gradient variance: Var[\partial \theta] \in \Omega(1/poly(N)) . Crucially, unlike approaches that avoid BPs by limiting entanglement, we demonstrate that H-EFT-VA maintains volume-law entanglement and near-Haar purity, ensuring sufficient expressibility for complex quantum states. Extensive benchmarking across 16 experiments – including Transverse Field Ising and Heisenberg XXZ models – confirms a 109x improvement in energy convergence and a 10.7x increase in ground-state fidelity over standard Hardware-Efficient Ansatze (HEA), with a statistical significance of p 10^-88 .

[LG-56] Sim2Real Deep Transfer for Per-Device CFO Calibration

链接: https://arxiv.org/abs/2601.10264
作者: Jingze Zheng,Zhiguo Shi,Shibo He,Chaojie Gu
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: Accepted by Globecom 2025

点击查看摘要

Abstract:Carrier Frequency Offset (CFO) estimation in Orthogonal Frequency Division Multiplexing (OFDM) systems faces significant performance degradation across heterogeneous software-defined radio (SDR) platforms due to uncalibrated hardware impairments. Existing deep neural network (DNN)-based approaches lack device-level adaptation, limiting their practical deployment. This paper proposes a Sim2Real transfer learning framework for per-device CFO calibration, combining simulation-driven pretraining with lightweight receiver adaptation. A backbone DNN is pre-trained on synthetic OFDM signals incorporating parametric hardware distortions (e.g., phase noise, IQ imbalance), enabling generalized feature learning without costly cross-device data collection. Subsequently, only the regression layers are fine-tuned using 1,000 real frames per target device, preserving hardware-agnostic knowledge while adapting to device-specific impairments. Experiments across three SDR families (USRP B210, USRP N210, HackRF One) achieve 30\times BER reduction compared to conventional CP-based methods under indoor multipath conditions. The framework bridges the simulation-to-reality gap for robust CFO estimation, enabling cost-effective deployment in heterogeneous wireless systems.

[LG-57] Instruction Finetuning LLaMA-3-8B Model Using LoRA for Financial Named Entity Recognition

链接: https://arxiv.org/abs/2601.10043
作者: Zhiming Lian
类目: Computational Finance (q-fin.CP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Particularly, financial named-entity recognition (NER) is one of the many important approaches to translate unformatted reports and news into structured knowledge graphs. However, free, easy-to-use large language models (LLMs) often fail to differentiate organisations as people, or disregard an actual monetary amount entirely. This paper takes Meta’s Llama 3 8B and applies it to financial NER by combining instruction fine-tuning and Low-Rank Adaptation (LoRA). Each annotated sentence is converted into an instruction-input-output triple, enabling the model to learn task descriptions while fine-tuning with small low-rank matrices instead of updating all weights. Using a corpus of 1,693 sentences, our method obtains a micro-F1 score of 0.894 compared with Qwen3-8B, Baichuan2-7B, T5, and BERT-Base. We present dataset statistics, describe training hyperparameters, and perform visualizations of entity density, learning curves, and evaluation metrics. Our results show that instruction tuning combined with parameter-efficient fine-tuning enables state-of-the-art performance on domain-sensitive NER.

[LG-58] Accelerated Regularized Wasserstein Proximal Sampling Algorithms

链接: https://arxiv.org/abs/2601.09848
作者: Hong Ye Tan,Stanley Osher,Wuchen Li
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC); Computation (stat.CO)
*备注:

点击查看摘要

Abstract:We consider sampling from a Gibbs distribution by evolving a finite number of particles using a particular score estimator rather than Brownian motion. To accelerate the particles, we consider a second-order score-based ODE, similar to Nesterov acceleration. In contrast to traditional kernel density score estimation, we use the recently proposed regularized Wasserstein proximal method, yielding the Accelerated Regularized Wasserstein Proximal method (ARWP). We provide a detailed analysis of continuous- and discrete-time non-asymptotic and asymptotic mixing rates for Gaussian initial and target distributions, using techniques from Euclidean acceleration and accelerated information gradients. Compared with the kinetic Langevin sampling algorithm, the proposed algorithm exhibits a higher contraction rate in the asymptotic time regime. Numerical experiments are conducted across various low-dimensional experiments, including multi-modal Gaussian mixtures and ill-conditioned Rosenbrock distributions. ARWP exhibits structured and convergent particles, accelerated discrete-time mixing, and faster tail exploration than the non-accelerated regularized Wasserstein proximal method and kinetic Langevin methods. Additionally, ARWP particles exhibit better generalization properties for some non-log-concave Bayesian neural network tasks.

[LG-59] Detecting Batch Heterogeneity via Likelihood Clustering

链接: https://arxiv.org/abs/2601.09758
作者: Austin Talbot,Yue Ke
类目: Genomics (q-bio.GN); Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Batch effects represent a major confounder in genomic diagnostics. In copy number variant (CNV) detection from NGS, many algorithms compare read depth between test samples and a reference sample, assuming they are process-matched. When this assumption is violated, with causes ranging from reagent lot changes to multi-site processing, the reference becomes inappropriate, introducing false CNV calls or masking true pathogenic variants. Detecting such heterogeneity before downstream analysis is critical for reliable clinical interpretation. Existing batch effect detection methods either cluster samples based on raw features, risking conflation of biological signal with technical variation, or require known batch labels that are frequently unavailable. We introduce a method that addresses both limitations by clustering samples according to their Bayesian model evidence. The central insight is that evidence quantifies compatibility between data and model assumptions, technical artifacts violate assumptions and reduce evidence, whereas biological variation, including CNV status, is anticipated by the model and yields high evidence. This asymmetry provides a discriminative signal that separates batch effects from biology. We formalize heterogeneity detection as a likelihood ratio test for mixture structure in evidence space, using parametric bootstrap calibration to ensure conservative false positive rates. We validate our approach on synthetic data demonstrating proper Type I error control, three clinical targeted sequencing panels (liquid biopsy, BRCA, and thalassemia) exhibiting distinct batch effect mechanisms, and mouse electrophysiology recordings demonstrating cross-modality generalization. Our method achieves superior clustering accuracy compared to standard correlation-based and dimensionality-reduction approaches while maintaining the conservativeness required for clinical usage.

信息检索

[IR-0] RoutIR: Fast Serving of Retrieval Pipelines for Retrieval-Augmented Generation

链接: https://arxiv.org/abs/2601.10644
作者: Eugene Yang,Andrew Yates,Dawn Lawrie,James Mayfield,Trevor Adriaanse
类目: Information Retrieval (cs.IR)
*备注: 17 pages, 1 figure

点击查看摘要

Abstract:Retrieval models are key components of Retrieval-Augmented Generation (RAG) systems, which generate search queries, process the documents returned, and generate a response. RAG systems are often dynamic and may involve multiple rounds of retrieval. While many state-of-the-art retrieval methods are available through academic IR platforms, these platforms are typically designed for the Cranfield paradigm in which all queries are known up front and can be batch processed offline. This simplification accelerates research but leaves state-of-the-art retrieval models unable to support downstream applications that require online services, such as arbitrary dynamic RAG pipelines that involve looping, feedback, or even self-organizing agents. In this work, we introduce RoutIR, a Python package that provides a simple and efficient HTTP API that wraps arbitrary retrieval methods, including first stage retrieval, reranking, query expansion, and result fusion. By providing a minimal JSON configuration file specifying the retrieval models to serve, RoutIR can be used to construct and query retrieval pipelines on-the-fly using any permutation of available models (e.g., fusing the results of several first-stage retrieval methods followed by reranking). The API automatically performs asynchronous query batching and caches results by default. While many state-of-the-art retrieval methods are already supported by the package, RoutIR is also easily expandable by implementing the Engine abstract class. The package is open-sourced and publicly available on GitHub: this http URL.

[IR-1] IMO: An LLM -empowered Synthesis Dataset for Travel Itinerary Modification

链接: https://arxiv.org/abs/2601.10609
作者: Zhuoxuan Huang,Yunshan Ma,Hongyu Zhang,Hua Ma,Zhu Sun
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Addressing itinerary modification is crucial for enhancing the travel experience as it is a frequent requirement during traveling. However, existing research mainly focuses on fixed itinerary planning, leaving modification underexplored. To bridge this gap, we formally define the itinerary modification task and introduce iTIMO, a dataset specifically tailored for this purpose. We identify the lack of \itshape need-to-modify itinerary data as the critical bottleneck hindering research on this task and propose a general pipeline to overcome it. This pipeline frames the generation of such data as an intent-driven perturbation task. It instructs large language models to perturb real world itineraries using three atomic editing operations: REPLACE, ADD, and DELETE. Each perturbation is grounded in three intents, including disruptions of popularity, spatial distance, and category diversity. Furthermore, a hybrid evaluation metric is designed to ensure perturbation effectiveness. We conduct comprehensive experiments on iTIMO, revealing the limitations of current LLMs and lead to several valuable directions for future research. Dataset and corresponding code are available at this https URL.

[IR-2] STCRank: Spatio-temporal Collaborative Ranking for Interactive Recommender System at Kuaishou E-shop WWW26

链接: https://arxiv.org/abs/2601.10027
作者: Boyang Xia,Ruilin Bao,Hanjun Jiang,Jun Wang,Wenwu Ou
类目: Information Retrieval (cs.IR)
*备注: Accepted as an oral paper by WWW26 Human-centered recommender systems (HCRS) workshop ( this https URL )

点击查看摘要

Abstract:As a popular e-commerce platform, Kuaishou E-shop provides precise personalized product recommendations to tens of millions of users every day. To better respond real-time user feedback, we have deployed an interactive recommender system (IRS) alongside our core homepage recommender system. This IRS is triggered by user click on homepage, and generates a series of highly relevant recommendations based on the clicked item to meet focused browsing demands. Different from traditional e-commerce RecSys, the full-screen UI and immersive swiping down functionality present two distinct challenges for regular ranking system. First, there exists explicit interference (overlap or conflicts) between ranking objectives, i.e., conversion, view and swipe down. This is because there are intrinsic behavioral co-occurrences under the premise of immersive browsing and swiping down functionality. Second, the ranking system is prone to temporal greedy traps in sequential recommendation slot transitions, which is caused by full-screen UI design. To alleviate these challenges, we propose a novel Spatio-temporal collaborative ranking (STCRank) framework to achieve collaboration between multi-objectives within one slot (spatial) and between multiple sequential recommondation slots. In multi-objective collaboration (MOC) module, we push Pareto frontier by mitigating the objective overlaps and conflicts. In multi-slot collaboration (MSC) module, we achieve global optima on overall sequential slots by dual-stage look-ahead ranking mechanism. Extensive experiments demonstrate our proposed method brings about purchase and DAU co-growth. The proposed system has been already deployed at Kuaishou E-shop since 2025.6.

[IR-3] From SERPs to Agents : A Platform for Comparative Studies of Information Interaction

链接: https://arxiv.org/abs/2601.09937
作者: Saber Zerhoudi,Michael Granitzer
类目: Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:The diversification of information access systems, from RAG to autonomous agents, creates a critical need for comparative user studies. However, the technical overhead to deploy and manage these distinct systems is a major barrier. We present UXLab, an open-source system for web-based user studies that addresses this challenge. Its core is a web-based dashboard enabling the complete, no-code configuration of complex experimental designs. Researchers can visually manage the full study, from recruitment to comparing backends like traditional search, vector databases, and LLMs. We demonstrate UXLab’s value via a micro case study comparing user behavior with RAG versus an autonomous agent. UXLab allows researchers to focus on experimental design and analysis, supporting future multi-modal interaction research.

[IR-4] In-Browser Agents for Search Assistance

链接: https://arxiv.org/abs/2601.09928
作者: Saber Zerhoudi,Michael Granitzer
类目: Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:A fundamental tension exists between the demand for sophisticated AI assistance in web search and the need for user data privacy. Current centralized models require users to transmit sensitive browsing data to external services, which limits user control. In this paper, we present a browser extension that provides a viable in-browser alternative. We introduce a hybrid architecture that functions entirely on the client side, combining two components: (1) an adaptive probabilistic model that learns a user’s behavioral policy from direct feedback, and (2) a Small Language Model (SLM), running in the browser, which is grounded by the probabilistic model to generate context-aware suggestions. To evaluate this approach, we conducted a three-week longitudinal user study with 18 participants. Our results show that this privacy-preserving approach is highly effective at adapting to individual user behavior, leading to measurably improved search efficiency. This work demonstrates that sophisticated AI assistance is achievable without compromising user privacy or data control.

附件下载

点击下载今日全部论文列表