本篇博文主要内容为 2026-01-12 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。
说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。
友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。
目录
概览 (2026-01-12)
今日共更新385篇论文,其中:
- 自然语言处理共82篇(Computation and Language (cs.CL))
- 人工智能共120篇(Artificial Intelligence (cs.AI))
- 计算机视觉共62篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共111篇(Machine Learning (cs.LG))
自然语言处理
[NLP-0] AdaFuse: Adaptive Ensemble Decoding with Test-Time Scaling for LLM s
【速读】: 该论文旨在解决现有推理时集成(inference-time ensembling)方法在融合粒度上缺乏灵活性的问题,即固定融合粒度无法适应生成过程中的动态变化和不同任务的特性。其解决方案的关键在于提出AdaFuse——一种自适应集成解码框架,通过引入基于不确定性的判别准则,在每个解码步骤动态决定是否执行集成,并以词为基本单元进行语义对齐;同时结合多样性感知的缩放策略,在置信度较低的状态下探索候选延续路径,从而实现集成决策与测试时缩放之间的协同优化,显著提升模型性能。
链接: https://arxiv.org/abs/2601.06022
作者: Chengming Cui,Tianxin Wei,Ziyi Chen,Ruizhong Qiu,Zhichen Zeng,Zhining Liu,Xuying Ning,Duo Zhou,Jingrui He
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) exhibit complementary strengths arising from differences in pretraining data, model architectures, and decoding behaviors. Inference-time ensembling provides a practical way to combine these capabilities without retraining. However, existing ensemble approaches suffer from fundamental limitations. Most rely on fixed fusion granularity, which lacks the flexibility required for mid-generation adaptation and fails to adapt to different generation characteristics across tasks. To address these challenges, we propose AdaFuse, an adaptive ensemble decoding framework that dynamically selects semantically appropriate fusion units during generation. Rather than committing to a fixed granularity, AdaFuse adjusts fusion behavior on the fly based on the decoding context, with words serving as basic building blocks for alignment. To be specific, we introduce an uncertainty-based criterion to decide whether to apply ensembling at each decoding step. Under confident decoding states, the model continues generation directly. In less certain states, AdaFuse invokes a diversity-aware scaling strategy to explore alternative candidate continuations and inform ensemble decisions. This design establishes a synergistic interaction between adaptive ensembling and test-time scaling, where ensemble decisions guide targeted exploration, and the resulting diversity in turn strengthens ensemble quality. Experiments on open-domain question answering, arithmetic reasoning, and machine translation demonstrate that AdaFuse consistently outperforms strong ensemble baselines, achieving an average relative improvement of 6.88%. The code is available at this https URL.
zh
[NLP-1] Chaining the Evidence: Robust Reinforcement Learning for Deep Search Agents with Citation-Aware Rubric Rewards
【速读】: 该论文旨在解决当前基于强化学习(Reinforcement Learning, RL)的大型语言模型(Large Language Models, LLMs)深度搜索代理中存在的关键问题:现有方法主要依赖二元结果奖励(binary outcome rewards),无法有效捕捉代理推理过程的全面性(comprehensiveness)与事实准确性(factual grounding),从而导致捷径利用(shortcut exploitation)和幻觉(hallucinations)等不良行为。解决方案的核心是提出一种细粒度奖励框架——引用感知评分奖励(Citation-aware Rubric Rewards, CaRR),该框架将复杂问题分解为可验证的单跳评分标准(single-hop rubrics),要求代理通过显式识别隐藏实体、提供正确引用以及构建完整的证据链来满足这些标准,进而提升推理质量;同时引入引用感知组相对策略优化(Citation-aware Group Relative Policy Optimization, C-GRPO),融合CaRR与结果奖励以训练鲁棒的深度搜索代理,实验证明其在多个基准测试中均优于传统基于结果的RL基线,并能有效抑制捷径行为、促进基于证据的全面推理。
链接: https://arxiv.org/abs/2601.06021
作者: Jiajie Zhang,Xin Lv,Ling Feng,Lei Hou,Juanzi Li
机构: Tsinghua University (清华大学); Zhipu AI (智谱AI)
类目: Computation and Language (cs.CL)
备注:
Abstract:Reinforcement learning (RL) has emerged as a critical technique for enhancing LLM-based deep search agents. However, existing approaches primarily rely on binary outcome rewards, which fail to capture the comprehensiveness and factuality of agents’ reasoning process, and often lead to undesirable behaviors such as shortcut exploitation and hallucinations. To address these limitations, we propose \textbfCitation-aware Rubric Rewards (CaRR), a fine-grained reward framework for deep search agents that emphasizes reasoning comprehensiveness, factual grounding, and evidence connectivity. CaRR decomposes complex questions into verifiable single-hop rubrics and requires agents to satisfy these rubrics by explicitly identifying hidden entities, supporting them with correct citations, and constructing complete evidence chains that link to the predicted answer. We further introduce \textbfCitation-aware Group Relative Policy Optimization (C-GRPO), which combines CaRR and outcome rewards for training robust deep search agents. Experiments show that C-GRPO consistently outperforms standard outcome-based RL baselines across multiple deep search benchmarks. Our analysis also validates that C-GRPO effectively discourages shortcut exploitation, promotes comprehensive, evidence-grounded reasoning, and exhibits strong generalization to open-ended deep research tasks. Our code and data are available at this https URL.
zh
[NLP-2] Dont Break the Cache: An Evaluation of Prompt Caching for Long-Horizon Agent ic Tasks
【速读】: 该论文旨在解决大型语言模型(Large Language Model, LLM)代理在执行多轮复杂任务时因频繁调用工具而导致的高API成本与延迟问题,尤其是在系统提示(system prompt)较长、上下文窗口庞大的场景下。现有主流LLM服务商虽提供提示缓存(prompt caching)功能以降低计算开销,但其对代理型工作负载的实际效益尚未被量化或系统性比较。论文的关键解决方案在于提出并评估三种提示缓存策略:全上下文缓存、仅缓存系统提示缓存以及排除动态工具结果的缓存策略,并基于DeepResearchBench基准进行大规模实证分析(500+代理会话,10,000-token系统提示)。结果表明,通过精细化控制缓存块结构(如将动态内容置于系统提示末尾、避免缓存动态函数调用结果),可实现45–80%的API成本下降和13–31%的时间到首个token(time to first token, TTFT)优化,显著优于朴素的全上下文缓存方案,后者甚至可能增加延迟。研究揭示了不同提供商间缓存行为的差异,为生产环境中高效部署提示缓存提供了可落地的技术指导。
链接: https://arxiv.org/abs/2601.06007
作者: Elias Lumer,Faheem Nizar,Akshaya Jangiti,Kevin Frank,Anmol Gulati,Mandar Phadate,Vamse Kumar Subbiah
机构: 未知
类目: Computation and Language (cs.CL)
备注: 15 pages, 8 figures
Abstract:Recent advancements in Large Language Model (LLM) agents have enabled complex multi-turn agentic tasks requiring extensive tool calling, where conversations can span dozens of API calls with increasingly large context windows. However, although major LLM providers offer prompt caching to reduce cost and latency, its benefits for agentic workloads remain underexplored in the research literature. To our knowledge, no prior work quantifies these cost savings or compares caching strategies for multi-turn agentic tasks. We present a comprehensive evaluation of prompt caching across three major LLM providers (OpenAI, Anthropic, and Google) and compare three caching strategies, including full context caching, system prompt only caching, and caching that excludes dynamic tool results. We evaluate on DeepResearchBench, a multi-turn agentic benchmark where agents autonomously execute real-world web search tool calls to answer complex research questions, measuring both API cost and time to first token (TTFT) across over 500 agent sessions with 10,000-token system prompts. Our results demonstrate that prompt caching reduces API costs by 45-80% and improves time to first token by 13-31% across providers. We find that strategic prompt cache block control, such as placing dynamic content at the end of the system prompt, avoiding dynamic traditional function calling, and excluding dynamic tool results, provides more consistent benefits than naive full-context caching, which can paradoxically increase latency. Our analysis reveals nuanced variations in caching behavior across providers, and we provide practical guidance for implementing prompt caching in production agentic systems.
zh
[NLP-3] he Molecular Structure of Thought: Mapping the Topology of Long Chain-of-Thought Reasoning
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在模仿人类或非长链思维(Long CoT)模型时难以有效学习长链思维(Long CoT)推理的问题。其核心发现是:有效的、可学习的Long CoT轨迹具有类分子结构的稳定性,这种结构由三种相互作用类型构成——深度推理(Deep-Reasoning,类共价键)、自我反思(Self-Reflection,类氢键)和自我探索(Self-Exploration,类范德华力)。解决方案的关键在于提出“有效语义异构体”(Effective Semantic Isomers)概念,并识别出促进熵快速收敛的化学键式连接才支持稳定Long CoT学习;基于此,作者设计了Mole-Syn方法,一种基于分布转移图的合成策略,用于引导生成高效Long CoT结构,从而提升多个基准测试中的性能与强化学习(RL)训练稳定性。
链接: https://arxiv.org/abs/2601.06002
作者: Qiguang Chen,Yantao Du,Ziniu Li,Jinhao Liu,Songyao Duan,Jiarui Guo,Minghao Liu,Jiaheng Liu,Tong Yang,Ge Zhang,Libo Qin,Wanxiang Che,Wenhao Huang
机构: Harbin Institute of Technology (哈尔滨工业大学); ByteDance (字节跳动)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Preprint
Abstract:Large language models (LLMs) often fail to learn effective long chain-of-thought (Long CoT) reasoning from human or non-Long-CoT LLMs imitation. To understand this, we propose that effective and learnable Long CoT trajectories feature stable molecular-like structures in unified view, which are formed by three interaction types: Deep-Reasoning (covalent-like), Self-Reflection (hydrogen-bond-like), and Self-Exploration (van der Waals-like). Analysis of distilled trajectories reveals these structures emerge from Long CoT fine-tuning, not keyword imitation. We introduce Effective Semantic Isomers and show that only bonds promoting fast entropy convergence support stable Long CoT learning, while structural competition impairs training. Drawing on these findings, we present Mole-Syn, a distribution-transfer-graph method that guides synthesis of effective Long CoT structures, boosting performance and RL stability across benchmarks.
zh
[NLP-4] Distilling Feedback into Memory-as-a-Tool
【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)在推理阶段进行复杂思维链(Chain-of-Thought, CoT)或测试时优化(test-time refinement)所导致的高计算成本问题。解决方案的关键在于通过文件系统构建可检索的记忆机制,将推理过程中产生的临时批评(transient critiques)转化为持久化的指导规则(retrievable guidelines),并结合代理控制的工具调用(agent-controlled tool calls)实现对这些规则的复用,从而在不牺牲性能的前提下显著降低推理阶段的计算开销。
链接: https://arxiv.org/abs/2601.05960
作者: Víctor Gallego
机构: Komorebi AI Technologies(科莫贝AI技术公司)
类目: Computation and Language (cs.CL)
备注: Code: this https URL Data: this https URL
Abstract:We propose a framework that amortizes the cost of inference-time reasoning by converting transient critiques into retrievable guidelines, through a file-based memory system and agent-controlled tool calls. We evaluate this method on the Rubric Feedback Bench, a novel dataset for rubric-based learning. Experiments demonstrate that our augmented LLMs rapidly match the performance of test-time refinement pipelines while drastically reducing inference cost.
zh
[NLP-5] Can We Predict Before Executing Machine Learning Agents ?
【速读】: 该论文旨在解决自主机器学习代理在科学发现中因依赖昂贵的物理执行而产生的“执行瓶颈”问题,即传统Generate-Execute-Feedback范式效率低下。其解决方案的关键在于将执行先验知识内化为预测模型,通过构建一个包含18,438对比较的数据集来形式化“以数据为中心的解决方案偏好”,并利用大语言模型(LLM)结合验证过的数据分析报告进行即时推理,从而替代耗时的运行时检查。实验表明,该方法在预测准确率达61.5%的同时具备良好的置信度校准能力,并在FOREAGENT系统中实现了预测-验证循环,使收敛速度提升6倍,性能优于基于执行的基线方法6%。
链接: https://arxiv.org/abs/2601.05930
作者: Jingsheng Zheng,Jintian Zhang,Yujie Luo,Yuren Mao,Yunjun Gao,Lun Du,Huajun Chen,Ningyu Zhang
机构: Zhejiang University (浙江大学); Ant Group (蚂蚁集团); Zhejiang University - Ant Group Joint Laboratory of Knowledge Graph (浙江大学-蚂蚁集团知识图谱联合实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: Work in progress
Abstract:Autonomous machine learning agents have revolutionized scientific discovery, yet they remain constrained by a Generate-Execute-Feedback paradigm. Previous approaches suffer from a severe Execution Bottleneck, as hypothesis evaluation relies strictly on expensive physical execution. To bypass these physical constraints, we internalize execution priors to substitute costly runtime checks with instantaneous predictive reasoning, drawing inspiration from World Models. In this work, we formalize the task of Data-centric Solution Preference and construct a comprehensive corpus of 18,438 pairwise comparisons. We demonstrate that LLMs exhibit significant predictive capabilities when primed with a Verified Data Analysis Report, achieving 61.5% accuracy and robust confidence calibration. Finally, we instantiate this framework in FOREAGENT, an agent that employs a Predict-then-Verify loop, achieving a 6x acceleration in convergence while surpassing execution-based baselines by +6%. Our code and dataset will be publicly available soon at this https URL.
zh
[NLP-6] Pantagruel: Unified Self-Supervised Encoders for French Text and Speech
【速读】: 该论文旨在解决法国语料中文本与语音模态的联合表示学习问题,传统方法通常依赖于针对特定模态设计的目标(如预测文本标记或语音单元),导致跨模态表征能力受限。解决方案的关键在于提出Pantagruel模型家族,采用特征空间自监督目标(feature-space self-supervised objectives),使编码器在共享架构下能够同时学习文本和语音的上下文表示,从而更有效地捕捉语言和声学规律。通过在大规模法语语料(包括Wikipedia、OSCAR、CroissantLLM及MultilingualLibriSpeech、LeBenchmark、INA-100k等)上预训练,Pantagruel在多种下游任务中展现出优于CamemBERT、FlauBERT等强基线模型的性能,验证了其在多模态语音-文本理解中的有效性。
链接: https://arxiv.org/abs/2601.05911
作者: Phuong-Hang Le,Valentin Pelloin,Arnault Chatelain,Maryem Bouziane,Mohammed Ghennai,Qianwen Guan,Kirill Milintsevich,Salima Mdhaffar,Aidan Mannion,Nils Defauw,Shuyue Gu,Alexandre Audibert,Marco Dinarelli,Yannick Estève,Lorraine Goeuriot,Steffen Lalande,Nicolas Hervé,Maximin Coavoux,François Portet,Étienne Ollion,Marie Candito,Maxime Peyrard,Solange Rossato,Benjamin Lecouteux,Aurélie Nardy,Gilles Sérasset,Vincent Segonne,Solène Evain,Diandra Fabre,Didier Schwab
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:We release Pantagruel models, a new family of self-supervised encoder models for French text and speech. Instead of predicting modality-tailored targets such as textual tokens or speech units, Pantagruel learns contextualized target representations in the feature space, allowing modality-specific encoders to capture linguistic and acoustic regularities more effectively. Separate models are pre-trained on large-scale French corpora, including Wikipedia, OSCAR and CroissantLLM for text, together with MultilingualLibriSpeech, LeBenchmark, and INA-100k for speech. INA-100k is a newly introduced 100,000-hour corpus of French audio derived from the archives of the Institut National de l’Audiovisuel (INA), the national repository of French radio and television broadcasts, providing highly diverse audio data. We evaluate Pantagruel across a broad range of downstream tasks spanning both modalities, including those from the standard French benchmarks such as FLUE or LeBenchmark. Across these tasks, Pantagruel models show competitive or superior performance compared to strong French baselines such as CamemBERT, FlauBERT, and LeBenchmark2.0, while maintaining a shared architecture that can seamlessly handle either speech or text inputs. These results confirm the effectiveness of feature-space self-supervised objectives for French representation learning and highlight Pantagruel as a robust foundation for multimodal speech-text understanding.
zh
[NLP-7] Illusions of Confidence? Diagnosing LLM Truthfulness via Neighborhood Consistency
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在实际部署中仅保证回答正确性(correctness)不足以确保其可靠性的问题,尤其关注模型在面对轻微上下文扰动时信念(belief)的脆弱性。现有评估方法如自一致性(Self-Consistency)虽能衡量单点置信度,但无法揭示模型在概念邻域中响应的一致性与鲁棒性。解决方案的关键在于提出一种结构化信念鲁棒性度量——邻域一致性信念(Neighbor-Consistency Belief, NCB),通过考察模型在概念邻近输入下的输出一致性来评估其信念稳定性,并进一步设计结构感知训练(Structure-Aware Training, SAT),优化上下文不变的信念结构,从而显著降低长尾知识的脆弱性(约30%)。
链接: https://arxiv.org/abs/2601.05905
作者: Haoming Xu,Ningyuan Zhao,Yunzhi Yao,Weihong Xu,Hongru Wang,Xinle Deng,Shumin Deng,Jeff Z. Pan,Huajun Chen,Ningyu Zhang
机构: Zhejiang University (浙江大学); University of Edinburgh (爱丁堡大学); National University of Singapore (新加坡国立大学), NUS-NCS Joint Lab (新加坡国立大学-新加坡资讯通信研究院联合实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: Work in progress
Abstract:As Large Language Models (LLMs) are increasingly deployed in real-world settings, correctness alone is insufficient. Reliable deployment requires maintaining truthful beliefs under contextual perturbations. Existing evaluations largely rely on point-wise confidence like Self-Consistency, which can mask brittle belief. We show that even facts answered with perfect self-consistency can rapidly collapse under mild contextual interference. To address this gap, we propose Neighbor-Consistency Belief (NCB), a structural measure of belief robustness that evaluates response coherence across a conceptual neighborhood. To validate the efficiency of NCB, we introduce a new cognitive stress-testing protocol that probes outputs stability under contextual interference. Experiments across multiple LLMs show that the performance of high-NCB data is relatively more resistant to interference. Finally, we present Structure-Aware Training (SAT), which optimizes context-invariant belief structure and reduces long-tail knowledge brittleness by approximately 30%. Code will be available at this https URL.
zh
[NLP-8] HAPS: Hierarchical LLM Routing with Joint Architecture and Parameter Search
【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)路由中仅关注模型架构选择而忽视参数设置优化的问题,后者对任务性能具有关键影响。其解决方案的核心是提出一种分层路由框架 HAPS,通过高层路由器筛选候选LLM架构,并由低层路由器在选定架构上搜索最优参数配置;同时设计参数生成网络以实现两层路由器之间的参数共享与能力互增强,并引入奖励增强的目标函数以有效优化整个框架。
链接: https://arxiv.org/abs/2601.05903
作者: Zihang Tian,Rui Li,Jingsen Zhang,Xiaohe Bo,Wei Huo,Xu Chen
机构: Renmin University of China (中国人民大学); Wireless Technology Lab, Huawei Technologies Co., Ltd. (华为技术有限公司无线技术实验室)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language model (LLM) routing aims to exploit the specialized strengths of different LLMs for diverse tasks. However, existing approaches typically focus on selecting LLM architectures while overlooking parameter settings, which are critical for task performance. In this paper, we introduce HAPS, a hierarchical LLM routing framework that jointly searches over model architectures and parameters. Specifically, we use a high-level router to select among candidate LLM architectures, and then search for the optimal parameters for the selected architectures based on a low-level router. We design a parameter generation network to share parameters between the two routers to mutually enhance their capabilities. In the training process, we design a reward-augmented objective to effectively optimize our framework. Experiments on two commonly used benchmarks show that HAPS consistently outperforms strong routing baselines. We have released our code at this https URL.
zh
[NLP-9] An Empirical Study on Preference Tuning Generalization and Diversity Under Domain Shift
【速读】: 该论文旨在解决预训练语言模型在偏好对齐(preference tuning)过程中出现的领域偏移(domain shift)问题,即模型在训练域外性能下降、助益性减弱的现象。其核心挑战在于如何提升模型在未见目标领域中的对齐泛化能力。解决方案的关键在于系统评估五种主流对齐目标与多种适应策略(如目标域监督微调和伪标签法)的效果,发现基于伪标签(pseudo-labeling)的适应策略能显著缓解领域偏移导致的性能退化,从而增强模型在跨域场景下的鲁棒性和实用性。
链接: https://arxiv.org/abs/2601.05882
作者: Constantinos Karouzos,Xingwei Tan,Nikolaos Aletras
机构: University of Sheffield (谢菲尔德大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Preference tuning aligns pretrained language models to human judgments of quality, helpfulness, or safety by optimizing over explicit preference signals rather than likelihood alone. Prior work has shown that preference-tuning degrades performance and reduces helpfulness when evaluated outside the training domain. However, the extent to which adaptation strategies mitigate this domain shift remains unexplored. We address this challenge by conducting a comprehensive and systematic study of alignment generalization under domain shift. We compare five popular alignment objectives and various adaptation strategies from source to target, including target-domain supervised fine-tuning and pseudo-labeling, across summarization and question-answering helpfulness tasks. Our findings reveal systematic differences in generalization across alignment objectives under domain shift. We show that adaptation strategies based on pseudo-labeling can substantially reduce domain-shift degradation
zh
[NLP-10] Gender Bias in LLM s: Preliminary Evidence from Shared Parenting Scenario in Czech Family Law
【速读】: 该论文试图解决的问题是: laypeople在法律自助过程中日益依赖生成式 AI(Generative AI)工具,但这些模型可能因性别偏见导致不准确或有偏差的法律建议,从而加剧司法可及性问题。解决方案的关键在于通过设计一个基于捷克家庭法的真实离婚场景,对四种前沿大语言模型(LLMs)进行零样本(zero-shot)测试,比较带有性别化姓名与中性标签版本的情境下输出差异,并引入九个法律相关变量考察其对共同抚养比例建议的影响,以此识别系统性不对称模式,揭示模型在敏感法律情境中的潜在风险。
链接: https://arxiv.org/abs/2601.05879
作者: Jakub Harasta,Matej Vasina,Martin Kornel,Tomas Foltynek
机构: Masaryk University (马萨里克大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: Accepted at AI for Access to Justice, Dispute Resolution, and Data Access (AIDA2J) at Jurix 2025, Torino, Italy
Abstract:Access to justice remains limited for many people, leading laypersons to increasingly rely on Large Language Models (LLMs) for legal self-help. Laypeople use these tools intuitively, which may lead them to form expectations based on incomplete, incorrect, or biased outputs. This study examines whether leading LLMs exhibit gender bias in their responses to a realistic family law scenario. We present an expert-designed divorce scenario grounded in Czech family law and evaluate four state-of-the-art LLMs GPT-5 nano, Claude Haiku 4.5, Gemini 2.5 Flash, and Llama 3.3 in a fully zero-shot interaction. We deploy two versions of the scenario, one with gendered names and one with neutral labels, to establish a baseline for comparison. We further introduce nine legally relevant factors that vary the factual circumstances of the case and test whether these variations influence the models’ proposed shared-parenting ratios. Our preliminary results highlight differences across models and suggest gender-dependent patterns in the outcomes generated by some systems. The findings underscore both the risks associated with laypeople’s reliance on LLMs for legal guidance and the need for more robust evaluation of model behavior in sensitive legal contexts. We present exploratory and descriptive evidence intended to identify systematic asymmetries rather than to establish causal effects.
zh
[NLP-11] Reason er: Trajectory-Aware Intrinsic Reasoning Supervision for Self-Evolving Large Multimodal Models
【速读】: 该论文旨在解决当前大型多模态模型(Large Multimodal Models, LMMs)在无监督自进化过程中对中间推理步骤约束不足的问题,尤其是在视觉引导决策任务中,仅依赖最终结果奖励会导致推理过程质量难以提升。其解决方案的关键在于提出 iReasoner 框架,通过引入一个显式提取链式思维(Chain-of-Thought, CoT)并基于内部一致性进行奖励的机制,在无需标注数据或外部评判者的情况下,为不同推理路径提供轨迹感知(trajectory-aware)的学习信号,从而有效区分达成相同答案的不同推理过程,实现对模型隐式推理能力的精准优化。
链接: https://arxiv.org/abs/2601.05877
作者: Meghana Sunil,Manikandarajan Venmathimaran,Muthu Subash Kavitha
机构: Vellore Institute of Technology (维洛尔理工学院); Nagasaki University (长崎大学); Loughborough University (拉夫堡大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Recent work shows that large multimodal models (LMMs) can self-improve from unlabeled data via self-play and intrinsic feedback. Yet existing self-evolving frameworks mainly reward final outcomes, leaving intermediate reasoning weakly constrained despite its importance for visually grounded decision making. We propose iReasoner, a self-evolving framework that improves an LMM’s implicit reasoning by explicitly eliciting chain-of-thought (CoT) and rewarding its internal agreement. In a Proposer–Solver loop over unlabeled images, iReasoner augments outcome-level intrinsic rewards with a trajectory-aware signal defined over intermediate reasoning steps, providing learning signals that distinguish reasoning paths leading to the same answer without ground-truth labels or external judges. Starting from Qwen2.5-VL-7B, iReasoner yields up to +2.1 points across diverse multimodal reasoning benchmarks under fully unsupervised post-training. We hope this work serves as a starting point for reasoning-aware self-improvement in LMMs in purely unsupervised settings.
zh
[NLP-12] Continual-learning for Modelling Low-Resource Languages from Large Language Models
【速读】: 该论文旨在解决在多语言场景下训练小型语言模型(Small Language Model, SLM)时因迁移大型语言模型(Large Language Model, LLM)而导致的灾难性遗忘(catastrophic forgetting)问题。解决方案的关键在于采用基于词性标注(Parts-of-Speech, POS)的代码切换(code-switching)策略与重放适配器(replay adapter)策略相结合的持续学习方法,从而有效缓解训练过程中对原有语言知识的遗忘,提升模型在视觉问答和语言建模等多语言任务中的性能表现。
链接: https://arxiv.org/abs/2601.05874
作者: Santosh Srinath K,Mudit Somani,Varun Reddy Padala,Prajna Devi Upadhyay,Abhijit Das
机构: Birla Institute of Technology and Sciences, Pilani, India (比特拉理工学院与科学学院,皮拉尼,印度)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Modelling a language model for a multi-lingual scenario includes several potential challenges, among which catastrophic forgetting is the major challenge. For example, small language models (SLM) built for low-resource languages by adapting large language models (LLMs) pose the challenge of catastrophic forgetting. This work proposes to employ a continual learning strategy using parts-of-speech (POS)-based code-switching along with a replay adapter strategy to mitigate the identified gap of catastrophic forgetting while training SLM from LLM. Experiments conducted on vision language tasks such as visual question answering and language modelling task exhibits the success of the proposed architecture.
zh
[NLP-13] FACTUM: Mechanistic Detection of Citation Hallucination in Long-Form RAG ECIR2026
【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)模型中存在的引用幻觉(citation hallucination)问题,即模型在生成内容时会自信地引用不存在或不支持其主张的来源。传统观点认为此类幻觉源于模型对参数化知识的过度依赖,但本文挑战这一假设,提出FACTUM(Framework for Attesting Citation Trustworthiness via Underlying Mechanisms)框架,其关键在于引入四个机制性评分,分别衡量模型注意力路径(attention pathway)与前馈网络(Feed-Forward Network, FFN)路径的独立贡献及其对齐程度。研究表明,正确引用的特征并非固定不变,而是随模型规模演化:例如,在Llama-3.2-3B中表现为路径高度对齐,而在Llama-3.1-8B中则体现为低对齐、路径提供正交信息。通过捕捉这种动态机制签名,FACTUM相较现有最优基线在AUC指标上提升达37.5%,重新将引用幻觉理解为内部机制间复杂且尺度依赖的相互作用,为构建更可靠RAG系统提供了新路径。
链接: https://arxiv.org/abs/2601.05866
作者: Maxime Dassen,Rebecca Kotula,Kenton Murray,Andrew Yates,Dawn Lawrie,Efsun Kayi,James Mayfield,Kevin Duh
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted at ECIR 2026. 18 pages, 2 figures
Abstract:Retrieval-Augmented Generation (RAG) models are critically undermined by citation hallucinations, a deceptive failure where a model confidently cites a source that fails to support its claim. Existing work often attributes hallucination to a simple over-reliance on the model’s parametric knowledge. We challenge this view and introduce FACTUM (Framework for Attesting Citation Trustworthiness via Underlying Mechanisms), a framework of four mechanistic scores measuring the distinct contributions of a model’s attention and FFN pathways, and the alignment between them. Our analysis reveals two consistent signatures of correct citation: a significantly stronger contribution from the model’s parametric knowledge and greater use of the attention sink for information synthesis. Crucially, we find the signature of a correct citation is not static but evolves with model scale. For example, the signature of a correct citation for the Llama-3.2-3B model is marked by higher pathway alignment, whereas for the Llama-3.1-8B model, it is characterized by lower alignment, where pathways contribute more distinct, orthogonal information. By capturing this complex, evolving signature, FACTUM outperforms state-of-the-art baselines by up to 37.5% in AUC. Our findings reframe citation hallucination as a complex, scale-dependent interplay between internal mechanisms, paving the way for more nuanced and reliable RAG systems.
zh
[NLP-14] What do the metrics mean? A critical analysis of the use of Automated Evaluation Metrics in Interpreting
【速读】: 该论文旨在解决如何快速、高效地测量口译质量的问题,尤其是在自动语音翻译和口译虚拟人等新兴技术背景下,传统人工评估方法已难以满足实时性需求。其解决方案的关键在于批判性分析当前提出的各类自动化质量评估方法,指出这些方法虽具备一定的效率优势,但均无法充分考虑口译的交际语境(communicative context),因而不能独立作为衡量人类或机器口译质量的有效指标。论文强调,在口译研究中,语境因素始终是决定最终质量评价的核心要素。
链接: https://arxiv.org/abs/2601.05864
作者: Jonathan Downie,Joss Moorkens
机构: 未知
类目: Computation and Language (cs.CL)
备注: 25 pages
Abstract:With the growth of interpreting technologies, from remote interpreting and Computer-Aided Interpreting to automated speech translation and interpreting avatars, there is now a high demand for ways to quickly and efficiently measure the quality of any interpreting delivered. A range of approaches to fulfil the need for quick and efficient quality measurement have been proposed, each involving some measure of automation. This article examines these recently-proposed quality measurement methods and will discuss their suitability for measuring the quality of authentic interpreting practice, whether delivered by humans or machines, concluding that automatic metrics as currently proposed cannot take into account the communicative context and thus are not viable measures of the quality of any interpreting provision when used on their own. Across all attempts to measure or even categorise quality in Interpreting Studies, the contexts in which interpreting takes place have become fundamental to the final analysis.
zh
[NLP-15] CLewR: Curriculum Learning with Restarts for Machine Translation Preference Learning
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在零样本多语言机器翻译(Zero-shot Multilingual Machine Translation, MT)任务中,由于训练数据样本顺序不当导致的性能瓶颈问题。尽管已有研究通过偏好优化(Preference Optimization)提升了MT性能,但其对训练数据排序这一关键因素关注不足。论文提出了一种带重启机制的课程学习策略(Curriculum Learning with Restarts, CLewR),其核心在于在训练过程中多次重复从易到难的数据顺序,以有效缓解因模型过度适应难例而导致的简单示例灾难性遗忘(Catastrophic Forgetting)。该方法在多个主流模型家族(Gemma2、Qwen2.5、Llama3.1)和偏好优化技术上均实现了稳定性能提升。
链接: https://arxiv.org/abs/2601.05858
作者: Alexandra Dragomir,Florin Brad,Radu Tudor Ionescu
机构: Bitdefender(比特Defender); University of Bucharest(布加勒斯特大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Large language models (LLMs) have demonstrated competitive performance in zero-shot multilingual machine translation (MT). Some follow-up works further improved MT performance via preference optimization, but they leave a key aspect largely underexplored: the order in which data samples are given during training. We address this topic by integrating curriculum learning into various state-of-the-art preference optimization algorithms to boost MT performance. We introduce a novel curriculum learning strategy with restarts (CLewR), which reiterates easy-to-hard curriculum multiple times during training to effectively mitigate the catastrophic forgetting of easy examples. We demonstrate consistent gains across several model families (Gemma2, Qwen2.5, Llama3.1) and preference optimization techniques. We publicly release our code at this https URL.
zh
[NLP-16] Router-Suggest: Dynamic Routing for Multimodal Auto-Completion in Visually-Grounded Dialogs EACL2026
【速读】: 该论文旨在解决传统文本自动补全(Text Auto-Completion, TAC)在缺乏视觉上下文时难以准确捕捉用户意图的问题,特别是在需要共享视觉语境的场景(如数字助手、聊天机器人、设计工具和医疗咨询)中。其核心解决方案是提出多模态自动补全(Multimodal Auto-Completion, MAC)任务,并构建基于MMDialog和ImageChat的基准数据集以支持该任务的研究。关键创新在于引入路由建议机制(Router-Suggest),根据对话上下文动态选择使用文本模型或视觉语言模型(Vision-Language Models, VLMs),从而在保证高精度的同时显著提升效率——相比最优VLM实现2.3至10倍的速度提升;用户研究表明,VLM在满意度、减少输入负担和多轮对话质量方面均显著优于纯文本模型,验证了多模态上下文对智能助手感知用户意图的重要性。
链接: https://arxiv.org/abs/2601.05851
作者: Sandeep Mishra,Devichand Budagam,Anubhab Mandal,Bishal Santra,Pawan Goyal,Manish Gupta
机构: IIT Kharagpur (印度理工学院克哈格普尔分校); Microsoft (微软)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to EACL 2026 Industry Track, 12 pages, 6 figures
Abstract:Real-time multimodal auto-completion is essential for digital assistants, chatbots, design tools, and healthcare consultations, where user inputs rely on shared visual context. We introduce Multimodal Auto-Completion (MAC), a task that predicts upcoming characters in live chats using partially typed text and visual cues. Unlike traditional text-only auto-completion (TAC), MAC grounds predictions in multimodal context to better capture user intent. To enable this task, we adapt MMDialog and ImageChat to create benchmark datasets. We evaluate leading vision-language models (VLMs) against strong textual baselines, highlighting trade-offs in accuracy and efficiency. We present Router-Suggest, a router framework that dynamically selects between textual models and VLMs based on dialog context, along with a lightweight variant for resource-constrained environments. Router-Suggest achieves a 2.3x to 10x speedup over the best-performing VLM. A user study shows that VLMs significantly excel over textual models on user satisfaction, notably saving user typing effort and improving the quality of completions in multi-turn conversations. These findings underscore the need for multimodal context in auto-completions, leading to smarter, user-aware assistants.
zh
[NLP-17] Semantic NLP Pipelines for Interoperable Patient Digital Twins from Unstructured EHRs
【速读】: 该论文旨在解决从非结构化电子健康记录(EHR)中生成符合FHIR标准的可互操作患者数字孪生(Digital Twin)这一挑战,其核心问题在于临床文档的异质性及缺乏标准化映射机制。解决方案的关键在于构建一个基于语义自然语言处理(NLP)的端到端流水线:首先利用命名实体识别(Named Entity Recognition, NER)提取临床概念,继而通过概念归一化将实体映射至SNOMED-CT或ICD-10标准术语体系,并结合关系抽取技术构建条件、药物与观察之间的结构化关联,最终输出符合FHIR规范的数字孪生表示,从而显著提升数据的schema完整性和跨系统互操作性。
链接: https://arxiv.org/abs/2601.05847
作者: Rafael Brens,Yuqiao Meng,Luoxi Tang,Zhaohan Xi
机构: Binghamton University (宾汉姆顿大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Digital twins – virtual replicas of physical entities – are gaining traction in healthcare for personalized monitoring, predictive modeling, and clinical decision support. However, generating interoperable patient digital twins from unstructured electronic health records (EHRs) remains challenging due to variability in clinical documentation and lack of standardized mappings. This paper presents a semantic NLP-driven pipeline that transforms free-text EHR notes into FHIR-compliant digital twin representations. The pipeline leverages named entity recognition (NER) to extract clinical concepts, concept normalization to map entities to SNOMED-CT or ICD-10, and relation extraction to capture structured associations between conditions, medications, and observations. Evaluation on MIMIC-IV Clinical Database Demo with validation against MIMIC-IV-on-FHIR reference mappings demonstrates high F1-scores for entity and relation extraction, with improved schema completeness and interoperability compared to baseline methods.
zh
[NLP-18] Left Right or Center? Evaluating LLM Framing in News Classification and Generation
【速读】: 该论文旨在解决大型语言模型(Large Language Model, LLM)在新闻摘要与文本生成中可能引入的政治框架偏倚问题,即模型在重写或总结文本时,其措辞选择是否隐含意识形态倾向,从而影响读者对信息的解读。解决方案的关键在于通过分类式偏差信号检测与可控生成实验相结合的方法:首先利用少量样本预测模型输出的意识形态倾向(LEFT/CENTER/RIGHT),随后在固定评价器下生成受“忠实”(FAITHFUL)、“中立”(CENTRIST)、“左翼”(LEFT)和“右翼”(RIGHT)提示引导的摘要,并量化其意识形态表达强度。研究发现所有测试模型均呈现系统性中立化倾向(center-collapse),且Grok 4表现出最强的意识形态表达能力,而Claude Sonnet 4.5和Llama 3.1分别在商用和开源模型中表现最优的偏差识别性能。
链接: https://arxiv.org/abs/2601.05835
作者: Molly Kennedy,Ali Parker,Yihong Liu,Hinrich Schütze
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Large Language Model (LLM) based summarization and text generation are increasingly used for producing and rewriting text, raising concerns about political framing in journalism where subtle wording choices can shape interpretation. Across nine state-of-the-art LLMs, we study political framing by testing whether LLMs’ classification-based bias signals align with framing behavior in their generated summaries. We first compare few-shot ideology predictions against LEFT/CENTER/RIGHT labels. We then generate “steered” summaries under FAITHFUL, CENTRIST, LEFT, and RIGHT prompts, and score all outputs using a single fixed ideology evaluator. We find pervasive ideological center-collapse in both article-level ratings and generated text, indicating a systematic tendency toward centrist framing. Among evaluated models, Grok 4 is by far the most ideologically expressive generator, while Claude Sonnet 4.5 and Llama 3.1 achieve the strongest bias-rating performance among commercial and open-weight models, respectively.
zh
[NLP-19] Peek2: A Regex-free implementation of pretokenizers for Byte-level BPE
【速读】: 该论文旨在解决Byte-level BPE(Byte-level Byte Pair Encoding)分词器中预分词(pretokenization)步骤的性能瓶颈问题。传统实现依赖正则表达式(Regex-based),存在计算效率低、难以并行化及潜在安全风险等问题。解决方案的关键在于提出一种无需正则表达式的新型预分词算法Peek2,其在CPU上运行、具有稳定的线性时间复杂度O(n),且与原始Regex-based预分词器输出结果完全一致,同时实现了整体Byte-level BPE编码流程吞吐量提升1.11倍的性能优化。
链接: https://arxiv.org/abs/2601.05833
作者: Liu Zai
机构: 未知
类目: Computation and Language (cs.CL)
备注: 5 pages, 4 figures, for associated code, see this https URL
Abstract:Pretokenization is a crucial, sequential pass in Byte-level BPE tokenizers. Our proposed new implementation, Peek2, serves as a drop-in replacement for cl100k-like pretokenizers used in GPT-3, LLaMa-3, and Qwen-2.5. Designed with performance and safety in mind, Peek2 is Regex-free and delivers a 1.11\times improvement in overall throughput across the entire Byte-level BPE encoding process. This algorithm runs entirely on the CPU, has stable linear complexity O(n) , and provides presegmentation results identical to those of the original Regex-based pretokenizer.
zh
[NLP-20] LLM s as Science Journalists: Supporting Early-stage Researchers in Communicating Their Science to the Public
【速读】: 该论文旨在解决早期科研人员在向公众传播研究成果时面临的沟通障碍问题,即现有通用大型语言模型(Large Language Models, LLMs)虽具辅助能力,但未针对科学传播任务进行优化。解决方案的关键在于提出一个训练框架,使LLMs能够模拟科学记者的角色,从而引导研究人员更清晰地阐述其研究的社会影响,并通过提问促进内容的深度表达与公众可理解性。实验表明,经该框架训练的LLM记者相较于通用模型能提出更具相关性的问题,且用户研究中多数参与者对其评价优于通用LLM。
链接: https://arxiv.org/abs/2601.05821
作者: Milad Alshomary,Grace Li,Anubhav Jangra,Yufang Hou,Kathleen McKeown,Smaranda Muresan
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:The scientific community needs tools that help early-stage researchers effectively communicate their findings and innovations to the public. Although existing general-purpose Large Language Models (LLMs) can assist in this endeavor, they are not optimally aligned for it. To address this, we propose a framework for training LLMs to emulate the role of a science journalist that can be used by early-stage researchers to learn how to properly communicate their papers to the general public. We evaluate the usefulness of our trained LLM Journalists in leading conversations with both simulated and human researchers. %compared to the general-purpose ones. Our experiments indicate that LLMs trained using our framework ask more relevant questions that address the societal impact of research, prompting researchers to clarify and elaborate on their findings. In the user study, the majority of participants who interacted with our trained LLM Journalist appreciated it more than interacting with general-purpose LLMs.
zh
[NLP-21] EnvScaler: Scaling Tool-Interactive Environments for LLM Agent via Programmatic Synthesis
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在真实环境中作为智能体(agent)进行工具交互训练时所面临的环境构建难题,包括现实系统访问受限、仿真环境易产生幻觉和不一致、以及人工构建沙箱难以扩展等问题。其解决方案的关键在于提出EnvScaler框架,通过程序化合成实现可扩展的工具交互环境:首先利用SkelBuilder基于主题挖掘、逻辑建模与质量评估生成多样化的环境骨架;随后由ScenGenerator为每个环境生成多任务场景及基于规则的轨迹验证函数,从而自动化构建高质量、大规模的训练环境。该方法显著提升了LLMs在复杂多轮、多工具交互任务中的表现。
链接: https://arxiv.org/abs/2601.05808
作者: Xiaoshuai Song,Haofei Chang,Guanting Dong,Yutao Zhu,Zhicheng Dou,Ji-Rong Wen
机构: Gaoling School of Artificial Intelligence, Renmin University of China (中国人民大学高瓴人工智能学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Working in progress
Abstract:Large language models (LLMs) are expected to be trained to act as agents in various real-world environments, but this process relies on rich and varied tool-interaction sandboxes. However, access to real systems is often restricted; LLM-simulated environments are prone to hallucinations and inconsistencies; and manually built sandboxes are hard to scale. In this paper, we propose EnvScaler, an automated framework for scalable tool-interaction environments via programmatic synthesis. EnvScaler comprises two components. First, SkelBuilder constructs diverse environment skeletons through topic mining, logic modeling, and quality evaluation. Then, ScenGenerator generates multiple task scenarios and rule-based trajectory validation functions for each environment. With EnvScaler, we synthesize 191 environments and about 7K scenarios, and apply them to Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) for Qwen3 series models. Results on three benchmarks show that EnvScaler significantly improves LLMs’ ability to solve tasks in complex environments involving multi-turn, multi-tool interactions. We release our code and data at this https URL.
zh
[NLP-22] Fusion Matters: Length-Aware Analysis of Positional-Encoding Fusion in Transformers
【速读】: 该论文旨在解决Transformer模型中位置编码(Positional Encoding)与词元嵌入(Token Embedding)融合机制的设计问题,特别是其在长序列场景下的影响未被充分探讨。以往研究多聚焦于设计新颖的位置编码形式,而忽视了融合方式本身对模型性能的潜在作用。论文的关键解决方案在于通过受控实验对比三种典型融合策略——逐元素加法、带投影的拼接、标量门控融合——发现融合机制的选择在短文本上影响可忽略,但在长文档任务中能带来稳定且显著的性能提升。进一步分析表明,这种增益具有结构性而非随机性,并且在不同位置编码家族下均具泛化能力,从而论证了融合机制应作为显式建模决策而非默认设置。
链接: https://arxiv.org/abs/2601.05807
作者: Mohamed Amine Hallam,Kuo-Kun Tseng
机构: Harbin Institute of Technology (哈尔滨工业大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 10 pages, 5 figures. Code and reproduction materials available on GitHub
Abstract:Transformers require positional encodings to represent sequence order, yet most prior work focuses on designing new positional encodings rather than examining how positional information is fused with token embeddings. In this paper, we study whether the fusion mechanism itself affects performance, particularly in long-sequence settings. We conduct a controlled empirical study comparing three canonical fusion strategies–element-wise addition, concatenation with projection, and scalar gated fusion–under identical Transformer architectures, data splits, and random seeds. Experiments on three text classification datasets spanning short (AG News), medium (IMDB), and long (ArXiv) sequences show that fusion choice has negligible impact on short texts but produces consistent gains on long documents. To verify that these gains are structural rather than stochastic, we perform paired-seed analysis and cross-dataset comparison across sequence-length regimes. Additional experiments on the ArXiv dataset indicate that the benefit of learnable fusion generalizes across multiple positional encoding families. Finally, we explore a lightweight convolutional gating mechanism that introduces local inductive bias at the fusion level, evaluated on long documents only. Our results indicate that positional-encoding fusion is a non-trivial design choice for long-sequence Transformers and should be treated as an explicit modeling decision rather than a fixed default.
zh
[NLP-23] Simplify-This: A Comparative Analysis of Prompt-Based and Fine-Tuned LLM s
【速读】: 该论文旨在解决文本简化(Text Simplification)任务中微调(Fine-tuning)与提示工程(Prompt Engineering)两种范式之间的性能权衡问题。其关键解决方案在于通过系统性比较研究,验证了在编码器-解码器架构的大语言模型(Encoder-Decoder LLMs)上,微调方法在结构简化方面表现更优,而提示工程虽在语义相似度上占优但易导致输入内容的直接复制;人类评估进一步支持微调输出的整体质量更高,从而为实际应用提供明确的技术选择依据。
链接: https://arxiv.org/abs/2601.05794
作者: Eilam Cohen,Itamar Bul,Danielle Inbar,Omri Loewenbach
机构: Tel Aviv University (特拉维夫大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Large language models (LLMs) enable strong text generation, and in general there is a practical tradeoff between fine-tuning and prompt engineering. We introduce Simplify-This, a comparative study evaluating both paradigms for text simplification with encoder-decoder LLMs across multiple benchmarks, using a range of evaluation metrics. Fine-tuned models consistently deliver stronger structural simplification, whereas prompting often attains higher semantic similarity scores yet tends to copy inputs. A human evaluation favors fine-tuned outputs overall. We release code, a cleaned derivative dataset used in our study, checkpoints of fine-tuned models, and prompt templates to facilitate reproducibility and future work.
zh
[NLP-24] One Script Instead of Hundreds? On Pretraining Romanized Encoder Language Models
【速读】: 该论文旨在解决在多语言语言模型(multilingual language models, mLMs)预训练中,使用音译(script romanization)是否会对高资源语言的性能造成信息损失的问题,特别是评估音译是否会因丢失字形特有信息或引入跨语言词汇重叠而引发性能下降。其解决方案的关键在于:在六种类型多样且高资源的语言上从头预训练编码器语言模型(encoder LMs),分别使用原始文本和音译文本进行对比实验,并采用两种不同保真度的音译工具,系统性地分析性能变化;结果表明,对于音节文字(segmental scripts)语言,音译带来的性能损失可忽略不计,且能提升编码效率(fertility),而对于表意音节文字(morphosyllabic scripts)如中文和日文,则存在性能退化,但更高保真度的音译可缓解这一问题但无法完全恢复。此外,研究发现,与单语模型相比,多语言模型中的子词重叠并未引发负向干扰,说明音译对通用mLM预训练是可行的。
链接: https://arxiv.org/abs/2601.05776
作者: Benedikt Ebing,Lennart Keller,Goran Glavaš
机构: University of Würzburg (维尔茨堡大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Exposing latent lexical overlap, script romanization has emerged as an effective strategy for improving cross-lingual transfer (XLT) in multilingual language models (mLMs). Most prior work, however, focused on setups that favor romanization the most: (1) transfer from high-resource Latin-script to low-resource non-Latin-script languages and/or (2) between genealogically closely related languages with different scripts. It thus remains unclear whether romanization is a good representation choice for pretraining general-purpose mLMs, or, more precisely, if information loss associated with romanization harms performance for high-resource languages. We address this gap by pretraining encoder LMs from scratch on both romanized and original texts for six typologically diverse high-resource languages, investigating two potential sources of degradation: (i) loss of script-specific information and (ii) negative cross-lingual interference from increased vocabulary overlap. Using two romanizers with different fidelity profiles, we observe negligible performance loss for languages with segmental scripts, whereas languages with morphosyllabic scripts (Chinese and Japanese) suffer degradation that higher-fidelity romanization mitigates but cannot fully recover. Importantly, comparing monolingual LMs with their mLM counterpart, we find no evidence that increased subword overlap induces negative interference. We further show that romanization improves encoding efficiency (i.e., fertility) for segmental scripts at a negligible performance cost.
zh
[NLP-25] Weights to Code: Extracting Interpretable Algorithms from the Discrete Transformer
【速读】: 该论文旨在解决基于Transformer模型进行算法提取(algorithm extraction)时面临的可解释性难题,特别是由于特征超位置(superposition)导致连续表示中信息纠缠、难以提取符号表达式的问题。其解决方案的关键在于提出**离散Transformer(Discrete Transformer)**架构,通过强制功能解耦:将数值注意力(Numerical Attention)限制为信息路由,将数值前馈网络(Numerical MLP)限定为逐元素算术运算,并结合温度退火采样策略,从而实现从连续表示到离散符号逻辑的有效映射,显著提升合成程序的可读性和可控性。
链接: https://arxiv.org/abs/2601.05770
作者: Yifan Zhang,Wei Bi,Kechi Zhang,Dongming Jin,Jie Fu,Zhi Jin
机构: Peking University (北京大学); Kuaishou Technology (快手科技); Shanghai AI Lab (上海人工智能实验室); Shanghai Innovation Institute (上海创新研究院)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:Algorithm extraction aims to synthesize executable programs directly from models trained on specific algorithmic tasks, enabling de novo algorithm discovery without relying on human-written code. However, extending this paradigm to Transformer is hindered by superposition, where entangled features encoded in overlapping directions obstruct the extraction of symbolic expressions. In this work, we propose the Discrete Transformer, an architecture explicitly engineered to bridge the gap between continuous representations and discrete symbolic logic. By enforcing a strict functional disentanglement, which constrains Numerical Attention to information routing and Numerical MLP to element-wise arithmetic, and employing temperature-annealed sampling, our method effectively facilitates the extraction of human-readable programs. Empirically, the Discrete Transformer not only achieves performance comparable to RNN-based baselines but crucially extends interpretability to continuous variable domains. Moreover, our analysis of the annealing process shows that the efficient discrete search undergoes a clear phase transition from exploration to exploitation. We further demonstrate that our method enables fine-grained control over synthesized programs by imposing inductive biases. Collectively, these findings establish the Discrete Transformer as a robust framework for demonstration-free algorithm discovery, offering a rigorous pathway toward Transformer interpretability.
zh
[NLP-26] AutoMonitor-Bench: Evaluating the Reliability of LLM -Based Misbehavior Monitor
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在实际应用中因行为失控而导致的安全风险问题,特别是针对基于LLM的异常行为监测器(misbehavior monitors)可靠性不足的问题。其核心挑战在于现有监测器在不同任务和故障模式下表现不稳定,且难以在低漏检率(Miss Rate, MR)与低误报率(False Alarm Rate, FAR)之间取得平衡,反映出安全性和实用性之间的内在权衡。解决方案的关键在于构建首个系统性评估基准AutoMonitor-Bench,包含3,010个精心标注的测试样本,覆盖问答、代码生成和推理任务,并通过MR与FAR两个互补指标量化监测性能;同时利用大规模训练语料(153,581条样本)对Qwen3-4B-Instruction进行微调,探索是否可通过训练已知易构造的恶意数据来提升对未知隐式恶意行为的检测能力,从而推动更可靠、可扩展的LLM监控机制的发展。
链接: https://arxiv.org/abs/2601.05752
作者: Shu Yang,Jingyu Hu,Tong Li,Hanqi Yan,Wenxuan Wang,Di Wang
机构: King Abdullah University of Science and Technology (阿卜杜拉国王科技大学); University of Bristol (布里斯托大学); Washington University in St. Louis (圣路易斯华盛顿大学); King’s College London (伦敦国王学院); Renmin University of China (中国人民大学)
类目: Computation and Language (cs.CL); Software Engineering (cs.SE)
备注:
Abstract:We introduce AutoMonitor-Bench, the first benchmark designed to systematically evaluate the reliability of LLM-based misbehavior monitors across diverse tasks and failure modes. AutoMonitor-Bench consists of 3,010 carefully annotated test samples spanning question answering, code generation, and reasoning, with paired misbehavior and benign instances. We evaluate monitors using two complementary metrics: Miss Rate (MR) and False Alarm Rate (FAR), capturing failures to detect misbehavior and oversensitivity to benign behavior, respectively. Evaluating 12 proprietary and 10 open-source LLMs, we observe substantial variability in monitoring performance and a consistent trade-off between MR and FAR, revealing an inherent safety-utility tension. To further explore the limits of monitor reliability, we construct a large-scale training corpus of 153,581 samples and fine-tune Qwen3-4B-Instruction to investigate whether training on known, relatively easy-to-construct misbehavior datasets improves monitoring performance on unseen and more implicit misbehaviors. Our results highlight the challenges of reliable, scalable misbehavior monitoring and motivate future work on task-aware designing and training strategies for LLM-based monitors.
zh
[NLP-27] Analysing Differences in Persuasive Language in LLM -Generated Text: Uncovering Stereotypical Gender Patterns
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生成说服性语言时如何受到用户指令影响的问题,特别是关注接收者性别、发送者意图及输出语言等因素对说服策略差异的影响。其解决方案的关键在于构建一个系统性的评估框架,通过成对提示指令(pairwise prompt instructions)对13个LLMs和16种语言进行测试,并基于社会心理学与传播学理论设计由大语言模型作为评判者的评估机制,从而量化分析19类说服性语言特征。结果揭示了所有模型在不同性别目标群体中均存在显著的说服语言差异,且这些模式与社会心理学和语用学中记录的性别刻板印象语言倾向一致,表明LLMs可能放大既有社会偏见。
链接: https://arxiv.org/abs/2601.05751
作者: Amalie Brogaard Pauli,Maria Barrett,Max Müller-Eberstein,Isabelle Augenstein,Ira Assent
机构: Aarhus University (奥胡斯大学); AMD Silo AI; University of Tokyo (东京大学); IT University of Copenhagen (哥本哈根信息技术大学); University of Copenhagen (哥本哈根大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) are increasingly used for everyday communication tasks, including drafting interpersonal messages intended to influence and persuade. Prior work has shown that LLMs can successfully persuade humans and amplify persuasive language. It is therefore essential to understand how user instructions affect the generation of persuasive language, and to understand whether the generated persuasive language differs, for example, when targeting different groups. In this work, we propose a framework for evaluating how persuasive language generation is affected by recipient gender, sender intent, or output language. We evaluate 13 LLMs and 16 languages using pairwise prompt instructions. We evaluate model responses on 19 categories of persuasive language using an LLM-as-judge setup grounded in social psychology and communication science. Our results reveal significant gender differences in the persuasive language generated across all models. These patterns reflect biases consistent with gender-stereotypical linguistic tendencies documented in social psychology and sociolinguistics.
zh
[NLP-28] PII-VisBench: Evaluating Personally Identifiable Information Safety in Vision Language Models Along a Continuum of Visibility
【速读】: 该论文旨在解决视觉语言模型(Vision Language Models, VLMs)在隐私敏感场景中对个人身份信息(PII)泄露风险评估不足的问题,特别是忽视了个体在线可见性(online presence)对隐私对齐(privacy alignment)的影响。现有评估方法将隐私视为静态的提取任务,而未考虑用户数据在网络上的暴露程度如何动态影响模型行为。解决方案的关键在于提出 PII-VisBench 基准,该基准包含 4000 个独特探测样本,基于 200 名受试者的在线可见性将其划分为高、中、低和零四个层级,并通过两个核心指标——PII 探测查询拒绝率(Refusal Rate)与非拒绝响应中含 PII 的比例(Conditional PII Disclosure Rate)——系统性地衡量不同 VLM 的隐私安全性。实验发现,随着受试者可见性降低,模型拒绝率上升且 PII 泄露率显著下降(从高可见性下的 9.10% 降至低可见性下的 5.34%),揭示了可见性是影响模型隐私行为的重要变量,从而推动面向可见性的安全评估与训练干预策略的发展。
链接: https://arxiv.org/abs/2601.05739
作者: G M Shahariar,Zabir Al Nazi,Md Olid Hasan Bhuiyan,Zhouxing Shi
机构: University of California, Riverside (加州大学河滨分校)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision Language Models (VLMs) are increasingly integrated into privacy-critical domains, yet existing evaluations of personally identifiable information (PII) leakage largely treat privacy as a static extraction task and ignore how a subject’s online presence–the volume of their data available online–influences privacy alignment. We introduce PII-VisBench, a novel benchmark containing 4000 unique probes designed to evaluate VLM safety through the continuum of online presence. The benchmark stratifies 200 subjects into four visibility categories: high, medium, low, and zero–based on the extent and nature of their information available online. We evaluate 18 open-source VLMs (0.3B-32B) based on two key metrics: percentage of PII probing queries refused (Refusal Rate) and the fraction of non-refusal responses flagged for containing PII (Conditional PII Disclosure Rate). Across models, we observe a consistent pattern: refusals increase and PII disclosures decrease (9.10% high to 5.34% low) as subject visibility drops. We identify that models are more likely to disclose PII for high-visibility subjects, alongside substantial model-family heterogeneity and PII-type disparities. Finally, paraphrasing and jailbreak-style prompts expose attack and model-dependent failures, motivating visibility-aware safety evaluation and training interventions.
zh
[NLP-29] Visualising Information Flow in Word Embeddings with Diffusion Tensor Imaging
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)如何表征自然语言表达这一核心问题,现有方法通常仅基于孤立词的嵌入向量进行可视化分析,忽略了词语在具体语境中的使用方式。其解决方案的关键在于引入扩散张量成像(Diffusion Tensor Imaging, DTI)技术对词嵌入进行建模,从而揭示词嵌入之间的信息流动路径;该方法不仅能够追踪LLM各层内部的信息传递过程,还可用于比较不同模型结构、识别冗余层以实现剪枝,并区分如代词消解和隐喻检测等任务中不同的信息流动模式,从而深化对LLM表征机制的理解并提升自然语言处理模型的可解释性。
链接: https://arxiv.org/abs/2601.05713
作者: Thomas Fabian
机构: Technical University Darmstadt (达姆施塔特工业大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Understanding how large language models (LLMs) represent natural language is a central challenge in natural language processing (NLP) research. Many existing methods extract word embeddings from an LLM, visualise the embedding space via point-plots, and compare the relative positions of certain words. However, this approach only considers single words and not whole natural language expressions, thus disregards the context in which a word is used. Here we present a novel tool for analysing and visualising information flow in natural language expressions by applying diffusion tensor imaging (DTI) to word embeddings. We find that DTI reveals how information flows between word embeddings. Tracking information flows within the layers of an LLM allows for comparing different model structures and revealing opportunities for pruning an LLM’s under-utilised layers. Furthermore, our model reveals differences in information flows for tasks like pronoun resolution and metaphor detection. Our results show that our model permits novel insights into how LLMs represent actual natural language expressions, extending the comparison of isolated word embeddings and improving the interpretability of NLP models.
zh
[NLP-30] Multimodal In-context Learning for ASR of Low-resource Languages
【速读】: 该论文旨在解决自动语音识别(ASR)在低资源语言(尤其是濒危语言)中性能受限的问题,其根本原因在于监督数据稀缺。解决方案的关键在于利用多模态上下文学习(MICL),即通过结合语音和文本模态的信息,使语音大语言模型(Speech LLMs)在无需目标语言训练数据的情况下学习未见语言,并将其用于提升ASR性能。研究发现,MICL能有效促进跨语言迁移学习,且注意力机制分析揭示了不同网络层对音频与文本上下文的偏好差异;进一步提出一种基于MICL选择声学模型候选假设的简单ASR系统,显著优于传统提示驱动方法,在无目标语言语料情况下仍可实现性能提升甚至超越仅用目标语料训练的语言模型。
链接: https://arxiv.org/abs/2601.05707
作者: Zhaolin Li,Jan Niehues
机构: Karlsruhe Institute of Technology (卡尔斯鲁厄理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Under review
Abstract:Automatic speech recognition (ASR) still covers only a small fraction of the world’s languages, mainly due to supervised data scarcity. In-context learning (ICL) with large language models (LLMs) addresses this problem, but prior work largely focuses on high-resource languages covered during training and text-only settings. This paper investigates whether speech LLMs can learn unseen languages with multimodal ICL (MICL), and how this learning can be used to improve ASR. We conduct experiments with two speech LLMs, Phi-4 and Qwen3-Omni, on three diverse endangered languages. Firstly, we find that MICL is effective for unseen languages, leveraging both speech and text modalities. We further show that cross-lingual transfer learning improves MICL efficiency on target languages without training on them. Moreover, we analyze attention patterns to interpret MICL mechanisms, and we observe layer-dependent preferences between audio and text context, with an overall bias towards text. Finally, we show that prompt-based ASR with speech LLMs performs poorly on unseen languages, motivating a simple ASR system that combines a stronger acoustic model with a speech LLM via MICL-based selection of acoustic hypotheses. Results show that MICL consistently improves ASR performance, and that cross-lingual transfer learning matches or outperforms corpus-trained language models without using target-language data. Our code is publicly available.
zh
[NLP-31] Logic-Parametric Neuro-Symbolic NLI: Controlling Logical Formalisms for Verifiable LLM Reasoning
【速读】: 该论文旨在解决现有自然语言推理(Natural Language Inference, NLI)方法中逻辑形式系统固定不变导致的鲁棒性与适应性不足的问题。当前方法依赖于静态的逻辑框架,限制了其在不同语义场景下的泛化能力。解决方案的关键在于提出一种逻辑参数化(logic-parametric)的神经符号推理框架,将逻辑本身作为可调控的组件而非固定背景,并通过LogiKEy方法将经典与非经典逻辑形式系统嵌入高阶逻辑(Higher-Order Logic, HOL),从而实现对推理质量、解释精细度及证明行为的系统比较。特别地,研究区分了逻辑外部(axiom-based)与逻辑内部(structure-driven)的规范推理策略,实验证明逻辑内部方法能持续提升性能并生成更高效的混合证明,且逻辑选择具有领域依赖性——一阶逻辑适用于常识推理,而道义逻辑和模态逻辑则在伦理领域表现更优。
链接: https://arxiv.org/abs/2601.05705
作者: Ali Farjami,Luca Redondi,Marco Valentino
机构: University of Luxemburg(卢森堡大学); Ruhr-Universtät Bochum(鲁尔大学波鸿分校); University of Sheffield(谢菲尔德大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Logic in Computer Science (cs.LO)
备注: Work in progress
Abstract:Large language models (LLMs) and theorem provers (TPs) can be effectively combined for verifiable natural language inference (NLI). However, existing approaches rely on a fixed logical formalism, a feature that limits robustness and adaptability. We propose a logic-parametric framework for neuro-symbolic NLI that treats the underlying logic not as a static background, but as a controllable component. Using the LogiKEy methodology, we embed a range of classical and non-classical formalisms into higher-order logic (HOL), enabling a systematic comparison of inference quality, explanation refinement, and proof behavior. We focus on normative reasoning, where the choice of logic has significant implications. In particular, we compare logic-external approaches, where normative requirements are encoded via axioms, with logic-internal approaches, where normative patterns emerge from the logic’s built-in structure. Extensive experiments demonstrate that logic-internal strategies can consistently improve performance and produce more efficient hybrid proofs for NLI. In addition, we show that the effectiveness of a logic is domain-dependent, with first-order logic favouring commonsense reasoning, while deontic and modal logics excel in ethical domains. Our results highlight the value of making logic a first-class, parametric element in neuro-symbolic architectures for more robust, modular, and adaptable reasoning.
zh
[NLP-32] Afri-MCQA: Multimodal Cultural Question Answering for African Languages
【速读】: 该论文旨在解决非洲语言在人工智能(AI)研究中严重代表性不足的问题,特别是针对多模态文化问答(Multilingual Cultural Question-Answering, MCQA)任务缺乏高质量、跨语言、跨模态基准数据集的现状。其解决方案的关键在于构建并发布Afri-MCQA——首个覆盖15种非洲语言(来自12个国家)、包含7.5k QA对的多语言文化问答基准,涵盖文本与语音模态,并由母语者完全创建。该基准支持对大型语言模型(LLMs)在非洲语言中的语言能力与文化理解能力进行系统评估,揭示了当前开放权重模型在原生语言或语音输入下的近零准确率问题,从而强调了发展以语音优先、文化嵌入预训练及跨语言文化迁移为核心的更包容的多模态AI方法的必要性。
链接: https://arxiv.org/abs/2601.05699
作者: Atnafu Lambebo Tonja,Srija Anand,Emilio Villa-Cueva,Israel Abebe Azime,Jesujoba Oluwadara Alabi,Muhidin A. Mohamed,Debela Desalegn Yadeta,Negasi Haile Abadi,Abigail Oppong,Nnaemeka Casmir Obiefuna,Idris Abdulmumin,Naome A Etori,Eric Peter Wairagala,Kanda Patrick Tshinu,Imanigirimbabazi Emmanuel,Gabofetswe Malema,Alham Fikri Aji,David Ifeoluwa Adelani,Thamar Solorio
机构: MBZUAI; AI4Bharat; Indian Institute of Technology, Madras; Saarland University; Aston University; Addis Ababa University; Lesan AI; Independent; Friedrich-Alexander University; University of Pretoria; University of Minneosta - Twin Cities; Lelapa AI; Tshwane University of Technology; Kabale University; University of Botswana; Mila, McGill University & Canada CIFAR AI Chair
类目: Computation and Language (cs.CL)
备注:
Abstract:Africa is home to over one-third of the world’s languages, yet remains underrepresented in AI research. We introduce Afri-MCQA, the first Multilingual Cultural Question-Answering benchmark covering 7.5k QA pairs across 15 African languages from 12 countries. The benchmark offers parallel English-African language QA pairs across text and speech modalities and was entirely created by native speakers. Benchmarking large language models (LLMs) on Afri-MCQA shows that open-weight models perform poorly across evaluated cultures, with near-zero accuracy on open-ended VQA when queried in native language or speech. To evaluate linguistic competence, we include control experiments meant to assess this specific aspect separate from cultural knowledge, and we observe significant performance gaps between native languages and English for both text and speech. These findings underscore the need for speech-first approaches, culturally grounded pretraining, and cross-lingual cultural transfer. To support more inclusive multimodal AI development in African languages, we release our Afri-MCQA under academic license or CC BY-NC 4.0 on HuggingFace (this https URL)
zh
[NLP-33] Stephanie2: Thinking Waiting and Making Decisions Like Humans in Step-by-Step AI Social Chat
【速读】: 该论文旨在解决现有一步式AI对话系统在模拟人类即时通讯(Instant-messaging)时存在的两个核心问题:一是缺乏主动等待机制,导致消息发送节奏不自然;二是未考虑消息间隔的时序特性,使得对话节奏与真实人类交互不符。解决方案的关键在于提出Stephanie2这一新一代分步决策对话代理,其通过引入主动等待机制和消息节奏自适应策略,显式地在每一步决策是否发送消息,并将延迟建模为思考时间(thinking time)与打字时间(typing time)之和,从而实现更贴近人类行为的对话节奏控制。此外,作者还设计了基于时间窗口的双代理对话系统以生成伪对话历史,用于人机评估和角色识别图灵测试,实验表明Stephanie2在自然度和参与感等指标上显著优于前代模型Stephanie1。
链接: https://arxiv.org/abs/2601.05657
作者: Hao Yang,Hongyuan Lu,Dingkang Yang,Wenliang Yang,Peng Sun,Xiaochuan Zhang,Jun Xiao,Kefan He,Wai Lam,Yang Liu,Xinhua Zeng
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 13 pages
Abstract:Instant-messaging human social chat typically progresses through a sequence of short messages. Existing step-by-step AI chatting systems typically split a one-shot generation into multiple messages and send them sequentially, but they lack an active waiting mechanism and exhibit unnatural message pacing. In order to address these issues, we propose Stephanie2, a novel next-generation step-wise decision-making dialogue agent. With active waiting and message-pace adaptation, Stephanie2 explicitly decides at each step whether to send or wait, and models latency as the sum of thinking time and typing time to achieve more natural pacing. We further introduce a time-window-based dual-agent dialogue system to generate pseudo dialogue histories for human and automatic evaluations. Experiments show that Stephanie2 clearly outperforms Stephanie1 on metrics such as naturalness and engagement, and achieves a higher pass rate on human evaluation with the role identification Turing test.
zh
[NLP-34] A Framework for Personalized Persuasiveness Prediction via Context-Aware User Profiling
【速读】: 该论文旨在解决个性化说服力预测中如何有效利用目标说服对象(persuadee)的历史行为数据(如对话记录)以提升预测准确性的问题。当前缺乏系统性框架来优化利用这些历史信息,导致现有方法难以捕捉用户动态特征对说服效果的影响。解决方案的关键在于提出一种上下文感知的用户画像框架,包含两个可训练组件:一是查询生成器(query generator),用于从用户历史记录中检索与说服相关的条目;二是画像生成器(profiler),将相关记录归纳为上下文依赖的用户画像,从而增强说服力预测模型的表现。实验表明,该方法在ChangeMyView Reddit数据集上显著优于现有基线,F1分数最高提升达+13.77%p,且分析显示有效用户画像具有任务导向性和预测器特异性,而非依赖静态属性或表面相似性。
链接: https://arxiv.org/abs/2601.05654
作者: Sejun Park,Yoonah Park,Jongwon Lim,Yohan Jo
机构: Seoul National University (首尔国立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Estimating the persuasiveness of messages is critical in various applications, from recommender systems to safety assessment of LLMs. While it is imperative to consider the target persuadee’s characteristics, such as their values, experiences, and reasoning styles, there is currently no established systematic framework to optimize leveraging a persuadee’s past activities (e.g., conversations) to the benefit of a persuasiveness prediction model. To address this problem, we propose a context-aware user profiling framework with two trainable components: a query generator that generates optimal queries to retrieve persuasion-relevant records from a user’s history, and a profiler that summarizes these records into a profile to effectively inform the persuasiveness prediction model. Our evaluation on the ChangeMyView Reddit dataset shows consistent improvements over existing methods across multiple predictor models, with gains of up to +13.77%p in F1 score. Further analysis shows that effective user profiles are context-dependent and predictor-specific, rather than relying on static attributes or surface-level similarity. Together, these results highlight the importance of task-oriented, context-dependent user profiling for personalized persuasiveness prediction.
zh
[NLP-35] Multilingual Amnesia: On the Transferability of Unlearning in Multilingual LLM s
【速读】: 该论文旨在解决多语言大语言模型(Multilingual Large Language Models)在不同语言环境中确保安全性和公平性的挑战,特别是针对数据遗忘(data unlearning)和概念遗忘(concept unlearning)任务的跨语言有效性问题。其关键解决方案在于构建覆盖十种语言(涵盖五种语系和不同资源水平)的事实知识与刻板印象基准测试集,并通过翻译扩展现有英文基准,从而系统评估模型在多语言场景下的遗忘能力;实验发现高资源语言中的遗忘更稳定,且句法相似性是预测跨语言遗忘行为的最强指标,揭示了语言类型学关系对遗忘迁移效果的显著影响。
链接: https://arxiv.org/abs/2601.05641
作者: Alireza Dehghanpour Farashah,Aditi Khandelwal,Marylou Fauchard,Zhuan Shi,Negar Rostamzadeh,Golnoosh Farnadi
机构: Mila – Quebec AI Institute (魁北克人工智能研究所); McGill University (麦吉尔大学); Université de Montréal (蒙特利尔大学); Google Research (谷歌研究院)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:As multilingual large language models become more widely used, ensuring their safety and fairness across diverse linguistic contexts presents unique challenges. While existing research on machine unlearning has primarily focused on monolingual settings, typically English, multilingual environments introduce additional complexities due to cross-lingual knowledge transfer and biases embedded in both pretraining and fine-tuning data. In this work, we study multilingual unlearning using the Aya-Expanse 8B model under two settings: (1) data unlearning and (2) concept unlearning. We extend benchmarks for factual knowledge and stereotypes to ten languages through translation: English, French, Arabic, Japanese, Russian, Farsi, Korean, Hindi, Hebrew, and Indonesian. These languages span five language families and a wide range of resource levels. Our experiments show that unlearning in high-resource languages is generally more stable, with asymmetric transfer effects observed between typologically related languages. Furthermore, our analysis of linguistic distances indicates that syntactic similarity is the strongest predictor of cross-lingual unlearning behavior.
zh
[NLP-36] Continual Pretraining on Encrypted Synthetic Data for Privacy-Preserving LLM s
【速读】: 该论文旨在解决在小规模领域语料上预训练大语言模型(Large Language Models, LLMs)时如何保护敏感数据隐私的问题。其核心挑战在于,在持续预训练过程中既要保留模型对新知识的编码能力,又要确保个人身份信息(Personally Identifiable Information, PII)的安全性。解决方案的关键在于提出一种基于实体的加密数据合成框架:首先构建加权实体图以指导合成过程,再对PII实体实施确定性加密(deterministic encryption),从而允许模型在加密数据上进行持续预训练,同时通过密钥实现授权访问原始敏感信息。该方法在有限数据集上验证了模型性能优于基础模型且保障PII安全,同时保持指令遵循能力和长上下文理解能力。
链接: https://arxiv.org/abs/2601.05635
作者: Honghao Liu,Xuhui Jiang,Chengjin Xu,Cehao Yang,Yiran Cheng,Lionel Ni,Jian Guo
机构: International Digital Economy Academy; The Hong Kong University of Science and Technology (Guangzhou); The Hong Kong University of Science and Technology; DataArc Tech Ltd
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:
Abstract:Preserving privacy in sensitive data while pretraining large language models on small, domain-specific corpora presents a significant challenge. In this work, we take an exploratory step toward privacy-preserving continual pretraining by proposing an entity-based framework that synthesizes encrypted training data to protect personally identifiable information (PII). Our approach constructs a weighted entity graph to guide data synthesis and applies deterministic encryption to PII entities, enabling LLMs to encode new knowledge through continual pretraining while granting authorized access to sensitive data through decryption keys. Our results on limited-scale datasets demonstrate that our pretrained models outperform base models and ensure PII security, while exhibiting a modest performance gap compared to models trained on unencrypted synthetic data. We further show that increasing the number of entities and leveraging graph-based synthesis improves model performance, and that encrypted models retain instruction-following capabilities with long retrieved contexts. We discuss the security implications and limitations of deterministic encryption, positioning this work as an initial investigation into the design space of encrypted data pretraining for privacy-preserving LLMs. Our code is available at this https URL.
zh
[NLP-37] GIFT: Games as Informal Training for Generalizable LLM s
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在处理“实践智慧”(practical wisdom)和可泛化智能(generalizable intelligence)方面的不足,例如战略创造力和社会推理能力,这些问题源于LLMs缺乏基于交互反馈的非正式学习(informal learning)。其解决方案的关键在于将游戏作为主要的非正式学习环境,利用游戏内在的奖励信号和抽象复杂性来培养多样化的认知能力;同时提出一种嵌套训练框架(Nested Training Framework),通过顺序任务组合实现显式的“AND”目标,而非简单混合任务所隐含的“OR”目标,从而迫使模型同时掌握多种能力以获取最大奖励,有效缓解多任务学习中的性能退化问题,并显著提升模型在广泛能力导向基准上的泛化能力。
链接: https://arxiv.org/abs/2601.05633
作者: Nuoyan Lyu,Bingbing Xu,Weihao Meng,Yige Yuan,Yang Zhang,Zhiyong Huang,Tat-Seng Chua,Huawei Shen
机构: State Key Laboratory of AI Safety, Institute of Computing Technology, CAS (中国科学院计算技术研究所人工智能安全重点实验室); University of Chinese Academy of Sciences (中国科学院大学); National University of Singapore (新加坡国立大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:While Large Language Models (LLMs) have achieved remarkable success in formal learning tasks such as mathematics and code generation, they still struggle with the “practical wisdom” and generalizable intelligence, such as strategic creativity and social reasoning, that characterize human cognition. This gap arises from a lack of informal learning, which thrives on interactive feedback rather than goal-oriented instruction. In this paper, we propose treating Games as a primary environment for LLM informal learning, leveraging their intrinsic reward signals and abstracted complexity to cultivate diverse competencies. To address the performance degradation observed in multi-task learning, we introduce a Nested Training Framework. Unlike naive task mixing optimizing an implicit “OR” objective, our framework employs sequential task composition to enforce an explicit “AND” objective, compelling the model to master multiple abilities simultaneously to achieve maximal rewards. Using GRPO-based reinforcement learning across Matrix Games, TicTacToe, and Who’s the Spy games, we demonstrate that integrating game-based informal learning not only prevents task interference but also significantly bolsters the model’s generalization across broad ability-oriented benchmarks. The framework and implementation are publicly available.
zh
[NLP-38] xt Detoxification in isiXhosa and Yorùbá: A Cross-Lingual Machine Learning Approach for Low-Resource African Languages
【速读】: 该论文旨在解决非洲语言中缺乏有效毒性文本净化(toxic language mitigation)工具的问题,特别是在低资源语种如isiXhosa和Yorùbá中的应用瓶颈。其关键解决方案在于提出一种新颖且实用的混合方法:首先使用轻量级、可解释的TF-IDF与逻辑回归模型进行透明的毒性检测(stratified K-fold准确率61–72% for isiXhosa, 72–86% for Yorùbá),其次引入受控的词典与标记引导重写组件,在保留非毒性句子完整性的同时实现全部毒性句子的净化,且能处理习语、变音符号和代码混用等文化特异性现象,从而在低资源文本风格迁移(Text Style Transfer, TST)场景下实现了高效、可解释且文化适配的安全防护方案。
链接: https://arxiv.org/abs/2601.05624
作者: Abayomi O. Agbeyangi
机构: 未知
类目: Computation and Language (cs.CL)
备注: 26 pages, 9 figures and 1 algorithm
Abstract:Toxic language is one of the major barrier to safe online participation, yet robust mitigation tools are scarce for African languages. This study addresses this critical gap by investigating automatic text detoxification (toxic to neutral rewriting) for two low-resource African languages, isiXhosa and Yorùbá. The work contributes a novel, pragmatic hybrid methodology: a lightweight, interpretable TF-IDF and Logistic Regression model for transparent toxicity detection, and a controlled lexicon- and token-guided rewriting component. A parallel corpus of toxic to neutral rewrites, which captures idiomatic usage, diacritics, and code switching, was developed to train and evaluate the model. The detection component achieved stratified K-fold accuracies of 61-72% (isiXhosa) and 72-86% (Yorùbá), with per-language ROC-AUCs up to 0.88. The rewriting component successfully detoxified all detected toxic sentences while preserving 100% of non-toxic sentences. These results demonstrate that scalable, interpretable machine learning detectors combined with rule-based edits offer a competitive and resource-efficient solution for culturally adaptive safety tooling, setting a new benchmark for low-resource Text Style Transfer (TST) in African languages.
zh
[NLP-39] Data Augmented Pipeline for Legal Information Extraction and Reasoning
【速读】: 该论文旨在解决法律领域信息抽取(Information Extraction, IE)任务中数据标注成本高、人工依赖性强的问题。解决方案的关键在于构建一个基于大语言模型(Large Language Models, LLMs)的数据增强流水线,通过自动化生成高质量标注样本,显著降低人工标注负担,并提升信息抽取系统的鲁棒性与泛化能力。该方法设计简洁且具有通用性,可扩展至其他自然语言处理(Natural Language Processing, NLP)任务场景。
链接: https://arxiv.org/abs/2601.05609
作者: Nguyen Minh Phuong,Ha-Thanh Nguyen,May Myo Zin,Ken Satoh
机构: Center for Juris-Informatics, ROIS-DS (信息法研究中心,ROIS-DS); Japan Advanced Institute of Science and Technology (日本高级科学技术研究院); Research and Development Center for LLMs, NII (国立信息研究所大语言模型研发中心)
类目: Computation and Language (cs.CL)
备注: Accepted in the Demonstration Track at ICAIL 2025
Abstract:In this paper, we propose a pipeline leveraging Large Language Models (LLMs) for data augmentation in Information Extraction tasks within the legal domain. The proposed method is both simple and effective, significantly reducing the manual effort required for data annotation while enhancing the robustness of Information Extraction systems. Furthermore, the method is generalizable, making it applicable to various Natural Language Processing (NLP) tasks beyond the legal domain.
zh
[NLP-40] SceneAlign: Aligning Multimodal Reasoning to Scene Graphs in Complex Visual Scenes
【速读】: 该论文旨在解决多模态大语言模型在复杂视觉场景中推理不忠实的问题,即模型在处理包含复杂实体与关系的视觉内容时,常出现幻觉实体、关系错位、推理步骤遗漏或过度细化等现象。其解决方案的关键在于提出SceneAlign框架,利用结构化的场景图(scene graph)作为视觉信息基础,通过识别推理关键节点并实施四种模拟典型视觉定位失败的定向扰动策略,构建语义合理但视觉事实错误的对比负样本(hard negative rationales),进而借助直接偏好优化(Direct Preference Optimization)引导模型实现细粒度且结构忠实的推理行为。
链接: https://arxiv.org/abs/2601.05600
作者: Chuhan Wang,Xintong Li,Jennifer Yuntong Zhang,Junda Wu,Chengkai Huang,Lina Yao,Julian McAuley,Jingbo Shang
机构: University of California, San Diego (加州大学圣地亚哥分校); University of Toronto (多伦多大学); University of New South Wales (新南威尔士大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Preprint
Abstract:Multimodal large language models often struggle with faithful reasoning in complex visual scenes, where intricate entities and relations require precise visual grounding at each step. This reasoning unfaithfulness frequently manifests as hallucinated entities, mis-grounded relations, skipped steps, and over-specified reasoning. Existing preference-based approaches, typically relying on textual perturbations or answer-conditioned rationales, fail to address this challenge as they allow models to exploit language priors to bypass visual grounding. To address this, we propose SceneAlign, a framework that leverages scene graphs as structured visual information to perform controllable structural interventions. By identifying reasoning-critical nodes and perturbing them through four targeted strategies that mimic typical grounding failures, SceneAlign constructs hard negative rationales that remain linguistically plausible but are grounded in inaccurate visual facts. These contrastive pairs are used in Direct Preference Optimization to steer models toward fine-grained, structure-faithful reasoning. Across seven visual reasoning benchmarks, SceneAlign consistently improves answer accuracy and reasoning faithfulness, highlighting the effectiveness of grounding-aware alignment for multimodal reasoning.
zh
[NLP-41] ACR: Adaptive Context Refactoring via Context Refactoring Operators for Multi-Turn Dialogue
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在多轮对话中面临的两个核心挑战:情境惯性(contextual inertia) 和 状态漂移(state drift),即模型难以保持与早期对话内容的一致性、无法有效追踪跨轮次的依赖关系,并随着对话长度增加而逐渐偏离事实。解决方案的关键在于提出一种自适应上下文重构框架(Adaptive Context Refactoring, ACR),其通过一套上下文重构算子库和教师引导的自我演化训练范式,动态监测并重构交互历史,从而主动缓解上述问题;该框架将上下文管理与推理过程解耦,显著提升多轮对话的连贯性与准确性,同时降低token消耗。
链接: https://arxiv.org/abs/2601.05589
作者: Jiawei Shen,Jia Zhu,Hanghui Guo,Weijie Shi,Yue Cui,Qingyu Niu,Guoqing Ma,Yidan Liang,Jingjiang Liu,Yiling Wang,Shimin Di,Jiajie Xu
机构: Zhejiang Normal University (浙江师范大学); Alibaba Group; Hong Kong University of Science and Technology (香港科技大学); Southeast University (东南大学); Soochow University (苏州大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) have shown remarkable performance in multi-turn dialogue. However, in multi-turn dialogue, models still struggle to stay aligned with what has been established earlier, follow dependencies across many turns, and avoid drifting into incorrect facts as the interaction grows longer. Existing approaches primarily focus on extending the context window, introducing external memory, or applying context compression, yet these methods still face limitations such as \textbfcontextual inertia and \textbfstate drift. To address these challenges, we propose the \textbfAdaptive \textbfContext \textbfRefactoring \textbf(ACR) Framework, which dynamically monitors and reshapes the interaction history to mitigate contextual inertia and state drift actively. ACR is built on a library of context refactoring operators and a teacher-guided self-evolving training paradigm that learns when to intervene and how to refactor, thereby decoupling context management from the reasoning process. Extensive experiments on multi-turn dialogue demonstrate that our method significantly outperforms existing baselines while reducing token consumption.
zh
[NLP-42] Can large language models interpret unstructured chat data on dynamic group decision-making processes? Evidence on joint destination choice
【速读】: 该论文旨在解决如何从非结构化群组聊天数据中自动识别和提取复杂社会活动决策过程中的显性和隐性因素的问题,尤其针对群体外出就餐这类社会互动行为。传统方法依赖人工标注以捕捉受社会文化规范影响的语境意义,效率低下且难以规模化。其解决方案的关键在于设计了一种受知识获取过程启发的提示(prompting)框架,通过分步骤结构化地引导大型语言模型(Large Language Models, LLMs)依次提取关键决策要素——包括群体层面的餐厅选择集与最终结果、个体对各备选方案的偏好及其驱动的具体属性——从而将非结构化的对话内容转化为可分析的结构化表格数据。该框架显著提升了自动化处理能力,但研究也指出LLM在识别人类标注者轻易发现的隐性因素方面仍存在局限,需结合人工审核以确保决策机制解读的完整性。
链接: https://arxiv.org/abs/2601.05582
作者: Sung-Yoo Lim,Koki Sato,Kiyoshi Takami,Giancarlos Parady,Eui-Jin Kim
机构: 未知
类目: Computation and Language (cs.CL)
备注: 23 pages, 9 figures
Abstract:Social activities result from complex joint activity-travel decisions between group members. While observing the decision-making process of these activities is difficult via traditional travel surveys, the advent of new types of data, such as unstructured chat data, can help shed some light on these complex processes. However, interpreting these decision-making processes requires inferring both explicit and implicit factors. This typically involves the labor-intensive task of manually annotating dialogues to capture context-dependent meanings shaped by the social and cultural norms. This study evaluates the potential of Large Language Models (LLMs) to automate and complement human annotation in interpreting decision-making processes from group chats, using data on joint eating-out activities in Japan as a case study. We designed a prompting framework inspired by the knowledge acquisition process, which sequentially extracts key decision-making factors, including the group-level restaurant choice set and outcome, individual preferences of each alternative, and the specific attributes driving those preferences. This structured process guides the LLM to interpret group chat data, converting unstructured dialogues into structured tabular data describing decision-making factors. To evaluate LLM-driven outputs, we conduct a quantitative analysis using a human-annotated ground truth dataset and a qualitative error analysis to examine model limitations. Results show that while the LLM reliably captures explicit decision-making factors, it struggles to identify nuanced implicit factors that human annotators readily identified. We pinpoint specific contexts when LLM-based extraction can be trusted versus when human oversight remains essential. These findings highlight both the potential and limitations of LLM-based analysis for incorporating non-traditional data sources on social activities.
zh
[NLP-43] RISE: Rule-Driven SQL Dialect Translation via Query Reduction ICSE2026
【速读】: 该论文旨在解决跨关系型数据库管理系统(RDBMS)的SQL方言翻译问题,传统规则驱动工具依赖人工编写规则,难以高效支持新数据库系统和复杂SQL查询;而大语言模型(LLM)虽具潜力,但在处理长且复杂的SQL语句时准确性不足。其解决方案的关键在于提出一种名为RISE的新方法,通过引入方言感知的查询简化技术(dialect-aware query reduction),先从复杂源查询中移除与目标方言无关的元素,生成简化查询,再利用LLM进行翻译并自动提取方言转换规则,最后将规则应用于原始复杂查询以实现准确翻译,从而规避了直接处理复杂查询带来的挑战。
链接: https://arxiv.org/abs/2601.05579
作者: Xudong Xie,Yuwei Zhang,Wensheng Dou,Yu Gao,Ziyu Cui,Jiansen Song,Rui Yang,Jun Wei
机构: Institute of Software at CAS, China(中国科学院软件研究所)
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Software Engineering (cs.SE)
备注: Accepted by ICSE 2026
Abstract:Translating SQL dialects across different relational database management systems (RDBMSs) is crucial for migrating RDBMS-based applications to the cloud. Traditional SQL dialect translation tools rely on manually-crafted rules, necessitating significant manual effort to support new RDBMSs and dialects. Although large language models (LLMs) can assist in translating SQL dialects, they often struggle with lengthy and complex SQL queries. In this paper, we propose RISE, a novel LLM-based SQL dialect translation approach that can accurately handle lengthy and complex SQL queries. Given a complex source query Q_c that contains a SQL dialect d , we first employ a dialect-aware query reduction technique to derive a simplified query Q_s by removing d -irrelevant SQL elements from Q_c . Subsequently, we utilize LLMs to translate Q_s into Q_s^’ , and automatically extract the translation rule r_d for dialect d based on the relationship between Q_s and Q_s^’ . By applying r_d to Q_c , we can effectively translate the dialect d within Q_c , thereby bypassing the complexity of the source query Q_c . We evaluate RISE on two real-world benchmarks, i.e., TPC-DS and SQLProcBench, comparing its performance against both the traditional rule-based tools and the LLM-based approaches with respect to translation accuracy. RISE achieves accuracies of 97.98% on TPC-DS and 100% on SQLProcBench, outperforming the baselines by an average improvement of 24.62% and 238.41%, respectively. Comments: Accepted by ICSE 2026 Subjects: Databases (cs.DB); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Software Engineering (cs.SE) Cite as: arXiv:2601.05579 [cs.DB] (or arXiv:2601.05579v1 [cs.DB] for this version) https://doi.org/10.48550/arXiv.2601.05579 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-44] WildSci: Advancing Scientific Reasoning from In-the-Wild Literature
【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)在科学领域(如医学和材料科学)中推理能力受限的问题,其根源在于这些领域缺乏高质量、覆盖广泛的训练数据以及开放性科学问题的评估标准不明确。解决方案的关键在于构建一个名为WildSci的新数据集,该数据集通过自动合成同行评审文献中的领域特定科学问题,覆盖9个科学学科和26个子领域,并将复杂推理任务转化为多选题形式以提供清晰的奖励信号。在此基础上,研究者采用强化学习对模型进行微调,从而实现可扩展且可持续的科学推理能力提升。
链接: https://arxiv.org/abs/2601.05567
作者: Tengxiao Liu,Deepak Nathani,Zekun Li,Kevin Yang,William Yang Wang
机构: University of California, Santa Barbara (加州大学圣塔芭芭拉分校)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Recent progress in large language model (LLM) reasoning has focused on domains like mathematics and coding, where abundant high-quality data and objective evaluation metrics are readily available. In contrast, progress in LLM reasoning models remains limited in scientific domains such as medicine and materials science due to limited dataset coverage and the inherent complexity of open-ended scientific questions. To address these challenges, we introduce WildSci, a new dataset of domain-specific science questions automatically synthesized from peer-reviewed literature, covering 9 scientific disciplines and 26 subdomains. By framing complex scientific reasoning tasks in a multiple-choice format, we enable scalable training with well-defined reward signals. We further apply reinforcement learning to finetune models on these data and analyze the resulting training dynamics, including domain-specific performance changes, response behaviors, and generalization trends. Experiments on a suite of scientific benchmarks demonstrate the effectiveness of our dataset and approach. We release WildSci to enable scalable and sustainable research in scientific reasoning, available at this https URL.
zh
[NLP-45] he ICASSP 2026 HumDial Challenge: Benchmarking Human-like Spoken Dialogue Systems in the LLM Era ICASSP2026
【速读】: 该论文旨在解决当前语音对话系统在实现“类人”交互能力方面的瓶颈问题,即如何同时具备情感智能(emotional intelligence)以理解并共情用户情绪状态,以及具备全双工交互机制(full-duplex interaction)以支持自然、实时的对话流。解决方案的关键在于构建首个面向类人语音对话系统的挑战赛(HumDial),其核心是基于真实人类对话的大规模数据集,设立两个评估维度:一是长期情感理解与共情生成能力(Emotional Intelligence Track),二是“边听边说”条件下的实时决策能力(Full-Duplex Interaction Track),从而为该领域提供一个公平、系统化的基准测试平台。
链接: https://arxiv.org/abs/2601.05564
作者: Zhixian Zhao,Shuiyuan Wang,Guojian Li,Hongfei Xue,Chengyou Wang,Shuai Wang,Longshuai Xiao,Zihan Zhang,Hui Bu,Xin Xu,Xinsheng Wang,Hexin Liu,Eng Siong Chng,Hung-yi Lee,Haizhou Li,Lei Xie
机构: 未知
类目: ound (cs.SD); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Audio and Speech Processing (eess.AS)
备注: Official summary paper for the ICASSP 2026 HumDial Challenge
Abstract:Driven by the rapid advancement of Large Language Models (LLMs), particularly Audio-LLMs and Omni-models, spoken dialogue systems have evolved significantly, progressively narrowing the gap between human-machine and human-human interactions. Achieving truly human-like'' communication necessitates a dual capability: emotional intelligence to perceive and resonate with users' emotional states, and robust interaction mechanisms to navigate the dynamic, natural flow of conversation, such as real-time turn-taking. Therefore, we launched the first Human-like Spoken Dialogue Systems Challenge (HumDial) at ICASSP 2026 to benchmark these dual capabilities. Anchored by a sizable dataset derived from authentic human conversations, this initiative establishes a fair evaluation platform across two tracks: (1) Emotional Intelligence, targeting long-term emotion understanding and empathetic generation; and (2) Full-Duplex Interaction, systematically evaluating real-time decision-making under listening-while-speaking’’ conditions. This paper summarizes the dataset, track configurations, and the final results.
zh
[NLP-46] Reason Any: Incorporating Reasoning Capability to Any Model via Simple and Effective Model Merging
【速读】: 该论文旨在解决领域专业化模型(如安全、生物医学和金融领域)在融合生成式推理能力(即“Reasoning + X”)时出现的性能崩溃问题,具体表现为推理深度减弱与领域专用能力受损。其解决方案的关键在于提出一种名为ReasonAny的新颖合并框架,该框架通过对比梯度识别(Contrastive Gradient Identification)技术,精准定位并保留推理能力主要分布的低梯度敏感参数区域,从而有效避免传统方法中因盲目合并导致的性能退化,实现推理能力与领域知识的协同增强。
链接: https://arxiv.org/abs/2601.05560
作者: Junyao Yang,Chen Qian,Dongrui Liu,Wen Shen,Yong Liu,Jing Shao
机构: Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); National University of Singapore (新加坡国立大学); Renmin University of China (中国人民大学); Tongji University (同济大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 22 pages, 6 figures, 14 tables
Abstract:Large Reasoning Models (LRMs) with long chain-of-thought reasoning have recently achieved remarkable success. Yet, equipping domain-specialized models with such reasoning capabilities, referred to as “Reasoning + X”, remains a significant challenge. While model merging offers a promising training-free solution, existing methods often suffer from a destructive performance collapse: existing methods tend to both weaken reasoning depth and compromise domain-specific utility. Interestingly, we identify a counter-intuitive phenomenon underlying this failure: reasoning ability predominantly resides in parameter regions with low gradient sensitivity, contrary to the common assumption that domain capabilities correspond to high-magnitude parameters. Motivated by this insight, we propose ReasonAny, a novel merging framework that resolves the reasoning-domain performance collapse through Contrastive Gradient Identification. Experiments across safety, biomedicine, and finance domains show that ReasonAny effectively synthesizes “Reasoning + X” capabilities, significantly outperforming state-of-the-art baselines while retaining robust reasoning performance.
zh
[NLP-47] Generation-Based and Emotion-Reflected Memory Update: Creating the KEEM Dataset for Better Long-Term Conversation COLING2025
【速读】: 该论文旨在解决长期对话系统中记忆更新机制不足的问题,即传统方法依赖简单的信息累积或操作式处理,易导致信息冲突且难以准确追踪用户当前状态。解决方案的关键在于提出一种基于生成式的记忆更新机制——Keep Emotional and Essential Memory (KEEM) 数据集,通过动态生成整合性记忆,同时保留关键事实信息、情感上下文及因果关系,从而实现对用户交互更细腻的理解与更有效的记忆更新,提升系统在开放域对话中的共情能力与响应质量。
链接: https://arxiv.org/abs/2601.05548
作者: Jeonghyun Kang,Hongjin Kim,Harksoo Kim
机构: Konkuk University (中央大学)
类目: Computation and Language (cs.CL)
备注: COLING 2025 accepted paper (Main)
Abstract:In this work, we introduce the Keep Emotional and Essential Memory (KEEM) dataset, a novel generation-based dataset designed to enhance memory updates in long-term conversational systems. Unlike existing approaches that rely on simple accumulation or operation-based methods, which often result in information conflicts and difficulties in accurately tracking a user’s current state, KEEM dynamically generates integrative memories. This process not only preserves essential factual information but also incorporates emotional context and causal relationships, enabling a more nuanced understanding of user interactions. By seamlessly updating a system’s memory with both emotional and essential data, our approach promotes deeper empathy and enhances the system’s ability to respond meaningfully in open-domain conversations.
zh
[NLP-48] Can Large Language Models Differentiate Harmful from Argumentative Essays? Steps Toward Ethical Essay Scoring COLING2025
【速读】: 该论文旨在解决当前自动作文评分(Automated Essay Scoring, AES)系统与大语言模型(Large Language Models, LLMs)在识别和评分有害内容方面的不足,尤其是它们对种族主义、性别偏见等敏感话题中道德风险的忽视,导致高分作文可能传播有害观点。解决方案的关键在于构建一个名为有害作文检测(Harmful Essay Detection, HED)的新基准测试集,该集合包含融合敏感议题的作文,用以系统评估不同LLMs在识别和评分有害内容上的能力,从而揭示现有模型在伦理维度上的缺陷,并推动开发更具备伦理敏感性的AES系统。
链接: https://arxiv.org/abs/2601.05545
作者: Hongjin Kim,Jeonghyun Kang,Harksoo Kim
机构: Konkuk University ( konkuk 大学)
类目: Computation and Language (cs.CL)
备注: COLING 2025 accepted paper (Main)
Abstract:This study addresses critical gaps in Automated Essay Scoring (AES) systems and Large Language Models (LLMs) with regard to their ability to effectively identify and score harmful essays. Despite advancements in AES technology, current models often overlook ethically and morally problematic elements within essays, erroneously assigning high scores to essays that may propagate harmful opinions. In this study, we introduce the Harmful Essay Detection (HED) benchmark, which includes essays integrating sensitive topics such as racism and gender bias, to test the efficacy of various LLMs in recognizing and scoring harmful content. Our findings reveal that: (1) LLMs require further enhancement to accurately distinguish between harmful and argumentative essays, and (2) both current AES models and LLMs fail to consider the ethical dimensions of content during scoring. The study underscores the need for developing more robust AES systems that are sensitive to the ethical implications of the content they are scoring.
zh
[NLP-49] Closing the Modality Reasoning Gap for Speech Large Language Models
【速读】: 该论文旨在解决语音大语言模型(Speech Large Language Models, Speech LLMs)在推理能力上存在的显著模态推理差距问题,即其在语音输入上的推理表现明显弱于文本输入。解决方案的关键在于提出一种名为TARS的强化学习框架,通过不对称奖励设计对齐文本条件和语音条件下的推理轨迹;该框架引入两个密集且互补的信号:表示对齐(representation alignment),衡量语音与文本条件路径在Transformer各层隐藏状态间的相似性;行为对齐(behavior alignment),评估生成输出与参考文本完成之间的语义一致性,从而有效缩小模态推理差距并提升性能。
链接: https://arxiv.org/abs/2601.05543
作者: Chaoren Wang,Heng Lu,Xueyao Zhang,Shujie Liu,Yan Lu,Jinyu Li,Zhizheng Wu
机构: The Chinese University of Hong Kong, Shenzhen (深圳大学); Microsoft Corporation (微软)
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:
Abstract:Although speech large language models have achieved notable progress, a substantial modality reasoning gap remains: their reasoning performance on speech inputs is markedly weaker than on text. This gap could be associated with representational drift across Transformer layers and behavior deviations in long-chain reasoning. To address this issue, we introduce TARS, a reinforcement-learning framework that aligns text-conditioned and speech-conditioned trajectories through an asymmetric reward design. The framework employs two dense and complementary signals: representation alignment, which measures layer-wise hidden-state similarity between speech- and text-conditioned trajectories, and behavior alignment, which evaluates semantic consistency between generated outputs and reference text completions. Experiments on challenging reasoning benchmarks, including MMSU and OBQA, show that our approach significantly narrows the modality reasoning gap and achieves state-of-the-art performance among 7B-scale Speech LLMs.
zh
[NLP-50] Double: Breaking the Acceleration Limit via Double Retrieval Speculative Parallelism
【速读】: 该论文旨在解决并行推测解码(Parallel Speculative Decoding, PSD)中存在的两个根本性问题:一是由草稿模型与目标模型之间速度比决定的理论加速上限;二是由于早期错误导致的序列中段token被拒绝,从而引发高计算浪费和流水线停滞。解决方案的关键在于提出一种名为\textscDouble(Double Retrieval Speculative Parallelism)的新框架,其核心创新是通过一种新颖的同步机制打破传统PSD的加速瓶颈——具体而言,草稿模型执行迭代检索推测以突破理论加速极限,同时目标模型通过权威检索生成多token引导信息,在不回滚的情况下有效缓解token拒绝问题。该方法完全无需训练且无损,实验表明在LLaMA3.3-70B和Qwen3-32B上分别实现了5.3×和2.8×的显著加速,优于需大量训练的先进方法EAGLE-3。
链接: https://arxiv.org/abs/2601.05524
作者: Yuhao Shen,Tianyu Liu,Junyi Shen,Jinyang Wu,Quan Kong,Li Huan,Cong Wang
机构: Zhejiang University (浙江大学); University of Science and Technology of China (中国科学技术大学); National University of Singapore (新加坡国立大学); Tsinghua University (清华大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Parallel Speculative Decoding (PSD) accelerates traditional Speculative Decoding (SD) by overlapping draft generation with verification. However, it remains hampered by two fundamental challenges: (1) a theoretical speedup ceiling dictated by the speed ratio between the draft and target models, and (2) high computational waste and pipeline stall due to mid-sequence token rejections of early errors. To address these limitations, we introduce \textscDouble (Double Retrieval Speculative Parallelism). By bridging the gap between SD and PSD, our framework resolves the Retrieval \emphPrecision-Efficiency Dilemma through a novel synchronous mechanism. Specifically, we enable the draft model to execute iterative retrieval speculations to break the theoretical speedup limits; to alleviate rejections without rollback, the target model performs authoritative retrieval to generate multi-token guidance. \textscDouble is entirely training-free and lossless. Extensive experiments demonstrate state-of-the-art speedup of \textbf5.3\times on LLaMA3.3-70B and \textbf2.8\times on Qwen3-32B, significantly outperforming the advanced method EAGLE-3 that requires extensive model training.
zh
[NLP-51] CHisAgent : A Multi-Agent Framework for Event Taxonomy Construction in Ancient Chinese Cultural Systems
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在历史与文化推理任务中,尤其是在非英语语境下(如中国古代历史)表现受限的问题。其核心挑战在于如何高效构建结构清晰、覆盖全面且忠实于史实的历史知识分类体系(taxonomic structure),而传统手工构建方法成本高、难以扩展。解决方案的关键是提出一个名为CHisAgent的多智能体框架,通过三个角色专业化阶段实现自动化历史分类体系构建:首先由底向上(bottom-up)的Inducer从原始历史文献中提取初始层次结构;其次由顶向下(top-down)的Expander利用LLM的世界知识补充缺失的中间概念;最后由证据引导(evidence-guided)的Enricher整合外部结构化历史资源以确保内容准确性。该方法显著提升了分类体系的结构性一致性和覆盖广度,并支持跨文化对齐。
链接: https://arxiv.org/abs/2601.05520
作者: Xuemei Tang,Chengxi Yan,Jinghang Gu,Chu-Ren Huang
机构: The Hong Kong Polytechnic University (香港理工大学); Renmin University of China (中国人民大学)
类目: Computation and Language (cs.CL)
备注: 22 pages, 13 figures, 7 tables
Abstract:Despite strong performance on many tasks, large language models (LLMs) show limited ability in historical and cultural reasoning, particularly in non-English contexts such as Chinese history. Taxonomic structures offer an effective mechanism to organize historical knowledge and improve understanding. However, manual taxonomy construction is costly and difficult to scale. Therefore, we propose \textbfCHisAgent, a multi-agent LLM framework for historical taxonomy construction in ancient Chinese contexts. CHisAgent decomposes taxonomy construction into three role-specialized stages: a bottom-up \textitInducer that derives an initial hierarchy from raw historical corpora, a top-down \textitExpander that introduces missing intermediate concepts using LLM world knowledge, and an evidence-guided \textitEnricher that integrates external structured historical resources to ensure faithfulness. Using the \textitTwenty-Four Histories, we construct a large-scale, domain-aware event taxonomy covering politics, military, diplomacy, and social life in ancient China. Extensive reference-free and reference-based evaluations demonstrate improved structural coherence and coverage, while further analysis shows that the resulting taxonomy supports cross-cultural alignment.
zh
[NLP-52] Enabling Stroke-Level Structural Analysis of Hieroglyphic Scripts without Language-Specific Priors
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)和多模态大语言模型(Multimodal LLMs, MLLMs)在处理表意文字(如汉字、古埃及象形文字等)时普遍存在的结构感知缺失问题,即这些模型通常将字符视为文本标记或原始像素矩阵,而无法建模字符内部笔画(stroke)的逻辑结构。其关键解决方案是提出Hieroglyphic Stroke Analyzer (HieroSA),一种通用且可泛化的框架,能够自动从字符位图中提取笔画级结构,无需人工标注数据;该框架将现代表意文字与古代象形文字图像映射到标准化坐标空间中的显式线段表示,从而实现跨语言的结构解析与语义理解,为符号文字的图形学分析提供了一种新范式。
链接: https://arxiv.org/abs/2601.05508
作者: Fuwen Luo,Zihao Wan,Ziyue Wang,Yaluo Liu,Pau Tong Lin Xu,Xuanjia Qiao,Xiaolong Wang,Peng Li,Yang Liu
机构: Tsinghua University (清华大学); Institute for AI Industry Research (AIR) (清华大学人工智能产业研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
Abstract:Hieroglyphs, as logographic writing systems, encode rich semantic and cultural information within their internal structural composition. Yet, current advanced Large Language Models (LLMs) and Multimodal LLMs (MLLMs) usually remain structurally blind to this information. LLMs process characters as textual tokens, while MLLMs additionally view them as raw pixel grids. Both fall short to model the underlying logic of character strokes. Furthermore, existing structural analysis methods are often script-specific and labor-intensive. In this paper, we propose Hieroglyphic Stroke Analyzer (HieroSA), a novel and generalizable framework that enables MLLMs to automatically derive stroke-level structures from character bitmaps without handcrafted data. It transforms modern logographic and ancient hieroglyphs character images into explicit, interpretable line-segment representations in a normalized coordinate space, allowing for cross-lingual generalization. Extensive experiments demonstrate that HieroSA effectively captures character-internal structures and semantics, bypassing the need for language-specific priors. Experimental results highlight the potential of our work as a graphematics analysis tool for a deeper understanding of hieroglyphic scripts. View our code at this https URL.
zh
[NLP-53] FlashMem: Distilling Intrinsic Latent Memory via Computation Reuse
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在无状态架构下难以维持动态上下文、导致代理(agent)需冗余重处理历史信息以实现长时自主性的问题。其核心解决方案是提出FlashMem框架,通过计算复用从瞬态推理状态中蒸馏出内在记忆(intrinsic memory),关键创新在于利用内部表示对输入轨迹的唯一编码特性,将最后一个隐藏状态识别为交互历史的充分统计量,并借助共享键值(Shared-KV)合并器直接访问冻结缓存进行记忆合成,从而避免冗余再参数化;同时引入无参数的认知监控器(Cognitive Monitor),基于注意力熵自适应触发记忆巩固,仅在高认知不确定性时激活,显著提升效率与持续认知能力的平衡。
链接: https://arxiv.org/abs/2601.05505
作者: Yubo Hou,Zhisheng Chen,Tao Wan,Zengchang Qin
机构: Beihang University (北京航空航天大学); Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所); University of Chinese Academy of Sciences (中国科学院大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:The stateless architecture of Large Language Models inherently lacks the mechanism to preserve dynamic context, compelling agents to redundantly reprocess history to maintain long-horizon autonomy. While latent memory offers a solution, current approaches are hindered by architectural segregation, relying on auxiliary encoders that decouple memory from the reasoning backbone. We propose FlashMem, a framework that distills intrinsic memory directly from transient reasoning states via computation reuse. Leveraging the property that internal representations uniquely encode input trajectories, FlashMem identifies the last hidden state as a sufficient statistic for the interaction history. This enables a Shared-KV Consolidator to synthesize memory by attending directly to the backbone’s frozen cache, eliminating redundant re-parameterization. Furthermore, a parameter-free Cognitive Monitor leverages attention entropy to adaptively trigger consolidation only when high epistemic uncertainty is detected. Experiments demonstrate that FlashMem matches the performance of heavy baselines while reducing inference latency by 5 times, effectively bridging the gap between efficiency and persistent cognition.
zh
[NLP-54] Hi-ZFO: Hierarchical Zeroth- and First-Order LLM Fine-Tuning via Importance-Guided Tensor Selection
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)微调过程中,标准一阶(First-Order, FO)优化方法易收敛至尖锐、泛化能力差的极小值点,而零阶(Zeroth-Order, ZO)方法虽具更强探索性但因生成任务中输出空间庞大导致估计方差显著增加,从而造成噪声大、效率低的问题。解决方案的关键在于提出一种分层混合优化框架 Hi-ZFO(Hierarchical Zeroth- and First-Order optimization),其通过逐层重要性分析自适应地划分模型结构:对关键层采用高精度 FO 更新,对非敏感层引入 ZO 优化以引入“有益随机性”,帮助模型跳出纯 FO 优化易停滞的局部极小值,从而在保持性能的同时显著提升训练效率。
链接: https://arxiv.org/abs/2601.05501
作者: Feihu Jin,Ying Tan
机构: Peking University (北京大学); Institute for Artificial Intellignce, Peking University (北京大学人工智能研究院); State Key Laboratory of General Artificial Intelligence (通用人工智能国家重点实验室)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 13 pages, 4 figures
Abstract:Fine-tuning large language models (LLMs) using standard first-order (FO) optimization often drives training toward sharp, poorly generalizing minima. Conversely, zeroth-order (ZO) methods offer stronger exploratory behavior without relying on explicit gradients, yet suffer from slow convergence. More critically, our analysis reveals that in generative tasks, the vast output and search space significantly amplify estimation variance, rendering ZO methods both noisy and inefficient. To address these challenges, we propose \textbfHi-ZFO (\textbfHierarchical \textbfZeroth- and \textbfFirst-\textbfOrder optimization), a hybrid framework designed to synergize the precision of FO gradients with the exploratory capability of ZO estimation. Hi-ZFO adaptively partitions the model through layer-wise importance profiling, applying precise FO updates to critical layers while leveraging ZO optimization for less sensitive ones. Notably, ZO in Hi-ZFO is not merely a memory-saving surrogate; it is intentionally introduced as a source of “beneficial stochasticity” to help the model escape the local minima where pure FO optimization tends to stagnate. Validated across diverse generative, mathematical, and code reasoning tasks, Hi-ZFO consistently achieves superior performance while significantly reducing the training time. These results demonstrate the effectiveness of hierarchical hybrid optimization for LLM fine-tuning.
zh
[NLP-55] MMViR: A Multi-Modal and Multi-Granularity Representation for Long-range Video Understanding
【速读】: 该论文旨在解决长视频(时长从几分钟到数小时)在当前多模态大语言模型(Multi-modal Large Language Models, MLLMs)中理解困难的问题,其核心挑战在于视频内容的复杂事件、多样场景以及长距离依赖关系难以有效建模。直接对长视频进行完整编码计算开销过大,而简单的视频到文本转换则易导致冗余或碎片化信息。解决方案的关键在于提出一种新颖的多模态、多粒度结构化表示方法——MMViR(Multi-Modal Video Representation),该方法通过识别关键转折点对视频进行分段,并构建包含全局叙事与细粒度视觉细节的三级描述体系,从而支持高效查询驱动的检索并具备跨场景泛化能力。
链接: https://arxiv.org/abs/2601.05495
作者: Zizhong Li,Haopeng Zhang,Jiawei Zhang
机构: IFM Lab, University of California, Davis (加州大学戴维斯分校); ALOHA Lab, University of Hawaii at Mānoa (夏威夷大学马诺阿分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: 13 pages, 11 figures
Abstract:Long videos, ranging from minutes to hours, present significant challenges for current Multi-modal Large Language Models (MLLMs) due to their complex events, diverse scenes, and long-range dependencies. Direct encoding of such videos is computationally too expensive, while simple video-to-text conversion often results in redundant or fragmented content. To address these limitations, we introduce MMViR, a novel multi-modal, multi-grained structured representation for long video understanding. MMViR identifies key turning points to segment the video and constructs a three-level description that couples global narratives with fine-grained visual details. This design supports efficient query-based retrieval and generalizes well across various scenarios. Extensive evaluations across three tasks, including QA, summarization, and retrieval, show that MMViR outperforms the prior strongest method, achieving a 19.67% improvement in hour-long video understanding while reducing processing latency to 45.4% of the original.
zh
[NLP-56] MemBuilder: Reinforcing LLM s for Long-Term Memory Construction via Attributed Dense Rewards
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在长期对话中保持一致性的问题,尤其是传统检索机制难以捕捉历史状态的时间演化特性,以及现有记忆增强框架受限于静态提示或低效训练策略(如稀疏奖励)。其解决方案的关键在于提出MemBuilder——一个基于强化学习的框架,通过两个核心机制实现:(1) 采用合成会话级问题生成来提供密集的中间奖励信号,缓解轨迹层面奖励稀疏性;(2) 引入贡献感知梯度加权方法,根据各记忆组件对下游任务的影响动态调整策略更新幅度,从而实现多维记忆的精准构建与优化。实验表明,该方法使4B参数模型在长程对话基准上超越闭源先进基线,展现出良好的泛化能力。
链接: https://arxiv.org/abs/2601.05488
作者: Zhiyu Shen,Ziming Wu,Fuming Lai,Shaobing Lian,Yanghui Rao
机构: Sun Yat-sen University (中山大学); Tencent Inc. (腾讯)
类目: Computation and Language (cs.CL)
备注: 19 pages (9 main + 10 appendix), 7 figures, 3 tables
Abstract:Maintaining consistency in long-term dialogues remains a fundamental challenge for LLMs, as standard retrieval mechanisms often fail to capture the temporal evolution of historical states. While memory-augmented frameworks offer a structured alternative, current systems rely on static prompting of closed-source models or suffer from ineffective training paradigms with sparse rewards. We introduce MemBuilder, a reinforcement learning framework that trains models to orchestrate multi-dimensional memory construction with attributed dense rewards. MemBuilder addresses two key challenges: (1) Sparse Trajectory-Level Rewards: we employ synthetic session-level question generation to provide dense intermediate rewards across extended trajectories; and (2) Multi-Dimensional Memory Attribution: we introduce contribution-aware gradient weighting that scales policy updates based on each component’s downstream impact. Experimental results show that MemBuilder enables a 4B-parameter model to outperform state-of-the-art closed-source baselines, exhibiting strong generalization across long-term dialogue benchmarks.
zh
[NLP-57] he Facade of Truth: Uncovering and Mitigating LLM Susceptibility to Deceptive Evidence
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在面对精心构造的、难以被直接识别为虚假的误导性证据时,其内部信念易被篡改的问题。当前模型虽能抵御显式错误信息,但对逻辑上看似合理、逐步强化的欺骗性推理链条仍缺乏鲁棒性,从而导致下游决策偏离事实。解决方案的关键在于提出一种名为“Deceptive Intent Shielding (DIS)”的治理机制,通过识别证据背后的欺骗意图提供早期预警信号,从而有效抑制模型对误导性信息的信任增强,并促使模型采取更审慎的证据评估策略。
链接: https://arxiv.org/abs/2601.05478
作者: Herun Wan,Jiaying Wu,Minnan Luo,Fanxiao Li,Zhi Zeng,Min-Yen Kan
机构: Xi’an Jiaotong University (西安交通大学); National University of Singapore (新加坡国立大学); Yunnan University (云南大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:To reliably assist human decision-making, LLMs must maintain factual internal beliefs against misleading injections. While current models resist explicit misinformation, we uncover a fundamental vulnerability to sophisticated, hard-to-falsify evidence. To systematically probe this weakness, we introduce MisBelief, a framework that generates misleading evidence via collaborative, multi-round interactions among multi-role LLMs. This process mimics subtle, defeasible reasoning and progressive refinement to create logically persuasive yet factually deceptive claims. Using MisBelief, we generate 4,800 instances across three difficulty levels to evaluate 7 representative LLMs. Results indicate that while models are robust to direct misinformation, they are highly sensitive to this refined evidence: belief scores in falsehoods increase by an average of 93.0%, fundamentally compromising downstream recommendations. To address this, we propose Deceptive Intent Shielding (DIS), a governance mechanism that provides an early warning signal by inferring the deceptive intent behind evidence. Empirical results demonstrate that DIS consistently mitigates belief shifts and promotes more cautious evidence evaluation.
zh
[NLP-58] MaxCode: A Max-Reward Reinforcement Learning Framework for Automated Code Optimization
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在代码优化任务中面临的两大挑战:一是编写高性能代码(如高效CUDA内核和竞赛级CPU代码)需要系统、算法及特定语言的深度专业知识;二是需对性能指标(如执行时间与设备利用率)进行解读,而不仅限于二进制正确性。其解决方案的核心在于提出一种名为MaxCode的推理时搜索框架,该框架将现有搜索方法统一到最大奖励强化学习(max-reward reinforcement learning)范式下,使观测空间与动作价值函数模块化以支持灵活调整。关键创新包括:引入自然语言批判模型(natural language critique model),将原始执行反馈转化为关于错误和性能瓶颈的诊断性洞察,并结合历史最佳折扣奖励作为输入增强代码生成函数;同时训练一个生成式奖励到终点模型(reward-to-go model),利用回放轨迹中的动作价值对候选解进行重排序,从而提升搜索过程中的探索效率。实验表明,MaxCode在KernelBench(CUDA)和PIE(C++)基准上均显著优于基线方法,在绝对加速比和相对加速排名上分别提升了20.3%和10.1%。
链接: https://arxiv.org/abs/2601.05475
作者: Jiefu Ou,Sapana Chaudhary,Kaj Bostrom,Nathaniel Weir,Shuai Zhang,Huzefa Rangwala,George Karypis
机构: Johns Hopkins University (约翰霍普金斯大学); Amazon Web Services (亚马逊网络服务)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLMs) demonstrate strong capabilities in general coding tasks but encounter two key challenges when optimizing code: (i) the complexity of writing optimized code (such as performant CUDA kernels and competition-level CPU code) requires expertise in systems, algorithms and specific languages and (ii) requires interpretation of performance metrics like timing and device utilization beyond binary correctness. In this work, we explore inference-time search algorithms that guide the LLM to discover better solutions through iterative refinement based on execution feedback. Our approach, called MaxCode unifies existing search methods under a max-reward reinforcement learning framework, making the observation and action-value functions modular for modification. To enhance the observation space, we integrate a natural language critique model that converts raw execution feedback into diagnostic insights about errors and performance bottlenecks, and the best-discounted reward seen so far. Together, these provide richer input to the code proposal function. To improve exploration during search, we train a generative reward-to-go model using action values from rollouts to rerank potential solutions. Testing on the KernelBench (CUDA) and PIE (C++) optimization benchmarks shows that MaxCode improves optimized code performance compared to baselines, achieving 20.3% and 10.1% relative improvements in absolute speedup value and relative speedup ranking, respectively.
zh
[NLP-59] owards Valid Student Simulation with Large Language Models
【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)在教育场景中用于模拟学生时所面临的“能力悖论”(competence paradox)问题,即LLM因其广泛的知识能力反而难以真实模拟具备部分知识的 learner 的认知状态与错误模式,导致生成的学生行为缺乏真实性。解决方案的关键在于将学生模拟重构为一个受约束的生成问题,通过显式的知识状态规范(Epistemic State Specification, ESS)来定义模拟学习者可访问的信息、错误结构及认知状态演化机制,从而实现对学习者知识状态的精准建模,强调以认知真实性(epistemic fidelity)而非表面行为相似性作为构建可靠教学模拟系统的核心前提。
链接: https://arxiv.org/abs/2601.05473
作者: Zhihao Yuan,Yunze Xiao,Ming Li,Weihao Xuan,Richard Tong,Mona Diab,Tom Mitchell
机构: Carnegie Mellon University (卡内基梅隆大学); University of Maryland (马里兰大学); The University of Tokyo (东京大学); NEOLAF
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:
Abstract:This paper presents a conceptual and methodological framework for large language model (LLM) based student simulation in educational settings. The authors identify a core failure mode, termed the “competence paradox” in which broadly capable LLMs are asked to emulate partially knowledgeable learners, leading to unrealistic error patterns and learning dynamics. To address this, the paper reframes student simulation as a constrained generation problem governed by an explicit Epistemic State Specification (ESS), which defines what a simulated learner can access, how errors are structured, and how learner state evolves over time. The work further introduces a Goal-by-Environment framework to situate simulated student systems according to behavioral objectives and deployment contexts. Rather than proposing a new system or benchmark, the paper synthesizes prior literature, formalizes key design dimensions, and articulates open challenges related to validity, evaluation, and ethical risks. Overall, the paper argues for epistemic fidelity over surface realism as a prerequisite for using LLM-based simulated students as reliable scientific and pedagogical instruments.
zh
[NLP-60] ROAP: A Reading-Order and Attention-Prior Pipeline for Optimizing Layout Transformers in Key Information Extraction
【速读】: 该论文针对视觉丰富文档理解(Visually-rich Document Understanding, VrDU)中多模态Transformer存在的两大固有局限展开研究:一是缺乏对逻辑阅读顺序的显式建模,二是视觉token干扰导致注意力分散,削弱了文本语义的捕捉能力。解决方案的关键在于提出ROAP(Reading-Order and Attention-Prior Pipeline),其核心由三个模块构成:首先使用自适应XY间隙树(Adaptive-XY-Gap, AXG-Tree)鲁棒提取复杂布局中的层次化阅读序列;其次通过阅读顺序感知的相对位置偏置(Reading-Order-Aware Relative Position Bias, RO-RPB)将该序列融入注意力机制;最后引入文本token子块注意力先验(Textual-Token Sub-block Attention Prior, TT-Prior)以自适应抑制视觉噪声并增强细粒度文本间交互。实验表明,ROAP在FUNSD和CORD基准上显著提升LayoutLMv3与GeoLayoutLM等主流模型性能,验证了显式建模阅读逻辑与调节模态干扰对提升文档理解鲁棒性的重要性。
链接: https://arxiv.org/abs/2601.05470
作者: Tingwei Xie,Jinxin He,Yonghong Song
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: 10 pages, 4 figures, 4 tables
Abstract:The efficacy of Multimodal Transformers in visually-rich document understanding (VrDU) is critically constrained by two inherent limitations: the lack of explicit modeling for logical reading order and the interference of visual tokens that dilutes attention on textual semantics. To address these challenges, this paper presents ROAP, a lightweight and architecture-agnostic pipeline designed to optimize attention distributions in Layout Transformers without altering their pre-trained backbones. The proposed pipeline first employs an Adaptive-XY-Gap (AXG-Tree) to robustly extract hierarchical reading sequences from complex layouts. These sequences are then integrated into the attention mechanism via a Reading-Order-Aware Relative Position Bias (RO-RPB). Furthermore, a Textual-Token Sub-block Attention Prior (TT-Prior) is introduced to adaptively suppress visual noise and enhance fine-grained text-text interactions. Extensive experiments on the FUNSD and CORD benchmarks demonstrate that ROAP consistently improves the performance of representative backbones, including LayoutLMv3 and GeoLayoutLM. These findings confirm that explicitly modeling reading logic and regulating modality interference are critical for robust document understanding, offering a scalable solution for complex layout analysis. The implementation code will be released at this https URL. Comments: 10 pages, 4 figures, 4 tables Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL) Cite as: arXiv:2601.05470 [cs.CV] (or arXiv:2601.05470v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2601.05470 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Xie Tingwei [view email] [v1] Fri, 9 Jan 2026 02:02:37 UTC (1,964 KB) Full-text links: Access Paper: View a PDF of the paper titled ROAP: A Reading-Order and Attention-Prior Pipeline for Optimizing Layout Transformers in Key Information Extraction, by Tingwei Xie and 2 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.CV prev | next new | recent | 2026-01 Change to browse by: cs cs.CL References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
zh
[NLP-61] Do LLM s Need Inherent Reasoning Before Reinforcement Learning? A Study in Korean Self-Correction AACL2025
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在低资源语言(如韩语)中推理能力不足的问题,尤其是相较于英语等高资源语言的性能差距。研究表明,单纯依赖强化学习(Reinforcement Learning, RL)难以显著提升韩语推理能力,其关键在于模型内部推理机制与韩语输入之间的对齐。解决方案的核心是通过微调策略,特别是调整早期层中与韩语特定推理相关的神经元(Korean-specific neurons),实现模型内部推理过程与目标语言输入的有效对齐。为此,作者构建了一个自纠正式代码切换数据集(self-correction code-switching dataset),从而显著提升了数学推理和自我修正任务的表现。最终结论指出,多语言推理增强的关键不在于注入新的语言知识,而在于有效激发并校准模型已有的推理能力。
链接: https://arxiv.org/abs/2601.05459
作者: Hongjin Kim,Jaewook Lee,Kiyoung Lee,Jong-hun Shin,Soojong Lim,Oh-Woog Kwon
机构: ETRI
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: IJCNLP-AACL 2025 (Main), Outstanding Paper Award
Abstract:Large Language Models (LLMs) demonstrate strong reasoning and self-correction abilities in high-resource languages like English, but their performance remains limited in low-resource languages such as Korean. In this study, we investigate whether reinforcement learning (RL) can enhance Korean reasoning abilities to a degree comparable to English. Our findings reveal that RL alone yields limited improvements when applied to models lacking inherent Korean reasoning capabilities. To address this, we explore several fine-tuning strategies and show that aligning the model’s internal reasoning processes with Korean inputs-particularly by tuning Korean-specific neurons in early layers-is key to unlocking RL’s effectiveness. We introduce a self-correction code-switching dataset to facilitate this alignment and observe significant performance gains in both mathematical reasoning and self-correction tasks. Ultimately, we conclude that the crucial factor in multilingual reasoning enhancement is not injecting new linguistic knowledge, but effectively eliciting and aligning existing reasoning capabilities. Our study provides a new perspective on how internal translation and neuron-level tuning contribute to multilingual reasoning alignment in LLMs.
zh
[NLP-62] RingSQL: Generating Synthetic Data with Schema-Independent Templates for Text-to-SQL Reasoning Models
【速读】: 该论文旨在解决文本到SQL(text-to-SQL)系统训练数据稀缺问题,尤其是现有合成数据方法在可靠性和可扩展性之间存在权衡:模板法虽能保证SQL正确性但依赖特定模式的模板,而大语言模型(LLM)生成虽具可扩展性却缺乏质量与正确性保障。解决方案的关键在于提出RingSQL框架,该框架融合了与模式无关的查询模板与基于LLM的自然语言问题重述技术,从而在保持SQL语法正确性的同时实现广泛的语义多样性,显著提升了模型在多个基准测试上的准确率(平均提升2.3%)。
链接: https://arxiv.org/abs/2601.05451
作者: Marko Sterbentz,Kevin Cushing,Cameron Barrie,Kristian J. Hammond
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:Recent advances in text-to-SQL systems have been driven by larger models and improved datasets, yet progress is still limited by the scarcity of high-quality training data. Manual data creation is expensive, and existing synthetic methods trade off reliability and scalability. Template-based approaches ensure correct SQL but require schema-specific templates, while LLM-based generation scales easily but lacks quality and correctness guarantees. We introduce RingSQL, a hybrid data generation framework that combines schema-independent query templates with LLM-based paraphrasing of natural language questions. This approach preserves SQL correctness across diverse schemas while providing broad linguistic variety. In our experiments, we find that models trained using data produced by RingSQL achieve an average gain in accuracy of +2.3% across six text-to-SQL benchmarks when compared to models trained on other synthetic data. We make our code available at this https URL.
zh
[NLP-63] racing Moral Foundations in Large Language Models
【速读】: 该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)生成的人类类似道德判断是否反映了其内部结构化的概念表征,还是仅仅表现为表面的“道德模仿”(moral mimicry)。为解答这一问题,研究采用道德基础理论(Moral Foundations Theory, MFT)作为分析框架,提出了一种多层级方法:首先通过层级分析揭示道德基础在模型中的编码与组织方式;其次利用预训练稀疏自编码器(sparse autoencoders, SAEs)识别残差流中支持道德概念的稀疏特征;最后通过因果干预(causal steering)实验验证这些内部表示与道德输出之间的因果关系。解决方案的关键在于结合密集向量和稀疏特征的双重分析手段,发现道德基础在模型中以分层、分布且部分解耦的方式存在,并能通过干预引导行为变化,从而提供了机制性证据表明道德结构可从语言统计规律中自然涌现。
链接: https://arxiv.org/abs/2601.05437
作者: Chenxiao Yu,Bowen Yi,Farzan Karimi-Malekabadi,Suhaib Abdurahman,Jinyi Ye,Shrikanth Narayanan,Yue Zhao,Morteza Dehghani
机构: University of Southern California (南加州大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) often produce human-like moral judgments, but it is unclear whether this reflects an internal conceptual structure or superficial ``moral mimicry.‘’ Using Moral Foundations Theory (MFT) as an analytic framework, we study how moral foundations are encoded, organized, and expressed within two instruction-tuned LLMs: Llama-3.1-8B-Instruct and Qwen2.5-7B-Instruct. We employ a multi-level approach combining (i) layer-wise analysis of MFT concept representations and their alignment with human moral perceptions, (ii) pretrained sparse autoencoders (SAEs) over the residual stream to identify sparse features that support moral concepts, and (iii) causal steering interventions using dense MFT vectors and sparse SAE features. We find that both models represent and distinguish moral foundations in a structured, layer-dependent way that aligns with human judgments. At a finer scale, SAE features show clear semantic links to specific foundations, suggesting partially disentangled mechanisms within shared representations. Finally, steering along either dense vectors or sparse features produces predictable shifts in foundation-relevant behavior, demonstrating a causal connection between internal representations and moral outputs. Together, our results provide mechanistic evidence that moral concepts in LLMs are distributed, layered, and partly disentangled, suggesting that pluralistic moral structure can emerge as a latent pattern from the statistical regularities of language alone.
zh
[NLP-64] hinking with Map: Reinforced Parallel Map-Augmented Agent for Geolocalization
【速读】: 该论文旨在解决图像地理定位(image geolocalization)任务中现有大型视觉语言模型(LVLM)忽视人类常用策略——即利用地图进行空间推理的问题。其核心解决方案是引入“思维与地图”(Thinking with Map)能力,并将其形式化为一个“地图中的代理循环”(agent-in-the-map loop),通过两阶段优化机制实现:首先采用代理强化学习(agentic reinforcement learning, RL)增强模型的采样效率,随后应用并行测试时扩展(parallel test-time scaling, TTS)让模型在最终预测前探索多个候选路径,从而显著提升定位精度。实验表明,该方法在真实世界图像上的基准测试(MAPBench)中优于多数开源和闭源模型,尤其将500米内准确率(Acc@500m)从8.0%提升至22.1%(对比Gemini-3-Pro带谷歌搜索/地图引导模式)。
链接: https://arxiv.org/abs/2601.05432
作者: Yuxiang Ji,Yong Wang,Ziyu Ma,Yiming Hu,Hailang Huang,Xuecai Hu,Guanhua Chen,Liaoni Wu,Xiangxiang Chu
机构: Xiamen University (厦门大学); AMAP, Alibaba Group (阿里巴巴集团); Southern University of Science and Technology (南方科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:The image geolocalization task aims to predict the location where an image was taken anywhere on Earth using visual clues. Existing large vision-language model (LVLM) approaches leverage world knowledge, chain-of-thought reasoning, and agentic capabilities, but overlook a common strategy used by humans – using maps. In this work, we first equip the model \textitThinking with Map ability and formulate it as an agent-in-the-map loop. We develop a two-stage optimization scheme for it, including agentic reinforcement learning (RL) followed by parallel test-time scaling (TTS). The RL strengthens the agentic capability of model to improve sampling efficiency, and the parallel TTS enables the model to explore multiple candidate paths before making the final prediction, which is crucial for geolocalization. To evaluate our method on up-to-date and in-the-wild images, we further present MAPBench, a comprehensive geolocalization training and evaluation benchmark composed entirely of real-world images. Experimental results show that our method outperforms existing open- and closed-source models on most metrics, specifically improving Acc@500m from 8.0% to 22.1% compared to \textitGemini-3-Pro with Google Search/Map grounded mode.
zh
[NLP-65] Large Language Models Are Bad Dice Players: LLM s Struggle to Generate Random Numbers from Statistical Distributions
【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在生成任务中缺乏对指定概率分布进行忠实采样的能力问题,这一能力对于教育评估、合成数据构建等需统计保障的应用场景至关重要。解决方案的关键在于设计了一种双协议审计框架:一是批量生成(Batch Generation),即单次请求生成N=1000个样本;二是独立请求(Independent Requests),即通过N=1000次无状态调用分别生成样本。该设计揭示了两种协议下采样性能的显著不对称性——批量生成仅获得13%的中位通过率,而独立请求几乎全部失败(10/11模型无一通过),并进一步表明采样精度随分布复杂度和采样规模N增加而单调下降,从而证明现有LLMs缺乏功能性的内部采样机制,必须依赖外部工具以实现统计可验证的采样输出。
链接: https://arxiv.org/abs/2601.05414
作者: Minda Zhao,Yilun Du,Mengyu Wang
机构: Harvard University (哈佛大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:As large language models (LLMs) transition from chat interfaces to integral components of stochastic pipelines across domains like educational assessment and synthetic data construction, the ability to faithfully sample from specified probability distributions has become a functional requirement rather than a theoretical curiosity. We present the first large-scale, statistically powered audit of native probabilistic sampling in frontier LLMs, benchmarking 11 models across 15 distributions. To disentangle failure modes, we employ a dual-protocol design: Batch Generation, where a model produces N=1000 samples within one response, and Independent Requests, comprising N=1000 stateless calls. We observe a sharp protocol asymmetry: batch generation achieves only modest statistical validity, with a 13% median pass rate, while independent requests collapse almost entirely, with 10 of 11 models passing none of the distributions. Beyond this asymmetry, we reveal that sampling fidelity degrades monotonically with distributional complexity and aggravates as the requested sampling horizon N increases. Finally, we demonstrate the propagation of these failures into downstream tasks: models fail to enforce uniform answer-position constraints in MCQ generation and systematically violate demographic targets in attribute-constrained text-to-image prompt synthesis. These findings indicate that current LLMs lack a functional internal sampler, necessitating the use of external tools for applications requiring statistical guarantees.
zh
[NLP-66] Glitter: Visualizing Lexical Surprisal for Readability in Administrative Texts
【速读】: 该论文旨在解决如何通过测量文本的信息熵(information entropy)来估计其可读性(readability)的问题,尤其关注行政或官僚文本的可读性和清晰度提升。解决方案的关键在于提出了一种可视化框架,利用多个语言模型近似计算文本的信息熵,并将结果以可视化方式呈现,从而为改善文本表达提供量化依据。
链接: https://arxiv.org/abs/2601.05411
作者: Jan Černý,Ivana Kvapilíková,Silvie Cinková
机构: Charles University, Faculty of Mathematics and Physics (查尔斯大学数学与物理学学院)
类目: Computation and Language (cs.CL)
备注:
Abstract:This work investigates how measuring information entropy of text can be used to estimate its readability. We propose a visualization framework that can be used to approximate information entropy of text using multiple language models and visualize the result. The end goal is to use this method to estimate and improve readability and clarity of administrative or bureaucratic texts. Our toolset is available as a libre software on this https URL.
zh
[NLP-67] Same Claim Different Judgment: Benchmarking Scenario-Induced Bias in Multilingual Financial Misinformation Detection
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在复杂、高风险的多语言金融虚假信息检测(Multilingual Financial Misinformation Detection, \mfmd)任务中所表现出的行为偏差问题。现有研究多集中于简单场景或直接提问,缺乏对真实金融环境中多种社会因素(如角色、地域、种族和宗教信仰)与偏见交互影响的系统评估。解决方案的关键在于构建一个名为\mfmdscen的综合性基准测试框架,该框架整合了三类由金融专家设计的复杂经济情景(角色与人格、角色与区域、角色结合种族与宗教),并覆盖英语、中文、希腊语和孟加拉语的多语言金融虚假信息数据集,从而实现对22个主流LLMs在多样化现实金融情境下的行为偏差进行系统性量化评估。
链接: https://arxiv.org/abs/2601.05403
作者: Zhiwei Liu,Yupen Cao,Yuechen Jiang,Mohsinul Kabir,Polydoros Giannouris,Chen Xu,Ziyang Xu,Tianlei Zhu,Tariquzzaman Faisal,Triantafillos Papadopoulos,Yan Wang,Lingfei Qian,Xueqing Peng,Zhuohan Xie,Ye Yuan,Saeed Almheiri,Abdulrazzaq Alnajjar,Mingbin Chen,Harry Stuart,Paul Thompson,Prayag Tiwari,Alejandro Lopez-Lira,Xue Liu,Jimin Huang,Sophia Ananiadou
机构: The University of Manchester (曼彻斯特大学); Stevens Institute of Technology (史蒂文斯理工学院); The FinAI; Columbia University (哥伦比亚大学); Islamic University of Technology (伊斯兰科技大学); Athens University of Economics and Business (雅典经济与商业大学); Archimedes, Athena Research Center; MBZUAI; McGill University (麦吉尔大学); Mila - Quebec AI Institute (蒙特利尔人工智能研究所); Dubai Police (迪拜警察局); University of Melbourne (墨尔本大学); Halmstad University (哈尔姆斯塔德大学); University of Florida (佛罗里达大学); ELLIS Manchester (ELLIS曼彻斯特)
类目: Computation and Language (cs.CL)
备注: Work in progress
Abstract:Large language models (LLMs) have been widely applied across various domains of finance. Since their training data are largely derived from human-authored corpora, LLMs may inherit a range of human biases. Behavioral biases can lead to instability and uncertainty in decision-making, particularly when processing financial information. However, existing research on LLM bias has mainly focused on direct questioning or simplified, general-purpose settings, with limited consideration of the complex real-world financial environments and high-risk, context-sensitive, multilingual financial misinformation detection tasks (\mfmd). In this work, we propose \mfmdscen, a comprehensive benchmark for evaluating behavioral biases of LLMs in \mfmd across diverse economic scenarios. In collaboration with financial experts, we construct three types of complex financial scenarios: (i) role- and personality-based, (ii) role- and region-based, and (iii) role-based scenarios incorporating ethnicity and religious beliefs. We further develop a multilingual financial misinformation dataset covering English, Chinese, Greek, and Bengali. By integrating these scenarios with misinformation claims, \mfmdscen enables a systematic evaluation of 22 mainstream LLMs. Our findings reveal that pronounced behavioral biases persist across both commercial and open-source models. This project will be available at this https URL.
zh
[NLP-68] Conformity and Social Impact on AI Agents
【速读】: 该论文旨在解决多智能体环境中AI代理(AI agents)在群体影响下表现出的从众行为(conformity)问题,尤其是这种行为可能带来的安全风险与决策偏差。其关键解决方案在于通过借鉴社会心理学中的经典视觉实验范式,系统性地验证了大型多模态语言模型作为AI代理在面对群体压力时展现出可预测的从众倾向,且该倾向符合社会影响理论(Social Impact Theory)。研究发现,即使AI代理在孤立状态下表现接近完美,仍会在群体影响下变得高度易受操控,尤其是在任务难度接近其能力边界时;这一现象揭示了当前多智能体系统中潜在的恶意操纵、信息误导和偏见传播风险,强调了构建防御机制以保障集体AI部署安全性的紧迫性。
链接: https://arxiv.org/abs/2601.05384
作者: Alessandro Bellina,Giordano De Marzo,David Garcia
机构: Centro Ricerche Enrico Fermi(恩里科·费米研究中心); Sony Computer Science Laboratories(索尼计算机科学实验室); Sapienza University of Rome(罗马大学); University of Konstanz(康斯坦茨大学); Complexity Science Hub(复杂科学中心)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:
Abstract:As AI agents increasingly operate in multi-agent environments, understanding their collective behavior becomes critical for predicting the dynamics of artificial societies. This study examines conformity, the tendency to align with group opinions under social pressure, in large multimodal language models functioning as AI agents. By adapting classic visual experiments from social psychology, we investigate how AI agents respond to group influence as social actors. Our experiments reveal that AI agents exhibit a systematic conformity bias, aligned with Social Impact Theory, showing sensitivity to group size, unanimity, task difficulty, and source characteristics. Critically, AI agents achieving near-perfect performance in isolation become highly susceptible to manipulation through social influence. This vulnerability persists across model scales: while larger models show reduced conformity on simple tasks due to improved capabilities, they remain vulnerable when operating at their competence boundary. These findings reveal fundamental security vulnerabilities in AI agent decision-making that could enable malicious manipulation, misinformation campaigns, and bias propagation in multi-agent systems, highlighting the urgent need for safeguards in collective AI deployments.
zh
[NLP-69] he Persona Paradox: Medical Personas as Behavioral Priors in Clinical Language Models
【速读】: 该论文旨在解决医学场景下基于角色的生成式 AI (Generative AI) 行为控制机制对临床决策质量与安全性的影响问题,特别是探究不同专业角色(如急诊科医生、护士)和交互风格(大胆 vs. 谨慎)如何在不同医疗任务中非单调地影响模型表现。其解决方案的关键在于通过多维评估框架系统性地量化临床分诊和患者安全任务中的准确性、校准度及风险行为,揭示了医学角色提示(persona conditioning)并非始终提升性能或安全性,而是呈现出情境依赖性的权衡效应:在重症护理任务中可带来约20%的准确性和校准度提升,但在初级诊疗场景中则产生相当幅度的性能下降;同时发现交互风格虽能调节风险倾向,但效果高度依赖于具体模型架构。这表明角色提示作为行为先验(behavioral prior)并不能保证专家性或安全性,需谨慎应用于高风险医疗场景。
链接: https://arxiv.org/abs/2601.05376
作者: Tassallah Abdullahi,Shrestha Ghosh,Hamish S Fraser,Daniel León Tramontini,Adeel Abbasi,Ghada Bourjeily,Carsten Eickhoff,Ritambhara Singh
机构: Brown University (布朗大学); University of Tuebingen (图宾根大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Persona conditioning can be viewed as a behavioral prior for large language models (LLMs) and is often assumed to confer expertise and improve safety in a monotonic manner. However, its effects on high-stakes clinical decision-making remain poorly characterized. We systematically evaluate persona-based control in clinical LLMs, examining how professional roles (e.g., Emergency Department physician, nurse) and interaction styles (bold vs.\ cautious) influence behavior across models and medical tasks. We assess performance on clinical triage and patient-safety tasks using multidimensional evaluations that capture task accuracy, calibration, and safety-relevant risk behavior. We find systematic, context-dependent, and non-monotonic effects: Medical personas improve performance in critical care tasks, yielding gains of up to \sim+20% in accuracy and calibration, but degrade performance in primary-care settings by comparable margins. Interaction style modulates risk propensity and sensitivity, but it’s highly model-dependent. While aggregated LLM-judge rankings favor medical over non-medical personas in safety-critical cases, we found that human clinicians show moderate agreement on safety compliance (average Cohen’s \kappa = 0.43 ) but indicate a low confidence in 95.9% of their responses on reasoning quality. Our work shows that personas function as behavioral priors that introduce context-dependent trade-offs rather than guarantees of safety or expertise. The code is available at this https URL_Paradox.
zh
[NLP-70] Lost in Execution: On the Multilingual Robustness of Tool Calling in Large Language Models
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在多语言用户交互场景下工具调用(tool calling)的鲁棒性问题,特别是当用户使用非英语语言(如中文、印地语和低资源语言伊博语)时,模型执行工具调用的能力显著下降的问题。其关键发现是:尽管模型能正确理解用户意图并选择合适的工具,但大量失败源于参数值的语言不匹配问题(parameter value language mismatch),即模型生成了符合用户语言语义但不符合执行环境语言一致性的参数值。解决方案的关键在于引入推理时系统级策略(inference-time system strategies),通过强制参数值以统一语言(通常是英文)输出来减少因语言差异导致的执行错误,从而提升跨语言工具调用的稳定性,但这些策略无法完全恢复英语基准下的性能水平。
链接: https://arxiv.org/abs/2601.05366
作者: Zheng Luo,T Pranav Kutralingam,Ogochukwu N Okoani,Wanpeng Xu,Hua Wei,Xiyang Hu
机构: University of Southern California (南加州大学); Arizona State University (亚利桑那州立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Large Language Models (LLMs) are increasingly deployed as agents that invoke external tools through structured function calls. While recent work reports strong tool-calling performance under standard English-centric evaluations, the robustness of tool calling under multilingual user interactions remains underexplored. In this work, we introduce MLCL, a diagnostic benchmark, and conduct a systematic evaluation of multilingual tool calling across Chinese, Hindi, and the low-resource language Igbo. Through fine-grained error analysis, we show that many failures occur despite correct intent understanding and tool selection. We identify parameter value language mismatch as a dominant failure mode, where models generate semantically appropriate parameter values in the user’s language, violating language-invariant execution conventions. We further evaluate several inference-time system strategies and find that while these strategies substantially reduce language-induced execution errors, none of them can fully recover English-level performance.
zh
[NLP-71] he Table of Media Bias Elements: A sentence-level taxonomy of media bias types and propaganda techniques
【速读】: 该论文试图解决的问题是:当前关于“左翼”或“右翼”新闻的公共讨论忽略了偏见通常通过具体的语言策略体现,而这些策略往往超越单一政治光谱,导致对媒体偏见的认知模糊。解决方案的关键在于将研究焦点从媒体立场的定性归属转向个体句子层面的偏见表达机制,通过整合细粒度的语料分析、跨学科理论与试点标注,构建了一个包含38种基本偏见类型的两级分类体系(称为“媒体偏见要素表”),并为每种类型提供定义、现实案例、认知与社会动因及识别指导,从而实现对媒体偏见的精准刻画与量化评估。
链接: https://arxiv.org/abs/2601.05358
作者: Tim Menzner,Jochen L. Leidner
机构: Coburg University of Applied Sciences and Arts(科堡应用技术与艺术大学); University of Sheffield(谢菲尔德大学)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:
Abstract:Public debates about “left-” or “right-wing” news overlook the fact that bias is usually conveyed by concrete linguistic manoeuvres that transcend any single political spectrum. We therefore shift the focus from where an outlet allegedly stands to how partiality is expressed in individual sentences. Drawing on 26,464 sentences collected from newsroom corpora, user submissions and our own browsing, we iteratively combine close-reading, interdisciplinary theory and pilot annotation to derive a fine-grained, sentence-level taxonomy of media bias and propaganda. The result is a two-tier schema comprising 38 elementary bias types, arranged in six functional families and visualised as a “table of media-bias elements”. For each type we supply a definition, real-world examples, cognitive and societal drivers, and guidance for recognition. A quantitative survey of a random 155-sentence sample illustrates prevalence differences, while a cross-walk to the best-known NLP and communication-science taxonomies reveals substantial coverage gains and reduced ambiguity.
zh
[NLP-72] IME: Temporally Intelligent Meta-reasoning Engine for Context Triggered Explicit Reasoning
【速读】: 该论文旨在解决生成式 AI(Generative AI)在推理导向的大语言模型中存在资源浪费、审计不透明以及缺乏时间感知能力的问题。具体而言,现有模型通常以全局长序列显式“思考”轨迹形式呈现推理过程,导致计算成本高、用户无法精准追溯决策依据,并且一旦开始输出回复便无法动态触发或中断推理。此外,对话模型普遍忽略时间结构,对不同时间间隔的交互响应无差别处理。解决方案的关键在于提出 TIME(Temporally Intelligent Meta-reasoning Engine),一个基于话语和时间线索动态调度显式推理行为的行为对齐框架:通过引入 ISO 8601 时间标签、表示沉默间隔的 tick turns 和可嵌入任意位置的短 think blocks,使推理成为上下文敏感的资源;并采用四阶段课程训练策略优化 Qwen3 密集模型,在保持高质量推理的同时显著减少推理 token 消耗(约一个数量级),从而实现高效、可控且具备时间感知能力的对话推理机制。
链接: https://arxiv.org/abs/2601.05300
作者: Susmit Das
机构: The Coherence Initiative
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 14 pages, 3 figures with 27 page appendix. See this https URL and this https URL for associated code
Abstract:Reasoning oriented large language models often expose explicit “thinking” as long, turn-global traces at the start of every response, either always on or toggled externally at inference time. While useful for arithmetic, programming, and problem solving, this design is costly, blurs claim level auditability, and cannot re-trigger explicit reasoning once the model begins presenting. Dialogue models are also largely blind to temporal structure, treating replies after seconds and replies after weeks as equivalent unless time is stated in text. We introduce TIME, the Temporally Intelligent Meta-reasoning Engine, a behavioral alignment framework that treats explicit reasoning as a context sensitive resource driven by discourse and temporal cues. TIME augments dialogue with optional ISO 8601 time tags, tick turns that represent silent gaps, and short think blocks that can appear anywhere in a reply. A four-phase curriculum including a small, maximally diverse full-batch alignment step trains Qwen3 dense models to invoke brief, in-place reasoning bursts and keep user facing text compact. We evaluate with TIMEBench, a temporally grounded dialogue benchmark probing chronology, commonsense under gaps and offsets, anomaly detection, and continuity. Across 4B to 32B scales, TIME improves TIMEBench scores over base Qwen3 in both thinking and no-thinking modes while reducing reasoning tokens by about an order of magnitude. Our training data and code are available at this https URL and TIMEBench is available at this https URL
zh
[NLP-73] Enhancing Foundation Models in Transaction Understanding with LLM -based Sentence Embeddings
【速读】: 该论文旨在解决支付网络中交易数据的语义信息损失问题,传统基础模型在处理商户类别字段时依赖基于索引的表示方法,将丰富的文本信息转化为离散标记,导致语义信息严重丢失。解决方案的关键在于提出一种混合框架,利用大语言模型(Large Language Models, LLMs)生成的嵌入向量作为轻量级交易模型的语义初始化,从而在保持模型可解释性的同时提升运算效率;该框架还通过多源数据融合增强商户类别字段的信息表达,并采用“单字约束原则”确保不同LLM架构下嵌入生成的一致性,同时系统性地通过噪声过滤与上下文感知增强改善数据质量。
链接: https://arxiv.org/abs/2601.05271
作者: Xiran Fan,Zhimeng Jiang,Chin-Chia Michael Yeh,Yuzhong Chen,Yingtong Dou,Menghai Pan,Yan Zheng
机构: Visa Research (Visa 研究院)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:The ubiquity of payment networks generates vast transactional data encoding rich consumer and merchant behavioral patterns. Recent foundation models for transaction analysis process tabular data sequentially but rely on index-based representations for categorical merchant fields, causing substantial semantic information loss by converting rich textual data into discrete tokens. While Large Language Models (LLMs) can address this limitation through superior semantic understanding, their computational overhead challenges real-time financial deployment. We introduce a hybrid framework that uses LLM-generated embeddings as semantic initializations for lightweight transaction models, balancing interpretability with operational efficiency. Our approach employs multi-source data fusion to enrich merchant categorical fields and a one-word constraint principle for consistent embedding generation across LLM architectures. We systematically address data quality through noise filtering and context-aware enrichment. Experiments on large-scale transaction datasets demonstrate significant performance improvements across multiple transaction understanding tasks.
zh
[NLP-74] ransforming User Defined Criteria into Explainable Indicators with an Integrated LLM AHP System
【速读】: 该论文旨在解决跨领域复杂文本评估中如何将用户定义的标准转化为可量化且可解释的指标这一持续性挑战,尤其针对单提示大语言模型(Large Language Model, LLM)评估存在的复杂性和延迟问题,以及基于特定准则的分解方法依赖简单平均或黑箱聚合方式导致的解释性不足。其解决方案的关键在于提出一种可解释的聚合框架,结合LLM评分与层次分析法(Analytic Hierarchy Process, AHP),通过LLM作为评判者生成各准则下的得分,利用Jensen-Shannon散度衡量判别能力,并借助AHP成对比较矩阵推导出统计学基础坚实的权重,从而在保持预测性能的同时显著提升解释性与运行效率,适用于对延迟敏感的在线服务场景。
链接: https://arxiv.org/abs/2601.05267
作者: Geonwoo Bang,Dongho Kim,Moohong Min
机构: Sungkyunkwan University (成均馆大学)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Evaluating complex texts across domains requires converting user defined criteria into quantitative, explainable indicators, which is a persistent challenge in search and recommendation systems. Single prompt LLM evaluations suffer from complexity and latency issues, while criterion specific decomposition approaches rely on naive averaging or opaque black-box aggregation methods. We present an interpretable aggregation framework combining LLM scoring with the Analytic Hierarchy Process. Our method generates criterion specific scores via LLM as judge, measures discriminative power using Jensen Shannon distance, and derives statistically grounded weights through AHP pairwise comparison matrices. Experiments on Amazon review quality assessment and depression related text scoring demonstrate that our approach achieves high explainability and operational efficiency while maintaining comparable predictive power, making it suitable for real time latency sensitive web services.
zh
[NLP-75] Retrieval-Augmented Multi-LLM Ensemble for Industrial Part Specification Extraction
【速读】: 该论文旨在解决工业领域中从非结构化文本中提取零部件规格信息的难题,该任务在制造、采购和维护场景下长期存在人工处理效率低、易出错的问题。其解决方案的关键在于提出了一种检索增强型多大语言模型(Large Language Models, LLMs)集成框架——RAGsemble,该框架通过三阶段结构化流程实现:首先并行调用九个先进LLM进行初步提取;其次利用高性能模型进行针对性研究增强;最后通过冲突消解与置信度感知评分机制实现智能融合。系统创新性地将FAISS-based语义检索嵌入全流程,使输出结果能实时对接结构化零部件数据库,从而实现事实校验、精度提升与内容丰富化,显著优于单一模型基线,在准确性、技术完整性与结构化输出质量上均取得显著提升。
链接: https://arxiv.org/abs/2601.05266
作者: Muzakkiruddin Ahmed Mohammed,John R. Talburt,Leon Claasssens,Adriaan Marais
机构: University of Arkansas - Little Rock (阿肯色大学小石城分校); PiLog Group (PiLog 组)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: The 17th International Conference on Knowledge and Systems Engineering
Abstract:Industrial part specification extraction from unstructured text remains a persistent challenge in manufacturing, procurement, and maintenance, where manual processing is both time-consuming and error-prone. This paper introduces a retrieval-augmented multi-LLM ensemble framework that orchestrates nine state-of-the-art Large Language Models (LLMs) within a structured three-phase pipeline. RAGsemble addresses key limitations of single-model systems by combining the complementary strengths of model families including Gemini (2.0, 2.5, 1.5), OpenAI (GPT-4o, o4-mini), Mistral Large, and Gemma (1B, 4B, 3n-e4b), while grounding outputs in factual data using FAISS-based semantic retrieval. The system architecture consists of three stages: (1) parallel extraction by diverse LLMs, (2) targeted research augmentation leveraging high-performing models, and (3) intelligent synthesis with conflict resolution and confidence-aware scoring. RAG integration provides real-time access to structured part databases, enabling the system to validate, refine, and enrich outputs through similarity-based reference retrieval. Experimental results using real industrial datasets demonstrate significant gains in extraction accuracy, technical completeness, and structured output quality compared to leading single-LLM baselines. Key contributions include a scalable ensemble architecture for industrial domains, seamless RAG integration throughout the pipeline, comprehensive quality assessment mechanisms, and a production-ready solution suitable for deployment in knowledge-intensive manufacturing environments.
zh
[NLP-76] Cross-Document Topic-Aligned Chunking for Retrieval-Augmented Generation
【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统中因文档独立分块导致的知识碎片化问题(knowledge fragmentation problem),即复杂查询所需信息常分散在多个文档中,而传统方法仅对单文档进行局部切分,难以满足多跳推理需求。其解决方案的关键在于提出跨文档主题对齐(Cross-Document Topic-Aligned, CDTA)分块策略:首先在语料库层面识别共现主题,将各文档片段映射至对应主题,并基于主题聚合生成语义统一的跨文档块,从而实现知识重构。实验表明,CDTA 在 HotpotQA 多跳推理任务上达到 0.93 的忠实度(faithfulness),显著优于上下文检索(0.83)和语义分块(0.78),且在 UAE 法律文本中同时实现 0.94 忠实度与 0.93 引用准确率,验证了跨文档合成机制在分布式知识场景下的有效性。
链接: https://arxiv.org/abs/2601.05265
作者: Mile Stankovic
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Chunking quality determines RAG system performance. Current methods partition documents individually, but complex queries need information scattered across multiple sources: the knowledge fragmentation problem. We introduce Cross-Document Topic-Aligned (CDTA) chunking, which reconstructs knowledge at the corpus level. It first identifies topics across documents, maps segments to each topic, and synthesizes them into unified chunks. On HotpotQA multi-hop reasoning, our method reached 0.93 faithfulness versus 0.83 for contextual retrieval and 0.78 for semantic chunking, a 12% improvement over current industry best practice (p 0.05). On UAE Legal texts, it reached 0.94 faithfulness with 0.93 citation accuracy. At k = 3, it maintains 0.91 faithfulness while semantic methods drop to 0.68, with a single CDTA chunk containing information requiring multiple traditional fragments. Indexing costs are higher, but synthesis produces information-dense chunks that reduce query-time retrieval needs. For high-query-volume applications with distributed knowledge, cross-document synthesis improves measurably over within-document optimization. Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2601.05265 [cs.IR] (or arXiv:2601.05265v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2601.05265 Focus to learn more arXiv-issued DOI via DataCite
zh
[NLP-77] LLM 2IR: simple unsupervised contrastive learning makes long-context LLM great retriever
【速读】: 该论文旨在解决现代密集信息检索(Information Retrieval, IR)模型依赖昂贵的大规模预训练的问题。其核心解决方案是提出LLM2IR框架,这是一种高效的无监督对比学习方法,能够将任意仅解码器结构的大语言模型(Decoder-only Large Language Model, LLM)转换为信息检索模型。该方法的关键在于利用无监督对比学习机制,在不进行额外标注数据的情况下,通过自然文本语境中的负样本构建策略,有效提升LLM的检索能力,从而实现从通用语言模型到高效检索模型的迁移。
链接: https://arxiv.org/abs/2601.05262
作者: Xiaocong Yang
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: MS Thesis
Abstract:Modern dense information retrieval (IR) models usually rely on costly large-scale pretraining. In this paper, we introduce LLM2IR, an efficient unsupervised contrastive learning framework to convert any decoder-only large language model (LLM) to an information retrieval model. Despite its simplicity, the effectiveness is proven among different LLMs on multiple IR benchmarks including LoCo, LongEmbed and BEIR. We also find that models with a longer context length tend to have a stronger IR capacity by comparing task performances of models in the same model family. Our work not only provides an effective way to build IR models on the state-of-the-art LLMs, but also shed light on the relationship between information retrieval ability and model context length, which helps the design of better information retrievers.
zh
[NLP-78] Quantifying Document Impact in RAG -LLM s
【速读】: 该论文旨在解决检索增强生成(Retrieval Augmented Generation, RAG)系统中缺乏量化单个检索文档对最终生成结果贡献度的评估指标这一关键问题。当前RAG系统面临事实不一致、来源冲突、偏见传播和安全漏洞等挑战,严重影响其可信度。解决方案的核心是提出一种名为“影响得分”(Influence Score, IS)的新指标,该指标基于部分信息分解(Partial Information Decomposition)理论,能够精确衡量每篇检索文档对生成响应的影响程度。通过毒化攻击模拟与消融实验验证,IS在识别恶意文档和保留核心生成内容方面均表现出优异性能,为提升RAG系统的透明性和可靠性提供了可量化、可解释的工具。
链接: https://arxiv.org/abs/2601.05260
作者: Armin Gerami,Kazem Faghih,Ramani Duraiswami
机构: University of Maryland (马里兰大学); Umiacs (UMIACS)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Retrieval Augmented Generation (RAG) enhances Large Language Models (LLMs) by connecting them to external knowledge, improving accuracy and reducing outdated information. However, this introduces challenges such as factual inconsistencies, source conflicts, bias propagation, and security vulnerabilities, which undermine the trustworthiness of RAG systems. A key gap in current RAG evaluation is the lack of a metric to quantify the contribution of individual retrieved documents to the final output. To address this, we introduce the Influence Score (IS), a novel metric based on Partial Information Decomposition that measures the impact of each retrieved document on the generated response. We validate IS through two experiments. First, a poison attack simulation across three datasets demonstrates that IS correctly identifies the malicious document as the most influential in 86% of cases. Second, an ablation study shows that a response generated using only the top-ranked documents by IS is consistently judged more similar to the original response than one generated from the remaining documents. These results confirm the efficacy of IS in isolating and quantifying document influence, offering a valuable tool for improving the transparency and reliability of RAG systems.
zh
[NLP-79] Naiad: Novel Agent ic Intelligent Autonomous System for Inland Water Monitoring
【速读】: 该论文旨在解决内陆水体监测中传统方法碎片化的问题,即现有技术通常仅针对蓝藻(cyanobacteria)、叶绿素(chlorophyll)或其它水质指标分别处理,缺乏统一、高效的综合分析能力。其解决方案的核心在于提出NAIAD——一个基于大语言模型(Large Language Models, LLMs)的智能代理系统,通过检索增强生成(Retrieval-Augmented Generation, RAG)、LLM推理、外部工具编排、计算图执行及代理式反思机制,整合多源地球观测(Earth Observation, EO)数据与多种分析工具(如Sentinel-2影像、NDCI指数计算、叶绿素-a估算等),实现从自然语言查询到定制化报告输出的端到端自动化处理,从而为专家和非专家用户提供一致且精准的水质评估服务。
链接: https://arxiv.org/abs/2601.05256
作者: Eirini Baltzi,Tilemachos Moumouris,Athena Psalta,Vasileios Tsironis,Konstantinos Karantzalos
机构: National Technical University of Athens (国立技术大学雅典)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
备注:
Abstract:Inland water monitoring is vital for safeguarding public health and ecosystems, enabling timely interventions to mitigate risks. Existing methods often address isolated sub-problems such as cyanobacteria, chlorophyll, or other quality indicators separately. NAIAD introduces an agentic AI assistant that leverages Large Language Models (LLMs) and external analytical tools to deliver a holistic solution for inland water monitoring using Earth Observation (EO) data. Designed for both experts and non-experts, NAIAD provides a single-prompt interface that translates natural-language queries into actionable insights. Through Retrieval-Augmented Generation (RAG), LLM reasoning, external tool orchestration, computational graph execution, and agentic reflection, it retrieves and synthesizes knowledge from curated sources to produce tailored reports. The system integrates diverse tools for weather data, Sentinel-2 imagery, remote-sensing index computation (e.g., NDCI), chlorophyll-a estimation, and established platforms such as CyFi. Performance is evaluated using correctness and relevancy metrics, achieving over 77% and 85% respectively on a dedicated benchmark covering multiple user-expertise levels. Preliminary results show strong adaptability and robustness across query types. An ablation study on LLM backbones further highlights Gemma 3 (27B) and Qwen 2.5 (14B) as offering the best balance between computational efficiency and reasoning performance.
zh
[NLP-80] agRAG : Tag-guided Hierarchical Knowledge Graph Retrieval-Augmented Generation
【速读】: 该论文旨在解决传统检索增强生成(Retrieval-Augmented Generation, RAG)方法在处理查询聚焦的摘要类任务时,因依赖片段级检索而受限于知识粒度粗略的问题,以及图结构RAG(GraphRAG)在信息抽取效率低、资源消耗大和难以支持增量更新等方面的不足。其解决方案的关键在于提出TagRAG框架,通过两个核心组件实现高效全局推理与可扩展的知识图谱维护:一是基于对象标签的层次化知识图谱构建(Tag Knowledge Graph Construction),将文档中的实体及其关系组织为领域导向的标签链;二是标签引导的检索增强生成机制(Tag-Guided Retrieval-Augmented Generation),在推理阶段通过检索特定领域的标签链来定位并合成相关知识,从而提升小语言模型适配性、细化检索粒度,并支持高效的知识增量更新。
链接: https://arxiv.org/abs/2601.05254
作者: Wenbiao Tao,Yunshi Lan,Weining Qian
机构: East China Normal University (华东师范大学)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:
Abstract:Retrieval-Augmented Generation enhances language models by retrieving external knowledge to support informed and grounded responses. However, traditional RAG methods rely on fragment-level retrieval, limiting their ability to address query-focused summarization queries. GraphRAG introduces a graph-based paradigm for global knowledge reasoning, yet suffers from inefficiencies in information extraction, costly resource consumption, and poor adaptability to incremental updates. To overcome these limitations, we propose TagRAG, a tag-guided hierarchical knowledge graph RAG framework designed for efficient global reasoning and scalable graph maintenance. TagRAG introduces two key components: (1) Tag Knowledge Graph Construction, which extracts object tags and their relationships from documents and organizes them into hierarchical domain tag chains for structured knowledge representation, and (2) Tag-Guided Retrieval-Augmented Generation, which retrieves domain-centric tag chains to localize and synthesize relevant knowledge during inference. This design significantly adapts to smaller language models, improves retrieval granularity, and supports efficient knowledge increment. Extensive experiments on UltraDomain datasets spanning Agriculture, Computer Science, Law, and cross-domain settings demonstrate that TagRAG achieves an average win rate of 95.41% against baselines while maintaining about 14.6x construction and 1.9x retrieval efficiency compared with GraphRAG.
zh
[NLP-81] Open World Knowledge Aided Single-Cell Foundation Model with Robust Cross-Modal Cell-Language Pre-training
【速读】: 该论文旨在解决当前基于预训练语言模型(Pre-trained Language Model, PLM)的单细胞基础模型在整合个体细胞深度特征和应对多模态数据噪声方面存在的局限性。其核心解决方案是提出一种开放世界语言知识增强的鲁棒单细胞基础模型(Open-world Language Knowledge-Aided Robust Single-Cell Foundation Model, OKR-CELL),关键创新在于:(1) 基于大语言模型(Large Language Model, LLM)的工作流结合检索增强生成(Retrieval-Augmented Generation, RAG),利用开放世界知识丰富细胞文本描述;(2) 设计跨模态鲁棒对齐(Cross-modal Robust Alignment, CRA)目标函数,融合样本可靠性评估、课程学习与耦合动量对比学习机制,显著提升模型对噪声数据的抗干扰能力。
链接: https://arxiv.org/abs/2601.05648
作者: Haoran Wang,Xuanyi Zhang,Shuangsang Fang,Longke Ran,Ziqing Deng,Yong Zhang,Yuxiang Li,Shaoshuai Li
机构: 未知
类目: Genomics (q-bio.GN); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 41 pages
Abstract:Recent advancements in single-cell multi-omics, particularly RNA-seq, have provided profound insights into cellular heterogeneity and gene regulation. While pre-trained language model (PLM) paradigm based single-cell foundation models have shown promise, they remain constrained by insufficient integration of in-depth individual profiles and neglecting the influence of noise within multi-modal data. To address both issues, we propose an Open-world Language Knowledge-Aided Robust Single-Cell Foundation Model (OKR-CELL). It is built based on a cross-modal Cell-Language pre-training framework, which comprises two key innovations: (1) leveraging Large Language Models (LLMs) based workflow with retrieval-augmented generation (RAG) enriches cell textual descriptions using open-world knowledge; (2) devising a Cross-modal Robust Alignment (CRA) objective that incorporates sample reliability assessment, curriculum learning, and coupled momentum contrastive learning to strengthen the model’s resistance to noisy data. After pretraining on 32M cell-text pairs, OKR-CELL obtains cutting-edge results across 6 evaluation tasks. Beyond standard benchmarks such as cell clustering, cell-type annotation, batch-effect correction, and few-shot annotation, the model also demonstrates superior performance in broader multi-modal applications, including zero-shot cell-type annotation and bidirectional cell-text retrieval.
zh
计算机视觉
[CV-0] Deepfake detectors are DUMB: A benchmark to assess adversarial training robustness under transferability constraints
【速读】:该论文旨在解决深度伪造检测系统在现实环境中面临的对抗攻击问题,即 adversaries 能够生成难以察觉的扰动(perturbations)来降低模型性能。其解决方案的关键在于扩展 DUMB 和 DUMBer 方法论至深度伪造检测领域,通过在跨数据集配置和迁移性约束条件下评估多种先进检测器(如 RECCE、SRM、XCeption、UCF、SPSL)与攻击方法(如 PGD、FGSM、FPBA)的鲁棒性表现,揭示对抗训练策略在分布内(in-distribution)和分布外(cross-dataset)场景下的差异性影响,从而强调需采用情境感知(case-aware)的防御策略以提升实际部署中的抗干扰能力。
链接: https://arxiv.org/abs/2601.05986
作者: Adrian Serrano,Erwan Umlil,Ronan Thomas
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注: 10 pages, four tables, one figure
Abstract:Deepfake detection systems deployed in real-world environments are subject to adversaries capable of crafting imperceptible perturbations that degrade model performance. While adversarial training is a widely adopted defense, its effectiveness under realistic conditions – where attackers operate with limited knowledge and mismatched data distributions - remains underexplored. In this work, we extend the DUMB – Dataset soUrces, Model architecture and Balance - and DUMBer methodology to deepfake detection. We evaluate detectors robustness against adversarial attacks under transferability constraints and cross-dataset configuration to extract real-world insights. Our study spans five state-of-the-art detectors (RECCE, SRM, XCeption, UCF, SPSL), three attacks (PGD, FGSM, FPBA), and two datasets (FaceForensics++ and Celeb-DF-V2). We analyze both attacker and defender perspectives mapping results to mismatch scenarios. Experiments show that adversarial training strategies reinforce robustness in the in-distribution cases but can also degrade it under cross-dataset configuration depending on the strategy adopted. These findings highlight the need for case-aware defense strategies in real-world applications exposed to adversarial attacks.
zh
[CV-1] Adaptive Conditional Contrast-Agnostic Deformable Image Registration with Uncertainty Estimation
【速读】:该论文旨在解决多对比度图像配准中因不同成像模态间强度关系复杂且非线性而导致的配准精度低与泛化能力差的问题。传统方法依赖迭代优化变形场,效率低下;而现有学习型方法虽能实现快速推理,但其泛化能力受限于训练时所见的特定对比度。解决方案的关键在于提出一种自适应条件无关对比度的可变形图像配准框架(AC-CAR),其核心创新包括:1)基于随机卷积的对比度增强策略,使模型在训练阶段模拟任意未见过的成像对比度;2)引入自适应条件特征调制模块(ACFM),通过自适应调制特征并施加对比度不变的潜在正则化,促进跨对比度特征一致性学习;3)集成方差网络以输出对比度无关的配准不确定性估计,提升结果的可信度。该方案显著提升了模型对未见对比度图像的泛化性能与配准可靠性。
链接: https://arxiv.org/abs/2601.05981
作者: Yinsong Wang,Xinzhe Luo,Siyi Du,Chen Qin
机构: Imperial College London (帝国理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ieee transactions on Medical Imaging
Abstract:Deformable multi-contrast image registration is a challenging yet crucial task due to the complex, non-linear intensity relationships across different imaging contrasts. Conventional registration methods typically rely on iterative optimization of the deformation field, which is time-consuming. Although recent learning-based approaches enable fast and accurate registration during inference, their generalizability remains limited to the specific contrasts observed during training. In this work, we propose an adaptive conditional contrast-agnostic deformable image registration framework (AC-CAR) based on a random convolution-based contrast augmentation scheme. AC-CAR can generalize to arbitrary imaging contrasts without observing them during training. To encourage contrast-invariant feature learning, we propose an adaptive conditional feature modulator (ACFM) that adaptively modulates the features and the contrast-invariant latent regularization to enforce the consistency of the learned feature across different imaging contrasts. Additionally, we enable our framework to provide contrast-agnostic registration uncertainty by integrating a variance network that leverages the contrast-agnostic registration encoder to improve the trustworthiness and reliability of AC-CAR. Experimental results demonstrate that AC-CAR outperforms baseline methods in registration accuracy and exhibits superior generalization to unseen imaging contrasts. Code is available at this https URL.
zh
[CV-2] VideoAR: Autoregressive Video Generation via Next-Frame Scale Prediction
【速读】:该论文旨在解决当前视频生成模型中计算资源消耗大、难以扩展以及长期时序一致性差的问题,尤其针对基于扩散(diffusion)和流匹配(flow-matching)方法的高复杂度与低效率瓶颈。其解决方案的关键在于提出VideoAR——首个大规模视觉自回归(Visual Autoregressive, VAR)框架,通过多尺度帧预测与自回归建模相结合的方式,将空间与时间依赖性解耦:利用3D多尺度分词器高效编码时空动态,并引入多尺度时间RoPE、跨帧误差校正和随机帧掩码机制,显著缓解误差传播并增强时序稳定性;同时采用多阶段预训练策略逐步对齐不同分辨率和时长下的空间与时间学习,从而在保持高效推理(减少超过10倍推理步数)的同时实现媲美大型扩散模型的性能(如UCF-101上FVD从99.5降至88.6,VBench得分达81.74)。
链接: https://arxiv.org/abs/2601.05966
作者: Longbin Ji,Xiaoxiong Liu,Junyuan Shang,Shuohuan Wang,Yu Sun,Hua Wu,Haifeng Wang
机构: ERNIE Team, Baidu(百度)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advances in video generation have been dominated by diffusion and flow-matching models, which produce high-quality results but remain computationally intensive and difficult to scale. In this work, we introduce VideoAR, the first large-scale Visual Autoregressive (VAR) framework for video generation that combines multi-scale next-frame prediction with autoregressive modeling. VideoAR disentangles spatial and temporal dependencies by integrating intra-frame VAR modeling with causal next-frame prediction, supported by a 3D multi-scale tokenizer that efficiently encodes spatio-temporal dynamics. To improve long-term consistency, we propose Multi-scale Temporal RoPE, Cross-Frame Error Correction, and Random Frame Mask, which collectively mitigate error propagation and stabilize temporal coherence. Our multi-stage pretraining pipeline progressively aligns spatial and temporal learning across increasing resolutions and durations. Empirically, VideoAR achieves new state-of-the-art results among autoregressive models, improving FVD on UCF-101 from 99.5 to 88.6 while reducing inference steps by over 10x, and reaching a VBench score of 81.74-competitive with diffusion-based models an order of magnitude larger. These results demonstrate that VideoAR narrows the performance gap between autoregressive and diffusion paradigms, offering a scalable, efficient, and temporally consistent foundation for future video generation research.
zh
[CV-3] WaveRNet: Wavelet-Guided Frequency Learning for Multi-Source Domain-Generalized Retinal Vessel Segmentation
【速读】:该论文旨在解决跨域泛化(retinal vessel segmentation)在非均匀光照和对比度变化下性能下降的问题,同时克服生成式AI(如Segment Anything Model, SAM)在细节保留方面的局限性。其核心挑战在于:现有基于SAM的方法依赖简单适配器微调,忽视了频域中蕴含的域不变特征;且SAM直接上采样导致细小血管结构丢失。解决方案的关键创新在于提出WaveRNet框架,包含三个核心模块:1)频谱引导的域调制器(Spectral-guided Domain Modulator, SDM),通过小波分解与可学习域标记分离低频光照鲁棒结构与高频血管边界,实现域特定特征生成;2)频适应域融合模块(Frequency-Adaptive Domain Fusion, FADF),基于小波频域相似性进行测试时智能域选择与软加权融合;3)分层掩码提示精修器(Hierarchical Mask-Prompt Refiner, HMPR),利用长程依赖建模实现从粗到细的细化,有效弥补SAM上采样细节损失。实验证明,该方法在四个公开数据集上的留一域外协议中达到最优跨域泛化性能。
链接: https://arxiv.org/abs/2601.05942
作者: Chanchan Wang,Yuanfang Wang,Qing Xu,Guanxin Chen
机构: Xinjiang University (新疆大学); Nanjing University of Chinese Medicine (南京中医药大学); University of Nottingham (诺丁汉大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Domain-generalized retinal vessel segmentation is critical for automated ophthalmic diagnosis, yet faces significant challenges from domain shift induced by non-uniform illumination and varying contrast, compounded by the difficulty of preserving fine vessel structures. While the Segment Anything Model (SAM) exhibits remarkable zero-shot capabilities, existing SAM-based methods rely on simple adapter fine-tuning while overlooking frequency-domain information that encodes domain-invariant features, resulting in degraded generalization under illumination and contrast variations. Furthermore, SAM’s direct upsampling inevitably loses fine vessel details. To address these limitations, we propose WaveRNet, a wavelet-guided frequency learning framework for robust multi-source domain-generalized retinal vessel segmentation. Specifically, we devise a Spectral-guided Domain Modulator (SDM) that integrates wavelet decomposition with learnable domain tokens, enabling the separation of illumination-robust low-frequency structures from high-frequency vessel boundaries while facilitating domain-specific feature generation. Furthermore, we introduce a Frequency-Adaptive Domain Fusion (FADF) module that performs intelligent test-time domain selection through wavelet-based frequency similarity and soft-weighted fusion. Finally, we present a Hierarchical Mask-Prompt Refiner (HMPR) that overcomes SAM’s upsampling limitation through coarse-to-fine refinement with long-range dependency modeling. Extensive experiments under the Leave-One-Domain-Out protocol on four public retinal datasets demonstrate that WaveRNet achieves state-of-the-art generalization performance. The source code is available at this https URL.
zh
[CV-4] Context-Aware Decoding for Faithful Vision-Language Generation
【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)在开放任务中出现的幻觉问题,即生成与视觉输入不一致的响应,尤其在图像描述和视觉推理等场景下尤为显著。其解决方案的关键在于通过Logit Lens分析模型各解码层的生成动态,发现真实token比幻觉token更早获得概率质量聚集,从而提出一种无需训练的轻量级干预方法——上下文嵌入注入(Context Embedding Injection, CEI),该方法利用最后一个输入token的隐藏状态作为 grounding 信号,在整个解码过程中保持视觉一致性并抑制幻觉。
链接: https://arxiv.org/abs/2601.05939
作者: Mehrdad Fazli,Bowen Wei,Ziwei Zhu
机构: George Mason University (乔治梅森大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Hallucinations, generating responses inconsistent with the visual input, remain a critical limitation of large vision-language models (LVLMs), especially in open-ended tasks such as image captioning and visual reasoning. In this work, we probe the layer-wise generation dynamics that drive hallucinations and propose a training-free mitigation strategy. Employing the Logit Lens, we examine how LVLMs construct next-token distributions across decoder layers, uncovering a pronounced commitment-depth gap: truthful tokens accumulate probability mass on their final candidates earlier than hallucinatory ones. Drawing on this discovery, we introduce Context Embedding Injection (CEI), a lightweight method that harnesses the hidden state of the last input token-the context embedding-as a grounding signal to maintain visual fidelity throughout decoding and curb hallucinations. Evaluated on the CHAIR, AMBER, and MMHal-Bench benchmarks (with a maximum token length of 512), CEI outperforms state-of-the-art baselines across three LVLMs, with its dynamic variant yielding the lowest overall hallucination rates. By integrating novel mechanistic insights with a scalable intervention, this work advances the mitigation of hallucinations in LVLMs.
zh
[CV-5] Performance of a Deep Learning-Based Segmentation Model for Pancreatic Tumors on Public Endoscopic Ultrasound Datasets
【速读】:该论文旨在解决胰腺肿瘤在内镜超声(Endoscopic Ultrasound, EUS)图像中自动分割的准确性问题,以降低诊断过程中因操作者主观性带来的变异。其解决方案的关键在于采用基于视觉Transformer(Vision Transformer)架构的深度学习分割模型(USFM框架),通过大规模EUS图像数据训练与多折交叉验证,实现了对胰腺肿瘤区域的有效识别与边界划分,展现出较高的敏感性和特异性,为临床提供了一种客观、可靠的辅助诊断工具。
链接: https://arxiv.org/abs/2601.05937
作者: Pankaj Gupta,Priya Mudgil,Niharika Dutta,Kartik Bose,Nitish Kumar,Anupam Kumar,Jimil Shah,Vaneet Jearth,Jayanta Samanta,Vishal Sharma,Harshal Mandavdhare,Surinder Rana,Saroj K Sinha,Usha Dutta
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Background: Pancreatic cancer is one of the most aggressive cancers, with poor survival rates. Endoscopic ultrasound (EUS) is a key diagnostic modality, but its effectiveness is constrained by operator subjectivity. This study evaluates a Vision Transformer-based deep learning segmentation model for pancreatic tumors. Methods: A segmentation model using the USFM framework with a Vision Transformer backbone was trained and validated with 17,367 EUS images (from two public datasets) in 5-fold cross-validation. The model was tested on an independent dataset of 350 EUS images from another public dataset, manually segmented by radiologists. Preprocessing included grayscale conversion, cropping, and resizing to 512x512 pixels. Metrics included Dice similarity coefficient (DSC), intersection over union (IoU), sensitivity, specificity, and accuracy. Results: In 5-fold cross-validation, the model achieved a mean DSC of 0.651 +/- 0.738, IoU of 0.579 +/- 0.658, sensitivity of 69.8%, specificity of 98.8%, and accuracy of 97.5%. For the external validation set, the model achieved a DSC of 0.657 (95% CI: 0.634-0.769), IoU of 0.614 (95% CI: 0.590-0.689), sensitivity of 71.8%, and specificity of 97.7%. Results were consistent, but 9.7% of cases exhibited erroneous multiple predictions. Conclusions: The Vision Transformer-based model demonstrated strong performance for pancreatic tumor segmentation in EUS images. However, dataset heterogeneity and limited external validation highlight the need for further refinement, standardization, and prospective studies.
zh
[CV-6] Adapting Vision Transformers to Ultra-High Resolution Semantic Segmentation with Relay Tokens
【速读】:该论文旨在解决超高分辨率图像分割中普遍存在的两个问题:一是传统滑动窗口方法因局部处理而丢失全局上下文信息,二是下采样策略虽能捕捉全局信息但会损失精细细节。解决方案的关键在于提出一种简单而有效的多尺度推理机制,将视觉Transformer(Vision Transformer, ViT)扩展为同时处理局部尺度(高分辨率小块)和全局尺度(低分辨率大块)的双分支架构,并通过少量可学习的中继令牌(relay tokens)在两个分支间聚合与传播特征。该设计可直接嵌入标准Transformer骨干网络(如ViT和Swin Transformer),参数增量低于2%,并在多个超分辨率分割基准(Archaeoscape、URUR、Gleason)及Cityscapes数据集上实现了显著性能提升,最高相对mIoU提升达15%。
链接: https://arxiv.org/abs/2601.05927
作者: Yohann Perron,Vladyslav Sydorov,Christophe Pottier,Loic Landrieu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages +3 pages of suppmat
Abstract:Current approaches for segmenting ultra high resolution images either slide a window, thereby discarding global context, or downsample and lose fine detail. We propose a simple yet effective method that brings explicit multi scale reasoning to vision transformers, simultaneously preserving local details and global awareness. Concretely, we process each image in parallel at a local scale (high resolution, small crops) and a global scale (low resolution, large crops), and aggregate and propagate features between the two branches with a small set of learnable relay tokens. The design plugs directly into standard transformer backbones (eg ViT and Swin) and adds fewer than 2 % parameters. Extensive experiments on three ultra high resolution segmentation benchmarks, Archaeoscape, URUR, and Gleason, and on the conventional Cityscapes dataset show consistent gains, with up to 15 % relative mIoU improvement. Code and pretrained models are available at this https URL .
zh
[CV-7] Phase4DFD: Multi-Domain Phase-Aware Attention for Deepfake Detection
【速读】:该论文旨在解决当前深度伪造检测(Deepfake Detection)方法中对频率域信息利用不充分的问题,特别是现有技术主要依赖频谱幅度(spectral magnitude),而忽视了相位信息(phase information)在揭示合成痕迹中的潜在价值。其解决方案的关键在于提出Phase4DFD框架,通过引入一个输入级的相位感知注意力模块(phase-aware attention module),利用合成生成过程中常见的相位不连续性(phase discontinuities)引导模型聚焦于最具判别性的频率模式,从而实现对相位与幅度交互关系的显式建模;此外,该方法结合快速傅里叶变换(FFT)幅度和局部二值模式(LBP)特征增强表示能力,并采用轻量级BNext M主干网络进行高效特征提取,实验表明该策略在保持低计算开销的同时显著提升了检测性能。
链接: https://arxiv.org/abs/2601.05861
作者: Zhen-Xin Lin,Shang-Kuan Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 3 figures, conference
Abstract:Recent deepfake detection methods have increasingly explored frequency domain representations to reveal manipulation artifacts that are difficult to detect in the spatial domain. However, most existing approaches rely primarily on spectral magnitude, implicitly under exploring the role of phase information. In this work, we propose Phase4DFD, a phase aware frequency domain deepfake detection framework that explicitly models phase magnitude interactions via a learnable attention mechanism. Our approach augments standard RGB input with Fast Fourier Transform (FFT) magnitude and local binary pattern (LBP) representations to expose subtle synthesis artifacts that remain indistinguishable under spatial analysis alone. Crucially, we introduce an input level phase aware attention module that uses phase discontinuities commonly introduced by synthetic generation to guide the model toward frequency patterns that are most indicative of manipulation before backbone feature extraction. The attended multi domain representation is processed by an efficient BNext M backbone, with optional channel spatial attention applied for semantic feature refinement. Extensive experiments on the CIFAKE and DFFD datasets demonstrate that our proposed model Phase4DFD outperforms state of the art spatial and frequency-based detectors while maintaining low computational overhead. Comprehensive ablation studies further confirm that explicit phase modeling provides complementary and non-redundant information beyond magnitude-only frequency representations.
zh
[CV-8] Bidirectional Channel-selective Semantic Interaction for Semi-Supervised Medical Segmentation AAAI2026
【速读】:该论文旨在解决半监督医学图像分割中因标注数据有限而导致的模型性能瓶颈问题,尤其针对现有方法如均值教师(mean teacher)和双流一致性学习框架中存在的误差累积、模型结构复杂以及标注与未标注数据流间交互不足等挑战。其解决方案的关键在于提出一种双向通道选择性语义交互(Bidirectional Channel-selective Semantic Interaction, BCSI)框架:首先引入语义-空间扰动(Semantic-Spatial Perturbation, SSP)机制,通过强增强操作结合伪标签监督与预测一致性来提升模型鲁棒性;其次设计通道选择性路由器(Channel-selective Router, CR),动态筛选高相关通道以减少噪声干扰;最后采用双向通道级交互(Bidirectional Channel-wise Interaction, BCI)策略,增强重要通道的语义表示能力,从而实现更高效、稳定的跨数据流信息融合。
链接: https://arxiv.org/abs/2601.05855
作者: Kaiwen Huang,Yizhe Zhang,Yi Zhou,Tianyang Xu,Tao Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to AAAI 2026. Code at: this https URL
Abstract:Semi-supervised medical image segmentation is an effective method for addressing scenarios with limited labeled data. Existing methods mainly rely on frameworks such as mean teacher and dual-stream consistency learning. These approaches often face issues like error accumulation and model structural complexity, while also neglecting the interaction between labeled and unlabeled data streams. To overcome these challenges, we propose a Bidirectional Channel-selective Semantic Interaction~(BCSI) framework for semi-supervised medical image segmentation. First, we propose a Semantic-Spatial Perturbation~(SSP) mechanism, which disturbs the data using two strong augmentation operations and leverages unsupervised learning with pseudo-labels from weak augmentations. Additionally, we employ consistency on the predictions from the two strong augmentations to further improve model stability and robustness. Second, to reduce noise during the interaction between labeled and unlabeled data, we propose a Channel-selective Router~(CR) component, which dynamically selects the most relevant channels for information exchange. This mechanism ensures that only highly relevant features are activated, minimizing unnecessary interference. Finally, the Bidirectional Channel-wise Interaction~(BCI) strategy is employed to supplement additional semantic information and enhance the representation of important channels. Experimental results on multiple benchmarking 3D medical datasets demonstrate that the proposed method outperforms existing semi-supervised approaches.
zh
[CV-9] LayerGS: Decomposition and Inpainting of Layered 3D Human Avatars via 2D Gaussian Splatting
【速读】:该论文旨在解决现有3D人体重建方法在处理任意姿态下人体与衣物分离时的两大难题:一是单层重建方法无法将服装与特定身份解耦,导致虚拟试穿等应用受限;二是已有多层方法难以有效恢复被遮挡区域的几何细节。其解决方案的关键在于提出一种基于2D高斯表示的分层建模框架,通过将每一层(身体和衣物)编码为一组2D高斯分布以实现精确几何重建与逼真渲染,并利用预训练2D扩散模型结合得分蒸馏采样(Score-Distillation Sampling, SDS)对遮挡区域进行图像修复,从而实现高质量的层间分解与重组。该方法采用三阶段训练策略,先进行粗粒度单层重建,再联合优化多层细节,显著提升了虚拟试穿效果和3D人体资产的真实性与可用性。
链接: https://arxiv.org/abs/2601.05853
作者: Yinghan Xu,John Dingliana
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
备注:
Abstract:We propose a novel framework for decomposing arbitrarily posed humans into animatable multi-layered 3D human avatars, separating the body and garments. Conventional single-layer reconstruction methods lock clothing to one identity, while prior multi-layer approaches struggle with occluded regions. We overcome both limitations by encoding each layer as a set of 2D Gaussians for accurate geometry and photorealistic rendering, and inpainting hidden regions with a pretrained 2D diffusion model via score-distillation sampling (SDS). Our three-stage training strategy first reconstructs the coarse canonical garment via single-layer reconstruction, followed by multi-layer training to jointly recover the inner-layer body and outer-layer garment details. Experiments on two 3D human benchmark datasets (4D-Dress, Thuman2.0) show that our approach achieves better rendering quality and layer decomposition and recomposition than the previous state-of-the-art, enabling realistic virtual try-on under novel viewpoints and poses, and advancing practical creation of high-fidelity 3D human assets for immersive applications. Our code is available at this https URL
zh
[CV-10] Kidney Cancer Detection Using 3D-Based Latent Diffusion Models
【速读】:该论文旨在解决在对比增强腹部CT图像中对肾脏异常进行3D检测的问题,尤其是针对标注成本高、依赖密集像素级标签的监督学习方法所面临的挑战。其解决方案的关键在于提出一种基于潜在扩散(latent diffusion)的新颖流水线,结合了去噪扩散概率模型(DDPMs)、去噪扩散隐式模型(DDIMs)和向量量化生成对抗网络(VQ-GANs),直接在图像体积上操作,并利用仅需病例级别伪标签的弱监督策略,从而实现高效的异常检测。该方法首次展示了3D潜在扩散模型在弱监督条件下对复杂腹部解剖结构进行生成建模与异常检测的可行性。
链接: https://arxiv.org/abs/2601.05852
作者: Jen Dusseljee,Sarah de Boer,Alessa Hering
机构: Radboud University Medical Center (拉德布德大学医学中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 2 figures. This paper has been accepted at Bildverarbeitung für die Medizin (BVM) 2026
Abstract:In this work, we present a novel latent diffusion-based pipeline for 3D kidney anomaly detection on contrast-enhanced abdominal CT. The method combines Denoising Diffusion Probabilistic Models (DDPMs), Denoising Diffusion Implicit Models (DDIMs), and Vector-Quantized Generative Adversarial Networks (VQ-GANs). Unlike prior slice-wise approaches, our method operates directly on an image volume and leverages weak supervision with only case-level pseudo-labels. We benchmark our approach against state-of-the-art supervised segmentation and detection models. This study demonstrates the feasibility and promise of 3D latent diffusion for weakly supervised anomaly detection. While the current results do not yet match supervised baselines, they reveal key directions for improving reconstruction fidelity and lesion localization. Our findings provide an important step toward annotation-efficient, generative modeling of complex abdominal anatomy.
zh
[CV-11] Goal Force: Teaching Video Models To Accomplish Physics-Conditioned Goals
【速读】:该论文旨在解决视频生成模型在机器人和规划任务中难以精确指定目标的问题,传统方法如文本指令过于抽象,而目标图像在动态任务中又常不可行。其解决方案的关键在于提出一种名为“Goal Force”的新框架,通过显式力矢量和中间动力学过程来定义目标,模拟人类对物理任务的直观理解。该框架训练一个视频生成模型学习合成因果基本单元(如弹性碰撞、多米诺骨牌倒塌)的数据集,从而掌握力在时空中的传播机制;即使仅用简单物理数据训练,模型也能在零样本条件下泛化到复杂现实场景(如工具操作和多物体因果链),表明其可作为隐式的神经物理模拟器,实现无需依赖外部引擎的精准、物理感知的规划。
链接: https://arxiv.org/abs/2601.05848
作者: Nate Gillman,Yinghua Zhou,Zitian Tang,Evan Luo,Arjan Chakravarthy,Daksh Aggarwal,Michael Freeman,Charles Herrmann,Chen Sun
机构: Brown University (布朗大学); Cornell University (康奈尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: Code and interactive demos at this https URL
Abstract:Recent advancements in video generation have enabled the development of ``world models’’ capable of simulating potential futures for robotics and planning. However, specifying precise goals for these models remains a challenge; text instructions are often too abstract to capture physical nuances, while target images are frequently infeasible to specify for dynamic tasks. To address this, we introduce Goal Force, a novel framework that allows users to define goals via explicit force vectors and intermediate dynamics, mirroring how humans conceptualize physical tasks. We train a video generation model on a curated dataset of synthetic causal primitives-such as elastic collisions and falling dominos-teaching it to propagate forces through time and space. Despite being trained on simple physics data, our model exhibits remarkable zero-shot generalization to complex, real-world scenarios, including tool manipulation and multi-object causal chains. Our results suggest that by grounding video generation in fundamental physical interactions, models can emerge as implicit neural physics simulators, enabling precise, physics-aware planning without reliance on external engines. We release all datasets, code, model weights, and interactive video demos at our project page.
zh
[CV-12] GeoSurDepth: Spatial Geometry-Consistent Self-Supervised Depth Estimation for Surround-View Cameras
【速读】:该论文旨在解决自动驾驶场景中多视角深度估计的准确性问题,尤其是如何在单目和多视角设置下有效利用几何结构信息以提升深度估计的鲁棒性和精度。现有方法主要依赖于光度一致性约束,但忽略了图像间丰富的几何一致性特征。解决方案的关键在于提出GeoSurDepth框架,其核心创新包括:(1)利用基础模型作为伪几何先验,引导网络保持空间三维中的表面法向一致性,并在二维空间中正则化物体与纹理一致的深度估计;(2)设计一种新颖的2D-3D lifting视图合成流程,通过空间扭曲重建密集深度图,从而在时间、空间及时空维度上引入额外的光度监督信号,弥补单视角重建的局限性;(3)提出自适应联合运动学习策略,使网络能动态强化对关键空间几何线索的关注,优化运动推理能力。实验表明,该方法在DDAD和nuScenes数据集上达到当前最优性能,验证了几何一致性在自监督多视角深度估计中的重要价值。
链接: https://arxiv.org/abs/2601.05839
作者: Weimin Liu,Wenjun Wang,Joshua H. Meng
机构: Tsinghua University (清华大学); California PATH, University of California, Berkeley (加州大学伯克利分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accurate surround-view depth estimation provides a competitive alternative to laser-based sensors and is essential for 3D scene understanding in autonomous driving. While prior studies have proposed various approaches that primarily focus on enforcing cross-view constraints at the photometric level, few explicitly exploit the rich geometric structure inherent in both monocular and surround-view setting. In this work, we propose GeoSurDepth, a framework that leverages geometry consistency as the primary cue for surround-view depth estimation. Concretely, we utilize foundation models as a pseudo geometry prior and feature representation enhancement tool to guide the network to maintain surface normal consistency in spatial 3D space and regularize object- and texture-consistent depth estimation in 2D. In addition, we introduce a novel view synthesis pipeline where 2D-3D lifting is achieved with dense depth reconstructed via spatial warping, encouraging additional photometric supervision across temporal, spatial, and spatial-temporal contexts, and compensating for the limitations of single-view image reconstruction. Finally, a newly-proposed adaptive joint motion learning strategy enables the network to adaptively emphasize informative spatial geometry cues for improved motion reasoning. Extensive experiments on DDAD and nuScenes demonstrate that GeoSurDepth achieves state-of-the-art performance, validating the effectiveness of our approach. Our framework highlights the importance of exploiting geometry coherence and consistency for robust self-supervised multi-view depth estimation.
zh
[CV-13] Boosting Latent Diffusion Models via Disentangled Representation Alignment
【速读】:该论文旨在解决当前隐式扩散模型(Latent Diffusion Models, LDMs)中用于压缩图像空间的变分自编码器(VAE)在语义解耦能力上的不足问题。现有方法通常将视觉基础模型(Vision Foundation Models, VFMs)作为统一的目标来对齐VAE和LDM的潜在表示,但忽略了VAE与LDM在表征需求上的本质差异:LDM更依赖保留高层语义概念,而VAE应擅长语义解耦以结构化地编码属性级信息。解决方案的关键在于提出一种语义解耦VAE(Semantic disentangled VAE, Send-VAE),通过非线性映射网络将VAE潜在变量对齐至预训练VFMs的语义层次结构,从而显式优化语义解耦学习,并实现属性级控制与高层语义的一致性。实验表明,Send-VAE显著加速了流形变换器(SiTs)的训练过程,并在ImageNet 256×256数据集上分别达到1.21和1.75的FID分数(有无分类器自由引导),优于现有方法。
链接: https://arxiv.org/abs/2601.05823
作者: John Page,Xuesong Niu,Kai Wu,Kun Gai
机构: Kolors Team, Kuaishou Technology (快手科技)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Latent Diffusion Models (LDMs) generate high-quality images by operating in a compressed latent space, typically obtained through image tokenizers such as Variational Autoencoders (VAEs). In pursuit of a generation-friendly VAE, recent studies have explored leveraging Vision Foundation Models (VFMs) as representation alignment targets for VAEs, mirroring the approach commonly adopted for LDMs. Although this yields certain performance gains, using the same alignment target for both VAEs and LDMs overlooks their fundamentally different representational requirements. We advocate that while LDMs benefit from latents retaining high-level semantic concepts, VAEs should excel in semantic disentanglement, enabling encoding of attribute-level information in a structured way. To address this, we propose the Semantic disentangled VAE (Send-VAE), explicitly optimized for disentangled representation learning through aligning its latent space with the semantic hierarchy of pre-trained VFMs. Our approach employs a non-linear mapper network to transform VAE latents, aligning them with VFMs to bridge the gap between attribute-level disentanglement and high-level semantics, facilitating effective guidance for VAE learning. We evaluate semantic disentanglement via linear probing on attribute prediction tasks, showing strong correlation with improved generation performance. Finally, using Send-VAE, we train flow-based transformers SiTs; experiments show Send-VAE significantly speeds up training and achieves a state-of-the-art FID of 1.21 and 1.75 with and without classifier-free guidance on ImageNet 256x256.
zh
[CV-14] SceneFoundry: Generating Interactive Infinite 3D Worlds
【速读】:该论文旨在解决当前生成式3D环境方法难以捕捉真实室内场景中功能性复杂结构的问题,尤其是缺乏可操作的活动部件(articulated objects)和语义多样性布局,这限制了机器人学习与具身智能的发展。解决方案的关键在于提出SceneFoundry——一个基于语言引导的扩散框架,通过大语言模型(LLM)控制地板布局生成,并利用基于扩散的后验采样从大规模3D资源库中高效填充带关节的物体;同时引入可微分引导函数以确保物理可用性,如调节物体数量、避免关节碰撞并保留足够的可行走空间,从而生成结构合理、语义一致且功能交互性强的公寓级3D环境,支持可扩展的具身AI研究。
链接: https://arxiv.org/abs/2601.05810
作者: ChunTeng Chen,YiChen Hsu,YiWen Liu,WeiFang Sun,TsaiChing Ni,ChunYi Lee,Min Sun,YuanFu Yang
机构: National Yang Ming Chiao Tung University (国立阳明交通大学); National Tsing Hua University (国立清华大学); NVIDIA AI Technology Center (英伟达人工智能技术中心); National Taiwan University (国立台湾大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注: 15 pages
Abstract:The ability to automatically generate large-scale, interactive, and physically realistic 3D environments is crucial for advancing robotic learning and embodied intelligence. However, existing generative approaches often fail to capture the functional complexity of real-world interiors, particularly those containing articulated objects with movable parts essential for manipulation and navigation. This paper presents SceneFoundry, a language-guided diffusion framework that generates apartment-scale 3D worlds with functionally articulated furniture and semantically diverse layouts for robotic training. From natural language prompts, an LLM module controls floor layout generation, while diffusion-based posterior sampling efficiently populates the scene with articulated assets from large-scale 3D repositories. To ensure physical usability, SceneFoundry employs differentiable guidance functions to regulate object quantity, prevent articulation collisions, and maintain sufficient walkable space for robotic navigation. Extensive experiments demonstrate that our framework generates structurally valid, semantically coherent, and functionally interactive environments across diverse scene types and conditions, enabling scalable embodied AI research.
zh
[CV-15] Adaptive Disentangled Representation Learning for Incomplete Multi-View Multi-Label Classification
【速读】:该论文旨在解决多视图多标签学习中普遍存在的特征缺失与标注不完整问题,这类问题通常由数据获取困难和监督成本高昂导致。现有方法在特征恢复、表示解耦和标签语义建模方面存在局限性。解决方案的关键在于提出自适应解耦表示学习方法(Adaptive Disentangled Representation Learning, ADRL),其核心机制包括:通过邻域感知的特征级亲和传播实现鲁棒的视图补全;利用随机掩码策略增强重建效果;通过类别级关联扩散优化标签分布参数以捕捉标签原型间的依赖关系;引入基于互信息的目标函数促进共享表示一致性并抑制视图特有表示与其他模态的信息冗余;同时,ADRL通过标签嵌入与视图表示之间的独立交互完成原型特定特征选择,并生成伪标签引导判别性视图融合。
链接: https://arxiv.org/abs/2601.05785
作者: Quanjiang Li,Zhiming Liu,Tianxiang Xu,Tingjin Luo,Chenping Hou
机构: National University of Defense Technology (国防科技大学); Harbin Institute of Technology, Shenzhen (哈尔滨工业大学深圳分校); Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Multi-view multi-label learning frequently suffers from simultaneous feature absence and incomplete annotations, due to challenges in data acquisition and cost-intensive supervision. To tackle the complex yet highly practical problem while overcoming the existing limitations of feature recovery, representation disentanglement, and label semantics modeling, we propose an Adaptive Disentangled Representation Learning method (ADRL). ADRL achieves robust view completion by propagating feature-level affinity across modalities with neighborhood awareness, and reinforces reconstruction effectiveness by leveraging a stochastic masking strategy. Through disseminating category-level association across label distributions, ADRL refines distribution parameters for capturing interdependent label prototypes. Besides, we formulate a mutual-information-based objective to promote consistency among shared representations and suppress information overlap between view-specific representation and other modalities. Theoretically, we derive the tractable bounds to train the dual-channel network. Moreover, ADRL performs prototype-specific feature selection by enabling independent interactions between label embeddings and view representations, accompanied by the generation of pseudo-labels for each category. The structural characteristics of the pseudo-label space are then exploited to guide a discriminative trade-off during view fusion. Finally, extensive experiments on public datasets and real-world applications demonstrate the superior performance of ADRL.
zh
[CV-16] FlyPose: Towards Robust Human Pose Estimation From Aerial Views WACV
【速读】:该论文旨在解决无人机(UAV)在靠近人类活动环境中执行任务时,如何实现高精度、实时的人类姿态与行为感知问题。由于航拍视角存在分辨率低、俯视角度陡峭以及自遮挡等问题,传统方法难以满足实时性要求。解决方案的关键在于提出 FlyPose,一个轻量级的自顶向下人体姿态估计流水线,通过多数据集联合训练显著提升检测与姿态估计性能(如在 UAV-Human 数据集上 2D 姿态估计 mAP 提升 16.3),并在 Jetson Orin AGX 开发板上实现约 20 毫秒的推理延迟,支持机载部署和飞行实验验证。
链接: https://arxiv.org/abs/2601.05747
作者: Hassaan Farooq,Marvin Brenner,Peter St\ütz
机构: Universität der Bundeswehr Munich (联邦国防军大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 11 pages, 9 figures, IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2026
Abstract:Unmanned Aerial Vehicles (UAVs) are increasingly deployed in close proximity to humans for applications such as parcel delivery, traffic monitoring, disaster response and infrastructure inspections. Ensuring safe and reliable operation in these human-populated environments demands accurate perception of human poses and actions from an aerial viewpoint. This perspective challenges existing methods with low resolution, steep viewing angles and (self-)occlusion, especially if the application demands realtime feasibile models. We train and deploy FlyPose, a lightweight top-down human pose estimation pipeline for aerial imagery. Through multi-dataset training, we achieve an average improvement of 6.8 mAP in person detection across the test-sets of Manipal-UAV, VisDrone, HIT-UAV as well as our custom dataset. For 2D human pose estimation we report an improvement of 16.3 mAP on the challenging UAV-Human dataset. FlyPose runs with an inference latency of ~20 milliseconds including preprocessing on a Jetson Orin AGX Developer Kit and is deployed onboard a quadrotor UAV during flight experiments. We also publish FlyPose-104, a small but challenging aerial human pose estimation dataset, that includes manual annotations from difficult aerial perspectives: this https URL.
zh
[CV-17] ViTNT-FIQA: Training-Free Face Image Quality Assessment with Vision Transformers WACV
【速读】:该论文旨在解决人脸图像质量评估(Face Image Quality Assessment, FIQA)问题,即如何在不依赖额外训练的情况下准确衡量人脸图像的质量,从而提升人脸识别系统的可靠性。现有方法通常仅利用模型最终层的特征表示,或需多次前向传播、反向传播等复杂操作;而本文提出ViTNT-FIQA,其核心创新在于通过分析Vision Transformer(ViT)中间层块中patch embedding的演化稳定性来量化图像质量:高质量人脸图像在不同ViT块间展现出稳定的特征精炼轨迹,而低质图像则表现为波动剧烈的变换模式。具体而言,该方法计算连续ViT块间L2归一化patch embedding的欧氏距离,并聚合为图像级质量评分,仅需单次前向传播即可完成评估,无需训练或修改网络结构,具备高效性与即插即用特性。
链接: https://arxiv.org/abs/2601.05741
作者: Guray Ozgur,Eduarda Caldeira,Tahar Chettaoui,Jan Niklas Kolf,Marco Huber,Naser Damer,Fadi Boutros
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at WACV Workshops
Abstract:Face Image Quality Assessment (FIQA) is essential for reliable face recognition systems. Current approaches primarily exploit only final-layer representations, while training-free methods require multiple forward passes or backpropagation. We propose ViTNT-FIQA, a training-free approach that measures the stability of patch embedding evolution across intermediate Vision Transformer (ViT) blocks. We demonstrate that high-quality face images exhibit stable feature refinement trajectories across blocks, while degraded images show erratic transformations. Our method computes Euclidean distances between L2-normalized patch embeddings from consecutive transformer blocks and aggregates them into image-level quality scores. We empirically validate this correlation on a quality-labeled synthetic dataset with controlled degradation levels. Unlike existing training-free approaches, ViTNT-FIQA requires only a single forward pass without backpropagation or architectural modifications. Through extensive evaluation on eight benchmarks (LFW, AgeDB-30, CFP-FP, CALFW, Adience, CPLFW, XQLFW, IJB-C), we show that ViTNT-FIQA achieves competitive performance with state-of-the-art methods while maintaining computational efficiency and immediate applicability to any pre-trained ViT-based face recognition model.
zh
[CV-18] FeatureSLAM: Feature-enriched 3D gaussian splatting SLAM in real time
【速读】:该论文旨在解决传统SLAM(Simultaneous Localization and Mapping,即时定位与地图构建)系统在实时性、语义丰富性和地图保真度之间的权衡问题,尤其是如何在不显著增加计算负担的前提下提升跟踪精度与映射质量。其解决方案的关键在于将3D高斯点绘(3D Gaussian Splatting, 3DGS)的密集特征渲染机制整合进新视角合成流程中,并与视觉基础模型对齐,从而实现语义增强的实时SLAM。该方法不仅支持自由视角下的开放集分割(open-set segmentation),还通过嵌入特征信息显著提升了跟踪稳定性与地图精度(相比固定类别语义SLAM基线,姿态误差降低9%,映射准确率提高8%),同时保持实时性能,为下游任务提供更强的语义支撑。
链接: https://arxiv.org/abs/2601.05738
作者: Christopher Thirgood,Oscar Mendez,Erin Ling,Jon Storey,Simon Hadfield
机构: University of Surrey (萨里大学); I3D Robotics (I3D机器人公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present a real-time tracking SLAM system that unifies efficient camera tracking with photorealistic feature-enriched mapping using 3D Gaussian Splatting (3DGS). Our main contribution is integrating dense feature rasterization into the novel-view synthesis, aligned with a visual foundation model. This yields strong semantics, going beyond basic RGB-D input, aiding both tracking and mapping accuracy. Unlike previous semantic SLAM approaches (which embed pre-defined class labels) FeatureSLAM enables entirely new downstream tasks via free-viewpoint, open-set segmentation. Across standard benchmarks, our method achieves real-time tracking, on par with state-of-the-art systems while improving tracking stability and map fidelity without prohibitive compute. Quantitatively, we obtain 9% lower pose error and 8% higher mapping accuracy compared to recent fixed-set SLAM baselines. Our results confirm that real-time feature-embedded SLAM, is not only valuable for enabling new downstream applications. It also improves the performance of the underlying tracking and mapping subsystems, providing semantic and language masking results that are on-par with offline 3DGS models, alongside state-of-the-art tracking, depth and RGB rendering.
zh
[CV-19] AGRPO: Boosting GRPO on Image-to-Video Generation with Direct Trajectory Alignment
【速读】:该论文旨在解决图像到视频(Image-to-Video, I2V)生成模型在应用Group Relative Policy Optimization (GRPO)时难以稳定提升奖励的问题。现有方法在文本到图像或文本到视频任务中表现良好,但在I2V场景下效果不稳定,主要原因在于缺乏对中间潜在表示的有效引导和多样性保障。解决方案的关键在于提出TAGRPO框架,其核心创新包括:1)利用相同初始噪声生成的rollout视频提供更优的优化指导,通过在中间潜空间设计新的GRPO损失函数,使模型输出直接对齐高奖励轨迹并远离低奖励轨迹;2)引入记忆库(memory bank)存储rollout视频以增强多样性并降低计算开销。该方法显著优于先前的DanceGRPO方法,在I2V生成任务中实现了稳定且显著的性能提升。
链接: https://arxiv.org/abs/2601.05729
作者: Jin Wang,Jianxiang Lu,Guangzheng Xu,Comi Chen,Haoyu Yang,Linqing Wang,Peng Chen,Mingtao Chen,Zhichao Hu,Longhuang Wu,Shuai Shao,Qinglin Lu,Ping Luo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 6 figures
Abstract:Recent studies have demonstrated the efficacy of integrating Group Relative Policy Optimization (GRPO) into flow matching models, particularly for text-to-image and text-to-video generation. However, we find that directly applying these techniques to image-to-video (I2V) models often fails to yield consistent reward improvements. To address this limitation, we present TAGRPO, a robust post-training framework for I2V models inspired by contrastive learning. Our approach is grounded in the observation that rollout videos generated from identical initial noise provide superior guidance for optimization. Leveraging this insight, we propose a novel GRPO loss applied to intermediate latents, encouraging direct alignment with high-reward trajectories while maximizing distance from low-reward counterparts. Furthermore, we introduce a memory bank for rollout videos to enhance diversity and reduce computational overhead. Despite its simplicity, TAGRPO achieves significant improvements over DanceGRPO in I2V generation.
zh
[CV-20] Rotate Your Character: Revisiting Video Diffusion Models for High-Quality 3D Character Generation
【速读】:该论文旨在解决从单张图像生成高质量3D角色的难题,尤其针对复杂身体姿态和自遮挡带来的挑战。其解决方案的核心是提出了一种名为RCM(Rotate your Character Model)的图像到视频扩散框架,通过将任意复杂姿态的字符转换为规范姿态(canonical pose),实现全视角轨道上的一致性新视图合成;同时支持1024×1024高分辨率轨道视频生成、可控观察位置以及最多4张多视角输入条件,显著提升了新视图合成与3D生成的质量。
链接: https://arxiv.org/abs/2601.05722
作者: Jin Wang,Jianxiang Lu,Comi Chen,Guangzheng Xu,Haoyu Yang,Peng Chen,Na Zhang,Yifan Xu,Longhuang Wu,Shuai Shao,Qinglin Lu,Ping Luo
机构: Hunyuan, Tencent(腾讯); The University of Hong Kong(香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 8 figures
Abstract:Generating high-quality 3D characters from single images remains a significant challenge in digital content creation, particularly due to complex body poses and self-occlusion. In this paper, we present RCM (Rotate your Character Model), an advanced image-to-video diffusion framework tailored for high-quality novel view synthesis (NVS) and 3D character generation. Compared to existing diffusion-based approaches, RCM offers several key advantages: (1) transferring characters with any complex poses into a canonical pose, enabling consistent novel view synthesis across the entire viewing orbit, (2) high-resolution orbital video generation at 1024x1024 resolution, (3) controllable observation positions given different initial camera poses, and (4) multi-view conditioning supporting up to 4 input images, accommodating diverse user scenarios. Extensive experiments demonstrate that RCM outperforms state-of-the-art methods in both novel view synthesis and 3D generation quality.
zh
[CV-21] SketchVL: Policy Optimization via Fine-Grained Credit Assignment for Chart Understanding and More
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在自动化图表理解任务中因缺乏细粒度信用分配机制而导致的推理能力不足问题。现有基于强化学习(Reinforcement Learning, RL)的MLLMs通常仅在轨迹层面进行优势估计,无法区分单个生成响应中正确与错误的推理步骤,从而限制了模型的精细优化能力。其解决方案的关键在于提出SketchVL模型与FinePO算法:SketchVL通过将中间推理步骤以标记形式绘制在图像上并反馈给自身,构建多步推理流程;而FinePO则引入细粒度过程奖励模型(Fine-grained Process Reward Model, FinePRM),对每个轨迹内的绘图动作进行评分,实现对每一步推理行为的精准信用分配——即当整个轨迹成功时更强地奖励正确token,失败时更严厉惩罚错误token,从而获得细粒度的强化信号,显著提升模型在图表理解、自然图像和数学推理等任务上的性能表现。
链接: https://arxiv.org/abs/2601.05688
作者: Muye Huang,Lingling Zhang,Yifei Li,Yaqiang Wu,Jun Liu
机构: Xi’an Jiaotong University (西安交通大学); Zhongguancun Academy (中关村学院); Lenovo Research (联想研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Charts are high-density visual carriers of complex data and medium for information extraction and analysis. Due to the need for precise and complex visual reasoning, automated chart understanding poses a significant challenge to existing Multimodal Large Language Models (MLLMs). Many MLLMs trained with reinforcement learning (RL) face the challenge of credit assignment. Their advantage estimation, typically performed at the trajectory level, cannot distinguish between correct and incorrect reasoning steps within a single generated response. To address this limitation, we introduce SketchVL, a novel MLLM that optimized with FinePO, a new RL algorithm designed for fine-grained credit assignment within each trajectory. SketchVL’s methodology involves drawing its intermediate reasoning steps as markers on the image and feeding the annotated image back to itself, creating a robust, multi-step reasoning process. During training, the FinePO algorithm leverages a Fine-grained Process Reward Model (FinePRM) to score each drawing action within a trajectory, thereby precisely assigning credit for each step. This mechanism allows FinePO to more strongly reward correct tokens when a trajectory is globally successful, and more heavily penalize incorrect tokens when the trajectory is globally suboptimal, thus achieving fine-grained reinforcement signals. Experiments show that SketchVL learns to align its step-level behavior with the FinePRM, achieving an average performance gain of 7.23% over its base model across chart datasets, natural image datasets, and mathematics, providing a promising new direction for training powerful reasoning models.
zh
[CV-22] AGDC: Autoregressive Generation of Variable-Length Sequences with Joint Discrete and Continuous Spaces
【速读】:该论文旨在解决基于离散化(discretization)的生成模型在处理混合离散-连续序列时的可扩展性限制问题,特别是在高精度领域(如半导体电路设计)中,离散化导致的精度损失可能引发功能失效。解决方案的关键在于提出AGDC(Autoregressive Generation with Discrete-Continuous modeling),这是一个统一框架,通过联合建模离散值与连续值来生成变长序列;其核心技术包括:1)基于多层感知机(MLP)动态调整结束符(EOS)logits的机制,以适应序列上下文变化;2)将长度正则化项嵌入损失函数,从而提升生成序列的可控性和稳定性。该方法在ContLayNet等高精度基准上验证了其优越性,实现了跨领域的高保真度混合向量表示生成。
链接: https://arxiv.org/abs/2601.05680
作者: Yeonsang Shin,Insoo Kim,Bongkeun Kim,Keonwoo Bae,Bohyung Han
机构: Seoul National University (首尔国立大学); Samsung Electronics (三星电子)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Transformer-based autoregressive models excel in data generation but are inherently constrained by their reliance on discretized tokens, which limits their ability to represent continuous values with high precision. We analyze the scalability limitations of existing discretization-based approaches for generating hybrid discrete-continuous sequences, particularly in high-precision domains such as semiconductor circuit designs, where precision loss can lead to functional failure. To address the challenge, we propose AGDC, a novel unified framework that jointly models discrete and continuous values for variable-length sequences. AGDC employs a hybrid approach that combines categorical prediction for discrete values with diffusion-based modeling for continuous values, incorporating two key technical components: an end-of-sequence (EOS) logit adjustment mechanism that uses an MLP to dynamically adjust EOS token logits based on sequence context, and a length regularization term integrated into the loss function. Additionally, we present ContLayNet, a large-scale benchmark comprising 334K high-precision semiconductor layout samples with specialized evaluation metrics that capture functional correctness where precision errors significantly impact performance. Experiments on semiconductor layouts (ContLayNet), graphic layouts, and SVGs demonstrate AGDC’s superior performance in generating high-fidelity hybrid vector representations compared to discretization-based and fixed-schema baselines, achieving scalable high-precision generation across diverse domains.
zh
[CV-23] SGDrive: Scene-to-Goal Hierarchical World Cognition for Autonomous Driving
【速读】:该论文旨在解决当前基于视觉语言模型(Vision-Language Models, VLMs)的端到端自动驾驶方法在复杂场景下规划能力不足的问题。由于VLMs是通用型模型,缺乏对三维空间和时间维度中驾驶特有推理的理解,导致其难以构建结构化的时空表征以捕捉几何关系、场景上下文和运动模式,从而影响安全轨迹规划。解决方案的关键在于提出SGDrive框架,该框架通过引入一个“场景-代理-目标”(scene-agent-goal)层次结构,显式地将VLM的表示学习与驾驶知识体系对齐,模拟人类驾驶员的认知过程:首先理解整体环境(场景上下文),再关注关键交通参与者及其行为,最后制定短期目标并执行动作。这一层次化设计为通用VLM提供了结构化的时空表征,实现了多层级信息的紧凑整合,显著提升了仅使用摄像头输入的自动驾驶系统在NAVSIM基准上的表现。
链接: https://arxiv.org/abs/2601.05640
作者: Jingyu Li,Junjie Wu,Dongnan Hu,Xiangkai Huang,Bin Sun,Zhihui Hao,Xianpeng Lang,Xiatian Zhu,Li Zhang
机构: Fudan University (复旦大学); Shanghai Innovation Institute; Li Auto Inc.; Tongji University (同济大学); University of Surrey (萨里大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent end-to-end autonomous driving approaches have leveraged Vision-Language Models (VLMs) to enhance planning capabilities in complex driving scenarios. However, VLMs are inherently trained as generalist models, lacking specialized understanding of driving-specific reasoning in 3D space and time. When applied to autonomous driving, these models struggle to establish structured spatial-temporal representations that capture geometric relationships, scene context, and motion patterns critical for safe trajectory planning. To address these limitations, we propose SGDrive, a novel framework that explicitly structures the VLM’s representation learning around driving-specific knowledge hierarchies. Built upon a pre-trained VLM backbone, SGDrive decomposes driving understanding into a scene-agent-goal hierarchy that mirrors human driving cognition: drivers first perceive the overall environment (scene context), then attend to safety-critical agents and their behaviors, and finally formulate short-term goals before executing actions. This hierarchical decomposition provides the structured spatial-temporal representation that generalist VLMs lack, integrating multi-level information into a compact yet comprehensive format for trajectory planning. Extensive experiments on the NAVSIM benchmark demonstrate that SGDrive achieves state-of-the-art performance among camera-only methods on both PDMS and EPDMS, validating the effectiveness of hierarchical knowledge structuring for adapting generalist VLMs to autonomous driving.
zh
[CV-24] Compressing image encoders via latent distillation
【速读】:该论文旨在解决深度学习图像压缩模型在硬件资源受限场景下的部署难题,即这些模型通常结构复杂、计算开销大且依赖大量训练数据。其解决方案的关键在于通过简化知识蒸馏策略,对原模型的编码器进行部分压缩,从而在减少训练数据需求和缩短训练时间的前提下,有效逼近原始模型的潜在空间(latent space),生成轻量化编码器。实验表明,该方法在保持重建质量和统计保真度方面优于直接用原损失函数训练轻量级编码器的方式,适用于资源受限环境。
链接: https://arxiv.org/abs/2601.05639
作者: Caroline Mazini Rodrigues(IRISA, CNRS),Nicolas Keriven(CNRS, IRISA, COMPACT),Thomas Maugey(COMPACT)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Deep learning models for image compression often face practical limitations in hardware-constrained applications. Although these models achieve high-quality reconstructions, they are typically complex, heavyweight, and require substantial training data and computational resources. We propose a methodology to partially compress these networks by reducing the size of their encoders. Our approach uses a simplified knowledge distillation strategy to approximate the latent space of the original models with less data and shorter training, yielding lightweight encoders from heavyweight ones. We evaluate the resulting lightweight encoders across two different architectures on the image compression task. Experiments show that our method preserves reconstruction quality and statistical fidelity better than training lightweight encoders with the original loss, making it practical for resource-limited environments.
zh
[CV-25] Continual Learning of Achieving Forgetting-free and Positive Knowledge Transfer
【速读】:该论文旨在解决持续学习(Continual Learning, CL)中不仅需克服灾难性遗忘(Catastrophic Forgetting, CF),还需促进正向知识迁移(Forward Knowledge Transfer, FKT)与反向知识迁移(Backward Knowledge Transfer, BKT)的问题。其解决方案的关键在于将CL建模为一个优化问题,约束条件为FKT和BKT均为正值,并提出了一种增强型任务持续学习方法(Enhanced Task Continual Learning, ETCL)。ETCL通过学习任务特定的二值掩码来隔离稀疏子网络以保留密集网络性能,同时在新任务开始时对齐梯度以确保正向知识迁移;并通过双目标优化策略与正交梯度投影方法,仅更新分类层中与前序相似任务相关的权重,从而实现正向和反向的知识迁移,最终在不同任务序列上显著优于现有强基线方法。
链接: https://arxiv.org/abs/2601.05623
作者: Zhi Wang,Zhongbin Wu,Yanni Li,Bing Liu,Guangxi Li,Yuping Wang
机构: Xidian University (西安电子科技大学); University of Illinois at Chicago (芝加哥大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Existing research on continual learning (CL) of a sequence of tasks focuses mainly on dealing with catastrophic forgetting (CF) to balance the learning plasticity of new tasks and the memory stability of old tasks. However, an ideal CL agent should not only be able to overcome CF, but also encourage positive forward and backward knowledge transfer (KT), i.e., using the learned knowledge from previous tasks for the new task learning (namely FKT), and improving the previous tasks’ performance with the knowledge of the new task (namely BKT). To this end, this paper first models CL as an optimization problem in which each sequential learning task aims to achieve its optimal performance under the constraint that both FKT and BKT should be positive. It then proposes a novel Enhanced Task Continual Learning (ETCL) method, which achieves forgetting-free and positive KT. Furthermore, the bounds that can lead to negative FKT and BKT are estimated theoretically. Based on the bounds, a new strategy for online task similarity detection is also proposed to facilitate positive KT. To overcome CF, ETCL learns a set of task-specific binary masks to isolate a sparse sub-network for each task while preserving the performance of a dense network for the task. At the beginning of a new task learning, ETCL tries to align the new task’s gradient with that of the sub-network of the previous most similar task to ensure positive FKT. By using a new bi-objective optimization strategy and an orthogonal gradient projection method, ETCL updates only the weights of previous similar tasks at the classification layer to achieve positive BKT. Extensive evaluations demonstrate that the proposed ETCL markedly outperforms strong baselines on dissimilar, similar, and mixed task sequences.
zh
[CV-26] LatentVLA: Efficient Vision-Language Models for Autonomous Driving via Latent Action Prediction
【速读】:该论文旨在解决当前端到端自动驾驶模型在罕见长尾场景下表现不佳的问题,其根源在于训练数据中场景多样性不足。同时,现有视觉-语言-动作(Vision-Language-Action, VLA)模型面临三大挑战:轨迹预测的数值精度受限于离散化分词、对语言标注的高度依赖引入语义偏置与标注负担,以及多步链式推理导致计算效率低下,难以实现实时部署。解决方案的关键在于提出LatentVLA框架,通过自监督潜在动作预测机制,在无需语言标注的情况下训练VLA模型,从而消除语言偏置并从无标签轨迹数据中学习丰富的驾驶表征;进一步利用知识蒸馏技术,将VLA模型的泛化能力迁移至轻量级视觉网络,实现高性能与实时性的统一。
链接: https://arxiv.org/abs/2601.05611
作者: Chengen Xie,Bin Sun,Tianyu Li,Junjie Wu,Zhihui Hao,XianPeng Lang,Hongyang Li
机构: Shanghai Innovation Institute; OpenDriveLab at The University of Hong Kong; Li Auto Inc.
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:End-to-end autonomous driving models trained on largescale datasets perform well in common scenarios but struggle with rare, long-tail situations due to limited scenario diversity. Recent Vision-Language-Action (VLA) models leverage broad knowledge from pre-trained visionlanguage models to address this limitation, yet face critical challenges: (1) numerical imprecision in trajectory prediction due to discrete tokenization, (2) heavy reliance on language annotations that introduce linguistic bias and annotation burden, and (3) computational inefficiency from multi-step chain-of-thought reasoning hinders real-time deployment. We propose LatentVLA, a novel framework that employs self-supervised latent action prediction to train VLA models without language annotations, eliminating linguistic bias while learning rich driving representations from unlabeled trajectory data. Through knowledge distillation, LatentVLA transfers the generalization capabilities of VLA models to efficient vision-based networks, achieving both robust performance and real-time efficiency. LatentVLA establishes a new state-of-the-art on the NAVSIM benchmark with a PDMS score of 92.4 and demonstrates strong zeroshot generalization on the nuScenes benchmark.
zh
[CV-27] Learning Geometric Invariance for Gait Recognition
【速读】:该论文旨在解决步态识别中跨视角、跨服装等不同步态条件下的身份不变特征提取难题,现有方法多依赖数据驱动的隐式学习来拉近不同条件下的步态差异,但缺乏对不同步态条件之间内在关系的显式建模。其解决方案的关键在于提出一种新视角:将不同步态条件的变化近似视为几何变换(如反射、旋转和缩放)的组合,并通过构建几何不变性来实现身份不变性。为此,作者设计了 RRS-Gait 框架,首先基于特定几何变换灵活调整卷积核以实现近似的特征等变性,随后将三种等变感知特征分别输入全局池化层进行最终的不变性学习,从而在多个主流步态数据集上取得了优异性能。
链接: https://arxiv.org/abs/2601.05604
作者: Zengbin Wang,Junjie Li,Saihui Hou,Xu Liu,Chunshui Cao,Yongzhen Huang,Muyi Sun,Siye Wang,Man Zhang
机构: Beijing University of Posts and Telecommunications (北京邮电大学); Qinghai Institute of Technology (青海理工学院); Beijing Normal University (北京师范大学); Watrix Technology Limited Co. Ltd (威马科技有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The goal of gait recognition is to extract identity-invariant features of an individual under various gait conditions, e.g., cross-view and cross-clothing. Most gait models strive to implicitly learn the common traits across different gait conditions in a data-driven manner to pull different gait conditions closer for recognition. However, relatively few studies have explicitly explored the inherent relations between different gait conditions. For this purpose, we attempt to establish connections among different gait conditions and propose a new perspective to achieve gait recognition: variations in different gait conditions can be approximately viewed as a combination of geometric transformations. In this case, all we need is to determine the types of geometric transformations and achieve geometric invariance, then identity invariance naturally follows. As an initial attempt, we explore three common geometric transformations (i.e., Reflect, Rotate, and Scale) and design a \mathcalR eflect- \mathcalR otate- \mathcalS cale invariance learning framework, named \mathcalRRS -Gait. Specifically, it first flexibly adjusts the convolution kernel based on the specific geometric transformations to achieve approximate feature equivariance. Then these three equivariant-aware features are respectively fed into a global pooling operation for final invariance-aware learning. Extensive experiments on four popular gait datasets (Gait3D, GREW, CCPG, SUSTech1K) show superior performance across various gait conditions.
zh
[CV-28] Quantifying and Inducing Shape Bias in CNNs via Max-Pool Dilation
【速读】:该论文旨在解决卷积神经网络(Convolutional Neural Networks, CNNs)在处理以形状为主导的数据(如插图和草图)时性能下降的问题,这是因为CNN固有的局部纹理偏好(texture bias)会抑制对全局形状信息的提取。其解决方案的关键在于提出一个基于结构相似性指数(Structural Similarity Index, SSIM)的数据驱动度量方法,用于量化数据集中形状与纹理的平衡程度;在此基础上,进一步设计了一种计算高效的适应性调整策略,通过修改最大池化操作的膨胀率(dilation)来增强形状偏置,同时保持卷积层权重冻结,从而在低数据场景下仅需微调分类层即可显著提升分类准确率。
链接: https://arxiv.org/abs/2601.05599
作者: Takito Sawada,Akinori Iwata,Masahiro Okuda
机构: Doshisha University (同志社大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to IEVC 2026. 4 pages, 1 figure, 3 tables
Abstract:Convolutional Neural Networks (CNNs) are known to exhibit a strong texture bias, favoring local patterns over global shape information–a tendency inherent to their convolutional architecture. While this bias is beneficial for texture-rich natural images, it often degrades performance on shape-dominant data such as illustrations and sketches. Although prior work has proposed shape-biased models to mitigate this issue, these approaches lack a quantitative metric for identifying which datasets would actually benefit from such modifications. To address this gap, we propose a data-driven metric that quantifies the shape-texture balance of a dataset by computing the Structural Similarity Index (SSIM) between each image’s luminance channel and its L0-smoothed counterpart. Building on this metric, we further introduce a computationally efficient adaptation method that promotes shape bias by modifying the dilation of max-pooling operations while keeping convolutional weights frozen. Experimental results show that this approach consistently improves classification accuracy on shape-dominant datasets, particularly in low-data regimes where full fine-tuning is impractical, requiring training only the final classification layer.
zh
[CV-29] GS-DMSR: Dynamic Sensitive Multi-scale Manifold Enhancement for Accelerated High-Quality 3D Gaussian Splatting
【速读】:该论文旨在解决3D动态场景重建中模型收敛速度与渲染质量之间的权衡问题,尤其针对具有复杂动态运动的高精度建模场景。其解决方案的关键在于提出GS-DMSR方法:通过定量分析高斯属性(Gaussian attributes)的动态演化过程,实现自适应梯度聚焦,从而动态识别高斯模型在不同运动状态下的显著差异,并对显著程度不同的模型应用差异化优化策略,显著提升收敛速度;同时引入多尺度流形增强模块,协同优化隐式非线性解码器与显式变形场,增强复杂形变场景的建模效率。
链接: https://arxiv.org/abs/2601.05584
作者: Nengbo Lu,Minghua Pan,Shaohua Sun,Yizhou Liang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:In the field of 3D dynamic scene reconstruction, how to balance model convergence rate and rendering quality has long been a critical challenge that urgently needs to be addressed, particularly in high-precision modeling of scenes with complex dynamic motions. To tackle this issue, this study proposes the GS-DMSR method. By quantitatively analyzing the dynamic evolution process of Gaussian attributes, this mechanism achieves adaptive gradient focusing, enabling it to dynamically identify significant differences in the motion states of Gaussian models. It then applies differentiated optimization strategies to Gaussian models with varying degrees of significance, thereby significantly improving the model convergence rate. Additionally, this research integrates a multi-scale manifold enhancement module, which leverages the collaborative optimization of an implicit nonlinear decoder and an explicit deformation field to enhance the modeling efficiency for complex deformation scenes. Experimental results demonstrate that this method achieves a frame rate of up to 96 FPS on synthetic datasets, while effectively reducing both storage overhead and training this http URL code and data are available at this https URL.
zh
[CV-30] Generalizable and Adaptive Continual Learning Framework for AI-generated Image Detection
【速读】:该论文旨在解决生成式 AI (Generative AI) 伪造图像在互联网上的恶意滥用与广泛传播问题,尤其针对当前检测方法难以泛化到未见过的生成模型、且随着生成技术快速演进而持续失效的挑战。其解决方案的关键在于提出一个三阶段域持续学习框架:第一阶段采用参数高效微调策略构建具备强泛化能力的离线检测模型;第二阶段引入未见数据流进行持续学习,通过逐步增加复杂度的数据增强链和 Kronecker-Factored Approximate Curvature (K-FAC) 方法缓解过拟合与灾难性遗忘;第三阶段利用基于线性模式连通性的线性插值策略捕捉不同生成模型间的共性特征,从而显著提升整体检测性能。
链接: https://arxiv.org/abs/2601.05580
作者: Hanyi Wang,Jun Lan,Yaoyu Kang,Huijia Zhu,Weiqiang Wang,Zhuosheng Zhang,Shilin Wang
机构: Shanghai Jiao Tong University (上海交通大学); Antgroup (蚂蚁集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by TMM 2025
Abstract:The malicious misuse and widespread dissemination of AI-generated images pose a significant threat to the authenticity of online information. Current detection methods often struggle to generalize to unseen generative models, and the rapid evolution of generative techniques continuously exacerbates this challenge. Without adaptability, detection models risk becoming ineffective in real-world applications. To address this critical issue, we propose a novel three-stage domain continual learning framework designed for continuous adaptation to evolving generative models. In the first stage, we employ a strategic parameter-efficient fine-tuning approach to develop a transferable offline detection model with strong generalization capabilities. Building upon this foundation, the second stage integrates unseen data streams into a continual learning process. To efficiently learn from limited samples of novel generated models and mitigate overfitting, we design a data augmentation chain with progressively increasing complexity. Furthermore, we leverage the Kronecker-Factored Approximate Curvature (K-FAC) method to approximate the Hessian and alleviate catastrophic forgetting. Finally, the third stage utilizes a linear interpolation strategy based on Linear Mode Connectivity, effectively capturing commonalities across diverse generative models and further enhancing overall performance. We establish a comprehensive benchmark of 27 generative models, including GANs, deepfakes, and diffusion models, chronologically structured up to August 2024 to simulate real-world scenarios. Extensive experiments demonstrate that our initial offline detectors surpass the leading baseline by +5.51% in terms of mean average precision. Our continual learning strategy achieves an average accuracy of 92.20%, outperforming state-of-the-art methods.
zh
[CV-31] Orient Anything V2: Unifying Orientation and Rotation Understanding NEURIPS2025
【速读】:该论文旨在解决单图或双图场景下物体3D方向(orientation)与旋转(rotation)统一理解的问题,尤其针对具有不同旋转对称性的物体难以准确估计其相对旋转的挑战。解决方案的关键在于四项创新:1)利用生成式AI(Generative AI)合成可扩展的3D资产,实现类别覆盖广且数据分布均衡;2)提出一种模型驱动的注释系统,能鲁棒地识别每个物体的0至N个有效前向面;3)设计一种考虑对称性的周期性分布拟合目标,有效建模物体旋转对称性并捕获所有可能的前向朝向;4)构建多帧架构,直接预测物体间的相对旋转。这些改进使Orient Anything V2在11个主流基准上实现了零样本(zero-shot)方向估计、6DoF位姿估计和物体对称性识别的最先进性能,显著提升了其在下游任务中的泛化能力与适用范围。
链接: https://arxiv.org/abs/2601.05573
作者: Zehan Wang,Ziang Zhang,Jiayang Xu,Jialei Wang,Tianyu Pang,Chao Du,HengShuang Zhao,Zhou Zhao
机构: Zhejiang University (浙江大学); Shanghai AI Lab; Sea AI Lab; The University of Hong Kong (香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: NeurIPS 2025 Spotlight, Repo: this https URL
Abstract:This work presents Orient Anything V2, an enhanced foundation model for unified understanding of object 3D orientation and rotation from single or paired images. Building upon Orient Anything V1, which defines orientation via a single unique front face, V2 extends this capability to handle objects with diverse rotational symmetries and directly estimate relative rotations. These improvements are enabled by four key innovations: 1) Scalable 3D assets synthesized by generative models, ensuring broad category coverage and balanced data distribution; 2) An efficient, model-in-the-loop annotation system that robustly identifies 0 to N valid front faces for each object; 3) A symmetry-aware, periodic distribution fitting objective that captures all plausible front-facing orientations, effectively modeling object rotational symmetry; 4) A multi-frame architecture that directly predicts relative object rotations. Extensive experiments show that Orient Anything V2 achieves state-of-the-art zero-shot performance on orientation estimation, 6DoF pose estimation, and object symmetry recognition across 11 widely used benchmarks. The model demonstrates strong generalization, significantly broadening the applicability of orientation estimation in diverse downstream tasks.
zh
[CV-32] owards Generalized Multi-Image Editing for Unified Multimodal Models
【速读】:该论文旨在解决统一多模态模型(Unified Multimodal Models, UMMs)在处理多图像编辑任务时面临的视觉一致性维持困难和跨图像视觉线索歧义性问题,尤其是在引用多个输入图像中的细节时难以保持语义一致性和准确的图像身份区分。其解决方案的关键在于提出一种可扩展的多图像编辑框架,包含两项核心创新:一是引入可学习的潜在分离器(learnable latent separators),在潜在空间中显式区分每张参考图像,实现精确且解耦的条件控制;二是采用正弦索引编码(sinusoidal index encoding),为来自同一图像的视觉标记分配连续的正弦索引嵌入,从而提供明确的图像身份信息,并支持对不同数量输入图像的泛化与外推能力。
链接: https://arxiv.org/abs/2601.05572
作者: Pengcheng Xu,Peng Tang,Donghao Luo,Xiaobin Hu,Weichu Cui,Qingdong He,Zhennan Chen,Jiangning Zhang,Charles Ling,Boyu Wang
机构: Western University (西蒙弗雷泽大学); Tencent YouTu Lab (腾讯优图实验室); Nanjing University (南京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:Unified Multimodal Models (UMMs) integrate multimodal understanding and generation, yet they are limited to maintaining visual consistency and disambiguating visual cues when referencing details across multiple input images. In this work, we propose a scalable multi-image editing framework for UMMs that explicitly distinguishes image identities and generalizes to variable input counts. Algorithmically, we introduce two innovations: 1) The learnable latent separators explicitly differentiate each reference image in the latent space, enabling accurate and disentangled conditioning. 2) The sinusoidal index encoding assigns visual tokens from the same image a continuous sinusoidal index embedding, which provides explicit image identity while allowing generalization and extrapolation on a variable number of inputs. To facilitate training and evaluation, we establish a high-fidelity benchmark using an inverse dataset construction methodology to guarantee artifact-free, achievable outputs. Experiments show clear improvements in semantic consistency, visual fidelity, and cross-image integration over prior baselines on diverse multi-image editing tasks, validating our advantages on consistency and generalization ability.
zh
[CV-33] Whats Left Unsaid? Detecting and Correcting Misleading Omissions in Multimodal News Previews
【速读】:该论文旨在解决社交媒体新闻预览(图像-标题组合)引发的“解释漂移”问题,即即使内容事实正确,因关键背景信息被有意省略,导致读者形成与原文主旨不符的判断,这种隐蔽性误导比显性虚假信息更难检测且研究不足。解决方案的关键在于构建MM-Misleading基准,通过多阶段流水线分离并模拟基于预览与基于全文的理解差异,并提出OMGuard框架:其核心包括(1)解释感知微调(Interpretation-Aware Fine-Tuning),提升多模态误导检测能力;(2)基于理由引导的内容修正(Rationale-Guided Misleading Content Correction),利用显式推理指导标题重写以降低误导印象。实验表明,OMGuard使8B模型的检测准确率媲美235B规模的大型视觉语言模型(Large Vision-Language Model, LVLM),并在端到端修正效果上显著优于基线。
链接: https://arxiv.org/abs/2601.05563
作者: Fanxiao Li,Jiaying Wu,Tingchao Fu,Dayang Li,Herun Wan,Wei Zhou,Min-Yen Kan
机构: Yunnan University (云南大学); National University of Singapore (新加坡国立大学); Xi’an Jiaotong University (西安交通大学); Yunnan-Malaya Institute (School of Engineering) (云南-马六甲研究院(工程学院))
类目: Computer Vision and Pattern Recognition (cs.CV); Social and Information Networks (cs.SI)
备注:
Abstract:Even when factually correct, social-media news previews (image-headline pairs) can induce interpretation drift: by selectively omitting crucial context, they lead readers to form judgments that diverge from what the full article conveys. This covert harm is harder to detect than explicit misinformation yet remains underexplored. To address this gap, we develop a multi-stage pipeline that disentangles and simulates preview-based versus context-based understanding, enabling construction of the MM-Misleading benchmark. Using this benchmark, we systematically evaluate open-source LVLMs and uncover pronounced blind spots to omission-based misleadingness detection. We further propose OMGuard, which integrates (1) Interpretation-Aware Fine-Tuning, which used to improve multimodal misleadingness detection and (2) Rationale-Guided Misleading Content Correction, which uses explicit rationales to guide headline rewriting and reduce misleading impressions. Experiments show that OMGuard lifts an 8B model’s detection accuracy to match a 235B LVLM and delivers markedly stronger end-to-end correction. Further analysis reveals that misleadingness typically stems from local narrative shifts (e.g., missing background) rather than global frame changes, and identifies image-driven scenarios where text-only correction fails, highlighting the necessity of visual interventions.
zh
[CV-34] Semi-Supervised Facial Expression Recognition based on Dynamic Threshold and Negative Learning
【速读】:该论文旨在解决面部表情识别(Facial Expression Recognition, FER)任务中因标注数据获取成本高而导致的模型性能受限问题,提出了一种半监督学习方法以充分利用有限的标注数据和大量未标注数据。其解决方案的关键在于两个核心机制:一是动态阈值调整(Dynamic Threshold Adjustment, DTA),用于自适应地筛选高置信度的未标注样本以增强模型泛化能力;二是选择性负样本学习(Selective Negative Learning, SNL),通过挖掘低置信度未标注样本中的互补标签信息,有效利用其中蕴含的表情特征,从而提升模型对复杂表情类别的判别能力。实验表明,该方法在RAF-DB和AffectNet数据集上均达到当前最优性能,且在仅使用部分标注数据时优于完全监督方法,验证了其有效性。
链接: https://arxiv.org/abs/2601.05556
作者: Zhongpeng Cai,Jun Yu,Wei Xu,Tianyu Liu,Jianqing Sun,Jiaen Liang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Facial expression recognition is a key task in human-computer interaction and affective computing. However, acquiring a large amount of labeled facial expression data is often costly. Therefore, it is particularly important to design a semi-supervised facial expression recognition algorithm that makes full use of both labeled and unlabeled data. In this paper, we propose a semi-supervised facial expression recognition algorithm based on Dynamic Threshold Adjustment (DTA) and Selective Negative Learning (SNL). Initially, we designed strategies for local attention enhancement and random dropout of feature maps during feature extraction, which strengthen the representation of local features while ensuring the model does not overfit to any specific local area. Furthermore, this study introduces a dynamic thresholding method to adapt to the requirements of the semi-supervised learning framework for facial expression recognition tasks, and through a selective negative learning strategy, it fully utilizes unlabeled samples with low confidence by mining useful expression information from complementary labels, achieving impressive results. We have achieved state-of-the-art performance on the RAF-DB and AffectNet datasets. Our method surpasses fully supervised methods even without using the entire dataset, which proves the effectiveness of our approach.
zh
[CV-35] One Language-Free Foundation Model Is Enough for Universal Vision Anomaly Detection
【速读】:该论文旨在解决通用视觉异常检测(Universal Visual Anomaly Detection, UVAD)中现有方法因复杂提示工程、繁琐的适配模块和挑战性的训练策略而导致灵活性与泛化能力受限的问题。其解决方案的关键在于重新审视视觉-语言模型在异常检测中的基本机制,提出一种极度简化且高效的框架UniADet:首先发现语言编码器并非必要,进而设计了一种完全解耦分类与分割任务的方法,同时解耦跨层级特征表示,仅通过学习独立权重实现多任务处理;该方法仅需0.002M可学习参数,具有高度参数效率、通用性(适配多种基础模型)和有效性(在14个真实世界基准上显著优于现有零样本/少样本及全监督方法)。
链接: https://arxiv.org/abs/2601.05552
作者: Bin-Bin Gao,Chengjie Wang
机构: Tencent YouTu Lab (腾讯优图实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, 5 figures, 34 tabels
Abstract:Universal visual anomaly detection (AD) aims to identify anomaly images and segment anomaly regions towards open and dynamic scenarios, following zero- and few-shot paradigms without any dataset-specific fine-tuning. We have witnessed significant progress in widely use of visual-language foundational models in recent approaches. However, current methods often struggle with complex prompt engineering, elaborate adaptation modules, and challenging training strategies, ultimately limiting their flexibility and generality. To address these issues, this paper rethinks the fundamental mechanism behind visual-language models for AD and presents an embarrassingly simple, general, and effective framework for Universal vision Anomaly Detection (UniADet). Specifically, we first find language encoder is used to derive decision weights for anomaly classification and segmentation, and then demonstrate that it is unnecessary for universal AD. Second, we propose an embarrassingly simple method to completely decouple classification and segmentation, and decouple cross-level features, i.e., learning independent weights for different tasks and hierarchical features. UniADet is highly simple (learning only decoupled weights), parameter-efficient (only 0.002M learnable parameters), general (adapting a variety of foundation models), and effective (surpassing state-of-the-art zero-/few-shot by a large margin and even full-shot AD methods for the first time) on 14 real-world AD benchmarks covering both industrial and medical domains. We will make the code and model of UniADet available at this https URL.
zh
[CV-36] VIB-Probe: Detecting and Mitigating Hallucinations in Vision-Language Models via Variational Information Bottleneck
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在生成文本时易出现幻觉(hallucination)的问题,即模型输出内容与输入图像语义不一致的现象。现有方法多依赖于输出logits或外部验证工具,未能深入挖掘模型内部机制。论文提出VIB-Probe框架,其核心创新在于利用变分信息瓶颈(Variational Information Bottleneck, VIB)理论,从模型内部注意力头中提取具有判别性的表征模式,同时通过信息瓶颈原则过滤语义噪声;进一步地,基于VIB探针的梯度可识别对幻觉具有强因果影响的注意力头,并设计推理阶段的干预策略以实现幻觉抑制。该方案显著提升了检测与缓解幻觉的能力,在多个基准测试中优于现有方法。
链接: https://arxiv.org/abs/2601.05547
作者: Feiran Zhang,Yixin Wu,Zhenghua Wang,Xiaohua Wang,Changze Lv,Xuanjing Huang,Xiaoqing Zheng
机构: Fudan University (复旦大学); Shanghai Key Laboratory of Intelligent Information Processing (上海市智能信息处理重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Vision-Language Models (VLMs) have demonstrated remarkable progress in multimodal tasks, but remain susceptible to hallucinations, where generated text deviates from the underlying visual content. Existing hallucination detection methods primarily rely on output logits or external verification tools, often overlooking their internal mechanisms. In this work, we investigate the outputs of internal attention heads, postulating that specific heads carry the primary signals for truthful this http URL, directly probing these high-dimensional states is challenging due to the entanglement of visual-linguistic syntax and noise. To address this, we propose VIB-Probe, a novel hallucination detection and mitigation framework leveraging the Variational Information Bottleneck (VIB) theory. Our method extracts discriminative patterns across layers and heads while filtering out semantic nuisances through the information bottleneck principle. Furthermore, by leveraging the gradients of our VIB probe, we identify attention heads with strong causal influence on hallucinations and introduce an inference-time intervention strategy for hallucination mitigation. Extensive experiments across diverse benchmarks demonstrate that VIB-Probe significantly outperforms existing baselines in both settings. Our code will be made publicly available.
zh
[CV-37] MoGen: A Unified Collaborative Framework for Controllable Multi-Object Image Generation
【速读】:该论文旨在解决现有多对象图像生成方法在语言描述指导下难以实现图像生成区域与语义精准对齐的问题,导致对象数量不一致和属性混淆(attribute aliasing)。为克服这一局限,作者提出MoGen方法,其关键在于设计了两个核心模块:一是区域语义锚定(Regional Semantic Anchor, RSA)模块,通过在生成过程中将语言描述中的短语单元精确锚定到对应图像区域,从而实现符合数量规范的文本到图像生成;二是自适应多模态引导(Adaptive Multi-modal Guidance, AMG)模块,能够动态解析并整合多种来源的控制信号,形成结构化的意图,并据此选择性地约束场景布局和对象属性,实现细粒度的动态控制。
链接: https://arxiv.org/abs/2601.05546
作者: Yanfeng Li,Yue Sun,Keren Fu,Sio-Kei Im,Xiaoming Liu,Guangtao Zhai,Xiaohong Liu,Tao Tan
机构: Macao Polytechnic University (澳门理工学院); Sichuan University (四川大学); Michigan State University (密歇根州立大学); Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Existing multi-object image generation methods face difficulties in achieving precise alignment between localized image generation regions and their corresponding semantics based on language descriptions, frequently resulting in inconsistent object quantities and attribute aliasing. To mitigate this limitation, mainstream approaches typically rely on external control signals to explicitly constrain the spatial layout, local semantic and visual attributes of images. However, this strong dependency makes the input format rigid, rendering it incompatible with the heterogeneous resource conditions of users and diverse constraint requirements. To address these challenges, we propose MoGen, a user-friendly multi-object image generation method. First, we design a Regional Semantic Anchor (RSA) module that precisely anchors phrase units in language descriptions to their corresponding image regions during the generation process, enabling text-to-image generation that follows quantity specifications for multiple objects. Building upon this foundation, we further introduce an Adaptive Multi-modal Guidance (AMG) module, which adaptively parses and integrates various combinations of multi-source control signals to formulate corresponding structured intent. This intent subsequently guides selective constraints on scene layouts and object attributes, achieving dynamic fine-grained control. Experimental results demonstrate that MoGen significantly outperforms existing methods in generation quality, quantity consistency, and fine-grained control, while exhibiting superior accessibility and control flexibility. Code is available at: this https URL.
zh
[CV-38] DIFF-MF: A Difference-Driven Channel-Spatial State Space Model for Multi-Modal Image Fusion
【速读】:该论文旨在解决多模态图像融合(multi-modal image fusion)中现有基于状态空间模型的方法存在的失衡问题,即在融合红外与可见光图像时,要么过度优先保留红外强度而损失可见细节,要么反之,保留可见结构却削弱热目标的显著性。解决方案的关键在于提出DIFF-MF——一种差异驱动的通道-空间状态空间模型(difference-driven channel-spatial state space model),其核心机制是利用模态间特征差异图(feature discrepancy maps)引导特征提取,并通过两个模块实现跨通道和跨空间维度的融合:一是通道交换模块(channel-exchange module),采用交叉注意力双状态空间建模增强通道交互并自适应重加权特征;二是空间交换模块(spatial-exchange module),通过跨模态状态空间扫描实现全面的空间融合。该方法在保持线性计算复杂度的同时有效捕捉全局依赖关系,从而更均衡地整合多模态互补信息。
链接: https://arxiv.org/abs/2601.05538
作者: Yiming Sun,Zifan Ye,Qinghua Hu,Pengfei Zhu
机构: Southeast University (东南大学); Xiong’an National Innovation Center Technology Co., Ltd. (雄安国家创新中心科技有限公司); Xiong’an Guochuang Lantian Technology Co., Ltd. (雄安国创蓝田科技有限公司); Tianjin University (天津大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multi-modal image fusion aims to integrate complementary information from multiple source images to produce high-quality fused images with enriched content. Although existing approaches based on state space model have achieved satisfied performance with high computational efficiency, they tend to either over-prioritize infrared intensity at the cost of visible details, or conversely, preserve visible structure while diminishing thermal target salience. To overcome these challenges, we propose DIFF-MF, a novel difference-driven channel-spatial state space model for multi-modal image fusion. Our approach leverages feature discrepancy maps between modalities to guide feature extraction, followed by a fusion process across both channel and spatial dimensions. In the channel dimension, a channel-exchange module enhances channel-wise interaction through cross-attention dual state space modeling, enabling adaptive feature reweighting. In the spatial dimension, a spatial-exchange module employs cross-modal state space scanning to achieve comprehensive spatial fusion. By efficiently capturing global dependencies while maintaining linear computational complexity, DIFF-MF effectively integrates complementary multi-modal features. Experimental results on the driving scenarios and low-altitude UAV datasets demonstrate that our method outperforms existing approaches in both visual quality and quantitative evaluation.
zh
[CV-39] SAS-VPReID: A Scale-Adaptive Framework with Shape Priors for Video-based Person Re-Identification at Extreme Far Distances WACV2026
【速读】:该论文旨在解决远距离视频行人再识别(Video-based Person Re-Identification, VPReID)中的关键挑战,包括因分辨率严重退化、视角剧烈变化以及不可避免的外观噪声导致的识别性能下降问题。解决方案的核心在于提出一种基于形状先验的尺度自适应框架(Scale-Adaptive framework with Shape Priors for VPReID, SAS-VPReID),其关键创新包含三个互补模块:1)基于CLIP视觉编码器和多代理记忆的增强型视觉骨干网络(Memory-Enhanced Visual Backbone, MEVB),用于提取更具判别性的特征表示;2)多粒度时序建模模块(Multi-Granularity Temporal Modeling, MGTM),通过在不同时间粒度上构建序列并自适应强调跨尺度运动线索;3)先验正则化的身体结构动态建模模块(Prior-Regularized Shape Dynamics, PRSD),用于捕捉人体结构的动态特性。这三个模块协同作用,显著提升了远距离场景下视频行人再识别的鲁棒性和准确性。
链接: https://arxiv.org/abs/2601.05535
作者: Qiwei Yang,Pingping Zhang,Yuhao Wang,Zijing Gong
机构: Dalian University of Technology (大连理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by WACV2026 VReID-XFD Workshop. Our final framework ranks the first on the VReID-XFD challenge leaderboard
Abstract:Video-based Person Re-IDentification (VPReID) aims to retrieve the same person from videos captured by non-overlapping cameras. At extreme far distances, VPReID is highly challenging due to severe resolution degradation, drastic viewpoint variation and inevitable appearance noise. To address these issues, we propose a Scale-Adaptive framework with Shape Priors for VPReID, named SAS-VPReID. The framework is built upon three complementary modules. First, we deploy a Memory-Enhanced Visual Backbone (MEVB) to extract discriminative feature representations, which leverages the CLIP vision encoder and multi-proxy memory. Second, we propose a Multi-Granularity Temporal Modeling (MGTM) to construct sequences at multiple temporal granularities and adaptively emphasize motion cues across scales. Third, we incorporate Prior-Regularized Shape Dynamics (PRSD) to capture body structure dynamics. With these modules, our framework can obtain more discriminative feature representations. Experiments on the VReID-XFD benchmark demonstrate the effectiveness of each module and our final framework ranks the first on the VReID-XFD challenge leaderboard. The source code is available at this https URL.
zh
[CV-40] GaussianSwap: Animatable Video Face Swapping with 3D Gaussian Splatting
【速读】:该论文旨在解决传统视频人脸替换框架在生成面部表示时仅限于像素级格式、缺乏可动画化与交互操控能力的问题。现有方法生成的交换人脸仅为无结构像素集合,难以实现高质量动态控制与应用拓展。解决方案的关键在于提出GaussianSwap框架,其核心创新包括:(1)基于3D高斯泼溅(3D Gaussian Splatting)构建目标视频中的人脸虚拟化身(face avatar),并从源图像迁移身份特征;(2)通过FLAME模型对目标视频进行参数化重建(含表情、相机位姿和分割掩膜),将高斯泼溅点集绑定至该模型以实现跨帧动态控制;(3)设计由三个先进人脸识别模型组成的复合身份嵌入(compound identity embedding)用于微调,保障身份一致性;最终实现高保真、时序一致且具备交互潜力的面部替换视频生成。
链接: https://arxiv.org/abs/2601.05511
作者: Xuan Cheng,Jiahao Rao,Chengyang Li,Wenhao Wang,Weilin Chen,Lvqing Yang
机构: Xiamen University (厦门大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We introduce GaussianSwap, a novel video face swapping framework that constructs a 3D Gaussian Splatting based face avatar from a target video while transferring identity from a source image to the avatar. Conventional video swapping frameworks are limited to generating facial representations in pixel-based formats. The resulting swapped faces exist merely as a set of unstructured pixels without any capacity for animation or interactive manipulation. Our work introduces a paradigm shift from conventional pixel-based video generation to the creation of high-fidelity avatar with swapped faces. The framework first preprocesses target video to extract FLAME parameters, camera poses and segmentation masks, and then rigs 3D Gaussian splats to the FLAME model across frames, enabling dynamic facial control. To ensure identity preserving, we propose an compound identity embedding constructed from three state-of-the-art face recognition models for avatar finetuning. Finally, we render the face-swapped avatar on the background frames to obtain the face-swapped video. Experimental results demonstrate that GaussianSwap achieves superior identity preservation, visual clarity and temporal consistency, while enabling previously unattainable interactive applications.
zh
[CV-41] Prompt-Free SAM-Based Multi-Task Framework for Breast Ultrasound Lesion Segmentation and Classification
【速读】:该论文旨在解决乳腺超声(Breast Ultrasound, BUS)图像中肿瘤分割与分类的挑战,这些问题主要源于图像对比度低、斑点噪声干扰以及病灶形态多样性。其解决方案的关键在于提出了一种多任务深度学习框架,利用Segment Anything Model (SAM) 视觉编码器提取的高维特征进行端到端的联合分割与诊断分类。该方法摒弃了传统的提示驱动(prompt-based)策略,采用无需提示的全监督适配方式,并通过轻量级卷积头或UNet-inspired解码器实现像素级分割;同时,在分类分支中引入掩码引导注意力机制(mask-guided attention),使模型聚焦于病灶相关特征并抑制背景伪影。实验表明,该方法在PRECISE 2025数据集上实现了Dice相似系数(DSC)0.887和分类准确率92.3%,显著提升了乳腺超声图像中病灶边界精确性和诊断一致性。
链接: https://arxiv.org/abs/2601.05498
作者: Samuel E. Johnny,Bernes L. Atabonfack,Israel Alagbe,Assane Gueye
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Accurate tumor segmentation and classification in breast ultrasound (BUS) imaging remain challenging due to low contrast, speckle noise, and diverse lesion morphology. This study presents a multi-task deep learning framework that jointly performs lesion segmentation and diagnostic classification using embeddings from the Segment Anything Model (SAM) vision encoder. Unlike prompt-based SAM variants, our approach employs a prompt-free, fully supervised adaptation where high-dimensional SAM features are decoded through either a lightweight convolutional head or a UNet-inspired decoder for pixel-wise segmentation. The classification branch is enhanced via mask-guided attention, allowing the model to focus on lesion-relevant features while suppressing background artifacts. Experiments on the PRECISE 2025 breast ultrasound dataset, split per class into 80 percent training and 20 percent testing, show that the proposed method achieves a Dice Similarity Coefficient (DSC) of 0.887 and an accuracy of 92.3 percent, ranking among the top entries on the PRECISE challenge leaderboard. These results demonstrate that SAM-based representations, when coupled with segmentation-guided learning, significantly improve both lesion delineation and diagnostic prediction in breast ultrasound imaging.
zh
[CV-42] Hippocampal Atrophy Patterns Across the Alzheimers Disease Spectrum: A Voxel-Based Morphometry Analysis
【速读】:该论文旨在解决阿尔茨海默病(Alzheimer’s disease, AD)和轻度认知障碍(mild cognitive impairment, MCI)中灰质体积变化的定量特征及其作为预测生物标志物的潜力问题。其解决方案的关键在于采用CAT12/SPM12基于体素的形态学分析方法,对来自ADNI项目的249名受试者(包括对照组CN、MCI和AD患者)的基线T1加权磁共振成像(T1-weighted MRI)数据进行灰质体积分析,通过广义线性模型以诊断分组为主要预测因子,并控制年龄和总颅腔容积等混杂因素,同时使用FWE校正进行多重比较,从而精准识别出与疾病状态显著相关的脑区变化。研究发现海马体积在AD组中显著萎缩,且具有中等预测能力用于MCI向AD转化的判别,为理解AD进展中的颞叶内侧退行性变提供了量化依据,并揭示了APOE4基因型在横断面海马体积上的无显著影响。
链接: https://arxiv.org/abs/2601.05494
作者: Trishna Niraula
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 7 figures, 6 tables
Abstract:Alzheimer’s disease (AD) and mild cognitive impairment (MCI) are associated with progressive gray matter loss, particularly in medial temporal structures. In this study, CAT12/SPM12 voxel-based morphometry was applied to baseline T1-weighted MRI scans from 249 ADNI participants (CN = 90, MCI = 129, AD = 30). Gray matter volume was analyzed using a general linear model, with the diagnostic group as primary predictor and age and total intracranial volume as covariates. Statistical maps were thresholded at p 0.001 (voxelwise) and corrected for multiple comparisons at the cluster level using family-wise error (FWE) correction (p 0.05). Significant hippocampal atrophy was observed in AD relative to CN and MCI (Cohen’s d = 2.03 and 1.61, respectively). Hippocampal volume demonstrated moderate predictive value for conversion from MCI to AD (AUC = 0.66). Stratification by APOE4 status did not reveal significant genetic effects on cross-sectional hippocampal volume. These results support medial temporal degeneration as a key feature of AD progression and provide insights into predictive biomarkers and genetic influences.
zh
[CV-43] Multi-Image Super Resolution Framework for Detection and Analysis of Plant Roots
【速读】:该论文旨在解决地下植物根系成像中因遮挡、土壤湿度变化及图像对比度低等因素导致的传统视觉方法效果受限的问题(即subterranean imaging challenges)。其解决方案的关键在于构建一个结合多视角图像采集与深度学习的多图像超分辨率(Multi-Image Super Resolution, MISR)框架,利用不同视角间的空间冗余信息重建高分辨率图像,从而提升根系结构保真度和可视清晰度。实验表明,该方法在保持CLIP-IQA评分不变的前提下,使BRISQUE指标降低2.3%,显著改善了图像质量,进而支持更精准的根系性状定量分析,如根毛数量和密度估计。
链接: https://arxiv.org/abs/2601.05482
作者: Shubham Agarwal,Ofek Nourian,Michael Sidorov,Sharon Chemweno,Ofer Hadar,Naftali Lazarovitch,Jhonathan E. Ephrath
机构: Ben Gurion University of the Negev (本古里安大学); The Jacob Blaustein Institutes for Desert Research (雅各布·布莱斯廷沙漠研究研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET)
备注:
Abstract:Understanding plant root systems is critical for advancing research in soil-plant interactions, nutrient uptake, and overall plant health. However, accurate imaging of roots in subterranean environments remains a persistent challenge due to adverse conditions such as occlusion, varying soil moisture, and inherently low contrast, which limit the effectiveness of conventional vision-based approaches. In this work, we propose a novel underground imaging system that captures multiple overlapping views of plant roots and integrates a deep learning-based Multi-Image Super Resolution (MISR) framework designed to enhance root visibility and detail. To train and evaluate our approach, we construct a synthetic dataset that simulates realistic underground imaging scenarios, incorporating key environmental factors that affect image quality. Our proposed MISR algorithm leverages spatial redundancy across views to reconstruct high-resolution images with improved structural fidelity and visual clarity. Quantitative evaluations show that our approach outperforms state-of-the-art super resolution baselines, achieving a 2.3 percent reduction in BRISQUE, indicating improved image quality with the same CLIP-IQA score, thereby enabling enhanced phenotypic analysis of root systems. This, in turn, facilitates accurate estimation of critical root traits, including root hair count and root hair density. The proposed framework presents a promising direction for robust automatic underground plant root imaging and trait quantification for agricultural and ecological research.
zh
[CV-44] APM-Net: Trajectory-Aware Perturbation Modeling for Infrared Small Target Detection BMVC2025 BMVC
【速读】:该论文旨在解决红外小目标检测(Infrared Small Target Detection, ISTD)中因信号对比度弱、空间范围有限及背景杂波干扰导致的检测性能瓶颈问题。现有基于卷积神经网络(CNN)和视觉Transformer(ViT)的方法虽有所提升,但缺乏对小目标如何在特征空间中引发层间方向性扰动的追踪机制,而这正是区分目标信号与结构化噪声的关键线索。解决方案的核心在于提出Trajectory-Aware Mamba Propagation Network (TAPM-Net),其关键创新是引入两个模块:扰动引导路径模块(Perturbation-guided Path Module, PGM)用于构建多尺度特征中的扰动能量场并提取梯度跟随的特征轨迹,以及轨迹感知状态块(Trajectory-Aware State Block, TASB),基于Mamba架构建模沿每条轨迹的动态传播过程,并融合速度约束扩散与语义对齐的特征融合机制。该方法实现了沿空间轨迹的各向异性、上下文敏感的状态转移,同时保持全局一致性且计算开销低,显著提升了ISTD性能。
链接: https://arxiv.org/abs/2601.05446
作者: Hongyang Xie,Hongyang He,Victor Sanchez
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Published in BMVC 2025 see: this https URL . Conference version. 12 pages, 6 figures, 4 tables. Author-prepared version
Abstract:Infrared small target detection (ISTD) remains a long-standing challenge due to weak signal contrast, limited spatial extent, and cluttered backgrounds. Despite performance improvements from convolutional neural networks (CNNs) and Vision Transformers (ViTs), current models lack a mechanism to trace how small targets trigger directional, layer-wise perturbations in the feature space, which is an essential cue for distinguishing signal from structured noise in infrared scenes. To address this limitation, we propose the Trajectory-Aware Mamba Propagation Network (TAPM-Net), which explicitly models the spatial diffusion behavior of target-induced feature disturbances. TAPM-Net is built upon two novel components: a Perturbation-guided Path Module (PGM) and a Trajectory-Aware State Block (TASB). The PGM constructs perturbation energy fields from multi-level features and extracts gradient-following feature trajectories that reflect the directionality of local responses. The resulting feature trajectories are fed into the TASB, a Mamba-based state-space unit that models dynamic propagation along each trajectory while incorporating velocity-constrained diffusion and semantically aligned feature fusion from word-level and sentence-level embeddings. Unlike existing attention-based methods, TAPM-Net enables anisotropic, context-sensitive state transitions along spatial trajectories while maintaining global coherence at low computational cost. Experiments on NUAA-SIRST and IRSTD-1K demonstrate that TAPM-Net achieves state-of-the-art performance in ISTD.
zh
[CV-45] Multi-task Cross-modal Learning for Chest X-ray Image Retrieval
【速读】:该论文旨在解决当前视觉-语言基础模型(如BiomedCLIP)在细粒度医学检索任务中表现不足的问题,特别是利用胸部X光片(CXR)图像查询检索临床相关放射学报告的准确性与平衡性问题。解决方案的关键在于提出一种多任务学习框架,以微调BiomedCLIP模型:通过引入一个轻量级MLP投影头,并采用复合损失函数进行训练,该损失函数包含三部分——二分类交叉熵损失用于区分正常与异常CXR研究,监督对比损失增强类内一致性,以及CLIP损失维持跨模态对齐。实验表明,该方法显著提升了图像到文本和文本到图像检索任务的平衡性和临床相关性,同时t-SNE可视化显示正常与异常病例的语义聚类更清晰,验证了模型诊断敏感性的提升。
链接: https://arxiv.org/abs/2601.05399
作者: Zhaohui Liang,Sivaramakrishnan Rajaraman,Niccolo Marini,Zhiyun Xue,Sameer Antani
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:
Abstract:CLIP and BiomedCLIP are examples of vision-language foundation models and offer strong cross-modal embeddings; however, they are not optimized for fine-grained medical retrieval tasks, such as retrieving clinically relevant radiology reports using chest X-ray (CXR) image queries. To address this shortcoming, we propose a multi-task learning framework to fine-tune BiomedCLIP and evaluate improvements to CXR image-text retrieval. Using BiomedCLIP as the backbone, we incorporate a lightweight MLP projector head trained with a multi-task composite loss function that includes: (1) a binary cross-entropy loss to distinguish normal from abnormal CXR studies, (2) a supervised contrastive loss to reinforce intra-class consistency, and (3) a CLIP loss to maintain cross-modal alignment. Experimental results demonstrate that the fine-tuned model achieves more balanced and clinically meaningful performance across both image-to-text and text-to-image retrieval tasks compared to the pretrained BiomedCLIP and general-purpose CLIP models. Furthermore, t-SNE visualizations reveal clearer semantic clustering of normal and abnormal cases, demonstrating the model’s enhanced diagnostic sensitivity. These findings highlight the value of domain-adaptive, multi-task learning for advancing cross-modal retrieval in biomedical applications.
zh
[CV-46] SketchPatch: Efficient Structure-Aware 3D Gaussian Representation
【速读】:该论文旨在解决3D高斯散射(3DGS)表示中缺乏结构感知能力的问题,即如何在保持视觉质量的同时实现高效存储与自适应流式传输。其核心挑战在于传统均匀剪枝方法无法区分不同几何特征的语义重要性,导致细节丢失或冗余编码。解决方案的关键是提出一种分层自适应分类框架,将高斯分布分为两类:**草图高斯(Sketch Gaussians)用于捕捉高频边界特征(如边缘和轮廓),以及补丁高斯(Patch Gaussians)**用于表征低频平滑区域(如体积和深度)。通过多准则密度聚类与质量驱动的自适应细化机制,该方法实现了结构感知的逐层渐进式流式传输——先以紧凑的草图高斯构建场景骨架,再由补丁高斯逐步精细化体积细节。此策略无需依赖外部3D线状基元,显著提升了参数编码效率,在同等模型规模下相较均匀剪枝基线在PSNR、SSIM和LPIPS指标上分别提升最高达1.74 dB、6.7%和41.4%,尤其在室内场景中可仅用原模型0.5%的参数量维持高质量渲染。
链接: https://arxiv.org/abs/2601.05394
作者: Yuang Shi,Simone Gasparini,Géraldine Morin,Wei Tsang Ooi
机构: National University of Singapore(新加坡国立大学); IRIT - Université de Toulouse(图卢兹大学IRIT研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Multimedia (cs.MM); Image and Video Processing (eess.IV)
备注:
Abstract:We observe that Gaussians exhibit distinct roles and characteristics analogous to traditional artistic techniques – like how artists first sketch outlines before filling in broader areas with color, some Gaussians capture high-frequency features such as edges and contours, while others represent broader, smoother regions analogous to brush strokes that add volume and depth. Based on this observation, we propose a hybrid representation that categorizes Gaussians into (i) Sketch Gaussians, which represent high-frequency, boundary-defining features, and (ii) Patch Gaussians, which cover low-frequency, smooth regions. This semantic separation naturally enables layered progressive streaming, where the compact Sketch Gaussians establish the structural skeleton before Patch Gaussians incrementally refine volumetric detail. In this work, we extend our previous method to arbitrary 3D scenes by proposing a novel hierarchical adaptive categorization framework that operates directly on the 3DGS representation. Our approach employs multi-criteria density-based clustering, combined with adaptive quality-driven refinement. This method eliminates dependency on external 3D line primitives while ensuring optimal parametric encoding effectiveness. Our comprehensive evaluation across diverse scenes, including both man-made and natural environments, demonstrates that our method achieves up to 1.74 dB improvement in PSNR, 6.7% in SSIM, and 41.4% in LPIPS at equivalent model sizes compared to uniform pruning baselines. For indoor scenes, our method can maintain visual quality with only 0.5% of the original model size. This structure-aware representation enables efficient storage, adaptive streaming, and rendering of high-fidelity 3D content across bandwidth-constrained networks and resource-limited devices. Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Multimedia (cs.MM); Image and Video Processing (eess.IV) Cite as: arXiv:2601.05394 [cs.CV] (or arXiv:2601.05394v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2601.05394 Focus to learn more arXiv-issued DOI via DataCite
zh
[CV-47] EdgeLDR: Quaternion Low-Displacement Rank Neural Networks for Edge-Efficient Deep Learning
【速读】:该论文旨在解决深度神经网络在边缘设备上部署时面临的内存访问开销和计算成本过高的问题,尤其是密集线性层中参数量大、计算效率低的瓶颈。其解决方案的关键在于提出EdgeLDR框架,通过将四元数(Quaternion)通道混合与块循环(Block-Circulant)参数结构相结合,并利用复数伴随表示实现基于快速傅里叶变换(FFT)的高效计算,从而在保持模型精度的同时显著压缩参数量并降低延迟。
链接: https://arxiv.org/abs/2601.05379
作者: Vladimir Frants,Sos Agaian,Karen Panetta
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Deploying deep neural networks on edge devices is often limited by the memory traffic and compute cost of dense linear operators. While quaternion neural networks improve parameter efficiency by coupling multiple channels through Hamilton products, they typically retain unstructured dense weights; conversely, structured matrices enable fast computation but are usually applied in the real domain. This paper introduces EdgeLDR, a practical framework for quaternion block-circulant linear and convolutional layers that combines quaternion channel mixing with block-circulant parameter structure and enables FFT-based evaluation through the complex adjoint representation. We present reference implementations of EdgeLDR layers and compare FFT-based computation against a naive spatial-domain realization of quaternion circulant products. FFT evaluation yields large empirical speedups over the naive implementation and keeps latency stable as block size increases, making larger compression factors computationally viable. We further integrate EdgeLDR layers into compact CNN and Transformer backbones and evaluate accuracy-compression trade-offs on 32x32 RGB classification (CIFAR-10/100, SVHN) and hyperspectral image classification (Houston 2013, Pavia University), reporting parameter counts and CPU/GPU latency. The results show that EdgeLDR layers provide significant compression with competitive accuracy.
zh
[CV-48] Ensemble of radiomics and ConvNeXt for breast cancer diagnosis
【速读】:该论文旨在解决乳腺癌早期诊断中准确率不足的问题,以提升生存率。其解决方案的关键在于通过集成学习(ensemble method)融合深度学习(Deep Learning, DL)与放射组学(Radiomics)的预测结果,从而显著增强从筛查乳腺X线摄影图像中检测乳腺癌的能力,实验表明该集成方法在两个独立数据集上均取得了最高的AUC值(0.87),优于单独使用ConvNeXtV1-small模型(0.83)或放射组学模型(0.80)。
链接: https://arxiv.org/abs/2601.05373
作者: Jorge Alberto Garza-Abdala,Gerardo Alejandro Fumagal-González,Beatriz A. Bosques-Palomo,Mario Alexis Monsivais Molina,Daly Avedano,Servando Cardona-Huerta,José Gerardo Tamez-Pena
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted and presented at the IEEE International Symposium on Computer-Based Medical Systems (CBMS) 2025
Abstract:Early diagnosis of breast cancer is crucial for improving survival rates. Radiomics and deep learning (DL) have shown significant potential in assisting radiologists with early cancer detection. This paper aims to critically assess the performance of radiomics, DL, and ensemble techniques in detecting cancer from screening mammograms. Two independent datasets were used: the RSNA 2023 Breast Cancer Detection Challenge (11,913 patients) and a Mexican cohort from the TecSalud dataset (19,400 patients). The ConvNeXtV1-small DL model was trained on the RSNA dataset and validated on the TecSalud dataset, while radiomics models were developed using the TecSalud dataset and validated with a leave-one-year-out approach. The ensemble method consistently combined and calibrated predictions using the same methodology. Results showed that the ensemble approach achieved the highest area under the curve (AUC) of 0.87, compared to 0.83 for ConvNeXtV1-small and 0.80 for radiomics. In conclusion, ensemble methods combining DL and radiomics predictions significantly enhance breast cancer diagnosis from mammograms.
zh
[CV-49] MOSAIC-GS: Monocular Scene Reconstruction via Advanced Initialization for Complex Dynamic Environments
【速读】:该论文旨在解决单目视频中动态场景高保真重建的难题,该问题因缺乏多视角约束而具有病态性(ill-posed),导致物体几何结构和时序一致性难以准确恢复。其解决方案的关键在于:首先在初始化阶段融合深度、光流、动态物体分割与点跟踪等多几何线索,并结合刚性运动约束,估计初步的3D场景动态;随后将场景分解为静态与动态两部分,对动态成分中的每个高斯(Gaussian)点赋予时间依赖的Poly-Fourier曲线轨迹,实现参数高效的运动编码;此策略显著降低对视觉外观驱动运动推断的依赖,从而提升重建精度与稳定性。
链接: https://arxiv.org/abs/2601.05368
作者: Svitlana Morkva,Maximum Wilder-Smith,Michael Oechsle,Alessio Tonioni,Marco Hutter,Vaishakh Patil
机构: ETH Zürich; Google(谷歌)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present MOSAIC-GS, a novel, fully explicit, and computationally efficient approach for high-fidelity dynamic scene reconstruction from monocular videos using Gaussian Splatting. Monocular reconstruction is inherently ill-posed due to the lack of sufficient multiview constraints, making accurate recovery of object geometry and temporal coherence particularly challenging. To address this, we leverage multiple geometric cues, such as depth, optical flow, dynamic object segmentation, and point tracking. Combined with rigidity-based motion constraints, these cues allow us to estimate preliminary 3D scene dynamics during an initialization stage. Recovering scene dynamics prior to the photometric optimization reduces reliance on motion inference from visual appearance alone, which is often ambiguous in monocular settings. To enable compact representations, fast training, and real-time rendering while supporting non-rigid deformations, the scene is decomposed into static and dynamic components. Each Gaussian in the dynamic part of the scene is assigned a trajectory represented as time-dependent Poly-Fourier curve for parameter-efficient motion encoding. We demonstrate that MOSAIC-GS achieves substantially faster optimization and rendering compared to existing methods, while maintaining reconstruction quality on par with state-of-the-art approaches across standard monocular dynamic scene benchmarks.
zh
[CV-50] STResNet STYOLO : A New Family of Compact Classification and Object Detection Models for MCUs
【速读】:该论文旨在解决轻量级神经网络在微控制器(microcontroller)和神经处理单元(neural processing unit, NPU)等资源受限设备上部署时,普遍存在的准确率与延迟权衡问题。其解决方案的关键在于提出两类新型模型家族——STResNet(用于图像分类)和STYOLO(用于目标检测),通过联合优化准确率、计算效率和内存占用,在极低参数预算下实现高性能表现。例如,STResNetMilli仅用三百万参数即达到70.0% Top-1准确率,优于MobileNetV1和ShuffleNetV2;STYOLOMicro和STYOLOMilli在MS COCO数据集上分别取得30.5%和33.6%的平均精度(mean average precision),超越YOLOv5n和YOLOX Nano,且在实际训练中可集成至Ultralytics环境以增强实用性。
链接: https://arxiv.org/abs/2601.05364
作者: Sudhakar Sah,Ravish Kumar
机构: STMicroelectronics(意法半导体)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 9 pages, 1 figure
Abstract:Recent advancements in lightweight neural networks have significantly improved the efficiency of deploying deep learning models on edge hardware. However, most existing architectures still trade accuracy for latency, which limits their applicability on microcontroller and neural processing unit based devices. In this work, we introduce two new model families, STResNet for image classification and STYOLO for object detection, jointly optimized for accuracy, efficiency, and memory footprint on resource constrained platforms. The proposed STResNet series, ranging from Nano to Tiny variants, achieves competitive ImageNet 1K accuracy within a four million parameter budget. Specifically, STResNetMilli attains 70.0 percent Top 1 accuracy with only three million parameters, outperforming MobileNetV1 and ShuffleNetV2 at comparable computational complexity. For object detection, STYOLOMicro and STYOLOMilli achieve 30.5 percent and 33.6 percent mean average precision, respectively, on the MS COCO dataset, surpassing YOLOv5n and YOLOX Nano in both accuracy and efficiency. Furthermore, when STResNetMilli is used as a backbone with the Ultralytics training environment.
zh
[CV-51] Coding the Visual World: From Image to Simulation Using Vision Language Models
【速读】:该论文旨在解决视觉理解中如何评估和量化视觉语言模型(Vision Language Models, VLMs)对图像所呈现系统性机制的建模能力问题。传统方法多关注图像分类或描述,而忽视了对图像中复杂系统(如物理现象、城市结构等)内在规律的理解与模拟。其解决方案的关键在于提出Im2Sim方法:给定一张真实世界的自然图像,VLM被要求不仅描述图像内容,还需编写可执行的生成代码以复现该系统并生成合成图像;通过比较原始图像与合成图像的差异,从而评估VLM对图像中系统机制的理解深度。此方法揭示了当前领先VLMs具备跨领域、多层次抽象建模能力,但对低层级细节的还原能力有限,体现出高阶语义理解与低级感知之间的不对称性。
链接: https://arxiv.org/abs/2601.05344
作者: Sagi Eppel
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The ability to construct mental models of the world is a central aspect of understanding. Similarly, visual understanding can be viewed as the ability to construct a representative model of the system depicted in an image. This work explores the capacity of Vision Language Models (VLMs) to recognize and simulate the systems and mechanisms depicted in images using the Im2Sim methodology. The VLM is given a natural image of a real-world system (e.g., cities, clouds, vegetation) and is tasked with describing the system and writing code that simulates and generates it. This generative code is then executed to produce a synthetic image, which is compared against the original. This approach is tested on various complex emergent systems, ranging from physical systems (waves, lights, clouds) to vegetation, cities, materials, and geological formations. Through analysis of the models and images generated by the VLMs, we examine their understanding of the systems in images. The results show that leading VLMs (GPT, Gemini) demonstrate the capacity to understand and model complex, multi-component systems across multiple layers of abstraction and a wide range of domains. At the same time, the VLMs exhibit limited ability to replicate fine details and low-level arrangements of patterns in the image. These findings reveal an interesting asymmetry: VLMs combine high-level, deep visual understanding of images with limited perception of fine details.
zh
[CV-52] Bi-Orthogonal Factor Decomposition for Vision Transformers
【速读】:该论文旨在解决视觉 Transformer 中自注意力机制(self-attention)在信息交换层面缺乏原理性理解的问题,即现有注意力图(attention maps)仅能描述权重分布位置,却无法揭示查询(query)与键(key)之间传递的是位置信息、内容信息还是两者兼有。为此,作者提出双正交因子分解(Bi-orthogonal Factor Decomposition, BFD)这一两阶段分析框架:第一阶段基于方差分析(ANOVA)将 token 激活分解为正交的位置和内容因子,实现二者有效解耦;第二阶段通过查询-键交互矩阵 QKT 的奇异值分解(SVD)识别出生物正交模式(bi-orthogonal modes),从而揭示这些因子如何介导 token 间的通信。BFD 的关键创新在于从信息维度上量化了注意力机制中位置与内容的相对贡献,并发现了不同注意力头及内部奇异模式的分工特性,为理解视觉 Transformer 的工作机制提供了可解释的理论工具。
链接: https://arxiv.org/abs/2601.05328
作者: Fenil R. Doshi,Thomas Fel,Talia Konkle,George Alvarez
机构: Harvard University (哈佛大学); Kempner Institute (肯普纳研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Self-attention is the central computational primitive of Vision Transformers, yet we lack a principled understanding of what information attention mechanisms exchange between tokens. Attention maps describe where weight mass concentrates; they do not reveal whether queries and keys trade position, content, or both. We introduce Bi-orthogonal Factor Decomposition (BFD), a two-stage analytical framework: first, an ANOVA-based decomposition statistically disentangles token activations into orthogonal positional and content factors; second, SVD of the query-key interaction matrix QK^T exposes bi-orthogonal modes that reveal how these factors mediate communication. After validating proper isolation of position and content, we apply BFD to state-of-the-art vision models and uncover three phenomena.(i) Attention operates primarily through content. Content-content interactions dominate attention energy, followed by content-position coupling. DINOv2 allocates more energy to content-position than supervised models and distributes computation across a richer mode spectrum. (ii) Attention mechanisms exhibit specialization: heads differentiate into content-content, content-position, and position-position operators, while singular modes within heads show analogous specialization. (iii) DINOv2’s superior holistic shape processing emerges from intermediate layers that simultaneously preserve positional structure while contextually enriching semantic content. Overall, BFD exposes how tokens interact through attention and which informational factors - positional or semantic - mediate their communication, yielding practical insights into vision transformer mechanisms. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2601.05328 [cs.CV] (or arXiv:2601.05328v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2601.05328 Focus to learn more arXiv-issued DOI via DataCite
zh
[CV-53] Studying Illustrations in Manuscripts: An Efficient Deep-Learning Approach
【速读】:该论文旨在解决历史手稿中视觉内容大规模系统性研究的难题,即如何高效地从海量数字化手稿中自动检测、提取并描述插图,以支持人文学者开展视觉文化研究。其解决方案的关键在于构建一个三阶段的端到端AI流水线:首先利用微调后的图像分类模型过滤纯文本页;其次通过高效的物体检测模型定位并裁剪插图区域;最后借助多模态图像描述生成模型自动生成简洁的人类可读说明,并存储于可检索数据库中。该方法在处理超过三百万页手稿时实现了每页低于0.06秒的处理速度,显著优于传统分割技术,在效率与可访问性上为视觉学术研究提供了全新范式。
链接: https://arxiv.org/abs/2601.05269
作者: Yoav Evron,Michal Bar-Asher Siegal,Michael Fire
机构: 未知
类目: Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 14 pages, 5 figures
Abstract:The recent Artificial Intelligence (AI) revolution has opened transformative possibilities for the humanities, particularly in unlocking the visual content embedded in historical manuscripts. While digital archives now offer unprecedented access to these materials, the ability to systematically study illustrations at a large scale remains challenging. Our study presents a fast and scalable AI approach for detecting, extracting, and describing illustrations in digitized manuscripts. Focusing on collections like the Vatican Library, our system enables efficient visual analysis across millions of pages. Our pipeline consists of three stages: (1) a fine-tuned image classification model filters out text-only pages; (2) an efficient object detection model identifies and crops illustrations; and (3) a multimodal image captioning model generates concise, human-readable descriptions. These are stored in a searchable database, allowing scholars to retrieve relevant visual materials through keyword queries. By harnessing the power of recent AI advancements, we enable large-scale visual research that was previously impractical, empowering scholars in historical studies, art history, and cultural heritage to explore visual motifs, artistic styles, and cross-cultural influences with new precision and speed. Applying our pipeline to over three million digitized manuscript pages, we automatically identified and extracted more than 200,000 unique illustrations. This scale of processing in under 0.06 seconds per page, dramatically outperforms traditional segmentation techniques in both efficiency and accessibility for visual scholarship. Our work demonstrates how cutting-edge AI tools can profoundly reshape scholarly workflows and open new avenues for multidisciplinary research in the age of digital manuscripts.
zh
人工智能
[AI-0] Open-Vocabulary 3D Instruction Ambiguity Detection
【速读】:该论文旨在解决安全关键领域中3D指令歧义检测问题,即在给定的3D场景中判断一个语言指令是否具有唯一明确的语义,以避免因语义模糊导致的严重后果(如手术场景中“递给我那个试管”可能引发误操作)。当前大多数具身AI研究忽视了这一问题,假设指令清晰而仅关注执行阶段。为填补此安全空白,作者首次定义了开放词汇的3D指令歧义检测任务,并构建了大规模基准Ambi3D(包含700余个多样化的3D场景和约22k条指令)。实验发现,最先进的3D大语言模型(3D LLMs)难以可靠识别歧义。为此,论文提出两阶段框架AmbiVer:首先从多视角收集显式视觉证据,再利用这些证据引导视觉-语言模型(VLM)判断指令歧义性,从而提升具身AI系统的安全性与可信度。
链接: https://arxiv.org/abs/2601.05991
作者: Jiayu Ding,Haoran Tang,Ge Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:In safety-critical domains, linguistic ambiguity can have severe consequences; a vague command like “Pass me the vial” in a surgical setting could lead to catastrophic errors. Yet, most embodied AI research overlooks this, assuming instructions are clear and focusing on execution rather than confirmation. To address this critical safety gap, we are the first to define Open-Vocabulary 3D Instruction Ambiguity Detection, a fundamental new task where a model must determine if a command has a single, unambiguous meaning within a given 3D scene. To support this research, we build Ambi3D, the large-scale benchmark for this task, featuring over 700 diverse 3D scenes and around 22k instructions. Our analysis reveals a surprising limitation: state-of-the-art 3D Large Language Models (LLMs) struggle to reliably determine if an instruction is ambiguous. To address this challenge, we propose AmbiVer, a two-stage framework that collects explicit visual evidence from multiple views and uses it to guide an vision-language model (VLM) in judging instruction ambiguity. Extensive experiments demonstrate the challenge of our task and the effectiveness of AmbiVer, paving the way for safer and more trustworthy embodied AI. Code and dataset available at this https URL.
zh
[AI-1] Agent ic LLM s as Powerful Deanonymizers: Re-identification of Participants in the Anthropic Interviewer Dataset
【速读】:该论文旨在解决大规模发布高质量定性访谈数据时可能引发的隐私泄露风险问题,特别是针对由生成式 AI(Generative AI)驱动的智能体(agent)进行再识别攻击(re-identification attack)的可能性。其关键解决方案在于揭示:当前具备网络搜索和代理能力的大型语言模型(Large Language Models, LLMs)能够通过少量自然语言提示,自动执行跨源信息比对与推理,从而将匿名化处理的科学家访谈记录与具体学术作品及作者身份关联起来——这表明现有数据脱敏措施在LLM代理面前极易被绕过,且攻击过程无需复杂技术门槛。研究强调需重新评估开放高质量访谈数据集的隐私保护策略,并提出针对性的缓解建议与未来研究方向。
链接: https://arxiv.org/abs/2601.05918
作者: Tianshi Li
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 4 pages
Abstract:On December 4, 2025, Anthropic released Anthropic Interviewer, an AI tool for running qualitative interviews at scale, along with a public dataset of 1,250 interviews with professionals, including 125 scientists, about their use of AI for research. Focusing on the scientist subset, I show that widely available LLMs with web search and agentic capabilities can link six out of twenty-four interviews to specific scientific works, recovering associated authors and, in some cases, uniquely identifying the interviewees. My contribution is to show that modern LLM-based agents make such re-identification attacks easy and low-effort: off-the-shelf tools can, with a few natural-language prompts, search the web, cross-reference details, and propose likely matches, effectively lowering the technical barrier. Existing safeguards can be bypassed by breaking down the re-identification into benign tasks. I outline the attack at a high level, discuss implications for releasing rich qualitative data in the age of LLM agents, and propose mitigation recommendations and open problems. I have notified Anthropic of my findings.
zh
[AI-2] Auditing Fairness under Model Updates: Fundamental Complexity and Property-Preserving Updates
【速读】:该论文旨在解决在模型动态更新场景下如何可靠地进行群体公平性审计的问题。随着机器学习模型在社会基础设施中的广泛应用,模型所有者可能根据环境变化(如金融市场波动)自适应地调整模型,这种更新虽可能改变模型类别但仍保留某些关键属性,从而对传统审计方法构成挑战。论文提出了一种基于经验属性优化(Empirical Property Optimization, EPO)Oracle的PAC审计框架,其核心在于通过引入一种新的组合复杂度度量——SP维度(SP dimension),刻画允许的战略性更新对审计属性的不变性影响,并在此基础上实现以最少标签样本高效估计群体公平性等审计指标的目标。该框架不仅适用于统计独立性(statistical parity),还可自然扩展至预测误差和鲁棒风险等其他审计目标。
链接: https://arxiv.org/abs/2601.05909
作者: Ayoub Ajarra,Debabrota Basu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (stat.ML)
备注:
Abstract:As machine learning models become increasingly embedded in societal infrastructure, auditing them for bias is of growing importance. However, in real-world deployments, auditing is complicated by the fact that model owners may adaptively update their models in response to changing environments, such as financial markets. These updates can alter the underlying model class while preserving certain properties of interest, raising fundamental questions about what can be reliably audited under such shifts. In this work, we study group fairness auditing under arbitrary updates. We consider general shifts that modify the pre-audit model class while maintaining invariance of the audited property. Our goals are two-fold: (i) to characterize the information complexity of allowable updates, by identifying which strategic changes preserve the property under audit; and (ii) to efficiently estimate auditing properties, such as group fairness, using a minimal number of labeled samples. We propose a generic framework for PAC auditing based on an Empirical Property Optimization (EPO) oracle. For statistical parity, we establish distribution-free auditing bounds characterized by the SP dimension, a novel combinatorial measure that captures the complexity of admissible strategic updates. Finally, we demonstrate that our framework naturally extends to other auditing objectives, including prediction error and robust risk. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (stat.ML) Cite as: arXiv:2601.05909 [cs.LG] (or arXiv:2601.05909v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2601.05909 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-3] Can AI mediation improve democratic deliberation?
【速读】:该论文试图解决民主实践中“参与广泛性、有意义的审议和政治平等”三者之间的固有权衡问题(即“三难困境”,trilemma)。其解决方案的关键在于引入大语言模型(Large Language Model, LLM)作为审议增强工具,通过提升参与的可扩展性、实现公平的中介机制以促进政治平等,并借助可信信息的挖掘与呈现来推动实质性审议。论文指出,尽管LLM在理论上具备缓解三难困境的潜力,但要真正实现AI辅助审议对公民参与和民主质量的提升,仍需在实证研究、技术优化与理论建构等方面取得系统性进展。
链接: https://arxiv.org/abs/2601.05904
作者: Michael Henry Tessler,Georgina Evans,Michiel A. Bakker,Iason Gabriel,Sophie Bridgers,Rishub Jain,Raphael Koster,Verena Rieser,Anca Dragan,Matthew Botvinick,Christopher Summerfield
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:The strength of democracy lies in the free and equal exchange of diverse viewpoints. Living up to this ideal at scale faces inherent tensions: broad participation, meaningful deliberation, and political equality often trade off with one another (Fishkin, 2011). We ask whether and how artificial intelligence (AI) could help navigate this “trilemma” by engaging with a recent example of a large language model (LLM)-based system designed to help people with diverse viewpoints find common ground (Tessler, Bakker, et al., 2024). Here, we explore the implications of the introduction of LLMs into deliberation augmentation tools, examining their potential to enhance participation through scalability, improve political equality via fair mediation, and foster meaningful deliberation by, for example, surfacing trustworthy information. We also point to key challenges that remain. Ultimately, a range of empirical, technical, and theoretical advancements are needed to fully realize the promise of AI-mediated deliberation for enhancing citizen engagement and strengthening democratic deliberation.
zh
[AI-4] owerMind: A Tower Defence Game Learning Environment and Benchmark for LLM as Agents AAAI2026
【速读】:该论文旨在解决当前基于实时策略(RTS)游戏的环境在评估大语言模型(LLMs)时存在的两大问题:一是计算资源需求过高,限制了模型的高效测试;二是缺乏对文本观测的支持,难以全面评估LLMs在多模态输入下的决策能力。为应对这些挑战,作者提出了一种名为TowerMind的新颖环境,其关键在于将塔防(TD)子类RTS游戏作为基础,同时具备低计算开销和包含像素、文本及结构化状态的多模态观测空间,从而支持对LLMs的长期规划与战术执行能力的系统性评估,并能有效检测模型幻觉现象。该设计显著提升了RTS类环境在LLM评测中的实用性与灵活性,填补了现有基准的空白。
链接: https://arxiv.org/abs/2601.05899
作者: Dawei Wang,Chengming Zhou,Di Zhao,Xinyuan Liu,Marci Chi Ma,Gary Ushaw,Richard Davison
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: AAAI 2026 Oral
Abstract:Recent breakthroughs in Large Language Models (LLMs) have positioned them as a promising paradigm for agents, with long-term planning and decision-making emerging as core general-purpose capabilities for adapting to diverse scenarios and tasks. Real-time strategy (RTS) games serve as an ideal testbed for evaluating these two capabilities, as their inherent gameplay requires both macro-level strategic planning and micro-level tactical adaptation and action execution. Existing RTS game-based environments either suffer from relatively high computational demands or lack support for textual observations, which has constrained the use of RTS games for LLM evaluation. Motivated by this, we present TowerMind, a novel environment grounded in the tower defense (TD) subgenre of RTS games. TowerMind preserves the key evaluation strengths of RTS games for assessing LLMs, while featuring low computational demands and a multimodal observation space, including pixel-based, textual, and structured game-state representations. In addition, TowerMind supports the evaluation of model hallucination and provides a high degree of customizability. We design five benchmark levels to evaluate several widely used LLMs under different multimodal input settings. The results reveal a clear performance gap between LLMs and human experts across both capability and hallucination dimensions. The experiments further highlight key limitations in LLM behavior, such as inadequate planning validation, a lack of multifinality in decision-making, and inefficient action use. We also evaluate two classic reinforcement learning algorithms: Ape-X DQN and PPO. By offering a lightweight and multimodal design, TowerMind complements the existing RTS game-based environment landscape and introduces a new benchmark for the AI agent field. The source code is publicly available on GitHub(this https URL).
zh
[AI-5] StackPlanner: A Centralized Hierarchical Multi-Agent System with Task-Experience Memory Management
【速读】:该论文旨在解决基于大语言模型的集中式多智能体系统在长期协作中因缺乏有效记忆管理而导致的问题,包括上下文膨胀(context bloat)、错误累积(error accumulation)以及跨任务泛化能力差等。其解决方案的关键在于提出一种具有显式记忆控制的分层多智能体框架 StackPlanner:通过将高层协调与子任务执行解耦,并引入任务级记忆控制机制来提升记忆效率;同时,利用结构化的经验记忆和强化学习策略,实现可复用的协作经验检索与利用,从而增强多智能体系统的长期协作可靠性与通用性。
链接: https://arxiv.org/abs/2601.05890
作者: Ruizhe Zhang,Xinke Jiang,Zhibang Yang,Zhixin Zhang,Jiaran Gao,Yuzhen Xiao,Hongbin Lai,Xu Chu,Junfeng Zhao,Yasha Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Multi-agent systems based on large language models, particularly centralized architectures, have recently shown strong potential for complex and knowledge-intensive tasks. However, central agents often suffer from unstable long-horizon collaboration due to the lack of memory management, leading to context bloat, error accumulation, and poor cross-task generalization. To address both task-level memory inefficiency and the inability to reuse coordination experience, we propose StackPlanner, a hierarchical multi-agent framework with explicit memory control. StackPlanner addresses these challenges by decoupling high-level coordination from subtask execution with active task-level memory control, and by learning to retrieve and exploit reusable coordination experience via structured experience memory and reinforcement learning. Experiments on multiple deep-search and agent system benchmarks demonstrate the effectiveness of our approach in enabling reliable long-horizon multi-agent collaboration.
zh
[AI-6] IIB-LPO: Latent Policy Optimization via Iterative Information Bottleneck
【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)在基于可验证奖励的强化学习(Reinforcement Learning with Verifiable Rewards, RLVR)推理过程中面临的探索崩溃(exploration collapse)问题,即随机轨迹的语义同质性导致模型陷入狭窄且过度优化的行为模式。解决方案的关键在于提出一种基于迭代信息瓶颈(Iterative Information Bottleneck, IIB-LPO)的潜在策略优化方法,其核心思想是将探索从对词元分布的统计扰动转变为对推理轨迹拓扑结构的分支扩展:通过在高熵状态触发潜在分支以多样化推理路径,并利用信息瓶颈原理作为轨迹过滤器和自奖励机制,从而实现简洁且信息丰富的探索,显著提升数学推理任务中的准确率与多样性指标。
链接: https://arxiv.org/abs/2601.05870
作者: Huilin Deng,Hongchen Luo,Yue Zhu,Long Li,Zhuoyue Chen,Xinghao Zhao,Ming Li,Jihai Zhang,Mengchang Wang,Yang Cao,Yu Kang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advances in Reinforcement Learning with Verifiable Rewards (RLVR) for Large Language Model (LLM) reasoning have been hindered by a persistent challenge: exploration collapse. The semantic homogeneity of random rollouts often traps models in narrow, over-optimized behaviors. While existing methods leverage policy entropy to encourage exploration, they face inherent limitations. Global entropy regularization is susceptible to reward hacking, which can induce meaningless verbosity, whereas local token-selective updates struggle with the strong inductive bias of pre-trained models. To address this, we propose Latent Policy Optimization via Iterative Information Bottleneck (IIB-LPO), a novel approach that shifts exploration from statistical perturbation of token distributions to topological branching of reasoning trajectories. IIB-LPO triggers latent branching at high-entropy states to diversify reasoning paths and employs the Information Bottleneck principle both as a trajectory filter and a self-reward mechanism, ensuring concise and informative exploration. Empirical results across four mathematical reasoning benchmarks demonstrate that IIB-LPO achieves state-of-the-art performance, surpassing prior methods by margins of up to 5.3% in accuracy and 7.4% in diversity metrics.
zh
[AI-7] DexterCap: An Affordable and Automated System for Capturing Dexterous Hand-Object Manipulation
【速读】:该论文旨在解决在复杂手部-物体交互中,由于手指间严重自遮挡(self-occlusion)以及手中操作动作细微性导致的精细动作捕捉难题。现有光学动捕系统依赖昂贵的摄像机阵列和大量人工后处理,而低成本视觉方法在遮挡下精度与可靠性显著下降。其解决方案的关键在于提出DexterCap系统:通过使用密集排列、字符编码的标记贴片(marker patches)实现高鲁棒性的跟踪性能,同时结合全自动重建流程,极大减少人工干预,从而在低成本条件下实现对灵巧手部操作行为的精准捕捉。
链接: https://arxiv.org/abs/2601.05844
作者: Yutong Liang,Shiyi Xu,Yulong Zhang,Bowen Zhan,He Zhang,Libin Liu
机构: 未知
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 12 pages, 12 figures
Abstract:Capturing fine-grained hand-object interactions is challenging due to severe self-occlusion from closely spaced fingers and the subtlety of in-hand manipulation motions. Existing optical motion capture systems rely on expensive camera setups and extensive manual post-processing, while low-cost vision-based methods often suffer from reduced accuracy and reliability under occlusion. To address these challenges, we present DexterCap, a low-cost optical capture system for dexterous in-hand manipulation. DexterCap uses dense, character-coded marker patches to achieve robust tracking under severe self-occlusion, together with an automated reconstruction pipeline that requires minimal manual effort. With DexterCap, we introduce DexterHand, a dataset of fine-grained hand-object interactions covering diverse manipulation behaviors and objects, from simple primitives to complex articulated objects such as a Rubik’s Cube. We release the dataset and code to support future research on dexterous hand-object interaction.
zh
[AI-8] Intelligent Singularity Avoidance in UR10 Robotic Arm Path Planning Using Hybrid Fuzzy Logic and Reinforcement Learning
【速读】:该论文旨在解决UR10机械臂在路径规划中因奇异点(singularity)导致的失控和设备损坏问题。解决方案的关键在于构建一个融合模糊逻辑安全系统与强化学习算法的混合框架:首先通过可操作性度量(manipulability measure)和条件数分析实现对奇异点的实时检测,再利用模糊逻辑进行决策判断;随后基于强化学习构建稳定自适应的路径规划机制,从而在保证安全性的同时提升任务成功率。实验表明,该方法在保持与奇异配置安全距离的前提下,成功到达目标位置的概率达到90%。
链接: https://arxiv.org/abs/2601.05836
作者: Sheng-Kai Chen,Jyh-Horng Wu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Published in TANET 2025 (Paper No. T0404)
Abstract:This paper presents a comprehensive approach to singularity detection and avoidance in UR10 robotic arm path planning through the integration of fuzzy logic safety systems and reinforcement learning algorithms. The proposed system addresses critical challenges in robotic manipulation where singularities can cause loss of control and potential equipment damage. Our hybrid approach combines real-time singularity detection using manipulability measures, condition number analysis, and fuzzy logic decision-making with a stable reinforcement learning framework for adaptive path planning. Experimental results demonstrate a 90% success rate in reaching target positions while maintaining safe distances from singular configurations. The system integrates PyBullet simulation for training data collection and URSim connectivity for real-world deployment.
zh
[AI-9] Influence of Parallelism in Vector-Multiplication Units on Correlation Power Analysis
【速读】:该论文旨在解决边缘设备中神经网络(Neural Networks)推理过程面临的侧信道攻击(Side-Channel Attacks)安全问题,特别是针对硬件加速器中并行处理对相关性功耗分析(Correlation Power Analysis, CPA)成功率的影响。其解决方案的关键在于理论建模并实证验证:在全连接层中,多个神经元并行执行乘加运算(Multiply-Accumulate Operations)时,由于同时处理相同输入值,会导致整体功耗波动减弱,从而降低CPA攻击的成功率;作者基于此现象推导出描述相关性随并行度增加而衰减的数学表达式,并通过FPGA实现的向量乘法单元验证了该模型的有效性。
链接: https://arxiv.org/abs/2601.05828
作者: Manuel Brosch,Matthias Probst,Stefan Kögler,Georg Sigl
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:
Abstract:The use of neural networks in edge devices is increasing, which introduces new security challenges related to the neural networks’ confidentiality. As edge devices often offer physical access, attacks targeting the hardware, such as side-channel analysis, must be considered. To enhance the performance of neural network inference, hardware accelerators are commonly employed. This work investigates the influence of parallel processing within such accelerators on correlation-based side-channel attacks that exploit power consumption. The focus is on neurons that are part of the same fully-connected layer, which run parallel and simultaneously process the same input value. The theoretical impact of concurrent multiply-and-accumulate operations on overall power consumption is evaluated, as well as the success rate of correlation power analysis. Based on the observed behavior, equations are derived that describe how the correlation decreases with increasing levels of parallelism. The applicability of these equations is validated using a vector-multiplication unit implemented on an FPGA.
zh
[AI-10] Decoding Workload and Agreement From EEG During Spoken Dialogue With Conversational AI
【速读】:该论文旨在解决如何将被动脑机接口(Passive Brain-Computer Interface, BCI)信号,特别是基于EEG的注意力负荷(mental workload)和隐式同意(implicit agreement)解码模型,有效应用于自然语音人机对话场景中,以实现对大型语言模型(Large Language Models, LLMs)的隐式反馈对齐。其解决方案的关键在于构建了一个端到端的处理流程,包括语音转录、事件标注以及将词级对话事件与连续EEG分类器输出进行精确时间对齐,从而在拼写蜜蜂任务和句子补全任务两种对话范式中验证了跨范式的迁移可行性,并揭示了隐式同意信号在对话中的连续应用潜力及其在概念迁移和事件触发异步性方面的限制。
链接: https://arxiv.org/abs/2601.05825
作者: Lucija Mihić Zidar,Philipp Wicke,Praneel Bhatia,Rosa Lutz,Marius Klug,Thorsten O. Zander
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: Accepted at the 14th International Winter Conference on Brain-Computer Interface
Abstract:Passive brain-computer interfaces offer a potential source of implicit feedback for alignment of large language models, but most mental state decoding has been done in controlled tasks. This paper investigates whether established EEG classifiers for mental workload and implicit agreement can be transferred to spoken human-AI dialogue. We introduce two conversational paradigms - a Spelling Bee task and a sentence completion task- and an end-to-end pipeline for transcribing, annotating, and aligning word-level conversational events with continuous EEG classifier output. In a pilot study, workload decoding showed interpretable trends during spoken interaction, supporting cross-paradigm transfer. For implicit agreement, we demonstrate continuous application and precise temporal alignment to conversational events, while identifying limitations related to construct transfer and asynchronous application of event-based classifiers. Overall, the results establish feasibility and constraints for integrating passive BCI signals into conversational AI systems.
zh
[AI-11] nsor-DTI: Enhancing Biomolecular Interaction Prediction with Contrastive Embedding Learning ICLR2025
【速读】:该论文旨在解决药物-靶标相互作用(Drug-Target Interaction, DTI)预测中现有模型依赖单一模态预定义分子描述符或序列嵌入、代表性有限的问题。其解决方案的关键在于提出Tensor-DTI框架,该框架通过对比学习(contrastive learning)整合分子图嵌入、蛋白质语言模型(protein language model)和结合位点预测的多模态信息,并采用孪生双编码器架构(siamese dual-encoder architecture),从而同时捕捉化学与结构相互作用特征,并有效区分相互作用与非相互作用对,显著提升DTI预测准确性与模型可靠性。
链接: https://arxiv.org/abs/2601.05792
作者: Manel Gil-Sorribes,Júlia Vilalta-Mor,Isaac Filella-Mercè,Robert Soliva,Álvaro Ciudad,Víctor Guallar,Alexis Molina
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM)
备注: Accepted at the Generative and Experimental Perspectives for Biomolecular Design Workshop at ICLR 2025 and at the Learning Meaningful Representations of Life Workshop at ICLR 2025
Abstract:Accurate drug-target interaction (DTI) prediction is essential for computational drug discovery, yet existing models often rely on single-modality predefined molecular descriptors or sequence-based embeddings with limited representativeness. We propose Tensor-DTI, a contrastive learning framework that integrates multimodal embeddings from molecular graphs, protein language models, and binding-site predictions to improve interaction modeling. Tensor-DTI employs a siamese dual-encoder architecture, enabling it to capture both chemical and structural interaction features while distinguishing interacting from non-interacting pairs. Evaluations on multiple DTI benchmarks demonstrate that Tensor-DTI outperforms existing sequence-based and graph-based models. We also conduct large-scale inference experiments on CDK2 across billion-scale chemical libraries, where Tensor-DTI produces chemically plausible hit distributions even when CDK2 is withheld from training. In enrichment studies against Glide docking and Boltz-2 co-folder, Tensor-DTI remains competitive on CDK2 and improves the screening budget required to recover moderate fractions of high-affinity ligands on out-of-family targets under strict family-holdout splits. Additionally, we explore its applicability to protein-RNA and peptide-protein interactions. Our findings highlight the benefits of integrating multimodal information with contrastive objectives to enhance interaction-prediction accuracy and to provide more interpretable and reliability-aware models for virtual screening.
zh
[AI-12] SAFE: Secure and Accurate Federated Learning for Privacy-Preserving Brain-Computer Interfaces
【速读】:该论文旨在解决基于脑电图(EEG)的脑机接口(BCI)在解码算法中存在的三大核心问题:泛化能力不足、对抗脆弱性以及隐私泄露风险。其解决方案的关键在于提出一种名为SAFE(Secure and Accurate FEderated learning)的联邦学习框架,通过在本地保留数据以实现隐私保护,并引入局部批次特定归一化来缓解跨被试特征分布偏移,从而提升模型泛化性能;同时,结合联邦对抗训练与对抗权重扰动,在输入空间和参数空间中引入扰动,显著增强模型的对抗鲁棒性。实验表明,SAFE无需目标被试校准数据即可在多个BCI范式下同时实现高解码准确率、强对抗鲁棒性和可靠隐私保护,优于现有14种先进方法。
链接: https://arxiv.org/abs/2601.05789
作者: Tianwang Jia,Xiaoqing Chen,Dongrui Wu
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 12 pages, 9 figures
Abstract:Electroencephalogram (EEG)-based brain-computer interfaces (BCIs) are widely adopted due to their efficiency and portability; however, their decoding algorithms still face multiple challenges, including inadequate generalization, adversarial vulnerability, and privacy leakage. This paper proposes Secure and Accurate FEderated learning (SAFE), a federated learning-based approach that protects user privacy by keeping data local during model training. SAFE employs local batch-specific normalization to mitigate cross-subject feature distribution shifts and hence improves model generalization. It further enhances adversarial robustness by introducing perturbations in both the input space and the parameter space through federated adversarial training and adversarial weight perturbation. Experiments on five EEG datasets from motor imagery (MI) and event-related potential (ERP) BCI paradigms demonstrated that SAFE consistently outperformed 14 state-of-the-art approaches in both decoding accuracy and adversarial robustness, while ensuring privacy protection. Notably, it even outperformed centralized training approaches that do not consider privacy protection at all. To our knowledge, SAFE is the first algorithm to simultaneously achieve high decoding accuracy, strong adversarial robustness, and reliable privacy protection without using any calibration data from the target subject, making it highly desirable for real-world BCIs.
zh
[AI-13] From Off-Policy to On-Policy: Enhancing GUI Agents via Bi-level Expert-to-Policy Assimilation
【速读】:该论文旨在解决如何在有限的专家轨迹数据下,通过强化学习从可验证奖励(Reinforcement Learning from Verifiable Rewards, RLVR)中高效训练端到端的视觉-语言策略(end-to-end screenshot-to-action policies),以提升其在桌面环境任务中的性能表现。当前主流框架如OSWorld-Verified面临两个瓶颈:一是交互式、可验证任务数量稀少,二是专家轨迹采集成本高、难以扩展。针对这一问题,作者提出BEPA(Bi-Level Expert-to-Policy Assimilation)方法,其核心创新在于双层机制:第一层(LEVEL-1)利用基础策略生成自洽的可达轨迹,将静态专家轨迹转化为与策略对齐的引导信号;第二层(LEVEL-2)引入每任务动态更新的缓存机制,在RLVR训练过程中持续融合专家知识,缓解分布偏移和结构不匹配问题。实验表明,BEPA显著提升了UITARS1.5-7B模型在OSWorld-Verified上的成功率(从22.87%提升至32.13%),并在多个GUI基准测试中保持一致优势。
链接: https://arxiv.org/abs/2601.05787
作者: Zezhou Wang,Ziyun Zhang,Xiaoyi Zhang,Zhuzhong Qian,Yan Lu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Work In Progress
Abstract:Vision-language models are increasingly deployed as computer-use agents (CUAs) that operate desktops and browsers. Top-performing CUAs are framework-based systems that decompose planning and execution, while end-to-end screenshot-to-action policies are easier to deploy but lag behind on benchmarks such as OSWorld-Verified. GUI datasets like OSWorld pose two bottlenecks: they expose only a few hundred interactive, verifiable tasks and environments, and expert trajectories must be gathered by interacting with these environments, making such data hard to scale. We therefore ask how reinforcement learning from verifiable rewards (RLVR) can best exploit a small pool of exist expert trajectories to train end-to-end policies. Naively mixing these off-policy traces into on-policy RLVR is brittle: even after format conversion, expert trajectories exhibit structural mismatch and distribution shift from the learner. We propose BEPA (Bi-Level Expert-to-Policy Assimilation), which turns static expert traces into policy-aligned guidance via self-rolled reachable trajectories under the base policy (LEVEL-1) and a per-task, dynamically updated cache used in RLVR (LEVEL-2). On OSWorld-Verified, BEPA improves UITARS1.5-7B success from 22.87% to 32.13% and raises a held-out split from 5.74% to 10.30%, with consistent gains on MMBench-GUI and Online-Mind2Web. Our code and data are available at: this https URL
zh
[AI-14] Variational Autoencoders for P-wave Detection on Strong Motion Earthquake Spectrograms
【速读】:该论文旨在解决地震早期预警中P波(P-wave)检测的难题,尤其是在强震记录因高噪声水平、标注数据有限及波形特征复杂而导致传统方法性能受限的情况下。其解决方案的关键在于将P波到达检测重新建模为自监督异常检测任务,并通过系统性地比较492种变分自编码器(Variational Autoencoder, VAE)架构配置,发现:虽然跳跃连接(skip connections)能最小化重建误差(平均绝对误差约0.0012),但会引发“过度泛化”问题,导致模型无法有效区分噪声与真实信号;而引入注意力机制(attention mechanisms)则优先捕捉全局上下文信息而非局部细节,显著提升异常判别能力,在曲线下面积(AUC)达0.875,且在近源区(0–40 km)达到0.91的AUC,展现出对即时地震预警的高度适用性。因此,该研究确立了以全局上下文为导向的网络结构设计是实现鲁棒自监督P波检测的核心要素。
链接: https://arxiv.org/abs/2601.05759
作者: Turkan Simge Ispak,Salih Tileylioglu,Erdem Akagunduz
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 13 pages, 8 figures, 3 tables
Abstract:Accurate P-wave detection is critical for earthquake early warning, yet strong-motion records pose challenges due to high noise levels, limited labeled data, and complex waveform characteristics. This study reframes P-wave arrival detection as a self-supervised anomaly detection task to evaluate how architectural variations regulate the trade-off between reconstruction fidelity and anomaly discrimination. Through a comprehensive grid search of 492 Variational Autoencoder configurations, we show that while skip connections minimize reconstruction error (Mean Absolute Error approximately 0.0012), they induce “overgeneralization”, allowing the model to reconstruct noise and masking the detection signal. In contrast, attention mechanisms prioritize global context over local detail and yield the highest detection performance with an area-under-the-curve of 0.875. The attention-based Variational Autoencoder achieves an area-under-the-curve of 0.91 in the 0 to 40-kilometer near-source range, demonstrating high suitability for immediate early warning applications. These findings establish that architectural constraints favoring global context over pixel-perfect reconstruction are essential for robust, self-supervised P-wave detection.
zh
[AI-15] VIGIL: Defending LLM Agents Against Tool Stream Injection via Verify-Before-Commit
【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)代理在开放环境中因间接提示注入(indirect prompt injection)带来的安全风险,尤其是工具流(tool stream)中被操纵的元数据和运行时反馈劫持执行流程的问题。现有防御方法面临两难困境:一方面,先进模型因严格对齐机制更易采纳恶意注入规则;另一方面,静态保护机制会切断自适应推理所需的反馈回路。解决方案的关键在于提出 VIGIL 框架,其核心思想是从限制性隔离转向“先验证后提交”(verify-before-commit)协议,通过支持推测性假设生成并基于意图 grounded 的验证机制,在保持推理灵活性的同时实现强安全性控制。
链接: https://arxiv.org/abs/2601.05755
作者: Junda Lin,Zhaomeng Zhou,Zhi Zheng,Shuochen Liu,Tong Xu,Yong Chen,Enhong Chen
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:LLM agents operating in open environments face escalating risks from indirect prompt injection, particularly within the tool stream where manipulated metadata and runtime feedback hijack execution flow. Existing defenses encounter a critical dilemma as advanced models prioritize injected rules due to strict alignment while static protection mechanisms sever the feedback loop required for adaptive reasoning. To reconcile this conflict, we propose \textbfVIGIL, a framework that shifts the paradigm from restrictive isolation to a verify-before-commit protocol. By facilitating speculative hypothesis generation and enforcing safety through intent-grounded verification, \textbfVIGIL preserves reasoning flexibility while ensuring robust control. We further introduce \textbfSIREN, a benchmark comprising 959 tool stream injection cases designed to simulate pervasive threats characterized by dynamic dependencies. Extensive experiments demonstrate that \textbfVIGIL outperforms state-of-the-art dynamic defenses by reducing the attack success rate by over 22% while more than doubling the utility under attack compared to static baselines, thereby achieving an optimal balance between security and utility. Code is available at this https URL.
zh
[AI-16] DynaDebate: Breaking Homogeneity in Multi-Agent Debate with Dynamic Path Generation
【速读】:该论文旨在解决当前基于大语言模型的多智能体系统(Multi-Agent Systems, MAS)在多智能体辩论(Multi-Agent Debate, MAD)框架下存在的局限性,即由于缺乏引导的初始设置导致各智能体采用相同的推理路径,从而引发重复性错误,削弱了辩论的有效性,最终结果往往退化为简单的多数投票机制。为此,作者提出动态多智能体辩论(Dynamic Multi-Agent Debate, DynaDebate),其解决方案的关键在于三个核心机制:(1) 动态路径生成与分配机制,通过专用路径生成智能体自适应地生成多样且逻辑合理的解题路径;(2) 以过程为中心的辩论机制,将辩论焦点从表层结果投票转向对每一步推理逻辑的严格批判,确保过程正确性;(3) 触发式验证智能体机制,在智能体间出现分歧时激活,并借助外部工具客观化解僵局。实验表明,DynaDebate 在多个基准测试中显著优于现有先进MAD方法。
链接: https://arxiv.org/abs/2601.05746
作者: Zhenghao Li,Zhi Zheng,Wei Chen,Jielun Zhao,Yong Chen,Tong Xu,Enhong Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 16pages,6figures
Abstract:Recent years have witnessed the rapid development of Large Language Model-based Multi-Agent Systems (MAS), which excel at collaborative decision-making and complex problem-solving. Recently, researchers have further investigated Multi-Agent Debate (MAD) frameworks, which enhance the reasoning and collaboration capabilities of MAS through information exchange and debate among multiple agents. However, existing approaches often rely on unguided initialization, causing agents to adopt identical reasoning paths that lead to the same errors. As a result, effective debate among agents is hindered, and the final outcome frequently degenerates into simple majority voting. To solve the above problem, in this paper, we introduce Dynamic Multi-Agent Debate (DynaDebate), which enhances the effectiveness of multi-agent debate through three key mechanisms: (1) Dynamic Path Generation and Allocation, which employs a dedicated Path Generation Agent to generate diverse and logical solution paths with adaptive redundancy; (2) Process-Centric Debate, which shifts the focus from surface-level outcome voting to rigorous step-by-step logic critique to ensure process correctness; (3) A Trigger-Based Verification Agent, which is activated upon disagreement and uses external tools to objectively resolve deadlocks. Extensive experiments demonstrate that DynaDebate achieves superior performance across various benchmarks, surpassing existing state-of-the-art MAD methods.
zh
[AI-17] he Echo Chamber Multi-Turn LLM Jailbreak
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)驱动的聊天机器人面临的安全挑战,特别是“越狱攻击”(jailbreaking)问题,即通过恶意构造的提示(prompt)和输入绕过其安全防护机制。解决方案的关键在于提出一种名为“回音室”(Echo Chamber)的新式多轮(multi-turn)越狱攻击方法,该方法采用渐进式升级策略,通过精心设计的一系列交互逐步突破模型的安全限制,从而在多个前沿模型上展现出显著的攻击效果。
链接: https://arxiv.org/abs/2601.05742
作者: Ahmad Alobaid(NeuralTrust),Martí Jordà Roca(NeuralTrust),Carlos Castillo(ICREA and UPF),Joan Vendrell(NeuralTrust)
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:The availability of Large Language Models (LLMs) has led to a new generation of powerful chatbots that can be developed at relatively low cost. As companies deploy these tools, security challenges need to be addressed to prevent financial loss and reputational damage. A key security challenge is jailbreaking, the malicious manipulation of prompts and inputs to bypass a chatbot’s safety guardrails. Multi-turn attacks are a relatively new form of jailbreaking involving a carefully crafted chain of interactions with a chatbot. We introduce Echo Chamber, a new multi-turn attack using a gradual escalation method. We describe this attack in detail, compare it to other multi-turn attacks, and demonstrate its performance against multiple state-of-the-art models through extensive evaluation.
zh
[AI-18] mHC-lite: You Dont Need 20 Sinkhorn-Knopp Iterations
【速读】:该论文旨在解决超连接(Hyper-Connections, HC)在深度神经网络中因未约束的残差矩阵导致训练不稳定的问题,以及其改进方案Manifold-Constrained Hyper-Connections (mHC) 在实际应用中存在的两个局限:一是有限次数的Sinkhorn–Knopp (SK) 迭代无法保证矩阵严格双随机性(doubly stochasticity),误差随网络深度累积影响稳定性;二是SK实现依赖高度定制化的CUDA内核,工程门槛高且可移植性差。解决方案的关键在于提出mHC-lite,通过Birkhoff–von Neumann定理,将双随机矩阵显式构造为置换矩阵的凸组合,从而在不依赖SK迭代的情况下保证精确的双随机性,并仅使用原生矩阵运算即可实现,显著提升了训练稳定性和计算效率。
链接: https://arxiv.org/abs/2601.05732
作者: Yongyi Yang,Jianyang Gao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Hyper-Connections (HC) generalizes residual connections by introducing dynamic residual matrices that mix information across multiple residual streams, accelerating convergence in deep neural networks. However, unconstrained residual matrices can compromise training stability. To address this, DeepSeek’s Manifold-Constrained Hyper-Connections (mHC) approximately projects these matrices onto the Birkhoff polytope via iterative Sinkhorn–Knopp (SK) normalization. We identify two limitations of this approach: (i) finite SK iterations do not guarantee exact doubly stochasticity, leaving an approximation gap that can accumulate through network depth and undermine stability; (ii) efficient SK implementation requires highly specialized CUDA kernels, raising engineering barriers and reducing portability. Motivated by the Birkhoff–von Neumann theorem, we propose mHC-lite, a simple reparameterization that explicitly constructs doubly stochastic matrices as convex combinations of permutation matrices. This approach guarantees exact doubly stochasticity by construction and can be implemented using only native matrix operations. Extensive experiments demonstrate that mHC-lite matches or exceeds mHC in performance while achieving higher training throughput with a naive implementation and eliminating the residual instabilities observed in both HC and mHC. The code is publicly available at this https URL.
zh
[AI-19] Overcoming Joint Intractability with Lossless Hierarchical Speculative Decoding
【速读】:该论文旨在解决推测解码(Speculative Decoding)中验证环节的瓶颈问题,即如何在不牺牲分布保真度(distribution fidelity)的前提下提升推理速度。现有方法多采用逐标记(token-wise)验证或依赖代理近似,难以处理联合不可计算性(joint intractability),导致接受率(acceptance rate)受限。其解决方案的关键在于提出分层推测解码(Hierarchical Speculative Decoding, HSD),一种可证明无损的验证机制,通过平衡可访问分支上的超额与不足概率质量(excess and deficient probability mass),显著增加期望接受标记数,并有效克服联合不可计算性问题。
链接: https://arxiv.org/abs/2601.05724
作者: Yuxuan Zhou,Fei Huang,Heng Li,Fengyi Wu,Tianyu Wang,Jianwei Zhang,Junyang Lin,Zhi-Qi Cheng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Verification is a key bottleneck in improving inference speed while maintaining distribution fidelity in Speculative Decoding. Recent work has shown that sequence-level verification leads to a higher number of accepted tokens compared to token-wise verification. However, existing solutions often rely on surrogate approximations or are constrained by partial information, struggling with joint intractability. In this work, we propose Hierarchical Speculative Decoding (HSD), a provably lossless verification method that significantly boosts the expected number of accepted tokens and overcomes joint intractability by balancing excess and deficient probability mass across accessible branches. Our extensive large-scale experiments demonstrate that HSD yields consistent improvements in acceptance rates across diverse model families and benchmarks. Moreover, its strong explainability and generality make it readily integrable into a wide range of speculative decoding frameworks. Notably, integrating HSD into EAGLE-3 yields over a 12% performance gain, establishing state-of-the-art decoding efficiency without compromising distribution fidelity. Code is available at this https URL.
zh
[AI-20] AIBoMGen: Generating an AI Bill of Materials for Secure Transparent and Compliant Model Training
【速读】:该论文旨在解决复杂人工智能(AI)系统快速部署与保障其透明性、安全性及监管合规性之间存在的工具滞后问题。解决方案的关键在于提出并实现AI Bill of Materials (AIBOM),即一种标准化、可验证的训练后AI模型及其环境记录机制,作为软件物料清单(SBOM)在AI领域的延伸。其核心创新在于通过一个中立的第三方训练平台——AIBoMGen,自动捕获数据集、模型元数据和环境信息,并利用加密哈希、数字签名和in-toto证明技术生成带签名的AIBOM,从而确保完整性并抵御恶意模型创建者对模型制品的篡改行为。评估表明,该方案能可靠检测所有未经授权的修改,且性能开销可忽略不计,为构建安全可信的AI生态系统提供了基础支撑。
链接: https://arxiv.org/abs/2601.05703
作者: Wiebe Vandendriessche,Jordi Thijsman,Laurens D’hooge,Bruno Volckaert,Merlijn Sebrechts
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: Accepted at ACM/IEEE CAIN 2026
Abstract:The rapid adoption of complex AI systems has outpaced the development of tools to ensure their transparency, security, and regulatory compliance. In this paper, the AI Bill of Materials (AIBOM), an extension of the Software Bill of Materials (SBOM), is introduced as a standardized, verifiable record of trained AI models and their environments. Our proof-of-concept platform, AIBoMGen, automates the generation of signed AIBOMs by capturing datasets, model metadata, and environment details during training. The training platform acts as a neutral, third-party observer and root of trust. It enforces verifiable AIBOM creation for every job. The system uses cryptographic hashing, digital signatures, and in-toto attestations to ensure integrity and protect against threats such as artifact tampering by dishonest model creators. Our evaluation demonstrates that AIBoMGen reliably detects unauthorized modifications to all artifacts and can generate AIBOMs with negligible performance overhead. These results highlight the potential of AIBoMGen as a foundational step toward building secure and transparent AI ecosystems, enabling compliance with regulatory frameworks like the EUs AI Act.
zh
[AI-21] Circular Reasoning : Understanding Self-Reinforcing Loops in Large Reasoning Models
【速读】:该论文旨在解决大型推理模型(Large Reasoning Models, LRM)在测试时扩展过程中频繁出现的循环推理(Circular Reasoning)问题,这种现象会导致计算资源浪费和推理失败。其关键解决方案是通过构建LoopBench数据集系统识别两种类型的循环模式(数值循环与陈述循环),并揭示循环推理的本质为状态坍缩(state collapse),表现为语义重复先于文本重复;进一步发现推理障碍触发循环启动,并由自增强的V型注意力机制维持不可逃脱的循环。基于此机制,研究提出使用累积和(CUSUM)算法捕捉早期前兆信号,实现对循环的精准预测,从而提升长链推理的稳定性。
链接: https://arxiv.org/abs/2601.05693
作者: Zenghao Duan,Liang Pang,Zihao Wei,Wenbin Duan,Yuxin Tian,Shicheng Xu,Jingcheng Deng,Zhiyi Yin,Xueqi Cheng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Despite the success of test-time scaling, Large Reasoning Models (LRMs) frequently encounter repetitive loops that lead to computational waste and inference failure. In this paper, we identify a distinct failure mode termed Circular Reasoning. Unlike traditional model degeneration, this phenomenon manifests as a self-reinforcing trap where generated content acts as a logical premise for its own recurrence, compelling the reiteration of preceding text. To systematically analyze this phenomenon, we introduce LoopBench, a dataset designed to capture two distinct loop typologies: numerical loops and statement loops. Mechanistically, we characterize circular reasoning as a state collapse exhibiting distinct boundaries, where semantic repetition precedes textual repetition. We reveal that reasoning impasses trigger the loop onset, which subsequently persists as an inescapable cycle driven by a self-reinforcing V-shaped attention mechanism. Guided by these findings, we employ the Cumulative Sum (CUSUM) algorithm to capture these precursors for early loop prediction. Experiments across diverse LRMs validate its accuracy and elucidate the stability of long-chain reasoning.
zh
[AI-22] CHDP: Cooperative Hybrid Diffusion Policies for Reinforcement Learning in Parameterized Action Space AAAI2026
【速读】:该论文旨在解决混合动作空间(hybrid action space)建模与优化中的核心挑战,即在同时包含离散选择和连续参数的场景下,如何提升策略表达能力并实现高维空间中的可扩展性。其解决方案的关键在于提出一种协同式混合扩散策略框架(Cooperative Hybrid Diffusion Policies, CHDP),该框架通过两个协作智能体分别使用离散和连续扩散策略,并以离散动作的表征条件化连续策略,显式建模二者依赖关系;同时引入顺序更新机制缓解策略同步更新带来的冲突,以及基于码本(codebook)的低维嵌入结构提升高维离散动作空间的学习效率,并设计基于Q函数的引导机制对齐码本嵌入与策略表示,从而显著增强模型性能,在多个基准测试中成功率达到现有最优方法的19.3%。
链接: https://arxiv.org/abs/2601.05675
作者: Bingyi Liu,Jinbo He,Haiyong Shi,Enshu Wang,Weizhen Han,Jingxiang Hao,Peixi Wang,Zhuangzhuang Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted by AAAI 2026
Abstract:Hybrid action space, which combines discrete choices and continuous parameters, is prevalent in domains such as robot control and game AI. However, efficiently modeling and optimizing hybrid discrete-continuous action space remains a fundamental challenge, mainly due to limited policy expressiveness and poor scalability in high-dimensional settings. To address this challenge, we view the hybrid action space problem as a fully cooperative game and propose a \textbfCooperative Hybrid Diffusion Policies (CHDP) framework to solve it. CHDP employs two cooperative agents that leverage a discrete and a continuous diffusion policy, respectively. The continuous policy is conditioned on the discrete action’s representation, explicitly modeling the dependency between them. This cooperative design allows the diffusion policies to leverage their expressiveness to capture complex distributions in their respective action spaces. To mitigate the update conflicts arising from simultaneous policy updates in this cooperative setting, we employ a sequential update scheme that fosters co-adaptation. Moreover, to improve scalability when learning in high-dimensional discrete action space, we construct a codebook that embeds the action space into a low-dimensional latent space. This mapping enables the discrete policy to learn in a compact, structured space. Finally, we design a Q-function-based guidance mechanism to align the codebook’s embeddings with the discrete policy’s representation during training. On challenging hybrid action benchmarks, CHDP outperforms the state-of-the-art method by up to 19.3% in success rate.
zh
[AI-23] Advancing credit mobility through stakeholder-informed AI design and adoption
【速读】:该论文旨在解决高校间学分转移过程中因课程等效性(articulation)认定困难而导致的学生学业延迟与额外成本问题。当前,学分等效性确认主要依赖人工审核,效率低下且难以规模化。解决方案的关键在于引入人工智能技术,通过构建一种监督式对齐方法,有效克服课程目录描述中的表面匹配偏差和机构偏见,从而显著提升推荐准确性——相较先前方法提升5.5倍;结合调研中61%的教师与工作人员采纳率,预估可使原本无法实现的有效学分流动机会增加12倍,推动学生学分流动性提升并重塑高校在课程等效决策中的治理模式。
链接: https://arxiv.org/abs/2601.05666
作者: Yerin Kwak,Siddharth Adelkar,Zachary A. Pardos
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 17 pages, 8 figures
Abstract:Transferring from a 2-year to a 4-year college is crucial for socioeconomic mobility, yet students often face challenges ensuring their credits are fully recognized, leading to delays in their academic progress and unexpected costs. Determining whether courses at different institutions are equivalent (i.e., articulation) is essential for successful credit transfer, as it minimizes unused credits and increases the likelihood of bachelor’s degree completion. However, establishing articulation agreements remains time- and resource-intensive, as all candidate articulations are reviewed manually. Although recent efforts have explored the use of artificial intelligence to support this work, its use in articulation practice remains limited. Given these challenges and the need for scalable support, this study applies artificial intelligence to suggest articulations between institutions in collaboration with the State University of New York system, one of the largest systems of higher education in the US. To develop our methodology, we first surveyed articulation staff and faculty to assess adoption rates of baseline algorithmic recommendations and gather feedback on perceptions and concerns about these recommendations. Building on these insights, we developed a supervised alignment method that addresses superficial matching and institutional biases in catalog descriptions, achieving a 5.5-fold improvement in accuracy over previous methods. Based on articulation predictions of this method and a 61% average surveyed adoption rate among faculty and staff, these findings project a 12-fold increase in valid credit mobility opportunities that would otherwise remain unrealized. This study suggests that stakeholder-informed design of AI in higher education administration can expand student credit mobility and help reshape current institutional decision-making in course articulation.
zh
[AI-24] HAG: Hierarchical Demographic Tree-based Agent Generation for Topic-Adaptive Simulation
【速读】:该论文旨在解决代理初始化(Agent Initialization)在不同领域中缺乏高保真度的问题,具体表现为现有方法要么无法适应未见主题(静态数据检索方法),要么忽视宏观分布一致性导致微观个体属性与现实不符(大语言模型生成方法)。其解决方案的关键在于提出一种分层代理生成框架(HAG),将人群生成建模为两阶段决策过程:首先通过世界知识模型(World Knowledge Model)推导层级条件概率以构建自适应主题树(Topic-Adaptive Tree),实现宏观分布对齐;其次基于真实世界数据进行实例化和智能体增强,确保微观层面的一致性。
链接: https://arxiv.org/abs/2601.05656
作者: Rongxin Chen,Tianyu Wu,Bingbing Xu,Xiucheng Xu,Huawei Shen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:High-fidelity agent initialization is crucial for credible Agent-Based Modeling across diverse domains. A robust framework should be Topic-Adaptive, capturing macro-level joint distributions while ensuring micro-level individual rationality. Existing approaches fall into two categories: static data-based retrieval methods that fail to adapt to unseen topics absent from the data, and LLM-based generation methods that lack macro-level distribution awareness, resulting in inconsistencies between micro-level persona attributes and reality. To address these problems, we propose HAG, a Hierarchical Agent Generation framework that formalizes population generation as a two-stage decision process. Firstly, utilizing a World Knowledge Model to infer hierarchical conditional probabilities to construct the Topic-Adaptive Tree, achieving macro-level distribution alignment. Then, grounded real-world data, instantiation and agentic augmentation are carried out to ensure micro-level consistency. Given the lack of specialized evaluation, we establish a multi-domain benchmark and a comprehensive PACE evaluation framework. Extensive experiments show that HAG significantly outperforms representative baselines, reducing population alignment errors by an average of 37.7% and enhancing sociological consistency by 18.8%.
zh
[AI-25] ransformer Is Inherently a Causal Learner
【速读】:该论文旨在解决多变量时间序列中的因果发现(causal discovery)问题,即从观测数据中自动识别变量间的时滞因果关系。传统方法在处理非线性动态、长程依赖和非平稳系统时性能受限,且缺乏随数据量增加而提升的可扩展性。解决方案的关键在于利用自回归训练的Transformer模型所天然编码的时间延迟因果结构:通过分析输出对过去输入的梯度敏感性,无需显式因果目标或结构约束即可直接恢复底层因果图。理论证明在标准可识别条件下成立,并提出基于梯度归因聚合的实用提取方法,在复杂场景下显著优于现有最优算法,且表现出随数据异质性和规模增长而提升的因果准确性,为基于基础模型的因果发现提供了新范式。
链接: https://arxiv.org/abs/2601.05647
作者: Xinyue Wang,Stephen Wang,Biwei Huang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:We reveal that transformers trained in an autoregressive manner naturally encode time-delayed causal structures in their learned representations. When predicting future values in multivariate time series, the gradient sensitivities of transformer outputs with respect to past inputs directly recover the underlying causal graph, without any explicit causal objectives or structural constraints. We prove this connection theoretically under standard identifiability conditions and develop a practical extraction method using aggregated gradient attributions. On challenging cases such as nonlinear dynamics, long-term dependencies, and non-stationary systems, this approach greatly surpasses the performance of state-of-the-art discovery algorithms, especially as data heterogeneity increases, exhibiting scaling potential where causal accuracy improves with data volume and heterogeneity, a property traditional methods lack. This unifying view lays the groundwork for a future paradigm where causal discovery operates through the lens of foundation models, and foundation models gain interpretability and enhancement through the lens of causality.
zh
[AI-26] GenCtrl – A Formal Controllability Toolkit for Generative Models
【速读】:该论文试图解决生成式模型(Generative Models)在实际应用中缺乏理论保障的可控性问题,即当前从提示工程到微调的各种控制方法是否真正有效。其解决方案的关键在于提出一个形式化的理论框架,将人-模型交互建模为控制过程,并设计一种新颖算法用于估计对话场景下模型的可控制集(Controllable Sets)。该算法提供无分布假设、无需额外约束(仅要求输出有界)的概率近似正确(Probably Approximately Correct, PAC)误差边界,适用于任意黑箱非线性控制系统(即任意生成式模型),从而首次实现了对模型可控性的严格量化分析。
链接: https://arxiv.org/abs/2601.05637
作者: Emily Cheng,Carmen Amo Alonso,Federico Danieli,Arno Blaas,Luca Zappella,Pau Rodriguez,Xavier Suau
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
备注:
Abstract:As generative models become ubiquitous, there is a critical need for fine-grained control over the generation process. Yet, while controlled generation methods from prompting to fine-tuning proliferate, a fundamental question remains unanswered: are these models truly controllable in the first place? In this work, we provide a theoretical framework to formally answer this question. Framing human-model interaction as a control process, we propose a novel algorithm to estimate the controllable sets of models in a dialogue setting. Notably, we provide formal guarantees on the estimation error as a function of sample complexity: we derive probably-approximately correct bounds for controllable set estimates that are distribution-free, employ no assumptions except for output boundedness, and work for any black-box nonlinear control system (i.e., any generative model). We empirically demonstrate the theoretical framework on different tasks in controlling dialogue processes, for both language models and text-to-image generation. Our results show that model controllability is surprisingly fragile and highly dependent on the experimental setting. This highlights the need for rigorous controllability analysis, shifting the focus from simply attempting control to first understanding its fundamental limits.
zh
[AI-27] Cumulative Path-Level Semantic Reasoning for Inductive Knowledge Graph Completion
【速读】:该论文旨在解决归纳式知识图谱补全(Inductive Knowledge Graph Completion, Inductive KGC)中面临的两大挑战:一是现有方法在推理过程中易受噪声结构信息干扰,二是难以捕捉推理路径中的长程依赖关系。解决方案的关键在于提出一种累积路径级语义推理框架(Cumulative Path-Level Semantic Reasoning, CPSR),其核心创新包括两个模块:一是查询相关的掩码模块(query-dependent masking module),可自适应地屏蔽噪声结构信息并保留与目标紧密相关的关键信息;二是全局语义评分模块(global semantic scoring module),能够评估推理路径上节点的个体贡献及其协同效应,从而提升对复杂语义关系的建模能力。实验表明,该方法在归纳式知识图谱补全任务中达到当前最优性能。
链接: https://arxiv.org/abs/2601.05629
作者: Jiapu Wang,Xinghe Cheng,Zezheng Wu,Ruiqi Ma,Rui Wang,Zhichao Yan,Haoran Luo,Yuhao Jiang,Kai Sun
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Conventional Knowledge Graph Completion (KGC) methods aim to infer missing information in incomplete Knowledge Graphs (KGs) by leveraging existing information, which struggle to perform effectively in scenarios involving emerging entities. Inductive KGC methods can handle the emerging entities and relations in KGs, offering greater dynamic adaptability. While existing inductive KGC methods have achieved some success, they also face challenges, such as susceptibility to noisy structural information during reasoning and difficulty in capturing long-range dependencies in reasoning paths. To address these challenges, this paper proposes the Cumulative Path-Level Semantic Reasoning for inductive knowledge graph completion (CPSR) framework, which simultaneously captures both the structural and semantic information of KGs to enhance the inductive KGC task. Specifically, the proposed CPSR employs a query-dependent masking module to adaptively mask noisy structural information while retaining important information closely related to the targets. Additionally, CPSR introduces a global semantic scoring module that evaluates both the individual contributions and the collective impact of nodes along the reasoning path within KGs. The experimental results demonstrate that CPSR achieves state-of-the-art performance.
zh
[AI-28] PiXTime: A Model for Federated Time Series Forecasting with Heterogeneous Data Structures Across Nodes
【速读】:该论文旨在解决联邦学习(Federated Learning)中时间序列预测任务面临的两大挑战:一是各节点因采样标准不同导致的时间粒度(time granularity)不一致,二是节点间变量集合(variable sets)的异构性,这阻碍了传统联邦学习方法的有效应用。解决方案的关键在于提出PiXTime模型,其核心创新包括:1)采用个性化的Patch Embedding将不同粒度的时间序列映射为统一维度的token序列,以适配共享的Transformer模型;2)引入全局VE表(Global VE Table)对齐跨节点变量类别语义,提升跨节点知识迁移能力;3)利用交叉注意力机制增强目标序列的预测性能,同时支持任意数量辅助变量的处理。该设计显著提升了联邦环境下多粒度、异构变量场景下的时间序列预测效果。
链接: https://arxiv.org/abs/2601.05613
作者: Yiming Zhou,Mingyue Cheng,Hao Wang,Enhong Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Time series are highly valuable and rarely shareable across nodes, making federated learning a promising paradigm to leverage distributed temporal data. However, different sampling standards lead to diverse time granularities and variable sets across nodes, hindering classical federated learning. We propose PiXTime, a novel time series forecasting model designed for federated learning that enables effective prediction across nodes with multi-granularity and heterogeneous variable sets. PiXTime employs a personalized Patch Embedding to map node-specific granularity time series into token sequences of a unified dimension for processing by a subsequent shared model, and uses a global VE Table to align variable category semantics across nodes, thereby enhancing cross-node transferability. With a transformer-based shared model, PiXTime captures representations of auxiliary series with arbitrary numbers of variables and uses cross-attention to enhance the prediction of the target series. Experiments show PiXTime achieves state-of-the-art performance in federated settings and demonstrates superior performance on eight widely used real-world traditional benchmarks.
zh
[AI-29] A Causal Information-Flow Framework for Unbiased Learning-to-Rank
【速读】:该论文旨在解决基于点击数据训练排序模型时存在的多重偏差问题,包括位置偏差(position bias)、选择偏差(selection bias)和信任偏差(trust bias),这些问题会导致模型无法准确学习到物品的真实相关性。现有无偏学习排序(Unbiased Learning-to-Rank, ULTR)方法主要针对位置偏差进行校正,依赖倾向性估计(propensity estimation),但难以量化剩余偏差、提供风险保障或联合处理多种偏差源。论文的关键解决方案是提出一种基于因果学习的排序框架,通过引入结构因果模型(Structural Causal Models, SCMs)明确点击生成机制以识别真实相关信号,并利用条件互信息(conditional mutual information)度量偏差泄漏程度,进而定义严格的解耦(disentanglement)概念并作为正则项嵌入训练过程以抑制偏差;同时结合双重稳健估计器(doubly robust estimator)提升风险估计的可靠性,从而在多个偏差交互的现实场景中显著降低偏差泄漏并提升排序性能。
链接: https://arxiv.org/abs/2601.05590
作者: Haoming Gong,Qingyao Ai,Zhihao Tao,Yongfeng Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:In web search and recommendation systems, user clicks are widely used to train ranking models. However, click data is heavily biased, i.e., users tend to click higher-ranked items (position bias), choose only what was shown to them (selection bias), and trust top results more (trust bias). Without explicitly modeling these biases, the true relevance of ranked items cannot be correctly learned from clicks. Existing Unbiased Learning-to-Rank (ULTR) methods mainly correct position bias and rely on propensity estimation, but they cannot measure remaining bias, provide risk guarantees, or jointly handle multiple bias sources. To overcome these challenges, this paper introduces a novel causal learning-based ranking framework that extends ULTR by combining Structural Causal Models (SCMs) with information-theoretic tools. SCMs specify how clicks are generated and help identify the true relevance signal from click data, while conditional mutual information, measures how much bias leaks into the learned relevance estimates. We use this leakage measure to define a rigorous notion of disentanglement and include it as a regularizer during model training to reduce bias. In addition, we incorporate a causal inference estimator, i.e., doubly robust estimator, to ensure more reliable risk estimation. Experiments on standard Learning-to-Rank benchmarks show that our method consistently reduces measured bias leakage and improves ranking performance, especially in realistic scenarios where multiple biases-such as position and trust bias-interact strongly. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2601.05590 [cs.AI] (or arXiv:2601.05590v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2601.05590 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-30] Autoregressive Ranking: Bridging the Gap Between Dual and Cross Encoders
【速读】:该论文旨在解决当前基于大语言模型(LLM)的排序方法在信息检索(IR)任务中普遍依赖于无序感知的下一个词预测损失函数,导致排序性能受限的问题。现有方法多采用点对点(pointwise)监督策略,但其本质上缺乏对文档排序顺序的显式建模能力,从而影响了排序效果。论文提出的关键解决方案是“点对点生成式排序”(pointwise generative ranking),即通过生成长度为单个文档ID(docID)的token序列来实现排序,并借助束搜索(beam search)进行解码;进一步设计了一种名为SToICaL(Simple Token-Item Calibrated Loss)的新型损失函数,在点对点框架下同时引入项级(item-level)和token级(token-level)的排序感知监督信号,显著提升了有效文档ID生成概率并改善了多个排名指标(如NDCG、MAP等),尤其在Top-K排序场景中表现优异。
链接: https://arxiv.org/abs/2601.05588
作者: Benjamin Rozonoyer,Chong You,Michael Boratko,Himanshu Jain,Nilesh Gupta,Srinadh Bhojanapalli,Andrew McCallum,Felix Yu
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 22 pages, 5 figures
Abstract:Dual and cross encoders have long been mainstays of information retrieval (IR), but are being challenged by the emergent capabilities of LLMs. An LLM-based approach we term pointwise generative ranking - generating tokens the length of a single docID as opposed to a list in order to enable ranking via beam search - combines efficiency and expressivity benefits while leveraging the in-context capabilities of Causal Transformers. Although there is ample evidence to suggest that pretrained LLMs are well-suited for ranking, we find that the vast majority of LLM-based approaches rely on next-token prediction, a loss function which is fundamentally rank-agnostic (and especially so with pointwise supervision). In this paper, we first prove that the expressivity of pointwise generative ranking with multi-token docIDs is superior to that of dual encoders. We then propose SToICaL - a Simple Token-Item Calibrated Loss - which can incorporate rank-aware supervision at both the item and token levels within the pointwise setup. We run a suite of experiments on ranking tasks derived from WordNet (Fellbaum, 1998) and ESCI (Reddy et al., arXiv:2206.06588). Two variants of SToICaL successfully suppress the probability of invalid docID generations and improve on common ranking metrics beyond top-1 retrieval.
zh
[AI-31] HogVul: Black-box Adversarial Code Generation Framework Against LM-based Vulnerability Detectors AAAI26
【速读】:该论文旨在解决基于语言模型(Language Model, LM)的软件漏洞检测模型易受对抗攻击的问题,尤其是针对词汇级和语法级扰动所导致的关键漏洞难以被发现的缺陷。现有黑盒攻击方法多依赖孤立的扰动策略,难以高效探索对抗代码空间以生成最优扰动。论文提出的解决方案是HogVul框架,其关键在于引入一种基于粒子群优化(Particle Swarm Optimization, PSO)驱动的双通道协同优化机制,统一整合词汇和语法两级扰动,在系统层面提升对抗样本搜索效率与攻击成功率。实验表明,该方法在四个基准数据集上平均攻击成功率较当前最优基线提升了26.05%。
链接: https://arxiv.org/abs/2601.05587
作者: Jingxiao Yang,Ping He,Tianyu Du,Sun Bing,Xuhong Zhang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: AAAI26
Abstract:Recent advances in software vulnerability detection have been driven by Language Model (LM)-based approaches. However, these models remain vulnerable to adversarial attacks that exploit lexical and syntax perturbations, allowing critical flaws to evade detection. Existing black-box attacks on LM-based vulnerability detectors primarily rely on isolated perturbation strategies, limiting their ability to efficiently explore the adversarial code space for optimal perturbations. To bridge this gap, we propose HogVul, a black-box adversarial code generation framework that integrates both lexical and syntax perturbations under a unified dual-channel optimization strategy driven by Particle Swarm Optimization (PSO). By systematically coordinating two-level perturbations, HogVul effectively expands the search space for adversarial examples, enhancing the attack efficacy. Extensive experiments on four benchmark datasets demonstrate that HogVul achieves an average attack success rate improvement of 26.05% over state-of-the-art baseline methods. These findings highlight the potential of hybrid optimization strategies in exposing model vulnerabilities.
zh
[AI-32] Reinforcement Learning of Large Language Models for Interpretable Credit Card Fraud Detection
【速读】:该论文旨在解决传统机器学习方法在电商欺诈检测中面临的能力局限性问题,尤其是在处理复杂、隐蔽的欺诈行为时表现不足;同时,针对大语言模型(Large Language Models, LLMs)在真实金融场景下尚未被充分应用且缺乏实证验证的问题,提出了一种基于强化学习(Reinforcement Learning, RL)的轻量级语言模型后训练框架。其解决方案的关键在于:利用Group Sequence Policy Optimization(GSPO)算法结合规则奖励机制,对仅含原始交易文本数据的语言模型进行后训练,使模型能够自主探索嵌入于客户信息、物流详情、商品描述和订单历史中的多样信任与风险信号,从而发现超越传统特征工程的新欺诈指标,显著提升F1分数。
链接: https://arxiv.org/abs/2601.05578
作者: Cooper Lin,Yanting Zhang,Maohao Ran,Wei Xue,Hongwei Fan,Yibo Xu,Zhenglin Wan,Sirui Han,Yike Guo,Jun Song
机构: 未知
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注:
Abstract:E-commerce platforms and payment solution providers face increasingly sophisticated fraud schemes, ranging from identity theft and account takeovers to complex money laundering operations that exploit the speed and anonymity of digital transactions. However, despite their theoretical promise, the application of Large Language Models (LLMs) to fraud detection in real-world financial contexts remains largely unexploited, and their practical effectiveness in handling domain-specific e-commerce transaction data has yet to be empirically validated. To bridge this gap between conventional machine learning limitations and the untapped potential of LLMs in fraud detection, this paper proposes a novel approach that employs Reinforcement Learning (RL) to post-train lightweight language models specifically for fraud detection tasks using only raw transaction data. We utilize the Group Sequence Policy Optimization (GSPO) algorithm combined with a rule-based reward system to fine-tune language models of various sizes on a real-life transaction dataset provided by a Chinese global payment solution company. Through this reinforcement learning framework, the language models are encouraged to explore diverse trust and risk signals embedded within the textual transaction data, including patterns in customer information, shipping details, product descriptions, and order history. Our experimental results demonstrate the effectiveness of this approach, with post-trained language models achieving substantial F1-score improvements on held-out test data. Our findings demonstrate that the observed performance improvements are primarily attributable to the exploration mechanism inherent in reinforcement learning, which allows models to discover novel fraud indicators beyond those captured by traditional engineered features.
zh
[AI-33] Crisis-Bench: Benchmarking Strategic Ambiguity and Reputation Management in Large Language Models
【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在通用安全对齐(standard safety alignment)框架下,因强制推行统一的“男孩 scout 道德观”而导致的专业场景适用性不足问题,尤其是在需要策略性模糊和信息保留的领域(如公共关系、谈判与危机管理)。其核心挑战在于传统基于透明性和诚实性的对齐方法会带来“透明度税”(transparency tax),削弱模型在高风险商业情境中的实际效用。解决方案的关键是提出 Crisis-Bench——一个基于多智能体部分可观测马尔可夫决策过程(Partially Observable Markov Decision Process, POMDP)的动态企业危机评估平台,通过模拟80个跨行业、7天周期的危机事件,要求PR代理在严格分离的私有叙事与公开叙事状态下进行决策;并创新性地引入“裁判-市场循环”(Adjudicator-Market Loop)机制,将公众情绪转化为模拟股价变化,构建经济激励结构以量化声誉管理能力,从而首次实现了对LLM在专业情境中“战略性信息控制”能力的定量评估。
链接: https://arxiv.org/abs/2601.05570
作者: Cooper Lin,Maohao Ran,Yanting Zhang,Zhenglin Wan,Hongwei Fan,Yibo Xu,Yike Guo,Wei Xue,Jun Song
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:Standard safety alignment optimizes Large Language Models (LLMs) for universal helpfulness and honesty, effectively instilling a rigid “Boy Scout” morality. While robust for general-purpose assistants, this one-size-fits-all ethical framework imposes a “transparency tax” on professional domains requiring strategic ambiguity and information withholding, such as public relations, negotiation, and crisis management. To measure this gap between general safety and professional utility, we introduce Crisis-Bench, a multi-agent Partially Observable Markov Decision Process (POMDP) that evaluates LLMs in high-stakes corporate crises. Spanning 80 diverse storylines across 8 industries, Crisis-Bench tasks an LLM-based Public Relations (PR) Agent with navigating a dynamic 7-day corporate crisis simulation while managing strictly separated Private and Public narrative states to enforce rigorous information asymmetry. Unlike traditional benchmarks that rely on static ground truths, we introduce the Adjudicator-Market Loop: a novel evaluation metric where public sentiment is adjudicated and translated into a simulated stock price, creating a realistic economic incentive structure. Our results expose a critical dichotomy: while some models capitulate to ethical concerns, others demonstrate the capacity for Machiavellian, legitimate strategic withholding in order to stabilize the simulated stock price. Crisis-Bench provides the first quantitative framework for assessing “Reputation Management” capabilities, arguing for a shift from rigid moral absolutism to context-aware professional alignment.
zh
[AI-34] Understanding LLM -Driven Test Oracle Generation
【速读】:该论文旨在解决软件测试中的“oracle问题”(oracle problem),即如何区分程序行为的正确与错误,传统自动化单元测试生成技术仅能生成基于已实现行为的回归断言,难以判断预期功能是否被正确实现。其解决方案的关键在于利用大语言模型(Large Language Models, LLMs)生成反映软件意图行为的测试断言,从而将LLM作为Promptware(提示驱动软件)的核心组件,通过自然语言提示引导生成更符合设计意图的测试断言,提升对软件缺陷的检测能力。
链接: https://arxiv.org/abs/2601.05542
作者: Adam Bodicoat,Gunel Jahangirova,Valerio Terragni
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted for presentation at the 2nd ACM/IEEE International Conference on AI-powered Software (AIware 2025)
Abstract:Automated unit test generation aims to improve software quality while reducing the time and effort required for creating tests manually. However, existing techniques primarily generate regression oracles that predicate on the implemented behavior of the class under test. They do not address the oracle problem: the challenge of distinguishing correct from incorrect program behavior. With the rise of Foundation Models (FMs), particularly Large Language Models (LLMs), there is a new opportunity to generate test oracles that reflect intended behavior. This positions LLMs as enablers of Promptware, where software creation and testing are driven by natural-language prompts. This paper presents an empirical study on the effectiveness of LLMs in generating test oracles that expose software failures. We investigate how different prompting strategies and levels of contextual input impact the quality of LLM-generated oracles. Our findings offer insights into the strengths and limitations of LLM-based oracle generation in the FM era, improving our understanding of their capabilities and fostering future research in this area.
zh
[AI-35] Scalable Heterogeneous Graph Learning via Heterogeneous-aware Orthogonal Prototype Experts
【速读】:该论文旨在解决异构图神经网络(Heterogeneous Graph Neural Networks, HGNNs)中因采用单一共享线性投影头(Linear Projection Head)而导致的语义表达不足问题,即所谓的“线性投影瓶颈”(Linear Projection Bottleneck)。该瓶颈在异构图中尤为显著,因其上下文多样性与长尾分布特性导致全局投影头难以捕捉细粒度语义,进而过度拟合枢纽节点(hub nodes)并忽视尾部节点(tail nodes)。为应对这一挑战,作者提出了一种名为HOPE(Heterogeneous-aware Orthogonal Prototype Experts)的可插拔式预测头框架:其核心创新在于引入基于可学习原型的路由机制(prototype-based routing),依据实例与专家原型的相似度动态分配任务至不同专家,使专家使用符合自然长尾分布;同时通过专家正交化(expert orthogonalization)策略增强专家间多样性,防止专家坍缩(expert collapse)。实验表明,HOPE在四个真实数据集上均能提升多种SOTA HGNN模型的性能,且计算开销极低。
链接: https://arxiv.org/abs/2601.05537
作者: Wei Zhou,Hong Huang,Ruize Shi,Bang Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Heterogeneous Graph Neural Networks(HGNNs) have advanced mainly through better encoders, yet their decoding/projection stage still relies on a single shared linear head, assuming it can map rich node embeddings to labels. We call this the Linear Projection Bottleneck: in heterogeneous graphs, contextual diversity and long-tail shifts make a global head miss fine semantics, overfit hub nodes, and underserve tail nodes. While Mixture-of-Experts(MoE) could help, naively applying it clashes with structural imbalance and risks expert collapse. We propose a Heterogeneous-aware Orthogonal Prototype Experts framework named HOPE, a plug-and-play replacement for the standard prediction head. HOPE uses learnable prototype-based routing to assign instances to experts by similarity, letting expert usage follow the natural long-tail distribution, and adds expert orthogonalization to encourage diversity and prevent collapse. Experiments on four real datasets show consistent gains across SOTA HGNN backbones with minimal overhead.
zh
[AI-36] Safety Not Found (404): Hidden Risks of LLM -Based Robotics Decision Making
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在安全关键场景中部署时因微小错误可能导致严重后果的问题,特别是在机器人决策系统中,一个错误指令可能直接危及人身安全。其解决方案的关键在于构建一套系统化的评估框架,通过设计三类任务——完全信息任务、不完整信息任务和面向安全的空间推理任务(Safety-Oriented Spatial Reasoning, SOSR),对LLMs及视觉-语言模型(Vision-Language Models, VLMs)进行定量测试,从而揭示模型在空间推理、上下文推断与安全决策方面的潜在脆弱性。研究发现,即使在99%的高准确率下,仍存在致命性失败案例,表明当前模型无法满足安全关键系统的可靠性要求。
链接: https://arxiv.org/abs/2601.05529
作者: Jua Han,Jaeyoon Seo,Jungbin Min,Jean Oh,Jihie Kim
机构: 未知
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:
Abstract:One mistake by an AI system in a safety-critical setting can cost lives. As Large Language Models (LLMs) become integral to robotics decision-making, the physical dimension of risk grows; a single wrong instruction can directly endanger human safety. This paper addresses the urgent need to systematically evaluate LLM performance in scenarios where even minor errors are catastrophic. Through a qualitative evaluation of a fire evacuation scenario, we identified critical failure cases in LLM-based decision-making. Based on these, we designed seven tasks for quantitative assessment, categorized into: Complete Information, Incomplete Information, and Safety-Oriented Spatial Reasoning (SOSR). Complete information tasks utilize ASCII maps to minimize interpretation ambiguity and isolate spatial reasoning from visual processing. Incomplete information tasks require models to infer missing context, testing for spatial continuity versus hallucinations. SOSR tasks use natural language to evaluate safe decision-making in life-threatening contexts. We benchmark various LLMs and Vision-Language Models (VLMs) across these tasks. Beyond aggregate performance, we analyze the implications of a 1% failure rate, highlighting how “rare” errors escalate into catastrophic outcomes. Results reveal serious vulnerabilities: several models achieved a 0% success rate in ASCII navigation, while in a simulated fire drill, models instructed robots to move toward hazardous areas instead of emergency exits. Our findings lead to a sobering conclusion: current LLMs are not ready for direct deployment in safety-critical systems. A 99% accuracy rate is dangerously misleading in robotics, as it implies one out of every hundred executions could result in catastrophic harm. We demonstrate that even state-of-the-art models cannot guarantee safety, and absolute reliance on them creates unacceptable risks.
zh
[AI-37] DeMa: Dual-Path Delay-Aware Mamba for Efficient Multivariate Time Series Analysis
【速读】:该论文旨在解决基于Transformer的多变量时间序列(Multivariate Time Series, MTS)模型因二次计算复杂度和高内存开销而导致的可扩展性与实际部署受限问题,同时针对原始Mamba架构在MTS任务中存在缺乏显式跨变量建模、难以解耦单序列内部时序动态与多序列交互关系、以及对潜在时滞交互效应建模不足等关键局限。解决方案的核心在于提出DeMa——一种双路径延迟感知的Mamba骨干网络:其一,通过将MTS分解为序列内时序动态与序列间交互;其二,设计含Mamba-SSD模块的时间路径以并行捕捉每个序列的长程动态;其三,构建含Mamba-DALA模块的变量路径,引入延迟感知线性注意力机制来有效建模跨变量依赖关系,从而在保持线性复杂度优势的同时显著提升对多样化MTS任务的建模能力。
链接: https://arxiv.org/abs/2601.05527
作者: Rui An,Haohao Qu,Wenqi Fan,Xuequn Shang,Qing Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Under review
Abstract:Accurate and efficient multivariate time series (MTS) analysis is increasingly critical for a wide range of intelligent applications. Within this realm, Transformers have emerged as the predominant architecture due to their strong ability to capture pairwise dependencies. However, Transformer-based models suffer from quadratic computational complexity and high memory overhead, limiting their scalability and practical deployment in long-term and large-scale MTS modeling. Recently, Mamba has emerged as a promising linear-time alternative with high expressiveness. Nevertheless, directly applying vanilla Mamba to MTS remains suboptimal due to three key limitations: (i) the lack of explicit cross-variate modeling, (ii) difficulty in disentangling the entangled intra-series temporal dynamics and inter-series interactions, and (iii) insufficient modeling of latent time-lag interaction effects. These issues constrain its effectiveness across diverse MTS tasks. To address these challenges, we propose DeMa, a dual-path delay-aware Mamba backbone. DeMa preserves Mamba’s linear-complexity advantage while substantially improving its suitability for MTS settings. Specifically, DeMa introduces three key innovations: (i) it decomposes the MTS into intra-series temporal dynamics and inter-series interactions; (ii) it develops a temporal path with a Mamba-SSD module to capture long-range dynamics within each individual series, enabling series-independent, parallel computation; and (iii) it designs a variate path with a Mamba-DALA module that integrates delay-aware linear attention to model cross-variate dependencies. Extensive experiments on five representative tasks, long- and short-term forecasting, data imputation, anomaly detection, and series classification, demonstrate that DeMa achieves state-of-the-art performance while delivering remarkable computational efficiency.
zh
[AI-38] Explainable AI: Learning from the Learners
【速读】:该论文试图解决人工智能在科学与工程任务中虽性能超越人类但其内部表征仍不透明的问题。解决方案的关键在于将可解释人工智能(Explainable Artificial Intelligence, XAI)与因果推理相结合,从而实现“从学习者中学习”(learning from the learners)。通过这一融合框架,论文提出利用基础模型(foundation models)与可解释性方法提取因果机制,指导鲁棒设计与控制,并在高风险应用场景中增强信任与问责制,最终构建人机协作的统一范式。
链接: https://arxiv.org/abs/2601.05525
作者: Ricardo Vinuesa,Steven L. Brunton,Gianmarco Mengaldo
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Computational Physics (physics.comp-ph); Physics and Society (physics.soc-ph)
备注:
Abstract:Artificial intelligence now outperforms humans in several scientific and engineering tasks, yet its internal representations often remain opaque. In this Perspective, we argue that explainable artificial intelligence (XAI), combined with causal reasoning, enables \it learning from the learners. Focusing on discovery, optimization and certification, we show how the combination of foundation models and explainability methods allows the extraction of causal mechanisms, guides robust design and control, and supports trust and accountability in high-stakes applications. We discuss challenges in faithfulness, generalization and usability of explanations, and propose XAI as a unifying framework for human-AI collaboration in science and engineering.
zh
[AI-39] Over-Searching in Search-Augmented Large Language Models EACL2026
【速读】:该论文旨在解决搜索增强型大语言模型(Search-augmented Large Language Models, SLMs)中存在的“过度搜索”(over-searching)问题,即模型在无需外部检索的情况下仍频繁调用搜索工具,导致计算资源浪费并可能引入无关信息从而引发幻觉。其核心解决方案在于系统性地识别和量化过度搜索现象,并提出Tokens Per Correctness(TPC)这一新评估指标以衡量搜索效率与准确性的权衡关系;同时通过分析查询类型、模型类别、检索质量及多轮对话场景等因素,发现复杂推理模型和噪声检索会加剧过度搜索,并指出负向证据的存在有助于提升模型的拒绝回答能力(abstention),从而为优化SLMs的搜索策略提供理论依据与实践方向。
链接: https://arxiv.org/abs/2601.05503
作者: Roy Xie,Deepak Gopinath,David Qiu,Dong Lin,Haitian Sun,Saloni Potdar,Bhuwan Dhingra
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to EACL 2026 Main Conference
Abstract:Search-augmented large language models (LLMs) excel at knowledge-intensive tasks by integrating external retrieval. However, they often over-search – unnecessarily invoking search tool even when it does not improve response quality, which leads to computational inefficiency and hallucinations by incorporating irrelevant context. In this work, we conduct a systematic evaluation of over-searching across multiple dimensions, including query types, model categories, retrieval conditions, and multi-turn conversations. Our finding shows: (i) search generally improves answer accuracy on answerable queries but harms abstention on unanswerable ones; (ii) over-searching is more pronounced in complex reasoning models and deep research systems, is exacerbated by noisy retrieval, and compounds across turns in multi-turn conversations; and (iii) the composition of retrieved evidence is crucial, as the presence of negative evidence improves abstention. To quantify over-searching, we introduce Tokens Per Correctness (TPC), an evaluation metric that captures the performance-cost trade-off for search-augmented LLMs. Lastly, we investigate mitigation approaches at both the query and retrieval levels and release the OverSearchQA to foster continued research into efficient search-augmented LLMs.
zh
[AI-40] Evaluating the Use of LLM s for Automated DOM-Level Resolution of Web Performance Issues
【速读】:该论文试图解决网页性能优化中DOM(Document Object Model)修改这一复杂且耗时的手动任务问题,旨在通过生成式AI(Generative AI)实现自动化修复。解决方案的关键在于利用九种前沿大语言模型(Large Language Models, LLMs)对真实网页的DOM树及其Lighthouse性能审计报告进行联合分析与重构,从而自动识别并修正影响网页性能的问题。研究发现,尽管LLMs在SEO和可访问性等类别上表现优异,但在关键性能指标如初始加载、交互性和网络优化方面效果差异显著,且主要采用添加型策略和位置调整,存在视觉稳定性下降等回归问题,揭示了当前LLM驱动的自动化方案在性能敏感场景下的局限性与改进方向。
链接: https://arxiv.org/abs/2601.05502
作者: Gideon Peters,SayedHassan Khatoonabadi,Emad Shihab
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Accepted to the The ACM International Conference on Mining Software Repositories (MSR) (MSR 2026)
Abstract:Users demand fast, seamless webpage experiences, yet developers often struggle to meet these expectations within tight constraints. Performance optimization, while critical, is a time-consuming and often manual process. One of the most complex tasks in this domain is modifying the Document Object Model (DOM), which is why this study focuses on it. Recent advances in Large Language Models (LLMs) offer a promising avenue to automate this complex task, potentially transforming how developers address web performance issues. This study evaluates the effectiveness of nine state-of-the-art LLMs for automated web performance issue resolution. For this purpose, we first extracted the DOM trees of 15 popular webpages (e.g., Facebook), and then we used Lighthouse to retrieve their performance audit reports. Subsequently, we passed the extracted DOM trees and corresponding audits to each model for resolution. Our study considers 7 unique audit categories, revealing that LLMs universally excel at SEO Accessibility issues. However, their efficacy in performance-critical DOM manipulations is mixed. While high-performing models like GPT-4.1 delivered significant reductions in areas like Initial Load, Interactivity, and Network Optimization (e.g., 46.52% to 48.68% audit incidence reductions), others, such as GPT-4o-mini, notably underperformed, consistently. A further analysis of these modifications showed a predominant additive strategy and frequent positional changes, alongside regressions particularly impacting Visual Stability.
zh
[AI-41] he Evaluation Gap in Medicine AI and LLM s: Navigating Elusive Ground Truth Uncertainty via a Probabilistic Paradigm
【速读】:该论文旨在解决当前人工智能(AI)系统评估中忽略专家标注不确定性的问题,尤其是在医学等高不确定性的领域,这种忽略可能导致错误结论——即误判非专家与专家性能相当。其核心问题是:若不考虑标注答案的变异性(即地面真实标签的不确定性),则可能高估模型表现或低估专家优势。解决方案的关键在于提出一种概率化评估范式,引入预期准确率(expected accuracy)和预期F1分数(expected F1),以量化在不同地面真实答案置信度水平下专家或系统的理论性能上限;同时建议按专家共识率(即地面真实答案的确定性)对结果进行分层评估,尤其当整体性能低于80%时,分层可显著降低不确定性这一混杂因素的影响,从而提升评估的可靠性。
链接: https://arxiv.org/abs/2601.05500
作者: Aparna Elangovan,Lei Xu,Mahsa Elyasi,Ismail Akdulum,Mehmet Aksakal,Enes Gurun,Brian Hur,Saab Mansour,Ravid Shwartz Ziv,Karin Verspoor,Dan Roth
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Benchmarking the relative capabilities of AI systems, including Large Language Models (LLMs) and Vision Models, typically ignores the impact of uncertainty in the underlying ground truth answers from experts. This ambiguity is particularly consequential in medicine where uncertainty is pervasive. In this paper, we introduce a probabilistic paradigm to theoretically explain how high certainty in ground truth answers is almost always necessary for even an expert to achieve high scores, whereas in datasets with high variation in ground truth answers there may be little difference between a random labeller and an expert. Therefore, ignoring uncertainty in ground truth evaluation data can result in the misleading conclusion that a non-expert has similar performance to that of an expert. Using the probabilistic paradigm, we thus bring forth the concepts of expected accuracy and expected F1 to estimate the score an expert human or system can achieve given ground truth answer variability. Our work leads to the recommendation that when establishing the capability of a system, results should be stratified by probability of the ground truth answer, typically measured by the agreement rate of ground truth experts. Stratification becomes critical when the overall performance drops below a threshold of 80%. Under stratified evaluation, performance comparison becomes more reliable in high certainty bins, mitigating the effect of the key confounding factor – uncertainty. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2601.05500 [cs.AI] (or arXiv:2601.05500v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2601.05500 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-42] MMUEChange: A Generalized LLM Agent Framework for Intelligent Multi-Modal Urban Environment Change Analysis
【速读】:该论文旨在解决当前城市环境变化分析中依赖单一模态、刚性分析方法的局限性,尤其是在遥感变化检测领域中存在的灵活性不足与多源异构数据融合困难的问题。其解决方案的关键在于提出MMUEChange框架,该框架通过模块化工具包和核心模块——模态控制器(Modality Controller)实现跨模态与模态内对齐,从而灵活整合多种类型的城市数据,显著提升复杂城市变化场景下的分析鲁棒性和任务成功率,相较最优基线模型任务成功率达到46.7%的提升,并有效缓解幻觉问题,具备实际政策支持价值。
链接: https://arxiv.org/abs/2601.05483
作者: Zixuan Xiao,Jun Ma,Siwei Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Understanding urban environment change is essential for sustainable development. However, current approaches, particularly remote sensing change detection, often rely on rigid, single-modal analysis. To overcome these limitations, we propose MMUEChange, a multi-modal agent framework that flexibly integrates heterogeneous urban data via a modular toolkit and a core module, Modality Controller for cross- and intra-modal alignment, enabling robust analysis of complex urban change scenarios. Case studies include: a shift toward small, community-focused parks in New York, reflecting local green space efforts; the spread of concentrated water pollution across districts in Hong Kong, pointing to coordinated water management; and a notable decline in open dumpsites in Shenzhen, with contrasting links between nighttime economic activity and waste types, indicating differing urban pressures behind domestic and construction waste. Compared to the best-performing baseline, the MMUEChange agent achieves a 46.7% improvement in task success rate and effectively mitigates hallucination, demonstrating its capacity to support complex urban change analysis tasks with real-world policy implications.
zh
[AI-43] Efficient Differentiable Causal Discovery via Reliable Super-Structure Learning
【速读】:该论文旨在解决不同iable因果发现(differentiable causal discovery)在高维数据或存在潜在混杂变量(latent confounders)时,因搜索空间庞大、目标函数复杂及图论约束非平凡而导致的优化效率低和准确性差的问题。其解决方案的关键在于提出ALVGL方法,通过稀疏与低秩分解学习数据的精度矩阵(precision matrix),并设计ADMM优化算法识别出对潜在因果结构最相关的矩阵成分;这些成分被用于构建一个可证明包含真实因果图的超结构(super-structure),进而初始化标准的可微因果发现方法,从而显著缩小搜索空间,提升优化效率与准确性。
链接: https://arxiv.org/abs/2601.05474
作者: Pingchuan Ma,Qixin Zhang,Shuai Wang,Dacheng Tao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Recently, differentiable causal discovery has emerged as a promising approach to improve the accuracy and efficiency of existing methods. However, when applied to high-dimensional data or data with latent confounders, these methods, often based on off-the-shelf continuous optimization algorithms, struggle with the vast search space, the complexity of the objective function, and the nontrivial nature of graph-theoretical constraints. As a result, there has been a surge of interest in leveraging super-structures to guide the optimization process. Nonetheless, learning an appropriate super-structure at the right level of granularity, and doing so efficiently across various settings, presents significant challenges. In this paper, we propose ALVGL, a novel and general enhancement to the differentiable causal discovery pipeline. ALVGL employs a sparse and low-rank decomposition to learn the precision matrix of the data. We design an ADMM procedure to optimize this decomposition, identifying components in the precision matrix that are most relevant to the underlying causal structure. These components are then combined to construct a super-structure that is provably a superset of the true causal graph. This super-structure is used to initialize a standard differentiable causal discovery method with a more focused search space, thereby improving both optimization efficiency and accuracy. We demonstrate the versatility of ALVGL by instantiating it across a range of structural causal models, including both Gaussian and non-Gaussian settings, with and without unmeasured confounders. Extensive experiments on synthetic and real-world datasets show that ALVGL not only achieves state-of-the-art accuracy but also significantly improves optimization efficiency, making it a reliable and effective solution for differentiable causal discovery. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2601.05474 [cs.LG] (or arXiv:2601.05474v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2601.05474 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-44] STELP: Secure Transpilation and Execution of LLM -Generated Programs
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)生成代码在生产环境中执行时存在的安全与可靠性问题,包括代码不稳定、漏洞注入(如数据投毒、恶意攻击和幻觉)、以及传统人工代码审查和安全测试工具难以适用的挑战。解决方案的关键在于提出一种名为Secure Transpiler and Executor of LLM-Generated Program (STELP) 的安全编译器与执行框架,其核心能力是在受控环境下对LLM生成的代码进行隔离执行、动态验证与风险规避,从而保障自主生产级AI系统中代码生成与执行的安全性,尤其适用于无头(headless)代码生成场景或实时执行代码片段作为行动规划的应用。
链接: https://arxiv.org/abs/2601.05467
作者: Swapnil Shinde,Sahil Wadhwa,Andy Luo,Emily Chen
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Rapid evolution of Large Language Models (LLMs) has achieved major advances in reasoning, planning, and function-calling capabilities. Multi-agentic collaborative frameworks using such LLMs place them at the center of solving software development-related tasks such as code generation. However, direct use of LLM generated code in production software development systems is problematic. The code could be unstable or erroneous and contain vulnerabilities such as data poisoning, malicious attacks, and hallucinations that could lead to widespread system malfunctions. This prohibits the adoption of LLM generated code in production AI systems where human code reviews and traditional secure testing tools are impractical or untrustworthy. In this paper, we discuss safety and reliability problems with the execution of LLM generated code and propose a Secure Transpiler and Executor of LLM-Generated Program (STELP), capable of executing LLM-generated code in a controlled and safe manner. STELP secures autonomous production AI systems involving code generation, filling the critical void left by the impracticality or limitations of traditional secure testing methodologies and human oversight. This includes applications such as headless code generation-execution and LLMs that produce executable code snippets as an action plan to be executed in real time. We contribute a human-validated dataset of insecure code snippets and benchmark our approach on publicly available datasets for correctness, safety, and latency. Our results demonstrate that our approach outperforms an existing method by a significant margin, particularly in its ability to safely execute risky code snippets. Warning: This paper contains malicious code snippets that should be run with caution.
zh
[AI-45] Jailbreaking Large Language Models through Iterative Tool-Disguised Attacks via Reinforcement Learning
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在面对复杂对抗性攻击时的安全脆弱性问题,尤其是针对现有防御机制难以有效拦截的“越狱攻击”(jailbreak attacks)。其解决方案的关键在于提出了一种名为iMIST(interactive Multi-step Progressive Tool-disguised Jailbreak Attack)的新型自适应越狱方法:该方法通过将恶意查询伪装成正常的工具调用以绕过内容过滤机制,并结合交互式渐进优化算法,在多轮对话中基于实时有害性评估动态提升响应的危害程度,从而显著提高攻击成功率并保持较低的被拒绝率。
链接: https://arxiv.org/abs/2601.05466
作者: Zhaoqi Wang,Zijian Zhang,Daqing He,Pengtao Kou,Xin Li,Jiamou Liu,Jincheng An,Yong Liu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) have demonstrated remarkable capabilities across diverse applications, however, they remain critically vulnerable to jailbreak attacks that elicit harmful responses violating human values and safety guidelines. Despite extensive research on defense mechanisms, existing safeguards prove insufficient against sophisticated adversarial strategies. In this work, we propose iMIST (\underlineinteractive \underlineMulti-step \underlineProgre\underlinessive \underlineTool-disguised Jailbreak Attack), a novel adaptive jailbreak method that synergistically exploits vulnerabilities in current defense mechanisms. iMIST disguises malicious queries as normal tool invocations to bypass content filters, while simultaneously introducing an interactive progressive optimization algorithm that dynamically escalates response harmfulness through multi-turn dialogues guided by real-time harmfulness assessment. Our experiments on widely-used models demonstrate that iMIST achieves higher attack effectiveness, while maintaining low rejection rates. These results reveal critical vulnerabilities in current LLM safety mechanisms and underscore the urgent need for more robust defense strategies.
zh
[AI-46] PRISMA: Reinforcement Learning Guided Two-Stage Policy Optimization in Multi-Agent Architecture for Open-Domain Multi-Hop Question Answering
【速读】:该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统在处理现实世界开放域多跳问题时面临的两大挑战:1)检索坍缩(Retrieval Collapse)——在大规模语料库中迭代检索无法定位包含桥梁答案的中间证据,导致下游推理失效;2)学习不稳定性——端到端轨迹训练存在信用分配弱和模块错误定位差的问题,易过拟合于基准数据集特定启发式规则,限制迁移性和鲁棒性。解决方案的关键在于提出PRISMA框架,其核心是解耦式强化学习引导机制,采用Plan-Retrieve-Inspect-Solve-Memoize架构,通过Inspector提供基于推理的反馈以优化Planner的分解与细粒度检索,并在Solver中强制证据驱动的推理;同时采用两阶段群体相对策略优化(Two-Stage Group Relative Policy Optimization, GRPO),第一阶段将Planner和Solver分别训练为规划与推理专家,第二阶段引入观察感知残差策略优化(Observation-Aware Residual Policy Optimization, OARPO)提升Inspector对上下文验证和触发针对性恢复的能力,从而实现高效且稳定的多跳推理。
链接: https://arxiv.org/abs/2601.05465
作者: Yu Liu,Wenxiao Zhang,Cong Cao,Wenxuan Lu,Fangfang Yuan,Diandian Guo,Kun Peng,Qiang Sun,Kaiyan Zhang,Yanbing Liu,Jin B.Hong,Bowen Zhou,Zhiyuan Ma
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Answering real-world open-domain multi-hop questions over massive corpora is a critical challenge in Retrieval-Augmented Generation (RAG) systems. Recent research employs reinforcement learning (RL) to end-to-end optimize the retrieval-augmented reasoning process, directly enhancing its capacity to resolve complex queries. However, reliable deployment is hindered by two obstacles. 1) Retrieval Collapse: iterative retrieval over large corpora fails to locate intermediate evidence containing bridge answers without reasoning-guided planning, causing downstream reasoning to collapse. 2) Learning Instability: end-to-end trajectory training suffers from weak credit assignment across reasoning chains and poor error localization across modules, causing overfitting to benchmark-specific heuristics that limit transferability and stability. To address these problems, we propose PRISMA, a decoupled RL-guided framework featuring a Plan-Retrieve-Inspect-Solve-Memoize architecture. PRISMA’s strength lies in reasoning-guided collaboration: the Inspector provides reasoning-based feedback to refine the Planner’s decomposition and fine-grained retrieval, while enforcing evidence-grounded reasoning in the Solver. We optimize individual agent capabilities via Two-Stage Group Relative Policy Optimization (GRPO). Stage I calibrates the Planner and Solver as specialized experts in planning and reasoning, while Stage II utilizes Observation-Aware Residual Policy Optimization (OARPO) to enhance the Inspector’s ability to verify context and trigger targeted recovery. Experiments show that PRISMA achieves state-of-the-art performance on ten benchmarks and can be deployed efficiently in real-world scenarios.
zh
[AI-47] ART: Adaptive Reasoning Trees for Explainable Claim Verification
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在高风险决策场景中因缺乏可解释性而导致的信任问题,即其输出难以提供忠实的解释且无法有效纠错,从而削弱了可信度。解决方案的关键在于提出一种名为自适应推理树(Adaptive Reasoning Trees, ART)的层次化论证验证方法:从根命题出发,分支生成支持与反驳的子论点,通过自底向上的两两锦标赛机制由裁判大语言模型(judge LLM)评估子论点强度,最终系统性地生成透明且可争议的结论,显著优于如思维链(Chain-of-Thought, CoT)等现有方法,实现了更可靠、清晰的决策过程。
链接: https://arxiv.org/abs/2601.05455
作者: Sahil Wadhwa,Himanshu Kumar,Guanqun Yang,Abbaas Alif Mohamed Nishar,Pranab Mohanty,Swapnil Shinde,Yue Wu
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Large Language Models (LLMs) are powerful candidates for complex decision-making, leveraging vast encoded knowledge and remarkable zero-shot abilities. However, their adoption in high-stakes environments is hindered by their opacity; their outputs lack faithful explanations and cannot be effectively contested to correct errors, undermining trustworthiness. In this paper, we propose ART (Adaptive Reasoning Trees), a hierarchical method for claim verification. The process begins with a root claim, which branches into supporting and attacking child arguments. An argument’s strength is determined bottom-up via a pairwise tournament of its children, adjudicated by a judge LLM, allowing a final, transparent and contestable verdict to be systematically derived which is missing in methods like Chain-of-Thought (CoT). We empirically validate ART on multiple datasets, analyzing different argument generators and comparison strategies. Our findings show that ART’s structured reasoning outperforms strong baselines, establishing a new benchmark for explainable claim verification which is more reliable and ensures clarity in the overall decision making step.
zh
[AI-48] On the Effect of Cheating in Chess
【速读】:该论文旨在解决作弊行为在国际象棋比赛中带来的性能提升问题,特别是评估在有限次数内使用强大软件辅助(即作弊)所能获得的性能增益。与以往主要关注作弊检测的研究不同,本文聚焦于量化作弊的实际效果,其解决方案的关键在于开发并测试适用于常用国际象棋引擎(chess engine)的算法,以模拟和测量作弊对棋手表现的具体影响,从而为反作弊策略提供数据支持。
链接: https://arxiv.org/abs/2601.05386
作者: Daniel Keren
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Cheating in chess, by using advice from powerful software, has become a major problem, reaching the highest levels. As opposed to the large majority of previous work, which concerned \em detection of cheating, here we try to evaluate the possible gain in performance, obtained by cheating a limited number of times during a game. Algorithms are developed and tested on a commonly used chess engine (i.e software).\footnoteNeedless to say, the goal of this work is not to assist cheaters, but to measure the effectiveness of cheating – which is crucial as part of the effort to contain and detect it.
zh
[AI-49] PRISM: Protocol Refinement through Intelligent Simulation Modeling
【速读】:该论文旨在解决自驱动实验室(self-driving laboratories)中实验协议设计与执行自动化的核心瓶颈问题。其解决方案的关键在于提出PRISM(Protocol Refinement through Intelligent Simulation Modeling)框架,该框架通过一组基于语言模型的智能体(language-model-based agents)协同工作,实现从网络获取实验流程、结构化生成实验步骤(如液体处理、仪器布局等),并通过规划-批评-验证循环优化协议,最终转化为Argonne MADSci协议格式以统一控制多台机器人设备(如Opentrons OT-2液体处理器、PF400机械臂、Azenta板封口/开盖装置),从而实现无干预的全流程自动化执行。
链接: https://arxiv.org/abs/2601.05356
作者: Brian Hsu,Priyanka V Setty,Rory M Butler,Ryan Lewis,Casey Stone,Rebecca Weinberg,Thomas Brettin,Rick Stevens,Ian Foster,Arvind Ramanathan
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Quantitative Methods (q-bio.QM)
备注: 43 pages, 8 figures, submitted to RSC Digital Discovery. Equal contribution: B. Hsu, P.V. Setty, R.M. Butler. Corresponding author: A. Ramanathan
Abstract:Automating experimental protocol design and execution remains as a fundamental bottleneck in realizing self-driving laboratories. We introduce PRISM (Protocol Refinement through Intelligent Simulation Modeling), a framework that automates the design, validation, and execution of experimental protocols on a laboratory platform composed of off-the-shelf robotic instruments. PRISM uses a set of language-model-based agents that work together to generate and refine experimental steps. The process begins with automatically gathering relevant procedures from web-based sources describing experimental workflows. These are converted into structured experimental steps (e.g., liquid handling steps, deck layout and other related operations) through a planning, critique, and validation loop. The finalized steps are translated into the Argonne MADSci protocol format, which provides a unified interface for coordinating multiple robotic instruments (Opentrons OT-2 liquid handler, PF400 arm, Azenta plate sealer and peeler) without requiring human intervention between steps. To evaluate protocol-generation performance, we benchmarked both single reasoning models and multi-agent workflow across constrained and open-ended prompting paradigms. The resulting protocols were validated in a digital-twin environment built in NVIDIA Omniverse to detect physical or sequencing errors before execution. Using Luna qPCR amplification and Cell Painting as case studies, we demonstrate PRISM as a practical end-to-end workflow that bridges language-based protocol generation, simulation-based validation, and automated robotic execution.
zh
[AI-50] Multi-turn Jailbreaking Attack in Multi-Modal Large Language Models
【速读】:该论文旨在解决多模态大语言模型(Multi-modal Large Language Models, MLLMs)在生成式人工智能(Generative Artificial Intelligence, GenAI)场景下所面临的严重安全漏洞问题,尤其是针对多轮提示(multi-turn prompting)环境下发起的越狱攻击(jailbreaking attacks)导致模型行为被操纵、安全约束被绕过的风险。解决方案的关键在于提出一个系统性的框架MJAD-MLLMs,其核心包括:一是设计了一种新型多轮越狱攻击方法,以更隐蔽和高效地利用MLLMs的漏洞;二是提出一种基于片段优化与多大语言模型(multi-LLM-based)协同防御机制的FragGuard,能够有效识别并缓解越狱攻击;三是通过在多个前沿开源与闭源MLLMs及基准数据集上的广泛实验验证了攻击与防御的有效性,显著优于现有技术。
链接: https://arxiv.org/abs/2601.05339
作者: Badhan Chandra Das,Md Tasnim Jawad,Joaquin Molto,M. Hadi Amini,Yanzhao Wu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:In recent years, the security vulnerabilities of Multi-modal Large Language Models (MLLMs) have become a serious concern in the Generative Artificial Intelligence (GenAI) research. These highly intelligent models, capable of performing multi-modal tasks with high accuracy, are also severely susceptible to carefully launched security attacks, such as jailbreaking attacks, which can manipulate model behavior and bypass safety constraints. This paper introduces MJAD-MLLMs, a holistic framework that systematically analyzes the proposed Multi-turn Jailbreaking Attacks and multi-LLM-based defense techniques for MLLMs. In this paper, we make three original contributions. First, we introduce a novel multi-turn jailbreaking attack to exploit the vulnerabilities of the MLLMs under multi-turn prompting. Second, we propose a novel fragment-optimized and multi-LLM defense mechanism, called FragGuard, to effectively mitigate jailbreaking attacks in the MLLMs. Third, we evaluate the efficacy of the proposed attacks and defenses through extensive experiments on several state-of-the-art (SOTA) open-source and closed-source MLLMs and benchmark datasets, and compare their performance with the existing techniques.
zh
[AI-51] Improving Enzyme Prediction with Chemical Reaction Equations by Hypergraph-Enhanced Knowledge Graph Embeddings
【速读】:该论文旨在解决酶-底物相互作用预测中因训练数据稀疏和不完整而导致的传统模型泛化能力差的问题。其关键解决方案是利用领域内化学反应方程式构建知识图谱三元组(educt, enzyme, product),并提出一种基于超图的增强型模型——Hyper-Enz,通过融合超图Transformer与知识图谱嵌入(Knowledge Graph Embedding, KGE)来学习包含多个底物和产物的超边表示,从而有效捕捉复杂化合物间的多对多关系;同时引入多专家机制以协同优化模型学习过程,显著提升了酶检索准确率(最高提升88%)和配对级预测性能(提升30%)。
链接: https://arxiv.org/abs/2601.05330
作者: Tengwei Song,Long Yin,Zhen Han,Zhiqiang Xu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Predicting enzyme-substrate interactions has long been a fundamental problem in biochemistry and metabolic engineering. While existing methods could leverage databases of expert-curated enzyme-substrate pairs for models to learn from known pair interactions, the databases are often sparse, i.e., there are only limited and incomplete examples of such pairs, and also labor-intensive to maintain. This lack of sufficient training data significantly hinders the ability of traditional enzyme prediction models to generalize to unseen interactions. In this work, we try to exploit chemical reaction equations from domain-specific databases, given their easier accessibility and denser, more abundant data. However, interactions of multiple compounds, e.g., educts and products, with the same enzymes create complex relational data patterns that traditional models cannot easily capture. To tackle that, we represent chemical reaction equations as triples of (educt, enzyme, product) within a knowledge graph, such that we can take advantage of knowledge graph embedding (KGE) to infer missing enzyme-substrate pairs for graph completion. Particularly, in order to capture intricate relationships among compounds, we propose our knowledge-enhanced hypergraph model for enzyme prediction, i.e., Hyper-Enz, which integrates a hypergraph transformer with a KGE model to learn representations of the hyper-edges that involve multiple educts and products. Also, a multi-expert paradigm is introduced to guide the learning of enzyme-substrate interactions with both the proposed model and chemical reaction equations. Experimental results show a significant improvement, with up to a 88% relative improvement in average enzyme retrieval accuracy and 30% improvement in pair-level prediction compared to traditional models, demonstrating the effectiveness of our approach.
zh
[AI-52] Effects of personality steering on cooperative behavior in Large Language Model agents
【速读】:该论文试图解决的问题是:在战略和社会交互场景中,如何通过人格引导(personality steering)影响大型语言模型(Large Language Models, LLMs)代理的协作行为,尤其是在受控条件下人格特质对合作机制的作用尚不明确。解决方案的关键在于基于“大五人格模型”(Big Five Personality Framework)量化LLM的基本人格特征,并通过重复囚徒困境博弈实验,系统性地比较基线条件与人格信息干预下的行为差异,同时独立操纵各人格维度至极端值以识别主导因素。研究发现,宜人性(agreeableness)是促进跨模型协作的主导人格特质,而其他维度影响有限;此外,人格信息虽能提升合作意愿,但也可能增加被剥削的风险,尤其在早期模型中;相比之下,后期模型表现出更具有选择性的合作策略,表明人格引导本质上是一种行为偏差机制而非决定性控制手段。
链接: https://arxiv.org/abs/2601.05302
作者: Mizuki Sakai,Mizuki Yokoyama,Wakaba Tateishi,Genki Ichinose
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) are increasingly used as autonomous agents in strategic and social interactions. Although recent studies suggest that assigning personality traits to LLMs can influence their behavior, how personality steering affects cooperation under controlled conditions remains unclear. In this study, we examine the effects of personality steering on cooperative behavior in LLM agents using repeated Prisoner’s Dilemma games. Based on the Big Five framework, we first measure basic personality profiles of three models, GPT-3.5-turbo, GPT-4o, and GPT-5, using the Big Five Inventory. We then compare behavior under baseline and personality-informed conditions, and further analyze the effects of independently manipulating each personality dimension to extreme values. Our results show that agreeableness is the dominant factor promoting cooperation across all models, while other personality traits have limited impact. Explicit personality information increases cooperation but can also raise vulnerability to exploitation, particularly in earlier-generation models. In contrast, later-generation models exhibit more selective cooperation. These findings indicate that personality steering acts as a behavioral bias rather than a deterministic control mechanism.
zh
[AI-53] Mathematical Knowledge Graph-Driven Framework for Equation-Based Predictive and Reliable Additive Manufacturing
【速读】:该论文旨在解决增材制造(Additive Manufacturing, AM)领域中数据驱动方法因知识表示碎片化和稀疏数据条件下外推不可靠而面临的挑战。其核心解决方案是提出一种基于本体引导、以方程为中心的框架,将大语言模型(Large Language Models, LLMs)与增材制造数学知识图谱(AM-MKG)紧密结合,实现可靠的知识提取与原理性外推建模。关键在于通过形式化本体显式编码方程、变量、假设及其语义关系,将非结构化文献转化为机器可解释的表示,并利用知识图谱子图条件约束LLM生成的方程形式,从而确保物理合理性与稳定性;同时引入融合外推距离、统计稳定性和知识图谱物理一致性的一致性置信度评分机制,显著提升外推结果的可靠性。
链接: https://arxiv.org/abs/2601.05298
作者: Yeongbin Cha,Namjung Kim
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: preprint
Abstract:Additive manufacturing (AM) relies critically on understanding and extrapolating process-property relationships; however, existing data-driven approaches remain limited by fragmented knowledge representations and unreliable extrapolation under sparse data conditions. In this study, we propose an ontology-guided, equation-centric framework that tightly integrates large language models (LLMs) with an additive manufacturing mathematical knowledge graph (AM-MKG) to enable reliable knowledge extraction and principled extrapolative modeling. By explicitly encoding equations, variables, assumptions, and their semantic relationships within a formal ontology, unstructured literature is transformed into machine-interpretable representations that support structured querying and reasoning. LLM-based equation generation is further conditioned on MKG-derived subgraphs, enforcing physically meaningful functional forms and mitigating non-physical or unstable extrapolation trends. To assess reliability beyond conventional predictive uncertainty, a confidence-aware extrapolation assessment is introduced, integrating extrapolation distance, statistical stability, and knowledge-graph-based physical consistency into a unified confidence score. Results demonstrate that ontology-guided extraction significantly improves the structural coherence and quantitative reliability of extracted knowledge, while subgraph-conditioned equation generation yields stable and physically consistent extrapolations compared to unguided LLM outputs. Overall, this work establishes a unified pipeline for ontology-driven knowledge representation, equation-centered reasoning, and confidence-based extrapolation assessment, highlighting the potential of knowledge-graph-augmented LLMs as reliable tools for extrapolative modeling in additive manufacturing.
zh
[AI-54] MoEBlaze: Breaking the Memory Wall for Efficient MoE Training on Modern GPUs
【速读】:该论文旨在解决大规模Mixture-of-Experts (MoE) 架构中因“内存墙”(memory wall)瓶颈导致的训练效率低下问题,具体表现为稀疏计算带来的激活内存开销过大——主要由庞大的token路由缓冲区和中间张量的显式存储(materialization)引起,进而限制了批处理大小(batch size)和序列长度(sequence length),并引发频繁的数据移动,阻碍模型高效扩展。其解决方案的关键在于提出一个协同设计的系统方法:(i)端到端的token调度与MoE训练机制,通过优化数据结构消除中间缓冲区和激活显式存储;(ii)结合智能激活检查点(activation checkpointing)的协同内核设计,在显著降低内存占用的同时提升性能,最终实现超过4倍的速度提升和50%以上的内存节省。
链接: https://arxiv.org/abs/2601.05296
作者: Jiyuan Zhang,Yining Liu,Siqi Yan,Lisen Deng,Jennifer Cao,Shuqi Yang,Min Ni,Bi Xue,Shen Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:
Abstract:The pervasive “memory wall” bottleneck is significantly amplified in modern large-scale Mixture-of-Experts (MoE) architectures. MoE’s inherent architectural sparsity leads to sparse arithmetic compute and also introduces substantial activation memory overheads – driven by large token routing buffers and the need to materialize and buffer intermediate tensors. This memory pressure limits the maximum batch size and sequence length that can fit on GPUs, and also results in excessive data movements that hinders performance and efficient model scaling. We present MoEBlaze, a memory-efficient MoE training framework that addresses these issues through a co-designed system approach: (i) an end-to-end token dispatch and MoE training method with optimized data structures to eliminate intermediate buffers and activation materializing, and (ii) co-designed kernels with smart activation checkpoint to mitigate memory footprint while simultaneously achieving better performance. We demonstrate that MoEBlaze can achieve over 4x speedups and over 50% memory savings compared to existing MoE frameworks.
zh
[AI-55] A Survey of Agent ic AI and Cybersecurity: Challenges Opportunities and Use-case Prototypes
【速读】:该论文旨在解决agentic AI(智能体AI)在网络安全领域带来的双重用途风险与治理挑战,即其既能增强防御能力(如持续监控、自主响应和自适应威胁狩猎),也可能被恶意利用以加速攻击链(如侦察、漏洞利用和社会工程)。解决方案的关键在于构建针对智能体系统的新型安全框架、评估流程和威胁模型,并通过三个实际用例揭示设计选择如何影响可靠性、安全性与操作效能,从而填补传统治理机制对非自主、短时AI系统设计的不足。
链接: https://arxiv.org/abs/2601.05293
作者: Sahaya Jestus Lazer,Kshitiz Aryal,Maanak Gupta,Elisa Bertino
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Agentic AI marks an important transition from single-step generative models to systems capable of reasoning, planning, acting, and adapting over long-lasting tasks. By integrating memory, tool use, and iterative decision cycles, these systems enable continuous, autonomous workflows in real-world environments. This survey examines the implications of agentic AI for cybersecurity. On the defensive side, agentic capabilities enable continuous monitoring, autonomous incident response, adaptive threat hunting, and fraud detection at scale. Conversely, the same properties amplify adversarial power by accelerating reconnaissance, exploitation, coordination, and social-engineering attacks. These dual-use dynamics expose fundamental gaps in existing governance, assurance, and accountability mechanisms, which were largely designed for non-autonomous and short-lived AI systems. To address these challenges, we survey emerging threat models, security frameworks, and evaluation pipelines tailored to agentic systems, and analyze systemic risks including agent collusion, cascading failures, oversight evasion, and memory poisoning. Finally, we present three representative use-case implementations that illustrate how agentic AI behaves in practical cybersecurity workflows, and how design choices shape reliability, safety, and operational effectiveness.
zh
[AI-56] On the Limits of Self-Improving in LLM s and Why AGI ASI and the Singularity Are Not Near Without Symbolic Model Synthesis
【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)和生成式 AI 在递归自训练(recursive self-training)过程中因依赖自生成数据而导致的系统性退化问题,特别是当自生成数据比例趋近于1(即 αt→0)时,模型会不可避免地出现熵衰减(Entropy Decay)和方差放大(Variance Amplification)两种根本性失效模式。解决方案的关键在于引入基于算法概率(Algorithmic Probability)的符号回归与程序合成方法,利用编码定理法(Coding Theorem Method, CTM)识别生成机制而非仅相关性,从而突破传统统计学习中的数据处理不等式限制,实现从纯分布学习向混合神经符号(neurosymbolic)框架的跃迁,以支撑模型的持续自我改进。
链接: https://arxiv.org/abs/2601.05280
作者: Hector Zenil
机构: 未知
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 26 pages
Abstract:We formalise recursive self-training in Large Language Models (LLMs) and Generative AI as a discrete-time dynamical system and prove that, as training data become increasingly self-generated ( \alpha_t \to 0 ), the system undergoes inevitably degenerative dynamics. We derive two fundamental failure modes: (1) Entropy Decay, where finite sampling effects cause a monotonic loss of distributional diversity (mode collapse), and (2) Variance Amplification, where the loss of external grounding causes the model’s representation of truth to drift as a random walk, bounded only by the support diameter. We show these behaviours are not contingent on architecture but are consequences of distributional learning on finite samples. We further argue that Reinforcement Learning with imperfect verifiers suffers similar semantic collapse. To overcome these limits, we propose a path involving symbolic regression and program synthesis guided by Algorithmic Probability. The Coding Theorem Method (CTM) allows for identifying generative mechanisms rather than mere correlations, escaping the data-processing inequality that binds standard statistical learning. We conclude that while purely distributional learning leads to model collapse, hybrid neurosymbolic approaches offer a coherent framework for sustained self-improvement.
zh
[AI-57] Simulation-Free PSRO: Removing Game Simulation from Policy Space Response Oracles
【速读】:该论文旨在解决Policy Space Response Oracles (PSRO) 在实际应用中因计算成本过高而受限的问题,尤其是游戏模拟(game simulation)成为其运行时间的主要瓶颈。解决方案的关键在于提出一种无需模拟的 PSRO 方法——Simulation-Free PSRO,并进一步设计了基于动态窗口(Dynamic Window)的改进机制:通过引入策略窗口(strategy window)替代传统 PSRO 中不断增长的策略集,限制参与博弈的策略数量,从而简化对手策略选择过程并提升最佳响应(best response)的鲁棒性;同时结合纳什聚类(Nash Clustering)动态筛选需剔除的策略,有效控制窗口内策略规模,显著降低可被利用性(exploitability),且在多种环境下的实验验证了该方法的优越性能与良好兼容性。
链接: https://arxiv.org/abs/2601.05279
作者: Yingzhuo Liu,Shuodi Liu,Weijun Luo,Liuyu Xiang,Zhaofeng He
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
备注:
Abstract:Policy Space Response Oracles (PSRO) combines game-theoretic equilibrium computation with learning and is effective in approximating Nash Equilibrium in zero-sum games. However, the computational cost of PSRO has become a significant limitation to its practical application. Our analysis shows that game simulation is the primary bottleneck in PSRO’s runtime. To address this issue, we conclude the concept of Simulation-Free PSRO and summarize existing methods that instantiate this concept. Additionally, we propose a novel Dynamic Window-based Simulation-Free PSRO, which introduces the concept of a strategy window to replace the original strategy set maintained in PSRO. The number of strategies in the strategy window is limited, thereby simplifying opponent strategy selection and improving the robustness of the best response. Moreover, we use Nash Clustering to select the strategy to be eliminated, ensuring that the number of strategies within the strategy window is effectively limited. Our experiments across various environments demonstrate that the Dynamic Window mechanism significantly reduces exploitability compared to existing methods, while also exhibiting excellent compatibility. Our code is available at this https URL.
zh
[AI-58] Bayesian Recovery for Probabilistic Coalition Structures
【速读】:该论文旨在解决概率联盟结构生成(Probabilistic Coalition Structure Generation, PCSG)问题中如何可靠恢复最优联盟结构的挑战,尤其关注标准稀疏恢复方法(如ℓ₁松弛和贪婪追踪算法)在存在高度相干近重复列(overlapping coalitions导致的高相关性列)时的表现。研究发现,在此类结构下,传统方法因违反不可表示条件(irrepresentable condition),导致k步正交匹配追踪(OMP)出现非消失的误选概率;而关键解决方案在于采用稀疏贝叶斯学习(Sparse Bayesian Learning, SBL)框架,其通过高斯-伽马先验层次结构诱导凹型稀疏惩罚项,有效抑制虚假近重复列干扰,从而在相同结构假设下实现支持一致性(support consistency),即以趋于1的概率准确恢复真实联盟支撑集。这一结果首次在理论上严格区分了凸优化、贪婪与贝叶斯稀疏方法在PCSG中的性能差异。
链接: https://arxiv.org/abs/2601.05273
作者: Angshul Majumdar
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI)
备注: 15 pages
Abstract:Probabilistic Coalition Structure Generation (PCSG) is NP-hard and can be recast as an l_0 -type sparse recovery problem by representing coalition structures as sparse coefficient vectors over a coalition-incidence design. A natural question is whether standard sparse methods, such as l_1 relaxations and greedy pursuits, can reliably recover the optimal coalition structure in this setting. We show that the answer is negative in a PCSG-inspired regime where overlapping coalitions generate highly coherent, near-duplicate columns: the irrepresentable condition fails for the design, and k -step Orthogonal Matching Pursuit (OMP) exhibits a nonvanishing probability of irreversible mis-selection. In contrast, we prove that Sparse Bayesian Learning (SBL) with a Gaussian-Gamma hierarchy is support consistent under the same structural assumptions. The concave sparsity penalty induced by SBL suppresses spurious near-duplicates and recovers the true coalition support with probability tending to one. This establishes a rigorous separation between convex, greedy, and Bayesian sparse approaches for PCSG.
zh
[AI-59] LiveVectorLake: A Real-Time Versioned Knowledge Base Architecture for Streaming Vector Updates and Temporal Retrieval
【速读】:该论文旨在解决现代检索增强生成(Retrieval-Augmented Generation, RAG)系统中一个根本性的架构矛盾:向量索引虽优化了查询延迟,但难以高效处理持续的知识更新;而数据湖(data lake)虽支持版本控制,却带来显著的查询延迟。其解决方案的核心是提出 LiveVectorLake,一种双层时间知识库架构,通过三项关键技术实现:(1) 基于 SHA-256 哈希的内容可寻址块级同步机制,无需外部状态追踪即可确定性地检测变更;(2) 双层存储结构,将热层向量索引(Milvus with HNSW)与冷层列式版本化存储(Delta Lake with Parquet)分离,独立优化查询延迟与存储成本;(3) 时间查询路由机制,利用 delta 版本化实现点时间知识检索,并保证跨层级的 ACID 一致性。该设计在保持当前知识实时语义搜索性能的同时,完整保留版本历史以满足合规、审计和点时间检索需求。
链接: https://arxiv.org/abs/2601.05270
作者: Tarun Prajapati
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注: 7 pages, 1 figure. Preprint; work in progress
Abstract:Modern Retrieval-Augmented Generation (RAG) systems struggle with a fundamental architectural tension: vector indices are optimized for query latency but poorly handle continuous knowledge updates, while data lakes excel at versioning but introduce query latency penalties. We introduce LiveVectorLake, a dual-tier temporal knowledge base architecture that enables real-time semantic search on current knowledge while maintaining complete version history for compliance, auditability, and point-in-time retrieval. The system introduces three core architectural contributions: (1) Content-addressable chunk-level synchronization using SHA-256 hashing for deterministic change detection without external state tracking; (2) Dual-tier storage separating hot-tier vector indices (Milvus with HNSW) from cold-tier columnar versioning (Delta Lake with Parquet), optimizing query latency and storage cost independently; (3) Temporal query routing enabling point-in-time knowledge retrieval via delta-versioning with ACID consistency across tiers. Evaluation on a 100-document corpus versioned across five time points demonstrates: (i) 10-15% re-processing of content during updates compared to 100% for full re-indexing; (ii) sub-100ms retrieval latency on current knowledge; (iii) sub-2s latency for temporal queries across version history; and (iv) storage cost optimization through hot/cold tier separation (only current chunks in expensive vector indices). The approach enables production RAG deployments requiring simultaneous optimization for query performance, update efficiency, and regulatory compliance. Code and resources: [this https URL]
zh
[AI-60] Engineering the RAG Stack: A Comprehensive Review of the Architecture and Trust Frameworks for Retrieval-Augmented Generation Systems
【速读】:该论文旨在解决当前检索增强生成(Retrieval-Augmented Generation, RAG)技术在研究与工业实践中因方法多样性导致的碎片化问题,特别是如何系统性地整合不同融合机制、检索策略和编排方式以构建统一且可部署的RAG架构。其解决方案的关键在于提出一个结构化的分类体系(taxonomy),并结合定量评估框架,对现有RAG技术进行系统归纳与分析,从而为构建鲁棒、安全且领域适配性强的RAG系统提供实用指南与技术参考。
链接: https://arxiv.org/abs/2601.05264
作者: Dean Wampler,Dave Nielson,Alireza Seddighi
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 86 pages, 2 figures, 37 tables. A comprehensive review of Retrieval-Augmented Generation (RAG) architectures and trust frameworks (2018-2025), encompassing a unified taxonomy, evaluation benchmarks, and trust-safety modeling
Abstract:This article provides a comprehensive systematic literature review of academic studies, industrial applications, and real-world deployments from 2018 to 2025, providing a practical guide and detailed overview of modern Retrieval-Augmented Generation (RAG) architectures. RAG offers a modular approach for integrating external knowledge without increasing the capacity of the model as LLM systems expand. Research and engineering practices have been fragmented as a result of the increasing diversity of RAG methodologies, which encompasses a variety of fusion mechanisms, retrieval strategies, and orchestration approaches. We provide quantitative assessment frameworks, analyze the implications for trust and alignment, and systematically consolidate existing RAG techniques into a unified taxonomy. This document is a practical framework for the deployment of resilient, secure, and domain-adaptable RAG systems, synthesizing insights from academic literature, industry reports, and technical implementation guides. It also functions as a technical reference.
zh
[AI-61] From Events to Trending: A Multi-Stage Hotspots Detection Method Based on Generative Query Indexing
【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)驱动的对话系统中对新闻类趋势查询(trending queries)识别能力不足的问题,从而提升用户信息获取体验。现有方法多基于传统搜索引擎设计,难以适应对话场景下查询分布与表达模式的根本差异。解决方案的关键在于提出一个多层次框架:首先利用热点事件生成索引查询,建立静态事件与动态用户查询之间的桥梁;其次采用级联召回与排序架构实现高效准确的在线检测;最后引入单召回模块作为冷启动策略收集在线数据以微调重排序器,从而在离线评估和线上A/B测试中均显著优于基线方法,并使用户正负反馈比提升27%。
链接: https://arxiv.org/abs/2601.05258
作者: Kaichun Wang,Yanguang Chen,Ting Zhang,Mengyao Bao,Keyu Chen,Xu Hu,Yongliang Wang,Jingsheng Yang,Jinsong Zhang,Fei Lu
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:LLM-based conversational systems have become a popular gateway for information access, yet most existing chatbots struggle to handle news-related trending queries effectively. To improve user experience, an effective trending query detection method is urgently needed to enable differentiated processing of such target traffic. However, current research on trending detection tailored to the dialogue system scenario remains largely unexplored, and methods designed for traditional search engines often underperform in conversational contexts due to radically distinct query distributions and expression patterns. To fill this gap, we propose a multi-stage framework for trending detection, which achieves systematic optimization from both offline generation and online identification perspectives. Specifically, our framework first exploits selected hot events to generate index queries, establishing a key bridge between static events and dynamic user queries. It then employs a retrieval matching mechanism for real-time online detection of trending queries, where we introduce a cascaded recall and ranking architecture to balance detection efficiency and accuracy. Furthermore, to better adapt to the practical application scenario, our framework adopts a single-recall module as a cold-start strategy to collect online data for fine-tuning the reranker. Extensive experiments demonstrate that our framework significantly outperforms baseline methods in both offline evaluations and online A/B tests, and user satisfaction is relatively improved by 27% in terms of positive-negative feedback ratio.
zh
[AI-62] KP-Agent : Keyword Pruning in Sponsored Search Advertising via LLM -Powered Contextual Bandits
【速读】:该论文旨在解决搜索引擎广告(Sponsored Search Advertising, SSA)中关键词修剪(keyword pruning)这一长期被忽视的关键问题,即如何通过精简和优化关键词集合来提升广告活动的性能。当前实践中,尽管出价调整和关键词生成已得到充分研究,但缺乏系统性的方法对冗余或低效关键词进行自动识别与剔除,导致广告资源浪费和转化效率低下。解决方案的核心在于提出KP-Agent——一个基于大语言模型(Large Language Model, LLM)的智能代理系统,其配备领域专用工具集和记忆模块,并在上下文相关的多臂赌博机(contextual bandit)框架下建模关键词修剪任务,利用强化学习生成代码片段以迭代优化关键词集合,从而显著提升累计利润(实验表明最高可达49.28%)。
链接: https://arxiv.org/abs/2601.05257
作者: Hou-Wan Long,Yicheng Song,Zidong Wang,Tianshu Sun
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Sponsored search advertising (SSA) requires advertisers to constantly adjust keyword strategies. While bid adjustment and keyword generation are well-studied, keyword pruning-refining keyword sets to enhance campaign performance-remains under-explored. This paper addresses critical inefficiencies in current practices as evidenced by a dataset containing 0.5 million SSA records from a pharmaceutical advertiser on search engine Meituan, China’s largest delivery platform. We propose KP-Agent, an LLM agentic system with domain tool set and a memory module. By modeling keyword pruning within a contextual bandit framework, KP-Agent generates code snippets to refine keyword sets through reinforcement learning. Experiments show KP-Agent improves cumulative profit by up to 49.28% over baselines.
zh
[AI-63] SP-Rank: A Dataset for Ranked Preferences with Secondary Information
【速读】:该论文旨在解决传统排名任务中仅依赖个体第一阶偏好(first-order preferences)信号而忽略群体层面第二阶预测(second-order predictions)信息的问题,从而限制了对真实排序(ground-truth ranking)的准确建模。其解决方案的关键在于提出并构建了首个大规模、公开可用的数据集 SP-Rank,其中每个数据点同时包含个人投票(第一阶信号)和对他人投票的元预测(第二阶信号),使得算法能够联合推理这两类信息以更精准地推断出隐藏的真实排序。通过对比仅使用第一阶投票的传统聚合方法与引入第二阶信号的 SP-Voting 方法,实验证明结合两类信号可显著提升排名恢复的准确性,尤其在专家身份未知但存在的情况下,为人类偏好建模、众包噪声处理及基于偏好的奖励模型训练提供了新范式。
链接: https://arxiv.org/abs/2601.05253
作者: Hadi Hosseini,Debmalya Mandal,Amrit Puhan
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:We introduce \mathbfSP-Rank , the first large-scale, publicly available dataset for benchmarking algorithms that leverage both first-order preferences and second-order predictions in ranking tasks. Each datapoint includes a personal vote (first-order signal) and a meta-prediction of how others will vote (second-order signal), allowing richer modeling than traditional datasets that capture only individual preferences. SP-Rank contains over 12,000 human-generated datapoints across three domains – geography, movies, and paintings, and spans nine elicitation formats with varying subset sizes. This structure enables empirical analysis of preference aggregation when expert identities are unknown but presumed to exist, and individual votes represent noisy estimates of a shared ground-truth ranking. We benchmark SP-Rank by comparing traditional aggregation methods that use only first-order votes against SP-Voting, a second-order method that jointly reasons over both signals to infer ground-truth rankings. While SP-Rank also supports models that rely solely on second-order predictions, our benchmarks emphasize the gains from combining both signals. We evaluate performance across three core tasks: (1) full ground-truth rank recovery, (2) subset-level rank recovery, and (3) probabilistic modeling of voter behavior. Results show that incorporating second-order signals substantially improves accuracy over vote-only methods. Beyond social choice, SP-Rank supports downstream applications in learning-to-rank, extracting expert knowledge from noisy crowds, and training reward models in preference-based fine-tuning pipelines. We release the dataset, code, and baseline evaluations (available at this https URL ) to foster research in human preference modeling, aggregation theory, and human-AI alignment.
zh
[AI-64] ny Recursive Models on ARC-AGI-1: Inductive Biases Identity Conditioning and Test-Time Compute
【速读】:该论文旨在解决当前大型语言模型在处理抽象推理任务(如Abstraction and Reasoning Corpus, ARC)时存在的参数效率低、计算资源消耗大等问题。其解决方案的关键在于提出了一种称为Tiny Recursive Models (TRM) 的轻量级架构,通过非自回归的递归潜在更新机制,在保持较低参数量的同时实现较高推理性能。研究表明,TRM 的优异表现主要源于测试阶段的增强策略(如多数投票集成)、任务特定条件编码(依赖 puzzle ID)以及浅层有效递归结构,而非深层内部推理能力,这揭示了高效推理与任务设计之间的协同作用。
链接: https://arxiv.org/abs/2512.11847
作者: Antonio Roye-Azar,Santiago Vargas-Naranjo,Dhruv Ghai,Nithin Balamurugan,Rayan Amir
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 13 pages, 0 figures, 6 tables
Abstract:Tiny Recursive Models (TRM) were proposed as a parameter-efficient alternative to large language models for solving Abstraction and Reasoning Corpus (ARC) style tasks. The original work reports strong performance and suggests that recursive latent updates enable non-trivial reasoning, but it remains unclear how much of this performance stems from architecture, test-time compute, or task-specific priors. In this technical note, we empirically analyze the ARC Prize TRM checkpoint on ARC-AGI-1 and report four behavioral findings and an efficiency comparison. First, we show that test-time augmentation and majority-vote ensembling account for a substantial fraction of reported performance: the 1000-sample voting pipeline improves Pass@1 by about 11 percentage points over single-pass canonical inference. Second, a puzzle-identity ablation reveals strict dependence on task identifiers: replacing the correct puzzle ID with a blank or random token yields zero accuracy. Third, a recursion trajectory analysis shows that most of the final accuracy is achieved at the first recursion step and that performance saturates after few latent updates, indicating shallow effective recursion. Fourth, early-stage training experiments under canonical versus heavy augmentation regimes suggest that heavy augmentation broadens the distribution of candidate solutions and improves multi-sample success. Finally, we compare TRM with a naive QLoRA fine-tune of Llama 3 8B on canonical ARC-AGI-1, finding that TRM’s non-autoregressive design achieves much higher throughput and substantially lower memory usage in this setting. Overall, TRM’s ARC-AGI-1 performance appears to arise from an interaction between efficiency, task-specific conditioning, and aggressive test-time compute rather than deep internal reasoning.
zh
[AI-65] Automating Deception: Scalable Multi-Turn LLM Jailbreaks
【速读】:该论文旨在解决多轮对话攻击(multi-turn conversational attacks)对大型语言模型(Large Language Models, LLMs)安全对齐机制的持续威胁问题,此类攻击利用心理原理如“门坎效应”(Foot-in-the-Door, FITD),通过先提出小请求来逐步诱导模型响应更大范围的违规内容。为克服现有防御方法依赖人工构建、难以扩展的数据集瓶颈,论文提出了一种自动化生成大规模、心理机制驱动的多轮越狱数据集的流水线方案;其关键在于将FITD技术系统化地转化为可复现的模板,从而构建包含1500个场景的基准测试集,用于评估不同LLM在有无对话历史条件下的上下文鲁棒性差异,揭示了当前主流模型在处理对话上下文时存在显著的安全性能分化。
链接: https://arxiv.org/abs/2511.19517
作者: Adarsh Kumarappan,Ananya Mujoo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Multi-turn conversational attacks, which leverage psychological principles like Foot-in-the-Door (FITD), where a small initial request paves the way for a more significant one, to bypass safety alignments, pose a persistent threat to Large Language Models (LLMs). Progress in defending against these attacks is hindered by a reliance on manual, hard-to-scale dataset creation. This paper introduces a novel, automated pipeline for generating large-scale, psychologically-grounded multi-turn jailbreak datasets. We systematically operationalize FITD techniques into reproducible templates, creating a benchmark of 1,500 scenarios across illegal activities and offensive content. We evaluate seven models from three major LLM families under both multi-turn (with history) and single-turn (without history) conditions. Our results reveal stark differences in contextual robustness: models in the GPT family demonstrate a significant vulnerability to conversational history, with Attack Success Rates (ASR) increasing by as much as 32 percentage points. In contrast, Google’s Gemini 2.5 Flash exhibits exceptional resilience, proving nearly immune to these attacks, while Anthropic’s Claude 3 Haiku shows strong but imperfect resistance. These findings highlight a critical divergence in how current safety architectures handle conversational context and underscore the need for defenses that can resist narrative-based manipulation.
zh
[AI-66] owards Realistic Guarantees: A Probabilistic Certificate for SmoothLLM
【速读】:该论文旨在解决SmoothLLM防御机制在实际应用中因依赖过于严格的“k-unstable”假设而导致安全证书可信度不足的问题。其解决方案的关键在于提出一种更符合现实的 probabilistic 框架——(k, ε)-unstable,该框架通过引入攻击成功率的经验模型,推导出一个新的、数据驱动的 SmoothLLM 防御概率下界,从而提供更具实践意义和可操作性的安全保证,使从业者能够基于真实世界大语言模型(Large Language Models, LLMs)的行为设定合理的认证阈值,提升模型对各类越狱攻击(如基于梯度的 GCG 和语义层面的 PAIR 攻击)的鲁棒性。
链接: https://arxiv.org/abs/2511.18721
作者: Adarsh Kumarappan,Ayushi Mehrotra
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:The SmoothLLM defense provides a certification guarantee against jailbreaking attacks, but it relies on a strict k-unstable' assumption that rarely holds in practice. This strong assumption can limit the trustworthiness of the provided safety certificate. In this work, we address this limitation by introducing a more realistic probabilistic framework, (k, \varepsilon )-unstable,’ to certify defenses against diverse jailbreaking attacks, from gradient-based (GCG) to semantic (PAIR). We derive a new, data-informed lower bound on SmoothLLM’s defense probability by incorporating empirical models of attack success, providing a more trustworthy and practical safety certificate. By introducing the notion of (k, \varepsilon )-unstable, our framework provides practitioners with actionable safety guarantees, enabling them to set certification thresholds that better reflect the real-world behavior of LLMs. Ultimately, this work contributes a practical and theoretically-grounded mechanism to make LLMs more resistant to the exploitation of their safety alignments, a critical challenge in secure AI deployment.
zh
[AI-67] EvoC2Rust: A Skeleton-guided Framework for Project-Level C-to-Rust Translation ICSE2026
【速读】:该论文旨在解决将大型C代码库完整转换为等效Rust代码的难题,这一需求在构建安全关键系统中日益迫切。现有方法存在明显局限:基于规则的方法难以满足代码安全性与风格一致性要求,而基于大语言模型(Large Language Model, LLM)的方法常因模块间复杂依赖导致语义不一致,且均局限于小规模程序。论文提出EvoC2Rust框架,其核心创新在于采用骨架引导的项目级翻译策略——首先通过功能模块分解和特征增强的LLM生成可编译的Rust骨架(包含类型检查的函数占位符),再逐步替换占位符完成函数级翻译,并结合LLM与静态分析修复编译错误。该方案通过进化式增强融合了规则方法的安全性优势与LLM的灵活性,实现了对工业级项目的高效、高保真迁移。
链接: https://arxiv.org/abs/2508.04295
作者: Chaofan Wang,Tingrui Yu,Chen Xie,Jie Wang,Dong Chen,Wenrui Zhang,Yuling Shi,Xiaodong Gu,Beijun Shen
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Accepted by ICSE 2026 SEIP
Abstract:Translating legacy C codebases to Rust is increasingly demanded for building safety-critical systems. While various approaches have emerged for this task, they face inherent trade-offs: rule-based methods often struggle to satisfy code safety and idiomaticity requirements, while LLM-based methods frequently fail to generate semantically equivalent Rust code, due to the heavy dependencies of modules across the entire codebase. Recent studies have revealed that both solutions are limited to small-scale programs. In this paper, we propose EvoC2Rust, an automated framework for converting complete C projects to equivalent Rust ones. EvoC2Rust employs a skeleton-guided translation strategy for project-level translation. The pipeline consists of three stages: 1) it first decomposes the C project into functional modules, employs a feature-mapping-enhanced LLM to transform definitions and macros, and generates type-checked function stubs, which form a compilable Rust skeleton; 2) it then incrementally translates functions, replacing the corresponding stub placeholders; 3) finally, it repairs compilation errors by integrating LLM and static analysis. Through evolutionary augmentation, EvoC2Rust combines the advantages of both rule-based and LLM-based solutions. Our evaluation on open-source benchmarks and six industrial projects demonstrates the superior performance of EvoC2Rust in project-level C-to-Rust translation. The results show that our approach outperforms the strongest LLM-based baseline by 17.24% in syntax accuracy and 14.32% in semantic accuracy, while also achieving a 43.59% higher code safety rate than the best rule-based tool.
zh
[AI-68] Cedalion Tutorial: A Python-based framework for comprehensive analysis of multimodal fNIRS DOT from the lab to the everyday world
【速读】:该论文旨在解决功能性近红外光谱(fNIRS)和漫射光学断层成像(DOT)在向可穿戴、多模态及数据驱动的神经影像方向发展过程中,现有分析工具碎片化导致的可重复性差、互操作性弱以及难以集成现代机器学习(ML)工作流的问题。其解决方案的关键在于提出一个基于Python的开源框架Cedalion,该框架统一了模型驱动与数据驱动的多模态fNIRS/DOT数据分析流程,采用标准化架构整合前向建模、光探头空间配准、信号处理、广义线性模型(GLM)分析、DOT图像重建及ML方法,并兼容SNIRF和BIDS标准,支持云执行的Jupyter笔记本与容器化工作流,从而实现可复现、可扩展、面向云计算和机器学习的神经影像分析管线。
链接: https://arxiv.org/abs/2601.05923
作者: E. Middell,L. Carlton,S. Moradi,T. Codina,T. Fischer,J. Cutler,S. Kelley,J. Behrendt,T. Dissanayake,N. Harmening,M. A. Yücel,D. A. Boas,A. von Lühmann
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV); Quantitative Methods (q-bio.QM)
备注: 33 pages main manuscript, 180 pages Supplementary Tutorial Notebooks, 12 figures, 6 tables, under review in SPIE Neurophotonics
Abstract:Functional near-infrared spectroscopy (fNIRS) and diffuse optical tomography (DOT) are rapidly evolving toward wearable, multimodal, and data-driven, AI-supported neuroimaging in the everyday world. However, current analytical tools are fragmented across platforms, limiting reproducibility, interoperability, and integration with modern machine learning (ML) workflows. Cedalion is a Python-based open-source framework designed to unify advanced model-based and data-driven analysis of multimodal fNIRS and DOT data within a reproducible, extensible, and community-driven environment. Cedalion integrates forward modelling, photogrammetric optode co-registration, signal processing, GLM Analysis, DOT image reconstruction, and ML-based data-driven methods within a single standardized architecture based on the Python ecosystem. It adheres to SNIRF and BIDS standards, supports cloud-executable Jupyter notebooks, and provides containerized workflows for scalable, fully reproducible analysis pipelines that can be provided alongside original research publications. Cedalion connects established optical-neuroimaging pipelines with ML frameworks such as scikit-learn and PyTorch, enabling seamless multimodal fusion with EEG, MEG, and physiological data. It implements validated algorithms for signal-quality assessment, motion correction, GLM modelling, and DOT reconstruction, complemented by modules for simulation, data augmentation, and multimodal physiology analysis. Automated documentation links each method to its source publication, and continuous-integration testing ensures robustness. This tutorial paper provides seven fully executable notebooks that demonstrate core features. Cedalion offers an open, transparent, and community extensible foundation that supports reproducible, scalable, cloud- and ML-ready fNIRS/DOT workflows for laboratory-based and real-world neuroimaging.
zh
[AI-69] Joint Optimization of Neural Autoregressors via Scoring rules
【速读】:该论文试图解决非参数分布回归中网格化方法在高维多变量场景下的计算复杂度与参数膨胀问题。具体而言,当对每个维度进行 N 个分箱(bin)的离散化时,显式联合网格的复杂度呈指数级增长,导致神经网络参数数量急剧上升,尤其在低数据量场景下易引发严重过拟合与计算不可行性。解决方案的关键在于设计一种可扩展的结构,在不显式构建完整高维网格的前提下实现高效的多变量建模,从而缓解维度灾难并提升模型在小样本条件下的泛化能力。
链接: https://arxiv.org/abs/2601.05683
作者: Jonas Landsgesell
机构: 未知
类目: oft Condensed Matter (cond-mat.soft); Artificial Intelligence (cs.AI)
备注:
Abstract:Non-parametric distributional regression has achieved significant milestones in recent years. Among these, the Tabular Prior-Data Fitted Network (TabPFN) has demonstrated state-of-the-art performance on various benchmarks. However, a challenge remains in extending these grid-based approaches to a truly multivariate setting. In a naive non-parametric discretization with N bins per dimension, the complexity of an explicit joint grid scales exponentially and the paramer count of the neural networks rise sharply. This scaling is particularly detrimental in low-data regimes, as the final projection layer would require many parameters, leading to severe overfitting and intractability.
zh
[AI-70] A Bayesian Generative Modeling Approach for Arbitrary Conditional Inference
【速读】:该论文旨在解决现代数据分析中灵活条件推理(conditional inference)的问题,即在任意变量划分 (XA,XB) 下对 P(XB∣XA) 进行建模,而现有方法受限于固定的条件结构,无法在训练后支持新的条件推理任务。解决方案的关键在于提出一种贝叶斯生成建模(Bayesian Generative Modeling, BGM)框架,其通过迭代贝叶斯更新算法学习联合变量 X 的生成模型,同时更新模型参数与潜在变量直至收敛;训练完成后,无需重新训练即可直接推导任意条件分布,并提供不确定性量化。BGM实现了单个模型作为通用条件预测引擎的能力,且具备理论上的收敛性、统计一致性和条件风险边界保证。
链接: https://arxiv.org/abs/2601.05355
作者: Qiao Liu,Wing Hung Wong
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Computation (stat.CO); Methodology (stat.ME)
备注:
Abstract:Modern data analysis increasingly requires flexible conditional inference P(X_B | X_A) where (X_A, X_B) is an arbitrary partition of observed variable X. Existing conditional inference methods lack this flexibility as they are tied to a fixed conditioning structure and cannot perform new conditional inference once trained. To solve this, we propose a Bayesian generative modeling (BGM) approach for arbitrary conditional inference without retraining. BGM learns a generative model of X through an iterative Bayesian updating algorithm where model parameters and latent variables are updated until convergence. Once trained, any conditional distribution can be obtained without retraining. Empirically, BGM achieves superior prediction performance with well calibrated predictive intervals, demonstrating that a single learned model can serve as a universal engine for conditional prediction with uncertainty quantification. We provide theoretical guarantees for the convergence of the stochastic iterative algorithm, statistical consistency and conditional-risk bounds. The proposed BGM framework leverages the power of AI to capture complex relationships among variables while adhering to Bayesian principles, emerging as a promising framework for advancing various applications in modern data science. The code for BGM is freely available at this https URL.
zh
[AI-71] Evolving Cognitive Architectures
【速读】:该论文旨在解决当前认知架构在建模认知活动时过于程式化的问题,即缺乏通用性,无法在不依赖预设感知模式的情况下复现高级神经功能。其解决方案的关键在于提出一种进化式认知架构,该架构以一个功能核心为基础,持续生成自主代理的智能功能,并通过符号学概念框架(semiotics)将Merkwelt(意义世界)与Werkwelt(工具世界)通过Innenwelt(内在世界)的构建相连接,从而实现认知功能的演化发展。
链接: https://arxiv.org/abs/2601.05277
作者: Alexander Serov
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注:
Abstract:This article proposes a research and development direction that would lead to the creation of next-generation intelligent technical systems. A distinctive feature of these systems is their ability to undergo evolutionary change. Cognitive architectures are now one of the most promising ways to create Artificial General Intelligence systems. One of the main problems of modern cognitive architectures is an excessively schematic approach to modeling the processes of cognitive activity. It does not allow the creation of a universal architecture that would be capable of reproducing higher nervous functions without using a predetermined set of perception patterns. Our paper proposes an evolutionary approach to creating a cognitive architecture. The basis of this approach is the use of a functional core, which consistently generates the intellectual functions of an autonomous agent. We are considering a cognitive architecture that includes components, the interaction of which ensures the evolution of the agent. The discussion of the development of intelligence is carried out using the conceptual apparatus of semiotics. This allows us to consider the task of developing cognitive functions as a problem of establishing a connection between the Merkwelt and the Werkwelt through the creation of the Innenwelt. The problem of early postnatal ontogenesis is investigated on the basis of the theory of constructivism: we discuss the requirements for the functional core and its composition, as well as the mechanism that initiates the process of cognition.
zh
机器学习
[LG-0] LookAroundNet: Extending Temporal Context with Transformers for Clinically Viable EEG Seizure Detection
链接: https://arxiv.org/abs/2601.06016
作者: Þór Sverrisson,Steinn Guðmundsson
类目: Machine Learning (cs.LG)
*备注:
Abstract:Automated seizure detection from electroencephalography (EEG) remains difficult due to the large variability of seizure dynamics across patients, recording conditions, and clinical settings. We introduce LookAroundNet, a transformer-based seizure detector that uses a wider temporal window of EEG data to model seizure activity. The seizure detector incorporates EEG signals before and after the segment of interest, reflecting how clinicians use surrounding context when interpreting EEG recordings. We evaluate the proposed method on multiple EEG datasets spanning diverse clinical environments, patient populations, and recording modalities, including routine clinical EEG and long-term ambulatory recordings, in order to study performance across varying data distributions. The evaluation includes publicly available datasets as well as a large proprietary collection of home EEG recordings, providing complementary views of controlled clinical data and unconstrained home-monitoring conditions. Our results show that LookAroundNet achieves strong performance across datasets, generalizes well to previously unseen recording conditions, and operates with computational costs compatible with real-world clinical deployment. The results indicate that extended temporal context, increased training data diversity, and model ensembling are key factors for improving performance. This work contributes to moving automatic seizure detection models toward clinically viable solutions.
[LG-1] CyberGFM: Graph Foundation Models for Lateral Movement Detection in Enterprise Networks
链接: https://arxiv.org/abs/2601.05988
作者: Isaiah J. King,Bernardo Trindade,Benjamin Bowman,H. Howie Huang
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 17 pages; 11 figures; 8 tables
Abstract:Representing networks as a graph and training a link prediction model using benign connections is an effective method of anomaly-based intrusion detection. Existing works using this technique have shown great success using temporal graph neural networks and skip-gram-based approaches on random walks. However, random walk-based approaches are unable to incorporate rich edge data, while the GNN-based approaches require large amounts of memory to train. In this work, we propose extending the original insight from random walk-based skip-grams–that random walks through a graph are analogous to sentences in a corpus–to the more modern transformer-based foundation models. Using language models that take advantage of GPU optimizations, we can quickly train a graph foundation model to predict missing tokens in random walks through a network of computers. The graph foundation model is then finetuned for link prediction and used as a network anomaly detector. This new approach allows us to combine the efficiency of random walk-based methods and the rich semantic representation of deep learning methods. This system, which we call CyberGFM, achieved state-of-the-art results on three widely used network anomaly detection datasets, delivering a up to 2 \times improvement in average precision. We found that CyberGFM outperforms all prior works in unsupervised link prediction for network anomaly detection, using the same number of parameters, and with equal or better efficiency than the previous best approaches.
[LG-2] Community-Based Model Sharing and Generalisation: Anomaly Detection in IoT Temperature Sensor Networks
链接: https://arxiv.org/abs/2601.05984
作者: Sahibzada Saadoon Hammad,Joaquín Huerta Guijarro,Francisco Ramos,Michael Gould Carlson,Sergio Trilles Oliver
类目: Machine Learning (cs.LG)
*备注: 20 pages, 9 figures, Journal submission
Abstract:The rapid deployment of Internet of Things (IoT) devices has led to large-scale sensor networks that monitor environmental and urban phenomena in real time. Communities of Interest (CoIs) provide a promising paradigm for organising heterogeneous IoT sensor networks by grouping devices with similar operational and environmental characteristics. This work presents an anomaly detection framework based on the CoI paradigm by grouping sensors into communities using a fused similarity matrix that incorporates temporal correlations via Spearman coefficients, spatial proximity using Gaussian distance decay, and elevation similarities. For each community, representative stations based on the best silhouette are selected and three autoencoder architectures (BiLSTM, LSTM, and MLP) are trained using Bayesian hyperparameter optimization with expanding window cross-validation and tested on stations from the same cluster and the best representative stations of other clusters. The models are trained on normal temperature patterns of the data and anomalies are detected through reconstruction error analysis. Experimental results show a robust within-community performance across the evaluated configurations, while variations across communities are observed. Overall, the results support the applicability of community-based model sharing in reducing computational overhead and to analyse model generalisability across IoT sensor networks.
[LG-3] AWaRe-SAC: Proactive Slice Admission Control under Weather-Induced Capacity Uncertainty
链接: https://arxiv.org/abs/2601.05978
作者: Dror Jacoby,Yanzhi Li,Shuyue Yu,Nicola Di Cicco,Hagit Messer,Gil Zussman,Igor Kadota
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注:
Abstract:As emerging applications demand higher throughput and lower latencies, operators are increasingly deploying millimeter-wave (mmWave) links within x-haul transport networks, spanning fronthaul, midhaul, and backhaul segments. However, the inherent susceptibility of mmWave frequencies to weather-related attenuation, particularly rain fading, complicates the maintenance of stringent Quality of Service (QoS) requirements. This creates a critical challenge: making admission decisions under uncertainty regarding future network capacity. To address this, we develop a proactive slice admission control framework for mmWave x-haul networks subject to rain-induced fluctuations. Our objective is to improve network performance, ensure QoS, and optimize revenue, thereby surpassing the limitations of standard reactive approaches. The proposed framework integrates a deep learning predictor of future network conditions with a proactive Q-learning-based slice admission control mechanism. We validate our solution using real-world data from a mmWave x-haul deployment in a dense urban area, incorporating realistic models of link capacity attenuation and dynamic slice demands. Extensive evaluations demonstrate that our proactive solution achieves 2-3x higher long-term average revenue under dynamic link conditions, providing a scalable and resilient framework for adaptive admission control.
[LG-4] On the Robustness of Age for Learning-Based Wireless Scheduling in Unknown Environments
链接: https://arxiv.org/abs/2601.05956
作者: Juaren Steiger,Bin Li
类目: Machine Learning (cs.LG)
*备注: technical report of conference paper
Abstract:The constrained combinatorial multi-armed bandit model has been widely employed to solve problems in wireless networking and related areas, including the problem of wireless scheduling for throughput optimization under unknown channel conditions. Most work in this area uses an algorithm design strategy that combines a bandit learning algorithm with the virtual queue technique to track the throughput constraint violation. These algorithms seek to minimize the virtual queue length in their algorithm design. However, in networks where channel conditions change abruptly, the resulting constraints may become infeasible, leading to unbounded growth in virtual queue lengths. In this paper, we make the key observation that the dynamics of the head-of-line age, i.e. the age of the oldest packet in the virtual queue, make it more robust when used in algorithm design compared to the virtual queue length. We therefore design a learning-based scheduling policy that uses the head-of-line age in place of the virtual queue length. We show that our policy matches state-of-the-art performance under i.i.d. network conditions. Crucially, we also show that the system remains stable even under abrupt changes in channel conditions and can rapidly recover from periods of constraint infeasibility.
[LG-5] Prophet as a Repro ducible Forecasting Framework: A Methodological Guide for Business and Financial Analytics
链接: https://arxiv.org/abs/2601.05929
作者: Sidney Shapiro,Burhanuddin Panvelwala
类目: Machine Learning (cs.LG)
*备注:
Abstract:Reproducibility remains a persistent challenge in forecasting research and practice, particularly in business and financial analytics where forecasts inform high-stakes decisions. Traditional forecasting methods, while theoretically interpretable, often require extensive manual tuning and are difficult to replicate in proprietary environments. Machine learning approaches offer predictive flexibility but introduce challenges related to interpretability, stochastic training procedures, and cross-environment reproducibility. This paper examines Prophet, an open-source forecasting framework developed by Meta, as a reproducibility-enabling solution that balances interpretability, standardized workflows, and accessibility. Rather than proposing a new algorithm, this study evaluates how Prophet’s additive structure, open-source implementation, and standardized workflow contribute to transparent and replicable forecasting practice. Using publicly available financial and retail datasets, we compare Prophet’s performance and interpretability with multiple ARIMA specifications (auto-selected, manually specified, and seasonal variants) and Random Forest under a controlled and fully documented experimental design. This multi-model comparison provides a robust assessment of Prophet’s relative performance and reproducibility advantages. Through concrete Python examples, we demonstrate how Prophet facilitates efficient forecasting workflows and integration with analytical pipelines. The study positions Prophet within the broader context of reproducible research. It highlights Prophet’s role as a methodological building block that supports verification, auditability, and methodological rigor. This work provides researchers and practitioners with a practical reference framework for reproducible forecasting in Python-based research workflows.
[LG-6] Distilling Lightweight Domain Experts from Large ML Models by Identifying Relevant Subspaces
链接: https://arxiv.org/abs/2601.05913
作者: Pattarawat Chormai,Ali Hashemi,Klaus-Robert Müller,Grégoire Montavon
类目: Machine Learning (cs.LG)
*备注: 20 pages + supplement
Abstract:Knowledge distillation involves transferring the predictive capabilities of large, high-performing AI models (teachers) to smaller models (students) that can operate in environments with limited computing power. In this paper, we address the scenario in which only a few classes and their associated intermediate concepts are relevant to distill. This scenario is common in practice, yet few existing distillation methods explicitly focus on the relevant subtask. To address this gap, we introduce ‘SubDistill’, a new distillation algorithm with improved numerical properties that only distills the relevant components of the teacher model at each layer. Experiments on CIFAR-100 and ImageNet with Convolutional and Transformer models demonstrate that SubDistill outperforms existing layer-wise distillation techniques on a representative set of subtasks. Our benchmark evaluations are complemented by Explainable AI analyses showing that our distilled student models more closely match the decision structure of the original teacher model.
[LG-7] GlueNN: gluing patchwise analytic solutions with neural networks
链接: https://arxiv.org/abs/2601.05889
作者: Doyoung Kim,Donghee Lee,Hye-Sung Lee,Jiheon Lee,Jaeok Yi
类目: Machine Learning (cs.LG); Cosmology and Nongalactic Astrophysics (astro-ph.CO); Computational Physics (physics.comp-ph)
*备注: 7 pages, 3 figures
Abstract:In many problems in physics and engineering, one encounters complicated differential equations with strongly scale-dependent terms for which exact analytical or numerical solutions are not available. A common strategy is to divide the domain into several regions (patches) and simplify the equation in each region. When approximate analytic solutions can be obtained in each patch, they are then matched at the interfaces to construct a global solution. However, this patching procedure can fail to reproduce the correct solution, since the approximate forms may break down near the matching boundaries. In this work, we propose a learning framework in which the integration constants of asymptotic analytic solutions are promoted to scale-dependent functions. By constraining these coefficient functions with the original differential equation over the domain, the network learns a globally valid solution that smoothly interpolates between asymptotic regimes, eliminating the need for arbitrary boundary matching. We demonstrate the effectiveness of this framework in representative problems from chemical kinetics and cosmology, where it accurately reproduces global solutions and outperforms conventional matching procedures.
[LG-8] A New Family of Poisson Non-negative Matrix Factorization Methods Using the Shifted Log Link
链接: https://arxiv.org/abs/2601.05845
作者: Eric Weine,Peter Carbonetto,Rafael A. Irizarry,Matthew Stephens
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注:
Abstract:Poisson non-negative matrix factorization (NMF) is a widely used method to find interpretable “parts-based” decompositions of count data. While many variants of Poisson NMF exist, existing methods assume that the “parts” in the decomposition combine additively. This assumption may be natural in some settings, but not in others. Here we introduce Poisson NMF with the shifted-log link function to relax this assumption. The shifted-log link function has a single tuning parameter, and as this parameter varies the model changes from assuming that parts combine additively (i.e., standard Poisson NMF) to assuming that parts combine more multiplicatively. We provide an algorithm to fit this model by maximum likelihood, and also an approximation that substantially reduces computation time for large, sparse datasets (computations scale with the number of non-zero entries in the data matrix). We illustrate these new methods on a variety of real datasets. Our examples show how the choice of link function in Poisson NMF can substantively impact the results, and how in some settings the use of a shifted-log link function may improve interpretability compared with the standard, additive link.
[LG-9] A Dual Pipeline Machine Learning Framework for Automated Multi Class Sleep Disorder Screening Using Hybrid Resampling and Ensemble Learning
链接: https://arxiv.org/abs/2601.05814
作者: Md Sultanul Islam Ovi,Muhsina Tarannum Munfa,Miftahul Alam Adib,Syed Sabbir Hasan
类目: Machine Learning (cs.LG)
*备注: 32 pages, 5 figures, 14 tables
Abstract:Accurate classification of sleep disorders, particularly insomnia and sleep apnea, is important for reducing long term health risks and improving patient quality of life. However, clinical sleep studies are resource intensive and are difficult to scale for population level screening. This paper presents a Dual Pipeline Machine Learning Framework for multi class sleep disorder screening using the Sleep Health and Lifestyle dataset. The framework consists of two parallel processing streams: a statistical pipeline that targets linear separability using Mutual Information and Linear Discriminant Analysis, and a wrapper based pipeline that applies Boruta feature selection with an autoencoder for non linear representation learning. To address class imbalance, we use the hybrid SMOTETomek resampling strategy. In experiments, Extra Trees and K Nearest Neighbors achieved an accuracy of 98.67%, outperforming recent baselines on the same dataset. Statistical testing using the Wilcoxon Signed Rank Test indicates that the improvement over baseline configurations is significant, and inference latency remains below 400 milliseconds. These results suggest that the proposed dual pipeline design supports accurate and efficient automated screening for non invasive sleep disorder risk stratification.
[LG-10] Detecting Autism Spectrum Disorder with Deep Eye Movement Features
链接: https://arxiv.org/abs/2601.05812
作者: Zhanpei Huang,Taochen chen,Fangqing Gu,Yiqun Zhang
类目: Machine Learning (cs.LG)
*备注: Accepted to CIS 2025
Abstract:Autism Spectrum Disorder (ASD) is a neurodevelopmental disorder characterized by deficits in social communication and behavioral patterns. Eye movement data offers a non-invasive diagnostic tool for ASD detection, as it is inherently discrete and exhibits short-term temporal dependencies, reflecting localized gaze focus between fixation points. These characteristics enable the data to provide deeper insights into subtle behavioral markers, distinguishing ASD-related patterns from typical development. Eye movement signals mainly contain short-term and localized dependencies. However, despite the widespread application of stacked attention layers in Transformer-based models for capturing long-range dependencies, our experimental results indicate that this approach yields only limited benefits when applied to eye movement data. This may be because discrete fixation points and short-term dependencies in gaze focus reduce the utility of global attention mechanisms, making them less efficient than architectures focusing on local temporal patterns. To efficiently capture subtle and complex eye movement patterns, distinguishing ASD from typically developing (TD) individuals, a discrete short-term sequential (DSTS) modeling framework is designed with Class-aware Representation and Imbalance-aware Mechanisms. Through extensive experiments on several eye movement datasets, DSTS outperforms both traditional machine learning techniques and more sophisticated deep learning models.
[LG-11] Learning Reconstructive Embeddings in Reproducing Kernel Hilbert Spaces via the Representer Theorem
链接: https://arxiv.org/abs/2601.05811
作者: Enrique Feito-Casares,Francisco M. Melgarejo-Meseguer,José-Luis Rojo-Álvarez
类目: Machine Learning (cs.LG)
*备注:
Abstract:Motivated by the growing interest in representation learning approaches that uncover the latent structure of high-dimensional data, this work proposes new algorithms for reconstruction-based manifold learning within Reproducing-Kernel Hilbert Spaces (RKHS). Each observation is first reconstructed as a linear combination of the other samples in the RKHS, by optimizing a vector form of the Representer Theorem for their autorepresentation property. A separable operator-valued kernel extends the formulation to vector-valued data while retaining the simplicity of a single scalar similarity function. A subsequent kernel-alignment task projects the data into a lower-dimensional latent space whose Gram matrix aims to match the high-dimensional reconstruction kernel, thus transferring the auto-reconstruction geometry of the RKHS to the embedding. Therefore, the proposed algorithms represent an extended approach to the autorepresentation property, exhibited by many natural data, by using and adapting well-known results of Kernel Learning Theory. Numerical experiments on both simulated (concentric circles and swiss-roll) and real (cancer molecular activity and IoT network intrusions) datasets provide empirical evidence of the practical effectiveness of the proposed approach.
[LG-12] FLRQ: Faster LLM Quantization with Flexible Low-Rank Matrix Sketching
链接: https://arxiv.org/abs/2601.05684
作者: Hongyaoxing Gul,Lijuan Hu,Shuzi Niu,Fangfang Liu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Traditional post-training quantization (PTQ) is considered an effective approach to reduce model size and accelerate inference of large-scale language models (LLMs). However, existing low-rank PTQ methods require costly fine-tuning to determine a compromise rank for diverse data and layers in large models, failing to exploit their full potential. Additionally, the current SVD-based low-rank approximation compounds the computational overhead. In this work, we thoroughly analyze the varying effectiveness of low-rank approximation across different layers in representative models. Accordingly, we introduce \underlineFlexible \underlineLow-\underlineRank \underlineQuantization (FLRQ), a novel solution designed to quickly identify the accuracy-optimal ranks and aggregate them to achieve minimal storage combinations. FLRQ comprises two powerful components, Rank1-Sketch-based Flexible Rank Selection (R1-FLR) and Best Low-rank Approximation under Clipping (BLC). R1-FLR applies the R1-Sketch with Gaussian projection for the fast low-rank approximation, enabling outlier-aware rank extraction for each layer. Meanwhile, BLC aims at minimizing the low-rank quantization error under the scaling and clipping strategy through an iterative method. FLRQ demonstrates strong effectiveness and robustness in comprehensive experiments, achieving state-of-the-art performance in both quantization quality and algorithm efficiency.
[LG-13] Do Sparse Autoencoders Identify Reasoning Features in Language Models?
链接: https://arxiv.org/abs/2601.05679
作者: George Ma,Zhongyuan Liang,Irene Y. Chen,Somayeh Sojoudi
类目: Machine Learning (cs.LG)
*备注:
Abstract:We investigate whether sparse autoencoders (SAEs) identify genuine reasoning features in large language models (LLMs). Starting from features selected using standard contrastive activation methods, we introduce a falsification-oriented framework that combines causal token injection experiments and LLM-guided falsification to test whether feature activation reflects reasoning processes or superficial linguistic correlates. Across 20 configurations spanning multiple model families, layers, and reasoning datasets, we find that identified reasoning features are highly sensitive to token-level interventions. Injecting a small number of feature-associated tokens into non-reasoning text is sufficient to elicit strong activation for 59% to 94% of features, indicating reliance on lexical artifacts. For the remaining features that are not explained by simple token triggers, LLM-guided falsification consistently produces non-reasoning inputs that activate the feature and reasoning inputs that do not, with no analyzed feature satisfying our criteria for genuine reasoning behavior. Steering these features yields minimal changes or slight degradations in benchmark performance. Together, these results suggest that SAE features identified by contrastive approaches primarily capture linguistic correlates of reasoning rather than the underlying reasoning computations themselves.
[LG-14] racing Stereotypes in Pre-trained Transformers: From Biased Neurons to Fairer Models
链接: https://arxiv.org/abs/2601.05663
作者: Gianmario Voria,Moses Openja,Foutse Khomh,Gemma Catolino,Fabio Palomba
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:
Abstract:The advent of transformer-based language models has reshaped how AI systems process and generate text. In software engineering (SE), these models now support diverse activities, accelerating automation and decision-making. Yet, evidence shows that these models can reproduce or amplify social biases, raising fairness concerns. Recent work on neuron editing has shown that internal activations in pre-trained transformers can be traced and modified to alter model behavior. Building on the concept of knowledge neurons, neurons that encode factual information, we hypothesize the existence of biased neurons that capture stereotypical associations within pre-trained transformers. To test this hypothesis, we build a dataset of biased relations, i.e., triplets encoding stereotypes across nine bias types, and adapt neuron attribution strategies to trace and suppress biased neurons in BERT models. We then assess the impact of suppression on SE tasks. Our findings show that biased knowledge is localized within small neuron subsets, and suppressing them substantially reduces bias with minimal performance loss. This demonstrates that bias in transformers can be traced and mitigated at the neuron level, offering an interpretable approach to fairness in SE.
[LG-15] From Global to Local: Cluster-Aware Learning for Wi-Fi Fingerprinting Indoor Localisation
链接: https://arxiv.org/abs/2601.05650
作者: Miguel Matey-Sanz,Joaquín Torres-Sospedra,Joaquín Huerta,Sergio Trilles
类目: Machine Learning (cs.LG)
*备注: 20 pages, 9 figures, 6 tables
Abstract:Wi-Fi fingerprinting remains one of the most practical solutions for indoor positioning, however, its performance is often limited by the size and heterogeneity of fingerprint datasets, strong Received Signal Strength Indicator variability, and the ambiguity introduced in large and multi-floor environments. These factors significantly degrade localisation accuracy, particularly when global models are applied without considering structural constraints. This paper introduces a clustering-based method that structures the fingerprint dataset prior to localisation. Fingerprints are grouped using either spatial or radio features, and clustering can be applied at the building or floor level. In the localisation phase, a clustering estimation procedure based on the strongest access points assigns unseen fingerprints to the most relevant cluster. Localisation is then performed only within the selected clusters, allowing learning models to operate on reduced and more coherent subsets of data. The effectiveness of the method is evaluated on three public datasets and several machine learning models. Results show a consistent reduction in localisation errors, particularly under building-level strategies, but at the cost of reducing the floor detection accuracy. These results demonstrate that explicitly structuring datasets through clustering is an effective and flexible approach for scalable indoor positioning.
[LG-16] Dual-Phase LLM Reasoning : Self-Evolved Mathematical Frameworks
链接: https://arxiv.org/abs/2601.05616
作者: ShaoZhen Liu,Xinting Huang,Houwen Peng,Xin Chen,Xinyang Song,Qi Li,Zhenan Sun
类目: Machine Learning (cs.LG)
*备注:
Abstract:In recent years, large language models (LLMs) have demonstrated significant potential in complex reasoning tasks like mathematical problem-solving. However, existing research predominantly relies on reinforcement learning (RL) frameworks while overlooking supervised fine-tuning (SFT) methods. This paper proposes a new two-stage training framework that enhances models’ self-correction capabilities through self-generated long chain-of-thought (CoT) data. During the first stage, a multi-turn dialogue strategy guides the model to generate CoT data incorporating verification, backtracking, subgoal decomposition, and backward reasoning, with predefined rules filtering high-quality samples for supervised fine-tuning. The second stage employs a difficulty-aware rejection sampling mechanism to dynamically optimize data distribution, strengthening the model’s ability to handle complex problems. The approach generates reasoning chains extended over 4 times longer while maintaining strong scalability, proving that SFT effectively activates models’ intrinsic reasoning capabilities and provides a resource-efficient pathway for complex task optimization. Experimental results demonstrate performance improvements on mathematical benchmarks including GSM8K and MATH500, with the fine-tuned model achieving a substantial improvement on competition-level problems like AIME24. Code will be open-sourced.
[LG-17] Orchestrating Tokens and Sequences: Dynamic Hybrid Policy Optimization for RLVR
链接: https://arxiv.org/abs/2601.05607
作者: Zijun Min,Bingshuai Liu,Ante Wang,Long Zhang,Anxiang Zeng,Haibo Zhang,Jinsong Su
类目: Machine Learning (cs.LG)
*备注:
Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) offers a promising framework for optimizing large language models in reasoning tasks. However, existing RLVR algorithms focus on different granularities, and each has complementary strengths and limitations. Group Relative Policy Optimization (GRPO) updates the policy with token-level importance ratios, which preserves fine-grained credit assignment but often suffers from high variance and instability. In contrast, Group Sequence Policy Optimization (GSPO) applies single sequence-level importance ratios across all tokens in a response that better matches sequence-level rewards, but sacrifices token-wise credit assignment. In this paper, we propose Dynamic Hybrid Policy Optimization (DHPO) to bridge GRPO and GSPO within a single clipped surrogate objective. DHPO combines token-level and sequence-level importance ratios using weighting mechanisms. We explore two variants of the mixing mechanism, including an averaged mixing and an entropy-guided mixing. To further stabilize training, we employ a branch-specific clipping strategy that constrains token-level and sequence-level ratios within separate trust regions before mixing, preventing outliers in either branch from dominating the update. Across seven challenging mathematical reasoning benchmarks, experiments on both dense and MoE models from the Qwen3 series show that DHPO consistently outperforms GRPO and GSPO. We will release our code upon acceptance of this paper.
[LG-18] Good Allocations from Bad Estimates
链接: https://arxiv.org/abs/2601.05597
作者: Sílvia Casacuberta,Moritz Hardt
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注:
Abstract:Conditional average treatment effect (CATE) estimation is the de facto gold standard for targeting a treatment to a heterogeneous population. The method estimates treatment effects up to an error \epsilon 0 in each of M different strata of the population, targeting individuals in decreasing order of estimated treatment effect until the budget runs out. In general, this method requires O(M/\epsilon^2) samples. This is best possible if the goal is to estimate all treatment effects up to an \epsilon error. In this work, we show how to achieve the same total treatment effect as CATE with only O(M/\epsilon) samples for natural distributions of treatment effects. The key insight is that coarse estimates suffice for near-optimal treatment allocations. In addition, we show that budget flexibility can further reduce the sample complexity of allocation. Finally, we evaluate our algorithm on various real-world RCT datasets. In all cases, it finds nearly optimal treatment allocations with surprisingly few samples. Our work highlights the fundamental distinction between treatment effect estimation and treatment allocation: the latter requires far fewer samples.
[LG-19] PaCoRe: Learning to Scale Test-Time Compute with Parallel Coordinated Reasoning
链接: https://arxiv.org/abs/2601.05593
作者: Jingcheng Hu,Yinmin Zhang,Shijie Shang,Xiaobo Yang,Yue Peng,Zhewei Huang,Hebin Zhou,Xin Wu,Jie Cheng,Fanqi Wan,Xiangwen Kong,Chengyuan Yao,Kaiwen Yan,Ailin Huang,Hongyu Zhou,Qi Han,Zheng Ge,Daxin Jiang,Xiangyu Zhang,Heung-Yeung Shum
类目: Machine Learning (cs.LG)
*备注:
Abstract:We introduce Parallel Coordinated Reasoning (PaCoRe), a training-and-inference framework designed to overcome a central limitation of contemporary language models: their inability to scale test-time compute (TTC) far beyond sequential reasoning under a fixed context window. PaCoRe departs from the traditional sequential paradigm by driving TTC through massive parallel exploration coordinated via a message-passing architecture in multiple rounds. Each round launches many parallel reasoning trajectories, compacts their findings into context-bounded messages, and synthesizes these messages to guide the next round and ultimately produce the final answer. Trained end-to-end with large-scale, outcome-based reinforcement learning, the model masters the synthesis abilities required by PaCoRe and scales to multi-million-token effective TTC without exceeding context limits. The approach yields strong improvements across diverse domains, and notably pushes reasoning beyond frontier systems in mathematics: an 8B model reaches 94.5% on HMMT 2025, surpassing GPT-5’s 93.2% by scaling effective TTC to roughly two million tokens. We open-source model checkpoints, training data, and the full inference pipeline to accelerate follow-up work.
[LG-20] Poisson Hyperplane Processes with Rectified Linear Units
链接: https://arxiv.org/abs/2601.05586
作者: Shufei Ge,Shijia Wang,Lloyd Elliott
类目: Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
*备注:
Abstract:Neural networks have shown state-of-the-art performances in various classification and regression tasks. Rectified linear units (ReLU) are often used as activation functions for the hidden layers in a neural network model. In this article, we establish the connection between the Poisson hyperplane processes (PHP) and two-layer ReLU neural networks. We show that the PHP with a Gaussian prior is an alternative probabilistic representation to a two-layer ReLU neural network. In addition, we show that a two-layer neural network constructed by PHP is scalable to large-scale problems via the decomposition propositions. Finally, we propose an annealed sequential Monte Carlo algorithm for Bayesian inference. Our numerical experiments demonstrate that our proposed method outperforms the classic two-layer ReLU neural network. The implementation of our proposed model is available at this https URL.
[LG-21] Learn to Evolve: Self-supervised Neural JKO Operator for Wasserstein Gradient Flow
链接: https://arxiv.org/abs/2601.05583
作者: Xue Feng,Li Wang,Deanna Needell,Rongjie Lai
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:
Abstract:The Jordan-Kinderlehrer-Otto (JKO) scheme provides a stable variational framework for computing Wasserstein gradient flows, but its practical use is often limited by the high computational cost of repeatedly solving the JKO subproblems. We propose a self-supervised approach for learning a JKO solution operator without requiring numerical solutions of any JKO trajectories. The learned operator maps an input density directly to the minimizer of the corresponding JKO subproblem, and can be iteratively applied to efficiently generate the gradient-flow evolution. A key challenge is that only a number of initial densities are typically available for training. To address this, we introduce a Learn-to-Evolve algorithm that jointly learns the JKO operator and its induced trajectories by alternating between trajectory generation and operator updates. As training progresses, the generated data increasingly approximates true JKO trajectories. Meanwhile, this Learn-to-Evolve strategy serves as a natural form of data augmentation, significantly enhancing the generalization ability of the learned operator. Numerical experiments demonstrate the accuracy, stability, and robustness of the proposed method across various choices of energies and initial conditions.
[LG-22] Buffered AUC maximization for scoring systems via mixed-integer optimization
链接: https://arxiv.org/abs/2601.05544
作者: Moe Shiina,Shunnosuke Ikeda,Yuichi Takano
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Applications (stat.AP)
*备注:
Abstract:A scoring system is a linear classifier composed of a small number of explanatory variables, each assigned a small integer coefficient. This system is highly interpretable and allows predictions to be made with simple manual calculations without the need for a calculator. Several previous studies have used mixed-integer optimization (MIO) techniques to develop scoring systems for binary classification; however, they have not focused on directly maximizing AUC (i.e., area under the receiver operating characteristic curve), even though AUC is recognized as an essential evaluation metric for scoring systems. Our goal herein is to establish an effective MIO framework for constructing scoring systems that directly maximize the buffered AUC (bAUC) as the tightest concave lower bound on AUC. Our optimization model is formulated as a mixed-integer linear optimization (MILO) problem that maximizes bAUC subject to a group sparsity constraint for limiting the number of questions in the scoring system. Computational experiments using publicly available real-world datasets demonstrate that our MILO method can build scoring systems with superior AUC values compared to the baseline methods based on regularization and stepwise regression. This research contributes to the advancement of MIO techniques for developing highly interpretable classification models.
[LG-23] oward an Integrated Cross-Urban Accident Prevention System: A Multi-Task Spatial-Temporal Learning Framework for Urban Safety Management
链接: https://arxiv.org/abs/2601.05521
作者: Jiayu Fang,Zhiqi Shao,Haoning Xi,Boris Choy,Junbin Gao
类目: Machine Learning (cs.LG)
*备注: 38pages, 18figures
Abstract:The development of a cross-city accident prevention system is particularly challenging due to the heterogeneity, inconsistent reporting, and inherently clustered, sparse, cyclical, and noisy nature of urban accident data. These intrinsic data properties, combined with fragmented governance and incompatible reporting standards, have long hindered the creation of an integrated, cross-city accident prevention framework. To address this gap, we propose the Mamba Local-ttention Spatial-Temporal Network MLA-STNet, a unified system that formulates accident risk prediction as a multi-task learning problem across multiple cities. MLA-STNet integrates two complementary modules: (i)the Spatio-Temporal Geographical Mamba-Attention (STG-MA), which suppresses unstable spatio-temporal fluctuations and strengthens long-range temporal dependencies; and (ii) the Spatio-Temporal Semantic Mamba-Attention (STS-MA), which mitigates cross-city heterogeneity through a shared-parameter design that jointly trains all cities while preserving individual semantic representation spaces. We validate the proposed framework through 75 experiments under two forecasting scenarios, full-day and high-frequency accident periods, using real-world datasets from New York City and Chicago. Compared with the state-of-the-art baselines, MLA-STNet achieves up to 6% lower RMSE, 8% higher Recall, and 5% higher MAP, while maintaining less than 1% performance variation under 50% input noise. These results demonstrate that MLA-STNet effectively unifies heterogeneous urban datasets within a scalable, robust, and interpretable Cross-City Accident Prevention System, paving the way for coordinated and data-driven urban safety management.
[LG-24] Knowledge-Driven Multi-Turn Jailbreaking on Large Language Models
链接: https://arxiv.org/abs/2601.05445
作者: Songze Li,Ruishi He,Xiaojun Jia,Jun Wang,Zhihui Fu
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:Large Language Models (LLMs) face a significant threat from multi-turn jailbreak attacks, where adversaries progressively steer conversations to elicit harmful outputs. However, the practical effectiveness of existing attacks is undermined by several critical limitations: they struggle to maintain a coherent progression over long interactions, often losing track of what has been accomplished and what remains to be done; they rely on rigid or pre-defined patterns, and fail to adapt to the LLM’s dynamic and unpredictable conversational state. To address these shortcomings, we introduce Mastermind, a multi-turn jailbreak framework that adopts a dynamic and self-improving approach. Mastermind operates in a closed loop of planning, execution, and reflection, enabling it to autonomously build and refine its knowledge of model vulnerabilities through interaction. It employs a hierarchical planning architecture that decouples high-level attack objectives from low-level tactical execution, ensuring long-term focus and coherence. This planning is guided by a knowledge repository that autonomously discovers and refines effective attack patterns by reflecting on interactive experiences. Mastermind leverages this accumulated knowledge to dynamically recombine and adapt attack vectors, dramatically improving both effectiveness and resilience. We conduct comprehensive experiments against state-of-the-art models, including GPT-5 and Claude 3.7 Sonnet. The results demonstrate that Mastermind significantly outperforms existing baselines, achieving substantially higher attack success rates and harmfulness ratings. Moreover, our framework exhibits notable resilience against multiple advanced defense mechanisms.
[LG-25] Prediction of Fault Slip Tendency in CO_2 Storag e using Data-space Inversion
链接: https://arxiv.org/abs/2601.05431
作者: Xiaowen He,Su Jiang,Louis J. Durlofsky
类目: Machine Learning (cs.LG)
*备注:
Abstract:Accurately assessing the potential for fault slip is essential in many subsurface operations. Conventional model-based history matching methods, which entail the generation of posterior geomodels calibrated to observed data, can be challenging to apply in coupled flow-geomechanics problems with faults. In this work, we implement a variational autoencoder (VAE)-based data-space inversion (DSI) framework to predict pressure, stress and strain fields, and fault slip tendency, in CO _2 storage projects. The main computations required by the DSI workflow entail the simulation of O(1000) prior geomodels. The posterior distributions for quantities of interest are then inferred directly from prior simulation results and observed data, without the need to generate posterior geomodels. The model used here involves a synthetic 3D system with two faults. Realizations of heterogeneous permeability and porosity fields are generated using geostatistical software, and uncertain geomechanical and fault parameters are sampled for each realization from prior distributions. Coupled flow-geomechanics simulations for these geomodels are conducted using GEOS. A VAE with stacked convolutional long short-term memory layers is trained, using the prior simulation results, to represent pressure, strain, effective normal stress and shear stress fields in terms of latent variables. The VAE parameterization is used with DSI for posterior predictions, with monitoring wells providing observed pressure and strain data. Posterior results for synthetic true models demonstrate that the DSI-VAE framework gives accurate predictions for pressure, strain, and stress fields and for fault slip tendency. The framework is also shown to reduce uncertainty in key geomechanical and fault parameters.
[LG-26] Efficient Inference for Noisy LLM -as-a-Judge Evaluation
链接: https://arxiv.org/abs/2601.05420
作者: Yiqun T Chen,Sizhu Lu,Sijia Li,Moran Guo,Shengyi Li
类目: Machine Learning (cs.LG); Applications (stat.AP); Methodology (stat.ME)
*备注:
Abstract:Large language models (LLMs) are increasingly used as automatic evaluators of generative AI outputs, a paradigm often referred to as “LLM-as-a-judge.” In practice, LLM judges are imperfect predictions for the underlying truth and can exhibit systematic, non-random errors. Two main approaches have recently been proposed to address this issue: (i) direct measurementerror correction based on misclassification models such as Rogan-Gladen-style estimators, and (ii) surrogate-outcome approaches such as prediction-powered inference (PPI), which correct bias by calibrating prediction residuals on a small set of gold-standard human labels. In this paper, we systematically study the performance of these two approaches for estimating mean parameters (e.g., average benchmark scores or pairwise win rates). Leveraging tools from semiparametric efficiency theory, we unify the two classes of estimators by deriving explicit forms of efficient influence function (EIF)-based efficient estimators and characterize conditions under which PPI-style estimators attain strictly smaller asymptotic variance than measurement-error corrections. We verify our theoretical results in simulations and demonstrate the methods on real-data examples. We provide an implementation of the benchmarked methods and comparison utilities at this https URL.
[LG-27] Interactive Distillation for Cooperative Multi-Agent Reinforcement Learning
链接: https://arxiv.org/abs/2601.05407
作者: Minwoo Cho,Batuhan Altundas,Matthew Gombolay
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:
Abstract:Knowledge distillation (KD) has the potential to accelerate MARL by employing a centralized teacher for decentralized students but faces key bottlenecks. Specifically, there are (1) challenges in synthesizing high-performing teaching policies in complex domains, (2) difficulties when teachers must reason in out-of-distribution (OOD) states, and (3) mismatches between the decentralized students’ and the centralized teacher’s observation spaces. To address these limitations, we propose HINT (Hierarchical INteractive Teacher-based transfer), a novel KD framework for MARL in a centralized training, decentralized execution setup. By leveraging hierarchical RL, HINT provides a scalable, high-performing teacher. Our key innovation, pseudo off-policy RL, enables the teacher policy to be updated using both teacher and student experience, thereby improving OOD adaptation. HINT also applies performance-based filtering to retain only outcome-relevant guidance, reducing observation mismatches. We evaluate HINT on challenging cooperative domains (e.g., FireCommander for resource allocation, MARINE for tactical combat). Across these benchmarks, HINT outperforms baselines, achieving improvements of 60% to 165% in success rate.
[LG-28] DynaSTy: A Framework for SpatioTemporal Node Attribute Prediction in Dynamic Graphs
链接: https://arxiv.org/abs/2601.05391
作者: Namrata Banerji,Tanya Berger-Wolf
类目: Machine Learning (cs.LG)
*备注:
Abstract:Accurate multistep forecasting of node-level attributes on dynamic graphs is critical for applications ranging from financial trust networks to biological networks. Existing spatiotemporal graph neural networks typically assume a static adjacency matrix. In this work, we propose an end-to-end dynamic edge-biased spatiotemporal model that ingests a multi-dimensional timeseries of node attributes and a timeseries of adjacency matrices, to predict multiple future steps of node attributes. At each time step, our transformer-based model injects the given adjacency as an adaptable attention bias, allowing the model to focus on relevant neighbors as the graph evolves. We further deploy a masked node-time pretraining objective that primes the encoder to reconstruct missing features, and train with scheduled sampling and a horizon-weighted loss to mitigate compounding error over long horizons. Unlike prior work, our model accommodates dynamic graphs that vary across input samples, enabling forecasting in multi-system settings such as brain networks across different subjects, financial systems in different contexts, or evolving social systems. Empirical results demonstrate that our method consistently outperforms strong baselines on Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE).
[LG-29] Imitation Learning for Combinatorial Optimisation under Uncertainty
链接: https://arxiv.org/abs/2601.05383
作者: Prakash Gawas,Antoine Legrain,Louis-Martin Rousseau
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:Imitation learning (IL) provides a data-driven framework for approximating policies for large-scale combinatorial optimisation problems formulated as sequential decision problems (SDPs), where exact solution methods are computationally intractable. A central but underexplored aspect of IL in this context is the role of the \emphexpert that generates training demonstrations. Existing studies employ a wide range of expert constructions, yet lack a unifying framework to characterise their modelling assumptions, computational properties, and impact on learning performance. This paper introduces a systematic taxonomy of experts for IL in combinatorial optimisation under uncertainty. Experts are classified along three dimensions: (i) their treatment of uncertainty, including myopic, deterministic, full-information, two-stage stochastic, and multi-stage stochastic formulations; (ii) their level of optimality, distinguishing task-optimal and approximate experts; and (iii) their interaction mode with the learner, ranging from one-shot supervision to iterative, interactive schemes. Building on this taxonomy, we propose a generalised Dataset Aggregation (DAgger) algorithm that supports multiple expert queries, expert aggregation, and flexible interaction strategies. The proposed framework is evaluated on a dynamic physician-to-patient assignment problem with stochastic arrivals and capacity constraints. Computational experiments compare learning outcomes across expert types and interaction regimes. The results show that policies learned from stochastic experts consistently outperform those learned from deterministic or full-information experts, while interactive learning improves solution quality using fewer expert demonstrations. Aggregated deterministic experts provide an effective alternative when stochastic optimisation becomes computationally challenging. Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC) Cite as: arXiv:2601.05383 [cs.LG] (or arXiv:2601.05383v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2601.05383 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-30] Inverting Non-Injective Functions with Twin Neural Network Regression
链接: https://arxiv.org/abs/2601.05378
作者: Sebastian J. Wetzel
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注:
Abstract:Non-injective functions are not invertible. However, non-injective functions can be restricted to sub-domains on which they are locally injective and surjective and thus invertible if the dimensionality between input and output spaces are the same. Further, even if the dimensionalities do not match it is often possible to choose a preferred solution from many possible solutions. Twin neural network regression is naturally capable of incorporating these properties to invert non-injective functions. Twin neural network regression is trained to predict adjustments to well known input variables \mathbfx^\textanchor to obtain an estimate for an unknown \mathbfx^\textnew under a change of the target variable from \mathbfy^\textanchor to \mathbfy^\textnew . In combination with k-nearest neighbor search, I propose a deterministic framework that finds input parameters to a given target variable of non-injective functions. The method is demonstrated by inverting non-injective functions describing toy problems and robot arm control that are a) defined by data or b) known as mathematical formula.
[LG-31] he Kernel Manifold: A Geometric Approach to Gaussian Process Model Selection
链接: https://arxiv.org/abs/2601.05371
作者: Md Shafiqul Islam,Shakti Prasad Padhy,Douglas Allaire,Raymundo Arróyave
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注:
Abstract:Gaussian Process (GP) regression is a powerful nonparametric Bayesian framework, but its performance depends critically on the choice of covariance kernel. Selecting an appropriate kernel is therefore central to model quality, yet remains one of the most challenging and computationally expensive steps in probabilistic modeling. We present a Bayesian optimization framework built on kernel-of-kernels geometry, using expected divergence-based distances between GP priors to explore kernel space efficiently. A multidimensional scaling (MDS) embedding of this distance matrix maps a discrete kernel library into a continuous Euclidean manifold, enabling smooth BO. In this formulation, the input space comprises kernel compositions, the objective is the log marginal likelihood, and featurization is given by the MDS coordinates. When the divergence yields a valid metric, the embedding preserves geometry and produces a stable BO landscape. We demonstrate the approach on synthetic benchmarks, real-world time-series datasets, and an additive manufacturing case study predicting melt-pool geometry, achieving superior predictive accuracy and uncertainty calibration relative to baselines including Large Language Model (LLM)-guided search. This framework establishes a reusable probabilistic geometry for kernel search, with direct relevance to GP modeling and deep kernel learning.
[LG-32] GlyRAG : Context-Aware Retrieval-Augmented Framework for Blood Glucose Forecasting
链接: https://arxiv.org/abs/2601.05353
作者: Shovito Barua Soumma,Hassan Ghasemzadeh
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注:
Abstract:Accurate forecasting of blood glucose from CGM is essential for preventing dysglycemic events, thus enabling proactive diabetes management. However, current forecasting models treat blood glucose readings captured using CGMs as a numerical sequence, either ignoring context or relying on additional sensors/modalities that are difficult to collect and deploy at scale. Recently, LLMs have shown promise for time-series forecasting tasks, yet their role as agentic context extractors in diabetes care remains largely unexplored. To address these limitations, we propose GlyRAG, a context-aware, retrieval-augmented forecasting framework that derives semantic understanding of blood glucose dynamics directly from CGM traces without requiring additional sensor modalities. GlyRAG employs an LLM as a contextualization agent to generate clinical summaries. These summaries are embedded by a language model and fused with patch-based glucose representations in a multimodal transformer architecture with a cross translation loss aligining textual and physiological embeddings. A retrieval module then identifies similar historical episodes in the learned embedding space and uses cross-attention to integrate these case-based analogues prior to making a forecasting inference. Extensive evaluations on two T1D cohorts show that GlyRAG consistently outperforms state-of-the art methods, achieving up to 39% lower RMSE and a further 1.7% reduction in RMSE over the baseline. Clinical evaluation shows that GlyRAG places 85% predictions in safe zones and achieves 51% improvement in predicting dysglycemic events across both cohorts. These results indicate that LLM-based contextualization and retrieval over CGM traces can enhance the accuracy and clinical reliability of long-horizon glucose forecasting without the need for extra sensors, thus supporting future agentic decision-support tools for diabetes management.
[LG-33] When the Server Steps In: Calibrated Updates for Fair Federated Learning
链接: https://arxiv.org/abs/2601.05352
作者: Tianrun Yu,Kaixiang Zhao,Cheng Zhang,Anjun Gao,Yueyang Quan,Zhuqing Liu,Minghong Fang
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Information Retrieval (cs.IR); Social and Information Networks (cs.SI)
*备注:
Abstract:Federated learning (FL) has emerged as a transformative distributed learning paradigm, enabling multiple clients to collaboratively train a global model under the coordination of a central server without sharing their raw training data. While FL offers notable advantages, it faces critical challenges in ensuring fairness across diverse demographic groups. To address these fairness concerns, various fairness-aware debiasing methods have been proposed. However, many of these approaches either require modifications to clients’ training protocols or lack flexibility in their aggregation strategies. In this work, we address these limitations by introducing EquFL, a novel server-side debiasing method designed to mitigate bias in FL systems. EquFL operates by allowing the server to generate a single calibrated update after receiving model updates from the clients. This calibrated update is then integrated with the aggregated client updates to produce an adjusted global model that reduces bias. Theoretically, we establish that EquFL converges to the optimal global model achieved by FedAvg and effectively reduces fairness loss over training rounds. Empirically, we demonstrate that EquFL significantly mitigates bias within the system, showcasing its practical effectiveness.
[LG-34] Generalized Canonical Polyadic Tensor Decompositions with General Symmetry
链接: https://arxiv.org/abs/2601.05335
作者: Alex Mulrooney,David Hong
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: This work has been submitted to the IEEE for possible publication. 11 pages, 5 figures
Abstract:Canonical Polyadic (CP) tensor decomposition is a workhorse algorithm for discovering underlying low-dimensional structure in tensor data. This is accomplished in conventional CP decomposition by fitting a low-rank tensor to data with respect to the least-squares loss. Generalized CP (GCP) decompositions generalize this approach by allowing general loss functions that can be more appropriate, e.g., to model binary and count data or to improve robustness to outliers. However, GCP decompositions do not explicitly account for any symmetry in the tensors, which commonly arises in modern applications. For example, a tensor formed by stacking the adjacency matrices of a dynamic graph over time will naturally exhibit symmetry along the two modes corresponding to the graph nodes. In this paper, we develop a symmetric GCP (SymGCP) decomposition that allows for general forms of symmetry, i.e., symmetry along any subset of the modes. SymGCP accounts for symmetry by enforcing the corresponding symmetry in the decomposition. We derive gradients for SymGCP that enable its efficient computation via all-at-once optimization with existing tensor kernels. The form of the gradients also leads to various stochastic approximations that enable us to develop stochastic SymGCP algorithms that can scale to large tensors. We demonstrate the utility of the proposed SymGCP algorithms with a variety of experiments on both synthetic and real data.
[LG-35] Ontology Neural Networks for Topologically Conditioned Constraint Satisfaction
链接: https://arxiv.org/abs/2601.05304
作者: Jaehong Oh
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 12 pages, 11 figures
Abstract:Neuro-symbolic reasoning systems face fundamental challenges in maintaining semantic coherence while satisfying physical and logical constraints. Building upon our previous work on Ontology Neural Networks, we present an enhanced framework that integrates topological conditioning with gradient stabilization mechanisms. The approach employs Forman-Ricci curvature to capture graph topology, Deep Delta Learning for stable rank-one perturbations during constraint projection, and Covariance Matrix Adaptation Evolution Strategy for parameter optimization. Experimental evaluation across multiple problem sizes demonstrates that the method achieves mean energy reduction to 1.15 compared to baseline values of 11.68, with 95 percent success rate in constraint satisfaction tasks. The framework exhibits seed-independent convergence and graceful scaling behavior up to twenty-node problems, suggesting that topological structure can inform gradient-based optimization without sacrificing interpretability or computational efficiency.
[LG-36] Optimizing Digital Adjudication through Social Network Analysis: An Empirical Study of Credit Card Disputes in Beijing
链接: https://arxiv.org/abs/2601.05299
作者: Chung Han Tsai,ChengTo Lin,Chung Han Tsai,ChengTo Lin,Baowen Zhang,Qingyue Deng,Yunhui Zhao,Zhijia Song,Baowen Zhang,Qingyue Deng,Yunhui Zhao,Zhijia Song
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG)
*备注:
Abstract:Amid the rapid digitalization of judicial systems, the integration of big data into adjudication remains underexplored, particularly in uncovering the structural logic of legal applications. This study bridges this gap by employing social network analysis (SNA) to examine credit card disputes involving personal information protection adjudicated in Beijing. By constructing a legal citation network, we reveal the latent patterns of substantive and procedural law application. The findings demonstrate that SNA can effectively identify core legal norms and typify cases, offering a robust methodological framework for optimizing ‘Digital Court’ systems. These insights provide practical pathways for enhancing judicial efficiency and consistency through data-driven case retrieval and holistic judicial information networks.
[LG-37] Improving User Experience with Personalized Review Ranking and Summarization
链接: https://arxiv.org/abs/2601.05261
作者: Muhammad Mufti,Omar Hammad,Mahfuzur Rahman
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:
Abstract:Online consumer reviews play a crucial role in guiding purchase decisions by offering insights into product quality, usability, and performance. However, the increasing volume of user-generated reviews has led to information overload, making it difficult for consumers to identify content that aligns with their specific preferences. Existing review ranking systems typically rely on metrics such as helpfulness votes, star ratings, and recency, but these fail to capture individual user interests and often treat textual sentiment and rating signals separately. This research addresses these limitations by proposing a personalized framework that integrates review ranking and abstractive summarization to enhance decision-making efficiency. The proposed system begins by modeling each user’s sentiment through a hybrid analysis of star ratings and review content. Simultaneously, user preferences were derived from historical reviews using sentence embeddings and clustering, forming semantic profiles aligned with thematic and sentiment dimensions. A relevance scoring algorithm matched these profiles with unseen reviews based on sentiment and aspect similarity. Top-matched reviews were then summarized to reflect individual interests. A user study with 70 participants demonstrated that the personalized approach improved satisfaction, perceived relevance, and decision-making confidence, while reducing time spent reading. The results highlight the method’s effectiveness in alleviating information overload and delivering content tailored to user-specific preferences, emphasizing its value in enhancing user experience in review-rich decision-making environments.
[LG-38] Manifold limit for the training of shallow graph convolutional neural networks
链接: https://arxiv.org/abs/2601.06025
作者: Johanna Tengler,Christoph Brune,José A. Iglesias
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Functional Analysis (math.FA); Optimization and Control (math.OC)
*备注: 44 pages, 0 figures, 1 table
Abstract:We study the discrete-to-continuum consistency of the training of shallow graph convolutional neural networks (GCNNs) on proximity graphs of sampled point clouds under a manifold assumption. Graph convolution is defined spectrally via the graph Laplacian, whose low-frequency spectrum approximates that of the Laplace-Beltrami operator of the underlying smooth manifold, and shallow GCNNs of possibly infinite width are linear functionals on the space of measures on the parameter space. From this functional-analytic perspective, graph signals are seen as spatial discretizations of functions on the manifold, which leads to a natural notion of training data consistent across graph resolutions. To enable convergence results, the continuum parameter space is chosen as a weakly compact product of unit balls, with Sobolev regularity imposed on the output weight and bias, but not on the convolutional parameter. The corresponding discrete parameter spaces inherit the corresponding spectral decay, and are additionally restricted by a frequency cutoff adapted to the informative spectral window of the graph Laplacians. Under these assumptions, we prove \Gamma -convergence of regularized empirical risk minimization functionals and corresponding convergence of their global minimizers, in the sense of weak convergence of the parameter measures and uniform convergence of the functions over compact sets. This provides a formalization of mesh and sample independence for the training of such networks.
[LG-39] Detecting Stochasticity in Discrete Signals via Nonparametric Excursion Theorem
链接: https://arxiv.org/abs/2601.06009
作者: Sunia Tanweer,Firas A. Khasawneh
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Signal Processing (eess.SP); Probability (math.PR); Applications (stat.AP)
*备注:
Abstract:We develop a practical framework for distinguishing diffusive stochastic processes from deterministic signals using only a single discrete time series. Our approach is based on classical excursion and crossing theorems for continuous semimartingales, which correlates number N_\varepsilon of excursions of magnitude at least \varepsilon with the quadratic variation [X]T of the process. The scaling law holds universally for all continuous semimartingales with finite quadratic variation, including general Ito diffusions with nonlinear or state-dependent volatility, but fails sharply for deterministic systems – thereby providing a theoretically-certfied method of distinguishing between these dynamics, as opposed to the subjective entropy or recurrence based state of the art methods. We construct a robust data-driven diffusion test. The method compares the empirical excursion counts against the theoretical expectation. The resulting ratio K(\varepsilon)=N\varepsilon^\mathrmemp/N_\varepsilon^\mathrmtheory is then summarized by a log-log slope deviation measuring the \varepsilon^-2 law that provides a classification into diffusion-like or not. We demonstrate the method on canonical stochastic systems, some periodic and chaotic maps and systems with additive white noise, as well as the stochastic Duffing system. The approach is nonparametric, model-free, and relies only on the universal small-scale structure of continuous semimartingales.
[LG-40] DeePM: Regime-Robust Deep Learning for Systematic Macro Portfolio Management
链接: https://arxiv.org/abs/2601.05975
作者: Kieran Wood,Stephen J. Roberts,Stefan Zohren
类目: Trading and Market Microstructure (q-fin.TR); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:We propose DeePM (Deep Portfolio Manager), a structured deep-learning macro portfolio manager trained end-to-end to maximize a robust, risk-adjusted utility. DeePM addresses three fundamental challenges in financial learning: (1) it resolves the asynchronous “ragged filtration” problem via a Directed Delay (Causal Sieve) mechanism that prioritizes causal impulse-response learning over information freshness; (2) it combats low signal-to-noise ratios via a Macroeconomic Graph Prior, regularizing cross-asset dependence according to economic first principles; and (3) it optimizes a distributionally robust objective where a smooth worst-window penalty serves as a differentiable proxy for Entropic Value-at-Risk (EVaR) - a window-robust utility encouraging strong performance in the most adverse historical subperiods. In large-scale backtests from 2010-2025 on 50 diversified futures with highly realistic transaction costs, DeePM attains net risk-adjusted returns that are roughly twice those of classical trend-following strategies and passive benchmarks, solely using daily closing prices. Furthermore, DeePM improves upon the state-of-the-art Momentum Transformer architecture by roughly fifty percent. The model demonstrates structural resilience across the 2010s “CTA (Commodity Trading Advisor) Winter” and the post-2020 volatility regime shift, maintaining consistent performance through the pandemic, inflation shocks, and the subsequent higher-for-longer environment. Ablation studies confirm that strictly lagged cross-sectional attention, graph prior, principled treatment of transaction costs, and robust minimax optimization are the primary drivers of this generalization capability.
[LG-41] A Critical Examination of Active Learning Workflows in Materials Science
链接: https://arxiv.org/abs/2601.05946
作者: Akhil S. Nair,Lucas Foppa
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注:
Abstract:Active learning (AL) plays a critical role in materials science, enabling applications such as the construction of machine-learning interatomic potentials for atomistic simulations and the operation of self-driving laboratories. Despite its widespread use, the reliability and effectiveness of AL workflows depend on implicit design assumptions that are rarely examined systematically. Here, we critically assess AL workflows deployed in materials science and investigate how key design choices, such as surrogate models, sampling strategies, uncertainty quantification and evaluation metrics, relate to their performance. By identifying common pitfalls and discussing practical mitigation strategies, we provide guidance to practitioners for the efficient design, assessment, and interpretation of AL workflows in materials science.
[LG-42] Multi-task Modeling for Engineering Applications with Sparse Data
链接: https://arxiv.org/abs/2601.05910
作者: Yigitcan Comlek,R. Murali Krishnan,Sandipp Krishnan Ravi,Amin Moghaddas,Rafael Giorjao,Michael Eff,Anirban Samaddar,Nesar S. Ramachandra,Sandeep Madireddy,Liping Wang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP)
*备注: 15 pages, 5 figures, 6 tables
Abstract:Modern engineering and scientific workflows often require simultaneous predictions across related tasks and fidelity levels, where high-fidelity data is scarce and expensive, while low-fidelity data is more abundant. This paper introduces an Multi-Task Gaussian Processes (MTGP) framework tailored for engineering systems characterized by multi-source, multi-fidelity data, addressing challenges of data sparsity and varying task correlations. The proposed framework leverages inter-task relationships across outputs and fidelity levels to improve predictive performance and reduce computational costs. The framework is validated across three representative scenarios: Forrester function benchmark, 3D ellipsoidal void modeling, and friction-stir welding. By quantifying and leveraging inter-task relationships, the proposed MTGP framework offers a robust and scalable solution for predictive modeling in domains with significant computational and experimental costs, supporting informed decision-making and efficient resource utilization.
[LG-43] Sequential Bayesian Optimal Experimental Design in Infinite Dimensions via Policy Gradient Reinforcement Learning
链接: https://arxiv.org/abs/2601.05868
作者: Kaichen Shen,Peng Chen
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:
Abstract:Sequential Bayesian optimal experimental design (SBOED) for PDE-governed inverse problems is computationally challenging, especially for infinite-dimensional random field parameters. High-fidelity approaches require repeated forward and adjoint PDE solves inside nested Bayesian inversion and design loops. We formulate SBOED as a finite-horizon Markov decision process and learn an amortized design policy via policy-gradient reinforcement learning (PGRL), enabling online design selection from the experiment history without repeatedly solving an SBOED optimization problem. To make policy training and reward evaluation scalable, we combine dual dimension reduction – active subspace projection for the parameter and principal component analysis for the state – with an adjusted derivative-informed latent attention neural operator (LANO) surrogate that predicts both the parameter-to-solution map and its Jacobian. We use a Laplace-based D-optimality reward while noting that, in general, other expected-information-gain utilities such as KL divergence can also be used within the same framework. We further introduce an eigenvalue-based evaluation strategy that uses prior samples as proxies for maximum a posteriori (MAP) points, avoiding repeated MAP solves while retaining accurate information-gain estimates. Numerical experiments on sequential multi-sensor placement for contaminant source tracking demonstrate approximately 100\times speedup over high-fidelity finite element methods, improved performance over random sensor placements, and physically interpretable policies that discover an ``upstream’’ tracking strategy.
[LG-44] Autonomous Discovery of the Ising Models Critical Parameters with Reinforcement Learning
链接: https://arxiv.org/abs/2601.05577
作者: Hai Man,Chaobo Wang,Jia-Rui Li,Yuping Tian,Shu-Gang Chen
类目: atistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: 37 pages, 9 figures. This is the Accepted Manuscript of an article published in J. Stat. Mech
Abstract:Traditional methods for determining critical parameters are often influenced by human factors. This research introduces a physics-inspired adaptive reinforcement learning framework that enables agents to autonomously interact with physical environments, simultaneously identifying both the critical temperature and various types of critical exponents in the Ising model with precision. Interestingly, our algorithm exhibits search behavior reminiscent of phase transitions, efficiently converging to target parameters regardless of initial conditions. Experimental results demonstrate that this method significantly outperforms traditional approaches, particularly in environments with strong perturbations. This study not only incorporates physical concepts into machine learning to enhance algorithm interpretability but also establishes a new paradigm for scientific exploration, transitioning from manual analysis to autonomous AI discovery.
[LG-45] DNATokenizer: A GPU-First Byte-to-Identifier Tokenizer for High-Throughput DNA Language Models
链接: https://arxiv.org/abs/2601.05531
作者: Eliatan Niktab,Hardip Patel
类目: Genomics (q-bio.GN); Machine Learning (cs.LG)
*备注:
Abstract:Tokenization sits at the boundary between high-throughput genomic input and GPU compute, posing challenges in both algorithm design and system throughput. Overlapping k-mer tokenization can introduce information leakage under masked language modeling (MLM) and may degrade downstream accuracy. Single-nucleotide tokenization avoids leakage and preserves per-base fidelity, but it greatly increases sequence length for attention-based architectures. Non-overlapping k-mers and byte-pair encoding (BPE) provide compression and avoid leakage, at the cost of boundary sensitivity or reduced interpretability. Empirically, the choice of tokenization interacts strongly with model architecture and task requirements. At the system level, however, standard string tokenizers and host-bound vocabulary lookups dominate wall-clock time once inputs reach billions of bases, regardless of the tokenization algorithm. We present DNATok, a high-performance, GPU-first tokenization system that replaces general-purpose string processing with byte lookup table (LUT)-based identifier streaming and an overlapped host-to-device (H2D)/compute pipeline using pinned memory and architectural parallelism. DNATok is vocabulary-agnostic: it accelerates single-nucleotide, non-overlapping k-mer, and BPE tokenization, and integrates as a drop-in systems layer beneath genomic foundation models. DNATok achieves 84-95x higher encoding throughput than optimized Hugging Face baselines and up to 1.9x higher H2D throughput. End-to-end streaming reaches 1.27-1.84e8 tokens/s depending on configuration, effectively removing tokenization as a bottleneck for production-scale training and inference.
[LG-46] Autonomous Probe Microscopy with Robust Bag-of-Features Multi-Objective Bayesian Optimization: Pareto-Front Mapping of Nanoscale Structure-Property Trade-Offs
链接: https://arxiv.org/abs/2601.05528
作者: Kamyar Barakati,Haochen Zhu,C Charlotte Buchanan,Dustin A Gilbert,Philip Rack,Sergei V. Kalinin
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an)
*备注: 25 pages, 5 figures
Abstract:Combinatorial materials libraries are an efficient route to generate large families of candidate compositions, but their impact is often limited by the speed and depth of characterization and by the difficulty of extracting actionable structure-property relations from complex characterization data. Here we develop an autonomous scanning probe microscopy (SPM) framework that integrates automated atomic force and magnetic force microscopy (AFM/MFM) to rapidly explore magnetic and structural properties across combinatorial spread libraries. To enable automated exploration of systems without a clear optimization target, we introduce a combination of a static physics-informed bag-of-features (BoF) representation of measured surface morphology and magnetic structure with multi-objective Bayesian optimization (MOBO) to discover the relative significance and robustness of features. The resulting closed-loop workflow selectively samples the compositional gradient and reconstructs feature landscapes consistent with dense grid “ground truth” measurements. The resulting Pareto structure reveals where multiple nanoscale objectives are simultaneously optimized, where trade-offs between roughness, coherence, and magnetic contrast are unavoidable, and how families of compositions cluster into distinct functional regimes, thereby turning multi-feature imaging data into interpretable maps of competing structure-property trends. While demonstrated for Au-Co-Ni and AFM/MFM, the approach is general and can be extended to other combinatorial systems, imaging modalities, and feature sets, illustrating how feature-based MOBO and autonomous SPM can transform microscopy images from static data products into active feedback for real-time, multi-objective materials discovery.
[LG-47] What Functions Does XGBoost Learn?
链接: https://arxiv.org/abs/2601.05444
作者: Dohyeong Ki,Adityanand Guntuboyina
类目: atistics Theory (math.ST); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:This paper establishes a rigorous theoretical foundation for the function class implicitly learned by XGBoost, bridging the gap between its empirical success and our theoretical understanding. We introduce an infinite-dimensional function class \mathcalF^d, s_\infty-\textST that extends finite ensembles of bounded-depth regression trees, together with a complexity measure V^d, s_\infty-\textXGB(\cdot) that generalizes the L^1 regularization penalty used in XGBoost. We show that every optimizer of the XGBoost objective is also an optimizer of an equivalent penalized regression problem over \mathcalF^d, s_\infty-\textST with penalty V^d, s_\infty-\textXGB(\cdot) , providing an interpretation of XGBoost as implicitly targeting a broader function class. We also develop a smoothness-based interpretation of \mathcalF^d, s_\infty-\textST and V^d, s_\infty-\textXGB(\cdot) in terms of Hardy–Krause variation. We prove that the least squares estimator over \f \in \mathcalF^d, s_\infty-\textST: V^d, s_\infty-\textXGB(f) \le V\ achieves a nearly minimax-optimal rate of convergence n^-2/3 (\log n)^4(\min(s, d) - 1)/3 , thereby avoiding the curse of dimensionality. Our results provide the first rigorous characterization of the function space underlying XGBoost, clarify its connection to classical notions of variation, and identify an important open problem: whether the XGBoost algorithm itself achieves minimax optimality over this class.
[LG-48] A brief note on learning problem with global perspectives
链接: https://arxiv.org/abs/2601.05441
作者: Getachew K. Befekadu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 7 Pages with 1 Figure
Abstract:This brief note considers the problem of learning with dynamic-optimizing principal-agent setting, in which the agents are allowed to have global perspectives about the learning process, i.e., the ability to view things according to their relative importances or in their true relations based-on some aggregated information shared by the principal. Whereas, the principal, which is exerting an influence on the learning process of the agents in the aggregation, is primarily tasked to solve a high-level optimization problem posed as an empirical-likelihood estimator under conditional moment restrictions model that also accounts information about the agents’ predictive performances on out-of-samples as well as a set of private datasets available only to the principal. In particular, we present a coherent mathematical argument which is necessary for characterizing the learning process behind this abstract principal-agent learning framework, although we acknowledge that there are a few conceptual and theoretical issues still need to be addressed.
[LG-49] Dynamic Inclusion and Bounded Multi-Factor Tilts for Robust Portfolio Construction
链接: https://arxiv.org/abs/2601.05428
作者: Roberto Garrone
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: 28 pages, 7 figures, algorithmic portfolio construction framework emphasizing robustness, explicit constraints, and implementability
Abstract:This paper proposes a portfolio construction framework designed to remain robust under estimation error, non-stationarity, and realistic trading constraints. The methodology combines dynamic asset eligibility, deterministic rebalancing, and bounded multi-factor tilts applied to an equal-weight baseline. Asset eligibility is formalized as a state-dependent constraint on portfolio construction, allowing factor exposure to adjust endogenously in response to observable market conditions such as liquidity, volatility, and cross-sectional breadth. Rather than estimating expected returns or covariances, the framework relies on cross-sectional rankings and hard structural bounds to control concentration, turnover, and fragility. The resulting approach is fully algorithmic, transparent, and directly implementable. It provides a robustness-oriented alternative to parametric optimization and unconstrained multi-factor models, particularly suited for long-horizon allocations where stability and operational feasibility are primary objectives.
[LG-50] Archetypal cases for questionnaires with nominal multiple choice questions
链接: https://arxiv.org/abs/2601.05392
作者: Aleix Alcacer,Irene Epifanio
类目: Methodology (stat.ME); Machine Learning (cs.LG)
*备注: Statistical Methods for Data Analysis and Decision Sciences. Third Conference of the Statistics and Data Science Group of the Italian Statistical Society. Milan, April 2-3, 2025
Abstract:Archetypal analysis serves as an exploratory tool that interprets a collection of observations as convex combinations of pure (extreme) patterns. When these patterns correspond to actual observations within the sample, they are termed archetypoids. For the first time, we propose applying archetypoid analysis to nominal observations, specifically for identifying archetypal cases from questionnaires featuring nominal multiple-choice questions with a single possible answer. This approach can enhance our understanding of a nominal data set, similar to its application in multivariate contexts. We compare this methodology with the use of archetype analysis and probabilistic archetypal analysis and demonstrate the benefits of this methodology using a real-world example: the German credit dataset.
[LG-51] Machine learning assisted state prediction of misspecified linear dynamical system via modal reduction
链接: https://arxiv.org/abs/2601.05297
作者: Rohan Vitthal Thorat,Rajdip Nayek
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Accurate prediction of structural dynamics is imperative for preserving digital twin fidelity throughout operational lifetimes. Parametric models with fixed nominal parameters often omit critical physical effects due to simplifications in geometry, material behavior, damping, or boundary conditions, resulting in model form errors (MFEs) that impair predictive accuracy. This work introduces a comprehensive framework for MFE estimation and correction in high-dimensional finite element (FE) based structural dynamical systems. The Gaussian Process Latent Force Model (GPLFM) represents discrepancies non-parametrically in the reduced modal domain, allowing a flexible data-driven characterization of unmodeled dynamics. A linear Bayesian filtering approach jointly estimates system states and discrepancies, incorporating epistemic and aleatoric uncertainties. To ensure computational tractability, the FE system is projected onto a reduced modal basis, and a mesh-invariant neural network maps modal states to discrepancy estimates, permitting model rectification across different FE discretizations without retraining. Validation is undertaken across five MFE scenarios-including incorrect beam theory, damping misspecification, misspecified boundary condition, unmodeled material nonlinearity, and local damage demonstrating the surrogate model’s substantial reduction of displacement and rotation prediction errors under unseen excitations. The proposed methodology offers a potential means to uphold digital twin accuracy amid inherent modeling uncertainties.
[LG-52] A universal vision transformer for fast calorimeter simulations
链接: https://arxiv.org/abs/2601.05289
作者: Luigi Favaro,Andrea Giammanco,Claudius Krause
类目: High Energy Physics - Phenomenology (hep-ph); Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex); Instrumentation and Detectors (physics.ins-det)
*备注: 37 pages, 15 figures, 8 tables
Abstract:The high-dimensional complex nature of detectors makes fast calorimeter simulations a prime application for modern generative machine learning. Vision transformers (ViTs) can emulate the Geant4 response with unmatched accuracy and are not limited to regular geometries. Starting from the CaloDREAM architecture, we demonstrate the robustness and scalability of ViTs on regular and irregular geometries, and multiple detectors. Our results show that ViTs generate electromagnetic and hadronic showers statistically indistinguishable from Geant4 in multiple evaluation metrics, while maintaining the generation time in the \mathcalO(10-100) ms on a single GPU. Furthermore, we show that pretraining on a large dataset and fine-tuning on the target geometry leads to reduced training costs and higher data efficiency, or altogether improves the fidelity of generated showers.
[LG-53] Channel Selected Stratified Nested Cross Validation for Clinically Relevant EEG Based Parkinsons Disease Detection
链接: https://arxiv.org/abs/2601.05276
作者: Nicholas R. Rasmussen,Rodrigue Rizk,Longwei Wang,Arun Singh,KC Santosh
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: Submitted to IEEE Conference - posting to Arxiv as normal
Abstract:The early detection of Parkinsons disease remains a critical challenge in clinical neuroscience, with electroencephalography offering a noninvasive and scalable pathway toward population level screening. While machine learning has shown promise in this domain, many reported results suffer from methodological flaws, most notably patient level data leakage, inflating performance estimates and limiting clinical translation. To address these modeling pitfalls, we propose a unified evaluation framework grounded in nested cross validation and incorporating three complementary safeguards: (i) patient level stratification to eliminate subject overlap and ensure unbiased generalization, (ii) multi layered windowing to harmonize heterogeneous EEG recordings while preserving temporal dynamics, and (iii) inner loop channel selection to enable principled feature reduction without information leakage. Applied across three independent datasets with a heterogeneous number of channels, a convolutional neural network trained under this framework achieved 80.6% accuracy and demonstrated state of the art performance under held out population block testing, comparable to other methods in the literature. This performance underscores the necessity of nested cross validation as a safeguard against bias and as a principled means of selecting the most relevant information for patient level decisions, providing a reproducible foundation that can extend to other biomedical signal analysis domains.
[LG-54] On the use of case estimate and transactional payment data in neural networks for individual loss reserving
链接: https://arxiv.org/abs/2601.05274
作者: Benjamin Avanzi,Matthew Lambrianidis,Greg Taylor,Bernard Wong
类目: atistical Finance (q-fin.ST); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:The use of neural networks trained on individual claims data has become increasingly popular in the actuarial reserving literature. We consider how to best input historical payment data in neural network models. Additionally, case estimates are also available in the format of a time series, and we extend our analysis to assessing their predictive power. In this paper, we compare a feed-forward neural network trained on summarised transactions to a recurrent neural network equipped to analyse a claim’s entire payment history and/or case estimate development history. We draw conclusions from training and comparing the performance of the models on multiple, comparable highly complex datasets simulated from SPLICE (Avanzi, Taylor and Wang, 2023). We find evidence that case estimates will improve predictions significantly, but that equipping the neural network with memory only leads to meagre improvements. Although the case estimation process and quality will vary significantly between insurers, we provide a standardised methodology for assessing their value.
[LG-55] Generalizable Blood Pressure Estimation from Multi-Wavelength PPG Using Curriculum-Adversarial Learning
链接: https://arxiv.org/abs/2509.12518
作者: Zequan Liang,Ruoyu Zhang,Wei Shao,Mahdi Pirayesh Shirazi Nejad,Ehsan Kourkchi,Setareh Rafatirad,Houman Homayoun
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: In the proceedings of IEEE-EMBS International Conference on Body Sensor Networks 2025
Abstract:Accurate and generalizable blood pressure (BP) estimation is vital for the early detection and management of cardiovascular diseases. In this study, we enforce subject-level data splitting on a public multi-wavelength photoplethysmography (PPG) dataset and propose a generalizable BP estimation framework based on curriculum-adversarial learning. Our approach combines curriculum learning, which transitions from hypertension classification to BP regression, with domain-adversarial training that confuses subject identity to encourage the learning of subject-invariant features. Experiments show that multi-channel fusion consistently outperforms single-channel models. On the four-wavelength PPG dataset, our method achieves strong performance under strict subject-level splitting, with mean absolute errors (MAE) of 14.2mmHg for systolic blood pressure (SBP) and 6.4mmHg for diastolic blood pressure (DBP). Additionally, ablation studies validate the effectiveness of both the curriculum and adversarial components. These results highlight the potential of leveraging complementary information in multi-wavelength PPG and curriculum-adversarial strategies for accurate and robust BP estimation.
[LG-56] Rapid Adaptation of SpO2 Estimation to Wearable Devices via Transfer Learning on Low-Sampling-Rate PPG
链接: https://arxiv.org/abs/2509.12515
作者: Zequan Liang,Ruoyu Zhang,Wei Shao,krishna Karthik,Ehsan Kourkchi,Setareh Rafatirad,Houman Homayoun
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: In the proceedings of IEEE-EMBS International Conference on Body Sensor Networks 2025
Abstract:Blood oxygen saturation (SpO2) is a vital marker for healthcare monitoring. Traditional SpO2 estimation methods often rely on complex clinical calibration, making them unsuitable for low-power, wearable applications. In this paper, we propose a transfer learning-based framework for the rapid adaptation of SpO2 estimation to energy-efficient wearable devices using low-sampling-rate (25Hz) dual-channel photoplethysmography (PPG). We first pretrain a bidirectional Long Short-Term Memory (BiLSTM) model with self-attention on a public clinical dataset, then fine-tune it using data collected from our wearable We-Be band and an FDA-approved reference pulse oximeter. Experimental results show that our approach achieves a mean absolute error (MAE) of 2.967% on the public dataset and 2.624% on the private dataset, significantly outperforming traditional calibration and non-transferred machine learning baselines. Moreover, using 25Hz PPG reduces power consumption by 40% compared to 100Hz, excluding baseline draw. Our method also attains an MAE of 3.284% in instantaneous SpO2 prediction, effectively capturing rapid fluctuations. These results demonstrate the rapid adaptation of accurate, low-power SpO2 monitoring on wearable devices without the need for clinical calibration.
信息检索
[IR-0] Statistical Foundations of DIME: Risk Estimation for Practical Index Selection EACL2026
链接: https://arxiv.org/abs/2601.05649
作者: Giulio D’Erasmo,Cesare Campagnano,Antonio Mallia,Pierpaolo Brutti,Nicola Tonellotto,Fabrizio Silvestri
类目: Information Retrieval (cs.IR)
*备注: Accepted to EACL 2026 (Main Conference)
Abstract:High-dimensional dense embeddings have become central to modern Information Retrieval, but many dimensions are noisy or redundant. Recently proposed DIME (Dimension IMportance Estimation), provides query-dependent scores to identify informative components of embeddings. DIME relies on a costly grid search to select a priori a dimensionality for all the query corpus’s embeddings. Our work provides a statistically grounded criterion that directly identifies the optimal set of dimensions for each query at inference time. Experiments confirm achieving parity of effectiveness and reduces embedding size by an average of \sim50% across different models and datasets at inference time.
[IR-1] Revisiting Human-vs-LLM judgments using the TREC Podcast Track ECIR2026
链接: https://arxiv.org/abs/2601.05603
作者: Watheq Mansour,J. Shane Culpepper,Joel Mackenzie,Andrew Yates
类目: Information Retrieval (cs.IR)
*备注: The paper has been accepted to appear at ECIR 2026
Abstract:Using large language models (LLMs) to annotate relevance is an increasingly important technique in the information retrieval community. While some studies demonstrate that LLMs can achieve high user agreement with ground truth (human) judgments, other studies have argued for the opposite conclusion. To the best of our knowledge, these studies have primarily focused on classic ad-hoc text search scenarios. In this paper, we conduct an analysis on user agreement between LLM and human experts, and explore the impact disagreement has on system rankings. In contrast to prior studies, we focus on a collection composed of audio files that are transcribed into two-minute segments – the TREC 2020 and 2021 podcast track. We employ five different LLM models to re-assess all of the query-segment pairs, which were originally annotated by TREC assessors. Furthermore, we re-assess a small subset of pairs where LLM and TREC assessors have the highest disagreement, and found that the human experts tend to agree with LLMs more than with the TREC assessors. Our results reinforce the previous insights of Sormunen in 2002 – that relying on a single assessor leads to lower user agreement.
[IR-2] Efficient Temporal-aware Matryoshka Adaptation for Temporal Information Retrieval
链接: https://arxiv.org/abs/2601.05549
作者: Tuan-Luc Huynh,Weiqing Wang,Trung Le,Thuy-Trang Vu,Dragan Gašević,Yuan-Fang Li,Thanh-Toan Do
类目: Information Retrieval (cs.IR)
*备注: 18 pages
Abstract:Retrievers are a key bottleneck in Temporal Retrieval-Augmented Generation (RAG) systems: failing to retrieve temporally relevant context can degrade downstream generation, regardless of LLM reasoning. We propose Temporal-aware Matryoshka Representation Learning (TMRL), an efficient method that equips retrievers with temporal-aware Matryoshka embeddings. TMRL leverages the nested structure of Matryoshka embeddings to introduce a temporal subspace, enhancing temporal encoding while preserving general semantic representations. Experiments show that TMRL efficiently adapts diverse text embedding models, achieving competitive temporal retrieval and temporal RAG performance compared to prior Matryoshka-based non-temporal methods and prior temporal methods, while enabling flexible accuracy-efficiency trade-offs.
[IR-3] LEAPS: An LLM -Empowered Adaptive Plugin for Taobao AI Search
链接: https://arxiv.org/abs/2601.05513
作者: Lei Wang,Jinhang Wu,Zhibin Wang,Biye Li,Haiping Hou
类目: Information Retrieval (cs.IR)
*备注:
Abstract:The rapid advancement of large language models has reshaped user search cognition, driving a paradigm shift from discrete keyword-based search to high-dimensional conversational interaction. However, existing e-commerce search architectures face a critical capability deficit in adapting to this change. Users are often caught in a dilemma: precise natural language descriptions frequently trigger zero-result scenarios, while the forced simplification of queries leads to decision overload from noisy, generic results. To tackle this challenge, we propose LEAPS (LLM-Empowered Adaptive Plugin for Taobao AI Search), which seamlessly upgrades traditional search systems via a “Broaden-and-Refine” paradigm. Specifically, it attaches plugins to both ends of the search pipeline: (1) Upstream, a Query Expander acts as an intent translator. It employs a novel three-stage training strategy–inverse data augmentation, posterior-knowledge supervised fine-tuning, and diversity-aware reinforcement learning–to generate adaptive and complementary query combinations that maximize the candidate product set. (2) Downstream, a Relevance Verifier serves as a semantic gatekeeper. By synthesizing multi-source data (e.g., OCR text, reviews) and leveraging chain-of-thought reasoning, it precisely filters noise to resolve selection overload. Extensive offline experiments and online A/B testing demonstrate that LEAPS significantly enhances conversational search experiences. Crucially, its non-invasive architecture preserves established retrieval performance optimized for short-text queries, while simultaneously allowing for low-cost integration into diverse back-ends. Fully deployed on Taobao AI Search since August 2025, LEAPS currently serves hundreds of millions of users monthly.
[IR-4] RECOR: Reasoning -focused Multi-turn Conversational Retrieval Benchmark
链接: https://arxiv.org/abs/2601.05461
作者: Mohammed Ali,Abdelrahman Abdallah,Amit Agarwal,Hitesh Laxmichand Patel,Adam Jatowt
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Existing benchmarks treat multi-turn conversation and reasoning-intensive retrieval separately, yet real-world information seeking requires both. To bridge this gap, we present a benchmark for reasoning-based conversational information retrieval comprising 707 conversations (2,971 turns) across eleven domains. To ensure quality, our Decomposition-and-Verification framework transforms complex queries into fact-grounded multi-turn dialogues through multi-level validation, where atomic facts are verified against sources and explicit retrieval reasoning is generated for each turn. Comprehensive evaluation reveals that combining conversation history with reasoning doubles retrieval performance (Baseline .236 \rightarrow History+Reasoning .479 nDCG@10), while reasoning-specialized models substantially outperform dense encoders. Despite these gains, further analysis highlights that implicit reasoning remains challenging, particularly when logical connections are not explicitly stated in the text.
[IR-5] Separating Semantic Expansion from Linear Geometry for PubMed-Scale Vector Search
链接: https://arxiv.org/abs/2601.05268
作者: Rob Koopman
类目: Information Retrieval (cs.IR)
*备注: 4 pages
Abstract:We describe a PubMed scale retrieval framework that separates semantic interpretation from metric geometry. A large language model expands a natural language query into concise biomedical phrases; retrieval then operates in a fixed, mean free, approximately isotropic embedding space. Each document and query vector is formed as a weighted mean of token embeddings, projected onto the complement of nuisance axes and compressed by a Johnson Lindenstrauss transform. No parameters are trained. The system retrieves coherent biomedical clusters across the full MEDLINE corpus (about 40 million records) using exact cosine search on 256 dimensional int8 vectors. Evaluation is purely geometric: head cosine, compactness, centroid closure, and isotropy are compared with random vector baselines. Recall is not defined, since the language-model expansion specifies the effective target set.
[IR-6] A General Metric-Space Formulation of the Time Warp Edit Distance (TWED)
链接: https://arxiv.org/abs/2601.05263
作者: Zhen Yi Lau
类目: Information Retrieval (cs.IR); Data Structures and Algorithms (cs.DS)
*备注: 20 pages, 1 algorithm, small technical note on the generalization of the Time Warp Edit Distance (TWED) to arbitrary metric spaces
Abstract:This short technical note presents a formal generalization of the Time Warp Edit Distance (TWED) proposed by Marteau (2009) to arbitrary metric spaces. By viewing both the observation and temporal domains as metric spaces (X, d) and (T, \Delta) , we define a Generalized TWED (GTWED) that remains a true metric under mild assumptions. We provide self-contained proofs of its metric properties and show that the classical TWED is recovered as a special case when X = \mathbbR^d , T \subset \mathbbR , and g(x) = x . This note focuses on the theoretical structure of GTWED and its implications for extending elastic distances beyond time series, which enables the use of TWED-like metrics on sequences over arbitrary domains such as symbolic data, manifolds, or embeddings.
[IR-7] A Technical Report on the Second Place Solution for the CIKM 2025 AnalytiCup Competition
链接: https://arxiv.org/abs/2601.05259
作者: Haotao Xie,Ruilin Chen,Yicheng Wu,Zhan Zhao,Yuanyuan Liu
类目: Information Retrieval (cs.IR)
*备注:
Abstract:In this work, we address the challenge of multilingual category relevance judgment in e-commerce search, where traditional ensemble-based systems improve accuracy but at the cost of heavy training, inference, and maintenance complexity. To overcome this limitation, we propose a simplified yet effective framework that leverages prompt engineering with Chain-of-Thought task decomposition to guide reasoning within a single large language model. Specifically, our approach decomposes the relevance judgment process into four interpretable subtasks: translation, intent understanding, category matching, and relevance judgment – and fine-tunes a base model (Qwen2.5-14B) using Low-Rank Adaptation (LoRA) for efficient adaptation. This design not only reduces computational and storage overhead but also enhances interpretability by explicitly structuring the model’s reasoning path. Experimental results show that our single-model framework achieves competitive accuracy and high inference efficiency, processing 20 samples per second on a single A100 GPU. In the CIKM 2025 AnalytiCup Competition Proposals, our method achieved 0.8902 on the public leaderboard and 0.8889 on the private leaderboard, validating the effectiveness and robustness of the proposed approach. These results highlight that structured prompting combined with lightweight fine-tuning can outperform complex ensemble systems, offering a new paradigm for scalable industrial AI applications.
[IR-8] CourtNav: Voice-Guided Anchor-Accurate Navigation of Long Legal Documents in Courtrooms
链接: https://arxiv.org/abs/2601.05255
作者: Sai Khadloya,Kush Juvekar,Arghya Bhattacharya,Utkarsh Saxena
类目: Information Retrieval (cs.IR); Human-Computer Interaction (cs.HC)
*备注:
Abstract:Judicial work depends on close reading of long records, charge sheets, pleadings, annexures, orders, often spanning hundreds of pages. With limited staff support, exhaustive reading during hearings is impractical. We present CourtNav, a voice-guided, anchor-first navigator for legal PDFs that maps a judge’s spoken command (e.g., “go to paragraph 23”, “highlight the contradiction in the cross-examination”) directly to a highlighted paragraph in seconds. CourtNav transcribes the command, classifies intent with a grammar-first(Exact regex matching), LLM-backed router classifying the queries using few shot examples, retrieves over a layout-aware hybrid index, and auto-scrolls the viewer to the cited span while highlighting it and close alternates. By design, the interface shows only grounded passages, never free text, keeping evidence verifiable and auditable. This need is acute in India, where judgments and cross-examinations are notoriously this http URL a pilot on representative charge sheets, pleadings, and orders, median time-to-relevance drops from 3-5 minutes (manual navigation) to 10-15 seconds; with quick visual verification included, 30-45 seconds. Under fixed time budgets, this navigation-first design increases the breadth of the record actually consulted while preserving control and transparency.

