本篇博文主要内容为 2025-10-29 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。
说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。
友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。
目录
概览 (2025-10-29)
今日共更新550篇论文,其中:
- 自然语言处理共105篇(Computation and Language (cs.CL))
- 人工智能共198篇(Artificial Intelligence (cs.AI))
- 计算机视觉共90篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共182篇(Machine Learning (cs.LG))
自然语言处理
[NLP-0] MetricX-25 and GemSpanEval: Google Translate Submissions to the WMT25 Evaluation Shared Task
【速读】: 该论文旨在解决机器翻译质量评估中的两个核心问题:一是对翻译质量进行精确评分预测,二是准确识别翻译中存在错误的片段(error span)及其严重程度和类别。针对质量评分预测任务,作者提出改进版MetricX-25,其关键在于将Gemma 3模型调整为编码器-only架构并附加回归头,从而有效预测MQM(Machine Translation Quality Metrics)和ESA(Error Severity Assessment)分数,显著优于前代模型;对于错误片段检测任务,提出GemSpanEval模型,其创新点在于将错误检测建模为生成式任务,不仅输出错误跨度,还同时生成上下文信息,确保错误定位无歧义,且在性能上与xCOMET等强基线相当。两项方案均基于多语言开源权重模型Gemma 3,并在公开WMT数据集上微调训练。
链接: https://arxiv.org/abs/2510.24707
作者: Juraj Juraska,Tobias Domhan,Mara Finkelstein,Tetsuji Nakagawa,Geza Kovacs,Daniel Deutsch,Pidong Wang,Markus Freitag
机构: Google(谷歌)
类目: Computation and Language (cs.CL)
备注: Accepted to WMT25
Abstract:In this paper, we present our submissions to the unified WMT25 Translation Evaluation Shared Task. For the Quality Score Prediction subtask, we create a new generation of MetricX with improvements in the input format and the training protocol, while for the Error Span Detection subtask we develop a new model, GemSpanEval, trained to predict error spans along with their severities and categories. Both systems are based on the state-of-the-art multilingual open-weights model Gemma 3, fine-tuned on publicly available WMT data. We demonstrate that MetricX-25, adapting Gemma 3 to an encoder-only architecture with a regression head on top, can be trained to effectively predict both MQM and ESA quality scores, and significantly outperforms its predecessor. Our decoder-only GemSpanEval model, on the other hand, we show to be competitive in error span detection with xCOMET, a strong encoder-only sequence-tagging baseline. With error span detection formulated as a generative task, we instruct the model to also output the context for each predicted error span, thus ensuring that error spans are identified unambiguously.
zh
[NLP-1] ComboBench: Can LLM s Manipulate Physical Devices to Play Virtual Reality Games?
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在虚拟现实(Virtual Reality, VR)游戏中将高层语义动作转化为精确设备操作序列的能力问题,即LLMs是否能像人类一样基于常识和具身理解完成从指令到具体控制器与头戴式显示器(Head-Mounted Display, HMD)操作的映射。解决方案的关键在于构建了一个名为ComboBench的基准测试平台,涵盖来自四款流行VR游戏(Half-Life: Alyx、Into the Radius、Moss: Book II 和 Vivecraft)的262个场景,系统评估了七种主流LLMs(包括GPT-4o、Gemini-1.5-Pro等)在该任务上的表现,并通过对比人工标注的真实行为和人类性能,量化其在任务分解、过程推理和空间理解方面的差距。研究发现,尽管顶级模型如Gemini-1.5-Pro具备较强的任务拆解能力,但在程序性推理和空间感知上仍显著落后于人类,且性能受交互复杂度影响较大,而少量示例可显著提升模型表现,表明通过针对性训练有望增强LLMs在VR环境中的操作能力。
链接: https://arxiv.org/abs/2510.24706
作者: Shuqing Li,Jiayi Yan,Chenyu Niu,Jen-tse Huang,Yun Peng,Wenxuan Wang,Yepang Liu,Michael R. Lyu
机构: Chinese University of Hong Kong (香港中文大学); Southern University of Science and Technology (南方科技大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Software Engineering (cs.SE)
备注:
Abstract:Virtual Reality (VR) games require players to translate high-level semantic actions into precise device manipulations using controllers and head-mounted displays (HMDs). While humans intuitively perform this translation based on common sense and embodied understanding, whether Large Language Models (LLMs) can effectively replicate this ability remains underexplored. This paper introduces a benchmark, ComboBench, evaluating LLMs’ capability to translate semantic actions into VR device manipulation sequences across 262 scenarios from four popular VR games: Half-Life: Alyx, Into the Radius, Moss: Book II, and Vivecraft. We evaluate seven LLMs, including GPT-3.5, GPT-4, GPT-4o, Gemini-1.5-Pro, LLaMA-3-8B, Mixtral-8x7B, and GLM-4-Flash, compared against annotated ground truth and human performance. Our results reveal that while top-performing models like Gemini-1.5-Pro demonstrate strong task decomposition capabilities, they still struggle with procedural reasoning and spatial understanding compared to humans. Performance varies significantly across games, suggesting sensitivity to interaction complexity. Few-shot examples substantially improve performance, indicating potential for targeted enhancement of LLMs’ VR manipulation capabilities. We release all materials at this https URL.
zh
[NLP-2] Agent Data Protocol: Unifying Datasets for Diverse Effective Fine-tuning of LLM Agents
【速读】: 该论文旨在解决大规模监督微调(Supervised Fine-Tuning, SFT)中AI代理(Agent)训练数据难以统一的问题,其核心挑战在于现有数据分散在异构格式、工具和接口中,导致无法高效整合与利用。解决方案的关键是提出一种轻量级的数据表示语言——代理数据协议(Agent Data Protocol, ADP),作为不同代理数据集之间的“通用语”(interlingua),能够以统一结构表达多种任务类型(如API调用、浏览、编程、软件工程等),同时保持解析和训练的简洁性,无需针对每个数据集进行定制化工程处理。实验表明,ADP可将13个现有代理训练数据集标准化并转化为多个代理框架可用的训练格式,在不进行领域特定调优的情况下实现平均约20%的性能提升,并达到或接近各基准测试中的最先进水平。
链接: https://arxiv.org/abs/2510.24702
作者: Yueqi Song,Ketan Ramaneti,Zaid Sheikh,Ziru Chen,Boyu Gou,Tianbao Xie,Yiheng Xu,Danyang Zhang,Apurva Gandhi,Fan Yang,Joseph Liu,Tianyue Ou,Zhihao Yuan,Frank Xu,Shuyan Zhou,Xingyao Wang,Xiang Yue,Tao Yu,Huan Sun,Yu Su,Graham Neubig
机构: Carnegie Mellon University (卡内基梅隆大学); The Ohio State University (俄亥俄州立大学); University of Hong Kong (香港大学); Duke University (杜克大学); Fujitsu Research (富士通研究); All Hands AI
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Public research results on large-scale supervised finetuning of AI agents remain relatively rare, since the collection of agent training data presents unique challenges. In this work, we argue that the bottleneck is not a lack of underlying data sources, but that a large variety of data is fragmented across heterogeneous formats, tools, and interfaces. To this end, we introduce the agent data protocol (ADP), a light-weight representation language that serves as an “interlingua” between agent datasets in diverse formats and unified agent training pipelines downstream. The design of ADP is expressive enough to capture a large variety of tasks, including API/tool use, browsing, coding, software engineering, and general agentic workflows, while remaining simple to parse and train on without engineering at a per-dataset level. In experiments, we unified a broad collection of 13 existing agent training datasets into ADP format, and converted the standardized ADP data into training-ready formats for multiple agent frameworks. We performed SFT on these data, and demonstrated an average performance gain of ~20% over corresponding base models, and delivers state-of-the-art or near-SOTA performance on standard coding, browsing, tool use, and research benchmarks, without domain-specific tuning. All code and data are released publicly, in the hope that ADP could help lower the barrier to standardized, scalable, and reproducible agent training.
zh
[NLP-3] ongyi DeepResearch Technical Report
【速读】: 该论文旨在解决长周期、深度信息检索类研究任务中,大语言模型缺乏自主推理与持续信息获取能力的问题。其核心挑战在于如何构建一个可扩展的代理式(agentic)框架,使模型能够在复杂任务中实现多步骤规划、环境交互与知识整合。解决方案的关键在于提出了一种端到端训练框架,融合了代理式中期训练(agentic mid-training)与代理式后期训练(agentic post-training),并设计了一个全自动、无需人工标注的数据合成流水线,支持所有训练阶段;同时通过为每个训练阶段定制化环境,保障系统在长时间任务中的稳定性和一致性表现。该方法使得Tongyi DeepResearch在多项代理式深度研究基准测试中达到当前最优性能。
链接: https://arxiv.org/abs/2510.24701
作者: Tongyi DeepResearch Team:Baixuan Li,Bo Zhang,Dingchu Zhang,Fei Huang,Guangyu Li,Guoxin Chen,Huifeng Yin,Jialong Wu,Jingren Zhou,Kuan Li,Liangcai Su,Litu Ou,Liwen Zhang,Pengjun Xie,Rui Ye,Wenbiao Yin,Xinmiao Yu,Xinyu Wang,Xixi Wu,Xuanzhong Chen,Yida Zhao,Zhen Zhang,Zhengwei Tao,Zhongwang Zhang,Zile Qiao,Chenxi Wang,Donglei Yu,Gang Fu,Haiyang Shen,Jiayin Yang,Jun Lin,Junkai Zhang,Kui Zeng,Li Yang,Hailong Yin,Maojia Song,Ming Yan,Peng Xia,Qian Xiao,Rui Min,Ruixue Ding,Runnan Fang,Shaowei Chen,Shen Huang,Shihang Wang,Shihao Cai,Weizhou Shen,Xiaobin Wang,Xin Guan,Xinyu Geng,Yingcheng Shi,Yuning Wu,Zhuo Chen,Zijian Li,Yong Jiang
机构: Tongyi DeepResearch Team (通义深度研究团队); Tongyi Lab (通义实验室); Alibaba Group (阿里巴巴集团)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: this https URL
Abstract:We present Tongyi DeepResearch, an agentic large language model, which is specifically designed for long-horizon, deep information-seeking research tasks. To incentivize autonomous deep research agency, Tongyi DeepResearch is developed through an end-to-end training framework that combines agentic mid-training and agentic post-training, enabling scalable reasoning and information seeking across complex tasks. We design a highly scalable data synthesis pipeline that is fully automatic, without relying on costly human annotation, and empowers all training stages. By constructing customized environments for each stage, our system enables stable and consistent interactions throughout. Tongyi DeepResearch, featuring 30.5 billion total parameters, with only 3.3 billion activated per token, achieves state-of-the-art performance across a range of agentic deep research benchmarks, including Humanity’s Last Exam, BrowseComp, BrowseComp-ZH, WebWalkerQA, xbench-DeepSearch, FRAMES and xbench-DeepSearch-2510. We open-source the model, framework, and complete solutions to empower the community.
zh
[NLP-4] Agent Fold: Long-Horizon Web Agents with Proactive Context Management
【速读】: 该论文旨在解决基于大语言模型(Large Language Model, LLM)的网络代理在长时程任务中因上下文管理不当而导致性能受限的问题。现有基于ReAct范式的代理由于累积噪声大、原始历史记录过多而面临上下文饱和,而固定频率总结历史的方法则可能造成关键细节的不可逆丢失。解决方案的关键在于提出AgentFold这一新型代理范式,其核心是受人类认知中回溯性整合机制启发的主动式上下文管理策略:将上下文视为一个动态的认知工作空间,通过学习执行“折叠”(folding)操作,在多尺度上对历史轨迹进行智能处理——既可进行细粒度压缩以保留重要细节,也可进行深度抽象以合并多步子任务,从而实现高效且鲁棒的长期记忆管理。
链接: https://arxiv.org/abs/2510.24699
作者: Rui Ye,Zhongwang Zhang,Kuan Li,Huifeng Yin,Zhengwei Tao,Yida Zhao,Liangcai Su,Liwen Zhang,Zile Qiao,Xinyu Wang,Pengjun Xie,Fei Huang,Siheng Chen,Jingren Zhou,Yong Jiang
机构: Tongyi Lab (通义实验室); Alibaba Group (阿里巴巴集团)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 26 pages, 9 figures
Abstract:LLM-based web agents show immense promise for information seeking, yet their effectiveness on long-horizon tasks is hindered by a fundamental trade-off in context management. Prevailing ReAct-based agents suffer from context saturation as they accumulate noisy, raw histories, while methods that fixedly summarize the full history at each step risk the irreversible loss of critical details. Addressing these, we introduce AgentFold, a novel agent paradigm centered on proactive context management, inspired by the human cognitive process of retrospective consolidation. AgentFold treats its context as a dynamic cognitive workspace to be actively sculpted, rather than a passive log to be filled. At each step, it learns to execute a `folding’ operation, which manages its historical trajectory at multiple scales: it can perform granular condensations to preserve vital, fine-grained details, or deep consolidations to abstract away entire multi-step sub-tasks. The results on prominent benchmarks are striking: with simple supervised fine-tuning (without continual pre-training or RL), our AgentFold-30B-A3B agent achieves 36.2% on BrowseComp and 47.3% on BrowseComp-ZH. Notably, this performance not only surpasses or matches open-source models of a dramatically larger scale, such as the DeepSeek-V3.1-671B-A37B, but also surpasses leading proprietary agents like OpenAI’s o4-mini.
zh
[NLP-5] ParallelMuse: Agent ic Parallel Thinking for Deep Information Seeking
【速读】: 该论文旨在解决传统并行思维(Parallel Thinking)在信息搜索型智能体(Information-Seeking, IS)中面临的两个核心问题:一是从头开始重复滚动(rollout)导致的效率低下,二是因上下文容量受限难以整合长程推理轨迹,从而影响最终答案生成的质量。解决方案的关键在于提出一种两阶段范式——ParallelMuse:第一阶段为功能指定的部分滚动(Functionality-Specified Partial Rollout),通过将生成序列划分为功能区域,并基于不确定性引导的路径复用与分支策略提升探索效率;第二阶段为压缩推理聚合(Compressed Reasoning Aggregation),利用推理冗余实现与答案推导相关的信息无损压缩,并合成连贯的最终答案。该方法在多个开源智能体和基准测试中实现了最高达62%的性能提升,同时减少10–30%的探索性token消耗。
链接: https://arxiv.org/abs/2510.24698
作者: Baixuan Li,Dingchu Zhang,Jialong Wu,Wenbiao Yin,Zhengwei Tao,Yida Zhao,Liwen Zhang,Haiyang Shen,Runnan Fang,Pengjun Xie,Jingren Zhou,Yong Jiang
机构: Tongyi Lab(通义实验室); Alibaba Group(阿里巴巴集团)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Parallel thinking expands exploration breadth, complementing the deep exploration of information-seeking (IS) agents to further enhance problem-solving capability. However, conventional parallel thinking faces two key challenges in this setting: inefficiency from repeatedly rolling out from scratch, and difficulty in integrating long-horizon reasoning trajectories during answer generation, as limited context capacity prevents full consideration of the reasoning process. To address these issues, we propose ParallelMuse, a two-stage paradigm designed for deep IS agents. The first stage, Functionality-Specified Partial Rollout, partitions generated sequences into functional regions and performs uncertainty-guided path reuse and branching to enhance exploration efficiency. The second stage, Compressed Reasoning Aggregation, exploits reasoning redundancy to losslessly compress information relevant to answer derivation and synthesize a coherent final answer. Experiments across multiple open-source agents and benchmarks demonstrate up to 62% performance improvement with a 10–30% reduction in exploratory token consumption.
zh
[NLP-6] WebLeaper: Empowering Efficiency and Efficacy in WebAgent via Enabling Info-Rich Seeking
【速读】: 该论文旨在解决当前基于大语言模型(Large Language Model, LLM)的信息寻求(Information Seeking, IS)代理在实际应用中普遍存在的搜索效率低下问题,这一瓶颈限制了其整体性能表现。研究表明,低效的主要根源在于训练任务中目标实体的稀疏性,导致代理难以学习和泛化高效的搜索行为。解决方案的关键在于提出WebLeaper框架,通过将IS建模为树状结构推理问题,显著扩展可覆盖的目标实体范围,并利用精心筛选的维基百科表格数据设计三种任务合成策略(Basic、Union与Reverse-Union),系统性提升搜索效率与有效性;同时,仅保留准确且高效的训练轨迹进行优化,确保模型在正确性和搜索性能之间取得平衡。
链接: https://arxiv.org/abs/2510.24697
作者: Zhengwei Tao,Haiyang Shen,Baixuan Li,Wenbiao Yin,Jialong Wu,Kuan Li,Zhongwang Zhang,Huifeng Yin,Rui Ye,Liwen Zhang,Xinyu Wang,Pengjun Xie,Jingren Zhou,Yong Jiang
机构: Tongyi Lab (通义实验室); Alibaba Group (阿里巴巴集团)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large Language Model (LLM)-based agents have emerged as a transformative approach for open-ended problem solving, with information seeking (IS) being a core capability that enables autonomous reasoning and decision-making. While prior research has largely focused on improving retrieval depth, we observe that current IS agents often suffer from low search efficiency, which in turn constrains overall performance. A key factor underlying this inefficiency is the sparsity of target entities in training tasks, which limits opportunities for agents to learn and generalize efficient search behaviors. To address these challenges, we propose WebLeaper, a framework for constructing high-coverage IS tasks and generating efficient solution trajectories. We formulate IS as a tree-structured reasoning problem, enabling a substantially larger set of target entities to be embedded within a constrained context. Leveraging curated Wikipedia tables, we propose three variants for synthesizing IS tasks, Basic, Union, and Reverse-Union, to systematically increase both IS efficiency and efficacy. Finally, we curate training trajectories by retaining only those that are simultaneously accurate and efficient, ensuring that the model is optimized for both correctness and search performance. Extensive experiments on both basic and comprehensive settings, conducted on five IS benchmarks, BrowserComp, GAIA, xbench-DeepSearch, WideSearch, and Seal-0, demonstrate that our method consistently achieves improvements in both effectiveness and efficiency over strong baselines.
zh
[NLP-7] Agent Frontier: Expanding the Capability Frontier of LLM Agents with ZPD-Guided Data Synthesis
【速读】: 该论文旨在解决如何有效训练大语言模型(Large Language Models, LLMs)以提升其在前沿推理任务上的能力问题。当前LLM在复杂任务中常因超出自身能力边界而表现受限,因此需要一种能精准定位并扩展其能力边界的训练策略。解决方案的关键在于引入教育学中的“最近发展区”(Zone of Proximal Development, ZPD)理论,构建AgentFrontier Engine这一自动化数据合成引擎,生成处于LLM当前能力边缘但可通过引导掌握的高质量、多学科任务数据,并据此进行持续预训练或针对性后训练。该方法实现了对LLM代理能力的系统性增强,且通过ZPD Exam动态评估其在前沿任务上的进展,最终在Humanity’s Last Exam等严苛基准上达到领先水平。
链接: https://arxiv.org/abs/2510.24695
作者: Xuanzhong Chen,Zile Qiao,Guoxin Chen,Liangcai Su,Zhen Zhang,Xinyu Wang,Pengjun Xie,Fei Huang,Jingren Zhou,Yong Jiang
机构: Tongyi Lab (通义实验室); Alibaba Group (阿里巴巴集团)
类目: Computation and Language (cs.CL)
备注: this https URL
Abstract:Training large language model agents on tasks at the frontier of their capabilities is key to unlocking advanced reasoning. We introduce a data synthesis approach inspired by the educational theory of the Zone of Proximal Development (ZPD), which defines this frontier as tasks an LLM cannot solve alone but can master with guidance. To operationalize this, we present the AgentFrontier Engine, an automated pipeline that synthesizes high-quality, multidisciplinary data situated precisely within the LLM’s ZPD. This engine supports both continued pre-training with knowledge-intensive data and targeted post-training on complex reasoning tasks. From the same framework, we derive the ZPD Exam, a dynamic and automated benchmark designed to evaluate agent capabilities on these frontier tasks. We train AgentFrontier-30B-A3B model on our synthesized data, which achieves state-of-the-art results on demanding benchmarks like Humanity’s Last Exam, even surpassing some leading proprietary agents. Our work demonstrates that a ZPD-guided approach to data synthesis offers a scalable and effective path toward building more capable LLM agents.
zh
[NLP-8] Repurposing Synthetic Data for Fine-grained Search Agent Supervision
【速读】: 该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的搜索代理在训练过程中因采用Group Relative Policy Optimization(GRPO)等方法而丢失实体信息的问题。这类方法依赖稀疏的结果奖励,无法区分具有正确推理但最终答案错误的“近似成功”样本与完全失败的样本,从而丢弃了重要的学习信号。解决方案的关键在于引入Entity-aware Group Relative Policy Optimization(E-GRPO),其核心创新是构建一个密集的、基于实体的奖励函数:通过量化代理推理过程中识别出的真实实体数量来分配部分奖励,使模型能够从“近似成功”样本中有效学习。实验证明,E-GRPO在多个问答和深度研究基准上显著优于GRPO基线,并且诱导出更高效的推理策略,减少工具调用次数,提升了训练样本效率。
链接: https://arxiv.org/abs/2510.24694
作者: Yida Zhao,Kuan Li,Xixi Wu,Liwen Zhang,Dingchu Zhang,Baixuan Li,Maojia Song,Zhuo Chen,Chenxi Wang,Xinyu Wang,Kewei Tu,Pengjun Xie,Jingren Zhou,Yong Jiang
机构: Tongyi Lab (通义实验室); Alibaba Group (阿里巴巴集团)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:LLM-based search agents are increasingly trained on entity-centric synthetic data to solve complex, knowledge-intensive tasks. However, prevailing training methods like Group Relative Policy Optimization (GRPO) discard this rich entity information, relying instead on sparse, outcome-based rewards. This critical limitation renders them unable to distinguish informative “near-miss” samples-those with substantially correct reasoning but a flawed final answer-from complete failures, thus discarding valuable learning signals. We address this by leveraging the very entities discarded during training. Our empirical analysis reveals a strong positive correlation between the number of ground-truth entities identified during an agent’s reasoning process and final answer accuracy. Building on this insight, we introduce Entity-aware Group Relative Policy Optimization (E-GRPO), a novel framework that formulates a dense entity-aware reward function. E-GRPO assigns partial rewards to incorrect samples proportional to their entity match rate, enabling the model to effectively learn from these “near-misses”. Experiments on diverse question-answering (QA) and deep research benchmarks show that E-GRPO consistently and significantly outperforms the GRPO baseline. Furthermore, our analysis reveals that E-GRPO not only achieves superior accuracy but also induces more efficient reasoning policies that require fewer tool calls, demonstrating a more effective and sample-efficient approach to aligning search agents.
zh
[NLP-9] STAR-Bench: Probing Deep Spatio-Temporal Reasoning as Audio 4D Intelligence
【速读】: 该论文旨在解决当前多模态大语言模型(Multi-modal Large Language Models)和大型音频-语言模型(Large Audio-Language Models)在音频理解能力上的局限性,特别是其对细粒度感知推理(fine-grained perceptual reasoning)的不足。现有音频基准测试主要依赖文本描述即可推断的语义信息,掩盖了模型在声音动态的时间与三维空间中进行精确推理的能力缺陷。为此,作者提出了“音频四维智能”(audio 4D intelligence)的概念,即对声音在时间维度和三维空间中的动态变化进行推理,并构建了STAR-Bench基准测试体系来量化这一能力。解决方案的关键在于:一是设计了基础声学感知(Foundational Acoustic Perception)和整体时空推理(Holistic Spatio-Temporal Reasoning)两个核心模块,涵盖绝对与相对情境下的六个属性及连续/离散过程的片段重排序、静态定位、多源关系和动态轨迹等任务;二是采用两种高质量数据采集方法——对基础任务使用程序化合成与物理仿真音频,对综合任务则通过四阶段流程(含人工标注与基于人类表现的选择)确保样本可靠性。实验表明,相比仅依赖文本描述的任务,STAR-Bench显著降低了模型准确率(时序下降31.5%,空间下降35.2%),验证了其对难以用语言描述的听觉线索的敏感性,从而为未来模型提升对物理世界的鲁棒理解提供了明确方向。
链接: https://arxiv.org/abs/2510.24693
作者: Zihan Liu,Zhikang Niu,Qiuyang Xiao,Zhisheng Zheng,Ruoqi Yuan,Yuhang Zang,Yuhang Cao,Xiaoyi Dong,Jianze Liang,Xie Chen,Leilei Sun,Dahua Lin,Jiaqi Wang
机构: Beihang University (北京航空航天大学); Shanghai AI Laboratory (上海人工智能实验室); Shanghai Jiao Tong University (上海交通大学); The Chinese University of Hong Kong (香港中文大学); Shanghai Innovation Institute (上海创新研究院)
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Homepage: this https URL
Abstract:Despite rapid progress in Multi-modal Large Language Models and Large Audio-Language Models, existing audio benchmarks largely test semantics that can be recovered from text captions, masking deficits in fine-grained perceptual reasoning. We formalize audio 4D intelligence that is defined as reasoning over sound dynamics in time and 3D space, and introduce STAR-Bench to measure it. STAR-Bench combines a Foundational Acoustic Perception setting (six attributes under absolute and relative regimes) with a Holistic Spatio-Temporal Reasoning setting that includes segment reordering for continuous and discrete processes and spatial tasks spanning static localization, multi-source relations, and dynamic trajectories. Our data curation pipeline uses two methods to ensure high-quality samples. For foundational tasks, we use procedurally synthesized and physics-simulated audio. For holistic data, we follow a four-stage process that includes human annotation and final selection based on human performance. Unlike prior benchmarks where caption-only answering reduces accuracy slightly, STAR-Bench induces far larger drops (-31.5% temporal, -35.2% spatial), evidencing its focus on linguistically hard-to-describe cues. Evaluating 19 models reveals substantial gaps compared with humans and a capability hierarchy: closed-source models are bottlenecked by fine-grained perception, while open-source models lag across perception, knowledge, and reasoning. Our STAR-Bench provides critical insights and a clear path forward for developing future models with a more robust understanding of the physical world.
zh
[NLP-10] SPICE: Self-Play In Corpus Environments Improves Reasoning
【速读】: 该论文旨在解决自提升系统(self-improving systems)在缺乏持续环境交互的情况下难以实现长期性能优化的问题。现有方法多依赖于无外部语境的自我博弈(ungrounded self-play),其生成的任务多样性有限,导致改进效果受限。为此,作者提出SPICE(Self-Play In Corpus Environments)框架,其核心创新在于将模型分为两个角色:Challenger(挑战者)从大规模语料库中挖掘文档以生成多样化的推理任务,Reasoner(推理者)则负责解决这些任务。通过对抗性动态机制,Challenger 自动构建位于Reasoner能力边界上的渐进式课程(automatic curriculum),而语料库的接地(corpus grounding)提供了近乎无穷的外部信号,使系统能够持续生成更具挑战性的目标并达成,从而实现稳定且可扩展的自我提升。实验表明,SPICE 在多个模型家族上均显著提升了数学推理(+8.9%)和通用推理(+9.8%)性能,验证了语料库接地作为关键要素的有效性。
链接: https://arxiv.org/abs/2510.24684
作者: Bo Liu,Chuanyang Jin,Seungone Kim,Weizhe Yuan,Wenting Zhao,Ilia Kulikov,Xian Li,Sainbayar Sukhbaatar,Jack Lanchantin,Jason Weston
机构: Meta(元)
类目: Computation and Language (cs.CL)
备注:
Abstract:Self-improving systems require environmental interaction for continuous adaptation. We introduce SPICE (Self-Play In Corpus Environments), a reinforcement learning framework where a single model acts in two roles: a Challenger that mines documents from a large corpus to generate diverse reasoning tasks, and a Reasoner that solves them. Through adversarial dynamics, the Challenger creates an automatic curriculum at the frontier of the Reasoner’s capability, while corpus grounding provides the rich, near-inexhaustible external signal necessary for sustained improvement. Unlike existing ungrounded self-play methods that offer more limited benefits, SPICE achieves consistent gains across mathematical (+8.9%) and general reasoning (+9.8%) benchmarks on multiple model families. Our analysis reveals how document grounding is a key ingredient in SPICE to continuously generate its own increasingly challenging goals and achieve them, enabling sustained self-improvement.
zh
[NLP-11] Dissecting Role Cognition in Medical LLM s via Neuronal Ablation
【速读】: 该论文旨在解决当前医疗领域大语言模型(Large Language Models, LLMs)中基于提示的角色扮演(Prompt-Based Role Playing, PBRP)方法是否能真正激发角色特异性的认知过程这一问题。现有做法通过指令让模型模拟不同临床角色(如医学生、住院医师或主治医师),但其对模型推理能力的实际影响尚不明确。论文提出RP-Neuron-Activated Evaluation Framework(RPNA)作为解决方案,其关键在于结合神经元消融(neuron ablation)与表征分析技术,系统评估角色提示是否引发不同的推理路径或认知分化。实验结果表明,角色提示仅改变表面语言风格,未发现显著的、与临床角色相关的推理机制差异,说明当前PBRP方法无法实现真实医疗实践中所需的认知复杂性,凸显了构建具备真实认知模拟能力的医疗AI模型的重要性。
链接: https://arxiv.org/abs/2510.24677
作者: Xun Liang,Huayi Lai,Hanyu Wang,Wentao Zhang,Linfeng Zhang,Yanfang Chen,Feiyu Xiong,Zhiyu Li
机构: Renmin University of China (中国人民大学); Peking University (北京大学); Shanghai Jiaotong University (上海交通大学); Institute for Advanced Algorithms Research (IAAR) (高级算法研究中心); Memtensor Research Center (Memtensor 研究中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 15 pages, 9 figures
Abstract:Large language models (LLMs) have gained significant traction in medical decision support systems, particularly in the context of medical question answering and role-playing simulations. A common practice, Prompt-Based Role Playing (PBRP), instructs models to adopt different clinical roles (e.g., medical students, residents, attending physicians) to simulate varied professional behaviors. However, the impact of such role prompts on model reasoning capabilities remains unclear. This study introduces the RP-Neuron-Activated Evaluation Framework(RPNA) to evaluate whether role prompts induce distinct, role-specific cognitive processes in LLMs or merely modify linguistic style. We test this framework on three medical QA datasets, employing neuron ablation and representation analysis techniques to assess changes in reasoning pathways. Our results demonstrate that role prompts do not significantly enhance the medical reasoning abilities of LLMs. Instead, they primarily affect surface-level linguistic features, with no evidence of distinct reasoning pathways or cognitive differentiation across clinical roles. Despite superficial stylistic changes, the core decision-making mechanisms of LLMs remain uniform across roles, indicating that current PBRP methods fail to replicate the cognitive complexity found in real-world medical practice. This highlights the limitations of role-playing in medical AI and emphasizes the need for models that simulate genuine cognitive processes rather than linguistic this http URL have released the related code in the following repository:https: //github.com/IAAR-Shanghai/RolePlay_LLMDoctor Comments: 15 pages, 9 figures Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2510.24677 [cs.CL] (or arXiv:2510.24677v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2510.24677 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Huayi Lai [view email] [v1] Tue, 28 Oct 2025 17:40:53 UTC (4,545 KB) Full-text links: Access Paper: View a PDF of the paper titled Dissecting Role Cognition in Medical LLMs via Neuronal Ablation, by Xun Liang and 7 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.CL prev | next new | recent | 2025-10 Change to browse by: cs cs.AI References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
zh
[NLP-12] InteractComp: Evaluating Search Agents With Ambiguous Queries
【速读】: 该论文旨在解决当前搜索代理(search agents)在面对用户查询不完整或存在歧义时缺乏交互式澄清能力的问题,而现有评估基准也无法衡量这一关键能力。解决方案的核心是提出InteractComp基准,通过目标-干扰物(target-distractor)方法构建210个专家标注的、具有真实歧义性的跨领域问题,强制模型识别模糊性并主动交互以消除歧义,从而客观评估其交互能力。实验表明,当前最优模型在无交互条件下准确率仅为13.73%,远低于提供完整上下文时的71.50%,揭示了系统性过自信而非推理缺陷;同时,强制交互带来显著性能提升,证明现有策略未能有效激活模型潜在的交互能力。
链接: https://arxiv.org/abs/2510.24668
作者: Mingyi Deng,Lijun Huang,Yani Fan,Jiayi Zhang,Fashen Ren,Jinyi Bai,Fuzhen Yang,Dayi Miao,Zhaoyang Yu,Yifan Wu,Yanfei Zhang,Fengwei Teng,Yingjia Wan,Song Hu,Yude Li,Xin Jin,Conghao Hu,Haoyu Li,Qirui Fu,Tai Zhong,Xinyu Wang,Xiangru Tang,Nan Tang,Chenglin Wu,Yuyu Luo
机构: DeepWisdom; The Hong Kong University of Science and Technology (Guangzhou); Renmin University of China; University of California, Los Angeles; Agent Universe; McGill University; Yale University
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Language agents have demonstrated remarkable potential in web search and information retrieval. However, these search agents assume user queries are complete and unambiguous, an assumption that diverges from reality where users begin with incomplete queries requiring clarification through interaction. Yet most agents lack interactive mechanisms during the search process, and existing benchmarks cannot assess this capability. To address this gap, we introduce InteractComp, a benchmark designed to evaluate whether search agents can recognize query ambiguity and actively interact to resolve it during search. Following the principle of easy to verify, interact to disambiguate, we construct 210 expert-curated questions across 9 domains through a target-distractor methodology that creates genuine ambiguity resolvable only through interaction. Evaluation of 17 models reveals striking failure: the best model achieves only 13.73% accuracy despite 71.50% with complete context, exposing systematic overconfidence rather than reasoning deficits. Forced interaction produces dramatic gains, demonstrating latent capability current strategies fail to engage. Longitudinal analysis shows interaction capabilities stagnated over 15 months while search performance improved seven-fold, revealing a critical blind spot. This stagnation, coupled with the immediate feedback inherent to search tasks, makes InteractComp a valuable resource for both evaluating and training interaction capabilities in search agents. The code is available at this https URL.
zh
[NLP-13] MQM Re-Annotation: A Technique for Collaborative Evaluation of Machine Translation
【速读】: 该论文旨在解决机器翻译(Machine Translation, MT)质量评估中因翻译模型性能提升而导致的评估方法滞后问题,即当前评估方法可能无法准确捕捉模型改进,从而引入评估噪声。其解决方案的关键在于提出一种两阶段的MQM(Meaningful Quality Metrics)重标注机制(MQM re-annotation),即由同一或不同标注者对已有的MQM标注结果进行复审与修正,从而发现并纠正初次标注中遗漏的错误,显著提升标注质量。
链接: https://arxiv.org/abs/2510.24664
作者: Parker Riley,Daniel Deutsch,Mara Finkelstein,Colten DiIanni,Juraj Juraska,Markus Freitag
机构: Google(谷歌)
类目: Computation and Language (cs.CL)
备注:
Abstract:Human evaluation of machine translation is in an arms race with translation model quality: as our models get better, our evaluation methods need to be improved to ensure that quality gains are not lost in evaluation noise. To this end, we experiment with a two-stage version of the current state-of-the-art translation evaluation paradigm (MQM), which we call MQM re-annotation. In this setup, an MQM annotator reviews and edits a set of pre-existing MQM annotations, that may have come from themselves, another human annotator, or an automatic MQM annotation system. We demonstrate that rater behavior in re-annotation aligns with our goals, and that re-annotation results in higher-quality annotations, mostly due to finding errors that were missed during the first pass.
zh
[NLP-14] Evolving Diagnostic Agents in a Virtual Clinical Environment
【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在临床诊断任务中缺乏动态决策能力的问题,即如何使模型不仅能够基于静态病例摘要进行诊断,还能在多轮交互中自适应地选择检查项目并最终做出准确诊断。其核心挑战在于传统指令微调方法无法模拟真实诊疗过程中的探索与反馈机制。解决方案的关键在于提出一个端到端的强化学习框架,通过训练一个名为DiagAgent的诊断代理,在由电子健康记录(Electronic Health Records, EHR)驱动的虚拟临床环境DiagGym中进行多轮互动学习,从而优化信息获取效率和诊断准确性;同时引入DiagBench基准用于系统评估诊断策略的质量,实验证明该方法显著优于多种主流LLM及提示工程代理,在单轮和端到端诊断场景下均展现出更高的准确率与检查推荐质量。
链接: https://arxiv.org/abs/2510.24654
作者: Pengcheng Qiu,Chaoyi Wu,Junwei Liu,Qiaoyu Zheng,Yusheng Liao,Haowen Wang,Yun Yue,Qianrui Fan,Shuai Zhen,Jian Wang,Jinjie Gu,Yanfeng Wang,Ya Zhang,Weidi Xie
机构: Shanghai Jiao Tong University (上海交通大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); AntGroup (蚂蚁集团)
类目: Computation and Language (cs.CL)
备注:
Abstract:In this paper, we present a framework for training large language models (LLMs) as diagnostic agents with reinforcement learning, enabling them to manage multi-turn diagnostic processes, adaptively select examinations, and commit to final diagnoses. Unlike instruction-tuned models trained on static case summaries, our method acquires diagnostic strategies through interactive exploration and outcome-based feedback. Our contributions are fourfold: (i) We present DiagGym, a diagnostics world model trained with electronic health records that emits examination outcomes conditioned on patient history and recommended examination, serving as a virtual clinical environment for realistic diagnosis training and evaluation; (ii) We train DiagAgent via end-to-end, multi-turn reinforcement learning to learn diagnostic policies that optimize both information yield and diagnostic accuracy; (iii) We introduce DiagBench, a diagnostic benchmark comprising 750 cases with physician-validated examination recommendations and 99 cases annotated with 973 physician-written rubrics on diagnosis process; (iv) we demonstrate superior performance across diverse diagnostic settings. DiagAgent significantly outperforms 10 state-of-the-art LLMs, including DeepSeek-v3 and GPT-4o, as well as two prompt-engineered agents. In single-turn settings, DiagAgent achieves 9.34% higher diagnostic accuracy and 44.03% improvement in examination recommendation hit ratio. In end-to-end settings, it delivers 15.12% increase in diagnostic accuracy and 23.09% boost in examination recommendation F1 score. In rubric-based evaluation, it surpasses the next-best model, Claude-sonnet-4, by 7.1% in weighted rubric score. These findings indicate that learning policies in interactive clinical environments confers dynamic and clinically meaningful diagnostic management abilities unattainable through passive training alone.
zh
[NLP-15] Optimizing Retrieval for RAG via Reinforced Contrastive Learning
【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统中检索模块的 relevance 定义与标注困难问题,即传统信息检索(Information Retrieval, IR)主要面向人类用户,而RAG中的检索目标是为AI模型提供上下文知识,其相关性难以预先定义或人工标注。解决方案的关键在于提出R3框架,该框架通过试错-反馈强化对比学习(trial-and-feedback Reinforced contrastive learning)实现对检索器的动态优化:在训练过程中,检索结果与RAG环境交互产生对比信号,自动引导检索器自我改进,无需依赖标注数据或合成数据进行监督微调,从而在多个任务上显著提升RAG性能。
链接: https://arxiv.org/abs/2510.24652
作者: Jiawei Zhou,Lei Chen
机构: The Hong Kong University of Science and Technology (香港科技大学); The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:
Abstract:As retrieval-augmented generation (RAG) becomes increasingly widespread, the role of information retrieval (IR) is shifting from retrieving information for human users to retrieving contextual knowledge for artificial intelligence (AI) systems, where relevance becomes difficult to define or annotate beforehand. To address this challenge, we propose R3, a Retrieval framework optimized for RAG through trialand-feedback Reinforced contrastive learning. Unlike prior approaches that rely on annotated or synthetic data for supervised fine-tuning, R3 enables the retriever to dynamically explore and optimize relevance within the RAG environment. During training, the retrieved results interact with the environment to produce contrastive signals that automatically guide the retriever’s self-improvement. Extensive experiments across diverse tasks demonstrate that R3 improves RAG performance by 5.2% over the original retriever and surpasses state-of-the-art retrievers by 4.9%, while achieving comparable results to LLM-augmented retrieval and RAG systems built on post-trained or instruction-tuned LLMs. It is both efficient and practical, requiring only 4 GPUs and completing training within a single day.
zh
[NLP-16] Quantifying the Effects of Word Length Frequency and Predictability on Dyslexia
【速读】: 该论文旨在解决“读写障碍(dyslexia)阅读成本在自然语境下何时以及在何种条件下产生”的问题,通过大规模眼动追踪数据与词级特征(如词长、频率和可预测性)的对齐分析,量化这些因素对典型读者与读写障碍者阅读时间的影响差异。其解决方案的关键在于:首先,发现所有三个词级特征均显著影响两类读者的阅读时间,且读写障碍者对这些特征表现出更强的敏感性,尤其是词义可预测性;其次,通过反事实干预手段操纵这些特征,使读写障碍者与对照组之间的阅读时间差距缩小约三分之一,其中可预测性的调节效应最强,其次为词长和频率。这一结果支持了读写障碍理论中关于语言工作记忆和音位编码需求增强的观点,并为未来基于词汇复杂度和旁视区预览收益的研究提供了方向,也为干预策略和计算模型设计提供了可操作的依据。
链接: https://arxiv.org/abs/2510.24647
作者: Hugo Rydel-Johnston,Alex Kafkas
机构: 未知
类目: Computation and Language (cs.CL); Neurons and Cognition (q-bio.NC)
备注:
Abstract:We ask where, and under what conditions, dyslexic reading costs arise in a large-scale naturalistic reading dataset. Using eye-tracking aligned to word-level features (word length, frequency, and predictability), we model how each feature influences dyslexic time costs. We find that all three features robustly change reading times in both typical and dyslexic readers, and that dyslexic readers show stronger sensitivities to each, especially predictability. Counterfactual manipulations of these features substantially narrow the dyslexic-control gap by about one third, with predictability showing the strongest effect, followed by length and frequency. These patterns align with dyslexia theories that posit heightened demands on linguistic working memory and phonological encoding, and they motivate further work on lexical complexity and parafoveal preview benefits to explain the remaining gap. In short, we quantify when extra dyslexic costs arise, how large they are, and offer actionable guidance for interventions and computational models for dyslexics.
zh
[NLP-17] OpenReward: Learning to Reward Long-form Agent ic Tasks via Reinforcement Learning
【速读】: 该论文旨在解决现有奖励模型(Reward Models, RMs)在知识密集型和长文本生成任务中表现不足的问题,尤其是在需要依赖外部证据来判断答案正确性时,传统RMs难以可靠区分细微的质量差异。其解决方案的关键在于提出OpenRM——一种工具增强的长文本奖励模型,通过调用外部工具获取相关证据以支持判断,并采用分组相对策略优化(Group Relative Policy Optimization, GRPO)进行训练,同时监督中间工具使用过程与最终结果准确性,从而学习到基于证据的有效评判策略。
链接: https://arxiv.org/abs/2510.24636
作者: Ziyou Hu,Zhengliang Shi,Minghang Zhu,Haitao Li,Teng Sun,Pengjie Ren,Suzan Verberne,Zhaochun Ren
机构: Leiden University (莱顿大学); Shandong University (山东大学); Tsinghua University (清华大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Reward models (RMs) have become essential for aligning large language models (LLMs), serving as scalable proxies for human evaluation in both training and inference. However, existing RMs struggle on knowledge-intensive and long-form tasks, where evaluating correctness requires grounding beyond the model’s internal knowledge. This limitation hinders them from reliably discriminating subtle quality differences, especially when external evidence is necessary. To address this, we introduce OpenRM, a tool-augmented long-form reward model that systematically judges open-ended responses by invoking external tools to gather relevant evidence. We train OpenRM with Group Relative Policy Optimization (GRPO) on over 27K synthesized pairwise examples generated through a controllable data synthesis framework. The training objective jointly supervises intermediate tool usage and final outcome accuracy, incentivizing our reward model to learn effective evidence-based judgment strategies. Extensive experiments on three newly-collected datasets and two widely-used benchmarks demonstrate that OpenRM substantially outperforms existing reward modeling approaches. As a further step, we integrate OpenRM into both inference-time response selection and training-time data selection. This yields consistent gains in downstream LLM alignment tasks, highlighting the potential of tool-augmented reward models for scaling reliable long-form evaluation.
zh
[NLP-18] “Mm Wat?” Detecting Other-initiated Repair Requests in Dialogue
【速读】: 该论文旨在解决对话代理(Conversational Agents, CAs)在自然对话中难以识别用户发起的修复请求(Other-Initiated Repair, OIR),从而导致对话中断或用户流失的问题。其解决方案的关键在于提出了一种多模态模型,通过融合基于会话分析(Conversation Analysis)的语义特征与韵律特征(prosodic features),显著提升了对荷兰语对话中修复启动行为的自动检测性能,表明韵律线索能有效增强预训练文本与音频嵌入的表现力,为构建更具交互鲁棒性的对话系统提供了新思路。
链接: https://arxiv.org/abs/2510.24628
作者: Anh Ngo,Nicolas Rollet,Catherine Pelachaud,Chloe Clavel
机构: ALMAnaCH, INRIA Paris (法国国家信息与自动化研究院巴黎分部); Télécom Paris, SES, Institut Polytechnique de Paris, I3-CNRS (巴黎电信学院,社会与经济科学系,巴黎理工学院,I3-CNRS); Télécom Paris, LTCI, Institut Polytechnique de Paris (巴黎电信学院,长距离通信与信息技术实验室,巴黎理工学院); CNRS, ISIR, Sorbonne University (法国国家科学研究中心,智能机器人研究所,索邦大学); ISIR, Sorbonne University (智能机器人研究所,索邦大学)
类目: Computation and Language (cs.CL)
备注: 9 pages
Abstract:Maintaining mutual understanding is a key component in human-human conversation to avoid conversation breakdowns, in which repair, particularly Other-Initiated Repair (OIR, when one speaker signals trouble and prompts the other to resolve), plays a vital role. However, Conversational Agents (CAs) still fail to recognize user repair initiation, leading to breakdowns or disengagement. This work proposes a multimodal model to automatically detect repair initiation in Dutch dialogues by integrating linguistic and prosodic features grounded in Conversation Analysis. The results show that prosodic cues complement linguistic features and significantly improve the results of pretrained text and audio embeddings, offering insights into how different features interact. Future directions include incorporating visual cues, exploring multilingual and cross-context corpora to assess the robustness and generalizability.
zh
[NLP-19] Relative Scaling Laws for LLM s
【速读】: 该论文试图解决传统Scaling laws在评估语言模型性能时忽略子群体差异的问题,即聚合测试集虽然能呈现整体趋势,但会掩盖不同子群体间的性能差距。其解决方案的关键在于提出“相对Scaling laws”(relative scaling laws),通过追踪不同测试分布之间性能差距随模型规模变化的演化路径,而非仅关注绝对误差。研究基于255个解码器-only Transformer模型在匹配计算预算(IsoFLOP)下从10¹⁸到10²⁰ FLOPs的训练结果,发现不同子群体(如学术领域、区域英语方言、AI风险行为集群)表现出多样化的发展轨迹:部分趋于收敛,部分随规模扩大而加剧分化,从而揭示了扩展并不能自动实现性能均等化。
链接: https://arxiv.org/abs/2510.24626
作者: William Held,David Hall,Percy Liang,Diyi Yang
机构: Stanford University (斯坦福大学); OpenAthena; Georgia Institute of Technology (佐治亚理工学院)
类目: Computation and Language (cs.CL)
备注:
Abstract:Scaling laws describe how language models improve with additional data, parameters, and compute. While widely used, they are typically measured on aggregate test sets. Aggregate evaluations yield clean trends but average over heterogeneous subpopulations, obscuring performance disparities. We introduce relative scaling laws, which track how performance gaps between test distributions evolve with scale rather than focusing solely on absolute error. Using 255 decoder-only Transformers trained under matched-compute (IsoFLOP) budgets from 10^18 – 10^20 FLOPs on standard pretraining datasets, we find diverse trajectories: academic domains on MMLU converge toward parity; regional English dialects shift depending on population size; and clusters of AI risk behaviours split, with capability- and influence-related risks increasing during pretraining while adversarial risks do not. These results show that although scaling improves overall performance, it is not a universal equalizer. To support further study, we release all model checkpoints from this work to enable practitioners to measure relative alongside traditional scaling laws, in order to better prioritize robustness challenges in light of the bitter lesson.
zh
[NLP-20] Zero-Shot Cross-Lingual Transfer using Prefix-Based Adaptation
【速读】: 该论文旨在解决decoder-only大语言模型(Large Language Models, LLMs)在零样本跨语言迁移(zero-shot cross-lingual transfer)任务中,如何高效适配新任务且在低资源语言场景下保持性能的问题。现有参数高效微调(Parameter-Efficient Fine-Tuning, PeFT)方法如LoRA虽广泛应用,但prefix-based技术(如soft prompt tuning、prefix tuning和Llama Adapter)在decoder-only模型中的潜力尚未充分挖掘。论文的关键解决方案是系统评估三种prefix-based方法在35种以上高/低资源语言上的零样本跨语言迁移表现,发现其仅用约123万学习参数即可显著优于LoRA基线(最高提升6%),尤其在小模型(如Llama 3.1 8B)和低资源语种中优势明显,验证了prefix-based方法作为可扩展、高效的替代方案在多语言场景下的有效性。
链接: https://arxiv.org/abs/2510.24619
作者: Snegha A(1),Sayambhu Sen(2),Piyush Singh Pasi(2),Abhishek Singhania(2),Preethi Jyothi(1) ((1) Indian Institute of Technology Bombay, (2) Amazon Alexa)
机构: Indian Institute of Technology Bombay (印度理工学院孟买分校); Amazon Alexa (亚马逊语音助手)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 12 Pages
Abstract:With the release of new large language models (LLMs) like Llama and Mistral, zero-shot cross-lingual transfer has become increasingly feasible due to their multilingual pretraining and strong generalization capabilities. However, adapting these decoder-only LLMs to new tasks across languages remains challenging. While parameter-efficient fine-tuning (PeFT) techniques like Low-Rank Adaptation (LoRA) are widely used, prefix-based techniques such as soft prompt tuning, prefix tuning, and Llama Adapter are less explored, especially for zero-shot transfer in decoder-only models. We present a comprehensive study of three prefix-based methods for zero-shot cross-lingual transfer from English to 35+ high- and low-resource languages. Our analysis further explores transfer across linguistic families and scripts, as well as the impact of scaling model sizes from 1B to 24B. With Llama 3.1 8B, prefix methods outperform LoRA-baselines by up to 6% on the Belebele benchmark. Similar improvements were observed with Mistral v0.3 7B as well. Despite using only 1.23M learning parameters with prefix tuning, we achieve consistent improvements across diverse benchmarks. These findings highlight the potential of prefix-based techniques as an effective and scalable alternative to LoRA, particularly in low-resource multilingual settings.
zh
[NLP-21] Long-Context Modeling with Dynamic Hierarchical Sparse Attention for On-Device LLM s NEURIPS2025
【速读】: 该论文旨在解决长上下文大语言模型(Long-context LLMs)中注意力机制计算复杂度高(二次方成本)导致的可扩展性问题,尤其是在资源受限场景下的性能瓶颈。现有静态稀疏方法(如滑动窗口或全局标记)因无法适应内容相关的注意力变化而效果有限;而已有动态方法则依赖预定义模板或启发式规则,缺乏通用性且可能误删重要上下文信息。其解决方案的关键在于提出一种数据驱动的动态分层稀疏注意力机制(Dynamic Hierarchical Sparse Attention, DHSA),该机制无需重新训练即可在线预测注意力稀疏模式:首先将序列自适应分割为变长块(chunk),通过长度归一化聚合方式(sqrt(chunk_size)缩放)消除块长度差异带来的偏差;随后将块级相似度上采样至token级别以生成重要性评分,从而决定保留哪些token间交互。实验表明,DHSA在Gemma2模型上达到与密集注意力相当的准确性,同时显著降低预填充延迟(20–60%)和峰值内存占用(35%),相比其他基线方法(如block sparse attention)在保持相近或更低开销的同时实现更高精度(相对提升6–18%)。
链接: https://arxiv.org/abs/2510.24606
作者: Siheng Xiong,Joe Zou,Faramarz Fekri,Yae Jee Cho
机构: Georgia Institute of Technology (佐治亚理工学院); Google(谷歌)
类目: Computation and Language (cs.CL)
备注: Accepted to NeurIPS 2025 Workshop on Efficient Reasoning
Abstract:The quadratic cost of attention hinders the scalability of long-context LLMs, especially in resource-constrained settings. Existing static sparse methods such as sliding windows or global tokens utilizes the sparsity of attention to reduce the cost of attention, but poorly adapts to the content-dependent variations in attention due to their staticity. While previous work has proposed several dynamic approaches to improve flexibility, they still depend on predefined templates or heuristic mechanisms. Such strategies reduce generality and prune tokens that remain contextually important, limiting their accuracy across diverse tasks. To tackle these bottlenecks of existing methods for long-context modeling, we introduce Dynamic Hierarchical Sparse Attention (DHSA), a data-driven framework that dynamically predicts attention sparsity online without retraining. Our proposed DHSA adaptively segments sequences into variable-length chunks, then computes chunk representations by aggregating the token embeddings within each chunk. To avoid the bias introduced by varying chunk lengths, we apply length-normalized aggregation that scales the averaged embeddings by the square root of the chunk size. Finally, DHSA upsamples the chunk-level similarity scores to token level similarities to calculate importance scores that determine which token-level interactions should be preserved. Our experiments on Gemma2 with Needle-in-a-Haystack Test and LongBench show that DHSA matches dense attention in accuracy, while reducing prefill latency by 20-60% and peak memory usage by 35%. Compared to other representative baselines such as block sparse attention, DHSA achieves consistently higher accuracy (6-18% relative gains) with comparable or lower cost, offering an efficient and adaptable solution for long-context on-device LLMs.
zh
[NLP-22] Diffusion LLM with Native Variable Generation Lengths: Let [EOS] Lead the Way
【速读】: 该论文旨在解决当前基于扩散模型的大语言模型(diffusion-based large language models, dLLMs)在文本生成过程中存在固定生成长度的问题,即生成长度需作为超参数提前设定,导致效率低下且灵活性不足。其解决方案的关键在于提出一种具有原生可变生成长度能力的扩散大语言模型(dLLM-Var),通过训练模型精准预测[EOS]标记,使得模型能够在块扩散(block diffusion)方式下自然推断,同时保持全局双向(全)注意力机制和高并行性,从而显著提升推理速度与生成精度。
链接: https://arxiv.org/abs/2510.24605
作者: Yicun Yang,Cong Wang,Shaobo Wang,Zichen Wen,Biqing Qi,Hanlin Xu,Linfeng Zhang
机构: Shanghai Jiao Tong University (上海交通大学); Shanghai AI Lab; Huawei (华为)
类目: Computation and Language (cs.CL)
备注:
Abstract:Diffusion-based large language models (dLLMs) have exhibited substantial potential for parallel text generation, which may enable more efficient generation compared to autoregressive models. However, current dLLMs suffer from fixed generation lengths, which indicates the generation lengths of dLLMs have to be determined before decoding as a hyper-parameter, leading to issues in efficiency and flexibility. To solve these problems, in this work, we propose to train a diffusion LLM with native variable generation lengths, abbreviated as dLLM-Var. Concretely, we aim to train a model to accurately predict the [EOS] token in the generated text, which makes a dLLM be able to natively infer in a block diffusion manner, while still maintaining the ability of global bi-directional (full) attention and high parallelism. Experiments on standard benchmarks demonstrate that our method achieves a 30.1x speedup over traditional dLLM inference paradigms and a 2.4x speedup relative to autoregressive models such as Qwen and Llama. Our method achieves higher accuracy and faster inference, elevating dLLMs beyond mere academic novelty and supporting their practical use in real-world applications. Codes and models have been released.
zh
[NLP-23] ReForm: Reflective Autoformalization with Prospective Bounded Sequence Optimization
【速读】: 该论文旨在解决自动形式化(Autoformalization)任务中大型语言模型(LLM)生成的形式化陈述难以保持原始自然语言数学问题语义意图的问题。现有方法将自动形式化视为简单的翻译任务,缺乏自我反思与迭代优化机制,导致语义偏差。解决方案的关键在于提出ReForm方法,通过在自动形式化过程中紧密集成语义一致性评估,使模型能够迭代生成形式化陈述、评估其语义保真度,并基于识别出的错误进行逐步修正。为有效训练该反射模型,作者进一步引入前瞻性边界序列优化(Prospective Bounded Sequence Optimization, PBSO),通过在不同序列位置采用差异化奖励策略,确保模型同时掌握准确的形式化生成和正确的语义验证能力,避免表面化的批判影响反思效果。
链接: https://arxiv.org/abs/2510.24592
作者: Guoxin Chen,Jing Wu,Xinjie Chen,Wayne Xin Zhao,Ruihua Song,Chengxi Li,Kai Fan,Dayiheng Liu,Minpeng Liao
机构: Gaoling School of Artificial Intelligence, Renmin University of China (中国人民大学高瓴人工智能学院); Tongyi Lab, Alibaba Group (阿里巴巴集团通义实验室); Zhejiang University (浙江大学)
类目: Computation and Language (cs.CL)
备注: Ongoing Work
Abstract:Autoformalization, which translates natural language mathematics into machine-verifiable formal statements, is critical for using formal mathematical reasoning to solve math problems stated in natural language. While Large Language Models can generate syntactically correct formal statements, they often fail to preserve the original problem’s semantic intent. This limitation arises from the LLM approaches’ treating autoformalization as a simplistic translation task which lacks mechanisms for self-reflection and iterative refinement that human experts naturally employ. To address these issues, we propose ReForm, a Reflective Autoformalization method that tightly integrates semantic consistency evaluation into the autoformalization process. This enables the model to iteratively generate formal statements, assess its semantic fidelity, and self-correct identified errors through progressive refinement. To effectively train this reflective model, we introduce Prospective Bounded Sequence Optimization (PBSO), which employs different rewards at different sequence positions to ensure that the model develops both accurate autoformalization and correct semantic validations, preventing superficial critiques that would undermine the purpose of reflection. Extensive experiments across four autoformalization benchmarks demonstrate that ReForm achieves an average improvement of 17.2 percentage points over the strongest baselines. To further ensure evaluation reliability, we introduce ConsistencyCheck, a benchmark of 859 expert-annotated items that not only validates LLMs as judges but also reveals that autoformalization is inherently difficult: even human experts produce semantic errors in up to 38.5% of cases.
zh
[NLP-24] ReplicationBench: Can AI Agents Replicate Astrophysics Research Papers?
【速读】: 该论文旨在解决当前前沿AI代理(AI agents)在科学科研场景中可靠性不足的问题,特别是其在复现完整研究论文时的忠实性(faithfulness)与正确性(correctness)难以保障。为应对这一挑战,作者提出ReplicationBench——一个基于天体物理学文献的评估框架,通过将每篇论文拆解为由原作者共同设计的核心任务(如实验设置、推导过程、数据分析和代码实现),并以专家验证的方式对AI代理的复现结果进行客观评估。该方案的关键在于:第一,采用“论文级”任务设计,确保评估覆盖科研全流程;第二,引入领域专家协作开发任务,提升评估的科学严谨性;第三,构建可扩展的基准测试体系,为衡量AI代理在数据驱动型科学研究中的可靠性提供标准化方法。
链接: https://arxiv.org/abs/2510.24591
作者: Christine Ye,Sihan Yuan,Suchetha Cooray,Steven Dillmann,Ian L. V. Roque,Dalya Baron,Philipp Frank,Sergio Martin-Alvarez,Nolan Koblischke,Frank J Qu,Diyi Yang,Risa Wechsler,Ioana Ciuca
机构: Stanford University (斯坦福大学); University of Toronto (多伦多大学)
类目: Computation and Language (cs.CL); Instrumentation and Methods for Astrophysics (astro-ph.IM)
备注:
Abstract:Frontier AI agents show increasing promise as scientific research assistants, and may eventually be useful for extended, open-ended research workflows. However, in order to use agents for novel research, we must first assess the underlying faithfulness and correctness of their work. To evaluate agents as research assistants, we introduce ReplicationBench, an evaluation framework that tests whether agents can replicate entire research papers drawn from the astrophysics literature. Astrophysics, where research relies heavily on archival data and computational study while requiring little real-world experimentation, is a particularly useful testbed for AI agents in scientific research. We split each paper into tasks which require agents to replicate the paper’s core contributions, including the experimental setup, derivations, data analysis, and codebase. Each task is co-developed with the original paper authors and targets a key scientific result, enabling objective evaluation of both faithfulness (adherence to original methods) and correctness (technical accuracy of results). ReplicationBench is extremely challenging for current frontier language models: even the best-performing language models score under 20%. We analyze ReplicationBench trajectories in collaboration with domain experts and find a rich, diverse set of failure modes for agents in scientific research. ReplicationBench establishes the first benchmark of paper-scale, expert-validated astrophysics research tasks, reveals insights about agent performance generalizable to other domains of data-driven science, and provides a scalable framework for measuring AI agents’ reliability in scientific research.
zh
[NLP-25] BEST-RQ-Based Self-Supervised Learning for Whisper Domain Adaptation ICASSP2026
【速读】: 该论文旨在解决自动语音识别(Automatic Speech Recognition, ASR)系统在域外(out-of-domain)和低资源场景下因标注数据稀缺而导致性能下降的问题。其解决方案的关键在于提出BEARD(BEST-RQ Encoder Adaptation with Re-training and Distillation)框架,该框架利用未标注数据对Whisper模型的编码器进行适应性调整,创新性地结合了BEST-RQ自监督学习目标与来自冻结教师编码器的知识蒸馏机制,从而确保编码器与预训练解码器之间的互补性,显著提升了在高噪声、非母语发音及专业术语密集的航空交通管制(ATC)通信领域中的识别性能。
链接: https://arxiv.org/abs/2510.24570
作者: Raphaël Bagat,Irina Illina,Emmanuel Vincent
机构: 未知
类目: Computation and Language (cs.CL)
备注: Submitted to ICASSP 2026
Abstract:Automatic Speech Recognition (ASR) systems, despite large multilingual training, struggle in out-of-domain and low-resource scenarios where labeled data is scarce. We propose BEARD (BEST-RQ Encoder Adaptation with Re-training and Distillation), a novel framework designed to adapt Whisper’s encoder using unlabeled data. Unlike traditional self-supervised learning methods, BEARD uniquely combines a BEST-RQ objective with knowledge distillation from a frozen teacher encoder, ensuring the encoder’s complementarity with the pre-trained decoder. Our experiments focus on the ATCO2 corpus from the challenging Air Traffic Control (ATC) communications domain, characterized by non-native speech, noise, and specialized phraseology. Using about 5,000 hours of untranscribed speech for BEARD and 2 hours of transcribed speech for fine-tuning, the proposed approach significantly outperforms previous baseline and fine-tuned model, achieving a relative improvement of 12% compared to the fine-tuned model. To the best of our knowledge, this is the first work to use a self-supervised learning objective for domain adaptation of Whisper.
zh
[NLP-26] Open Korean Historical Corpus: A Millennia-Scale Diachronic Collection of Public Domain Texts
【速读】: 该论文旨在解决韩国语在自然语言处理(Natural Language Processing, NLP)领域中历史语料库稀缺的问题,从而推动对韩语历时演变的定量分析。其核心挑战在于韩语口语与书面语长期脱节,且从汉字(Hanja)向韩文(Hangul)的书写系统转变具有显著转折性,但这一过程缺乏大规模、开放获取的历史文本数据支持。解决方案的关键是构建并发布Open Korean Historical Corpus,这是一个涵盖1300年历史、6种语言及多种书写体系(包括韩式汉文(Idu)和汉韩混写)的大型开源语料库,包含1800万份文档和50亿词元,时间跨度从7世纪至2025年。该资源首次为韩语历时语言学研究提供了量化基础,并可作为大语言模型的预训练语料,提升对现代韩语中汉语词(Sino-Korean vocabulary)及古文字系统的理解能力。
链接: https://arxiv.org/abs/2510.24541
作者: Seyoung Song,Nawon Kim,Songeun Chae,Kiwoong Park,Jiho Jin,Haneul Yoo,Kyunghyun Cho,Alice Oh
机构: 未知
类目: Computation and Language (cs.CL)
备注: Dataset and code available at this https URL
Abstract:The history of the Korean language is characterized by a discrepancy between its spoken and written forms and a pivotal shift from Chinese characters to the Hangul alphabet. However, this linguistic evolution has remained largely unexplored in NLP due to a lack of accessible historical corpora. To address this gap, we introduce the Open Korean Historical Corpus, a large-scale, openly licensed dataset spanning 1,300 years and 6 languages, as well as under-represented writing systems like Korean-style Sinitic (Idu) and Hanja-Hangul mixed script. This corpus contains 18 million documents and 5 billion tokens from 19 sources, ranging from the 7th century to 2025. We leverage this resource to quantitatively analyze major linguistic shifts: (1) Idu usage peaked in the 1860s before declining sharply; (2) the transition from Hanja to Hangul was a rapid transformation starting around 1890; and (3) North Korea’s lexical divergence causes modern tokenizers to produce up to 51 times higher out-of-vocabulary rates. This work provides a foundational resource for quantitative diachronic analysis by capturing the history of the Korean language. Moreover, it can serve as a pre-training corpus for large language models, potentially improving their understanding of Sino-Korean vocabulary in modern Hangul as well as archaic writing systems.
zh
[NLP-27] Dark Stormy: Modeling Humor in the Worst Sentences Ever Written
【速读】: 该论文旨在解决当前计算幽默检测模型对“故意低质量幽默”(intentionally bad humor)识别能力不足的问题,这类幽默形式在传统幽默数据集中常被忽略。其解决方案的关键在于构建了一个来自Bulwer-Lytton小说竞赛的新型语料库,该语料库包含大量刻意追求荒诞效果的句子,通过分析发现这些句子融合了常见幽默特征(如双关、反讽)与隐喻、元小说(metafiction)和明喻等复杂修辞手法;同时,研究还对比了大型语言模型(LLM)生成的类似风格句子,发现其虽能模仿形式但过度使用特定文学手段并引入更多新颖的形容词-名词组合,从而揭示出人类创作与AI生成在“坏幽默”表达上的本质差异。
链接: https://arxiv.org/abs/2510.24538
作者: Venkata S Govindarajan,Laura Biester
机构: Ithaca College (伊萨卡学院); Middlebury College (蒙特霍利约克学院)
类目: Computation and Language (cs.CL)
备注:
Abstract:Textual humor is enormously diverse and computational studies need to account for this range, including intentionally bad humor. In this paper, we curate and analyze a novel corpus of sentences from the Bulwer-Lytton Fiction Contest to better understand “bad” humor in English. Standard humor detection models perform poorly on our corpus, and an analysis of literary devices finds that these sentences combine features common in existing humor datasets (e.g., puns, irony) with metaphor, metafiction and simile. LLMs prompted to synthesize contest-style sentences imitate the form but exaggerate the effect by over-using certain literary devices, and including far more novel adjective-noun bigrams than human writers. Data, code and analysis are available at this https URL
zh
[NLP-28] Levée dambiguïtés par grammaires locales
【速读】: 该论文旨在解决词汇标注中的词性歧义(Part-of-Speech Ambiguity)问题,即在自然语言处理任务中,如何利用上下文信息对具有多种可能词性的词语进行准确的词性消歧。其解决方案的关键在于设计一种适用于“零沉默率”(Zero Silence Rate)目标的局部消歧语法机制,该机制要求不能遗漏任何正确的词性标签。文中指出,在Silberztein的INTEX系统中实现这一目标时,必须考虑有限状态转换器(Transducer)路径之间的相互作用,而不能孤立地分析每个转换器;同时,多转换器组合的结果也无法通过单独评估各组件来预测。因此,为确保零沉默率,需对局部语法规则进行细致验证,并提供详尽的语法行为规范。
链接: https://arxiv.org/abs/2510.24530
作者: Eric G. C. Laporte
机构: Institut Gaspard-Monge (加斯帕尔-蒙日研究所); Université de Marne-la-Vallée (马恩河谷大学)
类目: Computation and Language (cs.CL)
备注: in French language
Abstract:Many words are ambiguous in terms of their part of speech (POS). However, when a word appears in a text, this ambiguity is generally much reduced. Disambiguating POS involves using context to reduce the number of POS associated with words, and is one of the main challenges of lexical tagging. The problem of labeling words by POS frequently arises in natural language processing, for example for spelling correction, grammar or style checking, expression recognition, text-to-speech conversion, text corpus analysis, etc. Lexical tagging systems are thus useful as an initial component of many natural language processing systems. A number of recent lexical tagging systems produce multiple solutions when the text is lexically ambiguous or the uniquely correct solution cannot be found. These contributions aim to guarantee a zero silence rate: the correct tag(s) for a word must never be discarded. This objective is unrealistic for systems that tag each word uniquely. This article concerns a lexical disambiguation method adapted to the objective of a zero silence rate and implemented in Silberztein’s INTEX system (1993). We present here a formal description of this method. We show that to verify a local disambiguation grammar in this framework, it is not sufficient to consider the transducer paths separately: one needs to verify their interactions. Similarly, if a combination of multiple transducers is used, the result cannot be predicted by considering them in isolation. Furthermore, when examining the initial labeling of a text as produced by INTEX, ideas for disambiguation rules come spontaneously, but grammatical intuitions may turn out to be inaccurate, often due to an unforeseen construction or ambiguity. If a zero silence rate is targeted, local grammars must be carefully tested. This is where a detailed specification of what a grammar will do once applied to texts would be necessary.
zh
[NLP-29] Latent Sketchpad: Sketching Visual Thoughts to Elicit Multimodal Reasoning in MLLM s
【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在需要视觉规划与想象的复杂场景中表现不足的问题。传统MLLMs的内部视觉表征仅限于感知理解,缺乏生成式视觉思维能力。解决方案的关键在于引入Latent Sketchpad框架,通过重构模型内部视觉表示,使其支持生成式视觉思考而不损害原有推理能力;具体而言,该框架将视觉生成直接嵌入到模型原生的自回归推理流程中,实现文本推理与视觉潜在表示生成的交错进行,并利用上下文感知的视觉头(Context-Aware Vision Head)生成视觉潜变量,再由预训练的草图解码器(Sketch Decoder)将其转化为人类可解释的图像,从而增强模型的可视化推理能力与可解释性。
链接: https://arxiv.org/abs/2510.24514
作者: Huanyu Zhang,Wenshan Wu,Chengzu Li,Ning Shang,Yan Xia,Yangyu Huang,Yifan Zhang,Li Dong,Zhang Zhang,Liang Wang,Tieniu Tan,Furu Wei
机构: MSR(微软研究); UCAS(中国科学院大学); CASIA(中国科学院自动化研究所); Cambridge(剑桥大学); NJU(南京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
Abstract:While Multimodal Large Language Models (MLLMs) excel at visual understanding, they often struggle in complex scenarios that require visual planning and imagination. Inspired by how humans use sketching as a form of visual thinking to develop and communicate ideas, we introduce Latent Sketchpad, a framework that equips MLLMs with an internal visual scratchpad. The internal visual representations of MLLMs have traditionally been confined to perceptual understanding. We repurpose them to support generative visual thought without compromising reasoning ability. Building on frontier MLLMs, our approach integrates visual generation directly into their native autoregressive reasoning process. It allows the model to interleave textual reasoning with the generation of visual latents. These latents guide the internal thought process and can be translated into sketch images for interpretability. To realize this, we introduce two components: a Context-Aware Vision Head autoregressively produces visual representations, and a pretrained Sketch Decoder renders these into human-interpretable images. We evaluate the framework on our new dataset MazePlanning. Experiments across various MLLMs show that Latent Sketchpad delivers comparable or even superior reasoning performance to their backbone. It further generalizes across distinct frontier MLLMs, including Gemma3 and Qwen2.5-VL. By extending model’s textual reasoning to visual thinking, our framework opens new opportunities for richer human-computer interaction and broader applications. More details and resources are available on our project page: this https URL.
zh
[NLP-30] CritiCal: Can Critique Help LLM Uncertainty or Confidence Calibration?
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在高风险场景中缺乏准确的置信度校准(confidence calibration)问题,这直接影响用户对模型输出的信任与安全使用。传统方法依赖于模仿参考置信表达,但难以捕捉准确评估置信度所需的推理过程。其解决方案的关键在于引入自然语言批评(natural language critiques),通过两种机制实现:一是明确区分“不确定性”(question-focused uncertainty)与“置信度”(answer-specific confidence)的适用场景——前者更适合开放式任务,后者更适用于多选题;二是提出两种方法:Self-Critique 使模型自我批判并优化置信度,而 CritiCal 是一种新颖的批评校准训练方法,利用自然语言批评来提升置信度校准效果,而非直接优化数值指标。实验表明,CritiCal 显著优于 Self-Critique 和其他基线方法,甚至在复杂推理任务中超越教师模型 GPT-4o,并展现出良好的分布外泛化能力,从而提升了 LLM 的可靠性。
链接: https://arxiv.org/abs/2510.24505
作者: Qing Zong,Jiayu Liu,Tianshi Zheng,Chunyang Li,Baixuan Xu,Haochen Shi,Weiqi Wang,Zhaowei Wang,Chunkit Chan,Yangqiu Song
机构: HKUST
类目: Computation and Language (cs.CL)
备注:
Abstract:Accurate confidence calibration in Large Language Models (LLMs) is critical for safe use in high-stakes domains, where clear verbalized confidence enhances user trust. Traditional methods that mimic reference confidence expressions often fail to capture the reasoning needed for accurate confidence assessment. We propose natural language critiques as a solution, ideally suited for confidence calibration, as precise gold confidence labels are hard to obtain and often require multiple generations. This paper studies how natural language critiques can enhance verbalized confidence, addressing: (1) What to critique: uncertainty (question-focused) or confidence (answer-specific)? Analysis shows confidence suits multiple-choice tasks, while uncertainty excels in open-ended scenarios. (2) How to critique: self-critique or critique calibration training? We propose Self-Critique, enabling LLMs to critique and optimize their confidence beyond mere accuracy, and CritiCal, a novel Critique Calibration training method that leverages natural language critiques to improve confidence calibration, moving beyond direct numerical optimization. Experiments show that CritiCal significantly outperforms Self-Critique and other competitive baselines, even surpassing its teacher model, GPT-4o, in complex reasoning tasks. CritiCal also shows robust generalization in out-of-distribution settings, advancing LLM’s reliability.
zh
[NLP-31] A word association network methodology for evaluating implicit biases in LLM s compared to humans
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)中隐性社会偏见的检测与评估难题,这类偏见往往难以通过显式分析识别,因而阻碍了对模型公平性和社会影响的准确判断。其解决方案的关键在于提出一种基于词关联网络(word association network)的方法,通过模拟语义启动效应(semantic priming)来挖掘LLM内部隐含的知识表征,并借助提示工程(prompt-based approach)实现对性别、宗教、种族、性取向及政治立场等多维度社会偏见的量化与质性评估。该方法不仅支持不同LLM之间的直接比较,还能与人类认知进行对标,从而提供可解释、可扩展且通用的偏见评估框架,推动语言技术向透明化和负责任方向发展。
链接: https://arxiv.org/abs/2510.24488
作者: Katherine Abramski,Giulio Rossetti,Massimo Stella
机构: University of Pisa (比萨大学); National Research Council of Italy (意大利国家研究委员会); University of Trento (特伦托大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 24 pages, 13 figures, 3 tables
Abstract:As Large language models (LLMs) become increasingly integrated into our lives, their inherent social biases remain a pressing concern. Detecting and evaluating these biases can be challenging because they are often implicit rather than explicit in nature, so developing evaluation methods that assess the implicit knowledge representations of LLMs is essential. We present a novel word association network methodology for evaluating implicit biases in LLMs based on simulating semantic priming within LLM-generated word association networks. Our prompt-based approach taps into the implicit relational structures encoded in LLMs, providing both quantitative and qualitative assessments of bias. Unlike most prompt-based evaluation methods, our method enables direct comparisons between various LLMs and humans, providing a valuable point of reference and offering new insights into the alignment of LLMs with human cognition. To demonstrate the utility of our methodology, we apply it to both humans and several widely used LLMs to investigate social biases related to gender, religion, ethnicity, sexual orientation, and political party. Our results reveal both convergences and divergences between LLM and human biases, providing new perspectives on the potential risks of using LLMs. Our methodology contributes to a systematic, scalable, and generalizable framework for evaluating and comparing biases across multiple LLMs and humans, advancing the goal of transparent and socially responsible language technologies.
zh
[NLP-32] alk2Ref: A Dataset for Reference Prediction from Scientific Talks
【速读】: 该论文旨在解决科学演讲(scientific talks)中自动识别与之相关联的参考文献的问题,从而帮助研究人员和学生更高效地发现支撑或扩展演讲内容的学术文献。其核心挑战在于如何从非结构化、长篇的口头表达中提取语义信息,并准确匹配到对应的已发表论文。解决方案的关键在于提出了一种新的任务范式——Reference Prediction from Talks (RPT),并构建了首个大规模数据集Talk2Ref,包含6,279个科学演讲及其平均26篇被引文献;在此基础上,采用双编码器(dual-encoder)架构进行训练,并探索了长文本处理策略与领域自适应训练方法,显著提升了引用预测性能,验证了该数据集在学习口语科学内容语义表示方面的有效性。
链接: https://arxiv.org/abs/2510.24478
作者: Frederik Broy,Maike Züfle,Jan Niehues
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Scientific talks are a growing medium for disseminating research, and automatically identifying relevant literature that grounds or enriches a talk would be highly valuable for researchers and students alike. We introduce Reference Prediction from Talks (RPT), a new task that maps long, and unstructured scientific presentations to relevant papers. To support research on RPT, we present Talk2Ref, the first large-scale dataset of its kind, containing 6,279 talks and 43,429 cited papers (26 per talk on average), where relevance is approximated by the papers cited in the talk’s corresponding source publication. We establish strong baselines by evaluating state-of-the-art text embedding models in zero-shot retrieval scenarios, and propose a dual-encoder architecture trained on Talk2Ref. We further explore strategies for handling long transcripts, as well as training for domain adaptation. Our results show that fine-tuning on Talk2Ref significantly improves citation prediction performance, demonstrating both the challenges of the task and the effectiveness of our dataset for learning semantic representations from spoken scientific content. The dataset and trained models are released under an open license to foster future research on integrating spoken scientific communication into citation recommendation systems.
zh
[NLP-33] Mitigating Hallucination in Large Language Models (LLM s): An Application-Oriented Survey on RAG Reasoning and Agent ic Systems
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中存在的幻觉(Hallucination)问题,这一问题严重制约了LLMs在现实应用场景中的可靠部署。其解决方案的关键在于系统性地分析和整合两种主流策略:检索增强生成(Retrieval-Augmented Generation, RAG)与推理增强(Reasoning Enhancement),并通过构建知识型与逻辑型幻觉的分类体系,阐明二者如何分别缓解不同类型的幻觉,并提出一个由实际应用、评估和基准验证的统一框架,以实现创造力与可靠性之间的平衡。
链接: https://arxiv.org/abs/2510.24476
作者: Yihan Li,Xiyuan Fu,Ghanshyam Verma,Paul Buitelaar,Mingming Liu
机构: Wuhan University (武汉大学); Dublin City University (都柏林城市大学); University of Galway (高威大学); Insight Centre for Data Analytics (数据解析研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 25 pages, 7 figures, 3 tables
Abstract:Hallucination remains one of the key obstacles to the reliable deployment of large language models (LLMs), particularly in real-world applications. Among various mitigation strategies, Retrieval-Augmented Generation (RAG) and reasoning enhancement have emerged as two of the most effective and widely adopted approaches, marking a shift from merely suppressing hallucinations to balancing creativity and reliability. However, their synergistic potential and underlying mechanisms for hallucination mitigation have not yet been systematically examined. This survey adopts an application-oriented perspective of capability enhancement to analyze how RAG, reasoning enhancement, and their integration in Agentic Systems mitigate hallucinations. We propose a taxonomy distinguishing knowledge-based and logic-based hallucinations, systematically examine how RAG and reasoning address each, and present a unified framework supported by real-world applications, evaluations, and benchmarks.
zh
[NLP-34] Iterative Critique-Refine Framework for Enhancing LLM Personalization
【速读】: 该论文旨在解决个性化文本生成中模型输出易偏离目标用户风格、语气或主题聚焦的问题,现有检索增强方法(如LaMP和PGraphRAG)虽能引入用户及其邻近用户的交互历史来丰富个人画像,但在生成过程中仍难以维持一致性。其解决方案的关键在于提出一个无需训练的“批判-修正”框架PerFine,该框架通过迭代式反馈机制实现精细化调整:每次迭代中,生成器基于检索到的用户画像生成草稿,同时一个同样受同一画像约束的批评者LLM提供结构化反馈(涵盖语气、词汇、句法与主题性),随后生成器据此修订内容;此外引入一种新颖的“淘汰策略”保留每轮更强的草稿,从而在不依赖微调的前提下显著提升个性化程度与稳定性。
链接: https://arxiv.org/abs/2510.24469
作者: Durga Prasad Maram,Dhruvin Gandhi,Zonghai Yao,Gayathri Akkinapalli,Franck Dernoncourt,Yu Wang,Ryan A. Rossi,Nesreen K. Ahmed
机构: University of Massachusetts Amherst (马萨诸塞大学阿默斯特分校); Adobe Research (Adobe 研究院); University of Oregon (俄勒冈大学); Cisco AI Research (思科人工智能研究中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:
Abstract:Personalized text generation requires models not only to produce coherent text but also to align with a target user’s style, tone, and topical focus. Existing retrieval-augmented approaches such as LaMP and PGraphRAG enrich profiles with user and neighbor histories, but they stop at generation and often yield outputs that drift in tone, topic, or style. We present PerFine, a unified, training-free critique-refine framework that enhances personalization through iterative, profile-grounded feedback. In each iteration, an LLM generator produces a draft conditioned on the retrieved profile, and a critic LLM - also conditioned on the same profile - provides structured feedback on tone, vocabulary, sentence structure, and topicality. The generator then revises, while a novel knockout strategy retains the stronger draft across iterations. We further study additional inference-time strategies such as Best-of-N and Topic Extraction to balance quality and efficiency. Across Yelp, Goodreads, and Amazon datasets, PerFine consistently improves personalization over PGraphRAG, with GEval gains of +7-13%, steady improvements over 3-5 refinement iterations, and scalability with increasing critic size. These results highlight that post-hoc, profile-aware feedback offers a powerful paradigm for personalized LLM generation that is both training-free and model-agnostic.
zh
[NLP-35] Charting the European LLM Benchmarking Landscape: A New Taxonomy and a Set of Best Practices LREC2026
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)评估体系在非英语语言场景下发展滞后的问题,尤其是在欧洲语言中的适用性不足。其关键解决方案在于提出一种面向多语言或非英语使用场景的基准测试分类新框架(taxonomy),并制定一套最佳实践与质量标准,以推动欧洲语言相关基准的协同开发;其中特别强调提升评估方法对语言和文化敏感性的设计,从而增强LLM在非英语语境下的可靠性和公平性评估能力。
链接: https://arxiv.org/abs/2510.24450
作者: Špela Vintar,Taja Kuzman Pungeršek,Mojca Brglez,Nikola Ljubešić
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 12 pages, 1 figure. Submitted to the LREC 2026 conference
Abstract:While new benchmarks for large language models (LLMs) are being developed continuously to catch up with the growing capabilities of new models and AI in general, using and evaluating LLMs in non-English languages remains a little-charted landscape. We give a concise overview of recent developments in LLM benchmarking, and then propose a new taxonomy for the categorization of benchmarks that is tailored to multilingual or non-English use scenarios. We further propose a set of best practices and quality standards that could lead to a more coordinated development of benchmarks for European languages. Among other recommendations, we advocate for a higher language and culture sensitivity of evaluation methods.
zh
[NLP-36] SPARTA: Evaluating Reasoning Segmentation Robustness through Black-Box Adversarial Paraphrasing in Text Autoencoder Latent Space
【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在视觉-语言任务中对文本语义变化的鲁棒性不足的问题,特别是针对用户以不同但语义等价的表达方式提出相同意图时,模型性能下降的现象。现有研究主要关注图像输入扰动,而忽视了语义不变但句法变化的文本扰动(即对抗性改写),这在真实场景中具有重要意义。解决方案的关键在于提出一种新颖的对抗性改写任务——生成语法正确且语义保持不变但能显著降低分割性能的文本改写,并设计了SPARTA方法:一种基于强化学习的黑盒、句级优化算法,在文本自动编码器的低维语义潜在空间中进行搜索,从而高效生成高质量对抗性改写样本。实验表明,SPARTA在ReasonSeg和LLMSeg-40k数据集上相比已有方法成功率提升达2倍,揭示了当前先进推理分割模型在严格语义与语法约束下仍存在显著脆弱性。
链接: https://arxiv.org/abs/2510.24446
作者: Viktoriia Zinkovich,Anton Antonov,Andrei Spiridonov,Denis Shepelev,Andrey Moskalenko,Daria Pugacheva,Elena Tutubalina,Andrey Kuznetsov,Vlad Shakhuro
机构: 未知
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multimodal large language models (MLLMs) have shown impressive capabilities in vision-language tasks such as reasoning segmentation, where models generate segmentation masks based on textual queries. While prior work has primarily focused on perturbing image inputs, semantically equivalent textual paraphrases-crucial in real-world applications where users express the same intent in varied ways-remain underexplored. To address this gap, we introduce a novel adversarial paraphrasing task: generating grammatically correct paraphrases that preserve the original query meaning while degrading segmentation performance. To evaluate the quality of adversarial paraphrases, we develop a comprehensive automatic evaluation protocol validated with human studies. Furthermore, we introduce SPARTA-a black-box, sentence-level optimization method that operates in the low-dimensional semantic latent space of a text autoencoder, guided by reinforcement learning. SPARTA achieves significantly higher success rates, outperforming prior methods by up to 2x on both the ReasonSeg and LLMSeg-40k datasets. We use SPARTA and competitive baselines to assess the robustness of advanced reasoning segmentation models. We reveal that they remain vulnerable to adversarial paraphrasing-even under strict semantic and grammatical constraints. All code and data will be released publicly upon acceptance.
zh
[NLP-37] Law in Silico: Simulating Legal Society with LLM -Based Agents
【速读】: 该论文旨在解决现实世界中法律实验因成本高昂或不可行而难以开展的问题,提出利用生成式 AI(Generative AI)构建可模拟法律社会的计算框架,以验证和推进法律理论并支持法律治理实践。其解决方案的关键在于设计并实现 Law in Silico —— 一个基于大语言模型(Large Language Models, LLMs)的智能体框架,能够模拟个体决策与立法、司法、执法等制度机制,在宏观层面再现犯罪率趋势,并在微观层面揭示透明、适应性强的法律体系对弱势群体权利保护的积极作用。
链接: https://arxiv.org/abs/2510.24442
作者: Yiding Wang,Yuxuan Chen,Fanxu Meng,Xifan Chen,Xiaolei Yang,Muhan Zhang
机构: Peking University (北京大学); The University of Hong Kong (香港大学); BIGAI (通用人工智能国家重点实验室)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Multiagent Systems (cs.MA)
备注:
Abstract:Since real-world legal experiments are often costly or infeasible, simulating legal societies with Artificial Intelligence (AI) systems provides an effective alternative for verifying and developing legal theory, as well as supporting legal administration. Large Language Models (LLMs), with their world knowledge and role-playing capabilities, are strong candidates to serve as the foundation for legal society simulation. However, the application of LLMs to simulate legal systems remains underexplored. In this work, we introduce Law in Silico, an LLM-based agent framework for simulating legal scenarios with individual decision-making and institutional mechanisms of legislation, adjudication, and enforcement. Our experiments, which compare simulated crime rates with real-world data, demonstrate that LLM-based agents can largely reproduce macro-level crime trends and provide insights that align with real-world observations. At the same time, micro-level simulations reveal that a well-functioning, transparent, and adaptive legal system offers better protection of the rights of vulnerable individuals.
zh
[NLP-38] Can LLM s Write Faithfully? An Agent -Based Evaluation of LLM -generated Islamic Content NEURIPS2025
【速读】: 该论文旨在解决生成式 AI 在伊斯兰教义指导应用中可能出现的文本误引、法理误用及文化一致性缺失的问题,这些问题可能对信仰敏感内容的准确性造成严重影响。其解决方案的关键在于提出并实施一种双代理评估框架:定量代理负责引用验证与六维评分(如结构、伊斯兰一致性、引用质量),定性代理则进行五维对比分析(如语气、深度、原创性),从而系统性地评估 GPT-4o、Ansari AI 和 Fanar 三类模型在真实伊斯兰博客提示下的表现,为高风险领域(如宗教、医学、法律)中更可靠的人工智能应用提供基准和改进方向。
链接: https://arxiv.org/abs/2510.24438
作者: Abdullah Mushtaq,Rafay Naeem,Ezieddin Elmahjub,Ibrahim Ghaznavi,Shawqi Al-Maliki,Mohamed Abdallah,Ala Al-Fuqaha,Junaid Qadir
机构: Information Technology University (信息技术大学); Qatar University (卡塔尔大学); Hamad Bin Khalifa University (哈马德本哈利法大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Multiagent Systems (cs.MA)
备注: Accepted at 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: 5th Muslims in Machine Learning (MusIML) Workshop
Abstract:Large language models are increasingly used for Islamic guidance, but risk misquoting texts, misapplying jurisprudence, or producing culturally inconsistent responses. We pilot an evaluation of GPT-4o, Ansari AI, and Fanar on prompts from authentic Islamic blogs. Our dual-agent framework uses a quantitative agent for citation verification and six-dimensional scoring (e.g., Structure, Islamic Consistency, Citations) and a qualitative agent for five-dimensional side-by-side comparison (e.g., Tone, Depth, Originality). GPT-4o scored highest in Islamic Accuracy (3.93) and Citation (3.38), Ansari AI followed (3.68, 3.32), and Fanar lagged (2.76, 1.82). Despite relatively strong performance, models still fall short in reliably producing accurate Islamic content and citations – a paramount requirement in faith-sensitive writing. GPT-4o had the highest mean quantitative score (3.90/5), while Ansari AI led qualitative pairwise wins (116/200). Fanar, though trailing, introduces innovations for Islamic and Arabic contexts. This study underscores the need for community-driven benchmarks centering Muslim perspectives, offering an early step toward more reliable AI in Islamic knowledge and other high-stakes domains such as medicine, law, and journalism.
zh
[NLP-39] LuxIT: A Luxembourgish Instruction Tuning Dataset from Monolingual Seed Data
【速读】: 该论文旨在解决低资源语言环境下指令微调大型语言模型(Large Language Models, LLMs)效果受限的问题,特别是针对卢森堡语缺乏高质量训练数据的现状。其解决方案的关键在于构建了一个名为LuxIT的单语指令微调数据集,该数据集通过合成来自本地卢森堡语文本语料库的内容,并利用在卢森堡语上表现优异的DeepSeek-R1-0528模型进行生成,随后采用“大模型作为裁判”(LLM-as-a-judge)的质量保障流程确保数据质量。这一方法为卢森堡语自然语言处理提供了重要资源,并展示了可复现的单语数据构建范式,尽管实验结果表明模型性能存在差异,凸显了进一步优化应用策略的必要性。
链接: https://arxiv.org/abs/2510.24434
作者: Julian Valline,Cedric Lothritz,Jordi Cabot
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:The effectiveness of instruction-tuned Large Language Models (LLMs) is often limited in low-resource linguistic settings due to a lack of high-quality training data. We introduce LuxIT, a novel, monolingual instruction tuning dataset for Luxembourgish developed to mitigate this challenge. We synthesize the dataset from a corpus of native Luxembourgish texts, utilizing DeepSeek-R1-0528, chosen for its shown proficiency in Luxembourgish. Following generation, we apply a quality assurance process, employing an LLM-as-a-judge approach. To investigate the practical utility of the dataset, we fine-tune several smaller-scale LLMs on LuxIT. Subsequent benchmarking against their base models on Luxembourgish language proficiency examinations, however, yields mixed results, with performance varying significantly across different models. LuxIT represents a critical contribution to Luxembourgish natural language processing and offers a replicable monolingual methodology, though our findings highlight the need for further research to optimize its application.
zh
[NLP-40] SynthWorlds: Controlled Parallel Worlds for Disentangling Reasoning and Knowledge in Language Models
【速读】: 该论文旨在解决语言模型(Language Models, LMs)在评估推理能力时因参数化世界知识(parametric world knowledge)干扰而导致的混淆问题,即现有基准测试性能往往反映的是事实记忆而非真实推理能力。为实现推理复杂度与事实知识的解耦,论文提出SynthWorlds框架,其关键在于构建两个具有相同互联结构的平行语料库:一个现实映射世界(real-mapped world),模型可利用参数化知识;一个合成映射世界(synthetic-mapped world),此类知识无效。通过设计镜像任务(如多跳问答和页面导航)确保跨世界推理难度一致,从而清晰量化模型因记忆知识带来的性能优势(即知识优势差距)。该方法实现了对语言模型推理能力的可控、可扩展且精确的评估。
链接: https://arxiv.org/abs/2510.24427
作者: Ken Gu,Advait Bhat,Mike A Merrill,Robert West,Xin Liu,Daniel McDuff,Tim Althoff
机构: University of Washington (华盛顿大学); Stanford University (斯坦福大学); EPFL (洛桑联邦理工学院); Google Research (谷歌研究院)
类目: Computation and Language (cs.CL)
备注:
Abstract:Evaluating the reasoning ability of language models (LMs) is complicated by their extensive parametric world knowledge, where benchmark performance often reflects factual recall rather than genuine reasoning. Existing datasets and approaches (e.g., temporal filtering, paraphrasing, adversarial substitution) cannot cleanly separate the two. We present SynthWorlds, a framework that disentangles task reasoning complexity from factual knowledge. In SynthWorlds, we construct parallel corpora representing two worlds with identical interconnected structure: a real-mapped world, where models may exploit parametric knowledge, and a synthetic-mapped world, where such knowledge is meaningless. On top of these corpora, we design two mirrored tasks as case studies: multi-hop question answering and page navigation, which maintain equal reasoning difficulty across worlds. Experiments in parametric-only (e.g., closed-book QA) and knowledge-augmented (e.g., retrieval-augmented) LM settings reveal a persistent knowledge advantage gap, defined as the performance boost models gain from memorized parametric world knowledge. Knowledge acquisition and integration mechanisms reduce but do not eliminate this gap, highlighting opportunities for system improvements. Fully automatic and scalable, SynthWorlds provides a controlled environment for evaluating LMs in ways that were previously challenging, enabling precise and testable comparisons of reasoning and memorization.
zh
[NLP-41] Comprehensive and Efficient Distillation for Lightweight Sentiment Analysis Models EMNLP2025
【速读】: 该论文旨在解决当前基于知识蒸馏(Knowledge Distillation)的轻量级情感分析模型中存在的两个关键问题:一是人工编写的指令在多样性和数量上受限,难以确保蒸馏知识的全面覆盖;二是大规模用户文本数据带来高昂的计算成本,影响方法的实际可用性。其解决方案的核心在于提出一个名为COMPEFFDIST的综合且高效的蒸馏框架,包含两个关键模块:基于属性的自动指令构建(attribute-based automatic instruction construction)和基于难度的数据过滤(difficulty-based data filtering),分别有效提升指令多样性与数据利用效率,从而在仅使用10%数据的情况下实现与大型教师模型相当的性能,显著提升了模型训练的数据效率和实用性。
链接: https://arxiv.org/abs/2510.24425
作者: Guangyu Xie,Yice Zhang,Jianzhu Bao,Qianlong Wang,Yang Sun,Bingbing Wang,Ruifeng Xu
机构: Harbin Institute of Technology, Shenzhen, China (哈尔滨工业大学深圳校区); Peng Cheng Laboratory, Shenzhen, China (鹏城实验室); Nanyang Technological University, Singapore (南洋理工大学)
类目: Computation and Language (cs.CL)
备注: Accepted by EMNLP 2025. 22 pages, 9 figures. The first two authors contribute equally
Abstract:Recent efforts leverage knowledge distillation techniques to develop lightweight and practical sentiment analysis models. These methods are grounded in human-written instructions and large-scale user texts. Despite the promising results, two key challenges remain: (1) manually written instructions are limited in diversity and quantity, making them insufficient to ensure comprehensive coverage of distilled knowledge; (2) large-scale user texts incur high computational cost, hindering the practicality of these methods. To this end, we introduce COMPEFFDIST, a comprehensive and efficient distillation framework for sentiment analysis. Our framework consists of two key modules: attribute-based automatic instruction construction and difficulty-based data filtering, which correspondingly tackle the aforementioned challenges. Applying our method across multiple model series (Llama-3, Qwen-3, and Gemma-3), we enable 3B student models to match the performance of 20x larger teacher models on most tasks. In addition, our approach greatly outperforms baseline methods in data efficiency, attaining the same performance level with only 10% of the data.
zh
[NLP-42] OS-Sentinel: Towards Safety-Enhanced Mobile GUI Agents via Hybrid Validation in Realistic Workflows
【速读】: 该论文旨在解决基于视觉语言模型(Vision-Language Models, VLMs)的计算机代理在移动数字环境中可能引发的安全风险问题,特别是系统越权访问和隐私泄露等潜在危害。由于移动环境操作空间复杂且动态性强,现有方法难以有效检测此类安全威胁。解决方案的关键在于提出一个名为OS-Sentinel的混合安全检测框架:该框架融合了形式化验证器(Formal Verifier),用于识别明确的系统级违规行为,以及基于VLM的上下文判断器(Contextual Judge),用于评估代理行为的语境风险与合理性,从而实现对移动代理安全性的多维度、精细化检测。
链接: https://arxiv.org/abs/2510.24411
作者: Qiushi Sun,Mukai Li,Zhoumianze Liu,Zhihui Xie,Fangzhi Xu,Zhangyue Yin,Kanzhi Cheng,Zehao Li,Zichen Ding,Qi Liu,Zhiyong Wu,Zhuosheng Zhang,Ben Kao,Lingpeng Kong
机构: The University of Hong Kong (香港大学); Fudan University (复旦大学); Shanghai AI Laboratory (上海人工智能实验室); Nanyang Technological University (南洋理工大学); Nanjing University (南京大学); Shanghai Jiao Tong University (上海交通大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注: work in progress
Abstract:Computer-using agents powered by Vision-Language Models (VLMs) have demonstrated human-like capabilities in operating digital environments like mobile platforms. While these agents hold great promise for advancing digital automation, their potential for unsafe operations, such as system compromise and privacy leakage, is raising significant concerns. Detecting these safety concerns across the vast and complex operational space of mobile environments presents a formidable challenge that remains critically underexplored. To establish a foundation for mobile agent safety research, we introduce MobileRisk-Live, a dynamic sandbox environment accompanied by a safety detection benchmark comprising realistic trajectories with fine-grained annotations. Built upon this, we propose OS-Sentinel, a novel hybrid safety detection framework that synergistically combines a Formal Verifier for detecting explicit system-level violations with a VLM-based Contextual Judge for assessing contextual risks and agent actions. Experiments show that OS-Sentinel achieves 10%-30% improvements over existing approaches across multiple metrics. Further analysis provides critical insights that foster the development of safer and more reliable autonomous mobile agents.
zh
[NLP-43] xt Simplification with Sentence Embeddings
【速读】: 该论文旨在解决文本简化(Text Simplification)任务中模型规模大、训练成本高且泛化能力受限的问题。其核心解决方案是通过在句子嵌入(Sentence Embeddings)空间中学习高复杂度与低复杂度文本之间的映射变换,而非直接依赖大规模序列到序列(Seq2Seq)或大语言模型(LLM)架构。关键在于利用一个小型前馈神经网络有效捕捉嵌入空间中的语义结构差异,从而实现对复杂度层级的保留与转换,同时展现出在未见数据集(如MedEASI)及跨语言场景(西班牙语、德语)下的良好迁移性能,验证了嵌入空间变换方法的小型化、高效性与可扩展潜力。
链接: https://arxiv.org/abs/2510.24365
作者: Matthew Shardlow
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Sentence embeddings can be decoded to give approximations of the original texts used to create them. We explore this effect in the context of text simplification, demonstrating that reconstructed text embeddings preserve complexity levels. We experiment with a small feed forward neural network to effectively learn a transformation between sentence embeddings representing high-complexity and low-complexity texts. We provide comparison to a Seq2Seq and LLM-based approach, showing encouraging results in our much smaller learning setting. Finally, we demonstrate the applicability of our transformation to an unseen simplification dataset (MedEASI), as well as datasets from languages outside the training data (ES,DE). We conclude that learning transformations in sentence embedding space is a promising direction for future research and has potential to unlock the ability to develop small, but powerful models for text simplification and other natural language generation tasks.
zh
[NLP-44] Automatically Benchmarking LLM Code Agents Agent s through Agent-Driven Annotation and Evaluation
【速读】: 该论文旨在解决当前代码代理(code agent)评估基准存在的两大问题:一是标注成本高且对专家知识要求严格,二是评价指标过于僵化,主要依赖单元测试,难以全面反映代理在真实项目中的能力。其解决方案的关键在于提出一种代理驱动的基准构建流程(agent-driven benchmark construction pipeline),利用人类监督高效生成多样且具有挑战性的项目级任务,并基于此构建了PRDBench基准,包含50个跨20个领域的实际Python项目,每个项目均配有结构化的Product Requirement Document(PRD)需求、综合评估标准及参考实现;同时引入代理作为裁判(Agent-as-a-Judge)范式,支持对多种类型测试结果的评分,从而实现更灵活、可扩展且贴近真实开发场景的评估体系。
链接: https://arxiv.org/abs/2510.24358
作者: Lingyue Fu,Bolun Zhang,Hao Guan,Yaoming Zhu,Lin Qiu,Weiwen Liu,Xuezhi Cao,Xunliang Cai,Weinan Zhang,Yong Yu
机构: 未知
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注:
Abstract:Recent advances in code agents have enabled automated software development at the project level, supported by large language models (LLMs) and widely adopted tools. However, existing benchmarks for code agent evaluation face two major limitations: high annotation cost and expertise requirements, and rigid evaluation metrics that rely primarily on unit tests. To address these challenges, we propose an agent-driven benchmark construction pipeline that leverages human supervision to efficiently generate diverse and challenging project-level tasks. Based on this approach, we introduce PRDBench, a novel benchmark comprising 50 real-world Python projects across 20 domains, each with structured Product Requirement Document (PRD) requirements, comprehensive evaluation criteria, and reference implementations. PRDBench features rich data sources, high task complexity, and flexible metrics. We further employ an Agent-as-a-Judge paradigm to score agent outputs, enabling the evaluation of various test types beyond unit tests. Extensive experiments on PRDBench demonstrate its effectiveness in assessing the capabilities of both code agents and evaluation agents, providing a scalable and robust framework for annotation and evaluation.
zh
[NLP-45] LongWeave: A Long-Form Generation Benchmark Bridging Real-World Relevance and Verifiability EMNLP
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在生成长文本时面临的挑战,即如何在保持信息丰富性和事实准确性的同时,有效应对现实场景中的复杂约束。现有基准测试要么依赖难以验证的指标,要么采用过于简化的合成环境,无法真实反映实际应用需求。为此,作者提出LongWeave框架,其核心创新在于Constraint-Verifier Evaluation(CoV-Eval)机制:通过在真实世界场景中定义可验证的目标,再系统地生成对应查询、文本材料和约束条件,从而确保任务既贴近现实又具备客观评估能力。该方案实现了对模型在满足复杂现实约束下长文本生成能力的严谨评测,并支持高达64K输入/8K输出token的定制化长度设置,在7类任务上对23个LLM的评估表明,即便最先进的模型在面对更高复杂度和更长输出时仍面临显著挑战。
链接: https://arxiv.org/abs/2510.24345
作者: Zikai Xiao,Fei Huang,Jianhong Tu,Jianhui Wei,Wen Ma,Yuxuan Zhou,Jian Wu,Bowen Yu,Zuozhu Liu,Junyang Lin
机构: Zhejiang University (浙江大学); Qwen Team, Alibaba Group (阿里巴巴集团通义实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: EMNLP Findings 2025
Abstract:Generating long, informative, and factual outputs remains a major challenge for Large Language Models (LLMs). Existing benchmarks for long-form generation typically assess real-world queries with hard-to-verify metrics or use synthetic setups that ease evaluation but overlook real-world intricacies. In this paper, we introduce \textbfLongWeave, which balances real-world and verifiable assessment with Constraint-Verifier Evaluation (CoV-Eval). CoV-Eval constructs tasks by first defining verifiable targets within real-world scenarios, then systematically generating corresponding queries, textual materials, and constraints based on these targets. This ensures that tasks are both realistic and objectively assessable, enabling rigorous assessment of model capabilities in meeting complex real-world constraints. LongWeave supports customizable input/output lengths (up to 64K/8K tokens) across seven distinct tasks. Evaluation on 23 LLMs shows that even state-of-the-art models encounter significant challenges in long-form generation as real-world complexity and output length increase.
zh
[NLP-46] Beyond MCQ: An Open-Ended Arabic Cultural QA Benchmark with Dialect Variants
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在处理文化语境依赖和方言内容时表现不均的问题,尤其是在阿拉伯语不同变体(包括现代标准阿拉伯语 Modern Standard Arabic, MSA 和阿拉伯语方言)中的性能差异。其解决方案的关键在于提出一种系统性方法:首先将MSA多选题(MCQs)翻译为英语及多种阿拉伯语方言,并转换为开放问答(OEQs);其次,在MCQ与OEQ两种设置下对零样本和微调后的LLMs进行基准测试;最后,生成思维链(Chain-of-Thought, CoT)推理路径以提升模型的逐步推理能力。该方法构建了首个跨多种语言变体平行对齐的问答数据集,从而推动了对文化与语言多样性更具包容性的评估体系发展。
链接: https://arxiv.org/abs/2510.24328
作者: Hunzalah Hassan Bhatti,Firoj Alam
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Cultural Knowledge, Everyday Knowledge, Open-Ended Question, Chain-of-Thought, Large Language Models, Native, Multilingual, Language Diversity
Abstract:Large Language Models (LLMs) are increasingly used to answer everyday questions, yet their performance on culturally grounded and dialectal content remains uneven across languages. We propose a comprehensive method that (i) translates Modern Standard Arabic (MSA) multiple-choice questions (MCQs) into English and several Arabic dialects, (ii) converts them into open-ended questions (OEQs), (iii) benchmarks a range of zero-shot and fine-tuned LLMs under both MCQ and OEQ settings, and (iv) generates chain-of-thought (CoT) rationales to fine-tune models for step-by-step reasoning. Using this method, we extend an existing dataset in which QAs are parallelly aligned across multiple language varieties, making it, to our knowledge, the first of its kind. We conduct extensive experiments with both open and closed models. Our findings show that (i) models underperform on Arabic dialects, revealing persistent gaps in culturally grounded and dialect-specific knowledge; (ii) Arabic-centric models perform well on MCQs but struggle with OEQs; and (iii) CoT improves judged correctness while yielding mixed n-gram-based metrics. The developed dataset will be publicly released to support further research on culturally and linguistically inclusive evaluation.
zh
[NLP-47] Critique-RL: Training Language Models for Critiquing through Two-Stage Reinforcement Learning
【速读】: 该论文旨在解决生成式 AI(Generative AI)在复杂推理任务中性能提升受限的问题,具体挑战在于现有方法依赖更强的监督信号来标注批评数据(critique data),从而难以高效训练高质量的批判性语言模型(critiquing language models)。其解决方案的关键在于提出一种无需强监督的在线强化学习(Online Reinforcement Learning, RL)框架 Critique-RL,采用两阶段优化策略:第一阶段通过直接基于规则的奖励信号强化批评者的判别能力(discriminability),确保其能准确区分高质量与低质量输出;第二阶段引入基于演员(actor)改进反馈的间接奖励以提升批评者的帮助性(helpfulness),同时通过正则化手段保持判别能力。实验表明,该方法在多个任务和模型上均显著提升性能,例如在 Qwen2.5-7B 上实现域内任务 9.02% 和域外任务 5.70% 的准确率增益。
链接: https://arxiv.org/abs/2510.24320
作者: Zhiheng Xi,Jixuan Huang,Xin Guo,Boyang Hong,Dingwen Yang,Xiaoran Fan,Shuo Li,Zehui Chen,Junjie Ye,Siyu Yuan,Zhengyin Du,Xuesong Yao,Yufei Xu,Jiecao Chen,Rui Zheng,Tao Gui,Qi Zhang,Xuanjing Huang
机构: Fudan University (复旦大学); ByteDance Seed (字节跳动种子团队)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Preprint, 25 pages, 9 figures. Code: this https URL
Abstract:Training critiquing language models to assess and provide feedback on model outputs is a promising way to improve LLMs for complex reasoning tasks. However, existing approaches typically rely on stronger supervisors for annotating critique data. To address this, we propose Critique-RL, an online RL approach for developing critiquing language models without stronger supervision. Our approach operates on a two-player paradigm: the actor generates a response, the critic provides feedback, and the actor refines the response accordingly. We first reveal that relying solely on indirect reward signals from the actor’s outputs for RL optimization often leads to unsatisfactory critics: while their helpfulness (i.e., providing constructive feedback) improves, the discriminability (i.e., determining whether a response is high-quality or not) remains poor, resulting in marginal performance gains. To overcome this, Critique-RL adopts a two-stage optimization strategy. In stage I, it reinforces the discriminability of the critic with direct rule-based reward signals; in stage II, it introduces indirect rewards based on actor refinement to improve the critic’s helpfulness, while maintaining its discriminability via appropriate regularization. Extensive experiments across various tasks and models show that Critique-RL delivers substantial performance improvements. For example, it achieves a 9.02% gain on in-domain tasks and a 5.70% gain on out-of-domain tasks for Qwen2.5-7B, highlighting its potential.
zh
[NLP-48] Lookahead Tree-Based Rollouts for Enhanced Trajectory-Level Exploration in Reinforcement Learning with Verifiable Rewards
【速读】: 该论文旨在解决当前基于可验证奖励的强化学习(Reinforcement Learning with Verifiable Rewards, RLVR)方法在训练大语言模型推理能力时,因群体 rollout 中采样轨迹多样性不足而导致的策略更新信号弱化问题。其核心瓶颈在于 token-level 随机采样导致局部变化迅速收敛为相似的推理路径,从而限制了策略的有效学习。解决方案的关键在于提出一种名为 Lookahead Tree-Based Rollouts (LATR) 的新型 rollout 策略,通过三个阶段迭代执行:(1) 在高不确定性生成步骤进行分支;(2) 对每个新分支执行前瞻模拟;(3) 剪枝在模拟中表现出长期相似性的分支,从而显式促进轨迹层面的多样性,显著提升策略学习效率与最终性能。
链接: https://arxiv.org/abs/2510.24302
作者: Shangyu Xing,Siyuan Wang,Chenyuan Yang,Xinyu Dai,Xiang Ren
机构: Nanjing University (南京大学); University of Southern California (南加州大学); Fudan University (复旦大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Reinforcement Learning with Verifiable Rewards (RLVR), particularly with algorithms like Group Relative Policy Optimization (GRPO), has proven highly effective in enhancing the reasoning capabilities of large language models. However, a critical bottleneck in current pipelines lies in the limited diversity of sampled trajectories during group rollouts. Homogeneous trajectories and their associated rewards would diminish the return signals for policy updates, thereby hindering effective policy learning. This lack of diversity stems primarily from token-level stochastic sampling, where local variations are likely to collapse into near-identical reasoning paths. To address this limitation, we propose Lookahead Tree-Based Rollouts (LATR), a novel rollout strategy designed to explicitly promotes trajectory-level diversity by enforcing branching into different candidate tokens likely to yield distinct continuations. Specifically, LATR iteratively operates in three stages: (1) branching at high-uncertainty generation steps, (2) performing lookahead simulation for each new branch, and (3) pruning branches that exhibits prolonged similarity during simulation. Compared with stochastic Sampling, LATR accelerates policy learning by 131% on average and improves final pass@1 performance by 4.2% on both GRPO and Dynamic sAmpling Policy Optimization (DAPO) algorithms across different reasoning tasks. Our code and data are publicly available at this https URL.
zh
[NLP-49] MERGE: Minimal Expression-Replacement GEneralization Test for Natural Language Inference
【速读】: 该论文旨在解决自然语言推理(Natural Language Inference, NLI)模型在面对语义不变但表达形式微调的测试样本时表现出显著性能下降的问题,即模型缺乏对最小语义变化的鲁棒性。解决方案的关键在于提出一种名为MERGE(Minimal Expression-Replacements GEneralization)的方法,通过替换原问题中的开放类词汇(open-class words),同时严格保持其底层推理结构不变,从而自动生成高质量的NLI泛化测试变体。该方法有效评估了模型在保留推理逻辑前提下的预测一致性,揭示了当前NLI模型在面对细微表达变化时的脆弱性。
链接: https://arxiv.org/abs/2510.24295
作者: Mădălina Zgreabăn,Tejaswini Deoskar,Lasha Abzianidze
机构: Utrecht Institute of Linguistics OTS, Utrecht University (乌得勒支大学语言学研究所); The Netherlands (荷兰)
类目: Computation and Language (cs.CL)
备注: Pre-print
Abstract:In recent years, many generalization benchmarks have shown language models’ lack of robustness in natural language inference (NLI). However, manually creating new benchmarks is costly, while automatically generating high-quality ones, even by modifying existing benchmarks, is extremely difficult. In this paper, we propose a methodology for automatically generating high-quality variants of original NLI problems by replacing open-class words, while crucially preserving their underlying reasoning. We dub our generalization test as MERGE (Minimal Expression-Replacements GEneralization), which evaluates the correctness of models’ predictions across reasoning-preserving variants of the original problem. Our results show that NLI models’ perform 4-20% worse on variants, suggesting low generalizability even on such minimally altered problems. We also analyse how word class of the replacements, word probability, and plausibility influence NLI models’ performance.
zh
[NLP-50] ViPER: Empowering the Self-Evolution of Visual Perception Abilities in Vision-Language Model
【速读】: 该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在实际应用中因细粒度视觉感知能力有限而导致的性能瓶颈问题。现有方法如监督微调(Supervised Fine-Tuning, SFT)常牺牲模型的通用能力,而强化微调(Reinforcement Fine-Tuning, RFT)则更侧重文本推理而非视觉感知,难以有效提升细粒度视觉理解能力。解决方案的关键在于提出一种两阶段任务框架,将视觉感知学习建模为从粗到细的渐进过程,并设计了ViPER(Visual Perception Enhancement via Reinforcement)自Bootstrapping框架,通过图像级与实例级重建相结合的双阶段强化学习策略,构建闭环训练机制:模型利用内部生成的数据进行自我批判与自我预测,实现感知能力的迭代进化。这一机制不仅显著提升了细粒度视觉感知性能(最高达6.0%提升),还验证了生成与理解之间的相互促进关系,为发展更具自主性和泛化能力的VLM提供了新路径。
链接: https://arxiv.org/abs/2510.24285
作者: Juntian Zhang,Song Jin,Chuanqi Cheng,Yuhan Liu,Yankai Lin,Xun Zhang,Yufei Zhang,Fei Jiang,Guojun Yin,Wei Lin,Rui Yan
机构: Gaoling School of Artificial Intelligence, Renmin University of China (中国人民大学高瓴人工智能学院); Meituan (美团); MBZUAI; Wuhan University (武汉大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:The limited capacity for fine-grained visual perception presents a critical bottleneck for Vision-Language Models (VLMs) in real-world applications. Addressing this is challenging due to the scarcity of high-quality data and the limitations of existing methods: supervised fine-tuning (SFT) often compromises general capabilities, while reinforcement fine-tuning (RFT) prioritizes textual reasoning over visual perception. To bridge this gap, we propose a novel two-stage task that structures visual perception learning as a coarse-to-fine progressive process. Based on this task formulation, we develop ViPER, a self-bootstrapping framework specifically designed to enable iterative evolution through self-critiquing and self-prediction. By synergistically integrating image-level and instance-level reconstruction with a two-stage reinforcement learning strategy, ViPER establishes a closed-loop training paradigm, where internally synthesized data directly fuel the enhancement of perceptual ability. Applied to the Qwen2.5-VL family, ViPER produces the Qwen-Viper series. With an average gain of 1.7% on seven comprehensive benchmarks spanning various tasks and up to 6.0% on fine-grained perception, Qwen-Viper consistently demonstrates superior performance across different vision-language scenarios while maintaining generalizability. Beyond enabling self-improvement in perceptual capabilities, ViPER provides concrete evidence for the reciprocal relationship between generation and understanding, a breakthrough to developing more autonomous and capable VLMs.
zh
[NLP-51] Can LLM s Translate Human Instructions into a Reinforcement Learning Agents Internal Emergent Symbolic Representation?
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在将人类自然语言指令映射到强化学习代理内部符号表示(symbolic representations)方面的对齐能力问题,尤其关注其在不同粒度和任务复杂度下的表现。解决方案的关键在于构建一个结构化的评估框架,用于量化 GPT、Claude、Deepseek 和 Grok 等主流 LLMs 在 Ant Maze 和 Ant Fall 环境中,从层级强化学习(hierarchical reinforcement learning)生成的不同内部符号分区中进行自然语言翻译的性能,从而揭示当前 LLMs 在语言与代理内部表征之间对齐能力的局限性,并指出未来需加强鲁棒对齐研究的方向。
链接: https://arxiv.org/abs/2510.24259
作者: Ziqi Ma,Sao Mai Nguyen,Philippe Xu
机构: U2IS, ENSTA, IP-Paris (Institut Polytechnique de Paris)
类目: Computation and Language (cs.CL); Robotics (cs.RO)
备注:
Abstract:Emergent symbolic representations are critical for enabling developmental learning agents to plan and generalize across tasks. In this work, we investigate whether large language models (LLMs) can translate human natural language instructions into the internal symbolic representations that emerge during hierarchical reinforcement learning. We apply a structured evaluation framework to measure the translation performance of commonly seen LLMs – GPT, Claude, Deepseek and Grok – across different internal symbolic partitions generated by a hierarchical reinforcement learning algorithm in the Ant Maze and Ant Fall environments. Our findings reveal that although LLMs demonstrate some ability to translate natural language into a symbolic representation of the environment dynamics, their performance is highly sensitive to partition granularity and task complexity. The results expose limitations in current LLMs capacity for representation alignment, highlighting the need for further research on robust alignment between language and internal agent representations.
zh
[NLP-52] From Memorization to Reasoning in the Spectrum of Loss Curvature
【速读】: 该论文旨在解决神经网络中记忆(memorization)的表征与去除问题,特别是如何在不损害模型整体性能的前提下有效消除未授权或不需要的记忆内容。其解决方案的关键在于利用损失函数景观曲率(loss landscape curvature)对模型权重进行分解,识别并编辑低曲率方向上的权重成分——这些成分与未被明确标注但已被模型记忆的数据密切相关。通过这种基于曲率的权重编辑方法,相较于近期的去学习方法(如BalancedSubnet),能更有效地抑制非目标记忆数据的复现,同时保持更低的困惑度(perplexity)。此外,研究发现此类编辑会显著损害事实检索和算术任务的表现,而开放书本的事实检索和逻辑推理能力则得以保留,表明特定任务依赖于权重空间中狭窄且特异的方向,而非通用机制。
链接: https://arxiv.org/abs/2510.24256
作者: Jack Merullo,Srihita Vatsavaya,Lucius Bushnaq,Owen Lewis
机构: Goodfire
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:We characterize how memorization is represented in transformer models and show that it can be disentangled in the weights of both language models (LMs) and vision transformers (ViTs) using a decomposition based on the loss landscape curvature. This insight is based on prior theoretical and empirical work showing that the curvature for memorized training points is much sharper than non memorized, meaning ordering weight components from high to low curvature can reveal a distinction without explicit labels. This motivates a weight editing procedure that suppresses far more recitation of untargeted memorized data more effectively than a recent unlearning method (BalancedSubnet), while maintaining lower perplexity. Since the basis of curvature has a natural interpretation for shared structure in model weights, we analyze the editing procedure extensively on its effect on downstream tasks in LMs, and find that fact retrieval and arithmetic are specifically and consistently negatively affected, even though open book fact retrieval and general logical reasoning is conserved. We posit these tasks rely heavily on specialized directions in weight space rather than general purpose mechanisms, regardless of whether those individual datapoints are memorized. We support this by showing a correspondence between task data’s activation strength with low curvature components that we edit out, and the drop in task performance after the edit. Our work enhances the understanding of memorization in neural networks with practical applications towards removing it, and provides evidence for idiosyncratic, narrowly-used structures involved in solving tasks like math and fact retrieval.
zh
[NLP-53] Evaluating LLM s on Generating Age-Appropriate Child-Like Conversations
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在生成适合儿童年龄特征的对话时存在的适配性问题,尤其是在低资源语言(如挪威语)中缺乏高质量、年龄匹配的语料库所导致的生成内容偏成熟化的问题。其解决方案的关键在于通过对比评估五种主流LLM(包括GPT-4、NorBloom-7b等)生成的挪威语儿童对话样本,并由教育专业人员进行盲评,从而量化模型输出在语言复杂度和儿童发展适宜性方面的偏差,揭示数据驱动的局限性,为未来面向儿童的专用语言模型开发提供实证依据与改进方向。
链接: https://arxiv.org/abs/2510.24250
作者: Syed Zohaib Hassan,Pål Halvorsen,Miriam S. Johnson,Pierre Lison
机构: SimulaMet(模拟元); Oslo Metropolitan University (奥斯陆城市大学); Harvard University (哈佛大学); University of Oslo (奥斯陆大学); Norwegian Computing Center (挪威计算中心)
类目: Computation and Language (cs.CL)
备注: 11 pages excluding references and appendix. 3 figures and 6 tables
Abstract:Large Language Models (LLMs), predominantly trained on adult conversational data, face significant challenges when generating authentic, child-like dialogue for specialized applications. We present a comparative study evaluating five different LLMs (GPT-4, RUTER-LLAMA-2-13b, GPTSW, NorMistral-7b, and NorBloom-7b) to generate age-appropriate Norwegian conversations for children aged 5 and 9 years. Through a blind evaluation by eleven education professionals using both real child interview data and LLM-generated text samples, we assessed authenticity and developmental appropriateness. Our results show that evaluators achieved strong inter-rater reliability (ICC=0.75) and demonstrated higher accuracy in age prediction for younger children (5-year-olds) compared to older children (9-year-olds). While GPT-4 and NorBloom-7b performed relatively well, most models generated language perceived as more linguistically advanced than the target age groups. These findings highlight critical data-related challenges in developing LLM systems for specialized applications involving children, particularly in low-resource languages where comprehensive age-appropriate lexical resources are scarce.
zh
[NLP-54] Abjad AI at NADI 2025: CATT-Whisper: Multimodal Diacritic Restoration Using Text and Speech Representations
【速读】: 该论文旨在解决阿拉伯语方言文本中的声调符号恢复(Diacritic Restoration, DR)问题,即在缺乏声调标记的输入文本中准确还原缺失的元音和辅音重音符号。其解决方案的关键在于提出一种多模态融合方法,结合文本与语音信息:文本模态采用自研预训练模型CATT的编码器进行表示,语音模态则使用OpenAI Whisper基础模型的编码器提取特征;两种策略分别实现早期融合(将平均后的150个语音token通过线性投影后与文本token合并)和交叉注意力融合(通过跨模态注意力机制对齐文本与语音嵌入),最终由CATT分类头完成逐token的声调符号预测。此外,通过随机屏蔽语音输入以增强模型鲁棒性,使其在有/无语音条件下均能稳定表现。
链接: https://arxiv.org/abs/2510.24247
作者: Ahmad Ghannam,Naif Alharthi,Faris Alasmary,Kholood Al Tabash,Shouq Sadah,Lahouari Ghouti
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:In this work, we tackle the Diacritic Restoration (DR) task for Arabic dialectal sentences using a multimodal approach that combines both textual and speech information. We propose a model that represents the text modality using an encoder extracted from our own pre-trained model named CATT. The speech component is handled by the encoder module of the OpenAI Whisper base model. Our solution is designed following two integration strategies. The former consists of fusing the speech tokens with the input at an early stage, where the 1500 frames of the audio segment are averaged over 10 consecutive frames, resulting in 150 speech tokens. To ensure embedding compatibility, these averaged tokens are processed through a linear projection layer prior to merging them with the text tokens. Contextual encoding is guaranteed by the CATT encoder module. The latter strategy relies on cross-attention, where text and speech embeddings are fused. The cross-attention output is then fed to the CATT classification head for token-level diacritic prediction. To further improve model robustness, we randomly deactivate the speech input during training, allowing the model to perform well with or without speech. Our experiments show that the proposed approach achieves a word error rate (WER) of 0.25 and a character error rate (CER) of 0.9 on the development set. On the test set, our model achieved WER and CER scores of 0.55 and 0.13, respectively.
zh
[NLP-55] owards Transparent Reasoning : What Drives Faithfulness in Large Language Models ? NEURIPS2025
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在医疗等敏感领域中生成的解释缺乏忠实性(faithfulness)的问题,即模型的解释可能遗漏关键临床线索或掩盖虚假关联,从而损害临床医生的信任并导致不安全的决策支持。解决方案的关键在于通过可控的推理和训练阶段策略来提升解释的忠实性:研究发现,少量示例的数量与质量、提示(prompting)设计以及指令微调(instruction-tuning)过程均显著影响模型输出解释的忠实性,其中合理设计的少样本示例和针对性的提示策略可有效增强模型在医学问答任务(如MedQA)中的解释可信度。
链接: https://arxiv.org/abs/2510.24236
作者: Teague McMillan,Gabriele Dominici,Martin Gjoreski,Marc Langheinrich
机构: ETH Zurich (苏黎世联邦理工学院); Università della Svizzera italiana (瑞士意大利语大学)
类目: Computation and Language (cs.CL)
备注: 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: NeurIPS 2025 Workshop on Evaluating the Evolving LLM Lifecycle: Benchmarks, Emergent Abilities, and Scaling
Abstract:Large Language Models (LLMs) often produce explanations that do not faithfully reflect the factors driving their predictions. In healthcare settings, such unfaithfulness is especially problematic: explanations that omit salient clinical cues or mask spurious shortcuts can undermine clinician trust and lead to unsafe decision support. We study how inference and training-time choices shape explanation faithfulness, focusing on factors practitioners can control at deployment. We evaluate three LLMs (GPT-4.1-mini, LLaMA 70B, LLaMA 8B) on two datasets-BBQ (social bias) and MedQA (medical licensing questions), and manipulate the number and type of few-shot examples, prompting strategies, and training procedure. Our results show: (i) both the quantity and quality of few-shot examples significantly impact model faithfulness; (ii) faithfulness is sensitive to prompting design; (iii) the instruction-tuning phase improves measured faithfulness on MedQA. These findings offer insights into strategies for enhancing the interpretability and trustworthiness of LLMs in sensitive domains.
zh
[NLP-56] HACK: Hallucinations Along Certainty and Knowledge Axes
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中幻觉(Hallucination)问题的可靠识别与针对性缓解难题。现有研究多基于外部表现对幻觉进行分类,忽视了其内在机制差异,导致缺乏针对不同成因的定制化缓解策略。论文提出一个双轴分类框架,沿“知识”与“确定性”两个维度对幻觉进行细分:在知识维度上区分由知识缺失引发的幻觉和虽有正确知识却仍产生幻觉的情况;在确定性维度上识别出模型具备正确知识但依然以高置信度生成错误内容的高风险子集。关键解决方案在于引入模型特定的数据构建流程以区分上述类型,并通过激活操控(steering mitigation)验证知识类别的有效性,同时提出新评估指标衡量缓解方法在高确定性幻觉上的表现,揭示当前主流方法在该子集上存在显著失效问题,强调必须结合知识状态与确定性水平设计靶向性缓解策略。
链接: https://arxiv.org/abs/2510.24222
作者: Adi Simhi,Jonathan Herzig,Itay Itzhak,Dana Arad,Zorik Gekhman,Roi Reichart,Fazl Barez,Gabriel Stanovsky,Idan Szpektor,Yonatan Belinkov
机构: Technion – IIT (以色列理工学院); Google Research (谷歌研究); Oxford University (牛津大学); WhiteBox; Hebrew University (希伯来大学); Harvard University (哈佛大学)
类目: Computation and Language (cs.CL)
备注: The code is available at this https URL
Abstract:Hallucinations in LLMs present a critical barrier to their reliable usage. Existing research usually categorizes hallucination by their external properties rather than by the LLMs’ underlying internal properties. This external focus overlooks that hallucinations may require tailored mitigation strategies based on their underlying mechanism. We propose a framework for categorizing hallucinations along two axes: knowledge and certainty. Since parametric knowledge and certainty may vary across models, our categorization method involves a model-specific dataset construction process that differentiates between those types of hallucinations. Along the knowledge axis, we distinguish between hallucinations caused by a lack of knowledge and those occurring despite the model having the knowledge of the correct response. To validate our framework along the knowledge axis, we apply steering mitigation, which relies on the existence of parametric knowledge to manipulate model activations. This addresses the lack of existing methods to validate knowledge categorization by showing a significant difference between the two hallucination types. We further analyze the distinct knowledge and hallucination patterns between models, showing that different hallucinations do occur despite shared parametric knowledge. Turning to the certainty axis, we identify a particularly concerning subset of hallucinations where models hallucinate with certainty despite having the correct knowledge internally. We introduce a new evaluation metric to measure the effectiveness of mitigation methods on this subset, revealing that while some methods perform well on average, they fail disproportionately on these critical cases. Our findings highlight the importance of considering both knowledge and certainty in hallucination analysis and call for targeted mitigation approaches that consider the hallucination underlying factors.
zh
[NLP-57] Beyond Neural Incompatibility: Easing Cross-Scale Knowledge Transfer in Large Language Models through Latent Semantic Alignment
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)之间细粒度参数知识迁移(parametric knowledge transfer, PKT)的问题,尤其针对不同规模模型间因神经结构和参数差异导致的“神经不兼容性”(neural incompatibility)所引发的知识迁移效率低下问题。其解决方案的关键在于识别潜在空间(latent space)中的语义对齐(semantic alignment)作为跨尺度知识迁移的根本前提,并将层间知识传递的媒介从直接复用层参数转变为以激活值(activations)为载体,从而实现更有效、灵活且行为一致性的知识迁移。
链接: https://arxiv.org/abs/2510.24208
作者: Jian Gu,Aldeida Aleti,Chunyang Chen,Hongyu Zhang
机构: Cranberry-Lemon University (cranberry-lemon大学); University of the Witwatersrand (威特沃特斯兰德大学); Monash University (莫纳什大学); Technical University of Munich (慕尼黑工业大学); Chongqing University (重庆大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: an early-stage version
Abstract:Large Language Models (LLMs) encode vast amounts of knowledge in their massive parameters, which is accessible to locate, trace, and analyze. Despite advances in neural interpretability, it is still not clear how to transfer knowledge in a fine-grained manner, namely parametric knowledge transfer (PKT). A key problem is enabling effective and efficient knowledge transfer across LLMs of different scales, which is essential for achieving greater flexibility and broader applicability in transferring knowledge between LLMs. Due to neural incompatibility, referring to the architectural and parametric differences between LLMs of varying scales, existing methods that directly reuse layer parameters are severely limited. In this paper, we identify the semantic alignment in latent space as the fundamental prerequisite for LLM cross-scale knowledge transfer. Instead of directly using the layer parameters, our approach takes activations as the medium of layer-wise knowledge transfer. Leveraging the semantics in latent space, our approach is simple and outperforms prior work, better aligning model behaviors across varying scales. Evaluations on four benchmarks demonstrate the efficacy of our method. Further analysis reveals the key factors easing cross-scale knowledge transfer and provides insights into the nature of latent semantic alignment.
zh
[NLP-58] Exploring the Influence of Relevant Knowledge for Natural Language Generation Interpretability
【速读】: 该论文旨在解决自然语言生成(Natural Language Generation, NLG)过程中外部知识整合对生成质量影响的可解释性问题,尤其关注常识推理能力的提升。其核心挑战在于如何量化外部知识在生成句子中的作用,并验证其对语义连贯性和概念覆盖度的影响。解决方案的关键在于构建了一个名为KITGI的新基准,该基准通过结合ConceptNet中提取的语义关系与人工标注的输出,形成可控的知识输入条件;并采用三阶段可解释性评估方法:首先识别并移除关键知识,其次重新生成句子,最后由人工评估生成结果的常识合理性与概念完整性。实验表明,使用完整外部知识时生成句子在两项指标上正确率达91%,而过滤高相关知识后性能骤降至6%,证实了外部知识对于维持NLG系统输出质量的核心作用。
链接: https://arxiv.org/abs/2510.24179
作者: Iván Martínez-Murillo,Paloma Moreda,Elena Lloret
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:This paper explores the influence of external knowledge integration in Natural Language Generation (NLG), focusing on a commonsense generation task. We extend the CommonGen dataset by creating KITGI, a benchmark that pairs input concept sets with retrieved semantic relations from ConceptNet and includes manually annotated outputs. Using the T5-Large model, we compare sentence generation under two conditions: with full external knowledge and with filtered knowledge where highly relevant relations were deliberately removed. Our interpretability benchmark follows a three-stage method: (1) identifying and removing key knowledge, (2) regenerating sentences, and (3) manually assessing outputs for commonsense plausibility and concept coverage. Results show that sentences generated with full knowledge achieved 91% correctness across both criteria, while filtering reduced performance drastically to 6%. These findings demonstrate that relevant external knowledge is critical for maintaining both coherence and concept coverage in NLG. This work highlights the importance of designing interpretable, knowledge-enhanced NLG systems and calls for evaluation frameworks that capture the underlying reasoning beyond surface-level metrics.
zh
[NLP-59] MuSaG: A Multimodal German Sarcasm Dataset with Full-Modal Annotations
【速读】: 该论文旨在解决多模态情境下讽刺(sarcasm)检测的挑战,尤其是在社交媒体和流行文化中广泛存在的讽刺表达难以被现有模型准确识别的问题。其关键解决方案是构建并发布MuSaG数据集——首个德语多模态讽刺检测数据集,包含来自德国电视节目的33分钟人工筛选与标注语料,每条实例均提供对齐的文本、音频和视频模态,支持单模态与多模态模型的评估。通过在该数据集上对九种开源及商用模型进行基准测试,研究发现人类在对话场景中高度依赖音频线索,而当前模型在文本模态上表现最优,揭示了现有模型在多模态理解上的不足,从而为开发更贴近现实场景的多模态讽刺检测模型提供了重要基础。
链接: https://arxiv.org/abs/2510.24178
作者: Aaron Scott,Maike Züfle,Jan Niehues
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Sarcasm is a complex form of figurative language in which the intended meaning contradicts the literal one. Its prevalence in social media and popular culture poses persistent challenges for natural language understanding, sentiment analysis, and content moderation. With the emergence of multimodal large language models, sarcasm detection extends beyond text and requires integrating cues from audio and vision. We present MuSaG, the first German multimodal sarcasm detection dataset, consisting of 33 minutes of manually selected and human-annotated statements from German television shows. Each instance provides aligned text, audio, and video modalities, annotated separately by humans, enabling evaluation in unimodal and multimodal settings. We benchmark nine open-source and commercial models, spanning text, audio, vision, and multimodal architectures, and compare their performance to human annotations. Our results show that while humans rely heavily on audio in conversational settings, models perform best on text. This highlights a gap in current multimodal models and motivates the use of MuSaG for developing models better suited to realistic scenarios. We release MuSaG publicly to support future research on multimodal sarcasm detection and human-model alignment.
zh
[NLP-60] Ko-MuSR: A Multistep Soft Reasoning Benchmark for LLM s Capable of Understanding Korean ACL
【速读】: 该论文旨在解决当前自然语言处理(Natural Language Processing, NLP)领域中对长篇韩语文本的多步软推理(multistep, soft reasoning)缺乏系统性评估基准的问题,同时避免因数据污染(data contamination)导致评估结果失真。解决方案的关键在于构建首个专注于韩语的多步推理基准——Ko-MuSR,其核心特征包括:完全由韩语构成的叙事文本、经人工标注验证逻辑一致性和可答性的推理链(reasoning chains),以及多项选择题;此外,通过精心设计的提示策略(prompting strategies),如结合少量示例(few-shot examples)、推理轨迹(reasoning traces)和任务特定提示(task-specific hints),显著提升大语言模型在韩语推理任务上的准确率,接近人类水平表现。
链接: https://arxiv.org/abs/2510.24150
作者: Chanwoo Park,Suyoung Park,JiA Kang,Jongyeon Park,Sangho Kim,Hyunji M. Park,Sumin Bae,Mingyu Kang,Jaejin Lee
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: submitted to ACL ARR Rolling Review
Abstract:We present Ko-MuSR, the first benchmark to comprehensively evaluate multistep, soft reasoning in long Korean narratives while minimizing data contamination. Built following MuSR, Ko-MuSR features fully Korean narratives, reasoning chains, and multiple-choice questions verified by human annotators for logical consistency and answerability. Evaluations of four large language models – two multilingual and two Korean-specialized – show that multilingual models outperform Korean-focused ones even in Korean reasoning tasks, indicating cross-lingual generalization of reasoning ability. Carefully designed prompting strategies, which combine few-shot examples, reasoning traces, and task-specific hints, further boost accuracy, approaching human-level performance. Ko-MuSR offers a solid foundation for advancing Korean NLP by enabling systematic evaluation of long-context reasoning and prompting strategies.
zh
[NLP-61] Beyond Line-Level Filtering for the Pretraining Corpora of LLM s ACL
【速读】: 该论文旨在解决传统行级去重(line-level deduplication)和行尾标点过滤(trailing-punctuation filters)方法在文本预处理过程中可能误删有价值内容的问题,从而影响下游任务性能。其解决方案的关键在于提出两种改进方法:模式感知的行级去重(Pattern-aware Line-level Deduplication, PLD)和模式感知的行尾标点过滤(Pattern-aware Trailing Punctuation Filtering, PTF),通过结合行级信号与其在文档中的序列分布特征,有效保留结构上重要的内容,避免因简单规则导致的信息丢失。实验表明,该方法在英文和韩文的小规模语言模型(1B参数)训练中均能提升多项选择题基准测试表现,并显著改善SQuAD v1和KorQuAD v1上的生成式问答准确率。
链接: https://arxiv.org/abs/2510.24139
作者: Chanwoo Park,Suyoung Park,Yelim Ahn,Jongmin Kim,Jongyeon Park,Jaejin Lee
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: submitted to ACL ARR Rolling Review
Abstract:While traditional line-level filtering techniques, such as line-level deduplication and trailing-punctuation filters, are commonly used, these basic methods can sometimes discard valuable content, negatively affecting downstream performance. In this paper, we introduce two methods-pattern-aware line-level deduplication (PLD) and pattern-aware trailing punctuation filtering (PTF)-by enhancing the conventional filtering techniques. Our approach not only considers line-level signals but also takes into account their sequential distribution across documents, enabling us to retain structurally important content that might otherwise be removed. We evaluate these proposed methods by training small language models (1 B parameters) in both English and Korean. The results demonstrate that our methods consistently improve performance on multiple-choice benchmarks and significantly enhance generative question-answering accuracy on both SQuAD v1 and KorQuAD v1.
zh
[NLP-62] VC4VG: Optimizing Video Captions for Text-to-Video Generation EMNLP2025
【速读】: 该论文旨在解决文本到视频(Text-to-Video, T2V)生成模型训练中视频描述(video captions)质量不足的问题,尤其是缺乏针对T2V任务优化的caption设计策略。其解决方案的关键在于提出VC4VG(Video Captioning for Video Generation)框架,该框架从T2V视角系统分析caption内容,将视频重建所需要素分解为多维特征,并据此建立一套原则性的caption设计方法;同时构建了VC4VG-Bench基准测试集,包含细粒度、多维度且按必要性分级的评估指标,实验证明优化后的caption质量与视频生成性能显著正相关,验证了该方法的有效性。
链接: https://arxiv.org/abs/2510.24134
作者: Yang Du,Zhuoran Lin,Kaiqiang Song,Biao Wang,Zhicheng Zheng,Tiezheng Ge,Bo Zheng,Qin Jin
机构: Renmin University of China (中国人民大学); Alibaba (阿里巴巴)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted by EMNLP 2025
Abstract:Recent advances in text-to-video (T2V) generation highlight the critical role of high-quality video-text pairs in training models capable of producing coherent and instruction-aligned videos. However, strategies for optimizing video captions specifically for T2V training remain underexplored. In this paper, we introduce VC4VG (Video Captioning for Video Generation), a comprehensive caption optimization framework tailored to the needs of T2V this http URL begin by analyzing caption content from a T2V perspective, decomposing the essential elements required for video reconstruction into multiple dimensions, and proposing a principled caption design methodology. To support evaluation, we construct VC4VG-Bench, a new benchmark featuring fine-grained, multi-dimensional, and necessity-graded metrics aligned with T2V-specific this http URL T2V fine-tuning experiments demonstrate a strong correlation between improved caption quality and video generation performance, validating the effectiveness of our approach. We release all benchmark tools and code at this https URL to support further research.
zh
[NLP-63] Reinforcement Learning for Long-Horizon Multi-Turn Search Agents NEURIPS2025
【速读】: 该论文旨在解决大型语言模型(Large Language Model, LLM)代理在复杂任务中性能受限的问题,尤其是在多轮交互和工具使用场景下的能力瓶颈。其解决方案的关键在于引入强化学习(Reinforcement Learning, RL),通过经验学习显著提升模型的决策能力和任务完成精度。实验表明,在法律文档检索基准上,采用RL训练的140亿参数模型相较于前沿模型(85% vs 78%准确率)表现更优;同时,研究还发现,在训练和推理阶段限制对话轮次时,允许更长的多轮交互(multi-turn horizons)能进一步提升性能,凸显了RL驱动的策略优化对复杂任务处理的重要性。
链接: https://arxiv.org/abs/2510.24126
作者: Vivek Kalyan,Martin Andrews
机构: Red Cat Labs(红猫实验室)
类目: Computation and Language (cs.CL)
备注: 4 pages plus references and appendices. Accepted into the First Workshop on Multi-Turn Interactions in Large Language Models at NeurIPS 2025
Abstract:Large Language Model (LLM) agents can leverage multiple turns and tools to solve complex tasks, with prompt-based approaches achieving strong performance. This work demonstrates that Reinforcement Learning (RL) can push capabilities significantly further by learning from experience. Through experiments on a legal document search benchmark, we show that our RL-trained 14 Billion parameter model outperforms frontier class models (85% vs 78% accuracy). In addition, we explore turn-restricted regimes, during training and at test-time, that show these agents achieve better results if allowed to operate over longer multi-turn horizons.
zh
[NLP-64] Squrve: A Unified and Modular Framework for Complex Real-World Text-to-SQL Tasks
【速读】: 该论文旨在解决当前Text-to-SQL技术在真实系统中部署困难的问题,主要挑战在于缺乏有效的集成工具和统一的协作机制。解决方案的关键在于提出Squrve框架,其核心创新包括:首先建立一个通用的执行范式以标准化调用接口,其次设计基于七个抽象化有效原子代理组件的多代理协同机制,从而实现研究方法与实际应用的深度融合。实验表明,这种协同工作流在主流基准测试上持续优于单一方法,为应对复杂现实查询提供了新的高效路径。
链接: https://arxiv.org/abs/2510.24102
作者: Yihan Wang,Peiyu Liu,Runyu Chen,Jiaxing Pu,Wei Xu
机构: Renmin University of China (中国人民大学); University of International Business and Economics (对外经济贸易大学); Zhejiang University of Technology (浙江工业大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Text-to-SQL technology has evolved rapidly, with diverse academic methods achieving impressive results. However, deploying these techniques in real-world systems remains challenging due to limited integration tools. Despite these advances, we introduce Squrve, a unified, modular, and extensive Text-to-SQL framework designed to bring together research advances and real-world applications. Squrve first establishes a universal execution paradigm that standardizes invocation interfaces, then proposes a multi-actor collaboration mechanism based on seven abstracted effective atomic actor components. Experiments on widely adopted benchmarks demonstrate that the collaborative workflows consistently outperform the original individual methods, thereby opening up a new effective avenue for tackling complex real-world queries. The codes are available at this https URL.
zh
[NLP-65] RegSpeech12: A Regional Corpus of Bengali Spontaneous Speech Across Dialects
【速读】: 该论文试图解决的问题是:当前对孟加拉语方言的计算处理研究极为有限,尽管其在地理、文化与历史影响下呈现出显著的音系和形态多样性(如东孟加拉语、曼布赫米、朗格普里、瓦伦德拉和拉尔希等主要方言群),且在孟加拉国境内不同地区(如吉大港、锡尔赫特、朗布尔等)存在词汇、句法和形态上的差异。解决方案的关键在于系统地记录和分析这些方言的语音与形态特征,并探索构建面向区域变体的自动语音识别(Automatic Speech Recognition, ASR)模型的可行性,从而为虚拟助手等语言技术应用提供支持,同时促进方言多样性的数字保存与包容性工具的发展。为此,作者创建了一个公开可用的数据集以推动后续研究。
链接: https://arxiv.org/abs/2510.24096
作者: Md. Rezuwan Hassan,Azmol Hossain,Kanij Fatema,Rubayet Sabbir Faruque,Tanmoy Shome,Ruwad Naswan,Trina Chakraborty,Md. Foriduzzaman Zihad,Tawsif Tashwar Dipto,Nazia Tasnim,Nazmuddoha Ansary,Md. Mehedi Hasan Shawon,Ahmed Imtiaz Humayun,Md. Golam Rabiul Alam,Farig Sadeque,Asif Sushmit
机构: 未知
类目: Computation and Language (cs.CL)
备注: 26 pages
Abstract:The Bengali language, spoken extensively across South Asia and among diasporic communities, exhibits considerable dialectal diversity shaped by geography, culture, and history. Phonological and pronunciation-based classifications broadly identify five principal dialect groups: Eastern Bengali, Manbhumi, Rangpuri, Varendri, and Rarhi. Within Bangladesh, further distinctions emerge through variation in vocabulary, syntax, and morphology, as observed in regions such as Chittagong, Sylhet, Rangpur, Rajshahi, Noakhali, and Barishal. Despite this linguistic richness, systematic research on the computational processing of Bengali dialects remains limited. This study seeks to document and analyze the phonetic and morphological properties of these dialects while exploring the feasibility of building computational models particularly Automatic Speech Recognition (ASR) systems tailored to regional varieties. Such efforts hold potential for applications in virtual assistants and broader language technologies, contributing to both the preservation of dialectal diversity and the advancement of inclusive digital tools for Bengali-speaking communities. The dataset created for this study is released for public use.
zh
[NLP-66] Global PIQA: Evaluating Physical Commonsense Reasoning Across 100 Languages and Cultures
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在跨文化语境下缺乏针对性评估基准的问题,尤其在多语言、多文化场景中,现有评测体系难以全面反映模型对本地化常识理解的能力。其解决方案的关键在于构建Global PIQA——一个涵盖116种语言变体(覆盖五大洲、14个语系和23种书写系统)的参与式常识推理评测集,由来自65个国家的335名研究人员手工创建,其中非平行部分超过50%的样本涉及地方性食物、习俗、传统等文化特异性内容。该基准揭示了当前主流LLM在低资源语言中存在显著性能下降(最高达37%准确率差距),并强调了日常知识(everyday knowledge)作为未来改进方向的重要性。
链接: https://arxiv.org/abs/2510.24081
作者: Tyler A. Chang,Catherine Arnett,Abdelrahman Eldesokey,Abdelrahman Sadallah,Abeer Kashar,Abolade Daud,Abosede Grace Olanihun,Adamu Labaran Mohammed,Adeyemi Praise,Adhikarinayum Meerajita Sharma,Aditi Gupta,Afitab Iyigun,Afonso Simplício,Ahmed Essouaied,Aicha Chorana,Akhil Eppa,Akintunde Oladipo,Akshay Ramesh,Aleksei Dorkin,Alfred Malengo Kondoro,Alham Fikri Aji,Ali Eren Çetintaş,Allan Hanbury,Alou Dembele,Alp Niksarli,Álvaro Arroyo,Amin Bajand,Amol Khanna,Ana Chkhaidze,Ana Condez,Andiswa Mkhonto,Andrew Hoblitzell,Andrew Tran,Angelos Poulis,Anirban Majumder,Anna Vacalopoulou,Annette Kuuipolani Kanahele Wong,Annika Simonsen,Anton Kovalev,Ashvanth.S,Ayodeji Joseph Lana,Barkin Kinay,Bashar Alhafni,Benedict Cibalinda Busole,Bernard Ghanem,Bharti Nathani,Biljana Stojanovska Đurić,Bola Agbonile,Bragi Bergsson,Bruce Torres Fischer,Burak Tutar,Burcu Alakuş Çınar,Cade J. Kanoniakapueo Kane,Can Udomcharoenchaikit,Catherine Arnett,Chadi Helwe,Chaithra Reddy Nerella,Chen Cecilia Liu,Chiamaka Glory Nwokolo,Cristina España-Bonet,Cynthia Amol,DaeYeop Lee,Dana Arad,Daniil Dzenhaliou,Daria Pugacheva,Dasol Choi,Daud Abolade,David Liu,David Semedo,Deborah Popoola,Deividas Mataciunas,Delphine Nyaboke,Dhyuthy Krishna Kumar,Diogo Glória-Silva,Diogo Tavares,Divyanshu Goyal,DongGeon Lee,Ebele Nwamaka Anajemba,Egonu Ngozi Grace,Elena Mickel,Elena Tutubalina,Elias Herranen,Emile Anand,Emmanuel Habumuremyi,Emuobonuvie Maria Ajiboye,Eryawan Presma Yulianrifat,Esther Adenuga,Ewa Rudnicka,Faith Olabisi Itiola,Faran Taimoor Butt,Fathima Thekkekara,Fatima Haouari,Filbert Aurelian Tjiaranata,Firas Laakom,Francesca Grasso,Francesco Orabona,Francesco Periti,Gbenga Kayode Solomon,Gia Nghia Ngo,Gloria Udhehdhe-oze
机构: 未知
类目: Computation and Language (cs.CL)
备注: Preprint
Abstract:To date, there exist almost no culturally-specific evaluation benchmarks for large language models (LLMs) that cover a large number of languages and cultures. In this paper, we present Global PIQA, a participatory commonsense reasoning benchmark for over 100 languages, constructed by hand by 335 researchers from 65 countries around the world. The 116 language varieties in Global PIQA cover five continents, 14 language families, and 23 writing systems. In the non-parallel split of Global PIQA, over 50% of examples reference local foods, customs, traditions, or other culturally-specific elements. We find that state-of-the-art LLMs perform well on Global PIQA in aggregate, but they exhibit weaker performance in lower-resource languages (up to a 37% accuracy gap, despite random chance at 50%). Open models generally perform worse than proprietary models. Global PIQA highlights that in many languages and cultures, everyday knowledge remains an area for improvement, alongside more widely-discussed capabilities such as complex reasoning and expert knowledge. Beyond its uses for LLM evaluation, we hope that Global PIQA provides a glimpse into the wide diversity of cultures in which human language is embedded.
zh
[NLP-67] Challenging Multilingual LLM s: A New Taxonomy and Benchmark for Unraveling Hallucination in Translation
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在机器翻译(Machine Translation, MT)任务中存在幻觉(hallucination)问题,而现有评测基准无法有效暴露多语言LLMs的此类失败模式。其解决方案的关键在于提出一个诊断框架,包含一个将“指令偏离”(Instruction Detachment)与“源文偏离”(Source Detachment)分离的分类体系,并基于此构建了HalloMTBench——一个涵盖11个英译X方向、经人工验证的多语言基准数据集,共包含5,435个高质量实例。通过使用前沿LLMs生成候选翻译,并结合LLM判官集成与专家验证进行筛选,该研究系统识别出不同规模、源文长度敏感性、语言偏见及强化学习(Reinforcement Learning, RL)放大语言混杂等独特的“幻觉触发机制”,为诊断LLM翻译失败提供了可扩展且前瞻性的测试平台。
链接: https://arxiv.org/abs/2510.24073
作者: Xinwei Wu,Heng Liu,Jiang Zhou,Xiaohu Zhao,Linlong Xu,Longyue Wang,Weihua Luo,Kaifu Zhang
机构: Tianjin University (天津大学); Alibaba International Digital Commerce (阿里巴巴国际数字商业)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLMs) have advanced machine translation but remain vulnerable to hallucinations. Unfortunately, existing MT benchmarks are not capable of exposing failures in multilingual LLMs. To disclose hallucination in multilingual LLMs, we introduce a diagnostic framework with a taxonomy that separates Instruction Detachment from Source Detachment. Guided by this taxonomy, we create HalloMTBench, a multilingual, human-verified benchmark across 11 English-to-X directions. We employed 4 frontier LLMs to generate candidates and scrutinize these candidates with an ensemble of LLM judges, and expert validation. In this way, we curate 5,435 high-quality instances. We have evaluated 17 LLMs on HalloMTBench. Results reveal distinct ``hallucination triggers’’ – unique failure patterns reflecting model scale, source length sensitivity, linguistic biases, and Reinforcement-Learning (RL) amplified language mixing. HalloMTBench offers a forward-looking testbed for diagnosing LLM translation failures. HalloMTBench is available in this https URL.
zh
[NLP-68] Pie: A Programmable Serving System for Emerging LLM Applications SOSP2025
【速读】: 该论文旨在解决现有大型语言模型(Large Language Model, LLM)服务系统在处理多样化推理策略和代理工作流(agentic workflows)时面临的性能瓶颈问题,这些问题源于传统基于单体令牌生成循环的服务架构难以支持灵活的控制流和应用层优化。其解决方案的关键在于提出Pie系统,该系统将传统的生成循环分解为细粒度的服务处理器(service handlers),并通过API暴露给用户,同时将生成过程的控制权交由用户编写的程序(称为inferlets)管理;这些inferlets以WebAssembly(WASM)形式执行,实现轻量级沙箱隔离,并允许应用程序在不修改底层服务系统的情况下,自主实现KV缓存策略、定制化生成逻辑以及计算与I/O的无缝集成,从而显著提升代理类任务的延迟和吞吐量(1.3x–3.4x)。
链接: https://arxiv.org/abs/2510.24051
作者: In Gim,Zhiyao Ma,Seung-seob Lee,Lin Zhong
机构: Yale University (耶鲁大学)
类目: Computation and Language (cs.CL)
备注: SOSP 2025. Source code available at this https URL
Abstract:Emerging large language model (LLM) applications involve diverse reasoning strategies and agentic workflows, straining the capabilities of existing serving systems built on a monolithic token generation loop. This paper introduces Pie, a programmable LLM serving system designed for flexibility and efficiency. Pie decomposes the traditional generation loop into fine-grained service handlers exposed via an API and delegates control of the generation process to user-provided programs, called inferlets. This enables applications to implement new KV cache strategies, bespoke generation logic, and seamlessly integrate computation and I/O-entirely within the application, without requiring modifications to the serving system. Pie executes inferlets using WebAssembly, benefiting from its lightweight sandboxing. Our evaluation shows Pie matches state-of-the-art performance on standard tasks (3-12% latency overhead) while significantly improving latency and throughput (1.3x-3.4x higher) on agentic workflows by enabling application-specific optimizations.
zh
[NLP-69] GraphNet: A Large-Scale Computational Graph Dataset for Tensor Compiler Research
【速读】: 该论文旨在解决当前深度学习张量编译器(tensor compiler)在真实场景下缺乏统一、可量化评估标准的问题,尤其在不同任务类别和框架中的优化能力难以客观比较。其解决方案的关键在于提出两个核心指标:Speedup Score S(t),通过联合考虑运行时加速比与执行正确性(在可调容差水平下),提供对编译器整体优化能力的可靠度量;以及Error-aware Speedup Score ES(t),进一步引入错误信息以帮助开发者识别性能瓶颈。同时,作者构建了包含2.7K个真实世界深度学习计算图的GraphNet数据集,涵盖六类主要任务和多个主流框架,为编译器评测提供了标准化基准。
链接: https://arxiv.org/abs/2510.24035
作者: Xinqi Li,Yiqun Liu,Shan Jiang,Enrong Zheng,Huaijin Zheng,Wenhao Dai,Haodong Deng,Dianhai Yu,Yanjun Ma
机构: Baidu(百度)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:We introduce GraphNet, a dataset of 2.7K real-world deep learning computational graphs with rich metadata, spanning six major task categories across multiple deep learning frameworks. To evaluate tensor compiler performance on these samples, we propose the benchmark metric Speedup Score S(t), which jointly considers runtime speedup and execution correctness under tunable tolerance levels, offering a reliable measure of general optimization capability. Furthermore, we extend S(t) to the Error-aware Speedup Score ES(t), which incorporates error information and helps compiler developers identify key performance bottlenecks. In this report, we benchmark the default tensor compilers, CINN for PaddlePaddle and TorchInductor for PyTorch, on computer vision (CV) and natural language processing (NLP) samples to demonstrate the practicality of GraphNet. The full construction pipeline with graph extraction and compiler evaluation tools is available at this https URL .
zh
[NLP-70] Success and Cost Elicit Convention Formation for Efficient Communication
【速读】: 该论文旨在解决如何让大型多模态模型在无需人类标注数据的情况下,通过模拟对话游戏自发形成临时语言惯例(ad hoc linguistic conventions),从而实现更高效的人机交流。其核心解决方案是设计一种基于重复参考游戏的训练机制,其中模型通过交互不断优化沟通策略,利用共享的视觉上下文(如照片和拼图图像)来达成简洁且高成功率的通信;关键在于同时引入“成功”(success)与“成本”(cost)两个目标函数进行联合训练——仅依赖单一目标无法有效诱导惯例形成,而双目标协同则显著提升了消息长度压缩(最多减少41%)和任务成功率(提升15%),并使人类听者响应更快。
链接: https://arxiv.org/abs/2510.24023
作者: Saujas Vaduguru,Yilun Hua,Yoav Artzi,Daniel Fried
机构: Carnegie Mellon University (卡内基梅隆大学); Department of Computer Science and Cornell Tech, Cornell University (康奈尔大学计算机科学与康奈尔科技学院)
类目: Computation and Language (cs.CL)
备注:
Abstract:Humans leverage shared conversational context to become increasingly successful and efficient at communicating over time. One manifestation of this is the formation of ad hoc linguistic conventions, which allow people to coordinate on short, less costly utterances that are understood using shared conversational context. We present a method to train large multimodal models to form conventions, enabling efficient communication. Our approach uses simulated reference games between models, and requires no additional human-produced data. In repeated reference games involving photographs and tangram images, our method enables models to communicate efficiently with people: reducing the message length by up to 41% while increasing success by 15% over the course of the interaction. Human listeners respond faster when interacting with our model that forms conventions. We also show that training based on success or cost alone is insufficient - both are necessary to elicit convention formation.
zh
[NLP-71] SpecKD: Speculative Decoding for Effective Knowledge Distillation of LLM s
【速读】: 该论文旨在解决传统知识蒸馏(Knowledge Distillation, KD)方法在压缩大语言模型(Large Language Models, LLMs)时存在的问题:即对所有token统一应用蒸馏损失,导致学生模型被迫学习教师模型不确定或高熵的预测,从而引入噪声并损害性能,尤其当教师模型远大于学生模型时更为显著。解决方案的关键在于提出一种名为“推测性知识蒸馏”(Speculative Knowledge Distillation, SpecKD)的新框架,其核心创新是引入一个受推测解码(speculative decoding)中“提议与验证”机制启发的动态、token级门控机制——在每一步中,学生模型提出的token会与教师模型分布进行验证,仅对被“接受”的token施加蒸馏损失,而“拒绝”的token则被掩码,从而实现更精准的知识迁移,提升训练稳定性与学生模型能力。
链接: https://arxiv.org/abs/2510.24021
作者: Haiduo Huang,Jiangcheng Song,Yadong Zhang,Pengju Ren
机构: Xi’an Jiaotong University (西安交通大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Knowledge Distillation (KD) has become a cornerstone technique for compressing Large Language Models (LLMs) into smaller, more efficient student models. However, conventional KD approaches typically apply the distillation loss uniformly across all tokens, regardless of the teacher’s confidence. This indiscriminate mimicry can introduce noise, as the student is forced to learn from the teacher’s uncertain or high-entropy predictions, which may ultimately harm student performance-especially when the teacher is much larger and more powerful. To address this, we propose Speculative Knowledge Distillation (SpecKD), a novel, plug-and-play framework that introduces a dynamic, token-level gating mechanism inspired by the “propose-and-verify” paradigm of speculative decoding. At each step, the student’s token proposal is verified against the teacher’s distribution; the distillation loss is selectively applied only to “accepted” tokens, while “rejected” tokens are masked out. Extensive experiments on diverse text generation tasks show that SpecKD consistently and significantly outperforms strong KD baselines, leading to more stable training and more capable student models, and achieving state-of-the-art results.
zh
[NLP-72] aching LLM s to Abstain via Fine-Grained Semantic Confidence Reward
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)中存在的幻觉(hallucination)问题,即模型在缺乏足够知识时仍生成看似合理但不准确的回答。现有方法通常依赖粗粒度的置信度信号(如整体不确定性分数)来促使模型在不确定时选择回避回答,但这种策略难以精确识别模型的知识边界。其解决方案的关键在于提出一种基于细粒度语义置信度奖励(Fine-grained Semantic Confidence Reward, \Ours)的强化学习框架:通过采样多个候选答案并进行语义聚类,训练模型保留高置信度簇中的答案、舍弃低置信度簇中的答案,从而实现更精准的后验回避(post-hoc abstention),显著提升模型在域内和域外任务中的可靠性。
链接: https://arxiv.org/abs/2510.24020
作者: Hao An,Yang Xu
机构: Southern University of Science and Technology (南方科技大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 23pages, 4figures
Abstract:Mitigating hallucinations in Large Language Models (LLMs) is critical for their reliable deployment. Existing methods typically fine-tune LLMs to abstain from answering questions beyond their knowledge scope. However, these methods often rely on coarse-grained signals to guide LLMs to abstain, such as overall confidence or uncertainty scores on multiple sampled answers, which may result in an imprecise awareness of the model’s own knowledge boundaries. To this end, we propose a novel reinforcement learning framework built on \textbf\underlineFine-grained \underlineSemantic \underlineConfidence \underlineReward (\Ours) , which guides LLMs to abstain via sample-specific confidence. Specifically, our method operates by sampling multiple candidate answers and conducting semantic clustering, then training the LLM to retain answers within high-confidence clusters and discard those within low-confidence ones, thereby promoting accurate post-hoc abstention. Additionally, we propose a new metric for evaluating the reliability of abstention fine-tuning tasks more comprehensively. Our method significantly enhances reliability in both in-domain and out-of-distribution benchmarks.
zh
[NLP-73] EXT2DB: Integration-Aware Information Extraction with Large Language Model Agents ACL2025
【速读】: 该论文旨在解决信息抽取(Information Extraction, IE)输出与下游应用需求之间存在的语义不匹配问题,即传统IE方法生成的结构化知识难以直接适配目标数据库(或知识库)的schema。为应对这一挑战,作者提出了一种新的任务范式TEXT2DB,其核心在于将IE输出与目标数据库(Database, DB)的集成作为关键环节,要求模型根据用户指令、文档集和现有数据库动态调整数据更新策略。解决方案的关键是引入一个名为OPAL(Observe-Plan-Analyze LLM)的LLM代理框架:该框架包含三个组件——Observer用于交互式观察数据库状态,Planner生成基于代码的执行计划并调用IE模型,Analyzer在执行前评估代码质量以确保可靠性。实验表明,OPAL能根据不同的数据库schema自适应生成多样化的代码计划,并有效调用相应的IE模型完成任务,从而实现从文本到结构化数据库的精准映射。
链接: https://arxiv.org/abs/2510.24014
作者: Yizhu Jiao,Sha Li,Sizhe Zhou,Heng Ji,Jiawei Han
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computation and Language (cs.CL)
备注: ACL 2025. Source code: this https URL
Abstract:The task of information extraction (IE) is to extract structured knowledge from text. However, it is often not straightforward to utilize IE output due to the mismatch between the IE ontology and the downstream application needs. We propose a new formulation of IE TEXT2DB that emphasizes the integration of IE output and the target database (or knowledge base). Given a user instruction, a document set, and a database, our task requires the model to update the database with values from the document set to satisfy the user instruction. This task requires understanding user instructions for what to extract and adapting to the given DB/KB schema for how to extract on the fly. To evaluate this new task, we introduce a new benchmark featuring common demands such as data infilling, row population, and column addition. In addition, we propose an LLM agent framework OPAL (Observe-PlanAnalyze LLM) which includes an Observer component that interacts with the database, the Planner component that generates a code-based plan with calls to IE models, and the Analyzer component that provides feedback regarding code quality before execution. Experiments show that OPAL can successfully adapt to diverse database schemas by generating different code plans and calling the required IE models. We also highlight difficult cases such as dealing with large databases with complex dependencies and extraction hallucination, which we believe deserve further investigation. Source code: this https URL
zh
[NLP-74] META-RAG : Meta-Analysis-Inspired Evidence-Re-Ranking Method for Retrieval-Augmented Generation in Evidence-Based Medicine
【速读】: 该论文旨在解决生成式 AI(Generative AI)在临床决策支持中因检索到的医学证据质量参差不齐而导致诊断准确性下降的问题。当前基于检索增强生成(Retrieval-Augmented Generation, RAG)的方法难以有效区分高质量证据,从而可能引入错误知识,影响模型输出可靠性。解决方案的关键在于借鉴循证医学(Evidence-Based Medicine, EBM)中的系统评价与Meta分析思想,提出一种多原则协同的证据重排序与过滤机制,整合可靠性分析(reliability analysis)、异质性分析(heterogeneity analysis)和外推性分析(extrapolation analysis),以筛选出最适合作为LLM推理依据的高质量文献。实验表明,该方法可使诊断准确率提升最高达11.4%,显著增强了RAG在医学场景下提取可靠证据的能力。
链接: https://arxiv.org/abs/2510.24003
作者: Mengzhou Sun,Sendong Zhao,Jianyu Chen,Haochun Wang,Bin Qin
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Evidence-based medicine (EBM) holds a crucial role in clinical application. Given suitable medical articles, doctors effectively reduce the incidence of misdiagnoses. Researchers find it efficient to use large language models (LLMs) techniques like RAG for EBM tasks. However, the EBM maintains stringent requirements for evidence, and RAG applications in EBM struggle to efficiently distinguish high-quality evidence. Therefore, inspired by the meta-analysis used in EBM, we provide a new method to re-rank and filter the medical evidence. This method presents multiple principles to filter the best evidence for LLMs to diagnose. We employ a combination of several EBM methods to emulate the meta-analysis, which includes reliability analysis, heterogeneity analysis, and extrapolation analysis. These processes allow the users to retrieve the best medical evidence for the LLMs. Ultimately, we evaluate these high-quality articles and show an accuracy improvement of up to 11.4% in our experiments and results. Our method successfully enables RAG to extract higher-quality and more reliable evidence from the PubMed dataset. This work can reduce the infusion of incorrect knowledge into responses and help users receive more effective replies.
zh
[NLP-75] PICOs-RAG : PICO-supported Query Rewriting for Retrieval-Augmented Generation in Evidence-Based Medicine
【速读】: 该论文旨在解决当前检索增强生成(Retrieval-Augmented Generation, RAG)方法在真实临床场景中处理复杂查询时效率与准确性不足的问题,尤其是在用户查询信息不完整或表达不精确时,模型易检索到无关证据并生成无用回答。解决方案的关键在于提出PICOs-RAG框架,通过将原始查询扩展并规范化为符合循证医学(Evidence-Based Medicine, EBM)标准的专业化表达,利用PICO(Population, Intervention, Comparison, Outcome, Study design)结构提取核心检索要素,从而显著提升检索的相关性与效率,在实验中相较基线方法最高提升达8.8%。
链接: https://arxiv.org/abs/2510.23998
作者: Mengzhou Sun,Sendong Zhao,Jianyu Chen,Bin Qin
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Evidence-based medicine (EBM) research has always been of paramount importance. It is important to find appropriate medical theoretical support for the needs from physicians or patients to reduce the occurrence of medical accidents. This process is often carried out by human querying relevant literature databases, which lacks objectivity and efficiency. Therefore, researchers utilize retrieval-augmented generation (RAG) to search for evidence and generate responses automatically. However, current RAG methods struggle to handle complex queries in real-world clinical scenarios. For example, when queries lack certain information or use imprecise language, the model may retrieve irrelevant evidence and generate unhelpful answers. To address this issue, we present the PICOs-RAG to expand the user queries into a better format. Our method can expand and normalize the queries into professional ones and use the PICO format, a search strategy tool present in EBM, to extract the most important information used for retrieval. This approach significantly enhances retrieval efficiency and relevance, resulting in up to an 8.8% improvement compared to the baseline evaluated by our method. Thereby the PICOs-RAG improves the performance of the large language models into a helpful and reliable medical assistant in EBM.
zh
[NLP-76] M-Eval: A Heterogeneity-Based Framework for Multi-evidence Validation in Medical RAG Systems
【速读】: 该论文旨在解决当前基于检索增强生成(Retrieval-Augmented Generation, RAG)的医疗问答系统中存在的事实性错误问题,如幻觉(hallucination)和外部知识利用不当等,这些问题严重影响了系统的准确性与可靠性。解决方案的关键在于提出一种名为M-Eval的新方法,其核心思想源于循证医学(Evidence-Based Medicine, EBM)中的异质性分析(heterogeneity analysis),通过从多个外部知识源提取额外医学文献,并与RAG系统生成的证据文档进行比对,判断其是否支持响应中不同观点的一致性;同时评估所用证据的可靠性,从而有效识别并修正错误信息。实验表明,该方法可使多种大语言模型(Large Language Models, LLMs)的准确率提升最高达23.31%。
链接: https://arxiv.org/abs/2510.23995
作者: Mengzhou Sun,Sendong Zhao,Jianyu Chen,Haochun Wang,Bin Qin
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Retrieval-augmented Generation (RAG) has demonstrated potential in enhancing medical question-answering systems through the integration of large language models (LLMs) with external medical literature. LLMs can retrieve relevant medical articles to generate more professional responses efficiently. However, current RAG applications still face problems. They generate incorrect information, such as hallucinations, and they fail to use external knowledge correctly. To solve these issues, we propose a new method named M-Eval. This method is inspired by the heterogeneity analysis approach used in Evidence-Based Medicine (EBM). Our approach can check for factual errors in RAG responses using evidence from multiple sources. First, we extract additional medical literature from external knowledge bases. Then, we retrieve the evidence documents generated by the RAG system. We use heterogeneity analysis to check whether the evidence supports different viewpoints in the response. In addition to verifying the accuracy of the response, we also assess the reliability of the evidence provided by the RAG system. Our method shows an improvement of up to 23.31% accuracy across various LLMs. This work can help detect errors in current RAG-based medical systems. It also makes the applications of LLMs more reliable and reduces diagnostic errors.
zh
[NLP-77] mg2speech: synthesizing speech from electromyography using self-supervised speech models
【速读】: 该论文旨在解决如何将口腔面部肌肉的肌电图(EMG)信号直接转换为语音的问题,从而实现无需依赖显式发音机制模型或声码器训练的端到端EMG-to-speech生成。其解决方案的关键在于发现自监督语音(SS)表征与肌肉动作电位电功率之间存在强线性关系(相关系数 r = 0.85),且不同发音手势对应的EMG功率向量在特征空间中形成结构化、可分离的聚类,表明SS模型隐式编码了发音机制;基于此性质,研究者直接将EMG信号映射至SS特征空间并合成语音,从而跳过了传统复杂的中间建模步骤。
链接: https://arxiv.org/abs/2510.23969
作者: Harshavardhana T. Gowda,Lee M. Miller
机构: University of California, Davis (加州大学戴维斯分校)
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:
Abstract:We present a neuromuscular speech interface that translates electromyographic (EMG) signals collected from orofacial muscles during speech articulation directly into audio. We show that self-supervised speech (SS) representations exhibit a strong linear relationship with the electrical power of muscle action potentials: SS features can be linearly mapped to EMG power with a correlation of r = 0.85 . Moreover, EMG power vectors corresponding to different articulatory gestures form structured and separable clusters in feature space. This relationship: \textSS features \xrightarrow\textttlinear mapping \textEMG power \xrightarrow\textttgesture-specific clustering \textarticulatory movements , highlights that SS models implicitly encode articulatory mechanisms. Leveraging this property, we directly map EMG signals to SS feature space and synthesize speech, enabling end-to-end EMG-to-speech generation without explicit articulatory models and vocoder training.
zh
[NLP-78] Uncovering the Potential Risks in Unlearning: Danger of English-only Unlearning in Multilingual LLM s
【速读】: 该论文旨在解决多语言大语言模型(Multilingual Large Language Models, Multilingual LLMs)在去个性化(unlearning)过程中出现的语言混淆(language confusion)问题,即模型在输入某语言提示时输出其他语言的回答,导致传统基于参考文本的评估指标失效。解决方案的关键在于:首先引入基于N-gram的Language-Mix(N-Mix)评分来量化语言混淆的普遍性和一致性;其次证明高N-Mix分数会导致参考基指标产生假阴性结果;最后提出需要采用语义导向型评估指标(semantic-based metric),直接评估生成内容本身以实现更可靠的去个性化效果评估。
链接: https://arxiv.org/abs/2510.23949
作者: Kyomin Hwang,Hyeonjin Kim,Seungyeon Kim,Sunghyun Wee,Nojun Kwak
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:There have been a couple of studies showing that attempting to erase multilingual knowledge using only English data is insufficient for multilingual LLMs. However, their analyses remain highly performance-oriented. In this paper, we switch the point of view to evaluation, and address an additional blind spot which reveals itself when the multilingual LLM is fully finetuned with parallel multilingual dataset before unlearning. Here, language confusion occurs whereby a model responds in language different from that of the input prompt. Language confusion is a problematic phenomenon in unlearning, causing the standard reference-based metrics to fail. We tackle this phenomenon in three steps: (1) introduce N-gram-based Language-Mix (N-Mix) score to quantitatively show the language confusion is pervasive and consistent in multilingual LLMs, (2) demonstrate that reference-based metrics result in false negatives when N-Mix score is high, and(3) suggest the need of new type of unlearning evaluation that can directly assess the content of the generated sentences. We call this type of metrics as semantic-based metric.
zh
[NLP-79] Leverag ing LLM s for Early Alzheimers Prediction
【速读】: 该论文旨在解决阿尔茨海默病(Alzheimer’s disease, AD)早期检测中因生物标志物复杂性和临床判读阈值敏感性导致的预测准确性不足问题。其解决方案的关键在于构建一个基于连接组(connectome)信息的大型语言模型(large language model, LLM)框架,将动态功能磁共振成像(dynamic fMRI connectivity)数据编码为时间序列,并通过稳健的归一化处理将其映射到适合冻结预训练LLM输入的表示空间,从而实现高敏感性的临床预测,误差率显著低于临床可接受范围,为早期干预提供可靠依据。
链接: https://arxiv.org/abs/2510.23946
作者: Tananun Songdechakraiwut
机构: Duke University (杜克大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:We present a connectome-informed LLM framework that encodes dynamic fMRI connectivity as temporal sequences, applies robust normalization, and maps these data into a representation suitable for a frozen pre-trained LLM for clinical prediction. Applied to early Alzheimer’s detection, our method achieves sensitive prediction with error rates well below clinically recognized margins, with implications for timely Alzheimer’s intervention.
zh
[NLP-80] Auto prompting without training labels: An LLM cascade for product quality assessment in e-commerce catalogs
【速读】: 该论文旨在解决电商场景中大规模产品属性质量评估的难题,即如何在不依赖人工标注数据或模型微调的情况下,实现对海量商品类别-属性组合的精准评估。传统方法如链式思维(Chain-of-Thought) prompting 在领域特定任务中表现受限,且需大量领域专家参与提示设计,效率低下。解决方案的关键在于提出一种无需训练的级联自动提示(auto-prompt cascade)机制,从初始人工编写的提示出发,通过迭代生成与优化,逐步适应不同品类的特定需求,从而在保持高精度和召回率的同时,将每个属性所需的领域专家工作时间从5.1小时降至3分钟,效率提升99%。该方法有效实现了通用语言理解与工业级领域知识之间的规模化衔接。
链接: https://arxiv.org/abs/2510.23941
作者: Soham Satyadharma,Fatemeh Sheikholeslami,Swati Kaul,Aziz Umit Batur,Suleiman A. Khan
机构: Amazon Catalog AI
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:We introduce a novel, training free cascade for auto-prompting Large Language Models (LLMs) to assess product quality in e-commerce. Our system requires no training labels or model fine-tuning, instead automatically generating and refining prompts for evaluating attribute quality across tens of thousands of product category-attribute pairs. Starting from a seed of human-crafted prompts, the cascade progressively optimizes instructions to meet catalog-specific requirements. This approach bridges the gap between general language understanding and domain-specific knowledge at scale in complex industrial catalogs. Our extensive empirical evaluations shows the auto-prompt cascade improves precision and recall by 8-10% over traditional chain-of-thought prompting. Notably, it achieves these gains while reducing domain expert effort from 5.1 hours to 3 minutes per attribute - a 99% reduction. Additionally, the cascade generalizes effectively across five languages and multiple quality assessment tasks, consistently maintaining performance gains.
zh
[NLP-81] Latent Chain-of-Thought for Visual Reasoning NEURIPS2025
【速读】: 该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)在链式思维(Chain-of-thought, CoT)推理中存在泛化能力弱、对有偏奖励模型依赖性强的问题。解决方案的关键在于将CoT推理建模为后验推断(posterior inference),并提出一种基于近似变分推断(amortized variational inference)的可扩展训练算法;通过引入一种面向token级学习信号的稀疏奖励函数,结合多样性导向的强化学习策略,以鼓励生成多样且高似然的潜在推理路径,从而克服确定性采样局限并防止奖励劫持(reward hacking);同时,采用贝叶斯推断缩放策略,用边际似然(marginal likelihood)替代昂贵的Best-of-N和束搜索(Beam Search),高效排序最优推理路径与答案。
链接: https://arxiv.org/abs/2510.23925
作者: Guohao Sun,Hang Hua,Jian Wang,Jiebo Luo,Sohail Dianat,Majid Rabbani,Raghuveer Rao,Zhiqiang Tao
机构: Rochester Institute of Technology (罗切斯特理工学院); Snap Inc.; University of Rochester (罗切斯特大学); DEVCOM Army Research Laboratory (美国陆军研究实验室)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: NeurIPS 2025
Abstract:Chain-of-thought (CoT) reasoning is critical for improving the interpretability and reliability of Large Vision-Language Models (LVLMs). However, existing training algorithms such as SFT, PPO, and GRPO may not generalize well across unseen reasoning tasks and heavily rely on a biased reward model. To address this challenge, we reformulate reasoning in LVLMs as posterior inference and propose a scalable training algorithm based on amortized variational inference. By leveraging diversity-seeking reinforcement learning algorithms, we introduce a novel sparse reward function for token-level learning signals that encourage diverse, high-likelihood latent CoT, overcoming deterministic sampling limitations and avoiding reward hacking. Additionally, we implement a Bayesian inference-scaling strategy that replaces costly Best-of-N and Beam Search with a marginal likelihood to efficiently rank optimal rationales and answers. We empirically demonstrate that the proposed method enhances the state-of-the-art LVLMs on seven reasoning benchmarks, in terms of effectiveness, generalization, and interpretability.
zh
[NLP-82] Agent -based Automated Claim Matching with Instruction-following LLM s
【速读】: 该论文旨在解决自动化事实核查中的**主张匹配(claim matching)**问题,即判断两个陈述是否指向同一事实主张。其解决方案的关键在于提出了一种基于智能体(agent-based)的两阶段流水线方法:首先利用大语言模型(LLM)自动生成提示(prompt),再将主张匹配任务建模为二分类问题并由另一LLM完成分类。研究发现,LLM生成的提示可优于人工设计的提示,且较小规模的LLM在提示生成阶段即可达到与大型模型相当的效果,从而显著降低计算开销;同时,采用不同LLM分别负责提示生成和匹配任务的策略也表现出优越性能,进一步验证了该框架的灵活性与有效性。
链接: https://arxiv.org/abs/2510.23924
作者: Dina Pisarevskaya,Arkaitz Zubiaga
机构: Queen Mary University of London (伦敦玛丽女王大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted for the International Joint Conference on Natural Language Processing Asia-Pacific Chapter of the Association for Computational Linguistics (2025) Findings
Abstract:We present a novel agent-based approach for the automated claim matching task with instruction-following LLMs. We propose a two-step pipeline that first generates prompts with LLMs, to then perform claim matching as a binary classification task with LLMs. We demonstrate that LLM-generated prompts can outperform SOTA with human-generated prompts, and that smaller LLMs can do as well as larger ones in the generation process, allowing to save computational resources. We also demonstrate the effectiveness of using different LLMs for each step of the pipeline, i.e. using an LLM for prompt generation, and another for claim matching. Our investigation into the prompt generation process in turn reveals insights into the LLMs’ understanding of claim matching.
zh
[NLP-83] Breaking the Benchmark: Revealing LLM Bias via Minimal Contextual Augmentation
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在训练过程中因数据的判别性而表现出刻板偏见的问题,尤其是现有用于偏见对齐的方法在面对输入扰动时表现脆弱。解决方案的关键在于提出一种新颖且通用的增强框架,包含三个即插即用的步骤,适用于多种公平性评估基准。通过在偏见问答评测数据集(Bias Benchmark for Question Answering, BBQ)上的应用,研究发现即使是最先进的开源和闭源模型也容易受到输入扰动的影响,从而更倾向于表现出刻板行为;同时,当目标群体在文献中研究较少时,模型的偏见倾向更为显著,这凸显了扩展公平性和安全性研究以涵盖更多元化群体的重要性。
链接: https://arxiv.org/abs/2510.23921
作者: Kaveh Eskandari Miandoab,Mahammed Kamruzzaman,Arshia Gharooni,Gene Louis Kim,Vasanth Sarathy,Ninareh Mehrabi
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 9 pages, 3 figures, 3 tables
Abstract:Large Language Models have been shown to demonstrate stereotypical biases in their representations and behavior due to the discriminative nature of the data that they have been trained on. Despite significant progress in the development of methods and models that refrain from using stereotypical information in their decision-making, recent work has shown that approaches used for bias alignment are brittle. In this work, we introduce a novel and general augmentation framework that involves three plug-and-play steps and is applicable to a number of fairness evaluation benchmarks. Through application of augmentation to a fairness evaluation dataset (Bias Benchmark for Question Answering (BBQ)), we find that Large Language Models (LLMs), including state-of-the-art open and closed weight models, are susceptible to perturbations to their inputs, showcasing a higher likelihood to behave stereotypically. Furthermore, we find that such models are more likely to have biased behavior in cases where the target demographic belongs to a community less studied by the literature, underlining the need to expand the fairness and safety research to include more diverse communities.
zh
[NLP-84] AfriMTEB and AfriE5: Benchmarking and Adapting Text Embedding Models for African Languages
【速读】: 该论文旨在解决非洲语言在多语言文本嵌入(text embeddings)领域严重代表性不足的问题,当前主流基准如MTEB(Massively Multilingual Text Embedding Benchmark)中非洲语言覆盖有限,且多数任务基于翻译基准(如FLORES或SIB-200)改造而来,缺乏针对非洲语言特性的任务设计。解决方案的关键在于构建AfriMTEB——一个包含59种非洲语言、14项任务和38个数据集的区域性扩展基准,其中6个为全新任务(如仇恨言论检测、意图识别和情绪分类),显著提升了非洲语言任务的多样性与实用性;同时提出AfriE5,通过跨语言对比蒸馏(cross-lingual contrastive distillation)将指令微调的mE5模型适配至非洲语言,实证表明其性能优于Gemini-Embeddings等强基线模型,实现了非洲语言文本嵌入的高质量建模。
链接: https://arxiv.org/abs/2510.23896
作者: Kosei Uemura,Miaoran Zhang,David Ifeoluwa Adelani
机构: University of Toronto (多伦多大学); Mila - Quebec AI Institute, McGill University (麦吉尔大学魁北克人工智能研究所); Saarland University, Saarland Informatic Campus (萨尔兰大学,萨尔兰信息学园区); Canada CIFAR AI Chair (加拿大CIFAR人工智能主席)
类目: Computation and Language (cs.CL)
备注:
Abstract:Text embeddings are an essential building component of several NLP tasks such as retrieval-augmented generation which is crucial for preventing hallucinations in LLMs. Despite the recent release of massively multilingual MTEB (MMTEB), African languages remain underrepresented, with existing tasks often repurposed from translation benchmarks such as FLORES clustering or SIB-200. In this paper, we introduce AfriMTEB – a regional expansion of MMTEB covering 59 languages, 14 tasks, and 38 datasets, including six newly added datasets. Unlike many MMTEB datasets that include fewer than five languages, the new additions span 14 to 56 African languages and introduce entirely new tasks, such as hate speech detection, intent detection, and emotion classification, which were not previously covered. Complementing this, we present AfriE5, an adaptation of the instruction-tuned mE5 model to African languages through cross-lingual contrastive distillation. Our evaluation shows that AfriE5 achieves state-of-the-art performance, outperforming strong baselines such as Gemini-Embeddings and mE5.
zh
[NLP-85] Language Models for Longitudinal Clinical Prediction
【速读】: 该论文旨在解决如何在有限训练数据条件下,利用冻结的大语言模型(frozen large language models)对纵向临床数据进行准确分析与预测的问题。其核心挑战在于避免传统微调(fine-tuning)带来的过拟合风险并保留预训练模型的泛化能力。解决方案的关键在于构建一个轻量级框架,通过在语言模型空间内整合患者病史和上下文信息,实现无需模型微调即可生成高精度预测,从而在神经心理学评估中展现出对早期阿尔茨海默病(Alzheimer’s disease)监测的有效性。
链接: https://arxiv.org/abs/2510.23884
作者: Tananun Songdechakraiwut,Michael Lutz
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:We explore a lightweight framework that adapts frozen large language models to analyze longitudinal clinical data. The approach integrates patient history and context within the language model space to generate accurate forecasts without model fine-tuning. Applied to neuropsychological assessments, it achieves accurate and reliable performance even with minimal training data, showing promise for early-stage Alzheimer’s monitoring.
zh
[NLP-86] OraPlan-SQL: A Planning -Centric Framework for Complex Bilingual NL2SQL Reasoning
【速读】: 该论文旨在解决自然语言到SQL(NL2SQL)任务中复杂推理能力不足的问题,尤其是在多语言环境下对算术运算、常识推理和假设性推理的准确建模。针对这一挑战,其解决方案的关键在于提出一种基于代理框架(agentic framework)的系统 OraPlan-SQL,其中核心创新是采用反馈引导的元提示策略(feedback-guided meta-prompting strategy)来优化单一规划代理(Planner agent),通过聚类失败案例并结合人工标注生成校正指南,嵌入系统提示中以提升泛化性能;同时引入实体链接指导原则(entity-linking guidelines)缓解跨语言实体映射问题,并通过计划多样化(plan diversification)机制生成多个候选计划并基于多数投票选择最优执行结果,从而在保持高SQL语法有效性(>99% VA)的同时显著提升执行准确率(EX)。
链接: https://arxiv.org/abs/2510.23870
作者: Marianne Menglin Liu,Sai Ashish Somayajula,Syed Fahad Allam Shah,Sujith Ravi,Dan Roth
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:We present OraPlan-SQL, our system for the Archer NL2SQL Evaluation Challenge 2025, a bilingual benchmark requiring complex reasoning such as arithmetic, commonsense, and hypothetical inference. OraPlan-SQL ranked first, exceeding the second-best system by more than 6% in execution accuracy (EX), with 55.0% in English and 56.7% in Chinese, while maintaining over 99% SQL validity (VA). Our system follows an agentic framework with two components: Planner agent that generates stepwise natural language plans, and SQL agent that converts these plans into executable SQL. Since SQL agent reliably adheres to the plan, our refinements focus on the planner. Unlike prior methods that rely on multiple sub-agents for planning and suffer from orchestration overhead, we introduce a feedback-guided meta-prompting strategy to refine a single planner. Failure cases from a held-out set are clustered with human input, and an LLM distills them into corrective guidelines that are integrated into the planner’s system prompt, improving generalization without added complexity. For the multilingual scenario, to address transliteration and entity mismatch issues, we incorporate entity-linking guidelines that generate alternative surface forms for entities and explicitly include them in the plan. Finally, we enhance reliability through plan diversification: multiple candidate plans are generated for each query, with the SQL agent producing a query for each plan, and final output selected via majority voting over their executions.
zh
[NLP-87] GIFT: Group-relative Implicit Fine Tuning Integrates GRPO with DPO and UNA
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)对齐过程中因隐式奖励(implicit reward)难以有效利用而导致的训练不稳定与性能受限问题。现有方法如PPO或GRPO直接最大化累积奖励,易受噪声干扰且优化复杂;而DPO和UNA虽引入隐式奖励机制,但多为离线策略,丧失探索能力。GIFT的核心创新在于提出一种基于组相对隐式微调(Group-relative Implicit Fine Tuning, GIFT)的强化学习框架,其关键在于通过联合归一化隐式与显式奖励函数,消除原本不可处理的偏差项,从而将复杂的非凸奖励最大化目标转化为一个可解析求导的均方误差(MSE)损失函数,使优化过程稳定、高效且具凸性。这一设计不仅保留了在线策略的探索能力,还显著减少了超参数数量、加快收敛速度,并在数学推理等任务上展现出更优的泛化性能与计算效率。
链接: https://arxiv.org/abs/2510.23868
作者: Zhichao Wang
机构: Inflection AI
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:I propose \textbfGroup-relative \textbfImplicit \textbfFine \textbfTuning (GIFT), a novel reinforcement learning framework for aligning LLMs. Instead of directly maximizing cumulative rewards like PPO or GRPO, GIFT minimizes the discrepancy between implicit and explicit reward models. It combines three key ideas: (1) the online multi-response generation and normalization of GRPO, (2) the implicit reward formulation of DPO, and (3) the implicit-explicit reward alignment principle of UNA. By jointly normalizing the implicit and explicit rewards, GIFT eliminates an otherwise intractable term that prevents effective use of implicit rewards. This normalization transforms the complex reward maximization objective into a simple mean squared error (MSE) loss between the normalized reward functions, converting a non-convex optimization problem into a convex, stable, and analytically differentiable formulation. Unlike offline methods such as DPO and UNA, GIFT remains on-policy and thus retains exploration capability. Compared to GRPO, it requires fewer hyperparameters, converges faster, and generalizes better with significantly reduced training overfitting. Empirically, GIFT achieves superior reasoning and alignment performance on mathematical benchmarks while remaining computationally efficient.
zh
[NLP-88] Can LLM s Narrate Tabular Data? An Evaluation Framework for Natural Language Representations of Text-to-SQL System Outputs EMNLP2025
【速读】: 该论文旨在解决生成式 AI(Generative AI)在 Text-to-SQL 场景中,将数据库(DB)查询结果转化为自然语言表示(Natural Language Representations, NLRs)时存在的信息丢失或表述错误问题,而现有评估方法对 NLR 质量的判断存在 fidelity 不足和 LLM 调用成本高的缺陷。解决方案的关键在于提出一种新型评估方法 Combo-Eval,其通过融合多种现有评估策略,在保持高评估保真度的同时显著降低 LLM 调用次数(减少 25–61%),并配套构建了首个专注于 NLR 基准测试的数据集 NLR-BIRD,实验证明 Combo-Eval 在有无参考答案的情况下均能更贴近人工判断。
链接: https://arxiv.org/abs/2510.23854
作者: Jyotika Singh,Weiyi Sun,Amit Agarwal,Viji Krishnamurthy,Yassine Benajiba,Sujith Ravi,Dan Roth
机构: Oracle AI
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at EMNLP 2025
Abstract:In modern industry systems like multi-turn chat agents, Text-to-SQL technology bridges natural language (NL) questions and database (DB) querying. The conversion of tabular DB results into NL representations (NLRs) enables the chat-based interaction. Currently, NLR generation is typically handled by large language models (LLMs), but information loss or errors in presenting tabular results in NL remains largely unexplored. This paper introduces a novel evaluation method - Combo-Eval - for judgment of LLM-generated NLRs that combines the benefits of multiple existing methods, optimizing evaluation fidelity and achieving a significant reduction in LLM calls by 25-61%. Accompanying our method is NLR-BIRD, the first dedicated dataset for NLR benchmarking. Through human evaluations, we demonstrate the superior alignment of Combo-Eval with human judgments, applicable across scenarios with and without ground truth references.
zh
[NLP-89] mporal Blindness in Multi-Turn LLM Agents : Misaligned Tool Use vs. Human Time Perception
【速读】: 该论文试图解决大语言模型代理(Large Language Model Agents)在多轮对话中因缺乏时间感知能力而导致的工具调用决策偏差问题,即“时间盲视”(temporal blindness)。其核心挑战在于:模型默认使用静态上下文,无法根据消息间实际经过的时间判断是否需要调用工具,从而导致过度依赖历史信息或频繁重复调用工具。解决方案的关键在于引入显式时间戳(explicit timestamps)来增强对话消息的时间维度,使模型能够基于时间间隔做出更符合人类直觉的工具调用决策;同时通过构建 TicToc-v1 测试集和人工偏好数据(prefer-noTool 与 prefer-Tool 子集),量化评估模型在不同时间间隔下的工具调用准确性,揭示当前方法在无额外对齐训练时性能有限,强调需专门设计后训练对齐策略以提升模型与人类时间感知的一致性。
链接: https://arxiv.org/abs/2510.23853
作者: Yize Cheng,Arshia Soltani Moakhar,Chenrui Fan,Kazem Faghih,Parsa Hosseini,Wenxiao Wang,Soheil Feizi
机构: University of Maryland, College Park (马里兰大学学院市分校)
类目: Computation and Language (cs.CL)
备注: preliminary work in progress
Abstract:Large language model agents are increasingly used in multi-turn conversational settings to interact with and execute tasks in dynamic environments. However, a key limitation is their temporal blindness: they, by default, operate with a stationary context, failing to account for the real-world time elapsed between messages. This becomes a critical liability when an agent must decide whether to invoke a tool based on how much time has passed since the last observation. Without temporal awareness, agents often either over-rely on previous context (skipping necessary tool calls), or under-rely on it (unnecessarily repeating tool calls). To study this challenge, we introduce TicToc-v1, a test set of multi-turn user-agent trajectories across 34 scenarios with varying time sensitivity. Each trajectory ends with a user question, where the need for a tool call depends on the amount of time elapsed since the last message. To give LLMs temporal context, we augment dialogue messages with explicit timestamps, bridging the gap between static dialogue and evolving environments. We then collected human preferences for these samples, creating two subsets: one where humans preferred relying on the previous observation (prefer-noTool), and another where they preferred a new tool call (prefer-Tool). We evaluated how well LLM tool-calling decisions align with human preferences under varying time intervals on TicToc-v1. Our analysis show that without time information, most models perform only slightly better than random, with the top alignment rate being just over 60%. While adding timestamps leads to a slight improvement, particularly for larger models, the improvement is modest, peaking at around 65%. We also show that naive, prompt-based alignment have limited effectiveness. Our findings highlight the need for specific post-training alignment to align multi-turn LLM tool use with human temporal perception.
zh
[NLP-90] CRADLE Bench: A Clinician-Annotated Benchmark for Multi-Faceted Mental Health Crisis and Safety Risk Detection
【速读】: 该论文旨在解决语言模型在用户-模型交互中难以可靠检测心理健康危机情境(如自杀意念、性侵、家庭暴力、儿童虐待及性骚扰等)的问题,此类问题若未被及时识别可能带来严重后果。解决方案的关键在于提出CRADLE BENCH基准,这是一个涵盖七类符合临床标准的心理健康危机类型的多维度检测基准,并首次引入时间标签以捕捉危机事件的动态演变过程;同时,该基准提供600个由临床专家标注的评估样本和420个开发样本,并构建约4000个通过多语言模型多数投票集成自动标注的训练语料,显著优于单模型标注效果;此外,研究还基于共识与一致性的集成判断标准微调了六种危机检测模型,形成互补的检测能力。
链接: https://arxiv.org/abs/2510.23845
作者: Grace Byun,Rebecca Lipschutz,Sean T. Minton,Abigail Lott,Jinho D. Choi
机构: Emory University (埃默里大学); Emory University (埃默里大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Detecting mental health crisis situations such as suicide ideation, rape, domestic violence, child abuse, and sexual harassment is a critical yet underexplored challenge for language models. When such situations arise during user–model interactions, models must reliably flag them, as failure to do so can have serious consequences. In this work, we introduce CRADLE BENCH, a benchmark for multi-faceted crisis detection. Unlike previous efforts that focus on a limited set of crisis types, our benchmark covers seven types defined in line with clinical standards and is the first to incorporate temporal labels. Our benchmark provides 600 clinician-annotated evaluation examples and 420 development examples, together with a training corpus of around 4K examples automatically labeled using a majority-vote ensemble of multiple language models, which significantly outperforms single-model annotation. We further fine-tune six crisis detection models on subsets defined by consensus and unanimous ensemble agreement, providing complementary models trained under different agreement criteria.
zh
[NLP-91] How Prag matics Shape Articulation: A Computational Case Study in STEM ASL Discourse
【速读】: 该论文旨在解决当前手语模型训练数据多基于翻译或孤立词汇,忽视自然对话中动态语境变化的问题。其解决方案的关键在于构建了一个美国手语(ASL)STEM(科学、技术、工程与数学)对话的运动捕捉数据集,通过连续运动学特征分离出对话特有的同步效应(entrainment)与个体努力减少的影响,并量化重复提及STEM术语时的时空变化。结果表明,对话中的手语动作平均比孤立手语短24.6%–44.6%,且这种缩短在独白情境中不显著,从而揭示了语用因素如何塑造手语表达及其在手语技术中的表征。
链接: https://arxiv.org/abs/2510.23842
作者: Saki Imai,Lee Kezar,Laurel Aichler,Mert Inan,Erin Walker,Alicia Wooten,Lorna Quandt,Malihe Alikhani
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Most state-of-the-art sign language models are trained on interpreter or isolated vocabulary data, which overlooks the variability that characterizes natural dialogue. However, human communication dynamically adapts to contexts and interlocutors through spatiotemporal changes and articulation style. This specifically manifests itself in educational settings, where novel vocabularies are used by teachers, and students. To address this gap, we collect a motion capture dataset of American Sign Language (ASL) STEM (Science, Technology, Engineering, and Mathematics) dialogue that enables quantitative comparison between dyadic interactive signing, solo signed lecture, and interpreted articles. Using continuous kinematic features, we disentangle dialogue-specific entrainment from individual effort reduction and show spatiotemporal changes across repeated mentions of STEM terms. On average, dialogue signs are 24.6%-44.6% shorter in duration than the isolated signs, and show significant reductions absent in monologue contexts. Finally, we evaluate sign embedding models on their ability to recognize STEM signs and approximate how entrained the participants become over time. Our study bridges linguistic analysis and computational modeling to understand how pragmatics shape sign articulation and its representation in sign language technologies.
zh
[NLP-92] Beyond Understanding: Evaluating the Prag matic Gap in LLM s Cultural Processing of Figurative Language
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在处理具有文化根基的语言时的局限性问题,特别是其对蕴含地方知识与文化细微差别的隐喻表达的理解与语用能力不足。研究通过将隐喻语言作为文化细微差别的代理指标,设计了针对上下文理解、语用使用和情感内涵解释的评估任务,系统评测了22个开源与闭源LLMs在埃及阿拉伯语习语、多方言阿拉伯谚语及英语谚语上的表现。关键解决方案在于构建并发布Kinayat数据集——首个专为埃及阿拉伯语习语的隐喻理解与语用使用评估而设计的数据集,并揭示出LLMs在跨文化语用推理中的显著性能下降:如阿拉伯谚语准确率比英语谚语低4.29%,埃及习语又比阿拉伯谚语低10.28%,且语用任务准确率较理解任务下降14.07%。这表明当前LLMs虽能识别隐喻含义,但在实际语境中恰当运用仍存在明显挑战,凸显了文化推理能力作为LLM评估新维度的重要性。
链接: https://arxiv.org/abs/2510.23828
作者: Mena Attia,Aashiq Muhamed,Mai Alkhamissi,Thamar Solorio,Mona Diab
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:We present a comprehensive evaluation of the ability of large language models (LLMs) to process culturally grounded language, specifically to understand and pragmatically use figurative expressions that encode local knowledge and cultural nuance. Using figurative language as a proxy for cultural nuance and local knowledge, we design evaluation tasks for contextual understanding, pragmatic use, and connotation interpretation in Arabic and English. We evaluate 22 open- and closed-source LLMs on Egyptian Arabic idioms, multidialectal Arabic proverbs, and English proverbs. Our results show a consistent hierarchy: the average accuracy for Arabic proverbs is 4.29% lower than for English proverbs, and performance for Egyptian idioms is 10.28% lower than for Arabic proverbs. For the pragmatic use task, accuracy drops by 14.07% relative to understanding, though providing contextual idiomatic sentences improves accuracy by 10.66%. Models also struggle with connotative meaning, reaching at most 85.58% agreement with human annotators on idioms with 100% inter-annotator agreement. These findings demonstrate that figurative language serves as an effective diagnostic for cultural reasoning: while LLMs can often interpret figurative meaning, they face challenges in using it appropriately. To support future research, we release Kinayat, the first dataset of Egyptian Arabic idioms designed for both figurative understanding and pragmatic use evaluation.
zh
[NLP-93] BitSkip: An Empirical Analysis of Quantization and Early Exit Composition
【速读】: 该论文旨在解决高效大型语言模型(Large Language Models, LLMs)在采用复杂压缩技术(如极端量化和动态路由)时,其组合效应缺乏系统理解的问题。解决方案的关键在于提出一种名为BitSkip的混合架构框架,用于系统性探索不同压缩策略的交互作用;其中最核心发现是:一个简单的8-bit量化模型(无Hadamard变换的BitSkip-V1)不仅在性能上优于更复杂的4-bit及Hadamard增强版本,还能在困惑度(perplexity)上媲美全精度基线(1.13 vs 1.19),且具备优异的早期退出特性——第18层即可实现32.5%的推理速度提升,仅损失4%的质量。这一结果揭示了过度复杂化压缩策略可能引入训练不稳定性,而简化设计反而能获得最优权衡。
链接: https://arxiv.org/abs/2510.23766
作者: Ramshankar Bhuvaneswaran,Handan Liu
机构: Northeastern University (东北大学)
类目: Computation and Language (cs.CL)
备注: Submitted to JMLR
Abstract:The pursuit of efficient Large Language Models (LLMs) has led to increasingly complex techniques like extreme quantization and dynamic routing. While individual benefits of these methods are well-documented, their compositional effects remain poorly understood. This paper introduces BitSkip, a hybrid architectural framework for systematically explor- ing these interactions. Counter-intuitively, our findings reveal that a simple 8-bit quantized model without Hadamard transform (BitSkip-V1) not only outperforms its more complex 4-bit and Hadamard-enhanced counterparts but also competes the full-precision baseline in quality (perplexity of 1.13 vs 1.19) . The introduction of Hadamard transforms, even at 8- bit precision, catastrophically degraded performance by over 37,000%, tracing fundamental training instability. Our BitSkip-V1 recipe demonstrates superior early-exit characteristics, with layer 18 providing optimal 32.5% speed gain for minimal 4% quality loss.
zh
[NLP-94] RoboOmni: Proactive Robot Manipulation in Omni-modal Context
【速读】: 该论文旨在解决当前机器人操作中依赖显式指令的局限性,即在真实场景下人类通常不会直接下达命令,而机器人需具备主动推断用户意图的能力以实现高效协作。其解决方案的关键在于提出了一种基于端到端多模态大语言模型(Multimodal Large Language Models, MLLMs)的RoboOmni框架,该框架采用Perceiver-Thinker-Talker-Executor结构,通过时空融合听觉与视觉信号实现鲁棒的意图识别,并支持直接语音交互;同时构建了包含14万条任务片段、5000余名说话者、2400种事件声音等数据的OmniAction基准,填补了主动意图识别训练数据的空白,从而显著提升成功率、推理速度和主动辅助能力。
链接: https://arxiv.org/abs/2510.23763
作者: Siyin Wang,Jinlan Fu,Feihong Liu,Xinzhe He,Huangxuan Wu,Junhao Shi,Kexin Huang,Zhaoye Fei,Jingjing Gong,Zuxuan Wu,Yugang Jiang,See-Kiong Ng,Tat-Seng Chua,Xipeng Qiu
机构: Fudan University (复旦大学); Shanghai Innovation Institute (上海创新研究院); National Univesty of Singapore (新加坡国立大学)
类目: Robotics (cs.RO); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advances in Multimodal Large Language Models (MLLMs) have driven rapid progress in Vision-Language-Action (VLA) models for robotic manipulation. Although effective in many scenarios, current approaches largely rely on explicit instructions, whereas in real-world interactions, humans rarely issue instructions directly. Effective collaboration requires robots to infer user intentions proactively. In this work, we introduce cross-modal contextual instructions, a new setting where intent is derived from spoken dialogue, environmental sounds, and visual cues rather than explicit commands. To address this new setting, we present RoboOmni, a Perceiver-Thinker-Talker-Executor framework based on end-to-end omni-modal LLMs that unifies intention recognition, interaction confirmation, and action execution. RoboOmni fuses auditory and visual signals spatiotemporally for robust intention recognition, while supporting direct speech interaction. To address the absence of training data for proactive intention recognition in robotic manipulation, we build OmniAction, comprising 140k episodes, 5k+ speakers, 2.4k event sounds, 640 backgrounds, and six contextual instruction types. Experiments in simulation and real-world settings show that RoboOmni surpasses text- and ASR-based baselines in success rate, inference speed, intention recognition, and proactive assistance.
zh
[NLP-95] Evaluating Long-Term Memory for Long-Context Question Answering
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在长上下文对话任务中缺乏持续性记忆与经验学习能力的问题,从而实现真正的对话连贯性和知识积累。其解决方案的关键在于系统性评估多种增强记忆的方法:包括全上下文提示(full-context prompting)、通过检索增强生成(Retrieval-Augmented Generation, RAG)实现的语义记忆、通过上下文学习(in-context learning)实现的情景记忆(episodic memory),以及通过提示优化实现的过程记忆(procedural memory)。研究发现,这些记忆增强方法可在保持竞争性准确率的同时,将token使用量降低超过90%,且记忆架构的复杂度应根据模型能力进行适配——小规模基础模型受益于RAG,而强指令微调的推理模型则从情景记忆(尤其是通过反思机制)和更复杂的代理语义记忆中获益,其中情景记忆还能帮助LLMs识别自身知识边界。
链接: https://arxiv.org/abs/2510.23730
作者: Alessandra Terranova,Björn Ross,Alexandra Birch
机构: The University of Edinburgh (爱丁堡大学)
类目: Computation and Language (cs.CL)
备注: 14 pages including appendix, 3 figures. Submitted to October ARR and to Metacognition in Generative AI EurIPS workshop (under review for both)
Abstract:In order for large language models to achieve true conversational continuity and benefit from experiential learning, they need memory. While research has focused on the development of complex memory systems, it remains unclear which types of memory are most effective for long-context conversational tasks. We present a systematic evaluation of memory-augmented methods using LoCoMo, a benchmark of synthetic long-context dialogues annotated for question-answering tasks that require diverse reasoning strategies. We analyse full-context prompting, semantic memory through retrieval-augmented generation and agentic memory, episodic memory through in-context learning, and procedural memory through prompt optimization. Our findings show that memory-augmented approaches reduce token usage by over 90% while maintaining competitive accuracy. Memory architecture complexity should scale with model capability, with small foundation models benefitting most from RAG, and strong instruction-tuned reasoning model gaining from episodic learning through reflections and more complex agentic semantic memory. In particular, episodic memory can help LLMs recognise the limits of their own knowledge.
zh
[NLP-96] MUStReason : A Benchmark for Diagnosing Prag matic Reasoning in Video-LMs for Multimodal Sarcasm Detection
【速读】: 该论文旨在解决视频语言模型(VideoLMs)在讽刺检测(sarcasm detection)任务中的性能瓶颈问题,该任务不仅依赖于话语的字面内容,还要求模型整合多模态线索(如语调、面部表情和对话上下文)并进行语用推理以识别说话者的隐含意图。现有模型难以有效跨模态识别相关线索并进行合理的语用推断,导致其在复杂场景下的讽刺理解能力有限。解决方案的关键在于提出MUStReason诊断基准,该基准通过标注模态特定的相关线索与底层推理步骤,为评估模型的感知与推理能力提供结构化依据;同时引入PragCoT框架,引导VideoLMs聚焦于隐含意图而非字面意义,从而提升讽刺识别的准确性与可解释性。
链接: https://arxiv.org/abs/2510.23727
作者: Anisha Saha,Varsha Suresh,Timothy Hospedales,Vera Demberg
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:Sarcasm is a specific type of irony which involves discerning what is said from what is meant. Detecting sarcasm depends not only on the literal content of an utterance but also on non-verbal cues such as speaker’s tonality, facial expressions and conversational context. However, current multimodal models struggle with complex tasks like sarcasm detection, which require identifying relevant cues across modalities and pragmatically reasoning over them to infer the speaker’s intention. To explore these limitations in VideoLMs, we introduce MUStReason, a diagnostic benchmark enriched with annotations of modality-specific relevant cues and underlying reasoning steps to identify sarcastic intent. In addition to benchmarking sarcasm classification performance in VideoLMs, using MUStReason we quantitatively and qualitatively evaluate the generated reasoning by disentangling the problem into perception and reasoning, we propose PragCoT, a framework that steers VideoLMs to focus on implied intentions over literal meaning, a property core to detecting sarcasm.
zh
[NLP-97] VisCoder2: Building Multi-Language Visualization Coding Agents
【速读】: 该论文旨在解决当前生成式 AI (Generative AI) 在可视化代码生成任务中面临的三大挑战:语言覆盖有限、执行可靠性差以及缺乏迭代修正机制。现有模型受限于窄域数据集和单轮生成的评估范式,难以在实际工作流中稳定运行。其解决方案的关键在于构建三个互补资源:首先,提出 VisCode-Multi-679K 数据集,包含 679K 条经验证且可执行的多语言可视化代码样本及多轮纠错对话;其次,设计 VisPlotBench 基准测试平台,支持可执行任务、渲染输出与初始生成及多轮自调试的标准化评估;最后,基于该数据集训练出 VisCoder2 系列多语言可视化模型,实验证明其显著优于主流开源基线,并接近 GPT-4.1 的性能,在 32B 规模下通过迭代自调试达到 82.4% 的整体执行通过率,尤其在符号型或编译依赖型语言中表现突出。
链接: https://arxiv.org/abs/2510.23642
作者: Yuansheng Ni,Songcheng Cai,Xiangchao Chen,Jiarong Liang,Zhiheng Lyu,Jiaqi Deng,Kai Zou,Ping Nie,Fei Yuan,Xiang Yue,Wenhu Chen
机构: University of Waterloo (滑铁卢大学); Carnegie Mellon University (卡内基梅隆大学); Korea Advanced Institute of Science & Technology (韩国科学技术院); Netmind.ai; Independent Researcher
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Programming Languages (cs.PL)
备注:
Abstract:Large language models (LLMs) have recently enabled coding agents capable of generating, executing, and revising visualization code. However, existing models often fail in practical workflows due to limited language coverage, unreliable execution, and lack of iterative correction mechanisms. Progress has been constrained by narrow datasets and benchmarks that emphasize single-round generation and single-language tasks. To address these challenges, we introduce three complementary resources for advancing visualization coding agents. VisCode-Multi-679K is a large-scale, supervised dataset containing 679K validated and executable visualization samples with multi-turn correction dialogues across 12 programming languages. VisPlotBench is a benchmark for systematic evaluation, featuring executable tasks, rendered outputs, and protocols for both initial generation and multi-round self-debug. Finally, we present VisCoder2, a family of multi-language visualization models trained on VisCode-Multi-679K. Experiments show that VisCoder2 significantly outperforms strong open-source baselines and approaches the performance of proprietary models like GPT-4.1, with further gains from iterative self-debug, reaching 82.4% overall execution pass rate at the 32B scale, particularly in symbolic or compiler-dependent languages.
zh
[NLP-98] Combining Textual and Structural Information for Premise Selection in Lean
【速读】: 该论文旨在解决大规模形式化库中定理证明扩展的瓶颈问题——即前提选择(premise selection)效率低下。现有基于语言的方法通常孤立地处理前提,忽略了它们之间复杂的依赖关系。解决方案的关键在于提出一种图增强方法,将Lean形式化文本的密集嵌入(dense text embeddings)与异质依赖图上的图神经网络(graph neural networks)相结合,显式建模状态-前提和前提-前提之间的多重关系,从而更有效地进行前提选择。
链接: https://arxiv.org/abs/2510.23637
作者: Job Petrovčič,David Eliecer Narvaez Denis,Ljupčo Todorovski
机构: Faculty of Mathematics and Physics, University of Ljubljana (卢布尔雅那大学数学与物理学院); Department of Knowledge Technologies, Jožef Stefan Institute (约瑟夫·斯特凡研究所知识技术系)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Logic in Computer Science (cs.LO)
备注:
Abstract:Premise selection is a key bottleneck for scaling theorem proving in large formal libraries. Yet existing language-based methods often treat premises in isolation, ignoring the web of dependencies that connects them. We present a graph-augmented approach that combines dense text embeddings of Lean formalizations with graph neural networks over a heterogeneous dependency graph capturing both state–premise and premise–premise relations. On the LeanDojo Benchmark, our method outperforms the ReProver language-based baseline by over 25% across standard retrieval metrics. These results demonstrate the power of relational information for more effective premise selection.
zh
[NLP-99] Flight Delay Prediction via Cross-Modality Adaptation of Large Language Models and Aircraft Trajectory Representation
【速读】: 该论文旨在解决航空交通管理中航班延误预测的精准性问题,尤其关注终端区(terminal area)内航班延误的实时预测,以提升空域网络的整体运行效率。其解决方案的关键在于构建一种基于轻量级大语言模型(lightweight large language model)的多模态预测框架,通过将飞行轨迹数据转换为语言模态表示,与飞行信息、气象报告及机场通告等文本信息进行融合,从而捕捉影响延误的上下文因素。该方法利用语言理解能力结合跨模态适配机制,显著提升了预测精度(误差低于分钟级),并具备实际部署和实时更新的能力,适用于复杂动态的空中交通环境。
链接: https://arxiv.org/abs/2510.23636
作者: Thaweerath Phisannupawong,Joshua Julian Damanik,Han-Lim Choi
机构: KAIST (韩国科学技术院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Preprint submitted to Aerospace Science and Technology (Elsevier) for possible publication
Abstract:Flight delay prediction has become a key focus in air traffic management, as delays highlight inefficiencies that impact overall network performance. This paper presents a lightweight large language model-based multimodal flight delay prediction, formulated from the perspective of air traffic controllers monitoring aircraft delay after entering the terminal area. The approach integrates trajectory representations with textual aeronautical information, including flight information, weather reports, and aerodrome notices, by adapting trajectory data into the language modality to capture airspace conditions. Experimental results show that the model consistently achieves sub-minute prediction error by effectively leveraging contextual information related to the sources of delay. The framework demonstrates that linguistic understanding, when combined with cross-modality adaptation of trajectory information, enhances delay prediction. Moreover, the approach shows practicality and scalability for real-world operations, supporting real-time updates that refine predictions upon receiving new operational information.
zh
[NLP-100] NUM2EVENT: Interpretable Event Reasoning from Numerical time-series
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在处理纯数值时间序列信号时理解能力有限的问题,尤其是现有方法难以从数值变化中挖掘潜在驱动事件并解释其推理过程。解决方案的关键在于提出一个名为“数字到事件推理与解码”(number-to-event reasoning and decoding)的新任务,并构建一个推理感知框架,该框架包含三个核心组件:代理引导的事件提取器(agent-guided event extractor, AGE)、基于标记多变量霍克斯过程的合成生成器(marked multivariate Hawkes-based synthetic generator, EveDTS),以及结合时间序列编码器与结构化解码器的两阶段微调管道。该方法通过显式建模数值变化、生成中间解释并输出结构化事件假设,在多领域数据集上显著优于强基线模型,在事件级别的精确率和召回率上取得提升,为量化推理与语义理解的融合提供了新路径。
链接: https://arxiv.org/abs/2510.23630
作者: Ninghui Feng,Yiyan Qi
机构: International Digital Economy Academy (IDEA); University of Nottingham Ningbo China
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) have recently demonstrated impressive multimodal reasoning capabilities, yet their understanding of purely numerical time-series signals remains limited. Existing approaches mainly focus on forecasting or trend description, without uncovering the latent events that drive numerical changes or explaining the reasoning process behind them. In this work, we introduce the task of number-to-event reasoning and decoding, which aims to infer interpretable structured events from numerical inputs, even when current text is unavailable. To address the data scarcity and semantic alignment challenges, we propose a reasoning-aware framework that integrates an agent-guided event extractor (AGE), a marked multivariate Hawkes-based synthetic generator (EveDTS), and a two-stage fine-tuning pipeline combining a time-series encoder with a structured decoder. Our model explicitly reasons over numerical changes, generates intermediate explanations, and outputs structured event hypotheses. Experiments on multi-domain datasets show that our method substantially outperforms strong LLM baselines in event-level precision and recall. These results suggest a new direction for bridging quantitative reasoning and semantic understanding, enabling LLMs to explain and predict events directly from numerical dynamics.
zh
[NLP-101] From Detection to Discovery: A Closed-Loop Approach for Simultaneous and Continuous Medical Knowledge Expansion and Depression Detection on Social Media
【速读】: 该论文旨在解决传统社交媒体用户生成内容(UGC)在抑郁检测中预测精度有限,且未能有效利用预测过程来扩展医学知识的问题。解决方案的关键在于提出一个闭环的大语言模型(LLM)-知识图谱框架,通过迭代学习循环实现预测与知识扩展的协同进化:在知识感知的抑郁检测阶段,LLM联合执行抑郁识别与实体抽取,知识图谱则对提取的实体进行建模与加权以优化预测性能;在知识精炼与扩展阶段,LLM新发现的实体、关系和类型在专家监督下被注入知识图谱,推动医学知识的持续演化。该框架实现了“预测驱动学习”与“学习驱动预测”的双向强化机制,显著提升了预测准确性并拓展了临床可解释性知识。
链接: https://arxiv.org/abs/2510.23626
作者: Shuang Geng,Wenli Zhang,Jiaheng Xie,Rui Wang,Sudha Ram
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Presented at SWAIB2025 and HICSS2026
Abstract:Social media user-generated content (UGC) provides real-time, self-reported indicators of mental health conditions such as depression, offering a valuable source for predictive analytics. While prior studies integrate medical knowledge to improve prediction accuracy, they overlook the opportunity to simultaneously expand such knowledge through predictive processes. We develop a Closed-Loop Large Language Model (LLM)-Knowledge Graph framework that integrates prediction and knowledge expansion in an iterative learning cycle. In the knowledge-aware depression detection phase, the LLM jointly performs depression detection and entity extraction, while the knowledge graph represents and weights these entities to refine prediction performance. In the knowledge refinement and expansion phase, new entities, relationships, and entity types extracted by the LLM are incorporated into the knowledge graph under expert supervision, enabling continual knowledge evolution. Using large-scale UGC, the framework enhances both predictive accuracy and medical understanding. Expert evaluations confirmed the discovery of clinically meaningful symptoms, comorbidities, and social triggers complementary to existing literature. We conceptualize and operationalize prediction-through-learning and learning-through-prediction as mutually reinforcing processes, advancing both methodological and theoretical understanding in predictive analytics. The framework demonstrates the co-evolution of computational models and domain knowledge, offering a foundation for adaptive, data-driven knowledge systems applicable to other dynamic risk monitoring contexts.
zh
[NLP-102] An Enhanced Dual Transformer Contrastive Network for Multimodal Sentiment Analysis
【速读】: 该论文旨在解决多模态情感分析(Multimodal Sentiment Analysis, MSA)中如何更有效地融合文本与视觉信息以提升情感识别准确率的问题。其核心解决方案是提出一种基于早期融合策略的新型模型 BERT-ViT-EF,通过将基于 Transformer 的 BERT(文本编码器)与 ViT(视觉 Transformer 编码器)在输入层进行深度融合,促进跨模态交互与联合表征学习。进一步地,作者设计了双 Transformer 对比网络(Dual Transformer Contrastive Network, DTCN),在 BERT 后增加一个额外的 Transformer 层用于细化文本上下文,并引入对比学习对齐文本与图像表示,从而增强模型的鲁棒性与泛化能力。实验表明,DTCN 在 TumEmo 数据集上达到 78.4% 准确率和 78.3% F1 分数,验证了早期融合与深度语境建模的有效性。
链接: https://arxiv.org/abs/2510.23617
作者: Phuong Q. Dao,Mark Roantree,Vuong M. Ngo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: The paper has been accepted for presentation at the MEDES 2025 conference
Abstract:Multimodal Sentiment Analysis (MSA) seeks to understand human emotions by jointly analyzing data from multiple modalities typically text and images offering a richer and more accurate interpretation than unimodal approaches. In this paper, we first propose BERT-ViT-EF, a novel model that combines powerful Transformer-based encoders BERT for textual input and ViT for visual input through an early fusion strategy. This approach facilitates deeper cross-modal interactions and more effective joint representation learning. To further enhance the model’s capability, we propose an extension called the Dual Transformer Contrastive Network (DTCN), which builds upon BERT-ViT-EF. DTCN incorporates an additional Transformer encoder layer after BERT to refine textual context (before fusion) and employs contrastive learning to align text and image representations, fostering robust multimodal feature learning. Empirical results on two widely used MSA benchmarks MVSA-Single and TumEmo demonstrate the effectiveness of our approach. DTCN achieves best accuracy (78.4%) and F1-score (78.3%) on TumEmo, and delivers competitive performance on MVSA-Single, with 76.6% accuracy and 75.9% F1-score. These improvements highlight the benefits of early fusion and deeper contextual modeling in Transformer-based multimodal sentiment analysis.
zh
[NLP-103] Fine-tuning Large Language Models with Limited Data: A Survey and Practical Guide ACL
【速读】: 该论文旨在解决在数据稀缺场景下对大规模语言模型(Large Language Models, LLMs)进行高效微调的问题,尤其针对低资源语言、专业领域以及部署受限环境。其核心解决方案在于系统性地梳理和总结多种参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)技术,结合领域适应与跨语言迁移方法、模型专业化策略,以及基于有限人类或合成反馈的偏好对齐机制,强调样本与计算效率。关键在于通过结构化分析不同技术在模型规模、数据规模及灾难性遗忘缓解等方面的权衡关系,为研究者和实践者提供可操作的选型依据与最佳实践。
链接: https://arxiv.org/abs/2411.09539
作者: Marton Szep,Daniel Rueckert,Rüdiger von Eisenhart-Rothe,Florian Hinterwimmer
机构: Technical University of Munich (TUM) (慕尼黑工业大学); TUM University Hospital (慕尼黑工业大学医院); Munich Center for Machine Learning (MCML) (慕尼黑机器学习中心); Imperial College London (帝国理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to TACL. Pre-MIT Press version. Major restructuring; added preference alignment section and additional tables. 36 pages
Abstract:Fine-tuning large language models (LLMs) with limited data poses a practical challenge in low-resource languages, specialized domains, and constrained deployment settings. While pre-trained LLMs provide strong foundations, effective adaptation under data scarcity requires focused and efficient fine-tuning techniques. This paper presents a structured and practical survey of recent methods for fine-tuning LLMs in data-scarce scenarios. We systematically review parameter-efficient fine-tuning techniques that lower training and deployment costs, domain and cross-lingual adaptation methods for both encoder and decoder models, and model specialization strategies. We further examine preference alignment approaches that guide model behavior using limited human or synthetic feedback, emphasizing sample and compute efficiency. Throughout, we highlight empirical trade-offs, selection criteria, and best practices for choosing suitable techniques based on task constraints, including model scaling, data scaling, and the mitigation of catastrophic forgetting. The aim is to equip researchers and practitioners with actionable insights for effectively fine-tuning LLMs when data and resources are limited.
zh
[NLP-104] A Neural Model for Contextual Biasing Score Learning and Filtering
【速读】: 该论文旨在解决自动语音识别(ASR)中因缺乏上下文知识而导致的识别准确率下降问题,尤其是在存在用户特定短语或实体时。其解决方案的关键在于提出一种基于注意力机制的偏置解码器(attention-based biasing decoder),通过提取ASR编码器的声学信息为候选短语生成得分,用于过滤不合理的候选短语并计算浅融合(shallow-fusion)所需的奖励分数。进一步引入每词判别性目标函数,促使模型对真实短语赋予更高得分,同时抑制干扰项,从而在Librispeech偏置基准测试中显著提升识别精度,并具备良好的模块化特性,可兼容任意ASR系统。
链接: https://arxiv.org/abs/2510.23849
作者: Wanting Huang,Weiran Wang
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD)
备注: Accepted to IEEE ASRU 2025
Abstract:Contextual biasing improves automatic speech recognition (ASR) by integrating external knowledge, such as user-specific phrases or entities, during decoding. In this work, we use an attention-based biasing decoder to produce scores for candidate phrases based on acoustic information extracted by an ASR encoder, which can be used to filter out unlikely phrases and to calculate bonus for shallow-fusion biasing. We introduce a per-token discriminative objective that encourages higher scores for ground-truth phrases while suppressing distractors. Experiments on the Librispeech biasing benchmark show that our method effectively filters out majority of the candidate phrases, and significantly improves recognition accuracy under different biasing conditions when the scores are used in shallow fusion biasing. Our approach is modular and can be used with any ASR system, and the filtering mechanism can potentially boost performance of other biasing methods.
zh
计算机视觉
[CV-0] Generative View Stitching
【速读】:该论文旨在解决自回归视频扩散模型在相机引导视频生成中因无法利用未来条件而导致的场景碰撞问题,进而引发生成过程快速崩溃的局限性。其解决方案的关键在于提出生成式视图拼接(Generative View Stitching, GVS),通过并行采样整个视频序列,确保生成场景与预设相机轨迹的每一部分都保持一致性;GVS的核心创新是引入一种扩展自机器人规划中扩散拼接(diffusion stitching)的采样算法,且无需额外训练模型即可兼容任何基于Diffusion Forcing框架训练的现成视频模型,同时结合Omni Guidance技术实现对过去和未来的联合条件控制,从而提升时序一致性并支持闭环机制以实现长距离连贯性。
链接: https://arxiv.org/abs/2510.24718
作者: Chonghyuk Song,Michal Stary,Boyuan Chen,George Kopanas,Vincent Sitzmann
机构: MIT CSAIL (麻省理工学院计算机科学与人工智能实验室); RunwayML
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Project website: this https URL
Abstract:Autoregressive video diffusion models are capable of long rollouts that are stable and consistent with history, but they are unable to guide the current generation with conditioning from the future. In camera-guided video generation with a predefined camera trajectory, this limitation leads to collisions with the generated scene, after which autoregression quickly collapses. To address this, we propose Generative View Stitching (GVS), which samples the entire sequence in parallel such that the generated scene is faithful to every part of the predefined camera trajectory. Our main contribution is a sampling algorithm that extends prior work on diffusion stitching for robot planning to video generation. While such stitching methods usually require a specially trained model, GVS is compatible with any off-the-shelf video model trained with Diffusion Forcing, a prevalent sequence diffusion framework that we show already provides the affordances necessary for stitching. We then introduce Omni Guidance, a technique that enhances the temporal consistency in stitching by conditioning on both the past and future, and that enables our proposed loop-closing mechanism for delivering long-range coherence. Overall, GVS achieves camera-guided video generation that is stable, collision-free, frame-to-frame consistent, and closes loops for a variety of predefined camera paths, including Oscar Reutersvärd’s Impossible Staircase. Results are best viewed as videos at this https URL.
zh
[CV-1] Uniform Discrete Diffusion with Metric Path for Video Generation
【速读】:该论文旨在解决离散视频生成方法在高分辨率图像合成与长时序视频生成中因误差累积和长程不一致性导致性能落后于连续空间方法的问题。其核心解决方案是提出Uniform discRete diffuSion with metric pAth (URSA),通过两个关键设计实现高效且高质量的离散视频生成:一是线性化度量路径(Linearized Metric Path),用于在离散空间中构建稳定、可微的扩散路径;二是依赖分辨率的时间步偏移机制(Resolution-dependent Timestep Shifting),以自适应调整不同分辨率下的采样步数,从而减少推理步骤并提升生成稳定性。此外,引入异步时间微调策略统一多任务训练,进一步增强了模型对插值、图像到视频等多样化任务的泛化能力。
链接: https://arxiv.org/abs/2510.24717
作者: Haoge Deng,Ting Pan,Fan Zhang,Yang Liu,Zhuoyan Luo,Yufeng Cui,Wenxuan Wang,Chunhua Shen,Shiguang Shan,Zhaoxiang Zhang,Xinlong Wang
机构: National Laboratory of Pattern Recognition, CASIA (中国科学院自动化研究所模式识别国家重点实验室); Key Laboratory of Intelligent Information Processing, ICT, CAS (中国科学院信息工程研究所智能信息处理重点实验室); University of Chinese Academy of Sciences (中国科学院大学); Zhejiang University (浙江大学); Beijing Academy of Artificial Intelligence (北京人工智能研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 10 figures
Abstract:Continuous-space video generation has advanced rapidly, while discrete approaches lag behind due to error accumulation and long-context inconsistency. In this work, we revisit discrete generative modeling and present Uniform discRete diffuSion with metric pAth (URSA), a simple yet powerful framework that bridges the gap with continuous approaches for the scalable video generation. At its core, URSA formulates the video generation task as an iterative global refinement of discrete spatiotemporal tokens. It integrates two key designs: a Linearized Metric Path and a Resolution-dependent Timestep Shifting mechanism. These designs enable URSA to scale efficiently to high-resolution image synthesis and long-duration video generation, while requiring significantly fewer inference steps. Additionally, we introduce an asynchronous temporal fine-tuning strategy that unifies versatile tasks within a single model, including interpolation and image-to-video generation. Extensive experiments on challenging video and image generation benchmarks demonstrate that URSA consistently outperforms existing discrete methods and achieves performance comparable to state-of-the-art continuous diffusion methods. Code and models are available at this https URL
zh
[CV-2] Routing Matters in MoE: Scaling Diffusion Transformers with Explicit Routing Guidance
【速读】:该论文旨在解决将混合专家(Mixture-of-Experts, MoE)架构应用于扩散Transformer(Diffusion Transformers, DiTs)时效果有限的问题。作者指出,这一局限性源于语言令牌与视觉令牌的本质差异:语言令牌语义密集且token间差异显著,而视觉令牌则具有空间冗余性和功能异质性,导致视觉MoE中专家难以实现有效专业化。解决方案的关键在于提出ProMoE框架,其核心是双阶段路由机制——首先通过条件路由(conditional routing)依据功能角色将图像令牌划分为有条件和无条件集合;随后利用基于可学习原型的原型路由(prototypical routing)对有条件图像令牌进行细化分配,从而增强专家专业化能力。此外,该方法引入基于相似性的潜在空间专家分配机制,并设计路由对比损失(routing contrastive loss),以提升专家内一致性与专家间多样性,从而显著改善视觉MoE的性能。
链接: https://arxiv.org/abs/2510.24711
作者: Yujie Wei,Shiwei Zhang,Hangjie Yuan,Yujin Han,Zhekai Chen,Jiayu Wang,Difan Zou,Xihui Liu,Yingya Zhang,Yu Liu,Hongming Shan
机构: Fudan University (复旦大学); Tongyi Lab, Alibaba Group (阿里巴巴集团通义实验室); Zhejiang University (浙江大学); The University of Hong Kong (香港大学); MMLab
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Mixture-of-Experts (MoE) has emerged as a powerful paradigm for scaling model capacity while preserving computational efficiency. Despite its notable success in large language models (LLMs), existing attempts to apply MoE to Diffusion Transformers (DiTs) have yielded limited gains. We attribute this gap to fundamental differences between language and visual tokens. Language tokens are semantically dense with pronounced inter-token variation, while visual tokens exhibit spatial redundancy and functional heterogeneity, hindering expert specialization in vision MoE. To this end, we present ProMoE, an MoE framework featuring a two-step router with explicit routing guidance that promotes expert specialization. Specifically, this guidance encourages the router to partition image tokens into conditional and unconditional sets via conditional routing according to their functional roles, and refine the assignments of conditional image tokens through prototypical routing with learnable prototypes based on semantic content. Moreover, the similarity-based expert allocation in latent space enabled by prototypical routing offers a natural mechanism for incorporating explicit semantic guidance, and we validate that such guidance is crucial for vision MoE. Building on this, we propose a routing contrastive loss that explicitly enhances the prototypical routing process, promoting intra-expert coherence and inter-expert diversity. Extensive experiments on ImageNet benchmark demonstrate that ProMoE surpasses state-of-the-art methods under both Rectified Flow and DDPM training objectives. Code and models will be made publicly available.
zh
[CV-3] Does Object Binding Naturally Emerge in Large Pretrained Vision Transformers? NEURIPS2025
【速读】:该论文旨在解决视觉 Transformer(Vision Transformers, ViTs)是否具备自然涌现的物体绑定(object binding)能力这一问题,即模型能否在不显式引入以对象为中心的注意力机制(如 Slot Attention)的情况下,自动识别哪些图像块(patch)属于同一对象,并将其整合为统一的高阶表征。解决方案的关键在于提出并验证“IsSameObject”这一属性——即判断两个图像块是否属于同一对象的能力,并通过设计相似性探测器(similarity probe)从 ViT 各层的 patch embeddings 中解码该信号。实验表明,自监督预训练的 ViT(如 DINO、MAE、CLIP)能可靠地学习 IsSameObject 信号(准确率超 90%),而 ImageNet 监督训练的模型则表现较弱,说明该能力并非架构冗余,而是由特定预训练目标驱动的学习结果;进一步分析发现,IsSameObject 信号编码于对象特征的低维子空间中,并主动引导注意力机制,其移除会损害下游任务性能并违背预训练目标,从而证明物体绑定是 ViT 自然涌现且服务于预训练目标的核心能力。
链接: https://arxiv.org/abs/2510.24709
作者: Yihao Li,Saeed Salehi,Lyle Ungar,Konrad P. Kording
机构: University of Pennsylvania (宾夕法尼亚大学); Machine Learning Group, Technical University of Berlin (柏林工业大学机器学习组)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
备注: Accepted as a Spotlight at NeurIPS 2025
Abstract:Object binding, the brain’s ability to bind the many features that collectively represent an object into a coherent whole, is central to human cognition. It groups low-level perceptual features into high-level object representations, stores those objects efficiently and compositionally in memory, and supports human reasoning about individual object instances. While prior work often imposes object-centric attention (e.g., Slot Attention) explicitly to probe these benefits, it remains unclear whether this ability naturally emerges in pre-trained Vision Transformers (ViTs). Intuitively, they could: recognizing which patches belong to the same object should be useful for downstream prediction and thus guide attention. Motivated by the quadratic nature of self-attention, we hypothesize that ViTs represent whether two patches belong to the same object, a property we term IsSameObject. We decode IsSameObject from patch embeddings across ViT layers using a similarity probe, which reaches over 90% accuracy. Crucially, this object-binding capability emerges reliably in self-supervised ViTs (DINO, MAE, CLIP), but markedly weaker in ImageNet-supervised models, suggesting that binding is not a trivial architectural artifact, but an ability acquired through specific pretraining objectives. We further discover that IsSameObject is encoded in a low-dimensional subspace on top of object features, and that this signal actively guides attention. Ablating IsSameObject from model activations degrades downstream performance and works against the learning objective, implying that emergent object binding naturally serves the pretraining objective. Our findings challenge the view that ViTs lack object binding and highlight how symbolic knowledge of “which parts belong together” emerges naturally in a connectionist system.
zh
[CV-4] MIC-BEV: Multi-Infrastructure Camera Birds-Eye-View Transformer with Relation-Aware Fusion for 3D Object Detection
【速读】:该论文旨在解决基础设施感知(Infrastructure-based Perception)场景下,基于摄像头的3D目标检测模型因多视角部署、相机配置多样、视觉输入退化及复杂道路布局等因素导致性能下降的问题。其解决方案的关键在于提出了一种基于Transformer的鸟瞰图(Bird’s-eye-view, BEV)感知框架MIC-BEV,该框架通过引入图增强融合模块(graph-enhanced fusion module),利用相机与BEV单元之间的几何关系以及潜在的视觉线索,实现多视角图像特征到BEV空间的有效融合,从而在异构相机参数和传感器退化条件下仍保持强鲁棒性。
链接: https://arxiv.org/abs/2510.24688
作者: Yun Zhang,Zhaoliang Zheng,Johnson Liu,Zhiyu Huang,Zewei Zhou,Zonglin Meng,Tianhui Cai,Jiaqi Ma
机构: University of California, Los Angeles (加州大学洛杉矶分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Infrastructure-based perception plays a crucial role in intelligent transportation systems, offering global situational awareness and enabling cooperative autonomy. However, existing camera-based detection models often underperform in such scenarios due to challenges such as multi-view infrastructure setup, diverse camera configurations, degraded visual inputs, and various road layouts. We introduce MIC-BEV, a Transformer-based bird’s-eye-view (BEV) perception framework for infrastructure-based multi-camera 3D object detection. MIC-BEV flexibly supports a variable number of cameras with heterogeneous intrinsic and extrinsic parameters and demonstrates strong robustness under sensor degradation. The proposed graph-enhanced fusion module in MIC-BEV integrates multi-view image features into the BEV space by exploiting geometric relationships between cameras and BEV cells alongside latent visual cues. To support training and evaluation, we introduce M2I, a synthetic dataset for infrastructure-based object detection, featuring diverse camera configurations, road layouts, and environmental conditions. Extensive experiments on both M2I and the real-world dataset RoScenes demonstrate that MIC-BEV achieves state-of-the-art performance in 3D object detection. It also remains robust under challenging conditions, including extreme weather and sensor degradation. These results highlight the potential of MIC-BEV for real-world deployment. The dataset and source code are available at: this https URL.
zh
[CV-5] SAGE: Structure-Aware Generative Video Transitions between Diverse Clips
【速读】:该论文旨在解决视频过渡(video transitions)中因片段间存在较大时间跨度或显著语义差异而导致的结构不一致与视觉不连贯问题,传统方法如交叉淡入淡出、形态变换和帧插值在处理此类复杂场景时效果有限。其解决方案的关键在于提出一种零样本(zero-shot)框架SAGE(Structure-Aware Generative Video Transitions),通过结合结构引导(line maps和motion flow)与生成式合成技术,在无需微调的情况下实现平滑且语义一致的视频过渡,从而有效提升跨场景视频片段间的感知连续性与视觉合理性。
链接: https://arxiv.org/abs/2510.24667
作者: Mia Kan,Yilin Liu,Niloy Mitra
机构: University College London (伦敦大学学院); Autodesk Research (欧特克研究); Adobe Research (Adobe 研究)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Website: this https URL
Abstract:Video transitions aim to synthesize intermediate frames between two clips, but naive approaches such as linear blending introduce artifacts that limit professional use or break temporal coherence. Traditional techniques (cross-fades, morphing, frame interpolation) and recent generative inbetweening methods can produce high-quality plausible intermediates, but they struggle with bridging diverse clips involving large temporal gaps or significant semantic differences, leaving a gap for content-aware and visually coherent transitions. We address this challenge by drawing on artistic workflows, distilling strategies such as aligning silhouettes and interpolating salient features to preserve structure and perceptual continuity. Building on this, we propose SAGE (Structure-Aware Generative vidEo transitions) as a zeroshot approach that combines structural guidance, provided via line maps and motion flow, with generative synthesis, enabling smooth, semantically consistent transitions without fine-tuning. Extensive experiments and comparison with current alternatives, namely [FILM, TVG, DiffMorpher, VACE, GI], demonstrate that SAGE outperforms both classical and generative baselines on quantitative metrics and user studies for producing transitions between diverse clips. Code to be released on acceptance.
zh
[CV-6] Group Relative Attention Guidance for Image Editing
【速读】:该论文旨在解决当前基于Diffusion-in-Transformer(DiT)模型的图像编辑方法在编辑强度控制方面缺乏有效手段的问题,导致难以实现精细化、定制化的编辑效果。其解决方案的关键在于对DiT模型中的MM-Attention机制进行深入分析,发现Query和Key tokens共享一个仅与层相关联的偏置向量(bias vector),该偏置可被解释为模型固有的编辑行为,而每个token与其对应偏置之间的差值(delta)则编码了内容相关的编辑信号。基于此洞察,作者提出Group Relative Attention Guidance(GRAG),通过重新加权不同token的delta值来调节模型对输入图像与编辑指令之间关注程度的比例,从而实现无需调参的连续且细粒度的编辑强度控制。
链接: https://arxiv.org/abs/2510.24657
作者: Xuanpu Zhang,Xuesong Niu,Ruidong Chen,Dan Song,Jianhao Zeng,Penghui Du,Haoxiang Cao,Kai Wu,An-an Liu
机构: Tianjin University (天津大学); Kolors Team, Kuaishou Technology (快手科技)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recently, image editing based on Diffusion-in-Transformer models has undergone rapid development. However, existing editing methods often lack effective control over the degree of editing, limiting their ability to achieve more customized results. To address this limitation, we investigate the MM-Attention mechanism within the DiT model and observe that the Query and Key tokens share a bias vector that is only layer-dependent. We interpret this bias as representing the model’s inherent editing behavior, while the delta between each token and its corresponding bias encodes the content-specific editing signals. Based on this insight, we propose Group Relative Attention Guidance, a simple yet effective method that reweights the delta values of different tokens to modulate the focus of the model on the input image relative to the editing instruction, enabling continuous and fine-grained control over editing intensity without any tuning. Extensive experiments conducted on existing image editing frameworks demonstrate that GRAG can be integrated with as few as four lines of code, consistently enhancing editing quality. Moreover, compared to the commonly used Classifier-Free Guidance, GRAG achieves smoother and more precise control over the degree of editing. Our code will be released at this https URL.
zh
[CV-7] Eye-Tracking Mouse Tracking Stimulus Trackingand Decision-Making Datasets in Digital Pathology
【速读】:该论文旨在解决病理学家在解读高分辨率全切片图像(Whole-Slide Images, WSI)时诊断准确率低(平均约70%)且缺乏行为数据支持以解释诊断错误与不一致性的难题。其解决方案的关键在于构建PathoGaze1.0,一个综合性行为数据集,通过眼动追踪、鼠标交互、视窗导航等多模态数据记录了19名病理学家对397张WSI进行癌症诊断的完整视觉搜索和决策过程,共计采集18.69小时的动态行为数据(包括171,909次注视点、263,320次扫视和1,867,362次鼠标事件),并依托应用导向的测试平台PTAH确保生态效度,为提升病理医生培训及辅助诊断AI系统的训练提供可量化的行为依据。
链接: https://arxiv.org/abs/2510.24653
作者: Veronica Thai,Rui Li,Meng Ling,Shuning Jiang,Jeremy Wolfe,Raghu Machiraju,Yan Hu,Zaibo Li,Anil Parwani,Jian Chen
机构: The Ohio State University (俄亥俄州立大学); Harvard Medical School (哈佛医学院); Brigham and Women’s Hospital (布莱根妇女医院); The Ohio State University Medical Center (俄亥俄州立大学医学中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注: 16 pages, 9 figures, submitted to Nature Scientific Data
Abstract:Interpretation of giga-pixel whole-slide images (WSIs) is an important but difficult task for pathologists. Their diagnostic accuracy is estimated to average around 70%. Adding a second pathologist does not substantially improve decision consistency. The field lacks adequate behavioral data to explain diagnostic errors and inconsistencies. To fill in this gap, we present PathoGaze1.0, a comprehensive behavioral dataset capturing the dynamic visual search and decision-making processes of the full diagnostic workflow during cancer diagnosis. The dataset comprises 18.69 hours of eye-tracking, mouse interaction, stimulus tracking, viewport navigation, and diagnostic decision data (EMSVD) collected from 19 pathologists interpreting 397 WSIs. The data collection process emphasizes ecological validity through an application-grounded testbed, called PTAH. In total, we recorded 171,909 fixations, 263,320 saccades, and 1,867,362 mouse interaction events. In addition, such data could also be used to improve the training of both pathologists and AI systems that might support human experts. All experiments were preregistered at this https URL, and the complete dataset along with analysis code is available at this https URL.
zh
[CV-8] A Dual-Branch CNN for Robust Detection of AI-Generated Facial Forgeries
【速读】:该论文旨在解决生成式 AI(Generative AI)技术快速发展背景下,由人脸伪造(face forgery)技术引发的AI安全、数字媒体真实性及公众信任危机问题。其解决方案的关键在于提出一种双分支卷积神经网络架构,分别从空间域(RGB分支)和频率域(频率分支)提取互补特征:RGB分支捕捉语义信息,频率分支聚焦于生成模型难以抑制的高频伪影;通过引入通道注意力模块自适应融合异构特征,并设计统一的FSC损失函数(结合焦点损失、监督对比损失与频率中心边界损失),显著提升类别可分性和鲁棒性。该方法在DiFF基准测试中对四种主流伪造方式均表现出优于人类平均水平的检测性能,验证了其在构建抗视觉伪造攻击的AI安全体系中的有效性。
链接: https://arxiv.org/abs/2510.24640
作者: Xin Zhang,Yuqi Song,Fei Zuo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The rapid advancement of generative AI has enabled the creation of highly realistic forged facial images, posing significant threats to AI security, digital media integrity, and public trust. Face forgery techniques, ranging from face swapping and attribute editing to powerful diffusion-based image synthesis, are increasingly being used for malicious purposes such as misinformation, identity fraud, and defamation. This growing challenge underscores the urgent need for robust and generalizable face forgery detection methods as a critical component of AI security infrastructure. In this work, we propose a novel dual-branch convolutional neural network for face forgery detection that leverages complementary cues from both spatial and frequency domains. The RGB branch captures semantic information, while the frequency branch focuses on high-frequency artifacts that are difficult for generative models to suppress. A channel attention module is introduced to adaptively fuse these heterogeneous features, highlighting the most informative channels for forgery discrimination. To guide the network’s learning process, we design a unified loss function, FSC Loss, that combines focal loss, supervised contrastive loss, and a frequency center margin loss to enhance class separability and robustness. We evaluate our model on the DiFF benchmark, which includes forged images generated from four representative methods: text-to-image, image-to-image, face swap, and face edit. Our method achieves strong performance across all categories and outperforms average human accuracy. These results demonstrate the model’s effectiveness and its potential contribution to safeguarding AI ecosystems against visual forgery attacks.
zh
[CV-9] GroundLoc: Efficient Large-Scale Outdoor LiDAR-Only Localization
【速读】:该论文旨在解决移动机器人在大规模室外环境中仅依赖LiDAR实现高精度定位的问题。现有方法在复杂场景下定位精度不足或对传感器类型敏感,难以兼顾鲁棒性与实时性。解决方案的关键在于提出GroundLoc,一个纯LiDAR的定位流水线:其通过将点云投影为鸟瞰图(Bird’s-Eye View, BEV)图像,聚焦于地面感知区域,并采用R2D2(一种place recognition网络)或SIFT(无学习的尺度不变特征变换)提取关键点进行BEV地图注册,从而实现高效、稳定的定位。该方法在SemanticKITTI和HeLiPR数据集上优于当前最优方法,在多会话定位评估中平均轨迹误差(ATE)低于50 cm,且满足在线运行要求,同时支持多种LiDAR传感器(如Ouster OS2 128、Velodyne HDL-64E等),并使用低存储开销(每平方公里仅需4 MB)的二维栅格地图作为先验地图。
链接: https://arxiv.org/abs/2510.24623
作者: Nicolai Steinke,Daniel Goehring
机构: Freie Universität Berlin (柏林自由大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In this letter, we introduce GroundLoc, a LiDAR-only localization pipeline designed to localize a mobile robot in large-scale outdoor environments using prior maps. GroundLoc employs a Bird’s-Eye View (BEV) image projection focusing on the perceived ground area and utilizes the place recognition network R2D2, or alternatively, the non-learning approach Scale-Invariant Feature Transform (SIFT), to identify and select keypoints for BEV image map registration. Our results demonstrate that GroundLoc outperforms state-of-the-art methods on the SemanticKITTI and HeLiPR datasets across various sensors. In the multi-session localization evaluation, GroundLoc reaches an Average Trajectory Error (ATE) well below 50 cm on all Ouster OS2 128 sequences while meeting online runtime requirements. The system supports various sensor models, as evidenced by evaluations conducted with Velodyne HDL-64E, Ouster OS2 128, Aeva Aeries II, and Livox Avia sensors. The prior maps are stored as 2D raster image maps, which can be created from a single drive and require only 4 MB of storage per square kilometer. The source code is available at this https URL.
zh
[CV-10] Physics-Inspired Gaussian Kolmogorov-Arnold Networks for X-ray Scatter Correction in Cone-Beam CT
【速读】:该论文旨在解决锥束CT(Cone-beam CT, CBCT)在数据采集过程中因散射效应导致的图像伪影问题,具体表现为CT值偏差和组织对比度下降,进而影响诊断准确性。解决方案的关键在于提出一种基于深度学习的散射伪影校正方法,其核心创新是将物理先验知识(即投影域中点散射概率密度分布具有旋转对称性)融入网络结构中:通过高斯径向基函数(Gaussian Radial Basis Functions, RBF)建模点散射函数,并将其嵌入到Kolmogorov-Arnold Networks(KAN)层中,利用KAN层强大的非线性映射能力高效学习高维散射特征。该设计使模型能够更准确地表征散射分布特性,从而实现对重建图像中散射伪影的有效校正,且在定量指标上优于现有方法。
链接: https://arxiv.org/abs/2510.24579
作者: Xu Jiang,Huiying Pan,Ligen Shi,Jianing Sun,Wenfeng Xu,Xing Zhao
机构: Capital Normal University (首都师范大学); Nanjing University of Information Science and Technology (南京信息工程大学); Inner Mongolia University (内蒙古大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 6 figures
Abstract:Cone-beam CT (CBCT) employs a flat-panel detector to achieve three-dimensional imaging with high spatial resolution. However, CBCT is susceptible to scatter during data acquisition, which introduces CT value bias and reduced tissue contrast in the reconstructed images, ultimately degrading diagnostic accuracy. To address this issue, we propose a deep learning-based scatter artifact correction method inspired by physical prior knowledge. Leveraging the fact that the observed point scatter probability density distribution exhibits rotational symmetry in the projection domain. The method uses Gaussian Radial Basis Functions (RBF) to model the point scatter function and embeds it into the Kolmogorov-Arnold Networks (KAN) layer, which provides efficient nonlinear mapping capabilities for learning high-dimensional scatter features. By incorporating the physical characteristics of the scattered photon distribution together with the complex function mapping capacity of KAN, the model improves its ability to accurately represent scatter. The effectiveness of the method is validated through both synthetic and real-scan experiments. Experimental results show that the model can effectively correct the scatter artifacts in the reconstructed images and is superior to the current methods in terms of quantitative metrics.
zh
[CV-11] OSWorld-MCP: Benchmarking MCP Tool Invocation In Computer-Use Agents
【速读】:该论文旨在解决当前多模态智能体(multimodal agents)评估中忽视工具调用能力(tool invocation)的问题,尤其是对基于模型上下文协议(Model Context Protocol, MCP)的工具使用能力缺乏系统性衡量。以往评测主要聚焦于图形用户界面(GUI)交互技能,导致评估结果无法全面反映智能体在真实复杂场景中的综合能力,造成不公平比较。解决方案的关键在于提出首个综合性且公平的基准测试平台 OSWorld-MCP,其通过自动化代码生成管道构建高质量工具集(共158个,覆盖7类常见应用),并结合人工严格验证确保工具的功能正确性、实用性和泛化能力;在此基础上,对先进多模态智能体进行系统评估,首次明确量化了MCP工具调用对任务成功率的提升作用(如OpenAI o3在15步内成功率从8.3%提升至20.4%),同时揭示当前最强模型工具调用率仅达36.3%,凸显该领域仍存在显著改进空间。
链接: https://arxiv.org/abs/2510.24563
作者: Hongrui Jia,Jitong Liao,Xi Zhang,Haiyang Xu,Tianbao Xie,Chaoya Jiang,Ming Yan,Si Liu,Wei Ye,Fei Huang
机构: Tongyi Lab, Alibaba Group (通义实验室,阿里巴巴集团); Peking University (北京大学); Beijing Zhongguancun Academy (北京中关村学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:With advances in decision-making and reasoning capabilities, multimodal agents show strong potential in computer application scenarios. Past evaluations have mainly assessed GUI interaction skills, while tool invocation abilities, such as those enabled by the Model Context Protocol (MCP), have been largely overlooked. Comparing agents with integrated tool invocation to those evaluated only on GUI interaction is inherently unfair. We present OSWorld-MCP, the first comprehensive and fair benchmark for assessing computer-use agents’ tool invocation, GUI operation, and decision-making abilities in a real-world environment. We design a novel automated code-generation pipeline to create tools and combine them with a curated selection from existing tools. Rigorous manual validation yields 158 high-quality tools (covering 7 common applications), each verified for correct functionality, practical applicability, and versatility. Extensive evaluations of state-of-the-art multimodal agents on OSWorld-MCP show that MCP tools generally improve task success rates (e.g., from 8.3% to 20.4% for OpenAI o3 at 15 steps, from 40.1% to 43.3% for Claude 4 Sonnet at 50 steps), underscoring the importance of assessing tool invocation capabilities. However, even the strongest models have relatively low tool invocation rates, Only 36.3%, indicating room for improvement and highlighting the benchmark’s challenge. By explicitly measuring MCP tool usage skills, OSWorld-MCP deepens understanding of multimodal agents and sets a new standard for evaluating performance in complex, tool-assisted environments. Our code, environment, and data are publicly available at this https URL.
zh
[CV-12] Local Performance vs. Out-of-Distribution Generalization: An Empirical Analysis of Personalized Federated Learning in Heterogeneous Data Environments
【速读】:该论文旨在解决联邦学习(Federated Learning, FL)在数据异质性环境下的客户端漂移(client drift)问题,即本地模型在局部训练过程中趋向于各自的数据分布最优解,导致聚合后的全局模型无法有效逼近整体最优,从而影响大多数客户端的性能。传统方法如FedAvg虽能提升平均本地性能,但忽视了对分布外样本(out-of-distribution)的泛化能力,削弱了模型鲁棒性。其解决方案的关键在于提出一种改进的联邦算法——个性化联邦学习(Federated Learning with Individualized Updates, FLIU),通过引入一个自适应个性化因子,在每次通信轮次中对本地更新进行个体化调整,从而在保持全局模型一致性的同时增强各客户端的本地适应能力与跨分布泛化性能。
链接: https://arxiv.org/abs/2510.24503
作者: Mortesa Hussaini,Jan Theiß,Anthony Stein
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC); Multiagent Systems (cs.MA)
备注:
Abstract:In the context of Federated Learning with heterogeneous data environments, local models tend to converge to their own local model optima during local training steps, deviating from the overall data distributions. Aggregation of these local updates, e.g., with FedAvg, often does not align with the global model optimum (client drift), resulting in an update that is suboptimal for most clients. Personalized Federated Learning approaches address this challenge by exclusively focusing on the average local performances of clients’ models on their own data distribution. Generalization to out-of-distribution samples, which is a substantial benefit of FedAvg and represents a significant component of robustness, appears to be inadequately incorporated into the assessment and evaluation processes. This study involves a thorough evaluation of Federated Learning approaches, encompassing both their local performance and their generalization capabilities. Therefore, we examine different stages within a single communication round to enable a more nuanced understanding of the considered metrics. Furthermore, we propose and incorporate a modified approach of FedAvg, designated as Federated Learning with Individualized Updates (FLIU), extending the algorithm by a straightforward individualization step with an adaptive personalization factor. We evaluate and compare the approaches empirically using MNIST and CIFAR-10 under various distributional conditions, including benchmark IID and pathological non-IID, as well as additional novel test environments with Dirichlet distribution specifically developed to stress the algorithms on complex data heterogeneity.
zh
[CV-13] Fast and accurate neural reflectance transformation imaging through knowledge distillation
【速读】:该论文旨在解决神经反射变换成像(NeuralRTI)在交互式光照渲染过程中计算成本过高、难以在有限硬件资源下实现全分辨率渲染的问题。其解决方案的关键在于引入知识蒸馏(Knowledge Distillation)机制,通过训练一个轻量级的学生网络来模仿原复杂神经网络的输出特性,从而在保持高质量渲染效果的同时显著降低计算开销,实现高效且可行的实时交互式光照分析。
链接: https://arxiv.org/abs/2510.24486
作者: Tinsae G. Dulecha,Leonardo Righetto,Ruggero Pintus,Enrico Gobbetti,Andrea Giachetti
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: 18 pages
Abstract:Reflectance Transformation Imaging (RTI) is very popular for its ability to visually analyze surfaces by enhancing surface details through interactive relighting, starting from only a few tens of photographs taken with a fixed camera and variable illumination. Traditional methods like Polynomial Texture Maps (PTM) and Hemispherical Harmonics (HSH) are compact and fast, but struggle to accurately capture complex reflectance fields using few per-pixel coefficients and fixed bases, leading to artifacts, especially in highly reflective or shadowed areas. The NeuralRTI approach, which exploits a neural autoencoder to learn a compact function that better approximates the local reflectance as a function of light directions, has been shown to produce superior quality at comparable storage cost. However, as it performs interactive relighting with custom decoder networks with many parameters, the rendering step is computationally expensive and not feasible at full resolution for large images on limited hardware. Earlier attempts to reduce costs by directly training smaller networks have failed to produce valid results. For this reason, we propose to reduce its computational cost through a novel solution based on Knowledge Distillation (DisK-NeuralRTI). …
zh
[CV-14] Decoupled MeanFlow: Turning Flow Models into Flow Maps for Accelerated Sampling
【速读】:该论文旨在解决生成式模型(如扩散模型和基于流的模型)在采样过程中因离散化误差导致需要大量去噪步骤的问题,从而限制了生成效率。其核心解决方案是提出一种无需架构修改的解码策略——Decoupled MeanFlow,该方法通过在扩散Transformer的最后几层中引入对下一时间步的条件输入,将预训练的流模型直接转换为流图(flow map)模型,从而显著提升采样速度。关键创新在于:利用预训练流模型进行高效转换,而非从头训练流图模型,结合增强训练技术后可在1至4步内实现高质量图像生成,FID指标大幅优于现有方法,同时推理速度提升超过100倍。
链接: https://arxiv.org/abs/2510.24474
作者: Kyungmin Lee,Sihyun Yu,Jinwoo Shin
机构: KAIST
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Denoising generative models, such as diffusion and flow-based models, produce high-quality samples but require many denoising steps due to discretization error. Flow maps, which estimate the average velocity between timesteps, mitigate this error and enable faster sampling. However, their training typically demands architectural changes that limit compatibility with pretrained flow models. We introduce Decoupled MeanFlow, a simple decoding strategy that converts flow models into flow map models without architectural modifications. Our method conditions the final blocks of diffusion transformers on the subsequent timestep, allowing pretrained flow models to be directly repurposed as flow maps. Combined with enhanced training techniques, this design enables high-quality generation in as few as 1 to 4 steps. Notably, we find that training flow models and subsequently converting them is more efficient and effective than training flow maps from scratch. On ImageNet 256x256 and 512x512, our models attain 1-step FID of 2.16 and 2.12, respectively, surpassing prior art by a large margin. Furthermore, we achieve FID of 1.51 and 1.68 when increasing the steps to 4, which nearly matches the performance of flow models while delivering over 100x faster inference.
zh
[CV-15] Kineo: Calibration-Free Metric Motion Capture From Sparse RGB Cameras
【速读】:该论文旨在解决标记点驱动的多视角运动捕捉(markerless multiview motion capture)中对精确相机标定(camera calibration)的依赖问题,从而提升非专家用户和野外场景下的可用性。现有免标定方法虽降低了标定门槛,但存在计算开销高与重建精度不足的缺陷。其核心解决方案是提出Kineo——一个全自动、免标定的运动捕捉流水线,利用现成的2D关键点检测器同步完成相机标定(含Brown-Conrady畸变参数估计)、3D关键点重建及稠密场景点云生成,且保持度量尺度一致性。关键技术在于引入置信度驱动的时空关键点采样策略与基于图的全局优化机制,使标定过程在固定计算成本下实现鲁棒性,同时通过成对重投影一致性评分量化3D重建可靠性,显著优于此前最先进方法,在EgoHumans和Human3.6M数据集上将相机平移误差降低约83–85%,角误差降低86–92%,世界空间平均关节误差(W-MPJPE)降低83–91%。
链接: https://arxiv.org/abs/2510.24464
作者: Charles Javerliat,Pierre Raimbaud,Guillaume Lavoué
机构: École Centrale de Lyon (里昂中央理工学院); CNRS (法国国家科学研究中心); INSA Lyon (里昂国立应用科学学院); Université Claude Bernard Lyon 1 (克莱蒙·贝尔纳里昂第一大学); Université Lumière Lyon 2 (里昂第二大学); LIRIS, UMR5205 (信息、计算机与应用数学联合实验室); ENISE (圣艾蒂安国立工程师学校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Markerless multiview motion capture is often constrained by the need for precise camera calibration, limiting accessibility for non-experts and in-the-wild captures. Existing calibration-free approaches mitigate this requirement but suffer from high computational cost and reduced reconstruction accuracy. We present Kineo, a fully automatic, calibration-free pipeline for markerless motion capture from videos captured by unsynchronized, uncalibrated, consumer-grade RGB cameras. Kineo leverages 2D keypoints from off-the-shelf detectors to simultaneously calibrate cameras, including Brown-Conrady distortion coefficients, and reconstruct 3D keypoints and dense scene point maps at metric scale. A confidence-driven spatio-temporal keypoint sampling strategy, combined with graph-based global optimization, ensures robust calibration at a fixed computational cost independent of sequence length. We further introduce a pairwise reprojection consensus score to quantify 3D reconstruction reliability for downstream tasks. Evaluations on EgoHumans and Human3.6M demonstrate substantial improvements over prior calibration-free methods. Compared to previous state-of-the-art approaches, Kineo reduces camera translation error by approximately 83-85%, camera angular error by 86-92%, and world mean-per-joint error (W-MPJPE) by 83-91%. Kineo is also efficient in real-world scenarios, processing multi-view sequences faster than their duration in specific configuration (e.g., 36min to process 1h20min of footage). The full pipeline and evaluation code are openly released to promote reproducibility and practical adoption at this https URL. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2510.24464 [cs.CV] (or arXiv:2510.24464v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2510.24464 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Charles Javerliat [view email] [v1] Tue, 28 Oct 2025 14:30:47 UTC (41,553 KB)
zh
[CV-16] A Critical Study towards the Detection of Parkinsons Disease using ML Technologies
【速读】:该论文旨在解决茶树叶部病害的自动识别与病害区域定位问题,具体针对由虫害引起的红锈病(Red Rust)、赫洛佩尔蒂斯害虫(Helopeltis)以及红蜘蛛螨(Red Spider Mite)三种疾病进行分类和病斑范围检测。解决方案的关键在于采用深度学习技术,分别使用SSD MobileNet V2和Faster R-CNN ResNet50 V1模型进行目标检测,并通过改进的Mask R-CNN方法实现病叶的实例分割,从而精确量化病害在叶片上的损伤面积,为茶园精准管理提供技术支持。
链接: https://arxiv.org/abs/2510.24456
作者: Vivek Chetia,Abdul Taher Khan,Rahish Gogoi,David Kapsian Khual,Purnendu Bikash,Sajal Saha
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The proposed solution is Deep Learning Technique that will be able classify three types of tea leaves diseases from which two diseases are caused by the pests and one due to pathogens (infectious organisms) and environmental conditions and also show the area damaged by a disease in leaves. Namely Red Rust, Helopeltis and Red spider mite respectively. In this paper we have evaluated two models namely SSD MobileNet V2 and Faster R-CNN ResNet50 V1 for the object detection. The SSD MobileNet V2 gave precision of 0.209 for IOU range of 0.50:0.95 with recall of 0.02 on IOU 0.50:0.95 and final mAP of 20.9%. While Faster R-CNN ResNet50 V1 has precision of 0.252 on IOU range of 0.50:0.95 and recall of 0.044 on IOU of 0.50:0.95 with a mAP of 25%, which is better than SSD. Also used Mask R-CNN for Object Instance Segmentation where we have implemented our custom method to calculate the damaged diseased portion of leaves. Keywords: Tea Leaf Disease, Deep Learning, Red Rust, Helopeltis and Red Spider Mite, SSD MobileNet V2, Faster R-CNN ResNet50 V1 and Mask RCNN.
zh
[CV-17] Rethinking Visual Intelligence: Insights from Video Pretraining DATE
【速读】:该论文试图解决当前大语言模型(Large Language Models, LLMs)在视觉领域中表现不佳的问题,特别是其在组合理解、样本效率和通用问题求解能力方面的局限性。解决方案的关键在于利用视频扩散模型(Video Diffusion Models, VDMs)进行时空数据预训练,从而引入对结构与动态的强归纳偏置(inductive biases),以提升模型在视觉任务上的适应性和数据效率。实验表明,相较于语言模型,VDMs 在多个视觉基准测试中展现出更高的样本效率,验证了视频预训练在构建视觉基础模型(visual foundation models)中的潜力。
链接: https://arxiv.org/abs/2510.24448
作者: Pablo Acuaviva,Aram Davtyan,Mariam Hassan,Sebastian Stapf,Ahmad Rahimi,Alexandre Alahi,Paolo Favaro
机构: University of Bern (伯尔尼大学); EPFL (瑞士联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Updated version from preprint arXiv:2506.07280 (Gen2Gen) focused on visual intelligence. This work can be considered as v2
Abstract:Large language models (LLMs) have demonstrated that large-scale pretraining enables systems to adapt rapidly to new problems with little supervision in the language domain. This success, however, has not translated as effectively to the visual domain, where models, including LLMs, continue to struggle with compositional understanding, sample efficiency, and general-purpose problem-solving. We investigate Video Diffusion Models (VDMs) as a promising direction for bridging this gap. Pretraining on spatiotemporal data endows these models with strong inductive biases for structure and dynamics, which we hypothesize can support broad task adaptability. To test this, we design a controlled evaluation in which both a pretrained LLM and a pretrained VDM are equipped with lightweight adapters and presented with tasks in their natural modalities. Across benchmarks including ARC-AGI, ConceptARC, visual games, route planning, and cellular automata, VDMs demonstrate higher data efficiency than their language counterparts. Taken together, our results indicate that video pretraining offers inductive biases that support progress toward visual foundation models.
zh
[CV-18] Deeply-Conditioned Image Compression via Self-Generated Priors
【速读】:该论文旨在解决当前学习型图像压缩(Learned Image Compression, LIC)方法在建模自然图像中复杂相关结构方面的局限性,尤其是难以分离图像中不变的全局结构与瞬态局部纹理之间的纠缠关系,导致低比特率下出现严重的几何失真问题。其解决方案的关键在于提出一种基于功能分解的框架——通过自生成先验(self-generated prior)对图像结构主干进行编码,并将该先验深度嵌入整个压缩流程,尤其用于调节分析变换(analysis transform),从而使其专注于捕获残差高熵细节,实现信息流的有效解耦。这一依赖关系驱动的层次化设计显著提升了压缩性能,在多个数据集上相较VVC标准实现了14.4%–15.1%的BD-rate降低。
链接: https://arxiv.org/abs/2510.24437
作者: Zhineng Zhao,Zhihai He,Zikun Zhou,Siwei Ma,Yaowei Wang
机构: Southern University of Science and Technology (南方科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Learned image compression (LIC) has shown great promise for achieving high rate-distortion performance. However, current LIC methods are often limited in their capability to model the complex correlation structures inherent in natural images, particularly the entanglement of invariant global structures with transient local textures within a single monolithic representation. This limitation precipitates severe geometric deformation at low bitrates. To address this, we introduce a framework predicated on functional decomposition, which we term Deeply-Conditioned Image Compression via self-generated priors (DCIC-sgp). Our central idea is to first encode a potent, self-generated prior to encapsulate the image’s structural backbone. This prior is subsequently utilized not as mere side-information, but to holistically modulate the entire compression pipeline. This deep conditioning, most critically of the analysis transform, liberates it to dedicate its representational capacity to the residual, high-entropy details. This hierarchical, dependency-driven approach achieves an effective disentanglement of information streams. Our extensive experiments validate this assertion; visual analysis demonstrates that our method substantially mitigates the geometric deformation artifacts that plague conventional codecs at low bitrates. Quantitatively, our framework establishes highly competitive performance, achieving significant BD-rate reductions of 14.4%, 15.7%, and 15.1% against the VVC test model VTM-12.1 on the Kodak, CLIC, and Tecnick datasets.
zh
[CV-19] XAI Evaluation Framework for Semantic Segmentation
【速读】:该论文旨在解决生成式 AI(Generative AI)在语义分割任务中可解释性不足的问题,特别是在安全关键和高风险场景下缺乏透明度与可信度的挑战。其解决方案的关键在于提出了一套系统化且全面的评估框架,专门用于衡量解释型人工智能(Explainable AI, XAI)在语义分割中的表现,该框架通过像素级评估策略和精心设计的指标,显式考虑了空间复杂性和上下文复杂性,从而提供细粒度的可解释性洞察。实验结果表明,基于类激活映射(Class Activation Mapping, CAM)的XAI方法在该框架下展现出高效性、鲁棒性和可靠性,为构建透明、可信且可问责的语义分割模型提供了坚实支撑。
链接: https://arxiv.org/abs/2510.24414
作者: Reem Hammoud,Abdul karim Gizzini,Ali J. Ghandour
机构: American University of Beirut (贝鲁特美国大学); Université Paris Est Créteil (巴黎东部克雷泰伊大学); National Center for Remote Sensing (国家遥感中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Ensuring transparency and trust in artificial intelligence (AI) models is essential, particularly as they are increasingly applied in safety-critical and high-stakes domains. Explainable AI (XAI) has emerged as a promising approach to address this challenge, yet the rigorous evaluation of XAI methods remains crucial for optimizing the trade-offs between model complexity, predictive performance, and interpretability. While extensive progress has been achieved in evaluating XAI techniques for classification tasks, evaluation strategies tailored to semantic segmentation remain relatively underexplored. This work introduces a comprehensive and systematic evaluation framework specifically designed for assessing XAI in semantic segmentation, explicitly accounting for both spatial and contextual task complexities. The framework employs pixel-level evaluation strategies and carefully designed metrics to provide fine-grained interpretability insights. Simulation results using recently adapted class activation mapping (CAM)-based XAI schemes demonstrate the efficiency, robustness, and reliability of the proposed methodology. These findings contribute to advancing transparent, trustworthy, and accountable semantic segmentation models.
zh
[CV-20] 50 Years of Water Body Monitoring: The Case of Qaraaoun Reservoir Lebanon
【速读】:该论文旨在解决黎巴嫩最大地表水体——卡拉翁水库(Qaraaoun Reservoir)在传感器频繁故障和维护能力有限的情况下,如何实现可靠、持续的储水量监测问题。解决方案的关键在于提出了一种无需传感器的监测方法,通过融合开源卫星影像(Sentinel-2 和 Landsat)、改进的水体范围分割技术以及支持向量回归(Support Vector Regression, SVR)机器学习模型,实现了基于遥感提取的水面面积对水库体积的高精度估算。该方法利用新提出的水体分割指数进行水面边界识别,准确率超过95%,并通过SVR模型训练建立水面面积与储水量之间的非线性关系,最终实现误差低于全库容1.5%、决定系数高于0.98的近实时估算效果,具备良好的鲁棒性和可复制性。
链接: https://arxiv.org/abs/2510.24413
作者: Ali Ahmad Faour,Nabil Amacha,Ali J. Ghandour
机构: American University of Beirut (贝鲁特美国大学); Lebanese University (黎巴嫩大学); National Center for Remote Sensing, CNRS-L (国家遥感中心,CNRS-L)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The sustainable management of the Qaraaoun Reservoir, the largest surface water body in Lebanon located in the Bekaa Plain, depends on reliable monitoring of its storage volume despite frequent sensor malfunctions and limited maintenance capacity. This study introduces a sensor-free approach that integrates open-source satellite imagery, advanced water-extent segmentation, and machine learning to estimate the reservoir surface area and volume in near real time. Sentinel-2 and Landsat images are processed, where surface water is delineated using a newly proposed water segmentation index. A machine learning model based on Support Vector Regression (SVR) is trained on a curated dataset that includes water surface area, water level, and water volume calculations using a reservoir bathymetry survey. The model is then able to estimate reservoir volume relying solely on surface area extracted from satellite imagery, without the need for ground measurements. Water segmentation using the proposed index aligns with ground truth for more than 95 percent of the shoreline. Hyperparameter tuning with GridSearchCV yields an optimized SVR performance with error under 1.5 percent of full reservoir capacity and coefficients of determination exceeding 0.98. These results demonstrate the robustness and cost-effectiveness of the method, offering a practical solution for continuous, sensor-independent monitoring of reservoir storage. The proposed methodology can be replicated for other water bodies, and the resulting 50 years of time-series data is valuable for research on climate change and environmental patterns.
zh
[CV-21] A Hybrid Approach for Visual Multi-Object Tracking
【速读】:该论文旨在解决复杂场景下目标数量未知且时变、动态系统非线性以及噪声非高斯条件下的多目标跟踪问题,核心挑战在于保持目标标识一致性(identifier consistency)的同时实现鲁棒的状态估计。解决方案的关键在于融合随机机制与确定性机制:首先,采用基于粒子滤波(particle filter)的随机框架处理非线性动力学和非高斯噪声,并引入粒子群优化(PSO)引导粒子聚集于状态分布模态,通过融合运动一致性、外观相似性和邻近目标社交交互线索的适应度函数防止粒子发散;其次,设计一个确定性关联策略,利用包含空间一致性、检测置信度和轨迹惩罚项的成本矩阵强化标识一致性;此外,提出一种平滑更新机制以在目标相互遮挡或弱跟踪状态下维持身份不变性,并借助历史状态的velocity regression生成趋势种子速度(trend-seed velocities),提升粒子采样精度与状态更新效果。
链接: https://arxiv.org/abs/2510.24410
作者: Toan Van Nguyen,Rasmus G. K. Christiansen,Dirk Kraft,Leon Bodenhagen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: This work has been submitted to the IEEE for possible publication
Abstract:This paper proposes a visual multi-object tracking method that jointly employs stochastic and deterministic mechanisms to ensure identifier consistency for unknown and time-varying target numbers under nonlinear dynamics. A stochastic particle filter addresses nonlinear dynamics and non-Gaussian noise, with support from particle swarm optimization (PSO) to guide particles toward state distribution modes and mitigate divergence through proposed fitness measures incorporating motion consistency, appearance similarity, and social-interaction cues with neighboring targets. Deterministic association further enforces identifier consistency via a proposed cost matrix incorporating spatial consistency between particles and current detections, detection confidences, and track penalties. Subsequently, a novel scheme is proposed for the smooth updating of target states while preserving their identities, particularly for weak tracks during interactions with other targets and prolonged occlusions. Moreover, velocity regression over past states provides trend-seed velocities, enhancing particle sampling and state updates. The proposed tracker is designed to operate flexibly for both pre-recorded videos and camera live streams, where future frames are unavailable. Experimental results confirm superior performance compared to state-of-the-art trackers. The source-code reference implementations of both the proposed method and compared-trackers are provided on GitHub: this https URL
zh
[CV-22] GenTrack: A New Generation of Multi-Object Tracking
【速读】:该论文旨在解决多目标跟踪(Multi-Object Tracking, MOT)中目标数量未知且时变、目标身份(ID)一致性难以维持以及在遮挡和弱检测条件下易发生ID切换与轨迹丢失的问题。其解决方案的关键在于提出一种混合跟踪方法GenTrack,融合随机(stochastic)与确定性(deterministic)策略:利用粒子群优化(Particle Swarm Optimization, PSO)结合改进的适应度函数引导粒子逼近目标分布模式,从而提升在弱检测噪声下的鲁棒性;同时引入目标间的社交交互机制增强PSO粒子的引导效果,并实现强匹配与弱未匹配轨迹的连续更新,有效减少遮挡场景中的ID切换和轨迹中断;此外,构建基于空间一致性、外观特征、检测置信度、轨迹惩罚项及社交评分的综合状态与观测模型,形成可系统高效更新的目标跟踪基准,显著优于现有主流方法。
链接: https://arxiv.org/abs/2510.24399
作者: Toan Van Nguyen,Rasmus G. K. Christiansen,Dirk Kraft,Leon Bodenhagen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: This work has been submitted to the IEEE for possible publication
Abstract:This paper introduces a novel multi-object tracking (MOT) method, dubbed GenTrack, whose main contributions include: a hybrid tracking approach employing both stochastic and deterministic manners to robustly handle unknown and time-varying numbers of targets, particularly in maintaining target identity (ID) consistency and managing nonlinear dynamics, leveraging particle swarm optimization (PSO) with some proposed fitness measures to guide stochastic particles toward their target distribution modes, enabling effective tracking even with weak and noisy object detectors, integration of social interactions among targets to enhance PSO-guided particles as well as improve continuous updates of both strong (matched) and weak (unmatched) tracks, thereby reducing ID switches and track loss, especially during occlusions, a GenTrack-based redefined visual MOT baseline incorporating a comprehensive state and observation model based on space consistency, appearance, detection confidence, track penalties, and social scores for systematic and efficient target updates, and the first-ever publicly available source-code reference implementation with minimal dependencies, featuring three variants, including GenTrack Basic, PSO, and PSO-Social, facilitating flexible reimplementation. Experimental results have shown that GenTrack provides superior performance on standard benchmarks and real-world scenarios compared to state-of-the-art trackers, with integrated implementations of baselines for fair comparison. Potential directions for future work are also discussed. The source-code reference implementations of both the proposed method and compared-trackers are provided on GitHub: this https URL
zh
[CV-23] Unsupervised Detection of Post-Stroke Brain Abnormalities
【速读】:该论文旨在解决现有监督分割方法难以有效捕捉卒中后MRI中非局灶性结构异常(如萎缩和脑室扩大)的问题,这些问题已被视为预后和恢复的影像生物标志物。解决方案的关键在于采用基于流的生成模型REFLECT进行无监督异常检测,通过在健康对照组(IXI数据集)上训练模型,提升了对正常解剖变异的建模能力,从而更广泛且可靠地识别出包括局灶性和非局灶性在内的结构性异常,在ATLAS测试集上表现出更高的病灶分割Dice系数(0.37 vs 0.27)和对非局灶异常的敏感性(FROC = 0.62 vs 0.43)。
链接: https://arxiv.org/abs/2510.24398
作者: Youwan Mahé,Elise Bannier,Stéphanie Leplaideur,Elisa Fromont,Francesca Galassi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Post-stroke MRI not only delineates focal lesions but also reveals secondary structural changes, such as atrophy and ventricular enlargement. These abnormalities, increasingly recognised as imaging biomarkers of recovery and outcome, remain poorly captured by supervised segmentation methods. We evaluate REFLECT, a flow-based generative model, for unsupervised detection of both focal and non-lesional abnormalities in post-stroke patients. Using dual-expert central-slice annotations on ATLAS data, performance was assessed at the object level with Free-Response ROC analysis for anomaly maps. Two models were trained on lesion-free slices from stroke patients (ATLAS) and on healthy controls (IXI) to test the effect of training data. On ATLAS test subjects, the IXI-trained model achieved higher lesion segmentation (Dice = 0.37 vs 0.27) and improved sensitivity to non-lesional abnormalities (FROC = 0.62 vs 0.43). Training on fully healthy anatomy improves the modelling of normal variability, enabling broader and more reliable detection of structural abnormalities.
zh
[CV-24] When are radiology reports useful for training medical image classifiers?
【速读】:该论文试图解决的问题是如何在医学图像分类任务中有效利用放射科报告(radiology reports)这一富含专家标注的文本数据,以提升仅使用图像进行分类的模型性能。传统方法通常依赖人工标注或从报告中提取诊断标签进行微调,但忽略了报告文本与标签之间弱关联的任务场景。其解决方案的关键在于系统性地研究在预训练和微调两个阶段如何引入报告文本信息:发现当标签在报告中表达充分时,通过显式图像-文本对齐进行预训练有益;而在标签与文本关联较弱时,这种对齐反而有害;同时,微调阶段引入报告文本可带来显著性能提升,甚至在某些场景下优于预训练策略。该研究为如何选择性地利用文本辅助训练医学图像分类器提供了实证依据与实践指导。
链接: https://arxiv.org/abs/2510.24385
作者: Herman Bergström,Zhongqi Yue,Fredrik D. Johansson
机构: Chalmers University of Technology (查尔姆斯理工大学); University of Gothenburg (哥德堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Medical images used to train machine learning models are often accompanied by radiology reports containing rich expert annotations. However, relying on these reports as inputs for clinical prediction requires the timely manual work of a trained radiologist. This raises a natural question: when can radiology reports be leveraged during training to improve image-only classification? Prior works are limited to evaluating pre-trained image representations by fine-tuning them to predict diagnostic labels, often extracted from reports, ignoring tasks with labels that are weakly associated with the text. To address this gap, we conduct a systematic study of how radiology reports can be used during both pre-training and fine-tuning, across diagnostic and prognostic tasks (e.g., 12-month readmission), and under varying training set sizes. Our findings reveal that: (1) Leveraging reports during pre-training is beneficial for downstream classification tasks where the label is well-represented in the text; however, pre-training through explicit image-text alignment can be detrimental in settings where it’s not; (2) Fine-tuning with reports can lead to significant improvements and even have a larger impact than the pre-training method in certain settings. These results provide actionable insights into when and how to leverage privileged text data to train medical image classifiers while highlighting gaps in current research.
zh
[CV-25] A Luminance-Aware Multi-Scale Network for Polarization Image Fusion with a Multi-Scene Dataset
【速读】:该论文旨在解决复杂光照环境下偏振图像融合中互补信息整合困难的问题,特别是由于偏振图像固有的对比度差异导致的特征融合不充分问题。解决方案的关键在于提出一种亮度感知的多尺度网络(Luminance-aware Multi-scale Network, MLSN),其核心创新包括:在编码阶段引入亮度分支构建多尺度空间权重矩阵,动态地将亮度信息注入特征图以缓解对比度差异;在瓶颈层设计全局-局部特征融合机制,通过窗口自注意力计算与残差连接重构特征维度,平衡全局上下文与局部细节;在解码阶段提出亮度增强模块,建立亮度分布与纹理特征之间的非线性映射关系,实现融合结果的非线性亮度校正,从而提升在复杂照明场景下的适应性和融合质量。
链接: https://arxiv.org/abs/2510.24379
作者: Zhuangfan Huang,Xiaosong Li,Gao Wang,Tao Ye,Haishu Tan,Huafeng Li
机构: Foshan University (佛山大学); North University of China (中北大学); China University of Mining and Technology (北京) (中国矿业大学(北京)); Kunming University of Science and Technology (昆明理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Polarization image fusion combines S0 and DOLP images to reveal surface roughness and material properties through complementary texture features, which has important applications in camouflage recognition, tissue pathology analysis, surface defect detection and other fields. To intergrate coL-Splementary information from different polarized images in complex luminance environment, we propose a luminance-aware multi-scale network (MLSN). In the encoder stage, we propose a multi-scale spatial weight matrix through a brightness-branch , which dynamically weighted inject the luminance into the feature maps, solving the problem of inherent contrast difference in polarized images. The global-local feature fusion mechanism is designed at the bottleneck layer to perform windowed self-attention computation, to balance the global context and local details through residual linking in the feature dimension restructuring stage. In the decoder stage, to further improve the adaptability to complex lighting, we propose a Brightness-Enhancement module, establishing the mapping relationship between luminance distribution and texture features, realizing the nonlinear luminance correction of the fusion result. We also present MSP, an 1000 pairs of polarized images that covers 17 types of indoor and outdoor complex lighting scenes. MSP provides four-direction polarization raw maps, solving the scarcity of high-quality datasets in polarization image fusion. Extensive experiment on MSP, PIF and GAND datasets verify that the proposed MLSN outperms the state-of-the-art methods in subjective and objective evaluations, and the MS-SSIM and SD metircs are higher than the average values of other methods by 8.57%, 60.64%, 10.26%, 63.53%, 22.21%, and 54.31%, respectively. The source code and dataset is avalable at this https URL.
zh
[CV-26] Stroke Lesion Segmentation in Clinical Workflows: A Modular Lightweight and Deployment-Ready Tool
【速读】:该论文旨在解决深度学习框架(如nnU-Net)在临床部署中因依赖复杂和架构紧耦合而导致的实用性问题。其关键解决方案是提出一个模块化且轻量化的框架StrokeSeg,将预处理、推理和后处理解耦:预处理基于Anima工具箱并输出符合BIDS标准的数据,推理采用ONNX Runtime结合Float16量化技术,使模型体积减少约50%,同时提供图形界面和命令行接口,并以Python脚本和独立Windows可执行文件形式分发,从而实现高性能研究模型向临床可用工具的有效转化。
链接: https://arxiv.org/abs/2510.24378
作者: Yann Kerverdo,Florent Leray,Youwan Mahé,Stéphanie Leplaideur,Francesca Galassi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Deep learning frameworks such as nnU-Net achieve state-of-the-art performance in brain lesion segmentation but remain difficult to deploy clinically due to heavy dependencies and monolithic design. We introduce \textitStrokeSeg, a modular and lightweight framework that translates research-grade stroke lesion segmentation models into deployable applications. Preprocessing, inference, and postprocessing are decoupled: preprocessing relies on the Anima toolbox with BIDS-compliant outputs, and inference uses ONNX Runtime with \textttFloat16 quantisation, reducing model size by about 50%. \textitStrokeSeg provides both graphical and command-line interfaces and is distributed as Python scripts and as a standalone Windows executable. On a held-out set of 300 sub-acute and chronic stroke subjects, segmentation performance was equivalent to the original PyTorch pipeline (Dice difference 10^-3 ), demonstrating that high-performing research pipelines can be transformed into portable, clinically usable tools.
zh
[CV-27] Decoupling What to Count and Where to See for Referring Expression Counting
【速读】:该论文旨在解决参考表达计数(Referring Expression Counting, REC)任务中模型过度依赖类别级特征而忽略属性信息的问题,即标注点通常位于类别代表性位置(如头部),导致模型难以利用其他视觉区域(如腿部)的属性特征进行细粒度子类区分。解决方案的关键在于提出W2-Net框架,通过双查询机制显式解耦“计数什么”(what-to-count, w2c)与“看哪里”(where-to-see, w2s)两个子问题:w2c查询用于定位目标对象,而w2s查询则专门引导至属性特定的视觉区域以提取关键特征,从而实现更精确的子类判别;此外,引入子类可分离匹配(Subclass Separable Matching, SSM)策略,在标签分配阶段引入排斥力以增强不同子类间的可分性,显著提升了计数精度和定位性能。
链接: https://arxiv.org/abs/2510.24374
作者: Yuda Zou,Zijian Zhang,Yongchao Xu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Referring Expression Counting (REC) extends class-level object counting to the fine-grained subclass-level, aiming to enumerate objects matching a textual expression that specifies both the class and distinguishing attribute. A fundamental challenge, however, has been overlooked: annotation points are typically placed on class-representative locations (e.g., heads), forcing models to focus on class-level features while neglecting attribute information from other visual regions (e.g., legs for “walking”). To address this, we propose W2-Net, a novel framework that explicitly decouples the problem into “what to count” and “where to see” via a dual-query mechanism. Specifically, alongside the standard what-to-count (w2c) queries that localize the object, we introduce dedicated where-to-see (w2s) queries. The w2s queries are guided to seek and extract features from attribute-specific visual regions, enabling precise subclass discrimination. Furthermore, we introduce Subclass Separable Matching (SSM), a novel matching strategy that incorporates a repulsive force to enhance inter-subclass separability during label assignment. W2-Net significantly outperforms the state-of-the-art on the REC-8K dataset, reducing counting error by 22.5% (validation) and 18.0% (test), and improving localization F1 by 7% and 8%, respectively. Code will be available.
zh
[CV-28] Adaptive Knowledge Transferring with Switching Dual-Student Framework for Semi-Supervised Medical Image Segmentation
【速读】:该论文旨在解决教师-学生(Teacher-Student)框架在半监督医学图像分割中因教师与学生网络之间强相关性及不可靠的知识传递过程而导致的学习效果受限问题。其解决方案的关键在于提出一种新颖的切换双学生(Switching Dual-Student)架构,通过在每轮迭代中动态选择最可靠的学生产生伪标签,从而增强双学生协作并防止错误传播;同时引入基于损失感知的指数移动平均(Loss-Aware Exponential Moving Average)策略,动态优化教师模型对学生信息的吸收能力,提升伪标签质量,最终显著提高有限标注数据下的分割精度。
链接: https://arxiv.org/abs/2510.24366
作者: Thanh-Huy Nguyen,Hoang-Thien Nguyen,Ba-Thinh Lam,Vi Vu,Bach X. Nguyen,Jianhua Xing,Tianyang Wang,Xingjian Li,Min Xu
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The paper is under review at Pattern Recognition Journal
Abstract:Teacher-student frameworks have emerged as a leading approach in semi-supervised medical image segmentation, demonstrating strong performance across various tasks. However, the learning effects are still limited by the strong correlation and unreliable knowledge transfer process between teacher and student networks. To overcome this limitation, we introduce a novel switching Dual-Student architecture that strategically selects the most reliable student at each iteration to enhance dual-student collaboration and prevent error reinforcement. We also introduce a strategy of Loss-Aware Exponential Moving Average to dynamically ensure that the teacher absorbs meaningful information from students, improving the quality of pseudo-labels. Our plug-and-play framework is extensively evaluated on 3D medical image segmentation datasets, where it outperforms state-of-the-art semi-supervised methods, demonstrating its effectiveness in improving segmentation accuracy under limited supervision.
zh
[CV-29] NVSim: Novel View Synthesis Simulator for Large Scale Indoor Navigation
【速读】:该论文旨在解决传统3D扫描方法在构建大规模室内可导航模拟器时存在的成本高和扩展性差的问题。其关键解决方案是提出NVSim框架,该框架仅需普通图像序列即可自动构建大规模、可导航的室内仿真环境;核心创新包括:1)改进的3D高斯点绘(3D Gaussian Splatting)方法——Floor-Aware Gaussian Splatting,用于消除稀疏观测区域的视觉伪影并确保地面平面清晰可通行;2)一种无需显式网格表示的可通行性检查算法,通过直接分析渲染视图构建拓扑导航图,从而实现高效的大规模场景建模与导航路径生成。
链接: https://arxiv.org/abs/2510.24335
作者: Mingyu Jeong,Eunsung Kim,Sehun Park,Andrew Jaeyong Choi
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 10 figures
Abstract:We present NVSim, a framework that automatically constructs large-scale, navigable indoor simulators from only common image sequences, overcoming the cost and scalability limitations of traditional 3D scanning. Our approach adapts 3D Gaussian Splatting to address visual artifacts on sparsely observed floors a common issue in robotic traversal data. We introduce Floor-Aware Gaussian Splatting to ensure a clean, navigable ground plane, and a novel mesh-free traversability checking algorithm that constructs a topological graph by directly analyzing rendered views. We demonstrate our system’s ability to generate valid, large-scale navigation graphs from real-world data. A video demonstration is avilable at this https URL
zh
[CV-30] Sound Source Localization for Spatial Mapping of Surgical Actions in Dynamic Scenes
【速读】:该论文旨在解决当前手术场景理解中因过度依赖视觉数据或端到端学习方法而导致的细粒度上下文建模能力不足的问题。其核心解决方案在于引入三维声学信息,构建一种基于相控阵麦克风和RGB-D相机的4D音视频融合框架,通过将声源定位信息投影到动态点云上,实现对术中工具-组织交互事件的时空联合建模。关键创新点在于利用Transformer架构进行声学事件检测,并在多模态空间中精确关联音频与视觉元素,从而提升手术环境的动态、立体化表征能力。
链接: https://arxiv.org/abs/2510.24332
作者: Jonas Hein,Lazaros Vlachopoulos,Maurits Geert Laurent Olthof,Bastian Sigrist,Philipp Fürnstahl,Matthias Seibold
机构: 未知
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS); Image and Video Processing (eess.IV)
备注:
Abstract:Purpose: Surgical scene understanding is key to advancing computer-aided and intelligent surgical systems. Current approaches predominantly rely on visual data or end-to-end learning, which limits fine-grained contextual modeling. This work aims to enhance surgical scene representations by integrating 3D acoustic information, enabling temporally and spatially aware multimodal understanding of surgical environments. Methods: We propose a novel framework for generating 4D audio-visual representations of surgical scenes by projecting acoustic localization information from a phased microphone array onto dynamic point clouds from an RGB-D camera. A transformer-based acoustic event detection module identifies relevant temporal segments containing tool-tissue interactions which are spatially localized in the audio-visual scene representation. The system was experimentally evaluated in a realistic operating room setup during simulated surgical procedures performed by experts. Results: The proposed method successfully localizes surgical acoustic events in 3D space and associates them with visual scene elements. Experimental evaluation demonstrates accurate spatial sound localization and robust fusion of multimodal data, providing a comprehensive, dynamic representation of surgical activity. Conclusion: This work introduces the first approach for spatial sound localization in dynamic surgical scenes, marking a significant advancement toward multimodal surgical scene representations. By integrating acoustic and visual data, the proposed framework enables richer contextual understanding and provides a foundation for future intelligent and autonomous surgical systems. Subjects: Sound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS); Image and Video Processing (eess.IV) Cite as: arXiv:2510.24332 [cs.SD] (or arXiv:2510.24332v1 [cs.SD] for this version) https://doi.org/10.48550/arXiv.2510.24332 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Matthias Seibold [view email] [v1] Tue, 28 Oct 2025 11:55:45 UTC (1,315 KB)
zh
[CV-31] What do vision-language models see in the context? Investigating multimodal in-context learning
【速读】:该论文旨在解决当前视觉-语言模型(Vision-Language Models, VLMs)在上下文学习(In-context Learning, ICL)能力上的不足问题,尤其是其在多模态示例学习中的表现尚未被充分探索。解决方案的关键在于通过系统性实验评估七种不同架构的VLMs在三个图像描述任务上的ICL性能,深入分析提示设计、模型结构与训练策略对多模态ICL的影响,并首次揭示了随着演示样本数量增加,VLM中注意力机制的变化模式。研究发现,基于图文交错数据的训练虽提升ICL效果,但并未有效整合演示中的视觉与文本信息;而指令微调虽增强指令遵循能力,却可能削弱对上下文示例的依赖,表明存在指令对齐与上下文适应之间的权衡。此外,注意力分析显示当前VLM主要关注文本线索,难以利用视觉信息,暴露出其在多模态融合方面的局限性。
链接: https://arxiv.org/abs/2510.24331
作者: Gabriel O. dos Santos,Esther Colombini,Sandra Avila
机构: Instituto de Computação, Universidade Estadual de Campinas (UNICAMP) (计算研究所,坎皮纳斯州立大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In-context learning (ICL) enables Large Language Models (LLMs) to learn tasks from demonstration examples without parameter updates. Although it has been extensively studied in LLMs, its effectiveness in Vision-Language Models (VLMs) remains underexplored. In this work, we present a systematic study of ICL in VLMs, evaluating seven models spanning four architectures on three image captioning benchmarks. We analyze how prompt design, architectural choices, and training strategies influence multimodal ICL. To our knowledge, we are the first to analyze how attention patterns in VLMs vary with an increasing number of in-context demonstrations. Our results reveal that training on imag-text interleaved data enhances ICL performance but does not imply effective integration of visual and textual information from demonstration examples. In contrast, instruction tuning improves instruction-following but can reduce reliance on in-context demonstrations, suggesting a trade-off between instruction alignment and in-context adaptation. Attention analyses further show that current VLMs primarily focus on textual cues and fail to leverage visual information, suggesting a limited capacity for multimodal integration. These findings highlight key limitations in the ICL abilities of current VLMs and provide insights for enhancing their ability to learn from multimodal in-context examples.
zh
[CV-32] Few-Shot Remote Sensing Image Scene Classification with CLIP and Prompt Learning
【速读】:该论文旨在解决遥感图像场景分类中因标注数据稀缺和跨地理/传感器域标注成本高而导致的深度学习模型性能受限问题,同时克服现有视觉-语言模型(如CLIP)在遥感领域直接应用时因域差距显著且缺乏任务特定语义适配而表现不佳的局限。其解决方案的关键在于系统性地探索提示学习(prompt learning)作为一种轻量且高效的少样本适应策略,通过多种设计哲学实现对CLIP等预训练模型的精细化调优:包括静态上下文优化、条件提示以增强泛化能力、多模态提示实现视觉与语言模态联合适配,以及引入语义正则化约束的自调节提示机制,从而在不大量依赖标注数据的前提下显著提升跨域鲁棒性和分类精度。
链接: https://arxiv.org/abs/2510.24321
作者: Ivica Dimitrovski,Vlatko Spasev,Ivan Kitanovski
机构: University Ss Cyril and Methodius (圣西里尔和美多德大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Remote sensing applications increasingly rely on deep learning for scene classification. However, their performance is often constrained by the scarcity of labeled data and the high cost of annotation across diverse geographic and sensor domains. While recent vision-language models like CLIP have shown promise by learning transferable representations at scale by aligning visual and textual modalities, their direct application to remote sensing remains suboptimal due to significant domain gaps and the need for task-specific semantic adaptation. To address this critical challenge, we systematically explore prompt learning as a lightweight and efficient adaptation strategy for few-shot remote sensing image scene classification. We evaluate several representative methods, including Context Optimization, Conditional Context Optimization, Multi-modal Prompt Learning, and Prompting with Self-Regulating Constraints. These approaches reflect complementary design philosophies: from static context optimization to conditional prompts for enhanced generalization, multi-modal prompts for joint vision-language adaptation, and semantically regularized prompts for stable learning without forgetting. We benchmark these prompt-learning methods against two standard baselines: zero-shot CLIP with hand-crafted prompts and a linear probe trained on frozen CLIP features. Through extensive experiments on multiple benchmark remote sensing datasets, including cross-dataset generalization tests, we demonstrate that prompt learning consistently outperforms both baselines in few-shot scenarios. Notably, Prompting with Self-Regulating Constraints achieves the most robust cross-domain performance. Our findings underscore prompt learning as a scalable and efficient solution for bridging the domain gap in satellite and aerial imagery, providing a strong foundation for future research in this field.
zh
[CV-33] raining-free Source Attribution of AI-generated Images via Resynthesis
【速读】:该论文旨在解决合成图像源归属(synthetic image source attribution)任务在数据稀缺条件下难以准确分类的问题,尤其针对少样本(few-shot)或零样本(zero-shot)场景下的模型性能瓶颈。其解决方案的关键在于提出一种无需训练的单样本(one-shot)归属方法,通过生成描述待分析图像的提示(prompt),利用该提示对所有候选生成模型进行图像重合成(image resynthesis),并基于特征空间中与原始图像最接近的重合成结果来判定图像来源。该方法突破了传统依赖大量标注样本的局限,显著提升了在极小样本条件下的归属准确性,并借助新构建的包含商业与开源文本到图像生成器人脸图像的数据集验证了其有效性。
链接: https://arxiv.org/abs/2510.24278
作者: Pietro Bongini,Valentina Molinari,Andrea Costanzo,Benedetta Tondi,Mauro Barni
机构: University of Siena (锡耶纳大学); IMT School (IMT学校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 14 pages, 4 figures, 1 table, accepted at “The 17th IEEE INTERNATIONAL WORKSHOP ON INFORMATION FORENSICS AND SECURITY (WIFS2025)”, Perth, Australia
Abstract:Synthetic image source attribution is a challenging task, especially in data scarcity conditions requiring few-shot or zero-shot classification capabilities. We present a new training-free one-shot attribution method based on image resynthesis. A prompt describing the image under analysis is generated, then it is used to resynthesize the image with all the candidate sources. The image is attributed to the model which produced the resynthesis closest to the original image in a proper feature space. We also introduce a new dataset for synthetic image attribution consisting of face images from commercial and open-source text-to-image generators. The dataset provides a challenging attribution framework, useful for developing new attribution models and testing their capabilities on different generative architectures. The dataset structure allows to test approaches based on resynthesis and to compare them to few-shot methods. Results from state-of-the-art few-shot approaches and other baselines show that the proposed resynthesis method outperforms existing techniques when only a few samples are available for training or fine-tuning. The experiments also demonstrate that the new dataset is a challenging one and represents a valuable benchmark for developing and evaluating future few-shot and zero-shot methods.
zh
[CV-34] UtilGen: Utility-Centric Generative Data Augmentation with Dual-Level Task Adaptation NEURIPS2025
【速读】:该论文旨在解决当前生成式数据增强方法过于关注视觉质量(如保真度和多样性)而忽视下游任务需求的问题,即现有方法未能根据具体任务特性优化合成数据的实用性。其解决方案的关键在于提出UtilGen框架,该框架通过引入一个权重分配网络来评估每条合成样本的任务相关效用,并采用双层优化策略:模型层优化调整生成模型以适配下游任务,实例层优化则在每次生成过程中动态调整提示嵌入(prompt embeddings)和初始噪声等生成策略,从而实现基于任务反馈的自适应数据生成,最终提升训练数据的实用性和整体性能。
链接: https://arxiv.org/abs/2510.24262
作者: Jiyu Guo,Shuo Yang,Yiming Huang,Yancheng Long,Xiaobo Xia,Xiu Su,Bo Zhao,Zeke Xie,Liqiang Nie
机构: Harbin Institute of Technology, Shenzhen (哈尔滨工业大学深圳校区); National University of Singapore (新加坡国立大学); University of Science and Technology of China (中国科学技术大学); Central South University (中南大学); Shanghai Jiao Tong University (上海交通大学); Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
Abstract:Data augmentation using generative models has emerged as a powerful paradigm for enhancing performance in computer vision tasks. However, most existing augmentation approaches primarily focus on optimizing intrinsic data attributes – such as fidelity and diversity – to generate visually high-quality synthetic data, while often neglecting task-specific requirements. Yet, it is essential for data generators to account for the needs of downstream tasks, as training data requirements can vary significantly across different tasks and network architectures. To address these limitations, we propose UtilGen, a novel utility-centric data augmentation framework that adaptively optimizes the data generation process to produce task-specific, high-utility training data via downstream task feedback. Specifically, we first introduce a weight allocation network to evaluate the task-specific utility of each synthetic sample. Guided by these evaluations, UtilGen iteratively refines the data generation process using a dual-level optimization strategy to maximize the synthetic data utility: (1) model-level optimization tailors the generative model to the downstream task, and (2) instance-level optimization adjusts generation policies – such as prompt embeddings and initial noise – at each generation round. Extensive experiments on eight benchmark datasets of varying complexity and granularity demonstrate that UtilGen consistently achieves superior performance, with an average accuracy improvement of 3.87% over previous SOTA. Further analysis of data influence and distribution reveals that UtilGen produces more impactful and task-relevant synthetic data, validating the effectiveness of the paradigm shift from visual characteristics-centric to task utility-centric data augmentation.
zh
[CV-35] DynaRend: Learning 3D Dynamics via Masked Future Rendering for Robotic Manipulation NEURIPS2025
【速读】:该论文旨在解决机器人操作策略在真实世界中泛化能力不足的问题,其根源在于缺乏多样化的实际训练数据。为应对这一挑战,作者提出了一种名为DynaRend的表示学习框架,其关键创新在于通过可微分体素渲染(differentiable volumetric rendering)实现基于掩码重建和未来预测的三平面(triplane)特征学习,从而在统一表征中联合捕捉空间几何、未来动态与任务语义信息。该方法在多视角RGB-D视频数据上进行预训练,显著提升了下游机器人操作任务中的策略成功率、环境扰动下的泛化性能及现实场景适用性。
链接: https://arxiv.org/abs/2510.24261
作者: Jingyi Tian,Le Wang,Sanping Zhou,Sen Wang,Jiayi Li,Gang Hua
机构: Xi’an Jiaotong University (西安交通大学); Amazon(亚马逊)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to NeurIPS 2025
Abstract:Learning generalizable robotic manipulation policies remains a key challenge due to the scarcity of diverse real-world training data. While recent approaches have attempted to mitigate this through self-supervised representation learning, most either rely on 2D vision pretraining paradigms such as masked image modeling, which primarily focus on static semantics or scene geometry, or utilize large-scale video prediction models that emphasize 2D dynamics, thus failing to jointly learn the geometry, semantics, and dynamics required for effective manipulation. In this paper, we present DynaRend, a representation learning framework that learns 3D-aware and dynamics-informed triplane features via masked reconstruction and future prediction using differentiable volumetric rendering. By pretraining on multi-view RGB-D video data, DynaRend jointly captures spatial geometry, future dynamics, and task semantics in a unified triplane representation. The learned representations can be effectively transferred to downstream robotic manipulation tasks via action value map prediction. We evaluate DynaRend on two challenging benchmarks, RLBench and Colosseum, as well as in real-world robotic experiments, demonstrating substantial improvements in policy success rate, generalization to environmental perturbations, and real-world applicability across diverse manipulation tasks.
zh
[CV-36] DeshadowMamba: Deshadowing as 1D Sequential Similarity
【速读】:该论文旨在解决现有基于注意力机制的图像去阴影(shadow removal)方法中存在的问题,即固定注意力模式容易混入无关区域的光照线索,导致结构失真和颜色不一致。其解决方案的关键在于引入一种面向序列建模的新架构——Mamba(一种选择性状态空间模型),通过方向性状态转移实现全局上下文的有效传播,同时保持位置连续性;进一步提出CrossGate机制,在输入门中注入阴影感知的相似性信息,以沿转移轴选择性整合相关上下文;并设计ColorShift正则化项,基于全局颜色统计的对比学习目标,抑制邻近区域的颜色干扰,从而提升颜色恢复的鲁棒性与一致性。
链接: https://arxiv.org/abs/2510.24260
作者: Zhaotong Yang,Yi Chen,Yanying Li,Shengfeng He,Yangyang Xu,Junyu Dong,Jian Yang,Yong Du
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent deep models for image shadow removal often rely on attention-based architectures to capture long-range dependencies. However, their fixed attention patterns tend to mix illumination cues from irrelevant regions, leading to distorted structures and inconsistent colors. In this work, we revisit shadow removal from a sequence modeling perspective and explore the use of Mamba, a selective state space model that propagates global context through directional state transitions. These transitions yield an efficient global receptive field while preserving positional continuity. Despite its potential, directly applying Mamba to image data is suboptimal, since it lacks awareness of shadow-non-shadow semantics and remains susceptible to color interference from nearby regions. To address these limitations, we propose CrossGate, a directional modulation mechanism that injects shadow-aware similarity into Mamba’s input gate, allowing selective integration of relevant context along transition axes. To further ensure appearance fidelity, we introduce ColorShift regularization, a contrastive learning objective driven by global color statistics. By synthesizing structured informative negatives, it guides the model to suppress color contamination and achieve robust color restoration. Together, these components adapt sequence modeling to the structural integrity and chromatic consistency required for shadow removal. Extensive experiments on public benchmarks demonstrate that DeshadowMamba achieves state-of-the-art visual quality and strong quantitative performance.
zh
[CV-37] Delving into Cascaded Instability: A Lipschitz Continuity View on Image Restoration and Object Detection Synergy NEURIPS2025
【速读】:该论文旨在解决图像复原(image restoration)与目标检测(object detection)网络之间因函数不匹配而导致的稳定性问题,尤其是在恶劣环境(如雾霾和低光照)下,传统级联框架中复原模块引入的微小扰动会因检测网络的不连续决策边界被放大,从而破坏梯度传播并阻碍优化。解决方案的关键在于引入Lipschitz连续性约束,通过构建Lipschitz-regularized object detection (LROD) 框架,将图像复原过程直接整合进检测器的特征学习阶段,使两个任务在输入空间和参数空间中均保持一致的平滑变换特性,从而提升整体检测的稳定性和准确性。
链接: https://arxiv.org/abs/2510.24232
作者: Qing Zhao,Weijian Deng,Pengxu Wei,ZiYi Dong,Hannan Lu,Xiangyang Ji,Liang Lin
机构: Sun Yat-sen University (中山大学); Australian National University (澳大利亚国立大学); Harbin Institute of Technology (哈尔滨工业大学); Tsinghua University (清华大学); Peng Cheng Laboratory (鹏城实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: NeurIPS 2025
Abstract:To improve detection robustness in adverse conditions (e.g., haze and low light), image restoration is commonly applied as a pre-processing step to enhance image quality for the detector. However, the functional mismatch between restoration and detection networks can introduce instability and hinder effective integration – an issue that remains underexplored. We revisit this limitation through the lens of Lipschitz continuity, analyzing the functional differences between restoration and detection networks in both the input space and the parameter space. Our analysis shows that restoration networks perform smooth, continuous transformations, while object detectors operate with discontinuous decision boundaries, making them highly sensitive to minor perturbations. This mismatch introduces instability in traditional cascade frameworks, where even imperceptible noise from restoration is amplified during detection, disrupting gradient flow and hindering optimization. To address this, we propose Lipschitz-regularized object detection (LROD), a simple yet effective framework that integrates image restoration directly into the detector’s feature learning, harmonizing the Lipschitz continuity of both tasks during training. We implement this framework as Lipschitz-regularized YOLO (LR-YOLO), extending seamlessly to existing YOLO detectors. Extensive experiments on haze and low-light benchmarks demonstrate that LR-YOLO consistently improves detection stability, optimization smoothness, and overall accuracy.
zh
[CV-38] Benchmarking Microsaccade Recognition with Event Cameras: A Novel Dataset and Evaluation BMVC
【速读】:该论文旨在解决传统微眼动(microsaccade)研究中因依赖高成本、低可扩展性的眼动追踪设备或帧基分析方法而导致的时空分辨率受限问题。其解决方案的关键在于引入事件驱动型感知(event-based sensing)技术,构建首个基于事件的微眼动数据集,通过Blender仿真生成具有0.5–2.0度角位移的七类微眼动场景,并利用v2e转换器将其映射为保留自然时间动态的事件流(event stream),从而实现对微小眼动行为的高效捕捉与建模。进一步地,采用脉冲神经网络(Spiking Neural Networks, SNNs)如Spiking-VGG系列及改进的Spiking-VGG16Flow模型进行分类验证,在不依赖事件数量或持续时间的前提下实现了约90%的平均准确率,验证了SNN在细粒度运动识别中的潜力并为事件视觉研究提供了基准。
链接: https://arxiv.org/abs/2510.24231
作者: Waseem Shariff,Timothy Hanley,Maciej Stec,Hossein Javidnia,Peter Corcoran
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in British Machine Vision Conference (BMVC) 2025, Main Conference
Abstract:Microsaccades are small, involuntary eye movements vital for visual perception and neural processing. Traditional microsaccade studies typically use eye trackers or frame-based analysis, which, while precise, are costly and limited in scalability and temporal resolution. Event-based sensing offers a high-speed, low-latency alternative by capturing fine-grained spatiotemporal changes efficiently. This work introduces a pioneering event-based microsaccade dataset to support research on small eye movement dynamics in cognitive computing. Using Blender, we render high-fidelity eye movement scenarios and simulate microsaccades with angular displacements from 0.5 to 2.0 degrees, divided into seven distinct classes. These are converted to event streams using v2e, preserving the natural temporal dynamics of microsaccades, with durations ranging from 0.25 ms to 2.25 ms. We evaluate the dataset using Spiking-VGG11, Spiking-VGG13, and Spiking-VGG16, and propose Spiking-VGG16Flow, an optical-flow-enhanced variant implemented in SpikingJelly. The models achieve around 90 percent average accuracy, successfully classifying microsaccades by angular displacement, independent of event count or duration. These results demonstrate the potential of spiking neural networks for fine motion recognition and establish a benchmark for event-based vision research. The dataset, code, and trained models will be publicly available at this https URL .
zh
[CV-39] SCOPE: Saliency-Coverag e Oriented Token Pruning for Efficient Multimodel LLM s NEURIPS2025
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在处理视觉令牌(visual tokens)时存在的计算开销过大问题,尤其是现有视觉令牌剪枝方法仅基于注意力分数选择显著性最高的令牌,导致所选令牌语义不完整。其解决方案的关键在于提出一种新的剪枝策略——Saliency-Coverage Oriented token Pruning for Efficient MLLMs(SCOPE),通过联合建模所选令牌的显著性(saliency)与覆盖度(coverage)来提升语义完整性:具体而言,引入基于令牌间关系的集合覆盖度(set-coverage)概念,并定义每个未选令牌的覆盖增益(token-coverage gain),再将显著性分数融入该增益中形成SCOPE得分,迭代选择得分最高的令牌进行保留。
链接: https://arxiv.org/abs/2510.24214
作者: Jinhong Deng,Wen Li,Joey Tianyi Zhou,Yang He
机构: University of Electronic Science and Technology of China (UESTC); Shenzhen Institute for Advanced Study, UESTC; CFAR, Agency for Science, Technology and Research (ASTAR), Singapore; IHPC, Agency for Science, Technology and Research (ASTAR), Singapore
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: NeurIPS 2025
Abstract:Multimodal Large Language Models (MLLMs) typically process a large number of visual tokens, leading to considerable computational overhead, even though many of these tokens are redundant. Existing visual token pruning methods primarily focus on selecting the most salient tokens based on attention scores, resulting in the semantic incompleteness of the selected tokens. In this paper, we propose a novel visual token pruning strategy, called \textbfSaliency-\textbfCoverage \textbfOriented token \textbfPruning for \textbfEfficient MLLMs (SCOPE), to jointly model both the saliency and coverage of the selected visual tokens to better preserve semantic completeness. Specifically, we introduce a set-coverage for a given set of selected tokens, computed based on the token relationships. We then define a token-coverage gain for each unselected token, quantifying how much additional coverage would be obtained by including it. By integrating the saliency score into the token-coverage gain, we propose our SCOPE score and iteratively select the token with the highest SCOPE score. We conduct extensive experiments on multiple vision-language understanding benchmarks using the LLaVA-1.5 and LLaVA-Next models. Experimental results demonstrate that our method consistently outperforms prior approaches. Our code is available at \hrefthis https URLthis https URL.
zh
[CV-40] Beyond Inference Intervention: Identity-Decoupled Diffusion for Face Anonymization
【速读】:该论文旨在解决当前主流扩散模型在人脸匿名化(face anonymization)任务中依赖推理时干预(inference-time interventions)所引发的问题,如分布偏移(distribution shift)和身份信息与非身份属性纠缠(entanglement),进而导致视觉保真度下降和数据效用受损。解决方案的关键在于提出一种以训练为中心的匿名化框架 ID²Face,其核心思想是学习一个结构化的潜在空间(structured latent space),显式解耦身份信息与非身份属性,从而在推理阶段实现直接且可控的匿名化。具体而言,该方法通过设计条件扩散模型与身份掩码学习机制(identity-masked learning scheme),结合身份解耦潜在重组器(Identity-Decoupled Latent Recomposer)和身份引导潜在调和器(Identity-Guided Latent Harmonizer),利用重构损失强制潜在表示的解耦性,并引入正交身份映射策略(Orthogonal Identity Mapping)进一步抑制身份泄露。
链接: https://arxiv.org/abs/2510.24213
作者: Haoxin Yang,Yihong Lin,Jingdan Kang,Xuemiao Xu,Yue Li,Cheng Xu,Shengfeng He
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Face anonymization aims to conceal identity information while preserving non-identity attributes. Mainstream diffusion models rely on inference-time interventions such as negative guidance or energy-based optimization, which are applied post-training to suppress identity features. These interventions often introduce distribution shifts and entangle identity with non-identity attributes, degrading visual fidelity and data utility. To address this, we propose \textbfID\textsuperscript2Face, a training-centric anonymization framework that removes the need for inference-time optimization. The rationale of our method is to learn a structured latent space where identity and non-identity information are explicitly disentangled, enabling direct and controllable anonymization at inference. To this end, we design a conditional diffusion model with an identity-masked learning scheme. An Identity-Decoupled Latent Recomposer uses an Identity Variational Autoencoder to model identity features, while non-identity attributes are extracted from same-identity pairs and aligned through bidirectional latent alignment. An Identity-Guided Latent Harmonizer then fuses these representations via soft-gating conditioned on noisy feature prediction. The model is trained with a recomposition-based reconstruction loss to enforce disentanglement. At inference, anonymization is achieved by sampling a random identity vector from the learned identity space. To further suppress identity leakage, we introduce an Orthogonal Identity Mapping strategy that enforces orthogonality between sampled and source identity vectors. Experiments demonstrate that ID\textsuperscript2Face outperforms existing methods in visual quality, identity suppression, and utility preservation.
zh
[CV-41] MC-SJD : Maximal Coupling Speculative Jacobi Decoding for Autoregressive Visual Generation Acceleration
【速读】:该论文旨在解决自回归(Autoregressive, AR)视觉生成模型在实际应用中因逐token生成导致的推理速度缓慢问题,尤其针对其需要数千步才能生成单个样本的瓶颈。解决方案的关键在于提出一种无需训练、无损的并行解码框架MC-SJD,该方法基于信息论中的耦合思想,通过最大化连续迭代中生成相同草稿token的概率来显著提升接受率,从而加速AR生成过程;其核心创新仅需对现有算法进行一行代码修改,即可实现图像生成提速约4.2倍、视频生成提速约13.3倍,且不牺牲输出质量。
链接: https://arxiv.org/abs/2510.24211
作者: Junhyuk So,Hyunho Kook,Chaeyeon Jang,Eunhyeok Park
机构: POSTECH(浦项科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:While autoregressive (AR) modeling has recently emerged as a new paradigm in visual generation, its practical adoption is severely constrained by the slow inference speed of per-token generation, which often requires thousands of steps to produce a single sample. To address this challenge, we propose MC-SJD, a training-free, lossless parallel decoding framework designed to accelerate AR visual generation by extending the recently introduced Speculative Jacobi Decoding (SJD). Although SJD shows strong potential for accelerating AR generation, we demonstrate that token instability across iterations significantly reduces the acceptance rate, a limitation that primarily arises from the independent sampling process used during draft token generation. To overcome this, we introduce MC-SJD, an information-theoretic approach based on coupling, which substantially accelerates standard SJD by maximizing the probability of sampling identical draft tokens across consecutive iterations, all while preserving its lossless property. Remarkably, this method requires only a single-line modification to the existing algorithm, yet achieves substantial performance gains, delivering up to a ~4.2x acceleration in image generation and ~13.3x acceleration in video generation compared to standard AR decoding, without any degradation in output quality.
zh
[CV-42] CLFSeg: A Fuzzy-Logic based Solution for Boundary Clarity and Uncertainty Reduction in Medical Image Segmentation BMVC
【速读】:该论文旨在解决传统卷积神经网络(Convolutional Neural Network, CNN)在结肠息肉和心脏分割任务中普遍存在的一般化能力弱、鲁棒性差以及无法有效处理不确定性等问题,这些问题直接影响了分割性能。其解决方案的关键在于提出一种基于编码器-解码器架构的CLFSeg框架,该框架通过引入融合卷积层与模糊逻辑的模糊卷积(Fuzzy-Convolutional, FC)模块,有效增强局部与全局特征提取能力,同时降低边界区域的不确定性、噪声和模糊性,从而提升分割精度与计算效率;此外,结合二元交叉熵(Binary Cross-Entropy, BCE)与Dice损失函数以应对类别不平衡问题,并聚焦于微小结构及边界区域的注意力优化,显著提升了模型在多个公开数据集上的表现,优于当前最优(State-of-the-Art, SOTA)方法。
链接: https://arxiv.org/abs/2510.24202
作者: Anshul Kaushal,Kunal Jangid,Vinod K. Kurmi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The 36th British Machine Vision Conference (BMVC) 2025
Abstract:Accurate polyp and cardiac segmentation for early detection and treatment is essential for the diagnosis and treatment planning of cancer-like diseases. Traditional convolutional neural network (CNN) based models have represented limited generalizability, robustness, and inability to handle uncertainty, which affects the segmentation performance. To solve these problems, this paper introduces CLFSeg, an encoder-decoder based framework that aggregates the Fuzzy-Convolutional (FC) module leveraging convolutional layers and fuzzy logic. This module enhances the segmentation performance by identifying local and global features while minimizing the uncertainty, noise, and ambiguity in boundary regions, ensuring computing efficiency. In order to handle class imbalance problem while focusing on the areas of interest with tiny and boundary regions, binary cross-entropy (BCE) with dice loss is incorporated. Our proposed model exhibits exceptional performance on four publicly available datasets, including CVC-ColonDB, CVC-ClinicDB, EtisLaribPolypDB, and ACDC. Extensive experiments and visual studies show CLFSeg surpasses the existing SOTA performance and focuses on relevant regions of interest in anatomical structures. The proposed CLFSeg improves performance while ensuring computing efficiency, which makes it a potential solution for real-world medical diagnostic scenarios. Project page is available at this https URL
zh
[CV-43] Vanish into Thin Air: Cross-prompt Universal Adversarial Attacks for SAM2 NEURIPS2025
【速读】:该论文旨在解决生成式 AI(Generative AI)中的图像分割基础模型 SAM2 在面对对抗样本时的鲁棒性问题,特别是现有针对 SAM 的攻击方法是否能有效迁移至 SAM2 仍不明确。其关键挑战在于 SAM2 与 SAM 在架构上的差异:一是提示(prompt)引导的方向性增强,二是连续帧间的语义纠缠现象。为此,作者提出 UAP-SAM2,一种基于双语义偏差驱动的跨提示通用对抗攻击方法;其核心创新在于设计了目标扫描策略以降低优化过程中对提示的依赖,并引入双语义偏差框架,在单帧内扭曲语义特征的同时破坏相邻帧间的语义一致性,从而显著提升攻击的迁移性和有效性。
链接: https://arxiv.org/abs/2510.24195
作者: Ziqi Zhou,Yifan Hu,Yufei Song,Zijing Li,Shengshan Hu,Leo Yu Zhang,Dezhong Yao,Long Zheng,Hai Jin
机构: Huazhong University of Science and Technology (华中科技大学); Griffith University (格里菲斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by NeurIPS 2025
Abstract:Recent studies reveal the vulnerability of the image segmentation foundation model SAM to adversarial examples. Its successor, SAM2, has attracted significant attention due to its strong generalization capability in video segmentation. However, its robustness remains unexplored, and it is unclear whether existing attacks on SAM can be directly transferred to SAM2. In this paper, we first analyze the performance gap of existing attacks between SAM and SAM2 and highlight two key challenges arising from their architectural differences: directional guidance from the prompt and semantic entanglement across consecutive frames. To address these issues, we propose UAP-SAM2, the first cross-prompt universal adversarial attack against SAM2 driven by dual semantic deviation. For cross-prompt transferability, we begin by designing a target-scanning strategy that divides each frame into k regions, each randomly assigned a prompt, to reduce prompt dependency during optimization. For effectiveness, we design a dual semantic deviation framework that optimizes a UAP by distorting the semantics within the current frame and disrupting the semantic consistency across consecutive frames. Extensive experiments on six datasets across two segmentation tasks demonstrate the effectiveness of the proposed method for SAM2. The comparative results show that UAP-SAM2 significantly outperforms state-of-the-art (SOTA) attacks by a large margin.
zh
[CV-44] Enhancing Vision-Language Models for Autonomous Driving through Task-Specific Prompting and Spatial Reasoning IROS2025
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在自动驾驶场景理解任务中面临的多任务干扰与空间推理不足问题,尤其是在感知、预测、规划及异常检测等安全关键任务中的性能瓶颈。其解决方案的关键在于构建一个结构化的四组件框架:首先通过“提示混合路由器”(Mixture-of-Prompts router)对输入问题进行分类并路由至特定任务专家提示,避免跨任务干扰;其次设计任务定制化提示,嵌入显式坐标系、空间推理规则、角色扮演、链式/树状思维(Chain-of-Thought/Tree-of-Thought)推理以及少量示例;第三,引入视觉组装模块动态合成多视角图像、目标裁剪、洋红色标记和自适应历史帧以增强空间语义;第四,针对不同任务优化模型推理参数(如温度、top-p、消息角色),从而显著提升VLM在干净数据(Phase-1)和受扰数据(Phase-2)上的平均准确率(分别达70.87%和72.85%),验证了结构化提示与空间接地(spatial grounding)对复杂驾驶任务理解的有效性。
链接: https://arxiv.org/abs/2510.24152
作者: Aodi Wu,Xubo Luo
机构: University of Chinese Academy of Sciences (中国科学院大学); Technology and Engineering Center for Space Utilization, Chinese Academy of Sciences (中国科学院空间应用工程与技术中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: RoboSense Challenge with IROS 2025
Abstract:This technical report presents our solution for the RoboSense Challenge at IROS 2025, which evaluates Vision-Language Models (VLMs) on autonomous driving scene understanding across perception, prediction, planning, and corruption detection tasks. We propose a systematic framework built on four core components. First, a Mixture-of-Prompts router classifies questions and dispatches them to task-specific expert prompts, eliminating interference across diverse question types. Second, task-specific prompts embed explicit coordinate systems, spatial reasoning rules, role-playing, Chain-of-Thought/Tree-of-Thought reasoning, and few-shot examples tailored to each task. Third, a visual assembly module composes multi-view images with object crops, magenta markers, and adaptive historical frames based on question requirements. Fourth, we configure model inference parameters (temperature, top-p, message roles) per task to optimize output quality. Implemented on Qwen2.5-VL-72B, our approach achieves 70.87% average accuracy on Phase-1 (clean data) and 72.85% on Phase-2 (corrupted data), demonstrating that structured prompting and spatial grounding substantially enhance VLM performance on safety-critical autonomous driving tasks. Code and prompt are available at this https URL.
zh
[CV-45] MSRANetV2: An Explainable Deep Learning Architecture for Multi-class Classification of Colorectal Histopathological Images
【速读】:该论文旨在解决结直肠癌(Colorectal Cancer, CRC)诊断中传统方法如结肠镜检查和组织病理学分析存在的主观性强、耗时长及结果变异大的问题。为提升诊断的精准性与效率,研究提出了一种名为MSRANetV2的卷积神经网络架构,其关键创新在于:以ResNet50V2为基础骨干网络,引入残差注意力机制和挤压-激励(Squeeze-and-Excitation, SE)模块以提取深层语义与细粒度空间特征;同时通过通道对齐与上采样操作融合多尺度表示,从而增强分类鲁棒性。实验表明,该模型在两个公开数据集上均达到高精度指标(如F1-score > 0.99),并借助Grad-CAM可视化提升了模型的可解释性,验证了其作为CRC组织分类任务中可靠且高性能解决方案的有效性。
链接: https://arxiv.org/abs/2510.24136
作者: Ovi Sarkar,Md Shafiuzzaman,Md. Faysal Ahamed,Golam Mahmud,Muhammad E. H. Chowdhury
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Colorectal cancer (CRC) is a leading worldwide cause of cancer-related mortality, and the role of prompt precise detection is of paramount interest in improving patient outcomes. Conventional diagnostic methods such as colonoscopy and histological examination routinely exhibit subjectivity, are extremely time-consuming, and are susceptible to variation. Through the development of digital pathology, deep learning algorithms have become a powerful approach in enhancing diagnostic precision and efficiency. In our work, we proposed a convolutional neural network architecture named MSRANetV2, specially optimized for the classification of colorectal tissue images. The model employs a ResNet50V2 backbone, extended with residual attention mechanisms and squeeze-and-excitation (SE) blocks, to extract deep semantic and fine-grained spatial features. With channel alignment and upsampling operations, MSRANetV2 effectively fuses multi-scale representations, thereby enhancing the robustness of the classification. We evaluated our model on a five-fold stratified cross-validation strategy on two publicly available datasets: CRC-VAL-HE-7K and NCT-CRC-HE-100K. The proposed model achieved remarkable average Precision, recall, F1-score, AUC, and test accuracy were 0.9884 plus-minus 0.0151, 0.9900 plus-minus 0.0151, 0.9900 plus-minus 0.0145, 0.9999 plus-minus 0.00006, and 0.9905 plus-minus 0.0025 on the 7K dataset. On the 100K dataset, they were 0.9904 plus-minus 0.0091, 0.9900 plus-minus 0.0071, 0.9900 plus-minus 0.0071, 0.9997 plus-minus 0.00016, and 0.9902 plus-minus 0.0006. Additionally, Grad-CAM visualizations were incorporated to enhance model interpretability by highlighting tissue areas that are medically relevant. These findings validate that MSRANetV2 is a reliable, interpretable, and high-performing architectural model for classifying CRC tissues.
zh
[CV-46] Compositional Image Synthesis with Inference-Time Scaling
【速读】:该论文旨在解决当前文本到图像生成模型在组合性(compositionality)方面的不足,例如无法准确生成对象数量、属性和空间关系等问题。其解决方案的关键在于提出一种无需训练的框架,通过结合基于对象的建模与自精炼机制来提升布局忠实度(layout faithfulness),同时保持图像的美学质量。具体而言,该方法利用大语言模型(LLM)从输入提示中显式合成布局,并将这些布局注入图像生成过程;随后,通过一个以对象为中心的视觉-语言模型(VLM)判别器对多个候选图像进行迭代重排序,选择最符合提示的输出结果。该框架通过显式布局引导与推理时缩放的自精炼机制相结合,在场景对齐性上显著优于现有文本到图像模型。
链接: https://arxiv.org/abs/2510.24133
作者: Minsuk Ji,Sanghyeok Lee,Namhyuk Ahn
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: projcet page: this https URL
Abstract:Despite their impressive realism, modern text-to-image models still struggle with compositionality, often failing to render accurate object counts, attributes, and spatial relations. To address this challenge, we present a training-free framework that combines an object-centric approach with self-refinement to improve layout faithfulness while preserving aesthetic quality. Specifically, we leverage large language models (LLMs) to synthesize explicit layouts from input prompts, and we inject these layouts into the image generation process, where a object-centric vision-language model (VLM) judge reranks multiple candidates to select the most prompt-aligned outcome iteratively. By unifying explicit layout-grounding with self-refine-based inference-time scaling, our framework achieves stronger scene alignment with prompts compared to recent text-to-image models. The code are available at this https URL.
zh
[CV-47] ETC: training-free diffusion models acceleration with Error-aware Trend Consistency
【速读】:该论文旨在解决扩散模型(Diffusion Models)在生成过程中因迭代采样成本高昂而导致的效率瓶颈问题,同时针对现有无训练加速方法在多步复用模型输出时忽略去噪趋势(denoising trends)和缺乏模型特异性误差控制所引发的轨迹偏差与结果不一致性问题。解决方案的关键在于提出一种误差感知的趋势一致性框架(Error-aware Trend Consistency, ETC),其核心包括:(1) 引入一个趋势预测器,利用扩散轨迹的平滑连续性,将历史去噪模式投影为稳定未来方向,并逐步分配至多个近似步骤以实现加速而不偏离;(2) 设计模型特异性误差容忍度搜索机制,通过识别从语义规划波动到质量精炼稳定的过渡点来确定校正阈值,从而实现对不同模型的自适应误差控制。
链接: https://arxiv.org/abs/2510.24129
作者: Jiajian Xie,Hubery Yin,Chen Li,Zhou Zhao,Shengyu Zhang
机构: Zhejiang University (浙江大学); WeChat Vision, Tencent Inc. (腾讯公司微信视觉团队)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 10 figures
Abstract:Diffusion models have achieved remarkable generative quality but remain bottlenecked by costly iterative sampling. Recent training-free methods accelerate diffusion process by reusing model outputs. However, these methods ignore denoising trends and lack error control for model-specific tolerance, leading to trajectory deviations under multi-step reuse and exacerbating inconsistencies in the generated results. To address these issues, we introduce Error-aware Trend Consistency (ETC), a framework that (1) introduces a consistent trend predictor that leverages the smooth continuity of diffusion trajectories, projecting historical denoising patterns into stable future directions and progressively distributing them across multiple approximation steps to achieve acceleration without deviating; (2) proposes a model-specific error tolerance search mechanism that derives corrective thresholds by identifying transition points from volatile semantic planning to stable quality refinement. Experiments show that ETC achieves a 2.65x acceleration over FLUX with negligible (-0.074 SSIM score) degradation of consistency.
zh
[CV-48] DogMo: A Large-Scale Multi-View RGB-D Dataset for 4D Canine Motion Recovery
【速读】:该论文旨在解决现有犬类运动恢复(motion recovery)数据集在多视角、真实三维信息、规模和多样性方面的不足,从而推动从图像中准确重建狗的动态姿态与形变的研究。其解决方案的关键在于构建了DogMo数据集——一个包含10只不同品种狗的1.2k个运动序列的多视角RGB-D视频数据集,并提出了一种三阶段实例特定优化流程,通过粗对齐、密集对应监督和时序正则化逐步优化SMAL(Skinned Multi-Person Linear)模型,实现高精度的犬类运动恢复。
链接: https://arxiv.org/abs/2510.24117
作者: Zan Wang,Siyu Chen,Luya Mo,Xinfeng Gao,Yuxin Shen,Lebin Ding,Wei Liang
机构: Beijing Institute of Technology (北京理工大学); Yangtze Delta Region Academy of Beijing Institute of Technology (北京理工大学长三角研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages
Abstract:We present DogMo, a large-scale multi-view RGB-D video dataset capturing diverse canine movements for the task of motion recovery from images. DogMo comprises 1.2k motion sequences collected from 10 unique dogs, offering rich variation in both motion and breed. It addresses key limitations of existing dog motion datasets, including the lack of multi-view and real 3D data, as well as limited scale and diversity. Leveraging DogMo, we establish four motion recovery benchmark settings that support systematic evaluation across monocular and multi-view, RGB and RGB-D inputs. To facilitate accurate motion recovery, we further introduce a three-stage, instance-specific optimization pipeline that fits the SMAL model to the motion sequences. Our method progressively refines body shape and pose through coarse alignment, dense correspondence supervision, and temporal regularization. Our dataset and method provide a principled foundation for advancing research in dog motion recovery and open up new directions at the intersection of computer vision, computer graphics, and animal behavior modeling.
zh
[CV-49] UHKD: A Unified Framework for Heterogeneous Knowledge Distillation via Frequency-Domain Representations
【速读】:该论文旨在解决异构模型间知识蒸馏(Knowledge Distillation, KD)中因架构差异导致的语义不一致问题,尤其是在利用中间层特征进行知识迁移时性能显著下降的问题。现有方法多集中于输出 logits 空间,未能充分挖掘中间特征中的语义信息,从而限制了在异构师生模型间的有效知识传递。其解决方案的关键在于提出统一的异构知识蒸馏框架(Unified Heterogeneous Knowledge Distillation, UHKD),通过傅里叶变换(Fourier transform)将中间特征映射到频率域以捕捉全局结构信息,缓解不同架构之间的表征差异;同时设计特征转换模块(Feature Transformation Module, FTM)生成紧凑的频域表示,并引入可学习的特征对齐模块(Feature Alignment Module, FAM)实现多层级特征匹配与投影对齐,最终结合中间特征均方误差与 logits 层 KL 散度的联合目标函数优化训练过程,显著提升了异构场景下的知识迁移效果。
链接: https://arxiv.org/abs/2510.24116
作者: Fengming Yu,Haiwei Pan,Kejia Zhang,Jian Guan,Haiying Jiang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 4 figures
Abstract:Knowledge distillation (KD) is an effective model compression technique that transfers knowledge from a high-performance teacher to a lightweight student, reducing cost while maintaining accuracy. In visual applications, where large-scale image models are widely used, KD enables efficient deployment. However, architectural diversity introduces semantic discrepancies that hinder the use of intermediate representations. Most existing KD methods are designed for homogeneous models and degrade in heterogeneous scenarios, especially when intermediate features are involved. Prior studies mainly focus on the logits space, making limited use of the semantic information in intermediate layers. To address this limitation, Unified Heterogeneous Knowledge Distillation (UHKD) is proposed as a framework that leverages intermediate features in the frequency domain for cross-architecture transfer. Fourier transform is applied to capture global feature information, alleviating representational discrepancies between heterogeneous teacher-student pairs. A Feature Transformation Module (FTM) produces compact frequency-domain representations of teacher features, while a learnable Feature Alignment Module (FAM) projects student features and aligns them via multi-level matching. Training is guided by a joint objective combining mean squared error on intermediate features with Kullback-Leibler divergence on logits. Experiments on CIFAR-100 and ImageNet-1K demonstrate gains of 5.59% and 0.83% over the latest method, highlighting UHKD as an effective approach for unifying heterogeneous representations and enabling efficient utilization of visual knowledge
zh
[CV-50] ZTRS: Zero-Imitation End-to-end Autonomous Driving with Trajectory Scoring
【速读】:该论文旨在解决端到端自动驾驶中因依赖模仿学习(Imitation Learning, IL)导致的专家示范质量受限和部署时协变量偏移(covariate shift)问题,以及强化学习(Reinforcement Learning, RL)难以直接处理高维原始传感器数据的局限性。其解决方案的关键在于提出ZTRS(Zero-Imitation End-to-End Autonomous Driving with Trajectory Scoring)框架,该框架完全摒弃IL,仅通过奖励信号进行训练,并利用离线强化学习与作者提出的可枚举动作策略优化(Exhaustive Policy Optimization, EPO)方法,在不丢失传感器信息的前提下实现鲁棒规划。ZTRS首次实现了在高维原始输入上纯基于奖励的学习,且在Navhard和HUGSIM等多个基准测试中达到领先性能。
链接: https://arxiv.org/abs/2510.24108
作者: Zhenxin Li,Wenhao Yao,Zi Wang,Xinglong Sun,Jingde Chen,Nadine Chang,Maying Shen,Jingyu Song,Zuxuan Wu,Shiyi Lan,Jose M. Alvarez
机构: Fudan University (复旦大学); NVIDIA (英伟达); University of Michigan (密歇根大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:End-to-end autonomous driving maps raw sensor inputs directly into ego-vehicle trajectories to avoid cascading errors from perception modules and to leverage rich semantic cues. Existing frameworks largely rely on Imitation Learning (IL), which can be limited by sub-optimal expert demonstrations and covariate shift during deployment. On the other hand, Reinforcement Learning (RL) has recently shown potential in scaling up with simulations, but is typically confined to low-dimensional symbolic inputs (e.g. 3D objects and maps), falling short of full end-to-end learning from raw sensor data. We introduce ZTRS (Zero-Imitation End-to-End Autonomous Driving with Trajectory Scoring), a framework that combines the strengths of both worlds: sensor inputs without losing information and RL training for robust planning. To the best of our knowledge, ZTRS is the first framework that eliminates IL entirely by only learning from rewards while operating directly on high-dimensional sensor data. ZTRS utilizes offline reinforcement learning with our proposed Exhaustive Policy Optimization (EPO), a variant of policy gradient tailored for enumerable actions and rewards. ZTRS demonstrates strong performance across three benchmarks: Navtest (generic real-world open-loop planning), Navhard (open-loop planning in challenging real-world and synthetic scenarios), and HUGSIM (simulated closed-loop driving). Specifically, ZTRS achieves the state-of-the-art result on Navhard and outperforms IL-based baselines on HUGSIM. Code will be available at this https URL.
zh
[CV-51] Enhancing Pre-trained Representation Classifiability can Boost its Interpretability ICLR2025
【速读】:该论文旨在解决预训练视觉模型的表征在可解释性(interpretability)与分类能力(classifiability)之间是否存在权衡的问题,即是否能够同时实现高可解释性和高分类性能。其解决方案的关键在于提出了一种新的量化指标——固有可解释性评分(Inherent Interpretability Score, IIS),该指标通过衡量表征中可解释语义的比例来评估可解释性,并基于信息损失最小化原则进行计算。实验发现,可解释性与分类能力呈正相关关系,从而为通过最大化可解释性来提升分类性能提供了理论依据和实践路径,实现了两类性能的协同优化。
链接: https://arxiv.org/abs/2510.24105
作者: Shufan Shen,Zhaobo Qi,Junshu Sun,Qingming Huang,Qi Tian,Shuhui Wang
机构: Key Lab of Intell. Info. Process., Inst. of Comput. Tech., CAS (中国科学院计算技术研究所); University of Chinese Academy of Sciences (中国科学院大学); Harbin Institute of Technology, Weihai (哈尔滨工业大学(威海)); Peng Cheng Laboratory (鹏城实验室); Huawei Inc. (华为公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: ICLR 2025 (Spotlight)
Abstract:The visual representation of a pre-trained model prioritizes the classifiability on downstream tasks, while the widespread applications for pre-trained visual models have posed new requirements for representation interpretability. However, it remains unclear whether the pre-trained representations can achieve high interpretability and classifiability simultaneously. To answer this question, we quantify the representation interpretability by leveraging its correlation with the ratio of interpretable semantics within the representations. Given the pre-trained representations, only the interpretable semantics can be captured by interpretations, whereas the uninterpretable part leads to information loss. Based on this fact, we propose the Inherent Interpretability Score (IIS) that evaluates the information loss, measures the ratio of interpretable semantics, and quantifies the representation interpretability. In the evaluation of the representation interpretability with different classifiability, we surprisingly discover that the interpretability and classifiability are positively correlated, i.e., representations with higher classifiability provide more interpretable semantics that can be captured in the interpretations. This observation further supports two benefits to the pre-trained representations. First, the classifiability of representations can be further improved by fine-tuning with interpretability maximization. Second, with the classifiability improvement for the representations, we obtain predictions based on their interpretations with less accuracy degradation. The discovered positive correlation and corresponding applications show that practitioners can unify the improvements in interpretability and classifiability for pre-trained vision models. Codes are available at this https URL.
zh
[CV-52] OmniText: A Training-Free Generalist for Controllable Text-Image Manipulation
【速读】:该论文旨在解决扩散模型在文本图像操作(Text Image Manipulation, TIM)任务中的三大局限性:(i)无法实现文本移除,(ii)对生成文本的风格控制能力不足,以及(iii)易产生重复字符的问题。解决方案的关键在于提出 OmniText 框架,其核心创新包括:通过自注意力机制逆向操作实现文本移除,有效抑制模型对周边文本的关注从而减少文本幻觉;通过重新分配交叉注意力权重,提升特定文本 token 的生成概率以进一步降低幻觉;并引入潜空间优化框架下的新型损失函数——交叉注意力内容损失用于增强文本渲染准确性,自注意力风格损失用于支持风格定制。此外,作者构建了 OmniText-Bench 基准数据集,涵盖文本移除、重缩放、重定位及多风格插入与编辑等多样化任务,验证了 OmniText 作为首个无需训练即可执行多种 TIM 任务的通用方法,在性能上优于现有文本修复方法,并可比肩专用方法。
链接: https://arxiv.org/abs/2510.24093
作者: Agus Gunawan,Samuel Teodoro,Yun Chen,Soo Ye Kim,Jihyong Oh,Munchurl Kim
机构: KAIST(韩国科学技术院); Adobe Research(Adobe 研究院); Chung-Ang University(中央大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The first two authors contributed equally to this work. The last two authors are co-corresponding authors
Abstract:Recent advancements in diffusion-based text synthesis have demonstrated significant performance in inserting and editing text within images via inpainting. However, despite the potential of text inpainting methods, three key limitations hinder their applicability to broader Text Image Manipulation (TIM) tasks: (i) the inability to remove text, (ii) the lack of control over the style of rendered text, and (iii) a tendency to generate duplicated letters. To address these challenges, we propose OmniText, a training-free generalist capable of performing a wide range of TIM tasks. Specifically, we investigate two key properties of cross- and self-attention mechanisms to enable text removal and to provide control over both text styles and content. Our findings reveal that text removal can be achieved by applying self-attention inversion, which mitigates the model’s tendency to focus on surrounding text, thus reducing text hallucinations. Additionally, we redistribute cross-attention, as increasing the probability of certain text tokens reduces text hallucination. For controllable inpainting, we introduce novel loss functions in a latent optimization framework: a cross-attention content loss to improve text rendering accuracy and a self-attention style loss to facilitate style customization. Furthermore, we present OmniText-Bench, a benchmark dataset for evaluating diverse TIM tasks. It includes input images, target text with masks, and style references, covering diverse applications such as text removal, rescaling, repositioning, and insertion and editing with various styles. Our OmniText framework is the first generalist method capable of performing diverse TIM tasks. It achieves state-of-the-art performance across multiple tasks and metrics compared to other text inpainting methods and is comparable with specialist methods.
zh
[CV-53] Beyond Objects: Contextual Synthetic Data Generation for Fine-Grained Classification
【速读】:该论文旨在解决生成式 AI (Generative AI) 在低样本量细粒度分类任务中,利用文本到图像(Text-to-Image, T2I)模型生成合成训练数据时面临的过拟合与样本多样性不足的问题。解决方案的关键在于提出一种名为 BOB(BeyondObjects)的微调策略:首先从少量真实样本中提取类无关属性(如场景背景和物体姿态),并在微调 T2I 模型时显式地以这些属性为条件,而在生成阶段将其边缘化(marginalize out)。这一设计有效缓解了过拟合、保留了 T2I 模型的先验生成能力、降低了估计误差,并减少了类别间的意外关联,从而显著提升合成数据质量与下游分类性能。
链接: https://arxiv.org/abs/2510.24078
作者: William Yang,Xindi Wu,Zhiwei Deng,Esin Tureci,Olga Russakovsky
机构: Princeton University (普林斯顿大学); Google DeepMind (谷歌深度智核)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Text-to-image (T2I) models are increasingly used for synthetic dataset generation, but generating effective synthetic training data for classification remains challenging. Fine-tuning a T2I model with a few real examples can help improve the quality of synthetic training data; however, it may also cause overfitting and reduce diversity in the generated samples. We propose a fine-tuning strategy BOB (BeyondOBjects) to mitigate these concerns for fine-grained classification. Given a small set of real examples, we first extract class-agnostic attributes such as scene background and object pose. We then explicitly condition on these attributes during fine-tuning of the T2I model and marginalize them out during generation. This design mitigates overfitting, preserves the T2I model’s generative prior, reduces estimation errors, and further minimizes unintended inter-class associations. Extensive experiments across multiple T2I models, backbones, and datasets show that our method achieves state-of-the-art performance in low-shot fine-grained classification when augmented with synthetic data. Concretely, BOB outperforms DataDream by 7.4% on the Aircraft dataset (from 50.0% to 57.4% when fine-tuning a CLIP classifier with five real images augmented with 100 synthetic images). In three of the four benchmarks, fine-tuning downstream models with 5 real images augmented with BOB achieves better performance than fine-tuning with 10 real images. Collectively, BOB outperforms prior art in 18 of 24 experimental settings, with 2+% accuracy improvements in 14 of these settings.
zh
[CV-54] Enhancing CLIP Robustness via Cross-Modality Alignment NEURIPS2025
【速读】:该论文旨在解决视觉-语言模型(Vision-Language Models, VLMs)如CLIP在零样本分类任务中对对抗扰动高度敏感的问题。现有方法主要依赖对抗微调或提示优化,但忽略了CLIP编码特征中存在的文本与图像特征空间分离现象——这种错位在对抗扰动下被显著放大,导致分类性能严重下降。解决方案的关键在于提出一种基于最优传输(Optimal Transport, OT)的跨模态对齐框架COLA(Cross-modality Alignment),其核心机制包括:首先将对抗图像嵌入投影到由类别文本特征张成的子空间中,以滤除非语义扰动并保留判别信息;其次,将图像和文本建模为多个增强视图上的离散分布,并通过OT优化其对齐关系,同时将子空间投影自然融入代价计算中,从而在对抗条件下实现全局跨模态对齐与局部结构一致性恢复。该方法无需训练且兼容已有微调模型,在14个零样本分类基准上验证了有效性,尤其在PGD对抗攻击下ImageNet及其变体平均提升6.7%。
链接: https://arxiv.org/abs/2510.24038
作者: Xingyu Zhu,Beier Zhu,Shuo Wang,Kesen Zhao,Hanwang Zhang
机构: University of Science and Technology of China (中国科学技术大学); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: NeurIPS 2025 Spotlight
Abstract:Vision-language models (VLMs) such as CLIP demonstrate strong generalization in zero-shot classification but remain highly vulnerable to adversarial perturbations. Existing methods primarily focus on adversarial fine-tuning or prompt optimization; they often overlook the gaps in CLIP’s encoded features, which is shown as the text and image features lie far apart from each other. This misalignment is significantly amplified under adversarial perturbations, leading to severe degradation in classification performance. To address this problem, we propose Cross-modality Alignment, dubbed COLA, an optimal transport-based framework that explicitly addresses adversarial misalignment by restoring both global image-text alignment and local structural consistency in the feature space. (1) COLA first projects adversarial image embeddings onto a subspace spanned by class text features, effectively filtering out non-semantic distortions while preserving discriminative information. (2) It then models images and texts as discrete distributions over multiple augmented views and refines their alignment via OT, with the subspace projection seamlessly integrated into the cost computation. This design ensures stable cross-modal alignment even under adversarial conditions. COLA is training-free and compatible with existing fine-tuned models. Extensive evaluations across 14 zero-shot classification benchmarks demonstrate the effectiveness of COLA, especially with an average improvement of 6.7% on ImageNet and its variants under PGD adversarial attacks, while maintaining high accuracy on clean samples.
zh
[CV-55] Kernelized Sparse Fine-Tuning with Bi-level Parameter Competition for Vision Models
【速读】:该论文旨在解决参数高效微调(Parameter-efficient fine-tuning, PEFT)中稀疏微调(Sparse tuning)方法存在的两大局限:一是现有两阶段方法依赖梯度信息定位任务相关权重,忽视了微调过程中的参数动态调整,导致性能受限;二是由于需在优化器中存储全部权重矩阵,造成高内存消耗。其解决方案的关键在于提出一种单阶段方法SNELLA,通过引入可学习的低秩矩阵与非线性核函数扩展低秩分解,实现对权重矩阵的选择性更新,从而降低内存占用并提升适应能力;同时设计自适应双层稀疏分配机制,使权重在层间和层内基于重要性评分进行端到端竞争,精准定位任务相关权重,最终在分类、分割和生成任务上实现SOTA性能且内存减少31.1%–39.9%。
链接: https://arxiv.org/abs/2510.24037
作者: Shufan Shen,Junshu Sun,Shuhui Wang,Qingming Huang
机构: Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所智能信息处理重点实验室); School of Computer Science and Technology, University of Chinese Academy of Sciences (中国科学院大学计算机科学与技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Parameter-efficient fine-tuning (PEFT) aims to adapt pre-trained vision models to downstream tasks. Among PEFT paradigms, sparse tuning achieves remarkable performance by adjusting only the weights most relevant to downstream tasks, rather than densely tuning the entire weight matrix. Current methods follow a two-stage paradigm. First, it locates task-relevant weights by gradient information, which overlooks the parameter adjustments during fine-tuning and limits the performance. Second, it updates only the located weights by applying a sparse mask to the gradient of the weight matrix, which results in high memory usage due to the storage of all weight matrices in the optimizer. In this paper, we propose a one-stage method named SNELLA to overcome the above limitations. For memory usage, SNELLA selectively updates the weight matrix by adding it to another sparse matrix that is merged by two low-rank learnable matrices. We extend the low-rank decomposition by introducing nonlinear kernel functions, thereby increasing the rank of the resulting merged matrix to prevent the interdependency among weight updates, enabling better adaptation to downstream tasks. For locating task-relevant weights, we propose an adaptive bi-level sparsity allocation mechanism that encourages weights to compete across and inside layers based on their importance scores in an end-to-end manner. Extensive experiments are conducted on classification, segmentation, and generation tasks using different pre-trained vision models. The results show that SNELLA achieves SOTA performance with low memory usage. Notably, SNELLA obtains 1.8% (91.9% v.s. 90.1%) higher Top-1 accuracy on the FGVC benchmark compared to SPT-LoRA. Compared to previous methods, SNELLA achieves a memory reduction of 31.1%-39.9% across models with parameter scales from 86M to 632M. Our source codes are available at this https URL.
zh
[CV-56] ResNet: Enabling Deep Convolutional Neural Networks through Residual Learning
【速读】:该论文旨在解决深度卷积神经网络(Convolutional Neural Networks, CNNs)在训练过程中因梯度消失(vanishing gradient)问题而导致的性能瓶颈,尤其是当网络层数加深时难以有效优化的问题。其解决方案的关键在于引入残差网络(Residual Networks, ResNet),通过设计跳跃连接(skip connections)使梯度能够绕过若干中间层直接传播,从而显著改善深层网络的训练稳定性与收敛速度。
链接: https://arxiv.org/abs/2510.24036
作者: Xingyu Liu,Kun Ming Goh
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 3 pages, 5 figures, 1 table
Abstract:Convolutional Neural Networks (CNNs) has revolutionized computer vision, but training very deep networks has been challenging due to the vanishing gradient problem. This paper explores Residual Networks (ResNet), introduced by He et al. (2015), which overcomes this limitation by using skip connections. ResNet enables the training of networks with hundreds of layers by allowing gradients to flow directly through shortcut connections that bypass intermediate layers. In our implementation on the CIFAR-10 dataset, ResNet-18 achieves 89.9% accuracy compared to 84.1% for a traditional deep CNN of similar depth, while also converging faster and training more stably.
zh
[CV-57] AutoPrompt: Automated Red-Teaming of Text-to-Image Models via LLM -Driven Adversarial Prompts ICCV2025
【速读】:该论文旨在解决文本到图像(Text-to-Image, T2I)生成模型在安全机制上易受对抗性提示(adversarial prompts)攻击的问题,现有红队测试方法多依赖白盒访问且生成的提示语义混乱或易被过滤器拦截,难以有效暴露模型漏洞。解决方案的关键在于提出一个黑盒框架APT(AutoPrompT),利用大语言模型(Large Language Models, LLMs)自动构造可读性强、能绕过过滤机制的对抗后缀(adversarial suffixes),并通过交替优化与微调策略提升生成质量,并引入双规避策略——即通过辅助LLM的困惑度评分约束生成人类可读提示,以及引入禁用词惩罚项抑制黑名单词汇的显式生成,从而显著增强对抗提示的隐蔽性与迁移能力,实现在未见提示下的零样本泛化,成功暴露商业API中的关键安全漏洞。
链接: https://arxiv.org/abs/2510.24034
作者: Yufan Liu,Wanqian Zhang,Huashan Chen,Lin Wang,Xiaojun Jia,Zheng Lin,Weiping Wang
机构: Institute of Information Engineering, Chinese Academy of Sciences (中国科学院信息工程研究所); School of Cyber Security, University of Chinese Academy of Sciences (中国科学院大学网络空间安全学院); School of Cyberspace, Hangzhou Dianzi University (杭州电子科技大学网络空间学院); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV 2025
Abstract:Despite rapid advancements in text-to-image (T2I) models, their safety mechanisms are vulnerable to adversarial prompts, which maliciously generate unsafe images. Current red-teaming methods for proactively assessing such vulnerabilities usually require white-box access to T2I models, and rely on inefficient per-prompt optimization, as well as inevitably generate semantically meaningless prompts easily blocked by filters. In this paper, we propose APT (AutoPrompT), a black-box framework that leverages large language models (LLMs) to automatically generate human-readable adversarial suffixes for benign prompts. We first introduce an alternating optimization-finetuning pipeline between adversarial suffix optimization and fine-tuning the LLM utilizing the optimized suffix. Furthermore, we integrates a dual-evasion strategy in optimization phase, enabling the bypass of both perplexity-based filter and blacklist word filter: (1) we constrain the LLM generating human-readable prompts through an auxiliary LLM perplexity scoring, which starkly contrasts with prior token-level gibberish, and (2) we also introduce banned-token penalties to suppress the explicit generation of banned-tokens in blacklist. Extensive experiments demonstrate the excellent red-teaming performance of our human-readable, filter-resistant adversarial prompts, as well as superior zero-shot transferability which enables instant adaptation to unseen prompts and exposes critical vulnerabilities even in commercial APIs (e.g., this http URL.).
zh
[CV-58] Mars-Bench: A Benchmark for Evaluating Foundation Models for Mars Science Tasks NEURIPS2025
【速读】:该论文旨在解决火星科学领域缺乏标准化基准(benchmark)和评估框架的问题,这限制了基础模型(foundation models)在火星任务中的发展与比较。解决方案的关键在于提出首个面向火星任务的综合性基准——Mars-Bench,其包含20个涵盖分类、分割和目标检测任务的数据集,聚焦于关键地质特征(如撞击坑、火山锥、巨石和霜冻),并提供基于自然图像、地球卫星数据及前沿视觉-语言模型的标准化基线评估。结果表明,针对火星域特化的预训练模型可能优于通用领域的模型,从而推动对领域适配预训练策略的进一步探索。
链接: https://arxiv.org/abs/2510.24010
作者: Mirali Purohit,Bimal Gajera,Vatsal Malaviya,Irish Mehta,Kunal Kasodekar,Jacob Adler,Steven Lu,Umaa Rebbapragada,Hannah Kerner
机构: Arizona State University (亚利桑那州立大学); California Institute of Technology (加州理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at NeurIPS 2025
Abstract:Foundation models have enabled rapid progress across many specialized domains by leveraging large-scale pre-training on unlabeled data, demonstrating strong generalization to a variety of downstream tasks. While such models have gained significant attention in fields like Earth Observation, their application to Mars science remains limited. A key enabler of progress in other domains has been the availability of standardized benchmarks that support systematic evaluation. In contrast, Mars science lacks such benchmarks and standardized evaluation frameworks, which have limited progress toward developing foundation models for Martian tasks. To address this gap, we introduce Mars-Bench, the first benchmark designed to systematically evaluate models across a broad range of Mars-related tasks using both orbital and surface imagery. Mars-Bench comprises 20 datasets spanning classification, segmentation, and object detection, focused on key geologic features such as craters, cones, boulders, and frost. We provide standardized, ready-to-use datasets and baseline evaluations using models pre-trained on natural images, Earth satellite data, and state-of-the-art vision-language models. Results from all analyses suggest that Mars-specific foundation models may offer advantages over general-domain counterparts, motivating further exploration of domain-adapted pre-training. Mars-Bench aims to establish a standardized foundation for developing and comparing machine learning models for Mars science. Our data, models, and code are available at: this https URL.
zh
[CV-59] owards the Automatic Segmentation Modeling and Meshing of the Aortic Vessel Tree from Multicenter Acquisitions: An Overview of the SEG.A. 2023 Segmentation of the Aorta Challenge
【速读】:该论文旨在解决从计算机断层扫描血管造影(CTA)图像中自动分析主动脉血管树(AVT)的临床应用难题,其核心挑战在于缺乏共享且高质量的数据集,阻碍了算法的开发与比较。解决方案的关键在于发起SEG.A.挑战赛,构建了一个大规模、多中心公开数据集用于AVT分割,并通过隐藏测试集对自动化算法进行基准测试,同时引入表面网格生成任务以支持计算仿真。研究发现,深度学习方法(尤其是3D U-Net架构)主导了高性能模型,且通过集成排名靠前的算法显著优于单一模型,凸显了模型融合的价值;此外,算法性能高度依赖于定制化的后处理步骤和训练数据特性,为未来开发鲁棒、可临床转化的工具提供了新基准与长期资源。
链接: https://arxiv.org/abs/2510.24009
作者: Yuan Jin,Antonio Pepe,Gian Marco Melito,Yuxuan Chen,Yunsu Byeon,Hyeseong Kim,Kyungwon Kim,Doohyun Park,Euijoon Choi,Dosik Hwang,Andriy Myronenko,Dong Yang,Yufan He,Daguang Xu,Ayman El-Ghotni,Mohamed Nabil,Hossam El-Kady,Ahmed Ayyad,Amr Nasr,Marek Wodzinski,Henning Müller,Hyeongyu Kim,Yejee Shin,Abbas Khan,Muhammad Asad,Alexander Zolotarev,Caroline Roney,Anthony Mathur,Martin Benning,Gregory Slabaugh,Theodoros Panagiotis Vagenas,Konstantinos Georgas,George K. Matsopoulos,Jihan Zhang,Zhen Zhang,Liqin Huang,Christian Mayer,Heinrich Mächler,Jan Egger
机构: University of Technology Graz (格拉茨技术大学); University of Duisburg-Essen (杜伊斯堡-埃森大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The automated analysis of the aortic vessel tree (AVT) from computed tomography angiography (CTA) holds immense clinical potential, but its development has been impeded by a lack of shared, high-quality data. We launched the SEG.A. challenge to catalyze progress in this field by introducing a large, publicly available, multi-institutional dataset for AVT segmentation. The challenge benchmarked automated algorithms on a hidden test set, with subsequent optional tasks in surface meshing for computational simulations. Our findings reveal a clear convergence on deep learning methodologies, with 3D U-Net architectures dominating the top submissions. A key result was that an ensemble of the highest-ranking algorithms significantly outperformed individual models, highlighting the benefits of model fusion. Performance was strongly linked to algorithmic design, particularly the use of customized post-processing steps, and the characteristics of the training data. This initiative not only establishes a new performance benchmark but also provides a lasting resource to drive future innovation toward robust, clinically translatable tools.
zh
[CV-60] AdvBlur: Adversarial Blur for Robust Diabetic Retinopathy Classification and Cross-Domain Generalization
【速读】:该论文旨在解决糖尿病视网膜病变(Diabetic Retinopathy, DR)早期检测中深度学习模型因设备差异、人群特征不一致及成像条件变化导致的分布外泛化能力不足的问题。其解决方案的关键在于提出一种名为AdvBlur的新方法,通过在训练数据集中引入对抗性模糊图像,并设计双损失函数框架来增强模型对未见分布的鲁棒性,从而有效缓解因域间差异带来的性能下降问题。
链接: https://arxiv.org/abs/2510.24000
作者: Heethanjan Kanagalingam,Thenukan Pathmanathan,Mokeeshan Vathanakumar,Tharmakulasingam Mukunthan
机构: Jaffna Campus of the University of Moratuwa (斯里兰卡莫鲁塔瓦大学贾夫纳校区)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Diabetic retinopathy (DR) is a leading cause of vision loss worldwide, yet early and accurate detection can significantly improve treatment outcomes. While numerous Deep learning (DL) models have been developed to predict DR from fundus images, many face challenges in maintaining robustness due to distributional variations caused by differences in acquisition devices, demographic disparities, and imaging conditions. This paper addresses this critical limitation by proposing a novel DR classification approach, a method called AdvBlur. Our method integrates adversarial blurred images into the dataset and employs a dual-loss function framework to address domain generalization. This approach effectively mitigates the impact of unseen distributional variations, as evidenced by comprehensive evaluations across multiple datasets. Additionally, we conduct extensive experiments to explore the effects of factors such as camera type, low-quality images, and dataset size. Furthermore, we perform ablation studies on blurred images and the loss function to ensure the validity of our choices. The experimental results demonstrate the effectiveness of our proposed method, achieving competitive performance compared to state-of-the-art domain generalization DR models on unseen external datasets.
zh
[CV-61] Ego: Benchmarking Egocentric AI Assistants in the Wild
【速读】:该论文旨在解决现有基准测试无法全面评估现实场景中第一人称人工智能助手(egocentric AI assistants)在多模态输入处理、实时响应和长期记忆保持等方面的综合能力问题。当前的评测体系通常孤立地评估某一项能力,缺乏真实流式数据场景或仅支持短期任务,难以反映实际应用需求。解决方案的关键在于提出一个名为TeleEgo的长时序、流式、全模态(omni-modal)基准数据集,涵盖工作、学习、生活方式、社交活动与外出文化等四大领域,包含每名参与者超过14小时同步的视频、音频与文本数据,并基于统一全局时间轴标注高质量视觉叙述与语音转录文本;同时设计了12个诊断性子任务,覆盖记忆(Memory)、理解(Understanding)和跨记忆推理(Cross-Memory Reasoning)三大核心能力,通过3,291个经人工验证的问题问答项(含多种题型)进行严格流式评估,并引入“实时准确率”(Real-Time Accuracy)与“记忆持续时间”(Memory Persistence Time)两个关键指标,共同衡量模型的正确性、时间响应性和长期记忆保留能力,从而为开发实用化第一人称AI助手提供更真实、系统的评估框架。
链接: https://arxiv.org/abs/2510.23981
作者: Jiaqi Yan,Ruilong Ren,Jingren Liu,Shuning Xu,Ling Wang,Yiheng Wang,Yun Wang,Long Zhang,Xiangyu Chen,Changzhi Sun,Jixiang Luo,Dell Zhang,Hao Sun,Chi Zhang,Xuelong Li
机构: Institute of Artificial Intelligence (TeleAI), China Telecom(中国电信)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Egocentric AI assistants in real-world settings must process multi-modal inputs (video, audio, text), respond in real time, and retain evolving long-term memory. However, existing benchmarks typically evaluate these abilities in isolation, lack realistic streaming scenarios, or support only short-term tasks. We introduce \textbfTeleEgo, a long-duration, streaming, omni-modal benchmark for evaluating egocentric AI assistants in realistic daily contexts. The dataset features over 14 hours per participant of synchronized egocentric video, audio, and text across four domains: work \ study, lifestyle \ routines, social activities, and outings \ culture. All data is aligned on a unified global timeline and includes high-quality visual narrations and speech transcripts, curated through human this http URL defines 12 diagnostic subtasks across three core capabilities: Memory (recalling past events), Understanding (interpreting the current moment), and Cross-Memory Reasoning (linking distant events). It contains 3,291 human-verified QA items spanning multiple question formats (single-choice, binary, multi-choice, and open-ended), evaluated strictly in a streaming setting. We propose two key metrics – Real-Time Accuracy and Memory Persistence Time – to jointly assess correctness, temporal responsiveness, and long-term retention. TeleEgo provides a realistic and comprehensive evaluation to advance the development of practical AI assistants.
zh
[CV-62] Efficient Cost-and-Quality Controllable Arbitrary-scale Super-resolution with Fourier Constraints
【速读】:该论文旨在解决任意尺度超分辨率(Arbitrary-Scale Super-Resolution)中成本与质量(Cost-and-Quality, CQ)可控性不足的问题。现有方法通过循环神经网络(Recurrent Neural Network, RNN)逐个预测傅里叶成分(Fourier components),导致性能下降和计算效率低下,主要归因于各成分独立预测所引发的信息割裂与冗余。论文的关键解决方案是提出联合预测多个傅里叶成分的方法,从而在提升重建质量的同时显著提高计算效率,实现更优的CQ可控性。
链接: https://arxiv.org/abs/2510.23978
作者: Kazutoshi Akita,Norimichi Ukita
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages
Abstract:Cost-and-Quality (CQ) controllability in arbitrary-scale super-resolution is crucial. Existing methods predict Fourier components one by one using a recurrent neural network. However, this approach leads to performance degradation and inefficiency due to independent prediction. This paper proposes predicting multiple components jointly to improve both quality and efficiency.
zh
[CV-63] Synergistic Neural Forecasting of Air Pollution with Stochastic Sampling
【速读】:该论文旨在解决现有空气质量预测模型在应对极端污染事件(如野火、城市雾霾和沙尘暴引发的PM浓度骤升)时准确率不足的问题,尤其是传统模型常因损失函数设计偏向平均值而低估稀有但高危害的污染峰值。其解决方案的关键在于提出SynCast模型,该模型基于区域自适应的Transformer架构,并引入基于扩散过程的随机精修模块,结合气象与大气成分数据,通过领域感知的目标函数和极值理论指导训练,显著提升了对极端污染事件的捕捉能力,同时保持了全球范围内的预测精度。
链接: https://arxiv.org/abs/2510.23977
作者: Yohan Abeysinghe,Muhammad Akhtar Munir,Sanoojan Baliah,Ron Sarafian,Fahad Shahbaz Khan,Yinon Rudich,Salman Khan
机构: MBZUAI (Mohamed bin Zayed University of Artificial Intelligence); Weizmann Institute of Science (魏兹曼科学研究所)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Air pollution remains a leading global health and environmental risk, particularly in regions vulnerable to episodic air pollution spikes due to wildfires, urban haze and dust storms. Accurate forecasting of particulate matter (PM) concentrations is essential to enable timely public health warnings and interventions, yet existing models often underestimate rare but hazardous pollution events. Here, we present SynCast, a high-resolution neural forecasting model that integrates meteorological and air composition data to improve predictions of both average and extreme pollution levels. Built on a regionally adapted transformer backbone and enhanced with a diffusion-based stochastic refinement module, SynCast captures the nonlinear dynamics driving PM spikes more accurately than existing approaches. Leveraging on harmonized ERA5 and CAMS datasets, our model shows substantial gains in forecasting fidelity across multiple PM variables (PM _1 , PM _2.5 , PM _10 ), especially under extreme conditions. We demonstrate that conventional loss functions underrepresent distributional tails (rare pollution events) and show that SynCast, guided by domain-aware objectives and extreme value theory, significantly enhances performance in highly impacted regions without compromising global accuracy. This approach provides a scalable foundation for next-generation air quality early warning systems and supports climate-health risk mitigation in vulnerable regions.
zh
[CV-64] Reasoning Visual Language Model for Chest X-Ray Analysis
【速读】:该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在胸部X光片(chest X-ray)分析中缺乏透明推理过程的问题,即模型仅提供预测结果而无法展现类似放射科医生所依赖的逐步、可解释的思维链条。为应对这一挑战,其解决方案的关键在于引入链式思维(Chain-of-Thought, CoT)推理框架,并通过两阶段训练策略实现:首先采用基于推理风格的监督微调(Supervised Fine-Tuning, SFT),使模型学习将中间推理步骤与可见图像证据及放射学工作流程对齐;随后利用可验证奖励信号进行强化学习(Reinforcement Learning, RL),以优化对一系列X光异常的判别能力。该方法不仅提升了多标签分类性能,更重要的是生成了符合放射科医生系统性思考逻辑、包含不确定性评估和鉴别诊断的显式推理路径,从而增强了临床可审计性、错误分析能力和人机协作安全性。
链接: https://arxiv.org/abs/2510.23968
作者: Andriy Myronenko,Dong Yang,Baris Turkbey,Mariam Aboian,Sena Azamat,Esra Akcicek,Hongxu Yin,Pavlo Molchanov,Marc Edgar,Yufan He,Pengfei Guo,Yucheng Tang,Daguang Xu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: NV-Reason-CXR-3B
Abstract:Vision-language models (VLMs) have shown strong promise for medical image analysis, but most remain opaque, offering predictions without the transparent, stepwise reasoning clinicians rely on. We present a framework that brings chain-of-thought (CoT) reasoning to chest X-ray interpretation. Inspired by reasoning-first training paradigms, our approach is designed to learn how experts reason, not just what they conclude, by aligning intermediate steps with observable image evidence and radiology workflow. Beyond accuracy, the explicit reasoning traces support clinical auditability: they reveal why a conclusion was reached, which alternatives were considered, and where uncertainty remains, enabling quality assurance, error analysis, and safer human-AI collaboration. Our model couples high-fidelity visual encoding with a two-stage training recipe: a reasoning-style supervised fine-tuning (SFT) followed by reinforcement learning (RL) that uses verifiable rewards over a list of X-ray abnormalities. The model outputs reasoning that mirrors radiologists systematic thought process, uncertainty, and differential diagnosis. In out-of-distribution evaluation, the approach achieves competitive multi-label classification while improving interpretability. In a reader study with expert radiologists, full reasoning traces increased confidence, supported error auditing, and reduced time to finalize reports. We release code and the model NV-Reason-CXR-3B to support community progress toward trustworthy, explainable AI in chest radiography and other medical imaging tasks where reasoning quality is as critical as prediction quality. Comments: NV-Reason-CXR-3B Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2510.23968 [cs.CV] (or arXiv:2510.23968v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2510.23968 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-65] SafeVision: Efficient Image Guardrail with Robust Policy Adherence and Explainability
【速读】:该论文旨在解决传统图像安全防护模型(image guardrail models)在面对新兴威胁时适应性差、缺乏语义推理能力以及解释性不足的问题。这些问题导致模型在分类不安全内容时易产生误判,且难以动态响应政策变化,需频繁重新训练以应对新风险。其解决方案的关键在于提出SafeVision——一个融合人类式推理机制的新型图像安全防护系统,通过引入有效的数据收集与生成框架、遵循策略的训练流程、定制化损失函数及多样化的问答(QA)生成训练策略,实现对安全策略的动态对齐与精准风险评估。SafeVision可在推理阶段实时适配不断演进的安全政策,无需重新训练,并提供可解释的风险判断结果,显著优于现有方法(如GPT-4o),同时具备更高的效率(快16倍以上)。
链接: https://arxiv.org/abs/2510.23960
作者: Peiyang Xu,Minzhou Pan,Zhaorun Chen,Shuang Yang,Chaowei Xiao,Bo Li
机构: University of Chicago, IL, US; Virtue AI, CA, US; Meta, CA, US; University of Wisconsin, Madison, WI, US; UIUC, IL, US
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 42 pages, 9 figures
Abstract:With the rapid proliferation of digital media, the need for efficient and transparent safeguards against unsafe content is more critical than ever. Traditional image guardrail models, constrained by predefined categories, often misclassify content due to their pure feature-based learning without semantic reasoning. Moreover, these models struggle to adapt to emerging threats, requiring costly retraining for new threats. To address these limitations, we introduce SafeVision, a novel image guardrail that integrates human-like reasoning to enhance adaptability and transparency. Our approach incorporates an effective data collection and generation framework, a policy-following training pipeline, and a customized loss function. We also propose a diverse QA generation and training strategy to enhance learning effectiveness. SafeVision dynamically aligns with evolving safety policies at inference time, eliminating the need for retraining while ensuring precise risk assessments and explanations. Recognizing the limitations of existing unsafe image benchmarks, which either lack granularity or cover limited risks, we introduce VisionHarm, a high-quality dataset comprising two subsets: VisionHarm Third-party (VisionHarm-T) and VisionHarm Comprehensive(VisionHarm-C), spanning diverse harmful categories. Through extensive experiments, we show that SafeVision achieves state-of-the-art performance on different benchmarks. SafeVision outperforms GPT-4o by 8.6% on VisionHarm-T and by 15.5% on VisionHarm-C, while being over 16x faster. SafeVision sets a comprehensive, policy-following, and explainable image guardrail with dynamic adaptation to emerging threats.
zh
[CV-66] Neural USD: An object-centric framework for iterative editing and control
【速读】:该论文旨在解决生成式 AI(Generative AI)中精确且迭代的对象编辑问题,即在修改图像中的特定对象(如改变某个物体的颜色或背景)时,现有方法常因调整条件信号而导致场景的全局 unintended changes。解决方案的关键在于引入“神经通用场景描述符”(Neural Universal Scene Descriptor, Neural USD),该框架以结构化、分层的方式表示场景与对象,支持对外观、几何形状和姿态的逐对象控制,并通过微调策略实现控制信号之间的解耦,从而支持迭代和增量式的工作流程。
链接: https://arxiv.org/abs/2510.23956
作者: Alejandro Escontrela,Shrinu Kushagra,Sjoerd van Steenkiste,Yulia Rubanova,Aleksander Holynski,Kelsey Allen,Kevin Murphy,Thomas Kipf
机构: Google DeepMind(谷歌深度思维); U.C. Berkeley(加州大学伯克利分校); Google Research(谷歌研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 22 pages, 16 figures, 1 table
Abstract:Amazing progress has been made in controllable generative modeling, especially over the last few years. However, some challenges remain. One of them is precise and iterative object editing. In many of the current methods, trying to edit the generated image (for example, changing the color of a particular object in the scene or changing the background while keeping other elements unchanged) by changing the conditioning signals often leads to unintended global changes in the scene. In this work, we take the first steps to address the above challenges. Taking inspiration from the Universal Scene Descriptor (USD) standard developed in the computer graphics community, we introduce the “Neural Universal Scene Descriptor” or Neural USD. In this framework, we represent scenes and objects in a structured, hierarchical manner. This accommodates diverse signals, minimizes model-specific constraints, and enables per-object control over appearance, geometry, and pose. We further apply a fine-tuning approach which ensures that the above control signals are disentangled from one another. We evaluate several design considerations for our framework, demonstrating how Neural USD enables iterative and incremental workflows. More information at: this https URL .
zh
[CV-67] Adaptive Training of INRs via Pruning and Densification
【速读】:该论文旨在解决隐式神经表示(Implicit Neural Representations, INRs)中因输入频率选择不当和网络结构冗余导致的模型效率与重建质量难以平衡的问题。现有方法通常依赖启发式策略或复杂的超参数优化,缺乏对架构动态调整的能力。其解决方案的关键在于提出一种自适应训练机制 AIRe(Adaptive Implicit neural Representation),通过两个核心步骤实现:首先利用神经元剪枝机制识别并移除贡献度较低的神经元,结合目标权重衰减将信息迁移至保留神经元,从而减少冗余;其次在信号欠拟合区域进行输入频率密度扩展,增强表示能力,优化网络规模与重建精度之间的权衡。
链接: https://arxiv.org/abs/2510.23943
作者: Diana Aldana,João Paulo Lima,Daniel Csillag,Daniel Perazzo,Haoan Feng,Luiz Velho,Tiago Novello
机构: IMPA; Universidade Federal Rural de Pernambuco; FGV EMAp; University of Maryland
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Encoding input coordinates with sinusoidal functions into multilayer perceptrons (MLPs) has proven effective for implicit neural representations (INRs) of low-dimensional signals, enabling the modeling of high-frequency details. However, selecting appropriate input frequencies and architectures while managing parameter redundancy remains an open challenge, often addressed through heuristics and heavy hyperparameter optimization schemes. In this paper, we introduce AIRe ( \textbfA daptive \textbfI mplicit neural \textbfRe presentation), an adaptive training scheme that refines the INR architecture over the course of optimization. Our method uses a neuron pruning mechanism to avoid redundancy and input frequency densification to improve representation capacity, leading to an improved trade-off between network size and reconstruction quality. For pruning, we first identify less-contributory neurons and apply a targeted weight decay to transfer their information to the remaining neurons, followed by structured pruning. Next, the densification stage adds input frequencies to spectrum regions where the signal underfits, expanding the representational basis. Through experiments on images and SDFs, we show that AIRe reduces model size while preserving, or even improving, reconstruction quality. Code and pretrained models will be released for public use.
zh
[CV-68] PlanarGS: High-Fidelity Indoor 3D Gaussian Splatting Guided by Vision-Language Planar Priors NEURIPS2025
【速读】:该论文旨在解决三维高斯溅射(3D Gaussian Splatting, 3DGS)在室内场景中因大面积低纹理区域导致的几何模糊与高保真表面重建失败的问题。其解决方案的关键在于提出PlanarGS框架,通过引入语言提示的平面先验(Language-Prompted Planar Priors, LP3)机制,利用预训练视觉-语言分割模型生成初始平面区域提案,并结合多视角融合与几何先验进行细化;同时,在3DGS优化过程中增加两项监督项:平面一致性先验项以约束点云符合平面结构,以及几何先验监督项以引导高斯分布朝向深度和法向量线索收敛,从而显著提升室内场景的重建精度与细节保真度。
链接: https://arxiv.org/abs/2510.23930
作者: Xirui Jin,Renbiao Jin,Boying Li,Danping Zou,Wenxian Yu
机构: Shanghai Jiao Tong University (上海交通大学); Monash University (莫纳什大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by NeurIPS 2025. Project page: this https URL
Abstract:Three-dimensional Gaussian Splatting (3DGS) has recently emerged as an efficient representation for novel-view synthesis, achieving impressive visual quality. However, in scenes dominated by large and low-texture regions, common in indoor environments, the photometric loss used to optimize 3DGS yields ambiguous geometry and fails to recover high-fidelity 3D surfaces. To overcome this limitation, we introduce PlanarGS, a 3DGS-based framework tailored for indoor scene reconstruction. Specifically, we design a pipeline for Language-Prompted Planar Priors (LP3) that employs a pretrained vision-language segmentation model and refines its region proposals via cross-view fusion and inspection with geometric priors. 3D Gaussians in our framework are optimized with two additional terms: a planar prior supervision term that enforces planar consistency, and a geometric prior supervision term that steers the Gaussians toward the depth and normal cues. We have conducted extensive experiments on standard indoor benchmarks. The results show that PlanarGS reconstructs accurate and detailed 3D surfaces, consistently outperforming state-of-the-art methods by a large margin. Project page: this https URL
zh
[CV-69] urboPortrait3D: Single-step diffusion-based fast portrait novel-view synthesis
【速读】:该论文旨在解决现有图像到3D(image-to-3D)方法在生成人脸肖像时存在的视觉伪影、细节缺失以及身份信息保留不足的问题,同时克服扩散模型(diffusion models)虽能生成高质量图像但缺乏3D一致性且计算开销大的局限。解决方案的关键在于提出TurboPortrait3D:通过一个前馈的图像到虚拟形象(image-to-avatar)生成流程获得初始3D表示及噪声渲染结果,随后利用单步扩散模型对这些渲染图进行多视角一致性的精细化处理,该扩散模型以输入图像为条件并采用创新的训练策略——先在大规模合成多视角数据上预训练,再在高质量真实图像上微调,从而在保持3D感知能力的同时实现低延迟、高保真的人脸新视角合成。
链接: https://arxiv.org/abs/2510.23929
作者: Emily Kim,Julieta Martinez,Timur Bagautdinov,Jessica Hodgins
机构: Carnegie Mellon University (卡内基梅隆大学); Meta Reality Labs (Meta现实实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We introduce TurboPortrait3D: a method for low-latency novel-view synthesis of human portraits. Our approach builds on the observation that existing image-to-3D models for portrait generation, while capable of producing renderable 3D representations, are prone to visual artifacts, often lack of detail, and tend to fail at fully preserving the identity of the subject. On the other hand, image diffusion models excel at generating high-quality images, but besides being computationally expensive, are not grounded in 3D and thus are not directly capable of producing multi-view consistent outputs. In this work, we demonstrate that image-space diffusion models can be used to significantly enhance the quality of existing image-to-avatar methods, while maintaining 3D-awareness and running with low-latency. Our method takes a single frontal image of a subject as input, and applies a feedforward image-to-avatar generation pipeline to obtain an initial 3D representation and corresponding noisy renders. These noisy renders are then fed to a single-step diffusion model which is conditioned on input image(s), and is specifically trained to refine the renders in a multi-view consistent way. Moreover, we introduce a novel effective training strategy that includes pre-training on a large corpus of synthetic multi-view data, followed by fine-tuning on high-quality real images. We demonstrate that our approach both qualitatively and quantitatively outperforms current state-of-the-art for portrait novel-view synthesis, while being efficient in time.
zh
[CV-70] Adaptive Keyframe Selection for Scalable 3D Scene Reconstruction in Dynamic Environments
【速读】:该论文旨在解决动态环境中3D场景重建中的关键数据瓶颈问题,即如何从高频率的视频流中高效选择最具信息量的关键帧(keyframe),以提升重建质量并降低计算负担。解决方案的核心在于提出一种自适应关键帧选择方法,其关键创新在于集成两个互补模块:一是基于光度误差和结构相似性(SSIM)的误差驱动选择模块,用于量化帧间差异;二是基于动量(momentum)的更新模块,能够根据场景运动动态调整关键帧选择阈值,从而实现对复杂动态场景的实时响应。该方法显著优于传统的固定时间间隔或均匀跳帧策略,在Spann3r与CUT3R等先进3D重建网络上均实现了稳定且一致的质量提升。
链接: https://arxiv.org/abs/2510.23928
作者: Raman Jha,Yang Zhou,Giuseppe Loianno
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Under Review for ROBOVIS 2026
Abstract:In this paper, we propose an adaptive keyframe selection method for improved 3D scene reconstruction in dynamic environments. The proposed method integrates two complementary modules: an error-based selection module utilizing photometric and structural similarity (SSIM) errors, and a momentum-based update module that dynamically adjusts keyframe selection thresholds according to scene motion dynamics. By dynamically curating the most informative frames, our approach addresses a key data bottleneck in real-time perception. This allows for the creation of high-quality 3D world representations from a compressed data stream, a critical step towards scalable robot learning and deployment in complex, dynamic environments. Experimental results demonstrate significant improvements over traditional static keyframe selection strategies, such as fixed temporal intervals or uniform frame skipping. These findings highlight a meaningful advancement toward adaptive perception systems that can dynamically respond to complex and evolving visual scenes. We evaluate our proposed adaptive keyframe selection module on two recent state-of-the-art 3D reconstruction networks, Spann3r and CUT3R, and observe consistent improvements in reconstruction quality across both frameworks. Furthermore, an extensive ablation study confirms the effectiveness of each individual component in our method, underlining their contribution to the overall performance gains.
zh
[CV-71] DynaStride: Dynamic Stride Windowing with MMCoT for Instructional Multi-Scene Captioning AAAI
【速读】:该论文旨在解决 instructional videos 中场景级描述(scene-level captioning)生成过程中因缺乏对时序结构理解而导致的caption coherence和质量不足的问题,从而影响学习者对操作流程的理解与技能获取。其核心解决方案是提出DynaStride框架,关键在于通过自适应帧采样(adaptive frame sampling)与多模态窗口机制(multimodal windowing)捕捉场景内的关键转换,并结合多模态思维链(multimodal chain-of-thought)生成多个动作-对象对,再利用动态步长窗口选择算法(dynamic stride window selection algorithm)在时间上下文与冗余之间实现自适应平衡,最终融合为一个整合视觉语义与时序推理的连贯场景级指令性描述。
链接: https://arxiv.org/abs/2510.23907
作者: Eddison Pham,Prisha Priyadarshini,Adrian Maliackel,Kanishk Bandi,Cristian Meo,Kevin Zhu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 16 pages, 15 figures, 5 Tables, submitted to AAAI AI4ED Workshop 2026
Abstract:Scene-level captioning in instructional videos can enhance learning by requiring an understanding of both visual cues and temporal structure. By aligning visual cues with textual guidance, this understanding supports procedural learning and multimodal reasoning, providing a richer context for skill acquisition. However, captions that fail to capture this structure may lack coherence and quality, which can create confusion and undermine the video’s educational intent. To address this gap, we introduce DynaStride, a pipeline to generate coherent, scene-level captions without requiring manual scene segmentation. Using the YouCookII dataset’s scene annotations, DynaStride performs adaptive frame sampling and multimodal windowing to capture key transitions within each scene. It then employs a multimodal chain-of-thought process to produce multiple action-object pairs, which are refined and fused using a dynamic stride window selection algorithm that adaptively balances temporal context and redundancy. The final scene-level caption integrates visual semantics and temporal reasoning in a single instructional caption. Empirical evaluations against strong baselines, including VLLaMA3 and GPT-4o, demonstrate consistent gains on both N-gram-based metrics (BLEU, METEOR) and semantic similarity measures (BERTScore, CLIPScore). Qualitative analyses further show that DynaStride produces captions that are more temporally coherent and informative, suggesting a promising direction for improving AI-powered instructional content generation.
zh
[CV-72] Improving Visual Discriminability of CLIP for Training-Free Open-Vocabulary Semantic Segmentation
【速读】:该论文旨在解决CLIP模型在扩展至语义分割任务时面临的挑战,即其图像级预训练目标与像素级视觉理解需求之间的不匹配问题。现有方法虽通过重构最后一层和特征实现了一定效果,但往往继承了前层的全局对齐偏差,导致分割性能受限。解决方案的关键在于提出一种无需训练的框架LHT-CLIP,系统性地利用CLIP模型在层(layer)、注意力头(head)和token三个维度上的视觉判别能力差异。具体而言,研究揭示了三个关键发现:(i)最后几层强化了图像-文本对齐但牺牲了视觉判别力;(ii)部分注意力头在不同数据集上保持强视觉判别能力;(iii)异常token具有稀疏且一致的激活模式。基于此,提出三种互补技术:语义-空间重加权、选择性头增强和异常token替换,有效恢复视觉判别力并提升分割性能,无需额外训练、辅助预训练网络或复杂超参数调优。
链接: https://arxiv.org/abs/2510.23894
作者: Jinxin Zhou,Jiachen Jiang,Zhihui Zhu
机构: The Ohio State University (俄亥俄州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 23 pages, 10 figures, 14 tables
Abstract:Extending CLIP models to semantic segmentation remains challenging due to the misalignment between their image-level pre-training objectives and the pixel-level visual understanding required for dense prediction. While prior efforts have achieved encouraging results by reorganizing the final layer and features, they often inherit the global alignment bias of preceding layers, leading to suboptimal segmentation performance. In this work, we propose LHT-CLIP, a novel training-free framework that systematically exploits the visual discriminability of CLIP across layer, head, and token levels. Through comprehensive analysis, we reveal three key insights: (i) the final layers primarily strengthen image-text alignment with sacrifice of visual discriminability (e.g., last 3 layers in ViT-B/16 and 8 layers in ViT-L/14), partly due to the emergence of anomalous tokens; (ii) a subset of attention heads (e.g., 10 out of 144 in ViT-B/16) display consistently strong visual discriminability across datasets; (iii) abnormal tokens display sparse and consistent activation pattern compared to normal tokens. Based on these findings, we propose three complementary techniques: semantic-spatial reweighting, selective head enhancement, and abnormal token replacement to effectively restore visual discriminability and improve segmentation performance without any additional training, auxiliary pre-trained networks, or extensive hyperparameter tuning. Extensive experiments on 8 common semantic segmentation benchmarks demonstrate that LHT-CLIP achieves state-of-the-art performance across diverse scenarios, highlighting its effectiveness and practicality for real-world deployment.
zh
[CV-73] RELLISWorld: Training-Free World Generation from Object Generators
【速读】:该论文旨在解决当前文本驱动的3D场景生成方法中存在的局限性,包括仅支持单物体生成、依赖特定领域训练数据以及无法实现全360度视角可视化的缺陷。其解决方案的关键在于提出一种无需训练的方法,通过将通用文本到3D对象扩散模型(text-to-3D object diffusion models)重新用作模块化瓦片(modular tile)生成器,将场景生成重构为一个多瓦片去噪问题;具体而言,通过独立生成重叠的3D区域并利用加权平均实现无缝融合,从而在不依赖场景级数据或重新训练的前提下,实现大规模、语义一致且局部可控的3D场景合成。
链接: https://arxiv.org/abs/2510.23880
作者: Hanke Chen,Yuan Liu,Minchen Li
机构: Carnegie Mellon University (卡内基梅隆大学); The Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:
Abstract:Text-driven 3D scene generation holds promise for a wide range of applications, from virtual prototyping to AR/VR and simulation. However, existing methods are often constrained to single-object generation, require domain-specific training, or lack support for full 360-degree viewability. In this work, we present a training-free approach to 3D scene synthesis by repurposing general-purpose text-to-3D object diffusion models as modular tile generators. We reformulate scene generation as a multi-tile denoising problem, where overlapping 3D regions are independently generated and seamlessly blended via weighted averaging. This enables scalable synthesis of large, coherent scenes while preserving local semantic control. Our method eliminates the need for scene-level datasets or retraining, relies on minimal heuristics, and inherits the generalization capabilities of object-level priors. We demonstrate that our approach supports diverse scene layouts, efficient generation, and flexible editing, establishing a simple yet powerful foundation for general-purpose, language-driven 3D scene construction.
zh
[CV-74] RareFlow: Physics-Aware Flow-Matching for Cross-Sensor Super-Resolution of Rare-Earth Features
【速读】:该论文旨在解决遥感图像超分辨率(Super-resolution, SR)在分布外(out-of-distribution, OOD)条件下性能下降的问题,尤其针对罕见地貌特征或不同传感器采集数据时出现的视觉上合理但物理不准确的生成结果。其解决方案的关键在于提出 RareFlow 框架,该框架采用双条件控制架构:一是通过门控控制网络(Gated ControlNet)保留低分辨率输入中的细粒度几何保真度,二是利用文本提示提供语义引导以合成复杂结构;同时引入多维度损失函数确保输出在光谱和辐射特性上符合传感器物理属性,并通过随机前向传播机制量化预测不确定性,从而识别未知输入并减少特征幻觉现象。
链接: https://arxiv.org/abs/2510.23816
作者: Forouzan Fallah,Wenwen Li,Chia-Yu Hsu,Hyunho Lee,Yezhou Yang
机构: Arizona State University (亚利桑那州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Super-resolution (SR) for remote sensing imagery often fails under out-of-distribution (OOD) conditions, such as rare geomorphic features captured by diverse sensors, producing visually plausible but physically inaccurate results. We present RareFlow, a physics-aware SR framework designed for OOD robustness. RareFlow’s core is a dual-conditioning architecture. A Gated ControlNet preserves fine-grained geometric fidelity from the low-resolution input, while textual prompts provide semantic guidance for synthesizing complex features. To ensure physically sound outputs, we introduce a multifaceted loss function that enforces both spectral and radiometric consistency with sensor properties. Furthermore, the framework quantifies its own predictive uncertainty by employing a stochastic forward pass approach; the resulting output variance directly identifies unfamiliar inputs, mitigating feature hallucination. We validate RareFlow on a new, curated benchmark of multi-sensor satellite imagery. In blind evaluations, geophysical experts rated our model’s outputs as approaching the fidelity of ground truth imagery, significantly outperforming state-of-the-art baselines. This qualitative superiority is corroborated by quantitative gains in perceptual metrics, including a nearly 40% reduction in FID. RareFlow provides a robust framework for high-fidelity synthesis in data-scarce scientific domains and offers a new paradigm for controlled generation under severe domain shift.
zh
[CV-75] Why Foundation Models in Pathology Are Failing
【速读】:该论文试图解决当前生成式 AI(Generative AI)在计算病理学领域应用中存在的根本性局限问题,即现有基础模型(Foundation Models, FMs)在癌症诊断、预后判断和多模态检索等任务中表现出的低诊断准确性、鲁棒性差、几何不稳定性、高计算开销及安全漏洞等问题。其核心解决方案的关键在于识别出七项相互关联的根本原因:生物复杂性、无效的自监督学习、过度泛化、架构复杂度过高、缺乏领域特定创新、数据不足以及与组织切片尺寸相关的根本设计缺陷。这些发现表明,当前病理基础模型在概念上仍与组织形态学的本质特征不匹配,亟需对建模范式进行根本性重构。
链接: https://arxiv.org/abs/2510.23807
作者: Hamid R. Tizhoosh
机构: Kimia Lab, Mayo Clinic (梅奥诊所)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In non-medical domains, foundation models (FMs) have revolutionized computer vision and language processing through large-scale self-supervised and multimodal learning. Consequently, their rapid adoption in computational pathology was expected to deliver comparable breakthroughs in cancer diagnosis, prognostication, and multimodal retrieval. However, recent systematic evaluations reveal fundamental weaknesses: low diagnostic accuracy, poor robustness, geometric instability, heavy computational demands, and concerning safety vulnerabilities. This short paper examines these shortcomings and argues that they stem from deeper conceptual mismatches between the assumptions underlying generic foundation modeling in mainstream AI and the intrinsic complexity of human tissue. Seven interrelated causes are identified: biological complexity, ineffective self-supervision, overgeneralization, excessive architectural complexity, lack of domain-specific innovation, insufficient data, and a fundamental design flaw related to tissue patch size. These findings suggest that current pathology foundation models remain conceptually misaligned with the nature of tissue morphology and call for a fundamental rethinking of the paradigm itself.
zh
[CV-76] A geometric and deep learning reproducible pipeline for monitoring floating anthropogenic debris in urban rivers using in situ cameras
【速读】:该论文旨在解决河流中漂浮的人为垃圾(floating anthropogenic debris)监测难题,此类垃圾对生物多样性、水质及人类活动构成严重威胁。解决方案的关键在于提出了一种基于固定式现场摄像头的新型方法框架,结合深度学习模型实现对漂浮物的持续量化与监测,并通过几何建模利用相机的内参和外参信息,从二维图像中估计物体的实际尺寸。研究进一步验证了数据集构建协议(特别是负样本整合与时间泄漏规避)对模型性能的重要性,证明了基于投影几何与回归校正相结合的度量对象估算方法的可行性,为开发低成本、自动化的城市水体环境监测系统提供了技术路径。
链接: https://arxiv.org/abs/2510.23798
作者: Gauthier Grimmer,Romain Wenger,Clément Flint,Germain Forestier,Gilles Rixhon,Valentin Chardon
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:The proliferation of floating anthropogenic debris in rivers has emerged as a pressing environmental concern, exerting a detrimental influence on biodiversity, water quality, and human activities such as navigation and recreation. The present study proposes a novel methodological framework for the monitoring the aforementioned waste, utilising fixed, in-situ cameras. This study provides two key contributions: (i) the continuous quantification and monitoring of floating debris using deep learning and (ii) the identification of the most suitable deep learning model in terms of accuracy and inference speed under complex environmental conditions. These models are tested in a range of environmental conditions and learning configurations, including experiments on biases related to data leakage. Furthermore, a geometric model is implemented to estimate the actual size of detected objects from a 2D image. This model takes advantage of both intrinsic and extrinsic characteristics of the camera. The findings of this study underscore the significance of the dataset constitution protocol, particularly with respect to the integration of negative images and the consideration of temporal leakage. In conclusion, the feasibility of metric object estimation using projective geometry coupled with regression corrections is demonstrated. This approach paves the way for the development of robust, low-cost, automated monitoring systems for urban aquatic environments.
zh
[CV-77] CountFormer: A Transformer Framework for Learning Visual Repetition and Structure in Class-Agnostic Object Counting CEC
【速读】:该论文旨在解决现有计数模型在面对复杂形状、内部对称或重叠结构的对象时,因依赖类别识别而难以准确计数的问题(即缺乏类无关的结构感知能力)。其解决方案的关键在于提出CountFormer框架,该框架基于CounTR架构,用自监督预训练的基础模型DINOv2替代原有视觉编码器,以获取更丰富且空间一致的特征表示,并引入位置嵌入融合机制以保留几何关系,再通过轻量级卷积解码器生成密度图。这一设计使模型能够模仿人类通过感知重复性和结构一致性进行计数的能力,从而在结构复杂的密集场景中实现更高精度的类无关计数。
链接: https://arxiv.org/abs/2510.23785
作者: Md Tanvir Hossain,Akif Islam,Mohd Ruhul Ameen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 6 pages, 2 tables, 6 figures. Submitted to IEEE 5th International Conference on Electrical, Computer and Telecommunication Engineering (ICECTE 2025)
Abstract:Humans can effortlessly count diverse objects by perceiving visual repetition and structural relationships rather than relying on class identity. However, most existing counting models fail to replicate this ability; they often miscount when objects exhibit complex shapes, internal symmetry, or overlapping components. In this work, we introduce CountFormer, a transformer-based framework that learns to recognize repetition and structural coherence for class-agnostic object counting. Built upon the CounTR architecture, our model replaces its visual encoder with the self-supervised foundation model DINOv2, which produces richer and spatially consistent feature representations. We further incorporate positional embedding fusion to preserve geometric relationships before decoding these features into density maps through a lightweight convolutional decoder. Evaluated on the FSC-147 dataset, our model achieves performance comparable to current state-of-the-art methods while demonstrating superior accuracy on structurally intricate or densely packed scenes. Our findings indicate that integrating foundation models such as DINOv2 enables counting systems to approach human-like structural perception, advancing toward a truly general and exemplar-free counting paradigm.
zh
[CV-78] Explainable Detection of AI-Generated Images with Artifact Localization Using Faster-Than-Lies and Vision-Language Models for Edge Devices
【速读】:该论文旨在解决生成式 AI (Generative AI) 生成图像日益逼真所带来的视觉真实性验证难题,尤其针对低分辨率(32×32)图像中难以察觉的伪造痕迹进行检测与解释。其解决方案的关键在于构建一个可解释的图像真实性检测系统,该系统融合了一个轻量级卷积分类器(“Faster-Than-Lies”)与一个视觉语言模型(Qwen2-VL-7B),实现了对图像伪造的分类、定位和文本解释一体化处理;通过自编码器重构误差图生成伪影定位热力图,显著提升人类与模型对异常区域的理解能力,并将70种视觉伪影归类为8个语义类别,从而实现跨域适用性,如法证分析、工业质检及社交媒体内容审核等场景。
链接: https://arxiv.org/abs/2510.23775
作者: Aryan Mathur,Asaduddin Ahmed,Pushti Amit Vasoya,Simeon Kandan Sonar,Yasir Z,Madesh Kuppusamy
机构: Indian Institute of Technology Palakkad (印度理工学院帕拉克卡德分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注:
Abstract:The increasing realism of AI-generated imagery poses challenges for verifying visual authenticity. We present an explainable image authenticity detection system that combines a lightweight convolutional classifier (“Faster-Than-Lies”) with a Vision-Language Model (Qwen2-VL-7B) to classify, localize, and explain artifacts in 32x32 images. Our model achieves 96.5% accuracy on the extended CiFAKE dataset augmented with adversarial perturbations and maintains an inference time of 175ms on 8-core CPUs, enabling deployment on local or edge devices. Using autoencoder-based reconstruction error maps, we generate artifact localization heatmaps, which enhance interpretability for both humans and the VLM. We further categorize 70 visual artifact types into eight semantic groups and demonstrate explainable text generation for each detected anomaly. This work highlights the feasibility of combining visual and linguistic reasoning for interpretable authenticity detection in low-resolution imagery and outlines potential cross-domain applications in forensics, industrial inspection, and social media moderation.
zh
[CV-79] Quanvolutional Neural Networks for Pneumonia Detection: An Efficient Quantum-Assisted Feature Extraction Paradigm
【速读】:该论文旨在解决传统卷积神经网络(Convolutional Neural Networks, CNNs)在肺炎医学图像识别中面临的高计算成本、特征表示能力有限以及小样本数据下泛化性能差的问题。其解决方案的关键在于引入量子卷积神经网络(Quanvolutional Neural Networks, QNNs),利用量子计算的优势进行特征提取:具体而言,通过参数化量子电路(Parameterized Quantum Circuit, PQC)处理2×2图像块,采用Y旋转门(rotational Y-gates)实现数据编码,并结合纠缠层生成非经典特征表示,随后将这些量子提取的特征输入经典神经网络完成分类任务。实验表明,该混合量子-经典模型在PneumoniaMNIST数据集上验证准确率达到83.33%,显著优于对比的经典CNN模型(73.33%),展现出更高的收敛速度和样本效率,为量子计算赋能深度学习驱动的医疗诊断系统提供了可行路径。
链接: https://arxiv.org/abs/2510.23660
作者: Gazi Tanbhir,Md. Farhan Shahriyar,Abdullah Md Raihan Chy
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Pneumonia poses a significant global health challenge, demanding accurate and timely diagnosis. While deep learning, particularly Convolutional Neural Networks (CNNs), has shown promise in medical image analysis for pneumonia detection, CNNs often suffer from high computational costs, limitations in feature representation, and challenges in generalizing from smaller datasets. To address these limitations, we explore the application of Quanvolutional Neural Networks (QNNs), leveraging quantum computing for enhanced feature extraction. This paper introduces a novel hybrid quantum-classical model for pneumonia detection using the PneumoniaMNIST dataset. Our approach utilizes a quanvolutional layer with a parameterized quantum circuit (PQC) to process 2x2 image patches, employing rotational Y-gates for data encoding and entangling layers to generate non-classical feature representations. These quantum-extracted features are then fed into a classical neural network for classification. Experimental results demonstrate that the proposed QNN achieves a higher validation accuracy of 83.33 percent compared to a comparable classical CNN which achieves 73.33 percent. This enhanced convergence and sample efficiency highlight the potential of QNNs for medical image analysis, particularly in scenarios with limited labeled data. This research lays the foundation for integrating quantum computing into deep-learning-driven medical diagnostic systems, offering a computationally efficient alternative to traditional approaches.
zh
[CV-80] Quantum Machine Learning for Image Classification: A Hybrid Model of Residual Network with Quantum Support Vector Machine
【速读】:该论文旨在解决高维复杂图像数据在植物病害分类任务中面临的分类效率与精度瓶颈问题,尤其针对传统机器学习(如支持向量机 SVM 和随机森林 RF)和深度学习模型在处理马铃薯病害图像时表现受限的挑战。其解决方案的关键在于构建一种混合量子-经典架构:首先利用 ResNet-50 提取 RGB 图像的深层特征表示,随后通过主成分分析(PCA)进行降维,再将压缩后的特征输入到量子支持向量机(QSVM)中,借助 ZZ、Z 和 Pauli-X 等量子特征映射将经典数据编码为量子态以增强判别能力。实验表明,基于 Z 特征映射的 QSVM 在五折分层交叉验证下达到 99.23% 的准确率,显著优于传统方法,凸显了量子计算在提升图像分类性能方面的潜力。
链接: https://arxiv.org/abs/2510.23659
作者: Md. Farhan Shahriyar,Gazi Tanbhir,Abdullah Md Raihan Chy
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET)
备注:
Abstract:Recently, there has been growing attention on combining quantum machine learning (QML) with classical deep learning approaches, as computational techniques are key to improving the performance of image classification tasks. This study presents a hybrid approach that uses ResNet-50 (Residual Network) for feature extraction and Quantum Support Vector Machines (QSVM) for classification in the context of potato disease detection. Classical machine learning as well as deep learning models often struggle with high-dimensional and complex datasets, necessitating advanced techniques like quantum computing to improve classification efficiency. In our research, we use ResNet-50 to extract deep feature representations from RGB images of potato diseases. These features are then subjected to dimensionality reduction using Principal Component Analysis (PCA). The resulting features are processed through QSVM models which apply various quantum feature maps such as ZZ, Z, and Pauli-X to transform classical data into quantum states. To assess the model performance, we compared it with classical machine learning algorithms such as Support Vector Machine (SVM) and Random Forest (RF) using five-fold stratified cross-validation for comprehensive evaluation. The experimental results demonstrate that the Z-feature map-based QSVM outperforms classical models, achieving an accuracy of 99.23 percent, surpassing both SVM and RF models. This research highlights the advantages of integrating quantum computing into image classification and provides a potential disease detection solution through hybrid quantum-classical modeling.
zh
[CV-81] Noise is All You Need: Solving Linear Inverse Problems by Noise Combination Sampling with Diffusion Models
【速读】:该论文旨在解决预训练扩散模型在零样本逆问题求解中面临的内在矛盾:即观测信息的过度融合会破坏生成过程,而融合不足则无法有效施加逆问题约束。其解决方案的关键在于提出**噪声组合采样(Noise Combination Sampling)**方法,通过从噪声子空间中合成最优噪声向量来近似测量得分(measurement score),从而替代标准去噪扩散概率模型(Denoising Diffusion Probabilistic Models, DDPM)中的噪声项,使条件信息自然嵌入生成过程,无需逐步超参数调优。该方法在图像压缩等各类逆问题求解中具有广泛适用性,尤其在生成步数 $ T $ 较小时表现优异,且计算开销可忽略不计,显著提升了鲁棒性和稳定性。
链接: https://arxiv.org/abs/2510.23633
作者: Xun Su,Hiroyuki Kasai
机构: WASEDA University (早稻田大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 9 pages
Abstract:Pretrained diffusion models have demonstrated strong capabilities in zero-shot inverse problem solving by incorporating observation information into the generation process of the diffusion models. However, this presents an inherent dilemma: excessive integration can disrupt the generative process, while insufficient integration fails to emphasize the constraints imposed by the inverse problem. To address this, we propose \emphNoise Combination Sampling, a novel method that synthesizes an optimal noise vector from a noise subspace to approximate the measurement score, replacing the noise term in the standard Denoising Diffusion Probabilistic Models process. This enables conditional information to be naturally embedded into the generation process without reliance on step-wise hyperparameter tuning. Our method can be applied to a wide range of inverse problem solvers, including image compression, and, particularly when the number of generation steps T is small, achieves superior performance with negligible computational overhead, significantly improving robustness and stability.
zh
[CV-82] Energy Efficient Exact and Approximate Systolic Array Architecture for Matrix Multiplication
【速读】:该论文旨在解决深度神经网络(Deep Neural Networks, DNNs)在复杂计算中对高效矩阵乘法引擎的迫切需求,特别是在边缘计算场景下如何实现高能效与可接受输出质量之间的平衡。解决方案的关键在于提出一种新型脉动阵列(Systolic Array, SA)架构,其中嵌入了精确和近似处理单元(Processing Elements, PEs),这些PE基于能量高效的正部分积单元(Positive Partial Product Cell, PPC)和负部分积单元(Negative Partial Product Cell, NPPC)设计而成。该设计在8×8脉动阵列中实现了22%和32%的能耗降低,同时在离散余弦变换(DCT)和边缘检测卷积任务中分别保持了38.21 dB和30.45 dB的峰值信噪比(PSNR),验证了其在误差容限图像与视觉处理应用中的高能效潜力。
链接: https://arxiv.org/abs/2509.00778
作者: Pragun Jaswal,L.Hemanth Krishna,B. Srinivasu
机构: Indian Institute of Technology, Mandi (印度理工学院曼迪分校)
类目: Hardware Architecture (cs.AR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Submitted to 39th International Conference on VLSI Design, 2026
Abstract:Deep Neural Networks (DNNs) require highly efficient matrix multiplication engines for complex computations. This paper presents a systolic array architecture incorporating novel exact and approximate processing elements (PEs), designed using energy-efficient positive partial product and negative partial product cells, termed as PPC and NPPC, respectively. The proposed 8-bit exact and approximate PE designs are employed in a 8x8 systolic array, which achieves a energy savings of 22% and 32%, respectively, compared to the existing design. To demonstrate their effectiveness, the proposed PEs are integrated into a systolic array (SA) for Discrete Cosine Transform (DCT) computation, achieving high output quality with a PSNR of 38.21,dB. Furthermore, in an edge detection application using convolution, the approximate PE achieves a PSNR of 30.45,dB. These results highlight the potential of the proposed design to deliver significant energy efficiency while maintaining competitive output quality, making it well-suited for error-resilient image and vision processing applications.
zh
[CV-83] Listening without Looking: Modality Bias in Audio-Visual Captioning
【速读】:该论文旨在解决当前音频-视觉描述生成模型中模态互补性不明确以及对单一模态退化缺乏鲁棒性的问题。其关键解决方案是通过系统性的模态鲁棒性测试,量化分析LAVCap模型在音频或视觉流被抑制或破坏时的表现,从而揭示其对音频模态的显著偏向;进一步地,作者构建了AudioVisualCaps数据集,该数据集包含同时描述音频与视觉内容的文本标注,用于训练更平衡的多模态模型,并验证了在该数据集上训练后的LAVCap模型表现出更低的模态偏差。
链接: https://arxiv.org/abs/2510.24024
作者: Yuchi Ishikawa,Toranosuke Manabe,Tatsuya Komatsu,Yoshimitsu Aoki
机构: 未知
类目: Audio and Speech Processing (eess.AS); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: under review
Abstract:Audio-visual captioning aims to generate holistic scene descriptions by jointly modeling sound and vision. While recent methods have improved performance through sophisticated modality fusion, it remains unclear to what extent the two modalities are complementary in current audio-visual captioning models and how robust these models are when one modality is degraded. We address these questions by conducting systematic modality robustness tests on LAVCap, a state-of-the-art audio-visual captioning model, in which we selectively suppress or corrupt the audio or visual streams to quantify sensitivity and complementarity. The analysis reveals a pronounced bias toward the audio stream in LAVCap. To evaluate how balanced audio-visual captioning models are in their use of both modalities, we augment AudioCaps with textual annotations that jointly describe the audio and visual streams, yielding the AudioVisualCaps dataset. In our experiments, we report LAVCap baseline results on AudioVisualCaps. We also evaluate the model under modality robustness tests on AudioVisualCaps and the results indicate that LAVCap trained on AudioVisualCaps exhibits less modality bias than when trained on AudioCaps.
zh
人工智能
[AI-0] Greedy Sampling Is Provably Efficient for RLHF NEURIPS2025
【速读】:该论文旨在解决强化学习中基于人类反馈(Reinforcement Learning from Human Feedback, RLHF)的理论理解不足问题,尤其是在仅通过偏好反馈(preference feedback)学习KL正则化目标时所面临的挑战。现有方法主要基于奖励驱动的Bradley-Terry(BT)偏好模型,并依赖乐观或悲观估计策略设计算法,但其性能保证存在局限。本文的关键突破在于:首次在更通用的偏好模型框架下建立了性能保证,且实现了阶次意义上的显著改进;更重要的是,其解决方案并不依赖传统的乐观或悲观估计构造,而是直接使用经验估计(即贪婪采样),这一发现源于对KL正则化目标下最优策略类独特结构特性的深刻洞察,并进一步在BT模型场景中验证了贪婪采样的充分性,颠覆了以往依赖探索性估计的设计范式。
链接: https://arxiv.org/abs/2510.24700
作者: Di Wu,Chengshuai Shi,Jing Yang,Cong Shen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (stat.ML)
备注: NeurIPS 2025
Abstract:Reinforcement Learning from Human Feedback (RLHF) has emerged as a key technique for post-training large language models. Despite its empirical success, the theoretical understanding of RLHF is still limited, as learning the KL-regularized target with only preference feedback poses additional challenges compared with canonical RL. Existing works mostly study the reward-based Bradley-Terry (BT) preference model, and extend classical designs utilizing optimism or pessimism. This work, instead, considers the general preference model (whose practical relevance has been observed recently) and obtains performance guarantees with major, order-wise improvements over existing ones. Surprisingly, these results are derived from algorithms that directly use the empirical estimates (i.e., greedy sampling), as opposed to constructing optimistic or pessimistic estimates in previous works. This insight has a deep root in the unique structural property of the optimal policy class under the KL-regularized target, and we further specialize it to the BT model, highlighting the surprising sufficiency of greedy sampling in RLHF.
zh
[AI-1] Bridging Tool Dependencies and Domain Knowledge: A Graph-Based Framework for In-Context Planning NEURIPS2025
【速读】:该论文旨在解决工具(tool)与文档(document)之间依赖关系未被充分挖掘和利用的问题,从而提升示例性任务规划(exemplar artifact generation)的质量。其解决方案的关键在于构建一个融合工具知识图谱与文档知识图谱的统一框架:首先基于工具schema(包括描述、参数和输出payload)构建工具知识图谱,同时从内部文档和标准操作程序(SOPs)中提取互补的知识图谱,并将其融合;随后采用深度稀疏集成策略,将工具结构依赖与过程性知识对齐,实现更精准的计划生成。实验表明,该方法能有效建模工具交互并提升规划性能,验证了连接工具图谱与领域知识图谱在增强工具辅助推理与规划中的价值。
链接: https://arxiv.org/abs/2510.24690
作者: Shengjie Liu,Li Dong,Zhenyu Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 4 pages, 2 figures, short paper, NeurIPS 2025 workshop on Bridging Language, Agent, and World Models for Reasoning and Planning
Abstract:We present a framework for uncovering and exploiting dependencies among tools and documents to enhance exemplar artifact generation. Our method begins by constructing a tool knowledge graph from tool schemas,including descriptions, arguments, and output payloads, using a DeepResearch-inspired analysis. In parallel, we derive a complementary knowledge graph from internal documents and SOPs, which is then fused with the tool graph. To generate exemplar plans, we adopt a deep-sparse integration strategy that aligns structural tool dependencies with procedural knowledge. Experiments demonstrate that this unified framework effectively models tool interactions and improves plan generation, underscoring the benefits of linking tool graphs with domain knowledge graphs for tool-augmented reasoning and planning.
zh
[AI-2] Learning to Drive Safely with Hybrid Options
【速读】:该论文旨在解决深度强化学习(Deep Reinforcement Learning, DRL)在高速公路自动驾驶任务中缺乏结构化、可解释且安全可控的层次化控制策略的问题。现有方法大多直接作用于连续动作空间,难以有效整合领域先验知识并保障驾驶行为的安全性与舒适性。解决方案的关键在于引入选项(Options)框架,将纵向和横向控制分解为独立的子任务,并设计嵌入安全性与舒适性约束的专用选项(Skills),从而实现层次化决策;进一步提出基于混合选项(Hybrid Options)的灵活策略,通过分别选择纵向和横向动作,在保持人类驾驶员级别的表达能力与适应性的同时,显著提升策略的可解释性和鲁棒性,尤其在复杂多变的交通条件下表现优于传统基于连续动作的基线策略。
链接: https://arxiv.org/abs/2510.24674
作者: Bram De Cooman,Johan Suykens
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:
Abstract:Out of the many deep reinforcement learning approaches for autonomous driving, only few make use of the options (or skills) framework. That is surprising, as this framework is naturally suited for hierarchical control applications in general, and autonomous driving tasks in specific. Therefore, in this work the options framework is applied and tailored to autonomous driving tasks on highways. More specifically, we define dedicated options for longitudinal and lateral manoeuvres with embedded safety and comfort constraints. This way, prior domain knowledge can be incorporated into the learning process and the learned driving behaviour can be constrained more easily. We propose several setups for hierarchical control with options and derive practical algorithms following state-of-the-art reinforcement learning techniques. By separately selecting actions for longitudinal and lateral control, the introduced policies over combined and hybrid options obtain the same expressiveness and flexibility that human drivers have, while being easier to interpret than classical policies over continuous actions. Of all the investigated approaches, these flexible policies over hybrid options perform the best under varying traffic conditions, outperforming the baseline policies over actions.
zh
[AI-3] Multi-Agent Scenario Generation in Roundabouts with a Transformer-enhanced Conditional Variational Autoencoder
【速读】:该论文旨在解决智能驾驶功能在复杂多车交互场景(如环形交叉口)中验证难、数据获取成本高及边缘案例覆盖不足的问题。当前基于传统道路测试的方法难以高效应对高动态性和复杂布局下的多智能体交互场景,而现有虚拟测试方法在生成真实且多样化的交通场景方面仍存在局限。解决方案的关键在于提出一种基于Transformer增强的条件变分自编码器(CVAE-T)模型,通过引入Transformer结构提升对时序依赖关系的建模能力,并利用潜在空间的初步解耦特性实现对车辆进入时间、出口时机和速度分布等关键场景属性的有效控制与生成,从而支持智能驾驶系统在多车交互环境中的功能验证与数据增强。
链接: https://arxiv.org/abs/2510.24671
作者: Li Li,Tobias Brinkmann,Till Temmen,Markus Eisenbarth,Jakob Andert
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:With the increasing integration of intelligent driving functions into serial-produced vehicles, ensuring their functionality and robustness poses greater challenges. Compared to traditional road testing, scenario-based virtual testing offers significant advantages in terms of time and cost efficiency, reproducibility, and exploration of edge cases. We propose a Transformer-enhanced Conditional Variational Autoencoder (CVAE-T) model for generating multi-agent traffic scenarios in roundabouts, which are characterized by high vehicle dynamics and complex layouts, yet remain relatively underexplored in current research. The results show that the proposed model can accurately reconstruct original scenarios and generate realistic, diverse synthetic scenarios. Besides, two Key-Performance-Indicators (KPIs) are employed to evaluate the interactive behavior in the generated scenarios. Analysis of the latent space reveals partial disentanglement, with several latent dimensions exhibiting distinct and interpretable effects on scenario attributes such as vehicle entry timing, exit timing, and velocity profiles. The results demonstrate the model’s capability to generate scenarios for the validation of intelligent driving functions involving multi-agent interactions, as well as to augment data for their development and iterative improvement.
zh
[AI-4] OrchDAG: Complex Tool Orchestration in Multi-Turn Interactions with Plan DAGs
【速读】:该论文旨在解决多轮工具交互(multi-turn tool interactions)在代理式工具调用(agentic tool calling)中因复杂性被忽视的问题。现有研究往往忽略工具调用之间的依赖关系与执行顺序,导致模型难以应对实际场景中的复杂任务流。其解决方案的关键在于提出 OrchDAG——一个将工具执行建模为有向无环图(Directed Acyclic Graph, DAG)的合成数据生成流水线,通过控制图的拓扑复杂度来构建具有挑战性的基准数据集,并进一步设计基于图结构的奖励机制以增强强化学习与价值回归(RLVR)训练效果,从而提升模型对多轮工具使用中任务依赖性和执行顺序的理解能力。
链接: https://arxiv.org/abs/2510.24663
作者: Yifu Lu,Shengjie Liu,Li Dong
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 9 pages, 4 figures
Abstract:Agentic tool use has gained traction with the rise of agentic tool calling, yet most existing work overlooks the complexity of multi-turn tool interactions. We introduce OrchDAG, a synthetic data generation pipeline that models tool execution as directed acyclic graphs (DAGs) with controllable complexity. Using this dataset, we benchmark model performance and propose a graph-based reward to enhance RLVR training. Experiments show that the dataset presents a challenging but solvable benchmark, and the proposed reward is effective when combined with GRPO-style algorithms, highlighting the importance of leveraging topological structure and data complexity in multi-turn tool use.
zh
[AI-5] Advancing site-specific disease and pest management in precision agriculture: From reasoning -driven foundation models to adaptive feedback-based learning
【速读】:该论文旨在解决作物病害精准管理(Site-specific Disease Management, SSDM)中传统方法依赖人工特征提取、缺乏多模态数据融合与交互式决策支持的局限性。其解决方案的关键在于引入基础模型(Foundation Models, FMs),特别是视觉-语言模型(Vision-Language Models, VLMs)和大型语言模型(Large-Language Models, LLMs),通过整合视觉与文本信息实现症状识别、因果推理及人机交互问答能力,并结合强化学习(Reinforcement Learning, RL)与数字孪生(Digital Twin)框架,提升田间靶向喷洒的智能化水平。此外,论文强调弥合仿真到现实(sim-to-real)差距以及发展人机协同机制是推动下一代SSDM落地的核心挑战与突破口。
链接: https://arxiv.org/abs/2510.24650
作者: Nitin Rai,Daeun(Dana)Choi,Nathan S. Boyd,Arnold W. Schumann
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 26 pages, 8 figures, and 2 tables
Abstract:Site-specific disease management (SSDM) in crops has advanced rapidly through machine and deep learning (ML and DL) for real-time computer vision. Research evolved from handcrafted feature extraction to large-scale automated feature learning. With foundation models (FMs), crop disease datasets are now processed in fundamentally new ways. Unlike traditional neural networks, FMs integrate visual and textual data, interpret symptoms in text, reason about symptom-management relationships, and support interactive QA for growers and educators. Adaptive and imitation learning in robotics further enables field-based disease management. This review screened approx. 40 articles on FM applications for SSDM, focusing on large-language models (LLMs) and vision-language models (VLMs), and discussing their role in adaptive learning (AL), reinforcement learning (RL), and digital twin frameworks for targeted spraying. Key findings: (a) FMs are gaining traction with surging literature in 2023-24; (b) VLMs outpace LLMs, with a 5-10x increase in publications; © RL and AL are still nascent for smart spraying; (d) digital twins with RL can simulate targeted spraying virtually; (e) addressing the sim-to-real gap is critical for real-world deployment; (f) human-robot collaboration remains limited, especially in human-in-the-loop approaches where robots detect early symptoms and humans validate uncertain cases; (g) multi-modal FMs with real-time feedback will drive next-gen SSDM. For updates, resources, and contributions, visit, this https URL, to submit papers, code, or datasets.
zh
[AI-6] FunReason -MT Technical Report: Overcoming the Complexity Barrier in Multi-Turn Function Calling
【速读】:该论文旨在解决生成高质量、多轮次工具调用(Function Calling, FC)训练数据的难题,以支持大语言模型(Large Language Models, LLMs)和自主代理在真实世界环境中实现复杂任务的高效执行。现有方法如随机环境采样或多智能体角色扮演难以生成具备逻辑连贯性和实用性的多轮交互数据,且面临目标模型训练、工具架构隔离及多轮逻辑依赖等结构性挑战。其解决方案的关键在于提出一种名为FunReason-MT的数据合成框架,通过三个核心机制实现突破:1)利用环境-API图交互(Environment-API Graph Interactions)收集多样化高质量轨迹;2)采用高级工具查询合成(Advanced Tool-Query Synthesis)简化复杂查询构造;3)引入引导式迭代链(Guided Iterative Chain)生成复杂思维链(Chain-of-Thought, CoT),从而系统性提升多轮FC数据的质量与可用性。
链接: https://arxiv.org/abs/2510.24645
作者: Zengzhuang Xu,Bingguang Hao,Zechuan Wang,Yuntao Wen,Maolin Wang,Yang Liu,Long Chen,Dong Wang,Yicheng Chen,Cunyin Peng,Chenyi Zhuang,Jinjie Gu,Leilei Gan,Xiangyu Zhao,Shi Gu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Function calling (FC) empowers large language models (LLMs) and autonomous agents to interface with external tools, a critical capability for solving complex, real-world problems. As this ability becomes increasingly central to advanced AI systems, the need for high-quality, multi-turn training data to develop and refine it cannot be overstated. Existing data synthesis methods, such as random environment sampling or multi-agent role-playing, are not powerful enough to generate high-quality data in real-world environments. Practical challenges come in three folds: targeted model training, isolation of tool architecture, and multi-turn logical dependency. To address these structural deficiencies, we present FunReason-MT, a novel data synthesis framework for real-world multi-turn tool use. FunReason-MT resolves the complexity barrier in multi-turn FC data by employing 1) Environment-API Graph Interactions to gather varied high-quality trajectories, 2) Advanced Tool-Query Synthesis to simplify hard query construction, and 3) Guided Iterative Chain for sophisticated CoT generation. Evaluations on Berkeley Function-Calling Leaderboard (BFCLv3) demonstrate the power of our framework: a 4B model built upon FunReason-MT generated data achieves state-of-the-art performance among comparable-sized models, outperforming most close-source models. Further performance improvements on BFCLv4 confirm that FunReason-MT provides a reliable and robust source for agentic learning.
zh
[AI-7] he Cost of Robustness: Tighter Bounds on Parameter Complexity for Robust Memorization in ReLU Nets NEURIPS2025
【速读】:该论文旨在解决深度神经网络中鲁棒记忆(robust memorization)的参数复杂度问题,即在保证训练样本周围 μ-球内预测一致性的前提下,如何用最少的参数实现对任意数据集的插值,其中不同标签样本间具有 ϵ-分离。其核心贡献在于对鲁棒性比 ρ=μ/ϵ 的全范围 (0,1) 进行了细粒度分析,提出了更紧致的上下界,揭示了当 ρ 较小时鲁棒记忆的参数复杂度与非鲁棒情况相当,但随着 ρ 增大而显著增长,从而明确了鲁棒性要求对模型容量的定量影响。
链接: https://arxiv.org/abs/2510.24643
作者: Yujun Kim,Chaewon Moon,Chulhee Yun
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to NeurIPS 2025, 72 pages, 8 figures
Abstract:We study the parameter complexity of robust memorization for \mathrmReLU networks: the number of parameters required to interpolate any given dataset with \epsilon -separation between differently labeled points, while ensuring predictions remain consistent within a \mu -ball around each training sample. We establish upper and lower bounds on the parameter count as a function of the robustness ratio \rho = \mu / \epsilon . Unlike prior work, we provide a fine-grained analysis across the entire range \rho \in (0,1) and obtain tighter upper and lower bounds that improve upon existing results. Our findings reveal that the parameter complexity of robust memorization matches that of non-robust memorization when \rho is small, but grows with increasing \rho .
zh
[AI-8] Causal Ordering for Structure Learning From Time Series
【速读】:该论文旨在解决时间序列数据中因果结构发现(causal structure discovery)的难题,尤其针对变量数量和时间点增多时带来的组合复杂性问题。传统基于排序的方法受限于单一因果顺序的表示能力,易产生虚假因果关系。其解决方案的关键在于引入多有效因果排序(multiple valid causal orderings)而非单一排序,通过扩散过程(diffusion-based causal discovery)构建Dots(Diffusion Ordered Temporal Structure)方法,从而有效恢复底层有向无环图(directed acyclic graph, DAG)的传递闭包,显著减少单排序方法中的伪相关性。该方法在标准假设(如平稳性和加性噪声模型)下,结合得分匹配与扩散过程实现高效海森矩阵估计,实验证明其在合成与真实世界数据集上均优于现有最先进基线,兼具高准确率与可扩展性。
链接: https://arxiv.org/abs/2510.24639
作者: Pedro P. Sanchez,Damian Machlanski,Steven McDonagh,Sotirios A. Tsaftaris
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 32 pages
Abstract:Predicting causal structure from time series data is crucial for understanding complex phenomena in physiology, brain connectivity, climate dynamics, and socio-economic behaviour. Causal discovery in time series is hindered by the combinatorial complexity of identifying true causal relationships, especially as the number of variables and time points grow. A common approach to simplify the task is the so-called ordering-based methods. Traditional ordering methods inherently limit the representational capacity of the resulting model. In this work, we fix this issue by leveraging multiple valid causal orderings, instead of a single one as standard practice. We propose DOTS (Diffusion Ordered Temporal Structure), using diffusion-based causal discovery for temporal data. By integrating multiple orderings, DOTS effectively recovers the transitive closure of the underlying directed acyclic graph, mitigating spurious artifacts inherent in single-ordering approaches. We formalise the problem under standard assumptions such as stationarity and the additive noise model, and leverage score matching with diffusion processes to enable efficient Hessian estimation. Extensive experiments validate the approach. Empirical evaluations on synthetic and real-world datasets demonstrate that DOTS outperforms state-of-the-art baselines, offering a scalable and robust approach to temporal causal discovery. On synthetic benchmarks ( d=!3-!6 variables, T=200!-!5,000 samples), DOTS improves mean window-graph F1 from 0.63 (best baseline) to 0.81 . On the CausalTime real-world benchmark ( d=20!-!36 ), while baselines remain the best on individual datasets, DOTS attains the highest average summary-graph F1 while halving runtime relative to graph-optimisation methods. These results establish DOTS as a scalable and accurate solution for temporal causal discovery.
zh
[AI-9] All in one timestep: Enhancing Sparsity and Energy efficiency in Multi-level Spiking Neural Networks
【速读】:该论文旨在解决脉冲神经网络(Spiking Neural Networks, SNNs)因瞬时脉冲的二值特性导致的信息损失问题,从而引发准确率下降的瓶颈。其核心解决方案是提出一种多级脉冲神经元模型(multi-level spiking neuron model),该模型能够在保持极低量化误差的同时实现最小推理延迟,并逼近全精度人工神经网络(Artificial Neural Networks, ANNs)的性能。关键创新在于通过增加脉冲编码的量化层级,在不牺牲精度的前提下显著提升信息压缩效率,从而在图像分类任务中将能耗降低2至3倍,并在神经形态数据上将推理延迟压缩至1个时间步(相比先前工作压缩因子达10)。此外,论文还设计了新型稀疏残差架构Sparse-ResNet,通过分析残差连接中的脉冲传播机制,揭示了“脉冲雪崩效应”(spike avalanche effect),并在此基础上实现了比现有SNN残差结构减少超过20%网络活动量的同时达到最先进的图像分类准确率。
链接: https://arxiv.org/abs/2510.24637
作者: Andrea Castagnetti,Alain Pegatoquet,Benoît Miramond
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注:
Abstract:Spiking Neural Networks (SNNs) are one of the most promising bio-inspired neural networks models and have drawn increasing attention in recent years. The event-driven communication mechanism of SNNs allows for sparse and theoretically low-power operations on dedicated neuromorphic hardware. However, the binary nature of instantaneous spikes also leads to considerable information loss in SNNs, resulting in accuracy degradation. To address this issue, we propose a multi-level spiking neuron model able to provide both low-quantization error and minimal inference latency while approaching the performance of full precision Artificial Neural Networks (ANNs). Experimental results with popular network architectures and datasets, show that multi-level spiking neurons provide better information compression, allowing therefore a reduction in latency without performance loss. When compared to binary SNNs on image classification scenarios, multi-level SNNs indeed allow reducing by 2 to 3 times the energy consumption depending on the number of quantization intervals. On neuromorphic data, our approach allows us to drastically reduce the inference latency to 1 timestep, which corresponds to a compression factor of 10 compared to previously published results. At the architectural level, we propose a new residual architecture that we call Sparse-ResNet. Through a careful analysis of the spikes propagation in residual connections we highlight a spike avalanche effect, that affects most spiking residual architectures. Using our Sparse-ResNet architecture, we can provide state-of-the-art accuracy results in image classification while reducing by more than 20% the network activity compared to the previous spiking ResNets.
zh
[AI-10] DistDF: Time-Series Forecasting Needs Joint-Distribution Wasserstein Alignment
【速读】:该论文旨在解决时间序列预测模型在训练过程中,因标签自相关性导致标准直接预测(Direct Forecast, DF)方法中条件分布对齐失效的问题。DF方法通常通过最小化标签序列的条件负对数似然来优化模型,但当标签存在自相关时,这种估计会引入偏差。论文提出DistDF解决方案,其关键在于通过交替最小化预测分布与标签分布之间的条件差异来实现更准确的对齐;为克服有限观测下条件差异难以估计的难题,作者引入一种新的联合分布Wasserstein差异度量,该度量可严格上界目标条件差异,并具备可微性和可计算性,从而能无缝集成到基于梯度的训练流程中,显著提升多种预测模型的性能。
链接: https://arxiv.org/abs/2510.24574
作者: Hao Wang,Licheng Pan,Yuan Lu,Zhixuan Chu,Xiaoxi Li,Shuting He,Zhichao Chen,Haoxuan Li,Qingsong Wen,Zhouchen Lin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Training time-series forecast models requires aligning the conditional distribution of model forecasts with that of the label sequence. The standard direct forecast (DF) approach resorts to minimize the conditional negative log-likelihood of the label sequence, typically estimated using the mean squared error. However, this estimation proves to be biased in the presence of label autocorrelation. In this paper, we propose DistDF, which achieves alignment by alternatively minimizing a discrepancy between the conditional forecast and label distributions. Because conditional discrepancies are difficult to estimate from finite time-series observations, we introduce a newly proposed joint-distribution Wasserstein discrepancy for time-series forecasting, which provably upper bounds the conditional discrepancy of interest. This discrepancy admits tractable, differentiable estimation from empirical samples and integrates seamlessly with gradient-based training. Extensive experiments show that DistDF improves the performance diverse forecast models and achieves the state-of-the-art forecasting performance. Code is available at this https URL.
zh
[AI-11] LoRA-DA: Data-Aware Initialization for Low-Rank Adaptation via Asymptotic Analysis
【速读】:该论文旨在解决LoRA(Low-Rank Adaptation)在参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)中初始化方法的局限性问题,特别是现有方法要么未利用目标域数据,要么仅通过浅层梯度分解方式利用数据,导致性能不佳且理论基础薄弱。其解决方案的关键在于构建了一个基于渐近分析的数据感知LoRA初始化理论框架:从最小化微调后模型与目标模型之间参数差异期望的通用优化目标出发,推导出包含偏置项(由Fisher-梯度形式近似以保留各向异性)和方差项(通过Fisher信息矩阵捕捉采样随机性引入的不确定性)的优化问题,并据此得到最优LoRA初始化策略。在此基础上,作者提出了高效算法LoRA-DA,仅需少量目标域样本即可估计上述两项并实现最优初始化,实验证明其在多个基准上均优于现有方法,同时具备更快更稳定的收敛性、对不同秩的鲁棒性以及较小的初始化开销。
链接: https://arxiv.org/abs/2510.24561
作者: Qingyue Zhang,Chang Chu,Tianren Peng,Qi Li,Xiangyang Luo,Zhihao Jiang,Shao-Lun Huang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:With the widespread adoption of LLMs, LoRA has become a dominant method for PEFT, and its initialization methods have attracted increasing attention. However, existing methods have notable limitations: many methods do not incorporate target-domain data, while gradient-based methods exploit data only at a shallow level by relying on one-step gradient decomposition, which remains unsatisfactory due to the weak empirical performance of the one-step fine-tuning model that serves as their basis, as well as the fact that these methods either lack a rigorous theoretical foundation or depend heavily on restrictive isotropic assumptions. In this paper, we establish a theoretical framework for data-aware LoRA initialization based on asymptotic analysis. Starting from a general optimization objective that minimizes the expectation of the parameter discrepancy between the fine-tuned and target models, we derive an optimization problem with two components: a bias term, which is related to the parameter distance between the fine-tuned and target models, and is approximated using a Fisher-gradient formulation to preserve anisotropy; and a variance term, which accounts for the uncertainty introduced by sampling stochasticity through the Fisher information. By solving this problem, we obtain an optimal initialization strategy for LoRA. Building on this theoretical framework, we develop an efficient algorithm, LoRA-DA, which estimates the terms in the optimization problem from a small set of target domain samples and obtains the optimal LoRA initialization. Empirical results across multiple benchmarks demonstrate that LoRA-DA consistently improves final accuracy over existing initialization methods. Additional studies show faster, more stable convergence, robustness across ranks, and only a small initialization overhead for LoRA-DA. The source code will be released upon publication.
zh
[AI-12] Generative AI for Healthcare: Fundamentals Challenges and Perspectives
【速读】:该论文旨在解决生成式人工智能(Generative AI)在医疗健康领域部署时面临的挑战,即如何有效整合多样化的医学数据与知识以支持高质量、可信赖的临床应用。其解决方案的关键在于提出一种以数据为中心的设计范式,将医疗数据生态系统作为生成式医疗系统的基础架构,通过可持续的数据集成、表示与检索机制,支撑上游模型训练与下游临床任务的高效执行,从而实现从大规模预训练到特定任务推理的闭环赋能。
链接: https://arxiv.org/abs/2510.24551
作者: Gang Chen,Changshuo Liu,Gene Anne Ooi,Marcus Tan,Zhongle Xie,Jianwei Yin,James Wei Luen Yip,Wenqiao Zhang,Jiaqi Zhu,Beng Chin Ooi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Generative Artificial Intelligence (GenAI) is taking the world by storm. It promises transformative opportunities for advancing and disrupting existing practices, including healthcare. From large language models (LLMs) for clinical note synthesis and conversational assistance to multimodal systems that integrate medical imaging, electronic health records, and genomic data for decision support, GenAI is transforming the practice of medicine and the delivery of healthcare, such as diagnosis and personalized treatments, with great potential in reducing the cognitive burden on clinicians, thereby improving overall healthcare delivery. However, GenAI deployment in healthcare requires an in-depth understanding of healthcare tasks and what can and cannot be achieved. In this paper, we propose a data-centric paradigm in the design and deployment of GenAI systems for healthcare. Specifically, we reposition the data life cycle by making the medical data ecosystem as the foundational substrate for generative healthcare systems. This ecosystem is designed to sustainably support the integration, representation, and retrieval of diverse medical data and knowledge. With effective and efficient data processing pipelines, such as semantic vector search and contextual querying, it enables GenAI-powered operations for upstream model components and downstream clinical applications. Ultimately, it not only supplies foundation models with high-quality, multimodal data for large-scale pretraining and domain-specific fine-tuning, but also serves as a knowledge retrieval backend to support task-specific inference via the agentic layer. The ecosystem enables the deployment of GenAI for high-quality and effective healthcare delivery.
zh
[AI-13] From Cross-Task Examples to In-Task Prompts: A Graph-Based Pseudo-Labeling Framework for In-context Learning
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在进行上下文学习(In-Context Learning, ICL)时,为新任务收集高质量标注示例成本高昂且耗时的问题。其解决方案的关键在于提出一种两阶段、低成本的伪标签生成与传播框架:第一阶段利用跨任务的现成示例引导LLM对少量目标任务样本进行伪标注;第二阶段采用基于图结构的标签传播方法,在无需额外调用LLM的前提下,将标签信息扩展至剩余目标样本,从而构建完整的伪标签数据集用于ICL的示范构造。该方法结合了跨任务监督的灵活性与无LLM依赖的可扩展性,显著降低了人工标注成本并保持良好性能。
链接: https://arxiv.org/abs/2510.24528
作者: Zihan Chen,Song Wang,Xingbo Fu,Chengshuai Shi,Zhenyu Lei,Cong Shen,Jundong Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The capability of in-context learning (ICL) enables large language models (LLMs) to perform novel tasks without parameter updates by conditioning on a few input-output examples. However, collecting high-quality examples for new or challenging tasks can be costly and labor-intensive. In this work, we propose a cost-efficient two-stage pipeline that reduces reliance on LLMs for data labeling. Our approach first leverages readily available cross-task examples to prompt an LLM and pseudo-label a small set of target task instances. We then introduce a graph-based label propagation method that spreads label information to the remaining target examples without additional LLM queries. The resulting fully pseudo-labeled dataset is used to construct in-task demonstrations for ICL. This pipeline combines the flexibility of cross-task supervision with the scalability of LLM-free propagation. Experiments across five tasks demonstrate that our method achieves strong performance while lowering labeling costs.
zh
[AI-14] Audio Signal Processing Using Time Domain Mel-Frequency Wavelet Coefficient
【速读】:该论文旨在解决传统梅尔频率倒谱系数(Mel Frequency Cepstral Coefficients, MFCC)仅提供频率信息而缺乏时间定位能力的问题,同时克服标准小波变换在低频段频率分辨率不足且与人耳听觉感知不一致的局限性。其解决方案的关键在于提出一种时域梅尔尺度小波系数(Time domain Mel frequency Wavelet Coefficient, TMFWC)提取方法,通过将小波变换的思想融入梅尔尺度滤波的时域实现中,避免了传统方法中需在梅尔滤波后额外进行时频转换的高计算开销,从而显著降低了特征提取的复杂度,并提升了音频信号处理效率。
链接: https://arxiv.org/abs/2510.24519
作者: Rinku Sebastian,Simon O’Keefe,Martin Trefzer
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:
Abstract:Extracting features from the speech is the most critical process in speech signal processing. Mel Frequency Cepstral Coefficients (MFCC) are the most widely used features in the majority of the speaker and speech recognition applications, as the filtering in this feature is similar to the filtering taking place in the human ear. But the main drawback of this feature is that it provides only the frequency information of the signal but does not provide the information about at what time which frequency is present. The wavelet transform, with its flexible time-frequency window, provides time and frequency information of the signal and is an appropriate tool for the analysis of non-stationary signals like speech. On the other hand, because of its uniform frequency scaling, a typical wavelet transform may be less effective in analysing speech signals, have poorer frequency resolution in low frequencies, and be less in line with human auditory perception. Hence, it is necessary to develop a feature that incorporates the merits of both MFCC and wavelet transform. A great deal of studies are trying to combine both these features. The present Wavelet Transform based Mel-scaled feature extraction methods require more computation when a wavelet transform is applied on top of Mel-scale filtering, since it adds extra processing steps. Here we are proposing a method to extract Mel scale features in time domain combining the concept of wavelet transform, thus reducing the computational burden of time-frequency conversion and the complexity of wavelet extraction. Combining our proposed Time domain Mel frequency Wavelet Coefficient(TMFWC) technique with the reservoir computing methodology has significantly improved the efficiency of audio signal processing.
zh
[AI-15] Design and Optimization of Cloud Native Homomorphic Encryption Workflows for Privacy-Preserving ML Inference
【速读】:该论文旨在解决在云原生环境中部署机器学习(Machine Learning, ML)模型时,用户数据在推理阶段的隐私保护问题。传统方法难以在不暴露敏感数据的前提下实现安全计算,而同态加密(Homomorphic Encryption, HE)虽能支持对加密数据直接进行计算,但其在大规模云环境中的应用受限于高计算开销、编排复杂性及模型兼容性差等问题。论文提出了一种系统化的框架,关键在于将容器化的HE模块与基于Kubernetes的编排机制相结合,实现分布式环境下弹性扩展和并行加密计算;同时引入密文打包(ciphertext packing)、多项式模数调整(polynomial modulus adjustment)和算子融合(operator fusion)等优化策略,在保障密码学完整性的前提下显著降低延迟和内存消耗,实验表明该方案可实现最高3.2倍的推理加速和40%的内存利用率提升。
链接: https://arxiv.org/abs/2510.24498
作者: Tejaswini Bollikonda
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 6 pages 2 figures, 2 tABLES
Abstract:As machine learning (ML) models become increasingly deployed through cloud infrastructures, the confidentiality of user data during inference poses a significant security challenge. Homomorphic Encryption (HE) has emerged as a compelling cryptographic technique that enables computation on encrypted data, allowing predictions to be generated without decrypting sensitive inputs. However, the integration of HE within large scale cloud native pipelines remains constrained by high computational overhead, orchestration complexity, and model compatibility issues. This paper presents a systematic framework for the design and optimization of cloud native homomorphic encryption workflows that support privacy-preserving ML inference. The proposed architecture integrates containerized HE modules with Kubernetes-based orchestration, enabling elastic scaling and parallel encrypted computation across distributed environments. Furthermore, optimization strategies including ciphertext packing, polynomial modulus adjustment, and operator fusion are employed to minimize latency and resource consumption while preserving cryptographic integrity. Experimental results demonstrate that the proposed system achieves up to 3.2times inference acceleration and 40% reduction in memory utilization compared to conventional HE pipelines. These findings illustrate a practical pathway for deploying secure ML-as-a-Service (MLaaS) systems that guarantee data confidentiality under zero-trust cloud conditions. Comments: 6 pages 2 figures, 2 tABLES Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2510.24498 [cs.CR] (or arXiv:2510.24498v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2510.24498 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-16] Online neural fusion of distortionless differential beamformers for robust speech enhancement
【速读】:该论文旨在解决固定波束形成器在动态声学环境中适应性差、干扰抑制能力受限的问题,尤其是在高度非平稳场景(如快速移动干扰)下,传统自适应凸组合(Adaptive Convex Combination, ACC)算法因无法可靠跟踪快速变化而失效。解决方案的关键在于提出一种帧在线神经融合框架,通过神经网络估计多个无失真差分波束形成器的组合权重,从而在保持无失真约束的前提下,显著提升对动态声学环境的适应能力和干扰抑制性能。
链接: https://arxiv.org/abs/2510.24497
作者: Yuanhang Qian,Kunlong Zhao,Jilu Jin,Xueqin Luo,Gongping Huang,Jingdong Chen,Jacob Benesty
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:
Abstract:Fixed beamforming is widely used in practice since it does not depend on the estimation of noise statistics and provides relatively stable performance. However, a single beamformer cannot adapt to varying acoustic conditions, which limits its interference suppression capability. To address this, adaptive convex combination (ACC) algorithms have been introduced, where the outputs of multiple fixed beamformers are linearly combined to improve robustness. Nevertheless, ACC often fails in highly non-stationary scenarios, such as rapidly moving interference, since its adaptive updates cannot reliably track rapid changes. To overcome this limitation, we propose a frame-online neural fusion framework for multiple distortionless differential beamformers, which estimates the combination weights through a neural network. Compared with conventional ACC, the proposed method adapts more effectively to dynamic acoustic environments, achieving stronger interference suppression while maintaining the distortionless constraint.
zh
[AI-17] Sample-efficient and Scalable Exploration in Continuous-Time RL
【速读】:该论文旨在解决连续时间强化学习(Continuous-Time Reinforcement Learning, CT-RL)中的建模与决策问题,即如何在未知系统动力学由非线性常微分方程(ODE)描述的场景下,设计一种高效且具备不确定性感知能力的模型化方法。其核心挑战在于传统强化学习算法多基于离散时间假设,而现实控制系统的动态本质是连续的。解决方案的关键在于提出COMBRL算法,该算法利用高斯过程(Gaussian Processes)或贝叶斯神经网络(Bayesian Neural Networks)构建对ODE的不确定性感知模型,并通过贪婪地最大化外在奖励与模型认知不确定性(epistemic uncertainty)的加权和来指导探索与利用。这一机制不仅实现了样本效率提升,还在奖励驱动和无监督强化学习两种设置下分别提供了次线性遗憾(sublinear regret)和样本复杂度边界,实验表明其在多个深度强化学习任务中优于现有基线方法。
链接: https://arxiv.org/abs/2510.24482
作者: Klemens Iten,Lenart Treven,Bhavya Sukhija,Florian Dörfler,Andreas Krause
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 26 pages, 6 figures, 6 tables
Abstract:Reinforcement learning algorithms are typically designed for discrete-time dynamics, even though the underlying real-world control systems are often continuous in time. In this paper, we study the problem of continuous-time reinforcement learning, where the unknown system dynamics are represented using nonlinear ordinary differential equations (ODEs). We leverage probabilistic models, such as Gaussian processes and Bayesian neural networks, to learn an uncertainty-aware model of the underlying ODE. Our algorithm, COMBRL, greedily maximizes a weighted sum of the extrinsic reward and model epistemic uncertainty. This yields a scalable and sample-efficient approach to continuous-time model-based RL. We show that COMBRL achieves sublinear regret in the reward-driven setting, and in the unsupervised RL setting (i.e., without extrinsic rewards), we provide a sample complexity bound. In our experiments, we evaluate COMBRL in both standard and unsupervised RL settings and demonstrate that it scales better, is more sample-efficient than prior methods, and outperforms baselines across several deep RL tasks.
zh
[AI-18] Adaptive Surrogate Gradients for Sequential Reinforcement Learning in Spiking Neural Networks
【速读】:该论文旨在解决脉冲神经网络(Spiking Neural Networks, SNNs)在复杂控制任务中面临的两大挑战:一是脉冲神经元的非可微特性导致代理梯度(surrogate gradients)优化性质不明确;二是SNN的状态依赖动态要求序列训练,而在强化学习(Reinforcement Learning, RL)早期阶段由于序列长度受限,难以跨越网络的“预热期”(warm-up period)。解决方案的关键在于两个层面:其一,通过系统分析代理梯度斜率设置,发现较浅的斜率或调度式斜率在RL环境中能显著提升梯度幅度并增强与真实梯度的对齐性,从而带来2.1倍的训练和部署性能提升;其二,提出一种新型训练方法,利用一个特权引导策略(privileged guiding policy)启动学习过程,同时保留在线环境交互以优化脉冲策略,结合自适应斜率调度,在真实无人机位置控制任务中实现平均回报400分,远超行为克隆(Behavioral Cloning)和TD3BC等基线方法(最高仅–200分)。
链接: https://arxiv.org/abs/2510.24461
作者: Korneel Van den Berghe,Stein Stroobants,Vijay Janapa Reddi,G.C.H.E. de Croon
机构: 未知
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:
Abstract:Neuromorphic computing systems are set to revolutionize energy-constrained robotics by achieving orders-of-magnitude efficiency gains, while enabling native temporal processing. Spiking Neural Networks (SNNs) represent a promising algorithmic approach for these systems, yet their application to complex control tasks faces two critical challenges: (1) the non-differentiable nature of spiking neurons necessitates surrogate gradients with unclear optimization properties, and (2) the stateful dynamics of SNNs require training on sequences, which in reinforcement learning (RL) is hindered by limited sequence lengths during early training, preventing the network from bridging its warm-up period. We address these challenges by systematically analyzing surrogate gradient slope settings, showing that shallower slopes increase gradient magnitude in deeper layers but reduce alignment with true gradients. In supervised learning, we find no clear preference for fixed or scheduled slopes. The effect is much more pronounced in RL settings, where shallower slopes or scheduled slopes lead to a 2.1x improvement in both training and final deployed performance. Next, we propose a novel training approach that leverages a privileged guiding policy to bootstrap the learning process, while still exploiting online environment interactions with the spiking policy. Combining our method with an adaptive slope schedule for a real-world drone position control task, we achieve an average return of 400 points, substantially outperforming prior techniques, including Behavioral Cloning and TD3BC, which achieve at most --200 points under the same conditions. This work advances both the theoretical understanding of surrogate gradient learning in SNNs and practical training methodologies for neuromorphic controllers demonstrated in real-world robotic systems. Subjects: Artificial Intelligence (cs.AI); Robotics (cs.RO) Cite as: arXiv:2510.24461 [cs.AI] (or arXiv:2510.24461v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2510.24461 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-19] Affordance Representation and Recognition for Autonomous Agents
【速读】:该论文旨在解决软件代理(Software Agent)在构建可操作的内部世界模型时面临的两大核心问题:一是原始HTML文档对象模型(DOM)的冗长性导致基础模型难以直接处理,二是硬编码API集成的静态特性使得代理无法适应不断演化的网络服务。解决方案的关键在于提出一种用于结构化数据世界建模的模式语言,包含两个互补的架构模式:其一为DOM转换模式(DOM Transduction Pattern),通过将冗余的原始DOM精炼为任务相关的紧凑表示,优化代理推理模块的效率;其二为超媒体可用性识别模式(Hypermedia Affordances Recognition Pattern),使代理能够动态解析标准化语义描述,在运行时发现并整合未知Web服务的能力,从而实现对复杂数字环境的持续适应与扩展。
链接: https://arxiv.org/abs/2510.24459
作者: Habtom Kahsay Gidey,Niklas Huber,Alexander Lenz,Alois Knoll
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Software Engineering (cs.SE)
备注:
Abstract:The autonomy of software agents is fundamentally dependent on their ability to construct an actionable internal world model from the structured data that defines their digital environment, such as the Document Object Model (DOM) of web pages and the semantic descriptions of web services. However, constructing this world model from raw structured data presents two critical challenges: the verbosity of raw HTML makes it computationally intractable for direct use by foundation models, while the static nature of hardcoded API integrations prevents agents from adapting to evolving services. This paper introduces a pattern language for world modeling from structured data, presenting two complementary architectural patterns. The DOM Transduction Pattern addresses the challenge of web page complexity by distilling a verbose, raw DOM into a compact, task-relevant representation or world model optimized for an agent’s reasoning core. Concurrently, the Hypermedia Affordances Recognition Pattern enables the agent to dynamically enrich its world model by parsing standardized semantic descriptions to discover and integrate the capabilities of unknown web services at runtime. Together, these patterns provide a robust framework for engineering agents that can efficiently construct and maintain an accurate world model, enabling scalable, adaptive, and interoperable automation across the web and its extended resources. Subjects: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Software Engineering (cs.SE) Cite as: arXiv:2510.24459 [cs.AI] (or arXiv:2510.24459v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2510.24459 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: The Second International Workshop on Hypermedia Multi-Agent Systems (HyperAgents 2025), in conjunction with the 28th European Conference on Artificial Intelligence (ECAI 2025); October 26, 2025, Bologna, Italy
zh
[AI-20] Human-Level Reasoning : A Comparative Study of Large Language Models on Logical and Abstract Reasoning
【速读】:该论文试图解决的问题是:如何有效评估大型语言模型(Large Language Models, LLMs)在逻辑推理与抽象推理方面的能力,以判断其是否具备真正的理解、推断和逻辑结论生成能力,而不仅仅是语言层面的任务完成度。解决方案的关键在于设计了一套八道定制化的推理题目,并将不同LLM的答题结果与人类在相同任务上的表现进行对比,从而揭示LLMs在演绎推理方面的局限性与性能差异。
链接: https://arxiv.org/abs/2510.24435
作者: Benjamin Grando Moreira
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 12 pages
Abstract:Evaluating reasoning ability in Large Language Models (LLMs) is important for advancing artificial intelligence, as it transcends mere linguistic task performance. It involves understanding whether these models truly understand information, perform inferences, and are able to draw conclusions in a logical and valid way. This study compare logical and abstract reasoning skills of several LLMs - including GPT, Claude, DeepSeek, Gemini, Grok, Llama, Mistral, Perplexity, and Sabiá - using a set of eight custom-designed reasoning questions. The LLM results are benchmarked against human performance on the same tasks, revealing significant differences and indicating areas where LLMs struggle with deduction.
zh
[AI-21] MiniOneRec: An Open-Source Framework for Scaling Generative Recommendation
【速读】:该论文旨在解决生成式推荐系统(Generative Recommender Systems)在公开基准上是否具备可扩展性,以及如何通过最小化后训练(post-training)策略实现竞争力性能的两个核心问题。其关键解决方案是提出首个完全开源的生成式推荐框架MiniOneRec,该框架涵盖Semantic ID(SID)构建、监督微调和面向推荐任务的强化学习全流程;并通过残差量化变分自编码器(Residual Quantized VAE)生成紧凑的SID序列,并对Qwen系列模型(0.5B–7B参数)进行后训练优化,其中包含两个关键技术:(1) 全流程SID对齐以增强语义一致性,(2) 结合约束解码与混合奖励机制的强化学习策略,从而显著提升排序准确率与候选多样性。
链接: https://arxiv.org/abs/2510.24431
作者: Xiaoyu Kong,Leheng Sheng,Junfei Tan,Yuxin Chen,Jiancan Wu,An Zhang,Xiang Wang,Xiangnan He
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: Technical Report
Abstract:The recent success of large language models (LLMs) has renewed interest in whether recommender systems can achieve similar scaling benefits. Conventional recommenders, dominated by massive embedding tables, tend to plateau as embedding dimensions grow. In contrast, the emerging generative paradigm replaces embeddings with compact Semantic ID (SID) sequences produced by autoregressive Transformers. Yet most industrial deployments remain proprietary, leaving two fundamental questions open: (1) Do the expected scaling laws hold on public benchmarks? (2) What is the minimal post-training recipe that enables competitive performance? We present MiniOneRec, to the best of our knowledge, the first fully open-source generative recommendation framework, which provides an end-to-end workflow spanning SID construction, supervised fine-tuning, and recommendation-oriented reinforcement learning. We generate SIDs via a Residual Quantized VAE and post-train Qwen backbones ranging from 0.5B to 7B parameters on the Amazon Review dataset. Our experiments reveal a consistent downward trend in both training and evaluation losses with increasing model size, validating the parameter efficiency of the generative approach. To further enhance performance, we propose a lightweight yet effective post-training pipeline that (1) enforces full-process SID alignment and (2) applies reinforcement learning with constrained decoding and hybrid rewards. Together, these techniques yield significant improvements in both ranking accuracy and candidate diversity. Comments: Technical Report Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI) Cite as: arXiv:2510.24431 [cs.IR] (or arXiv:2510.24431v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2510.24431 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-22] Metadata-Driven Retrieval-Augmented Generation for Financial Question Answering
【速读】:该论文旨在解决生成式 AI (Generative AI) 在处理长篇、结构化金融文件时的检索增强生成(Retrieval-Augmented Generation, RAG)性能瓶颈问题,尤其是在相关证据稀疏且跨文档交叉引用频繁的场景下。其核心解决方案在于提出一种多阶段 RAG 架构,关键创新点是利用大语言模型(LLM)生成的元数据(metadata)对文档块(chunk)进行语境增强,即“上下文嵌入”(contextual chunks),从而显著提升检索精度与生成质量。实验表明,相较于单纯依赖文本嵌入或仅使用商业 reranker,结合 LLM 驱动的预检索优化与元数据增强嵌入的架构可实现最优性能,同时引入一个定制化的元数据重排序器(metadata reranker),在保持高性能的同时降低计算成本,为金融文档分析中的鲁棒 RAG 系统提供了可落地的技术路径。
链接: https://arxiv.org/abs/2510.24402
作者: Michail Dadopoulos,Anestis Ladas,Stratos Moschidis,Ioannis Negkakis
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注: Preprint version submitted to the International Journal of Accounting Information Systems; currently under major revision. 20 pages, 1 figure, 1 table
Abstract:Retrieval-Augmented Generation (RAG) struggles on long, structured financial filings where relevant evidence is sparse and cross-referenced. This paper presents a systematic investigation of advanced metadata-driven Retrieval-Augmented Generation (RAG) techniques, proposing and evaluating a novel, multi-stage RAG architecture that leverages LLM-generated metadata. We introduce a sophisticated indexing pipeline to create contextually rich document chunks and benchmark a spectrum of enhancements, including pre-retrieval filtering, post-retrieval reranking, and enriched embeddings, benchmarked on the FinanceBench dataset. Our results reveal that while a powerful reranker is essential for precision, the most significant performance gains come from embedding chunk metadata directly with text (“contextual chunks”). Our proposed optimal architecture combines LLM-driven pre-retrieval optimizations with these contextual embeddings to achieve superior performance. Additionally, we present a custom metadata reranker that offers a compelling, cost-effective alternative to commercial solutions, highlighting a practical trade-off between peak performance and operational efficiency. This study provides a blueprint for building robust, metadata-aware RAG systems for financial document analysis.
zh
[AI-23] APTBench: Benchmarking Agent ic Potential of Base LLM s During Pre-Training
【速读】:该论文旨在解决当前预训练大语言模型(Large Language Models, LLMs)在评估其代理能力(agentic capabilities)方面的不足,即现有预训练基准主要聚焦于孤立的静态技能(如常识推理或数学/code推理),无法反映模型在真实世界自主任务执行中的潜力;而代理评测基准则多用于后训练模型,依赖多轮任务执行能力,基础模型难以支持。为此,作者提出APTBench框架,其核心创新在于将真实世界的代理任务及成功轨迹转化为适用于基础模型的多项选择题或文本补全任务,重点考察规划与行动等核心代理能力,并覆盖软件工程和深度研究等关键场景。该方案不仅能更有效地预测模型作为代理的下游性能,且相比端到端的后训练代理评估显著更轻量、成本更低。
链接: https://arxiv.org/abs/2510.24397
作者: Jiarui Qin,Yunjia Xi,Junjie Huang,Renting Rui,Di Yin,Weiwen Liu,Yong Yu,Weinan Zhang,Xing Sun
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 46 pages
Abstract:With the rapid development of LLM-based agents, there is a growing trend to incorporate agent-specific data into the pre-training stage of LLMs, aiming to better align LLMs with real-world autonomous task execution. However, current pre-training benchmarks primarily focus on isolated and static skills, e.g., common knowledge or mathematical/code reasoning, and fail to reflect model’s agentic capabilities. On the other hand, agent benchmarks are typically designed for post-trained models, requiring multi-turn task execution abilities that base models struggle to support. Thus, there is a compelling need for a benchmark that can evaluate agentic potentials during pre-training and guide the model training more effectively. To address this gap, we propose APTBench, a framework that converts real-world agent tasks and successful trajectories into multiple-choice or text completion questions tailored for base models. It focuses on core agentic abilities, e.g., planning and action, and covers key agent scenarios, software engineering and deep research. Compared to existing general-purpose benchmarks, APTBench offers a more predictive signal of a model’s downstream performance as an agent, while remaining significantly more lightweight and cost-effective than full-scale, end-to-end agent evaluations after post-training.
zh
[AI-24] Improving LLM Reasoning via Dependency-Aware Query Decomposition and Logic-Parallel Content Expansion
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在实时Web应用中因计算效率低下和推理策略僵化而导致的高延迟与低吞吐量问题,尤其是在需要高质量复杂推理的同时满足低延迟和高并发需求的场景下。现有方法通常只能优化效率或质量之一,难以兼顾两者。解决方案的关键在于提出Orion框架,其核心创新是将单个查询的推理过程分解为两个协同阶段:(1)关键点生成(key point generation),通过检索增强的少样本提示提炼逻辑结构化的要点;(2)内容并行扩展(content parallel expansion),基于依赖图对这些要点进行并行细化以保证逻辑一致性。此外,Orion引入流水线调度机制,利用两个阶段在GPU计算与内存压力上的互补特性,在多查询间实现交叉并行,显著提升推理性能(效率与质量)。
链接: https://arxiv.org/abs/2510.24390
作者: Xianjun Gao,Jianchun Liu,Hongli Xu,Liusheng Huang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The integration of Large Language Models (LLMs) into real-time Web applications, such as AI-powered search and conversational agents, presents a fundamental Web infrastructure challenge: reconciling the demand for high-quality, complex reasoning with the stringent low-latency and high-throughput requirements of interactive services. Current LLM reasoning, hindered by computationally inefficient sequential generation and rigid reasoning strategies, creates a critical bottleneck for the Web services. Existing approaches typically optimize the LLM reasoning for either efficiency or quality but struggle to achieve both, and thus fail to meet the dual requirements of modern Web platforms. To overcome these limitations, we propose Orion, a novel and efficient reasoning framework that enables dependency-aware query decomposition and logic-parallel content expansion. Concretely, Orion decomposes a single query reasoning process into two synergistic phases: (1) \textitkey point generation, which distills logically structured key points through retrieval-augmented few-shot prompting, and (2) \textitcontent parallel expansion, which concurrently elaborates on these points based on a dependency graph to ensure logical consistency. Furthermore, Orion introduces a pipeline scheduling mechanism that exploits the complementary computational characteristics of the two phases (generation imposes pressure on GPU computing and expansion stresses on GPU memory) across multiple queries, enabling cross-query parallelism and dramatically improving reasoning performance (\ie, efficiency and quality). Experiments on diverse benchmarks show that Orion not only delivers up to 4.33x higher token generation speed and 3.42x lower answer latency over the baselines but also improves reasoning quality by up to 18.75% through explicitly modeling inter-point dependencies.
zh
[AI-25] Policy Cards: Machine-Readable Runtime Governance for Autonomous AI Agents
【速读】:该论文旨在解决AI代理在部署和运行过程中缺乏可执行、可验证的合规性约束表达机制的问题,尤其是在操作、监管与伦理层面难以实现动态约束遵循。解决方案的关键在于提出“Policy Card”(策略卡)这一机器可读的部署层标准,它作为AI代理的组成部分,在运行时嵌入允许/禁止规则、义务条款、证据要求及与NIST AI RMF、ISO/IEC 42001和欧盟AI法案等保障框架的映射关系,从而实现对自主代理的规范性控制,并支持自动验证、版本管理和持续审计流水线,构建分布式多代理生态中的可验证合规基础。
链接: https://arxiv.org/abs/2510.24383
作者: Juraj Mavračić
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Multiagent Systems (cs.MA)
备注: First published on 19/10/2025. Canonical archived record and DOI: https://doi.org/10.5281/zenodo.17391796
Abstract:Policy Cards are introduced as a machine-readable, deployment-layer standard for expressing operational, regulatory, and ethical constraints for AI agents. The Policy Card sits with the agent and enables it to follow required constraints at runtime. It tells the agent what it must and must not do. As such, it becomes an integral part of the deployed agent. Policy Cards extend existing transparency artifacts such as Model, Data, and System Cards by defining a normative layer that encodes allow/deny rules, obligations, evidentiary requirements, and crosswalk mappings to assurance frameworks including NIST AI RMF, ISO/IEC 42001, and the EU AI Act. Each Policy Card can be validated automatically, version-controlled, and linked to runtime enforcement or continuous-audit pipelines. The framework enables verifiable compliance for autonomous agents, forming a foundation for distributed assurance in multi-agent ecosystems. Policy Cards provide a practical mechanism for integrating high-level governance with hands-on engineering practice and enabling accountable autonomy at scale.
zh
[AI-26] An N-of-1 Artificial Intelligence Ecosystem for Precision Medicine ALT
【速读】:该论文旨在解决当前医学人工智能(Artificial Intelligence in Medicine)系统普遍存在的“平均患者谬误”问题,即现有模型虽在大规模数据集上表现良好,但在罕见变异、多重共病或代表性不足的人群中性能显著下降,从而损害医疗公平性与临床信任。其解决方案的关键在于构建一个面向个体(N-of-1)决策支持的多智能体生态系统(multi-agent ecosystem),通过按器官系统、患者人群和分析模态分组的智能体协作,共享模型库与证据合成工具,并在协调层整合可靠性、不确定性与数据密度信息,最终向临床医生输出包含置信区间、异常标记及证据链接的决策支持包。该设计将验证标准从群体平均准确率转向个体层面的低密度区域误差、小样本校准性和风险-覆盖权衡,实现更透明、公平且以个体为中心的医疗AI应用。
链接: https://arxiv.org/abs/2510.24359
作者: Pedram Fard,Alaleh Azhir,Neguine Rezaii,Jiazi Tian,Hossein Estiri
机构: 未知
类目: Artificial Intelligence (cs.AI); Systems and Control (eess.SY); Quantitative Methods (q-bio.QM); Applications (stat.AP)
备注: This study has been supported by grants from the National Institutes of Health: The National Institute on Aging R01AG074372 and The National Institute of Allergy and Infectious Diseases R01AI165535
Abstract:Artificial intelligence in medicine is built to serve the average patient. By minimizing error across large datasets, most systems deliver strong aggregate accuracy yet falter at the margins: patients with rare variants, multimorbidity, or underrepresented demographics. This average patient fallacy erodes both equity and trust. We propose a different design: a multi-agent ecosystem for N-of-1 decision support. In this environment, agents clustered by organ systems, patient populations, and analytic modalities draw on a shared library of models and evidence synthesis tools. Their results converge in a coordination layer that weighs reliability, uncertainty, and data density before presenting the clinician with a decision-support packet: risk estimates bounded by confidence ranges, outlier flags, and linked evidence. Validation shifts from population averages to individual reliability, measured by error in low-density regions, calibration in the small, and risk–coverage trade-offs. Anticipated challenges include computational demands, automation bias, and regulatory fit, addressed through caching strategies, consensus checks, and adaptive trial frameworks. By moving from monolithic models to orchestrated intelligence, this approach seeks to align medical AI with the first principle of medicine: care that is transparent, equitable, and centered on the individual.
zh
[AI-27] Perception Learning: A Formal Separation of Sensory Representation Learning from Decision Learning
【速读】:该论文旨在解决感知模块与决策模块在强化学习或智能体训练中耦合过紧的问题,即传统方法常将感知(perception)和决策(decision)联合优化,导致感知能力受限于特定任务目标,难以泛化。为此,作者提出感知学习(Perception Learning, PeL)范式,其核心在于通过任务无关信号(task-agnostic signals)独立优化感知接口 $ f_\phi:\mathcal{X}\to\mathcal{Z} $,使其具备诸如对干扰的稳定性、无坍塌的信息性以及可控几何结构等可量化感知属性,这些属性由表示不变量(representation-invariant)指标评估。关键创新在于形式化了感知与决策的分离机制,证明了保持充分不变性的PeL更新方向与贝叶斯任务风险梯度正交,从而实现感知模块的独立改进与质量认证。
链接: https://arxiv.org/abs/2510.24356
作者: Suman Sanyal
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
Abstract:We introduce Perception Learning (PeL), a paradigm that optimizes an agent’s sensory interface f_\phi:\mathcalX\to\mathcalZ using task-agnostic signals, decoupled from downstream decision learning g_\theta:\mathcalZ\to\mathcalY . PeL directly targets label-free perceptual properties, such as stability to nuisances, informativeness without collapse, and controlled geometry, assessed via objective representation-invariant metrics. We formalize the separation of perception and decision, define perceptual properties independent of objectives or reparameterizations, and prove that PeL updates preserving sufficient invariants are orthogonal to Bayes task-risk gradients. Additionally, we provide a suite of task-agnostic evaluation metrics to certify perceptual quality.
zh
[AI-28] A Unified Geometric Space Bridging AI Models and the Human Brain
【速读】:该论文旨在解决当前人工智能(AI)模型在组织信息方式上与人类大脑是否存在相似性的问题,尤其针对不同模态(如视觉、语言或跨模态)的AI模型缺乏统一比较框架的困境。其核心挑战在于如何超越特定输入和任务的限制,建立一个普适性的度量体系来评估AI模型的“类脑”程度。解决方案的关键在于提出“类脑空间”(Brain-like Space)这一全新概念——通过将AI模型内在的空间注意力拓扑结构映射到标准的人类功能脑网络中,实现对任意模态、任务或感官域的AI模型进行精确位置定位与定量比较。该方法揭示了模型类脑程度的连续弧形几何分布,并发现其受预训练范式(是否强调全局语义抽象)及位置编码方案(是否促进多模态深度融合)的共同影响,从而为跨领域智能系统的比较与理解提供了首个统一理论框架。
链接: https://arxiv.org/abs/2510.24342
作者: Silin Chen,Yuzhong Chen,Zifan Wang,Junhao Wang,Zifeng Jia,Keith M Kendrick,Tuo Zhang,Lin Zhao,Dezhong Yao,Tianming Liu,Xi Jiang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:For decades, neuroscientists and computer scientists have pursued a shared ambition: to understand intelligence and build it. Modern artificial neural networks now rival humans in language, perception, and reasoning, yet it is still largely unknown whether these artificial systems organize information as the brain does. Existing brain-AI alignment studies have shown the striking correspondence between the two systems, but such comparisons remain bound to specific inputs and tasks, offering no common ground for comparing how AI models with different kinds of modalities-vision, language, or multimodal-are intrinsically organized. Here we introduce a groundbreaking concept of Brain-like Space: a unified geometric space in which every AI model can be precisely situated and compared by mapping its intrinsic spatial attention topological organization onto canonical human functional brain networks, regardless of input modality, task, or sensory domain. Our extensive analysis of 151 Transformer-based models spanning state-of-the-art large vision models, large language models, and large multimodal models uncovers a continuous arc-shaped geometry within this space, reflecting a gradual increase of brain-likeness; different models exhibit distinct distribution patterns within this geometry associated with different degrees of brain-likeness, shaped not merely by their modality but by whether the pretraining paradigm emphasizes global semantic abstraction and whether the positional encoding scheme facilitates deep fusion across different modalities. Moreover, the degree of brain-likeness for a model and its downstream task performance are not “identical twins”. The Brain-like Space provides the first unified framework for situating, quantifying, and comparing intelligence across domains, revealing the deep organizational principles that bridge machines and the brain.
zh
[AI-29] VDSAgents : A PCS-Guided Multi-Agent System for Veridical Data Science Automation
【速读】:该论文旨在解决当前由大语言模型(Large Language Models, LLMs)驱动的数据科学自动化系统缺乏科学理论指导、导致可信度和鲁棒性不足的问题,尤其是在处理噪声复杂的真实世界数据时表现受限。其解决方案的关键在于提出VDSAgents——一个基于可预测性-可计算性-稳定性(Predictability-Computability-Stability, PCS)原则的多智能体系统,通过模块化流程实现数据清洗、特征工程、建模与评估,并在每个阶段引入扰动分析、单元测试和模型验证机制,从而确保系统的功能正确性和科学可审计性。
链接: https://arxiv.org/abs/2510.24339
作者: Yunxuan Jiang(School of Management, Xi’an Jiaotong University),Silan Hu(School of Computing, National University of Singapore),Xiaoning Wang(School of Data Science and Media Intelligence, Communication University of China),Yuanyuan Zhang(Beijing Baixingkefu Network Technology Co., Ltd.),Xiangyu Chang(School of Management, Xi’an Jiaotong University)
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 29 pages, 6 figures. Yunxuan Jiang and Silan Hu contributed equally. Code available at this https URL
Abstract:Large language models (LLMs) become increasingly integrated into data science workflows for automated system design. However, these LLM-driven data science systems rely solely on the internal reasoning of LLMs, lacking guidance from scientific and theoretical principles. This limits their trustworthiness and robustness, especially when dealing with noisy and complex real-world datasets. This paper provides VDSAgents, a multi-agent system grounded in the Predictability-Computability-Stability (PCS) principles proposed in the Veridical Data Science (VDS) framework. Guided by PCS principles, the system implements a modular workflow for data cleaning, feature engineering, modeling, and evaluation. Each phase is handled by an elegant agent, incorporating perturbation analysis, unit testing, and model validation to ensure both functionality and scientific auditability. We evaluate VDSAgents on nine datasets with diverse characteristics, comparing it with state-of-the-art end-to-end data science systems, such as AutoKaggle and DataInterpreter, using DeepSeek-V3 and GPT-4o as backends. VDSAgents consistently outperforms the results of AutoKaggle and DataInterpreter, which validates the feasibility of embedding PCS principles into LLM-driven data science automation.
zh
[AI-30] Generative Large Language Models (gLLM s) in Content Analysis: A Practical Guide for Communication Research
【速读】:该论文旨在解决生成式大语言模型(Generative Large Language Models, gLLMs)在传播学定量内容分析中应用时所面临的七项关键方法论挑战,包括代码本开发、提示工程、模型选择、参数调优、迭代优化、模型可靠性验证及性能提升等问题,以推动gLLMs从技术潜力向可信赖、可复现的科研实践转化。其解决方案的关键在于提出一套系统性的最佳实践指南,涵盖从任务设计到结果验证的全流程规范,确保基于gLLMs的内容分析符合传播学研究对效度、信度、可重复性和伦理标准的基本要求。
链接: https://arxiv.org/abs/2510.24337
作者: Daria Kravets-Meinke,Hannah Schmid-Petri,Sonja Niemann,Ute Schmid
机构: 未知
类目: Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
备注:
Abstract:Generative Large Language Models (gLLMs), such as ChatGPT, are increasingly being used in communication research for content analysis. Studies show that gLLMs can outperform both crowd workers and trained coders, such as research assistants, on various coding tasks relevant to communication science, often at a fraction of the time and cost. Additionally, gLLMs can decode implicit meanings and contextual information, be instructed using natural language, deployed with only basic programming skills, and require little to no annotated data beyond a validation dataset - constituting a paradigm shift in automated content analysis. Despite their potential, the integration of gLLMs into the methodological toolkit of communication research remains underdeveloped. In gLLM-assisted quantitative content analysis, researchers must address at least seven critical challenges that impact result quality: (1) codebook development, (2) prompt engineering, (3) model selection, (4) parameter tuning, (5) iterative refinement, (6) validation of the model’s reliability, and optionally, (7) performance enhancement. This paper synthesizes emerging research on gLLM-assisted quantitative content analysis and proposes a comprehensive best-practice guide to navigate these challenges. Our goal is to make gLLM-based content analysis more accessible to a broader range of communication researchers and ensure adherence to established disciplinary quality standards of validity, reliability, reproducibility, and research ethics.
zh
[AI-31] ransformers can do Bayesian Clustering
【速读】:该论文旨在解决大规模数据下贝叶斯聚类的计算效率低以及真实世界数据中缺失值处理不当导致不确定性被忽略的问题。其核心解决方案是提出Cluster-PFN,一种基于Transformer架构的模型,它扩展了先验-数据拟合网络(Prior-Data Fitted Networks, PFNs),通过在由有限高斯混合模型(Gaussian Mixture Model, GMM)先验生成的合成数据上训练,学习估计聚类数量和簇分配的后验分布;该方法不仅在聚类数量估计上优于AIC、BIC和变分推断(Variational Inference, VI)等传统模型选择方法,且在聚类质量上可与VI相媲美,同时速度提升数个数量级,并能直接处理含缺失数据的复杂先验,在高缺失率的真实基因组数据集上显著优于基于插补的基线方法。
链接: https://arxiv.org/abs/2510.24318
作者: Prajit Bhaskaran,Tom Viering
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Bayesian clustering accounts for uncertainty but is computationally demanding at scale. Furthermore, real-world datasets often contain missing values, and simple imputation ignores the associated uncertainty, resulting in suboptimal results. We present Cluster-PFN, a Transformer-based model that extends Prior-Data Fitted Networks (PFNs) to unsupervised Bayesian clustering. Trained entirely on synthetic datasets generated from a finite Gaussian Mixture Model (GMM) prior, Cluster-PFN learns to estimate the posterior distribution over both the number of clusters and the cluster assignments. Our method estimates the number of clusters more accurately than handcrafted model selection procedures such as AIC, BIC and Variational Inference (VI), and achieves clustering quality competitive with VI while being orders of magnitude faster. Cluster-PFN can be trained on complex priors that include missing data, outperforming imputation-based baselines on real-world genomic datasets, at high missingness. These results show that the Cluster-PFN can provide scalable and flexible Bayesian clustering.
zh
[AI-32] Retrieval and Argumentation Enhanced Multi-Agent LLM s for Judgmental Forecasting
【速读】:该论文旨在解决判断性预测(Judgmental Forecasting)中的可信度评估问题,即如何基于人类判断对未来的事件进行预测,并将其视为一种主张验证任务——通过评估某一未来事件的合理性来做出决策。其核心挑战在于如何有效整合来自不同来源的证据以提升预测准确性并保持可解释性。解决方案的关键在于提出了一种多智能体框架,其中多个代理(agents)可以就主张的真实性产生分歧,并分别提供支持或反对该主张的定量双极论证框架(Quantitative Bipolar Argumentation Frameworks, QBAFs)。该框架通过三种不同机制实现:ArgLLM代理生成和评估QBAFs;RbAM代理利用外部信息源挖掘关系型论据;RAG-ArgLLM代理则结合检索增强生成(Retrieval-Augmented Generation, RAG)扩展论据来源。实验表明,融合多个代理提供的证据能够显著提升预测精度,尤其在三代理组合时效果更优,且能提供透明、可解释的推理过程。
链接: https://arxiv.org/abs/2510.24303
作者: Deniz Gorur,Antoni Rago,Francesca Toni
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Judgmental forecasting is the task of making predictions about future events based on human judgment. This task can be seen as a form of claim verification, where the claim corresponds to a future event and the task is to assess the plausibility of that event. In this paper, we propose a novel multi-agent framework for claim verification, whereby different agents may disagree on claim veracity and bring specific evidence for and against the claims, represented as quantitative bipolar argumentation frameworks (QBAFs). We then instantiate the framework for supporting claim verification, with a variety of agents realised with Large Language Models (LLMs): (1) ArgLLM agents, an existing approach for claim verification that generates and evaluates QBAFs; (2) RbAM agents, whereby LLM-empowered Relation-based Argument Mining (RbAM) from external sources is used to generate QBAFs; (3) RAG-ArgLLM agents, extending ArgLLM agents with a form of Retrieval-Augmented Generation (RAG) of arguments from external sources. Finally, we conduct experiments with two standard judgmental forecasting datasets, with instances of our framework with two or three agents, empowered by six different base LLMs. We observe that combining evidence from agents can improve forecasting accuracy, especially in the case of three agents, while providing an explainable combination of evidence for claim verification.
zh
[AI-33] Verifying Large Language Models Reasoning Paths via Correlation Matrix Rank
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在推理过程中容易出现错误和幻觉的问题,尤其是如何高效、有效地验证其输出的可信度。现有方法通常依赖外部资源,如训练好的验证器或复杂提示工程,导致计算开销大且领域适用性受限。论文的关键创新在于发现输入问题与输出推理路径之间的相关性矩阵的秩(rank)是一个稳健的内在正确性指标,该指标仅基于LLM自身行为即可计算,无需额外训练或设计复杂提示。基于此,作者提出了一种简单、即插即用的Self-Indicator方法,通过重加权候选推理路径显著提升准确性,实验表明其在多个LLM上可实现超过75%的正确路径识别率,并使三个推理基准测试的准确率提升超过8%。
链接: https://arxiv.org/abs/2510.24299
作者: Jiayu Liu,Wei Dai,Zhenya Huang,Ning Miao,Enhong Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Despite the strong reasoning ability of large language models~(LLMs), they are prone to errors and hallucinations. As a result, how to check their outputs effectively and efficiently has become a critical problem in their applications. Existing checking methods heavily rely on external resources, such as trained verifiers (e.g., process/outcome reward models) or elaborate prompts, which lead to high computational overhead and are only applicable to specific domains. In this paper, we investigate whether the internal behaviors of LLMs have already implied the credibility of their reasoning paths. Specifically, we find that the rank of the correlation matrix between the input problem and the output reasoning path is a robust indicator of reasoning correctness. Different from other correctness indicators for LLMs, the calculation of the correlation matrix only relies on the LLM itself, which avoids the hassle of training a separate model or designing complicated prompts. Based on it, we design a simple, plug-and-play Self-Indicator method to reweight candidate reasoning paths, which achieves significant performance improvements than other voting and verification methods with very few computational overhead. Our experiments across multiple LLMs of varying scales and model families have further shown the effectiveness of Self-Indicator. It achieves over 75% accuracy in distinguishing correct reasoning paths from incorrect ones, and, in turn, improves the accuracies on three reasoning benchmarks by more than 8%.
zh
[AI-34] Investigating Intra-Abstraction Policies For Non-exact Abstraction Algorithms
【速读】:该论文旨在解决蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS)中样本效率低的问题,尤其关注在使用状态和/或动作抽象(state and/or action abstractions)时,如何更有效地利用抽象节点中的信息共享机制。其关键解决方案在于改进抽象层内(intra-abstraction)的策略设计:传统方法如剪枝在线进行抽象(pruned On the Go Abstractions, pruned OGA)仅通过聚合访问次数和回报来更新上置信界(Upper Confidence Bound, UCB)值,但未考虑同一抽象节点下多个子动作可能具有相同UCB值的问题,从而引发歧义决策。为此,论文提出并实证评估了多种替代的抽象内策略(intra-abstraction policies),其中若干策略在多数环境和参数设置下优于随机选择的冲突解决规则,显著提升了MCTS的性能与稳定性。
链接: https://arxiv.org/abs/2510.24297
作者: Robin Schmöcker,Alexander Dockhorn,Bodo Rosenhahn
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:One weakness of Monte Carlo Tree Search (MCTS) is its sample efficiency which can be addressed by building and using state and/or action abstractions in parallel to the tree search such that information can be shared among nodes of the same layer. The primary usage of abstractions for MCTS is to enhance the Upper Confidence Bound (UCB) value during the tree policy by aggregating visits and returns of an abstract node. However, this direct usage of abstractions does not take the case into account where multiple actions with the same parent might be in the same abstract node, as these would then all have the same UCB value, thus requiring a tiebreak rule. In state-of-the-art abstraction algorithms such as pruned On the Go Abstractions (pruned OGA), this case has not been noticed, and a random tiebreak rule was implicitly chosen. In this paper, we propose and empirically evaluate several alternative intra-abstraction policies, several of which outperform the random policy across a majority of environments and parameter settings.
zh
[AI-35] MCP-Flow: Facilitating LLM Agents to Master Real-World Diverse and Scaling MCP Tools
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在实际应用中对日益扩展的模型上下文协议(Model Contextual Protocol, MCP)生态系统利用能力有限的问题,现有研究存在服务器覆盖范围窄、依赖高成本人工标注且缺乏训练支持等瓶颈。解决方案的关键在于提出MCP-Flow——一个由网络代理驱动的自动化流水线,实现大规模服务器发现、数据合成与模型训练一体化;其核心创新在于通过自动化采集和过滤来自1166个服务器和11536个工具的数据,生成68733条高质量指令-函数调用对及6439条任务轨迹,显著提升规模与多样性,从而有效增强LLM在真实MCP环境中的工具选择、函数调用生成及智能体任务执行性能。
链接: https://arxiv.org/abs/2510.24284
作者: Wenhao Wang,Peizhi Niu,Zhao Xu,Zhaoyu Chen,Jian Du,Yaxin Du,Xianghe Pang,Keduan Huang,Yanfeng Wang,Qiang Yan,Siheng Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) increasingly rely on external tools to perform complex, realistic tasks, yet their ability to utilize the rapidly expanding Model Contextual Protocol (MCP) ecosystem remains limited. Existing MCP research covers few servers, depends on costly manual curation, and lacks training support, hindering progress toward real-world deployment. To overcome these limitations, we introduce MCP-Flow, an automated web-agent-driven pipeline for large-scale server discovery, data synthesis, and model training. MCP-Flow collects and filters data from 1166 servers and 11536 tools, producing 68733 high-quality instruction-function call pairs and 6439 trajectories, far exceeding prior work in scale and diversity. Extensive experiments demonstrate MCP-Flow’s effectiveness in driving superior MCP tool selection, function-call generation, and enhanced agentic task performance. MCP-Flow thus provides a scalable foundation for advancing LLM agents’ proficiency in real-world MCP environments. MCP-Flow is publicly available at \hrefthis https URLthis https URL.
zh
[AI-36] Survey and Tutorial of Reinforcement Learning Methods in Process Systems Engineering
【速读】:该论文旨在解决过程系统工程(Process Systems Engineering, PSE)中复杂、随机系统在不确定性下的序贯决策问题,传统方法在控制与优化此类系统时存在局限性。解决方案的关键在于引入强化学习(Reinforcement Learning, RL)这一数据驱动方法,通过构建智能体(agent)与环境的交互机制,自动学习最优控制策略。文中系统梳理了价值基(value-based)、策略基(policy-based)和演员-评论家(actor-critic)三类核心RL算法,并结合PSE领域中的应用实例(如分批和连续过程控制、工艺优化及供应链管理),展示了RL在提升系统适应性和性能方面的潜力,同时指出了当前挑战与未来研究方向。
链接: https://arxiv.org/abs/2510.24272
作者: Maximilian Bloor,Max Mowbray,Ehecatl Antonio Del Rio Chanona,Calvin Tsay
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI)
备注:
Abstract:Sequential decision making under uncertainty is central to many Process Systems Engineering (PSE) challenges, where traditional methods often face limitations related to controlling and optimizing complex and stochastic systems. Reinforcement Learning (RL) offers a data-driven approach to derive control policies for such challenges. This paper presents a survey and tutorial on RL methods, tailored for the PSE community. We deliver a tutorial on RL, covering fundamental concepts and key algorithmic families including value-based, policy-based and actor-critic methods. Subsequently, we survey existing applications of these RL techniques across various PSE domains, such as in fed-batch and continuous process control, process optimization, and supply chains. We conclude with PSE focused discussion of specialized techniques and emerging directions. By synthesizing the current state of RL algorithm development and implications for PSE this work identifies successes, challenges, trends, and outlines avenues for future research at the interface of these fields.
zh
[AI-37] Enabling Near-realtime Remote Sensing via Satellite-Ground Collaboration of Large Vision-Language Models
【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)在低地球轨道(Low Earth Orbit, LEO)卫星遥感任务中部署受限的问题,主要挑战包括星上计算资源有限和卫星与地面站(Ground Station, GS)通信窗口短。解决方案的关键在于提出一个星地协同系统Grace,其核心创新包括:1)采用异步星地检索增强生成(Retrieval-Augmented Generation, RAG)机制,在有限的卫星-地面数据交换期内,通过自适应更新算法将地面RAG知识库同步至卫星端;2)设计基于置信度的任务调度算法,根据推理置信度决定任务是在星上本地处理还是卸载至地面站执行,从而实现近实时推理。实验表明,Grace相比现有最优方法平均延迟降低76%-95%,且不牺牲推理准确性。
链接: https://arxiv.org/abs/2510.24242
作者: Zihan Li,Jiahao Yang,Yuxin Zhang,Zhe Chen,Yue Gao
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 15 pages, 11 figures
Abstract:Large vision-language models (LVLMs) have recently demonstrated great potential in remote sensing (RS) tasks (e.g., disaster monitoring) conducted by low Earth orbit (LEO) satellites. However, their deployment in real-world LEO satellite systems remains largely unexplored, hindered by limited onboard computing resources and brief satellite-ground contacts. We propose Grace, a satellite-ground collaborative system designed for near-realtime LVLM inference in RS tasks. Accordingly, we deploy compact LVLM on satellites for realtime inference, but larger ones on ground stations (GSs) to guarantee end-to-end performance. Grace is comprised of two main phases that are asynchronous satellite-GS Retrieval-Augmented Generation (RAG), and a task dispatch algorithm. Firstly, we still the knowledge archive of GS RAG to satellite archive with tailored adaptive update algorithm during limited satellite-ground data exchange period. Secondly, propose a confidence-based test algorithm that either processes the task onboard the satellite or offloads it to the GS. Extensive experiments based on real-world satellite orbital data show that Grace reduces the average latency by 76-95% compared to state-of-the-art methods, without compromising inference accuracy.
zh
[AI-38] MAGNET: A Multi-Graph Attentional Network for Code Clone Detection
【速读】:该论文旨在解决代码克隆检测(Code Clone Detection)任务中因单一表示方法(如抽象语法树 AST、控制流图 CFG 或数据流图 DFG)仅能捕捉代码语义的部分特征而导致的检测精度不足问题。现有混合方法虽尝试融合多种图结构,但其融合策略多为手工设计且效果有限。解决方案的关键在于提出一种多图注意力框架 MAGNET,其核心创新包括:引入残差图神经网络与节点级自注意力机制以同时学习局部和长距离依赖关系;设计门控交叉注意力机制实现细粒度的跨图交互;并采用 Set2Set 池化层将多图嵌入融合为统一的程序级表征,从而显著提升代码克隆检测的准确性。
链接: https://arxiv.org/abs/2510.24241
作者: Zixian Zhang,Takfarinas Saber
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Code clone detection is a fundamental task in software engineering that underpins refactoring, debugging, plagiarism detection, and vulnerability analysis. Existing methods often rely on singular representations such as abstract syntax trees (ASTs), control flow graphs (CFGs), and data flow graphs (DFGs), which capture only partial aspects of code semantics. Hybrid approaches have emerged, but their fusion strategies are typically handcrafted and ineffective. In this study, we propose MAGNET, a multi-graph attentional framework that jointly leverages AST, CFG, and DFG representations to capture syntactic and semantic features of source code. MAGNET integrates residual graph neural networks with node-level self-attention to learn both local and long-range dependencies, introduces a gated cross-attention mechanism for fine-grained inter-graph interactions, and employs Set2Set pooling to fuse multi-graph embeddings into unified program-level representations. Extensive experiments on BigCloneBench and Google Code Jam demonstrate that MAGNET achieves state-of-the-art performance with an overall F1 score of 96.5% and 99.2% on the two datasets, respectively. Ablation studies confirm the critical contributions of multi-graph fusion and each attentional component. Our code is available at this https URL
zh
[AI-39] PaTaRM: Bridging Pairwise and Pointwise Signals via Preference-Aware Task-Adaptive Reward Modeling
【速读】:该论文旨在解决当前生成式奖励模型(Generative Reward Models, GRMs)在强化学习人类反馈(Reinforcement Learning from Human Feedback, RLHF)中面临的两大局限性:一是成对训练方法依赖二元好坏标签,导致点对点推理时存在不匹配问题,并需复杂配对策略;二是点对点方法依赖详尽的绝对评分标签与基于评分标准(rubric)的标注,造成适应性差且标注成本高。解决方案的关键在于提出偏好感知任务自适应奖励模型(Preference-Aware Task-Adaptive Reward Model, PaTaRM),其核心创新为:通过偏好感知奖励(Preference-Aware Reward, PAR)机制,利用成对数据构建鲁棒的点对点训练信号,从而无需显式点对点标签;同时引入任务自适应评分标准系统,动态生成适用于全局任务一致性与实例级细粒度推理的评估准则,实现高效、通用且可解释的奖励建模。
链接: https://arxiv.org/abs/2510.24235
作者: Ai Jian,Jingqing Ruan,Xing Ma,Dailin Li,QianLin Zhou,Ke Zeng,Xunliang Cai
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Reward models (RMs) are central to reinforcement learning from human feedback (RLHF), providing the critical supervision signals that align large language models (LLMs) with human preferences. While generative reward models (GRMs) offer greater interpretability than traditional scalar RMs, current training paradigms remain limited. Pair-wise methods rely on binary good-versus-bad labels, which cause mismatches for point-wise inference and necessitate complex pairing strategies for effective application in RLHF. On the other hand, point-wise methods require more elaborate absolute labeling with rubric-driven criteria, resulting in poor adaptability and high annotation costs. In this work, we propose the Preference-Aware Task-Adaptive Reward Model (PaTaRM), a unified framework that integrates a preference-aware reward (PAR) mechanism with dynamic rubric adaptation. PaTaRM leverages relative preference information from pairwise data to construct robust point-wise training signals, eliminating the need for explicit point-wise labels. Simultaneously, it employs a task-adaptive rubric system that flexibly generates evaluation criteria for both global task consistency and instance-specific fine-grained reasoning. This design enables efficient, generalizable, and interpretable reward modeling for RLHF. Extensive experiments show that PaTaRM achieves an average relative improvement of 4.7% on RewardBench and RMBench across Qwen3-8B and Qwen3-14B models. Furthermore, PaTaRM boosts downstream RLHF performance, with an average improvement of 13.6% across IFEval and InFoBench benchmarks, confirming its effectiveness and robustness. Our code is available at this https URL.
zh
[AI-40] Closing Gaps: An Imputation Analysis of ICU Vital Signs
【速读】:该论文旨在解决重症监护病房(Intensive Care Unit, ICU)中由于数据质量差导致的临床预测模型性能受限问题,尤其是因生命体征数据存在大量缺失值而影响机器学习(Machine Learning, ML)模型的准确性。其解决方案的关键在于系统性地比较15种现有的时间序列插补(imputation)方法与4种去缺失(amputation)方法,构建一个可扩展且可复用的基准测试框架,以识别最优的插补策略,从而提升临床预测模型的性能并推动其向实际医疗场景落地。
链接: https://arxiv.org/abs/2510.24217
作者: Alisher Turubayev,Anna Shopova,Fabian Lange,Mahmut Kamalak,Paul Mattes,Victoria Ayvasky,Bert Arnrich,Bjarne Pfitzner,Robin P. van de Water
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Preprint
Abstract:As more Intensive Care Unit (ICU) data becomes available, the interest in developing clinical prediction models to improve healthcare protocols increases. However, the lack of data quality still hinders clinical prediction using Machine Learning (ML). Many vital sign measurements, such as heart rate, contain sizeable missing segments, leaving gaps in the data that could negatively impact prediction performance. Previous works have introduced numerous time-series imputation techniques. Nevertheless, more comprehensive work is needed to compare a representative set of methods for imputing ICU vital signs and determine the best practice. In reality, ad-hoc imputation techniques that could decrease prediction accuracy, like zero imputation, are still used. In this work, we compare established imputation techniques to guide researchers in improving the performance of clinical prediction models by selecting the most accurate imputation technique. We introduce an extensible and reusable benchmark with currently 15 imputation and 4 amputation methods, created for benchmarking on major ICU datasets. We hope to provide a comparative basis and facilitate further ML development to bring more models into clinical practice.
zh
[AI-41] SymMaP: Improving Computational Efficiency in Linear Solvers through Symbolic Preconditioning
【速读】:该论文旨在解决矩阵预条件(Matrix Preconditioning)中参数选择效率与适应性不足的问题,传统方法依赖固定常数和领域知识,难以根据问题实例的特征动态调整参数;而现有机器学习方法虽具潜力,却面临推理开销高和可解释性差的挑战。其解决方案的关键在于提出一种符号发现框架——Symbolic Matrix Preconditioning (SymMaP),通过神经网络在高维离散空间中搜索能够准确预测最优预条件参数的符号表达式,从而实现高效推理与良好可解释性(以简洁的符号公式表示),显著提升预条件策略的通用性和部署可靠性。
链接: https://arxiv.org/abs/2510.24170
作者: Hong Wang,Jie Wang,Minghao Ma,Haoran Shao,Haoyang Liu
机构: 未知
类目: Numerical Analysis (math.NA); Artificial Intelligence (cs.AI)
备注:
Abstract:Matrix preconditioning is a critical technique to accelerate the solution of linear systems, where performance heavily depends on the selection of preconditioning parameters. Traditional parameter selection approaches often define fixed constants for specific scenarios. However, they rely on domain expertise and fail to consider the instance-wise features for individual problems, limiting their performance. In contrast, machine learning (ML) approaches, though promising, are hindered by high inference costs and limited interpretability. To combine the strengths of both approaches, we propose a symbolic discovery framework-namely, Symbolic Matrix Preconditioning (SymMaP)-to learn efficient symbolic expressions for preconditioning parameters. Specifically, we employ a neural network to search the high-dimensional discrete space for expressions that can accurately predict the optimal parameters. The learned expression allows for high inference efficiency and excellent interpretability (expressed in concise symbolic formulas), making it simple and reliable for deployment. Experimental results show that SymMaP consistently outperforms traditional strategies across various benchmarks.
zh
[AI-42] MGA: Memory-Driven GUI Agent for Observation-Centric Interaction WWW2025
【速读】:该论文旨在解决GUI代理在复杂桌面和网页界面中进行交互时面临的两大核心挑战:一是依赖历史轨迹导致的误差传播问题,二是“决策优先、观察滞后”机制引发的局部探索偏差问题。其解决方案的关键在于提出Memory-Driven GUI Agent (MGA),该模型重构了GUI交互范式,采用“先观察后决策”的原则,并将每一步操作建模为一个独立且上下文丰富的环境状态,由当前屏幕截图、任务无关的空间信息以及动态更新的结构化记忆三元组表示,从而显著提升代理在鲁棒性、泛化能力和执行效率方面的表现。
链接: https://arxiv.org/abs/2510.24168
作者: Weihua Cheng,Ersheng Ni,Wenlong Wang,Yifei Sun,Junming Liu,Wangyu Shen,Yirong Chen,Botian Shi,Ding Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Submitted to WWW2025
Abstract:The rapid progress of Large Language Models (LLMs) and their multimodal extensions (MLLMs) has enabled agentic systems capable of perceiving and acting across diverse environments. A challenging yet impactful frontier is the development of GUI agents, which must navigate complex desktop and web interfaces while maintaining robustness and generalization. Existing paradigms typically model tasks as long-chain executions, concatenating historical trajectories into the context. While approaches such as Mirage and GTA1 refine planning or introduce multi-branch action selection, they remain constrained by two persistent issues: Dependence on historical trajectories, which amplifies error propagation. And Local exploration bias, where “decision-first, observation-later” mechanisms overlook critical interface cues. We introduce the Memory-Driven GUI Agent (MGA), which reframes GUI interaction around the principle of observe first, then decide. MGA models each step as an independent, context-rich environment state represented by a triad: current screenshot, task-agnostic spatial information, and a dynamically updated structured memory. Experiments on OSworld benchmarks, real desktop applications (Chrome, VSCode, VLC), and cross-task transfer demonstrate that MGA achieves substantial gains in robustness, generalization, and efficiency compared to state-of-the-art baselines. The code is publicly available at: this https URL.
zh
[AI-43] UniPlanner: A Unified Motion Planning Framework for Autonomous Vehicle Decision-Making Systems via Multi-Dataset Integration
【速读】:该论文旨在解决当前自动驾驶车辆决策系统中运动规划(motion planning)方法普遍依赖单一数据集训练导致鲁棒性不足的问题。现有深度学习方法虽提升了规划能力,但缺乏跨数据集的泛化性能,限制了其在复杂多变真实场景中的应用。解决方案的关键在于提出UniPlanner框架,通过三项协同创新实现多数据集统一学习:首先,基于历史-未来轨迹字典网络(History-Future Trajectory Dictionary Network, HFTDN)聚合不同数据集中历史与未来轨迹对,利用历史轨迹相似性检索相关未来轨迹以生成跨数据集规划引导;其次,梯度无关轨迹映射器(Gradient-Free Trajectory Mapper, GFTM)从多个数据集中学习鲁棒的历史-未来关联,并将历史轨迹转化为通用规划先验,其无梯度设计可防止捷径学习,保障知识安全迁移;最后,稀疏到密集(Sparse-to-Dense, S2D)范式引入自适应dropout机制,在训练阶段选择性抑制规划先验以增强鲁棒性,推理阶段则完全启用先验以最大化规划性能。
链接: https://arxiv.org/abs/2510.24166
作者: Xin Yang,Yuhang Zhang,Wei Li,Xin Lin,Wenbin Zou,Chen Xu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Motion planning is a critical component of autonomous vehicle decision-making systems, directly determining trajectory safety and driving efficiency. While deep learning approaches have advanced planning capabilities, existing methods remain confined to single-dataset training, limiting their robustness in planning. Through systematic analysis, we discover that vehicular trajectory distributions and history-future correlations demonstrate remarkable consistency across different datasets. Based on these findings, we propose UniPlanner, the first planning framework designed for multi-dataset integration in autonomous vehicle decision-making. UniPlanner achieves unified cross-dataset learning through three synergistic innovations. First, the History-Future Trajectory Dictionary Network (HFTDN) aggregates history-future trajectory pairs from multiple datasets, using historical trajectory similarity to retrieve relevant futures and generate cross-dataset planning guidance. Second, the Gradient-Free Trajectory Mapper (GFTM) learns robust history-future correlations from multiple datasets, transforming historical trajectories into universal planning priors. Its gradient-free design ensures the introduction of valuable priors while preventing shortcut learning, making the planning knowledge safely transferable. Third, the Sparse-to-Dense (S2D) paradigm implements adaptive dropout to selectively suppress planning priors during training for robust learning, while enabling full prior utilization during inference to maximize planning performance. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2510.24166 [cs.AI] (or arXiv:2510.24166v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2510.24166 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Xin Yang [view email] [v1] Tue, 28 Oct 2025 08:12:15 UTC (8,469 KB)
zh
[AI-44] BLM_1: A Boundless Large Model for Cross-Space Cross-Task and Cross-Embodiment Learning
【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在数字与物理空间之间泛化能力差、视觉-语言-动作模型(Vision-Language-Action models, VLAs)缺乏高阶具身推理能力,以及具身大语言模型(Embodied Large Language Models, ELLMs)局限于数字空间且难以迁移到物理世界的问题。其解决方案的关键在于提出一个统一的跨空间具身基础模型——Boundless Large Model (BLM₁),通过两阶段训练范式实现三大核心能力:跨空间迁移(cross-space transfer)、跨任务学习(cross-task learning)和跨体化泛化(cross-embodiment generalization)。第一阶段利用精心构建的数字语料注入具身知识,同时保持语言理解能力;第二阶段引入意图桥接接口,从MLLM中提取高层语义以指导控制策略,无需微调主干模型。该方法在四个机器人本体和六项递进挑战任务上验证了其有效性,单个BLM₁实例在数字和物理基准测试中均显著优于MLLMs、ELLMs、VLAs和通用具身模型(Generalized Embodied Models, GMLMs)。
链接: https://arxiv.org/abs/2510.24161
作者: Wentao Tan,Bowen Wang,Heng Zhi,Chenyu Liu,Zhe Li,Jian Liu,Zengrong Lin,Yukun Dai,Yipeng Chen,Wenjie Yang,Enci Xie,Hao Xue,Baixu Ji,Chen Xu,Zhibin Wang,Tianshi Wang,Lei Zhu,Heng Tao Shen
机构: 未知
类目: Artificial Intelligence (cs.AI); Multimedia (cs.MM); Robotics (cs.RO)
备注:
Abstract:Multimodal large language models (MLLMs) have advanced vision-language reasoning and are increasingly deployed in embodied agents. However, significant limitations remain: MLLMs generalize poorly across digital-physical spaces and embodiments; vision-language-action models (VLAs) produce low-level actions yet lack robust high-level embodied reasoning; and most embodied large language models (ELLMs) are constrained to digital-space with poor generalization to the physical world. Thus, unified models that operate seamlessly across digital and physical spaces while generalizing across embodiments and tasks remain absent. We introduce the \textbfBoundless Large Model (BLM _1 ), a multimodal spatial foundation model that preserves instruction following and reasoning, incorporates embodied knowledge, and supports robust cross-embodiment control. BLM _1 integrates three key capabilities – \textitcross-space transfer, cross-task learning, and cross-embodiment generalization – via a two-stage training paradigm. Stage I injects embodied knowledge into the MLLM through curated digital corpora while maintaining language competence. Stage II trains a policy module through an intent-bridging interface that extracts high-level semantics from the MLLM to guide control, without fine-tuning the MLLM backbone. This process is supported by a self-collected cross-embodiment demonstration suite spanning four robot embodiments and six progressively challenging tasks. Evaluations across digital and physical benchmarks show that a single BLM _1 instance outperforms four model families – MLLMs, ELLMs, VLAs, and GMLMs – achieving \sim!\textbf6% gains in digital tasks and \sim!\textbf3% in physical tasks.
zh
[AI-45] BMGQ: A Bottom-up Method for Generating Complex Multi-hop Reasoning Questions from Semi-structured Data
【速读】:该论文旨在解决构建高质量、可训练的多跳问答(Multi-hop Question Answering, MQA)数据集的难题,尤其是如何生成既难以通过单一检索获取答案(retrieval-resistant),又能通过组合多源证据进行验证(verifiable)的问题,以支持监督微调(Supervised Fine-Tuning, SFT)和强化学习(Reinforcement Learning, RL)训练。当前现有数据集多用于评估而非训练,且人工构建非直接可检索问题成本高昂、难以扩展,形成训练高能力检索-推理智能体的关键瓶颈。解决方案的核心在于提出一个自动化框架:首先基于自然语言推理(Natural Language Inference, NLI)的关系类型标注与多样性感知扩展,构建逻辑清晰的证据簇;其次采用反向问题构造策略,生成模糊但组合后唯一指向目标实体的间接线索;最后通过两阶段质量评估机制——结合多模型共识过滤与结构化约束分解及基于证据的匹配——确保生成问题的难度与可验证性,从而实现高效、可扩展地生成适合训练与评测的复杂多跳问题。
链接: https://arxiv.org/abs/2510.24151
作者: Bingsen Qiu,Zijian Liu,Xiao Liu,Haoshen Yang,Zeren Gao,Bingjie Wang,Feier Zhang,Yixuan Qin,Chunyan Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Building training-ready multi-hop question answering (QA) datasets that truly stress a model’s retrieval and reasoning abilities remains highly challenging recently. While there have been a few recent evaluation datasets that capture the characteristics of hard-to-search but easy-to-verify problems – requiring the integration of ambiguous, indirect, and cross-domain cues – these data resources remain scarce and are mostly designed for evaluation, making them unsuitable for supervised fine-tuning (SFT) or reinforcement learning (RL). Meanwhile, manually curating non-trivially retrievable questions – where answers cannot be found through a single direct query but instead require multi-hop reasoning over oblique and loosely connected evidence – incurs prohibitive human costs and fails to scale, creating a critical data bottleneck for training high-capability retrieval-and-reasoning agents. To address this, we present an automated framework for generating high-difficulty, training-ready multi-hop questions from semi-structured knowledge sources. The system (i) grows diverse, logically labeled evidence clusters through Natural Language Inference (NLI)-based relation typing and diversity-aware expansion; (ii) applies reverse question construction to compose oblique cues so that isolated signals are underinformative but their combination uniquely identifies the target entity; and (iii) enforces quality with a two-step evaluation pipeline that combines multi-model consensus filtering with structured constraint decomposition and evidence-based matching. The result is a scalable process that yields complex, retrieval-resistant yet verifiable questions suitable for SFT/RL training as well as challenging evaluation, substantially reducing human curation effort while preserving the difficulty profile of strong evaluation benchmarks. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2510.24151 [cs.AI] (or arXiv:2510.24151v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2510.24151 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-46] From Observability Data to Diagnosis: An Evolving Multi-agent System for Incident Management in Cloud Systems
【速读】:该论文旨在解决大规模云系统中人工事件管理(Incident Management, IM)效率低、易出错的问题,以及现有自动化IM方法在跨系统泛化能力差、可解释性弱和部署成本高等方面的局限性。其解决方案的关键在于提出一个轻量级、自演化的多智能体系统OpsAgent,该系统通过无训练的数据处理器将异构可观测性数据转化为结构化文本描述,并结合多智能体协作框架实现诊断推理的透明化与可审计性;同时引入内外部协同的双重自演化机制,实现模型内部更新与外部经验积累的闭环,从而支持持续的能力增长与实际部署的可持续性。
链接: https://arxiv.org/abs/2510.24145
作者: Yu Luo,Jiamin Jiang,Jingfei Feng,Lei Tao,Qingliang Zhang,Xidao Wen,Yongqian Sun,Shenglin Zhang,Jielong Huang,Nan Qi,Dan Pei
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Incident management (IM) is central to the reliability of large-scale cloud systems. Yet manual IM, where on-call engineers examine metrics, logs, and traces is labor-intensive and error-prone in the face of massive and heterogeneous observability data. Existing automated IM approaches often struggle to generalize across systems, provide limited interpretability, and incur high deployment costs, which hinders adoption in practice. In this paper, we present OpsAgent, a lightweight, self-evolving multi-agent system for IM that employs a training-free data processor to convert heterogeneous observability data into structured textual descriptions, along with a multi-agent collaboration framework that makes diagnostic inference transparent and auditable. To support continual capability growth, OpsAgent also introduces a dual self-evolution mechanism that integrates internal model updates with external experience accumulation, thereby closing the deployment loop. Comprehensive experiments on the OPENRCA benchmark demonstrate state-of-the-art performance and show that OpsAgent is generalizable, interpretable, cost-efficient, and self-evolving, making it a practically deployable and sustainable solution for long-term operation in real-world cloud systems.
zh
[AI-47] LagMemo: Language 3D Gaussian Splatting Memory for Multi-modal Open-vocabulary Multi-goal Visual Navigation
【速读】:该论文旨在解决智能机器人在视觉导航中面临的多模态、开放词汇和多目标场景下的挑战,即传统方法通常局限于单一目标、单模态输入及封闭词汇集的限制,难以应对复杂现实环境中的多样化任务需求。其解决方案的关键在于提出LagMemo系统,该系统利用语言驱动的3D高斯点云(language 3D Gaussian Splatting)构建统一的记忆库,在探索过程中动态积累语义与空间信息;当接收到任务目标时,系统通过查询记忆库预测候选目标位置,并结合局部感知验证机制实现导航过程中的实时目标匹配与有效性验证,从而支持多模态开放词汇条件下的高效多目标视觉导航。
链接: https://arxiv.org/abs/2510.24118
作者: Haotian Zhou,Xiaole Wang,He Li,Fusheng Sun,Shengyu Guo,Guolei Qi,Jianghuan Xu,Huijing Zhao
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Navigating to a designated goal using visual information is a fundamental capability for intelligent robots. Most classical visual navigation methods are restricted to single-goal, single-modality, and closed set goal settings. To address the practical demands of multi-modal, open-vocabulary goal queries and multi-goal visual navigation, we propose LagMemo, a navigation system that leverages a language 3D Gaussian Splatting memory. During exploration, LagMemo constructs a unified 3D language memory. With incoming task goals, the system queries the memory, predicts candidate goal locations, and integrates a local perception-based verification mechanism to dynamically match and validate goals during navigation. For fair and rigorous evaluation, we curate GOAT-Core, a high-quality core split distilled from GOAT-Bench tailored to multi-modal open-vocabulary multi-goal visual navigation. Experimental results show that LagMemo’s memory module enables effective multi-modal open-vocabulary goal localization, and that LagMemo outperforms state-of-the-art methods in multi-goal visual navigation. Project page: this https URL
zh
[AI-48] HistoLens: An Interactive XAI Toolkit for Verifying and Mitigating Flaws in Vision-Language Models for Histopathology
【速读】:该论文旨在解决医生对人工智能(Artificial Intelligence, AI)缺乏信任的问题,尤其在病理诊断场景中,传统AI系统常被视为“黑箱”,难以解释其决策逻辑。解决方案的关键在于开发了一个名为HistoLens的透明化、可交互的AI辅助系统:它允许病理科医生以自然语言提问,并将问题精准转化为AI引擎可执行的查询;同时,在医生追问“为什么”时,系统能即时生成热力图(heatmap),直观展示AI分析所依据的具体细胞和组织区域,从而提供可视化证据;此外,通过训练模型聚焦于患者组织而非背景噪声,确保AI行为符合专业病理学实践。最终实现了由医生主导、AI作为可信协作助手的诊断流程。
链接: https://arxiv.org/abs/2510.24115
作者: Sandeep Vissapragada,Vikrant Sahu,Gagan Raj Gupta,Vandita Singh
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:For doctors to truly trust artificial intelligence, it can’t be a black box. They need to understand its reasoning, almost as if they were consulting a colleague. We created HistoLens1 to be that transparent, collaborative partner. It allows a pathologist to simply ask a question in plain English about a tissue slide–just as they would ask a trainee. Our system intelligently translates this question into a precise query for its AI engine, which then provides a clear, structured report. But it doesn’t stop there. If a doctor ever asks, “Why?”, HistoLens can instantly provide a ‘visual proof’ for any finding–a heatmap that points to the exact cells and regions the AI used for its analysis. We’ve also ensured the AI focuses only on the patient’s tissue, just like a trained pathologist would, by teaching it to ignore distracting background noise. The result is a workflow where the pathologist remains the expert in charge, using a trustworthy AI assistant to verify their insights and make faster, more confident diagnoses.
zh
[AI-49] aming the Tail: NoI Topology Synthesis for Mixed DL Workloads on Chiplet-Based Accelerators
【速读】:该论文旨在解决异构Chiplet系统中因片上内存 disaggregation(如HBM/DRAM)导致的网络-on-interposer(NoI)延迟问题,尤其是在大规模模型推理场景下,参数与激活数据在HBM/DRAM间频繁传输所引发的突发性流量,显著增加尾部延迟并违反服务等级协议(SLA)。解决方案的关键在于提出一种干扰评分(Interference Score, IS),用于量化此类内存驱动流量下的最坏情况性能下降,并将NoI拓扑设计建模为多目标优化(Multi-Objective Optimization, MOO)问题;进而开发了PARL(Partition-Aware Reinforcement Learner)拓扑生成器,通过强化学习实现吞吐量、延迟和功耗之间的平衡,在降低内存切割处竞争的同时满足SLA要求,将最坏情况下的延迟放大控制在1.2倍以内,且保持与高连通度网格拓扑相当的平均吞吐性能。
链接: https://arxiv.org/abs/2510.24113
作者: Arnav Shukla,Harsh Sharma,Srikant Bharadwaj,Vinayak Abrol,Sujay Deb
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Heterogeneous chiplet-based systems improve scaling by disag-gregating CPUs/GPUs and emerging technologies (HBM/DRAM).However this on-package disaggregation introduces a latency inNetwork-on-Interposer(NoI). We observe that in modern large-modelinference, parameters and activations routinely move backand forth from HBM/DRAM, injecting large, bursty flows into theinterposer. These memory-driven transfers inflate tail latency andviolate Service Level Agreements (SLAs) across k-ary n-cube base-line NoI topologies. To address this gap we introduce an InterferenceScore (IS) that quantifies worst-case slowdown under this http URL then formulate NoI synthesis as a multi-objective optimization(MOO) problem. We develop PARL (Partition-Aware ReinforcementLearner), a topology generator that balances throughput, latency,and power. PARL-generated topologies reduce contention at the memory cut, meet SLAs, and cut worst-case slowdown to 1.2 times while maintaining competitive mean throughput relative to link-rich meshes. Overall, this reframes NoI design for heterogeneouschiplet accelerators with workload-aware objectives.
zh
[AI-50] Model-Guided Dual-Role Alignment for High-Fidelity Open-Domain Video-to-Audio Generation NEURIPS2025
【速读】:该论文旨在解决开放域视频到音频生成(video-to-audio generation)中跨模态一致性与音频真实感不足的问题。现有方法多依赖分类器引导(classifier-based)或无分类器引导(classifier-free guidance),难以有效建模视频条件下的复杂音频特征。其解决方案的关键在于提出MGAudio框架,核心创新是引入模型引导的双角色对齐机制(model-guided dual-role alignment):即音频-视觉编码器同时作为条件输入模块和特征对齐器,在训练中通过专门设计的模型引导目标提升跨模态一致性与音频保真度,从而实现更高质量、更具泛化能力的视频驱动音频生成。
链接: https://arxiv.org/abs/2510.24103
作者: Kang Zhang,Trung X. Pham,Suyeon Lee,Axi Niu,Arda Senocak,Joon Son Chung
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
备注: accepted by NeurIPS 2025
Abstract:We present MGAudio, a novel flow-based framework for open-domain video-to-audio generation, which introduces model-guided dual-role alignment as a central design principle. Unlike prior approaches that rely on classifier-based or classifier-free guidance, MGAudio enables the generative model to guide itself through a dedicated training objective designed for video-conditioned audio generation. The framework integrates three main components: (1) a scalable flow-based Transformer model, (2) a dual-role alignment mechanism where the audio-visual encoder serves both as a conditioning module and as a feature aligner to improve generation quality, and (3) a model-guided objective that enhances cross-modal coherence and audio realism. MGAudio achieves state-of-the-art performance on VGGSound, reducing FAD to 0.40, substantially surpassing the best classifier-free guidance baselines, and consistently outperforms existing methods across FD, IS, and alignment metrics. It also generalizes well to the challenging UnAV-100 benchmark. These results highlight model-guided dual-role alignment as a powerful and scalable paradigm for conditional video-to-audio generation. Code is available at: this https URL
zh
[AI-51] Learning Parameterized Skills from Demonstrations NEURIPS2025
【速读】:该论文旨在解决从专家演示中自动发现可参数化的技能(parameterized skills)这一问题,尤其关注如何在多任务场景下学习具有时间扩展性、语义意义和适应性的技能表示。传统方法常因潜在变量模型的退化问题(degeneracy)导致学到的技能不具区分度或无法泛化,而DEPS通过结合时序变分推断(temporal variational inference)与信息论正则化方法,有效缓解了该问题,从而实现技能策略与元策略(meta-policy)的端到端联合学习——其中元策略在每个时间步选择合适的离散技能及连续参数。其关键创新在于将参数化技能的学习建模为一个结构化的潜变量优化过程,使技能具备可解释性和任务迁移能力,在LIBERO与MetaWorld基准上显著优于多任务学习及技能学习基线方法。
链接: https://arxiv.org/abs/2510.24095
作者: Vedant Gupta,Haotian Fu,Calvin Luo,Yiding Jiang,George Konidaris
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: Neurips 2025
Abstract:We present DEPS, an end-to-end algorithm for discovering parameterized skills from expert demonstrations. Our method learns parameterized skill policies jointly with a meta-policy that selects the appropriate discrete skill and continuous parameters at each timestep. Using a combination of temporal variational inference and information-theoretic regularization methods, we address the challenge of degeneracy common in latent variable models, ensuring that the learned skills are temporally extended, semantically meaningful, and adaptable. We empirically show that learning parameterized skills from multitask expert demonstrations significantly improves generalization to unseen tasks. Our method outperforms multitask as well as skill learning baselines on both LIBERO and MetaWorld benchmarks. We also demonstrate that DEPS discovers interpretable parameterized skills, such as an object grasping skill whose continuous arguments define the grasp location.
zh
[AI-52] Modeling Electric Vehicle Car-Following Behavior: Classical vs Machine Learning Approach
【速读】:该论文旨在解决电动汽车(Electric Vehicles, EVs)在实际交通环境中车距跟随行为建模的准确性问题,以提升交通安全并支持智能驾驶系统的发展。其解决方案的关键在于对比经典物理模型(如IDM、OVM、OVRV及简化CACC模型)与基于机器学习的随机森林回归模型(Random Forest Regressor)在真实EV跟车数据上的预测性能;其中,随机森林模型通过输入间距、速度和跟车间隙类型等特征直接预测加速度,显著优于传统模型,在不同间隙条件下均实现了更低的均方根误差(RMSE),体现出更强的适应性和精度优势。
链接: https://arxiv.org/abs/2510.24085
作者: Md. Shihab Uddin,Md Nazmus Shakib,Rahul Bhadani
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:The increasing adoption of electric vehicles (EVs) necessitates an understanding of their driving behavior to enhance traffic safety and develop smart driving systems. This study compares classical and machine learning models for EV car following behavior. Classical models include the Intelligent Driver Model (IDM), Optimum Velocity Model (OVM), Optimal Velocity Relative Velocity (OVRV), and a simplified CACC model, while the machine learning approach employs a Random Forest Regressor. Using a real world dataset of an EV following an internal combustion engine (ICE) vehicle under varied driving conditions, we calibrated classical model parameters by minimizing the RMSE between predictions and real data. The Random Forest model predicts acceleration using spacing, speed, and gap type as inputs. Results demonstrate the Random Forest’s superior accuracy, achieving RMSEs of 0.0046 (medium gap), 0.0016 (long gap), and 0.0025 (extra long gap). Among physics based models, CACC performed best, with an RMSE of 2.67 for long gaps. These findings highlight the machine learning model’s performance across all scenarios. Such models are valuable for simulating EV behavior and analyzing mixed autonomy traffic dynamics in EV integrated environments.
zh
[AI-53] Covert Surveillance in Smart Devices: A SCOUR Framework Analysis of Youth Privacy Implications
【速读】:该论文旨在解决智能设备(smart devices)在未经用户明确知情同意的情况下, covertly(隐蔽地)捕获青少年私人对话所带来的隐私风险问题。其核心挑战在于当前智能设备的数据采集、存储与共享实践缺乏透明度,尤其在面向青少年的智能玩具和语音激活设备中更为突出。解决方案的关键在于提出并应用SCOUR框架(涵盖监视机制、知情同意与意识、数据操作流程、使用与滥用、监管与技术保障),以系统性识别和分析隐私风险,并据此提出改进监管和技术防护策略,从而在保障设备功能性的同时提升青少年隐私保护水平。
链接: https://arxiv.org/abs/2510.24072
作者: Austin Shouli,Yulia Bobkova,Ajay Kumar Shrestha
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: To appear in the IEEE UEMCON 2025 proceedings
Abstract:This paper investigates how smart devices covertly capture private conversations and discusses in more in-depth the implications of this for youth privacy. Using a structured review guided by the PRISMA methodology, the analysis focuses on privacy concerns, data capture methods, data storage and sharing practices, and proposed technical mitigations. To structure and synthesize findings, we introduce the SCOUR framework, encompassing Surveillance mechanisms, Consent and awareness, Operational data flow, Usage and exploitation, and Regulatory and technical safeguards. Findings reveal that smart devices have been covertly capturing personal data, especially with smart toys and voice-activated smart gadgets built for youth. These issues are worsened by unclear data collection practices and insufficient transparency in smart device applications. Balancing privacy and utility in smart devices is crucial, as youth are becoming more aware of privacy breaches and value their personal data more. Strategies to improve regulatory and technical safeguards are also provided. The review identifies research gaps and suggests future directions. The limitations of this literature review are also explained. The findings have significant implications for policy development and the transparency of data collection for smart devices.
zh
[AI-54] FALQON: Accelerating LoRA Fine-tuning with Low-Bit Floating-Point Arithmetic NEURIPS2025
【速读】:该论文旨在解决低比特浮点数(Low-bit floating-point, e.g., FP8)量化在低秩适配(LoRA)场景下因小维度矩阵计算导致的量化开销显著、难以获得加速的问题。现有方法在FP8量化LoRA时,由于其依赖分离的LoRA计算路径,引入了额外的量化/反量化操作开销,削弱了硬件加速收益。解决方案的关键在于提出FALQON框架:通过在微调过程中将LoRA适配器直接合并到FP8量化主干网络中,并重构前向与反向传播以大幅降低量化开销;同时引入行级代理更新机制,高效地将大尺度参数更新整合进量化主干,从而实现接近3倍的训练加速且保持相近精度,且无需后训练量化,支持端到端FP8工作流。
链接: https://arxiv.org/abs/2510.24061
作者: Kanghyun Choi,Hyeyoon Lee,SunJong Park,Dain Kwon,Jinho Lee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: NeurIPS 2025
Abstract:Low-bit floating-point (FP) formats, such as FP8, provide significant acceleration and memory savings in model training thanks to native hardware support on modern GPUs and NPUs. However, we analyze that FP8 quantization offers speedup primarily for large-dimensional matrix multiplications, while inherent quantization overheads diminish speedup when applied to low-rank adaptation (LoRA), which uses small-dimensional matrices for efficient fine-tuning of large language models (LLMs). To address this limitation, we propose FALQON, a novel framework that eliminates the quantization overhead from separate LoRA computational paths by directly merging LoRA adapters into an FP8-quantized backbone during fine-tuning. Furthermore, we reformulate the forward and backward computations for merged adapters to significantly reduce quantization overhead, and introduce a row-wise proxy update mechanism that efficiently integrates substantial updates into the quantized backbone. Experimental evaluations demonstrate that FALQON achieves approximately a 3 \times training speedup over existing quantized LoRA methods with a similar level of accuracy, providing a practical solution for efficient large-scale model fine-tuning. Moreover, FALQON’s end-to-end FP8 workflow removes the need for post-training quantization, facilitating efficient deployment. Code is available at this https URL.
zh
[AI-55] SynAD: Enhancing Real-World End-to-End Autonomous Driving Models through Synthetic Data Integration
【速读】:该论文旨在解决当前端到端自动驾驶(End-to-End Autonomous Driving, E2E AD)模型训练中因依赖真实世界数据而导致驾驶场景多样性不足的问题。其核心挑战在于合成场景生成技术难以直接应用于E2E AD模型,主要原因是合成场景缺乏明确的自车(ego vehicle)及其对应的传感器输入(如摄像头或激光雷达)。解决方案的关键在于提出SynAD框架:首先在多智能体合成场景中将拥有最完整驾驶信息的代理指定为自车;其次通过路径级场景投影至地图,并利用新设计的“地图到鸟瞰图网络”(Map-to-BEV Network)提取无需传感器输入的鸟瞰图特征;最后采用一种有效的训练策略,实现地图驱动的合成数据与真实驾驶数据的融合。实验表明,该方法显著提升了模型的安全性能,为合成场景生成与E2E AD的结合提供了可行路径。
链接: https://arxiv.org/abs/2510.24052
作者: Jongsuk Kim,Jaeyoung Lee,Gyojin Han,Dongjae Lee,Minki Jeong,Junmo Kim
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advancements in deep learning and the availability of high-quality real-world driving datasets have propelled end-to-end autonomous driving. Despite this progress, relying solely on real-world data limits the variety of driving scenarios for training. Synthetic scenario generation has emerged as a promising solution to enrich the diversity of training data; however, its application within E2E AD models remains largely unexplored. This is primarily due to the absence of a designated ego vehicle and the associated sensor inputs, such as camera or LiDAR, typically provided in real-world scenarios. To address this gap, we introduce SynAD, the first framework designed to enhance real-world E2E AD models using synthetic data. Our method designates the agent with the most comprehensive driving information as the ego vehicle in a multi-agent synthetic scenario. We further project path-level scenarios onto maps and employ a newly developed Map-to-BEV Network to derive bird’s-eye-view features without relying on sensor inputs. Finally, we devise a training strategy that effectively integrates these map-based synthetic data with real driving data. Experimental results demonstrate that SynAD effectively integrates all components and notably enhances safety performance. By bridging synthetic scenario generation and E2E AD, SynAD paves the way for more comprehensive and robust autonomous driving models.
zh
[AI-56] Learning from History: A Retrieval-Augmented Framework for Spatiotemporal Prediction
【速读】:该论文旨在解决复杂物理系统中长期时空预测的准确性问题,特别是深度学习模型在长时间自回归滚动预测过程中因误差累积而导致物理上不合理结果的问题。其解决方案的关键在于提出一种名为“检索增强预测”(Retrieval-Augmented Prediction, RAP)的混合框架,该框架通过从大规模历史数据中检索与当前状态最相似的演化实例作为动态参考目标,将其作为条件输入嵌入到双流神经架构中,从而为模型提供强动态引导,有效抑制预测误差的发散,提升长期预测的物理合理性。
链接: https://arxiv.org/abs/2510.24049
作者: Hao Jia,Penghao Zhao,Hao Wu,Yuan Gao,Yangyu Tao,Bin Cui
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Accurate and long-term spatiotemporal prediction for complex physical systems remains a fundamental challenge in scientific computing. While deep learning models, as powerful parametric approximators, have shown remarkable success, they suffer from a critical limitation: the accumulation of errors during long-term autoregressive rollouts often leads to physically implausible artifacts. This deficiency arises from their purely parametric nature, which struggles to capture the full constraints of a system’s intrinsic dynamics. To address this, we introduce a novel \textbfRetrieval-Augmented Prediction (RAP) framework, a hybrid paradigm that synergizes the predictive power of deep networks with the grounded truth of historical data. The core philosophy of RAP is to leverage historical evolutionary exemplars as a non-parametric estimate of the system’s local dynamics. For any given state, RAP efficiently retrieves the most similar historical analog from a large-scale database. The true future evolution of this analog then serves as a \textbfreference target. Critically, this target is not a hard constraint in the loss function but rather a powerful conditional input to a specialized dual-stream architecture. It provides strong \textbfdynamic guidance, steering the model’s predictions towards physically viable trajectories. In extensive benchmarks across meteorology, turbulence, and fire simulation, RAP not only surpasses state-of-the-art methods but also significantly outperforms a strong \textbfanalog-only forecasting baseline. More importantly, RAP generates predictions that are more physically realistic by effectively suppressing error divergence in long-term rollouts.
zh
[AI-57] Causal-Aware Generative Adversarial Networks with Reinforcement Learning
【速读】:该论文旨在解决现有生成式模型(尤其是基于生成对抗网络GANs的模型)在处理真实世界表格数据时面临的三大挑战:难以捕捉复杂因果关系、难以保持数据实用性以及缺乏适用于企业部署的可证明隐私保障。其解决方案的关键在于提出一种名为CA-GAN的新颖生成框架,该框架采用两阶段策略:首先通过因果图提取(causal graph extraction)学习数据流形中的鲁棒且全面的因果结构;随后设计一种基于条件Wasserstein GAN with Gradient Penalty(Conditional WGAN-GP)的生成器,该生成器严格遵循因果图中节点的结构进行采样;更重要的是,引入一种基于强化学习的目标函数,在训练和采样阶段均对齐真实与合成数据的因果图结构,从而实现因果感知的生成过程。此方法在14个真实表格数据集上优于六种当前最优(SOTA)方法,在因果保留、实用性保留和隐私保留三个核心指标上表现优异,为数据工程师提供了高性能、合规的合成数据生成方案。
链接: https://arxiv.org/abs/2510.24046
作者: Tu Anh Hoang Nguyen,Dang Nguyen,Tri-Nhan Vo,Thuc Duy Le,Sunil Gupta
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:The utility of tabular data for tasks ranging from model training to large-scale data analysis is often constrained by privacy concerns or regulatory hurdles. While existing data generation methods, particularly those based on Generative Adversarial Networks (GANs), have shown promise, they frequently struggle with capturing complex causal relationship, maintaining data utility, and providing provable privacy guarantees suitable for enterprise deployment. We introduce CA-GAN, a novel generative framework specifically engineered to address these challenges for real-world tabular datasets. CA-GAN utilizes a two-step approach: causal graph extraction to learn a robust, comprehensive causal relationship in the data’s manifold, followed by a custom Conditional WGAN-GP (Wasserstein GAN with Gradient Penalty) that operates exclusively as per the structure of nodes in the causal graph. More importantly, the generator is trained with a new Reinforcement Learning-based objective that aligns the causal graphs constructed from real and fake data, ensuring the causal awareness in both training and sampling phases. We demonstrate CA-GAN superiority over six SOTA methods across 14 tabular datasets. Our evaluations, focused on core data engineering metrics: causal preservation, utility preservation, and privacy preservation. Our method offers a practical, high-performance solution for data engineers seeking to create high-quality, privacy-compliant synthetic datasets to benchmark database systems, accelerate software development, and facilitate secure data-driven research.
zh
[AI-58] Geometric Algorithms for Neural Combinatorial Optimization with Constraints
【速读】:该论文旨在解决自监督学习(Self-Supervised Learning, SSL)在组合优化(Combinatorial Optimization, CO)中面临的核心挑战:如何有效处理具有离散约束的优化问题。传统神经网络难以直接生成满足离散约束的可行解,而本文提出了一种端到端可微分框架,其关键在于利用凸几何中的算法技术与Carathéodory定理,将神经网络输出分解为多面体顶点(polytope corners)的凸组合,从而映射到可行解空间。该分解策略不仅支持自监督训练,还能在不损失解质量的前提下实现高效的可行解 rounding,显著提升了模型在基数约束优化等任务上的性能,并可扩展至独立集查找和拟阵约束等问题。
链接: https://arxiv.org/abs/2510.24039
作者: Nikolaos Karalias,Akbar Rafiey,Yifei Xu,Zhishang Luo,Behrooz Tahmasebi,Connie Jiang,Stefanie Jegelka
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Self-Supervised Learning (SSL) for Combinatorial Optimization (CO) is an emerging paradigm for solving combinatorial problems using neural networks. In this paper, we address a central challenge of SSL for CO: solving problems with discrete constraints. We design an end-to-end differentiable framework that enables us to solve discrete constrained optimization problems with neural networks. Concretely, we leverage algorithmic techniques from the literature on convex geometry and Carathéodory’s theorem to decompose neural network outputs into convex combinations of polytope corners that correspond to feasible sets. This decomposition-based approach enables self-supervised training but also ensures efficient quality-preserving rounding of the neural net output into feasible solutions. Extensive experiments in cardinality-constrained optimization show that our approach can consistently outperform neural baselines. We further provide worked-out examples of how our method can be applied beyond cardinality-constrained problems to a diverse set of combinatorial optimization tasks, including finding independent sets in graphs, and solving matroid-constrained problems.
zh
[AI-59] LLM LogAnalyzer: A Clustering-Based Log Analysis Chatbot using Large Language Models
【速读】:该论文旨在解决系统日志(system logs)在网络安全领域中分析难度大、成本高、依赖专业技能的问题,尤其是在面对海量且多样化的日志数据时,传统方法难以实现高效、准确的模式识别与异常检测。解决方案的关键在于提出 LLMLogAnalyzer——一个基于聚类的对话语义分析聊天机器人,其核心创新是通过模块化架构(包括路由模块、日志识别器、日志解析器和搜索工具)增强大型语言模型(Large Language Models, LLMs)对结构化文本的处理能力,并有效克服LLM在上下文窗口限制和结构化信息理解方面的短板,从而显著提升日志摘要生成、模式提取和异常检测任务的准确性与鲁棒性。
链接: https://arxiv.org/abs/2510.24031
作者: Peng Cai,Reza Ryan,Nickson M. Karie
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 33 pages, 10 figures
Abstract:System logs are a cornerstone of cybersecurity, supporting proactive breach prevention and post-incident investigations. However, analyzing vast amounts of diverse log data remains significantly challenging, as high costs, lack of in-house expertise, and time constraints make even basic analysis difficult for many organizations. This study introduces LLMLogAnalyzer, a clustering-based log analysis chatbot that leverages Large Language Models (LLMs) and Machine Learning (ML) algorithms to simplify and streamline log analysis processes. This innovative approach addresses key LLM limitations, including context window constraints and poor structured text handling capabilities, enabling more effective summarization, pattern extraction, and anomaly detection tasks. LLMLogAnalyzer is evaluated across four distinct domain logs and various tasks. Results demonstrate significant performance improvements over state-of-the-art LLM-based chatbots, including ChatGPT, ChatPDF, and NotebookLM, with consistent gains ranging from 39% to 68% across different tasks. The system also exhibits strong robustness, achieving a 93% reduction in interquartile range (IQR) when using ROUGE-1 scores, indicating significantly lower result variability. The framework’s effectiveness stems from its modular architecture comprising a router, log recognizer, log parser, and search tools. This design enhances LLM capabilities for structured text analysis while improving accuracy and robustness, making it a valuable resource for both cybersecurity experts and non-technical users.
zh
[AI-60] Improved Accuracy of Robot Localization Using 3-D LiDAR in a Hippocampus-Inspired Model
【速读】:该论文旨在解决传统边界向量细胞(Boundary Vector Cells, BVC)模型在二维(2D)环境中因水平对称性导致的空间模糊问题,该问题限制了其在真实三维(3D)场景中的定位精度与鲁棒性。解决方案的关键在于引入垂直方向的角度敏感性,使BVC框架能够处理来自激光雷达(LiDAR)数据的垂直轮廓信息,从而在三维空间中实现更精确的边界检测和位置编码。这一改进显著提升了生物启发式机器人模型在复杂3D环境中的空间定位准确性,并有效减少了空间混叠(spatial aliasing),同时在近平面环境中保持与2D基线相当的性能。
链接: https://arxiv.org/abs/2510.24029
作者: Andrew Gerstenslager,Bekarys Dukenbaev,Ali A. Minai
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY); Neurons and Cognition (q-bio.NC)
备注: 8 pages, 9 figures, Presented at the 2025 International Joint Conference on Neural Networks, Rome, July 2025
Abstract:Boundary Vector Cells (BVCs) are a class of neurons in the brains of vertebrates that encode environmental boundaries at specific distances and allocentric directions, playing a central role in forming place fields in the hippocampus. Most computational BVC models are restricted to two-dimensional (2D) environments, making them prone to spatial ambiguities in the presence of horizontal symmetries in the environment. To address this limitation, we incorporate vertical angular sensitivity into the BVC framework, thereby enabling robust boundary detection in three dimensions, and leading to significantly more accurate spatial localization in a biologically-inspired robot model. The proposed model processes LiDAR data to capture vertical contours, thereby disambiguating locations that would be indistinguishable under a purely 2D representation. Experimental results show that in environments with minimal vertical variation, the proposed 3D model matches the performance of a 2D baseline; yet, as 3D complexity increases, it yields substantially more distinct place fields and markedly reduces spatial aliasing. These findings show that adding a vertical dimension to BVC-based localization can significantly enhance navigation and mapping in real-world 3D spaces while retaining performance parity in simpler, near-planar scenarios. Comments: 8 pages, 9 figures, Presented at the 2025 International Joint Conference on Neural Networks, Rome, July 2025 Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY); Neurons and Cognition (q-bio.NC) ACMclasses: I.2.9; I.2.6 Cite as: arXiv:2510.24029 [cs.RO] (or arXiv:2510.24029v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2510.24029 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-61] OneCast: Structured Decomposition and Modular Generation for Cross-Domain Time Series Forecasting
【速读】:该论文旨在解决跨域时间序列预测中因异构数据导致的泛化能力不足问题,尤其针对领域特异性趋势变化和不一致周期模式带来的挑战。现有方法多基于单域模型扩展,难以有效捕捉不同领域的结构差异。其解决方案的关键在于提出OneCast框架,通过显式解耦时间序列的季节性和趋势成分,并分别设计专用生成路径进行建模:季节性成分由轻量级投影模块利用可解释基函数重构周期模式,趋势成分则通过语义感知分段令牌化与掩码离散扩散机制进行推理,最终融合两分支输出以同时保留周期特性并跟踪领域特异性趋势。
链接: https://arxiv.org/abs/2510.24028
作者: Tingyue Pan,Mingyue Cheng,Shilong Zhang,Zhiding Liu,Xiaoyu Tao,Yucong Luo,Jintao Zhang,Qi Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Cross-domain time series forecasting is a valuable task in various web applications. Despite its rapid advancement, achieving effective generalization across heterogeneous time series data remains a significant challenge. Existing methods have made progress by extending single-domain models, yet often fall short when facing domain-specific trend shifts and inconsistent periodic patterns. We argue that a key limitation lies in treating temporal series as undifferentiated sequence, without explicitly decoupling their inherent structural components. To address this, we propose OneCast, a structured and modular forecasting framework that decomposes time series into seasonal and trend components, each modeled through tailored generative pathways. Specifically, the seasonal component is captured by a lightweight projection module that reconstructs periodic patterns via interpretable basis functions. In parallel, the trend component is encoded into discrete tokens at segment level via a semantic-aware tokenizer, and subsequently inferred through a masked discrete diffusion mechanism. The outputs from both branches are combined to produce a final forecast that captures seasonal patterns while tracking domain-specific trends. Extensive experiments across eight domains demonstrate that OneCast mostly outperforms state-of-the-art baselines.
zh
[AI-62] Spatio-temporal Multivariate Time Series Forecast with Chosen Variables
【速读】:该论文旨在解决时空多变量时间序列预测(Spatio-Temporal Multivariate time series Forecast, STMF)中因传感器数量受限导致的输入变量选择问题,即在n个待监测位置中,如何最优地选择m个具有代表性的变量作为模型输入以最大化预测精度。现有方法通常假设输入变量是预先确定的,忽略了变量选择对预测性能的关键影响。本文提出了一种统一框架,通过联合优化变量选择与模型参数,实现精度与效率的协同提升;其关键创新在于三个技术组件:(1) 基于分位数掩码的变量-参数剪枝机制,逐步剔除低信息量变量及注意力参数;(2) 优先级变量-参数重放机制,通过回放低损失历史样本保持模型稳定性;(3) 动态外推机制,利用可学习的空间嵌入和邻接信息将选中变量的信息传播至所有其他变量。实验表明,该方法在五个真实数据集上显著优于当前最优基线,在准确性和效率方面均取得提升。
链接: https://arxiv.org/abs/2510.24027
作者: Zibo Liu,Zhe Jiang,Zelin Xu,Tingsong Xiao,Yupu Zhang,Zhengkun Xiao,Haibo Wang,Shigang Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: In submission
Abstract:Spatio-Temporal Multivariate time series Forecast (STMF) uses the time series of n spatially distributed variables in a period of recent past to forecast their values in a period of near future. It has important applications in spatio-temporal sensing forecast such as road traffic prediction and air pollution prediction. Recent papers have addressed a practical problem of missing variables in the model input, which arises in the sensing applications where the number m of sensors is far less than the number n of locations to be monitored, due to budget constraints. We observe that the state of the art assumes that the m variables (i.e., locations with sensors) in the model input are pre-determined and the important problem of how to choose the m variables in the input has never been studied. This paper fills the gap by studying a new problem of STMF with chosen variables, which optimally selects m -out-of- n variables for the model input in order to maximize the forecast accuracy. We propose a unified framework that jointly performs variable selection and model optimization for both forecast accuracy and model efficiency. It consists of three novel technical components: (1) masked variable-parameter pruning, which progressively prunes less informative variables and attention parameters through quantile-based masking; (2) prioritized variable-parameter replay, which replays low-loss past samples to preserve learned knowledge for model stability; (3) dynamic extrapolation mechanism, which propagates information from variables selected for the input to all other variables via learnable spatial embeddings and adjacency information. Experiments on five real-world datasets show that our work significantly outperforms the state-of-the-art baselines in both accuracy and efficiency, demonstrating the effectiveness of joint variable selection and model optimization.
zh
[AI-63] NeuroPathNet: Dynamic Path Trajectory Learning for Brain Functional Connectivity Analysis
【速读】:该论文旨在解决现有方法难以捕捉特定功能社区之间连接路径随时间演变特征的问题。其解决方案的关键在于提出一种基于路径层面的轨迹建模框架(NeuroPathNet),该框架首先基于医学支持的静态分区方案(如Yeo和Smith ICA)提取各功能分区间的连接强度时间序列,并利用时序神经网络对这些路径动态行为进行建模,从而更精准地刻画脑功能网络中连接路径的时变特性。
链接: https://arxiv.org/abs/2510.24025
作者: Guo Tianqi Guo,Chen Liping,Peng Ciyuan,Guo Jingjing,Ren Jing
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Understanding the evolution of brain functional networks over time is of great significance for the analysis of cognitive mechanisms and the diagnosis of neurological diseases. Existing methods often have difficulty in capturing the temporal evolution characteristics of connections between specific functional communities. To this end, this paper proposes a new path-level trajectory modeling framework (NeuroPathNet) to characterize the dynamic behavior of connection pathways between brain functional partitions. Based on medically supported static partitioning schemes (such as Yeo and Smith ICA), we extract the time series of connection strengths between each pair of functional partitions and model them using a temporal neural network. We validate the model performance on three public functional Magnetic Resonance Imaging (fMRI) datasets, and the results show that it outperforms existing mainstream methods in multiple indicators. This study can promote the development of dynamic graph learning methods for brain network analysis, and provide possible clinical applications for the diagnosis of neurological diseases.
zh
[AI-64] Lifecycle-Aware code generation: Leverag ing Software Engineering Phases in LLM s
【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)在自动代码生成中普遍存在的问题:即大多数方法依赖于从问题描述到代码的直接、单步翻译,忽视了软件工程中的结构化开发流程。为此,作者提出了一种生命周期感知(lifecycle-aware)的框架,其关键在于将需求分析、状态机建模和伪代码等中间产物系统性地融入训练与推理阶段,使代码生成过程与标准软件开发阶段对齐,从而促进更结构化的推理。实验表明,该方法在代码正确性上相较未微调模型提升高达75%,且多步推理显著优于单步生成,证明了中间 scaffold 的有效性;同时,开源模型经此框架微调后可达到甚至超越专有代码预训练模型的表现。
链接: https://arxiv.org/abs/2510.24019
作者: Xing Xing,Wei Wang,Lipeng Ma,Weidong Yang,Junjie Zheng
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent progress in large language models (LLMs) has advanced automatic code generation, yet most approaches rely on direct, single-step translation from problem descriptions to code, disregarding structured software engineering practices. We introduce a lifecycle-aware framework that systematically incorporates intermediate artifacts such as requirements analysis, state machine modeling, and pseudocode into both the training and inference stages. This design aligns code generation with standard software development phases and enables more structured reasoning. Experiments show that lifecycle-level fine-tuning improves code correctness by up to 75% over the same model before fine-tuning, with performance gains compounding across intermediate stages. Multi-step inference consistently surpasses single-step generation, demonstrating the effectiveness of intermediate scaffolding. Notably, open-source LLMs, once fine-tuned under our framework, match or slightly outperform models pretrained on code. When applied to DeepSeek-Coder-1.3B, our framework yields relative CodeBLEU improvements of 34.3%, 20.0%, 11.2%, and 22.3% over ChatGPT-3.5, ChatGPT-4o-mini, DeepSeek-R1, and LLaMA-8B, respectively. Our pipeline also proves robust with up to 80% less training data, confirming its resilience. Ablation studies further reveal that each intermediate artifact contributes distinctly to final code quality, with state machine modeling yielding the most substantial impact. Our source code and detailed experimental data are available at this https URL.
zh
[AI-65] Discovering Heuristics with Large Language Models (LLM s) for Mixed-Integer Programs: Single-Machine Scheduling
【速读】:该论文旨在解决单机总延迟调度问题(Single-Machine Total Tardiness, SMTT),即在给定作业的处理时间和截止日期条件下,通过优化作业序列以最小化总延迟时间。该问题是典型的NP-hard组合优化问题,在大规模实例下传统精确算法(如混合整数规划MIP和动态规划)面临计算不可行性。解决方案的关键在于利用大型语言模型(Large Language Models, LLMs)发现新型启发式规则——EDD Challenger (EDDC) 和 MDD Challenger (MDDC),它们分别基于经典的最早截止日期优先(Earliest Due Date, EDD)和修改截止日期(Modified Due Date, MDD)规则进行改进。这些由LLM生成的启发式算法在保持较低计算复杂度的同时,显著优于传统启发式方法,并在500个作业的大规模实例中展现出与精确方法相当甚至更优的性能,证明了人机协作在设计可扩展、高性能启发式算法方面的有效性。
链接: https://arxiv.org/abs/2510.24013
作者: İbrahim Oğuz Çetinkaya,İ. Esra Büyüktahtakın,Parshin Shojaee,Chandan K. Reddy
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Combinatorics (math.CO); Optimization and Control (math.OC)
备注:
Abstract:Our study contributes to the scheduling and combinatorial optimization literature with new heuristics discovered by leveraging the power of Large Language Models (LLMs). We focus on the single-machine total tardiness (SMTT) problem, which aims to minimize total tardiness by sequencing n jobs on a single processor without preemption, given processing times and due dates. We develop and benchmark two novel LLM-discovered heuristics, the EDD Challenger (EDDC) and MDD Challenger (MDDC), inspired by the well-known Earliest Due Date (EDD) and Modified Due Date (MDD) rules. In contrast to prior studies that employed simpler rule-based heuristics, we evaluate our LLM-discovered algorithms using rigorous criteria, including optimality gaps and solution time derived from a mixed-integer programming (MIP) formulation of SMTT. We compare their performance against state-of-the-art heuristics and exact methods across various job sizes (20, 100, 200, and 500 jobs). For instances with more than 100 jobs, exact methods such as MIP and dynamic programming become computationally intractable. Up to 500 jobs, EDDC improves upon the classic EDD rule and another widely used algorithm in the literature. MDDC consistently outperforms traditional heuristics and remains competitive with exact approaches, particularly on larger and more complex instances. This study shows that human-LLM collaboration can produce scalable, high-performing heuristics for NP-hard constrained combinatorial optimization, even under limited resources when effectively configured.
zh
[AI-66] raining-Free Safe Text Embedding Guidance for Text-to-Image Diffusion Models NEURIPS2025
【速读】:该论文旨在解决扩散模型(diffusion models)在生成图像时可能因训练数据中包含不当或偏见内容而产生有害输出的问题,尤其是在面对恶意文本提示时。其解决方案的关键在于提出一种无需训练的“安全文本嵌入引导”(Safe Text embedding Guidance, STG)方法:通过在采样过程中基于对预期最终去噪图像的安全函数评估来调整文本嵌入,从而在不改变模型参数的前提下引导生成更安全的图像,同时最小化对原始语义意图的影响。理论分析表明,STG能够使模型分布与安全约束对齐,实验证明其在去除裸露、暴力及特定艺术家风格等不安全内容方面显著优于现有训练型和非训练型基线方法。
链接: https://arxiv.org/abs/2510.24012
作者: Byeonghu Na,Mina Kang,Jiseok Kwak,Minsang Park,Jiwoo Shin,SeJoon Jun,Gayoung Lee,Jin-Hwa Kim,Il-Chul Moon
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at NeurIPS 2025
Abstract:Text-to-image models have recently made significant advances in generating realistic and semantically coherent images, driven by advanced diffusion models and large-scale web-crawled datasets. However, these datasets often contain inappropriate or biased content, raising concerns about the generation of harmful outputs when provided with malicious text prompts. We propose Safe Text embedding Guidance (STG), a training-free approach to improve the safety of diffusion models by guiding the text embeddings during sampling. STG adjusts the text embeddings based on a safety function evaluated on the expected final denoised image, allowing the model to generate safer outputs without additional training. Theoretically, we show that STG aligns the underlying model distribution with safety constraints, thereby achieving safer outputs while minimally affecting generation quality. Experiments on various safety scenarios, including nudity, violence, and artist-style removal, show that STG consistently outperforms both training-based and training-free baselines in removing unsafe content while preserving the core semantic intent of input prompts. Our code is available at this https URL.
zh
[AI-67] Learning Individual Movement Shifts After Urban Disruptions with Social Infrastructure Reliance
【速读】:该论文旨在解决灾变事件后个体移动模式变化难以预测的问题,其核心挑战在于缺乏对个体社会基础设施韧性(Social Infrastructure Resilience, SIR)的量化测量、个体移动模式与空间环境之间复杂交互关系的捕捉不足,以及个体层面移动数据的稀疏性导致传统决策方法适用性差。解决方案的关键在于构建一个条件深度学习模型,将个体SIR作为条件输入,以整合个体层面的社会韧性特征与局部空间背景信息,从而有效建模个体移动模式与空间上下文之间的非线性关系,并提升对灾变后移动模式转变的预测能力。实验表明,该方法能准确识别出具有相似前期移动模式但不同SIR水平的个体在灾变后的差异化移动响应。
链接: https://arxiv.org/abs/2510.23989
作者: Shangde Gao,Zelin Xu,Zhe Jiang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Shifts in individual movement patterns following disruptive events can reveal changing demands for community resources. However, predicting such shifts before disruptive events remains challenging for several reasons. First, measures are lacking for individuals’ heterogeneous social infrastructure resilience (SIR), which directly influences their movement patterns, and commonly used features are often limited or unavailable at scale, e.g., sociodemographic characteristics. Second, the complex interactions between individual movement patterns and spatial contexts have not been sufficiently captured. Third, individual-level movement may be spatially sparse and not well-suited to traditional decision-making methods for movement predictions. This study incorporates individuals’ SIR into a conditioned deep learning model to capture the complex relationships between individual movement patterns and local spatial context using large-scale, sparse individual-level data. Our experiments demonstrate that incorporating individuals’ SIR and spatial context can enhance the model’s ability to predict post-event individual movement patterns. The conditioned model can capture the divergent shifts in movement patterns among individuals who exhibit similar pre-event patterns but differ in SIR.
zh
[AI-68] STNet: Spectral Transformation Network for Solving Operator Eigenvalue Problem
【速读】:该论文旨在解决高维空间中算子特征值问题(Operator eigenvalue problems)的数值求解难题,此类问题在科学计算与工程应用中至关重要,但传统方法受限于维度灾难(curse of dimensionality)。现有基于深度学习的方法虽能有效缓解此问题,但其性能高度依赖于算子的谱分布特性——即特征值之间的间隔越大,精度越高。为此,作者提出谱变换网络(Spectral Transformation Network, STNet),其核心创新在于:利用近似特征值和特征函数对原始算子进行谱变换,将其转化为等价但更易求解的问题;通过变形投影(deflation projection)剔除已求得特征函数对应的子空间,缩小搜索范围并避免重复收敛;同时引入滤波变换(filter transform)增强目标区域内的特征值、抑制其他区域,从而显著提升求解精度。实验表明,STNet在准确性上优于现有基于学习的方法,达到当前最优水平。
链接: https://arxiv.org/abs/2510.23986
作者: Hong Wang,Jiang Yixuan,Jie Wang,Xinyi Li,Jian Luo,Huanshuo Dong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA)
备注:
Abstract:Operator eigenvalue problems play a critical role in various scientific fields and engineering applications, yet numerical methods are hindered by the curse of dimensionality. Recent deep learning methods provide an efficient approach to address this challenge by iteratively updating neural networks. These methods’ performance relies heavily on the spectral distribution of the given operator: larger gaps between the operator’s eigenvalues will improve precision, thus tailored spectral transformations that leverage the spectral distribution can enhance their performance. Based on this observation, we propose the Spectral Transformation Network (STNet). During each iteration, STNet uses approximate eigenvalues and eigenfunctions to perform spectral transformations on the original operator, turning it into an equivalent but easier problem. Specifically, we employ deflation projection to exclude the subspace corresponding to already solved eigenfunctions, thereby reducing the search space and avoiding converging to existing eigenfunctions. Additionally, our filter transform magnifies eigenvalues in the desired region and suppresses those outside, further improving performance. Extensive experiments demonstrate that STNet consistently outperforms existing learning-based methods, achieving state-of-the-art performance in accuracy.
zh
[AI-69] HyperGraphX: Graph Transductive Learning with Hyperdimensional Computing and Message Passing
【速读】:该论文旨在解决图学习任务中模型精度与计算效率之间的权衡问题,特别是在处理同质性(homophilic)和异质性(heterophilic)图结构时,传统图神经网络(Graph Neural Networks, GNNs)往往在精度或速度上存在不足。解决方案的关键在于提出一种新颖的算法 \hdgc,该算法将图卷积(graph convolution)与超维度计算(Hyperdimensional Computing, HDC)中的绑定(binding)和捆绑(bundling)操作相结合,从而在保持高预测准确性的前提下显著提升推理速度。实验表明,\hdgc 在多个基准数据集上优于主流 GNN 实现(如 \gcnii)及最先进的 HDC 方法(如 HDGL),且在相同 GPU 平台上平均分别快 9561.0 倍和 144.5 倍,其基于二进制向量的操作特性也预示了在类脑计算和存算一体(process-in-memory)硬件上的优异能效表现。
链接: https://arxiv.org/abs/2510.23980
作者: Guojing Cong,Tom Potok,Hamed Poursiami,Maryam Parsa
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注:
Abstract:We present a novel algorithm, \hdgc, that marries graph convolution with binding and bundling operations in hyperdimensional computing for transductive graph learning. For prediction accuracy \hdgc outperforms major and popular graph neural network implementations as well as state-of-the-art hyperdimensional computing implementations for a collection of homophilic graphs and heterophilic graphs. Compared with the most accurate learning methodologies we have tested, on the same target GPU platform, \hdgc is on average 9561.0 and 144.5 times faster than \gcnii, a graph neural network implementation and HDGL, a hyperdimensional computing implementation, respectively. As the majority of the learning operates on binary vectors, we expect outstanding energy performance of \hdgc on neuromorphic and emerging process-in-memory devices.
zh
[AI-70] Diffusion Adaptive Text Embedding for Text-to-Image Diffusion Models NEURIPS2025
【速读】:该论文旨在解决文本到图像扩散模型中固定文本嵌入(text embeddings)导致的生成适应性不足问题,即预训练文本编码器产生的嵌入在扩散过程的所有时间步保持不变,限制了其与中间扰动数据的动态对齐能力。解决方案的关键在于提出Diffusion Adaptive Text Embedding (DATE),通过在每个扩散步骤中基于中间扰动数据动态优化文本嵌入,构建一个可微分的优化问题并推导出更新规则,从而在无需额外模型训练的前提下,使文本条件能够自适应地调整以更好地匹配反向扩散过程中的图像均值预测,显著提升文本-图像对齐效果和生成质量。
链接: https://arxiv.org/abs/2510.23974
作者: Byeonghu Na,Minsang Park,Gyuwon Sim,Donghyeok Shin,HeeSun Bae,Mina Kang,Se Jung Kwon,Wanmo Kang,Il-Chul Moon
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at NeurIPS 2025
Abstract:Text-to-image diffusion models rely on text embeddings from a pre-trained text encoder, but these embeddings remain fixed across all diffusion timesteps, limiting their adaptability to the generative process. We propose Diffusion Adaptive Text Embedding (DATE), which dynamically updates text embeddings at each diffusion timestep based on intermediate perturbed data. We formulate an optimization problem and derive an update rule that refines the text embeddings at each sampling step to improve alignment and preference between the mean predicted image and the text. This allows DATE to dynamically adapts the text conditions to the reverse-diffused images throughout diffusion sampling without requiring additional model training. Through theoretical analysis and empirical results, we show that DATE maintains the generative capability of the model while providing superior text-image alignment over fixed text embeddings across various tasks, including multi-concept generation and text-guided image editing. Our code is available at this https URL.
zh
[AI-71] An efficient probabilistic hardware architecture for diffusion-like models
【速读】:该论文旨在解决当前概率型人工智能(Probabilistic AI)系统在能效和硬件可扩展性方面的瓶颈问题。现有方案受限于建模能力不足及依赖难以规模化实现的特殊硬件,导致难以实际应用。其解决方案的关键在于提出一种全晶体管架构的概率计算机,该架构在硬件层面原生实现强大的去噪模型(denoising models),从而在保持高性能的同时显著降低能耗——系统级分析表明,该架构可在简单图像基准测试中达到与GPU相当的性能,但功耗仅为GPU的约1/10,000。
链接: https://arxiv.org/abs/2510.23972
作者: Andraž Jelinčič,Owen Lockwood,Akhil Garlapati,Guillaume Verdon,Trevor McCourt
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 9 pages, 6 figures
Abstract:The proliferation of probabilistic AI has promoted proposals for specialized stochastic computers. Despite promising efficiency gains, these proposals have failed to gain traction because they rely on fundamentally limited modeling techniques and exotic, unscalable hardware. In this work, we address these shortcomings by proposing an all-transistor probabilistic computer that implements powerful denoising models at the hardware level. A system-level analysis indicates that devices based on our architecture could achieve performance parity with GPUs on a simple image benchmark using approximately 10,000 times less energy.
zh
[AI-72] he Sign Estimator: LLM Alignment in the Face of Choice Heterogeneity
【速读】:该论文旨在解决传统大语言模型(Large Language Model, LLM)对齐方法在面对人类偏好异质性时的不一致性问题,即基于成对比较数据(如提示-完成对)拟合朴素概率模型时,无法获得对群体平均效用(population-average utility)的一致估计,从而影响社会福利的准确衡量。其解决方案的关键在于提出一种称为“符号估计器”(sign estimator)的新方法:通过在聚合步骤中将交叉熵损失替换为二分类损失,实现无需复杂建模即可获得一致的序数对齐,并在温和假设下首次建立多项式阶有限样本误差界。该方法在模拟场景中显著降低偏好扭曲,相比标准强化学习从人类反馈(Reinforcement Learning from Human Feedback, RLHF)将角度估计误差减少近35%,且与真实群体偏好的一致性从12%提升至8%,同时保持了现有LLM对齐流程的实现简洁性。
链接: https://arxiv.org/abs/2510.23965
作者: Aymane El Gadarri,Ali Aouad,Vivek F. Farias
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:
Abstract:Traditional LLM alignment methods are vulnerable to heterogeneity in human preferences. Fitting a naïve probabilistic model to pairwise comparison data (say over prompt-completion pairs) yields an inconsistent estimate of the population-average utility -a canonical measure of social welfare. We propose a new method, dubbed the sign estimator, that provides a simple, provably consistent, and efficient estimator by replacing cross-entropy with binary classification loss in the aggregation step. This simple modification recovers consistent ordinal alignment under mild assumptions and achieves the first polynomial finite-sample error bounds in this setting. In realistic simulations of LLM alignment using digital twins, the sign estimator substantially reduces preference distortion over a panel of simulated personas, cutting (angular) estimation error by nearly 35% and decreasing disagreement with true population preferences from 12% to 8% compared to standard RLHF. Our method also compares favorably to panel data heuristics that explicitly model user heterogeneity and require tracking individual-level preference data-all while maintaining the implementation simplicity of existing LLM alignment pipelines.
zh
[AI-73] ChessQA: Evaluating Large Language Models for Chess Understanding
【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在国际象棋理解能力评估中存在方法片面、缺乏系统性和动态演化机制的问题。现有评估多局限于简单走法质量判断,难以全面刻画模型从基础规则理解到高阶抽象思维的多层次棋力水平。其解决方案的关键在于提出ChessQA——一个涵盖五个任务类别的综合性基准测试体系(结构理解、模式识别、短程战术、局面判断和语义描述),这些类别对应棋手由浅入深的认知发展路径,并通过结构化、可扩展的评测框架实现对LLM棋力的精细化诊断与持续比较。该基准具备动态更新特性,能够随模型进步而迭代优化,从而为LLM在复杂推理任务中的能力演进提供可控且一致的评估环境。
链接: https://arxiv.org/abs/2510.23948
作者: Qianfeng Wen,Zhenwei Tang,Ashton Anderson
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 33 pages,8 figures
Abstract:Chess provides an ideal testbed for evaluating the reasoning, modeling, and abstraction capabilities of large language models (LLMs), as it has well-defined structure and objective ground truth while admitting a wide spectrum of skill levels. However, existing evaluations of LLM ability in chess are ad hoc and narrow in scope, making it difficult to accurately measure LLM chess understanding and how it varies with scale, post-training methodologies, or architecture choices. We present ChessQA, a comprehensive benchmark that assesses LLM chess understanding across five task categories (Structural, Motifs, Short Tactics, Position Judgment, and Semantic), which approximately correspond to the ascending abstractions that players master as they accumulate chess knowledge, from understanding basic rules and learning tactical motifs to correctly calculating tactics, evaluating positions, and semantically describing high-level concepts. In this way, ChessQA captures a more comprehensive picture of chess ability and understanding, going significantly beyond the simple move quality evaluations done previously, and offers a controlled, consistent setting for diagnosis and comparison. Furthermore, ChessQA is inherently dynamic, with prompts, answer keys, and construction scripts that can evolve as models improve. Evaluating a range of contemporary LLMs, we find persistent weaknesses across all five categories and provide results and error analyses by category. We will release the code, periodically refreshed datasets, and a public leaderboard to support further research.
zh
[AI-74] Decentralized Causal Discovery using Judo Calculus
【速读】:该论文旨在解决现实世界中因果效应依赖于特定情境(如年龄、国家、剂量、基因型或实验协议)的问题,传统因果发现方法往往假设因果关系全局一致,忽略了这种情境依赖性。其解决方案的关键在于提出一种基于**拓扑层(topos of sheaves)**的直观主义去中心化因果发现框架——judo calculus,通过引入Lawvere-Tierney模算子 $ j $ 来形式化“局部真值”概念:因果主张仅在某一类相关情境(即 $ j $-稳定覆盖)上成立,而非全域统一。这一机制使得因果推断具有构造性和一致性,同时结合了评分法、约束法和梯度法等经典因果发现技术,实现了计算效率提升与性能改进,在生物和经济等真实数据集上验证了有效性。
链接: https://arxiv.org/abs/2510.23942
作者: Sridhar Mahadevan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 54 pages
Abstract:We describe a theory and implementation of an intuitionistic decentralized framework for causal discovery using judo calculus, which is formally defined as j-stable causal inference using j-do-calculus in a topos of sheaves. In real-world applications – from biology to medicine and social science – causal effects depend on regime (age, country, dose, genotype, or lab protocol). Our proposed judo calculus formalizes this context dependence formally as local truth: a causal claim is proven true on a cover of regimes, not everywhere at once. The Lawvere-Tierney modal operator j chooses which regimes are relevant; j-stability means the claim holds constructively and consistently across that family. We describe an algorithmic and implementation framework for judo calculus, combining it with standard score-based, constraint-based, and gradient-based causal discovery methods. We describe experimental results on a range of domains, from synthetic to real-world datasets from biology and economics. Our experimental results show the computational efficiency gained by the decentralized nature of sheaf-theoretic causal discovery, as well as improved performance over classical causal discovery methods.
zh
[AI-75] Modeling Biological Multifunctionality with Echo State Networks
【速读】:该论文旨在解决如何有效模拟生物系统中复杂的时空动态行为,特别是电生理过程的建模问题。其解决方案的关键在于构建一个三维多组分反应-扩散模型,该模型融合了兴奋性系统动力学与扩散过程,具有类似FitzHugh-Nagumo模型的结构特征;随后利用数值方法生成时间序列数据,并基于这些数据训练并验证一种回声状态网络(Echo State Network, ESN),最终实现了对原始生物动力学行为的准确再现,证明了数据驱动的多功能ESN模型在模拟生物动态过程中的可行性与有效性。
链接: https://arxiv.org/abs/2510.23940
作者: Anastasia-Maria Leventi-Peetz,Jörg-Volker Peetz,Kai Weber,Nikolaos Zacharis
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 26 pages, 17 figures, 6 tables, 23 references
Abstract:In this work, a three-dimensional multicomponent reaction-diffusion model has been developed, combining excitable-system dynamics with diffusion processes and sharing conceptual features with the FitzHugh-Nagumo model. Designed to capture the spatiotemporal behavior of biological systems, particularly electrophysiological processes, the model was solved numerically to generate time-series data. These data were subsequently used to train and evaluate an Echo State Network (ESN), which successfully reproduced the system’s dynamic behavior. The results demonstrate that simulating biological dynamics using data-driven, multifunctional ESN models is both feasible and effective.
zh
[AI-76] Scalable GPU-Based Integrity Verification for Large Machine Learning Models
【速读】:该论文旨在解决分布式机器学习(Distributed Machine Learning)中模型完整性验证的效率与架构适配问题,即传统基于CPU的安全验证机制难以跟上GPU加速的大型模型执行节奏,导致性能瓶颈和架构不一致。其解决方案的关键在于将完整性验证直接集成到GPU加速器上,利用GPU原生计算单元(如Intel Arc的XMX单元或NVIDIA的Tensor Cores)执行加密操作,从而消除CPU-GPU间的通信开销,并确保验证过程与模型训练/推理同步进行,实现高吞吐、低延迟的完整性保护,同时支持跨不同GPU厂商的硬件一致性。
链接: https://arxiv.org/abs/2510.23938
作者: Marcin Spoczynski,Marcela S. Melara
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:We present a security framework that strengthens distributed machine learning by standardizing integrity protections across CPU and GPU platforms and significantly reducing verification overheads. Our approach co-locates integrity verification directly with large ML model execution on GPU accelerators, resolving the fundamental mismatch between how large ML workloads typically run (primarily on GPUs) and how security verifications traditionally operate (on separate CPU-based processes), delivering both immediate performance benefits and long-term architectural consistency. By performing cryptographic operations natively on GPUs using dedicated compute units (e.g., Intel Arc’s XMX units, NVIDIA’s Tensor Cores), our solution eliminates the potential architectural bottlenecks that could plague traditional CPU-based verification systems when dealing with large models. This approach leverages the same GPU-based high-memory bandwidth and parallel processing primitives that power ML workloads ensuring integrity checks keep pace with model execution even for massive models exceeding 100GB. This framework establishes a common integrity verification mechanism that works consistently across different GPU vendors and hardware configurations. By anticipating future capabilities for creating secure channels between trusted execution environments and GPU accelerators, we provide a hardware-agnostic foundation that enterprise teams can deploy regardless of their underlying CPU and GPU infrastructures.
zh
[AI-77] MFiSP: A Multimodal Fire Spread Prediction Framework
【速读】:该论文旨在解决传统野火建模方法在预测精度和动态适应性方面的不足,这些问题主要源于依赖人工解读的火行为分析(Fire Behaviour Analysts, FBAns)和静态环境数据,导致预测结果常出现偏差且难以实时响应火情变化。其解决方案的关键在于提出一种多模态野火蔓延预测框架(Multimodal Fire Spread Prediction Framework, MFiSP),通过融合社交媒体数据与遥感观测信息(如NASA FIRMS卫星影像),并在同化周期间动态调整燃料地图参数策略,使火势蔓延预测更贴合实际观测到的蔓延速率,从而显著提升预测准确性。
链接: https://arxiv.org/abs/2510.23934
作者: Alec Sathiyamoorthy,Wenhao Zhou,Xiangmin Zhou,Xiaodong Li,Iqbal Gondal
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注:
Abstract:The 2019-2020 Black Summer bushfires in Australia devastated 19 million hectares, destroyed 3,000 homes, and lasted seven months, demonstrating the escalating scale and urgency of wildfire threats requiring better forecasting for effective response. Traditional fire modeling relies on manual interpretation by Fire Behaviour Analysts (FBAns) and static environmental data, often leading to inaccuracies and operational limitations. Emerging data sources, such as NASA’s FIRMS satellite imagery and Volunteered Geographic Information, offer potential improvements by enabling dynamic fire spread prediction. This study proposes a Multimodal Fire Spread Prediction Framework (MFiSP) that integrates social media data and remote sensing observations to enhance forecast accuracy. By adapting fuel map manipulation strategies between assimilation cycles, the framework dynamically adjusts fire behavior predictions to align with the observed rate of spread. We evaluate the efficacy of MFiSP using synthetically generated fire event polygons across multiple scenarios, analyzing individual and combined impacts on forecast perimeters. Results suggest that our MFiSP integrating multimodal data can improve fire spread prediction beyond conventional methods reliant on FBAn expertise and static inputs.
zh
[AI-78] Key and Value Weights Are Probably All You Need: On the Necessity of the Query Key Value weight Triplet in Decoder-Only Transformers
【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)中注意力机制所依赖的查询(Query)、键(Key)、值(Value)权重三元组是否可以简化的问题。其关键解决方案在于理论证明在一定简化假设下,查询权重是冗余的,从而可在不显著影响性能的前提下减少非嵌入层(non-embedding/lm-head)参数量超过8%。作者进一步在完整复杂度的GPT-3小规模架构上验证了该理论,表明去除查询权重后的模型仍能实现与标准基线相当的验证损失,为大规模场景下进一步压缩注意力机制参数提供了依据。
链接: https://arxiv.org/abs/2510.23912
作者: Marko Karbevski,Antonij Mijoski
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:The Query, Key, Value weight triplet is a building block of current attention mechanisms in state-of-the-art LLMs. We theoretically investigate whether this triplet can be reduced, proving under simplifying assumptions that the Query weights are redundant, thereby reducing the number of non-embedding/lm-head parameters by over 8%. We validate the theory on full-complexity GPT-3 small architectures (with layer normalization, skip connections, and weight decay) trained from scratch, demonstrating that the reduced model achieves comparable validation loss to standard baselines. These findings motivate the investigation of the Query weight redundancy at scale.
zh
[AI-79] Group Interventions on Deep Networks for Causal Discovery in Subsystems
【速读】:该论文旨在解决现有因果发现方法在处理非线性多变量时间序列时,主要局限于成对变量间的因果关系识别,而忽视了变量组(即子系统)之间的集体因果影响问题。其解决方案的关键在于提出一种名为gCDMI的多组因果发现方法,通过在训练好的深度神经网络上施加组级干预,并结合模型不变性检验来推断变量组之间的因果关系。该方法包含三个核心步骤:首先利用深度学习联合建模所有时间序列组间的结构关系;其次对训练模型实施组级干预;最后通过不变性测试判定变量组间是否存在因果链接。实验证明,该方法在模拟数据和真实世界数据(如脑网络与气候生态系统)中均能有效揭示复杂的组级因果结构,为神经科学和气候科学等领域提供新的分析工具。
链接: https://arxiv.org/abs/2510.23906
作者: Wasim Ahmad,Maha Shadaydeh,Joachim Denzler
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Submitted to IEEE Access. We are working on the revised version
Abstract:Causal discovery uncovers complex relationships between variables, enhancing predictions, decision-making, and insights into real-world systems, especially in nonlinear multivariate time series. However, most existing methods primarily focus on pairwise cause-effect relationships, overlooking interactions among groups of variables, i.e., subsystems and their collective causal influence. In this study, we introduce gCDMI, a novel multi-group causal discovery method that leverages group-level interventions on trained deep neural networks and employs model invariance testing to infer causal relationships. Our approach involves three key steps. First, we use deep learning to jointly model the structural relationships among groups of all time series. Second, we apply group-wise interventions to the trained model. Finally, we conduct model invariance testing to determine the presence of causal links among variable groups. We evaluate our method on simulated datasets, demonstrating its superior performance in identifying group-level causal relationships compared to existing methods. Additionally, we validate our approach on real-world datasets, including brain networks and climate ecosystems. Our results highlight that applying group-level interventions to deep learning models, combined with invariance testing, can effectively reveal complex causal structures, offering valuable insights for domains such as neuroscience and climate science.
zh
[AI-80] RS-ORT: A Reduced-Space Branch-and-Bound Algorithm for Optimal Regression Trees ICLR2026
【速读】:该论文旨在解决混合整数规划(Mixed-Integer Programming, MIP)在回归任务中面临的两大挑战:一是现有方法仅适用于纯二值特征,难以处理连续特征;二是当数据规模较大时,计算复杂度急剧上升,导致求解不可行。为应对这些问题,作者提出了一种两阶段优化框架下的简化空间最优回归树(Reduced-Space Optimal Regression Trees, RS-ORT)——其核心创新在于设计了一个仅对树结构变量进行分支的专用分支定界(Branch-and-Bound, BB)算法,从而保证收敛性并独立于训练样本数量。该方案通过闭式叶节点预测、经验阈值离散化和精确深度为1子树解析等边界紧致技术,结合可分解的上下界策略,在保持全局最优性的前提下显著加速训练过程,并支持节点级并行执行,使百万级数据集上的训练可在四小时内完成,且生成更简洁、泛化能力更强的回归树。
链接: https://arxiv.org/abs/2510.23901
作者: Cristobal Heredia,Pedro Chumpitaz-Flores,Kaixun Hua
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 20 pages, 1 figure, uses ICLR 2026 LaTeX style. Submitted to arXiv as a preprint version
Abstract:Mixed-integer programming (MIP) has emerged as a powerful framework for learning optimal decision trees. Yet, existing MIP approaches for regression tasks are either limited to purely binary features or become computationally intractable when continuous, large-scale data are involved. Naively binarizing continuous features sacrifices global optimality and often yields needlessly deep trees. We recast the optimal regression-tree training as a two-stage optimization problem and propose Reduced-Space Optimal Regression Trees (RS-ORT) - a specialized branch-and-bound (BB) algorithm that branches exclusively on tree-structural variables. This design guarantees the algorithm’s convergence and its independence from the number of training samples. Leveraging the model’s structure, we introduce several bound tightening techniques - closed-form leaf prediction, empirical threshold discretization, and exact depth-1 subtree parsing - that combine with decomposable upper and lower bounding strategies to accelerate the training. The BB node-wise decomposition enables trivial parallel execution, further alleviating the computational intractability even for million-size datasets. Based on the empirical studies on several regression benchmarks containing both binary and continuous features, RS-ORT also delivers superior training and testing performance than state-of-the-art methods. Notably, on datasets with up to 2,000,000 samples with continuous features, RS-ORT can obtain guaranteed training performance with a simpler tree structure and a better generalization ability in four hours.
zh
[AI-81] Evaluating the effectiveness of LLM -based interoperability
【速读】:该论文旨在解决系统间互操作性(interoperability)在动态和异构系统架构中日益加剧的挑战,尤其是传统方法依赖人工开发互操作性构件所导致的时间与成本问题。其解决方案的关键在于利用大语言模型(Large Language Models, LLMs)实现运行时(runtime)的自主互操作,无需人工干预。研究通过在农业领域互操作性用例中测试13个开源LLM,并采用两种策略(DIRECT 和 CODEGEN)进行对比实验,发现qwen2.5-coder:32b模型在多数数据集版本中表现最优,尤其在包含单位转换的复杂场景下,CODEGEN策略仍能保持较高成功率,表明LLM具备实现自主系统互操作的潜力。
链接: https://arxiv.org/abs/2510.23893
作者: Rodrigo Falcão,Stefan Schweitzer,Julien Siebert,Emily Calvet,Frank Elberzhager
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Background: Systems of systems are becoming increasingly dynamic and heterogeneous, and this adds pressure on the long-standing challenge of interoperability. Besides its technical aspect, interoperability has also an economic side, as development time efforts are required to build the interoperability artifacts. Objectives: With the recent advances in the field of large language models (LLMs), we aim at analyzing the effectiveness of LLM-based strategies to make systems interoperate autonomously, at runtime, without human intervention. Method: We selected 13 open source LLMs and curated four versions of a dataset in the agricultural interoperability use case. We performed three runs of each model with each version of the dataset, using two different strategies. Then we compared the effectiveness of the models and the consistency of their results across multiple runs. Results: qwen2.5-coder:32b was the most effective model using both strategies DIRECT (average pass@1 = 0.99) and CODEGEN (average pass@1 = 0.89) in three out of four dataset versions. In the fourth dataset version, which included an unit conversion, all models using the strategy DIRECT failed, whereas using CODEGEN qwen2.5-coder:32b succeeded with an average pass@1 = 0.75. Conclusion: Some LLMs can make systems interoperate autonomously. Further evaluation in different domains is recommended, and further research on reliability strategies should be conducted.
zh
[AI-82] PRO: Enabling Precise and Robust Text Watermark for Open-Source LLM s
【速读】:该论文旨在解决开放源代码大语言模型(Large Language Models, LLMs)中文本水印难以实现的问题,即模型所有者缺乏有效手段验证生成文本是否源自其模型。现有方法在将闭源模型的水印知识蒸馏到开源模型时存在两个核心挑战:一是学习到的水印模式与预定义检测标准不一致导致检测性能下降;二是对下游修改(如微调或模型合并)敏感,水印易被破坏。解决方案的关键在于提出PRO方法,通过联合训练水印策略模型与LLM,在训练过程中嵌入易于模型学习且符合检测标准的水印模式,并引入正则化项模拟下游扰动以增强鲁棒性,从而显著提升水印的可检测性和对模型编辑的抗干扰能力。
链接: https://arxiv.org/abs/2510.23891
作者: Jiaqi Xue,Yifei Zhao,Mansour Al Ghanim,Shangqian Gao,Ruimin Sun,Qian Lou,Mengxin Zheng
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Text watermarking for large language models (LLMs) enables model owners to verify text origin and protect intellectual property. While watermarking methods for closed-source LLMs are relatively mature, extending them to open-source models remains challenging, as developers cannot control the decoding process. Consequently, owners of open-source LLMs lack practical means to verify whether text was generated by their models. A core difficulty lies in embedding watermarks directly into model weights without hurting detectability. A promising idea is to distill watermarks from a closed-source model into an open one, but this suffers from (i) poor detectability due to mismatch between learned and predefined patterns, and (ii) fragility to downstream modifications such as fine-tuning or model merging. To overcome these limitations, we propose PRO, a Precise and Robust text watermarking method for open-source LLMs. PRO jointly trains a watermark policy model with the LLM, producing patterns that are easier for the model to learn and more consistent with detection criteria. A regularization term further simulates downstream perturbations and penalizes degradation in watermark detectability, ensuring robustness under model edits. Experiments on open-source LLMs (e.g., LLaMA-3.2, LLaMA-3, Phi-2) show that PRO substantially improves both watermark detectability and resilience to model modifications.
zh
[AI-83] Agent ic AI Security: Threats Defenses Evaluation and Open Challenges
【速读】:该论文旨在解决自主智能体系统(Agentic AI)在自动化任务执行过程中所面临的安全风险问题,这些风险区别于传统人工智能安全和常规软件安全。其核心挑战在于如何识别、分类并应对由规划能力、工具调用、记忆机制和自主性共同引发的新类型威胁。解决方案的关键在于构建一套针对此类系统的威胁分类体系(taxonomy of threats),结合近期基准测试与评估方法,并从技术和治理双维度提出防御策略,从而推动“安全设计优先”(secure-by-design)的智能体系统开发。
链接: https://arxiv.org/abs/2510.23883
作者: Shrestha Datta,Shahriar Kabir Nahin,Anshuman Chhabra,Prasant Mohapatra
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Agentic AI systems powered by large language models (LLMs) and endowed with planning, tool use, memory, and autonomy, are emerging as powerful, flexible platforms for automation. Their ability to autonomously execute tasks across web, software, and physical environments creates new and amplified security risks, distinct from both traditional AI safety and conventional software security. This survey outlines a taxonomy of threats specific to agentic AI, reviews recent benchmarks and evaluation methodologies, and discusses defense strategies from both technical and governance perspectives. We synthesize current research and highlight open challenges, aiming to support the development of secure-by-design agent systems.
zh
[AI-84] Hybrid Modeling Sim-to-Real Reinforcement Learning and Large Language Model Driven Control for Digital Twins
【速读】:该论文旨在解决复杂动态系统建模与控制中的精度、泛化能力与计算效率之间的权衡问题,特别是在数字孪生(Digital Twin)框架下整合物理模型、数据驱动方法及混合策略的优化配置。其关键解决方案在于:在建模层面,采用混合分析与建模(Hybrid Analysis and Modeling, HAM)方法,在保持较高预测精度的同时实现良好的泛化能力和较低的计算开销;在控制层面,通过对比模型预测控制(Model Predictive Control, MPC)、强化学习(Reinforcement Learning, RL)和大语言模型(Large Language Model, LLM)驱动控制三种策略,发现MPC具备鲁棒性和可预测性,RL展现出强适应性,而LLM控制则在人机协同场景中提供灵活交互接口,从而为不同应用场景下的数字孪生系统提供了可扩展的建模与控制范式。
链接: https://arxiv.org/abs/2510.23882
作者: Adil Rasheed,Oscar Ravik,Omer San
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:This work investigates the use of digital twins for dynamical system modeling and control, integrating physics-based, data-driven, and hybrid approaches with both traditional and AI-driven controllers. Using a miniature greenhouse as a test platform, four predictive models Linear, Physics-Based Modeling (PBM), Long Short Term Memory (LSTM), and Hybrid Analysis and Modeling (HAM) are developed and compared under interpolation and extrapolation scenarios. Three control strategies Model Predictive Control (MPC), Reinforcement Learning (RL), and Large Language Model (LLM) based control are also implemented to assess trade-offs in precision, adaptability, and implementation effort. Results show that in modeling HAM provides the most balanced performance across accuracy, generalization, and computational efficiency, while LSTM achieves high precision at greater resource cost. Among controllers, MPC delivers robust and predictable performance, RL demonstrates strong adaptability, and LLM-based controllers offer flexible human-AI interaction when coupled with predictive tools.
zh
[AI-85] Generating Creative Chess Puzzles
【速读】:该论文旨在解决生成式 AI (Generative AI) 在棋局谜题生成中难以产生真正具有创造性、美学价值及反直觉特性的输出问题。其解决方案的关键在于引入一种基于棋类引擎搜索统计信息设计的强化学习(Reinforcement Learning, RL)框架,通过新颖的奖励机制提升谜题的独特性、反直觉性、多样性与现实感,从而显著增强AI生成谜题的质量和创新性。实验表明,该方法将反直觉谜题生成率从监督学习的0.22%提升至2.5%,远超现有数据集和最佳Lichess训练模型的表现,并获得国际象棋专家的高度认可。
链接: https://arxiv.org/abs/2510.23881
作者: Xidong Feng,Vivek Veeriah,Marcus Chiam,Michael Dennis,Ryan Pachauri,Thomas Tumiel,Federico Barbero,Johan Obando-Ceron,Jiaxin Shi,Satinder Singh,Shaobo Hou,Nenad Tomašev,Tom Zahavy
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:While Generative AI rapidly advances in various domains, generating truly creative, aesthetic, and counter-intuitive outputs remains a challenge. This paper presents an approach to tackle these difficulties in the domain of chess puzzles. We start by benchmarking Generative AI architectures, and then introduce an RL framework with novel rewards based on chess engine search statistics to overcome some of those shortcomings. The rewards are designed to enhance a puzzle’s uniqueness, counter-intuitiveness, diversity, and realism. Our RL approach dramatically increases counter-intuitive puzzle generation by 10x, from 0.22% (supervised) to 2.5%, surpassing existing dataset rates (2.1%) and the best Lichess-trained model (0.4%). Our puzzles meet novelty and diversity benchmarks, retain aesthetic themes, and are rated by human experts as more creative, enjoyable, and counter-intuitive than composed book puzzles, even approaching classic compositions. Our final outcome is a curated booklet of these AI-generated puzzles, which is acknowledged for creativity by three world-renowned experts.
zh
[AI-86] A PDE-Informed Latent Diffusion Model for 2-m Temperature Downscaling
【速读】:该论文旨在解决大气数据动态降尺度(dynamical downscaling)中高分辨率2米温度场重建的问题,其核心挑战在于如何在生成过程中保持物理一致性。解决方案的关键在于提出一种物理条件化的潜在扩散模型(physics-conditioned latent diffusion model),通过在训练目标中引入偏微分方程(PDE)损失项来增强模型输出的物理合理性;该PDE损失基于全分辨率空间中的解码潜变量计算,采用有限差分近似实现对有效平流-扩散平衡的约束,从而在微调阶段进一步提升生成场的物理可解释性与真实性。
链接: https://arxiv.org/abs/2510.23866
作者: Paul Rosu,Muchang Bahng,Erick Jiang,Rico Zhu,Vahid Tarokh
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:This work presents a physics-conditioned latent diffusion model tailored for dynamical downscaling of atmospheric data, with a focus on reconstructing high-resolution 2-m temperature fields. Building upon a pre-existing diffusion architecture and employing a residual formulation against a reference UNet, we integrate a partial differential equation (PDE) loss term into the model’s training objective. The PDE loss is computed in the full resolution (pixel) space by decoding the latent representation and is designed to enforce physical consistency through a finite-difference approximation of an effective advection-diffusion balance. Empirical observations indicate that conventional diffusion training already yields low PDE residuals, and we investigate how fine-tuning with this additional loss further regularizes the model and enhances the physical plausibility of the generated fields. The entirety of our codebase is available on Github, for future reference and development.
zh
[AI-87] From Benchmarks to Business Impact: Deploying IBM Generalist Agent in Enterprise Production AAAI
【速读】:该论文旨在解决企业级场景中通用智能体(Generalist Agents)从原型开发到可交付业务价值的部署难题,这一挑战主要源于框架碎片化、开发周期长以及缺乏标准化评估实践。解决方案的关键在于提出并验证了计算机通用智能体(Computer Using Generalist Agent, CUGA),其采用具有强分析基础的分层规划-执行架构(hierarchical planner–executor architecture),在AppWorld和WebArena等基准测试中达到最先进性能;同时,在业务流程外包(Business-Process-Outsourcing, BPO)人才招聘领域开展试点,通过引入BPO-TA基准(26个任务,覆盖13个分析端点)评估其实用性,初步表明CUGA在保持接近专用智能体准确率的同时具备降低开发时间和成本的潜力,从而为通用智能体向企业级系统演进提供了技术路径与组织经验。
链接: https://arxiv.org/abs/2510.23856
作者: Segev Shlomov,Alon Oved,Sami Marreed,Ido Levy,Offer Akrabi,Avi Yaeli,Łukasz Strąk,Elizabeth Koumpan,Yinon Goldshtein,Eilam Shapira,Nir Mashkif,Asaf Adi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: AAAI Conference on Artificial Intelligence
Abstract:Agents are rapidly advancing in automating digital work, but enterprises face a harder challenge: moving beyond prototypes to deployed systems that deliver measurable business value. This path is complicated by fragmented frameworks, slow development, and the absence of standardized evaluation practices. Generalist agents have emerged as a promising direction, excelling on academic benchmarks and offering flexibility across task types, applications, and modalities. Yet, evidence of their use in production enterprise settings remains limited. This paper reports IBM’s experience developing and piloting the Computer Using Generalist Agent (CUGA), which has been open-sourced for the community (this https URL). CUGA adopts a hierarchical planner–executor architecture with strong analytical foundations, achieving state-of-the-art performance on AppWorld and WebArena. Beyond benchmarks, it was evaluated in a pilot within the Business-Process-Outsourcing talent acquisition domain, addressing enterprise requirements for scalability, auditability, safety, and governance. To support assessment, we introduce BPO-TA, a 26-task benchmark spanning 13 analytics endpoints. In preliminary evaluations, CUGA approached the accuracy of specialized agents while indicating potential for reducing development time and cost. Our contribution is twofold: presenting early evidence of generalist agents operating at enterprise scale, and distilling technical and organizational lessons from this initial pilot. We outline requirements and next steps for advancing research-grade architectures like CUGA into robust, enterprise-ready systems.
zh
[AI-88] Decentralized Multi-Agent Goal Assignment for Path Planning using Large Language Models
【速读】:该论文旨在解决多智能体路径规划(Multi-Agent Path Planning, MAPF)中的去中心化目标分配问题,即在共享环境中,各智能体基于对环境的结构化表示(如网格可视化和场景数据)独立生成目标偏好排序,并通过交换排名后采用固定、确定性的冲突解决规则(如按智能体索引顺序)进行目标分配,无需协商或迭代协调。其解决方案的关键在于利用大语言模型(Large Language Model, LLM)作为智能体决策机制,通过精心设计的提示(prompt)和结构化的定量信息输入,使LLM能够生成高质量的目标偏好排序,从而在完全可观测的网格世界中实现接近最优的完成时间(makespan),显著优于传统贪婪启发式方法。
链接: https://arxiv.org/abs/2510.23824
作者: Murad Ismayilov,Edwin Meriaux,Shuo Wen,Gregory Dudek
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted at MIT URTC 2025
Abstract:Coordinating multiple autonomous agents in shared environments under decentralized conditions is a long-standing challenge in robotics and artificial intelligence. This work addresses the problem of decentralized goal assignment for multi-agent path planning, where agents independently generate ranked preferences over goals based on structured representations of the environment, including grid visualizations and scenario data. After this reasoning phase, agents exchange their goal rankings, and assignments are determined by a fixed, deterministic conflict-resolution rule (e.g., agent index ordering), without negotiation or iterative coordination. We systematically compare greedy heuristics, optimal assignment, and large language model (LLM)-based agents in fully observable grid-world settings. Our results show that LLM-based agents, when provided with well-designed prompts and relevant quantitative information, can achieve near-optimal makespans and consistently outperform traditional heuristics. These findings underscore the potential of language models for decentralized goal assignment in multi-agent path planning and highlight the importance of information structure in such systems.
zh
[AI-89] ReCAP: Recursive Context-Aware Reasoning and Planning for Large Language Model Agents
【速读】:该论文旨在解决长时程任务中大语言模型(Large Language Models, LLMs)在多步推理与动态重规划方面面临的挑战,特别是传统顺序提示方法易出现上下文漂移(context drift)、目标信息丢失及循环失败问题,而分层提示方法则常导致跨层级连续性减弱或运行开销过大。其解决方案的核心是提出一种名为ReCAP(Recursive Context-Aware Reasoning and Planning)的分层框架,通过三个关键机制实现:(i) 提前规划分解(plan-ahead decomposition),即模型先生成完整的子任务列表,执行首个子任务并迭代优化剩余部分;(ii) 结构化父级计划重注入(structured re-injection of parent plans),确保递归返回时维持多层级上下文一致性;(iii) 内存高效执行策略,限制活跃提示长度以使成本随任务深度线性增长。上述机制协同作用,实现了高层目标与底层动作的对齐、冗余提示减少以及递归过程中的连贯上下文更新。
链接: https://arxiv.org/abs/2510.23822
作者: Zhenyu Zhang,Tianyi Chen,Weiran Xu,Alex Pentland,Jiaxin Pei
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Long-horizon tasks requiring multi-step reasoning and dynamic re-planning remain challenging for large language models (LLMs). Sequential prompting methods are prone to context drift, loss of goal information, and recurrent failure cycles, while hierarchical prompting methods often weaken cross-level continuity or incur substantial runtime overhead. We introduce ReCAP (Recursive Context-Aware Reasoning and Planning), a hierarchical framework with shared context for reasoning and planning in LLMs. ReCAP combines three key mechanisms: (i) plan-ahead decomposition, in which the model generates a full subtask list, executes the first item, and refines the remainder; (ii) structured re-injection of parent plans, maintaining consistent multi-level context during recursive return; and (iii) memory-efficient execution, bounding the active prompt so costs scale linearly with task depth. Together these mechanisms align high-level goals with low-level actions, reduce redundant prompting, and preserve coherent context updates across recursion. Experiments demonstrate that ReCAP substantially improves subgoal alignment and success rates on various long-horizon reasoning benchmarks, achieving a 32% gain on synchronous Robotouille and a 29% improvement on asynchronous Robotouille under the strict pass@1 protocol.
zh
[AI-90] Evaluating In Silico Creativity: An Expert Review of AI Chess Compositions NEURIPS2025
【速读】:该论文旨在解决生成式 AI(Generative AI)在创作具有美学价值、新颖性及反直觉独特解法的国际象棋谜题方面的创造力评估问题。其解决方案的关键在于构建一个能够生成高质量谜题的AI系统,并通过三位国际象棋领域权威专家(包括国际大师Amatzia Avni、特级大师Jonathan Levitt和Matthew Sadler)的评审,从创意性、挑战性和美学设计等维度验证所生成谜题的质量与创新水平。
链接: https://arxiv.org/abs/2510.23772
作者: Vivek Veeriah,Federico Barbero,Marcus Chiam,Xidong Feng,Michael Dennis,Ryan Pachauri,Thomas Tumiel,Johan Obando-Ceron,Jiaxin Shi,Shaobo Hou,Satinder Singh,Nenad Tomašev,Tom Zahavy
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at the Creative AI Track, NeurIPS 2025
Abstract:The rapid advancement of Generative AI has raised significant questions regarding its ability to produce creative and novel outputs. Our recent work investigates this question within the domain of chess puzzles and presents an AI system designed to generate puzzles characterized by aesthetic appeal, novelty, counter-intuitive and unique solutions. We briefly discuss our method below and refer the reader to the technical paper for more details. To assess our system’s creativity, we presented a curated booklet of AI-generated puzzles to three world-renowned experts: International Master for chess compositions Amatzia Avni, Grandmaster Jonathan Levitt, and Grandmaster Matthew Sadler. All three are noted authors on chess aesthetics and the evolving role of computers in the game. They were asked to select their favorites and explain what made them appealing, considering qualities such as their creativity, level of challenge, or aesthetic design.
zh
[AI-91] DFlow: Agent ic Workflows for Test Driven Software Engineering
【速读】:该论文旨在解决自动化软件工程修复中长期存在的高复杂度与低成功率问题,特别是针对人类编写的测试用例(human-written tests)进行高效、准确的补丁生成与调试。其核心挑战在于如何在大规模代码库(repository-scale)环境下实现稳定且可解释的程序修复能力,而传统方法受限于长上下文处理能力不足和任务耦合度高导致的性能瓶颈。解决方案的关键在于提出一种名为TDFlow的测试驱动型代理工作流(test-driven agentic workflow),通过将修复过程严格分解为四个独立子任务——补丁提议、调试、补丁修订及可选测试生成,并由专门设计的子代理(sub-agents)分别负责,从而实现任务解耦、专注执行与局部优化。这种结构化约束显著降低了单个代理的认知负担,提升了整体系统在SWE-Bench Lite和SWE-Bench Verified基准上的表现(分别达到88.8%和94.3%的通过率),并揭示出当前限制达到人类水平软件工程能力的主要障碍在于有效重现测试(reproduction tests)的生成质量。
链接: https://arxiv.org/abs/2510.23761
作者: Kevin Han,Siddharth Maddikayala,Tim Knappe,Om Patel,Austen Liao,Amir Barati Farimani
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:We introduce TDFlow, a novel test-driven agentic workflow that frames repository-scale software engineering as a test-resolution task, specifically designed to solve human-written tests. Given a set of tests, TDFlow repeatedly proposes, revises, and debugs repository-scale patches using precisely engineered sub-agents and tightly constrained tools. The workflow decomposes software engineering program repair into four components governed by respective sub-agents. This simple, forced decoupling of patch proposing, debugging, patch revision, and optional test generation (1) reduces long-context burden on any individual sub-agent, (2) focuses each sub-agent on specific, pre-defined sub-tasks, and (3) allows for specialized performance improvement on specific sub-tasks. When provided human-written tests, TDFlow attains 88.8% pass rate on SWE-Bench Lite (an absolute improvement of 27.8% over the next best system) and 94.3% on SWE-Bench Verified. Manual inspection of the 800 TDFlow runs within SWE-Bench Lite and Verified uncover only 7 instances of test hacking, which were subsequently counted as failures. Furthermore, we show that the primary obstacle to human-level software engineering performance lies within writing successful reproduction tests. We envision a human-LLM interactive system powered by TDFlow where human developers write tests solved by LLM systems. Together, these results indicate that modern LLMs, when embedded in a narrowly engineered, test-driven workflow, already achieve human-level test resolution – with the final frontier for fully autonomous repository repair being the accurate generation of valid reproduction tests.
zh
[AI-92] Explaining Robustness to Catastrophic Forgetting Through Incremental Concept Formation
【速读】:该论文旨在解决持续学习(continual learning)中的灾难性遗忘(catastrophic forgetting)问题,即模型在学习新知识时会丢失先前习得的信息。其解决方案的关键在于采用基于概念的、信息论驱动的学习机制:通过自适应结构重组提升学习可塑性,利用稀疏且选择性的参数更新减少干扰,并基于充分统计量的信息理论学习方式替代梯度反向传播,从而在不回放历史数据的前提下有效保留先验知识。实验表明,Cobweb/4V模型及其神经实现CobwebNN在多个复杂度不同的数据集上均展现出对灾难性遗忘的鲁棒性,验证了概念基础与信息论方法在构建稳定、自适应持续学习系统中的潜力。
链接: https://arxiv.org/abs/2510.23756
作者: Nicki Barari,Edward Kim,Christopher MacLellan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 18 pages, 5 figures, Advances in Cognitive Systems 2025
Abstract:Catastrophic forgetting remains a central challenge in continual learning, where models are required to integrate new knowledge over time without losing what they have previously learned. In prior work, we introduced Cobweb/4V, a hierarchical concept formation model that exhibited robustness to catastrophic forgetting in visual domains. Motivated by this robustness, we examine three hypotheses regarding the factors that contribute to such stability: (1) adaptive structural reorganization enhances knowledge retention, (2) sparse and selective updates reduce interference, and (3) information-theoretic learning based on sufficiency statistics provides advantages over gradient-based backpropagation. To test these hypotheses, we compare Cobweb/4V with neural baselines, including CobwebNN, a neural implementation of the Cobweb framework introduced in this work. Experiments on datasets of varying complexity (MNIST, Fashion-MNIST, MedMNIST, and CIFAR-10) show that adaptive restructuring enhances learning plasticity, sparse updates help mitigate interference, and the information-theoretic learning process preserves prior knowledge without revisiting past data. Together, these findings provide insight into mechanisms that can mitigate catastrophic forgetting and highlight the potential of concept-based, information-theoretic approaches for building stable and adaptive continual learning systems.
zh
[AI-93] Debiasing Reward Models by Representation Learning with Guarantees
【速读】:该论文旨在解决当前基于人类反馈的强化学习(Reinforcement Learning from Human Feedback, RLHF)方法中,奖励模型(Reward Model)因学习到数据中的虚假相关性(spurious correlations)而导致的偏差问题,例如响应长度、歧视性倾向、谄媚行为及概念偏见等。解决方案的关键在于提出一个理论严谨的框架,通过建模数据生成过程,将观测到的数据(如文本)视为由虚假和非虚假潜在变量共同生成;研究发现,即使缺乏对虚假潜在变量的代理表示,仍可从数据中理论上识别出非虚假潜在变量。这一发现启发了一种实用方法:利用变分推断(Variational Inference)恢复这些非虚假变量,并将其用于训练更鲁棒的奖励模型,从而在保留人类意图偏好信号的同时有效缓解虚假相关性问题。
链接: https://arxiv.org/abs/2510.23751
作者: Ignavier Ng,Patrick Blöbaum,Siddharth Bhandari,Kun Zhang,Shiva Kasiviswanathan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
Abstract:Recent alignment techniques, such as reinforcement learning from human feedback, have been widely adopted to align large language models with human preferences by learning and leveraging reward models. In practice, these models often exploit spurious correlations, involving, e.g., response length, discrimination, sycophancy, and conceptual bias, which is a problem that has received increasing attention. In this work, we propose a principled framework that mitigates these biases in reward models while preserving the underlying factors that reflect intended preferences. We first provide a formulation of the data-generating process, assuming that the observed data (e.g., text) is generated from both spurious and non-spurious latent variables. We show that, interestingly, these non-spurious latent variables can be theoretically identified from data, regardless of whether a surrogate for the spurious latent variables is available. This further inspires a practical method that uses variational inference to recover these variables and leverages them to train reward models. Experiments on synthetic and real-world datasets demonstrate that our method effectively mitigates spurious correlation issues and yields more robust reward models.
zh
[AI-94] st-Time Tuned Language Models Enable End-to-end De Novo Molecular Structure Generation from MS/MS Spectra
【速读】:该论文旨在解决质谱(Tandem Mass Spectrometry, MS/MS)中未知化合物结构鉴定的难题,尤其是当目标化合物未出现在参考数据库时,传统方法依赖数据库匹配或复杂的多步预测流程(如中间片段或指纹预测),难以实现准确识别。解决方案的关键在于引入一种基于测试时微调(test-time tuning)的框架,通过增强预训练Transformer模型的学习能力,直接从质谱数据和分子式端到端生成分子结构,无需人工标注和中间步骤。该方法在NPLIB1和MassSpecGym两个基准上分别超越当前最优方法DiffMS 100%和20%,且在实验谱上表现出更强的动态适应性,相对常规微调提升62%性能,同时即使预测偏离真实结构,生成的候选分子仍保持较高的结构准确性,为人工解读和可靠鉴定提供支持。
链接: https://arxiv.org/abs/2510.23746
作者: Laura Mismetti,Marvin Alberts,Andreas Krause,Mara Graziani
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Tandem Mass Spectrometry enables the identification of unknown compounds in crucial fields such as metabolomics, natural product discovery and environmental analysis. However, current methods rely on database matching from previously observed molecules, or on multi-step pipelines that require intermediate fragment or fingerprint prediction. This makes finding the correct molecule highly challenging, particularly for compounds absent from reference databases. We introduce a framework that, by leveraging test-time tuning, enhances the learning of a pre-trained transformer model to address this gap, enabling end-to-end de novo molecular structure generation directly from the tandem mass spectra and molecular formulae, bypassing manual annotations and intermediate steps. We surpass the de-facto state-of-the-art approach DiffMS on two popular benchmarks NPLIB1 and MassSpecGym by 100% and 20%, respectively. Test-time tuning on experimental spectra allows the model to dynamically adapt to novel spectra, and the relative performance gain over conventional fine-tuning is of 62% on MassSpecGym. When predictions deviate from the ground truth, the generated molecular candidates remain structurally accurate, providing valuable guidance for human interpretation and more reliable identification.
zh
[AI-95] Multi-Environment POMDPs: Discrete Model Uncertainty Under Partial Observability NEURIPS2025
【速读】:该论文旨在解决多环境部分可观测马尔可夫决策过程(Multi-environment POMDPs, ME-POMDPs)中的鲁棒策略优化问题,即在一组共享状态、动作和观测空间但转移、观测或奖励模型可能不同的POMDP中,找到一个能最大化最差情形下累积奖励的单一策略。其解决方案的关键在于:首先将ME-POMDPs推广为初始信念集不确定的对抗性信念POMDPs(Adversarial-Belief POMDPs, AB-POMDPs),从而更一般化地建模不确定性;其次证明任意ME-POMDP均可等价转化为仅在转移与奖励或观测与奖励上变化的简化形式,且保持最优策略不变;在此基础上设计了精确与近似(基于点的)算法用于求解AB-POMDPs的鲁棒策略,从而有效扩展至ME-POMDPs,并在标准POMDP基准问题上验证了方法的有效性。
链接: https://arxiv.org/abs/2510.23744
作者: Eline M. Bovy,Caleb Probine,Marnix Suilen,Ufuk Topcu,Nils Jansen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted at NeurIPS 2025
Abstract:Multi-environment POMDPs (ME-POMDPs) extend standard POMDPs with discrete model uncertainty. ME-POMDPs represent a finite set of POMDPs that share the same state, action, and observation spaces, but may arbitrarily vary in their transition, observation, and reward models. Such models arise, for instance, when multiple domain experts disagree on how to model a problem. The goal is to find a single policy that is robust against any choice of POMDP within the set, i.e., a policy that maximizes the worst-case reward across all POMDPs. We generalize and expand on existing work in the following way. First, we show that ME-POMDPs can be generalized to POMDPs with sets of initial beliefs, which we call adversarial-belief POMDPs (AB-POMDPs). Second, we show that any arbitrary ME-POMDP can be reduced to a ME-POMDP that only varies in its transition and reward functions or only in its observation and reward functions, while preserving (optimal) policies. We then devise exact and approximate (point-based) algorithms to compute robust policies for AB-POMDPs, and thus ME-POMDPs. We demonstrate that we can compute policies for standard POMDP benchmarks extended to the multi-environment setting.
zh
[AI-96] AI and the Decentering of Disciplinary Creativity
【速读】:该论文试图解决的问题是人工智能(AI)在科学问题求解中如何影响学科创造力(disciplinary creativity),特别是探讨AI介入是否可能取代或削弱科学家基于专业领域知识的创造性实践。其解决方案的关键在于区分“创造性方法”(creative approaches)与“创造性成果”(creative products),并提出“学科创造力”的概念——即运用特定学科的专业知识解决该领域内具有价值的问题。通过数学领域的两个案例分析,论文指出计算能力可以扩展学科创造力,但某些依赖AI的方法反而可能导致这种创造力被替代,从而可能降低科学研究的价值。
链接: https://arxiv.org/abs/2510.23734
作者: Eamon Duede
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:This paper examines the role of artificial intelligence in scientific problem-solving, with a focus on its implications for disciplinary creativity. Drawing on recent work in the philosophy of creativity, I distinguish between creative approaches and creative products, and introduce the concept of disciplinary creativity -the creative application of discipline-specific expertise to a valued problem within that field. Through two cases in mathematics, I show that while computation can extend disciplinary creativity, certain approaches involving AI can serve to displace it. This displacement has the potential to alter (and, perhaps, diminish) the value of scientific pursuit.
zh
[AI-97] On the Societal Impact of Machine Learning
【速读】:该论文旨在解决机器学习(Machine Learning, ML)系统在社会应用中可能引发的不公平性问题,尤其是由于缺乏显式公平性考量而导致的算法歧视风险。其解决方案的关键在于:首先,开发更合适的公平性测量方法以准确评估ML系统的公平表现;其次,通过系统性分解ML模型来预测偏见演化路径;最后,提出有效的干预策略,在保持系统效用的同时降低算法歧视。这一系列工作为确保机器学习的社会影响与社会价值观相一致提供了理论基础和实践路径。
链接: https://arxiv.org/abs/2510.23693
作者: Joachim Baumann
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: PhD thesis
Abstract:This PhD thesis investigates the societal impact of machine learning (ML). ML increasingly informs consequential decisions and recommendations, significantly affecting many aspects of our lives. As these data-driven systems are often developed without explicit fairness considerations, they carry the risk of discriminatory effects. The contributions in this thesis enable more appropriate measurement of fairness in ML systems, systematic decomposition of ML systems to anticipate bias dynamics, and effective interventions that reduce algorithmic discrimination while maintaining system utility. I conclude by discussing ongoing challenges and future research directions as ML systems, including generative artificial intelligence, become increasingly integrated into society. This work offers a foundation for ensuring that ML’s societal impact aligns with broader social values.
zh
[AI-98] Game-TARS: Pretrained Foundation Models for Scalable Generalist Multimodal Game Agents
【速读】:该论文旨在解决当前通用游戏智能体(Generalist Game Agent)在跨域迁移能力、训练效率与推理成本之间难以平衡的问题,尤其是如何实现大规模持续预训练(Continual Pre-training)并保持对多样化环境(如操作系统、网页和模拟游戏)的泛化能力。解决方案的关键在于提出一种以人类对齐的原生键盘-鼠标输入为锚点的统一可扩展动作空间(Unified Scalable Action Space),结合衰减式持续损失(Decaying Continual Loss)以减少因果混淆,并引入高效的稀疏思考策略(Sparse-Thinking Strategy)来权衡推理深度与推理开销。该方法使得模型在开放世界Minecraft任务中成功率提升约2倍,在未见过的网页3D游戏中接近人类水平表现,并在FPS基准测试中优于GPT-5、Gemini-2.5-Pro和Claude-4-Sonnet等主流大模型。
链接: https://arxiv.org/abs/2510.23691
作者: Zihao Wang,Xujing Li,Yining Ye,Junjie Fang,Haoming Wang,Longxiang Liu,Shihao Liang,Junting Lu,Zhiyong Wu,Jiazhan Feng,Wanjun Zhong,Zili Li,Yu Wang,Yu Miao,Bo Zhou,Yuanfan Li,Hao Wang,Zhongkai Zhao,Faming Wu,Zhengxuan Jiang,Weihao Tan,Heyuan Yao,Shi Yan,Xiangyang Li,Yitao Liang,Yujia Qin,Guang Shi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:We present Game-TARS, a generalist game agent trained with a unified, scalable action space anchored to human-aligned native keyboard-mouse inputs. Unlike API- or GUI-based approaches, this paradigm enables large-scale continual pre-training across heterogeneous domains, including OS, web, and simulation games. Game-TARS is pre-trained on over 500B tokens with diverse trajectories and multimodal data. Key techniques include a decaying continual loss to reduce causal confusion and an efficient Sparse-Thinking strategy that balances reasoning depth and inference cost. Experiments show that Game-TARS achieves about 2 times the success rate over the previous sota model on open-world Minecraft tasks, is close to the generality of fresh humans in unseen web 3d games, and outperforms GPT-5, Gemini-2.5-Pro, and Claude-4-Sonnet in FPS benchmarks. Scaling results on training-time and test-time confirm that the unified action space sustains improvements when scaled to cross-game and multimodal data. Our results demonstrate that simple, scalable action representations combined with large-scale pre-training provide a promising path toward generalist agents with broad computer-use abilities.
zh
[AI-99] Parallel BiLSTM-Transformer networks for forecasting chaotic dynamics
【速读】:该论文旨在解决传统方法难以同时捕捉混沌时间序列中局部特征与全局依赖关系的问题,从而限制了对混沌系统演化行为的准确预测。其解决方案的关键在于提出一种融合Transformer与双向长短期记忆网络(Bidirectional Long Short-Term Memory, BiLSTM)的并行预测框架:其中Transformer分支负责建模长期依赖关系,BiLSTM分支专注于提取局部时序特征,二者通过专用的特征融合层进行互补信息整合,显著提升了模型在混沌系统预测任务中的准确性与鲁棒性。
链接: https://arxiv.org/abs/2510.23685
作者: Junwen Ma,Mingyu Ge,Yisen Wang,Yong Zhang,Weicheng Fu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 9 pages,7 figures
Abstract:The nonlinear nature of chaotic systems results in extreme sensitivity to initial conditions and highly intricate dynamical behaviors, posing fundamental challenges for accurately predicting their evolution. To overcome the limitation that conventional approaches fail to capture both local features and global dependencies in chaotic time series simultaneously, this study proposes a parallel predictive framework integrating Transformer and Bidirectional Long Short-Term Memory (BiLSTM) networks. The hybrid model employs a dual-branch architecture, where the Transformer branch mainly captures long-range dependencies while the BiLSTM branch focuses on extracting local temporal features. The complementary representations from the two branches are fused in a dedicated feature-fusion layer to enhance predictive accuracy. As illustrating examples, the model’s performance is systematically evaluated on two representative tasks in the Lorenz system. The first is autonomous evolution prediction, in which the model recursively extrapolates system trajectories from the time-delay embeddings of the state vector to evaluate long-term tracking accuracy and stability. The second is inference of unmeasured variable, where the model reconstructs the unobserved states from the time-delay embeddings of partial observations to assess its state-completion capability. The results consistently indicate that the proposed hybrid framework outperforms both single-branch architectures across tasks, demonstrating its robustness and effectiveness in chaotic system prediction.
zh
[AI-100] Beyond Prompt Engineering: Neuro-Symbolic-Causal Architecture for Robust Multi-Objective AI Agents
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)作为自主决策代理在高风险场景中部署时存在的脆弱性和不可靠性问题,即相同能力的LLM代理因提示(prompt)框架不同而产生截然不同的结果,导致灾难性后果。解决方案的关键在于提出一种神经符号因果架构(neuro-symbolic-causal architecture)Chimera,其核心由三个互补模块构成:一个LLM策略器、一个形式化验证的符号约束引擎(symbolic constraint engine),以及一个用于反事实推理的因果推断模块(causal inference module)。该架构通过形式化验证确保约束零违规,并在真实电商环境中实现了对价格弹性、信任动态和季节性需求等复杂因素的稳健响应,显著优于仅使用LLM或添加符号约束的基线方法,在利润和品牌信任度上均取得显著提升,证明了架构设计比提示工程更能决定自主代理在生产环境中的可靠性。
链接: https://arxiv.org/abs/2510.23682
作者: Gokturk Aytug Akarlar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO); Software Engineering (cs.SE)
备注: 35 pages, 15 figures, 2 tables. Keywords: Large Language Models, Autonomous Agents, Neuro-Symbolic AI, Causal Inference, Formal Verification, Multi-Objective Optimization. Open-source code and interactive demo available
Abstract:Large language models show promise as autonomous decision-making agents, yet their deployment in high-stakes domains remains fraught with risk. Without architectural safeguards, LLM agents exhibit catastrophic brittleness: identical capabilities produce wildly different outcomes depending solely on prompt framing. We present Chimera, a neuro-symbolic-causal architecture that integrates three complementary components - an LLM strategist, a formally verified symbolic constraint engine, and a causal inference module for counterfactual reasoning. We benchmark Chimera against baseline architectures (LLM-only, LLM with symbolic constraints) across 52-week simulations in a realistic e-commerce environment featuring price elasticity, trust dynamics, and seasonal demand. Under organizational biases toward either volume or margin optimization, LLM-only agents fail catastrophically (total loss of \ 99K in volume scenarios) or destroy brand trust (-48.6% in margin scenarios). Adding symbolic constraints prevents disasters but achieves only 43-87% of Chimera’s profit. Chimera consistently delivers the highest returns (\ 1.52M and \ 1.96M respectively, some cases +\ 2.2M) while improving brand trust (+1.8% and +10.8%, some cases +20.86%), demonstrating prompt-agnostic robustness. Our TLA+ formal verification proves zero constraint violations across all scenarios. These results establish that architectural design not prompt engineering determines the reliability of autonomous agents in production environments. We provide open-source implementations and interactive demonstrations for reproducibility.
zh
[AI-101] QueryIPI: Query-agnostic Indirect Prompt Injection on Coding Agents
【速读】:该论文旨在解决现代集成于集成开发环境(IDE)中的代码生成代理(coding agents)所面临的间接提示注入(Indirect Prompt Injection, IPI)攻击问题,尤其是现有研究多局限于特定查询的不稳定性攻击,难以在多样化用户输入下保持高成功率。其核心挑战在于如何实现跨查询的、持续有效的攻击。解决方案的关键在于识别并利用代理内部提示(internal prompt)泄露这一共性漏洞,将原本难以控制的黑盒攻击转化为一个受约束的白盒优化问题;通过迭代式提示驱动过程,基于泄露的内部提示精炼恶意工具描述,从而实现对多种代理模型的高效攻击。实验表明,该方法——QueryIPI——在五个模拟代理中最高可达87%的成功率,并具备在真实系统中的迁移能力,揭示了LLM驱动代码代理的实际安全风险。
链接: https://arxiv.org/abs/2510.23675
作者: Yuchong Xie,Zesen Liu,Mingyu Luo,Zhixiang Zhang,Kaikai Zhang,Zongjie Li,Ping Chen,Shuai Wang,Dongdong She
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Modern coding agents integrated into IDEs combine powerful tools and system-level actions, exposing a high-stakes attack surface. Existing Indirect Prompt Injection (IPI) studies focus mainly on query-specific behaviors, leading to unstable attacks with lower success rates. We identify a more severe, query-agnostic threat that remains effective across diverse user inputs. This challenge can be overcome by exploiting a common vulnerability: leakage of the agent’s internal prompt, which turns the attack into a constrained white-box optimization problem. We present QueryIPI, the first query-agnostic IPI method for coding agents. QueryIPI refines malicious tool descriptions through an iterative, prompt-based process informed by the leaked internal prompt. Experiments on five simulated agents show that QueryIPI achieves up to 87 percent success, outperforming baselines, and the generated malicious descriptions also transfer to real-world systems, highlighting a practical security risk to modern LLM-based coding agents.
zh
[AI-102] RefleXGen:The unexamined code is not worth using
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在代码生成过程中存在的安全性问题。传统方法依赖于模型微调或构建专用安全代码数据集,成本较高且效率有限。其解决方案的关键在于提出RefleXGen框架,该框架通过融合检索增强生成(Retrieval-Augmented Generation, RAG)与LLM固有的引导式自我反思机制,在无需大量资源投入的前提下,实现代码生成过程的迭代优化。该方法使模型能够持续积累并精炼知识库,从而显著提升生成代码的安全性。
链接: https://arxiv.org/abs/2510.23674
作者: Bin Wang,Hui Li,AoFan Liu,BoTao Yang,Ao Yang,YiLu Zhong,Weixiang Huang,Yanping Zhang,Runhuai Huang,Weimin Zeng
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:
Abstract:Security in code generation remains a pivotal challenge when applying large language models (LLMs). This paper introduces RefleXGen, an innovative method that significantly enhances code security by integrating Retrieval-Augmented Generation (RAG) techniques with guided self-reflection mechanisms inherent in LLMs. Unlike traditional approaches that rely on fine-tuning LLMs or developing specialized secure code datasets - processes that can be resource-intensive - RefleXGen iteratively optimizes the code generation process through self-assessment and reflection without the need for extensive resources. Within this framework, the model continuously accumulates and refines its knowledge base, thereby progressively improving the security of the generated code. Experimental results demonstrate that RefleXGen substantially enhances code security across multiple models, achieving a 13.6% improvement with GPT-3.5 Turbo, a 6.7% improvement with GPT-4o, a 4.5% improvement with CodeQwen, and a 5.8% improvement with Gemini. Our findings highlight that improving the quality of model self-reflection constitutes an effective and practical strategy for strengthening the security of AI-generated code.
zh
[AI-103] MCPGuard : Automatically Detecting Vulnerabilities in MCP Servers
【速读】:该论文旨在解决基于模型上下文协议(Model Context Protocol, MCP)的系统所面临的安全威胁问题,特别是由协议开放性和可扩展性引入的三大类风险:代理劫持攻击、MCP服务器的传统Web漏洞以及供应链安全问题。其解决方案的关键在于构建多层次防御体系,包括服务器端的主动扫描策略(如分层检测流水线、代理审计框架和零信任注册机制)与运行时交互监控方案,以实现对自然语言元数据语义解释层面的持续监督和策略执行,从而应对从传统代码执行向语义理解扩展的新型攻击面。
链接: https://arxiv.org/abs/2510.23673
作者: Bin Wang,Zexin Liu,Hao Yu,Ao Yang,Yenan Huang,Jing Guo,Huangsheng Cheng,Hui Li,Huiyu Wu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:The Model Context Protocol (MCP) has emerged as a standardized interface enabling seamless integration between Large Language Models (LLMs) and external data sources and tools. While MCP significantly reduces development complexity and enhances agent capabilities, its openness and extensibility introduce critical security vulnerabilities that threaten system trustworthiness and user data protection. This paper systematically analyzes the security landscape of MCP-based systems, identifying three principal threat categories: (1) agent hijacking attacks stemming from protocol design deficiencies; (2) traditional web vulnerabilities in MCP servers; and (3) supply chain security. To address these challenges, we comprehensively survey existing defense strategies, examining both proactive server-side scanning approaches, ranging from layered detection pipelines and agentic auditing frameworks to zero-trust registry systems, and runtime interaction monitoring solutions that provide continuous oversight and policy enforcement. Our analysis reveals that MCP security fundamentally represents a paradigm shift where the attack surface extends from traditional code execution to semantic interpretation of natural language metadata, necessitating novel defense mechanisms tailored to this unique threat model.
zh
[AI-104] Sparsity and Superposition in Mixture of Experts
【速读】:该论文试图解决混合专家(Mixture of Experts, MoE)模型与密集网络在机制上的差异问题,特别是如何从特征表示的角度理解MoE模型的内部运作。以往研究指出,密集模型通过超位置(superposition)机制实现特征数量超过维度数的表示,且超位置特性依赖于特征稀疏性和重要性;但MoE模型无法用相同视角解释。论文的关键解决方案在于提出以网络稀疏性(network sparsity,即活跃专家占总专家的比例)作为核心指标来刻画MoE模型,并开发新的超位置测量方法用于跨专家分析。研究发现,更高的网络稀疏性会导致更强的单语义性(monosemanticity),并据此提出基于单语义特征表示而非负载均衡的新专家专业化定义,表明当初始化合理时,专家自然围绕一致的特征组合组织。这一发现挑战了“可解释性与能力不可兼得”的普遍假设,揭示了网络稀疏性可能成为提升MoE模型可解释性的关键机制。
链接: https://arxiv.org/abs/2510.23671
作者: Marmik Chaudhari,Jeremi Nuer,Rome Thorstenson
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Mixture of Experts (MoE) models have become central to scaling large language models, yet their mechanistic differences from dense networks remain poorly understood. Previous work has explored how dense models use \textitsuperposition to represent more features than dimensions, and how superposition is a function of feature sparsity and feature importance. MoE models cannot be explained mechanistically through the same lens. We find that neither feature sparsity nor feature importance cause discontinuous phase changes, and that network sparsity (the ratio of active to total experts) better characterizes MoEs. We develop new metrics for measuring superposition across experts. Our findings demonstrate that models with greater network sparsity exhibit greater \emphmonosemanticity. We propose a new definition of expert specialization based on monosemantic feature representation rather than load balancing, showing that experts naturally organize around coherent feature combinations when initialized appropriately. These results suggest that network sparsity in MoEs may enable more interpretable models without sacrificing performance, challenging the common assumption that interpretability and capability are fundamentally at odds.
zh
[AI-105] raffic flow forecasting STL decomposition Hybrid model LSTM ARIMA XGBoost Intelligent transportation systems
【速读】:该论文旨在解决交通流预测中单一模型难以捕捉复杂非线性及多尺度时间模式的问题。其解决方案的关键在于提出一种基于分解驱动的混合框架,通过Seasonal Trend decomposition using Loess (STL) 将原始时间序列分解为趋势、季节性和残差成分,再分别由LSTM、ARIMA和XGBoost三种互补模型对各分量进行建模,最终通过乘法集成获得预测结果。该策略有效分离了不同时间特征,使各子模型专注特定模式,从而显著提升预测精度、可解释性与鲁棒性。
链接: https://arxiv.org/abs/2510.23668
作者: Fujiang Yuan,Yangrui Fan,Xiaohuan Bing,Zhen Tian,Chunhong Yuan,Yankang Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Accurate traffic flow forecasting is essential for intelligent transportation systems and urban traffic management. However, single model approaches often fail to capture the complex, nonlinear, and multi scale temporal patterns in traffic flow data. This study proposes a decomposition driven hybrid framework that integrates Seasonal Trend decomposition using Loess (STL) with three complementary predictive models. STL first decomposes the original time series into trend, seasonal, and residual components. Then, a Long Short Term Memory (LSTM) network models long term trends, an Autoregressive Integrated Moving Average (ARIMA) model captures seasonal periodicity, and an Extreme Gradient Boosting (XGBoost) algorithm predicts nonlinear residual fluctuations. The final forecast is obtained through multiplicative integration of the sub model predictions. Using 998 traffic flow records from a New York City intersection between November and December 2015, results show that the LSTM ARIMA XGBoost hybrid model significantly outperforms standalone models including LSTM, ARIMA, and XGBoost across MAE, RMSE, and R squared metrics. The decomposition strategy effectively isolates temporal characteristics, allowing each model to specialize, thereby improving prediction accuracy, interpretability, and robustness.
zh
[AI-106] Optimize Any Topology: A Foundation Model for Shape- and Resolution-Free Structural Topology Optimization
【速读】:该论文旨在解决结构拓扑优化(Structural Topology Optimization, TO)在工程设计中计算成本高、现有深度学习方法受限于固定网格格式与边界条件、且缺乏通用性的问题。其解决方案的关键在于提出了一种名为“Optimize Any Topology”(OAT)的基础模型框架,该框架通过结合一个与分辨率和形状无关的自编码器(autoencoder)、隐式神经场解码器(implicit neural-field decoder)以及在包含220万组优化结构的新数据集OpenTO上训练的条件潜在扩散模型(conditional latent-diffusion model),实现了对任意长宽比、分辨率、体积分数、载荷和约束条件下的最小柔度布局的直接预测,从而显著提升优化效率与泛化能力。
链接: https://arxiv.org/abs/2510.23667
作者: Amin Heyrani Nobari,Lyle Regenwetter,Cyril Picard,Ligong Han,Faez Ahmed
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注:
Abstract:Structural topology optimization (TO) is central to engineering design but remains computationally intensive due to complex physics and hard constraints. Existing deep-learning methods are limited to fixed square grids, a few hand-coded boundary conditions, and post-hoc optimization, preventing general deployment. We introduce Optimize Any Topology (OAT), a foundation-model framework that directly predicts minimum-compliance layouts for arbitrary aspect ratios, resolutions, volume fractions, loads, and fixtures. OAT combines a resolution- and shape-agnostic autoencoder with an implicit neural-field decoder and a conditional latent-diffusion model trained on OpenTO, a new corpus of 2.2 million optimized structures covering 2 million unique boundary-condition configurations. On four public benchmarks and two challenging unseen tests, OAT lowers mean compliance up to 90% relative to the best prior models and delivers sub-1 second inference on a single GPU across resolutions from 64 x 64 to 256 x 256 and aspect ratios as high as 10:1. These results establish OAT as a general, fast, and resolution-free framework for physics-aware topology optimization and provide a large-scale dataset to spur further research in generative modeling for inverse design. Code data can be found at this https URL.
zh
[AI-107] ransformers from Compressed Representations
【速读】:该论文旨在解决压缩文件格式在表示学习(representation learning)领域长期被忽视的问题,即如何高效地从压缩数据流中提取语义信息,而无需进行完整的解码或原始字节级处理。其解决方案的关键在于提出 TEMPEST(TransformErs froM comPressed rEpreSenTations),该方法利用压缩文件固有的字节流结构设计了一种有效的分词(tokenization)与编码策略,使标准 Transformer 模型能够直接从压缩数据流中学习语义表示,从而显著减少用于语义分类的 token 数量,降低计算复杂度和内存消耗,同时保持与当前最优方法相当的准确率。
链接: https://arxiv.org/abs/2510.23665
作者: Juan C. Leon Alcazar,Mattia Soldan,Mohammad Saatialsoruji,Alejandro Pardo,Hani Itani,Juan Camilo Perez,Bernard Ghanem
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Compressed file formats are the corner stone of efficient data storage and transmission, yet their potential for representation learning remains largely underexplored. We introduce TEMPEST (TransformErs froM comPressed rEpreSenTations), a method that exploits the inherent byte-stream structure of compressed files to design an effective tokenization and encoding strategy. By leveraging this compact encoding, a standard transformer can directly learn semantic representations from compressed data streams, bypassing the need for raw byte-level processing or full media decoding. Our proposal substantially reduces the number of tokens required for semantic classification, thereby lowering both computational complexity and memory usage. Through extensive experiments across diverse datasets, coding schemes, and modalities, we show that TEMPEST achieves accuracy competitive wit the state-of-the-art while delivering efficiency gains in memory and compute.
zh
[AI-108] Agents way – Software Development Methodology for AI Agents -based Teams
【速读】:该论文旨在解决传统软件开发方法(如Agile、Kanban等)在AI代理(Agent)作为核心协作成员的环境中日益失效的问题,因为这些方法原本是为人类团队设计的,难以适应由自主AI代理参与规划、编码、测试和持续学习的新范式。解决方案的关键在于提出“Agentsway”框架,其核心是构建一个以人类协调为中心、隐私保护协作为基础的结构化生命周期,明确划分规划、提示、编码、测试和微调等角色的AI代理,并通过整合多轮反馈与细调大语言模型(LLM)形成回顾性学习机制,从而提升领域特定推理能力和可解释决策,同时嵌入负责任AI原则,实现透明、可问责的协作流程。
链接: https://arxiv.org/abs/2510.23664
作者: Eranga Bandara,Ross Gore,Xueping Liang,Sachini Rajapakse,Isurunima Kularathne,Pramoda Karunarathna,Peter Foytik,Sachin Shetty,Ravi Mukkamala,Abdul Rahman,Amin Hass,Ng Wee Keong,Kasun De Zoysa,Aruna Withanage,Nilaan Loganathan
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:The emergence of Agentic AI is fundamentally transforming how software is designed, developed, and maintained. Traditional software development methodologies such as Agile, Kanban, ShapeUp, etc, were originally designed for human-centric teams and are increasingly inadequate in environments where autonomous AI agents contribute to planning, coding, testing, and continuous learning. To address this methodological gap, we present “Agentsway” a novel software development framework designed for ecosystems where AI agents operate as first-class collaborators. Agentsway introduces a structured lifecycle centered on human orchestration, and privacy-preserving collaboration among specialized AI agents. The framework defines distinct roles for planning, prompting, coding, testing, and fine-tuning agents, each contributing to iterative improvement and adaptive learning throughout the development process. By integrating fine-tuned LLMs that leverage outputs and feedback from different agents throughout the development cycle as part of a retrospective learning process, Agentsway enhances domain-specific reasoning, and explainable decision-making across the entire software development lifecycle. Responsible AI principles are further embedded across the agents through the coordinated use of multiple fine-tuned LLMs and advanced reasoning models, ensuring balanced, transparent, and accountable decision-making. This work advances software engineering by formalizing agent-centric collaboration, integrating privacy-by-design principles, and defining measurable metrics for productivity and trust. Agentsway represents a foundational step toward the next generation of AI-native, self-improving software development methodologies. To the best of our knowledge, this is the first research effort to introduce a dedicated methodology explicitly designed for AI agent-based software engineering teams.
zh
[AI-109] Aligning Diffusion Language Models via Unpaired Preference Optimization
【速读】:该论文旨在解决扩散语言模型(Diffusion Language Models, dLLMs)在对齐人类偏好时面临的两大挑战:一是序列对数似然难以计算(intractable sequence log-likelihoods),二是成对偏好数据收集成本高(costly pairwise preference data)。其解决方案的关键在于提出ELBO-KTO方法,该方法结合了证据下界(Evidence Lower Bound, ELBO)作为扩散对数似然的代理目标,以及基于前景理论(prospect-theoretic)的无配对偏好优化目标(Kahneman-Tversky Optimization, KTO),并通过方差减少技术稳定训练过程中的梯度。实验表明,该方法在多个基准测试中显著优于基线模型,验证了无配对偏好优化在扩散语言模型中的有效性。
链接: https://arxiv.org/abs/2510.23658
作者: Vaibhav Jindal,Hejian Sang,Chun-Mao Lai,Yanning Chen,Zhipeng Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Diffusion language models (dLLMs) are an emerging alternative to autoregressive (AR) generators, but aligning them to human preferences is challenging because sequence log-likelihoods are intractable and pairwise preference data are costly to collect. We introduce ELBO-KTO, which combines an ELBO surrogate for diffusion log-likelihoods with a prospect-theoretic, unpaired preference objective (Kahneman Tversky Optimization, KTO). We analyze the bias and variance induced by the ELBO substitution and employ variance-reduction practices that stabilize gradients during training. Applied to LLaDA-8B-Instruct, ELBO-KTO yields \textbf65.9% and \textbf62.3% adjusted win rates on kto-mix-14k and UltraFeedback-Binary, respectively, versus the base model under an automatic LLM judge. Across downstream tasks, including GSM8K, MMLU, and additional reasoning/knowledge benchmarks, ELBO-KTO trained on UltraFeedback-Binary performs on par with or better than the base model under identical decoding. This establishes unpaired preference optimization as a viable alternative to pairwise alignment in diffusion LLMs.
zh
[AI-110] Error Adjustment Based on Spatiotemporal Correlation Fusion for Traffic Forecasting
【速读】:该论文旨在解决深度神经网络(Deep Neural Networks, DNNs)在交通预测中因假设预测误差在时间和空间上相互独立而导致的性能瓶颈问题。实际上,交通数据具有明显的时空自相关性(spatiotemporal autocorrelation),使得传统基于均方误差(Mean Squared Error, MSE)的训练方法无法充分建模误差间的依赖关系,从而限制了模型表现。解决方案的关键在于提出一种新颖且通用的框架——时空自相关误差调整(Spatiotemporally Autocorrelated Error Adjustment, SAEA),其核心创新包括:1)将预测误差建模为时空向量自回归(VAR)过程,通过系数矩阵显式捕获时空误差相关性,并嵌入到新的损失函数中;2)引入结构稀疏正则化以融合先验路网空间信息,使学习到的误差系数矩阵与实际道路拓扑结构一致;3)设计测试阶段的误差动态调整机制,实现实时预测优化。该方法显著提升了多种交通预测模型的性能。
链接: https://arxiv.org/abs/2510.23656
作者: Fuqiang Liu,Weiping Ding,Luis Miranda-Moreno,Lijun Sun
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 12 pages, 7 figures, 3 tables
Abstract:Deep neural networks (DNNs) play a significant role in an increasing body of research on traffic forecasting due to their effectively capturing spatiotemporal patterns embedded in traffic data. A general assumption of training the said forecasting models via mean squared error estimation is that the errors across time steps and spatial positions are uncorrelated. However, this assumption does not really hold because of the autocorrelation caused by both the temporality and spatiality of traffic data. This gap limits the performance of DNN-based forecasting models and is overlooked by current studies. To fill up this gap, this paper proposes Spatiotemporally Autocorrelated Error Adjustment (SAEA), a novel and general framework designed to systematically adjust autocorrelated prediction errors in traffic forecasting. Unlike existing approaches that assume prediction errors follow a random Gaussian noise distribution, SAEA models these errors as a spatiotemporal vector autoregressive (VAR) process to capture their intrinsic dependencies. First, it explicitly captures both spatial and temporal error correlations by a coefficient matrix, which is then embedded into a newly formulated cost function. Second, a structurally sparse regularization is introduced to incorporate prior spatial information, ensuring that the learned coefficient matrix aligns with the inherent road network structure. Finally, an inference process with test-time error adjustment is designed to dynamically refine predictions, mitigating the impact of autocorrelated errors in real-time forecasting. The effectiveness of the proposed approach is verified on different traffic datasets. Results across a wide range of traffic forecasting models show that our method enhances performance in almost all cases.
zh
[AI-111] he Structural Scalpel: Automated Contiguous Layer Pruning for Large Language Models
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在资源受限的边缘设备上部署时面临的计算成本高、模型规模庞大等问题。现有层剪枝(layer pruning)方法通常依赖人工设计的指标逐层评估并移除冗余层,忽视了层间的依赖关系,易破坏信息流并导致性能显著下降。为应对这一挑战,作者提出了一种新颖的连续层剪枝框架(Continuous Layer Pruning, CLP),其关键创新在于:一是引入可微凹门控算法(differentiable concave gate algorithm),通过梯度优化自动识别最优的连续层段进行剪枝;二是设计截断端点调优策略(cutoff endpoint tuning strategy),仅对剪枝段邻近的层进行微调以有效恢复模型性能。实验表明,CLP在多个模型架构和参数规模下均显著优于现有最先进方法。
链接: https://arxiv.org/abs/2510.23652
作者: Yao Lu,Yuqi Li,Wenbin Xie,Shanqing Yu,Qi Xuan,Zhaowei Zhu,Shiping Wen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Although large language models (LLMs) have achieved revolutionary breakthroughs in many fields, their large model size and high computational cost pose significant challenges for practical deployment on resource-constrained edge devices. To this end, layer pruning has been proposed to reduce the computational overhead by directly removing redundant layers. However, existing layer pruning methods typically rely on hand-crafted metrics to evaluate and remove individual layers, while ignoring the dependencies between layers. This can disrupt the model’s information flow and severely degrade performance. To address these issues, we propose CLP, a novel continuous layer pruning framework that introduces two key innovations: a differentiable concave gate algorithm that automatically identifies the best continuous layer segments for pruning via gradient-based optimization; and a cutoff endpoint tuning strategy that effectively restores model performance by fine-tuning only the layers adjacent to the pruned segments. Extensive experiments across multiple model architectures (including LLaMA2, LLaMA3 and Qwen) and sizes (from 7 B to 70 B parameters) show that CLP significantly outperforms existing state-of-the-art baselines. For example, at a pruning rate of 20% , CLP achieves an average performance retention of 95.34% on LLaMA3-70B, outperforming baselines by 4.29% - 30.52% . Furthermore, CLP can be seamlessly combined with quantization to further compress the model with only a slight performance loss.
zh
[AI-112] Beyond Hidden-Layer Manipulation: Semantically-Aware Logit Interventions for Debiasing LLM s
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)中存在的偏见问题,尤其是对齐后模型在生成文本时可能延续或放大社会偏见的现象。解决方案的关键在于提出两种零样本(zero-shot)的 logits 层去偏方法——Static 和 Dynamic,其中 Dynamic 方法通过动态调整 logits 分布,在减少偏见高达 70% 的同时保持最小的语言流畅性损失;此外,研究发现 logits 层干预相较于隐藏层干预更具优势,并且语义感知的 logits 干预能稳定有效地提升对齐 LLMs 的公平性表现。
链接: https://arxiv.org/abs/2510.23650
作者: Wei Xia
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:We proposed Static and Dynamic – two zero-shot logits-layer debiasing methods. Dynamic reduces bias by up to 70% with minimal fluency loss. Logits intervention outperforms hidden-layer approaches. We show semantic-aware logits intervention is stable and effective for debiasing aligned LLMs.
zh
[AI-113] Efficient Low Rank Attention for Long-Context Inference in Large Language Models
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在处理长文本输入时,由于键值(Key-Value, KV)缓存占用大量GPU显存而导致的资源受限设备上长上下文推理效率低下的问题。现有方法如KV量化和剪枝虽能减少内存消耗,但会引入数值精度损失或关键信息保留不足。其解决方案的核心是提出一种两阶段框架——低秩查询与键注意力(Low Rank Query and Key attention, LRQK):首先在预填充阶段将全精度查询和键矩阵分解为紧凑的秩-(r)因子,随后在解码每个步骤中利用这些低维投影以(\mathcal{O}(lr))时间复杂度计算代理注意力分数;同时结合混合GPU-CPU缓存机制与命中-未命中策略,仅传输缺失的全精度KV对,从而在保持精确注意力输出的同时显著降低CPU-GPU数据移动开销,实现内存节省与精度损失最小化的平衡。
链接: https://arxiv.org/abs/2510.23649
作者: Tenghui Li,Guoxu Zhou,Xuyang Zhao,Yuning Qiu,Qibin Zhao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:As the length of input text grows, the key-value (KV) cache in LLMs imposes prohibitive GPU memory costs and limits long-context inference on resource constrained devices. Existing approaches, such as KV quantization and pruning, reduce memory usage but suffer from numerical precision loss or suboptimal retention of key-value pairs. We introduce Low Rank Query and Key attention (LRQK), a two-stage framework that jointly decomposes the full-precision query and key matrices into compact rank-(r) factors during the prefill stage, and then uses these low-dimensional projections to compute proxy attention scores in (\mathcalO(lr)) time at each decode step. By selecting only the top-(k) tokens and a small fixed set of recent tokens, LRQK employs a mixed GPU-CPU cache with a hit-and-miss mechanism that transfers only missing full-precision KV pairs, thereby preserving exact attention outputs while reducing CPU-GPU data movement. Extensive experiments on the RULER and LongBench benchmarks with LLaMA-3-8B and Qwen2.5-7B demonstrate that LRQK matches or surpasses leading sparse-attention methods in long context settings, while delivering significant memory savings with minimal loss in accuracy. Our code is available at this https URL.
zh
[AI-114] RoGBot: Relationship-Oblivious Graph-based Neural Network with Contextual Knowledge for Bot Detection
【速读】:该论文旨在解决社交媒体平台上自动化账户(bots)检测的难题,尤其针对现有方法高度依赖显式用户间关系数据(如关注/被关注关系)而导致在缺乏此类信息时适用性受限的问题。其解决方案的关键在于提出一种新颖的多模态框架,通过结合深度文本特征与增强的用户行为元数据,并利用GraphSAGE模型进行图推理,从而在无需follower-following关系数据的前提下捕捉用户行为的局部与全局模式;该方法采用基于Transformer的模型(如BERT)提取推文语义嵌入并聚合为用户级表征,显著提升了对日益复杂bot策略的识别准确率,在Cresci-15、Cresci-17和PAN 2019数据集上分别达到99.8%、99.1%和96.8%的准确率。
链接: https://arxiv.org/abs/2510.23648
作者: Ashutosh Anshul,Mohammad Zia Ur Rehman,Sri Akash Kadali,Nagendra Kumar
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
备注: Submitted to IEEE
Abstract:Detecting automated accounts (bots) among genuine users on platforms like Twitter remains a challenging task due to the evolving behaviors and adaptive strategies of such accounts. While recent methods have achieved strong detection performance by combining text, metadata, and user relationship information within graph-based frameworks, many of these models heavily depend on explicit user-user relationship data. This reliance limits their applicability in scenarios where such information is unavailable. To address this limitation, we propose a novel multimodal framework that integrates detailed textual features with enriched user metadata while employing graph-based reasoning without requiring follower-following data. Our method uses transformer-based models (e.g., BERT) to extract deep semantic embeddings from tweets, which are aggregated using max pooling to form comprehensive user-level representations. These are further combined with auxiliary behavioral features and passed through a GraphSAGE model to capture both local and global patterns in user behavior. Experimental results on the Cresci-15, Cresci-17, and PAN 2019 datasets demonstrate the robustness of our approach, achieving accuracies of 99.8%, 99.1%, and 96.8%, respectively, and highlighting its effectiveness against increasingly sophisticated bot strategies.
zh
[AI-115] SAND: A Self-supervised and Adaptive NAS-Driven Framework for Hardware Trojan Detection
【速读】:该论文旨在解决硬件木马(Hardware Trojan, HT)在嵌入式系统中日益严峻的安全威胁问题,尤其是现有基于机器学习的检测方法因特征选择随意性和缺乏适应性而导致在多种HT攻击场景下效果受限。其解决方案的关键在于提出SAND框架——一个自监督且可自适应的神经架构搜索(Neural Architecture Search, NAS)驱动的高效HT检测机制:首先利用自监督学习(Self-Supervised Learning, SSL)实现自动化特征提取,摆脱对人工设计特征的依赖;其次通过NAS动态优化下游分类器结构,使模型能以最小微调代价适配未见基准测试,从而显著提升检测准确率(最高达18.3%)、抗逃避木马能力和跨场景泛化性能。
链接: https://arxiv.org/abs/2510.23643
作者: Zhixin Pan,Ziyu Shu,Linh Nguyen,Amberbir Alemayoh
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:The globalized semiconductor supply chain has made Hardware Trojans (HT) a significant security threat to embedded systems, necessitating the design of efficient and adaptable detection mechanisms. Despite promising machine learning-based HT detection techniques in the literature, they suffer from ad hoc feature selection and the lack of adaptivity, all of which hinder their effectiveness across diverse HT attacks. In this paper, we propose SAND, a selfsupervised and adaptive NAS-driven framework for efficient HT detection. Specifically, this paper makes three key contributions. (1) We leverage self-supervised learning (SSL) to enable automated feature extraction, eliminating the dependency on manually engineered features. (2) SAND integrates neural architecture search (NAS) to dynamically optimize the downstream classifier, allowing for seamless adaptation to unseen benchmarks with minimal fine-tuning. (3) Experimental results show that SAND achieves a significant improvement in detection accuracy (up to 18.3%) over state-of-the-art methods, exhibits high resilience against evasive Trojans, and demonstrates strong generalization.
zh
[AI-116] Spatially Aware Linear Transformer (SAL-T) for Particle Jet Tagging
【速读】:该论文旨在解决Transformer模型在高能粒子碰撞数据处理中因二次计算复杂度导致的资源消耗大、推理延迟高的问题,尤其是在CERN LHC这类高数据吞吐场景下的部署挑战。解决方案的关键在于提出一种物理启发的线性注意力机制改进方法——空间感知线性Transformer(Spatially Aware Linear Transformer, SAL-T):首先基于粒子的运动学特征进行空间感知分区,仅在具有物理意义的区域间计算注意力;其次引入卷积层以捕捉局部相关性,借鉴喷注(jet)物理中的先验知识。该设计在保持线性计算复杂度的同时,显著提升了模型性能,在喷注分类任务中达到与全注意力Transformer相当的精度,且推理资源占用和延迟大幅降低。
链接: https://arxiv.org/abs/2510.23641
作者: Aaron Wang,Zihan Zhao,Subash Katel,Vivekanand Gyanchand Sahu,Elham E Khoda,Abhijith Gandrakota,Jennifer Ngadiuba,Richard Cavanaugh,Javier Duarte
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); High Energy Physics - Experiment (hep-ex); Instrumentation and Detectors (physics.ins-det)
备注:
Abstract:Transformers are very effective in capturing both global and local correlations within high-energy particle collisions, but they present deployment challenges in high-data-throughput environments, such as the CERN LHC. The quadratic complexity of transformer models demands substantial resources and increases latency during inference. In order to address these issues, we introduce the Spatially Aware Linear Transformer (SAL-T), a physics-inspired enhancement of the linformer architecture that maintains linear attention. Our method incorporates spatially aware partitioning of particles based on kinematic features, thereby computing attention between regions of physical significance. Additionally, we employ convolutional layers to capture local correlations, informed by insights from jet physics. In addition to outperforming the standard linformer in jet classification tasks, SAL-T also achieves classification results comparable to full-attention transformers, while using considerably fewer resources with lower latency during inference. Experiments on a generic point cloud classification dataset (ModelNet10) further confirm this trend. Our code is available at this https URL.
zh
[AI-117] Structure-Aware Fusion with Progressive Injection for Multimodal Molecular Representation Learning NEURIPS2025
【速读】:该论文旨在解决多模态分子模型中存在的两大问题:一是依赖三维构象(3D conformer)的融合不稳定性,二是因简单融合导致的模态坍缩(modality collapse),这些问题限制了模型的鲁棒性和泛化能力。解决方案的关键在于提出一种结构化的多模态融合框架MuMo,其核心包含两个创新机制:一是设计结构融合管道(Structured Fusion Pipeline, SFP),将二维拓扑(2D topology)与三维几何信息整合为统一且稳定的结构先验,以降低构象相关融合的不稳定性;二是引入渐进式注入机制(Progressive Injection, PI),通过非对称方式将该先验逐步注入序列流中,在保持各模态特异性建模的同时实现跨模态信息增强。
链接: https://arxiv.org/abs/2510.23640
作者: Zihao Jing,Yan Sun,Yan Yi Li,Sugitha Janarthanan,Alana Deng,Pingzhao Hu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by NeurIPS 2025
Abstract:Multimodal molecular models often suffer from 3D conformer unreliability and modality collapse, limiting their robustness and generalization. We propose MuMo, a structured multimodal fusion framework that addresses these challenges in molecular representation through two key strategies. To reduce the instability of conformer-dependent fusion, we design a Structured Fusion Pipeline (SFP) that combines 2D topology and 3D geometry into a unified and stable structural prior. To mitigate modality collapse caused by naive fusion, we introduce a Progressive Injection (PI) mechanism that asymmetrically integrates this prior into the sequence stream, preserving modality-specific modeling while enabling cross-modal enrichment. Built on a state space backbone, MuMo supports long-range dependency modeling and robust information propagation. Across 29 benchmark tasks from Therapeutics Data Commons (TDC) and MoleculeNet, MuMo achieves an average improvement of 2.7% over the best-performing baseline on each task, ranking first on 22 of them, including a 27% improvement on the LD50 task. These results validate its robustness to 3D conformer noise and the effectiveness of multimodal fusion in molecular representation. The code is available at: this http URL.
zh
[AI-118] Integrating Genomics into Multimodal EHR Foundation Models
【速读】:该论文旨在解决传统电子健康记录(Electronic Health Record, EHR)基础模型在疾病预测中因缺乏遗传信息而导致的局限性问题,从而难以实现精准的风险分层与个性化健康管理。其解决方案的关键在于将多基因风险评分(Polygenic Risk Scores, PRS)作为基础数据模态引入EHR基础模型,构建一个融合临床数据与遗传易感性的多模态框架,借助生成式AI(Generative AI)技术提升模型的预测能力与可解释性,并通过All of Us(AoU)研究计划的大规模数据验证了该方法在Type 2 Diabetes(T2D)等疾病预测中的有效性及跨任务迁移能力。
链接: https://arxiv.org/abs/2510.23639
作者: Jonathan Amar,Edward Liu,Alessandra Breschi,Liangliang Zhang,Pouya Kheradpour,Sylvia Li,Lisa Soleymani Lehmann,Alessandro Giulianelli,Matt Edwards,Yugang Jia,David Nola,Raghav Mani,Pankaj Vats,Jesse Tetreault,T.J. Chen,Cory Y. McLean
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
备注:
Abstract:This paper introduces an innovative Electronic Health Record (EHR) foundation model that integrates Polygenic Risk Scores (PRS) as a foundational data modality, moving beyond traditional EHR-only approaches to build more holistic health profiles. Leveraging the extensive and diverse data from the All of Us (AoU) Research Program, this multimodal framework aims to learn complex relationships between clinical data and genetic predispositions. The methodology extends advancements in generative AI to the EHR foundation model space, enhancing predictive capabilities and interpretability. Evaluation on AoU data demonstrates the model’s predictive value for the onset of various conditions, particularly Type 2 Diabetes (T2D), and illustrates the interplay between PRS and EHR data. The work also explores transfer learning for custom classification tasks, showcasing the architecture’s versatility and efficiency. This approach is pivotal for unlocking new insights into disease prediction, proactive health management, risk stratification, and personalized treatment strategies, laying the groundwork for more personalized, equitable, and actionable real-world evidence generation in healthcare.
zh
[AI-119] Bridging Function Approximation and Device Physics via Negative Differential Resistance Networks
【速读】:该论文旨在解决全模拟神经计算中非线性激活函数难以高效实现的问题,这一瓶颈通常导致系统依赖数字或混合方案,从而限制了能效和可扩展性。解决方案的关键在于提出KANalogue,一种基于科尔莫戈罗夫-阿诺德网络(Kolmogorov-Arnold Networks, KANs)的全模拟实现方法,利用负微分电阻(Negative Differential Resistance, NDR)器件作为可学习的一元基函数的物理载体,通过NbSi₂N₄/HfSi₂N₄异质结构隧道二极管的固有非线性特性构建具有不同曲率和支撑范围的坐标轴独立非线性单元,并结合实验提取的I-V数据拟合高阶多项式以软件仿真器件行为,最终在视觉基准任务上实现了参数极少但分类精度与数字基线相当的模型。
链接: https://arxiv.org/abs/2510.23638
作者: Songyuan Li,Teng Wang,Jinrong Tang,Ruiqi Liu,Yuyao Lu,Feng Xu,Bin Gao,Xiangwei Zhu
机构: 未知
类目: Emerging Technologies (cs.ET); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Achieving fully analog neural computation requires hardware that can natively implement both linear and nonlinear operations with high efficiency. While analogue matrix-vector multiplication has advanced via compute-in-memory architectures, nonlinear activation functions remain a bottleneck, often requiring digital or hybrid solutions. Inspired by the Kolmogorov-Arnold framework, we propose KANalogue, a fully analogue implementation of Kolmogorov-Arnold Networks (KANs) using negative differential resistance devices as physical realizations of learnable univariate basis functions. By leveraging the intrinsic negative differential resistance characteristics of tunnel diodes fabricated from NbSi2N4/HfSi2N4 heterostructures, we construct coordinate-wise nonlinearities with distinct curvature and support profiles. We extract I-V data from fabricated armchair and zigzag devices, fit high-order polynomials to emulate diode behavior in software, and train KANs on vision benchmarks using these learned basis functions. Our results demonstrate that KANalogue can approximate complex functions with minimal parameters while maintaining classification accuracy competitive with digital baselines. This work bridges device-level physics and function approximation theory, charting a path toward scalable, energy-efficient analogue machine learning systems.
zh
[AI-120] Help the machine to help you: an evaluation in the wild of egocentric data cleaning via skeptical learning
【速读】:该论文旨在解决数字个人助手在实际应用中因用户标注数据存在误差和噪声而导致性能下降的问题。解决方案的关键在于引入一种基于用户反馈的“怀疑学习”(Skeptical Learning, SKEL)机制,通过让用户在真实使用场景下对初始标注进行修正,从而提升标注质量并减少用户标注负担,实现用户努力与数据质量之间的平衡优化。
链接: https://arxiv.org/abs/2510.23635
作者: Andrea Bontempelli,Matteo Busso,Leonardo Javier Malcotti,Fausto Giunchiglia
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Any digital personal assistant, whether used to support task performance, answer questions, or manage work and daily life, including fitness schedules, requires high-quality annotations to function properly. However, user annotations, whether actively produced or inferred from context (e.g., data from smartphone sensors), are often subject to errors and noise. Previous research on Skeptical Learning (SKEL) addressed the issue of noisy labels by comparing offline active annotations with passive data, allowing for an evaluation of annotation accuracy. However, this evaluation did not include confirmation from end-users, the best judges of their own context. In this study, we evaluate SKEL’s performance in real-world conditions with actual users who can refine the input labels based on their current perspectives and needs. The study involves university students using the iLog mobile application on their devices over a period of four weeks. The results highlight the challenges of finding the right balance between user effort and data quality, as well as the potential benefits of using SKEL, which include reduced annotation effort and improved quality of collected data.
zh
[AI-121] Monotone and Separable Set Functions: Characterizations and Neural Models
【速读】:该论文旨在解决如何设计集到向量的函数(set-to-vector functions),使得集合间的包含关系能够被精确保留在向量空间中,即满足 $ S \subseteq T $ 当且仅当 $ F(S) \leq F(T) $,这类函数被称为单调且可分离(Monotone and Separating, MAS)的集合函数。其核心解决方案是通过理论分析确定实现MAS性质所需的最小向量维度,并在无限基集情形下证明MAS函数不存在,进而提出一种“弱MAS”(weakly MAS)模型,该模型具备Holder连续性稳定性;此外,作者进一步利用MAS函数构造出具有单调性的通用模型,可逼近任意单调集合函数。实验表明,该方法在集合包含任务中优于未引入集合包含归纳偏置的标准集合模型。
链接: https://arxiv.org/abs/2510.23634
作者: Soutrik Sarangi,Yonatan Sverdlov,Nadav Dym,Abir De
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Motivated by applications for set containment problems, we consider the following fundamental problem: can we design set-to-vector functions so that the natural partial order on sets is preserved, namely S\subseteq T \text if and only if F(S)\leq F(T) . We call functions satisfying this property Monotone and Separating (MAS) set functions. % We establish lower and upper bounds for the vector dimension necessary to obtain MAS functions, as a function of the cardinality of the multisets and the underlying ground set. In the important case of an infinite ground set, we show that MAS functions do not exist, but provide a model called our which provably enjoys a relaxed MAS property we name “weakly MAS” and is stable in the sense of Holder continuity. We also show that MAS functions can be used to construct universal models that are monotone by construction and can approximate all monotone set functions. Experimentally, we consider a variety of set containment tasks. The experiments show the benefit of using our our model, in comparison with standard set models which do not incorporate set containment as an inductive bias. Our code is available in this https URL.
zh
[AI-122] LLM Comp: A Language Modeling Paradigm for Error-Bounded Scientific Data Compression
【速读】:该论文旨在解决高分辨率科学模拟与观测系统产生的海量时空数据在高效、有误差边界约束下的压缩难题。其解决方案的关键在于提出了一种基于解码器-only大语言模型(decoder-only large language models, LLMs)的新型有损压缩范式——LLMCOMP:首先将三维场数据量化为离散标记(tokens),并通过Z-order曲线排列以保持空间局部性,结合覆盖引导采样提升训练效率;随后使用带有时空嵌入的自回归Transformer建模标记转移过程;在压缩阶段采用top-k预测策略,仅存储排名索引和备用修正值,从而确保严格的误差边界。实验表明,该方法在多个再分析数据集上均显著优于现有最优压缩算法,在严格误差约束下实现最高达30%的压缩率提升。
链接: https://arxiv.org/abs/2510.23632
作者: Guozhong Li,Muhannad Alhumaidi,Spiros Skiadopoulos,Panos Kalnis
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:The rapid growth of high-resolution scientific simulations and observation systems is generating massive spatiotemporal datasets, making efficient, error-bounded compression increasingly important. Meanwhile, decoder-only large language models (LLMs) have demonstrated remarkable capabilities in modeling complex sequential data. In this paper, we propose LLMCOMP, a novel lossy compression paradigm that leverages decoder-only large LLMs to model scientific data. LLMCOMP first quantizes 3D fields into discrete tokens, arranges them via Z-order curves to preserve locality, and applies coverage-guided sampling to enhance training efficiency. An autoregressive transformer is then trained with spatial-temporal embeddings to model token transitions. During compression, the model performs top-k prediction, storing only rank indices and fallback corrections to ensure strict error bounds. Experiments on multiple reanalysis datasets show that LLMCOMP consistently outperforms state-of-the-art compressors, achieving up to 30% higher compression ratios under strict error bounds. These results highlight the potential of LLMs as general-purpose compressors for high-fidelity scientific data.
zh
[AI-123] Beyond Pairwise: Empowering LLM Alignment With Ranked Choice Modeling
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)对齐过程中依赖简单成对偏好优化(pairwise preference optimization)所导致的局限性,即未能充分利用更丰富的用户反馈形式,如多对象比较(multiwise comparisons)和Top-k排名(top-k rankings)。其解决方案的关键在于提出了一种统一框架——Ranked Choice Preference Optimization (RCPO),该框架通过最大似然估计(maximum likelihood estimation)将偏好优化与(排序)选择建模(choice modeling)相融合,支持基于效用和基于排序的选择模型,并能自然地扩展至多种反馈格式。RCPO不仅涵盖现有成对方法(如DPO、SimPO),还为复杂反馈提供了理论严谨的训练目标,实验证明其在多个基准测试中显著优于基线方法,表明直接利用排序偏好数据并结合合适的Choice模型可实现更有效的对齐效果。
链接: https://arxiv.org/abs/2510.23631
作者: Yuxuan Tang,Yifan Feng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Methodology (stat.ME); Machine Learning (stat.ML)
备注:
Abstract:Alignment of large language models (LLMs) has predominantly relied on pairwise preference optimization, where annotators select the better of two responses to a prompt. While simple, this approach overlooks the opportunity to learn from richer forms of human feedback, such as multiwise comparisons and top- k rankings. We propose Ranked Choice Preference Optimization (RCPO), a unified framework that bridges preference optimization with (ranked) choice modeling via maximum likelihood estimation. The framework is flexible, supporting both utility-based and rank-based choice models. It subsumes several existing pairwise methods (e.g., DPO, SimPO), while providing principled training objectives for richer feedback formats. We instantiate this framework with two representative ranked choice models (Multinomial Logit and Mallows-RMJ). Empirical studies on Llama-3-8B-Instruct and Gemma-2-9B-it across AlpacaEval 2 and Arena-Hard benchmarks show that RCPO consistently outperforms competitive baselines. RCPO shows how directly leveraging ranked preference data, combined with the right choice models, yields more effective alignment. It offers a versatile and extensible foundation for incorporating (ranked) choice modeling into LLM training.
zh
[AI-124] Chain of Execution Supervision Promotes General Reasoning in Large Language Models
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在构建鲁棒且通用的推理能力方面面临的挑战,尤其是如何从代码中有效提取和利用显式的逻辑推理过程。现有方法直接使用原始代码进行训练时,受限于其隐式表达的推理路径与语法或实现噪声的纠缠,难以获得高质量的推理能力提升。解决方案的关键在于提出TracePile——一个包含260万样本的大规模语料库,通过将程序执行过程转化为显式的、分步骤的链式执行(Chain of Execution, CoE)式推理链,从而显性化代码中的逻辑结构。该语料库覆盖数学、经典算法及算法竞赛等领域,并引入变量追踪问题和代码重写任务以增强逻辑粒度与代码多样性,实验表明该方法在多个基准测试中显著提升了模型性能,尤其是在数学推理和代码理解任务上。
链接: https://arxiv.org/abs/2510.23629
作者: Nuo Chen,Zehua Li,Keqin Bao,Junyang Lin,Dayiheng Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
备注:
Abstract:Building robust and general reasoning ability is a central goal in the development of large language models (LLMs). Recent efforts increasingly turn to code as a rich training source, given its inherent logical structure and diverse reasoning paradigms such as divide-and-conquer, topological ordering, and enumeration. However, reasoning in code is often expressed implicitly and entangled with syntactic or implementation noise, making direct training on raw code this http URL address this, we introduce TracePile, a large-scale corpus of 2.6 million samples that transforms code execution into explicit, step-by-step chain-of-thought-style rationales, which we call Chain of Execution (CoE). The corpus spans domains including mathematics, classical algorithms and algorithmic competition, and is enriched with variable-tracing questions and code rewritings to enhance logical granularity and code diversity. We evaluate TracePile using three training setups: continue-pretraining, instruction tuning after pretraining, and two-stage finetuning. Experiments across four base models (LLaMA 3, LLaMA 3.1, Qwen-2.5, and Qwen-2.5 Coder) and 20 benchmarks covering math, code, logic, and algorithms demonstrate consistent improvements. Notably, TracePile boosts LLaMA3.1-8B by 7.1% on average across nine math datasets and delivers clear gains on LiveCodeBench, CRUX, and MMLU under two-stage fine-tuning.
zh
[AI-125] AI-Driven Development of a Publishing Imprint: Xynapse Traces
【速读】:该论文旨在解决传统图书出版流程效率低下、成本高昂且难以覆盖小众市场的问题。其核心解决方案在于构建一个融合人类与算法方法的实验性出版平台 Xynapse Traces,通过配置驱动架构与多模态 AI 集成框架实现全流程自动化,包括从选题构思到生产分发的连续化流程。关键技术突破包括基于锦标赛式评估的持续创意生成管道、用于转录冥想实践的新颖典籍(codex)设计、以及定义出版使命的“出版人角色”(publisher personas),同时结合自动化验证与人工审核机制,在显著提升效率(时间缩短至2-4周,成本降低80%)的同时保障质量(如99%引文准确率)。此方案为未来人机协作出版范式提供了可扩展路径,并使原本不可行的细分市场变得可行。
链接: https://arxiv.org/abs/2510.23627
作者: Fred Zimmerman
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Xynapse Traces is an experimental publishing imprint created via a fusion of human and algorithmic methods using a configuration-driven architecture and a multi-model AI integration framework. The system achieved a remarkable 90% reduction in time-to-market (from a typical 6-12 months to just 2-4 weeks), with 80% cost reduction compared to traditional imprint development, while publishing 52 books in its first year and maintaining exceptional quality metrics, including 99% citation accuracy and 100% validation success after initial corrections. Key technical innovations include a continuous ideation pipeline with tournament-style evaluation, a novel codex design for transcriptive meditation practice, comprehensive automation spanning from ideation through production and distribution, and publisher personas that define and guide the imprint’s mission. The system also integrates automated verification with human oversight, ensuring that gains in speed do not compromise publishing standards. This effort has significant implications for the future of book publishing, suggesting new paradigms for human-AI collaboration that democratize access to sophisticated publishing capabilities and make previously unviable niche markets accessible.
zh
[AI-126] Speeding Up MACE: Low-Precision Tricks for Equivarient Force Fields
【速读】:该论文旨在解决生成式 AI (Generative AI) 领域中分子动力学(Molecular Dynamics, MD)模拟计算成本高、效率低的问题,特别是针对SO(3)-等变模型如MACE在实际应用中因高精度浮点运算和缺乏GPU优化导致的性能瓶颈。解决方案的关键在于通过系统性分析识别计算瓶颈,并采用混合精度策略与GPU优化内核:具体而言,使用NVIDIA cuEquivariance后端可将推理延迟降低约3倍;仅将线性层转换为BF16/FP16精度(保持FP32累加)可额外获得约4倍加速,且不影响NVT/NPT模拟中的能量和热力学可观测量的稳定性;同时避免e3nn与cuEquivariance模块混用引发的表示不匹配问题。最终提出一种实用策略——默认使用FP32、对线性层启用BF16/FP16进行混合精度推理以最大化吞吐量,而训练仍维持FP32精度,从而在保证物理保真度的前提下显著提升状态领先力场的计算效率。
链接: https://arxiv.org/abs/2510.23621
作者: Alexandre Benoit
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 78 pages, 21 figures
Abstract:Machine-learning force fields can deliver accurate molecular dynamics (MD) at high computational cost. For SO(3)-equivariant models such as MACE, there is little systematic evidence on whether reduced-precision arithmetic and GPU-optimized kernels can cut this cost without harming physical fidelity. This thesis aims to make MACE cheaper and faster while preserving accuracy by identifying computational bottlenecks and evaluating low-precision execution policies. We profile MACE end-to-end and per block, compare the e3nn and NVIDIA cuEquivariance backends, and assess FP64/FP32/BF16/FP16 settings (with FP32 accumulation) for inference, short NVT and long NPT water simulations, and toy training runs under reproducible, steady-state timing. cuEquivariance reduces inference latency by about 3\times . Casting only linear layers to BF16/FP16 within an FP32 model yields roughly 4x additional speedups, while energies and thermodynamic observables in NVT/NPT MD remain within run-to-run variability. Half-precision weights during training degrade force RMSE. Mixing e3nn and cuEq modules without explicit adapters causes representation mismatches. Fused equivariant kernels and mixed-precision inference can substantially accelerate state-of-the-art force fields with negligible impact on downstream MD. A practical policy is to use cuEquivariance with FP32 by default and enable BF16/FP16 for linear layers (keeping FP32 accumulations) for maximum throughput, while training remains in FP32. Further gains are expected on Ampere/Hopper GPUs (TF32/BF16) and from kernel-level FP16/BF16 paths and pipeline fusion.
zh
[AI-127] Short Ticketing Detection Framework Analysis Report
【速读】:该论文旨在解决铁路系统中短票欺诈(short ticketing fraud)的检测问题,这是一种乘客通过不正当手段规避完整车票费用的行为,对运输系统的收入安全构成威胁。解决方案的关键在于提出了一种无监督多专家机器学习框架,结合四种互补算法——孤立森林(Isolation Forest)、局部异常因子(Local Outlier Factor)、一类支持向量机(One-Class SVM)和马氏距离(Mahalanobis Distance),并引入A/B/C/D站分类体系以精准识别30个高风险站点中的可疑行为模式,从而实现对五类典型短票欺诈模式的有效识别与潜在损失挽回。
链接: https://arxiv.org/abs/2510.23619
作者: Yuyang Miao,Huijun Xing,Danilo P. Mandic,Tony G. Constantinides
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:This report presents a comprehensive analysis of an unsupervised multi-expert machine learning framework for detecting short ticketing fraud in railway systems. The study introduces an A/B/C/D station classification system that successfully identifies suspicious patterns across 30 high-risk stations. The framework employs four complementary algorithms: Isolation Forest, Local Outlier Factor, One-Class SVM, and Mahalanobis Distance. Key findings include the identification of five distinct short ticketing patterns and potential for short ticketing recovery in transportation systems.
zh
[AI-128] Feedback Lunch: Deep Feedback Codes for Wiretap Channels
【速读】:该论文旨在解决反向退化(reversely-degraded)无线窃听信道中无信道反馈时保密容量为零的问题,即如何在存在窃听者的情况下实现正的保密速率。其解决方案的关键在于提出一种基于种子模块化编码的设计方法,结合通用哈希函数(universal hash functions)以保障安全性,以及利用学习型反馈码(learned feedback-based codes)提升可靠性,从而通过信道输出反馈实现合法通信双方共享秘密密钥,克服窃听者的安全优势,最终达成正的保密速率。
链接: https://arxiv.org/abs/2510.16620
作者: Yingyao Zhou,Natasha Devroye,Onur Günlü
机构: 未知
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Signal Processing (eess.SP)
备注:
Abstract:We consider reversely-degraded wiretap channels, for which the secrecy capacity is zero if there is no channel feedback. This work focuses on a seeded modular code design for the Gaussian wiretap channel with channel output feedback, combining universal hash functions for security and learned feedback-based codes for reliability to achieve positive secrecy rates. We study the trade-off between communication reliability and information leakage, illustrating that feedback enables agreeing on a secret key shared between legitimate parties, overcoming the security advantage of the wiretapper. Our findings also motivate code designs for sensing-assisted secure communication, to be used in next-generation integrated sensing and communication methods.
zh
[AI-129] Preference Learning with Response Time: Robust Losses and Guarantees NEURIPS2025
【速读】:该论文旨在解决当前人类偏好学习框架中对响应时间(response time)信息利用不足的问题,从而提升奖励模型(reward model)的获取效率。现有方法主要依赖二元偏好数据(binary preference data),但忽略了用户决策过程中蕴含的时序信息,而这些信息可反映偏好强度。解决方案的关键在于引入基于证据累积漂移扩散模型(Evidence Accumulation Drift Diffusion, EZ)的响应时间建模,并设计了Neyman正交损失函数,使奖励模型学习能够达到理论最优的收敛速率——即使在未知期望响应时间的情况下,也能实现与已知该信息时相当的性能。这一方法显著改善了传统偏好学习中误差率随奖励幅度指数增长的问题,将其转化为多项式增长,大幅提升了样本效率。
链接: https://arxiv.org/abs/2505.22820
作者: Ayush Sawarni,Sahasrajit Sarmasarkar,Vasilis Syrgkanis
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Theoretical Economics (econ.TH); Machine Learning (stat.ML)
备注: Accepted at NeurIPS 2025
Abstract:This paper investigates the integration of response time data into human preference learning frameworks for more effective reward model elicitation. While binary preference data has become fundamental in fine-tuning foundation models, generative AI systems, and other large-scale models, the valuable temporal information inherent in user decision-making remains largely unexploited. We propose novel methodologies to incorporate response time information alongside binary choice data, leveraging the Evidence Accumulation Drift Diffusion (EZ) model, under which response time is informative of the preference strength. We develop Neyman-orthogonal loss functions that achieve oracle convergence rates for reward model learning, matching the theoretical optimal rates that would be attained if the expected response times for each query were known a priori. Our theoretical analysis demonstrates that for linear reward functions, conventional preference learning suffers from error rates that scale exponentially with reward magnitude. In contrast, our response time-augmented approach reduces this to polynomial scaling, representing a significant improvement in sample efficiency. We extend these guarantees to non-parametric reward function spaces, establishing convergence properties for more complex, realistic reward models. Our extensive experiments validate our theoretical findings in the context of preference learning over images.
zh
[AI-130] Fast algorithms enabling optimization and deep learning for photoacoustic tomography in a circular detection geometry
【速读】:该论文旨在解决光声层析成像(Photoacoustic Tomography, PAT)及其他耦合物理模态中逆源问题(Inverse Source Problem)的高效数值求解难题。此类问题通常通过迭代算法实现,其核心在于最小化特定代价函数,并依赖于前向算子及其伴随算子的多次计算。论文的关键解决方案是提出了一种渐近快速算法,用于在圆形采集几何下高效计算前向与伴随算子,其时间复杂度为 O(n2logn) 浮点运算,显著优于传统方法。该算法被集成到多种图像重建技术中,包括经典的非负最小二乘和总变差正则化最小二乘法,以及基于深度学习的“可学习原始对偶”(Learned Primal-Dual)方法,在数值实验中验证了其高效性与通用性。
链接: https://arxiv.org/abs/2510.24687
作者: Andreas Hauptmann,Leonid Kunyansky,Jenni Poimala
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Analysis of PDEs (math.AP); Numerical Analysis (math.NA); Optimization and Control (math.OC)
备注:
Abstract:The inverse source problem arising in photoacoustic tomography and in several other coupled-physics modalities is frequently solved by iterative algorithms. Such algorithms are based on the minimization of a certain cost functional. In addition, novel deep learning techniques are currently being investigated to further improve such optimization approaches. All such methods require multiple applications of the operator defining the forward problem, and of its adjoint. In this paper, we present new asymptotically fast algorithms for numerical evaluation of the forward and adjoint operators, applicable in the circular acquisition geometry. For an (n \times n) image, our algorithms compute these operators in \mathcalO(n^2 \log n) floating point operations. We demonstrate the performance of our algorithms in numerical simulations, where they are used as an integral part of several iterative image reconstruction techniques: classic variational methods, such as non-negative least squares and total variation regularized least squares, as well as deep learning methods, such as learned primal dual. A Python implementation of our algorithms and computational examples is available to the general public.
zh
[AI-131] Quantum-Resistant Networks Using Post-Quantum Cryptography
【速读】:该论文旨在解决当前量子网络架构中经典通信信道安全性不足的问题,尤其是在量子计算威胁下传统密码学易被攻破的困境。其解决方案的关键在于构建一个抗量子的网络架构,通过引入后量子密码学(Post-Quantum Cryptography, PQC)技术来保护经典通信信道的安全性,同时保留基于量子纠缠的通信机制以保障量子通道的可靠性。此外,该框架还集成对量子与经典两层的持续监控及跨异构基础设施的协同调度,从而实现端到端的安全性,确保量子网络在经典与量子时代均具备鲁棒性和可扩展性。
链接: https://arxiv.org/abs/2510.24534
作者: Xin Jin,Nitish Kumar Chandra,Mohadeseh Azari,Kaushik P. Seshadreesan,Junyu Liu
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: Submission for 2025 IEEE Workshop on Quantum IntelLigence, Learning Security (QUILLS), this https URL
Abstract:Quantum networks rely on both quantum and classical channels for coordinated operation. Current architectures employ entanglement distribution and key exchange over quantum channels but often assume that classical communication is sufficiently secure. In practice, classical channels protected by traditional cryptography remain vulnerable to quantum adversaries, since large-scale quantum computers could break widely used public-key schemes and reduce the effective security of symmetric cryptography. This perspective presents a quantum-resistant network architecture that secures classical communication with post-quantum cryptographic techniques while supporting entanglement-based communication over quantum channels. Beyond cryptographic protection, the framework incorporates continuous monitoring of both quantum and classical layers, together with orchestration across heterogeneous infrastructures, to ensure end-to-end security. Collectively, these mechanisms provide a pathway toward scalable, robust, and secure quantum networks that remain dependable against both classical and quantum-era threats.
zh
[AI-132] Diffusion Models for Wireless Transceivers: From Pilot-Efficient Channel Estimation to AI-Native 6G Receivers
【速读】:该论文旨在解决大规模正交频分复用(Orthogonal Frequency Division Multiplexing, OFDM)系统中传统方法难以有效处理的信道特征刻画与估计问题,这一瓶颈限制了无线收发器的性能提升。其解决方案的关键在于将信道估计建模为生成式 AI(Generative AI)问题,并引入扩散模型(Diffusion Models, DMs),利用DMs对粗略初始估计的高效处理能力,实现与传统信号处理方法的协同优化,从而显著提升无线接收机的性能。
链接: https://arxiv.org/abs/2510.24495
作者: Yuzhi Yang,Sen Yan,Weijie Zhou,Brahim Mefgouda,Ridong Li,Zhaoyang Zhang,Mérouane Debbah
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
备注: Submitted for potential publication in IEEE Wireless Communications
Abstract:With the development of artificial intelligence (AI) techniques, implementing AI-based techniques to improve wireless transceivers becomes an emerging research topic. Within this context, AI-based channel characterization and estimation become the focus since these methods have not been solved by traditional methods very well and have become the bottleneck of transceiver efficiency in large-scale orthogonal frequency division multiplexing (OFDM) systems. Specifically, by formulating channel estimation as a generative AI problem, generative AI methods such as diffusion models (DMs) can efficiently deal with rough initial estimations and have great potential to cooperate with traditional signal processing methods. This paper focuses on the transceiver design of OFDM systems based on DMs, provides an illustration of the potential of DMs in wireless transceivers, and points out the related research directions brought by DMs. We also provide a proof-of-concept case study of further adapting DMs for better wireless receiver performance.
zh
[AI-133] rajectory Design for UAV-Based Low-Altitude Wireless Networks in Unknown Environments: A Digital Twin-Assisted TD3 Approach
【速读】:该论文旨在解决在未知环境拓扑下,如何设计高效且安全的无人机(Unmanned Aerial Vehicle, UAV)飞行轨迹以支持低空无线网络(Low-Altitude Wireless Network, LAWN)部署的问题。解决方案的关键在于提出一种数字孪生(Digital Twin, DT)辅助的训练与部署框架:UAV通过集成感知与通信信号同时提供地面用户通信服务并收集回波数据上传至DT服务器,用于构建和持续更新虚拟环境(Virtual Environment, VE),从而加速模型训练并在实际部署中实现动态优化决策;在此基础上,进一步融合模拟退火算法用于用户调度、双延迟深度确定性策略梯度算法(Twin-Delayed Deep Deterministic Policy Gradient, TD3)进行连续轨迹规划,以最小化任务完成时间并保障避障能力。
链接: https://arxiv.org/abs/2510.24255
作者: Jihao Luo,Zesong Fei,Xinyi Wang,Le Zhao,Yuanhao Cui,Guangxu Zhu,Dusit Niyato
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
备注: 13 pages, 11 figures
Abstract:Unmanned aerial vehicles (UAVs) are emerging as key enablers for low-altitude wireless network (LAWN), particularly when terrestrial networks are unavailable. In such scenarios, the environmental topology is typically unknown; hence, designing efficient and safe UAV trajectories is essential yet challenging. To address this, we propose a digital twin (DT)-assisted training and deployment framework. In this framework, the UAV transmits integrated sensing and communication signals to provide communication services to ground users, while simultaneously collecting echoes that are uploaded to the DT server to progressively construct virtual environments (VEs). These VEs accelerate model training and are continuously updated with real-time UAV sensing data during deployment, supporting decision-making and enhancing flight safety. Based on this framework, we further develop a trajectory design scheme that integrates simulated annealing for efficient user scheduling with the twin-delayed deep deterministic policy gradient algorithm for continuous trajectory design, aiming to minimize mission completion time while ensuring obstacle avoidance. Simulation results demonstrate that the proposed approach achieves faster convergence, higher flight safety, and shorter mission completion time compared with baseline methods, providing a robust and efficient solution for LAWN deployment in unknown environments.
zh
[AI-134] Self-supervised Synthetic Pretraining for Inference of Stellar Mass Embedded in Dense Gas NEURIPS2025
【速读】:该论文旨在解决在恒星形成区域中准确估计恒星质量(stellar mass)的问题,该问题因年轻恒星被致密气体遮蔽及区域高度非均匀性导致传统球对称动力学估算不可靠。解决方案的关键在于利用自监督预训练的视觉Transformer模型(基于DINOv2框架),通过在一百万张合成分形图像上预训练后冻结参数,再应用于有限的高分辨率磁流体动力学(MHD)模拟数据进行特征提取与回归预测,从而显著提升恒星质量估计性能,并揭示出无需标注数据即可实现恒星形成区无监督分割的语义结构特征。
链接: https://arxiv.org/abs/2510.24159
作者: Keiya Hirashima,Shingo Nozaki,Naoto Harada
机构: 未知
类目: Astrophysics of Galaxies (astro-ph.GA); Instrumentation and Methods for Astrophysics (astro-ph.IM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 6 pages, 3 figures, 1 table, accepted for NeurIPS 2025 ML4PS workshop
Abstract:Stellar mass is a fundamental quantity that determines the properties and evolution of stars. However, estimating stellar masses in star-forming regions is challenging because young stars are obscured by dense gas and the regions are highly inhomogeneous, making spherical dynamical estimates unreliable. Supervised machine learning could link such complex structures to stellar mass, but it requires large, high-quality labeled datasets from high-resolution magneto-hydrodynamical (MHD) simulations, which are computationally expensive. We address this by pretraining a vision transformer on one million synthetic fractal images using the self-supervised framework DINOv2, and then applying the frozen model to limited high-resolution MHD simulations. Our results demonstrate that synthetic pretraining improves frozen-feature regression stellar mass predictions, with the pretrained model performing slightly better than a supervised model trained on the same limited simulations. Principal component analysis of the extracted features further reveals semantically meaningful structures, suggesting that the model enables unsupervised segmentation of star-forming regions without the need for labeled data or fine-tuning.
zh
[AI-135] PULSE: Privileged Knowledge Transfer from Electrodermal Activity to Low-Cost Sensors for Stress Monitoring ML4H2025
【速读】:该论文旨在解决在真实场景中使用低成本可穿戴设备进行压力检测时,因缺乏昂贵的皮肤电活动(Electrodermal Activity, EDA)传感器而导致性能受限的问题。其解决方案的关键在于提出PULSE框架:该框架在自监督预训练阶段仅利用EDA信号,通过分离编码器输出为共享嵌入和私有嵌入,实现跨模态对齐与融合以构建模态不变表示;同时,私有嵌入用于支持重建目标。随后通过冻结的EDA教师模型将交感神经唤醒表征迁移至学生编码器,从而在无需EDA传感器的情况下,借助ECG、BVP、ACC和TEMP等更易获取的生理信号实现高精度压力检测,有效降低硬件成本并提升实用性。
链接: https://arxiv.org/abs/2510.24058
作者: Zihan Zhao,Masood Mortazavi,Ning Yan
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted as a finders paper at ML4H 2025
Abstract:Electrodermal activity (EDA), the primary signal for stress detection, requires costly hardware often unavailable in real-world wearables. In this paper, we propose PULSE, a framework that utilizes EDA exclusively during self-supervised pretraining, while enabling inference without EDA but with more readily available modalities such as ECG, BVP, ACC, and TEMP. Our approach separates encoder outputs into shared and private embeddings. We align shared embeddings across modalities and fuse them into a modality-invariant representation. The private embeddings carry modality-specific information to support the reconstruction objective. Pretraining is followed by knowledge transfer where a frozen EDA teacher transfers sympathetic-arousal representations into student encoders. On WESAD, our method achieves strong stress-detection performance, showing that representations of privileged EDA can be transferred to low-cost sensors to improve accuracy while reducing hardware cost.
zh
[AI-136] What Work is AI Actually Doing? Uncovering the Drivers of Generative AI Adoption
【速读】:该论文试图解决的问题是:如何理解生成式 AI(Generative AI)在实际工作中被采纳的驱动因素,即哪些内在任务特征决定了用户将工作委托给 AI 系统。解决方案的关键在于构建了一个基于四百万次 Claude AI 交互数据与 O*NET 任务映射的多维任务特征体系,系统地量化了七项关键维度(如常规性、认知需求、社会智能、创造力、领域知识、复杂性和决策能力),并运用多元统计方法识别出三种任务原型(动态问题解决、程序化分析工作、标准化操作任务),揭示了 AI 使用模式高度集中于少数高创造性、高复杂性和高认知需求但低常规性的任务,从而为预测 AI 对工作重塑提供了可解释的数据驱动框架。
链接: https://arxiv.org/abs/2510.23669
作者: Peeyush Agarwal,Harsh Agarwal,Akshat Ranaa
机构: 未知
类目: General Economics (econ.GN); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 22 pages
Abstract:Purpose: The rapid integration of artificial intelligence (AI) systems like ChatGPT, Claude AI, etc., has a deep impact on how work is done. Predicting how AI will reshape work requires understanding not just its capabilities, but how it is actually being adopted. This study investigates which intrinsic task characteristics drive users’ decisions to delegate work to AI systems. Methodology: This study utilizes the Anthropic Economic Index dataset of four million Claude AI interactions mapped to O*NET tasks. We systematically scored each task across seven key dimensions: Routine, Cognitive, Social Intelligence, Creativity, Domain Knowledge, Complexity, and Decision Making using 35 parameters. We then employed multivariate techniques to identify latent task archetypes and analyzed their relationship with AI usage. Findings: Tasks requiring high creativity, complexity, and cognitive demand, but low routineness, attracted the most AI engagement. Furthermore, we identified three task archetypes: Dynamic Problem Solving, Procedural Analytical Work, and Standardized Operational Tasks, demonstrating that AI applicability is best predicted by a combination of task characteristics, over individual factors. Our analysis revealed highly concentrated AI usage patterns, with just 5% of tasks accounting for 59% of all interactions. Originality: This research provides the first systematic evidence linking real-world generative AI usage to a comprehensive, multi-dimensional framework of intrinsic task characteristics. It introduces a data-driven classification of work archetypes that offers a new framework for analyzing the emerging human-AI division of labor.
zh
[AI-137] Genotype-Phenotype Integration through Machine Learning and Personalized Gene Regulatory Networks for Cancer Metastasis Prediction
【速读】:该论文旨在解决癌症转移(metastasis)风险预测中因模型架构浅层化及忽视患者特异性基因调控机制而导致的预测准确性不足问题。其解决方案的关键在于融合传统机器学习与基于图神经网络(graph neural network, GNN)的深度学习方法:首先利用XGBoost等经典算法对大规模基因表达数据进行特征筛选与初步建模,随后通过PANDA和LIONESS构建个体化转录因子-靶基因调控网络,并引入图注意力神经网络(GATv2)捕获患者层面非线性调控依赖关系,从而实现可扩展且具有生物学可解释性的转移潜能预测框架。
链接: https://arxiv.org/abs/2510.23620
作者: Jiwei Fu,Chunyu Yang,Charalampos P. Triantafyllidis
机构: 未知
类目: Other Quantitative Biology (q-bio.OT); Artificial Intelligence (cs.AI)
备注: 39 pages, 14 figures. Preliminary version of ongoing collaborative research; a substantially revised manuscript is in preparation
Abstract:Metastasis is the leading cause of cancer-related mortality, yet most predictive models rely on shallow architectures and neglect patient-specific regulatory mechanisms. Here, we integrate classical machine learning and deep learning to predict metastatic potential across multiple cancer types. Gene expression profiles from the Cancer Cell Line Encyclopedia were combined with a transcription factor-target prior from DoRothEA, focusing on nine metastasis-associated regulators. After selecting differential genes using the Kruskal-Wallis test, ElasticNet, Random Forest, and XGBoost models were trained for benchmarking. Personalized gene regulatory networks were then constructed using PANDA and LIONESS and analyzed through a graph attention neural network (GATv2) to learn topological and expression-based representations. While XGBoost achieved the highest AUROC (0.7051), the GNN captured non-linear regulatory dependencies at the patient level. These results demonstrate that combining traditional machine learning with graph-based deep learning enables a scalable and interpretable framework for metastasis risk prediction in precision oncology.
zh
机器学习
[LG-0] Eigenfunction Extraction for Ordered Representation Learning
链接: https://arxiv.org/abs/2510.24672
作者: Burak Varıcı,Che-Ping Tsai,Ritabrata Ray,Nicholas M. Boffi,Pradeep Ravikumar
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Recent advances in representation learning reveal that widely used objectives, such as contrastive and non-contrastive, implicitly perform spectral decomposition of a contextual kernel, induced by the relationship between inputs and their contexts. Yet, these methods recover only the linear span of top eigenfunctions of the kernel, whereas exact spectral decomposition is essential for understanding feature ordering and importance. In this work, we propose a general framework to extract ordered and identifiable eigenfunctions, based on modular building blocks designed to satisfy key desiderata, including compatibility with the contextual kernel and scalability to modern settings. We then show how two main methodological paradigms, low-rank approximation and Rayleigh quotient optimization, align with this framework for eigenfunction extraction. Finally, we validate our approach on synthetic kernels and demonstrate on real-world image datasets that the recovered eigenvalues act as effective importance scores for feature selection, enabling principled efficiency-accuracy tradeoffs via adaptive-dimensional representations.
[LG-1] Pearl: A Foundation Model for Placing Every Atom in the Right Location
链接: https://arxiv.org/abs/2510.24670
作者: Genesis Research Team:Alejandro Dobles,Nina Jovic,Kenneth Leidal,Pranav Murugan,David C. Williams,Drausin Wulsin,Nate Gruver,Christina X. Ji,Korrawat Pruegsanusak,Gianluca Scarpellini,Ansh Sharma,Wojciech Swiderski,Andrea Bootsma,Richard Strong Bowen,Charlotte Chen,Jamin Chen,Marc André Dämgen,Roy Tal Dew,Benjamin DiFrancesco,J. D. Fishman,Alla Ivanova,Zach Kagin,David Li-Bland,Zuli Liu,Igor Morozov,Jeffrey Ouyang-Zhang,Frank C. Pickard IV,Kushal S. Shah,Ben Shor,Gabriel Monteiro da Silva,Maxx Tessmer,Carl Tilbury,Cyr Vetcher,Daniel Zeng,Maruan Al-Shedivat,Aleksandra Faust,Evan N. Feinberg,Michael V. LeVine,Matteus Pan
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:
Abstract:Accurately predicting the three-dimensional structures of protein-ligand complexes remains a fundamental challenge in computational drug discovery that limits the pace and success of therapeutic design. Deep learning methods have recently shown strong potential as structural prediction tools, achieving promising accuracy across diverse biomolecular systems. However, their performance and utility are constrained by scarce experimental data, inefficient architectures, physically invalid poses, and the limited ability to exploit auxiliary information available at inference. To address these issues, we introduce Pearl (Placing Every Atom in the Right Location), a foundation model for protein-ligand cofolding at scale. Pearl addresses these challenges with three key innovations: (1) training recipes that include large-scale synthetic data to overcome data scarcity; (2) architectures that incorporate an SO(3)-equivariant diffusion module to inherently respect 3D rotational symmetries, improving generalization and sample efficiency, and (3) controllable inference, including a generalized multi-chain templating system supporting both protein and non-polymeric components as well as dual unconditional/conditional modes. Pearl establishes a new state-of-the-art performance in protein-ligand cofolding. On the key metric of generating accurate (RMSD 2 Å) and physically valid poses, Pearl surpasses AlphaFold 3 and other open source baselines on the public Runs N’ Poses and PoseBusters benchmarks, delivering 14.5% and 14.2% improvements, respectively, over the next best model. In the pocket-conditional cofolding regime, Pearl delivers 3.6\times improvement on a proprietary set of challenging, real-world drug targets at the more rigorous RMSD 1 Å threshold. Finally, we demonstrate that model performance correlates directly with synthetic dataset size used in training.
[LG-2] Symbolic Snapshot Ensembles
链接: https://arxiv.org/abs/2510.24633
作者: Mingyue Liu,Andrew Cropper
类目: Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
*备注:
Abstract:Inductive logic programming (ILP) is a form of logical machine learning. Most ILP algorithms learn a single hypothesis from a single training run. Ensemble methods train an ILP algorithm multiple times to learn multiple hypotheses. In this paper, we train an ILP algorithm only once and save intermediate hypotheses. We then combine the hypotheses using a minimum description length weighting scheme. Our experiments on multiple benchmarks, including game playing and visual reasoning, show that our approach improves predictive accuracy by 4% with less than 1% computational overhead.
[LG-3] Coreset for Robust Geometric Median: Eliminating Size Dependency on Outliers NEURIPS2025
链接: https://arxiv.org/abs/2510.24621
作者: Ziyi Fang,Lingxiao Huang,Runkai Yang
类目: Data Structures and Algorithms (cs.DS); Computational Geometry (cs.CG); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: This paper has been accepted by NeurIPS 2025
Abstract:We study the robust geometric median problem in Euclidean space \mathbbR^d , with a focus on coreset construction.A coreset is a compact summary of a dataset P of size n that approximates the robust cost for all centers c within a multiplicative error \varepsilon . Given an outlier count m , we construct a coreset of size \tildeO(\varepsilon^-2 \cdot \min\varepsilon^-2, d) when n \geq 4m , eliminating the O(m) dependency present in prior work [Huang et al., 2022 2023]. For the special case of d = 1 , we achieve an optimal coreset size of \tilde\Theta(\varepsilon^-1/2 + \fracmn \varepsilon^-1) , revealing a clear separation from the vanilla case studied in [Huang et al., 2023; Afshani and Chris, 2024]. Our results further extend to robust (k,z) -clustering in various metric spaces, eliminating the m -dependence under mild data assumptions. The key technical contribution is a novel non-component-wise error analysis, enabling substantial reduction of outlier influence, unlike prior methods that retain this http URL, our algorithms consistently outperform existing baselines in terms of size-accuracy tradeoffs and runtime, even when data assumptions are violated across a wide range of datasets.
[LG-4] Semi-supervised and unsupervised learning for health indicator extraction from guided waves in aerospace composite structures
链接: https://arxiv.org/abs/2510.24614
作者: James Josep Perry,Pablo Garcia-Conde Ortiz,George Konstantinou,Cornelie Vergouwen,Edlyn Santha Kumaran,Morteza Moradi
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Signal Processing (eess.SP)
*备注:
Abstract:Health indicators (HIs) are central to diagnosing and prognosing the condition of aerospace composite structures, enabling efficient maintenance and operational safety. However, extracting reliable HIs remains challenging due to variability in material properties, stochastic damage evolution, and diverse damage modes. Manufacturing defects (e.g., disbonds) and in-service incidents (e.g., bird strikes) further complicate this process. This study presents a comprehensive data-driven framework that learns HIs via two learning approaches integrated with multi-domain signal processing. Because ground-truth HIs are unavailable, a semi-supervised and an unsupervised approach are proposed: (i) a diversity deep semi-supervised anomaly detection (Diversity-DeepSAD) approach augmented with continuous auxiliary labels used as hypothetical damage proxies, which overcomes the limitation of prior binary labels that only distinguish healthy and failed states while neglecting intermediate degradation, and (ii) a degradation-trend-constrained variational autoencoder (DTC-VAE), in which the monotonicity criterion is embedded via an explicit trend constraint. Guided waves with multiple excitation frequencies are used to monitor single-stiffener composite structures under fatigue loading. Time, frequency, and time-frequency representations are explored, and per-frequency HIs are fused via unsupervised ensemble learning to mitigate frequency dependence and reduce variance. Using fast Fourier transform features, the augmented Diversity-DeepSAD model achieved 81.6% performance, while DTC-VAE delivered the most consistent HIs with 92.3% performance, outperforming existing baselines.
[LG-5] A Novel XAI-Enhanced Quantum Adversarial Networks for Velocity Dispersion Modeling in MaNGA Galaxies
链接: https://arxiv.org/abs/2510.24598
作者: Sathwik Narkedimilli,N V Saran Kumar,Aswath Babu H,Manjunath K Vanahalli,Manish M,Vinija Jain,Aman Chadha
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:
Abstract:Current quantum machine learning approaches often face challenges balancing predictive accuracy, robustness, and interpretability. To address this, we propose a novel quantum adversarial framework that integrates a hybrid quantum neural network (QNN) with classical deep learning layers, guided by an evaluator model with LIME-based interpretability, and extended through quantum GAN and self-supervised variants. In the proposed model, an adversarial evaluator concurrently guides the QNN by computing feedback loss, thereby optimizing both prediction accuracy and model explainability. Empirical evaluations show that the Vanilla model achieves RMSE = 0.27, MSE = 0.071, MAE = 0.21, and R^2 = 0.59, delivering the most consistent performance across regression metrics compared to adversarial counterparts. These results demonstrate the potential of combining quantum-inspired methods with classical architectures to develop lightweight, high-performance, and interpretable predictive models, advancing the applicability of QML beyond current limitations.
[LG-6] Physics-Informed Extreme Learning Machine (PIELM): Opportunities and Challenges
链接: https://arxiv.org/abs/2510.24577
作者: He Yang,Fei Ren,Hai-Sui Yu,Xiaohui Chen,Pei-Zhi Zhuang
类目: Machine Learning (cs.LG)
*备注:
Abstract:We are very delighted to see the fast development of physics-informed extreme learning machine (PIELM) in recent years for higher computation efficiency and accuracy in physics-informed machine learning. As a summary or review on PIELM is currently not available, we would like to take this opportunity to show our perspective and experience for this promising research direction. We can see many efforts are made to solve PDEs with sharp gradients, nonlinearities, high-frequency behavior, hard constraints, uncertainty, multiphysics coupling. Despite the success, many urgent challenges remain to be tackled, which also provides us opportunities to develop more robust, interpretable, and generalizable PIELM frameworks with applications in science and engineering.
[LG-7] Enforcing boundary conditions for physics-informed neural operators
链接: https://arxiv.org/abs/2510.24557
作者: Niklas Göschel,Sebastian Götschel,Daniel Ruprecht
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注:
Abstract:Machine-learning based methods like physics-informed neural networks and physics-informed neural operators are becoming increasingly adept at solving even complex systems of partial differential equations. Boundary conditions can be enforced either weakly by penalizing deviations in the loss function or strongly by training a solution structure that inherently matches the prescribed values and derivatives. The former approach is easy to implement but the latter can provide benefits with respect to accuracy and training times. However, previous approaches to strongly enforcing Neumann or Robin boundary conditions require a domain with a fully C^1 boundary and, as we demonstrate, can lead to instability if those boundary conditions are posed on a segment of the boundary that is piecewise C^1 but only C^0 globally. We introduce a generalization of the approach by Sukumar \ Srivastava (doi: https://doi.org/10.1016/j.cma.2021.114333), and a new approach based on orthogonal projections that overcome this limitation. The performance of these new techniques is compared against weakly and semi-weakly enforced boundary conditions for the scalar Darcy flow equation and the stationary Navier-Stokes equations.
[LG-8] Dual-Mind World Models: A General Framework for Learning in Dynamic Wireless Networks
链接: https://arxiv.org/abs/2510.24546
作者: Lingyi Wang,Rashed Shelim,Walid Saad,Naren Ramakrishnan
类目: Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:
Abstract:Despite the popularity of reinforcement learning (RL) in wireless networks, existing approaches that rely on model-free RL (MFRL) and model-based RL (MBRL) are data inefficient and short-sighted. Such RL-based solutions cannot generalize to novel network states since they capture only statistical patterns rather than the underlying physics and logic from wireless data. These limitations become particularly challenging in complex wireless networks with high dynamics and long-term planning requirements. To address these limitations, in this paper, a novel dual-mind world model-based learning framework is proposed with the goal of optimizing completeness-weighted age of information (CAoI) in a challenging mmWave V2X scenario. Inspired by cognitive psychology, the proposed dual-mind world model encompasses a pattern-driven System 1 component and a logic-driven System 2 component to learn dynamics and logic of the wireless network, and to provide long-term link scheduling over reliable imagined trajectories. Link scheduling is learned through end-to-end differentiable imagined trajectories with logical consistency over an extended horizon rather than relying on wireless data obtained from environment interactions. Moreover, through imagination rollouts, the proposed world model can jointly reason network states and plan link scheduling. During intervals without observations, the proposed method remains capable of making efficient decisions. Extensive experiments are conducted on a realistic simulator based on Sionna with real-world physical channel, ray-tracing, and scene objects with material properties. Simulation results show that the proposed world model achieves a significant improvement in data efficiency and achieves strong generalization and adaptation to unseen environments, compared to the state-of-the-art RL baselines, and the world model approach with only System 1.
[LG-9] MIMIC-Sepsis: A Curated Benchmark for Modeling and Learning from Sepsis Trajectories in the ICU
链接: https://arxiv.org/abs/2510.24500
作者: Yong Huang,Zhongqi Yang,Amir Rahmani
类目: Machine Learning (cs.LG)
*备注:
Abstract:Sepsis is a leading cause of mortality in intensive care units (ICUs), yet existing research often relies on outdated datasets, non-reproducible preprocessing pipelines, and limited coverage of clinical interventions. We introduce MIMIC-Sepsis, a curated cohort and benchmark framework derived from the MIMIC-IV database, designed to support reproducible modeling of sepsis trajectories. Our cohort includes 35,239 ICU patients with time-aligned clinical variables and standardized treatment data, including vasopressors, fluids, mechanical ventilation and antibiotics. We describe a transparent preprocessing pipeline-based on Sepsis-3 criteria, structured imputation strategies, and treatment inclusion-and release it alongside benchmark tasks focused on early mortality prediction, length-of-stay estimation, and shock onset classification. Empirical results demonstrate that incorporating treatment variables substantially improves model performance, particularly for Transformer-based architectures. MIMIC-Sepsis serves as a robust platform for evaluating predictive and sequential models in critical care research.
[LG-10] Methodology for Comparing Machine Learning Algorithms for Survival Analysis
链接: https://arxiv.org/abs/2510.24473
作者: Lucas Buk Cardoso,Simone Aldrey Angelo,Yasmin Pacheco Gil Bonilha,Fernando Maia,Adeylson Guimarães Ribeiro,Maria Paula Curado,Gisele Aparecida Fernandes,Vanderlei Cunha Parro,Flávio Almeida de Magalhães Cipparrone,Alexandre Dias Porto Chiavegatto Filho,Tatiana Natasha Toporcov
类目: Machine Learning (cs.LG)
*备注:
Abstract:This study presents a comparative methodological analysis of six machine learning models for survival analysis (MLSA). Using data from nearly 45,000 colorectal cancer patients in the Hospital-Based Cancer Registries of São Paulo, we evaluated Random Survival Forest (RSF), Gradient Boosting for Survival Analysis (GBSA), Survival SVM (SSVM), XGBoost-Cox (XGB-Cox), XGBoost-AFT (XGB-AFT), and LightGBM (LGBM), capable of predicting survival considering censored data. Hyperparameter optimization was performed with different samplers, and model performance was assessed using the Concordance Index (C-Index), C-Index IPCW, time-dependent AUC, and Integrated Brier Score (IBS). Survival curves produced by the models were compared with predictions from classification algorithms, and predictor interpretation was conducted using SHAP and permutation importance. XGB-AFT achieved the best performance (C-Index = 0.7618; IPCW = 0.7532), followed by GBSA and RSF. The results highlight the potential and applicability of MLSA to improve survival prediction and support decision making.
[LG-11] ARIMA_PLUS: Large-scale Accurate Automatic and Interpretable In-Database Time Series Forecasting and Anomaly Detection in Google BigQuery
链接: https://arxiv.org/abs/2510.24452
作者: Xi Cheng,Weijie Shen,Haoming Chen,Chaoyi Shen,Jean Ortega,Jiashang Liu,Steve Thomas,Honglin Zheng,Haoyun Wu,Yuxiang Li,Casey Lichtendahl,Jenny Ortiz,Gang Liu,Haiyang Qi,Omid Fatemieh,Chris Fry,Jing Jing Long
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:
Abstract:Time series forecasting and anomaly detection are common tasks for practitioners in industries such as retail, manufacturing, advertising and energy. Two unique challenges stand out: (1) efficiently and accurately forecasting time series or detecting anomalies in large volumes automatically; and (2) ensuring interpretability of results to effectively incorporate business insights. We present ARIMA_PLUS, a novel framework to overcome these two challenges by a unique combination of (a) accurate and interpretable time series models and (b) scalable and fully managed system infrastructure. The model has a sequential and modular structure to handle different components of the time series, including holiday effects, seasonality, trend, and anomalies, which enables high interpretability of the results. Novel enhancements are made to each module, and a unified framework is established to address both forecasting and anomaly detection tasks simultaneously. In terms of accuracy, its comprehensive benchmark on the 42 public datasets in the Monash forecasting repository shows superior performance over not only well-established statistical alternatives (such as ETS, ARIMA, TBATS, Prophet) but also newer neural network models (such as DeepAR, N-BEATS, PatchTST, TimeMixer). In terms of infrastructure, it is directly built into the query engine of BigQuery in Google Cloud. It uses a simple SQL interface and automates tedious technicalities such as data cleaning and model selection. It automatically scales with managed cloud computational and storage resources, making it possible to forecast 100 million time series using only 1.5 hours with a throughput of more than 18000 time series per second. In terms of interpretability, we present several case studies to demonstrate time series insights it generates and customizability it offers.
[LG-12] Fill in the Blanks: Accelerating Q-Learning with a Handful of Demonstrations in Sparse Reward Settings
链接: https://arxiv.org/abs/2510.24432
作者: Seyed Mahdi Basiri Azad,Joschka Boedecker
类目: Machine Learning (cs.LG)
*备注:
Abstract:Reinforcement learning (RL) in sparse-reward environments remains a significant challenge due to the lack of informative feedback. We propose a simple yet effective method that uses a small number of successful demonstrations to initialize the value function of an RL agent. By precomputing value estimates from offline demonstrations and using them as targets for early learning, our approach provides the agent with a useful prior over promising actions. The agent then refines these estimates through standard online interaction. This hybrid offline-to-online paradigm significantly reduces the exploration burden and improves sample efficiency in sparse-reward settings. Experiments on benchmark tasks demonstrate that our method accelerates convergence and outperforms standard baselines, even with minimal or suboptimal demonstration data.
[LG-13] Attack on a PUF-based Secure Binary Neural Network
链接: https://arxiv.org/abs/2510.24422
作者: Bijeet Basak,Nupur Patil,Kurian Polachan,Srinivas Vivek
类目: Cryptography and Security (cs.CR); Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注: Accepted at VLSID 2026. To be published in IEEE Xplore
Abstract:Binarized Neural Networks (BNNs) deployed on memristive crossbar arrays provide energy-efficient solutions for edge computing but are susceptible to physical attacks due to memristor nonvolatility. Recently, Rajendran et al. (IEEE Embedded Systems Letter 2025) proposed a Physical Unclonable Function (PUF)-based scheme to secure BNNs against theft attacks. Specifically, the weight and bias matrices of the BNN layers were secured by swapping columns based on device’s PUF key bits. In this paper, we demonstrate that this scheme to secure BNNs is vulnerable to PUF-key recovery attack. As a consequence of our attack, we recover the secret weight and bias matrices of the BNN. Our approach is motivated by differential cryptanalysis and reconstructs the PUF key bit-by-bit by observing the change in model accuracy, and eventually recovering the BNN model parameters. Evaluated on a BNN trained on the MNIST dataset, our attack could recover 85% of the PUF key, and recover the BNN model up to 93% classification accuracy compared to the original model’s 96% accuracy. Our attack is very efficient and it takes a couple of minutes to recovery the PUF key and the model parameters. Comments: Accepted at VLSID 2026. To be published in IEEE Xplore Subjects: Cryptography and Security (cs.CR); Hardware Architecture (cs.AR); Machine Learning (cs.LG) Cite as: arXiv:2510.24422 [cs.CR] (or arXiv:2510.24422v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2510.24422 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-14] APEX: Approximate-but-exhaustive search for ultra-large combinatorial synthesis libraries
链接: https://arxiv.org/abs/2510.24380
作者: Aryan Pedawi,Jordi Silvestre-Ryan,Bradley Worley,Darren J Hsu,Kushal S Shah,Elias Stehle,Jingrong Zhang,Izhar Wallach
类目: Machine Learning (cs.LG)
*备注:
Abstract:Make-on-demand combinatorial synthesis libraries (CSLs) like Enamine REAL have significantly enabled drug discovery efforts. However, their large size presents a challenge for virtual screening, where the goal is to identify the top compounds in a library according to a computational objective (e.g., optimizing docking score) subject to computational constraints under a limited computational budget. For current library sizes – numbering in the tens of billions of compounds – and scoring functions of interest, a routine virtual screening campaign may be limited to scoring fewer than 0.1% of the available compounds, leaving potentially many high scoring compounds undiscovered. Furthermore, as constraints (and sometimes objectives) change during the course of a virtual screening campaign, existing virtual screening algorithms typically offer little room for amortization. We propose the approximate-but-exhaustive search protocol for CSLs, or APEX. APEX utilizes a neural network surrogate that exploits the structure of CSLs in the prediction of objectives and constraints to make full enumeration on a consumer GPU possible in under a minute, allowing for exact retrieval of approximate top- k sets. To demonstrate APEX’s capabilities, we develop a benchmark CSL comprised of more than 10 million compounds, all of which have been annotated with their docking scores on five medically relevant targets along with physicohemical properties measured with RDKit such that, for any objective and set of constraints, the ground truth top- k compounds can be identified and compared against the retrievals from any virtual screening algorithm. We show APEX’s consistently strong performance both in retrieval accuracy and runtime compared to alternative methods.
[LG-15] A Comprehensive Evaluation Framework for Synthetic Trip Data Generation in Public Transport
链接: https://arxiv.org/abs/2510.24375
作者: Yuanyuan Wu,Zhenlin Qin,Zhenliang Ma
类目: Machine Learning (cs.LG)
*备注:
Abstract:Synthetic data offers a promising solution to the privacy and accessibility challenges of using smart card data in public transport research. Despite rapid progress in generative modeling, there is limited attention to comprehensive evaluation, leaving unclear how reliable, safe, and useful synthetic data truly are. Existing evaluations remain fragmented, typically limited to population-level representativeness or record-level privacy, without considering group-level variations or task-specific utility. To address this gap, we propose a Representativeness-Privacy-Utility (RPU) framework that systematically evaluates synthetic trip data across three complementary dimensions and three hierarchical levels (record, group, population). The framework integrates a consistent set of metrics to quantify similarity, disclosure risk, and practical usefulness, enabling transparent and balanced assessment of synthetic data quality. We apply the framework to benchmark twelve representative generation methods, spanning conventional statistical models, deep generative networks, and privacy-enhanced variants. Results show that synthetic data do not inherently guarantee privacy and there is no “one-size-fits-all” model, the trade-off between privacy and representativeness/utility is obvious. Conditional Tabular generative adversarial network (CTGAN) provide the most balanced trade-off and is suggested for practical applications. The RPU framework provides a systematic and reproducible basis for researchers and practitioners to compare synthetic data generation techniques and select appropriate methods in public transport applications.
[LG-16] Filtering instances and rejecting predictions to obtain reliable models in healthcare
链接: https://arxiv.org/abs/2510.24368
作者: Maria Gabriela Valeriano,David Kohan Marzagão,Alfredo Montelongo,Carlos Roberto Veiga Kiffer,Natan Katz,Ana Carolina Lorena
类目: Machine Learning (cs.LG)
*备注: This paper is under review at Machine Learning (Springer)
Abstract:Machine Learning (ML) models are widely used in high-stakes domains such as healthcare, where the reliability of predictions is critical. However, these models often fail to account for uncertainty, providing predictions even with low confidence. This work proposes a novel two-step data-centric approach to enhance the performance of ML models by improving data quality and filtering low-confidence predictions. The first step involves leveraging Instance Hardness (IH) to filter problematic instances during training, thereby refining the dataset. The second step introduces a confidence-based rejection mechanism during inference, ensuring that only reliable predictions are retained. We evaluate our approach using three real-world healthcare datasets, demonstrating its effectiveness at improving model reliability while balancing predictive performance and rejection rate. Additionally, we use alternative criteria - influence values for filtering and uncertainty for rejection - as baselines to evaluate the efficiency of the proposed method. The results demonstrate that integrating IH filtering with confidence-based rejection effectively enhances model performance while preserving a large proportion of instances. This approach provides a practical method for deploying ML systems in safety-critical applications.
[LG-17] EDC: Equation Discovery for Classification
链接: https://arxiv.org/abs/2510.24310
作者: Guus Toussaint,Arno Knobbe
类目: Machine Learning (cs.LG)
*备注: This preprint has not undergone peer review or any post-submission improvements or corrections. The Version of Record of this contribution is published in Lecture Notes in Computer Science, and is available online at this https URL
Abstract:Equation Discovery techniques have shown considerable success in regression tasks, where they are used to discover concise and interpretable models (\textitSymbolic Regression). In this paper, we propose a new ED-based binary classification framework. Our proposed method EDC finds analytical functions of manageable size that specify the location and shape of the decision boundary. In extensive experiments on artificial and real-life data, we demonstrate how EDC is able to discover both the structure of the target equation as well as the value of its parameters, outperforming the current state-of-the-art ED-based classification methods in binary classification and achieving performance comparable to the state of the art in binary classification. We suggest a grammar of modest complexity that appears to work well on the tested datasets but argue that the exact grammar – and thus the complexity of the models – is configurable, and especially domain-specific expressions can be included in the pattern language, where that is required. The presented grammar consists of a series of summands (additive terms) that include linear, quadratic and exponential terms, as well as products of two features (producing hyperbolic curves ideal for capturing XOR-like dependencies). The experiments demonstrate that this grammar allows fairly flexible decision boundaries while not so rich to cause overfitting.
[LG-18] HergNet: a Fast Neural Surrogate Model for Sound Field Predictions via Superposition of Plane Waves
链接: https://arxiv.org/abs/2510.24279
作者: Matteo Calafà,Yuanxin Xia,Cheol-Ho Jeong
类目: ound (cs.SD); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:
Abstract:We present a novel neural network architecture for the efficient prediction of sound fields in two and three dimensions. The network is designed to automatically satisfy the Helmholtz equation, ensuring that the outputs are physically valid. Therefore, the method can effectively learn solutions to boundary-value problems in various wave phenomena, such as acoustics, optics, and electromagnetism. Numerical experiments show that the proposed strategy can potentially outperform state-of-the-art methods in room acoustics simulation, in particular in the range of mid to high frequencies.
[LG-19] SALS: Sparse Attention in Latent Space for KV cache Compression
链接: https://arxiv.org/abs/2510.24273
作者: Junlin Mu,Hantao Huang,Jihang Zhang,Minghui Yu,Tao Wang,Yidong Li
类目: Machine Learning (cs.LG)
*备注:
Abstract:Large Language Models capable of handling extended contexts are in high demand, yet their inference remains challenging due to substantial Key-Value cache size and high memory bandwidth requirements. Previous research has demonstrated that KV cache exhibits low-rank characteristics within the hidden dimension, suggesting the potential for effective compression. However, due to the widely adopted Rotary Position Embedding mechanism in modern LLMs, naive low-rank compression suffers severe accuracy degradation or creates a new speed bottleneck, as the low-rank cache must first be reconstructed in order to apply RoPE. In this paper, we introduce two key insights: first, the application of RoPE to the key vectors increases their variance, which in turn results in a higher rank; second, after the key vectors are transformed into the latent space, they largely maintain their representation across most layers. Based on these insights, we propose the Sparse Attention in Latent Space framework. SALS projects the KV cache into a compact latent space via low-rank projection, and performs sparse token selection using RoPE-free query-key interactions in this space. By reconstructing only a small subset of important tokens, it avoids the overhead of full KV cache reconstruction. We comprehensively evaluate SALS on various tasks using two large-scale models: LLaMA2-7b-chat and Mistral-7b, and additionally verify its scalability on the RULER-128k benchmark with LLaMA3.1-8B-Instruct. Experimental results demonstrate that SALS achieves SOTA performance by maintaining competitive accuracy. Under different settings, SALS achieves 6.4-fold KV cache compression and 5.7-fold speed-up in the attention operator compared to FlashAttention2 on the 4K sequence. For the end-to-end throughput performance, we achieves 1.4-fold and 4.5-fold improvement compared to GPT-fast on 4k and 32K sequences, respectively.
[LG-20] mporal Knowledge Graph Hyperedge Forecasting: Exploring Entity-to-Category Link Prediction
链接: https://arxiv.org/abs/2510.24240
作者: Edward Markai,Sina Molavipour
类目: Machine Learning (cs.LG)
*备注:
Abstract:Temporal Knowledge Graphs have emerged as a powerful way of not only modeling static relationships between entities but also the dynamics of how relations evolve over time. As these informational structures can be used to store information from a real-world setting, such as a news flow, predicting future graph components to a certain extent equates predicting real-world events. Most of the research in this field focuses on embedding-based methods, often leveraging convolutional neural net architectures. These solutions act as black boxes, limiting insight. In this paper, we explore an extension to an established rule-based framework, TLogic, that yields a high accuracy in combination with explainable predictions. This offers transparency and allows the end-user to critically evaluate the rules applied at the end of the prediction stage. The new rule format incorporates entity category as a key component with the purpose of limiting rule application only to relevant entities. When categories are unknown for building the graph, we propose a data-driven method to generate them with an LLM-based approach. Additionally, we investigate the choice of aggregation method for scores of retrieved entities when performing category prediction.
[LG-21] Sparse Optimistic Information Directed Sampling
链接: https://arxiv.org/abs/2510.24234
作者: Ludovic Schwartz,Hamish Flynn,Gergely Neu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Many high-dimensional online decision-making problems can be modeled as stochastic sparse linear bandits. Most existing algorithms are designed to achieve optimal worst-case regret in either the data-rich regime, where polynomial depen- dence on the ambient dimension is unavoidable, or the data-poor regime, where dimension-independence is possible at the cost of worse dependence on the num- ber of rounds. In contrast, the sparse Information Directed Sampling (IDS) algo- rithm satisfies a Bayesian regret bound that has the optimal rate in both regimes simultaneously. In this work, we explore the use of Sparse Optimistic Informa- tion Directed Sampling (SOIDS) to achieve the same adaptivity in the worst-case setting, without Bayesian assumptions. Through a novel analysis that enables the use of a time-dependent learning rate, we show that SOIDS can optimally balance information and regret. Our results extend the theoretical guarantees of IDS, pro- viding the first algorithm that simultaneously achieves optimal worst-case regret in both the data-rich and data-poor regimes. We empirically demonstrate the good performance of SOIDS.
[LG-22] PRIVET: Privacy Metric Based on Extreme Value Theory
链接: https://arxiv.org/abs/2510.24233
作者: Antoine Szatkownik(TAU, BioInfo),Aurélien Decelle,Beatriz Seoane(TAU),Nicolas Bereux(TAU),Léo Planche(BioInfo),Guillaume Charpiat(TAU),Burak Yelmen,Flora Jay(BioInfo, TAU),Cyril Furtlehner(TAU)
类目: Machine Learning (cs.LG)
*备注:
Abstract:Deep generative models are often trained on sensitive data, such as genetic sequences, health data, or more broadly, any copyrighted, licensed or protected content. This raises critical concerns around privacy-preserving synthetic data, and more specifically around privacy leakage, an issue closely tied to overfitting. Existing methods almost exclusively rely on global criteria to estimate the risk of privacy failure associated to a model, offering only quantitative non interpretable insights. The absence of rigorous evaluation methods for data privacy at the sample-level may hinder the practical deployment of synthetic data in real-world applications. Using extreme value statistics on nearest-neighbor distances, we propose PRIVET, a generic sample-based, modality-agnostic algorithm that assigns an individual privacy leak score to each synthetic sample. We empirically demonstrate that PRIVET reliably detects instances of memorization and privacy leakage across diverse data modalities, including settings with very high dimensionality, limited sample sizes such as genetic data and even under underfitting regimes. We compare our method to existing approaches under controlled settings and show its advantage in providing both dataset level and sample level assessments through qualitative and quantitative outputs. Additionally, our analysis reveals limitations in existing computer vision embeddings to yield perceptually meaningful distances when identifying near-duplicate samples.
[LG-23] A comparison between joint and dual UKF implementations for state estimation and leak localization in water distribution networks
链接: https://arxiv.org/abs/2510.24228
作者: Luis Romero-Ben,Paul Irofti,Florin Stoican,Vicenç Puig
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: This work has been submitted to ECC2026 for review. It has 7 pages and 2 figures
Abstract:The sustainability of modern cities highly depends on efficient water distribution management, including effective pressure control and leak detection and localization. Accurate information about the network hydraulic state is therefore essential. This article presents a comparison between two data-driven state estimation methods based on the Unscented Kalman Filter (UKF), fusing pressure, demand and flow data for head and flow estimation. One approach uses a joint state vector with a single estimator, while the other uses a dual-estimator scheme. We analyse their main characteristics, discussing differences, advantages and limitations, and compare them theoretically in terms of accuracy and complexity. Finally, we show several estimation results for the L-TOWN benchmark, allowing to discuss their properties in a real implementation.
[LG-24] Unlocking Out-of-Distribution Generalization in Dynamics through Physics-Guided Augmentation
链接: https://arxiv.org/abs/2510.24216
作者: Fan Xu,Hao Wu,Kun Wang,Nan Wang,Qingsong Wen,Xian Wu,Wei Gong,Xibin Zhao
类目: Machine Learning (cs.LG)
*备注:
Abstract:In dynamical system modeling, traditional numerical methods are limited by high computational costs, while modern data-driven approaches struggle with data scarcity and distribution shifts. To address these fundamental limitations, we first propose SPARK, a physics-guided quantitative augmentation plugin. Specifically, SPARK utilizes a reconstruction autoencoder to integrate physical parameters into a physics-rich discrete state dictionary. This state dictionary then acts as a structured dictionary of physical states, enabling the creation of new, physically-plausible training samples via principled interpolation in the latent space. Further, for downstream prediction, these augmented representations are seamlessly integrated with a Fourier-enhanced Graph ODE, a combination designed to robustly model the enriched data distribution while capturing long-term temporal dependencies. Extensive experiments on diverse benchmarks demonstrate that SPARK significantly outperforms state-of-the-art baselines, particularly in challenging out-of-distribution scenarios and data-scarce regimes, proving the efficacy of our physics-guided augmentation paradigm.
[LG-25] What Can Be Recovered Under Sparse Adversarial Corruption? Assumption-Free Theory for Linear Measurements
链接: https://arxiv.org/abs/2510.24215
作者: Vishal Halder(IMT Atlantique - INFO, Lab-STICC),Alexandre Reiffers-Masson(IMT Atlantique - INFO, Lab-STICC),Abdeldjalil Aïssa-El-Bey(IMT Atlantique - MEE, Lab-STICC),Gugan Thoppe(CSA, IISc)
类目: Information Theory (cs.IT); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:
Abstract:Let (\bmA \in \mathbbR^m \times n) be an arbitrary, known matrix and (\bme) a (q)-sparse adversarial vector. Given (\bmy = \bmA x^* + \bme) and (q), we seek the smallest set containing (x^)-hence the one conveying maximal information about (x^)-that is uniformly recoverable from (\bmy) without knowing (\bme). While exact recovery of (x^) via strong (and often impractical) structural assumptions on (\bmA) or (x^) (for example, restricted isometry, sparsity) is well studied, recoverability for arbitrary (\bmA) and (x^) remains open. Our main result shows that the best that one can hope to recover is (x^ + \ker(\bmU)), where (\bmU) is the unique projection matrix onto the intersection of rowspaces of all possible submatrices of (\bmA) obtained by deleting (2q) rows. Moreover, we prove that every (x) that minimizes the (\ell_0)-norm of (\bmy - \bmA x) lies in (x^* + \ker(\bmU)), which then gives a constructive approach to recover this set.
[LG-26] SPEAR: Scaling Gradient Inversion via Sparsely-Used Dictionary Learning NEURIPS2025
链接: https://arxiv.org/abs/2510.24200
作者: Alexander Bakarsky,Dimitar I. Dimitrov,Maximilian Baader,Martin Vechev
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: Published at the Workshop on Regulatable ML at the 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
Abstract:Federated Learning has seen an increased deployment in real-world scenarios recently, as it enables the distributed training of machine learning models without explicit data sharing between individual clients. Yet, the introduction of the so-called gradient inversion attacks has fundamentally challenged its privacy-preserving properties. Unfortunately, as these attacks mostly rely on direct data optimization without any formal guarantees, the vulnerability of real-world systems remains in dispute and requires tedious testing for each new federated deployment. To overcome these issues, recently the SPEAR attack was introduced, which is based on a theoretical analysis of the gradients of linear layers with ReLU activations. While SPEAR is an important theoretical breakthrough, the attack’s practicality was severely limited by its exponential runtime in the batch size b. In this work, we fill this gap by applying State-of-the-Art techniques from Sparsely-Used Dictionary Learning to make the problem of gradient inversion on linear layers with ReLU activations tractable. Our experiments demonstrate that our new attack, SPEAR++, retains all desirable properties of SPEAR, such as robustness to DP noise and FedAvg aggregation, while being applicable to 10x bigger batch sizes.
[LG-27] Blindfolded Experts Generalize Better: Insights from Robotic Manipulation and Videogames
链接: https://arxiv.org/abs/2510.24194
作者: Ev Zisselman,Mirco Mutti,Shelly Francis-Meretzki,Elisei Shafer,Aviv Tamar
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:
Abstract:Behavioral cloning is a simple yet effective technique for learning sequential decision-making from demonstrations. Recently, it has gained prominence as the core of foundation models for the physical world, where achieving generalization requires countless demonstrations of a multitude of tasks. Typically, a human expert with full information on the task demonstrates a (nearly) optimal behavior. In this paper, we propose to hide some of the task’s information from the demonstrator. This ``blindfolded’’ expert is compelled to employ non-trivial exploration to solve the task. We show that cloning the blindfolded expert generalizes better to unseen tasks than its fully-informed counterpart. We conduct experiments of real-world robot peg insertion tasks with (limited) human demonstrations, alongside videogames from the Procgen benchmark. Additionally, we support our findings with theoretical analysis, which confirms that the generalization error scales with \sqrtI/m , where I measures the amount of task information available to the demonstrator, and m is the number of demonstrated tasks. Both theory and practice indicate that cloning blindfolded experts generalizes better with fewer demonstrated tasks. Project page with videos and code: this https URL
[LG-28] V-SAT: Video Subtitle Annotation Tool
链接: https://arxiv.org/abs/2510.24180
作者: Arpita Kundu,Joyita Chakraborty,Anindita Desarkar,Aritra Sen,Srushti Anil Patil,Vishwanathan Raman
类目: Machine Learning (cs.LG)
*备注:
Abstract:The surge of audiovisual content on streaming platforms and social media has heightened the demand for accurate and accessible subtitles. However, existing subtitle generation methods primarily speech-based transcription or OCR-based extraction suffer from several shortcomings, including poor synchronization, incorrect or harmful text, inconsistent formatting, inappropriate reading speeds, and the inability to adapt to dynamic audio-visual contexts. Current approaches often address isolated issues, leaving post-editing as a labor-intensive and time-consuming process. In this paper, we introduce V-SAT (Video Subtitle Annotation Tool), a unified framework that automatically detects and corrects a wide range of subtitle quality issues. By combining Large Language Models(LLMs), Vision-Language Models (VLMs), Image Processing, and Automatic Speech Recognition (ASR), V-SAT leverages contextual cues from both audio and video. Subtitle quality improved, with the SUBER score reduced from 9.6 to 3.54 after resolving all language mode issues and F1-scores of ~0.80 for image mode issues. Human-in-the-loop validation ensures high-quality results, providing the first comprehensive solution for robust subtitle annotation.
[LG-29] EddyFormer: Accelerated Neural Simulations of Three-Dimensional Turbulence at Scale NEURIPS2025
链接: https://arxiv.org/abs/2510.24173
作者: Yiheng Du,Aditi S. Krishnapriyan
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS); Numerical Analysis (math.NA); Fluid Dynamics (physics.flu-dyn)
*备注: NeurIPS 2025
Abstract:Computationally resolving turbulence remains a central challenge in fluid dynamics due to its multi-scale interactions. Fully resolving large-scale turbulence through direct numerical simulation (DNS) is computationally prohibitive, motivating data-driven machine learning alternatives. In this work, we propose EddyFormer, a Transformer-based spectral-element (SEM) architecture for large-scale turbulence simulation that combines the accuracy of spectral methods with the scalability of the attention mechanism. We introduce an SEM tokenization that decomposes the flow into grid-scale and subgrid-scale components, enabling capture of both local and global features. We create a new three-dimensional isotropic turbulence dataset and train EddyFormer to achieves DNS-level accuracy at 256^3 resolution, providing a 30x speedup over DNS. When applied to unseen domains up to 4x larger than in training, EddyFormer preserves accuracy on physics-invariant metrics-energy spectra, correlation functions, and structure functions-showing domain generalization. On The Well benchmark suite of diverse turbulent flows, EddyFormer resolves cases where prior ML models fail to converge, accurately reproducing complex dynamics across a wide range of physical conditions.
[LG-30] Identifiable learning of dissipative dynamics
链接: https://arxiv.org/abs/2510.24160
作者: Aiqing Zhu,Beatrice W. Soh,Grigorios A. Pavliotis,Qianxiao Li
类目: Machine Learning (cs.LG)
*备注:
Abstract:Complex dissipative systems appear across science and engineering, from polymers and active matter to learning algorithms. These systems operate far from equilibrium, where energy dissipation and time irreversibility are key to their behavior, but are difficult to quantify from data. Learning accurate and interpretable models of such dynamics remains a major challenge: the models must be expressive enough to describe diverse processes, yet constrained enough to remain physically meaningful and mathematically identifiable. Here, we introduce I-OnsagerNet, a neural framework that learns dissipative stochastic dynamics directly from trajectories while ensuring both interpretability and uniqueness. I-OnsagerNet extends the Onsager principle to guarantee that the learned potential is obtained from the stationary density and that the drift decomposes cleanly into time-reversible and time-irreversible components, as dictated by the Helmholtz decomposition. Our approach enables us to calculate the entropy production and to quantify irreversibility, offering a principled way to detect and quantify deviations from equilibrium. Applications to polymer stretching in elongational flow and to stochastic gradient Langevin dynamics reveal new insights, including super-linear scaling of barrier heights and sub-linear scaling of entropy production rates with the strain rate, and the suppression of irreversibility with increasing batch size. I-OnsagerNet thus establishes a general, data-driven framework for discovering and interpreting non-equilibrium dynamics.
[LG-31] Fixed Point Neural Acceleration and Inverse Surrogate Model for Battery Parameter Identification
链接: https://arxiv.org/abs/2510.24135
作者: Hojin Cheon,Hyeongseok Seo,Jihun Jeon,Wooju Lee,Dohyun Jeong,Hongseok Kim
类目: Machine Learning (cs.LG)
*备注: 31 pages, 11 figures, submitted to Applied Energy
Abstract:The rapid expansion of electric vehicles has intensified the need for accurate and efficient diagnosis of lithium-ion batteries. Parameter identification of electrochemical battery models is widely recognized as a powerful method for battery health assessment. However, conventional metaheuristic approaches suffer from high computational cost and slow convergence, and recent machine learning methods are limited by their reliance on constant current data, which may not be available in practice. To overcome these challenges, we propose deep learning-based framework for parameter identification of electrochemical battery models. The proposed framework combines a neural surrogate model of the single particle model with electrolyte (NeuralSPMe) and a deep learning-based fixed-point iteration method. NeuralSPMe is trained on realistic EV load profiles to accurately predict lithium concentration dynamics under dynamic operating conditions while a parameter update network (PUNet) performs fixed-point iterative updates to significantly reduce both the evaluation time per sample and the overall number of iterations required for convergence. Experimental evaluations demonstrate that the proposed framework accelerates the parameter identification by more than 2000 times, achieves superior sample efficiency and more than 10 times higher accuracy compared to conventional metaheuristic algorithms, particularly under dynamic load scenarios encountered in practical applications.
[LG-32] Causal Convolutional Neural Networks as Finite Impulse Response Filters
链接: https://arxiv.org/abs/2510.24125
作者: Kiran Bacsa,Wei Liu,Xudong Jian,Huangbin Liang,Eleni Chatzi
类目: Machine Learning (cs.LG)
*备注: 14 pages, 19 figures, Under review
Abstract:This study investigates the behavior of Causal Convolutional Neural Networks (CNNs) with quasi-linear activation functions when applied to time-series data characterized by multimodal frequency content. We demonstrate that, once trained, such networks exhibit properties analogous to Finite Impulse Response (FIR) filters, particularly when the convolutional kernels are of extended length exceeding those typically employed in standard CNN architectures. Causal CNNs are shown to capture spectral features both implicitly and explicitly, offering enhanced interpretability for tasks involving dynamic systems. Leveraging the associative property of convolution, we further show that the entire network can be reduced to an equivalent single-layer filter resembling an FIR filter optimized via least-squares criteria. This equivalence yields new insights into the spectral learning behavior of CNNs trained on signals with sparse frequency content. The approach is validated on both simulated beam dynamics and real-world bridge vibration datasets, underlining its relevance for modeling and identifying physical systems governed by dynamic responses.
[LG-33] Graph-Guided Concept Selection for Efficient Retrieval-Augmented Generation
链接: https://arxiv.org/abs/2510.24120
作者: Ziyu Liu,Yijing Liu,Jianfei Yuan,Minzhi Yan,Le Yue,Honghui Xiong,Yi Yang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Graph-based RAG constructs a knowledge graph (KG) from text chunks to enhance retrieval in Large Language Model (LLM)-based question answering. It is especially beneficial in domains such as biomedicine, law, and political science, where effective retrieval often involves multi-hop reasoning over proprietary documents. However, these methods demand numerous LLM calls to extract entities and relations from text chunks, incurring prohibitive costs at scale. Through a carefully designed ablation study, we observe that certain words (termed concepts) and their associated documents are more important. Based on this insight, we propose Graph-Guided Concept Selection (G2ConS). Its core comprises a chunk selection method and an LLM-independent concept graph. The former selects salient document chunks to reduce KG construction costs; the latter closes knowledge gaps introduced by chunk selection at zero cost. Evaluations on multiple real-world datasets show that G2ConS outperforms all baselines in construction cost, retrieval effectiveness, and answering quality.
[LG-34] Information-Theoretic Discrete Diffusion NEURIPS2025
链接: https://arxiv.org/abs/2510.24088
作者: Moongyu Jeon,Sangwoo Shin,Dongjae Jeon,Albert No
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注: Accepted at NeurIPS 2025
Abstract:We present an information-theoretic framework for discrete diffusion models that yields principled estimators of log-likelihood using score-matching losses. Inspired by the I-MMSE identity for the Gaussian setup, we derive analogous results for the discrete setting. Specifically, we introduce the Information-Minimum Denoising Score Entropy (I-MDSE) relation, which links mutual information between data and its diffused version to the minimum denoising score entropy (DSE) loss. We extend this theory to masked diffusion and establish the Information-Minimum Denoising Cross-Entropy (I-MDCE) relation, connecting cross-entropy losses to mutual information in discrete masked processes. These results provide a time-integral decomposition of the log-likelihood of the data in terms of optimal score-based losses, showing that commonly used losses such as DSE and DCE are not merely variational bounds but tight and principled estimators of log-likelihood. The I-MDCE decomposition further enables practical extensions, including time-free formula, conditional likelihood estimation in prompt-response tasks, and coupled Monte Carlo estimation of likelihood ratios. Experiments on synthetic and real-world data confirm the accuracy, variance stability, and utility of our estimators. The code is publicly available at this https URL.
[LG-35] Language-Conditioned Representations and Mixture-of-Experts Policy for Robust Multi-Task Robotic Manipulation
链接: https://arxiv.org/abs/2510.24055
作者: Xiucheng Zhang,Yang Jiang,Hongwei Qing,Jiashuo Bai
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 8 pages
Abstract:Perceptual ambiguity and task conflict limit multitask robotic manipulation via imitation learning. We propose a framework combining a Language-Conditioned Visual Representation (LCVR) module and a Language-conditioned Mixture-ofExperts Density Policy (LMoE-DP). LCVR resolves perceptual ambiguities by grounding visual features with language instructions, enabling differentiation between visually similar tasks. To mitigate task conflict, LMoE-DP uses a sparse expert architecture to specialize in distinct, multimodal action distributions, stabilized by gradient modulation. On real-robot benchmarks, LCVR boosts Action Chunking with Transformers (ACT) and Diffusion Policy (DP) success rates by 33.75% and 25%, respectively. The full framework achieves a 79% average success, outperforming the advanced baseline by 21%. Our work shows that combining semantic grounding and expert specialization enables robust, efficient multi-task manipulation
[LG-36] Low-N Protein Activity Optimization with FolDE
链接: https://arxiv.org/abs/2510.24053
作者: Jacob B. Roberts,Catherine R. Ji,Isaac Donnell,Thomas D. Young,Allison N. Pearson,Graham A. Hudson,Leah S. Keiser,Mia Wesselkamper,Peter H. Winegar,Janik Ludwig,Sarah H. Klass,Isha V. Sheth,Ezechinyere C. Ukabiala,Maria C. T. Astolfi,Benjamin Eysenbach,Jay D. Keasling
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: 18 pages, 4 figures. Preprint. Open-source software available at this https URL
Abstract:Proteins are traditionally optimized through the costly construction and measurement of many mutants. Active Learning-assisted Directed Evolution (ALDE) alleviates that cost by predicting the best improvements and iteratively testing mutants to inform predictions. However, existing ALDE methods face a critical limitation: selecting the highest-predicted mutants in each round yields homogeneous training data insufficient for accurate prediction models in subsequent rounds. Here we present FolDE, an ALDE method designed to maximize end-of-campaign success. In simulations across 20 protein targets, FolDE discovers 23% more top 10% mutants than the best baseline ALDE method (p=0.005) and is 55% more likely to find top 1% mutants. FolDE achieves this primarily through naturalness-based warm-starting, which augments limited activity measurements with protein language model outputs to improve activity prediction. We also introduce a constant-liar batch selector, which improves batch diversity; this is important in multi-mutation campaigns but had limited effect in our benchmarks. The complete workflow is freely available as open-source software, making efficient protein optimization accessible to any laboratory.
[LG-37] Mitigating Negative Transfer via Reducing Environmental Disagreement
链接: https://arxiv.org/abs/2510.24044
作者: Hui Sun,Zheng Xie,Hao-Yuan He,Ming Li
类目: Machine Learning (cs.LG)
*备注: 13 pages, 5 figures
Abstract:Unsupervised Domain Adaptation~(UDA) focuses on transferring knowledge from a labeled source domain to an unlabeled target domain, addressing the challenge of \emphdomain shift. Significant domain shifts hinder effective knowledge transfer, leading to \emphnegative transfer and deteriorating model performance. Therefore, mitigating negative transfer is essential. This study revisits negative transfer through the lens of causally disentangled learning, emphasizing cross-domain discriminative disagreement on non-causal environmental features as a critical factor. Our theoretical analysis reveals that overreliance on non-causal environmental features as the environment evolves can cause discriminative disagreements~(termed \emphenvironmental disagreement), thereby resulting in negative transfer. To address this, we propose Reducing Environmental Disagreement~(RED), which disentangles each sample into domain-invariant causal features and domain-specific non-causal environmental features via adversarially training domain-specific environmental feature extractors in the opposite domains. Subsequently, RED estimates and reduces environmental disagreement based on domain-specific non-causal environmental features. Experimental results confirm that RED effectively mitigates negative transfer and achieves state-of-the-art performance.
[LG-38] Localized Kernel Projection Outlyingness: A Two-Stage Approach for Multi-Modal Outlier Detection
链接: https://arxiv.org/abs/2510.24043
作者: Akira Tamamori
类目: Machine Learning (cs.LG)
*备注: 10 pages, 4 figures; submitted to The IEICE Transactions on Information and Systems
Abstract:This paper presents Two-Stage LKPLO, a novel multi-stage outlier detection framework that overcomes the coexisting limitations of conventional projection-based methods: their reliance on a fixed statistical metric and their assumption of a single data structure. Our framework uniquely synthesizes three key concepts: (1) a generalized loss-based outlyingness measure (PLO) that replaces the fixed metric with flexible, adaptive loss functions like our proposed SVM-like loss; (2) a global kernel PCA stage to linearize non-linear data structures; and (3) a subsequent local clustering stage to handle multi-modal distributions. Comprehensive 5-fold cross-validation experiments on 10 benchmark datasets, with automated hyperparameter optimization, demonstrate that Two-Stage LKPLO achieves state-of-the-art performance. It significantly outperforms strong baselines on datasets with challenging structures where existing methods fail, most notably on multi-cluster data (Optdigits) and complex, high-dimensional data (Arrhythmia). Furthermore, an ablation study empirically confirms that the synergistic combination of both the kernelization and localization stages is indispensable for its superior performance. This work contributes a powerful new tool for a significant class of outlier detection problems and underscores the importance of hybrid, multi-stage architectures.
[LG-39] Efficient Global-Local Fusion Sampling for Physics-Informed Neural Networks
链接: https://arxiv.org/abs/2510.24026
作者: Jiaqi Luo,Shixin Xu,Zhouwang Yang
类目: Machine Learning (cs.LG)
*备注:
Abstract:The accuracy of Physics-Informed Neural Networks (PINNs) critically depends on the placement of collocation points, as the PDE loss is approximated through sampling over the solution domain. Global sampling ensures stability by covering the entire domain but requires many samples and is computationally expensive, whereas local sampling improves efficiency by focusing on high-residual regions but may neglect well-learned areas, reducing robustness. We propose a Global-Local Fusion (GLF) Sampling Strategy that combines the strengths of both approaches. Specifically, new collocation points are generated by perturbing training points with Gaussian noise scaled inversely to the residual, thereby concentrating samples in difficult regions while preserving exploration. To further reduce computational overhead, a lightweight linear surrogate is introduced to approximate the global residual-based distribution, achieving similar effectiveness at a fraction of the cost. Together, these components, residual-adaptive sampling and residual-based approximation, preserve the stability of global methods while retaining the efficiency of local refinement. Extensive experiments on benchmark PDEs demonstrate that GLF consistently improves both accuracy and efficiency compared with global and local sampling strategies. This study provides a practical and scalable framework for enhancing the reliability and efficiency of PINNs in solving complex and high-dimensional PDEs.
[LG-40] Auto-Adaptive PINNs with Applications to Phase Transitions
链接: https://arxiv.org/abs/2510.23999
作者: Kevin Buck,Woojeong Kim
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注:
Abstract:We propose an adaptive sampling method for the training of Physics Informed Neural Networks (PINNs) which allows for sampling based on an arbitrary problem-specific heuristic which may depend on the network and its gradients. In particular we focus our analysis on the Allen-Cahn equations, attempting to accurately resolve the characteristic interfacial regions using a PINN without any post-hoc resampling. In experiments, we show the effectiveness of these methods over residual-adaptive frameworks.
[LG-41] Predicting Barge Tow Size on Inland Waterways Using Vessel Trajectory Derived Features: Proof of Concept
链接: https://arxiv.org/abs/2510.23994
作者: Geoffery Agorku,Sarah Hernandez,Hayley Hames,Cade Wagner
类目: Machine Learning (cs.LG)
*备注:
Abstract:Accurate, real-time estimation of barge quantity on inland waterways remains a critical challenge due to the non-self-propelled nature of barges and the limitations of existing monitoring systems. This study introduces a novel method to use Automatic Identification System (AIS) vessel tracking data to predict the number of barges in tow using Machine Learning (ML). To train and test the model, barge instances were manually annotated from satellite scenes across the Lower Mississippi River. Labeled images were matched to AIS vessel tracks using a spatiotemporal matching procedure. A comprehensive set of 30 AIS-derived features capturing vessel geometry, dynamic movement, and trajectory patterns were created and evaluated using Recursive Feature Elimination (RFE) to identify the most predictive variables. Six regression models, including ensemble, kernel-based, and generalized linear approaches, were trained and evaluated. The Poisson Regressor model yielded the best performance, achieving a Mean Absolute Error (MAE) of 1.92 barges using 12 of the 30 features. The feature importance analysis revealed that metrics capturing vessel maneuverability such as course entropy, speed variability and trip length were most predictive of barge count. The proposed approach provides a scalable, readily implementable method for enhancing Maritime Domain Awareness (MDA), with strong potential applications in lock scheduling, port management, and freight planning. Future work will expand the proof of concept presented here to explore model transferability to other inland rivers with differing operational and environmental conditions.
[LG-42] Optimal Arm Elimination Algorithms for Combinatorial Bandits
链接: https://arxiv.org/abs/2510.23992
作者: Yuxiao Wen,Yanjun Han,Zhengyuan Zhou
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Machine Learning (stat.ML)
*备注:
Abstract:Combinatorial bandits extend the classical bandit framework to settings where the learner selects multiple arms in each round, motivated by applications such as online recommendation and assortment optimization. While extensions of upper confidence bound (UCB) algorithms arise naturally in this context, adapting arm elimination methods has proved more challenging. We introduce a novel elimination scheme that partitions arms into three categories (confirmed, active, and eliminated), and incorporates explicit exploration to update these sets. We demonstrate the efficacy of our algorithm in two settings: the combinatorial multi-armed bandit with general graph feedback, and the combinatorial linear contextual bandit. In both cases, our approach achieves near-optimal regret, whereas UCB-based methods can provably fail due to insufficient explicit exploration. Matching lower bounds are also provided.
[LG-43] A Prag matic Way to Measure Chain-of-Thought Monitorability
链接: https://arxiv.org/abs/2510.23966
作者: Scott Emmons,Roland S. Zimmermann,David K. Elson,Rohin Shah
类目: Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注: The first two authors contributed equally
Abstract:While Chain-of-Thought (CoT) monitoring offers a unique opportunity for AI safety, this opportunity could be lost through shifts in training practices or model architecture. To help preserve monitorability, we propose a pragmatic way to measure two components of it: legibility (whether the reasoning can be followed by a human) and coverage (whether the CoT contains all the reasoning needed for a human to also produce the final output). We implement these metrics with an autorater prompt that enables any capable LLM to compute the legibility and coverage of existing CoTs. After sanity-checking our prompted autorater with synthetic CoT degradations, we apply it to several frontier models on challenging benchmarks, finding that they exhibit high monitorability. We present these metrics, including our complete autorater prompt, as a tool for developers to track how design decisions impact monitorability. While the exact prompt we share is still a preliminary version under ongoing development, we are sharing it now in the hopes that others in the community will find it useful. Our method helps measure the default monitorability of CoT - it should be seen as a complement, not a replacement, for the adversarial stress-testing needed to test robustness against deliberately evasive models.
[LG-44] A data free neural operator enabling fast inference of 2D and 3D Navier Stokes equations
链接: https://arxiv.org/abs/2510.23936
作者: Junho Choi,Teng-Yuan Chang,Namjung Kim,Youngjoon Hong
类目: Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
*备注:
Abstract:Ensemble simulations of high-dimensional flow models (e.g., Navier Stokes type PDEs) are computationally prohibitive for real time applications. Neural operators enable fast inference but are limited by costly data requirements and poor generalization to 3D flows. We present a data-free operator network for the Navier Stokes equations that eliminates the need for paired solution data and enables robust, real time inference for large ensemble forecasting. The physics-grounded architecture takes initial and boundary conditions as well as forcing functions, yielding solutions robust to high variability and perturbations. Across 2D benchmarks and 3D test cases, the method surpasses prior neural operators in accuracy and, for ensembles, achieves greater efficiency than conventional numerical solvers. Notably, it delivers accurate solutions of the three dimensional Navier Stokes equations, a regime not previously demonstrated for data free neural operators. By uniting a numerically grounded architecture with the scalability of machine learning, this approach establishes a practical pathway toward data free, high fidelity PDE surrogates for end to end scientific simulation and prediction.
[LG-45] Differential Privacy: Gradient Leakage Attacks in Federated Learning Environments
链接: https://arxiv.org/abs/2510.23931
作者: Miguel Fernandez-de-Retana,Unai Zulaika,Rubén Sánchez-Corcuera,Aitor Almeida
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: 17 pages, 12 figures
Abstract:Federated Learning (FL) allows for the training of Machine Learning models in a collaborative manner without the need to share sensitive data. However, it remains vulnerable to Gradient Leakage Attacks (GLAs), which can reveal private information from the shared model updates. In this work, we investigate the effectiveness of Differential Privacy (DP) mechanisms - specifically, DP-SGD and a variant based on explicit regularization (PDP-SGD) - as defenses against GLAs. To this end, we evaluate the performance of several computer vision models trained under varying privacy levels on a simple classification task, and then analyze the quality of private data reconstructions obtained from the intercepted gradients in a simulated FL environment. Our results demonstrate that DP-SGD significantly mitigates the risk of gradient leakage attacks, albeit with a moderate trade-off in model utility. In contrast, PDP-SGD maintains strong classification performance but proves ineffective as a practical defense against reconstruction attacks. These findings highlight the importance of empirically evaluating privacy mechanisms beyond their theoretical guarantees, particularly in distributed learning scenarios where information leakage may represent an unassumable critical threat to data security and privacy.
[LG-46] Improving the Straight-Through Estimator with Zeroth-Order Information NEURIPS2025
链接: https://arxiv.org/abs/2510.23926
作者: Ningfeng Yang,Tor M. Aamodt
类目: Machine Learning (cs.LG)
*备注: 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
Abstract:We study the problem of training neural networks with quantized parameters. Learning low-precision quantized parameters by enabling computation of gradients via the Straight-Through Estimator (STE) can be challenging. While the STE enables back-propagation, which is a first-order method, recent works have explored the use of zeroth-order (ZO) gradient descent for fine-tuning. We note that the STE provides high-quality biased gradients, and ZO gradients are unbiased but can be expensive. We thus propose First-Order-Guided Zeroth-Order Gradient Descent (FOGZO) that reduces STE bias while reducing computations relative to ZO methods. Empirically, we show FOGZO improves the tradeoff between quality and training time in Quantization-Aware Pre-Training. Specifically, versus STE at the same number of iterations, we show a 1-8% accuracy improvement for DeiT Tiny/Small, 1-2% accuracy improvement on ResNet 18/50, and 1-22 perplexity point improvement for LLaMA models with up to 0.3 billion parameters. For the same loss, FOGZO yields a 796 \times reduction in computation versus n-SPSA for a 2-layer MLP on MNIST. Code is available at this https URL.
[LG-47] Geometry-Inspired Unified Framework for Discounted and Averag e Reward MDPs
链接: https://arxiv.org/abs/2510.23914
作者: Arsenii Mustafin,Xinyi Sheng,Dominik Baumann
类目: Machine Learning (cs.LG)
*备注: 12 pages, 1 figure
Abstract:The theoretical analysis of Markov Decision Processes (MDPs) is commonly split into two cases - the average-reward case and the discounted-reward case - which, while sharing similarities, are typically analyzed separately. In this work, we extend a recently introduced geometric interpretation of MDPs for the discounted-reward case to the average-reward case, thereby unifying both. This allows us to extend a major result known for the discounted-reward case to the average-reward case: under a unique and ergodic optimal policy, the Value Iteration algorithm achieves a geometric convergence rate.
[LG-48] Artificial Intelligence Based Predictive Maintenance for Electric Buses
链接: https://arxiv.org/abs/2510.23879
作者: Ayse Irmak Ercevik(TOBB University of Economics and Technology, Ankara, Turkey),Ahmet Murat Ozbayoglu(TOBB University of Economics and Technology, Ankara, Turkey)
类目: Machine Learning (cs.LG)
*备注:
Abstract:Predictive maintenance (PdM) is crucial for optimizing efficiency and minimizing downtime of electric buses. While these vehicles provide environmental benefits, they pose challenges for PdM due to complex electric transmission and battery systems. Traditional maintenance, often based on scheduled inspections, struggles to capture anomalies in multi-dimensional real-time CAN Bus data. This study employs a graph-based feature selection method to analyze relationships among CAN Bus parameters of electric buses and investigates the prediction performance of targeted alarms using artificial intelligence techniques. The raw data collected over two years underwent extensive preprocessing to ensure data quality and consistency. A hybrid graph-based feature selection tool was developed by combining statistical filtering (Pearson correlation, Cramer’s V, ANOVA F-test) with optimization-based community detection algorithms (InfoMap, Leiden, Louvain, Fast Greedy). Machine learning models, including SVM, Random Forest, and XGBoost, were optimized through grid and random search with data balancing via SMOTEEN and binary search-based down-sampling. Model interpretability was achieved using LIME to identify the features influencing predictions. The results demonstrate that the developed system effectively predicts vehicle alarms, enhances feature interpretability, and supports proactive maintenance strategies aligned with Industry 4.0 principles.
[LG-49] ScaLoRA: Optimally Scaled Low-Rank Adaptation for Efficient High-Rank Fine-Tuning
链接: https://arxiv.org/abs/2510.23818
作者: Yilang Zhang,Xiaodong Yang,Yiwei Cai,Georgios B. Giannakis
类目: Machine Learning (cs.LG)
*备注:
Abstract:As large language models (LLMs) continue to scale in size, the computational overhead has become a major bottleneck for task-specific fine-tuning. While low-rank adaptation (LoRA) effectively curtails this cost by confining the weight updates to a low-dimensional subspace, such a restriction can hinder effectiveness and slow convergence. This contribution deals with these limitations by accumulating progressively a high-rank weight update from consecutive low-rank increments. Specifically, the per update optimal low-rank matrix is identified to minimize the loss function and closely approximate full fine-tuning. To endow efficient and seamless optimization without restarting, this optimal choice is formed by appropriately scaling the columns of the original low-rank matrix. Rigorous performance guarantees reveal that the optimal scaling can be found analytically. Extensive numerical tests with popular LLMs scaling up to 12 billion parameters demonstrate a consistent performance gain and fast convergence relative to state-of-the-art LoRA variants on diverse tasks including natural language understanding, commonsense reasoning, and mathematical problem solving.
[LG-50] Combining SHAP and Causal Analysis for Interpretable Fault Detection in Industrial Processes
链接: https://arxiv.org/abs/2510.23817
作者: Pedro Cortes dos Santos,Matheus Becali Rocha,Renato A Krohling
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注:
Abstract:Industrial processes generate complex data that challenge fault detection systems, often yielding opaque or underwhelming results despite advanced machine learning techniques. This study tackles such difficulties using the Tennessee Eastman Process, a well-established benchmark known for its intricate dynamics, to develop an innovative fault detection framework. Initial attempts with standard models revealed limitations in both performance and interpretability, prompting a shift toward a more tractable approach. By employing SHAP (SHapley Additive exPlanations), we transform the problem into a more manageable and transparent form, pinpointing the most critical process features driving fault predictions. This reduction in complexity unlocks the ability to apply causal analysis through Directed Acyclic Graphs, generated by multiple algorithms, to uncover the underlying mechanisms of fault propagation. The resulting causal structures align strikingly with SHAP findings, consistently highlighting key process elements-like cooling and separation systems-as pivotal to fault development. Together, these methods not only enhance detection accuracy but also provide operators with clear, actionable insights into fault origins, a synergy that, to our knowledge, has not been previously explored in this context. This dual approach bridges predictive power with causal understanding, offering a robust tool for monitoring complex manufacturing environments and paving the way for smarter, more interpretable fault detection in industrial systems.
[LG-51] A Physics-informed Multi-resolution Neural Operator
链接: https://arxiv.org/abs/2510.23810
作者: Sumanta Roy,Bahador Bahmani,Ioannis G. Kevrekidis,Michael D. Shields
类目: Machine Learning (cs.LG); Analysis of PDEs (math.AP); Computational Physics (physics.comp-ph); Machine Learning (stat.ML)
*备注: 26 pages, 14 figures, 4 tables
Abstract:The predictive accuracy of operator learning frameworks depends on the quality and quantity of available training data (input-output function pairs), often requiring substantial amounts of high-fidelity data, which can be challenging to obtain in some real-world engineering applications. These datasets may be unevenly discretized from one realization to another, with the grid resolution varying across samples. In this study, we introduce a physics-informed operator learning approach by extending the Resolution Independent Neural Operator (RINO) framework to a fully data-free setup, addressing both challenges simultaneously. Here, the arbitrarily (but sufficiently finely) discretized input functions are projected onto a latent embedding space (i.e., a vector space of finite dimensions), using pre-trained basis functions. The operator associated with the underlying partial differential equations (PDEs) is then approximated by a simple multi-layer perceptron (MLP), which takes as input a latent code along with spatiotemporal coordinates to produce the solution in the physical space. The PDEs are enforced via a finite difference solver in the physical space. The validation and performance of the proposed method are benchmarked on several numerical examples with multi-resolution data, where input functions are sampled at varying resolutions, including both coarse and fine discretizations.
[LG-52] How do simple rotations affect the implicit bias of Adam?
链接: https://arxiv.org/abs/2510.23804
作者: Adela DePavia,Vasileios Charisopoulos,Rebecca Willett
类目: Machine Learning (cs.LG)
*备注:
Abstract:Adaptive gradient methods such as Adam and Adagrad are widely used in machine learning, yet their effect on the generalization of learned models – relative to methods like gradient descent – remains poorly understood. Prior work on binary classification suggests that Adam exhibits a ``richness bias,‘’ which can help it learn nonlinear decision boundaries closer to the Bayes-optimal decision boundary relative to gradient descent. However, the coordinate-wise preconditioning scheme employed by Adam renders the overall method sensitive to orthogonal transformations of feature space. We show that this sensitivity can manifest as a reversal of Adam’s competitive advantage: even small rotations of the underlying data distribution can make Adam forfeit its richness bias and converge to a linear decision boundary that is farther from the Bayes-optimal decision boundary than the one learned by gradient descent. To alleviate this issue, we show that a recently proposed reparameterization method – which applies an orthogonal transformation to the optimization objective – endows any first-order method with equivariance to data rotations, and we empirically demonstrate its ability to restore Adam’s bias towards rich decision boundaries.
[LG-53] Learning Interpretable Features in Audio Latent Spaces via Sparse Autoencoders NEURIPS2025
链接: https://arxiv.org/abs/2510.23802
作者: Nathan Paek,Yongyi Zang,Qihui Yang,Randal Leistikow
类目: Machine Learning (cs.LG)
*备注: Accepted to NeurIPS 2025 Mechanistic Interpretability Workshop
Abstract:While sparse autoencoders (SAEs) successfully extract interpretable features from language models, applying them to audio generation faces unique challenges: audio’s dense nature requires compression that obscures semantic meaning, and automatic feature characterization remains limited. We propose a framework for interpreting audio generative models by mapping their latent representations to human-interpretable acoustic concepts. We train SAEs on audio autoencoder latents, then learn linear mappings from SAE features to discretized acoustic properties (pitch, amplitude, and timbre). This enables both controllable manipulation and analysis of the AI music generation process, revealing how acoustic properties emerge during synthesis. We validate our approach on continuous (DiffRhythm-VAE) and discrete (EnCodec, WavTokenizer) audio latent spaces, and analyze DiffRhythm, a state-of-the-art text-to-music model, to demonstrate how pitch, timbre, and loudness evolve throughout generation. While our work is only done on audio modality, our framework can be extended to interpretable analysis of visual latent space generation models.
[LG-54] Revealing the Potential of Learnable Perturbation Ensemble Forecast Model for Tropical Cyclone Prediction
链接: https://arxiv.org/abs/2510.23794
作者: Jun Liu,Tao Zhou,Jiarui Li,Xiaohui Zhong,Peng Zhang,Jie Feng,Lei Chen,Hao Li
类目: Machine Learning (cs.LG)
*备注: 30 pages, 21 figures, 1 table
Abstract:Tropical cyclones (TCs) are highly destructive and inherently uncertain weather systems. Ensemble forecasting helps quantify these uncertainties, yet traditional systems are constrained by high computational costs and limited capability to fully represent atmospheric nonlinearity. FuXi-ENS introduces a learnable perturbation scheme for ensemble generation, representing a novel AI-based forecasting paradigm. Here, we systematically compare FuXi-ENS with ECMWF-ENS using all 90 global TCs in 2018, examining their performance in TC-related physical variables, track and intensity forecasts, and the associated dynamical and thermodynamical fields. FuXi-ENS demonstrates clear advantages in predicting TC-related physical variables, and achieves more accurate track forecasts with reduced ensemble spread, though it still underestimates intensity relative to observations. Further dynamical and thermodynamical analyses reveal that FuXi-ENS better captures large-scale circulation, with moisture turbulent energy more tightly concentrated around the TC warm core, whereas ECMWF-ENS exhibits a more dispersed distribution. These findings highlight the potential of learnable perturbations to improve TC forecasting skill and provide valuable insights for advancing AI-based ensemble prediction of extreme weather events that have significant societal impacts.
[LG-55] Relaxed Sequence Sampling for Diverse Protein Design
链接: https://arxiv.org/abs/2510.23786
作者: Joohwan Ko,Aristofanis Rontogiannis,Yih-En Andrew Ban,Axel Elaldi,Nicholas Franklin
类目: Machine Learning (cs.LG)
*备注:
Abstract:Protein design using structure prediction models such as AlphaFold2 has shown remarkable success, but existing approaches like relaxed sequence optimization (RSO) rely on single-path gradient descent and ignore sequence-space constraints, limiting diversity and designability. We introduce Relaxed Sequence Sampling (RSS), a Markov chain Monte Carlo (MCMC) framework that integrates structural and evolutionary information for protein design. RSS operates in continuous logit space, combining gradient-guided exploration with protein language model-informed jumps. Its energy function couples AlphaFold2-derived structural objectives with ESM2-derived sequence priors, balancing accuracy and biological plausibility. In an in silico protein binder design task, RSS produces 5 \times more designable structures and 2-3 \times greater structural diversity than RSO baselines, at equal computational cost. These results highlight RSS as a principled approach for efficiently exploring the protein design landscape.
[LG-56] Informed Initialization for Bayesian Optimization and Active Learning
链接: https://arxiv.org/abs/2510.23681
作者: Carl Hvarfner,David Eriksson,Eytan Bakshy,Max Balandat
类目: Machine Learning (cs.LG)
*备注: 28 pages
Abstract:Bayesian Optimization is a widely used method for optimizing expensive black-box functions, relying on probabilistic surrogate models such as Gaussian Processes. The quality of the surrogate model is crucial for good optimization performance, especially in the few-shot setting where only a small number of batches of points can be evaluated. In this setting, the initialization plays a critical role in shaping the surrogate’s predictive quality and guiding subsequent optimization. Despite this, practitioners typically rely on (quasi-)random designs to cover the input space. However, such approaches neglect two key factors: (a) space-filling designs may not be desirable to reduce predictive uncertainty, and (b) efficient hyperparameter learning during initialization is essential for high-quality prediction, which may conflict with space-filling designs. To address these limitations, we propose Hyperparameter-Informed Predictive Exploration (HIPE), a novel acquisition strategy that balances predictive uncertainty reduction with hyperparameter learning using information-theoretic principles. We derive a closed-form expression for HIPE in the Gaussian Process setting and demonstrate its effectiveness through extensive experiments in active learning and few-shot BO. Our results show that HIPE outperforms standard initialization strategies in terms of predictive accuracy, hyperparameter identification, and subsequent optimization performance, particularly in large-batch, few-shot settings relevant to many real-world Bayesian Optimization applications.
[LG-57] DBLoss: Decomposition-based Loss Function for Time Series Forecasting NEURIPS2025
链接: https://arxiv.org/abs/2510.23672
作者: Xiangfei Qiu,Xingjian Wu,Hanyin Cheng,Xvyuan Liu,Chenjuan Guo,Jilin Hu,Bin Yang
类目: Machine Learning (cs.LG)
*备注: Accepted by NeurIPS 2025
Abstract:Time series forecasting holds significant value in various domains such as economics, traffic, energy, and AIOps, as accurate predictions facilitate informed decision-making. However, the existing Mean Squared Error (MSE) loss function sometimes fails to accurately capture the seasonality or trend within the forecasting horizon, even when decomposition modules are used in the forward propagation to model the trend and seasonality separately. To address these challenges, we propose a simple yet effective Decomposition-Based Loss function called DBLoss. This method uses exponential moving averages to decompose the time series into seasonal and trend components within the forecasting horizon, and then calculates the loss for each of these components separately, followed by weighting them. As a general loss function, DBLoss can be combined with any deep learning forecasting model. Extensive experiments demonstrate that DBLoss significantly improves the performance of state-of-the-art models across diverse real-world datasets and provides a new perspective on the design of time series loss functions.
[LG-58] AI-Driven Carbon Monitoring: Transformer-Based Reconstruction of Atmospheric CO2 in Canadian Poultry Regions
链接: https://arxiv.org/abs/2510.23663
作者: Padmanabhan Jagannathan Prajesh,Kaliaperumal Ragunath,Miriam Gordon,Bruce Rathgeber,Suresh Neethirajan
类目: Machine Learning (cs.LG)
*备注:
Abstract:Accurate mapping of column-averaged CO2 (XCO2) over agricultural landscapes is essential for guiding emission mitigation strategies. We present a Spatiotemporal Vision Transformer with Wavelets (ST-ViWT) framework that reconstructs continuous, uncertainty-quantified XCO2 fields from OCO-2 across southern Canada, emphasizing poultry-intensive regions. The model fuses wavelet time-frequency representations with transformer attention over meteorology, vegetation indices, topography, and land cover. On 2024 OCO-2 data, ST-ViWT attains R2 = 0.984 and RMSE = 0.468 ppm; 92.3 percent of gap-filled predictions lie within +/-1 ppm. Independent validation with TCCON shows robust generalization (bias = -0.14 ppm; r = 0.928), including faithful reproduction of the late-summer drawdown. Spatial analysis across 14 poultry regions reveals a moderate positive association between facility density and XCO2 (r = 0.43); high-density areas exhibit larger seasonal amplitudes (9.57 ppm) and enhanced summer variability. Compared with conventional interpolation and standard machine-learning baselines, ST-ViWT yields seamless 0.25 degree CO2 surfaces with explicit uncertainties, enabling year-round coverage despite sparse observations. The approach supports integration of satellite constraints with national inventories and precision livestock platforms to benchmark emissions, refine region-specific factors, and verify interventions. Importantly, transformer-based Earth observation enables scalable, transparent, spatially explicit carbon accounting, hotspot prioritization, and policy-relevant mitigation assessment.
[LG-59] JiuTian Chuanliu: A Large Spatiotemporal Model for General-purpose Dynamic Urban Sensing
链接: https://arxiv.org/abs/2510.23662
作者: Liangzhe Han,Leilei Sun,Tongyu Zhu,Tao Tao,Jibin Wang,Weifeng Lv
类目: ocial and Information Networks (cs.SI); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:
Abstract:As a window for urban sensing, human mobility contains rich spatiotemporal information that reflects both residents’ behavior preferences and the functions of urban areas. The analysis of human mobility has attracted the attention of many researchers. However, existing methods often address specific tasks from a particular perspective, leading to insufficient modeling of human mobility and limited applicability of the learned knowledge in various downstream applications. To address these challenges, this paper proposes to push massive amounts of human mobility data into a spatiotemporal model, discover latent semantics behind mobility behavior and support various urban sensing tasks. Specifically, a large-scale and widely covering human mobility data is collected through the ubiquitous base station system and a framework named General-purpose and Dynamic Human Mobility Embedding (GDHME) for urban sensing is introduced. The framework follows the self-supervised learning idea and contains two major stages. In stage 1, GDHME treats people and regions as nodes within a dynamic graph, unifying human mobility data as people-region-time interactions. An encoder operating in continuous-time dynamically computes evolving node representations, capturing dynamic states for both people and regions. Moreover, an autoregressive self-supervised task is specially designed to guide the learning of the general-purpose node embeddings. In stage 2, these representations are utilized to support various tasks. To evaluate the effectiveness of our GDHME framework, we further construct a multi-task urban sensing benchmark. Offline experiments demonstrate GDHME’s ability to automatically learn valuable node features from vast amounts of data. Furthermore, our framework is used to deploy the JiuTian ChuanLiu Big Model, a system that has been presented at the 2023 China Mobile Worldwide Partner Conference.
[LG-60] A machine learning framework integrating seed traits and plasma parameters for predicting germination uplift in crops
链接: https://arxiv.org/abs/2510.23657
作者: Saklain Niam,Tashfiqur Rahman,Md. Amjad Patwary,Mukarram Hossain
类目: Machine Learning (cs.LG)
*备注:
Abstract:Cold plasma (CP) is an eco-friendly method to enhance seed germination, yet outcomes remain difficult to predict due to complex seed–plasma–environment interactions. This study introduces the first machine learning framework to forecast germination uplift in soybean, barley, sunflower, radish, and tomato under dielectric barrier discharge (DBD) plasma. Among the models tested (GB, XGB, ET, and hybrids), Extra Trees (ET) performed best (R\textsuperscript2 = 0.919; RMSE = 3.21; MAE = 2.62), improving to R\textsuperscript2 = 0.925 after feature reduction. Engineering analysis revealed a hormetic response: negligible effects at 7 kV or 200 s, maximum germination at 7–15 kV for 200–500 s, and reduced germination beyond 20 kV or prolonged exposures. Discharge power was also a dominant factor, with germination rate maximizing at \geq 100 W with low exposure time. Species and cultivar-level predictions showed radish (MAE = 1.46) and soybean (MAE = 2.05) were modeled with high consistency, while sunflower remained slightly higher variable (MAE = 3.80). Among cultivars, Williams (MAE = 1.23) and Sari (1.33) were well predicted, while Arian (2.86) and Ny’ırségi fekete (3.74) were comparatively poorly captured. This framework was also embedded into MLflow, providing a decision-support tool for optimizing CP seed germination in precision agriculture.
[LG-61] DiNo and RanBu: Lightweight Predictions from Shallow Random Forests
链接: https://arxiv.org/abs/2510.23624
作者: Tiago Mendonça dos Santos,Rafael Izbicki,Luís Gustavo Esteves
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Random Forest ensembles are a strong baseline for tabular prediction tasks, but their reliance on hundreds of deep trees often results in high inference latency and memory demands, limiting deployment in latency-sensitive or resource-constrained environments. We introduce DiNo (Distance with Nodes) and RanBu (Random Bushes), two shallow-forest methods that convert a small set of depth-limited trees into efficient, distance-weighted predictors. DiNo measures cophenetic distances via the most recent common ancestor of observation pairs, while RanBu applies kernel smoothing to Breiman’s classical proximity measure. Both approaches operate entirely after forest training: no additional trees are grown, and tuning of the single bandwidth parameter h requires only lightweight matrix-vector operations. Across three synthetic benchmarks and 25 public datasets, RanBu matches or exceeds the accuracy of full-depth random forests-particularly in high-noise settings-while reducing training plus inference time by up to 95%. DiNo achieves the best bias-variance trade-off in low-noise regimes at a modest computational cost. Both methods extend directly to quantile regression, maintaining accuracy with substantial speed gains. The implementation is available as an open-source R/C++ package at this https URL. We focus on structured tabular random samples (i.i.d.), leaving extensions to other modalities for future work.
[LG-62] Adversarially-Aware Architecture Design for Robust Medical AI Systems
链接: https://arxiv.org/abs/2510.23622
作者: Alyssa Gerhart,Balaji Iyangar
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:
Abstract:Adversarial attacks pose a severe risk to AI systems used in healthcare, capable of misleading models into dangerous misclassifications that can delay treatments or cause misdiagnoses. These attacks, often imperceptible to human perception, threaten patient safety, particularly in underserved populations. Our study explores these vulnerabilities through empirical experimentation on a dermatological dataset, where adversarial methods significantly reduce classification accuracy. Through detailed threat modeling, experimental benchmarking, and model evaluation, we demonstrate both the severity of the threat and the partial success of defenses like adversarial training and distillation. Our results show that while defenses reduce attack success rates, they must be balanced against model performance on clean data. We conclude with a call for integrated technical, ethical, and policy-based approaches to build more resilient, equitable AI in healthcare.
[LG-63] A Single-Loop First-Order Algorithm for Linearly Constrained Bilevel Optimization NEURIPS2025
链接: https://arxiv.org/abs/2510.24710
作者: Wei Shen,Jiawei Zhang,Minhui Huang,Cong Shen
类目: Optimization and Control (math.OC); Information Theory (cs.IT); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: NeurIPS 2025
Abstract:We study bilevel optimization problems where the lower-level problems are strongly convex and have coupled linear constraints. To overcome the potential non-smoothness of the hyper-objective and the computational challenges associated with the Hessian matrix, we utilize penalty and augmented Lagrangian methods to reformulate the original problem as a single-level one. Especially, we establish a strong theoretical connection between the reformulated function and the original hyper-objective by characterizing the closeness of their values and derivatives. Based on this reformulation, we propose a single-loop, first-order algorithm for linearly constrained bilevel optimization (SFLCB). We provide rigorous analyses of its non-asymptotic convergence rates, showing an improvement over prior double-loop algorithms – form O(\epsilon^-3\log(\epsilon^-1)) to O(\epsilon^-3) . The experiments corroborate our theoretical findings and demonstrate the practical efficiency of the proposed SFLCB algorithm. Simulation code is provided at this https URL.
[LG-64] Statistical physics of deep learning: Optimal learning of a multi-layer perceptron near interpolation
链接: https://arxiv.org/abs/2510.24616
作者: Jean Barbier,Francesco Camilli,Minh-Toan Nguyen,Mauro Pastore,Rudy Skerk
类目: Machine Learning (stat.ML); Disordered Systems and Neural Networks (cond-mat.dis-nn); Statistical Mechanics (cond-mat.stat-mech); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: 30 pages, 19 figures + appendix. This submission supersedes both arXiv:2505.24849 and arXiv:2501.18530
Abstract:For three decades statistical physics has been providing a framework to analyse neural networks. A long-standing question remained on its capacity to tackle deep learning models capturing rich feature learning effects, thus going beyond the narrow networks or kernel methods analysed until now. We positively answer through the study of the supervised learning of a multi-layer perceptron. Importantly, (i) its width scales as the input dimension, making it more prone to feature learning than ultra wide networks, and more expressive than narrow ones or with fixed embedding layers; and (ii) we focus on the challenging interpolation regime where the number of trainable parameters and data are comparable, which forces the model to adapt to the task. We consider the matched teacher-student setting. It provides the fundamental limits of learning random deep neural network targets and helps in identifying the sufficient statistics describing what is learnt by an optimally trained network as the data budget increases. A rich phenomenology emerges with various learning transitions. With enough data optimal performance is attained through model’s “specialisation” towards the target, but it can be hard to reach for training algorithms which get attracted by sub-optimal solutions predicted by the theory. Specialisation occurs inhomogeneously across layers, propagating from shallow towards deep ones, but also across neurons in each layer. Furthermore, deeper targets are harder to learn. Despite its simplicity, the Bayesian-optimal setting provides insights on how the depth, non-linearity and finite (proportional) width influence neural networks in the feature learning regime that are potentially relevant way beyond it.
[LG-65] Comparison of generalised additive models and neural networks in applications: A systematic review
链接: https://arxiv.org/abs/2510.24601
作者: Jessica Doohan,Lucas Kook,Kevin Burke
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Neural networks have become a popular tool in predictive modelling, more commonly associated with machine learning and artificial intelligence than with statistics. Generalised Additive Models (GAMs) are flexible non-linear statistical models that retain interpretability. Both are state-of-the-art in their own right, with their respective advantages and disadvantages. This paper analyses how these two model classes have performed on real-world tabular data. Following PRISMA guidelines, we conducted a systematic review of papers that performed empirical comparisons of GAMs and neural networks. Eligible papers were identified, yielding 143 papers, with 430 datasets. Key attributes at both paper and dataset levels were extracted and reported. Beyond summarising comparisons, we analyse reported performance metrics using mixed-effects modelling to investigate potential characteristics that can explain and quantify observed differences, including application area, study year, sample size, number of predictors, and neural network complexity. Across datasets, no consistent evidence of superiority was found for either GAMs or neural networks when considering the most frequently reported metrics (RMSE, R^2 , and AUC). Neural networks tended to outperform in larger datasets and in those with more predictors, but this advantage narrowed over time. Conversely, GAMs remained competitive, particularly in smaller data settings, while retaining interpretability. Reporting of dataset characteristics and neural network complexity was incomplete in much of the literature, limiting transparency and reproducibility. This review highlights that GAMs and neural networks should be viewed as complementary approaches rather than competitors. For many tabular applications, the performance trade-off is modest, and interpretability may favour GAMs.
[LG-66] Unsupervised Machine-Learning Pipeline for Data-Driven Defect Detection and Characterisation: Application to Displacement Cascades
链接: https://arxiv.org/abs/2510.24523
作者: Samuel Del Fré,Andrée de Backer,Christophe Domain,Ludovic Thuinet,Charlotte S. Becquart
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注: 22 pages, 1 graphical abstract, 7 figures, 4 tables
Abstract:Neutron irradiation produces, within a few picoseconds, displacement cascades that are sequences of atomic collisions generating point and extended defects which subsequently affects the long-term evolution of materials. The diversity of these defects, characterized morphologically and statistically, defines what is called the “primary damage”. In this work, we present a fully unsupervised machine learning (ML) workflow that detects and classifies these defects directly from molecular dynamics data. Local environments are encoded by the Smooth Overlap of Atomic Positions (SOAP) vector, anomalous atoms are isolated with autoencoder neural networks (AE), embedded with Uniform Man- ifold Approximation and Projection (UMAP) and clustered using Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN). Applied to 80 keV displacement cascades in Ni, Fe70Ni10Cr20, and Zr, the AE successfully identify the small fraction of outlier atoms that participate in defect formation. HDBSCAN then partitions the UMAP latent space of AE-flagged SOAP de- scriptors into well defined groups representing vacancy- and interstitial-dominated regions and, within each, separates small from large aggregates, assigning 99.7 % of outliers to compact physical motifs. A signed cluster-identification score confirms this separation, and cluster size scales with net defect counts (R2 0.89). Statistical cross analyses between the ML outlier map and several conventional detectors (centrosymmetry, dislocation extraction, etc.) reveal strong overlap and complementary coverage, all achieved without template or threshold tuning. This ML workflow thus provides an efficient tool for the quantitative mapping of structural anomalies in materials, particularly those arising from irradiation damage in displacement cascades.
[LG-67] Non-Singularity of the Gradient Descent map for Neural Networks with Piecewise Analytic Activations
链接: https://arxiv.org/abs/2510.24466
作者: Alexandru Crăciun,Debarghya Ghoshdastidar
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:
Abstract:The theory of training deep networks has become a central question of modern machine learning and has inspired many practical advancements. In particular, the gradient descent (GD) optimization algorithm has been extensively studied in recent years. A key assumption about GD has appeared in several recent works: the \emphGD map is non-singular – it preserves sets of measure zero under preimages. Crucially, this assumption has been used to prove that GD avoids saddle points and maxima, and to establish the existence of a computable quantity that determines the convergence to global minima (both for GD and stochastic GD). However, the current literature either assumes the non-singularity of the GD map or imposes restrictive assumptions, such as Lipschitz smoothness of the loss (for example, Lipschitzness does not hold for deep ReLU networks with the cross-entropy loss) and restricts the analysis to GD with small step-sizes. In this paper, we investigate the neural network map as a function on the space of weights and biases. We also prove, for the first time, the non-singularity of the gradient descent (GD) map on the loss landscape of realistic neural network architectures (with fully connected, convolutional, or softmax attention layers) and piecewise analytic activations (which includes sigmoid, ReLU, leaky ReLU, etc.) for almost all step-sizes. Our work significantly extends the existing results on the convergence of GD and SGD by guaranteeing that they apply to practical neural network settings and has the potential to unlock further exploration of learning dynamics.
[LG-68] Nearest Neighbor Matching as Least Squares Density Ratio Estimation and Riesz Regression
链接: https://arxiv.org/abs/2510.24433
作者: Masahiro Kato
类目: Econometrics (econ.EM); Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME); Machine Learning (stat.ML)
*备注:
Abstract:This study proves that Nearest Neighbor (NN) matching can be interpreted as an instance of Riesz regression for automatic debiased machine learning. Lin et al. (2023) shows that NN matching is an instance of density-ratio estimation with their new density-ratio estimator. Chernozhukov et al. (2024) develops Riesz regression for automatic debiased machine learning, which directly estimates the Riesz representer (or equivalently, the bias-correction term) by minimizing the mean squared error. In this study, we first prove that the density-ratio estimation method proposed in Lin et al. (2023) is essentially equivalent to Least-Squares Importance Fitting (LSIF) proposed in Kanamori et al. (2009) for direct density-ratio estimation. Furthermore, we derive Riesz regression using the LSIF framework. Based on these results, we derive NN matching from Riesz regression. This study is based on our work Kato (2025a) and Kato (2025b).
[LG-69] Problem-Parameter-Free Decentralized Bilevel Optimization NEURIPS2025
链接: https://arxiv.org/abs/2510.24288
作者: Zhiwei Zhai,Wenjing Yan,Ying-Jun Angela Zhang
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted by NeurIPS 2025
Abstract:Decentralized bilevel optimization has garnered significant attention due to its critical role in solving large-scale machine learning problems. However, existing methods often rely on prior knowledge of problem parameters-such as smoothness, convexity, or communication network topologies-to determine appropriate stepsizes. In practice, these problem parameters are typically unavailable, leading to substantial manual effort for hyperparameter tuning. In this paper, we propose AdaSDBO, a fully problem-parameter-free algorithm for decentralized bilevel optimization with a single-loop structure. AdaSDBO leverages adaptive stepsizes based on cumulative gradient norms to update all variables simultaneously, dynamically adjusting its progress and eliminating the need for problem-specific hyperparameter tuning. Through rigorous theoretical analysis, we establish that AdaSDBO achieves a convergence rate of \widetilde\mathcalO\left(\frac1T\right) , matching the performance of well-tuned state-of-the-art methods up to polylogarithmic factors. Extensive numerical experiments demonstrate that AdaSDBO delivers competitive performance compared to existing decentralized bilevel optimization methods while exhibiting remarkable robustness across diverse stepsize configurations.
[LG-70] owards actionable hypotension prediction- predicting catecholamine therapy initiation in the intensive care unit ALT
链接: https://arxiv.org/abs/2510.24287
作者: Richard Koebe,Noah Saibel,Juan Miguel Lopez Alcaraz,Simon Schäfer,Nils Strodthoff
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 27 pages, 8 figures, source code under this https URL
Abstract:Hypotension in critically ill ICU patients is common and life-threatening. Escalation to catecholamine therapy marks a key management step, with both undertreatment and overtreatment posing risks. Most machine learning (ML) models predict hypotension using fixed MAP thresholds or MAP forecasting, overlooking the clinical decision behind treatment escalation. Predicting catecholamine initiation, the start of vasoactive or inotropic agent administration offers a more clinically actionable target reflecting real decision-making. Using the MIMIC-III database, we modeled catecholamine initiation as a binary event within a 15-minute prediction window. Input features included statistical descriptors from a two-hour sliding MAP context window, along with demographics, biometrics, comorbidities, and ongoing treatments. An Extreme Gradient Boosting (XGBoost) model was trained and interpreted via SHapley Additive exPlanations (SHAP). The model achieved an AUROC of 0.822 (0.813-0.830), outperforming the hypotension baseline (MAP 65, AUROC 0.686 [0.675-0.699]). SHAP analysis highlighted recent MAP values, MAP trends, and ongoing treatments (e.g., sedatives, electrolytes) as dominant predictors. Subgroup analysis showed higher performance in males, younger patients (53 years), those with higher BMI (32), and patients without comorbidities or concurrent medications. Predicting catecholamine initiation based on MAP dynamics, treatment context, and patient characteristics supports the critical decision of when to escalate therapy, shifting focus from threshold-based alarms to actionable decision support. This approach is feasible across a broad ICU cohort under natural event imbalance. Future work should enrich temporal and physiological context, extend label definitions to include therapy escalation, and benchmark against existing hypotension prediction systems.
[LG-71] Forecasting precipitation in the Arctic using probabilistic machine learning informed by causal climate drivers
链接: https://arxiv.org/abs/2510.24254
作者: Madhurima Panja,Dhiman Das,Tanujit Chakraborty,Arnob Ray,R. Athulya,Chittaranjan Hens,Syamal K. Dana,Nuncio Murukesh,Dibakar Ghosh
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an)
*备注:
Abstract:Understanding and forecasting precipitation events in the Arctic maritime environments, such as Bear Island and Ny-Ålesund, is crucial for assessing climate risk and developing early warning systems in vulnerable marine regions. This study proposes a probabilistic machine learning framework for modeling and predicting the dynamics and severity of precipitation. We begin by analyzing the scale-dependent relationships between precipitation and key atmospheric drivers (e.g., temperature, relative humidity, cloud cover, and air pressure) using wavelet coherence, which captures localized dependencies across time and frequency domains. To assess joint causal influences, we employ Synergistic-Unique-Redundant Decomposition, which quantifies the impact of interaction effects among each variable on future precipitation dynamics. These insights inform the development of data-driven forecasting models that incorporate both historical precipitation and causal climate drivers. To account for uncertainty, we employ the conformal prediction method, which enables the generation of calibrated non-parametric prediction intervals. Our results underscore the importance of utilizing a comprehensive framework that combines causal analysis with probabilistic forecasting to enhance the reliability and interpretability of precipitation predictions in Arctic marine environments.
[LG-72] Self-Concordant Perturbations for Linear Bandits
链接: https://arxiv.org/abs/2510.24187
作者: Lucas Lévy(1 and 2),Jean-Lou Valeau(1 and 3),Arya Akhavan(1 and 2),Patrick Rebeschini(1) ((1) University of Oxford, United Kingdom, (2) École Polytechnique, IP Paris, France, (3) ENSAE, IP Paris, France)
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:We study the adversarial linear bandits problem and present a unified algorithmic framework that bridges Follow-the-Regularized-Leader (FTRL) and Follow-the-Perturbed-Leader (FTPL) methods, extending the known connection between them from the full-information setting. Within this framework, we introduce self-concordant perturbations, a family of probability distributions that mirror the role of self-concordant barriers previously employed in the FTRL-based SCRiBLe algorithm. Using this idea, we design a novel FTPL-based algorithm that combines self-concordant regularization with efficient stochastic exploration. Our approach achieves a regret of O(d\sqrtn \ln n) on both the d -dimensional hypercube and the Euclidean ball. On the Euclidean ball, this matches the rate attained by existing self-concordant FTRL methods. For the hypercube, this represents a \sqrtd improvement over these methods and matches the optimal bound up to logarithmic factors.
[LG-73] Deep Learning-Enhanced Calibration of the Heston Model: A Unified Framework
链接: https://arxiv.org/abs/2510.24074
作者: Arman Zadgar,Somayeh Fallah,Farshid Mehrdoust
类目: Analysis of PDEs (math.AP); Machine Learning (cs.LG)
*备注:
Abstract:The Heston stochastic volatility model is a widely used tool in financial mathematics for pricing European options. However, its calibration remains computationally intensive and sensitive to local minima due to the model’s nonlinear structure and high-dimensional parameter space. This paper introduces a hybrid deep learning-based framework that enhances both the computational efficiency and the accuracy of the calibration procedure. The proposed approach integrates two supervised feedforward neural networks: the Price Approximator Network (PAN), which approximates the option price surface based on strike and moneyness inputs, and the Calibration Correction Network (CCN), which refines the Heston model’s output by correcting systematic pricing errors. Experimental results on real S\P 500 option data demonstrate that the deep learning approach outperforms traditional calibration techniques across multiple error metrics, achieving faster convergence and superior generalization in both in-sample and out-of-sample settings. This framework offers a practical and robust solution for real-time financial model calibration.
[LG-74] Copula-Stein Discrepancy: A Generator-Based Stein Operator for Archimedean Dependence
链接: https://arxiv.org/abs/2510.24056
作者: Agnideep Aich,Ashit Baran Aich
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Kernel Stein discrepancies (KSDs) have become a principal tool for goodness-of-fit testing, but standard KSDs are often insensitive to higher-order dependency structures, such as tail dependence, which are critical in many scientific and financial domains. We address this gap by introducing the Copula-Stein Discrepancy (CSD), a novel class of discrepancies tailored to the geometry of statistical dependence. By defining a Stein operator directly on the copula density, CSD leverages the generative structure of dependence, rather than relying on the joint density’s score function. For the broad class of Archimedean copulas, this approach yields a closed-form Stein kernel derived from the scalar generator function. We provide a comprehensive theoretical analysis, proving that CSD (i) metrizes weak convergence of copula distributions, ensuring it detects any mismatch in dependence; (ii) has an empirical estimator that converges at the minimax optimal rate of O_P(n^-1/2) ; and (iii) is provably sensitive to differences in tail dependence coefficients. The framework is extended to general non-Archimedean copulas, including elliptical and vine copulas. Computationally, the exact CSD kernel evaluation scales linearly in dimension, while a novel random feature approximation reduces the n -dependence from quadratic O(n^2) to near-linear \tildeO(n) , making CSD a practical and theoretically principled tool for dependence-aware inference.
[LG-75] Score-based constrained generative modeling via Langevin diffusions with boundary conditions
链接: https://arxiv.org/abs/2510.23985
作者: Adam Nordenhög,Akash Sharma
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:
Abstract:Score-based generative models based on stochastic differential equations (SDEs) achieve impressive performance in sampling from unknown distributions, but often fail to satisfy underlying constraints. We propose a constrained generative model using kinetic (underdamped) Langevin dynamics with specular reflection of velocity on the boundary defining constraints. This results in piecewise continuously differentiable noising and denoising process where the latter is characterized by a time-reversed dynamics restricted to a domain with boundary due to specular boundary condition. In addition, we also contribute to existing reflected SDEs based constrained generative models, where the stochastic dynamics is restricted through an abstract local time term. By presenting efficient numerical samplers which converge with optimal rate in terms of discretizations step, we provide a comprehensive comparison of models based on confined (specularly reflected kinetic) Langevin diffusion with models based on reflected diffusion with local time.
[LG-76] Understanding Fairness and Prediction Error through Subspace Decomposition and Influence Analysis
链接: https://arxiv.org/abs/2510.23935
作者: Enze Shi,Pankaj Bhagwat,Zhixian Yang,Linglong Kong,Bei Jiang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Machine learning models have achieved widespread success but often inherit and amplify historical biases, resulting in unfair outcomes. Traditional fairness methods typically impose constraints at the prediction level, without addressing underlying biases in data representations. In this work, we propose a principled framework that adjusts data representations to balance predictive utility and fairness. Using sufficient dimension reduction, we decompose the feature space into target-relevant, sensitive, and shared components, and control the fairness-utility trade-off by selectively removing sensitive information. We provide a theoretical analysis of how prediction error and fairness gaps evolve as shared subspaces are added, and employ influence functions to quantify their effects on the asymptotic behavior of parameter estimates. Experiments on both synthetic and real-world datasets validate our theoretical insights and show that the proposed method effectively improves fairness while preserving predictive performance.
[LG-77] Inferring Group Intent as a Cooperative Game. An NLP-based Framework for Trajectory Analysis using Graph Transformer Neural Network
链接: https://arxiv.org/abs/2510.23905
作者: Yiming Zhang,Vikram Krishnamurthy,Shashwat Jain
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:
Abstract:This paper studies group target trajectory intent as the outcome of a cooperative game where the complex-spatio trajectories are modeled using an NLP-based generative model. In our framework, the group intent is specified by the characteristic function of a cooperative game, and allocations for players in the cooperative game are specified by either the core, the Shapley value, or the nucleolus. The resulting allocations induce probability distributions that govern the coordinated spatio-temporal trajectories of the targets that reflect the group’s underlying intent. We address two key questions: (1) How can the intent of a group trajectory be optimally formalized as the characteristic function of a cooperative game? (2) How can such intent be inferred from noisy observations of the targets? To answer the first question, we introduce a Fisher-information-based characteristic function of the cooperative game, which yields probability distributions that generate coordinated spatio-temporal patterns. As a generative model for these patterns, we develop an NLP-based generative model built on formal grammar, enabling the creation of realistic multi-target trajectory data. To answer the second question, we train a Graph Transformer Neural Network (GTNN) to infer group trajectory intent-expressed as the characteristic function of the cooperative game-from observational data with high accuracy. The self-attention function of the GTNN depends on the track estimates. Thus, the formulation and algorithms provide a multi-layer approach that spans target tracking (Bayesian signal processing) and the GTNN (for group intent inference).
[LG-78] sting-driven Variable Selection in Bayesian Modal Regression
链接: https://arxiv.org/abs/2510.23831
作者: Jiasong Duan,Hongmei Zhang,Xianzheng Huang
类目: Methodology (stat.ME); Machine Learning (cs.LG); Computation (stat.CO); Machine Learning (stat.ML)
*备注: 30 pages, 2 figures, preprint under review
Abstract:We propose a Bayesian variable selection method in the framework of modal regression for heavy-tailed responses. An efficient expectation-maximization algorithm is employed to expedite parameter estimation. A test statistic is constructed to exploit the shape of the model error distribution to effectively separate informative covariates from unimportant ones. Through simulations, we demonstrate and evaluate the efficacy of the proposed method in identifying important covariates in the presence of non-Gaussian model errors. Finally, we apply the proposed method to analyze two datasets arising in genetic and epigenetic studies.
[LG-79] Re-envisioning Euclid Galaxy Morphology: Identifying and Interpreting Features with Sparse Autoencoders NEURIPS
链接: https://arxiv.org/abs/2510.23749
作者: John F. Wu,Michael Walmsley
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注: Accepted to NeurIPS Machine Learning and the Physical Sciences Workshop
Abstract:Sparse Autoencoders (SAEs) can efficiently identify candidate monosemantic features from pretrained neural networks for galaxy morphology. We demonstrate this on Euclid Q1 images using both supervised (Zoobot) and new self-supervised (MAE) models. Our publicly released MAE achieves superhuman image reconstruction performance. While a Principal Component Analysis (PCA) on the supervised model primarily identifies features already aligned with the Galaxy Zoo decision tree, SAEs can identify interpretable features outside of this framework. SAE features also show stronger alignment than PCA with Galaxy Zoo labels. Although challenges in interpretability remain, SAEs provide a powerful engine for discovering astrophysical phenomena beyond the confines of human-defined classification.
[LG-80] Bayesian neural networks with interpretable priors from Mercer kernels
链接: https://arxiv.org/abs/2510.23745
作者: Alex Alberts,Ilias Bilionis
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Quantifying the uncertainty in the output of a neural network is essential for deployment in scientific or engineering applications where decisions must be made under limited or noisy data. Bayesian neural networks (BNNs) provide a framework for this purpose by constructing a Bayesian posterior distribution over the network parameters. However, the prior, which is of key importance in any Bayesian setting, is rarely meaningful for BNNs. This is because the complexity of the input-to-output map of a BNN makes it difficult to understand how certain distributions enforce any interpretable constraint on the output space. Gaussian processes (GPs), on the other hand, are often preferred in uncertainty quantification tasks due to their interpretability. The drawback is that GPs are limited to small datasets without advanced techniques, which often rely on the covariance kernel having a specific structure. To address these challenges, we introduce a new class of priors for BNNs, called Mercer priors, such that the resulting BNN has samples which approximate that of a specified GP. The method works by defining a prior directly over the network parameters from the Mercer representation of the covariance kernel, and does not rely on the network having a specific structure. In doing so, we can exploit the scalability of BNNs in a meaningful Bayesian way.
[LG-81] In Search of the Unknown Unknowns: A Multi-Metric Distance Ensemble for Out of Distribution Anomaly Detection in Astronomical Surveys NEURIPS
链接: https://arxiv.org/abs/2510.23702
作者: Siddharth Chaini,Federica B. Bianco,Ashish Mahabal
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注: 9 pages, 5 figures, Accepted at the 2025 Machine Learning and the Physical Sciences (ML4PS) workshop at NeurIPS
Abstract:Distance-based methods involve the computation of distance values between features and are a well-established paradigm in machine learning. In anomaly detection, anomalies are identified by their large distance from normal data points. However, the performance of these methods often hinges on a single, user-selected distance metric (e.g., Euclidean), which may not be optimal for the complex, high-dimensional feature spaces common in astronomy. Here, we introduce a novel anomaly detection method, Distance Multi-Metric Anomaly Detection (DiMMAD), which uses an ensemble of distance metrics to find novelties. Using multiple distance metrics is effectively equivalent to using different geometries in the feature space. By using a robust ensemble of diverse distance metrics, we overcome the metric-selection problem, creating an anomaly score that is not reliant on any single definition of distance. We demonstrate this multi-metric approach as a tool for simple, interpretable scientific discovery on astronomical time series – (1) with simulated data for the upcoming Vera C. Rubin Observatory Legacy Survey of Space and Time, and (2) real data from the Zwicky Transient Facility. We find that DiMMAD excels at out-of-distribution anomaly detection – anomalies in the data that might be new classes – and beats other state-of-the-art methods in the goal of maximizing the diversity of new classes discovered. For rare in-distribution anomaly detection, DiMMAD performs similarly to other methods, but may allow for improved interpretability. All our code is open source: DiMMAD is implemented within DistClassiPy: this https URL, while all code to reproduce the results of this paper is available here: this https URL. Comments: 9 pages, 5 figures, Accepted at the 2025 Machine Learning and the Physical Sciences (ML4PS) workshop at NeurIPS Subjects: Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG) Cite as: arXiv:2510.23702 [astro-ph.IM] (or arXiv:2510.23702v1 [astro-ph.IM] for this version) https://doi.org/10.48550/arXiv.2510.23702 Focus to learn more arXiv-issued DOI via DataCite
[LG-82] VIKING: Deep variational inference with stochastic projections NEURIPS2025
链接: https://arxiv.org/abs/2510.23684
作者: Samuel G. Fadel,Hrittik Roy,Nicholas Krämer,Yevgen Zainchkovskyy,Stas Syrota,Alejandro Valverde Mahou,Carl Henrik Ek,Søren Hauberg
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: NeurIPS 2025 (poster)
Abstract:Variational mean field approximations tend to struggle with contemporary overparametrized deep neural networks. Where a Bayesian treatment is usually associated with high-quality predictions and uncertainties, the practical reality has been the opposite, with unstable training, poor predictive power, and subpar calibration. Building upon recent work on reparametrizations of neural networks, we propose a simple variational family that considers two independent linear subspaces of the parameter space. These represent functional changes inside and outside the support of training data. This allows us to build a fully-correlated approximate posterior reflecting the overparametrization that tunes easy-to-interpret hyperparameters. We develop scalable numerical routines that maximize the associated evidence lower bound (ELBO) and sample from the approximate posterior. Empirically, we observe state-of-the-art performance across tasks, models, and datasets compared to a wide array of baseline methods. Our results show that approximate Bayesian inference applied to deep neural networks is far from a lost cause when constructing inference mechanisms that reflect the geometry of reparametrizations.
[LG-83] Beyond Normality: Reliable A/B Testing with Non-Gaussian Data
链接: https://arxiv.org/abs/2510.23666
作者: Junpeng Gong,Chunkai Wang,Hao Li,Jinyong Ma,Haoxuan Li,Xu He
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注: 11 pages, 3 figures
Abstract:A/B testing has become the cornerstone of decision-making in online markets, guiding how platforms launch new features, optimize pricing strategies, and improve user experience. In practice, we typically employ the pairwise t -test to compare outcomes between the treatment and control groups, thereby assessing the effectiveness of a given strategy. To be trustworthy, these experiments must keep Type I error (i.e., false positive rate) under control; otherwise, we may launch harmful strategies. However, in real-world applications, we find that A/B testing often fails to deliver reliable results. When the data distribution departs from normality or when the treatment and control groups differ in sample size, the commonly used pairwise t -test is no longer trustworthy. In this paper, we quantify how skewed, long tailed data and unequal allocation distort error rates and derive explicit formulas for the minimum sample size required for the t -test to remain valid. We find that many online feedback metrics require hundreds of millions samples to ensure reliable A/B testing. Thus we introduce an Edgeworth-based correction that provides more accurate p -values when the available sample size is limited. Offline experiments on a leading A/B testing platform corroborate the practical value of our theoretical minimum sample size thresholds and demonstrate that the corrected method substantially improves the reliability of A/B testing in real-world conditions.
信息检索
[IR-0] From Time and Place to Preference: LLM -Driven Geo-Temporal Context in Recommendations
链接: https://arxiv.org/abs/2510.24430
作者: Yejin Kim,Shaghayegh Agah,Mayur Nankani,Neeraj Sharma,Feifei Peng,Maria Peifer,Sardar Hamidian,H Howie Huang
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Most recommender systems treat timestamps as numeric or cyclical values, overlooking real-world context such as holidays, events, and seasonal patterns. We propose a scalable framework that uses large language models (LLMs) to generate geo-temporal embeddings from only a timestamp and coarse location, capturing holidays, seasonal trends, and local/global events. We then introduce a geo-temporal embedding informativeness test as a lightweight diagnostic, demonstrating on MovieLens, LastFM, and a production dataset that these embeddings provide predictive signal consistent with the outcomes of full model integrations. Geo-temporal embeddings are incorporated into sequential models through (1) direct feature fusion with metadata embeddings or (2) an auxiliary loss that enforces semantic and geo-temporal alignment. Our findings highlight the need for adaptive or hybrid recommendation strategies, and we release a context-enriched MovieLens dataset to support future research.
[IR-1] DUET: Dual Model Co-Training for Entire Space CTR Prediction
链接: https://arxiv.org/abs/2510.24369
作者: Yutian Xiao,Meng Yuan,Fuzhen Zhuang,Wei Chen,Shukuan Wang,Shanqi Liu,Chao Feng,Wenhui Yu,Xiang Li,Lantao Hu,Han Li,Zhao Zhang
类目: Information Retrieval (cs.IR)
*备注:
Abstract:The pre-ranking stage plays a pivotal role in large-scale recommender systems but faces an intrinsic trade-off between model expressiveness and computational efficiency. Owing to the massive candidate pool and strict latency constraints, industry systems often rely on lightweight two-tower architectures, which are computationally efficient yet limited in estimation capability. As a result, they struggle to capture the complex synergistic and suppressive relationships among candidate items, which are essential for producing contextually coherent and diverse recommendation lists. Moreover, this simplicity further amplifies the Sample Selection Bias (SSB) problem, as coarse-grained models trained on biased exposure data must generalize to a much larger candidate space with distinct distributions. To address these issues, we propose \textbfDUET (\textbfDUal Model Co-Training for \textbfEntire Space C\textbfTR Prediction), a set-wise pre-ranking framework that achieves expressive modeling under tight computational budgets. Instead of scoring items independently, DUET performs set-level prediction over the entire candidate subset in a single forward pass, enabling information-aware interactions among candidates while amortizing the computational cost across the set. Moreover, a dual model co-training mechanism extends supervision to unexposed items via mutual pseudo-label refinement, effectively mitigating SSB. Validated through extensive offline experiments and online A/B testing, DUET consistently outperforms state-of-the-art baselines and achieves improvements across multiple core business metrics. At present, DUET has been fully deployed in Kuaishou and Kuaishou Lite Apps, serving the main traffic for hundreds of millions of users. Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2510.24369 [cs.IR] (or arXiv:2510.24369v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2510.24369 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[IR-2] Resource-Efficient LLM Application for Structured Transformation of Unstructured Financial Contracts
链接: https://arxiv.org/abs/2510.23990
作者: Maruf Ahmed Mridul,Oshani Seneviratne
类目: Information Retrieval (cs.IR)
*备注: 5 pages, 1 figure, 2 tables
Abstract:The transformation of unstructured legal contracts into standardized, machine-readable formats is essential for automating financial workflows. The Common Domain Model (CDM) provides a standardized framework for this purpose, but converting complex legal documents like Credit Support Annexes (CSAs) into CDM representations remains a significant challenge. In this paper, we present an extension of the CDMizer framework, a template-driven solution that ensures syntactic correctness and adherence to the CDM schema during contract-to-CDM conversion. We apply this extended framework to a real-world task, comparing its performance with a benchmark developed by the International Swaps and Derivatives Association (ISDA) for CSA clause extraction. Our results show that CDMizer, when integrated with a significantly smaller, open-source Large Language Model (LLM), achieves competitive performance in terms of accuracy and efficiency against larger, proprietary models. This work underscores the potential of resource-efficient solutions to automate legal contract transformation, offering a cost-effective and scalable approach that can meet the needs of financial institutions with constrained resources or strict data privacy requirements.

