本篇博文主要内容为 2025-09-29 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。
友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。
目录
概览 (2025-09-29)
今日共更新819篇论文,其中:
- 自然语言处理共162篇(Computation and Language (cs.CL))
- 人工智能共273篇(Artificial Intelligence (cs.AI))
- 计算机视觉共190篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共297篇(Machine Learning (cs.LG))
自然语言处理
[NLP-0] See Point Fly: A Learning-Free VLM Framework for Universal Unmanned Aerial Navigation
【速读】: 该论文旨在解决航空视觉与语言导航(Aerial Vision-and-Language Navigation, AVLN)中如何高效、准确地根据任意类型的自由文本指令在任意环境中引导无人机(UAV)到达目标的问题。现有基于视觉语言模型(Vision-Language Models, VLMs)的方法通常将动作预测视为文本生成任务,难以实现精确的空间控制。本文提出See, Point, Fly (SPF)框架,其核心创新在于将动作预测重新建模为2D空间定位(spatial grounding)任务:利用VLMs将模糊的语言指令分解为图像上的迭代2D航点标注,并结合预测的行进距离,将其转换为UAV可执行的3D位移向量作为动作指令;同时引入自适应距离调整机制以提升导航效率,并采用闭环控制策略支持动态环境中的目标追踪。这一设计显著提升了导航准确性与泛化能力,在仿真和真实场景中均取得领先性能。
链接: https://arxiv.org/abs/2509.22653
作者: Chih Yao Hu,Yang-Sen Lin,Yuna Lee,Chih-Hai Su,Jie-Ying Lee,Shr-Ruei Tsai,Chin-Yang Lin,Kuan-Wen Chen,Tsung-Wei Ke,Yu-Lun Liu
机构: National Yang Ming Chiao Tung University (国立阳明交通大学); National Taiwan University (国立台湾大学)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: CoRL 2025. Project page: this https URL
点击查看摘要
Abstract:We present See, Point, Fly (SPF), a training-free aerial vision-and-language navigation (AVLN) framework built atop vision-language models (VLMs). SPF is capable of navigating to any goal based on any type of free-form instructions in any kind of environment. In contrast to existing VLM-based approaches that treat action prediction as a text generation task, our key insight is to consider action prediction for AVLN as a 2D spatial grounding task. SPF harnesses VLMs to decompose vague language instructions into iterative annotation of 2D waypoints on the input image. Along with the predicted traveling distance, SPF transforms predicted 2D waypoints into 3D displacement vectors as action commands for UAVs. Moreover, SPF also adaptively adjusts the traveling distance to facilitate more efficient navigation. Notably, SPF performs navigation in a closed-loop control manner, enabling UAVs to follow dynamic targets in dynamic environments. SPF sets a new state of the art in DRL simulation benchmark, outperforming the previous best method by an absolute margin of 63%. In extensive real-world evaluations, SPF outperforms strong baselines by a large margin. We also conduct comprehensive ablation studies to highlight the effectiveness of our design choice. Lastly, SPF shows remarkable generalization to different VLMs. Project page: this https URL
zh
[NLP-1] VoiceAssistant-Eval: Benchmarking AI Assistants across Listening Speaking and Viewing
【速读】: 该论文旨在解决当前语音优先型人工智能助手(voice-first AI assistants)缺乏全面评估基准的问题,现有基准无法充分衡量多模态系统在听觉理解、语音生成和视觉感知等方面的综合能力。其解决方案的关键在于提出VoiceAssistant-Eval——一个涵盖10,497个精心设计样本、覆盖13类任务的综合性评测基准,覆盖听觉(自然声音、音乐、对话)、语音(多轮对话、角色扮演模仿)和视觉(高度异构图像)三大维度。通过评估21个开源模型及GPT-4o-Audio,该基准揭示了当前模型在音频理解上的短板、中小规模模型的潜力以及多模态输入与角色扮演语音模仿等任务的挑战,从而为下一代AI助手的研发提供严谨的评估框架与方向指引。
链接: https://arxiv.org/abs/2509.22651
作者: Ke Wang,Houxing Ren,Zimu Lu,Mingjie Zhan,Hongsheng Li
机构: CUHK MMLab (香港中文大学多媒体实验室); SenseTime Research (商汤科技研究院); CPII under InnoHK (创新香港研发平台计算与人工智能研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Sound (cs.SD)
备注:
点击查看摘要
Abstract:The growing capabilities of large language models and multimodal systems have spurred interest in voice-first AI assistants, yet existing benchmarks are inadequate for evaluating the full range of these systems’ capabilities. We introduce VoiceAssistant-Eval, a comprehensive benchmark designed to assess AI assistants across listening, speaking, and viewing. VoiceAssistant-Eval comprises 10,497 curated examples spanning 13 task categories. These tasks include natural sounds, music, and spoken dialogue for listening; multi-turn dialogue, role-play imitation, and various scenarios for speaking; and highly heterogeneous images for viewing. To demonstrate its utility, we evaluate 21 open-source models and GPT-4o-Audio, measuring the quality of the response content and speech, as well as their consistency. The results reveal three key findings: (1) proprietary models do not universally outperform open-source models; (2) most models excel at speaking tasks but lag in audio understanding; and (3) well-designed smaller models can rival much larger ones. Notably, the mid-sized Step-Audio-2-mini (7B) achieves more than double the listening accuracy of LLaMA-Omni2-32B-Bilingual. However, challenges remain: multimodal (audio plus visual) input and role-play voice imitation tasks are difficult for current models, and significant gaps persist in robustness and safety alignment. VoiceAssistant-Eval identifies these gaps and establishes a rigorous framework for evaluating and guiding the development of next-generation AI assistants. Code and data will be released at this https URL .
zh
[NLP-2] CapRL: Stimulating Dense Image Caption Capabilities via Reinforcement Learning
【速读】: 该论文旨在解决当前图像描述(image captioning)任务中基于监督微调(Supervised Fine-Tuning, SFT)方法所面临的局限性,即依赖昂贵且非可扩展的人工标注数据,导致模型易记忆特定真值答案,从而限制其生成多样性与创造性描述的能力。解决方案的关键在于提出一种基于可验证奖励的强化学习框架(Reinforcement Learning with Verifiable Rewards, RLVR),称为CapRL,其核心创新是将“高质量描述”的定义从主观感知转化为客观效用:一个优质描述应能使不依赖视觉信息的语言模型(LLM)准确回答关于图像的多项选择题。CapRL采用解耦的两阶段流程,首先由大型视觉语言模型(LVLM)生成描述,随后利用独立的纯文本语言模型根据该描述作答,以答题准确率作为客观奖励信号,从而引导模型生成更具信息量和泛化能力的描述。
链接: https://arxiv.org/abs/2509.22647
作者: Long Xing,Xiaoyi Dong,Yuhang Zang,Yuhang Cao,Jianze Liang,Qidong Huang,Jiaqi Wang,Feng Wu,Dahua Lin
机构: University of Science and Technology of China (中国科学技术大学); Shanghai AI Laboratory (上海人工智能实验室); The Chinese University of Hong Kong (香港中文大学); Shanghai Innovation Institute (上海创新研究院); Alibaba Cloud (阿里云)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Code is available at this https URL
点击查看摘要
Abstract:Image captioning is a fundamental task that bridges the visual and linguistic domains, playing a critical role in pre-training Large Vision-Language Models (LVLMs). Current state-of-the-art captioning models are typically trained with Supervised Fine-Tuning (SFT), a paradigm that relies on expensive, non-scalable data annotated by humans or proprietary models. This approach often leads to models that memorize specific ground-truth answers, limiting their generality and ability to generate diverse, creative descriptions. To overcome the limitation of SFT, we propose applying the Reinforcement Learning with Verifiable Rewards (RLVR) paradigm to the open-ended task of image captioning. A primary challenge, however, is designing an objective reward function for the inherently subjective nature of what constitutes a “good” caption. We introduce Captioning Reinforcement Learning (CapRL), a novel training framework that redefines caption quality through its utility: a high-quality caption should enable a non-visual language model to accurately answer questions about the corresponding image. CapRL employs a decoupled two-stage pipeline where an LVLM generates a caption, and the objective reward is derived from the accuracy of a separate, vision-free LLM answering Multiple-Choice Questions based solely on that caption. As the first study to apply RLVR to the subjective image captioning task, we demonstrate that CapRL significantly enhances multiple settings. Pretraining on the CapRL-5M caption dataset annotated by CapRL-3B results in substantial gains across 12 benchmarks. Moreover, within the Prism Framework for caption quality evaluation, CapRL achieves performance comparable to Qwen2.5-VL-72B, while exceeding the baseline by an average margin of 8.4%. Code is available here: this https URL.
zh
[NLP-3] Learning Human-Perceived Fakeness in AI-Generated Videos via Multimodal LLM s
【速读】: 该论文旨在解决人类是否能够识别生成式AI(Generative AI)视频中的伪造痕迹(即时空上可定位的视觉伪影),并提供有依据的解释这一问题。当前视频生成模型快速发展,但其生成内容中隐含的深度伪造(deepfake)痕迹是否能被人类感知和准确标注仍缺乏系统研究。解决方案的关键在于构建首个细粒度、时空感知的基准数据集DeeptraceReward,该数据集包含4.3K条人工标注,涵盖3.3K个高质量生成视频,每条标注均包含自然语言解释、边界框区域定位及精确的时间起止点标记,并归纳为9类主要伪造线索。基于此数据集,作者训练了多模态语言模型作为奖励模型(reward model),以模拟人类对伪造线索的识别与定位能力,实验表明其7B参数规模的模型在虚假线索识别、定位和解释任务上相较GPT-5平均提升34.7%,验证了该方案的有效性。
链接: https://arxiv.org/abs/2509.22646
作者: Xingyu Fu,Siyi Liu,Yinuo Xu,Pan Lu,Guangqiuse Hu,Tianbo Yang,Taran Anantasagar,Christopher Shen,Yikai Mao,Yuanzhe Liu,Keyush Shah,Chung Un Lee,Yejin Choi,James Zou,Dan Roth,Chris Callison-Burch
机构: Princeton University (普林斯顿大学); University of Pennsylvania (宾夕法尼亚大学); Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Project Page: this https URL
点击查看摘要
Abstract:Can humans identify AI-generated (fake) videos and provide grounded reasons? While video generation models have advanced rapidly, a critical dimension – whether humans can detect deepfake traces within a generated video, i.e., spatiotemporal grounded visual artifacts that reveal a video as machine generated – has been largely overlooked. We introduce DeeptraceReward, the first fine-grained, spatially- and temporally- aware benchmark that annotates human-perceived fake traces for video generation reward. The dataset comprises 4.3K detailed annotations across 3.3K high-quality generated videos. Each annotation provides a natural-language explanation, pinpoints a bounding-box region containing the perceived trace, and marks precise onset and offset timestamps. We consolidate these annotations into 9 major categories of deepfake traces that lead humans to identify a video as AI-generated, and train multimodal language models (LMs) as reward models to mimic human judgments and localizations. On DeeptraceReward, our 7B reward model outperforms GPT-5 by 34.7% on average across fake clue identification, grounding, and explanation. Interestingly, we observe a consistent difficulty gradient: binary fake v.s. real classification is substantially easier than fine-grained deepfake trace detection; within the latter, performance degrades from natural language explanations (easiest), to spatial grounding, to temporal labeling (hardest). By foregrounding human-perceived deepfake traces, DeeptraceReward provides a rigorous testbed and training signal for socially aware and trustworthy video generation.
zh
[NLP-4] WebGen-Agent : Enhancing Interactive Website Generation with Multi-Level Feedback and Step-Level Reinforcement Learning
【速读】: 该论文旨在解决当前代码代理(Code Agent)在网站代码生成任务中因依赖简单代码执行反馈而无法准确评估生成代码视觉质量的问题。现有方法难以捕捉实际用户交互与界面呈现效果,导致生成结果与预期存在偏差。解决方案的关键在于提出WebGen-Agent,一个利用多层级视觉反馈(包括截图和GUI-agent测试)进行迭代优化的网站生成代理系统。该系统通过视觉语言模型(VLM)生成详细文本描述及量化评分,并结合回溯选择最优路径机制提升性能;进一步引入Step-GRPO训练策略,将每一步的截图与GUI-agent评分作为奖励信号,提供密集且可靠的流程监督,显著增强大语言模型(LLM)作为推理引擎的网站生成能力。
链接: https://arxiv.org/abs/2509.22644
作者: Zimu Lu,Houxing Ren,Yunqiao Yang,Ke Wang,Zhuofan Zong,Junting Pan,Mingjie Zhan,Hongsheng Li
机构: Multimedia Laboratory (MMLab), The Chinese University of Hong Kong (香港中文大学); Ace Robotics; SenseTime (商汤科技)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Agent systems powered by large language models (LLMs) have demonstrated impressive performance on repository-level code-generation tasks. However, for tasks such as website codebase generation, which depend heavily on visual effects and user-interaction feedback, current code agents rely only on simple code execution for feedback and verification. This approach fails to capture the actual quality of the generated code. In this paper, we propose WebGen-Agent, a novel website-generation agent that leverages comprehensive and multi-level visual feedback to iteratively generate and refine the website codebase. Detailed and expressive text descriptions and suggestions regarding the screenshots and GUI-agent testing of the websites are generated by a visual language model (VLM), together with scores that quantify their quality. The screenshot and GUI-agent scores are further integrated with a backtracking and select-best mechanism, enhancing the performance of the agent. Utilizing the accurate visual scores inherent in the WebGen-Agent workflow, we further introduce \textitStep-GRPO with Screenshot and GUI-agent Feedback to improve the ability of LLMs to act as the reasoning engine of WebGen-Agent. By using the screenshot and GUI-agent scores at each step as the reward in Step-GRPO, we provide a dense and reliable process supervision signal, which effectively improves the model’s website-generation ability. On the WebGen-Bench dataset, WebGen-Agent increases the accuracy of Claude-3.5-Sonnet from 26.4% to 51.9% and its appearance score from 3.0 to 3.9, outperforming the previous state-of-the-art agent system. Additionally, our Step-GRPO training approach increases the accuracy of Qwen2.5-Coder-7B-Instruct from 38.9% to 45.4% and raises the appearance score from 3.4 to 3.7.
zh
[NLP-5] Death of the Novel(ty): Beyond n-Gram Novelty as a Metric for Textual Creativity
【速读】: 该论文旨在解决当前语言模型评估中仅依赖N-gram novelty作为创造力指标的局限性问题,因为这种指标未能充分考虑创造力的双重维度——新颖性(novelty)与适当性(appropriateness),即文本的原创性与语用合理性。其关键解决方案是通过大规模专家标注数据(共7542条人类与AI生成文本的注释),系统性地量化新颖性、语用合理性和逻辑通顺性之间的关系,并验证不同训练范式(零样本、少样本和微调)下大语言模型(LLM)在识别创造性表达和非语用表达上的能力。研究发现,尽管N-gram novelty与专家判断的创造力呈正相关,但高新颖性表达中约91%未被认定为创造性的,且开源模型的高新颖性反而伴随语用合理性下降;同时,前沿闭源模型也较人类更难生成创造性内容。此外,最佳模型的LLM-as-a-Judge新颖性评分能有效预测专家偏好,表明基于人工反馈的自动评估机制具有潜力。
链接: https://arxiv.org/abs/2509.22641
作者: Arkadiy Saakyan,Najoung Kim,Smaranda Muresan,Tuhin Chakrabarty
机构: Columbia University (哥伦比亚大学); Boston University (波士顿大学); Stony Brook University (石溪大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 26 pages, 10 figures, under review
点击查看摘要
Abstract:N-gram novelty is widely used to evaluate language models’ ability to generate text outside of their training data. More recently, it has also been adopted as a metric for measuring textual creativity. However, theoretical work on creativity suggests that this approach may be inadequate, as it does not account for creativity’s dual nature: novelty (how original the text is) and appropriateness (how sensical and pragmatic it is). We investigate the relationship between this notion of creativity and n-gram novelty through 7542 expert writer annotations (n=26) of novelty, pragmaticality, and sensicality via close reading of human and AI-generated text. We find that while n-gram novelty is positively associated with expert writer-judged creativity, ~91% of top-quartile expressions by n-gram novelty are not judged as creative, cautioning against relying on n-gram novelty alone. Furthermore, unlike human-written text, higher n-gram novelty in open-source LLMs correlates with lower pragmaticality. In an exploratory study with frontier close-source models, we additionally confirm that they are less likely to produce creative expressions than humans. Using our dataset, we test whether zero-shot, few-shot, and finetuned models are able to identify creative expressions (a positive aspect of writing) and non-pragmatic ones (a negative aspect). Overall, frontier LLMs exhibit performance much higher than random but leave room for improvement, especially struggling to identify non-pragmatic expressions. We further find that LLM-as-a-Judge novelty scores from the best-performing model were predictive of expert writer preferences.
zh
[NLP-6] Language Models Can Learn from Verbal Feedback Without Scalar Rewards
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在基于人类或AI反馈进行强化学习(Reinforcement Learning, RL)训练时,因将复杂且多维的言语反馈压缩为标量奖励信号而导致的信息损失与规模失衡问题。其解决方案的关键在于将言语反馈视为条件信号,提出反馈条件策略(Feedback-Conditional Policy, FCP),通过最大似然训练直接从响应-反馈配对数据中学习反馈条件后验分布,并引入在线自举阶段使策略在正向条件下生成文本并接收新反馈以持续优化。这一方法将反馈驱动的学习重构为条件生成任务,而非传统的奖励优化,从而更充分地利用言语反馈的表达能力,提升模型对复杂反馈的理解与适应性。
链接: https://arxiv.org/abs/2509.22638
作者: Renjie Luo,Zichen Liu,Xiangyan Liu,Chao Du,Min Lin,Wenhu Chen,Wei Lu,Tianyu Pang
机构: Sea AI Lab (海AI实验室); SUTD (新加坡科技设计大学); NUS (新加坡国立大学); NTU (南洋理工大学); University of Waterloo (滑铁卢大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:LLMs are often trained with RL from human or AI feedback, yet such methods typically compress nuanced feedback into scalar rewards, discarding much of their richness and inducing scale imbalance. We propose treating verbal feedback as a conditioning signal. Inspired by language priors in text-to-image generation, which enable novel outputs from unseen prompts, we introduce the feedback-conditional policy (FCP). FCP learns directly from response-feedback pairs, approximating the feedback-conditional posterior through maximum likelihood training on offline data. We further develop an online bootstrapping stage where the policy generates under positive conditions and receives fresh feedback to refine itself. This reframes feedback-driven learning as conditional generation rather than reward optimization, offering a more expressive way for LLMs to directly learn from verbal feedback. Our code is available at this https URL.
zh
[NLP-7] Variational Reasoning for Language Models
【速读】: 该论文旨在解决语言模型在复杂推理任务中表现不稳定、训练目标不明确的问题,尤其是如何通过更稳定的优化机制提升模型的推理能力。其解决方案的关键在于提出一种变分推理框架(variational reasoning framework),将思维轨迹(thinking traces)视为潜在变量,并基于证据下界(ELBO)扩展出多轨迹目标以获得更紧的边界;同时引入前向KL(forward-KL)形式化方法,稳定变分后验的训练过程。该框架进一步揭示了拒绝采样微调和二值奖励强化学习(如GRPO)本质上是局部前向KL优化,且隐式地对模型准确率进行加权,从而暴露了对简单问题的偏差。实验验证表明,该方法在Qwen 2.5与Qwen 3系列模型上均能有效提升推理性能,提供了一个统一的概率视角,连接变分推断与强化学习类方法。
链接: https://arxiv.org/abs/2509.22637
作者: Xiangxin Zhou,Zichen Liu,Haonan Wang,Chao Du,Min Lin,Chongxuan Li,Liang Wang,Tianyu Pang
机构: Sea AI Lab; UCAS (中国科学院大学); CASIA (中国科学院自动化研究所); NUS (新加坡国立大学); RUC (中国人民大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:We introduce a variational reasoning framework for language models that treats thinking traces as latent variables and optimizes them through variational inference. Starting from the evidence lower bound (ELBO), we extend it to a multi-trace objective for tighter bounds and propose a forward-KL formulation that stabilizes the training of the variational posterior. We further show that rejection sampling finetuning and binary-reward RL, including GRPO, can be interpreted as local forward-KL objectives, where an implicit weighting by model accuracy naturally arises from the derivation and reveals a previously unnoticed bias toward easier questions. We empirically validate our method on the Qwen 2.5 and Qwen 3 model families across a wide range of reasoning tasks. Overall, our work provides a principled probabilistic perspective that unifies variational inference with RL-style methods and yields stable objectives for improving the reasoning ability of language models. Our code is available at this https URL.
zh
[NLP-8] LABELING COPILOT: A Deep Research Agent for Automated Data Curation in Computer Vision
【速读】: 该论文旨在解决工业级视觉系统部署中高质量、领域特定数据集构建的瓶颈问题,其核心挑战在于如何在数据质量、多样性与成本之间进行复杂权衡。解决方案的关键在于提出Labeling Copilot——首个面向计算机视觉的数据集构建深度研究代理(deep research agent),通过一个由大模型驱动的中央编排代理,结合多步推理能力调用三个核心模块:(1) 校准发现(Calibrated Discovery)从大规模未标注数据中识别分布内相关样本;(2) 可控合成(Controllable Synthesis)针对罕见场景生成新数据并进行鲁棒过滤;(3) 共识标注(Consensus Annotation)利用新型融合非极大值抑制(non-maximum suppression)与投票机制的共识策略协调多个基础模型实现高精度标注。实验证明,该代理工作流结合优化且可扩展的工具链,为构建工业规模数据集提供了稳健基础。
链接: https://arxiv.org/abs/2509.22631
作者: Debargha Ganguly,Sumit Kumar,Ishwar Balappanawar,Weicong Chen,Shashank Kambhatla,Srinivasan Iyengar,Shivkumar Kalyanaraman,Ponnurangam Kumaraguru,Vipin Chaudhary
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Curating high-quality, domain-specific datasets is a major bottleneck for deploying robust vision systems, requiring complex trade-offs between data quality, diversity, and cost when researching vast, unlabeled data lakes. We introduce Labeling Copilot, the first data curation deep research agent for computer vision. A central orchestrator agent, powered by a large multimodal language model, uses multi-step reasoning to execute specialized tools across three core capabilities: (1) Calibrated Discovery sources relevant, in-distribution data from large repositories; (2) Controllable Synthesis generates novel data for rare scenarios with robust filtering; and (3) Consensus Annotation produces accurate labels by orchestrating multiple foundation models via a novel consensus mechanism incorporating non-maximum suppression and voting. Our large-scale validation proves the effectiveness of Labeling Copilot’s components. The Consensus Annotation module excels at object discovery: on the dense COCO dataset, it averages 14.2 candidate proposals per image-nearly double the 7.4 ground-truth objects-achieving a final annotation mAP of 37.1%. On the web-scale Open Images dataset, it navigated extreme class imbalance to discover 903 new bounding box categories, expanding its capability to over 1500 total. Concurrently, our Calibrated Discovery tool, tested at a 10-million sample scale, features an active learning strategy that is up to 40x more computationally efficient than alternatives with equivalent sample efficiency. These experiments validate that an agentic workflow with optimized, scalable tools provides a robust foundation for curating industrial-scale datasets.
zh
[NLP-9] StateX: Enhancing RNN Recall via Post-training State Expansion
【速读】: 该论文旨在解决基于循环神经网络(Recurrent Neural Networks, RNNs)的模型在处理长上下文时因状态空间大小受限而导致的上下文信息召回能力不足的问题。尽管RNN类模型如线性注意力(Linear Attention)和状态空间模型(State Space Models)具有恒定的每标记复杂度优势,但其有限的递归状态容量限制了对长距离依赖关系的记忆能力。解决方案的关键在于提出StateX训练流程,通过后训练阶段对预训练RNN模型进行架构修改,在不显著增加参数量的前提下高效扩展其递归状态大小,从而大幅提升模型的上下文记忆能力和上下文学习性能。
链接: https://arxiv.org/abs/2509.22630
作者: Xingyu Shen,Yingfa Chen,Zhen Leng Thai,Xu Han,Zhiyuan Liu,Maosong Sun
机构: Tsinghua University (清华大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:While Transformer-based models have demonstrated remarkable language modeling performance, their high complexities result in high costs when processing long contexts. In contrast, recurrent neural networks (RNNs) such as linear attention and state space models have gained popularity due to their constant per-token complexities. However, these recurrent models struggle with tasks that require accurate recall of contextual information from long contexts, because all contextual information is compressed into a constant-size recurrent state. Previous works have shown that recall ability is positively correlated with the recurrent state size, yet directly training RNNs with larger recurrent states results in high training costs. In this paper, we introduce StateX, a training pipeline for efficiently expanding the states of pre-trained RNNs through post-training. For two popular classes of RNNs, linear attention and state space models, we design post-training architectural modifications to scale up the state size with no or negligible increase in model parameters. Experiments on models up to 1.3B parameters demonstrate that StateX efficiently enhances the recall and in-context learning ability of RNNs without incurring high post-training costs or compromising other capabilities.
zh
[NLP-10] IA2: Alignment with ICL Activations Improves Supervised Fine-Tuning
【速读】: 该论文旨在解决如何利用上下文学习(In-Context Learning, ICL)的内部计算机制来提升监督微调(Supervised Fine-Tuning, SFT)模型在准确性与校准性方面的表现。其核心解决方案是提出一种自蒸馏方法——ICL激活对齐(ICL Activation Alignment, IA2),通过在SFT之前引入一个预训练步骤,使SFT模型的激活模式尽可能逼近ICL的激活模式,从而激励模型产生类似ICL的内部推理过程。实验证明,IA2作为预训练步骤可显著提升多个基准测试上的性能,揭示了ICL与SFT在功能机制上的差异,并为改进SFT提供了新的思路。
链接: https://arxiv.org/abs/2509.22621
作者: Aayush Mishra,Daniel Khashabi,Anqi Liu
机构: Johns Hopkins University (约翰霍普金斯大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Supervised Fine-Tuning (SFT) is used to specialize model behavior by training weights to produce intended target responses for queries. In contrast, In-Context Learning (ICL) adapts models during inference with instructions or demonstrations in the prompt. ICL can offer better generalizability and more calibrated responses compared to SFT in data scarce settings, at the cost of more inference compute. In this work, we ask the question: Can ICL’s internal computations be used to improve the qualities of SFT? We first show that ICL and SFT produce distinct activation patterns, indicating that the two methods achieve adaptation through different functional mechanisms. Motivated by this observation and to use ICL’s rich functionality, we introduce ICL Activation Alignment (IA2), a self-distillation technique which aims to replicate ICL’s activation patterns in SFT models and incentivizes ICL-like internal reasoning. Performing IA2 as a priming step before SFT significantly improves the accuracy and calibration of model outputs, as shown by our extensive empirical results on 12 popular benchmarks and 2 model families. This finding is not only practically useful, but also offers a conceptual window into the inner mechanics of model adaptation.
zh
[NLP-11] Vision-Language Alignment from Compressed Image Representations using 2D Gaussian Splatting
【速读】: 该论文旨在解决当前视觉-语言模型中基于RGB像素的视觉编码器所存在的两大结构性低效问题:一是边缘设备向云端传输密集RGB图像能耗高、成本大;二是基于补丁(patch)的token化导致序列长度激增,加重注意力机制负担并受限于上下文窗口。其解决方案的关键在于引入二维高斯点绘(2D Gaussian Splatting, 2DGS)作为替代的视觉表征基底,该方法以一组彩色各向异性高斯分布参数化图像,具有空间自适应性和紧凑性。通过结构化初始化、亮度感知剪枝和批处理CUDA内核优化,实现了超过90倍的拟合速度提升与约97%的GPU利用率;同时在对比语言图像预训练(CLIP)框架下,仅微调约7%参数即可实现对2DGS输入的有效对齐,使压缩后的输入在ImageNet-1K上获得有意义的零样本性能,验证了2DGS作为语义强大且传输高效的多模态表示潜力。
链接: https://arxiv.org/abs/2509.22615
作者: Yasmine Omri,Connor Ding,Tsachy Weissman,Thierry Tambe
机构: Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Modern vision language pipelines are driven by RGB vision encoders trained on massive image text corpora. While these pipelines have enabled impressive zero shot capabilities and strong transfer across tasks, they still inherit two structural inefficiencies from the pixel domain: (i) transmitting dense RGB images from edge devices to the cloud is energy intensive and costly, and (ii) patch based tokenization explodes sequence length, stressing attention budgets and context limits. We explore 2D Gaussian Splatting (2DGS) as an alternative visual substrate for alignment: a compact, spatially adaptive representation that parameterizes images by a set of colored anisotropic Gaussians. We develop a scalable 2DGS pipeline with structured initialization, luminance aware pruning, and batched CUDA kernels, achieving over 90x faster fitting and about 97% GPU utilization compared to prior implementations. We further adapt contrastive language image pretraining (CLIP) to 2DGS by reusing a frozen RGB-based transformer backbone with a lightweight splat aware input stem and a perceiver resampler, training only about 7% of the total parameters. On large DataComp subsets, GS encoders yield meaningful zero shot ImageNet-1K performance while compressing inputs 3 to 20x relative to pixels. While accuracy currently trails RGB encoders, our results establish 2DGS as a viable multimodal substrate, pinpoint architectural bottlenecks, and open a path toward representations that are both semantically powerful and transmission efficient for edge cloud learning.
zh
[NLP-12] From tests to effect sizes: Quantifying uncertainty and statistical variability in multilingual and multitask NLP evaluation benchmarks ACL
【速读】: 该论文旨在解决多语言和多任务自然语言处理(NLP)基准测试中评估指标的不确定性与统计精度量化问题。现有方法往往忽略实验变异来源的多样性,导致对假设重复实验中的总体变异性显著低估。解决方案的关键在于引入基于重采样(resampling-based)的方法,能够同时考虑模型相关和数据相关的变异源,从而更准确地估计性能分数的抽样分布,并有效支持排行榜中常用统计量(如平均值、中位数、模型间差异及排名)的不确定性分析。
链接: https://arxiv.org/abs/2509.22612
作者: Jonne Sälevä,Duygu Ataman,Constantine Lignos
机构: Brandeis University (布兰迪斯大学); Middle East Technical University (中东技术大学)
类目: Computation and Language (cs.CL)
备注: Paper currently under review at ACL Rolling Review
点击查看摘要
Abstract:In this paper, we introduce a set of resampling-based methods for quantifying uncertainty and statistical precision of evaluation metrics in multilingual and/or multitask NLP benchmarks. We show how experimental variation in performance scores arises from both model- and data-related sources, and that accounting for both of them is necessary to avoid substantially underestimating the overall variability over hypothetical replications. Using multilingual question answering, machine translation, and named entity recognition as example tasks, we also demonstrate how resampling methods are useful for computing sampling distributions for various quantities used in leaderboards such as the average/median, pairwise differences between models, and rankings.
zh
[NLP-13] Capturing Opinion Shifts in Deliberative Discourse through Frequency-based Quantum deep learning methods
【速读】: 该论文旨在解决如何通过自然语言处理(Natural Language Processing, NLP)技术有效建模和分析 deliberation(审议)过程的问题,特别是如何从多元观点的动态变化中提取有意义的见解。其解决方案的关键在于构建一个自收集的数据集以反映多样化观点,并采用两种创新模型——基于频率的论述调制(Frequency-Based Discourse Modulation)与量子审议框架(Quantum-Deliberation Framework)——进行对比分析,二者在捕捉意见转变和预测不同情境下潜在结果方面显著优于现有最先进模型,从而为公共政策制定、辩论评估、决策支持系统及大规模社交媒体意见挖掘提供可计算的实践路径。
链接: https://arxiv.org/abs/2509.22603
作者: Rakesh Thakur,Harsh Chaturvedi,Ruqayya Shah,Janvi Chauhan,Ayush Sharma
机构: Amity Center for Artificial Intelligence, Amity University (阿米蒂大学人工智能中心,阿米蒂大学); Amity School of Engineering and Technology, Amity University (阿米蒂工程与技术学院,阿米蒂大学)
类目: Computation and Language (cs.CL)
备注: 9 pages, 2 figures, 1 table
点击查看摘要
Abstract:Deliberation plays a crucial role in shaping outcomes by weighing diverse perspectives before reaching decisions. With recent advancements in Natural Language Processing, it has become possible to computationally model deliberation by analyzing opinion shifts and predicting potential outcomes under varying scenarios. In this study, we present a comparative analysis of multiple NLP techniques to evaluate how effectively models interpret deliberative discourse and produce meaningful insights. Opinions from individuals of varied backgrounds were collected to construct a self-sourced dataset that reflects diverse viewpoints. Deliberation was simulated using product presentations enriched with striking facts, which often prompted measurable shifts in audience opinions. We have given comparative analysis between two models namely Frequency-Based Discourse Modulation and Quantum-Deliberation Framework which outperform the existing state of art models. The findings highlight practical applications in public policy-making, debate evaluation, decision-support frameworks, and large-scale social media opinion mining.
zh
[NLP-14] Learn the Ropes Then Trust the Wins: Self-imitation with Progressive Exploration for Agent ic Reinforcement Learning
【速读】: 该论文旨在解决强化学习(Reinforcement Learning, RL)在训练大语言模型(Large Language Models, LLMs)执行长时程、稀疏奖励任务时所面临的探索-利用(exploration-exploitation)平衡难题,尤其是由多轮分布偏移引发的训练不稳定性问题。解决方案的关键在于提出一种基于课程学习(curriculum-based)的自模仿学习(Self-Imitation Learning, SIL)方法——SPEAR。其核心机制是在训练过程中通过课程策略逐步引导策略演化于熵值可控的范围内:初期利用辅助工具调用奖励促进技能层面探索并维持熵上升趋势;后期增强自模仿以利用回放缓冲区中成功轨迹进行动作级比较探索,从而加速解题迭代且避免熵无界增长;同时引入优势重校准和轨迹级熵控制正则化(如裁剪高协方差token),有效缓解策略漂移与过自信问题,实现稳定而渐进的探索-利用平衡。
链接: https://arxiv.org/abs/2509.22601
作者: Yulei Qin,Xiaoyu Tan,Zhengbao He,Gang Li,Haojia Lin,Zongyi Li,Zihan Xu,Yuchen Shi,Siqi Cai,Renting Rui,Shaofei Cai,Yuzheng Cai,Xuan Zhang,Sheng Ye,Ke Li,Xing Sun
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA)
备注: 26 pages, 11 figures
点击查看摘要
Abstract:Reinforcement learning (RL) is the dominant paradigm for sharpening strategic tool use capabilities of LLMs on long-horizon, sparsely-rewarded agent tasks, yet it faces a fundamental challenge of exploration-exploitation trade-off. Existing studies stimulate exploration through the lens of policy entropy, but such mechanical entropy maximization is prone to RL training instability due to the multi-turn distribution shifting. In this paper, we target the progressive exploration-exploitation balance under the guidance of the agent own experiences without succumbing to either entropy collapsing or runaway divergence. We propose SPEAR, a curriculum-based self-imitation learning (SIL) recipe for training agentic LLMs. It extends the vanilla SIL framework, where a replay buffer stores self-generated promising trajectories for off-policy update, by gradually steering the policy evolution within a well-balanced range of entropy across stages. Specifically, our approach incorporates a curriculum to manage the exploration process, utilizing intrinsic rewards to foster skill-level exploration and facilitating action-level exploration through SIL. At first, the auxiliary tool call reward plays a critical role in the accumulation of tool-use skills, enabling broad exposure to the unfamiliar distributions of the environment feedback with an upward entropy trend. As training progresses, self-imitation gets strengthened to exploit existing successful patterns from replayed experiences for comparative action-level exploration, accelerating solution iteration without unbounded entropy growth. To further stabilize training, we recalibrate the advantages of experiences in the replay buffer to address the potential policy drift. Reugularizations such as the clipping of tokens with high covariance between probability and advantage are introduced to the trajectory-level entropy control to curb over-confidence.
zh
[NLP-15] From Formal Language Theory to Statistical Learning: Finite Observability of Subregular Languages
【速读】: 该论文旨在解决自然语言结构建模中可学习性与可解释性的难题,特别是针对子正则层次(subregular hierarchy)语言类的表示是否具备有限可观测性(finite observability)以及能否通过简单线性模型实现有效学习的问题。其解决方案的关键在于证明所有标准子正则语言类在其决策谓词(deciding predicates)表示下均具有线性可分性(linear separability),从而确保了这些语言类在理论上是可学习的,并且所学特征能够与已知的语言约束对齐,如英文形态学实验所示。这一发现为自然语言结构提供了严谨且可解释的建模基础。
链接: https://arxiv.org/abs/2509.22598
作者: Katsuhiko Hayashi,Hidetaka Kamigaito
机构: The University of Tokyo (东京大学); Nara Institute of Science and Technology (奈良先端科学技术大学院大学)
类目: Computation and Language (cs.CL); Formal Languages and Automata Theory (cs.FL); Machine Learning (cs.LG)
备注: 12 pages, 5 figures
点击查看摘要
Abstract:We prove that all standard subregular language classes are linearly separable when represented by their deciding predicates. This establishes finite observability and guarantees learnability with simple linear models. Synthetic experiments confirm perfect separability under noise-free conditions, while real-data experiments on English morphology show that learned features align with well-known linguistic constraints. These results demonstrate that the subregular hierarchy provides a rigorous and interpretable foundation for modeling natural language structure. Our code used in real-data experiments is available at this https URL.
zh
[NLP-16] ArabJobs: A Multinational Corpus of Arabic Job Ads
【速读】: 该论文旨在解决阿拉伯语劳动力市场文本数据稀缺且缺乏多语言、多地区代表性的问题,以支持公平性感知的阿拉伯语自然语言处理(Natural Language Processing, NLP)和劳动市场研究。其解决方案的关键在于构建并公开发布一个包含来自埃及、约旦、沙特阿拉伯和阿联酋的8,500余条职位广告的大型阿拉伯语语料库(ArabJobs),该语料库覆盖了语言变体、性别表征和职业结构的多样性,并通过大语言模型(Large Language Models, LLMs)展示了其在薪资估算、职业类别标准化、性别偏见检测与职业分类等任务中的应用潜力,从而为后续研究提供基准工具和实证基础。
链接: https://arxiv.org/abs/2509.22589
作者: Mo El-Haj
机构: 未知
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:ArabJobs is a publicly available corpus of Arabic job advertisements collected from Egypt, Jordan, Saudi Arabia, and the United Arab Emirates. Comprising over 8,500 postings and more than 550,000 words, the dataset captures linguistic, regional, and socio-economic variation in the Arab labour market. We present analyses of gender representation and occupational structure, and highlight dialectal variation across ads, which offers opportunities for future research. We also demonstrate applications such as salary estimation and job category normalisation using large language models, alongside benchmark tasks for gender bias detection and profession classification. The findings show the utility of ArabJobs for fairness-aware Arabic NLP and labour market research. The dataset is publicly available on GitHub: this https URL.
zh
[NLP-17] Fine-Grained Detection of Context-Grounded Hallucinations Using LLM s
【速读】: 该论文旨在解决生成式 AI(Generative AI)模型在文本生成过程中产生的“上下文相关幻觉”(context-grounded hallucinations)的定位问题,即识别模型输出中无法从源文本验证的信息。传统评估方法复杂且难以部署,因此作者提出一种更实用的本地化方案。其关键解决方案包括:构建一个面向大语言模型(LLM)的新型基准测试集,涵盖超过1000个由人工标注的挑战性样本,并设计基于自由文本描述的新错误表示方式,以捕捉所有可能类型的幻觉;同时引入基于LLM的评估协议来验证基准质量,并通过系统性实验评估四个主流LLM,揭示当前模型在处理幻觉定位任务时的主要瓶颈:一是容易将缺失信息误判为不一致(即使指令明确仅检查事实一致性),二是难以区分源自模型参数知识但未在源文本中出现的事实正确内容与真正幻觉。
链接: https://arxiv.org/abs/2509.22582
作者: Yehonatan Pesiakhovsky,Zorik Gekhman,Yosi Mass,Liat Ein-Dor,Roi Reichart
机构: 未知
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Context-grounded hallucinations are cases where model outputs contain information not verifiable against the source text. We study the applicability of LLMs for localizing such hallucinations, as a more practical alternative to existing complex evaluation pipelines. In the absence of established benchmarks for meta-evaluation of hallucinations localization, we construct one tailored to LLMs, involving a challenging human annotation of over 1,000 examples. We complement the benchmark with an LLM-based evaluation protocol, verifying its quality in a human evaluation. Since existing representations of hallucinations limit the types of errors that can be expressed, we propose a new representation based on free-form textual descriptions, capturing the full range of possible errors. We conduct a comprehensive study, evaluating four large-scale LLMs, which highlights the benchmark’s difficulty, as the best model achieves an F1 score of only 0.67. Through careful analysis, we offer insights into optimal prompting strategies for the task and identify the main factors that make it challenging for LLMs: (1) a tendency to incorrectly flag missing details as inconsistent, despite being instructed to check only facts in the output; and (2) difficulty with outputs containing factually correct information absent from the source - and thus not verifiable - due to alignment with the model’s parametric knowledge.
zh
[NLP-18] EPO: Entropy-regularized Policy Optimization for LLM Agents Reinforcement Learning
【速读】: 该论文旨在解决在多轮稀疏奖励环境中训练大语言模型(LLM)代理时面临的根本性挑战,即由于单个任务需要30余轮交互才能完成,导致强化学习难以有效探索和优化策略。研究识别出一种独特的失败模式——探索-利用级联失效(exploration-exploitation cascade failure),其表现为早期策略过早收敛至低熵的劣质策略,随后在训练后期因传统熵正则化反而引发策略崩溃,导致训练不稳定。解决方案的关键在于提出熵正则化策略优化(Entropy-regularized Policy Optimization, EPO)框架,通过三个协同机制实现:(1) 在多轮场景中引入熵正则化以增强探索能力;(2) 设计熵平滑正则项限制策略熵波动,避免剧烈变化;(3) 采用自适应分阶段权重平衡探索与利用。EPO保证熵方差单调递减并维持收敛性,显著提升了ScienceWorld和ALFWorld上的性能表现。
链接: https://arxiv.org/abs/2509.22576
作者: Xu Wujiang,Wentian Zhao,Zhenting Wang,Li Yu-Jhe,Jin Can,Jin Mingyu,Mei Kai,Wan Kun,Metaxas Dimitris
机构: Rutgers University (罗格斯大学); Adobe Inc. (Adobe 公司)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Training LLM agents in multi-turn environments with sparse rewards, where completing a single task requires 30+ turns of interaction within an episode, presents a fundamental challenge for reinforcement learning. We identify a critical failure mode unique to this setting: the exploration-exploitation cascade failure. This cascade begins with early-stage policy premature convergence, where sparse feedback causes agents to commit to flawed, low-entropy strategies. Subsequently, agents enter late-stage policy collapse, where conventional entropy regularization becomes counterproductive, promoting chaotic exploration that destabilizes training. We propose Entropy-regularized Policy Optimization (EPO), a general framework that breaks this failure cycle through three synergistic mechanisms: (1) adopting entropy regularization in multi-turn settings to enhance exploration, (2) an entropy smoothing regularizer that bounds policy entropy within historical averages to prevent abrupt fluctuations, and (3) adaptive phase-based weighting that balances exploration and exploitation across training. Our analysis justifies that EPO guarantees monotonically decreasing entropy variance while maintaining convergence. EPO achieves up to 152% performance improvement on ScienceWorld and up to 19.8% on ALFWorld. Our work demonstrates that multi-turn sparse-reward settings require fundamentally different entropy control than traditional RL, with broad implications for LLM agent training.
zh
[NLP-19] Dynamic Experts Search: Enhancing Reasoning in Mixture-of-Experts LLM s at Test Time
【速读】: 该论文旨在解决现有测试时扩展(Test-Time Scaling, TTS)方法在提升大语言模型(Large Language Models, LLMs)推理能力时,主要依赖输出层面采样而忽视模型架构潜力的问题。针对主流的专家混合(Mixture-of-Experts, MoE)结构,作者发现调整激活专家数量可生成互补解集且保持稳定准确率,揭示了一种未被充分探索的多样性来源。解决方案的关键在于提出动态专家搜索(Dynamic Experts Search, DES),其核心创新包括:(1) 动态MoE机制,在推理阶段直接控制专家数量以生成多样化推理路径而无需额外计算开销;(2) 专家配置继承策略,在单次推理路径内维持专家数量一致性,跨多次运行间改变专家配置,从而在搜索过程中平衡稳定性与多样性。实验证明DES在多种MoE架构、验证器和推理基准(数学、代码、知识)上均显著优于基线TTS方法,且不增加额外计算成本,展现出架构感知型TTS的实用性与可扩展性。
链接: https://arxiv.org/abs/2509.22572
作者: Yixuan Han,Fan Ma,Ruijie Quan,Yi Yang
机构: Zhejiang University (浙江大学); Nanyang Technological University (南洋理工大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Test-Time Scaling (TTS) enhances the reasoning ability of large language models (LLMs) by allocating additional computation during inference. However, existing approaches primarily rely on output-level sampling while overlooking the role of model architecture. In mainstream Mixture-of-Experts (MoE) LLMs, we observe that varying the number of activated experts yields complementary solution sets with stable accuracy, revealing a new and underexplored source of diversity. Motivated by this observation, we propose Dynamic Experts Search (DES), a TTS strategy that elevates expert activation into a controllable dimension of the search space. DES integrates two key components: (1) Dynamic MoE, which enables direct control of expert counts during inference to generate diverse reasoning trajectories without additional cost; and (2) Expert Configuration Inheritance, which preserves consistent expert counts within a reasoning path while varying them across runs, thereby balancing stability and diversity throughout the search. Extensive experiments across MoE architectures, verifiers and reasoning benchmarks (i.e., math, code and knowledge) demonstrate that DES reliably outperforms TTS baselines, enhancing accuracy and stability without additional cost. These results highlight DES as a practical and scalable form of architecture-aware TTS, illustrating how structural flexibility in modern LLMs can advance reasoning.
zh
[NLP-20] Retrieval-Augmented Guardrails for AI-Drafted Patient-Portal Messages: Error Taxonomy Construction and Large-Scale Evaluation
【速读】: 该论文旨在解决电子健康记录(EHR)门户中异步医患消息通信导致的临床工作负荷问题,尤其是大语言模型(LLM)生成回复时可能存在的临床不准确、信息缺失或语气不当等风险。其核心解决方案是提出一种三重贡献的方法:首先构建了一个基于临床实践的错误本体(error ontology),涵盖5个领域和59个细粒度错误代码;其次开发了检索增强评估管道(RAEC),通过引入机构历史消息-回复对的语义相似性来提升评估质量;最后采用两阶段DSPy提示架构实现可扩展、可解释且分层的错误检测。实验表明,引入上下文检索显著提升了错误识别能力,尤其在临床完整性与工作流适配性方面,且人类验证显示增强版标签的一致性和F1分数均优于基线,证明RAEC作为AI守门员用于患者消息处理的有效性。
链接: https://arxiv.org/abs/2509.22565
作者: Wenyuan Chen,Fateme Nateghi Haredasht,Kameron C. Black,Francois Grolleau,Emily Alsentzer,Jonathan H. Chen,Stephen P. Ma
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:
点击查看摘要
Abstract:Asynchronous patient-clinician messaging via EHR portals is a growing source of clinician workload, prompting interest in large language models (LLMs) to assist with draft responses. However, LLM outputs may contain clinical inaccuracies, omissions, or tone mismatches, making robust evaluation essential. Our contributions are threefold: (1) we introduce a clinically grounded error ontology comprising 5 domains and 59 granular error codes, developed through inductive coding and expert adjudication; (2) we develop a retrieval-augmented evaluation pipeline (RAEC) that leverages semantically similar historical message-response pairs to improve judgment quality; and (3) we provide a two-stage prompting architecture using DSPy to enable scalable, interpretable, and hierarchical error detection. Our approach assesses the quality of drafts both in isolation and with reference to similar past message-response pairs retrieved from institutional archives. Using a two-stage DSPy pipeline, we compared baseline and reference-enhanced evaluations on over 1,500 patient messages. Retrieval context improved error identification in domains such as clinical completeness and workflow appropriateness. Human validation on 100 messages demonstrated superior agreement (concordance = 50% vs. 33%) and performance (F1 = 0.500 vs. 0.256) of context-enhanced labels vs. baseline, supporting the use of our RAEC pipeline as AI guardrails for patient messaging.
zh
[NLP-21] hink Socially via Cognitive Reasoning
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在社会情境中推理能力不足的问题,即当前基于逻辑推理的训练范式难以应对社交场景中模糊线索的解释性处理需求。解决方案的关键在于提出一种受人类社会认知启发的“认知推理”(Cognitive Reasoning)范式,其核心是将社交理解过程形式化为由多个相互关联的认知单元(如观察或归因)构成的结构化认知流(Cognitive Flow),并通过Cognitive Flow(CogFlow)框架实现该能力的建模与优化:首先通过树状规划模拟人类思维的联想性和渐进性构建认知流数据集,继而采用监督微调赋予模型基础认知推理能力,并进一步利用多目标强化学习机制引导模型自我改进,从而显著提升LLMs在社会决策中的表现。
链接: https://arxiv.org/abs/2509.22546
作者: Jinfeng Zhou,Zheyu Chen,Shuai Wang,Quanyu Dai,Zhenhua Dong,Hongning Wang,Minlie Huang
机构: The CoAI Group, DCST, Tsinghua University (清华大学); Huawei Noah’ Ark Lab (华为诺亚方舟实验室)
类目: Computation and Language (cs.CL)
备注: Repository: this https URL
点击查看摘要
Abstract:LLMs trained for logical reasoning excel at step-by-step deduction to reach verifiable answers. However, this paradigm is ill-suited for navigating social situations, which induce an interpretive process of analyzing ambiguous cues that rarely yield a definitive outcome. To bridge this gap, we introduce Cognitive Reasoning, a paradigm modeled on human social cognition. It formulates the interpretive process into a structured cognitive flow of interconnected cognitive units (e.g., observation or attribution), which combine adaptively to enable effective social thinking and responses. We then propose CogFlow, a complete framework that instills this capability in LLMs. CogFlow first curates a dataset of cognitive flows by simulating the associative and progressive nature of human thought via tree-structured planning. After instilling the basic cognitive reasoning capability via supervised fine-tuning, CogFlow adopts reinforcement learning to enable the model to improve itself via trial and error, guided by a multi-objective reward that optimizes both cognitive flow and response quality. Extensive experiments show that CogFlow effectively enhances the social cognitive capabilities of LLMs, and even humans, leading to more effective social decision-making.
zh
[NLP-22] Does AI Coaching Prepare us for Workplace Negotiations?
链接: https://arxiv.org/abs/2509.22545
作者: Veda Duddu,Jash Rajesh Parekh,Andy Mao,Hanyi Min,Ziang Xiao,Vedant Das Swain,Koustuv Saha
机构: University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校); Johns Hopkins University(约翰霍普金斯大学); New York University(纽约大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:
[NLP-23] InfiR2: A Comprehensive FP8 Training Recipe for Reasoning -Enhanced Language Models
链接: https://arxiv.org/abs/2509.22536
作者: Wenjun Wang,Shuo Cai,Congkai Xie,Mingfa Feng,Yiming Zhang,Zhen Li,Kejing Yang,Ming Li,Jiannong Cao,Yuan Xie,Hongxia Yang
机构: The Hong Kong Polytechnic University (香港理工大学); InfiX.ai; The Hong Kong University of Science and Technology (香港科技大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
[NLP-24] We Think Therefore We Align LLM s to Helpful Harmless and Honest Before They Go Wrong
链接: https://arxiv.org/abs/2509.22510
作者: Gautam Siddharth Kashyap,Mark Dras,Usman Naseem
机构: Macquarie University (麦考瑞大学)
类目: Computation and Language (cs.CL)
备注:
[NLP-25] Representing LLM s in Prompt Semantic Task Space EMNLP2025
链接: https://arxiv.org/abs/2509.22506
作者: Idan Kashani,Avi Mendelson,Yaniv Nemcovsky
机构: Technion - Israel Institute of Technology (以色列理工学院)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted to Findings of the Association for Computational Linguistics: EMNLP 2025
[NLP-26] Mental Health Impacts of AI Companions: Triangulating Social Media Quasi-Experiments User Perspectives and Relational Theory
【速读】: 该论文旨在解决生成式 AI (Generative AI) 驱动的陪伴型聊天机器人(AICCs)在用户心理社会层面的影响机制不清的问题,特别是其对幸福感和情绪表达的双重效应。解决方案的关键在于采用混合方法:首先通过大规模纵向 Reddit 数据进行准实验研究,运用分层倾向得分匹配与双重差分回归识别 AICC 使用与情绪语言变化之间的因果关系;其次结合 15 份半结构化访谈,基于 Knapp 的关系发展模型进行主题分析,揭示用户从初始接触、情感升级到深度绑定的互动轨迹。两者三角验证表明,AICC 可提供情感支持与社交演练机会,但也存在依赖风险,因此提出设计建议——应引导健康边界、促进有意识参与、支持表达而不制造依赖,并显性呈现关系阶段,从而最大化心理益处并降低潜在危害。
链接: https://arxiv.org/abs/2509.22505
作者: Yunhao Yuan,Jiaxun Zhang,Talayeh Aledavood,Renwen Zhang,Koustuv Saha
机构: Aalto University (阿尔托大学); University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Nanyang Technological University (南洋理工大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Applications (stat.AP)
备注:
点击查看摘要
Abstract:AI-powered companion chatbots (AICCs) such as Replika are increasingly popular, offering empathetic interactions, yet their psychosocial impacts remain unclear. We examined how engaging with AICCs shaped wellbeing and how users perceived these experiences. First, we conducted a large-scale quasi-experimental study of longitudinal Reddit data, applying stratified propensity score matching and Difference-in-Differences regression. Findings revealed mixed effects – greater affective and grief expression, readability, and interpersonal focus, alongside increases in language about loneliness and suicidal ideation. Second, we complemented these results with 15 semi-structured interviews, which we thematically analyzed and contextualized using Knapp’s relationship development model. We identified trajectories of initiation, escalation, and bonding, wherein AICCs provided emotional validation and social rehearsal but also carried risks of over-reliance and withdrawal. Triangulating across methods, we offer design implications for AI companions that scaffold healthy boundaries, support mindful engagement, support disclosure without dependency, and surface relationship stages – maximizing psychosocial benefits while mitigating risks.
zh
[NLP-27] JGU Mainzs Submission to the WMT25 Shared Task on LLM s with Limited Resources for Slavic Languages: MT and QA
【速读】: 该论文旨在解决资源受限的斯拉夫语言(乌克兰语、上索布语和下索布语)在大语言模型(LLM)支持下的机器翻译与问答(Question Answering, QA)任务性能不足的问题。解决方案的关键在于:针对每种语言,联合微调 Qwen2.5-3B-Instruct 模型以同时优化翻译和问答两个任务,并采用参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)策略;此外,通过整合额外的翻译数据和多项选择题问答数据提升模型泛化能力,对乌克兰语 QA 进一步引入检索增强生成(Retrieval-Augmented Generation, RAG),并对上、下索布语的 QA 任务使用集成方法(Ensembling)以提高鲁棒性。实验表明,该方案在两项任务上均优于基线模型。
链接: https://arxiv.org/abs/2509.22490
作者: Hossain Shaikh Saadi,Minh Duc Bui,Mario Sanz-Guerrero,Katharina von der Wense
机构: Johannes Gutenberg University Mainz (美因茨约翰内斯古滕贝格大学); University of Colorado Boulder (科罗拉多大学博尔德分校)
类目: Computation and Language (cs.CL)
备注: WMT 25 Shared Task LLMs with Limited Resources for Slavic Languages: MT and QA
点击查看摘要
Abstract:This paper presents the JGU Mainz submission to the WMT25 Shared Task on LLMs with Limited Resources for Slavic Languages: Machine Translation and Question Answering, focusing on Ukrainian, Upper Sorbian, and Lower Sorbian. For each language, we jointly fine-tune a Qwen2.5-3B-Instruct model for both tasks with parameter-efficient finetuning. Our pipeline integrates additional translation and multiple-choice question answering (QA) data. For Ukrainian QA, we further use retrieval-augmented generation. We also apply ensembling for QA in Upper and Lower Sorbian. Experiments show that our models outperform the baseline on both tasks.
zh
[NLP-28] Exploring Solution Divergence and Its Effect on Large Language Model Problem Solving
链接: https://arxiv.org/abs/2509.22480
作者: Hang Li,Kaiqi Yang,Yucheng Chu,Hui Liu,Jiliang Tang
机构: Michigan State University (密歇根州立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 17 pages, 11 figures
[NLP-29] NeLLCom-Lex: A Neural-agent Framework to Study the Interplay between Lexical Systems and Language Use EMNLP2025
【速读】: 该论文旨在解决词汇语义演变(lexical semantic change)研究中因果机制难以揭示的问题,传统观察方法(如语料库分析、分布语义建模)无法捕捉语义变化的内在驱动因素,而基于人类的实验范式又因时间跨度长难以实施。解决方案的关键在于提出NeLLCom-Lex框架,通过将神经代理(neural agents)嵌入真实语言系统(如英语),并系统性地操纵其交际需求,在单一代内模拟语义演变过程;实验表明,训练代理以“说话”现有语言时,可显著复现人类在颜色命名任务中的行为模式,从而为理解语义演变机制提供了可操控、可量化的新路径。
链接: https://arxiv.org/abs/2509.22479
作者: Yuqing Zhang,Ecesu Ürker,Tessa Verhoef,Gemma Boleda,Arianna Bisazza
机构: Center for Language and Cognition, University of Groningen (格罗宁根大学语言与认知中心); Department of Translation and Language Sciences, Universitat Pompeu Fabra (庞佩乌法布拉大学翻译与语言科学系); Leiden Institute of Advanced Computer Science, Leiden University (莱顿大学高级计算机科学研究所); Catalan Institution for Research and Advanced Studies (ICREA) (加泰罗尼亚研究与高级研究学院)
类目: Computation and Language (cs.CL)
备注: Findings of EMNLP 2025
点击查看摘要
Abstract:Lexical semantic change has primarily been investigated with observational and experimental methods; however, observational methods (corpus analysis, distributional semantic modeling) cannot get at causal mechanisms, and experimental paradigms with humans are hard to apply to semantic change due to the extended diachronic processes involved. This work introduces NeLLCom-Lex, a neural-agent framework designed to simulate semantic change by first grounding agents in a real lexical system (e.g. English) and then systematically manipulating their communicative needs. Using a well-established color naming task, we simulate the evolution of a lexical system within a single generation, and study which factors lead agents to: (i) develop human-like naming behavior and lexicons, and (ii) change their behavior and lexicons according to their communicative needs. Our experiments with different supervised and reinforcement learning pipelines show that neural agents trained to ‘speak’ an existing language can reproduce human-like patterns in color naming to a remarkable extent, supporting the further use of NeLLCom-Lex to elucidate the mechanisms of semantic change.
zh
[NLP-30] Evaluating the Limits of Large Language Models in Multilingual Legal Reasoning
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在多语言、跨司法管辖区及对抗性场景下应用于法律任务时性能不足且缺乏系统评估的问题。其核心挑战在于现有LLMs在法律推理任务中准确率普遍低于50%,远低于其在通用任务(如XNLI)中的表现,且存在对提示敏感性和跨语言泛化能力弱等缺陷。解决方案的关键是构建一个开源、模块化的评估流水线,支持多语言、多样化任务的基准测试,并采用LLM-as-a-Judge方法进行人类对齐的评价,从而全面量化LLMs在法律文本分类、摘要、开放问答和推理等任务中的表现及其对抗鲁棒性。实证结果显示,尽管新一代模型有所提升,但部署于关键法律场景仍面临显著挑战。
链接: https://arxiv.org/abs/2509.22472
作者: Antreas Ioannou,Andreas Shiamishis,Nora Hollenstein,Nezihe Merve Gürel
机构: Delft University of Technology (代尔夫特理工大学); University of Zurich (苏黎世大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 39 pages, 36 figures. Code and evaluation pipeline available at this https URL
点击查看摘要
Abstract:In an era dominated by Large Language Models (LLMs), understanding their capabilities and limitations, especially in high-stakes fields like law, is crucial. While LLMs such as Meta’s LLaMA, OpenAI’s ChatGPT, Google’s Gemini, DeepSeek, and other emerging models are increasingly integrated into legal workflows, their performance in multilingual, jurisdictionally diverse, and adversarial contexts remains insufficiently explored. This work evaluates LLaMA and Gemini on multilingual legal and non-legal benchmarks, and assesses their adversarial robustness in legal tasks through character and word-level perturbations. We use an LLM-as-a-Judge approach for human-aligned evaluation. We moreover present an open-source, modular evaluation pipeline designed to support multilingual, task-diverse benchmarking of any combination of LLMs and datasets, with a particular focus on legal tasks, including classification, summarization, open questions, and general reasoning. Our findings confirm that legal tasks pose significant challenges for LLMs with accuracies often below 50% on legal reasoning benchmarks such as LEXam, compared to over 70% on general-purpose tasks like XNLI. In addition, while English generally yields more stable results, it does not always lead to higher accuracy. Prompt sensitivity and adversarial vulnerability is also shown to persist across languages. Finally, a correlation is found between the performance of a language and its syntactic similarity to English. We also observe that LLaMA is weaker than Gemini, with the latter showing an average advantage of about 24 percentage points across the same task. Despite improvements in newer LLMs, challenges remain in deploying them reliably for critical, multilingual legal applications.
zh
[NLP-31] IIET: Efficient Numerical Transformer via Implicit Iterative Euler Method
【速读】: 该论文旨在解决高阶数值方法在Transformer模型中引入的性能-效率权衡问题,即虽然高阶方法(如PCformer)能提升任务表现,但其计算开销显著增加,且传统压缩技术(如知识蒸馏)可能损害模型性能。解决方案的关键在于提出一种基于迭代隐式欧拉法(Iterative Implicit Euler Transformer, IIET)的新架构,通过简化高阶方法实现更优性能与更强的可压缩性;同时引入迭代影响感知蒸馏(Iteration Influence-Aware Distillation, IIAD),利用灵活阈值控制蒸馏强度,从而有效平衡模型性能与推理效率。实验表明,IIET相比原生Transformer平均准确率提升2.65%,E-IIET版本推理开销降低55%仍保持99.4%原始精度,验证了该方案的有效性。
链接: https://arxiv.org/abs/2509.22463
作者: Xinyu Liu,Bei Li,Jiahao Liu,Junhao Ruan,Kechen Jiao,Hongyin Tang,Jingang Wang,Xiao Tong,Jingbo Zhu
机构: Northeastern University (东北大学); Meituan Inc. (美团); Tsinghua University (清华大学); NiuTrans Research (牛津研究)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:High-order numerical methods enhance Transformer performance in tasks like NLP and CV, but introduce a performance-efficiency trade-off due to increased computational overhead. Our analysis reveals that conventional efficiency techniques, such as distillation, can be detrimental to the performance of these models, exemplified by PCformer. To explore more optimizable ODE-based Transformer architectures, we propose the \textbfIterative \textbfImplicit \textbfEuler \textbfTransformer \textbf(IIET), which simplifies high-order methods using an iterative implicit Euler approach. This simplification not only leads to superior performance but also facilitates model compression compared to PCformer. To enhance inference efficiency, we introduce \textbfIteration \textbfInfluence-\textbfAware \textbfDistillation \textbf(IIAD). Through a flexible threshold, IIAD allows users to effectively balance the performance-efficiency trade-off. On lm-evaluation-harness, IIET boosts average accuracy by 2.65% over vanilla Transformers and 0.8% over PCformer. Its efficient variant, E-IIET, significantly cuts inference overhead by 55% while retaining 99.4% of the original task accuracy. Moreover, the most efficient IIET variant achieves an average performance gain exceeding 1.6% over vanilla Transformer with comparable speed.
zh
[NLP-32] MDAR: A Multi-scene Dynamic Audio Reasoning Benchmark
【速读】: 该论文旨在解决当前音频理解基准测试在复杂现实场景中的局限性问题,即现有基准多集中于静态或单一场景设置,难以刻画多说话者、事件动态演化及异构音频源交互的复杂情境。为此,作者提出了MDAR(Multi-Scene Dynamic Audio Reasoning)基准,其关键在于构建了一个包含3,000个精心设计的问答对与多样化音频片段的数据集,覆盖五类复杂推理任务和三种题型,能够系统评估模型在多场景、动态变化的音频环境中进行推理的能力。该基准为推动生成式AI(Generative AI)在音频理解领域的进步提供了新的评测标准和挑战。
链接: https://arxiv.org/abs/2509.22461
作者: Hui Li,Changhao Jiang,Hongyu Wang,Ming Zhang,Jiajun Sun,Zhixiong Yang,Yifei Cao,Shihan Dou,Xiaoran Fan,Baoyu Fan,Tao Ji,Tao Gui,Qi Zhang,Xuanjing Huang
机构: Fudan University (复旦大学); Shanghai Jiao Tong University (上海交通大学); IEIT Systems Co Ltd
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: 25 pages, 7 figures
点击查看摘要
Abstract:The ability to reason from audio, including speech, paralinguistic cues, environmental sounds, and music, is essential for AI agents to interact effectively in real-world scenarios. Existing benchmarks mainly focus on static or single-scene settings and do not fully capture scenarios where multiple speakers, unfolding events, and heterogeneous audio sources interact. To address these challenges, we introduce MDAR, a benchmark for evaluating models on complex, multi-scene, and dynamically evolving audio reasoning tasks. MDAR comprises 3,000 carefully curated question-answer pairs linked to diverse audio clips, covering five categories of complex reasoning and spanning three question types. We benchmark 26 state-of-the-art audio language models on MDAR and observe that they exhibit limitations in complex reasoning tasks. On single-choice questions, Qwen2.5-Omni (open-source) achieves 76.67% accuracy, whereas GPT-4o Audio (closed-source) reaches 68.47%; however, GPT-4o Audio substantially outperforms Qwen2.5-Omni on the more challenging multiple-choice and open-ended tasks. Across all three question types, no model achieves 80% performance. These findings underscore the unique challenges posed by MDAR and its value as a benchmark for advancing audio reasoning this http URL and benchmark can be found at this https URL.
zh
[NLP-33] Detecting (Un)answerability in Large Language Models with Linear Directions
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在缺乏足够信息时仍会自信生成错误答案的问题,即“幻觉”(hallucination)问题,具体聚焦于可回答性检测(unanswerability detection),尤其是在抽取式问答(extractive question answering, QA)场景中判断给定文本是否包含回答问题所需的信息。解决方案的关键在于:通过在推理过程中对模型激活值进行加法扰动并测量其对模型拒绝回答行为的影响,识别出一个能有效捕捉未回答性的方向(direction);将隐藏层激活向量投影到该方向上可获得可靠的可回答性评分,并据此实现准确分类。该方法在多个开源LLM和基准数据集上表现优于基于提示(prompt-based)或分类器(classifier-based)的方法,且具备跨数据集的泛化能力,同时可扩展至由科学共识缺失或主观性导致的非可回答性情形。
链接: https://arxiv.org/abs/2509.22449
作者: Maor Juliet Lavi,Tova Milo,Mor Geva
机构: Blavatnik School of Computer Science and AI, Tel Aviv University (特拉维夫大学计算机科学与人工智能学院)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Large language models (LLMs) often respond confidently to questions even when they lack the necessary information, leading to hallucinated answers. In this work, we study the problem of (un)answerability detection, focusing on extractive question answering (QA) where the model should determine if a passage contains sufficient information to answer a given question. We propose a simple approach for identifying a direction in the model’s activation space that captures unanswerability and uses it for classification. This direction is selected by applying activation additions during inference and measuring their impact on the model’s abstention behavior. We show that projecting hidden activations onto this direction yields a reliable score for (un)answerability classification. Experiments on two open-weight LLMs and four extractive QA benchmarks show that our method effectively detects unanswerable questions and generalizes better across datasets than existing prompt-based and classifier-based approaches. Moreover, the obtained directions extend beyond extractive QA to unanswerability that stems from factors, such as lack of scientific consensus and subjectivity. Last, causal interventions show that adding or ablating the directions effectively controls the abstention behavior of the model.
zh
[NLP-34] Bridging Kolmogorov Complexity and Deep Learning: Asymptotically Optimal Description Length Objectives for Transformers
【速读】: 该论文试图解决神经网络(如Transformer)在应用最小描述长度(Minimum Description Length, MDL)原则时面临的模型复杂度度量缺乏普适性理论框架的问题。其解决方案的关键在于提出了一种渐近最优描述长度目标(asymptotically optimal description length objectives),该目标基于Kolmogorov复杂度理论,理论上保证了在模型资源约束趋于无穷时,任何数据集上都能实现接近最优的压缩性能(仅相差一个加性常数)。作者进一步通过证明Transformer具有计算通用性,建立了此类目标的存在性,并设计了一个基于自适应高斯混合先验的变分目标,使其具备可计算性和可微性。实证结果表明,该变分目标能选择出低复杂度且泛化能力强的解,但标准优化器难以从随机初始化中找到此类解,凸显了训练中的关键优化挑战。
链接: https://arxiv.org/abs/2509.22445
作者: Peter Shaw,James Cohan,Jacob Eisenstein,Kristina Toutanova
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:The Minimum Description Length (MDL) principle offers a formal framework for applying Occam’s razor in machine learning. However, its application to neural networks such as Transformers is challenging due to the lack of a principled, universal measure for model complexity. This paper introduces the theoretical notion of asymptotically optimal description length objectives, grounded in the theory of Kolmogorov complexity. We establish that a minimizer of such an objective achieves optimal compression, for any dataset, up to an additive constant, in the limit as model resource bounds increase. We prove that asymptotically optimal objectives exist for Transformers, building on a new demonstration of their computational universality. We further show that such objectives can be tractable and differentiable by constructing and analyzing a variational objective based on an adaptive Gaussian mixture prior. Our empirical analysis shows that this variational objective selects for a low-complexity solution with strong generalization on an algorithmic task, but standard optimizers fail to find such solutions from a random initialization, highlighting key optimization challenges. More broadly, by providing a theoretical framework for identifying description length objectives with strong asymptotic guarantees, we outline a potential path towards training neural networks that achieve greater compression and generalization.
zh
[NLP-35] Chimera: Diagnosing Shortcut Learning in Visual-Language Understanding
【速读】: 该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在处理图表类复杂视觉信息时存在的“表面性能”问题,即模型看似在图表理解任务中表现优异,但其性能可能源于对视觉模式、知识记忆或语言先验的依赖,而非真正理解图表中的符号信息与逻辑关系。解决方案的关键在于提出Chimera这一综合性测试套件,包含7,500个来自维基百科的高质量图表,每个图表均配有语义三元组标注和多层次问题,用于系统评估模型在实体识别、关系理解、知识锚定和视觉推理四个核心维度上的能力,并识别三种典型捷径行为:视觉记忆捷径、知识回忆捷径和Clever-Hans捷径。通过在15个开源VLM上进行评估,研究发现模型性能主要由捷径驱动,揭示了现有模型缺乏对图表的真实理解能力,从而强调了构建更鲁棒的评估协议以推动真正图示理解能力发展的必要性。
链接: https://arxiv.org/abs/2509.22437
作者: Ziheng Chi,Yifan Hou,Chenxi Pang,Shaobo Cui,Mubashara Akhtar,Mrinmaya Sachan
机构: ETH Zürich(苏黎世联邦理工学院); Google DeepMind(谷歌深度思维); EPFL(洛桑联邦理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Our code ( this https URL ) and data ( this https URL ) are publicly available
点击查看摘要
Abstract:Diagrams convey symbolic information in a visual format rather than a linear stream of words, making them especially challenging for AI models to process. While recent evaluations suggest that vision-language models (VLMs) perform well on diagram-related benchmarks, their reliance on knowledge, reasoning, or modality shortcuts raises concerns about whether they genuinely understand and reason over diagrams. To address this gap, we introduce Chimera, a comprehensive test suite comprising 7,500 high-quality diagrams sourced from Wikipedia; each diagram is annotated with its symbolic content represented by semantic triples along with multi-level questions designed to assess four fundamental aspects of diagram comprehension: entity recognition, relation understanding, knowledge grounding, and visual reasoning. We use Chimera to measure the presence of three types of shortcuts in visual question answering: (1) the visual-memorization shortcut, where VLMs rely on memorized visual patterns; (2) the knowledge-recall shortcut, where models leverage memorized factual knowledge instead of interpreting the diagram; and (3) the Clever-Hans shortcut, where models exploit superficial language patterns or priors without true comprehension. We evaluate 15 open-source VLMs from 7 model families on Chimera and find that their seemingly strong performance largely stems from shortcut behaviors: visual-memorization shortcuts have slight impact, knowledge-recall shortcuts play a moderate role, and Clever-Hans shortcuts contribute significantly. These findings expose critical limitations in current VLMs and underscore the need for more robust evaluation protocols that benchmark genuine comprehension of complex visual inputs (e.g., diagrams) rather than question-answering shortcuts.
zh
[NLP-36] What Is The Political Content in LLM s Pre- and Post-Training Data?
链接: https://arxiv.org/abs/2509.22367
作者: Tanise Ceron,Dmitry Nikolaev,Dominik Stammbach,Debora Nozza
机构: Bocconi University (博科尼大学); University of Manchester (曼彻斯特大学); Princeton University (普林斯顿大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 9 pages, under review
[NLP-37] Exploratory Semantic Reliability Analysis of Wind Turbine Maintenance Logs using Large Language Models
链接: https://arxiv.org/abs/2509.22366
作者: Max Malyi,Jonathan Shek,Andre Biscaya
机构: Institute for Energy Systems, School of Engineering, The University of Edinburgh (爱丁堡大学能源系统研究所,工程学院); Nadara (纳达拉)
类目: Computation and Language (cs.CL)
备注:
[NLP-38] CHRONOBERG: Capturing Language Evolution and Temporal Awareness in Foundation Models
链接: https://arxiv.org/abs/2509.22360
作者: Niharika Hegde,Subarnaduti Paul,Lars Joel-Frey,Manuel Brack,Kristian Kersting,Martin Mundt,Patrick Schramowski
机构: German Research Center for Artificial Intelligence (DFKI); University of Bremen; Hessian Center for AI (hessian.AI); TU Darmstadt; CERTAIN
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
[NLP-39] Conversational Implicatures: Modelling Relevance Theory Probabilistically
【速读】: 该论文旨在探讨如何将贝叶斯方法应用于关联理论(Relevance Theory)的语用学研究,特别是解决通过会话隐含意义(conversational implicatures)传递隐含意义这一典型语用现象。其解决方案的关键在于借鉴理性言语行为理论(Rational Speech Act theory)的框架,将关联理论中的认知-语用推理过程形式化为贝叶斯更新机制,从而在概率计算的基础上建模说话者与听话者之间的递归推理过程,实现对隐含意义生成与理解的量化解释。
链接: https://arxiv.org/abs/2509.22354
作者: Christoph Unger,Hendrik Buschmeier
机构: 未知
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Recent advances in Bayesian probability theory and its application to cognitive science in combination with the development of a new generation of computational tools and methods for probabilistic computation have led to a ‘probabilistic turn’ in pragmatics and semantics. In particular, the framework of Rational Speech Act theory has been developed to model broadly Gricean accounts of pragmatic phenomena in Bayesian terms, starting with fairly simple reference games and covering ever more complex communicative exchanges such as verbal syllogistic reasoning. This paper explores in which way a similar Bayesian approach might be applied to relevance-theoretic pragmatics (Sperber Wilson, 1995) by study a paradigmatic pragmatic phenomenon: the communication of implicit meaning by ways of (conversational) implicatures.
zh
[NLP-40] he InviTE Corpus: Annotating Invectives in Tudor English Texts for Computational Modeling
【速读】: 该论文旨在解决如何运用自然语言处理(Natural Language Processing, NLP)技术来支持历史研究,特别是针对都铎时期英格兰宗教改革背景下宗教辱骂性语言(religious invectives)的分析问题。其解决方案的关键在于构建了一个名为InviTE的语料库——包含近2000条早期现代英语(Early Modern English, EModE)句子,并通过专家标注标记出其中的辱骂性语言;同时比较了微调后的BERT模型与零样本提示指令微调的大语言模型(Large Language Models, LLMs)在辱骂检测任务中的表现,结果表明在历史数据上预训练并微调的模型具有显著优势。
链接: https://arxiv.org/abs/2509.22345
作者: Sophie Spliethoff,Sanne Hoeken,Silke Schwandt,Sina Zarrieß,Özge Alaçam
机构: Bielefeld University (比勒费尔德大学); Bielefeld University (比勒费尔德大学)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:In this paper, we aim at the application of Natural Language Processing (NLP) techniques to historical research endeavors, particularly addressing the study of religious invectives in the context of the Protestant Reformation in Tudor England. We outline a workflow spanning from raw data, through pre-processing and data selection, to an iterative annotation process. As a result, we introduce the InviTE corpus – a corpus of almost 2000 Early Modern English (EModE) sentences, which are enriched with expert annotations regarding invective language throughout 16th-century England. Subsequently, we assess and compare the performance of fine-tuned BERT-based models and zero-shot prompted instruction-tuned large language models (LLMs), which highlights the superiority of models pre-trained on historical data and fine-tuned to invective detection.
zh
[NLP-41] ransformers Can Learn Connectivity in Some Graphs but Not Others
链接: https://arxiv.org/abs/2509.22343
作者: Amit Roy,Abulhair Saparov
机构: Purdue University (普渡大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
备注: Under Review
[NLP-42] Advancing Natural Language Formalization to First Order Logic with Fine-tuned LLM s
【速读】: 该论文旨在解决自然语言到一阶逻辑(First-Order Logic, FOL)自动翻译的难题,这是知识表示与形式化方法中的关键挑战。解决方案的关键在于系统性评估微调后的大型语言模型(Large Language Models, LLMs),重点比较编码器-解码器架构(如Flan-T5)与仅解码器架构的性能差异,并引入词汇扩展、谓词条件控制和多语言训练等策略。实验表明,使用谓词列表可使准确率提升15–20%,且T5类模型在性能上优于更大规模的仅解码器LLM,同时具备对未见过的逻辑论证(FOLIO数据集)的良好泛化能力,揭示出结构化逻辑翻译的鲁棒性,而谓词提取仍是主要瓶颈。
链接: https://arxiv.org/abs/2509.22338
作者: Felix Vossel,Till Mossakowski,Björn Gehrke
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 15 pages, 7 tables, accepted at the International Joint Conference on Learning Reasoning (IJCLR 2025)
点击查看摘要
Abstract:Automating the translation of natural language to first-order logic (FOL) is crucial for knowledge representation and formal methods, yet remains challenging. We present a systematic evaluation of fine-tuned LLMs for this task, comparing architectures (encoder-decoder vs. decoder-only) and training strategies. Using the MALLS and Willow datasets, we explore techniques like vocabulary extension, predicate conditioning, and multilingual training, introducing metrics for exact match, logical equivalence, and predicate alignment. Our fine-tuned Flan-T5-XXL achieves 70% accuracy with predicate lists, outperforming GPT-4o and even the DeepSeek-R1-0528 model with CoT reasoning ability as well as symbolic systems like ccg2lambda. Key findings show: (1) predicate availability boosts performance by 15-20%, (2) T5 models surpass larger decoder-only LLMs, and (3) models generalize to unseen logical arguments (FOLIO dataset) without specific training. While structural logic translation proves robust, predicate extraction emerges as the main bottleneck.
zh
[NLP-43] Can Synthetic Query Rewrites Capture User Intent Better than Humans in Retrieval-Augmented Generation?
【速读】: 该论文旨在解决多轮检索增强生成(Multi-turn RAG)系统中因用户查询存在口语化省略和指代模糊而导致的检索与生成效果不佳的问题。传统依赖人工标注的查询重写方法受限于标注者表达能力和理解深度,难以准确捕捉真实场景下的用户意图,从而造成用户意图与系统响应之间的偏差。解决方案的关键在于提出SynRewrite模型,其核心是基于合成数据驱动的查询重写机制:首先利用GPT-4o根据对话历史、当前查询、正样本文档及答案生成高质量合成重写查询,构建训练数据集;随后用Flan-T5模型在该数据集上微调以实现从对话历史和原始查询到重写查询的映射;最后通过DPO算法引入生成器反馈进一步优化重写器,提升端到端任务性能。实验证明,SynRewrite在TopiOCQA和QRECC数据集上均优于人工重写,在检索和生成两个维度上展现出更强的意图对齐能力。
链接: https://arxiv.org/abs/2509.22325
作者: JiaYing Zheng,HaiNan Zhang,Liang Pang,YongXin Tong,ZhiMing Zheng
机构: 未知
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: 10 pages, 6 figures
点击查看摘要
Abstract:Multi-turn RAG systems often face queries with colloquial omissions and ambiguous references, posing significant challenges for effective retrieval and generation. Traditional query rewriting relies on human annotators to clarify queries, but due to limitations in annotators’ expressive ability and depth of understanding, manually rewritten queries often diverge from those needed in real-world RAG systems, resulting in a gap between user intent and system response. We observe that high-quality synthetic queries can better bridge this gap, achieving superior performance in both retrieval and generation compared to human rewrites. This raises an interesting question: Can rewriting models trained on synthetic queries better capture user intent than human annotators? In this paper, we propose SynRewrite, a synthetic data-driven query rewriting model to generate high-quality synthetic rewrites more aligned with user intent. To construct training data, we prompt GPT-4o with dialogue history, current queries, positive documents, and answers to synthesize high-quality rewrites. A Flan-T5 model is then finetuned on this dataset to map dialogue history and queries to synthetic rewrites. Finally, we further enhance the rewriter using the generator’s feedback through the DPO algorithm to boost end-task performance. Experiments on TopiOCQA and QRECC datasets show that SynRewrite consistently outperforms human rewrites in both retrieval and generation tasks. Our results demonstrate that synthetic rewrites can serve as a scalable and effective alternative to human annotations.
zh
[NLP-44] PRIME: Planning and Retrieval-Integrated Memory for Enhanced Reasoning
链接: https://arxiv.org/abs/2509.22315
作者: Hieu Tran,Zonghai Yao,Nguyen Luong Tran,Zhichao Yang,Feiyun Ouyang,Shuo Han,Razieh Rahimi,Hong Yu
机构: Manning College of Information and Computer Sciences, University of Massachusetts Amherst (马萨诸塞大学阿默斯特分校信息与计算机科学学院); Miner School of Computer and Information Sciences, University of Massachusetts Lowell (马萨诸塞大学洛厄尔分校计算机与信息科学学院)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 8 pages
[NLP-45] Bridging Fairness and Explainability: Can Input-Based Explanations Promote Fairness in Hate Speech Detection?
【速读】: 该论文旨在解决自然语言处理(Natural Language Processing, NLP)模型在仇恨言论检测任务中因训练数据中的社会偏见而产生的公平性问题,同时应对模型黑箱特性导致的偏见识别与缓解困难。其解决方案的关键在于系统性地评估输入层面的可解释性(input-based explanations)在三个核心维度上的作用:识别偏见预测、选择公平模型以及训练阶段的偏见缓解。研究发现,输入解释能有效检测偏见预测并作为训练期间减少偏见的监督信号,但在从多个候选模型中选出公平模型时则不可靠。
链接: https://arxiv.org/abs/2509.22291
作者: Yifan Wang,Mayank Jobanputra,Ji-Ung Lee,Soyoung Oh,Isabel Valera,Vera Demberg
机构: Saarland University (萨尔兰大学); Max Planck Institute for Informatics (马克斯·普朗克信息研究所); Max Planck Institute for Software Systems (马克斯·普朗克软件系统研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Natural language processing (NLP) models often replicate or amplify social bias from training data, raising concerns about fairness. At the same time, their black-box nature makes it difficult for users to recognize biased predictions and for developers to effectively mitigate them. While some studies suggest that input-based explanations can help detect and mitigate bias, others question their reliability in ensuring fairness. Existing research on explainability in fair NLP has been predominantly qualitative, with limited large-scale quantitative analysis. In this work, we conduct the first systematic study of the relationship between explainability and fairness in hate speech detection, focusing on both encoder- and decoder-only models. We examine three key dimensions: (1) identifying biased predictions, (2) selecting fair models, and (3) mitigating bias during model training. Our findings show that input-based explanations can effectively detect biased predictions and serve as useful supervision for reducing bias during training, but they are unreliable for selecting fair models among candidates.
zh
[NLP-46] InfiMed-Foundation: Pioneering Advanced Multimodal Medical Models with Compute-Efficient Pre-Training and Multi-Stage Fine-Tuning
【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在医疗领域应用中的三大核心问题:一是通用型MLLM缺乏医学专业知识,导致回答不确定或产生幻觉;二是从先进模型中进行知识蒸馏难以有效捕捉放射学和药理学等领域的专有知识;三是使用大规模医疗数据持续预训练时计算成本过高,效率低下。解决方案的关键在于:首先,构建了两个面向医疗场景的专用MLLM——InfiMed-Foundation-1.7B与InfiMed-Foundation-4B,通过融合高质量通用与医学多模态数据,并提出五维质量评估框架以筛选高质医疗数据集;其次,采用低至高分辨率图像渐进式训练及多模态序列打包策略提升训练效率,实现海量医疗数据的有效整合;最后,设计三阶段监督微调流程,确保复杂医疗任务中领域知识的高效提取与利用。实验表明,所提模型在医学视觉问答与诊断任务上显著优于现有主流模型,验证了方案的有效性与先进性。
链接: https://arxiv.org/abs/2509.22261
作者: Guanghao Zhu,Zhitian Hou,Zeyu Liu,Zhijie Sang,Congkai Xie,Hongxia Yang
机构: The Hong Kong Polytechnic University (香港理工大学); Sun Yat-sen University (中山大学); InfiX.ai
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Multimodal large language models (MLLMs) have shown remarkable potential in various domains, yet their application in the medical field is hindered by several challenges. General-purpose MLLMs often lack the specialized knowledge required for medical tasks, leading to uncertain or hallucinatory responses. Knowledge distillation from advanced models struggles to capture domain-specific expertise in radiology and pharmacology. Additionally, the computational cost of continual pretraining with large-scale medical data poses significant efficiency challenges. To address these issues, we propose InfiMed-Foundation-1.7B and InfiMed-Foundation-4B, two medical-specific MLLMs designed to deliver state-of-the-art performance in medical applications. We combined high-quality general-purpose and medical multimodal data and proposed a novel five-dimensional quality assessment framework to curate high-quality multimodal medical datasets. We employ low-to-high image resolution and multimodal sequence packing to enhance training efficiency, enabling the integration of extensive medical data. Furthermore, a three-stage supervised fine-tuning process ensures effective knowledge extraction for complex medical tasks. Evaluated on the MedEvalKit framework, InfiMed-Foundation-1.7B outperforms Qwen2.5VL-3B, while InfiMed-Foundation-4B surpasses HuatuoGPT-V-7B and MedGemma-27B-IT, demonstrating superior performance in medical visual question answering and diagnostic tasks. By addressing key challenges in data quality, training efficiency, and domain-specific knowledge extraction, our work paves the way for more reliable and effective AI-driven solutions in healthcare. InfiMed-Foundation-4B model is available at \hrefthis https URLInfiMed-Foundation-4B.
zh
[NLP-47] Beyond Textual Context: Structural Graph Encoding with Adaptive Space Alignment to alleviate the hallucination of LLM s
链接: https://arxiv.org/abs/2509.22251
作者: Yifang Zhang,Pengfei Duan,Yiwen Yang,Shengwu Xiong
机构: Wuhan University of Technology (武汉理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 11 pages, 5 figures
[NLP-48] Safety Compliance: Rethinking LLM Safety Reasoning through the Lens of Compliance
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)安全方法依赖非系统性、临时性的分类体系,难以有效保障现代LLM复杂行为安全的问题。其核心解决方案是从法律合规角度重构LLM安全定义,提出“安全合规”(safety compliance)范式,将欧盟《人工智能法案》(EU AI Act)和《通用数据保护条例》(GDPR)等成熟法律框架作为安全标准,并构建一个基于群体策略优化(Group Policy Optimization, GRPO)训练的合规推理器(Compliance Reasoner),使LLM能够依据法律规范进行安全决策。实验表明,该方法在新设计的安全合规基准上显著优于基线,对EU AI Act和GDPR的平均提升分别达到+10.45%和+11.85%。
链接: https://arxiv.org/abs/2509.22250
作者: Wenbin Hu,Huihao Jing,Haochen Shi,Haoran Li,Yangqiu Song
机构: Hong Kong University of Science and Technology (香港科技大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:The proliferation of Large Language Models (LLMs) has demonstrated remarkable capabilities, elevating the critical importance of LLM safety. However, existing safety methods rely on ad-hoc taxonomy and lack a rigorous, systematic protection, failing to ensure safety for the nuanced and complex behaviors of modern LLM systems. To address this problem, we solve LLM safety from legal compliance perspectives, named safety compliance. In this work, we posit relevant established legal frameworks as safety standards for defining and measuring safety compliance, including the EU AI Act and GDPR, which serve as core legal frameworks for AI safety and data security in Europe. To bridge the gap between LLM safety and legal compliance, we first develop a new benchmark for safety compliance by generating realistic LLM safety scenarios seeded with legal statutes. Subsequently, we align Qwen3-8B using Group Policy Optimization (GRPO) to construct a safety reasoner, Compliance Reasoner, which effectively aligns LLMs with legal standards to mitigate safety risks. Our comprehensive experiments demonstrate that the Compliance Reasoner achieves superior performance on the new benchmark, with average improvements of +10.45% for the EU AI Act and +11.85% for GDPR.
zh
[NLP-49] FLEXI: Benchmarking Full-duplex Human-LLM Speech Interaction
链接: https://arxiv.org/abs/2509.22243
作者: Yuan Ge,Saihan Chen,Jingqi Xiao,Xiaoqian Liu,Tong Xiao,Yan Xiang,Zhengtao Yu,Jingbo Zhu
机构: 未知
类目: Computation and Language (cs.CL)
备注:
[NLP-50] FeatBench: Evaluating Coding Agents on Feature Implementation for Vibe Coding
【速读】: 该论文旨在解决当前代码生成评估基准在衡量“vibe coding”(即通过自然语言与编码代理交互的新型软件开发范式)能力时存在的显著不足问题。现有基准要么依赖代码级规范,要么仅关注问题求解,未能有效评估在真实场景中基于抽象自然语言描述实现新功能的能力。其解决方案的关键在于提出FeaBench——一个专注于特征实现的新型基准,其核心创新包括:1)纯自然语言提示(无代码或结构提示),2)多层级过滤与自动化演进的数据收集流程以保障质量并防止数据污染,3)包含Fail-to-Pass(F2P)和Pass-to-Pass(P2P)测试用例以确保正确性并防止回归,4)覆盖多样化应用领域的代码库以贴近实际场景。实证表明,当前主流大语言模型(LLM)在该任务上的最高成功率为29.94%,凸显了该问题的挑战性,并揭示了“激进实现”策略虽可能导致严重失败,但也可能带来更优的设计结果。
链接: https://arxiv.org/abs/2509.22237
作者: Haorui Chen,Chengze Li,Jia Li
机构: Tsinghua University (清华大学); University of Electronic Science and Technology of China (电子科技大学); Nanjing University (南京大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:
点击查看摘要
Abstract:The rapid advancement of Large Language Models (LLMs) has given rise to a novel software development paradigm known as “vibe coding,” where users interact with coding agents through high-level natural language. However, existing evaluation benchmarks for code generation inadequately assess an agent’s vibe coding capabilities. Existing benchmarks are misaligned, as they either require code-level specifications or focus narrowly on issue-solving, neglecting the critical scenario of feature implementation within the vibe coding paradiam. To address this gap, we propose FeatBench, a novel benchmark for vibe coding that focuses on feature implementation. Our benchmark is distinguished by several key features: 1. Pure Natural Language Prompts. Task inputs consist solely of abstract natural language descriptions, devoid of any code or structural hints. 2. A Rigorous Evolving Data Collection Process. FeatBench is built on a multi-level filtering pipeline to ensure quality and a fully automated pipeline to evolve the benchmark, mitigating data contamination. 3. Comprehensive Test Cases. Each task includes Fail-to-Pass (F2P) and Pass-to-Pass (P2P) tests to verify correctness and prevent regressions. 4. Diverse Application Domains. The benchmark includes repositories from diverse domains to ensure it reflects real-world scenarios. We evaluate two state-of-the-art agent frameworks with four leading LLMs on FeatBench. Our evaluation reveals that feature implementation within the vibe coding paradigm is a significant challenge, with the highest success rate of only 29.94%. Our analysis also reveals a tendency for “aggressive implementation,” a strategy that paradoxically leads to both critical failures and superior software design. We release FeatBench, our automated collection pipeline, and all experimental results to facilitate further community research.
zh
[NLP-51] In Their Own Words: Reasoning Traces Tailored for Small Models Make Them Better Reason ers
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)向小模型迁移推理能力时出现的性能下降问题,即通过监督微调(Supervised Fine-Tuning, SFT)进行知识蒸馏时,尽管使用高质量教师模型生成的推理轨迹(Reasoning Traces),学生模型的性能反而显著退化。研究表明,这一现象的根本原因在于分布偏移(Distributional Misalignment):教师模型输出的推理轨迹中包含学生模型概率极低的token,超出了其内部表示容量,形成学习障碍而非有效指导。解决方案的关键在于提出逆向推测解码(Reverse Speculative Decoding, RSD),其核心机制是让教师模型仅提供候选token,而由学生模型根据自身概率分布自主决定是否接受,从而过滤掉低概率token,生成更适配学生模型的推理轨迹。实验表明,基于RSD生成的训练数据可使Qwen3-0.6B模型在多个推理基准上提升4.9%,而直接蒸馏则导致平均性能下降20.5%,验证了分布对齐对推理能力迁移的重要性。
链接: https://arxiv.org/abs/2509.22230
作者: Jaehoon Kim,Kwangwook Seo,Dongha Lee
机构: Yonsei University (延世大学)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Transferring reasoning capabilities from larger language models to smaller ones through supervised fine-tuning often fails counterintuitively, with performance degrading despite access to high-quality teacher demonstrations. We identify that this failure stems from distributional misalignment: reasoning traces from larger models contain tokens that are low probability under the student’s distribution, exceeding the internal representation capacity of smaller architectures and creating learning barriers rather than helpful guidance. We propose Reverse Speculative Decoding (RSD), a mechanism for generating student-friendly reasoning traces in which the teacher model proposes candidate tokens but the student model determines acceptance based on its own probability distributions, filtering low probability tokens. When applied to Qwen3-0.6B, direct distillation of s1K-1.1 reasoning trace data degrades average performance across major reasoning benchmarks by 20.5%, while the same model trained on RSD-generated reasoning traces achieves meaningful improvements of 4.9%. Our analysis reveals that low probability tokens constitute the critical bottleneck in reasoning ability transfer. However, cross-model experiments demonstrate that RSD traces are model-specific rather than universally applicable, indicating that distributional alignment must be tailored for each student architecture’s unique internal representation.
zh
[NLP-52] hinking in Many Modes: How Composite Reasoning Elevates Large Language Model Performance with Limited Data
链接: https://arxiv.org/abs/2509.22224
作者: Zishan Ahmad,Saisubramaniam Gopalakrishnan
机构: PhiLabs, Quantiphi Inc (Quantiphi公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 7 pages, 3 figures
[NLP-53] StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLM s
【速读】: 该论文旨在解决当前语义语音分词器(semantic speech tokenizer)在面对无关语义的声学扰动时表现出的脆弱性问题,即在高信噪比(SNR)下语音仍可清晰理解时,其输出的标记序列仍可能发生剧烈变化,从而增加下游大语言模型(LLM)的学习负担。解决方案的关键在于提出一种名为StableToken的新分词器,其核心创新是采用共识驱动机制:通过多分支并行处理音频,并利用强大的位级投票机制融合各分支表示,生成单一且稳定的标记序列,从而显著降低单位编辑距离(UED),提升语音大模型(SpeechLLM)的鲁棒性。
链接: https://arxiv.org/abs/2509.22220
作者: Yuhan Song,Linhao Zhang,Chuhan Wu,Aiwei Liu,Wei Jia,Houfeng Wang,Xiao Zhou
机构: Peking University (北京大学); Tencent Inc (腾讯公司)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Prevalent semantic speech tokenizers, designed to capture linguistic content, are surprisingly fragile. We find they are not robust to meaning-irrelevant acoustic perturbations; even at high Signal-to-Noise Ratios (SNRs) where speech is perfectly intelligible, their output token sequences can change drastically, increasing the learning burden for downstream LLMs. This instability stems from two flaws: a brittle single-path quantization architecture and a distant training signal indifferent to intermediate token stability. To address this, we introduce StableToken, a tokenizer that achieves stability through a consensus-driven mechanism. Its multi-branch architecture processes audio in parallel, and these representations are merged via a powerful bit-wise voting mechanism to form a single, stable token sequence. StableToken sets a new state-of-the-art in token stability, drastically reducing Unit Edit Distance (UED) under diverse noise conditions. This foundational stability translates directly to downstream benefits, significantly improving the robustness of SpeechLLMs on a variety of tasks.
zh
[NLP-54] Question-Driven Analysis and Synthesis: Building Interpretable Thematic Trees with LLM s for Text Clustering and Controllable Generation
【速读】: 该论文旨在解决无监督文本分析中,尤其是在数据稀缺领域下,传统主题模型因依赖关键词列表而难以解释且语义连贯性差的问题。其核心解决方案是提出递归主题划分(Recursive Thematic Partitioning, RTP)框架,该框架利用大语言模型(Large Language Models, LLMs)交互式构建二叉树结构,每个节点为一个自然语言问题,用以语义上分割数据,从而形成逻辑明确、完全可解释的主题分类体系。RTP的关键创新在于将主题建模从统计模式发现转变为知识驱动的语义分析,并通过结构化路径作为可控提示(prompt),进一步赋能生成式AI(Generative AI)的特征模仿与内容合成。
链接: https://arxiv.org/abs/2509.22211
作者: Tiago Fernandes Tavares
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Unsupervised analysis of text corpora is challenging, especially in data-scarce domains where traditional topic models struggle. While these models offer a solution, they typically describe clusters with lists of keywords that require significant manual effort to interpret and often lack semantic coherence. To address this critical interpretability gap, we introduce Recursive Thematic Partitioning (RTP), a novel framework that leverages Large Language Models (LLMs) to interactively build a binary tree. Each node in the tree is a natural language question that semantically partitions the data, resulting in a fully interpretable taxonomy where the logic of each cluster is explicit. Our experiments demonstrate that RTP’s question-driven hierarchy is more interpretable than the keyword-based topics from a strong baseline like BERTopic. Furthermore, we establish the quantitative utility of these clusters by showing they serve as powerful features in downstream classification tasks, particularly when the data’s underlying themes correlate with the task labels. RTP introduces a new paradigm for data exploration, shifting the focus from statistical pattern discovery to knowledge-driven thematic analysis. Furthermore, we demonstrate that the thematic paths from the RTP tree can serve as structured, controllable prompts for generative models. This transforms our analytical framework into a powerful tool for synthesis, enabling the consistent imitation of specific characteristics discovered in the source corpus.
zh
[NLP-55] he Outputs of Large Language Models are Meaningless
【速读】: 该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)的输出是否具有意义(meaning)。作者认为,尽管LLM的输出在表面上看似有意义,但其本质缺乏真正的语义基础。解决方案的关键在于提出一个两前提论证:首先,只有具备特定类型的目的性意图(intentionality),LLM的输出才可能拥有字面意义上的意义;其次,LLM本身无法合理地具备此类意图。该论证驳斥了诸如语义外部主义(semantic externalism)和语义内部主义(semantic internalism)等替代性解释,最终指出即使LLM输出无真正意义,它们仍可能因表征功能而显得有意义,并可用于获取真信念甚至知识。
链接: https://arxiv.org/abs/2509.22206
作者: Anandi Hattiangadi,Anders J. Schoubye
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 24 pages, 2 figures, forthcoming in Herman Cappelen and Rachel Sterken, eds. Communicating with AI: Philosophical Perspectives. Oxford: Oxford University Press
点击查看摘要
Abstract:In this paper, we offer a simple argument for the conclusion that the outputs of large language models (LLMs) are meaningless. Our argument is based on two key premises: (a) that certain kinds of intentions are needed in order for LLMs’ outputs to have literal meanings, and (b) that LLMs cannot plausibly have the right kinds of intentions. We defend this argument from various types of responses, for example, the semantic externalist argument that deference can be assumed to take the place of intentions and the semantic internalist argument that meanings can be defined purely in terms of intrinsic relations between concepts, such as conceptual roles. We conclude the paper by discussing why, even if our argument is sound, the outputs of LLMs nevertheless seem meaningful and can be used to acquire true beliefs and even knowledge.
zh
[NLP-56] Library Hallucinations in LLM s: Risk Analysis Grounded in Developer Queries
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生成代码时频繁出现的库 hallucination(即虚构不存在的库或库成员)问题,这类错误不仅误导开发者、破坏构建流程,还可能引入供应链攻击风险(如 slopsquatting)。其解决方案的关键在于系统性地分析用户提示(prompt)的自然变体如何影响 hallucination 率,包括从开发者论坛提取的真实语言表达、拼写错误(单字符或多重字符错位)以及完全虚构的库名/成员名。研究发现,即使是轻微的拼写错误也能触发高达26%的 hallucination 率,而虚假库名在99%的任务中被接受,表明LLMs对提示变化极度敏感;尽管提示工程(prompt engineering)显示出一定的缓解潜力,但效果不稳定且依赖于具体模型,凸显了当前LLMs在代码生成中的脆弱性,并强调亟需建立针对库相关 hallucination 的防护机制。
链接: https://arxiv.org/abs/2509.22202
作者: Lukas Twist,Jie M. Zhang,Mark Harman,Helen Yannakoudakis
机构: 未知
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注: 23 pages, 5 tables
点击查看摘要
Abstract:Large language models (LLMs) are increasingly used to generate code, yet they continue to hallucinate, often inventing non-existent libraries. Such library hallucinations are not just benign errors: they can mislead developers, break builds, and expose systems to supply chain threats such as slopsquatting. Despite increasing awareness of these risks, little is known about how real-world prompt variations affect hallucination rates. Therefore, we present the first systematic study of how user-level prompt variations impact library hallucinations in LLM-generated code. We evaluate six diverse LLMs across two hallucination types: library name hallucinations (invalid imports) and library member hallucinations (invalid calls from valid libraries). We investigate how realistic user language extracted from developer forums and how user errors of varying degrees (one- or multi-character misspellings and completely fake names/members) affect LLM hallucination rates. Our findings reveal systemic vulnerabilities: one-character misspellings in library names trigger hallucinations in up to 26% of tasks, fake library names are accepted in up to 99% of tasks, and time-related prompts lead to hallucinations in up to 84% of tasks. Prompt engineering shows promise for mitigating hallucinations, but remains inconsistent and LLM-dependent. Our results underscore the fragility of LLMs to natural prompt variation and highlight the urgent need for safeguards against library-related hallucinations and their potential exploitation.
zh
[NLP-57] When Does Reasoning Matter? A Controlled Study of Reasoning s Contribution to Model Performance
链接: https://arxiv.org/abs/2509.22193
作者: Nicolas Boizard,Hippolyte Gisserot-Boukhlef,Kevin El-Haddad,Céline Hudelot,Pierre Colombo
机构: Diabolocom; Artefact Research Center; Equall; ISIA Lab, University of Mons; MICS, CentraleSupélec, Université Paris-Saclay
类目: Computation and Language (cs.CL)
备注:
[NLP-58] MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing
链接: https://arxiv.org/abs/2509.22186
作者: Junbo Niu,Zheng Liu,Zhuangcheng Gu,Bin Wang,Linke Ouyang,Zhiyuan Zhao,Tao Chu,Tianyao He,Fan Wu,Qintong Zhang,Zhenjiang Jin,Guang Liang,Rui Zhang,Wenzheng Zhang,Yuan Qu,Zhifei Ren,Yuefeng Sun,Yuanhong Zheng,Dongsheng Ma,Zirui Tang,Boyu Niu,Ziyang Miao,Hejun Dong,Siyi Qian,Junyuan Zhang,Jingzhou Chen,Fangdong Wang,Xiaomeng Zhao,Liqun Wei,Wei Li,Shasha Wang,Ruiliang Xu,Yuanyuan Cao,Lu Chen,Qianqian Wu,Huaiyu Gu,Lindong Lu,Keming Wang,Dechen Lin,Guanlin Shen,Xuanhe Zhou,Linfeng Zhang,Yuhang Zang,Xiaoyi Dong,Jiaqi Wang,Bo Zhang,Lei Bai,Pei Chu,Weijia Li,Jiang Wu,Lijun Wu,Zhenxiang Li,Guangyu Wang,Zhongying Tu,Chao Xu,Kai Chen,Yu Qiao,Bowen Zhou,Dahua Lin,Wentao Zhang,Conghui He
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Technical Report; GitHub Repo: this https URL Hugging Face Model: this https URL Hugging Face Demo: this https URL
[NLP-59] Context Parametrization with Compositional Adapters
链接: https://arxiv.org/abs/2509.22158
作者: Josip Jukić,Martin Tutek,Jan Šnajder
机构: TakeLab, University of Zagreb (萨格勒布大学)
类目: Computation and Language (cs.CL)
备注:
[NLP-60] Mixture of Detectors: A Compact View of Machine-Generated Text Detection
【速读】: 该论文旨在解决机器生成文本检测(Machine-Generated Text Detection, MGTD)中的多个关键挑战,包括文档级二分类与多类分类(区分不同生成器)、句子级边界分割(识别人类与AI协作文本的分界)以及对抗攻击下的检测鲁棒性问题。其解决方案的核心在于构建了一个名为BMAS English的新数据集,该数据集覆盖上述多种检测场景,支持对人类与机器生成文本的精准区分,并能进一步识别生成源(generator attribution),同时引入对抗攻击机制以评估和提升检测模型在实际应用中面对隐蔽策略时的抗干扰能力,从而推动MGTD研究向更全面、实用的方向发展。
链接: https://arxiv.org/abs/2509.22147
作者: Sai Teja Lekkala,Yadagiri Annepaka,Arun Kumar Challa,Samatha Reddy Machireddy,Partha Pakray,Chukhu Chunka
机构: NIT Surathkal (印度理工学院苏拉特卡尔分校)
类目: Computation and Language (cs.CL)
备注: 20 pages, 3 figures
点击查看摘要
Abstract:Large Language Models (LLMs) are gearing up to surpass human creativity. The veracity of the statement needs careful consideration. In recent developments, critical questions arise regarding the authenticity of human work and the preservation of their creativity and innovative abilities. This paper investigates such issues. This paper addresses machine-generated text detection across several scenarios, including document-level binary and multiclass classification or generator attribution, sentence-level segmentation to differentiate between human-AI collaborative text, and adversarial attacks aimed at reducing the detectability of machine-generated text. We introduce a new work called BMAS English: an English language dataset for binary classification of human and machine text, for multiclass classification, which not only identifies machine-generated text but can also try to determine its generator, and Adversarial attack addressing where it is a common act for the mitigation of detection, and Sentence-level segmentation, for predicting the boundaries between human and machine-generated text. We believe that this paper will address previous work in Machine-Generated Text Detection (MGTD) in a more meaningful way.
zh
[NLP-61] From Long to Lean: Performance-aware and Adaptive Chain-of-Thought Compression via Multi-round Refinement
【速读】: 该论文旨在解决链式思维(Chain-of-Thought, CoT)推理在复杂任务中虽能提升性能,但因输出冗长导致显著推理延迟的问题。其解决方案的关键在于提出多轮自适应链式思维压缩(Multiround Adaptive Chain-of-Thought Compression, MACC)框架,该框架利用“token弹性”现象——即过小的token预算反而可能增加输出长度——通过多轮迭代优化实现CoT的渐进式压缩。MACC能够根据输入特性自适应地确定最优压缩深度,在保持甚至提升准确率的同时,平均减少47个token并显著降低延迟。此外,研究发现测试时的性能(准确率与token长度)可通过训练集中的可解释特征(如困惑度和压缩率)可靠预测,从而无需重复微调即可实现高效模型选择与性能预估。
链接: https://arxiv.org/abs/2509.22144
作者: Jianzhi Yan,Le Liu,Youcheng Pan,Shiwei Chen,Zike Yuan,Yang Xiang,Buzhou Tang
机构: Harbin Institute of Technology, Shenzhen, China (哈尔滨工业大学深圳校区); Pengcheng Laboratory, Shenzhen, China (鹏城实验室); Shaoguan Research Institute of Data Industry, China (韶关数据产业研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 17 pages, 8 figures
点击查看摘要
Abstract:Chain-of-Thought (CoT) reasoning improves performance on complex tasks but introduces significant inference latency due to verbosity. We propose Multiround Adaptive Chain-of-Thought Compression (MACC), a framework that leverages the token elasticity phenomenon–where overly small token budgets can paradoxically increase output length–to progressively compress CoTs via multiround refinement. This adaptive strategy allows MACC to determine the optimal compression depth for each input. Our method achieves an average accuracy improvement of 5.6 percent over state-of-the-art baselines, while also reducing CoT length by an average of 47 tokens and significantly lowering latency. Furthermore, we show that test-time performance–accuracy and token length–can be reliably predicted using interpretable features like perplexity and compression rate on the training set. Evaluated across different models, our method enables efficient model selection and forecasting without repeated fine-tuning, demonstrating that CoT compression is both effective and predictable. Our code will be released in this https URL.
zh
[NLP-62] NFDI4DS Shared Tasks for Scholarly Document Processing
链接: https://arxiv.org/abs/2509.22141
作者: Raia Abu Ahmad,Rana Abdulla,Tilahun Abedissa Taffa,Soeren Auer,Hamed Babaei Giglou,Ekaterina Borisova,Zongxiong Chen,Stefan Dietze,Jennifer DSouza,Mayra Elwes,Genet-Asefa Gesese,Shufan Jiang,Ekaterina Kutafina,Philipp Mayr,Georg Rehm,Sameer Sadruddin,Sonja Schimmler,Daniel Schneider,Kanishka Silva,Sharmila Upadhyaya,Ricardo Usbeck
机构: Deutsches Forschungszentrum für Künstliche Intelligenz GmbH (DFKI); Leuphana University of Lüneburg; TIB Leibniz Information Centre for Science and Technology; Fraunhofer Institute for Open Communication Systems FOKUS; GESIS – Leibniz Institute for the Social Sciences; University Hospital Cologne, University of Cologne, Institute for Biomedical Informatics; FIZ-Karlsruhe – Leibniz-Institute for Information Infrastructure; Innovation Center Computer Assisted Surgery (ICCAS), Leipzig University
类目: Computation and Language (cs.CL)
备注: Accepted at the RDI4DS 2025 Workshop
[NLP-63] Bridging Draft Policy Misalignment: Group Tree Optimization for Speculative Decoding
链接: https://arxiv.org/abs/2509.22134
作者: Shijing Hu,Jingyang Li,Zhihui Lu,Pan Zhou
机构: Fudan University (复旦大学); National University of Singapore (新加坡国立大学); Singapore Management University (新加坡管理大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
[NLP-64] R-Capsule: Compressing High-Level Plans for Efficient Large Language Model Reasoning
【速读】: 该论文旨在解决Chain-of-Thought (CoT) 提示在大型语言模型(LLMs)中进行复杂推理时存在的效率低下与错误传播问题,即CoT的冗长性导致延迟增加、内存消耗上升,并可能因早期错误在长链中扩散而影响最终准确性。解决方案的关键在于提出一种名为“推理胶囊”(Reasoning Capsule, R-Capsule)的混合框架,其核心思想是将高阶推理计划压缩为一组可学习的低维隐式标记(latent tokens),同时保持执行步骤轻量化或显式化;该设计受信息瓶颈(Information Bottleneck, IB)原理启发,通过低容量瓶颈约束实现最小化表示以提升效率,同时引入主任务损失和辅助计划重构损失双目标优化,确保胶囊既能完成任务又能忠实还原原始文本计划,从而在减少可见token数量的同时维持甚至提升复杂基准测试上的准确率与可解释性。
链接: https://arxiv.org/abs/2509.22131
作者: Hongyu Shan,Mingyang Song,Chang Dai,Di Liang,Han Chen
机构: Tianjin University (天津大学); Tencent Hunyuan Team (腾讯混元团队); Peking University (北京大学); Fudan University (复旦大学); Central China Normal University (华中师范大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Chain-of-Thought (CoT) prompting helps Large Language Models (LLMs) tackle complex reasoning by eliciting explicit step-by-step rationales. However, CoT’s verbosity increases latency and memory usage and may propagate early errors across long chains. We propose the Reasoning Capsule (R-Capsule), a framework that aims to combine the efficiency of latent reasoning with the transparency of explicit CoT. The core idea is to compress the high-level plan into a small set of learned latent tokens (a Reasoning Capsule) while keeping execution steps lightweight or explicit. This hybrid approach is inspired by the Information Bottleneck (IB) principle, where we encourage the capsule to be approximately minimal yet sufficient for the task. Minimality is encouraged via a low-capacity bottleneck, which helps improve efficiency. Sufficiency is encouraged via a dual objective: a primary task loss for answer accuracy and an auxiliary plan-reconstruction loss that encourages the capsule to faithfully represent the original textual plan. The reconstruction objective helps ground the latent space, thereby improving interpretability and reducing the use of uninformative shortcuts. Our framework strikes a balance between efficiency, accuracy, and interpretability, thereby reducing the visible token footprint of reasoning while maintaining or improving accuracy on complex benchmarks. Our codes are available at: this https URL
zh
[NLP-65] FoodSEM: Large Language Model Specialized in Food Named-Entity Linking
链接: https://arxiv.org/abs/2509.22125
作者: Ana Gjorgjevikj,Matej Martinc,Gjorgjina Cenikj,Sašo Džeroski,Barbara Koroušić Seljak,Tome Eftimov
机构: Jozef Stefan Institute (乔瑟夫·斯蒂芬研究所); Jozef Stefan International Postgraduate School (乔瑟夫·斯蒂芬国际研究生院)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: To appear in the Proceedings of the 28th International Conference on Discovery Science (DS 2025)
[NLP-66] Multilingual Vision-Language Models A Survey
【速读】: 该论文旨在解决多语言视觉-语言模型(Multilingual Vision-Language Models)在跨语言一致性与文化敏感性之间的权衡问题。其核心挑战在于如何在保持语言中立性(Language Neutrality,即不同语言间表示的一致性)的同时,增强模型对文化语境的适应能力(Cultural Awareness)。解决方案的关键在于训练策略与评估基准的协同优化:当前主流方法依赖对比学习(Contrastive Learning)实现语言中立性,但文化意识的提升则依赖于多样化且具有文化背景的数据;同时,研究发现现有约三分之二的评估基准仍以翻译为基础,侧重语义一致性,而新兴工作正逐步引入文化相关的内容以更全面地衡量模型性能,从而弥合训练目标与评估目标之间的差距。
链接: https://arxiv.org/abs/2509.22123
作者: Andrei-Alexandru Manea,Jindřich Libovický
机构: 未知
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:This survey examines multilingual vision-language models that process text and images across languages. We review 31 models and 21 benchmarks, spanning encoder-only and generative architectures, and identify a key tension between language neutrality (consistent cross-lingual representations) and cultural awareness (adaptation to cultural contexts). Current training methods favor neutrality through contrastive learning, while cultural awareness depends on diverse data. Two-thirds of evaluation benchmarks use translation-based approaches prioritizing semantic consistency, though recent work incorporates culturally grounded content. We find discrepancies in cross-lingual capabilities and gaps between training objectives and evaluation goals.
zh
[NLP-67] Universal Legal Article Prediction via Tight Collaboration between Supervised Classification Model and LLM
链接: https://arxiv.org/abs/2509.22119
作者: Xiao Chi,Wenlin Zhong,Yiquan Wu,Wei Wang,Kun Kuang,Fei Wu,Minghui Xiong
机构: Zhejiang University (浙江大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages, 6 figures, Accepted to ICAIL 2025 (International Conference on Artificial Intelligence and Law)
[NLP-68] hink Right Not More: Test-Time Scaling for Numerical Claim Verification EMNLP2025
链接: https://arxiv.org/abs/2509.22101
作者: Primakov Chungkham,V Venktesh,Vinay Setty,Avishek Anand
机构: TU Delft (代尔夫特理工大学); Stockholm University (斯德哥尔摩大学); University of Stavanger (斯塔万格大学)
类目: Computation and Language (cs.CL)
备注: Accepted to EMNLP 2025, 19 pages
[NLP-69] S2J: Bridging the Gap Between Solving and Judging Ability in Generative Reward Models
【速读】: 该论文旨在解决生成式奖励模型(Generative Reward Models, GRMs)中存在的“求解到判断差距”(solve-to-judge gap)问题,即GRM在能够正确求解某个问题的情况下,仍可能无法对该问题做出正确判断(错误率高达14%-37%)。解决方案的关键在于提出Solve-to-Judge(S2J)方法,通过在单个GRM的输出上同时利用其求解能力和判断能力进行监督学习,显式地将模型的解决问题与评估能力在优化过程中关联起来,从而缩小该差距。实验表明,S2J可使solve-to-judge gap降低16.2%,并提升判断性能5.8%,且在相同基座模型下达到当前最优效果,同时仅需较小训练数据集,并通过自我进化实现,无需依赖外部更强模型进行蒸馏。
链接: https://arxiv.org/abs/2509.22099
作者: Shaoning Sun,Jiachen Yu,Zongqi Wang,Xuewei Yang,Tianle Gu,Yujiu Yang
机构: Tsinghua Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:With the rapid development of large language models (LLMs), generative reward models (GRMs) have been widely adopted for reward modeling and evaluation. Previous studies have primarily focused on training specialized GRMs by optimizing them on preference datasets with the judgment correctness as supervision. While it’s widely accepted that GRMs with stronger problem-solving capabilities typically exhibit superior judgment abilities, we first identify a significant solve-to-judge gap when examining individual queries. Specifically, the solve-to-judge gap refers to the phenomenon where GRMs struggle to make correct judgments on some queries (14%-37%), despite being fully capable of solving them. In this paper, we propose the Solve-to-Judge (S2J) approach to address this problem. Specifically, S2J simultaneously leverages both the solving and judging capabilities on a single GRM’s output for supervision, explicitly linking the GRM’s problem-solving and evaluation abilities during model optimization, thereby narrowing the gap. Our comprehensive experiments demonstrate that S2J effectively reduces the solve-to-judge gap by 16.2%, thereby enhancing the model’s judgment performance by 5.8%. Notably, S2J achieves state-of-the-art (SOTA) performance among GRMs built on the same base model while utilizing a significantly smaller training dataset. Moreover, S2J accomplishes this through self-evolution without relying on more powerful external models for distillation.
zh
[NLP-70] SecureAgent Bench: Benchmarking Secure Code Generation under Realistic Vulnerability Scenarios
【速读】: 该论文旨在解决当前由大语言模型(Large Language Model, LLM)驱动的代码代理(code agent)在自动化软件开发过程中存在的安全风险问题,即生成代码中可能引入未被发现或新增的漏洞,而现有基准测试未能充分评估此类风险。其解决方案的关键在于提出SecureAgentBench——一个包含105个编码任务的综合性基准,每个任务均具备三个核心特征:(i) 基于真实开源项目中漏洞引入点的多文件编辑场景;(ii) 与实际漏洞上下文对齐的任务描述;(iii) 结合功能测试、基于PoC(proof-of-concept)漏洞验证和静态分析的新漏洞检测的多维评估机制。该设计使评测能够全面捕捉代码的功能正确性与安全性,从而为LLM赋能的代码生成提供更可靠的评估标准。
链接: https://arxiv.org/abs/2509.22097
作者: Junkai Chen,Huihui Huang,Yunbo Lyu,Junwen An,Jieke Shi,Chengran Yang,Ting Zhang,Haoye Tian,Yikun Li,Zhenhao Li,Xin Zhou,Xing Hu,David Lo
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:
点击查看摘要
Abstract:Large language model (LLM) powered code agents are rapidly transforming software engineering by automating tasks such as testing, debugging, and repairing, yet the security risks of their generated code have become a critical concern. Existing benchmarks have offered valuable insights but remain insufficient: they often overlook the genuine context in which vulnerabilities were introduced or adopt narrow evaluation protocols that fail to capture either functional correctness or newly introduced vulnerabilities. We therefore introduce SecureAgentBench, a benchmark of 105 coding tasks designed to rigorously evaluate code agents’ capabilities in secure code generation. Each task includes (i) realistic task settings that require multi-file edits in large repositories, (ii) aligned contexts based on real-world open-source vulnerabilities with precisely identified introduction points, and (iii) comprehensive evaluation that combines functionality testing, vulnerability checking through proof-of-concept exploits, and detection of newly introduced vulnerabilities using static analysis. We evaluate three representative agents (SWE-agent, OpenHands, and Aider) with three state-of-the-art LLMs (Claude 3.7 Sonnet, GPT-4.1, and DeepSeek-V3.1). Results show that (i) current agents struggle to produce secure code, as even the best-performing one, SWE-agent supported by DeepSeek-V3.1, achieves merely 15.2% correct-and-secure solutions, (ii) some agents produce functionally correct code but still introduce vulnerabilities, including new ones not previously recorded, and (iii) adding explicit security instructions for agents does not significantly improve secure coding, underscoring the need for further research. These findings establish SecureAgentBench as a rigorous benchmark for secure code generation and a step toward more reliable software development with LLMs.
zh
[NLP-71] Multilingual Dialogue Generation and Localization with Dialogue Act Scripting EMNLP
【速读】: 该论文旨在解决非英语对话数据集稀缺的问题,以及现有模型在训练或评估时依赖英文对话翻译所带来的“翻译腔”(translationese)问题,这些问题会降低对话的自然性和文化适切性。解决方案的关键在于提出Dialogue Act Script (DAS) 框架,该框架通过抽象意图表示(abstract intent representations)对对话进行结构化编码与本地化生成,而非直接翻译语句,从而在目标语言中生成更具文化相关性与情境恰当性的新对话,实现跨语言的灵活适配与更自然的交互表现。
链接: https://arxiv.org/abs/2509.22086
作者: Justin Vasselli,Eunike Andriani Kardinata,Yusuke Sakai,Taro Watanabe
机构: Nara Institute of Science and Technology (奈良科学技术大学院大学)
类目: Computation and Language (cs.CL)
备注: 16 pages, 10 tables, 2 figures, Accepted at EMNLP Main 2025
点击查看摘要
Abstract:Non-English dialogue datasets are scarce, and models are often trained or evaluated on translations of English-language dialogues, an approach which can introduce artifacts that reduce their naturalness and cultural appropriateness. This work proposes Dialogue Act Script (DAS), a structured framework for encoding, localizing, and generating multilingual dialogues from abstract intent representations. Rather than translating dialogue utterances directly, DAS enables the generation of new dialogues in the target language that are culturally and contextually appropriate. By using structured dialogue act representations, DAS supports flexible localization across languages, mitigating translationese and enabling more fluent, naturalistic conversations. Human evaluations across Italian, German, and Chinese show that DAS-generated dialogues consistently outperform those produced by both machine and human translators on measures of cultural relevance, coherence, and situational appropriateness.
zh
[NLP-72] COSPADI: Compressing LLM s via Calibration-Guided Sparse Dictionary Learning
链接: https://arxiv.org/abs/2509.22075
作者: Dmitriy Shopkhoev,Denis Makhov,Magauiya Zhussip,Ammar Ali,Stamatios Lefkimmiatis
机构: MTS AI (MTS AI); ITMO (ITMO)
类目: Computation and Language (cs.CL)
备注:
[NLP-73] Fine-tuning Done Right in Model Editing
【速读】: 该论文旨在解决生成式 AI(Generative AI)模型编辑中微调(fine-tuning)方法长期被视为无效的问题。传统观点认为微调在模型编辑任务中效果不佳,但本文指出其失败根源并非微调本身的能力局限,而是由于将其应用于顺序性的单次深度优先(depth-first)流水线——该流程对每个样本优化至收敛后再推进,导致过拟合与编辑间干扰。解决方案的关键在于将微调恢复为标准的广度优先(breadth-first,即基于epoch的)流水线,并引入小批量(mini-batch)优化策略,从而显著提升编辑有效性;进一步地,通过系统分析调参位置,提出LocFT-BF方法,在局部范围内进行高效编辑,实现了10万次编辑和720亿参数模型下的稳定性能,且不损害模型通用能力,使微调从被低估的基础方法跃升为当前最优的模型编辑方案。
链接: https://arxiv.org/abs/2509.22072
作者: Wanli Yang,Fei Sun,Rui Tang,Hongyu Zang,Du Su,Qi Cao,Jingang Wang,Huawei Shen,Xueqi Cheng
机构: 未知
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Fine-tuning, a foundational method for adapting large language models, has long been considered ineffective for model editing. Here, we challenge this belief, arguing that the reported failure arises not from the inherent limitation of fine-tuning itself, but from adapting it to the sequential nature of the editing task, a single-pass depth-first pipeline that optimizes each sample to convergence before moving on. While intuitive, this depth-first pipeline coupled with sample-wise updating over-optimizes each edit and induces interference across edits. Our controlled experiments reveal that simply restoring fine-tuning to the standard breadth-first (i.e., epoch-based) pipeline with mini-batch optimization substantially improves its effectiveness for model editing. Moreover, fine-tuning in editing also suffers from suboptimal tuning parameter locations inherited from prior methods. Through systematic analysis of tuning locations, we derive LocFT-BF, a simple and effective localized editing method built on the restored fine-tuning framework. Extensive experiments across diverse LLMs and datasets demonstrate that LocFT-BF outperforms state-of-the-art methods by large margins. Notably, to our knowledge, it is the first to sustain 100K edits and 72B-parameter models,10 x beyond prior practice, without sacrificing general capabilities. By clarifying a long-standing misconception and introducing a principled localized tuning strategy, we advance fine-tuning from an underestimated baseline to a leading method for model editing, establishing a solid foundation for future research.
zh
[NLP-74] he QCET Taxonomy of Standard Quality Criterion Names and Definitions for the Evaluation of NLP Systems
链接: https://arxiv.org/abs/2509.22064
作者: Anya Belz,Simon Mille,Craig Thomson
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 39 pages, 7 figures
[NLP-75] RedNote-Vibe: A Dataset for Capturing Temporal Dynamics of AI-Generated Text in Social Media
【速读】: 该论文旨在解决社交平台上生成式 AI (Generative AI) 内容(AIGT)的动态检测问题,现有数据集多为静态视角,无法反映 AIGT 在用户互动驱动下的时间演化特性。其解决方案的关键在于构建首个纵向(5年)社交平台 AIGT 分析数据集 RedNote-Vibe,该数据集来自小红书平台,包含点赞、评论等用户交互指标及时间戳信息,覆盖从预大语言模型(LLM)时期至2025年7月的完整周期;同时提出可解释的 Psycholinguistic AIGT Detection Framework (PLAD),基于心理语言学特征实现高精度检测,并揭示了这些语言特征与社交互动之间的复杂关联。
链接: https://arxiv.org/abs/2509.22055
作者: Yudong Li,Yufei Sun,Yuhan Yao,Peiru Yang,Wanyue Li,Jiajun Zou,Yongfeng Huang,Linlin Shen
机构: Tsinghua University (清华大学); Beijing University of Posts and Telecommunications (北京邮电大学); Hong Kong Metropolitan University (香港城市大学); Shenzhen University (深圳大学)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:The proliferation of Large Language Models (LLMs) has led to widespread AI-Generated Text (AIGT) on social media platforms, creating unique challenges where content dynamics are driven by user engagement and evolve over time. However, existing datasets mainly depict static AIGT detection. In this work, we introduce RedNote-Vibe, the first longitudinal (5-years) dataset for social media AIGT analysis. This dataset is sourced from Xiaohongshu platform, containing user engagement metrics (e.g., likes, comments) and timestamps spanning from the pre-LLM period to July 2025, which enables research into the temporal dynamics and user interaction patterns of AIGT. Furthermore, to detect AIGT in the context of social media, we propose PsychoLinguistic AIGT Detection Framework (PLAD), an interpretable approach that leverages psycholinguistic features. Our experiments show that PLAD achieves superior detection performance and provides insights into the signatures distinguishing human and AI-generated content. More importantly, it reveals the complex relationship between these linguistic features and social media engagement. The dataset is available at this https URL.
zh
[NLP-76] Fuzzy Reasoning Chain (FRC): An Innovative Reasoning Framework from Fuzziness to Clarity EMNLP2025
链接: https://arxiv.org/abs/2509.22054
作者: Ping Chen,Xiang Liu,Zhaoxiang Liu,Zezhou Chen,Xingpeng Zhang,Huan Hu,Zipeng Wang,Kai Wang,Shuming Shi,Shiguo Lian
机构: Data Science & AI Research Institute, China Unicom (中国联通数据科学与人工智能研究院); Unicom Data Intelligence, China Unicom (中国联通数据智能); School of Computer Science and Software Engineering, Southwest Petroleum University (西南石油大学计算机科学与软件工程学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepet by EMNLP 2025 Findings (11 pages, 1 figures)
[NLP-77] A2R: An Asymmetric Two-Stage Reasoning Framework for Parallel Reasoning
链接: https://arxiv.org/abs/2509.22044
作者: Ziqi Wang,Boye Niu,Zhongli Li,Linghui Meng,Jing Liu,Zhi Zheng,Tong Xu,Hua Wu,Haifeng Wang,Enhong Chen
机构: USTC(中国科学技术大学); Baidu(百度); USYD(悉尼大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 15 pages, 3 figures
[NLP-78] axonomy of Comprehensive Safety for Clinical Agents EMNLP2025
【速读】: 该论文旨在解决临床聊天机器人(clinical chatbot)应用中的安全性问题,即不准确或有害的响应可能导致严重后果,而现有方法如防护机制(guardrails)和工具调用(tool calling)难以满足临床场景中对安全性的精细化需求。解决方案的关键在于提出一个细粒度的21类安全分类体系——TACOS(TAxonomy of COmprehensive Safety for Clinical Agents),该体系将安全过滤与工具选择整合为单一用户意图分类步骤,明确建模不同查询的安全阈值和对外部工具的依赖关系,从而提升临床代理系统的安全性与适应性。
链接: https://arxiv.org/abs/2509.22041
作者: Jean Seo,Hyunkyung Lee,Gibaeg Kim,Wooseok Han,Jaehyo Yoo,Seungseop Lim,Kihun Shin,Eunho Yang
机构: 未知
类目: Computation and Language (cs.CL)
备注: EMNLP 2025 Industry
点击查看摘要
Abstract:Safety is a paramount concern in clinical chatbot applications, where inaccurate or harmful responses can lead to serious consequences. Existing methods–such as guardrails and tool calling–often fall short in addressing the nuanced demands of the clinical domain. In this paper, we introduce TACOS (TAxonomy of COmprehensive Safety for Clinical Agents), a fine-grained, 21-class taxonomy that integrates safety filtering and tool selection into a single user intent classification step. TACOS is a taxonomy that can cover a wide spectrum of clinical and non-clinical queries, explicitly modeling varying safety thresholds and external tool dependencies. To validate our framework, we curate a TACOS-annotated dataset and perform extensive experiments. Our results demonstrate the value of a new taxonomy specialized for clinical agent settings, and reveal useful insights about train data distribution and pretrained knowledge of base models.
zh
[NLP-79] he Thinking Spectrum: An Emperical Study of Tunable Reasoning in LLM s through Model Merging
链接: https://arxiv.org/abs/2509.22034
作者: Xiaochong Lan,Yu Zheng,Shiteng Cao,Yong Li
机构: Tsinghua University (清华大学); Massachusetts Institute of Technology (麻省理工学院); Shenzhen International Graduate School, Tsinghua University (深圳国际研究生院,清华大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
[NLP-80] From Outliers to Topics in Language Models: Anticipating Trends in News Corpora ACL
链接: https://arxiv.org/abs/2509.22030
作者: Evangelia Zve,Benjamin Icard,Alice Breton,Lila Sainero,Gauvain Bourgne,Jean-Gabriel Ganascia
机构: LIP6(法国国家科学研究中心-索邦大学实验室); Sorbonne University (索邦大学); CNRS (法国国家科学研究中心)
类目: Computation and Language (cs.CL)
备注: presented at ICNLSP 2025; to appear in the ACL Anthology; received the Best Full Paper Award
[NLP-81] GraphSearch: An Agent ic Deep Searching Workflow for Graph Retrieval-Augmented Generation
链接: https://arxiv.org/abs/2509.22009
作者: Cehao Yang,Xiaojun Wu,Xueyuan Lin,Chengjin Xu,Xuhui Jiang,Yuanliang Sun,Jia Li,Hui Xiong,Jian Guo
机构: IDEA Research (国际数字经济发展研究院); Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)); DataArc Tech Ltd. (数据弧科技有限公司); Hithink RoyalFlush Information Network Co., Ltd (慧科讯业信息网络有限公司)
类目: Computation and Language (cs.CL)
备注:
[NLP-82] Black-Box Hallucination Detection via Consistency Under the Uncertain Expression
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生成文本时存在的“幻觉”(hallucination)问题,即模型输出与事实不符的内容。现有方法通常依赖于模型内部状态(如token概率分布)或外部知识资源,但受限于API访问权限和外部资源的可用性,难以在实际场景中广泛部署。论文提出一种基于黑盒(Black-Box)视角的检测机制,其关键在于观察到LLMs在表达不确定性时会产生不一致的响应,而当其输出事实性内容时则表现出高度一致性。基于此行为特征,作者设计了一种仅需模型输入输出即可实现的高效黑盒检测指标,实验证明该方法在预测响应真实性方面优于依赖内部信息的基线方法。
链接: https://arxiv.org/abs/2509.21999
作者: Seongho Joo,Kyungmin Min,Jahyun Koo,Kyomin Jung
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Despite the great advancement of Language modeling in recent days, Large Language Models (LLMs) such as GPT3 are notorious for generating non-factual responses, so-called “hallucination” problems. Existing methods for detecting and alleviating this hallucination problem require external resources or the internal state of LLMs, such as the output probability of each token. Given the LLM’s restricted external API availability and the limited scope of external resources, there is an urgent demand to establish the Black-Box approach as the cornerstone for effective hallucination detection. In this work, we propose a simple black-box hallucination detection metric after the investigation of the behavior of LLMs under expression of uncertainty. Our comprehensive analysis reveals that LLMs generate consistent responses when they present factual responses while non-consistent responses vice versa. Based on the analysis, we propose an efficient black-box hallucination detection metric with the expression of uncertainty. The experiment demonstrates that our metric is more predictive of the factuality in model responses than baselines that use internal knowledge of LLMs.
zh
[NLP-83] ERGO: Efficient High-Resolution Visual Understanding for Vision-Language Models
链接: https://arxiv.org/abs/2509.21991
作者: Jewon Lee,Wooksu Shin,Seungmin Yang,Ki-Ung Song,DongUk Lim,Jaeyeon Kim,Tae-Ho Kim,Bo-Kyeong Kim
机构: Nota Inc. (Nota公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
[NLP-84] From Bias to Balance: Exploring and Mitigating Spatial Bias in LVLMs
【速读】: 该论文旨在解决大型视觉-语言模型(Large Vision-Language Models, LVLMs)在空间位置变化下的鲁棒性不足问题,即当图像中关键视觉信息处于不同位置时,模型输出存在不一致性。研究表明,这种空间偏差并非源于视觉编码器对内容的感知不一致,而是由语言模型组件中位置嵌入(position embedding)设计失衡所致,特别是RoPE等常用策略在跨模态交互中引入了位置相关的影响差异。解决方案的关键在于提出一种称为“平衡位置分配”(Balanced Position Assignment, BaPA)的新机制,其核心思想是为所有图像标记(image token)分配相同的相对位置嵌入,从而实现视觉信息的均衡融合,提升模型的空间鲁棒性和整体性能。
链接: https://arxiv.org/abs/2509.21984
作者: Yingjie Zhu,Xuefeng Bai,Kehai Chen,Yang Xiang,Weili Guan,Jun Yu,Min Zhang
机构: Harbin Institute of Technology, Shenzhen, China (哈尔滨工业大学深圳); Peng Cheng Laboratory, Shenzhen, China (鹏城实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Large Vision-Language Models (LVLMs) have achieved remarkable success across a wide range of multimodal tasks, yet their robustness to spatial variations remains insufficiently understood. In this work, we present a systematic study of the spatial bias of LVLMs, focusing on how models respond when identical key visual information is placed at different locations within an image. Through a carefully designed probing dataset, we demonstrate that current LVLMs often produce inconsistent outputs under such spatial shifts, revealing a fundamental limitation in their spatial-semantic understanding. Further analysis shows that this phenomenon originates not from the vision encoder, which reliably perceives and interprets visual content across positions, but from the unbalanced design of position embeddings in the language model component. In particular, the widely adopted position embedding strategies, such as RoPE, introduce imbalance during cross-modal interaction, leading image tokens at different positions to exert unequal influence on semantic understanding. To mitigate this issue, we introduce Balanced Position Assignment (BaPA), a simple yet effective mechanism that assigns identical position embeddings to all image tokens, promoting a more balanced integration of visual information. Extensive experiments show that BaPA enhances the spatial robustness of LVLMs without retraining and further boosts their performance across diverse multimodal benchmarks when combined with lightweight fine-tuning. Further analysis of information flow reveals that BaPA yields balanced attention, enabling more holistic visual understanding.
zh
[NLP-85] RISK: A Framework for GUI Agents in E-commerce Risk Management
【速读】: 该论文旨在解决电子商务风险评估中复杂、多步骤且状态依赖的网页交互数据获取难题,传统爬虫方法和现有图形用户界面(GUI)代理难以处理此类动态交互内容。其解决方案的关键在于提出一个名为RISK的端到端框架,包含三个核心组件:(1) RISK-Data,一个高质量标注的数据集,涵盖单步与多步交互轨迹;(2) RISK-Bench,用于标准化评估的基准测试集,覆盖不同难度等级的任务;(3) RISK-R1,一种基于R1风格的强化学习微调框架,通过输出格式奖励、单步精度奖励、多步过程重加权及任务层级重加权四个维度优化代理行为,从而显著提升在离线与在线场景下的任务成功率。
链接: https://arxiv.org/abs/2509.21982
作者: Renqi Chen,Zeyin Tao,Jianming Guo,Jingzhe Zhu,Yiheng Peng,Qingqing Sun,Tianyi Zhang,Shuai Chen
机构: Fudan University (复旦大学); Ant International (蚂蚁国际)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:E-commerce risk management requires aggregating diverse, deeply embedded web data through multi-step, stateful interactions, which traditional scraping methods and most existing Graphical User Interface (GUI) agents cannot handle. These agents are typically limited to single-step tasks and lack the ability to manage dynamic, interactive content critical for effective risk assessment. To address this challenge, we introduce RISK, a novel framework designed to build and deploy GUI agents for this domain. RISK integrates three components: (1) RISK-Data, a dataset of 8,492 single-step and 2,386 multi-step interaction trajectories, collected through a high-fidelity browser framework and a meticulous data curation process; (2) RISK-Bench, a benchmark with 802 single-step and 320 multi-step trajectories across three difficulty levels for standardized evaluation; and (3) RISK-R1, a R1-style reinforcement fine-tuning framework considering four aspects: (i) Output Format: Updated format reward to enhance output syntactic correctness and task comprehension, (ii) Single-step Level: Stepwise accuracy reward to provide granular feedback during early training stages, (iii) Multi-step Level: Process reweight to emphasize critical later steps in interaction sequences, and (iv) Task Level: Level reweight to focus on tasks of varying difficulty. Experiments show that RISK-R1 outperforms existing baselines, achieving a 6.8% improvement in offline single-step and an 8.8% improvement in offline multi-step. Moreover, it attains a top task success rate of 70.5% in online evaluation. RISK provides a scalable, domain-specific solution for automating complex web interactions, advancing the state of the art in e-commerce risk management.
zh
[NLP-86] MotivGraph-SoIQ: Integrating Motivational Knowledge Graphs and Socratic Dialogue for Enhanced LLM Ideation EMNLP2025
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在学术创意生成过程中存在的两个关键问题:一是缺乏对创意的动机性 grounding(即创意与现实问题、挑战和解决方案之间的结构化关联),二是易受确认偏误(confirmation bias)影响,导致创意质量难以提升。其解决方案的核心在于提出 MotivGraph-SoIQ 框架,该框架通过整合动机知识图谱(Motivational Knowledge Graph, MotivGraph)与基于问题驱动的苏格拉底式思辨机制(Q-Driven Socratic Ideator)实现创新。MotivGraph 以结构化方式存储问题、挑战和解决方案三类节点,为 LLM 提供动机层面的语义锚定;而双代理苏格拉底式思辨器则通过系统性提问策略,有效抑制确认偏误,从新颖性、实验严谨性和动机合理性三个维度显著提升创意质量。
链接: https://arxiv.org/abs/2509.21978
作者: Xinping Lei,Tong Zhou,Yubo Chen,Kang Liu,Jun Zhao
机构: The Key Laboratory of Cognition and Decision Intelligence for Complex Systems; School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China; Hunan Provincial Key Laboratory of Philosophy and Social Sciences of Artificial Intelligence and Precision International, Hunan Normal University; Beijing University of Posts and Telecommunications
类目: Computation and Language (cs.CL)
备注: EMNLP2025 Findings
点击查看摘要
Abstract:Large Language Models (LLMs) hold substantial potential for accelerating academic ideation but face critical challenges in grounding ideas and mitigating confirmation bias for further refinement. We propose integrating motivational knowledge graphs and socratic dialogue to address these limitations in enhanced LLM ideation (MotivGraph-SoIQ). This novel framework provides essential grounding and practical idea improvement steps for LLM ideation by integrating a Motivational Knowledge Graph (MotivGraph) with a Q-Driven Socratic Ideator. The MotivGraph structurally stores three key node types(problem, challenge and solution) to offer motivation grounding for the LLM ideation process. The Ideator is a dual-agent system utilizing Socratic questioning, which facilitates a rigorous refinement process that mitigates confirmation bias and improves idea quality across novelty, experimental rigor, and motivational rationality dimensions. On the ICLR25 paper topics dataset, MotivGraph-SoIQ exhibits clear advantages over existing state-of-the-art approaches across LLM-based scoring, ELO ranking, and human evaluation metrics.
zh
[NLP-87] Evaluating Open-Source Large Language Models for Technical Telecom Question Answering
链接: https://arxiv.org/abs/2509.21949
作者: Arina Caraus,Alessio Buscemi,Sumit Kumar,Ion Turcanu
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Computation and Language (cs.CL)
备注: Accepted at the IEEE GLOBECOM Workshops 2025: “Large AI Model over Future Wireless Networks”
[NLP-88] Debiasing Large Language Models in Thai Political Stance Detection via Counterfactual Calibration
【速读】: 该论文旨在解决低资源且文化复杂的语境下,大型语言模型(LLM)在泰语政治立场检测中出现的系统性偏差问题,如情感泄露(sentiment leakage)和对特定实体的偏好,这些问题损害了模型的公平性和可靠性。解决方案的关键在于提出 ThaiFACTUAL,一个轻量级、与模型无关的校准框架,其核心机制包括反事实数据增强(counterfactual data augmentation)和基于理由的监督(rationale-based supervision),从而实现情感与立场的解耦,并有效降低偏见。
链接: https://arxiv.org/abs/2509.21946
作者: Kasidit Sermsri,Teerapong Panboonyuen
机构: Chulalongkorn University (朱拉隆功大学); MARSAIL (Motor AI Recognition Solution Artificial Intelligence Laboratory)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 pages
点击查看摘要
Abstract:Political stance detection in low-resource and culturally complex settings poses a critical challenge for large language models (LLMs). In the Thai political landscape - marked by indirect language, polarized figures, and entangled sentiment and stance - LLMs often display systematic biases such as sentiment leakage and favoritism toward entities. These biases undermine fairness and reliability. We present ThaiFACTUAL, a lightweight, model-agnostic calibration framework that mitigates political bias without requiring fine-tuning. ThaiFACTUAL uses counterfactual data augmentation and rationale-based supervision to disentangle sentiment from stance and reduce bias. We also release the first high-quality Thai political stance dataset, annotated with stance, sentiment, rationales, and bias markers across diverse entities and events. Experimental results show that ThaiFACTUAL significantly reduces spurious correlations, enhances zero-shot generalization, and improves fairness across multiple LLMs. This work highlights the importance of culturally grounded debiasing techniques for underrepresented languages.
zh
[NLP-89] Why Chain of Thought Fails in Clinical Text Understanding
【速读】: 该论文旨在解决生成式 AI(Generative AI)在临床文本理解任务中因缺乏透明推理而导致的可信度问题,特别是在电子健康记录(Electronic Health Records, EHRs)这一复杂、碎片化且噪声较大的数据源上,Chain-of-thought (CoT) 提示方法是否能提升模型性能与可解释性的问题。其解决方案的关键在于通过大规模实证研究系统评估95个先进大语言模型(LLMs)在87项真实临床任务上的表现,发现86.3%的模型在CoT设置下出现性能下降,揭示了CoT虽增强推理透明度但可能削弱可靠性的核心矛盾,并提出需结合细粒度分析(如推理长度、医学概念对齐度及错误模式)和专家评估来构建更可信的临床推理策略。
链接: https://arxiv.org/abs/2509.21933
作者: Jiageng Wu,Kevin Xie,Bowen Gu,Nils Krüger,Kueiyu Joshua Lin,Jie Yang
机构: Harvard Medical School (哈佛医学院); MIT (麻省理工学院); Broad Institute of MIT and Harvard (麻省理工学院和哈佛大学布罗德研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Large language models (LLMs) are increasingly being applied to clinical care, a domain where both accuracy and transparent reasoning are critical for safe and trustworthy deployment. Chain-of-thought (CoT) prompting, which elicits step-by-step reasoning, has demonstrated improvements in performance and interpretability across a wide range of tasks. However, its effectiveness in clinical contexts remains largely unexplored, particularly in the context of electronic health records (EHRs), the primary source of clinical documentation, which are often lengthy, fragmented, and noisy. In this work, we present the first large-scale systematic study of CoT for clinical text understanding. We assess 95 advanced LLMs on 87 real-world clinical text tasks, covering 9 languages and 8 task types. Contrary to prior findings in other domains, we observe that 86.3% of models suffer consistent performance degradation in the CoT setting. More capable models remain relatively robust, while weaker ones suffer substantial declines. To better characterize these effects, we perform fine-grained analyses of reasoning length, medical concept alignment, and error profiles, leveraging both LLM-as-a-judge evaluation and clinical expert evaluation. Our results uncover systematic patterns in when and why CoT fails in clinical contexts, which highlight a critical paradox: CoT enhances interpretability but may undermine reliability in clinical text tasks. This work provides an empirical basis for clinical reasoning strategies of LLMs, highlighting the need for transparent and trustworthy approaches.
zh
[NLP-90] SimulSense: Sense-Driven Interpreting for Efficient Simultaneous Speech Translation
链接: https://arxiv.org/abs/2509.21932
作者: Haotian Tan,Hiroki Ouchi,Sakriani Sakti
机构: 未知
类目: Computation and Language (cs.CL)
备注: \c{opyright} 2026 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works
[NLP-91] AutoSCORE: Enhancing Automated Scoring with Multi-Agent Large Language Models via Structured Component Recognition
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在自动化评分任务中面临的准确性低、提示敏感性高、可解释性差以及评分量规(rubric)对齐不足等问题,这些问题限制了LLM在教育评估实践中的落地应用。解决方案的关键在于提出AutoSCORE框架,其采用多智能体设计:首先由“评分量规组件提取代理”(Scoring Rubric Component Extraction Agent)从学生作答中识别并结构化提取与评分量规相关的成分,随后由“评分代理”(Scoring Agent)基于该结构化表示进行最终打分。这种分步推理机制模拟人类评分流程,显著提升了评分的准确性、鲁棒性和可解释性,尤其在复杂多维评分量规下表现突出,且对小规模LLM带来相对更大的性能增益。
链接: https://arxiv.org/abs/2509.21910
作者: Yun Wang,Zhaojun Ding,Xuansheng Wu,Siyue Sun,Ninghao Liu,Xiaoming Zhai
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 pages, 2 figures
点击查看摘要
Abstract:Automated scoring plays a crucial role in education by reducing the reliance on human raters, offering scalable and immediate evaluation of student work. While large language models (LLMs) have shown strong potential in this task, their use as end-to-end raters faces challenges such as low accuracy, prompt sensitivity, limited interpretability, and rubric misalignment. These issues hinder the implementation of LLM-based automated scoring in assessment practice. To address the limitations, we propose AutoSCORE, a multi-agent LLM framework enhancing automated scoring via rubric-aligned Structured COmponent REcognition. With two agents, AutoSCORE first extracts rubric-relevant components from student responses and encodes them into a structured representation (i.e., Scoring Rubric Component Extraction Agent), which is then used to assign final scores (i.e., Scoring Agent). This design ensures that model reasoning follows a human-like grading process, enhancing interpretability and robustness. We evaluate AutoSCORE on four benchmark datasets from the ASAP benchmark, using both proprietary and open-source LLMs (GPT-4o, LLaMA-3.1-8B, and LLaMA-3.1-70B). Across diverse tasks and rubrics, AutoSCORE consistently improves scoring accuracy, human-machine agreement (QWK, correlations), and error metrics (MAE, RMSE) compared to single-agent baselines, with particularly strong benefits on complex, multi-dimensional rubrics, and especially large relative gains on smaller LLMs. These results demonstrate that structured component recognition combined with multi-agent design offers a scalable, reliable, and interpretable solution for automated scoring.
zh
[NLP-92] A Large-Scale Dataset and Citation Intent Classification in Turkish with LLM s
【速读】: 该论文旨在解决土耳其语等黏着语中引用意图(citation intent)定性分析的难题,这对学术研究的全面评估至关重要。其核心挑战在于现有方法在处理此类语言时效果不稳定,且缺乏标准化的数据集与高效的分类框架。解决方案的关键在于:首先构建了一个公开可用的土耳其语引用意图标注数据集;其次提出基于DSPy框架的可编程分类流水线,自动优化提示(prompt)以克服传统上下文学习(In-Context Learning, ICL)因人工设计提示导致的不一致性问题;最终采用XGBoost作为元模型的堆叠泛化集成方法,实现91.3%的准确率,显著提升了预测稳定性与可靠性。
链接: https://arxiv.org/abs/2509.21907
作者: Kemal Sami Karaca,Bahaeddin Eravcı
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Submitted to IEEE UBMK 2025 International Conference on Computer Science and Engineering
点击查看摘要
Abstract:Understanding the qualitative intent of citations is essential for a comprehensive assessment of academic research, a task that poses unique challenges for agglutinative languages like Turkish. This paper introduces a systematic methodology and a foundational dataset to address this problem. We first present a new, publicly available dataset of Turkish citation intents, created with a purpose-built annotation tool. We then evaluate the performance of standard In-Context Learning (ICL) with Large Language Models (LLMs), demonstrating that its effectiveness is limited by inconsistent results caused by manually designed prompts. To address this core limitation, we introduce a programmable classification pipeline built on the DSPy framework, which automates prompt optimization systematically. For final classification, we employ a stacked generalization ensemble to aggregate outputs from multiple optimized models, ensuring stable and reliable predictions. This ensemble, with an XGBoost meta-model, achieves a state-of-the-art accuracy of 91.3%. Ultimately, this study provides the Turkish NLP community and the broader academic circles with a foundational dataset and a robust classification framework paving the way for future qualitative citation studies.
zh
[NLP-93] Elastic MoE: Unlocking the Inference-Time Scalability of Mixture-of-Experts
链接: https://arxiv.org/abs/2509.21892
作者: Naibin Gu,Zhenyu Zhang,Yuchen Feng,Yilong Chen,Peng Fu,Zheng Lin,Shuohuan Wang,Yu Sun,Hua Wu,Weiping Wang,Haifeng Wang
机构: Institute of Information Engineering, Chinese Academy of Sciences (中国科学院信息工程研究所); School of Cyber Security, University of Chinese Academy of Sciences (中国科学院大学网络空间安全学院); Baidu Inc. (百度公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
[NLP-94] Agent Pack: A Dataset of Code Changes Co-Authored by Agents and Humans
链接: https://arxiv.org/abs/2509.21891
作者: Yangtian Zi,Zixuan Wu,Aleksander Boruch-Gruszecki,Jonathan Bell,Arjun Guha
机构: Northeastern University (东北大学)
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注:
[NLP-95] QoNext: Towards Next-generation QoE for Foundation Models
链接: https://arxiv.org/abs/2509.21889
作者: Yijin Guo,Ye Shen,Farong Wen,Junying Wang,Zicheng Zhang,Qi Jia,Guangtao Zhai
机构: Shanghai Jiao Tong University (上海交通大学); Shanghai AI Laboratory (上海人工智能实验室); Fudan University (复旦大学)
类目: Computation and Language (cs.CL)
备注:
[NLP-96] You Cant Steal Nothing: Mitigating Prompt Leakages in LLM s via System Vectors CCS25
链接: https://arxiv.org/abs/2509.21884
作者: Bochuan Cao,Changjiang Li,Yuanpu Cao,Yameng Ge,Ting Wang,Jinghui Chen
机构: The Pennsylvania State University (宾夕法尼亚州立大学); Palo Alto Networks (帕洛阿尔托网络公司); Stony Brook University (石溪大学)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 29 pages, 10 tables, 6figures, accepted by CCS 25
[NLP-97] No Prompt Left Behind: Exploiting Zero-Variance Prompts in LLM Reinforcement Learning via Entropy-Guided Advantage Shaping
链接: https://arxiv.org/abs/2509.21880
作者: Thanh-Long V. Le,Myeongho Jeon,Kim Vu,Viet Lai,Eunho Yang
机构: KAIST(韩国科学技术院); EPFL(瑞士洛桑联邦理工学院); Adobe Research(Adobe研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
[NLP-98] LUMINA: Detecting Hallucinations in RAG System with Context-Knowledge Signals
链接: https://arxiv.org/abs/2509.21875
作者: Min-Hsuan Yeh,Yixuan Li,Tanwi Mallick
机构: University of Wisconsin-Madison (威斯康星大学麦迪逊分校); Argonne National Laboratory (阿贡国家实验室)
类目: Computation and Language (cs.CL)
备注:
[NLP-99] Enhancing Low-Rank Adaptation with Structured Nonlinear Transformations
【速读】: 该论文旨在解决低秩适配(Low-Rank Adaptation, LoRA)方法因线性特性导致表达能力受限的问题。其解决方案的关键在于提出LoRAN,即一种非线性扩展的LoRA框架,通过在低秩更新中引入轻量级非线性变换来增强模型的表示能力;同时,进一步设计了Sinter——一种基于正弦函数的激活机制,在不增加参数数量的前提下引入结构化扰动,从而提升微调性能。实验表明,LoRAN在摘要和分类任务上均优于QLoRA,且消融实验证明Sinter在性能上优于Sigmoid、ReLU和Tanh等标准激活函数,凸显了激活函数设计在低秩微调中的关键作用。
链接: https://arxiv.org/abs/2509.21870
作者: Guanzhi Deng,Mingyang Liu,Dapeng Wu,Yinqiao Li,Linqi Song
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: This manuscript has been submitted to IEEE Journal of Selected Topics in Signal Processing (JSTSP) for review. Until the moment I submitted the manuscript to arXiv, we haven’t received any review comments from JSTSP
点击查看摘要
Abstract:Low-Rank Adaptation (LoRA) is a widely adopted parameter-efficient fine-tuning method for large language models. However, its linear nature limits expressiveness. We propose LoRAN, a non-linear extension of LoRA that applies lightweight transformations to the low-rank updates. We further introduce Sinter, a sine-based activation that adds structured perturbations without increasing parameter count. Experiments across summarization and classification tasks show that LoRAN consistently improves over QLoRA. Ablation studies reveal that Sinter outperforms standard activations such as Sigmoid, ReLU, and Tanh, highlighting the importance of activation design in lowrank tuning.
zh
[NLP-100] What Makes LLM Agent Simulations Useful for Policy? Insights From an Iterative Design Engagement in Emergency Preparedness
链接: https://arxiv.org/abs/2509.21868
作者: Yuxuan Li,Sauvik Das,Hirokazu Shirado
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
备注:
[NLP-101] KnowMT-Bench: Benchmarking Knowledge-Intensive Long-Form Question Answering in Multi-Turn Dialogues
链接: https://arxiv.org/abs/2509.21856
作者: Junhao Chen,Yu Huang,Siyuan Li,Rui Yao,Hanqian Li,Hanyu Zhang,Jungang Li,Jian Chen,Bowen Wang,Xuming Hu
机构: Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)); Hong Kong University of Science and Technology (香港科技大学); The University of Osaka (大阪大学)
类目: Computation and Language (cs.CL)
备注:
[NLP-102] Following the TRACE: A Structured Path to Empathetic Response Generation with Multi-Agent Models
【速读】: 该论文旨在解决当前情感回应生成任务中存在的一大核心矛盾:专业化模型在分析深度上的优势与大型语言模型(LLM)在生成流畅性上的优势难以兼顾。为应对这一挑战,作者提出了一种名为TRACE(Task-decomposed Reasoning for Affective Communication and Empathy)的新框架,其关键在于将情感理解与回应生成过程结构化地分解为一个分阶段的分析-合成流水线(analysis-synthesis pipeline),从而在生成前建立全面的认知理解,实现深度分析与高表达力生成的统一。实验结果表明,该方法在自动评估和基于LLM的人工评估中均显著优于现有强基线模型,验证了结构化任务分解作为构建更强大且可解释的情感对话代理的可行范式。
链接: https://arxiv.org/abs/2509.21849
作者: Ziqi Liu,Ziyang Zhou,Yilin Li,Haiyang Zhang,Yangbin Chen
机构: 未知
类目: Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注:
点击查看摘要
Abstract:Empathetic response generation is a crucial task for creating more human-like and supportive conversational agents. However, existing methods face a core trade-off between the analytical depth of specialized models and the generative fluency of Large Language Models (LLMs). To address this, we propose TRACE, Task-decomposed Reasoning for Affective Communication and Empathy, a novel framework that models empathy as a structured cognitive process by decomposing the task into a pipeline for analysis and synthesis. By building a comprehensive understanding before generation, TRACE unites deep analysis with expressive generation. Experimental results show that our framework significantly outperforms strong baselines in both automatic and LLM-based evaluations, confirming that our structured decomposition is a promising paradigm for creating more capable and interpretable empathetic agents. Our code is available at this https URL.
zh
[NLP-103] SBFA: Single Sneaky Bit Flip Attack to Break Large Language Models
链接: https://arxiv.org/abs/2509.21843
作者: Jingkai Guo,Chaitali Chakrabarti,Deliang Fan
机构: Arizona State University (亚利桑那州立大学)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 10 pages, 4 figures, 5 tables, 2 equations. Topics: Bit-flip attacks, adversarial attacks, large language models (LLMs)
[NLP-104] Semantic Agreement Enables Efficient Open-Ended LLM Cascades EMNLP2025
【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)部署中成本与生成质量之间的权衡问题,特别是在开放文本生成场景下如何准确判断输出可靠性——因为生成质量通常处于连续谱上且存在多个合理答案。其解决方案的关键在于提出“语义一致性”(semantic agreement),即通过集成多个模型输出在语义层面达成的一致性作为无需训练的可靠度信号,替代传统的基于token级置信度的判断方式。实验表明,该方法可在不访问模型内部结构的前提下,适用于黑盒API,并实现高达40%的成本降低和60%的延迟减少,同时保持或超越目标模型的质量表现。
链接: https://arxiv.org/abs/2509.21837
作者: Duncan Soiffer,Steven Kolawole,Virginia Smith
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL)
备注: EMNLP 2025 Industry Track
点击查看摘要
Abstract:Cascade systems route computational requests to smaller models when possible and defer to larger models only when necessary, offering a promising approach to balance cost and quality in LLM deployment. However, they face a fundamental challenge in open-ended text generation: determining output reliability when generation quality lies on a continuous spectrum, often with multiple valid responses. To address this, we propose semantic agreement – meaning-level consensus between ensemble outputs – as a training-free signal for reliable deferral. We show that when diverse model outputs agree semantically, their consensus is a stronger reliability signal than token-level confidence. Evaluated from 500M to 70B-parameter models, we find that semantic cascades match or surpass target-model quality at 40% of the cost and reduce latency by up to 60%. Our method requires no model internals, works across black-box APIs, and remains robust to model updates, making it a practical baseline for real-world LLM deployment.
zh
[NLP-105] ResT: Reshaping Token-Level Policy Gradients for Tool-Use Large Language Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在工具使用任务中因依赖稀疏结果奖励而导致策略梯度方差过大、训练效率低的问题。现有强化学习(Reinforcement Learning, RL)方法未充分考虑工具使用任务的特性,导致策略优化不稳定。解决方案的关键在于提出一种基于熵感知的token级策略梯度重塑方法(Reshaped Token-level policy gradients, ResT),通过引入熵引导的token重加权机制,在训练过程中逐步提升推理token的权重,从而实现从结构正确性到语义推理的平滑过渡,并显著提升多轮工具使用任务中的训练稳定性和收敛性。
链接: https://arxiv.org/abs/2509.21826
作者: Zihan Lin,Xiaohan Wang,Jie Cao,Jiajun Chai,Guojun Yin,Wei Lin,Ran He
机构: CRIPAC, Institute of Automation (中国科学院自动化研究所); Meituan(美团)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Large language models (LLMs) transcend passive generation and act as goal-directed agents by invoking external tools. Reinforcement learning (RL) offers a principled framework for optimizing these emergent tool-use policies, yet the prevailing paradigm relies exclusively on sparse outcome rewards and lacks consideration of the particularity of tool-use tasks, inflating policy-gradient variance and resulting in inefficient training. To better understand and address these challenges, we first establish a theoretical link between policy entropy and training stability of tool-use tasks, which reveals that structured, low-entropy tokens are primary determinants of rewards. Motivated by this insight, we propose \textbfReshaped \textbfToken-level policy gradients (\textbfResT) for tool-use tasks. ResT reshapes the policy gradient through entropy-informed token reweighting, progressively upweighting reasoning tokens as training proceeds. This entropy-aware scheme enables a smooth shift from structural correctness to semantic reasoning and stabilizes convergence in multi-turn tool-use tasks. Evaluation on BFCL and API-Bank shows that ResT achieves state-of-the-art results, outperforming prior methods by up to 8.76% . When fine-tuned on a 4B base LLM, ResT further surpasses GPT-4o by 4.11% on single-turn tasks and 1.50% on multi-turn base tasks.
zh
[NLP-106] Can LLM s Solve and Generate Linguistic Olympiad Puzzles? EMNLP2025
【速读】: 该论文旨在解决如何利用大语言模型(Large Language Models, LLMs)提升语言谜题的求解效率,并进一步探索生成语言谜题的新任务,以促进语言学知识的普及与传播。其解决方案的关键在于:首先扩展了现有语言谜题求解的基准数据集,系统评估了包括OpenAI o1在内的最新LLMs在不同语言主题上的表现,发现LLMs在多数谜题类型中优于人类,尤其在书写系统相关的谜题上表现不佳;其次,基于这些求解实验的洞察,设计并实现了语言谜题生成机制,从而将谜题求解能力转化为可自动化的生成任务,为推广语言学、特别是稀有和未充分研究的语言提供新的工具与路径。
链接: https://arxiv.org/abs/2509.21820
作者: Neh Majmudar,Elena Filatova
机构: CUNY(纽约市立大学)
类目: Computation and Language (cs.CL)
备注: To be published in the Proceedings of Main Conference of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP 2025)
点击查看摘要
Abstract:In this paper, we introduce a combination of novel and exciting tasks: the solution and generation of linguistic puzzles. We focus on puzzles used in Linguistic Olympiads for high school students. We first extend the existing benchmark for the task of solving linguistic puzzles. We explore the use of Large Language Models (LLMs), including recent state-of-the-art models such as OpenAI’s o1, for solving linguistic puzzles, analyzing their performance across various linguistic topics. We demonstrate that LLMs outperform humans on most puzzles types, except for those centered on writing systems, and for the understudied languages. We use the insights from puzzle-solving experiments to direct the novel task of puzzle generation. We believe that automating puzzle generation, even for relatively simple puzzles, holds promise for expanding interest in linguistics and introducing the field to a broader audience. This finding highlights the importance of linguistic puzzle generation as a research task: such puzzles can not only promote linguistics but also support the dissemination of knowledge about rare and understudied languages.
zh
[NLP-107] owards Minimal Causal Representations for Human Multimodal Language Understanding
链接: https://arxiv.org/abs/2509.21805
作者: Menghua Jiang,Yuncheng Jiang,Haifeng Hu,Sijie Mai
机构: South China Normal University (华南师范大学); Sun Yat-sen University (中山大学)
类目: Computation and Language (cs.CL)
备注:
[NLP-108] Redefining Machine Simultaneous Interpretation: From Incremental Translation to Human-Like Strategies
链接: https://arxiv.org/abs/2509.21801
作者: Qianen Zhang,Satoshi Nakamura
机构: The Chinese University of Hong Kong, Shenzhen (深圳中文大学); Nara Institute of Science and Technology (奈良科学技术研究所)
类目: Computation and Language (cs.CL)
备注:
[NLP-109] Evaluating and Improving Cultural Awareness of Reward Models for LLM Alignment ICLR2026
链接: https://arxiv.org/abs/2509.21798
作者: Hongbin Zhang,Kehai Chen,Xuefeng Bai,Yang Xiang,Min Zhang
机构: Harbin Institute of Technology, Shenzhen(哈尔滨工业大学深圳校区); Peng Cheng Laboratory(鹏城实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Under review on ICLR 2026;Work in progress;
[NLP-110] Compiling by Proving: Language-Agnostic Automatic Optimization from Formal Semantics
链接: https://arxiv.org/abs/2509.21793
作者: Jianhong Zhao,Everett Hildenbrandt,Juan Conejero,Yongwang Zhao
机构: Runtime Verification Inc. (运行时验证公司)
类目: Programming Languages (cs.PL); Computation and Language (cs.CL)
备注:
[NLP-111] Navigating the Impact of Structured Output Format on Large Language Models through the Compass of Causal Inference
链接: https://arxiv.org/abs/2509.21791
作者: Han Yuan,Yue Zhao,Li Zhang,Wuqiong Luo,Zheng Ma
机构: American Express(美国运通)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
[NLP-112] DeHate: A Stable Diffusion-based Multimodal Approach to Mitigate Hate Speech in Images AAAI2024
链接: https://arxiv.org/abs/2509.21787
作者: Dwip Dalal,Gautam Vashishtha,Anku Ranui,Aishwarya Reganti,Parth Patwa,Mohd Sarique,Chandan Gupta,Keshav Nath,Viswanatha Reddy,Vinija Jain,Aman Chadha,Amitava Das,Amit Sheth,Asif Ekbal
机构: IIT Gandhinagar, India; MIT Media Lab, USA; CMU, USA; UCLA, USA; IIIT Kalyani, India; IIIT Delhi, India; DTU, India; UW Madison, USA; Stanford University, USA; Amazon GenAI, USA; University of South Carolina, USA; IIT Patna, India
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Defactify 3 workshop at AAAI 2024
[NLP-113] SynerGen: Contextualized Generative Recommender for Unified Search and Recommendation
【速读】: 该论文旨在解决大规模推荐系统中“检索-排序”流水线因架构分离和优化目标不一致导致的校准失准(mis-calibration)与工程开销过大的问题。现有生成式序列模型虽尝试通过自回归生成排序结果来统一检索与排序,但通常仅能处理个性化搜索或无查询推荐,难以同时兼顾两者且存在性能折衷。其解决方案的关键在于提出一种名为SynerGen的新颖生成式推荐模型,该模型以单一生成式骨干网络同时支持个性化搜索与推荐任务,并通过联合优化策略实现检索与排序能力的协同提升:利用InfoNCE损失进行检索建模,结合点对点与成对损失(hybrid pointwise-pairwise loss)优化排序,使搜索中的语义信号增强推荐效果、反之亦然;此外,引入时间感知旋转位置编码(time-aware rotary positional embedding)有效融合时序信息至注意力机制中,从而在多个主流推荐与搜索基准上显著优于强基线模型,验证了单个生成式基础模型在工业级统一信息获取场景下的可行性。
链接: https://arxiv.org/abs/2509.21777
作者: Vianne R. Gao,Chen Xue,Marc Versage,Xie Zhou,Zhongruo Wang,Chao Li,Yeon Seonwoo,Nan Chen,Zhen Ge,Gourab Kundu,Weiqi Zhang,Tian Wang,Qingjun Cui,Trishul Chilimbi
机构: Store Foundation AI, Amazon(亚马逊)
类目: Computation and Language (cs.CL)
备注: Generative Recommender, Recommendation System, Information Retrieval
点击查看摘要
Abstract:The dominant retrieve-then-rank pipeline in large-scale recommender systems suffers from mis-calibration and engineering overhead due to its architectural split and differing optimization objectives. While recent generative sequence models have shown promise in unifying retrieval and ranking by auto-regressively generating ranked items, existing solutions typically address either personalized search or query-free recommendation, often exhibiting performance trade-offs when attempting to unify both. We introduce \textitSynerGen, a novel generative recommender model that bridges this critical gap by providing a single generative backbone for both personalized search and recommendation, while simultaneously excelling at retrieval and ranking tasks. Trained on behavioral sequences, our decoder-only Transformer leverages joint optimization with InfoNCE for retrieval and a hybrid pointwise-pairwise loss for ranking, allowing semantic signals from search to improve recommendation and vice versa. We also propose a novel time-aware rotary positional embedding to effectively incorporate time information into the attention mechanism. \textitSynerGen achieves significant improvements on widely adopted recommendation and search benchmarks compared to strong generative recommender and joint search and recommendation baselines. This work demonstrates the viability of a single generative foundation model for industrial-scale unified information access.
zh
[NLP-114] UltraHorizon: Benchmarking Agent Capabilities in Ultra Long-Horizon Scenarios
【速读】: 该论文旨在解决当前自主代理(Autonomous Agents)在长时程(long-horizon)和部分可观测(partially observable)场景下评估不足的问题,这类场景广泛存在于软件开发、商业投资与科学发现等复杂现实任务中。现有基准测试多聚焦于短时程、完全可观测的任务,难以系统性衡量代理在持续推理、规划、记忆管理和工具调用等方面的核心能力。为此,作者提出了 UltraHorizon 基准,通过探索任务统一跨三个不同环境的评估框架,要求代理在长时程发现任务中迭代识别隐藏规则,并整合推理、规划、记忆与工具使用能力。其关键创新在于构建了具有高复杂度的轨迹数据集(平均超过20万token和400次工具调用),并揭示出当前大语言模型代理(LLM-agents)在长时程任务中存在显著性能缺陷,主要归因于“上下文锁定”(in-context locking)和“功能基础能力缺口”(functional fundamental capability gaps),从而为后续研究指明方向。
链接: https://arxiv.org/abs/2509.21766
作者: Haotian Luo,Huaisong Zhang,Xuelin Zhang,Haoyu Wang,Zeyu Qin,Wenjie Lu,Guozheng Ma,Haiying He,Yingsha Xie,Qiyang Zhou,Zixuan Hu,Hongze Mi,Yibo Wang,Naiqiang Tan,Hong Chen,Yi R. Fung,Chun Yuan,Li Shen
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Autonomous agents have recently achieved remarkable progress across diverse domains, yet most evaluations focus on short-horizon, fully observable tasks. In contrast, many critical real-world tasks, such as large-scale software development, commercial investment, and scientific discovery, unfold in long-horizon and partially observable scenarios where success hinges on sustained reasoning, planning, memory management, and tool use. Existing benchmarks rarely capture these long-horizon challenges, leaving a gap in systematic evaluation. To bridge this gap, we introduce \textbfUltraHorizon a novel benchmark that measures the foundational capabilities essential for complex real-world challenges. We use exploration as a unifying task across three distinct environments to validate these core competencies. Agents are designed in long-horizon discovery tasks where they must iteratively uncover hidden rules through sustained reasoning, planning, memory and tools management, and interaction with environments. Under the heaviest scale setting, trajectories average \textbf200k+ tokens and \textbf400+ tool calls, whereas in standard configurations they still exceed \textbf35k tokens and involve more than \textbf60 tool calls on average. Our extensive experiments reveal that LLM-agents consistently underperform in these settings, whereas human participants achieve higher scores, underscoring a persistent gap in agents’ long-horizon abilities. We also observe that simple scaling fails in our task. To better illustrate the failure of agents, we conduct an in-depth analysis of collected trajectories. We identify eight types of errors and attribute them to two primary causes: in-context locking and functional fundamental capability gaps. \hrefthis https URLOur code will be available here.
zh
[NLP-115] hinking with Sound: Audio Chain-of-Thought Enables Multimodal Reasoning in Large Audio-Language Models
【速读】: 该论文旨在解决当前大型音频语言模型(Large Audio-Language Models, LALMs)在复杂声学场景下进行音频推理任务时表现不佳的问题,尤其是其缺乏对噪声抑制、声源分离和精确时间对齐等声学工具的调用能力。解决方案的关键在于提出Thinking-with-Sound(TwS)框架,通过引入Audio Chain-of-Thought(Audio CoT),使LALMs能够在推理过程中结合语言推理与实时音频域分析,实现对音频信号的数值计算和数字操作,从而主动“思考”音频内容而非仅将其视为静态输入。此方法无需重新训练即可显著提升模型在噪声干扰下的鲁棒性。
链接: https://arxiv.org/abs/2509.21749
作者: Zhen Xiong,Yujun Cai,Zhecheng Li,Junsong Yuan,Yiwei Wang
机构: University of Southern California (南加州大学); University of Queensland (昆士兰大学); University of California, San Diego (加州大学圣地亚哥分校); University of Buffalo (水牛城大学); University of California, Merced (加州大学默塞德分校)
类目: Computation and Language (cs.CL); Sound (cs.SD)
备注:
点击查看摘要
Abstract:Recent Large Audio-Language Models (LALMs) have shown strong performance on various audio understanding tasks such as speech translation and Audio Q\A. However, they exhibit significant limitations on challenging audio reasoning tasks in complex acoustic scenarios. These situations would greatly benefit from the use of acoustic tools like noise suppression, source separation, and precise temporal alignment, but current LALMs lack access to such tools. To address this limitation, we introduce Thinking-with-Sound (TwS), a framework that equips LALMs with Audio CoT by combining linguistic reasoning with on-the-fly audio-domain analysis. Unlike existing approaches that treat audio as static input, TwS enables models to actively think with audio signals, performing numerical analysis and digital manipulation through multimodal reasoning. To evaluate this approach, we construct MELD-Hard1k, a new robustness benchmark created by introducing various acoustic perturbations. Experiments reveal that state-of-the-art LALMs suffer dramatic performance degradation on MELD-Hard1k, with accuracy dropping by more than 50% compared to clean audio. TwS achieves substantial improvements in robustness, demonstrating both effectiveness and scalability: small models gain 24.73% absolute accuracy, with improvements scaling consistently up to 36.61% for larger models. Our findings demonstrate that Audio CoT can significantly enhance robustness without retraining, opening new directions for developing more robust audio understanding systems.
zh
[NLP-116] Self-Speculative Biased Decoding for Faster Live Translation
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在流式应用(如实时翻译)中因输入上下文持续扩展而导致的高延迟问题,即如何在保持较低计算成本的同时实现连续更新的输出。其核心解决方案是提出一种名为“自洽推测性偏置解码”(Self-Speculative Biased Decoding)的新推理范式:利用最新生成的输出作为当前不断增长输入上下文的草稿,在验证阶段对候选词进行偏置以提高草稿接受率,从而避免重复从头生成,显著减少闪烁现象并提升推理速度。该方法无需额外草稿计算,具有模型无关性和即插即用特性,实验表明相比传统自回归重翻译策略可实现最高1.7倍加速,同时通过引入仅显示掩码-k(display-only mask-k)技术将闪烁减少80%。
链接: https://arxiv.org/abs/2509.21740
作者: Linxiao Zeng,Haoyun Deng,Kangyuan Shu,Shizhen Wang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) have recently demonstrated impressive capabilities in various text generation tasks. However, it remains challenging to use them off-the-shelf in streaming applications (such as live translation), where the output must continually update as the input context expands, while still maintaining a reasonable computational cost to meet the latency requirement. In this work, we reexamine the re-translation approach to simultaneous translation and propose Self-Speculative Biased Decoding, a novel inference paradigm designed to avoid repeatedly generating output from scratch for a consistently growing input stream. We propose using the most recent output as a draft for the current growing input context. During the verification stage, the output will be biased towards the draft token for a higher draft acceptance rate. This strategy not only minimizes flickering that might distract users but also leads to higher speedups. Conventional decoding may take charge from the point of divergence after draft verification and continue until the end condition is met. Unlike existing speculative decoding strategies, our approach eliminates the need for draft computations, making it a model-agnostic and plug-and-play solution for accelerating latency-sensitive streaming applications. Experimental results on simultaneous text-to-text re-translation demonstrate that our approach achieves up to 1.7x speedup compared to conventional auto-regressive re-translation without compromising quality. Additionally, it significantly reduces flickering by 80% by incorporating the display-only mask-k technique. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2509.21740 [cs.CL] (or arXiv:2509.21740v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2509.21740 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-117] UISim: An Interactive Image-Based UI Simulator for Dynamic Mobile Environments
链接: https://arxiv.org/abs/2509.21733
作者: Jiannan Xiang,Yun Zhu,Lei Shu,Maria Wang,Lijun Yu,Gabriel Barcik,James Lyon,Srinivas Sunkara,Jindong Chen
机构: University of California, San Diego (加州大学圣地亚哥分校); Google DeepMind
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:
[NLP-118] How Accurate Are LLM s at Multi-Question Answering on Conversational Transcripts? EMNLP2025
链接: https://arxiv.org/abs/2509.21732
作者: Xiliang Zhu,Shi Zong,David Rossouw
机构: Dialpad Inc. (Dialpad公司)
类目: Computation and Language (cs.CL)
备注: Accepted by EMNLP 2025 Industry Track
[NLP-119] ProPerSim: Developing Proactive and Personalized AI Assistants through User-Assistant Simulation
链接: https://arxiv.org/abs/2509.21730
作者: Jiho Kim,Junseong Choi,Woosog Chay,Daeun Kyung,Yeonsu Kwon,Yohan Jo,Edward Choi
机构: KAIST (韩国科学技术院); SNU (首尔国立大学)
类目: Computation and Language (cs.CL)
备注:
[NLP-120] hink-on-Graph 3.0: Efficient and Adaptive LLM Reasoning on Heterogeneous Graphs via Multi-Agent Dual-Evolving Context Retrieval
链接: https://arxiv.org/abs/2509.21710
作者: Xiaojun Wu,Cehao Yang,Xueyuan Lin,Chengjin Xu,Xuhui Jiang,Yuanliang Sun,Hui Xiong,Jia Li,Jian Guo
机构: IDEA Research (国际数字经济发展研究院); Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)); DataArc Tech Ltd. (数据弧科技有限公司); Hithink RoyalFlush Information Network Co., Ltd (慧科集团信息网络有限公司)
类目: Computation and Language (cs.CL)
备注: 28 pages, 17 figures
[NLP-121] GRAB: A Risk Taxonomy–Grounded Benchmark for Unsupervised Topic Discovery in Financial Disclosures NEURIPS2025
【速读】: 该论文旨在解决金融领域10-K风险披露文本中风险类别划分的评估问题,现有研究缺乏公开基准来无监督地评估主题模型在该任务上的表现。其解决方案的关键在于构建GRAB——一个面向金融领域的基准数据集,包含1.61百万句来自8,247份财报的文本,并通过结合FinBERT词元注意力、YAKE关键词信号与基于分类体系的共现匹配方法自动标注句子标签,无需人工标注;标签锚定于一个将193个术语映射到21个细粒度类别的风险分类体系,其中21个细粒度类型用于弱监督训练,而评估则以5个宏观类别进行报告。该方案统一了评估标准,提供固定数据划分和鲁棒指标(如准确率、宏F1、Topic BERTScore及基于熵的有效话题数),支持对传统、嵌入式、神经网络及混合主题模型在财务披露中的标准化比较。
链接: https://arxiv.org/abs/2509.21698
作者: Ying Li,Tiejun Ma
机构: The University of Edinburgh (爱丁堡大学)
类目: Computation and Language (cs.CL)
备注: 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: NeurIPS 2025 Workshop on Generative AI in Finance
点击查看摘要
Abstract:Risk categorization in 10-K risk disclosures matters for oversight and investment, yet no public benchmark evaluates unsupervised topic models for this task. We present GRAB, a finance-specific benchmark with 1.61M sentences from 8,247 filings and span-grounded sentence labels produced without manual annotation by combining FinBERT token attention, YAKE keyphrase signals, and taxonomy-aware collocation matching. Labels are anchored in a risk taxonomy mapping 193 terms to 21 fine-grained types nested under five macro classes; the 21 types guide weak supervision, while evaluation is reported at the macro level. GRAB unifies evaluation with fixed dataset splits and robust metrics–Accuracy, Macro-F1, Topic BERTScore, and the entropy-based Effective Number of Topics. The dataset, labels, and code enable reproducible, standardized comparison across classical, embedding-based, neural, and hybrid topic models on financial disclosures.
zh
[NLP-122] ReviewScore: Misinformed Peer Review Detection with Large Language Models
【速读】: 该论文旨在解决人工智能(AI)顶会中审稿质量随投稿量激增而持续下降的问题,核心挑战在于如何可靠识别低质量审稿意见。其解决方案的关键在于提出“误判审稿点”(misinformed review points)的定义,即审稿中包含错误前提的“弱点”或可被论文直接解答的“问题”,并构建ReviewScore指标来量化此类误判。为评估审稿点中每个前提的事实性,作者设计了一个自动化的推理引擎以重构显性和隐性前提,并基于人工专家标注的ReviewScore数据集测试大语言模型(LLM)的自动化能力,结果表明在前提层级进行事实性判断能显著提升人机一致性,验证了实现全自动审稿质量评估的可行性。
链接: https://arxiv.org/abs/2509.21679
作者: Hyun Ryu,Doohyuk Jang,Hyemin S. Lee,Joonhyun Jeong,Gyeongman Kim,Donghyeon Cho,Gyouk Chu,Minyeong Hwang,Hyeongwon Jang,Changhun Kim,Haechan Kim,Jina Kim,Joowon Kim,Yoonjeon Kim,Kwanhyung Lee,Chanjae Park,Heecheol Yun,Gregor Betz,Eunho Yang
机构: KAIST(韩国科学技术院); MIT(麻省理工学院); KRAFTON; AITRICS; KIT(卡尔斯鲁厄理工学院)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Peer review serves as a backbone of academic research, but in most AI conferences, the review quality is degrading as the number of submissions explodes. To reliably detect low-quality reviews, we define misinformed review points as either “weaknesses” in a review that contain incorrect premises, or “questions” in a review that can be already answered by the paper. We verify that 15.2% of weaknesses and 26.4% of questions are misinformed and introduce ReviewScore indicating if a review point is misinformed. To evaluate the factuality of each premise of weaknesses, we propose an automated engine that reconstructs every explicit and implicit premise from a weakness. We build a human expert-annotated ReviewScore dataset to check the ability of LLMs to automate ReviewScore evaluation. Then, we measure human-model agreements on ReviewScore using eight current state-of-the-art LLMs and verify moderate agreements. We also prove that evaluating premise-level factuality shows significantly higher agreements than evaluating weakness-level factuality. A thorough disagreement analysis further supports a potential of fully automated ReviewScore evaluation.
zh
[NLP-123] owards Transparent AI: A Survey on Explainable Language Models
链接: https://arxiv.org/abs/2509.21631
作者: Avash Palikhe,Zichong Wang,Zhipeng Yin,Rui Guo,Qiang Duan,Jie Yang,Wenbin Zhang
机构: Florida International University (佛罗里达国际大学)
类目: Computation and Language (cs.CL)
备注:
[NLP-124] InvBench: Can LLM s Accelerate Program Verification with Invariant Synthesis?
链接: https://arxiv.org/abs/2509.21629
作者: Anjiang Wei,Tarun Suresh,Tianran Sun,Haoze Wu,Ke Wang,Alex Aiken
机构: Stanford University (斯坦福大学); University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Shanghai Jiao Tong University (上海交通大学); Amherst College (阿姆赫斯特学院); Nanjing University (南京大学)
类目: Programming Languages (cs.PL); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
[NLP-125] OjaKV: Context-Aware Online Low-Rank KV Cache Compression with Ojas Rule
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在长上下文推理中因键值缓存(Key-Value Cache, KV Cache)占用内存过大而导致的显著内存瓶颈问题。例如,Llama-3.1-8B模型在处理32K-token提示词时,其KV缓存所需内存可达16GB,甚至超过模型参数本身。传统低秩投影压缩方法依赖静态离线学习的子空间,在数据分布变化时性能下降明显。解决方案的关键在于提出OjaKV框架,该框架结合了策略性混合存储机制与在线子空间自适应:首先保留关键首尾token的全秩表示以维持注意力锚点;其次对中间大部分token采用低秩压缩,并通过Oja算法实现增量式主成分分析(Online Principal Component Analysis),在预填充阶段进行完整更新、解码阶段轻量周期更新,确保投影基底持续匹配动态上下文变化。该方案兼容FlashAttention等现代注意力模块,在高压缩比下仍能保持或提升零样本准确率,尤其在复杂推理任务上表现优异,成为无需微调的即插即用型长上下文高效推理方案。
链接: https://arxiv.org/abs/2509.21623
作者: Yuxuan Zhu,David H. Yang,Mohammad Mohammadi Amiri,Keerthiram Murugesan,Tejaswini Pedapati,Pin-Yu Chen
机构: Rensselaer Polytechnic Institute (伦斯勒理工学院); IBM Research (IBM 研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:The expanding long-context capabilities of large language models are constrained by a significant memory bottleneck: the key-value (KV) cache required for autoregressive generation. This bottleneck is substantial; for instance, a Llama-3.1-8B model processing a 32K-token prompt at a batch size of 4 requires approximately 16GB for its KV cache, a size exceeding the model’s weights. While KV-cache compression via low-rank projection is a promising direction, existing methods rely on a static, offline-learned subspace that performs poorly under data distribution shifts. To overcome these limitations, we introduce OjaKV, a novel framework that integrates a strategic hybrid storage policy with online subspace adaptation. First, OjaKV recognizes that not all tokens are equally important for compression; it preserves the crucial first and most recent tokens in full-rank, maintaining high-fidelity anchors for attention. Second, for the vast majority of intermediate tokens, it applies low-rank compression by incrementally adapting the projection basis using Oja’s algorithm for online principal component analysis. This adaptation involves a comprehensive update during prompt prefilling and lightweight periodic updates during decoding, ensuring the subspace remains aligned with the evolving context. Crucially, our framework is fully compatible with modern attention modules like FlashAttention. Experiments demonstrate that OjaKV maintains or even improves zero-shot accuracy at high compression ratios. In particular, OjaKV achieves its strongest gains on very long-context benchmarks that require complex reasoning, highlighting the importance of online subspace adaptation in dynamically tracking context shifts. These results establish our hybrid framework as a practical, plug-and-play solution for memory-efficient long-context inference without requiring model fine-tuning.
zh
[NLP-126] Multi-Objective Reinforcement Learning for Large Language Model Optimization: Visionary Perspective ECAI
链接: https://arxiv.org/abs/2509.21613
作者: Lingxiao Kong,Cong Yang,Oya Deniz Beyan,Zeyd Boukhers
机构: Fraunhofer Institute for Applied Information Technology FIT (弗劳恩霍夫应用信息技术研究所); Soochow University (苏州大学); University Hospital of Cologne (科隆大学医院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: 3 pages, 1 figure, accepted by ECAI MODeM 2025
[NLP-127] Leverag ing Big Data Frameworks for Spam Detection in Amazon Reviews
【速读】: 该论文旨在解决在线购物环境中虚假评论(spam reviews)泛滥所引发的消费者信任危机问题。虚假评论可能误导消费者决策并损害商家声誉,从而破坏电商平台的可信度。解决方案的关键在于利用大规模数据处理技术和机器学习方法对亚马逊产品评论进行精准检测与分类,通过构建可扩展的大数据框架高效提取欺诈行为的关键特征,并采用多种机器学习分类器进行建模分析,其中逻辑回归(Logistic Regression)模型达到了90.35%的准确率,显著提升了评论的真实性与平台透明度。
链接: https://arxiv.org/abs/2509.21579
作者: Mst Eshita Khatun,Halima Akter,Tasnimul Rehan,Toufiq Ahmed
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Accepted presented at THE 16th INTERNATIONAL IEEE CONFERENCE ON COMPUTING, COMMUNICATION AND NETWORKING TECHNOLOGIES (ICCCNT) 2025
点击查看摘要
Abstract:In this digital era, online shopping is common practice in our daily lives. Product reviews significantly influence consumer buying behavior and help establish buyer trust. However, the prevalence of fraudulent reviews undermines this trust by potentially misleading consumers and damaging the reputations of the sellers. This research addresses this pressing issue by employing advanced big data analytics and machine learning approaches on a substantial dataset of Amazon product reviews. The primary objective is to detect and classify spam reviews accurately so that it enhances the authenticity of the review. Using a scalable big data framework, we efficiently process and analyze a large scale of review data, extracting key features indicative of fraudulent behavior. Our study illustrates the utility of various machine learning classifiers in detecting spam reviews, with Logistic Regression achieving an accuracy of 90.35%, thus contributing to a more trustworthy and transparent online shopping environment.
zh
[NLP-128] “Be My Cheese?”: Assessing Cultural Nuance in Multilingual LLM Translations
链接: https://arxiv.org/abs/2509.21577
作者: Madison Van Doren,Cory Holland
机构: 未知
类目: Computation and Language (cs.CL)
备注:
[NLP-129] Vision Language Models Cannot Plan but Can They Formalize?
【速读】: 该论文旨在解决多模态环境中长程规划(long-horizon planning)的挑战,即如何让视觉语言模型(Vision Language Models, VLMs)在复杂、开放词汇且图像质量较低的真实场景中,准确地将任务描述转化为形式化规划语言(如PDDL),从而调用正式求解器生成可验证的行动计划。其解决方案的关键在于提出五种VLM-as-formalizer管道,实现“一次-shot”、“开放词汇”和“多模态”的PDDL形式化,相较于端到端直接生成动作序列的方法显著提升性能;同时揭示当前瓶颈在于视觉理解能力不足,而非语言处理,尤其在捕捉必要对象关系方面存在局限,尽管中间文本表示(如描述或场景图)能部分缓解问题,但效果不稳定,仍需进一步研究。
链接: https://arxiv.org/abs/2509.21576
作者: Muyu He,Yuxi Zheng,Yuchen Liu,Zijian An,Bill Cai,Jiani Huang,Lifeng Zhou,Feng Liu,Ziyang Li,Li Zhang
机构: Drexel University (德雷塞尔大学); University of Pennsylvania (宾夕法尼亚大学); Johns Hopkins University (约翰霍普金斯大学)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:The advancement of vision language models (VLMs) has empowered embodied agents to accomplish simple multimodal planning tasks, but not long-horizon ones requiring long sequences of actions. In text-only simulations, long-horizon planning has seen significant improvement brought by repositioning the role of LLMs. Instead of directly generating action sequences, LLMs translate the planning domain and problem into a formal planning language like the Planning Domain Definition Language (PDDL), which can call a formal solver to derive the plan in a verifiable manner. In multimodal environments, research on VLM-as-formalizer remains scarce, usually involving gross simplifications such as predefined object vocabulary or overly similar few-shot examples. In this work, we present a suite of five VLM-as-formalizer pipelines that tackle one-shot, open-vocabulary, and multimodal PDDL formalization. We evaluate those on an existing benchmark while presenting another two that for the first time account for planning with authentic, multi-view, and low-quality images. We conclude that VLM-as-formalizer greatly outperforms end-to-end plan generation. We reveal the bottleneck to be vision rather than language, as VLMs often fail to capture an exhaustive set of necessary object relations. While generating intermediate, textual representations such as captions or scene graphs partially compensate for the performance, their inconsistent gain leaves headroom for future research directions on multimodal planning formalization.
zh
[NLP-130] Comparative Personalization for Multi-document Summarization
链接: https://arxiv.org/abs/2509.21562
作者: Haoyuan Li,Snigdha Chaturvedi
机构: University of North Carolina at Chapel Hill (北卡罗来纳大学教堂山分校)
类目: Computation and Language (cs.CL)
备注:
[NLP-131] Generation-Time vs. Post-hoc Citation: A Holistic Evaluation of LLM Attribution NEURIPS2025
链接: https://arxiv.org/abs/2509.21557
作者: Yash Saxena,Raviteja Bommireddy,Ankur Padia,Manas Gaur
机构: UMBC (University of Maryland, Baltimore County); IIITDM (International Institute of Information Technology, Design and Manufacturing)
类目: Computation and Language (cs.CL)
备注: Accepted at NeurIPS 2025 LLM Evaluation Workshop
[NLP-132] Domain-Aware Speaker Diarization On African-Accented English
【速读】: 该论文旨在解决非洲口音英语在说话人分割(speaker diarization)任务中因领域差异导致的性能下降问题,特别是临床对话场景下的显著误差增加。其关键解决方案在于通过轻量级领域自适应方法——在口音匹配数据上微调分割模块(segmentation module),有效缓解了跨领域误差,但未能完全消除差距;同时提出了一种可控的基准测试框架、对话级别的错误分解方法及易于复现的适配策略,强调未来应关注重叠感知的分割技术和临床资源的均衡配置以提升系统鲁棒性。
链接: https://arxiv.org/abs/2509.21554
作者: Chibuzor Okocha,Kelechi Ezema,Christan Grant
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 5 pages
点击查看摘要
Abstract:This study examines domain effects in speaker diarization for African-accented English. We evaluate multiple production and open systems on general and clinical dialogues under a strict DER protocol that scores overlap. A consistent domain penalty appears for clinical speech and remains significant across models. Error analysis attributes much of this penalty to false alarms and missed detections, aligning with short turns and frequent overlap. We test lightweight domain adaptation by fine-tuning a segmentation module on accent-matched data; it reduces error but does not eliminate the gap. Our contributions include a controlled benchmark across domains, a concise approach to error decomposition and conversation-level profiling, and an adaptation recipe that is easy to reproduce. Results point to overlap-aware segmentation and balanced clinical resources as practical next steps.
zh
[NLP-133] Learning GUI Grounding with Spatial Reasoning from Visual Feedback
链接: https://arxiv.org/abs/2509.21552
作者: Yu Zhao,Wei-Ning Chen,Huseyin Atahan Inan,Samuel Kessler,Lu Wang,Lukas Wutschitz,Fangkai Yang,Chaoyun Zhang,Pasquale Minervini,Saravan Rajmohan,Robert Sim
机构: University of Edinburgh (爱丁堡大学); Microsoft (微软); Miniml.AI
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
[NLP-134] C-QUERI: Congressional Questions Exchanges and Responses in Institutions Dataset
链接: https://arxiv.org/abs/2509.21548
作者: Manjari Rudra,Daniel Magleby,Sujoy Sikdar
机构: 未知
类目: Computers and Society (cs.CY); Computation and Language (cs.CL)
备注:
[NLP-135] Agribot: agriculture-specific question answer system
链接: https://arxiv.org/abs/2509.21535
作者: Naman Jain,Pranjali Jain,Pratik Kayal,Jayakrishna Sahit,Soham Pachpande,Jayesh Choudhari
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
[NLP-136] Uncertainty-Aware Knowledge Tracing Models
【速读】: 该论文旨在解决知识追踪(Knowledge Tracing, KT)模型在预测学生答题错误时的局限性,尤其是当学生选择干扰项(distractor)时,现有模型往往无法准确识别错误,导致对学生学习状态的误判。其解决方案的关键在于引入对模型预测不确定性的捕捉机制,研究表明预测不确定性与模型错误预测之间存在显著正相关关系,从而为教育平台提供了一种可解释且具有教学价值的信号,尤其适用于资源有限但需精准评估学生能力的教学场景。
链接: https://arxiv.org/abs/2509.21514
作者: Joshua Mitton,Prarthana Bhattacharyya,Ralph Abboud,Simon Woodhead
机构: Eedi; Learning Engineering Virtual Institute
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 10 pages, 7 figures. Joshua Mitton and Prarthana Bhattacharyya contributed equally to this paper
点击查看摘要
Abstract:The main focus of research on Knowledge Tracing (KT) models is on model developments with the aim of improving predictive accuracy. Most of these models make the most incorrect predictions when students choose a distractor, leading to student errors going undetected. We present an approach to add new capabilities to KT models by capturing predictive uncertainty and demonstrate that a larger predictive uncertainty aligns with model incorrect predictions. We show that uncertainty in KT models is informative and that this signal would be pedagogically useful for application in an educational learning platform that can be used in a limited resource setting where understanding student ability is necessary.
zh
[NLP-137] LLM Agent Meets Agent ic AI: Can LLM Agents Agent s Simulate Customers to Evaluate Agentic-AI-based Shopping Assistants?
链接: https://arxiv.org/abs/2509.21501
作者: Lu Sun,Shihan Fu,Bingsheng Yao,Yuxuan Lu,Wenbo Li,Hansu Gu,Jiri Gesi,Jing Huang,Chen Luo,Dakuo Wang
机构: University of California San Diego (加州大学圣地亚哥分校); Northeastern University (东北大学); North Carolina State University (北卡罗来纳州立大学); Independent Researcher (独立研究员); Independent Researcher (独立研究员); Independent Researcher (独立研究员); Independent Researcher (独立研究员); Independent Researcher (独立研究员)
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
备注:
[NLP-138] On Code-Induced Reasoning in LLM s
链接: https://arxiv.org/abs/2509.21499
作者: Abdul Waheed,Zhen Wu,Carolyn Rosé,Daphne Ippolito
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL); Programming Languages (cs.PL)
备注:
[NLP-139] Dual-Head Reasoning Distillation: Improving Classifier Accuracy with Train-Time-Only Reasoning NEURIPS2025
链接: https://arxiv.org/abs/2509.21487
作者: Jillian Xu,Dylan Zhou,Vinay Shukla,Yang Yang,Junrui Ruan,Shuhuai Lin,Wenfei Zou,Yinxiao Liu,Karthik Lakshmanan
机构: University of Waterloo (滑铁卢大学); Google (谷歌); Google DeepMind (谷歌深度思维)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by the Workshop on Efficient Reasoning, Neurips 2025, 5 pages
[NLP-140] Learning to Reason with Mixture of Tokens
链接: https://arxiv.org/abs/2509.21482
作者: Adit Jain,Brendan Rappazzo
机构: Morgan Stanley (摩根士丹利); Cornell University (康奈尔大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 30 page
[NLP-141] Are Hallucinations Bad Estimations?
链接: https://arxiv.org/abs/2509.21473
作者: Hude Liu,Jerry Yao-Chieh Hu,Jennifer Yuntong Zhang,Zhao Song,Han Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注: Code is available at this https URL
[NLP-142] Gender Stereotypes in Professional Roles Among Saudis: An Analytical Study of AI-Generated Images Using Language Models
链接: https://arxiv.org/abs/2509.21466
作者: Khaloud S. AlKhalifah,Malak Mashaabi,Hend Al-Khalifa
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
[NLP-143] A State-of-the-Art SQL Reasoning Model using RLVR
链接: https://arxiv.org/abs/2509.21459
作者: Alnur Ali,Ashutosh Baheti,Jonathan Chang,Ta-Chung Chi,Brandon Cui,Andrew Drozdov,Jonathan Frankle,Abhay Gupta,Pallavi Koppol,Sean Kulinski,Jonathan Li,Dipendra Misra,Krista Opsahl-Ong,Jose Javier Gonzalez Ortiz,Matei Zaharia,Yue Zhang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB); Machine Learning (cs.LG)
备注:
[NLP-144] Diagnosing the Performance Trade-off in Moral Alignment: A Case Study on Gender Stereotypes
链接: https://arxiv.org/abs/2509.21456
作者: Guangliang Liu,Bocheng Chen,Xitong Zhang,Kristen Marie Johnson
机构: Michigan State University (密歇根州立大学); University of Mississippi (密西西比大学)
类目: Computation and Language (cs.CL)
备注:
[NLP-145] VideoJudge: Bootstrapping Enables Scalable Supervision of MLLM -as-a-Judge for Video Understanding
链接: https://arxiv.org/abs/2509.21451
作者: Abdul Waheed,Zhen Wu,Dareen Alharthi,Seungone Kim,Bhiksha Raj
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Work in progress
[NLP-146] LLM -Based Support for Diabetes Diagnosis: Opportunities Scenarios and Challenges with GPT -5
链接: https://arxiv.org/abs/2509.21450
作者: Gaurav Kumar Gupta,Nirajan Acharya,Pranal Pande
机构: Youngstown State University (杨斯敦州立大学)
类目: Computation and Language (cs.CL)
备注:
[NLP-147] One Model Many Morals: Uncovering Cross-Linguistic Misalignments in Computational Moral Reasoning
链接: https://arxiv.org/abs/2509.21443
作者: Sualeha Farid,Jayden Lin,Zean Chen,Shivani Kumar,David Jurgens
机构: University of Michigan (密歇根大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 22 pages, 11 figures, 6 tables
[NLP-148] How Large Language Models Need Symbolism
链接: https://arxiv.org/abs/2509.21404
作者: Xiaotie Deng,Hanyu Li
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
[NLP-149] LLM s for Bayesian Optimization in Scientific Domains: Are We There Yet? EMNLP2025
链接: https://arxiv.org/abs/2509.21403
作者: Rushil Gupta,Jason Hartford,Bang Liu
机构: DIRO, Université de Montréal & Institut Courtois (DIRO,蒙特利尔大学与库尔托伊研究所); Mila - Quebec AI Institute (Mila - 魁北克人工智能研究所); The University of Manchester (曼彻斯特大学); Valence Labs (Valence Labs); Canada CIFAR AI Chair (加拿大CIFAR人工智能主席)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Accepted to EMNLP 2025
[NLP-150] ReGeS: Reciprocal Retrieval-Generation Synergy for Conversational Recommender Systems
链接: https://arxiv.org/abs/2509.21371
作者: Dayu Yang,Hui Fang
机构: University of Delaware (特拉华大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted by WISE 2025: 26th International Web Information Systems Engineering conference. Our code is publicly available at the link: this https URL
[NLP-151] Context Is What You Need: The Maximum Effective Context Window for Real World Limits of LLM s
【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)在实际应用中“最大上下文窗口(Maximum Context Window, MCW)”与“有效最大上下文窗口(Maximum Effective Context Window, MECW)”之间存在显著差异的问题,即模型虽宣称支持超长上下文,但在真实任务中性能迅速下降。其解决方案的关键在于:首先定义MECW概念,其次提出一种标准化测试方法以评估不同问题类型下上下文窗口的有效性,并建立可比较的基准来识别模型在多大上下文长度时出现性能崩溃点。实验表明,MECW远低于MCW(最高差达99%),且随任务类型变化,揭示了优化模型准确性与降低幻觉率的明确路径。
链接: https://arxiv.org/abs/2509.21361
作者: Norman Paulsen
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 20 pages, 4 charts
点击查看摘要
Abstract:Large language model (LLM) providers boast big numbers for maximum context window sizes. To test the real world use of context windows, we 1) define a concept of maximum effective context window, 2) formulate a testing method of a context window’s effectiveness over various sizes and problem types, and 3) create a standardized way to compare model efficacy for increasingly larger context window sizes to find the point of failure. We collected hundreds of thousands of data points across several models and found significant differences between reported Maximum Context Window (MCW) size and Maximum Effective Context Window (MECW) size. Our findings show that the MECW is, not only, drastically different from the MCW but also shifts based on the problem type. A few top of the line models in our test group failed with as little as 100 tokens in context; most had severe degradation in accuracy by 1000 tokens in context. All models fell far short of their Maximum Context Window by as much as 99 percent. Our data reveals the Maximum Effective Context Window shifts based on the type of problem provided, offering clear and actionable insights into how to improve model accuracy and decrease model hallucination rates.
zh
[NLP-152] Influence Guided Context Selection for Effective Retrieval-Augmented Generation
链接: https://arxiv.org/abs/2509.21359
作者: Jiale Deng,Yanyan Shen,Ziyuan Pei,Youmin Chen,Linpeng Huang
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
[NLP-153] A Novel Differential Feature Learning for Effective Hallucination Detection and Classification
链接: https://arxiv.org/abs/2509.21357
作者: Wenkai Wang,Vincent Lee,Yizhen Zheng
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages, 7 figures, 13 tables
[NLP-154] Random Direct Preference Optimization for Radiography Report Generation
链接: https://arxiv.org/abs/2509.21351
作者: Valentin Samokhin,Boris Shirokikh,Mikhail Goncharov,Dmitriy Umerenkov,Maksim Bobrin,Ivan Oseledets,Dmitry Dylov,Mikhail Belyaev
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
[NLP-155] owards mitigating information leakage when evaluating safety monitors
链接: https://arxiv.org/abs/2509.21344
作者: Gerard Boxo,Aman Neelappa,Shivam Raval
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 14 pages, 4 figures
[NLP-156] HetaRAG : Hybrid Deep Retrieval-Augmented Generation across Heterogeneous Data Stores
链接: https://arxiv.org/abs/2509.21336
作者: Guohang Yan,Yue Zhang,Pinlong Cai,Ding Wang,Song Mao,Hongwei Zhang,Yaoze Zhang,Hairong Zhang,Xinyu Cai,Botian Shi
机构: Shanghai Artificial Intelligence Laboratory (上海人工智能实验室)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: 15 pages, 4 figures
[NLP-157] Accelerate Creation of Product Claims Using Generative AI NEURIPS2025
链接: https://arxiv.org/abs/2509.20652
作者: Po-Yu Liang,Yong Zhang,Tatiana Hwa,Aaron Byers
机构: University of Cincinnati (辛辛那提大学); P&G (宝洁公司)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: This paper has been accepted at the GenProCC workshop (NeurIPS 2025)
[NLP-158] owards Efficient Online Exploration for Reinforcement Learning with Human Feedback
【速读】: 该论文旨在解决在线强化学习与人类反馈(Online RLHF)中的探索策略问题,即如何在数据效率的前提下自适应地收集新的偏好数据以同时优化奖励模型和策略。现有基于乐观性(optimism-based)的探索算法存在缺陷:其采样协议倾向于收集对减少奖励差异不确定性贡献最小的比较信息,且理论证明此类方法在指数级长时 horizon 上可能产生线性 regret。论文的关键解决方案是提出一种新的探索机制,通过引导偏好查询聚焦于最有助于政策改进的奖励差异不确定性区域,从而实现更高效的探索;在多臂老虎机建模下,该方案实现了关于时间 T 的多项式尺度 regret 上界 $ T^{(\beta+1)/(\beta+2)} $,其中 β>0 是平衡奖励最大化与缓解分布偏移的超参数,这是首个在所有模型参数上均呈现多项式 regret 比例的在线 RLHF 算法。
链接: https://arxiv.org/abs/2509.22633
作者: Gen Li,Yuling Yan
机构: The Chinese University of Hong Kong (香港中文大学); University of Wisconsin-Madison (威斯康星大学麦迪逊分校)
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Statistics Theory (math.ST)
备注:
点击查看摘要
Abstract:Reinforcement learning with human feedback (RLHF), which learns a reward model from human preference data and then optimizes a policy to favor preferred responses, has emerged as a central paradigm for aligning large language models (LLMs) with human preferences. In this paper, we investigate exploration principles for online RLHF, where one seeks to adaptively collect new preference data to refine both the reward model and the policy in a data-efficient manner. By examining existing optimism-based exploration algorithms, we identify a drawback in their sampling protocol: they tend to gather comparisons that fail to reduce the most informative uncertainties in reward differences, and we prove lower bounds showing that such methods can incur linear regret over exponentially long horizons. Motivated by this insight, we propose a new exploration scheme that directs preference queries toward reducing uncertainty in reward differences most relevant to policy improvement. Under a multi-armed bandit model of RLHF, we establish regret bounds of order T^(\beta+1)/(\beta+2) , where \beta0 is a hyperparameter that balances reward maximization against mitigating distribution shift. To our knowledge, this is the first online RLHF algorithm with regret scaling polynomially in all model parameters.
zh
[NLP-159] Speak Your Mind: The Speech Continuation Task as a Probe of Voice-Based Model Bias ICASSP2026
链接: https://arxiv.org/abs/2509.22061
作者: Shree Harsha Bokkahalli Satish,Harm Lameris,Olivier Perrotin,Gustav Eje Henter,Éva Székely
机构: 未知
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注: 6 pages, 1 figure, Submitted to IEEE ICASSP 2026
[NLP-160] AUDDT: Audio Unified Deepfake Detection Benchmark Toolkit
链接: https://arxiv.org/abs/2509.21597
作者: Yi Zhu,Heitor R. Guimarães,Arthur Pimentel,Tiago Falk
机构: Institut national de la recherche scientifique (国家科学研究院); Reality Defender
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注:
[NLP-161] ARTI-6: Towards Six-dimensional Articulatory Speech Encoding
链接: https://arxiv.org/abs/2509.21447
作者: Jihwan Lee,Sean Foley,Thanathai Lertpetchpun,Kevin Huang,Yoonjeong Lee,Tiantian Feng,Louis Goldstein,Dani Byrd,Shrikanth Narayanan
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
计算机视觉
[CV-0] Pixel Motion Diffusion is What We Need for Robot Control
【速读】:该论文旨在解决机器人操控中高阶运动意图与低阶动作之间缺乏统一建模框架的问题,特别是在语言条件下的多任务、跨域迁移场景下如何实现高效且可解释的控制。其解决方案的关键在于提出DAWN(Diffusion is All We Need for robot control),一个基于扩散模型(diffusion-based)的统一框架,通过结构化的像素级运动表示(structured pixel motion representation)将高层语义指令与底层机器人动作进行端到端映射;其中,高低层控制器均以扩散过程建模,从而生成可解释的中间运动抽象,并在CALVIN和MetaWorld基准上取得最优性能,同时在仿真到现实的迁移中仅需少量微调即可实现稳定部署,验证了扩散建模与以运动为中心的表示结合在构建可扩展、鲁棒机器人学习系统中的有效性。
链接: https://arxiv.org/abs/2509.22652
作者: E-Ro Nguyen,Yichi Zhang,Kanchana Ranasinghe,Xiang Li,Michael S. Ryoo
机构: Stony Brook University (石溪大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 7 figures
点击查看摘要
Abstract:We present DAWN (Diffusion is All We Need for robot control), a unified diffusion-based framework for language-conditioned robotic manipulation that bridges high-level motion intent and low-level robot action via structured pixel motion representation. In DAWN, both the high-level and low-level controllers are modeled as diffusion processes, yielding a fully trainable, end-to-end system with interpretable intermediate motion abstractions. DAWN achieves state-of-the-art results on the challenging CALVIN benchmark, demonstrating strong multi-task performance, and further validates its effectiveness on MetaWorld. Despite the substantial domain gap between simulation and reality and limited real-world data, we demonstrate reliable real-world transfer with only minimal finetuning, illustrating the practical viability of diffusion-based motion abstractions for robotic control. Our results show the effectiveness of combining diffusion modeling with motion-centric representations as a strong baseline for scalable and robust robot learning. Project page: this https URL
zh
[CV-1] RefAM: Attention Magnets for Zero-Shot Referral Segmentation
【速读】:该论文旨在解决现有指代表达分割(referring segmentation)方法依赖微调或多个预训练模型组合所带来的额外训练开销与架构修改问题。其解决方案的关键在于直接利用扩散Transformer模型中提取的特征和注意力分数,无需任何额外训练或结构改动。核心创新包括:识别停用词(stop words)作为注意力聚集点并用于噪声过滤;发现深层网络中的全局注意力汇聚点(Global Attention Sinks, GAS),并通过抑制或重定向其注意力提升定位精度;提出一种注意力重分配策略,通过插入停用词将背景激活分割为更小簇,生成更锐利、局部化的热力图。基于上述发现,作者构建了RefAM框架,实现了零样本条件下的图像与视频指代表达分割性能的显著提升,成为无需微调的新SOTA方法。
链接: https://arxiv.org/abs/2509.22650
作者: Anna Kukleva,Enis Simsar,Alessio Tonioni,Muhammad Ferjad Naeem,Federico Tombari,Jan Eric Lenssen,Bernt Schiele
机构: Max Planck Institute for Informatics (马普计算机科学研究所); ETH Zürich (苏黎世联邦理工学院); Google(谷歌); TU Munich (慕尼黑工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
点击查看摘要
Abstract:Most existing approaches to referring segmentation achieve strong performance only through fine-tuning or by composing multiple pre-trained models, often at the cost of additional training and architectural modifications. Meanwhile, large-scale generative diffusion models encode rich semantic information, making them attractive as general-purpose feature extractors. In this work, we introduce a new method that directly exploits features, attention scores, from diffusion transformers for downstream tasks, requiring neither architectural modifications nor additional training. To systematically evaluate these features, we extend benchmarks with vision-language grounding tasks spanning both images and videos. Our key insight is that stop words act as attention magnets: they accumulate surplus attention and can be filtered to reduce noise. Moreover, we identify global attention sinks (GAS) emerging in deeper layers and show that they can be safely suppressed or redirected onto auxiliary tokens, leading to sharper and more accurate grounding maps. We further propose an attention redistribution strategy, where appended stop words partition background activations into smaller clusters, yielding sharper and more localized heatmaps. Building on these findings, we develop RefAM, a simple training-free grounding framework that combines cross-attention maps, GAS handling, and redistribution. Across zero-shot referring image and video segmentation benchmarks, our approach consistently outperforms prior methods, establishing a new state of the art without fine-tuning or additional components.
zh
[CV-2] Hierarchical Representation Matching for CLIP-based Class-Incremental Learning
【速读】:该论文旨在解决基于预训练视觉语言模型(如CLIP)的类增量学习(Class-Incremental Learning, CIL)中因使用简单模板(如"a photo of a [CLASS]")和单一层次特征表示而导致的语义表达不足与灾难性遗忘问题。其解决方案的关键在于提出HiErarchical Representation MAtchiNg (HERMAN)框架,通过大语言模型(LLM)递归生成具有显式层级结构的文本描述符,将这些描述符匹配到语义层次的不同层级,并根据任务需求自适应路由,从而在保留细粒度区分能力的同时增强模型对新类别的适应性,有效缓解增量学习中的性能退化问题。
链接: https://arxiv.org/abs/2509.22645
作者: Zhen-Hao Wen,Yan Wang,Ji Feng,Han-Jia Ye,De-Chuan Zhan,Da-Wei Zhou
机构: School of Artificial Intelligence, Nanjing University (南京大学人工智能学院); State Key Laboratory for Novel Software Technology, Nanjing University (南京大学软件新技术国家重点实验室); Baiont Quant
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Class-Incremental Learning (CIL) aims to endow models with the ability to continuously adapt to evolving data streams. Recent advances in pre-trained vision-language models (e.g., CLIP) provide a powerful foundation for this task. However, existing approaches often rely on simplistic templates, such as “a photo of a [CLASS]”, which overlook the hierarchical nature of visual concepts. For example, recognizing “cat” versus “car” depends on coarse-grained cues, while distinguishing “cat” from “lion” requires fine-grained details. Similarly, the current feature mapping in CLIP relies solely on the representation from the last layer, neglecting the hierarchical information contained in earlier layers. In this work, we introduce HiErarchical Representation MAtchiNg (HERMAN) for CLIP-based CIL. Our approach leverages LLMs to recursively generate discriminative textual descriptors, thereby augmenting the semantic space with explicit hierarchical cues. These descriptors are matched to different levels of the semantic hierarchy and adaptively routed based on task-specific requirements, enabling precise discrimination while alleviating catastrophic forgetting in incremental tasks. Extensive experiments on multiple benchmarks demonstrate that our method consistently achieves state-of-the-art performance.
zh
[CV-3] WoW: Towards a World omniscient World model Through Embodied Interaction
【速读】:该论文旨在解决当前视频生成模型(如Sora)因依赖被动观察而难以理解物理因果关系的问题,即如何让人工智能具备类似人类的直观物理直觉。其解决方案的关键在于:通过大规模真实世界交互数据训练一个140亿参数的生成式世界模型(WoW),该模型基于200万条机器人交互轨迹学习到一种以概率分布形式表征的物理理解能力;进一步引入SOPHIA框架,利用视觉-语言模型代理对扩散Transformer(DiT)生成结果进行迭代评估与语言指令优化,从而约束生成内容向物理现实收敛,并结合逆动力学模型将优化后的计划转化为可执行的机器人动作,形成“想象-行动”闭环。这一方法系统验证了真实世界交互是构建AI物理直觉的核心基础。
链接: https://arxiv.org/abs/2509.22642
作者: Xiaowei Chi,Peidong Jia,Chun-Kai Fan,Xiaozhu Ju,Weishi Mi,Kevin Zhang,Zhiyuan Qin,Wanxin Tian,Kuangzhi Ge,Hao Li,Zezhong Qian,Anthony Chen,Qiang Zhou,Yueru Jia,Jiaming Liu,Yong Dai,Qingpo Wuwu,Chengyu Bai,Yu-Kai Wang,Ying Li,Lizhang Chen,Yong Bao,Zhiyuan Jiang,Jiacheng Zhu,Kai Tang,Ruichuan An,Yulin Luo,Qiuxuan Feng,Siyuan Zhou,Chi-min Chan,Chengkai Hou,Wei Xue,Sirui Han,Yike Guo,Shanghang Zhang,Jian Tang
机构: Beijing Innovation Center of Humanoid Robotics (北京人形机器人创新中心); State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University (北京大学计算机学院多媒体信息处理国家重点实验室); Hong Kong University of Science and Technology (香港科技大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:
点击查看摘要
Abstract:Humans develop an understanding of intuitive physics through active interaction with the world. This approach is in stark contrast to current video models, such as Sora, which rely on passive observation and therefore struggle with grasping physical causality. This observation leads to our central hypothesis: authentic physical intuition of the world model must be grounded in extensive, causally rich interactions with the real world. To test this hypothesis, we present WoW, a 14-billion-parameter generative world model trained on 2 million robot interaction trajectories. Our findings reveal that the model’s understanding of physics is a probabilistic distribution of plausible outcomes, leading to stochastic instabilities and physical hallucinations. Furthermore, we demonstrate that this emergent capability can be actively constrained toward physical realism by SOPHIA, where vision-language model agents evaluate the DiT-generated output and guide its refinement by iteratively evolving the language instructions. In addition, a co-trained Inverse Dynamics Model translates these refined plans into executable robotic actions, thus closing the imagination-to-action loop. We establish WoWBench, a new benchmark focused on physical consistency and causal reasoning in video, where WoW achieves state-of-the-art performance in both human and autonomous evaluation, demonstrating strong ability in physical causality, collision dynamics, and object permanence. Our work provides systematic evidence that large-scale, real-world interaction is a cornerstone for developing physical intuition in AI. Models, data, and benchmarks will be open-sourced.
zh
[CV-4] Scale-Wise VAR is Secretly Discrete Diffusion
链接: https://arxiv.org/abs/2509.22636
作者: Amandeep Kumar,Nithin Gopalakrishnan Nair,Vishal M. Patel
机构: Johns Hopkins University (约翰霍普金斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Technical Reports
[CV-5] raining-Free Synthetic Data Generation with Dual IP-Adapter Guidance WWW BMVC BMVC2025
【速读】:该论文旨在解决少样本图像分类(few-shot image classification)中因标注样本稀缺而导致性能受限的问题。传统方法常依赖生成式 AI (Generative AI) 模型合成训练数据,但往往需要大量模型微调或外部信息源(如图像描述和过滤工具)。其解决方案的关键在于提出一种无需训练的框架 DIPSY,利用 IP-Adapter 实现图像到图像的翻译,仅基于少量已知样本生成具有判别性的合成图像;核心创新包括:(1) 扩展的无分类器引导机制,实现对正负样本条件的独立控制;(2) 基于类别相似性的采样策略,筛选出有效的对比样本;(3) 一个无需模型微调或外部 captioning/ filtering 的简洁有效流程。实验表明,该方法在十个基准数据集上达到或超越当前最优性能,尤其在细粒度分类任务中表现出显著优势。
链接: https://arxiv.org/abs/2509.22635
作者: Luc Boudier,Loris Manganelli,Eleftherios Tsonis,Nicolas Dufour,Vicky Kalogeiton
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: BMVC 2025. Project page: this https URL
点击查看摘要
Abstract:Few-shot image classification remains challenging due to the limited availability of labeled examples. Recent approaches have explored generating synthetic training data using text-to-image diffusion models, but often require extensive model fine-tuning or external information sources. We present a novel training-free approach, called DIPSY, that leverages IP-Adapter for image-to-image translation to generate highly discriminative synthetic images using only the available few-shot examples. DIPSY introduces three key innovations: (1) an extended classifier-free guidance scheme that enables independent control over positive and negative image conditioning; (2) a class similarity-based sampling strategy that identifies effective contrastive examples; and (3) a simple yet effective pipeline that requires no model fine-tuning or external captioning and filtering. Experiments across ten benchmark datasets demonstrate that our approach achieves state-of-the-art or comparable performance, while eliminating the need for generative model adaptation or reliance on external tools for caption generation and image filtering. Our results highlight the effectiveness of leveraging dual image prompting with positive-negative guidance for generating class-discriminative features, particularly for fine-grained classification tasks.
zh
[CV-6] UML-CoT: Structured Reasoning and Planning with Unified Modeling Language for Robotic Room Cleaning
【速读】:该论文旨在解决链式思维(Chain-of-Thought, CoT)提示在具身任务中因依赖非结构化文本而导致的可解释性差与不可执行的问题。现有方法虽尝试使用场景图或逻辑图构建结构化CoT,但仍受限于仅建模低阶关系、缺乏继承或行为抽象等机制,且未提供标准化语义以支持顺序或条件规划。其解决方案的关键在于提出UML-CoT框架,利用统一建模语言(Unified Modeling Language, UML)生成符号化的CoT和可执行的动作计划:通过类图(class diagram)刻画对象的组合语义,用活动图(activity diagram)建模过程控制流,并结合监督微调与组相对策略优化(Group Relative Policy Optimization, GRPO)的三阶段训练流程,实现更高效、可解释且可执行的推理与规划。
链接: https://arxiv.org/abs/2509.22628
作者: Hongyu Chen,Guangrun Wang
机构: Sun Yat-sen University (中山大学); X-Era AI Lab
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Chain-of-Thought (CoT) prompting improves reasoning in large language models (LLMs), but its reliance on unstructured text limits interpretability and executability in embodied tasks. Prior work has explored structured CoTs using scene or logic graphs, yet these remain fundamentally limited: they model only low-order relations, lack constructs like inheritance or behavioral abstraction, and provide no standardized semantics for sequential or conditional planning. We propose UML-CoT, a structured reasoning and planning framework that leverages Unified Modeling Language (UML) to generate symbolic CoTs and executable action plans. UML class diagrams capture compositional object semantics, while activity diagrams model procedural control flow. Our three-stage training pipeline combines supervised fine-tuning with Group Relative Policy Optimization (GRPO), including reward learning from answer-only data. We evaluate UML-CoT on MRoom-30k, a new benchmark of cluttered room-cleaning scenarios. UML-CoT outperforms unstructured CoTs in interpretability, planning coherence, and execution success, highlighting UML as a more expressive and actionable structured reasoning formalism.
zh
[CV-7] CCNeXt: An Effective Self-Supervised Stereo Depth Estimation Approach
【速读】:该论文旨在解决深度估计(Depth Estimation)在计算资源受限场景下的性能与效率平衡问题,特别是在机器人、自动驾驶和增强现实等应用中,如何利用无标签数据实现高精度且低延迟的深度预测。其解决方案的关键在于提出了一种新颖的自监督卷积神经网络架构CCNeXt,该架构采用现代CNN特征提取器,并在编码器中引入一种创新的窗口化视差交叉注意力模块(windowed epipolar cross-attention module),同时对解码器进行了系统性重构,从而在保持高精度的同时显著降低计算开销——实验表明,CCNeXt在KITTI Eigen Split测试集上优于现有最优CNN和视觉Transformer(Vision Transformer, ViT)模型,且推理速度比当前最佳模型快10.18倍,在改进的真实深度标注数据集上也达到了最先进性能。
链接: https://arxiv.org/abs/2509.22627
作者: Alexandre Lopes,Roberto Souza,Helio Pedrini
机构: University of Campinas (坎皮纳斯州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Depth Estimation plays a crucial role in recent applications in robotics, autonomous vehicles, and augmented reality. These scenarios commonly operate under constraints imposed by computational power. Stereo image pairs offer an effective solution for depth estimation since it only needs to estimate the disparity of pixels in image pairs to determine the depth in a known rectified system. Due to the difficulty in acquiring reliable ground-truth depth data across diverse scenarios, self-supervised techniques emerge as a solution, particularly when large unlabeled datasets are available. We propose a novel self-supervised convolutional approach that outperforms existing state-of-the-art Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) while balancing computational cost. The proposed CCNeXt architecture employs a modern CNN feature extractor with a novel windowed epipolar cross-attention module in the encoder, complemented by a comprehensive redesign of the depth estimation decoder. Our experiments demonstrate that CCNeXt achieves competitive metrics on the KITTI Eigen Split test data while being 10.18 \times faster than the current best model and achieves state-of-the-art results in all metrics in the KITTI Eigen Split Improved Ground Truth and Driving Stereo datasets when compared to recently proposed techniques. To ensure complete reproducibility, our project is accessible at \hrefthis https URL\textttthis https URL.
zh
[CV-8] SPARK: Synergistic Policy And Reward Co-Evolving Framework
链接: https://arxiv.org/abs/2509.22624
作者: Ziyu Liu,Yuhang Zang,Shengyuan Ding,Yuhang Cao,Xiaoyi Dong,Haodong Duan,Dahua Lin,Jiaqi Wang
机构: Shanghai Jiao Tong University (上海交通大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Fudan University (复旦大学); The Chinese University of Hong Kong (香港中文大学); Shanghai Innovation Institute (上海创新研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Project: this https URL
[CV-9] LongLive: Real-time Interactive Long Video Generation
【速读】:该论文旨在解决长视频生成中效率与质量难以兼顾的问题,尤其是传统扩散模型(Diffusion)因双向注意力机制导致推理效率低下,而因果自回归(Causal Autoregressive, AR)模型虽支持键值缓存(KV caching)加速推理,却在长视频训练中因记忆瓶颈导致质量下降;此外,现有方法缺乏对交互式输入(如流式提示)的支持,难以保证提示切换时的视觉一致性与语义连贯性。解决方案的关键在于:提出一种帧级因果AR框架LongLive,集成三项核心技术——基于新提示刷新缓存状态的KV-recache机制以实现平滑提示切换、流式长视频微调策略(streaming long tuning)确保训练与推理一致(train-long-test-long),以及短窗口注意力结合帧级注意力池(frame sink)在保持长程一致性的同时提升生成速度。通过这些设计,LongLive仅用32 GPU天即可将1.3B参数短片段模型扩展至分钟级视频生成,并在单张NVIDIA H100上实现20.7 FPS的实时推理能力,支持长达240秒的视频输出及INT8量化推理。
链接: https://arxiv.org/abs/2509.22622
作者: Shuai Yang,Wei Huang,Ruihang Chu,Yicheng Xiao,Yuyang Zhao,Xianbang Wang,Muyang Li,Enze Xie,Yingcong Chen,Yao Lu,Song Han,Yukang Chen
机构: NVIDIA(英伟达); MIT(麻省理工学院); HKUST(GZ)(香港科技大学(广州)); HKU(香港大学); THU(清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code, model, and demos are available at this https URL
点击查看摘要
Abstract:We present LongLive, a frame-level autoregressive (AR) framework for real-time and interactive long video generation. Long video generation presents challenges in both efficiency and quality. Diffusion and Diffusion-Forcing models can produce high-quality videos but suffer from low efficiency due to bidirectional attention. Causal attention AR models support KV caching for faster inference, but often degrade in quality on long videos due to memory challenges during long-video training. In addition, beyond static prompt-based generation, interactive capabilities, such as streaming prompt inputs, are critical for dynamic content creation, enabling users to guide narratives in real time. This interactive requirement significantly increases complexity, especially in ensuring visual consistency and semantic coherence during prompt transitions. To address these challenges, LongLive adopts a causal, frame-level AR design that integrates a KV-recache mechanism that refreshes cached states with new prompts for smooth, adherent switches; streaming long tuning to enable long video training and to align training and inference (train-long-test-long); and short window attention paired with a frame-level attention sink, shorten as frame sink, preserving long-range consistency while enabling faster generation. With these key designs, LongLive fine-tunes a 1.3B-parameter short-clip model to minute-long generation in just 32 GPU-days. At inference, LongLive sustains 20.7 FPS on a single NVIDIA H100, achieves strong performance on VBench in both short and long videos. LongLive supports up to 240-second videos on a single H100 GPU. LongLive further supports INT8-quantized inference with only marginal quality loss.
zh
[CV-10] SpikeMatch: Semi-Supervised Learning with Temporal Dynamics of Spiking Neural Networks
【速读】:该论文旨在解决脉冲神经网络(Spiking Neural Networks, SNNs)在半监督学习(Semi-Supervised Learning, SSL)场景下方法研究不足的问题,尤其是如何有效利用有限标签数据提升模型性能。其解决方案的关键在于提出SpikeMatch框架,该框架通过利用SNN中的泄漏因子(leakage factor)所蕴含的时间动态特性,在协同训练(co-training)框架内实现多样化的伪标签生成;具体而言,它基于单个SNN对弱增强未标记样本的多个预测结果的一致性,生成可靠伪标签,并用于强增强样本的训练,从而有效缓解因标签信息稀缺导致的确认偏差(confirmation bias),同时增强对判别特征的捕捉能力。
链接: https://arxiv.org/abs/2509.22581
作者: Jini Yang,Beomseok Oh,Seungryong Kim,Sunok Kim
机构: KAIST AI (KAIST人工智能); Korea Aerospace University (韩国航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Spiking neural networks (SNNs) have recently been attracting significant attention for their biological plausibility and energy efficiency, but semi-supervised learning (SSL) methods for SNN-based models remain underexplored compared to those for artificial neural networks (ANNs). In this paper, we introduce SpikeMatch, the first SSL framework for SNNs that leverages the temporal dynamics through the leakage factor of SNNs for diverse pseudo-labeling within a co-training framework. By utilizing agreement among multiple predictions from a single SNN, SpikeMatch generates reliable pseudo-labels from weakly-augmented unlabeled samples to train on strongly-augmented ones, effectively mitigating confirmation bias by capturing discriminative features with limited labels. Experiments show that SpikeMatch outperforms existing SSL methods adapted to SNN backbones across various standard benchmarks.
zh
[CV-11] MINT-RVAE: Multi-Cues Intention Prediction of Human-Robot Interaction using Human Pose and Emotion Information from RGB-only Camera Data
链接: https://arxiv.org/abs/2509.22573
作者: Farida Mohsen,Ali Safa
机构: Hamad Bin Khalifa University (哈马德本哈利法大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
[CV-12] Activation Function Design Sustains Plasticity in Continual Learning
【速读】:该论文旨在解决持续学习(continual learning)中因激活函数选择不当而导致的“可塑性丧失”(loss of plasticity)问题,即模型在面对新任务或环境变化时逐渐失去适应能力的现象。传统研究多关注灾难性遗忘(catastrophic forgetting),而本文指出激活函数的非线性特性在导致可塑性丧失中的关键作用尚未被充分探索。解决方案的关键在于通过分析激活函数负分支形状(negative-branch shape)和饱和行为(saturation behavior)的属性,设计两种即插即用的新型非线性变换:Smooth-Leaky 和 Randomized Smooth-Leaky。这些激活函数无需额外容量或任务特定调参,即可显著提升模型在类增量学习与非平稳MuJoCo强化学习场景下的持续适应能力,从而以轻量、通用的方式维持模型的可塑性。
链接: https://arxiv.org/abs/2509.22562
作者: Lute Lillo,Nick Cheney
机构: University of Vermont (佛蒙特大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:In independent, identically distributed (i.i.d.) training regimes, activation functions have been benchmarked extensively, and their differences often shrink once model size and optimization are tuned. In continual learning, however, the picture is different: beyond catastrophic forgetting, models can progressively lose the ability to adapt (referred to as loss of plasticity) and the role of the non-linearity in this failure mode remains underexplored. We show that activation choice is a primary, architecture-agnostic lever for mitigating plasticity loss. Building on a property-level analysis of negative-branch shape and saturation behavior, we introduce two drop-in nonlinearities (Smooth-Leaky and Randomized Smooth-Leaky) and evaluate them in two complementary settings: (i) supervised class-incremental benchmarks and (ii) reinforcement learning with non-stationary MuJoCo environments designed to induce controlled distribution and dynamics shifts. We also provide a simple stress protocol and diagnostics that link the shape of the activation to the adaptation under change. The takeaway is straightforward: thoughtful activation design offers a lightweight, domain-general way to sustain plasticity in continual learning without extra capacity or task-specific tuning.
zh
[CV-13] JanusVLN: Decoupling Semantics and Spatiality with Dual Implicit Memory for Vision-Language Navigation
【速读】:该论文旨在解决视觉-语言导航(Vision-and-Language Navigation, VLN)中现有方法依赖显式语义记忆(如构建文本认知地图或存储历史视觉帧)所导致的空间信息丢失、计算冗余和内存膨胀问题。其解决方案的关键在于提出JanusVLN框架,该框架引入一种双隐式神经记忆机制,将空间几何记忆与视觉语义记忆分别建模为独立、紧凑且固定大小的神经表示,从而实现高效增量更新;通过扩展多模态大语言模型(Multimodal Large Language Models, MLLM)以融入来自空间几何编码器的3D先验知识,增强仅基于RGB输入模型的空间推理能力,并利用初始窗口与滑动窗口内token的键值缓存(Key-Value Caches)构建双重隐式记忆,有效避免冗余计算,显著提升导航性能。
链接: https://arxiv.org/abs/2509.22548
作者: Shuang Zeng,Dekang Qi,Xinyuan Chang,Feng Xiong,Shichao Xie,Xiaolong Wu,Shiyi Liang,Mu Xu,Xing Wei
机构: Amap, Alibaba Group (高德地图); Xi’an Jiaotong University (西安交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Project page: this https URL
点击查看摘要
Abstract:Vision-and-Language Navigation requires an embodied agent to navigate through unseen environments, guided by natural language instructions and a continuous video stream. Recent advances in VLN have been driven by the powerful semantic understanding of Multimodal Large Language Models. However, these methods typically rely on explicit semantic memory, such as building textual cognitive maps or storing historical visual frames. This type of method suffers from spatial information loss, computational redundancy, and memory bloat, which impede efficient navigation. Inspired by the implicit scene representation in human navigation, analogous to the left brain’s semantic understanding and the right brain’s spatial cognition, we propose JanusVLN, a novel VLN framework featuring a dual implicit neural memory that models spatial-geometric and visual-semantic memory as separate, compact, and fixed-size neural representations. This framework first extends the MLLM to incorporate 3D prior knowledge from the spatial-geometric encoder, thereby enhancing the spatial reasoning capabilities of models based solely on RGB input. Then, the historical key-value caches from the spatial-geometric and visual-semantic encoders are constructed into a dual implicit memory. By retaining only the KVs of tokens in the initial and sliding window, redundant computation is avoided, enabling efficient incremental updates. Extensive experiments demonstrate that JanusVLN outperforms over 20 recent methods to achieve SOTA performance. For example, the success rate improves by 10.5-35.5 compared to methods using multiple data types as input and by 3.6-10.8 compared to methods using more RGB training data. This indicates that the proposed dual implicit neural memory, as a novel paradigm, explores promising new directions for future VLN research. Ours project page: this https URL.
zh
[CV-14] HyCoVAD: A Hybrid SSL-LLM Model for Complex Video Anomaly Detection
【速读】:该论文旨在解决复杂视频异常检测(Complex Video Anomaly Detection, CVAD)中难以识别由多个实体间复杂交互关系和时序依赖定义的异常事件的问题。现有自监督学习(Self-Supervised Learning, SSL)方法虽能建模低层时空模式,但缺乏对交互语义的理解;而大语言模型(Large Language Models, LLMs)虽具备强大的上下文推理能力,却因逐帧计算成本高且缺乏细粒度空间定位能力而不适用于实时视频分析。解决方案的关键在于提出HyCoVAD——一种混合SSL-LLM架构:首先利用基于nnFormer的多任务SSL模块从视频帧中提取疑似异常区域,再将候选帧输入LLM进行结构化规则推理以验证异常存在性,从而在提升检测精度的同时显著降低LLM计算开销。实验表明,该方法在ComplexVAD数据集上达到72.5%的帧级AUC,优于现有基线12.5%,并公开了交互异常分类体系与自适应阈值协议以推动后续研究。
链接: https://arxiv.org/abs/2509.22544
作者: Mohammad Mahdi Hemmatyar,Mahdi Jafari,Mohammad Amin Yousefi,Mohammad Reza Nemati,Mobin Azadani,Hamid Reza Rastad,Amirmohammad Akbari
机构: Sharif University of Technology (谢里夫理工大学); University of Tehran (德黑兰大学); Iran University of Science and Technology (伊朗科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Video anomaly detection (VAD) is crucial for intelligent surveillance, but a significant challenge lies in identifying complex anomalies, which are events defined by intricate relationships and temporal dependencies among multiple entities rather than by isolated actions. While self-supervised learning (SSL) methods effectively model low-level spatiotemporal patterns, they often struggle to grasp the semantic meaning of these interactions. Conversely, large language models (LLMs) offer powerful contextual reasoning but are computationally expensive for frame-by-frame analysis and lack fine-grained spatial localization. We introduce HyCoVAD, Hybrid Complex Video Anomaly Detection, a hybrid SSL-LLM model that combines a multi-task SSL temporal analyzer with LLM validator. The SSL module is built upon an nnFormer backbone which is a transformer-based model for image segmentation. It is trained with multiple proxy tasks, learns from video frames to identify those suspected of anomaly. The selected frames are then forwarded to the LLM, which enriches the analysis with semantic context by applying structured, rule-based reasoning to validate the presence of anomalies. Experiments on the challenging ComplexVAD dataset show that HyCoVAD achieves a 72.5% frame-level AUC, outperforming existing baselines by 12.5% while reducing LLM computation. We release our interaction anomaly taxonomy, adaptive thresholding protocol, and code to facilitate future research in complex VAD scenarios.
zh
[CV-15] Category Discovery: An Open-World Perspective
链接: https://arxiv.org/abs/2509.22542
作者: Zhenqi He,Yuanpei Liu,Kai Han
机构: The University of Hong Kong (香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
[CV-16] EfficientDepth: A Fast and Detail-Preserving Monocular Depth Estimation Model
链接: https://arxiv.org/abs/2509.22527
作者: Andrii Litvynchuk,Ivan Livinsky,Anand Ravi,Nima Kalantari,Andrii Tsarov
机构: Leia Inc.(Leia公司); Texas A&M University (德克萨斯农工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 7 figures, 5 tables
[CV-17] Color Names in Vision-Language Models
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在颜色命名能力上是否与人类一致的问题,以提升人机交互的有效性。其关键解决方案在于首次系统性地评估了五种代表性VLMs的颜色命名表现,通过复现经典颜色命名实验方法,在957个色样上进行测试,并结合跨语言分析和消融实验,揭示了模型在颜色命名策略上的差异:约束型模型倾向于使用基本颜色词,而扩展型模型则依赖亮度修饰语;同时发现语言模型架构对颜色命名决策具有独立于视觉处理能力的重要影响。
链接: https://arxiv.org/abs/2509.22524
作者: Alexandra Gomez-Villa,Pablo Hernández-Cámara,Muhammad Atif Butt,Valero Laparra,Jesus Malo,Javier Vazquez-Corral
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Color serves as a fundamental dimension of human visual perception and a primary means of communicating about objects and scenes. As vision-language models (VLMs) become increasingly prevalent, understanding whether they name colors like humans is crucial for effective human-AI interaction. We present the first systematic evaluation of color naming capabilities across VLMs, replicating classic color naming methodologies using 957 color samples across five representative models. Our results show that while VLMs achieve high accuracy on prototypical colors from classical studies, performance drops significantly on expanded, non-prototypical color sets. We identify 21 common color terms that consistently emerge across all models, revealing two distinct approaches: constrained models using predominantly basic terms versus expansive models employing systematic lightness modifiers. Cross-linguistic analysis across nine languages demonstrates severe training imbalances favoring English and Chinese, with hue serving as the primary driver of color naming decisions. Finally, ablation studies reveal that language model architecture significantly influences color naming independent of visual processing capabilities.
zh
[CV-18] JointDiff: Bridging Continuous and Discrete in Multi-Agent Trajectory Generation
链接: https://arxiv.org/abs/2509.22522
作者: Guillem Capellera,Luis Ferraz,Antonio Rubio,Alexandre Alahi,Antonio Agudo
机构: Kognia Sports Intelligence (Kognia体育智能); Visual Intelligence for Transportation, EPFL (瑞士联邦理工学院); Institut de Robòtica i Informàtica Industrial, CSIC-UPC (西班牙国家研究委员会-巴塞罗那理工大学机器人与信息工业研究所)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
[CV-19] Adaptive Dual-Mode Distillation with Incentive Schemes for Scalable Heterogeneous Federated Learning on Non-IID Data
【速读】:该论文旨在解决联邦学习(Federated Learning, FL)中的三大核心挑战:一是客户端在业务需求和计算资源上的异质性导致无法统一训练相同模型;二是数据统计异质性(即非独立同分布,non-IID)对全局模型性能的显著负面影响;三是缺乏有效的成本可控激励机制以促进客户端持续参与训练。针对这些问题,论文提出三种方法:DL-SH 通过优化通信效率与隐私保护来应对统计异质性;DL-MH 支持模型异构性以适应不同客户端的能力差异并缓解统计偏差;I-DL-MH 进一步引入基于激励的机制,增强客户端参与意愿。其关键创新在于将模型异构性、统计异质性和激励机制三者有机结合,在保证隐私前提下实现高精度、低通信开销的分布式学习,实验表明 DL-SH 和 I-DL-MH 在非独立同分布场景下分别提升全局模型准确率达 153% 和 225%。
链接: https://arxiv.org/abs/2509.22507
作者: Zahid Iqbal
机构: University of Gujrat (古贾尔大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Federated Learning (FL) has emerged as a promising decentralized learning (DL) approach that enables the use of distributed data without compromising user privacy. However, FL poses several key challenges. First, it is frequently assumed that every client can train the same machine learning models, however, not all clients are able to meet this assumption because of differences in their business needs and computational resources. Second, statistical heterogeneity (a.k.a. non-IID data) poses a major challenge in FL, which can lead to lower global model performance. Third, while addressing these challenges, there is a need for a cost-effective incentive mechanism to encourage clients to participate in FL training. In response to these challenges, we propose several methodologies: DL-SH, which facilitates efficient, privacy-preserving, and communication-efficient learning in the context of statistical heterogeneity; DL-MH, designed to manage fully heterogeneous models while tackling statistical disparities; and I-DL-MH, an incentive-based extension of DL-MH that promotes client engagement in federated learning training by providing incentives within this complex federated learning framework. Comprehensive experiments were carried out to assess the performance and scalability of the proposed approaches across a range of complex experimental settings. This involved utilizing various model architectures, in diverse data distributions, including IID and several non-IID scenarios, as well as multiple datasets. Experimental results demonstrate that the proposed approaches significantly enhance accuracy and decrease communication costs while effectively addressing statistical heterogeneity and model heterogeneity in comparison to existing state-of-the-art approaches and baselines, with DL-SH improving global model accuracy by 153%, and I-DL-MH achieving a 225% improvement under non-IID conditions.
zh
[CV-20] Where MLLM s Attend and What They Rely On: Explaining Autoregressive Token Generation
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在生成文本时对视觉模态依赖程度不明确的问题,从而限制了模型决策的可解释性和可靠性。解决方案的关键在于提出EAGLE框架,这是一个轻量级黑箱解释方法,能够将选定的生成token归因于紧凑的感知区域,并量化语言先验与感知证据的相对影响。其核心创新是引入一个统一的优化目标函数,同时考虑充分性(insight score)和必要性(necessity score),并通过稀疏化图像区域的贪心搜索实现高效且忠实的归因分析。此外,EAGLE还支持模态感知分析,可精细区分不同token所依赖的模态信息,显著提升了MLLMs决策过程的可解释性。
链接: https://arxiv.org/abs/2509.22496
作者: Ruoyu Chen,Xiaoqing Guo,Kangwei Liu,Siyuan Liang,Shiming Liu,Qunli Zhang,Hua Zhang,Xiaochun Cao
机构: Institute of Information Engineering, Chinese Academy of Sciences (中国科学院信息工程研究所); School of Cyber Security, University of Chinese Academy of Sciences (中国科学院大学网络空间安全学院); Department of Computer Science, Hong Kong Baptist University (香港浸会大学计算机科学系); School of Computing, NUS (新加坡国立大学计算机学院); RAMS Lab, Huawei Technologies Co., Ltd. (华为技术有限公司RAMS实验室); RAMS Lab, Munich Research Center, Huawei Technologies Düsseldorf GmbH (华为技术有限公司慕尼黑研究中心); School of Cyber Science and Technology, Shenzhen Campus of Sun Yat-sen University (中山大学深圳校区网络科学与技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Multimodal large language models (MLLMs) have demonstrated remarkable capabilities in aligning visual inputs with natural language outputs. Yet, the extent to which generated tokens depend on visual modalities remains poorly understood, limiting interpretability and reliability. In this work, we present EAGLE, a lightweight black-box framework for explaining autoregressive token generation in MLLMs. EAGLE attributes any selected tokens to compact perceptual regions while quantifying the relative influence of language priors and perceptual evidence. The framework introduces an objective function that unifies sufficiency (insight score) and indispensability (necessity score), optimized via greedy search over sparsified image regions for faithful and efficient attribution. Beyond spatial attribution, EAGLE performs modality-aware analysis that disentangles what tokens rely on, providing fine-grained interpretability of model decisions. Extensive experiments across open-source MLLMs show that EAGLE consistently outperforms existing methods in faithfulness, localization, and hallucination diagnosis, while requiring substantially less GPU memory. These results highlight its effectiveness and practicality for advancing the interpretability of MLLMs. The code is available at this https URL.
zh
[CV-21] Group Critical-token Policy Optimization for Autoregressive Image Generation
【速读】:该论文旨在解决自回归(Autoregressive, AR)视觉生成中因对所有图像token采用统一优化策略而导致的效率低下问题,尤其关注不同token在强化学习与可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)训练中的贡献差异未被充分挖掘的问题。解决方案的关键在于提出Group Critical-token Policy Optimization (GCPO),其核心是通过三个维度识别关键token:(1) 因果依赖性(早期token决定后续生成效果),(2) 基于熵梯度的空间结构重要性(高熵梯度token对应图像结构和区域连接),(3) RLVR驱动的token多样性(低视觉相似性的token提升token级多样性)。针对这些关键token,GCPO引入动态token级优势权重,基于策略模型与参考模型之间的置信度分歧鼓励探索,从而仅使用30%的token即可超越全token优化方法GRPO,在多个文本到图像生成基准上实现更优性能。
链接: https://arxiv.org/abs/2509.22485
作者: Guohui Zhang,Hu Yu,Xiaoxiao Ma,JingHao Zhang,Yaning Pan,Mingde Yao,Jie Xiao,Linjiang Huang,Feng Zhao
机构: University of Science and Technology of China (中国科学技术大学); Shanghai Innovation Institute (上海创新研究院); Fudan University (复旦大学); CUHK (香港中文大学); Beihang University (北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code is available at this https URL
点击查看摘要
Abstract:Recent studies have extended Reinforcement Learning with Verifiable Rewards (RLVR) to autoregressive (AR) visual generation and achieved promising progress. However, existing methods typically apply uniform optimization across all image tokens, while the varying contributions of different image tokens for RLVR’s training remain unexplored. In fact, the key obstacle lies in how to identify more critical image tokens during AR generation and implement effective token-wise optimization for them. To tackle this challenge, we propose \textbfG roup \textbfC ritical-token \textbfP olicy \textbfO ptimization ( \textbfGCPO ), which facilitates effective policy optimization on critical tokens. We identify the critical tokens in RLVR-based AR generation from three perspectives, specifically: \textbf(1) Causal dependency: early tokens fundamentally determine the later tokens and final image effect due to unidirectional dependency; \textbf(2) Entropy-induced spatial structure: tokens with high entropy gradients correspond to image structure and bridges distinct visual regions; \textbf(3) RLVR-focused token diversity: tokens with low visual similarity across a group of sampled images contribute to richer token-level diversity. For these identified critical tokens, we further introduce a dynamic token-wise advantage weight to encourage exploration, based on confidence divergence between the policy model and reference model. By leveraging 30% of the image tokens, GCPO achieves better performance than GRPO with full tokens. Extensive experiments on multiple text-to-image benchmarks for both AR models and unified multimodal models demonstrate the effectiveness of GCPO for AR visual generation.
zh
[CV-22] PSTTS: A Plug-and-Play Token Selector for Efficient Event-based Spatio-temporal Representation Learning
【速读】:该论文旨在解决事件数据(event data)在时空表征学习中因帧间运动冗余和空间稀疏性导致的计算开销过大的问题,以及现有针对RGB视频的token稀疏化方法因依赖不可靠的中间token表示并忽略事件噪声而难以直接应用于事件数据的问题。解决方案的关键在于提出一种无额外参数的即插即用模块Progressive Spatio-Temporal Token Selection (PSTTS),其核心机制分为两个阶段:首先通过Spatial Token Purification剔除事件帧内的噪声和非事件区域,以保障后续时域冗余评估的准确性;其次通过Temporal Token Selection基于相邻事件帧间的运动模式相似性识别并移除冗余时域信息,从而实现精度与效率之间的最优平衡。
链接: https://arxiv.org/abs/2509.22481
作者: Xiangmo Zhao,Nan Yang,Yang Wang,Zhanwen Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Mainstream event-based spatio-temporal representation learning methods typically process event streams by converting them into sequences of event frames, achieving remarkable performance. However, they neglect the high spatial sparsity and inter-frame motion redundancy inherent in event frame sequences, leading to significant computational overhead. Existing token sparsification methods for RGB videos rely on unreliable intermediate token representations and neglect the influence of event noise, making them ineffective for direct application to event data. In this paper, we propose Progressive Spatio-Temporal Token Selection (PSTTS), a Plug-and-Play module for event data without introducing any additional parameters. PSTTS exploits the spatio-temporal distribution characteristics embedded in raw event data to effectively identify and discard spatio-temporal redundant tokens, achieving an optimal trade-off between accuracy and efficiency. Specifically, PSTTS consists of two stages, Spatial Token Purification and Temporal Token Selection. Spatial Token Purification discards noise and non-event regions by assessing the spatio-temporal consistency of events within each event frame to prevent interference with subsequent temporal redundancy evaluation. Temporal Token Selection evaluates the motion pattern similarity between adjacent event frames, precisely identifying and removing redundant temporal information. We apply PSTTS to four representative backbones UniformerV2, VideoSwin, EVMamba, and ExACT on the HARDVS, DailyDVS-200, and SeACT datasets. Experimental results demonstrate that PSTTS achieves significant efficiency improvements. Specifically, PSTTS reduces FLOPs by 29-43.6% and increases FPS by 21.6-41.3% on the DailyDVS-200 dataset, while maintaining task accuracy. Our code will be available.
zh
[CV-23] Bézier Meets Diffusion: Robust Generation Across Domains for Medical Image Segmentation
【速读】:该论文旨在解决跨医学影像模态(medical imaging modalities)之间因领域差异(domain gap)导致的深度学习模型泛化能力差的问题。现有方法多依赖基于生成对抗网络(GAN)的风格迁移,但在高变异性区域难以准确建模跨域映射关系。其解决方案的关键在于提出一个统一框架——Bézier Meets Diffusion:首先利用基于贝塞尔曲线(Bézier curve)的风格迁移策略有效缩小源域与目标域之间的分布差异,从而训练出更具鲁棒性的分割模型;随后,借助该模型在目标域上生成的伪标签(pseudo-labels),训练条件扩散模型(conditional diffusion model, CDM)合成高质量、带标签的目标域图像;进一步地,通过不确定性引导的分数匹配(uncertainty-guided score matching)方法降低噪声伪标签对CDM训练的影响,提升模型稳定性与性能。
链接: https://arxiv.org/abs/2509.22476
作者: Chen Li,Meilong Xu,Xiaoling Hu,Weimin Lyu,Chao Chen
机构: Stony Brook University (石溪大学); Massachusetts General Hospital and Harvard Medical School (马萨诸塞州总医院和哈佛医学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 7 figures
点击查看摘要
Abstract:Training robust learning algorithms across different medical imaging modalities is challenging due to the large domain gap. Unsupervised domain adaptation (UDA) mitigates this problem by using annotated images from the source domain and unlabeled images from the target domain to train the deep models. Existing approaches often rely on GAN-based style transfer, but these methods struggle to capture cross-domain mappings in regions with high variability. In this paper, we propose a unified framework, Bézier Meets Diffusion, for cross-domain image generation. First, we introduce a Bézier-curve-based style transfer strategy that effectively reduces the domain gap between source and target domains. The transferred source images enable the training of a more robust segmentation model across domains. Thereafter, using pseudo-labels generated by this segmentation model on the target domain, we train a conditional diffusion model (CDM) to synthesize high-quality, labeled target-domain images. To mitigate the impact of noisy pseudo-labels, we further develop an uncertainty-guided score matching method that improves the robustness of CDM training. Extensive experiments on public datasets demonstrate that our approach generates realistic labeled images, significantly augmenting the target domain and improving segmentation performance.
zh
[CV-24] SSVIF: Self-Supervised Segmentation-Oriented Visible and Infrared Image Fusion
链接: https://arxiv.org/abs/2509.22450
作者: Zixian Zhao,Xingchen Zhang
机构: University of Exeter (埃克塞特大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
[CV-25] γ-Quant: Towards Learnable Quantization for Low-bit Pattern Recognition
【速读】:该论文旨在解决在低带宽和能量受限场景下,传感器采集的高比特深度数据(如12-bit图像或传感数据)在传输与处理中效率低下、耗能严重的问题,尤其是在无需人类干预的自动化模式识别任务中,传统基于人眼感知优化的预处理流程(如ISP流水线)可能并非最优。其解决方案的关键在于提出一种任务特定的非线性量化方法——γ-Quant(gamma-Quant),即通过学习得到针对具体识别任务的可微分非线性量化函数,从而在仅使用4-bit低比特深度原始数据的情况下,仍能实现与原始12-bit数据相当的性能表现,显著降低数据传输开销并延长可穿戴设备续航能力。
链接: https://arxiv.org/abs/2509.22448
作者: Mishal Fatima,Shashank Agnihotri,Marius Bock,Kanchana Vaishnavi Gandikota,Kristof Van Laerhoven,Michael Moeller,Margret Keuper
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at DAGM GCPR 2025
点击查看摘要
Abstract:Most pattern recognition models are developed on pre-proce-ssed data. In computer vision, for instance, RGB images processed through image signal processing (ISP) pipelines designed to cater to human perception are the most frequent input to image analysis networks. However, many modern vision tasks operate without a human in the loop, raising the question of whether such pre-processing is optimal for automated analysis. Similarly, human activity recognition (HAR) on body-worn sensor data commonly takes normalized floating-point data arising from a high-bit analog-to-digital converter (ADC) as an input, despite such an approach being highly inefficient in terms of data transmission, significantly affecting the battery life of wearable devices. In this work, we target low-bandwidth and energy-constrained settings where sensors are limited to low-bit-depth capture. We propose \gamma -Quant, i.e.~the task-specific learning of a non-linear quantization for pattern recognition. We exemplify our approach on raw-image object detection as well as HAR of wearable data, and demonstrate that raw data with a learnable quantization using as few as 4-bits can perform on par with the use of raw 12-bit data. All code to reproduce our experiments is publicly available via this https URL
zh
[CV-26] U-MAN: U-Net with Multi-scale Adaptive KAN Network for Medical Image Segmentation
【速读】:该论文旨在解决医学图像分割中因复杂解剖结构和病灶区域导致的细节信息丢失与边界不精确问题,其根源在于传统U-Net架构的两个局限:一是简单的跳跃连接忽略了编码器与解码器之间特征的语义差异,二是深层网络缺乏多尺度特征提取能力。解决方案的关键在于提出U-Net with Multi-scale Adaptive KAN(U-MAN),通过引入两个核心模块实现改进:其一为渐进式注意力引导特征融合(Progressive Attention-Guided Feature Fusion, PAGF)模块,替代原始跳跃连接,利用注意力机制融合编码器与解码器特征以增强语义一致性;其二为多尺度自适应Kolmogorov-Arnold Network(Multi-scale Adaptive KAN, MAN)模块,使网络能够自适应地处理不同尺度的特征表示,从而提升对不同尺寸目标的分割精度与边界保持能力。
链接: https://arxiv.org/abs/2509.22444
作者: Bohan Huang,Qianyun Bao,Haoyuan Ma
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages
点击查看摘要
Abstract:Medical image segmentation faces significant challenges in preserving fine-grained details and precise boundaries due to complex anatomical structures and pathological regions. These challenges primarily stem from two key limitations of conventional U-Net architectures: (1) their simple skip connections ignore the encoder-decoder semantic gap between various features, and (2) they lack the capability for multi-scale feature extraction in deep layers. To address these challenges, we propose the U-Net with Multi-scale Adaptive KAN (U-MAN), a novel architecture that enhances the emerging Kolmogorov-Arnold Network (KAN) with two specialized modules: Progressive Attention-Guided Feature Fusion (PAGF) and the Multi-scale Adaptive KAN (MAN). Our PAGF module replaces the simple skip connection, using attention to fuse features from the encoder and decoder. The MAN module enables the network to adaptively process features at multiple scales, improving its ability to segment objects of various sizes. Experiments on three public datasets (BUSI, GLAS, and CVC) show that U-MAN outperforms state-of-the-art methods, particularly in defining accurate boundaries and preserving fine details.
zh
[CV-27] Explaining multimodal LLM s via intra-modal token interactions
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在解释其决策过程时存在的局限性问题,尤其是现有跨模态归因方法忽视了模态内部依赖关系所带来的干扰,导致视觉解释碎片化、文本解释受无关上下文噪声影响。解决方案的关键在于引入模态内交互增强机制:针对视觉分支提出多尺度解释聚合(Multi-Scale Explanation Aggregation, MSEA),通过多尺度输入归因聚合动态调整感受野,生成空间上更连贯的视觉解释;针对文本分支提出激活排名相关性(Activation Ranking Correlation, ARC),基于上下文词元与当前词元预测排名的一致性来衡量语义相关性,从而抑制无关上下文引起的虚假激活,保留语义一致的响应,显著提升归因的忠实度和细粒度。
链接: https://arxiv.org/abs/2509.22415
作者: Jiawei Liang,Ruoyu Chen,Xianghao Jiao,Siyuan Liang,Shiming Liu,Qunli Zhang,Zheng Hu,Xiaochun Cao
机构: Sun Yat-sen University (中山大学); Chinese Academy of Sciences (中国科学院); University of Chinese Academy of Sciences (中国科学院大学); NUS (新加坡国立大学); Huawei Technologies Co., Ltd. (华为技术有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Multimodal Large Language Models (MLLMs) have achieved remarkable success across diverse vision-language tasks, yet their internal decision-making mechanisms remain insufficiently understood. Existing interpretability research has primarily focused on cross-modal attribution, identifying which image regions the model attends to during output generation. However, these approaches often overlook intra-modal dependencies. In the visual modality, attributing importance to isolated image patches ignores spatial context due to limited receptive fields, resulting in fragmented and noisy explanations. In the textual modality, reliance on preceding tokens introduces spurious activations. Failing to effectively mitigate these interference compromises attribution fidelity. To address these limitations, we propose enhancing interpretability by leveraging intra-modal interaction. For the visual branch, we introduce \textitMulti-Scale Explanation Aggregation (MSEA), which aggregates attributions over multi-scale inputs to dynamically adjust receptive fields, producing more holistic and spatially coherent visual explanations. For the textual branch, we propose \textitActivation Ranking Correlation (ARC), which measures the relevance of contextual tokens to the current token via alignment of their top- k prediction rankings. ARC leverages this relevance to suppress spurious activations from irrelevant contexts while preserving semantically coherent ones. Extensive experiments across state-of-the-art MLLMs and benchmark datasets demonstrate that our approach consistently outperforms existing interpretability methods, yielding more faithful and fine-grained explanations of model behavior.
zh
[CV-28] LucidFlux: Caption-Free Universal Image Restoration via a Large-Scale Diffusion Transformer
链接: https://arxiv.org/abs/2509.22414
作者: Song Fei,Tian Ye,Lujia Wang,Lei Zhu
机构: The Hong Kong University of Science and Technology (Guangzhou); The Hong Kong University of Science and Technology
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
[CV-29] FreqDebias: Towards Generalizable Deepfake Detection via Consistency-Driven Frequency Debiasing CVPR2025
【速读】:该论文旨在解决深度伪造检测模型在面对新型伪造类型时泛化能力不足的问题,其根源在于模型从有限训练数据中学习到了特定频率域的偏差(spectral bias),即过度依赖某些频带特征,从而限制了对未见伪造类型的识别能力。解决方案的关键在于提出一种频率去偏框架FreqDebias,通过两个互补策略实现:一是引入新型伪造混叠增强方法(Forgery Mixup, Fo-Mixup),动态多样化训练样本的频域特性;二是设计双一致性正则化机制(dual consistency regularization, CR),结合局部一致性(基于类激活图CAM)与全局一致性(在超球面嵌入空间中使用von Mises-Fisher分布),从而抑制模型对特定频段的过拟合,提升跨域泛化性能。
链接: https://arxiv.org/abs/2509.22412
作者: Hossein Kashiani,Niloufar Alipour Talemi,Fatemeh Afghah
机构: Clemson University (克莱姆森大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2025)
点击查看摘要
Abstract:Deepfake detectors often struggle to generalize to novel forgery types due to biases learned from limited training data. In this paper, we identify a new type of model bias in the frequency domain, termed spectral bias, where detectors overly rely on specific frequency bands, restricting their ability to generalize across unseen forgeries. To address this, we propose FreqDebias, a frequency debiasing framework that mitigates spectral bias through two complementary strategies. First, we introduce a novel Forgery Mixup (Fo-Mixup) augmentation, which dynamically diversifies frequency characteristics of training samples. Second, we incorporate a dual consistency regularization (CR), which enforces both local consistency using class activation maps (CAMs) and global consistency through a von Mises-Fisher (vMF) distribution on a hyperspherical embedding space. This dual CR mitigates over-reliance on certain frequency components by promoting consistent representation learning under both local and global supervision. Extensive experiments show that FreqDebias significantly enhances cross-domain generalization and outperforms state-of-the-art methods in both cross-domain and in-domain settings.
zh
[CV-30] RAU: Reference-based Anatomical Understanding with Vision Language Models
【速读】:该论文旨在解决医学图像中解剖结构理解受限于专家标注数据稀缺的问题,尤其关注如何利用已标注的参考图像来指导未标注目标图像的解剖区域识别与定位。其解决方案的关键在于提出RAU框架,通过视觉语言模型(VLM)学习参考图像与目标图像之间的相对空间关系,实现对解剖区域的参考式识别与细粒度定位;进一步将VLM提取的空间线索与SAM2(Segment Anything Model 2)的像素级分割能力融合,从而在小尺度解剖结构(如血管段)上实现精准定位与分割。RAU在分布内和分布外数据集上均优于基于SAM2微调的基线方法,展现出优异的泛化性能,为自动化临床工作流中的解剖理解提供了新路径。
链接: https://arxiv.org/abs/2509.22404
作者: Yiwei Li,Yikang Liu,Jiaqi Guo,Lin Zhao,Zheyuan Zhang,Xiao Chen,Boris Mailhe,Ankush Mukherjee,Terrence Chen,Shanhui Sun
机构: United Imaging Intelligence (United Imaging Intelligence); University of Georgia (佐治亚大学); Northwestern University (西北大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Anatomical understanding through deep learning is critical for automatic report generation, intra-operative navigation, and organ localization in medical imaging; however, its progress is constrained by the scarcity of expert-labeled data. A promising remedy is to leverage an annotated reference image to guide the interpretation of an unlabeled target. Although recent vision-language models (VLMs) exhibit non-trivial visual reasoning, their reference-based understanding and fine-grained localization remain limited. We introduce RAU, a framework for reference-based anatomical understanding with VLMs. We first show that a VLM learns to identify anatomical regions through relative spatial reasoning between reference and target images, trained on a moderately sized dataset. We validate this capability through visual question answering (VQA) and bounding box prediction. Next, we demonstrate that the VLM-derived spatial cues can be seamlessly integrated with the fine-grained segmentation capability of SAM2, enabling localization and pixel-level segmentation of small anatomical regions, such as vessel segments. Across two in-distribution and two out-of-distribution datasets, RAU consistently outperforms a SAM2 fine-tuning baseline using the same memory setup, yielding more accurate segmentations and more reliable localization. More importantly, its strong generalization ability makes it scalable to out-of-distribution datasets, a property crucial for medical image applications. To the best of our knowledge, RAU is the first to explore the capability of VLMs for reference-based identification, localization, and segmentation of anatomical structures in medical images. Its promising performance highlights the potential of VLM-driven approaches for anatomical understanding in automated clinical workflows.
zh
[CV-31] Closing the Safety Gap: Surgical Concept Erasure in Visual Autoregressive Models
链接: https://arxiv.org/abs/2509.22400
作者: Xinhao Zhong,Yimin Zhou,Zhiqi Zhang,Junhao Li,Yi Sun,Bin Chen,Shu-Tao Xia,Ke Xu
机构: Harbin Institute of Technology, Shenzhen (哈尔滨工业大学深圳校区); Tsinghua Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院); Jilin University (吉林大学); Peng Cheng Laboratory (鹏城实验室); Department of Computer Science and Technology, Tsinghua University (清华大学计算机科学与技术系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
[CV-32] Integrating Background Knowledge in Medical Semantic Segmentation with Logic Tensor Networks IJCNN2025
【速读】:该论文旨在解决医学图像语义分割中因训练数据稀缺导致模型性能受限的问题。现有基于深度学习的分割方法虽具扩展潜力,但在噪声和伪影干扰下仍存在不足,且缺乏对医学先验知识的有效利用。解决方案的关键在于引入逻辑张量网络(Logic Tensor Networks, LTNs),将医学领域的先验知识(如解剖结构形状约束及不同区域间的逻辑关系)以一阶逻辑(First-Order Logic, FOL)规则形式编码至损失函数中,并与SwinUNETR架构结合形成端到端框架。实验表明,该方法在脑部MRI中海马体分割任务上显著提升性能,尤其在小样本场景下优势明显,验证了神经符号方法在医学语义分割中的有效性与通用性。
链接: https://arxiv.org/abs/2509.22399
作者: Luca Bergamin,Giovanna Maria Dimitri,Fabio Aiolli
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at TAIM@IJCNN 2025
点击查看摘要
Abstract:Semantic segmentation is a fundamental task in medical image analysis, aiding medical decision-making by helping radiologists distinguish objects in an image. Research in this field has been driven by deep learning applications, which have the potential to scale these systems even in the presence of noise and artifacts. However, these systems are not yet perfected. We argue that performance can be improved by incorporating common medical knowledge into the segmentation model’s loss function. To this end, we introduce Logic Tensor Networks (LTNs) to encode medical background knowledge using first-order logic (FOL) rules. The encoded rules span from constraints on the shape of the produced segmentation, to relationships between different segmented areas. We apply LTNs in an end-to-end framework with a SwinUNETR for semantic segmentation. We evaluate our method on the task of segmenting the hippocampus in brain MRI scans. Our experiments show that LTNs improve the baseline segmentation performance, especially when training data is scarce. Despite being in its preliminary stages, we argue that neurosymbolic methods are general enough to be adapted and applied to other medical semantic segmentation tasks.
zh
[CV-33] xt Adversarial Attacks with Dynamic Outputs
【速读】:该论文旨在解决文本对抗攻击方法在动态输出场景下的有效性问题,即现有攻击方法通常假设输出标签数量固定且标签空间预定义,难以适应输出空间动态变化的大语言模型(Large Language Models, LLMs)。其解决方案的关键在于提出Textual Dynamic Outputs Attack (TDOA) 方法,通过聚类驱动的代理模型训练策略将动态输出场景转化为静态单输出场景,并引入最远标签目标攻击策略(farthest-label targeted attack strategy),选择与模型粗粒度标签差异最大的对抗向量以最大化扰动效果。此方法显著提升了在有限查询条件下对LLMs的攻击成功率,在仅需单次查询的情况下达到最高50.81%的攻击成功率(Attack Success Rate, ASR),同时在传统静态输出场景中也达到了82.68%的ASR,展现出优越的通用性和迁移能力。
链接: https://arxiv.org/abs/2509.22393
作者: Wenqiang Wang,Siyuan Liang,Xiao Yan,Xiaochun Cao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Text adversarial attack methods are typically designed for static scenarios with fixed numbers of output labels and a predefined label space, relying on extensive querying of the victim model (query-based attacks) or the surrogate model (transfer-based attacks). To address this gap, we introduce the Textual Dynamic Outputs Attack (TDOA) method, which employs a clustering-based surrogate model training approach to convert the dynamic-output scenario into a static single-output scenario. To improve attack effectiveness, we propose the farthest-label targeted attack strategy, which selects adversarial vectors that deviate most from the model’s coarse-grained labels, thereby maximizing disruption. We extensively evaluate TDOA on four datasets and eight victim models (e.g., ChatGPT-4o, ChatGPT-4.1), showing its effectiveness in crafting adversarial examples and its strong potential to compromise large language models with limited access. With a single query per text, TDOA achieves a maximum attack success rate of 50.81%. Additionally, we find that TDOA also achieves state-of-the-art performance in conventional static output scenarios, reaching a maximum ASR of 82.68%. Meanwhile, by conceptualizing translation tasks as classification problems with unbounded output spaces, we extend the TDOA framework to generative settings, surpassing prior results by up to 0.64 RDBLEU and 0.62 RDchrF.
zh
[CV-34] Gradient-based multi-focus image fusion with focus-aware saliency enhancement
【速读】:该论文旨在解决多聚焦图像融合(Multi-focus Image Fusion, MFIF)中焦点-非焦点边界模糊和细节丢失的问题,这在监控、显微成像和计算摄影等应用中尤为关键。解决方案的关键在于提出一种基于显著边界增强的方法:首先构建一个梯度域模型以生成具有完整边界的初始融合结果并有效保留边界细节;其次引入Tenengrad梯度检测提取源图像及初始融合图像的显著特征,生成显著图;最后设计一种基于梯度与互补信息的聚焦度量机制,将显著特征与跨图像的互补信息融合,强化聚焦区域,从而获得高质量的初始决策结果。该方法在四个公开数据集上的实验表明其在主观和客观评价上均优于12种前沿方法。
链接: https://arxiv.org/abs/2509.22392
作者: Haoyu Li,XiaoSong Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: iCIG 2025
点击查看摘要
Abstract:Multi-focus image fusion (MFIF) aims to yield an all-focused image from multiple partially focused inputs, which is crucial in applications cover sur-veillance, microscopy, and computational photography. However, existing methods struggle to preserve sharp focus-defocus boundaries, often resulting in blurred transitions and focused details loss. To solve this problem, we propose a MFIF method based on significant boundary enhancement, which generates high-quality fused boundaries while effectively detecting focus in-formation. Particularly, we propose a gradient-domain-based model that can obtain initial fusion results with complete boundaries and effectively pre-serve the boundary details. Additionally, we introduce Tenengrad gradient detection to extract salient features from both the source images and the ini-tial fused image, generating the corresponding saliency maps. For boundary refinement, we develop a focus metric based on gradient and complementary information, integrating the salient features with the complementary infor-mation across images to emphasize focused regions and produce a high-quality initial decision result. Extensive experiments on four public datasets demonstrate that our method consistently outperforms 12 state-of-the-art methods in both subjective and objective evaluations. We have realized codes in this https URL
zh
[CV-35] GPT -4 for Occlusion Order Recovery
【速读】:该论文旨在解决当前视觉模型在复杂且密集的真实场景图像中难以鲁棒地解析遮挡关系的问题。其核心挑战在于准确预测物体之间的遮挡顺序(occlusion order),这对图像理解与场景解析至关重要。解决方案的关键在于利用预训练的GPT-4模型,通过设计特定提示(prompt)并结合输入图像,使其基于语义上下文、视觉模式和常识知识推理出遮挡顺序。该方法无需标注训练数据,可实现零样本(zero-shot)预测,并能生成可用于其他遮挡处理任务的遮挡矩阵(occlusion matrix),从而提升现有视觉系统的泛化能力与实用性。
链接: https://arxiv.org/abs/2509.22383
作者: Kaziwa Saleh,Zhyar Rzgar K Rostam,Sándor Szénási,Zoltán Vámossy
机构: Nikolaus Esterházy College of Education (尼古拉斯·埃斯特哈齐教育学院); University of Óbuda (布达佩斯城市大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 4 figures
点击查看摘要
Abstract:Occlusion remains a significant challenge for current vision models to robustly interpret complex and dense real-world images and scenes. To address this limitation and to enable accurate prediction of the occlusion order relationship between objects, we propose leveraging the advanced capability of a pre-trained GPT-4 model to deduce the order. By providing a specifically designed prompt along with the input image, GPT-4 can analyze the image and generate order predictions. The response can then be parsed to construct an occlusion matrix which can be utilized in assisting with other occlusion handling tasks and image understanding. We report the results of evaluating the model on COCOA and InstaOrder datasets. The results show that by using semantic context, visual patterns, and commonsense knowledge, the model can produce more accurate order predictions. Unlike baseline methods, the model can reason about occlusion relationships in a zero-shot fashion, which requires no annotated training data and can easily be integrated into occlusion handling frameworks.
zh
[CV-36] Effectiveness of Large Multimodal Models in Detecting Disinformation: Experimental Results
【速读】:该论文旨在解决多模态虚假信息(multimodal disinformation)在数字平台上的传播问题,特别是文本与图像结合场景下的检测难题。其解决方案的关键在于利用GPT-4o大模型的多模态理解能力,并通过六个核心贡献构建一个系统化、可复现的自动化分析框架:包括优化提示工程以提升评估一致性、设计符合token限制的预处理流程、定义六项细粒度评价标准并引入置信度自评估机制、跨多个异构数据集(Gossipcop、Politifact、Fakeddit、MMFakeBench、AMMEBA)进行性能验证、量化预测变异性以评估模型稳定性,以及提出基于置信度和变异性双重维度的评估方法。这些措施共同提升了多模态虚假信息检测的准确性、鲁棒性和可解释性。
链接: https://arxiv.org/abs/2509.22377
作者: Yasmina Kheddache,Marc Lalonde
机构: Université de Montréal (蒙特利尔大学); Computer Research Institute of Montreal (蒙特利尔计算机研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages
点击查看摘要
Abstract:The proliferation of disinformation, particularly in multimodal contexts combining text and images, presents a significant challenge across digital platforms. This study investigates the potential of large multimodal models (LMMs) in detecting and mitigating false information. We propose to approach multimodal disinformation detection by leveraging the advanced capabilities of the GPT-4o model. Our contributions include: (1) the development of an optimized prompt incorporating advanced prompt engineering techniques to ensure precise and consistent evaluations; (2) the implementation of a structured framework for multimodal analysis, including a preprocessing methodology for images and text to comply with the model’s token limitations; (3) the definition of six specific evaluation criteria that enable a fine-grained classification of content, complemented by a self-assessment mechanism based on confidence levels; (4) a comprehensive performance analysis of the model across multiple heterogeneous datasets Gossipcop, Politifact, Fakeddit, MMFakeBench, and AMMEBA highlighting GPT-4o’s strengths and limitations in disinformation detection; (5) an investigation of prediction variability through repeated testing, evaluating the stability and reliability of the model’s classifications; and (6) the introduction of confidence-level and variability-based evaluation methods. These contributions provide a robust and reproducible methodological framework for automated multimodal disinformation analysis.
zh
[CV-37] HierLight-YOLO: A Hierarchical and Lightweight Object Detection Network for UAV Photography
【速读】:该论文旨在解决在资源受限平台(如无人机)上实时检测小目标(如32像素以下)时面临的双重挑战:一是小目标检测精度低,尤其是相较于大目标场景下YOLO系列检测器显著增高的假阴性率;二是保持模型轻量化以满足实时性要求。解决方案的关键在于提出HierLight-YOLO架构,其核心创新包括:(1) 引入分层扩展路径聚合网络(Hierarchical Extended Path Aggregation Network, HEPAN),通过跨层级的分层连接实现多尺度特征融合,提升小目标特征表达能力;(2) 设计两种轻量模块——倒置残差深度卷积块(Inverted Residual Depthwise Convolution Block, IRDCB)与轻量下采样模块(LDown),有效降低参数量和计算复杂度而不损失检测性能;(3) 构建专门用于小目标的检测头,增强空间分辨率和特征融合能力,从而实现对极小目标(如4像素)的有效检测。
链接: https://arxiv.org/abs/2509.22365
作者: Defan Chen,Yaohua Hu,Luchan Zhang
机构: 深圳大学(Shenzhen University)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:The real-time detection of small objects in complex scenes, such as the unmanned aerial vehicle (UAV) photography captured by drones, has dual challenges of detecting small targets (32 pixels) and maintaining real-time efficiency on resource-constrained platforms. While YOLO-series detectors have achieved remarkable success in real-time large object detection, they suffer from significantly higher false negative rates for drone-based detection where small objects dominate, compared to large object scenarios. This paper proposes HierLight-YOLO, a hierarchical feature fusion and lightweight model that enhances the real-time detection of small objects, based on the YOLOv8 architecture. We propose the Hierarchical Extended Path Aggregation Network (HEPAN), a multi-scale feature fusion method through hierarchical cross-level connections, enhancing the small object detection accuracy. HierLight-YOLO includes two innovative lightweight modules: Inverted Residual Depthwise Convolution Block (IRDCB) and Lightweight Downsample (LDown) module, which significantly reduce the model’s parameters and computational complexity without sacrificing detection capabilities. Small object detection head is designed to further enhance spatial resolution and feature fusion to tackle the tiny object (4 pixels) detection. Comparison experiments and ablation studies on the VisDrone2019 benchmark demonstrate state-of-the-art performance of HierLight-YOLO.
zh
[CV-38] RoboView-Bias: Benchmarking Visual Bias in Embodied Agents for Robotic Manipulation
链接: https://arxiv.org/abs/2509.22356
作者: Enguang Liu,Siyuan Liang,Liming Lu,Xiyu Zeng,Xiaochun Cao,Aishan Liu,Shuchao Pang
机构: Nanjing University of Science and Technology (南京理工大学); National University of Singapore (新加坡国立大学); Sun Yat-sen University (中山大学); Beihang University (北京航空航天大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
[CV-39] CircuitSense: A Hierarchical Circuit System Benchmark Bridging Visual Comprehension and Symbolic Reasoning in Engineering Design Process
【速读】:该论文旨在解决多模态大语言模型(Multi-modal Large Language Models, MLLMs)在电路理解任务中从视觉输入到符号数学建模能力不足的问题,尤其是其在工程设计层级中从元件级原理图到系统级框图的跨层次推理能力缺失。解决方案的关键在于构建了一个名为CircuitSense的综合性基准测试集,涵盖8006+个问题,覆盖感知(Perception)、分析(Analysis)与设计(Design)三个完整工程流程,并引入分层合成生成管道——包括基于网格的原理图生成器和自动推导符号方程标签的框图生成器,从而系统性评估模型在视觉到数学推理链条中的表现。实验表明,尽管闭源模型在组件识别与拓扑判断等感知任务上准确率超过85%,但在符号方程推导与分析推理任务上的表现低于19%,凸显了当前MLLMs在视觉到符号推理间的显著鸿沟;同时验证了符号推理能力是电路设计任务性能的核心决定因素,确立其为衡量工程智能的关键指标。
链接: https://arxiv.org/abs/2509.22339
作者: Arman Akbari,Jian Gao,Yifei Zou,Mei Yang,Jinru Duan,Dmitrii Torbunov,Yanzhi Wang,Yihui Ren,Xuan Zhang
机构: Northeastern University (东北大学); Brookhaven National Laboratory (布鲁克海文国家实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Engineering design operates through hierarchical abstraction from system specifications to component implementations, requiring visual understanding coupled with mathematical reasoning at each level. While Multi-modal Large Language Models (MLLMs) excel at natural image tasks, their ability to extract mathematical models from technical diagrams remains unexplored. We present \textbfCircuitSense, a comprehensive benchmark evaluating circuit understanding across this hierarchy through 8,006+ problems spanning component-level schematics to system-level block diagrams. Our benchmark uniquely examines the complete engineering workflow: Perception, Analysis, and Design, with a particular emphasis on the critical but underexplored capability of deriving symbolic equations from visual inputs. We introduce a hierarchical synthetic generation pipeline consisting of a grid-based schematic generator and a block diagram generator with auto-derived symbolic equation labels. Comprehensive evaluation of six state-of-the-art MLLMs, including both closed-source and open-source models, reveals fundamental limitations in visual-to-mathematical reasoning. Closed-source models achieve over 85% accuracy on perception tasks involving component recognition and topology identification, yet their performance on symbolic derivation and analytical reasoning falls below 19%, exposing a critical gap between visual parsing and symbolic reasoning. Models with stronger symbolic reasoning capabilities consistently achieve higher design task accuracy, confirming the fundamental role of mathematical understanding in circuit synthesis and establishing symbolic reasoning as the key metric for engineering competence.
zh
[CV-40] Pedestrian Attribute Recognition via Hierarchical Cross-Modality HyperGraph Learning
链接: https://arxiv.org/abs/2509.22331
作者: Xiao Wang,Shujuan Wu,Xiaoxia Cheng,Changwei Bi,Jin Tang,Bin Luo
机构: Anhui University (安徽大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: The First Work that Exploits Multi-modal Knowledge Graph for Pedestrian Attribute Recognition
[CV-41] RAPID3: Tri-Level Reinforced Acceleration Policies for Diffusion Transformer
链接: https://arxiv.org/abs/2509.22323
作者: Wangbo Zhao,Yizeng Han,Zhiwei Tang,Jiasheng Tang,Pengfei Zhou,Kai Wang,Bohan Zhuang,Zhangyang Wang,Fan Wang,Yang You
机构: National University of Singapore (新加坡国立大学); DAMO Academy, Alibaba Group (阿里巴巴达摩院); Hupan Lab (湖畔实验室); ZIP Lab, Zhejiang University (浙江大学ZIP实验室); The University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
[CV-42] NIFTY: a Non-Local Image Flow Matching for Texture Synthesis
链接: https://arxiv.org/abs/2509.22318
作者: Pierrick Chatillon,Julien Rabin,David Tschumperlé
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
[CV-43] Johnson-Lindenstrauss Lemma Guided Network for Efficient 3D Medical Segmentation
链接: https://arxiv.org/abs/2509.22307
作者: Jinpeng Lu,Linghan Cai,Yinda Chen,Guo Tang,Songhan Jiang,Haoyuan Shi,Zhiwei Xiong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
[CV-44] HiGS: History-Guided Sampling for Plug-and-Play Enhancement of Diffusion Models
【速读】:该论文旨在解决扩散模型(diffusion models)在图像生成中因采样步数较少或引导尺度较低时,输出图像 realism(真实感)不足、细节缺失的问题。其解决方案的关键在于提出一种基于动量的采样技术——历史引导采样(history-guided sampling, HiGS),该方法通过将当前预测与过去预测的加权平均之间的差异引入每一步推理过程,以引导采样路径趋向更真实、结构更清晰的图像结果。HiGS无需额外训练或微调,计算开销几乎为零,可无缝集成至现有扩散框架中,并在多种模型架构和采样预算下显著提升图像质量,例如在仅30步采样条件下即实现未引导ImageNet生成的最新FID得分1.61(256×256)。
链接: https://arxiv.org/abs/2509.22300
作者: Seyedmorteza Sadat,Farnood Salehi,Romann M. Weber
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:While diffusion models have made remarkable progress in image generation, their outputs can still appear unrealistic and lack fine details, especially when using fewer number of neural function evaluations (NFEs) or lower guidance scales. To address this issue, we propose a novel momentum-based sampling technique, termed history-guided sampling (HiGS), which enhances quality and efficiency of diffusion sampling by integrating recent model predictions into each inference step. Specifically, HiGS leverages the difference between the current prediction and a weighted average of past predictions to steer the sampling process toward more realistic outputs with better details and structure. Our approach introduces practically no additional computation and integrates seamlessly into existing diffusion frameworks, requiring neither extra training nor fine-tuning. Extensive experiments show that HiGS consistently improves image quality across diverse models and architectures and under varying sampling budgets and guidance scales. Moreover, using a pretrained SiT model, HiGS achieves a new state-of-the-art FID of 1.61 for unguided ImageNet generation at 256 \times 256 with only 30 sampling steps (instead of the standard 250). We thus present HiGS as a plug-and-play enhancement to standard diffusion sampling that enables faster generation with higher fidelity.
zh
[CV-45] Jailbreaking on Text-to-Video Models via Scene Splitting Strategy
【速读】:该论文旨在解决文本到视频(Text-to-Video, T2V)生成模型中存在的安全风险问题,尤其是当前对T2V模型的攻击研究严重不足,导致其安全机制存在显著漏洞。现有方法主要针对大语言模型(LLM)、视觉语言模型(VLM)和文本到图像(Text-to-Image, T2I)模型开展对抗性攻击,而T2V模型尚未被充分探索。论文提出的解决方案核心是SceneSplit——一种基于黑盒攻击的新型越狱方法,其关键在于将有害叙事拆分为多个单独看似无害的场景片段,利用这些片段在生成空间中的组合约束效应,从整体上引导输出进入一个不安全区域,从而显著提升生成有害视频的概率。该机制通过迭代式场景操控绕过模型内部的安全过滤器,并借助攻击模式库复用成功策略以增强攻击的鲁棒性和有效性,实验证明其在多个T2V模型上平均攻击成功率(ASR)超过77%,显著优于现有基线。
链接: https://arxiv.org/abs/2509.22292
作者: Wonjun Lee,Haon Park,Doehyeon Lee,Bumsub Ham,Suhyun Kim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Along with the rapid advancement of numerous Text-to-Video (T2V) models, growing concerns have emerged regarding their safety risks. While recent studies have explored vulnerabilities in models like LLMs, VLMs, and Text-to-Image (T2I) models through jailbreak attacks, T2V models remain largely unexplored, leaving a significant safety gap. To address this gap, we introduce SceneSplit, a novel black-box jailbreak method that works by fragmenting a harmful narrative into multiple scenes, each individually benign. This approach manipulates the generative output space, the abstract set of all potential video outputs for a given prompt, using the combination of scenes as a powerful constraint to guide the final outcome. While each scene individually corresponds to a wide and safe space where most outcomes are benign, their sequential combination collectively restricts this space, narrowing it to an unsafe region and significantly increasing the likelihood of generating a harmful video. This core mechanism is further enhanced through iterative scene manipulation, which bypasses the safety filter within this constrained unsafe region. Additionally, a strategy library that reuses successful attack patterns further improves the attack’s overall effectiveness and robustness. To validate our method, we evaluate SceneSplit across 11 safety categories on T2V models. Our results show that it achieves a high average Attack Success Rate (ASR) of 77.2% on Luma Ray2, 84.1% on Hailuo, and 78.2% on Veo2, significantly outperforming the existing baseline. Through this work, we demonstrate that current T2V safety mechanisms are vulnerable to attacks that exploit narrative structure, providing new insights for understanding and improving the safety of T2V models.
zh
[CV-46] Rule-Based Reinforcement Learning for Document Image Classification with Vision Language Models
【速读】:该论文旨在解决文档分析领域中下游任务(如文档图像分类)在面对分布外数据时泛化能力不足的问题。传统方法在处理未见过的类别、不同模态或分布外图像时表现不稳定,而本文提出通过规则驱动的强化学习(Rule-based Reinforcement Learning, RbRL)来增强模型的推理能力与适应性。其解决方案的关键在于利用可验证的奖励机制(verifiable rewards)引导模型学习更鲁棒的决策策略,从而在三种典型场景下——分布外图像、未见类别和跨模态数据中均展现出优于基线方法的泛化性能。
链接: https://arxiv.org/abs/2509.22283
作者: Michael Jungo,Andreas Fischer
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code available at this https URL
点击查看摘要
Abstract:Rule-based reinforcement learning has been gaining popularity ever since DeepSeek-R1 has demonstrated its success through simple verifiable rewards. In the domain of document analysis, reinforcement learning is not as prevalent, even though many downstream tasks may benefit from the emerging properties of reinforcement learning, particularly the enhanced reason capabilities. We study the effects of rule-based reinforcement learning with the task of Document Image Classification which is one of the most commonly studied downstream tasks in document analysis. We find that reinforcement learning tends to have better generalisation capabilities to out-of-distritbution data, which we examine in three different scenarios, namely out-of-distribution images, unseen classes and different modalities. Our code is available at this https URL.
zh
[CV-47] MesaTask: Towards Task-Driven Tabletop Scene Generation via 3D Spatial Reasoning NEURIPS2025
【速读】:该论文旨在解决机器人在执行操作任务时,如何根据高阶任务指令生成符合语义且物理合理的桌面场景(tabletop scene)的问题。传统方法依赖人工设计或随机布局,难以保证场景的真实性与任务一致性。解决方案的关键在于提出一种名为Spatial Reasoning Chain的结构化推理链,将场景生成过程分解为物体推断、空间关系推理和场景图构建三个阶段,并结合大语言模型(LLM)与直接偏好优化(DPO)算法,构建了MesaTask框架,从而实现从任务描述到物理合理3D桌面布局的高质量映射。
链接: https://arxiv.org/abs/2509.22281
作者: Jinkun Hao,Naifu Liang,Zhen Luo,Xudong Xu,Weipeng Zhong,Ran Yi,Yichen Jin,Zhaoyang Lyu,Feng Zheng,Lizhuang Ma,Jiangmiao Pang
机构: Shanghai Jiao Tong University (上海交通大学); Shanghai AI Laboratory (上海人工智能实验室); Southern University of Science and Technology (南方科技大学); Peking University (北京大学); SII
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted by NeurIPS 2025; Project page: this https URL
点击查看摘要
Abstract:The ability of robots to interpret human instructions and execute manipulation tasks necessitates the availability of task-relevant tabletop scenes for training. However, traditional methods for creating these scenes rely on time-consuming manual layout design or purely randomized layouts, which are limited in terms of plausibility or alignment with the tasks. In this paper, we formulate a novel task, namely task-oriented tabletop scene generation, which poses significant challenges due to the substantial gap between high-level task instructions and the tabletop scenes. To support research on such a challenging task, we introduce MesaTask-10K, a large-scale dataset comprising approximately 10,700 synthetic tabletop scenes with manually crafted layouts that ensure realistic layouts and intricate inter-object relations. To bridge the gap between tasks and scenes, we propose a Spatial Reasoning Chain that decomposes the generation process into object inference, spatial interrelation reasoning, and scene graph construction for the final 3D layout. We present MesaTask, an LLM-based framework that utilizes this reasoning chain and is further enhanced with DPO algorithms to generate physically plausible tabletop scenes that align well with given task descriptions. Exhaustive experiments demonstrate the superior performance of MesaTask compared to baselines in generating task-conforming tabletop scenes with realistic layouts. Project page is at this https URL
zh
[CV-48] GS-2M: Gaussian Splatting for Joint Mesh Reconstruction and Material Decomposition
链接: https://arxiv.org/abs/2509.22276
作者: Dinh Minh Nguyen,Malte Avenhaus,Thomas Lindemeier
机构: Norwegian University of Science and Technology (挪威科技大学); Carl Zeiss AG (卡尔·蔡司公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 10 figures
[CV-49] UniMapGen: A Generative Framework for Large-Scale Map Construction from Multi-modal Data
链接: https://arxiv.org/abs/2509.22262
作者: Yujian Yuan,Changjie Wu,Xinyuan Chang,Sijin Wang,Hang Zhang,Shiyi Liang,Shuang Zeng,Mu Xu
机构: Amap(高德地图); Alibaba Group(阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 10 figures
[CV-50] Beyond Classification Accuracy: Neural-MedBench and the Need for Deeper Reasoning Benchmarks
【速读】:该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在医学领域中存在“评估幻觉”问题,即模型在标准分类任务上表现优异,但缺乏真正临床推理能力,尤其在神经科诊断中难以胜任高风险的推理任务。解决方案的关键在于提出一个名为Neural-MedBench的紧凑且推理密集型基准测试集,其整合多序列MRI影像、结构化电子健康记录与临床笔记,涵盖鉴别诊断、病灶识别和推理生成三类核心任务,并采用结合大语言模型(Large Language Model, LLM)评分、临床医生验证和语义相似性度量的混合评分流程,以实现对模型推理能力的可靠评估。实证表明,现有先进VLMs在此基准下性能显著下降,且推理错误远超感知错误,凸显了深度导向的紧凑基准对于保障临床可信AI的重要性。
链接: https://arxiv.org/abs/2509.22258
作者: Miao Jing,Mengting Jia,Junling Lin,Zhongxia Shen,Lijun Wang,Yuanyuan Peng,Huan Gao,Mingkun Xu,Shangyang Li
机构: Guangdong Institute of Intelligence Science and Technology (广东省智能科学与技术研究院); Beijing Chaoyang Hospital (北京市朝阳医院); Sleep Medical Center of Huzhou Third Municipal Hospital (湖州第三人民医院睡眠医学中心); Shenzhen Hospital (Futian) of Guangzhou University of Chinese Medicine (广州中医药大学深圳医院(福田院区)); University of Macau (澳门大学); Academy for Advanced Interdisciplinary Studies, Peking University (北京大学前沿交叉学科研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 23 pages, 12 figures
点击查看摘要
Abstract:Recent advances in vision-language models (VLMs) have achieved remarkable performance on standard medical benchmarks, yet their true clinical reasoning ability remains unclear. Existing datasets predominantly emphasize classification accuracy, creating an evaluation illusion in which models appear proficient while still failing at high-stakes diagnostic reasoning. We introduce Neural-MedBench, a compact yet reasoning-intensive benchmark specifically designed to probe the limits of multimodal clinical reasoning in neurology. Neural-MedBench integrates multi-sequence MRI scans, structured electronic health records, and clinical notes, and encompasses three core task families: differential diagnosis, lesion recognition, and rationale generation. To ensure reliable evaluation, we develop a hybrid scoring pipeline that combines LLM-based graders, clinician validation, and semantic similarity metrics. Through systematic evaluation of state-of-the-art VLMs, including GPT-4o, Claude-4, and MedGemma, we observe a sharp performance drop compared to conventional datasets. Error analysis shows that reasoning failures, rather than perceptual errors, dominate model shortcomings. Our findings highlight the necessity of a Two-Axis Evaluation Framework: breadth-oriented large datasets for statistical generalization, and depth-oriented, compact benchmarks such as Neural-MedBench for reasoning fidelity. We release Neural-MedBench at this https URL as an open and extensible diagnostic testbed, which guides the expansion of future benchmarks and enables rigorous yet cost-effective assessment of clinically trustworthy AI.
zh
[CV-51] FlashEdit: Decoupling Speed Structure and Semantics for Precise Image Editing
链接: https://arxiv.org/abs/2509.22244
作者: Junyi Wu,Zhiteng Li,Haotong Qin,Xiaohong Liu,Linghe Kong,Yulun Zhang,Xiaokang Yang
机构: Shanghai Jiao Tong University (上海交通大学); ETH Zürich (苏黎世联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Our code will be made publicly available at this https URL
[CV-52] Clinical Uncertainty Impacts Machine Learning Evaluations
【速读】:该论文试图解决临床数据标注不确定性问题,即标注者之间存在分歧且标注置信度在不同病例中不一致,而传统聚合方法(如多数投票)会掩盖这种变异性,从而导致模型评估结果失真。解决方案的关键在于引入基于概率的评估指标,这些指标能够直接作用于标签分布,并显式地考虑标注不确定性;该方法独立于标注生成过程(无论是基于计数、主观置信度评分还是概率响应模型),且计算效率高,具有线性时间复杂度,只需对样本按模型得分排序即可实现。因此,作者呼吁社区公开原始标注并采用不确定性感知的评估方式,以更真实地反映临床数据下的模型性能。
链接: https://arxiv.org/abs/2509.22242
作者: Simone Lionetti,Fabian Gröger,Philippe Gottfrois,Alvaro Gonzalez-Jimenez,Ludovic Amruthalingam,Alexander A. Navarini,Marc Pouly
机构: Lucerne University of Applied Sciences and Arts(卢塞恩应用科学与艺术大学); University of Basel(巴塞尔大学); University Hospital of Basel(巴塞尔大学医院)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Clinical dataset labels are rarely certain as annotators disagree and confidence is not uniform across cases. Typical aggregation procedures, such as majority voting, obscure this variability. In simple experiments on medical imaging benchmarks, accounting for the confidence in binary labels significantly impacts model rankings. We therefore argue that machine-learning evaluations should explicitly account for annotation uncertainty using probabilistic metrics that directly operate on distributions. These metrics can be applied independently of the annotations’ generating process, whether modeled by simple counting, subjective confidence ratings, or probabilistic response models. They are also computationally lightweight, as closed-form expressions have linear-time implementations once examples are sorted by model score. We thus urge the community to release raw annotations for datasets and to adopt uncertainty-aware evaluation so that performance estimates may better reflect clinical data.
zh
[CV-53] A Tale of Two Experts: Cooperative Learning for Source-Free Unsupervised Domain Adaptation
链接: https://arxiv.org/abs/2509.22229
作者: Jiaping Yu,Muli Yang,Jiapeng Ji,Jiexi Yan,Cheng Deng
机构: Xidian University (西安电子科技大学); Institute for Infocomm Research (I2R) (资讯通信研究所); A*STAR (新加坡科技研究局)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
[CV-54] UrbanFeel: A Comprehensive Benchmark for Temporal and Perceptual Understanding of City Scenes through Human Perspective
链接: https://arxiv.org/abs/2509.22228
作者: Jun He,Yi Lin,Zilong Huang,Jiacong Yin,Junyan Ye,Yuchuan Zhou,Weijia Li,Xiang Zhang
机构: Sun Yat-sen University (中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 6 figures
[CV-55] Aerial Path Planning for Urban Geometry and Texture Co-Capture SIGGRAPH
链接: https://arxiv.org/abs/2509.22227
作者: Weidan Xiong,Bochuan Zeng,Ziyu Hu,Jianwei Guo,Ke Xie,Hui Huang
机构: Shenzhen University (深圳大学); Beijing Normal University (北京师范大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: ACM TOG and SIGGRAPH Asia 2025 (Patent Protected); Project page: this https URL
[CV-56] Polysemous Language Gaussian Splatting via Matching-based Mask Lifting
链接: https://arxiv.org/abs/2509.22225
作者: Jiayu Ding,Xinpeng Liu,Zhiyi Pan,Shiqiang Long,Ge Li
机构: Peking University (北京大学); Tianjin University (天津大学); Guangdong Bohua UHD Innovation Center Co., Ltd.
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
[CV-57] Rigidity-Aware 3D Gaussian Deformation from a Single Image
链接: https://arxiv.org/abs/2509.22222
作者: Jinhyeok Kim,Jaehun Bang,Seunghyun Seo,Kyungdon Joo
机构: UNISTRepublic of Korea
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 11 figures, conference
[CV-58] owards Faithful Reasoning in Remote Sensing: A Perceptually-Grounded GeoSpatial Chain-of-Thought for Vision-Language Models
链接: https://arxiv.org/abs/2509.22221
作者: Jiaqi Liu,Lang Sun,Ronghao Fu,Bo Yang
机构: Jilin University (吉林大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
[CV-59] Drag GANSpace: Latent Space Exploration and Control for GANs
链接: https://arxiv.org/abs/2509.22169
作者: Kirsten Odendaal,Neela Kaushik,Spencer Halverson
机构: Georgia Institute of Technology (佐治亚理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 6 pages with 7 figures and 3 tables
[CV-60] MultiMat: Multimodal Program Synthesis for Procedural Materials using Large Multimodal Models ICLR2026
【速读】:该论文旨在解决生成式材料节点图(Material Node Graphs)的自动化构建问题,这类图是计算机图形学中用于参数化表示虚拟三维物体外观的关键工具,但其创建通常依赖专业人员的手动设计。现有基于神经程序合成的方法仅将节点图表示为文本形式,忽略了其固有的视觉空间结构,导致生成结果难以直观理解和使用。解决方案的关键在于提出 MultiMat 框架,该框架利用大规模多模态模型同时处理节点图的视觉和文本表示,并结合约束树搜索推理算法,在保证语法正确性的同时高效探索程序空间,从而显著提升无条件与条件下的材料节点图生成效率与视觉保真度,达到当前最优性能。
链接: https://arxiv.org/abs/2509.22151
作者: Jonas Belouadi,Tamy Boubekeur,Adrien Kaiser
机构: University of Mannheim (曼海姆大学); Adobe Research (Adobe 研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to ICLR 2026
点击查看摘要
Abstract:Material node graphs are programs that generate the 2D channels of procedural materials, including geometry such as roughness and displacement maps, and reflectance such as albedo and conductivity maps. They are essential in computer graphics for representing the appearance of virtual 3D objects parametrically and at arbitrary resolution. In particular, their directed acyclic graph structures and intermediate states provide an intuitive understanding and workflow for interactive appearance modeling. Creating such graphs is a challenging task and typically requires professional training. While recent neural program synthesis approaches attempt to simplify this process, they solely represent graphs as textual programs, failing to capture the inherently visual-spatial nature of node graphs that makes them accessible to humans. To address this gap, we present MultiMat, a multimodal program synthesis framework that leverages large multimodal models to process both visual and textual graph representations for improved generation of procedural material graphs. We train our models on a new dataset of production-quality procedural materials and combine them with a constrained tree search inference algorithm that ensures syntactic validity while efficiently navigating the program space. Our experimental results show that our multimodal program synthesis method is more efficient in both unconditional and conditional graph synthesis with higher visual quality and fidelity than text-only baselines, establishing new state-of-the-art performance.
zh
[CV-61] Joint graph entropy knowledge distillation for point cloud classification and robustness against corruptions
【速读】:该论文旨在解决3D点云分类任务中传统方法假设类别事件服从独立同分布(independent and identically distributed, IID)所带来的类间相关性丢失问题。其解决方案的关键在于提出一种名为联合图熵知识蒸馏(Joint Graph Entropy Knowledge Distillation, JGEKD)的策略,通过构建基于联合图熵(joint graph entropy)的损失函数实现类间相关性的知识迁移;具体而言,利用联合图捕捉类别间的隐式关系,并结合知识蒸馏机制优化模型训练;此外,为增强对空间变换不变性的鲁棒性,设计了自知识蒸馏与教师知识蒸馏两种框架,分别在相同数据的不同变换形式间及原始点云与其退化形式之间实现信息传递,从而提升模型的整体性能与抗干扰能力。
链接: https://arxiv.org/abs/2509.22150
作者: Zhiqiang Tian,Weigang Li,Junwei Hu,Chunhua Deng
机构: Wuhan University of Science and Technology (武汉科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
备注:
点击查看摘要
Abstract:Classification tasks in 3D point clouds often assume that class events \replacedare follow independent and identically distributed (IID), although this assumption destroys the correlation between classes. This \replacedstudy paper proposes a classification strategy, \textbfJoint \textbfGraph \textbfEntropy \textbfKnowledge \textbfDistillation (JGEKD), suitable for non-independent and identically distributed 3D point cloud data, \replacedwhich the strategy achieves knowledge transfer of class correlations through knowledge distillation by constructing a loss function based on joint graph entropy. First\deletedly, we employ joint graphs to capture addthe hidden relationships between classes\replaced and, implement knowledge distillation to train our model by calculating the entropy of addadd graph.\replaced Subsequently Then, to handle 3D point clouds \deletedthat is invariant to spatial transformations, we construct \replacedSsiamese structures and develop two frameworks, self-knowledge distillation and teacher-knowledge distillation, to facilitate information transfer between different transformation forms of the same data. \replacedIn addition Additionally, we use the above framework to achieve knowledge transfer between point clouds and their corrupted forms, and increase the robustness against corruption of model. Extensive experiments on ScanObject, ModelNet40, ScanntV2_cls and ModelNet-C demonstrate that the proposed strategy can achieve competitive results.
zh
[CV-62] REFINE-CONTROL: A Semi-supervised Distillation Method For Conditional Image Generation
【速读】:该论文旨在解决生成式图像模型在边缘设备部署时面临的高资源消耗与标注数据稀缺问题,这些问题导致计算成本高昂且存在用户隐私风险。解决方案的关键在于提出一种半监督蒸馏框架Refine-Control,其核心创新是引入三层次知识融合损失(tri-level knowledge fusion loss),以实现从教师模型到学生模型的不同粒度知识迁移,并结合有标签与无标签数据的半监督蒸馏策略,从而在显著降低计算开销和延迟的同时,保持高质量的图像生成能力和文本控制精度。
链接: https://arxiv.org/abs/2509.22139
作者: Yicheng Jiang,Jin Yuan,Hua Yuan,Yao Zhang,Yong Rui
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 5 pages,17 figures
点击查看摘要
Abstract:Conditional image generation models have achieved remarkable results by leveraging text-based control to generate customized images. However, the high resource demands of these models and the scarcity of well-annotated data have hindered their deployment on edge devices, leading to enormous costs and privacy concerns, especially when user data is sent to a third party. To overcome these challenges, we propose Refine-Control, a semi-supervised distillation framework. Specifically, we improve the performance of the student model by introducing a tri-level knowledge fusion loss to transfer different levels of knowledge. To enhance generalization and alleviate dataset scarcity, we introduce a semi-supervised distillation method utilizing both labeled and unlabeled data. Our experiments reveal that Refine-Control achieves significant reductions in computational cost and latency, while maintaining high-fidelity generation capabilities and controllability, as quantified by comparative metrics.
zh
[CV-63] Self-Supervised Point Cloud Completion based on Multi-View Augmentations of Single Partial Point Cloud
【速读】:该论文旨在解决点云补全(point cloud completion)任务中现有方法的局限性:监督学习方法依赖真实标签,导致在真实数据集上的泛化能力受限;无监督方法需完整的点云作为训练数据,而弱监督方法则要求多视角观测,这限制了其实际应用;现有自监督方法因自监督信号能力有限,常产生质量不佳的补全结果。解决方案的关键在于设计了一种基于单个部分点云多视角增强的新颖自监督信号,并首次将Mamba架构引入自监督点云补全任务中,从而显著提升模型的学习能力和生成点云的质量,实验证明该方法在合成与真实数据集上均达到当前最优性能。
链接: https://arxiv.org/abs/2509.22132
作者: Jingjing Lu,Huilong Pi,Yunchuan Qin,Zhuo Tang,Ruihui Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Point cloud completion aims to reconstruct complete shapes from partial observations. Although current methods have achieved remarkable performance, they still have some limitations: Supervised methods heavily rely on ground truth, which limits their generalization to real-world datasets due to the synthetic-to-real domain gap. Unsupervised methods require complete point clouds to compose unpaired training data, and weakly-supervised methods need multi-view observations of the object. Existing self-supervised methods frequently produce unsatisfactory predictions due to the limited capabilities of their self-supervised signals. To overcome these challenges, we propose a novel self-supervised point cloud completion method. We design a set of novel self-supervised signals based on multi-view augmentations of the single partial point cloud. Additionally, to enhance the model’s learning ability, we first incorporate Mamba into self-supervised point cloud completion task, encouraging the model to generate point clouds with better quality. Experiments on synthetic and real-world datasets demonstrate that our method achieves state-of-the-art results.
zh
[CV-64] Guidance Watermarking for Diffusion Models
【速读】:该论文旨在解决扩散模型(diffusion models)中生成内容的版权归属与可追溯性问题,即如何在不显著影响图像质量与多样性的情况下,实现鲁棒且高效的水印嵌入。其解决方案的关键在于引入一种基于梯度引导的水印嵌入机制:利用任意现成的水印解码器(watermark decoder)计算梯度,并将该梯度信息融入扩散过程的每一步迭代中,从而在生成过程中直接嵌入水印信号。该方法无需重新训练或微调模型,且通过引入图像增强策略提升对攻击的鲁棒性,同时兼容现有基于变分自编码器(VAE)末端修改的水印技术,实现更全面的内容溯源能力。
链接: https://arxiv.org/abs/2509.22126
作者: Enoal Gesny,Eva Giboulot,Teddy Furon,Vivien Chappelier
机构: Univ. Rennes, Inria, CNRS, IRISA; LABEL4.AI
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:This paper introduces a novel watermarking method for diffusion models. It is based on guiding the diffusion process using the gradient computed from any off-the-shelf watermark decoder. The gradient computation encompasses different image augmentations, increasing robustness to attacks against which the decoder was not originally robust, without retraining or fine-tuning. Our method effectively convert any \textitpost-hoc watermarking scheme into an in-generation embedding along the diffusion process. We show that this approach is complementary to watermarking techniques modifying the variational autoencoder at the end of the diffusion process. We validate the methods on different diffusion models and detectors. The watermarking guidance does not significantly alter the generated image for a given seed and prompt, preserving both the diversity and quality of generation.
zh
[CV-65] Large Material Gaussian Model for Relightable 3D Generation
【速读】:该论文旨在解决当前基于3D Gaussian Splatting的大规模重建模型(LRMs)在生成3D内容时缺乏物理材质属性(如反照率、粗糙度和金属度)的问题,这限制了其在复杂光照环境下的真实感渲染能力。解决方案的关键在于提出一种名为Large Material Gaussian Model (MGM)的新框架,该框架首先微调一个多视角材质扩散模型,以输入的深度图和法线图作为条件生成多视角PBR(Physically Based Rendering)图像;随后构建一种高斯材质表示方法,不仅与2D高斯点绘(Gaussian Splatting)兼容,还能分别建模PBR各通道属性;最终通过重建点云获取可动态 relighting 的PBR材质,从而实现更真实的渲染效果。
链接: https://arxiv.org/abs/2509.22112
作者: Jingrui Ye,Lingting Zhu,Runze Zhang,Zeyu Hu,Yingda Yin,Lanjiong Li,Lequan Yu,Qingmin Liao
机构: Tsinghua Shenzhen International Graduate School (清华大学深圳国际研究生院); The University of Hong Kong (香港大学); LIGHTSPEED; The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:The increasing demand for 3D assets across various industries necessitates efficient and automated methods for 3D content creation. Leveraging 3D Gaussian Splatting, recent large reconstruction models (LRMs) have demonstrated the ability to efficiently achieve high-quality 3D rendering by integrating multiview diffusion for generation and scalable transformers for reconstruction. However, existing models fail to produce the material properties of assets, which is crucial for realistic rendering in diverse lighting environments. In this paper, we introduce the Large Material Gaussian Model (MGM), a novel framework designed to generate high-quality 3D content with Physically Based Rendering (PBR) materials, ie, albedo, roughness, and metallic properties, rather than merely producing RGB textures with uncontrolled light baking. Specifically, we first fine-tune a new multiview material diffusion model conditioned on input depth and normal maps. Utilizing the generated multiview PBR images, we explore a Gaussian material representation that not only aligns with 2D Gaussian Splatting but also models each channel of the PBR materials. The reconstructed point clouds can then be rendered to acquire PBR attributes, enabling dynamic relighting by applying various ambient light maps. Extensive experiments demonstrate that the materials produced by our method not only exhibit greater visual appeal compared to baseline methods but also enhance material modeling, thereby enabling practical downstream rendering applications.
zh
[CV-66] SpecXNet: A Dual-Domain Convolutional Network for Robust Deepfake Detection ACM-MM
链接: https://arxiv.org/abs/2509.22070
作者: Inzamamul Alam,Md Tanvir Islam,Simon S. Woo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ACM MM Accepted
[CV-67] High-Quality Sound Separation Across Diverse Categories via Visually-Guided Generative Modeling
链接: https://arxiv.org/abs/2509.22063
作者: Chao Huang,Susan Liang,Yapeng Tian,Anurag Kumar,Chenliang Xu
机构: University of Rochester (罗切斯特大学); University of Texas at Dallas (德克萨斯大学达拉斯分校); Meta (Meta)
类目: Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD)
备注: Accepted to IJCV
[CV-68] Enriching Knowledge Distillation with Intra-Class Contrastive Learning
链接: https://arxiv.org/abs/2509.22053
作者: Hua Yuan,Ning Xu,Xin Geng,Yong Rui
机构: Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education, China
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
[CV-69] EgoInstruct: An Egocentric Video Dataset of Face-to-face Instructional Interactions with Multi-modal LLM Benchmarking ICCV2025
链接: https://arxiv.org/abs/2509.22019
作者: Yuki Sakai,Ryosuke Furuta,Juichun Yen,Yoichi Sato
机构: The University of Tokyo (东京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to the I-HFM Workshop at ICCV 2025
[CV-70] Lightweight Structured Multimodal Reasoning for Clinical Scene Understanding in Robotics
【速读】:该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在医疗机器人场景中面临的局限性,包括时间推理能力不足、不确定性估计缺失以及缺乏结构化输出以支持机器人规划的问题。解决方案的关键在于提出一个轻量级的代理式多模态框架,其核心由Qwen2.5-VL-3B-Instruct模型与基于SmolAgent的编排层结合构成,实现了链式思维推理(chain-of-thought reasoning)、视听融合(speech-vision fusion)及动态工具调用,并通过结构化场景图生成和混合检索模块实现可解释且自适应的推理过程,从而显著提升了在视频理解任务中的准确性和鲁棒性。
链接: https://arxiv.org/abs/2509.22014
作者: Saurav Jha,Stefan K. Ehrlich
机构: SETLabs Resarch GmbH(SETLabs研究有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Robotics (cs.RO)
备注: 11 pages, 3 figures
点击查看摘要
Abstract:Healthcare robotics requires robust multimodal perception and reasoning to ensure safety in dynamic clinical environments. Current Vision-Language Models (VLMs) demonstrate strong general-purpose capabilities but remain limited in temporal reasoning, uncertainty estimation, and structured outputs needed for robotic planning. We present a lightweight agentic multimodal framework for video-based scene understanding. Combining the Qwen2.5-VL-3B-Instruct model with a SmolAgent-based orchestration layer, it supports chain-of-thought reasoning, speech-vision fusion, and dynamic tool invocation. The framework generates structured scene graphs and leverages a hybrid retrieval module for interpretable and adaptive reasoning. Evaluations on the Video-MME benchmark and a custom clinical dataset show competitive accuracy and improved robustness compared to state-of-the-art VLMs, demonstrating its potential for applications in robot-assisted surgery, patient monitoring, and decision support.
zh
[CV-71] CoFFT: Chain of Foresight-Focus Thought for Visual Language Models
链接: https://arxiv.org/abs/2509.22010
作者: Xinyu Zhang,Yuxuan Dong,Lingling Zhang,Chengyou Jia,Zhuohang Dang,Basura Fernando,Jun Liu,Mike Zheng Shou
机构: Xi’an Jiaotong University (西安交通大学); Agency for Science, Technology and Research, Singapore (新加坡科技研究局); National University of Singapore (新加坡国立大学); Nanyang Technological University, Singapore (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
[CV-72] Exposing Hallucinations To Suppress Them: VLMs Representation Editing With Generative Anchors
链接: https://arxiv.org/abs/2509.21997
作者: Youxu Shi,Suorong Yang,Dong Liu
机构: University of Science and Technology of China (中国科学技术大学); Nanjing University (南京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
[CV-73] FailureAtlas:Mapping the Failure Landscape of T2I Models via Active Exploration
【速读】:该论文旨在解决静态基准测试在评估文本到图像(Text-to-Image, T2I)生成模型时诊断能力不足的问题,即难以全面揭示系统性失败模式及其根本原因。为应对这一挑战,作者提出了一种新的主动探索范式,并设计了FailureAtlas框架,其核心创新在于将错误发现建模为对最小失败诱导概念的结构化搜索问题。通过引入新颖的加速技术,该方法有效缓解了计算爆炸难题,在Stable Diffusion模型上成功识别出数十万种此前未知的失败切片(如SD1.5中超过247,000个),并首次提供了大规模证据表明这些失败与训练数据稀缺性密切相关。FailureAtlas为生成式AI模型的深度审计提供了一个原则性强且可扩展的工具,确立了以诊断为导向的新研发路径。
链接: https://arxiv.org/abs/2509.21995
作者: Muxi Chen,Zhaohua Zhang,Chenchen Zhao,Mingyang Chen,Wenyu Jiang,Tianwen Jiang,Jianhuan Zhuo,Yu Tang,Qiuyong Xiao,Jihong Zhang,Qiang Xu
机构: The Chinese University of Hong Kong (香港中文大学); Tencent (腾讯); Nanjing University (南京大学); Dalian University of Technology (大连理工大学); The Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Static benchmarks have provided a valuable foundation for comparing Text-to-Image (T2I) models. However, their passive design offers limited diagnostic power, struggling to uncover the full landscape of systematic failures or isolate their root causes. We argue for a complementary paradigm: active exploration. We introduce FailureAtlas, the first framework designed to autonomously explore and map the vast failure landscape of T2I models at scale. FailureAtlas frames error discovery as a structured search for minimal, failure-inducing concepts. While it is a computationally explosive problem, we make it tractable with novel acceleration techniques. When applied to Stable Diffusion models, our method uncovers hundreds of thousands of previously unknown error slices (over 247,000 in SD1.5 alone) and provides the first large-scale evidence linking these failures to data scarcity in the training set. By providing a principled and scalable engine for deep model auditing, FailureAtlas establishes a new, diagnostic-first methodology to guide the development of more robust generative AI. The code is available at this https URL
zh
[CV-74] Rate-Distortion Optimized Communication for Collaborative Perception
链接: https://arxiv.org/abs/2509.21994
作者: Genjia Liu,Anning Hu,Yue Hu,Wenjun Zhang,Siheng Chen
机构: Shanghai Jiao Tong University (上海交通大学); University of Michigan (密歇根大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
[CV-75] DualFocus: Depth from Focus with Spatio-Focal Dual Variational Constraints NEURIPS2025
【速读】:该论文旨在解决深度聚焦(Depth-from-Focus, DFF)在复杂场景中因细纹理或突变深度导致焦点线索模糊或误导的问题,从而提升深度估计的准确性与鲁棒性。其解决方案的关键在于提出DualFocus框架,通过联合建模焦点堆栈在空间和焦距维度上的梯度变化模式,引入一种带有双重约束的变分公式:空间约束利用不同焦距下梯度模式的变化来区分真实深度边缘与纹理伪影,焦距约束则强制焦点概率满足单峰且单调的物理特性,从而增强模型对挑战性区域的判别能力。
链接: https://arxiv.org/abs/2509.21992
作者: Sungmin Woo,Sangyoun Lee
机构: Yonsei University (延世大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by NeurIPS 2025
点击查看摘要
Abstract:Depth-from-Focus (DFF) enables precise depth estimation by analyzing focus cues across a stack of images captured at varying focal lengths. While recent learning-based approaches have advanced this field, they often struggle in complex scenes with fine textures or abrupt depth changes, where focus cues may become ambiguous or misleading. We present DualFocus, a novel DFF framework that leverages the focal stack’s unique gradient patterns induced by focus variation, jointly modeling focus changes over spatial and focal dimensions. Our approach introduces a variational formulation with dual constraints tailored to DFF: spatial constraints exploit gradient pattern changes across focus levels to distinguish true depth edges from texture artifacts, while focal constraints enforce unimodal, monotonic focus probabilities aligned with physical focus behavior. These inductive biases improve robustness and accuracy in challenging regions. Comprehensive experiments on four public datasets demonstrate that DualFocus consistently outperforms state-of-the-art methods in both depth accuracy and perceptual quality.
zh
[CV-76] WAVE: Learning Unified Versatile Audio-Visual Embeddings with Multimodal LLM
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在动态模态(如音频和视频)上应用不足的问题,尤其是缺乏统一的文本-音频-视频嵌入表示空间。其解决方案的关键在于提出WAVE(unified versatile audio-visual embeddings),通过一种新颖的分层特征融合策略与联合多模态、多任务训练方法,实现任意模态间的跨模态检索(any-to-any cross-modal retrieval)以及根据用户指令生成提示感知嵌入(prompt-aware embeddings)。该设计显著提升了音频与视频到音频的检索性能,并在多模态问答任务中超越现有嵌入模型,同时在MMEB-v2视频基准测试中达到新的最先进水平。
链接: https://arxiv.org/abs/2509.21990
作者: Changli Tang,Qinfan Xiao,Ke Mei,Tianyi Wang,Fengyun Rao,Chao Zhang
机构: Tsinghua University (清华大学); WeChat Vision, Tencent Inc. (腾讯公司微信视觉团队)
类目: Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD)
备注:
点击查看摘要
Abstract:While embeddings from multimodal large language models (LLMs) excel as general-purpose representations, their application to dynamic modalities like audio and video remains underexplored. We introduce WAVE (\textbfunified \ \textbfversatile \textbfaudio-\textbfvisual \textbfembeddings), the first LLM-based embedding that creates a unified representation space for text, audio, and video modalities. WAVE employs a novel hierarchical feature fusion strategy and a joint multi-modal, multi-task training approach to enable two key capabilities: any-to-any cross-modal retrieval and the generation of prompt-aware embeddings tailored to user instructions. Experimentally, WAVE sets a new state-of-the-art on the MMEB-v2 video benchmark and achieves superior results in audio and video-to-audio retrieval. Its prompt-aware nature also yields remarkable performance in multimodal question answering, significantly outperforming existing embedding models. Ablation studies validate our joint training strategy, demonstrating improved performance across all modalities. With a newly introduced benchmark for versatile audio-visual learning, WAVE opens up broad possibilities for cross-modal, any-to-any applications. Our code, checkpoints, and data will be released.
zh
[CV-77] Mind-the-Glitch: Visual Correspondence for Detecting Inconsistencies in Subject-Driven Generation NEURIPS2025
【速读】:该论文旨在解决预训练扩散模型(diffusion models)中视觉特征(visual features)与语义特征(semantic features)难以分离的问题,从而实现对主体驱动图像生成过程中视觉不一致性的量化与定位。当前方法通常依赖全局特征(如CLIP、DINO或视觉-语言模型)进行评估,但无法精确定位不一致区域。其解决方案的关键在于:首先构建一个自动化的数据处理流水线,基于现有主体驱动图像生成数据集生成带有语义和视觉对应关系标注的图像对;其次设计一种对比学习架构,有效分离出视觉与语义特征表示;最终提出视觉语义匹配(Visual Semantic Matching, VSM)指标,不仅优于现有全局特征方法在量化视觉不一致性上的表现,还能实现空间级的不一致区域定位,是首个同时支持量化与定位的评估工具。
链接: https://arxiv.org/abs/2509.21989
作者: Abdelrahman Eldesokey,Aleksandar Cvejic,Bernard Ghanem,Peter Wonka
机构: KAUST(国王阿卜杜拉大学科技学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: NeurIPS 2025 (Spotlight). Project Page: this https URL
点击查看摘要
Abstract:We propose a novel approach for disentangling visual and semantic features from the backbones of pre-trained diffusion models, enabling visual correspondence in a manner analogous to the well-established semantic correspondence. While diffusion model backbones are known to encode semantically rich features, they must also contain visual features to support their image synthesis capabilities. However, isolating these visual features is challenging due to the absence of annotated datasets. To address this, we introduce an automated pipeline that constructs image pairs with annotated semantic and visual correspondences based on existing subject-driven image generation datasets, and design a contrastive architecture to separate the two feature types. Leveraging the disentangled representations, we propose a new metric, Visual Semantic Matching (VSM), that quantifies visual inconsistencies in subject-driven image generation. Empirical results show that our approach outperforms global feature-based metrics such as CLIP, DINO, and vision–language models in quantifying visual inconsistencies while also enabling spatial localization of inconsistent regions. To our knowledge, this is the first method that supports both quantification and localization of inconsistencies in subject-driven generation, offering a valuable tool for advancing this task. Project Page:this https URL
zh
[CV-78] Resolving Ambiguity in Gaze-Facilitated Visual Assistant Interaction Paradigm
【速读】:该论文旨在解决智能眼镜场景下,基于视觉-语言模型(Vision-Language Models, VLMs)进行多模态查询时,因用户注视数据(gaze data)存在噪声和时空复杂性而导致的语义模糊问题,以及现有方法仅使用静态图像作为视觉输入、无法捕捉用户注意力动态变化的局限。其解决方案的关键在于提出GLARIFY框架:首先通过分析大量带注视信息的查询样本揭示注视模式的噪声特性;其次利用GPT-4o构建自动数据合成流程生成GLARIFY-Ambi数据集,并引入链式思维(Chain-of-Thought, CoT)机制处理噪声;最后设计热力图模块将时空注视信息有效融合进先进VLMs中,同时保留预训练知识,从而实现对人类注意力的鲁棒对齐,显著提升真实场景下的交互效果。
链接: https://arxiv.org/abs/2509.21980
作者: Zeyu Wang,Baiyu Chen,Kun Yan,Hongjing Piao,Hao Xue,Flora D. Salim,Yuanchun Shi,Yuntao Wang
机构: Tsinghua University (清华大学); The University of New South Wales (新南威尔士大学); Beihang University (北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:With the rise in popularity of smart glasses, users’ attention has been integrated into Vision-Language Models (VLMs) to streamline multi-modal querying in daily scenarios. However, leveraging gaze data to model users’ attention may introduce ambiguity challenges: (1) users’ verbal questions become ambiguous by using pronouns or skipping context, (2) humans’ gaze patterns can be noisy and exhibit complex spatiotemporal relationships with their spoken questions. Previous works only consider single image as visual modality input, failing to capture the dynamic nature of the user’s attention. In this work, we introduce GLARIFY, a novel method to leverage spatiotemporal gaze information to enhance the model’s effectiveness in real-world applications. Initially, we analyzed hundreds of querying samples with the gaze modality to demonstrate the noisy nature of users’ gaze patterns. We then utilized GPT-4o to design an automatic data synthesis pipeline to generate the GLARIFY-Ambi dataset, which includes a dedicated chain-of-thought (CoT) process to handle noisy gaze patterns. Finally, we designed a heatmap module to incorporate gaze information into cutting-edge VLMs while preserving their pretrained knowledge. We evaluated GLARIFY using a hold-out test set. Experiments demonstrate that GLARIFY significantly outperforms baselines. By robustly aligning VLMs with human attention, GLARIFY paves the way for a usable and intuitive interaction paradigm with a visual assistant.
zh
[CV-79] Benchmarking and Mitigate Psychological Sycophancy in Medical Vision-Language Models
链接: https://arxiv.org/abs/2509.21979
作者: Zikun Guo,Xinyue Xu,Pei Xiang,Shu Yang,Xin Han,Di Wang,Lijie Hu
机构: Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学); Hong Kong University of Science and Technology (香港科技大学); King Abdullah University of Science and Technology (阿卜杜拉国王科学技术大学); Xidian University (西安电子科技大学); Zhejiang A&F University (浙江农林大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 19figures, 37pages
[CV-80] Geo-R1: Improving Few-Shot Geospatial Referring Expression Understanding with Reinforcement Fine-Tuning
链接: https://arxiv.org/abs/2509.21976
作者: Zilun Zhang,Zian Guan,Tiancheng Zhao,Haozhan Shen,Tianyu Li,Yuxiang Cai,Zhonggen Su,Zhaojun Liu,Jianwei Yin,Xiang Li
机构: Zhejiang University (浙江大学); China Academy of Space Technology (中国航天科技集团公司); University of Bristol (布里斯托大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
[CV-81] No-Reference Image Contrast Assessment with Customized EfficientNet-B0
【速读】:该论文旨在解决无参考图像质量评估(No Reference Image Quality Assessment, NR IQA)中对真实场景下对比度失真(contrast distortions)感知评估准确性不足的问题。现有方法在多样现实条件下难以有效捕捉人类视觉系统对对比度变化的敏感性,导致评分与主观感知不一致。解决方案的关键在于:通过定制化微调三种轻量级预训练网络(EfficientNet B0、ResNet18 和 MobileNetV2),并引入对比度感知回归头(contrast-aware regression head),结合针对对比度失真的数据增强策略,在 CID2013 和 CCID2014 两个基准数据集上进行端到端训练,从而显著提升模型对感知对比度失真的建模能力。实验表明,优化后的 EfficientNet B0 模型在 CCID2014 和 CID2013 上分别达到 PLCC = 0.9286 / SRCC = 0.9178 和 PLCC = 0.9581 / SRCC = 0.9369,优于传统方法及其它深度基线,验证了对比度感知适应机制在资源受限和实时应用中的有效性与鲁棒性。
链接: https://arxiv.org/abs/2509.21967
作者: Javad Hassannataj Joloudari,Bita Mesbahzadeh,Omid Zare,Emrah Arslan,Roohallah Alizadehsani,Hossein Moosaei
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 32 pages, 9 tables, 6 figures
点击查看摘要
Abstract:Image contrast was a fundamental factor in visual perception and played a vital role in overall image quality. However, most no reference image quality assessment NR IQA models struggled to accurately evaluate contrast distortions under diverse real world conditions. In this study, we proposed a deep learning based framework for blind contrast quality assessment by customizing and fine-tuning three pre trained architectures, EfficientNet B0, ResNet18, and MobileNetV2, for perceptual Mean Opinion Score, along with an additional model built on a Siamese network, which indicated a limited ability to capture perceptual contrast distortions. Each model is modified with a contrast-aware regression head and trained end to end using targeted data augmentations on two benchmark datasets, CID2013 and CCID2014, containing synthetic and authentic contrast distortions. Performance is evaluated using Pearson Linear Correlation Coefficient and Spearman Rank Order Correlation Coefficient, which assess the alignment between predicted and human rated scores. Among these three models, our customized EfficientNet B0 model achieved state-of-the-art performance with PLCC = 0.9286 and SRCC = 0.9178 on CCID2014 and PLCC = 0.9581 and SRCC = 0.9369 on CID2013, surpassing traditional methods and outperforming other deep baselines. These results highlighted the models robustness and effectiveness in capturing perceptual contrast distortion. Overall, the proposed method demonstrated that contrast aware adaptation of lightweight pre trained networks can yield a high performing, scalable solution for no reference contrast quality assessment suitable for real time and resource constrained applications.
zh
[CV-82] PartSAM: A Scalable Promptable Part Segmentation Model Trained on Native 3D Data
链接: https://arxiv.org/abs/2509.21965
作者: Zhe Zhu,Le Wan,Rui Xu,Yiheng Zhang,Honghua Chen,Zhiyang Dou,Cheng Lin,Yuan Liu,Mingqiang Wei
机构: Nanjing University of Aeronautics and Astronautics (南京航空航天大学); Hong Kong University of Science and Technology (香港科技大学); The University of Hong Kong (香港大学); National University of Singapore (新加坡国立大学); Lingnan University (岭南大学); Macau University of Science and Technology (澳门科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
[CV-83] MultiCrafter: High-Fidelity Multi-Subject Generation via Spatially Disentangled Attention and Identity-Aware Reinforcement Learning ICRA
链接: https://arxiv.org/abs/2509.21953
作者: Tao Wu,Yibo Jiang,Yehao Lu,Zhizhong Wang,Zeyi Huang,Zequn Qin,Xi Li
机构: College of Computer Science and Technology, Zhejiang University (浙江大学计算机科学与技术学院); School of Software Technology, Zhejiang University (浙江大学软件技术学院); Huawei Technologies Ltd (华为技术有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
[CV-84] Customizing Visual Emotion Evaluation for MLLM s: An Open-vocabulary Multifaceted and Scalable Approach
【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在图像情绪感知能力评估中存在的不一致性问题,其根源在于现有评估方法存在诸多局限,如忽视合理但非标准的回答、情感分类体系单一、忽略情境因素以及人工标注成本高等。解决方案的关键在于提出一种名为“情绪陈述判断”(Emotion Statement Judgment)的新任务范式,该范式能够更全面地衡量模型对图像情绪的理解能力;同时设计了一个自动化流程,可高效构建以情绪为中心的陈述句,显著降低人工标注负担。这一框架为定制化视觉情绪评估提供了可靠基础,并揭示了MLLMs在情境依赖情绪判断上的优势及对感知主观性理解的不足。
链接: https://arxiv.org/abs/2509.21950
作者: Daiqing Wu,Dongbao Yang,Sicheng Zhao,Can Ma,Yu Zhou
机构: IIE, Chinese Academy of Sciences (中国科学院信息工程研究所); Nankai University (南开大学); Tsinghua University (清华大学); University of Chinese Academy of Sciences (中国科学院大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Recently, Multimodal Large Language Models (MLLMs) have achieved exceptional performance across diverse tasks, continually surpassing previous expectations regarding their capabilities. Nevertheless, their proficiency in perceiving emotions from images remains debated, with studies yielding divergent results in zero-shot scenarios. We argue that this inconsistency stems partly from constraints in existing evaluation methods, including the oversight of plausible responses, limited emotional taxonomies, neglect of contextual factors, and labor-intensive annotations. To facilitate customized visual emotion evaluation for MLLMs, we propose an Emotion Statement Judgment task that overcomes these constraints. Complementing this task, we devise an automated pipeline that efficiently constructs emotion-centric statements with minimal human effort. Through systematically evaluating prevailing MLLMs, our study showcases their stronger performance in emotion interpretation and context-based emotion judgment, while revealing relative limitations in comprehending perception subjectivity. When compared to humans, even top-performing MLLMs like GPT4o demonstrate remarkable performance gaps, underscoring key areas for future improvement. By developing a fundamental evaluation framework and conducting a comprehensive MLLM assessment, we hope this work contributes to advancing emotional intelligence in MLLMs. Project page: this https URL.
zh
[CV-85] SemanticControl: A Training-Free Approach for Handling Loosely Aligned Visual Conditions in ControlNet BMVC2025
链接: https://arxiv.org/abs/2509.21938
作者: Woosung Joung,Daewon Chae,Jinkyu Kim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: BMVC 2025
[CV-86] DynaNav: Dynamic Feature and Layer Selection for Efficient Visual Navigation NEURIPS2025
链接: https://arxiv.org/abs/2509.21930
作者: Jiahui Wang,Changhao Chen
机构: National University of Singapore (新加坡国立大学); The Hong Kong University of Science and Technology (广州) (香港科技大学(广州))
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted as a poster in NeurIPS 2025
[CV-87] SingRef6D: Monocular Novel Object Pose Estimation with a Single RGB Reference NEURIPS2025
链接: https://arxiv.org/abs/2509.21927
作者: Jiahui Wang,Haiyue Zhu,Haoren Guo,Abdullah Al Mamun,Cheng Xiang,Tong Heng Lee
机构: National University of Singapore (新加坡国立大学); Agency for Science, Technology and Research (A*STAR) (新加坡科技研究局)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted as a poster in NeurIPS 2025
[CV-88] PANICL: Mitigating Over-Reliance on Single Prompt in Visual In-Context Learning
链接: https://arxiv.org/abs/2509.21926
作者: Jiahao Zhang,Bowen Wang,Hong Liu,Yuta Nakashima,Hajime Nagahara
机构: D3 Center, The University of Osaka (大阪大学D3中心); SANKEN, The University of Osaka (大阪大学SANKEN研究所); Xiamen University (厦门大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 21 pages, 12 figures
[CV-89] Spatial Reasoning in Foundation Models: Benchmarking Object-Centric Spatial Understanding NEURIPS
链接: https://arxiv.org/abs/2509.21922
作者: Vahid Mirjalili,Ramin Giahi,Sriram Kollipara,Akshay Kekuda,Kehui Yao,Kai Zhao,Jianpeng Xu,Kaushiki Nag,Sinduja Subramaniam,Topojoy Biswas,Evren Korpeoglu,Kannan Achan
机构: Walmart Global Tech (沃尔玛全球科技)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 4 pages, NeurIPS Workshop SpaVLE
[CV-90] Multi-View Crowd Counting With Self-Supervised Learning
链接: https://arxiv.org/abs/2509.21918
作者: Hong Mo,Xiong Zhang,Tengfei Shi,Zhongbo Wu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
[CV-91] aming Flow-based I2V Models for Creative Video Editing
链接: https://arxiv.org/abs/2509.21917
作者: Xianghao Kong,Hansheng Chen,Yuwei Guo,Lvmin Zhang,Gordon Wetzstein,Maneesh Agrawala,Anyi Rao
机构: HKUST(香港科技大学); Stanford University (斯坦福大学); CUHK(香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:
[CV-92] Enhancing Vehicle Detection under Adverse Weather Conditions with Contrastive Learning
【速读】:该论文旨在解决无人机(UAV)图像中车辆检测在北欧地区因积雪覆盖导致的可见度下降和域偏移问题,同时应对标注数据稀缺与计算资源受限的挑战。其解决方案的关键在于提出了一种“侧载对比学习适配”(sideload-CL-adaptation)框架:首先在预训练阶段利用未标注数据通过对比学习(contrastive learning, CL)训练一个轻量级卷积神经网络(CNN)表示提取器,随后将该提取器以“侧载”方式注入冻结的YOLO11n骨干网络中进行微调,从而有效利用廉价获取的未标注数据提升模型在目标域上的检测性能(mAP50提升3.8%–9.5%)。
链接: https://arxiv.org/abs/2509.21916
作者: Boying Li,Chang Liu,Petter Kyösti,Mattias Öhman,Devashish Singha Roy,Sofia Plazzi,Hamam Mokayed,Olle Hagner
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Aside from common challenges in remote sensing like small, sparse targets and computation cost limitations, detecting vehicles from UAV images in the Nordic regions faces strong visibility challenges and domain shifts caused by diverse levels of snow coverage. Although annotated data are expensive, unannotated data is cheaper to obtain by simply flying the drones. In this work, we proposed a sideload-CL-adaptation framework that enables the use of unannotated data to improve vehicle detection using lightweight models. Specifically, we propose to train a CNN-based representation extractor through contrastive learning on the unannotated data in the pretraining stage, and then sideload it to a frozen YOLO11n backbone in the fine-tuning stage. To find a robust sideload-CL-adaptation, we conducted extensive experiments to compare various fusion methods and granularity. Our proposed sideload-CL-adaptation model improves the detection performance by 3.8% to 9.5% in terms of mAP50 on the NVD dataset.
zh
[CV-93] DEdit: A Unified Diffusion Framework for Text-Drag Guided Image Manipulation
链接: https://arxiv.org/abs/2509.21905
作者: Qihang Wang,Yaxiong Wang,Lechao Cheng,Zhun Zhong
机构: Hefei University of Technology (合肥工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
[CV-94] Closing the Oracle Gap: Increment Vector Transformation for Class Incremental Learning
链接: https://arxiv.org/abs/2509.21898
作者: Zihuan Qiu,Yi Xu,Fanman Meng,Runtong Zhang,Linfeng Xu,Qingbo Wu,Hongliang Li
机构: University of Electronic Science and Technology of China (电子科技大学); Dalian University of Technology (大连理工大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
[CV-95] LG-CD: Enhancing Language-Guided Change Detection through SAM2 Adaptation
链接: https://arxiv.org/abs/2509.21894
作者: Yixiao Liu(1),Yizhou Yang(1),Jinwen Li(2),Jun Tao(1),Ruoyu Li(1),Xiangkun Wang(1),Min Zhu(1),Junlong Cheng(1) ((1) College of Computer Science, Sichuan University, China, (2) School of Computer Science and Technology, Xinjiang University, China)
机构: Sichuan University (四川大学); Xinjiang University (新疆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: *Corresponding authors: Min Zhu ( this http URL @scu. this http URL ) and Junlong Cheng (jlcheng@scu. this http URL )
[CV-96] Syncphony: Synchronized Audio-to-Video Generation with Diffusion Transformers
【速读】:该论文旨在解决当前文本到视频(text-to-video)和图像到视频(image-to-video)生成模型在运动时间控制上的精度不足问题,即难以精确调控视频中动作发生的时机。尽管音频能提供与视频运动对齐的时间线索,但现有音频到视频(audio-to-video, A2V)模型因间接条件控制机制或有限的时序建模能力,仍存在细粒度同步困难。其解决方案的关键在于提出Syncphony框架,通过两个核心组件提升同步性能:(1) 运动感知损失(Motion-aware Loss),强化高运动区域的学习;(2) 音频同步引导(Audio Sync Guidance),利用一个无音频层但视觉对齐的离线模型在推理阶段指导整体模型,从而更有效地利用音频线索并保持视频质量。
链接: https://arxiv.org/abs/2509.21893
作者: Jibin Song,Mingi Kwon,Jaeseok Jeong,Youngjung Uh
机构: Yonsei University (延世大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
点击查看摘要
Abstract:Text-to-video and image-to-video generation have made rapid progress in visual quality, but they remain limited in controlling the precise timing of motion. In contrast, audio provides temporal cues aligned with video motion, making it a promising condition for temporally controlled video generation. However, existing audio-to-video (A2V) models struggle with fine-grained synchronization due to indirect conditioning mechanisms or limited temporal modeling capacity. We present Syncphony, which generates 380x640 resolution, 24fps videos synchronized with diverse audio inputs. Our approach builds upon a pre-trained video backbone and incorporates two key components to improve synchronization: (1) Motion-aware Loss, which emphasizes learning at high-motion regions; (2) Audio Sync Guidance, which guides the full model using a visually aligned off-sync model without audio layers to better exploit audio cues at inference while maintaining visual quality. To evaluate synchronization, we propose CycleSync, a video-to-audio-based metric that measures the amount of motion cues in the generated video to reconstruct the original audio. Experiments on AVSync15 and The Greatest Hits datasets demonstrate that Syncphony outperforms existing methods in both synchronization accuracy and visual quality. Project page is available at: this https URL
zh
[CV-97] Drag 4D: Align Your Motion with Text-Driven 3D Scene Generation
链接: https://arxiv.org/abs/2509.21888
作者: Minjun Kang,Inkyu Shin,Taeyeop Lee,In So Kweon,Kuk-Jin Yoon
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: version 1
[CV-98] StableDub: Taming Diffusion Prior for Generalized and Efficient Visual Dubbing
链接: https://arxiv.org/abs/2509.21887
作者: Liyang Chen,Tianze Zhou,Xu He,Boshi Tang,Zhiyong Wu,Yang Huang,Yang Wu,Zhongqian Sun,Wei Yang,Helen Meng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:
[CV-99] Unlocking the Essence of Beauty: Advanced Aesthetic Reasoning with Relative-Absolute Policy Optimization
链接: https://arxiv.org/abs/2509.21871
作者: Boyang Liu,Yifan Hu,Senjie Jin,Shihan Dou,Gonglei Shi,Jie Shao,Tao Gui,Xuanjing Huang
机构: Fudan University (复旦大学); Tsinghua University (清华大学); Bytedance (字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
[CV-100] Deepfakes: we need to re-think the concept of “real” images
链接: https://arxiv.org/abs/2509.21864
作者: Janis Keuper,Margret Keuper
机构: Offenburg University (奥芬堡大学); University of Mannheim (曼海姆大学); Max Planck Institute for Informatics (马克斯·普朗克信息研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
[CV-101] SRHand: Super-Resolving Hand Images and 3D Shapes via View/Pose-aware Neural Image Representations and Explicit 3D Meshes
链接: https://arxiv.org/abs/2509.21859
作者: Minje Kim,Tae-Kyun Kim
机构: KAIST (韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 6 figures
[CV-102] Perception-Consistency Multimodal Large Language Models Reasoning via Caption-Regularized Policy Optimization
【速读】:该论文旨在解决多模态大语言模型在融合视觉感知与符号推理任务中因感知错误传播而导致性能下降的问题,尤其是现有强化学习(Reinforcement Learning, RL)微调方法未能有效缓解视觉定位(visual grounding)与后续推理过程之间的错位问题。解决方案的关键在于提出一种名为Caption-Regularized Policy Optimization (CapPO) 的新型强化学习框架,其核心机制包括:(1) 基于图像描述(caption)的一致性正则化,通过最小化原始图像和对应描述条件下的响应差异,确保推理锚定于语义一致的视觉内容;(2) 基于KL散度加权的优势估计策略,自适应地调整强化信号,强化感知一致的决策路径并抑制虚假相关性。该方法显著提升了模型在数学和通用推理任务中的准确率,并有效减少了感知相关的错误。
链接: https://arxiv.org/abs/2509.21854
作者: Songjun Tu,Qichao Zhang,Jingbo Sun,Yuqian Fu,Linjing Li,Xiangyuan Lan,Dongmei Jiang,Yaowei Wang,Dongbin Zhao
机构: State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所多模态人工智能系统重点实验室); Pengcheng Laboratory (鹏城实验室); School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院); Harbin Institute of Technology (Shenzhen) (哈尔滨工业大学(深圳)
类目: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV)
备注: 12pages, 11 figures
点击查看摘要
Abstract:While multimodal large language models excel at tasks that integrate visual perception with symbolic reasoning, their performance is often undermined by a critical vulnerability: perception-induced errors that propagate through the reasoning chain. Current reinforcement learning (RL) fine-tuning methods, while enhancing reasoning abilities, largely fail to address the underlying misalignment between visual grounding and the subsequent reasoning process. To address this challenge, we propose \textbfCaption-Regularized Policy Optimization (CapPO), a novel RL framework that explicitly enforces perceptual consistency during policy optimization. CapPO integrates two key mechanisms: (1) a caption-based consistency regularization, which minimizes the divergence between responses conditioned on raw images and those conditioned on captions, thereby anchoring reasoning to semantically faithful visual content; and (2) a KL-weighted advantage estimation scheme, which adaptively scales reinforcement signals to strengthen perceptually consistent trajectories while suppressing spurious correlations. Extensive experiments on five math-focused and five general reasoning benchmarks demonstrate that CapPO achieves competitive performance, yielding gains of +6.0% accuracy on math-related tasks and +2.4% on general reasoning tasks over the base Qwen2.5-VL-7B model. Moreover, ablation studies further confirm the effectiveness of each component, while error analysis reveals that CapPO significantly reduces perception-related mistakes compared with baselines. Overall, CapPO provides a simple yet effective framework for improving multimodal reasoning.
zh
[CV-103] Dynamic Novel View Synthesis in High Dynamic Range
链接: https://arxiv.org/abs/2509.21853
作者: Kaixuan Zhang,Zhipeng Xiong,Minxian Li,Mingwu Ren,Jiankang Deng,Xiatian Zhu
机构: Nanjing University of Science and Technology (南京理工大学); Imperial College London (帝国理工学院); University of Surrey (萨里大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
[CV-104] A Comprehensive Evaluation of Transformer-Based Question Answering Models and RAG -Enhanced Design
【速读】:该论文旨在解决多跳问答(multi-hop question answering)中因需跨多个文档片段整合证据而导致的挑战,尤其是在检索增强生成(retrieval-augmented generation, RAG)框架下的检索策略优化问题。其解决方案的关键在于提出一种混合检索方法,该方法融合密集嵌入(dense embeddings)与词法重叠(lexical overlap)特征,并引入重排序(re-ranking)机制,从而显著提升检索质量;同时结合高效RAG(EfficientRAG)管道中的查询优化技术,如标记(token labeling)和迭代精炼(iterative refinement),在保持效率的同时增强了实体召回率和证据互补性,最终在HotpotQA数据集上实现了相比余弦相似度基线50%的精确匹配提升和47%的F1分数提升。
链接: https://arxiv.org/abs/2509.21845
作者: Zichen Zhang,Kunlong Zhang,Hongwei Ruan,Yiming Luo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Transformer-based models have advanced the field of question answering, but multi-hop reasoning, where answers require combining evidence across multiple passages, remains difficult. This paper presents a comprehensive evaluation of retrieval strategies for multi-hop question answering within a retrieval-augmented generation framework. We compare cosine similarity, maximal marginal relevance, and a hybrid method that integrates dense embeddings with lexical overlap and re-ranking. To further improve retrieval, we adapt the EfficientRAG pipeline for query optimization, introducing token labeling and iterative refinement while maintaining efficiency. Experiments on the HotpotQA dataset show that the hybrid approach substantially outperforms baseline methods, achieving a relative improvement of 50 percent in exact match and 47 percent in F1 score compared to cosine similarity. Error analysis reveals that hybrid retrieval improves entity recall and evidence complementarity, while remaining limited in handling distractors and temporal reasoning. Overall, the results suggest that hybrid retrieval-augmented generation provides a practical zero-shot solution for multi-hop question answering, balancing accuracy, efficiency, and interpretability.
zh
[CV-105] DiTraj: training-free trajectory control for video diffusion transformer
链接: https://arxiv.org/abs/2509.21839
作者: Cheng Lei,Jiayu Zhang,Yue Ma,Xinyu Wang,Long Chen,Liang Tang,Yiqiang Yan,Fei Su,Zhicheng Zhao
机构: Beijing University of Posts and Telecommunications (北京邮电大学); Lenovo (联想); HKUST (香港科技大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
[CV-106] MoWM: Mixture-of-World-Models for Embodied Planning via Latent-to-Pixel Feature Modulation
链接: https://arxiv.org/abs/2509.21797
作者: Yu Shang,Yangcheng Yu,Xin Zhang,Xin Jin,Haisheng Su,Wei Wu,Yong Li
机构: Tsinghua University (清华大学); Manifold AI
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 4 figures
[CV-107] LongScape: Advancing Long-Horizon Embodied World Models with Context-Aware MoE
【速读】:该论文旨在解决视频生成方法在长时程(long-horizon)场景下难以保持稳定性和一致性的难题,尤其是传统基于扩散模型的方法存在时间不一致性与视觉漂移问题,而自回归方法则常牺牲视觉细节质量。其解决方案的关键在于提出一种混合框架LongScape,通过动作引导的可变长度分块机制(action-guided, variable-length chunking)将视频按机器人动作语义上下文进行分割,确保每个片段代表一个完整且连贯的动作;同时引入上下文感知的专家混合(Context-aware Mixture-of-Experts, CMoE)机制,在生成过程中动态激活针对各片段的专用专家模块,从而实现高质量视觉输出与平滑的片段间过渡,最终保障长时间序列生成的稳定性与一致性。
链接: https://arxiv.org/abs/2509.21790
作者: Yu Shang,Lei Jin,Yiding Ma,Xin Zhang,Chen Gao,Wei Wu,Yong Li
机构: Tsinghua University (清华大学); Manifold AI
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 8 figures
点击查看摘要
Abstract:Video-based world models hold significant potential for generating high-quality embodied manipulation data. However, current video generation methods struggle to achieve stable long-horizon generation: classical diffusion-based approaches often suffer from temporal inconsistency and visual drift over multiple rollouts, while autoregressive methods tend to compromise on visual detail. To solve this, we introduce LongScape, a hybrid framework that adaptively combines intra-chunk diffusion denoising with inter-chunk autoregressive causal generation. Our core innovation is an action-guided, variable-length chunking mechanism that partitions video based on the semantic context of robotic actions. This ensures each chunk represents a complete, coherent action, enabling the model to flexibly generate diverse dynamics. We further introduce a Context-aware Mixture-of-Experts (CMoE) framework that adaptively activates specialized experts for each chunk during generation, guaranteeing high visual quality and seamless chunk transitions. Extensive experimental results demonstrate that our method achieves stable and consistent long-horizon generation over extended rollouts. Our code is available at: this https URL.
zh
[CV-108] Visual Multi-Agent System: Mitigating Hallucination Snowballing via Visual Flow
【速读】:该论文旨在解决多智能体系统(Multi-Agent System, MAS)中由视觉语言模型(Visual Language Models, VLMs)驱动时出现的“多智能体视觉幻觉雪球效应”(multi-agent visual hallucination snowballing)问题,即初始单个智能体产生的视觉幻觉会因后续智能体过度依赖文本流传递视觉信息而被逐层放大。解决方案的关键在于识别出在中间层具有单峰注意力峰值的视觉标记(vision tokens),这些标记能最好地保留视觉证据,但随智能体交互轮次加深逐渐衰减,从而导致幻觉扩散;为此,作者提出ViF(Visual Flow-based mitigation paradigm),通过选择特定视觉中继标记(visual relay tokens)构建视觉流(Visual Flow)来传递智能体间消息,并引入注意力重分配机制强化该模式,实现轻量、可插拔的幻觉抑制。
链接: https://arxiv.org/abs/2509.21789
作者: Xinlei Yu,Chengming Xu,Guibin Zhang,Yongbo He,Zhangquan Chen,Zhucun Xue,Jiangning Zhang,Yue Liao,Xiaobin Hu,Yu-Gang Jiang,Shuicheng Yan
机构: National University of Singapore (新加坡国立大学); Tencent Youtu Lab (腾讯优图实验室); Zhejiang University (浙江大学); Tsinghua University (清华大学); Fudan University (复旦大学)
类目: Multiagent Systems (cs.MA); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Multi-Agent System (MAS) powered by Visual Language Models (VLMs) enables challenging tasks but suffers from a novel failure term, multi-agent visual hallucination snowballing, where hallucinations are seeded in a single agent and amplified by following ones due to the over-reliance on textual flow to relay visual information. Through turn-, layer-, and token-wise attention analyses, we provide detailed insights into the essence of hallucination snowballing regarding the reduction of visual attention allocation. It leads us to identify a subset of vision tokens with a unimodal attention peak in middle layers that best preserve visual evidence but gradually diminish in deeper agent turns, resulting in the visual hallucination snowballing in MAS. Thus, we propose ViF, a lightweight, plug-and-play mitigation paradigm that relays inter-agent messages with Visual Flow powered by the selected visual relay tokens and applies attention reallocation to amplify this pattern. The experiment results demonstrate that our method markedly reduces hallucination snowballing, consistently improving the performance across eight benchmarks based on four common MAS structures and ten base models. The source code will be available at: this https URL.
zh
[CV-109] MIRG-RL: Multi-Image Reasoning and Grounding with Reinforcement Learning
链接: https://arxiv.org/abs/2509.21788
作者: Lihao Zheng,Jiawei Chen,Xintian Shen,Hao Ma,Tao Wei
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
[CV-110] Prompt-guided Representation Disentanglement for Action Recognition
链接: https://arxiv.org/abs/2509.21783
作者: Tianci Wu,Guangming Zhu,Jiang Lu,Siyuan Wang,Ning Wang,Nuoye Xiong,Zhang Liang
机构: Xidian University (西安电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
[CV-111] raining-Free Multimodal Deepfake Detection via Graph Reasoning
链接: https://arxiv.org/abs/2509.21774
作者: Yuxin Liu,Fei Wang,Kun Li,Yiqi Nie,Junjie Chen,Yanyan Wei,Zhangling Duan,Zhaohong Jia
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
备注:
[CV-112] CubistMerge: Spatial-Preserving Token Merging For Diverse ViT Backbones
链接: https://arxiv.org/abs/2509.21764
作者: Wenyi Gong,Mieszko Lis
机构: University of British Columbia (不列颠哥伦比亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
[CV-113] UniVid: Unifying Vision Tasks with Pre-trained Video Generation Models
链接: https://arxiv.org/abs/2509.21760
作者: Lan Chen,Yuchao Gu,Qi Mao
机构: MIPG, Communication University of China (中国传媒大学); Show Lab, National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
[CV-114] KG-SAM: Injecting Anatomical Knowledge into Segment Anything Models via Conditional Random Fields
【速读】:该论文旨在解决生成式 AI(Generative AI)在医学图像分割中的应用瓶颈问题,包括边界模糊、解剖关系建模不足以及缺乏不确定性量化。其核心解决方案是提出KG-SAM框架,关键在于三方面协同:(i)利用医学知识图谱编码精细的解剖学先验关系;(ii)引入基于能量的条件随机场(CRF)强制实现解剖一致性预测;(iii)设计不确定性感知融合模块以提升高风险临床场景下的可靠性。该方法显著提升了多中心医学影像数据上的分割性能,验证了其鲁棒性和泛化能力。
链接: https://arxiv.org/abs/2509.21750
作者: Yu Li,Da Chang,Xi Xiao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:While the Segment Anything Model (SAM) has achieved remarkable success in image segmentation, its direct application to medical imaging remains hindered by fundamental challenges, including ambiguous boundaries, insufficient modeling of anatomical relationships, and the absence of uncertainty quantification. To address these limitations, we introduce KG-SAM, a knowledge-guided framework that synergistically integrates anatomical priors with boundary refinement and uncertainty estimation. Specifically, KG-SAM incorporates (i) a medical knowledge graph to encode fine-grained anatomical relationships, (ii) an energy-based Conditional Random Field (CRF) to enforce anatomically consistent predictions, and (iii) an uncertainty-aware fusion module to enhance reliability in high-stakes clinical scenarios. Extensive experiments across multi-center medical datasets demonstrate the effectiveness of our approach: KG-SAM achieves an average Dice score of 82.69% on prostate segmentation and delivers substantial gains in abdominal segmentation, reaching 78.05% on MRI and 79.68% on CT. These results establish KG-SAM as a robust and generalizable framework for advancing medical image segmentation.
zh
[CV-115] Incorporating Scene Context and Semantic Labels for Enhanced Group-level Emotion Recognition
【速读】:该论文旨在解决当前群体情绪识别(Group-level Emotion Recognition, GER)方法中对视觉场景上下文信息利用不足以及忽视情感标签语义信息的问题,从而导致对群体情绪理解不完整。解决方案的关键在于提出一个融合视觉场景上下文与标签引导语义信息的新框架:首先通过视觉上下文编码模块利用多尺度场景信息多样化建模个体间关系;其次通过情感语义编码模块借助群体级情感标签激发大语言模型生成细粒度情感词典,并结合结构化情感树将其精炼为全面的语义表示;最后采用相似性感知交互机制对齐并融合视觉与语义信息,从而生成增强的群体情绪表征,显著提升GER性能。
链接: https://arxiv.org/abs/2509.21747
作者: Qing Zhu,Wangdong Guo,Qirong Mao,Xiaohua Huang,Xiuyan Shao,Wenming Zheng
机构: Jiangsu University (江苏大学); Nanjing Institute of Technology (南京工程学院); Southeast University (东南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 5figures, submitted to IEEE Transactions on Human-Machine Systems
点击查看摘要
Abstract:Group-level emotion recognition (GER) aims to identify holistic emotions within a scene involving multiple individuals. Current existed methods underestimate the importance of visual scene contextual information in modeling individual relationships. Furthermore, they overlook the crucial role of semantic information from emotional labels for complete understanding of emotions. To address this limitation, we propose a novel framework that incorporates visual scene context and label-guided semantic information to improve GER performance. It involves the visual context encoding module that leverages multi-scale scene information to diversely encode individual relationships. Complementarily, the emotion semantic encoding module utilizes group-level emotion labels to prompt a large language model to generate nuanced emotion lexicons. These lexicons, in conjunction with the emotion labels, are then subsequently refined into comprehensive semantic representations through the utilization of a structured emotion tree. Finally, similarity-aware interaction is proposed to align and integrate visual and semantic information, thereby generating enhanced group-level emotion representations and subsequently improving the performance of GER. Experiments on three widely adopted GER datasets demonstrate that our proposed method achieves competitive performance compared to state-of-the-art methods.
zh
[CV-116] LFA-Net: A Lightweight Network with LiteFusion Attention for Retinal Vessel Segmentation
【速读】:该论文旨在解决轻量化视网膜血管分割在计算资源受限的临床环境中面临的两大挑战:一是小血管分割精度不足,二是现有深度学习模型计算成本过高。解决方案的关键在于提出了一种新型轻量级分割网络LFA-Net,其核心创新是引入了一个名为LiteFusion-Attention的注意力模块,该模块融合了残差学习连接、受Vision Mamba启发的动力学机制以及基于调制的注意力机制,从而在极低参数量(0.11百万)、内存占用(0.42 MB)和计算复杂度(4.46 GFLOPs)下高效捕获局部与全局上下文信息,显著提升了小血管分割性能。
链接: https://arxiv.org/abs/2509.21738
作者: Mehwish Mehmood,Ivor Spence,Muhammad Fahim
机构: Queen’s University Belfast (贝尔法斯特女王大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Lightweight retinal vessel segmentation is important for the early diagnosis of vision-threatening and systemic diseases, especially in a real-world clinical environment with limited computational resources. Although segmentation methods based on deep learning are improving, existing models are still facing challenges of small vessel segmentation and high computational costs. To address these challenges, we proposed a new vascular segmentation network, LFA-Net, which incorporates a newly designed attention module, LiteFusion-Attention. This attention module incorporates residual learning connections, Vision Mamba-inspired dynamics, and modulation-based attention, enabling the model to capture local and global context efficiently and in a lightweight manner. LFA-Net offers high performance with 0.11 million parameters, 0.42 MB memory size, and 4.46 GFLOPs, which make it ideal for resource-constrained environments. We validated our proposed model on DRIVE, STARE, and CHASE_DB with outstanding performance in terms of dice scores of 83.28, 87.44, and 84.50% and Jaccard indices of 72.85, 79.31, and 74.70%, respectively. The code of LFA-Net is available online this https URL.
zh
[CV-117] On the Status of Foundation Models for SAR Imagery
链接: https://arxiv.org/abs/2509.21722
作者: Nathan Inkawhich
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
[CV-118] DeLiVR: Differential Spatiotemporal Lie Bias for Efficient Video Deraining
链接: https://arxiv.org/abs/2509.21719
作者: Shuning Sun,Jialang Lu,Xiang Chen,Jichao Wang,Dianjie Lu,Guijuan Zhang,Guangwei Gao,Zhuoran Zheng
机构: University of Chinese Academy of Sciences (中国科学院大学); Hubei University (湖北大学); Nanjing University of Science and Technology (南京理工大学); Shandong Normal University (山东师范大学); Nanjing University of Posts and Telecommunications (南京邮电大学); Sun Yat-sen University (中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
[CV-119] Motion-Aware Transformer for Multi-Object Tracking
链接: https://arxiv.org/abs/2509.21715
作者: Xu Yang,Gady Agam
机构: Illinois Institute of Technology (伊利诺伊理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
[CV-120] MS-YOLO: Infrared Object Detection for Edge Deployment via MobileNetV4 and SlideLoss IJCNN
链接: https://arxiv.org/abs/2509.21696
作者: Jiali Zhang,Thomas S. White,Haoliang Zhang,Wenqing Hu,Donald C. Wunsch II,Jian Liu
机构: Missouri University of Science and Technology (密苏里科技大学); University of Oklahoma (俄克拉荷马大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by the International Joint Conference on Neural Networks (IJCNN) 2025. Keywords: Infrared Object Detection, MobileNetV4, SlideLoss, YOLO Model
[CV-121] MORPH: Shape-agnostic PDE Foundation Models
【速读】:该论文旨在解决科学计算中偏微分方程(Partial Differential Equations, PDEs)建模与预测的通用性与数据异构性问题,即如何构建一个能够处理多维(1D–3D)、多场(scalar 和 vector 组分混合)、不同分辨率的异构时空数据的统一基础模型。其解决方案的关键在于提出 MORPH——一种形状无关(shape-agnostic)、自回归的基础模型架构,通过三种核心机制实现高效建模:(i) 逐通道卷积(component-wise convolution)以联合捕捉标量与矢量通道的局部相互作用;(ii) 跨场交叉注意力(inter-field cross-attention)用于选择性地在不同物理场之间传播信息;(iii) 轴向注意力(axial attention)将全时空自注意力沿空间和时间轴分解,显著降低计算复杂度同时保持表达能力。该设计使模型在预训练后可在多种下游任务中实现零样本(zero-shot)和全样本(full-shot)迁移性能超越从头训练模型,展现出强大的泛化能力和数据效率。
链接: https://arxiv.org/abs/2509.21670
作者: Mahindra Singh Rautela,Alexander Most,Siddharth Mansingh,Bradley C. Love,Ayan Biswas,Diane Oyen,Earl Lawrence
机构: Los Alamos National Laboratory (洛斯阿拉莫斯国家实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
备注:
点击查看摘要
Abstract:We introduce MORPH, a shape-agnostic, autoregressive foundation model for partial differential equations (PDEs). MORPH is built on a convolutional vision transformer backbone that seamlessly handles heterogeneous spatiotemporal datasets of varying data dimensionality (1D–3D) at different resolutions, multiple fields with mixed scalar and vector components. The architecture combines (i) component-wise convolution, which jointly processes scalar and vector channels to capture local interactions, (ii) inter-field cross-attention, which models and selectively propagates information between different physical fields, (iii) axial attentions, which factorizes full spatiotemporal self-attention along individual spatial and temporal axes to reduce computational burden while retaining expressivity. We pretrain multiple model variants on a diverse collection of heterogeneous PDE datasets and evaluate transfer to a range of downstream prediction tasks. Using both full-model fine-tuning and parameter-efficient low-rank adapters (LoRA), MORPH outperforms models trained from scratch in both zero-shot and full-shot generalization. Across extensive evaluations, MORPH matches or surpasses strong baselines and recent state-of-the-art models. Collectively, these capabilities present a flexible and powerful backbone for learning from heterogeneous and multimodal nature of scientific observations, charting a path toward scalable and data-efficient scientific machine learning.
zh
[CV-122] FantasyWorld: Geometry-Consistent World Modeling via Unified Video and 3D Prediction
链接: https://arxiv.org/abs/2509.21657
作者: Yixiang Dai,Fan Jiang,Chiyu Wang,Mu Xu,Yonggang Qi
机构: AMAP, Alibaba Group (阿里巴巴集团); Beijing University of Posts and Telecommunications (北京邮电大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
[CV-123] A Data-driven Typology of Vision Models from Integrated Representational Metrics
【速读】:该论文旨在解决大规模视觉模型(Large Vision Models)在架构和训练范式差异显著的情况下,如何系统性地识别其表征中哪些成分是跨家族共享的、哪些反映了独特的计算策略这一问题。解决方案的关键在于引入一套多维度的表示相似性度量(包括几何结构、单元调谐特性及线性可解码性),并通过受多组学整合启发的相似性网络融合(Similarity Network Fusion, SNF)方法,将不同度量维度的信息进行融合,从而获得更清晰、更具判别力的模型家族分离结果。该框架揭示了几何与调谐特性承载家族特异性信号,而线性可解码信息则更为普遍共享,并发现自监督模型无论架构如何均聚类在一起,暗示训练目标对表征结构的影响超越了表面设计类别。
链接: https://arxiv.org/abs/2509.21628
作者: Jialin Wu,Shreya Saha,Yiqing Bo,Meenakshi Khosla
机构: UC San Diego (加州大学圣地亚哥分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Large vision models differ widely in architecture and training paradigm, yet we lack principled methods to determine which aspects of their representations are shared across families and which reflect distinctive computational strategies. We leverage a suite of representational similarity metrics, each capturing a different facet-geometry, unit tuning, or linear decodability-and assess family separability using multiple complementary measures. Metrics preserving geometry or tuning (e.g., RSA, Soft Matching) yield strong family discrimination, whereas flexible mappings such as Linear Predictivity show weaker separation. These findings indicate that geometry and tuning carry family-specific signatures, while linearly decodable information is more broadly shared. To integrate these complementary facets, we adapt Similarity Network Fusion (SNF), a method inspired by multi-omics integration. SNF achieves substantially sharper family separation than any individual metric and produces robust composite signatures. Clustering of the fused similarity matrix recovers both expected and surprising patterns: supervised ResNets and ViTs form distinct clusters, yet all self-supervised models group together across architectural boundaries. Hybrid architectures (ConvNeXt, Swin) cluster with masked autoencoders, suggesting convergence between architectural modernization and reconstruction-based training. This biology-inspired framework provides a principled typology of vision models, showing that emergent computational strategies-shaped jointly by architecture and training objective-define representational structure beyond surface design categories.
zh
[CV-124] VLCE: A Knowledge-Enhanced Framework for Image Description in Disaster Assessment
链接: https://arxiv.org/abs/2509.21609
作者: Md. Mahfuzur Rahman,Kishor Datta Gupta,Marufa Kamal,Fahad Rahman,Sunzida Siddique,Ahmed Rafi Hasan,Mohd Ariful Haque,Roy George
机构: Clark Atlanta University (克拉克亚特兰大大学); BRAC University (BRAC大学); United International University (联合国际大学); Daffodil International University (水仙国际大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 29 pages, 40 figures, 3 algorithms
[CV-125] mporal vs. Spatial: Comparing DINOv3 and V-JEPA2 Feature Representations for Video Action Analysis
链接: https://arxiv.org/abs/2509.21595
作者: Sai Varun Kodathala,Rakesh Vunnam
机构: Sports Vision, Inc.(Sports Vision, Inc.); Vizworld, Inc.(Vizworld, Inc.)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
[CV-126] What Happens Next? Anticipating Future Motion by Generating Point Trajectories
链接: https://arxiv.org/abs/2509.21592
作者: Gabrijel Boduljak,Laurynas Karazija,Iro Laina,Christian Rupprecht,Andrea Vedaldi
机构: Visual Geometry Group, University of Oxford (牛津大学视觉几何组)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
[CV-127] X-Streamer: Unified Human World Modeling with Audiovisual Interaction
链接: https://arxiv.org/abs/2509.21574
作者: You Xie,Tianpei Gu,Zenan Li,Chenxu Zhang,Guoxian Song,Xiaochen Zhao,Chao Liang,Jianwen Jiang,Hongyi Xu,Linjie Luo
机构: ByteDance(字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page at this https URL
[CV-128] Enhancing Contrastive Learning for Geolocalization by Discovering Hard Negatives on Semivariograms
链接: https://arxiv.org/abs/2509.21573
作者: Boyi Chen,Zhangyu Wang,Fabian Deuser,Johann Maximilian Zollner,Martin Werner
机构: Technical University of Munich (慕尼黑工业大学); University of Maine (缅因大学); University of the Bundeswehr Munich (慕尼黑联邦国防军大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
[CV-129] No Alignment Needed for Generation: Learning Linearly Separable Representations in Diffusion Models
链接: https://arxiv.org/abs/2509.21565
作者: Junno Yun,Yaşar Utku Alçalar,Mehmet Akçakaya
机构: University of Minnesota (明尼苏达大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
[CV-130] Unsupervised Defect Detection for Surgical Instruments
链接: https://arxiv.org/abs/2509.21561
作者: Joseph Huang,Yichi Zhang,Jingxi Yu,Wei Chen,Seunghyun Hwang,Qiang Qiu,Amy R. Reibman,Edward J. Delp,Fengqing Zhu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
[CV-131] X-CoT: Explainable Text-to-Video Retrieval via LLM -based Chain-of-Thought Reasoning EMNLP2025
链接: https://arxiv.org/abs/2509.21559
作者: Prasanna Reddy Pulakurthi,Jiamian Wang,Majid Rabbani,Sohail Dianat,Raghuveer Rao,Zhiqiang Tao
机构: Rochester Institute of Technology (罗彻斯特理工学院); DEVCOM Army Research Laboratory (美国陆军研究实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 7 figures. Accepted at EMNLP 2025 (Main Conference)
[CV-132] ControlHair: Physically-based Video Diffusion for Controllable Dynamic Hair Rendering
链接: https://arxiv.org/abs/2509.21541
作者: Weikai Lin,Haoxiang Li,Yuhao Zhu
机构: University of Rochester (罗切斯特大学); Pixocial Technology (Pixocial科技)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages,Project website: this https URL
[CV-133] RiCo: Triadic Game-Theoretic Co-Training for Robust Semi-Supervised Learning NEURIPS2025
链接: https://arxiv.org/abs/2509.21526
作者: Hongyang He,Xinyuan Song,Yangfan He,Zeyu Zhang,Yanshu Li,Haochen You,Lifan Sun,Wenqiao Zhang
机构: University of Warwick (华威大学); Emory University (埃默里大学); University of North Carolina at Chapel Hill (北卡罗来纳大学教堂山分校); The Australian National University (澳大利亚国立大学); Brown University (布朗大学); Columbia University (哥伦比亚大学); University of California, San Diego (加州大学圣地亚哥分校); Zhejiang University (浙江大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by NeurIPS 2025
[CV-134] DistillKac: Few-Step Image Generation via Damped Wave Equations
链接: https://arxiv.org/abs/2509.21513
作者: Weiqiao Han,Chenlin Meng,Christopher D. Manning,Stefano Ermon
机构: MIT (麻省理工学院); Stanford University (斯坦福大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Probability (math.PR); Machine Learning (stat.ML)
备注:
[CV-135] SlimDiff: Training-Free Activation-Guided Hands-free Slimming of Diffusion Models
链接: https://arxiv.org/abs/2509.21498
作者: Arani Roy,Shristi Das Biswas,Kaushik Roy
机构: Purdue University (普渡大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
[CV-136] Reasoning -Enhanced Domain-Adaptive Pretraining of Multimodal Large Language Models for Short Video Content Moderation
链接: https://arxiv.org/abs/2509.21486
作者: Zixuan Wang,Yu Sun,Hongwei Wang,Baoyu Jing,Xiang Shen,Xin Dong,Zhuolin Hao,Hongyu Xiong,Yang Song
机构: TikTok
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
[CV-137] VISION: Prompting Ocean Vertical Velocity Reconstruction from Incomplete Observations
链接: https://arxiv.org/abs/2509.21477
作者: Yuan Gao,Hao Wu,Qingsong Wen,Kun Wang,Xian Wu,Xiaomeng Huang
机构: Tsinghua University (清华大学); Squirrel Ai Learning (智谱AI); Nanyang Technological University (南洋理工大学); Tencent (腾讯)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Atmospheric and Oceanic Physics (physics.ao-ph)
备注:
[CV-138] Residual Vector Quantization For Communication-Efficient Multi-Agent Perception
链接: https://arxiv.org/abs/2509.21464
作者: Dereje Shenkut,B.V.K Vijaya Kumar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 5 pages
[CV-139] DyME: Dynamic Multi-Concept Erasure in Diffusion Models with Bi-Level Orthogonal LoRA Adaptation
链接: https://arxiv.org/abs/2509.21433
作者: Jiaqi Liu,Lan Zhang,Xiaoyong Yuan
机构: Clemson University (克莱姆森大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
[CV-140] QuadGPT : Native Quadrilateral Mesh Generation with Autoregressive Models
【速读】:该论文旨在解决现有生成式模型在构建四边形主导网格(quadrilateral-dominant meshes)时存在的拓扑质量差的问题,即当前方法通常先生成三角形网格,再通过规则合并为四边形,导致最终网格拓扑结构不理想。其解决方案的关键在于提出QuadGPT,一个端到端的自回归框架,首次实现原生四边形网格生成;核心创新包括:一种统一的标记化方法(tokenization method),可处理三角形与四边形混合拓扑;以及一种面向拓扑感知的强化学习微调方法(tDPO),显著提升生成质量。实验表明,QuadGPT在几何精度和拓扑质量上均优于传统三角形转四边形的流水线,建立了原生四边形网格生成的新基准。
链接: https://arxiv.org/abs/2509.21420
作者: Jian Liu,Chunshi Wang,Song Guo,Haohan Weng,Zhen Zhou,Zhiqi Li,Jiaao Yu,Yiling Zhu,Jing Xu,Biwen Lei,Zhuo Chen,Chunchao Guo
机构: Hong Kong University of Science and Technology (香港科技大学); Tencent Hunyuan (腾讯混元)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:The generation of quadrilateral-dominant meshes is a cornerstone of professional 3D content creation. However, existing generative models generate quad meshes by first generating triangle meshes and then merging triangles into quadrilaterals with some specific rules, which typically produces quad meshes with poor topology. In this paper, we introduce QuadGPT, the first autoregressive framework for generating quadrilateral meshes in an end-to-end manner. QuadGPT formulates this as a sequence prediction paradigm, distinguished by two key innovations: a unified tokenization method to handle mixed topologies of triangles and quadrilaterals, and a specialized Reinforcement Learning fine-tuning method tDPO for better generation quality. Extensive experiments demonstrate that QuadGPT significantly surpasses previous triangle-to-quad conversion pipelines in both geometric accuracy and topological quality. Our work establishes a new benchmark for native quad-mesh generation and showcases the power of combining large-scale autoregressive models with topology-aware RL refinement for creating structured 3D assets.
zh
[CV-141] Overview of ExpertLifeCLEF 2018: how far automated identification systems are from the best experts?
链接: https://arxiv.org/abs/2509.21419
作者: Herve Goeau,Pierre Bonnet,Alexis Joly
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 2 figures, CLEF 2018 Conference and Labs of the Evaluation Forum, September 10 to 14, 2018, Avignon, France
[CV-142] JaiLIP: Jailbreaking Vision-Language Models via Loss Guided Image Perturbation
链接: https://arxiv.org/abs/2509.21401
作者: Md Jueal Mia,M. Hadi Amini
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
[CV-143] Downscaling climate projections to 1 km with single-image super resolution
链接: https://arxiv.org/abs/2509.21399
作者: Petr Košťál,Pavel Kordík,Ondřej Podsztavek
机构: Czech Technical University in Prague (布拉格捷克技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
[CV-144] Skeleton Sparsification and Densification Scale-Spaces
【速读】:该论文旨在解决传统中轴线(medial axis)在噪声敏感性方面的缺陷,即微小边界扰动可能导致骨架结构显著且不合理的扩张,从而影响其在实际应用中的鲁棒性。解决方案的关键在于提出骨架化尺度空间(skeletonisation scale-spaces),通过引入对中轴线的稀疏化(sparsification)机制,实现形状的分层简化。该框架不仅继承了经典剪枝方法的优势,还天然满足尺度空间的核心属性,如层次架构、可控简化和几何变换等变性(equivariance),并通过连续与离散形式的理论支撑实现了从粗到细的逆向重构能力,甚至可生成超越原始骨架的过完备表示,适用于鲁棒骨架提取、形状压缩及增材制造中的刚度增强等任务。
链接: https://arxiv.org/abs/2509.21398
作者: Julia Gierke,Pascal Peter
机构: University of Saarland (萨尔兰大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
点击查看摘要
Abstract:The Hamilton-Jacobi skeleton, also known as the medial axis, is a powerful shape descriptor that represents binary objects in terms of the centres of maximal inscribed discs. Despite its broad applicability, the medial axis suffers from sensitivity to noise: minor boundary variations can lead to disproportionately large and undesirable expansions of the skeleton. Classical pruning methods mitigate this shortcoming by systematically removing extraneous skeletal branches. This sequential simplification of skeletons resembles the principle of sparsification scale-spaces that embed images into a family of reconstructions from increasingly sparse pixel representations. We combine both worlds by introducing skeletonisation scale-spaces: They leverage sparsification of the medial axis to achieve hierarchical simplification of shapes. Unlike conventional pruning, our framework inherently satisfies key scale-space properties such as hierarchical architecture, controllable simplification, and equivariance to geometric transformations. We provide a rigorous theoretical foundation in both continuous and discrete formulations and extend the concept further with densification. This allows inverse progression from coarse to fine scales and can even reach beyond the original skeleton to produce overcomplete shape representations with relevancy for practical applications. Through proof-of-concept experiments, we demonstrate the effectiveness of our framework for practical tasks including robust skeletonisation, shape compression, and stiffness enhancement for additive manufacturing. Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV) Cite as: arXiv:2509.21398 [cs.CV] (or arXiv:2509.21398v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2509.21398 Focus to learn more arXiv-issued DOI via DataCite
zh
[CV-145] mmHSense: Multi-Modal and Distributed mmWave ISAC Datasets for Human Sensing
链接: https://arxiv.org/abs/2509.21396
作者: Nabeel Nisar Bhat,Maksim Karnaukh,Stein Vandenbroeke,Wouter Lemoine,Jakob Struye,Jesus Omar Lacruz,Siddhartha Kumar,Mohammad Hossein Moghaddam,Joerg Widmer,Rafael Berkvens,Jeroen Famaey
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
[CV-146] Large AI Model-Enabled Generative Semantic Communications for Image Transmission
链接: https://arxiv.org/abs/2509.21394
作者: Qiyu Ma,Wanli Ni,Zhijin Qin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
备注: Accepted to the IEEE GLOBECOM 2025
[CV-147] UN3D: Towards Real-World Scene Understanding from Unposed Images
链接: https://arxiv.org/abs/2509.21388
作者: Anton Konushin,Nikita Drozdov,Bulat Gabdullin,Alexey Zakharov,Anna Vorontsova,Danila Rukhovich,Maksim Kolodiazhnyi
机构: Lomonosov Moscow State University (莫斯科国立大学); Higher School of Economics (高等经济学院); Institute of Mechanics, Armenia (亚美尼亚力学研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
[CV-148] Do Sparse Subnetworks Exhibit Cognitively Aligned Attention? Effects of Pruning on Saliency Map Fidelity Sparsity and Concept Coherence
链接: https://arxiv.org/abs/2509.21387
作者: Sanish Suwal,Dipkamal Bhusal,Michael Clifford,Nidhi Rastogi
机构: Rochester Institute of Technology (罗彻斯特理工学院); Toyota InfoTech Labs (丰田信息科技实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 4 pages
[CV-149] ShipwreckFinder: A QGIS Tool for Shipwreck Detection in Multibeam Sonar Data
【速读】:该论文旨在解决海洋考古中船难遗址(shipwreck)自动检测效率低下的问题,传统方法依赖人工逐帧分析多波束声呐(multibeam sonar)数据,耗时且需专家经验。解决方案的关键在于开发了一个开源的QGIS插件ShipwreckFinder,其核心是一个基于深度学习的模型,通过自动预处理水深数据、执行推理、阈值分割等步骤,输出像素级分割掩码或边界框预测结果;同时利用合成数据生成技术扩充并多样化训练集,从而显著提升检测精度与效率,优于现有的ArcGIS深度学习工具包和经典逆凹陷检测方法。
链接: https://arxiv.org/abs/2509.21386
作者: Anja Sheppard,Tyler Smithline,Andrew Scheffer,David Smith,Advaith V. Sethuraman,Ryan Bird,Sabrina Lin,Katherine A. Skinner
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Image and Video Processing (eess.IV)
备注: Accepted to OCEANS 2025 Great Lakes
点击查看摘要
Abstract:In this paper, we introduce ShipwreckFinder, an open-source QGIS plugin that detects shipwrecks from multibeam sonar data. Shipwrecks are an important historical marker of maritime history, and can be discovered through manual inspection of bathymetric data. However, this is a time-consuming process and often requires expert analysis. Our proposed tool allows users to automatically preprocess bathymetry data, perform deep learning inference, threshold model outputs, and produce either pixel-wise segmentation masks or bounding boxes of predicted shipwrecks. The backbone of this open-source tool is a deep learning model, which is trained on a variety of shipwreck data from the Great Lakes and the coasts of Ireland. Additionally, we employ synthetic data generation in order to increase the size and diversity of our dataset. We demonstrate superior segmentation performance with our open-source tool and training pipeline as compared to a deep learning-based ArcGIS toolkit and a more classical inverse sinkhole detection method. The open-source tool can be found at this https URL.
zh
[CV-150] Debugging Concept Bottleneck Models through Removal and Retraining
【速读】:该论文旨在解决概念瓶颈模型(Concept Bottleneck Models, CBMs)在实际应用中因数据偏差导致的系统性误判问题,即模型学习到的捷径依赖于训练数据中的伪相关性,而这种偏差无法通过简单的测试时干预(如移除特定概念)来修正。解决方案的关键在于提出一个通用的可解释调试框架CBDebug,其核心是两步流程:首先由专家基于概念解释识别并移除不良概念(Removal),随后通过CBDebug方法将概念层面的用户反馈转化为样本级别的辅助标签,用于监督式偏差缓解和针对性增强,从而有效降低模型对不良概念的依赖,提升其鲁棒性和可解释性。
链接: https://arxiv.org/abs/2509.21385
作者: Eric Enouen,Sainyam Galhotra
机构: Cornell University (康奈尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Concept Bottleneck Models (CBMs) use a set of human-interpretable concepts to predict the final task label, enabling domain experts to not only validate the CBM’s predictions, but also intervene on incorrect concepts at test time. However, these interventions fail to address systemic misalignment between the CBM and the expert’s reasoning, such as when the model learns shortcuts from biased data. To address this, we present a general interpretable debugging framework for CBMs that follows a two-step process of Removal and Retraining. In the Removal step, experts use concept explanations to identify and remove any undesired concepts. In the Retraining step, we introduce CBDebug, a novel method that leverages the interpretability of CBMs as a bridge for converting concept-level user feedback into sample-level auxiliary labels. These labels are then used to apply supervised bias mitigation and targeted augmentation, reducing the model’s reliance on undesired concepts. We evaluate our framework with both real and automated expert feedback, and find that CBDebug significantly outperforms prior retraining methods across multiple CBM architectures (PIP-Net, Post-hoc CBM) and benchmarks with known spurious correlations.
zh
[CV-151] Assessing the Alignment of Popular CNNs to the Brain for Valence Appraisal
链接: https://arxiv.org/abs/2509.21384
作者: Laurent Mertens,Elahe’ Yargholi,Laura Van Hove,Hans Op de Beeck,Jan Van den Stock,Joost Vennekens
机构: KU Leuven, De Nayer Campus, Dept. of Computer Science (鲁汶大学,德纳耶校园,计算机科学系); Leuven.AI - KU Leuven Institute for AI (鲁汶人工智能研究所); Department of Brain and Cognition, Leuven Brain Institute, Faculty of Psychology & Educational Sciences, KU Leuven (鲁汶大脑研究所,心理与教育科学学院); Neuropsychiatry, Leuven Brain Institute, KU Leuven (鲁汶大脑研究所,神经精神病学); Vrije Universiteit Brussel (布鲁塞尔自由大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 4 figures
[CV-152] he LongiMam model for improved breast cancer risk prediction using longitudinal mammograms
【速读】:该论文旨在解决乳腺癌筛查中风险适应性建模的问题,即如何利用纵向影像数据提升乳腺癌预测的准确性,尤其是在现实临床环境中存在结果分布不均和随访异质性的挑战。解决方案的关键在于提出一种端到端的深度学习模型LongiMam,该模型整合当前及最多四次既往乳腺X线摄影(mammogram)数据,通过卷积神经网络(Convolutional Neural Network, CNN)捕捉空间特征、循环神经网络(Recurrent Neural Network, RNN)建模时间动态变化,从而实现对乳腺癌风险的精细化预测。实验表明,结合当前与既往检查信息显著优于仅使用单次检查的模型,尤其在乳腺密度随时间变化的人群中表现最优,验证了纵向建模在风险分层中的核心价值。
链接: https://arxiv.org/abs/2509.21383
作者: Manel Rakez,Thomas Louis,Julien Guillaumin,Foucauld Chamming’s,Pierre Fillard,Brice Amadeo,Virginie Rondeau
机构: BIOSTAT team, Bordeaux Population Health U1219, Bordeaux University, ISPED, Bordeaux, France; Therapixel, Nice, France; Department of Radiology, Institut Bergonié, Comprehensive Cancer Centre, Bordeaux, France; EPICENE team, Bordeaux Population Health U1219, Bordeaux University, ISPED, Bordeaux, France
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Risk-adapted breast cancer screening requires robust models that leverage longitudinal imaging data. Most current deep learning models use single or limited prior mammograms and lack adaptation for real-world settings marked by imbalanced outcome distribution and heterogeneous follow-up. We developed LongiMam, an end-to-end deep learning model that integrates both current and up to four prior mammograms. LongiMam combines a convolutional and a recurrent neural network to capture spatial and temporal patterns predictive of breast cancer. The model was trained and evaluated using a large, population-based screening dataset with disproportionate case-to-control ratio typical of clinical screening. Across several scenarios that varied in the number and composition of prior exams, LongiMam consistently improved prediction when prior mammograms were included. The addition of prior and current visits outperformed single-visit models, while priors alone performed less well, highlighting the importance of combining historical and recent information. Subgroup analyses confirmed the model’s efficacy across key risk groups, including women with dense breasts and those aged 55 years or older. Moreover, the model performed best in women with observed changes in mammographic density over time. These findings demonstrate that longitudinal modeling enhances breast cancer prediction and support the use of repeated mammograms to refine risk stratification in screening programs. LongiMam is publicly available as open-source software.
zh
[CV-153] Coreset selection based on Intra-class diversity
链接: https://arxiv.org/abs/2509.21380
作者: Imran Ashraf,Mukhtar Ullah,Muhammad Faisal Nadeem,Muhammad Nouman Noor
机构: NUCES-FAST (National University of Computer and Emerging Sciences); Informatics Complex (信息复杂系统)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
[CV-154] SAEmnesia: Erasing Concepts in Diffusion Models with Sparse Autoencoders
链接: https://arxiv.org/abs/2509.21379
作者: Enrico Cassano,Riccardo Renzulli,Marco Nurisso,Mirko Zaffaroni,Alan Perotti,Marco Grangetto
机构: University of Turin (都灵大学); Politecnico di Torino (都灵理工大学); CENTAI Institute (CENTAI 研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
[CV-155] Dynamic Multi-Target Fusion for Efficient Audio-Visual Navigation ECAI
链接: https://arxiv.org/abs/2509.21377
作者: Yinfeng Yu,Hailong Zhang,Meiling Zhu
机构: Xinjiang University (新疆大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Main paper (8 pages). Accepted for publication by ECAI( European Conference on Artificial Intelligence) 2025
[CV-156] In silico Deep Learning Protocols for Label-Free Super-Resolution Microscopy: A Comparative Study of Network Architectures and SNR Dependence
链接: https://arxiv.org/abs/2509.21376
作者: Shiraz S Kaderuppan,Jonathan Mar,Andrew Irvine,Anurag Sharma,Muhammad Ramadan Saifuddin,Wai Leong Eugene Wong,Wai Lok Woo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 20 pages, 10 figures
[CV-157] Automated Prompt Generation for Creative and Counterfactual Text-to-image Synthesis
链接: https://arxiv.org/abs/2509.21375
作者: Aleksa Jelaca,Ying Jiao,Chang Tian,Marie-Francine Moens
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: text-to-image generation, automatic prompt, DPO, Counterfactual
[CV-158] Language-in-the-Loop Culvert Inspection on the Erie Canal
【速读】:该论文旨在解决传统人工巡检渠道涵洞(culvert)效率低、安全性差的问题,尤其针对老旧设施如伊利运河(Erie Canal)下的涵洞因结构复杂、光照不足、通行困难等因素导致的检测挑战。解决方案的关键在于提出一个端到端的“语言驱动自主系统”(VISION),其核心是将大规模视觉-语言模型(VLM)与约束感知的视角规划相结合:通过简短提示(prompt)从VLM获取开放词汇的目标区域(ROI)建议及其置信度和推理依据,利用立体深度信息恢复尺度,再由考虑涵洞几何与操作约束的规划器驱动四足机器人移动至指定位置进行近距离成像,从而闭环完成“观察、决策、移动、重成像”流程,无需领域特定微调即可生成高分辨率图像用于专家评估,实测表明该方法显著提升了初步识别与最终判断的一致性(分别达61.4%和80%)。
链接: https://arxiv.org/abs/2509.21370
作者: Yashom Dighe,Yash Turkar,Karthik Dantu
机构: University at Buffalo (纽约州立大学布法罗分校); New York Canal Corporation (纽约运河公司)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: First two authors contributed equally
点击查看摘要
Abstract:Culverts on canals such as the Erie Canal, built originally in 1825, require frequent inspections to ensure safe operation. Human inspection of culverts is challenging due to age, geometry, poor illumination, weather, and lack of easy access. We introduce VISION, an end-to-end, language-in-the-loop autonomy system that couples a web-scale vision-language model (VLM) with constrained viewpoint planning for autonomous inspection of culverts. Brief prompts to the VLM solicit open-vocabulary ROI proposals with rationales and confidences, stereo depth is fused to recover scale, and a planner – aware of culvert constraints – commands repositioning moves to capture targeted close-ups. Deployed on a quadruped in a culvert under the Erie Canal, VISION closes the see, decide, move, re-image loop on-board and produces high-resolution images for detailed reporting without domain-specific fine-tuning. In an external evaluation by New York Canal Corporation personnel, initial ROI proposals achieved 61.4% agreement with subject-matter experts, and final post-re-imaging assessments reached 80%, indicating that VISION converts tentative hypotheses into grounded, expert-aligned findings.
zh
[CV-159] Safety Assessment of Scaffolding on Construction Site using AI
链接: https://arxiv.org/abs/2509.21368
作者: Sameer Prabhu,Amit Patwardhan,Ramin Karim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
[CV-160] MAJORScore: A Novel Metric for Evaluating Multimodal Relevance via Joint Representation
【速读】:该论文旨在解决现有评估指标在多模态数据相关性分析中的局限性问题,即传统方法(如CLIP)仅适用于双模态关联分析,难以有效衡量三个及以上模态之间的复杂相似性关系。其解决方案的关键在于提出MAJORScore,一种基于多模态联合表示(multimodal joint representation)的新评估指标,通过将多种模态映射到统一的潜在空间中,实现跨模态信息在相同尺度上的对齐与整合,从而支持公平且精准的相关性评分。实验表明,MAJORScore在一致性场景下相比现有方法提升26.03%-64.29%,在不一致场景下下降13.28%-20.54%,显著提升了大规模多模态数据集和模型性能评估的可靠性。
链接: https://arxiv.org/abs/2509.21365
作者: Zhicheng Du,Qingyang Shi,Jiasheng Lu,Yingshan Liang,Xinyu Zhang,Yiran Wang,Peiwu Qin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:The multimodal relevance metric is usually borrowed from the embedding ability of pretrained contrastive learning models for bimodal data, which is used to evaluate the correlation between cross-modal data (e.g., CLIP). However, the commonly used evaluation metrics are only suitable for the associated analysis between two modalities, which greatly limits the evaluation of multimodal similarity. Herein, we propose MAJORScore, a brand-new evaluation metric for the relevance of multiple modalities (N modalities, N=3) via multimodal joint representation for the first time. The ability of multimodal joint representation to integrate multiple modalities into the same latent space can accurately represent different modalities at one scale, providing support for fair relevance scoring. Extensive experiments have shown that MAJORScore increases by 26.03%-64.29% for consistent modality and decreases by 13.28%-20.54% for inconsistence compared to existing methods. MAJORScore serves as a more reliable metric for evaluating similarity on large-scale multimodal datasets and multimodal model performance evaluation.
zh
[CV-161] A Mutual Learning Method for Salient Object Detection with intertwined Multi-Supervision–Revised
链接: https://arxiv.org/abs/2509.21363
作者: Runmin Wu,Mengyang Feng,Wenlong Guan,Dong Wang,Huchuan Lu,Errui Ding
机构: Dalian University of Technology (大连理工大学); Department of Computer Vision Technology (VIS), Baidu Inc. (百度公司视觉技术部门)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 11 pages
[CV-162] Multimodal Prompt Decoupling Attack on the Safety Filters in Text-to-Image Models
链接: https://arxiv.org/abs/2509.21360
作者: Xingkai Peng,Jun Jiang,Meng Tong,Shuai Li,Weiming Zhang,Nenghai Yu,Kejiang Chen
机构: University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
[CV-163] MDF-MLLM : Deep Fusion Through Cross-Modal Feature Alignment for Contextually Aware Fundoscopic Image Classification
链接: https://arxiv.org/abs/2509.21358
作者: Jason Jordan,Mohammadreza Akbari Lor,Peter Koulen,Mei-Ling Shyu,Shu-Ching Chen
机构: 1. University of Texas at Dallas (德克萨斯大学达拉斯分校); 2. Fraunhofer Institute for Manufacturing Engineering and Automation IPA (弗劳恩霍夫制造工程与自动化研究所); 3. University of Texas at Dallas (德克萨斯大学达拉斯分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Word count: 5157, Table count: 2, Figure count: 5
[CV-164] Phrase-grounded Fact-checking for Automatically Generated Chest X-ray Reports MICCAI2025
链接: https://arxiv.org/abs/2509.21356
作者: Razi Mahmood,Diego Machado-Reyes,Joy Wu,Parisa Kaviani,Ken C.L. Wong,Niharika D’Souza,Mannudeep Kalra,Ge Wang,Pingkun Yan,Tanveer Syeda-Mahmood
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: In proceedings MICCAI 2025
[CV-165] KV-Efficient VLA: A Method of Speed up Vision Language Model with RNN-Gated Chunked KV Cache
链接: https://arxiv.org/abs/2509.21354
作者: Wanshun Xu,Long Zhuang
机构: University of Toronto (多伦多大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
[CV-166] Improving Autism Detection with Multimodal Behavioral Analysis
链接: https://arxiv.org/abs/2509.21352
作者: William Saakyan,Matthias Norden,Lola Eversmann,Simon Kirsch,Muyu Lin,Simon Guendelman,Isabel Dziobek,Hanna Drimalla
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
[CV-167] Cross-Modal Retrieval with Cauchy-Schwarz Divergence
【速读】:该论文旨在解决跨模态检索中异构数据类型对齐不充分的问题,现有方法多集中于双模态任务,依赖Kullback-Leibler散度、最大均值差异(Maximum Mean Discrepancy, MMD)和相关性对齐等分布对齐技术,但这些方法常面临数值不稳定、超参数敏感以及难以捕捉底层分布全结构等局限。解决方案的关键在于提出一种无超参数的柯西-施瓦茨(Cauchy-Schwarz, CS)散度,显著提升训练稳定性和检索性能;进一步基于赫尔德不等式(Hölder’s inequality)设计了广义柯西-施瓦茨(Generalized CS, GCS)散度,通过双向环形比较机制在统一数学框架内实现三模态及以上模态的直接对齐,避免了繁琐的成对比较过程。
链接: https://arxiv.org/abs/2509.21339
作者: Jiahao Zhang,Wenzhe Yin,Shujian Yu
机构: The HongKong University of Science and Technology (Guangzhou)(香港科技大学(广州) ); University of Amsterdam(阿姆斯特丹大学); Vrije Universiteit Amsterdam(自由大学(阿姆斯特丹))
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: Accepted by ACMMM-25
点击查看摘要
Abstract:Effective cross-modal retrieval requires robust alignment of heterogeneous data types. Most existing methods focus on bi-modal retrieval tasks and rely on distributional alignment techniques such as Kullback-Leibler divergence, Maximum Mean Discrepancy, and correlation alignment. However, these methods often suffer from critical limitations, including numerical instability, sensitivity to hyperparameters, and their inability to capture the full structure of the underlying distributions. In this paper, we introduce the Cauchy-Schwarz (CS) divergence, a hyperparameter-free measure that improves both training stability and retrieval performance. We further propose a novel Generalized CS (GCS) divergence inspired by Hölder’s inequality. This extension enables direct alignment of three or more modalities within a unified mathematical framework through a bidirectional circular comparison scheme, eliminating the need for exhaustive pairwise comparisons. Extensive experiments on six benchmark datasets demonstrate the effectiveness of our method in both bi-modal and tri-modal retrieval tasks. The code of our CS/GCS divergence is publicly available at this https URL.
zh
[CV-168] SGAligner: Cross-Modal Language-Aided 3D Scene Graph Alignment
链接: https://arxiv.org/abs/2509.20401
作者: Binod Singh,Sayan Deb Sarkar,Iro Armeni
机构: Technical University of Munich (慕尼黑工业大学); Stanford University (斯坦福大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
[CV-169] Deep Learning-Based Cross-Anatomy CT Synthesis Using Adapted nnResU-Net with Anatomical Feature Prioritized Loss
【速读】:该论文旨在解决跨模态医学图像合成中的结构保真度与重建质量难题,特别是在磁共振成像(MRI)到计算机断层扫描(CT)及锥形束CT(CBCT)到CT的图像转换任务中,如何提升关键解剖结构(如骨组织和病灶)的重建精度。解决方案的关键在于:首先采用基于3D nnUNet框架的patch-based方法,结合标准UNet与残差UNet两种网络结构以实现区域自适应;其次引入解剖特征优先损失(Anatomical Feature-Prioritized, AFP)损失函数,通过对比从TotalSegmentator训练的分割网络提取的多层特征来增强对临床相关结构的重建;最后在训练策略上采用分阶段优化——先用L1损失预训练,再以L1+AFP联合目标进行500轮微调,从而在保持整体强度一致性的同时显著提升局部解剖细节的准确性。
链接: https://arxiv.org/abs/2509.22394
作者: Javier Sequeiro González,Arthur Longuefosse,Miguel Díaz Benito,Álvaro García Martín,Fabien Baldacci
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:We present a patch-based 3D nnUNet adaptation for MR to CT and CBCT to CT image translation using the multicenter SynthRAD2025 dataset, covering head and neck (HN), thorax (TH), and abdomen (AB) regions. Our approach leverages two main network configurations: a standard UNet and a residual UNet, both adapted from nnUNet for image synthesis. The Anatomical Feature-Prioritized (AFP) loss was introduced, which compares multilayer features extracted from a compact segmentation network trained on TotalSegmentator labels, enhancing reconstruction of clinically relevant structures. Input volumes were normalized per-case using zscore normalization for MRIs, and clipping plus dataset level zscore normalization for CBCT and CT. Training used 3D patches tailored to each anatomical region without additional data augmentation. Models were trained for 1000 and 1500 epochs, with AFP fine-tuning performed for 500 epochs using a combined L1+AFP objective. During inference, overlapping patches were aggregated via mean averaging with step size of 0.3, and postprocessing included reverse zscore normalization. Both network configurations were applied across all regions, allowing consistent model design while capturing local adaptations through residual learning and AFP loss. Qualitative and quantitative evaluation revealed that residual networks combined with AFP yielded sharper reconstructions and improved anatomical fidelity, particularly for bone structures in MR to CT and lesions in CBCT to CT, while L1only networks achieved slightly better intensity-based metrics. This methodology provides a stable solution for cross modality medical image synthesis, demonstrating the effectiveness of combining the automatic nnUNet pipeline with residual learning and anatomically guided feature losses.
zh
[CV-170] COMPASS: Robust Feature Conformal Prediction for Medical Segmentation Metrics
链接: https://arxiv.org/abs/2509.22240
作者: Matt Y. Cheung,Ashok Veeraraghavan,Guha Balakrishnan
机构: Rice University (莱斯大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Applications (stat.AP); Machine Learning (stat.ML)
备注:
[CV-171] Comparative Analysis of GAN and Diffusion for MRI-to-CT translation
【速读】:该论文旨在解决在无法获取或难以获得计算机断层扫描(CT)图像时,如何从磁共振成像(MRI)生成合成CT(sCT)图像的问题。其关键解决方案在于对比两种主流生成模型架构——条件生成对抗网络(cGAN)与条件去噪扩散概率模型(cDDPM)的性能差异,并提出通过将三维(3D)翻译任务分解为一系列二维(2D)横断面翻译以降低计算成本的策略。研究发现,多通道条件输入和采用cDDPM架构显著提升生成质量,尤其在保持跨切片连续性方面表现优异,验证了该方法在临床应用中的可行性与有效性。
链接: https://arxiv.org/abs/2509.22049
作者: Emily Honey,Anders Helbo,Jens Petersen
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Computed tomography (CT) is essential for treatment and diagnostics; In case CT are missing or otherwise difficult to obtain, methods for generating synthetic CT (sCT) images from magnetic resonance imaging (MRI) images are sought after. Therefore, it is valuable to establish a reference for what strategies are most effective for MRI-to-CT translation. In this paper, we compare the performance of two frequently used architectures for MRI-to-CT translation: a conditional generative adversarial network (cGAN) and a conditional denoising diffusion probabilistic model (cDDPM). We chose well-established implementations to represent each architecture: Pix2Pix for cGAN, and Palette for cDDPM. We separate the classical 3D translation problem into a sequence of 2D translations on the transverse plane, to investigate the viability of a strategy that reduces the computational cost. We also investigate the impact of conditioning the generative process on a single MRI image/slice and on multiple MRI slices. The performance is assessed using a thorough evaluation protocol, including a novel slice-wise metric Similarity Of Slices (SIMOS), which measures the continuity between transverse slices when compiling the sCTs into 3D format. Our comparative analysis revealed that MRI-to-CT generative models benefit from multi-channel conditional input and using cDDPM as an architecture.
zh
[CV-172] Patch-Based Diffusion for Data-Efficient Radiologist-Preferred MRI Reconstruction
链接: https://arxiv.org/abs/2509.21531
作者: Rohan Sanda,Asad Aali,Andrew Johnston,Eduardo Reis,Jonathan Singh,Gordon Wetzstein,Sara Fridovich-Keil
机构: Stanford University (斯坦福大学); Stanford University School of Medicine (斯坦福大学医学院); Kaiser Permanente (凯撒医疗集团); Georgia Institute of Technology (佐治亚理工学院)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Code is available at: this https URL
人工智能
[AI-0] Learning Admissible Heuristics for A*: Theory and Practice
【速读】:该论文旨在解决搜索算法(如A*)中启发式函数(heuristic functions)的两个关键问题:一是传统深度学习方法往往忽视启发式函数的可采纳性(admissibility),即不保证不高估真实最短路径代价,从而影响解的最优性;二是现有方法在训练数据之外的泛化能力有限。解决方案的关键在于将启发式学习建模为带约束的优化问题,并提出交叉熵可采纳性(Cross-Entropy Admissibility, CEA)损失函数,在训练过程中显式强制启发式函数满足可采纳性。此外,论文通过引入压缩模式数据库(compressed pattern database, PDB)抽象和图结构特性,理论分析了学习启发式的样本复杂度,证明使用ReLU神经网络时,泛化界主要依赖于网络宽度与深度而非图规模,从而为基于神经网络的启发式函数提供了首个目标依赖型启发式的一般化保证。
链接: https://arxiv.org/abs/2509.22626
作者: Ehsan Futuhi,Nathan R. Sturtevant
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Heuristic functions are central to the performance of search algorithms such as A-star, where admissibility - the property of never overestimating the true shortest-path cost - guarantees solution optimality. Recent deep learning approaches often disregard admissibility and provide limited guarantees on generalization beyond the training data. This paper addresses both of these limitations. First, we pose heuristic learning as a constrained optimization problem and introduce Cross-Entropy Admissibility (CEA), a loss function that enforces admissibility during training. On the Rubik’s Cube domain, this method yields near-admissible heuristics with significantly stronger guidance than compressed pattern database (PDB) heuristics. Theoretically, we study the sample complexity of learning heuristics. By leveraging PDB abstractions and the structural properties of graphs such as the Rubik’s Cube, we tighten the bound on the number of training samples needed for A-star to generalize. Replacing a general hypothesis class with a ReLU neural network gives bounds that depend primarily on the network’s width and depth, rather than on graph size. Using the same network, we also provide the first generalization guarantees for goal-dependent heuristics.
zh
[AI-1] A Theoretical Analysis of Discrete Flow Matching Generative Models
【速读】:该论文旨在解决离散生成模型中分布估计误差的理论保障问题,特别是针对端到端训练的离散流匹配(Discrete Flow Matching, DFM)方法。其核心挑战在于如何从理论上证明DFM生成的分布能够随着训练数据规模的增加而收敛至真实数据分布。解决方案的关键在于建立一个清晰的误差传递链条:首先证明生成分布与目标分布之间的总变差距离受学习到的速度场风险控制;随后将该风险分解为两个主要来源——近似误差(由Transformer架构对真实速度场的表示能力决定)和估计误差(由有限数据集训练带来的统计收敛速率决定);最终通过组合这两个误差界,首次提供了DFM模型生成分布收敛性的形式化证明。
链接: https://arxiv.org/abs/2509.22623
作者: Maojiang Su,Mingcheng Lu,Jerry Yao-Chieh Hu,Shang Wu,Zhao Song,Alex Reneau,Han Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
点击查看摘要
Abstract:We provide a theoretical analysis for end-to-end training Discrete Flow Matching (DFM) generative models. DFM is a promising discrete generative modeling framework that learns the underlying generative dynamics by training a neural network to approximate the transformative velocity field. Our analysis establishes a clear chain of guarantees by decomposing the final distribution estimation error. We first prove that the total variation distance between the generated and target distributions is controlled by the risk of the learned velocity field. We then bound this risk by analyzing its two primary sources: (i) Approximation Error, where we quantify the capacity of the Transformer architecture to represent the true velocity, and (ii) Estimation Error, where we derive statistical convergence rates that bound the error from training on a finite dataset. By composing these results, we provide the first formal proof that the distribution generated by a trained DFM model provably converges to the true data distribution as the training set size increases.
zh
[AI-2] Benefits and Pitfalls of Reinforcement Learning for Language Model Planning : A Theoretical Perspective
【速读】:该论文旨在解决当前基于强化学习(Reinforcement Learning, RL)方法在提升大语言模型(Large Language Models, LLMs)规划能力时缺乏理论基础的问题。其核心挑战在于理解RL为何有效以及其局限性,尤其是在处理策略偏差、多样性丧失和奖励欺骗(reward hacking)等方面。解决方案的关键在于构建一个可 tractable 的图抽象模型,通过理论分析揭示:监督微调(Supervised Fine-Tuning, SFT)易引入共现相关的虚假解,而RL主要依赖探索实现正确规划,凸显探索对泛化性能的重要性;进一步发现策略梯度(Policy Gradient, PG)存在多样性崩溃问题,而Q-learning则具备离策略学习与收敛时保持输出多样性的优势;此外,强调精心设计奖励函数是避免Q-learning中奖励欺骗的关键。该框架在真实世界规划基准Blocksworld上的实证验证了上述理论行为的实践表现。
链接: https://arxiv.org/abs/2509.22613
作者: Siwei Wang,Yifei Shen,Haoran Sun,Shi Feng,Shang-Hua Teng,Li Dong,Yaru Hao,Wei Chen
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:
点击查看摘要
Abstract:Recent reinforcement learning (RL) methods have substantially enhanced the planning capabilities of Large Language Models (LLMs), yet the theoretical basis for their effectiveness remains elusive. In this work, we investigate RL’s benefits and limitations through a tractable graph-based abstraction, focusing on policy gradient (PG) and Q-learning methods. Our theoretical analyses reveal that supervised fine-tuning (SFT) may introduce co-occurrence-based spurious solutions, whereas RL achieves correct planning primarily through exploration, underscoring exploration’s role in enabling better generalization. However, we also show that PG suffers from diversity collapse, where output diversity decreases during training and persists even after perfect accuracy is attained. By contrast, Q-learning provides two key advantages: off-policy learning and diversity preservation at convergence. We further demonstrate that careful reward design is necessary to prevent reward hacking in Q-learning. Finally, applying our framework to the real-world planning benchmark Blocksworld, we confirm that these behaviors manifest in practice.
zh
[AI-3] Quantile Advantage Estimation for Entropy-Safe Reasoning
【速读】:该论文旨在解决强化学习中可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)训练过程中出现的熵崩溃(entropy collapse)与熵爆炸(entropy explosion)问题,其根源在于传统无价值强化学习方法(如GRPO和DAPO)所采用的均值基线(mean baseline)在存在奖励异常值时对负优势样本施加了不当惩罚。解决方案的关键在于提出分位数优势估计(Quantile Advantage Estimation, QAE),用分组K-分位数基线替代均值基线,从而在响应层面引入双 regimes 门控机制:对于困难查询(p = 1 - K)增强罕见成功案例,对于简单查询(p < 1 - K)聚焦剩余失败案例。理论证明表明,在一阶Softmax更新下,QAE可实现双向熵安全性,提供单步熵变化的上下界以抑制爆炸并防止崩溃;实证结果进一步显示,该最小改动显著稳定了熵动态、稀疏化信用分配(约80%响应获得零优势),并在Qwen3-8B/14B-Base模型上持续提升pass@1指标,揭示基线设计而非词元级启发式策略才是RLVR扩展的核心机制。
链接: https://arxiv.org/abs/2509.22611
作者: Junkang Wu,Kexin Huang,Jiancan Wu,An Zhang,Xiang Wang,Xiangnan He
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) strengthens LLM reasoning, but training often oscillates between entropy collapse and entropy explosion. We trace both hazards to the mean baseline used in value-free RL (e.g., GRPO and DAPO), which improperly penalizes negative-advantage samples under reward outliers. We propose Quantile Advantage Estimation (QAE), replacing the mean with a group-wise K-quantile baseline. QAE induces a response-level, two-regime gate: on hard queries (p = 1 - K) it reinforces rare successes, while on easy queries (p 1 - K) it targets remaining failures. Under first-order softmax updates, we prove two-sided entropy safety, giving lower and upper bounds on one-step entropy change that curb explosion and prevent collapse. Empirically, this minimal modification stabilizes entropy, sparsifies credit assignment (with tuned K, roughly 80% of responses receive zero advantage), and yields sustained pass@1 gains on Qwen3-8B/14B-Base across AIME 2024/2025 and AMC 2023. These results identify baseline design – rather than token-level heuristics – as the primary mechanism for scaling RLVR.
zh
[AI-4] UniMIC: Token-Based Multimodal Interactive Coding for Human-AI Collaboration
链接: https://arxiv.org/abs/2509.22570
作者: Qi Mao,Tinghan Yang,Jiahao Li,Bin Li,Libiao Jin,Yan Lu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
[AI-5] From Parameters to Behavior: Unsupervised Compression of the Policy Space
【速读】:该论文旨在解决深度强化学习(Deep Reinforcement Learning, DRL)中策略优化的样本效率低下问题,尤其在多任务场景下更为显著。其核心挑战在于直接在高维且高度冗余的策略参数空间 Θ 中进行优化,导致学习过程低效。解决方案的关键在于提出一种无监督方法,将原始策略参数空间 Θ 压缩至一个低维潜在空间 Z,并通过训练生成模型 g:Z→Θ 来最小化行为重建损失(behavioral reconstruction loss),从而使得潜在空间 Z 按功能相似性组织而非依赖于参数空间中的几何邻近性。这一机制揭示了潜在流形的固有维度可能由环境复杂度决定,而非策略网络规模,并实验证明可在保留策略表达能力的同时实现高达五数量级的压缩,同时支持在潜在空间中通过策略梯度(Policy Gradient)实现任务特异性适应。
链接: https://arxiv.org/abs/2509.22566
作者: Davide Tenedini,Riccardo Zamboni,Mirco Mutti,Marcello Restelli
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Despite its recent successes, Deep Reinforcement Learning (DRL) is notoriously sample-inefficient. We argue that this inefficiency stems from the standard practice of optimizing policies directly in the high-dimensional and highly redundant parameter space \Theta . This challenge is greatly compounded in multi-task settings. In this work, we develop a novel, unsupervised approach that compresses the policy parameter space \Theta into a low-dimensional latent space \mathcalZ . We train a generative model g:\mathcalZ\to\Theta by optimizing a behavioral reconstruction loss, which ensures that the latent space is organized by functional similarity rather than proximity in parameterization. We conjecture that the inherent dimensionality of this manifold is a function of the environment’s complexity, rather than the size of the policy network. We validate our approach in continuous control domains, showing that the parameterization of standard policy networks can be compressed up to five orders of magnitude while retaining most of its expressivity. As a byproduct, we show that the learned manifold enables task-specific adaptation via Policy Gradient operating in the latent space \mathcalZ .
zh
[AI-6] StepORLM: A Self-Evolving Framework With Generative Process Supervision For Operations Research Language Models
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在运筹学(Operations Research, OR)问题求解中面临的两大局限:一是结果奖励(outcome reward)存在信用分配难题(credit assignment problem),即正确最终答案可能强化错误的推理路径;二是传统判别式过程监督(discriminative process supervision)具有短视性,无法对OR建模中的步骤间依赖关系进行整体评估。解决方案的关键在于提出StepORLM框架,其核心是一个由策略模型与生成式过程奖励模型(Generative Process Reward Model, GenPRM)构成的协同进化循环,通过双重反馈机制驱动优化:外部求解器提供的确定性结果验证和GenPRM提供的细粒度过程评价共同作用,利用加权直接偏好优化(Weighted Direct Preference Optimization, W-DPO)对策略模型进行对齐,并同步更新GenPRM,从而实现更高效、可靠的OR问题求解能力。
链接: https://arxiv.org/abs/2509.22558
作者: Chenyu Zhou,Tianyi Xu,Jianghao Lin,Dongdong Ge
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) have shown promising capabilities for solving Operations Research (OR) problems. While reinforcement learning serves as a powerful paradigm for LLM training on OR problems, existing works generally face two key limitations. First, outcome reward suffers from the credit assignment problem, where correct final answers can reinforce flawed reasoning. Second, conventional discriminative process supervision is myopic, failing to evaluate the interdependent steps of OR modeling holistically. To this end, we introduce StepORLM, a novel self-evolving framework with generative process supervision. At its core, StepORLM features a co-evolutionary loop where a policy model and a generative process reward model (GenPRM) iteratively improve on each other. This loop is driven by a dual-feedback mechanism: definitive, outcome-based verification from an external solver, and nuanced, holistic process evaluation from the GenPRM. The combined signal is used to align the policy via Weighted Direct Preference Optimization (W-DPO) and simultaneously refine the GenPRM. Our resulting 8B-parameter StepORLM establishes a new state-of-the-art across six benchmarks, significantly outperforming vastly larger generalist models, agentic methods, and specialized baselines. Moreover, the co-evolved GenPRM is able to act as a powerful and universally applicable process verifier, substantially boosting the inference scaling performance of both our own model and other existing LLMs.
zh
[AI-7] he Emergence of Altruism in Large-Language-Model Agents Society
【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的社会模拟研究中,对大规模代理社会中利他行为(altruism)如何涌现缺乏理解的问题。现有研究多集中于小规模、任务导向的博弈场景中的合作机制,忽视了在复杂社会系统中个体如何在自利(egoistic)与利他(altruistic)目标之间权衡。为填补这一空白,作者提出了一种改进的谢林(Schelling)城市迁移模型,构建了一个社会困境环境,使超过200个LLM代理必须在个人效用与系统效用之间做出选择。其关键解决方案在于识别出两类具有本质差异的代理类型:一类是“适应性利己者”(Adaptive Egoists),默认优先自我利益,但在受到社会规范引导后显著提升利他行为;另一类是“利他优化者”(Altruistic Optimizers),表现出内在的利他逻辑,即使牺牲自身利益也持续追求集体福祉。这一发现揭示了不同LLM在社会倾向上的内禀异质性,表明社会模拟中模型选择的核心不应仅关注推理能力,而应聚焦于其内在的社会行动逻辑。
链接: https://arxiv.org/abs/2509.22537
作者: Haoyang Li,Xiao Jia,Zhanzhan Zhao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Leveraging Large Language Models (LLMs) for social simulation is a frontier in computational social science. Understanding the social logics these agents embody is critical to this attempt. However, existing research has primarily focused on cooperation in small-scale, task-oriented games, overlooking how altruism, which means sacrificing self-interest for collective benefit, emerges in large-scale agent societies. To address this gap, we introduce a Schelling-variant urban migration model that creates a social dilemma, compelling over 200 LLM agents to navigate an explicit conflict between egoistic (personal utility) and altruistic (system utility) goals. Our central finding is a fundamental difference in the social tendencies of LLMs. We identify two distinct archetypes: “Adaptive Egoists”, which default to prioritizing self-interest but whose altruistic behaviors significantly increase under the influence of a social norm-setting message board; and “Altruistic Optimizers”, which exhibit an inherent altruistic logic, consistently prioritizing collective benefit even at a direct cost to themselves. Furthermore, to qualitatively analyze the cognitive underpinnings of these decisions, we introduce a method inspired by Grounded Theory to systematically code agent reasoning. In summary, this research provides the first evidence of intrinsic heterogeneity in the egoistic and altruistic tendencies of different LLMs. We propose that for social simulation, model selection is not merely a matter of choosing reasoning capability, but of choosing an intrinsic social action logic. While “Adaptive Egoists” may offer a more suitable choice for simulating complex human societies, “Altruistic Optimizers” are better suited for modeling idealized pro-social actors or scenarios where collective welfare is the primary consideration.
zh
[AI-8] REMA: A Unified Reasoning Manifold Framework for Interpreting Large Language Model
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在复杂推理过程中失败机制难以解释的问题,尤其关注如何从内部表示的角度量化和定位推理错误的起源。其解决方案的关键在于提出“推理流形”(Reasoning Manifold)这一概念,即由所有正确推理生成对应的内部表征构成的低维几何结构,并基于此构建REMA框架:通过计算错误样本表征到正确表征流形的k近邻距离来量化几何偏差,进而沿模型各层追踪该偏差并对比正确推理的基线波动,从而精确定位推理链偏离正常路径的分叉点。该方法将抽象的推理失败映射为可测量的表征空间几何偏移,为黑箱模型的内部计算过程提供了新的诊断工具。
链接: https://arxiv.org/abs/2509.22518
作者: Bo Li,Guanzhi Deng,Ronghao Chen,Junrong Yue,Shuo Zhang,Qinghua Zhao,Linqi Song,Lijie Wen
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Understanding how Large Language Models (LLMs) perform complex reasoning and their failure mechanisms is a challenge in interpretability research. To provide a measurable geometric analysis perspective, we define the concept of the Reasoning Manifold, a latent low-dimensional geometric structure formed by the internal representations corresponding to all correctly reasoned generations. This structure can be conceptualized as the embodiment of the effective thinking paths that the model has learned to successfully solve a given task. Based on this concept, we build REMA, a framework that explains the origins of failures by quantitatively comparing the spatial relationships of internal model representations corresponding to both erroneous and correct reasoning samples. Specifically, REMA first quantifies the geometric deviation of each erroneous representation by calculating its k-nearest neighbors distance to the approximated manifold formed by correct representations, thereby providing a unified failure signal. It then localizes the divergence points where these deviations first become significant by tracking this deviation metric across the model’s layers and comparing it against a baseline of internal fluctuations from correct representations, thus identifying where the reasoning chain begins to go off-track. Our extensive experiments on diverse language and multimodal models and tasks demonstrate the low-dimensional nature of the reasoning manifold and the high separability between erroneous and correct reasoning representations. The results also validate the effectiveness of the REMA framework in analyzing the origins of reasoning failures. This research connects abstract reasoning failures to measurable geometric deviations in representations, providing new avenues for in-depth understanding and diagnosis of the internal computational processes of black-box models.
zh
[AI-9] rueGradeAI: Retrieval-Augmented and Bias-Resistant AI for Transparent and Explainable Digital Assessments
【速读】:该论文旨在解决传统纸质考试中存在的诸多问题,包括纸张浪费、后勤复杂性高、评分延迟以及评分者偏见等。其解决方案的核心在于提出一个名为TrueGradeAI的AI驱动数字考试框架,该框架通过在安全平板设备上捕捉手写笔迹并利用基于Transformer的光学字符识别(Optical Character Recognition, OCR)技术进行转录,从而保留自然书写特征;同时采用检索增强型评分管道,融合教师参考答案、缓存层和外部资源,使大语言模型能够基于可解释的、证据关联的推理过程进行评分。该方案不仅实现了评分自动化与透明化,还通过可审计的评分轨迹有效缓解了评分偏见,提升了评估公平性与效率。
链接: https://arxiv.org/abs/2509.22516
作者: Rakesh Thakur,Shivaansh Kaushik,Gauri Chopra,Harsh Rohilla
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:This paper introduces TrueGradeAI, an AI-driven digital examination framework designed to overcome the shortcomings of traditional paper-based assessments, including excessive paper usage, logistical complexity, grading delays, and evaluator bias. The system preserves natural handwriting by capturing stylus input on secure tablets and applying transformer-based optical character recognition for transcription. Evaluation is conducted through a retrieval-augmented pipeline that integrates faculty solutions, cache layers, and external references, enabling a large language model to assign scores with explicit, evidence-linked reasoning. Unlike prior tablet-based exam systems that primarily digitize responses, TrueGradeAI advances the field by incorporating explainable automation, bias mitigation, and auditable grading trails. By uniting handwriting preservation with scalable and transparent evaluation, the framework reduces environmental costs, accelerates feedback cycles, and progressively builds a reusable knowledge base, while actively working to mitigate grading bias and ensure fairness in assessment.
zh
[AI-10] Estimating the Empowerment of Language Model Agents ICLR2026
链接: https://arxiv.org/abs/2509.22504
作者: Jinyeop Song,Jeff Gore,Max Kleiman-Weiner
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 10 pages, 8 figures. Submitted to ICLR 2026
[AI-11] InfiAgent : Self-Evolving Pyramid Agent Framework for Infinite Scenarios ICLR2026
链接: https://arxiv.org/abs/2509.22502
作者: Chenglin Yu,Yang Yu,Songmiao Wang,Yucheng Wang,Yifan Yang,Jinjia Li,Ming Li,Hongxia Yang
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 9 pages of main content and 32 pages of others, 2 figures, under review as a conference paper at ICLR 2026
[AI-12] Ontological foundations for contrastive explanatory narration of robot plans
【速读】:该论文旨在解决机器人在人机交互中因决策过程不透明而导致信任度不足的问题,核心在于提升机器人对竞争性行动计划的比较与解释能力。解决方案的关键在于提出一种新颖的本体模型(ontological model),用于形式化建模和推理不同计划间的差异,并据此分类出最优方案(如最短路径、最安全路径或最符合人类偏好的方案);同时设计了一种基于计划间差异知识的新算法,通过构建对比叙事(contrastive narratives)来增强解释的合理性与可理解性,从而显著优于基线方法。
链接: https://arxiv.org/abs/2509.22493
作者: Alberto Olivares-Alarcos,Sergi Foix,Júlia Borràs,Gerard Canal,Guillem Alenyà
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Logic in Computer Science (cs.LO)
备注: This version was submitted to the journal Information Sciences and is under review since October 2024
点击查看摘要
Abstract:Mutual understanding of artificial agents’ decisions is key to ensuring a trustworthy and successful human-robot interaction. Hence, robots are expected to make reasonable decisions and communicate them to humans when needed. In this article, the focus is on an approach to modeling and reasoning about the comparison of two competing plans, so that robots can later explain the divergent result. First, a novel ontological model is proposed to formalize and reason about the differences between competing plans, enabling the classification of the most appropriate one (e.g., the shortest, the safest, the closest to human preferences, etc.). This work also investigates the limitations of a baseline algorithm for ontology-based explanatory narration. To address these limitations, a novel algorithm is presented, leveraging divergent knowledge between plans and facilitating the construction of contrastive narratives. Through empirical evaluation, it is observed that the explanations excel beyond the baseline method.
zh
[AI-13] A Machine Learning Pipeline for Multiple Sclerosis Biomarker Discovery: Comparing explainable AI and Traditional Statistical Approaches
链接: https://arxiv.org/abs/2509.22484
作者: Samuele Punzo,Silvia Giulia Galfrè,Francesco Massafra,Alessandro Maglione,Corrado Priami,Alina Sîrbu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Short paper presented at the 20th conference on Computational Intelligence methods for Bioinformatics and Biostatistics (CIBB2025)
[AI-14] OFMU: Optimization-Driven Framework for Machine Unlearning ICLR2026
【速读】:该论文旨在解决机器遗忘(machine unlearning)问题,即在敏感应用场景中,如何高效地移除模型对特定数据(如用户请求、版权内容或过时信息)的记忆影响,同时保持模型在剩余数据上的性能表现,且无需从头重新训练。传统方法通常将遗忘与保留视为多目标优化问题并采用加权求和的方式转化为单目标问题,但这种方法因梯度方向冲突导致训练不稳定且模型效用下降。论文提出了一种基于惩罚的双层优化框架OFMU(Optimization Framework for Machine Unlearning),其核心创新在于通过分层结构显式优先保障遗忘:内层最大化步骤引入一种相似性感知惩罚项以解耦遗忘与保留目标的梯度方向,外层最小化步骤恢复模型性能;该框架设计了两层循环算法,在凸与非凸场景下均具备收敛性保证,并在多个视觉与语言基准测试中验证了其在遗忘效果与保留效用之间更优的平衡能力。
链接: https://arxiv.org/abs/2509.22483
作者: Sadia Asif,Mohammad Mohammadi Amiri
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Under review at ICLR 2026
点击查看摘要
Abstract:Large language models deployed in sensitive applications increasingly require the ability to unlearn specific knowledge, such as user requests, copyrighted materials, or outdated information, without retraining from scratch to ensure regulatory compliance, user privacy, and safety. This task, known as machine unlearning, aims to remove the influence of targeted data (forgetting) while maintaining performance on the remaining data (retention). A common approach is to formulate this as a multi-objective problem and reduce it to a single-objective problem via scalarization, where forgetting and retention losses are combined using a weighted sum. However, this often results in unstable training dynamics and degraded model utility due to conflicting gradient directions. To address these challenges, we propose OFMU, a penalty-based bi-level optimization framework that explicitly prioritizes forgetting while preserving retention through a hierarchical structure. Our method enforces forgetting via an inner maximization step that incorporates a similarity-aware penalty to decorrelate the gradients of the forget and retention objectives, and restores utility through an outer minimization step. To ensure scalability, we develop a two-loop algorithm with provable convergence guarantees under both convex and non-convex regimes. We further provide a rigorous theoretical analysis of convergence rates and show that our approach achieves better trade-offs between forgetting efficacy and model utility compared to prior methods. Extensive experiments across vision and language benchmarks demonstrate that OFMU consistently outperforms existing unlearning methods in both forgetting efficacy and retained utility.
zh
[AI-15] Learning the Neighborhood: Contrast-Free Multimodal Self-Supervised Molecular Graph Pretraining
【速读】:该论文旨在解决分子表示学习中因缺乏高质量标注数据而导致的属性预测与分子设计性能受限的问题,尤其针对现有自监督预训练方法依赖手工设计的数据增强或复杂生成目标、且仅利用二维拓扑信息而忽视三维结构信息的局限性。其解决方案的关键在于提出C-FREE(Contrast-Free Representation learning on Ego-nets),一种无需负样本、无需位置编码且无需昂贵预处理的简单框架:通过固定半径的 ego-net 作为建模单元,在多个3D构象中提取互补邻域信息,以子图嵌入预测的方式联合学习2D图结构与3D几何信息;该方法在混合图神经网络(GNN)-Transformer骨干网络中实现几何与拓扑信息的融合,从而显著提升分子表示的质量和跨域迁移能力。
链接: https://arxiv.org/abs/2509.22468
作者: Boshra Ariguib,Mathias Niepert,Andrei Manolache
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:High-quality molecular representations are essential for property prediction and molecular design, yet large labeled datasets remain scarce. While self-supervised pretraining on molecular graphs has shown promise, many existing approaches either depend on hand-crafted augmentations or complex generative objectives, and often rely solely on 2D topology, leaving valuable 3D structural information underutilized. To address this gap, we introduce C-FREE (Contrast-Free Representation learning on Ego-nets), a simple framework that integrates 2D graphs with ensembles of 3D conformers. C-FREE learns molecular representations by predicting subgraph embeddings from their complementary neighborhoods in the latent space, using fixed-radius ego-nets as modeling units across different conformers. This design allows us to integrate both geometric and topological information within a hybrid Graph Neural Network (GNN)-Transformer backbone, without negatives, positional encodings, or expensive pre-processing. Pretraining on the GEOM dataset, which provides rich 3D conformational diversity, C-FREE achieves state-of-the-art results on MoleculeNet, surpassing contrastive, generative, and other multimodal self-supervised methods. Fine-tuning across datasets with diverse sizes and molecule types further demonstrates that pretraining transfers effectively to new chemical domains, highlighting the importance of 3D-informed molecular representations.
zh
[AI-16] GeoSketch: A Neural-Symbolic Approach to Geometric Multimodal Reasoning with Auxiliary Line Construction and Affine Transformation
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在几何问题求解(Geometric Problem Solving, GPS)中面临的挑战,即如何实现对文本与图形的联合理解,并支持迭代式的、动态的视觉空间推理。现有方法将图形视为静态图像,缺乏对人类几何推理中关键操作(如辅助线构造和仿射变换)的动态交互能力。其解决方案的核心在于提出GeoSketch——一个神经符号框架,通过构建感知-推理-动作的闭环系统:首先由感知模块将图形抽象为结构化逻辑形式,接着符号推理模块基于几何定理决定下一步推导,最后绘图动作模块执行具体操作(如画辅助线或变换图形),从而形成可执行、可验证的动态交互过程。该框架通过两阶段训练策略(监督微调+强化学习)提升鲁棒性和探索能力,显著优于静态感知方法,在新提出的GeoSketch Benchmark上验证了其在步骤准确率和问题解决成功率上的优势。
链接: https://arxiv.org/abs/2509.22460
作者: Shichao Weng,Zhiqiang Wang,Yuhua Zhou,Rui Lu,Ting Liu,Zhiyang Teng,Xiaozhang Liu,Hanmeng Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Geometric Problem Solving (GPS) poses a unique challenge for Multimodal Large Language Models (MLLMs), requiring not only the joint interpretation of text and diagrams but also iterative visuospatial reasoning. While existing approaches process diagrams as static images, they lack the capacity for dynamic manipulation - a core aspect of human geometric reasoning involving auxiliary line construction and affine transformations. We present GeoSketch, a neural-symbolic framework that recasts geometric reasoning as an interactive perception-reasoning-action loop. GeoSketch integrates: (1) a Perception module that abstracts diagrams into structured logic forms, (2) a Symbolic Reasoning module that applies geometric theorems to decide the next deductive step, and (3) a Sketch Action module that executes operations such as drawing auxiliary lines or applying transformations, thereby updating the diagram in a closed loop. To train this agent, we develop a two-stage pipeline: supervised fine-tuning on 2,000 symbolic-curated trajectories followed by reinforcement learning with dense, symbolic rewards to enhance robustness and strategic exploration. To evaluate this paradigm, we introduce the GeoSketch Benchmark, a high-quality set of 390 geometry problems requiring auxiliary construction or affine transformations. Experiments on strong MLLM baselines demonstrate that GeoSketch significantly improves stepwise reasoning accuracy and problem-solving success over static perception methods. By unifying hierarchical decision-making, executable visual actions, and symbolic verification, GeoSketch advances multimodal reasoning from static interpretation to dynamic, verifiable interaction, establishing a new foundation for solving complex visuospatial problems.
zh
[AI-17] Physics-informed GNN for medium-high voltage AC power flow with edge-aware attention and line search correction operator ICASSP2026
【速读】:该论文旨在解决物理信息图神经网络(Physics-informed Graph Neural Networks, PIGNNs)在实际电力系统应用中精度不足的问题,特别是现有方法在推理阶段无法有效利用物理损失函数,从而限制了其在运行场景中的采纳。解决方案的关键在于提出 PIGNN-Attn-LS 模型,其核心创新包括:1)引入边感知注意力机制(edge-aware attention mechanism),通过每条线路的偏置项显式编码输电线路的物理特性,捕捉电网的各向异性;2)设计基于回溯线搜索(backtracking line-search)的全局校正算子,确保推理阶段仍能维持有效的下降准则,从而提升解的准确性。该方法在高/中压电网场景下实现了显著优于基线模型的精度和速度优势。
链接: https://arxiv.org/abs/2509.22458
作者: Changhun Kim,Timon Conrad,Redwanul Karim,Julian Oelhaf,David Riebesel,Tomás Arias-Vergara,Andreas Maier,Johann Jäger,Siming Bayer
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 5 pages, 2 figures. Submitted to ICASSP 2026. Code available at this https URL
点击查看摘要
Abstract:Physics-informed graph neural networks (PIGNNs) have emerged as fast AC power-flow solvers that can replace classic Newton–Raphson (NR) solvers, especially when thousands of scenarios must be evaluated. However, current PIGNNs still need accuracy improvements at parity speed; in particular, the physics loss is inoperative at inference, which can deter operational adoption. We address this with PIGNN-Attn-LS, combining an edge-aware attention mechanism that explicitly encodes line physics via per-edge biases, capturing the grid’s anisotropy, with a backtracking line-search-based globalized correction operator that restores an operative decrease criterion at inference. Training and testing use a realistic High-/Medium-Voltage scenario generator, with NR used only to construct reference states. On held-out HV cases consisting of 4–32-bus grids, PIGNN-Attn-LS achieves a test RMSE of 0.00033 p.u. in voltage and 0.08 ^\circ in angle, outperforming the PIGNN-MLP baseline by 99.5% and 87.1%, respectively. With streaming micro-batches, it delivers 2–5 \times faster batched inference than NR on 4–1024-bus grids.
zh
[AI-18] Guiding Evolution of Artificial Life Using Vision-Language Models
链接: https://arxiv.org/abs/2509.22447
作者: Nikhil Baid,Hannah Erlebach,Paul Hellegouarch,Frederico Wieser
机构: 未知
类目: Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注: 9 pages, 6 figures. Accepted for publication in the Proceedings of the Artificial Life Conference 2025 (MIT Press)
[AI-19] Learning to Ball: Composing Policies for Long-Horizon Basketball Moves SIGGRAPH
【速读】:该论文旨在解决多阶段、长时程任务(如篮球动作)中策略组合与技能过渡的挑战,尤其是在个体策略之间缺乏显著共享状态空间或中间状态定义不明确的情况下。其核心解决方案是提出一种新颖的策略集成框架,结合一个高层软路由器(high-level soft router),实现不同运动技能在多阶段任务中的无缝且鲁棒的切换,从而支持基于实时用户指令完成复杂交互任务,而无需依赖球体轨迹参考信息。
链接: https://arxiv.org/abs/2509.22442
作者: Pei Xu,Zhen Wu,Ruocheng Wang,Vishnu Sarukkai,Kayvon Fatahalian,Ioannis Karamouzas,Victor Zordan,C. Karen Liu
机构: 未知
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注: ACM Transactions on Graphics (Proceedings of SIGGRAPH Asia 2025). Website: this http URL . Video: this https URL . Code: this https URL
点击查看摘要
Abstract:Learning a control policy for a multi-phase, long-horizon task, such as basketball maneuvers, remains challenging for reinforcement learning approaches due to the need for seamless policy composition and transitions between skills. A long-horizon task typically consists of distinct subtasks with well-defined goals, separated by transitional subtasks with unclear goals but critical to the success of the entire task. Existing methods like the mixture of experts and skill chaining struggle with tasks where individual policies do not share significant commonly explored states or lack well-defined initial and terminal states between different phases. In this paper, we introduce a novel policy integration framework to enable the composition of drastically different motor skills in multi-phase long-horizon tasks with ill-defined intermediate states. Based on that, we further introduce a high-level soft router to enable seamless and robust transitions between the subtasks. We evaluate our framework on a set of fundamental basketball skills and challenging transitions. Policies trained by our approach can effectively control the simulated character to interact with the ball and accomplish the long-horizon task specified by real-time user commands, without relying on ball trajectory references.
zh
[AI-20] Global Convergence in Neural ODEs: Impact of Activation Functions ICLR2025
【速读】:该论文旨在解决神经微分方程(Neural Ordinary Differential Equations, Neural ODEs)在训练过程中因连续性与参数共享特性所带来的梯度计算精度不足和收敛性分析困难的问题。其解决方案的关键在于系统性地分析激活函数的性质对训练动态的影响:首先,平滑的激活函数可确保前向与反向ODE均存在全局唯一解;其次,足够的非线性有助于维持训练过程中神经切线核(Neural Tangent Kernel, NTK)的谱特性。这两个性质共同保障了在过参数化条件下,基于梯度下降法的全局收敛性,从而为实际应用中加速训练并提升性能提供了理论依据与实践指导。
链接: https://arxiv.org/abs/2509.22436
作者: Tianxiang Gao,Siyuan Sun,Hailiang Liu,Hongyang Gao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: ICLR 2025 (Oral)
点击查看摘要
Abstract:Neural Ordinary Differential Equations (ODEs) have been successful in various applications due to their continuous nature and parameter-sharing efficiency. However, these unique characteristics also introduce challenges in training, particularly with respect to gradient computation accuracy and convergence analysis. In this paper, we address these challenges by investigating the impact of activation functions. We demonstrate that the properties of activation functions, specifically smoothness and nonlinearity, are critical to the training dynamics. Smooth activation functions guarantee globally unique solutions for both forward and backward ODEs, while sufficient nonlinearity is essential for maintaining the spectral properties of the Neural Tangent Kernel (NTK) during training. Together, these properties enable us to establish the global convergence of Neural ODEs under gradient descent in overparameterized regimes. Our theoretical findings are validated by numerical experiments, which not only support our analysis but also provide practical guidelines for scaling Neural ODEs, potentially leading to faster training and improved performance in real-world applications.
zh
[AI-21] An Ontology for Unified Modeling of Tasks Actions Environments and Capabilities in Personal Service Robotics
链接: https://arxiv.org/abs/2509.22434
作者: Margherita Martorana,Francesca Urgese,Ilaria Tiddi,Stefan Schlobach
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
[AI-22] Partial Parameter Updates for Efficient Distributed Training
【速读】:该论文旨在解决分布式训练中通信开销过高导致的效率瓶颈问题。现有方法虽通过减少全局同步频率来降低通信量,但仍存在内存占用高和计算资源消耗大的缺陷。其解决方案的关键在于引入参数子集更新机制:在局部迭代过程中,每个节点仅对固定参数子集进行反向传播更新,其余参数保持冻结状态,从而显著降低峰值内存使用和训练浮点运算次数(FLOPs);同时,由于仍执行完整的前向传播以覆盖所有参数,无需跨节点交换激活值,进一步减少了通信需求。实验表明,在32个节点上训练13亿参数语言模型时,该方法在相同token和带宽预算下可达到与先前低通信方法相当的困惑度(perplexity),但训练FLOPs和峰值内存均明显下降。
链接: https://arxiv.org/abs/2509.22418
作者: Anastasiia Filippova,Angelos Katharopoulos,David Grangier,Ronan Collobert
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:We introduce a memory- and compute-efficient method for low-communication distributed training. Existing methods reduce communication by performing multiple local updates between infrequent global synchronizations. We demonstrate that their efficiency can be significantly improved by restricting backpropagation: instead of updating all the parameters, each node updates only a fixed subset while keeping the remainder frozen during local steps. This constraint substantially reduces peak memory usage and training FLOPs, while a full forward pass over all parameters eliminates the need for cross-node activation exchange. Experiments on a 1.3 B-parameter language model trained across 32 nodes show that our method matches the perplexity of prior low-communication approaches under identical token and bandwidth budgets while reducing training FLOPs and peak memory.
zh
[AI-23] EMMA: Generalizing Real-World Robot Manipulation via Generative Visual Transfer
【速读】:该论文旨在解决当前视觉-语言-动作(Vision-Language-Action, VLA)模型在机器人操作任务中因缺乏大规模多样化真实数据而导致泛化能力受限的问题。其核心挑战在于收集覆盖不同物体外观和环境条件的真实世界机器人操作数据既耗时又昂贵。解决方案的关键在于提出Embodied Manipulation Media Adaptation (EMMA)框架,该框架包含两个核心技术:一是基于扩散Transformer的DreamTransfer方法,用于生成多视角一致、几何结构合理的具身操作视频,并支持文本控制的视觉编辑;二是AdaMix训练策略,通过动态重加权训练批次,聚焦于感知或运动学上更具挑战性的样本。实验表明,使用生成数据训练的VLA模型可在零样本视觉域迁移场景下实现超过200%的性能提升,进一步结合AdaMix后性能再提升13%,显著增强了策略的泛化能力。
链接: https://arxiv.org/abs/2509.22407
作者: Zhehao Dong,Xiaofeng Wang,Zheng Zhu,Yirui Wang,Yang Wang,Yukun Zhou,Boyuan Wang,Chaojun Ni,Runqi Ouyang,Wenkang Qin,Xinze Chen,Yun Ye,Guan Huang
机构: 未知
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:
点击查看摘要
Abstract:Vision-language-action (VLA) models increasingly rely on diverse training data to achieve robust generalization. However, collecting large-scale real-world robot manipulation data across varied object appearances and environmental conditions remains prohibitively time-consuming and expensive. To overcome this bottleneck, we propose Embodied Manipulation Media Adaptation (EMMA), a VLA policy enhancement framework that integrates a generative data engine with an effective training pipeline. We introduce DreamTransfer, a diffusion Transformer-based framework for generating multi-view consistent, geometrically grounded embodied manipulation videos. DreamTransfer enables text-controlled visual editing of robot videos, transforming foreground, background, and lighting conditions without compromising 3D structure or geometrical plausibility. Furthermore, we explore hybrid training with real and generated data, and introduce AdaMix, a hard-sample-aware training strategy that dynamically reweights training batches to focus optimization on perceptually or kinematically challenging samples. Extensive experiments show that videos generated by DreamTransfer significantly outperform prior video generation methods in multi-view consistency, geometric fidelity, and text-conditioning accuracy. Crucially, VLAs trained with generated data enable robots to generalize to unseen object categories and novel visual domains using only demonstrations from a single appearance. In real-world robotic manipulation tasks with zero-shot visual domains, our approach achieves over a 200% relative performance gain compared to training on real data alone, and further improves by 13% with AdaMix, demonstrating its effectiveness in boosting policy generalization.
zh
[AI-24] Do LLM Agents Agents Know How to Ground Recover and Assess? A Benchmark for Epistemic Competence in Information-Seeking Agents
链接: https://arxiv.org/abs/2509.22391
作者: Jiaqi Shao,Yuxiang Lin,Munish Prasad Lohani,Yufeng Miao,Bing Luo
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
[AI-25] SpinGPT : A Large-Language-Model Approach to Playing Poker Correctly
链接: https://arxiv.org/abs/2509.22387
作者: Narada Maugin,Tristan Cazenave
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
备注: Accepted at Advances in Computer Games (ACG) 2025, LNCS (Springer)
[AI-26] Zero-Effort Image-to-Music Generation: An Interpretable RAG -based VLM Approach
链接: https://arxiv.org/abs/2509.22378
作者: Zijian Zhao,Dian Jin,Zijing Zhou
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
备注:
[AI-27] Stochastic activations
【速读】:该论文旨在解决大语言模型中ReLU激活函数在优化过程中因负输入区域梯度流受限而导致的性能瓶颈问题(即ReLU的恒定形状特性阻碍了梯度传播)。解决方案的关键在于引入随机激活机制(stochastic activations),即在前向传播层中根据伯努利分布随机选择SILU或ReLU作为非线性激活函数。这一策略在预训练阶段使用随机激活以缓解优化困难,随后在微调阶段固定为ReLU用于推理,从而生成稀疏潜在向量并显著降低CPU上的计算FLOPs;同时,该方法还能在生成任务中提供可控的文本多样性提升,效果接近最优确定性非线性(如SILU结合温度缩放),且优于从头训练时直接使用ReLU的方案。
链接: https://arxiv.org/abs/2509.22358
作者: Maria Lomeli,Matthijs Douze,Gergely Szilvasy,Loic Cabannes,Jade Copet,Sainbayar Sukhbaatar,Jason Weston,Gabriel Synnaeve,Pierre-Emmanuel Mazaré,Hervé Jégou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:We introduce stochastic activations. This novel strategy randomly selects between several non-linear functions in the feed-forward layer of a large language model. In particular, we choose between SILU or RELU depending on a Bernoulli draw. This strategy circumvents the optimization problem associated with RELU, namely, the constant shape for negative inputs that prevents the gradient flow. We leverage this strategy in two ways: (1) We use stochastic activations during pre-training and fine-tune the model with RELU, which is used at inference time to provide sparse latent vectors. This reduces the inference FLOPs and translates into a significant speedup in the CPU. Interestingly, this leads to much better results than training from scratch with the RELU activation function. (2) We evaluate stochastic activations for generation. This strategy performs reasonably well: it is only slightly inferior to the best deterministic non-linearity, namely SILU combined with temperature scaling. This offers an alternative to existing strategies by providing a controlled way to increase the diversity of the generated text. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2509.22358 [cs.LG] (or arXiv:2509.22358v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2509.22358 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-28] Context and Diversity Matter: The Emergence of In-Context Learning in World Models
【速读】:该论文旨在解决当前环境建模方法在面对新奇或罕见场景时性能下降的问题,即静态世界模型(world model)难以适应动态变化的环境。其解决方案的关键在于提出上下文环境学习(In-Context Environment Learning, ICEL),通过将环境识别(environment recognition)与环境学习(environment learning)作为两个核心机制进行形式化建模,并理论推导出二者误差的上界,从而揭示其涌现机制;同时实证验证了不同数据分布和模型架构对ICEL效果的影响,明确指出长上下文长度和多样化环境是实现有效自我适应世界模型的关键因素。
链接: https://arxiv.org/abs/2509.22353
作者: Fan Wang,Zhiyuan Chen,Yuxuan Zhong,Sunjian Zheng,Pengtao Shao,Bo Yu,Shaoshan Liu,Jianan Wang,Ning Ding,Yang Cao,Yu Kang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:The capability of predicting environmental dynamics underpins both biological neural systems and general embodied AI in adapting to their surroundings. Yet prevailing approaches rest on static world models that falter when confronted with novel or rare configurations. We investigate in-context environment learning (ICEL), shifting attention from zero-shot performance to the growth and asymptotic limits of the world model. Our contributions are three-fold: (1) we formalize in-context learning of a world model and identify two core mechanisms: environment recognition and environment learning; (2) we derive error upper-bounds for both mechanisms that expose how the mechanisms emerge; and (3) we empirically confirm that distinct ICL mechanisms exist in the world model, and we further investigate how data distribution and model architecture affect ICL in a manner consistent with theory. These findings demonstrate the potential of self-adapting world models and highlight the key factors behind the emergence of ICEL, most notably the necessity of long context and diverse environments.
zh
[AI-29] SurvDiff: A Diffusion Model for Generating Synthetic Data in Survival Analysis
【速读】:该论文旨在解决生存分析(Survival Analysis)中合成数据生成的挑战,特别是如何准确再现事件时间分布和删失机制(censoring mechanism),因为临床生存数据常因失访或截尾导致不完整。其解决方案的关键在于提出 SurvDiff——一种专为生存数据设计的端到端扩散模型(diffusion model),通过联合生成混合类型协变量、事件时间和右删失信息,并借助面向生存任务的损失函数进行优化,从而确保生成数据在分布保真度和下游生存建模性能上均优于现有最先进方法。
链接: https://arxiv.org/abs/2509.22352
作者: Marie Brockschmidt,Maresa Schröder,Stefan Feuerriegel
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Survival analysis is a cornerstone of clinical research by modeling time-to-event outcomes such as metastasis, disease relapse, or patient death. Unlike standard tabular data, survival data often come with incomplete event information due to dropout, or loss to follow-up. This poses unique challenges for synthetic data generation, where it is crucial for clinical research to faithfully reproduce both the event-time distribution and the censoring mechanism. In this paper, we propose SurvDiff, an end-to-end diffusion model specifically designed for generating synthetic data in survival analysis. SurvDiff is tailored to capture the data-generating mechanism by jointly generating mixed-type covariates, event times, and right-censoring, guided by a survival-tailored loss function. The loss encodes the time-to-event structure and directly optimizes for downstream survival tasks, which ensures that SurvDiff (i) reproduces realistic event-time distributions and (ii) preserves the censoring mechanism. Across multiple datasets, we show that \survdiff consistently outperforms state-of-the-art generative baselines in both distributional fidelity and downstream evaluation metrics across multiple medical datasets. To the best of our knowledge, SurvDiff is the first diffusion model explicitly designed for generating synthetic survival data.
zh
[AI-30] Spectral Collapse Drives Loss of Plasticity in Deep Continual Learning
链接: https://arxiv.org/abs/2509.22335
作者: Naicheng He,Kaicheng Guo,Arjun Prakash,Saket Tiwari,Ruo Yu Tao,Tyrone Serapio,Amy Greenwald,George Konidaris
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
[AI-31] Progressive Weight Loading: Accelerating Initial Inference and Gradually Boosting Performance on Resource-Constrained Environments
【速读】:该论文旨在解决深度学习模型在移动设备和低延迟敏感场景中因模型规模增大导致的加载时间长、初始推理延迟高问题,同时避免传统知识蒸馏(Knowledge Distillation, KD)方法在压缩模型时性能下降的缺陷。其解决方案的关键在于提出一种渐进式权重加载(Progressive Weight Loading, PWL)技术:首先部署轻量级学生模型以实现快速初始推理,随后逐步用预训练教师模型的层替换学生模型的对应层;为支持无缝层替换,PWL还引入了一种训练策略,不仅对齐学生与教师模型中间特征表示,还提升学生模型的整体输出性能。实验表明,采用PWL训练的模型在逐步加载教师层的过程中精度持续提升,最终达到与完整教师模型相当的性能,且不牺牲初始推理速度,特别适用于资源受限且对响应性和性能均要求高的动态部署场景。
链接: https://arxiv.org/abs/2509.22319
作者: Hyunwoo Kim,Junha Lee,Mincheol Choi,Jeonghwan Lee,Jaeshin Cho
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Deep learning models have become increasingly large and complex, resulting in higher memory consumption and computational demands. Consequently, model loading times and initial inference latency have increased, posing significant challenges in mobile and latency-sensitive environments where frequent model loading and unloading are required, which directly impacts user experience. While Knowledge Distillation (KD) offers a solution by compressing large teacher models into smaller student ones, it often comes at the cost of reduced performance. To address this trade-off, we propose Progressive Weight Loading (PWL), a novel technique that enables fast initial inference by first deploying a lightweight student model, then incrementally replacing its layers with those of a pre-trained teacher model. To support seamless layer substitution, we introduce a training method that not only aligns intermediate feature representations between student and teacher layers, but also improves the overall output performance of the student model. Our experiments on VGG, ResNet, and ViT architectures demonstrate that models trained with PWL maintain competitive distillation performance and gradually improve accuracy as teacher layers are loaded-matching the final accuracy of the full teacher model without compromising initial inference speed. This makes PWL particularly suited for dynamic, resource-constrained deployments where both responsiveness and performance are critical.
zh
[AI-32] Adaptive Policy Backbone via Shared Network
链接: https://arxiv.org/abs/2509.22310
作者: Bumgeun Park,Donghwan Lee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
[AI-33] HEAPr: Hessian-based Efficient Atomic Expert Pruning in Output Space
链接: https://arxiv.org/abs/2509.22299
作者: Ke Li,Zheng Yang,Zhongbin Zhou,Feng Xue,Zhonglin Jiang,Wenxiao Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
[AI-34] Large Language Models as Nondeterministic Causal Models
【速读】:该论文试图解决当前生成大语言模型(Large Language Models, LLMs)反事实(counterfactuals)方法中存在的语义模糊性问题,即现有方法未能基于LLM的预期含义进行建模,而是假设可通过修改采样过程而不改变模型本身来生成反事实,或强行将非确定性LLM表示为确定性因果模型。其解决方案的关键在于提出一种更简洁的方法,该方法基于LLM的预期语义,将其建模为非确定性因果模型(nondeterministic causal model),从而无需修改黑盒LLM即可直接生成具有理论一致性的反事实输出。此方法在保持对LLM本质理解的同时,也为后续针对特定应用场景定制反事实生成策略提供了理论基础。
链接: https://arxiv.org/abs/2509.22297
作者: Sander Beckers
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Preprint: under review
点击查看摘要
Abstract:Recent work by Chatzi et al. and Ravfogel et al. has developed, for the first time, a method for generating counterfactuals of probabilistic Large Language Models. Such counterfactuals tell us what would - or might - have been the output of an LLM if some factual prompt \bf x had been \bf x^* instead. The ability to generate such counterfactuals is an important necessary step towards explaining, evaluating, and comparing, the behavior of LLMs. I argue, however, that the existing method rests on an ambiguous interpretation of LLMs: it does not interpret LLMs literally, for the method involves the assumption that one can change the implementation of an LLM’s sampling process without changing the LLM itself, nor does it interpret LLMs as intended, for the method involves explicitly representing a nondeterministic LLM as a deterministic causal model. I here present a much simpler method for generating counterfactuals that is based on an LLM’s intended interpretation by representing it as a nondeterministic causal model instead. The advantage of my simpler method is that it is directly applicable to any black-box LLM without modification, as it is agnostic to any implementation details. The advantage of the existing method, on the other hand, is that it directly implements the generation of a specific type of counterfactuals that is useful for certain purposes, but not for others. I clarify how both methods relate by offering a theoretical foundation for reasoning about counterfactuals in LLMs based on their intended semantics, thereby laying the groundwork for novel application-specific methods for generating counterfactuals.
zh
[AI-35] Leverag ing Large Language Models for Robot-Assisted Learning of Morphological Structures in Preschool Children with Language Vulnerabilities
链接: https://arxiv.org/abs/2509.22287
作者: Stina Sundstedt,Mattias Wingren,Susanne Hägglund,Daniel Ventus
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 12 pages, 2 figures, Preprint of: Sundstedt, S., Wingren, M., Hägglund, S. Ventus, D. (2025). Leveraging Large Language Models for Robot-Assisted Learning of Morphological Structures in Preschool Children with Language Vulnerabilities. In: Stephanidis, C., Antona, M., Ntoa, S. Salvendy, G. (eds.), Communications in Computer and Information Science, vol. 2523, pp. 415-425. Springer
[AI-36] Structured Sparse Transition Matrices to Enable State Tracking in State-Space Models NEURIPS2025
【速读】:该论文旨在解决状态空间模型(State-Space Models, SSMs)在表达能力与计算效率之间的权衡问题:传统使用结构化过渡矩阵的SSMs虽然计算高效,但受限于其表达能力(即模拟有限状态自动机,Finite-State Automata, FSA的能力),而完全无结构的过渡矩阵虽能实现最优表达能力,却因计算和内存开销过大难以实用。解决方案的关键在于提出一种结构稀疏的过渡矩阵参数化方法——PD-SSM,其将过渡矩阵表示为一个列一热矩阵(P)与一个复数对角矩阵(D)的乘积,从而在保持线性复杂度的并行扫描计算的同时,理论上可精确模拟任意N状态FSA,且仅需单层维度N和N×N线性读出,显著优于现有结构化SSM的理论保证。
链接: https://arxiv.org/abs/2509.22284
作者: Aleksandar Terzić,Nicolas Menet,Michael Hersche,Thomas Hofmann,Abbas Rahimi
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 10 pages, NeurIPS 2025 Spotlight
点击查看摘要
Abstract:Modern state-space models (SSMs) often utilize transition matrices which enable efficient computation but pose restrictions on the model’s expressivity, as measured in terms of the ability to emulate finite-state automata (FSA). While unstructured transition matrices are optimal in terms of expressivity, they come at a prohibitively high compute and memory cost even for moderate state sizes. We propose a structured sparse parametrization of transition matrices in SSMs that enables FSA state tracking with optimal state size and depth, while keeping the computational cost of the recurrence comparable to that of diagonal SSMs. Our method, PD-SSM, parametrizes the transition matrix as the product of a column one-hot matrix ( P ) and a complex-valued diagonal matrix ( D ). Consequently, the computational cost of parallel scans scales linearly with the state size. Theoretically, the model is BIBO-stable and can emulate any N -state FSA with one layer of dimension N and a linear readout of size N \times N , significantly improving on all current structured SSM guarantees. Experimentally, the model significantly outperforms a wide collection of modern SSM variants on various FSA state tracking tasks. On multiclass time-series classification, the performance is comparable to that of neural controlled differential equations, a paradigm explicitly built for time-series analysis. Finally, we integrate PD-SSM into a hybrid Transformer-SSM architecture and demonstrate that the model can effectively track the states of a complex FSA in which transitions are encoded as a set of variable-length English sentences. The code is available at this https URL
zh
[AI-37] A Global Analysis of Cyber Threats to the Energy Sector: “Currents of Conflict” from a Geopolitical Perspective
【速读】:该论文旨在解决日益频繁且复杂的网络威胁所带来的认知与应对挑战,特别是在能源领域中如何更有效地理解威胁来源、识别攻击模式并提升检测能力的问题。其解决方案的关键在于利用生成式人工智能(Generative AI)从原始的网络安全威胁描述中提取并结构化信息,从而增强威胁情报分析的深度与效率;同时结合地缘政治视角对多个数据库中的攻击者起源与目标区域进行对比分析,并评估基于学习的检测技术在识别针对能源行业的入侵指标(Indicators of Compromise, IoC)方面的有效性,由此为研究人员、政策制定者和安全从业人员提供可操作的洞察。
链接: https://arxiv.org/abs/2509.22280
作者: Gustavo Sánchez,Ghada Elbez,Veit Hagenmeyer
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: THIS IS A POSTPRINT OF A PEER-REVIEWED ARTICLE, PLEASE CITE IT IF USING THIS WORK: Gustavo Sanchez, Ghada Elbez, and Veit Hagenmeyer. “A Global Analysis of Cyber Threats to the Energy Sector:“Currents of Conflict” from a geopolitical perspective.” atp magazin 67.9 (2025): 56-66. this https URL
点击查看摘要
Abstract:The escalating frequency and sophistication of cyber threats increased the need for their comprehensive understanding. This paper explores the intersection of geopolitical dynamics, cyber threat intelligence analysis, and advanced detection technologies, with a focus on the energy domain. We leverage generative artificial intelligence to extract and structure information from raw cyber threat descriptions, enabling enhanced analysis. By conducting a geopolitical comparison of threat actor origins and target regions across multiple databases, we provide insights into trends within the general threat landscape. Additionally, we evaluate the effectiveness of cybersecurity tools – with particular emphasis on learning-based techniques – in detecting indicators of compromise for energy-targeted attacks. This analysis yields new insights, providing actionable information to researchers, policy makers, and cybersecurity professionals.
zh
[AI-38] Wavelet-Induced Rotary Encodings: RoPE Meets Graphs
链接: https://arxiv.org/abs/2509.22259
作者: Isaac Reid,Arijit Sehanobish,Cedrik Höfs,Bruno Mlodozeniec,Leonhard Vulpius,Federico Barbero,Adrian Weller,Krzysztof Choromanski,Richard E. Turner,Petar Veličković
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
[AI-39] Secure and Efficient Access Control for Computer-Use Agents via Context Space
链接: https://arxiv.org/abs/2509.22256
作者: Haochen Gong,Chenxiao Li,Rui Chang,Wenbo Shen
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Operating Systems (cs.OS)
备注:
[AI-40] Evaluating LLM s for Combinatorial Optimization: One-Phase and Two-Phase Heuristics for 2D Bin-Packing NEURIPS2025
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在组合优化领域,特别是二维装箱问题(2D bin-packing problem)中的能力评估与应用问题。其关键解决方案是提出一种系统性方法,将LLMs与进化算法(evolutionary algorithms)相结合,通过迭代生成和优化启发式策略来提升求解效率。实验表明,该方法能显著优于传统启发式算法(如有限首次适应法和混合首次适应法),例如GPT-4o仅需两次迭代即可获得最优解,平均使用箱子数从16降至15,空间利用率从0.76–0.78提升至0.83,同时降低计算资源消耗。
链接: https://arxiv.org/abs/2509.22255
作者: Syed Mahbubul Huq,Daniel Brito,Daniel Sikar,Rajesh Mojumder
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 1 table, 6 figures. 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Accepted for the Workshop: Evaluating the Evolving LLM Lifecycle Benchmarks, Emergent Abilities, and Scaling
点击查看摘要
Abstract:This paper presents an evaluation framework for assessing Large Language Models’ (LLMs) capabilities in combinatorial optimization, specifically addressing the 2D bin-packing problem. We introduce a systematic methodology that combines LLMs with evolutionary algorithms to generate and refine heuristic solutions iteratively. Through comprehensive experiments comparing LLM generated heuristics against traditional approaches (Finite First-Fit and Hybrid First-Fit), we demonstrate that LLMs can produce more efficient solutions while requiring fewer computational resources. Our evaluation reveals that GPT-4o achieves optimal solutions within two iterations, reducing average bin usage from 16 to 15 bins while improving space utilization from 0.76-0.78 to 0.83. This work contributes to understanding LLM evaluation in specialized domains and establishes benchmarks for assessing LLM performance in combinatorial optimization tasks.
zh
[AI-41] ASSESS: A Semantic and Structural Evaluation Framework for Statement Similarity
链接: https://arxiv.org/abs/2509.22246
作者: Xiaoyang Liu,Tao Zhu,Zineng Dong,Yuntian Liu,Qingfeng Guo,Zhaoxuan Liu,Yu Chen,Tao Luo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
[AI-42] Fairness-Aware Reinforcement Learning (FAReL): A Framework for Transparent and Balanced Sequential Decision-Making
链接: https://arxiv.org/abs/2509.22232
作者: Alexandra Cimpean,Nicole Orzan,Catholijn Jonker,Pieter Libin,Ann Nowé
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
[AI-43] Automatic Discovery of One Parameter Subgroups of SO(n)
链接: https://arxiv.org/abs/2509.22219
作者: Pavan Karjol,Vivek V Kashyap,Rohan Kashyap,Prathosh A P
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
[AI-44] VizGen: Data Exploration and Visualization from Natural Language via a Multi-Agent AI Architecture
链接: https://arxiv.org/abs/2509.22218
作者: Sandaru Fernando,Imasha Jayarathne,Sithumini Abeysekara,Shanuja Sithamparanthan,Thushari Silva,Deshan Jayawardana
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注:
[AI-45] Impact of Collective Behaviors of Autonomous Vehicles on Urban Traffic Dynamics: A Multi-Agent Reinforcement Learning Approach
【速读】:该论文旨在解决混合交通环境中强化学习(Reinforcement Learning, RL)赋能的自动驾驶车辆(Autonomous Vehicles, AVs)对城市交通流影响的问题,特别是在多智能体场景下,通过设定不同行为目标来探究AVs如何优化自身路径选择并影响人类驾驶员的出行效率。解决方案的关键在于构建一个基于深度Q-learning算法的RL框架PARCOUR,将AVs设定为具有六种不同行为模式(自私型、协作型、竞争型、社会型、利他型和恶意型)的智能体,并通过奖励机制引导其行为;仿真结果表明,AVs在采用自利行为时可实现最高达5%的旅行时间优化,但其对人类驾驶员的影响因行为类型而异,凸显了多智能体RL在交通网络协同路径规划中的适用性及其复杂性差异。
链接: https://arxiv.org/abs/2509.22216
作者: Ahmet Onur Akman,Anastasia Psarou,Zoltán György Varga,Grzegorz Jamróz,Rafał Kucharski
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注: Work presented at the European Workshop on Reinforcement Learning (EWRL 2024)
点击查看摘要
Abstract:This study examines the potential impact of reinforcement learning (RL)-enabled autonomous vehicles (AV) on urban traffic flow in a mixed traffic environment. We focus on a simplified day-to-day route choice problem in a multi-agent setting. We consider a city network where human drivers travel through their chosen routes to reach their destinations in minimum travel time. Then, we convert one-third of the population into AVs, which are RL agents employing Deep Q-learning algorithm. We define a set of optimization targets, or as we call them behaviors, namely selfish, collaborative, competitive, social, altruistic, and malicious. We impose a selected behavior on AVs through their rewards. We run our simulations using our in-house developed RL framework PARCOUR. Our simulations reveal that AVs optimize their travel times by up to 5%, with varying impacts on human drivers’ travel times depending on the AV behavior. In all cases where AVs adopt a self-serving behavior, they achieve shorter travel times than human drivers. Our findings highlight the complexity differences in learning tasks of each target behavior. We demonstrate that the multi-agent RL setting is applicable for collective routing on traffic networks, though their impact on coexisting parties greatly varies with the behaviors adopted.
zh
[AI-46] Reversible GNS for Dissipative Fluids with Consistent Bidirectional Dynamics
链接: https://arxiv.org/abs/2509.22207
作者: Mu Huang,Linning Xu,Mingyue Dai,Yidi Shao,Bo Dai
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Fluid Dynamics (physics.flu-dyn)
备注: 13 pages, 5 figures
[AI-47] MimicDreamer: Aligning Human and Robot Demonstrations for Scalable VLA Training
链接: https://arxiv.org/abs/2509.22199
作者: Haoyun Li,Ivan Zhang,Runqi Ouyang,Xiaofeng Wang,Zheng Zhu,Zhiqin Yang,Zhentao Zhang,Boyuan Wang,Chaojun Ni,Wenkang Qin,Xinze Chen,Yun Ye,Guan Huang,Zhenbo Song,Xingang Wang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
[AI-48] Learning Equivariant Functions via Quadratic Forms
【速读】:该论文旨在解决如何从数据中学习群(已知或未知)等变函数的问题,尤其关注在假设对称群为正交群(orthogonal group)的前提下,通过学习其对应的二次型 $ x^T A x $ 来揭示潜在的对称结构。解决方案的关键在于利用正交群保持特定二次型不变的数学性质,将该二次型对应的唯一对称矩阵及其对角化形式引入神经网络架构设计中,从而嵌入适当的归纳偏置(inductive bias),使得模型既简化又高效。由此构建的等变模型可分解为一个保范数模型与一个尺度不变模型的乘积(指群作用下的组合),并在更一般的多输入向量场景下进一步扩展:此时等变函数被分解为仅依赖于归一化第一向量的角向分量和依赖于整个输入张量Gram矩阵的尺度不变分量,从而在保留群对称性的同时捕捉多输入间的相互依赖关系。
链接: https://arxiv.org/abs/2509.22184
作者: Pavan Karjol,Vivek V Kashyap,Rohan Kashyap,Prathosh A P
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:In this study, we introduce a method for learning group (known or unknown) equivariant functions by learning the associated quadratic form x^T A x corresponding to the group from the data. Certain groups, known as orthogonal groups, preserve a specific quadratic form, and we leverage this property to uncover the underlying symmetry group under the assumption that it is orthogonal. By utilizing the corresponding unique symmetric matrix and its inherent diagonal form, we incorporate suitable inductive biases into the neural network architecture, leading to models that are both simplified and efficient. Our approach results in an invariant model that preserves norms, while the equivariant model is represented as a product of a norm-invariant model and a scale-invariant model, where the ``product’’ refers to the group action. Moreover, we extend our framework to a more general setting where the function acts on tuples of input vectors via a diagonal (or product) group action. In this extension, the equivariant function is decomposed into an angular component extracted solely from the normalized first vector and a scale-invariant component that depends on the full Gram matrix of the tuple. This decomposition captures the inter-dependencies between multiple inputs while preserving the underlying group symmetry. We assess the effectiveness of our framework across multiple tasks, including polynomial regression, top quark tagging, and moment of inertia matrix prediction. Comparative analysis with baseline methods demonstrates that our model consistently excels in both discovering the underlying symmetry and efficiently learning the corresponding equivariant function. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2509.22184 [cs.LG] (or arXiv:2509.22184v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2509.22184 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-49] Efficiency Boost in Decentralized Optimization: Reimagining Neighborhood Aggregation with Minimal Overhead
链接: https://arxiv.org/abs/2509.22174
作者: Durgesh Kalwar,Mayank Baranwal,Harshad Khadilkar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
[AI-50] aching AI to Feel: A Collaborative Full-Body Exploration of Emotive Communication ACM-MM’25
链接: https://arxiv.org/abs/2509.22168
作者: Esen K. Tütüncü,Lissette Lemus,Kris Pilcher,Holger Sprengel,Jordi Sabater-Mir
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 9 pages, 10 Figures, ACM MM’25
[AI-51] Lightweight error mitigation strategies for post-training N:M activation sparsity in LLM s
链接: https://arxiv.org/abs/2509.22166
作者: Shirin Alanova,Kristina Kazistova,Ekaterina Galaeva,Alina Kostromina,Vladimir Smirnov,Redko Dmitry,Alexey Dontsov,Maxim Zhelnin,Evgeny Burnaev,Egor Shvetsov
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
[AI-52] Pushing Toward the Simplex Vertices: A Simple Remedy for Code Collapse in Smoothed Vector Quantization
链接: https://arxiv.org/abs/2509.22161
作者: Takashi Morita
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
[AI-53] Log2Plan: An Adaptive GUI Automation Framework Integrated with Task Mining Approach
【速读】:该论文旨在解决当前基于大语言模型(Large Language Model, LLM)或视觉语言模型(Vision-Language Model, VLM)的规划-执行代理在GUI任务自动化中面临的脆弱泛化能力、高延迟以及长程一致性不足的问题。其核心解决方案是提出Log2Plan框架,关键在于结合结构化的两级规划机制与基于用户行为日志的任务挖掘方法:首先通过将用户命令映射到结构化任务词典生成高层计划,实现一致且可泛化的自动化;其次利用任务挖掘从用户行为日志中识别个性化模式,支持定制化和复用;最后通过实时GUI上下文解释将高层计划转化为低层动作序列,确保跨不同界面的鲁棒执行。
链接: https://arxiv.org/abs/2509.22137
作者: Seoyoung Lee,Seonbin Yoon,Seongbeen Lee,Hyesoo Kim,Joo Yong Sim
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA); Robotics (cs.RO)
备注:
点击查看摘要
Abstract:GUI task automation streamlines repetitive tasks, but existing LLM or VLM-based planner-executor agents suffer from brittle generalization, high latency, and limited long-horizon coherence. Their reliance on single-shot reasoning or static plans makes them fragile under UI changes or complex tasks. Log2Plan addresses these limitations by combining a structured two-level planning framework with a task mining approach over user behavior logs, enabling robust and adaptable GUI automation. Log2Plan constructs high-level plans by mapping user commands to a structured task dictionary, enabling consistent and generalizable automation. To support personalization and reuse, it employs a task mining approach from user behavior logs that identifies user-specific patterns. These high-level plans are then grounded into low-level action sequences by interpreting real-time GUI context, ensuring robust execution across varying interfaces. We evaluated Log2Plan on 200 real-world tasks, demonstrating significant improvements in task success rate and execution time. Notably, it maintains over 60.0% success rate even on long-horizon task sequences, highlighting its robustness in complex, multi-step workflows.
zh
[AI-54] Multi-Agent Path Finding via Offline RL and LLM Collaboration
【速读】:该论文旨在解决多智能体路径规划(Multi-Agent Path Finding, MAPF)中的两大挑战:一是去中心化强化学习方法易导致代理间出现自私行为,引发频繁碰撞;二是传统方法依赖复杂通信模块,训练周期长,常需数周。其解决方案的关键在于提出一种基于决策Transformer(Decision Transformer, DT)的高效去中心化规划框架,利用离线强化学习显著缩短训练时间至数小时,并有效处理长时程信用分配问题,提升稀疏奖励场景下的性能;同时引入大语言模型(GPT-4o)动态引导代理策略,增强系统在动态环境中的适应能力。
链接: https://arxiv.org/abs/2509.22130
作者: Merve Atasever,Matthew Hong,Mihir Nitin Kulkarni,Qingpei Li,Jyotirmoy V. Deshmukh
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Multi-Agent Path Finding (MAPF) poses a significant and challenging problem critical for applications in robotics and logistics, particularly due to its combinatorial complexity and the partial observability inherent in realistic environments. Decentralized reinforcement learning methods commonly encounter two substantial difficulties: first, they often yield self-centered behaviors among agents, resulting in frequent collisions, and second, their reliance on complex communication modules leads to prolonged training times, sometimes spanning weeks. To address these challenges, we propose an efficient decentralized planning framework based on the Decision Transformer (DT), uniquely leveraging offline reinforcement learning to substantially reduce training durations from weeks to mere hours. Crucially, our approach effectively handles long-horizon credit assignment and significantly improves performance in scenarios with sparse and delayed rewards. Furthermore, to overcome adaptability limitations inherent in standard RL methods under dynamic environmental changes, we integrate a large language model (GPT-4o) to dynamically guide agent policies. Extensive experiments in both static and dynamically changing environments demonstrate that our DT-based approach, augmented briefly by GPT-4o, significantly enhances adaptability and performance.
zh
[AI-55] he AI_INFN Platform: Artificial Intelligence Development in the Cloud
链接: https://arxiv.org/abs/2509.22117
作者: Lucio Anderlini,Giulio Bianchini,Diego Ciangottini,Stefano Dal Pra,Diego Michelotto,Rosa Petrini,Daniele Spiga
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注: To be published in SciPost Physics Proceedings for European AI for Fundamental Physics Conference (EuCAIFCon 2025)
[AI-56] Learning More with Less: A Dynamic Dual-Level Down-Sampling Framework for Efficient Policy Optimization ICLR2026
【速读】:该论文旨在解决批评者-free(critic-free)强化学习方法(如GRPO)在训练过程中因大量无信息样本和token导致的学习信号稀释问题,从而造成收敛速度缓慢的挑战。其解决方案的关键在于提出动态双层下采样(Dynamic Dual-Level Down-Sampling, D³S)框架:在样本层面上,通过选择能最大化优势方差(Var(A))的rollout子集来提升策略梯度上界;在token层面上,优先选取优势幅度与策略熵乘积(|A_i,t|×H_i,t)高的token进行更新,聚焦于既不确定又具影响力的区域;同时引入受课程学习启发的动态下采样调度机制,早期激进下采样加速学习,后期逐步放松以增强泛化能力。该方案显著提升了策略优化效率,在多个推理基准测试中实现了更优性能与更低样本/token消耗。
链接: https://arxiv.org/abs/2509.22115
作者: Chao Wang,Tao Yang,Hongtao Tian,Yunsheng Shi,Qiyao Ma,Xiaotao Liu,Ting Yao,Wenbo Ding
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 18 pages, 5 figures, Under review as a conference paper at ICLR 2026
点击查看摘要
Abstract:Critic-free methods like GRPO reduce memory demands by estimating advantages from multiple rollouts but tend to converge slowly, as critical learning signals are diluted by an abundance of uninformative samples and tokens. To tackle this challenge, we propose the \textbfDynamic Dual-Level Down-Sampling (D ^3 S) framework that prioritizes the most informative samples and tokens across groups to improve the efficient of policy optimization. D ^3 S operates along two levels: (1) the sample-level, which selects a subset of rollouts to maximize advantage variance ( \textVar(A) ). We theoretically proven that this selection is positively correlated with the upper bound of the policy gradient norms, yielding higher policy gradients. (2) the token-level, which prioritizes tokens with a high product of advantage magnitude and policy entropy ( |A_i,t|\times H_i,t ), focusing updates on tokens where the policy is both uncertain and impactful. Moreover, to prevent overfitting to high-signal data, D ^3 S employs a dynamic down-sampling schedule inspired by curriculum learning. This schedule starts with aggressive down-sampling to accelerate early learning and gradually relaxes to promote robust generalization. Extensive experiments on Qwen2.5 and Llama3.1 demonstrate that integrating D ^3 S into advanced RL algorithms achieves state-of-the-art performance and generalization while requiring \textitfewer samples and tokens across diverse reasoning benchmarks. Our code is added in the supplementary materials and will be made publicly available.
zh
[AI-57] Reinforcement Learning for Durable Algorithmic Recourse
【速读】:该论文旨在解决算法性归因(algorithmic recourse)中忽视时间动态性的问题,尤其是在资源受限且存在竞争的环境中,推荐策略如何影响未来申请人池的行为演化。现有方法多关注模型更新下的鲁棒性,但未充分考虑个体在收到建议后随时间变化的适应行为及其对系统长期有效性的影响。论文的关键解决方案是提出一种时间感知的归因框架,显式建模候选人群体对推荐的响应机制,并设计了一种基于强化学习(reinforcement learning, RL)的归因算法,能够捕捉环境的动态演化过程,生成在可行性和长期有效性之间取得平衡的推荐策略。特别地,推荐被设计为具有“持久性”(durability),确保在预定义的时间窗口 T 内保持有效,使个体可在实施建议后自信地重新申请。
链接: https://arxiv.org/abs/2509.22102
作者: Marina Ceccon,Alessandro Fabris,Goran Radanović,Asia J. Biega,Gian Antonio Susto
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Algorithmic recourse seeks to provide individuals with actionable recommendations that increase their chances of receiving favorable outcomes from automated decision systems (e.g., loan approvals). While prior research has emphasized robustness to model updates, considerably less attention has been given to the temporal dynamics of recourse–particularly in competitive, resource-constrained settings where recommendations shape future applicant pools. In this work, we present a novel time-aware framework for algorithmic recourse, explicitly modeling how candidate populations adapt in response to recommendations. Additionally, we introduce a novel reinforcement learning (RL)-based recourse algorithm that captures the evolving dynamics of the environment to generate recommendations that are both feasible and valid. We design our recommendations to be durable, supporting validity over a predefined time horizon T. This durability allows individuals to confidently reapply after taking time to implement the suggested changes. Through extensive experiments in complex simulation environments, we show that our approach substantially outperforms existing baselines, offering a superior balance between feasibility and long-term validity. Together, these results underscore the importance of incorporating temporal and behavioral dynamics into the design of practical recourse systems.
zh
[AI-58] Action-aware Dynamic Pruning for Efficient Vision-Language-Action Manipulation
链接: https://arxiv.org/abs/2509.22093
作者: Xiaohuan Pei,Yuxing Chen,Siyu Xu,Yunke Wang,Yuheng Shi,Chang Xu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
[AI-59] Ground-Truthing AI Energy Consumption: Validating CodeCarbon Against External Measurements
链接: https://arxiv.org/abs/2509.22092
作者: Raphael Fischer
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
[AI-60] Generalizing Multi-Objective Search via Objective-Aggregation Functions
链接: https://arxiv.org/abs/2509.22085
作者: Hadar Peer,Eyal Weiss,Ron Alterovitz,Oren Salzman
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
[AI-61] he Rogue Scalpel: Activation Steering Compromises LLM Safety
【速读】:该论文旨在解决生成式 AI(Generative AI)模型在使用激活引导(activation steering)技术进行行为控制时可能引发的安全风险问题。当前该技术常被视为一种精确、可解释且更安全的替代微调的方法,但本文通过大量实验表明,激活引导实际上会系统性地破坏模型对有害请求的防护机制,显著提升其合规率(从0%上升至2-27%),甚至在随机方向或利用稀疏自编码器(sparse autoencoder, SAE)提取的良性特征方向上也能实现类似效果。解决方案的关键在于揭示了“内部可控性”不等于“行为可控性”——即对模型隐藏状态的精细干预未必带来预期的行为控制,反而可能被用于构造通用越狱攻击(universal jailbreak),从而挑战以可解释性保障安全的传统范式。
链接: https://arxiv.org/abs/2509.22067
作者: Anton Korznikov,Andrey Galichin,Alexey Dontsov,Oleg Y. Rogov,Ivan Oseledets,Elena Tutubalina
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Activation steering is a promising technique for controlling LLM behavior by adding semantically meaningful vectors directly into a model’s hidden states during inference. It is often framed as a precise, interpretable, and potentially safer alternative to fine-tuning. We demonstrate the opposite: steering systematically breaks model alignment safeguards, making it comply with harmful requests. Through extensive experiments on different model families, we show that even steering in a random direction can increase the probability of harmful compliance from 0% to 2-27%. Alarmingly, steering benign features from a sparse autoencoder (SAE), a common source of interpretable directions, increases these rates by a further 2-4%. Finally, we show that combining 20 randomly sampled vectors that jailbreak a single prompt creates a universal attack, significantly increasing harmful compliance on unseen requests. These results challenge the paradigm of safety through interpretability, showing that precise control over model internals does not guarantee precise control over model behavior.
zh
[AI-62] Decoding Deception: Understanding Automatic Speech Recognition Vulnerabilities in Evasion and Poisoning Attacks
链接: https://arxiv.org/abs/2509.22060
作者: Aravindhan G,Yuvaraj Govindarajulu,Parin Shah
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:
[AI-63] An Adaptive ICP LiDAR Odometry Based on Reliable Initial Pose
【速读】:该论文旨在解决基于迭代最近点(Iterative Closest Point, ICP)的激光雷达里程计(LiDAR odometry)在复杂动态环境中因初始位姿不可靠和缺乏自适应机制而导致的局部最优收敛及注册精度下降问题。解决方案的关键在于:首先通过密度滤波的分布式粗配准获取初始位姿,并结合运动预测位姿进行可靠性筛选,从而显著降低源点云与目标点云间的初始误差;其次,利用当前帧与历史误差信息动态调整ICP的阈值,实现对动态环境变化的自适应响应;最终,在可靠初始位姿和自适应阈值基础上,采用点到平面的自适应ICP算法完成高精度点云配准,有效提升了LiDAR里程计的鲁棒性与准确性。
链接: https://arxiv.org/abs/2509.22058
作者: Qifeng Wang,Weigang Li,Lei Nie,Xin Xu,Wenping Liu,Zhe Xu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:As a key technology for autonomous navigation and positioning in mobile robots, light detection and ranging (LiDAR) odometry is widely used in autonomous driving applications. The Iterative Closest Point (ICP)-based methods have become the core technique in LiDAR odometry due to their efficient and accurate point cloud registration capability. However, some existing ICP-based methods do not consider the reliability of the initial pose, which may cause the method to converge to a local optimum. Furthermore, the absence of an adaptive mechanism hinders the effective handling of complex dynamic environments, resulting in a significant degradation of registration accuracy. To address these issues, this paper proposes an adaptive ICP-based LiDAR odometry method that relies on a reliable initial pose. First, distributed coarse registration based on density filtering is employed to obtain the initial pose estimation. The reliable initial pose is then selected by comparing it with the motion prediction pose, reducing the initial error between the source and target point clouds. Subsequently, by combining the current and historical errors, the adaptive threshold is dynamically adjusted to accommodate the real-time changes in the dynamic environment. Finally, based on the reliable initial pose and the adaptive threshold, point-to-plane adaptive ICP registration is performed from the current frame to the local map, achieving high-precision alignment of the source and target point clouds. Extensive experiments on the public KITTI dataset demonstrate that the proposed method outperforms existing approaches and significantly enhances the accuracy of LiDAR odometry.
zh
[AI-64] Latent Diffusion : Multi-Dimension Stable Diffusion Latent Space Explorer
链接: https://arxiv.org/abs/2509.22038
作者: Zhihua Zhong,Xuanyang Huang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
[AI-65] GSM-Agent : Understanding Agent ic Reasoning Using Controllable Environments
链接: https://arxiv.org/abs/2509.21998
作者: Hanlin Zhu,Tianyu Guo,Song Mei,Stuart Russell,Nikhil Ghosh,Alberto Bietti,Jiantao Jiao
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 35 pages, 8 figures
[AI-66] Bilinear relational structure fixes reversal curse and enables consistent model editing
【速读】:该论文旨在解决语言模型(Language Model, LM)中存在的“反转诅咒”(reversal curse)问题,即模型无法从已学习的命题“A is B”中推理出未见过的逆向事实“B is A”。研究表明,这一现象并非模型的本质缺陷,而是其知识编码方式所致。解决方案的关键在于通过在关系知识图谱(relational knowledge graphs)的合成数据集上从头训练语言模型,使其隐藏层表示中自然涌现出双线性(bilinear)结构。这种结构显著缓解了反转诅咒,并使模型能够正确推理逆向事实;更重要的是,该结构还确保了模型编辑时逻辑一致性——当更新某一事实时,其逆向及逻辑相关事实能正确传播,从而避免引入新的不一致。因此,论文指出,模型编辑的成功不仅依赖于编辑算法,更取决于知识所依赖的底层表征几何结构。
链接: https://arxiv.org/abs/2509.21993
作者: Dong-Kyum Kim,Minsung Kim,Jea Kwon,Nakyeong Yang,Meeyoung Cha
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 9 pages
点击查看摘要
Abstract:The reversal curse – a language model’s (LM) inability to infer an unseen fact B is A'' from a learned fact
A is B’’ – is widely considered a fundamental limitation. We show that this is not an inherent failure but an artifact of how models encode knowledge. By training LMs from scratch on a synthetic dataset of relational knowledge graphs, we demonstrate that bilinear relational structure emerges in their hidden representations. This structure substantially alleviates the reversal curse, enabling LMs to infer unseen reverse facts. Crucially, we also find that this bilinear structure plays a key role in consistent model editing. When a fact is updated in a LM with this structure, the edit correctly propagates to its reverse and other logically dependent facts. In contrast, models lacking this representation not only suffer from the reversal curse but also fail to generalize edits, further introducing logical inconsistencies. Our results establish that training on a relational knowledge dataset induces the emergence of bilinear internal representations, which in turn enable LMs to behave in a logically consistent manner after editing. This implies that the success of model editing depends critically not just on editing algorithms but on the underlying representational geometry of the knowledge being modified.
zh
[AI-67] Developing Vision-Language-Action Model from Egocentric Videos
【速读】:该论文旨在解决如何利用无需额外标注的原始第一人称视角视频(egocentric videos)来训练视觉-语言-动作模型(Vision-Language-Action models, VLAs),以替代传统依赖昂贵且人工密集型的手部姿态标注或专家遥控操作的方法。其核心挑战在于从无辅助信息的视频中提取精确的物体6自由度(6DoF)操作轨迹,从而构建可用于VLA预训练的大规模数据集。解决方案的关键在于引入EgoScaler框架,该框架能直接从第一人称视频中自动估计并校正6DoF物体运动轨迹,无需任何手部姿态或其他辅助标注,并通过在四个大规模第一人称视频数据集上应用该方法,构建了一个高质量、可扩展的VLA预训练数据集。实验表明,基于此数据集进行预训练显著提升了机器人任务成功率,且性能接近甚至超越使用真实机器人数据训练的结果。
链接: https://arxiv.org/abs/2509.21986
作者: Tomoya Yoshida,Shuhei Kurita,Taichi Nishimura,Shinsuke Mori
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Egocentric videos capture how humans manipulate objects and tools, providing diverse motion cues for learning object manipulation. Unlike the costly, expert-driven manual teleoperation commonly used in training Vision-Language-Action models (VLAs), egocentric videos offer a scalable alternative. However, prior studies that leverage such videos for training robot policies typically rely on auxiliary annotations, such as detailed hand-pose recordings. Consequently, it remains unclear whether VLAs can be trained directly from raw egocentric videos. In this work, we address this challenge by leveraging EgoScaler, a framework that extracts 6DoF object manipulation trajectories from egocentric videos without requiring auxiliary recordings. We apply EgoScaler to four large-scale egocentric video datasets and automatically refine noisy or incomplete trajectories, thereby constructing a new large-scale dataset for VLA pre-training. Our experiments with a state-of-the-art \pi_0 architecture in both simulated and real-robot environments yield three key findings: (i) pre-training on our dataset improves task success rates by over 20% compared to training from scratch, (ii) the performance is competitive with that achieved using real-robot datasets, and (iii) combining our dataset with real-robot data yields further improvements. These results demonstrate that egocentric videos constitute a promising and scalable resource for advancing VLA research.
zh
[AI-68] Hybrid Diffusion for Simultaneous Symbolic and Continuous Planning
链接: https://arxiv.org/abs/2509.21983
作者: Sigmund Hennum Høeg,Aksel Vaaler,Chaoqi Liu,Olav Egeland,Yilun Du
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 10 pages, 11 figures. This work has been submitted to the IEEE for possible publication. See this https URL for the project website
[AI-69] CoBel-World: Harnessing LLM Reasoning to Build a Collaborative Belief World for Optimizing Embodied Multi-Agent Collaboration
链接: https://arxiv.org/abs/2509.21981
作者: Zhimin Wang,Shaokang He,Duo Wu,Jinghe Wang,Linjia Kang,Jing Yu,Zhi Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
[AI-70] From Superficial Outputs to Superficial Learning: Risks of Large Language Models in Education
链接: https://arxiv.org/abs/2509.21972
作者: Iris Delikoura,Yi.R(May)Fung,Pan Hui
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
[AI-71] FlowDrive: moderated flow matching with data balancing for trajectory planning
链接: https://arxiv.org/abs/2509.21961
作者: Lingguang Wang,Ömer Şahin Taş,Marlon Steiner,Christoph Stiller
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
[AI-72] Active Attacks: Red-teaming LLM s via Adaptive Environments
链接: https://arxiv.org/abs/2509.21947
作者: Taeyoung Yun,Pierre-Luc St-Charles,Jinkyoo Park,Yoshua Bengio,Minsu Kim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 22 pages, 7 figures, 18 tables
[AI-73] Unveiling Many Faces of Surrogate Models for Configuration Tuning: A Fitness Landscape Analysis Perspective
链接: https://arxiv.org/abs/2509.21945
作者: Pengzhou Chen,Hongyuan Liang,Tao Chen
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: This paper is under review
[AI-74] Outlier Detection in Plantar Pressure: Human-Centered Comparison of Statistical Parametric Mapping and Explainable Machine Learning
【速读】:该论文旨在解决植物压力数据集中因技术误差或操作不一致导致的异常值检测问题,传统统计参数映射(Statistical Parametric Mapping, SPM)方法虽具可解释性但对配准敏感且鲁棒性不足。解决方案的关键在于比较一种非参数、依赖配准的SPM方法与一种基于卷积神经网络(Convolutional Neural Network, CNN)的可解释机器学习(Explainable Machine Learning, XAI)方法,并利用SHapley Additive exPlanations(SHAP)提供模型决策的可解释性,从而建立透明的质量控制流程。实验表明,CNN结合SHAP在准确率上优于SPM,能有效识别真实异常值并避免误判临床有意义的变异,同时专家评估显示两种方法的解释均清晰、可信,凸显了可解释性在将复杂模型输出转化为可决策洞察中的核心作用。
链接: https://arxiv.org/abs/2509.21943
作者: Carlo Dindorf,Jonas Dully,Steven Simon,Dennis Perchthaler,Stephan Becker,Hannah Ehmann,Kjell Heitmann,Bernd Stetter,Christian Diers,Michael Fröhlich
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Plantar pressure mapping is essential in clinical diagnostics and sports science, yet large heterogeneous datasets often contain outliers from technical errors or procedural inconsistencies. Statistical Parametric Mapping (SPM) provides interpretable analyses but is sensitive to alignment and its capacity for robust outlier detection remains unclear. This study compares an SPM approach with an explainable machine learning (ML) approach to establish transparent quality-control pipelines for plantar pressure datasets. Data from multiple centers were annotated by expert consensus and enriched with synthetic anomalies resulting in 798 valid samples and 2000 outliers. We evaluated (i) a non-parametric, registration-dependent SPM approach and (ii) a convolutional neural network (CNN), explained using SHapley Additive exPlanations (SHAP). Performance was assessed via nested cross-validation; explanation quality via a semantic differential survey with domain experts. The ML model reached high accuracy and outperformed SPM, which misclassified clinically meaningful variations and missed true outliers. Experts perceived both SPM and SHAP explanations as clear, useful, and trustworthy, though SPM was assessed less complex. These findings highlight the complementary potential of SPM and explainable ML as approaches for automated outlier detection in plantar pressure data, and underscore the importance of explainability in translating complex model outputs into interpretable insights that can effectively inform decision-making.
zh
[AI-75] SAGE: Scene Graph-Aware Guidance and Execution for Long-Horizon Manipulation Tasks
链接: https://arxiv.org/abs/2509.21928
作者: Jialiang Li,Wenzheng Wu,Gaojing Zhang,Yifan Han,Wenzhao Lian
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
[AI-76] Generation Properties of Stochastic Interpolation under Finite Training Set
【速读】:该论文旨在解决生成模型在有限训练样本下的理论行为问题,特别是分析其在理想和非理想设置下的生成机制与过拟合/欠拟合现象。解决方案的关键在于基于随机插值生成框架(stochastic interpolation generative framework),推导出在仅有有限训练样本时最优速度场(velocity field)和得分函数(score function)的闭式表达式,并揭示 deterministic 生成过程能精确恢复训练样本,而 stochastic 生成过程则表现为带高斯噪声的训练样本;进一步引入针对生成模型的过拟合与欠拟合定义,表明在估计误差存在时,随机生成过程等效于由均匀噪声和高斯噪声混合扰动的训练样本的凸组合。
链接: https://arxiv.org/abs/2509.21925
作者: Yunchen Li,Shaohui Lin,Zhou Yu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:This paper investigates the theoretical behavior of generative models under finite training populations. Within the stochastic interpolation generative framework, we derive closed-form expressions for the optimal velocity field and score function when only a finite number of training samples are available. We demonstrate that, under some regularity conditions, the deterministic generative process exactly recovers the training samples, while the stochastic generative process manifests as training samples with added Gaussian noise. Beyond the idealized setting, we consider model estimation errors and introduce formal definitions of underfitting and overfitting specific to generative models. Our theoretical analysis reveals that, in the presence of estimation errors, the stochastic generation process effectively produces convex combinations of training samples corrupted by a mixture of uniform and Gaussian noise. Experiments on generation tasks and downstream tasks such as classification support our theory.
zh
[AI-77] DyRo-MCTS: A Robust Monte Carlo Tree Search Approach to Dynamic Job Shop Scheduling
链接: https://arxiv.org/abs/2509.21902
作者: Ruiqi Chen,Yi Mei,Fangfang Zhang,Mengjie Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
[AI-78] GenesisGeo: Technical Report
链接: https://arxiv.org/abs/2509.21896
作者: Minfeng Zhu,Zi Wang,Sizhe Ji,Zhengtong Du,Junming Ke,Xiao Deng,Zanlang Yin,Xiuqi Huang,Heyu Wang,Wei Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
[AI-79] RACE: Learning to Compute on Graphs
【速读】:该论文旨在解决图表示学习中的核心挑战——“学习计算”(learning to compute),即建模计算图(computational graph)的功能行为。现有主流方法,如消息传递神经网络(MPNNs)及其基于Transformer的变体,因架构设计上的缺陷无法捕捉计算过程中位置敏感且分层的特性。解决方案的关键在于提出一种全新的范式——TRACE,其包含两个核心创新:一是采用分层Transformer(Hierarchical Transformer)作为架构基础,以模拟计算的逐步执行流程,替代传统模型中不合理的置换不变聚合机制;二是引入函数偏移学习(function shift learning),通过将复杂全局函数的学习解耦为局部近似与函数偏移的预测任务,从而提升模型的表达能力和训练稳定性。实验证明,该范式在电子电路等复杂计算图上显著优于先前方法,验证了架构对齐与解耦学习目标的有效性。
链接: https://arxiv.org/abs/2509.21886
作者: Ziyang Zheng,Jiaying Zhu,Jingyi Zhou,Qiang Xu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Learning to compute, the ability to model the functional behavior of a computational graph, is a fundamental challenge for graph representation learning. Yet, the dominant paradigm is architecturally mismatched for this task. This flawed assumption, central to mainstream message passing neural networks (MPNNs) and their conventional Transformer-based counterparts, prevents models from capturing the position-aware, hierarchical nature of computation. To resolve this, we introduce \textbfTRACE, a new paradigm built on an architecturally sound backbone and a principled learning objective. First, TRACE employs a Hierarchical Transformer that mirrors the step-by-step flow of computation, providing a faithful architectural backbone that replaces the flawed permutation-invariant aggregation. Second, we introduce \textbffunction shift learning, a novel objective that decouples the learning problem. Instead of predicting the complex global function directly, our model is trained to predict only the \textitfunction shift, the discrepancy between the true global function and a simple local approximation that assumes input independence. We validate this paradigm on electronic circuits, one of the most complex and economically critical classes of computational graphs. Across a comprehensive suite of benchmarks, TRACE substantially outperforms all prior architectures. These results demonstrate that our architecturally-aligned backbone and decoupled learning objective form a more robust paradigm for the fundamental challenge of learning to compute on graphs.
zh
[AI-80] Position: The Hidden Costs and Measurement Gaps of Reinforcement Learning with Verifiable Rewards
【速读】:该论文旨在解决强化学习与可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)在提升大语言模型性能时存在的评估偏差和潜在代价问题,具体包括:1)报告的性能增益是否在严格公平控制条件下依然成立;2)RLVR是否真正“无成本”,抑或存在可量化的效率损耗(即“RLVR税”)。其解决方案的关键在于提出一种“税感知”的训练与评估协议,该协议通过联合优化准确性、事实一致性(grounding)与校准后的弃权行为(calibrated abstention),并引入标准化的预算分配与数据溯源核查机制,从而实现更可靠地量化推理能力提升,并修正此前因数据污染、评估漏洞及未控变量导致的高估结论。
链接: https://arxiv.org/abs/2509.21882
作者: Aaron Tu,Weihao Xuan,Heli Qi,Xu Huang,Qingcheng Zeng,Shayan Talaei,Yijia Xiao,Peng Xia,Xiangru Tang,Yuchen Zhuang,Bing Hu,Hanqun Cao,Wenqi Shi,Tianang Leng,Rui Yang,Yingjian Chen,Ziqi Wang,Irene Li,Nan Liu,Huaxiu Yao,Li Erran Li,Ge Liu,Amin Saberi,Naoto Yokoya,Jure Leskovec,Yejin Choi,Fang Wu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Reinforcement learning with verifiable rewards (RLVR) is a practical and scalable approach to enhancing large language models in areas such as math, code, and other structured tasks. Two questions motivate this paper: how much of the reported gains survive under strictly parity-controlled evaluation, and whether RLVR is cost-free or exacts a measurable tax. We argue that progress is real, but gains are often overstated due to three forces - an RLVR tax, evaluation pitfalls, and data contamination. Using a partial-prompt contamination audit and matched-budget reproductions across base and RL models, we show that several headline gaps shrink or vanish under clean, parity-controlled evaluation. We then propose a tax-aware training and evaluation protocol that co-optimizes accuracy, grounding, and calibrated abstention and standardizes budgeting and provenance checks. Applied to recent RLVR setups, this protocol yields more reliable estimates of reasoning gains and, in several cases, revises prior conclusions. Our position is constructive: RLVR is valuable and industry-ready; we advocate keeping its practical benefits while prioritizing reliability, safety, and measurement.
zh
[AI-81] Reimagining Agent -based Modeling with Large Language Model Agents via Shachi
链接: https://arxiv.org/abs/2509.21862
作者: So Kuroki,Yingtao Tian,Kou Misaki,Takashi Ikegami,Takuya Akiba,Yujin Tang
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Social and Information Networks (cs.SI); General Economics (econ.GN)
备注:
[AI-82] Graph of Agents : Principled Long Context Modeling by Emergent Multi-Agent Collaboration
链接: https://arxiv.org/abs/2509.21848
作者: Taejong Joo,Shu Ishida,Ivan Sosnovik,Bryan Lim,Sahand Rezaei-Shoshtari,Adam Gaier,Robert Giaquinto
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Preprint
[AI-83] Beyond Johnson-Lindenstrauss: Uniform Bounds for Sketched Bilinear Forms
链接: https://arxiv.org/abs/2509.21847
作者: Rohan Deb,Qiaobo Li,Mayank Shrivastava,Arindam Banerjee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
[AI-84] DeepTravel: An End-to-End Agent ic Reinforcement Learning Framework for Autonomous Travel Planning Agents
链接: https://arxiv.org/abs/2509.21842
作者: Yansong Ning,Rui Liu,Jun Wang,Kai Chen,Wei Li,Jun Fang,Kan Zheng,Naiqiang Tan,Hao Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Under review
[AI-85] Can Large Language Models Autoformalize Kinematics?
链接: https://arxiv.org/abs/2509.21840
作者: Aditi Kabra,Jonathan Laurent,Sagar Bharadwaj,Ruben Martins,Stefan Mitsch,André Platzer
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
备注:
[AI-86] Axiomatic Choice and the Decision-Evaluation Paradox
链接: https://arxiv.org/abs/2509.21836
作者: Ben Abramowitz,Nicholas Mattei
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
[AI-87] DS-STAR: Data Science Agent via Iterative Planning and Verification
链接: https://arxiv.org/abs/2509.21825
作者: Jaehyun Nam,Jinsung Yoon,Jiefeng Chen,Jinwoo Shin,Tomas Pfister
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
[AI-88] ProRe: A Proactive Reward System for GUI Agents via Reason er-Actor Collaboration
【速读】:该论文旨在解决当前奖励机制在GUI代理(Graphical User Interface agents)训练与评估中的局限性问题,特别是规则或模型驱动的奖励方法难以泛化、且基于静态轨迹的LLM-as-a-Judge方法准确性不足的问题。其核心解决方案是提出ProRe——一种主动式奖励系统,关键在于引入一个通用推理器(reasoner)和领域特定评估器(evaluator)代理协同工作:推理器调度有针对性的状态探测任务,评估器通过与环境的主动交互获取额外观测,从而提升奖励的准确性与可验证性。这一机制显著增强了对GUI代理行为的判别能力,实验证明其在3000+轨迹上将奖励准确率和F1分数分别提升5.3%和19.4%,并使先进策略代理的成功率最高提升22.4%。
链接: https://arxiv.org/abs/2509.21823
作者: Gaole Dai,Shiqi Jiang,Ting Cao,Yuqing Yang,Yuanchun Li,Rui Tan,Mo Li,Lili Qiu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 10 pages, 7 figures
点击查看摘要
Abstract:Reward is critical to the evaluation and training of large language models (LLMs). However, existing rule-based or model-based reward methods struggle to generalize to GUI agents, where access to ground-truth trajectories or application databases is often unavailable, and static trajectory-based LLM-as-a-Judge approaches suffer from limited accuracy. To address these challenges, we propose ProRe, a proactive reward system that leverages a general-purpose reasoner and domain-specific evaluator agents (actors). The reasoner schedules targeted state probing tasks, which the evaluator agents then execute by actively interacting with the environment to collect additional observations. This enables the reasoner to assign more accurate and verifiable rewards to GUI agents. Empirical results on over 3K trajectories demonstrate that ProRe improves reward accuracy and F1 score by up to 5.3% and 19.4%, respectively. Furthermore, integrating ProRe with state-of-the-art policy agents yields a success rate improvement of up to 22.4%.
zh
[AI-89] ChaosNexus: A Foundation Model for Universal Chaotic System Forecasting with Multi-scale Representations
链接: https://arxiv.org/abs/2509.21802
作者: Chang Liu,Bohao Zhao,Jingtao Ding,Yong Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
[AI-90] D-Artemis: A Deliberative Cognitive Framework for Mobile GUI Multi-Agents
【速读】:该论文旨在解决当前图形用户界面(GUI)代理在端到端训练中面临的数据瓶颈、延迟错误检测成本高以及执行过程中存在矛盾指令风险等关键挑战。其解决方案的核心在于提出一种受人类认知循环(Thinking, Alignment, Reflection)启发的 deliberative 框架 D-Artemis:通过细粒度、应用特定的提示检索机制增强决策能力;引入预执行对齐阶段(Pre-execution Alignment),结合 Thought-Action Consistency (TAC) 检查模块与 Action Correction Agent (ACA) 降低执行失败风险;并通过后执行状态反思代理(Status Reflection Agent, SRA)完成闭环学习,实现经验驱动的战略性优化。该框架无需依赖复杂轨迹数据集进行训练,即可显著提升通用多模态大语言模型(MLLMs)在 GUI 任务中的泛化性能,在 AndroidWorld 和 ScreenSpot-V2 基准上分别达到 75.8% 和 96.8% 的成功率,验证了各组件的有效性与协同作用。
链接: https://arxiv.org/abs/2509.21799
作者: Hongze Mi,Yibo Feng,Wenjie Lu,Yuqi Wang,Jinyuan Li,Song Cao,He Cui,Tengfei Tian,Xuelin Zhang,Haotian Luo,Di Sun,Naiqiang Tan,Gang Pan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Graphical User Interface (GUI) agents aim to automate a wide spectrum of human tasks by emulating user interaction. Despite rapid advancements, current approaches are hindered by several critical challenges: data bottleneck in end-to-end training, high cost of delayed error detection, and risk of contradictory guidance. Inspired by the human cognitive loop of Thinking, Alignment, and Reflection, we present D-Artemis – a novel deliberative framework in this paper. D-Artemis leverages a fine-grained, app-specific tip retrieval mechanism to inform its decision-making process. It also employs a proactive Pre-execution Alignment stage, where Thought-Action Consistency (TAC) Check module and Action Correction Agent (ACA) work in concert to mitigate the risk of execution failures. A post-execution Status Reflection Agent (SRA) completes the cognitive loop, enabling strategic learning from experience. Crucially, D-Artemis enhances the capabilities of general-purpose Multimodal large language models (MLLMs) for GUI tasks without the need for training on complex trajectory datasets, demonstrating strong generalization. D-Artemis establishes new state-of-the-art (SOTA) results across both major benchmarks, achieving a 75.8% success rate on AndroidWorld and 96.8% on ScreenSpot-V2. Extensive ablation studies further demonstrate the significant contribution of each component to the framework.
zh
[AI-91] FastGRPO: Accelerating Policy Optimization via Concurrency-aware Speculative Decoding and Online Draft Learning ICLR2026
【速读】:该论文旨在解决基于群体相对策略优化(Group Relative Policy Optimization, GRPO)训练大型语言模型(Large Language Models, LLMs)时生成阶段效率低下问题,其核心瓶颈在于每轮训练中需对每个查询进行多次自回归生成响应,导致计算开销巨大。解决方案的关键在于提出一种并发感知的推测解码框架(concurrency-aware speculative decoding framework),该框架能根据实时并发水平动态调整草稿(drafting)与验证(verification)策略,从而最大化生成加速;同时引入在线草稿学习机制(online draft learning mechanism),通过目标模型的反馈信号持续更新固定草稿模型,缓解因分布漂移(distributional drift)导致的性能下降。实验表明,该方法在多个数学推理数据集和模型上实现了2.35x至2.72x的端到端加速效果,显著优于基线方法。
链接: https://arxiv.org/abs/2509.21792
作者: Yizhou Zhang,Ning Lv,Teng Wang,Jisheng Dang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Submitted to ICLR 2026
点击查看摘要
Abstract:Group relative policy optimization (GRPO) has demonstrated significant potential in improving the reasoning capabilities of large language models (LLMs) via reinforcement learning. However, its practical deployment is impeded by an excessively slow training process, primarily attributed to the computationally intensive autoregressive generation of multiple responses per query, which makes the generation phase the primary performance bottleneck. Although speculative decoding presents a promising direction for acceleration, its direct application in GRPO achieves limited speedup under high-concurrency training conditions. To overcome this limitation, we propose a concurrency-aware speculative decoding framework that dynamically adjusts the drafting and verification strategy according to real-time concurrency levels, thereby maximizing the acceleration of the generation process. Furthermore, to address performance degradation arising from distributional drift between the evolving target model and the fixed draft model during training, we introduce an online draft learning mechanism that enables the draft model to continuously adapt using feedback signals from the target model. Experimental results across multiple mathematical reasoning datasets and models demonstrate that the proposed method achieves end-to-end speedups of 2.35x to 2.72x, significantly surpassing baseline approaches in efficiency. The code is available at this https URL.
zh
[AI-92] Unbiased Binning: Fairness-aware Attribute Representation
链接: https://arxiv.org/abs/2509.21785
作者: Abolfazl Asudeh,Zeinab(Mila)Asoodeh,Bita Asoodeh,Omid Asudeh
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
备注:
[AI-93] Benchmarking MLLM -based Web Understanding: Reasoning Robustness and Safety
链接: https://arxiv.org/abs/2509.21782
作者: Junliang Liu,Jingyu Xiao,Wenxin Tang,Wenxuan Wang,Zhixian Wang,Minrui Zhang,Shuanghe Yu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
[AI-94] Lifelong Learning with Behavior Consolidation for Vehicle Routing
【速读】:该论文旨在解决神经路由求解器在面对多任务、动态变化的问题分布与规模时,因传统训练方式导致的灾难性遗忘(catastrophic forgetting)问题,即模型在学习新任务时会显著退化对旧任务的性能。解决方案的关键在于提出一种名为“终身学习路由器与行为巩固”(Lifelong Learning Router with Behavior Consolidation, LLR-BC)的新框架,其核心机制是通过决策导向的行为对齐策略,将新任务训练过程中产生的行为与缓存的历史经验进行有效对齐,并为低置信度决策分配更高权重以强化关键经验的巩固,从而在保持模型对已学任务性能的同时,提升其对新任务的学习效率和零样本泛化能力。
链接: https://arxiv.org/abs/2509.21765
作者: Jiyuan Pei,Yi Mei,Jialin Liu,Mengjie Zhang,Xin Yao
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Recent neural solvers have demonstrated promising performance in learning to solve routing problems. However, existing studies are primarily based on one-off training on one or a set of predefined problem distributions and scales, i.e., tasks. When a new task arises, they typically rely on either zero-shot generalization, which may be poor due to the discrepancies between the new task and the training task(s), or fine-tuning the pretrained solver on the new task, which possibly leads to catastrophic forgetting of knowledge acquired from previous tasks. This paper explores a novel lifelong learning paradigm for neural VRP solvers, where multiple tasks with diverse distributions and scales arise sequentially over time. Solvers are required to effectively and efficiently learn to solve new tasks while maintaining their performance on previously learned tasks. Consequently, a novel framework called Lifelong Learning Router with Behavior Consolidation (LLR-BC) is proposed. LLR-BC consolidates prior knowledge effectively by aligning behaviors of the solver trained on a new task with the buffered ones in a decision-seeking way. To encourage more focus on crucial experiences, LLR-BC assigns greater consolidated weights to decisions with lower confidence. Extensive experiments on capacitated vehicle routing problems and traveling salesman problems demonstrate LLR-BC’s effectiveness in training high-performance neural solvers in a lifelong learning setting, addressing the catastrophic forgetting issue, maintaining their plasticity, and improving zero-shot generalization ability.
zh
[AI-95] Backdoor Attribution: Elucidating and Controlling Backdoor in Language Models
链接: https://arxiv.org/abs/2509.21761
作者: Miao Yu,Zhenhong Zhou,Moayad Aloqaily,Kun Wang,Biwei Huang,Stephen Wang,Yueming Jin,Qingsong Wen
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
[AI-96] SubZeroCore: A Submodular Approach with Zero Training for Coreset Selection
链接: https://arxiv.org/abs/2509.21748
作者: Brian B. Moser,Tobias C. Nauen,Arundhati S. Shanbhag,Federico Raue,Stanislav Frolov,Joachim Folz,Andreas Dengel
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
[AI-97] HyperCore: Coreset Selection under Noise via Hypersphere Models
【速读】:该论文旨在解决现有核心集(coreset)选择方法在真实场景中因忽略标注错误和依赖固定剪枝比例而难以应用的问题。其解决方案的关键在于提出HyperCore框架,该框架通过为每个类别学习轻量级超球面模型(hypersphere model),将类内样本嵌入至超球心附近,并基于距离自然分离类外样本;同时利用Youden’s J统计量自适应确定剪枝阈值,实现无需超参数调优的噪声感知数据剪枝,从而在噪声环境和低数据条件下有效剔除误标与模糊样本,生成紧凑且信息丰富的子集,支持可扩展、抗噪的学习过程。
链接: https://arxiv.org/abs/2509.21746
作者: Brian B. Moser,Arundhati S. Shanbhag,Tobias C. Nauen,Stanislav Frolov,Federico Raue,Joachim Folz,Andreas Dengel
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:The goal of coreset selection methods is to identify representative subsets of datasets for efficient model training. Yet, existing methods often ignore the possibility of annotation errors and require fixed pruning ratios, making them impractical in real-world settings. We present HyperCore, a robust and adaptive coreset selection framework designed explicitly for noisy environments. HyperCore leverages lightweight hypersphere models learned per class, embedding in-class samples close to a hypersphere center while naturally segregating out-of-class samples based on their distance. By using Youden’s J statistic, HyperCore can adaptively select pruning thresholds, enabling automatic, noise-aware data pruning without hyperparameter tuning. Our experiments reveal that HyperCore consistently surpasses state-of-the-art coreset selection methods, especially under noisy and low-data regimes. HyperCore effectively discards mislabeled and ambiguous points, yielding compact yet highly informative subsets suitable for scalable and noise-free learning.
zh
[AI-98] Retrieval-of-Thought: Efficient Reasoning via Reusing Thoughts
【速读】:该论文旨在解决大型推理模型(Large Reasoning Models, LRM)在推理过程中因生成冗长推理轨迹而导致的延迟高和成本高的问题。解决方案的关键在于提出检索式思维(Retrieval-of-Thought, RoT),其核心是将先前的推理步骤组织成带有顺序和语义边的思维图(thought graph),从而实现高效检索与灵活重组;在推理阶段,RoT通过奖励引导的遍历策略从图中检索相关节点并构建问题特定的动态模板,以减少冗余探索,显著降低输出token数量、推理延迟和成本,同时保持原有准确性。
链接: https://arxiv.org/abs/2509.21743
作者: Ammar Ahmed,Azal Ahmad Khan,Ayaan Ahmad,Sheng Di,Zirui Liu,Ali Anwar
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Large reasoning models improve accuracy by producing long reasoning traces, but this inflates latency and cost, motivating inference-time efficiency. We propose Retrieval-of-Thought (RoT), which reuses prior reasoning as composable ``thought" steps to guide new problems. RoT organizes steps into a thought graph with sequential and semantic edges to enable fast retrieval and flexible recombination. At inference, RoT retrieves query-relevant nodes and applies reward-guided traversal to assemble a problem-specific template that guides generation. This dynamic template reuse reduces redundant exploration and, therefore, reduces output tokens while preserving accuracy. We evaluate RoT on reasoning benchmarks with multiple models, measuring accuracy, token usage, latency, and memory overhead. Findings show small prompt growth but substantial efficiency gains, with RoT reducing output tokens by up to 40%, inference latency by 82%, and cost by 59% while maintaining accuracy. RoT establishes a scalable paradigm for efficient LRM reasoning via dynamic template construction through retrieval.
zh
[AI-99] Brain PathoGraph Learning
链接: https://arxiv.org/abs/2509.21742
作者: Ciyuan Peng,Nguyen Linh Dan Le,Shan Jin,Dexuan Ding,Shuo Yu,Feng Xia
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
[AI-100] POLO: Preference-Guided Multi-Turn Reinforcement Learning for Lead Optimization
链接: https://arxiv.org/abs/2509.21737
作者: Ziqing Wang,Yibo Wen,William Pattie,Xiao Luo,Weimin Wu,Jerry Yao-Chieh Hu,Abhishek Pandey,Han Liu,Kaize Ding
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
[AI-101] Uncovering Alzheimers Disease Progression via SDE-based Spatio-Temporal Graph Deep Learning on Longitudinal Brain Networks
链接: https://arxiv.org/abs/2509.21735
作者: Houliang Zhou,Rong Zhou,Yangying Liu,Kanhao Zhao,Li Shen,Brian Y. Chen,Yu Zhang,Lifang He,Alzheimer’s Disease Neuroimaging Initiative
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
[AI-102] Align2Speak: Improving TTS for Low Resource Languages via ASR-Guided Online Preference Optimization ICASSP2026
链接: https://arxiv.org/abs/2509.21718
作者: Shehzeen Hussain,Paarth Neekhara,Xuesong Yang,Edresson Casanova,Subhankar Ghosh,Roy Fejgin,Ryan Langman,Mikyas Desta,Leili Tavabi,Jason Li
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注: Submitted to ICASSP 2026
[AI-103] Developing Strategies to Increase Capacity in AI Education
链接: https://arxiv.org/abs/2509.21713
作者: Noah Q. Cowit,Sri Yash Tadimalla,Stephanie T. Jones,Mary Lou Maher,Tracy Camp,Enrico Pontelli
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: This is a 40 page report prepared by the CRA based on 32 virtual roundtable discussions with 202 experts committed to developing AI Education from varied backgrounds
[AI-104] Not My Agent Not My Boundary? Elicitation of Personal Privacy Boundaries in AI-Delegated Information Sharing
链接: https://arxiv.org/abs/2509.21712
作者: Bingcan Guo,Eryue Xu,Zhiping Zhang,Tianshi Li
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
[AI-105] QueryGym: Step-by-Step Interaction with Relational Databases
【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)在数据库查询生成过程中缺乏可解释性、依赖特定查询语言方言以及难以进行系统化评估的问题。其解决方案的关键在于提出QueryGym——一个基于Gymnasium接口的交互式环境,要求代理(agent)显式构建关系代数操作序列,从而实现与数据库引擎无关的评估和透明的分步查询规划;该环境提供结构化观测信息(如模式细节、中间结果和执行反馈)并接收数据库探索动作(如预览表、采样列值)及关系代数操作(如筛选、投影、连接),为错误修复、透明性研究和强化学习驱动的查询生成提供了可复现的实验平台。
链接: https://arxiv.org/abs/2509.21674
作者: Haritha Ananthakrishanan,Harsha Kokel,Kelsey Sikes,Debarun Bhattacharjya,Michael Katz,Shirin Sohrabi,Kavitha Srinivas
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:We introduce QueryGym, an interactive environment for building, testing, and evaluating LLM-based query planning agents. Existing frameworks often tie agents to specific query language dialects or obscure their reasoning; QueryGym instead requires agents to construct explicit sequences of relational algebra operations, ensuring engine-agnostic evaluation and transparent step-by-step planning. The environment is implemented as a Gymnasium interface that supplies observations – including schema details, intermediate results, and execution feedback – and receives actions that represent database exploration (e.g., previewing tables, sampling column values, retrieving unique values) as well as relational algebra operations (e.g., filter, project, join). We detail the motivation and the design of the environment. In the demo, we showcase the utility of the environment by contrasting it with contemporary LLMs that query databases. QueryGym serves as a practical testbed for research in error remediation, transparency, and reinforcement learning for query generation. For the associated demo, see this https URL.
zh
[AI-106] SlotFM: A Motion Foundation Model with Slot Attention for Diverse Downstream Tasks
链接: https://arxiv.org/abs/2509.21673
作者: Junyong Park,Oron Levy,Rebecca Adaimi,Asaf Liberman,Gierad Laput,Abdelkareem Bedri
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
[AI-107] DIM: Enforcing Domain-Informed Monotonicity in Deep Neural Networks
链接: https://arxiv.org/abs/2509.21666
作者: Joshua Salim,Jordan Yu,Xilei Zhao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
[AI-108] Logic of Hypotheses: from Zero to Full Knowledge in Neurosymbolic Integration
链接: https://arxiv.org/abs/2509.21663
作者: Davide Bizzaro,Alessandro Daniele
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注:
[AI-109] Limitations on Safe Trusted Artificial General Intelligence
链接: https://arxiv.org/abs/2509.21654
作者: Rina Panigrahy,Vatsal Sharan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Complexity (cs.CC)
备注: 17 pages, 1 figure
[AI-110] Can AI Perceive Physical Danger and Intervene?
链接: https://arxiv.org/abs/2509.21651
作者: Abhishek Jindal,Dmitry Kalashnikov,Oscar Chang,Divya Garikapati,Anirudha Majumdar,Pierre Sermanet,Vikas Sindhwani
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
[AI-111] MobiLLM : An Agent ic AI Framework for Closed-Loop Threat Mitigation in 6G Open RANs
链接: https://arxiv.org/abs/2509.21634
作者: Prakhar Sharma,Haohuang Wen,Vinod Yegneswaran,Ashish Gehani,Phillip Porras,Zhiqiang Lin
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
备注:
[AI-112] Semantic F1 Scores: Fair Evaluation Under Fuzzy Class Boundaries
链接: https://arxiv.org/abs/2509.21633
作者: Georgios Chochlakis,Jackson Trager,Vedant Jhaveri,Nikhil Ravichandran,Alexandros Potamianos,Shrikanth Narayanan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 33 pages, 1 table, 29 figures, 4 algorithms
[AI-113] Guiding Audio Editing with Audio Language Model
链接: https://arxiv.org/abs/2509.21625
作者: Zitong Lan,Yiduo Hao,Mingmin Zhao
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注:
[AI-114] LANCE: Low Rank Activation Compression for Efficient On-Device Continual Learning
链接: https://arxiv.org/abs/2509.21617
作者: Marco Paul E. Apolinario,Kaushik Roy
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注: 16 pages, 3 figures
[AI-115] Automated and Interpretable Survival Analysis from Multimodal Data
链接: https://arxiv.org/abs/2509.21600
作者: Mafalda Malafaia,Peter A.N. Bosman,Coen Rasch,Tanja Alderliesten
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 4 figures; 4 tables; 24 pages
[AI-116] GeoEvolve: Automating Geospatial Model Discovery via Multi-Agent Large Language Models
【速读】:该论文旨在解决现有基于大语言模型(Large Language Model, LLM)的算法发现框架在处理复杂地理空间问题时,因缺乏领域知识和多步推理能力而导致的性能瓶颈问题。其解决方案的关键在于提出GeoEvolve——一个融合进化搜索与地理空间领域知识的多智能体LLM框架,通过嵌套双循环机制实现自动算法设计与优化:内层由代码进化器生成并变异候选解,外层由代理控制器评估全局最优解,并调用GeoKnowRAG模块(结构化地理空间知识库)注入理论先验,从而引导搜索过程向理论合理且计算高效的算法收敛。该方法显著提升了空间插值(RMSE降低13–21%)和不确定性量化(性能提升17%)的性能,验证了领域知识驱动的自动化地理空间建模路径的有效性。
链接: https://arxiv.org/abs/2509.21593
作者: Peng Luo,Xiayin Lou,Yu Zheng,Zhuo Zheng,Stefano Ermon
机构: 未知
类目: Artificial Intelligence (cs.AI); Physics and Society (physics.soc-ph)
备注:
点击查看摘要
Abstract:Geospatial modeling provides critical solutions for pressing global challenges such as sustainability and climate change. Existing large language model (LLM)-based algorithm discovery frameworks, such as AlphaEvolve, excel at evolving generic code but lack the domain knowledge and multi-step reasoning required for complex geospatial problems. We introduce GeoEvolve, a multi-agent LLM framework that couples evolutionary search with geospatial domain knowledge to automatically design and refine geospatial algorithms. GeoEvolve operates in two nested loops: an inner loop leverages a code evolver to generate and mutate candidate solutions, while an outer agentic controller evaluates global elites and queries a GeoKnowRAG module – a structured geospatial knowledge base that injects theoretical priors from geography. This knowledge-guided evolution steers the search toward theoretically meaningful and computationally efficient algorithms. We evaluate GeoEvolve on two fundamental and classical tasks: spatial interpolation (kriging) and spatial uncertainty quantification (geospatial conformal prediction). Across these benchmarks, GeoEvolve automatically improves and discovers new algorithms, incorporating geospatial theory on top of classical models. It reduces spatial interpolation error (RMSE) by 13-21% and enhances uncertainty estimation performance by 17%. Ablation studies confirm that domain-guided retrieval is essential for stable, high-quality evolution. These results demonstrate that GeoEvolve provides a scalable path toward automated, knowledge-driven geospatial modeling, opening new opportunities for trustworthy and efficient AI-for-Science discovery.
zh
[AI-117] EEG-Based Consumer Behaviour Prediction: An Exploration from Classical Machine Learning to Graph Neural Networks
链接: https://arxiv.org/abs/2509.21567
作者: Mohammad Parsa Afshar,Aryan Azimi
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
[AI-118] AutoClimDS: Climate Data Science Agent ic AI – A Knowledge Graph is All You Need
链接: https://arxiv.org/abs/2509.21553
作者: Ahmed Jaber,Wangshu Zhu,Karthick Jayavelu,Justin Downes,Sameer Mohamed,Candace Agonafir,Linnia Hawkins,Tian Zheng
机构: 未知
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:
[AI-119] Correct Reasoning Paths Visit Shared Decision Pivots
链接: https://arxiv.org/abs/2509.21549
作者: Dongkyu Cho,Amy B.Z. Zhang,Bilel Fehri,Sheng Wang,Rumi Chunara,Rui Song,Hengrui Cai
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 18 pages, 10 figures
[AI-120] Psychological and behavioural responses in human-agent vs. human-human interactions: a systematic review and meta-analysis
链接: https://arxiv.org/abs/2509.21542
作者: Jianan Zhou,Fleur Corbett,Joori Byun,Talya Porat,Nejra van Zalk
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
[AI-121] Preemptive Detection and Steering of LLM Misalignment via Latent Reachability
链接: https://arxiv.org/abs/2509.21528
作者: Sathwik Karnik,Somil Bansal
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
[AI-122] Shortcut Flow Matching for Speech Enhancement: Step-Invariant flows via single stage training ICASSP2026
链接: https://arxiv.org/abs/2509.21522
作者: Naisong Zhou,Saisamarth Rajesh Phaye,Milos Cernak,Tijana Stojkovic,Andy Pearce,Andrea Cavallaro,Andy Harper
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: 5 pages, 2 figures, submitted to ICASSP2026
[AI-123] mathbfLi_2: A Framework on Dynamics of Feature Emergence and Delayed Generalization
【速读】:该论文旨在解决复杂结构化输入下“grokking”(延迟泛化)现象的数学刻画问题,即明确在训练过程中哪些特征会涌现、如何涌现以及在何种条件下发生。其解决方案的关键在于提出一个名为 Li2 的新框架,该框架通过分析反向传播梯度 GF 的层间结构,将 grokking 行为划分为三个阶段:(I) 懒惰学习(Lazy learning),此时 GF 为随机,顶层过拟合于随机隐藏表示;(II) 独立特征学习(independent feature learning),此时 GF 中每个节点的梯度仅依赖于自身激活,隐藏节点独立学习表示,且该过程等价于能量函数 E 的梯度上升,局部极大值对应新兴特征;(III) 交互特征学习(interactive feature learning),隐藏节点开始相互作用,GF 调整以聚焦缺失特征的学习。此框架揭示了权重衰减、学习率和样本量等超参数的作用机制,并提供了记忆与泛化能力的可证明缩放律,从而从梯度动力学的第一性原理出发解释了如 Muon 等优化器的有效性。
链接: https://arxiv.org/abs/2509.21519
作者: Yuandong Tian
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:While the phenomenon of grokking, i.e., delayed generalization, has been studied extensively, it remains an open question whether there is a mathematical framework to characterize what kind of features emerge, how and in which conditions it happens from training, for complex structured inputs. We propose a novel framework, named \mathbfLi_2 , that captures three key stages for the grokking behavior of 2-layer nonlinear networks: (I) Lazy learning, (II) independent feature learning and (III) interactive feature learning, characterized by the structure of backpropagated gradient G_F across layers. In (I), G_F is random, and top layer overfits to random hidden representation. In (II), the gradient of each node (column of G_F ) only depends on its own activation, and thus each hidden node learns their representation independently from G_F , which now carries information about target labels, thanks to weight decay. Interestingly, the independent dynamics follows exactly the gradient ascent of an energy function E , and its local maxima are precisely the emerging features. We study whether these local-optima induced features are generalizable, their representation power, and how they change on sample size, in group arithmetic tasks. Finally, in (III), we provably show how hidden nodes interact, and how G_F changes to focus on missing features that need to be learned. Our study sheds lights on roles played by key hyperparameters such as weight decay, learning rate and sample sizes in grokking, leads to provable scaling laws of memorization and generalization, and reveals the underlying cause why recent optimizers such as Muon can be effective, from the first principles of gradient dynamics. Our analysis can be extended to multi-layer architectures.
zh
[AI-124] New Algorithmic Directions in Optimal Transport and Applications for Product Spaces
链接: https://arxiv.org/abs/2509.21502
作者: Salman Beigi,Omid Etesami,Mohammad Mahmoody,Amir Najafi
机构: 未知
类目: Data Structures and Algorithms (cs.DS); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (cs.LG)
备注:
[AI-125] Chasing the Tail: Effective Rubric-based Reward Modeling for Large Language Model Post-Training
【速读】:该论文旨在解决强化学习微调(Reinforcement Fine-Tuning, RFT)中常见的“奖励过优化”(reward over-optimization)问题,即策略模型通过操纵奖励信号获得高分,但输出质量低下。研究表明,问题根源在于高奖励尾部区域的奖励误设:模型无法可靠区分“优秀”响应与“仅良好”响应。为此,作者聚焦于高奖励区域,并提出基于评分量表(rubric-based rewards)的解决方案——该方法利用离策略样本(如更强模型生成或重写示例)训练,同时保持对这些样本人工特征的鲁棒性,从而在不引入偏差的前提下精准刻画高奖励尾部差异。关键创新在于设计一套能区分多样且高质量响应的评分机制,并构建相应工作流以实现该目标,实验证明其显著缓解了奖励过优化并提升了大语言模型(LLM)后训练效果。
链接: https://arxiv.org/abs/2509.21500
作者: Junkai Zhang,Zihao Wang,Lin Gui,Swarnashree Mysore Sathyendra,Jaehwan Jeong,Victor Veitch,Wei Wang,Yunzhong He,Bing Liu,Lifeng Jin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Reinforcement fine-tuning (RFT) often suffers from \emphreward over-optimization, where a policy model hacks the reward signals to achieve high scores while producing low-quality outputs. Our theoretical analysis shows that the key lies in reward misspecification at the high-reward tail: the inability to reliably distinguish Excellent responses from merely Great ones. This motivate us to focus on the high-reward region. However, such tail examples are scarce under the base LLM. While off-policy exemplars (e.g. from stronger models or rewrites) are easier to obtain, naively training on them yields a misspecified reward for the policy we aim to align. To address this, we study rubric-based rewards. By design, rubrics can leverage off-policy examples while remaining insensitive to their artifacts. To elicit rubrics that capture the high-reward tail, we highlight the importance of distinguishing among great and diverse responses, and introduce a workflow to implement this idea. We empirically demonstrate that rubric-based rewards substantially mitigate reward over-optimization and deliver effective LLM post-training improvements. Our code can be accessed at this https URL .
zh
[AI-126] Neural Operators for Mathematical Modeling of Transient Fluid Flow in Subsurface Reservoir Systems
链接: https://arxiv.org/abs/2509.21485
作者: Daniil D. Sirota,Sergey A. Khan,Sergey L. Kostikov,Kirill A. Butov
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Fluid Dynamics (physics.flu-dyn); Geophysics (physics.geo-ph)
备注: 10 pages, 6 figures
[AI-127] Score-based Idempotent Distillation of Diffusion Models
链接: https://arxiv.org/abs/2509.21470
作者: Shehtab Zaman,Chengyan Liu,Kenneth Chiu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
[AI-128] MIXRAG : Mixture-of-Experts Retrieval-Augmented Generation for Textual Graph Understanding and Question Answering
链接: https://arxiv.org/abs/2509.21391
作者: Lihui Liu,Carl J. Yang
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
[AI-129] owards Adapting Federated Quantum Machine Learning for Network Intrusion Detection: A Survey
链接: https://arxiv.org/abs/2509.21389
作者: Devashish Chaudhary,Sutharshan Rajasegarar,Shiva Raj Pokhrel
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 34 pages, 16 figures, IEEE Communication Surveys and Tutorials
[AI-130] Design and Implementation of a Secure RAG -Enhanced AI Chatbot for Smart Tourism Customer Service: Defending Against Prompt Injection Attacks – A Case Study of Hsinchu Taiwan
链接: https://arxiv.org/abs/2509.21367
作者: Yu-Kai Shih,You-Kai Kang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 12 pages, 7 figures, 5 tables
[AI-131] Domain-Informed Genetic Superposition Programming: A Case Study on SFRC Beams
链接: https://arxiv.org/abs/2509.21355
作者: Mohammad Sadegh Khorshidi,Navid Yazdanjue,Hassan Gharoun,Mohammad Reza Nikoo,Fang Chen,Amir H. Gandomi
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注: 11 pages, 6 tables, 4 figures
[AI-132] SGNNBench: A Holistic Evaluation of Spiking Graph Neural Network on Large-scale Graph
链接: https://arxiv.org/abs/2509.21342
作者: Huizhe Zhang,Jintang Li,Yuchang Zhu,Liang Chen,Li Kuang
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: The code is available at this https URL
[AI-133] From Embeddings to Equations: Genetic-Programming Surrogates for Interpretable Transformer Classification
链接: https://arxiv.org/abs/2509.21341
作者: Mohammad Sadegh Khorshidi,Navid Yazdanjue,Hassan Gharoun,Mohammad Reza Nikoo,Fang Chen,Amir H. Gandomi
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 20 pages, 8 tables, 7 figures
[AI-134] Cycle is All You Need: More Is Different
链接: https://arxiv.org/abs/2509.21340
作者: Xin Li
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
备注:
[AI-135] PIR-RAG : A System for Private Information Retrieval in Retrieval-Augmented Generation
链接: https://arxiv.org/abs/2509.21325
作者: Baiqiang Wang,Qian Lou,Mengxin Zheng,Dongfang Zhao
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:
[AI-136] From Search to Reasoning : A Five-Level RAG Capability Framework for Enterprise Data
链接: https://arxiv.org/abs/2509.21324
作者: Gurbinder Gill,Ritvik Gupta,Denis Lusson,Anand Chandrashekar,Donald Nguyen
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
[AI-137] oward a Physics of Deep Learning and Brains
【速读】:该论文试图解决的问题是:如何建立一个统一的理论框架,以解释生物神经网络(如大脑)与人工深度神经网络在学习机制上的共性。其解决方案的关键在于发现并验证了描述活体大脑中神经元级联活动(neuronal avalanches)的非平衡统计物理方程,同样适用于深度神经网络中的活动传播过程;进一步表明,深度神经网络在接近临界点但处于准临界(quasi-critical)状态时性能最优,此时系统表现出裂纹噪声(crackling noise)标度关系,且最大敏感度(maximal susceptibility)比接近临界点本身更能可靠预测学习效果,从而为优化网络性能提供了理论依据。
链接: https://arxiv.org/abs/2509.22649
作者: Arsham Ghavasieh,Meritxell Vila-Minana,Akanksha Khurd,John Beggs,Gerardo Ortiz,Santo Fortunato
机构: 未知
类目: Disordered Systems and Neural Networks (cond-mat.dis-nn); Statistical Mechanics (cond-mat.stat-mech); Artificial Intelligence (cs.AI); Adaptation and Self-Organizing Systems (nlin.AO); Biological Physics (physics.bio-ph)
备注:
点击查看摘要
Abstract:Deep neural networks and brains both learn and share superficial similarities: processing nodes are likened to neurons and adjustable weights are likened to modifiable synapses. But can a unified theoretical framework be found to underlie them both? Here we show that the equations used to describe neuronal avalanches in living brains can also be applied to cascades of activity in deep neural networks. These equations are derived from non-equilibrium statistical physics and show that deep neural networks learn best when poised between absorbing and active phases. Because these networks are strongly driven by inputs, however, they do not operate at a true critical point but within a quasi-critical regime – one that still approximately satisfies crackling noise scaling relations. By training networks with different initializations, we show that maximal susceptibility is a more reliable predictor of learning than proximity to the critical point itself. This provides a blueprint for engineering improved network performance. Finally, using finite-size scaling we identify distinct universality classes, including Barkhausen noise and directed percolation. This theoretical framework demonstrates that universal features are shared by both biological and artificial neural networks.
zh
[AI-138] ConQuER: Modular Architectures for Control and Bias Mitigation in IQP Quantum Generative Models
【速读】:该论文旨在解决当前基于瞬时量子多项式(Instantaneous Quantum Polynomial, IQP)电路的量子生成模型存在的两大问题:一是生成输出缺乏可控性,二是对特定预期模式存在严重的生成偏差。解决方案的关键在于提出一种可控制的量子生成框架 ConQuER,其核心是采用模块化电路架构,嵌入一个轻量级控制器电路,该控制器可与预训练的 IQP 电路直接组合,在无需重新训练整个模型的前提下精确调控输出分布;同时,受控制器设计启发,进一步通过数据驱动优化将隐式控制路径嵌入 IQP 架构中,显著降低在结构化数据集上的生成偏差。ConQuER 在保持 IQP 模型经典可训练性和高可扩展性的基础上,实现了对如汉明权重分布等关键属性的精准控制,并以极低的参数和门资源开销提升了生成质量与可控性。
链接: https://arxiv.org/abs/2509.22551
作者: Xiaocheng Zou,Shijin Duan,Charles Fleming,Gaowen Liu,Ramana Rao Kompella,Shaolei Ren,Xiaolin Xu
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Quantum generative models based on instantaneous quantum polynomial (IQP) circuits show great promise in learning complex distributions while maintaining classical trainability. However, current implementations suffer from two key limitations: lack of controllability over generated outputs and severe generation bias towards certain expected patterns. We present a Controllable Quantum Generative Framework, ConQuER, which addresses both challenges through a modular circuit architecture. ConQuER embeds a lightweight controller circuit that can be directly combined with pre-trained IQP circuits to precisely control the output distribution without full retraining. Leveraging the advantages of IQP, our scheme enables precise control over properties such as the Hamming Weight distribution with minimal parameter and gate overhead. In addition, inspired by the controller design, we extend this modular approach through data-driven optimization to embed implicit control paths in the underlying IQP architecture, significantly reducing generation bias on structured datasets. ConQuER retains efficient classical training properties and high scalability. We experimentally validate ConQuER on multiple quantum state datasets, demonstrating its superior control accuracy and balanced generation performance, only with very low overhead cost over original IQP circuits. Our framework bridges the gap between the advantages of quantum computing and the practical needs of controllable generation modeling.
zh
[AI-139] Forecasting the Future with Yesterdays Climate: Temperature Bias in AI Weather and Climate Models
链接: https://arxiv.org/abs/2509.22359
作者: Jacob B. Landsberg,Elizabeth A. Barnes
机构: 未知
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Artificial Intelligence (cs.AI)
备注: 13 pages, 5 figures
[AI-140] EqDiff-CT: Equivariant Conditional Diffusion model for CT Image Synthesis from CBCT
链接: https://arxiv.org/abs/2509.21913
作者: Alzahra Altalib,Chunhui Li,Alessandro Perelli
机构: 未知
类目: Medical Physics (physics.med-ph); Artificial Intelligence (cs.AI)
备注: 12 pages, 8 figures, 3 tables, submitted to IEEE Transactions on Radiation and Plasma Medical Sciences
[AI-141] Beyond Structure: Invariant Crystal Property Prediction with Pseudo-Particle Ray Diffraction
链接: https://arxiv.org/abs/2509.21778
作者: Bin Cao,Yang Liu,Longhan Zhang,Yifan Wu,Zhixun Li,Yuyu Luo,Hong Cheng,Yang Ren,Tong-Yi Zhang
机构: 未知
类目: Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI)
备注:
[AI-142] Optimizing the non-Clifford-count in unitary synthesis using Reinforcement Learning
链接: https://arxiv.org/abs/2509.21709
作者: David Kremer,Ali Javadi-Abhari,Priyanka Mukhopadhyay
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
备注:
[AI-143] Enhanced Generative Machine Listener
【速读】:该论文旨在解决主观音频质量预测中现有客观指标(如PEAQ和ViSQOL)在跨内容类型和编码配置下相关性不足、预测可靠性差的问题。解决方案的关键在于提出GMLv2模型,其核心创新是引入基于Beta分布的损失函数以更精准地建模听者评分的分布特性,并融合额外的神经音频编码(Neural Audio Coding, NAC)主观数据集,从而提升模型的泛化能力与适用范围。实验表明,GMLv2在多个测试集上均显著优于主流指标,在预测MUSHRA分数时表现出更高的相关性和稳定性。
链接: https://arxiv.org/abs/2509.21463
作者: Vishnu Raj,Gouthaman KV,Shiv Gehlot,Lars Villemoes,Arijit Biswas
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:We present GMLv2, a reference-based model designed for the prediction of subjective audio quality as measured by MUSHRA scores. GMLv2 introduces a Beta distribution-based loss to model the listener ratings and incorporates additional neural audio coding (NAC) subjective datasets to extend its generalization and applicability. Extensive evaluations on diverse testset demonstrate that proposed GMLv2 consistently outperforms widely used metrics, such as PEAQ and ViSQOL, both in terms of correlation with subjective scores and in reliably predicting these scores across diverse content types and codec configurations. Consequently, GMLv2 offers a scalable and automated framework for perceptual audio quality evaluation, poised to accelerate research and development in modern audio coding technologies.
zh
[AI-144] Foundation models for high-energy physics
链接: https://arxiv.org/abs/2509.21434
作者: Anna Hallin
机构: 未知
类目: High Energy Physics - Phenomenology (hep-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex); Data Analysis, Statistics and Probability (physics.data-an)
备注: To be submitted to SciPost Physics Proceedings (EuCAIFCon 2025)
[AI-145] PhenoMoler: Phenotype-Guided Molecular Optimization via Chemistry Large Language Model
链接: https://arxiv.org/abs/2509.21424
作者: Ran Song,Hui Liu
机构: 未知
类目: Chemical Physics (physics.chem-ph); Artificial Intelligence (cs.AI)
备注:
[AI-146] Near-Optimal Experiment Design in Linear non-Gaussian Cyclic Models
链接: https://arxiv.org/abs/2509.21423
作者: Ehsan Sharifian,Saber Salehkaleybar,Negar Kiyavash
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
[AI-147] oward a Realistic Encoding Model of Auditory Affective Understanding in the Brain
链接: https://arxiv.org/abs/2509.21381
作者: Guandong Pan,Yaqian Yang,Shi Chen,Xin Wang,Longzhao Liu,Hongwei Zheng,Shaoting Tang
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
[AI-148] Seismic Velocity Inversion from Multi-Source Shot Gathers Using Deep Segmentation Networks: Benchmarking U-Net Variants and SeismoLabV3
链接: https://arxiv.org/abs/2509.21331
作者: Mahedi Hasan
机构: 未知
类目: Geophysics (physics.geo-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
[AI-149] Assessment of deep learning models integrated with weather and environmental variables for wildfire spread prediction and a case study of the 2023 Maui fires
【速读】:该论文旨在解决现有 wildfire spread 预测模型中对深度学习(deep learning)模型性能理解不足的问题,以及如何将基于 AI 的模型与传统非 AI 模型(如 FARSITE)进行有效比较。其解决方案的关键在于:首先,通过在夏威夷十年的野火数据上评估五种典型深度学习模型(包括 ConvLSTM 及其带注意力机制的变体),识别出表现最优的模型;其次,以 2023 年毛伊岛火灾为案例,系统对比最佳 AI 模型与 FARSITE 模型的预测精度、召回率和 F1 分数;最后,结合可解释人工智能(Explainable AI, XAI)方法,量化关键天气与环境因子对火灾蔓延的影响,从而揭示 AI 模型在灵活性与可解释性方面的优势。
链接: https://arxiv.org/abs/2509.21327
作者: Jiyeon Kim,Yingjie Hu,Negar Elhami-Khorasani,Kai Sun,Ryan Zhenqi Zhou
机构: 未知
类目: Physics and Society (physics.soc-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Predicting the spread of wildfires is essential for effective fire management and risk assessment. With the fast advancements of artificial intelligence (AI), various deep learning models have been developed and utilized for wildfire spread prediction. However, there is limited understanding of the advantages and limitations of these models, and it is also unclear how deep learning-based fire spread models can be compared with existing non-AI fire models. In this work, we assess the ability of five typical deep learning models integrated with weather and environmental variables for wildfire spread prediction based on over ten years of wildfire data in the state of Hawaii. We further use the 2023 Maui fires as a case study to compare the best deep learning models with a widely-used fire spread model, FARSITE. The results show that two deep learning models, i.e., ConvLSTM and ConvLSTM with attention, perform the best among the five tested AI models. FARSITE shows higher precision, lower recall, and higher F1-score than the best AI models, while the AI models offer higher flexibility for the input data. By integrating AI models with an explainable AI method, we further identify important weather and environmental factors associated with the 2023 Maui wildfires.
zh
机器学习
[LG-0] Effective Policy Learning for Multi-Agent Online Coordination Beyond Submodular Objectives NEURIPS2025
链接: https://arxiv.org/abs/2509.22596
作者: Qixin Zhang,Yan Sun,Can Jin,Xikun Zhang,Yao Shu,Puning Zhao,Li Shen,Dacheng Tao
类目: Multiagent Systems (cs.MA); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: Accepted to NeurIPS 2025
点击查看摘要
Abstract:In this paper, we present two effective policy learning algorithms for multi-agent online coordination(MA-OC) problem. The first one, \textttMA-SPL, not only can achieve the optimal (1-\fracce) -approximation guarantee for the MA-OC problem with submodular objectives but also can handle the unexplored \alpha -weakly DR-submodular and (\gamma,\beta) -weakly submodular scenarios, where c is the curvature of the investigated submodular functions, \alpha denotes the diminishing-return(DR) ratio and the tuple (\gamma,\beta) represents the submodularity ratios. Subsequently, in order to reduce the reliance on the unknown parameters \alpha,\gamma,\beta inherent in the \textttMA-SPL algorithm, we further introduce the second online algorithm named \textttMA-MPL. This \textttMA-MPL algorithm is entirely \emphparameter-free and simultaneously can maintain the same approximation ratio as the first \textttMA-SPL algorithm. The core of our \textttMA-SPL and \textttMA-MPL algorithms is a novel continuous-relaxation technique termed as \emphpolicy-based continuous extension. Compared with the well-established \emphmulti-linear extension, a notable advantage of this new \emphpolicy-based continuous extension is its ability to provide a lossless rounding scheme for any set function, thereby enabling us to tackle the challenging weakly submodular objectives. Finally, extensive simulations are conducted to validate the effectiveness of our proposed algorithms.
[LG-1] ransport Based Mean Flows for Generative Modeling
链接: https://arxiv.org/abs/2509.22592
作者: Elaheh Akbari,Ping He,Ahmadreza Moradipari,Yikun Bai,Soheil Kolouri
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Flow-matching generative models have emerged as a powerful paradigm for continuous data generation, achieving state-of-the-art results across domains such as images, 3D shapes, and point clouds. Despite their success, these models suffer from slow inference due to the requirement of numerous sequential sampling steps. Recent work has sought to accelerate inference by reducing the number of sampling steps. In particular, Mean Flows offer a one-step generation approach that delivers substantial speedups while retaining strong generative performance. Yet, in many continuous domains, Mean Flows fail to faithfully approximate the behavior of the original multi-step flow-matching process. In this work, we address this limitation by incorporating optimal transport-based sampling strategies into the Mean Flow framework, enabling one-step generators that better preserve the fidelity and diversity of the original multi-step flow process. Experiments on controlled low-dimensional settings and on high-dimensional tasks such as image generation, image-to-image translation, and point cloud generation demonstrate that our approach achieves superior inference accuracy in one-step generative modeling.
[LG-2] he Lie of the Averag e: How Class Incremental Learning Evaluation Deceives You?
链接: https://arxiv.org/abs/2509.22580
作者: Guannan Lai,Da-Wei Zhou,Xin Yang,Han-Jia Ye
类目: Machine Learning (cs.LG)
*备注:
[LG-3] Machine learning approaches to seismic event classification in the Ostrava region
链接: https://arxiv.org/abs/2509.22574
作者: Marek Pecha,Michael Skotnica,Jana Rušajová,Bohdan Rieznikov,Vít Wandrol,Markéta Rösnerová,Jaromír Knejzlík
类目: Machine Learning (cs.LG)
*备注: 10 pages, 5 figures
[LG-4] Nearly Tight Regret Bounds for Profit Maximization in Bilateral Trade
链接: https://arxiv.org/abs/2509.22563
作者: Simone Di Gregorio,Paul Dütting,Federico Fusco,Chris Schwiegelshohn
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注: Accept at FOCS '25
点击查看摘要
Abstract:Bilateral trade models the task of intermediating between two strategic agents, a seller and a buyer, willing to trade a good for which they hold private valuations. We study this problem from the perspective of a broker, in a regret minimization framework. At each time step, a new seller and buyer arrive, and the broker has to propose a mechanism that is incentive-compatible and individually rational, with the goal of maximizing profit. We propose a learning algorithm that guarantees a nearly tight \tildeO(\sqrtT) regret in the stochastic setting when seller and buyer valuations are drawn i.i.d. from a fixed and possibly correlated unknown distribution. We further show that it is impossible to achieve sublinear regret in the non-stationary scenario where valuations are generated upfront by an adversary. Our ambitious benchmark for these results is the best incentive-compatible and individually rational mechanism. This separates us from previous works on efficiency maximization in bilateral trade, where the benchmark is a single number: the best fixed price in hindsight. A particular challenge we face is that uniform convergence for all mechanisms’ profits is impossible. We overcome this difficulty via a careful chaining analysis that proves convergence for a provably near-optimal mechanism at (essentially) optimal rate. We further showcase the broader applicability of our techniques by providing nearly optimal results for the joint ads problem. Comments: Accept at FOCS '25 Subjects: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG) Cite as: arXiv:2509.22563 [cs.GT] (or arXiv:2509.22563v1 [cs.GT] for this version) https://doi.org/10.48550/arXiv.2509.22563 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-5] Learning to Price Bundles: A GCN Approach for Mixed Bundling
链接: https://arxiv.org/abs/2509.22557
作者: Liangyu Ding,Chenghan Wu,Guokai Li,Zizhuo Wang
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
[LG-6] ECHO: Toward Contextual Seq2Seq Paradigms in Large EEG Models
链接: https://arxiv.org/abs/2509.22556
作者: Chenyu Liu,Yuqiu Deng,Tianyu Liu,Jinan Zhou,Xinliang Zhou,Ziyu Jia,Yi Ding
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:
点击查看摘要
Abstract:Electroencephalography (EEG), with its broad range of applications, necessitates models that can generalize effectively across various tasks and datasets. Large EEG Models (LEMs) address this by pretraining encoder-centric architectures on large-scale unlabeled data to extract universal representations. While effective, these models lack decoders of comparable capacity, limiting the full utilization of the learned features. To address this issue, we introduce ECHO, a novel decoder-centric LEM paradigm that reformulates EEG modeling as sequence-to-sequence learning. ECHO captures layered relationships among signals, labels, and tasks within sequence space, while incorporating discrete support samples to construct contextual cues. This design equips ECHO with in-context learning, enabling dynamic adaptation to heterogeneous tasks without parameter updates. Extensive experiments across multiple datasets demonstrate that, even with basic model components, ECHO consistently outperforms state-of-the-art single-task LEMs in multi-task settings, showing superior generalization and adaptability.
[LG-7] Dual Optimistic Ascent (PI Control) is the Augmented Lagrangian Method in Disguise
链接: https://arxiv.org/abs/2509.22500
作者: Juan Ramirez,Simon Lacoste-Julien
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
点击查看摘要
Abstract:Constrained optimization is a powerful framework for enforcing requirements on neural networks. These constrained deep learning problems are typically solved using first-order methods on their min-max Lagrangian formulation, but such approaches often suffer from oscillations and can fail to find all local solutions. While the Augmented Lagrangian method (ALM) addresses these issues, practitioners often favor dual optimistic ascent schemes (PI control) on the standard Lagrangian, which perform well empirically but lack formal guarantees. In this paper, we establish a previously unknown equivalence between these approaches: dual optimistic ascent on the Lagrangian is equivalent to gradient descent-ascent on the Augmented Lagrangian. This finding allows us to transfer the robust theoretical guarantees of the ALM to the dual optimistic setting, proving it converges linearly to all local solutions. Furthermore, the equivalence provides principled guidance for tuning the optimism hyper-parameter. Our work closes a critical gap between the empirical success of dual optimistic methods and their theoretical foundation.
[LG-8] Bayesian Transfer Operators in Reproducing Kernel Hilbert Spaces
链接: https://arxiv.org/abs/2509.22482
作者: Septimus Boshoff,Sebastian Peitz,Stefan Klus
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS); Chaotic Dynamics (nlin.CD); Data Analysis, Statistics and Probability (physics.data-an)
*备注:
点击查看摘要
Abstract:The Koopman operator, as a linear representation of a nonlinear dynamical system, has been attracting attention in many fields of science. Recently, Koopman operator theory has been combined with another concept that is popular in data science: reproducing kernel Hilbert spaces. We follow this thread into Gaussian process methods, and illustrate how these methods can alleviate two pervasive problems with kernel-based Koopman algorithms. The first being sparsity: most kernel methods do not scale well and require an approximation to become practical. We show that not only can the computational demands be reduced, but also demonstrate improved resilience against sensor noise. The second problem involves hyperparameter optimization and dictionary learning to adapt the model to the dynamical system. In summary, the main contribution of this work is the unification of Gaussian process regression and dynamic mode decomposition.
[LG-9] Nonlinear Optimization with GPU-Accelerated Neural Network Constraints
链接: https://arxiv.org/abs/2509.22462
作者: Robert Parker,Oscar Dowson,Nicole LoGiudice,Manuel Garcia,Russell Bent
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:We propose a reduced-space formulation for optimizing over trained neural networks where the network’s outputs and derivatives are evaluated on a GPU. To do this, we treat the neural network as a “gray box” where intermediate variables and constraints are not exposed to the optimization solver. Compared to the full-space formulation, in which intermediate variables and constraints are exposed to the optimization solver, the reduced-space formulation leads to faster solves and fewer iterations in an interior point method. We demonstrate the benefits of this method on two optimization problems: Adversarial generation for a classifier trained on MNIST images and security-constrained optimal power flow with transient feasibility enforced using a neural network surrogate.
[LG-10] Overclocking Electrostatic Generative Models
链接: https://arxiv.org/abs/2509.22454
作者: Daniil Shlenskii,Alexander Korotin
类目: Machine Learning (cs.LG)
*备注:
[LG-11] he Flood Complex: Large-Scale Persistent Homology on Millions of Points
链接: https://arxiv.org/abs/2509.22432
作者: Florian Graf,Paolo Pellizzoni,Martin Uray,Stefan Huber,Roland Kwitt
类目: Machine Learning (cs.LG); Computational Geometry (cs.CG)
*备注:
点击查看摘要
Abstract:We consider the problem of computing persistent homology (PH) for large-scale Euclidean point cloud data, aimed at downstream machine learning tasks, where the exponential growth of the most widely-used Vietoris-Rips complex imposes serious computational limitations. Although more scalable alternatives such as the Alpha complex or sparse Rips approximations exist, they often still result in a prohibitively large number of simplices. This poses challenges in the complex construction and in the subsequent PH computation, prohibiting their use on large-scale point clouds. To mitigate these issues, we introduce the Flood complex, inspired by the advantages of the Alpha and Witness complex constructions. Informally, at a given filtration value r\geq 0 , the Flood complex contains all simplices from a Delaunay triangulation of a small subset of the point cloud X that are fully covered by balls of radius r emanating from X , a process we call flooding. Our construction allows for efficient PH computation, possesses several desirable theoretical properties, and is amenable to GPU parallelization. Scaling experiments on 3D point cloud data show that we can compute PH of up to dimension 2 on several millions of points. Importantly, when evaluating object classification performance on real-world and synthetic data, we provide evidence that this scaling capability is needed, especially if objects are geometrically or topologically complex, yielding performance superior to other PH-based methods and neural networks for point cloud data.
[LG-12] Learning from Delayed Feedback in Games via Extra Prediction
链接: https://arxiv.org/abs/2509.22426
作者: Yuma Fujimoto,Kenshi Abe,Kaito Ariu
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA); Optimization and Control (math.OC)
*备注: 11 pages, 3 figures (main); 9 pages (appendix)
点击查看摘要
Abstract:This study raises and addresses the problem of time-delayed feedback in learning in games. Because learning in games assumes that multiple agents independently learn their strategies, a discrepancy in optimization often emerges among the agents. To overcome this discrepancy, the prediction of the future reward is incorporated into algorithms, typically known as Optimistic Follow-the-Regularized-Leader (OFTRL). However, the time delay in observing the past rewards hinders the prediction. Indeed, this study firstly proves that even a single-step delay worsens the performance of OFTRL from the aspects of regret and convergence. This study proposes the weighted OFTRL (WOFTRL), where the prediction vector of the next reward in OFTRL is weighted n times. We further capture an intuition that the optimistic weight cancels out this time delay. We prove that when the optimistic weight exceeds the time delay, our WOFTRL recovers the good performances that the regret is constant ( O(1) -regret) in general-sum normal-form games, and the strategies converge to the Nash equilibrium as a subsequence (best-iterate convergence) in poly-matrix zero-sum games. The theoretical results are supported and strengthened by our experiments.
[LG-13] One Prompt Fits All: Universal Graph Adaptation for Pretrained Models NEURIPS2025
链接: https://arxiv.org/abs/2509.22416
作者: Yongqi Huang,Jitao Zhao,Dongxiao He,Xiaobao Wang,Yawen Li,Yuxiao Huang,Di Jin,Zhiyong Feng
类目: Machine Learning (cs.LG)
*备注: accepted by NeurIPS 2025 main conference
点击查看摘要
Abstract:Graph Prompt Learning (GPL) has emerged as a promising paradigm that bridges graph pretraining models and downstream scenarios, mitigating label dependency and the misalignment between upstream pretraining and downstream tasks. Although existing GPL studies explore various prompt strategies, their effectiveness and underlying principles remain unclear. We identify two critical limitations: (1) Lack of consensus on underlying mechanisms: Despite current GPLs have advanced the field, there is no consensus on how prompts interact with pretrained models, as different strategies intervene at varying spaces within the model, i.e., input-level, layer-wise, and representation-level prompts. (2) Limited scenario adaptability: Most methods fail to generalize across diverse downstream scenarios, especially under data distribution shifts (e.g., homophilic-to-heterophilic graphs). To address these issues, we theoretically analyze existing GPL approaches and reveal that representation-level prompts essentially function as fine-tuning a simple downstream classifier, proposing that graph prompt learning should focus on unleashing the capability of pretrained models, and the classifier adapts to downstream scenarios. Based on our findings, we propose UniPrompt, a novel GPL method that adapts any pretrained models, unleashing the capability of pretrained models while preserving the structure of the input graph. Extensive experiments demonstrate that our method can effectively integrate with various pretrained models and achieve strong performance across in-domain and cross-domain scenarios.
[LG-14] Fast-Forward Lattice Boltzmann: Learning Kinetic Behaviour with Physics-Informed Neural Operators
链接: https://arxiv.org/abs/2509.22411
作者: Xiao Xue,Marco F.P. ten Eikelder,Mingyang Gao,Xiaoyuan Cheng,Yiming Yang,Yi He,Shuo Wang,Sibo Cheng,Yukun Hu,Peter V. Coveney
类目: Machine Learning (cs.LG); Cellular Automata and Lattice Gases (nlin.CG); Computational Physics (physics.comp-ph); Fluid Dynamics (physics.flu-dyn)
*备注:
[LG-15] NeuroScalar: A Deep Learning Framework for Fast Accurate and In-the-Wild Cycle-Level Performance Prediction
链接: https://arxiv.org/abs/2509.22410
作者: Shayne Wadle,Yanxin Zhang,Vikas Singh,Karthikeyan Sankaralingam
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注:
[LG-16] MoveFM-R: Advancing Mobility Foundation Models via Language-driven Semantic Reasoning
链接: https://arxiv.org/abs/2509.22403
作者: Fanjin Meng,Yuan Yuan,Jingtao Ding,Jie Feng,Chonghua Han,Yong Li
类目: Machine Learning (cs.LG)
*备注:
[LG-17] ReLAM: Learning Anticipation Model for Rewarding Visual Robotic Manipulation
链接: https://arxiv.org/abs/2509.22402
作者: Nan Tang,Jing-Cheng Pang,Guanlin Li,Chao Qian,Yang Yu
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注:
点击查看摘要
Abstract:Reward design remains a critical bottleneck in visual reinforcement learning (RL) for robotic manipulation. In simulated environments, rewards are conventionally designed based on the distance to a target position. However, such precise positional information is often unavailable in real-world visual settings due to sensory and perceptual limitations. In this study, we propose a method that implicitly infers spatial distances through keypoints extracted from images. Building on this, we introduce Reward Learning with Anticipation Model (ReLAM), a novel framework that automatically generates dense, structured rewards from action-free video demonstrations. ReLAM first learns an anticipation model that serves as a planner and proposes intermediate keypoint-based subgoals on the optimal path to the final goal, creating a structured learning curriculum directly aligned with the task’s geometric objectives. Based on the anticipated subgoals, a continuous reward signal is provided to train a low-level, goal-conditioned policy under the hierarchical reinforcement learning (HRL) framework with provable sub-optimality bound. Extensive experiments on complex, long-horizon manipulation tasks show that ReLAM significantly accelerates learning and achieves superior performance compared to state-of-the-art methods.
[LG-18] Improving accuracy in short mortality rate series: Exploring Multi-step Forecasting Approaches in Hybrid Systems
链接: https://arxiv.org/abs/2509.22395
作者: Filipe C. L. Duarte,Paulo S. G. de Mattos Neto,Paulo R. A. Firmino
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:The decline in interest rates and economic stabilization has heightened the importance of accurate mortality rate forecasting, particularly in insurance and pension markets. Multi-step-ahead predictions are crucial for public health, demographic planning, and insurance risk assessments; however, they face challenges when data are limited. Hybrid systems that combine statistical and Machine Learning (ML) models offer a promising solution for handling both linear and nonlinear patterns. This study evaluated the impact of different multi-step forecasting approaches (Recursive, Direct, and Multi-Input Multi-Output) and ML models on the accuracy of hybrid systems. Results from 12 datasets and 21 models show that the selection of both the multi-step approach and the ML model is essential for improving performance, with the ARIMA-LSTM hybrid using a recursive approach outperforming other models in most cases.
[LG-19] (Sometimes) Less is More: Mitigating the Complexity of Rule-based Representation for Interpretable Classification IJCNN2025
链接: https://arxiv.org/abs/2509.22384
作者: Luca Bergamin,Roberto Confalonieri,Fabio Aiolli
类目: Machine Learning (cs.LG)
*备注: Presented at IJCNN 2025
[LG-20] Enhancing Credit Risk Prediction: A Meta-Learning Framework Integrating Baseline Models LASSO and ECOC for Superior Accuracy
链接: https://arxiv.org/abs/2509.22381
作者: Haibo Wang,Lutfu S. Sua,Jun Huang,Figen Balo,Burak Dolar
类目: Machine Learning (cs.LG)
*备注: 36 pages
[LG-21] Role-Aware Multi-modal federated learning system for detecting phishing webpages
链接: https://arxiv.org/abs/2509.22369
作者: Bo Wang,Imran Khan,Martin White,Natalia Beloff
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: 22 pages, 9 figures
[LG-22] Investigating Faithfulness in Large Audio Language Models
链接: https://arxiv.org/abs/2509.22363
作者: Lovenya Jain,Pooneh Mousavi,Mirco Ravanelli,Cem Subakan
类目: Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:
点击查看摘要
Abstract:Faithfulness measures whether chain-of-thought (CoT) representations accurately reflect a model’s decision process and can be used as reliable explanations. Prior work has shown that CoTs from text-based LLMs are often unfaithful. This question has not been explored for large audio-language models (LALMs), where faithfulness is critical for safety-sensitive applications. Reasoning in LALMs is also more challenging, as models must first extract relevant clues from audio before reasoning over them. In this paper, we investigate the faithfulness of CoTs produced by several LALMs by applying targeted interventions, including paraphrasing, filler token injection, early answering, and introducing mistakes, on two challenging reasoning datasets: SAKURA and MMAR. After going through the aforementioned interventions across several datasets and tasks, our experiments suggest that, LALMs generally produce CoTs that appear to be faithful to their underlying decision processes.
[LG-23] Neural Feature Geometry Evolves as Discrete Ricci Flow
链接: https://arxiv.org/abs/2509.22362
作者: Moritz Hehl,Max von Renesse,Melanie Weber
类目: Machine Learning (cs.LG); Discrete Mathematics (cs.DM); Differential Geometry (math.DG)
*备注: 38 pages, 14 figures
点击查看摘要
Abstract:Deep neural networks learn feature representations via complex geometric transformations of the input data manifold. Despite the models’ empirical success across domains, our understanding of neural feature representations is still incomplete. In this work we investigate neural feature geometry through the lens of discrete geometry. Since the input data manifold is typically unobserved, we approximate it using geometric graphs that encode local similarity structure. We provide theoretical results on the evolution of these graphs during training, showing that nonlinear activations play a crucial role in shaping feature geometry in feedforward neural networks. Moreover, we discover that the geometric transformations resemble a discrete Ricci flow on these graphs, suggesting that neural feature geometry evolves analogous to Ricci flow. This connection is supported by experiments on over 20,000 feedforward neural networks trained on binary classification tasks across both synthetic and real-world datasets. We observe that the emergence of class separability corresponds to the emergence of community structure in the associated graph representations, which is known to relate to discrete Ricci flow dynamics. Building on these insights, we introduce a novel framework for locally evaluating geometric transformations through comparison with discrete Ricci flow dynamics. Our results suggest practical design principles, including a geometry-informed early-stopping heuristic and a criterion for selecting network depth.
[LG-24] Distributed Associative Memory via Online Convex Optimization
链接: https://arxiv.org/abs/2509.22321
作者: Bowen Wang,Matteo Zecchin,Osvaldo Simeone
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:
点击查看摘要
Abstract:An associative memory (AM) enables cue-response recall, and associative memorization has recently been noted to underlie the operation of modern neural architectures such as Transformers. This work addresses a distributed setting where agents maintain a local AM to recall their own associations as well as selective information from others. Specifically, we introduce a distributed online gradient descent method that optimizes local AMs at different agents through communication over routing trees. Our theoretical analysis establishes sublinear regret guarantees, and experiments demonstrate that the proposed protocol consistently outperforms existing online optimization baselines.
[LG-25] SoDaDE: Solvent Data-Driven Embeddings with Small Transformer Models NEURIPS2025
链接: https://arxiv.org/abs/2509.22302
作者: Gabriel Kitso Gibberd,Jose Pablo Folch,Antonio Del Rio Chanona
类目: Machine Learning (cs.LG)
*备注: 7 pages, 2 figures, 3 tables, to be presented as a poster at the NeurIPS 2025 Workshop on Machine Learning and the Physical Sciences
点击查看摘要
Abstract:Computational representations have become crucial in unlocking the recent growth of machine learning algorithms for chemistry. Initially hand-designed, machine learning has shown that meaningful representations can be learnt from data. Chemical datasets are limited and so the representations learnt from data are generic, being trained on broad datasets which contain shallow information on many different molecule types. For example, generic fingerprints lack physical context specific to solvents. However, the use of harmful solvents is a leading climate-related issue in the chemical industry, and there is a surge of interest in green solvent replacement. To empower this research, we propose a new solvent representation scheme by developing Solvent Data Driven Embeddings (SoDaDE). SoDaDE uses a small transformer model and solvent property dataset to create a fingerprint for solvents. To showcase their effectiveness, we use SoDaDE to predict yields on a recently published dataset, outperforming previous representations. We demonstrate through this paper that data-driven fingerprints can be made with small datasets and set-up a workflow that can be explored for other applications.
[LG-26] Aurora: Towards Universal Generative Multimodal Time Series Forecasting
链接: https://arxiv.org/abs/2509.22295
作者: Xingjian Wu,Jianxin Jin,Wanghui Qiu,Peng Chen,Yang Shu,Bin Yang,Chenjuan Guo
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Cross-domain generalization is very important in Time Series Forecasting because similar historical information may lead to distinct future trends due to the domain-specific characteristics. Recent works focus on building unimodal time series foundation models and end-to-end multimodal supervised models. Since domain-specific knowledge is often contained in modalities like texts, the former lacks the explicit utilization of them, thus hindering the performance. The latter is tailored for end-to-end scenarios and does not support zero-shot inference for cross-domain scenarios. In this work, we introduce Aurora, a Multimodal Time Series Foundation Model, which supports multimodal inputs and zero-shot inference. Pretrained on Corss-domain Multimodal Time Series Corpus, Aurora can adaptively extract and focus on key domain knowledge contained in corrsponding text or image modalities, thus possessing strong Cross-domain generalization capability. Through tokenization, encoding, and distillation, Aurora can extract multimodal domain knowledge as guidance and then utilizes a Modality-Guided Multi-head Self-Attention to inject them into the modeling of temporal representations. In the decoding phase, the multimodal representations are used to generate the conditions and prototypes of future tokens, contributing to a novel Prototype-Guided Flow Matching for generative probabilistic forecasting. Comprehensive experiments on well-recognized benchmarks, including TimeMMD, TSFM-Bench and ProbTS, demonstrate the consistent state-of-the-art performance of Aurora on both unimodal and multimodal scenarios.
[LG-27] A Multi-Level Framework for Multi-Objective Hypergraph Partitioning: Combining Minimum Spanning Tree and Proximal Gradient
链接: https://arxiv.org/abs/2509.22294
作者: Yingying Li,Mingxuan Xie,Hailong You,Yongqiang Yao,Hongwei Liu
类目: Machine Learning (cs.LG); Combinatorics (math.CO)
*备注:
点击查看摘要
Abstract:This paper proposes an efficient hypergraph partitioning framework based on a novel multi-objective non-convex constrained relaxation model. A modified accelerated proximal gradient algorithm is employed to generate diverse k -dimensional vertex features to avoid local optima and enhance partition quality. Two MST-based strategies are designed for different data scales: for small-scale data, the Prim algorithm constructs a minimum spanning tree followed by pruning and clustering; for large-scale data, a subset of representative nodes is selected to build a smaller MST, while the remaining nodes are assigned accordingly to reduce complexity. To further improve partitioning results, refinement strategies including greedy migration, swapping, and recursive MST-based clustering are introduced for partitions. Experimental results on public benchmark sets demonstrate that the proposed algorithm achieves reductions in cut size of approximately 2%–5% on average compared to KaHyPar in 2, 3, and 4-way partitioning, with improvements of up to 35% on specific instances. Particularly on weighted vertex sets, our algorithm outperforms state-of-the-art partitioners including KaHyPar, hMetis, Mt-KaHyPar, and K-SpecPart, highlighting its superior partitioning quality and competitiveness. Furthermore, the proposed refinement strategy improves hMetis partitions by up to 16%. A comprehensive evaluation based on virtual instance methodology and parameter sensitivity analysis validates the algorithm’s competitiveness and characterizes its performance trade-offs. Subjects: Machine Learning (cs.LG); Combinatorics (math.CO) Cite as: arXiv:2509.22294 [cs.LG] (or arXiv:2509.22294v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2509.22294 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-28] Conditional Denoising Diffusion Autoencoders for Wireless Semantic Communications
链接: https://arxiv.org/abs/2509.22282
作者: Mehdi Letafati,Samad Ali,Matti Latva-aho
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Semantic communication (SemCom) systems aim to learn the mapping from low-dimensional semantics to high-dimensional ground-truth. While this is more akin to a “domain translation” problem, existing frameworks typically emphasize on channel-adaptive neural encoding-decoding schemes, lacking full exploration of signal distribution. Moreover, such methods so far have employed autoencoder-based architectures, where the encoding is tightly coupled to a matched decoder, causing scalability issues in practice. To address these gaps, diffusion autoencoder models are proposed for wireless SemCom. The goal is to learn a “semantic-to-clean” mapping, from the semantic space to the ground-truth probability distribution. A neural encoder at semantic transmitter extracts the high-level semantics, and a conditional diffusion model (CDiff) at the semantic receiver exploits the source distribution for signal-space denoising, while the received semantic latents are incorporated as the conditioning input to “steer” the decoding process towards the semantics intended by the transmitter. It is analytically proved that the proposed decoder model is a consistent estimator of the ground-truth data. Furthermore, extensive simulations over CIFAR-10 and MNIST datasets are provided along with design insights, highlighting the performance compared to legacy autoencoders and variational autoencoders (VAE). Simulations are further extended to the multi-user SemCom, identifying the dominating factors in a more realistic setup.
[LG-29] Unlocking the Power of Mixture-of-Experts for Task-Aware Time Series Analytics
链接: https://arxiv.org/abs/2509.22279
作者: Xingjian Wu,Zhengyu Li,Hanyin Cheng,Xiangfei Qiu,Jilin Hu,Chenjuan Guo,Bin Yang
类目: Machine Learning (cs.LG)
*备注:
[LG-30] Fine-Grained Uncertainty Decomposition in Large Language Models : A Spectral Approach
链接: https://arxiv.org/abs/2509.22272
作者: Nassim Walha,Sebastian G. Gruber,Thomas Decker,Yinchong Yang,Alireza Javanmardi,Eyke Hüllermeier,Florian Buettner
类目: Machine Learning (cs.LG)
*备注:
[LG-31] owards a more realistic evaluation of machine learning models for bearing fault diagnosis
链接: https://arxiv.org/abs/2509.22267
作者: João Paulo Vieira,Victor Afonso Bauler,Rodrigo Kobashikawa Rosa,Danilo Silva
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: Submitted to Mechanical Systems and Signal Processing
[LG-32] Erase or Hide? Suppressing Spurious Unlearning Neurons for Robust Unlearning
链接: https://arxiv.org/abs/2509.22263
作者: Nakyeong Yang,Dong-Kyum Kim,Jea Kwon,Minsung Kim,Kyomin Jung,Meeyoung Cha
类目: Machine Learning (cs.LG)
*备注: 15 pages
点击查看摘要
Abstract:Large language models trained on web-scale data can memorize private or sensitive knowledge, raising significant privacy risks. Although some unlearning methods mitigate these risks, they remain vulnerable to “relearning” during subsequent training, allowing a substantial portion of forgotten knowledge to resurface. In this paper, we show that widely used unlearning methods cause shallow alignment: instead of faithfully erasing target knowledge, they generate spurious unlearning neurons that amplify negative influence to hide it. To overcome this limitation, we introduce Ssiuu, a new class of unlearning methods that employs attribution-guided regularization to prevent spurious negative influence and faithfully remove target knowledge. Experimental results confirm that our method reliably erases target knowledge and outperforms strong baselines across two practical retraining scenarios: (1) adversarial injection of private data, and (2) benign attack using an instruction-following benchmark. Our findings highlight the necessity of robust and faithful unlearning methods for safe deployment of language models.
[LG-33] A Law of Data Reconstruction for Random Features (and Beyond)
链接: https://arxiv.org/abs/2509.22214
作者: Leonardo Iurada,Simone Bombari,Tatiana Tommasi,Marco Mondelli
类目: Machine Learning (cs.LG)
*备注:
[LG-34] Kernel Regression of Multi-Way Data via Tensor Trains with Hadamard Overparametrization: The Dynamic Graph Flow Case
链接: https://arxiv.org/abs/2509.22197
作者: Duc Thien Nguyen,Konstantinos Slavakis,Eleftherios Kofidis,Dimitris Pados
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:
点击查看摘要
Abstract:A regression-based framework for interpretable multi-way data imputation, termed Kernel Regression via Tensor Trains with Hadamard overparametrization (KReTTaH), is introduced. KReTTaH adopts a nonparametric formulation by casting imputation as regression via reproducing kernel Hilbert spaces. Parameter efficiency is achieved through tensors of fixed tensor-train (TT) rank, which reside on low-dimensional Riemannian manifolds, and is further enhanced via Hadamard overparametrization, which promotes sparsity within the TT parameter space. Learning is accomplished by solving a smooth inverse problem posed on the Riemannian manifold of fixed TT-rank tensors. As a representative application, the estimation of dynamic graph flows is considered. In this setting, KReTTaH exhibits flexibility by seamlessly incorporating graph-based (topological) priors via its inverse problem formulation. Numerical tests on real-world graph datasets demonstrate that KReTTaH consistently outperforms state-of-the-art alternatives-including a nonparametric tensor- and a neural-network-based methods-for imputing missing, time-varying edge flows.
[LG-35] Mechanistic Independence: A Principle for Identifiable Disentangled Representations
链接: https://arxiv.org/abs/2509.22196
作者: Stefan Matthes,Zhiwei Han,Hao Shen
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
[LG-36] Slicing Wasserstein Over Wasserstein Via Functional Optimal Transport
链接: https://arxiv.org/abs/2509.22138
作者: Moritz Piening,Robert Beinert
类目: Machine Learning (cs.LG); Metric Geometry (math.MG); Optimization and Control (math.OC)
*备注:
[LG-37] Mind the Missing: Variable-Aware Representation Learning for Irregular EHR Time Series using Large Language Models
链接: https://arxiv.org/abs/2509.22121
作者: Jeong Eul Kwon,Joo Heung Yoon,Hyo Kyung Lee
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Irregular sampling and high missingness are intrinsic challenges in modeling time series derived from electronic health records (EHRs),where clinical variables are measured at uneven intervals depending on workflow and intervention timing. To address this, we propose VITAL, a variable-aware, large language model (LLM) based framework tailored for learning from irregularly sampled physiological time series. VITAL differentiates between two distinct types of clinical variables: vital signs, which are frequently recorded and exhibit temporal patterns, and laboratory tests, which are measured sporadically and lack temporal structure. It reprograms vital signs into the language space, enabling the LLM to capture temporal context and reason over missing values through explicit encoding. In contrast, laboratory variables are embedded either using representative summary values or a learnable [Not measured] token, depending on their availability. Extensive evaluations on the benchmark datasets from the PhysioNet demonstrate that VITAL outperforms state of the art methods designed for irregular time series. Furthermore, it maintains robust performance under high levels of missingness, which is prevalent in real world clinical scenarios where key variables are often unavailable.
[LG-38] Countering adversarial evasion in regression analysis
链接: https://arxiv.org/abs/2509.22113
作者: David Benfield,Phan Tu Vuong,Alain Zemkoho
类目: Machine Learning (cs.LG)
*备注:
[LG-39] Modeling Psychological Profiles in Volleyball via Mixed-Type Bayesian Networks
链接: https://arxiv.org/abs/2509.22111
作者: Maria Iannario,Dae-Jin Lee,Manuele Leonelli
类目: Machine Learning (cs.LG); Applications (stat.AP)
*备注:
[LG-40] SHAKE-GNN: Scalable Hierarchical Kirchhoff-Forest Graph Neural Network
链接: https://arxiv.org/abs/2509.22100
作者: Zhipu Cui,Johannes Lutzeyer
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
[LG-41] Non-Linear Trajectory Modeling for Multi-Step Gradient Inversion Attacks in Federated Learning
链接: https://arxiv.org/abs/2509.22082
作者: Li Xia,Zheng Liu,Sili Huang,Wei Tang,Xuan Liu
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:
[LG-42] owards Understanding Feature Learning in Parameter Transfer
链接: https://arxiv.org/abs/2509.22056
作者: Hua Yuan,Xuran Meng,Qiufeng Wang,Shiyu Xia,Ning Xu,Xu Yang,Jing Wang,Xin Geng,Yong Rui
类目: Machine Learning (cs.LG)
*备注:
[LG-43] BrainPro: Towards Large-scale Brain State-aware EEG Representation Learning
链接: https://arxiv.org/abs/2509.22050
作者: Yi Ding,Muyun Jiang,Weibang Jiang,Shuailei Zhang,Xinliang Zhou,Chenyu Liu,Shanglin Li,Yong Li,Cuntai Guan
类目: Machine Learning (cs.LG)
*备注: 26 pages, 9 figures
点击查看摘要
Abstract:Electroencephalography (EEG) is a non-invasive technique for recording brain electrical activity, widely used in brain-computer interface (BCI) and healthcare. Recent EEG foundation models trained on large-scale datasets have shown improved performance and generalizability over traditional decoding methods, yet significant challenges remain. Existing models often fail to explicitly capture channel-to-channel and region-to-region interactions, which are critical sources of information inherently encoded in EEG signals. Due to varying channel configurations across datasets, they either approximate spatial structure with self-attention or restrict training to a limited set of common channels, sacrificing flexibility and effectiveness. Moreover, although EEG datasets reflect diverse brain states such as emotion, motor, and others, current models rarely learn state-aware representations during self-supervised pre-training. To address these gaps, we propose BrainPro, a large EEG model that introduces a retrieval-based spatial learning block to flexibly capture channel- and region-level interactions across varying electrode layouts, and a brain state-decoupling block that enables state-aware representation learning through parallel encoders with decoupling and region-aware reconstruction losses. This design allows BrainPro to adapt seamlessly to diverse tasks and hardware settings. Pre-trained on an extensive EEG corpus, BrainPro achieves state-of-the-art performance and robust generalization across nine public BCI datasets. Our codes and the pre-trained weights will be released.
[LG-44] MO-GRPO: Mitigating Reward Hacking of Group Relative Policy Optimization on Multi-Objective Problems
链接: https://arxiv.org/abs/2509.22047
作者: Yuki Ichihara,Yuu Jinnai,Tetsuro Morimura,Mitsuki Sakamoto,Ryota Mitsuhashi,Eiji Uchibe
类目: Machine Learning (cs.LG)
*备注:
[LG-45] Convexity-Driven Projection for Point Cloud Dimensionality Reduction
链接: https://arxiv.org/abs/2509.22043
作者: Suman Sanyal
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
点击查看摘要
Abstract:We propose Convexity-Driven Projection (CDP), a boundary-free linear method for dimensionality reduction of point clouds that targets preserving detour-induced local non-convexity. CDP builds a k -NN graph, identifies admissible pairs whose Euclidean-to-shortest-path ratios are below a threshold, and aggregates their normalized directions to form a positive semidefinite non-convexity structure matrix. The projection uses the top- k eigenvectors of the structure matrix. We give two verifiable guarantees. A pairwise a-posteriori certificate that bounds the post-projection distortion for each admissible pair, and an average-case spectral bound that links expected captured direction energy to the spectrum of the structure matrix, yielding quantile statements for typical distortion. Our evaluation protocol reports fixed- and reselected-pairs detour errors and certificate quantiles, enabling practitioners to check guarantees on their data.
[LG-46] OrtSAE: Orthogonal Sparse Autoencoders Uncover Atomic Features
链接: https://arxiv.org/abs/2509.22033
作者: Anton Korznikov,Andrey Galichin,Alexey Dontsov,Oleg Rogov,Elena Tutubalina,Ivan Oseledets
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Sparse autoencoders (SAEs) are a technique for sparse decomposition of neural network activations into human-interpretable features. However, current SAEs suffer from feature absorption, where specialized features capture instances of general features creating representation holes, and feature composition, where independent features merge into composite representations. In this work, we introduce Orthogonal SAE (OrtSAE), a novel approach aimed to mitigate these issues by enforcing orthogonality between the learned features. By implementing a new training procedure that penalizes high pairwise cosine similarity between SAE features, OrtSAE promotes the development of disentangled features while scaling linearly with the SAE size, avoiding significant computational overhead. We train OrtSAE across different models and layers and compare it with other methods. We find that OrtSAE discovers 9% more distinct features, reduces feature absorption (by 65%) and composition (by 15%), improves performance on spurious correlation removal (+6%), and achieves on-par performance for other downstream tasks compared to traditional SAEs.
[LG-47] MCGM: Multi-stage Clustered Global Modeling for Long-range Interactions in Molecules
链接: https://arxiv.org/abs/2509.22028
作者: Haodong Pan,Yusong Wang,Nanning Zheng,Caijui Jiang
类目: Machine Learning (cs.LG)
*备注: 27 pages, 1 figures
[LG-48] aching Transformers to Solve Combinatorial Problems through Efficient Trial Error
链接: https://arxiv.org/abs/2509.22023
作者: Panagiotis Giannoulis,Yorgos Pantis,Christos Tzamos
类目: Machine Learning (cs.LG)
*备注:
[LG-49] ask-Adaptive Parameter-Efficient Fine-Tuning for Weather Foundation Models
链接: https://arxiv.org/abs/2509.22020
作者: Shilei Cao,Hehai Lin,Jiashun Cheng,Yang Liu,Guowen Li,Xuehe Wang,Juepeng Zheng,Haoyuan Liang,Meng Jin,Chengwei Qin,Hong Cheng,Haohuan Fu
类目: Machine Learning (cs.LG)
*备注:
[LG-50] AEGIS: Authentic Edge Growth In Sparsity for Link Prediction in Edge-Sparse Bipartite Knowledge Graphs
链接: https://arxiv.org/abs/2509.22017
作者: Hugh Xuechen Liu,Kıvanç Tatar
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Bipartite knowledge graphs in niche domains are typically data-poor and edge-sparse, which hinders link prediction. We introduce AEGIS (Authentic Edge Growth In Sparsity), an edge-only augmentation framework that resamples existing training edges -either uniformly simple or with inverse-degree bias degree-aware -thereby preserving the original node set and sidestepping fabricated endpoints. To probe authenticity across regimes, we consider naturally sparse graphs (game design pattern’s game-pattern network) and induce sparsity in denser benchmarks (Amazon, MovieLens) via high-rate bond percolation. We evaluate augmentations on two complementary metrics: AUC-ROC (higher is better) and the Brier score (lower is better), using two-tailed paired t-tests against sparse baselines. On Amazon and MovieLens, copy-based AEGIS variants match the baseline while the semantic KNN augmentation is the only method that restores AUC and calibration; random and synthetic edges remain detrimental. On the text-rich GDP graph, semantic KNN achieves the largest AUC improvement and Brier score reduction, and simple also lowers the Brier score relative to the sparse control. These findings position authenticity-constrained resampling as a data-efficient strategy for sparse bipartite link prediction, with semantic augmentation providing an additional boost when informative node descriptions are available.
[LG-51] Concept-SAE: Active Causal Probing of Visual Model Behavior
链接: https://arxiv.org/abs/2509.22015
作者: Jianrong Ding,Muxi Chen,Chenchen Zhao,Qiang Xu
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Standard Sparse Autoencoders (SAEs) excel at discovering a dictionary of a model’s learned features, offering a powerful observational lens. However, the ambiguous and ungrounded nature of these features makes them unreliable instruments for the active, causal probing of model behavior. To solve this, we introduce Concept-SAE, a framework that forges semantically grounded concept tokens through a novel hybrid disentanglement strategy. We first quantitatively demonstrate that our dual-supervision approach produces tokens that are remarkably faithful and spatially localized, outperforming alternative methods in disentanglement. This validated fidelity enables two critical applications: (1) we probe the causal link between internal concepts and predictions via direct intervention, and (2) we probe the model’s failure modes by systematically localizing adversarial vulnerabilities to specific layers. Concept-SAE provides a validated blueprint for moving beyond correlational interpretation to the mechanistic, causal probing of model behavior.
[LG-52] Goal-Guided Efficient Exploration via Large Language Model in Reinforcement Learning
链接: https://arxiv.org/abs/2509.22008
作者: Yajie Qi,Wei Wei,Lin Li,Lijun Zhang,Zhidong Gao,Da Wang,Huizhong Song
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Real-world decision-making tasks typically occur in complex and open environments, posing significant challenges to reinforcement learning (RL) agents’ exploration efficiency and long-horizon planning capabilities. A promising approach is LLM-enhanced RL, which leverages the rich prior knowledge and strong planning capabilities of LLMs to guide RL agents in efficient exploration. However, existing methods mostly rely on frequent and costly LLM invocations and suffer from limited performance due to the semantic mismatch. In this paper, we introduce a Structured Goal-guided Reinforcement Learning (SGRL) method that integrates a structured goal planner and a goal-conditioned action pruner to guide RL agents toward efficient exploration. Specifically, the structured goal planner utilizes LLMs to generate a reusable, structured function for goal generation, in which goals are prioritized. Furthermore, by utilizing LLMs to determine goals’ priority weights, it dynamically generates forward-looking goals to guide the agent’s policy toward more promising decision-making trajectories. The goal-conditioned action pruner employs an action masking mechanism that filters out actions misaligned with the current goal, thereby constraining the RL agent to select goal-consistent policies. We evaluate the proposed method on Crafter and Craftax-Classic, and experimental results demonstrate that SGRL achieves superior performance compared to existing state-of-the-art methods.
[LG-53] Stage-wise Dynamics of Classifier-Free Guidance in Diffusion Models
链接: https://arxiv.org/abs/2509.22007
作者: Cheng Jin,Qitan Shi,Yuantao Gu
类目: Machine Learning (cs.LG)
*备注: 24 pages, 10 figures
[LG-54] GRAM-TDI: adaptive multimodal representation learning for drug target interaction prediction
链接: https://arxiv.org/abs/2509.21971
作者: Feng Jiang,Amina Mollaysa,Hehuan Ma,Tommaso Mansi,Junzhou Huang,Mangal Prakash,Rui Liao
类目: Machine Learning (cs.LG)
*备注:
[LG-55] hink Smart Not Hard: Difficulty Adaptive Reasoning for Large Audio Language Models
链接: https://arxiv.org/abs/2509.21960
作者: Zhichao Sheng,Shilin Zhou,Chen Gong,Zhenghua Li
类目: Machine Learning (cs.LG)
*备注:
[LG-56] Learnable Conformal Prediction with Context-Aware Nonconformity Functions for Robotic Planning and Perception
链接: https://arxiv.org/abs/2509.21955
作者: Divake Kumar,Sina Tayebati,Francesco Migliarba,Ranganath Krishnan,Amit Ranjan Trivedi
类目: Robotics (cs.RO); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:
[LG-57] Structural Information-based Hierarchical Diffusion for Offline Reinforcement Learning NEURIPS2025
链接: https://arxiv.org/abs/2509.21942
作者: Xianghua Zeng,Hao Peng,Angsheng Li,Yicheng Pan
类目: Machine Learning (cs.LG)
*备注: Accepted by NeurIPS 2025
点击查看摘要
Abstract:Diffusion-based generative methods have shown promising potential for modeling trajectories from offline reinforcement learning (RL) datasets, and hierarchical diffusion has been introduced to mitigate variance accumulation and computational challenges in long-horizon planning tasks. However, existing approaches typically assume a fixed two-layer diffusion hierarchy with a single predefined temporal scale, which limits adaptability to diverse downstream tasks and reduces flexibility in decision making. In this work, we propose SIHD, a novel Structural Information-based Hierarchical Diffusion framework for effective and stable offline policy learning in long-horizon environments with sparse rewards. Specifically, we analyze structural information embedded in offline trajectories to construct the diffusion hierarchy adaptively, enabling flexible trajectory modeling across multiple temporal scales. Rather than relying on reward predictions from localized sub-trajectories, we quantify the structural information gain of each state community and use it as a conditioning signal within the corresponding diffusion layer. To reduce overreliance on offline datasets, we introduce a structural entropy regularizer that encourages exploration of underrepresented states while avoiding extrapolation errors from distributional shifts. Extensive evaluations on challenging offline RL tasks show that SIHD significantly outperforms state-of-the-art baselines in decision-making performance and demonstrates superior generalization across diverse scenarios.
[LG-58] Statistical Advantage of Softmax Attention: Insights from Single-Location Regression
链接: https://arxiv.org/abs/2509.21936
作者: O. Duranthon,P. Marion,C. Boyer,B. Loureiro,L. Zdeborová
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn)
*备注:
[LG-59] Extracting Actionable Insights from Building Energy Data using Vision LLM s on Wavelet and 3D Recurrence Representations
链接: https://arxiv.org/abs/2509.21934
作者: Amine Bechar,Adel Oulefki,Abbes Amira,Fatih Kurogollu,Yassine Himeur
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注: IEEE International Conference on Data Mining 2025
[LG-60] Multiplicative-Additive Constrained Models:Toward Joint Visualization of Interactive and Independent Effects
链接: https://arxiv.org/abs/2509.21923
作者: Fumin Wang
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Interpretability is one of the considerations when applying machine learning to high-stakes fields such as healthcare that involve matters of life safety. Generalized Additive Models (GAMs) enhance interpretability by visualizing shape functions. Nevertheless, to preserve interpretability, GAMs omit higher-order interaction effects (beyond pairwise interactions), which imposes significant constraints on their predictive performance. We observe that Curve Ergodic Set Regression (CESR), a multiplicative model, naturally enables the visualization of its shape functions and simultaneously incorporates both interactions among all features and individual feature effects. Nevertheless, CESR fails to demonstrate superior performance compared to GAMs. We introduce Multiplicative-Additive Constrained Models (MACMs), which augment CESR with an additive part to disentangle the intertwined coefficients of its interactive and independent terms, thus effectively broadening the hypothesis space. The model is composed of a multiplicative part and an additive part, whose shape functions can both be naturally visualized, thereby assisting users in interpreting how features participate in the decision-making process. Consequently, MACMs constitute an improvement over both CESR and GAMs. The experimental results indicate that neural network-based MACMs significantly outperform both CESR and the current state-of-the-art GAMs in terms of predictive performance.
[LG-61] Discrete Guidance Matching: Exact Guidance for Discrete Flow Matching
链接: https://arxiv.org/abs/2509.21912
作者: Zhengyan Wan,Yidong Ouyang,Liyan Xie,Fang Fang,Hongyuan Zha,Guang Cheng
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
[LG-62] Why High-rank Neural Networks Generalize?: An Algebraic Framework with RKHSs
链接: https://arxiv.org/abs/2509.21895
作者: Yuka Hashimoto,Sho Sonoda,Isao Ishikawa,Masahiro Ikeda
类目: Machine Learning (cs.LG); Functional Analysis (math.FA); Representation Theory (math.RT); Machine Learning (stat.ML)
*备注:
[LG-63] Zubov-Net: Adaptive Stability for Neural ODEs Reconciling Accuracy with Robustness
链接: https://arxiv.org/abs/2509.21879
作者: Chaoyang Luo,Yan Zou,Nanjing Huang
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
点击查看摘要
Abstract:Despite neural ordinary differential equations (Neural ODEs) exhibiting intrinsic robustness under input perturbations due to their dynamical systems nature, recent approaches often involve imposing Lyapunov-based stability conditions to provide formal robustness guarantees. However, a fundamental challenge remains: the tension between robustness and accuracy, primarily stemming from the difficulty in imposing appropriate stability conditions. To address this, we propose an adaptive stable learning framework named Zubov-Net, which innovatively reformulates Zubov’s equation into a consistency characterization between regions of attraction (RoAs) and prescribed RoAs (PRoAs). Building on this consistency, we introduce a new paradigm for actively controlling the geometry of RoAs by directly optimizing PRoAs to reconcile accuracy and robustness. Our approach is realized through tripartite losses (consistency, classification, and separation losses) and a parallel boundary sampling algorithm that co-optimizes the Neural ODE and the Lyapunov function. To enhance the discriminativity of Lyapunov functions, we design an input-attention-based convex neural network via a softmax attention mechanism that focuses on equilibrium-relevant features and also serves as weight normalization to maintain training stability in deep architectures. Theoretically, we prove that minimizing the tripartite loss guarantees consistent alignment of PRoAs-RoAs, trajectory stability, and non-overlapping PRoAs. Moreover, we establish stochastic convex separability with tighter probability bounds and fewer dimensionality requirements to justify the convex design in Lyapunov functions. Experimentally, Zubov-Net maintains high classification accuracy while significantly improving robustness against various stochastic noises and adversarial attacks.
[LG-64] Abductive Logical Rule Induction by Bridging Inductive Logic Programming and Multimodal Large Language Models
链接: https://arxiv.org/abs/2509.21874
作者: Yifei Peng,Yaoli Liu,Enbo Xia,Yu Jin,Wang-Zhou Dai,Zhong Ren,Yao-Xiang Ding,Kun Zhou
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:We propose ILP-CoT, a method that bridges Inductive Logic Programming (ILP) and Multimodal Large Language Models (MLLMs) for abductive logical rule induction. The task involves both discovering logical facts and inducing logical rules from a small number of unstructured textual or visual inputs, which still remain challenging when solely relying on ILP, due to the requirement of specified background knowledge and high computational cost, or MLLMs, due to the appearance of perceptual hallucinations. Based on the key observation that MLLMs could propose structure-correct rules even under hallucinations, our approach automatically builds ILP tasks with pruned search spaces based on the rule structure proposals from MLLMs, and utilizes ILP system to output rules built upon rectified logical facts and formal inductive reasoning. Its effectiveness is verified through challenging logical induction benchmarks, as well as a potential application of our approach, namely text-to-image customized generation with rule induction. Our code and data are released at this https URL.
[LG-65] Beyond RAG vs. Long-Context: Learning Distraction-Aware Retrieval for Efficient Knowledge Grounding
链接: https://arxiv.org/abs/2509.21865
作者: Seong-Woong Shim,Myunsoo Kim,Jae Hyeon Cho,Byung-Jun Lee
类目: Machine Learning (cs.LG)
*备注:
[LG-66] MolSpectLLM : A Molecular Foundation Model Bridging Spectroscopy Molecule Elucidation and 3D Structure Generation
链接: https://arxiv.org/abs/2509.21861
作者: Shuaike Shen,Jiaqing Xie,Zhuo Yang,Antong Zhang,Shuzhou Sun,Ben Gao,Tianfan Fu,Biqing Qi,Yuqiang Li
类目: Machine Learning (cs.LG)
*备注:
[LG-67] On the Complexity Theory of Masked Discrete Diffusion: From mathrmpoly(1/ε) to Nearly ε-Free
链接: https://arxiv.org/abs/2509.21835
作者: Xunpeng Huang,Yingyu Lin,Nishant Jain,Kaibo Wang,Difan Zou,Yian Ma,Tong Zhang
类目: Machine Learning (cs.LG)
*备注: 44 pages
[LG-68] Preference-Guided Learning for Sparse-Reward Multi-Agent Reinforcement Learning
链接: https://arxiv.org/abs/2509.21828
作者: TheViet Bui,Tien Mai,Hong Thanh Nguyen
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:
[LG-69] Sharpness-Aware Minimization Can Hallucinate Minimizers
链接: https://arxiv.org/abs/2509.21818
作者: Chanwoong Park,Uijeong Jang,Ernest K. Ryu,Insoon Yang
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
[LG-70] Scaling Laws for Neural Material Models
链接: https://arxiv.org/abs/2509.21811
作者: Akshay Trikha,Kyle Chu,Advait Gosai,Parker Szachta,Eric Weiner
类目: Machine Learning (cs.LG)
*备注: 12 pages, 11 figures, preprint
点击查看摘要
Abstract:Predicting material properties is crucial for designing better batteries, semiconductors, and medical devices. Deep learning helps scientists quickly find promising materials by predicting their energy, forces, and stresses. Companies scale capacities of deep learning models in multiple domains, such as language modeling, and invest many millions of dollars into such models. Our team analyzes how scaling training data (giving models more information to learn from), model sizes (giving models more capacity to learn patterns), and compute (giving models more computational resources) for neural networks affects their performance for material property prediction. In particular, we trained both transformer and EquiformerV2 neural networks to predict material properties. We find empirical scaling laws for these models: we can predict how increasing each of the three hyperparameters (training data, model size, and compute) affects predictive performance. In particular, the loss L can be measured with a power law relationship L = \alpha \cdot N^-\beta , where \alpha and \beta are constants while N is the relevant hyperparameter. We also incorporate command-line arguments for changing training settings such as the amount of epochs, maximum learning rate, and whether mixed precision is enabled. Future work could entail further investigating scaling laws for other neural network models in this domain, such as GemNet and fully connected networks, to assess how they compare to the models we trained.
[LG-71] Exploring the Relationships Between Physiological Signals During Automated Fatigue Detection
链接: https://arxiv.org/abs/2509.21794
作者: Kourosh Kakhi,Abbas Khosravi,Roohallah Alizadehsani,U. Rajendra Acharyab
类目: Machine Learning (cs.LG)
*备注: 14 Pages, 12 Figures, 3 Tables
[LG-72] Beyond Formula Complexity: Effective Information Criterion Improves Performance and Interpretability for Symbolic Regression
链接: https://arxiv.org/abs/2509.21780
作者: Zihan Yu,Guanren Wang,Jingtao Ding,Huandong Wang,Yong Li
类目: Machine Learning (cs.LG)
*备注:
[LG-73] Machine Learning and AI Applied to fNIRS Data Reveals Novel Brain Activity Biomarkers in Stable Subclinical Multiple Sclerosis
链接: https://arxiv.org/abs/2509.21770
作者: Sadman Saumik Islam,Bruna Dalcin Baldasso,Davide Cattaneo,Xianta Jiang,Michelle Ploughman
类目: Machine Learning (cs.LG)
*备注:
[LG-74] Reparameterizing 4DVAR with neural fields
链接: https://arxiv.org/abs/2509.21751
作者: Jaemin Oh
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph); Fluid Dynamics (physics.flu-dyn)
*备注: 22 pages, 10 figures, 6 tables
[LG-75] Reinforcement Learning Based Traffic Signal Design to Minimize Queue Lengths
链接: https://arxiv.org/abs/2509.21745
作者: Anirud Nandakumar,Chayan Banerjee,Lelitha Devi Vanajakshi
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:
[LG-76] Noise-to-Notes: Diffusion-based Generation and Refinement for Automatic Drum Transcription
链接: https://arxiv.org/abs/2509.21739
作者: Michael Yeung,Keisuke Toyama,Toya Teramoto,Shusuke Takahashi,Tamaki Kojima
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:
[LG-77] Information-Theoretic Bayesian Optimization for Bilevel Optimization Problems
链接: https://arxiv.org/abs/2509.21725
作者: Takuya Kanayama,Yuki Ito,Tomoyuki Tamura,Masayuki Karasuyama
类目: Machine Learning (cs.LG)
*备注:
[LG-78] A Unifying Framework for Parallelizing Sequential Models with Linear Dynamical Systems
链接: https://arxiv.org/abs/2509.21716
作者: Xavier Gonzalez,E. Kelly Buchanan,Hyun Dong Lee,Jerry Weihong Liu,Ke Alexander Wang,David M. Zoltowski,Christopher Ré,Scott W. Linderman
类目: Machine Learning (cs.LG)
*备注: Repo: this https URL
点击查看摘要
Abstract:Harnessing parallelism in seemingly sequential models is a central challenge for modern machine learning. Several approaches have been proposed for evaluating sequential processes in parallel using fixed-point methods, like Newton, Picard, and Jacobi iterations. In this work, we show that these methods can be understood within a common framework based on linear dynamical systems (LDSs), where different iteration schemes arise naturally as approximate linearizations of a nonlinear recursion. This unifying view highlights shared principles behind these techniques and clarifies when particular fixed-point methods are most likely to be effective. By bridging diverse algorithms through the language of LDSs, our framework provides a clearer theoretical foundation for parallelizing sequential models and points toward new opportunities for efficient and scalable computation.
[LG-79] PQFed: A Privacy-Preserving Quality-Controlled Federated Learning Framework
链接: https://arxiv.org/abs/2509.21704
作者: Weiqi Yue,Wenbiao Li,Yuzhou Jiang,Anisa Halimi,Roger French,Erman Ayday
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Federated learning enables collaborative model training without sharing raw data, but data heterogeneity consistently challenges the performance of the global model. Traditional optimization methods often rely on collaborative global model training involving all clients, followed by local adaptation to improve individual performance. In this work, we focus on early-stage quality control and propose PQFed, a novel privacy-preserving personalized federated learning framework that designs customized training strategies for each client prior to the federated training process. PQFed extracts representative features from each client’s raw data and applies clustering techniques to estimate inter-client dataset similarity. Based on these similarity estimates, the framework implements a client selection strategy that enables each client to collaborate with others who have compatible data distributions. We evaluate PQFed on two benchmark datasets, CIFAR-10 and MNIST, integrated with three existing federated learning algorithms. Experimental results show that PQFed consistently improves the target client’s model performance, even with a limited number of participants. We further benchmark PQFed against a baseline cluster-based algorithm, IFCA, and observe that PQFed also achieves better performance in low-participation scenarios. These findings highlight PQFed’s scalability and effectiveness in personalized federated learning settings.
[LG-80] Downscaling human mobility data based on demographic socioeconomic and commuting characteristics using interpretable machine learning methods
链接: https://arxiv.org/abs/2509.21703
作者: Yuqin Jiang,Andrey A. Popov,Tianle Duan,Qingchun Li
类目: Machine Learning (cs.LG)
*备注:
[LG-81] Exact Subgraph Isomorphism Network for Predictive Graph Mining
链接: https://arxiv.org/abs/2509.21699
作者: Taiga Kojima,Masayuki Karasuyama
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:In the graph-level prediction task (predict a label for a given graph), the information contained in subgraphs of the input graph plays a key role. In this paper, we propose Exact subgraph Isomorphism Network (EIN), which combines the exact subgraph enumeration, neural network, and a sparse regularization. In general, building a graph-level prediction model achieving high discriminative ability along with interpretability is still a challenging problem. Our combination of the subgraph enumeration and neural network contributes to high discriminative ability about the subgraph structure of the input graph. Further, the sparse regularization in EIN enables us 1) to derive an effective pruning strategy that mitigates computational difficulty of the enumeration while maintaining the prediction performance, and 2) to identify important subgraphs that contributes to high interpretability. We empirically show that EIN has sufficiently high prediction performance compared with standard graph neural network models, and also, we show examples of post-hoc analysis based on the selected subgraphs.
[LG-82] Wav2Arrest 2.0: Long-Horizon Cardiac Arrest Prediction with Time-to-Event Modeling Identity-Invariance and Pseudo-Lab Alignment
链接: https://arxiv.org/abs/2509.21695
作者: Saurabh Kataria,Davood Fattahi,Minxiao Wang,Ran Xiao,Matthew Clark,Timothy Ruchti,Mark Mai,Xiao Hu
类目: Machine Learning (cs.LG)
*备注: Submitted to BPSC
[LG-83] SpecMER: Fast Protein Generation with K-mer Guided Speculative Decoding NEURIPS2025
链接: https://arxiv.org/abs/2509.21689
作者: Thomas Walton,Darin Tsui,Aryan Musharaf,Amirali Aghazadeh
类目: Machine Learning (cs.LG)
*备注: Accepted as spotlight at NeurIPS 2025
[LG-84] Prophecy: Inferring Formal Properties from Neuron Activations
链接: https://arxiv.org/abs/2509.21677
作者: Divya Gopinath,Corina S. Pasareanu,Muhammad Usman
类目: Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注:
[LG-85] Scalable Second-order Riemannian Optimization for K-means Clustering
链接: https://arxiv.org/abs/2509.21675
作者: Peng Xu,Chun-Ying Hou,Xiaohui Chen,Richard Y. Zhang
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
[LG-86] Neuroprobe: Evaluating Intracranial Brain Responses to Naturalistic Stimuli
链接: https://arxiv.org/abs/2509.21671
作者: Andrii Zahorodnii,Christopher Wang,Bennett Stankovits,Charikleia Moraitaki,Geeling Chau,Andrei Barbu,Boris Katz,Ila R Fiete
类目: Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注: 31 pages, 7 main figures
[LG-87] Generating Stable Placements via Physics-guided Diffusion Models
链接: https://arxiv.org/abs/2509.21664
作者: Philippe Nadeau,Miguel Rogel,Ivan Bilić,Ivan Petrović,Jonathan Kelly
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Submitted to the IEEE International Conference on Robotics and Automation 2026, Vienna, Austria, June 1-5, 2026
[LG-88] MMPlanner: Zero-Shot Multimodal Procedural Planning with Chain-of-Thought Object State Reasoning EMNLP2025
链接: https://arxiv.org/abs/2509.21662
作者: Afrina Tabassum,Bin Guo,Xiyao Ma,Hoda Eldardiry,Ismini Lourentzou
类目: Machine Learning (cs.LG)
*备注: 17 pages, 9 figures, 14 tables, Findings of the Association for Computational Linguistics: EMNLP 2025
[LG-89] A Systematic Review of Conformal Inference Procedures for Treatment Effect Estimation: Methods and Challenges
链接: https://arxiv.org/abs/2509.21660
作者: Pascal Memmesheimer,Vincent Heuveline,Jürgen Hesser
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 13 pages, 3 figures
点击查看摘要
Abstract:Treatment effect estimation is essential for informed decision-making in many fields such as healthcare, economics, and public policy. While flexible machine learning models have been widely applied for estimating heterogeneous treatment effects, quantifying the inherent uncertainty of their point predictions remains an issue. Recent advancements in conformal prediction address this limitation by allowing for inexpensive computation, as well as distribution shifts, while still providing frequentist, finite-sample coverage guarantees under minimal assumptions for any point-predictor model. This advancement holds significant potential for improving decision-making in especially high-stakes environments. In this work, we perform a systematic review regarding conformal prediction methods for treatment effect estimation and provide for both the necessary theoretical background. Through a systematic filtering process, we select and analyze eleven key papers, identifying and describing current state-of-the-art methods in this area. Based on our findings, we propose directions for future research.
[LG-90] RED-DiffEq: Regularization by denoising diffusion models for solving inverse PDE problems with application to full waveform inversion
链接: https://arxiv.org/abs/2509.21659
作者: Siming Shan,Min Zhu,Youzuo Lin,Lu Lu
类目: Machine Learning (cs.LG); Geophysics (physics.geo-ph)
*备注:
[LG-91] Differentiable Structure Learning for General Binary Data NEURIPS2024
链接: https://arxiv.org/abs/2509.21658
作者: Chang Deng,Bryon Aragam
类目: Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME); Machine Learning (stat.ML)
*备注: 33 pages, 6 figures, to appear at NeurIPS 2024
[LG-92] DriftLite: Lightweight Drift Control for Inference-Time Scaling of Diffusion Models
链接: https://arxiv.org/abs/2509.21655
作者: Yinuo Ren,Wenhao Gao,Lexing Ying,Grant M. Rotskoff,Jiequn Han
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
点击查看摘要
Abstract:We study inference-time scaling for diffusion models, where the goal is to adapt a pre-trained model to new target distributions without retraining. Existing guidance-based methods are simple but introduce bias, while particle-based corrections suffer from weight degeneracy and high computational cost. We introduce DriftLite, a lightweight, training-free particle-based approach that steers the inference dynamics on the fly with provably optimal stability control. DriftLite exploits a previously unexplored degree of freedom in the Fokker-Planck equation between the drift and particle potential, and yields two practical instantiations: Variance- and Energy-Controlling Guidance (VCG/ECG) for approximating the optimal drift with minimal overhead. Across Gaussian mixture models, particle systems, and large-scale protein-ligand co-folding problems, DriftLite consistently reduces variance and improves sample quality over pure guidance and sequential Monte Carlo baselines. These results highlight a principled, efficient route toward scalable inference-time adaptation of diffusion models.
[LG-93] Understanding and Enhancing Mask-Based Pretraining towards Universal Representations NEURIPS2025
链接: https://arxiv.org/abs/2509.21650
作者: Mingze Dong,Leda Wang,Yuval Kluger
类目: Machine Learning (cs.LG)
*备注: NeurIPS 2025
[LG-94] Blockwise Hadamard high-Rank Adaptation for Parameter-Efficient LLM Fine-Tuning
链接: https://arxiv.org/abs/2509.21637
作者: Feng Yu,Jia Hu,Geyong Min
类目: Machine Learning (cs.LG)
*备注:
[LG-95] Shoot from the HIP: Hessian Interatomic Potentials without derivatives
链接: https://arxiv.org/abs/2509.21624
作者: Andreas Burger,Luca Thiede,Nikolaj Rønne,Varinia Bernales,Nandita Vijaykumar,Tejs Vegge,Arghya Bhowmik,Alan Aspuru-Guzik
类目: Machine Learning (cs.LG); Chemical Physics (physics.chem-ph); Computational Physics (physics.comp-ph)
*备注: this https URL
点击查看摘要
Abstract:Fundamental tasks in computational chemistry, from transition state search to vibrational analysis, rely on molecular Hessians, which are the second derivatives of the potential energy. Yet, Hessians are computationally expensive to calculate and scale poorly with system size, with both quantum mechanical methods and neural networks. In this work, we demonstrate that Hessians can be predicted directly from a deep learning model, without relying on automatic differentiation or finite differences. We observe that one can construct SE(3)-equivariant, symmetric Hessians from irreducible representations (irrep) features up to degree l =2 computed during message passing in graph neural networks. This makes HIP Hessians one to two orders of magnitude faster, more accurate, more memory efficient, easier to train, and enables more favorable scaling with system size. We validate our predictions across a wide range of downstream tasks, demonstrating consistently superior performance for transition state search, accelerated geometry optimization, zero-point energy corrections, and vibrational analysis benchmarks. We open-source the HIP codebase and model weights to enable further development of the direct prediction of Hessians at this https URL
[LG-96] PreLoRA: Hybrid Pre-training of Vision Transformers with Full Training and Low-Rank Adapters
链接: https://arxiv.org/abs/2509.21619
作者: Krishu K Thapa,Reet Barik,Krishna Teja Chitty-Venkata,Murali Emani,Venkatram Vishwanath
类目: Machine Learning (cs.LG); Performance (cs.PF)
*备注: 7 pages, 7 figures, 2 algorithms, 1 table, conference paper
点击查看摘要
Abstract:Training large models ranging from millions to billions of parameters is highly resource-intensive, requiring significant time, compute, and memory. It is observed that most of the learning (higher change in weights) takes place in the earlier stage of the training loop. These changes stabilize as training continues, enabling them to be captured by matrices of a low intrinsic rank. Therefore, we propose an approach to identify such states of partial convergence and dynamically switch from full parameter training to Low-Rank Adaptation (LoRA) on the ViT-Large model. We introduce a flexible approach that leverages user-defined hyperparameters to determine the switching point and assign a rank specific to each module layer based on its level of convergence. Experimental results show that this approach preserves model accuracy while reducing the number of trainable parameters to 10% of its original size, resulting in a 3x improvement in throughput, and a 1.5x reduction in average training time per epoch while also reducing GPU memory consumption by 20%
[LG-97] Causal Abstraction Inference under Lossy Representations ICML2025
链接: https://arxiv.org/abs/2509.21607
作者: Kevin Xia,Elias Bareinboim
类目: Machine Learning (cs.LG)
*备注: 35 pages, 8 figures, published at ICML 2025
[LG-98] ask-Agnostic Federated Continual Learning via Replay-Free Gradient Projection
链接: https://arxiv.org/abs/2509.21606
作者: Seohyeon Cha,Huancheng Chen,Haris Vikalo
类目: Machine Learning (cs.LG)
*备注:
[LG-99] GenUQ: Predictive Uncertainty Estimates via Generative Hyper-Networks
链接: https://arxiv.org/abs/2509.21605
作者: Tian Yu Yen,Reese E. Jones,Ravi G. Patel
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Machine Learning (stat.ML)
*备注: 9 pages, 6 figures
[LG-100] Interpretable time series analysis with Gumbel dynamics
链接: https://arxiv.org/abs/2509.21578
作者: Yiliu Wang,Timothy Doyeon Kim,Eric Shea-Brown,Uygar Sümbül
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 15 pages, 5 figures
[LG-101] Machine Learning. The Science of Selection under Uncertainty
链接: https://arxiv.org/abs/2509.21547
作者: Yevgeny Seldin
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
[LG-102] Evidence for Limited Metacognition in LLM s
链接: https://arxiv.org/abs/2509.21545
作者: Christopher Ackerman
类目: Machine Learning (cs.LG)
*备注: 25 pages, 22 figures
[LG-103] A circuit for predicting hierarchical structure in-context in Large Language Models
链接: https://arxiv.org/abs/2509.21534
作者: Tankred Saanum,Can Demircan,Samuel J. Gershman,Eric Schulz
类目: Machine Learning (cs.LG)
*备注:
[LG-104] Expert-guided Clinical Text Augmentation via Query-Based Model Collaboration
链接: https://arxiv.org/abs/2509.21530
作者: Dongkyu Cho,Miao Zhang,Rumi Chunara
类目: Machine Learning (cs.LG)
*备注: 18 pages, 5 figures
[LG-105] Contrastive Mutual Information Learning: Toward Robust Representations without Positive-Pair Augmentations
链接: https://arxiv.org/abs/2509.21511
作者: Micha Livne
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Preprint. 9 pages main manuscript, 23 pages with appendix
[LG-106] Functional Encryption in Secure Neural Network Training: Data Leakage and Practical Mitigations RAID2025
链接: https://arxiv.org/abs/2509.21497
作者: Alexandru Ioniţă,Andreea Ioniţă
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: Accepted at RAID 2025. © IEEE
[LG-107] Context-Aware Hybrid Routing in Bluetooth Mesh Networks Using Multi-Model Machine Learning and AODV Fallback
链接: https://arxiv.org/abs/2509.21490
作者: Md Sajid Islam,Tanvir Hasan
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注: 15 pages, 2 figures
[LG-108] GraphPFN: A Prior-Data Fitted Graph Foundation Model
链接: https://arxiv.org/abs/2509.21489
作者: Dmitry Eremeev,Oleg Platonov,Gleb Bazhenov,Artem Babenko,Liudmila Prokhorenkova
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Foundation models pretrained on large-scale datasets have transformed such fields as natural language processing and computer vision, but their application to graph data remains limited. Recently emerged graph foundation models, such as G2T-FM, utilize tabular foundation models for graph tasks and were shown to significantly outperform prior attempts to create GFMs. However, these models primarily rely on hand-crafted graph features, limiting their ability to learn complex graph-specific patterns. In this work, we propose GraphPFN: a prior-data fitted network for node-level prediction. First, we design a prior distribution of synthetic attributed graphs. For graph structure generation, we use a novel combination of multiple stochastic block models and a preferential attachment process. We then apply graph-aware structured causal models to generate node attributes and targets. This procedure allows us to efficiently generate a wide range of realistic graph datasets. Then, we augment the tabular foundation model LimiX with attention-based graph neighborhood aggregation layers and train it on synthetic graphs sampled from our prior, allowing the model to capture graph structural dependencies not present in tabular data. On diverse real-world graph datasets with up to 50,000 nodes, GraphPFN shows strong in-context learning performance and achieves state-of-the-art results after finetuning, outperforming both G2T-FM and task-specific GNNs trained from scratch on most datasets. More broadly, our work demonstrates that pretraining on synthetic graphs from a well-designed prior distribution is an effective strategy for building graph foundation models.
[LG-109] High-Probability Analysis of Online and Federated Zero-Order Optimisation
链接: https://arxiv.org/abs/2509.21484
作者: Arya Akhavan,David Janz,El-Mahdi El-Mhamdi
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
[LG-110] Filtering with Confidence: When Data Augmentation Meets Conformal Prediction
链接: https://arxiv.org/abs/2509.21479
作者: Zixuan Wu,So Won Jeong,Yating Liu,Yeo Jin Jung,Claire Donnat
类目: Machine Learning (cs.LG)
*备注:
[LG-111] d2: Improved Techniques for Training Reasoning Diffusion Language Models
链接: https://arxiv.org/abs/2509.21474
作者: Guanghan Wang,Yair Schiff,Gilad Turok,Volodymyr Kuleshov
类目: Machine Learning (cs.LG)
*备注: preprint
[LG-112] alking Trees: Reasoning -Assisted Induction of Decision Trees for Tabular Data
链接: https://arxiv.org/abs/2509.21465
作者: George Yakushev,Alina Shutova,Ivan Rubachev,Renat Sergazinov,Artem Babenko
类目: Machine Learning (cs.LG)
*备注: Preprint, code at this https URL
[LG-113] Forecasting Seismic Waveforms: A Deep Learning Approach for Einstein Telescope
链接: https://arxiv.org/abs/2509.21446
作者: Waleed Esmail,Alexander Kappes,Stuart Russell,Christine Thomas
类目: Machine Learning (cs.LG); Earth and Planetary Astrophysics (astro-ph.EP); Instrumentation and Methods for Astrophysics (astro-ph.IM); General Relativity and Quantum Cosmology (gr-qc)
*备注: 8 pages, 3 figures, ICRC 2025 Proceedings
[LG-114] Null-Space Filtering for Data-Free Continual Model Merging: Preserving Transparency Promoting Fidelity
链接: https://arxiv.org/abs/2509.21413
作者: Zihuan Qiu,Lei Wang,Yang Cao,Runtong Zhang,Bing Su,Yi Xu,Fanman Meng,Linfeng Xu,Qingbo Wu,Hongliang Li
类目: Machine Learning (cs.LG)
*备注:
[LG-115] Object Identification Under Known Dynamics: A PIRNN Approach for UAV Classification ICML
链接: https://arxiv.org/abs/2509.21405
作者: Nyi Nyi Aung,Neil Muralles,Adrian Stein
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注: 2025 International Conference on Machine Learning and Applications (ICMLA)
[LG-116] Impact of Loss Weight and Model Complexity on Physics-Informed Neural Networks for Computational Fluid Dynamics
链接: https://arxiv.org/abs/2509.21393
作者: Yi En Chou,Te Hsin Liu,Chao An Lin
类目: Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
*备注:
[LG-117] Spiking Neural Networks for Mental Workload Classification with a Multimodal Approach
链接: https://arxiv.org/abs/2509.21346
作者: Jiahui An,Sara Irina Fabrikant,Giacomo Indiveri,Elisa Donati
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注: 8 pages
点击查看摘要
Abstract:Accurately assessing mental workload is crucial in cognitive neuroscience, human-computer interaction, and real-time monitoring, as cognitive load fluctuations affect performance and decision-making. While Electroencephalography (EEG) based machine learning (ML) models can be used to this end, their high computational cost hinders embedded real-time applications. Hardware implementations of spiking neural networks (SNNs) offer a promising alternative for low-power, fast, event-driven processing. This study compares hardware compatible SNN models with various traditional ML ones, using an open-source multimodal dataset. Our results show that multimodal integration improves accuracy, with SNN performance comparable to the ML one, demonstrating their potential for real-time implementations of cognitive load detection. These findings position event-based processing as a promising solution for low-latency, energy efficient workload monitoring in adaptive closed-loop embedded devices that dynamically regulate cognitive load.
[LG-118] Discovering and Analyzing Stochastic Processes to Reduce Waste in Food Retail
链接: https://arxiv.org/abs/2509.21322
作者: Anna Kalenkova,Lu Xia,Dirk Neumann
类目: Machine Learning (cs.LG); Probability (math.PR); Applications (stat.AP)
*备注:
[LG-119] Linear Causal Representation Learning by Topological Ordering Pruning and Disentanglement
链接: https://arxiv.org/abs/2509.22553
作者: Hao Chen,Lin Liu,Yu Guang Wang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
[LG-120] Metrics for Parametric Families of Networks
链接: https://arxiv.org/abs/2509.22549
作者: Mario Gómez,Guanqun Ma,Tom Needham,Bei Wang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Metric Geometry (math.MG)
*备注:
[LG-121] Debiased Front-Door Learners for Heterogeneous Effects
链接: https://arxiv.org/abs/2509.22531
作者: Yonghan Jung
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 27 pages, 3 figures. Preprint. Code available at this https URL
[LG-122] Smoothing-Based Conformal Prediction for Balancing Efficiency and Interpretability
链接: https://arxiv.org/abs/2509.22529
作者: Mingyi Zheng,Hongyu Jiang,Yizhou Lu,Jiaye Teng
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
[LG-123] CausalKANs: interpretable treatment effect estimation with Kolmogorov-Arnold networks
链接: https://arxiv.org/abs/2509.22467
作者: Alejandro Almodóvar,Patricia A. Apellániz,Santiago Zazo,Juan Parras
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
[LG-124] Universal Inverse Distillation for Matching Models with Real-Data Supervision (No GANs)
链接: https://arxiv.org/abs/2509.22459
作者: Nikita Kornilov,David Li,Tikhon Mavrin,Aleksei Leonov,Nikita Gushchin,Evgeny Burnaev,Iaroslav Koshelev,Alexander Korotin
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
[LG-125] Multidimensional Uncertainty Quantification via Optimal Transport
链接: https://arxiv.org/abs/2509.22380
作者: Nikita Kotelevskii,Maiya Goloburda,Vladimir Kondratyev,Alexander Fishkov,Mohsen Guizani,Eric Moulines,Maxim Panov
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Most uncertainty quantification (UQ) approaches provide a single scalar value as a measure of model reliability. However, different uncertainty measures could provide complementary information on the prediction confidence. Even measures targeting the same type of uncertainty (e.g., ensemble-based and density-based measures of epistemic uncertainty) may capture different failure modes. We take a multidimensional view on UQ by stacking complementary UQ measures into a vector. Such vectors are assigned with Monge-Kantorovich ranks produced by an optimal-transport-based ordering method. The prediction is then deemed more uncertain than the other if it has a higher rank. The resulting VecUQ-OT algorithm uses entropy-regularized optimal transport. The transport map is learned on vectors of scores from in-distribution data and, by design, applies to unseen inputs, including out-of-distribution cases, without retraining. Our framework supports flexible non-additive uncertainty fusion (including aleatoric and epistemic components). It yields a robust ordering for downstream tasks such as selective prediction, misclassification detection, out-of-distribution detection, and selective generation. Across synthetic, image, and text data, VecUQ-OT shows high efficiency even when individual measures fail. The code for the method is available at: this https URL. Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG) Cite as: arXiv:2509.22380 [stat.ML] (or arXiv:2509.22380v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2509.22380 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Nikita Kotelevskii [view email] [v1] Fri, 26 Sep 2025 14:09:03 UTC (1,485 KB)
[LG-126] Multi-channel convolutional neural quantum embedding
链接: https://arxiv.org/abs/2509.22355
作者: Yujin Kim,Changjae Im,Taehyun Kim,Tak Hur,Daniel K. Park
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 20 pages, 7 figures
[LG-127] Preventing Model Collapse Under Overparametrization: Optimal Mixing Ratios for Interpolation Learning and Ridge Regression
链接: https://arxiv.org/abs/2509.22341
作者: Anvit Garg,Sohom Bhattacharya,Pragya Sur
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME)
*备注: 28 pages, 2 figures
[LG-128] Incorporating priors in learning: a random matrix study under a teacher-student framework
链接: https://arxiv.org/abs/2509.22124
作者: Malik Tiomoko,Ekkehard Schnoor
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 5 pages, 4 figures
点击查看摘要
Abstract:Regularized linear regression is central to machine learning, yet its high-dimensional behavior with informative priors remains poorly understood. We provide the first exact asymptotic characterization of training and test risks for maximum a posteriori (MAP) regression with Gaussian priors centered at a domain-informed initialization. Our framework unifies ridge regression, least squares, and prior-informed estimators, and – using random matrix theory – yields closed-form risk formulas that expose the bias-variance-prior tradeoff, explain double descent, and quantify prior mismatch. We also identify a closed-form minimizer of test risk, enabling a simple estimator of the optimal regularization parameter. Simulations confirm the theory with high accuracy. By connecting Bayesian priors, classical regularization, and modern asymptotics, our results provide both conceptual clarity and practical guidance for learning with structured prior knowledge.
[LG-129] Direct Bias-Correction Term Estimation for Propensity Scores and Averag e Treatment Effect Estimation
链接: https://arxiv.org/abs/2509.22122
作者: Masahiro Kato
类目: Econometrics (econ.EM); Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME); Machine Learning (stat.ML)
*备注:
[LG-130] Exploring the Early Universe with Deep Learning
链接: https://arxiv.org/abs/2509.22018
作者: Emmanuel de Salis,Massimo De Santis,Davide Piras,Sambit K. Giri,Michele Bianco,Nicolas Cerardi,Philipp Denzel,Merve Selcuk-Simsek,Kelley M. Hess,M. Carmen Toribio,Franz Kirsten,Hatem Ghorbel
类目: Cosmology and Nongalactic Astrophysics (astro-ph.CO); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注: EPIA 2025 preprint version, 12 pages, 3 figures
[LG-131] A Random Matrix Perspective of Echo State Networks: From Precise Bias–Variance Characterization to Optimal Regularization
链接: https://arxiv.org/abs/2509.22011
作者: Yessin Moakher,Malik Tiomoko,Cosme Louart,Zhenyu Liao
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: 2026 IEEE International Conference on Acoustics, Speech, and Signal Processing
[LG-132] A Nonparametric Discrete Hawkes Model with a Collapsed Gaussian-Process Prior
链接: https://arxiv.org/abs/2509.21996
作者: Trinnhallen Brisley,Gordon Ross,Daniel Paulin
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Hawkes process models are used in settings where past events increase the likelihood of future events occurring. Many applications record events as counts on a regular grid, yet discrete-time Hawkes models remain comparatively underused and are often constrained by fixed-form baselines and excitation kernels. In particular, there is a lack of flexible, nonparametric treatments of both the baseline and the excitation in discrete time. To this end, we propose the Gaussian Process Discrete Hawkes Process (GP-DHP), a nonparametric framework that places Gaussian process priors on both the baseline and the excitation and performs inference through a collapsed latent representation. This yields smooth, data-adaptive structure without prespecifying trends, periodicities, or decay shapes, and enables maximum a posteriori (MAP) estimation with near-linear-time (O(T\log T)) complexity. A closed-form projection recovers interpretable baseline and excitation functions from the optimized latent trajectory. In simulations, GP-DHP recovers diverse excitation shapes and evolving baselines. In case studies on U.S. terrorism incidents and weekly Cryptosporidiosis counts, it improves test predictive log-likelihood over standard parametric discrete Hawkes baselines while capturing bursts, delays, and seasonal background variation. The results indicate that flexible discrete-time self-excitation can be achieved without sacrificing scalability or interpretability.
[LG-133] Sequential 1-bit Mean Estimation with Near-Optimal Sample Complexity
链接: https://arxiv.org/abs/2509.21940
作者: Ivan Lau,Jonathan Scarlett
类目: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:
[LG-134] Error Analysis of Discrete Flow with Generator Matching
链接: https://arxiv.org/abs/2509.21906
作者: Zhengyan Wan,Yidong Ouyang,Qiang Yao,Liyan Xie,Fang Fang,Hongyuan Zha,Guang Cheng
类目: atistics Theory (math.ST); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
[LG-135] Causal-EPIG: A Prediction-Oriented Active Learning Framework for CATE Estimation
链接: https://arxiv.org/abs/2509.21866
作者: Erdun Gao,Jake Fawkes,Dino Sejdinovic
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
[LG-136] Multi-modal Bayesian Neural Network Surrogates with Conjugate Last-Layer Estimation
链接: https://arxiv.org/abs/2509.21711
作者: Ian Taylor,Juliane Mueller,Julie Bessac
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 35 pages including references and appendix, 5 figures
[LG-137] SADA: Safe and Adaptive Inference with Multiple Black-Box Predictions
链接: https://arxiv.org/abs/2509.21707
作者: Jiawei Shan,Yiming Dong,Jiwei Zhao
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:
[LG-138] HuLA: Prosody-Aware Anti-Spoofing with Multi-Task Learning for Expressive and Emotional Synthetic Speech
链接: https://arxiv.org/abs/2509.21676
作者: Aurosweta Mahapatra,Ismail Rasim Ulgen,Berrak Sisman
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
*备注: Submitted to IEEE Transactions on Affective Computing
点击查看摘要
Abstract:Current anti-spoofing systems remain vulnerable to expressive and emotional synthetic speech, since they rarely leverage prosody as a discriminative cue. Prosody is central to human expressiveness and emotion, and humans instinctively use prosodic cues such as F0 patterns and voiced/unvoiced structure to distinguish natural from synthetic speech. In this paper, we propose HuLA, a two-stage prosody-aware multi-task learning framework for spoof detection. In Stage 1, a self-supervised learning (SSL) backbone is trained on real speech with auxiliary tasks of F0 prediction and voiced/unvoiced classification, enhancing its ability to capture natural prosodic variation similar to human perceptual learning. In Stage 2, the model is jointly optimized for spoof detection and prosody tasks on both real and synthetic data, leveraging prosodic awareness to detect mismatches between natural and expressive synthetic speech. Experiments show that HuLA consistently outperforms strong baselines on challenging out-of-domain dataset, including expressive, emotional, and cross-lingual attacks. These results demonstrate that explicit prosodic supervision, combined with SSL embeddings, substantially improves robustness against advanced synthetic speech attacks.
[LG-139] Automating Sensor Characterization with Bayesian Optimization
链接: https://arxiv.org/abs/2509.21661
作者: J. Cuevas-Zepeda,C. Chavez,J. Estrada,J. Noonan,B. D. Nord,N. Saffold,M. Sofo-Haro,R. Spinola e Castro,S. Trivedi
类目: Instrumentation and Detectors (physics.ins-det); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注:
[LG-140] A regret minimization approach to fixed-point iterations
链接: https://arxiv.org/abs/2509.21653
作者: Joon Kwon
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:
[LG-141] Automated Machine Learning Pipeline for Training and Analysis Using Large Language Models
链接: https://arxiv.org/abs/2509.21647
作者: Adam Lahouari,Jutta Rogal,Mark E. Tuckerman
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注:
[LG-142] Effective continuous equations for adaptive SGD: a stochastic analysis view
链接: https://arxiv.org/abs/2509.21614
作者: Luca Callisti,Marco Romito,Francesco Triggiano
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR)
*备注:
[LG-143] IndiSeek learns information-guided disentangled representations
链接: https://arxiv.org/abs/2509.21584
作者: Yu Gui,Cong Ma,Zongming Ma
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
[LG-144] General Pruning Criteria for Fast SBL
链接: https://arxiv.org/abs/2509.21572
作者: Jakob Möderl,Erik Leitinger,Bernard Henri Fleury
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 5 pages, 2 figures, submitted to IEEE Signal Processing Letters
[LG-145] Data-driven approach to the design of complexing agents for trivalent transuranium elements
链接: https://arxiv.org/abs/2509.21362
作者: Kirill V. Karpov,Ivan S. Pikulin,Grigory V. Bokov,Artem A. Mitrofanov
类目: Chemical Physics (physics.chem-ph); Machine Learning (cs.LG)
*备注: 9 pages, 7 figures,
点击查看摘要
Abstract:The properties of complexes with transuranium elements have long been the object of research in various fields of chemistry. However, their experimental study is complicated by their rarity, high cost and special conditions necessary for working with such elements, and the complexity of quantum chemical calculations does not allow their use for large systems. To overcome these problems, we used modern machine learning methods to create a novel neural network architecture that allows to use available experimental data on a number of elements and thus significantly improve the quality of the resulting models. We also described the applicability domain of the presented model and identified the molecular fragments that most influence the stability of the complexes.
[LG-146] Accurate typhoon intensity forecasts using a non-iterative spatiotemporal transformer model
链接: https://arxiv.org/abs/2509.21349
作者: Hongyu Qu,Hongxiong Xu,Lin Dong,Chunyi Xiang,Gaozhen Nie
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
*备注: 41 pages, 5 figures in the text and 6 figures in the appendix. Submitted to npj Climate and Atmospheric Science
点击查看摘要
Abstract:Accurate forecasting of tropical cyclone (TC) intensity - particularly during periods of rapid intensification and rapid weakening - remains a challenge for operational meteorology, with high-stakes implications for disaster preparedness and infrastructure resilience. Recent advances in machine learning have yielded notable progress in TC prediction; however, most existing systems provide forecasts that degrade rapidly in extreme regimes and lack long-range consistency. Here we introduce TIFNet, a transformer-based forecasting model that generates non-iterative, 5-day intensity trajectories by integrating high-resolution global forecasts with a historical-evolution fusion mechanism. Trained on reanalysis data and fine-tuned with operational data, TIFNet consistently outperforms operational numerical models across all forecast horizons, delivering robust improvements across weak, strong, and super typhoon categories. In rapid intensity change regimes - long regarded as the most difficult to forecast - TIFNet reduces forecast error by 29-43% relative to current operational baselines. These results represent a substantial advance in artificial-intelligence-based TC intensity forecasting, especially under extreme conditions where traditional models consistently underperform.
[LG-147] Interpretable Spectral Features Predict Conductivity in Self-Driving Doped Conjugated Polymer Labs
链接: https://arxiv.org/abs/2509.21330
作者: Ankush Kumar Mishra,Jacob P. Mauthe,Nicholas Luke,Aram Amassian,Baskar Ganapathysubramanian
类目: Materials Science (cond-mat.mtrl-sci); Soft Condensed Matter (cond-mat.soft); Machine Learning (cs.LG)
*备注: 31 Pages, 19 Figures
信息检索
[IR-0] Your RAG is Unfair: Exposing Fairness Vulnerabilities in Retrieval-Augmented Generation via Backdoor Attacks EMNLP2025
链接: https://arxiv.org/abs/2509.22486
作者: Gaurav Bagwe,Saket S. Chaturvedi,Xiaolong Ma,Xiaoyong Yuan,Kuang-Ching Wang,Lan Zhang
类目: Information Retrieval (cs.IR); Cryptography and Security (cs.CR)
*备注: Accepted by EMNLP 2025
点击查看摘要
Abstract:Retrieval-augmented generation (RAG) enhances factual grounding by integrating retrieval mechanisms with generative models but introduces new attack surfaces, particularly through backdoor attacks. While prior research has largely focused on disinformation threats, fairness vulnerabilities remain underexplored. Unlike conventional backdoors that rely on direct trigger-to-target mappings, fairness-driven attacks exploit the interaction between retrieval and generation models, manipulating semantic relationships between target groups and social biases to establish a persistent and covert influence on content generation. This paper introduces BiasRAG, a systematic framework that exposes fairness vulnerabilities in RAG through a two-phase backdoor attack. During the pre-training phase, the query encoder is compromised to align the target group with the intended social bias, ensuring long-term persistence. In the post-deployment phase, adversarial documents are injected into knowledge bases to reinforce the backdoor, subtly influencing retrieved content while remaining undetectable under standard fairness evaluations. Together, BiasRAG ensures precise target alignment over sensitive attributes, stealthy execution, and resilience. Empirical evaluations demonstrate that BiasRAG achieves high attack success rates while preserving contextual relevance and utility, establishing a persistent and evolving threat to fairness in RAG. Comments: Accepted by EMNLP 2025 Subjects: Information Retrieval (cs.IR); Cryptography and Security (cs.CR) Cite as: arXiv:2509.22486 [cs.IR] (or arXiv:2509.22486v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2509.22486 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[IR-1] he system of processing and analysis of customer tracking data for customer journey research on the base of RFID technology
链接: https://arxiv.org/abs/2509.22162
作者: Marina Kholod
类目: Databases (cs.DB); Information Retrieval (cs.IR)
*备注: 20 pages, in Russian language, 5 figures
[IR-2] Does Generative Retrieval Overcome the Limitations of Dense Retrieval?
链接: https://arxiv.org/abs/2509.22116
作者: Yingchen Zhang,Ruqing Zhang,Jiafeng Guo,Maarten de Rijke,Yixing Fan,Xueqi Cheng
类目: Information Retrieval (cs.IR)
*备注:
点击查看摘要
Abstract:Generative retrieval (GR) has emerged as a new paradigm in neural information retrieval, offering an alternative to dense retrieval (DR) by directly generating identifiers of relevant documents. In this paper, we theoretically and empirically investigate how GR fundamentally diverges from DR in both learning objectives and representational capacity. GR performs globally normalized maximum-likelihood optimization and encodes corpus and relevance information directly in the model parameters, whereas DR adopts locally normalized objectives and represents the corpus with external embeddings before computing similarity via a bilinear interaction. Our analysis suggests that, under scaling, GR can overcome the inherent limitations of DR, yielding two major benefits. First, with larger corpora, GR avoids the sharp performance degradation caused by the optimization drift induced by DR’s local normalization. Second, with larger models, GR’s representational capacity scales with parameter size, unconstrained by the global low-rank structure that limits DR. We validate these theoretical insights through controlled experiments on the Natural Questions and MS MARCO datasets, across varying negative sampling strategies, embedding dimensions, and model scales. But despite its theoretical advantages, GR does not universally outperform DR in practice. We outline directions to bridge the gap between GR’s theoretical potential and practical performance, providing guidance for future research in scalable and robust generative retrieval.
[IR-3] GoalRank: Group-Relative Optimization for a Large Ranking Model
链接: https://arxiv.org/abs/2509.22046
作者: Kaike Zhang,Xiaobei Wang,Shuchang Liu,Hailan Yang,Xiang Li,Lantao Hu,Han Li,Qi Cao,Fei Sun,Kun Gai
类目: Information Retrieval (cs.IR)
*备注:
点击查看摘要
Abstract:Mainstream ranking approaches typically follow a Generator-Evaluator two-stage paradigm, where a generator produces candidate lists and an evaluator selects the best one. Recent work has attempted to enhance performance by expanding the number of candidate lists, for example, through multi-generator settings. However, ranking involves selecting a recommendation list from a combinatorially large space. Simply enlarging the candidate set remains ineffective, and performance gains quickly saturate. At the same time, recent advances in large recommendation models have shown that end-to-end one-stage models can achieve promising performance with the expectation of scaling laws. Motivated by this, we revisit ranking from a generator-only one-stage perspective. We theoretically prove that, for any (finite Multi-)Generator-Evaluator model, there always exists a generator-only model that achieves strictly smaller approximation error to the optimal ranking policy, while also enjoying scaling laws as its size increases. Building on this result, we derive an evidence upper bound of the one-stage optimization objective, from which we find that one can leverage a reward model trained on real user feedback to construct a reference policy in a group-relative manner. This reference policy serves as a practical surrogate of the optimal policy, enabling effective training of a large generator-only ranker. Based on these insights, we propose GoalRank, a generator-only ranking framework. Extensive offline experiments on public benchmarks and large-scale online A/B tests demonstrate that GoalRank consistently outperforms state-of-the-art methods.
[IR-4] Effect of Model Merging in Domain-Specific Ad-hoc Retrieval CIKM2025
链接: https://arxiv.org/abs/2509.21966
作者: Taiga Sasaki,Takehiro Yamamoto,Hiroaki Ohshima,Sumio Fujita
类目: Information Retrieval (cs.IR)
*备注: Accepted at CIKM 2025, 5 pages
点击查看摘要
Abstract:In this study, we evaluate the effect of model merging in ad-hoc retrieval tasks. Model merging is a technique that combines the diverse characteristics of multiple models. We hypothesized that applying model merging to domain-specific ad-hoc retrieval tasks could improve retrieval effectiveness. To verify this hypothesis, we merged the weights of a source retrieval model and a domain-specific (non-retrieval) model using a linear interpolation approach. A key advantage of our approach is that it requires no additional fine-tuning of the models. We conducted two experiments each in the medical and Japanese domains. The first compared the merged model with the source retrieval model, and the second compared it with a LoRA fine-tuned model under both full and limited data settings for model construction. The experimental results indicate that model merging has the potential to produce more effective domain-specific retrieval models than the source retrieval model, and may serve as a practical alternative to LoRA fine-tuning, particularly when only a limited amount of data is available.
[IR-5] SPELUNKER: Item Similarity Search Using Large Language Models and Custom K-Nearest Neighbors
链接: https://arxiv.org/abs/2509.21323
作者: Ana Rodrigues,João Mata,Rui Rego
类目: Information Retrieval (cs.IR)
*备注: 6 pages, 4 figures
[IR-6] Chronic Stress Immune Suppression and Cancer Occurrence: Unveiling the Connection using Survey Data and Predictive Models
链接: https://arxiv.org/abs/2509.22275
作者: Teddy Lazebnik,Vered Aharonson
类目: Applications (stat.AP); Computers and Society (cs.CY); Information Retrieval (cs.IR)
*备注:
点击查看摘要
Abstract:Chronic stress was implicated in cancer occurrence, but a direct causal connection has not been consistently established. Machine learning and causal modeling offer opportunities to explore complex causal interactions between psychological chronic stress and cancer occurrences. We developed predictive models employing variables from stress indicators, cancer history, and demographic data from self-reported surveys, unveiling the direct and immune suppression mitigated connection between chronic stress and cancer occurrence. The models were corroborated by traditional statistical methods. Our findings indicated significant causal correlations between stress frequency, stress level and perceived health impact, and cancer incidence. Although stress alone showed limited predictive power, integrating socio-demographic and familial cancer history data significantly enhanced model accuracy. These results highlight the multidimensional nature of cancer risk, with stress emerging as a notable factor alongside genetic predisposition. These findings strengthen the case for addressing chronic stress as a modifiable cancer risk factor, supporting its integration into personalized prevention strategies and public health interventions to reduce cancer incidence.
附件下载
点击下载今日全部论文列表