本篇博文主要内容为 2025-09-17 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。

目录

概览 (2025-09-17)

今日共更新589篇论文,其中:

  • 自然语言处理69篇(Computation and Language (cs.CL))
  • 人工智能169篇(Artificial Intelligence (cs.AI))
  • 计算机视觉133篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习158篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] Do Natural Language Descriptions of Model Activations Convey Privileged Information?

【速读】: 该论文试图解决当前生成式 AI(Generative AI)模型中激活值语义化(activation verbalization)方法是否真正揭示了目标模型内部运作机制的问题。现有方法通过引入第二个语言模型(verbalizer LLM)将目标大语言模型(LLM)的内部激活表示转化为自然语言描述,旨在提升模型可解释性。但论文指出,这些方法在评估基准上表现良好,却可能并未利用目标模型的内部信息,而是反映了语义化模型自身的参数知识。其解决方案的关键在于:设计受控实验和针对性基准测试,以区分语义化结果是源于目标模型激活还是语义化模型的先验知识,从而建立更严格的评估框架来判断语义化方法是否提供了对 LLM 内部操作的真实洞察。

链接: https://arxiv.org/abs/2509.13316
作者: Millicent Li,Alberto Mario Ceballos Arroyo,Giordano Rogers,Naomi Saphra,Byron C. Wallace
机构: Northeastern University (东北大学); Kempner Institute, Harvard University (哈佛大学肯普纳研究所)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 34 pages, 6 figures

点击查看摘要

Abstract:Recent interpretability methods have proposed to translate LLM internal representations into natural language descriptions using a second verbalizer LLM. This is intended to illuminate how the target model represents and operates on inputs. But do such activation verbalization approaches actually provide privileged knowledge about the internal workings of the target model, or do they merely convey information about its inputs? We critically evaluate popular verbalization methods across datasets used in prior work and find that they succeed at benchmarks without any access to target model internals, suggesting that these datasets are not ideal for evaluating verbalization methods. We then run controlled experiments which reveal that verbalizations often reflect the parametric knowledge of the verbalizer LLM which generated them, rather than the activations of the target LLM being decoded. Taken together, our results indicate a need for targeted benchmarks and experimental controls to rigorously assess whether verbalization methods provide meaningful insights into the operations of LLMs.
zh

[NLP-1] ReSum: Unlocking Long-Horizon Search Intelligence via Context Summarization

【速读】: 该论文旨在解决基于大语言模型(Large Language Model, LLM)的网络代理(web agent)在处理复杂知识密集型任务时,因上下文窗口(context window)限制导致的探索能力受限问题。具体而言,在ReAct等范式中,复杂查询涉及多个实体、交织关系和高不确定性时,频繁的搜索循环会迅速耗尽上下文预算,阻碍完整解决方案的达成。其解决方案的关键在于提出ReSum新范式,通过周期性地对交互历史进行摘要(context summarization),将不断增长的交互记录压缩为紧凑的推理状态(reasoning state),从而在保持对先前发现的认知的同时规避上下文约束。进一步地,为适配该范式,作者设计了ReSum-GRPO方法,结合分段轨迹训练(segmented trajectory training)与优势广播(advantage broadcasting),使代理能够适应摘要条件下的推理模式,显著提升长期探索能力与任务性能。

链接: https://arxiv.org/abs/2509.13313
作者: Xixi Wu,Kuan Li,Yida Zhao,Liwen Zhang,Litu Ou,Huifeng Yin,Zhongwang Zhang,Yong Jiang,Pengjun Xie,Fei Huang,Minhao Cheng,Shuai Wang,Hong Cheng,Jingren Zhou
机构: 未知
类目: Computation and Language (cs.CL)
备注: this https URL

点击查看摘要

Abstract:Large Language Model (LLM)-based web agents demonstrate strong performance on knowledge-intensive tasks but are hindered by context window limitations in paradigms like ReAct. Complex queries involving multiple entities, intertwined relationships, and high uncertainty demand extensive search cycles that rapidly exhaust context budgets before reaching complete solutions. To overcome this challenge, we introduce ReSum, a novel paradigm that enables indefinite exploration through periodic context summarization. ReSum converts growing interaction histories into compact reasoning states, maintaining awareness of prior discoveries while bypassing context constraints. For paradigm adaptation, we propose ReSum-GRPO, integrating GRPO with segmented trajectory training and advantage broadcasting to familiarize agents with summary-conditioned reasoning. Extensive experiments on web agents of varying scales across three benchmarks demonstrate that ReSum delivers an average absolute improvement of 4.5% over ReAct, with further gains of up to 8.2% following ReSum-GRPO training. Notably, with only 1K training samples, our WebResummer-30B (a ReSum-GRPO-trained version of WebSailor-30B) achieves 33.3% Pass@1 on BrowseComp-zh and 18.3% on BrowseComp-en, surpassing existing open-source web agents.
zh

[NLP-2] WebWeaver: Structuring Web-Scale Evidence with Dynamic Outlines for Open-Ended Deep Research

【速读】: 该论文旨在解决开放端深度研究(Open-ended Deep Research, OEDR)中的关键挑战,即如何从海量网络信息中合成结构化、可信且高质量的报告。当前方法存在双重局限:一是静态的研究流程将规划与证据获取分离,二是单次生成范式易受长上下文问题(如“中间丢失”和幻觉)影响。解决方案的关键在于提出WebWeaver双智能体框架,其核心是模拟人类研究过程——规划者通过迭代循环动态整合证据获取与大纲优化,构建一个链接至证据记忆库的全面、溯源可靠的报告大纲;写作者则采用分层检索与写作机制,仅针对每个部分精准调用所需证据,从而有效缓解长上下文问题。该设计实现了自适应规划与聚焦合成的协同,显著提升了报告质量与可靠性。

链接: https://arxiv.org/abs/2509.13312
作者: Zijian Li,Xin Guan,Bo Zhang,Shen Huang,Houquan Zhou,Shaopeng Lai,Ming Yan,Yong Jiang,Pengjun Xie,Fei Huang,Jun Zhang,Jingren Zhou
机构: 未知
类目: Computation and Language (cs.CL)
备注: An agent system for open-ended deep research

点击查看摘要

Abstract:This paper tackles open-ended deep research (OEDR), a complex challenge where AI agents must synthesize vast web-scale information into insightful reports. Current approaches are plagued by dual-fold limitations: static research pipelines that decouple planning from evidence acquisition and one-shot generation paradigms that easily suffer from long-context failure issues like “loss in the middle” and hallucinations. To address these challenges, we introduce WebWeaver, a novel dual-agent framework that emulates the human research process. The planner operates in a dynamic cycle, iteratively interleaving evidence acquisition with outline optimization to produce a comprehensive, source-grounded outline linking to a memory bank of evidence. The writer then executes a hierarchical retrieval and writing process, composing the report section by section. By performing targeted retrieval of only the necessary evidence from the memory bank for each part, it effectively mitigates long-context issues. Our framework establishes a new state-of-the-art across major OEDR benchmarks, including DeepResearch Bench, DeepConsult, and DeepResearchGym. These results validate our human-centric, iterative methodology, demonstrating that adaptive planning and focused synthesis are crucial for producing high-quality, reliable, and well-structured reports.
zh

[NLP-3] owards General Agent ic Intelligence via Environment Scaling

【速读】: 该论文旨在解决大语言模型在真实世界应用中部署所面临的挑战,即如何提升代理(Agent)在多样现实API环境中进行精确、鲁棒的函数调用(function-calling)能力。核心问题在于:当前代理的能力受限于训练环境的单一性,缺乏跨场景泛化能力。解决方案的关键在于两个方面:一是设计一个可扩展的框架,自动构建全仿真、异构的多样化环境以系统性地扩充函数调用场景;二是采用两阶段微调策略——先赋予代理基础的通用代理能力,再针对特定领域进行专业化优化。实验表明,基于此方案训练的模型AgentScaler显著提升了函数调用性能,在tau-bench、tau2-Bench和ACEBench等多个基准上取得优异效果。

链接: https://arxiv.org/abs/2509.13311
作者: Runnan Fang,Shihao Cai,Baixuan Li,Jialong Wu,Guangyu Li,Wenbiao Yin,Xinyu Wang,Xiaobin Wang,Liangcai Su,Zhen Zhang,Shibin Wu,Zhengwei Tao,Yong Jiang,Pengjun Xie,Fei Huang,Jingren Zhou
机构: Tongyi Lab (通义实验室); Alibaba Group (阿里巴巴集团)
类目: Computation and Language (cs.CL)
备注: this https URL

点击查看摘要

Abstract:Advanced agentic intelligence is a prerequisite for deploying Large Language Models in practical, real-world applications. Diverse real-world APIs demand precise, robust function-calling intelligence, which needs agents to develop these capabilities through interaction in varied environments. The breadth of function-calling competence is closely tied to the diversity of environments in which agents are trained. In this work, we scale up environments as a step towards advancing general agentic intelligence. This gives rise to two central challenges: (i) how to scale environments in a principled manner, and (ii) how to effectively train agentic capabilities from experiences derived through interactions with these environments. To address these, we design a scalable framework that automatically constructs heterogeneous environments that are fully simulated, systematically broadening the space of function-calling scenarios. We further adapt a two-phase agent fine-tuning strategy: first endowing agents with fundamental agentic capabilities, then specializing them for domain-specific contexts. Extensive experiments on agentic benchmarks, tau-bench, tau2-Bench, and ACEBench, demonstrate that our trained model, AgentScaler, significantly enhances the function-calling capability of models.
zh

[NLP-4] Scaling Agents via Continual Pre-training

【速读】: 该论文旨在解决当前基于通用基础模型的后训练方法在代理型任务(agentic tasks)中表现不佳的问题,尤其是在开源实现中。其核心问题是缺乏稳健的代理型基础模型,导致模型在后训练阶段需同时学习多样化的代理行为并对其对齐专家示范,从而引发优化冲突。解决方案的关键在于首次将代理持续预训练(Agentic Continual Pre-training, Agentic CPT)引入深度研究代理的训练流程,以构建强大的代理型基础模型;基于此方法,作者开发了名为AgentFounder的深度研究代理模型,在10个基准测试中达到最先进性能,同时保持了出色的工具使用能力。

链接: https://arxiv.org/abs/2509.13310
作者: Liangcai Su,Zhen Zhang,Guangyu Li,Zhuo Chen,Chenxi Wang,Maojia Song,Xinyu Wang,Kuan Li,Jialong Wu,Xuanzhong Chen,Zile Qiao,Zhongwang Zhang,Huifeng Yin,Shihao Cai,Runnan Fang,Zhengwei Tao,Wenbiao Yin,Chenxiong Qian,Yong Jiang,Pengjun Xie,Fei Huang,Jingren Zhou
机构: 未知
类目: Computation and Language (cs.CL)
备注: this https URL

点击查看摘要

Abstract:Large language models (LLMs) have evolved into agentic systems capable of autonomous tool use and multi-step reasoning for complex problem-solving. However, post-training approaches building upon general-purpose foundation models consistently underperform in agentic tasks, particularly in open-source implementations. We identify the root cause: the absence of robust agentic foundation models forces models during post-training to simultaneously learn diverse agentic behaviors while aligning them to expert demonstrations, thereby creating fundamental optimization tensions. To this end, we are the first to propose incorporating Agentic Continual Pre-training (Agentic CPT) into the deep research agents training pipeline to build powerful agentic foundational models. Based on this approach, we develop a deep research agent model named AgentFounder. We evaluate our AgentFounder-30B on 10 benchmarks and achieve state-of-the-art performance while retains strong tool-use ability, notably 39.9% on BrowseComp-en, 43.3% on BrowseComp-zh, and 31.5% Pass@1 on HLE.
zh

[NLP-5] WebResearcher: Unleashing unbounded reasoning capability in Long-Horizon Agents

【速读】: 该论文旨在解决当前AI代理在深度研究任务中面临的两大核心问题:一是传统单上下文方法因上下文窗口限制导致的“上下文窒息”(context suffocation)和噪声污染;二是缺乏有效训练数据以支持从被动知识回忆向主动知识构建的跃迁。解决方案的关键在于提出一个名为WebResearcher的新型框架,其包含两个核心组件:(1) WebResearcher本身是一种迭代式深度研究范式,将研究过程建模为马尔可夫决策过程(Markov Decision Process),通过周期性地整合发现并维护专注的工作空间,缓解上下文压力;(2) WebFrontier是一个可扩展的数据合成引擎,借助工具增强的复杂度递进策略生成高质量训练数据,从而系统性地推动模型从知识检索向知识建构的能力跃升。该方案不仅显著提升工具使用能力,还天然支持并行思维,实现多智能体协同探索,最终在6个挑战性基准上达到最先进性能。

链接: https://arxiv.org/abs/2509.13309
作者: Zile Qiao,Guoxin Chen,Xuanzhong Chen,Donglei Yu,Wenbiao Yin,Xinyu Wang,Zhen Zhang,Baixuan Li,Huifeng Yin,Kuan Li,Rui Min,Minpeng Liao,Yong Jiang,Pengjun Xie,Fei Huang,Jingren Zhou
机构: Tongyi Lab (通义实验室); Alibaba Group (阿里巴巴集团)
类目: Computation and Language (cs.CL)
备注: this https URL

点击查看摘要

Abstract:Recent advances in deep-research systems have demonstrated the potential for AI agents to autonomously discover and synthesize knowledge from external sources. In this paper, we introduce WebResearcher, a novel framework for building such agents through two key components: (1) WebResearcher, an iterative deep-research paradigm that reformulates deep research as a Markov Decision Process, where agents periodically consolidate findings into evolving reports while maintaining focused workspaces, overcoming the context suffocation and noise contamination that plague existing mono-contextual approaches; and (2) WebFrontier, a scalable data synthesis engine that generates high-quality training data through tool-augmented complexity escalation, enabling systematic creation of research tasks that bridge the gap between passive knowledge recall and active knowledge construction. Notably, we find that the training data from our paradigm significantly enhances tool-use capabilities even for traditional mono-contextual methods. Furthermore, our paradigm naturally scales through parallel thinking, enabling concurrent multi-agent exploration for more comprehensive conclusions. Extensive experiments across 6 challenging benchmarks demonstrate that WebResearcher achieves state-of-the-art performance, even surpassing frontier proprietary systems.
zh

[NLP-6] WebSailor-V2: Bridging the Chasm to Proprietary Agents via Synthetic Data and Scalable Reinforcement Learning

【速读】: 该论文旨在解决开源大语言模型(Large Language Models, LLMs)在复杂信息检索任务中难以超越人类认知局限的问题,尤其是在面对高不确定性信息环境时表现不足。其核心挑战在于缺乏系统性推理能力以有效导航海量信息并降低不确定性。解决方案的关键在于提出一种名为WebSailor的完整后训练方法,通过结构化采样与信息混淆生成高不确定性任务、引入RFT冷启动策略以及采用高效的代理强化学习算法——重复采样策略优化(Duplicating Sampling Policy Optimization, DUPO),从而显著提升开源代理在复杂信息寻求任务中的性能,使其接近甚至媲美专有代理系统(如DeepResearch)的表现。

链接: https://arxiv.org/abs/2509.13305
作者: Kuan Li,Zhongwang Zhang,Huifeng Yin,Rui Ye,Yida Zhao,Liwen Zhang,Litu Ou,Dingchu Zhang,Xixi Wu,Jialong Wu,Xinyu Wang,Zile Qiao,Zhen Zhang,Yong Jiang,Pengjun Xie,Fei Huang,Jingren Zhou
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: this https URL

点击查看摘要

Abstract:Transcending human cognitive limitations represents a critical frontier in LLM training. Proprietary agentic systems like DeepResearch have demonstrated superhuman capabilities on extremely complex information-seeking benchmarks such as BrowseComp, a feat previously unattainable. We posit that their success hinges on a sophisticated reasoning pattern absent in open-source models: the ability to systematically reduce extreme uncertainty when navigating vast information landscapes. Based on this insight, we introduce WebSailor, a complete post-training methodology designed to instill this crucial capability. Our approach involves generating novel, high-uncertainty tasks through structured sampling and information obfuscation, RFT cold start, and an efficient agentic RL training algorithm, Duplicating Sampling Policy Optimization (DUPO). With this integrated pipeline, WebSailor significantly outperforms all open-source agents in complex information-seeking tasks, matching proprietary agents’ performance and closing the capability gap.
zh

[NLP-7] ChartGaze: Enhancing Chart Understanding in LVLMs with Eye-Tracking Guided Attention Refinement EMNLP2025

【速读】: 该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)在图表问答(Chart Question Answering, CQA)任务中因关注无关区域而导致的推理准确性低和可解释性差的问题。其解决方案的关键在于提出一种基于人类眼动数据的注意力精炼机制——ChartGaze,通过将模型的图像-文本注意力对齐到人类注视点(human fixations),从而提升模型的答题准确率与注意力分布的合理性,实验表明该方法在多个模型上最高可带来2.56个百分点的性能提升。

链接: https://arxiv.org/abs/2509.13282
作者: Ali Salamatian,Amirhossein Abaskohi,Wan-Cyuan Fan,Mir Rayat Imtiaz Hossain,Leonid Sigal,Giuseppe Carenini
机构: 未知
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: EMNLP 2025

点击查看摘要

Abstract:Charts are a crucial visual medium for communicating and representing information. While Large Vision-Language Models (LVLMs) have made progress on chart question answering (CQA), the task remains challenging, particularly when models attend to irrelevant regions of the chart. In this work, we present ChartGaze, a new eye-tracking dataset that captures human gaze patterns during chart reasoning tasks. Through a systematic comparison of human and model attention, we find that LVLMs often diverge from human gaze, leading to reduced interpretability and accuracy. To address this, we propose a gaze-guided attention refinement that aligns image-text attention with human fixations. Our approach improves both answer accuracy and attention alignment, yielding gains of up to 2.56 percentage points across multiple models. These results demonstrate the promise of incorporating human gaze to enhance both the reasoning quality and interpretability of chart-focused LVLMs.
zh

[NLP-8] RepIt: Representing Isolated Targets to Steer Language Models

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)中激活控制(activation steering)方法常引发非目标性副作用的问题,从而实现对特定概念的精准干预和更细粒度的行为理解。其核心解决方案是提出RepIt框架,该框架通过数据高效的方式提取纯正的概念向量(concept-specific representations),使干预仅作用于目标概念而不影响其他部分;实验证明,RepIt可在仅需十余个样本和单张A6000显卡的情况下,定位到约100–200个神经元上的修正信号,从而在保持安全基准得分的同时,针对性地解除对特定主题(如WMD相关问题)的拒绝响应,有效缓解模型过度泛化现象。

链接: https://arxiv.org/abs/2509.13281
作者: Vincent Siu,Nathan W. Henry,Nicholas Crispino,Yang Liu,Dawn Song,Chenguang Wang
机构: University of California, Santa Cruz (加州大学圣克鲁兹分校); University of California, Berkeley (加州大学伯克利分校)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While activation steering in large language models (LLMs) is a growing area of research, methods can often incur broader effects than desired. This motivates isolation of purer concept vectors to enable targeted interventions and understand LLM behavior at a more granular level. We present RepIt, a simple and data-efficient framework for isolating concept-specific representations. Across five frontier LLMs, RepIt enables precise interventions: it selectively suppresses refusal on targeted concepts while preserving refusal elsewhere, producing models that answer WMD-related questions while still scoring as safe on standard benchmarks. We further show that the corrective signal localizes to just 100-200 neurons and that robust target representations can be extracted from as few as a dozen examples on a single A6000. This efficiency raises a dual concern: manipulations can be performed with modest compute and data to extend to underrepresented data-scarce topics while evading existing benchmarks. By disentangling refusal vectors with RepIt, this work demonstrates that targeted interventions can counteract overgeneralization, laying the foundation for more granular control of model behavior.
zh

[NLP-9] HARMONIC: A Content-Centric Cognitive Robotic Architecture

【速读】: 该论文旨在解决人机协作场景中机器人面临的三大核心问题:数据稀缺性、决策可解释性不足以及安全性难以保障。针对这些问题,作者提出HARMONIC认知机器人架构,其关键在于集成语义感知理解(semantic perception interpretation)、类人决策机制(human-like decision-making)和意图驱动的语言通信(intentional language communication),从而提升系统透明度与可信度,并在高保真仿真环境和物理机器人平台上验证了其有效性。

链接: https://arxiv.org/abs/2509.13279
作者: Sanjay Oruganti,Sergei Nirenburg,Marjorie McShane,Jesse English,Michael K. Roberts,Christian Arndt,Carlos Gonzalez,Mingyo Seo,Luis Sentis
机构: Rensselaer Polytechnic Institute (伦斯勒理工学院); University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper introduces HARMONIC, a cognitive-robotic architecture designed for robots in human-robotic teams. HARMONIC supports semantic perception interpretation, human-like decision-making, and intentional language communication. It addresses the issues of safety and quality of results; aims to solve problems of data scarcity, explainability, and safety; and promotes transparency and trust. Two proof-of-concept HARMONIC-based robotic systems are demonstrated, each implemented in both a high-fidelity simulation environment and on physical robotic platforms.
zh

[NLP-10] Evaluating LLM Alignment on Personality Inference from Real-World Interview Data

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在自然对话场景中对人类人格特质(Big Five personality traits)的识别能力不足的问题,尤其是其与基于真实交互获得的连续人格评估结果之间的对齐程度尚未被系统研究。解决方案的关键在于构建了一个新的基准数据集,包含半结构化访谈转录文本与经验证的连续Big Five人格评分,并在此基础上系统评估了三种范式:零样本和思维链(chain-of-thought)提示、LoRA微调(应用于RoBERTa和Meta-LLaMA架构)、以及基于静态嵌入(BERT和OpenAI text-embedding-3-small)的回归方法。结果表明,所有模型预测与真实人格分数的相关系数均低于0.26,说明当前LLMs在人格理解方面仍存在显著差距,且思维链提示带来的提升有限,暗示人格推断更依赖于隐式语义表征而非显式推理过程。

链接: https://arxiv.org/abs/2509.13244
作者: Jianfeng Zhu,Julina Maharjan,Xinyu Li,Karin G. Coifman,Ruoming Jin
机构: 未知
类目: Computation and Language (cs.CL)
备注: 8 pages, 3 figures

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly deployed in roles requiring nuanced psychological understanding, such as emotional support agents, counselors, and decision-making assistants. However, their ability to interpret human personality traits, a critical aspect of such applications, remains unexplored, particularly in ecologically valid conversational settings. While prior work has simulated LLM “personas” using discrete Big Five labels on social media data, the alignment of LLMs with continuous, ground-truth personality assessments derived from natural interactions is largely unexamined. To address this gap, we introduce a novel benchmark comprising semi-structured interview transcripts paired with validated continuous Big Five trait scores. Using this dataset, we systematically evaluate LLM performance across three paradigms: (1) zero-shot and chain-of-thought prompting with GPT-4.1 Mini, (2) LoRA-based fine-tuning applied to both RoBERTa and Meta-LLaMA architectures, and (3) regression using static embeddings from pretrained BERT and OpenAI’s text-embedding-3-small. Our results reveal that all Pearson correlations between model predictions and ground-truth personality traits remain below 0.26, highlighting the limited alignment of current LLMs with validated psychological constructs. Chain-of-thought prompting offers minimal gains over zero-shot, suggesting that personality inference relies more on latent semantic representation than explicit reasoning. These findings underscore the challenges of aligning LLMs with complex human attributes and motivate future work on trait-specific prompting, context-aware modeling, and alignment-oriented fine-tuning.
zh

[NLP-11] Podcasts as a Medium for Participation in Collective Action: A Case Study of Black Lives Matter

【速读】: 该论文试图解决的问题是:如何在播客(podcast)这一音频形式中识别和分析集体行动参与(participation in collective action)的表达方式,尤其是在黑命攸关(Black Lives Matter, BLM)运动背景下,探讨情绪在不同阶段的参与行为中的作用。解决方案的关键在于构建了一个基于结构化播客研究语料库(Structured Podcast Research Corpus, SPoRC)的分层框架,将播客中与集体行动相关的陈述划分为问题-解决方案、号召行动(call-to-action)、意图(intention)和执行(execution)四类,并结合情感计算方法检测八种关键情绪及其与各阶段参与行为的关联,从而揭示情绪框架如何受讨论媒介格式影响。

链接: https://arxiv.org/abs/2509.13197
作者: Theodora Moldovan,Arianna Pera,Davide Vega,Luca Maria Aiello
机构: 未知
类目: ocial and Information Networks (cs.SI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 11 pages, 5 figures

点击查看摘要

Abstract:We study how participation in collective action is articulated in podcast discussions, using the Black Lives Matter (BLM) movement as a case study. While research on collective action discourse has primarily focused on text-based content, this study takes a first step toward analyzing audio formats by using podcast transcripts. Using the Structured Podcast Research Corpus (SPoRC), we investigated spoken language expressions of participation in collective action, categorized as problem-solution, call-to-action, intention, and execution. We identified podcast episodes discussing racial justice after important BLM-related events in May and June of 2020, and extracted participatory statements using a layered framework adapted from prior work on social media. We examined the emotional dimensions of these statements, detecting eight key emotions and their association with varying stages of activism. We found that emotional profiles vary by stage, with different positive emotions standing out during calls-to-action, intention, and execution. We detected negative associations between collective action and negative emotions, contrary to theoretical expectations. Our work contributes to a better understanding of how activism is expressed in spoken digital discourse and how emotional framing may depend on the format of the discussion.
zh

[NLP-12] he Few-shot Dilemma: Over-prompting Large Language Models

【速读】: 该论文旨在解决“过提示”(over-prompting)问题,即在上下文少样本学习(in-context few-shot learning)中,向大型语言模型(LLMs)输入过多领域特定示例反而会导致性能下降的现象。这一现象挑战了传统认知中“更多相关示例必然提升模型表现”的假设。解决方案的关键在于构建一个基于三种标准少样本选择方法(随机采样、语义嵌入和TF-IDF向量)的提示框架,并通过系统实验识别不同LLM在实际软件需求分类任务中的最优示例数量。该方法能够有效避免过提示效应,在减少示例数量的同时实现优于当前最先进水平1%的分类准确率。

链接: https://arxiv.org/abs/2509.13196
作者: Yongjian Tang,Doruk Tuncel,Christian Koerner,Thomas Runkler
机构: Siemens AG (西门子集团); Technical University of Munich (慕尼黑工业大学)
类目: Computation and Language (cs.CL)
备注: accepted for the main track of FLLM

点击查看摘要

Abstract:Over-prompting, a phenomenon where excessive examples in prompts lead to diminished performance in Large Language Models (LLMs), challenges the conventional wisdom about in-context few-shot learning. To investigate this few-shot dilemma, we outline a prompting framework that leverages three standard few-shot selection methods - random sampling, semantic embedding, and TF-IDF vectors - and evaluate these methods across multiple LLMs, including GPT-4o, GPT-3.5-turbo, DeepSeek-V3, Gemma-3, LLaMA-3.1, LLaMA-3.2, and Mistral. Our experimental results reveal that incorporating excessive domain-specific examples into prompts can paradoxically degrade performance in certain LLMs, which contradicts the prior empirical conclusion that more relevant few-shot examples universally benefit LLMs. Given the trend of LLM-assisted software engineering and requirement analysis, we experiment with two real-world software requirement classification datasets. By gradually increasing the number of TF-IDF-selected and stratified few-shot examples, we identify their optimal quantity for each LLM. This combined approach achieves superior performance with fewer examples, avoiding the over-prompting problem, thus surpassing the state-of-the-art by 1% in classifying functional and non-functional requirements.
zh

[NLP-13] xtarium: Entangling Annotation Abstraction and Argument IEEE-VIS2025

【速读】: 该论文旨在解决传统文本解读过程中 interpretive processes(解释过程)难以可视化、难以共享以及难以在细读(close reading)与远读(distant reading)之间建立有效连接的问题。解决方案的关键在于构建一个基于网页的交互式环境 Textarium,它通过将标注(annotation)、抽象(abstraction)和论证(argumentation)三者动态整合,使读者能够以参数化可视化状态呈现其解读行为,并借助轻量级计算处理实现人机协同的文本分析,从而提升数字叙事中解释过程的透明度与可协作性。

链接: https://arxiv.org/abs/2509.13191
作者: Philipp Proff,Marian Dörk
机构: Technical University Berlin (柏林工业大学); Berlin University of the Arts (柏林艺术大学); University of Applied Sciences Potsdam (波茨坦应用技术大学)
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
备注: This is the authors’ version of the article presented at VIS4DH and published in the proceedings of IEEE VIS 2025

点击查看摘要

Abstract:We present a web-based environment that connects annotation, abstraction, and argumentation during the interpretation of text. As a visual interface for scholarly reading and writing, Textarium combines human analysis with lightweight computational processing to bridge close and distant reading practices. Readers can highlight text, group keywords into concepts, and embed these observations as anchors in essays. The interface renders these interpretive actions as parameterized visualization states. Through a speculative design process of co-creative and iterative prototyping, we developed a reading-writing approach that makes interpretive processes transparent and shareable within digital narratives.
zh

[NLP-14] LLM Hallucination Detection: A Fast Fourier Transform Method Based on Hidden Layer Temporal Signals

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在可靠性敏感场景中因幻觉(Hallucination)导致的部署障碍问题。现有检测方法主要分为两类:基于事实核查的方法受限于外部知识库的覆盖范围,而静态隐藏状态分析则无法捕捉推理过程中的动态偏差,导致检测效果和鲁棒性不足。解决方案的关键在于提出HSAD(Hidden Signal Analysis-based Detection)框架,其核心创新是通过采样自回归生成过程中各层激活值构建隐藏层信号,利用快速傅里叶变换(Fast Fourier Transform, FFT)提取频域表示,并以最强非直流频率分量作为谱特征,同时结合LLM的自回归特性确定最优观测点,从而实现对推理动态变化的有效建模与幻觉检测。该方法在TruthfulQA等多个基准上相较前沿方法提升超过10个百分点,建立了基于推理过程建模与频域分析融合的新范式。

链接: https://arxiv.org/abs/2509.13154
作者: Jinxin Li,Gang Tu,ShengYu Cheng,Junjie Hu,Jinting Wang,Rui Chen,Zhilong Zhou,Dongbo Shan
机构: Huazhong University of Science and Technology (华中科技大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Hallucination remains a critical barrier for deploying large language models (LLMs) in reliability-sensitive applications. Existing detection methods largely fall into two categories: factuality checking, which is fundamentally constrained by external knowledge coverage, and static hidden-state analysis, that fails to capture deviations in reasoning dynamics. As a result, their effectiveness and robustness remain limited. We propose HSAD (Hidden Signal Analysis-based Detection), a novel hallucination detection framework that models the temporal dynamics of hidden representations during autoregressive generation. HSAD constructs hidden-layer signals by sampling activations across layers, applies Fast Fourier Transform (FFT) to obtain frequency-domain representations, and extracts the strongest non-DC frequency component as spectral features. Furthermore, by leveraging the autoregressive nature of LLMs, HSAD identifies optimal observation points for effective and reliable detection. Across multiple benchmarks, including TruthfulQA, HSAD achieves over 10 percentage points improvement compared to prior state-of-the-art methods. By integrating reasoning-process modeling with frequency-domain analysis, HSAD establishes a new paradigm for robust hallucination detection in LLMs.
zh

[NLP-15] Empowering LLM s with Parameterized Skills for Adversarial Long-Horizon Planning IJCNN2025

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在复杂、对抗性强的长时程环境中难以有效落地的问题。现有方法主要分为两类:一是将LLM作为策略直接生成低层可行动作,但可靠性不足;二是利用LLM生成高层任务或语言指导以触发动作生成,但严重依赖专家经验来转化为具体动作序列。为克服上述局限,作者提出Plan with Language, Act with Parameter (PLAP)规划框架,其核心在于构建一个包含环境特定参数化技能(parameterized skills)的技能库、基于LLM的技能规划器以及将参数化技能映射为可执行动作序列的技能执行器,从而实现LLM在长时程任务中的可靠决策与行动落地。

链接: https://arxiv.org/abs/2509.13127
作者: Sijia Cui,Shuai Xu,Aiyao He,Yanna Wang,Bo Xu
机构: Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院); Nanjing Artificial Intelligence Research of IA (南京人工智能研究院); University of Chinese Academy of Sciences, Nanjing (中国科学院大学南京学院); Nanjing University of Information Science & Technology (南京信息工程大学)
类目: Computation and Language (cs.CL)
备注: Accepted to IJCNN 2025

点击查看摘要

Abstract:Recent advancements in Large Language Models(LLMs) have led to the development of LLM-based AI agents. A key challenge is the creation of agents that can effectively ground themselves in complex, adversarial long-horizon environments. Existing methods mainly focus on (1) using LLMs as policies to interact with the environment through generating low-level feasible actions, and (2) utilizing LLMs to generate high-level tasks or language guides to stimulate action generation. However, the former struggles to generate reliable actions, while the latter relies heavily on expert experience to translate high-level tasks into specific action sequences. To address these challenges, we introduce the Plan with Language, Act with Parameter (PLAP) planning framework that facilitates the grounding of LLM-based agents in long-horizon environments. The PLAP method comprises three key components: (1) a skill library containing environment-specific parameterized skills, (2) a skill planner powered by LLMs, and (3) a skill executor converting the parameterized skills into executable action sequences. We implement PLAP in MicroRTS, a long-horizon real-time strategy game that provides an unfamiliar and challenging environment for LLMs. The experimental results demonstrate the effectiveness of PLAP. In particular, GPT-4o-driven PLAP in a zero-shot setting outperforms 80% of baseline agents, and Qwen2-72B-driven PLAP, with carefully crafted few-shot examples, surpasses the top-tier scripted agent, CoacAI. Additionally, we design comprehensive evaluation metrics and test 6 closed-source and 2 open-source LLMs within the PLAP framework, ultimately releasing an LLM leaderboard ranking long-horizon skill planning ability. Our code is available at this https URL.
zh

[NLP-16] Shaping Explanations: Semantic Reward Modeling with Encoder-Only Transformers for GRPO

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在生成符合复杂定性目标(如教学合理性)的文本时,难以有效对齐的问题。传统强化学习方法依赖于耗时昂贵的LLM作为裁判(LLM-as-a-judge)评估或脆弱的关键词匹配指标(如ROUGE),无法捕捉高质量解释的语义本质。其解决方案的关键在于:在Group Relative Policy Optimisation(GRPO)框架中引入一个轻量级、仅编码器结构的Transformer作为语义奖励模型(semantic reward model),通过计算生成解释与参考答案之间的余弦相似度提供密集且语义丰富的奖励信号,从而引导策略生成不仅事实正确,而且在结构和概念上更贴近专家推理的解释。

链接: https://arxiv.org/abs/2509.13081
作者: Francesco Pappone,Ruggero Marino Lazzaroni,Federico Califano,Niccolò Gentile,Roberto Marras
机构: AI Sparks; University of Graz; Foyer Group; Onepix Academy
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While Large Language Models (LLMs) excel at generating human-like text, aligning their outputs with complex, qualitative goals like pedagogical soundness remains a significant challenge. Standard reinforcement learning techniques often rely on slow and expensive LLM-as-a-judge evaluations or on brittle, keyword-based metrics like ROUGE, which fail to capture the semantic essence of a high-quality explanation. In this work, we introduce a novel approach to reward shaping within the Group Relative Policy Optimisation (GRPO) framework. Our central contribution is the use of a small, efficient encoder-only transformer as a semantic reward model. This model provides a dense, semantically rich reward signal based on the cosine similarity between a generated explanation and a ground-truth reference, guiding the policy towards explanations that are not just factually correct but also structurally and conceptually aligned with expert reasoning. We apply this method to the task of training a model for the Italian medical-school entrance examinations, following standard domain-adaptive continued pre-training (CPT) and supervised fine-tuning (SFT). Our results demonstrate that GRPO with our proposed semantic reward significantly improves explanation faithfulness and clarity over a strong SFT baseline, showcasing the power of using lightweight encoder models for nuanced reward shaping in complex generation tasks
zh

[NLP-17] When Inverse Data Outperforms: Exploring the Pitfalls of Mixed Data in Multi-Stage Fine-Tuning

【速读】: 该论文旨在解决当前生成式 AI (Generative AI) 模型在双向推理(bidirectional reasoning)场景下对齐(alignment)效果不佳的问题,尤其是现有方法主要依赖单向监督微调(SFT),忽视了不同推理路径之间的复杂交互。其解决方案的关键在于构建了一个高质量的反向推理数据集 r1k(通过反转 s1k 中 1,000 个正向示例得到),并系统评估 SFT 和直接偏好优化(DPO)在双向推理目标下的对齐表现。实验表明,仅在 r1k 上进行 SFT 可带来显著的准确率提升(1.6%–6.8%),但若将正向与反向数据混合用于 SFT,则会削弱推理方向区分度;尽管 DPO 能部分恢复方向性,却也会因概率质量转移而抑制非优选推理路径。这一发现揭示了混合推理数据引入冲突监督信号的本质问题,强调需发展具备方向感知能力的鲁棒对齐策略。

链接: https://arxiv.org/abs/2509.13079
作者: Mengyi Deng,Xin Li,Tingyu Zhu,Zhicheng Yang,Zhijiang Guo,Wei Wang
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Existing work has shown that o1-level performance can be achieved with limited data distillation, but most existing methods focus on unidirectional supervised fine-tuning (SFT), overlooking the intricate interplay between diverse reasoning patterns. In this paper, we construct r1k, a high-quality reverse reasoning dataset derived by inverting 1,000 forward examples from s1k, and examine how SFT and Direct Preference Optimization (DPO) affect alignment under bidirectional reasoning objectives. SFT on r1k yields a 1.6%–6.8% accuracy improvement over s1k across evaluated benchmarks. However, naively mixing forward and reverse data during SFT weakens the directional distinction. Although DPO can partially recover this distinction, it also suppresses less preferred reasoning paths by shifting the probability mass toward irrelevant outputs. These findings suggest that mixed reasoning data introduce conflicting supervision signals, underscoring the need for robust and direction-aware alignment strategies.
zh

[NLP-18] Multi-Model Synthetic Training for Mission-Critical Small Language Models

【速读】: 该论文旨在解决生成式 AI 在专业领域(如海事情报)应用中因领域特定训练数据稀缺且复杂而导致的模型性能受限问题。其核心解决方案在于利用大语言模型(Large Language Models, LLMs)作为一次性教师(one-time teachers),通过多模态生成技术(GPT-4o 和 o3-mini)将 32 亿条自动识别系统(Automatic Identification System, AIS)船舶轨迹数据转化为 21,543 条合成问答对,从而构建高质量、低过拟合风险的合成数据集;在此基础上微调的小型模型(Qwen2.5-7B)在海事任务上达到 75% 准确率,同时实现相比直接使用大型模型推理降低 261 倍的成本,验证了“合理微调的小模型可媲美昂贵大模型”的可行性,为缺乏人工标注资源的专业领域提供了一种高性价比、可复现的模型开发范式。

链接: https://arxiv.org/abs/2509.13047
作者: Nolan Platt,Pragyansmita Nayak
机构: Virginia Tech (弗吉尼亚理工大学); Hitachi Vantara Federal (日立Vantara联邦)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 8 pages. Accepted as a full paper to the 3rd International Conference on Foundation and Large Language Models (IEEE FLLM) 2025

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities across many domains, yet their appli- cation to specialized fields remains constrained by the scarcity and complexity of domain-specific training data. We present a novel approach that achieves a 261x cost reduction for maritime intelligence by using LLMs as one-time teachers rather than using them directly for inference. Our method transforms 3.2 billion Automatic Identification System (AIS) vessel tracking records into 21,543 synthetic question and answer pairs through multi-model generation (GPT-4o and o3-mini), preventing over- fitting and ensuring accurate reasoning. The resulting fine-tuned Qwen2.5-7B model achieves 75% accuracy on maritime tasks, while being substantially cheaper than using a larger model for inference. We show that smaller, cheaper models - when fine tuned properly - can provide similar accuracy compared to larger models that are prohibitively expensive. Our work contributes to the growing field of synthetic dataset generation for specialized AI applications and presents a highly reproducible framework for domains where manual annotation is infeasible. Beyond expand- ing research in the growing field of specialized small language models, our approach has immediate applications in maritime safety, security operations, and vessel traffic management systems in various industries.
zh

[NLP-19] SitLLM : Large Language Models for Sitting Posture Health Understanding via Pressure Sensor Data

【速读】: 该论文旨在解决现有坐姿监测系统在识别精度和语义表达能力方面的不足,这些问题限制了个性化健康反馈的生成。其核心解决方案是提出一种轻量级多模态框架SitLLM,关键在于通过融合压力传感与大语言模型(Large Language Models, LLMs),实现细粒度坐姿理解与健康导向响应生成;具体包括三个模块:高斯鲁棒传感器嵌入模块用于增强压力图特征提取的鲁棒性,提示驱动跨模态对齐模块将传感器嵌入映射至LLM语义空间以实现语义一致性,以及多上下文提示模块融合多层次信息以提升指令理解能力。

链接: https://arxiv.org/abs/2509.12994
作者: Jian Gao,Fufangchen Zhao,Yiyang Zhang,Danfeng Yan
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Poor sitting posture is a critical yet often overlooked factor contributing to long-term musculoskeletal disorders and physiological dysfunctions. Existing sitting posture monitoring systems, although leveraging visual, IMU, or pressure-based modalities, often suffer from coarse-grained recognition and lack the semantic expressiveness necessary for personalized feedback. In this paper, we propose \textbfSitLLM, a lightweight multimodal framework that integrates flexible pressure sensing with large language models (LLMs) to enable fine-grained posture understanding and personalized health-oriented response generation. SitLLM comprises three key components: (1) a \textitGaussian-Robust Sensor Embedding Module that partitions pressure maps into spatial patches and injects local noise perturbations for robust feature extraction; (2) a \textitPrompt-Driven Cross-Modal Alignment Module that reprograms sensor embeddings into the LLM’s semantic space via multi-head cross-attention using the pre-trained vocabulary embeddings; and (3) a \textitMulti-Context Prompt Module that fuses feature-level, structure-level, statistical-level, and semantic-level contextual information to guide instruction comprehension.
zh

[NLP-20] Do LLM s Understand Wine Descriptors Across Cultures? A Benchmark for Cultural Adaptations of Wine Reviews EMNLP2025

【速读】: 该论文旨在解决跨文化语境下葡萄酒评论的适应性翻译问题,即如何在保留原文信息的基础上,适配目标文化的味觉偏好与文化特有的风味描述,而不仅仅是进行字面翻译。其解决方案的关键在于构建首个专业葡萄酒评论平行语料库(包含8k条中文和16k条英语评论),并提出三种面向文化的评估指标——文化贴近度(Cultural Proximity)、文化中立性(Cultural Neutrality)和文化真实性(Cultural Genuineness),用于系统评估翻译结果在目标文化读者中的自然度与接受度。通过对比神经机器翻译模型与前沿大语言模型(LLMs)的表现,研究揭示了当前模型在捕捉文化细微差别方面的局限性,凸显了文化感知翻译的重要性与挑战。

链接: https://arxiv.org/abs/2509.12961
作者: Chenye Zou,Xingyue Wen,Tianyi Hu,Qian Janice Wang,Daniel Hershcovich
机构: University of Copenhagen (哥本哈根大学); Aarhus University (奥胡斯大学)
类目: Computation and Language (cs.CL)
备注: EMNLP 2025 Findings

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have opened the door to culture-aware language tasks. We introduce the novel problem of adapting wine reviews across Chinese and English, which goes beyond literal translation by incorporating regional taste preferences and culture-specific flavor descriptors. In a case study on cross-cultural wine review adaptation, we compile the first parallel corpus of professional reviews, containing 8k Chinese and 16k Anglophone reviews. We benchmark both neural-machine-translation baselines and state-of-the-art LLMs with automatic metrics and human evaluation. For the latter, we propose three culture-oriented criteria – Cultural Proximity, Cultural Neutrality, and Cultural Genuineness – to assess how naturally a translated review resonates with target-culture readers. Our analysis shows that current models struggle to capture cultural nuances, especially in translating wine descriptions across different cultures. This highlights the challenges and limitations of translation models in handling cultural content.
zh

[NLP-21] Investigating ReLoRA: Effects on the Learning Dynamics of Small Language Models

【速读】: 该论文旨在解决低秩适应(Low-Rank Adaptation, LoRA)方法在小语言模型(Small Language Models, SLMs)预训练阶段的适用性问题,特别是其扩展形式ReLoRA在性能和学习动态上的表现。解决方案的关键在于系统性地评估ReLoRA在参数规模为11M至66M的SLMs中的有效性,并通过消融实验揭示其相较于标准训练在损失、Paloma困惑度和BLiMP指标上的劣势,以及其加剧了小模型固有秩不足的问题,从而指出低秩更新策略难以直接迁移至低计算资源场景下的预训练任务。

链接: https://arxiv.org/abs/2509.12960
作者: Yuval Weiss,David Demitri Africa,Paula Buttery,Richard Diehl Martinez
机构: University of Cambridge (剑桥大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 12 Pages, 6 Tables, 8 Figures

点击查看摘要

Abstract:Parameter-efficient methods such as LoRA have revolutionised the fine-tuning of LLMs. Still, their extension to pretraining via ReLoRA is less well understood, especially for small language models (SLMs), which offer lower computational and environmental costs. This work is the first systematic study of ReLoRA in SLMs (11M-66M parameters), evaluating both performance and learning dynamics. Through ablation experiments, we find that ReLoRA generally performs worse than standard training on loss, Paloma perplexity and BLiMP, with the gap widening for the larger models. Further analysis of the learning dynamics of the models indicates that ReLoRA reinforces the rank deficiencies found in smaller models. These results indicate that low-rank update strategies may not transfer easily to SLM pretraining, highlighting the need for more research in the low-compute regime.
zh

[NLP-22] Automated Generation of Research Workflows from Academic Papers: A Full-text Mining Framework

【速读】: 该论文旨在解决现有方法仅能提取碎片化程序组件、难以捕获完整科研工作流的问题,从而阻碍了科学研究的可重复性提升与“AI for Science”范式的加速发展。其解决方案的关键在于提出一个端到端框架,通过挖掘全文学术论文自动构建结构化的研究工作流:首先利用基于SciBERT的正例-未标记(Positive-Unlabeled, PU)学习识别描述工作流的段落(F1-score达0.9772),继而采用提示学习(prompt learning)的Flan-T5模型从这些段落生成工作流短语(ROUGE指标分别为0.4543、0.2877和0.4427),再借助少量样本学习(few-shot learning)的ChatGPT对短语进行分类(精度0.958),最终映射至原文位置生成可视化流程图,实现对自然语言处理(NLP)领域近二十年方法论演进的定量分析。

链接: https://arxiv.org/abs/2509.12955
作者: Heng Zhang,Chengzhi Zhang
机构: 未知
类目: Computation and Language (cs.CL); Digital Libraries (cs.DL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:The automated generation of research workflows is essential for improving the reproducibility of research and accelerating the paradigm of “AI for Science”. However, existing methods typically extract merely fragmented procedural components and thus fail to capture complete research workflows. To address this gap, we propose an end-to-end framework that generates comprehensive, structured research workflows by mining full-text academic papers. As a case study in the Natural Language Processing (NLP) domain, our paragraph-centric approach first employs Positive-Unlabeled (PU) Learning with SciBERT to identify workflow-descriptive paragraphs, achieving an F1-score of 0.9772. Subsequently, we utilize Flan-T5 with prompt learning to generate workflow phrases from these paragraphs, yielding ROUGE-1, ROUGE-2, and ROUGE-L scores of 0.4543, 0.2877, and 0.4427, respectively. These phrases are then systematically categorized into data preparation, data processing, and data analysis stages using ChatGPT with few-shot learning, achieving a classification precision of 0.958. By mapping categorized phrases to their document locations in the documents, we finally generate readable visual flowcharts of the entire research workflows. This approach facilitates the analysis of workflows derived from an NLP corpus and reveals key methodological shifts over the past two decades, including the increasing emphasis on data analysis and the transition from feature engineering to ablation studies. Our work offers a validated technical framework for automated workflow generation, along with a novel, process-oriented perspective for the empirical investigation of evolving scientific paradigms. Source code and data are available at: this https URL.
zh

[NLP-23] Jailbreaking Large Language Models Through Content Concretization

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在安全机制上易受“越狱”(jailbreaking)攻击的问题,即恶意请求通过特定技巧绕过模型内置的安全过滤器。其解决方案的核心是提出一种名为“内容具体化”(Content Concretization, CC)的新颖越狱技术,该技术采用两阶段迭代流程:首先使用安全性较低的模型生成初步响应,随后利用更高层级的模型对原始提示和初步输出进行联合处理,逐步将抽象的恶意指令转化为可执行的具体实现。实验表明,经过三次迭代 refinements 后,越狱成功率(Success Rate, SR)从无优化时的 7% 提升至 62%,且单次成本仅为 7.5 美分,同时在多轮评估中被一致判定为更具危害性和技术合理性,揭示了当前 LLM 安全框架中的关键漏洞。

链接: https://arxiv.org/abs/2509.12937
作者: Johan Wahréus,Ahmed Hussain,Panos Papadimitratos
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted for presentation in the Conference on Game Theory and AI for Security (GameSec) 2025

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly deployed for task automation and content generation, yet their safety mechanisms remain vulnerable to circumvention through different jailbreaking techniques. In this paper, we introduce \textitContent Concretization (CC), a novel jailbreaking technique that iteratively transforms abstract malicious requests into concrete, executable implementations. CC is a two-stage process: first, generating initial LLM responses using lower-tier, less constrained safety filters models, then refining them through higher-tier models that process both the preliminary output and original prompt. We evaluate our technique using 350 cybersecurity-specific prompts, demonstrating substantial improvements in jailbreak Success Rates (SRs), increasing from 7% (no refinements) to 62% after three refinement iterations, while maintaining a cost of 7.5\textcent~per prompt. Comparative A/B testing across nine different LLM evaluators confirms that outputs from additional refinement steps are consistently rated as more malicious and technically superior. Moreover, manual code analysis reveals that generated outputs execute with minimal modification, although optimal deployment typically requires target-specific fine-tuning. With eventual improved harmful code generation, these results highlight critical vulnerabilities in current LLM safety frameworks.
zh

[NLP-24] Rethinking the Evaluation of Alignment Methods: Insights into Diversity Generalisation and Safety

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在对齐过程中面临的多目标冲突问题,即如何在事实准确性、安全性、简洁性、主动性与多样性之间实现有效权衡。现有研究通常聚焦于单一技术或特定维度,缺乏对不同对齐方法整体性能的系统评估。其解决方案的关键在于提出一个统一的评估框架,通过引入专门设计的“LLM作为裁判”(LLM-as-Judge)提示机制,并结合分布内(in-distribution)与分布外(out-of-distribution)数据集,对PPO、DPO、ORPO和KTO四种主流对齐方法进行跨五维指标的量化比较,从而揭示各方法在不同属性上的优势与权衡关系,为开发更平衡、可靠的LLMs提供实证依据。

链接: https://arxiv.org/abs/2509.12936
作者: Denis Janiak,Julia Moska,Dawid Motyka,Karolina Seweryn,Paweł Walkowiak,Bartosz Żuk,Arkadiusz Janz
机构: Wroclaw University of Science and Technology (WUST); National Research Institute (NASK); Institute of Computer Science, Polish Academy of Sciences (IPI PAN)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) require careful alignment to balance competing objectives - factuality, safety, conciseness, proactivity, and diversity. Existing studies focus on individual techniques or specific dimensions, lacking a holistic assessment of the inherent trade-offs. We propose a unified evaluation framework that compares LLM alignment methods (PPO, DPO, ORPO, KTO) across these five axes, using both in-distribution and out-of-distribution datasets. Leveraging a specialized LLM-as-Judge prompt, validated through human studies, we reveal that DPO and KTO excel in factual accuracy, PPO and DPO lead in safety, and PPO best balances conciseness with proactivity. Our findings provide insights into trade-offs of common alignment methods, guiding the development of more balanced and reliable LLMs.
zh

[NLP-25] All Roads Lead to Rome: Graph-Based Confidence Estimation for Large Language Model Reasoning EMNLP2025

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在推理任务中缺乏可靠置信度估计的问题。现有方法主要针对事实性问答(factual QA)任务设计,难以泛化到需要逻辑推理的任务场景。其解决方案的关键在于提出了一种无需训练的、基于图结构的置信度估计方法:将推理路径建模为有向图,并利用图的中心性(centrality)、路径汇聚性(path convergence)和路径权重(path weighting)等属性来量化置信度,从而在多个推理数据集上实现更准确的置信度预测,并提升下游任务性能。

链接: https://arxiv.org/abs/2509.12908
作者: Caiqi Zhang,Chang Shu,Ehsan Shareghi,Nigel Collier
机构: University of Cambridge (剑桥大学); Monash University (蒙纳士大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: EMNLP 2025 Main

点击查看摘要

Abstract:Confidence estimation is essential for the reliable deployment of large language models (LLMs). Existing methods are primarily designed for factual QA tasks and often fail to generalize to reasoning tasks. To address this gap, we propose a set of training-free, graph-based confidence estimation methods tailored to reasoning tasks. Our approach models reasoning paths as directed graphs and estimates confidence by exploiting graph properties such as centrality, path convergence, and path weighting. Experiments with two LLMs on three reasoning datasets demonstrate improved confidence estimation and enhanced performance on two downstream tasks.
zh

[NLP-26] Conan-Embedding-v2: Training an LLM from Scratch for Text Embeddings EMNLP2025

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在文本嵌入任务中因数据分布差异和训练机制不匹配而导致性能受限的问题。具体而言,现有方法如LoRA微调受限于LLMs与嵌入模型之间的数据鸿沟和训练目标差异(LLMs采用因果掩码与词级损失,而嵌入模型使用双向掩码与句级损失)。解决方案的关键在于:首先,在预训练阶段引入新闻数据和多语言对齐语料以弥合数据差距,并构建跨语言检索数据集促进多语言嵌入融合;其次,提出一种软掩码机制(soft-masking),逐步过渡因果掩码到双向掩码,使模型学习更全面的表示;最后,设计动态难负样本挖掘策略,持续引入更具挑战性的负例以提升嵌入质量。基于此,Conan-embedding-v2在仅1.4B参数下实现了MTEB及中文MTEB上的最先进(SOTA)性能。

链接: https://arxiv.org/abs/2509.12892
作者: Shiyu Li,Yang Tang,Ruijie Liu,Shi-Zhe Chen,Xi Chen
机构: Basic Algorithm Center, PCG, Tencent(腾讯)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: EMNLP 2025 Oral

点击查看摘要

Abstract:Large language models (LLMs) have recently demonstrated excellent performance in text embedding tasks. Previous work usually use LoRA to fine-tune existing LLMs, which are limited by the data and training gap between LLMs and embedding models. In this work, we introduce Conan-embedding-v2, a new 1.4B-parameter LLM trained from scratch and fine-tuned as a text embedder. First, we add news data and multilingual pairs for LLM pretraining to bridge the data gap. Based on this, we propose a cross-lingual retrieval dataset that enables the LLM to better integrate embeddings across different languages. Second, whereas LLMs use a causal mask with token-level loss, embedding models use a bidirectional mask with sentence-level loss. This training gap makes full fine-tuning less effective than LoRA. We introduce a soft-masking mechanism to gradually transition between these two types of masks, enabling the model to learn more comprehensive representations. Based on this, we propose a dynamic hard negative mining method that exposes the model to more difficult negative examples throughout the training process. Being intuitive and effective, with only approximately 1.4B parameters, Conan-embedding-v2 achieves SOTA performance on both the Massive Text Embedding Benchmark (MTEB) and Chinese MTEB (May 19, 2025).
zh

[NLP-27] he LLM Already Knows: Estimating LLM -Perceived Question Difficulty via Hidden Representations

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在推理过程中对输入问题难度估计不准确的问题,这直接影响性能评估的可靠性与自适应推理策略的有效性。现有方法通常依赖重复采样、辅助模型或微调目标模型,存在计算开销高或泛化能力弱的局限。其解决方案的关键在于:仅利用目标LLM产生的隐藏状态(hidden representations),将词元级生成过程建模为马尔可夫链,并定义一个值函数(value function)来估计给定隐藏状态下的预期输出质量,从而实现基于初始隐藏状态的高效且精确的难度估计,无需生成任何输出词元。这一方法显著提升了难度估计的效率与准确性,并成功应用于Self-Consistency、Best-of-N和Self-Refine等自适应推理策略中,提高了推理效率并减少了生成词元数量。

链接: https://arxiv.org/abs/2509.12886
作者: Yubo Zhu,Dongrui Liu,Zecheng Lin,Wei Tong,Sheng Zhong,Jing Shao
机构: Nanjing University (南京大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Xidian University (西安电子科技大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Estimating the difficulty of input questions as perceived by large language models (LLMs) is essential for accurate performance evaluation and adaptive inference. Existing methods typically rely on repeated response sampling, auxiliary models, or fine-tuning the target model itself, which may incur substantial computational costs or compromise generality. In this paper, we propose a novel approach for difficulty estimation that leverages only the hidden representations produced by the target LLM. We model the token-level generation process as a Markov chain and define a value function to estimate the expected output quality given any hidden state. This allows for efficient and accurate difficulty estimation based solely on the initial hidden state, without generating any output tokens. Extensive experiments across both textual and multimodal tasks demonstrate that our method consistently outperforms existing baselines in difficulty estimation. Moreover, we apply our difficulty estimates to guide adaptive reasoning strategies, including Self-Consistency, Best-of-N, and Self-Refine, achieving higher inference efficiency with fewer generated tokens.
zh

[NLP-28] Benchmarking and Improving LVLMs on Event Extraction from Multimedia Documents

【速读】: 该论文旨在解决多媒体事件抽取(Multimedia Event Extraction, M2E2)任务中缺乏对大型视觉语言模型(Large Vision-Language Models, LVLMs)系统性评估的问题。当前尽管LVLM在跨模态理解方面展现出强大能力,但其在M2E2场景下的实际表现尚未被充分探索。解决方案的关键在于首次对代表性LVLM(如DeepSeek-VL2和Qwen-VL系列)进行了全面评估,涵盖纯文本、纯图像及跨媒体子任务,并在少样本提示(few-shot prompting)与微调(fine-tuning)两种设置下进行比较。研究发现:(1)少样本LVLM在视觉任务上表现优异但在文本任务上表现不足;(2)采用LoRA(Low-Rank Adaptation)微调显著提升性能;(3)多模态融合具有强协同效应,跨模态设置下表现最优。这一系统性分析为未来M2E2技术的发展提供了关键实证依据与优化方向。

链接: https://arxiv.org/abs/2509.12876
作者: Fuyu Xing,Zimu Wang,Wei Wang,Haiyang Zhang
机构: School of Advanced Technology, Xi’an Jiaotong-Liverpool University (西交利物浦大学先进科技学院)
类目: Computation and Language (cs.CL); Multimedia (cs.MM)
备注: Accepted at INLG 2025. Camera-ready version

点击查看摘要

Abstract:The proliferation of multimedia content necessitates the development of effective Multimedia Event Extraction (M2E2) systems. Though Large Vision-Language Models (LVLMs) have shown strong cross-modal capabilities, their utility in the M2E2 task remains underexplored. In this paper, we present the first systematic evaluation of representative LVLMs, including DeepSeek-VL2 and the Qwen-VL series, on the M2E2 dataset. Our evaluations cover text-only, image-only, and cross-media subtasks, assessed under both few-shot prompting and fine-tuning settings. Our key findings highlight the following valuable insights: (1) Few-shot LVLMs perform notably better on visual tasks but struggle significantly with textual tasks; (2) Fine-tuning LVLMs with LoRA substantially enhances model performance; and (3) LVLMs exhibit strong synergy when combining modalities, achieving superior performance in cross-modal settings. We further provide a detailed error analysis to reveal persistent challenges in areas such as semantic precision, localization, and cross-modal grounding, which remain critical obstacles for advancing M2E2 capabilities.
zh

[NLP-29] Data Augmentation for Maltese NLP using Transliterated and Machine Translated Arabic Data EMNLP

【速读】: 该论文旨在解决Maltese语言在自然语言处理(Natural Language Processing, NLP)资源匮乏的问题,特别是其作为具有阿拉伯语语系根源但使用拉丁字母书写、与阿拉伯语存在显著差异的语言,在跨语言迁移学习中的适配难题。解决方案的关键在于利用阿拉伯语资源通过跨语言增强(cross-lingual augmentation)技术来提升Maltese的NLP性能,具体包括多种音译方案(transliteration schemes)和机器翻译(Machine Translation, MT)方法,同时引入了更贴合Maltese正字法的新音译系统,实验证明此类基于阿拉伯语的增强策略能显著改善单语和多语模型在Maltese任务上的表现。

链接: https://arxiv.org/abs/2509.12853
作者: Kurt Micallef,Nizar Habash,Claudia Borg
机构: 未知
类目: Computation and Language (cs.CL)
备注: EMNLP Camera-Ready

点击查看摘要

Abstract:Maltese is a unique Semitic language that has evolved under extensive influence from Romance and Germanic languages, particularly Italian and English. Despite its Semitic roots, its orthography is based on the Latin script, creating a gap between it and its closest linguistic relatives in Arabic. In this paper, we explore whether Arabic-language resources can support Maltese natural language processing (NLP) through cross-lingual augmentation techniques. We investigate multiple strategies for aligning Arabic textual data with Maltese, including various transliteration schemes and machine translation (MT) approaches. As part of this, we also introduce novel transliteration systems that better represent Maltese orthography. We evaluate the impact of these augmentations on monolingual and mutlilingual models and demonstrate that Arabic-based augmentation can significantly benefit Maltese NLP tasks.
zh

[NLP-30] ConvergeWriter: Data-Driven Bottom-Up Article Construction

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生成长篇、事实性文档时面临的挑战,即现有“自上而下”方法因生成计划与可用知识之间存在脱节,导致内容碎片化和事实错误。其解决方案的关键在于提出一种“自下而上”的数据驱动框架,核心是“先检索知识,再聚类结构”策略:首先通过迭代式检索建立源语料的知识边界,再利用无监督聚类算法将检索到的文档划分为若干“知识簇”,这些簇构成客观、数据驱动的结构基础,直接指导后续分层大纲和文本内容的生成,从而确保生成内容严格受限于且可追溯至原始知识库,有效降低幻觉风险,并提升在知识约束场景下的真实性与结构性。

链接: https://arxiv.org/abs/2509.12811
作者: Binquan Ji,Jiaqi Wang,Ruiting Li,Xingchen Han,Yiyang Qi,Shichao Wang,Yifei Lu,Yuantao Han,Feiliang Ren
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have shown remarkable prowess in text generation, yet producing long-form, factual documents grounded in extensive external knowledge bases remains a significant challenge. Existing “top-down” methods, which first generate a hypothesis or outline and then retrieve evidence, often suffer from a disconnect between the model’s plan and the available knowledge, leading to content fragmentation and factual inaccuracies. To address these limitations, we propose a novel “bottom-up,” data-driven framework that inverts the conventional generation pipeline. Our approach is predicated on a “Retrieval-First for Knowledge, Clustering for Structure” strategy, which first establishes the “knowledge boundaries” of the source corpus before any generative planning occurs. Specifically, we perform exhaustive iterative retrieval from the knowledge base and then employ an unsupervised clustering algorithm to organize the retrieved documents into distinct “knowledge clusters.” These clusters form an objective, data-driven foundation that directly guides the subsequent generation of a hierarchical outline and the final document content. This bottom-up process ensures that the generated text is strictly constrained by and fully traceable to the source material, proactively adapting to the finite scope of the knowledge base and fundamentally mitigating the risk of hallucination. Experimental results on both 14B and 32B parameter models demonstrate that our method achieves performance comparable to or exceeding state-of-the-art baselines, and is expected to demonstrate unique advantages in knowledge-constrained scenarios that demand high fidelity and structural coherence. Our work presents an effective paradigm for generating reliable, structured, long-form documents, paving the way for more robust LLM applications in high-stakes, knowledge-intensive domains.
zh

[NLP-31] Contrastive Learning with Enhanced Abstract Representations using Grouped Loss of Abstract Semantic Supervision

【速读】: 该论文旨在解决视觉语言模型(VLM)在图像理解中缺乏概念抽象能力的问题,即模型能否超越对图像中物体及其关系的识别,而捕捉到更高层次的、泛化的概念信息。其解决方案的关键在于提出一种基于分组对比损失(grouped contrastive loss)的训练方法,通过引入一个名为MAGIC的分组图像-文本数据集,其中每组包含多个图像和对应文本描述以及高层概念标签。该方法利用外层对比损失(outer contrastive loss)促使模型在潜在语义空间中将同一组内的图像与文本表示拉近至共享的高层次概念表示,同时以内层损失(inner loss)优化组内图像-文本实例间的距离。这种训练机制使模型无需显式接触高层概念标签即可自发习得概念抽象能力,从而显著提升在抽象概念识别任务上的性能。

链接: https://arxiv.org/abs/2509.12771
作者: Omri Suissa,Muhiim Ali,Shengmai Chen,Yinuo Cai,Shekhar Pradhan
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Humans can recognize an image as an instance of a general concept, beyond simply identifying its objects and their relationships. In this paper, we investigate 1. The extent to which VLMs have this concept abstraction capacity, and 2. Strategies for encoding the sort of higher-concept information in images that would enable the resulting VLM model (CLEAR GLASS model) to have this capability to a greater degree. To this end, we introduce a grouped image-caption dataset (MAGIC), which consists of several groups of image captions and for each group a set of associated images and higher-level conceptual labels. We use a novel contrastive loss technique to induce the model to encode in the representation of each image (caption) in a group the information that is common to all members of the image-caption group. Our main contribution is a grouped contrastive loss function based on text-image contrastive groups (outer contrastive loss) as well as an inner loss which measures the distances between image-caption instances in the group. Our training methodology results in the CLEAR GLASS model having the concept abstraction capacity as an emergent capacity because the model is not exposed to the higher-level concepts associated with each group. Instead, the training forces the model to create for each image-caption group a semantic representation that brings it closer to the semantic representation of the higher-level concepts in the latent semantic space. Our experiments show that this training methodology results in a model which shows improvement in abstract concept recognition compared to SOTA models.
zh

[NLP-32] InfoGain-RAG : Boosting Retrieval-Augmented Generation via Document Information Gain-based Reranking and Filtering EMNLP’25

【速读】: 该论文旨在解决当前检索增强生成(Retrieval-Augmented Generation, RAG)框架中一个关键问题:如何有效识别和筛选对答案生成具有实质性贡献的检索文档,从而避免无关或误导性内容影响最终性能。现有方法往往缺乏对检索文档价值的量化评估能力,导致无法精准过滤低质量信息。解决方案的关键在于提出一种名为文档信息增益(Document Information Gain, DIG)的新指标,其通过计算大型语言模型(Large Language Models, LLMs)在有无特定检索文档时生成答案的置信度差异来衡量该文档的价值;并进一步构建InfoGain-RAG框架,利用DIG分数训练专用重排序器(reranker),从精确区分与准确排序两个维度优化文档优先级,从而显著提升RAG系统的答案准确性与鲁棒性。

链接: https://arxiv.org/abs/2509.12765
作者: Zihan Wang,Zihan Liang,Zhou Shao,Yufei Ma,Huangyu Dai,Ben Chen,Lingtao Mao,Chenyi Lei,Yuqing Ding,Han Li
机构: Kuaishou Technology (快手科技); Peking University (北京大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: EMNLP’25 Oral Presentation. Contact: benchen4395@gmail.com

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) has emerged as a promising approach to address key limitations of Large Language Models (LLMs), such as hallucination, outdated knowledge, and lacking reference. However, current RAG frameworks often struggle with identifying whether retrieved documents meaningfully contribute to answer generation. This shortcoming makes it difficult to filter out irrelevant or even misleading content, which notably impacts the final performance. In this paper, we propose Document Information Gain (DIG), a novel metric designed to quantify the contribution of retrieved documents to correct answer generation. DIG measures a document’s value by computing the difference of LLM’s generation confidence with and without the document augmented. Further, we introduce InfoGain-RAG, a framework that leverages DIG scores to train a specialized reranker, which prioritizes each retrieved document from exact distinguishing and accurate sorting perspectives. This approach can effectively filter out irrelevant documents and select the most valuable ones for better answer generation. Extensive experiments across various models and benchmarks demonstrate that InfoGain-RAG can significantly outperform existing approaches, on both single and multiple retrievers paradigm. Specifically on NaturalQA, it achieves the improvements of 17.9%, 4.5%, 12.5% in exact match accuracy against naive RAG, self-reflective RAG and modern ranking-based RAG respectively, and even an average of 15.3% increment on advanced proprietary model GPT-4o across all datasets. These results demonstrate the feasibility of InfoGain-RAG as it can offer a reliable solution for RAG in multiple applications.
zh

[NLP-33] Similarity-Distance-Magnitude Activations

【速读】: 该论文旨在解决标准Softmax激活函数在神经网络中对协变量偏移(co-variate shifts)和分布外输入(out-of-distribution inputs)敏感、且缺乏可解释性的问题,尤其是在高概率区域表现不稳定。其解决方案的关键在于提出一种新的激活函数——相似性-距离-幅度(Similarity-Distance-Magnitude, SDM)激活函数,该函数通过引入三个维度的感知能力:相似性(Similarity,即训练中正确预测的深度匹配)、到训练分布的距离(Distance-to-training-distribution)以及现有输出幅度(Magnitude,即决策边界),从而提升模型在选择性分类任务中的鲁棒性和可解释性。SDM不仅增强了对分布外样本的识别能力,还通过密集匹配提供基于实例的解释,并支持类条件经验累积分布函数(empirical CDFs)的分区策略,有效缓解选择性分类中低类召回率问题。

链接: https://arxiv.org/abs/2509.12760
作者: Allen Schmaltz
机构: Reexpress AI
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 17 pages, 5 tables, 1 algorithm. arXiv admin note: substantial text overlap with arXiv:2502.20167

点击查看摘要

Abstract:We introduce a more robust and interpretable formulation of the standard softmax activation function commonly used with neural networks by adding Similarity (i.e., correctly predicted depth-matches into training) awareness and Distance-to-training-distribution awareness to the existing output Magnitude (i.e., decision-boundary) awareness. When used as the final-layer activation with language models, the resulting Similarity-Distance-Magnitude (SDM) activation function is more robust than the softmax function to co-variate shifts and out-of-distribution inputs in high-probability regions, and provides interpretability-by-exemplar via dense matching. Complementing the prediction-conditional estimates, the SDM activation enables a partitioning of the class-wise empirical CDFs to guard against low class-wise recall among selective classifications. These properties make it preferable for selective classification, even when considering post-hoc calibration methods over the softmax.
zh

[NLP-34] Zero-shot Graph Reasoning via Retrieval Augmented Framework with LLM s

【速读】: 该论文旨在解决现有图推理方法在处理多样化图任务时存在的局限性,例如依赖大量微调或预设算法导致的灵活性不足与泛化能力差的问题。其解决方案的关键在于提出一种无需训练的框架——Graph Reasoning via Retrieval Augmented Framework (GRRAF),该框架利用检索增强生成(Retrieval-Augmented Generation, RAG)技术与大语言模型(Large Language Models, LLMs)的代码生成能力,将目标图存储于图数据库中,并通过LLM生成可执行代码查询来获取所需信息,从而实现对多种图推理任务(如环检测、二分图判定、最短路径计算等)的高精度、高效响应,同时引入错误反馈机制和超时控制以保障正确性与效率。

链接: https://arxiv.org/abs/2509.12743
作者: Hanqing Li,Kiran Sheena Jyothi,Henry Liang,Sharika Mahadevan,Diego Klabjan
机构: Northwestern University (西北大学); EXL Service; Vail Systems; Netflix
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We propose a new, training-free method, Graph Reasoning via Retrieval Augmented Framework (GRRAF), that harnesses retrieval-augmented generation (RAG) alongside the code-generation capabilities of large language models (LLMs) to address a wide range of graph reasoning tasks. In GRRAF, the target graph is stored in a graph database, and the LLM is prompted to generate executable code queries that retrieve the necessary information. This approach circumvents the limitations of existing methods that require extensive finetuning or depend on predefined algorithms, and it incorporates an error feedback loop with a time-out mechanism to ensure both correctness and efficiency. Experimental evaluations on the GraphInstruct dataset reveal that GRRAF achieves 100% accuracy on most graph reasoning tasks, including cycle detection, bipartite graph checks, shortest path computation, and maximum flow, while maintaining consistent token costs regardless of graph sizes. Imperfect but still very high performance is observed on subgraph matching. Notably, GRRAF scales effectively to large graphs with up to 10,000 nodes.
zh

[NLP-35] A Novel Recurrent Neural Network Framework for Prediction and Treatment of Oncogenic Mutation Progression

【速读】: 该论文旨在解决当前癌症路径分析(pathway analysis)依赖耗时且昂贵的湿实验(wet lab)数据的问题,从而限制了其在临床中的高效应用。为应对这一挑战,作者提出了一种基于人工智能(AI)的端到端框架,能够预测癌症严重程度和突变进展,并据此推荐可能的治疗方案。该解决方案的关键在于结合时间序列机器学习模型与路径分析方法:首先从TCGA数据库中提取突变序列,通过一种新颖的预处理算法筛选高频率关键突变;随后利用循环神经网络(RNN)预测癌症严重程度,并基于RNN输出、预处理信息及多个药物靶点数据库的概率推理,预测未来突变并推荐治疗策略。该框架在ROC曲线上表现出优于60%的准确率,验证了其有效性,同时揭示每个癌症阶段约存在数百个关键驱动突变,显著减少了对湿实验的依赖,实现了高效、低成本的癌症进展预测与个体化治疗建议。

链接: https://arxiv.org/abs/2509.12732
作者: Rishab Parthasarathy,Achintya Bhowmik
机构: Massachusetts Institute of Technology (麻省理工学院); Stanford University School of Medicine (斯坦福大学医学院)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Quantitative Methods (q-bio.QM)
备注: 12 pages, 11 figures, work originally done in 2022/2023 and was awarded as one of the Regeneron Science Talent Search Finalists in 2022

点击查看摘要

Abstract:Despite significant medical advancements, cancer remains the second leading cause of death, with over 600,000 deaths per year in the US. One emerging field, pathway analysis, is promising but still relies on manually derived wet lab data, which is time-consuming to acquire. This work proposes an efficient, effective end-to-end framework for Artificial Intelligence (AI) based pathway analysis that predicts both cancer severity and mutation progression, thus recommending possible treatments. The proposed technique involves a novel combination of time-series machine learning models and pathway analysis. First, mutation sequences were isolated from The Cancer Genome Atlas (TCGA) Database. Then, a novel preprocessing algorithm was used to filter key mutations by mutation frequency. This data was fed into a Recurrent Neural Network (RNN) that predicted cancer severity. Then, the model probabilistically used the RNN predictions, information from the preprocessing algorithm, and multiple drug-target databases to predict future mutations and recommend possible treatments. This framework achieved robust results and Receiver Operating Characteristic (ROC) curves (a key statistical metric) with accuracies greater than 60%, similar to existing cancer diagnostics. In addition, preprocessing played an instrumental role in isolating important mutations, demonstrating that each cancer stage studied may contain on the order of a few-hundred key driver mutations, consistent with current research. Heatmaps based on predicted gene frequency were also generated, highlighting key mutations in each cancer. Overall, this work is the first to propose an efficient, cost-effective end-to-end framework for projecting cancer progression and providing possible treatments without relying on expensive, time-consuming wet lab work.
zh

[NLP-36] HistoryBankQA: Multilingual Temporal Question Answering on Historical Events

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在历史事件时序推理能力评估方面的不足,具体表现为现有数据集规模有限、缺乏多语言覆盖且侧重于当代事件。为应对这些问题,作者提出了HistoryBank——一个包含1000万+条历史事件的多语言数据库,源自维基百科的时间线页面和信息框(infoboxes),覆盖10种语言,实现了前所未有的历史深度与语言广度。其关键解决方案在于构建了一个全面的跨语言时序问答(Temporal Question Answering, Temporal QA)基准测试,涵盖6类时序推理任务,并对多个主流模型(如GPT4o、Llama-3-8B、Mistral-7B等)进行系统评测,从而为多语言历史事件的时序理解提供可扩展、可比较的研究资源。

链接: https://arxiv.org/abs/2509.12720
作者: Biswadip Mandal,Anant Khandelwal,Manish Gupta
机构: Microsoft(微软); Microsoft(微软)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Temporal reasoning about historical events is a critical skill for NLP tasks like event extraction, historical entity linking, temporal question answering, timeline summarization, temporal event clustering and temporal natural language inference. Yet efforts on benchmarking temporal reasoning capabilities of large language models (LLMs) are rather limited. Existing temporal reasoning datasets are limited in scale, lack multilingual coverage and focus more on contemporary events. To address these limitations, we present HistoryBank, a multilingual database of 10M+ historical events extracted from Wikipedia timeline pages and article infoboxes. Our database provides unprecedented coverage in both historical depth and linguistic breadth with 10 languages. Additionally, we construct a comprehensive question answering benchmark for temporal reasoning across all languages. This benchmark covers a diverse set of 6 temporal QA reasoning tasks, and we evaluate a suite of popular language models (LLaMA-3-8B, Mistral-7B, Gemma-2-9b, Qwen3-8B, GPT4o) to assess their performance on these tasks. As expected GPT4o performs best across all answer types and languages; Gemma-2 outperforms the other small language models. Our work aims to provide a comprehensive resource for advancing multilingual and temporally-aware natural language understanding of historical events. To facilitate further research, we will make our code and datasets publicly available upon acceptance of this paper.
zh

[NLP-37] Case-Based Decision-Theoretic Decoding with Quality Memories EMNLP2025

【速读】: 该论文旨在解决最小贝叶斯风险(Minimum Bayes Risk, MBR)解码在处理域外(out-of-domain)数据时难以准确捕捉知识或信息的问题,因为MBR依赖于从文本生成模型中采样的样本,而这些样本可能无法充分代表目标域的特性。解决方案的关键在于提出基于案例的决策理论(Case-Based Decision-Theoretic, CBDT)解码方法,该方法利用领域数据中的示例来估计期望效用,从而更有效地指导文本生成。实验表明,CBDT不仅优于最大后验概率(MAP)解码,且与MBR结合使用时在多个域间翻译任务和图像描述任务中均表现更优。

链接: https://arxiv.org/abs/2509.12677
作者: Hiroyuki Deguchi,Masaaki Nagata
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted at EMNLP2025 main

点击查看摘要

Abstract:Minimum Bayes risk (MBR) decoding is a decision rule of text generation, which selects the hypothesis that maximizes the expected utility and robustly generates higher-quality texts than maximum a posteriori (MAP) decoding. However, it depends on sample texts drawn from the text generation model; thus, it is difficult to find a hypothesis that correctly captures the knowledge or information of out-of-domain. To tackle this issue, we propose case-based decision-theoretic (CBDT) decoding, another method to estimate the expected utility using examples of domain data. CBDT decoding not only generates higher-quality texts than MAP decoding, but also the combination of MBR and CBDT decoding outperformed MBR decoding in seven domain De–En and Ja \leftrightarrow En translation tasks and image captioning tasks on MSCOCO and nocaps datasets.
zh

[NLP-38] owards Inclusive Toxic Content Moderation: Addressing Vulnerabilities to Adversarial Attacks in Toxicity Classifiers Tackling LLM -generated Content

【速读】: 该论文旨在解决由大型语言模型(Large Language Models, LLMs)生成内容激增所引发的毒性内容检测模型误判问题,尤其是传统内容审核分类器因训练数据主要来自人类文本、缺乏对LLM生成文本和对抗攻击的鲁棒性而导致的性能下降。其解决方案的关键在于采用机制可解释性技术识别毒性分类器中易受攻击的组件(即“脆弱电路”),并通过抑制这些脆弱模块来提升模型在对抗样本下的表现。研究发现,不同注意力头在模型性能与脆弱性上具有差异化作用,且不同人群群体的脆弱性分布不均,这揭示了当前模型训练中存在的公平性与鲁棒性缺口,为开发更具包容性的毒性检测系统提供了方向。

链接: https://arxiv.org/abs/2509.12672
作者: Shaz Furniturewala,Arkaitz Zubiaga
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The volume of machine-generated content online has grown dramatically due to the widespread use of Large Language Models (LLMs), leading to new challenges for content moderation systems. Conventional content moderation classifiers, which are usually trained on text produced by humans, suffer from misclassifications due to LLM-generated text deviating from their training data and adversarial attacks that aim to avoid detection. Present-day defence tactics are reactive rather than proactive, since they rely on adversarial training or external detection models to identify attacks. In this work, we aim to identify the vulnerable components of toxicity classifiers that contribute to misclassification, proposing a novel strategy based on mechanistic interpretability techniques. Our study focuses on fine-tuned BERT and RoBERTa classifiers, testing on diverse datasets spanning a variety of minority groups. We use adversarial attacking techniques to identify vulnerable circuits. Finally, we suppress these vulnerable circuits, improving performance against adversarial attacks. We also provide demographic-level insights into these vulnerable circuits, exposing fairness and robustness gaps in model training. We find that models have distinct heads that are either crucial for performance or vulnerable to attack and suppressing the vulnerable heads improves performance on adversarial input. We also find that different heads are responsible for vulnerability across different demographic groups, which can inform more inclusive development of toxicity detection models.
zh

[NLP-39] Chat-Driven Text Generation and Interaction for Person Retrieval EMNLP2025

【速读】: 该论文旨在解决文本驱动的人体检索(Text-based Person Search, TBPS)系统在实际应用中因依赖大量高质量人工标注文本描述而导致的可扩展性与部署难题。其解决方案的关键在于提出两个互补模块:多轮文本生成(Multi-Turn Text Generation, MTG)与多轮文本交互(Multi-Turn Text Interaction, MTI)。MTG 利用大规模语言模型(MLLMs)模拟对话过程,自动生成细粒度且多样化的伪标签文本,无需人工标注;MTI 在推理阶段通过动态对话式推理优化用户查询,提升系统对模糊、不完整或歧义描述的解析能力。二者共同构建了一个统一且无需标注的TBPS框架,在显著提升检索准确率、鲁棒性和可用性的同时,实现了系统的规模化和实用化部署。

链接: https://arxiv.org/abs/2509.12662
作者: Zequn Xie,Chuxin Wang,Sihang Cai,Yeqiang Wang,Shulei Wang,Tao Jin
机构: Zhejiang University (浙江大学); Northwest A&F University (西北农林科技大学)
类目: Computation and Language (cs.CL)
备注: Accepted by EMNLP 2025. 13 pages, 3 figures

点击查看摘要

Abstract:Text-based person search (TBPS) enables the retrieval of person images from large-scale databases using natural language descriptions, offering critical value in surveillance applications. However, a major challenge lies in the labor-intensive process of obtaining high-quality textual annotations, which limits scalability and practical deployment. To address this, we introduce two complementary modules: Multi-Turn Text Generation (MTG) and Multi-Turn Text Interaction (MTI). MTG generates rich pseudo-labels through simulated dialogues with MLLMs, producing fine-grained and diverse visual descriptions without manual supervision. MTI refines user queries at inference time through dynamic, dialogue-based reasoning, enabling the system to interpret and resolve vague, incomplete, or ambiguous descriptions - characteristics often seen in real-world search scenarios. Together, MTG and MTI form a unified and annotation-free framework that significantly improves retrieval accuracy, robustness, and usability. Extensive evaluations demonstrate that our method achieves competitive or superior results while eliminating the need for manual captions, paving the way for scalable and practical deployment of TBPS systems.
zh

[NLP-40] Mitigating Strategy Preference Bias in Emotional Support Conversation via Uncertainty Estimations

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在情感支持对话(Emotional Support Conversation, ESC)中因策略规划准确性低及存在显著策略偏好偏差而导致的效果不佳问题。解决方案的关键在于:首先通过识别LLMs在策略规划中的知识边界,揭示偏好偏差的根本成因;随后提出一种基于双奖励函数的强化学习方法,该方法结合准确性和基于熵的置信度来优化不同知识区域内的策略选择,从而有效缓解偏好偏差并提升策略规划质量。实验结果表明,该方法在ESCov和ExTES数据集上优于多个基线模型,验证了其有效性。

链接: https://arxiv.org/abs/2509.12661
作者: Yougen Zhou,Qin Chen,Ningning Zhou,Jie Zhou,Xingjiao Wu,Liang He
机构: Shanghai Institute of Artificial Intelligence for Education (上海人工智能教育研究所); School of Computer Science and Technology (计算机科学与技术学院); School of Psychology and Cognitive Science (心理学与认知科学学院); School of Pharmacy (药学院); East China Normal University (华东师范大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Emotional support conversation (ESC) aims to alleviate distress through empathetic dialogue, yet large language models (LLMs) face persistent challenges in delivering effective ESC due to low accuracy in strategy planning. Moreover, there is a considerable preference bias towards specific strategies. Prior methods using fine-tuned strategy planners have shown potential in reducing such bias, while the underlying causes of the preference bias in LLMs have not well been studied. To address these issues, we first reveal the fundamental causes of the bias by identifying the knowledge boundaries of LLMs in strategy planning. Then, we propose an approach to mitigate the bias by reinforcement learning with a dual reward function, which optimizes strategy planning via both accuracy and entropy-based confidence for each region according to the knowledge boundaries. Experiments on the ESCov and ExTES datasets with multiple LLM backbones show that our approach outperforms the baselines, confirming the effectiveness of our approach.
zh

[NLP-41] Dont Change My View: Ideological Bias Auditing in Large Language Models

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)可能被有意引导至特定意识形态立场(如政治或宗教观点)的问题,从而对公共话语产生不成比例的影响。其核心挑战在于如何检测此类意识形态操纵行为,尤其是在模型为黑箱系统且无法访问内部结构的情况下。解决方案的关键在于采用一种无需模型内部信息的、基于分布偏移分析的方法:通过比较与特定主题相关提示下模型输出的统计分布变化,识别潜在的意识形态引导迹象。该方法具有模型无关性(model-agnostic),特别适用于对专有系统的独立事后审计。

链接: https://arxiv.org/abs/2509.12652
作者: Paul Kröger,Emilio Barkett
机构: Columbia University (哥伦比亚大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As large language models (LLMs) become increasingly embedded in products used by millions, their outputs may influence individual beliefs and, cumulatively, shape public opinion. If the behavior of LLMs can be intentionally steered toward specific ideological positions, such as political or religious views, then those who control these systems could gain disproportionate influence over public discourse. Although it remains an open question whether LLMs can reliably be guided toward coherent ideological stances and whether such steering can be effectively prevented, a crucial first step is to develop methods for detecting when such steering attempts occur. In this work, we adapt a previously proposed statistical method to the new context of ideological bias auditing. Our approach carries over the model-agnostic design of the original framework, which does not require access to the internals of the language model. Instead, it identifies potential ideological steering by analyzing distributional shifts in model outputs across prompts that are thematically related to a chosen topic. This design makes the method particularly suitable for auditing proprietary black-box systems. We validate our approach through a series of experiments, demonstrating its practical applicability and its potential to support independent post hoc audits of LLM behavior.
zh

[NLP-42] PAC: Pronunciation-Aware Contextualized Large Language Model-based Automatic Speech Recognition ICASSP2026

【速读】: 该论文旨在解决基于大语言模型(Large Language Model, LLM)的自动语音识别(Automatic Speech Recognition, ASR)系统中两个关键问题:有效的发音建模与鲁棒的同音词区分能力,二者对于原始词或长尾词的识别至关重要。解决方案的核心在于提出一种发音感知的上下文化(Pronunciation-Aware Contextualized, PAC)框架,采用两阶段学习范式:第一阶段引入发音引导的上下文学习方法,通过交错的音素-音位上下文建模策略并加入仅含音素的干扰项,促使模型利用音素线索实现精准识别;第二阶段提出一种带扰动标签采样的发音区分强化学习方法,进一步提升模型对上下文同音词的判别能力。实验表明,该方法在英文Librispeech和中文AISHELL-1数据集上分别将相对词错误率(WER)降低30.2%和53.8%,并对长尾词的偏倚WER分别减少31.8%和60.5%。

链接: https://arxiv.org/abs/2509.12647
作者: Li Fu,Yu Xin,Sunlu Zeng,Lu Fan,Youzheng Wu,Xiaodong He
机构: 未知
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Submitted to ICASSP 2026

点击查看摘要

Abstract:This paper presents a Pronunciation-Aware Contextualized (PAC) framework to address two key challenges in Large Language Model (LLM)-based Automatic Speech Recognition (ASR) systems: effective pronunciation modeling and robust homophone discrimination. Both are essential for raw or long-tail word recognition. The proposed approach adopts a two-stage learning paradigm. First, we introduce a pronunciation-guided context learning method. It employs an interleaved grapheme-phoneme context modeling strategy that incorporates grapheme-only distractors, encouraging the model to leverage phonemic cues for accurate recognition. Then, we propose a pronunciation-discriminative reinforcement learning method with perturbed label sampling to further enhance the modelś ability to distinguish contextualized homophones. Experimental results on the public English Librispeech and Mandarin AISHELL-1 datasets indicate that PAC: (1) reduces relative Word Error Rate (WER) by 30.2% and 53.8% compared to pre-trained LLM-based ASR models, and (2) achieves 31.8% and 60.5% relative reductions in biased WER for long-tail words compared to strong baselines, respectively.
zh

[NLP-43] Positional Encoding via Token-Aware Phase Attention

【速读】: 该论文旨在解决旋转位置编码(Rotary Positional Embedding, RoPE)在建模长序列时因距离依赖性偏差导致的性能局限问题,这种偏差会限制其在长上下文场景下的表达能力。解决方案的关键在于提出一种新的位置编码方法——Token-Aware Phase Attention (TAPA),其核心创新是将一个可学习的相位函数(learnable phase function)嵌入注意力机制中,从而在保持长程token交互的同时,支持直接且轻量的微调扩展至更长上下文,并具备对未见长度的外推能力,显著降低长文本上的困惑度(perplexity)。

链接: https://arxiv.org/abs/2509.12635
作者: Yu (Sid)Wang,Sheng Shen,Rémi Munos,Hongyuan Zhan,Yuandong Tian
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 21 pages

点击查看摘要

Abstract:We prove under practical assumptions that Rotary Positional Embedding (RoPE) introduces an intrinsic distance-dependent bias in attention scores that limits RoPE’s ability to model long-context. RoPE extension methods may alleviate this issue, but they typically require post-hoc adjustments after pretraining, such as rescaling or hyperparameters retuning. This paper introduces Token-Aware Phase Attention (TAPA), a new positional encoding method that incorporates a learnable phase function into the attention mechanism. TAPA preserves token interactions over long range, extends to longer contexts with direct and light fine-tuning, extrapolates to unseen lengths, and attains significantly lower perplexity on long-context than RoPE families.
zh

[NLP-44] EconProver: Towards More Economical Test-Time Scaling for Automated Theorem Proving

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在自动化定理证明(Automated Theorem Proving, ATP)任务中,因采用测试时扩展策略(如反射式思维链 CoT 和增加采样轮次)而导致的显著计算开销问题。现有方法通常仅控制采样轮次数目,却忽视了不同扩展策略之间存在的巨大采样成本差异,从而造成效率低下。解决方案的关键在于提出两种互补方法并集成至统一的 EconRL 管道中:一是动态切换思维链(Chain-of-Thought, CoT)机制,以减少不必要的 token 消耗;二是基于可训练前缀的多样化并行强化学习(Diverse parallel-scaled reinforcement learning, RL),在受限采样轮次下提升成功通过率。实验表明,所提出的 EconProver 在 miniF2F 和 ProofNet 数据集上达到与基线相当的性能,同时将计算成本降低至原方法的 12%。

链接: https://arxiv.org/abs/2509.12603
作者: Mukai Li,Linfeng Song,Zhenwen Liang,Jiahao Xu,Shansan Gong,Qi Liu,Haitao Mi,Dong Yu
机构: Tencent(腾讯); The University of Hong Kong(香港大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have recently advanced the field of Automated Theorem Proving (ATP), attaining substantial performance gains through widely adopted test-time scaling strategies, notably reflective Chain-of-Thought (CoT) reasoning and increased sampling passes. However, they both introduce significant computational overhead for inference. Moreover, existing cost analyses typically regulate only the number of sampling passes, while neglecting the substantial disparities in sampling costs introduced by different scaling strategies. In this paper, we systematically compare the efficiency of different test-time scaling strategies for ATP models and demonstrate the inefficiency of the current state-of-the-art (SOTA) open-source approaches. We then investigate approaches to significantly reduce token usage and sample passes while maintaining the original performance. Specifically, we propose two complementary methods that can be integrated into a unified EconRL pipeline for amplified benefits: (1) a dynamic Chain-of-Thought (CoT) switching mechanism designed to mitigate unnecessary token consumption, and (2) Diverse parallel-scaled reinforcement learning (RL) with trainable prefixes to enhance pass rates under constrained sampling passes. Experiments on miniF2F and ProofNet demonstrate that our EconProver achieves comparable performance to baseline methods with only 12% of the computational cost. This work provides actionable insights for deploying lightweight ATP models without sacrificing performance.
zh

[NLP-45] DaSAThco: Data-Aware SAT Heuristics Combinations Optimization via Large Language Models

【速读】: 该论文旨在解决冲突驱动子句学习(Conflict-Driven Clause Learning, CDCL)求解器中启发式策略配置的通用性问题:由于SAT问题的异质性,单一全局最优配置难以适应不同问题类型,而现有基于数据集特定优化的方法缺乏泛化能力且需频繁重新调优。解决方案的关键在于提出DaSAThco框架,该框架通过大型语言模型(Large Language Model, LLM)结合系统定义的问题原型(Problem Archetypes)生成多样化的专用启发式组合集合,并进一步学习一个自适应选择机制,从而建立从实例特征到定制化启发式集成的可迁移映射关系,实现“一次训练、广泛适配”的目标。

链接: https://arxiv.org/abs/2509.12602
作者: Minyu Chen,Guoqiang Li
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 11 pages

点击查看摘要

Abstract:The performance of Conflict-Driven Clause Learning solvers hinges on internal heuristics, yet the heterogeneity of SAT problems makes a single, universally optimal configuration unattainable. While prior automated methods can find specialized configurations for specific problem families, this dataset-specific approach lacks generalizability and requires costly re-optimization for new problem types. We introduce DaSAThco, a framework that addresses this challenge by learning a generalizable mapping from instance features to tailored heuristic ensembles, enabling a train-once, adapt-broadly model. Our framework uses a Large Language Model, guided by systematically defined Problem Archetypes, to generate a diverse portfolio of specialized heuristic ensembles and subsequently learns an adaptive selection mechanism to form the final mapping. Experiments show that DaSAThco achieves superior performance and, most notably, demonstrates robust out-of-domain generalization where non-adaptive methods show limitations. Our work establishes a more scalable and practical path toward automated algorithm design for complex, configurable systems.
zh

[NLP-46] he Better You Learn The Smarter You Prune: Towards Efficient Vision-language-action Models via Differentiable Token Pruning

【速读】: 该论文旨在解决视觉-语言-动作(Vision-Language-Action, VLA)模型在资源受限平台部署时因大量视觉标记(visual tokens)带来的注意力计算瓶颈问题。解决方案的关键在于提出一种可微分的动态标记剪枝框架LightVLA:它通过生成式查询(dynamic queries)评估视觉标记的重要性,并利用Gumbel Softmax实现可微分的标记选择机制,从而在不引入额外可训练参数的前提下,自适应地保留任务执行中最具信息量的视觉标记,同时显著降低计算开销。实验表明,LightVLA在LIBERO基准上实现了更高的任务成功率与更低的FLOPs(减少59.1%)和延迟(减少38.2%),标志着VLA模型向高效、高性能实时机器人系统迈进的重要一步。

链接: https://arxiv.org/abs/2509.12594
作者: Titong Jiang,Xuefeng Jiang,Yuan Ma,Xin Wen,Bailin Li,Kun Zhan,Peng Jia,Yahui Liu,Sheng Sun,Xianpeng Lang
机构: LiAuto Inc.(李想汽车公司); Tsinghua University (清华大学); Chinese Academy of Sciences (中国科学院)
类目: Robotics (cs.RO); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Under review. Project site: this https URL

点击查看摘要

Abstract:We present LightVLA, a simple yet effective differentiable token pruning framework for vision-language-action (VLA) models. While VLA models have shown impressive capability in executing real-world robotic tasks, their deployment on resource-constrained platforms is often bottlenecked by the heavy attention-based computation over large sets of visual tokens. LightVLA addresses this challenge through adaptive, performance-driven pruning of visual tokens: It generates dynamic queries to evaluate visual token importance, and adopts Gumbel softmax to enable differentiable token selection. Through fine-tuning, LightVLA learns to preserve the most informative visual tokens while pruning tokens which do not contribute to task execution, thereby improving efficiency and performance simultaneously. Notably, LightVLA requires no heuristic magic numbers and introduces no additional trainable parameters, making it compatible with modern inference frameworks. Experimental results demonstrate that LightVLA outperforms different VLA models and existing token pruning methods across diverse tasks on the LIBERO benchmark, achieving higher success rates with substantially reduced computational overhead. Specifically, LightVLA reduces FLOPs and latency by 59.1% and 38.2% respectively, with a 2.9% improvement in task success rate. Meanwhile, we also investigate the learnable query-based token pruning method LightVLA* with additional trainable parameters, which also achieves satisfactory performance. Our work reveals that as VLA pursues optimal performance, LightVLA spontaneously learns to prune tokens from a performance-driven perspective. To the best of our knowledge, LightVLA is the first work to apply adaptive visual token pruning to VLA tasks with the collateral goals of efficiency and performance, marking a significant step toward more efficient, powerful and practical real-time robotic systems.
zh

[NLP-47] Match Chat: Real Time Generative AI and Generative Computing for Tennis

【速读】: 该论文旨在解决体育赛事直播场景中观众对实时、精准、易用的赛事信息查询需求难以满足的问题,尤其在网球等动态比赛中,用户希望以自然语言快速获取比赛关键数据与洞察。解决方案的关键在于构建一个基于代理导向架构(Agent-Oriented Architecture, AOA)的实时智能助手系统——Match Chat,其通过融合生成式人工智能(Generative Artificial Intelligence, GenAI)与生成式计算(Generative Computing, GenComp)技术,结合规则引擎、预测模型和多智能体机制对用户查询进行预处理与优化,从而在高并发负载下实现92.83%的准确率和平均6.25秒的响应时间,同时保障100% uptime与百万级用户访问能力,显著提升了用户体验的清晰度、响应速度与操作便捷性。

链接: https://arxiv.org/abs/2509.12592
作者: Aaron Baughman,Gozde Akay,Eduardo Morales,Rahul Agarwal,Preetika Srivastava
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 12 pages, 5 Figures, 4 Tables

点击查看摘要

Abstract:We present Match Chat, a real-time, agent-driven assistant designed to enhance the tennis fan experience by delivering instant, accurate responses to match-related queries. Match Chat integrates Generative Artificial Intelligence (GenAI) with Generative Computing (GenComp) techniques to synthesize key insights during live tennis singles matches. The system debuted at the 2025 Wimbledon Championships and the 2025 US Open, where it provided about 1 million users with seamless access to streaming and static data through natural language queries. The architecture is grounded in an Agent-Oriented Architecture (AOA) combining rule engines, predictive models, and agents to pre-process and optimize user queries before passing them to GenAI components. The Match Chat system had an answer accuracy of 92.83% with an average response time of 6.25 seconds under loads of up to 120 requests per second (RPS). Over 96.08% of all queries were guided using interactive prompt design, contributing to a user experience that prioritized clarity, responsiveness, and minimal effort. The system was designed to mask architectural complexity, offering a frictionless and intuitive interface that required no onboarding or technical familiarity. Across both Grand Slam deployments, Match Chat maintained 100% uptime and supported nearly 1 million unique users, underscoring the scalability and reliability of the platform. This work introduces key design patterns for real-time, consumer-facing AI systems that emphasize speed, precision, and usability that highlights a practical path for deploying performant agentic systems in dynamic environments.
zh

[NLP-48] MAGIC-Enhanced Keyword Prompting for Zero-Shot Audio Captioning with CLIP Models

【速读】: 该论文旨在解决自动化音频描述(Automated Audio Captioning, AAC)因数据集规模有限而难以训练高质量模型的问题。传统方法依赖大量标注数据进行端到端训练,但音频数据的获取和标注成本远高于图像数据。为此,作者提出了一种零样本AAC系统,其关键在于利用预训练的音频CLIP模型提取听觉特征,并生成结构化提示(prompt),引导大语言模型(Large Language Model, LLM)生成与音频内容对齐的描述。该方案不依赖特定任务的数据微调,而是通过音频-文本匹配机制优化token选择过程,从而显著提升生成质量——实验表明,采用MAGIC搜索策略后,NLG平均得分从4.7提升至7.3,性能高度依赖于音频-文本匹配模型的选择及关键词提示的设计,单关键词提示效果最优,无关键词提示时性能下降50%。

链接: https://arxiv.org/abs/2509.12591
作者: Vijay Govindarajan,Pratik Patel,Sahil Tripathi,Md Azizul Hoque,Gautam Siddharth Kashyap
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted in The 26th International Conference on Web Information Systems Engineering (WISE), scheduled for 15-17 December 2025 in Marrakech, Morocco

点击查看摘要

Abstract:Automated Audio Captioning (AAC) generates captions for audio clips but faces challenges due to limited datasets compared to image captioning. To overcome this, we propose the zero-shot AAC system that leverages pre-trained models, eliminating the need for extensive training. Our approach uses a pre-trained audio CLIP model to extract auditory features and generate a structured prompt, which guides a Large Language Model (LLM) in caption generation. Unlike traditional greedy decoding, our method refines token selection through the audio CLIP model, ensuring alignment with the audio content. Experimental results demonstrate a 35% improvement in NLG mean score (from 4.7 to 7.3) using MAGIC search with the WavCaps model. The performance is heavily influenced by the audio-text matching model and keyword selection, with optimal results achieved using a single keyword prompt, and a 50% performance drop when no keyword list is used.
zh

[NLP-49] Yet Another Watermark for Large Language Models

【速读】: 该论文旨在解决现有大语言模型(Large Language Models, LLMs)水印方法在语义质量、鲁棒性与隐蔽性之间难以平衡的问题,尤其是传统方法多依赖于token采样调整或后处理,缺乏与LLM内部机制的内在耦合,导致生成文本的语义质量下降;同时,基于训练或微调的方法受限于白盒场景或计算开销过大。其解决方案的关键在于将水印嵌入到LLM的内部参数中,实现水印与模型结构的深度耦合,从而在不访问模型本身的情况下,仅通过生成文本即可提取水印,既保障了水印的鲁棒性和隐蔽性,又支持黑盒场景下的高效提取,显著提升了实用性与可扩展性。

链接: https://arxiv.org/abs/2509.12574
作者: Siyuan Bao,Ying Shi,Zhiguang Yang,Hanzhou Wu,Xinpeng Zhang
机构: Shanghai University (上海大学); Guizhou Normal University (贵州师范大学)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注: this https URL

点击查看摘要

Abstract:Existing watermarking methods for large language models (LLMs) mainly embed watermark by adjusting the token sampling prediction or post-processing, lacking intrinsic coupling with LLMs, which may significantly reduce the semantic quality of the generated marked texts. Traditional watermarking methods based on training or fine-tuning may be extendable to LLMs. However, most of them are limited to the white-box scenario, or very time-consuming due to the massive parameters of LLMs. In this paper, we present a new watermarking framework for LLMs, where the watermark is embedded into the LLM by manipulating the internal parameters of the LLM, and can be extracted from the generated text without accessing the LLM. Comparing with related methods, the proposed method entangles the watermark with the intrinsic parameters of the LLM, which better balances the robustness and imperceptibility of the watermark. Moreover, the proposed method enables us to extract the watermark under the black-box scenario, which is computationally efficient for use. Experimental results have also verified the feasibility, superiority and practicality. This work provides a new perspective different from mainstream works, which may shed light on future research.
zh

[NLP-50] LEAF: Knowledge Distillation of Text Embedding Models with Teacher-Aligned Representations

【速读】: 该论文旨在解决文本嵌入模型(text embedding models)在实际应用中面临的计算资源消耗大、部署成本高以及难以兼顾性能与效率的问题。现有方法通常需要大规模训练数据和复杂的负样本构造,且难以在保持高性能的同时实现轻量化部署。解决方案的关键在于提出一种名为LEAF(Lightweight Embedding Alignment Framework)的知识蒸馏框架,其核心创新是通过将小型叶子模型(leaf models)显式对齐到教师模型(teacher model),实现高效且灵活的异构架构:文档编码使用性能更强的教师模型,而查询则由轻量级叶子模型处理,从而显著降低推理开销并提升系统灵活性。此外,LEAF无需人工标注或难负样本,支持小批量训练,并能自动继承教师模型的多任务学习能力(MRL)和输出量化鲁棒性,使得模型训练更简便、适用范围更广。

链接: https://arxiv.org/abs/2509.12539
作者: Robin Vujanic,Thomas Rueckstiess
机构: MongoDB Research (MongoDB 研究院)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 17 pages, 12 figures

点击查看摘要

Abstract:We present LEAF (“Lightweight Embedding Alignment Framework”), a knowledge distillation framework for text embedding models. A key distinguishing feature is that our distilled leaf models are aligned to their teacher. In the context of information retrieval, this allows for flexible asymmetric architectures where documents are encoded with the larger teacher model, while queries can be served with the smaller leaf models. We also show that leaf models automatically inherit MRL and robustness to output quantization whenever these properties are present in the teacher model, without explicitly training for them. To demonstrate the capability of our framework we publish leaf-ir, a 23M parameters information retrieval oriented text embedding model trained using LEAF, which sets a new state-of-the-art (SOTA) on BEIR, ranking #1 on the public leaderboard for this benchmark and for models of its size. When run in asymmetric mode, its retrieval performance is further increased. Our scheme is however not restricted to the information retrieval setting, and we demonstrate its wider applicability by synthesizing the multi-task leaf-mt model. This also sets a new SOTA, ranking #1 on the public MTEB v2 (English) leaderboard for its size. LEAF is applicable to black-box models and in contrast to other embedding model training frameworks, it does not require judgments nor hard negatives, and training can be conducted using small batch sizes. Thus, dataset and training infrastructure requirements for our framework are modest. We make our models publicly available under a permissive Apache 2.0 license.
zh

[NLP-51] he Adaptation Paradox: Agency vs. Mimicry in Companion Chatbots

【速读】: 该论文试图解决的问题是如何在生成式 AI (Generative AI) 驱动的陪伴型聊天机器人中建立真实的人际连接,尤其是在用户与 AI 互动时如何提升亲密度(rapport)和满意度。其解决方案的关键在于区分两种个性化策略:一是让用户可见地参与角色构建(如生成自定义头像),二是通过语言风格匹配(Language Style Matching, LSM)进行隐蔽的模仿。研究发现,用户主导的可见创作(如生成个人头像)显著提升了亲密度(ω² = .040, p = .013),而动态调整的语言风格匹配反而降低了感知个性化和满意度(d = 0.35, p = .009),并引发“适应性悖论”——即当风格变化被感知为不连贯时,反而削弱了用户对 AI 人格稳定性的信任。作者据此提出“稳定性与可读性”(stability-and-legibility)解释框架,主张设计应优先保障用户可理解的、由用户驱动的个性化表达,并谨慎控制隐性风格变动,而非依赖不可见的模仿机制。

链接: https://arxiv.org/abs/2509.12525
作者: T. James Brandt,Cecilia Xi Wang
机构: University of Minnesota (明尼苏达大学)
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
备注: 31 pages, 17 figures, 2 tables. Submitted to CHI 2026 (under review). Preregistered: this https URL ; Code/Materials: this https URL

点击查看摘要

Abstract:Generative AI powers a growing wave of companion chatbots, yet principles for fostering genuine connection remain unsettled. We test two routes: visible user authorship versus covert language-style mimicry. In a preregistered 3x2 experiment (N = 162), we manipulated user-controlled avatar generation (none, premade, user-generated) and Language Style Matching (LSM) (static vs. adaptive). Generating an avatar boosted rapport ( \omega^2 = .040, p = .013), whereas adaptive LSM underperformed static style on personalization and satisfaction (d = 0.35, p = .009) and was paradoxically judged less adaptive (t = 3.07, p = .003, d = 0.48). We term this an Adaptation Paradox: synchrony erodes connection when perceived as incoherent, destabilizing persona. To explain, we propose a stability-and-legibility account: visible authorship fosters natural interaction, while covert mimicry risks incoherence. Our findings suggest designers should prioritize legible, user-driven personalization and limit stylistic shifts rather than rely on opaque mimicry.
zh

[NLP-52] Context-Aware Language Models for Forecasting Market Impact from Sequences of Financial News

【速读】: 该论文旨在解决金融新闻中信息不自洽、需依赖历史背景才能准确理解其市场影响的问题,以及如何高效地识别和整合最相关的上下文信息以提升模型预测性能。解决方案的关键在于提出一种分层式上下文建模方法:利用大语言模型(Large Language Model, LLM)处理当前新闻主体内容,同时使用小语言模型对历史新闻进行编码,生成简洁的摘要嵌入(summary embeddings),并通过空间对齐将其融入大模型的表征空间,从而实现高效且有效的上下文增强。该方法在多个时间尺度上均显著提升了模型性能,并通过实证验证了其在模拟投资组合中的实际价值。

链接: https://arxiv.org/abs/2509.12519
作者: Ross Koval,Nicholas Andrews,Xifeng Yan
机构: University of California, Santa Barbara (加州大学圣塔芭芭拉分校); Johns Hopkins University (约翰霍普金斯大学)
类目: Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL); Computational Finance (q-fin.CP)
备注: Preprint

点击查看摘要

Abstract:Financial news plays a critical role in the information diffusion process in financial markets and is a known driver of stock prices. However, the information in each news article is not necessarily self-contained, often requiring a broader understanding of the historical news coverage for accurate interpretation. Further, identifying and incorporating the most relevant contextual information presents significant challenges. In this work, we explore the value of historical context in the ability of large language models to understand the market impact of financial news. We find that historical context provides a consistent and significant improvement in performance across methods and time horizons. To this end, we propose an efficient and effective contextualization method that uses a large LM to process the main article, while a small LM encodes the historical context into concise summary embeddings that are then aligned with the large model’s representation space. We explore the behavior of the model through multiple qualitative and quantitative interpretability tests and reveal insights into the value of contextualization. Finally, we demonstrate that the value of historical context in model predictions has real-world applications, translating to substantial improvements in simulated investment performance.
zh

[NLP-53] A comparison of pipelines for the translation of a low resource language based on transformers

【速读】: 该论文旨在解决低资源语言(Bambara)的机器翻译问题,特别是在缺乏大规模平行语料库的情况下提升翻译质量。其核心挑战在于如何有效利用有限的数据训练高性能的神经网络模型,以实现高质量的法语到Bambara的翻译。解决方案的关键在于提出并比较三种不同的训练管道:(1) 训练一个简单的Transformer模型直接进行翻译;(2) 使用LLaMA3(3B-8B)解码器架构进行微调;(3) 采用语言蒸馏技术,将Bambara融入预训练的LaBSE模型,并通过BERT扩展生成翻译结果。实验表明,尽管结构最简单,第一种方法在多个测试集上取得了最佳性能(如Bayelemagaba数据集上BLEU提升10%,chrF提升21%),说明针对低资源场景设计轻量级、专用模型仍是有效的策略,而基于指令微调的模型则更擅长捕捉特定数据集的模式。

链接: https://arxiv.org/abs/2509.12514
作者: Chiara Bonfanti,Michele Colombino,Giulia Coucourde,Faeze Memari,Stefano Pinardi,Rosa Meo
机构: University of Turin (都灵大学); Politecnico di Torino (都灵理工大学)
类目: Computation and Language (cs.CL); Computational Engineering, Finance, and Science (cs.CE); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: 9 pages, 4 figures

点击查看摘要

Abstract:This work compares three pipelines for training transformer-based neural networks to produce machine translators for Bambara, a Mandè language spoken in Africa by about 14,188,850 people. The first pipeline trains a simple transformer to translate sentences from French into Bambara. The second fine-tunes LLaMA3 (3B-8B) instructor models using decoder-only architectures for French-to-Bambara translation. Models from the first two pipelines were trained with different hyperparameter combinations to improve BLEU and chrF scores, evaluated on both test sentences and official Bambara benchmarks. The third pipeline uses language distillation with a student-teacher dual neural network to integrate Bambara into a pre-trained LaBSE model, which provides language-agnostic embeddings. A BERT extension is then applied to LaBSE to generate translations. All pipelines were tested on Dokotoro (medical) and Bayelemagaba (mixed domains). Results show that the first pipeline, although simpler, achieves the best translation accuracy (10% BLEU, 21% chrF on Bayelemagaba), consistent with low-resource translation results. On the Yiri dataset, created for this work, it achieves 33.81% BLEU and 41% chrF. Instructor-based models perform better on single datasets than on aggregated collections, suggesting they capture dataset-specific patterns more effectively.
zh

[NLP-54] FunAudio-ASR Technical Report

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在自动语音识别(Automatic Speech Recognition, ASR)应用中因幻觉(hallucination)导致用户体验下降的问题,同时应对工业级场景下真实数据集上性能显著退化的挑战。解决方案的关键在于构建一个面向实际部署的端到端ASR系统——FunAudio-ASR,其核心创新在于协同整合大规模训练数据、超大模型容量、LLM深度集成与强化学习(reinforcement learning),并通过针对性优化提升流式处理能力、噪声鲁棒性、代码切换(code-switching)支持及热词定制等实用功能,从而在真实业务数据集上实现最优性能(state-of-the-art performance)。

链接: https://arxiv.org/abs/2509.12508
作者: Keyu An,Yanni Chen,Chong Deng,Changfeng Gao,Zhifu Gao,Bo Gong,Xiangang Li,Yabin Li,Xiang Lv,Yunjie Ji,Yiheng Jiang,Bin Ma,Haoneng Luo,Chongjia Ni,Zexu Pan,Yiping Peng,Zhendong Peng,Peiyao Wang,Hao Wang,Wen Wang,Wupeng Wang,Biao Tian,Zhentao Tan,Nan Yang,Bin Yuan,Jieping Ye,Jixing Yu,Qinglin Zhang,Kun Zou,Han Zhao,Shengkui Zhao,Jingren Zhou
机构: Tongyi Lab (通义实验室); Alibaba Group (阿里巴巴集团)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:In recent years, automatic speech recognition (ASR) has witnessed transformative advancements driven by three complementary paradigms: data scaling, model size scaling, and deep integration with large language models (LLMs). However, LLMs are prone to hallucination, which can significantly degrade user experience in real-world ASR applications. In this paper, we present FunAudio-ASR, a large-scale, LLM-based ASR system that synergistically combines massive data, large model capacity, LLM integration, and reinforcement learning to achieve state-of-the-art performance across diverse and complex speech recognition scenarios. Moreover, FunAudio-ASR is specifically optimized for practical deployment, with enhancements in streaming capability, noise robustness, code-switching, hotword customization, and satisfying other real-world application requirements. Experimental results show that while most LLM-based ASR systems achieve strong performance on open-source benchmarks, they often underperform on real industry evaluation sets. Thanks to production-oriented optimizations, FunAudio-ASR achieves SOTA performance on real application datasets, demonstrating its effectiveness and robustness in practical settings.
zh

[NLP-55] Audited Reasoning Refinement: Fine-Tuning Language Models via LLM -Guided Step-Wise Evaluation and Correction

【速读】: 该论文旨在解决在缺乏直接人类监督或高质量标签的情况下,训练任务特定的小型推理模型(task-specific small reasoning model)所面临的挑战。其核心问题在于如何从有限标注数据中高效构建可靠的监督信号以指导模型学习。解决方案的关键在于提出一种名为“Reason-Refine-then-Align”(R2tA)的两阶段框架:首先利用开源基础模型生成初始推理路径和响应,随后通过系统性精炼(refinement)修复幻觉与不一致性,形成高保真度的中间推理数据集;接着执行两阶段对齐——先进行监督微调(SFT)以匹配人类验证的概念偏好,再使用直接偏好优化(DPO)将最终输出与已对齐的推理过程绑定,从而实现推理链与输出结果的一致性校准。该方法在数据库设计中的扩展实体关系图(EERD)评估任务上验证了其有效性,表明其能为数据稀缺领域提供可扩展、低成本且可复现的大型语言模型(LLM)适配路径。

链接: https://arxiv.org/abs/2509.12476
作者: Sumanta Bhattacharyya,Sara Riaz,Pedram Rooshenas
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Training a task-specific small reasoning model is challenging when direct human supervision or high-quality labels are scarce. However, LLMs with reasoning capabilities produce abundant intermediate reasoning traces that can be systematically refined to create effective supervision signals. We propose Reason-Refine-then-Align (R2tA), which turns refined model rationales into supervision for training task-specific reasoning models. Our method generates initial reasoning and responses from an open-source base model on task-specific inputs, then refines these traces, fixing hallucinations and inconsistencies, to form a high-fidelity dataset. We perform a two-stage alignment, supervised fine-tuning (SFT), followed by direct preference optimization (DPO) to calibrate the model’s intermediate reasoning with human-validated conceptual preferences and then condition the final output on that aligned reasoning. As a case study, we apply R2tA to evaluate extended entity relationship diagrams (EERDs) in database system design, a structurally complex task where prompt-only methods miss or hallucinate errors. We curated a dataset of 600 EERD variants (train/test split of 450/150, respectively) with induced mistakes spanning 11 categories. Empirical evaluation suggests R2tA provides a practical, cost-effective path to scalable LLM adaptation in data-scarce domains, enabling reproducible AI tools for education and beyond.
zh

[NLP-56] Does Language Model Understand Language?

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在细粒度语言现象(如时态、否定、语态和情态)上表现不足的问题,这些问题直接影响联合国可持续发展目标4(SDG 4)教育场景中语言模型的可靠性与有效性。为实现系统性评估,作者提出了一套名为“LUCID”的新评测框架与数据集,包含精心设计的英、孟加拉语句对,专门针对上述语言要素进行挑战。解决方案的关键在于引入一种基于人类判断变异性的新型指标——HCE准确率(Human-like Consistency Error accuracy),该指标衡量模型预测是否落在人类评分均值的一个标准差范围内,从而更贴近人类对语言理解的容忍度;同时结合Pearson相关系数、Spearman相关系数及平均绝对误差(Mean Absolute Error, MAE)等标准指标,全面评估主流模型(如Compound-Beta、LLaMA-3.3-70B等)在跨语言环境下的表现。结果表明,Compound-Beta在多种语言条件下展现出最优平衡性能,体现出与人类判断高度一致的语义理解能力。

链接: https://arxiv.org/abs/2509.12459
作者: Suvojit Acharjee,Utathya Aich,Asfak Ali
机构: Institute of Engineering and Management (印度工程与管理学院); Jadavpur University (加达夫大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Despite advances in natural language generation and understanding, LM still struggle with fine grained linguistic phenomena such as tense, negation, voice, and modality which are the elements central to effective human communication. In the context of the United Nations SDG 4, where linguistic clarity is critical, the deployment of LMs in educational technologies demands careful scrutiny. As LMs are increasingly powering applications like tutoring systems, automated grading, and translation, their alignment with human linguistic interpretation becomes essential for effective learning. In this study, we conduct a evaluation of SOTA language models across these challenging contexts in both English and Bengali. To ensure a structured assessment, we introduce a new Route for Evaluation of Cognitive Inference in Systematic Environments guidelines. Our proposed LUCID dataset, composed of carefully crafted sentence pairs in English and Bengali, specifically challenges these models on critical aspects of language comprehension, including negation, tense, voice variations. We assess the performance of SOTA models including MISTRAL-SABA-24B, LLaMA-4-Scout-17B, LLaMA-3.3-70B, Gemma2-9B, and Compound-Beta using standard metrics like Pearson correlation, Spearman correlation, and Mean Absolute Error, as well as novel, linguistically inspired metric the HCE accuracy. The HCE accuracy measures how often model predictions fall within one standard deviation of the mean human rating, thus capturing human like tolerance for variability in language interpretation. Our findings highlight Compound-Beta as the most balanced model, consistently achieving high correlations and low MAEs across diverse language conditions. It records the highest Pearson correlation in English and demonstrates robust performance on mixed-language data, indicating a strong alignment with human judgments in cross lingual scenarios.
zh

[NLP-57] opic Coverag e-based Demonstration Retrieval for In-Context Learning EMNLP2025

【速读】: 该论文旨在解决生成式 AI(Generative AI)在上下文学习(in-context learning)中因演示样本(demonstrations)选择不当而导致的知识覆盖不足问题,即现有方法仅依赖嵌入相似性或生成概率检索示例,常导致无关或冗余的演示被选中。其解决方案的关键在于提出一种基于主题覆盖(topic coverage-based)的检索框架 TopicK:该框架首先估计测试输入所需的主题,并评估模型在这些主题上的知识掌握程度;随后迭代选取能引入此前未覆盖且模型知识薄弱的主题的演示样本,从而实现对测试输入和模型自身所需主题知识的全面覆盖。

链接: https://arxiv.org/abs/2509.12451
作者: Wonbin Kweon,SeongKu Kang,Runchu Tian,Pengcheng Jiang,Jiawei Han,Hwanjo Yu
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Korea University (韩国科学技术院); Pohang University of Science and Technology (浦项科技大学)
类目: Computation and Language (cs.CL)
备注: EMNLP 2025 Main

点击查看摘要

Abstract:The effectiveness of in-context learning relies heavily on selecting demonstrations that provide all the necessary information for a given test input. To achieve this, it is crucial to identify and cover fine-grained knowledge requirements. However, prior methods often retrieve demonstrations based solely on embedding similarity or generation probability, resulting in irrelevant or redundant examples. In this paper, we propose TopicK, a topic coverage-based retrieval framework that selects demonstrations to comprehensively cover topic-level knowledge relevant to both the test input and the model. Specifically, TopicK estimates the topics required by the input and assesses the model’s knowledge on those topics. TopicK then iteratively selects demonstrations that introduce previously uncovered required topics, in which the model exhibits low topical knowledge. We validate the effectiveness of TopicK through extensive experiments across various datasets and both open- and closed-source LLMs. Our source code is available at this https URL.
zh

[NLP-58] MedFact: Benchmarking the Fact-Checking Capabilities of Large Language Models on Chinese Medical Texts

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在医疗领域应用中面临的事实可靠性不足问题,尤其针对现有评测基准因数据域狭窄而无法充分反映真实医疗信息复杂性的局限。解决方案的关键在于构建了一个名为MedFact的新颖且具有挑战性的中文医学事实核查基准,其核心特征包括:2,116个由专家标注的实例,覆盖13个医学专科、8类细粒度错误类型、4种写作风格及多难度层级;采用混合AI-人类协同框架,通过迭代专家反馈优化AI驱动的多准则过滤流程,从而确保数据质量与难度兼具。该基准不仅用于系统评估20个主流LLMs在真伪分类和错误定位上的表现,还揭示了模型在错误定位能力上的显著短板以及“过度批评”现象——即模型倾向于将正确信息误判为错误,且此问题在引入多智能体协作等高级推理技术后更为突出,为提升医疗场景下LLM的事实准确性提供了关键测评工具与研究方向。

链接: https://arxiv.org/abs/2509.12440
作者: Jiayi He,Yangmin Huang,Qianyun Du,Xiangying Zhou,Zhiyang He,Jiaxue Hu,Xiaodong Tao,Lixian Lai
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The increasing deployment of Large Language Models (LLMs) in healthcare necessitates a rigorous evaluation of their factual reliability. However, existing benchmarks are often limited by narrow domains of data, failing to capture the complexity of real-world medical information. To address this critical gap, we introduce MedFact, a new and challenging benchmark for Chinese medical fact-checking. MedFact comprises 2,116 expert-annotated instances curated from diverse real-world texts, spanning 13 medical specialties, 8 fine-grained error types, 4 writing styles, and multiple difficulty levels. Its construction employs a hybrid AI-human framework where iterative expert feedback refines an AI-driven, multi-criteria filtering process, ensuring both high data quality and difficulty. We conduct a comprehensive evaluation of 20 leading LLMs, benchmarking their performance on veracity classification and error localization against a human expert baseline. Our results reveal that while models can often determine if a text contains an error, precisely localizing it remains a substantial challenge, with even top-performing models falling short of human performance. Furthermore, our analysis uncovers a frequent ``over-criticism’’ phenomenon, a tendency for models to misidentify correct information as erroneous, which is exacerbated by advanced reasoning techniques such as multi-agent collaboration and inference-time scaling. By highlighting these critical challenges for deploying LLMs in medical applications, MedFact provides a robust resource to drive the development of more factually reliable and medically aware models.
zh

[NLP-59] Small Models Big Results: Achieving Superior Intent Extraction through Decomposition

【速读】: 该论文旨在解决在资源受限设备上准确理解用户意图(user intent)的问题,尤其是在基于UI交互轨迹(UI interaction trajectories)的场景中。当前,虽然大规模多模态大语言模型(MLLMs)具备较强的处理能力,但其部署成本高、延迟大且难以保障隐私;而轻量级本地模型则因计算资源有限,在意图识别精度上表现不足。论文提出的解决方案关键在于采用一种分解式方法:首先对用户每一步交互进行结构化摘要(structured interaction summarization),提取关键信息;随后利用微调后的模型基于汇总后的摘要进行意图抽取(intent extraction)。该策略显著提升了小模型在低资源环境下的意图理解性能,甚至超越了基础的大规模MLLMs。

链接: https://arxiv.org/abs/2509.12423
作者: Danielle Cohen,Yoni Halpern,Noam Kahlon,Joel Oren,Omri Berkovitch,Sapir Caduri,Ido Dagan,Anatoly Efros
机构: Google(谷歌); Bar-Ilan University (巴伊兰大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Understanding user intents from UI interaction trajectories remains a challenging, yet crucial, frontier in intelligent agent development. While massive, datacenter-based, multi-modal large language models (MLLMs) possess greater capacity to handle the complexities of such sequences, smaller models which can run on-device to provide a privacy-preserving, low-cost, and low-latency user experience, struggle with accurate intent inference. We address these limitations by introducing a novel decomposed approach: first, we perform structured interaction summarization, capturing key information from each user action. Second, we perform intent extraction using a fine-tuned model operating on the aggregated summaries. This method improves intent understanding in resource-constrained models, even surpassing the base performance of large MLLMs.
zh

[NLP-60] MORQA: Benchmarking Evaluation Metrics for Medical Open-Ended Question Answering

【速读】: 该论文旨在解决医学领域自然语言生成(Natural Language Generation, NLG)系统评估中的核心挑战,即传统自动评估指标(如BLEU、ROUGE和BERTScore)难以有效区分高质量输出,尤其在开放式的医学问答(Medical Open-Response QA, MORQA)任务中,存在多个合理答案但指标相关性不足的问题。解决方案的关键在于构建了一个多语言基准MORQA,包含由医疗专业人员撰写的2–4个以上黄金标准答案及专家人类评分的英文与中文子集,并对比传统指标与大语言模型(Large Language Model, LLM)评估器(如GPT-4和Gemini)的表现,结果表明LLM方法显著优于传统指标,其优势源于对语义细微差别的敏感性和对参考答案变异性更强的鲁棒性。

链接: https://arxiv.org/abs/2509.12405
作者: Wen-wai Yim,Asma Ben Abacha,Zixuan Yu,Robert Doerning,Fei Xia,Meliha Yetisgen
机构: Microsoft Health AI (微软健康AI); University of Washington (华盛顿大学)
类目: Computation and Language (cs.CL)
备注: 9 pages, 8 tables

点击查看摘要

Abstract:Evaluating natural language generation (NLG) systems in the medical domain presents unique challenges due to the critical demands for accuracy, relevance, and domain-specific expertise. Traditional automatic evaluation metrics, such as BLEU, ROUGE, and BERTScore, often fall short in distinguishing between high-quality outputs, especially given the open-ended nature of medical question answering (QA) tasks where multiple valid responses may exist. In this work, we introduce MORQA (Medical Open-Response QA), a new multilingual benchmark designed to assess the effectiveness of NLG evaluation metrics across three medical visual and text-based QA datasets in English and Chinese. Unlike prior resources, our datasets feature 2-4+ gold-standard answers authored by medical professionals, along with expert human ratings for three English and Chinese subsets. We benchmark both traditional metrics and large language model (LLM)-based evaluators, such as GPT-4 and Gemini, finding that LLM-based approaches significantly outperform traditional metrics in correlating with expert judgments. We further analyze factors driving this improvement, including LLMs’ sensitivity to semantic nuances and robustness to variability among reference answers. Our results provide the first comprehensive, multilingual qualitative study of NLG evaluation in the medical domain, highlighting the need for human-aligned evaluation methods. All datasets and annotations will be publicly released to support future research.
zh

[NLP-61] SENTRA: Selected-Next-Token Transformer for LLM Text Detection EMNLP

【速读】: 该论文旨在解决生成式 AI(Generative AI)文本在未明确标注来源时难以被有效识别的问题,尤其关注大语言模型(Large Language Models, LLMs)生成内容的检测。其解决方案的关键在于提出一种新型、通用且基于监督学习的LLM文本检测器——SElected-Next-Token tRAnsformer(SENTRA),该模型基于Transformer架构,利用选中的下一个token概率序列作为特征,并通过大规模无标签数据进行对比预训练(contrastive pre-training),从而在跨领域场景下显著优于现有主流基线方法。

链接: https://arxiv.org/abs/2509.12385
作者: Mitchell Plyler,Yilun Zhang,Alexander Tuzhilin,Saoud Khalifah,Sen Tian
机构: Mozilla Corporation( Mozilla公司); New York University(纽约大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: EMNLP Findings 2025

点击查看摘要

Abstract:LLMs are becoming increasingly capable and widespread. Consequently, the potential and reality of their misuse is also growing. In this work, we address the problem of detecting LLM-generated text that is not explicitly declared as such. We present a novel, general-purpose, and supervised LLM text detector, SElected-Next-Token tRAnsformer (SENTRA). SENTRA is a Transformer-based encoder leveraging selected-next-token-probability sequences and utilizing contrastive pre-training on large amounts of unlabeled data. Our experiments on three popular public datasets across 24 domains of text demonstrate SENTRA is a general-purpose classifier that significantly outperforms popular baselines in the out-of-domain setting.
zh

[NLP-62] LLM -as-a-Judge: Rapid Evaluation of Legal Document Recommendation for Retrieval-Augmented Generation RECSYS2025

【速读】: 该论文旨在解决生成式 AI(Generative AI)在推荐系统评估中面临的“评估瓶颈”问题,尤其是在法律研究等高风险领域,传统评价指标难以捕捉推荐质量的细微差异。其核心挑战在于如何可靠地利用大型语言模型(Large Language Models, LLMs)作为评判者(LLM-as-a-Judge)来评估检索增强生成(Retrieval-Augmented Generation, RAG)系统的性能。解决方案的关键在于:首先,采用Gwet’s AC2和秩相关系数替代传统的Krippendorff’s alpha等一致性指标,以更稳健地衡量LLM与人类评估者之间的对齐程度;其次,引入带有Benjamini-Hochberg校正的Wilcoxon符号秩检验(Wilcoxon Signed-Rank Test),实现统计上严谨的系统间比较。这一方法体系构建了一个可扩展、低成本且符合法律应用精度要求的自动化评估框架,将原本依赖人工的评估瓶颈转化为具备理论支撑的自动评估流程。

链接: https://arxiv.org/abs/2509.12382
作者: Anu Pradhan,Alexandra Ortan,Apurv Verma,Madhavan Seshadri
机构: Bloomberg(彭博)
类目: Computation and Language (cs.CL)
备注: Accepted in EARL 25: The 2nd Workshop on Evaluating and Applying Recommender Systems with Large Language Models at RecSys 2025

点击查看摘要

Abstract:The evaluation bottleneck in recommendation systems has become particularly acute with the rise of Generative AI, where traditional metrics fall short of capturing nuanced quality dimensions that matter in specialized domains like legal research. Can we trust Large Language Models to serve as reliable judges of their own kind? This paper investigates LLM-as-a-Judge as a principled approach to evaluating Retrieval-Augmented Generation systems in legal contexts, where the stakes of recommendation quality are exceptionally high. We tackle two fundamental questions that determine practical viability: which inter-rater reliability metrics best capture the alignment between LLM and human assessments, and how do we conduct statistically sound comparisons between competing systems? Through systematic experimentation, we discover that traditional agreement metrics like Krippendorff’s alpha can be misleading in the skewed distributions typical of AI system evaluations. Instead, Gwet’s AC2 and rank correlation coefficients emerge as more robust indicators for judge selection, while the Wilcoxon Signed-Rank Test with Benjamini-Hochberg corrections provides the statistical rigor needed for reliable system comparisons. Our findings suggest a path toward scalable, cost-effective evaluation that maintains the precision demanded by legal applications, transforming what was once a human-intensive bottleneck into an automated, yet statistically principled, evaluation framework. Comments: Accepted in EARL 25: The 2nd Workshop on Evaluating and Applying Recommender Systems with Large Language Models at RecSys 2025 Subjects: Computation and Language (cs.CL) ACMclasses: H.3.3; I.2.7; I.2.6 Cite as: arXiv:2509.12382 [cs.CL] (or arXiv:2509.12382v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2509.12382 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-63] MORABLES: A Benchmark for Assessing Abstract Moral Reasoning in LLM s with Fables EMNLP2025

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在复杂抽象推理与道德推断能力评估方面的不足,尤其针对当前标准阅读理解基准难以有效衡量模型深层语义理解的问题。解决方案的关键在于构建一个名为MORABLES的人工验证基准,该基准基于历史文学中的寓言和短篇故事,设计多选题任务以聚焦道德推断,并通过精心设计的干扰项(distractors)迫使模型超越浅层抽取式问答;同时引入对抗性变体以暴露模型因数据污染或捷径依赖而导致的脆弱性。实验表明,尽管更大规模的模型表现更优,但其仍易受对抗扰动且常依赖表面模式而非真正道德推理,揭示出当前LLMs在推理能力上的局限性。

链接: https://arxiv.org/abs/2509.12371
作者: Matteo Marcuzzo,Alessandro Zangari,Andrea Albarelli,Jose Camacho-Collados,Mohammad Taher Pilehvar
机构: Ca’ Foscari University of Venice (威尼斯大学); Cardiff University (卡迪夫大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to EMNLP 2025 Main Conference

点击查看摘要

Abstract:As LLMs excel on standard reading comprehension benchmarks, attention is shifting toward evaluating their capacity for complex abstract reasoning and inference. Literature-based benchmarks, with their rich narrative and moral depth, provide a compelling framework for evaluating such deeper comprehension skills. Here, we present MORABLES, a human-verified benchmark built from fables and short stories drawn from historical literature. The main task is structured as multiple-choice questions targeting moral inference, with carefully crafted distractors that challenge models to go beyond shallow, extractive question answering. To further stress-test model robustness, we introduce adversarial variants designed to surface LLM vulnerabilities and shortcuts due to issues such as data contamination. Our findings show that, while larger models outperform smaller ones, they remain susceptible to adversarial manipulation and often rely on superficial patterns rather than true moral reasoning. This brittleness results in significant self-contradiction, with the best models refuting their own answers in roughly 20% of cases depending on the framing of the moral choice. Interestingly, reasoning-enhanced models fail to bridge this gap, suggesting that scale - not reasoning ability - is the primary driver of performance.
zh

[NLP-64] MTEB-NL and E5-NL: Embedding Benchmark and Models for Dutch

【速读】: 该论文旨在解决荷兰语(Dutch)在嵌入资源(embedding resources)中严重不足的问题,即当前多语言嵌入资源中荷兰语通常仅占很小比例,限制了其在自然语言处理任务中的应用与发展。解决方案的关键在于:首先构建了面向荷兰语的大型文本嵌入基准测试(Massive Text Embedding Benchmark for Dutch, MTEB-NL),涵盖多种任务的现有与新创建数据集;其次,整合可用的荷兰语检索数据集并结合大语言模型生成的合成数据,扩充训练数据的任务覆盖范围;最后,发布了一系列高效紧凑的E5-NL嵌入模型,在多个任务上展现出优异性能。所有资源均开源共享于Hugging Face Hub和MTEB包中,以推动荷兰语嵌入技术的发展。

链接: https://arxiv.org/abs/2509.12340
作者: Nikolay Banar,Ehsan Lotfi,Jens Van Nooten,Cristina Arhiliuc,Marija Kliocaite,Walter Daelemans
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recently, embedding resources, including models, benchmarks, and datasets, have been widely released to support a variety of languages. However, the Dutch language remains underrepresented, typically comprising only a small fraction of the published multilingual resources. To address this gap and encourage the further development of Dutch embeddings, we introduce new resources for their evaluation and generation. First, we introduce the Massive Text Embedding Benchmark for Dutch (MTEB-NL), which includes both existing Dutch datasets and newly created ones, covering a wide range of tasks. Second, we provide a training dataset compiled from available Dutch retrieval datasets, complemented with synthetic data generated by large language models to expand task coverage beyond retrieval. Finally, we release a series of E5-NL models compact yet efficient embedding models that demonstrate strong performance across multiple tasks. We make our resources publicly available through the Hugging Face Hub and the MTEB package.
zh

[NLP-65] LLM AP: LLM -Assisted Multi-Objective Route Planning with User Preferences

【速读】: 该论文旨在解决自然语言驱动的路径规划中面临的两大核心问题:一是基于大语言模型(LLM)作为代理(LLM-as-Agent)的方法难以处理大规模地图数据;二是基于图搜索的策略在理解用户自然语言偏好方面能力有限,同时面对全球范围内高度异构且不可预测的时空分布用户群体时,现有方法缺乏适应性。解决方案的关键在于提出一种新型的LLM辅助路径规划系统(LLMAP),其核心创新包括:利用LLM作为解析器(LLM-as-Parser)来精准理解自然语言、识别任务并提取用户偏好及任务依赖关系,同时结合多步图构建与迭代搜索算法(MSGS)作为底层求解器,实现最优路径查找。此外,通过多目标优化机制动态调整权重,在满足用户时间限制、POI营业时间及任务依赖约束的前提下,最大化兴趣点(POI)质量与任务完成率,并最小化路径距离,从而在复杂场景下提供可靠且高效的路径规划服务。

链接: https://arxiv.org/abs/2509.12273
作者: Liangqi Yuan,Dong-Jun Han,Christopher G. Brinton,Sabine Brunswicker
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The rise of large language models (LLMs) has made natural language-driven route planning an emerging research area that encompasses rich user objectives. Current research exhibits two distinct approaches: direct route planning using LLM-as-Agent and graph-based searching strategies. However, LLMs in the former approach struggle to handle extensive map data, while the latter shows limited capability in understanding natural language preferences. Additionally, a more critical challenge arises from the highly heterogeneous and unpredictable spatio-temporal distribution of users across the globe. In this paper, we introduce a novel LLM-Assisted route Planning (LLMAP) system that employs an LLM-as-Parser to comprehend natural language, identify tasks, and extract user preferences and recognize task dependencies, coupled with a Multi-Step Graph construction with iterative Search (MSGS) algorithm as the underlying solver for optimal route finding. Our multi-objective optimization approach adaptively tunes objective weights to maximize points of interest (POI) quality and task completion rate while minimizing route distance, subject to three key constraints: user time limits, POI opening hours, and task dependencies. We conduct extensive experiments using 1,000 routing prompts sampled with varying complexity across 14 countries and 27 cities worldwide. The results demonstrate that our approach achieves superior performance with guarantees across multiple constraints.
zh

[NLP-66] Humor in Pixels: Benchmarking Large Multimodal Models Understanding of Online Comics EMNLP2025

【速读】: 该论文旨在解决大型多模态模型(Large Multimodal Models, LMMs)在理解跨模态幽默(multimodal humor)和识别叙事序列方面的显著能力不足问题。其解决方案的关键在于构建PixelHumor这一基准数据集,包含2800个标注的多面板漫画,用于系统评估LMMs在整合视觉与文本线索以实现连贯叙事和幽默理解方面的能力。实验表明,当前最优模型在面板排序任务中准确率仅为61%,远低于人类水平,凸显了现有模型在多模态语境推理上的关键短板,而PixelHumor为推动LMMs向更自然、社会感知的交互能力发展提供了严谨的评估框架。

链接: https://arxiv.org/abs/2509.12248
作者: Yuriel Ryan,Rui Yang Tan,Kenny Tsu Wei Choo,Roy Ka-Wei Lee
机构: Singapore University of Technology and Design (新加坡科技设计大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 27 pages, 8 figures, EMNLP 2025

点击查看摘要

Abstract:Understanding humor is a core aspect of social intelligence, yet it remains a significant challenge for Large Multimodal Models (LMMs). We introduce PixelHumor, a benchmark dataset of 2,800 annotated multi-panel comics designed to evaluate LMMs’ ability to interpret multimodal humor and recognize narrative sequences. Experiments with state-of-the-art LMMs reveal substantial gaps: for instance, top models achieve only 61% accuracy in panel sequencing, far below human performance. This underscores critical limitations in current models’ integration of visual and textual cues for coherent narrative and humor understanding. By providing a rigorous framework for evaluating multimodal contextual and narrative reasoning, PixelHumor aims to drive the development of LMMs that better engage in natural, socially aware interactions.
zh

[NLP-67] MEUV: Achieving Fine-Grained Capability Activation in Large Language Models via Mutually Exclusive Unlock Vectors

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在安全对齐(safety alignment)过程中因采用统一拒绝策略而导致合法应用场景(如执法、国防等高风险领域)被误伤的问题。传统“拒绝方向”编辑方法虽能绕过安全限制,但依赖单一向量,缺乏语义控制能力,易引发跨话题泄漏。其解决方案的关键在于提出互斥解锁向量(Mutually Exclusive Unlock Vectors, MEUV),通过将单一拒绝方向分解为多个语义对齐、近正交的子向量,每个向量专用于激活某一敏感能力,从而实现细粒度、可控的能力释放。MEUV在单个训练周期内通过多任务目标学习,融合差分消解边界、跨话题惩罚项和正交性约束,显著提升攻击成功率(≥87%)并降低跨话题泄露(最高达90%),且中英文向量具有高度迁移性,表明存在语言无关的拒绝子空间。

链接: https://arxiv.org/abs/2509.12221
作者: Xin Tong,Zhi Lin,Jingya Wang,Meng Han,Bo Jin
机构: People’s Public Security University of China (中国公安大学); Tsinghua University (清华大学); Zhejiang University (浙江大学); The Third Research Institute of the Ministry of Public Security of China (中国公安部第三研究所)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注: Under Review

点击查看摘要

Abstract:Large language models (LLMs) enforce safety alignment to reliably refuse malicious requests, yet the same blanket safeguards also block legitimate uses in policing, defense, and other high-stakes settings. Earlier “refusal-direction” edits can bypass those layers, but they rely on a single vector that indiscriminately unlocks all hazardous topics, offering no semantic control. We introduce Mutually Exclusive Unlock Vectors (MEUV), a lightweight framework that factorizes the monolithic refusal direction into topic-aligned, nearly orthogonal vectors, each dedicated to one sensitive capability. MEUV is learned in a single epoch with a multi-task objective that blends a differential-ablation margin, cross-topic and orthogonality penalties, and several auxiliary terms. On bilingual malicious-prompt benchmarks, MEUV achieves an attack success rate of no less than 87% on Gemma-2-2B, LLaMA-3-8B, and Qwen-7B, yet cuts cross-topic leakage by up to 90% compared with the best single-direction baseline. Vectors trained in Chinese transfer almost unchanged to English (and vice versa), suggesting a language-agnostic refusal subspace. The results show that fine-grained, topic-level capability activation is achievable with minimal utility loss, paving the way for controlled LLMs deployment in security-sensitive domains.
zh

[NLP-68] Exact Coset Sampling for Quantum Lattice Algorithms

【速读】: 该论文针对近期基于窗函数的量子傅里叶变换(Quantum Fourier Transform, QFT)格算法中第9步存在的“域扩展”(domain-extension)问题展开研究,该步骤因周期性与支撑集不匹配而引发错误。解决方案的关键在于提出一种成对移位差分构造(pair-shift difference construction),该方法能够相干地消除所有未知偏移量,精确生成 ZP\mathbb{Z}_P 上的均匀中国剩余定理(Chinese Remainder Theorem, CRT)余类态,并通过QFT实现预期的模线性关系。此方案具有可逆性、门复杂度为 poly(logM2)\mathrm{poly}(\log M_2),且保持原算法的渐近性能。

链接: https://arxiv.org/abs/2509.12341
作者: Yifan Zhang
机构: Princeton University (普林斯顿大学)
类目: Quantum Physics (quant-ph); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注: Project Page: this https URL

点击查看摘要

Abstract:We give a simple, fully correct, and assumption-light replacement for the contested “domain-extension” in Step 9 of a recent windowed-QFT lattice algorithm with complex-Gaussian windows~\citepchen2024quantum. The published Step~9 suffers from a periodicity/support mismatch. We present a pair-shift difference construction that coherently cancels all unknown offsets, produces an exact uniform CRT-coset state over \mathbbZ_P , and then uses the QFT to enforce the intended modular linear relation. The unitary is reversible, uses \mathrmpoly(\log M_2) gates, and preserves the algorithm’s asymptotics. Project Page: this https URL.
zh

计算机视觉

[CV-0] 3D Aware Region Prompted Vision Language Model WWW

【速读】:该论文旨在解决如何有效融合单视角2D图像与多视角3D数据以实现统一的场景理解问题,尤其在缺乏完整多帧标注的情况下提升空间推理能力。其核心挑战在于跨模态对齐与区域提示(region prompting)的灵活性,以及如何利用2D视觉先验增强3D空间建模的准确性。解决方案的关键在于构建一个空间感知的3D视觉语言模型(Spatial Region 3D aware vision-language model, SR-3D),通过共享的视觉标记空间(visual token space)连接2D和3D数据,并引入3D位置嵌入(3D positional embeddings)丰富2D特征,使模型能在不同视图中准确关联物体空间关系,即使目标对象不共现于同一帧也能实现高精度的空间推理。

链接: https://arxiv.org/abs/2509.13317
作者: An-Chieh Cheng,Yang Fu,Yukang Chen,Zhijian Liu,Xiaolong Li,Subhashree Radhakrishnan,Song Han,Yao Lu,Jan Kautz,Pavlo Molchanov,Hongxu Yin,Xiaolong Wang,Sifei Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Website: this https URL

点击查看摘要

Abstract:We present Spatial Region 3D (SR-3D) aware vision-language model that connects single-view 2D images and multi-view 3D data through a shared visual token space. SR-3D supports flexible region prompting, allowing users to annotate regions with bounding boxes, segmentation masks on any frame, or directly in 3D, without the need for exhaustive multi-frame labeling. We achieve this by enriching 2D visual features with 3D positional embeddings, which allows the 3D model to draw upon strong 2D priors for more accurate spatial reasoning across frames, even when objects of interest do not co-occur within the same view. Extensive experiments on both general 2D vision language and specialized 3D spatial benchmarks demonstrate that SR-3D achieves state-of-the-art performance, underscoring its effectiveness for unifying 2D and 3D representation space on scene understanding. Moreover, we observe applicability to in-the-wild videos without sensory 3D inputs or ground-truth 3D annotations, where SR-3D accurately infers spatial relationships and metric measurements.
zh

[CV-1] StyleSculptor: Zero-Shot Style-Controllable 3D Asset Generation with Texture-Geometry Dual Guidance SIGGRAPH

【速读】:该论文旨在解决生成式 3D 资产时难以实现风格可控的问题,尤其是在不依赖训练数据的情况下,如何从内容图像和一个或多个风格图像中生成兼具纹理与几何特征的高质量 3D 资产。其解决方案的关键在于提出了一种无需训练的 StyleSculptor 框架,核心创新是引入了 Style Disentangled Attention (SD-Attn) 模块,该模块通过跨 3D 注意力机制建立内容图像与风格图像之间的动态交互,实现稳定特征融合与细粒度风格引导;同时结合风格解耦特征选择策略,利用 3D 特征块方差分离出与风格和内容相关的通道,从而在注意力框架内进行选择性特征注入,使网络能够动态计算仅纹理、仅几何或两者兼具的引导特征,并进一步通过 Style Guided Control (SGC) 机制实现独立的几何或纹理风格化及可调强度控制,显著提升生成结果的保真度与灵活性。

链接: https://arxiv.org/abs/2509.13301
作者: Zefan Qu,Zhenwei Wang,Haoyuan Wang,Ke Xu,Gerhard Hancke,Rynson W.H. Lau
机构: City University of Hong Kong(香港城市大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: SIGGRAPH Asia 2025 Conference Paper

点击查看摘要

Abstract:Creating 3D assets that follow the texture and geometry style of existing ones is often desirable or even inevitable in practical applications like video gaming and virtual reality. While impressive progress has been made in generating 3D objects from text or images, creating style-controllable 3D assets remains a complex and challenging problem. In this work, we propose StyleSculptor, a novel training-free approach for generating style-guided 3D assets from a content image and one or more style images. Unlike previous works, StyleSculptor achieves style-guided 3D generation in a zero-shot manner, enabling fine-grained 3D style control that captures the texture, geometry, or both styles of user-provided style images. At the core of StyleSculptor is a novel Style Disentangled Attention (SD-Attn) module, which establishes a dynamic interaction between the input content image and style image for style-guided 3D asset generation via a cross-3D attention mechanism, enabling stable feature fusion and effective style-guided generation. To alleviate semantic content leakage, we also introduce a style-disentangled feature selection strategy within the SD-Attn module, which leverages the variance of 3D feature patches to disentangle style- and content-significant channels, allowing selective feature injection within the attention framework. With SD-Attn, the network can dynamically compute texture-, geometry-, or both-guided features to steer the 3D generation process. Built upon this, we further propose the Style Guided Control (SGC) mechanism, which enables exclusive geometry- or texture-only stylization, as well as adjustable style intensity control. Extensive experiments demonstrate that StyleSculptor outperforms existing baseline methods in producing high-fidelity 3D assets.
zh

[CV-2] Image Realness Assessment and Localization with Multimodal Features

【速读】:该论文旨在解决如何可靠地量化生成式 AI (Generative AI) 图像的感知真实度(perceptual realness)以及识别图像中视觉不一致区域的问题,这对于提升生成图像在实际应用中的可信度及通过真实度反馈优化训练过程至关重要。解决方案的关键在于提出一种多模态框架,利用在大规模数据集上训练的视觉-语言模型(vision-language models)生成描述视觉不一致性的文本信息,作为人类标注的可靠替代,从而实现对图像整体真实度的客观评估和局部不一致区域的精准定位,最终生成稠密的真实度图(dense realness maps),有效区分图像中真实与虚假的空间区域。

链接: https://arxiv.org/abs/2509.13289
作者: Lovish Kaushik,Agnij Biswas,Somdyuti Paul
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:A reliable method of quantifying the perceptual realness of AI-generated images and identifying visually inconsistent regions is crucial for practical use of AI-generated images and for improving photorealism of generative AI via realness feedback during training. This paper introduces a framework that accomplishes both overall objective realness assessment and local inconsistency identification of AI-generated images using textual descriptions of visual inconsistencies generated by vision-language models trained on large datasets that serve as reliable substitutes for human annotations. Our results demonstrate that the proposed multimodal approach improves objective realness prediction performance and produces dense realness maps that effectively distinguish between realistic and unrealistic spatial regions.
zh

[CV-3] RadGame: An AI-Powered Platform for Radiology Education

【速读】:该论文旨在解决传统放射学教育中反馈滞后、可扩展性差的问题,即学习者主要依赖被动案例观摩或在导师实时指导下进行主动练习,难以获得即时且大规模的个性化反馈。其解决方案的关键在于构建一个基于生成式 AI (Generative AI) 的游戏化平台 RadGame,通过整合公开数据集与自动化 AI 驱动反馈机制,实现对两个核心技能——定位异常和生成报告——的结构化训练与评估。具体而言,RadGame Localize 利用视觉语言模型(Vision-Language Models)对用户遗漏的病灶提供可视化解释,RadGame Report 则基于放射科报告生成指标对比人工标注基准,输出性能与风格评分,从而显著提升学习效率与准确性。

链接: https://arxiv.org/abs/2509.13270
作者: Mohammed Baharoon,Siavash Raissi,John S. Jun,Thibault Heintz,Mahmoud Alabbad,Ali Alburkani,Sung Eun Kim,Kent Kleinschmidt,Abdulrahman O. Alhumaydhi,Mohannad Mohammed G. Alghamdi,Jeremy Francis Palacio,Mohammed Bukhaytan,Noah Michael Prudlo,Rithvik Akula,Brady Chrisler,Benjamin Galligos,Mohammed O. Almutairi,Mazeen Mohammed Alanazi,Nasser M. Alrashdi,Joel Jihwan Hwang,Sri Sai Dinesh Jaliparthi,Luke David Nelson,Nathaniel Nguyen,Sathvik Suryadevara,Steven Kim,Mohammed F. Mohammed,Yevgeniy R. Semenov,Kun-Hsing Yu,Abdulrhman Aljouie,Hassan AlOmaish,Adam Rodman,Pranav Rajpurkar
机构: Harvard Medical School (哈佛医学院); Mass General Brigham (麻省总医院联合机构); Maastricht University (马斯特里赫特大学); King Abdulaziz Medical City (阿卜杜勒阿齐兹国王医疗城); Seoul National University Hospital (首尔国立大学医院); Saint Louis University School of Medicine (圣路易斯大学医学院); King Saud bin Abdulaziz University for Health Sciences (萨德·本·阿卜杜勒阿齐兹健康科学大学); Tufts University School of Medicine (塔夫茨大学医学院); King Faisal Specialist Hospital & Research Center (法伊萨尔国王专科医院及研究中心); King Abdullah International Medical Research Center (阿卜杜拉国际医学研究中心); Beth Israel Deaconess Medical Center (贝斯以色列女执事医疗中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce RadGame, an AI-powered gamified platform for radiology education that targets two core skills: localizing findings and generating reports. Traditional radiology training is based on passive exposure to cases or active practice with real-time input from supervising radiologists, limiting opportunities for immediate and scalable feedback. RadGame addresses this gap by combining gamification with large-scale public datasets and automated, AI-driven feedback that provides clear, structured guidance to human learners. In RadGame Localize, players draw bounding boxes around abnormalities, which are automatically compared to radiologist-drawn annotations from public datasets, and visual explanations are generated by vision-language models for user missed findings. In RadGame Report, players compose findings given a chest X-ray, patient age and indication, and receive structured AI feedback based on radiology report generation metrics, highlighting errors and omissions compared to a radiologist’s written ground truth report from public datasets, producing a final performance and style score. In a prospective evaluation, participants using RadGame achieved a 68% improvement in localization accuracy compared to 17% with traditional passive methods and a 31% improvement in report-writing accuracy compared to 4% with traditional methods after seeing the same cases. RadGame highlights the potential of AI-driven gamification to deliver scalable, feedback-rich radiology training and reimagines the application of medical AI resources in education.
zh

[CV-4] ResidualViT for Efficient Temporally Dense Video Encoding

【速读】:该论文旨在解决视频理解任务中因高时间分辨率采样导致的帧级特征计算成本过高的问题,这类任务包括自然语言时序视频定位、时序动作定位和音频描述生成等“时序密集”推理场景。解决方案的关键在于提出一种名为ResidualViT的视觉Transformer架构,其通过引入可学习的残差连接以保证连续帧间的时序一致性,并设计了一个令牌压缩模块,在保留关键信息的同时选择性丢弃时间冗余信息并复用预训练基础模型的权重,从而显著提升处理效率;此外,还提出了一种轻量级蒸馏策略来近似原始基础模型的帧级特征,在多个数据集和任务上实现了高达60%的计算成本降低和最多2.5倍的推理速度提升,同时保持与原模型相近的准确性。

链接: https://arxiv.org/abs/2509.13255
作者: Mattia Soldan,Fabian Caba Heilbron,Bernard Ghanem,Josef Sivic,Bryan Russell
机构: KAUST; CIIRC CTU; Adobe Research
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Several video understanding tasks, such as natural language temporal video grounding, temporal activity localization, and audio description generation, require “temporally dense” reasoning over frames sampled at high temporal resolution. However, computing frame-level features for these tasks is computationally expensive given the temporal resolution requirements. In this paper, we make three contributions to reduce the cost of computing features for temporally dense tasks. First, we introduce a vision transformer (ViT) architecture, dubbed ResidualViT, that leverages the large temporal redundancy in videos to efficiently compute temporally dense frame-level features. Our architecture incorporates (i) learnable residual connections that ensure temporal consistency across consecutive frames and (ii) a token reduction module that enhances processing speed by selectively discarding temporally redundant information while reusing weights of a pretrained foundation model. Second, we propose a lightweight distillation strategy to approximate the frame-level features of the original foundation model. Finally, we evaluate our approach across four tasks and five datasets, in both zero-shot and fully supervised settings, demonstrating significant reductions in computational cost (up to 60%) and improvements in inference speed (up to 2.5x faster), all while closely approximating the accuracy of the original foundation model.
zh

[CV-5] Intelligent Vacuum Thermoforming Process

【速读】:该论文旨在解决真空热成型(vacuum thermoforming)过程中因材料性能波动和模具配置差异导致的产品质量不稳定问题。解决方案的关键在于构建一个基于视觉的质量控制系统,通过少量数据即可预测并优化工艺参数;其核心方法是利用从不同工艺条件下获取的视觉样本建立综合数据集,并结合图像增强技术提升模型训练效果,进而采用k-近邻算法(k-Nearest Neighbour)将低质量制品映射到高质量对应状态,从而精准调整加热功率、加热时间和真空时间等关键参数,有效减少缺陷并提升生产效率。

链接: https://arxiv.org/abs/2509.13250
作者: Andi Kuswoyo,Christos Margadji,Sebastian W. Pattinson
机构: University of Cambridge (剑桥大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Contains 6 figures in total, 15 pages. Under revision for Journal of Intelligent Manufacturing

点击查看摘要

Abstract:Ensuring consistent quality in vacuum thermoforming presents challenges due to variations in material properties and tooling configurations. This research introduces a vision-based quality control system to predict and optimise process parameters, thereby enhancing part quality with minimal data requirements. A comprehensive dataset was developed using visual data from vacuum-formed samples subjected to various process parameters, supplemented by image augmentation techniques to improve model training. A k-Nearest Neighbour algorithm was subsequently employed to identify adjustments needed in process parameters by mapping low-quality parts to their high-quality counterparts. The model exhibited strong performance in adjusting heating power, heating time, and vacuum time to reduce defects and improve production efficiency.
zh

[CV-6] Simulating Clinical AI Assistance using Multimodal LLM s: A Case Study in Diabetic Retinopathy

【速读】:该论文旨在解决当前用于糖尿病视网膜病变(Diabetic Retinopathy, DR)筛查的AI系统输出格式单一(多为二分类结果)导致临床信任度和实用性受限的问题。其解决方案的关键在于评估多模态大语言模型(Multimodal Large Language Models, MLLMs)在DR检测中的性能及其作为临床AI辅助工具的潜力,特别是通过不同输出形式(如数值预测、描述性文本等)对医生-AI协作效果的影响。研究发现,轻量级开源医学模型MedGemma在基准表现上优于通用模型GPT-4o,且在模拟与实际AI协作场景中展现出更高的稳定性与可解释性;更重要的是,GPT-4o在接收MedGemma生成的描述性输出后,即使无直接图像输入也能达到AUROC高达0.96的高水平,表明描述性输出可显著增强AI辅助决策的可靠性与临床适用性。

链接: https://arxiv.org/abs/2509.13234
作者: Nadim Barakat,William Lotter
机构: Dana-Farber Cancer Institute (达纳-法伯癌症研究所); Tufts University School of Medicine (塔夫茨大学医学院); Brigham and Women’s Hospital (布莱根妇女医院); Harvard Medical School (哈佛医学院)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Diabetic retinopathy (DR) is a leading cause of blindness worldwide, and AI systems can expand access to fundus photography screening. Current FDA-cleared systems primarily provide binary referral outputs, where this minimal output may limit clinical trust and utility. Yet, determining the most effective output format to enhance clinician-AI performance is an empirical challenge that is difficult to assess at scale. We evaluated multimodal large language models (MLLMs) for DR detection and their ability to simulate clinical AI assistance across different output types. Two models were tested on IDRiD and Messidor-2: GPT-4o, a general-purpose MLLM, and MedGemma, an open-source medical model. Experiments included: (1) baseline evaluation, (2) simulated AI assistance with synthetic predictions, and (3) actual AI-to-AI collaboration where GPT-4o incorporated MedGemma outputs. MedGemma outperformed GPT-4o at baseline, achieving higher sensitivity and AUROC, while GPT-4o showed near-perfect specificity but low sensitivity. Both models adjusted predictions based on simulated AI inputs, but GPT-4o’s performance collapsed with incorrect ones, whereas MedGemma remained more stable. In actual collaboration, GPT-4o achieved strong results when guided by MedGemma’s descriptive outputs, even without direct image access (AUROC up to 0.96). These findings suggest MLLMs may improve DR screening pipelines and serve as scalable simulators for studying clinical AI assistance across varying output configurations. Open, lightweight models such as MedGemma may be especially valuable in low-resource settings, while descriptive outputs could enhance explainability and clinician trust in clinical workflows.
zh

[CV-7] Curriculum Multi-Task Self-Supervision Improves Lightweight Architectures for Onboard Satellite Hyperspectral Image Segmentation

【速读】:该论文旨在解决高光谱成像(Hyperspectral Imaging, HSI)数据维度高、传输效率低的问题,特别是在卫星平台上的轻量化模型部署需求,以支持星上处理并减少冗余数据(如云覆盖区域)的传输。解决方案的关键在于提出一种新颖的课程多任务自监督学习(Curriculum Multi-Task Self-Supervised Learning, CMTSSL)框架,其核心创新包括:将掩码图像建模(Masked Image Modeling)与解耦的空间和光谱拼图任务(Decoupled Spatial and Spectral Jigsaw Puzzle Solving)相结合,并通过课程学习策略逐步提升训练数据的复杂度,从而引导编码器同时捕获精细的光谱连续性、空间结构及全局语义特征。该设计在统一且计算高效的架构中实现了空间与光谱推理的协同优化,显著提升了轻量级模型在下游分割任务中的表现,且模型参数量比现有先进模型小超过16,000倍,适用于星载环境的实际部署。

链接: https://arxiv.org/abs/2509.13229
作者: Hugo Carlesso,Josiane Mothe,Radu Tudor Ionescu
机构: Université de Toulouse (图卢兹大学); University of Bucharest (布加勒斯特大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Hyperspectral imaging (HSI) captures detailed spectral signatures across hundreds of contiguous bands per pixel, being indispensable for remote sensing applications such as land-cover classification, change detection, and environmental monitoring. Due to the high dimensionality of HSI data and the slow rate of data transfer in satellite-based systems, compact and efficient models are required to support onboard processing and minimize the transmission of redundant or low-value data, e.g. cloud-covered areas. To this end, we introduce a novel curriculum multi-task self-supervised learning (CMTSSL) framework designed for lightweight architectures for HSI analysis. CMTSSL integrates masked image modeling with decoupled spatial and spectral jigsaw puzzle solving, guided by a curriculum learning strategy that progressively increases data complexity during self-supervision. This enables the encoder to jointly capture fine-grained spectral continuity, spatial structure, and global semantic features. Unlike prior dual-task SSL methods, CMTSSL simultaneously addresses spatial and spectral reasoning within a unified and computationally efficient design, being particularly suitable for training lightweight models for onboard satellite deployment. We validate our approach on four public benchmark datasets, demonstrating consistent gains in downstream segmentation tasks, using architectures that are over 16,000x lighter than some state-of-the-art models. These results highlight the potential of CMTSSL in generalizable representation learning with lightweight architectures for real-world HSI applications. Our code is publicly available at this https URL.
zh

[CV-8] End4: End-to-end Denoising Diffusion for Diffusion-Based Inpainting Detection

【速读】:该论文旨在解决扩散模型(diffusion models)在图像修复(inpainting-based image editing)中生成的图像难以被有效检测的问题,尤其是在训练数据中包含类似修复图像的情况下,现有方法仍无法准确识别。其解决方案的关键在于提出一种名为End4的端到端去噪扩散检测方法:首先设计了一个去噪重建模型以增强重建与检测过程在潜在空间中的对齐度,从而提取更利于检测的特征;其次引入尺度感知金字塔融合模块(Scale-aware Pyramid-like Fusion Module, SPFM),通过多尺度注意力引导的特征细化机制提升局部特征的判别能力。该方法在多种未见掩码模式和扰动条件下均表现出良好的泛化性和鲁棒性。

链接: https://arxiv.org/abs/2509.13214
作者: Fei Wang,Xuecheng Wu,Zheng Zhang,Danlei Huang,Yuheng Huang,BoWang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The powerful generative capabilities of diffusion models have significantly advanced the field of image synthesis, enhancing both full image generation and inpainting-based image editing. Despite their remarkable advancements, diffusion models also raise concerns about potential misuse for malicious purposes. However, existing approaches struggle to identify images generated by diffusion-based inpainting models, even when similar inpainted images are included in their training data. To address this challenge, we propose a novel detection method based on End-to-end denoising diffusion (End4). Specifically, End4 designs a denoising reconstruction model to improve the alignment degree between the latent spaces of the reconstruction and detection processes, thus reconstructing features that are more conducive to detection. Meanwhile, it leverages a Scale-aware Pyramid-like Fusion Module (SPFM) that refines local image features under the guidance of attention pyramid layers at different scales, enhancing feature discriminability. Additionally, to evaluate detection performance on inpainted images, we establish a comprehensive benchmark comprising images generated from five distinct masked regions. Extensive experiments demonstrate that our End4 effectively generalizes to unseen masking patterns and remains robust under various perturbations. Our code and dataset will be released soon.
zh

[CV-9] Vi-SAFE: A Spatial-Temporal Framework for Efficient Violence Detection in Public Surveillance

【速读】:该论文旨在解决公共监控场景中暴力行为检测的挑战,包括小目标识别困难、复杂环境干扰以及实时时序分析需求。其解决方案的关键在于提出一种空间-时间框架Vi-SAFE,该框架融合优化后的YOLOv8目标检测模型与Temporal Segment Network(TSN)时序分类网络:YOLOv8采用GhostNetV3轻量化骨干网络、指数移动平均(EMA)注意力机制和剪枝技术,在降低计算成本的同时保持高精度;TSN则负责基于人体区域进行暴力行为的二分类。两阶段训练策略使系统在RWF-2000数据集上达到0.88的准确率,显著优于单独使用TSN(0.77),验证了其在公共安全监控中的有效性与高效性。

链接: https://arxiv.org/abs/2509.13210
作者: Ligang Chang,Shengkai Xu,Liangchang Shen,Binhan Xu,Junqiao Wang,Tianyu Shi,Yanhui Du
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Violence detection in public surveillance is critical for public safety. This study addresses challenges such as small-scale targets, complex environments, and real-time temporal analysis. We propose Vi-SAFE, a spatial-temporal framework that integrates an enhanced YOLOv8 with a Temporal Segment Network (TSN) for video surveillance. The YOLOv8 model is optimized with GhostNetV3 as a lightweight backbone, an exponential moving average (EMA) attention mechanism, and pruning to reduce computational cost while maintaining accuracy. YOLOv8 and TSN are trained separately on pedestrian and violence datasets, where YOLOv8 extracts human regions and TSN performs binary classification of violent behavior. Experiments on the RWF-2000 dataset show that Vi-SAFE achieves an accuracy of 0.88, surpassing TSN alone (0.77) and outperforming existing methods in both accuracy and efficiency, demonstrating its effectiveness for public safety surveillance. Code is available at this https URL.
zh

[CV-10] Road Obstacle Video Segmentation

【速读】:该论文旨在解决当前道路障碍物分割方法仅基于单帧图像导致的时序不一致性问题,即在连续视频帧中预测结果缺乏连贯性,从而影响自动驾驶系统的安全导航。其解决方案的关键在于认识到道路障碍物分割任务本质上具有时序特性,并通过构建四个新的视频分割评估基准对11种先进图像与视频分割方法进行系统评测,同时引入基于视觉基础模型(vision foundation models)的两个强基线方法,显著提升了长视频序列中的分割性能,确立了该任务的新基准。

链接: https://arxiv.org/abs/2509.13181
作者: Shyam Nandan Rai,Shyamgopal Karthik,Mariana-Iuliana Georgescu,Barbara Caputo,Carlo Masone,Zeynep Akata
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: GCPR 2025

点击查看摘要

Abstract:With the growing deployment of autonomous driving agents, the detection and segmentation of road obstacles have become critical to ensure safe autonomous navigation. However, existing road-obstacle segmentation methods are applied on individual frames, overlooking the temporal nature of the problem, leading to inconsistent prediction maps between consecutive frames. In this work, we demonstrate that the road-obstacle segmentation task is inherently temporal, since the segmentation maps for consecutive frames are strongly correlated. To address this, we curate and adapt four evaluation benchmarks for road-obstacle video segmentation and evaluate 11 state-of-the-art image- and video-based segmentation methods on these benchmarks. Moreover, we introduce two strong baseline methods based on vision foundation models. Our approach establishes a new state-of-the-art in road-obstacle video segmentation for long-range video sequences, providing valuable insights and direction for future research.
zh

[CV-11] More performant and scalable: Rethinking contrastive vision-language pre-training of radiology in the LLM era MICCAI2025

【速读】:该论文旨在解决医学影像领域中视觉-语言对齐(vision-language alignment)训练数据稀缺与标注成本高昂的问题,尤其针对大规模监督预训练难以落地的挑战。其解决方案的关键在于利用大型语言模型(Large Language Models, LLMs)自动从放射科报告中提取诊断标签,从而低成本构建高质量的“银标准”(silver-standard)数据集(实验中AUC达96%,每5万张CT图像-报告对仅需约3美元成本)。该方法无需复杂提示工程(prompt engineering),显著降低了标注门槛,并证明基于此数据集训练的视觉编码器性能可媲美使用专用BERT模型提取标签的结果;进一步地,作者发现这种监督预训练能从根本上提升视觉-语言对齐能力,在仅用3D ResNet-18和标准CLIP训练框架下即实现SOTA性能(如CT-RATE零样本诊断AUC达83.8%,跨模态检索MAP@50=53.7%)。

链接: https://arxiv.org/abs/2509.13175
作者: Yingtai Li,Haoran Lai,Xiaoqian Zhou,Shuai Ming,Wenxin Ma,Wei Wei,Shaohua Kevin Zhou
机构: University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: MICCAI 2025

点击查看摘要

Abstract:The emergence of Large Language Models (LLMs) presents unprecedented opportunities to revolutionize medical contrastive vision-language pre-training. In this paper, we show how LLMs can facilitate large-scale supervised pre-training, thereby advancing vision-language alignment. We begin by demonstrate that modern LLMs can automatically extract diagnostic labels from radiology reports with remarkable precision (96% AUC in our experiments) without complex prompt engineering, enabling the creation of large-scale “silver-standard” datasets at a minimal cost (~\ 3 for 50k CT image-report pairs). Further, we find that vision encoder trained on this “silver-standard” dataset achieves performance comparable to those trained on labels extracted by specialized BERT-based models, thereby democratizing the access to large-scale supervised pre-training. Building on this foundation, we proceed to reveal that supervised pre-training fundamentally improves contrastive vision-language alignment. Our approach achieves state-of-the-art performance using only a 3D ResNet-18 with vanilla CLIP training, including 83.8% AUC for zero-shot diagnosis on CT-RATE, 77.3% AUC on RAD-ChestCT, and substantial improvements in cross-modal retrieval (MAP@50=53.7% for image-image, Recall@100=52.2% for report-image). These results demonstrate the potential of utilizing LLMs to facilitate \bf more performant and scalable medical AI systems. Our code is avaiable at this https URL.
zh

[CV-12] WHU-STree: A Multi-modal Benchmark Dataset for Street Tree Inventory

【速读】:该论文旨在解决城市街道树木资产精细化管理中因传统人工调查效率低、数据规模小、标注不充分及模态单一等问题,导致难以支撑多任务分析与智能决策的瓶颈。其核心解决方案是构建了一个跨城市、多模态、高精度标注的街道树木数据集WHU-STree,该数据集融合了同步获取的点云与高分辨率图像,并涵盖50种树种和2类形态参数,支持超过10项街道树木资产管理相关任务。关键创新在于通过多模态数据融合提升模型性能,并验证了跨域泛化能力对实际算法部署的重要性,为未来在多模态融合、多任务协同、空间模式学习及面向街道树木资产管理的多模态大语言模型等方向的研究提供了基础支撑。

链接: https://arxiv.org/abs/2509.13172
作者: Ruifei Ding,Zhe Chen,Wen Fan,Chen Long,Huijuan Xiao,Yelu Zeng,Zhen Dong,Bisheng Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Street trees are vital to urban livability, providing ecological and social benefits. Establishing a detailed, accurate, and dynamically updated street tree inventory has become essential for optimizing these multifunctional assets within space-constrained urban environments. Given that traditional field surveys are time-consuming and labor-intensive, automated surveys utilizing Mobile Mapping Systems (MMS) offer a more efficient solution. However, existing MMS-acquired tree datasets are limited by small-scale scene, limited annotation, or single modality, restricting their utility for comprehensive analysis. To address these limitations, we introduce WHU-STree, a cross-city, richly annotated, and multi-modal urban street tree dataset. Collected across two distinct cities, WHU-STree integrates synchronized point clouds and high-resolution images, encompassing 21,007 annotated tree instances across 50 species and 2 morphological parameters. Leveraging the unique characteristics, WHU-STree concurrently supports over 10 tasks related to street tree inventory. We benchmark representative baselines for two key tasks–tree species classification and individual tree segmentation. Extensive experiments and in-depth analysis demonstrate the significant potential of multi-modal data fusion and underscore cross-domain applicability as a critical prerequisite for practical algorithm deployment. In particular, we identify key challenges and outline potential future works for fully exploiting WHU-STree, encompassing multi-modal fusion, multi-task collaboration, cross-domain generalization, spatial pattern learning, and Multi-modal Large Language Model for street tree asset management. The WHU-STree dataset is accessible at: this https URL.
zh

[CV-13] Enhancing Video Large Language Models with Structured Multi-Video Collaborative Reasoning (early version)

【速读】:该论文旨在解决视频语言模型在进行综合视频推理时,因单个视频内部存在时空信息不完整性而导致的幻觉与不准确问题。其核心解决方案在于提出一种多视频协作框架,关键创新点在于构建了一个视频结构模块(Video Structuring Module),将视频知识表示为时空图结构,并设计图融合模块(Graph Fusion Module)将相关视频中的有价值信息融合到增强的图节点令牌中,最终通过精心设计的多视频结构化提示(multi-video structured prompt)整合图、视觉和文本令牌作为输入,从而提升大语言模型的推理性能。

链接: https://arxiv.org/abs/2509.13161
作者: Zhihao He,Tianyao He,Tieyuan Chen,Yun Xu,Huabin Liu,Chaofan Gan,Gui Zou,Weiyao Lin
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite the prosperity of the video language model, the current pursuit of comprehensive video reasoning is thwarted by the inherent spatio-temporal incompleteness within individual videos, resulting in hallucinations and inaccuracies. A promising solution is to augment the reasoning performance with multiple related videos. However, video tokens are numerous and contain redundant information, so directly feeding the relevant video data into a large language model to enhance responses could be counterproductive. To address this challenge, we propose a multi-video collaborative framework for video language models. For efficient and flexible video representation, we establish a Video Structuring Module to represent the video’s knowledge as a spatio-temporal graph. Based on the structured video representation, we design the Graph Fusion Module to fuse the structured knowledge and valuable information from related videos into the augmented graph node tokens. Finally, we construct an elaborate multi-video structured prompt to integrate the graph, visual, and textual tokens as the input to the large language model. Extensive experiments substantiate the effectiveness of our framework, showcasing its potential as a promising avenue for advancing video language models.
zh

[CV-14] xTAR : Textual Attribute Recognition in Multi-domain and Multi-lingual Document Images ICDAR2025

【速读】:该论文旨在解决文本属性识别(Textual Attribute Recognition, TAR)在计算效率和噪声多语言环境下的适应性不足问题。现有方法难以在复杂场景中准确识别粗体、斜体、下划线和删除线等文本属性,而这些属性对理解文档语义、结构和视觉呈现至关重要。解决方案的关键在于提出TexTAR模型,其核心创新包括:一是设计了一种新颖的数据选择管道以增强上下文感知能力;二是采用2D RoPE(Rotary Positional Embedding)风格机制将输入上下文信息融入模型,从而提升属性预测精度;此外,还构建了MMTAD数据集,涵盖法律文书、公告和教科书等多种真实文档场景,支持多语言、多领域标注,为模型训练与评估提供高质量基准。

链接: https://arxiv.org/abs/2509.13151
作者: Rohan Kumar,Jyothi Swaroopa Jinka,Ravi Kiran Sarvadevabhatla
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICDAR 2025 (Oral)

点击查看摘要

Abstract:Recognizing textual attributes such as bold, italic, underline and strikeout is essential for understanding text semantics, structure, and visual presentation. These attributes highlight key information, making them crucial for document analysis. Existing methods struggle with computational efficiency or adaptability in noisy, multilingual settings. To address this, we introduce TexTAR, a multi-task, context-aware Transformer for Textual Attribute Recognition (TAR). Our novel data selection pipeline enhances context awareness, and our architecture employs a 2D RoPE (Rotary Positional Embedding)-style mechanism to incorporate input context for more accurate attribute predictions. We also introduce MMTAD, a diverse, multilingual, multi-domain dataset annotated with text attributes across real-world documents such as legal records, notices, and textbooks. Extensive evaluations show TexTAR outperforms existing methods, demonstrating that contextual awareness contributes to state-of-the-art TAR performance.
zh

[CV-15] MSDNet: Efficient 4D Radar Super-Resolution via Multi-Stage Distillation

【速读】:该论文旨在解决4D雷达点云超分辨率(4D radar super-resolution)中现有方法存在的高训练成本、推理延迟大及泛化能力差等问题,这些问题限制了重建精度与计算效率之间的平衡。其核心解决方案是提出一种多阶段知识蒸馏框架MSDNet,关键在于通过两个阶段的特征蒸馏机制:第一阶段采用重建引导的特征蒸馏,利用特征重构对齐并稀疏学生模型特征;第二阶段引入扩散引导的特征蒸馏,将第一阶段输出视为教师特征的噪声版本,并借助轻量级扩散网络进行精细化去噪,同时设计噪声适配器以自适应调整特征噪声水平与预定义扩散时间步的一致性,从而实现高质量重建与低延迟推理的兼顾。

链接: https://arxiv.org/abs/2509.13149
作者: Minqing Huang,Shouyi Lu,Boyuan Zheng,Ziyao Li,Xiao Tang,Guirong Zhuo
机构: Tongji University (同济大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 5 figures

点击查看摘要

Abstract:4D radar super-resolution, which aims to reconstruct sparse and noisy point clouds into dense and geometrically consistent representations, is a foundational problem in autonomous perception. However, existing methods often suffer from high training cost or rely on complex diffusion-based sampling, resulting in high inference latency and poor generalization, making it difficult to balance accuracy and efficiency. To address these limitations, we propose MSDNet, a multi-stage distillation framework that efficiently transfers dense LiDAR priors to 4D radar features to achieve both high reconstruction quality and computational efficiency. The first stage performs reconstruction-guided feature distillation, aligning and densifying the student’s features through feature reconstruction. In the second stage, we propose diffusion-guided feature distillation, which treats the stage-one distilled features as a noisy version of the teacher’s representations and refines them via a lightweight diffusion network. Furthermore, we introduce a noise adapter that adaptively aligns the noise level of the feature with a predefined diffusion timestep, enabling a more precise denoising. Extensive experiments on the VoD and in-house datasets demonstrate that MSDNet achieves both high-fidelity reconstruction and low-latency inference in the task of 4D radar point cloud super-resolution, and consistently improves performance on downstream tasks. The code will be publicly available upon publication.
zh

[CV-16] Advancing Real-World Parking Slot Detection with Large-Scale Dataset and Semi-Supervised Baseline

【速读】:该论文旨在解决自动泊车系统中停车位检测的准确性问题,特别是针对现有数据集规模有限、场景噪声(如光照变化、遮挡等)缺乏以及人工标注成本高、易出错等挑战。其解决方案的关键在于构建一个大规模、多样化的停车位检测数据集CRPS-D,并提出一种基于教师-学生模型的半监督学习方法SS-PSD,该方法通过置信度引导的掩码一致性约束和自适应特征扰动机制,有效利用未标注数据提升检测性能。实验表明,SS-PSD在所提数据集及现有数据集上均优于当前最先进(SoTA)方法,且未标注数据越多,性能增益越显著。

链接: https://arxiv.org/abs/2509.13133
作者: Zhihao Zhang,Chunyu Lin,Lang Nie,Jiyuan Wang,Yao Zhao
机构: Beijing Jiaotong University (北京交通大学); Visual Intelligence +X International Cooperation Joint Laboratory of MOE (教育部视觉智能+x国际合作联合实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: IEEE Transactions on Intelligent Transportation Systems (T-ITS)

点击查看摘要

Abstract:As automatic parking systems evolve, the accurate detection of parking slots has become increasingly critical. This study focuses on parking slot detection using surround-view cameras, which offer a comprehensive bird’s-eye view of the parking environment. However, the current datasets are limited in scale, and the scenes they contain are seldom disrupted by real-world noise (e.g., light, occlusion, etc.). Moreover, manual data annotation is prone to errors and omissions due to the complexity of real-world conditions, significantly increasing the cost of annotating large-scale datasets. To address these issues, we first construct a large-scale parking slot detection dataset (named CRPS-D), which includes various lighting distributions, diverse weather conditions, and challenging parking slot variants. Compared with existing datasets, the proposed dataset boasts the largest data scale and consists of a higher density of parking slots, particularly featuring more slanted parking slots. Additionally, we develop a semi-supervised baseline for parking slot detection, termed SS-PSD, to further improve performance by exploiting unlabeled data. To our knowledge, this is the first semi-supervised approach in parking slot detection, which is built on the teacher-student model with confidence-guided mask consistency and adaptive feature perturbation. Experimental results demonstrate the superiority of SS-PSD over the existing state-of-the-art (SoTA) solutions on both the proposed dataset and the existing dataset. Particularly, the more unlabeled data there is, the more significant the gains brought by our semi-supervised scheme. The relevant source codes and the dataset have been made publicly available at this https URL.
zh

[CV-17] Weakly and Self-Supervised Class-Agnostic Motion Prediction for Autonomous Driving CVPR2023

【速读】:该论文旨在解决自动驾驶中动态环境中运动预测的挑战,特别是如何在减少标注依赖的前提下实现类无关(class-agnostic)的运动预测。其核心问题是:传统监督学习依赖大量精细标注的运动信息,而这些标注成本高昂且难以获取。解决方案的关键在于提出一种新颖的弱监督与自监督相结合的范式,利用部分或完全标注的前景/背景掩码(foreground/background masks)替代运动标签进行监督,并进一步引入非地面/地面掩码作为更易获取的替代信号,从而显著降低标注需求。此外,作者设计了一种鲁棒的一致性感知 Chamfer 距离损失函数(Robust Consistency-aware Chamfer Distance loss),结合多帧信息和鲁棒惩罚机制,有效抑制自监督学习中的异常值,提升模型性能。实验表明,该方法在标注极少甚至无标注的情况下仍能获得优于现有自监督方法的结果,甚至媲美部分有监督模型。

链接: https://arxiv.org/abs/2509.13116
作者: Ruibo Li,Hanyu Shi,Zhe Wang,Guosheng Lin
机构: Nanyang Technological University (南洋理工大学); SenseTime Research (商汤科技)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: An extension of our CVPR 2023 paper, “Weakly Supervised Class-Agnostic Motion Prediction for Autonomous Driving,” accepted for publication in TPAMI

点击查看摘要

Abstract:Understanding motion in dynamic environments is critical for autonomous driving, thereby motivating research on class-agnostic motion prediction. In this work, we investigate weakly and self-supervised class-agnostic motion prediction from LiDAR point clouds. Outdoor scenes typically consist of mobile foregrounds and static backgrounds, allowing motion understanding to be associated with scene parsing. Based on this observation, we propose a novel weakly supervised paradigm that replaces motion annotations with fully or partially annotated (1%, 0.1%) foreground/background masks for supervision. To this end, we develop a weakly supervised approach utilizing foreground/background cues to guide the self-supervised learning of motion prediction models. Since foreground motion generally occurs in non-ground regions, non-ground/ground masks can serve as an alternative to foreground/background masks, further reducing annotation effort. Leveraging non-ground/ground cues, we propose two additional approaches: a weakly supervised method requiring fewer (0.01%) foreground/background annotations, and a self-supervised method without annotations. Furthermore, we design a Robust Consistency-aware Chamfer Distance loss that incorporates multi-frame information and robust penalty functions to suppress outliers in self-supervised learning. Experiments show that our weakly and self-supervised models outperform existing self-supervised counterparts, and our weakly supervised models even rival some supervised ones. This demonstrates that our approaches effectively balance annotation effort and performance.
zh

[CV-18] Hierarchical Deep Fusion Framework for Multi-dimensional Facial Forgery Detection - The 2024 Global Deepfake Image Detection Challenge

【速读】:该论文旨在解决深度伪造(Deepfake)技术日益复杂所带来的数字安全与真实性验证难题,尤其是针对多种图像篡改手段下伪造检测模型泛化能力不足的问题。解决方案的关键在于提出一种分层深度融合框架(Hierarchical Deep Fusion Framework, HDFF),通过集成四个预训练子模型(Swin-MLP、CoAtNet、EfficientNetV2 和 DaViT)并进行多阶段微调,提取各自特化的特征表示后进行拼接,再训练一个最终分类器,从而有效融合不同架构的优势,提升对复杂人脸伪造场景的识别性能。

链接: https://arxiv.org/abs/2509.13107
作者: Kohou Wang,Huan Hu,Xiang Liu,Zezhou Chen,Ping Chen,Zhaoxiang Liu,Shiguo Lian
机构: AI Innovation Center, China Unicom(中国联通人工智能创新中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: The 2024 Global Deepfake Image Detection Challenge Top20 Reward, 5 pages

点击查看摘要

Abstract:The proliferation of sophisticated deepfake technology poses significant challenges to digital security and authenticity. Detecting these forgeries, especially across a wide spectrum of manipulation techniques, requires robust and generalized models. This paper introduces the Hierarchical Deep Fusion Framework (HDFF), an ensemble-based deep learning architecture designed for high-performance facial forgery detection. Our framework integrates four diverse pre-trained sub-models, Swin-MLP, CoAtNet, EfficientNetV2, and DaViT, which are meticulously fine-tuned through a multi-stage process on the MultiFFDI dataset. By concatenating the feature representations from these specialized models and training a final classifier layer, HDFF effectively leverages their collective strengths. This approach achieved a final score of 0.96852 on the competition’s private leaderboard, securing the 20th position out of 184 teams, demonstrating the efficacy of hierarchical fusion for complex image classification tasks.
zh

[CV-19] A Synthetic Data Pipeline for Supporting Manufacturing SMEs in Visual Assembly Control

【速读】:该论文旨在解决中小型企业(SMEs)在装配质量控制中面临的资源限制问题,特别是由于图像采集、标注和计算机视觉算法训练成本高昂而导致的自动化装配控制难以落地的问题。解决方案的关键在于构建一个数据高效且易于集成的视觉装配控制流程,其核心是利用计算机辅助设计(CAD)数据驱动的仿真场景生成与目标检测算法相结合,从而大幅减少对真实世界数据的依赖。实验表明,该方法在合成训练数据上的平均精度(mAP@0.5:0.95)可达99.5%,在真实相机拍摄的测试数据上仍保持93%的准确率,验证了合成数据在制造环境中用于装配质量检测的有效性与可迁移性。

链接: https://arxiv.org/abs/2509.13089
作者: Jonas Werheid,Shengjie He,Aymen Gannouni,Anas Abdelrazeq,Robert H. Schmitt
机构: RWTH Aachen University (亚琛工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Quality control of assembly processes is essential in manufacturing to ensure not only the quality of individual components but also their proper integration into the final product. To assist in this matter, automated assembly control using computer vision methods has been widely implemented. However, the costs associated with image acquisition, annotation, and training of computer vision algorithms pose challenges for integration, especially for small- and medium-sized enterprises (SMEs), which often lack the resources for extensive training, data collection, and manual image annotation. Synthetic data offers the potential to reduce manual data collection and labeling. Nevertheless, its practical application in the context of assembly quality remains limited. In this work, we present a novel approach for easily integrable and data-efficient visual assembly control. Our approach leverages simulated scene generation based on computer-aided design (CAD) data and object detection algorithms. The results demonstrate a time-saving pipeline for generating image data in manufacturing environments, achieving a mean Average Precision (mAP@0.5:0.95) up to 99,5% for correctly identifying instances of synthetic planetary gear system components within our simulated training data, and up to 93% when transferred to real-world camera-captured testing data. This research highlights the effectiveness of synthetic data generation within an adaptable pipeline and underscores its potential to support SMEs in implementing resource-efficient visual assembly control solutions.
zh

[CV-20] Enhancing Dual Network Based Semi-Supervised Medical Image Segmentation with Uncertainty-Guided Pseudo-Labeling

【速读】:该论文旨在解决半监督3D医学图像分割中因依赖大量标注数据而带来的实际应用难题,同时应对现有方法中存在的伪标签噪声和特征空间监督不足的问题。其解决方案的关键在于提出一种基于双网络架构的新型半监督框架:首先通过交叉一致性增强模块(Cross Consistency Enhancement module)结合交叉伪标签与熵滤波监督机制以降低伪标签噪声;其次设计了一种基于不确定性感知的动态权重策略(利用Kullback-Leibler散度),自适应调整伪标签贡献;此外引入自监督对比学习机制,将不确定体素特征与可靠类别原型对齐,从而有效区分可信与不确定预测,减少预测不确定性。

链接: https://arxiv.org/abs/2509.13084
作者: Yunyao Lu,Yihang Wu,Ahmad Chaddad,Tareef Daqqaq,Reem Kateb
机构: Guilin University of Electronic Technology (桂林电子科技大学); École de Technologie Supérieure (高级技术学院); Taibah University (塔伊巴大学); Jeddah University (杰达大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accpeted in Knowledge-Based Systems

点击查看摘要

Abstract:Despite the remarkable performance of supervised medical image segmentation models, relying on a large amount of labeled data is impractical in real-world situations. Semi-supervised learning approaches aim to alleviate this challenge using unlabeled data through pseudo-label generation. Yet, existing semi-supervised segmentation methods still suffer from noisy pseudo-labels and insufficient supervision within the feature space. To solve these challenges, this paper proposes a novel semi-supervised 3D medical image segmentation framework based on a dual-network architecture. Specifically, we investigate a Cross Consistency Enhancement module using both cross pseudo and entropy-filtered supervision to reduce the noisy pseudo-labels, while we design a dynamic weighting strategy to adjust the contributions of pseudo-labels using an uncertainty-aware mechanism (i.e., Kullback-Leibler divergence). In addition, we use a self-supervised contrastive learning mechanism to align uncertain voxel features with reliable class prototypes by effectively differentiating between trustworthy and uncertain predictions, thus reducing prediction uncertainty. Extensive experiments are conducted on three 3D segmentation datasets, Left Atrial, NIH Pancreas and BraTS-2019. The proposed approach consistently exhibits superior performance across various settings (e.g., 89.95% Dice score on left Atrial with 10% labeled data) compared to the state-of-the-art methods. Furthermore, the usefulness of the proposed modules is further validated via ablation experiments.
zh

[CV-21] Using KL-Divergence to Focus Frequency Information in Low-Light Image Enhancement

【速读】:该论文旨在解决传统傅里叶域图像增强方法中因采用像素级损失函数(如均方误差 MSE)而导致的局部信息过拟合与全局结构信息丢失的问题。其关键解决方案是提出一种基于频域分布感知的损失函数,直接在傅里叶域对幅度谱和相位谱进行建模,并利用闭式KL散度(KL-Divergence)最小化目标来提升模型对频域信息的对齐能力;同时,在VGG特征空间中嵌入KL散度以优化感知损失,从而增强图像结构保真度。整体架构为U型深度网络,融合交叉注意力(cross-attention)与门控机制(gating mechanism),实现频率感知的高效图像增强。

链接: https://arxiv.org/abs/2509.13083
作者: Yan Xingyang,Huang Xiaohong,Zhang Zhao,You Tian,Xu Ziheng
机构: Hebei Key Laboratory of Industrial Intelligent Perception (河北省工业智能感知重点实验室); College of Artificial Intelligence (人工智能学院); North China University of Science and Technology (华北理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In the Fourier domain, luminance information is primarily encoded in the amplitude spectrum, while spatial structures are captured in the phase components. The traditional Fourier Frequency information fitting employs pixel-wise loss functions, which tend to focus excessively on local information and may lead to global information loss. In this paper, we present LLFDisc, a U-shaped deep enhancement network that integrates cross-attention and gating mechanisms tailored for frequency-aware enhancement. We propose a novel distribution-aware loss that directly fits the Fourier-domain information and minimizes their divergence using a closed-form KL-Divergence objective. This enables the model to align Fourier-domain information more robustly than with conventional MSE-based losses. Furthermore, we enhance the perceptual loss based on VGG by embedding KL-Divergence on extracted deep features, enabling better structural fidelity. Extensive experiments across multiple benchmarks demonstrate that LLFDisc achieves state-of-the-art performance in both qualitative and quantitative evaluations. Our code will be released at: this https URL
zh

[CV-22] FANet: Three-Stage Image-Text Feature Alignment Network for Robust Referring Image Segmentation

【速读】:该论文旨在解决指代表达图像分割(Referring Image Segmentation, RIS)任务中因多模态错位和语言语义损失导致的目标定位不准或分割不完整的问题,尤其在包含多个视觉相似对象的复杂场景中表现突出。其解决方案的关键在于提出一种三阶段图像-文本特征对齐网络(TFANet),通过分层架构系统性增强跨模态对齐:第一阶段设计多尺度线性交叉注意力模块(Multiscale Linear Cross-Attention Module, MLAM)实现视觉与语言特征在不同尺度上的双向语义交互;第二阶段引入跨模态特征扫描模块(Cross-modal Feature Scanning Module, CFSM)捕获长程依赖并构建统一的多模态表示;第三阶段提出词级语言特征引导的语义深化模块(Word-level Linguistic Feature-guided Semantic Deepening Module, WFDM)以补偿前期对齐过程中产生的语义退化,从而显著提升复杂场景下的分割精度与鲁棒性。

链接: https://arxiv.org/abs/2509.13070
作者: Qianqi Lu,Yuxiang Xie,Jing Zhang,Shiwei Zou,Yan Chen,Xidao Luan
机构: National University of Defense Technology (国防科技大学); Changsha University of Science and Technology (长沙理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Referring Image Segmentation (RIS) is a task that segments image regions based on language expressions, requiring fine-grained alignment between two modalities. However, existing methods often struggle with multimodal misalignment and language semantic loss, especially in complex scenes containing multiple visually similar objects, where uniquely described targets are frequently mislocalized or incompletely segmented. To tackle these challenges, this paper proposes TFANet, a Three-stage Image-Text Feature Alignment Network that systematically enhances multimodal alignment through a hierarchical framework comprising three stages: Knowledge Plus Stage (KPS), Knowledge Fusion Stage (KFS), and Knowledge Intensification Stage (KIS). In the first stage, we design the Multiscale Linear Cross-Attention Module (MLAM), which facilitates bidirectional semantic exchange between visual features and textual representations across multiple scales. This establishes rich and efficient alignment between image regions and different granularities of linguistic descriptions. Subsequently, the KFS further strengthens feature alignment through the Cross-modal Feature Scanning Module (CFSM), which applies multimodal selective scanning to capture long-range dependencies and construct a unified multimodal representation. This is essential for modeling long-range cross-modal dependencies and enhancing alignment accuracy in complex scenes. Finally, in the KIS, we propose the Word-level Linguistic Feature-guided Semantic Deepening Module (WFDM) to compensate for semantic degradation introduced in earlier stages.
zh

[CV-23] HERO: Rethinking Visual Token Early Dropping in High-Resolution Large Vision-Language Models

【速读】:该论文旨在解决高分辨率大视觉语言模型(High-Resolution Large Vision-Language Models, HR-LVLMs)在处理局部图像块(local tiles)时因视觉token数量激增而导致的计算与内存开销过大的问题。其解决方案的关键在于提出HERO框架,该框架通过内容自适应的token预算分配与功能感知的token选择机制,在不依赖训练的前提下,精准估计图像块的重要性并保留具有互补作用的视觉token,从而实现高效且准确的推理性能平衡。

链接: https://arxiv.org/abs/2509.13067
作者: Xu Li,Yuxuan Liang,Xiaolei Chen,Yi Zheng,Haotian Chen,Bin Li,Xiangyang Xue
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:By cropping high-resolution images into local tiles and encoding them independently, High-Resolution Large Vision-Language Models (HR-LVLMs) have demonstrated remarkable fine-grained visual understanding capabilities. However, this divide-and-conquer paradigm significantly increases the number of visual tokens, resulting in substantial computational and memory overhead. To better understand and address this challenge, we empirically investigate visual token utilization in HR-LVLMs and uncover three key findings: (1) the local tiles have varying importance, jointly determined by visual saliency and task relevance; (2) the CLS token in CLIP-based vision encoders exhibits a two-stage attention pattern across layers, with each stage attending to different types of visual tokens; (3) the visual tokens emphasized at different stages encode information at varying levels of granularity, playing complementary roles within LVLMs. Building on these insights, we propose HERO, a High-resolution visual token early dropping framework that integrates content-adaptive token budget allocation with function-aware token selection. By accurately estimating tile-level importance and selectively retaining visual tokens with complementary roles, HERO achieves superior efficiency-accuracy trade-offs across diverse benchmarks and model scales, all in a training-free manner. This study provides both empirical insights and practical solutions toward efficient inference in HR-LVLMs.
zh

[CV-24] Perception Before Reasoning : Two-Stage Reinforcement Learning for Visual Reasoning in Vision-Language Models

【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在复杂任务中因感知与推理能力不足而导致性能受限的问题。现有基于强化学习(Reinforcement Learning, RL)的方法直接移植自大语言模型(Large Language Models, LLMs),但未能充分考虑VLMs需先准确理解视觉输入才能进行有效推理的特性,导致训练效果不佳。解决方案的关键在于提出一种两阶段强化学习框架:第一阶段通过数据级采样策略聚焦于提升模型的粗粒度与细粒度视觉感知能力,第二阶段则专注于增强其推理能力;同时,该框架通过分阶段训练缓解了强化学习中常见的优势消失(vanishing advantage)问题,从而显著提升了模型的整体表现,最终得到名为PeBR-R1的高性能视觉语言模型。

链接: https://arxiv.org/abs/2509.13031
作者: Yan Chen,Long Li,Teng Xi,Long Zeng,Jingdong Wang
机构: 1. Tsinghua University (清华大学); 2. Alibaba Group (阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement learning (RL) has proven highly effective in eliciting the reasoning capabilities of large language models (LLMs). Inspired by this success, recent studies have explored applying similar techniques to vision-language models (VLMs), aiming to enhance their reasoning performance. However, directly transplanting RL methods from LLMs to VLMs is suboptimal, as the tasks faced by VLMs are inherently more complex. Specifically, VLMs must first accurately perceive and understand visual inputs before reasoning can be effectively performed. To address this challenge, we propose a two-stage reinforcement learning framework designed to jointly enhance both the perceptual and reasoning capabilities of VLMs. To mitigate the vanishing advantage issue commonly observed in RL training, we first perform dataset-level sampling to selectively strengthen specific capabilities using distinct data sources. During training, the first stage focuses on improving the model’s visual perception through coarse- and fine-grained visual understanding, while the second stage targets the enhancement of reasoning abilities. After the proposed two-stage reinforcement learning process, we obtain PeBR-R1, a vision-language model with significantly enhanced perceptual and reasoning capabilities. Experimental results on seven benchmark datasets demonstrate the effectiveness of our approach and validate the superior performance of PeBR-R1 across diverse visual reasoning tasks.
zh

[CV-25] Dream3DAvatar: Text-Controlled 3D Avatar Reconstruction from a Single Image

【速读】:该论文旨在解决从单张图像生成高质量、可编辑的全身3D虚拟形象(3D avatar)时面临的几何与纹理不确定性问题,尤其是在遮挡区域难以控制的问题。其核心挑战在于单目输入信息有限,导致重建结果存在歧义性。解决方案的关键在于提出一个两阶段高效且文本可控的框架——Dream3DAvatar:第一阶段设计轻量级多视图生成模型,引入Pose-Adapter和ID-Adapter-G分别注入SMPL-X人体姿态与骨骼信息及高分辨率面部特征,以确保跨视角几何一致性与身份保真度;同时利用BLIP2生成多视图描述文本,增强对遮挡区域的文本引导能力;第二阶段构建基于Transformer的前馈式模型,结合多视图特征融合模块,从生成图像中重建高保真度的3D高斯点云表示(3D Gaussian Splatting, 3DGS),并通过ID-Adapter-R引入门控机制融合面部特征,提升高频细节恢复能力。整体方法无需后处理即可生成动画就绪的3D虚拟形象,并在多个指标上优于现有基线。

链接: https://arxiv.org/abs/2509.13013
作者: Gaofeng Liu,Hengsen Li,Ruoyu Gao,Xuetong Li,Zhiyuan Ma,Tao Fang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:With the rapid advancement of 3D representation techniques and generative models, substantial progress has been made in reconstructing full-body 3D avatars from a single image. However, this task remains fundamentally ill-posedness due to the limited information available from monocular input, making it difficult to control the geometry and texture of occluded regions during generation. To address these challenges, we redesign the reconstruction pipeline and propose Dream3DAvatar, an efficient and text-controllable two-stage framework for 3D avatar generation. In the first stage, we develop a lightweight, adapter-enhanced multi-view generation model. Specifically, we introduce the Pose-Adapter to inject SMPL-X renderings and skeletal information into SDXL, enforcing geometric and pose consistency across views. To preserve facial identity, we incorporate ID-Adapter-G, which injects high-resolution facial features into the generation process. Additionally, we leverage BLIP2 to generate high-quality textual descriptions of the multi-view images, enhancing text-driven controllability in occluded regions. In the second stage, we design a feedforward Transformer model equipped with a multi-view feature fusion module to reconstruct high-fidelity 3D Gaussian Splat representations (3DGS) from the generated images. Furthermore, we introduce ID-Adapter-R, which utilizes a gating mechanism to effectively fuse facial features into the reconstruction process, improving high-frequency detail recovery. Extensive experiments demonstrate that our method can generate realistic, animation-ready 3D avatars without any post-processing and consistently outperforms existing baselines across multiple evaluation metrics.
zh

[CV-26] Drone Detection Using a Low-Power Neuromorphic Virtual Tripwire

【速读】:该论文旨在解决小型无人机对军事人员和民用基础设施日益增长的威胁问题,核心挑战在于实现早期、自动化且低功耗的检测。解决方案的关键在于构建一个完全基于脉冲神经网络(spiking neural networks)与事件相机(neuromorphic cameras, event cameras)的检测系统,并将其部署在类脑芯片上,形成全神经形态架构。该系统通过多个检测单元构成虚拟警戒线,可精准识别无人机进入受限制区域的时间与位置,同时展现出比边缘GPU参考方案高几个数量级的能效优势,支持电池供电运行超过一年,适用于无稳定电力供应或高对抗环境下的部署。

链接: https://arxiv.org/abs/2509.12997
作者: Anton Eldeborg Lundin,Rasmus Winzell,Hanna Hamrell,David Gustafsson,Hannes Ovrén
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Small drones are an increasing threat to both military personnel and civilian infrastructure, making early and automated detection crucial. In this work we develop a system that uses spiking neural networks and neuromorphic cameras (event cameras) to detect drones. The detection model is deployed on a neuromorphic chip making this a fully neuromorphic system. Multiple detection units can be deployed to create a virtual tripwire which detects when and where drones enter a restricted zone. We show that our neuromorphic solution is several orders of magnitude more energy efficient than a reference solution deployed on an edge GPU, allowing the system to run for over a year on battery power. We investigate how synthetically generated data can be used for training, and show that our model most likely relies on the shape of the drone rather than the temporal characteristics of its propellers. The small size and low power consumption allows easy deployment in contested areas or locations that lack power infrastructure.
zh

[CV-27] Brought a Gun to a Knife Fight: Modern VFM Baselines Outgun Specialized Detectors on In-the-Wild AI Image Detection

【速读】:该论文旨在解决当前专门用于检测生成式 AI(Generative AI)图像的检测器在真实场景(in-the-wild)中表现严重下降的问题,尤其是其高假阴性率导致无法有效识别实际应用中的伪造图像。解决方案的关键在于摒弃传统定制化检测器的设计思路,转而采用一个基于现代视觉基础模型(Vision Foundation Model, VFM)的简单线性分类器作为基准方法。研究发现,这种基于VFM的线性分类器在相同训练数据下显著优于现有专用检测器,提升真实场景下的准确率超过20%。其核心优势源于VFM对合成图像与伪造相关概念(如“AI-generated”)之间语义对齐能力的增强,而这种能力可能源自预训练阶段的数据暴露——当测试数据来自VFM预训练截止日期之后的新数据时,模型性能显著下降,表明其泛化能力依赖于完整的训练历史,包括预训练阶段。

链接: https://arxiv.org/abs/2509.12995
作者: Yue Zhou,Xinan He,Kaiqing Lin,Bing Fan,Feng Ding,Jinhua Zeng,Bin Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While specialized detectors for AI-generated images excel on curated benchmarks, they fail catastrophically in real-world scenarios, as evidenced by their critically high false-negative rates on in-the-wild' benchmarks. Instead of crafting another specialized knife’ for this problem, we bring a gun' to the fight: a simple linear classifier on a modern Vision Foundation Model (VFM). Trained on identical data, this baseline decisively outguns’ bespoke detectors, boosting in-the-wild accuracy by a striking margin of over 20%. Our analysis pinpoints the source of the VFM’s firepower': First, by probing text-image similarities, we find that recent VLMs (e.g., Perception Encoder, Meta CLIP2) have learned to align synthetic images with forgery-related concepts (e.g., AI-generated’), unlike previous versions. Second, we speculate that this is due to data exposure, as both this alignment and overall accuracy plummet on a novel dataset scraped after the VFM’s pre-training cut-off date, ensuring it was unseen during pre-training. Our findings yield two critical conclusions: 1) For the real-world gunfight' of AI-generated image detection, the raw firepower’ of an updated VFM is far more effective than the `craftsmanship’ of a static detector. 2) True generalization evaluation requires test data to be independent of the model’s entire training history, including pre-training. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2509.12995 [cs.CV] (or arXiv:2509.12995v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2509.12995 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-28] Dual-Stage Reweighted MoE for Long-Tailed Egocentric Mistake Detection

【速读】:该论文旨在解决从第一人称视角视频(egocentric video)中识别用户动作是否错误的问题,尤其针对细微且发生频率低的错误实例。解决方案的关键在于提出了一种双阶段重加权专家混合模型(Dual-Stage Reweighted Mixture-of-Experts, DR-MoE):第一阶段通过冻结的ViViT模型与LoRA微调的ViViT模型提取特征,并由特征级专家模块融合;第二阶段训练三个具有不同优化目标的分类器——基于重加权交叉熵损失缓解类别不平衡、基于AUC损失提升偏斜分布下的排序性能、以及结合标签感知损失与锐度感知最小化以增强校准性和泛化能力,其预测结果由分类级专家模块融合。该方法在识别罕见和模糊错误实例方面表现优异。

链接: https://arxiv.org/abs/2509.12990
作者: Boyu Han,Qianqian Xu,Shilong Bao,Zhiyong Yang,Sicong Li,Qingming Huang
机构: Institute of Computing Technology, CAS (中国科学院计算技术研究所); University of Chinese Academy of Sciences (中国科学院大学); Institute of Information Engineering, CAS (中国科学院信息工程研究所); School of Cyber Security, University of Chinese Academy of Sciences (中国科学院大学网络空间安全学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In this report, we address the problem of determining whether a user performs an action incorrectly from egocentric video data. To handle the challenges posed by subtle and infrequent mistakes, we propose a Dual-Stage Reweighted Mixture-of-Experts (DR-MoE) framework. In the first stage, features are extracted using a frozen ViViT model and a LoRA-tuned ViViT model, which are combined through a feature-level expert module. In the second stage, three classifiers are trained with different objectives: reweighted cross-entropy to mitigate class imbalance, AUC loss to improve ranking under skewed distributions, and label-aware loss with sharpness-aware minimization to enhance calibration and generalization. Their predictions are fused using a classification-level expert module. The proposed method achieves strong performance, particularly in identifying rare and ambiguous mistake instances. The code is available at this https URL.
zh

[CV-29] PANORAMA: The Rise of Omnidirectional Vision in the Embodied AI Era

【速读】:该论文旨在解决传统针孔视觉(pinhole vision)在环境感知完整性与决策可靠性方面的局限性,推动全景视觉(omnidirectional vision)在具身人工智能(embodied AI)时代的发展。其核心问题是:如何构建一个具备全面环境感知能力、能够支持复杂任务的通用型全景视觉系统。解决方案的关键在于提出了一种理想的全景系统架构——PANORAMA,该架构由四个关键子系统组成,整合了近期在全景生成、感知、理解及数据集方面的突破,并结合学术界与工业界的洞察,为未来研究指明了方向,包括开放挑战与技术路线图。

链接: https://arxiv.org/abs/2509.12989
作者: Xu Zheng,Chenfei Liao,Ziqiao Weng,Kaiyu Lei,Zihao Dongfang,Haocong He,Yuanhuiyi Lyu,Lutao Jiang,Lu Qi,Li Chen,Danda Pani Paudel,Kailun Yang,Linfeng Zhang,Luc Van Gool,Xuming Hu
机构: 1. Tsinghua University (清华大学); 2. KU Leuven (鲁汶大学); 3. Zhejiang University (浙江大学); 4. Peking University (北京大学); 5. University of California, Berkeley (加州大学伯克利分校); 6. Google (谷歌); 7. Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper presents a draft overview of the emerging field of omnidirectional vision in the context of embodied AI

点击查看摘要

Abstract:Omnidirectional vision, using 360-degree vision to understand the environment, has become increasingly critical across domains like robotics, industrial inspection, and environmental monitoring. Compared to traditional pinhole vision, omnidirectional vision provides holistic environmental awareness, significantly enhancing the completeness of scene perception and the reliability of decision-making. However, foundational research in this area has historically lagged behind traditional pinhole vision. This talk presents an emerging trend in the embodied AI era: the rapid development of omnidirectional vision, driven by growing industrial demand and academic interest. We highlight recent breakthroughs in omnidirectional generation, omnidirectional perception, omnidirectional understanding, and related datasets. Drawing on insights from both academia and industry, we propose an ideal panoramic system architecture in the embodied AI era, PANORAMA, which consists of four key subsystems. Moreover, we offer in-depth opinions related to emerging trends and cross-community impacts at the intersection of panoramic vision and embodied AI, along with the future roadmap and open challenges. This overview synthesizes state-of-the-art advancements and outlines challenges and opportunities for future research in building robust, general-purpose omnidirectional AI systems in the embodied AI era.
zh

[CV-30] Improving Accuracy and Efficiency of Implicit Neural Representations: Making SIREN a WINNER

【速读】:该论文针对正弦表示网络(SIRENs)在未适当初始化时难以拟合超出其频率支持范围信号的问题展开研究,尤其在目标频谱与网络频率支持错位时会出现“频谱瓶颈”现象,导致模型输出接近零且无法恢复自身能力范围内的频率成分。解决方案的关键在于提出WINNER(Weight Initialization with Noise for Neural Representations),通过用自适应确定噪声尺度的高斯噪声扰动基础SIREN的均匀初始化权重,从而缓解“频谱偏差”,无需引入额外可训练参数,即可显著提升音频拟合性能,并在图像和3D形状拟合任务中取得优于基线SIREN的成果。

链接: https://arxiv.org/abs/2509.12980
作者: Hemanth Chandravamsi,Dhanush V. Shenoy,Steven H. Frankel
机构: Technion - Israel Institute of Technology (以色列理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We identify and address a fundamental limitation of sinusoidal representation networks (SIRENs), a class of implicit neural representations. SIRENs Sitzmann et al. (2020), when not initialized appropriately, can struggle at fitting signals that fall outside their frequency support. In extreme cases, when the network’s frequency support misaligns with the target spectrum, a ‘spectral bottleneck’ phenomenon is observed, where the model yields to a near-zero output and fails to recover even the frequency components that are within its representational capacity. To overcome this, we propose WINNER - Weight Initialization with Noise for Neural Representations. WINNER perturbs uniformly initialized weights of base SIREN with Gaussian noise - whose noise scales are adaptively determined by the spectral centroid of the target signal. Similar to random Fourier embeddings, this mitigates ‘spectral bias’ but without introducing additional trainable parameters. Our method achieves state-of-the-art audio fitting and significant gains in image and 3D shape fitting tasks over base SIREN. Beyond signal fitting, WINNER suggests new avenues in adaptive, target-aware initialization strategies for optimizing deep neural network training. For code and data visit this http URL.
zh

[CV-31] SHREC 2025: Protein surface shape retrieval including electrostatic potential

【速读】:该论文旨在解决蛋白质表面形状检索(protein surface shape retrieval)问题,即如何在大规模蛋白表面数据集中高效准确地匹配具有相似几何形状的分子表面。其解决方案的关键在于引入静电势(electrostatic potential)作为补充描述符,与分子表面形状特征协同使用,显著提升了检索性能,尤其在训练数据有限的类别中表现更优,验证了多模态分子表面描述符融合策略的有效性。

链接: https://arxiv.org/abs/2509.12976
作者: Taher Yacoub,Camille Depenveiller,Atsushi Tatsuma,Tin Barisin,Eugen Rusakov,Udo Gobel,Yuxu Peng,Shiqiang Deng,Yuki Kagaya,Joon Hong Park,Daisuke Kihara,Marco Guerra,Giorgio Palmieri,Andrea Ranieri,Ulderico Fugacci,Silvia Biasotti,Ruiwen He,Halim Benhabiles,Adnane Cabani,Karim Hammoudi,Haotian Li,Hao Huang,Chunyan Li,Alireza Tehrani,Fanwang Meng,Farnaz Heidar-Zadeh,Tuan-Anh Yang,Matthieu Montes
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Biomolecules (q-bio.BM)
备注: Published in Computers Graphics, Elsevier. 59 pages, 12 figures

点击查看摘要

Abstract:This SHREC 2025 track dedicated to protein surface shape retrieval involved 9 participating teams. We evaluated the performance in retrieval of 15 proposed methods on a large dataset of 11,555 protein surfaces with calculated electrostatic potential (a key molecular surface descriptor). The performance in retrieval of the proposed methods was evaluated through different metrics (Accuracy, Balanced accuracy, F1 score, Precision and Recall). The best retrieval performance was achieved by the proposed methods that used the electrostatic potential complementary to molecular surface shape. This observation was also valid for classes with limited data which highlights the importance of taking into account additional molecular surface descriptors.
zh

[CV-32] ICDAR 2025 Competition on FEw-Shot Text line segmentation of ancient handwritten documents (FEST) ICDAR2025

【速读】:该论文旨在解决历史手写文档图像中文本行分割(text line segmentation)的难题,其核心挑战包括不规则手写风格、墨迹褪色、复杂布局及非线性文本流,且由于高质量标注数据稀缺,传统全监督学习方法难以适用。解决方案的关键在于提出“Few-Shot Text Line Segmentation of Ancient Handwritten Documents (FEST)”竞赛,要求参赛者仅用每份手稿3张标注图像进行训练,从而推动少样本学习(few-shot learning)方法的发展,以构建对历史文献具有鲁棒性和适应性的文本行分割系统,降低人文学者的标注负担,促进自动化文档分析工具在历史研究中的广泛应用。

链接: https://arxiv.org/abs/2509.12965
作者: Silvia Zottin,Axel De Nardin,Giuseppe Branca,Claudio Piciarelli,Gian Luca Foresti
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICDAR 2025

点击查看摘要

Abstract:Text line segmentation is a critical step in handwritten document image analysis. Segmenting text lines in historical handwritten documents, however, presents unique challenges due to irregular handwriting, faded ink, and complex layouts with overlapping lines and non-linear text flow. Furthermore, the scarcity of large annotated datasets renders fully supervised learning approaches impractical for such materials. To address these challenges, we introduce the Few-Shot Text Line Segmentation of Ancient Handwritten Documents (FEST) Competition. Participants are tasked with developing systems capable of segmenting text lines in U-DIADS-TL dataset, using only three annotated images per manuscript for training. The competition dataset features a diverse collection of ancient manuscripts exhibiting a wide range of layouts, degradation levels, and non-standard formatting, closely reflecting real-world conditions. By emphasizing few-shot learning, FEST competition aims to promote the development of robust and adaptable methods that can be employed by humanities scholars with minimal manual annotation effort, thus fostering broader adoption of automated document analysis tools in historical research.
zh

[CV-33] MMMS: Multi-Modal Multi-Surface Interactive Segmentation

【速读】:该论文旨在解决多表面交互式图像分割(Multi-Modal Multi-Surface interactive segmentation, MMMS)问题,即在单张图像中同时存在多个相互缠绕或相邻的物体表面时,如何通过用户点击实现高效、准确的分割。其核心挑战在于传统方法难以处理复杂场景下的多目标分离与交互响应效率。解决方案的关键在于提出一种新型网络架构,该架构能够融合RGB图像、非RGB模态(如深度图、红外等)以及用户点击信息,在不访问RGB特征提取器内部结构的前提下,仅通过黑盒接口完成多模态输入的整合,并将交互信息注入到图像特征提取和多模态融合之后的阶段,从而显著提升分割精度并降低用户交互成本(NoC@90指标)。实验表明,引入多模态输入可使平均每个表面减少1.28次点击(DeLiVER数据集),且RGB-only基线在经典单掩码交互分割任务中仍具备竞争力。

链接: https://arxiv.org/abs/2509.12963
作者: Robin Schön,Julian Lorenz,Katja Ludwig,Daniel Kienzle,Rainer Lienhart
机构: University of Augsburg (奥格斯堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 19 pages, 11 figures, 10 pages

点击查看摘要

Abstract:In this paper, we present a method to interactively create segmentation masks on the basis of user clicks. We pay particular attention to the segmentation of multiple surfaces that are simultaneously present in the same image. Since these surfaces may be heavily entangled and adjacent, we also present a novel extended evaluation metric that accounts for the challenges of this scenario. Additionally, the presented method is able to use multi-modal inputs to facilitate the segmentation task. At the center of this method is a network architecture which takes as input an RGB image, a number of non-RGB modalities, an erroneous mask, and encoded clicks. Based on this input, the network predicts an improved segmentation mask. We design our architecture such that it adheres to two conditions: (1) The RGB backbone is only available as a black-box. (2) To reduce the response time, we want our model to integrate the interaction-specific information after the image feature extraction and the multi-modal fusion. We refer to the overall task as Multi-Modal Multi-Surface interactive segmentation (MMMS). We are able to show the effectiveness of our multi-modal fusion strategy. Using additional modalities, our system reduces the NoC@90 by up to 1.28 clicks per surface on average on DeLiVER and up to 1.19 on MFNet. On top of this, we are able to show that our RGB-only baseline achieves competitive, and in some cases even superior performance when tested in a classical, single-mask interactive segmentation scenario.
zh

[CV-34] me-step Mixup for Efficient Spiking Knowledge Transfer from Appearance to Event Domain

【速读】:该论文旨在解决事件相机(event camera)与脉冲神经网络(spiking neural network, SNN)在训练过程中因事件数据稀缺性和DVS输出稀疏性导致的性能瓶颈问题,以及RGB与DVS模态间显著的分布差异(modality shift)对知识迁移带来的挑战。其解决方案的关键在于提出一种新颖的细粒度混合策略——Time-step Mixup knowledge transfer (TMKT),通过在不同时间步上插值RGB与DVS输入,利用SNN的异步特性实现跨模态的知识融合;同时引入模态感知的辅助学习目标(modality-aware auxiliary learning objectives),以支持标签混合并增强模型在多模态场景下的判别能力,从而实现更平滑的知识迁移和更优的脉冲图像分类性能。

链接: https://arxiv.org/abs/2509.12959
作者: Yuqi Xie,Shuhan Ye,Chong Wang,Jiazhen Xu,Le Shen,Yuanbin Qian,Jiangbo Qian
机构: 1. Tsinghua University (清华大学); 2. Peking University (北京大学); 3. Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The integration of event cameras and spiking neural networks holds great promise for energy-efficient visual processing. However, the limited availability of event data and the sparse nature of DVS outputs pose challenges for effective training. Although some prior work has attempted to transfer semantic knowledge from RGB datasets to DVS, they often overlook the significant distribution gap between the two modalities. In this paper, we propose Time-step Mixup knowledge transfer (TMKT), a novel fine-grained mixing strategy that exploits the asynchronous nature of SNNs by interpolating RGB and DVS inputs at various time-steps. To enable label mixing in cross-modal scenarios, we further introduce modality-aware auxiliary learning objectives. These objectives support the time-step mixup process and enhance the model’s ability to discriminate effectively across different modalities. Our approach enables smoother knowledge transfer, alleviates modality shift during training, and achieves superior performance in spiking image classification tasks. Extensive experiments demonstrate the effectiveness of our method across multiple datasets. The code will be released after the double-blind review process.
zh

[CV-35] Sy-FAR: Symmetry-based Fair Adversarial Robustness

【速读】:该论文旨在解决安全关键型机器学习系统(如人脸识别系统)在面对对抗样本攻击时,存在不公平鲁棒性的问题——即某些类别或群体更容易被攻击,导致模型对不同类别的鲁棒性不均等。现有方法虽尝试提升鲁棒性的同时实现类间公平性,但在现实场景中难以达到理想平衡,尤其当某些类别本身高度相似时。论文提出的关键解决方案是引入“对称性”(symmetry)作为新的公平性衡量标准:即从类别 i 到 j 的攻击成功率应与从 j 到 i 相近,这更符合类别间相似关系的对称本质。作者进一步设计了 Sy-FAR 方法,在优化对抗鲁棒性的基础上显式鼓励类间攻击对称性,并通过理论证明其可自然推广至子群层面的公平性。实验表明,Sy-FAR 在多个数据集和模型架构上显著优于当前最优方法,在提升公平鲁棒性的同时兼具训练效率和稳定性,并意外缓解了另一类新型不公平现象——目标类别在诱导对称后变得更难被误分类。

链接: https://arxiv.org/abs/2509.12939
作者: Haneen Najjar,Eyal Ronen,Mahmood Sharif
机构: Tel Aviv University (特拉维夫大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, 11 figures

点击查看摘要

Abstract:Security-critical machine-learning (ML) systems, such as face-recognition systems, are susceptible to adversarial examples, including real-world physically realizable attacks. Various means to boost ML’s adversarial robustness have been proposed; however, they typically induce unfair robustness: It is often easier to attack from certain classes or groups than from others. Several techniques have been developed to improve adversarial robustness while seeking perfect fairness between classes. Yet, prior work has focused on settings where security and fairness are less critical. Our insight is that achieving perfect parity in realistic fairness-critical tasks, such as face recognition, is often infeasible – some classes may be highly similar, leading to more misclassifications between them. Instead, we suggest that seeking symmetry – i.e., attacks from class i to j would be as successful as from j to i – is more tractable. Intuitively, symmetry is a desirable because class resemblance is a symmetric relation in most domains. Additionally, as we prove theoretically, symmetry between individuals induces symmetry between any set of sub-groups, in contrast to other fairness notions where group-fairness is often elusive. We develop Sy-FAR, a technique to encourage symmetry while also optimizing adversarial robustness and extensively evaluate it using five datasets, with three model architectures, including against targeted and untargeted realistic attacks. The results show Sy-FAR significantly improves fair adversarial robustness compared to state-of-the-art methods. Moreover, we find that Sy-FAR is faster and more consistent across runs. Notably, Sy-FAR also ameliorates another type of unfairness we discover in this work – target classes that adversarial examples are likely to be classified into become significantly less vulnerable after inducing symmetry.
zh

[CV-36] Beyond Averag es: Open-Vocabulary 3D Scene Understanding with Gaussian Splatting and Bag of Embeddings

【速读】:该论文旨在解决3D Gaussian Splatting(3DGS)在3D场景理解中的局限性问题,特别是其固有的模糊性导致难以实现细粒度的3D语义理解,从而限制了其在增强现实(AR)、虚拟现实(VR)和机器人等领域的应用。现有方法通过2D基础模型蒸馏学习语义,但因alpha混合机制平均了物体间的语义信息,无法实现真正的3D级语义理解。解决方案的关键在于摒弃基于可微渲染的语义学习范式,转而利用预分解的物体级高斯分布,并通过多视角CLIP特征聚合构建每个物体的“嵌入袋”(bags of embeddings),从而实现两个核心能力:一是基于文本查询与物体级嵌入比对的开放词汇物体检索;二是将物体ID无缝传播至像素(用于2D分割)或高斯点(用于3D提取),实现任务自适应。

链接: https://arxiv.org/abs/2509.12938
作者: Abdalla Arafa,Didier Stricker
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Novel view synthesis has seen significant advancements with 3D Gaussian Splatting (3DGS), enabling real-time photorealistic rendering. However, the inherent fuzziness of Gaussian Splatting presents challenges for 3D scene understanding, restricting its broader applications in AR/VR and robotics. While recent works attempt to learn semantics via 2D foundation model distillation, they inherit fundamental limitations: alpha blending averages semantics across objects, making 3D-level understanding impossible. We propose a paradigm-shifting alternative that bypasses differentiable rendering for semantics entirely. Our key insight is to leverage predecomposed object-level Gaussians and represent each object through multiview CLIP feature aggregation, creating comprehensive “bags of embeddings” that holistically describe objects. This allows: (1) accurate open-vocabulary object retrieval by comparing text queries to object-level (not Gaussian-level) embeddings, and (2) seamless task adaptation: propagating object IDs to pixels for 2D segmentation or to Gaussians for 3D extraction. Experiments demonstrate that our method effectively overcomes the challenges of 3D open-vocabulary object extraction while remaining comparable to state-of-the-art performance in 2D open-vocabulary segmentation, ensuring minimal compromise.
zh

[CV-37] 4DRadar-GS: Self-Supervised Dynamic Driving Scene Reconstruction with 4D Radar

【速读】:该论文旨在解决自监督3D重建方法在动态驾驶场景中因运动估计不准确和时间一致性弱而导致的动态物体重建不完整或失真问题。解决方案的关键在于提出4DRadar-GS框架,其核心创新包括:(1)基于4D雷达的速度与空间信息设计的高斯初始化方案,用于分割动态物体并恢复单目深度尺度,生成精确的高斯点表示;(2)一种联合训练于场景流监督下的速度引导点追踪(Velocity-guided PointTrack, VGPT)模型,用于捕捉细粒度动态轨迹并构建时序一致的表征。

链接: https://arxiv.org/abs/2509.12931
作者: Xiao Tang,Guirong Zhuo,Cong Wang,Boyuan Zheng,Minqing Huang,Lianqing Zheng,Long Chen,Shouyi Lu
机构: Tongji University (同济大学); Chinese Academy of Sciences (中国科学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D reconstruction and novel view synthesis are critical for validating autonomous driving systems and training advanced perception models. Recent self-supervised methods have gained significant attention due to their cost-effectiveness and enhanced generalization in scenarios where annotated bounding boxes are unavailable. However, existing approaches, which often rely on frequency-domain decoupling or optical flow, struggle to accurately reconstruct dynamic objects due to imprecise motion estimation and weak temporal consistency, resulting in incomplete or distorted representations of dynamic scene elements. To address these challenges, we propose 4DRadar-GS, a 4D Radar-augmented self-supervised 3D reconstruction framework tailored for dynamic driving scenes. Specifically, we first present a 4D Radar-assisted Gaussian initialization scheme that leverages 4D Radar’s velocity and spatial information to segment dynamic objects and recover monocular depth scale, generating accurate Gaussian point representations. In addition, we propose a Velocity-guided PointTrack (VGPT) model, which is jointly trained with the reconstruction pipeline under scene flow supervision, to track fine-grained dynamic trajectories and construct temporally consistent representations. Evaluated on the OmniHD-Scenes dataset, 4DRadar-GS achieves state-of-the-art performance in dynamic driving scene 3D reconstruction.
zh

[CV-38] HLSMAC: A New StarCraft Multi-Agent Challenge for High-Level Strategic Decision-Making

【速读】:该论文旨在解决当前多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)评估中缺乏对高级战略决策能力全面衡量的问题。现有基准如SMAC主要聚焦于微操作管理(micromanagement),难以充分评估智能体在战术协调、时机把握和欺骗等高层次战略层面的表现。解决方案的关键在于提出HLSMAC,一个基于《三十六计》经典谋略设计的12个StarCraft II场景组成的新型协作式MARL基准,每个场景对应特定的战略策略,并引入超越传统胜率的新指标(如技能利用率和进展效率),从而系统性地评估智能体的高阶战略决策能力。

链接: https://arxiv.org/abs/2509.12927
作者: Xingxing Hong,Yungong Wang,Dexin Jin,Ye Yuan,Ximing Huang,Zijian Wu,Wenxin Li
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: 30 pages, 13 figures with appendix

点击查看摘要

Abstract:Benchmarks are crucial for assessing multi-agent reinforcement learning (MARL) algorithms. While StarCraft II-related environments have driven significant advances in MARL, existing benchmarks like SMAC focus primarily on micromanagement, limiting comprehensive evaluation of high-level strategic intelligence. To address this, we introduce HLSMAC, a new cooperative MARL benchmark with 12 carefully designed StarCraft II scenarios based on classical stratagems from the Thirty-Six Stratagems. Each scenario corresponds to a specific stratagem and is designed to challenge agents with diverse strategic elements, including tactical maneuvering, timing coordination, and deception, thereby opening up avenues for evaluating high-level strategic decision-making capabilities. We also propose novel metrics across multiple dimensions beyond conventional win rate, such as ability utilization and advancement efficiency, to assess agents’ overall performance within the HLSMAC environment. We integrate state-of-the-art MARL algorithms and LLM-based agents with our benchmark and conduct comprehensive experiments. The results demonstrate that HLSMAC serves as a robust testbed for advancing multi-agent strategic decision-making.
zh

[CV-39] MATTER: Multiscale Attention for Registration Error Regression

【速读】:该论文旨在解决点云配准(Point Cloud Registration, PCR)质量验证问题,即如何更精确地检测和量化配准误差,而非简单地将其分类为有限的几类。现有方法均将验证任务视为分类问题,难以提供细粒度的质量评估。本文的关键创新在于采用回归方法进行PCR质量验证,从而实现对配准误差的连续、精细化估计;同时,通过多尺度特征提取与注意力机制聚合,增强了模型对异质空间密度点云的鲁棒性,显著提升了在多样化数据集上的误差估计精度,并在下游建图任务中通过引导重配准帧的选择,有效提高了地图构建质量。

链接: https://arxiv.org/abs/2509.12924
作者: Shipeng Liu,Ziliang Xiong,Khac-Hoang Ngo,Per-Erik Forssén
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Point cloud registration (PCR) is crucial for many downstream tasks, such as simultaneous localization and mapping (SLAM) and object tracking. This makes detecting and quantifying registration misalignment, i.e.,~\it PCR quality validation, an important task. All existing methods treat validation as a classification task, aiming to assign the PCR quality to a few classes. In this work, we instead use regression for PCR validation, allowing for a more fine-grained quantification of the registration quality. We also extend previously used misalignment-related features by using multiscale extraction and attention-based aggregation. This leads to accurate and robust registration error estimation on diverse datasets, especially for point clouds with heterogeneous spatial densities. Furthermore, when used to guide a mapping downstream task, our method significantly improves the mapping quality for a given amount of re-registered frames, compared to the state-of-the-art classification-based method.
zh

[CV-40] A Novel Compression Framework for YOLOv8: Achiev-ing Real-Time Aerial Object Detection on Edge Devices via Structured Pruning and Channel-Wise Distillation

【速读】:该论文旨在解决在资源受限设备上高效部署深度学习模型进行航空目标检测时面临的模型压缩与性能保持之间的矛盾问题。其关键解决方案为提出一种三阶段压缩流水线:首先通过引入动态稀疏性的感知训练,在参数减少与检测精度之间实现平衡;其次利用批归一化缩放因子进行结构化通道剪枝,有效降低模型规模和计算复杂度;最后采用通道级知识蒸馏(Channel-Wise Knowledge Distillation, CWD)来补偿剪枝导致的精度损失,特别针对小中型目标设计可调温度和损失加权机制。该方法在VisDrone数据集上验证了有效性,以极小的精度代价显著压缩YOLOv8m模型,并结合TensorRT进一步提升推理速度,满足边缘设备实时部署需求。

链接: https://arxiv.org/abs/2509.12918
作者: Melika Sabaghian,Mohammad Ali Keyvanrad,Seyyedeh Mahila Moghadami
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 28 pages, 11 figures

点击查看摘要

Abstract:Efficient deployment of deep learning models for aerial object detection on resource-constrained devices requires significant compression without com-promising performance. In this study, we propose a novel three-stage compression pipeline for the YOLOv8 object detection model, integrating sparsity-aware training, structured channel pruning, and Channel-Wise Knowledge Distillation (CWD). First, sparsity-aware training introduces dynamic sparsity during model optimization, effectively balancing parameter reduction and detection accuracy. Second, we apply structured channel pruning by leveraging batch normalization scaling factors to eliminate redundant channels, significantly reducing model size and computational complexity. Finally, to mitigate the accuracy drop caused by pruning, we employ CWD to transfer knowledge from the original model, using an adjustable temperature and loss weighting scheme tailored for small and medium object detection. Extensive experiments on the VisDrone dataset demonstrate the effectiveness of our approach across multiple YOLOv8 variants. For YOLOv8m, our method reduces model parameters from 25.85M to 6.85M (a 73.51% reduction), FLOPs from 49.6G to 13.3G, and MACs from 101G to 34.5G, while reducing AP50 by only 2.7%. The resulting compressed model achieves 47.9 AP50 and boosts inference speed from 26 FPS (YOLOv8m baseline) to 45 FPS, enabling real-time deployment on edge devices. We further apply TensorRT as a lightweight optimization step. While this introduces a minor drop in AP50 (from 47.9 to 47.6), it significantly improves inference speed from 45 to 68 FPS, demonstrating the practicality of our approach for high-throughput, re-source-constrained scenarios.
zh

[CV-41] -SiamTPN: Temporal Siamese Transformer Pyramid Networks for Robust and Efficient UAV Tracking

【速读】:该论文旨在解决航空场景下目标跟踪中存在的尺度变化、动态背景干扰、环境杂乱以及频繁遮挡等问题,尤其针对现有基于相关性的孪生(Siamese)跟踪器因忽视时间依赖性而导致的长期跟踪鲁棒性不足和非线性外观变化适应能力弱的问题。其解决方案的关键在于提出T-SiamTPN框架,在SiamTPN基础上引入显式的时间建模机制,通过时间特征融合与基于注意力机制的时序交互,增强时序一致性并丰富特征表达能力,从而在不显著增加计算开销的前提下实现性能提升——实验表明,相较于基线模型,该方法在成功率和精度上分别提升13.7%和14.7%,且在Jetson Nano嵌入式平台保持7.1 FPS的实时运行速度,验证了其在资源受限场景下的实用性。

链接: https://arxiv.org/abs/2509.12913
作者: Hojat Ardi(1),Amir Jahanshahi(1),Ali Diba(2) ((1) Department of Electrical Engineering, Amirkabir University of Technology (AUT), Tehran, Iran (2) Qatar Computing Research Institute, Hamad Bin Khalifa University, Doha, Qatar)
机构: Amirkabir University of Technology (阿米尔卡比尔理工大学); Hamad Bin Khalifa University (哈马德本哈利法大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Aerial object tracking remains a challenging task due to scale variations, dynamic backgrounds, clutter, and frequent occlusions. While most existing trackers emphasize spatial cues, they often overlook temporal dependencies, resulting in limited robustness in long-term tracking and under occlusion. Furthermore, correlation-based Siamese trackers are inherently constrained by the linear nature of correlation operations, making them ineffective against complex, non-linear appearance changes. To address these limitations, we introduce T-SiamTPN, a temporal-aware Siamese tracking framework that extends the SiamTPN architecture with explicit temporal modeling. Our approach incorporates temporal feature fusion and attention-based interactions, strengthening temporal consistency and enabling richer feature representations. These enhancements yield significant improvements over the baseline and achieve performance competitive with state-of-the-art trackers. Crucially, despite the added temporal modules, T-SiamTPN preserves computational efficiency. Deployed on the resource-constrained Jetson Nano, the tracker runs in real time at 7.1 FPS, demonstrating its suitability for real-world embedded applications without notable runtime overhead. Experimental results highlight substantial gains: compared to the baseline, T-SiamTPN improves success rate by 13.7% and precision by 14.7%. These findings underscore the importance of temporal modeling in Siamese tracking frameworks and establish T-SiamTPN as a strong and efficient solution for aerial object tracking. Code is available at: this https URL
zh

[CV-42] AREPAS: Anomaly Detection in Fine-Grained Anatomy with Reconstruction-Based Semantic Patch-Scoring

【速读】:该论文旨在解决医学图像中异常检测(Anomaly Detection, AD)与无监督分割任务的挑战,特别是针对肺部等存在细微正常组织变异的场景下,现有生成式异常检测方法难以准确识别病灶的问题。其解决方案的关键在于提出一种新颖的生成式AD框架:首先通过图像到图像的翻译模型实现无异常区域的重建,随后基于观测图像与生成图像对之间的局部块相似性评分,实现精准的异常定位。该方法在胸部CT扫描中用于感染性病灶的检测与分割,并在T1加权脑部MRI的缺血性卒中病灶分割任务中验证了其泛化能力,结果表明在像素级异常分割上相较其他先进重建类方法分别提升了1.9%和4.4%的DICE分数。

链接: https://arxiv.org/abs/2509.12905
作者: Branko Mitic,Philipp Seeböck,Helmut Prosch,Georg Langs
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Early detection of newly emerging diseases, lesion severity assessment, differentiation of medical conditions and automated screening are examples for the wide applicability and importance of anomaly detection (AD) and unsupervised segmentation in medicine. Normal fine-grained tissue variability such as present in pulmonary anatomy is a major challenge for existing generative AD methods. Here, we propose a novel generative AD approach addressing this issue. It consists of an image-to-image translation for anomaly-free reconstruction and a subsequent patch similarity scoring between observed and generated image-pairs for precise anomaly localization. We validate the new method on chest computed tomography (CT) scans for the detection and segmentation of infectious disease lesions. To assess generalizability, we evaluate the method on an ischemic stroke lesion segmentation task in T1-weighted brain MRI. Results show improved pixel-level anomaly segmentation in both chest CTs and brain MRIs, with relative DICE score improvements of +1.9% and +4.4%, respectively, compared to other state-of-the-art reconstruction-based methods.
zh

[CV-43] MSGFusion: Multimodal Scene Graph-Guided Infrared and Visible Image Fusion

【速读】:该论文旨在解决红外与可见光图像融合中难以有效捕捉高阶语义信息的问题,现有基于深度学习的方法多依赖低层视觉线索(如纹理和对比度),在复杂场景下难以实现精细的语义对齐与结构保持。其解决方案的关键在于提出MSGFusion框架,通过深度融合文本引导的结构化场景图(scene graph)与视觉特征,显式建模图像中的实体、属性及空间关系,并借助分层聚合与图驱动融合模块同步优化高层语义一致性与低层细节保留,从而显著提升融合结果的结构清晰度、细节保真度以及下游任务(如低光照目标检测、语义分割)的语义一致性与泛化能力。

链接: https://arxiv.org/abs/2509.12901
作者: Guihui Li,Bowei Dong,Kaizhi Dong,Jiayi Li,Haiyong Zheng
机构: Ocean University of China (中国海洋大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Infrared and visible image fusion has garnered considerable attention owing to the strong complementarity of these two modalities in complex, harsh environments. While deep learning-based fusion methods have made remarkable advances in feature extraction, alignment, fusion, and reconstruction, they still depend largely on low-level visual cues, such as texture and contrast, and struggle to capture the high-level semantic information embedded in images. Recent attempts to incorporate text as a source of semantic guidance have relied on unstructured descriptions that neither explicitly model entities, attributes, and relationships nor provide spatial localization, thereby limiting fine-grained fusion performance. To overcome these challenges, we introduce MSGFusion, a multimodal scene graph-guided fusion framework for infrared and visible imagery. By deeply coupling structured scene graphs derived from text and vision, MSGFusion explicitly represents entities, attributes, and spatial relations, and then synchronously refines high-level semantics and low-level details through successive modules for scene graph representation, hierarchical aggregation, and graph-driven fusion. Extensive experiments on multiple public benchmarks show that MSGFusion significantly outperforms state-of-the-art approaches, particularly in detail preservation and structural clarity, and delivers superior semantic consistency and generalizability in downstream tasks such as low-light object detection, semantic segmentation, and medical image fusion.
zh

[CV-44] Cross-Layer Vision Smoothing: Enhancing Visual Understanding via Sustained Focus on Key Objects in Large Vision-Language Models

【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)在图像中对关键对象的注意力过于短暂的问题,这限制了其视觉理解能力。解决方案的关键在于提出跨层视觉平滑(Cross-Layer Vision Smoothing, CLVS),其核心机制是引入一个视觉记忆模块,用于在不同网络层之间平滑注意力分布。具体而言,CLVS在第一层使用位置无关的视觉注意力初始化该记忆,并在后续层中结合历史记忆更新当前注意力,从而实现对关键对象的持续关注;同时,利用不确定性作为视觉理解完成的指标,在早期和中期层终止平滑过程,以提升效率与性能。实验表明,CLVS在多个基准测试中显著提升了LVLMs在关系和属性理解等任务上的表现。

链接: https://arxiv.org/abs/2509.12897
作者: Jianfei Zhao,Feng Zhang,Xin Sun,Lingxing Kong,Zhixing Tan,Chong Feng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) can accurately locate key objects in images, yet their attention to these objects tends to be very brief. Motivated by the hypothesis that sustained focus on key objects can improve LVLMs’ visual capabilities, we propose Cross-Layer Vision Smoothing (CLVS). The core idea of CLVS is to incorporate a vision memory that smooths the attention distribution across layers. Specifically, we initialize this vision memory with position-unbiased visual attention in the first layer. In subsequent layers, the model’s visual attention jointly considers the vision memory from previous layers, while the memory is updated iteratively, thereby maintaining smooth attention on key objects. Given that visual understanding primarily occurs in the early and middle layers of the model, we use uncertainty as an indicator of completed visual understanding and terminate the smoothing process accordingly. Experiments on four benchmarks across three LVLMs confirm the effectiveness and generalizability of our method. CLVS achieves state-of-the-art performance on a variety of visual understanding tasks, with particularly significant improvements in relation and attribute understanding.
zh

[CV-45] DialNav: Multi-turn Dialog Navigation with a Remote Guide ICCV2025

【速读】:该论文旨在解决传统导航任务中缺乏多轮协同对话评估的问题,特别是如何在具身智能体(embodied agent)与远程引导者(Guide)之间实现有效通信以完成导航目标。现有方法通常忽视了对话对导航决策的动态影响,导致无法全面衡量系统在复杂环境中的协作能力。解决方案的关键在于提出DialNav这一新型协同具身对话任务,并构建RAIN数据集——包含人类在逼真环境中进行导航时的对话记录与轨迹数据,从而支持对导航和对话能力的联合评估。通过设计综合评测基准并实验分析不同导航器(Navigator)与引导者(Guide)模型的表现,论文揭示了位置推断与语义沟通在任务成功中的核心作用,为未来具身对话研究提供了可复现的数据、代码与评估框架。

链接: https://arxiv.org/abs/2509.12894
作者: Leekyeung Han,Hyunji Min,Gyeom Hwangbo,Jonghyun Choi,Paul Hongsuck Seo
机构: Korea University (韩国大学); University of Seoul (首尔大学); Seoul National University (首尔国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 8 figures, ICCV 2025

点击查看摘要

Abstract:We introduce DialNav, a novel collaborative embodied dialog task, where a navigation agent (Navigator) and a remote guide (Guide) engage in multi-turn dialog to reach a goal location. Unlike prior work, DialNav aims for holistic evaluation and requires the Guide to infer the Navigator’s location, making communication essential for task success. To support this task, we collect and release the Remote Assistance in Navigation (RAIN) dataset, human-human dialog paired with navigation trajectories in photorealistic environments. We design a comprehensive benchmark to evaluate both navigation and dialog, and conduct extensive experiments analyzing the impact of different Navigator and Guide models. We highlight key challenges and publicly release the dataset, code, and evaluation framework to foster future research in embodied dialog.
zh

[CV-46] MEJO: MLLM -Engaged Surgical Triplet Recognition via Inter- and Intra-Task Joint Optimization

【速读】:该论文旨在解决外科三元组识别(surgical triplet recognition)中因长尾数据分布导致的优化冲突问题,具体包括两类挑战:一是任务间优化冲突,源于任务通用与特定表示的纠缠;二是任务内优化冲突,源于类别不平衡训练数据引发的梯度失衡。解决方案的关键在于提出一种多任务学习增强联合优化框架(MEJO),其核心创新为:1)引入共享-特定-解耦(S²D)学习机制,将表示分解为任务共享与特定成分,并通过多模态大语言模型(MLLM)驱动的概率提示池动态增强视觉特征的专家级语义信息,同时利用差异化任务提示建模时空间维度的任务特定线索以缓解任务间歧义;2)设计协同梯度学习(CGL)策略,对头部和尾部类别的正负梯度进行解耦与再平衡,实现更协调的任务内学习行为。

链接: https://arxiv.org/abs/2509.12893
作者: Yiyi Zhang,Yuchen Yuan,Ying Zheng,Jialun Pei,Jinpeng Li,Zheng Li,Pheng-Ann Heng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Surgical triplet recognition, which involves identifying instrument, verb, target, and their combinations, is a complex surgical scene understanding challenge plagued by long-tailed data distribution. The mainstream multi-task learning paradigm benefiting from cross-task collaborative promotion has shown promising performance in identifying triples, but two key challenges remain: 1) inter-task optimization conflicts caused by entangling task-generic and task-specific representations; 2) intra-task optimization conflicts due to class-imbalanced training data. To overcome these difficulties, we propose the MLLM-Engaged Joint Optimization (MEJO) framework that empowers both inter- and intra-task optimization for surgical triplet recognition. For inter-task optimization, we introduce the Shared-Specific-Disentangled (S ^2 D) learning scheme that decomposes representations into task-shared and task-specific components. To enhance task-shared representations, we construct a Multimodal Large Language Model (MLLM) powered probabilistic prompt pool to dynamically augment visual features with expert-level semantic cues. Additionally, comprehensive task-specific cues are modeled via distinct task prompts covering the temporal-spatial dimensions, effectively mitigating inter-task ambiguities. To tackle intra-task optimization conflicts, we develop a Coordinated Gradient Learning (CGL) strategy, which dissects and rebalances the positive-negative gradients originating from head and tail classes for more coordinated learning behaviors. Extensive experiments on the CholecT45 and CholecT50 datasets demonstrate the superiority of our proposed framework, validating its effectiveness in handling optimization conflicts.
zh

[CV-47] Runge-Kutta Approximation and Decoupled Attention for Rectified Flow Inversion and Semantic Editing

【速读】:该论文旨在解决生成式AI(Generative AI)中基于修正流(Rectified Flow, RF)模型的两个关键问题:一是 inversion accuracy(逆向精度)较低,导致生成结果与源图像一致性差;二是扩散Transformer中的多模态注意力机制存在纠缠现象(entangled multimodal attention),难以实现精确的语义控制。针对第一个问题,作者提出了一种基于微分方程Runge-Kutta求解器的高效高阶逆向方法,显著提升了逆向过程的准确性;针对第二个问题,提出了去耦合扩散Transformer注意力机制(Decoupled Diffusion Transformer Attention, DDTA),将文本与图像注意力分离,从而实现更精细的语义控制。实验表明,该方法在图像重建和文本引导编辑任务上均达到当前最优性能。

链接: https://arxiv.org/abs/2509.12888
作者: Weiming Chen,Zhihan Zhu,Yijia Wang,Zhihai He
机构: Southern University of Science and Technology (南方科技大学); Pengcheng Laboratory (鹏城实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Rectified flow (RF) models have recently demonstrated superior generative performance compared to DDIM-based diffusion models. However, in real-world applications, they suffer from two major challenges: (1) low inversion accuracy that hinders the consistency with the source image, and (2) entangled multimodal attention in diffusion transformers, which hinders precise attention control. To address the first challenge, we propose an efficient high-order inversion method for rectified flow models based on the Runge-Kutta solver of differential equations. To tackle the second challenge, we introduce Decoupled Diffusion Transformer Attention (DDTA), a novel mechanism that disentangles text and image attention inside the multimodal diffusion transformers, enabling more precise semantic control. Extensive experiments on image reconstruction and text-guided editing tasks demonstrate that our method achieves state-of-the-art performance in terms of fidelity and editability. Code is available at this https URL.
zh

[CV-48] Lego-Edit: A General Image Editing Framework with Model-Level Bricks and MLLM Builder

【速读】:该论文旨在解决指令驱动的图像编辑(instruction-based image editing)在面对真实世界多样化用户指令时泛化能力不足的问题,现有方法往往局限于训练域内的指令,难以适应开放域场景。解决方案的关键在于提出Lego-Edit框架,其核心创新包括:(1) 构建一个模型级工具包(model-level toolkit),集成多种高效训练的小样本模型及图像操作函数,使多模态大语言模型(MLLM)能够细粒度地组合编辑动作;(2) 设计三阶段渐进式强化学习策略,利用未标注的开放域指令反馈对MLLM进行训练,从而赋予其处理现实指令的通用推理能力。该方法在GEdit-Bench和ImgBench上达到当前最优性能,并可在不重新微调的情况下直接使用新增编辑工具。

链接: https://arxiv.org/abs/2509.12883
作者: Qifei Jia,Yu Liu,Yajie Chai,Xintong Yao,Qiming Lu,Yasen Zhang,Runyu Shi,Ying Huang,Guoquan Zhang
机构: Xiaomi Corporation(小米公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Instruction-based image editing has garnered significant attention due to its direct interaction with users. However, real-world user instructions are immensely diverse, and existing methods often fail to generalize effectively to instructions outside their training domain, limiting their practical application. To address this, we propose Lego-Edit, which leverages the generalization capability of Multi-modal Large Language Model (MLLM) to organize a suite of model-level editing tools to tackle this challenge. Lego-Edit incorporates two key designs: (1) a model-level toolkit comprising diverse models efficiently trained on limited data and several image manipulation functions, enabling fine-grained composition of editing actions by the MLLM; and (2) a three-stage progressive reinforcement learning approach that uses feedback on unannotated, open-domain instructions to train the MLLM, equipping it with generalized reasoning capabilities for handling real-world instructions. Experiments demonstrate that Lego-Edit achieves state-of-the-art performance on GEdit-Bench and ImgBench. It exhibits robust reasoning capabilities for open-domain instructions and can utilize newly introduced editing tools without additional fine-tuning. Code is available: this https URL. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2509.12883 [cs.CV] (or arXiv:2509.12883v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2509.12883 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-49] Few to Big: Prototype Expansion Network via Diffusion Learner for Point Cloud Few-shot Semantic Segmentation

【速读】:该论文旨在解决少样本3D点云语义分割中的两个关键挑战:(1)类内多样性(Intra-class Diversity),即原型表示能力有限,无法覆盖类别内部的全部变化;(2)跨集不一致性(Inter-set Inconsistency),即支持集提取的原型与查询特征空间存在对齐偏差。解决方案的核心在于提出Prototype Expansion Network (PENet),其通过双流学习架构构建大容量原型:保留传统全监督的内在学习器(Intrinsic Learner, IL)以提取代表性特征,同时引入基于扩散模型预训练条件编码器的扩散学习器(Diffusion Learner, DL)提供可泛化的丰富特征;随后,通过原型融合模块(Prototype Assimilation Module, PAM)中的新型“推-拉”交叉引导注意力机制迭代对齐原型与查询空间,并借助原型校准机制(Prototype Calibration Mechanism, PCM)防止语义漂移,从而显著提升少样本场景下的分割性能。

链接: https://arxiv.org/abs/2509.12878
作者: Qianguang Zhao,Dongli Wang,Yan Zhou,Jianxun Li,Richard Irampa
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Few-shot 3D point cloud semantic segmentation aims to segment novel categories using a minimal number of annotated support samples. While existing prototype-based methods have shown promise, they are constrained by two critical challenges: (1) Intra-class Diversity, where a prototype’s limited representational capacity fails to cover a class’s full variations, and (2) Inter-set Inconsistency, where prototypes derived from the support set are misaligned with the query feature space. Motivated by the powerful generative capability of diffusion model, we re-purpose its pre-trained conditional encoder to provide a novel source of generalizable features for expanding the prototype’s representational range. Under this setup, we introduce the Prototype Expansion Network (PENet), a framework that constructs big-capacity prototypes from two complementary feature sources. PENet employs a dual-stream learner architecture: it retains a conventional fully supervised Intrinsic Learner (IL) to distill representative features, while introducing a novel Diffusion Learner (DL) to provide rich generalizable features. The resulting dual prototypes are then processed by a Prototype Assimilation Module (PAM), which adopts a novel push-pull cross-guidance attention block to iteratively align the prototypes with the query space. Furthermore, a Prototype Calibration Mechanism (PCM) regularizes the final big capacity prototype to prevent semantic drift. Extensive experiments on the S3DIS and ScanNet datasets demonstrate that PENet significantly outperforms state-of-the-art methods across various few-shot settings.
zh

[CV-50] Cumulative Consensus Score: Label-Free and Model-Agnostic Evaluation of Object Detectors in Deployment

【速读】:该论文旨在解决部署环境中目标检测模型评估难题,即在缺乏真实标注(ground-truth annotations)的情况下如何持续监测和比较检测器性能。解决方案的关键在于提出一种无标签指标——累积共识得分(Cumulative Consensus Score, CCS),其通过测试时数据增强(test-time data augmentation)获取同一图像的不同增强视图,收集各视图下的预测边界框,并基于交并比(Intersection over Union, IoU)计算最大重叠区域;随后对增强对之间的归一化重叠值进行平均,得到空间一致性度量,该度量可作为可靠性代理指标,无需标注即可实现对检测器性能的持续监控与案例级异常识别。

链接: https://arxiv.org/abs/2509.12871
作者: Avinaash Manoharan,Xiangyu Yin,Domenik Helm,Chih-Hong Cheng
机构: DLR Institute of Systems Engineering for Future Mobility (德国航空航天中心未来交通系统工程研究所); Carl von Ossietzky University of Oldenburg (奥尔登堡卡尔·冯·奥西茨基大学); Chalmers University of Technology (查尔姆斯理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Evaluating object detection models in deployment is challenging because ground-truth annotations are rarely available. We introduce the Cumulative Consensus Score (CCS), a label-free metric that enables continuous monitoring and comparison of detectors in real-world settings. CCS applies test-time data augmentation to each image, collects predicted bounding boxes across augmented views, and computes overlaps using Intersection over Union. Maximum overlaps are normalized and averaged across augmentation pairs, yielding a measure of spatial consistency that serves as a proxy for reliability without annotations. In controlled experiments on Open Images and KITTI, CCS achieved over 90% congruence with F1-score, Probabilistic Detection Quality, and Optimal Correction Cost. The method is model-agnostic, working across single-stage and two-stage detectors, and operates at the case level to highlight under-performing scenarios. Altogether, CCS provides a robust foundation for DevOps-style monitoring of object detectors.
zh

[CV-51] ool-R1: Sample-Efficient Reinforcement Learning for Agent ic Tool Use

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在处理需要实时知识、精确操作或专用工具调用的真实世界任务时能力受限的问题。其核心解决方案是提出一种基于强化学习的框架 Tool-R1,该框架通过生成可执行的 Python 代码来实现通用、组合式及多步骤的工具调用,并支持用户自定义工具与标准库的集成,同时通过跨步骤变量共享构建连贯的工作流;关键创新在于采用基于结果的奖励函数(结合 LLM 判断答案正确性与代码执行成功率)指导策略优化,并引入动态样本队列缓存和重用高质量轨迹,显著提升训练效率,从而在 GAIA 基准测试中实现比强基线高出约 10% 的准确率和鲁棒性提升,尤其在复杂多步任务上表现更优。

链接: https://arxiv.org/abs/2509.12867
作者: Yabo Zhang,Yihan Zeng,Qingyun Li,Zhen Hu,Kavin Han,Wangmeng Zuo
机构: Harbin Institute of Technology (哈尔滨工业大学); Huawei Noah’s Ark Lab (华为诺亚方舟实验室)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated strong capabilities in language understanding and reasoning, yet they remain limited when tackling real-world tasks that require up-to-date knowledge, precise operations, or specialized tool use. To address this, we propose Tool-R1, a reinforcement learning framework that enables LLMs to perform general, compositional, and multi-step tool use by generating executable Python code. Tool-R1 supports integration of user-defined tools and standard libraries, with variable sharing across steps to construct coherent workflows. An outcome-based reward function, combining LLM-based answer judgment and code execution success, guides policy optimization. To improve training efficiency, we maintain a dynamic sample queue to cache and reuse high-quality trajectories, reducing the overhead of costly online sampling. Experiments on the GAIA benchmark show that Tool-R1 substantially improves both accuracy and robustness, achieving about 10% gain over strong baselines, with larger improvements on complex multi-step tasks. These results highlight the potential of Tool-R1 for enabling reliable and efficient tool-augmented reasoning in real-world applications. Our code will be available at this https URL.
zh

[CV-52] Leverag ing Large Language Models to Effectively Generate Visual Data for Canine Musculoskeletal Diagnoses

【速读】:该论文旨在解决犬类肌肉骨骼疾病诊断中因真实数据稀缺而导致的AI模型训练困难问题,特别是在罕见病症或高成本采集场景下。其解决方案的关键在于利用大语言模型(Large Language Models, LLMs)生成合成视觉文档数据,通过构建一个包含200多个标注区域(对应肌肉或关节)的映射机制,结合引导解码(guided decoding)、思维链推理(chain-of-thought reasoning)和少样本提示(few-shot prompting)等技术,实现对病灶位置与严重程度敏感的可视化数据合成。实验表明,仅使用这些合成数据训练的分类模型在70份真实案例上达到了88%的F1分数,验证了LLM生成数据在缓解数据稀缺问题上的有效性。

链接: https://arxiv.org/abs/2509.12866
作者: Martin Thißen,Thi Ngoc Diep Tran,Barbara Esteve Ratsch,Ben Joel Schönbein,Ute Trapp,Beate Egner,Romana Piat,Elke Hergenröther
机构: Darmstadt University of Applied Sciences (达姆施塔特应用技术大学); Veterinary Academy of Higher Learning (兽医高等学院); European University of Technology (欧洲科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:It is well-established that more data generally improves AI model performance. However, data collection can be challenging for certain tasks due to the rarity of occurrences or high costs. These challenges are evident in our use case, where we apply AI models to a novel approach for visually documenting the musculoskeletal condition of dogs. Here, abnormalities are marked as colored strokes on a body map of a dog. Since these strokes correspond to distinct muscles or joints, they can be mapped to the textual domain in which large language models (LLMs) operate. LLMs have demonstrated impressive capabilities across a wide range of tasks, including medical applications, offering promising potential for generating synthetic training data. In this work, we investigate whether LLMs can effectively generate synthetic visual training data for canine musculoskeletal diagnoses. For this, we developed a mapping that segments visual documentations into over 200 labeled regions representing muscles or joints. Using techniques like guided decoding, chain-of-thought reasoning, and few-shot prompting, we generated 1,000 synthetic visual documentations for patellar luxation (kneecap dislocation) diagnosis, the diagnosis for which we have the most real-world data. Our analysis shows that the generated documentations are sensitive to location and severity of the diagnosis while remaining independent of the dog’s sex. We further generated 1,000 visual documentations for various other diagnoses to create a binary classification dataset. A model trained solely on this synthetic data achieved an F1 score of 88% on 70 real-world documentations. These results demonstrate the potential of LLM-generated synthetic data, which is particularly valuable for addressing data scarcity in rare diseases. While our methodology is tailored to the medical domain, the insights and techniques can be adapted to other fields.
zh

[CV-53] Unleashing the Power of Discrete-Time State Representation: Ultrafast Target-based IMU-Camera Spatial-Temporal Calibration

【速读】:该论文旨在解决视觉-惯性(Visual-Inertial)系统中IMU与相机之间时空位姿标定的计算效率问题。现有方法多采用连续时间状态表示(如B样条),虽能实现高精度标定,但因状态表示复杂导致计算成本高昂。论文提出一种基于离散时间状态表示的新标定方法,其关键在于通过优化离散时间建模显著降低计算开销,同时有效克服了离散时间表示在时间校准方面的固有缺陷,从而大幅提升标定效率,尤其适用于大规模设备部署场景(如无人机、智能手机等)。

链接: https://arxiv.org/abs/2509.12846
作者: Junlin Song,Antoine Richard,Miguel Olivares-Mendez
机构: Space Robotics (SpaceR) Research Group, Int. Centre for Security, Reliability and Trust (SnT), University of Luxembourg (卢森堡大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visual-inertial fusion is crucial for a large amount of intelligent and autonomous applications, such as robot navigation and augmented reality. To bootstrap and achieve optimal state estimation, the spatial-temporal displacements between IMU and cameras must be calibrated in advance. Most existing calibration methods adopt continuous-time state representation, more specifically the B-spline. Despite these methods achieve precise spatial-temporal calibration, they suffer from high computational cost caused by continuous-time state representation. To this end, we propose a novel and extremely efficient calibration method that unleashes the power of discrete-time state representation. Moreover, the weakness of discrete-time state representation in temporal calibration is tackled in this paper. With the increasing production of drones, cellphones and other visual-inertial platforms, if one million devices need calibration around the world, saving one minute for the calibration of each device means saving 2083 work days in total. To benefit both the research and industry communities, our code will be open-source.
zh

[CV-54] Exploring Metric Fusion for Evaluation of NeRFs

【速读】:该论文旨在解决神经辐射场(Neural Radiance Fields, NeRFs)生成图像在视点合成任务中缺乏有效客观评价指标的问题,因为现有单一评价指标难以在不同数据集上保持一致的主观质量相关性。其解决方案的关键在于融合两种基于不同感知机制的成熟指标——深度图像结构与纹理相似性(Deep Image Structure and Texture Similarity, DISTS)和视频多方法评估融合(Video Multi-Method Assessment Fusion, VMAF),通过设计不同的归一化策略与融合策略,提升对主观视觉质量评分的预测能力,从而实现更鲁棒且通用的NeRF输出质量评估方法。

链接: https://arxiv.org/abs/2509.12836
作者: Shreyas Shivakumara,Gabriel Eilertsen,Karljohan Lundin Palmerius
机构: Linköping University (林雪平大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for 17th International Conference on Quality of Multimedia Experience (QoMEX 25)

点击查看摘要

Abstract:Neural Radiance Fields (NeRFs) have demonstrated significant potential in synthesizing novel viewpoints. Evaluating the NeRF-generated outputs, however, remains a challenge due to the unique artifacts they exhibit, and no individual metric performs well across all datasets. We hypothesize that combining two successful metrics, Deep Image Structure and Texture Similarity (DISTS) and Video Multi-Method Assessment Fusion (VMAF), based on different perceptual methods, can overcome the limitations of individual metrics and achieve improved correlation with subjective quality scores. We experiment with two normalization strategies for the individual metrics and two fusion strategies to evaluate their impact on the resulting correlation with the subjective scores. The proposed pipeline is tested on two distinct datasets, Synthetic and Outdoor, and its performance is evaluated across three different configurations. We present a detailed analysis comparing the correlation coefficients of fusion methods and individual scores with subjective scores to demonstrate the robustness and generalizability of the fusion metrics.
zh

[CV-55] Data Scaling Laws for Radiology Foundation Models

【速读】:该论文旨在解决医学影像领域基础模型(foundation models)在小规模数据下性能受限的问题,尤其是缺乏对数据规模和预训练范式如何影响模型性能的系统性理解。其关键解决方案是通过持续预训练(continual pretraining)两种主流视觉编码器——基于对比学习的MedImageInsight(MI2)和基于自监督的RAD-DINO(代表CLIP与DINOv2范式),在单机构350万张胸部X光图像上进行统一计算资源和评估协议下的对比实验,发现MI2在放射学发现相关任务中更具扩展性,而RAD-DINO在管状结构识别任务中表现更优;同时,利用UniCL联合报告与结构化标签持续预训练MI2可显著提升性能,表明大规模结构化监督的重要性,并证明仅需3万样本即可超越开放权重的基础模型,凸显了机构专属数据持续预训练的价值。

链接: https://arxiv.org/abs/2509.12818
作者: Maximilian Ilse,Harshita Sharma,Anton Schwaighofer,Sam Bond-Taylor,Fernando Pérez-García,Olesya Melnichenko,Anne-Marie G. Sykes,Kelly K. Horst,Ashish Khandelwal,Maxwell Reynolds,Maria T. Wetscherek,Noel C. F. Codella,Javier Alvarez-Valle,Korfiatis Panagiotis,Valentina Salvatelli
机构: Microsoft Health Futures UK; Mayo Clinic; Radiology AI Lab, Mayo Clinic; Cambridge University Hospitals; Microsoft Health & Life Sciences
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Foundation vision encoders such as CLIP and DINOv2, trained on web-scale data, exhibit strong transfer performance across tasks and datasets. However, medical imaging foundation models remain constrained by smaller datasets, limiting our understanding of how data scale and pretraining paradigms affect performance in this setting. In this work, we systematically study continual pretraining of two vision encoders, MedImageInsight (MI2) and RAD-DINO representing the two major encoder paradigms CLIP and DINOv2, on up to 3.5M chest x-rays from a single institution, holding compute and evaluation protocols constant. We evaluate on classification (radiology findings, lines and tubes), segmentation (lines and tubes), and radiology report generation. While prior work has primarily focused on tasks related to radiology findings, we include lines and tubes tasks to counterbalance this bias and evaluate a model’s ability to extract features that preserve continuity along elongated structures. Our experiments show that MI2 scales more effectively for finding-related tasks, while RAD-DINO is stronger on tube-related tasks. Surprisingly, continually pretraining MI2 with both reports and structured labels using UniCL improves performance, underscoring the value of structured supervision at scale. We further show that for some tasks, as few as 30k in-domain samples are sufficient to surpass open-weights foundation models. These results highlight the utility of center-specific continual pretraining, enabling medical institutions to derive significant performance gains by utilizing in-domain data.
zh

[CV-56] SAGA: Selective Adaptive Gating for Efficient and Expressive Linear Attention

【速读】:该论文旨在解决基于softmax的注意力机制在处理高分辨率图像时因二次复杂度(O(N2)\mathcal{O}(N^2))导致的计算瓶颈问题,同时克服传统线性注意力方法中由于对历史键值(Key-Value, KV)信息进行均匀压缩所引发的特征冗余与方向对齐丢失问题,进而造成低秩KV特征图和性能下降。解决方案的关键在于提出Selective Adaptive Gating for Efficient and Expressive Linear Attention (SAGA),通过引入输入自适应的可学习门控机制,动态调节KV特征聚合过程,增强语义多样性并缓解传统线性注意力的低秩约束;此外,设计了一种高效的Hadamard积分解方法用于门控计算,在不增加额外内存开销的前提下提升了计算效率。

链接: https://arxiv.org/abs/2509.12817
作者: Yuan Cao,Dong Wang
机构: Beijing Jiaotong University (北京交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While Transformer architecture excel at modeling long-range dependencies contributing to its widespread adoption in vision tasks the quadratic complexity of softmax-based attention mechanisms imposes a major bottleneck, particularly when processing high-resolution images. Linear attention presents a promising alternative by reformulating the attention computation from (QK)V to Q(KV) , thereby reducing the complexity from \mathcalO(N^2) to \mathcalO(N) while preserving the global receptive field. However, most existing methods compress historical key-value (KV) information uniformly, which can lead to feature redundancy and the loss of directional alignment with the query (Q). This uniform compression results in low-rank KV feature maps, contributing to a performance gap compared to softmax attention. To mitigate this limitation, we propose \textbfSelective \textbfAdaptive \textbfGAting for Efficient and Expressive Linear Attention (SAGA) , which introduces input-adaptive learnable gates to selectively modulate information aggregation into the KV feature map. These gates enhance semantic diversity and alleviate the low-rank constraint inherent in conventional linear attention. Additionally, we propose an efficient Hadamard-product decomposition method for gate computation, which introduces no additional memory overhead. Experiments demonstrate that SAGA achieves a 1.76 \times improvement in throughput and a 2.69 \times reduction in peak GPU memory compared to PVT-T at a resolution of 1280 \times 1280 . Moreover, it improves top-1 accuracy by up to 4.4% on the ImageNet dataset, demonstrating both computational efficiency and model effectiveness.
zh

[CV-57] Gesture Evaluation in Virtual Reality

【速读】:该论文旨在解决虚拟化身(Avatar)在不同呈现环境中生成式手势(AI-generated gestures)感知差异的问题,尤其是VR与2D环境对人类对手势自然度和真实感评价的影响。其解决方案的关键在于通过对比实验,系统评估来自2023年GENEA Challenge的三种手势生成模型在VR与2D两种场景下的表现,发现VR环境虽未改变模型间的相对排名,但显著提升了整体感知质量,尤其在基于动作捕捉的真实运动(motion-capture “true movement”)上效果最明显,表明沉浸式VR可作为更有效的评估平台,以提升手势生成技术的感知真实性与应用价值。

链接: https://arxiv.org/abs/2509.12816
作者: Axel Wiebe Werner,Jonas Beskow,Anna Deichler
机构: KTH Royal Institute of Technology (皇家理工学院)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Published in Proceedings of the 26th International Conference on Multimodal Interaction (ICMI '24), ACM. Copyright 2024 ACM. Licensed under CC BY

点击查看摘要

Abstract:Gestures are central to human communication, enriching interactions through non-verbal expression. Virtual avatars increasingly use AI-generated gestures to enhance life-likeness, yet evaluations have largely been confined to 2D. Virtual Reality (VR) provides an immersive alternative that may affect how gestures are perceived. This paper presents a comparative evaluation of computer-generated gestures in VR and 2D, examining three models from the 2023 GENEA Challenge. Results show that gestures viewed in VR were rated slightly higher on average, with the strongest effect observed for motion-capture “true movement.” While model rankings remained consistent across settings, VR influenced participants’ overall perception and offered unique benefits over traditional 2D evaluation.
zh

[CV-58] Hunyuan3D Studio: End-to-End AI Pipeline for Game-Ready 3D Asset Generation

【速读】:该论文旨在解决现代游戏开发中高质量3D资产制作长期依赖人工密集型且专业化流程的问题。其核心解决方案是提出Hunyuan3D Studio,一个端到端的AI驱动内容创作平台,通过集成多项先进的神经模块(如基于部件的3D生成、多边形生成、语义UV展开等),构建了一个统一且用户友好的系统,能够将单一概念图像或文本描述快速转化为具备优化几何结构和高保真PBR贴图的游戏就绪3D模型,从而显著缩短迭代时间并降低3D内容创作的技术门槛。

链接: https://arxiv.org/abs/2509.12815
作者: Biwen Lei,Yang Li,Xinhai Liu,Shuhui Yang,Lixin Xu,Jingwei Huang,Ruining Tang,Haohan Weng,Jian Liu,Jing Xu,Zhen Zhou,Yiling Zhu,Jiankai Xing,Jiachen Xu,Changfeng Ma,Xinhao Yan,Yunhan Yang,Chunshi Wang,Duoteng Xu,Xueqi Ma,Yuguang Chen,Jing Li,Mingxin Yang,Sheng Zhang,Yifei Feng,Xin Huang,Di Luo,Zebin He,Puhua Jiang,Changrong Hu,Zihan Qin,Shiwei Miao,Haolin Liu,Yunfei Zhao,Zeqiang Lai,Qingxiang Lin,Zibo Zhao,Kunhong Li,Xianghui Yang,Huiwen Shi,Xin Yang,Yuxuan Wang,Zebin Yao,Yihang Lian,Sicong Liu,Xintong Han,Wangchen Qin,Caisheng Ouyang,Jianyin Liu,Tianwen Yuan,Shuai Jiang,Hong Duan,Yanqi Niu,Wencong Lin,Yifu Sun,Shirui Huang,Lin Niu,Gu Gong,Guojian Xiao,Bojian Zheng,Xiang Yuan,Qi Chen,Jie Xiao,Dongyang Zheng,Xiaofeng Yang,Kai Liu,Jianchen Zhu,Lifu Wang,Qinglin Lu,Jie Liu,Liang Dong,Fan Jiang,Ruibin Chen,Lei Wang,Chao Zhang,Jiaxin Lin,Hao Zhang,Zheng Ye,Peng He,Runzhou Wu,Yinhe Wu,Jiayao Du,Jupeng Chen,Xinyue Mao,Dongyuan Guo,Yixuan Tang,Yulin Tsai,Yonghao Tan,Jiaao Yu,Junlin Yu,Keren Zhang,Yifan Li,Peng Chen,Tian Liu,Di Wang,Yuhong Liu,Linus,Jie Jiang,Zhuo Chen,Chunchao Guo
机构: Tencent Hunyuan3D
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical Report

点击查看摘要

Abstract:The creation of high-quality 3D assets, a cornerstone of modern game development, has long been characterized by labor-intensive and specialized workflows. This paper presents Hunyuan3D Studio, an end-to-end AI-powered content creation platform designed to revolutionize the game production pipeline by automating and streamlining the generation of game-ready 3D assets. At its core, Hunyuan3D Studio integrates a suite of advanced neural modules (such as Part-level 3D Generation, Polygon Generation, Semantic UV, etc.) into a cohesive and user-friendly system. This unified framework allows for the rapid transformation of a single concept image or textual description into a fully-realized, production-quality 3D model complete with optimized geometry and high-fidelity PBR textures. We demonstrate that assets generated by Hunyuan3D Studio are not only visually compelling but also adhere to the stringent technical requirements of contemporary game engines, significantly reducing iteration time and lowering the barrier to entry for 3D content creation. By providing a seamless bridge from creative intent to technical asset, Hunyuan3D Studio represents a significant leap forward for AI-assisted workflows in game development and interactive media.
zh

[CV-59] Superpixel Anything: A general object-based framework for accurate yet regular superpixel segmentation

【速读】:该论文旨在解决传统超像素(superpixel)生成方法在保持规则性与准确性的平衡问题:传统方法依赖低级特征,虽能保证超像素的几何规则性,但难以捕捉复杂对象;而基于深度学习的方法虽能提升分割准确性,却常牺牲超像素的规则结构,导致结果可解释性下降。解决方案的关键在于提出SPAM(SuperPixel Anything Model),其核心创新是利用大规模预训练模型进行语义无关的分割,从而在推理阶段确保超像素与物体边界对齐,同时保留规则形状特性。该框架可兼容任意高阶分割先验,有效处理不确定区域,并支持交互式聚焦特定目标,显著提升了超像素分割的准确性与可解释性。

链接: https://arxiv.org/abs/2509.12791
作者: Julien Walther,Rémi Giraud,Michaël Clément
机构: Univ. Bordeaux, CNRS, Bordeaux INP, IMS, UMR 5218, France (波尔多大学, 法国国家科学研究中心, 波尔多国立高等理工学院, 信息与信号研究所, 法国国家科研署联合实验室5218); Univ. Bordeaux, CNRS, Bordeaux INP, LaBRI, UMR 5800, France (波尔多大学, 法国国家科学研究中心, 波尔多国立高等理工学院, 计算机科学与应用数学实验室, 法国国家科研署联合实验室5800)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Superpixels are widely used in computer vision to simplify image representation and reduce computational complexity. While traditional methods rely on low-level features, deep learning-based approaches leverage high-level features but also tend to sacrifice regularity of superpixels to capture complex objects, leading to accurate but less interpretable segmentations. In this work, we introduce SPAM (SuperPixel Anything Model), a versatile framework for segmenting images into accurate yet regular superpixels. We train a model to extract image features for superpixel generation, and at inference, we leverage a large-scale pretrained model for semantic-agnostic segmentation to ensure that superpixels align with object masks. SPAM can handle any prior high-level segmentation, resolving uncertainty regions, and is able to interactively focus on specific objects. Comprehensive experiments demonstrate that SPAM qualitatively and quantitatively outperforms state-of-the-art methods on segmentation tasks, making it a valuable and robust tool for various applications. Code and pre-trained models are available here: this https URL.
zh

[CV-60] Double Helix Diffusion for Cross-Domain Anomaly Image Generation

【速读】:该论文旨在解决制造领域中视觉异常检测因真实异常样本稀缺而导致检测器训练困难的问题,同时克服现有合成数据生成方法在结构一致性与特征纠缠方面的局限性。其解决方案的关键在于提出双螺旋扩散模型(Double Helix Diffusion, DH-Diff),该模型采用受双螺旋结构启发的独特架构,通过模块化循环设计实现特征分离、连接与融合,其中域解耦注意力机制有效缓解图像与标注掩膜之间的特征纠缠问题,而语义得分图对齐模块则确保异常前景的结构真实性,从而同步生成高保真度的异常图像及其像素级标注掩膜,显著提升下游检测性能。

链接: https://arxiv.org/abs/2509.12787
作者: Linchun Wu,Qin Zou,Xianbiao Qi,Bo Du,Zhongyuan Wang,Qingquan Li
机构: Wuhan University (武汉大学); Shenzhen Intellifusion Technologies Co Ltd (深圳深醒科技有限公司); Guangdong Artificial Intelligence and Digital Economy Laboratory (SZ) (广东省人工智能与数字经济实验室(深圳))
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visual anomaly inspection is critical in manufacturing, yet hampered by the scarcity of real anomaly samples for training robust detectors. Synthetic data generation presents a viable strategy for data augmentation; however, current methods remain constrained by two principal limitations: 1) the generation of anomalies that are structurally inconsistent with the normal background, and 2) the presence of undesirable feature entanglement between synthesized images and their corresponding annotation masks, which undermines the perceptual realism of the output. This paper introduces Double Helix Diffusion (DH-Diff), a novel cross-domain generative framework designed to simultaneously synthesize high-fidelity anomaly images and their pixel-level annotation masks, explicitly addressing these challenges. DH-Diff employs a unique architecture inspired by a double helix, cycling through distinct modules for feature separation, connection, and merging. Specifically, a domain-decoupled attention mechanism mitigates feature entanglement by enhancing image and annotation features independently, and meanwhile a semantic score map alignment module ensures structural authenticity by coherently integrating anomaly foregrounds. DH-Diff offers flexible control via text prompts and optional graphical guidance. Extensive experiments demonstrate that DH-Diff significantly outperforms state-of-the-art methods in diversity and authenticity, leading to significant improvements in downstream anomaly detection performance.
zh

[CV-61] Modeling the Multivariate Relationship with Contextualized Representations for Effective Human-Object Interaction Detection

【速读】:该论文旨在解决人类-物体交互(Human-Object Interaction, HOI)检测中因上下文建模不充分而导致的复杂交互识别困难问题。现有两阶段方法虽取得进展,但在处理依赖工具等辅助实体的多变量关系时仍存在局限。解决方案的关键在于提出一种情境化表示学习网络(Contextualized Representation Learning Network),通过融合功能引导推理(affordance-guided reasoning)与情境提示(contextual prompts)及视觉线索,显式建模“人-工具-物”三元组结构以捕捉工具依赖型交互(如“填充”行为);同时引入可学习提示,并结合注意力机制将实例类别信息与视觉特征对齐,在全局与局部层面实现语言与图像内容的语义一致,从而增强模型对复杂、情境依赖交互的推理能力。

链接: https://arxiv.org/abs/2509.12784
作者: Zhehao Li,Yucheng Qian,Chong Wang,Yinghao Lu,Zhihao Yang,Jiafei Wu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Human-Object Interaction (HOI) detection aims to simultaneously localize human-object pairs and recognize their interactions. While recent two-stage approaches have made significant progress, they still face challenges due to incomplete context modeling. In this work, we introduce a Contextualized Representation Learning Network that integrates both affordance-guided reasoning and contextual prompts with visual cues to better capture complex interactions. We enhance the conventional HOI detection framework by expanding it beyond simple human-object pairs to include multivariate relationships involving auxiliary entities like tools. Specifically, we explicitly model the functional role (affordance) of these auxiliary objects through triplet structures human, tool, object. This enables our model to identify tool-dependent interactions such as ‘filling’. Furthermore, the learnable prompt is enriched with instance categories and subsequently integrated with contextual visual features using an attention mechanism. This process aligns language with image content at both global and regional levels. These contextualized representations equip the model with enriched relational cues for more reliable reasoning over complex, context-dependent interactions. Our proposed method demonstrates superior performance on both the HICO-Det and V-COCO datasets in most scenarios. Codes will be released upon acceptance.
zh

[CV-62] CECT-Mamba: a Hierarchical Contrast-enhanced-aware Model for Pancreatic Tumor Subtyping from Multi-phase CECT

【速读】:该论文旨在解决胰腺肿瘤亚型(尤其是胰腺导管腺癌PDAC与胰腺神经内分泌肿瘤PNETs)精准分类中因肿瘤高度异质性和多期增强CT(CECT)数据利用不足而导致的诊断困难问题。现有方法未能有效挖掘放射科医生日常诊断流程中常用的多期CECT数据中的上下文信息,限制了模型性能。其解决方案的关键在于首次提出一种自动融合多期CECT数据的方法,核心创新是引入Mamba架构以实现对时间与空间特征的高效建模,具体包括:设计双层次对比增强感知Mamba模块,结合新颖的空间与时间采样序列以探索病灶在不同造影阶段内的变化;引入相似性引导精修模块强化局部肿瘤区域的时间变化学习;并通过空间互补集成器与多粒度融合模块实现跨尺度语义编码与聚合,从而显著提升亚型分类准确性(实验达到97.4%准确率和98.6% AUC)。

链接: https://arxiv.org/abs/2509.12777
作者: Zhifang Gong,Shuo Gao,Ben Zhao,Yingjing Xu,Yijun Yang,Shenghong Ju,Guangquan Zhou
机构: Southeast University (东南大学); Zhongda Hospital, Medical School of Southeast University (东南大学医学院中大医院); Zhejiang University (浙江大学); The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Contrast-enhanced computed tomography (CECT) is the primary imaging technique that provides valuable spatial-temporal information about lesions, enabling the accurate diagnosis and subclassification of pancreatic tumors. However, the high heterogeneity and variability of pancreatic tumors still pose substantial challenges for precise subtyping diagnosis. Previous methods fail to effectively explore the contextual information across multiple CECT phases commonly used in radiologists’ diagnostic workflows, thereby limiting their performance. In this paper, we introduce, for the first time, an automatic way to combine the multi-phase CECT data to discriminate between pancreatic tumor subtypes, among which the key is using Mamba with promising learnability and simplicity to encourage both temporal and spatial modeling from multi-phase CECT. Specifically, we propose a dual hierarchical contrast-enhanced-aware Mamba module incorporating two novel spatial and temporal sampling sequences to explore intra and inter-phase contrast variations of lesions. A similarity-guided refinement module is also imposed into the temporal scanning modeling to emphasize the learning on local tumor regions with more obvious temporal variations. Moreover, we design the space complementary integrator and multi-granularity fusion module to encode and aggregate the semantics across different scales, achieving more efficient learning for subtyping pancreatic tumors. The experimental results on an in-house dataset of 270 clinical cases achieve an accuracy of 97.4% and an AUC of 98.6% in distinguishing between pancreatic ductal adenocarcinoma (PDAC) and pancreatic neuroendocrine tumors (PNETs), demonstrating its potential as a more accurate and efficient tool.
zh

[CV-63] BATR-FST: Bi-Level Adaptive Token Refinement for Few-Shot Transformers IJCNN

【速读】:该论文旨在解决视觉Transformer(Vision Transformer, ViT)在少样本学习(few-shot learning)场景下性能受限的问题,主要挑战包括token级交互的精细化不足、有限训练数据下的特征优化困难,以及缺乏强归纳偏置(inductive bias)。其解决方案的关键在于提出双层自适应token精炼方法(Bi-Level Adaptive Token Refinement, BATR-FST),该方法包含两个阶段:预训练阶段采用掩码图像建模(Masked Image Modeling, MIM)获取可迁移的patch级表示;元微调阶段引入Token聚类(Token Clustering)、不确定性感知的token加权(Uncertainty-Aware Token Weighting)与双层注意力机制(Bi-Level Attention),实现局部特征细化与全局关系平衡,并结合图token传播(Graph Token Propagation)保证支持集与查询集语义一致性,以及类别分离惩罚(Class Separation Penalty)增强类别判别能力,从而显著提升ViT在少样本分类任务中的表现。

链接: https://arxiv.org/abs/2509.12768
作者: Mohammed Al-Habib,Zuping Zhang,Abdulrahman Noman
机构: Central South University (中南大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: This paper has been accepted for publication at the IEEE International Joint Conference on Neural Networks (IJCNN), Rome, Italy 2025

点击查看摘要

Abstract:Vision Transformers (ViTs) have shown significant promise in computer vision applications. However, their performance in few-shot learning is limited by challenges in refining token-level interactions, struggling with limited training data, and developing a strong inductive bias. Existing methods often depend on inflexible token matching or basic similarity measures, which limit the effective incorporation of global context and localized feature refinement. To address these challenges, we propose Bi-Level Adaptive Token Refinement for Few-Shot Transformers (BATR-FST), a two-stage approach that progressively improves token representations and maintains a robust inductive bias for few-shot classification. During the pre-training phase, Masked Image Modeling (MIM) provides Vision Transformers (ViTs) with transferable patch-level representations by recreating masked image regions, providing a robust basis for subsequent adaptation. In the meta-fine-tuning phase, BATR-FST incorporates a Bi-Level Adaptive Token Refinement module that utilizes Token Clustering to capture localized interactions, Uncertainty-Aware Token Weighting to prioritize dependable features, and a Bi-Level Attention mechanism to balance intra-cluster and inter-cluster relationships, thereby facilitating thorough token refinement. Furthermore, Graph Token Propagation ensures semantic consistency between support and query instances, while a Class Separation Penalty preserves different class borders, enhancing discriminative capability. Extensive experiments on three benchmark few-shot datasets demonstrate that BATR-FST achieves superior results in both 1-shot and 5-shot scenarios and improves the few-shot classification via transformers.
zh

[CV-64] DyGLNet: Hybrid Global-Local Feature Fusion with Dynamic Upsampling for Medical Image Segmentation

【速读】:该论文旨在解决医学图像分割中面临的三大挑战:病灶的多尺度变化、组织边界模糊以及计算复杂度高的问题。其解决方案的关键在于提出了一种名为DyGLNet的新模型,该模型通过融合全局与局部特征,并引入动态上采样机制实现高效且精确的分割。具体而言,创新性地设计了混合特征提取模块(SHDCBlock),结合单头自注意力机制和多尺度空洞卷积,协同建模局部细节与全局上下文信息;同时,提出了动态自适应上采样模块(DyFusionUp),基于可学习偏移量实现特征图的高保真重建;此外,采用轻量化设计显著降低计算开销,从而在保证分割精度的同时提升推理效率,尤其在边界精度和小目标分割方面表现优异。

链接: https://arxiv.org/abs/2509.12763
作者: Yican Zhao,Ce Wang,You Hao,Lei Li,Tianli Liao
机构: Henan University of Technology (河南工业大学); Sun Yat-sen University (中山大学); Chinese Academy of Sciences (中国科学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18pages, under review

点击查看摘要

Abstract:Medical image segmentation grapples with challenges including multi-scale lesion variability, ill-defined tissue boundaries, and computationally intensive processing demands. This paper proposes the DyGLNet, which achieves efficient and accurate segmentation by fusing global and local features with a dynamic upsampling mechanism. The model innovatively designs a hybrid feature extraction module (SHDCBlock), combining single-head self-attention and multi-scale dilated convolutions to model local details and global context collaboratively. We further introduce a dynamic adaptive upsampling module (DyFusionUp) to realize high-fidelity reconstruction of feature maps based on learnable offsets. Then, a lightweight design is adopted to reduce computational overhead. Experiments on seven public datasets demonstrate that DyGLNet outperforms existing methods, particularly excelling in boundary accuracy and small-object segmentation. Meanwhile, it exhibits lower computation complexity, enabling an efficient and reliable solution for clinical medical image analysis. The code will be made available soon.
zh

[CV-65] A-TDOM: Active TDOM via On-the-Fly 3DGS

【速读】:该论文旨在解决传统真数字正射影像图(True Digital Orthophoto Map, TDOM)生成方法依赖复杂离线摄影测量流程导致延迟高、难以支持实时应用的问题,以及因相机位姿估计不准确、数字表面模型(Digital Surface Model, DSM)误差和场景遮挡等因素造成的TDOM质量下降问题。其解决方案的关键在于提出一种基于即时3D高斯溅射(On-the-Fly 3DGS)优化的近实时TDOM生成方法(A-TDOM):在每张图像获取后,通过即时SfM(Structure from Motion)计算其位姿与稀疏点云,并将新的高斯分布体素融入此前未覆盖或粗略重建区域进行优化;结合正交溅射(orthogonal splatting),可在每次更新3DGS场后立即渲染出TDOM,实现秒级3DGS优化与可接受的几何精度保持。

链接: https://arxiv.org/abs/2509.12759
作者: Yiwei Xu,Xiang Wang,Yifei Yu,Wentian Gan,Luca Morelli,Giulio Perda,Xiongwu Xiao,Zongqian Zhan,Xin Wang,Fabio Remondino
机构: Wuhan University (武汉大学); State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing (LIESMARS) (信息工程测绘遥感国家重点实验室); Bruno Kessler Foundation (FBK) (布鲁诺·凯斯勒基金会)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:True Digital Orthophoto Map (TDOM) serves as a crucial geospatial product in various fields such as urban management, city planning, land surveying, etc. However, traditional TDOM generation methods generally rely on a complex offline photogrammetric pipeline, resulting in delays that hinder real-time applications. Moreover, the quality of TDOM may degrade due to various challenges, such as inaccurate camera poses or Digital Surface Model (DSM) and scene occlusions. To address these challenges, this work introduces A-TDOM, a near real-time TDOM generation method based on On-the-Fly 3DGS optimization. As each image is acquired, its pose and sparse point cloud are computed via On-the-Fly SfM. Then new Gaussians are integrated and optimized into previously unseen or coarsely reconstructed regions. By integrating with orthogonal splatting, A-TDOM can render just after each update of a new 3DGS field. Initial experiments on multiple benchmarks show that the proposed A-TDOM is capable of actively rendering TDOM in near real-time, with 3DGS optimization for each new image in seconds while maintaining acceptable rendering quality and TDOM geometric accuracy.
zh

[CV-66] Recurrent Cross-View Object Geo-Localization

【速读】:该论文旨在解决跨视图目标地理定位(Cross-view Object Geo-localization, CVOGL)任务中现有方法因直接回归定位而对特征噪声敏感、缺乏错误修正机制的问题。解决方案的关键在于将CVOGL重新建模为一个递归定位任务,提出ReCOT框架:通过引入可学习的token编码查询图像与点提示的特定任务意图,并在迭代过程中逐步关注参考特征以优化预测位置;同时结合基于Segment Anything Model (SAM)的知识蒸馏策略提供语义引导,以及参考特征增强模块(Reference Feature Enhancement Module, RFEM)通过分层注意力机制强化对象相关区域,从而显著提升定位精度并减少模型参数量达60%。

链接: https://arxiv.org/abs/2509.12757
作者: Xiaohan Zhang,Si-Yuan Cao,Xiaokai Bai,Yiming Li,Zhangkai Shen,Zhe Wu,Xiaoxi Hu,Hui-liang Shen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Cross-view object geo-localization (CVOGL) aims to determine the location of a specific object in high-resolution satellite imagery given a query image with a point prompt. Existing approaches treat CVOGL as a one-shot detection task, directly regressing object locations from cross-view information aggregation, but they are vulnerable to feature noise and lack mechanisms for error correction. In this paper, we propose ReCOT, a Recurrent Cross-view Object geo-localization Transformer, which reformulates CVOGL as a recurrent localization task. ReCOT introduces a set of learnable tokens that encode task-specific intent from the query image and prompt embeddings, and iteratively attend to the reference features to refine the predicted location. To enhance this recurrent process, we incorporate two complementary modules: (1) a SAM-based knowledge distillation strategy that transfers segmentation priors from the Segment Anything Model (SAM) to provide clearer semantic guidance without additional inference cost, and (2) a Reference Feature Enhancement Module (RFEM) that introduces a hierarchical attention to emphasize object-relevant regions in the reference features. Extensive experiments on standard CVOGL benchmarks demonstrate that ReCOT achieves state-of-the-art (SOTA) performance while reducing parameters by 60% compared to previous SOTA approaches.
zh

[CV-67] What Makes a Good Generated Image? Investigating Human and Multimodal LLM Image Preference Alignment

【速读】:该论文旨在解决生成式文本到图像模型(text-to-image models)的自动化评估难题,特别是揭示多模态大语言模型(multimodal LLMs)在判断图像质量时如何利用人类感知相关的概念(如图像风格、构图等)。其解决方案的关键在于:首先构建一个基于合成图像对的人类偏好数据集,通过分析不同图像质量属性(美学、无伪影、解剖准确性、构图正确性、对象一致性及风格)之间的任务间相关性,比较人类与多模态LLMs在判断这些属性时的差异;其次,通过高控制度的合成数据集分别考察各属性的独立影响,发现人类能稳定识别所有属性的质量差异,而多模态LLMs在某些属性(如解剖准确性)上表现显著不足,从而揭示了人类与多模态LLMs在图像感知机制上的本质差异。

链接: https://arxiv.org/abs/2509.12750
作者: Rishab Parthasarathy,Jasmine Collins,Cory Stephenson
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 9 figures, 3 tables; appendix 16 pages, 9 figures, 6 tables

点击查看摘要

Abstract:Automated evaluation of generative text-to-image models remains a challenging problem. Recent works have proposed using multimodal LLMs to judge the quality of images, but these works offer little insight into how multimodal LLMs make use of concepts relevant to humans, such as image style or composition, to generate their overall assessment. In this work, we study what attributes of an image–specifically aesthetics, lack of artifacts, anatomical accuracy, compositional correctness, object adherence, and style–are important for both LLMs and humans to make judgments on image quality. We first curate a dataset of human preferences using synthetically generated image pairs. We use inter-task correlation between each pair of image quality attributes to understand which attributes are related in making human judgments. Repeating the same analysis with LLMs, we find that the relationships between image quality attributes are much weaker. Finally, we study individual image quality attributes by generating synthetic datasets with a high degree of control for each axis. Humans are able to easily judge the quality of an image with respect to all of the specific image quality attributes (e.g. high vs. low aesthetic image), however we find that some attributes, such as anatomical accuracy, are much more difficult for multimodal LLMs to learn to judge. Taken together, these findings reveal interesting differences between how humans and multimodal LLMs perceive images.
zh

[CV-68] Modelling and analysis of the 8 filters from the “master key filters hypothesis” for depthwise-separable deep networks in relation to idealized receptive fields based on scale-space theory

【速读】:该论文旨在解决深度可分离卷积神经网络(depthwise-separable deep networks)中学习到的滤波器(即感受野,receptive fields)的建模与解释问题,特别是如何用数学上简洁且具有物理意义的模型来近似这些复杂的学习滤波器。其核心挑战在于理解滤波器的空间分布特性及其是否可以被某种基础滤波操作所刻画。解决方案的关键在于:首先通过计算空间扩散度量(如加权均值和加权方差),验证了学习滤波器在空间域上可由可分离滤波操作近似,且非中心滤波器的空间偏移接近半个网格单元;其次,将聚类得到的“主键滤波器”(master key filters)建模为离散高斯核平滑操作后的差分算子(difference operators),并采用两种参数配置策略(各向异性或各向同性尺度参数)进行拟合;最终通过最小化 $ l_1 $ 或 $ l_2 $ 范数误差或强制空间方差相等的方式完成模型拟合,结果表明理想化的模型能很好地逼近真实学习滤波器,并具备良好的预测能力,从而证明深度可分离网络中的滤波器可被离散尺度空间滤波器(discrete scale-space filters)有效近似。

链接: https://arxiv.org/abs/2509.12746
作者: Tony Lindeberg,Zahra Babaiee,Peyman M. Kiasari
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 24 pages, 11 figures, 17 tables

点击查看摘要

Abstract:This paper presents the results of analysing and modelling a set of 8 master key filters'', which have been extracted by applying a clustering approach to the receptive fields learned in depthwise-separable deep networks based on the ConvNeXt architecture. For this purpose, we first compute spatial spread measures in terms of weighted mean values and weighted variances of the absolute values of the learned filters, which support the working hypotheses that: (i) the learned filters can be modelled by separable filtering operations over the spatial domain, and that (ii) the spatial offsets of the those learned filters that are non-centered are rather close to half a grid unit. Then, we model the clustered master key filters’’ in terms of difference operators applied to a spatial smoothing operation in terms of the discrete analogue of the Gaussian kernel, and demonstrate that the resulting idealized models of the receptive fields show good qualitative similarity to the learned filters. This modelling is performed in two different ways: (i) using possibly different values of the scale parameters in the coordinate directions for each filter, and (ii) using the same value of the scale parameter in both coordinate directions. Then, we perform the actual model fitting by either (i) requiring spatial spread measures in terms of spatial variances of the absolute values of the receptive fields to be equal, or (ii) minimizing the discrete l_1 - or l_2 -norms between the idealized receptive field models and the learned filters. Complementary experimental results then demonstrate the idealized models of receptive fields have good predictive properties for replacing the learned filters by idealized filters in depthwise-separable deep networks, thus showing that the learned filters in depthwise-separable deep networks can be well approximated by discrete scale-space filters. Comments: 24 pages, 11 figures, 17 tables Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2509.12746 [cs.CV] (or arXiv:2509.12746v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2509.12746 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Tony Lindeberg [view email] [v1] Tue, 16 Sep 2025 07:04:45 UTC (93 KB)
zh

[CV-69] Effective Gaussian Management for High-fidelity Object Reconstruction

【速读】:该论文旨在解决高保真物体重建中因双监督导致的梯度冲突问题,以及现有高斯点渲染(Gaussian Splatting, GS)方法在属性分配上缺乏针对性所引发的表示效率低下问题。其解决方案的关键在于提出一种新型的高斯管理策略:首先,通过表面重建模块监督下动态激活球谐函数(Spherical Harmonics, SHs)或法向量,缓解多任务监督下的梯度冲突;其次,设计轻量化高斯表示机制,依据梯度幅值自适应调整每个高斯的SH阶数,并采用任务解耦剪枝策略移除对特定重建任务影响最小的高斯,从而在保持重建质量的同时显著降低参数量。该方法具有模型无关性,可无缝集成至其他框架以提升性能并压缩模型规模。

链接: https://arxiv.org/abs/2509.12742
作者: Jiateng Liu,Hao Gao,Jiu-Cheng Xie,Chi-Man Pun,Jian Xiong,Haolun Li,Feng Xu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper proposes an effective Gaussian management approach for high-fidelity object reconstruction. Departing from recent Gaussian Splatting (GS) methods that employ indiscriminate attribute assignment, our approach introduces a novel densification strategy that dynamically activates spherical harmonics (SHs) or normals under the supervision of a surface reconstruction module, which effectively mitigates the gradient conflicts caused by dual supervision and achieves superior reconstruction results. To further improve representation efficiency, we develop a lightweight Gaussian representation that adaptively adjusts the SH orders of each Gaussian based on gradient magnitudes and performs task-decoupled pruning to remove Gaussian with minimal impact on a reconstruction task without sacrificing others, which balances the representational capacity with parameter quantity. Notably, our management approach is model-agnostic and can be seamlessly integrated into other frameworks, enhancing performance while reducing model size. Extensive experiments demonstrate that our approach consistently outperforms state-of-the-art approaches in both reconstruction quality and efficiency, achieving superior performance with significantly fewer parameters.
zh

[CV-70] Defense-to-Attack: Bypassing Weak Defenses Enables Stronger Jailbreaks in Vision-Language Models

【速读】:该论文旨在解决视觉-语言模型(Vision-Language Models, VLMs)在面对对抗性攻击时安全性不足的问题,特别是如何更高效、更有效地实施“越狱”攻击(jailbreak attacks),以绕过其内置的安全防护机制。解决方案的关键在于提出了一种名为Defense2Attack的新方法,其核心思想是将防御性特征(weak defense)融入攻击流程中,通过三个关键组件实现:(1) 视觉优化器嵌入具有积极语义的通用对抗扰动;(2) 文本优化器使用防御风格提示对输入进行精细化调整;(3) 红队后缀生成器基于强化学习微调增强攻击效果。该方法在单次尝试中即展现出优于现有需多次迭代的攻击方法的性能,揭示了利用防御模式指导攻击设计的新范式。

链接: https://arxiv.org/abs/2509.12724
作者: Yunhan Zhao,Xiang Zheng,Xingjun Ma
机构: Fudan University (复旦大学); City University of Hong Kong (香港城市大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Despite their superb capabilities, Vision-Language Models (VLMs) have been shown to be vulnerable to jailbreak attacks. While recent jailbreaks have achieved notable progress, their effectiveness and efficiency can still be improved. In this work, we reveal an interesting phenomenon: incorporating weak defense into the attack pipeline can significantly enhance both the effectiveness and the efficiency of jailbreaks on VLMs. Building on this insight, we propose Defense2Attack, a novel jailbreak method that bypasses the safety guardrails of VLMs by leveraging defensive patterns to guide jailbreak prompt design. Specifically, Defense2Attack consists of three key components: (1) a visual optimizer that embeds universal adversarial perturbations with affirmative and encouraging semantics; (2) a textual optimizer that refines the input using a defense-styled prompt; and (3) a red-team suffix generator that enhances the jailbreak through reinforcement fine-tuning. We empirically evaluate our method on four VLMs and four safety benchmarks. The results demonstrate that Defense2Attack achieves superior jailbreak performance in a single attempt, outperforming state-of-the-art attack methods that often require multiple tries. Our work offers a new perspective on jailbreaking VLMs.
zh

[CV-71] SPGen: Spherical Projection as Consistent and Flexible Representation for Single Image 3D Shape Generation

【速读】:该论文旨在解决现有单视图3D生成模型在重建物体表面时存在的视图不一致性问题,以及难以准确表示复杂内部结构或非平凡拓扑的问题。其解决方案的关键在于提出一种新颖的多层2D球面投影(Spherical Projection, SP)表示方法,通过将几何信息投影到包围球并展开为紧凑且结构化的多层图像表示,从而在纯图像域内实现高保真、一致且高效的3D生成。该方法利用可逆的SP映射消除视图歧义,支持嵌套内部结构建模,并能直接继承强大的2D扩散先验,显著提升几何质量和计算效率。

链接: https://arxiv.org/abs/2509.12721
作者: Jingdong Zhang,Weikai Chen,Yuan Liu,Jionghao Wang,Zhengming Yu,Zhuowen Shen,Bo Yang,Wenping Wang,Xin Li
机构: Texas A&M University (德州农工大学); LightSpeed Studios; Hong Kong University of Science and Technology (香港科技大学); Waymo
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing single-view 3D generative models typically adopt multiview diffusion priors to reconstruct object surfaces, yet they remain prone to inter-view inconsistencies and are unable to faithfully represent complex internal structure or nontrivial topologies. In particular, we encode geometry information by projecting it onto a bounding sphere and unwrapping it into a compact and structural multi-layer 2D Spherical Projection (SP) representation. Operating solely in the image domain, SPGen offers three key advantages simultaneously: (1) Consistency. The injective SP mapping encodes surface geometry with a single viewpoint which naturally eliminates view inconsistency and ambiguity; (2) Flexibility. Multi-layer SP maps represent nested internal structures and support direct lifting to watertight or open 3D surfaces; (3) Efficiency. The image-domain formulation allows the direct inheritance of powerful 2D diffusion priors and enables efficient finetuning with limited computational resources. Extensive experiments demonstrate that SPGen significantly outperforms existing baselines in geometric quality and computational efficiency.
zh

[CV-72] EvoEmpirBench: Dynamic Spatial Reasoning with Agent -ExpVer

【速读】:该论文旨在解决现有空间推理基准测试多局限于静态或全局可观测环境,难以刻画在局部感知、环境反馈与全局目标紧密耦合条件下,模型在长期推理和记忆利用方面的挑战。其核心解决方案在于提出两个动态空间推理基准任务——局部可观测迷宫导航(locally observable maze navigation)和匹配消除(match-2 elimination),二者通过每一步动作触发环境结构变化,迫使模型持续更新认知与策略;同时引入基于主观体验的记忆机制,实现跨任务经验迁移与验证,从而系统评估模型在动态场景下的空间理解与自适应规划能力。

链接: https://arxiv.org/abs/2509.12718
作者: Pukun Zhao,Longxiang Wang,Miaowei Wang,Chen Chen,Fanqing Zhou,Haojian Huang
机构: Guangdong University of Finance and Economics (广东财经大学); Chongqing University (重庆大学); University of Edinburgh (爱丁堡大学); The University of Hong Kong (香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Ongoing Work, 29 pages, 3 figures, 7 tables

点击查看摘要

Abstract:Most existing spatial reasoning benchmarks focus on static or globally observable environments, failing to capture the challenges of long-horizon reasoning and memory utilization under partial observability and dynamic changes. We introduce two dynamic spatial benchmarks, locally observable maze navigation and match-2 elimination that systematically evaluate models’ abilities in spatial understanding and adaptive planning when local perception, environment feedback, and global objectives are tightly coupled. Each action triggers structural changes in the environment, requiring continuous update of cognition and strategy. We further propose a subjective experience-based memory mechanism for cross-task experience transfer and validation. Experiments show that our benchmarks reveal key limitations of mainstream models in dynamic spatial reasoning and long-term memory, providing a comprehensive platform for future methodological advances. Our code and data are available at this https URL.
zh

[CV-73] AsyMoE: Leverag ing Modal Asymmetry for Enhanced Expert Specialization in Large Vision-Language Models

【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)中基于专家混合(Mixture of Experts, MoE)架构时,因视觉与语言处理在信息结构上的不对称性而导致的模态特异性特征与跨模态交互难以平衡的问题。具体而言,语言专家在深层网络中逐渐丧失上下文关联性,过度依赖参数化知识而非充分利用提供的视觉和语言信息。解决方案的关键在于提出AsyMoE架构,通过三个专用专家组建模这种不对称性:用于模态内处理的模态内专家(intra-modality experts)、用于分层跨模态交互的双曲跨模态专家(hyperbolic inter-modality experts),以及通过优先考虑证据来抑制参数偏倚并保持上下文接地的语言专家(evidence-priority language experts)。该设计显著提升了模型性能,在准确率上相比原始MoE和模态特定MoE分别提升26.58%和15.45%,同时激活参数量比密集模型减少25.45%。

链接: https://arxiv.org/abs/2509.12715
作者: Heng Zhang,Haichuan Hu,Yaomin Shen,Weihao Yu,Yilei Yuan,Haochen You,Guo Cheng,Zijian Zhang,Lubin Gan,Huihui Wei,Hao Zhang,Jin Huang
机构: South China Normal University (华南师范大学); Alibaba Cloud (阿里云); Nanchang Research Institute, Zhejiang University (浙江大学南昌研究院); Research Institute of China Telecom Corporate Ltd (中国电信集团研究院); University of Michigan (密歇根大学); Columbia University (哥伦比亚大学); Tsinghua University (清华大学); University of Pennsylvania (宾夕法尼亚大学); University of Science and Technology of China (中国科学技术大学); Shanghai Jiao Tong University (上海交通大学); University of Chinese Academy of Sciences (中国科学院大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) have demonstrated impressive performance on multimodal tasks through scaled architectures and extensive training. However, existing Mixture of Experts (MoE) approaches face challenges due to the asymmetry between visual and linguistic processing. Visual information is spatially complete, while language requires maintaining sequential context. As a result, MoE models struggle to balance modality-specific features and cross-modal interactions. Through systematic analysis, we observe that language experts in deeper layers progressively lose contextual grounding and rely more on parametric knowledge rather than utilizing the provided visual and linguistic information. To address this, we propose AsyMoE, a novel architecture that models this asymmetry using three specialized expert groups. We design intra-modality experts for modality-specific processing, hyperbolic inter-modality experts for hierarchical cross-modal interactions, and evidence-priority language experts to suppress parametric biases and maintain contextual grounding. Extensive experiments demonstrate that AsyMoE achieves 26.58% and 15.45% accuracy improvements over vanilla MoE and modality-specific MoE respectively, with 25.45% fewer activated parameters than dense models.
zh

[CV-74] Learning by Imagining: Debiased Feature Augmentation for Compositional Zero-Shot Learning

【速读】:该论文旨在解决组合零样本学习(Compositional Zero-Shot Learning, CZSL)中因属性与对象特征纠缠以及真实数据中长尾分布普遍存在的问题,导致难以学习具有泛化能力的组合表示。解决方案的关键在于提出一种名为去偏特征增强(Debiased Feature Augmentation, DeFA)的新方法,其核心是结合解耦与重构框架进行特征增强,并引入去偏策略:通过显式利用已见属性和对象的先验知识,合成高保真度的组合特征,从而支持组合泛化能力的提升。

链接: https://arxiv.org/abs/2509.12711
作者: Haozhe Zhang,Chenchen Jing,Mingyu Liu,Qingsheng Wang,Hao Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Compositional Zero-Shot Learning (CZSL) aims to recognize unseen attribute-object compositions by learning prior knowledge of seen primitives, \textiti.e., attributes and objects. Learning generalizable compositional representations in CZSL remains challenging due to the entangled nature of attributes and objects as well as the prevalence of long-tailed distributions in real-world data. Inspired by neuroscientific findings that imagination and perception share similar neural processes, we propose a novel approach called Debiased Feature Augmentation (DeFA) to address these challenges. The proposed DeFA integrates a disentangle-and-reconstruct framework for feature augmentation with a debiasing strategy. DeFA explicitly leverages the prior knowledge of seen attributes and objects by synthesizing high-fidelity composition features to support compositional generalization. Extensive experiments on three widely used datasets demonstrate that DeFA achieves state-of-the-art performance in both \textitclosed-world and \textitopen-world settings.
zh

[CV-75] RIS-FUSION: Rethinking Text-Driven Infrared and Visible Image Fusion from the Perspective of Referring Image Segmentation

【速读】:该论文旨在解决文本驱动的红外与可见光图像融合(text-driven infrared and visible image fusion)中缺乏目标对齐任务监督和评估的问题,即现有方法无法有效衡量输入文本对融合结果的贡献程度。解决方案的关键在于提出RIS-FUSION框架,其核心是LangGatedFusion模块,通过将文本特征注入融合主干网络以增强语义对齐,并借助参考图像分割(Referring Image Segmentation, RIS)任务的共性目标——突出文本所指对象——实现融合与RIS的联合优化,从而提升融合结果与文本语义的一致性。

链接: https://arxiv.org/abs/2509.12710
作者: Siju Ma,Changsiyu Gong,Xiaofeng Fan,Yong Ma,Chengjie Jiang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 2 figures

点击查看摘要

Abstract:Text-driven infrared and visible image fusion has gained attention for enabling natural language to guide the fusion process. However, existing methods lack a goal-aligned task to supervise and evaluate how effectively the input text contributes to the fusion outcome. We observe that referring image segmentation (RIS) and text-driven fusion share a common objective: highlighting the object referred to by the text. Motivated by this, we propose RIS-FUSION, a cascaded framework that unifies fusion and RIS through joint optimization. At its core is the LangGatedFusion module, which injects textual features into the fusion backbone to enhance semantic alignment. To support multimodal referring image segmentation task, we introduce MM-RIS, a large-scale benchmark with 12.5k training and 3.5k testing triplets, each consisting of an infrared-visible image pair, a segmentation mask, and a referring expression. Extensive experiments show that RIS-FUSION achieves state-of-the-art performance, outperforming existing methods by over 11% in mIoU. Code and dataset will be released at this https URL.
zh

[CV-76] SmokeBench: A Real-World Dataset for Surveillance Image Desmoking in Early-Stage Fire Scenes

【速读】:该论文旨在解决早期火灾场景(点燃后0-15分钟内)中烟雾导致监控图像可见度严重下降的问题,从而影响应急响应与救援行动的态势感知能力。其解决方案的关键在于构建了一个名为SmokeBench的真实世界 surveillance 图像去烟基准数据集,该数据集包含多样场景设置和不同烟雾浓度下的成对烟雾污染与无烟图像,实现了精确配准,支持监督学习与严谨评估,为提升真实火灾场景下图像去烟算法的鲁棒性和实用性提供了重要基础。

链接: https://arxiv.org/abs/2509.12701
作者: Wenzhuo Jin,Qianfeng Yang,Xianhao Wu,Hongming Chen,Pengpeng Li,Xiang Chen
机构: Beijing Jiaotong University (北京交通大学); Dalian Polytechnic University (大连工业大学); Nanjing University of Aeronautics and Astronautics (南京航空航天大学); Dalian Martime University (大连海事大学); Nanjing University of Science and Technology (南京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ACMMM 2025 Datasets Track

点击查看摘要

Abstract:Early-stage fire scenes (0-15 minutes after ignition) represent a crucial temporal window for emergency interventions. During this stage, the smoke produced by combustion significantly reduces the visibility of surveillance systems, severely impairing situational awareness and hindering effective emergency response and rescue operations. Consequently, there is an urgent need to remove smoke from images to obtain clear scene information. However, the development of smoke removal algorithms remains limited due to the lack of large-scale, real-world datasets comprising paired smoke-free and smoke-degraded images. To address these limitations, we present a real-world surveillance image desmoking benchmark dataset named SmokeBench, which contains image pairs captured under diverse scenes setup and smoke concentration. The curated dataset provides precisely aligned degraded and clean images, enabling supervised learning and rigorous evaluation. We conduct comprehensive experiments by benchmarking a variety of desmoking methods on our dataset. Our dataset provides a valuable foundation for advancing robust and practical image desmoking in real-world fire scenes. This dataset has been released to the public and can be downloaded from this https URL.
zh

[CV-77] StereoCarla: A High-Fidelity Driving Dataset for Generalizable Stereo

【速读】:该论文旨在解决当前基于学习的立体匹配(stereo matching)模型在真实场景中泛化能力不足的问题,其根源在于现有训练数据多样性有限。解决方案的关键在于构建一个高保真度的合成立体数据集 StereoCarla,该数据集基于 CARLA 模拟器,涵盖多种相机配置(如基线、视角和传感器位置差异)以及多样的环境条件(如光照变化、天气影响和道路几何结构),从而显著提升模型在跨域测试中的泛化性能。实验表明,使用 StereoCarla 训练的模型在多个标准基准(KITTI2012、KITTI2015、Middlebury、ETH3D)上优于其他 11 个现有数据集训练的模型,并且在多数据集联合训练中仍能带来显著精度提升,验证了其兼容性和可扩展性。

链接: https://arxiv.org/abs/2509.12683
作者: Xianda Guo,Chenming Zhang,Ruilin Wang,Youmin Zhang,Wenzhao Zheng,Matteo Poggi,Hao Zhao,Qin Zou,Long Chen
机构: Wuhan University (武汉大学); Xi’an Jiaotong University (西安交通大学); Waytous; CASIA (中国科学院自动化研究所); University of Bologna (博洛尼亚大学); University of California, Berkeley (加州大学伯克利分校); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Stereo matching plays a crucial role in enabling depth perception for autonomous driving and robotics. While recent years have witnessed remarkable progress in stereo matching algorithms, largely driven by learning-based methods and synthetic datasets, the generalization performance of these models remains constrained by the limited diversity of existing training data. To address these challenges, we present StereoCarla, a high-fidelity synthetic stereo dataset specifically designed for autonomous driving scenarios. Built on the CARLA simulator, StereoCarla incorporates a wide range of camera configurations, including diverse baselines, viewpoints, and sensor placements as well as varied environmental conditions such as lighting changes, weather effects, and road geometries. We conduct comprehensive cross-domain experiments across four standard evaluation datasets (KITTI2012, KITTI2015, Middlebury, ETH3D) and demonstrate that models trained on StereoCarla outperform those trained on 11 existing stereo datasets in terms of generalization accuracy across multiple benchmarks. Furthermore, when integrated into multi-dataset training, StereoCarla contributes substantial improvements to generalization accuracy, highlighting its compatibility and scalability. This dataset provides a valuable benchmark for developing and evaluating stereo algorithms under realistic, diverse, and controllable settings, facilitating more robust depth perception systems for autonomous vehicles. Code can be available at this https URL, and data can be available at this https URL.
zh

[CV-78] A Comparative Study of YOLOv8 to YOLOv11 Performance in Underwater Vision Tasks

【速读】:该论文旨在解决水下自主航行器(Autonomous Underwater Vehicles, AUVs)在有限计算资源下,如何高效准确地执行视觉任务(如栖息地制图、生态监测和基础设施检查)的问题。其核心挑战包括水下图像质量差(光衰减、浑浊度高)、类别不平衡以及模型部署的实时性要求。解决方案的关键在于系统性评估最新YOLO系列单阶段检测器(YOLOv8-s至YOLOv11-s)在两个公开水下数据集(珊瑚疾病与鱼类物种识别)上的性能表现,通过控制训练比例(25%~100%)并保持验证/测试集固定,量化精度(mAP50、mAP50-95)、召回率、推理速度(FPS)及特征可视化(Grad-CAM)。结果表明:YOLOv9之后模型精度趋于饱和,而推理速度显著提升,其中轻量级YOLOv10在速度-精度权衡上最优,为AUV嵌入式部署提供了可复现的基准与代码库,推动水下视觉研究发展。

链接: https://arxiv.org/abs/2509.12682
作者: Gordon Hung,Ivan Felipe Rodriguez
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 9 pages, 8 figures, 10 tables

点击查看摘要

Abstract:Autonomous underwater vehicles (AUVs) increasingly rely on on-board computer-vision systems for tasks such as habitat mapping, ecological monitoring, and infrastructure inspection. However, underwater imagery is hindered by light attenuation, turbidity, and severe class imbalance, while the computational resources available on AUVs are limited. One-stage detectors from the YOLO family are attractive because they fuse localization and classification in a single, low-latency network; however, their terrestrial benchmarks (COCO, PASCAL-VOC, Open Images) leave open the question of how successive YOLO releases perform in the marine domain. We curate two openly available datasets that span contrasting operating conditions: a Coral Disease set (4,480 images, 18 classes) and a Fish Species set (7,500 images, 20 classes). For each dataset, we create four training regimes (25 %, 50 %, 75 %, 100 % of the images) while keeping balanced validation and test partitions fixed. We train YOLOv8-s, YOLOv9-s, YOLOv10-s, and YOLOv11-s with identical hyperparameters (100 epochs, 640 px input, batch = 16, T4 GPU) and evaluate precision, recall, mAP50, mAP50-95, per-image inference time, and frames-per-second (FPS). Post-hoc Grad-CAM visualizations probe feature utilization and localization faithfulness. Across both datasets, accuracy saturates after YOLOv9, suggesting architectural innovations primarily target efficiency rather than accuracy. Inference speed, however, improves markedly. Our results (i) provide the first controlled comparison of recent YOLO variants on underwater imagery, (ii) show that lightweight YOLOv10 offers the best speed-accuracy trade-off for embedded AUV deployment, and (iii) deliver an open, reproducible benchmark and codebase to accelerate future marine-vision research.
zh

[CV-79] MFAF: An EVA02-Based Multi-scale Frequency Attention Fusion Method for Cross-View Geo-Localization

【速读】:该论文旨在解决跨视角地理定位(cross-view geo-localization)任务中因视角变化导致的显著外观差异以及难以提取判别性特征的问题。现有方法通常依赖于特征图分割来提取特征,但忽略了空间和语义信息,从而限制了定位精度。其解决方案的关键在于提出基于EVA02的多尺度频域注意力融合(MFAF)方法,该方法包含多频分支块(MFB)与频域感知空间注意力模块(FSA)。MFB能够跨多尺度有效捕捉低频结构特征与高频边缘细节,提升不同视角下特征表示的一致性和鲁棒性;FSA则自适应地聚焦于频域特征中的关键区域,显著抑制背景噪声和视角变化带来的干扰,从而在University-1652、SUES-200和Dense-UAV等基准数据集上实现优异的无人机定位与导航性能。

链接: https://arxiv.org/abs/2509.12673
作者: YiTong Liu,TianZhu Liu,YanFeng GU
机构: Harbin Institute of Technology (哈尔滨工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 17 pages, 13 figures

点击查看摘要

Abstract:Cross-view geo-localization aims to determine the geographical location of a query image by matching it against a gallery of images. This task is challenging due to the significant appearance variations of objects observed from variable views, along with the difficulty in extracting discriminative features. Existing approaches often rely on extracting features through feature map segmentation while neglecting spatial and semantic information. To address these issues, we propose the EVA02-based Multi-scale Frequency Attention Fusion (MFAF) method. The MFAF method consists of Multi-Frequency Branch-wise Block (MFB) and the Frequency-aware Spatial Attention (FSA) module. The MFB block effectively captures both low-frequency structural features and high-frequency edge details across multiple scales, improving the consistency and robustness of feature representations across various viewpoints. Meanwhile, the FSA module adaptively focuses on the key regions of frequency features, significantly mitigating the interference caused by background noise and viewpoint variability. Extensive experiments on widely recognized benchmarks, including University-1652, SUES-200, and Dense-UAV, demonstrate that the MFAF method achieves competitive performance in both drone localization and drone navigation tasks.
zh

[CV-80] Beyond Artificial Misalignment: Detecting and Grounding Semantic-Coordinated Multimodal Manipulations

【速读】:该论文旨在解决多模态数据中伪造内容检测与定位的问题,尤其针对现有基准数据集因跨模态对齐破坏而产生的“伪异常”现象——即实际攻击通常保持视觉与文本的语义一致性,而传统数据集却人为制造不一致,导致模型难以适应真实场景。解决方案的关键在于提出首个语义对齐的多模态篡改数据集(Semantic-Aligned Multimodal Manipulation, SAMM),通过两阶段生成流程:首先应用先进的图像篡改技术,再生成语义一致的文本描述以强化视觉欺骗;在此基础上构建检索增强型篡改检测与定位框架(Retrieval-Augmented Manipulation Detection and Grounding, RamDG),利用外部知识库检索上下文证据作为辅助文本,与输入图像联合编码,从而实现更精准的篡改溯源与定位。实验表明,该方法在SAMM数据集上比当前最优方法检测准确率提升2.06%。

链接: https://arxiv.org/abs/2509.12653
作者: Jinjie Shen,Yaxiong Wang,Lechao Cheng,Nan Pu,Zhun Zhong
机构: Hefei University of Technology (合肥工业大学); University of Trento (特伦托大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The detection and grounding of manipulated content in multimodal data has emerged as a critical challenge in media forensics. While existing benchmarks demonstrate technical progress, they suffer from misalignment artifacts that poorly reflect real-world manipulation patterns: practical attacks typically maintain semantic consistency across modalities, whereas current datasets artificially disrupt cross-modal alignment, creating easily detectable anomalies. To bridge this gap, we pioneer the detection of semantically-coordinated manipulations where visual edits are systematically paired with semantically consistent textual descriptions. Our approach begins with constructing the first Semantic-Aligned Multimodal Manipulation (SAMM) dataset, generated through a two-stage pipeline: 1) applying state-of-the-art image manipulations, followed by 2) generation of contextually-plausible textual narratives that reinforce the visual deception. Building on this foundation, we propose a Retrieval-Augmented Manipulation Detection and Grounding (RamDG) framework. RamDG commences by harnessing external knowledge repositories to retrieve contextual evidence, which serves as the auxiliary texts and encoded together with the inputs through our image forgery grounding and deep manipulation detection modules to trace all manipulations. Extensive experiments demonstrate our framework significantly outperforms existing methods, achieving 2.06% higher detection accuracy on SAMM compared to state-of-the-art approaches. The dataset and code are publicly available at this https URL.
zh

[CV-81] CIARD: Cyclic Iterative Adversarial Robustness Distillation

【速读】:该论文旨在解决现有对抗鲁棒性蒸馏(Adversarial Robustness Distillation, ARD)方法在将教师模型的鲁棒性迁移至轻量级学生模型时,不可避免地导致干净样本(clean examples)性能下降的问题。其核心问题源于两个方面:一是双教师框架中清洁教师与鲁棒教师优化目标不一致,阻碍有效知识传递;二是训练过程中迭代生成的对抗样本导致鲁棒教师模型性能退化。为解决上述挑战,作者提出了一种新颖的循环迭代式ARD方法(Cyclic Iterative ARD, CIARD),其关键创新在于:a. 引入多教师框架结合对比推力损失(contrastive push-loss alignment)以对齐不同教师间的优化目标冲突;b. 通过持续对抗重训练(continuous adversarial retraining)动态维持教师模型的鲁棒性,防止因对抗样本分布变化导致的性能衰减。实验表明,CIARD在多个数据集上显著提升了对抗防御率(平均提升3.53%)和干净样本准确率(提升5.87%),实现了鲁棒性与泛化能力的更好平衡。

链接: https://arxiv.org/abs/2509.12633
作者: Liming Lu,Shuchao Pang,Xu Zheng,Xiang Gu,Anan Du,Yunhuai Liu,Yongbin Zhou
机构: Nanjing University of Science and Technology (南京理工大学); HKUST(GZ) (香港科技大学(广州)); INSAIT (INSAIT); Sofia University, St. Kliment Ohridski (索非亚大学,圣克莱门特·奥赫里德斯基); Nanjing University of Industry Technology (南京工业职业技术大学); Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Adversarial robustness distillation (ARD) aims to transfer both performance and robustness from teacher model to lightweight student model, enabling resilient performance on resource-constrained scenarios. Though existing ARD approaches enhance student model’s robustness, the inevitable by-product leads to the degraded performance on clean examples. We summarize the causes of this problem inherent in existing methods with dual-teacher framework as: 1. The divergent optimization objectives of dual-teacher models, i.e., the clean and robust teachers, impede effective knowledge transfer to the student model, and 2. The iteratively generated adversarial examples during training lead to performance deterioration of the robust teacher model. To address these challenges, we propose a novel Cyclic Iterative ARD (CIARD) method with two key innovations: a. A multi-teacher framework with contrastive push-loss alignment to resolve conflicts in dual-teacher optimization objectives, and b. Continuous adversarial retraining to maintain dynamic teacher robustness against performance degradation from the varying adversarial examples. Extensive experiments on CIFAR-10, CIFAR-100, and Tiny-ImageNet demonstrate that CIARD achieves remarkable performance with an average 3.53 improvement in adversarial defense rates across various attack scenarios and a 5.87 increase in clean sample accuracy, establishing a new benchmark for balancing model robustness and generalization. Our code is available at this https URL
zh

[CV-82] Maps for Autonomous Driving: Full-process Survey and Frontiers

【速读】:该论文旨在解决自动驾驶中地图表示与生成流程的演进问题,即如何适应自动驾驶技术发展对地图精度、效率和灵活性的需求。其解决方案的关键在于系统性地将地图演化划分为三个阶段——高精地图(High-Definition, HD)地图、轻量级(Lightweight, Lite)地图和隐式(Implicit)地图,并针对每个阶段梳理了地图生产流程中的核心技术挑战及学术界提出的应对策略,同时探讨了新型地图表示形式如何嵌入端到端自动驾驶框架,从而推动地图从静态数据载体向动态感知与决策支持工具的转变。

链接: https://arxiv.org/abs/2509.12632
作者: Pengxin Chen,Zhipeng Luo,Xiaoqi Jiang,Zhangcai Yin,Jonathan Li
机构: Wuhan University of Technology (武汉理工大学); Huawei Technologies (华为技术有限公司); Minnan Normal University (闽南师范大学); Chery Automobile Co., Ltd (奇瑞汽车股份有限公司); University of Waterloo (滑铁卢大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Maps have always been an essential component of autonomous driving. With the advancement of autonomous driving technology, both the representation and production process of maps have evolved substantially. The article categorizes the evolution of maps into three stages: High-Definition (HD) maps, Lightweight (Lite) maps, and Implicit maps. For each stage, we provide a comprehensive review of the map production workflow, with highlighting technical challenges involved and summarizing relevant solutions proposed by the academic community. Furthermore, we discuss cutting-edge research advances in map representations and explore how these innovations can be integrated into end-to-end autonomous driving frameworks.
zh

[CV-83] Exploring Spectral Characteristics for Single Image Reflection Removal

【速读】:该论文旨在解决图像复原领域中由入射光与反射介质相互作用引起的反射消除问题,该问题本质上是一个病态问题(ill-posed problem),主要难点在于反射分量与透射分量在捕获图像中存在空间重叠,导致难以准确区分并恢复干净背景。其解决方案的关键在于引入基于光谱学习的新视角,提出“光谱码本”(Spectral Codebook)以重建反射图像的光学光谱,并通过感知不同光源在光谱中的波长差异来有效区分反射成分;进一步设计两个光谱先验精修模块,在空间维度重新分布像素并在波长维度自适应增强光谱差异;最后构建谱感知Transformer(Spectrum-Aware Transformer),联合在光谱域和像素域恢复透射内容,从而实现更精确且泛化能力强的反射去除效果。

链接: https://arxiv.org/abs/2509.12627
作者: Pengbo Guo,Chengxu Liu,Guoshuai Zhao,Xingsong Hou,Jialie Shen,Xueming Qian
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Eliminating reflections caused by incident light interacting with reflective medium remains an ill-posed problem in the image restoration area. The primary challenge arises from the overlapping of reflection and transmission components in the captured images, which complicates the task of accurately distinguishing and recovering the clean background. Existing approaches typically address reflection removal solely in the image domain, ignoring the spectral property variations of reflected light, which hinders their ability to effectively discern reflections. In this paper, we start with a new perspective on spectral learning, and propose the Spectral Codebook to reconstruct the optical spectrum of the reflection image. The reflections can be effectively distinguished by perceiving the wavelength differences between different light sources in the spectrum. To leverage the reconstructed spectrum, we design two spectral prior refinement modules to re-distribute pixels in the spatial dimension and adaptively enhance the spectral differences along the wavelength dimension. Furthermore, we present the Spectrum-Aware Transformer to jointly recover the transmitted content in spectral and pixel domains. Experimental results on three different reflection benchmarks demonstrate the superiority and generalization ability of our method compared to state-of-the-art models.
zh

[CV-84] ActiveVLN: Towards Active Exploration via Multi-Turn RL in Vision-and-Language Navigation

【速读】:该论文旨在解决当前基于多模态大语言模型(Multimodal Large Language Models, MLLMs)的视觉-语言导航(Vision-and-Language Navigation, VLN)方法在训练效率与探索能力上的局限性。现有方法主要依赖模仿学习(Imitation Learning, IL)并结合DAgger进行后训练以缓解协变量偏移(covariate shift),但此类方法数据收集和训练成本高;而先前的强化学习(Reinforcement Learning, RL)方法缺乏与环境的动态交互,且依赖专家轨迹进行奖励设计,限制了智能体发现多样化、合理路径的能力。解决方案的关键在于提出ActiveVLN框架,其核心是通过多轮强化学习实现主动探索:第一阶段使用少量专家轨迹进行模仿学习初始化;第二阶段则让智能体自主预测并执行动作,自动收集多样轨迹,并基于GRPO目标优化多个 rollout;同时引入动态早停策略剪枝长尾或失败轨迹,提升RL效率。实验表明,该方法在性能上显著优于IL基线及以往DAgger和RL后训练方法,且在较小模型规模下达到与最先进方法相当的效果。

链接: https://arxiv.org/abs/2509.12618
作者: Zekai Zhang,Weiye Zhu,Hewei Pan,Xiangchen Wang,Rongtao Xu,Xing Sun,Feng Zheng
机构: Southern University of Science and Technology (南方科技大学); Spatialtemporal AI; MBZUAI; Tencent Youtu Lab
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The Vision-and-Language Navigation (VLN) task requires an agent to follow natural language instructions and navigate through complex environments. Existing MLLM-based VLN methods primarily rely on imitation learning (IL) and often use DAgger for post-training to mitigate covariate shift. While effective, these approaches incur substantial data collection and training costs. Reinforcement learning (RL) offers a promising alternative. However, prior VLN RL methods lack dynamic interaction with the environment and depend on expert trajectories for reward shaping, rather than engaging in open-ended active exploration. This restricts the agent’s ability to discover diverse and plausible navigation routes. To address these limitations, we propose ActiveVLN, a VLN framework that explicitly enables active exploration through multi-turn RL. In the first stage, a small fraction of expert trajectories is used for IL to bootstrap the agent. In the second stage, the agent iteratively predicts and executes actions, automatically collects diverse trajectories, and optimizes multiple rollouts via the GRPO objective. To further improve RL efficiency, we introduce a dynamic early-stopping strategy to prune long-tail or likely failed trajectories, along with additional engineering optimizations. Experiments show that ActiveVLN achieves the largest performance gains over IL baselines compared to both DAgger-based and prior RL-based post-training methods, while reaching competitive performance with state-of-the-art approaches despite using a smaller model. Code and data will be released soon.
zh

[CV-85] DisorientLiDAR: Physical Attacks on LiDAR-based Localization

【速读】:该论文旨在解决LiDAR-based定位系统在面对对抗攻击时的脆弱性问题,尤其是在自动驾驶场景中,通过引入一种名为DisorientLiDAR的新颖对抗攻击框架,实现对基于点云配准的定位模型的有效干扰。其解决方案的关键在于逆向工程定位模型(如特征提取网络),识别出对定位精度至关重要的关键点(keypoints),并通过战略性地移除这些关键区域来破坏点云配准过程,从而导致定位漂移;实验表明,即使仅隐藏少数关键区域,也能显著降低主流点云注册模型(HRegNet、D3Feat和GeoTransformer)的精度,并在Autoware平台上验证了攻击的实际影响,最终通过近红外吸收材料在物理世界中复现攻击效果,证明了该方法的现实可行性和通用性。

链接: https://arxiv.org/abs/2509.12595
作者: Yizhen Lao,Yu Zhang,Ziting Wang,Chengbo Wang,Yifei Xue,Wanpeng Shao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deep learning models have been shown to be susceptible to adversarial attacks with visually imperceptible perturbations. Even this poses a serious security challenge for the localization of self-driving cars, there has been very little exploration of attack on it, as most of adversarial attacks have been applied to 3D perception. In this work, we propose a novel adversarial attack framework called DisorientLiDAR targeting LiDAR-based localization. By reverse-engineering localization models (e.g., feature extraction networks), adversaries can identify critical keypoints and strategically remove them, thereby disrupting LiDAR-based localization. Our proposal is first evaluated on three state-of-the-art point-cloud registration models (HRegNet, D3Feat, and GeoTransformer) using the KITTI dataset. Experimental results demonstrate that removing regions containing Top-K keypoints significantly degrades their registration accuracy. We further validate the attack’s impact on the Autoware autonomous driving platform, where hiding merely a few critical regions induces noticeable localization drift. Finally, we extended our attacks to the physical world by hiding critical regions with near-infrared absorptive materials, thereby successfully replicate the attack effects observed in KITTI data. This step has been closer toward the realistic physical-world attack that demonstrate the veracity and generality of our proposal.
zh

[CV-86] Adaptive Sampling Scheduler

【速读】:该论文旨在解决一致性蒸馏(Consistency Distillation)方法在实际应用中因固定目标时间步选择策略而带来的灵活性不足问题,该问题限制了扩散模型在不同蒸馏框架下的采样潜力与生成性能。解决方案的关键在于提出一种自适应采样调度器(adaptive sampling scheduler),其核心创新包括:(i) 动态目标时间步选择机制,根据计算的时间步重要性自适应调整;(ii) 基于时间步重要性的优化交替采样策略,引导前向去噪与反向加噪过程以更高效探索解空间;(iii) 引入平滑裁剪(smoothing clipping)与色彩平衡技术,在高指导尺度下实现稳定且高质量的生成结果,从而显著提升一致性蒸馏模型在复杂场景中的适用性与生成性能。

链接: https://arxiv.org/abs/2509.12569
作者: Qi Wang,Shuliang Zhu,Jinjia Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages, 10 figures,2 Tables, 18 Equations

点击查看摘要

Abstract:Consistent distillation methods have evolved into effective techniques that significantly accelerate the sampling process of diffusion models. Although existing methods have achieved remarkable results, the selection of target timesteps during distillation mainly relies on deterministic or stochastic strategies, which often require sampling schedulers to be designed specifically for different distillation processes. Moreover, this pattern severely limits flexibility, thereby restricting the full sampling potential of diffusion models in practical applications. To overcome these limitations, this paper proposes an adaptive sampling scheduler that is applicable to various consistency distillation frameworks. The scheduler introduces three innovative strategies: (i) dynamic target timestep selection, which adapts to different consistency distillation frameworks by selecting timesteps based on their computed importance; (ii) Optimized alternating sampling along the solution trajectory by guiding forward denoising and backward noise addition based on the proposed time step importance, enabling more effective exploration of the solution space to enhance generation performance; and (iii) Utilization of smoothing clipping and color balancing techniques to achieve stable and high-quality generation results at high guidance scales, thereby expanding the applicability of consistency distillation models in complex generation scenarios. We validated the effectiveness and flexibility of the adaptive sampling scheduler across various consistency distillation methods through comprehensive experimental evaluations. Experimental results consistently demonstrated significant improvements in generative performance, highlighting the strong adaptability achieved by our method.
zh

[CV-87] VQT-Light:Lightweight HDR Illumination Map Prediction with Richer Texture.pdf

【速读】:该论文旨在解决计算机视觉与图形学中光照估计(lighting estimation)的准确性问题,现有方法在恢复光照图细节纹理方面存在不足,且普遍存在运行速度慢与纹理保真度低的问题。其解决方案的关键在于提出一种基于VQVAE(Vector Quantized Variational Autoencoder)与ViT(Vision Transformer)架构的新框架VQT-Light,通过两个核心模块实现:首先利用VQVAE提取光照图的离散特征而非连续特征,避免“后验崩溃”(posterior collapse)导致的信息丢失;其次采用ViT捕捉输入图像的全局上下文与依赖关系,提升对视场外光照的预测能力。最终将光照估计建模为多分类任务,使模型在保持轻量化和高速推理(40FPS)的同时显著提升纹理丰富度与保真度。

链接: https://arxiv.org/abs/2509.12556
作者: Kunliang Xie
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 8 figures

点击查看摘要

Abstract:Accurate lighting estimation is a significant yet challenging task in computer vision and graphics. However, existing methods either struggle to restore detailed textures of illumination map, or face challenges in run-ning speed and texture fidelity. To tackle this problem, we propose a novel framework (VQT-Light) based on VQVAE and ViT architecture. VQT-Light includes two modules: feature extraction and lighting estima-tion. First, we take advantages of VQVAE to extract discrete features of illumination map rather than con-tinuous features to avoid “posterior collapse”. Second, we capture global context and dependencies of in-put image through ViT rather than CNNs to improve the prediction of illumination outside the field of view. Combining the above two modules, we formulate the lighting estimation as a multiclass classification task, which plays a key role in our pipeline. As a result, our model predicts light map with richer texture and better fidelity while keeping lightweight and fast. VQT-Light achieves an inference speed of 40FPS and im-proves multiple evaluation metrics. Qualitative and quantitative experiments demonstrate that the proposed method realizes superior results compared to existing state-of-the-art methods.
zh

[CV-88] Explicit Multimodal Graph Modeling for Human-Object Interaction Detection

【速读】:该论文旨在解决基于Transformer的方法在Human-Object Interaction (HOI)检测中因未显式建模人与物体之间关系结构而导致的交互识别性能受限问题。其解决方案的关键在于提出一种多模态图网络建模(Multimodal Graph Network Modeling, MGNM)框架,该框架通过四阶段图结构显式建模HOI任务,并引入多层次特征交互机制,利用视觉与语言的多层级特征增强人-物对之间的信息传播,从而显著提升HOI检测性能,在HICO-DET和V-COCO两个主流基准上均达到最优效果。

链接: https://arxiv.org/abs/2509.12554
作者: Wenxuan Ji,Haichao Shi,Xiao-Yu zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Transformer-based methods have recently become the prevailing approach for Human-Object Interaction (HOI) detection. However, the Transformer architecture does not explicitly model the relational structures inherent in HOI detection, which impedes the recognition of interactions. In contrast, Graph Neural Networks (GNNs) are inherently better suited for this task, as they explicitly model the relationships between human-object pairs. Therefore, in this paper, we propose \textbfMultimodal \textbfGraph \textbfNetwork \textbfModeling (MGNM) that leverages GNN-based relational structures to enhance HOI detection. Specifically, we design a multimodal graph network framework that explicitly models the HOI task in a four-stage graph structure. Furthermore, we introduce a multi-level feature interaction mechanism within our graph network. This mechanism leverages multi-level vision and language features to enhance information propagation across human-object pairs. Consequently, our proposed MGNM achieves state-of-the-art performance on two widely used benchmarks: HICO-DET and V-COCO. Moreover, when integrated with a more advanced object detector, our method demonstrates a significant performance gain and maintains an effective balance between rare and non-rare classes.
zh

[CV-89] CD: A Implicit Clustering Distillation Mathod for Structural Information Mining

【速读】:该论文旨在解决Logit知识蒸馏(Logit Knowledge Distillation)在决策过程中的可解释性不足问题。其解决方案的关键在于提出隐式聚类蒸馏(Implicit Clustering Distillation, iCD),该方法通过解耦局部logits表示并利用Gram矩阵挖掘和传递可解释的结构知识,使学生模型能够在无需真实标签或特征空间对齐的情况下学习潜在语义结构模式,从而提升模型性能,尤其在细粒度分类任务中表现显著。

链接: https://arxiv.org/abs/2509.12553
作者: Xiang Xue,Yatu Ji,Qing-dao-er-ji Ren,Bao Shi,Min Lu,Nier Wu,Xufei Zhuang,Haiteng Xu,Gan-qi-qi-ge Cha
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Logit Knowledge Distillation has gained substantial research interest in recent years due to its simplicity and lack of requirement for intermediate feature alignment; however, it suffers from limited interpretability in its decision-making process. To address this, we propose implicit Clustering Distillation (iCD): a simple and effective method that mines and transfers interpretable structural knowledge from logits, without requiring ground-truth labels or feature-space alignment. iCD leverages Gram matrices over decoupled local logit representations to enable student models to learn latent semantic structural patterns. Extensive experiments on benchmark datasets demonstrate the effectiveness of iCD across diverse teacher-student architectures, with particularly strong performance in fine-grained classification tasks – achieving a peak improvement of +5.08% over the baseline. The code is available at: this https URL.
zh

[CV-90] Agent 4FaceForgery: Multi-Agent LLM Framework for Realistic Face Forgery Detection

【速读】:该论文旨在解决人脸伪造检测领域中离线基准测试与真实场景应用之间存在的显著性能差距问题,其根源在于现有训练数据的生态效度不足(ecological invalidity)。为应对这一挑战,作者提出了一种名为Agent4FaceForgery的多智能体框架,其关键在于:利用大语言模型(LLM)驱动的智能体配备个人资料和记忆模块,模拟人类伪造创作过程中的多样化意图及迭代行为;同时,这些智能体在模拟的社会环境中交互生成带有细粒度文本-图像一致性标签的数据,从而超越传统的二分类标注方式。此外,通过自适应拒绝采样(Adaptive Rejection Sampling, ARS)机制保障生成数据的质量与多样性,最终显著提升多种架构检测器的性能,验证了该仿真驱动方法的有效性与实用性。

链接: https://arxiv.org/abs/2509.12546
作者: Yingxin Lai,Zitong Yu,Jun Wang,Linlin Shen,Yong Xu,Xiaochun Cao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Face forgery detection faces a critical challenge: a persistent gap between offline benchmarks and real-world efficacy,which we attribute to the ecological invalidity of training this http URL work introduces Agent4FaceForgery to address two fundamental problems: (1) how to capture the diverse intents and iterative processes of human forgery creation, and (2) how to model the complex, often adversarial, text-image interactions that accompany forgeries in social media. To solve this,we propose a multi-agent framework where LLM-poweredagents, equipped with profile and memory modules, simulate the forgery creation process. Crucially, these agents interact in a simulated social environment to generate samples labeled for nuanced text-image consistency, moving beyond simple binary classification. An Adaptive Rejection Sampling (ARS) mechanism ensures data quality and diversity. Extensive experiments validate that the data generated by our simulationdriven approach brings significant performance gains to detectors of multiple architectures, fully demonstrating the effectiveness and value of our framework.
zh

[CV-91] Neural Collapse-Inspired Multi-Label Federated Learning under Label-Distribution Skew

【速读】:该论文旨在解决联邦学习(Federated Learning, FL)中多标签场景下的性能下降问题,尤其是在客户端数据分布偏斜、标签共现复杂且本地与全局标签关系不一致的情况下。现有联邦学习方法主要针对单标签分类任务,难以有效处理多标签数据中的语义冗余与类别依赖关系。其解决方案的关键在于引入神经坍缩(Neural Collapse, NC)理论,并设计特征解耦模块以提取语义特定的类内特征,同时通过预定义的共享NC结构引导跨客户端特征聚类,从而缓解因局部数据分布差异导致的模型冲突。此外,论文还设计了正则化损失函数以增强潜在特征空间中的紧凑聚类效果,实验表明该方法在多个基准数据集上显著优于现有技术。

链接: https://arxiv.org/abs/2509.12544
作者: Can Peng,Yuyuan Liu,Yingyu Yang,Pramit Saha,Qianye Yang,J. Alison Noble
机构: University of Oxford (牛津大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Federated Learning (FL) enables collaborative model training across distributed clients while preserving data privacy. However, the performance of deep learning often deteriorates in FL due to decentralized and heterogeneous data. This challenge is further amplified in multi-label scenarios, where data exhibit complex characteristics such as label co-occurrence, inter-label dependency, and discrepancies between local and global label relationships. While most existing FL research primarily focuses on single-label classification, many real-world applications, particularly in domains such as medical imaging, often involve multi-label settings. In this paper, we address this important yet underexplored scenario in FL, where clients hold multi-label data with skewed label distributions. Neural Collapse (NC) describes a geometric structure in the latent feature space where features of each class collapse to their class mean with vanishing intra-class variance, and the class means form a maximally separated configuration. Motivated by this theory, we propose a method to align feature distributions across clients and to learn high-quality, well-clustered representations. To make the NC-structure applicable to multi-label settings, where image-level features may contain multiple semantic concepts, we introduce a feature disentanglement module that extracts semantically specific features. The clustering of these disentangled class-wise features is guided by a predefined shared NC structure, which mitigates potential conflicts between client models due to diverse local data distributions. In addition, we design regularisation losses to encourage compact clustering in the latent feature space. Experiments conducted on four benchmark datasets across eight diverse settings demonstrate that our approach outperforms existing methods, validating its effectiveness in this challenging FL scenario.
zh

[CV-92] Human AI for Accelerating Ad Localization Evaluation

【速读】:该论文旨在解决多语言广告本地化过程中面临的复杂挑战,即在实现文本翻译的同时,保持视觉一致性、空间对齐和风格完整性。其解决方案的关键在于提出一个结合自动化组件与人工监督的结构化框架,集成场景文本检测(scene text detection)、图像修复(inpainting)、机器翻译(machine translation, MT)及文本重置(text reimposition)等模块,从而显著加速广告本地化评估流程,并确保生成结果在语义准确性和视觉连贯性上均满足实际部署需求。

链接: https://arxiv.org/abs/2509.12543
作者: Harshit Rajgarhia,Shivali Dalmia,Mengyang Zhao,Mukherji Abhishek,Kiran Ganesh
机构: Centific
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Adapting advertisements for multilingual audiences requires more than simple text translation; it demands preservation of visual consistency, spatial alignment, and stylistic integrity across diverse languages and formats. We introduce a structured framework that combines automated components with human oversight to address the complexities of advertisement localization. To the best of our knowledge, this is the first work to integrate scene text detection, inpainting, machine translation (MT), and text reimposition specifically for accelerating ad localization evaluation workflows. Qualitative results across six locales demonstrate that our approach produces semantically accurate and visually coherent localized advertisements, suitable for deployment in real-world workflows.
zh

[CV-93] Axis-Aligned 3D Stalk Diameter Estimation from RGB-D Imagery

【速读】:该论文旨在解决传统作物茎秆直径(stalk diameter)测量方法在高通量表型分析中存在劳动强度大、误差高及难以规模化应用的问题。其解决方案的关键在于构建一个几何感知的计算机视觉流程,通过深度学习驱动的实例分割(instance segmentation)、三维点云重建以及基于主成分分析(Principal Component Analysis, PCA)的轴对齐切片技术,有效缓解曲率、遮挡和图像噪声对测量精度的影响,从而实现对茎秆直径的鲁棒且可扩展的估计。

链接: https://arxiv.org/abs/2509.12511
作者: Benjamin Vail,Rahul Harsha Cheppally,Ajay Sharda,Sidharth Rai
机构: Kansas State University (堪萨斯州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 8 figures, 4 tables

点击查看摘要

Abstract:Accurate, high-throughput phenotyping is a critical component of modern crop breeding programs, especially for improving traits such as mechanical stability, biomass production, and disease resistance. Stalk diameter is a key structural trait, but traditional measurement methods are labor-intensive, error-prone, and unsuitable for scalable phenotyping. In this paper, we present a geometry-aware computer vision pipeline for estimating stalk diameter from RGB-D imagery. Our method integrates deep learning-based instance segmentation, 3D point cloud reconstruction, and axis-aligned slicing via Principal Component Analysis (PCA) to perform robust diameter estimation. By mitigating the effects of curvature, occlusion, and image noise, this approach offers a scalable and reliable solution to support high-throughput phenotyping in breeding and agronomic research.
zh

[CV-94] Artist-Created Mesh Generation from Raw Observation

【速读】:该论文旨在解决从噪声大或不完整的点云数据(如LiDAR或RGB-D相机捕获的数据)中生成高质量、艺术家风格网格的问题。现有方法通常假设输入为干净且完整的点云,或依赖复杂的多阶段流程,难以在真实场景中应用。解决方案的关键在于提出一种端到端框架,将3D点云精化重新表述为2D图像修复(inpainting)任务,从而利用强大的生成模型直接输出符合艺术标准的完整网格,显著提升了处理复杂现实数据的能力。

链接: https://arxiv.org/abs/2509.12501
作者: Yao He,Youngjoong Kwon,Wenxiao Cai,Ehsan Adeli
机构: Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present an end-to-end framework for generating artist-style meshes from noisy or incomplete point clouds, such as those captured by real-world sensors like LiDAR or mobile RGB-D cameras. Artist-created meshes are crucial for commercial graphics pipelines due to their compatibility with animation and texturing tools and their efficiency in rendering. However, existing approaches often assume clean, complete inputs or rely on complex multi-stage pipelines, limiting their applicability in real-world scenarios. To address this, we propose an end-to-end method that refines the input point cloud and directly produces high-quality, artist-style meshes. At the core of our approach is a novel reformulation of 3D point cloud refinement as a 2D inpainting task, enabling the use of powerful generative models. Preliminary results on the ShapeNet dataset demonstrate the promise of our framework in producing clean, complete meshes.
zh

[CV-95] Instance-Guided Class Activation Mapping for Weakly Supervised Semantic Segmentation

【速读】:该论文旨在解决弱监督语义分割(Weakly Supervised Semantic Segmentation, WSSS)中对象边界定位不精确以及模型仅关注最具判别性的局部区域而非完整对象的问题。其解决方案的关键在于提出IG-CAM(Instance-Guided Class Activation Mapping),通过三个核心创新实现高质量、边界感知的定位图生成:(1) 利用真实分割掩码引导类激活图(Class Activation Mapping, CAM)生成,确保完整对象覆盖;(2) 引入影响函数(Influence Function)捕捉训练样本与模型预测之间的关系,提升特征表示鲁棒性;(3) 采用多尺度边界增强策略,通过渐进式细化实现锐利精准的对象边界。该方法在PASCAL VOC 2012数据集上达到82.3%的mIoU(未后处理),经条件随机场(Conditional Random Field, CRF)优化后提升至86.6%,显著优于现有WSSS方法。

链接: https://arxiv.org/abs/2509.12496
作者: Ali Torabi,Sanjog Gaihre,MD Mahbubur Rahman,Yaqoob Majeed
机构: University of Wyoming (怀俄明大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Weakly Supervised Semantic Segmentation (WSSS) addresses the challenge of training segmentation models using only image-level annotations, eliminating the need for expensive pixel-level labeling. While existing methods struggle with precise object boundary localization and often focus only on the most discriminative regions, we propose IG-CAM (Instance-Guided Class Activation Mapping), a novel approach that leverages instance-level cues and influence functions to generate high-quality, boundary-aware localization maps. Our method introduces three key innovations: (1) Instance-Guided Refinement that uses ground truth segmentation masks to guide CAM generation, ensuring complete object coverage rather than just discriminative parts; (2) Influence Function Integration that captures the relationship between training samples and model predictions, leading to more robust feature representations; and (3) Multi-Scale Boundary Enhancement that employs progressive refinement strategies to achieve sharp, precise object boundaries. IG-CAM achieves state-of-the-art performance on the PASCAL VOC 2012 dataset with an mIoU of 82.3% before post-processing, which further improves to 86.6% after applying Conditional Random Field (CRF) refinement, significantly outperforming previous WSSS methods. Our approach demonstrates superior localization accuracy, with complete object coverage and precise boundary delineation, while maintaining computational efficiency. Extensive ablation studies validate the contribution of each component, and qualitative comparisons across 600 diverse images showcase the method’s robustness and generalization capability. The results establish IG-CAM as a new benchmark for weakly supervised semantic segmentation, offering a practical solution for scenarios where pixel-level annotations are unavailable or prohibitively expensive.
zh

[CV-96] Evaluating Robustness of Vision-Language Models Under Noisy Conditions

【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在噪声环境下鲁棒性不足的问题,即当前VLMs在面对光照变化、运动模糊和压缩伪影等可控扰动时性能下降的机制尚不明确。其解决方案的关键在于构建一个系统性的评估框架,通过引入词法指标(如BLEU、METEOR、ROUGE、CIDEr)与基于句子嵌入的神经相似度度量,量化模型在多种噪声条件下的语义对齐能力,并在多样化数据集上进行实验验证,从而揭示模型规模、数据特性与噪声鲁棒性之间的复杂权衡关系,为未来鲁棒多模态学习提供标准化基准。

链接: https://arxiv.org/abs/2509.12492
作者: Purushoth,Alireza
机构: University of Nevada Reno (内华达大学里诺分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs) have attained exceptional success across multimodal tasks such as image captioning and visual question answering. However, their robustness under noisy conditions remains unfamiliar. In this study, we present a comprehensive evaluation framework to evaluate the performance of several state-of-the-art VLMs under controlled perturbations, including lighting variation, motion blur, and compression artifacts. We used both lexical-based metrics (BLEU, METEOR, ROUGE, CIDEr) and neural-based similarity measures using sentence embeddings to quantify semantic alignment. Our experiments span diverse datasets, revealing key insights: (1) descriptiveness of ground-truth captions significantly influences model performance; (2) larger models like LLaVA excel in semantic understanding but do not universally outperform smaller models; and (3) certain noise types, such as JPEG compression and motion blur, dramatically degrade performance across models. Our findings highlight the nuanced trade-offs between model size, dataset characteristics, and noise resilience, offering a standardized benchmark for future robust multimodal learning.
zh

[CV-97] owards Foundational Models for Single-Chip Radar ICCV2025

【速读】:该论文旨在解决低成本单芯片毫米波雷达(mmWave radar)因角分辨率较低而导致感知性能受限的问题,尤其是在自动驾驶和室内传感等应用中。其核心挑战在于现有方法多依赖于任务特定的小规模模型训练,缺乏通用的基础模型(foundational model)与大规模数据集支持。解决方案的关键是构建了一个迄今为止最大的原始雷达数据集(含100万样本,约29小时),并基于此训练出一种名为通用雷达变压器(Generalizable Radar Transformer, GRT)的4D基础模型,该模型能够从原始雷达数据中预测高质量的3D占据图(occupancy)与语义分割结果,其性能可媲美高分辨率传感器;同时实验表明,GRT具备跨场景泛化能力、任务迁移潜力及对数据量的对数尺度增长响应(每增加10倍数据,性能提升约20%),且使用原始雷达数据相比传统有损表示可等效获得10倍训练数据的增益。

链接: https://arxiv.org/abs/2509.12482
作者: Tianshu Huang,Akarsh Prabhakara,Chuhan Chen,Jay Karhade,Deva Ramanan,Matthew O’Toole,Anthony Rowe
机构: Carnegie Mellon University (卡内基梅隆大学); Bosch Research (博世研究); University of Wisconsin–Madison (威斯康星大学麦迪逊分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: To appear in ICCV 2025

点击查看摘要

Abstract:mmWave radars are compact, inexpensive, and durable sensors that are robust to occlusions and work regardless of environmental conditions, such as weather and darkness. However, this comes at the cost of poor angular resolution, especially for inexpensive single-chip radars, which are typically used in automotive and indoor sensing applications. Although many have proposed learning-based methods to mitigate this weakness, no standardized foundational models or large datasets for the mmWave radar have emerged, and practitioners have largely trained task-specific models from scratch using relatively small datasets. In this paper, we collect (to our knowledge) the largest available raw radar dataset with 1M samples (29 hours) and train a foundational model for 4D single-chip radar, which can predict 3D occupancy and semantic segmentation with quality that is typically only possible with much higher resolution sensors. We demonstrate that our Generalizable Radar Transformer (GRT) generalizes across diverse settings, can be fine-tuned for different tasks, and shows logarithmic data scaling of 20% per 10\times data. We also run extensive ablations on common design decisions, and find that using raw radar data significantly outperforms widely-used lossy representations, equivalent to a 10\times increase in training data. Finally, we roughly estimate that \approx 100M samples (3000 hours) of data are required to fully exploit the potential of GRT. Comments: To appear in ICCV 2025 Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2509.12482 [cs.CV] (or arXiv:2509.12482v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2509.12482 Focus to learn more arXiv-issued DOI via DataCite
zh

[CV-98] Image Tokenizer Needs Post-Training

【速读】:该论文旨在解决当前图像生成模型中 latent space 的重建分布与生成分布之间存在显著差异的问题,即现有冻结的图像分词器(image tokenizer)仅优化重建任务,而未考虑生成采样过程中的误差。其解决方案的关键在于提出一种包含主训练(main-training)和后训练(post-training)的新型分词器训练方案:在主训练阶段引入潜在扰动策略(latent perturbation strategy),模拟生成推理中的采样噪声,从而提升分词器鲁棒性并增强生成质量与收敛速度;在后训练阶段则基于已训练好的生成模型进一步优化分词器解码器,以缩小生成与重建 token 分布间的差异。实验表明,该方法显著提升了生成质量(如 gFID 从 1.60 降至 1.36),且对多种生成模型和分词器具有广泛适用性。

链接: https://arxiv.org/abs/2509.12474
作者: Kai Qiu,Xiang Li,Hao Chen,Jason Kuen,Xiaohao Xu,Jiuxiang Gu,Yinyi Luo,Bhiksha Raj,Zhe Lin,Marios Savvides
机构: Carnegie Mellon University (卡内基梅隆大学); Adobe Research (Adobe 研究院); University of Michigan (密歇根大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 21 pages, 16 figures, 10 tables. arXiv admin note: substantial text overlap with arXiv:2503.08354

点击查看摘要

Abstract:Recent image generative models typically capture the image distribution in a pre-constructed latent space, relying on a frozen image tokenizer. However, there exists a significant discrepancy between the reconstruction and generation distribution, where current tokenizers only prioritize the reconstruction task that happens before generative training without considering the generation errors during sampling. In this paper, we comprehensively analyze the reason for this discrepancy in a discrete latent space, and, from which, we propose a novel tokenizer training scheme including both main-training and post-training, focusing on improving latent space construction and decoding respectively. During the main training, a latent perturbation strategy is proposed to simulate sampling noises, \ie, the unexpected tokens generated in generative inference. Specifically, we propose a plug-and-play tokenizer training scheme, which significantly enhances the robustness of tokenizer, thus boosting the generation quality and convergence speed, and a novel tokenizer evaluation metric, \ie, pFID, which successfully correlates the tokenizer performance to generation quality. During post-training, we further optimize the tokenizer decoder regarding a well-trained generative model to mitigate the distribution difference between generated and reconstructed tokens. With a \sim 400M generator, a discrete tokenizer trained with our proposed main training achieves a notable 1.60 gFID and further obtains 1.36 gFID with the additional post-training. Further experiments are conducted to broadly validate the effectiveness of our post-training strategy on off-the-shelf discrete and continuous tokenizers, coupled with autoregressive and diffusion-based generators.
zh

[CV-99] Neural 3D Object Reconstruction with Small-Scale Unmanned Aerial Vehicles

【速读】:该论文旨在解决小型无人机(Small Unmanned Aerial Vehicles, UAVs)因载荷和自主性受限而难以执行高精度三维(3D)重建任务的问题。其核心解决方案在于提出一种双重建流水线架构,关键创新在于构建了一个实时反馈回路:近实时(near-real-time, near-RT)的Structure from Motion (SfM) 流程生成即时点云,并基于模型质量动态调整无人机轨迹以智能补充覆盖不足区域;最终通过非实时(non-real-time, non-RT)神经辐射场(Neural Radiance Fields, NeRF)方法融合SfM相机位姿与超宽带(Ultra Wide-Band, UWB)定位数据,实现高保真度的3D重建。此方案显著提升了微型无人机在受限环境中的自主扫描能力与重建精度。

链接: https://arxiv.org/abs/2509.12458
作者: Àlmos Veres-Vitàlyos,Genis Castillo Gomez-Raya,Filip Lemic,Daniel Johannes Bugelnig,Bernhard Rinner,Sergi Abadal,Xavier Costa-Pérez
机构: i2CAT Foundation (i2CAT基金会); University of Klagenfurt (克劳恩弗特大学); Universitat Politècnica de Catalunya (加泰罗尼亚理工大学); Faculty of Electrical Engineering and Computing, University of Zagreb (萨格勒布大学电气工程与计算学院); NEC Labs Europe GmbH (NEC欧洲实验室有限公司); ICREA (加泰罗尼亚研究与创新委员会)
类目: Robotics (cs.RO); Hardware Architecture (cs.AR); Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET); Systems and Control (eess.SY)
备注: 13 pages, 16 figures, 3 tables, 45 references

点击查看摘要

Abstract:Small Unmanned Aerial Vehicles (UAVs) exhibit immense potential for navigating indoor and hard-to-reach areas, yet their significant constraints in payload and autonomy have largely prevented their use for complex tasks like high-quality 3-Dimensional (3D) reconstruction. To overcome this challenge, we introduce a novel system architecture that enables fully autonomous, high-fidelity 3D scanning of static objects using UAVs weighing under 100 grams. Our core innovation lies in a dual-reconstruction pipeline that creates a real-time feedback loop between data capture and flight control. A near-real-time (near-RT) process uses Structure from Motion (SfM) to generate an instantaneous pointcloud of the object. The system analyzes the model quality on the fly and dynamically adapts the UAV’s trajectory to intelligently capture new images of poorly covered areas. This ensures comprehensive data acquisition. For the final, detailed output, a non-real-time (non-RT) pipeline employs a Neural Radiance Fields (NeRF)-based Neural 3D Reconstruction (N3DR) approach, fusing SfM-derived camera poses with precise Ultra Wide-Band (UWB) location data to achieve superior accuracy. We implemented and validated this architecture using Crazyflie 2.1 UAVs. Our experiments, conducted in both single- and multi-UAV configurations, conclusively show that dynamic trajectory adaptation consistently improves reconstruction quality over static flight paths. This work demonstrates a scalable and autonomous solution that unlocks the potential of miniaturized UAVs for fine-grained 3D reconstruction in constrained environments, a capability previously limited to much larger platforms.
zh

[CV-100] wo-Stage Decoupling Framework for Variable-Length Glaucoma Prognosis

【速读】:该论文旨在解决青光眼(glaucoma)预后预测中面临的两大挑战:一是现有方法依赖固定长度的时序数据输入,难以灵活处理不同长度的临床随访数据;二是传统端到端模型在样本量有限的青光眼数据集上表现不佳。解决方案的关键在于提出一种两阶段解耦框架(Two-Stage Decoupling Framework, TSDF):第一阶段通过自监督学习(self-supervised learning)模块聚合多个青光眼数据集进行特征表示学习,忽略标签差异,从而提升小样本数据的特征表达能力;第二阶段引入基于注意力机制(attention-based mechanism)的时间聚合模块,有效处理变长时序输入,实现对所有可用数据的高效利用,同时保持模型参数规模紧凑,显著提升预后预测性能与鲁棒性。

链接: https://arxiv.org/abs/2509.12453
作者: Yiran Song,Yikai Zhang,Silvia Orengo-Nania,Nian Wang,Fenglong Ma,Rui Zhang,Yifan Peng,Mingquan Lin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages.2 figures, 4 tables

点击查看摘要

Abstract:Glaucoma is one of the leading causes of irreversible blindness worldwide. Glaucoma prognosis is essential for identifying at-risk patients and enabling timely intervention to prevent blindness. Many existing approaches rely on historical sequential data but are constrained by fixed-length inputs, limiting their flexibility. Additionally, traditional glaucoma prognosis methods often employ end-to-end models, which struggle with the limited size of glaucoma datasets. To address these challenges, we propose a Two-Stage Decoupling Framework (TSDF) for variable-length glaucoma prognosis. In the first stage, we employ a feature representation module that leverages self-supervised learning to aggregate multiple glaucoma datasets for training, disregarding differences in their supervisory information. This approach enables datasets of varying sizes to learn better feature representations. In the second stage, we introduce a temporal aggregation module that incorporates an attention-based mechanism to process sequential inputs of varying lengths, ensuring flexible and efficient utilization of all available data. This design significantly enhances model performance while maintaining a compact parameter size. Extensive experiments on two benchmark glaucoma datasets:the Ocular Hypertension Treatment Study (OHTS) and the Glaucoma Real-world Appraisal Progression Ensemble (GRAPE),which differ significantly in scale and clinical settings,demonstrate the effectiveness and robustness of our approach.
zh

[CV-101] Deep learning for 3D point cloud processing - from approaches tasks to its implications on urban and environmental applications

【速读】:该论文旨在解决当前基于深度学习的点云处理方法在实际应用中转化不足的问题,尤其针对现有研究过度关注网络架构设计而忽视了真实场景下数据规模大、场景内容多样、点密度变化及多模态数据等关键挑战。其解决方案的关键在于通过元综述(meta review)系统梳理当前主流深度学习方法及其对应的数据集,聚焦于场景补全、配准、语义分割和建模等核心任务,并结合城市与环境领域的典型应用场景,识别算法层面与实践层面存在的差距,从而为后续研究提供理论指导与落地路径。

链接: https://arxiv.org/abs/2509.12452
作者: Zhenxin Zhang,Zhihua Xu,Yuwei Cao,Ningli Xu,Shuye Wang,Shen’ao Cui,Zhen Li,Rongjun Qin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 57 Pages, 4 Figures

点击查看摘要

Abstract:Point cloud processing as a fundamental task in the field of geomatics and computer vision, has been supporting tasks and applications at different scales from air to ground, including mapping, environmental monitoring, urban/tree structure modeling, automated driving, robotics, disaster responses etc. Due to the rapid development of deep learning, point cloud processing algorithms have nowadays been almost explicitly dominated by learning-based approaches, most of which are yet transitioned into real-world practices. Existing surveys primarily focus on the ever-updating network architecture to accommodate unordered point clouds, largely ignoring their practical values in typical point cloud processing applications, in which extra-large volume of data, diverse scene contents, varying point density, data modality need to be considered. In this paper, we provide a meta review on deep learning approaches and datasets that cover a selection of critical tasks of point cloud processing in use such as scene completion, registration, semantic segmentation, and modeling. By reviewing a broad range of urban and environmental applications these tasks can support, we identify gaps to be closed as these methods transformed into applications and draw concluding remarks in both the algorithmic and practical aspects of the surveyed methods.
zh

[CV-102] Cott-ADNet: Lightweight Real-Time Cotton Boll and Flower Detection Under Field Conditions

【速读】:该论文旨在解决棉花采摘过程中因人工依赖导致的效率低下、产量损失以及难以精准识别棉铃(cotton boll)成熟度的问题,从而推动棉花收获自动化、产量估算及育种研究。解决方案的关键在于提出一种轻量级实时检测模型 Cott-ADNet,其基于 YOLOv11n 架构,通过改进卷积设计增强空间表征能力与鲁棒性,并引入两个创新模块:一是 NeLU 增强的全局注意力机制(NeLU-enhanced Global Attention Mechanism),用于更有效地捕捉弱信号和低对比度特征;二是扩张感受野 SPPF(Dilated Receptive Field SPPF),在计算成本可控的前提下扩展感受野,实现多尺度上下文建模。该方法在自建的 4,966 张标注图像数据集及外部验证集上表现出高精度(mAP50=93.3%)与高效性(仅 7.5 GFLOPs),具备良好的抗尺度变化和旋转变化能力,为田间部署提供了可靠支持。

链接: https://arxiv.org/abs/2509.12442
作者: Rui-Feng Wang,Mingrui Xu,Matthew C Bauer,Iago Beffart Schardong,Xiaowen Ma,Kangning Cui
机构: University of Georgia(佐治亚大学); University of Georgia-Tifton Campus(佐治亚大学蒂夫顿校区); Zhejiang University(浙江大学); Wake Forest University(维克森林大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 14 pages, 5 figures, 1 table

点击查看摘要

Abstract:Cotton is one of the most important natural fiber crops worldwide, yet harvesting remains limited by labor-intensive manual picking, low efficiency, and yield losses from missing the optimal harvest window. Accurate recognition of cotton bolls and their maturity is therefore essential for automation, yield estimation, and breeding research. We propose Cott-ADNet, a lightweight real-time detector tailored to cotton boll and flower recognition under complex field conditions. Building on YOLOv11n, Cott-ADNet enhances spatial representation and robustness through improved convolutional designs, while introducing two new modules: a NeLU-enhanced Global Attention Mechanism to better capture weak and low-contrast features, and a Dilated Receptive Field SPPF to expand receptive fields for more effective multi-scale context modeling at low computational cost. We curate a labeled dataset of 4,966 images, and release an external validation set of 1,216 field images to support future research. Experiments show that Cott-ADNet achieves 91.5% Precision, 89.8% Recall, 93.3% mAP50, 71.3% mAP, and 90.6% F1-Score with only 7.5 GFLOPs, maintaining stable performance under multi-scale and rotational variations. These results demonstrate Cott-ADNet as an accurate and efficient solution for in-field deployment, and thus provide a reliable basis for automated cotton harvesting and high-throughput phenotypic analysis. Code and dataset is available at this https URL.
zh

[CV-103] DYNAMO: Dependency-Aware Deep Learning Framework for Articulated Assembly Motion Prediction

【速读】:该论文旨在解决从静态几何结构中理解机械装配体(mechanical assemblies)中部件间耦合运动(coupled motion)的问题,尤其是针对齿轮等通过几何接触(如齿啮合或轴对齐)产生运动的复杂机构,传统方法因依赖简化运动学结构或关节标注而难以有效建模。解决方案的关键在于构建了一个包含693个多样化合成齿轮装配体的基准数据集MechBench,其中每个部件的运动轨迹均基于几何接触与传动关系生成;在此基础上提出DYNAMO模型,一种能够直接从分割后的CAD点云中预测每个部件SE(3)空间位姿轨迹的依赖感知神经网络,其通过显式建模部件间的运动依赖关系实现准确且时序一致的运动预测。

链接: https://arxiv.org/abs/2509.12430
作者: Mayank Patel,Rahul Jain,Asim Unmesh,Karthik Ramani
机构: Purdue University (普渡大学); Purdue University (普渡大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Understanding the motion of articulated mechanical assemblies from static geometry remains a core challenge in 3D perception and design automation. Prior work on everyday articulated objects such as doors and laptops typically assumes simplified kinematic structures or relies on joint annotations. However, in mechanical assemblies like gears, motion arises from geometric coupling, through meshing teeth or aligned axes, making it difficult for existing methods to reason about relational motion from geometry alone. To address this gap, we introduce MechBench, a benchmark dataset of 693 diverse synthetic gear assemblies with part-wise ground-truth motion trajectories. MechBench provides a structured setting to study coupled motion, where part dynamics are induced by contact and transmission rather than predefined joints. Building on this, we propose DYNAMO, a dependency-aware neural model that predicts per-part SE(3) motion trajectories directly from segmented CAD point clouds. Experiments show that DYNAMO outperforms strong baselines, achieving accurate and temporally consistent predictions across varied gear configurations. Together, MechBench and DYNAMO establish a novel systematic framework for data-driven learning of coupled mechanical motion in CAD assemblies.
zh

[CV-104] From Orthomosaics to Raw UAV Imagery: Enhancing Palm Detection and Crown-Center Localization

【速读】:该论文旨在解决无人机(UAV)影像在热带森林中个体树木检测与树冠中心定位中的精度与实用性问题,特别是在野外部署场景下如何提升检测性能并减少对复杂预处理的依赖。其关键解决方案是对比分析原始无人机影像与正射镶嵌影像(orthomosaic imagery)在检测性能上的差异,并引入树冠中心标注(crown-center annotations)以提升定位精度。研究发现,原始影像在实际部署场景中表现更优,而正射影像更适合跨域泛化;同时,加入树冠中心标注可显著改善定位准确性,为生态监测提供更精确的树位信息。

链接: https://arxiv.org/abs/2509.12400
作者: Rongkun Zhu,Kangning Cui,Wei Tang,Rui-Feng Wang,Sarra Alqahtani,David Lutz,Fan Yang,Paul Fine,Jordan Karubian,Robert Plemmons,Jean-Michel Morel,Victor Pauca,Miles Silman
机构: Hong Kong Baptist University (香港浸会大学); Wake Forest University (维克森林大学); City University of Hong Kong (香港城市大学); University of Georgia (佐治亚大学); Colby-Sawyer College (科尔比-索耶学院); University of California, Berkeley (加州大学伯克利分校); Tulane University (杜兰大学); Lingnan University (岭南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 2 figures, 2 tables

点击查看摘要

Abstract:Accurate mapping of individual trees is essential for ecological monitoring and forest management. Orthomosaic imagery from unmanned aerial vehicles (UAVs) is widely used, but stitching artifacts and heavy preprocessing limit its suitability for field deployment. This study explores the use of raw UAV imagery for palm detection and crown-center localization in tropical forests. Two research questions are addressed: (1) how detection performance varies across orthomosaic and raw imagery, including within-domain and cross-domain transfer, and (2) to what extent crown-center annotations improve localization accuracy beyond bounding-box centroids. Using state-of-the-art detectors and keypoint models, we show that raw imagery yields superior performance in deployment-relevant scenarios, while orthomosaics retain value for robust cross-domain generalization. Incorporating crown-center annotations in training further improves localization and provides precise tree positions for downstream ecological analyses. These findings offer practical guidance for UAV-based biodiversity and conservation monitoring.
zh

[CV-105] GhostNetV3-Small: A Tailored Architecture and Comparative Study of Distillation Strategies for Tiny Images

【速读】:该论文旨在解决深度神经网络在资源受限的边缘设备上部署时面临的计算资源消耗过高问题,特别是针对低分辨率图像分类任务中模型性能下降的挑战。其解决方案的关键在于提出一种面向移动应用的轻量化架构改进——GhostNetV3-Small,该模型通过结构优化以更好地适应低分辨率输入(如CIFAR-10数据集),并在实验中实现了93.94%的准确率,显著优于原始GhostNetV3;同时研究发现,在小规模图像分类任务中,架构层面的适配比知识蒸馏(knowledge distillation)策略更具有效性,提示未来应更关注模型设计与适用于低分辨率域的先进蒸馏技术。

链接: https://arxiv.org/abs/2509.12380
作者: Florian Zager,Hamza A. A. Gardi
机构: ETIT-KIT (ETIT-KIT); IIIT at ETIT- KIT (IIIT at ETIT- KIT)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Deep neural networks have achieved remarkable success across a range of tasks, however their computational demands often make them unsuitable for deployment on resource-constrained edge devices. This paper explores strategies for compressing and adapting models to enable efficient inference in such environments. We focus on GhostNetV3, a state-of-the-art architecture for mobile applications, and propose GhostNetV3-Small, a modified variant designed to perform better on low-resolution inputs such as those in the CIFAR-10 dataset. In addition to architectural adaptation, we provide a comparative evaluation of knowledge distillation techniques, including traditional knowledge distillation, teacher assistants, and teacher ensembles. Experimental results show that GhostNetV3-Small significantly outperforms the original GhostNetV3 on CIFAR-10, achieving an accuracy of 93.94%. Contrary to expectations, all examined distillation strategies led to reduced accuracy compared to baseline training. These findings indicate that architectural adaptation can be more impactful than distillation in small-scale image classification tasks, highlighting the need for further research on effective model design and advanced distillation techniques for low-resolution domains.
zh

[CV-106] DS@GT AnimalCLEF: Triplet Learning over ViT Manifolds with Nearest Neighbor Classification for Animal Re-identification

【速读】:该论文旨在解决动物再识别(re-identification, re-ID)任务中,如何有效利用预训练模型提取的特征进行个体识别与新个体检测的问题。其核心挑战在于:在数据有限且场景特定的情况下,通用模型(如DINOv2)的特征表示是否足以支持高精度的细粒度识别,以及如何通过后处理策略提升性能。解决方案的关键在于对比两种不同性质的骨干网络——通用模型(DINOv2)与领域专用模型(MegaDescriptor)作为嵌入提取器的效果差异,并引入基于K近邻(K-Nearest Neighbor, KNN)的分类器结合鲁棒阈值判定机制来识别已知个体或标记新个体。研究发现,尽管使用三元组学习(triplet learning)的投影头能小幅提升专用模型性能(+0.13点BAKS/BAUS),但对通用模型几乎无改善(仅+0.03),表明通用特征空间难以适应细粒度任务,凸显了领域特定预训练的重要性。

链接: https://arxiv.org/abs/2509.12353
作者: Anthony Miyaguchi,Chandrasekaran Maruthaiyannan,Charles R. Clark
机构: Georgia Institute of Technology (佐治亚理工学院); University of Florida (佛罗里达大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CLEF 2025 working notes

点击查看摘要

Abstract:This paper details the DS@GT team’s entry for the AnimalCLEF 2025 re-identification challenge. Our key finding is that the effectiveness of post-hoc metric learning is highly contingent on the initial quality and domain-specificity of the backbone embeddings. We compare a general-purpose model (DINOv2) with a domain-specific model (MegaDescriptor) as a backbone. A K-Nearest Neighbor classifier with robust thresholding then identifies known individuals or flags new ones. While a triplet-learning projection head improved the performance of the specialized MegaDescriptor model by 0.13 points, it yielded minimal gains (0.03) for the general-purpose DINOv2 on averaged BAKS and BAUS. We demonstrate that the general-purpose manifold is more difficult to reshape for fine-grained tasks, as evidenced by stagnant validation loss and qualitative visualizations. This work highlights the critical limitations of refining general-purpose features for specialized, limited-data re-ID tasks and underscores the importance of domain-specific pre-training. The implementation for this work is publicly available at this http URL.
zh

[CV-107] Uncertainty-Aware Hourly Air Temperature Mapping at 2 km Resolution via Physics-Guided Deep Learning

【速读】:该论文旨在解决近地表空气温度(near-surface air temperature)在时空连续性上的数据缺失问题,即现有观测手段中,气象站虽能提供时间连续的数据但空间覆盖有限,而卫星遥感则具备广域覆盖能力但受云层遮挡影响且难以直接反演空气温度。解决方案的关键在于提出一种数据驱动且物理引导的深度学习方法——Amplifier Air-Transformer,其核心包括:首先利用编码年周期温度模式的神经网络重建被云层遮蔽的GOES-16地表温度;其次引入线性放大项融合ERA5再分析数据以实现细尺度增强,并通过卷积层捕捉时空变化特征;最后借助另一神经网络将重构的地表温度转化为空气温度,同时采用深度集成学习估计预测不确定性,从而实现高时空分辨率(2 km,小时级)空气温度制图,在美国本土范围内验证误差为1.93°C。

链接: https://arxiv.org/abs/2509.12329
作者: Shengjie Kris Liu,Siqin Wang,Lu Zhang
机构: University of Southern California(南加州大学); Keck School of Medicine(凯克医学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Near-surface air temperature is a key physical property of the Earth’s surface. Although weather stations offer continuous monitoring and satellites provide broad spatial coverage, no single data source offers seamless data in a spatiotemporal fashion. Here, we propose a data-driven, physics-guided deep learning approach to generate hourly air temperature data at 2 km resolution over the contiguous United States. The approach, called Amplifier Air-Transformer, first reconstructs GOES-16 surface temperature data obscured by clouds. It does so through a neural network encoded with the annual temperature cycle, incorporating a linear term to amplify ERA5 temperature values at finer scales and convolutional layers to capture spatiotemporal variations. Then, another neural network transforms the reconstructed surface temperature into air temperature by leveraging its latent relationship with key Earth surface properties. The approach is further enhanced with predictive uncertainty estimation through deep ensemble learning to improve reliability. The proposed approach is built and tested on 77.7 billion surface temperature pixels and 155 million air temperature records from weather stations across the contiguous United States (2018-2024), achieving hourly air temperature mapping accuracy of 1.93 C in station-based validation. The proposed approach streamlines surface temperature reconstruction and air temperature prediction, and it can be extended to other satellite sources for seamless air temperature monitoring at high spatiotemporal resolution. The generated data of this study can be downloaded at this https URL, and the project webpage can be found at this https URL.
zh

[CV-108] Domain Adaptive SAR Wake Detection: Leverag ing Similarity Filtering and Memory Guidance

【速读】:该论文旨在解决跨模态域适应(cross-modal domain adaptation)问题,即如何在光学图像(optical images)上训练的船体尾迹检测模型迁移到合成孔径雷达(SAR)图像中仍保持高精度与鲁棒性的问题。其核心挑战在于光学与SAR图像之间存在显著的视觉差异和域偏移(domain shift),导致直接迁移性能下降。解决方案的关键在于提出一种相似性引导与记忆引导的域自适应框架(SimMemDA),通过两个创新机制实现:一是利用WakeGAN进行风格迁移生成近似SAR样式的伪图像,并结合实例级特征相似性过滤机制筛选源域中具有目标域分布特性的样本,从而减少负向迁移;二是引入特征置信度记忆库(Feature-Confidence Memory Bank)与K近邻置信加权融合策略,动态校准目标域中的伪标签,提升其可靠性与稳定性;最终通过区域混合训练策略融合源域标注与校准后的目标域伪标签,增强模型泛化能力。

链接: https://arxiv.org/abs/2509.12279
作者: He Gao,Baoxiang Huang,Milena Radenkovic,Borui Li,Ge Chen
机构: Qingdao University (青岛大学); Laboratory for Regional Oceanography and Numerical Modeling (区域海洋学与数值模拟实验室); The University of Nottingham (诺丁汉大学); Ocean University of China (中国海洋大学); Laoshan Laboratory (崂山实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Synthetic Aperture Radar (SAR), with its all- weather and wide-area observation capabilities, serves as a crucial tool for wake detection. However, due to its complex imaging mechanism, wake features in SAR images often appear abstract and noisy, posing challenges for accurate annotation. In contrast, optical images provide more distinct visual cues, but models trained on optical data suffer from performance degradation when applied to SAR images due to domain shift. To address this cross-modal domain adaptation challenge, we propose a Similarity-Guided and Memory-Guided Domain Adap- tation (termed SimMemDA) framework for unsupervised domain adaptive ship wake detection via instance-level feature similarity filtering and feature memory guidance. Specifically, to alleviate the visual discrepancy between optical and SAR images, we first utilize WakeGAN to perform style transfer on optical images, generating pseudo-images close to the SAR style. Then, instance-level feature similarity filtering mechanism is designed to identify and prioritize source samples with target-like dis- tributions, minimizing negative transfer. Meanwhile, a Feature- Confidence Memory Bank combined with a K-nearest neighbor confidence-weighted fusion strategy is introduced to dynamically calibrate pseudo-labels in the target domain, improving the reliability and stability of pseudo-labels. Finally, the framework further enhances generalization through region-mixed training, strategically combining source annotations with calibrated tar- get pseudo-labels. Experimental results demonstrate that the proposed SimMemDA method can improve the accuracy and robustness of cross-modal ship wake detection tasks, validating the effectiveness and feasibility of the proposed method.
zh

[CV-109] PATIMT-Bench: A Multi-Scenario Benchmark for Position-Aware Text Image Machine Translation in Large Vision-Language Models

【速读】:该论文旨在解决传统文本图像机器翻译(Text Image Machine Translation, TIMT)任务中缺乏位置感知能力的问题,即现有方法通常仅提供图像中所有文本的翻译结果,而未保留文本的空间布局信息,且适用场景有限。为此,作者提出位置感知的文本图像机器翻译(Position-aware TIMT, PATIMT),其核心在于支持细粒度的区域特定翻译和带定位信息的全图翻译,从而实现布局保持的精准翻译。解决方案的关键包括:构建涵盖10种真实场景的PATIMT基准数据集(PATIMTBench),引入自适应OCR优化流水线以提升复杂图像中的文本识别质量,并通过人工标注的高质量测试集确保评估可靠性;同时,基于该数据集微调紧凑型大视觉语言模型(Large Vision-Language Models, LVLMs)后,在两个子任务上均达到当前最优性能,验证了训练数据的可扩展性和模型的泛化能力。

链接: https://arxiv.org/abs/2509.12278
作者: Wanru Zhuang,Wenbo Li,Zhibin Lan,Xu Han,Peng Li,Jinsong Su
机构: Xiamen University (厦门大学); Tsinghua (清华大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Text Image Machine Translation (TIMT) aims to translate texts embedded within an image into another language. Current TIMT studies primarily focus on providing translations for all the text within an image, while neglecting to provide bounding boxes and covering limited scenarios. In this work, we extend traditional TIMT into position-aware TIMT (PATIMT), aiming to support fine-grained and layoutpreserving translation, which holds great practical value but remains largely unexplored. This task comprises two key sub-tasks: regionspecific translation and full-image translation with grounding. To support existing models on PATIMT and conduct fair evaluation, we construct the PATIMT benchmark (PATIMTBench), which consists of 10 diverse real-world scenarios. Specifically, we introduce an Adaptive Image OCR Refinement Pipeline, which adaptively selects appropriate OCR tools based on scenario and refines the results of text-rich images. To ensure evaluation reliability, we further construct a test set, which contains 1,200 high-quality instances manually annotated and reviewed by human experts. After fine-tuning on our data, compact Large Vision-Language Models (LVLMs) achieve state-of-the-art performance on both sub-tasks. Experimental results also highlight the scalability and generalizability of our training data
zh

[CV-110] GraphDerm: Fusing Imaging Physical Scale and Metadata in a Population-Graph Classifier for Dermoscopic Lesions

【速读】:该论文旨在解决当前生成式 AI (Generative AI) 在皮肤镜图像分类中忽视患者元数据(如年龄、性别、病变部位)及毫米级物理尺度信息的问题,这些问题限制了模型对病变几何特征的准确分析能力。解决方案的关键在于提出 GraphDerm 框架——一个融合图像、毫米级校准和多模态元数据的图神经网络(Graph Neural Network, GNN)架构,首次在 ISIC 规模上实现 GNN 用于皮肤镜多类分类任务。其核心创新包括:通过合成嵌入标尺的图像并训练 U-Net 进行病变与标尺分割以获取像素/毫米比例;基于标尺掩膜两点相关性回归出精确尺度参数;从病变掩膜计算真实尺度描述符(面积、周长、回转半径);构建包含节点特征(EfficientNet-B3 提取)和边权重(编码元数据与几何相似性)的群体图结构,并使用谱图卷积进行半监督节点分类。实验表明,该方法在 ISIC-2019 数据集上达到 AUC 0.9812,显著优于仅依赖图像的前馈神经网络基线(AUC 0.9440),且稀疏化边结构仍保持近似性能,验证了其高效部署潜力。

链接: https://arxiv.org/abs/2509.12277
作者: Mehdi Yousefzadeh,Parsa Esfahanian,Sara Rashidifar,Hossein Salahshoor Gavalan,Negar Sadat Rafiee Tabatabaee,Saeid Gorgin,Dara Rahmati,Maryam Daneshpazhooh
机构: Institute for Research in Fundamental Sciences (IPM); Shahid Beheshti University; University of Tehran; Alborz University of Medical Sciences; Sungkyunkwan University; Shahid Beheshti University; Tehran University of Medical Sciences
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Introduction. Dermoscopy aids melanoma triage, yet image-only AI often ignores patient metadata (age, sex, site) and the physical scale needed for geometric analysis. We present GraphDerm, a population-graph framework that fuses imaging, millimeter-scale calibration, and metadata for multiclass dermoscopic classification, to the best of our knowledge the first ISIC-scale application of GNNs to dermoscopy. Methods. We curate ISIC 2018/2019, synthesize ruler-embedded images with exact masks, and train U-Nets (SE-ResNet-18) for lesion and ruler segmentation. Pixels-per-millimeter are regressed from the ruler-mask two-point correlation via a lightweight 1D-CNN. From lesion masks we compute real-scale descriptors (area, perimeter, radius of gyration). Node features use EfficientNet-B3; edges encode metadata/geometry similarity (fully weighted or thresholded). A spectral GNN performs semi-supervised node classification; an image-only ANN is the baseline. Results. Ruler and lesion segmentation reach Dice 0.904 and 0.908; scale regression attains MAE 1.5 px (RMSE 6.6). The graph attains AUC 0.9812, with a thresholded variant using about 25% of edges preserving AUC 0.9788 (vs. 0.9440 for the image-only baseline); per-class AUCs typically fall in the 0.97-0.99 range. Conclusion. Unifying calibrated scale, lesion geometry, and metadata in a population graph yields substantial gains over image-only pipelines on ISIC-2019. Sparser graphs retain near-optimal accuracy, suggesting efficient deployment. Scale-aware, graph-based AI is a promising direction for dermoscopic decision support; future work will refine learned edge semantics and evaluate on broader curated benchmarks.
zh

[CV-111] Developing an aeroponic smart experimental greenhouse for controlling irrigation and plant disease detection using deep learning and IoT DATE

【速读】:该论文旨在解决温室环境中植物生长状态与环境参数难以实时监控及病害早期识别效率低的问题,从而影响作物生产管理的及时性和精准性。解决方案的关键在于构建一个集成物联网(IoT)与人工智能(AI)的智能气雾栽培温室系统:一方面,通过IoT平台实现对温湿度、水流和储液罐体积等环境参数的持续监测与自动调控,确保植物处于最优生长条件;另一方面,利用VGG-19、InceptionResNetV2和InceptionV3等深度学习算法构建疾病检测框架,其中VGG-19在识别干旱胁迫和锈病叶片方面表现最优,准确率达92%,显著优于其他模型,并可辅助用户做出科学决策。

链接: https://arxiv.org/abs/2509.12274
作者: Mohammadreza Narimani,Ali Hajiahmad,Ali Moghimi,Reza Alimardani,Shahin Rafiee,Amir Hossein Mirzabe
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Author-accepted version. Presented at ASABE Annual International Meeting (AIM) 2021 (virtual), Paper 2101252. Please cite the published meeting paper: doi: https://doi.org/10.13031/aim.202101252 . Minor wording and formatting updates in this preprint

点击查看摘要

Abstract:Controlling environmental conditions and monitoring plant status in greenhouses is critical to promptly making appropriate management decisions aimed at promoting crop production. The primary objective of this research study was to develop and test a smart aeroponic greenhouse on an experimental scale where the status of Geranium plant and environmental conditions are continuously monitored through the integration of the internet of things (IoT) and artificial intelligence (AI). An IoT-based platform was developed to control the environmental conditions of plants more efficiently and provide insights to users to make informed management decisions. In addition, we developed an AI-based disease detection framework using VGG-19, InceptionResNetV2, and InceptionV3 algorithms to analyze the images captured periodically after an intentional inoculation. The performance of the AI framework was compared with an expert’s evaluation of disease status. Preliminary results showed that the IoT system implemented in the greenhouse environment is able to publish data such as temperature, humidity, water flow, and volume of charge tanks online continuously to users and adjust the controlled parameters to provide an optimal growth environment for the plants. Furthermore, the results of the AI framework demonstrate that the VGG-19 algorithm was able to identify drought stress and rust leaves from healthy leaves with the highest accuracy, 92% among the other algorithms.
zh

[CV-112] A Modern Look at Simplicity Bias in Image Classification Tasks

【速读】:该论文旨在解决大模型中简单性偏差(Simplicity Bias, SB)的量化难题及其与图像分类任务性能之间关系不明确的问题。现有研究多集中于小模型或合成任务,缺乏对大型预训练模型(如CLIP)中SB的精细刻画及其实证分析。其解决方案的关键在于提出一种频率感知的SB度量方法,能够捕捉更细粒度的偏差差异,并通过两个近期的SB调节方法验证了该度量的有效性和一致性;同时,结合零样本和微调等多种图像分类任务设置,揭示了SB强度与不同任务特性(如分布外泛化能力 vs. 对抗鲁棒性)之间的非线性关联,从而强调了将模型归纳偏置与目标任务特征对齐的重要性。

链接: https://arxiv.org/abs/2509.12265
作者: Xiaoguang Chang,Teng Wang,Changyin Sun
机构: School of Cyber Science and Engineering, Southeast University (东南大学网络科学与工程学院); School of Automation, Southeast University (东南大学自动化学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The simplicity Bias (SB) of neural networks, i.e.\ their tendency to represent simple functions, is a key factor in their generalization capabilities. Recent studies show that an excessive SB may harm performance on complex tasks, and the need for this bias varies across tasks. Many of these studies focus on simple models or synthetic tasks. It remains challenging to measure the SB in large models and little is known about the relevance of the SB to various image classification tasks. In this paper, we investigate the relationship between the SB in CLIP models and their performance across image classification tasks. First, we theoretically analyze the potential limitation of existing measures of complexity that have been used to characterize small models. To address this, we propose a frequency-aware measure capturing finer-grained SB differences. We validate this measure on CLIP models subjected to two recent SB-modulation methods, demonstrating that it is more informative and consistent than previous measures. Second, we examine the relation between the SB of those models and their performance across a range of image classification tasks, including zero-shot and fine-tuning settings. These experiments reveal a range of behaviors. For example, a stronger SB correlates with a better performance on OOD generalization than on adversarial robustness. These results highlight the benefits of aligning a model’s inductive biases with the characteristics of the target task. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2509.12265 [cs.CV] (or arXiv:2509.12265v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2509.12265 Focus to learn more arXiv-issued DOI via DataCite
zh

[CV-113] EfficientNet-Based Multi-Class Detection of Real Deepfake and Plastic Surgery Faces

【速读】:该论文聚焦于深度学习技术,特别是生成式 AI(Generative AI)在计算机视觉领域广泛应用所带来的社会风险问题,核心在于分析 Deepfake 技术对隐私保护、公众人物声誉及国家安全的潜在威胁。其解决方案的关键在于识别并揭示 Deepfake 技术通过伪造图像与视频所引发的多维度负面影响,包括误导性内容传播、人脸识别系统功能失效以及政治操纵等,从而为后续开发检测机制和制定治理策略提供理论依据和技术支撑。

链接: https://arxiv.org/abs/2509.12258
作者: Li Kun,Milena Radenkovic
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Currently, deep learning has been utilised to tackle several difficulties in our everyday lives. It not only exhibits progress in computer vision but also constitutes the foundation for several revolutionary technologies. Nonetheless, similar to all phenomena, the use of deep learning in diverse domains has produced a multifaceted interaction of advantages and disadvantages for human society. Deepfake technology has advanced, significantly impacting social life. However, developments in this technology can affect privacy, the reputations of prominent personalities, and national security via software development. It can produce indistinguishable counterfeit photographs and films, potentially impairing the functionality of facial recognition systems, so presenting a significant risk. The improper application of deepfake technology produces several detrimental effects on society. Face-swapping programs mislead users by altering persons’ appearances or expressions to fulfil particular aims or to appropriate personal information. Deepfake technology permeates daily life through such techniques. Certain individuals endeavour to sabotage election campaigns or subvert prominent political figures by creating deceptive pictures to influence public perception, causing significant harm to a nation’s political and economic structure. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2509.12258 [cs.CV] (or arXiv:2509.12258v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2509.12258 Focus to learn more arXiv-issued DOI via DataCite
zh

[CV-114] V-Math: An Agent ic Approach to the Vietnamese National High School Graduation Mathematics Exams

【速读】:该论文旨在解决越南高中生在准备国家高中毕业数学考试(NHSGMEs)过程中面临的个性化学习资源匮乏与教师命题负担过重的问题。解决方案的关键在于提出一个名为V-Math的自主代理框架,其核心由三个专业化AI代理组成:基于规范矩阵条件的题目生成器、用于详细分步推理的求解器/解释器,以及根据学生表现自适应调整的个性化辅导代理。该框架不仅支持学生进行自主练习,还能为教师生成符合考试标准且多样化的高质量试题,从而减轻人工命题压力并提升教学资源质量,实现规模化、公平化的数学备考支持。

链接: https://arxiv.org/abs/2509.12251
作者: Duong Q. Nguyen,Quy P. Nguyen,Nguyen Van Nhon,Quang-Thinh Bui,H. Nguyen-Xuan
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:This paper develops an autonomous agentic framework called V-Math that aims to assist Vietnamese high school students in preparing for the National High School Graduation Mathematics Exams (NHSGMEs). The salient framework integrates three specialized AI agents: a specification-matrix-conditioned question generator, a solver/explainer for detailed step-by-step reasoning, and a personalized tutor that adapts to student performance. Beyond enabling self-paced student practice, V-Math supports teachers by generating innovative, compliant exam questions and building diverse, high-quality question banks. This reduces manual workload and enriches instructional resources. We describe the system architecture, focusing on practice modes for learners and teacher-oriented features for question generation. Preliminary evaluations demonstrate that V-Math produces matrix-aligned exams with high solution accuracy, delivers coherent explanations, and enhances the variety of practice materials. These results highlight its potential to support scalable, equitable mathematics preparation aligned with national standards while also empowering teachers through AI-assisted exam creation.
zh

[CV-115] OnlineHOI: Towards Online Human-Object Interaction Generation and Perception ACM-MM2025

【速读】:该论文旨在解决人类-物体交互(Human-Object Interaction, HOI)在在线场景下的感知与生成问题,即如何在实时环境中仅基于当前时刻和历史信息进行高效建模,而非传统离线方法中可访问完整序列数据的假设。现有方法在在线设置下表现不佳,因其无法适应流式数据的时序依赖性和有限的历史信息利用能力。解决方案的关键在于提出了一种名为OnlineHOI的框架,其核心创新是基于Mamba架构并引入记忆机制(Memory mechanism),充分利用Mamba对序列数据的强大建模能力以及记忆模块对历史信息的高效整合,从而在Core4D、OAKINK2的在线生成任务及HOI4D的在线感知任务上实现了最优性能。

链接: https://arxiv.org/abs/2509.12250
作者: Yihong Ji,Yunze Liu,Yiyao Zhuo,Weijiang Yu,Fei Ma,Joshua Huang,Fei Yu
机构: Shenzhen University (深圳大学); Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ) (广东省人工智能与数字经济发展实验室(深圳)); Tsinghua University (清华大学); Shanghai Qi Zhi Institute (上海期智研究院); Sun Yat-Sen University (中山大学); Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ) (广东省人工智能与数字经济发展实验室(深圳)); Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ) (广东省人工智能与数字经济发展实验室(深圳)); Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ) (广东省人工智能与数字经济发展实验室(深圳))
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: Accepted at ACM MM 2025

点击查看摘要

Abstract:The perception and generation of Human-Object Interaction (HOI) are crucial for fields such as robotics, AR/VR, and human behavior understanding. However, current approaches model this task in an offline setting, where information at each time step can be drawn from the entire interaction sequence. In contrast, in real-world scenarios, the information available at each time step comes only from the current moment and historical data, i.e., an online setting. We find that offline methods perform poorly in an online context. Based on this observation, we propose two new tasks: Online HOI Generation and Perception. To address this task, we introduce the OnlineHOI framework, a network architecture based on the Mamba framework that employs a memory mechanism. By leveraging Mamba’s powerful modeling capabilities for streaming data and the Memory mechanism’s efficient integration of historical information, we achieve state-of-the-art results on the Core4D and OAKINK2 online generation tasks, as well as the online HOI4D perception task.
zh

[CV-116] Modular On-Site Solutions with Lightweight Anomaly Detection for Sustainable Nutrient Management in Agriculture

【速读】:该论文旨在解决农业生产中营养管理效率低下的问题,即传统营养监测方法耗时长、难以实现实时优化,而图像分析虽能快速进行表型鉴定却因计算复杂度高,在资源受限环境下难以部署。其关键解决方案是提出了一种灵活的分层检测与状态估计管道:首先利用自编码器(Autoencoder, AE)实现早期异常预警(如营养缺乏),随后通过两种不同复杂度的状态估计模块进行精细化分析——一种基于植被指数(Vegetation Index, VI)特征结合随机森林(Random Forest, RF)模型,另一种采用视觉Transformer(Vision Transformer, ViT)直接从全图深度学习提取特征。实验表明,该方案在保证较高检测精度的同时显著降低能耗,且ViT在磷和钙含量预测上优于RF,为边缘端诊断和农业可持续发展提供了可行路径。

链接: https://arxiv.org/abs/2509.12247
作者: Abigail R. Cohen,Yuming Sun,Zhihao Qin,Harsh S. Muriki,Zihao Xiao,Yeonju Lee,Matthew Housley,Andrew F. Sharkey,Rhuanito S. Ferrarezi,Jing Li,Lu Gan,Yongsheng Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Efficient nutrient management is critical for crop growth and sustainable resource consumption (e.g., nitrogen, energy). Current approaches require lengthy analyses, preventing real-time optimization; similarly, imaging facilitates rapid phenotyping but can be computationally intensive, preventing deployment under resource constraints. This study proposes a flexible, tiered pipeline for anomaly detection and status estimation (fresh weight, dry mass, and tissue nutrients), including a comprehensive energy analysis of approaches that span the efficiency-accuracy spectrum. Using a nutrient depletion experiment with three treatments (T1-100%, T2-50%, and T3-25% fertilizer strength) and multispectral imaging (MSI), we developed a hierarchical pipeline using an autoencoder (AE) for early warning. Further, we compared two status estimation modules of different complexity for more detailed analysis: vegetation index (VI) features with machine learning (Random Forest, RF) and raw whole-image deep learning (Vision Transformer, ViT). Results demonstrated high-efficiency anomaly detection (73% net detection of T3 samples 9 days after transplanting) at substantially lower energy than embodied energy in wasted nitrogen. The state estimation modules show trade-offs, with ViT outperforming RF on phosphorus and calcium estimation (R2 0.61 vs. 0.58, 0.48 vs. 0.35) at higher energy cost. With our modular pipeline, this work opens opportunities for edge diagnostics and practical opportunities for agricultural sustainability.
zh

[CV-117] RU-Net for Automatic Characterization of TRISO Fuel Cross Sections

【速读】:该论文旨在解决辐照条件下三结构各向同性(TRISO)颗粒燃料中核芯膨胀和缓冲层致密化等形貌变化的自动化分析问题,传统人工显微镜分析方法存在效率低、主观性强等局限。解决方案的关键在于利用卷积神经网络(Convolutional Neural Networks, CNNs)对大量辐照后TRISO颗粒截面图像进行自动分割,构建包含2000余张标注图像的数据集,并对比了RU-Net(本研究提出)、U-Net、ResNet及Attention U-Net四种架构的分割性能,结果显示RU-Net在交并比(Intersection over Union, IoU)指标上表现最优,显著提升了分析效率与客观性。

链接: https://arxiv.org/abs/2509.12244
作者: Lu Cai,Fei Xu,Min Xian,Yalei Tang,Shoukun Sun,John Stempien
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:During irradiation, phenomena such as kernel swelling and buffer densification may impact the performance of tristructural isotropic (TRISO) particle fuel. Post-irradiation microscopy is often used to identify these irradiation-induced morphologic changes. However, each fuel compact generally contains thousands of TRISO particles. Manually performing the work to get statistical information on these phenomena is cumbersome and subjective. To reduce the subjectivity inherent in that process and to accelerate data analysis, we used convolutional neural networks (CNNs) to automatically segment cross-sectional images of microscopic TRISO layers. CNNs are a class of machine-learning algorithms specifically designed for processing structured grid data. They have gained popularity in recent years due to their remarkable performance in various computer vision tasks, including image classification, object detection, and image segmentation. In this research, we generated a large irradiated TRISO layer dataset with more than 2,000 microscopic images of cross-sectional TRISO particles and the corresponding annotated images. Based on these annotated images, we used different CNNs to automatically segment different TRISO layers. These CNNs include RU-Net (developed in this study), as well as three existing architectures: U-Net, Residual Network (ResNet), and Attention U-Net. The preliminary results show that the model based on RU-Net performs best in terms of Intersection over Union (IoU). Using CNN models, we can expedite the analysis of TRISO particle cross sections, significantly reducing the manual labor involved and improving the objectivity of the segmentation results.
zh

[CV-118] Artificial Intelligence in Breast Cancer Care: Transforming Preoperative Planning and Patient Education with 3D Reconstruction

【速读】:该论文旨在解决传统医学图像分割算法在跨数据集和成像场景中泛化能力不足的问题,从而限制了其在术前规划中的应用效果。其解决方案的关键在于提出一种“人机协同”(human-in-the-loop)的机器学习方法,采用基于U-Mamba架构的深度学习模型,通过迭代优化实现对不同模态MRI图像中乳腺、纤维腺体组织及肿瘤的高精度3D分割与重建。该方法在120例回顾性乳腺MRI数据上验证了优异性能(如全器官Dice相似系数达0.97),并显著提升临床医生的术前规划效率、术中导航支持与患者沟通质量,实现了从算法泛化到临床价值落地的闭环改进。

链接: https://arxiv.org/abs/2509.12242
作者: Mustafa Khanbhai,Giulia Di Nardo,Jun Ma,Vivienne Freitas,Caterina Masino,Ali Dolatabadi,Zhaoxun “Lorenz” Liu,Wey Leong,Wagner H. Souza,Amin Madani
机构: Surgical Artificial Intelligence Research Academy, University Health Network, Toronto, ON, Canada; Department of Mechanical & Industrial Engineering, University of Toronto, Toronto, ON, Canada; Vector Institute, Toronto, ON, Canada; Princess Margaret Cancer Centre, University Health Network, Toronto, ON, Canada
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Effective preoperative planning requires accurate algorithms for segmenting anatomical structures across diverse datasets, but traditional models struggle with generalization. This study presents a novel machine learning methodology to improve algorithm generalization for 3D anatomical reconstruction beyond breast cancer applications. We processed 120 retrospective breast MRIs (January 2018-June 2023) through three phases: anonymization and manual segmentation of T1-weighted and dynamic contrast-enhanced sequences; co-registration and segmentation of whole breast, fibroglandular tissue, and tumors; and 3D visualization using ITK-SNAP. A human-in-the-loop approach refined segmentations using U-Mamba, designed to generalize across imaging scenarios. Dice similarity coefficient assessed overlap between automated segmentation and ground truth. Clinical relevance was evaluated through clinician and patient interviews. U-Mamba showed strong performance with DSC values of 0.97 ( \pm 0.013) for whole organs, 0.96 ( \pm 0.024) for fibroglandular tissue, and 0.82 ( \pm 0.12) for tumors on T1-weighted images. The model generated accurate 3D reconstructions enabling visualization of complex anatomical features. Clinician interviews indicated improved planning, intraoperative navigation, and decision support. Integration of 3D visualization enhanced patient education, communication, and understanding. This human-in-the-loop machine learning approach successfully generalizes algorithms for 3D reconstruction and anatomical segmentation across patient datasets, offering enhanced visualization for clinicians, improved preoperative planning, and more effective patient education, facilitating shared decision-making and empowering informed patient choices across medical applications.
zh

[CV-119] InJecteD: Analyzing Trajectories and Drift Dynamics in Denoising Diffusion Probabilistic Models for 2D Point Cloud Generation

【速读】:该论文旨在解决生成式 AI(Generative AI)中扩散模型(Denoising Diffusion Probabilistic Models, DDPMs)的可解释性问题,即如何理解模型在生成过程中样本轨迹的行为特征及其对最终输出质量的影响。解决方案的关键在于提出 InJecteD 框架,通过分析 2D 点云生成过程中样本的轨迹特性(如位移、速度、聚类行为和漂移场动力学),并利用 Wasserstein 距离与余弦相似度等统计指标量化这些属性,从而揭示不同数据集下模型的分阶段 denoising 行为(初始噪声探索、快速形状形成、最终精修),并支持基于轨迹稳定性和重建质量的模型调试与优化,特别是证明了基于傅里叶(Fourier)的嵌入方式能显著提升轨迹稳定性与生成性能。

链接: https://arxiv.org/abs/2509.12239
作者: Sanyam Jain,Khuram Naveed,Illia Oleksiienko,Alexandros Iosifidis,Ruben Pauwels
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This work introduces InJecteD, a framework for interpreting Denoising Diffusion Probabilistic Models (DDPMs) by analyzing sample trajectories during the denoising process of 2D point cloud generation. We apply this framework to three datasets from the Datasaurus Dozen bullseye, dino, and circle using a simplified DDPM architecture with customizable input and time embeddings. Our approach quantifies trajectory properties, including displacement, velocity, clustering, and drift field dynamics, using statistical metrics such as Wasserstein distance and cosine similarity. By enhancing model transparency, InJecteD supports human AI collaboration by enabling practitioners to debug and refine generative models. Experiments reveal distinct denoising phases: initial noise exploration, rapid shape formation, and final refinement, with dataset-specific behaviors example, bullseyes concentric convergence vs. dinos complex contour formation. We evaluate four model configurations, varying embeddings and noise schedules, demonstrating that Fourier based embeddings improve trajectory stability and reconstruction quality
zh

[CV-120] Neural Diffeomorphic-Neural Operator for Residual Stress-Induced Deformation Prediction

【速读】:该论文旨在解决复杂几何结构件在加工过程中因残余应力场分布不均而导致的变形预测难题,传统数值方法在处理多变几何时计算成本高昂。解决方案的关键在于提出一种基于微分同胚嵌入的神经算子框架(Neural Diffeomorphic-Neural Operator, NDNO),通过一个受光滑性和可逆性约束的微分同胚神经网络将复杂三维几何显式映射到统一参考域,并在此参考域上训练神经算子以高效学习残余应力引起的变形场;该方法在训练完成后能快速适应不同几何形状,实现高精度、低耗时的变形预测,适用于多种类型、尺寸和特征的零件。

链接: https://arxiv.org/abs/2509.12237
作者: Changqing Liu,Kaining Dai,Zhiwei Zhao,Tianyi Wu,Yingguang Li
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Accurate prediction of machining deformation in structural components is essential for ensuring dimensional precision and reliability. Such deformation often originates from residual stress fields, whose distribution and influence vary significantly with geometric complexity. Conventional numerical methods for modeling the coupling between residual stresses and deformation are computationally expensive, particularly when diverse geometries are considered. Neural operators have recently emerged as a powerful paradigm for efficiently solving partial differential equations, offering notable advantages in accelerating residual stress-deformation analysis. However, their direct application across changing geometric domains faces theoretical and practical limitations. To address this challenge, a novel framework based on diffeomorphic embedding neural operators named neural diffeomorphic-neural operator (NDNO) is introduced. Complex three-dimensional geometries are explicitly mapped to a common reference domain through a diffeomorphic neural network constrained by smoothness and invertibility. The neural operator is then trained on this reference domain, enabling efficient learning of deformation fields induced by residual stresses. Once trained, both the diffeomorphic neural network and the neural operator demonstrate efficient prediction capabilities, allowing rapid adaptation to varying geometries. The proposed method thus provides an effective and computationally efficient solution for deformation prediction in structural components subject to varying geometries. The proposed method is validated to predict both main-direction and multi-direction deformation fields, achieving high accuracy and efficiency across parts with diverse geometries including component types, dimensions and features.
zh

[CV-121] Flexible Multimodal Neuroimaging Fusion for Alzheimers Disease Progression Prediction

【速读】:该论文旨在解决多模态神经影像模型在临床实践中因模态缺失导致预测性能下降的问题,尤其是在阿尔茨海默病(Alzheimer’s disease, AD)患者认知衰退预测中,当多种影像数据(如T1加权MRI、FLAIR、淀粉样蛋白β PET和tau PET)不完整时,现有模型难以保持准确预测能力。解决方案的关键在于提出PerM-MoE方法,这是一种基于稀疏专家混合(sparse mixture-of-experts, MoE)的新型架构,其核心创新是为每种模态设计独立的路由机制(independent routers),而非传统模型使用的单一全局路由器,从而提升模型在高模态缺失情况下的灵活性与专家利用效率。

链接: https://arxiv.org/abs/2509.12234
作者: Benjamin Burns,Yuan Xue,Douglas W. Scharre,Xia Ning
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Accepted at Applications of Medical AI 2025

点击查看摘要

Abstract:Alzheimer’s disease (AD) is a progressive neurodegenerative disease with high inter-patient variance in rate of cognitive decline. AD progression prediction aims to forecast patient cognitive decline and benefits from incorporating multiple neuroimaging modalities. However, existing multimodal models fail to make accurate predictions when many modalities are missing during inference, as is often the case in clinical settings. To increase multimodal model flexibility under high modality missingness, we introduce PerM-MoE, a novel sparse mixture-of-experts method that uses independent routers for each modality in place of the conventional, single router. Using T1-weighted MRI, FLAIR, amyloid beta PET, and tau PET neuroimaging data from the Alzheimer’s Disease Neuroimaging Initiative (ADNI), we evaluate PerM-MoE, state-of-the-art Flex-MoE, and unimodal neuroimaging models on predicting two-year change in Clinical Dementia Rating-Sum of Boxes (CDR-SB) scores under varying levels of modality missingness. PerM-MoE outperforms the state of the art in most variations of modality missingness and demonstrates more effective utility of experts than Flex-MoE.
zh

[CV-122] Scalable RF Simulation in Generative 4D Worlds

【速读】:该论文旨在解决在动态且多样化的室内环境中收集高质量射频(Radio Frequency, RF)数据的难题,从而推动基于RF的室内感知任务发展。其核心解决方案是提出WaveVerse框架,该框架采用提示驱动(prompt-based)方式,通过语言引导的4D世界生成器模拟真实RF信号;其中关键创新在于:一是引入状态感知的因果Transformer(state-aware causal transformer),实现基于空间约束和文本描述的人体运动条件生成;二是开发相位一致的光线追踪模拟器(phase-coherent ray tracing simulator),确保生成的RF信号在时域和空域上保持相位一致性,从而支持波束赋形(beamforming)与呼吸监测等应用。此方法首次实现了面向RF成像的数据生成,并在数据稀缺与充足场景下均显著提升机器学习模型性能。

链接: https://arxiv.org/abs/2508.12176
作者: Zhiwei Zheng,Dongyin Hu,Mingmin Zhao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:Radio Frequency (RF) sensing has emerged as a powerful, privacy-preserving alternative to vision-based methods for indoor perception tasks. However, collecting high-quality RF data in dynamic and diverse indoor environments remains a major challenge. To address this, we introduce WaveVerse, a prompt-based, scalable framework that simulates realistic RF signals from generated indoor scenes with human motions. WaveVerse introduces a language-guided 4D world generator, which includes a state-aware causal transformer for human motion generation conditioned on spatial constraints and texts, and a phase-coherent ray tracing simulator that enables the simulation of accurate and coherent RF signals. Experiments demonstrate the effectiveness of our approach in conditioned human motion generation and highlight how phase coherence is applied to beamforming and respiration monitoring. We further present two case studies in ML-based high-resolution imaging and human activity recognition, demonstrating that WaveVerse not only enables data generation for RF imaging for the first time, but also consistently achieves performance gain in both data-limited and data-adequate scenarios.
zh

[CV-123] QDFlow: A Python package for physics simulations of quantum dot devices

【速读】:该论文旨在解决量子点(Quantum Dot, QD)器件中机器学习(Machine Learning, ML)模型训练与验证所面临的高质量标注数据稀缺问题。由于实验获取大量标注数据既困难又耗时,研究者难以构建可靠的ML基准测试体系。其解决方案的关键在于提出QDFlow——一个开源的多量子点阵列物理仿真器,它通过自洽Thomas-Fermi求解器、动态电容模型及可定制噪声模块,生成具有真实感且带精确标签的合成数据,从而支持大规模、多样化数据集的构建,满足ML算法开发、性能评估及量子器件研究的需求。

链接: https://arxiv.org/abs/2509.13298
作者: Donovan L. Buterakos,Sandesh S. Kalantre,Joshua Ziegler,Jacob M Taylor,Justyna P. Zwolak
机构: 未知
类目: Mesoscale and Nanoscale Physics (cond-mat.mes-hall); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Quantum Physics (quant-ph)
备注: 17 pages, 5 figures

点击查看摘要

Abstract:Recent advances in machine learning (ML) have accelerated progress in calibrating and operating quantum dot (QD) devices. However, most ML approaches rely on access to large, high-quality labeled datasets for training, benchmarking, and validation, with labels capturing key features in the data. Obtaining such datasets experimentally is challenging due to limited data availability and the labor-intensive nature of labeling. QDFlow is an open-source physics simulator for multi-QD arrays that generates realistic synthetic data with ground-truth labels. QDFlow combines a self-consistent Thomas-Fermi solver, a dynamic capacitance model, and flexible noise modules to produce charge stability diagrams and ray-based data closely resembling experiments. With extensive tunable parameters and customizable noise models, QDFlow supports the creation of large, diverse datasets for ML development, benchmarking, and quantum device research.
zh

[CV-124] MEGAN: Mixture of Experts for Robust Uncertainty Estimation in Endoscopy Videos MICCAI

【速读】:该论文旨在解决医疗人工智能(AI)中不确定性量化(Uncertainty Quantification, UQ)的可靠性问题,特别是现有方法(如蒙特卡洛Dropout和深度集成)通常依赖单一专家标注作为训练标签,忽略了临床实践中常见的评分者间变异性(inter-rater variability)。为此,作者提出多专家门控网络(MEGAN),其核心在于通过多个基于不同真实标签和建模策略训练的Evidential Deep Learning(EDL)模型聚合预测结果与不确定性估计,并利用一个门控网络对各EDL模型的输出进行最优加权融合,从而提升整体预测置信度与校准性能。该方案显著改善了在溃疡性结肠炎(Ulcerative Colitis, UC)内镜视频评估中的F1分数和预期校准误差(Expected Calibration Error, ECE),并支持基于不确定性的样本分层策略,降低标注负担。

链接: https://arxiv.org/abs/2509.12772
作者: Damola Agbelese,Krishna Chaitanya,Pushpak Pati,Chaitanya Parmar,Pooya Mobadersany,Shreyas Fadnavis,Lindsey Surace,Shadi Yarandi,Louis R. Ghanem,Molly Lucas,Tommaso Mansi,Oana Gabriela Cula,Pablo F. Damasceno,Kristopher Standish
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 11 pages, 2 figures, 1 table, accepted at UNSURE, MICCAI

点击查看摘要

Abstract:Reliable uncertainty quantification (UQ) is essential in medical AI. Evidential Deep Learning (EDL) offers a computationally efficient way to quantify model uncertainty alongside predictions, unlike traditional methods such as Monte Carlo (MC) Dropout and Deep Ensembles (DE). However, all these methods often rely on a single expert’s annotations as ground truth for model training, overlooking the inter-rater variability in healthcare. To address this issue, we propose MEGAN, a Multi-Expert Gating Network that aggregates uncertainty estimates and predictions from multiple AI experts via EDL models trained with diverse ground truths and modeling strategies. MEGAN’s gating network optimally combines predictions and uncertainties from each EDL model, enhancing overall prediction confidence and calibration. We extensively benchmark MEGAN on endoscopy videos for Ulcerative colitis (UC) disease severity estimation, assessed by visual labeling of Mayo Endoscopic Subscore (MES), where inter-rater variability is prevalent. In large-scale prospective UC clinical trial, MEGAN achieved a 3.5% improvement in F1-score and a 30.5% reduction in Expected Calibration Error (ECE) compared to existing methods. Furthermore, MEGAN facilitated uncertainty-guided sample stratification, reducing the annotation burden and potentially increasing efficiency and consistency in UC trials.
zh

[CV-125] Generalizable Holographic Reconstruction via Amplitude-Only Diffusion Priors

【速读】:该论文旨在解决透射式全息成像中的相位恢复问题,这是一个由于相干成像中振幅与相位非线性耦合而导致的病态逆问题(ill-posed inverse problem)。其解决方案的关键在于提出一种基于扩散模型(diffusion model)的新方法,该模型仅使用物体振幅数据进行训练,通过预测-校正采样框架(predictor-corrector sampling framework)分别利用振幅和相位的似然梯度,实现从衍射强度中同时恢复振幅与相位的复场重建。该方法无需真实相位标签即可完成训练,且在多种物体形态、成像系统配置及无透镜设置下均表现出强泛化能力,尤其展示了仅用简单振幅样本(如聚苯乙烯微球)训练的先验即可成功重建复杂生物组织结构,体现了该框架在计算成像中处理非线性逆问题的通用性和成本效益。

链接: https://arxiv.org/abs/2509.12728
作者: Jeongsol Kim,Chanseok Lee,Jong Chul Ye,Mooseok Jang
机构: KAIST(韩国科学技术院)
类目: Optics (physics.optics); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Keywords: Diffusion model, phase retrieval, inline-holography, inverse problem

点击查看摘要

Abstract:Phase retrieval in inline holography is a fundamental yet ill-posed inverse problem due to the nonlinear coupling between amplitude and phase in coherent imaging. We present a novel off-the-shelf solution that leverages a diffusion model trained solely on object amplitude to recover both amplitude and phase from diffraction intensities. Using a predictor-corrector sampling framework with separate likelihood gradients for amplitude and phase, our method enables complex field reconstruction without requiring ground-truth phase data for training. We validate the proposed approach through extensive simulations and experiments, demonstrating robust generalization across diverse object shapes, imaging system configurations, and modalities, including lensless setups. Notably, a diffusion prior trained on simple amplitude data (e.g., polystyrene beads) successfully reconstructs complex biological tissue structures, highlighting the method’s adaptability. This framework provides a cost-effective, generalizable solution for nonlinear inverse problems in computational imaging, and establishes a foundation for broader coherent imaging applications beyond holography.
zh

[CV-126] DeepEyeNet: Generating Medical Report for Retinal Images CIKM

【速读】:该论文旨在解决因眼科医生资源不足导致的视网膜疾病诊断效率低下问题,特别是传统人工生成医学报告方式存在耗时长、易出错等局限性。其核心解决方案是基于人工智能(Artificial Intelligence, AI)构建自动化医疗报告生成系统,关键在于:(1) 采用多模态深度学习方法捕捉文本关键词与视网膜图像之间的交互关系,提升报告完整性;(2) 改进医学关键词表示方法以更好地刻画术语细微差异;(3) 设计策略克服循环神经网络(Recurrent Neural Network, RNN)在长程依赖建模上的不足,增强对复杂描述的理解能力;(4) 引入可解释性技术提升系统的临床可信度。上述方法在多个指标上达到当前最优性能,验证了AI在提高视网膜疾病诊断效率和准确性方面的潜力。

链接: https://arxiv.org/abs/2509.12534
作者: Jia-Hong Huang
机构: University of Amsterdam (阿姆斯特丹大学)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: The paper is accepted by the Conference on Information and Knowledge Management (CIKM), 2025

点击查看摘要

Abstract:The increasing prevalence of retinal diseases poses a significant challenge to the healthcare system, as the demand for ophthalmologists surpasses the available workforce. This imbalance creates a bottleneck in diagnosis and treatment, potentially delaying critical care. Traditional methods of generating medical reports from retinal images rely on manual interpretation, which is time-consuming and prone to errors, further straining ophthalmologists’ limited resources. This thesis investigates the potential of Artificial Intelligence (AI) to automate medical report generation for retinal images. AI can quickly analyze large volumes of image data, identifying subtle patterns essential for accurate diagnosis. By automating this process, AI systems can greatly enhance the efficiency of retinal disease diagnosis, reducing doctors’ workloads and enabling them to focus on more complex cases. The proposed AI-based methods address key challenges in automated report generation: (1) A multi-modal deep learning approach captures interactions between textual keywords and retinal images, resulting in more comprehensive medical reports; (2) Improved methods for medical keyword representation enhance the system’s ability to capture nuances in medical terminology; (3) Strategies to overcome RNN-based models’ limitations, particularly in capturing long-range dependencies within medical descriptions; (4) Techniques to enhance the interpretability of the AI-based report generation system, fostering trust and acceptance in clinical practice. These methods are rigorously evaluated using various metrics and achieve state-of-the-art performance. This thesis demonstrates AI’s potential to revolutionize retinal disease diagnosis by automating medical report generation, ultimately improving clinical efficiency, diagnostic accuracy, and patient care.
zh

[CV-127] DinoAtten3D: Slice-Level Attention Aggregation of DinoV2 for 3D Brain MRI Anomaly Classification ICCV2025

【速读】:该论文旨在解决医学影像中异常检测与分类面临的挑战,包括标注数据有限、类别不平衡以及专家标注成本高等问题。其解决方案的关键在于提出一种基于注意力机制的全局聚合框架,利用自监督预训练的DINOv2模型作为特征提取器,对脑部MRI的每个二维轴向切片进行处理,并通过软注意力机制分配自适应的切片重要性权重;同时引入结合监督对比学习与类别方差正则化的复合损失函数,以增强类间可分性和类内一致性,从而在数据稀缺和严重类别不平衡条件下实现鲁棒的三维异常检测性能。

链接: https://arxiv.org/abs/2509.12512
作者: Fazle Rafsani,Jay Shah,Catherine D. Chong,Todd J. Schwedt,Teresa Wu
机构: Arizona State University (亚利桑那州立大学); Mayo Clinic, Arizona (梅奥诊所,亚利桑那州)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: ACCEPTED at the ICCV 2025 Workshop on Anomaly Detection with Foundation Models

点击查看摘要

Abstract:Anomaly detection and classification in medical imaging are critical for early diagnosis but remain challenging due to limited annotated data, class imbalance, and the high cost of expert labeling. Emerging vision foundation models such as DINOv2, pretrained on extensive, unlabeled datasets, offer generalized representations that can potentially alleviate these limitations. In this study, we propose an attention-based global aggregation framework tailored specifically for 3D medical image anomaly classification. Leveraging the self-supervised DINOv2 model as a pretrained feature extractor, our method processes individual 2D axial slices of brain MRIs, assigning adaptive slice-level importance weights through a soft attention mechanism. To further address data scarcity, we employ a composite loss function combining supervised contrastive learning with class-variance regularization, enhancing inter-class separability and intra-class consistency. We validate our framework on the ADNI dataset and an institutional multi-class headache cohort, demonstrating strong anomaly classification performance despite limited data availability and significant class imbalance. Our results highlight the efficacy of utilizing pretrained 2D foundation models combined with attention-based slice aggregation for robust volumetric anomaly detection in medical imaging. Our implementation is publicly available at this https URL.
zh

[CV-128] Universal Gröbner Bases of (Universal) Multiview Ideals

【速读】:该论文旨在解决多视角理想(multiview ideals)及其通用版本——通用多视角理想(universal multiview ideals)的代数结构刻画问题,特别是它们的Gröbner基构造。其解决方案的关键在于:利用Huang与Larson提出的一个Gröbner基判定准则,证明了一类自然定义的多项式集合构成这两类理想的标准Gröbner基;同时通过对称性约化(symmetry reduction)和归纳法(induction),将该方法推广至一个无限族的理想上,并给出了支撑该方法的拟阵(matroids)的显式描述。

链接: https://arxiv.org/abs/2509.12376
作者: Timothy Duff,Jack Kendrick,Rekha R. Thomas
机构: 未知
类目: Commutative Algebra (math.AC); Computer Vision and Pattern Recognition (cs.CV); Algebraic Geometry (math.AG)
备注:

点击查看摘要

Abstract:Multiview ideals arise from the geometry of image formation in pinhole cameras, and universal multiview ideals are their analogs for unknown cameras. We prove that a natural collection of polynomials form a universal Gröbner basis for both types of ideals using a criterion introduced by Huang and Larson, and include a proof of their criterion in our setting. Symmetry reduction and induction enable the method to be deployed on an infinite family of ideals. We also give an explicit description of the matroids on which the methodology depends, in the context of multiview ideals.
zh

[CV-129] Enhancing Radiographic Disease Detection with MetaCheX a Context-Aware Multimodal Model

【速读】:该论文旨在解决现有深度学习模型在胸部放射学诊断中忽视患者元数据(metadata)导致的诊断准确性和公平性不足的问题。其解决方案的关键在于提出一种名为MetaCheX的新型多模态框架,该框架将胸部X光图像与结构化患者元数据联合建模:通过卷积神经网络(CNN)提取影像特征,并利用多层感知机(MLP)处理元数据,最终由共享分类器进行融合决策。实验表明,该方法显著提升了整体诊断准确性(以AUROC衡量),并有效降低算法偏倚,增强模型在不同人群中的泛化能力。

链接: https://arxiv.org/abs/2509.12287
作者: Nathan He,Cody Chen
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: All authors contributed equally, 5 pages, 2 figures, 1 table

点击查看摘要

Abstract:Existing deep learning models for chest radiology often neglect patient metadata, limiting diagnostic accuracy and fairness. To bridge this gap, we introduce MetaCheX, a novel multimodal framework that integrates chest X-ray images with structured patient metadata to replicate clinical decision-making. Our approach combines a convolutional neural network (CNN) backbone with metadata processed by a multilayer perceptron through a shared classifier. Evaluated on the CheXpert Plus dataset, MetaCheX consistently outperformed radiograph-only baseline models across multiple CNN architectures. By integrating metadata, the overall diagnostic accuracy was significantly improved, measured by an increase in AUROC. The results of this study demonstrate that metadata reduces algorithmic bias and enhances model generalizability across diverse patient populations. MetaCheX advances clinical artificial intelligence toward robust, context-aware radiographic disease detection.
zh

人工智能

[AI-0] Shapes of Cognition for Computational Cognitive Modeling

【速读】:该论文旨在解决如何在语言赋予智能体(Language-Endowed Intelligent Agents, LEIAs)中实现高效、可解释且具备人类认知特征的计算认知建模问题。传统AI系统往往难以应对现实世界的复杂性和不确定性,而该研究提出“形状”(Shapes)作为新的概念范式,其关键在于将感知、语言、概念、情景记忆和程序性知识整合为可记忆的“构型”(constellations),使智能体能够通过典型预期、模式识别、习惯行为、类比推理和满意决策等机制降低认知负荷,并在异常情况下采用基于形状的恢复策略(如在线学习、求助人类或获取可行动的近似理解)。这一方法不仅提供了具体的建模目标、假设与知识库设计,还嵌入特定认知架构中,从而确保模型的可验证性、可扩展性与可信度,为构建实用且透明的智能系统提供新路径。

链接: https://arxiv.org/abs/2509.13288
作者: Marjorie McShane,Sergei Nirenburg,Sanjay Oruganti,Jesse English
机构: 未知
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Shapes of cognition is a new conceptual paradigm for the computational cognitive modeling of Language-Endowed Intelligent Agents (LEIAs). Shapes are remembered constellations of sensory, linguistic, conceptual, episodic, and procedural knowledge that allow agents to cut through the complexity of real life the same way as people do: by expecting things to be typical, recognizing patterns, acting by habit, reasoning by analogy, satisficing, and generally minimizing cognitive load to the degree situations permit. Atypical outcomes are treated using shapes-based recovery methods, such as learning on the fly, asking a human partner for help, or seeking an actionable, even if imperfect, situational understanding. Although shapes is an umbrella term, it is not vague: shapes-based modeling involves particular objectives, hypotheses, modeling strategies, knowledge bases, and actual models of wide-ranging phenomena, all implemented within a particular cognitive architecture. Such specificity is needed both to vet our hypotheses and to achieve our practical aims of building useful agent systems that are explainable, extensible, and worthy of our trust, even in critical domains. However, although the LEIA example of shapes-based modeling is specific, the principles can be applied more broadly, giving new life to knowledge-based and hybrid AI.
zh

[AI-1] Contrastive timbre representations for musical instrument and synthesizer retrieval

【速读】:该论文旨在解决数字音乐制作中从音频混合信号中高效检索特定乐器音色(instrument timbre)的难题。现有方法在处理多乐器混合音频时性能受限,且数据增强手段难以生成真实可靠的正负样本对。其解决方案的关键在于提出一种基于对比学习(contrastive learning)的框架,通过设计针对虚拟乐器(如采样器和合成器)的声音生成技术,构建逼真的正/负样本对,从而提升模型在单乐器与多乐器场景下的检索能力。实验表明,该框架在3,884种乐器的单乐器检索任务中表现优于传统分类预训练方法,并在三乐器混合音频检索中达到81.7% top-1和95.7% top-5准确率,显著优于现有方法。

链接: https://arxiv.org/abs/2509.13285
作者: Gwendal Le Vaillant,Yannick Molle
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Efficiently retrieving specific instrument timbres from audio mixtures remains a challenge in digital music production. This paper introduces a contrastive learning framework for musical instrument retrieval, enabling direct querying of instrument databases using a single model for both single- and multi-instrument sounds. We propose techniques to generate realistic positive/negative pairs of sounds for virtual musical instruments, such as samplers and synthesizers, addressing limitations in common audio data augmentation methods. The first experiment focuses on instrument retrieval from a dataset of 3,884 instruments, using single-instrument audio as input. Contrastive approaches are competitive with previous works based on classification pre-training. The second experiment considers multi-instrument retrieval with a mixture of instruments as audio input. In this case, the proposed contrastive framework outperforms related works, achieving 81.7% top-1 and 95.7% top-5 accuracies for three-instrument mixtures. Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI) Cite as: arXiv:2509.13285 [cs.SD] (or arXiv:2509.13285v1 [cs.SD] for this version) https://doi.org/10.48550/arXiv.2509.13285 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-2] JANUS: A Dual-Constraint Generative Framework for Stealthy Node Injection Attacks

【速读】:该论文旨在解决图神经网络(Graph Neural Networks, GNNs)在面对节点注入攻击时的隐蔽性不足问题,现有方法通常依赖间接代理指标或仅模仿局部结构,导致“局部近视”现象,难以实现真正意义上的隐蔽攻击。解决方案的关键在于提出一种双约束隐蔽节点注入框架——JANUS(Joint Alignment of Nodal and Universal Structures),其核心创新在于:在局部层面通过局部特征流形对齐策略实现特征空间中的几何一致性,在全局层面引入结构潜在变量并最大化与生成结构的互信息,从而确保注入结构与原始图的语义模式一致;同时将攻击建模为序贯决策过程,并由强化学习代理进行优化,显著提升了攻击的有效性和隐蔽性。

链接: https://arxiv.org/abs/2509.13266
作者: Jiahao Zhang,Xiaobing Pei,Zhaokun Zhong,Wenqiang Hao,Zhenghao Tang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have demonstrated remarkable performance across various applications, yet they are vulnerable to sophisticated adversarial attacks, particularly node injection attacks. The success of such attacks heavily relies on their stealthiness, the ability to blend in with the original graph and evade detection. However, existing methods often achieve stealthiness by relying on indirect proxy metrics, lacking consideration for the fundamental characteristics of the injected content, or focusing only on imitating local structures, which leads to the problem of local myopia. To overcome these limitations, we propose a dual-constraint stealthy node injection framework, called Joint Alignment of Nodal and Universal Structures (JANUS). At the local level, we introduce a local feature manifold alignment strategy to achieve geometric consistency in the feature space. At the global level, we incorporate structured latent variables and maximize the mutual information with the generated structures, ensuring the injected structures are consistent with the semantic patterns of the original graph. We model the injection attack as a sequential decision process, which is optimized by a reinforcement learning agent. Experiments on multiple standard datasets demonstrate that the JANUS framework significantly outperforms existing methods in terms of both attack effectiveness and stealthiness.
zh

[AI-3] Metacognitive Reuse: Turning Recurring LLM Reasoning Into Concise Behaviors

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在多步推理过程中重复推导相同中间步骤的问题,这导致了token消耗和延迟增加,并限制了上下文窗口中可用于探索的空间。其解决方案的关键在于通过模型自身的元认知分析(metacognitive analysis)将频繁出现的推理片段提炼为简洁、可复用的“行为”(behavior,即名称+指令),并存储于“行为手册”(behavior handbook)中;这些行为可在推理时以提示形式提供给模型或通过监督微调(Supervised Fine-Tuning, SFT)固化到参数中,从而实现更高效的推理过程。该方法显著减少了推理token数量(最多降低46%),同时保持或提升准确性,并支持模型自我改进与从非推理模型向推理模型的有效迁移。

链接: https://arxiv.org/abs/2509.13237
作者: Aniket Didolkar,Nicolas Ballas,Sanjeev Arora,Anirudh Goyal
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 18 pages, 9 Figures, 5 Tables

点击查看摘要

Abstract:Large language models (LLMs) now solve multi-step problems by emitting extended chains of thought. During the process, they often re-derive the same intermediate steps across problems, inflating token usage and latency. This saturation of the context window leaves less capacity for exploration. We study a simple mechanism that converts recurring reasoning fragments into concise, reusable “behaviors” (name + instruction) via the model’s own metacognitive analysis of prior traces. These behaviors are stored in a “behavior handbook” which supplies them to the model in-context at inference or distills them into parameters via supervised fine-tuning. This approach achieves improved test-time reasoning across three different settings - 1) Behavior-conditioned inference: Providing the LLM relevant behaviors in-context during reasoning reduces number of reasoning tokens by up to 46% while matching or improving baseline accuracy; 2) Behavior-guided self-improvement: Without any parameter updates, the model improves its own future reasoning by leveraging behaviors from its own past problem solving attempts. This yields up to 10% higher accuracy than a naive critique-and-revise baseline; and 3) Behavior-conditioned SFT: SFT on behavior-conditioned reasoning traces is more effective at converting non-reasoning models into reasoning models as compared to vanilla SFT. Together, these results indicate that turning slow derivations into fast procedural hints enables LLMs to remember how to reason, not just what to conclude.
zh

[AI-4] Layout-Aware OCR for Black Digital Archives with Unsupervised Evaluation

【速读】:该论文旨在解决黑人数字档案(Black digital archives)在人工智能研究与基础设施中长期被结构性忽视的问题,尤其聚焦于历史黑人报纸的数字化过程中因排版不一致、视觉退化及标注布局数据匮乏所导致的光学字符识别(OCR)准确性不足难题。其解决方案的关键在于构建一个面向布局感知的OCR流水线,融合合成布局生成、增强数据预训练以及先进YOLO目标检测器的集成策略,并引入无监督评估框架,通过语义连贯性得分(SCS)、区域熵(RE)和文本冗余度得分(TRS)三项指标量化OCR结果的语言流畅性、信息多样性与区域冗余程度,从而在低资源档案场景下显著提升结构多样性并降低冗余,同时保持可接受的语义一致性,强调了尊重文化布局逻辑在AI驱动文档理解中的重要性。

链接: https://arxiv.org/abs/2509.13236
作者: Fitsum Sileshi Beyene,Christopher L. Dancy
机构: 未知
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI)
备注: IEEE-ISTAS conference

点击查看摘要

Abstract:Despite their cultural and historical significance, Black digital archives continue to be a structurally underrepresented area in AI research and infrastructure. This is especially evident in efforts to digitize historical Black newspapers, where inconsistent typography, visual degradation, and limited annotated layout data hinder accurate transcription, despite the availability of various systems that claim to handle optical character recognition (OCR) well. In this short paper, we present a layout-aware OCR pipeline tailored for Black newspaper archives and introduce an unsupervised evaluation framework suited to low-resource archival contexts. Our approach integrates synthetic layout generation, model pretraining on augmented data, and a fusion of state-of-the-art You Only Look Once (YOLO) detectors. We used three annotation-free evaluation metrics, the Semantic Coherence Score (SCS), Region Entropy (RE), and Textual Redundancy Score (TRS), which quantify linguistic fluency, informational diversity, and redundancy across OCR regions. Our evaluation on a 400-page dataset from ten Black newspaper titles demonstrates that layout-aware OCR improves structural diversity and reduces redundancy compared to full-page baselines, with modest trade-offs in coherence. Our results highlight the importance of respecting cultural layout logic in AI-driven document understanding and lay the foundation for future community-driven and ethically grounded archival AI systems.
zh

[AI-5] A Scenario-Driven Cognitive Approach to Next-Generation AI Memory

【速读】:该论文旨在解决当前人工智能(Artificial Intelligence, AI)记忆架构在适应性、多模态融合能力以及持续学习支持方面的局限性,这些问题严重制约了向通用人工智能(Artificial General Intelligence, AGI)演进的进程。其解决方案的关键在于提出一种场景驱动的方法论,通过从代表性认知场景中提取核心功能需求,形成一套统一的设计原则,并据此构建了认知分层记忆架构(COgnitive Layered Memory Architecture, COLMA),该架构将认知场景、记忆过程与存储机制有机整合,为实现具备终身学习能力和类人推理能力的下一代AI系统提供了结构化基础。

链接: https://arxiv.org/abs/2509.13235
作者: Linyue Cai,Yuyang Cheng,Xiaoding Shao,Huiming Wang,Yong Zhao,Wei Zhang,Kang Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As artificial intelligence advances toward artificial general intelligence (AGI), the need for robust and human-like memory systems has become increasingly evident. Current memory architectures often suffer from limited adaptability, insufficient multimodal integration, and an inability to support continuous learning. To address these limitations, we propose a scenario-driven methodology that extracts essential functional requirements from representative cognitive scenarios, leading to a unified set of design principles for next-generation AI memory systems. Based on this approach, we introduce the \textbfCOgnitive Layered Memory Architecture (COLMA), a novel framework that integrates cognitive scenarios, memory processes, and storage mechanisms into a cohesive design. COLMA provides a structured foundation for developing AI systems capable of lifelong learning and human-like reasoning, thereby contributing to the pragmatic development of AGI.
zh

[AI-6] Single-stream Policy Optimization

【速读】:该论文旨在解决当前基于分组的策略梯度优化方法(如GRPO)在大语言模型(Large Language Models, LLMs)训练中面临的两个关键问题:一是频繁出现的退化分组会削弱学习信号,二是同步机制限制了扩展性。为此,作者提出单流策略优化(Single-stream Policy Optimization, SPO),其核心创新在于用一个持续更新且KL自适应的价值追踪器替代每组独立的基线,并在全球批次范围内对优势值进行归一化处理,从而为每个样本提供稳定、低方差的学习信号。这一设计不仅消除了分组带来的缺陷,还提升了吞吐量和长序列或工具集成场景下的可扩展性,并通过优先采样自然实现自适应课程学习。实验表明,SPO在多个数学推理基准上显著优于GRPO,验证了其在理论严谨性和实际性能上的优势。

链接: https://arxiv.org/abs/2509.13232
作者: Zhongwen Xu,Zihan Ding
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:We revisit policy-gradient optimization for Large Language Models (LLMs) from a single-stream perspective. Prevailing group-based methods like GRPO reduce variance with on-the-fly baselines but suffer from critical flaws: frequent degenerate groups erase learning signals, and synchronization barriers hinder scalability. We introduce Single-stream Policy Optimization (SPO), which eliminates these issues by design. SPO replaces per-group baselines with a persistent, KL-adaptive value tracker and normalizes advantages globally across the batch, providing a stable, low-variance learning signal for every sample. Being group-free, SPO enables higher throughput and scales effectively in long-horizon or tool-integrated settings where generation times vary. Furthermore, the persistent value tracker naturally enables an adaptive curriculum via prioritized sampling. Experiments using Qwen3-8B show that SPO converges more smoothly and attains higher accuracy than GRPO, while eliminating computation wasted on degenerate groups. Ablation studies confirm that SPO’s gains stem from its principled approach to baseline estimation and advantage normalization, offering a more robust and efficient path for LLM reasoning. Across five hard math benchmarks with Qwen3 8B, SPO improves the average maj@32 by +3.4 percentage points (pp) over GRPO, driven by substantial absolute point gains on challenging datasets, including +7.3 pp on BRUMO 25, +4.4 pp on AIME 25, +3.3 pp on HMMT 25, and achieves consistent relative gain in pass@ k across the evaluated k values. SPO’s success challenges the prevailing trend of adding incidental complexity to RL algorithms, highlighting a path where fundamental principles, not architectural workarounds, drive the next wave of progress in LLM reasoning.
zh

[AI-7] G-CSEA: A Graph-Based Conflict Set Extraction Algorithm for Identifying Infeasibility in Pseudo-Boolean Models

【速读】:该论文旨在解决工作调度模型中因规则约束(如班次限制、排班政策和工时限制等)相互冲突而导致的不可行性问题,其核心挑战在于高效识别引发不可行性的最小约束集合。现有方法如Additive Deletion和QuickXplain依赖大量可行性检查,计算开销大;而基于对偶射线分析的方法在伪布尔(pseudo-Boolean)模型不可行时失效。论文提出的关键解决方案是Graph-based Conflict Set Extraction Algorithm (G-CSEA),该方法受SAT求解器中冲突驱动子句学习(Conflict-Driven Clause Learning, CDCL)启发,在约束传播过程中构建蕴含图(implication graph),并在检测到冲突时追踪两个决策分支上所有贡献约束,从而生成一个冲突集(conflict set),并可进一步用QuickXplain进行最小化以获得不可行最小子集(Irreducible Infeasible Subset, IIS)。

链接: https://arxiv.org/abs/2509.13203
作者: Kanishk Garg,Saranya D.,Sanal Kumar,Saurabh Singh,Anupam Purwar
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: This paper presents G-CSEA, a novel graph-based algorithm for rapidly diagnosing infeasibility in workforce scheduling models. Inspired by Conflict-Driven Clause Learning (CDCL), our method efficiently extracts a compact conflict set from an implication graph, reducing the initial constraint set by approximately 94%

点击查看摘要

Abstract:Workforce scheduling involves a variety of rule-based constraints-such as shift limits, staffing policies, working hour restrictions, and many similar scheduling rules-which can interact in conflicting ways, leading to infeasible models. Identifying the underlying causes of such infeasibility is critical for resolving scheduling issues and restoring feasibility. A common diagnostic approach is to compute Irreducible Infeasible Subsets (IISs): minimal sets of constraints that are jointly infeasible but become feasible when any one is removed. We consider models formulated using pseudo-Boolean constraints with inequality relations over binary variables, which naturally encode scheduling logic. Existing IIS extraction methods such as Additive Deletion and QuickXplain rely on repeated feasibility checks, often incurring large numbers of solver calls. Dual ray analysis, while effective for LP-based models, may fail when the relaxed problem is feasible but the underlying pseudo-Boolean model is not. To address these limitations, we propose Graph-based Conflict Set Extraction Algorithm (G-CSEA) to extract a conflict set, an approach inspired by Conflict-Driven Clause Learning (CDCL) in SAT solvers. Our method constructs an implication graph during constraint propagation and, upon detecting a conflict, traces all contributing constraints across both decision branches. The resulting conflict set can optionally be minimized using QuickXplain to produce an IIS.
zh

[AI-8] B-TGAT: A Bi-directional Temporal Graph Attention Transformer for Clustering Multivariate Spatiotemporal Data

【速读】:该论文旨在解决高维多变量时空气候数据聚类的挑战,特别是复杂的时间依赖性、动态的空间交互作用以及非平稳动力学导致的传统聚类方法难以同时捕捉局部与全局时间关系并保持空间上下文信息的问题。其解决方案的关键在于提出一种时分布的混合U-Net自编码器架构,该架构集成双向时间图注意力Transformer(Bi-directional Temporal Graph Attention Transformer, B-TGAT),通过ConvLSTM2D模块提取联合时空特征以建模局域动态和空间相关性,并利用U-Net中的跳跃连接保留多尺度空间细节;在瓶颈层中,B-TGAT结合基于图的空间建模与注意力驱动的时间编码机制,实现对时间邻域的自适应加权,从而捕获跨区域的短程与长程依赖关系,最终生成优化用于聚类的判别性潜在嵌入。

链接: https://arxiv.org/abs/2509.13202
作者: Francis Ndikum Nji,Vandana Janaja,Jianwu Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 10 pages, In review

点击查看摘要

Abstract:Clustering high-dimensional multivariate spatiotemporal climate data is challenging due to complex temporal dependencies, evolving spatial interactions, and non-stationary dynamics. Conventional clustering methods, including recurrent and convolutional models, often struggle to capture both local and global temporal relationships while preserving spatial context. We present a time-distributed hybrid U-Net autoencoder that integrates a Bi-directional Temporal Graph Attention Transformer (B-TGAT) to guide efficient temporal clustering of multidimensional spatiotemporal climate datasets. The encoder and decoder are equipped with ConvLSTM2D modules that extract joint spatial–temporal features by modeling localized dynamics and spatial correlations over time, and skip connections that preserve multiscale spatial details during feature compression and reconstruction. At the bottleneck, B-TGAT integrates graph-based spatial modeling with attention-driven temporal encoding, enabling adaptive weighting of temporal neighbors and capturing both short and long-range dependencies across regions. This architecture produces discriminative latent embeddings optimized for clustering. Experiments on three distinct spatiotemporal climate datasets demonstrate superior cluster separability, temporal stability, and alignment with known climate transitions compared to state-of-the-art baselines. The integration of ConvLSTM2D, U-Net skip connections, and B-TGAT enhances temporal clustering performance while providing interpretable insights into complex spatiotemporal variability, advancing both methodological development and climate science applications.
zh

[AI-9] Is Meta-Learning Out? Rethinking Unsupervised Few-Shot Classification with Limited Entropy ICCV2025

【速读】:该论文旨在解决当前关于元学习(meta-learning)在少样本分类任务中是否真正优于传统全类训练(whole-class training)策略的争议问题。通过构建熵受限的监督设置以实现公平比较,作者揭示了元学习具有更紧的泛化边界,并在低熵、标签噪声和异质任务场景下表现出更强的鲁棒性和效率,从而验证了其在无监督任务中的优势。解决方案的关键在于提出MINO框架:该框架利用DBSCAN自适应聚类算法与动态头结构进行无监督任务构造,并引入基于稳定性的元缩放器(meta-scaler)提升对标签噪声的鲁棒性,显著增强了无监督少样本与零样本任务中的性能表现。

链接: https://arxiv.org/abs/2509.13185
作者: Yunchuan Guan,Yu Liu,Ke Zhou,Zhiqi Shen,Jenq-Neng Hwang,Serge Belongie,Lei Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by ICCV 2025

点击查看摘要

Abstract:Meta-learning is a powerful paradigm for tackling few-shot tasks. However, recent studies indicate that models trained with the whole-class training strategy can achieve comparable performance to those trained with meta-learning in few-shot classification tasks. To demonstrate the value of meta-learning, we establish an entropy-limited supervised setting for fair comparisons. Through both theoretical analysis and experimental validation, we establish that meta-learning has a tighter generalization bound compared to whole-class training. We unravel that meta-learning is more efficient with limited entropy and is more robust to label noise and heterogeneous tasks, making it well-suited for unsupervised tasks. Based on these insights, We propose MINO, a meta-learning framework designed to enhance unsupervised performance. MINO utilizes the adaptive clustering algorithm DBSCAN with a dynamic head for unsupervised task construction and a stability-based meta-scaler for robustness against label noise. Extensive experiments confirm its effectiveness in multiple unsupervised few-shot and zero-shot tasks.
zh

[AI-10] On the Correlation between Individual Fairness and Predictive Accuracy in Probabilistic Models

【速读】:该论文旨在解决生成式概率分类器中的个体公平性(individual fairness)问题,核心关注点是后验推断对私有特征扰动的鲁棒性。传统方法常面临公平性与准确率之间的权衡困境,而本文通过分析私有特征扰动下后验分布的稳定性,提出假设:鲁棒性越强的样本越可能被准确分类。为验证这一假设,作者在14个存在公平性关切的数据集上使用贝叶斯网络作为生成模型进行实证研究;关键创新在于将多私有特征的鲁棒性分析转化为辅助马尔可夫随机场(Markov random field)中的最大可能解释(most probable explanation, MPE)任务,从而有效缓解了计算复杂度问题。实验结果支持了该假设,揭示了提升鲁棒性可成为改善公平性与准确率平衡的新路径。

链接: https://arxiv.org/abs/2509.13165
作者: Alessandro Antonucci,Eric Rossetto,Ivan Duvnjak
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 15 pages, 9 figures, 1 table

点击查看摘要

Abstract:We investigate individual fairness in generative probabilistic classifiers by analysing the robustness of posterior inferences to perturbations in private features. Building on established results in robustness analysis, we hypothesise a correlation between robustness and predictive accuracy, specifically, instances exhibiting greater robustness are more likely to be classified accurately. We empirically assess this hypothesis using a benchmark of fourteen datasets with fairness concerns, employing Bayesian networks as the underlying generative models. To address the computational complexity associated with robustness analysis over multiple private features with Bayesian networks, we reformulate the problem as a most probable explanation task in an auxiliary Markov random field. Our experiments confirm the hypothesis about the correlation, suggesting novel directions to mitigate the traditional trade-off between fairness and accuracy.
zh

[AI-11] FinSearchComp: Towards a Realistic Expert-Level Evaluation of Financial Search and Reasoning

【速读】:该论文旨在解决当前缺乏能够评估端到端智能体在真实金融场景中数据搜索与知识推理能力的开放基准问题。现有开源金融数据集未能有效衡量智能体对时敏性、领域特定数据的多步骤检索能力,主要受限于构建复杂任务所需的金融专业知识以及时间敏感数据的验证难度。解决方案的关键在于提出FinSearchComp——首个完全开源的金融搜索与推理基准,其包含三个贴近现实分析师工作流的任务:时敏数据获取、简单历史查询和复杂历史调查,并通过70名专业金融专家标注及多阶段质量控制确保任务难度与可靠性;该基准涵盖全球及大中华市场共635个问题,用于系统评估21种模型(产品),实证表明引入网络搜索和金融插件可显著提升性能,且模型与工具的地域来源对其表现有显著影响,从而为复杂金融搜索与推理提供了一个专业、高难度的端到端测试平台。

链接: https://arxiv.org/abs/2509.13160
作者: Liang Hu,Jianpeng Jiao,Jiashuo Liu,Yanle Ren,Zhoufutu Wen,Kaiyuan Zhang,Xuanliang Zhang,Xiang Gao,Tianci He,Fei Hu,Yali Liao,Zaiyuan Wang,Chenghao Yang,Qianyu Yang,Mingren Yin,Zhiyuan Zeng,Ge Zhang,Xinyi Zhang,Xiying Zhao,Zhenwei Zhu,Hongseok Namkoong,Wenhao Huang,Yuwen Tang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 29 pages

点击查看摘要

Abstract:Search has emerged as core infrastructure for LLM-based agents and is widely viewed as critical on the path toward more general intelligence. Finance is a particularly demanding proving ground: analysts routinely conduct complex, multi-step searches over time-sensitive, domain-specific data, making it ideal for assessing both search proficiency and knowledge-grounded reasoning. Yet no existing open financial datasets evaluate data searching capability of end-to-end agents, largely because constructing realistic, complicated tasks requires deep financial expertise and time-sensitive data is hard to evaluate. We present FinSearchComp, the first fully open-source agent benchmark for realistic, open-domain financial search and reasoning. FinSearchComp comprises three tasks – Time-Sensitive Data Fetching, Simple Historical Lookup, and Complex Historical Investigation – closely reproduce real-world financial analyst workflows. To ensure difficulty and reliability, we engage 70 professional financial experts for annotation and implement a rigorous multi-stage quality-assurance pipeline. The benchmark includes 635 questions spanning global and Greater China markets, and we evaluate 21 models (products) on it. Grok 4 (web) tops the global subset, approaching expert-level accuracy. DouBao (web) leads on the Greater China subset. Experimental analyses show that equipping agents with web search and financial plugins substantially improves results on FinSearchComp, and the country origin of models and tools impact performance this http URL aligning with realistic analyst tasks and providing end-to-end evaluation, FinSearchComp offers a professional, high-difficulty testbed for complex financial search and reasoning.
zh

[AI-12] Agent ic AI for Financial Crime Compliance

【速读】:该论文旨在解决金融犯罪合规(Financial Crime Compliance, FCC)成本与复杂性持续上升但效果难以量化提升的问题,同时应对当前AI解决方案普遍缺乏透明度且与监管预期脱节的挑战。其核心解决方案是设计并部署了一个代理型人工智能(Agentic AI)系统,用于数字化原生金融平台的FCC流程。该系统通过行动设计研究(Action Design Research, ADR)方法与金融科技公司及监管方协同开发,实现了开户、监控、调查和报告等环节的自动化,并以可解释性(explainability)、可追溯性(traceability)和合规即设计(compliance-by-design)为核心原则。关键创新在于采用以工件为中心的建模方式,明确界定自主代理的角色边界,支持任务特定模型路由与审计日志记录,从而在监管约束下重构FCC工作流,增强高风险受监管环境中的透明度与制度信任。

链接: https://arxiv.org/abs/2509.13137
作者: Henrik Axelsen,Valdemar Licht,Jan Damsgaard
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA)
备注: Accepted for presentation at HICSS-59 (2026), forthcoming in Proceedings

点击查看摘要

Abstract:The cost and complexity of financial crime compliance (FCC) continue to rise, often without measurable improvements in effectiveness. While AI offers potential, most solutions remain opaque and poorly aligned with regulatory expectations. This paper presents the design and deployment of an agentic AI system for FCC in digitally native financial platforms. Developed through an Action Design Research (ADR) process with a fintech firm and regulatory stakeholders, the system automates onboarding, monitoring, investigation, and reporting, emphasizing explainability, traceability, and compliance-by-design. Using artifact-centric modeling, it assigns clearly bounded roles to autonomous agents and enables task-specific model routing and audit logging. The contribution includes a reference architecture, a real-world prototype, and insights into how Agentic AI can reconfigure FCC workflows under regulatory constraints. Our findings extend IS literature on AI-enabled compliance by demonstrating how automation, when embedded within accountable governance structures, can support transparency and institutional trust in high-stakes, regulated environments.
zh

[AI-13] An Uncertainty-Weighted Decision Transformer for Navigation in Dense Complex Driving Scenarios

【速读】:该论文旨在解决复杂交通环境中自动驾驶车辆在密集动态场景下的决策问题,尤其关注如何有效利用空间结构和长时程时间依赖性,同时增强对不确定性的鲁棒性。其核心挑战在于低风险状态频繁而安全关键决策稀少所导致的学习不平衡问题。解决方案的关键在于提出不确定性加权决策变压器(Uncertainty-Weighted Decision Transformer, UWDT),该方法通过一个冻结的教师Transformer估计每个token的预测熵,并将此熵作为权重引入学生模型的损失函数中,从而放大对高影响、高不确定状态的学习强度,同时保持对常见低风险状态的稳定性。实验表明,UWDT在奖励、碰撞率和行为稳定性方面均优于现有基线方法,验证了不确定性感知的空间-时间Transformer在复杂交通环境下实现更安全高效决策的有效性。

链接: https://arxiv.org/abs/2509.13132
作者: Zhihao Zhang,Chengyang Peng,Minghao Zhu,Ekim Yurtsever,Keith A. Redmill
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Autonomous driving in dense, dynamic environments requires decision-making systems that can exploit both spatial structure and long-horizon temporal dependencies while remaining robust to uncertainty. This work presents a novel framework that integrates multi-channel bird’s-eye-view occupancy grids with transformer-based sequence modeling for tactical driving in complex roundabout scenarios. To address the imbalance between frequent low-risk states and rare safety-critical decisions, we propose the Uncertainty-Weighted Decision Transformer (UWDT). UWDT employs a frozen teacher transformer to estimate per-token predictive entropy, which is then used as a weight in the student model’s loss function. This mechanism amplifies learning from uncertain, high-impact states while maintaining stability across common low-risk transitions. Experiments in a roundabout simulator, across varying traffic densities, show that UWDT consistently outperforms other baselines in terms of reward, collision rate, and behavioral stability. The results demonstrate that uncertainty-aware, spatial-temporal transformers can deliver safer and more efficient decision-making for autonomous driving in complex traffic environments.
zh

[AI-14] Reasoning with Preference Constraints: A Benchmark for Language Models in Many-to-One Matching Markets

【速读】:该论文旨在解决生成式 AI(Generative AI)在具有偏好约束的组合优化问题中推理能力不足的问题,特别是针对匹配问题(matching problem)这一经典场景缺乏系统评估与有效方法的现状。其解决方案的关键在于构建了一个包含369个实例的College Admission Problem(大学录取问题)基准测试集,用于多维度评估大型语言模型(LLM)在可行性、稳定性与最优性上的表现,并通过对比不同 prompting 策略(如Chain-of-Thought、In-Context Learning 和角色提示)以及迭代提示与自动反馈机制的效果,揭示了当前 LLM 在处理结构化偏好约束时的能力边界和优化潜力。

链接: https://arxiv.org/abs/2509.13131
作者: Marylou Fauchard,Florian Carichon,Margarida Carvalho,Golnoosh Farnadi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in reasoning with large language models (LLMs) have demonstrated strong performance on complex mathematical tasks, including combinatorial optimization. Techniques such as Chain-of-Thought and In-Context Learning have further enhanced this capability, making LLMs both powerful and accessible tools for a wide range of users, including non-experts. However, applying LLMs to matching problems, which require reasoning under preferential and structural constraints, remains underexplored. To address this gap, we introduce a novel benchmark of 369 instances of the College Admission Problem, a canonical example of a matching problem with preferences, to evaluate LLMs across key dimensions: feasibility, stability, and optimality. We employ this benchmark to assess the performance of several open-weight LLMs. Our results first reveal that while LLMs can satisfy certain constraints, they struggle to meet all evaluation criteria consistently. They also show that reasoning LLMs, like QwQ and GPT-oss, significantly outperform traditional models such as Llama, Qwen or Mistral, defined here as models used without any dedicated reasoning mechanisms. Moreover, we observed that LLMs reacted differently to the various prompting strategies tested, which include Chain-of-Thought, In-Context Learning and role-based prompting, with no prompt consistently offering the best performance. Finally, we report the performances from iterative prompting with auto-generated feedback and show that they are not monotonic; they can peak early and then significantly decline in later attempts. Overall, this work offers a new perspective on model reasoning performance and the effectiveness of prompting strategies in combinatorial optimization problems with preferential constraints.
zh

[AI-15] A Design Co-Pilot for Task-Tailored Manipulators

【速读】:该论文旨在解决通用型机器人在特定任务中性能不佳的问题,即传统“一刀切”设计无法充分利用任务特异性,导致效率低下;同时,定制化机器人开发周期长、成本高,限制了其应用。解决方案的关键在于提出一种基于生成式 AI 的自动机器人形态设计与优化方法,通过学习多种不同机械臂的逆运动学(Inverse Kinematics, IK)并构建全可微分框架,实现梯度驱动的形态与IK解的联合优化。该方法将专用机器人设计从传统优化方法所需的数小时缩短至秒级,显著提升了设计效率,并支持模块化硬件约束下的快速适配,最终在仿真与真实世界中均验证了其有效性。

链接: https://arxiv.org/abs/2509.13077
作者: Jonathan Külz,Sehoon Ha,Matthias Althoff
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Although robotic manipulators are used in an ever-growing range of applications, robot manufacturers typically follow a ``one-fits-all’’ philosophy, employing identical manipulators in various settings. This often leads to suboptimal performance, as general-purpose designs fail to exploit particularities of tasks. The development of custom, task-tailored robots is hindered by long, cost-intensive development cycles and the high cost of customized hardware. Recently, various computational design methods have been devised to overcome the bottleneck of human engineering. In addition, a surge of modular robots allows quick and economical adaptation to changing industrial settings. This work proposes an approach to automatically designing and optimizing robot morphologies tailored to a specific environment. To this end, we learn the inverse kinematics for a wide range of different manipulators. A fully differentiable framework realizes gradient-based fine-tuning of designed robots and inverse kinematics solutions. Our generative approach accelerates the generation of specialized designs from hours with optimization-based methods to seconds, serving as a design co-pilot that enables instant adaptation and effective human-AI collaboration. Numerical experiments show that our approach finds robots that can navigate cluttered environments, manipulators that perform well across a specified workspace, and can be adapted to different hardware constraints. Finally, we demonstrate the real-world applicability of our method by setting up a modular robot designed in simulation that successfully moves through an obstacle course.
zh

[AI-16] MIA-EPT: Membership Inference Attack via Error Prediction for Tabular Data

【速读】:该论文旨在解决生成式AI(Generative AI)在合成表格数据时可能存在的隐私泄露问题,特别是针对扩散模型(Diffusion Models)在生成高保真度表格数据过程中因记忆训练记录而导致的成员推理攻击(Membership Inference Attacks, MIAs)风险。现有研究多集中于图像和文本领域的MIAs,而对结构化属性丰富但记录多样性有限的表格数据场景关注不足。解决方案的关键在于提出MIA-EPT(Membership Inference Attack via Error Prediction for Tabular Data),一种专门设计用于表格扩散模型的黑盒攻击方法:通过掩码与重构目标记录的属性构建基于误差的特征向量,利用模型预测精度差异作为成员信号,从而判断某条记录是否曾出现在训练集中。该方法无需访问生成模型内部组件,仅依赖其输出的合成数据即可实现有效攻击,并在多个先进扩散模型上展现出良好的泛化能力。

链接: https://arxiv.org/abs/2509.13046
作者: Eyal German,Daniel Samira,Yuval Elovici,Asaf Shabtai
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Synthetic data generation plays an important role in enabling data sharing, particularly in sensitive domains like healthcare and finance. Recent advances in diffusion models have made it possible to generate realistic, high-quality tabular data, but they may also memorize training records and leak sensitive information. Membership inference attacks (MIAs) exploit this vulnerability by determining whether a record was used in training. While MIAs have been studied in images and text, their use against tabular diffusion models remains underexplored despite the unique risks of structured attributes and limited record diversity. In this paper, we introduce MIAEPT, Membership Inference Attack via Error Prediction for Tabular Data, a novel black-box attack specifically designed to target tabular diffusion models. MIA-EPT constructs errorbased feature vectors by masking and reconstructing attributes of target records, disclosing membership signals based on how well these attributes are predicted. MIA-EPT operates without access to the internal components of the generative model, relying only on its synthetic data output, and was shown to generalize across multiple state-of-the-art diffusion models. We validate MIA-EPT on three diffusion-based synthesizers, achieving AUC-ROC scores of up to 0.599 and TPR@10% FPR values of 22.0% in our internal tests. Under the MIDST 2025 competition conditions, MIA-EPT achieved second place in the Black-box Multi-Table track (TPR@10% FPR = 20.0%). These results demonstrate that our method can uncover substantial membership leakage in synthetic tabular data, challenging the assumption that synthetic data is inherently privacy-preserving. Our code is publicly available at this https URL.
zh

[AI-17] Introducing the A2AJs Canadian Legal Data: An open-source alternative to CanLII for the era of computational law

【速读】:该论文旨在解决加拿大法律数据获取不平等的问题,即当前由私人机构运营的加拿大法律信息研究所(CanLII)限制了对法律数据的批量和程序化访问,导致资源丰富的主体掌握先进数字工具,而公众尤其是低收入群体只能使用次优工具,从而加剧了数字鸿沟。解决方案的关键在于推出开源项目“算法正义获取计划”(Access to Algorithmic Justice, A2AJ),其核心是提供开放获取的加拿大法律数据集,包括超过116,000份法院判决和5,000条法规,并通过API、机器学习数据集及AI集成协议等多种渠道分发。这一举措不仅保障了法律数据的可及性,还赋能法院开展基于证据的评估,并支持开发者为低收入人群服务的从业者构建实用工具,从而推动法律服务的公平性和技术民主化。

链接: https://arxiv.org/abs/2509.13032
作者: Simon Wallace,Sean Rehaag
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The Access to Algorithmic Justice project (A2AJ) is an open-source alternative to the Canadian Legal Information Institute (CanLII). At a moment when technology promises to enable new ways of working with law, CanLII is becoming an impediment to the free access of law and access to justice movements because it restricts bulk and programmatic access to Canadian legal data. This means that Canada is staring down a digital divide: well-resourced actors have the best new technological tools and, because CanLII has disclaimed leadership, the public only gets second-rate tools. This article puts CanLII in its larger historical context and shows how long and deep efforts to democratize access to Canadian legal data are, and how often they are thwarted by private industry. We introduce the A2AJ’s Canadian Legal Data project, which provides open access to over 116,000 court decisions and 5,000 statutes through multiple channels including APIs, machine learning datasets, and AI integration protocols. Through concrete examples, we demonstrate how open legal data enables courts to conduct evidence-based assessments and allows developers to create tools for practitioners serving low-income communities.
zh

[AI-18] GView: A Survey of Binary Forensics via Visual Semantic and AI-Enhanced Analysis

【速读】:该论文旨在解决网络安全取证分析中面临的威胁日益复杂和多样化所带来的挑战,尤其是传统方法在处理高维度、高复杂度的攻击 artifacts(攻击痕迹)时效率低下、推理能力有限的问题。解决方案的关键在于提出 GView——一个开源的可视化与人工智能增强型取证分析框架,其核心创新在于引入大语言模型(Large Language Models, LLMs)以动态提升推理能力,并通过谓词(predicates)与推理规则对分析文档及用户操作进行逻辑推演,从而实现更智能、自适应的取证建议生成。此外,GView 具备可扩展架构,有效连接了工业实践与学术研究之间的鸿沟。

链接: https://arxiv.org/abs/2509.13025
作者: Raul Zaharia(Al. I. Cuza University amp; Bitdefender),Dragoş Gavriluţ(Al. I. Cuza University amp; Bitdefender),Gheorghiţă Mutu(Al. I. Cuza University amp; Bitdefender)
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: In Proceedings FROM 2025, arXiv:2509.11877

点击查看摘要

Abstract:Cybersecurity threats continue to become more sophisticated and diverse in their artifacts, boosting both their volume and complexity. To overcome those challenges, we present GView, an open-source forensic analysis framework with visual and AI-enhanced reasoning. It started with focus on the practical cybersecurity industry. It has evolved significantly, incorporating large language models (LLMs) to dynamically enhance reasoning and ease the forensic workflows. This paper surveys both the current state of GView with its published papers alongside those that are in the publishing process. It also includes its innovative use of logical inference through predicates and inference rules for both the analyzed documents and the user’s actions for better suggestions. We highlight the extensible architecture, showcasing its potential as a bridge between the practical forensics worlds with the academic research.
zh

[AI-19] Validating Solidity Code Defects using Symbolic and Concrete Execution powered by Large Language Models

【速读】:该论文旨在解决静态分析工具和大型语言模型(Large Language Models, LLMs)在Solidity智能合约漏洞检测中误报率高的问题,从而提升漏洞检测的准确性与可靠性。其解决方案的关键在于构建一个集成自定义Slither检测器、LLMs、Kontrol以及Forge的新型检测流水线,通过符号执行或具体执行相结合的方式对漏洞进行分类验证,有效识别并生成针对复杂漏洞(如重入攻击、复杂回退函数、访问控制策略错误)的可证明缺陷,显著降低人工验证负担,形成一种结合启发式分析与形式化验证的可靠自动化审计框架。

链接: https://arxiv.org/abs/2509.13023
作者: Ştefan-Claudiu Susan(“Alexandru Ioan Cuza”, University of Iaşi, Department of Computer Science),Andrei Arusoaie(“Alexandru Ioan Cuza”, University of Iaşi, Department of Computer Science),Dorel Lucanu(“Alexandru Ioan Cuza”, University of Iaşi, Department of Computer Science)
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: In Proceedings FROM 2025, arXiv:2509.11877

点击查看摘要

Abstract:The high rate of false alarms from static analysis tools and Large Language Models (LLMs) complicates vulnerability detection in Solidity Smart Contracts, demanding methods that can formally or empirically prove the presence of defects. This paper introduces a novel detection pipeline that integrates custom Slither-based detectors, LLMs, Kontrol, and Forge. Our approach is designed to reliably detect defects and generate proofs. We currently perform experiments with promising results for seven types of critical defects. We demonstrate the pipeline’s efficacy by presenting our findings for three vulnerabilities – Reentrancy, Complex Fallback, and Faulty Access Control Policies – that are challenging for current verification solutions, which often generate false alarms or fail to detect them entirely. We highlight the potential of either symbolic or concrete execution in correctly classifying such code faults. By chaining these instruments, our method effectively validates true positives, significantly reducing the manual verification burden. Although we identify potential limitations, such as the inconsistency and the cost of LLMs, our findings establish a robust framework for combining heuristic analysis with formal verification to achieve more reliable and automated smart contract auditing.
zh

[AI-20] xOffense: An AI-driven autonomous penetration testing framework with offensive knowledge-enhanced LLM s and multi agent systems

【速读】:该论文旨在解决传统渗透测试(Penetration Testing)依赖人工、效率低且难以规模化的问题。现有方法多由专家手动执行,耗时长、成本高,且缺乏可重复性和自动化能力。解决方案的关键在于提出一个名为xOffense的多智能体渗透测试框架,其核心是基于Qwen3-32B大语言模型(Large Language Model, LLM)进行链式思维(Chain-of-Thought)微调,使模型能够生成精确的工具命令并执行多步推理;同时通过任务分工明确的智能体(侦察、漏洞扫描、利用)与编排层协同工作,实现全流程自动化、可扩展和可复现的渗透测试流程。

链接: https://arxiv.org/abs/2509.13021
作者: Phung Duc Luong,Le Tran Gia Bao,Nguyen Vu Khai Tam,Dong Huu Nguyen Khoa,Nguyen Huu Quyen,Van-Hau Pham,Phan The Duy
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 17 pages, 4 figures

点击查看摘要

Abstract:This work introduces xOffense, an AI-driven, multi-agent penetration testing framework that shifts the process from labor-intensive, expert-driven manual efforts to fully automated, machine-executable workflows capable of scaling seamlessly with computational infrastructure. At its core, xOffense leverages a fine-tuned, mid-scale open-source LLM (Qwen3-32B) to drive reasoning and decision-making in penetration testing. The framework assigns specialized agents to reconnaissance, vulnerability scanning, and exploitation, with an orchestration layer ensuring seamless coordination across phases. Fine-tuning on Chain-of-Thought penetration testing data further enables the model to generate precise tool commands and perform consistent multi-step reasoning. We evaluate xOffense on two rigorous benchmarks: AutoPenBench and AI-Pentest-Benchmark. The results demonstrate that xOffense consistently outperforms contemporary methods, achieving a sub-task completion rate of 79.17%, decisively surpassing leading systems such as VulnBot and PentestGPT. These findings highlight the potential of domain-adapted mid-scale LLMs, when embedded within structured multi-agent orchestration, to deliver superior, cost-efficient, and reproducible solutions for autonomous penetration testing.
zh

[AI-21] A Visualized Framework for Event Cooperation with Generative Agents

【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在模拟代理社会时存在的两大关键问题:一是缺乏对事件组织的系统性评估机制,二是缺少与物理具身环境的可视化集成,从而限制了代理在真实空间中导航和与物体交互的能力。其解决方案的核心是开发了一个名为MiniAgentPro的可视化平台,该平台包含一个直观的地图编辑器用于自定义环境,并配备具有平滑动画的仿真播放器;在此基础上构建了一个包含八个多样化事件场景(含基础与高难度变体)的综合性测试集,用于系统评估代理行为。实验表明,尽管GPT-4o在基础场景中表现优异,但在高难度协同任务中仍面临挑战,凸显了现有方法在复杂社交互动中的局限性。

链接: https://arxiv.org/abs/2509.13011
作者: Yuyang Tian,Shunqiang Mao,Wenchang Gao,Lanlan Qiu,Tianxing He
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have revolutionized the simulation of agent societies, enabling autonomous planning, memory formation, and social interactions. However, existing frameworks often overlook systematic evaluations for event organization and lack visualized integration with physically grounded environments, limiting agents’ ability to navigate spaces and interact with items realistically. We develop MiniAgentPro, a visualization platform featuring an intuitive map editor for customizing environments and a simulation player with smooth animations. Based on this tool, we introduce a comprehensive test set comprising eight diverse event scenarios with basic and hard variants to assess agents’ ability. Evaluations using GPT-4o demonstrate strong performance in basic settings but highlight coordination challenges in hard variants.
zh

[AI-22] Data-driven Methods of Extracting Text Structure and Information Transfer

【速读】:该论文试图解决的问题是:不同媒介(如小说、在线百科、学术论文和电影)中的文本结构是否遵循某种普遍的“成功-失败”规律,即是否存在一种结构性原则决定其成功与否。解决方案的关键在于将各类文本表示为功能块(functional blocks)序列,并通过分析这些块在顺序(order)和位置(position)上的转换模式,检验安娜·卡列尼娜原理(Anna Karenina Principle, AKP)及其变体(如反向AKP、有序模式和噪声模式)在不同媒介中的适用性。研究发现,每种媒介都表现出独特的结构约束:小说遵循反向AKP的顺序规则,维基百科结合了AKP与有序模式,学术论文在顺序上体现反向AKP但在位置上保持噪声特性,而电影则因类型不同而呈现差异。这表明成功依赖于特定媒介的结构约束,而失败则呈现出多样化的形态。

链接: https://arxiv.org/abs/2509.12999
作者: Shinichi Honna,Taichi Murayama,Akira Matsui
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The Anna Karenina Principle (AKP) holds that success requires satisfying a small set of essential conditions, whereas failure takes diverse forms. We test AKP, its reverse, and two further patterns described as ordered and noisy across novels, online encyclopedias, research papers, and movies. Texts are represented as sequences of functional blocks, and convergence is assessed in transition order and position. Results show that structural principles vary by medium: novels follow reverse AKP in order, Wikipedia combines AKP with ordered patterns, academic papers display reverse AKP in order but remain noisy in position, and movies diverge by genre. Success therefore depends on structural constraints that are specific to each medium, while failure assumes different shapes across domains.
zh

[AI-23] Bridging Performance Gaps for Foundation Models: A Post-Training Strategy for ECGFounder

【速读】:该论文旨在解决当前心电图(ECG)基础模型在临床应用中的性能瓶颈问题,即尽管经过大规模数据预训练和目标任务微调,其性能仍落后于专用模型。核心问题是缺乏有效的后训练(post-training)策略来进一步提升模型泛化能力和稳定性。解决方案的关键在于提出一种简单但高效的后训练方法,通过引入随机深度(stochastic depth)和预览线性探针(preview linear probing)等关键组件,在仅使用10%训练数据的情况下显著提升模型表现,宏平均受试者工作特征曲线下面积(macro AUROC)和宏平均精度-召回曲线下面积(macro AUPRC)分别提高9.1%和34.9%,且优于多个先进任务特定模型与架构。

链接: https://arxiv.org/abs/2509.12991
作者: Ya Zhou,Yujie Yang,Xiaohan Fan,Wei Zhao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Applications (stat.AP)
备注: A simple yet effective strategy for ECG foundation models

点击查看摘要

Abstract:ECG foundation models are increasingly popular due to their adaptability across various tasks. However, their clinical applicability is often limited by performance gaps compared to task-specific models, even after pre-training on large ECG datasets and fine-tuning on target data. This limitation is likely due to the lack of an effective post-training strategy. In this paper, we propose a simple yet effective post-training approach to enhance ECGFounder, a state-of-the-art ECG foundation model pre-trained on over 7 million ECG recordings. Experiments on the PTB-XL benchmark show that our approach improves the baseline fine-tuning strategy by 1.2%-3.3% in macro AUROC and 5.3%-20.9% in macro AUPRC. Additionally, our method outperforms several recent state-of-the-art approaches, including task-specific and advanced architectures. Further evaluation reveals that our method is more stable and sample-efficient compared to the baseline, achieving a 9.1% improvement in macro AUROC and a 34.9% improvement in macro AUPRC using just 10% of the training data. Ablation studies identify key components, such as stochastic depth and preview linear probing, that contribute to the enhanced performance. These findings underscore the potential of post-training strategies to improve ECG foundation models, and we hope this work will contribute to the continued development of foundation models in the ECG domain.
zh

[AI-24] oward PDDL Planning Copilot

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在执行复杂任务时缺乏可靠长周期规划能力的问题。其解决方案的关键在于引入“规划协作者”(Planning Copilot),这是一个集成多种规划工具的聊天机器人,通过Model Context Protocol (MCP) 标准与LLM连接,使用户能够以自然语言指令调用外部规划工具。该设计无需针对特定领域进行微调即可利用任意支持MCP的LLM,显著提升了LLM在语法检查、规划器选择、计划生成验证及执行模拟等常见规划任务上的性能表现。

链接: https://arxiv.org/abs/2509.12987
作者: Yarin Benyamin,Argaman Mordoch,Shahaf S. Shperberg,Roni Stern
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly being used as autonomous agents capable of performing complicated tasks. However, they lack the ability to perform reliable long-horizon planning on their own. This paper bridges this gap by introducing the Planning Copilot, a chatbot that integrates multiple planning tools and allows users to invoke them through instructions in natural language. The Planning Copilot leverages the Model Context Protocol (MCP), a recently developed standard for connecting LLMs with external tools and systems. This approach allows using any LLM that supports MCP without domain-specific fine-tuning. Our Planning Copilot supports common planning tasks such as checking the syntax of planning problems, selecting an appropriate planner, calling it, validating the plan it generates, and simulating their execution. We empirically evaluate the ability of our Planning Copilot to perform these tasks using three open-source LLMs. The results show that the Planning Copilot highly outperforms using the same LLMs without the planning tools. We also conducted a limited qualitative comparison of our tool against Chat GPT-5, a very recent commercial LLM. Our results shows that our Planning Copilot significantly outperforms GPT-5 despite relying on a much smaller LLM. This suggests dedicated planning tools may be an effective way to enable LLMs to perform planning tasks.
zh

[AI-25] Out of Distribution Detection in Self-adaptive Robots with AI-powered Digital Twins

【速读】:该论文旨在解决自适应机器人(Self-adaptive Robots, SARs)在复杂、不确定环境中主动检测异常行为(包括分布外,Out-of-Distribution, OOD)的问题。其解决方案的关键在于提出一种基于数字孪生(Digital Twin)的OOD检测方法(ODiSAR),该方法利用Transformer架构构建数字孪生模型以预测SAR状态,并结合重构误差(reconstruction error)与蒙特卡洛丢弃(Monte Carlo dropout)进行不确定性量化;通过融合重构误差与预测方差,实现对未知条件下OOD行为的有效识别,同时引入可解释性模块将潜在OOD事件映射到具体机器人状态,从而为自适应决策提供依据。

链接: https://arxiv.org/abs/2509.12982
作者: Erblin Isaku,Hassan Sartaj,Shaukat Ali,Beatriz Sanguino,Tongtong Wang,Guoyuan Li,Houxiang Zhang,Thomas Peyrucain
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 15 pages, 4 figures, 3 tables

点击查看摘要

Abstract:Self-adaptive robots (SARs) in complex, uncertain environments must proactively detect and address abnormal behaviors, including out-of-distribution (OOD) cases. To this end, digital twins offer a valuable solution for OOD detection. Thus, we present a digital twin-based approach for OOD detection (ODiSAR) in SARs. ODiSAR uses a Transformer-based digital twin to forecast SAR states and employs reconstruction error and Monte Carlo dropout for uncertainty quantification. By combining reconstruction error with predictive variance, the digital twin effectively detects OOD behaviors, even in previously unseen conditions. The digital twin also includes an explainability layer that links potential OOD to specific SAR states, offering insights for self-adaptation. We evaluated ODiSAR by creating digital twins of two industrial robots: one navigating an office environment, and another performing maritime ship navigation. In both cases, ODiSAR forecasts SAR behaviors (i.e., robot trajectories and vessel motion) and proactively detects OOD events. Our results showed that ODiSAR achieved high detection performance – up to 98% AUROC, 96% TNR@TPR95, and 95% F1-score – while providing interpretable insights to support self-adaptation.
zh

[AI-26] Forget Whats Sensitive Remember What Matters: Token-Level Differential Privacy in Memory Sculpting for Continual Learning

【速读】:该论文旨在解决持续学习(Continual Learning, CL)模型在累积多样化信息过程中面临的隐私挑战,尤其是传统统一差分隐私(Differential Privacy, DP)预算分配方式导致的模型性能显著下降问题。其解决方案的关键在于提出一种隐私增强的持续学习框架(Privacy-enhanced Continual Learning, PeCL),该框架包含两个核心机制:一是基于token级别的动态差分隐私策略,根据语义敏感度自适应地分配隐私预算,从而仅对敏感实体注入噪声,减少对通用知识的干扰;二是引入隐私引导的记忆雕刻模块(privacy-guided memory sculpting module),利用敏感性分析结果智能遗忘敏感信息,同时显式保留对缓解灾难性遗忘至关重要的任务不变历史知识,实现隐私保护与模型效用之间的最优平衡。

链接: https://arxiv.org/abs/2509.12958
作者: Bihao Zhan,Jie Zhou,Junsong Li,Yutao Yang,Shilian Chen,Qianjun Pan,Xin Li,Wen Wu,Xingjiao Wu,Qin Chen,Hang Yan,Liang He
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Continual Learning (CL) models, while adept at sequential knowledge acquisition, face significant and often overlooked privacy challenges due to accumulating diverse information. Traditional privacy methods, like a uniform Differential Privacy (DP) budget, indiscriminately protect all data, leading to substantial model utility degradation and hindering CL deployment in privacy-sensitive areas. To overcome this, we propose a privacy-enhanced continual learning (PeCL) framework that forgets what’s sensitive and remembers what matters. Our approach first introduces a token-level dynamic Differential Privacy strategy that adaptively allocates privacy budgets based on the semantic sensitivity of individual tokens. This ensures robust protection for private entities while minimizing noise injection for non-sensitive, general knowledge. Second, we integrate a privacy-guided memory sculpting module. This module leverages the sensitivity analysis from our dynamic DP mechanism to intelligently forget sensitive information from the model’s memory and parameters, while explicitly preserving the task-invariant historical knowledge crucial for mitigating catastrophic forgetting. Extensive experiments show that PeCL achieves a superior balance between privacy preserving and model utility, outperforming baseline models by maintaining high accuracy on previous tasks while ensuring robust privacy.
zh

[AI-27] Black-box Model Merging for Language-Model-as-a-Service with Massive Model Repositories

【速读】:该论文旨在解决黑箱大语言模型(Black-box Large Language Models, BLMs)的合并问题(Black-box Model Merging, BMM),即在无法获取模型参数的情况下,如何将多个通过API调用访问的大型语言模型有效融合为一个统一模型。其核心挑战在于传统基于任务向量(task vectors)的模型合并方法依赖于可访问的模型权重,而现代主流LLM如GPT-4通常仅以服务形式提供(Language-Model-as-a-Service),导致参数不可见。为此,作者提出了一种无导数优化框架Evo-Merging,关键创新在于两个组件:一是基于稀疏性的去噪机制(sparsity-based denoising),用于识别并过滤跨模型中的无关或冗余信息;二是符号感知缩放策略(sign-aware scaling),动态计算各模型组合权重以最大化性能表现。该方法仅需推理阶段API查询即可实现高效合并,并在多项任务上达到当前最优效果。

链接: https://arxiv.org/abs/2509.12951
作者: Shilian Chen,Jie Zhou,Tianyu Huai,Yujiang Lu,Junsong Li,Bihao Zhan,Qianjun Pan,Yutao Yang,Xin Li,Qin Chen,Hang Yan,Liang He
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Model merging refers to the process of integrating multiple distinct models into a unified model that preserves and combines the strengths and capabilities of the individual models. Most existing approaches rely on task vectors to combine models, typically under the assumption that model parameters are accessible. However, for extremely large language models (LLMs) such as GPT-4, which are often provided solely as black-box services through API interfaces (Language-Model-as-a-Service), model weights are not available to end users. This presents a significant challenge, which we refer to as black-box model merging (BMM) with massive LLMs. To address this challenge, we propose a derivative-free optimization framework based on the evolutionary algorithm (Evo-Merging) that enables effective model merging using only inference-time API queries. Our method consists of two key components: (1) sparsity-based denoising, designed to identify and filter out irrelevant or redundant information across models, and (2) sign-aware scaling, which dynamically computes optimal combination weights for the relevant models based on their performance. We also provide a formal justification, along with a theoretical analysis, for our asymmetric sparsification. Extensive experimental evaluations demonstrate that our approach achieves state-of-the-art results on a range of tasks, significantly outperforming existing strong baselines.
zh

[AI-28] he Anatomy of Alignment: Decomposing Preference Optimization by Steering Sparse Features

【速读】:该论文旨在解决当前大语言模型对齐方法中存在透明度不足的问题,尤其是基于人类反馈的强化学习(Reinforcement Learning from Human Feedback, RLHF)导致参数变化弥散且难以解释,使得模型内部对齐机制不清晰。其解决方案的关键在于提出特征引导的强化学习(Feature Steering with Reinforcement Learning, FSRL),通过训练一个轻量级适配器来调节稀疏自编码器(Sparse Autoencoder, SAE)提取的可解释特征,从而实现对模型行为的可控干预与机制解析。该方法不仅在偏好优化效果上可媲美主流RLHF方法,还揭示出对齐过程更倾向于奖励风格特征而非显式对齐概念,为理解模型内部机制提供了新的诊断工具。

链接: https://arxiv.org/abs/2509.12934
作者: Jeremias Ferrao,Matthijs van der Lende,Ilija Lichkovski,Clement Neo
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Work in Progress

点击查看摘要

Abstract:Aligning large language models is critical for their usability and safety. However, the prevailing approach of Reinforcement Learning from Human Feedback (RLHF) induces diffuse, opaque parameter changes, making it difficult to discern what the model has internalized. Hence, we introduce Feature Steering with Reinforcement Learning (FSRL), a transparent alignment framework that trains a lightweight adapter to steer behavior by modulating interpretable features from a Sparse Autoencoder (SAE). First, we demonstrate that FSRL is an effective method for preference optimization and is comparable with current RLHF methods. We then perform mechanistic analysis on the trained adapter, and find that its policy systematically promotes style features over explicit alignment concepts, suggesting that the preference optimization process rewards stylistic presentation as a proxy for quality. Ultimately, we hope that FSRL provides a tool for both interpretable model control and diagnosing the internal mechanisms of alignment.
zh

[AI-29] Population Estimation using Deep Learning over Gandhinagar Urban Area

【速读】:该论文旨在解决传统人口估计方法(如普查和调查)成本高、耗时长且依赖大量人力的问题。其核心解决方案是基于高分辨率卫星影像(0.3 m)、数字高程模型(DEM,0.5 m)及矢量边界数据,构建一种结合卷积神经网络(CNN)与人工神经网络(ANN)的深度学习框架:CNN用于建筑分类(区分居住与非居住建筑),ANN则用于基于居住类建筑进行楼栋级人口估算。关键创新在于利用高空间分辨率的地理空间数据与AI模型深度融合,实现了对甘地纳加市(Gandhinagar)约27.9万人的精准估算,并具备实时更新、标准化评估与城市规划集成能力,显著提升了人口监测的自动化水平与可扩展性。

链接: https://arxiv.org/abs/2509.12926
作者: Jai Singla,Peal Jotania,Keivalya Pandya
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Population estimation is crucial for various applications, from resource allocation to urban planning. Traditional methods such as surveys and censuses are expensive, time-consuming and also heavily dependent on human resources, requiring significant manpower for data collection and processing. In this study a deep learning solution is proposed to estimate population using high resolution (0.3 m) satellite imagery, Digital Elevation Models (DEM) of 0.5m resolution and vector boundaries. Proposed method combines Convolution Neural Network (CNN) architecture for classification task to classify buildings as residential and non-residential and Artificial Neural Network (ANN) architecture to estimate the population. Approx. 48k building footprints over Gandhinagar urban area are utilized containing both residential and non-residential, with residential categories further used for building-level population estimation. Experimental results on a large-scale dataset demonstrate the effectiveness of our model, achieving an impressive overall F1-score of 0.9936. The proposed system employs advanced geospatial analysis with high spatial resolution to estimate Gandhinagar population at 278,954. By integrating real-time data updates, standardized metrics, and infrastructure planning capabilities, this automated approach addresses critical limitations of conventional census-based methodologies. The framework provides municipalities with a scalable and replicable tool for optimized resource management in rapidly urbanizing cities, showcasing the efficiency of AI-driven geospatial analytics in enhancing data-driven urban governance.
zh

[AI-30] A Graph-Based Approach to Alert Contextualisation in Security Operations Centres

【速读】:该论文旨在解决安全运营中心(Security Operations Centres, SOCs)中海量安全告警难以有效解读的问题,其核心挑战在于如何快速区分真实威胁与正常行为,从而优先处理高风险事件。解决方案的关键在于提出一种基于图结构的告警上下文化方法:将告警聚合为图结构的告警组,其中节点表示单个告警,边表示在指定时间窗口内告警间的关联关系;通过这种高阶抽象的分组方式,能够更有效地捕捉攻击链步骤,同时该格式兼容下游机器学习方法(如图匹配网络 Graph Matching Networks, GMNs),用于将新生成的告警组与历史事件进行关联分析,从而为分析师提供可操作的洞察。

链接: https://arxiv.org/abs/2509.12923
作者: Magnus Wiik Eckhoff,Peter Marius Flydal,Siem Peters,Martin Eian,Jonas Halvorsen,Vasileios Mavroeidis,Gudmund Grov
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Interpreting the massive volume of security alerts is a significant challenge in Security Operations Centres (SOCs). Effective contextualisation is important, enabling quick distinction between genuine threats and benign activity to prioritise what needs further this http URL paper proposes a graph-based approach to enhance alert contextualisation in a SOC by aggregating alerts into graph-based alert groups, where nodes represent alerts and edges denote relationships within defined time-windows. By grouping related alerts, we enable analysis at a higher abstraction level, capturing attack steps more effectively than individual alerts. Furthermore, to show that our format is well suited for downstream machine learning methods, we employ Graph Matching Networks (GMNs) to correlate incoming alert groups with historical incidents, providing analysts with additional insights.
zh

[AI-31] Stochastic Streets: A Walk Through Random LLM Address Generation in four European Cities

【速读】:该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)是否具备生成符合真实地理分布的欧洲城市随机街道地址的能力。其解决方案的关键在于评估LLMs在无明确训练数据支持下,能否通过其内部知识和语言理解能力,合成结构合理、语义可信且具有地域特征的街道地址,从而揭示其在空间信息生成方面的泛化能力和潜在偏差。

链接: https://arxiv.org/abs/2509.12914
作者: Tairan Fu,David Campo-Nazareno,Javier Coronado-Blázquez,Javier Conde,Pedro Reviriego,Fabrizio Lombardi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are capable of solving complex math problems or answer difficult questions on almost any topic, but can they generate random street addresses for European cities?
zh

[AI-32] LTA-thinker: Latent Thought-Augmented Training Framework for Large Language Models on Complex Reasoning

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在复杂推理过程中因“过度思考”(Overthinking)导致的效率与准确性瓶颈问题,核心挑战在于如何高效生成并利用高质量的潜在思维表示(Latent Thought)。解决方案的关键在于提出一种潜在思维增强训练框架(Latent Thought-Augmented Training Framework, LTA-Thinker),其创新性体现在两个方面:一是构建基于可学习先验(learnable prior)的潜在思维生成架构,以提升潜在思维向量分布的方差,从而更逼近理想的真实分布;二是引入基于分布的方向优化范式,通过联合约束分布局部性和尺度,结合标准监督微调(Supervised Fine-Tuning, SFT)损失与两项新设计的损失函数——语义对齐损失(Semantic Alignment Loss,基于KL散度确保潜在思维与问题语义高度相关)和推理聚焦损失(Reasoning Focus Loss,基于对比学习引导模型关注关键推理步骤),实现信息效率与计算成本的协同优化。实验表明,LTA-Thinker在多个基线中达到最先进性能,并展现出更高的性能上限和更好的扩展效果。

链接: https://arxiv.org/abs/2509.12875
作者: Jiaqi Wang,Binquan Ji,Haibo Luo,Yiyang Qi,Ruiting Li,Huiyan Wang,Yuantao Han,Cangyi Yang,jiaxu Zhang,Feiliang Ren
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Complex Reasoning in Large Language Models can be dynamically optimized using Test-Time Scaling (TTS) to mitigate Overthinking. Methods such as Coconut, SoftCoT and its variant are effective in continuous latent space inference, the core bottleneck still lies in the efficient generation and utilization of high-quality Latent Thought. Drawing from the theory of SoftCoT++ that a larger variance in the generated Latent Thought distribution more closely approximates the golden truth distribution, we propose a Latent Thought-Augmented Training Framework–LTA-Thinker, which improves distributional variance and enhances reasoning performance from two perspectives. First, LTA-Thinker constructs a Latent Thought generation architecture based on a learnable prior. This architecture aims to increase the variance distribution of generated Latent Thought Vectors in order to simplify the overall structure and raise the performance ceiling. Second, LTA-Thinker introduces a distribution-based directional optimization paradigm that jointly constrains both distribution locality and distribution scale. This mechanism improves information efficiency and computational cost through a multi-objective co-training strategy, which combines standard Supervised Fine-Tuning (SFT) loss with two novel losses: Semantic Alignment Loss, which utilizes KL divergence to ensure that the Latent Thought is highly relevant to the semantics of the question; Reasoning Focus Loss, which utilizes a contrastive learning mechanism to guide the model to focus on the most critical reasoning steps. Experiments show that LTA-thinker achieves state-of-the-art (SOTA) performance among various baselines and demonstrates a higher performance ceiling and better scaling effects.
zh

[AI-33] AI Factories: Its time to rethink the Cloud-HPC divide

【速读】:该论文旨在解决当前高性能计算(High-Performance Computing, HPC)系统在支持生成式 AI(Generative AI)等现代人工智能服务时存在的 usability(可用性)和 accessibility(可访问性)不足的问题,尤其是在欧洲推进主权人工智能(Sovereign AI)战略背景下,如何将传统 HPC 超级计算机与云原生技术融合以构建高效、易用的 AI 工厂(AI Factory)。其解决方案的关键在于提出“双栈架构”(dual-stack approach),即在超级计算机中同时集成 HPC 与云原生技术(如 Kubernetes 和对象存储),从而实现高算力与硬件加速能力同服务导向前端的协同优化,具体表现为两个方向:一是探索 Serverless HPC(无服务器 HPC)以提升 HPC 的弹性与易用性,二是发展 High-performance Cloud(高性能云)以增强云计算平台对 HPC 级任务的支持能力。这种融合策略旨在弥合 HPC 与云 computing 之间的鸿沟,使二者优势互补、相互放大。

链接: https://arxiv.org/abs/2509.12849
作者: Pedro Garcia Lopez,Daniel Barcelona Pons,Marcin Copik,Torsten Hoefler,Eduardo Quiñones,Maciej Malawski,Peter Pietzutch,Alberto Marti,Thomas Ohlson Timoudas,Aleksander Slominski
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The strategic importance of artificial intelligence is driving a global push toward Sovereign AI initiatives. Nationwide governments are increasingly developing dedicated infrastructures, called AI Factories (AIF), to achieve technological autonomy and secure the resources necessary to sustain robust local digital ecosystems. In Europe, the EuroHPC Joint Undertaking is investing hundreds of millions of euros into several AI Factories, built atop existing high-performance computing (HPC) supercomputers. However, while HPC systems excel in raw performance, they are not inherently designed for usability, accessibility, or serving as public-facing platforms for AI services such as inference or agentic applications. In contrast, AI practitioners are accustomed to cloud-native technologies like Kubernetes and object storage, tools that are often difficult to integrate within traditional HPC environments. This article advocates for a dual-stack approach within supercomputers: integrating both HPC and cloud-native technologies. Our goal is to bridge the divide between HPC and cloud computing by combining high performance and hardware acceleration with ease of use and service-oriented front-ends. This convergence allows each paradigm to amplify the other. To this end, we will study the cloud challenges of HPC (Serverless HPC) and the HPC challenges of cloud technologies (High-performance Cloud). Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI) Cite as: arXiv:2509.12849 [cs.DC] (or arXiv:2509.12849v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2509.12849 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-34] Improving Anomalous Sound Detection with Attribute-aware Representation from Domain-adaptive Pre-training

【速读】:该论文旨在解决异常声音检测(Anomalous Sound Detection, ASD)中因缺乏完整机器属性标签(machine attribute labels)而导致的模型训练困难问题。当前主流方法通常将ASD建模为机器属性分类任务,但实际场景中仅能获取正常数据,且人工标注所有机器属性标签成本高昂、不切实际。为此,作者提出一种基于层次聚类(agglomerative hierarchical clustering)的伪属性标签分配方法,利用领域自适应预训练模型提取的声音表征来捕捉机器属性特征,并据此生成伪标签;随后通过监督微调(supervised fine-tuning)对预训练模型进行适配,从而实现高精度的机器属性分类。该方案的关键在于:借助预训练模型的泛化能力与聚类策略生成可靠的伪标签,进而显著提升模型在仅有正常样本情况下的性能表现。

链接: https://arxiv.org/abs/2509.12845
作者: Xin Fang,Guirui Zhong,Qing Wang,Fan Chu,Lei Wang,Mengui Qian,Mingqi Cai,Jiangzhao Wu,Jianqing Gao,Jun Du
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: 5 pages, 3 figures

点击查看摘要

Abstract:Anomalous Sound Detection (ASD) is often formulated as a machine attribute classification task, a strategy necessitated by the common scenario where only normal data is available for training. However, the exhaustive collection of machine attribute labels is laborious and impractical. To address the challenge of missing attribute labels, this paper proposes an agglomerative hierarchical clustering method for the assignment of pseudo-attribute labels using representations derived from a domain-adaptive pre-trained model, which are expected to capture machine attribute characteristics. We then apply model adaptation to this pre-trained model through supervised fine-tuning for machine attribute classification, resulting in a new state-of-the-art performance. Evaluation on the Detection and Classification of Acoustic Scenes and Events (DCASE) 2025 Challenge dataset demonstrates that our proposed approach yields significant performance gains, ultimately outperforming our previous top-ranking system in the challenge.
zh

[AI-35] Multi-Robot Task Planning for Multi-Object Retrieval Tasks with Distributed On-Site Knowledge via Large Language Models

【速读】:该论文旨在解决多机器人协作中如何根据各机器人所具备的特定情境知识(如用户指定区域的空间概念)来高效分配任务的问题,尤其针对自然语言指令中存在模糊性或上下文依赖性(如“找一个苹果和一个香蕉”或“为野外考察做准备”)的情况。解决方案的关键在于提出了一种融合大语言模型(Large Language Models, LLMs)与空间概念的任务规划框架,通过设计一种新颖的少样本提示策略,使LLM能够从模糊指令中推断所需对象并分解为可执行子任务,进而实现精准的任务分配与顺序规划,实验表明该方法在50次测试中成功完成47次分配,显著优于随机分配(28/50)和基于常识的分配方式(26/50)。

链接: https://arxiv.org/abs/2509.12838
作者: Kento Murata,Shoichi Hasegawa,Tomochika Ishikawa,Yoshinobu Hagiwara,Akira Taniguchi,Lotfi El Hafi,Tadahiro Taniguchi
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: Submitted to AROB-ISBC 2026 (Journal Track option)

点击查看摘要

Abstract:It is crucial to efficiently execute instructions such as “Find an apple and a banana” or “Get ready for a field trip,” which require searching for multiple objects or understanding context-dependent commands. This study addresses the challenging problem of determining which robot should be assigned to which part of a task when each robot possesses different situational on-site knowledge-specifically, spatial concepts learned from the area designated to it by the user. We propose a task planning framework that leverages large language models (LLMs) and spatial concepts to decompose natural language instructions into subtasks and allocate them to multiple robots. We designed a novel few-shot prompting strategy that enables LLMs to infer required objects from ambiguous commands and decompose them into appropriate subtasks. In our experiments, the proposed method achieved 47/50 successful assignments, outperforming random (28/50) and commonsense-based assignment (26/50). Furthermore, we conducted qualitative evaluations using two actual mobile manipulators. The results demonstrated that our framework could handle instructions, including those involving ad hoc categories such as “Get ready for a field trip,” by successfully performing task decomposition, assignment, sequential planning, and execution.
zh

[AI-36] A Lightweight Pipeline for Noisy Speech Voice Cloning and Accurate Lip Sync Synthesis

【速读】:该论文旨在解决当前语音克隆(voice cloning)与说话人脸生成技术在噪声环境或低资源场景下难以应用的问题,这些问题通常依赖于大规模高质量数据集和计算密集型训练流程。解决方案的关键在于提出一个模块化流水线:采用基于Transformer的潜在扩散模型Tortoise文本到语音(text-to-speech, TTS)实现高保真零样本语音克隆,仅需少量样本即可完成个性化语音合成;同时使用轻量级生成对抗网络(Generative Adversarial Network, GAN)架构实现鲁棒的实时唇部同步(lip synchronization)。该设计显著降低了对预训练数据规模和计算资源的依赖,并具备良好的扩展性,适用于复杂现实场景中的情感表达语音生成与多模态控制任务。

链接: https://arxiv.org/abs/2509.12831
作者: Javeria Amir,Farwa Attaria,Mah Jabeen,Umara Noor,Zahid Rashid
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent developments in voice cloning and talking head generation demonstrate impressive capabilities in synthesizing natural speech and realistic lip synchronization. Current methods typically require and are trained on large scale datasets and computationally intensive processes using clean studio recorded inputs that is infeasible in noisy or low resource environments. In this paper, we introduce a new modular pipeline comprising Tortoise text to speech. It is a transformer based latent diffusion model that can perform high fidelity zero shot voice cloning given only a few training samples. We use a lightweight generative adversarial network architecture for robust real time lip synchronization. The solution will contribute to many essential tasks concerning less reliance on massive pre training generation of emotionally expressive speech and lip synchronization in noisy and unconstrained scenarios. The modular structure of the pipeline allows an easy extension for future multi modal and text guided voice modulation and it could be used in real world systems.
zh

[AI-37] A Pressure-Based Diffusion Model for Influence Maximization on Social Networks

【速读】:该论文旨在解决社交网络中影响扩散的建模与影响力最大化(Influence Maximization, IM)问题,即如何选择最优种子节点以在扩散过程结束后实现最大范围的覆盖。其解决方案的关键在于提出了一种新的扩散模型——压力阈值模型(Pressure Threshold Model, PT),该模型在经典的线性阈值模型(Linear Threshold Model, LT)基础上进行扩展,通过将节点的输出影响力与其从已激活邻居处接收到的影响成比例调整,从而更动态地模拟社会网络中的影响传播机制。实验表明,PT模型在真实网络中表现出独特的种子节点选择策略,并且密集连接的网络会显著放大压力效应。

链接: https://arxiv.org/abs/2509.12822
作者: Curt Stutsman,Eliot W. Robson,Abhishek K. Umrawal
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
备注: 10 pages, 7 figures, and 2 tables

点击查看摘要

Abstract:In many real-world scenarios, an individual’s local social network carries significant influence over the opinions they form and subsequently propagate to others. In this paper, we propose a novel diffusion model – the Pressure Threshold model (PT) – for dynamically simulating the spread of influence through a social network. This new model extends the popular Linear Threshold Model (LT) by adjusting a node’s outgoing influence proportional to the influence it receives from its activated neighbors. We address the Influence Maximization (IM) problem, which involves selecting the most effective seed nodes to achieve maximal graph coverage after a diffusion process, and how the problem manifests with the PT Model. Experiments conducted on real-world networks, facilitated by enhancements to the open-source network-diffusion Python library, CyNetDiff, demonstrate unique seed node selection for the PT Model when compared to the LT Model. Moreover, analyses demonstrate that densely connected networks amplify pressure effects more significantly than sparse networks.
zh

[AI-38] H2R: Hierarchical Hindsight Reflection for Multi-Task LLM Agents

【速读】:该论文旨在解决当前基于大语言模型(Large Language Model, LLM)的智能体在多任务场景中知识迁移效率低的问题,即现有方法将先前经验视为单一整体单元,导致知识转移粒度粗略、利用率低下。其解决方案的关键在于提出一种分层记忆架构(hierarchical memory architecture),通过解耦高层规划记忆与底层执行记忆,实现细粒度的知识迁移;并引入层次化事后反思机制(Hierarchical Hindsight Reflection, H²R),从历史交互中提炼可复用的层级化知识,并在测试阶段分别检索高低层记忆,从而提升LLM智能体在新任务中的泛化能力和决策性能。

链接: https://arxiv.org/abs/2509.12810
作者: Shicheng Ye,Chao Yu,Kaiqiang Ke,Chengdong Xu,Yinqi Wei
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language model (LLM)-based agents have shown strong potential in multi-task scenarios, owing to their ability to transfer knowledge across diverse tasks. However, existing approaches often treat prior experiences and knowledge as monolithic units, leading to inefficient and coarse-grained knowledge transfer. In this work, we propose a novel hierarchical memory architecture that enables fine-grained knowledge transfer by decoupling high-level planning memory from low-level execution memory. To construct and refine these hierarchical memories, we introduce Hierarchical Hindsight Reflection (H ^2 R), a mechanism that distills reusable and hierarchical knowledge from past agent-environment interactions. At test time, H ^2 R performs retrievals of high-level and low-level memories separately, allowing LLM-based agents to efficiently access and utilize task-relevant knowledge for new this http URL results across two benchmarks demonstrate that H ^2 R can improve generalization and decision-making performance, outperforming prior baselines such as Expel.
zh

[AI-39] LLM -Based Approach for Enhancing Maintainability of Automotive Architectures

【速读】:该论文旨在解决汽车系统在长期维护、更新与扩展过程中因重构周期长、标准化和合规流程复杂,以及设备和底层软件组件异构性高、数量庞大所带来的灵活性不足问题。其解决方案的关键在于利用大语言模型(Large Language Models, LLMs)自动化关键任务流程,具体包括:1)实现软硬件抽象与合规性管理的自动化;2)进行接口兼容性检查;3)提供架构修改建议。研究通过基于OpenAI GPT-4o模型的原型验证了该方法的可行性,展示了LLMs在提升汽车系统生命周期灵活性方面的潜力。

链接: https://arxiv.org/abs/2509.12798
作者: Nenad Petrovic,Lukasz Mazur,Alois Knoll
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:There are many bottlenecks that decrease the flexibility of automotive systems, making their long-term maintenance, as well as updates and extensions in later lifecycle phases increasingly difficult, mainly due to long re-engineering, standardization, and compliance procedures, as well as heterogeneity and numerosity of devices and underlying software components involved. In this paper, we explore the potential of Large Language Models (LLMs) when it comes to the automation of tasks and processes that aim to increase the flexibility of automotive systems. Three case studies towards achieving this goal are considered as outcomes of early-stage research: 1) updates, hardware abstraction, and compliance, 2) interface compatibility checking, and 3) architecture modification suggestions. For proof-of-concept implementation, we rely on OpenAI’s GPT-4o model.
zh

[AI-40] EmbeddedML: A New Optimized and Fast Machine Learning Library

【速读】:该论文旨在解决机器学习模型在处理大规模数据集时训练时间过长的问题,尤其针对传统库(如scikit-learn)在回归和分类任务中效率低下的瓶颈。其解决方案的关键在于对核心算法进行数学重构与优化:通过改进多重线性回归、逻辑回归和支持向量机(SVM)等算法的数学实现方式,显著缩短训练时间而不牺牲准确性。实验表明,EmbeddedML在小规模数据上使SVM训练速度提升约2倍,在大规模数据上提升高达800倍;逻辑回归则提速约4倍,同时保持了与scikit-learn相当的预测精度。

链接: https://arxiv.org/abs/2509.12774
作者: Halil Hüseyin Çalışkan,Talha Koruk
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 10 pages, 7 figures

点击查看摘要

Abstract:Machine learning models and libraries can train datasets of different sizes and perform prediction and classification operations, but machine learning models and libraries cause slow and long training times on large datasets. This article introduces EmbeddedML, a training-time-optimized and mathematically enhanced machine learning library. The speed was increased by approximately times compared to scikit-learn without any loss in terms of accuracy in regression models such as Multiple Linear Regression. Logistic Regression and Support Vector Machines (SVM) algorithms have been mathematically rewritten to reduce training time and increase accuracy in classification models. With the applied mathematical improvements, training time has been reduced by approximately 2 times for SVM on small datasets and by around 800 times on large datasets, and by approximately 4 times for Logistic Regression, compared to the scikit-learn implementation. In summary, the EmbeddedML library offers regression, classification, clustering, and dimensionality reduction algorithms that are mathematically rewritten and optimized to reduce training time.
zh

[AI-41] oward Ownership Understanding of Objects: Active Question Generation with Large Language Model and Probabilistic Generative Model

【速读】:该论文旨在解决机器人在家庭和办公环境中执行指令(如“把我的杯子拿来”)时,因无法仅通过视觉特征可靠推断物体归属而导致的任务失败问题。解决方案的关键在于提出一种主动所有权学习框架(Active Ownership Learning, ActOwL),其核心机制包括:利用概率生成模型选择信息增益最大的所有权相关问题以高效获取知识,同时结合大语言模型(LLM)提供的常识推理能力对物体进行预分类(分为共享或专属),从而只针对专属物品发起询问,显著提升学习效率与归属识别准确性。

链接: https://arxiv.org/abs/2509.12754
作者: Saki Hashimoto,Shoichi Hasegawa,Tomochika Ishikawa,Akira Taniguchi,Yoshinobu Hagiwara,Lotfi El Hafi,Tadahiro Taniguchi
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: Submitted to AROB-ISBC 2026 (Journal Track option)

点击查看摘要

Abstract:Robots operating in domestic and office environments must understand object ownership to correctly execute instructions such as ``Bring me my cup.‘’ However, ownership cannot be reliably inferred from visual features alone. To address this gap, we propose Active Ownership Learning (ActOwL), a framework that enables robots to actively generate and ask ownership-related questions to users. ActOwL employs a probabilistic generative model to select questions that maximize information gain, thereby acquiring ownership knowledge efficiently to improve learning efficiency. Additionally, by leveraging commonsense knowledge from Large Language Models (LLM), objects are pre-classified as either shared or owned, and only owned objects are targeted for questioning. Through experiments in a simulated home environment and a real-world laboratory setting, ActOwL achieved significantly higher ownership clustering accuracy with fewer questions than baseline methods. These findings demonstrate the effectiveness of combining active inference with LLM-guided commonsense reasoning, advancing the capability of robots to acquire ownership knowledge for practical and socially appropriate task execution.
zh

[AI-42] Force-Modulated Visual Policy for Robot-Assisted Dressing with Arm Motions

【速读】:该论文旨在解决机器人辅助穿衣(robot-assisted dressing)中面临的两大核心挑战:一是处理部分观测下的视觉遮挡问题,二是实时适应穿戴者手臂在穿衣过程中的动态运动。现有研究通常假设人体肢体静止不动,限制了其在真实场景中的应用。解决方案的关键在于提出一种基于仿真训练的策略,在少量真实世界数据和多模态反馈(视觉与力觉传感)的基础上进行微调,从而显著提升策略对臂部运动的适应能力与安全性。实验表明,该方法能够在模拟和真人测试(12名参与者共264次试穿)中成功完成两件长袖日常衣物的穿衣任务,并在任务完成率和用户反馈方面显著优于先前基线方法。

链接: https://arxiv.org/abs/2509.12741
作者: Alexis Yihong Hao,Yufei Wang,Navin Sriram Ravie,Bharath Hegde,David Held,Zackory Erickson
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: CoRL 2025

点击查看摘要

Abstract:Robot-assisted dressing has the potential to significantly improve the lives of individuals with mobility impairments. To ensure an effective and comfortable dressing experience, the robot must be able to handle challenging deformable garments, apply appropriate forces, and adapt to limb movements throughout the dressing process. Prior work often makes simplifying assumptions – such as static human limbs during dressing – which limits real-world applicability. In this work, we develop a robot-assisted dressing system capable of handling partial observations with visual occlusions, as well as robustly adapting to arm motions during the dressing process. Given a policy trained in simulation with partial observations, we propose a method to fine-tune it in the real world using a small amount of data and multi-modal feedback from vision and force sensing, to further improve the policy’s adaptability to arm motions and enhance safety. We evaluate our method in simulation with simplified articulated human meshes and in a real world human study with 12 participants across 264 dressing trials. Our policy successfully dresses two long-sleeve everyday garments onto the participants while being adaptive to various kinds of arm motions, and greatly outperforms prior baselines in terms of task completion and user feedback. Video are available at this https URL.
zh

[AI-43] Deep Generative and Discriminative Digital Twin endowed with Variational Autoencoder for Unsupervised Predictive Thermal Condition Monitoring of Physical Robots in Industry 6.0 and Society 6.0

【速读】:该论文旨在解决工业机器人在运行过程中因电机过热导致的热饱和与烧伤风险问题,尤其是在强调人机协同与可持续性的Industry 5.0及未来Industry 6.0场景下,传统机器人在面临热饱和时直接停机的做法会严重影响生产效率和人类舒适度,且冷却策略难以在机器人购置后实施。解决方案的关键在于引入基于生成式AI(Generative AI)的智能数字孪生系统,具体采用变分自编码器(Variational Autoencoders, VAEs)来建模机器人的热状态,并通过VAE的重构误差定义“热难度”(thermal difficulty)指标,使机器人能够自主预测、预判并共享运动轨迹的热可行性,从而实现无需人工干预的热鲁棒性运行与任务优化。

链接: https://arxiv.org/abs/2509.12740
作者: Eric Guiffo Kaigom
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG); Systems and Control (eess.SY)
备注: ©© 2025 the authors. This work has been accepted to the to the 10th IFAC Symposium on Mechatronic Systems 14th IFAC Symposium on Robotics July 15-18, 2025 || Paris, France for publication under a Creative Commons Licence CC-BY-NC-ND

点击查看摘要

Abstract:Robots are unrelentingly used to achieve operational efficiency in Industry 4.0 along with symbiotic and sustainable assistance for the work-force in Industry 5.0. As resilience, robustness, and well-being are required in anti-fragile manufacturing and human-centric societal tasks, an autonomous anticipation and adaption to thermal saturation and burns due to motors overheating become instrumental for human safety and robot availability. Robots are thereby expected to self-sustain their performance and deliver user experience, in addition to communicating their capability to other agents in advance to ensure fully automated thermally feasible tasks, and prolong their lifetime without human intervention. However, the traditional robot shutdown, when facing an imminent thermal saturation, inhibits productivity in factories and comfort in the society, while cooling strategies are hard to implement after the robot acquisition. In this work, smart digital twins endowed with generative AI, i.e., variational autoencoders, are leveraged to manage thermally anomalous and generate uncritical robot states. The notion of thermal difficulty is derived from the reconstruction error of variational autoencoders. A robot can use this score to predict, anticipate, and share the thermal feasibility of desired motion profiles to meet requirements from emerging applications in Industry 6.0 and Society 6.0.
zh

[AI-44] Deep Learning for Model-Free Prediction of Thermal States of Robot Joint Motors

【速读】:该论文旨在解决机器人关节电机热行为预测的难题,尤其针对传统建模方法在参数辨识与验证过程中面临的复杂性与不确定性挑战。解决方案的关键在于采用无模型(model-free)且可扩展的深度神经网络架构,利用多个隐藏层的长短期记忆网络(Long Short-Term Memory, LSTM)和前馈层,基于传感器采集的关节扭矩数据来学习并预测关节电机的温度动态特性,从而实现对冗余机器人七自由度关节电机热行为的有效捕捉与高精度预测。

链接: https://arxiv.org/abs/2509.12739
作者: Trung Kien La,Eric Guiffo Kaigom
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG); Systems and Control (eess.SY)
备注: ©© 2025 the authors. This work has been accepted to the 10th IFAC Symposium on Mechatronic Systems 14th IFAC Symposium on Robotics July 15-18, 2025 || Paris, France for publication under a Creative Commons Licence CC-BY-NC-ND

点击查看摘要

Abstract:In this work, deep neural networks made up of multiple hidden Long Short-Term Memory (LSTM) and Feedforward layers are trained to predict the thermal behavior of the joint motors of robot manipulators. A model-free and scalable approach is adopted. It accommodates complexity and uncertainty challenges stemming from the derivation, identification, and validation of a large number of parameters of an approximation model that is hardly available. To this end, sensed joint torques are collected and processed to foresee the thermal behavior of joint motors. Promising prediction results of the machine learning based capture of the temperature dynamics of joint motors of a redundant robot with seven joints are presented.
zh

[AI-45] A Graph Machine Learning Approach for Detecting Topological Patterns in Transactional Graphs ICDM2025

【速读】:该论文旨在解决传统基于规则的金融犯罪检测系统在应对复杂、协同作案行为时适应性不足的问题,尤其是在数字生态系统中,犯罪分子利用跨环境(如法币与加密资产)共享操作知识和手段,使得静态规则难以识别其隐蔽模式。解决方案的关键在于提出一种融合图机器学习(Graph Machine Learning)与网络分析的方法,通过构建交易图结构并引入四步预处理框架——包括图结构提取、时间维度管理节点集、社区检测及自动标注策略,以生成弱监督标签,从而克服传统金融数据稀疏且无标签的局限;在此基础上,采用图自编码器(Graph Autoencoders, GAE)对已知拓扑模式进行区分,实验表明该以拓扑驱动的模式识别方法能有效识别复杂的金融犯罪结构,为替代传统规则系统提供了新路径。

链接: https://arxiv.org/abs/2509.12730
作者: Francesco Zola,Jon Ander Medina,Andrea Venturi,Amaia Gil,Raul Orduna
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注: Paper accepted @ Workshop on AI for Financial Crime Fight (AI4FCF @ ICDM 2025)

点击查看摘要

Abstract:The rise of digital ecosystems has exposed the financial sector to evolving abuse and criminal tactics that share operational knowledge and techniques both within and across different environments (fiat-based, crypto-assets, etc.). Traditional rule-based systems lack the adaptability needed to detect sophisticated or coordinated criminal behaviors (patterns), highlighting the need for strategies that analyze actors’ interactions to uncover suspicious activities and extract their modus operandi. For this reason, in this work, we propose an approach that integrates graph machine learning and network analysis to improve the detection of well-known topological patterns within transactional graphs. However, a key challenge lies in the limitations of traditional financial datasets, which often provide sparse, unlabeled information that is difficult to use for graph-based pattern analysis. Therefore, we firstly propose a four-step preprocessing framework that involves (i) extracting graph structures, (ii) considering data temporality to manage large node sets, (iii) detecting communities within, and (iv) applying automatic labeling strategies to generate weak ground-truth labels. Then, once the data is processed, Graph Autoencoders are implemented to distinguish among the well-known topological patterns. Specifically, three different GAE variants are implemented and compared in this analysis. Preliminary results show that this pattern-focused, topology-driven method is effective for detecting complex financial crime schemes, offering a promising alternative to conventional rule-based detection systems.
zh

[AI-46] Unbiased Online Curvature Approximation for Regularized Graph Continual Learning

【速读】:该论文旨在解决图持续学习(Graph Continual Learning, GCL)中因灾难性遗忘(catastrophic forgetting)导致模型难以在不存储历史数据的情况下有效保留旧知识的问题,尤其针对无回放(replay-free)、类增量(class-incremental)场景。其解决方案的关键在于构建一个基于Fisher信息矩阵(Fisher Information Matrix, FIM)诱导的弯曲参数空间的通用正则化框架,并提出一种新的无偏在线曲率近似方法,用于直接估计全FIM的正则化项,而无需显式计算和存储FIM本身。该方法能够在线捕捉学习新任务时的损失景观,从而在保持旧知识稳定性与获取新知识的灵活性之间实现更优平衡。

链接: https://arxiv.org/abs/2509.12727
作者: Jie Yin,Ke Sun,Han Wu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 9 pages

点击查看摘要

Abstract:Graph continual learning (GCL) aims to learn from a continuous sequence of graph-based tasks. Regularization methods are vital for preventing catastrophic forgetting in GCL, particularly in the challenging replay-free, class-incremental setting, where each task consists of a set of unique classes. In this work, we first establish a general regularization framework for GCL based on the curved parameter space induced by the Fisher information matrix (FIM). We show that the dominant Elastic Weight Consolidation (EWC) and its variants are a special case within this framework, using a diagonal approximation of the empirical FIM based on parameters from previous tasks. To overcome their limitations, we propose a new unbiased online curvature approximation of the full FIM based on the model’s current learning state. Our method directly estimates the regularization term in an online manner without explicitly evaluating and storing the FIM itself. This enables the model to better capture the loss landscape during learning new tasks while retaining the knowledge learned from previous tasks. Extensive experiments on three graph datasets demonstrate that our method significantly outperforms existing regularization-based methods, achieving a superior trade-off between stability (retaining old knowledge) and plasticity (acquiring new knowledge).
zh

[AI-47] Joint AoI and Handover Optimization in Space-Air-Ground Integrated Network

【速读】:该论文旨在解决低地球轨道(Low Earth Orbit, LEO)卫星通信在偏远地区和应急场景中因轨道动力学导致的间歇性覆盖与有限通信窗口问题,从而保障服务连续性和用户优先级差异化需求。其解决方案的关键在于提出一种面向信息时效性(Age of Information, AoI)的空-天-地一体化网络(Space-Air-Ground Integrated Network, SAGIN)架构,利用高空平台(High-Altitude Platform, HAP)作为智能中继节点,在LEO卫星与地面终端之间构建混合自由空间光(Free-Space Optics, FSO)链路与可靠射频(Radio Frequency, RF)链路;并通过联合优化传输功率分配与卫星选择策略,最小化AoI与切换频率,同时采用基于扩散模型(Diffusion Model, DM)增强的分行动作双深度Q网络(DD3QN-AS)算法,引入Transformer编码器提取时序特征并结合条件去噪机制优化状态-动作表示,有效应对高动态、非凸且具有时间耦合约束的复杂决策问题。

链接: https://arxiv.org/abs/2509.12716
作者: Zifan Lang,Guixia Liu,Geng Sun,Jiahui Li,Jiacheng Wang,Weijie Yuan,Dusit Niyato,Dong In Kim
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite the widespread deployment of terrestrial networks, providing reliable communication services to remote areas and maintaining connectivity during emergencies remains challenging. Low Earth orbit (LEO) satellite constellations offer promising solutions with their global coverage capabilities and reduced latency, yet struggle with intermittent coverage and limited communication windows due to orbital dynamics. This paper introduces an age of information (AoI)-aware space-air-ground integrated network (SAGIN) architecture that leverages a high-altitude platform (HAP) as intelligent relay between the LEO satellites and ground terminals. Our three-layer design employs hybrid free-space optical (FSO) links for high-capacity satellite-to-HAP communication and reliable radio frequency (RF) links for HAP-to-ground transmission, and thus addressing the temporal discontinuity in LEO satellite coverage while serving diverse user priorities. Specifically, we formulate a joint optimization problem to simultaneously minimize the AoI and satellite handover frequency through optimal transmit power distribution and satellite selection decisions. This highly dynamic, non-convex problem with time-coupled constraints presents significant computational challenges for traditional approaches. To address these difficulties, we propose a novel diffusion model (DM)-enhanced dueling double deep Q-network with action decomposition and state transformer encoder (DD3QN-AS) algorithm that incorporates transformer-based temporal feature extraction and employs a DM-based latent prompt generative module to refine state-action representations through conditional denoising. Simulation results highlight the superior performance of the proposed approach compared with policy-based methods and some other deep reinforcement learning (DRL) benchmarks.
zh

[AI-48] Instance-level Randomization: Toward More Stable LLM Evaluations EMNLP2025

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)评估中存在的不稳定性问题,即随机因素(如少样本示例)的微小变化会导致评分和模型排名出现剧烈波动,且不同模型对同一随机设置可能表现出偏好差异,从而引发不公平比较。解决方案的关键在于提出实例级随机化(Instance-Level Randomization, ILR)方法:通过在每个评估实例上独立随机化所有影响得分的因子,进行多次实验并取平均分数,从而有效降低由随机因素引起的方差,提升模型比较的公平性与稳定性;理论分析与实证结果表明,ILR可在计算成本低于此前方法一半的情况下达到相当的鲁棒性水平。

链接: https://arxiv.org/abs/2509.12678
作者: Yiyang Li,Yonghuang Wu,Ying Luo,Liangtai Sun,Zishu Qin,Lin Qiu,Xuezhi Cao,Xunliang Cai
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by Findings of EMNLP 2025

点击查看摘要

Abstract:Evaluations of large language models (LLMs) suffer from instability, where small changes of random factors such as few-shot examples can lead to drastic fluctuations of scores and even model rankings. Moreover, different LLMs can have different preferences for a certain setting of random factors. As a result, using a fixed setting of random factors, which is often adopted as the paradigm of current evaluations, can lead to potential unfair comparisons between LLMs. To mitigate the volatility of evaluations, we first theoretically analyze the sources of variance induced by changes in random factors. Targeting these specific sources, we then propose the instance-level randomization (ILR) method to reduce variance and enhance fairness in model comparisons. Instead of using a fixed setting across the whole benchmark in a single experiment, we randomize all factors that affect evaluation scores for every single instance, run multiple experiments and report the averaged score. Theoretical analyses and empirical results demonstrate that ILR can reduce the variance and unfair comparisons caused by random factors, as well as achieve similar robustness level with less than half computational cost compared with previous methods.
zh

[AI-49] Leverag ing Intermediate Representations of Time Series Foundation Models for Anomaly Detection

【速读】:该论文旨在解决时间序列异常检测(Anomaly Detection in Time Series)问题,尤其针对现有基于时间序列基础模型(Time Series Foundation Models, TSFMs)的方法仅依赖最终层表示、难以捕捉细粒度异常模式的局限性。其解决方案的关键在于提出一种名为TimeRep的新方法,该方法通过利用TSFM中间层(intermediate layer)的表示来计算异常分数——具体而言,将异常得分定义为输入数据与参考集合中中间表示之间的距离,并采用核心集(core-set)策略优化参考集合的规模与分布覆盖;此外,为应对概念漂移(concept drift),TimeRep在推理阶段引入自适应机制,仅用非冗余的中间表示扩充参考集合,从而实现高效且鲁棒的异常检测性能。

链接: https://arxiv.org/abs/2509.12650
作者: Chan Sik Han,Keon Myung Lee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 10 pages,8 figures

点击查看摘要

Abstract:Detecting anomalies in time series data is essential for the reliable operation of many real-world systems. Recently, time series foundation models (TSFMs) have emerged as a powerful tool for anomaly detection. However, existing methods typically rely on the final layer’s representations of TSFMs, computing the anomaly score as a reconstruction or forecasting error via a task-specific head. Instead, we propose TimeRep, a novel anomaly detection approach that leverages the intermediate layer’s representations of TSFMs, computing the anomaly score as the distance between these representations. Given a pre-trained TSFM, TimeRep selects the intermediate layer and patch-token position that yield the most informative representation. TimeRep forms a reference collection of intermediate representations from the training data and applies a core-set strategy to reduce its size while maintaining distributional coverage. During inference, TimeRep computes the anomaly score for incoming data by measuring the distance between its intermediate representations and those of the collection. To address concept drift, TimeRep integrates an adaptation mechanism that, at inference time, augments the collection exclusively with non-redundant intermediate representations from incoming data. We conducted extensive experiments on the UCR Anomaly Archive, which contains 250 univariate time series. TimeRep consistently outperforms a broad spectrum of state-of-the-art baselines, including non-DL, DL, and foundation model-based methods.
zh

[AI-50] A Systematic Evaluation of Parameter-Efficient Fine-Tuning Methods for the Security of Code LLM s

【速读】:该论文旨在解决生成式 AI(Generative AI)在代码生成过程中频繁产生不安全代码的问题,从而对软件系统安全构成严重风险。其核心解决方案是采用参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)技术提升代码生成的安全性,其中关键发现为:prompt-tuning 方法在 CodeGen2 16B 模型上实现了 80.86% 的整体安全率(Overall-Secure-Rate),较基线提升 13.5 个百分点;进一步结合采样温度优化的解码策略,安全率提升至 87.65%,相当于每百万条生成代码减少约 203,700 个漏洞片段。此外,prompt 和 prefix tuning 还增强了模型对投毒攻击(poisoning attacks)的鲁棒性,尤其在 CWE-79(跨站脚本)和 CWE-502(恶意输入解析)等典型攻击向量下表现优异,且该方法在 Python 和 Java 语言中均具一致性效果。

链接: https://arxiv.org/abs/2509.12649
作者: Kiho Lee,Jungkon Kim,Doowon Kim,Hyoungshick Kim
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 25 pages

点击查看摘要

Abstract:Code-generating Large Language Models (LLMs) significantly accelerate software development. However, their frequent generation of insecure code presents serious risks. We present a comprehensive evaluation of seven parameter-efficient fine-tuning (PEFT) techniques, demonstrating substantial gains in secure code generation without compromising functionality. Our research identifies prompt-tuning as the most effective PEFT method, achieving an 80.86% Overall-Secure-Rate on CodeGen2 16B, a 13.5-point improvement over the 67.28% baseline. Optimizing decoding strategies through sampling temperature further elevated security to 87.65%. This equates to a reduction of approximately 203,700 vulnerable code snippets per million generated. Moreover, prompt and prefix tuning increase robustness against poisoning attacks in our TrojanPuzzle evaluation, with strong performance against CWE-79 and CWE-502 attack vectors. Our findings generalize across Python and Java, confirming prompt-tuning’s consistent effectiveness. This study provides essential insights and practical guidance for building more resilient software systems with LLMs.
zh

[AI-51] Large Language Models Imitate Logical Reasoning but at what Cost?

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在推理能力上的演化趋势及其计算效率优化问题。研究通过纵向实验评估了前沿LLMs在18个月内对PrOntoQA数据集中的真假判断任务的准确性及推理策略忠实度,发现从2023到2024年的性能提升主要归因于隐式Chain of Thought(CoT)提示技术,而2024到2025年则得益于专门设计的“思考模型”(thinking models)引入。解决方案的关键在于提出一种神经符号架构(neuro-symbolic architecture),该架构利用参数量小于150亿的LLM将问题转化为标准化形式,并进一步解析为可由SMT求解器Z3执行的程序以验证查询的可满足性。该方法在保持接近完美的推理准确率的同时,显著降低了计算成本(FLOPs),且实测FLOPs与理论估算(即推理FLOPs约为活跃参数与总token数乘积的两倍)误差控制在10%以内。

链接: https://arxiv.org/abs/2509.12645
作者: Lachlan McGinness,Peter Baumgartner
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注: This work has been accepted as a main track paper for publication in the proceedings of the Australasian Joint Conference on Artificial Intelligence 2025 held in Canberra, Australia

点击查看摘要

Abstract:We present a longitudinal study which evaluates the reasoning capability of frontier Large Language Models over an eighteen month period. We measured the accuracy of three leading models from December 2023, September 2024 and June 2025 on true or false questions from the PrOntoQA dataset and their faithfulness to reasoning strategies provided through in-context learning. The improvement in performance from 2023 to 2024 can be attributed to hidden Chain of Thought prompting. The introduction of thinking models allowed for significant improvement in model performance between 2024 and 2025. We then present a neuro-symbolic architecture which uses LLMs of less than 15 billion parameters to translate the problems into a standardised form. We then parse the standardised forms of the problems into a program to be solved by Z3, an SMT solver, to determine the satisfiability of the query. We report the number of prompt and completion tokens as well as the computational cost in FLOPs for open source models. The neuro-symbolic approach significantly reduces the computational cost while maintaining near perfect performance. The common approximation that the number of inference FLOPs is double the product of the active parameters and total tokens was accurate within 10% for all experiments. Comments: This work has been accepted as a main track paper for publication in the proceedings of the Australasian Joint Conference on Artificial Intelligence 2025 held in Canberra, Australia Subjects: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO) Cite as: arXiv:2509.12645 [cs.AI] (or arXiv:2509.12645v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2509.12645 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-52] Learn to Relax with Large Language Models : Solving Nonlinear Combinatorial Optimization Problems via Bidirectional Coevolution

【速读】:该论文旨在解决非线性组合优化问题(Nonlinear Combinatorial Optimization Problems, NCOPs)在实际应用中面临的计算挑战,尤其是其非凸特性导致的多模态解空间难以高效优化的问题。传统约束松弛方法依赖专家驱动的迭代设计流程,缺乏系统自动化与可扩展适应性;而现有基于大语言模型(Large Language Model, LLM)的优化方法通常仅作为被动约束验证器,无法主动构建策略以应对复杂约束交互。论文提出首个端到端的自动化约束优化方法(AutoCO),其核心创新在于:通过结构化LLM推理生成动态演化的约束松弛策略,并基于统一三元表示框架融合算法原理与可执行代码;同时引入一种新型双向(全局-局部)协同进化机制,结合进化算法进行局部精细优化与蒙特卡洛树搜索进行全局策略空间探索,从而在碎片化解空间中实现强化开发(intensification)与广泛探索(diversification)的最优平衡。

链接: https://arxiv.org/abs/2509.12643
作者: Beidan Liu,Zhengqiu Zhu,Chen Gao,Yong Zhao,Wei Qi,Quanjun Yin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Nonlinear Combinatorial Optimization Problems (NCOPs) present a formidable computational hurdle in practice, as their nonconvex nature gives rise to multi-modal solution spaces that defy efficient optimization. Traditional constraint relaxation approaches rely heavily on expert-driven, iterative design processes that lack systematic automation and scalable adaptability. While recent Large Language Model (LLM)-based optimization methods show promise for autonomous problem-solving, they predominantly function as passive constraint validators rather than proactive strategy architects, failing to handle the sophisticated constraint interactions inherent to this http URL address these limitations, we introduce the first end-to-end \textbfAutomated \textbfConstraint \textbfOptimization (AutoCO) method, which revolutionizes NCOPs resolution through learning to relax with this http URL, we leverage structured LLM reasoning to generate constraint relaxation strategies, which are dynamically evolving with algorithmic principles and executable code through a unified triple-representation scheme. We further establish a novel bidirectional (global-local) coevolution mechanism that synergistically integrates Evolutionary Algorithms for intensive local refinement with Monte Carlo Tree Search for systematic global strategy space exploration, ensuring optimal balance between intensification and diversification in fragmented solution spaces. Finally, comprehensive experiments on three challenging NCOP benchmarks validate AutoCO’s consistent effectiveness and superior performance over the baselines.
zh

[AI-53] DoubleAgents : Exploring Mechanisms of Building Trust with Proactive AI

【速读】:该论文旨在解决用户对代理型工作流(agentic workflows)的信任问题,即人们是否愿意将决策权交由AI系统代为执行任务。其核心挑战在于如何在保证效率的同时,建立并增强人类用户对自主行动AI系统的信任。解决方案的关键在于提出DoubleAgents工具,通过嵌入透明性(transparency)、可控性(controllability)、价值对齐策略、状态可视化、不确定性标注以及内置响应者模拟机制,实现“信任设计”(trust-by-design)。其中,模拟环境允许用户在无风险场景中反复练习、调整政策并校准依赖程度,从而逐步建立起对AI代理的可靠认知,最终提升实际部署中的采纳率与实用性。

链接: https://arxiv.org/abs/2509.12626
作者: Tao Long,Xuanming Zhang,Sitong Wang,Zhou Yu,Lydia B Chilton
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Emerging Technologies (cs.ET)
备注: 21 pages, 10 figures

点击查看摘要

Abstract:Agentic workflows promise efficiency, but adoption hinges on whether people actually trust systems that act on their behalf. We present DoubleAgents, an agentic planning tool that embeds transparency and control through user intervention, value-reflecting policies, rich state visualizations, and uncertainty flagging for human coordination tasks. A built-in respondent simulation generates realistic scenarios, allowing users to rehearse, refine policies, and calibrate their reliance before live use. We evaluate DoubleAgents in a two-day lab study (n=10), two deployments (n=2), and a technical evaluation. Results show that participants initially hesitated to delegate but grew more reliant as they experienced transparency, control, and adaptive learning during simulated cases. Deployment results demonstrate DoubleAgents’ real-world relevance and usefulness, showing that the effort required scaled appropriately with task complexity and contextual data. We contribute trust-by-design patterns and mechanisms for proactive AI – consistency, controllability, and explainability – along with simulation as a safe path to build and calibrate trust over time.
zh

[AI-54] ECG-aBcDe: Overcoming Model Dependence Encoding ECG into a Universal Language for Any LLM

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在心电图(Electrocardiogram, ECG)分析中面临的三大挑战:模型特异性编码器导致的迁移性差、Transformer架构难以捕捉ECG固有的时序信息,以及模型黑箱特性限制了临床可解释性。其解决方案的关键在于提出一种新型ECG编码方法——ECG-aBcDe,该方法将ECG信号转化为通用的ECG语言,使任何预训练LLM均可直接微调而无需修改架构,实现“构建一次,随处使用”的能力;同时,ECG与ECG语言之间的双向可逆性支持从ECG信号中提取注意力热图,显著提升可解释性,并通过显式建模时间尺度信息克服Transformer对时序特征学习的局限。

链接: https://arxiv.org/abs/2509.12625
作者: Yong Xia,Jingxuan Li,YeTeng Sun,Jiarui Bu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 14pages, 6 figures

点击查看摘要

Abstract:Large Language Models (LLMs) hold significant promise for electrocardiogram (ECG) analysis, yet challenges remain regarding transferability, time-scale information learning, and interpretability. Current methods suffer from model-specific ECG encoders, hindering transfer across LLMs. Furthermore, LLMs struggle to capture crucial time-scale information inherent in ECGs due to Transformer limitations. And their black-box nature limits clinical adoption. To address these limitations, we introduce ECG-aBcDe, a novel ECG encoding method that transforms ECG signals into a universal ECG language readily interpretable by any LLM. By constructing a hybrid dataset of ECG language and natural language, ECG-aBcDe enables direct fine-tuning of pre-trained LLMs without architectural modifications, achieving “construct once, use anywhere” capability. Moreover, the bidirectional convertibility between ECG and ECG language of ECG-aBcDe allows for extracting attention heatmaps from ECG signals, significantly enhancing interpretability. Finally, ECG-aBcDe explicitly represents time-scale information, mitigating Transformer limitations. This work presents a new paradigm for integrating ECG analysis with LLMs. Compared with existing methods, our method achieves competitive performance on ROUGE-L and METEOR. Notably, it delivers significant improvements in the BLEU-4, with improvements of 2.8 times and 3.9 times in in-dataset and cross-dataset evaluations, respectively, reaching scores of 42.58 and 30.76. These results provide strong evidence for the feasibility of the new paradigm.
zh

[AI-55] Mob-based cattle weight gain forecasting using ML models

【速读】:该论文旨在解决大规模畜牧养殖场中牛群体重增长(Herd-Based Cattle Weight Gain, MB CWG)预测难题,以支持精准饲喂策略制定、科学育种决策,并降低气候波动与市场风险带来的不确定性。其解决方案的关键在于提出一种基于随机森林(Random Forest, RF)的预测模型,利用来自查尔斯·斯图尔特大学农场的历史数据(包括108个牛群共756条样本数据及降雨量、温度等气象因素),结合牛龄信息进行建模。实验表明,RF模型在包含天气和年龄因素时表现最优,R²达0.973,均方根误差(RMSE)为0.040,平均绝对误差(MAE)为0.033,显著优于支持向量回归(SVR)和长短期记忆网络(LSTM)模型。研究进一步开发了一个开源自动化预处理工具,用于生成MB CWG预测模型的基准数据集,为后续相关分析提供标准化数据支持。

链接: https://arxiv.org/abs/2509.12615
作者: Muhammad Riaz Hasib Hossain,Rafiqul Islam,Shawn R McGrath,Md Zahidul Islam,David Lamb
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Forecasting mob based cattle weight gain (MB CWG) may benefit large livestock farms, allowing farmers to refine their feeding strategies, make educated breeding choices, and reduce risks linked to climate variability and market fluctuations. In this paper, a novel technique termed MB CWG is proposed to forecast the one month advanced weight gain of herd based cattle using historical data collected from the Charles Sturt University Farm. This research employs a Random Forest (RF) model, comparing its performance against Support Vector Regression (SVR) and Long Short Term Memory (LSTM) models for monthly weight gain prediction. Four datasets were used to evaluate the performance of models, using 756 sample data from 108 herd-based cattle, along with weather data (rainfall and temperature) influencing CWG. The RF model performs better than the SVR and LSTM models across all datasets, achieving an R^2 of 0.973, RMSE of 0.040, and MAE of 0.033 when both weather and age factors were included. The results indicate that including both weather and age factors significantly improves the accuracy of weight gain predictions, with the RF model outperforming the SVR and LSTM models in all scenarios. These findings demonstrate the potential of RF as a robust tool for forecasting cattle weight gain in variable conditions, highlighting the influence of age and climatic factors on herd based weight trends. This study has also developed an innovative automated pre processing tool to generate a benchmark dataset for MB CWG predictive models. The tool is publicly available on GitHub and can assist in preparing datasets for current and future analytical research…
zh

[AI-56] GBV-SQL: Guided Generation and SQL2Text Back-Translation Validation for Multi-Agent Text2SQL

【速读】:该论文旨在解决大型语言模型在Text2SQL生成任务中存在语义鸿沟的问题,即生成的SQL语句虽然语法正确,但常未能准确反映用户意图。解决方案的关键在于提出一种名为GBV-SQL的多智能体框架,其核心机制是“基于SQL到自然语言回译验证的引导生成”(Guided Generation with SQL2Text Back-translation Validation),通过一个专门的智能体将生成的SQL语句回译为自然语言,从而验证其逻辑一致性与原始问题的一致性。此外,研究还揭示了现有评估体系因基准数据本身存在系统性缺陷(称为“Gold Errors”)而失真,并通过清理这些错误样本显著提升了模型性能,表明高质量数据构建对模型评估的重要性。

链接: https://arxiv.org/abs/2509.12612
作者: Daojun Chen,Xi Wang,Shenyuan Ren,Qingzhi Ma,Pengpeng Zhao,An Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While Large Language Models have significantly advanced Text2SQL generation, a critical semantic gap persists where syntactically valid queries often misinterpret user intent. To mitigate this challenge, we propose GBV-SQL, a novel multi-agent framework that introduces Guided Generation with SQL2Text Back-translation Validation. This mechanism uses a specialized agent to translate the generated SQL back into natural language, which verifies its logical alignment with the original question. Critically, our investigation reveals that current evaluation is undermined by a systemic issue: the poor quality of the benchmarks themselves. We introduce a formal typology for “Gold Errors”, which are pervasive flaws in the ground-truth data, and demonstrate how they obscure true model performance. On the challenging BIRD benchmark, GBV-SQL achieves 63.23% execution accuracy, a 5.8% absolute improvement. After removing flawed examples, GBV-SQL achieves 96.5% (dev) and 97.6% (test) execution accuracy on the Spider benchmark. Our work offers both a robust framework for semantic validation and a critical perspective on benchmark integrity, highlighting the need for more rigorous dataset curation.
zh

[AI-57] Analogy-Driven Financial Chain-of-Thought (AD-FCoT): A Prompting Approach for Financial Sentiment Analysis

【速读】:该论文旨在解决金融新闻情感分析中现有方法难以捕捉复杂经济语境且缺乏透明推理过程的问题,从而影响模型的可靠性。其解决方案的关键在于提出一种基于类比驱动的金融思维链(Analogy-Driven Financial Chain-of-Thought, AD-FCoT)提示框架,通过引导大型语言模型(Large Language Models, LLMs)将当前事件与历史已知结果的相似场景进行类比,并将这些类比嵌入结构化的逐步推理链中,从而增强模型对金融语境的理解和可解释性。该方法无需额外训练或微调,仅依赖提示工程即可利用模型内部知识生成符合领域专家认知的推理路径,显著提升情感分类准确率及与市场回报的相关性。

链接: https://arxiv.org/abs/2509.12611
作者: Anmol Singhal Navya Singhal
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: IEEE AIxB 2025

点击查看摘要

Abstract:Financial news sentiment analysis is crucial for anticipating market movements. With the rise of AI techniques such as Large Language Models (LLMs), which demonstrate strong text understanding capabilities, there has been renewed interest in enhancing these systems. Existing methods, however, often struggle to capture the complex economic context of news and lack transparent reasoning, which undermines their reliability. We propose Analogy-Driven Financial Chain-of-Thought (AD-FCoT), a prompting framework that integrates analogical reasoning with chain-of-thought (CoT) prompting for sentiment prediction on historical financial news. AD-FCoT guides LLMs to draw parallels between new events and relevant historical scenarios with known outcomes, embedding these analogies into a structured, step-by-step reasoning chain. To our knowledge, this is among the first approaches to explicitly combine analogical examples with CoT reasoning in finance. Operating purely through prompting, AD-FCoT requires no additional training data or fine-tuning and leverages the model’s internal financial knowledge to generate rationales that mirror human analytical reasoning. Experiments on thousands of news articles show that AD-FCoT outperforms strong baselines in sentiment classification accuracy and achieves substantially higher correlation with market returns. Its generated explanations also align with domain expertise, providing interpretable insights suitable for real-world financial analysis.
zh

[AI-58] ScaleDoc: Scaling LLM -based Predicates over Large Document Collections

【速读】:该论文旨在解决现代数据查询中因处理大量非结构化文档而带来的语义理解挑战,即传统基于值的谓词(Predicate)无法满足对文档内容深层语义的判断需求,同时大型语言模型(Large Language Models, LLMs)虽具备强大的零样本推理能力但存在高昂的推理开销问题。解决方案的关键在于提出一个名为 \textscScaleDoc 的系统架构,通过将谓词执行解耦为离线表示阶段和在线优化过滤阶段:在离线阶段利用LLM生成文档的语义表征;在线阶段则基于这些表征训练轻量级代理模型(proxy model)以高效过滤绝大多数文档,仅将模糊案例交由LLM做最终决策。此外,该方案引入两项核心创新——基于对比学习(contrastive learning)的代理模型训练机制以生成可靠的预测得分,以及自适应级联(adaptive cascade)机制以动态确定符合精度目标的过滤策略,从而实现显著的性能提升,实测可带来超过2倍的端到端加速并减少高达85%的LLM调用次数。

链接: https://arxiv.org/abs/2509.12610
作者: Hengrui Zhang,Yulong Hui,Yihao Liu,Huanchen Zhang
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Predicates are foundational components in data analysis systems. However, modern workloads increasingly involve unstructured documents, which demands semantic understanding, beyond traditional value-based predicates. Given enormous documents and ad-hoc queries, while Large Language Models (LLMs) demonstrate powerful zero-shot capabilities, their high inference cost leads to unacceptable overhead. Therefore, we introduce \textscScaleDoc, a novel system that addresses this by decoupling predicate execution into an offline representation phase and an optimized online filtering phase. In the offline phase, \textscScaleDoc leverages a LLM to generate semantic representations for each document. Online, for each query, it trains a lightweight proxy model on these representations to filter the majority of documents, forwarding only the ambiguous cases to the LLM for final decision. Furthermore, \textscScaleDoc proposes two core innovations to achieve significant efficiency: (1) a contrastive-learning-based framework that trains the proxy model to generate reliable predicating decision scores; (2) an adaptive cascade mechanism that determines the effective filtering policy while meeting specific accuracy targets. Our evaluations across three datasets demonstrate that \textscScaleDoc achieves over a 2 \times end-to-end speedup and reduces expensive LLM invocations by up to 85%, making large-scale semantic analysis practical and efficient.
zh

[AI-59] A Multimodal Foundation Model to Enhance Generalizability and Data Efficiency for Pan-cancer Prognosis Prediction

【速读】:该论文旨在解决现有人工智能(AI)模型在处理多模态数据时难以充分挖掘其丰富信息并提取泛化能力弱的表征问题。针对此挑战,作者提出MICE(Multimodal data Integration via Collaborative Experts)——一种基于功能多样专家协作机制的多模态基础模型,通过整合病理图像、临床报告与基因组学数据实现精准泛癌预后预测。其解决方案的关键在于:摒弃传统多专家模块设计,采用多个功能各异的专家网络以同时捕获跨癌种共性特征与特定癌症的差异化信息,并结合对比学习与监督学习策略提升模型在11,799名患者、30种癌症类型数据上的泛化性能,从而显著优于单一模态及当前最先进的多专家模型,在内部和独立验证队列中C-index分别提升3.8%–11.2%和5.8%–8.8%,且具备优异的数据效率,为泛癌预后预测提供了可扩展的基础框架。

链接: https://arxiv.org/abs/2509.12600
作者: Huajun Zhou,Fengtao Zhou,Jiabo Ma,Yingxue Xu,Xi Wang,Xiuming Zhang,Li Liang,Zhenhui Li,Hao Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
备注: 27 pages, 7 figures

点击查看摘要

Abstract:Multimodal data provides heterogeneous information for a holistic understanding of the tumor microenvironment. However, existing AI models often struggle to harness the rich information within multimodal data and extract poorly generalizable representations. Here we present MICE (Multimodal data Integration via Collaborative Experts), a multimodal foundation model that effectively integrates pathology images, clinical reports, and genomics data for precise pan-cancer prognosis prediction. Instead of conventional multi-expert modules, MICE employs multiple functionally diverse experts to comprehensively capture both cross-cancer and cancer-specific insights. Leveraging data from 11,799 patients across 30 cancer types, we enhanced MICE’s generalizability by coupling contrastive and supervised learning. MICE outperformed both unimodal and state-of-the-art multi-expert-based multimodal models, demonstrating substantial improvements in C-index ranging from 3.8% to 11.2% on internal cohorts and 5.8% to 8.8% on independent cohorts, respectively. Moreover, it exhibited remarkable data efficiency across diverse clinical scenarios. With its enhanced generalizability and data efficiency, MICE establishes an effective and scalable foundation for pan-cancer prognosis prediction, holding strong potential to personalize tailored therapies and improve treatment outcomes.
zh

[AI-60] Redefining CX with Agent ic AI: Minerva CQ Case Study

【速读】:该论文旨在解决客服中心中客户体验(CX)持续不佳的问题,具体表现为平均处理时间(AHT)过高、首次解决率(first-call resolution)低以及客户满意度(CSAT)差。其核心原因在于座席人员面临认知负荷过重,需在碎片化系统间切换、手动排查问题并频繁让客户等待。传统AI辅助工具多为被动响应式,依赖静态规则、简单提示或检索增强生成(RAG),缺乏深层上下文推理能力。本文提出的关键解决方案是引入代理型人工智能(Agentic AI)——一种目标驱动、自主运作且具备工具调用能力的系统,能够实时主动支持座席:通过识别客户意图、触发模块化工作流、维护动态演化上下文,并根据对话状态自适应调整策略。文中以部署于语音客服场景的Minerva CQ产品为例,展示了其整合实时转录、意图与情感检测、实体识别、上下文检索、动态客户画像及部分对话摘要等功能,实现主动流程干预和持续上下文构建,显著提升座席效率与客户体验。

链接: https://arxiv.org/abs/2509.12589
作者: Garima Agrawal,Riccardo De Maria,Kiran Davuluri,Daniele Spera,Charlie Read,Cosimo Spera,Jack Garrett,Don Miller
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite advances in AI for contact centers, customer experience (CX) continues to suffer from high average handling time (AHT), low first-call resolution, and poor customer satisfaction (CSAT). A key driver is the cognitive load on agents, who must navigate fragmented systems, troubleshoot manually, and frequently place customers on hold. Existing AI-powered agent-assist tools are often reactive driven by static rules, simple prompting, or retrieval-augmented generation (RAG) without deeper contextual reasoning. We introduce Agentic AI goal-driven, autonomous, tool-using systems that proactively support agents in real time. Unlike conventional approaches, Agentic AI identifies customer intent, triggers modular workflows, maintains evolving context, and adapts dynamically to conversation state. This paper presents a case study of Minerva CQ, a real-time Agent Assist product deployed in voice-based customer support. Minerva CQ integrates real-time transcription, intent and sentiment detection, entity recognition, contextual retrieval, dynamic customer profiling, and partial conversational summaries enabling proactive workflows and continuous context-building. Deployed in live production, Minerva CQ acts as an AI co-pilot, delivering measurable improvements in agent efficiency and customer experience across multiple deployments.
zh

[AI-61] zELO: ELO-inspired Training Method for Rerankers and Embedding Models

【速读】:该论文旨在解决检索排序(retrieval ranking)任务中依赖大量标注数据的问题,提出了一种无需人工标注的训练方法——zELO。其核心创新在于将排序任务建模为Thurstone模型,并通过静态等价性分析实现对检索性能的优化;关键解决方案是利用未标注的查询-文档对(共112,000个查询,每个查询配100个文档)进行端到端训练,从而在金融、法律、代码和STEM等多个领域上训练出性能超越闭源reranker的开源重排序模型(zerank-1 和 zerank-1-small),且保持零样本(zero-shot)跨域泛化能力。

链接: https://arxiv.org/abs/2509.12541
作者: Nicholas Pipitone,Ghita Houir Alami,Advaith Avadhanam,Anton Kaminskyi,Ashley Khoo
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 13 pages, 9 sections, 17 figures and tables

点击查看摘要

Abstract:We introduce a novel training methodology named zELO, which optimizes retrieval performance via the analysis that ranking tasks are statically equivalent to a Thurstone model. Based on the zELO method, we use unsupervised data in order train a suite of state-of-the-art open-weight reranker models: zerank-1 and zerank-1-small. These models achieve the highest retrieval scores in multiple domains, including finance, legal, code, and STEM, outperforming closed-source proprietary rerankers on both NDCG@10 and Recall. These models also demonstrate great versatility, maintaining their 0-shot performance on out-of-domain and private customer datasets. The training data included 112,000 queries and 100 documents per query, and was trained end-to-end from unannotated queries and documents in less than 10,000 H100-hours.
zh

[AI-62] Pre-trained Visual Representations Generalize Where it Matters in Model-Based Reinforcement Learning

【速读】:该论文旨在解决视觉域偏移(visual domain shifts)下模型-based强化学习(MBRL)中策略网络泛化能力差的问题。现有研究表明,预训练视觉模型(PVMs)在无模型强化学习(MFRL)中能提升鲁棒性,但在MBRL中效果不佳。论文的关键解决方案是验证并证明PVMs在MBRL中的有效性,特别是在严重视觉域偏移场景下,通过部分微调(partial fine-tuning)PVM可显著提升任务平均性能,从而为MBRL在机器人视觉策略学习中的应用提供了强有力的支持。

链接: https://arxiv.org/abs/2509.12531
作者: Scott Jones,Liyou Zhou,Sebastian W. Pattinson
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:In visuomotor policy learning, the control policy for the robotic agent is derived directly from visual inputs. The typical approach, where a policy and vision encoder are trained jointly from scratch, generalizes poorly to novel visual scene changes. Using pre-trained vision models (PVMs) to inform a policy network improves robustness in model-free reinforcement learning (MFRL). Recent developments in Model-based reinforcement learning (MBRL) suggest that MBRL is more sample-efficient than MFRL. However, counterintuitively, existing work has found PVMs to be ineffective in MBRL. Here, we investigate PVM’s effectiveness in MBRL, specifically on generalization under visual domain shifts. We show that, in scenarios with severe shifts, PVMs perform much better than a baseline model trained from scratch. We further investigate the effects of varying levels of fine-tuning of PVMs. Our results show that partial fine-tuning can maintain the highest average task performance under the most extreme distribution shifts. Our results demonstrate that PVMs are highly successful in promoting robustness in visual policy learning, providing compelling evidence for their wider adoption in model-based robotic learning applications.
zh

[AI-63] A Dimensionality-Reduced XAI Framework for Roundabout Crash Severity Insights

【速读】:该论文旨在解决交通工程领域中圆形交叉口(roundabout)事故严重性差异的复杂性问题,即不同环境条件与事故类型如何共同作用导致严重程度变化。其解决方案的关键在于提出一种两步、可解释的分析流程:首先通过聚类对应分析(Cluster Correspondence Analysis, CCA)识别出四类共现因素组合形成的事故模式;随后构建基于树模型的严重性预测框架,并利用SHAP值(SHapley Additive exPlanations)对每种模式内及跨模式的伤害驱动因素进行量化解释。该方法不仅揭示了特定场景下(如夜间、湿滑路面、高限速与固定障碍物碰撞)事故严重性显著升高机制,还明确了入口处“让行失败”、多车道环道内“不当操作”以及减速过程中的“追尾行为”等关键致因,从而为事故高发点筛查、对策选择和审计合规报告提供可操作依据。

链接: https://arxiv.org/abs/2509.12524
作者: Rohit Chakraborty,Subasish Das
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: This is the author’s preprint version of a paper accepted for presentation at HICSS 59 (Hawaii International Conference on System Sciences), 2026, Hawaii, USA. The final published version will appear in the official conference proceedings. Conference site: this https URL

点击查看摘要

Abstract:Roundabouts reduce severe crashes, yet risk patterns vary by conditions. This study analyzes 2017-2021 Ohio roundabout crashes using a two-step, explainable workflow. Cluster Correspondence Analysis (CCA) identifies co-occurring factors and yields four crash patterns. A tree-based severity model is then interpreted with SHAP to quantify drivers of injury within and across patterns. Results show higher severity when darkness, wet surfaces, and higher posted speeds coincide with fixed-object or angle events, and lower severity in clear, low-speed settings. Pattern-specific explanations highlight mechanisms at entries (fail-to-yield, gap acceptance), within multi-lane circulation (improper maneuvers), and during slow-downs (rear-end). The workflow links pattern discovery with case-level explanations, supporting site screening, countermeasure selection, and audit-ready reporting. The contribution to Information Systems is a practical template for usable XAI in public safety analytics.
zh

[AI-64] Physical Complexity of a Cognitive Artifact

【速读】:该论文试图解决如何通过认知策略降低任务难度的问题,特别是从计算复杂性角度分析物理谜题(如Soma Cube)的认知求解过程。其解决方案的关键在于提出“物质性原则”(Principle of Materiality),将计算复杂性理论中的概念映射到认知策略上,通过分层优化试错搜索:包括预处理(认知分块)、值排序(认知自由排序)、变量排序(认知支架)和剪枝(认知推理),从而系统性地减少有效时间复杂度。论文进一步指出,熟练使用工具和物理约束可显著降低问题的计算负担,提出智能是一种结合心智与物质能力的算法库模型。

链接: https://arxiv.org/abs/2509.12495
作者: Gülce Kardeş,David Krakauer,Joshua Grochow
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Cognitive science and theoretical computer science both seek to classify and explain the difficulty of tasks. Mechanisms of intelligence are those that reduce task difficulty. Here we map concepts from the computational complexity of a physical puzzle, the Soma Cube, onto cognitive problem-solving strategies through a ``Principle of Materiality’'. By analyzing the puzzle’s branching factor, measured through search tree outdegree, we quantitatively assess task difficulty and systematically examine how different strategies modify complexity. We incrementally refine a trial-and-error search by layering preprocessing (cognitive chunking), value ordering (cognitive free-sorting), variable ordering (cognitive scaffolding), and pruning (cognitive inference). We discuss how the competent use of artifacts reduces effective time complexity by exploiting physical constraints and propose a model of intelligence as a library of algorithms that recruit the capabilities of both mind and matter.
zh

[AI-65] Empowering Clinical Trial Design through AI: A Randomized Evaluation of PowerGPT

【速读】:该论文旨在解决临床研究中统计功效分析(power analysis)的样本量计算问题,这一过程因复杂性高且高度依赖统计专业知识,常成为研究人员的障碍。解决方案的关键在于提出PowerGPT——一个集成大语言模型(Large Language Models, LLMs)与统计引擎的AI驱动系统,能够自动化完成假设检验选择和样本量估算,显著提升任务完成率、准确性并缩短耗时,且在不同统计测试和用户背景(包括非统计专家)下均表现稳定,从而有效弥合统计知识差距,推动临床试验设计的可及性与效率。

链接: https://arxiv.org/abs/2509.12471
作者: Yiwen Lu,Lu Li,Dazheng Zhang,Xinyao Jian,Tingyin Wang,Siqi Chen,Yuqing Lei,Jiayi Tong,Zhaohan Xi,Haitao Chu,Chongliang Luo,Alexis Ogdie,Brian Athey,Alparslan Turan,Michael Abramoff,Joseph C Cappelleri,Hua Xu,Yun Lu,Jesse Berlin,Daniel I. Sessler,David A. Asch,Xiaoqian Jiang,Yong Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Sample size calculations for power analysis are critical for clinical research and trial design, yet their complexity and reliance on statistical expertise create barriers for many researchers. We introduce PowerGPT, an AI-powered system integrating large language models (LLMs) with statistical engines to automate test selection and sample size estimation in trial design. In a randomized trial to evaluate its effectiveness, PowerGPT significantly improved task completion rates (99.3% vs. 88.9% for test selection, 99.3% vs. 77.8% for sample size calculation) and accuracy (94.1% vs. 55.4% in sample size estimation, p 0.001), while reducing average completion time (4.0 vs. 9.3 minutes, p 0.001). These gains were consistent across various statistical tests and benefited both statisticians and non-statisticians as well as bridging expertise gaps. Already under deployment across multiple institutions, PowerGPT represents a scalable AI-driven approach that enhances accessibility, efficiency, and accuracy in statistical power analysis for clinical research.
zh

[AI-66] Reasoning Models Can be Accurately Pruned Via Chain-of-Thought Reconstruction

【速读】:该论文旨在解决推理型语言模型(Reasoning Language Models)在部署时因长链式思维(chain-of-thought)推理轨迹导致的高计算成本问题。现有压缩技术如神经网络剪枝(neural network pruning)在典型语言建模任务中表现良好,但在推理任务中往往造成更大性能损失,甚至因生成更多低质量思考 token 而降低推理效率。其根本原因在于传统剪枝方法侧重于输入重建(input reconstruction),而推理是一个以解码为主导(decode-dominated)的任务。论文提出的关键解决方案是“推理感知压缩”(Reasoning-Aware Compression, RAC),即在剪枝过程中联合重构输入和模型的策略内链式思维轨迹(on-policy chain-of-thought traces),从而更精准地保留推理过程中的关键激活信息。RAC 可无缝集成至 SparseGPT 等现有剪枝流程中,并显著提升压缩后模型的推理性能。

链接: https://arxiv.org/abs/2509.12464
作者: Ryan Lucas,Kayhan Behdin,Zhipeng Wang,Qingquan Song,Shao Tang,Rahul Mazumder
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reasoning language models such as DeepSeek-R1 produce long chain-of-thought traces during inference time which make them costly to deploy at scale. We show that using compression techniques such as neural network pruning produces greater performance loss than in typical language modeling tasks, and in some cases can make the model slower since they cause the model to produce more thinking tokens but with worse performance. We show that this is partly due to the fact that standard LLM pruning methods often focus on input reconstruction, whereas reasoning is a decode-dominated task. We introduce a simple, drop-in fix: during pruning we jointly reconstruct activations from the input and the model’s on-policy chain-of-thought traces. This “Reasoning-Aware Compression” (RAC) integrates seamlessly into existing pruning workflows such as SparseGPT, and boosts their performance significantly. Code reproducing the results in the paper can be found at: this https URL
zh

[AI-67] PromptSculptor: Multi-Agent Based Text-to-Image Prompt Optimization EMNLP2025

【速读】:该论文旨在解决生成式 AI(Generative AI)中用户在使用文本到图像(Text-to-Image, T2I)模型时,需反复手动优化模糊或简短提示词(prompt)才能获得高质量图像的问题。解决方案的关键在于提出 PromptSculptor——一个由四个专业化智能体协同工作的多智能体框架:1)任务分解与上下文推理智能体利用 Chain-of-Thought 推理机制补全隐含场景和背景细节;2)自评估智能体确保修改后的提示词与原始输入保持语义一致性;3)反馈调优智能体引入用户反馈进行迭代优化;4)整体架构具备模型无关性(model-agnostic),可无缝集成多种 T2I 模型。该设计显著提升了提示词质量并减少了用户交互轮次,为工业级应用提供了可行路径。

链接: https://arxiv.org/abs/2509.12446
作者: Dawei Xiang,Wenyan Xu,Kexin Chu,Zixu Shen,Tianqi Ding,Wei Zhang
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注: Accepted to EMNLP 2025 System Demonstration Track

点击查看摘要

Abstract:The rapid advancement of generative AI has democratized access to powerful tools such as Text-to-Image models. However, to generate high-quality images, users must still craft detailed prompts specifying scene, style, and context-often through multiple rounds of refinement. We propose PromptSculptor, a novel multi-agent framework that automates this iterative prompt optimization process. Our system decomposes the task into four specialized agents that work collaboratively to transform a short, vague user prompt into a comprehensive, refined prompt. By leveraging Chain-of-Thought reasoning, our framework effectively infers hidden context and enriches scene and background details. To iteratively refine the prompt, a self-evaluation agent aligns the modified prompt with the original input, while a feedback-tuning agent incorporates user feedback for further refinement. Experimental results demonstrate that PromptSculptor significantly enhances output quality and reduces the number of iterations needed for user satisfaction. Moreover, its model-agnostic design allows seamless integration with various T2I models, paving the way for industrial applications.
zh

[AI-68] Enhancing Physical Consistency in Lightweight World Models

【速读】:该论文旨在解决世界模型(World Model)在参数规模与性能之间的权衡问题:大型模型虽能精确捕捉物理动态,但计算资源消耗大,难以部署于边缘设备;小型模型虽易于部署,却常因物理建模不准确导致预测效果差。解决方案的关键在于提出一种轻量级的物理信息引导鸟瞰图世界模型(Physics-Informed BEV World Model, PIWM),其核心创新包括:训练阶段引入Soft Mask机制以提升动态物体建模和未来状态预测能力,以及推理阶段采用Warm Start策略,在零样本条件下显著增强预测质量。实验表明,PIWM在相同参数规模(400M)下相比基线模型提升60.6%加权综合得分,且最小版本(130M Soft Mask)在性能优于最大基线模型的同时,推理速度加快28%。

链接: https://arxiv.org/abs/2509.12437
作者: Dingrui Wang,Zhexiao Sun,Zhouheng Li,Cheng Wang,Youlun Peng,Hongyuan Ye,Baha Zarrouki,Wei Li,Mattia Piccinini,Lei Xie,Johannes Betz
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 8 pages

点击查看摘要

Abstract:A major challenge in deploying world models is the trade-off between size and performance. Large world models can capture rich physical dynamics but require massive computing resources, making them impractical for edge devices. Small world models are easier to deploy but often struggle to learn accurate physics, leading to poor predictions. We propose the Physics-Informed BEV World Model (PIWM), a compact model designed to efficiently capture physical interactions in bird’s-eye-view (BEV) representations. PIWM uses Soft Mask during training to improve dynamic object modeling and future prediction. We also introduce a simple yet effective technique, Warm Start, for inference to enhance prediction quality with a zero-shot model. Experiments show that at the same parameter scale (400M), PIWM surpasses the baseline by 60.6% in weighted overall score. Moreover, even when compared with the largest baseline model (400M), the smallest PIWM (130M Soft Mask) achieves a 7.4% higher weighted overall score with a 28% faster inference speed.
zh

[AI-69] Building Coding Agents via Entropy-Enhanced Multi-Turn Preference Optimization

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在软件工程任务中面临的多步骤推理与工具协同使用难题,尤其是在真实代码库中解决复杂问题的能力不足的问题。当前主流的偏好优化方法(如直接偏好优化 DPO 和 Kahneman-Tversky 优化 KTO)虽然能有效对齐人类偏好,但往往牺牲了输出多样性,限制了测试时扩展(Test-Time Scaling, TTS)的性能提升潜力;且这些方法通常针对单轮对话设计,难以适配交互式编码代理所需的多轮推理和工具集成场景。解决方案的关键在于提出一个熵增强型多轮偏好优化框架 \sys,其核心创新包括:显式地在偏好目标中引入策略熵约束以维持输出多样性,并将学习范式从单轮响应扩展到多轮交互过程,从而更好地支持工具辅助的复杂推理任务。此外,作者还设计了一种结合学习验证器与无模型方法的混合最优轨迹选择机制,进一步释放 TTS 的潜力,在 SWE-bench 基准上实现了开源权重模型的新 SOTA 表现。

链接: https://arxiv.org/abs/2509.12434
作者: Jiahao Yu,Zelei Cheng,Xian Wu,Xinyu Xing
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Software engineering presents complex, multi-step challenges for Large Language Models (LLMs), requiring reasoning over large codebases and coordinated tool use. The difficulty of these tasks is exemplified by benchmarks like SWE-bench, where current LLMs still struggle to resolve real-world issues. A promising approach to enhance performance is test-time scaling (TTS), but its gains are heavily dependent on the diversity of model outputs. While standard alignment methods such as Direct Preference Optimization (DPO) and Kahneman-Tversky Optimization (KTO) are effective at aligning model outputs with human preferences, this process can come at the cost of reduced diversity, limiting the effectiveness of TTS. Additionally, existing preference optimization algorithms are typically designed for single-turn tasks and do not fully address the complexities of multi-turn reasoning and tool integration required for interactive coding agents. To bridge this gap, we introduce \sys, an entropy-enhanced framework that adapts existing preference optimization algorithms to the multi-turn, tool-assisted setting. \sys augments the preference objective to explicitly preserve policy entropy and generalizes learning to optimize over multi-turn interactions rather than single-turn responses. We validate \sys by fine-tuning a diverse suite of models from different families and sizes (up to 106B parameters). To maximize performance gains from TTS, we further propose a hybrid best-trajectory selection scheme combining a learned verifier model with model free approaches. On the \swebench leaderboard, our approach establishes new state-of-the-art results among open-weight models. A 30B parameter model trained with \sys ranks 1st on \lite and 4th on \verified on the open-weight leaderboard, surpassed only by models with over 10x more parameters(\eg 350B). Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2509.12434 [cs.AI] (or arXiv:2509.12434v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2509.12434 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Jiahao Yu [view email] [v1] Mon, 15 Sep 2025 20:36:19 UTC (350 KB) Full-text links: Access Paper: View a PDF of the paper titled Building Coding Agents via Entropy-Enhanced Multi-Turn Preference Optimization, by Jiahao Yu and 3 other authorsView PDFTeX SourceOther Formats view license Current browse context: cs.AI prev | next new | recent | 2025-09 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
zh

[AI-70] Understanding Prompt Management in GitHub Repositories: A Call for Best Practices

【速读】:该论文旨在解决生成式 AI(Generative AI)领域中promptware(即基于自然语言提示构建的软件)在开源生态下存在的提示管理问题,特别是提示组织混乱、质量参差不齐等挑战。其解决方案的关键在于通过实证分析24,800个来自92个GitHub仓库的开源提示,系统识别出格式不一致、内部与外部重复、可读性差和拼写错误等核心质量问题,并据此提出可操作的建议,以提升提示的可用性和可维护性,从而支撑promptware生态的可持续发展。

链接: https://arxiv.org/abs/2509.12421
作者: Hao Li,Hicham Masri,Filipe R. Cogo,Abdul Ali Bangash,Bram Adams,Ahmed E. Hassan
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rapid adoption of foundation models (e.g., large language models) has given rise to promptware, i.e., software built using natural language prompts. Effective management of prompts, such as organization and quality assurance, is essential yet challenging. In this study, we perform an empirical analysis of 24,800 open-source prompts from 92 GitHub repositories to investigate prompt management practices and quality attributes. Our findings reveal critical challenges such as considerable inconsistencies in prompt formatting, substantial internal and external prompt duplication, and frequent readability and spelling issues. Based on these findings, we provide actionable recommendations for developers to enhance the usability and maintainability of open-source prompts within the rapidly evolving promptware ecosystem.
zh

[AI-71] Evaluating Large Language Models for Functional and Maintainable Code in Industrial Settings: A Case Study at ASML

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在工业级私有软件环境中代码生成能力的适用性问题,特别是在存在领域特定约束和代码依赖关系的封闭系统中,LLMs能否生成可编译、可维护的功能性代码尚未得到充分研究。解决方案的关键在于构建一个针对ASML专有代码库定制的评估框架,并引入一个新的评价指标——build@k,用于衡量LLM生成代码在真实工业仓库中的编译成功率与集成可行性;同时通过对比不同提示策略(如few-shot和chain-of-thought)、通用型与代码专用型LLM以及模型规模对代码生成质量的影响,揭示了提示方法和模型规模是决定输出质量的核心因素,而代码专用模型与通用模型之间的性能差异则因模型家族而异。

链接: https://arxiv.org/abs/2509.12395
作者: Yash Mundhra,Max Valk,Maliheh Izadi
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Accepted in the 40th IEEE/ACM International Conference on Automated Software Engineering, ASE 2025 (Industry track)

点击查看摘要

Abstract:Large language models have shown impressive performance in various domains, including code generation across diverse open-source domains. However, their applicability in proprietary industrial settings, where domain-specific constraints and code interdependencies are prevalent, remains largely unexplored. We present a case study conducted in collaboration with the leveling department at ASML to investigate the performance of LLMs in generating functional, maintainable code within a closed, highly specialized software environment. We developed an evaluation framework tailored to ASML’s proprietary codebase and introduced a new benchmark. Additionally, we proposed a new evaluation metric, build@k, to assess whether LLM-generated code successfully compiles and integrates within real industrial repositories. We investigate various prompting techniques, compare the performance of generic and code-specific LLMs, and examine the impact of model size on code generation capabilities, using both match-based and execution-based metrics. The findings reveal that prompting techniques and model size have a significant impact on output quality, with few-shot and chain-of-thought prompting yielding the highest build success rates. The difference in performance between the code-specific LLMs and generic LLMs was less pronounced and varied substantially across different model families. Comments: Accepted in the 40th IEEE/ACM International Conference on Automated Software Engineering, ASE 2025 (Industry track) Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2509.12395 [cs.SE] (or arXiv:2509.12395v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2509.12395 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-72] Evaluating the printability of stl files with ML

【速读】:该论文旨在解决3D打印过程中因模型几何结构复杂或不合理而导致的打印失败问题,尤其针对缺乏经验的用户在使用3D打印时难以识别潜在打印风险的问题。解决方案的关键在于引入一种基于AI的检测模型,该模型在切片软件中作为新增的验证层,能够提前识别3D模型中可能导致打印失败的难打印特征(如悬空结构、过薄壁厚等),从而在打印开始前提供预警和修正建议,提升打印成功率并降低用户学习成本。

链接: https://arxiv.org/abs/2509.12392
作者: Janik Henn,Adrian Hauptmannl,Hamza A. A. Gardi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:3D printing has long been a technology for industry professionals and enthusiasts willing to tinker or even build their own machines. This stands in stark contrast to today’s market, where recent developments have prioritized ease of use to attract a broader audience. Slicing software nowadays has a few ways to sanity check the input file as well as the output gcode. Our approach introduces a novel layer of support by training an AI model to detect common issues in 3D models. The goal is to assist less experienced users by identifying features that are likely to cause print failures due to difficult to print geometries before printing even begins.
zh

[AI-73] Causal-Symbolic Meta-Learning (CSML): Inducing Causal World Models for Few-Shot Generalization

【速读】:该论文试图解决当前深度学习模型在模式识别任务中因依赖伪相关性而导致泛化能力差、样本效率低的问题,尤其是在需要因果推理的任务上表现不佳。其解决方案的关键在于提出一种名为因果符号元学习(Causal-Symbolic Meta-Learning, CSML)的新框架,该框架通过元学习共享的因果世界模型,从任务分布中推断潜在的因果结构,并结合感知模块、可微因果归纳模块和基于图的推理模块,实现对新任务的快速适应,尤其在仅需少量示例的情况下即可完成干预与反事实推理。

链接: https://arxiv.org/abs/2509.12387
作者: Mohamed Zayaan S
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 10 pages, 4 figures

点击查看摘要

Abstract:Modern deep learning models excel at pattern recognition but remain fundamentally limited by their reliance on spurious correlations, leading to poor generalization and a demand for massive datasets. We argue that a key ingredient for human-like intelligence-robust, sample-efficient learning-stems from an understanding of causal mechanisms. In this work, we introduce Causal-Symbolic Meta-Learning (CSML), a novel framework that learns to infer the latent causal structure of a task distribution. CSML comprises three key modules: a perception module that maps raw inputs to disentangled symbolic representations; a differentiable causal induction module that discovers the underlying causal graph governing these symbols and a graph-based reasoning module that leverages this graph to make predictions. By meta-learning a shared causal world model across a distribution of tasks, CSML can rapidly adapt to novel tasks, including those requiring reasoning about interventions and counterfactuals, from only a handful of examples. We introduce CausalWorld, a new physics-based benchmark designed to test these capabilities. Our experiments show that CSML dramatically outperforms state-of-the-art meta-learning and neuro-symbolic baselines, particularly on tasks demanding true causal inference.
zh

[AI-74] Amulet: a Python Library for Assessing Interactions Among ML Defenses and Risks

【速读】:该论文旨在解决机器学习(Machine Learning, ML)模型在防御某一类风险时可能无意中加剧其他无关风险的问题,即“意外交互”(unintended interactions)问题。当前ML监管框架要求从业者评估模型对多种风险的敏感性,但缺乏系统工具来量化这些复杂交互效应。解决方案的关键在于提出AMULET——一个Python库,其核心创新在于:①全面覆盖安全、隐私和公平性三类风险的代表性攻击、防御措施与评估指标;②模块化设计支持扩展新组件;③提供统一友好的API接口以简化使用;④可应用于评估此前未被探索的意外交互场景,从而实现对ML模型多维风险的系统性分析与优化。

链接: https://arxiv.org/abs/2509.12386
作者: Asim Waheed,Vasisht Duddu,Rui Zhang,Sebastian Szyller,N. Asokan
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 12 pages, 4 figures

点击查看摘要

Abstract:ML models are susceptible to risks to security, privacy, and fairness. Several defenses are designed to protect against their intended risks, but can inadvertently affect susceptibility to other unrelated risks, known as unintended interactions. Several jurisdictions are preparing ML regulatory frameworks that require ML practitioners to assess the susceptibility of ML models to different risks. A library for valuating unintended interactions that can be used by (a) practitioners to evaluate unintended interactions at scale prior to model deployment and (b) researchers to design defenses which do not suffer from an unintended increase in unrelated risks. Ideally, such a library should be i) comprehensive by including representative attacks, defenses and metrics for different risks, ii) extensible to new modules due to its modular design, iii) consistent with a user-friendly API template for inputs and outputs, iv) applicable to evaluate previously unexplored unintended interactions. We present AMULET, a Python library that covers risks to security, privacy, and fairness, which satisfies all these requirements. AMULET can be used to evaluate unexplored unintended interactions, compare effectiveness between defenses or attacks, and include new attacks and defenses.
zh

[AI-75] Geometric Red-Teaming for Robotic Manipulation

【速读】:该论文旨在解决机器人操作中策略鲁棒性评估不足的问题,即现有标准评估协议通常仅在分布内(in-distribution)测试集上进行,难以揭示系统在合理变化下的失效模式。为此,作者提出几何红队测试(Geometric Red-Teaming, GRT),其核心在于通过对象中心的几何扰动自动生成能够触发预训练操作策略崩溃的“CrashShapes”——这类形状是在用户约束下结构合法的网格变形体。GRT的关键创新在于结合基于雅可比场(Jacobian field)的形变模型与无需梯度信息的仿真回路优化策略,从而高效探索导致策略性能骤降的几何变异;同时,文中进一步引入蓝队微调(blue-teaming)机制,通过对单个CrashShape进行微调可使任务成功率提升高达60个百分点,且保持原对象性能不变,验证了红队几何在定向策略改进中的有效性。

链接: https://arxiv.org/abs/2509.12379
作者: Divyam Goel,Yufei Wang,Tiancheng Wu,Guixiu Qiao,Pavel Piliptchak,David Held,Zackory Erickson
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at the 9th Annual Conference on Robot Learning (CoRL 2025, Oral)

点击查看摘要

Abstract:Standard evaluation protocols in robotic manipulation typically assess policy performance over curated, in-distribution test sets, offering limited insight into how systems fail under plausible variation. We introduce Geometric Red-Teaming (GRT), a red-teaming framework that probes robustness through object-centric geometric perturbations, automatically generating CrashShapes – structurally valid, user-constrained mesh deformations that trigger catastrophic failures in pre-trained manipulation policies. The method integrates a Jacobian field-based deformation model with a gradient-free, simulator-in-the-loop optimization strategy. Across insertion, articulation, and grasping tasks, GRT consistently discovers deformations that collapse policy performance, revealing brittle failure modes missed by static benchmarks. By combining task-level policy rollouts with constraint-aware shape exploration, we aim to build a general purpose framework for structured, object-centric robustness evaluation in robotic manipulation. We additionally show that fine-tuning on individual CrashShapes, a process we refer to as blue-teaming, improves task success by up to 60 percentage points on those shapes, while preserving performance on the original object, demonstrating the utility of red-teamed geometries for targeted policy refinement. Finally, we validate both red-teaming and blue-teaming results with a real robotic arm, observing that simulated CrashShapes reduce task success from 90% to as low as 22.5%, and that blue-teaming recovers performance to up to 90% on the corresponding real-world geometry – closely matching simulation outcomes. Videos and code can be found on our project website: this https URL .
zh

[AI-76] An integrated process for design and control of lunar robotics using AI and simulation

【速读】:该论文旨在解决月球建造设备开发中物理设计与控制策略难以并行优化的问题,传统方法往往导致设计与控制环节割裂,限制了系统整体性能。解决方案的关键在于提出一个集成的技术框架,其核心是OpenPLX这一可读写、声明式的语言,能够将CAD模型与自主系统无缝连接至高保真、实时的3D多体动力学仿真环境,同时精确模拟机械与月壤相互作用力及非理想传感器行为,从而实现设计与控制的协同迭代优化。

链接: https://arxiv.org/abs/2509.12367
作者: Daniel Lindmark,Jonas Andersson,Kenneth Bodin,Tora Bodin,Hugo Börjesson,Fredrik Nordfeldth,Martin Servin
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 14 pages, 6 figures

点击查看摘要

Abstract:We envision an integrated process for developing lunar construction equipment, where physical design and control are explored in parallel. In this paper, we describe a technical framework that supports this process. It relies on OpenPLX, a readable/writable declarative language that links CAD-models and autonomous systems to high-fidelity, real-time 3D simulations of contacting multibody dynamics, machine regolith interaction forces, and non-ideal sensors. To demonstrate its capabilities, we present two case studies, including an autonomous lunar rover that combines a vision-language model for navigation with a reinforcement learning-based control policy for locomotion.
zh

[AI-77] Enhancing Smart Farming Through Federated Learning: A Secure Scalable and Efficient Approach for AI-Driven Agriculture

【速读】:该论文旨在解决农业领域中数据驱动决策与农场数据隐私保护之间的矛盾问题,即在提升作物病害检测精度的同时,避免因共享敏感农场数据而引发的隐私担忧。其解决方案的关键在于提出一种基于联邦学习(Federated Learning)的智能 farming 框架,通过在本地农场设备上运行深度学习算法并仅上传模型参数至中央聚合服务器,实现多农场协作训练高精度病害分类模型,从而在保障数据隐私的前提下提升模型泛化能力、降低通信与训练成本,并支持早期病害识别与干预。

链接: https://arxiv.org/abs/2509.12363
作者: Ritesh Janga,Rushit Dave
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 15 pages, 5 Figures

点击查看摘要

Abstract:The agricultural sector is undergoing a transformation with the integration of advanced technologies, particularly in data-driven decision-making. This work proposes a federated learning framework for smart farming, aiming to develop a scalable, efficient, and secure solution for crop disease detection tailored to the environmental and operational conditions of Minnesota farms. By maintaining sensitive farm data locally and enabling collaborative model updates, our proposed framework seeks to achieve high accuracy in crop disease classification without compromising data privacy. We outline a methodology involving data collection from Minnesota farms, application of local deep learning algorithms, transfer learning, and a central aggregation server for model refinement, aiming to achieve improved accuracy in disease detection, good generalization across agricultural scenarios, lower costs in communication and training time, and earlier identification and intervention against diseases in future implementations. We outline a methodology and anticipated outcomes, setting the stage for empirical validation in subsequent studies. This work comes in a context where more and more demand for data-driven interpretations in agriculture has to be weighed with concerns about privacy from farms that are hesitant to share their operational data. This will be important to provide a secure and efficient disease detection method that can finally revolutionize smart farming systems and solve local agricultural problems with data confidentiality. In doing so, this paper bridges the gap between advanced machine learning techniques and the practical, privacy-sensitive needs of farmers in Minnesota and beyond, leveraging the benefits of federated learning.
zh

[AI-78] Linear Dimensionality Reduction for Word Embeddings in Tabular Data Classification

【速读】:该论文旨在解决在训练样本有限的情况下,如何有效利用高维词嵌入(word embedding)进行表格数据分类的问题,特别是针对工程师薪资预测挑战中因嵌入维度高达300维而导致的维度灾难和过拟合问题。解决方案的关键在于采用线性降维方法提升嵌入表示的判别能力:首先发现主成分分析(PCA)在适当子空间维度下可优于原始嵌入;其次指出线性判别分析(LDA)因协方差矩阵估计误差而表现不佳,但引入收缩(shrinkage)正则化后性能显著提升,即便仅保留两个特征也能取得良好效果;最后提出分块LDA(Partitioned-LDA),将嵌入划分为等大小块并在每块上独立执行LDA,从而降低协方差矩阵规模,进一步改善性能,结合收缩策略后在竞赛公开排行榜达到前10名。

链接: https://arxiv.org/abs/2509.12346
作者: Liam Ressel,Hamza A. A. Gardi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The Engineers’ Salary Prediction Challenge requires classifying salary categories into three classes based on tabular data. The job description is represented as a 300-dimensional word embedding incorporated into the tabular features, drastically increasing dimensionality. Additionally, the limited number of training samples makes classification challenging. Linear dimensionality reduction of word embeddings for tabular data classification remains underexplored. This paper studies Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA). We show that PCA, with an appropriate subspace dimension, can outperform raw embeddings. LDA without regularization performs poorly due to covariance estimation errors, but applying shrinkage improves performance significantly, even with only two dimensions. We propose Partitioned-LDA, which splits embeddings into equal-sized blocks and performs LDA separately on each, thereby reducing the size of the covariance matrices. Partitioned-LDA outperforms regular LDA and, combined with shrinkage, achieves top-10 accuracy on the competition public leaderboard. This method effectively enhances word embedding performance in tabular data classification with limited training samples.
zh

[AI-79] Integrating Attention-Enhanced LSTM and Particle Swarm Optimization for Dynamic Pricing and Replenishment Strategies in Fresh Food Supermarkets

【速读】:该论文旨在解决新鲜食品超市在动态定价与补货策略优化中的难题,特别是在面对销售波动、商品损耗及库存约束时如何实现利润最大化并减少食物浪费。解决方案的关键在于将带有注意力机制的长短期记忆(Long Short-Term Memory, LSTM)神经网络与粒子群优化(Particle Swarm Optimization, PSO)算法相结合:LSTM模型用于预测未来七天内的销量、价格趋势和变质率,其输出作为PSO算法的输入,通过迭代优化定价与补货方案以提升盈利能力;同时引入成本加成定价机制,使策略能实时响应固定与可变成本变化,从而增强对市场波动的适应性。该方法不仅提升了决策准确性,还通过注意力机制增强了模型的可解释性,为易腐商品零售业提供了可扩展的动态定价与库存管理框架。

链接: https://arxiv.org/abs/2509.12339
作者: Xianchen Liu(1),Tianhui Zhang(2),Xinyu Zhang(3),Lingmin Hou(3),Zhen Guo(4),Yuanhao Tian(5),Yang Liu(6) ((1) Department of Electrical and Computer Engineering, Florida International University, Miami, FL, 33199 USA (2) College of Engineering, Northeastern University, Boston, MA, 02169 USA (3) Department of Computer Science, Rochester Institute of Technology, Rochester, USA (4) Department of Mechanical and Materials Engineering, Florida International University, Miami, FL, 33199 USA (5) Department of Politics amp; International Relations, Florida International University, Miami, FL, 33199 USA (6) College of Arts amp; Sciences, University of Miami, Miami, FL 33124, USA)
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 16 pages, 6 figure

点击查看摘要

Abstract:This paper presents a novel approach to optimizing pricing and replenishment strategies in fresh food supermarkets by combining Long Short-Term Memory (LSTM) networks with Particle Swarm Optimization (PSO). The LSTM model, enhanced with an attention mechanism, is used to predict sales volumes, pricing trends, and spoilage rates over a seven-day period. The predictions generated by the LSTM model serve as inputs for the PSO algorithm, which iteratively optimizes pricing and replenishment strategies to maximize profitability while adhering to inventory constraints. The integration of cost-plus pricing allows for dynamic adjustments based on fixed and variable costs, ensuring real-time adaptability to market fluctuations. The framework not only maximizes profits but also reduces food waste, contributing to more sustainable supermarket operations. The attention mechanism enhances the interpretability of the LSTM model by identifying key time points and factors influencing sales, improving decision-making accuracy. This methodology bridges the gap between predictive modeling and optimization, offering a scalable solution for dynamic pricing and inventory management in fresh food retail and other industries dealing with perishable goods.
zh

[AI-80] An End to End Edge to Cloud Data and Analytics Strategy

【速读】:该论文旨在解决物联网(Internet of Things, IoT)设备数量激增背景下,企业如何在云端与边缘端之间构建安全且高效的架构,以支持实时数据处理和关键决策的问题。其解决方案的关键在于提出了一套端到端的安全边缘到云的数据与分析策略,并提供了设备层、边缘层和云层的参考架构,从而实现对云与边缘资源能力的最佳利用。

链接: https://arxiv.org/abs/2509.12296
作者: Vijay Kumar Butte,Sujata Butte
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:There is an exponential growth of connected Internet of Things (IoT) devices. These have given rise to applications that rely on real time data to make critical decisions quickly. Enterprises today are adopting cloud at a rapid pace. There is a critical need to develop secure and efficient strategy and architectures to best leverage capabilities of cloud and edge assets. This paper provides an end to end secure edge to cloud data and analytics strategy. To enable real life implementation, the paper provides reference architectures for device layer, edge layer and cloud layer.
zh

[AI-81] C3DE: Causal-Aware Collaborative Neural Controlled Differential Equation for Long-Term Urban Crowd Flow Prediction

【速读】:该论文旨在解决长期城市人群流动预测中因序列长度增加和采样间隔变长而导致的累积采样误差问题,同时应对兴趣点(Points of Interest, POIs)演化对人群流动的复杂影响,包括人群流动与POI分布之间的多时间尺度异步动态特性以及潜在的虚假因果关系。解决方案的关键在于提出Causal-aware Collaborative neural CDE (C3DE),其核心创新包括:采用双路径神经微分方程(Neural Controlled Differential Equations, NCDEs)作为主干网络,以有效捕捉跨多时间尺度的协同信号异步演化;设计基于反事实的因果效应估计器与动态修正机制,量化POI对人群流动的真实因果影响并抑制虚假相关性的累积;最终融合POI与人群流动的协同信号进行长期预测,显著提升在具有明显流量波动的城市场景下的预测性能。

链接: https://arxiv.org/abs/2509.12289
作者: Yuting Liu,Qiang Zhou,Hanzhe Li,Chenqi Gong,Jingjing Gu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Long-term urban crowd flow prediction suffers significantly from cumulative sampling errors, due to increased sequence lengths and sampling intervals, which inspired us to leverage Neural Controlled Differential Equations (NCDEs) to mitigate this issue. However, regarding the crucial influence of Points of Interest (POIs) evolution on long-term crowd flow, the multi-timescale asynchronous dynamics between crowd flow and POI distribution, coupled with latent spurious causality, poses challenges to applying NCDEs for long-term urban crowd flow prediction. To this end, we propose Causal-aware Collaborative neural CDE (C3DE) to model the long-term dynamic of crowd flow. Specifically, we introduce a dual-path NCDE as the backbone to effectively capture the asynchronous evolution of collaborative signals across multiple time scales. Then, we design a dynamic correction mechanism with the counterfactual-based causal effect estimator to quantify the causal impact of POIs on crowd flow and minimize the accumulation of spurious correlations. Finally, we leverage a predictor for long-term prediction with the fused collaborative signals of POI and crowd flow. Extensive experiments on three real-world datasets demonstrate the superior performance of C3DE, particularly in cities with notable flow fluctuations.
zh

[AI-82] Digital Voices of Survival: From Social Media Disclosures to Support Provisions for Domestic Violence Victims

【速读】:该论文旨在解决家庭暴力(Domestic Violence, DV)受害者在社交媒体平台上自我披露行为与在线支持机制之间缺乏系统性理解和量化分析的问题。现有研究对DV自述内容的识别、社区响应模式及其关联性尚未形成全面认知,导致无法有效支持受害者并设计精准干预策略。解决方案的关键在于提出一个新颖的计算框架,包含四个核心模块:自述检测(self-disclosure detection)、帖子聚类(post clustering)、主题总结(topic summarization)以及支持提取与映射(support extraction and mapping),从而实现从海量社交媒体数据中自动识别DV相关文本、提炼关键议题,并构建支持提供与需求之间的映射关系,为以受害者为中心的数字干预提供可操作的技术路径。

链接: https://arxiv.org/abs/2509.12288
作者: Kanlun Wang,Zhe Fu,Wangjiaxuan Xin,Lina Zhou,Shashi Kiran Chandrappa
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Information Retrieval (cs.IR)
备注: 9 pages, 4 figures and 4 tables. Accepted to The 59th Hawaii International Conference on System Sciences (HICSS) 2026

点击查看摘要

Abstract:Domestic Violence (DV) is a pervasive public health problem characterized by patterns of coercive and abusive behavior within intimate relationships. With the rise of social media as a key outlet for DV victims to disclose their experiences, online self-disclosure has emerged as a critical yet underexplored avenue for support-seeking. In addition, existing research lacks a comprehensive and nuanced understanding of DV self-disclosure, support provisions, and their connections. To address these gaps, this study proposes a novel computational framework for modeling DV support-seeking behavior alongside community support mechanisms. The framework consists of four key components: self-disclosure detection, post clustering, topic summarization, and support extraction and mapping. We implement and evaluate the framework with data collected from relevant social media communities. Our findings not only advance existing knowledge on DV self-disclosure and online support provisions but also enable victim-centered digital interventions.
zh

[AI-83] Deriving the Scaled-Dot-Function via Maximum Likelihood Estimation and Maximum Entropy Approach

【速读】:该论文旨在解决Transformer模型中价值向量(value vector)的估计问题,通过引入最大似然估计(maximum likelihood estimation)方法,构建一个基于高斯分布序列的概率模型来建模查询向量(query vector)、键向量(key vector)和值向量(value vector)之间的关系。其解决方案的关键在于将每个时间步的值向量和键向量视为服从高斯分布的随机变量,其中方差依赖于当前时间步、对应的键向量与查询向量,而均值则由时间步和对应的值向量决定,从而为Transformer中的缩放点积函数(scaled-dot-product function)或Softmax函数提供新的概率解释。

链接: https://arxiv.org/abs/2509.12285
作者: Jiyong Ma
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this paper, we present a maximum likelihood estimation approach to determine the value vector in transformer models. We model the sequence of value vectors, key vectors, and the query vector as a sequence of Gaussian distributions. The variance in each Gaussian distribution depends on the time step, the corresponding key vector, and the query vector. The mean value in each Gaussian distribution depends on the time step, and the corresponding value vector. This analysis may offer a new explanation of the scaled-dot-product function or softmax function used in transformer architectures [1]. Another explanation, inspired by [4], is based on the maximum entropy approach in natural language processing [5]. In this approach, a query vector and key vectors are used to derive the feature functions for the maximum entropy model.
zh

[AI-84] AIssistant: An Agent ic Approach for Human–AI Collaborative Scientific Work on Reviews and Perspectives in Machine Learning

【速读】:该论文旨在解决当前AI辅助科研工具碎片化、缺乏以人类为中心的工作流等问题,从而阻碍了科学工作流程的端到端自动化与高效协同。其解决方案的关键在于提出AIssistant——一个开源的、代理驱动(agentic)的人机协作框架,通过模块化工具与智能体(agent)实现文献综述、分章节实验、引文管理及LaTeX论文自动生成,并在每个环节保持人类监督,确保内容的准确性、连贯性与学术严谨性。

链接: https://arxiv.org/abs/2509.12282
作者: Sasi Kiran Gaddipati,Farhana Keya,Gollam Rabby,Sören Auer
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Advances in AI-assisted research have introduced powerful tools for literature retrieval, hypothesis generation, experimentation, and manuscript preparation. However, systems remain fragmented and lack human-centred workflows. To address these gaps, we introduce AIssistant, an agentic, open-source Human-AI collaborative framework designed to simplify the end-to-end creation of scientific workflows. Since our development is still in an early stage, we present here the first experiments with AIssistant for perspective and review research papers in machine learning. Our system integrates modular tools and agents for literature synthesis, section-wise experimentation, citation management, and automatic LaTeX paper text generation, while maintaining human oversight at every stage to ensure accuracy, coherence, and scholarly rigour. We conducted a comprehensive evaluation across three layers: (1) Independent Human Review, following NeurIPS double-blind standards; (2) Automated LLM Review, using GPT-5 as a scalable human review proxy; and (3) Program Chair Oversight, where the chair monitors the entire review process and makes final validation and acceptance decisions. The results demonstrate that AIssistant improves drafting efficiency and thematic consistency. Nonetheless, Human-AI collaboration remains essential for maintaining factual correctness, methodological soundness, and ethical compliance. Despite its effectiveness, we identify key limitations, including hallucinated citations, difficulty adapting to dynamic paper structures, and incomplete integration of multimodal content.
zh

[AI-85] Omni-CLST: Error-aware Curriculum Learning with guided Selective chain-of-Thought for audio questuin answering

【速读】:该论文旨在解决多模态音频-语言理解任务中模型训练效率低、泛化能力弱的问题,尤其在音频问答(Audio Question Answering, AQA)场景下如何有效利用高质量数据提升模型性能。其解决方案的关键在于提出Omni-CLST框架,该框架融合两种核心机制:一是基于错误感知的课程学习(error-aware Curriculum Learning),按样本难度动态组织训练数据以优化学习路径;二是受引导的思维链丢弃机制(guided Selective Chain-of-Thought),聚焦于困难样本的推理过程,增强模型对复杂语义的理解能力。结合GRPO(Generalized Reward Policy Optimization)训练策略,该方法显著提升了模型从高价值样本中学习的能力,在MMAU-mini和MMAR两个基准上分别达到73.80%和64.30%的准确率,优于现有方法,验证了其在多模态音频-语言理解中的鲁棒性与泛化能力。

链接: https://arxiv.org/abs/2509.12275
作者: Jinghua Zhao,Hang Su,Lichun Fan,Zhenbo Luo,Jian Luan,Hui Wang,Haoqin Sun,Yong Qin
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: 5 pages, 1 figure, 2 tables

点击查看摘要

Abstract:We propose Omni-CLST, an error-aware Curriculum Learning framework with guided Selective Chain-of-Thought for audio question answering. The framework efficiently leverages existing high-quality dataset through two key strategies: an error-aware curriculum that organizes samples by difficulty, and a guided thought dropout mechanism that focuses reasoning on challenging cases. Integrated with GRPO training, these strategies enable the model to learn more effectively from informative samples. Experiments on MMAU-mini and MMAR demonstrate that Omni-CLST achieves competitive accuracy (73.80% on MMAU-mini) and establishes a new state of the art (64.30% on MMAR), highlighting its robustness and generalization capability in multimodal audio-language understanding.
zh

[AI-86] A Variational Physics-Informed Neural Network Framework Using Petrov-Galerkin Method for Solving Singularly Perturbed Boundary Value Problems

【速读】:该论文旨在解决一维奇异摄动边值问题(Singularly Perturbed Boundary Value Problems, SPBVPs)及含一个或两个小参数的抛物型偏微分方程(Parabolic Partial Differential Equations, PDEs)的数值求解难题,尤其关注边界层现象的精确捕捉。其解决方案的关键在于提出了一种变分物理信息神经网络(Variational Physics-Informed Neural Network, VPINN)框架,该框架将Petrov-Galerkin弱形式与深度神经网络(Deep Neural Networks, DNNs)相结合:试函数空间由神经网络定义,测试函数空间则由hat函数构造,并引入局部化测试函数和界面罚项以增强数值稳定性并准确刻画边界层;同时通过硬约束施加Dirichlet边界条件、利用自动微分计算源项,从而在L2L_2和最大范数下显著优于标准VPINN方法。

链接: https://arxiv.org/abs/2509.12271
作者: Vijay Kumar,Gautam Singh
机构: 未知
类目: Numerical Analysis (math.NA); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This work proposes a Variational Physics-Informed Neural Network (VPINN) framework that integrates the Petrov-Galerkin formulation with deep neural networks (DNNs) for solving one-dimensional singularly perturbed boundary value problems (BVPs) and parabolic partial differential equations (PDEs) involving one or two small parameters. The method adopts a nonlinear approximation in which the trial space is defined by neural network functions, while the test space is constructed from hat functions. The weak formulation is constructed using localized test functions, with interface penalty terms introduced to enhance numerical stability and accurately capture boundary layers. Dirichlet boundary conditions are imposed via hard constraints, and source terms are computed using automatic differentiation. Numerical experiments on benchmark problems demonstrate the effectiveness of the proposed method, showing significantly improved accuracy in both the L_2 and maximum norms compared to the standard VPINN approach for one-dimensional singularly perturbed differential equations (SPDEs).
zh

[AI-87] InPhyRe Discovers: Large Multimodal Models Struggle in Inductive Physical Reasoning

【速读】:该论文旨在解决当前大型多模态模型(Large Multimodal Models, LMMs)在物理推理能力上的局限性问题,尤其是其无法有效进行归纳物理推理(inductive physical reasoning)的问题。现有LMMs仅能依赖训练中习得的参数化物理知识(parametric knowledge),当推理场景违反这些已知物理规律时便失效;而人类则能通过少量视觉示例快速适应新物理环境,这种能力对安全关键应用至关重要。为此,论文提出首个专门评估LMMs归纳物理推理能力的视觉问答基准InPhyRe,其核心创新在于使用算法生成的合成碰撞视频作为测试数据,系统性地考察模型在未见过的物理规则下预测碰撞结果的能力。关键发现表明:LMMs难以迁移已有物理知识、在违反普遍物理规律的演示样本下归纳推理能力薄弱,且存在显著的语言偏差,过度依赖文本线索而忽视视觉输入,从而质疑了其在视觉感知任务中的可靠性。

链接: https://arxiv.org/abs/2509.12263
作者: Gautam Sreekumar,Vishnu Naresh Boddeti
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 35 pages including appendix

点击查看摘要

Abstract:Large multimodal models (LMMs) encode universal physical laws observed during training, such as momentum conservation, as parametric knowledge. It allows LMMs to answer physical reasoning queries, such as the outcome of a potential collision event from visual input. However, since parametric knowledge includes only the physical laws seen during training, it is insufficient for reasoning when the inference scenario violates these physical laws. In contrast, humans possess the skill to adapt their physical reasoning to unseen physical environments from a few visual examples. This ability, which we refer to as inductive physical reasoning, is indispensable for LMMs if they are to replace human agents in safety-critical applications. Despite its importance, existing visual benchmarks evaluate only the parametric knowledge in LMMs, and not inductive physical reasoning. To this end, we propose InPhyRe, the first visual question answering benchmark to measure inductive physical reasoning in LMMs. InPhyRe evaluates LMMs on their ability to predict the outcome of collision events in algorithmically generated synthetic collision videos. By inspecting 13 LMMs, InPhyRe informs us that (1) LMMs struggle to apply their limited parametric knowledge about universal physical laws to reasoning, (2) inductive physical reasoning in LMMs is weak when demonstration samples violate universal physical laws, and (3) inductive physical reasoning in LMMs suffers from language bias and largely ignores the visual inputs, questioning the trustworthiness of LMMs regarding visual inputs.
zh

[AI-88] Quantum-Inspired Stacked Integrated Concept Graph Model (QISICGM) for Diabetes Risk Prediction

【速读】:该论文旨在解决糖尿病风险预测中因类别不平衡导致的传统机器学习模型性能下降的问题,并提升模型的准确性与可解释性。其解决方案的关键在于提出量子启发的堆叠集成概念图模型(Quantum-Inspired Stacked Integrated Concept Graph Model, QISICGM),该模型通过引入相位特征映射(phase feature mapping)和邻域序列建模(neighborhood sequence modeling)等量子启发机制,增强特征表示能力;同时结合自改进的概念图结构与多模型堆叠集成(包括随机森林、极端梯度提升树、Transformer、卷积神经网络及前馈神经网络),在PIMA Indians糖尿病数据集上实现OOF F1分数0.8933和AUC 0.8699,显著优于传统方法,且具备CPU高效推理能力(8.5行/秒),为临床辅助分诊提供可信、可复现的AI基准。

链接: https://arxiv.org/abs/2509.12259
作者: Kenneth G. Young II
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantum Physics (quant-ph)
备注: 13 pages, 3 figures, includes performance tables and visualizations. Proposes a Quantum-Inspired Stacked Integrated Concept Graph Model (QISICGM) that integrates phase feature mapping, self-improving concept graphs, and neighborhood sequence modeling within a stacked ensemble. Demonstrates improved F1 and AUC on an augmented PIMA Diabetes dataset with efficient CPU inference

点击查看摘要

Abstract:The Quantum-Inspired Stacked Integrated Concept Graph Model (QISICGM) is an innovative machine learning framework that harnesses quantum-inspired techniques to predict diabetes risk with exceptional accuracy and efficiency. Utilizing the PIMA Indians Diabetes dataset augmented with 2,000 synthetic samples to mitigate class imbalance (total: 2,768 samples, 1,949 positives), QISICGM integrates a self-improving concept graph with a stacked ensemble comprising Random Forests (RF), Extra Trees (ET), transformers, convolutional neural networks (CNNs), and feed-forward neural networks (FFNNs). This approach achieves an out-of-fold (OOF) F1 score of 0.8933 and an AUC of 0.8699, outperforming traditional methods. Quantum inspired elements, such as phase feature mapping and neighborhood sequence modeling, enrich feature representations, enabling CPU-efficient inference at 8.5 rows per second. This paper presents a detailed architecture, theoretical foundations, code insights, and performance evaluations, including visualizations from the outputs subfolder. The open-source implementation (v1.0.0) is available at this https URL, positioning QISICGM as a potential benchmark for AI-assisted clinical triage in diabetes and beyond. Ultimately, this work emphasizes trustworthy AI through calibration, interpretability, and open-source reproducibility.
zh

[AI-89] Representation Learning on Large Non-Bipartite Transaction Networks using GraphSAGE

【速读】:该论文旨在解决金融领域中复杂交易网络分析的可扩展性问题,特别是传统图嵌入方法在处理动态、现实世界银行数据时的局限性。其解决方案的关键在于采用GraphSAGE(Graph Sample and Aggregate),这是一种归纳式图神经网络框架,能够有效处理大规模非二部异构交易网络,并具备对未见节点的泛化能力,从而适应时间演化的交易数据场景。通过在匿名客户与商户交易数据上构建图结构并训练GraphSAGE模型生成节点嵌入,研究验证了嵌入结果在地理和人口统计特征上的可解释聚类性,并在反洗钱中的“资金搬运工”检测任务中提升了高风险账户的优先级排序效果,凸显了该方法在银行规模网络中的可扩展性、归纳能力和实用性。

链接: https://arxiv.org/abs/2509.12255
作者: Mihir Tare,Clemens Rattasits,Yiming Wu,Euan Wielewski
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:Financial institutions increasingly require scalable tools to analyse complex transactional networks, yet traditional graph embedding methods struggle with dynamic, real-world banking data. This paper demonstrates the practical application of GraphSAGE, an inductive Graph Neural Network framework, to non-bipartite heterogeneous transaction networks within a banking context. Unlike transductive approaches, GraphSAGE scales well to large networks and can generalise to unseen nodes which is critical for institutions working with temporally evolving transactional data. We construct a transaction network using anonymised customer and merchant transactions and train a GraphSAGE model to generate node embeddings. Our exploratory work on the embeddings reveals interpretable clusters aligned with geographic and demographic attributes. Additionally, we illustrate their utility in downstream classification tasks by applying them to a money mule detection model where using these embeddings improves the prioritisation of high-risk accounts. Beyond fraud detection, our work highlights the adaptability of this framework to banking-scale networks, emphasising its inductive capability, scalability, and interpretability. This study provides a blueprint for financial organisations to harness graph machine learning for actionable insights in transactional ecosystems.
zh

[AI-90] DISPLIB: a library of train dispatching problems

【速读】:该论文旨在解决铁路调度中因延误导致的效率低下问题,通过优化算法实现列车自动重路由与重新调度,从而减少延误并提升运营效率。其关键解决方案是提出了一套统一的问题定义和文件格式——DISPLIB,该标准涵盖了列车重路由与重调度的核心特征,并整合了来自多个真实工业场景的问题实例,同时提供了一个参考求解器实现。这一举措显著提升了研究的可复现性,使研究人员无需依赖工业合作即可开展相关研究,并支持不同求解器之间的实证比较。

链接: https://arxiv.org/abs/2509.12254
作者: Oddvar Kloster,Bjørnar Luteberget,Carlo Mannino,Giorgio Sartor
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Optimization-based decision support systems have a significant potential to reduce delays, and thus improve efficiency on the railways, by automatically re-routing and re-scheduling trains after delays have occurred. The operations research community has dedicated a lot of effort to developing optimization algorithms for this problem, but each study is typically tightly connected with a specific industrial use case. Code and data are seldom shared publicly. This fact hinders reproducibility, and has led to a proliferation of papers describing algorithms for more or less compatible problem definitions, without any real opportunity for readers to assess their relative performance. Inspired by the successful communities around MILP, SAT, TSP, VRP, etc., we introduce a common problem definition and file format, DISPLIB, which captures all the main features of train re-routing and re-scheduling. We have gathered problem instances from multiple real-world use cases and made them openly available. In this paper, we describe the problem definition, the industrial instances, and a reference solver implementation. This allows any researcher or developer to work on the train dispatching problem without an industrial connection, and enables the research community to perform empirical comparisons between solvers. All materials are available online at this https URL.
zh

[AI-91] Why and How Auxiliary Tasks Improve JEPA Representations

【速读】:该论文旨在解决联合嵌入预测架构(Joint-Embedding Predictive Architecture, JEPA)在视觉表征学习和基于模型的强化学习(model-based RL)中行为机制不明确的问题。其解决方案的关键在于提出并理论证明了一个“无不良表示坍缩”定理:在确定性马尔可夫决策过程(deterministic MDPs)中,若训练使潜在状态转移一致性损失与辅助回归损失同时收敛至零,则任意非等价观测(即具有不同转移动态或辅助标签的观测)必然映射到不同的潜在表示空间。这一结果表明,辅助任务通过锚定需保留的区分信息,有效防止了表示空间的无效坍缩;实验进一步验证了联合训练JEPA与辅助头可生成比单独训练更丰富的表征,为改进JEPA编码器提供了明确方向——即设计一个与状态转移动态共同编码正确等价关系的辅助函数。

链接: https://arxiv.org/abs/2509.12249
作者: Jiacan Yu,Siyi Chen,Mingrui Liu,Nono Horiuchi,Vladimir Braverman,Zicheng Xu,Dan Haramati,Randall Balestriero
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Joint-Embedding Predictive Architecture (JEPA) is increasingly used for visual representation learning and as a component in model-based RL, but its behavior remains poorly understood. We provide a theoretical characterization of a simple, practical JEPA variant that has an auxiliary regression head trained jointly with latent dynamics. We prove a No Unhealthy Representation Collapse theorem: in deterministic MDPs, if training drives both the latent-transition consistency loss and the auxiliary regression loss to zero, then any pair of non-equivalent observations, i.e., those that do not have the same transition dynamics or auxiliary label, must map to distinct latent representations. Thus, the auxiliary task anchors which distinctions the representation must preserve. Controlled ablations in a counting environment corroborate the theory and show that training the JEPA model jointly with the auxiliary head generates a richer representation than training them separately. Our work indicates a path to improve JEPA encoders: training them with an auxiliary function that, together with the transition dynamics, encodes the right equivalence relations.
zh

[AI-92] RL Fine-Tuning Heals OOD Forgetting in SFT

【速读】:该论文旨在解决两阶段微调范式(即监督微调 SFT 后接强化学习 RL)中 SFT 与 RL 协同机制不明确的问题,尤其是对“SFT 记忆,RL 泛化”这一常见观点的简化理解是否成立。研究发现,SFT 阶段会导致分布外(OOD)性能先升后降(OOD 忘记),而 RL 并非生成更优的 OOD 能力,而是起到恢复 SFT 中丢失的推理能力的作用(OOD 恢复)。关键解决方案在于通过奇异值分解(SVD)分析参数矩阵的变化,揭示出:模型 OOD 表现的波动主要由奇异向量的旋转驱动,而非奇异值变化;且 RL 的恢复能力存在边界——若 SFT 时间过短或过长,RL 均无法有效恢复 OOD 性能。这一发现重新定义了 SFT 和 RL 在两阶段微调中的角色,并识别出奇异向量旋转为关键机制。

链接: https://arxiv.org/abs/2509.12235
作者: Hangzhan Jin,Sitao Luan,Sicheng Lyu,Guillaume Rabusseau,Reihaneh Rabbany,Doina Precup,Mohammad Hamdaqa
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 10 pages, 15 figures

点击查看摘要

Abstract:The two-stage fine-tuning paradigm of Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL) has empirically shown better reasoning performance than one-stage SFT for the post-training of Large Language Models (LLMs). However, the evolution and mechanism behind the synergy of SFT and RL are still under-explored and inconclusive. In our study, we find the well-known claim “SFT memorizes, RL generalizes” is over-simplified, and discover that: (1) OOD performance peaks at the early stage of SFT and then declines (OOD forgetting), the best SFT checkpoint cannot be captured by training/test loss; (2) the subsequent RL stage does not generate fundamentally better OOD capability, instead it plays an \textbfOOD restoration role, recovering the lost reasoning ability during SFT; (3) The recovery ability has boundaries, \ie \textbfif SFT trains for too short or too long, RL cannot recover the lost OOD ability; (4) To uncover the underlying mechanisms behind the forgetting and restoration process, we employ SVD analysis on parameter matrices, manually edit them, and observe their impacts on model performance. Unlike the common belief that the shift of model capacity mainly results from the changes of singular values, we find that they are actually quite stable throughout fine-tuning. Instead, the OOD behavior strongly correlates with the \textbfrotation of singular vectors. Our findings re-identify the roles of SFT and RL in the two-stage fine-tuning and discover the rotation of singular vectors as the key mechanism. %reversing the rotations induced by SFT, which shows recovery from forgetting, whereas imposing the SFT parameter directions onto a RL-tuned model results in performance degradation. Code is available at this https URL
zh

[AI-93] owards Trustworthy Agent ic IoEV: AI Agents for Explainable Cyberthreat Mitigation and State Analytics

【速读】:该论文旨在解决物联网电动汽车(IoEV)生态系统中存在的三大核心问题:网络安全漏洞、电池状态预测不可靠以及决策过程缺乏透明度,这些问题严重削弱了系统的信任度与运行性能。解决方案的关键在于提出一种面向IoEV的代理型人工智能(Agentic Artificial Intelligence, AAI)框架,其核心创新包括:设计由专用代理组成的架构,分别负责充电站处的网络威胁检测与响应、实时电池荷电状态(State of Charge, SoC)估计和健康状态(State of Health, SoH)异常检测,并通过一个共享的可解释推理层进行协同;引入可解释的威胁缓解机制以主动识别并消除对物理充电端口及学习组件的攻击;构建具备持续学习和对抗感知能力的鲁棒SoC与SoH模型,实现高精度且带不确定性量化预测,并提供人类可读的解释;最终通过三代理流水线结构,每个代理利用大语言模型(LLM)驱动的推理与动态工具调用,完成意图解析、任务上下文理解及形式化优化,从而实现以用户为中心的智能辅助。

链接: https://arxiv.org/abs/2509.12233
作者: Meryem Malak Dif,Mouhamed Amine Bouchiha,Abdelaziz Amara Korba,Yacine Ghamri-Doudane
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
备注: 10 pages, 7 figures, Accepted at LCN’25

点击查看摘要

Abstract:The Internet of Electric Vehicles (IoEV) envisions a tightly coupled ecosystem of electric vehicles (EVs), charging infrastructure, and grid services, yet it remains vulnerable to cyberattacks, unreliable battery-state predictions, and opaque decision processes that erode trust and performance. To address these challenges, we introduce a novel Agentic Artificial Intelligence (AAI) framework tailored for IoEV, where specialized agents collaborate to deliver autonomous threat mitigation, robust analytics, and interpretable decision support. Specifically, we design an AAI architecture comprising dedicated agents for cyber-threat detection and response at charging stations, real-time State of Charge (SoC) estimation, and State of Health (SoH) anomaly detection, all coordinated through a shared, explainable reasoning layer; develop interpretable threat-mitigation mechanisms that proactively identify and neutralize attacks on both physical charging points and learning components; propose resilient SoC and SoH models that leverage continuous and adversarial-aware learning to produce accurate, uncertainty-aware forecasts with human-readable explanations; and implement a three-agent pipeline, where each agent uses LLM-driven reasoning and dynamic tool invocation to interpret intent, contextualize tasks, and execute formal optimizations for user-centric assistance. Finally, we validate our framework through comprehensive experiments across diverse IoEV scenarios, demonstrating significant improvements in security and prediction accuracy. All datasets, models, and code will be released publicly.
zh

[AI-94] Profiling LoRA/QLoRA Fine-Tuning Efficiency on Consumer GPUs: An RTX 4060 Case Study

【速读】:该论文旨在解决在消费级GPU(尤其是显存受限于8 GB的NVIDIA RTX 4060)上进行大语言模型(Large Language Models, LLMs)微调时的效率问题,特别是参数高效微调技术(如LoRA和QLoRA)在实际硬件条件下的性能表现尚未被系统研究。其解决方案的关键在于通过受控的基准测试,系统性地评估不同配置组合(包括批次大小、序列长度、优化器选择(AdamW vs. PagedAdamW)以及精度设置(fp16 vs. bf16))对训练吞吐量(tokens/s)、每10k tokens耗时及显存占用的影响,并首次提供了在消费级GPU上LLM微调的可复现基准与实用指导,其中发现采用paged优化器可提升高达25%的吞吐量,而bf16精度反而降低效率,同时验证了在8 GB显存限制下支持最大2048 token序列长度的可行性。

链接: https://arxiv.org/abs/2509.12229
作者: MSR Avinash
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Performance (cs.PF)
备注: 8 pages, 3 figures, 2 tables. Primary category: cs.LG (Machine Learning); secondary: cs.AI (Artificial Intelligence). LaTeX source with figures included

点击查看摘要

Abstract:Fine-tuning large language models (LLMs) with parameter-efficient techniques such as LoRA and QLoRA has enabled adaptation of foundation models on modest hardware. Yet the efficiency of such training on consumer-grade GPUs, especially under strict 8 GB VRAM limits, remains underexplored. We present a controlled profiling study of LoRA/QLoRA fine-tuning using the Qwen2.5-1.5B-Instruct model on a single NVIDIA RTX 4060. Across three representative configurations, we systematically vary batch size, sequence length, optimizer choice (AdamW vs. PagedAdamW), and precision (fp16 vs. bf16). We report throughput (tokens/s), time per 10k tokens, and VRAM footprint, alongside energy estimates derived from GPU board power limits. Our results show that paged optimizers improve throughput by up to 25% (628 tok/s vs. 500 tok/s baseline), while bf16 degrades efficiency relative to fp16. Despite 8 GB constraints, sequence lengths up to 2048 tokens were feasible using parameter-efficient strategies. To our knowledge, this is the first systematic case study of LLM fine- tuning efficiency on consumer GPUs, providing reproducible benchmarks and practical guidelines for resource-constrained researchers and practitioners.
zh

[AI-95] Learning to Route: Per-Sample Adaptive Routing for Multimodal Multitask Prediction

【速读】:该论文旨在解决多任务、多模态预测场景中因数据异质性和任务交互关系随样本变化而带来的建模挑战,特别是在心理治疗领域中结构化评估与非结构化临床笔记共存、存在部分缺失数据及结果相关性的情况。解决方案的关键在于提出一种统一的自适应路由框架(adaptive routing framework),通过动态选择每条输入样本的模态处理路径和任务共享策略,实现对不同模态(文本与数值特征的原始及融合表示)和任务组合的灵活调用;模型学习为每个样本分配最优的专家组合,并据此决定任务输出是采用共享头还是独立头,整个系统端到端训练,从而在保持高预测性能的同时提供可解释的路由决策,揭示模态重要性和任务结构关系,推动个性化医疗中的精准信息处理与干预策略优化。

链接: https://arxiv.org/abs/2509.12227
作者: Marzieh Ajirak,Oded Bein,Ellen Rose Bowen,Dora Kanellopoulos,Avital Falk,Faith M. Gunning,Nili Solomonov,Logan Grosenick
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We propose a unified framework for adaptive routing in multitask, multimodal prediction settings where data heterogeneity and task interactions vary across samples. Motivated by applications in psychotherapy where structured assessments and unstructured clinician notes coexist with partially missing data and correlated outcomes, we introduce a routing-based architecture that dynamically selects modality processing pathways and task-sharing strategies on a per-sample basis. Our model defines multiple modality paths, including raw and fused representations of text and numeric features and learns to route each input through the most informative expert combination. Task-specific predictions are produced by shared or independent heads depending on the routing decision, and the entire system is trained end-to-end. We evaluate the model on both synthetic data and real-world psychotherapy notes predicting depression and anxiety outcomes. Our experiments show that our method consistently outperforms fixed multitask or single-task baselines, and that the learned routing policy provides interpretable insights into modality relevance and task structure. This addresses critical challenges in personalized healthcare by enabling per-subject adaptive information processing that accounts for data heterogeneity and task correlations. Applied to psychotherapy, this framework could improve mental health outcomes, enhance treatment assignment precision, and increase clinical cost-effectiveness through personalized intervention strategies.
zh

[AI-96] Ratio1 – AI meta-OS

【速读】:该论文旨在解决当前AI模型开发与部署中面临的资源碎片化、中心化架构依赖以及跨异构边缘设备协同效率低下的问题。现有集中式MLOps平台难以高效利用分散的计算资源,而现有的去中心化计算平台又普遍缺乏集成化的AI工具链和可信的节点运营机制。其解决方案的关键在于提出Ratio1 AI元操作系统(meta-OS),通过一个基于区块链的去中心化框架,将闲置计算资源(如笔记本电脑、智能手机、云虚拟机)整合为一个无需信任的全球超级计算机;核心创新包括去中心化认证层(dAuth)、内存态状态数据库(CSTORE)、分布式存储系统(R1FS)、同态加密联邦学习(EDIL)、去中心化容器编排(Deeploy)及预言机网络(OracleSync),并引入结合可用性证明(PoA)与AI贡献证明(PoAI)的循环代币经济模型,从而在安全性、可扩展性和成本效益方面实现显著提升。

链接: https://arxiv.org/abs/2509.12223
作者: Andrei Damian,Petrica Butusina,Alessandro De Franceschi,Vitalii Toderian,Marius Grigoras,Cristian Bleotiu
机构: 未知
类目: Operating Systems (cs.OS); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:

点击查看摘要

Abstract:We propose the Ratio1 AI meta-operating system (meta-OS), a decentralized MLOps protocol that unifies AI model development, deployment, and inference across heterogeneous edge devices. Its key innovation is an integrated blockchain-based framework that transforms idle computing resources (laptops, smartphones, cloud VMs) into a trustless global supercomputer. The architecture includes novel components: a decentralized authentication layer (dAuth), an in-memory state database (CSTORE), a distributed storage system (R1FS), homomorphic encrypted federated learning (EDIL), decentralized container orchestration (Deeploy) and an oracle network (OracleSync), which collectively ensure secure, resilient execution of AI pipelines and other container based apps at scale. The protocol enforces a formal circular token-economic model combining Proof-of-Availability (PoA) and Proof-of-AI (PoAI) consensus. Compared to centralized heterogeneous cloud MLOps and existing decentralized compute platforms, which often lack integrated AI toolchains or trusted Ratio1 node operators (R1OP) mechanics, Ratio1’s holistic design lowers barriers for AI deployment and improves cost-efficiency. We provide mathematical formulations of its secure licensing and reward protocols, and include descriptive information for the system architecture and protocol flow. We argue that our proposed fully functional ecosystem proposes and demonstrates significant improvements in accessibility, scalability, and security over existing alternatives.
zh

[AI-97] Accelerating Privacy-Preserving Federated Learning in Large-Scale LEO Satellite Systems

【速读】:该论文旨在解决在大规模低地球轨道(Low-Earth-orbit, LEO)卫星网络中开展联邦学习(Federated Learning, FL)时,因卫星链路动态拓扑和带宽受限导致的模型参数聚合与分发延迟问题,从而延长训练周期。其解决方案的关键在于提出一种基于离散时间图的按需调度框架(discrete temporal graph-based on-demand scheduling framework),该框架通过动态分配通信资源,在保证隐私的前提下优化参数传输效率,显著缩短每轮训练时间,并在模型规模和客户端数量增加时展现出更强的加速效果和良好的可扩展性。

链接: https://arxiv.org/abs/2509.12222
作者: Binquan Guo,Junteng Cao,Marie Siew,Binbin Chen,Tony Q. S. Quek,Zhu Han
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: Submitted to IEEE conference for publication

点击查看摘要

Abstract:Large-scale low-Earth-orbit (LEO) satellite systems are increasingly valued for their ability to enable rapid and wide-area data exchange, thereby facilitating the collaborative training of artificial intelligence (AI) models across geographically distributed regions. Due to privacy concerns and regulatory constraints, raw data collected at remote clients cannot be centrally aggregated, posing a major obstacle to traditional AI training methods. Federated learning offers a privacy-preserving alternative by training local models on distributed devices and exchanging only model parameters. However, the dynamic topology and limited bandwidth of satellite systems will hinder timely parameter aggregation and distribution, resulting in prolonged training times. To address this challenge, we investigate the problem of scheduling federated learning over satellite networks and identify key bottlenecks that impact the overall duration of each training round. We propose a discrete temporal graph-based on-demand scheduling framework that dynamically allocates communication resources to accelerate federated learning. Simulation results demonstrate that the proposed approach achieves significant performance gains over traditional statistical multiplexing-based model exchange strategies, reducing overall round times by 14.20% to 41.48%. Moreover, the acceleration effect becomes more pronounced for larger models and higher numbers of clients, highlighting the scalability of the proposed approach.
zh

[AI-98] Scaling Up Data Parallelism in Decentralized Deep Learning

【速读】:该论文旨在解决去中心化深度神经网络(DNN)训练在大规模场景下存在的稳定性、可扩展性和泛化性不足的问题,这些问题限制了其在生产环境中的应用。解决方案的关键在于提出一种名为Ada的自适应去中心化方法,该方法基于动态调整通信图结构以适应训练过程中参数张量方差的变化,从而实现更稳定的收敛速度和与集中式学习相当或更优的模型精度,尤其在高达1008个GPU规模的ResNet50图像分类任务中验证了其有效性。

链接: https://arxiv.org/abs/2509.12213
作者: Bing Xie,Junqi Yin,Zhenyu Zhou,Sarp Oral,Feiyi Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Although it has been extensively explored in theory, decentralized learning is not yet green-lighted for production use, largely due to a lack of stability, scalability, and generality in large scale DNN training. To shed light on the production use of decentralized learning, this work studies decentralized data parallel training at scale. To this end, we introduce a benchmarking framework, namely DBench, to host both centralized and decentralized DNN training. Building upon DBench, we introduce a benchmarking methodology to uncover the correlations between model accuracy and the variances of parameter tensors by varying communication graphs and training scales. Based on the benchmarking results, we observe that, (1) Similar to centralized learning, decentralized data parallel training also presents the issues of scalability and generality when the training scales up; (2) The model accuracy of decentralized learning is correlated to the number of connections in a communication graph; (3) The model accuracy of decentralized learning is surprisingly sensitive to the variance of parameter tensors across model replicas. Built upon the observations, we propose Ada, a decentralized adaptive approach that performs large scale DNN training following a decentralized SGD method and adapting the communication graph in use dynamically throughout training iterations. We apply Ada on large scale training and observe that Ada can obtain the best convergence rates consistently in decentralized DNN training, and delivers equally or comparably good model accuracy for all sample applications as centralized learning does, even when training ResNet50 for ImageNet-1K on the scale of 1008 GPUs.
zh

[AI-99] PowerGrow: Feasible Co-Growth of Structures and Dynamics for Power Grid Synthesis

【速读】:该论文旨在解决现代电力系统中因可再生能源波动、电动汽车普及及主动电网重构导致的拓扑结构和负荷动态变化日益复杂,而公开可用的测试案例稀缺的问题。现有方法难以在保持物理可行性的同时高效合成包含网络拓扑、支路属性、节点特性与时变负荷曲线的联合分布。其解决方案的关键在于提出PowerGrow框架,通过依赖分解(dependence decomposition)将复杂的联合分布因子化为一系列条件分布链,从而实现结构与动态数据的协同生成;具体而言,采用分层图β扩散过程进行拓扑结构合成,并结合时间自编码器将时序数据嵌入紧凑隐空间,显著降低计算开销并提升样本保真度与运行有效性,实验表明该方法在功率流收敛率(98.9%)和N-1故障韧性方面均优于现有扩散模型。

链接: https://arxiv.org/abs/2509.12212
作者: Xinyu He,Chenhan Xiao,Haoran Li,Ruizhong Qiu,Zhe Xu,Yang Weng,Jingrui He,Hanghang Tong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:Modern power systems are becoming increasingly dynamic, with changing topologies and time-varying loads driven by renewable energy variability, electric vehicle adoption, and active grid reconfiguration. Despite these changes, publicly available test cases remain scarce, due to security concerns and the significant effort required to anonymize real systems. Such limitations call for generative tools that can jointly synthesize grid structure and nodal dynamics. However, modeling the joint distribution of network topology, branch attributes, bus properties, and dynamic load profiles remains a major challenge, while preserving physical feasibility and avoiding prohibitive computational costs. We present PowerGrow, a co-generative framework that significantly reduces computational overhead while maintaining operational validity. The core idea is dependence decomposition: the complex joint distribution is factorized into a chain of conditional distributions over feasible grid topologies, time-series bus loads, and other system attributes, leveraging their mutual dependencies. By constraining the generation process at each stage, we implement a hierarchical graph beta-diffusion process for structural synthesis, paired with a temporal autoencoder that embeds time-series data into a compact latent space, improving both training stability and sample fidelity. Experiments across benchmark settings show that PowerGrow not only outperforms prior diffusion models in fidelity and diversity but also achieves a 98.9% power flow convergence rate and improved N-1 contingency resilience. This demonstrates its ability to generate operationally valid and realistic power grid scenarios.
zh

[AI-100] nyServe: Query-Aware Cache Selection for Efficient LLM Serving ICML ACM-MM

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在自回归解码过程中因键值(Key-Value, KV)缓存访问带来的高内存占用和延迟问题。其核心解决方案是提出一个轻量且可扩展的推理系统TinyServe,关键创新在于引入一种查询感知的页面选择机制(query-aware page selection mechanism),利用边界框元数据估算查询与KV缓存块之间的注意力相关性,从而实现选择性加载KV缓存,显著降低解码开销且无需修改模型结构。此外,TinyServe通过融合CUDA内核将页面评分、稀疏内存访问与掩码注意力计算整合为单次遍历,进一步提升硬件效率。实验表明,该方案可在几乎不损失精度的前提下实现最高3.4倍加速和超过2倍内存节省,验证了其在资源受限硬件上的实用性。

链接: https://arxiv.org/abs/2509.12211
作者: Dong Liu,Yanxuan Yu
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注: Accepted to ACM MM as Oral Paper, also accepted to ICML MOSS workshop, publicly available as this https URL

点击查看摘要

Abstract:Serving large language models (LLMs) efficiently remains challenging due to the high memory and latency overhead of key-value (KV) cache access during autoregressive decoding. We present \textbfTinyServe, a lightweight and extensible serving system for deploying tiny LLMs (e.g., TinyLLaMA, GPT2-345M) with support for structured KV sparsity, plugin-based token selection, and hardware-efficient attention kernels. Unlike prior simulation frameworks, TinyServe executes real-time decoding with configurable sparsity strategies and fine-grained instrumentation. To reduce decoding cost, we introduce a \textitquery-aware page selection mechanism that leverages bounding-box metadata to estimate attention relevance between the query and KV cache blocks. This enables selective KV loading with minimal overhead and no model modifications. Our fused CUDA kernel integrates page scoring, sparse memory access, and masked attention in a single pass. Experiments show that TinyServe achieves up to \textbf3.4x speedup and over \textbf2x memory savings with negligible accuracy drop. Additional analysis of cache reuse, page hit rate, and multi-GPU scaling confirms its practicality as an efficient system-level design for LLM training and inference research on resource-constrained hardware. Comments: Accepted to ACM MM as Oral Paper, also accepted to ICML MOSS workshop, publicly available as this https URL Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI) Cite as: arXiv:2509.12211 [cs.DC] (or arXiv:2509.12211v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2509.12211 Focus to learn more arXiv-issued DOI via DataCite
zh

[AI-101] Rich Vehicle Routing Problem with diverse Vertices allowing Hierarchical and Multimodal Time-Dependant Transhipment of multiple Node- Vehicle- compatible Cargo with Cascaded Time-Minimization Objective for Emergency Decision Support Systems

【速读】:该论文旨在解决复杂环境下多车型、多站点、多运输模式的车辆路径优化问题(Rich Vehicle Routing Problem, RVRP),特别针对灾害管理场景中如何最小化最高车辆路径持续时间(即makespan)以提升应急响应效率。其核心挑战在于整合异构车辆资源、多式联运节点(如转运港,Transhipment Port)、货物与车辆/转运站的兼容性约束,以及同时或拆分装卸货需求。解决方案的关键在于提出一种级联最小化策略(cascaded minimization approach)和基于决策树结构的启发式算法(PSR-GIP Heuristic):首先通过混合整数线性规划(MILP)建模验证方法有效性,进而设计可快速求解大规模实例的启发式框架——该算法优先生成小规模路由单元并采用多种逻辑集成方式构建独立解集,再通过扰动操作探索邻域解空间,从而在保证兼容性约束的前提下高效获得高质量近似解。

链接: https://arxiv.org/abs/2509.13227
作者: Santanu Banerjee,Goutam Sen,Siddhartha Mukhopadhyay
机构: 未知
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:A rich vehicle routing problem is considered allowing multiple trips of heterogeneous vehicles stationed at distributed vehicle depots spread across diverse geographies having access to different modes of transportation. The problem arises from the real world requirement of optimizing the disaster response/preparedness time and minimizes the route duration of the vehicles to achieve the solution with the minimum highest-vehicle-route-duration. Multiple diversely-functional vertices are considered including the concept of Transhipment Ports as inter-modal resource transfer stations. Both simultaneous and split pickup and transferring of different types of delivery and pickup cargo is considered, along with Vehicle-Cargo and Transhipment Port-Cargo Compatibility. The superiority of the proposed cascaded minimization approach is shown over existing makespan minimization approaches through the developed MILP formulation. To solve the problem quickly for practical implementation within Disaster Management-specific Decision Support Systems, an extensive Heuristic Algorithm is devised. The Heuristic utilizes Decision Tree based structuring of possible routes and is able to inherently consider the compatibility issues. Preferential generation of small route elements are performed, which are integrated into route clusters; we consider multiple different logical integration approaches, as well as shuffling the logics to simultaneously produce multiple independent solutions. Finally perturbation of the different solutions are done to find better neighbouring solutions. The computational performance of the PSR-GIP Heuristic, on our created novel datasets, indicate that it is able to give good solutions swiftly for practical problems involving large integer instances which the MILP is unable to solve.
zh

[AI-102] FusionMAE: large-scale pretrained model to optimize and simplify diagnostic and control of fusion plasma

【速读】:该论文旨在解决磁约束聚变装置中等离子体复杂、多尺度及非线性动力学特性所带来的诊断系统冗杂与耦合关系混乱问题,该问题长期阻碍了聚变能源发展的加速进程。解决方案的关键在于提出了一种大规模预训练模型——融合掩码自编码器(Fusion Masked Auto-Encoder, FusionMAE),通过压缩88路诊断信号为高维嵌入表示(embedding),实现诊断系统与控制执行器之间的统一接口;其核心机制包括压缩-降维和缺失信号重建,使得模型在预训练后具备“虚拟备份诊断”能力(缺失数据推断可靠性达96.7%),并涌现出自动数据分析、通用控制-诊断接口以及多任务控制性能提升等新能力,从而简化系统接口、减少必要诊断设备并优化未来聚变堆运行性能。

链接: https://arxiv.org/abs/2509.12945
作者: Zongyu Yang,Zhenghao Yang,Wenjing Tian,Jiyuan Li,Xiang Sun,Guohui Zheng,Songfen Liu,Niannian Wu,Rongpeng Li,Zhaohe Xu,Bo Li,Zhongbing Shi,Zhe Gao,Wei Chen,Xiaoquan Ji,Min Xu,Wulyu Zhong
机构: 未知
类目: Plasma Physics (physics.plasm-ph); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In magnetically confined fusion device, the complex, multiscale, and nonlinear dynamics of plasmas necessitate the integration of extensive diagnostic systems to effectively monitor and control plasma behaviour. The complexity and uncertainty arising from these extensive systems and their tangled interrelations has long posed a significant obstacle to the acceleration of fusion energy development. In this work, a large-scale model, fusion masked auto-encoder (FusionMAE) is pre-trained to compress the information from 88 diagnostic signals into a concrete embedding, to provide a unified interface between diagnostic systems and control actuators. Two mechanisms are proposed to ensure a meaningful embedding: compression-reduction and missing-signal reconstruction. Upon completion of pre-training, the model acquires the capability for ‘virtual backup diagnosis’, enabling the inference of missing diagnostic data with 96.7% reliability. Furthermore, the model demonstrates three emergent capabilities: automatic data analysis, universal control-diagnosis interface, and enhancement of control performance on multiple tasks. This work pioneers large-scale AI model integration in fusion energy, demonstrating how pre-trained embeddings can simplify the system interface, reducing necessary diagnostic systems and optimize operation performance for future fusion reactors.
zh

[AI-103] Exact alternative optima for nonlinear optimization problems defined with maximum component objective function constrained by the Sugeno-Weber fuzzy relational inequalities

【速读】:该论文旨在解决带有模糊关系不等式约束的格优化问题,其中可行域由两个不等式模糊系统交集构成,并采用Sugeno-Weber族t-范数作为模糊合成运算。此类问题在模糊建模中具有重要应用价值,尤其涉及模糊集合的交与并运算建模。解决方案的关键在于:首先通过max-Sugeno-Weber复合运算分析可行域的结构,给出判断可行性的充要条件;随后基于问题的理论性质设计了一种高效算法,该算法能够精确求解非线性优化问题的最优解,且证明了其收敛性与正确性。

链接: https://arxiv.org/abs/2509.12669
作者: Amin Ghodousian,Sara Zal,Minoo Ahmadi
机构: 未知
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI)
备注: 9 pages, 1 numerical example, presented at 17th International Conference on Information Technology, Computer and Telecommunication (ITCTC), Poland, December 2022

点击查看摘要

Abstract:In this paper, we study a latticized optimization problem with fuzzy relational inequality constraints where the feasible region is formed as the intersection of two inequality fuzzy systems and Sugeno-Weber family of t-norms is considered as fuzzy composition. Sugeno-Weber family of t-norms and t-conorms is one of the most applied one in various fuzzy modelling problems. This family of t-norms and t-conorms was suggested by Weber for modeling intersection and union of fuzzy sets. Also, the t-conorms were suggested as addition rules by Sugeno for so-called alpha-fuzzy measures. The resolution of the feasible region of the problem is firstly investigated when it is defined with max-Sugeno-Weber composition and a necessary and sufficient condition is presented for determining the feasibility. Then, based on some theoretical properties of the problem, an algorithm is presented for solving this nonlinear problem. It is proved that the algorithm can find the exact optimal solution and an example is presented to illustrate the proposed algorithm.
zh

[AI-104] Reinforcement Learning-Based Market Making as a Stochastic Control on Non-Stationary Limit Order Book Dynamics

【速读】:该论文旨在解决市场做市商在非平稳市场条件下难以优化决策策略的问题,尤其针对真实市场中订单到达时间聚集、买卖价差非平稳性、回报漂移、订单数量和价格波动的随机性等典型特征。其解决方案的关键在于构建一个基于强化学习(Reinforcement Learning, RL)的做市代理,并采用近端策略优化(Proximal Policy Optimization, PPO)算法进行训练,同时设计了一个能够模拟上述市场 stylized facts 的仿真环境,从而将领域知识嵌入策略学习过程,提升代理在复杂动态市场中的适应性和稳定性。

链接: https://arxiv.org/abs/2509.12456
作者: Rafael Zimmer,Oswaldo Luiz do Valle Costa
机构: 未知
类目: Trading and Market Microstructure (q-fin.TR); Artificial Intelligence (cs.AI)
备注: 9 pages, 8 figures, 3 tables, 31 equations

点击查看摘要

Abstract:Reinforcement Learning has emerged as a promising framework for developing adaptive and data-driven strategies, enabling market makers to optimize decision-making policies based on interactions with the limit order book environment. This paper explores the integration of a reinforcement learning agent in a market-making context, where the underlying market dynamics have been explicitly modeled to capture observed stylized facts of real markets, including clustered order arrival times, non-stationary spreads and return drifts, stochastic order quantities and price volatility. These mechanisms aim to enhance stability of the resulting control agent, and serve to incorporate domain-specific knowledge into the agent policy learning process. Our contributions include a practical implementation of a market making agent based on the Proximal-Policy Optimization (PPO) algorithm, alongside a comparative evaluation of the agent’s performance under varying market conditions via a simulator-based environment. As evidenced by our analysis of the financial return and risk metrics when compared to a closed-form optimal solution, our results suggest that the reinforcement learning agent can effectively be used under non-stationary market conditions, and that the proposed simulator-based environment can serve as a valuable tool for training and pre-training reinforcement learning agents in market-making scenarios.
zh

[AI-105] Neural-Quantum-States Impurity Solver for Quantum Embedding Problems

【速读】:该论文旨在解决量子嵌入方法(quantum embedding methods)中杂质求解器(impurity solver)的精度与稳定性问题,特别是在使用神经量子态(Neural Quantum States, NQS)框架时面临的挑战。其关键解决方案是提出一种基于图变换器(graph transformer)的NQS框架,能够表示任意连接的杂质轨道,并引入误差控制机制以稳定嵌入循环中的迭代更新过程,从而在保持高精度的同时提升算法的收敛性与鲁棒性。

链接: https://arxiv.org/abs/2509.12431
作者: Yinzhanghao Zhou,Tsung-Han Lee,Ao Chen,Nicola Lanatà,Hong Guo
机构: 未知
类目: rongly Correlated Electrons (cond-mat.str-el); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Quantum Physics (quant-ph)
备注: 10 pages main text, and 4 figures. Note that YinZhangHao Zhou and Zhanghao Zhouyin are the same person, I use them both

点击查看摘要

Abstract:Neural quantum states (NQS) have emerged as a promising approach to solve second-quantised Hamiltonians, because of their scalability and flexibility. In this work, we design and benchmark an NQS impurity solver for the quantum embedding methods, focusing on the ghost Gutzwiller Approximation (gGA) framework. We introduce a graph transformer-based NQS framework able to represent arbitrarily connected impurity orbitals and develop an error control mechanism to stabilise iterative updates throughout the quantum embedding loops. We validate the accuracy of our approach with benchmark gGA calculations of the Anderson Lattice Model, yielding results in excellent agreement with the exact diagonalisation impurity solver. Finally, our analysis of the computational budget reveals the method’s principal bottleneck to be the high-accuracy sampling of physical observables required by the embedding loop, rather than the NQS variational optimisation, directly highlighting the critical need for more efficient inference techniques.
zh

[AI-106] Physics-Informed Neural Networks vs. Physics Models for Non-Invasive Glucose Monitoring: A Comparative Study Under Realistic Synthetic Conditions

【速读】:该论文旨在解决非侵入式葡萄糖监测设备在实验室外性能下降的问题,其根源在于现有数据集忽略了硬件噪声、环境漂移及个体生理差异等现实因素。解决方案的关键在于构建首个超真实的近红外(NIR)仿真平台,该平台能够精确模拟12位ADC量化误差、LED老化(±0.1%)、光电二极管暗噪声、温度(15–45 °C)、湿度(30–90% RH)、接触压力变化、Fitzpatrick I–VI型皮肤色素沉着以及昼夜血糖波动(黎明现象)等复杂干扰因素。基于此平台,研究者对六种算法进行了基准测试,其中基于物理机制的增强比尔-朗伯模型(Enhanced Beer-Lambert)以仅56个参数和0.01毫秒推理时间实现了13.6 mg/dL均方根误差(RMSE),并达到95.8% Clarke-A评分和93.8% ±15%准确性,显著优于最深的物理信息神经网络(PINN)和浅层DNN基线模型,从而颠覆了“更深的PINN一定更优”的认知,并提供了一个开放的端到端参考堆栈,用于嵌入式光学葡萄糖传感器的快速原型开发。

链接: https://arxiv.org/abs/2509.12253
作者: Riyaadh Gani
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Non-invasive glucose monitors often fail outside the lab because existing datasets ignore hardware noise, environmental drift, and person-to-person physiology. We introduce the first ultra-realistic near-infrared (NIR) simulator that injects 12-bit ADC quantisation, +/-0.1% LED ageing, photodiode dark noise, 15-45 C temperature, 30-90% relative humidity, contact-pressure variation, Fitzpatrick I-VI melanin, and diurnal glucose excursions (dawn phenomenon). Using this platform (rho glucose-NIR = 0.21), we benchmark six methods: Enhanced Beer-Lambert (physics-engineered ridge regression), three physics-informed neural networks (PINNs), a selective radiative-transfer PINN, and a shallow DNN. Beer-Lambert achieves 13.6 mg/dL RMSE, 95.8% Clarke-A and 93.8% +/-15% accuracy with only 56 parameters and 0.01 ms inference, outperforming the best PINN (14.6 mg/dL) and the SDNN baseline (35.1 mg/dL). Results overturn the assumption that deeper PINNs dominate and supply an open, end-to-end reference stack for rapid prototyping of embedded optical glucose sensors.
zh

机器学习

[LG-0] LLM s for energy and macronutrients estimation using only text data from 24-hour dietary recalls: a parameter-efficient fine-tuning experiment using a 10-shot prompt

链接: https://arxiv.org/abs/2509.13268
作者: Rodrigo M Carrillo-Larco
类目: Machine Learning (cs.LG)
*备注: this https URL

点击查看摘要

Abstract:BACKGROUND: Most artificial intelligence tools used to estimate nutritional content rely on image input. However, whether large language models (LLMs) can accurately predict nutritional values based solely on text descriptions of foods consumed remains unknown. If effective, this approach could enable simpler dietary monitoring without the need for photographs. METHODS: We used 24-hour dietary recalls from adolescents aged 12-19 years in the National Health and Nutrition Examination Survey (NHANES). An open-source quantized LLM was prompted using a 10-shot, chain-of-thought approach to estimate energy and five macronutrients based solely on text strings listing foods and their quantities. We then applied parameter-efficient fine-tuning (PEFT) to evaluate whether predictive accuracy improved. NHANES-calculated values served as the ground truth for energy, proteins, carbohydrates, total sugar, dietary fiber and total fat. RESULTS: In a pooled dataset of 11,281 adolescents (49.9% male, mean age 15.4 years), the vanilla LLM yielded poor predictions. The mean absolute error (MAE) was 652.08 for energy and the Lin’s CCC 0.46 across endpoints. In contrast, the fine-tuned model performed substantially better, with energy MAEs ranging from 171.34 to 190.90 across subsets, and Lin’s CCC exceeding 0.89 for all outcomes. CONCLUSIONS: When prompted using a chain-of-thought approach and fine-tuned with PEFT, open-source LLMs exposed solely to text input can accurately predict energy and macronutrient values from 24-hour dietary recalls. This approach holds promise for low-burden, text-based dietary monitoring tools.

[LG-1] Post-Hoc Split-Point Self-Consistency Verification for Efficient Unified Quantification of Aleatoric and Epistemic Uncertainty in Deep Learning

链接: https://arxiv.org/abs/2509.13262
作者: Zhizhong Zhao,Ke Chen
类目: Machine Learning (cs.LG)
*备注: 32 pages, 15 figures and 16 tables. Technical Report submitted to a journal for publication

点击查看摘要

Abstract:Uncertainty quantification (UQ) is vital for trustworthy deep learning, yet existing methods are either computationally intensive, such as Bayesian or ensemble methods, or provide only partial, task-specific estimates, such as single-forward-pass techniques. In this paper, we propose a post-hoc single-forward-pass framework that jointly captures aleatoric and epistemic uncertainty without modifying or retraining pretrained models. Our method applies \emphSplit-Point Analysis (SPA) to decompose predictive residuals into upper and lower subsets, computing \emphMean Absolute Residuals (MARs) on each side. We prove that, under ideal conditions, the total MAR equals the harmonic mean of subset MARs; deviations define a novel \emphSelf-consistency Discrepancy Score (SDS) for fine-grained epistemic estimation across regression and classification. For regression, side-specific quantile regression yields prediction intervals with improved empirical coverage, which are further calibrated via SDS. For classification, when calibration data are available, we apply SPA-based calibration identities to adjust the softmax outputs and then compute predictive entropy on these calibrated probabilities. Extensive experiments on diverse regression and classification benchmarks demonstrate that our framework matches or exceeds several state-of-the-art UQ methods while incurring minimal overhead. Our source code is available at this https URL. Comments: 32 pages, 15 figures and 16 tables. Technical Report submitted to a journal for publication Subjects: Machine Learning (cs.LG) Cite as: arXiv:2509.13262 [cs.LG] (or arXiv:2509.13262v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2509.13262 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-2] Dont Forget the Nonlinearity: Unlocking Activation Functions in Efficient Fine-Tuning

链接: https://arxiv.org/abs/2509.13240
作者: Bo Yin,Xingyi Yang,Xinchao Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Existing parameter-efficient fine-tuning (PEFT) methods primarily adapt weight matrices while keeping activation functions fixed. We introduce \textbfNoRA, the first PEFT framework that directly adapts nonlinear activation functions in pretrained transformer-based models. NoRA replaces fixed activations with learnable rational functions and applies structured low-rank updates to numerator and denominator coefficients, with a group-wise design that localizes adaptation and improves stability at minimal cost. On vision transformers trained on CIFAR-10 and CIFAR-100, NoRA matches or exceeds full fine-tuning while updating only 0.4% of parameters (0.02M), achieving accuracy gains of +0.17% and +0.27%. When combined with LoRA (\textbfNoRA++), it outperforms LoRA and DoRA under matched training budgets by adding fewer trainable parameters. On LLaMA3-8B instruction tuning, NoRA++ consistently improves generation quality, yielding average MMLU gains of +0.3%–0.8%, including +1.6% on STEM (Alpaca) and +1.3% on OpenOrca. We further show that NoRA constrains adaptation to a low-dimensional functional subspace, implicitly regularizing update magnitude and direction. These results establish activation-space tuning as a complementary and highly parameter-efficient alternative to weight-based PEFT, positioning activation functions as first-class objects for model adaptation.

[LG-3] On the Out-of-Distribution Backdoor Attack for Federated Learning

链接: https://arxiv.org/abs/2509.13219
作者: Jiahao Xu,Zikai Zhang,Rui Hu
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: To appear at MobiHoc 2025

点击查看摘要

Abstract:Traditional backdoor attacks in federated learning (FL) operate within constrained attack scenarios, as they depend on visible triggers and require physical modifications to the target object, which limits their practicality. To address this limitation, we introduce a novel backdoor attack prototype for FL called the out-of-distribution (OOD) backdoor attack ( \mathttOBA ), which uses OOD data as both poisoned samples and triggers simultaneously. Our approach significantly broadens the scope of backdoor attack scenarios in FL. To improve the stealthiness of \mathttOBA , we propose \mathttSoDa , which regularizes both the magnitude and direction of malicious local models during local training, aligning them closely with their benign versions to evade detection. Empirical results demonstrate that \mathttOBA effectively circumvents state-of-the-art defenses while maintaining high accuracy on the main task. To address this security vulnerability in the FL system, we introduce \mathttBNGuard , a new server-side defense method tailored against \mathttSoDa . \mathttBNGuard leverages the observation that OOD data causes significant deviations in the running statistics of batch normalization layers. This allows \mathttBNGuard to identify malicious model updates and exclude them from aggregation, thereby enhancing the backdoor robustness of FL. Extensive experiments across various settings show the effectiveness of \mathttBNGuard on defending against \mathttSoDa . The code is available at this https URL. Comments: To appear at MobiHoc 2025 Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR) Cite as: arXiv:2509.13219 [cs.LG] (or arXiv:2509.13219v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2509.13219 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-4] FOSSIL: Regret-minimizing weighting for robust learning under imbalance and small data ICLR2025

链接: https://arxiv.org/abs/2509.13218
作者: J. Cha(Gwinnett Technical College),J. Lee(Intel Corporation),J. Cho(Prairie View Aamp;M University),J. Shin(Ohio State University)
类目: Machine Learning (cs.LG)
*备注: 24 pages, 6 figures, submitted to ICLR 2025

点击查看摘要

Abstract:Imbalanced and small data regimes are pervasive in domains such as rare disease imaging, genomics, and disaster response, where labeled samples are scarce and naive augmentation often introduces artifacts. Existing solutions such as oversampling, focal loss, or meta-weighting address isolated aspects of this challenge but remain fragile or complex. We introduce FOSSIL (Flexible Optimization via Sample Sensitive Importance Learning), a unified weighting framework that seamlessly integrates class imbalance correction, difficulty-aware curricula, augmentation penalties, and warmup dynamics into a single interpretable formula. Unlike prior heuristics, the proposed framework provides regret-based theoretical guarantees and achieves consistent empirical gains over ERM, curriculum, and meta-weighting baselines on synthetic and real-world datasets, while requiring no architectural changes.

[LG-5] Density-Aware Farthest Point Sampling

链接: https://arxiv.org/abs/2509.13213
作者: Paolo Climaco,Jochen Garcke
类目: Machine Learning (cs.LG)
*备注: 12 pages, 2 figures

点击查看摘要

Abstract:We focus on training machine learning regression models in scenarios where the availability of labeled training data is limited due to computational constraints or high labeling costs. Thus, selecting suitable training sets from unlabeled data is essential for balancing performance and efficiency. For the selection of the training data, we focus on passive and model-agnostic sampling methods that only consider the data feature representations. We derive an upper bound for the expected prediction error of Lipschitz continuous regression models that linearly depends on the weighted fill distance of the training set, a quantity we can estimate simply by considering the data features. We introduce “Density-Aware Farthest Point Sampling” (DA-FPS), a novel sampling method. We prove that DA-FPS provides approximate minimizers for a data-driven estimation of the weighted fill distance, thereby aiming at minimizing our derived bound. We conduct experiments using two regression models across three datasets. The results demonstrate that DA-FPS significantly reduces the mean absolute prediction error compared to other sampling strategies.

[LG-6] HAM: Hierarchical Adapter Merging for Scalable Continual Learning

链接: https://arxiv.org/abs/2509.13211
作者: Eric Nuertey Coleman,Luigi Quarantiello,Samrat Mukherjee,Julio Hurtado,Vincenzo Lomonaco
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Continual learning is an essential capability of human cognition, yet it poses significant challenges for current deep learning models. The primary issue is that new knowledge can interfere with previously learned information, causing the model to forget earlier knowledge in favor of the new, a phenomenon known as catastrophic forgetting. Although large pre-trained models can partially mitigate forgetting by leveraging their existing knowledge and over-parameterization, they often struggle when confronted with novel data distributions. Parameter-Efficient Fine-Tuning (PEFT) methods, such as LoRA, enable efficient adaptation to new knowledge. However, they still face challenges in scaling to dynamic learning scenarios and long sequences of tasks, as maintaining one adapter per task introduces complexity and increases the potential for interference. In this paper, we introduce Hierarchical Adapters Merging (HAM), a novel framework that dynamically combines adapters from different tasks during training. This approach enables HAM to scale effectively, allowing it to manage more tasks than competing baselines with improved efficiency. To achieve this, HAM maintains a fixed set of groups that hierarchically consolidate new adapters. For each task, HAM trains a low-rank adapter along with an importance scalar, then dynamically groups tasks based on adapter similarity. Within each group, adapters are pruned, scaled and merge, facilitating transfer learning between related tasks. Extensive experiments on three vision benchmarks show that HAM significantly outperforms state-of-the-art methods, particularly as the number of tasks increases.

[LG-7] RUST-FS: Tensorized Reliable Unsupervised Multi-View Feature Selection for Incomplete Data

链接: https://arxiv.org/abs/2509.13192
作者: Minghui Lu,Yanyong Huang,Minbo Ma,Dongjie Wang,Xiuwen Yi,Tianrui Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multi-view unsupervised feature selection (MUFS), which selects informative features from multi-view unlabeled data, has attracted increasing research interest in recent years. Although great efforts have been devoted to MUFS, several challenges remain: 1) existing methods for incomplete multi-view data are limited to handling missing views and are unable to address the more general scenario of missing variables, where some features have missing values in certain views; 2) most methods address incomplete data by first imputing missing values and then performing feature selection, treating these two processes independently and overlooking their interactions; 3) missing data can result in an inaccurate similarity graph, which reduces the performance of feature selection. To solve this dilemma, we propose a novel MUFS method for incomplete multi-view data with missing variables, termed Tensorized Reliable UnSupervised mulTi-view Feature Selection (TRUST-FS). TRUST-FS introduces a new adaptive-weighted CP decomposition that simultaneously performs feature selection, missing-variable imputation, and view weight learning within a unified tensor factorization framework. By utilizing Subjective Logic to acquire trustworthy cross-view similarity information, TRUST-FS facilitates learning a reliable similarity graph, which subsequently guides feature selection and imputation. Comprehensive experimental results demonstrate the effectiveness and superiority of our method over state-of-the-art methods.

[LG-8] Efficient Cold-Start Recommendation via BPE Token-Level Embedding Initialization with LLM

链接: https://arxiv.org/abs/2509.13179
作者: Yushang Zhao,Xinyue Han,Qian Leng,Qianyi Sun,Haotian Lyu,Chengrui Zhou
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The cold-start issue is the challenge when we talk about recommender systems, especially in the case when we do not have the past interaction data of new users or new items. Content-based features or hybrid solutions are common as conventional solutions, but they can only work in a sparse metadata environment with shallow patterns. In this paper, the efficient cold-start recommendation strategy is presented, which is based on the sub word-level representations by applying Byte Pair Encoding (BPE) tokenization and pre-trained Large Language Model (LLM) embedding in the initialization procedure. We obtain fine-grained token-level vectors that are aligned with the BPE vocabulary as opposed to using coarse-grained sentence embeddings. Together, these token embeddings can be used as dense semantic priors on unseen entities, making immediate recommendation performance possible without user-item interaction history. Our mechanism can be compared to collaborative filtering systems and tested over benchmark datasets with stringent cold-start assumptions. Experimental findings show that the given BPE-LLM method achieves higher Recall@k, NDCG@k, and Hit Rate measurements compared to the standard baseline and displays the same capability of sufficient computational performance. Furthermore, we demonstrate that using subword-aware embeddings yields better generalizability and is more interpretable, especially within a multilingual and sparse input setting. The practical application of token-level semantic initialization as a lightweight, but nevertheless effective extension to modern recommender systems in the zero-shot setting is indicated within this work.

[LG-9] CoVariance Filters and Neural Networks over Hilbert Spaces

链接: https://arxiv.org/abs/2509.13178
作者: Claudio Battiloro,Andrea Cavallo,Elvin Isufi
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 6 pages, 3 figures

点击查看摘要

Abstract:CoVariance Neural Networks (VNNs) perform graph convolutions on the empirical covariance matrix of signals defined over finite-dimensional Hilbert spaces, motivated by robustness and transferability properties. Yet, little is known about how these arguments extend to infinite-dimensional Hilbert spaces. In this work, we take a first step by introducing a novel convolutional learning framework for signals defined over infinite-dimensional Hilbert spaces, centered on the (empirical) covariance operator. We constructively define Hilbert coVariance Filters (HVFs) and design Hilbert coVariance Networks (HVNs) as stacks of HVF filterbanks with nonlinear activations. We propose a principled discretization procedure, and we prove that empirical HVFs can recover the Functional PCA (FPCA) of the filtered signals. We then describe the versatility of our framework with examples ranging from multivariate real-valued functions to reproducing kernel Hilbert spaces. Finally, we validate HVNs on both synthetic and real-world time-series classification tasks, showing robust performance compared to MLP and FPCA-based classifiers.

[LG-10] Concentration inequalities for semidefinite least squares based on data

链接: https://arxiv.org/abs/2509.13166
作者: Filippo Fabiani,Andrea Simonetto
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Signal Processing (eess.SP); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:We study data-driven least squares (LS) problems with semidefinite (SD) constraints and derive finite-sample guarantees on the spectrum of their optimal solutions when these constraints are relaxed. In particular, we provide a high confidence bound allowing one to solve a simpler program in place of the full SDLS problem, while ensuring that the eigenvalues of the resulting solution are \varepsilon -close of those enforced by the SD constraints. The developed certificate, which consistently shrinks as the number of data increases, turns out to be easy-to-compute, distribution-free, and only requires independent and identically distributed samples. Moreover, when the SDLS is used to learn an unknown quadratic function, we establish bounds on the error between a gradient descent iterate minimizing the surrogate cost obtained with no SD constraints and the true minimizer.

[LG-11] Learning from Heterophilic Graphs: A Spectral Theory Perspective on the Impact of Self-Loops and Parallel Edges

链接: https://arxiv.org/abs/2509.13139
作者: Kushal Bose,Swagatam Das
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph heterophily poses a formidable challenge to the performance of Message-passing Graph Neural Networks (MP-GNNs). The familiar low-pass filters like Graph Convolutional Networks (GCNs) face performance degradation, which can be attributed to the blending of the messages from dissimilar neighboring nodes. The performance of the low-pass filters on heterophilic graphs still requires an in-depth analysis. In this context, we update the heterophilic graphs by adding a number of self-loops and parallel edges. We observe that eigenvalues of the graph Laplacian decrease and increase respectively by increasing the number of self-loops and parallel edges. We conduct several studies regarding the performance of GCN on various benchmark heterophilic networks by adding either self-loops or parallel edges. The studies reveal that the GCN exhibited either increasing or decreasing performance trends on adding self-loops and parallel edges. In light of the studies, we established connections between the graph spectra and the performance trends of the low-pass filters on the heterophilic graphs. The graph spectra characterize the essential intrinsic properties of the input graph like the presence of connected components, sparsity, average degree, cluster structures, etc. Our work is adept at seamlessly evaluating graph spectrum and properties by observing the performance trends of the low-pass filters without pursuing the costly eigenvalue decomposition. The theoretical foundations are also discussed to validate the impact of adding self-loops and parallel edges on the graph spectrum.

[LG-12] Curriculum Learning for Mesh-based simulations

链接: https://arxiv.org/abs/2509.13138
作者: Paul Garnier,Vincent Lannelongue,Elie Hachem
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:Graph neural networks (GNNs) have emerged as powerful surrogates for mesh-based computational fluid dynamics (CFD), but training them on high-resolution unstructured meshes with hundreds of thousands of nodes remains prohibitively expensive. We study a \emphcoarse-to-fine curriculum that accelerates convergence by first training on very coarse meshes and then progressively introducing medium and high resolutions (up to (3\times10^5) nodes). Unlike multiscale GNN architectures, the model itself is unchanged; only the fidelity of the training data varies over time. We achieve comparable generalization accuracy while reducing total wall-clock time by up to 50%. Furthermore, on datasets where our model lacks the capacity to learn the underlying physics, using curriculum learning enables it to break through plateaus.

[LG-13] Discovering Mathematical Equations with Diffusion Language Model

链接: https://arxiv.org/abs/2509.13136
作者: Xiaoxu Han,Chengzhen Ning,Jinghui Zhong,Fubiao Yang,Yu Wang,Xin Mu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Discovering valid and meaningful mathematical equations from observed data plays a crucial role in scientific discovery. While this task, symbolic regression, remains challenging due to the vast search space and the trade-off between accuracy and complexity. In this paper, we introduce DiffuSR, a pre-training framework for symbolic regression built upon a continuous-state diffusion language model. DiffuSR employs a trainable embedding layer within the diffusion process to map discrete mathematical symbols into a continuous latent space, modeling equation distributions effectively. Through iterative denoising, DiffuSR converts an initial noisy sequence into a symbolic equation, guided by numerical data injected via a cross-attention mechanism. We also design an effective inference strategy to enhance the accuracy of the diffusion-based equation generator, which injects logit priors into genetic programming. Experimental results on standard symbolic regression benchmarks demonstrate that DiffuSR achieves competitive performance with state-of-the-art autoregressive methods and generates more interpretable and diverse mathematical expressions.

[LG-14] Sublinear-Time Algorithms for Diagonally Dominant Systems and Applications to the Friedkin-Johnsen Model

链接: https://arxiv.org/abs/2509.13112
作者: Weiming Feng,Zelin Li,Pan Peng
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:We study sublinear-time algorithms for solving linear systems Sz = b , where S is a diagonally dominant matrix, i.e., |S_ii| \geq \delta + \sum_j \ne i |S_ij| for all i \in [n] , for some \delta \geq 0 . We present randomized algorithms that, for any u \in [n] , return an estimate z_u of z^_u with additive error \varepsilon or \varepsilon \lVert z^\rVert_\infty , where z^* is some solution to Sz^* = b , and the algorithm only needs to read a small portion of the input S and b . For example, when the additive error is \varepsilon and assuming \delta0 , we give an algorithm that runs in time O\left( \frac|b|\infty^2 S\max\delta^3 \varepsilon^2 \log \frac| b |\infty\delta \varepsilon \right) , where S\max = \max_i \in [n] |S_ii| . We also prove a matching lower bound, showing that the linear dependence on S_\max is optimal. Unlike previous sublinear-time algorithms, which apply only to symmetric diagonally dominant matrices with non-negative diagonal entries, our algorithm works for general strictly diagonally dominant matrices ( \delta 0 ) and a broader class of non-strictly diagonally dominant matrices (\delta = 0) . Our approach is based on analyzing a simple probabilistic recurrence satisfied by the solution. As an application, we obtain an improved sublinear-time algorithm for opinion estimation in the Friedkin–Johnsen model.

[LG-15] races Propagation: Memory-Efficient and Scalable Forward-Only Learning in Spiking Neural Networks

链接: https://arxiv.org/abs/2509.13053
作者: Lorenzo Pes,Bojian Yin,Sander Stuijk,Federico Corradi
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Spiking Neural Networks (SNNs) provide an efficient framework for processing dynamic spatio-temporal signals and for investigating the learning principles underlying biological neural systems. A key challenge in training SNNs is to solve both spatial and temporal credit assignment. The dominant approach for training SNNs is Backpropagation Through Time (BPTT) with surrogate gradients. However, BPTT is in stark contrast with the spatial and temporal locality observed in biological neural systems and leads to high computational and memory demands, limiting efficient training strategies and on-device learning. Although existing local learning rules achieve local temporal credit assignment by leveraging eligibility traces, they fail to address the spatial credit assignment without resorting to auxiliary layer-wise matrices, which increase memory overhead and hinder scalability, especially on embedded devices. In this work, we propose Traces Propagation (TP), a forward-only, memory-efficient, scalable, and fully local learning rule that combines eligibility traces with a layer-wise contrastive loss without requiring auxiliary layer-wise matrices. TP outperforms other fully local learning rules on NMNIST and SHD datasets. On more complex datasets such as DVS-GESTURE and DVS-CIFAR10, TP showcases competitive performance and scales effectively to deeper SNN architectures such as VGG-9, while providing favorable memory scaling compared to prior fully local scalable rules, for datasets with a significant number of classes. Finally, we show that TP is well suited for practical fine-tuning tasks, such as keyword spotting on the Google Speech Commands dataset, thus paving the way for efficient learning at the edge.

[LG-16] Spiking Vocos: An Energy-Efficient Neural Vocoder

链接: https://arxiv.org/abs/2509.13049
作者: Yukun Chen,Zhaoxi Mu,Andong Li,Peilin Li,Xinyu Yang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Despite the remarkable progress in the synthesis speed and fidelity of neural vocoders, their high energy consumption remains a critical barrier to practical deployment on computationally restricted edge devices. Spiking Neural Networks (SNNs), widely recognized for their high energy efficiency due to their event-driven nature, offer a promising solution for low-resource scenarios. In this paper, we propose Spiking Vocos, a novel spiking neural vocoder with ultra-low energy consumption, built upon the efficient Vocos framework. To mitigate the inherent information bottleneck in SNNs, we design a Spiking ConvNeXt module to reduce Multiply-Accumulate (MAC) operations and incorporate an amplitude shortcut path to preserve crucial signal dynamics. Furthermore, to bridge the performance gap with its Artificial Neural Network (ANN) counterpart, we introduce a self-architectural distillation strategy to effectively transfer knowledge. A lightweight Temporal Shift Module is also integrated to enhance the model’s ability to fuse information across the temporal dimension with negligible computational overhead. Experiments demonstrate that our model achieves performance comparable to its ANN counterpart, with UTMOS and PESQ scores of 3.74 and 3.45 respectively, while consuming only 14.7% of the energy. The source code is available at this https URL.

[LG-17] ReTrack: Data Unlearning in Diffusion Models through Redirecting the Denoising Trajectory

链接: https://arxiv.org/abs/2509.13007
作者: Qitan Shi,Cheng Jin,Jiawei Zhang,Yuantao Gu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Diffusion models excel at generating high-quality, diverse images but suffer from training data memorization, raising critical privacy and safety concerns. Data unlearning has emerged to mitigate this issue by removing the influence of specific data without retraining from scratch. We propose ReTrack, a fast and effective data unlearning method for diffusion models. ReTrack employs importance sampling to construct a more efficient fine-tuning loss, which we approximate by retaining only dominant terms. This yields an interpretable objective that redirects denoising trajectories toward the k -nearest neighbors, enabling efficient unlearning while preserving generative quality. Experiments on MNIST T-Shirt, CelebA-HQ, CIFAR-10, and Stable Diffusion show that ReTrack achieves state-of-the-art performance, striking the best trade-off between unlearning strength and generation quality preservation.

[LG-18] Ensemble Visualization With Variational Autoencoder

链接: https://arxiv.org/abs/2509.13000
作者: Cenyang Wu,Qinhan Yu,Liang Zhou
类目: Machine Learning (cs.LG)
*备注: Accepted by the IEEE Workshop on Uncertainty Visualization

点击查看摘要

Abstract:We present a new method to visualize data ensembles by constructing structured probabilistic representations in latent spaces, i.e., lower-dimensional representations of spatial data features. Our approach transforms the spatial features of an ensemble into a latent space through feature space conversion and unsupervised learning using a variational autoencoder (VAE). The resulting latent spaces follow multivariate standard Gaussian distributions, enabling analytical computation of confidence intervals and density estimation of the probabilistic distribution that generates the data ensemble. Preliminary results on a weather forecasting ensemble demonstrate the effectiveness and versatility of our method.

[LG-19] Causal Discovery via Quantile Partial Effect

链接: https://arxiv.org/abs/2509.12981
作者: Yikang Chen,Xingzhe Sun,Dehui Du
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 29 pages, 6 figures

点击查看摘要

Abstract:Quantile Partial Effect (QPE) is a statistic associated with conditional quantile regression, measuring the effect of covariates at different levels. Our theory demonstrates that when the QPE of cause on effect is assumed to lie in a finite linear span, cause and effect are identifiable from their observational distribution. This generalizes previous identifiability results based on Functional Causal Models (FCMs) with additive, heteroscedastic noise, etc. Meanwhile, since QPE resides entirely at the observational level, this parametric assumption does not require considering mechanisms, noise, or even the Markov assumption, but rather directly utilizes the asymmetry of shape characteristics in the observational distribution. By performing basis function tests on the estimated QPE, causal directions can be distinguished, which is empirically shown to be effective in experiments on a large number of bivariate causal discovery datasets. For multivariate causal discovery, leveraging the close connection between QPE and score functions, we find that Fisher Information is sufficient as a statistical measure to determine causal order when assumptions are made about the second moment of QPE. We validate the feasibility of using Fisher Information to identify causal order on multiple synthetic and real-world multivariate causal discovery datasets.

[LG-20] BAPFL: Exploring Backdoor Attacks Against Prototype-based Federated Learning

链接: https://arxiv.org/abs/2509.12964
作者: Honghong Zeng,Jiong Lou,Zhe Wang,Hefeng Zhou,Chentao Wu,Wei Zhao,Jie Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Prototype-based federated learning (PFL) has emerged as a promising paradigm to address data heterogeneity problems in federated learning, as it leverages mean feature vectors as prototypes to enhance model generalization. However, its robustness against backdoor attacks remains largely unexplored. In this paper, we identify that PFL is inherently resistant to existing backdoor attacks due to its unique prototype learning mechanism and local data heterogeneity. To further explore the security of PFL, we propose BAPFL, the first backdoor attack method specifically designed for PFL frameworks. BAPFL integrates a prototype poisoning strategy with a trigger optimization mechanism. The prototype poisoning strategy manipulates the trajectories of global prototypes to mislead the prototype training of benign clients, pushing their local prototypes of clean samples away from the prototypes of trigger-embedded samples. Meanwhile, the trigger optimization mechanism learns a unique and stealthy trigger for each potential target label, and guides the prototypes of trigger-embedded samples to align closely with the global prototype of the target label. Experimental results across multiple datasets and PFL variants demonstrate that BAPFL achieves a 35%-75% improvement in attack success rate compared to traditional backdoor attacks, while preserving main task accuracy. These results highlight the effectiveness, stealthiness, and adaptability of BAPFL in PFL.

[LG-21] Spatiotemporal graph neural process for reconstruction extrapolation and classification of cardiac trajectories

链接: https://arxiv.org/abs/2509.12953
作者: Jaume Banus,Augustin C. Ogier,Roger Hullin,Philippe Meyer,Ruud B. van Heeswijk,Jonas Richiardi
类目: Machine Learning (cs.LG); Systems and Control (eess.SY); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:We present a probabilistic framework for modeling structured spatiotemporal dynamics from sparse observations, focusing on cardiac motion. Our approach integrates neural ordinary differential equations (NODEs), graph neural networks (GNNs), and neural processes into a unified model that captures uncertainty, temporal continuity, and anatomical structure. We represent dynamic systems as spatiotemporal multiplex graphs and model their latent trajectories using a GNN-parameterized vector field. Given the sparse context observations at node and edge levels, the model infers a distribution over latent initial states and control variables, enabling both interpolation and extrapolation of trajectories. We validate the method on three synthetic dynamical systems (coupled pendulum, Lorenz attractor, and Kuramoto oscillators) and two real-world cardiac imaging datasets - ACDC (N=150) and UK Biobank (N=526) - demonstrating accurate reconstruction, extrapolation, and disease classification capabilities. The model accurately reconstructs trajectories and extrapolates future cardiac cycles from a single observed cycle. It achieves state-of-the-art results on the ACDC classification task (up to 99% accuracy), and detects atrial fibrillation in UK Biobank subjects with competitive performance (up to 67% accuracy). This work introduces a flexible approach for analyzing cardiac motion and offers a foundation for graph-based learning in structured biomedical spatiotemporal time-series data.

[LG-22] Soft Gradient Boosting with Learnable Feature Transforms for Sequential Regression

链接: https://arxiv.org/abs/2509.12920
作者: Huseyin Karaca,Suleyman Serdar Kozat
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:We propose a soft gradient boosting framework for sequential regression that embeds a learnable linear feature transform within the boosting procedure. At each boosting iteration, we train a soft decision tree and learn a linear input feature transform Q together. This approach is particularly advantageous in high-dimensional, data-scarce scenarios, as it discovers the most relevant input representations while boosting. We demonstrate, using both synthetic and real-world datasets, that our method effectively and efficiently increases the performance by an end-to-end optimization of feature selection/transform and boosting while avoiding overfitting. We also extend our algorithm to differentiable non-linear transforms if overfitting is not a problem. To support reproducibility and future work, we share our code publicly.

[LG-23] Reversible Deep Equilibrium Models

链接: https://arxiv.org/abs/2509.12917
作者: Sam McCallum,Kamran Arora,James Foster
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Deep Equilibrium Models (DEQs) are an interesting class of implicit model where the model output is implicitly defined as the fixed point of a learned function. These models have been shown to outperform explicit (fixed-depth) models in large-scale tasks by trading many deep layers for a single layer that is iterated many times. However, gradient calculation through DEQs is approximate. This often leads to unstable training dynamics and requires regularisation or many function evaluations to fix. Here, we introduce Reversible Deep Equilibrium Models (RevDEQs) that allow for exact gradient calculation, no regularisation and far fewer function evaluations than DEQs. We show that RevDEQs achieve state-of-the-art performance on language modelling and image classification tasks against comparable implicit and explicit models.

[LG-24] meCluster with PCA is Equivalent to Subspace Identification of Linear Dynamical Systems

链接: https://arxiv.org/abs/2509.12895
作者: Christian L. Hines,Samuel Spillard,Daniel P. Martin
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS)
*备注: 15 pages, 9 figures

点击查看摘要

Abstract:TimeCluster is a visual analytics technique for discovering structure in long multivariate time series by projecting overlapping windows of data into a low-dimensional space. We show that, when Principal Component Analysis (PCA) is chosen as the dimensionality reduction technique, this procedure is mathematically equivalent to classical linear subspace identification (block-Hankel matrix plus Singular Vector Decomposition (SVD)). In both approaches, the same low-dimensional linear subspace is extracted from the time series data. We first review the TimeCluster method and the theory of subspace system identification. Then we show that forming the sliding-window matrix of a time series yields a Hankel matrix, so applying PCA (via SVD) to this matrix recovers the same principal directions as subspace identification. Thus the cluster coordinates from TimeCluster coincide with the subspace identification methods. We present experiments on synthetic and real dynamical signals confirming that the two embeddings coincide. Finally, we explore and discuss future opportunities enabled by this equivalence, including forecasting from the identified state space, streaming/online extensions, incorporating and visualising external inputs and robust techniques for displaying underlying trends in corrupted data.

[LG-25] owards Context-Aware Human-like Pointing Gestures with RL Motion Imitation

链接: https://arxiv.org/abs/2509.12880
作者: Anna Deichler,Siyang Wang,Simon Alexanderson,Jonas Beskow
类目: Robotics (cs.RO); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: Presented at the Context-Awareness in HRI (CONAWA) Workshop, ACM/IEEE International Conference on Human-Robot Interaction (HRI 2022), March 7, 2022

点击查看摘要

Abstract:Pointing is a key mode of interaction with robots, yet most prior work has focused on recognition rather than generation. We present a motion capture dataset of human pointing gestures covering diverse styles, handedness, and spatial targets. Using reinforcement learning with motion imitation, we train policies that reproduce human-like pointing while maximizing precision. Results show our approach enables context-aware pointing behaviors in simulation, balancing task performance with natural dynamics.

[LG-26] Safe Reinforcement Learning using Action Projection: Safeguard the Policy or the Environment?

链接: https://arxiv.org/abs/2509.12833
作者: Hannah Markgraf,Shamburaj Sawant,Hanna Krasowski,Lukas Schäfer,Sebastien Gros,Matthias Althoff
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Projection-based safety filters, which modify unsafe actions by mapping them to the closest safe alternative, are widely used to enforce safety constraints in reinforcement learning (RL). Two integration strategies are commonly considered: Safe environment RL (SE-RL), where the safeguard is treated as part of the environment, and safe policy RL (SP-RL), where it is embedded within the policy through differentiable optimization layers. Despite their practical relevance in safety-critical settings, a formal understanding of their differences is lacking. In this work, we present a theoretical comparison of SE-RL and SP-RL. We identify a key distinction in how each approach is affected by action aliasing, a phenomenon in which multiple unsafe actions are projected to the same safe action, causing information loss in the policy gradients. In SE-RL, this effect is implicitly approximated by the critic, while in SP-RL, it manifests directly as rank-deficient Jacobians during backpropagation through the safeguard. Our contributions are threefold: (i) a unified formalization of SE-RL and SP-RL in the context of actor-critic algorithms, (ii) a theoretical analysis of their respective policy gradient estimates, highlighting the role of action aliasing, and (iii) a comparative study of mitigation strategies, including a novel penalty-based improvement for SP-RL that aligns with established SE-RL practices. Empirical results support our theoretical predictions, showing that action aliasing is more detrimental for SP-RL than for SE-RL. However, with appropriate improvement strategies, SP-RL can match or outperform improved SE-RL across a range of environments. These findings provide actionable insights for choosing and refining projection-based safe RL methods based on task characteristics.

[LG-27] Energy-Efficient Quantized Federated Learning for Resource-constrained IoT devices

链接: https://arxiv.org/abs/2509.12814
作者: Wilfrid Sougrinoma Compaoré,Yaya Etiabi,El Mehdi Amhoud,Mohamad Assaad
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: 6 pages, accepted at IEEE PIMRC 2025

点击查看摘要

Abstract:Federated Learning (FL) has emerged as a promising paradigm for enabling collaborative machine learning while preserving data privacy, making it particularly suitable for Internet of Things (IoT) environments. However, resource-constrained IoT devices face significant challenges due to limited energy,unreliable communication channels, and the impracticality of assuming infinite blocklength transmission. This paper proposes a federated learning framework for IoT networks that integrates finite blocklength transmission, model quantization, and an error-aware aggregation mechanism to enhance energy efficiency and communication reliability. The framework also optimizes uplink transmission power to balance energy savings and model performance. Simulation results demonstrate that the proposed approach significantly reduces energy consumption by up to 75% compared to a standard FL model, while maintaining robust model accuracy, making it a viable solution for FL in real-world IoT scenarios with constrained resources. This work paves the way for efficient and reliable FL implementations in practical IoT deployments. Index Terms: Federated learning, IoT, finite blocklength, quantization, energy efficiency.

[LG-28] Spatio-temporal DeepKriging in PyTorch: A Supplementary Application to Precipitation Data for Interpolation and Probabilistic Forecasting

链接: https://arxiv.org/abs/2509.12708
作者: Pratik Nag
类目: Machine Learning (cs.LG); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:A detailed analysis of precipitation data over Europe is presented, with a focus on interpolation and forecasting applications. A Spatio-temporal DeepKriging (STDK) framework has been implemented using the PyTorch platform to achieve these objectives. The proposed model is capable of handling spatio-temporal irregularities while generating high-resolution interpolations and multi-step forecasts. Reproducible code modules have been developed as standalone PyTorch implementations for the interpolation\footnote[2]Interpolation - this https URL and forecasting\footnote[3]Forecasting - this https URL, facilitating broader application to similar climate datasets. The effectiveness of this approach is demonstrated through extensive evaluation on daily precipitation measurements, highlighting predictive performance and robustness.

[LG-29] NORA: A Nephrology-Oriented Representation Learning Approach Towards Chronic Kidney Disease Classification ICML

链接: https://arxiv.org/abs/2509.12704
作者: Mohammad Abdul Hafeez Khan,Twisha Bhattacharyya,Omar Khan,Noorah Khan,Alina Aziz Fatima Khan,Mohammed Qutub Khan,Sujoy Ghosh Hajra
类目: Machine Learning (cs.LG)
*备注: 7 pages, 5 figures, accepted to the International Conference on Machine Learning and Applications (ICMLA) 2025

点击查看摘要

Abstract:Chronic Kidney Disease (CKD) affects millions of people worldwide, yet its early detection remains challenging, especially in outpatient settings where laboratory-based renal biomarkers are often unavailable. In this work, we investigate the predictive potential of routinely collected non-renal clinical variables for CKD classification, including sociodemographic factors, comorbid conditions, and urinalysis findings. We introduce the Nephrology-Oriented Representation leArning (NORA) approach, which combines supervised contrastive learning with a nonlinear Random Forest classifier. NORA first derives discriminative patient representations from tabular EHR data, which are then used for downstream CKD classification. We evaluated NORA on a clinic-based EHR dataset from Riverside Nephrology Physicians. Our results demonstrated that NORA improves class separability and overall classification performance, particularly enhancing the F1-score for early-stage CKD. Additionally, we assessed the generalizability of NORA on the UCI CKD dataset, demonstrating its effectiveness for CKD risk stratification across distinct patient cohorts.

[LG-30] Bi-level Personalization for Federated Foundation Models: A Task-vector Aggregation Approach

链接: https://arxiv.org/abs/2509.12697
作者: Yiyuan Yang,Guodong Long,Qinghua Lu,Liming Zhu,Jing Jiang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated foundation models represent a new paradigm to jointly fine-tune pre-trained foundation models across clients. It is still a challenge to fine-tune foundation models for a small group of new users or specialized scenarios, which typically involve limited data compared to the large-scale data used in pre-training. In this context, the trade-off between personalization and federation becomes more sensitive. To tackle these, we proposed a bi-level personalization framework for federated fine-tuning on foundation models. Specifically, we conduct personalized fine-tuning on the client-level using its private data, and then conduct a personalized aggregation on the server-level using similar users measured by client-specific task vectors. Given the personalization information gained from client-level fine-tuning, the server-level personalized aggregation can gain group-wise personalization information while mitigating the disturbance of irrelevant or interest-conflict clients with non-IID data. The effectiveness of the proposed algorithm has been demonstrated by extensive experimental analysis in benchmark datasets.

[LG-31] Soft Graph Transformer for MIMO Detection

链接: https://arxiv.org/abs/2509.12694
作者: Jiadong Hong,Lei Liu,Xinyu Bian,Wenjie Wang,Zhaoyang Zhang
类目: Machine Learning (cs.LG)
*备注: 8 pages

点击查看摘要

Abstract:We propose the Soft Graph Transformer (SGT), a Soft-Input-Soft-Output neural architecture tailored for MIMO detection. While Maximum Likelihood (ML) detection achieves optimal accuracy, its prohibitive exponential complexity renders it impractical for real-world systems. Conventional message passing algorithms offer tractable alternatives but rely on large-system asymptotics and random matrix assumptions, both of which break down under practical implementations. Prior Transformer-based detectors, on the other hand, fail to incorporate the MIMO factor graph structure and cannot utilize decoder-side soft information, limiting their standalone performance and their applicability in iterative detection-decoding (IDD). To overcome these limitations, SGT integrates message passing directly into a graph-aware attention mechanism and supports decoder-informed updates through soft-input embeddings. This design enables effective soft-output generation while preserving computational efficiency. As a standalone detector, SGT closely approaches ML performance and surpasses prior Transformer-based approaches.

[LG-32] ZTree: A Subgroup Identification Based Decision Tree Learning Framework

链接: https://arxiv.org/abs/2509.12688
作者: Eric Cheng,Jie Cheng
类目: Machine Learning (cs.LG)
*备注: 15 pages, 1 table, 5 figures

点击查看摘要

Abstract:Decision trees are a commonly used class of machine learning models valued for their interpretability and versatility, capable of both classification and regression. We propose ZTree, a novel decision tree learning framework that replaces CART’s traditional purity based splitting with statistically principled subgroup identification. At each node, ZTree applies hypothesis testing (e.g., z-tests, t-tests, Mann-Whitney U, log-rank) to assess whether a candidate subgroup differs meaningfully from the complement. To adjust for the complication of multiple testing, we employ a cross-validation-based approach to determine if further node splitting is needed. This robust stopping criterion eliminates the need for post-pruning and makes the test threshold (z-threshold) the only parameter for controlling tree complexity. Because of the simplicity of the tree growing procedure, once a detailed tree is learned using the most lenient z-threshold, all simpler trees can be derived by simply removing nodes that do not meet the larger z-thresholds. This makes parameter tuning intuitive and efficient. Furthermore, this z-threshold is essentially a p-value, allowing users to easily plug in appropriate statistical tests into our framework without adjusting the range of parameter search. Empirical evaluation on five large-scale UCI datasets demonstrates that ZTree consistently delivers strong performance, especially at low data regimes. Compared to CART, ZTree also tends to grow simpler trees without sacrificing performance. ZTree introduces a statistically grounded alternative to traditional decision tree splitting by leveraging hypothesis testing and a cross-validation approach to multiple testing correction, resulting in an efficient and flexible framework.

[LG-33] Large Language Model Scaling Laws for Neural Quantum States in Quantum Chemistry

链接: https://arxiv.org/abs/2509.12679
作者: Oliver Knitter,Dan Zhao,Stefan Leichenauer,Shravan Veerapaneni
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Quantum Physics (quant-ph)
*备注: 16 pages, 5 figures, to be submitted for peer review

点击查看摘要

Abstract:Scaling laws have been used to describe how large language model (LLM) performance scales with model size, training data size, or amount of computational resources. Motivated by the fact that neural quantum states (NQS) has increasingly adopted LLM-based components, we seek to understand NQS scaling laws, thereby shedding light on the scalability and optimal performance–resource trade-offs of NQS ansatze. In particular, we identify scaling laws that predict the performance, as measured by absolute error and V-score, for transformer-based NQS as a function of problem size in second-quantized quantum chemistry applications. By performing analogous compute-constrained optimization of the obtained parametric curves, we find that the relationship between model size and training time is highly dependent on loss metric and ansatz, and does not follow the approximately linear relationship found for language models.

[LG-34] High-Energy Concentration for Federated Learning in Frequency Domain

链接: https://arxiv.org/abs/2509.12630
作者: Haozhi Shi,Weiying Xie,Hangyu Ye,Daixun Li,Jitao Ma,Leyuan Fang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated Learning (FL) presents significant potential for collaborative optimization without data sharing. Since synthetic data is sent to the server, leveraging the popular concept of dataset distillation, this FL framework protects real data privacy while alleviating data heterogeneity. However, such methods are still challenged by the redundant information and noise in entire spatial-domain designs, which inevitably increases the communication burden. In this paper, we propose a novel Frequency-Domain aware FL method with high-energy concentration (FedFD) to address this problem. Our FedFD is inspired by the discovery that the discrete cosine transform predominantly distributes energy to specific regions, referred to as high-energy concentration. The principle behind FedFD is that low-energy like high-frequency components usually contain redundant information and noise, thus filtering them helps reduce communication costs and optimize performance. Our FedFD is mathematically formulated to preserve the low-frequency components using a binary mask, facilitating an optimal solution through frequency-domain distribution alignment. In particular, real data-driven synthetic classification is imposed into the loss to enhance the quality of the low-frequency components. On five image and speech datasets, FedFD achieves superior performance than state-of-the-art methods while reducing communication costs. For example, on the CIFAR-10 dataset with Dirichlet coefficient \alpha = 0.01 , FedFD achieves a minimum reduction of 37.78% in the communication cost, while attaining a 10.88% performance gain.

[LG-35] Exploring Training Data Attribution under Limited Access Constraints

链接: https://arxiv.org/abs/2509.12581
作者: Shiyuan Zhang,Junwei Deng,Juhan Bae,Jiaqi Ma
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Training data attribution (TDA) plays a critical role in understanding the influence of individual training data points on model predictions. Gradient-based TDA methods, popularized by \textitinfluence function for their superior performance, have been widely applied in data selection, data cleaning, data economics, and fact tracing. However, in real-world scenarios where commercial models are not publicly accessible and computational resources are limited, existing TDA methods are often constrained by their reliance on full model access and high computational costs. This poses significant challenges to the broader adoption of TDA in practical applications. In this work, we present a systematic study of TDA methods under various access and resource constraints. We investigate the feasibility of performing TDA under varying levels of access constraints by leveraging appropriately designed solutions such as proxy models. Besides, we demonstrate that attribution scores obtained from models without prior training on the target dataset remain informative across a range of tasks, which is useful for scenarios where computational resources are limited. Our findings provide practical guidance for deploying TDA in real-world environments, aiming to improve feasibility and efficiency under limited access. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2509.12581 [cs.LG] (or arXiv:2509.12581v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2509.12581 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-36] No Need for “Learning” to Defer? A Training Free Deferral Framework to Multiple Experts through Conformal Prediction

链接: https://arxiv.org/abs/2509.12573
作者: Tim Bary,Benoît Macq,Louis Petit
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC)
*备注: 9 pages, 4 figures, 1 table

点击查看摘要

Abstract:AI systems often fail to deliver reliable predictions across all inputs, prompting the need for hybrid human-AI decision-making. Existing Learning to Defer (L2D) approaches address this by training deferral models, but these are sensitive to changes in expert composition and require significant retraining if experts change. We propose a training-free, model- and expert-agnostic framework for expert deferral based on conformal prediction. Our method uses the prediction set generated by a conformal predictor to identify label-specific uncertainty and selects the most discriminative expert using a segregativity criterion, measuring how well an expert distinguishes between the remaining plausible labels. Experiments on CIFAR10-H and ImageNet16-H show that our method consistently outperforms both the standalone model and the strongest expert, with accuracies attaining 99.57\pm0.10% and 99.40\pm0.52% , while reducing expert workload by up to a factor of 11 . The method remains robust under degraded expert performance and shows a gradual performance drop in low-information settings. These results suggest a scalable, retraining-free alternative to L2D for real-world human-AI collaboration.

[LG-37] Cross-Modal Deep Metric Learning for Time Series Anomaly Detection

链接: https://arxiv.org/abs/2509.12540
作者: Wei Li,Zheze Yang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:To effectively address the issues of low sensitivity and high time consumption in time series anomaly detection, we propose an anomaly detection method based on cross-modal deep metric learning. A cross-modal deep metric learning feature clustering model is constructed, composed of an input layer, a triplet selection layer, and a loss function computation layer. The squared Euclidean distances between cluster centers are calculated, and a stochastic gradient descent strategy is employed to optimize the model and classify different time series features. The inner product of principal component direction vectors is used as a metric for anomaly measurement. The von Mises-Fisher (vMF) distribution is applied to describe the directional characteristics of time series data, and historical data is used to train and obtain evaluation parameters. By comparing the principal component direction vector of actual time series data with the threshold, anomaly detection is performed. Experimental results demonstrate that the proposed method accurately classifies time series data with different attributes, exhibits high sensitivity to anomalies, and achieves high detection accuracy, fast detection speed, and strong robustness.

[LG-38] Graph Homophily Booster: Rethinking the Role of Discrete Features on Heterophilic Graphs

链接: https://arxiv.org/abs/2509.12530
作者: Ruizhong Qiu,Ting-Wei Li,Gaotang Li,Hanghang Tong
类目: Machine Learning (cs.LG)
*备注: 14 pages

点击查看摘要

Abstract:Graph neural networks (GNNs) have emerged as a powerful tool for modeling graph-structured data. However, existing GNNs often struggle with heterophilic graphs, where connected nodes tend to have dissimilar features or labels. While numerous methods have been proposed to address this challenge, they primarily focus on architectural designs without directly targeting the root cause of the heterophily problem. These approaches still perform even worse than the simplest MLPs on challenging heterophilic datasets. For instance, our experiments show that 21 latest GNNs still fall behind the MLP on the Actor dataset. This critical challenge calls for an innovative approach to addressing graph heterophily beyond architectural designs. To bridge this gap, we propose and study a new and unexplored paradigm: directly increasing the graph homophily via a carefully designed graph transformation. In this work, we present a simple yet effective framework called GRAPHITE to address graph heterophily. To the best of our knowledge, this work is the first method that explicitly transforms the graph to directly improve the graph homophily. Stemmed from the exact definition of homophily, our proposed GRAPHITE creates feature nodes to facilitate homophilic message passing between nodes that share similar features. Furthermore, we both theoretically and empirically show that our proposed GRAPHITE significantly increases the homophily of originally heterophilic graphs, with only a slight increase in the graph size. Extensive experiments on challenging datasets demonstrate that our proposed GRAPHITE significantly outperforms state-of-the-art methods on heterophilic graphs while achieving comparable accuracy with state-of-the-art methods on homophilic graphs.

[LG-39] Selective Risk Certification for LLM Outputs via Information-Lift Statistics: PAC-Bayes Robustness and Skeleton Design

链接: https://arxiv.org/abs/2509.12527
作者: Sanjeda Akter,Ibne Farabi Shihab,Anuj Sharma
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Large language models often produce plausible but incorrect outputs. Existing heuristics such as HallBayes lack formal guarantees. We develop the first comprehensive theory of \emphinformation-lift certificates under selective classification. Our contributions are: (i) a PAC-Bayes \emphsub-gamma analysis extending beyond standard Bernstein bounds; (ii) explicit skeleton sensitivity theorems quantifying robustness to misspecification; (iii) failure-mode guarantees under assumption violations; and (iv) a principled variational method for skeleton construction. Across six datasets and multiple model families, we validate assumptions empirically, reduce abstention by 12–15% at the same risk, and maintain runtime overhead below 20% (further reduced via batching).

[LG-40] Phi: Preference Hijacking in Multi-modal Large Language Models at Inference Time

链接: https://arxiv.org/abs/2509.12521
作者: Yifan Lan,Yuanpu Cao,Weitong Zhang,Lu Lin,Jinghui Chen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recently, Multimodal Large Language Models (MLLMs) have gained significant attention across various domains. However, their widespread adoption has also raised serious safety concerns. In this paper, we uncover a new safety risk of MLLMs: the output preference of MLLMs can be arbitrarily manipulated by carefully optimized images. Such attacks often generate contextually relevant yet biased responses that are neither overtly harmful nor unethical, making them difficult to detect. Specifically, we introduce a novel method, Preference Hijacking (Phi), for manipulating the MLLM response preferences using a preference hijacked image. Our method works at inference time and requires no model modifications. Additionally, we introduce a universal hijacking perturbation – a transferable component that can be embedded into different images to hijack MLLM responses toward any attacker-specified preferences. Experimental results across various tasks demonstrate the effectiveness of our approach. The code for Phi is accessible at this https URL.

[LG-41] Learning to Generate Pointing Gestures in Situated Embodied Conversational Agents

链接: https://arxiv.org/abs/2509.12507
作者: Anna Deichler,Siyang Wang,Simon Alexanderson,Jonas Beskow
类目: Robotics (cs.RO); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: DOI: https://doi.org/10.3389/frobt.2023.1110534 . This is the author’s LaTeX version

点击查看摘要

Abstract:One of the main goals of robotics and intelligent agent research is to enable natural communication with humans in physically situated settings. While recent work has focused on verbal modes such as language and speech, non-verbal communication is crucial for flexible interaction. We present a framework for generating pointing gestures in embodied agents by combining imitation and reinforcement learning. Using a small motion capture dataset, our method learns a motor control policy that produces physically valid, naturalistic gestures with high referential accuracy. We evaluate the approach against supervised learning and retrieval baselines in both objective metrics and a virtual reality referential game with human users. Results show that our system achieves higher naturalness and accuracy than state-of-the-art supervised models, highlighting the promise of imitation-RL for communicative gesture generation and its potential application to robots.

[LG-42] Prediction and Causality of functional MRI and synthetic signal using a Zero-Shot Time-Series Foundation Model

链接: https://arxiv.org/abs/2509.12497
作者: Alessandro Crimi,Andrea Brovelli
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Time-series forecasting and causal discovery are central in neuroscience, as predicting brain activity and identifying causal relationships between neural populations and circuits can shed light on the mechanisms underlying cognition and disease. With the rise of foundation models, an open question is how they compare to traditional methods for brain signal forecasting and causality analysis, and whether they can be applied in a zero-shot setting. In this work, we evaluate a foundation model against classical methods for inferring directional interactions from spontaneous brain activity measured with functional magnetic resonance imaging (fMRI) in humans. Traditional approaches often rely on Wiener-Granger causality. We tested the forecasting ability of the foundation model in both zero-shot and fine-tuned settings, and assessed causality by comparing Granger-like estimates from the model with standard Granger causality. We validated the approach using synthetic time series generated from ground-truth causal models, including logistic map coupling and Ornstein-Uhlenbeck processes. The foundation model achieved competitive zero-shot forecasting fMRI time series (mean absolute percentage error of 0.55 in controls and 0.27 in patients). Although standard Granger causality did not show clear quantitative differences between models, the foundation model provided a more precise detection of causal interactions. Overall, these findings suggest that foundation models offer versatility, strong zero-shot performance, and potential utility for forecasting and causal discovery in time-series data. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2509.12497 [cs.LG] (or arXiv:2509.12497v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2509.12497 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-43] Finite-Agent Stochastic Differential Games on Large Graphs: II. Graph-Based Architectures

链接: https://arxiv.org/abs/2509.12484
作者: Ruimeng Hu,Jihao Long,Haosheng Zhou
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:We propose a novel neural network architecture, called Non-Trainable Modification (NTM), for computing Nash equilibria in stochastic differential games (SDGs) on graphs. These games model a broad class of graph-structured multi-agent systems arising in finance, robotics, energy, and social dynamics, where agents interact locally under uncertainty. The NTM architecture imposes a graph-guided sparsification on feedforward neural networks, embedding fixed, non-trainable components aligned with the underlying graph topology. This design enhances interpretability and stability, while significantly reducing the number of trainable parameters in large-scale, sparse settings. We theoretically establish a universal approximation property for NTM in static games on graphs and numerically validate its expressivity and robustness through supervised learning tasks. Building on this foundation, we incorporate NTM into two state-of-the-art game solvers, Direct Parameterization and Deep BSDE, yielding their sparse variants (NTM-DP and NTM-DBSDE). Numerical experiments on three SDGs across various graph structures demonstrate that NTM-based methods achieve performance comparable to their fully trainable counterparts, while offering improved computational efficiency.

[LG-44] Comparative Analysis of Wave Scattering Numerical Modeling Using the Boundary Element Method and Physics-Informed Neural Networks

链接: https://arxiv.org/abs/2509.12483
作者: Oscar Rincón-Cardeno,Gregorio Pérez Bernal,Silvana Montoya Noguera,Nicolás Guarín-Zapata
类目: Machine Learning (cs.LG)
*备注: 19 pages, 7 figures

点击查看摘要

Abstract:Purpose - This study compares the Boundary Element Method (BEM) and Physics-Informed Neural Networks (PINNs) for solving the two-dimensional Helmholtz equation in wave scattering problems. The objective is to evaluate the performance of both methods under the same conditions. Design/methodology/approach - We solve the Helmholtz equation using BEM and PINNs for the same scattering problem. The PINNs are trained by minimizing the residual of the governing equations and boundary conditions, with their configuration determined through hyperparameter optimization, while the BEM is applied using boundary discretization. Both methods are evaluated in terms of solution accuracy, computation time, and generalization capacity. Findings - Numerical experiments were conducted by varying the number of integration points for BEM and the number of layers and neurons per layer for PINNs. Hyperparameter tuning provided further insight into suitable configurations for wave scattering problems. At comparable accuracy, PINNs produced consistent solutions but required training times approximately 42 times longer than BEM. However, once trained, PINNs achieved evaluation times up to 204 times faster. The generalization capacity was also assessed outside the PINN training domain, where the relative error increased from 7.46 \times 10^-2 to 8.22, while BEM maintained a similar error level in the extended region. Originality/value - This work presents a direct comparison between PINNs and BEM for the Helmholtz equation. The analysis provides quantitative data on the performance of both methods, supporting their selection in future research on wave propagation problems and establishing future challenges and directions. Comments: 19 pages, 7 figures Subjects: Machine Learning (cs.LG) Cite as: arXiv:2509.12483 [cs.LG] (or arXiv:2509.12483v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2509.12483 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Oscar Rincón Sr. [view email] [v1] Mon, 15 Sep 2025 22:08:20 UTC (938 KB)

[LG-45] Nonlocal Neural Tangent Kernels via Parameter-Space Interactions

链接: https://arxiv.org/abs/2509.12467
作者: Sriram Nagaraj,Vishakh Hari
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:The Neural Tangent Kernel (NTK) framework has provided deep insights into the training dynamics of neural networks under gradient flow. However, it relies on the assumption that the network is differentiable with respect to its parameters, an assumption that breaks down when considering non-smooth target functions or parameterized models exhibiting non-differentiable behavior. In this work, we propose a Nonlocal Neural Tangent Kernel (NNTK) that replaces the local gradient with a nonlocal interaction-based approximation in parameter space. Nonlocal gradients are known to exist for a wider class of functions than the standard gradient. This allows NTK theory to be extended to nonsmooth functions, stochastic estimators, and broader families of models. We explore both fixed-kernel and attention-based formulations of this nonlocal operator. We illustrate the new formulation with numerical studies.

[LG-46] On the Regularity and Fairness of Combinatorial Multi-Armed Bandit

链接: https://arxiv.org/abs/2509.12457
作者: Xiaoyi Wu,Bin Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The combinatorial multi-armed bandit model is designed to maximize cumulative rewards in the presence of uncertainty by activating a subset of arms in each round. This paper is inspired by two critical applications in wireless networks, where it’s not only essential to maximize cumulative rewards but also to guarantee fairness among arms (i.e., the minimum average reward required by each arm) and ensure reward regularity (i.e., how often each arm receives the reward). In this paper, we propose a parameterized regular and fair learning algorithm to achieve these three objectives. In particular, the proposed algorithm linearly combines virtual queue-lengths (tracking the fairness violations), Time-Since-Last-Reward (TSLR) metrics, and Upper Confidence Bound (UCB) estimates in its weight measure. Here, TSLR is similar to age-of-information and measures the elapsed number of rounds since the last time an arm received a reward, capturing the reward regularity performance, and UCB estimates are utilized to balance the tradeoff between exploration and exploitation in online learning. By exploring a key relationship between virtual queue-lengths and TSLR metrics and utilizing several non-trivial Lyapunov functions, we analytically characterize zero cumulative fairness violation, reward regularity, and cumulative regret performance under our proposed algorithm. These theoretical outcomes are verified by simulations based on two real-world datasets.

[LG-47] Surrogate Representation Inference for Noisy Text and Image Annotations

链接: https://arxiv.org/abs/2509.12416
作者: Kentaro Nakamura
类目: Machine Learning (cs.LG); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:As researchers increasingly rely on machine learning models and LLMs to annotate unstructured data, such as texts or images, various approaches have been proposed to correct bias in downstream statistical analysis. However, existing methods tend to yield large standard errors and require some error-free human annotation. In this paper, I introduce Surrogate Representation Inference (SRI), which assumes that unstructured data fully mediate the relationship between human annotations and structured variables. The assumption is guaranteed by design provided that human coders rely only on unstructured data for annotation. Under this setting, I propose a neural network architecture that learns a low-dimensional representation of unstructured data such that the surrogate assumption remains to be satisfied. When multiple human annotations are available, SRI can further correct non-differential measurement errors that may exist in human annotations. Focusing on text-as-outcome settings, I formally establish the identification conditions and semiparametric efficient estimation strategies that enable learning and leveraging such a low-dimensional representation. Simulation studies and a real-world application demonstrate that SRI reduces standard errors by over 50% when machine learning prediction accuracy is moderate and provides valid inference even when human annotations contain non-differential measurement errors.

[LG-48] Bayesian Parametric Matrix Models: Principled Uncertainty Quantification for Spectral Learning

链接: https://arxiv.org/abs/2509.12406
作者: Mohammad Nooraiepour
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Scientific machine learning increasingly uses spectral methods to understand physical systems. Current spectral learning approaches provide only point estimates without uncertainty quantification, limiting their use in safety-critical applications where prediction confidence is essential. Parametric matrix models have emerged as powerful tools for scientific machine learning, achieving exceptional performance by learning governing equations. However, their deterministic nature limits deployment in uncertainty quantification applications. We introduce Bayesian parametric matrix models (B-PMMs), a principled framework that extends PMMs to provide uncertainty estimates while preserving their spectral structure and computational efficiency. B-PMM addresses the fundamental challenge of quantifying uncertainty in matrix eigenvalue problems where standard Bayesian methods fail due to the geometric constraints of spectral decomposition. The theoretical contributions include: (i) adaptive spectral decomposition with regularized matrix perturbation bounds that characterize eigenvalue uncertainty propagation, (ii) structured variational inference algorithms using manifold-aware matrix-variate Gaussian posteriors that respect Hermitian constraints, and (iii) finite-sample calibration guarantees with explicit dependence on spectral gaps and problem conditioning. Experimental validation across matrix dimensions from 5x5 to 500x500 with perfect convergence rates demonstrates that B-PMMs achieve exceptional uncertainty calibration (ECE 0.05) while maintaining favorable scaling. The framework exhibits graceful degradation under spectral ill-conditioning and provides reliable uncertainty estimates even in near-degenerate regimes. The proposed framework supports robust spectral learning in uncertainty-critical domains and lays the groundwork for broader Bayesian spectral machine learning.

[LG-49] Structured Information Loss in Network Embeddings

链接: https://arxiv.org/abs/2509.12396
作者: Gabriel Chuang,Augustin Chaintreau
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We analyze a simple algorithm for network embedding, explicitly characterizing conditions under which the learned representation encodes the graph’s generative model fully, partially, or not at all. In cases where the embedding loses some information (i.e., is not invertible), we describe the equivalence classes of graphons that map to the same embedding, finding that these classes preserve community structure but lose substantial density information. Finally, we show implications for community detection and link prediction. Our results suggest strong limitations on the effectiveness of link prediction based on embeddings alone, and we show common conditions under which naive link prediction adds edges in a disproportionate manner that can either mitigate or exacerbate structural biases.

[LG-50] Adaptive Spatial Goodness Encoding: Advancing and Scaling Forward-Forward Learning Without Backpropagation

链接: https://arxiv.org/abs/2509.12394
作者: Qingchun Gong,Robert Bogdan Staszewski,Kai Xu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The Forward-Forward (FF) algorithm offers a promising al- ternative to backpropagation (BP). Despite advancements in recent FF-based extensions, which have enhanced the origi- nal algorithm and adapted it to convolutional neural networks (CNNs), they often suffer from limited representational ca- pacity and poor scalability to large-scale datasets, primarily due to exploding channel dimensionality. In this work, we propose adaptive spatial goodness encoding (ASGE), a new FF-based training framework tailored for CNNs. ASGE lever- ages feature maps to compute spatially-aware goodness rep- resentations at each layer, enabling layer-wise supervision. Crucially, this approach decouples classification complexity from channel dimensionality, thereby addressing the issue of channel explosion and achieving competitive performance compared to other BP-free methods. ASGE outperforms all other FF-based approaches across multiple benchmarks, delivering test accuracies of 99.65% on MNIST, 93.41% on FashionMNIST, 90.62% on CIFAR-10, and 65.42% on CIFAR-100. Moreover, we present the first successful ap- plication of FF-based training to ImageNet, with Top-1 and Top-5 accuracies of 26.21% and 47.49%. By entirely elimi- nating BP and significantly narrowing the performance gap with BP-trained models, the ASGE framework establishes a viable foundation toward scalable BP-free CNN training.

[LG-51] Diffusion-Based Generation and Imputation of Driving Scenarios from Limited Vehicle CAN Data ITSC2025

链接: https://arxiv.org/abs/2509.12375
作者: Julian Ripper,Ousama Esbel,Rafael Fietzek,Max Mühlhäuser,Thomas Kreutz
类目: Machine Learning (cs.LG)
*备注: Preprint, Paper has been accepted at ITSC 2025

点击查看摘要

Abstract:Training deep learning methods on small time series datasets that also include corrupted samples is challenging. Diffusion models have shown to be effective to generate realistic and synthetic data, and correct corrupted samples through imputation. In this context, this paper focuses on generating synthetic yet realistic samples of automotive time series data. We show that denoising diffusion probabilistic models (DDPMs) can effectively solve this task by applying them to a challenging vehicle CAN-dataset with long-term data and a limited number of samples. Therefore, we propose a hybrid generative approach that combines autoregressive and non-autoregressive techniques. We evaluate our approach with two recently proposed DDPM architectures for time series generation, for which we propose several improvements. To evaluate the generated samples, we propose three metrics that quantify physical correctness and test track adherence. Our best model is able to outperform even the training data in terms of physical correctness, while showing plausible driving behavior. Finally, we use our best model to successfully impute physically implausible regions in the training data, thereby improving the data quality.

[LG-52] Explainable Unsupervised Multi-Anomaly Detection and Temporal Localization in Nuclear Times Series Data with a Dual Attention-Based Autoencoder

链接: https://arxiv.org/abs/2509.12372
作者: Konstantinos Vasili,Zachery T. Dahm,Stylianos Chatzidakis
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The nuclear industry is advancing toward more new reactor designs, with next-generation reactors expected to be smaller in scale and power output. These systems have the potential to produce large volumes of information in the form of multivariate time-series data, which could be used for enhanced real-time monitoring and control. In this context, the development of remote autonomous or semi-autonomous control systems for reactor operation has gained significant interest. A critical first step toward such systems is an accurate diagnostics module capable of detecting and localizing anomalies within the reactor system. Recent studies have proposed various ML and DL approaches for anomaly detection in the nuclear domain. Despite promising results, key challenges remain, including limited to no explainability, lack of access to real-world data, and scarcity of abnormal events, which impedes benchmarking and characterization. Most existing studies treat these methods as black boxes, while recent work highlights the need for greater interpretability of ML/DL outputs in safety-critical domains. Here, we propose an unsupervised methodology based on an LSTM autoencoder with a dual attention mechanism for characterization of abnormal events in a real-world reactor radiation area monitoring system. The framework includes not only detection but also localization of the event and was evaluated using real-world datasets of increasing complexity from the PUR-1 research reactor. The attention mechanisms operate in both the feature and temporal dimensions, where the feature attention assigns weights to radiation sensors exhibiting abnormal patterns, while time attention highlights the specific timesteps where irregularities occur, thus enabling localization. By combining the results, the framework can identify both the affected sensors and the duration of each anomaly within a single unified network.

[LG-53] Unsupervised Atomic Data Mining via Multi-Kernel Graph Autoencoders for Machine Learning Force Fields

链接: https://arxiv.org/abs/2509.12358
作者: Hong Sun,Joshua A. Vita,Amit Samanta,Vincenzo Lordi
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
*备注:

点击查看摘要

Abstract:Constructing a chemically diverse dataset while avoiding sampling bias is critical to training efficient and generalizable force fields. However, in computational chemistry and materials science, many common dataset generation techniques are prone to oversampling regions of the potential energy surface. Furthermore, these regions can be difficult to identify and isolate from each other or may not align well with human intuition, making it challenging to systematically remove bias in the dataset. While traditional clustering and pruning (down-sampling) approaches can be useful for this, they can often lead to information loss or a failure to properly identify distinct regions of the potential energy surface due to difficulties associated with the high dimensionality of atomic descriptors. In this work, we introduce the Multi-kernel Edge Attention-based Graph Autoencoder (MEAGraph) model, an unsupervised approach for analyzing atomic datasets. MEAGraph combines multiple linear kernel transformations with attention-based message passing to capture geometric sensitivity and enable effective dataset pruning without relying on labels or extensive training. Demonstrated applications on niobium, tantalum, and iron datasets show that MEAGraph efficiently groups similar atomic environments, allowing for the use of basic pruning techniques for removing sampling bias. This approach provides an effective method for representation learning and clustering that can be used for data analysis, outlier detection, and dataset optimization.

[LG-54] FEDONet : Fourier-Embedded DeepONet for Spectrally Accurate Operator Learning

链接: https://arxiv.org/abs/2509.12344
作者: Arth Sojitra,Mrigank Dhingra,Omer San
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep Operator Networks (DeepONets) have recently emerged as powerful data-driven frameworks for learning nonlinear operators, particularly suited for approximating solutions to partial differential equations (PDEs). Despite their promising capabilities, the standard implementation of DeepONets, which typically employs fully connected linear layers in the trunk network, can encounter limitations in capturing complex spatial structures inherent to various PDEs. To address this, we introduce Fourier-embedded trunk networks within the DeepONet architecture, leveraging random Fourier feature mappings to enrich spatial representation capabilities. Our proposed Fourier-embedded DeepONet, FEDONet demonstrates superior performance compared to the traditional DeepONet across a comprehensive suite of PDE-driven datasets, including the two-dimensional Poisson equation, Burgers’ equation, the Lorenz-63 chaotic system, Eikonal equation, Allen-Cahn equation, Kuramoto-Sivashinsky equation, and the Lorenz-96 system. Empirical evaluations of FEDONet consistently show significant improvements in solution reconstruction accuracy, with average relative L2 performance gains ranging between 2-3x compared to the DeepONet baseline. This study highlights the effectiveness of Fourier embeddings in enhancing neural operator learning, offering a robust and broadly applicable methodology for PDE surrogate modeling.

[LG-55] Spontaneous Kolmogorov-Arnold Geometry in Shallow MLPs

链接: https://arxiv.org/abs/2509.12326
作者: Michael Freedman,Michael Mulligan
类目: Machine Learning (cs.LG); Strongly Correlated Electrons (cond-mat.str-el); High Energy Physics - Theory (hep-th)
*备注: 25 pages + 3 appendices

点击查看摘要

Abstract:The Kolmogorov-Arnold (KA) representation theorem constructs universal, but highly non-smooth inner functions (the first layer map) in a single (non-linear) hidden layer neural network. Such universal functions have a distinctive local geometry, a “texture,” which can be characterized by the inner function’s Jacobian J(\mathbfx) , as \mathbfx varies over the data. It is natural to ask if this distinctive KA geometry emerges through conventional neural network optimization. We find that indeed KA geometry often is produced when training vanilla single hidden layer neural networks. We quantify KA geometry through the statistical properties of the exterior powers of J(\mathbfx) : number of zero rows and various observables for the minor statistics of J(\mathbfx) , which measure the scale and axis alignment of J(\mathbfx) . This leads to a rough understanding for where KA geometry occurs in the space of function complexity and model hyperparameters. The motivation is first to understand how neural networks organically learn to prepare input data for later downstream processing and, second, to learn enough about the emergence of KA geometry to accelerate learning through a timely intervention in network hyperparameters. This research is the “flip side” of KA-Networks (KANs). We do not engineer KA into the neural network, but rather watch KA emerge in shallow MLPs.

[LG-56] More Similar than Dissimilar: Modeling Annotators for Cross-Corpus Speech Emotion Recognition

链接: https://arxiv.org/abs/2509.12295
作者: James Tavernor,Emily Mower Provost
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: ©20XX IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

点击查看摘要

Abstract:Speech emotion recognition systems often predict a consensus value generated from the ratings of multiple annotators. However, these models have limited ability to predict the annotation of any one person. Alternatively, models can learn to predict the annotations of all annotators. Adapting such models to new annotators is difficult as new annotators must individually provide sufficient labeled training data. We propose to leverage inter-annotator similarity by using a model pre-trained on a large annotator population to identify a similar, previously seen annotator. Given a new, previously unseen, annotator and limited enrollment data, we can make predictions for a similar annotator, enabling off-the-shelf annotation of unseen data in target datasets, providing a mechanism for extremely low-cost personalization. We demonstrate our approach significantly outperforms other off-the-shelf approaches, paving the way for lightweight emotion adaptation, practical for real-world deployment.

[LG-57] Prediction of Stocks Index Price using Quantum GANs

链接: https://arxiv.org/abs/2509.12286
作者: Sangram Deshpande,Gopal Ramesh Dahale,Sai Nandan Morapakula,Uday Wad
类目: Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注:

点击查看摘要

Abstract:This paper investigates the application of Quantum Generative Adversarial Networks (QGANs) for stock price prediction. Financial markets are inherently complex, marked by high volatility and intricate patterns that traditional models often fail to capture. QGANs, leveraging the power of quantum computing, offer a novel approach by combining the strengths of generative models with quantum machine learning techniques. We implement a QGAN model tailored for stock price prediction and evaluate its performance using historical stock market data. Our results demonstrate that QGANs can generate synthetic data closely resembling actual market behavior, leading to enhanced prediction accuracy. The experiment was conducted using the Stocks index price data and the AWS Braket SV1 simulator for training the QGAN circuits. The quantum-enhanced model outperforms classical Long Short-Term Memory (LSTM) and GAN models in terms of convergence speed and prediction accuracy. This research represents a key step toward integrating quantum computing in financial forecasting, offering potential advantages in speed and precision over traditional methods. The findings suggest important implications for traders, financial analysts, and researchers seeking advanced tools for market analysis.

[LG-58] Meta-model Neural Process for Probabilistic Power Flow under Varying N-1 System Topologies

链接: https://arxiv.org/abs/2509.12281
作者: Sel Ly,Kapil Chauhan,Anshuman Singh,Hung Dinh Nguyen
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: An improved version for the conference paper at PESGM 2025

点击查看摘要

Abstract:The probabilistic power flow (PPF) problem is essential to quantifying the distribution of the nodal voltages due to uncertain injections. The conventional PPF problem considers a fixed topology, and the solutions to such a PPF problem are associated with this topology. A change in the topology might alter the power flow patterns and thus require the PPF problem to be solved again. The previous PPF model and its solutions are no longer valid for the new topology. This practice incurs both inconvenience and computation burdens as more contingencies are foreseen due to high renewables and a large share of electric vehicles. This paper presents a novel topology-adaptive approach, based on the meta-model Neural Process (MMNP), for finding the solutions to PPF problems under varying N-1 topologies, particularly with one-line failures. By leveraging context set-based topology representation and conditional distribution over function learning techniques, the proposed MMNP enhances the robustness of PPF models to topology variations, mitigating the need for retraining PPF models on a new configuration. Simulations on an IEEE 9-bus system and IEEE 118-bus system validate the model’s performance. The maximum %L1-relative error norm was observed as 1.11% and 0.77% in 9-bus and 118-bus, respectively. This adaptive approach fills a critical gap in PPF methodology in an era of increasing grid volatility.

[LG-59] Research on Short-Video Platform User Decision-Making via Multimodal Temporal Modeling and Reinforcement Learning

链接: https://arxiv.org/abs/2509.12269
作者: Jinmeiyang Wang,Jing Dong,Li Zhou
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注: 26 pages

点击查看摘要

Abstract:This paper proposes the MT-DQN model, which integrates a Transformer, Temporal Graph Neural Network (TGNN), and Deep Q-Network (DQN) to address the challenges of predicting user behavior and optimizing recommendation strategies in short-video environments. Experiments demonstrated that MT-DQN consistently outperforms traditional concatenated models, such as Concat-Modal, achieving an average F1-score improvement of 10.97% and an average NDCG@5 improvement of 8.3%. Compared to the classic reinforcement learning model Vanilla-DQN, MT-DQN reduces MSE by 34.8% and MAE by 26.5%. Nonetheless, we also recognize challenges in deploying MT-DQN in real-world scenarios, such as its computational cost and latency sensitivity during online inference, which will be addressed through future architectural optimization.

[LG-60] A Traditional Approach to Symbolic Piano Continuation

链接: https://arxiv.org/abs/2509.12267
作者: Christian Zhou-Zheng,John Backsund,Dun Li Chan,Alex Coventry,Avid Eslami,Jyotin Goel,Xingwen Han,Danysh Soomro,Galen Wei
类目: ound (cs.SD); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
*备注: 3 pages, extended abstract, MIREX session at ISMIR 2025 LBD

点击查看摘要

Abstract:We present a traditional approach to symbolic piano music continuation for the MIREX 2025 Symbolic Music Generation challenge. While computational music generation has recently focused on developing large foundation models with sophisticated architectural modifications, we argue that simpler approaches remain more effective for constrained, single-instrument tasks. We thus return to a simple, unaugmented next-token-prediction objective on tokenized raw MIDI, aiming to outperform large foundation models by using better data and better fundamentals. We release model weights and code at this https URL.

[LG-61] Explainable Fraud Detection with GNNExplainer and Shapley Values

链接: https://arxiv.org/abs/2509.12262
作者: Ngoc Hieu Dao
类目: Machine Learning (cs.LG)
*备注: B. Comp Dissertation

点击查看摘要

Abstract:The risk of financial fraud is increasing as digital payments are used more and more frequently. Although the use of artificial intelligence systems for fraud detection is widespread, society and regulators have raised the standards for these systems’ transparency for reliability verification purposes. To increase their effectiveness in conducting fraud investigations, fraud analysts also profit from having concise and understandable explanations. To solve these challenges, the paper will concentrate on developing an explainable fraud detector.

[LG-62] Interpretable Data Mining of Follicular Thyroid Cancer Ultrasound Features Using Enhanced Association Rules

链接: https://arxiv.org/abs/2509.12238
作者: Songlin Zhou,Tao Zhou,Xin Li,Stephen Shing-Toung Yau
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Purpose: Thyroid cancer has been a common cancer. Papillary thyroid cancer and follicular thyroid cancer are the two most common types of thyroid cancer. Follicular thyroid cancer lacks distinctive ultrasound signs and is more difficult to diagnose preoperatively than the more prevalent papillary thyroid cancer, and the clinical studies associated with it are less well established. We aimed to analyze the clinical data of follicular thyroid cancer based on a novel data mining tool to identify some clinical indications that may help in preoperative diagnosis. Methods: We performed a retrospective analysis based on case data collected by the Department of General Surgery of Peking University Third Hospital between 2010 and 2023. Unlike traditional statistical methods, we improved the association rule mining, a classical data mining method, and proposed new analytical metrics reflecting the malignant association between clinical indications and cancer with the help of the idea of SHAP method in interpretable machine learning. Results: The dataset was preprocessed to contain 1673 cases (in terms of nodes rather than patients), of which 1414 were benign and 259 were malignant nodes. Our analysis pointed out that in addition to some common indicators (e.g., irregular or lobulated nodal margins, uneven thickness halo, hypoechogenicity), there were also some indicators with strong malignant associations, such as nodule-in-nodule pattern, trabecular pattern, and low TSH scores. In addition, our results suggest that the combination of Hashimoto’s thyroiditis may also have a strong malignant association. Conclusion: In the preoperative diagnosis of nodules suspected of follicular thyroid cancer, multiple clinical indications should be considered for a more accurate diagnosis. The diverse malignant associations identified in our study may serve as a reference for clinicians in related fields.

[LG-63] A Physics-Informed Neural Networks-Based Model Predictive Control Framework for SIR Epidemics

链接: https://arxiv.org/abs/2509.12226
作者: Aiping Zhong,Baike She,Philip E. Paré
类目: Machine Learning (cs.LG); Systems and Control (eess.SY); Populations and Evolution (q-bio.PE)
*备注:

点击查看摘要

Abstract:This work introduces a physics-informed neural networks (PINNs)-based model predictive control (MPC) framework for susceptible-infected-recovered ( SIR ) spreading models. Existing studies in MPC design for epidemic control often assume either 1) measurable states of the dynamics, where the parameters are learned, or 2) known parameters of the model, where the states are learned. In this work, we address the joint real-time estimation of states and parameters within the MPC framework using only noisy infected states, under the assumption that 1) only the recovery rate is known, or 2) only the basic reproduction number is known. Under the first assumption, we propose MPC-PINNs and two novel PINNs algorithms, all of which are integrated into the MPC framework. First, we introduce MPC-PINNs, which are designed for SIR models with control. We then propose log-scaled PINNs (MPC-LS-PINNs), which incorporate a log-scaled loss function to improve robustness against noise. Next, we present split-integral PINNs (MPC-SI-PINNs), which leverage integral operators and state coupling in the neural network training process to effectively reconstruct the complete epidemic state information. Building upon these methods, we further extend our framework for the second assumption. We establish the necessary conditions and extend our PINNs algorithms, where MPC-SI-PINNs are simplified as split-PINNs (MPC-S-PINNs). By incorporating these algorithms into the MPC framework, we simultaneously estimate the epidemic states and parameters while generating optimal control strategies. Experiment results demonstrate the effectiveness of the proposed methods under different settings.

[LG-64] ripOptimizer: Generative 3D Shape Optimization and Drag Prediction using Triplane VAE Networks

链接: https://arxiv.org/abs/2509.12224
作者: Parsa Vatani,Mohamed Elrefaie,Farhad Nazarpour,Faez Ahmed
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The computational cost of traditional Computational Fluid Dynamics-based Aerodynamic Shape Optimization severely restricts design space exploration. This paper introduces TripOptimizer, a fully differentiable deep learning framework for rapid aerodynamic analysis and shape optimization directly from vehicle point cloud data. TripOptimizer employs a Variational Autoencoder featuring a triplane-based implicit neural representation for high-fidelity 3D geometry reconstruction and a drag coefficient prediction head. Trained on DrivAerNet++, a large-scale dataset of 8,000 unique vehicle geometries with corresponding drag coefficients computed via Reynolds-Averaged Navier-Stokes simulations, the model learns a latent representation that encodes aerodynamically salient geometric features. We propose an optimization strategy that modifies a subset of the encoder parameters to steer an initial geometry towards a target drag value, and demonstrate its efficacy in case studies where optimized designs achieved drag coefficient reductions up to 11.8%. These results were subsequently validated by using independent, high-fidelity Computational Fluid Dynamics simulations with more than 150 million cells. A key advantage of the implicit representation is its inherent robustness to geometric imperfections, enabling optimization of non-watertight meshes, a significant challenge for traditional adjoint-based methods. The framework enables a more agile Aerodynamic Shape Optimization workflow, reducing reliance on computationally intensive CFD simulations, especially during early design stages.

[LG-65] Accelerating Protein Molecular Dynamics Simulation with DeepJump

链接: https://arxiv.org/abs/2509.13294
作者: Allan dos Santos Costa,Manvitha Ponnapati,Dana Rubin,Tess Smidt,Joseph Jacobson
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Unraveling the dynamical motions of biomolecules is essential for bridging their structure and function, yet it remains a major computational challenge. Molecular dynamics (MD) simulation provides a detailed depiction of biomolecular motion, but its high-resolution temporal evolution comes at significant computational cost, limiting its applicability to timescales of biological relevance. Deep learning approaches have emerged as promising solutions to overcome these computational limitations by learning to predict long-timescale dynamics. However, generalizable kinetics models for proteins remain largely unexplored, and the fundamental limits of achievable acceleration while preserving dynamical accuracy are poorly understood. In this work, we fill this gap with DeepJump, an Euclidean-Equivariant Flow Matching-based model for predicting protein conformational dynamics across multiple temporal scales. We train DeepJump on trajectories of the diverse proteins of mdCATH, systematically studying our model’s performance in generalizing to long-term dynamics of fast-folding proteins and characterizing the trade-off between computational acceleration and prediction accuracy. We demonstrate the application of DeepJump to ab initio folding, showcasing prediction of folding pathways and native states. Our results demonstrate that DeepJump achieves significant \approx 1000 \times computational acceleration while effectively recovering long-timescale dynamics, providing a stepping stone for enabling routine simulation of proteins.

[LG-66] Flow-Based Frag ment Identification via Binding Site-Specific Latent Representations

链接: https://arxiv.org/abs/2509.13216
作者: Rebecca Manuela Neeser,Ilia Igashov,Arne Schneuing,Michael Bronstein,Philippe Schwaller,Bruno Correia
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Fragment-based drug design is a promising strategy leveraging the binding of small chemical moieties that can efficiently guide drug discovery. The initial step of fragment identification remains challenging, as fragments often bind weakly and non-specifically. We developed a protein-fragment encoder that relies on a contrastive learning approach to map both molecular fragments and protein surfaces in a shared latent space. The encoder captures interaction-relevant features and allows to perform virtual screening as well as generative design with our new method LatentFrag. In LatentFrag, fragment embeddings and positions are generated conditioned on the protein surface while being chemically realistic by construction. Our expressive fragment and protein representations allow location of protein-fragment interaction sites with high sensitivity and we observe state-of-the-art fragment recovery rates when sampling from the learned distribution of latent fragment embeddings. Our generative method outperforms common methods such as virtual screening at a fraction of its computational cost providing a valuable starting point for fragment hit discovery. We further show the practical utility of LatentFrag and extend the workflow to full ligand design tasks. Together, these approaches contribute to advancing fragment identification and provide valuable tools for fragment-based drug discovery.

[LG-67] SURGIN: SURrogate-guided Generative INversion for subsurface multiphase flow with quantified uncertainty

链接: https://arxiv.org/abs/2509.13189
作者: Zhao Feng,Bicheng Yan,Luanxiao Zhao,Xianda Shen,Renyu Zhao,Wenhao Wang,Fengshou Zhang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn); Geophysics (physics.geo-ph)
*备注:

点击查看摘要

Abstract:We present a direct inverse modeling method named SURGIN, a SURrogate-guided Generative INversion framework tailed for subsurface multiphase flow data assimilation. Unlike existing inversion methods that require adaptation for each new observational configuration, SURGIN features a zero-shot conditional generation capability, enabling real-time assimilation of unseen monitoring data without task-specific retraining. Specifically, SURGIN synergistically integrates a U-Net enhanced Fourier Neural Operator (U-FNO) surrogate with a score-based generative model (SGM), framing the conditional generation as a surrogate prediction-guidance process in a Bayesian perspective. Instead of directly learning the conditional generation of geological parameters, an unconditional SGM is first pretrained in a self-supervised manner to capture the geological prior, after which posterior sampling is performed by leveraging a differentiable U-FNO surrogate to enable efficient forward evaluations conditioned on unseen observations. Extensive numerical experiments demonstrate SURGIN’s capability to decently infer heterogeneous geological fields and predict spatiotemporal flow dynamics with quantified uncertainty across diverse measurement settings. By unifying generative learning with surrogate-guided Bayesian inference, SURGIN establishes a new paradigm for inverse modeling and uncertainty quantification in parametric functional spaces.

[LG-68] Fast reconstruction of degenerate populations of conductance-based neuron models from spike times

链接: https://arxiv.org/abs/2509.12783
作者: Julien Brandoit,Damien Ernst,Guillaume Drion,Arthur Fyon
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG); Dynamical Systems (math.DS); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Neurons communicate through spikes, and spike timing is a crucial part of neuronal processing. Spike times can be recorded experimentally both intracellularly and extracellularly, and are the main output of state-of-the-art neural probes. On the other hand, neuronal activity is controlled at the molecular level by the currents generated by many different transmembrane proteins called ion channels. Connecting spike timing to ion channel composition remains an arduous task to date. To address this challenge, we developed a method that combines deep learning with a theoretical tool called Dynamic Input Conductances (DICs), which reduce the complexity of ion channel interactions into three interpretable components describing how neurons spike. Our approach uses deep learning to infer DICs directly from spike times and then generates populations of “twin” neuron models that replicate the observed activity while capturing natural variability in membrane channel composition. The method is fast, accurate, and works using only spike recordings. We also provide open-source software with a graphical interface, making it accessible to researchers without programming expertise.

[LG-69] DeltaHedge: A Multi-Agent Framework for Portfolio Options Optimization

链接: https://arxiv.org/abs/2509.12753
作者: Feliks Bańka(Warsaw University of Technology, Faculty of Electronics and Information Technology),Jarosław A. Chudziak(Warsaw University of Technology)
类目: Portfolio Management (q-fin.PM); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注: Presented at Pacific Asia Conference on Information Systems (PACIS 2025), Kuala Lumpur. Official proceedings available at this https URL . 16 pages, 7 figures, 3 tables

点击查看摘要

Abstract:In volatile financial markets, balancing risk and return remains a significant challenge. Traditional approaches often focus solely on equity allocation, overlooking the strategic advantages of options trading for dynamic risk hedging. This work presents DeltaHedge, a multi-agent framework that integrates options trading with AI-driven portfolio management. By combining advanced reinforcement learning techniques with an ensembled options-based hedging strategy, DeltaHedge enhances risk-adjusted returns and stabilizes portfolio performance across varying market conditions. Experimental results demonstrate that DeltaHedge outperforms traditional strategies and standalone models, underscoring its potential to transform practical portfolio management in complex financial environments. Building on these findings, this paper contributes to the fields of quantitative finance and AI-driven portfolio optimization by introducing a novel multi-agent system for integrating options trading strategies, addressing a gap in the existing literature.

[LG-70] PBPK-iPINNs : Inverse Physics-Informed Neural Networks for Physiologically Based Pharmacokinetic Brain Models

链接: https://arxiv.org/abs/2509.12666
作者: Charuka D. Wickramasinghe,Krishanthi C. Weerasinghe,Pradeep K. Ranaweera
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 24 pages, 11 figures

点击查看摘要

Abstract:Physics-Informed Neural Networks (PINNs) leverage machine learning with differential equations to solve direct and inverse problems, ensuring predictions follow physical laws. Physiologically based pharmacokinetic (PBPK) modeling advances beyond classical compartmental approaches by using a mechanistic, physiology focused framework. A PBPK model is based on a system of ODEs, with each equation representing the mass balance of a drug in a compartment, such as an organ or tissue. These ODEs include parameters that reflect physiological, biochemical, and drug-specific characteristics to simulate how the drug moves through the body. In this paper, we introduce PBPK-iPINN, a method to estimate drug-specific or patient-specific parameters and drug concentration profiles in PBPK brain compartment models using inverse PINNs. We demonstrate that, for the inverse problem to converge to the correct solution, the loss function components (data loss, initial conditions loss, and residual loss) must be appropriately weighted, and parameters (including number of layers, number of neurons, activation functions, learning rate, optimizer, and collocation points) must be carefully tuned. The performance of the PBPK-iPINN approach is then compared with established traditional numerical and statistical methods.

[LG-71] Sustainable LSTM-Based Precoding for RIS-Aided mmWave MIMO Systems with Implicit CSI

链接: https://arxiv.org/abs/2509.12658
作者: Po-Heng Chou,Jiun-Jia Wu,Wan-Jen Huang,Ronald Y. Chang
类目: ignal Processing (eess.SP); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: 6 pages, 5 figures, 2 tables, and accepted by 2025 IEEE Globecom Workshops

点击查看摘要

Abstract:In this paper, we propose a sustainable long short-term memory (LSTM)-based precoding framework for reconfigurable intelligent surface (RIS)-assisted millimeter-wave (mmWave) MIMO systems. Instead of explicit channel state information (CSI) estimation, the framework exploits uplink pilot sequences to implicitly learn channel characteristics, reducing both pilot overhead and inference complexity. Practical hardware constraints are addressed by incorporating the phase-dependent amplitude model of RIS elements, while a multi-label training strategy improves robustness when multiple near-optimal codewords yield comparable performance. Simulations show that the proposed design achieves over 90% of the spectral efficiency of exhaustive search (ES) with only 2.2% of its computation time, cutting energy consumption by nearly two orders of magnitude. The method also demonstrates resilience under distribution mismatch and scalability to larger RIS arrays, making it a practical and energy-efficient solution for sustainable 6G wireless networks.

[LG-72] SamudrACE: Fast and Accurate Coupled Climate Modeling with 3D Ocean and Atmosphere Emulators

链接: https://arxiv.org/abs/2509.12490
作者: James P. C. Duncan,Elynn Wu,Surya Dheeshjith,Adam Subel,Troy Arcomano,Spencer K. Clark,Brian Henn,Anna Kwa,Jeremy McGibbon,W. Andre Perkins,William Gregory,Carlos Fernandez-Granda,Julius Busecke,Oliver Watt-Meyer,William J. Hurlin,Alistair Adcroft,Laure Zanna,Christopher Bretherton
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
*备注: 23 pages, 17 figures

点击查看摘要

Abstract:Traditional numerical global climate models simulate the full Earth system by exchanging boundary conditions between separate simulators of the atmosphere, ocean, sea ice, land surface, and other geophysical processes. This paradigm allows for distributed development of individual components within a common framework, unified by a coupler that handles translation between realms via spatial or temporal alignment and flux exchange. Following a similar approach adapted for machine learning-based emulators, we present SamudrACE: a coupled global climate model emulator which produces centuries-long simulations at 1-degree horizontal, 6-hourly atmospheric, and 5-daily oceanic resolution, with 145 2D fields spanning 8 atmospheric and 19 oceanic vertical levels, plus sea ice, surface, and top-of-atmosphere variables. SamudrACE is highly stable and has low climate biases comparable to those of its components with prescribed boundary forcing, with realistic variability in coupled climate phenomena such as ENSO that is not possible to simulate in uncoupled mode.

[LG-73] VADER: A Variational Autoencoder to Infer Planetary Masses and Gas-Dust Disk Properties Around Young Stars

链接: https://arxiv.org/abs/2509.12324
作者: Sayed Shafaat Mahmud,Sayantan Auddy,Neal Turner,Jeffrey S. Bary
类目: Earth and Planetary Astrophysics (astro-ph.EP); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注: 6 pages, 5 figures, Accepted and Published at International Conference on Machine Learning, Machine Learning for Astrophysics Workshop 2025

点击查看摘要

Abstract:We present \textbfVADER (Variational Autoencoder for Disks Embedded with Rings), for inferring both planet mass and global disk properties from high-resolution ALMA dust continuum images of protoplanetary disks (PPDs). VADER, a probabilistic deep learning model, enables uncertainty-aware inference of planet masses, \alpha -viscosity, dust-to-gas ratio, Stokes number, flaring index, and the number of planets directly from protoplanetary disk images. VADER is trained on over 100,000 synthetic images of PPDs generated from \textttFARGO3D simulations post-processed with \textttRADMC3D. Our trained model predicts physical planet and disk parameters with R^2 0.9 from dust continuum images of PPDs. Applied to 23 real disks, VADER’s mass estimates are consistent with literature values and reveal latent correlations that reflect known disk physics. Our results establish VAE-based generative models as robust tools for probabilistic astrophysical inference, with direct applications to interpreting protoplanetary disk substructures in the era of large interferometric surveys.

[LG-74] Genome-Factory: An Integrated Library for Tuning Deploying and Interpreting Genomic Models

链接: https://arxiv.org/abs/2509.12266
作者: Weimin Wu,Xuefeng Song,Yibo Wen,Qinjie Lin,Zhihan Zhou,Jerry Yao-Chieh Hu,Zhong Wang,Han Liu
类目: Genomics (q-bio.GN); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce Genome-Factory, an integrated Python library for tuning, deploying, and interpreting genomic models. Our core contribution is to simplify and unify the workflow for genomic model development: data collection, model tuning, inference, benchmarking, and interpretability. For data collection, Genome-Factory offers an automated pipeline to download genomic sequences and preprocess them. It also includes quality control, such as GC content normalization. For model tuning, Genome-Factory supports three approaches: full-parameter, low-rank adaptation, and adapter-based fine-tuning. It is compatible with a wide range of genomic models. For inference, Genome-Factory enables both embedding extraction and DNA sequence generation. For benchmarking, we include two existing benchmarks and provide a flexible interface for users to incorporate additional benchmarks. For interpretability, Genome-Factory introduces the first open-source biological interpreter based on a sparse auto-encoder. This module disentangles embeddings into sparse, near-monosemantic latent units and links them to interpretable genomic features by regressing on external readouts. To improve accessibility, Genome-Factory features both a zero-code command-line interface and a user-friendly web interface. We validate the utility of Genome-Factory across three dimensions: (i) Compatibility with diverse models and fine-tuning methods; (ii) Benchmarking downstream performance using two open-source benchmarks; (iii) Biological interpretation of learned representations with DNABERT-2. These results highlight its end-to-end usability and practical value for real-world genomic analysis.

[LG-75] CNN-BiLSTM for sustainable and non-invasive COVID-19 detection via salivary ATR-FTIR spectroscopy

链接: https://arxiv.org/abs/2509.12241
作者: Anisio P. Santos Junior,Robinson Sabino-Silva,Mário Machado Martins,Thulio Marquez Cunha,Murillo G. Carneiro
类目: Medical Physics (physics.med-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The COVID-19 pandemic has placed unprecedented strain on healthcare systems and remains a global health concern, especially with the emergence of new variants. Although real-time polymerase chain reaction (RT-PCR) is considered the gold standard for COVID-19 detection, it is expensive, time-consuming, labor-intensive, and sensitive to issues with RNA extraction. In this context, ATR-FTIR spectroscopy analysis of biofluids offers a reagent-free, cost-effective alternative for COVID-19 detection. We propose a novel architecture that combines Convolutional Neural Networks (CNN) with Bidirectional Long Short-Term Memory (BiLSTM) networks, referred to as CNN-BiLSTM, to process spectra generated by ATR-FTIR spectroscopy and diagnose COVID-19 from spectral samples. We compare the performance of this architecture against a standalone CNN and other state-of-the-art machine learning techniques. Experimental results demonstrate that our CNN-BiLSTM model outperforms all other models, achieving an average accuracy and F1-score of 0.80 on a challenging real-world COVID-19 dataset. The addition of the BiLSTM layer to the CNN architecture significantly enhances model performance, making CNN-BiLSTM a more accurate and reliable choice for detecting COVID-19 using ATR-FTIR spectra of non-invasive saliva samples.

信息检索

[IR-0] Green Recommender Systems: Understanding and Minimizing the Carbon Footprint of AI-Powered Personalization

链接: https://arxiv.org/abs/2509.13001
作者: Lukas Wegmeth,Tobias Vente,Alan Said,Joeran Beel
类目: Information Retrieval (cs.IR)
*备注: Just Accepted at ACM TORS. arXiv admin note: substantial text overlap with arXiv:2408.08203

点击查看摘要

Abstract:As global warming soars, the need to assess and reduce the environmental impact of recommender systems is becoming increasingly urgent. Despite this, the recommender systems community hardly understands, addresses, and evaluates the environmental impact of their work. In this study, we examine the environmental impact of recommender systems research by reproducing typical experimental pipelines. Based on our results, we provide guidelines for researchers and practitioners on how to minimize the environmental footprint of their work and implement green recommender systems - recommender systems designed to minimize their energy consumption and carbon footprint. Our analysis covers 79 papers from the 2013 and 2023 ACM RecSys conferences, comparing traditional “good old-fashioned AI” models with modern deep learning models. We designed and reproduced representative experimental pipelines for both years, measuring energy consumption using a hardware energy meter and converting it into CO2 equivalents. Our results show that papers utilizing deep learning models emit approximately 42 times more CO2 equivalents than papers using traditional models. On average, a single deep learning-based paper generates 2,909 kilograms of CO2 equivalents - more than the carbon emissions of a person flying from New York City to Melbourne or the amount of CO2 sequestered by one tree over 260 years. This work underscores the urgent need for the recommender systems and wider machine learning communities to adopt green AI principles, balancing algorithmic advancements and environmental responsibility to build a sustainable future with AI-powered personalization.

[IR-1] Protecting participants or population? Comparison of k-anonymous Origin-Destination matrices

链接: https://arxiv.org/abs/2509.12950
作者: Pietro Armenante,Kai Huang,Nikhil Jha,Luca Vassio
类目: Information Retrieval (cs.IR); Data Structures and Algorithms (cs.DS)
*备注: Accepted at NetMob 2025 Data Challenge (full report)

点击查看摘要

Abstract:Origin-Destination (OD) matrices are a core component of research on users’ mobility and summarize how individuals move between geographical regions. These regions should be small enough to be representative of user mobility, without incurring substantial privacy risks. There are two added values of the NetMob2025 challenge dataset. Firstly, the data is extensive and contains a lot of socio-demographic information that can be used to create multiple OD matrices, based on the segments of the population. Secondly, a participant is not merely a record in the data, but a statistically weighted proxy for a segment of the real population. This opens the door to a fundamental shift in the anonymization paradigm. A population-based view of privacy is central to our contribution. By adjusting our anonymization framework to account for representativeness, we are also protecting the inferred identity of the actual population, rather than survey participants alone. The challenge addressed in this work is to produce and compare OD matrices that are k-anonymous for survey participants and for the whole population. We compare several traditional methods of anonymization to k-anonymity by generalizing geographical areas. These include generalization over a hierarchy (ATG and OIGH) and the classical Mondrian. To this established toolkit, we add a novel method, i.e., ODkAnon, a greedy algorithm aiming at balancing speed and quality. Unlike previous approaches, which primarily address the privacy aspects of the given datasets, we aim to contribute to the generation of privacy-preserving OD matrices enriched with socio-demographic segmentation that achieves k-anonymity on the actual population.

[IR-2] A Learnable Fully Interacted Two-Tower Model for Pre-Ranking System

链接: https://arxiv.org/abs/2509.12948
作者: Chao Xiong,Xianwen Yu,Wei Xu,Lei Cheng,Chuan Yuan,Linjian Mo
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Pre-ranking plays a crucial role in large-scale recommender systems by significantly improving the efficiency and scalability within the constraints of providing high-quality candidate sets in real time. The two-tower model is widely used in pre-ranking systems due to a good balance between efficiency and effectiveness with decoupled architecture, which independently processes user and item inputs before calculating their interaction (e.g. dot product or similarity measure). However, this independence also leads to the lack of information interaction between the two towers, resulting in less effectiveness. In this paper, a novel architecture named learnable Fully Interacted Two-tower Model (FIT) is proposed, which enables rich information interactions while ensuring inference efficiency. FIT mainly consists of two parts: Meta Query Module (MQM) and Lightweight Similarity Scorer (LSS). Specifically, MQM introduces a learnable item meta matrix to achieve expressive early interaction between user and item features. Moreover, LSS is designed to further obtain effective late interaction between the user and item towers. Finally, experimental results on several public datasets show that our proposed FIT significantly outperforms the state-of-the-art baseline pre-ranking models.

[IR-3] DiffHash: Text-Guided Targeted Attack via Diffusion Models against Deep Hashing Image Retrieval

链接: https://arxiv.org/abs/2509.12824
作者: Zechao Liu,Zheng Zhou,Xiangkun Chen,Tao Liang,Dapeng Lang
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Deep hashing models have been widely adopted to tackle the challenges of large-scale image retrieval. However, these approaches face serious security risks due to their vulnerability to adversarial examples. Despite the increasing exploration of targeted attacks on deep hashing models, existing approaches still suffer from a lack of multimodal guidance, reliance on labeling information and dependence on pixel-level operations for attacks. To address these limitations, we proposed DiffHash, a novel diffusion-based targeted attack for deep hashing. Unlike traditional pixel-based attacks that directly modify specific pixels and lack multimodal guidance, our approach focuses on optimizing the latent representations of images, guided by text information generated by a Large Language Model (LLM) for the target image. Furthermore, we designed a multi-space hash alignment network to align the high-dimension image space and text space to the low-dimension binary hash space. During reconstruction, we also incorporated text-guided attention mechanisms to refine adversarial examples, ensuring them aligned with the target semantics while maintaining visual plausibility. Extensive experiments have demonstrated that our method outperforms state-of-the-art (SOTA) targeted attack methods, achieving better black-box transferability and offering more excellent stability across datasets.

[IR-4] mbre-Adaptive Transcription: A Lightweight Architecture with Associative Memory for Dynamic Instrument Separation

链接: https://arxiv.org/abs/2509.12712
作者: Ruigang Li,Yongxu Zhu
类目: ound (cs.SD); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Existing multi-timbre transcription models struggle with generalization beyond pre-trained instruments and rigid source-count constraints. We address these limitations with a lightweight deep clustering solution featuring: 1) a timbre-agnostic backbone achieving state-of-the-art performance with only half the parameters of comparable models, and 2) a novel associative memory mechanism that mimics human auditory cognition to dynamically encode unseen timbres via attention-based clustering. Our biologically-inspired framework enables adaptive polyphonic separation with minimal training data (12.5 minutes), supported by a new synthetic dataset method offering cost-effective, high-precision multi-timbre generation. Experiments show the timbre-agnostic transcription model outperforms existing models on public benchmarks, while the separation module demonstrates promising timbre discrimination. This work provides an efficient framework for timbre-related music transcription and explores new directions for timbre-aware separation through cognitive-inspired architectures.

[IR-5] What News Recommendation Research Did (But Mostly Didnt) Teach Us About Building A News Recommender

链接: https://arxiv.org/abs/2509.12361
作者: Karl Higley,Robin Burke,Michael D. Ekstrand,Bart P. Knijnenburg
类目: Information Retrieval (cs.IR); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:One of the goals of recommender systems research is to provide insights and methods that can be used by practitioners to build real-world systems that deliver high-quality recommendations to actual people grounded in their genuine interests and needs. We report on our experience trying to apply the news recommendation literature to build POPROX, a live platform for news recommendation research, and reflect on the extent to which the current state of research supports system-building efforts. Our experience highlights several unexpected challenges encountered in building personalization features that are commonly found in products from news aggregators and publishers, and shows how those difficulties are connected to surprising gaps in the literature. Finally, we offer a set of lessons learned from building a live system with a persistent user base and highlight opportunities to make future news recommendation research more applicable and impactful in practice.

[IR-6] Knowledge Graph Tokenization for Behavior-Aware Generative Next POI Recommendation

链接: https://arxiv.org/abs/2509.12350
作者: Ke Sun,Mayi Xu
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Generative paradigm, especially powered by Large Language Models (LLMs), has emerged as a new solution to the next point-of-interest (POI) recommendation. Pioneering studies usually adopt a two-stage pipeline, starting with a tokenizer converting POIs into discrete identifiers that can be processed by LLMs, followed by POI behavior prediction tasks to instruction-tune LLM for next POI recommendation. Despite of remarkable progress, they still face two limitations: (1) existing tokenizers struggle to encode heterogeneous signals in the recommendation data, suffering from information loss issue, and (2) previous instruction-tuning tasks only focus on users’ POI visit behavior while ignore other behavior types, resulting in insufficient understanding of mobility. To address these limitations, we propose KGTB (Knowledge Graph Tokenization for Behavior-aware generative next POI recommendation). Specifically, KGTB organizes the recommendation data in a knowledge graph (KG) format, of which the structure can seamlessly preserve the heterogeneous information. Then, a KG-based tokenizer is developed to quantize each node into an individual structural ID. This process is supervised by the KG’s structure, thus reducing the loss of heterogeneous information. Using generated IDs, KGTB proposes multi-behavior learning that introduces multiple behavior-specific prediction tasks for LLM fine-tuning, e.g., POI, category, and region visit behaviors. Learning on these behavior tasks provides LLMs with comprehensive insights on the target POI visit behavior. Experiments on four real-world city datasets demonstrate the superior performance of KGTB.

[IR-7] Identifying Information Technology Research Trends through Text Mining of NSF Awards

链接: https://arxiv.org/abs/2509.12245
作者: Said Varlioglu,Hazem Said,Murat Ozer,Nelly Elsayed
类目: Digital Libraries (cs.DL); Information Retrieval (cs.IR)
*备注: 8 pages, under review

点击查看摘要

Abstract:Information Technology (IT) is recognized as an independent and unique research field. However, there has been ambiguity and difficulty in identifying and differentiating IT research from other close variations. Given this context, this paper aimed to explore the roots of the Information Technology (IT) research domain by conducting a large-scale text mining analysis of 50,780 abstracts from awarded NSF CISE grants from 1985 to 2024. We categorized the awards based on their program content, labeling human-centric programs as IT research programs and infrastructure-centric programs as other research programs based on the IT definitions in the literature. This novel approach helped us identify the core concepts of IT research and compare the similarities and differences between IT research and other research areas. The results showed that IT differentiates itself from other close variations by focusing more on the needs of users, organizations, and societies.

附件下载

点击下载今日全部论文列表