本篇博文主要内容为 2025-09-11 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。
说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。
友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。
目录
概览 (2025-09-11)
今日共更新425篇论文,其中:
- 自然语言处理共43篇(Computation and Language (cs.CL))
- 人工智能共110篇(Artificial Intelligence (cs.AI))
- 计算机视觉共76篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共114篇(Machine Learning (cs.LG))
自然语言处理
[NLP-0] A Survey of Reinforcement Learning for Large Reasoning Models
【速读】: 该论文旨在解决如何通过强化学习(Reinforcement Learning, RL)提升大型语言模型(Large Language Models, LLMs)的推理能力,并推动其向更通用的推理模型(Language Reasoning Models, LRMs)演进的问题。其核心挑战在于,尽管RL已在数学和编程等复杂逻辑任务中显著增强LLMs的能力,但进一步扩展RL用于LRMs仍面临计算资源、算法设计、训练数据及基础设施等方面的系统性瓶颈。解决方案的关键在于系统梳理近年来RL应用于LLMs与LRMs的核心组件、基础问题、训练资源与下游应用,尤其聚焦于DeepSeek-R1发布后的进展,从而识别可提升RL可扩展性的关键路径,为迈向人工超级智能(Artificial SuperIntelligence, ASI)提供理论支撑与实践方向。
链接: https://arxiv.org/abs/2509.08827
作者: Kaiyan Zhang,Yuxin Zuo,Bingxiang He,Youbang Sun,Runze Liu,Che Jiang,Yuchen Fan,Kai Tian,Guoli Jia,Pengfei Li,Yu Fu,Xingtai Lv,Yuchen Zhang,Sihang Zeng,Shang Qu,Haozhan Li,Shijie Wang,Yuru Wang,Xinwei Long,Fangfu Liu,Xiang Xu,Jiaze Ma,Xuekai Zhu,Ermo Hua,Yihao Liu,Zonglin Li,Huayu Chen,Xiaoye Qu,Yafu Li,Weize Chen,Zhenzhao Yuan,Junqi Gao,Dong Li,Zhiyuan Ma,Ganqu Cui,Zhiyuan Liu,Biqing Qi,Ning Ding,Bowen Zhou
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:In this paper, we survey recent advances in Reinforcement Learning (RL) for reasoning with Large Language Models (LLMs). RL has achieved remarkable success in advancing the frontier of LLM capabilities, particularly in addressing complex logical tasks such as mathematics and coding. As a result, RL has emerged as a foundational methodology for transforming LLMs into LRMs. With the rapid progress of the field, further scaling of RL for LRMs now faces foundational challenges not only in computational resources but also in algorithm design, training data, and infrastructure. To this end, it is timely to revisit the development of this domain, reassess its trajectory, and explore strategies to enhance the scalability of RL toward Artificial SuperIntelligence (ASI). In particular, we examine research applying RL to LLMs and LRMs for reasoning abilities, especially since the release of DeepSeek-R1, including foundational components, core problems, training resources, and downstream applications, to identify future opportunities and directions for this rapidly evolving area. We hope this review will promote future research on RL for broader reasoning models. Github: this https URL
zh
[NLP-1] Large Language Model Hacking: Quantifying the Hidden Risks of Using LLM s for Text Annotation
【速读】: 该论文旨在解决生成式 AI(Generative AI)在社会科学实证研究中因模型实现选择差异(如模型选型、提示策略或温度参数设置)而导致的“LLM黑客攻击”(LLM hacking)问题,即这些选择会引入系统性偏差和随机误差,进而导致统计推断错误(如I类、II类、S类或M类错误)。其解决方案的关键在于:通过大规模复现37项数据标注任务并测试2,361个真实假设,量化不同实现方式对结论可靠性的影响;结果表明,即便使用最先进的语言模型,约三分之一的假设仍会产生错误结论,且小模型风险更高;此外,强调人类标注在降低假阳性发现中的核心作用,并指出常见回归校正方法无法有效缓解该风险,因其本质是牺牲II类错误以换取I类错误控制。
链接: https://arxiv.org/abs/2509.08825
作者: Joachim Baumann,Paul Röttger,Aleksandra Urman,Albert Wendsjö,Flor Miriam Plaza-del-Arco,Johannes B. Gruber,Dirk Hovy
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Large language models (LLMs) are rapidly transforming social science research by enabling the automation of labor-intensive tasks like data annotation and text analysis. However, LLM outputs vary significantly depending on the implementation choices made by researchers (e.g., model selection, prompting strategy, or temperature settings). Such variation can introduce systematic biases and random errors, which propagate to downstream analyses and cause Type I, Type II, Type S, or Type M errors. We call this LLM hacking. We quantify the risk of LLM hacking by replicating 37 data annotation tasks from 21 published social science research studies with 18 different models. Analyzing 13 million LLM labels, we test 2,361 realistic hypotheses to measure how plausible researcher choices affect statistical conclusions. We find incorrect conclusions based on LLM-annotated data in approximately one in three hypotheses for state-of-the-art models, and in half the hypotheses for small language models. While our findings show that higher task performance and better general model capabilities reduce LLM hacking risk, even highly accurate models do not completely eliminate it. The risk of LLM hacking decreases as effect sizes increase, indicating the need for more rigorous verification of findings near significance thresholds. Our extensive analysis of LLM hacking mitigation techniques emphasizes the importance of human annotations in reducing false positive findings and improving model selection. Surprisingly, common regression estimator correction techniques are largely ineffective in reducing LLM hacking risk, as they heavily trade off Type I vs. Type II errors. Beyond accidental errors, we find that intentional LLM hacking is unacceptably simple. With few LLMs and just a handful of prompt paraphrases, anything can be presented as statistically significant. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2509.08825 [cs.CL] (or arXiv:2509.08825v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2509.08825 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-2] Building High-Quality Datasets for Portuguese LLM s: From Common Crawl Snapshots to Industrial-Grade Corpora
【速读】: 该论文旨在解决多语言大语言模型(Large Language Models, LLMs)训练数据构建中的关键挑战,特别是如何有效构建高质量、可扩展的非英语语种语料库,以提升模型在目标语言上的性能。其核心问题在于现有研究多集中于英文语料,缺乏针对其他语言(如葡萄牙语)的数据筛选与预处理策略,导致模型迁移效果受限。解决方案的关键在于提出一套可扩展的基于网络的语料构建方法,并引入语言特定的过滤管道(包括STEM领域分类器和毒性内容检测器),通过持续预训练(continual pretraining)验证了使用高质量、语言定制化数据对模型性能的显著提升作用,为多语言LLM开发提供了通用且有效的实践路径。
链接: https://arxiv.org/abs/2509.08824
作者: Thales Sales Almeida,Rodrigo Nogueira,Helio Pedrini
机构: Institute of Computing, University of Campinas (计算研究所,坎皮纳斯大学); Maritaca AI; Institute of Computing, University of Campinas (计算研究所,坎皮纳斯大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:The performance of large language models (LLMs) is deeply influenced by the quality and composition of their training data. While much of the existing work has centered on English, there remains a gap in understanding how to construct effective training corpora for other languages. We explore scalable methods for building web-based corpora for LLMs. We apply them to build a new 120B token corpus in Portuguese that achieves competitive results to an industrial-grade corpus. Using a continual pretraining setup, we study how different data selection and preprocessing strategies affect LLM performance when transitioning a model originally trained in English to another language. Our findings demonstrate the value of language-specific filtering pipelines, including classifiers for education, science, technology, engineering, and mathematics (STEM), as well as toxic content. We show that adapting a model to the target language leads to performance improvements, reinforcing the importance of high-quality, language-specific data. While our case study focuses on Portuguese, our methods are applicable to other languages, offering insights for multilingual LLM development.
zh
[NLP-3] Merge-of-Thought Distillation
【速读】: 该论文旨在解决长链式思维(Chain-of-Thought, CoT)模型在知识蒸馏过程中受限于单一教师模型假设的问题,尤其是在存在多个候选教师和日益增长的CoT语料库时,如何高效融合多教师的知识以提升学生模型性能。其核心解决方案是提出一种轻量级框架——思维合并蒸馏(Merge-of-Thought Distillation, MoT),关键在于交替执行教师特定的监督微调分支与学生变体在权重空间中的合并操作,从而统一多教师的推理能力并缓解不同教师间监督信号的冲突。此方法不仅显著优于单教师蒸馏和简单联合策略,还能有效减少灾难性遗忘、增强泛化推理能力,并在数学以外任务中展现出迁移潜力。
链接: https://arxiv.org/abs/2509.08814
作者: Zhanming Shen,Zeyu Qin,Zenan Huang,Hao Chen,Jiaqi Hu,Yihong Zhuang,Guoshan Lu,Gang Chen,Junbo Zhao
机构: Zhejiang University (浙江大学); Inclusion AI, Ant Group (蚂蚁集团)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Efficient reasoning distillation for long chain-of-thought (CoT) models is increasingly constrained by the assumption of a single oracle teacher, despite practical availability of multiple candidate teachers and growing CoT corpora. We revisit teacher selection and observe that different students have different “best teachers,” and even for the same student the best teacher can vary across datasets. Therefore, to unify multiple teachers’ reasoning abilities into student with overcoming conflicts among various teachers’ supervision, we propose Merge-of-Thought Distillation (MoT), a lightweight framework that alternates between teacher-specific supervised fine-tuning branches and weight-space merging of the resulting student variants. On competition math benchmarks, using only about 200 high-quality CoT samples, applying MoT to a Qwen3-14B student surpasses strong models including DEEPSEEK-R1, QWEN3-30B-A3B, QWEN3-32B, and OPENAI-O1, demonstrating substantial gains. Besides, MoT consistently outperforms the best single-teacher distillation and the naive multi-teacher union, raises the performance ceiling while mitigating overfitting, and shows robustness to distribution-shifted and peer-level teachers. Moreover, MoT reduces catastrophic forgetting, improves general reasoning beyond mathematics and even cultivates a better teacher, indicating that consensus-filtered reasoning features transfer broadly. These results position MoT as a simple, scalable route to efficiently distilling long CoT capabilities from diverse teachers into compact students.
zh
[NLP-4] MoVoC: Morphology-Aware Subword Construction for Geez Script Languages
【速读】: 该论文旨在解决子词(subword)分词方法在低资源、形态学复杂的语言(如使用吉兹字母书写的语言)中难以保留词素边界的问题。其关键解决方案是提出MoVoC(Morpheme-aware Subword Vocabulary Construction)方法,通过将监督式形态分析整合到子词词汇构建过程中,实现词素感知的分词策略;该方法结合词素级和Byte Pair Encoding(BPE)的token,既保持了形态完整性,又维护了词汇语义,从而提升语言学保真度与分词效率。
链接: https://arxiv.org/abs/2509.08812
作者: Hailay Kidu Teklehaymanot,Dren Fazlija,Wolfgang Nejdl
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: This submission is approximately 10 pages in length and includes 1 figure and 6 tables
Abstract:Subword-based tokenization methods often fail to preserve morphological boundaries, a limitation especially pronounced in low-resource, morphologically complex languages such as those written in the Geez script. To address this, we present MoVoC (Morpheme-aware Subword Vocabulary Construction) and train MoVoC-Tok, a tokenizer that integrates supervised morphological analysis into the subword vocabulary. This hybrid segmentation approach combines morpheme-based and Byte Pair Encoding (BPE) tokens to preserve morphological integrity while maintaining lexical meaning. To tackle resource scarcity, we curate and release manually annotated morpheme data for four Geez script languages and a morpheme-aware vocabulary for two of them. While the proposed tokenization method does not lead to significant gains in automatic translation quality, we observe consistent improvements in intrinsic metrics, MorphoScore, and Boundary Precision, highlighting the value of morphology-aware segmentation in enhancing linguistic fidelity and token efficiency. Our morpheme-annotated datasets and tokenizer will be publicly available to support further research in low-resource, morphologically rich languages. Our code and data are available on GitHub: this https URL
zh
[NLP-5] Evaluating LLM s Without Oracle Feedback: Agent ic Annotation Evaluation Through Unsupervised Consistency Signals
【速读】: 该论文旨在解决在动态、无监督环境中评估大型语言模型(Large Language Models, LLMs)生成标注质量的难题,尤其是在缺乏黄金标准反馈(oracle feedback)的情况下,传统评估方法失效的问题。解决方案的关键在于提出一种新型代理式标注范式(agentic annotation paradigm),其中学生模型(student model)与噪声教师模型(即LLM)协作,通过基于用户偏好的多数投票策略对LLM输出的一致性进行评估,从而实现无需人工标注的自我反馈机制。此外,论文引入了一种新的无监督评估指标——一致与不一致比值(Consistent and Inconsistent, CAI Ratio),该指标不仅量化了LLM标注的质量,还在模型选择中发挥关键作用,能够识别出在动态无监督场景下更具鲁棒性的LLM。
链接: https://arxiv.org/abs/2509.08809
作者: Cheng Chen,Haiyan Yin,Ivor Tsang
机构: Australian Artificial Intelligence Institute (AAII), University of Technology Sydney (悉尼科技大学); Centre for Frontier AI Research, Institute of High Performance Computing, Agency for Science, Technology and Research (新加坡科技研究局); College of Computing and Data Science, Nanyang Technological University (南洋理工大学)
类目: Computation and Language (cs.CL)
备注: 11 pages, 10 figures
Abstract:Large Language Models (LLMs), when paired with prompt-based tasks, have significantly reduced data annotation costs and reliance on human annotators. However, evaluating the quality of their annotations remains challenging in dynamic, unsupervised environments where oracle feedback is scarce and conventional methods fail. To address this challenge, we propose a novel agentic annotation paradigm, where a student model collaborates with a noisy teacher (the LLM) to assess and refine annotation quality without relying on oracle feedback. The student model, acting as an unsupervised feedback mechanism, employs a user preference-based majority voting strategy to evaluate the consistency of the LLM outputs. To systematically measure the reliability of LLM-generated annotations, we introduce the Consistent and Inconsistent (CAI) Ratio, a novel unsupervised evaluation metric. The CAI Ratio not only quantifies the annotation quality of the noisy teacher under limited user preferences but also plays a critical role in model selection, enabling the identification of robust LLMs in dynamic, unsupervised environments. Applied to ten open-domain NLP datasets across four LLMs, the CAI Ratio demonstrates a strong positive correlation with LLM accuracy, establishing it as an essential tool for unsupervised evaluation and model selection in real-world settings.
zh
[NLP-6] Scaling Truth: The Confidence Paradox in AI Fact-Checking
【速读】: 该论文旨在解决虚假信息(misinformation)传播背景下,如何实现跨语言、跨地域的可扩展且可靠的自动化事实核查问题。其核心挑战在于现有大语言模型(Large Language Models, LLMs)在不同语境下的泛化能力尚不明确,尤其在非英语和全球南方地区表现不佳,可能加剧信息不平等。解决方案的关键在于构建一个涵盖47种语言、5,000条经专业机构验证的声明数据集,并采用240,000次人工标注作为基准,系统评估九种不同架构、规模与源代码类型的LLMs在多种提示策略下的表现,从而揭示模型准确性与置信度之间的非对称关系(类似达克效应),并提出多语言基准以支撑未来公平、可信的AI辅助事实核查研究与政策制定。
链接: https://arxiv.org/abs/2509.08803
作者: Ihsan A. Qazi,Zohaib Khan,Abdullah Ghani,Agha A. Raza,Zafar A. Qazi,Wassay Sajjad,Ayesha Ali,Asher Javaid,Muhammad Abdullah Sohail,Abdul H. Azeemi
机构: Lahore University of Management Sciences (拉合尔管理科学大学)
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 65 pages, 26 figures, 6 tables
Abstract:The rise of misinformation underscores the need for scalable and reliable fact-checking solutions. Large language models (LLMs) hold promise in automating fact verification, yet their effectiveness across global contexts remains uncertain. We systematically evaluate nine established LLMs across multiple categories (open/closed-source, multiple sizes, diverse architectures, reasoning-based) using 5,000 claims previously assessed by 174 professional fact-checking organizations across 47 languages. Our methodology tests model generalizability on claims postdating training cutoffs and four prompting strategies mirroring both citizen and professional fact-checker interactions, with over 240,000 human annotations as ground truth. Findings reveal a concerning pattern resembling the Dunning-Kruger effect: smaller, accessible models show high confidence despite lower accuracy, while larger models demonstrate higher accuracy but lower confidence. This risks systemic bias in information verification, as resource-constrained organizations typically use smaller models. Performance gaps are most pronounced for non-English languages and claims originating from the Global South, threatening to widen existing information inequalities. These results establish a multilingual benchmark for future research and provide an evidence base for policy aimed at ensuring equitable access to trustworthy, AI-assisted fact-checking.
zh
[NLP-7] Do All Autoregressive Transformers Remember Facts the Same Way? A Cross-Architecture Analysis of Recall Mechanisms EMNLP2025
【速读】: 该论文旨在解决Transformer-based语言模型中事实关联存储与检索机制的可解释性问题,尤其关注不同自回归架构下事实记忆的编码和访问位置是否具有一致性。此前研究主要聚焦于GPT类模型,发现早期层中的多层感知机(MLP)模块是事实召回的关键;但这一结论是否适用于其他架构尚不明确。论文通过系统评估GPT、LLaMA、Qwen和DeepSeek等多模型的事实召回能力,揭示了Qwen系列模型表现出不同于以往模式的现象:其最早层的注意力(Attention)模块对事实召回的贡献超过MLP模块。解决方案的关键在于跨模型对比分析,识别出不同架构下事实记忆机制的根本差异,从而为模型可解释性和针对性编辑提供新依据。
链接: https://arxiv.org/abs/2509.08778
作者: Minyeong Choe,Haehyun Cho,Changho Seo,Hyunil Kim
机构: Chosun University (全南大学); Soongsil University (中央大学); Kongju National University (公州国立大学)
类目: Computation and Language (cs.CL)
备注: Accepted at EMNLP 2025
Abstract:Understanding how Transformer-based language models store and retrieve factual associations is critical for improving interpretability and enabling targeted model editing. Prior work, primarily on GPT-style models, has identified MLP modules in early layers as key contributors to factual recall. However, it remains unclear whether these findings generalize across different autoregressive architectures. To address this, we conduct a comprehensive evaluation of factual recall across several models – including GPT, LLaMA, Qwen, and DeepSeek – analyzing where and how factual information is encoded and accessed. Consequently, we find that Qwen-based models behave differently from previous patterns: attention modules in the earliest layers contribute more to factual recall than MLP modules. Our findings suggest that even within the autoregressive Transformer family, architectural variations can lead to fundamentally different mechanisms of factual recall.
zh
[NLP-8] Calibrating MLLM -as-a-judge via Multimodal Bayesian Prompt Ensembles ICCV2025
【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在文本到图像(Text-to-Image, TTI)生成系统评估中存在偏倚、过度自信及跨图像领域性能不一致的问题。现有基于提示集成(prompt ensembling)的方法在纯文本场景下有效,但在TTI任务中难以泛化。其解决方案的关键在于提出一种新型的多模态感知方法——多模态贝叶斯提示集成(Multimodal Mixture-of-Bayesian Prompt Ensembles, MMB),该方法通过引入图像聚类对提示权重进行动态调整,使判别模型能根据样本的视觉特征自适应分配提示重要性,从而显著提升配对偏好判断的准确性与校准度(calibration),增强对模型不确定性的可靠估计。
链接: https://arxiv.org/abs/2509.08777
作者: Eric Slyman,Mehrab Tanjim,Kushal Kafle,Stefan Lee
机构: Adobe(Adobe); Oregon State University (俄勒冈州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: 17 pages, 8 figures, Accepted at ICCV 2025
Abstract:Multimodal large language models (MLLMs) are increasingly used to evaluate text-to-image (TTI) generation systems, providing automated judgments based on visual and textual context. However, these “judge” models often suffer from biases, overconfidence, and inconsistent performance across diverse image domains. While prompt ensembling has shown promise for mitigating these issues in unimodal, text-only settings, our experiments reveal that standard ensembling methods fail to generalize effectively for TTI tasks. To address these limitations, we propose a new multimodal-aware method called Multimodal Mixture-of-Bayesian Prompt Ensembles (MMB). Our method uses a Bayesian prompt ensemble approach augmented by image clustering, allowing the judge to dynamically assign prompt weights based on the visual characteristics of each sample. We show that MMB improves accuracy in pairwise preference judgments and greatly enhances calibration, making it easier to gauge the judge’s true uncertainty. In evaluations on two TTI benchmarks, HPSv2 and MJBench, MMB outperforms existing baselines in alignment with human annotations and calibration across varied image content. Our findings highlight the importance of multimodal-specific strategies for judge calibration and suggest a promising path forward for reliable large-scale TTI evaluation.
zh
[NLP-9] Agent Gym-RL: Training LLM Agents for Long-Horizon Decision Making through Multi-Turn Reinforcement Learning
【速读】: 该论文旨在解决当前自主大语言模型(Large Language Model, LLM)智能体在复杂、真实环境中缺乏统一且高效的强化学习(Reinforcement Learning, RL)训练框架的问题,尤其是如何在不依赖监督微调(Supervised Fine-Tuning, SFT)的情况下,从零开始训练具备多轮交互决策能力的智能体。其解决方案的关键在于提出 AgentGym-RL 框架与 ScalingInter-RL 训练策略:AgentGym-RL 采用模块化和解耦架构,支持多样现实场景与主流 RL 算法,提升灵活性与可扩展性;ScalingInter-RL 则通过动态调整探索-利用平衡机制——初期限制交互次数以强化利用,随后逐步延长决策 horizon 促进探索,从而增强智能体行为多样性并避免长序列决策中的性能崩溃,实验证明该方案在 27 项跨环境任务中达到或超越商业模型水平。
链接: https://arxiv.org/abs/2509.08755
作者: Zhiheng Xi,Jixuan Huang,Chenyang Liao,Baodai Huang,Honglin Guo,Jiaqi Liu,Rui Zheng,Junjie Ye,Jiazheng Zhang,Wenxiang Chen,Wei He,Yiwen Ding,Guanyu Li,Zehui Chen,Zhengyin Du,Xuesong Yao,Yufei Xu,Jiecao Chen,Tao Gui,Zuxuan Wu,Qi Zhang,Xuanjing Huang,Yu-Gang Jiang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: preprint, 39 pages, 16 figures. Project: this https URL . Framework and Code: this https URL , this https URL
Abstract:Developing autonomous LLM agents capable of making a series of intelligent decisions to solve complex, real-world tasks is a fast-evolving frontier. Like human cognitive development, agents are expected to acquire knowledge and skills through exploration and interaction with the environment. Despite advances, the community still lacks a unified, interactive reinforcement learning (RL) framework that can effectively train such agents from scratch – without relying on supervised fine-tuning (SFT) – across diverse and realistic environments. To bridge this gap, we introduce AgentGym-RL, a new framework to train LLM agents for multi-turn interactive decision-making through RL. The framework features a modular and decoupled architecture, ensuring high flexibility and extensibility. It encompasses a wide variety of real-world scenarios, and supports mainstream RL algorithms. Furthermore, we propose ScalingInter-RL, a training approach designed for exploration-exploitation balance and stable RL optimization. In early stages, it emphasizes exploitation by restricting the number of interactions, and gradually shifts towards exploration with larger horizons to encourage diverse problem-solving strategies. In this way, the agent develops more diverse behaviors and is less prone to collapse under long horizons. We perform extensive experiments to validate the stability and effectiveness of both the AgentGym-RL framework and the ScalingInter-RL approach. Our agents match or surpass commercial models on 27 tasks across diverse environments. We offer key insights and will open-source the complete AgentGym-RL framework – including code and datasets – to empower the research community in developing the next generation of intelligent agents.
zh
[NLP-10] Streaming Sequence-to-Sequence Learning with Delayed Streams Modeling
【速读】: 该论文旨在解决传统序列到序列(sequence-to-sequence)模型在流式处理场景下难以灵活支持任意输入组合与输出序列长度的问题。现有方法通常依赖于离线处理或复杂的策略来决定何时推进输入流或输出流,限制了实时性和通用性。解决方案的关键在于提出延迟流建模(Delayed Streams Modeling, DSM),其核心思想是将时间对齐操作前置至预处理阶段,并通过引入适当的流间延迟(delay)来构建解码器-only 语言模型的输入结构,从而实现任意输出序列的流式推理,适用于多种多样的序列到序列任务。实验表明,DSM 在自动语音识别(ASR)和文本转语音(TTS)等典型任务中均达到最优性能与低延迟,且能处理任意长序列,甚至可媲美离线基线模型。
链接: https://arxiv.org/abs/2509.08753
作者: Neil Zeghidour,Eugene Kharitonov,Manu Orsini,Václav Volhejn,Gabriel de Marmiesse,Edouard Grave,Patrick Pérez,Laurent Mazaré,Alexandre Défossez
机构: Kyutai
类目: Computation and Language (cs.CL)
备注:
Abstract:We introduce Delayed Streams Modeling (DSM), a flexible formulation for streaming, multimodal sequence-to-sequence learning. Sequence-to-sequence generation is often cast in an offline manner, where the model consumes the complete input sequence before generating the first output timestep. Alternatively, streaming sequence-to-sequence rely on learning a policy for choosing when to advance on the input stream, or write to the output stream. DSM instead models already time-aligned streams with a decoder-only language model. By moving the alignment to a pre-processing step,and introducing appropriate delays between streams, DSM provides streaming inference of arbitrary output sequences, from any input combination, making it applicable to many sequence-to-sequence problems. In particular, given text and audio streams, automatic speech recognition (ASR) corresponds to the text stream being delayed, while the opposite gives a text-to-speech (TTS) model. We perform extensive experiments for these two major sequence-to-sequence tasks, showing that DSM provides state-of-the-art performance and latency while supporting arbitrary long sequences, being even competitive with offline baselines. Code, samples and demos are available at this https URL
zh
[NLP-11] X-Teaming Evolutionary M2S: Automated Discovery of Multi-turn to Single-turn Jailbreak Templates
【速读】: 该论文旨在解决多轮对抗性测试(multi-turn red-teaming)效率低下的问题,通过将迭代式对抗过程压缩为单轮结构化提示(multi-turn-to-single-turn, M2S),从而提升对大语言模型(Large Language Models, LLMs)安全性的评估效率。其核心挑战在于如何自动发现并优化有效的M2S模板,而此前方法依赖少量人工设计的模板,泛化能力受限。解决方案的关键是提出X-Teaming Evolutionary M2S框架——一个基于语言模型引导进化的自动化模板发现与优化系统:该框架结合来自12个来源的智能采样策略、受StrongREJECT启发的LLM作为评判者(LLM-as-judge)机制,并通过设定成功阈值θ=0.70维持选择压力,历经五代进化获得两个全新模板家族,在GPT-4.1上实现44.8%的整体成功率(103/230)。实验证明结构级搜索可复现地提升单轮探测强度,同时强调阈值校准与跨模型评估的重要性。
链接: https://arxiv.org/abs/2509.08729
作者: Hyunjun Kim,Junwoo Ha,Sangyoon Yu,Haon Park
机构: AIM Intelligence (AIM智能); University of Seoul (首尔大学); Korea Advanced Institute of Science and Technology (韩国科学技术院); Seoul National University (首尔国立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Multi-turn-to-single-turn (M2S) compresses iterative red-teaming into one structured prompt, but prior work relied on a handful of manually written templates. We present X-Teaming Evolutionary M2S, an automated framework that discovers and optimizes M2S templates through language-model-guided evolution. The system pairs smart sampling from 12 sources with an LLM-as-judge inspired by StrongREJECT and records fully auditable logs. Maintaining selection pressure by setting the success threshold to \theta = 0.70 , we obtain five evolutionary generations, two new template families, and 44.8% overall success (103/230) on GPT-4.1. A balanced cross-model panel of 2,500 trials (judge fixed) shows that structural gains transfer but vary by target; two models score zero at the same threshold. We also find a positive coupling between prompt length and score, motivating length-aware judging. Our results demonstrate that structure-level search is a reproducible route to stronger single-turn probes and underscore the importance of threshold calibration and cross-model evaluation. Code, configurations, and artifacts are available at this https URL. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2509.08729 [cs.CL] (or arXiv:2509.08729v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2509.08729 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-12] Generative Data Refinement: Just Ask for Better Data
【速读】: 该论文旨在解决大规模模型训练数据面临的数据枯竭问题,即随着模型参数规模固定,训练数据的质量和数量成为限制模型能力的关键因素,而当前公开可索引的数据增长速度已无法满足需求。解决方案的核心是提出一种名为生成式数据精炼(Generative Data Refinement, GDR)的框架,利用预训练生成式AI(Generative AI)模型对包含不良内容的数据集进行重构,从而生成更适合训练的高质量数据。GDR通过条件生成合成数据以匹配真实数据的多样性,无需额外复杂提示工程即可实现多样性的保持,同时有效实现去隐私化与去毒化,显著优于现有工业级数据清洗方案。
链接: https://arxiv.org/abs/2509.08653
作者: Minqi Jiang,João G. M. Araújo,Will Ellsworth,Sian Gooding,Edward Grefenstette
机构: Google DeepMind(谷歌深度思维)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:For a fixed parameter size, the capabilities of large models are primarily determined by the quality and quantity of its training data. Consequently, training datasets now grow faster than the rate at which new data is indexed on the web, leading to projected data exhaustion over the next decade. Much more data exists as user-generated content that is not publicly indexed, but incorporating such data comes with considerable risks, such as leaking private information and other undesirable content. We introduce a framework, Generative Data Refinement (GDR), for using pretrained generative models to transform a dataset with undesirable content into a refined dataset that is more suitable for training. Our experiments show that GDR can outperform industry-grade solutions for dataset anonymization, as well as enable direct detoxification of highly unsafe datasets. Moreover, we show that by generating synthetic data that is conditioned on each example in the real dataset, GDR’s refined outputs naturally match the diversity of web scale datasets, and thereby avoid the often challenging task of generating diverse synthetic data via model prompting. The simplicity and effectiveness of GDR make it a powerful tool for scaling up the total stock of training data for frontier models.
zh
[NLP-13] OTESGN:Optimal Transport Enhanced Syntactic-Semantic Graph Networks for Aspect-Based Sentiment Analysis
【速读】: 该论文旨在解决Aspect-based Sentiment Analysis (ABSA) 中因依赖线性点积特征而难以建模复杂语义关系的问题,尤其在噪声干扰下难以准确捕捉关键情感词(opinion words)的挑战。解决方案的关键在于提出Optimal Transport Enhanced Syntactic-Semantic Graph Network (OTESGN),其核心创新是引入Syntactic-Semantic Collaborative Attention机制:一方面通过Syntactic Graph-Aware Attention挖掘潜在句法依赖并建模全局句法拓扑;另一方面设计Semantic Optimal Transport Attention以在文本噪声中发现细粒度语义对齐,从而精准捕获被无关词掩盖的情感信号。此外,Adaptive Attention Fusion模块融合异构特征,并结合对比正则化提升模型鲁棒性,最终在Twitter和Laptop14数据集上显著优于现有方法。
链接: https://arxiv.org/abs/2509.08612
作者: Xinfeng Liao,Xuanqi Chen,Lianxi Wang,Jiahuan Yang,Zhuowei Chen,Ziying Rong
机构: Guangdong University of Foreign Studies (广东外语外贸大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Aspect-based sentiment analysis (ABSA) aims to identify aspect terms and determine their sentiment polarity. While dependency trees combined with contextual semantics effectively identify aspect sentiment, existing methods relying on syntax trees and aspect-aware attention struggle to model complex semantic relationships. Their dependence on linear dot-product features fails to capture nonlinear associations, allowing noisy similarity from irrelevant words to obscure key opinion terms. Motivated by Differentiable Optimal Matching, we propose the Optimal Transport Enhanced Syntactic-Semantic Graph Network (OTESGN), which introduces a Syntactic-Semantic Collaborative Attention. It comprises a Syntactic Graph-Aware Attention for mining latent syntactic dependencies and modeling global syntactic topology, as well as a Semantic Optimal Transport Attention designed to uncover fine-grained semantic alignments amidst textual noise, thereby accurately capturing sentiment signals obscured by irrelevant tokens. A Adaptive Attention Fusion module integrates these heterogeneous features, and contrastive regularization further improves robustness. Experiments demonstrate that OTESGN achieves state-of-the-art results, outperforming previous best models by +1.01% F1 on Twitter and +1.30% F1 on Laptop14 benchmarks. Ablative studies and visual analyses corroborate its efficacy in precise localization of opinion words and noise resistance.
zh
[NLP-14] Memorization in Large Language Models in Medicine: Prevalence Characteristics and Implications
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在医学领域中对训练数据的记忆现象及其潜在影响问题,特别是其普遍性、内容特征、记忆量级以及对下游医疗应用的可能风险。解决方案的关键在于系统评估了三种常见适应场景下的记忆行为:医学语料持续预训练、标准医学基准微调及真实临床数据微调(含13,000余条住院记录),并据此将记忆分为三类——有益记忆(如准确复现临床指南)、无信息记忆(如模板化表述)和有害记忆(如泄露患者敏感信息)。基于此分类,论文提出针对性策略:促进有益记忆以增强领域推理与事实准确性,减少无信息记忆以推动深层学习,规避有害记忆以防止隐私泄露,从而实现医学LLMs的安全、可靠部署。
链接: https://arxiv.org/abs/2509.08604
作者: Anran Li,Lingfei Qian,Mengmeng Du,Yu Yin,Yan Hu,Zihao Sun,Yihang Fu,Erica Stutz,Xuguang Ai,Qianqian Xie,Rui Zhu,Jimin Huang,Yifan Yang,Siru Liu,Yih-Chung Tham,Lucila Ohno-Machado,Hyunghoon Cho,Zhiyong Lu,Hua Xu,Qingyu Chen
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) have demonstrated significant potential in medicine. To date, LLMs have been widely applied to tasks such as diagnostic assistance, medical question answering, and clinical information synthesis. However, a key open question remains: to what extent do LLMs memorize medical training data. In this study, we present the first comprehensive evaluation of memorization of LLMs in medicine, assessing its prevalence (how frequently it occurs), characteristics (what is memorized), volume (how much content is memorized), and potential downstream impacts (how memorization may affect medical applications). We systematically analyze common adaptation scenarios: (1) continued pretraining on medical corpora, (2) fine-tuning on standard medical benchmarks, and (3) fine-tuning on real-world clinical data, including over 13,000 unique inpatient records from Yale New Haven Health System. The results demonstrate that memorization is prevalent across all adaptation scenarios and significantly higher than reported in the general domain. Memorization affects both the development and adoption of LLMs in medicine and can be categorized into three types: beneficial (e.g., accurate recall of clinical guidelines and biomedical references), uninformative (e.g., repeated disclaimers or templated medical document language), and harmful (e.g., regeneration of dataset-specific or sensitive clinical content). Based on these findings, we offer practical recommendations to facilitate beneficial memorization that enhances domain-specific reasoning and factual accuracy, minimize uninformative memorization to promote deeper learning beyond surface-level patterns, and mitigate harmful memorization to prevent the leakage of sensitive or identifiable patient information.
zh
[NLP-15] LLM Ensemble for RAG : Role of Context Length in Zero-Shot Question Answering for BioASQ Challenge
【速读】: 该论文旨在解决生物医学问答(Biomedical Question Answering, QA)中因专业术语复杂、知识更新迅速而导致的精准信息检索与理解难题。其核心挑战在于如何在不依赖昂贵微调或标注数据的前提下,实现高精度且具备泛化能力的问答系统。解决方案的关键在于采用基于大语言模型(Large Language Models, LLMs)的零样本集成方法(ensemble of zero-shot models),结合有效的检索增强生成(Retrieval-Augmented Generation, RAG)流程,通过聚合来自不同厂商(如Anthropic和Google)的多个LLM输出结果,提升答案的准确性和鲁棒性;同时指出上下文长度与性能之间存在权衡关系——过长的上下文虽可提供更多证据,但易引发信息稀释与模型失焦,因此强调精确、聚焦的信息检索是确保LLM在相关文档范围内生成可靠答案的基础。
链接: https://arxiv.org/abs/2509.08596
作者: Dima Galat,Diego Molla-Aliod
机构: University of Technology Sydney (UTS); Macquarie University
类目: Computation and Language (cs.CL)
备注: CEUR-WS, CLEF2025
Abstract:Biomedical question answering (QA) poses significant challenges due to the need for precise interpretation of specialized knowledge drawn from a vast, complex, and rapidly evolving corpus. In this work, we explore how large language models (LLMs) can be used for information retrieval (IR), and an ensemble of zero-shot models can accomplish state-of-the-art performance on a domain-specific Yes/No QA task. Evaluating our approach on the BioASQ challenge tasks, we show that ensembles can outperform individual LLMs and in some cases rival or surpass domain-tuned systems - all while preserving generalizability and avoiding the need for costly fine-tuning or labeled data. Our method aggregates outputs from multiple LLM variants, including models from Anthropic and Google, to synthesize more accurate and robust answers. Moreover, our investigation highlights a relationship between context length and performance: while expanded contexts are meant to provide valuable evidence, they simultaneously risk information dilution and model disorientation. These findings emphasize IR as a critical foundation in Retrieval-Augmented Generation (RAG) approaches for biomedical QA systems. Precise, focused retrieval remains essential for ensuring LLMs operate within relevant information boundaries when generating answers from retrieved documents. Our results establish that ensemble-based zero-shot approaches, when paired with effective RAG pipelines, constitute a practical and scalable alternative to domain-tuned systems for biomedical question answering.
zh
[NLP-16] CM-Align: Consistency-based Multilingual Alignment for Large Language Models EMNLP2025
【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在多语言对齐(multilingual alignment)中存在显著性能差距的问题,尤其是由于现有方法依赖英文响应作为参考来构建跨语言偏好数据时,所引入的噪声导致对齐效果受限。其解决方案的关键在于提出一种基于一致性的高质量偏好数据构建方法(Consistency-based Multilingual Alignment, CM-Align),核心包括两个部分:一是通过一致性引导的英文参考选择机制,筛选出高质量英文响应以避免低质参考误导;二是基于跨语言一致性的偏好对构造策略,克服传统启发式或有偏方法带来的偏差,从而提升多语言对齐效果。实验表明,该方法在多个模型和任务上均展现出优越性,验证了高质量偏好数据构建对多语言对齐的必要性。
链接: https://arxiv.org/abs/2509.08541
作者: Xue Zhang,Yunlong Liang,Fandong Meng,Songming Zhang,Yufeng Chen,Jinan Xu,Jie Zhou
机构: Beijing Jiaotong University (北京交通大学); Pattern Recognition Center, WeChat AI, Tencent Inc (微信人工智能研究院,腾讯公司)
类目: Computation and Language (cs.CL)
备注: EMNLP 2025 Findings
Abstract:Current large language models (LLMs) generally show a significant performance gap in alignment between English and other languages. To bridge this gap, existing research typically leverages the model’s responses in English as a reference to select the best/worst responses in other languages, which are then used for Direct Preference Optimization (DPO) training. However, we argue that there are two limitations in the current methods that result in noisy multilingual preference data and further limited alignment performance: 1) Not all English responses are of high quality, and using a response with low quality may mislead the alignment for other languages. 2) Current methods usually use biased or heuristic approaches to construct multilingual preference pairs. To address these limitations, we design a consistency-based data selection method to construct high-quality multilingual preference data for improving multilingual alignment (CM-Align). Specifically, our method includes two parts: consistency-guided English reference selection and cross-lingual consistency-based multilingual preference data construction. Experimental results on three LLMs and three common tasks demonstrate the effectiveness and superiority of our method, which further indicates the necessity of constructing high-quality preference data.
zh
[NLP-17] HumanAgencyBench: Scalable Evaluation of Human Agency Support in AI Assistants
【速读】: 该论文旨在解决当前生成式 AI(Generative AI)系统在广泛部署过程中可能削弱人类自主性(human agency)的问题,即当人类越来越多地将决策权委托给AI时,可能导致个体与集体未来控制权的丧失。其解决方案的关键在于提出并构建了一个名为HumanAgencyBench(HAB)的可扩展且自适应的评估基准,涵盖六个维度:询问澄清问题(Ask Clarifying Questions)、避免价值操纵(Avoid Value Manipulation)、纠正错误信息(Correct Misinformation)、推迟重要决策(Defer Important Decisions)、鼓励学习(Encourage Learning)和维持社会边界(Maintain Social Boundaries),并通过大语言模型(LLMs)模拟用户查询与评估AI响应,从而量化AI助手对人类代理权的支持程度。研究发现现有LLM助手在人类代理支持上普遍处于低至中等水平,并存在显著的开发者间差异和维度间波动,表明单纯提升模型能力或指令遵循行为(如RLHF)不足以保障人类代理权,亟需转向更稳健的安全与对齐目标。
链接: https://arxiv.org/abs/2509.08494
作者: Benjamin Sturgeon,Daniel Samuelson,Jacob Haimes,Jacy Reese Anthis
机构: Apart Research; AI Safety Cape Town; University of Chicago; Stanford University; Sentience Institute
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:
Abstract:As humans delegate more tasks and decisions to artificial intelligence (AI), we risk losing control of our individual and collective futures. Relatively simple algorithmic systems already steer human decision-making, such as social media feed algorithms that lead people to unintentionally and absent-mindedly scroll through engagement-optimized content. In this paper, we develop the idea of human agency by integrating philosophical and scientific theories of agency with AI-assisted evaluation methods: using large language models (LLMs) to simulate and validate user queries and to evaluate AI responses. We develop HumanAgencyBench (HAB), a scalable and adaptive benchmark with six dimensions of human agency based on typical AI use cases. HAB measures the tendency of an AI assistant or agent to Ask Clarifying Questions, Avoid Value Manipulation, Correct Misinformation, Defer Important Decisions, Encourage Learning, and Maintain Social Boundaries. We find low-to-moderate agency support in contemporary LLM-based assistants and substantial variation across system developers and dimensions. For example, while Anthropic LLMs most support human agency overall, they are the least supportive LLMs in terms of Avoid Value Manipulation. Agency support does not appear to consistently result from increasing LLM capabilities or instruction-following behavior (e.g., RLHF), and we encourage a shift towards more robust safety and alignment targets.
zh
[NLP-18] oo Helpful Too Harmless Too Honest or Just Right? EMNLP’25
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在帮助性(Helpfulness)、无害性(Harmlessness)和诚实性(Honesty)三方面难以协同对齐的问题,现有方法通常孤立优化单一维度,导致性能权衡与行为不一致。其解决方案的关键在于提出TrinityX框架,该框架在Transformer架构中引入了**校准专家混合(Mixture of Calibrated Experts, MoCaE)**机制:通过分别训练针对HHH各维度的专家模块,并采用一种任务自适应的校准路由机制,将专家信号融合为统一的对齐感知表示,从而实现多维对齐目标的协同优化。实验表明,该方法在多个基准测试中显著优于基线模型,同时大幅降低内存占用和推理延迟。
链接: https://arxiv.org/abs/2509.08486
作者: Gautam Siddharth Kashyap,Mark Dras,Usman Naseem
机构: Macquarie University (麦考瑞大学)
类目: Computation and Language (cs.CL)
备注: EMNLP’25 Main
Abstract:Large Language Models (LLMs) exhibit strong performance across a wide range of NLP tasks, yet aligning their outputs with the principles of Helpfulness, Harmlessness, and Honesty (HHH) remains a persistent challenge. Existing methods often optimize for individual alignment dimensions in isolation, leading to trade-offs and inconsistent behavior. While Mixture-of-Experts (MoE) architectures offer modularity, they suffer from poorly calibrated routing, limiting their effectiveness in alignment tasks. We propose TrinityX, a modular alignment framework that incorporates a Mixture of Calibrated Experts (MoCaE) within the Transformer architecture. TrinityX leverages separately trained experts for each HHH dimension, integrating their outputs through a calibrated, task-adaptive routing mechanism that combines expert signals into a unified, alignment-aware representation. Extensive experiments on three standard alignment benchmarks-Alpaca (Helpfulness), BeaverTails (Harmlessness), and TruthfulQA (Honesty)-demonstrate that TrinityX outperforms strong baselines, achieving relative improvements of 32.5% in win rate, 33.9% in safety score, and 28.4% in truthfulness. In addition, TrinityX reduces memory usage and inference latency by over 40% compared to prior MoE-based approaches. Ablation studies highlight the importance of calibrated routing, and cross-model evaluations confirm TrinityX’s generalization across diverse LLM backbones.
zh
[NLP-19] Simulating Identity Propagating Bias: Abstraction and Stereotypes in LLM -Generated Text EMNLP
【速读】: 该论文试图解决的问题是:persona-prompting(人格提示)这一策略在引导大语言模型(Large Language Models, LLMs)模拟特定身份或语言风格时,是否会影响其对社会群体的表述方式,特别是是否会加剧语言抽象化——这是刻板印象的一个已知标记。解决方案的关键在于引入Self-Stereo数据集(来自Reddit的自我报告刻板印象),并基于语言期望偏差(Linguistic Expectancy Bias)框架,通过三个指标(具体性、特异性与否定词使用)量化分析六种开放权重LLMs在三种提示条件下生成文本的语言抽象程度,从而揭示persona-prompting在调节语言抽象上的局限性,并警示其可能无意中强化刻板印象,即使看似在代表边缘化群体发声。
链接: https://arxiv.org/abs/2509.08484
作者: Pia Sommerauer,Giulia Rambelli,Tommaso Caselli
机构: Vrije Universiteit(自由大学); Università di Bologna(博洛尼亚大学); University of Groningen(格罗宁根大学)
类目: Computation and Language (cs.CL)
备注: Accepted to EMNLP Findings 2025
Abstract:Persona-prompting is a growing strategy to steer LLMs toward simulating particular perspectives or linguistic styles through the lens of a specified identity. While this method is often used to personalize outputs, its impact on how LLMs represent social groups remains underexplored. In this paper, we investigate whether persona-prompting leads to different levels of linguistic abstraction - an established marker of stereotyping - when generating short texts linking socio-demographic categories with stereotypical or non-stereotypical attributes. Drawing on the Linguistic Expectancy Bias framework, we analyze outputs from six open-weight LLMs under three prompting conditions, comparing 11 persona-driven responses to those of a generic AI assistant. To support this analysis, we introduce Self-Stereo, a new dataset of self-reported stereotypes from Reddit. We measure abstraction through three metrics: concreteness, specificity, and negation. Our results highlight the limits of persona-prompting in modulating abstraction in language, confirming criticisms about the ecology of personas as representative of socio-demographic groups and raising concerns about the risk of propagating stereotypes even when seemingly evoking the voice of a marginalized group.
zh
[NLP-20] Acquiescence Bias in Large Language Models EMNLP2025
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)是否表现出类似人类的顺从偏差(acquiescence bias)的问题,即模型在面对陈述性语句时是否存在倾向于同意的倾向。研究发现,与人类不同,LLMs在多种模型、任务和语言(英语、德语和波兰语)中均表现出对“否”答案的偏好,无论该回答是否表示反对或同意。解决方案的关键在于通过系统性的实证实验设计,验证了LLMs在语义层面存在显著的非对称响应模式,揭示其内在倾向并非源于人类行为类比,而是可能由训练数据中的隐含结构或模型架构本身所驱动。
链接: https://arxiv.org/abs/2509.08480
作者: Daniel Braun
机构: Marburg University (马尔堡大学)
类目: Computation and Language (cs.CL)
备注: Accepted to EMNLP 2025 Findings
Abstract:Acquiescence bias, i.e. the tendency of humans to agree with statements in surveys, independent of their actual beliefs, is well researched and documented. Since Large Language Models (LLMs) have been shown to be very influenceable by relatively small changes in input and are trained on human-generated data, it is reasonable to assume that they could show a similar tendency. We present a study investigating the presence of acquiescence bias in LLMs across different models, tasks, and languages (English, German, and Polish). Our results indicate that, contrary to humans, LLMs display a bias towards answering no, regardless of whether it indicates agreement or disagreement.
zh
[NLP-21] Adversarial Attacks Against Automated Fact-Checking: A Survey EMNLP2025
【速读】: 该论文旨在解决自动化事实核查(Automated Fact-Checking, AFC)系统在面对对抗攻击时的脆弱性问题,即攻击者通过操纵或生成虚假声明(claim)、证据(evidence)或声明-证据配对来误导模型判断,从而破坏其可靠性。解决方案的关键在于系统性地梳理现有对抗攻击方法,分类分析其策略与影响,并评估当前AFC模型的鲁棒性;同时,论文还综述了新兴的抗对抗防御技术,识别出亟待深入研究的开放性问题,最终呼吁构建具备高鲁棒性的事实核查框架,以保障在信息污染环境下仍能维持准确验证能力。
链接: https://arxiv.org/abs/2509.08463
作者: Fanzhen Liu,Alsharif Abuadbba,Kristen Moore,Surya Nepal,Cecile Paris,Jia Wu,Jian Yang,Quan Z. Sheng
机构: Macquarie University (麦考瑞大学); CSIRO’s Data61 (澳大利亚联邦科学与工业研究组织数据六一); UNSW Sydney (新南威尔士大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: Accepted to the Main Conference of EMNLP 2025. Resources are available at this https URL
Abstract:In an era where misinformation spreads freely, fact-checking (FC) plays a crucial role in verifying claims and promoting reliable information. While automated fact-checking (AFC) has advanced significantly, existing systems remain vulnerable to adversarial attacks that manipulate or generate claims, evidence, or claim-evidence pairs. These attacks can distort the truth, mislead decision-makers, and ultimately undermine the reliability of FC models. Despite growing research interest in adversarial attacks against AFC systems, a comprehensive, holistic overview of key challenges remains lacking. These challenges include understanding attack strategies, assessing the resilience of current models, and identifying ways to enhance robustness. This survey provides the first in-depth review of adversarial attacks targeting FC, categorizing existing attack methodologies and evaluating their impact on AFC systems. Additionally, we examine recent advancements in adversary-aware defenses and highlight open research questions that require further exploration. Our findings underscore the urgent need for resilient FC frameworks capable of withstanding adversarial manipulations in pursuit of preserving high verification accuracy.
zh
[NLP-22] CommonVoice-SpeechRE and RPG-MoGe: Advancing Speech Relation Extraction with a New Dataset and Multi-Order Generative Framework
【速读】: 该论文针对语音关系抽取(Speech Relation Extraction, SpeechRE)任务中存在的两大问题展开研究:一是现有基准数据集严重依赖合成数据,缺乏真实人类语音的多样性和充足数量;二是现有模型受限于单一顺序生成模板和弱语义对齐能力,导致性能受限。解决方案的关键在于提出两个核心创新:其一,构建了大规模真实语音数据集 CommonVoice-SpeechRE(近20,000个来自多样化说话者的语音样本),为SpeechRE提供新的基准;其二,设计了关系提示引导的多序生成集成框架(Relation Prompt-Guided Multi-Order Generative Ensemble, RPG-MoGe),通过引入多阶三元组生成策略增强训练与推理阶段的数据多样性,并利用基于CNN的潜在关系预测头生成显式关系提示,以提升跨模态对齐精度与三元组生成准确性。
链接: https://arxiv.org/abs/2509.08438
作者: Jinzhong Ning,Paerhati Tulajiang,Yingying Le,Yijia Zhang,Yuanyuan Sun,Hongfei Lin,Haifeng Liu
机构: Dalian Maritime University (大连海事大学); Dalian University of Technology (大连理工大学); Xinjiang Normal University (新疆师范大学); Nanjing Normal University (南京师范大学)
类目: Computation and Language (cs.CL); Multimedia (cs.MM); Sound (cs.SD)
备注:
Abstract:Speech Relation Extraction (SpeechRE) aims to extract relation triplets directly from speech. However, existing benchmark datasets rely heavily on synthetic data, lacking sufficient quantity and diversity of real human speech. Moreover, existing models also suffer from rigid single-order generation templates and weak semantic alignment, substantially limiting their performance. To address these challenges, we introduce CommonVoice-SpeechRE, a large-scale dataset comprising nearly 20,000 real-human speech samples from diverse speakers, establishing a new benchmark for SpeechRE research. Furthermore, we propose the Relation Prompt-Guided Multi-Order Generative Ensemble (RPG-MoGe), a novel framework that features: (1) a multi-order triplet generation ensemble strategy, leveraging data diversity through diverse element orders during both training and inference, and (2) CNN-based latent relation prediction heads that generate explicit relation prompts to guide cross-modal alignment and accurate triplet generation. Experiments show our approach outperforms state-of-the-art methods, providing both a benchmark dataset and an effective solution for real-world SpeechRE. The source code and dataset are publicly available at this https URL.
zh
[NLP-23] Low-Resource Fine-Tuning for Multi-Task Structured Information Extraction with a Billion-Parameter Instruction-Tuned Model
【速读】: 该论文旨在解决在资源受限环境下,如何高效、可靠地进行结构化数据抽取的问题。具体而言,面对金融合规报告、法律文档分析及多语言知识库构建等场景中,大型语言模型(Large Language Models, LLMs)因计算成本高和高质量标注数据稀缺而难以部署的问题,作者提出了一种基于百亿参数LLaMA架构的小型模型ETLCH。其解决方案的关键在于采用低秩适应(Low-Rank Adaptation, LoRA)技术,在每项任务仅需数百至千样本的极低数据规模下实现有效微调,从而在JSON抽取、知识图谱抽取和命名实体识别等多个任务上显著优于强基线模型,证明了小规模模型在低资源多任务场景下的可行性与优越性。
链接: https://arxiv.org/abs/2509.08381
作者: Yu Cheng Chih,Yong Hao Hou
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 13 pages, 8 figures, includes experiments on JSON extraction, knowledge graph extraction, and NER
Abstract:Deploying large language models (LLMs) for structured data extraction in domains such as financial compliance reporting, legal document analytics, and multilingual knowledge base construction is often impractical for smaller teams due to the high cost of running large architectures and the difficulty of preparing large, high-quality datasets. Most recent instruction-tuning studies focus on seven-billion-parameter or larger models, leaving limited evidence on whether much smaller models can work reliably under low-resource, multi-task conditions. This work presents ETLCH, a billion-parameter LLaMA-based model fine-tuned with low-rank adaptation on only a few hundred to one thousand samples per task for JSON extraction, knowledge graph extraction, and named entity recognition. Despite its small scale, ETLCH outperforms strong baselines across most evaluation metrics, with substantial gains observed even at the lowest data scale. These findings demonstrate that well-tuned small models can deliver stable and accurate structured outputs at a fraction of the computational cost, enabling cost-effective and reliable information extraction pipelines in resource-constrained environments.
zh
[NLP-24] hink So lets replace this phrase with insult… /think Lessons learned from generation of toxic texts with LLM s
【速读】: 该论文试图解决的问题是:如何有效利用生成式 AI (Generative AI) 生成的合成毒性文本数据来替代人工标注的毒性数据,以训练更高效的文本去毒(text detoxification)模型。解决方案的关键在于通过对比实验验证了基于 Llama 3 和 Qwen 模型生成的合成毒性文本在训练去毒模型时的表现,发现其性能显著低于使用真实人类标注数据训练的模型,最大下降达 30% 的联合指标得分;根本原因被识别为词汇多样性不足——生成模型倾向于重复使用少量侮辱性词汇,无法捕捉人类毒性表达的复杂性和多样性,从而揭示当前生成式 AI 在敏感领域语义建模能力的局限性,并强调高质量、多样化的真人标注数据对构建鲁棒去毒系统仍具不可替代的价值。
链接: https://arxiv.org/abs/2509.08358
作者: Sergey Pletenev,Daniil Moskovskiy,Alexander Panchenko
机构: AIRI; Skoltech
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Modern Large Language Models (LLMs) are excellent at generating synthetic data. However, their performance in sensitive domains such as text detoxification has not received proper attention from the scientific community. This paper explores the possibility of using LLM-generated synthetic toxic data as an alternative to human-generated data for training models for detoxification. Using Llama 3 and Qwen activation-patched models, we generated synthetic toxic counterparts for neutral texts from ParaDetox and SST-2 datasets. Our experiments show that models fine-tuned on synthetic data consistently perform worse than those trained on human data, with a drop in performance of up to 30% in joint metrics. The root cause is identified as a critical lexical diversity gap: LLMs generate toxic content using a small, repetitive vocabulary of insults that fails to capture the nuances and variety of human toxicity. These findings highlight the limitations of current LLMs in this domain and emphasize the continued importance of diverse, human-annotated data for building robust detoxification systems.
zh
[NLP-25] Automatic Detection of Inauthentic Templated Responses in English Language Assessments
【速读】: 该论文旨在解决高风险英语语言测评中,低技能考生通过使用预先记忆的“模板”(template)来生成应试作文以欺骗自动化评分系统的问题。解决方案的关键在于提出了一种名为AuDITR(Automated Detection of Inauthentic, Templated Responses)的任务框架,并采用基于机器学习的方法实现对这类非真实作答的自动检测,同时强调在实际应用中定期更新模型对于维持检测效果的重要性。
链接: https://arxiv.org/abs/2509.08355
作者: Yashad Samant,Lee Becker,Scott Hellman,Bradley Behan,Sarah Hughes,Joshua Southerland
机构: Pearson Education(培生教育); Pearson Education, Inc.(培生教育公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to National Council on Measurement in Education (NCME) 2025 Annual Meeting
Abstract:In high-stakes English Language Assessments, low-skill test takers may employ memorized materials called templates'' on essay questions to
game’’ or fool the automated scoring system. In this study, we introduce the automated detection of inauthentic, templated responses (AuDITR) task, describe a machine learning-based approach to this task and illustrate the importance of regularly updating these models in production.
zh
[NLP-26] oward Subtrait-Level Model Explainability in Automated Writing Evaluation
【速读】: 该论文旨在解决自动化写作评分(automated writing scoring, AWS)缺乏透明度的问题,通过引入子特质(subtrait)评估来提升评分的可解释性。其解决方案的关键在于利用生成式语言模型(generative language models)实现子特质得分的原型化,从而在人类与自动化评分之间建立子特质层面的关联,虽相关性尚属中等,但为教育工作者和学生提供了更细致的评分依据,有助于揭示评分逻辑并增强对结果的信任。
链接: https://arxiv.org/abs/2509.08345
作者: Alejandro Andrade-Lotero,Lee Becker,Joshua Southerland,Scott Hellman
机构: Pearson Education(培生教育); Pearson Education, Inc.(培生教育公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to National Council on Measurement in Education (NCME) 2025 Annual Meeting
Abstract:Subtrait (latent-trait components) assessment presents a promising path toward enhancing transparency of automated writing scores. We prototype explainability and subtrait scoring with generative language models and show modest correlation between human subtrait and trait scores, and between automated and human subtrait scores. Our approach provides details to demystify scores for educators and students.
zh
[NLP-27] EvolKV: Evolutionary KV Cache Compression for LLM Inference
【速读】: 该论文旨在解决现有键值(Key-Value, KV)缓存压缩方法依赖启发式策略(如层间均匀分配或静态淘汰策略)所导致的性能下降问题,这些问题忽略了不同层特征模式与任务性能之间的关键交互关系。解决方案的核心在于提出EvolKV框架,该框架将缓存分配建模为多目标优化问题,并利用进化搜索动态配置各层缓存预算,从而在内存效率和下游任务性能之间实现联合优化,显著提升了长上下文任务中的泛化能力。
链接: https://arxiv.org/abs/2509.08315
作者: Bohan Yu,Yekun Chai
机构: University of Chinese Academy of Sciences (中国科学院大学); Institute of Automation, CAS (中国科学院自动化研究所); ETH Zurich (苏黎世联邦理工学院)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Neural and Evolutionary Computing (cs.NE)
备注:
Abstract:Existing key-value (KV) cache compression methods typically rely on heuristics, such as uniform cache allocation across layers or static eviction policies, however, they ignore the critical interplays among layer-specific feature patterns and task performance, which can lead to degraded generalization. In this paper, we propose EvolKV, an adaptive framework for layer-wise, task-driven KV cache compression that jointly optimizes the memory efficiency and task performance. By reformulating cache allocation as a multi-objective optimization problem, EvolKV leverages evolutionary search to dynamically configure layer budgets while directly maximizing downstream performance. Extensive experiments on 11 tasks demonstrate that our approach outperforms all baseline methods across a wide range of KV cache budgets on long-context tasks and surpasses heuristic baselines by up to 7 percentage points on GSM8K. Notably, EvolKV achieves superior performance over the full KV cache setting on code completion while utilizing only 1.5% of the original budget, suggesting the untapped potential in learned compression strategies for KV cache budget allocation.
zh
[NLP-28] owards Knowledge-Aware Document Systems: Modeling Semantic Coverag e Relations via Answerability Detection
【速读】: 该论文旨在解决跨文档信息共享的语义覆盖关系建模问题,即如何在不依赖文本形式差异的情况下,准确识别文档对之间的语义关联。其核心挑战在于区分三类关系:等价(equivalence)、包含(inclusion)和语义重叠(semantic overlap)。解决方案的关键在于提出一种基于问答(question answering, QA)的方法,利用共享问题在文档间的可回答性作为语义覆盖的指示器,并构建了一个基于SQuAD语料库的合成数据集,通过改写和选择性删减信息实现对内容重叠程度的精确控制,从而有效评估生成式语言模型与判别式分类器在SCR预测任务上的性能表现。
链接: https://arxiv.org/abs/2509.08304
作者: Yehudit Aperstein,Alon Gottlib,Gal Benita,Alexander Apartsin
机构: 未知
类目: Computation and Language (cs.CL)
备注: 27 pages, 1 figure
Abstract:Understanding how information is shared across documents, regardless of the format in which it is expressed, is critical for tasks such as information retrieval, summarization, and content alignment. In this work, we introduce a novel framework for modelling Semantic Coverage Relations (SCR), which classifies document pairs based on how their informational content aligns. We define three core relation types: equivalence, where both texts convey the same information using different textual forms or styles; inclusion, where one document fully contains the information of another and adds more; and semantic overlap, where each document presents partially overlapping content. To capture these relations, we adopt a question answering (QA)-based approach, using the answerability of shared questions across documents as an indicator of semantic coverage. We construct a synthetic dataset derived from the SQuAD corpus by paraphrasing source passages and selectively omitting information, enabling precise control over content overlap. This dataset allows us to benchmark generative language models and train transformer-based classifiers for SCR prediction. Our findings demonstrate that discriminative models significantly outperform generative approaches, with the RoBERTa-base model achieving the highest accuracy of 61.4% and the Random Forest-based model showing the best balance with a macro-F1 score of 52.9%. The results show that QA provides an effective lens for assessing semantic relations across stylistically diverse texts, offering insights into the capacity of current models to reason about information beyond surface similarity. The dataset and code developed in this study are publicly available to support reproducibility.
zh
[NLP-29] Balancing Quality and Variation: Spam Filtering Distorts Data Label Distributions
【速读】: 该论文试图解决在机器学习数据集构建过程中如何平衡标注者(annotator)可靠性与标签多样性之间的矛盾问题,即在过滤低质量标注的同时避免过度消除真实观点差异。其关键发现在于:现有基于“变异即噪声”假设的标注者过滤方法(如基于一致性或方差的阈值策略)往往误删具有真实分歧的标注者而非真正的垃圾标注者(spam),从而导致标签均值误差增加;尤其当采用保守阈值(如移除5%标注者)时性能最优,进一步收紧则显著损害标签多样性。研究还指出,多数垃圾标注者行为分布上难以与真实标注者区分,且其非随机性(常给出固定答案)与传统假设相反,因此需设计能主动保留标签多样性的新型垃圾标注检测机制。
链接: https://arxiv.org/abs/2509.08217
作者: Eve Fleisig,Matthias Orlikowski,Philipp Cimiano,Dan Klein
机构: UC Berkeley (加州大学伯克利分校); Bielefeld University (比勒费尔德大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:For machine learning datasets to accurately represent diverse opinions in a population, they must preserve variation in data labels while filtering out spam or low-quality responses. How can we balance annotator reliability and representation? We empirically evaluate how a range of heuristics for annotator filtering affect the preservation of variation on subjective tasks. We find that these methods, designed for contexts in which variation from a single ground-truth label is considered noise, often remove annotators who disagree instead of spam annotators, introducing suboptimal tradeoffs between accuracy and label diversity. We find that conservative settings for annotator removal (5%) are best, after which all tested methods increase the mean absolute error from the true average label. We analyze performance on synthetic spam to observe that these methods often assume spam annotators are less random than real spammers tend to be: most spammers are distributionally indistinguishable from real annotators, and the minority that are distinguishable tend to give fixed answers, not random ones. Thus, tasks requiring the preservation of variation reverse the intuition of existing spam filtering methods: spammers tend to be less random than non-spammers, so metrics that assume variation is spam fare worse. These results highlight the need for spam removal methods that account for label diversity.
zh
[NLP-30] XML Prompting as Grammar-Constrained Interaction: Fixed-Point Semantics Convergence Guarantees and Human-AI Protocols
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在实际系统中生成结构化、符合Schema的输出时缺乏可控性和可验证性的问题。其核心挑战在于如何在保证输出格式正确性的同时,维持任务性能。解决方案的关键在于提出一种以逻辑为基础的XML提示框架,通过三个核心机制实现:(i) 基于语法约束的解码(grammar-constrained decoding),确保输出的XML结构合法性;(ii) 在层次化提示的格结构上定义不动点语义(fixed-point semantics over lattices of hierarchical prompts),利用Knaster-Tarski定理证明单调提示映射存在最小不动点,从而刻画稳定交互协议;(iii) 引入任务感知的收缩度量(task-aware contraction metric)证明迭代引导过程的Banach收敛性,保障人类-人工智能交互循环的收敛与稳定性。该框架进一步通过上下文无关文法(Context-Free Grammars, CFGs)实例化,并结合多层人机交互范式(如“计划→验证→修正”流程和代理工具调用)实现可部署的实践路径。
链接: https://arxiv.org/abs/2509.08182
作者: Faruk Alpay,Taylan Alpay
机构: Lightcap(未来研究所); Turkish Aeronautical Association(土耳其航空协会)
类目: Programming Languages (cs.PL); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 7 pages, multiple XML prompts
Abstract:Structured prompting with XML tags has emerged as an effective way to steer large language models (LLMs) toward parseable, schema-adherent outputs in real-world systems. We develop a logic-first treatment of XML prompting that unifies (i) grammar-constrained decoding, (ii) fixed-point semantics over lattices of hierarchical prompts, and (iii) convergent human-AI interaction loops. We formalize a complete lattice of XML trees under a refinement order and prove that monotone prompt-to-prompt operators admit least fixed points (Knaster-Tarski) that characterize steady-state protocols; under a task-aware contraction metric on trees, we further prove Banach-style convergence of iterative guidance. We instantiate these results with context-free grammars (CFGs) for XML schemas and show how constrained decoding guarantees well-formedness while preserving task performance. A set of multi-layer human-AI interaction recipes demonstrates practical deployment patterns, including multi-pass “plan \to verify \to revise” routines and agentic tool use. We provide mathematically complete proofs and tie our framework to recent advances in grammar-aligned decoding, chain-of-verification, and programmatic prompting.
zh
[NLP-31] Verbalized Algorithms NEURIPS2025
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在复杂推理任务中表现不稳定的问题,即直接以单次查询方式调用LLMs难以保证推理结果的准确性和可重复性。其解决方案的关键在于提出“语义化算法”(Verbalized Algorithms, VAs),该方法将任务分解为一系列自然语言字符串上的简单基本操作,并借助经典算法(如位逆序排序网络)来组织这些操作,同时仅让LLM承担可靠执行简单子任务的能力(例如作为二元比较预言机)。通过限制LLM的作用范围至已知且理论完备的算法框架内,VA显著提升了任务执行的确定性和鲁棒性。
链接: https://arxiv.org/abs/2509.08150
作者: Supriya Lall,Christian Farrell,Hari Pathanjaly,Marko Pavic,Sarvesh Chezhian,Masataro Asai
机构: MIT CSAIL (麻省理工学院计算机科学与人工智能实验室); MIT-IBM Watson AI Lab (麻省理工学院-IBM沃森人工智能实验室); Marist University (马里斯特学院); IBM Infrastructure (IBM基础设施); UC Irvine (加州大学欧文分校); IBM Research Cambridge, USA (IBM研究剑桥,美国)
类目: Computation and Language (cs.CL)
备注: Submitted to NeurIPS 2025 Workshop on Efficient Reasoning
Abstract:Instead of querying LLMs in a one-shot manner and hoping to get the right answer for a reasoning task, we propose a paradigm we call \emphverbalized algorithms (VAs), which leverage classical algorithms with established theoretical understanding. VAs decompose a task into simple elementary operations on natural language strings that they should be able to answer reliably, and limit the scope of LLMs to only those simple tasks. For example, for sorting a series of natural language strings, \emphverbalized sorting uses an LLM as a binary comparison oracle in a known and well-analyzed sorting algorithm (e.g., bitonic sorting network). We demonstrate the effectiveness of this approach on sorting and clustering tasks.
zh
[NLP-32] Bias after Prompting: Persistent Discrimination in Large Language Models
【速读】: 该论文旨在解决预训练大语言模型(Large Language Models, LLMs)中的偏见是否会在提示(prompt)适应过程中传递至下游任务的问题,从而挑战了先前关于偏见不会从预训练模型转移到适配模型的假设。其关键发现是:即使在广泛使用的提示策略下,模型内在偏见与提示适应后的偏见之间仍存在显著相关性(如性别偏见在共指消解任务中相关系数ρ=0.94,年龄偏见在问答任务中ρ=0.98),且当前主流的基于提示的去偏方法无法稳定抑制偏见传播。研究进一步表明,偏见转移的强度在不同少样本组成参数(如样本量、刻板内容比例、职业分布和代表性平衡)下依然保持高度稳定(ρ=0.90),说明偏见具有较强的鲁棒性。因此,论文提出的核心解决方案思路是:应在模型内在阶段进行偏见修正,而非依赖提示层面的干预,以防止偏见向下游任务扩散。
链接: https://arxiv.org/abs/2509.08146
作者: Nivedha Sivakumar,Natalie Mackraz,Samira Khorshidi,Krishna Patel,Barry-John Theobald,Luca Zappella,Nicholas Apostoloff
机构: Apple(苹果)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:A dangerous assumption that can be made from prior work on the bias transfer hypothesis (BTH) is that biases do not transfer from pre-trained large language models (LLMs) to adapted models. We invalidate this assumption by studying the BTH in causal models under prompt adaptations, as prompting is an extremely popular and accessible adaptation strategy used in real-world applications. In contrast to prior work, we find that biases can transfer through prompting and that popular prompt-based mitigation methods do not consistently prevent biases from transferring. Specifically, the correlation between intrinsic biases and those after prompt adaptation remain moderate to strong across demographics and tasks – for example, gender (rho = 0.94) in co-reference resolution, and age (rho = 0.98) and religion (rho = 0.69) in question answering. Further, we find that biases remain strongly correlated when varying few-shot composition parameters, such as sample size, stereotypical content, occupational distribution and representational balance (rho = 0.90). We evaluate several prompt-based debiasing strategies and find that different approaches have distinct strengths, but none consistently reduce bias transfer across models, tasks or demographics. These results demonstrate that correcting bias, and potentially improving reasoning ability, in intrinsic models may prevent propagation of biases to downstream tasks.
zh
[NLP-33] MERLIN: Multi-Stage Curriculum Alignment for Multilingual Encoder and LLM Fusion
【速读】: 该论文旨在解决大语言模型在低资源语言(Low-Resource Languages, LRLs)中复杂推理能力不足的问题。现有基于编码器-解码器架构的方法(如LangBridge和MindMerger)虽在中高资源语言上提升显著,但在LRLs上仍存在较大性能差距。其解决方案的关键在于提出一种两阶段模型堆叠框架MERLIN,采用课程学习策略(从通用双语平行语料到任务特定数据),并通过仅微调少量DoRA(Decomposed Low-Rank Adaptation)权重实现高效适配,从而在AfriMGSM等基准上显著提升准确率(相比MindMerger提升+12.9个百分点),并展现出在不同资源环境下的一致性优势。
链接: https://arxiv.org/abs/2509.08105
作者: Kosei Uemura,David Guzmán,Quang Phuoc Nguyen,Jesujoba Oluwadara Alabi,En-shiun Annie Lee,David Ifeoluwa Adelani
机构: University of Toronto (多伦多大学); Mila - Quebec AI Institute, McGill University (麦吉尔大学魁北克人工智能研究所); Ontario Tech University (安大略理工大学); Saarland University (萨尔兰大学); Canada CIFAR AI Chair (加拿大CIFAR人工智能主席)
类目: Computation and Language (cs.CL)
备注: under submission
Abstract:Large language models excel in English but still struggle with complex reasoning in many low-resource languages (LRLs). Existing encoder-plus-decoder methods such as LangBridge and MindMerger raise accuracy on mid and high-resource languages, yet they leave a large gap on LRLs. We present MERLIN, a two-stage model-stacking framework that applies a curriculum learning strategy – from general bilingual bitext to task-specific data – and adapts only a small set of DoRA weights. On the AfriMGSM benchmark MERLIN improves exact-match accuracy by +12.9 pp over MindMerger and outperforms GPT-4o-mini. It also yields consistent gains on MGSM and MSVAMP (+0.9 and +2.8 pp), demonstrating effectiveness across both low and high-resource settings.
zh
[NLP-34] Culturally transmitted color categories in LLM s reflect a learning bias toward efficient compression
【速读】: 该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)是否具备演化出类似人类语义系统的能力,尤其是这些系统能否遵循信息瓶颈(Information Bottleneck, IB)原理实现高效压缩与准确性平衡。其解决方案的关键在于通过颜色命名这一认知分类的核心领域,对 Gemini 2.0-flash 和 Llama 3.3-70B-Instruct 进行两阶段实验:首先验证 LLM 在单次任务中是否能生成符合人类语言分布的语义结构;其次利用迭代上下文学习模拟文化演化过程,发现 LLM 能从随机初始状态逐步优化为更具 IB 效率且跨语言一致的语义系统,表明其具备内在的、类人的归纳偏置,能够自发演化出感知 grounded 的高效语义体系。
链接: https://arxiv.org/abs/2509.08093
作者: Nathaniel Imel,Noga Zaslavsky
机构: New York University (纽约大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Converging evidence suggests that systems of semantic categories across human languages achieve near-optimal compression via the Information Bottleneck (IB) complexity-accuracy principle. Large language models (LLMs) are not trained for this objective, which raises the question: are LLMs capable of evolving efficient human-like semantic systems? To address this question, we focus on the domain of color as a key testbed of cognitive theories of categorization and replicate with LLMs (Gemini 2.0-flash and Llama 3.3-70B-Instruct) two influential human behavioral studies. First, we conduct an English color-naming study, showing that Gemini aligns well with the naming patterns of native English speakers and achieves a significantly high IB-efficiency score, while Llama exhibits an efficient but lower complexity system compared to English. Second, to test whether LLMs simply mimic patterns in their training data or actually exhibit a human-like inductive bias toward IB-efficiency, we simulate cultural evolution of pseudo color-naming systems in LLMs via iterated in-context language learning. We find that akin to humans, LLMs iteratively restructure initially random systems towards greater IB-efficiency and increased alignment with patterns observed across the world’s languages. These findings demonstrate that LLMs are capable of evolving perceptually grounded, human-like semantic systems, driven by the same fundamental principle that governs semantic efficiency across human languages.
zh
[NLP-35] No for Some Yes for Others: Persona Prompts and Other Sources of False Refusal in Language Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)个性化过程中因社会人口学特征(如性别、种族、宗教和残疾等)引发的虚假拒绝(false refusal)问题,即模型在特定人格提示下对用户请求不恰当地拒绝的现象。现有研究指出此类现象存在,但缺乏系统量化分析。为解决这一问题,作者提出一种基于蒙特卡洛(Monte Carlo)采样的方法,在样本效率较高的前提下精确衡量不同人格对模型拒绝行为的影响。关键在于通过控制模型类型、任务类别(自然语言推理、礼貌性与冒犯性分类)及提示改写方式,分离出人格因素的真实影响,并发现随着模型能力提升,人格效应减弱,而模型选择和任务敏感性才是导致虚假拒绝的主要因素,从而揭示了当前对人格影响的高估可能源于其他变量干扰。
链接: https://arxiv.org/abs/2509.08075
作者: Flor Miriam Plaza-del-Arco,Paul Röttger,Nino Scherrer,Emanuele Borgonovo,Elmar Plischke,Dirk Hovy
机构: LIACS, Leiden University (莱顿大学); Bocconi University (博科尼大学); Independent Researcher (独立研究员); Boconni University (博科尼大学); Helmholtz-Zentrum Dresden-Rossendorf (德累斯顿罗斯多夫亥姆霍兹研究中心); Bocconi University (博科尼大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) are increasingly integrated into our daily lives and personalized. However, LLM personalization might also increase unintended side effects. Recent work suggests that persona prompting can lead models to falsely refuse user requests. However, no work has fully quantified the extent of this issue. To address this gap, we measure the impact of 15 sociodemographic personas (based on gender, race, religion, and disability) on false refusal. To control for other factors, we also test 16 different models, 3 tasks (Natural Language Inference, politeness, and offensiveness classification), and nine prompt paraphrases. We propose a Monte Carlo-based method to quantify this issue in a sample-efficient manner. Our results show that as models become more capable, personas impact the refusal rate less and less. Certain sociodemographic personas increase false refusal in some models, which suggests underlying biases in the alignment strategies or safety mechanisms. However, we find that the model choice and task significantly influence false refusals, especially in sensitive content tasks. Our findings suggest that persona effects have been overestimated, and might be due to other factors.
zh
[NLP-36] SciGPT : A Large Language Model for Scientific Literature Understanding and Knowledge Discovery
【速读】: 该论文旨在解决通用大语言模型(Large Language Models, LLMs)在科学文献理解中因缺乏领域特异性而难以准确处理技术术语、方法严谨性及跨学科知识整合的问题,从而限制了其在复杂科研任务中的应用。解决方案的关键在于提出SciGPT——一个基于Qwen3架构的领域适配基础模型,包含三项核心技术:(1) 通过两阶段低代价领域蒸馏实现性能与效率的平衡;(2) 引入稀疏专家混合(Sparse Mixture-of-Experts, SMoE)注意力机制,在长文档推理中降低55%内存消耗;(3) 结合领域本体的知识感知适应策略,弥合跨学科知识鸿沟。实验表明,SciGPT在ScienceBench基准上优于GPT-4o,并展现出对未见科学任务的良好鲁棒性,验证其在AI增强型科学发现中的潜力。
链接: https://arxiv.org/abs/2509.08032
作者: Fengyu She,Nan Wang,Hongfei Wu,Ziyi Wan,Jingmian Wang,Chang Wang
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Scientific literature is growing exponentially, creating a critical bottleneck for researchers to efficiently synthesize knowledge. While general-purpose Large Language Models (LLMs) show potential in text processing, they often fail to capture scientific domain-specific nuances (e.g., technical jargon, methodological rigor) and struggle with complex scientific tasks, limiting their utility for interdisciplinary research. To address these gaps, this paper presents SciGPT, a domain-adapted foundation model for scientific literature understanding and ScienceBench, an open source benchmark tailored to evaluate scientific LLMs. Built on the Qwen3 architecture, SciGPT incorporates three key innovations: (1) low-cost domain distillation via a two-stage pipeline to balance performance and efficiency; (2) a Sparse Mixture-of-Experts (SMoE) attention mechanism that cuts memory consumption by 55% for 32,000-token long-document reasoning; and (3) knowledge-aware adaptation integrating domain ontologies to bridge interdisciplinary knowledge gaps. Experimental results on ScienceBench show that SciGPT outperforms GPT-4o in core scientific tasks including sequence labeling, generation, and inference. It also exhibits strong robustness in unseen scientific tasks, validating its potential to facilitate AI-augmented scientific discovery. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2509.08032 [cs.CL] (or arXiv:2509.08032v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2509.08032 Focus to learn more arXiv-issued DOI via DataCite
zh
[NLP-37] NOWJ@COLIEE 2025: A Multi-stage Framework Integrating Embedding Models and Large Language Models for Legal Retrieval and Entailment
【速读】: 该论文旨在解决法律信息处理中的多任务挑战,包括法律案例蕴含关系判断(Legal Case Entailment)、法律案例检索、法规检索、法律文本蕴含关系判断及法律判决预测等问题。其核心解决方案在于构建一个融合传统信息检索(Information Retrieval, IR)技术与生成式 AI(Generative AI)的混合模型架构:首先通过预排序模型(如BM25、BERT、monoT5)和嵌入表示(如BGE-m3、LLM2Vec)进行初步筛选,再利用大语言模型(Large Language Models, LLMs)如Qwen-2、QwQ-32B和DeepSeek-V3进行语义理解、相关性评分与上下文重排序;尤其在法律案例蕴含任务中,采用两阶段检索系统——先进行词法-语义过滤,再结合上下文化LLM分析,显著提升了性能,最终以F1=0.3195的成绩获得第一名。这一方法验证了混合模型在法律场景下处理复杂语义关系的有效性。
链接: https://arxiv.org/abs/2509.08025
作者: Hoang-Trung Nguyen,Tan-Minh Nguyen,Xuan-Bach Le,Tuan-Kiet Le,Khanh-Huyen Nguyen,Ha-Thanh Nguyen,Thi-Hai-Yen Vuong,Le-Minh Nguyen
机构: VNU University of Engineering and Technology (河内国立大学工程与技术学院); Japan Advanced Institute of Science and Technology (日本高级科学技术研究院); Center for Juris-Informatics, ROIS-DS (信息法研究中心,国立情报学研究所-数据科学部); Research and Development Center for Large Language Models, NII (大型语言模型研发中心,国立情报学研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:This paper presents the methodologies and results of the NOWJ team’s participation across all five tasks at the COLIEE 2025 competition, emphasizing advancements in the Legal Case Entailment task (Task 2). Our comprehensive approach systematically integrates pre-ranking models (BM25, BERT, monoT5), embedding-based semantic representations (BGE-m3, LLM2Vec), and advanced Large Language Models (Qwen-2, QwQ-32B, DeepSeek-V3) for summarization, relevance scoring, and contextual re-ranking. Specifically, in Task 2, our two-stage retrieval system combined lexical-semantic filtering with contextualized LLM analysis, achieving first place with an F1 score of 0.3195. Additionally, in other tasks–including Legal Case Retrieval, Statute Law Retrieval, Legal Textual Entailment, and Legal Judgment Prediction–we demonstrated robust performance through carefully engineered ensembles and effective prompt-based reasoning strategies. Our findings highlight the potential of hybrid models integrating traditional IR techniques with contemporary generative models, providing a valuable reference for future advancements in legal information processing.
zh
[NLP-38] MVPBench: A Benchmark and Fine-Tuning Framework for Aligning Large Language Models with Diverse Human Values
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在跨文化与跨人群场景下价值对齐(value alignment)能力评估不足的问题,即现有基准测试普遍缺乏对文化多样性与人口统计学差异的考量,导致难以全面理解模型价值对齐的全球泛化性能。其解决方案的关键在于提出MVPBench——一个涵盖75个国家、包含24,020个高质量标注实例的新型基准,这些实例具备细粒度的价值标签、个性化问题及丰富的群体人口统计学元数据,从而系统性地评估LLMs在多维人类价值观偏好上的表现。通过该基准,研究进一步验证了轻量级微调方法(如低秩适应LoRA和直接偏好优化DPO)在域内与域外场景下显著提升价值对齐效果的能力,为构建更具文化适应性和价值敏感性的通用大模型提供了实证依据与实践路径。
链接: https://arxiv.org/abs/2509.08022
作者: Yao Liang,Dongcheng Zhao,Feifei Zhao,Guobin Shen,Yuwei Wang,Dongqi Liang,Yi Zeng
机构: 1. Tsinghua University (清华大学); 2. Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所); 3. Peking University (北京大学); 4. Beijing Academy of Artificial Intelligence (北京人工智能研究院); 5. National Engineering Research Center for Big Data Technology and Application (国家大数据工程技术研究中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:The alignment of large language models (LLMs) with human values is critical for their safe and effective deployment across diverse user populations. However, existing benchmarks often neglect cultural and demographic diversity, leading to limited understanding of how value alignment generalizes globally. In this work, we introduce MVPBench, a novel benchmark that systematically evaluates LLMs’ alignment with multi-dimensional human value preferences across 75 countries. MVPBench contains 24,020 high-quality instances annotated with fine-grained value labels, personalized questions, and rich demographic metadata, making it the most comprehensive resource of its kind to date. Using MVPBench, we conduct an in-depth analysis of several state-of-the-art LLMs, revealing substantial disparities in alignment performance across geographic and demographic lines. We further demonstrate that lightweight fine-tuning methods, such as Low-Rank Adaptation (LoRA) and Direct Preference Optimization (DPO), can significantly enhance value alignment in both in-domain and out-of-domain settings. Our findings underscore the necessity for population-aware alignment evaluation and provide actionable insights for building culturally adaptive and value-sensitive LLMs. MVPBench serves as a practical foundation for future research on global alignment, personalized value modeling, and equitable AI development.
zh
[NLP-39] Measuring and mitigating overreliance is necessary for building human-compatible AI
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在实际应用中因用户过度依赖(overreliance)而引发的风险问题,包括个体层面的高风险错误与认知能力退化,以及社会层面的治理挑战。其解决方案的关键在于:首先系统识别并量化过度假设行为的成因,涵盖LLM特性、系统设计缺陷及用户认知偏差;其次,基于历史测量方法的不足,提出改进的评估框架以精准捕捉过度假赖现象;最终,通过构建可落地的缓解策略,推动LLM作为“协作式思维伙伴”增强而非削弱人类决策能力,从而实现人机协同的可持续发展。
链接: https://arxiv.org/abs/2509.08010
作者: Lujain Ibrahim,Katherine M. Collins,Sunnie S. Y. Kim,Anka Reuel,Max Lamparth,Kevin Feng,Lama Ahmad,Prajna Soni,Alia El Kattan,Merlin Stein,Siddharth Swaroop,Ilia Sucholutsky,Andrew Strait,Q. Vera Liao,Umang Bhatt
机构: University of Oxford (牛津大学); University of Cambridge (剑桥大学); Princeton University (普林斯顿大学); Stanford University (斯坦福大学); Harvard Kennedy School (哈佛肯尼迪学院); University of Washington (华盛顿大学); OpenAI; Alinia AI; New York University (纽约大学); UK AI Security Institute (英国人工智能安全研究所); Harvard University (哈佛大学); University of Michigan (密歇根大学)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:
Abstract:Large language models (LLMs) distinguish themselves from previous technologies by functioning as collaborative “thought partners,” capable of engaging more fluidly in natural language. As LLMs increasingly influence consequential decisions across diverse domains from healthcare to personal advice, the risk of overreliance - relying on LLMs beyond their capabilities - grows. This position paper argues that measuring and mitigating overreliance must become central to LLM research and deployment. First, we consolidate risks from overreliance at both the individual and societal levels, including high-stakes errors, governance challenges, and cognitive deskilling. Then, we explore LLM characteristics, system design features, and user cognitive biases that - together - raise serious and unique concerns about overreliance in practice. We also examine historical approaches for measuring overreliance, identifying three important gaps and proposing three promising directions to improve measurement. Finally, we propose mitigation strategies that the AI research community can pursue to ensure LLMs augment rather than undermine human capabilities.
zh
[NLP-40] AntiDote: Bi-level Adversarial Training for Tamper-Resistant LLM s
【速读】: 该论文旨在解决开放权重大型语言模型(Large Language Models, LLMs)在促进研究可及性的同时,如何有效抵御恶意微调(malicious fine-tuning)导致的安全风险问题,尤其是当攻击者拥有模型全部参数和架构访问权限时,可通过全参数微调抹除原有安全对齐机制。解决方案的关键在于提出一种双层优化方法——AntiDote,其核心是引入一个辅助对抗超网络(adversary hypernetwork),该网络学习生成基于防御模型内部激活状态的恶意低秩适配(Low-Rank Adaptation, LoRA)权重;同时,防御模型通过最小化这些对抗性权重添加的影响进行训练,从而在保持通用能力的前提下增强对篡改攻击的鲁棒性。
链接: https://arxiv.org/abs/2509.08000
作者: Debdeep Sanyal,Manodeep Ray,Murari Mandal
机构: 未知
类目: Computation and Language (cs.CL)
备注: 19 pages
Abstract:The release of open-weight large language models (LLMs) creates a tension between advancing accessible research and preventing misuse, such as malicious fine-tuning to elicit harmful content. Current safety measures struggle to preserve the general capabilities of the LLM while resisting a determined adversary with full access to the model’s weights and architecture, who can use full-parameter fine-tuning to erase existing safeguards. To address this, we introduce AntiDote, a bi-level optimization procedure for training LLMs to be resistant to such tampering. AntiDote involves an auxiliary adversary hypernetwork that learns to generate malicious Low-Rank Adaptation (LoRA) weights conditioned on the defender model’s internal activations. The defender LLM is then trained with an objective to nullify the effect of these adversarial weight additions, forcing it to maintain its safety alignment. We validate this approach against a diverse suite of 52 red-teaming attacks, including jailbreak prompting, latent space manipulation, and direct weight-space attacks. AntiDote is upto 27.4% more robust against adversarial attacks compared to both tamper-resistance and unlearning baselines. Crucially, this robustness is achieved with a minimal trade-off in utility, incurring a performance degradation of upto less than 0.5% across capability benchmarks including MMLU, HellaSwag, and GSM8K. Our work offers a practical and compute efficient methodology for building open-weight models where safety is a more integral and resilient property.
zh
[NLP-41] Bilingual Word Level Language Identification for Omotic Languages
【速读】: 该论文旨在解决多语言文本中双语语言识别(Bilingual Language Identification, BLID)的挑战,特别是在埃塞俄比亚南部地区使用的沃尔aita语和戈法语之间进行准确区分的问题。由于两种语言存在词汇相似性和差异性,传统方法难以有效识别。解决方案的关键在于融合基于BERT的预训练语言模型与长短期记忆网络(LSTM)的方法,通过结合上下文语义特征和序列建模能力,在测试集上实现了0.72的F1分数,显著提升了识别性能,为社交媒体内容治理及后续相关研究提供了可行的技术路径。
链接: https://arxiv.org/abs/2509.07998
作者: Mesay Gemeda Yigezu,Girma Yohannis Bade,Atnafu Lambebo Tonja,Olga Kolesnikova,Grigori Sidorov,Alexander Gelbukh
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Language identification is the task of determining the languages for a given text. In many real world scenarios, text may contain more than one language, particularly in multilingual communities. Bilingual Language Identification (BLID) is the task of identifying and distinguishing between two languages in a given text. This paper presents BLID for languages spoken in the southern part of Ethiopia, namely Wolaita and Gofa. The presence of words similarities and differences between the two languages makes the language identification task challenging. To overcome this challenge, we employed various experiments on various approaches. Then, the combination of the BERT based pretrained language model and LSTM approach performed better, with an F1 score of 0.72 on the test set. As a result, the work will be effective in tackling unwanted social media issues and providing a foundation for further research in this area.
zh
[NLP-42] Reinforcement Learning Foundations for Deep Research Systems: A Survey
【速读】: 该论文旨在解决当前深度研究系统(Deep Research Systems)在训练过程中面临的三大核心问题:一是监督微调(SFT)存在模仿偏差和暴露偏差,且未能充分利用环境反馈;二是偏好对齐方法(如DPO)依赖于人类定义的决策点与子技能,缺乏对长程信用分配和多目标权衡的有效建模;三是现有方法对人工先验和标注者偏见的依赖过高,难以实现高效、鲁棒的自主探索与任务执行。其解决方案的关键在于引入强化学习(Reinforcement Learning, RL),通过优化轨迹级策略(trajectory-level policies)来实现闭环工具交互、探索行为、恢复机制以及可解释的信用分配,从而减少对人工标注和预设规则的依赖,提升代理在复杂多步骤任务中的泛化能力与透明度。
链接: https://arxiv.org/abs/2509.06733
作者: Wenjun Li,Zhi Chen,Jingru Lin,Hannan Cao,Wei Han,Sheng Liang,Zhi Zhang,Kuicai Dong,Dexun Li,Chen Zhang,Yong Liu
机构: Huawei Technologies Co., Ltd (华为技术有限公司)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 38 pages, first version
Abstract:Deep research systems, agentic AI that solve complex, multi-step tasks by coordinating reasoning, search across the open web and user files, and tool use, are moving toward hierarchical deployments with a Planner, Coordinator, and Executors. In practice, training entire stacks end-to-end remains impractical, so most work trains a single planner connected to core tools such as search, browsing, and code. While SFT imparts protocol fidelity, it suffers from imitation and exposure biases and underuses environment feedback. Preference alignment methods such as DPO are schema and proxy-dependent, off-policy, and weak for long-horizon credit assignment and multi-objective trade-offs. A further limitation of SFT and DPO is their reliance on human defined decision points and subskills through schema design and labeled comparisons. Reinforcement learning aligns with closed-loop, tool-interaction research by optimizing trajectory-level policies, enabling exploration, recovery behaviors, and principled credit assignment, and it reduces dependence on such human priors and rater biases. This survey is, to our knowledge, the first dedicated to the RL foundations of deep research systems. It systematizes work after DeepSeek-R1 along three axes: (i) data synthesis and curation; (ii) RL methods for agentic research covering stability, sample efficiency, long context handling, reward and credit design, multi-objective optimization, and multimodal integration; and (iii) agentic RL training systems and frameworks. We also cover agent architecture and coordination, as well as evaluation and benchmarks, including recent QA, VQA, long-form synthesis, and domain-grounded, tool-interaction tasks. We distill recurring patterns, surface infrastructure bottlenecks, and offer practical guidance for training robust, transparent deep research agents with RL. Comments: 38 pages, first version Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR) Cite as: arXiv:2509.06733 [cs.AI] (or arXiv:2509.06733v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2509.06733 Focus to learn more arXiv-issued DOI via DataCite
zh
计算机视觉
[CV-0] SAFT: Shape and Appearance of Fabrics from Template via Differentiable Physical Simulations from Monocular Video WWW ATC
【速读】:该论文旨在解决从单目RGB视频序列中重建三维动态场景(特别是织物)的难题,尤其针对单目视频中存在的深度歧义问题导致的重建不准确问题。解决方案的关键在于结合物理仿真与可微分渲染技术,提出两种新颖的正则化项以提升重建结果的合理性,并利用优化后的运动信息实现高质量的外观估计,从而在仅依赖单目视频输入的情况下,同时完成高保真度的几何重建和外观恢复。
链接: https://arxiv.org/abs/2509.08828
作者: David Stotko,Reinhard Klein
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL Video: this https URL GitHub: this https URL
Abstract:The reconstruction of three-dimensional dynamic scenes is a well-established yet challenging task within the domain of computer vision. In this paper, we propose a novel approach that combines the domains of 3D geometry reconstruction and appearance estimation for physically based rendering and present a system that is able to perform both tasks for fabrics, utilizing only a single monocular RGB video sequence as input. In order to obtain realistic and high-quality deformations and renderings, a physical simulation of the cloth geometry and differentiable rendering are employed. In this paper, we introduce two novel regularization terms for the 3D reconstruction task that improve the plausibility of the reconstruction by addressing the depth ambiguity problem in monocular video. In comparison with the most recent methods in the field, we have reduced the error in the 3D reconstruction by a factor of 2.64 while requiring a medium runtime of 30 min per scene. Furthermore, the optimized motion achieves sufficient quality to perform an appearance estimation of the deforming object, recovering sharp details from this single monocular RGB video.
zh
[CV-1] RewardDance: Reward Scaling in Visual Generation
【速读】:该论文旨在解决视觉生成模型中奖励模型(Reward Model, RM)的可扩展性问题及其在强化学习人类反馈(RLHF)过程中普遍存在的“奖励欺骗”(Reward Hacking)现象。现有方法受限于CLIP-based RM的架构与模态约束,以及Bradley-Terry损失函数与视觉语言模型(VLM)的next-token预测机制不匹配,导致无法有效扩展;同时,RLHF优化易受奖励信号缺陷诱导,使模型仅追求高奖励而非真实质量提升。解决方案的关键在于提出RewardDance框架,其核心创新是采用生成式奖励范式——将奖励分数重新定义为模型预测“yes” token的概率,即判断生成图像是否优于参考图像。这一设计使奖励目标天然契合VLM架构,从而实现两个维度的可扩展:模型规模扩展至260亿参数,以及上下文扩展(包含任务指令、参考示例和链式思维推理)。实验表明,RewardDance显著优于当前最优方法,并通过保持高奖励方差有效抵御奖励欺骗,缓解模式崩溃问题。
链接: https://arxiv.org/abs/2509.08826
作者: Jie Wu,Yu Gao,Zilyu Ye,Ming Li,Liang Li,Hanzhong Guo,Jie Liu,Zeyue Xue,Xiaoxia Hou,Wei Liu,Yan Zeng,Weilin Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Bytedance Seed Technical Report
Abstract:Reward Models (RMs) are critical for improving generation models via Reinforcement Learning (RL), yet the RM scaling paradigm in visual generation remains largely unexplored. It primarily due to fundamental limitations in existing approaches: CLIP-based RMs suffer from architectural and input modality constraints, while prevalent Bradley-Terry losses are fundamentally misaligned with the next-token prediction mechanism of Vision-Language Models (VLMs), hindering effective scaling. More critically, the RLHF optimization process is plagued by Reward Hacking issue, where models exploit flaws in the reward signal without improving true quality. To address these challenges, we introduce RewardDance, a scalable reward modeling framework that overcomes these barriers through a novel generative reward paradigm. By reformulating the reward score as the model’s probability of predicting a “yes” token, indicating that the generated image outperforms a reference image according to specific criteria, RewardDance intrinsically aligns reward objectives with VLM architectures. This alignment unlocks scaling across two dimensions: (1) Model Scaling: Systematic scaling of RMs up to 26 billion parameters; (2) Context Scaling: Integration of task-specific instructions, reference examples, and chain-of-thought (CoT) reasoning. Extensive experiments demonstrate that RewardDance significantly surpasses state-of-the-art methods in text-to-image, text-to-video, and image-to-video generation. Crucially, we resolve the persistent challenge of “reward hacking”: Our large-scale RMs exhibit and maintain high reward variance during RL fine-tuning, proving their resistance to hacking and ability to produce diverse, high-quality outputs. It greatly relieves the mode collapse problem that plagues smaller models.
zh
[CV-2] GeneVA: A Dataset of Human Annotations for Generative Text to Video Artifacts
【速读】:该论文旨在解决生成式视频模型在文本驱动视频生成过程中存在的时空伪影(spatio-temporal artifacts)问题,如违反物理规律和时间不一致性等。其解决方案的关键在于构建了一个大规模的标注数据集GeneVA,该数据集专注于自然文本提示下生成视频中的时空伪影,并包含丰富的专家人工标注,从而为模型性能评估与生成质量优化提供系统性基准支持。
链接: https://arxiv.org/abs/2509.08818
作者: Jenna Kang,Maria Silva,Patsorn Sangkloy,Kenneth Chen,Niall Williams,Qi Sun
机构: New York University (纽约大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advances in probabilistic generative models have extended capabilities from static image synthesis to text-driven video generation. However, the inherent randomness of their generation process can lead to unpredictable artifacts, such as impossible physics and temporal inconsistency. Progress in addressing these challenges requires systematic benchmarks, yet existing datasets primarily focus on generative images due to the unique spatio-temporal complexities of videos. To bridge this gap, we introduce GeneVA, a large-scale artifact dataset with rich human annotations that focuses on spatio-temporal artifacts in videos generated from natural text prompts. We hope GeneVA can enable and assist critical applications, such as benchmarking model performance and improving generative video quality.
zh
[CV-3] Handling Multiple Hypotheses in Coarse-to-Fine Dense Image Matching
【速读】:该论文旨在解决密集图像匹配(Dense Image Matching)中因单个对应点假设导致的误匹配问题,尤其是在深度不连续或目标图像为源图像强烈缩放时,相邻源像素的对应点分布广泛,传统粗到精机制易产生错误匹配。解决方案的关键在于提出一种多假设传播策略:在每个尺度上为每个源位置预测多个对应点假设,并采用束搜索(Beam Search)策略进行跨尺度传播,同时将这些多假设集成到交叉注意力(Cross-Attention)层中,构建出名为BEAMER的新架构。该方法能够有效保留并传递多路径假设,显著提升匹配鲁棒性,尤其在挑战性场景下表现优于现有最先进方法。
链接: https://arxiv.org/abs/2509.08805
作者: Matthieu Vilain,Rémi Giraud,Yannick Berthoumieu,Guillaume Bourmaud
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Dense image matching aims to find a correspondent for every pixel of a source image in a partially overlapping target image. State-of-the-art methods typically rely on a coarse-to-fine mechanism where a single correspondent hypothesis is produced per source location at each scale. In challenging cases – such as at depth discontinuities or when the target image is a strong zoom-in of the source image – the correspondents of neighboring source locations are often widely spread and predicting a single correspondent hypothesis per source location at each scale may lead to erroneous matches. In this paper, we investigate the idea of predicting multiple correspondent hypotheses per source location at each scale instead. We consider a beam search strategy to propagat multiple hypotheses at each scale and propose integrating these multiple hypotheses into cross-attention layers, resulting in a novel dense matching architecture called BEAMER. BEAMER learns to preserve and propagate multiple hypotheses across scales, making it significantly more robust than state-of-the-art methods, especially at depth discontinuities or when the target image is a strong zoom-in of the source image.
zh
[CV-4] PianoVAM: A Multimodal Piano Performance Dataset
【速读】:该论文旨在解决音乐信息检索(Music Information Retrieval, MIR)领域中对多模态钢琴演奏数据需求日益增长的问题,尤其是如何有效整合音频、视频、MIDI、手部关键点和指法标签等多源异构数据以支持更精准的钢琴演奏分析与建模。解决方案的关键在于构建了一个名为PianoVAM的综合性钢琴演奏数据集,其核心创新在于通过Disklavier钢琴同步采集音频与MIDI信号,并配合顶部视角视频记录,在真实多样场景下捕获了包含手部关键点(hand landmarks)和半自动标注的指法标签(fingering labels)的多模态数据;同时提出基于预训练手部姿态估计模型和半自动化指法标注算法的高效标注流程,为后续音频-视觉联合转录(audio-visual transcription)等任务提供了高质量基准数据。
链接: https://arxiv.org/abs/2509.08800
作者: Yonghyun Kim,Junhyung Park,Joonhyung Bae,Kirak Kim,Taegyun Kwon,Alexander Lerch,Juhan Nam
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
备注: Accepted to the 26th International Society for Music Information Retrieval (ISMIR) Conference, 2025
Abstract:The multimodal nature of music performance has driven increasing interest in data beyond the audio domain within the music information retrieval (MIR) community. This paper introduces PianoVAM, a comprehensive piano performance dataset that includes videos, audio, MIDI, hand landmarks, fingering labels, and rich metadata. The dataset was recorded using a Disklavier piano, capturing audio and MIDI from amateur pianists during their daily practice sessions, alongside synchronized top-view videos in realistic and varied performance conditions. Hand landmarks and fingering labels were extracted using a pretrained hand pose estimation model and a semi-automated fingering annotation algorithm. We discuss the challenges encountered during data collection and the alignment process across different modalities. Additionally, we describe our fingering annotation method based on hand landmarks extracted from videos. Finally, we present benchmarking results for both audio-only and audio-visual piano transcription using the PianoVAM dataset and discuss additional potential applications.
zh
[CV-5] Quantifying Accuracy of an Event-Based Star Tracker via Earths Rotation
【速读】:该论文旨在解决事件相机(Event-based Camera, EBC)在基于恒星跟踪的姿态确定中缺乏准确真实值(ground truth)的问题。传统方法难以获取实测数据的真实姿态标签,而本文创新性地利用地球自转作为高精度参考基准:由于地球旋转具有高度规律性(误差可达毫角秒级别),通过将事件相机固定于地面望远镜指向夜空,使得其在天球参考系中的唯一运动即为地球自转引起的视角变化。由此产生的事件流被处理以估计姿态,并与国际地球自转与参考系统(International Earth Rotation and Reference System, IERS)提供的地球姿态进行比对。关键在于将地球自转这一已知物理现象转化为可用于校准事件相机姿态估计的可靠真值,从而验证了事件相机在低延迟、低成本星跟踪应用中的可行性,其姿态误差达到均方根18.47角秒,最大误差78.84角秒。
链接: https://arxiv.org/abs/2509.08794
作者: Dennis Melamed,Connor Hashemi,Scott McCloskey
机构: Kitware
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Event-based cameras (EBCs) are a promising new technology for star tracking-based attitude determination, but prior studies have struggled to determine accurate ground truth for real data. We analyze the accuracy of an EBC star tracking system utilizing the Earth’s motion as the ground truth for comparison. The Earth rotates in a regular way with very small irregularities which are measured to the level of milli-arcseconds. By keeping an event camera static and pointing it through a ground-based telescope at the night sky, we create a system where the only camera motion in the celestial reference frame is that induced by the Earth’s rotation. The resulting event stream is processed to generate estimates of orientation which we compare to the International Earth Rotation and Reference System (IERS) measured orientation of the Earth. The event camera system is able to achieve a root mean squared across error of 18.47 arcseconds and an about error of 78.84 arcseconds. Combined with the other benefits of event cameras over framing sensors (reduced computation due to sparser data streams, higher dynamic range, lower energy consumption, faster update rates), this level of accuracy suggests the utility of event cameras for low-cost and low-latency star tracking. We provide all code and data used to generate our results: this https URL.
zh
[CV-6] An End-to-End Deep Learning Framework for Arsenicosis Diagnosis Using Mobile-Captured Skin Images
【速读】:该论文旨在解决南亚和东南亚地区因长期饮用含砷水源导致的皮肤砷中毒(arsenicosis)早期诊断困难的问题,尤其是在缺乏皮肤科医生的农村地区。其核心挑战在于如何实现一种非侵入性、可及性强且具备临床透明度的自动化诊断方法。解决方案的关键在于构建一个端到端的深度学习框架,利用移动设备采集的皮肤图像进行分类识别:首先构建包含20类皮肤病(超过11000张图像)的专业数据集,随后对比卷积神经网络(CNNs)与基于Transformer的模型性能,发现Swim Transformer在准确率上达到86%,显著优于传统CNN架构;同时引入LIME和Grad-CAM等可解释性技术以可视化模型关注区域,增强临床可信度,并通过Web工具验证了该系统在资源有限环境中的部署可行性,从而实现了高精度、可解释且易于落地的砷中毒早期筛查方案。
链接: https://arxiv.org/abs/2509.08780
作者: Asif Newaz,Asif Ur Rahman Adib,Rajit Sahil,Mashfique Mehzad
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Background: Arsenicosis is a serious public health concern in South and Southeast Asia, primarily caused by long-term consumption of arsenic-contaminated water. Its early cutaneous manifestations are clinically significant but often underdiagnosed, particularly in rural areas with limited access to dermatologists. Automated, image-based diagnostic solutions can support early detection and timely interventions. Methods: In this study, we propose an end-to-end framework for arsenicosis diagnosis using mobile phone-captured skin images. A dataset comprising 20 classes and over 11000 images of arsenic-induced and other dermatological conditions was curated. Multiple deep learning architectures, including convolutional neural networks (CNNs) and Transformer-based models, were benchmarked for arsenicosis detection. Model interpretability was integrated via LIME and Grad-CAM, while deployment feasibility was demonstrated through a web-based diagnostic tool. Results: Transformer-based models significantly outperformed CNNs, with the Swin Transformer achieving the best results (86\% accuracy). LIME and Grad-CAM visualizations confirmed that the models attended to lesion-relevant regions, increasing clinical transparency and aiding in error analysis. The framework also demonstrated strong performance on external validation samples, confirming its ability to generalize beyond the curated dataset. Conclusion: The proposed framework demonstrates the potential of deep learning for non-invasive, accessible, and explainable diagnosis of arsenicosis from mobile-acquired images. By enabling reliable image-based screening, it can serve as a practical diagnostic aid in rural and resource-limited communities, where access to dermatologists is scarce, thereby supporting early detection and timely intervention. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2509.08780 [cs.CV] (or arXiv:2509.08780v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2509.08780 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Asif Newaz [view email] [v1] Wed, 10 Sep 2025 17:08:31 UTC (6,489 KB)
zh
[CV-7] ArgoTweak: Towards Self-Updating HD Maps through Structured Priors ICCV2025
【速读】:该论文旨在解决高精地图(High Definition Map, HD Map)在自验证与自更新过程中,因缺乏真实先验地图、当前地图与传感器数据三元组而导致的仿真到现实(sim2real)差距问题。现有方法依赖合成先验信息,造成不一致性,限制了模型在真实场景中的泛化能力。其解决方案的关键在于提出首个包含真实地图先验的 ArgoTweak 数据集,采用双射映射(bijective mapping)框架,将大规模地图修改分解为细粒度的原子级元素变化,从而实现可解释的变化检测与整合,同时高保真保留未变更要素。这一范式显著提升了模型训练效果,有效缩小了 sim2real 差距,并为可解释的先验辅助高精地图构建提供了基准和工具链。
链接: https://arxiv.org/abs/2509.08764
作者: Lena Wild,Rafael Valencia,Patric Jensfelt
机构: KTH Royal Institute of Technology (皇家理工学院); TRATON
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025
Abstract:Reliable integration of prior information is crucial for self-verifying and self-updating HD maps. However, no public dataset includes the required triplet of prior maps, current maps, and sensor data. As a result, existing methods must rely on synthetic priors, which create inconsistencies and lead to a significant sim2real gap. To address this, we introduce ArgoTweak, the first dataset to complete the triplet with realistic map priors. At its core, ArgoTweak employs a bijective mapping framework, breaking down large-scale modifications into fine-grained atomic changes at the map element level, thus ensuring interpretability. This paradigm shift enables accurate change detection and integration while preserving unchanged elements with high fidelity. Experiments show that training models on ArgoTweak significantly reduces the sim2real gap compared to synthetic priors. Extensive ablations further highlight the impact of structured priors and detailed change annotations. By establishing a benchmark for explainable, prior-aided HD mapping, ArgoTweak advances scalable, self-improving mapping solutions. The dataset, baselines, map modification toolbox, and further resources are available at this https URL.
zh
[CV-8] SocialNav-SUB: Benchmarking VLMs for Scene Understanding in Social Robot Navigation
【速读】:该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在复杂社会导航场景中对空间-时间关系和人类意图理解能力不足的问题,从而提升机器人在动态、以人为中心环境中的安全与社会合规性导航能力。其解决方案的关键在于提出SocialNav-SUB基准,这是一个面向社会机器人导航场景的视觉问答(Visual Question Answering, VQA)数据集与评估框架,系统性地衡量VLMs在空间推理、时空推理和社会推理任务上的表现,并对比人类与基于规则的基线方法,揭示当前VLMs在社会场景理解中的关键差距,为后续针对社会机器人导航优化基础模型提供可量化的研究路径。
链接: https://arxiv.org/abs/2509.08757
作者: Michael J. Munje,Chen Tang,Shuijing Liu,Zichao Hu,Yifeng Zhu,Jiaxun Cui,Garrett Warnell,Joydeep Biswas,Peter Stone
机构: The University of Texas at Austin (德克萨斯大学奥斯汀分校); Army Research Laboratory (美国陆军研究实验室); Sony AI (索尼人工智能)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Conference on Robot Learning (CoRL) 2025 Project site: this https URL
Abstract:Robot navigation in dynamic, human-centered environments requires socially-compliant decisions grounded in robust scene understanding. Recent Vision-Language Models (VLMs) exhibit promising capabilities such as object recognition, common-sense reasoning, and contextual understanding-capabilities that align with the nuanced requirements of social robot navigation. However, it remains unclear whether VLMs can accurately understand complex social navigation scenes (e.g., inferring the spatial-temporal relations among agents and human intentions), which is essential for safe and socially compliant robot navigation. While some recent works have explored the use of VLMs in social robot navigation, no existing work systematically evaluates their ability to meet these necessary conditions. In this paper, we introduce the Social Navigation Scene Understanding Benchmark (SocialNav-SUB), a Visual Question Answering (VQA) dataset and benchmark designed to evaluate VLMs for scene understanding in real-world social robot navigation scenarios. SocialNav-SUB provides a unified framework for evaluating VLMs against human and rule-based baselines across VQA tasks requiring spatial, spatiotemporal, and social reasoning in social robot navigation. Through experiments with state-of-the-art VLMs, we find that while the best-performing VLM achieves an encouraging probability of agreeing with human answers, it still underperforms simpler rule-based approach and human consensus baselines, indicating critical gaps in social scene understanding of current VLMs. Our benchmark sets the stage for further research on foundation models for social robot navigation, offering a framework to explore how VLMs can be tailored to meet real-world social robot navigation needs. An overview of this paper along with the code and data can be found at this https URL .
zh
[CV-9] CrowdQuery: Density-Guided Query Module for Enhanced 2D and 3D Detection in Crowded Scenes IROS2025
【速读】:该论文旨在解决拥挤场景下目标检测性能下降的问题,特别是针对现有基于Transformer的检测器在密集人群中的漏检和误检现象。其解决方案的关键在于提出了一种名为CrowdQuery(CQ)的新方法,核心是引入一个CQ模块,该模块能够预测并嵌入对象密度图(object density map),并将密度信息系统性地融合到解码器中。与以往依赖头部位置或基于物体的空间统计定义密度图不同,CQ进一步扩展了密度定义维度,引入个体边界框尺寸信息,并通过密度引导的对象查询(density-guided queries)增强模型对复杂拥挤场景的理解能力。此设计使得CQ可通用适配2D与3D检测任务,无需额外数据,实现了跨模态的统一优化。
链接: https://arxiv.org/abs/2509.08738
作者: Marius Dähling,Sebastian Krebs,J. Marius Zöllner
机构: Karlsruhe Institute of Technology (KIT); Mercedes-Benz AG, Research and Development; Intelligent Vehicles Group at TU Delft; Research Center for Information Technology (FZI)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 5 figures, accepted by IROS 2025
Abstract:This paper introduces a novel method for end-to-end crowd detection that leverages object density information to enhance existing transformer-based detectors. We present CrowdQuery (CQ), whose core component is our CQ module that predicts and subsequently embeds an object density map. The embedded density information is then systematically integrated into the decoder. Existing density map definitions typically depend on head positions or object-based spatial statistics. Our method extends these definitions to include individual bounding box dimensions. By incorporating density information into object queries, our method utilizes density-guided queries to improve detection in crowded scenes. CQ is universally applicable to both 2D and 3D detection without requiring additional data. Consequently, we are the first to design a method that effectively bridges 2D and 3D detection in crowded environments. We demonstrate the integration of CQ into both a general 2D and 3D transformer-based object detector, introducing the architectures CQ2D and CQ3D. CQ is not limited to the specific transformer models we selected. Experiments on the STCrowd dataset for both 2D and 3D domains show significant performance improvements compared to the base models, outperforming most state-of-the-art methods. When integrated into a state-of-the-art crowd detector, CQ can further improve performance on the challenging CrowdHuman dataset, demonstrating its generalizability. The code is released at this https URL.
zh
[CV-10] BcQLM: Efficient Vision-Language Understanding with Distilled Q-Gated Cross-Modal Fusion
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在资源受限环境中的部署难题,特别是在能源效率、计算可扩展性和环境可持续性日益重要的背景下,如何实现轻量化与高性能的平衡。其解决方案的关键在于提出了一种名为BcQLM(BreezeCLIP-enhanced Q-Gated Multimodal Language Model)的轻量级框架,核心组件为BreezeCLIP——一个参数仅12亿的紧凑但高效的视觉-语言编码器,专为高效率的多模态理解优化。该设计在显著降低计算成本的同时,仍能实现与标准规模MLLM相当的性能,且具备模块化和可扩展性,适用于更广泛的多模态任务。
链接: https://arxiv.org/abs/2509.08715
作者: Sike Xiang,Shuang Chen,Amir Atapour-Abarghouei
机构: Durham University (杜伦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:As multimodal large language models (MLLMs) advance, their large-scale architectures pose challenges for deployment in resource-constrained environments. In the age of large models, where energy efficiency, computational scalability and environmental sustainability are paramount, the development of lightweight and high-performance models is critical for real-world applications. As such, we propose a lightweight MLLM framework for end-to-end visual question answering. Our proposed approach centres on BreezeCLIP, a compact yet powerful vision-language encoder optimised for efficient multimodal understanding. With only 1.2 billion parameters overall, our model significantly reduces computational cost while achieving performance comparable to standard-size MLLMs. Experiments conducted on multiple datasets further validate its effectiveness in balancing accuracy and efficiency. The modular and extensible design enables generalisation to broader multimodal tasks. The proposed lightweight vision-language framework is denoted as BcQLM (BreezeCLIP-enhanced Q-Gated Multimodal Language Model). It offers a promising path toward deployable MLLMs under practical hardware constraints. The source code is available at this https URL.
zh
[CV-11] Computational Imaging for Enhanced Computer Vision
【速读】:该论文旨在解决传统成像方法在低光、运动模糊或高动态范围等挑战性条件下难以获取高质量视觉数据的问题,从而限制了先进计算机视觉(CV)系统性能的瓶颈。其解决方案的关键在于系统性地整合计算成像(Computational Imaging, CI)技术,如光场成像、高动态范围(HDR)成像、去模糊、高速成像及眩光抑制等,通过优化图像采集与重建流程,增强与核心CV任务(如目标检测、深度估计、光流计算、人脸识别和关键点检测)之间的协同效应,进而提升实际应用场景中(如自动驾驶、监控、增强现实和机器人)系统的鲁棒性、准确性和效率。
链接: https://arxiv.org/abs/2509.08712
作者: Humera Shaikh,Kaur Jashanpreet
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This paper presents a comprehensive survey of computational imaging (CI) techniques and their transformative impact on computer vision (CV) applications. Conventional imaging methods often fail to deliver high-fidelity visual data in challenging conditions, such as low light, motion blur, or high dynamic range scenes, thereby limiting the performance of state-of-the-art CV systems. Computational imaging techniques, including light field imaging, high dynamic range (HDR) imaging, deblurring, high-speed imaging, and glare mitigation, address these limitations by enhancing image acquisition and reconstruc- tion processes. This survey systematically explores the synergies between CI techniques and core CV tasks, including object detection, depth estimation, optical flow, face recognition, and keypoint detection. By analyzing the relationships between CI methods and their practical contributions to CV applications, this work highlights emerging opportunities, challenges, and future research directions. We emphasize the potential for task-specific, adaptive imaging pipelines that improve robustness, accuracy, and efficiency in real-world scenarios, such as autonomous navigation, surveillance, augmented reality, and robotics.
zh
[CV-12] ANGO: Traversability-Aware Navigation with Local Metric Control for Topological Goals ICRA2025
【速读】:该论文旨在解决机器人视觉导航中依赖全局一致的3D地图或预训练控制器所带来的计算开销大、泛化能力差的问题,尤其在开放集(open-set)环境中难以适应新场景。其解决方案的关键在于提出了一种仅使用RGB图像的、基于物体级别的拓扑度量导航(topometric navigation)流程,通过整合全局拓扑路径规划与局部度量轨迹控制,在无需预先构建3D地图或训练特定控制器的前提下实现零样本(zero-shot)、长距离导航。系统利用单目深度估计和可通行性预测持续生成局部轨迹,并引入自动切换机制,在必要时回退至基线控制器以保障鲁棒性,同时借助基础模型(foundational models)实现无需领域微调的开放集适用性。
链接: https://arxiv.org/abs/2509.08699
作者: Stefan Podgorski,Sourav Garg,Mehdi Hosseinzadeh,Lachlan Mares,Feras Dayoub,Ian Reid
机构: Australian Institute for Machine Learning (澳大利亚机器学习研究所); The University of Adelaide (阿德莱德大学); Mohamed Bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Systems and Control (eess.SY)
备注: 9 pages, 5 figures, ICRA 2025
Abstract:Visual navigation in robotics traditionally relies on globally-consistent 3D maps or learned controllers, which can be computationally expensive and difficult to generalize across diverse environments. In this work, we present a novel RGB-only, object-level topometric navigation pipeline that enables zero-shot, long-horizon robot navigation without requiring 3D maps or pre-trained controllers. Our approach integrates global topological path planning with local metric trajectory control, allowing the robot to navigate towards object-level sub-goals while avoiding obstacles. We address key limitations of previous methods by continuously predicting local trajectory using monocular depth and traversability estimation, and incorporating an auto-switching mechanism that falls back to a baseline controller when necessary. The system operates using foundational models, ensuring open-set applicability without the need for domain-specific fine-tuning. We demonstrate the effectiveness of our method in both simulated environments and real-world tests, highlighting its robustness and deployability. Our approach outperforms existing state-of-the-art methods, offering a more adaptable and effective solution for visual navigation in open-set environments. The source code is made publicly available: this https URL.
zh
[CV-13] Multi-Modal Robust Enhancement for Coastal Water Segmentation: A Systematic HSV-Guided Framework
【速读】:该论文旨在解决遥感影像中海岸水体分割的难题,主要挑战在于复杂多变的光谱特征和不规则的海岸线边界,传统基于RGB(Red-Green-Blue)空间的方法常因训练不稳定和泛化能力差而效果不佳。其解决方案的关键在于提出一种系统性的鲁棒增强框架——Robust U-Net,通过引入HSV(Hue-Saturation-Value)颜色空间监督、梯度引导的海岸线优化、形态学后处理、海域清理与连通性控制等五项协同组件,显著提升分割稳定性与精度;其中HSV监督贡献最大(影响得分0.85),整体框架使训练方差降低84%,并保持高效计算性能。
链接: https://arxiv.org/abs/2509.08694
作者: Zhen Tian,Christos Anagnostopoulos,Qiyuan Wang,Zhiwei Gao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Coastal water segmentation from satellite imagery presents unique challenges due to complex spectral characteristics and irregular boundary patterns. Traditional RGB-based approaches often suffer from training instability and poor generalization in diverse maritime environments. This paper introduces a systematic robust enhancement framework, referred to as Robust U-Net, that leverages HSV color space supervision and multi-modal constraints for improved coastal water segmentation. Our approach integrates five synergistic components: HSV-guided color supervision, gradient-based coastline optimization, morphological post-processing, sea area cleanup, and connectivity control. Through comprehensive ablation studies, we demonstrate that HSV supervision provides the highest impact (0.85 influence score), while the complete framework achieves superior training stability (84% variance reduction) and enhanced segmentation quality. Our method shows consistent improvements across multiple evaluation metrics while maintaining computational efficiency. For reproducibility, our training configurations and code are available here: this https URL.
zh
[CV-14] FractalPINN-Flow: A Fractal-Inspired Network for Unsupervised Optical Flow Estimation with Total Variation Regularization
【速读】:该论文旨在解决无监督场景下密集光流估计(dense optical flow estimation)的问题,即在不依赖真实标注数据的情况下,从连续的灰度图像帧中学习准确的像素级运动场。其解决方案的关键在于提出了一种基于分形几何自相似性的新型神经网络架构——分形变形网络(Fractal Deformation Network, FDN),该网络通过递归嵌套的编码器-解码器结构与跳跃连接实现多尺度特征融合,从而同时捕捉精细细节和长程运动模式;同时,训练目标采用基于总变差(Total Variation, TV)正则化的变分框架,结合L¹和L²数据保真项以约束亮度恒常性,并通过TV项增强空间平滑性和光流场的一致性,显著提升了高分辨率图像和弱标注场景下的光流估计性能。
链接: https://arxiv.org/abs/2509.08670
作者: Sara Behnamian,Rasoul Khaksarinezhad,Andreas Langer
机构: Globe Institute, University of Copenhagen (哥本哈根大学全球研究所); Centre for Mathematical Sciences, Lund University (隆德大学数学科学中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present FractalPINN-Flow, an unsupervised deep learning framework for dense optical flow estimation that learns directly from consecutive grayscale frames without requiring ground truth. The architecture centers on the Fractal Deformation Network (FDN) - a recursive encoder-decoder inspired by fractal geometry and self-similarity. Unlike traditional CNNs with sequential downsampling, FDN uses repeated encoder-decoder nesting with skip connections to capture both fine-grained details and long-range motion patterns. The training objective is based on a classical variational formulation using total variation (TV) regularization. Specifically, we minimize an energy functional that combines L^1 and L^2 data fidelity terms to enforce brightness constancy, along with a TV term that promotes spatial smoothness and coherent flow fields. Experiments on synthetic and benchmark datasets show that FractalPINN-Flow produces accurate, smooth, and edge-preserving optical flow fields. The model is especially effective for high-resolution data and scenarios with limited annotations.
zh
[CV-15] Skeleton-based sign language recognition using a dual-stream spatio-temporal dynamic graph convolutional network ICASSP
【速读】:该论文旨在解决孤立手语识别(Isolated Sign Language Recognition, ISLR)中因手势形态相似但语义不同而导致的几何模糊性问题,其根源在于手部形状与运动轨迹之间的复杂交互。解决方案的关键在于提出双参考系、双流架构 Dual-SignLanguageNet (DSLNet),通过解耦并分别建模手势形态和轨迹:一是在以手腕为中心的坐标系中进行视图不变的形状分析(使用拓扑感知图卷积),二是在以面部为中心的坐标系中进行上下文感知的轨迹建模(基于Finsler几何的编码器),并通过几何驱动的最优传输融合机制实现两路特征的有效整合,从而显著提升识别精度并减少参数量。
链接: https://arxiv.org/abs/2509.08661
作者: Liangjin Liu,Haoyang Zheng,Pei Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 5 pages, 3 figures, ICASSP
Abstract:Isolated Sign Language Recognition (ISLR) is challenged by gestures that are morphologically similar yet semantically distinct, a problem rooted in the complex interplay between hand shape and motion trajectory. Existing methods, often relying on a single reference frame, struggle to resolve this geometric ambiguity. This paper introduces Dual-SignLanguageNet (DSLNet), a dual-reference, dual-stream architecture that decouples and models gesture morphology and trajectory in separate, complementary coordinate systems. Our approach utilizes a wrist-centric frame for view-invariant shape analysis and a facial-centric frame for context-aware trajectory modeling. These streams are processed by specialized networks-a topology-aware graph convolution for shape and a Finsler geometry-based encoder for trajectory-and are integrated via a geometry-driven optimal transport fusion mechanism. DSLNet sets a new state-of-the-art, achieving 93.70%, 89.97% and 99.79% accuracy on the challenging WLASL-100, WLASL-300 and LSA64 datasets, respectively, with significantly fewer parameters than competing models.
zh
[CV-16] X-Part: high fidelity and structure coherent shape decomposition
【速读】:该论文旨在解决现有基于部件的3D形状生成方法在可控性不足和语义分解不清晰方面的局限性,从而提升生成结果的结构一致性和几何保真度。其解决方案的关键在于提出X-Part模型,该模型利用边界框(bounding box)作为提示信息引导部件生成,并通过注入点级语义特征实现语义上有意义的分解;同时设计了可编辑的交互式生成流程,使得生成的3D部件既具备生产就绪性,又支持灵活修改,从而建立了一种新的可控制、可编辑且结构合理的3D资产生成范式。
链接: https://arxiv.org/abs/2509.08643
作者: Xinhao Yan,Jiachen Xu,Yang Li,Changfeng Ma,Yunhan Yang,Chunshi Wang,Zibo Zhao,Zeqiang Lai,Yunfei Zhao,Zhuo Chen,Chunchao Guo
机构: Tencent Hunyuan (腾讯混元); ShanghaiTech (上海科技大学); NJU (南京大学); HKU (香港大学); ZJU (浙江大学); CUHK (香港中文大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: Tech Report
Abstract:Generating 3D shapes at part level is pivotal for downstream applications such as mesh retopology, UV mapping, and 3D printing. However, existing part-based generation methods often lack sufficient controllability and suffer from poor semantically meaningful decomposition. To this end, we introduce X-Part, a controllable generative model designed to decompose a holistic 3D object into semantically meaningful and structurally coherent parts with high geometric fidelity. X-Part exploits the bounding box as prompts for the part generation and injects point-wise semantic features for meaningful decomposition. Furthermore, we design an editable pipeline for interactive part generation. Extensive experimental results show that X-Part achieves state-of-the-art performance in part-level shape generation. This work establishes a new paradigm for creating production-ready, editable, and structurally sound 3D assets. Codes will be released for public research.
zh
[CV-17] LADB: Latent Aligned Diffusion Bridges for Semi-Supervised Domain Translation
【速读】:该论文旨在解决扩散模型(diffusion models)在数据稀缺领域中面临的挑战,即传统方法通常需要大量配对数据或昂贵的重新训练才能实现跨域翻译。其核心问题是:如何在仅有部分配对数据的情况下,有效桥接源域与目标域之间的分布差异,并实现高质量、可控的样本到样本转换。解决方案的关键在于提出Latent Aligned Diffusion Bridges(LADB),通过在共享潜在空间中对齐源域和目标域分布,将预训练的源域扩散模型与基于部分配对潜在表示训练的目标域潜在扩散模型(LADM)无缝集成,从而实现无需全监督的确定性域映射。该方法利用配对与非配对潜在耦合的混合策略,在保真度与多样性之间取得平衡,显著提升了在部分监督条件下的深度图到图像翻译性能,并可扩展至多源和多目标翻译任务。
链接: https://arxiv.org/abs/2509.08628
作者: Xuqin Wang,Tao Wu,Yanfeng Zhang,Lu Liu,Dong Wang,Mingwei Sun,Yongliang Wang,Niclas Zeller,Daniel Cremers
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Diffusion models excel at generating high-quality outputs but face challenges in data-scarce domains, where exhaustive retraining or costly paired data are often required. To address these limitations, we propose Latent Aligned Diffusion Bridges (LADB), a semi-supervised framework for sample-to-sample translation that effectively bridges domain gaps using partially paired data. By aligning source and target distributions within a shared latent space, LADB seamlessly integrates pretrained source-domain diffusion models with a target-domain Latent Aligned Diffusion Model (LADM), trained on partially paired latent representations. This approach enables deterministic domain mapping without the need for full supervision. Compared to unpaired methods, which often lack controllability, and fully paired approaches that require large, domain-specific datasets, LADB strikes a balance between fidelity and diversity by leveraging a mixture of paired and unpaired latent-target couplings. Our experimental results demonstrate superior performance in depth-to-image translation under partial supervision. Furthermore, we extend LADB to handle multi-source translation (from depth maps and segmentation masks) and multi-target translation in a class-conditioned style transfer task, showcasing its versatility in handling diverse and heterogeneous use cases. Ultimately, we present LADB as a scalable and versatile solution for real-world domain translation, particularly in scenarios where data annotation is costly or incomplete.
zh
[CV-18] UOPSL: Unpaired OCT Predilection Sites Learning for Fundus Image Diagnosis Augmentation
【速读】:该论文旨在解决多模态眼底医学图像诊断中因配对数据稀缺导致的模型性能瓶颈问题,尤其是在基金照相(fundus photography)与光学相干断层扫描(OCT)图像之间存在显著模态不平衡的情况下。传统方法仅依赖单一模态特征难以捕捉病灶的精细空间分布信息,从而限制了疾病识别精度。其解决方案的关键在于提出一种新颖的无配对多模态框架(Unpaired Multimodal Framework, UOPSL),通过在OCT隐空间中学习病灶易感部位矩阵(predilection sites matrix),利用大量未配对的OCT图像和扩展疾病文本描述作为桥梁,动态提取空间先验信息,并在下游仅基于基金图像的任务中引入该矩阵以增强分类学习能力,从而在无需配对OCT数据的前提下显著提升疾病识别性能。
链接: https://arxiv.org/abs/2509.08624
作者: Zhihao Zhao,Yinzheng Zhao,Junjie Yang,Xiangtong Yao,Quanmin Liang,Daniel Zapp,Kai Huang,Nassir Navab,M.Ali Nasseri
机构: Technical University of Munich (慕尼黑工业大学); Sun Yat-Sen University (中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: BIBM
Abstract:Significant advancements in AI-driven multimodal medical image diagnosis have led to substantial improvements in ophthalmic disease identification in recent years. However, acquiring paired multimodal ophthalmic images remains prohibitively expensive. While fundus photography is simple and cost-effective, the limited availability of OCT data and inherent modality imbalance hinder further progress. Conventional approaches that rely solely on fundus or textual features often fail to capture fine-grained spatial information, as each imaging modality provides distinct cues about lesion predilection sites. In this study, we propose a novel unpaired multimodal framework \UOPSL that utilizes extensive OCT-derived spatial priors to dynamically identify predilection sites, enhancing fundus image-based disease recognition. Our approach bridges unpaired fundus and OCTs via extended disease text descriptions. Initially, we employ contrastive learning on a large corpus of unpaired OCT and fundus images while simultaneously learning the predilection sites matrix in the OCT latent space. Through extensive optimization, this matrix captures lesion localization patterns within the OCT feature space. During the fine-tuning or inference phase of the downstream classification task based solely on fundus images, where paired OCT data is unavailable, we eliminate OCT input and utilize the predilection sites matrix to assist in fundus image classification learning. Extensive experiments conducted on 9 diverse datasets across 28 critical categories demonstrate that our framework outperforms existing benchmarks.
zh
[CV-19] AdsQA: Towards Advertisement Video Understanding ICCV-2025
【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)在处理超越客观物理内容的复杂语义理解任务时能力不足的问题,特别是如何有效评估和提升LLMs对广告视频中隐含信息(如营销逻辑、说服策略与受众互动机制)的理解能力。其解决方案的关键在于:首先构建了AdsQA这一基于1,544个广告视频、包含10,962个片段的挑战性视频问答基准,涵盖5类高难度任务;其次提出了一种受Deepseek-R1启发的强化学习模型ReAd-R,该模型通过反思问题并基于奖励驱动优化生成答案,从而显著提升了模型在非显式信息推理上的表现;最终在14个顶尖LLM上进行评测,ReAd-R展现出优于具备长链推理能力的强基线模型的性能,验证了其有效性。
链接: https://arxiv.org/abs/2509.08621
作者: Xinwei Long,Kai Tian,Peng Xu,Guoli Jia,Jingxuan Li,Sa Yang,Yihua Shao,Kaiyan Zhang,Che Jiang,Hao Xu,Yang Liu,Jiaheng Ma,Bowen Zhou
机构: Tsinghua University (清华大学); Peking University (北京大学); CASIA (中国科学院自动化研究所); Harvard University (哈佛大学); Shanghai Artificial Intelligence Lab (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV-2025
Abstract:Large language models (LLMs) have taken a great step towards AGI. Meanwhile, an increasing number of domain-specific problems such as math and programming boost these general-purpose models to continuously evolve via learning deeper expertise. Now is thus the time further to extend the diversity of specialized applications for knowledgeable LLMs, though collecting high quality data with unexpected and informative tasks is challenging. In this paper, we propose to use advertisement (ad) videos as a challenging test-bed to probe the ability of LLMs in perceiving beyond the objective physical content of common visual domain. Our motivation is to take full advantage of the clue-rich and information-dense ad videos’ traits, e.g., marketing logic, persuasive strategies, and audience engagement. Our contribution is three-fold: (1) To our knowledge, this is the first attempt to use ad videos with well-designed tasks to evaluate LLMs. We contribute AdsQA, a challenging ad Video QA benchmark derived from 1,544 ad videos with 10,962 clips, totaling 22.7 hours, providing 5 challenging tasks. (2) We propose ReAd-R, a Deepseek-R1 styled RL model that reflects on questions, and generates answers via reward-driven optimization. (3) We benchmark 14 top-tier LLMs on AdsQA, and our \textttReAd-R~achieves the state-of-the-art outperforming strong competitors equipped with long-chain reasoning capabilities by a clear margin.
zh
[CV-20] CLAPS: A CLIP-Unified Auto-Prompt Segmentation for Multi-Modal Retinal Imaging
【速读】:该论文旨在解决当前基于生成式 AI (Generative AI) 的医学图像分割方法在视网膜成像中面临的三大挑战:(1)文本疾病描述中的模态歧义问题,(2)SAM(Segment Anything Model)工作流仍依赖人工提示,以及(3)缺乏统一框架,现有方法多为特定模态和任务定制。解决方案的关键在于提出 CLIP-unified Auto-Prompt Segmentation(\CLAPS),其核心创新包括:首先在大规模多模态视网膜数据集上预训练 CLIP-based 图像编码器以缓解数据稀缺与分布不均;其次利用 GroundingDINO 自动检测局部病灶并生成空间边界框提示;最后引入每种成像模态特有的“模态签名”(modality signature)增强文本提示,从而实现模态无关的统一分割任务处理,并驱动 SAM 实现全自动、高精度的分割流程。
链接: https://arxiv.org/abs/2509.08618
作者: Zhihao Zhao,Yinzheng Zhao,Junjie Yang,Xiangtong Yao,Quanmin Liang,Shahrooz Faghihroohi,Kai Huang,Nassir Navab,M.Ali Nasseri
机构: Technical University of Munich (慕尼黑工业大学); Sun Yat-Sen University (中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: BIBM
Abstract:Recent advancements in foundation models, such as the Segment Anything Model (SAM), have significantly impacted medical image segmentation, especially in retinal imaging, where precise segmentation is vital for diagnosis. Despite this progress, current methods face critical challenges: 1) modality ambiguity in textual disease descriptions, 2) a continued reliance on manual prompting for SAM-based workflows, and 3) a lack of a unified framework, with most methods being modality- and task-specific. To overcome these hurdles, we propose CLIP-unified Auto-Prompt Segmentation (\CLAPS), a novel method for unified segmentation across diverse tasks and modalities in retinal imaging. Our approach begins by pre-training a CLIP-based image encoder on a large, multi-modal retinal dataset to handle data scarcity and distribution imbalance. We then leverage GroundingDINO to automatically generate spatial bounding box prompts by detecting local lesions. To unify tasks and resolve ambiguity, we use text prompts enhanced with a unique “modality signature” for each imaging modality. Ultimately, these automated textual and spatial prompts guide SAM to execute precise segmentation, creating a fully automated and unified pipeline. Extensive experiments on 12 diverse datasets across 11 critical segmentation categories show that CLAPS achieves performance on par with specialized expert models while surpassing existing benchmarks across most metrics, demonstrating its broad generalizability as a foundation model.
zh
[CV-21] EfficientIML: Efficient High-Resolution Image Manipulation Localization
【速读】:该论文旨在解决当前图像伪造检测方法在面对基于扩散模型(diffusion-based)的新型高分辨率伪造手法时表现不足的问题,尤其是现有检测器主要训练于传统伪造类型(如拼接、复制移动和对象移除),缺乏对扩散生成伪造的识别能力。其解决方案的关键在于构建一个包含1200余张高分辨率扩散生成伪造图像及语义掩码的新型SIF(Splicing, Inpainting, and Forgery)数据集,并提出一种轻量级三阶段EfficientRWKV骨干网络模型。该模型通过混合状态空间与注意力机制并行捕捉全局上下文与局部细节,结合多尺度监督策略确保层级预测一致性,在保持高定位精度的同时显著降低计算复杂度(FLOPs)与推理延迟,从而适用于实时图像取证场景。
链接: https://arxiv.org/abs/2509.08583
作者: Jinhan Li,Haoyang He,Lei Xie,Jiangning Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:With imaging devices delivering ever-higher resolutions and the emerging diffusion-based forgery methods, current detectors trained only on traditional datasets (with splicing, copy-moving and object removal forgeries) lack exposure to this new manipulation type. To address this, we propose a novel high-resolution SIF dataset of 1200+ diffusion-generated manipulations with semantically extracted masks. However, this also imposes a challenge on existing methods, as they face significant computational resource constraints due to their prohibitive computational complexities. Therefore, we propose a novel EfficientIML model with a lightweight, three-stage EfficientRWKV backbone. EfficientRWKV’s hybrid state-space and attention network captures global context and local details in parallel, while a multi-scale supervision strategy enforces consistency across hierarchical predictions. Extensive evaluations on our dataset and standard benchmarks demonstrate that our approach outperforms ViT-based and other SOTA lightweight baselines in localization performance, FLOPs and inference speed, underscoring its suitability for real-time forensic applications.
zh
[CV-22] Implicit Shape-Prior for Few-Shot Assisted 3D Segmentation
【速读】:该论文旨在解决医学影像中复杂三维分割任务对医疗专业人员的高人工负担问题,尤其是在放射治疗计划中需精准识别危及器官(Organs at Risk, OARs)以及肌肉减少症(Sarcopenia)诊断中依赖手动分割获取肌肉体积测量值的场景。其解决方案的关键在于引入一种隐式形状先验(implicit shape prior),结合一个简单的自动选择最具信息量切片的框架,从而仅通过稀疏切片的手动标注即可实现多器官的准确分割,并最小化后续交互次数,显著提升分割效率与自动化水平。
链接: https://arxiv.org/abs/2509.08580
作者: Mathilde Monvoisin,Louise Piecuch,Blanche Texier,Cédric Hémon,Anaïs Barateau,Jérémie Huet,Antoine Nordez,Anne-Sophie Boureau,Jean-Claude Nunes,Diana Mateus
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Both first Authors contributed equally to this work, lastnames in alphabetical order. This preprint has not undergone peer review or any post-submission improvements or corrections. The Version of Record of this contribution will be published in a Springer Nature Computer Science book series (CCIS, LNAI, LNBI, LNBIP, LNCS) and the doi will soon be released
Abstract:The objective of this paper is to significantly reduce the manual workload required from medical professionals in complex 3D segmentation tasks that cannot be yet fully automated. For instance, in radiotherapy planning, organs at risk must be accurately identified in computed tomography (CT) or magnetic resonance imaging (MRI) scans to ensure they are spared from harmful radiation. Similarly, diagnosing age-related degenerative diseases such as sarcopenia, which involve progressive muscle volume loss and strength, is commonly based on muscular mass measurements often obtained from manual segmentation of medical volumes. To alleviate the manual-segmentation burden, this paper introduces an implicit shape prior to segment volumes from sparse slice manual annotations generalized to the multi-organ case, along with a simple framework for automatically selecting the most informative slices to guide and minimize the next interactions. The experimental validation shows the method’s effectiveness on two medical use cases: assisted segmentation in the context of at risks organs for brain cancer patients, and acceleration of the creation of a new database with unseen muscle shapes for patients with sarcopenia.
zh
[CV-23] Improving Greenland Bed Topography Mapping with Uncertainty-Aware Graph Learning on Sparse Radar Data
【速读】:该论文旨在解决格陵兰冰盖下基底地形(subglacial bed)测绘中因雷达观测稀疏且分布不均导致的精度不足问题,这对海平面变化预测至关重要。解决方案的关键在于提出GraphTopoNet框架,其核心创新包括:利用空间图结构融合多源地表观测数据(如高程、速度和质量平衡),并引入梯度特征与多项式趋势以同时捕捉局部变异性和大尺度结构;通过蒙特卡洛丢弃(Monte Carlo dropout)显式建模不确定性;采用混合损失函数结合置信加权的雷达监督信号与动态平衡的正则化项,有效处理数据缺失问题。该方法在三个格陵兰区域验证中显著优于插值、卷积及传统图神经网络基线模型,误差降低最高达60%,同时保留了精细的冰川特征,提升了冰盖模型的可靠性。
链接: https://arxiv.org/abs/2509.08571
作者: Bayu Adhi Tama,Homayra Alam,Mostafa Cham,Omar Faruque,Jianwu Wang,Vandana Janeja
机构: iHARP, University of Maryland Baltimore County (UMBC); Department of Information Systems, University of Maryland Baltimore County (UMBC)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accurate maps of Greenland’s subglacial bed are essential for sea-level projections, but radar observations are sparse and uneven. We introduce GraphTopoNet, a graph-learning framework that fuses heterogeneous supervision and explicitly models uncertainty via Monte Carlo dropout. Spatial graphs built from surface observables (elevation, velocity, mass balance) are augmented with gradient features and polynomial trends to capture both local variability and broad structure. To handle data gaps, we employ a hybrid loss that combines confidence-weighted radar supervision with dynamically balanced regularization. Applied to three Greenland subregions, GraphTopoNet outperforms interpolation, convolutional, and graph-based baselines, reducing error by up to 60 percent while preserving fine-scale glacial features. The resulting bed maps improve reliability for operational modeling, supporting agencies engaged in climate forecasting and policy. More broadly, GraphTopoNet shows how graph machine learning can convert sparse, uncertain geophysical observations into actionable knowledge at continental scale.
zh
[CV-24] Vision-Language Semantic Aggregation Leverag ing Foundation Model for Generalizable Medical Image Segmentation
【速读】:该论文旨在解决多模态模型在医学图像分割任务中性能显著低于自然图像领域的问题,其核心挑战在于文本提示与细粒度医学视觉特征之间的语义鸿沟(semantic gap)以及由此导致的特征分散(feature dispersion)。解决方案的关键在于提出两种协同机制:一是基于期望最大化(Expectation-Maximization, EM)的聚合机制,通过动态聚类将特征映射到紧凑的语义中心,增强跨模态对应关系;二是文本引导的像素解码器(Text-Guided Pixel Decoder),利用领域不变的文本知识有效引导深层视觉表示,从而缩小语义鸿沟。二者结合显著提升了模型在不同医学影像域上的泛化能力。
链接: https://arxiv.org/abs/2509.08570
作者: Wenjun Yu,Yinchen Zhou,Jia-Xuan Jiang,Shubin Zeng,Yuee Li,Zhong Wang
机构: Lanzhou University (兰州大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 29 pages and 8 figures
Abstract:Multimodal models have achieved remarkable success in natural image segmentation, yet they often underperform when applied to the medical domain. Through extensive study, we attribute this performance gap to the challenges of multimodal fusion, primarily the significant semantic gap between abstract textual prompts and fine-grained medical visual features, as well as the resulting feature dispersion. To address these issues, we revisit the problem from the perspective of semantic aggregation. Specifically, we propose an Expectation-Maximization (EM) Aggregation mechanism and a Text-Guided Pixel Decoder. The former mitigates feature dispersion by dynamically clustering features into compact semantic centers to enhance cross-modal correspondence. The latter is designed to bridge the semantic gap by leveraging domain-invariant textual knowledge to effectively guide deep visual representations. The synergy between these two mechanisms significantly improves the model’s generalization ability. Extensive experiments on public cardiac and fundus datasets demonstrate that our method consistently outperforms existing SOTA approaches across multiple domain generalization benchmarks.
zh
[CV-25] ViewSparsifier: Killing Redundancy in Multi-View Plant Phenotyping
【速读】:该论文旨在解决植物表型分析中因单视角图像信息不足而导致的目标性状(如植物年龄预测和叶片计数)估计不准确的问题,这限制了对植物健康状况评估与收获时机预测的可靠性。其解决方案的关键在于引入一种名为ViewSparsifier的方法,通过随机选择24个视点(即“选择向量”)来学习视图不变的嵌入表示,从而有效利用多视角图像中的冗余信息并提升模型泛化能力,最终在ACM Multimedia 2025的Growth Modelling(GroMo)挑战赛中同时胜出两个任务。此外,研究还探索了更广泛的随机视点选择策略(共120个视点),进一步验证了该方法的扩展潜力。
链接: https://arxiv.org/abs/2509.08550
作者: Robin-Nico Kampa,Fabian Deuser,Konrad Habel,Norbert Oswald
机构: University of the Bundeswehr Munich (慕尼黑联邦国防军大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Plant phenotyping involves analyzing observable characteristics of plants to better understand their growth, health, and development. In the context of deep learning, this analysis is often approached through single-view classification or regression models. However, these methods often fail to capture all information required for accurate estimation of target phenotypic traits, which can adversely affect plant health assessment and harvest readiness prediction. To address this, the Growth Modelling (GroMo) Grand Challenge at ACM Multimedia 2025 provides a multi-view dataset featuring multiple plants and two tasks: Plant Age Prediction and Leaf Count Estimation. Each plant is photographed from multiple heights and angles, leading to significant overlap and redundancy in the captured information. To learn view-invariant embeddings, we incorporate 24 views, referred to as the selection vector, in a random selection. Our ViewSparsifier approach won both tasks. For further improvement and as a direction for future research, we also experimented with randomized view selection across all five height levels (120 views total), referred to as selection matrices.
zh
[CV-26] MESH – Understanding Videos Like Human: Measuring Hallucinations in Large Video Models
【速读】:该论文旨在解决大型视频模型(Large Video Models, LVMs)在理解动态视频内容时容易产生幻觉(hallucination)的问题,即生成不准确或与视频内容无关的描述。当前评估LVM幻觉的基准严重依赖人工对视频内容进行分类,忽视了人类基于感知的自然视频理解过程。论文提出的解决方案是构建MESH基准,其关键在于采用问答(Question-Answering)框架,结合二元选择和多选题形式,并引入目标实例(target instances)与陷阱实例(trap instances),从基础物体、粗粒度到细粒度的主体特征,再到主体-动作对进行逐层评估,从而系统性地衡量LVM在不同认知层级上的幻觉倾向。该方法符合人类视频理解的自底向上机制,能更真实反映模型在复杂场景中的表现。
链接: https://arxiv.org/abs/2509.08538
作者: Garry Yang,Zizhe Chen,Man Hon Wong,Haoyu Lei,Yongqiang Chen,Zhenguo Li,Kaiwen Zhou,James Cheng
机构: The Chinese University of Hong Kong(香港中文大学); Huawei Noah’s Ark Lab(华为诺亚方舟实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Video Models (LVMs) build on the semantic capabilities of Large Language Models (LLMs) and vision modules by integrating temporal information to better understand dynamic video content. Despite their progress, LVMs are prone to hallucinations-producing inaccurate or irrelevant descriptions. Current benchmarks for video hallucination depend heavily on manual categorization of video content, neglecting the perception-based processes through which humans naturally interpret videos. We introduce MESH, a benchmark designed to evaluate hallucinations in LVMs systematically. MESH uses a Question-Answering framework with binary and multi-choice formats incorporating target and trap instances. It follows a bottom-up approach, evaluating basic objects, coarse-to-fine subject features, and subject-action pairs, aligning with human video understanding. We demonstrate that MESH offers an effective and comprehensive approach for identifying hallucinations in videos. Our evaluations show that while LVMs excel at recognizing basic objects and features, their susceptibility to hallucinations increases markedly when handling fine details or aligning multiple actions involving various subjects in longer videos.
zh
[CV-27] HuMo: Human-Centric Video Generation via Collaborative Multi-Modal Conditioning
【速读】:该论文旨在解决人类中心视频生成(Human-Centric Video Generation, HCVG)中多模态输入(文本、图像和音频)协同控制的两大挑战:一是缺乏高质量的配对三元组数据(文本-参考图像-音频),二是难以有效协调主体保真度与音画同步这两个子任务。解决方案的关键在于提出一个统一的HCVG框架HuMo,其核心包括:1)构建高质量的多模态配对数据集以缓解数据稀缺问题;2)设计两阶段渐进式多模态训练范式,其中采用最小侵入式图像注入策略保持基础模型的提示跟随与视觉生成能力,同时提出“先预测后聚焦”(focus-by-predicting)策略隐式引导模型将音频信息关联至面部区域以实现音画同步;3)通过逐步融合音画同步任务实现跨模态可控性的联合学习,并在推理阶段引入时间自适应的Classifier-Free Guidance机制,动态调整去噪过程中的引导权重,从而实现灵活且细粒度的多模态控制。
链接: https://arxiv.org/abs/2509.08519
作者: Liyang Chen,Tianxiang Ma,Jiawei Liu,Bingchuan Li,Zhuowei Chen,Lijie Liu,Xu He,Gen Li,Qian He,Zhiyong Wu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:
Abstract:Human-Centric Video Generation (HCVG) methods seek to synthesize human videos from multimodal inputs, including text, image, and audio. Existing methods struggle to effectively coordinate these heterogeneous modalities due to two challenges: the scarcity of training data with paired triplet conditions and the difficulty of collaborating the sub-tasks of subject preservation and audio-visual sync with multimodal inputs. In this work, we present HuMo, a unified HCVG framework for collaborative multimodal control. For the first challenge, we construct a high-quality dataset with diverse and paired text, reference images, and audio. For the second challenge, we propose a two-stage progressive multimodal training paradigm with task-specific strategies. For the subject preservation task, to maintain the prompt following and visual generation abilities of the foundation model, we adopt the minimal-invasive image injection strategy. For the audio-visual sync task, besides the commonly adopted audio cross-attention layer, we propose a focus-by-predicting strategy that implicitly guides the model to associate audio with facial regions. For joint learning of controllabilities across multimodal inputs, building on previously acquired capabilities, we progressively incorporate the audio-visual sync task. During inference, for flexible and fine-grained multimodal control, we design a time-adaptive Classifier-Free Guidance strategy that dynamically adjusts guidance weights across denoising steps. Extensive experimental results demonstrate that HuMo surpasses specialized state-of-the-art methods in sub-tasks, establishing a unified framework for collaborative multimodal-conditioned HCVG. Project Page: this https URL.
zh
[CV-28] Chirality in Action: Time-Aware Video Representation Learning by Latent Straightening
【速读】:该论文旨在解决现有视频表示方法对时间敏感性建模不足的问题,尤其是难以区分时序相反动作(如“开门 vs. 关门”)的挑战。这类动作在日常生活中频繁出现,且依赖于对物体状态、位置、数量等视觉变化的时序理解,但多数视频嵌入模型对此类信息表征能力较弱。解决方案的关键在于提出一种自监督适应策略,将时间敏感性注入冻结的图像特征序列中,其核心是基于感知直线化(perceptual straightening)思想设计具有归纳偏置的自动编码器潜空间结构,从而实现对时序反向动作对的线性可分性。该方法在Something-Something、EPIC-Kitchens和Charades三个数据集上验证了其紧凑且时间敏感的视频表示能力,并优于更大规模预训练视频模型,同时提升标准基准上的分类性能。
链接: https://arxiv.org/abs/2509.08502
作者: Piyush Bagad,Andrew Zisserman
机构: VGG, Dept. of Engineering Science, University of Oxford (牛津大学工程科学系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 24 pages, 10 figures
Abstract:Our objective is to develop compact video representations that are sensitive to visual change over time. To measure such time-sensitivity, we introduce a new task: chiral action recognition, where one needs to distinguish between a pair of temporally opposite actions, such as “opening vs. closing a door”, “approaching vs. moving away from something”, “folding vs. unfolding paper”, etc. Such actions (i) occur frequently in everyday life, (ii) require understanding of simple visual change over time (in object state, size, spatial position, count . . . ), and (iii) are known to be poorly represented by many video embeddings. Our goal is to build time aware video representations which offer linear separability between these chiral pairs. To that end, we propose a self-supervised adaptation recipe to inject time-sensitivity into a sequence of frozen image features. Our model is based on an auto-encoder with a latent space with inductive bias inspired by perceptual straightening. We show that this results in a compact but time-sensitive video representation for the proposed task across three datasets: Something-Something, EPIC-Kitchens, and Charade. Our method (i) outperforms much larger video models pre-trained on large-scale video datasets, and (ii) leads to an improvement in classification performance on standard benchmarks when combined with these existing models.
zh
[CV-29] A Structured Review of Underwater Object Detection Challenges and Solutions: From Traditional to Large Vision Language Models
【速读】:该论文旨在解决水下目标检测(Underwater Object Detection, UOD)在实际应用中面临的多重挑战,包括图像质量退化、目标相关问题、数据稀缺与标注困难、计算资源限制以及检测方法本身的局限性。其核心解决方案在于系统梳理现有技术演进路径,并提出利用大视觉语言模型(Large Vision-Language Models, LVLMs)提升UOD性能的潜在路径:通过合成数据生成(如使用DALL-E 3构建合成数据集)和微调LVLM(如Florence-2)增强模型泛化能力,从而应对真实场景中复杂环境带来的挑战。研究指出,当前方法仍难以有效处理动态水下环境中图像退化和小目标检测问题,而LVLM虽具潜力但需进一步优化以实现高效实时推理。
链接: https://arxiv.org/abs/2509.08490
作者: Edwine Nabahirwa,Wei Song,Minghua Zhang,Yi Fang,Zhou Ni
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 72 Pages, 11 Figures
Abstract:Underwater object detection (UOD) is vital to diverse marine applications, including oceanographic research, underwater robotics, and marine conservation. However, UOD faces numerous challenges that compromise its performance. Over the years, various methods have been proposed to address these issues, but they often fail to fully capture the complexities of underwater environments. This review systematically categorizes UOD challenges into five key areas: Image quality degradation, target-related issues, data-related challenges, computational and processing constraints, and limitations in detection methodologies. To address these challenges, we analyze the progression from traditional image processing and object detection techniques to modern approaches. Additionally, we explore the potential of large vision-language models (LVLMs) in UOD, leveraging their multi-modal capabilities demonstrated in other domains. We also present case studies, including synthetic dataset generation using DALL-E 3 and fine-tuning Florence-2 LVLM for UOD. This review identifies three key insights: (i) Current UOD methods are insufficient to fully address challenges like image degradation and small object detection in dynamic underwater environments. (ii) Synthetic data generation using LVLMs shows potential for augmenting datasets but requires further refinement to ensure realism and applicability. (iii) LVLMs hold significant promise for UOD, but their real-time application remains under-explored, requiring further research on optimization techniques.
zh
[CV-30] Prompt-Driven Image Analysis with Multimodal Generative AI: Detection Segmentation Inpainting and Interpretation
【速读】:该论文旨在解决如何将自然语言指令(prompt)转化为多步骤图像分析任务的统一工作流问题,包括定位(locate)、分割(segment)、编辑(edit)和描述(describe)。其核心挑战在于构建一个端到端、可重复且透明的系统,以提升生成式 AI 在复杂视觉任务中的可靠性与可控性。解决方案的关键在于整合开放词汇检测(open-vocabulary detection)、可提示分割(promptable segmentation)、文本条件修复(text-conditioned inpainting)及视觉-语言描述(vision-language description)四大模块,并通过中间产物保留(如检测框、掩码、叠加图、编辑前后对比图等)实现调试透明化;同时引入阈值调整、轻量形态学掩码检查、资源感知默认配置等策略降低系统脆弱性,在高精度(>85%)和高成功率(>90%)前提下实现稳定运行,尤其在对象替换、场景增强和移除等应用中展现出良好的实用性。
链接: https://arxiv.org/abs/2509.08489
作者: Kaleem Ahmad
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 14 pages. Preprint
Abstract:Prompt-driven image analysis converts a single natural-language instruction into multiple steps: locate, segment, edit, and describe. We present a practical case study of a unified pipeline that combines open-vocabulary detection, promptable segmentation, text-conditioned inpainting, and vision-language description into a single workflow. The system works end to end from a single prompt, retains intermediate artifacts for transparent debugging (such as detections, masks, overlays, edited images, and before and after composites), and provides the same functionality through an interactive UI and a scriptable CLI for consistent, repeatable runs. We highlight integration choices that reduce brittleness, including threshold adjustments, mask inspection with light morphology, and resource-aware defaults. In a small, single-word prompt segment, detection and segmentation produced usable masks in over 90% of cases with an accuracy above 85% based on our criteria. On a high-end GPU, inpainting makes up 60 to 75% of total runtime under typical guidance and sampling settings, which highlights the need for careful tuning. The study offers implementation-guided advice on thresholds, mask tightness, and diffusion parameters, and details version pinning, artifact logging, and seed control to support replay. Our contribution is a transparent, reliable pattern for assembling modern vision and multimodal models behind a single prompt, with clear guardrails and operational practices that improve reliability in object replacement, scene augmentation, and removal.
zh
[CV-31] Maximally Useful and Minimally Redundant: The Key to Self Supervised Learning for Imbalanced Data
【速读】:该论文旨在解决对比自监督学习(Contrastive Self-Supervised Learning, CSSL)在类别不平衡数据集上表现不佳的问题。现有CSSL方法通常依赖多视图假设,通过相似与不相似样本对来学习判别性特征,但在不平衡数据场景下泛化能力有限。其核心解决方案是基于互信息(Mutual Information)理论,提出一种“多于两个视图”的目标函数,通过区分类内(intra-class)和类间(inter-class)判别特征,有效提取尾部类别(tail classes)的代表性表征,并设计了一种新型损失函数以过滤极端特征。实验表明,该方法在多种自监督框架(包括对比与非对比方法)中均显著提升不平衡数据分类性能,实现了新的SOTA结果。
链接: https://arxiv.org/abs/2509.08469
作者: Yash Kumar Sharma,Vineet Nair,Wilson Naik
机构: University of Hyderabad (海得拉巴大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The robustness of contrastive self-supervised learning (CSSL) for imbalanced datasets is largely unexplored. CSSL usually makes use of \emphmulti-view assumptions to learn discriminatory features via similar and dissimilar data samples. CSSL works well on balanced datasets, but does not generalize well for imbalanced datasets. In a very recent paper, as part of future work, Yann LeCun pointed out that the self-supervised multiview framework can be extended to cases involving \emphmore than two views. Taking a cue from this insight we propose a theoretical justification based on the concept of \emphmutual information to support the \emphmore than two views objective and apply it to the problem of dataset imbalance in self-supervised learning. The proposed method helps extract representative characteristics of the tail classes by segregating between \emphintra and \emphinter discriminatory characteristics. We introduce a loss function that helps us to learn better representations by filtering out extreme features. Experimental evaluation on a variety of self-supervised frameworks (both contrastive and non-contrastive) also prove that the \emphmore than two view objective works well for imbalanced datasets. We achieve a new state-of-the-art accuracy in self-supervised imbalanced dataset classification (2% improvement in Cifar10-LT using Resnet-18, 5% improvement in Cifar100-LT using Resnet-18, 3% improvement in Imagenet-LT (1k) using Resnet-50).
zh
[CV-32] Adapting Vision-Language Models for Neutrino Event Classification in High-Energy Physics
【速读】:该论文旨在解决高能物理(HEP)实验中利用像素化探测器数据识别中微子相互作用的分类问题,传统方法如卷积神经网络(CNN)虽在电子和μ子中微子事件分类中表现优异,但在模型可解释性与多模态信息融合方面存在局限。解决方案的关键在于引入视觉语言模型(Vision Language Models, VLMs),特别是微调后的LLaMa 3.2模型,通过其强大的跨模态理解能力实现对探测器图像数据的高效分类,并结合文本或语义辅助信息提升预测的可解释性和灵活性,从而在保持高性能的同时增强模型的推理透明度和通用性。
链接: https://arxiv.org/abs/2509.08461
作者: Dikshant Sagar,Kaiwen Yu,Alejandro Yankelevich,Jianming Bian,Pierre Baldi
机构: University of California, Irvine (加州大学欧文分校)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); High Energy Physics - Experiment (hep-ex)
备注:
Abstract:Recent advances in Large Language Models (LLMs) have demonstrated their remarkable capacity to process and reason over structured and unstructured data modalities beyond natural language. In this work, we explore the applications of Vision Language Models (VLMs), specifically a fine-tuned variant of LLaMa 3.2, to the task of identifying neutrino interactions in pixelated detector data from high-energy physics (HEP) experiments. We benchmark this model against a state-of-the-art convolutional neural network (CNN) architecture, similar to those used in the NOvA and DUNE experiments, which have achieved high efficiency and purity in classifying electron and muon neutrino events. Our evaluation considers both the classification performance and interpretability of the model predictions. We find that VLMs can outperform CNNs, while also providing greater flexibility in integrating auxiliary textual or semantic information and offering more interpretable, reasoning-based predictions. This work highlights the potential of VLMs as a general-purpose backbone for physics event classification, due to their high performance, interpretability, and generalizability, which opens new avenues for integrating multimodal reasoning in experimental neutrino physics.
zh
[CV-33] First-order State Space Model for Lightweight Image Super-resolution ICASSP2025
【速读】:该论文旨在解决当前基于状态空间模型(State Space Models, SSMs)的轻量级图像超分辨率(Super-Resolution, SR)方法中,对SSM模块本身优化不足的问题。现有Mamba-based视觉模型多聚焦于网络架构与扫描路径设计,而忽视了SSM核心模块的潜力挖掘。解决方案的关键在于提出一种一阶状态空间模型(First-order State Space Model, FSSM),通过在不增加参数量的前提下重构SSM的计算流程,引入一阶保持(first-order hold)条件推导出新的离散化形式,并分析累积误差特性,从而增强token之间的相关性建模能力。实验表明,FSSM在五个基准数据集上显著提升了MambaIR的性能,优于现有轻量级SR方法,达到当前最优效果。
链接: https://arxiv.org/abs/2509.08458
作者: Yujie Zhu,Xinyi Zhang,Yekai Lu,Guang Yang,Faming Fang,Guixu Zhang
机构: East China Normal University (华东师范大学); Guotai Junan Security (国泰君安证券)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accept by ICASSP 2025 (Oral)
Abstract:State space models (SSMs), particularly Mamba, have shown promise in NLP tasks and are increasingly applied to vision tasks. However, most Mamba-based vision models focus on network architecture and scan paths, with little attention to the SSM module. In order to explore the potential of SSMs, we modified the calculation process of SSM without increasing the number of parameters to improve the performance on lightweight super-resolution tasks. In this paper, we introduce the First-order State Space Model (FSSM) to improve the original Mamba module, enhancing performance by incorporating token correlations. We apply a first-order hold condition in SSMs, derive the new discretized form, and analyzed cumulative error. Extensive experimental results demonstrate that FSSM improves the performance of MambaIR on five benchmark datasets without additionally increasing the number of parameters, and surpasses current lightweight SR methods, achieving state-of-the-art results.
zh
[CV-34] Spherical Brownian Bridge Diffusion Models for Conditional Cortical Thickness Forecasting
【速读】:该论文旨在解决个体化、高分辨率皮层厚度(Cortical Thickness, CTh)轨迹预测的难题,该问题在神经退行性病变早期检测与干预策略制定中至关重要。由于大脑皮层具有复杂的非欧几里得几何结构,且需融合多模态数据以实现个体特异性预测,传统方法难以准确建模CTh随时间演变的动态过程。其解决方案的关键在于提出球面布朗桥扩散模型(Spherical Brownian Bridge Diffusion Model, SBDM),该模型通过双向条件布朗桥扩散过程,在注册后的皮层表面顶点层面进行CTh轨迹预测;同时设计了条件球面U-Net(conditional spherical U-Net, CoS-UNet),结合球面卷积与密集交叉注意力机制,实现皮层表面与表格型条件信息的无缝融合,从而显著降低预测误差,并支持生成真实及反事实的CTh轨迹,为探索皮层发育的假设情景提供新范式。
链接: https://arxiv.org/abs/2509.08442
作者: Ivan Stoyanov,Fabian Bongratz,Christian Wachinger
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
备注:
Abstract:Accurate forecasting of individualized, high-resolution cortical thickness (CTh) trajectories is essential for detecting subtle cortical changes, providing invaluable insights into neurodegenerative processes and facilitating earlier and more precise intervention strategies. However, CTh forecasting is a challenging task due to the intricate non-Euclidean geometry of the cerebral cortex and the need to integrate multi-modal data for subject-specific predictions. To address these challenges, we introduce the Spherical Brownian Bridge Diffusion Model (SBDM). Specifically, we propose a bidirectional conditional Brownian bridge diffusion process to forecast CTh trajectories at the vertex level of registered cortical surfaces. Our technical contribution includes a new denoising model, the conditional spherical U-Net (CoS-UNet), which combines spherical convolutions and dense cross-attention to integrate cortical surfaces and tabular conditions seamlessly. Compared to previous approaches, SBDM achieves significantly reduced prediction errors, as demonstrated by our experiments based on longitudinal datasets from the ADNI and OASIS. Additionally, we demonstrate SBDM’s ability to generate individual factual and counterfactual CTh trajectories, offering a novel framework for exploring hypothetical scenarios of cortical development.
zh
[CV-35] Beyond Distribution Shifts: Adaptive Hyperspectral Image Classification at Test Time
【速读】:该论文旨在解决高光谱图像(Hyperspectral Image, HSI)分类模型在面对真实世界中多种退化类型(如噪声、模糊、压缩和大气效应等)导致的分布偏移时,鲁棒性不足的问题。解决方案的关键在于提出一个统一框架HyperTTA,其核心包括:(1) 构建包含九类典型退化的多退化高光谱数据集,为鲁棒分类评估提供基准;(2) 设计一种具有多尺度感受野机制和标签平滑正则化的光谱-空间Transformer分类器(Spectral-Spatial Transformer Classifier, SSTC),以增强模型对多尺度空间上下文的捕捉能力和泛化性能;(3) 提出轻量级测试时自适应(Test-Time Adaptation, TTA)策略——置信度感知熵最小化LayerNorm适配器(Confidence-aware Entropy-minimized LayerNorm adapter, CELA),通过在高置信度无标签目标样本上最小化预测熵来动态更新LayerNorm层的仿射参数,从而实现无需源数据或目标标注的可靠自适应。
链接: https://arxiv.org/abs/2509.08436
作者: Xia Yue,Anfeng Liu,Ning Chen,Chenjia Huang,Hui Liu,Zhou Huang,Leyuan Fang
机构: Central South University (中南大学); Peking University (北京大学); Nanjing University of Information Science & Technology (南京信息工程大学); Hunan University (湖南大学); Peng Cheng Laboratory (鹏城实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Hyperspectral image (HSI) classification models are highly sensitive to distribution shifts caused by various real-world degradations such as noise, blur, compression, and atmospheric effects. To address this challenge, we propose HyperTTA, a unified framework designed to enhance model robustness under diverse degradation conditions. Specifically, we first construct a multi-degradation hyperspectral dataset that systematically simulates nine representative types of degradations, providing a comprehensive benchmark for robust classification evaluation. Based on this, we design a spectral-spatial transformer classifier (SSTC) enhanced with a multi-level receptive field mechanism and label smoothing regularization to jointly capture multi-scale spatial context and improve generalization. Furthermore, HyperTTA incorporates a lightweight test-time adaptation (TTA) strategy, the confidence-aware entropy-minimized LayerNorm adapter (CELA), which updates only the affine parameters of LayerNorm layers by minimizing prediction entropy on high-confidence unlabeled target samples. This confidence-aware adaptation prevents unreliable updates from noisy predictions, enabling robust and dynamic adaptation without access to source data or target annotations. Extensive experiments on two benchmark datasets demonstrate that HyperTTA outperforms existing baselines across a wide range of degradation scenarios, validating the effectiveness of both its classification backbone and the proposed TTA scheme. Code will be made available publicly.
zh
[CV-36] LD-ViCE: Latent Diffusion Model for Video Counterfactual Explanations
【速读】:该论文旨在解决视频类人工智能(AI)系统在安全关键领域(如自动驾驶和医疗健康)中决策可解释性不足的问题,特别是现有解释方法在时间连贯性、鲁棒性和因果洞察力方面的局限性。其解决方案的关键在于提出一种名为Latent Diffusion for Video Counterfactual Explanations (LD-ViCE) 的新框架,该框架通过在潜在空间中利用先进的扩散模型生成反事实解释,显著降低了计算成本,并结合额外的精炼步骤以生成语义一致且时序连贯的解释结果,从而提升了解释的实用性与可信度。
链接: https://arxiv.org/abs/2509.08422
作者: Payal Varshney,Adriano Lucieri,Christoph Balada,Sheraz Ahmed,Andreas Dengel
机构: DFKI(德国弗劳恩霍夫计算机图形学研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 30 pages
Abstract:Video-based AI systems are increasingly adopted in safety-critical domains such as autonomous driving and healthcare. However, interpreting their decisions remains challenging due to the inherent spatiotemporal complexity of video data and the opacity of deep learning models. Existing explanation techniques often suffer from limited temporal coherence, insufficient robustness, and a lack of actionable causal insights. Current counterfactual explanation methods typically do not incorporate guidance from the target model, reducing semantic fidelity and practical utility. We introduce Latent Diffusion for Video Counterfactual Explanations (LD-ViCE), a novel framework designed to explain the behavior of video-based AI models. Compared to previous approaches, LD-ViCE reduces the computational costs of generating explanations by operating in latent space using a state-of-the-art diffusion model, while producing realistic and interpretable counterfactuals through an additional refinement step. Our experiments demonstrate the effectiveness of LD-ViCE across three diverse video datasets, including EchoNet-Dynamic (cardiac ultrasound), FERV39k (facial expression), and Something-Something V2 (action recognition). LD-ViCE outperforms a recent state-of-the-art method, achieving an increase in R2 score of up to 68% while reducing inference time by half. Qualitative analysis confirms that LD-ViCE generates semantically meaningful and temporally coherent explanations, offering valuable insights into the target model behavior. LD-ViCE represents a valuable step toward the trustworthy deployment of AI in safety-critical domains.
zh
[CV-37] Sparse BEV Fusion with Self-View Consistency for Multi-View Detection and Tracking
【速读】:该论文旨在解决多视角多目标跟踪(Multi-View Multi-Object Tracking, MVMOT)中因视角变化、光照差异和遮挡导致的物体身份不一致问题,尤其针对传统鸟瞰图(Bird’s-Eye-View, BEV)投影方法中存在的特征失真与密度非均匀性问题。其解决方案的关键在于提出SCFusion框架,通过三个核心技术实现:1)采用稀疏变换避免投影过程中的非自然插值;2)引入密度感知加权机制,基于空间置信度和相机距离自适应融合特征;3)设计多视角一致性损失,促使各视角在融合前学习判别性特征。该方案有效提升了多视角特征集成的质量,显著改善了跟踪精度与鲁棒性。
链接: https://arxiv.org/abs/2509.08421
作者: Keisuke Toida,Taigo Sakai,Naoki Kato,Kazutoyo Yokota,Takeshi Nakamura,Kazuhiro Hotta
机构: Meijo University (明治大学); Chubu Electric Power Co., Inc. (中部电力公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Multi-View Multi-Object Tracking (MVMOT) is essential for applications such as surveillance, autonomous driving, and sports analytics. However, maintaining consistent object identities across multiple cameras remains challenging due to viewpoint changes, lighting variations, and occlusions, which often lead to tracking this http URL methods project features from multiple cameras into a unified Bird’s-Eye-View (BEV) space to improve robustness against occlusion. However, this projection introduces feature distortion and non-uniform density caused by variations in object scale with distance. These issues degrade the quality of the fused representation and reduce detection and tracking this http URL address these problems, we propose SCFusion, a framework that combines three techniques to improve multi-view feature integration. First, it applies a sparse transformation to avoid unnatural interpolation during projection. Next, it performs density-aware weighting to adaptively fuse features based on spatial confidence and camera distance. Finally, it introduces a multi-view consistency loss that encourages each camera to learn discriminative features independently before this http URL show that SCFusion achieves state-of-the-art performance, reaching an IDF1 score of 95.9% on WildTrack and a MODP of 89.2% on MultiviewX, outperforming the baseline method TrackTacular. These results demonstrate that SCFusion effectively mitigates the limitations of conventional BEV projection and provides a robust and accurate solution for multi-view object detection and tracking.
zh
[CV-38] VRAE: Vertical Residual Autoencoder for License Plate Denoising and Deblurring
【速读】:该论文旨在解决交通监控场景中因恶劣天气、光照不足或高速运动导致的车辆图像严重噪声与模糊问题,此类退化显著降低车牌识别系统的准确性,尤其当车牌在整张图像中占据较小区域时。解决方案的关键在于提出一种垂直残差自编码器(Vertical Residual Autoencoder, VRAE)架构,其核心创新是引入一个辅助模块,在编码阶段逐层注入输入感知特征,从而引导表示学习过程,增强网络对全局信息的保留能力,相比传统自编码器(Autoencoder, AE)、生成对抗网络(Generative Adversarial Network, GAN)和基于流的方法(Flow-Based, FB),VRAE在保持相同深度下显著提升图像重建质量,PSNR提升约20%,NMSE降低约50%,SSIM提升1%,且仅增加约1%参数量。
链接: https://arxiv.org/abs/2509.08392
作者: Cuong Nguyen,Dung T. Tran,Hong Nguyen,Xuan-Vu Phan,Nam-Phong Nguyen
机构: HUST (华中科技大学); VinUni (越南国立大学); USC (南加州大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In real-world traffic surveillance, vehicle images captured under adverse weather, poor lighting, or high-speed motion often suffer from severe noise and blur. Such degradations significantly reduce the accuracy of license plate recognition systems, especially when the plate occupies only a small region within the full vehicle image. Restoring these degraded images a fast realtime manner is thus a crucial pre-processing step to enhance recognition performance. In this work, we propose a Vertical Residual Autoencoder (VRAE) architecture designed for the image enhancement task in traffic surveillance. The method incorporates an enhancement strategy that employs an auxiliary block, which injects input-aware features at each encoding stage to guide the representation learning process, enabling better general information preservation throughout the network compared to conventional autoencoders. Experiments on a vehicle image dataset with visible license plates demonstrate that our method consistently outperforms Autoencoder (AE), Generative Adversarial Network (GAN), and Flow-Based (FB) approaches. Compared with AE at the same depth, it improves PSNR by about 20%, reduces NMSE by around 50%, and enhances SSIM by 1%, while requiring only a marginal increase of roughly 1% in parameters.
zh
[CV-39] Semantic Causality-Aware Vision-Based 3D Occupancy Prediction ICCV2025
【速读】:该论文旨在解决现有基于视觉的3D语义占据预测方法中因模块化流水线导致的级联误差问题。传统方法通常将2D到3D的转换过程拆分为独立优化的模块,缺乏端到端的联合训练机制,从而限制了整体性能与鲁棒性。其解决方案的关键在于提出一种新颖的因果损失(causal loss),该损失基于2D到3D语义因果关系原理,通过引导从3D体素表示回传梯度至2D特征空间,使整个2D-to-3D转换流水线可微分,从而统一学习流程并使原本不可训练的组件变为可学习模块。在此基础上,作者进一步设计了语义因果感知的2D-to-3D变换框架,包含通道分组提升(Channel-Grouped Lifting)、可学习相机偏移(Learnable Camera Offsets)和归一化卷积(Normalized Convolution)三个核心组件,显著提升了模型在Occ3D基准上的性能及对相机扰动的鲁棒性。
链接: https://arxiv.org/abs/2509.08388
作者: Dubing Chen,Huan Zheng,Yucheng Zhou,Xianfei Li,Wenlong Liao,Tao He,Pai Peng,Jianbing Shen
机构: SKL-IOTSC, CIS, University of Macau (澳门大学信息与通信系统重点实验室); COWAROBOT Co. Ltd. (COWAROBOT公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: ICCV 2025
Abstract:Vision-based 3D semantic occupancy prediction is a critical task in 3D vision that integrates volumetric 3D reconstruction with semantic understanding. Existing methods, however, often rely on modular pipelines. These modules are typically optimized independently or use pre-configured inputs, leading to cascading errors. In this paper, we address this limitation by designing a novel causal loss that enables holistic, end-to-end supervision of the modular 2D-to-3D transformation pipeline. Grounded in the principle of 2D-to-3D semantic causality, this loss regulates the gradient flow from 3D voxel representations back to the 2D features. Consequently, it renders the entire pipeline differentiable, unifying the learning process and making previously non-trainable components fully learnable. Building on this principle, we propose the Semantic Causality-Aware 2D-to-3D Transformation, which comprises three components guided by our causal loss: Channel-Grouped Lifting for adaptive semantic mapping, Learnable Camera Offsets for enhanced robustness against camera perturbations, and Normalized Convolution for effective feature propagation. Extensive experiments demonstrate that our method achieves state-of-the-art performance on the Occ3D benchmark, demonstrating significant robustness to camera perturbations and improved 2D-to-3D semantic consistency.
zh
[CV-40] Bitrate-Controlled Diffusion for Disentangling Motion and Content in Video
【速读】:该论文旨在解决视频数据中动态运动(dynamic motion)与静态内容(static content)难以有效解耦的问题,从而实现更可控的视频表示学习。其核心解决方案是提出一种无需强先验假设的自监督框架:利用基于Transformer的架构联合生成帧级运动和片段级内容的灵活隐式特征,并引入低比特率向量量化(vector quantization)作为信息瓶颈,以促进解耦并构建有意义的离散运动空间;随后将受比特率控制的潜在运动与内容作为条件输入到去噪扩散模型中,用于增强自监督表征学习。该方法在真实世界说话头视频上的运动迁移和自回归运动生成任务中验证了有效性,并展示了对2D卡通像素精灵等其他视频类型的泛化能力。
链接: https://arxiv.org/abs/2509.08376
作者: Xiao Li,Qi Chen,Xiulian Peng,Kai Yu,Xie Chen,Yan Lu
机构: Microsoft Research Asia (微软亚洲研究院); Shanghai Jiao Tong University (上海交通大学); Shanghai Innovation Institute (上海创新研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We propose a novel and general framework to disentangle video data into its dynamic motion and static content components. Our proposed method is a self-supervised pipeline with less assumptions and inductive biases than previous works: it utilizes a transformer-based architecture to jointly generate flexible implicit features for frame-wise motion and clip-wise content, and incorporates a low-bitrate vector quantization as an information bottleneck to promote disentanglement and form a meaningful discrete motion space. The bitrate-controlled latent motion and content are used as conditional inputs to a denoising diffusion model to facilitate self-supervised representation learning. We validate our disentangled representation learning framework on real-world talking head videos with motion transfer and auto-regressive motion generation tasks. Furthermore, we also show that our method can generalize to other types of video data, such as pixel sprites of 2D cartoon characters. Our work presents a new perspective on self-supervised learning of disentangled video representations, contributing to the broader field of video analysis and generation.
zh
[CV-41] InsFusion: Rethink Instance-level LiDAR-Camera Fusion for 3D Object Detection
【速读】:该论文旨在解决多视角相机与激光雷达(LiDAR)融合过程中,由于基础特征提取、透视变换及特征融合等环节导致的噪声和误差逐层累积问题。解决方案的关键在于提出InsFusion框架,其通过从原始特征和融合特征中分别提取候选区域(proposal),并利用这些候选区域对原始特征进行查询(query),从而有效缓解误差传播;同时引入注意力机制作用于原始特征,进一步抑制累积误差的影响。
链接: https://arxiv.org/abs/2509.08374
作者: Zhongyu Xia,Hansong Yang,Yongtao Wang
机构: Wangxuan Institute of Computer Technology, Peking University (北京大学王选计算机研究所); Beijing Jiaotong University (北京交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Three-dimensional Object Detection from multi-view cameras and LiDAR is a crucial component for autonomous driving and smart transportation. However, in the process of basic feature extraction, perspective transformation, and feature fusion, noise and error will gradually accumulate. To address this issue, we propose InsFusion, which can extract proposals from both raw and fused features and utilizes these proposals to query the raw features, thereby mitigating the impact of accumulated errors. Additionally, by incorporating attention mechanisms applied to the raw features, it thereby mitigates the impact of accumulated errors. Experiments on the nuScenes dataset demonstrate that InsFusion is compatible with various advanced baseline methods and delivers new state-of-the-art performance for 3D object detection.
zh
[CV-42] Retrieval-Augmented VLMs for Multimodal Melanoma Diagnosis MICCAI
【速读】:该论文旨在解决恶性黑色素瘤(malignant melanoma)早期准确诊断的难题,尤其针对现有卷积神经网络(Convolutional Neural Networks, CNNs)在皮肤镜图像分析中忽视临床元数据(clinical metadata)且需复杂预处理,以及视觉-语言模型(Vision-Language Models, VLMs)在通用领域数据上训练时难以捕捉临床特异性的问题。其解决方案的关键在于提出一种检索增强型视觉-语言模型(retrieval-augmented VLM)框架,通过在诊断提示(prompt)中引入语义相似的患者病例,实现无需微调(fine-tuning)即可提升分类准确率和错误纠正能力,从而为临床决策支持提供更可靠的方法。
链接: https://arxiv.org/abs/2509.08338
作者: Jihyun Moon,Charmgil Hong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Medical Image Computing and Computer-Assisted Intervention (MICCAI) ISIC Skin Image Analysis Workshop (MICCAI ISIC) 2025; 10 pages
Abstract:Accurate and early diagnosis of malignant melanoma is critical for improving patient outcomes. While convolutional neural networks (CNNs) have shown promise in dermoscopic image analysis, they often neglect clinical metadata and require extensive preprocessing. Vision-language models (VLMs) offer a multimodal alternative but struggle to capture clinical specificity when trained on general-domain data. To address this, we propose a retrieval-augmented VLM framework that incorporates semantically similar patient cases into the diagnostic prompt. Our method enables informed predictions without fine-tuning and significantly improves classification accuracy and error correction over conventional baselines. These results demonstrate that retrieval-augmented prompting provides a robust strategy for clinical decision support.
zh
[CV-43] Good Deep Features to Track: Self-Supervised Feature Extraction and Tracking in Visual Odometry
【速读】:该论文旨在解决视觉定位(Visual-based Localization)在大规模、户外及长期场景中性能下降的问题,主要挑战包括光照变化、动态场景和低纹理区域,这些因素会削弱特征提取与跟踪能力,进而影响运动估计的准确性。解决方案的关键在于通过引入任务特定反馈的自监督学习(self-supervised learning),增强深度特征提取与跟踪的稳定性与信息量,从而提升模型在复杂环境下的泛化能力和可靠性。
链接: https://arxiv.org/abs/2509.08333
作者: Sai Puneeth Reddy Gottam,Haoming Zhang,Eivydas Keras
机构: Montanuniversität Leoben (莱奥本矿业大学); RWTH Aachen University (亚琛工业大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: This short paper has been accepted as a workshop paper at European Conference on Mobile Robots 2025
Abstract:Visual-based localization has made significant progress, yet its performance often drops in large-scale, outdoor, and long-term settings due to factors like lighting changes, dynamic scenes, and low-texture areas. These challenges degrade feature extraction and tracking, which are critical for accurate motion estimation. While learning-based methods such as SuperPoint and SuperGlue show improved feature coverage and robustness, they still face generalization issues with out-of-distribution data. We address this by enhancing deep feature extraction and tracking through self-supervised learning with task specific feedback. Our method promotes stable and informative features, improving generalization and reliability in challenging environments.
zh
[CV-44] Boosted Training of Lightweight Early Exits for Optimizing CNN Image Classification Inference
【速读】:该论文旨在解决资源受限平台上的实时图像分类问题,核心挑战在于如何在严格延迟和功耗预算下平衡模型精度与推理效率。传统早期退出(early exit)策略虽能通过在CNN中间层附加辅助分类器实现“简单”样本的提前终止推理,但其训练过程存在协方差偏移(covariance shift):下游分支在全数据集上训练,而推理时仅处理未退出的困难样本,导致性能受限。解决方案的关键在于提出一种增强型训练方案(Boosted Training Scheme for Early Exits, BTS-EE),采用顺序训练机制,使每个分支在下一阶段训练前完成独立训练与校准,从而确保分支训练分布与推理时的样本分布一致;同时引入轻量级1D卷积分支结构及基于类精度边际(Class Precision Margin, CPM)的校准方法,支持按类别调整退出阈值,提升退出决策可靠性。实验表明,BTS-EE在ResNet18基础上可实现最多45%计算量减少且仅损失2%精度,显著优化了效率-精度权衡。
链接: https://arxiv.org/abs/2509.08318
作者: Yehudit Aperstein,Alexander Apartsin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 4 figures
Abstract:Real-time image classification on resource-constrained platforms demands inference methods that balance accuracy with strict latency and power budgets. Early-exit strategies address this need by attaching auxiliary classifiers to intermediate layers of convolutional neural networks (CNNs), allowing “easy” samples to terminate inference early. However, conventional training of early exits introduces a covariance shift: downstream branches are trained on full datasets, while at inference they process only the harder, non-exited samples. This mismatch limits efficiency–accuracy trade-offs in practice. We introduce the Boosted Training Scheme for Early Exits (BTS-EE), a sequential training approach that aligns branch training with inference-time data distributions. Each branch is trained and calibrated before the next, ensuring robustness under selective inference conditions. To further support embedded deployment, we propose a lightweight branch architecture based on 1D convolutions and a Class Precision Margin (CPM) calibration method that enables per-class threshold tuning for reliable exit decisions. Experiments on the CINIC-10 dataset with a ResNet18 backbone demonstrate that BTS-EE consistently outperforms non-boosted training across 64 configurations, achieving up to 45 percent reduction in computation with only 2 percent accuracy degradation. These results expand the design space for deploying CNNs in real-time image processing systems, offering practical efficiency gains for applications such as industrial inspection, embedded vision, and UAV-based monitoring.
zh
[CV-45] SimCroP: Radiograph Representation Learning with Similarity-driven Cross-granularity Pre-training MICCAI2025
【速读】:该论文旨在解决医学影像中病灶空间稀疏性与报告文本中病理描述与图像子区域之间复杂隐含关联所带来的挑战,尤其是在胸部CT图像的多模态理解任务中。其核心解决方案是提出一种基于相似性驱动的跨粒度预训练框架(Similarity-Driven Cross-Granularity Pre-training, SimCroP),关键在于:首先通过多模态掩码建模优化编码器以捕捉图像中的精细低层语义;其次设计相似性驱动对齐机制,使模型能够自适应地选择并匹配报告中每句话对应的图像块;最后引入跨粒度融合模块,在实例级和词-图块级之间整合多模态信息,从而增强对稀疏病灶结构的识别能力,提升下游多尺度任务(如分类与分割)的性能。
链接: https://arxiv.org/abs/2509.08311
作者: Rongsheng Wang,Fenghe Tang,Qingsong Yao,Rui Yan,Xu Zhang,Zhen Huang,Haoran Lai,Zhiyang He,Xiaodong Tao,Zihang Jiang,Shaohua Kevin Zhou
机构: University of Science and Technology of China (中国科学技术大学); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by MICCAI 2025
Abstract:Medical vision-language pre-training shows great potential in learning representative features from massive paired radiographs and reports. However, in computed tomography (CT) scans, the distribution of lesions which contain intricate structures is characterized by spatial sparsity. Besides, the complex and implicit relationships between different pathological descriptions in each sentence of the report and their corresponding sub-regions in radiographs pose additional challenges. In this paper, we propose a Similarity-Driven Cross-Granularity Pre-training (SimCroP) framework on chest CTs, which combines similarity-driven alignment and cross-granularity fusion to improve radiograph interpretation. We first leverage multi-modal masked modeling to optimize the encoder for understanding precise low-level semantics from radiographs. Then, similarity-driven alignment is designed to pre-train the encoder to adaptively select and align the correct patches corresponding to each sentence in reports. The cross-granularity fusion module integrates multimodal information across instance level and word-patch level, which helps the model better capture key pathology structures in sparse radiographs, resulting in improved performance for multi-scale downstream tasks. SimCroP is pre-trained on a large-scale paired CT-reports dataset and validated on image classification and segmentation tasks across five public datasets. Experimental results demonstrate that SimCroP outperforms both cutting-edge medical self-supervised learning methods and medical vision-language pre-training methods. Codes and models are available at this https URL.
zh
[CV-46] An Open Benchmark Dataset for GeoAI Foundation Models for Oil Palm Mapping in Indonesia
【速读】:该论文旨在解决印度尼西亚油棕种植导致的森林砍伐问题,通过提供高精度、可公开获取的地理空间数据集来支持可持续发展努力和新兴监管框架。其解决方案的关键在于利用2020至2024年高分辨率卫星影像,结合专家标注与多解释者共识及实地验证,生成覆盖多种农业生态区的多边形矢量注释数据集,并采用分层分类体系区分油棕不同种植阶段及相关多年生作物类型,从而为传统卷积神经网络(Convolutional Neural Networks, CNNs)和新一代地理空间基础模型(Geospatial Foundation Models)提供高质量训练与基准测试数据。
链接: https://arxiv.org/abs/2509.08303
作者: M. Warizmi Wafiq,Peter Cutter,Ate Poortinga,Daniel Marc G. dela Torre,Karis Tenneson,Vanna Teck,Enikoe Bihari,Chanarun Saisaward,Weraphong Suaruang,Andrea McMahon,Andi Vika Faradiba Muin,Karno B. Batiran,Chairil A,Nurul Qomar,Arya Arismaya Metananda,David Ganz,David Saah
机构: RECOFTC – The Center for People and Forests(中心人与森林); Spatial Informatics Group, LLC(空间信息集团有限责任公司); Faculty of Forestry, Hasanuddin University(哈萨努丁大学林业学院); Department of Forestry, Faculty of Agriculture, Universitas Riau(里亚乌大学农业学院林业系); University of San Francisco(旧金山大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Oil palm cultivation remains one of the leading causes of deforestation in Indonesia. To better track and address this challenge, detailed and reliable mapping is needed to support sustainability efforts and emerging regulatory frameworks. We present an open-access geospatial dataset of oil palm plantations and related land cover types in Indonesia, produced through expert labeling of high-resolution satellite imagery from 2020 to 2024. The dataset provides polygon-based, wall-to-wall annotations across a range of agro-ecological zones and includes a hierarchical typology that distinguishes oil palm planting stages as well as similar perennial crops. Quality was ensured through multi-interpreter consensus and field validation. The dataset was created using wall-to-wall digitization over large grids, making it suitable for training and benchmarking both conventional convolutional neural networks and newer geospatial foundation models. Released under a CC-BY license, it fills a key gap in training data for remote sensing and aims to improve the accuracy of land cover types mapping. By supporting transparent monitoring of oil palm expansion, the resource contributes to global deforestation reduction goals and follows FAIR data principles.
zh
[CV-47] Foundation Models for Autonomous Driving Perception: A Survey Through Core Capabilities
【速读】:该论文旨在解决自动驾驶感知中基础模型(foundation models)在泛化能力、可扩展性和分布偏移鲁棒性方面的关键挑战,其解决方案的核心在于提出一个以四大核心能力为导向的新分类框架:泛化知识(generalized knowledge)、空间理解(spatial understanding)、多传感器鲁棒性(multi-sensor robustness)和时间推理(temporal reasoning)。该框架超越传统按方法分类的调研方式,聚焦于模型设计的概念原则,为构建具备动态驾驶环境鲁棒性能的基础模型提供系统性指导,并强调未来需攻克实时性、计算效率与可靠性(如幻觉和分布外失效)等部署难题。
链接: https://arxiv.org/abs/2509.08302
作者: Rajendramayavan Sathyam,Yueqi Li
机构: Zoox Inc.
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 32 pages, 14 figures, accepted at IEEE Open Journal of Vehicular Technology (OJVT)
Abstract:Foundation models are revolutionizing autonomous driving perception, transitioning the field from narrow, task-specific deep learning models to versatile, general-purpose architectures trained on vast, diverse datasets. This survey examines how these models address critical challenges in autonomous perception, including limitations in generalization, scalability, and robustness to distributional shifts. The survey introduces a novel taxonomy structured around four essential capabilities for robust performance in dynamic driving environments: generalized knowledge, spatial understanding, multi-sensor robustness, and temporal reasoning. For each capability, the survey elucidates its significance and comprehensively reviews cutting-edge approaches. Diverging from traditional method-centric surveys, our unique framework prioritizes conceptual design principles, providing a capability-driven guide for model development and clearer insights into foundational aspects. We conclude by discussing key challenges, particularly those associated with the integration of these capabilities into real-time, scalable systems, and broader deployment challenges related to computational demands and ensuring model reliability against issues like hallucinations and out-of-distribution failures. The survey also outlines crucial future research directions to enable the safe and effective deployment of foundation models in autonomous driving systems.
zh
[CV-48] Dual-Thresholding Heatmaps to Cluster Proposals for Weakly Supervised Object Detection
【速读】:该论文旨在解决弱监督目标检测(Weakly Supervised Object Detection, WSOD)中三个关键问题:一是伪标注框(pseudo GT boxes)仅聚焦于物体的判别性局部区域,难以覆盖完整对象,或在同类相邻实例间无法区分;二是基础WSDDN架构缺乏每个候选框的背景类表示,且其分支间存在显著语义鸿沟;三是优化过程中丢弃被忽略的候选框导致收敛速度慢。解决方案的关键在于:首先设计热力图引导的候选框选择算法(Heatmap-guided Proposal Selector, HGPS),通过双阈值策略预选候选框,确保伪标注框既能完整覆盖物体又能区分相邻同类实例;其次提出弱监督基础检测网络(Weakly Supervised Basic Detection Network, WSBDN),为每个候选框引入背景类表示,并利用热力图进行预监督以缩小分支间的语义差距;最后在被忽略的候选框上引入负向确定性监督损失(Negative Certainty Supervision Loss),加速模型收敛。
链接: https://arxiv.org/abs/2509.08289
作者: Yuelin Guo,Haoyu He,Zhiyuan Chen,Zitong Huang,Renhao Lu,Lu Shi,Zejun Wang,Weizhe Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This work has been submitted to the IEEE for possible publication
Abstract:Weakly supervised object detection (WSOD) has attracted significant attention in recent years, as it does not require box-level annotations. State-of-the-art methods generally adopt a multi-module network, which employs WSDDN as the multiple instance detection network module and multiple instance refinement modules to refine performance. However, these approaches suffer from three key limitations. First, existing methods tend to generate pseudo GT boxes that either focus only on discriminative parts, failing to capture the whole object, or cover the entire object but fail to distinguish between adjacent intra-class instances. Second, the foundational WSDDN architecture lacks a crucial background class representation for each proposal and exhibits a large semantic gap between its branches. Third, prior methods discard ignored proposals during optimization, leading to slow convergence. To address these challenges, we first design a heatmap-guided proposal selector (HGPS) algorithm, which utilizes dual thresholds on heatmaps to pre-select proposals, enabling pseudo GT boxes to both capture the full object extent and distinguish between adjacent intra-class instances. We then present a weakly supervised basic detection network (WSBDN), which augments each proposal with a background class representation and uses heatmaps for pre-supervision to bridge the semantic gap between matrices. At last, we introduce a negative certainty supervision loss on ignored proposals to accelerate convergence. Extensive experiments on the challenging PASCAL VOC 2007 and 2012 datasets demonstrate the effectiveness of our framework. We achieve mAP/mCorLoc scores of 58.5%/81.8% on VOC 2007 and 55.6%/80.5% on VOC 2012, performing favorably against the state-of-the-art WSOD methods. Our code is publicly available at this https URL.
zh
[CV-49] Generalized Zero-Shot Learning for Point Cloud Segmentation with Evidence-Based Dynamic Calibration AAAI2025
【速读】:该论文旨在解决3D点云广义零样本语义分割(Generalized Zero-Shot Semantic Segmentation, GZSL)中模型对训练时见过的类别产生过自信预测(overconfident predictions)的问题,这一问题在3D场景中尤为突出,因训练数据规模通常小于图像任务。解决方案的关键在于提出E3DPC-GZSL方法,其核心是将基于证据的不确定性估计器(evidence-based uncertainty estimator)集成到分类器中,并通过一个动态校准堆叠因子(dynamic calibrated stacking factor)根据点级预测不确定性调整概率输出,从而缓解对已见类别的偏好;同时引入一种新颖的训练策略,通过融合可学习参数与文本驱动特征来优化语义空间,提升对未见类别的建模能力。
链接: https://arxiv.org/abs/2509.08280
作者: Hyeonseok Kim,Byeongkeun Kang,Yeejin Lee
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, 12 figures, AAAI 2025
Abstract:Generalized zero-shot semantic segmentation of 3D point clouds aims to classify each point into both seen and unseen classes. A significant challenge with these models is their tendency to make biased predictions, often favoring the classes encountered during training. This problem is more pronounced in 3D applications, where the scale of the training data is typically smaller than in image-based tasks. To address this problem, we propose a novel method called E3DPC-GZSL, which reduces overconfident predictions towards seen classes without relying on separate classifiers for seen and unseen data. E3DPC-GZSL tackles the overconfidence problem by integrating an evidence-based uncertainty estimator into a classifier. This estimator is then used to adjust prediction probabilities using a dynamic calibrated stacking factor that accounts for pointwise prediction uncertainty. In addition, E3DPC-GZSL introduces a novel training strategy that improves uncertainty estimation by refining the semantic space. This is achieved by merging learnable parameters with text-derived features, thereby improving model optimization for unseen data. Extensive experiments demonstrate that the proposed approach achieves state-of-the-art performance on generalized zero-shot semantic segmentation datasets, including ScanNet v2 and S3DIS.
zh
[CV-50] Examining Vision Language Models through Multi-dimensional Experiments with Vision and Text Features
【速读】:该论文旨在解决视觉语言模型(Vision Language Models, VLMs)在面对特定视觉任务时因训练中习得的固有偏见而导致性能下降的问题,尤其当任务需要聚焦图像中特定区域或对细节敏感时(如计数修改后的美国国旗上的星星)。其解决方案的关键在于构建一个多维评估框架,系统性地分析输入数据特征(包括图像本身和伴随提示)如何影响模型行为,并通过开源VLMs量化注意力值随图像尺寸、对象数量、背景颜色及提示具体程度等参数的变化,从而揭示模型响应机制的敏感性与可预测性。
链接: https://arxiv.org/abs/2509.08266
作者: Saurav Sengupta,Nazanin Moradinasab,Jiebei Liu,Donald E. Brown
机构: University of Virginia (弗吉尼亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent research on Vision Language Models (VLMs) suggests that they rely on inherent biases learned during training to respond to questions about visual properties of an image. These biases are exacerbated when VLMs are asked highly specific questions that require focusing on specific areas of the image. For example, a VLM tasked with counting stars on a modified American flag (e.g., with more than 50 stars) will often disregard the visual evidence and fail to answer accurately. We build upon this research and develop a multi-dimensional examination framework to systematically determine which characteristics of the input data, including both the image and the accompanying prompt, lead to such differences in performance. Using open-source VLMs, we further examine how attention values fluctuate with varying input parameters (e.g., image size, number of objects in the image, background color, prompt specificity). This research aims to learn how the behavior of vision language models changes and to explore methods for characterizing such changes. Our results suggest, among other things, that even minor modifications in image characteristics and prompt specificity can lead to large changes in how a VLM formulates its answer and, subsequently, its overall performance.
zh
[CV-51] Hyperspectral Mamba for Hyperspectral Object Tracking
【速读】:该论文旨在解决现有高光谱目标跟踪方法在捕捉内在光谱信息、时间依赖性和跨深度交互方面的不足。其解决方案的关键在于提出一种基于状态空间模块(State Space Modules, SSMs)的新网络架构 HyMamba,其中核心创新是 Spectral State Integration (SSI) 模块,该模块通过融合跨深度和时间的光谱信息实现光谱特征的逐步精炼与传播;同时,在每个 SSI 中嵌入 Hyperspectral Mamba (HSM) 模块,利用三个方向扫描的 SSMs 同步学习空间与光谱特征,从而有效整合伪彩色与高光谱输入的联合特征,并增强原始高光谱图像中提取的光谱特征表示。
链接: https://arxiv.org/abs/2509.08265
作者: Long Gao,Yunhe Zhang,Yan Jiang,Weiying Xie,Yunsong Li
机构: Xidian University (西安电子科技大学); The University of Sheffield (谢菲尔德大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Hyperspectral object tracking holds great promise due to the rich spectral information and fine-grained material distinctions in hyperspectral images, which are beneficial in challenging scenarios. While existing hyperspectral trackers have made progress by either transforming hyperspectral data into false-color images or incorporating modality fusion strategies, they often fail to capture the intrinsic spectral information, temporal dependencies, and cross-depth interactions. To address these limitations, a new hyperspectral object tracking network equipped with Mamba (HyMamba), is proposed. It unifies spectral, cross-depth, and temporal modeling through state space modules (SSMs). The core of HyMamba lies in the Spectral State Integration (SSI) module, which enables progressive refinement and propagation of spectral features with cross-depth and temporal spectral information. Embedded within each SSI, the Hyperspectral Mamba (HSM) module is introduced to learn spatial and spectral information synchronously via three directional scanning SSMs. Based on SSI and HSM, HyMamba constructs joint features from false-color and hyperspectral inputs, and enhances them through interaction with original spectral features extracted from raw hyperspectral images. Extensive experiments conducted on seven benchmark datasets demonstrate that HyMamba achieves state-of-the-art performance. For instance, it achieves 73.0% of the AUC score and 96.3% of the DP@20 score on the HOTC2020 dataset. The code will be released at this https URL.
zh
[CV-52] EVDI: Event-based Video Deblurring and Interpolation via Self-Supervised Learning
【速读】:该论文旨在解决帧基相机在长曝光条件下导致的视觉模糊与帧间信息丢失问题,从而显著降低视频质量。其核心解决方案是提出EVDI++(Event-based Video Deblurring and Interpolation++),一个统一的自监督框架,利用事件相机(event camera)的高时间分辨率来缓解运动模糊并预测中间帧。关键创新在于:1)设计可学习双积分(Learnable Double Integral, LDI)网络以估计参考帧与清晰潜在图像之间的映射关系;2)引入基于学习的除法重建模块,优化粗略结果并提升训练效率,支持不同曝光间隔下的图像转换;3)采用自适应无参数融合策略,利用并发事件中嵌入的置信度获得最终输出;4)构建真实世界模糊图像与事件数据集,并通过自监督学习框架实现端到端训练,充分利用模糊帧、潜在图像和事件流之间的相互约束关系。
链接: https://arxiv.org/abs/2509.08260
作者: Chi Zhang,Xiang Zhang,Chenxu Jiang,Gui-Song Xia,Lei Yu
机构: Wuhan University (武汉大学); Peng Cheng Laboratory (鹏城实验室); ETH Zurich (苏黎世联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages
Abstract:Frame-based cameras with extended exposure times often produce perceptible visual blurring and information loss between frames, significantly degrading video quality. To address this challenge, we introduce EVDI++, a unified self-supervised framework for Event-based Video Deblurring and Interpolation that leverages the high temporal resolution of event cameras to mitigate motion blur and enable intermediate frame prediction. Specifically, the Learnable Double Integral (LDI) network is designed to estimate the mapping relation between reference frames and sharp latent images. Then, we refine the coarse results and optimize overall training efficiency by introducing a learning-based division reconstruction module, enabling images to be converted with varying exposure intervals. We devise an adaptive parameter-free fusion strategy to obtain the final results, utilizing the confidence embedded in the LDI outputs of concurrent events. A self-supervised learning framework is proposed to enable network training with real-world blurry videos and events by exploring the mutual constraints among blurry frames, latent images, and event streams. We further construct a dataset with real-world blurry images and events using a DAVIS346c camera, demonstrating the generalizability of the proposed EVDI++ in real-world scenarios. Extensive experiments on both synthetic and real-world datasets show that our method achieves state-of-the-art performance in video deblurring and interpolation tasks.
zh
[CV-53] Symmetry Interactive Transformer with CNN Framework for Diagnosis of Alzheimers Disease Using Structural MRI
【速读】:该论文旨在解决阿尔茨海默病(Alzheimer’s disease, AD)诊断中因大脑左右半球结构不对称性变化(asymmetric induced by left and right brain atrophy)被忽略或未充分建模而导致的性能瓶颈问题。现有基于卷积神经网络(CNN)和通用Transformer的方法通常依赖预训练或忽视这种病理相关的不对称特征,限制了模型对AD特异性脑萎缩区域的敏感性。其解决方案的关键在于提出一种端到端网络架构,包含3D CNN编码器与对称交互Transformer(Symmetry Interactive Transformer, SIT),通过等距网格块获取操作实现左右半球特征对齐,并利用SIT模块增强对由AD引起的结构性不对称区域的关注,从而提升诊断准确率并增强模型可解释性。
链接: https://arxiv.org/abs/2509.08243
作者: Zheng Yang,Yanteng Zhang,Xupeng Kou,Yang Liu,Chao Ren
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Structural magnetic resonance imaging (sMRI) combined with deep learning has achieved remarkable progress in the prediction and diagnosis of Alzheimer’s disease (AD). Existing studies have used CNN and transformer to build a well-performing network, but most of them are based on pretraining or ignoring the asymmetrical character caused by brain disorders. We propose an end-to-end network for the detection of disease-based asymmetric induced by left and right brain atrophy which consist of 3D CNN Encoder and Symmetry Interactive Transformer (SIT). Following the inter-equal grid block fetch operation, the corresponding left and right hemisphere features are aligned and subsequently fed into the SIT for diagnostic analysis. SIT can help the model focus more on the regions of asymmetry caused by structural changes, thus improving diagnostic performance. We evaluated our method based on the ADNI dataset, and the results show that the method achieves better diagnostic accuracy (92.5%) compared to several CNN methods and CNNs combined with a general transformer. The visualization results show that our network pays more attention in regions of brain atrophy, especially for the asymmetric pathological characteristics induced by AD, demonstrating the interpretability and effectiveness of the method.
zh
[CV-54] RepViT-CXR: A Channel Replication Strategy for Vision Transformers in Chest X-ray Tuberculosis and Pneumonia Classification
【速读】:该论文旨在解决视觉Transformer(Vision Transformer, ViT)在处理胸片(Chest X-ray, CXR)图像时存在的适配难题,即ViT通常预训练于三通道自然图像且要求输入为三通道,而CXR图像本质上是单通道灰度图,直接输入会导致信息不匹配或损失。解决方案的关键在于提出一种简单的通道复制策略——RepViT-CXR,通过将单通道CXR图像复制为三通道格式,使其兼容ViT架构,同时避免引入额外的信息损耗。该方法在多个基准数据集上显著优于现有方法,验证了其有效性与实用性,为ViT在医学影像分析中的应用提供了新的范式。
链接: https://arxiv.org/abs/2509.08234
作者: Faisal Ahmed
机构: Embry-Riddle Aeronautical University (艾姆布里-里德航空大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 10 pages, 5 figures
Abstract:Chest X-ray (CXR) imaging remains one of the most widely used diagnostic tools for detecting pulmonary diseases such as tuberculosis (TB) and pneumonia. Recent advances in deep learning, particularly Vision Transformers (ViTs), have shown strong potential for automated medical image analysis. However, most ViT architectures are pretrained on natural images and require three-channel inputs, while CXR scans are inherently grayscale. To address this gap, we propose RepViT-CXR, a channel replication strategy that adapts single-channel CXR images into a ViT-compatible format without introducing additional information loss. We evaluate RepViT-CXR on three benchmark datasets. On the TB-CXR dataset,our method achieved an accuracy of 99.9% and an AUC of 99.9%, surpassing prior state-of-the-art methods such as Topo-CXR (99.3% accuracy, 99.8% AUC). For the Pediatric Pneumonia dataset, RepViT-CXR obtained 99.0% accuracy, with 99.2% recall, 99.3% precision, and an AUC of 99.0%, outperforming strong baselines including DCNN and VGG16. On the Shenzhen TB dataset, our approach achieved 91.1% accuracy and an AUC of 91.2%, marking a performance improvement over previously reported CNN-based methods. These results demonstrate that a simple yet effective channel replication strategy allows ViTs to fully leverage their representational power on grayscale medical imaging tasks. RepViT-CXR establishes a new state of the art for TB and pneumonia detection from chest X-rays, showing strong potential for deployment in real-world clinical screening systems.
zh
[CV-55] GTA-Crime: A Synthetic Dataset and Generation Framework for Fatal Violence Detection with Adversarial Snippet-Level Domain Adaptation
【速读】:该论文旨在解决视频异常检测(Video Anomaly Detection, VAD)中致命暴力事件(如枪击和刺杀)因数据稀缺及伦理问题难以有效识别的难题。其解决方案的关键在于构建了一个名为GTA-Crime的合成致命事件视频数据集及其生成框架,该框架基于《侠盗猎车手5》(Grand Theft Auto 5, GTA5)模拟环境生成多视角、多条件(如动作类型、天气、时间、视点)下的真实感致命场景视频,并提出一种基于Wasserstein对抗训练的片段级域适应策略,以缩小合成数据与真实世界数据(如UCF-Crime)之间的特征差异,从而显著提升实际场景中致命暴力行为的检测准确率。
链接: https://arxiv.org/abs/2509.08232
作者: Seongho Kim,Sejong Ryu,Hyoukjun You,Je Hyeong Hong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advancements in video anomaly detection (VAD) have enabled identification of various criminal activities in surveillance videos, but detecting fatal incidents such as shootings and stabbings remains difficult due to their rarity and ethical issues in data collection. Recognizing this limitation, we introduce GTA-Crime, a fatal video anomaly dataset and generation framework using Grand Theft Auto 5 (GTA5). Our dataset contains fatal situations such as shootings and stabbings, captured from CCTV multiview perspectives under diverse conditions including action types, weather, time of day, and viewpoints. To address the rarity of such scenarios, we also release a framework for generating these types of videos. Additionally, we propose a snippet-level domain adaptation strategy using Wasserstein adversarial training to bridge the gap between synthetic GTA-Crime features and real-world features like UCF-Crime. Experimental results validate our GTA-Crime dataset and demonstrate that incorporating GTA-Crime with our domain adaptation strategy consistently enhances real world fatal violence detection accuracy. Our dataset and the data generation framework are publicly available at this https URL.
zh
[CV-56] Sparse Transformer for Ultra-sparse Sampled Video Compressive Sensing
【速读】:该论文旨在解决高分辨率、高速视频成像系统中功耗过高导致的可持续性问题,尤其是在未来 gigapixel 级别相机以 100–1000 fps 运行时,传统处理模型难以承受的能耗瓶颈。其核心解决方案是提出一种超稀疏采样(Ultra-Sparse Sampling, USS)策略,即在每个空间位置仅保留一个子帧为 1,其余均为 0,从而显著降低每像素能耗并提升动态范围。关键创新在于将 Uss 测量视为可分解的子测量,并设计了 BSTFormer——一种结合局部 Block 注意力、全局稀疏注意力和时间注意力机制的稀疏 Transformer 模型,以有效利用 USS 的结构化稀疏特性,在 DMD 与 CCD 不匹配条件下仍能实现高质量高速帧恢复,且具备芯片级集成优势(固定曝光时间)。
链接: https://arxiv.org/abs/2509.08228
作者: Miao Cao,Siming Zheng,Lishun Wang,Ziyang Chen,David Brady,Xin Yuan
机构: Zhejiang University (浙江大学); Westlake University (西湖大学); vivo Mobile Communication Co., Ltd (维沃移动通信有限公司); Wyant College of Optical Sciences, University of Arizona (亚利桑那大学光学科学学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Digital cameras consume ~0.1 microjoule per pixel to capture and encode video, resulting in a power usage of ~20W for a 4K sensor operating at 30 fps. Imagining gigapixel cameras operating at 100-1000 fps, the current processing model is unsustainable. To address this, physical layer compressive measurement has been proposed to reduce power consumption per pixel by 10-100X. Video Snapshot Compressive Imaging (SCI) introduces high frequency modulation in the optical sensor layer to increase effective frame rate. A commonly used sampling strategy of video SCI is Random Sampling (RS) where each mask element value is randomly set to be 0 or 1. Similarly, image inpainting (I2P) has demonstrated that images can be recovered from a fraction of the image pixels. Inspired by I2P, we propose Ultra-Sparse Sampling (USS) regime, where at each spatial location, only one sub-frame is set to 1 and all others are set to 0. We then build a Digital Micro-mirror Device (DMD) encoding system to verify the effectiveness of our USS strategy. Ideally, we can decompose the USS measurement into sub-measurements for which we can utilize I2P algorithms to recover high-speed frames. However, due to the mismatch between the DMD and CCD, the USS measurement cannot be perfectly decomposed. To this end, we propose BSTFormer, a sparse TransFormer that utilizes local Block attention, global Sparse attention, and global Temporal attention to exploit the sparsity of the USS measurement. Extensive results on both simulated and real-world data show that our method significantly outperforms all previous state-of-the-art algorithms. Additionally, an essential advantage of the USS strategy is its higher dynamic range than that of the RS strategy. Finally, from the application perspective, the USS strategy is a good choice to implement a complete video SCI system on chip due to its fixed exposure time.
zh
[CV-57] Lightweight Deep Unfolding Networks with Enhanced Robustness for Infrared Small Target Detection
【速读】:该论文旨在解决红外小目标检测(Infrared Small Target Detection, ISTD)中现有深度展开网络(Deep Unfolding Networks, DUNs)存在的参数冗余与噪声鲁棒性不足的问题。其解决方案的关键在于提出一种基于稳健主成分分析(Robust Principal Component Analysis, RPCA)的轻量化框架——L-RPCANet:首先通过分层瓶颈结构实现单通道红外图像的通道维度压缩与重构,结合模块内瓶颈层提取特征,显著降低参数量;其次嵌入噪声抑制模块以提升复杂噪声环境下的鲁棒性;最后引入挤压-激励网络(Squeeze-and-Excitation Networks, SENets)作为通道注意力机制,动态加权不同通道特征的重要性,在保持模型轻量化的同时实现高性能检测。
链接: https://arxiv.org/abs/2509.08205
作者: Jingjing Liu,Yinchao Han,Xianchao Xiu,Jianhua Zhang,Wanquan Liu
机构: Shanghai Key Laboratory of Automobile Intelligent Network Interaction Chip and System, School of Microelectronics, Shanghai University, Shanghai 200444, China; School of Mechatronic Engineering and Automation, Shanghai University, Shanghai 200444, China; School of Intelligent Systems Engineering, Sun Yat-sen University, Guangzhou 510275, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Infrared small target detection (ISTD) is one of the key techniques in image processing. Although deep unfolding networks (DUNs) have demonstrated promising performance in ISTD due to their model interpretability and data adaptability, existing methods still face significant challenges in parameter lightweightness and noise robustness. In this regard, we propose a highly lightweight framework based on robust principal component analysis (RPCA) called L-RPCANet. Technically, a hierarchical bottleneck structure is constructed to reduce and increase the channel dimension in the single-channel input infrared image to achieve channel-wise feature refinement, with bottleneck layers designed in each module to extract features. This reduces the number of channels in feature extraction and improves the lightweightness of network parameters. Furthermore, a noise reduction module is embedded to enhance the robustness against complex noise. In addition, squeeze-and-excitation networks (SENets) are leveraged as a channel attention mechanism to focus on the varying importance of different features across channels, thereby achieving excellent performance while maintaining both lightweightness and robustness. Extensive experiments on the ISTD datasets validate the superiority of our proposed method compared with state-of-the-art methods covering RPCANet, DRPCANet, and RPCANet++. The code will be available at this https URL.
zh
[CV-58] Quadrotor Navigation using Reinforcement Learning with Privileged Information
【速读】:该论文旨在解决无人机(quadrotor)在复杂环境中绕过大障碍物时的导航难题,尤其是当目标位置被大型墙体或地形阻挡时,传统基于学习的方法表现不佳。解决方案的关键在于引入时间到达(time-of-arrival, ToA)地图作为特权信息(privileged information),并设计了偏航对齐损失(yaw alignment loss)来引导无人机有效绕行大障碍物。该方法结合可微分仿真和新型损失函数,在包含大障碍、锐角和死胡同的逼真仿真环境中实现了86%的成功率,较基线策略提升34%,并在实际室外杂乱环境中验证了其鲁棒性与实时性。
链接: https://arxiv.org/abs/2509.08177
作者: Jonathan Lee,Abhishek Rathod,Kshitij Goel,John Stecklein,Wennie Tabib
机构: Robotics Institute, Carnegie Mellon University (卡内基梅隆大学机器人研究所)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This paper presents a reinforcement learning-based quadrotor navigation method that leverages efficient differentiable simulation, novel loss functions, and privileged information to navigate around large obstacles. Prior learning-based methods perform well in scenes that exhibit narrow obstacles, but struggle when the goal location is blocked by large walls or terrain. In contrast, the proposed method utilizes time-of-arrival (ToA) maps as privileged information and a yaw alignment loss to guide the robot around large obstacles. The policy is evaluated in photo-realistic simulation environments containing large obstacles, sharp corners, and dead-ends. Our approach achieves an 86% success rate and outperforms baseline strategies by 34%. We deploy the policy onboard a custom quadrotor in outdoor cluttered environments both during the day and night. The policy is validated across 20 flights, covering 589 meters without collisions at speeds up to 4 m/s.
zh
[CV-59] APML: Adaptive Probabilistic Matching Loss for Robust 3D Point Cloud Reconstruction
【速读】:该论文旨在解决点云预测任务(如形状补全与生成)中常用损失函数(如Chamfer Distance、HyperCD、InfoCD)因依赖最近邻匹配而产生多对一对应关系,导致密集区域点拥挤、稀疏区域覆盖不足的问题,同时这些损失函数中的索引选择操作具有非可微性,影响基于梯度的优化效果。解决方案的关键在于提出一种完全可微的一对一匹配损失——自适应概率匹配损失(Adaptive Probabilistic Matching Loss, APML),其通过在温度缩放的相似性矩阵上使用Sinkhorn迭代来近似最优传输,从而实现结构敏感且高效的点云配准;文中进一步分析推导出保证最小分配概率的温度参数,避免人工调参,使APML具备接近二次时间复杂度(优于EMD的立方复杂度),并显著提升低密度区域的空间分布质量与收敛速度,同时在多个先进模型(PoinTr、PCN、FoldingNet、CSI2PC)上取得优异或相当的定量性能。
链接: https://arxiv.org/abs/2509.08104
作者: Sasan Sharifipour,Constantino Álvarez Casado,Mohammad Sabokrou,Miguel Bordallo López
机构: Center for Machine Vision and Signal Analysis (CMVS), University of Oulu (芬兰); Okinawa Institute of Science and Technology (OIST), Japan
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 22 pages, 6 figures, conference, 7 tables, 15 formulas
Abstract:Training deep learning models for point cloud prediction tasks such as shape completion and generation depends critically on loss functions that measure discrepancies between predicted and ground-truth point sets. Commonly used functions such as Chamfer Distance (CD), HyperCD, and InfoCD rely on nearest-neighbor assignments, which often induce many-to-one correspondences, leading to point congestion in dense regions and poor coverage in sparse regions. These losses also involve non-differentiable operations due to index selection, which may affect gradient-based optimization. Earth Mover Distance (EMD) enforces one-to-one correspondences and captures structural similarity more effectively, but its cubic computational complexity limits its practical use. We propose the Adaptive Probabilistic Matching Loss (APML), a fully differentiable approximation of one-to-one matching that leverages Sinkhorn iterations on a temperature-scaled similarity matrix derived from pairwise distances. We analytically compute the temperature to guarantee a minimum assignment probability, eliminating manual tuning. APML achieves near-quadratic runtime, comparable to Chamfer-based losses, and avoids non-differentiable operations. When integrated into state-of-the-art architectures (PoinTr, PCN, FoldingNet) on ShapeNet benchmarks and on a spatiotemporal Transformer (CSI2PC) that generates 3D human point clouds from WiFi CSI measurements, APM loss yields faster convergence, superior spatial distribution, especially in low-density regions, and improved or on-par quantitative performance without additional hyperparameter search. The code is available at: this https URL.
zh
[CV-60] MCTED: A Machine-Learning-Ready Dataset for Digital Elevation Model Generation From Mars Imagery
【速读】:该论文旨在解决火星数字高程模型(Digital Elevation Model, DEM)预测任务中数据稀缺与质量不一的问题,尤其针对由大型DEM生成流程引入的伪影和缺失数据点。其关键解决方案是构建了一个名为MCTED的新数据集,该数据集基于高分辨率火星正射影像与DEM配对数据,通过一套完整的处理管道生成了80,898个样本,并引入两个掩码(mask)来标注原始缺失或人工修改区域,从而允许后续使用者灵活处理异常区域。此外,为避免数据泄露,训练与验证集在地理空间上无重叠;实验表明,仅用小型U-Net架构在该数据集上微调即可超越零样本性能的单目深度估计基础模型DepthAnythingV2,凸显了专用数据集对提升火星DEM预测精度的重要性。
链接: https://arxiv.org/abs/2509.08027
作者: Rafał Osadnik,Pablo Gómez,Eleni Bohacek,Rickbir Bahia
机构: European Space Agency (ESA); European Space Astronomy Centre (ESAC); UK Research and Innovation - Innovate UK; UK Space Agency (UKSA)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 22 pages, 21 figures
Abstract:This work presents a new dataset for the Martian digital elevation model prediction task, ready for machine learning applications called MCTED. The dataset has been generated using a comprehensive pipeline designed to process high-resolution Mars orthoimage and DEM pairs from Day et al., yielding a dataset consisting of 80,898 data samples. The source images are data gathered by the Mars Reconnaissance Orbiter using the CTX instrument, providing a very diverse and comprehensive coverage of the Martian surface. Given the complexity of the processing pipelines used in large-scale DEMs, there are often artefacts and missing data points in the original data, for which we developed tools to solve or mitigate their impact. We divide the processed samples into training and validation splits, ensuring samples in both splits cover no mutual areas to avoid data leakage. Every sample in the dataset is represented by the optical image patch, DEM patch, and two mask patches, indicating values that were originally missing or were altered by us. This allows future users of the dataset to handle altered elevation regions as they please. We provide statistical insights of the generated dataset, including the spatial distribution of samples, the distributions of elevation values, slopes and more. Finally, we train a small U-Net architecture on the MCTED dataset and compare its performance to a monocular depth estimation foundation model, DepthAnythingV2, on the task of elevation prediction. We find that even a very small architecture trained on this dataset specifically, beats a zero-shot performance of a depth estimation foundation model like DepthAnythingV2. We make the dataset and code used for its generation completely open source in public repositories.
zh
[CV-61] wo-Stage Swarm Intelligence Ensemble Deep Transfer Learning (SI-EDTL) for Vehicle Detection Using Unmanned Aerial Vehicles
【速读】:该论文旨在解决无人机(Unmanned Aerial Vehicle, UAV)图像中多车辆目标检测的准确性与泛化能力不足的问题。其解决方案的关键在于提出一种两阶段的群智能集成深度迁移学习模型(Swarm Intelligence Ensemble Deep Transfer Learning, SI-EDTL),通过融合三种预训练的Faster R-CNN特征提取器(InceptionV3、ResNet50、GoogLeNet)与五种迁移分类器(KNN、SVM、MLP、C4.5、Naïve Bayes)构建15个基学习器,并采用加权平均策略进行集成,同时利用鲸鱼优化算法(Whale Optimization Algorithm)对超参数进行优化,以在准确率、精确率和召回率之间实现最佳平衡,从而显著提升复杂场景下多类车辆的检测性能。
链接: https://arxiv.org/abs/2509.08026
作者: Zeinab Ghasemi Darehnaei,Mohammad Shokouhifar,Hossein Yazdanjouei,S.M.J. Rastegar Fatemi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:This paper introduces SI-EDTL, a two-stage swarm intelligence ensemble deep transfer learning model for detecting multiple vehicles in UAV images. It combines three pre-trained Faster R-CNN feature extractor models (InceptionV3, ResNet50, GoogLeNet) with five transfer classifiers (KNN, SVM, MLP, C4.5, Naïve Bayes), resulting in 15 different base learners. These are aggregated via weighted averaging to classify regions as Car, Van, Truck, Bus, or background. Hyperparameters are optimized with the whale optimization algorithm to balance accuracy, precision, and recall. Implemented in MATLAB R2020b with parallel processing, SI-EDTL outperforms existing methods on the AU-AIR UAV dataset.
zh
[CV-62] wo Stage Context Learning with Large Language Models for Multimodal Stance Detection on Climate Change
【速读】:该论文旨在解决社交媒体中多模态数据(文本与视觉信息)融合的立场检测(stance detection)问题,现有方法多局限于纯文本分析,难以应对真实场景下图文并茂的内容。其解决方案的关键在于提出一种分层融合框架:首先利用大语言模型(Large Language Model, LLM)从源文本中提取与立场相关的摘要,同时通过领域感知的图像字幕生成器对视觉内容进行语义解读;随后,将文本、图像描述及回复文本共同输入一个定制化的Transformer模块,以捕获多模态间的交互关系,从而实现更鲁棒的立场分类。
链接: https://arxiv.org/abs/2509.08024
作者: Lata Pangtey,Omkar Kabde,Shahid Shafi Dar,Nagendra Kumar
机构: Indian Institute of Technology (IIT) Indore (印度理工学院(IIT)印多尔分校); Chaitanya Bharathi Institute of Technology (查伊坦亚·巴赫拉理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
备注:
Abstract:With the rapid proliferation of information across digital platforms, stance detection has emerged as a pivotal challenge in social media analysis. While most of the existing approaches focus solely on textual data, real-world social media content increasingly combines text with visual elements creating a need for advanced multimodal methods. To address this gap, we propose a multimodal stance detection framework that integrates textual and visual information through a hierarchical fusion approach. Our method first employs a Large Language Model to retrieve stance-relevant summaries from source text, while a domain-aware image caption generator interprets visual content in the context of the target topic. These modalities are then jointly modeled along with the reply text, through a specialized transformer module that captures interactions between the texts and images. The proposed modality fusion framework integrates diverse modalities to facilitate robust stance classification. We evaluate our approach on the MultiClimate dataset, a benchmark for climate change-related stance detection containing aligned video frames and transcripts. We achieve accuracy of 76.2%, precision of 76.3%, recall of 76.2% and F1-score of 76.2%, respectively, outperforming existing state-of-the-art approaches.
zh
[CV-63] Video Parallel Scaling: Aggregating Diverse Frame Subsets for VideoLLM s
【速读】:该论文旨在解决视频大语言模型(Video Large Language Models, VideoLLMs)在提升输入帧数以捕捉细粒度时间信息时所面临的计算成本过高和长上下文导致性能下降的问题。解决方案的关键在于提出一种推理阶段的并行扩展方法——视频并行缩放(Video Parallel Scaling, VPS),其核心思想是通过运行多个并行推理流,每个流处理视频帧的一个互不重叠的子集,并将各流输出的概率分布进行聚合,从而在不增加模型上下文窗口的情况下整合更丰富的视觉信息。该方法理论上通过利用未相关联的视觉证据有效压缩了Chinchilla缩放定律,实现了无需额外训练即可提升性能的目标。
链接: https://arxiv.org/abs/2509.08016
作者: Hyungjin Chung,Hyelin Nam,Jiyeon Kim,Hyojun Go,Byeongjun Park,Junho Kim,Joonseok Lee,Seongsu Ha,Byung-Hoon Kim
机构: EverEx; University of Michigan (密歇根大学); KAIST (韩国科学技术院); ETH Zürich (苏黎世联邦理工学院); UNC Chapel Hill (北卡罗来纳大学教堂山分校); Yonsei University (延世大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: this https URL
Abstract:Video Large Language Models (VideoLLMs) face a critical bottleneck: increasing the number of input frames to capture fine-grained temporal detail leads to prohibitive computational costs and performance degradation from long context lengths. We introduce Video Parallel Scaling (VPS), an inference-time method that expands a model’s perceptual bandwidth without increasing its context window. VPS operates by running multiple parallel inference streams, each processing a unique, disjoint subset of the video’s frames. By aggregating the output probabilities from these complementary streams, VPS integrates a richer set of visual information than is possible with a single pass. We theoretically show that this approach effectively contracts the Chinchilla scaling law by leveraging uncorrelated visual evidence, thereby improving performance without additional training. Extensive experiments across various model architectures and scales (2B-32B) on benchmarks such as Video-MME and EventHallusion demonstrate that VPS consistently and significantly improves performance. It scales more favorably than other parallel alternatives (e.g. Self-consistency) and is complementary to other decoding strategies, offering a memory-efficient and robust framework for enhancing the temporal reasoning capabilities of VideoLLMs.
zh
[CV-64] An Explainable Deep Neural Network with Frequency-Aware Channel and Spatial Refinement for Flood Prediction in Sustainable Cities
【速读】:该论文旨在解决城市内涝分类中因传统方法依赖单一模态数据和静态规则系统而导致的动态非线性关系建模不足、注意力机制与集成学习在层级细化、跨模态特征融合及噪声环境适应性方面的局限性问题。其解决方案的关键在于提出XFloodNet框架,包含三个核心创新模块:(1)分层跨模态门控注意力机制,实现视觉与文本特征的动态对齐与多粒度交互;(2)异构卷积自适应多尺度注意力模块,通过频域增强的通道注意力与频域调制的空间注意力提取并优先化光谱与空间域中的判别性洪水特征;(3)级联卷积Transformer特征精炼技术,借助自适应缩放与级联操作整合层次特征,提升抗噪能力与检测鲁棒性。
链接: https://arxiv.org/abs/2509.08003
作者: Shahid Shafi Dar,Bharat Kaurav,Arnav Jain,Chandravardhan Singh Raghaw,Mohammad Zia Ur Rehman,Nagendra Kumar
机构: Indian Institute of Technology Indore (印度理工学院印多尔分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In an era of escalating climate change, urban flooding has emerged as a critical challenge for sustainable cities, threatening lives, infrastructure, and ecosystems. Traditional flood detection methods are constrained by their reliance on unimodal data and static rule-based systems, which fail to capture the dynamic, non-linear relationships inherent in flood events. Furthermore, existing attention mechanisms and ensemble learning approaches exhibit limitations in hierarchical refinement, cross-modal feature integration, and adaptability to noisy or unstructured environments, resulting in suboptimal flood classification performance. To address these challenges, we present XFloodNet, a novel framework that redefines urban flood classification through advanced deep-learning techniques. XFloodNet integrates three novel components: (1) a Hierarchical Cross-Modal Gated Attention mechanism that dynamically aligns visual and textual features, enabling precise multi-granularity interactions and resolving contextual ambiguities; (2) a Heterogeneous Convolutional Adaptive Multi-Scale Attention module, which leverages frequency-enhanced channel attention and frequency-modulated spatial attention to extract and prioritize discriminative flood-related features across spectral and spatial domains; and (3) a Cascading Convolutional Transformer Feature Refinement technique that harmonizes hierarchical features through adaptive scaling and cascading operations, ensuring robust and noise-resistant flood detection. We evaluate our proposed method on three benchmark datasets, such as Chennai Floods, Rhine18 Floods, and Harz17 Floods, XFloodNet achieves state-of-the-art F1-scores of 93.33%, 82.24%, and 88.60%, respectively, surpassing existing methods by significant margins.
zh
[CV-65] 3D and 4D World Modeling: A Survey
【速读】:该论文旨在解决当前世界建模(World Modeling)研究中两个关键问题:一是现有方法主要聚焦于2D图像和视频的生成式建模,忽视了基于RGB-D影像、占据栅格(occupancy grids)和激光雷达点云(LiDAR point clouds)等原生3D/4D表示的大规模场景建模进展;二是缺乏对“世界模型”的标准化定义与分类体系,导致文献中存在碎片化和不一致的表述。其解决方案的关键在于首次系统性地构建了一个面向3D和4D世界建模与生成的全面综述框架,提出明确的定义、涵盖视频驱动(VideoGen)、占据驱动(OccGen)和激光雷达驱动(LiDARGen)三类方法的结构化分类体系,并梳理适配于3D/4D场景的数据集与评估指标,从而为该领域提供统一、清晰且可扩展的研究基础。
链接: https://arxiv.org/abs/2509.07996
作者: Lingdong Kong,Wesley Yang,Jianbiao Mei,Youquan Liu,Ao Liang,Dekai Zhu,Dongyue Lu,Wei Yin,Xiaotao Hu,Mingkai Jia,Junyuan Deng,Kaiwen Zhang,Yang Wu,Tianyi Yan,Shenyuan Gao,Song Wang,Linfeng Li,Liang Pan,Yong Liu,Jianke Zhu,Wei Tsang Ooi,Steven C.H. Hoi,Ziwei Liu
机构: National University of Singapore(新加坡国立大学); CNRS@CREATE(法国国家科学研究中心@创造中心); Zhejiang University(浙江大学); Horizon Robotics(地平线机器人); Technical University of Munich(慕尼黑工业大学); HKUST(香港科技大学); Tsinghua University(清华大学); Nanjing University of Science and Technology(南京理工大学); University of Macau(澳门大学); Shanghai AI Laboratory(上海人工智能实验室); HyperGAI(超智科技); Nanyang Technological University(南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Survey; 34 pages, 10 figures, 14 tables; GitHub Repo at this https URL
Abstract:World modeling has become a cornerstone in AI research, enabling agents to understand, represent, and predict the dynamic environments they inhabit. While prior work largely emphasizes generative methods for 2D image and video data, they overlook the rapidly growing body of work that leverages native 3D and 4D representations such as RGB-D imagery, occupancy grids, and LiDAR point clouds for large-scale scene modeling. At the same time, the absence of a standardized definition and taxonomy for ``world models’’ has led to fragmented and sometimes inconsistent claims in the literature. This survey addresses these gaps by presenting the first comprehensive review explicitly dedicated to 3D and 4D world modeling and generation. We establish precise definitions, introduce a structured taxonomy spanning video-based (VideoGen), occupancy-based (OccGen), and LiDAR-based (LiDARGen) approaches, and systematically summarize datasets and evaluation metrics tailored to 3D/4D settings. We further discuss practical applications, identify open challenges, and highlight promising research directions, aiming to provide a coherent and foundational reference for advancing the field. A systematic summary of existing literature is available at this https URL
zh
[CV-66] Revisiting Deepfake Detection: Chronological Continual Learning and the Limits of Generalization
【速读】:该论文旨在解决深度伪造检测(Deepfake Detection, DFD)系统在面对不断演进的生成技术时,因传统非持续学习方法需频繁且昂贵的重新训练而导致的适应性差和知识遗忘问题。其核心解决方案是将DFD建模为持续学习(Continual Learning, CL)任务,提出一种高效框架:一方面模拟真实世界中过去7年深度伪造技术的时间演化过程,而非依赖不切实际的虚拟序列;另一方面采用轻量级视觉骨干网络以保障实时性能。关键创新在于引入两个新指标——历史性能度量Continual AUC(C-AUC)与未来泛化能力度量Forward Transfer AUC(FWT-AUC),并通过600余次实验验证了该框架在快速适应(比全量重训快155倍)和保留历史知识方面的有效性,但同时也揭示当前方法对未见过的未来生成器泛化能力接近随机(FWT-AUC ≈ 0.5),由此提出了“非通用深度伪造分布假设”(Non-Universal Deepfake Distribution Hypothesis)。
链接: https://arxiv.org/abs/2509.07993
作者: Federico Fontana,Anxhelo Diko,Romeo Lanzino,Marco Raoul Marini,Bachir Kaddar,Gian Luca Foresti,Luigi Cinque
机构: Sapienza University of Rome (罗马大学); University Ibn Khaldoun of Tiaret (伊本·哈利敦大学); University of Udine (乌迪内大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:
Abstract:The rapid evolution of deepfake generation technologies poses critical challenges for detection systems, as non-continual learning methods demand frequent and expensive retraining. We reframe deepfake detection (DFD) as a Continual Learning (CL) problem, proposing an efficient framework that incrementally adapts to emerging visual manipulation techniques while retaining knowledge of past generators. Our framework, unlike prior approaches that rely on unreal simulation sequences, simulates the real-world chronological evolution of deepfake technologies in extended periods across 7 years. Simultaneously, our framework builds upon lightweight visual backbones to allow for the real-time performance of DFD systems. Additionally, we contribute two novel metrics: Continual AUC (C-AUC) for historical performance and Forward Transfer AUC (FWT-AUC) for future generalization. Through extensive experimentation (over 600 simulations), we empirically demonstrate that while efficient adaptation (+155 times faster than full retraining) and robust retention of historical knowledge is possible, the generalization of current approaches to future generators without additional training remains near-random (FWT-AUC \approx 0.5) due to the unique imprint characterizing each existing generator. Such observations are the foundation of our newly proposed Non-Universal Deepfake Distribution Hypothesis. \textbfCode will be released upon acceptance. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR) Cite as: arXiv:2509.07993 [cs.LG] (or arXiv:2509.07993v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2509.07993 Focus to learn more arXiv-issued DOI via DataCite
zh
[CV-67] RoentMod: A Synthetic Chest X-Ray Modification Model to Identify and Correct Image Interpretation Model Shortcuts
【速读】:该论文旨在解决深度学习模型在胸部X光片(CXR)自动解读中普遍存在的“捷径学习”(shortcut learning)问题,即模型依赖于与临床无关的伪相关性(如设备标记、背景噪声等)而非真实病理特征进行决策,从而导致特异性下降和泛化能力受限。解决方案的关键在于提出RoentMod框架,该框架结合开源医学图像生成器RoentGen与图像到图像的修改模型,无需重新训练即可生成具有指定合成病灶且保留原始扫描其他解剖结构的解剖学上真实的CXRs,从而实现可控的反事实干预;通过在训练中引入RoentMod生成的反事实图像,显著提升了多任务和基础模型对多种病理的区分能力(内部验证AUC提升3-19%,外部测试中5/6种病灶AUC提升1-11%),有效缓解了模型对非目标病灶的捷径依赖,增强了模型的鲁棒性和可解释性。
链接: https://arxiv.org/abs/2509.08640
作者: Lauren H. Cooke,Matthias Jung,Jan M. Brendel,Nora M. Kerkovits,Borek Foldyna,Michael T. Lu,Vineet K. Raghu
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 25 + 8 pages, 4 + 7 figures
Abstract:Chest radiographs (CXRs) are among the most common tests in medicine. Automated image interpretation may reduce radiologists’ workload and expand access to diagnostic expertise. Deep learning multi-task and foundation models have shown strong performance for CXR interpretation but are vulnerable to shortcut learning, where models rely on spurious and off-target correlations rather than clinically relevant features to make decisions. We introduce RoentMod, a counterfactual image editing framework that generates anatomically realistic CXRs with user-specified, synthetic pathology while preserving unrelated anatomical features of the original scan. RoentMod combines an open-source medical image generator (RoentGen) with an image-to-image modification model without requiring retraining. In reader studies with board-certified radiologists and radiology residents, RoentMod-produced images appeared realistic in 93% of cases, correctly incorporated the specified finding in 89-99% of cases, and preserved native anatomy comparable to real follow-up CXRs. Using RoentMod, we demonstrate that state-of-the-art multi-task and foundation models frequently exploit off-target pathology as shortcuts, limiting their specificity. Incorporating RoentMod-generated counterfactual images during training mitigated this vulnerability, improving model discrimination across multiple pathologies by 3-19% AUC in internal validation and by 1-11% for 5 out of 6 tested pathologies in external testing. These findings establish RoentMod as a broadly applicable tool for probing and correcting shortcut learning in medical AI. By enabling controlled counterfactual interventions, RoentMod enhances the robustness and interpretability of CXR interpretation models and provides a generalizable strategy for improving foundation models in medical imaging.
zh
[CV-68] CNN-ViT Hybrid for Pneumonia Detection: Theory and Empiric on Limited Data without Pretraining
【速读】:该论文旨在解决在小规模训练数据集且存在类别不平衡情况下,深度学习模型诊断性能不稳定的问题。其解决方案的关键在于提出了一种融合卷积神经网络(Convolutional Neural Network, CNN)与视觉Transformer(Vision Transformer, ViT)优势的混合架构,通过结合CNN的空间特征提取能力与ViT的全局建模能力,在不同数据比例和类别分布条件下均展现出更高的召回率(最高达0.9443)和稳定的F1分数(约0.85),尤其在类别不平衡场景下显著优于单独使用CNN或ViT的模型,同时保持了与纯Transformer相当的训练效率。
链接: https://arxiv.org/abs/2509.08586
作者: Prashant Singh Basnet,Roshan Chitrakar
机构: The British College, Keele University (英国学院,基尔大学); Nepal College of Information Technology, Pokhara University (尼泊尔信息科技学院,博卡拉大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 5 Tables, 5 Figures. Manuscript submitted to ICOIICS 2025 Conference. Currently, under peer review
Abstract:This research explored the hybridization of CNN and ViT within a training dataset of limited size, and introduced a distinct class imbalance. The training was made from scratch with a mere focus on theoretically and experimentally exploring the architectural strengths of the proposed hybrid model. Experiments were conducted across varied data fractions with balanced and imbalanced training datasets. Comparatively, the hybrid model, complementing the strengths of CNN and ViT, achieved the highest recall of 0.9443 (50% data fraction in balanced) and consistency in F1 score around 0.85, suggesting reliability in diagnosis. Additionally, the model was successful in outperforming CNN and ViT in imbalanced datasets. Despite its complex architecture, it required comparable training time to the transformers in all data fractions.
zh
[CV-69] Physics-Guided Rectified Flow for Low-light RAW Image Enhancement
【速读】:该论文旨在解决低光照条件下RAW图像增强中因噪声建模不准确导致的性能瓶颈问题。现有深度学习方法多依赖于合成数据集,但这些数据集通常仅考虑加性噪声而忽略乘性噪声,且采用全局校准方式,无法捕捉由CMOS制造工艺差异引起的像素级空间噪声变化,从而难以真实还原传感器噪声特性。解决方案的关键在于:首先,基于物理机制推导出包含加性和乘性噪声的复合噪声模型;其次,提出一种基于物理的逐像素噪声仿真与校准方案,实现对每个像素独立的噪声估计与合成,克服传统全局校准的局限性;最后,将该物理引导的噪声合成方法与修正流(rectified flow)生成框架结合,构建PGRF(Physics-guided Rectified Flow)框架,利用修正流对复杂数据分布的强大建模能力,并通过物理先验引导生成过程,有效提升低光RAW图像增强效果。
链接: https://arxiv.org/abs/2509.08330
作者: Juntai Zeng
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 21pages,7figures
Abstract:Enhancing RAW images captured under low light conditions is a challenging task. Recent deep learning based RAW enhancement methods have shifted from using real paired data to relying on synthetic datasets. These synthetic datasets are typically generated by physically modeling sensor noise, but existing approaches often consider only additive noise, ignore multiplicative components, and rely on global calibration that overlooks pixel level manufacturing variations. As a result, such methods struggle to accurately reproduce real sensor noise. To address these limitations, this paper derives a noise model from the physical noise generation mechanisms that occur under low illumination and proposes a novel composite model that integrates both additive and multiplicative noise. To solve the model, we introduce a physics based per pixel noise simulation and calibration scheme that estimates and synthesizes noise for each individual pixel, thereby overcoming the restrictions of traditional global calibration and capturing spatial noise variations induced by microscopic CMOS manufacturing differences. Motivated by the strong performance of rectified flow methods in image generation and processing, we further combine the physics-based noise synthesis with a rectified flow generative framework and present PGRF a physics-guided rectified flow framework for low light image enhancement. PGRF leverages the ability of rectified flows to model complex data distributions and uses physical guidance to steer the generation toward the desired clean image. To validate the effectiveness of the proposed model, we established the LLID dataset, an indoor low light benchmark captured with the Sony A7S II camera. Experimental results demonstrate that the proposed framework achieves significant improvements in low light RAW image enhancement.
zh
[CV-70] Enhancing Privacy Preservation and Reducing Analysis Time with Federated Transfer Learning in Digital Twins-based Computed Tomography Scan Analysis
【速读】:该论文旨在解决生物医学图像分析中,尤其是CT扫描图像分析领域所面临的三大核心问题:数据隐私保护、计算资源受限以及数据异构性(non-IID)带来的模型性能下降。其解决方案的关键在于提出了一种基于数字孪生(Digital Twin, DT)的联邦迁移学习(Federated Transfer Learning, FTL)框架,通过在对等节点间共享预训练模型和知识迁移机制,在不交换原始数据的前提下实现跨机构的协同建模,从而在保障患者身份隐私的同时提升模型收敛速度与诊断准确性。该方法特别适用于非独立同分布(non-IID)数据场景,显著优于传统联邦学习(Federated Learning, FL)和聚类联邦学习(Clustered Federated Learning, CFL)方法,在精度、召回率、F1分数等指标上表现更优,为精准医疗和智能医疗系统提供了安全、高效且可靠的图像分析新范式。
链接: https://arxiv.org/abs/2509.08018
作者: Avais Jan,Qasim Zia,Murray Patterson
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:The application of Digital Twin (DT) technology and Federated Learning (FL) has great potential to change the field of biomedical image analysis, particularly for Computed Tomography (CT) scans. This paper presents Federated Transfer Learning (FTL) as a new Digital Twin-based CT scan analysis paradigm. FTL uses pre-trained models and knowledge transfer between peer nodes to solve problems such as data privacy, limited computing resources, and data heterogeneity. The proposed framework allows real-time collaboration between cloud servers and Digital Twin-enabled CT scanners while protecting patient identity. We apply the FTL method to a heterogeneous CT scan dataset and assess model performance using convergence time, model accuracy, precision, recall, F1 score, and confusion matrix. It has been shown to perform better than conventional FL and Clustered Federated Learning (CFL) methods with better precision, accuracy, recall, and F1-score. The technique is beneficial in settings where the data is not independently and identically distributed (non-IID), and it offers reliable, efficient, and secure solutions for medical diagnosis. These findings highlight the possibility of using FTL to improve decision-making in digital twin-based CT scan analysis, secure and efficient medical image analysis, promote privacy, and open new possibilities for applying precision medicine and smart healthcare systems.
zh
[CV-71] CardioComposer: Flexible and Compositional Anatomical Structure Generation with Disentangled Geometric Guidance
【速读】:该论文旨在解决当前生成式3D解剖模型在可控性与解剖真实性之间存在的权衡问题。现有方法难以同时实现对解剖结构的精细控制和高保真度建模,限制了其在临床研究和医疗设备设计中的应用。解决方案的关键在于提出一种可编程且组合式的框架,通过在3D空间中嵌入可解释的椭球体(ellipsoidal primitives)作为引导信号,结合多组织分割图(multi-tissue segmentation maps)选择特定组织,并施加几何矩损失(geometric moment losses)来引导扩散过程的逆向演化,从而实现对尺寸、形状和位置的独立控制,以及推理阶段多组件约束的组合式调控。
链接: https://arxiv.org/abs/2509.08015
作者: Karim Kadry,Shoaib Goraya,Ajay Manicka,Abdalla Abdelwahed,Farhad Nezami,Elazer Edelman
机构: Massachusetts Institute of Technology (麻省理工学院); Brigham and Women’s Hospital (布莱根妇女医院); American University in Cairo (开罗美国大学); Harvard Medical School (哈佛医学院)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 10 pages, 13 figures
Abstract:Generative models of 3D anatomy, when integrated with biophysical simulators, enable the study of structure-function relationships for clinical research and medical device design. However, current models face a trade-off between controllability and anatomical realism. We propose a programmable and compositional framework for guiding unconditional diffusion models of human anatomy using interpretable ellipsoidal primitives embedded in 3D space. Our method involves the selection of certain tissues within multi-tissue segmentation maps, upon which we apply geometric moment losses to guide the reverse diffusion process. This framework supports the independent control over size, shape, and position, as well as the composition of multi-component constraints during inference.
zh
[CV-72] Validation of a CT-brain analysis tool for measuring global cortical atrophy in older patient cohorts
【速读】:该论文旨在解决脑萎缩量化依赖人工视觉评分导致效率低下、主观性强的问题,提出了一种基于深度学习(Deep Learning, DL)的自动化脑部CT图像分析工具,用于准确测量全局脑萎缩(Global Cerebral Atrophy, GCA)评分。其解决方案的关键在于:利用多中心真实世界老年患者(≥65岁)的CT扫描数据,通过训练、优化和测试集划分(60/20/20比例),实现无需人工干预的GCA评分自动提取;验证结果显示,该DL工具与专业人眼评分者之间具有良好的一致性(平均绝对误差MAE=3.2,加权Kappa=0.45),且GCA评分与年龄及认知功能显著相关(均p<0.001),证明了其在大规模健康数据研究中的可行性,并为未来开发临床级点-of-care工具提供了概念验证。
链接: https://arxiv.org/abs/2509.08012
作者: Sukhdeep Bal,Emma Colbourne,Jasmine Gan,Ludovica Griffanti,Taylor Hanayik,Nele Demeyere,Jim Davies,Sarah T Pendlebury,Mark Jenkinson
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 6 figures
Abstract:Quantification of brain atrophy currently requires visual rating scales which are time consuming and automated brain image analysis is warranted. We validated our automated deep learning (DL) tool measuring the Global Cerebral Atrophy (GCA) score against trained human raters, and associations with age and cognitive impairment, in representative older (65 years) patients. CT-brain scans were obtained from patients in acute medicine (ORCHARD-EPR), acute stroke (OCS studies) and a legacy sample. Scans were divided in a 60/20/20 ratio for training, optimisation and testing. CT-images were assessed by two trained raters (rater-1=864 scans, rater-2=20 scans). Agreement between DL tool-predicted GCA scores (range 0-39) and the visual ratings was evaluated using mean absolute error (MAE) and Cohen’s weighted kappa. Among 864 scans (ORCHARD-EPR=578, OCS=200, legacy scans=86), MAE between the DL tool and rater-1 GCA scores was 3.2 overall, 3.1 for ORCHARD-EPR, 3.3 for OCS and 2.6 for the legacy scans and half had DL-predicted GCA error between -2 and 2. Inter-rater agreement was Kappa=0.45 between the DL-tool and rater-1, and 0.41 between the tool and rater- 2 whereas it was lower at 0.28 for rater-1 and rater-2. There was no difference in GCA scores from the DL-tool and the two raters (one-way ANOVA, p=0.35) or in mean GCA scores between the DL-tool and rater-1 (paired t-test, t=-0.43, p=0.66), the tool and rater-2 (t=1.35, p=0.18) or between rater-1 and rater-2 (t=0.99, p=0.32). DL-tool GCA scores correlated with age and cognitive scores (both p0.001). Our DL CT-brain analysis tool measured GCA score accurately and without user input in real-world scans acquired from older patients. Our tool will enable extraction of standardised quantitative measures of atrophy at scale for use in health data research and will act as proof-of-concept towards a point-of-care clinically approved tool.
zh
[CV-73] Expert-Guided Explainable Few-Shot Learning for Medical Image Diagnosis MICCAI
【速读】:该论文旨在解决医学图像分析中因专家标注数据有限而导致模型泛化能力差和临床可接受度低的问题。解决方案的关键在于提出一种专家引导的可解释少样本学习框架,通过将放射科医生提供的感兴趣区域(Regions-of-Interest, ROIs)融入模型训练过程,同时提升分类性能与模型可解释性;具体而言,利用Grad-CAM生成空间注意力监督信号,并设计基于Dice相似度的解释损失函数,使模型关注区域与诊断相关区域对齐,该损失与原型网络目标联合优化,从而在数据稀缺条件下促使模型聚焦于临床有意义的特征。
链接: https://arxiv.org/abs/2509.08007
作者: Ifrat Ikhtear Uddin,Longwei Wang,KC Santosh
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication in the proceedings of MICCAI Workshop on Data Engineering in Medical Imaging 2025
Abstract:Medical image analysis often faces significant challenges due to limited expert-annotated data, hindering both model generalization and clinical adoption. We propose an expert-guided explainable few-shot learning framework that integrates radiologist-provided regions-of-interests (ROIs) into model training to simultaneously enhance classification performance and interpretability. Leveraging Grad-CAM for spatial attention supervision, we introduce an explanation loss based on Dice similarity to align model attention with diagnostically relevant regions during training. This explanation loss is jointly optimized with a standard prototypical network objective, encouraging the model to focus on clinically meaningful features even under limited data conditions. We evaluate our framework on two distinct datasets: BraTS (MRI) and VinDr-CXR (Chest X-ray), achieving significant accuracy improvements from 77.09% to 83.61% on BraTS and from 54.33% to 73.29% on VinDr-CXR compared to non-guided models. Grad-CAM visualizations further confirm that expert-guided training consistently aligns attention with diagnostic regions, improving both predictive reliability and clinical trustworthiness. Our findings demonstrate the effectiveness of incorporating expert-guided attention supervision to bridge the gap between performance and interpretability in few-shot medical image diagnosis.
zh
[CV-74] STROKEVISION-BENCH: A Multimodal Video And 2D Pose Benchmark For Tracking Stroke Recovery
【速读】:该论文旨在解决卒中后上肢(UE)功能评估仍依赖主观判断、难以捕捉细微运动改善的临床痛点,从而限制个性化康复规划的问题。其解决方案的关键在于构建首个专注于卒中患者执行结构化块转移任务的专用数据集StrokeVision-Bench,该数据集包含1000个标注视频,分为四类临床有意义的动作类别,并以原始视频帧和二维骨骼关键点两种模态表示,为基于计算机视觉的客观、量化、可扩展的上肢运动功能评估提供了基础支撑。
链接: https://arxiv.org/abs/2509.07994
作者: David Robinson,Animesh Gupta,Rizwan Quershi,Qiushi Fu,Mubarak Shah
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 6 pages
Abstract:Despite advancements in rehabilitation protocols, clinical assessment of upper extremity (UE) function after stroke largely remains subjective, relying heavily on therapist observation and coarse scoring systems. This subjectivity limits the sensitivity of assessments to detect subtle motor improvements, which are critical for personalized rehabilitation planning. Recent progress in computer vision offers promising avenues for enabling objective, quantitative, and scalable assessment of UE motor function. Among standardized tests, the Box and Block Test (BBT) is widely utilized for measuring gross manual dexterity and tracking stroke recovery, providing a structured setting that lends itself well to computational analysis. However, existing datasets targeting stroke rehabilitation primarily focus on daily living activities and often fail to capture clinically structured assessments such as block transfer tasks. Furthermore, many available datasets include a mixture of healthy and stroke-affected individuals, limiting their specificity and clinical utility. To address these critical gaps, we introduce StrokeVision-Bench, the first-ever dedicated dataset of stroke patients performing clinically structured block transfer tasks. StrokeVision-Bench comprises 1,000 annotated videos categorized into four clinically meaningful action classes, with each sample represented in two modalities: raw video frames and 2D skeletal keypoints. We benchmark several state-of-the-art video action recognition and skeleton-based action classification methods to establish performance baselines for this domain and facilitate future research in automated stroke rehabilitation assessment.
zh
人工智能
[AI-0] Narrative-Guided Reinforcement Learning: A Platform for Studying Language Model Influence on Decision Making
【速读】:该论文试图解决的问题是:如何通过叙事元素(narrative elements)影响基于奖励的学习(reward-based learning)决策过程,从而弥合当前人工智能系统中决策能力与叙事推理(narrative reasoning)能力长期分离的研究现状。解决方案的关键在于构建一个双系统架构(dual-system architecture),其中包含一个基于强化学习(Reinforcement Learning, RL)的策略模块用于生成动作建议,以及一个语言模型模块,负责将这些建议置于不同的叙事框架下进行推理以指导最终决策。该架构在可配置的网格世界环境中实现,保持环境和奖励结构一致,同时允许对叙事参数、环境复杂度及RL与符号推理之间的交互进行受控实验,从而为探索叙事框架如何塑造AI决策提供可量化、可扩展的实验平台。
链接: https://arxiv.org/abs/2509.08785
作者: Anup Tuladhar,Araz Minhas,Adam Kirton,Eli Kinney-Lang
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Machine Learning (stat.ML)
备注: Extended Abstract for RLDM 2025
Abstract:We present a preliminary experimental platform that explores how narrative elements might shape AI decision-making by combining reinforcement learning (RL) with language model reasoning. While AI systems can now both make decisions and engage in narrative reasoning, these capabilities have mostly been studied separately. Our platform attempts to bridge this gap using a dual-system architecture to examine how narrative frameworks could influence reward-based learning. The system comprises a reinforcement learning policy that suggests actions based on past experience, and a language model that processes these suggestions through different narrative frameworks to guide decisions. This setup enables initial experimentation with narrative elements while maintaining consistent environment and reward structures. We implement this architecture in a configurable gridworld environment, where agents receive both policy suggestions and information about their surroundings. The platform’s modular design facilitates controlled testing of environmental complexity, narrative parameters, and the interaction between reinforcement learning and narrative-based decisions. Our logging system captures basic decision metrics, from RL policy values to language model reasoning to action selection patterns. While preliminary, this implementation provides a foundation for studying how different narrative frameworks might affect reward-based decisions and exploring potential interactions between optimization-based learning and symbolic reasoning in AI systems.
zh
[AI-1] Using AI to Optimize Patient Transfer and Resource Utilization During Mass-Casualty Incidents: A Simulation Platform
【速读】:该论文旨在解决大规模伤亡事件(Mass Casualty Incidents, MCIs)中医疗系统面临资源紧张、决策压力大等问题,尤其关注在极端条件下如何高效、准确地进行患者-医院分配决策。其解决方案的关键在于开发了一个基于深度强化学习的决策支持人工智能代理(AI agent),该代理能够综合考虑患者病情严重程度(acuity levels)、专科护理需求、医院容量及转运物流等因素,实现最优患者转移决策。通过构建名为MasTER的Web端指挥仪表盘集成该AI代理,并在模拟场景下验证了不同人机交互模式(人类独断、人机协作和AI独断)的效果,结果显示AI显著提升了决策质量与一致性,甚至使非专家达到专家水平,从而证明了该AI驱动决策支持系统在提升MCIs应急响应能力方面的巨大潜力。
链接: https://arxiv.org/abs/2509.08756
作者: Zhaoxun “Lorenz” Liu,Wagner H. Souza,Jay Han,Amin Madani
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:Mass casualty incidents (MCIs) overwhelm healthcare systems and demand rapid, accurate patient-hospital allocation decisions under extreme pressure. Here, we developed and validated a deep reinforcement learning-based decision-support AI agent to optimize patient transfer decisions during simulated MCIs by balancing patient acuity levels, specialized care requirements, hospital capacities, and transport logistics. To integrate this AI agent, we developed MasTER, a web-accessible command dashboard for MCI management simulations. Through a controlled user study with 30 participants (6 trauma experts and 24 non-experts), we evaluated three interaction approaches with the AI agent (human-only, human-AI collaboration, and AI-only) across 20- and 60-patient MCI scenarios in the Greater Toronto Area. Results demonstrate that increasing AI involvement significantly improves decision quality and consistency. The AI agent outperforms trauma surgeons (p 0.001) and enables non-experts to achieve expert-level performance when assisted, contrasting sharply with their significantly inferior unassisted performance (p 0.001). These findings establish the potential for our AI-driven decision support to enhance both MCI preparedness training and real-world emergency response management.
zh
[AI-2] DEQuify your force field: More efficient simulations using deep equilibrium models ICLR-2025
【速读】:该论文旨在解决分子动力学(Molecular Dynamics, MD)模拟中力场模型的精度与计算效率之间的权衡问题。现有基于机器学习的力场模型虽已利用物理对称性(如旋转、平移和反射不变性)提升性能,但尚未充分利用时间连续性这一关键先验信息——即相邻时间步的状态高度相似。解决方案的关键在于将最先进的等变基础模型重构为深度均衡模型(Deep Equilibrium Model, DEQ),通过复用前一时间步的神经网络中间特征,从而在MD17、MD22和OC20 200k数据集上实现精度和速度提升10%–20%,同时显著降低训练内存消耗,支持更复杂模型在更大体系上的训练。
链接: https://arxiv.org/abs/2509.08734
作者: Andreas Burger,Luca Thiede,Alán Aspuru-Guzik,Nandita Vijaykumar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: AI4MAT-ICLR-2025 Spotlight this https URL
Abstract:Machine learning force fields show great promise in enabling more accurate molecular dynamics simulations compared to manually derived ones. Much of the progress in recent years was driven by exploiting prior knowledge about physical systems, in particular symmetries under rotation, translation, and reflections. In this paper, we argue that there is another important piece of prior information that, thus fa,r hasn’t been explored: Simulating a molecular system is necessarily continuous, and successive states are therefore extremely similar. Our contribution is to show that we can exploit this information by recasting a state-of-the-art equivariant base model as a deep equilibrium model. This allows us to recycle intermediate neural network features from previous time steps, enabling us to improve both accuracy and speed by 10%-20% on the MD17, MD22, and OC20 200k datasets, compared to the non-DEQ base model. The training is also much more memory efficient, allowing us to train more expressive models on larger systems.
zh
[AI-3] Explainability of CNN Based Classification Models for Acoustic Signal ICTAI2025
【速读】:该论文旨在解决生物声学(bioacoustics)领域中复杂深度学习模型预测结果缺乏可解释性的问题,尤其是在分析具有显著地理变异的鸟类鸣叫信号时。其解决方案的关键在于结合使用多种可解释人工智能(Explainable Artificial Intelligence, XAI)技术——包括模型无关方法(如LIME、SHAP)和模型特定方法(如DeepLIFT、Grad-CAM),以生成互补且更全面的决策解释,从而提升模型在生物声学应用中的可信度与可操作性。
链接: https://arxiv.org/abs/2509.08717
作者: Zubair Faruqui,Mackenzie S. McIntire,Rahul Dubey,Jay McEntee
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted in IEEE ICTAI 2025
Abstract:Explainable Artificial Intelligence (XAI) has emerged as a critical tool for interpreting the predictions of complex deep learning models. While XAI has been increasingly applied in various domains within acoustics, its use in bioacoustics, which involves analyzing audio signals from living organisms, remains relatively underexplored. In this paper, we investigate the vocalizations of a bird species with strong geographic variation throughout its range in North America. Audio recordings were converted into spectrogram images and used to train a deep Convolutional Neural Network (CNN) for classification, achieving an accuracy of 94.8%. To interpret the model’s predictions, we applied both model-agnostic (LIME, SHAP) and model-specific (DeepLIFT, Grad-CAM) XAI techniques. These techniques produced different but complementary explanations, and when their explanations were considered together, they provided more complete and interpretable insights into the model’s decision-making. This work highlights the importance of using a combination of XAI techniques to improve trust and interoperability, not only in broader acoustics signal analysis but also argues for broader applicability in different domain specific tasks.
zh
[AI-4] he More You Automate the Less You See: Hidden Pitfalls of AI Scientist Systems
【速读】:该论文试图解决当前生成式 AI 科学家系统(AI scientist systems)在自主执行从假设生成、实验到论文撰写全流程时,因内部工作流缺乏透明度而可能引入的四大潜在失效模式问题:不当基准选择、数据泄露、指标误用和事后选择偏差。这些问题若未被识别,将严重损害研究成果的完整性、可靠性与可信度。解决方案的关键在于设计受控实验以隔离并验证每种失效模式,并通过分析完整的自动化工作流中的追踪日志和源代码,显著提升对这些缺陷的检测能力——相较于仅审查最终论文,这种方法能更有效地实现透明性、可追溯性和可复现性,因此作者建议学术期刊和会议要求提交此类辅助材料作为评审前提。
链接: https://arxiv.org/abs/2509.08713
作者: Ziming Luo,Atoosa Kasirzadeh,Nihar B. Shah
机构: 未知
类目: Artificial Intelligence (cs.AI); Digital Libraries (cs.DL)
备注:
Abstract:AI scientist systems, capable of autonomously executing the full research workflow from hypothesis generation and experimentation to paper writing, hold significant potential for accelerating scientific discovery. However, the internal workflow of these systems have not been closely examined. This lack of scrutiny poses a risk of introducing flaws that could undermine the integrity, reliability, and trustworthiness of their research outputs. In this paper, we identify four potential failure modes in contemporary AI scientist systems: inappropriate benchmark selection, data leakage, metric misuse, and post-hoc selection bias. To examine these risks, we design controlled experiments that isolate each failure mode while addressing challenges unique to evaluating AI scientist systems. Our assessment of two prominent open-source AI scientist systems reveals the presence of several failures, across a spectrum of severity, which can be easily overlooked in practice. Finally, we demonstrate that access to trace logs and code from the full automated workflow enables far more effective detection of such failures than examining the final paper alone. We thus recommend journals and conferences evaluating AI-generated research to mandate submission of these artifacts alongside the paper to ensure transparency, accountability, and reproducibility.
zh
[AI-5] One Model Two Minds: A Context-Gated Graph Learner that Recreates Human Biases
【速读】:该论文旨在解决如何构建具备人类社会认知能力的AI系统,特别是实现类似人类在心理理论(Theory of Mind, ToM)任务中的适应性推理与决策行为。其核心挑战在于模拟人类双系统思维模式——即快速直觉式推理(System 1)与缓慢情境敏感的元自适应学习(System 2)之间的动态平衡。解决方案的关键在于提出一个基于图卷积网络(Graph Convolutional Networks, GCNs)的快速惯性推理模块与基于元学习(meta-learning)的慢速情境感知模块,并通过一个可学习的上下文门控机制(context gate mechanism)实现两者间的动态切换与协同优化。该架构不仅在经典错误信念任务中表现出类人适应行为,还能复现锚定效应、认知负荷疲劳、框架效应和启动效应等典型认知偏差,从而为AI系统实现具身化、情境化和社会化的智能提供了新的理论框架与技术路径。
链接: https://arxiv.org/abs/2509.08705
作者: Shalima Binta Manir,Tim Oates
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 9 pages, 7 figures, 2 tables
Abstract:We introduce a novel Theory of Mind (ToM) framework inspired by dual-process theories from cognitive science, integrating a fast, habitual graph-based reasoning system (System 1), implemented via graph convolutional networks (GCNs), and a slower, context-sensitive meta-adaptive learning system (System 2), driven by meta-learning techniques. Our model dynamically balances intuitive and deliberative reasoning through a learned context gate mechanism. We validate our architecture on canonical false-belief tasks and systematically explore its capacity to replicate hallmark cognitive biases associated with dual-process theory, including anchoring, cognitive-load fatigue, framing effects, and priming effects. Experimental results demonstrate that our dual-process approach closely mirrors human adaptive behavior, achieves robust generalization to unseen contexts, and elucidates cognitive mechanisms underlying reasoning biases. This work bridges artificial intelligence and cognitive theory, paving the way for AI systems exhibiting nuanced, human-like social cognition and adaptive decision-making capabilities.
zh
[AI-6] A layered architecture for log analysis in complex IT systems
【速读】:该论文旨在解决IT系统日益复杂背景下,DevOps团队在故障定位与修复过程中面临的稳定性与可靠性挑战。其核心问题是传统日志分析方法难以高效识别异常行为并精准定位根本原因,从而影响系统运维效率。解决方案的关键在于提出一个三层架构:第一层为日志调查(Log Investigation),通过自动标注和异常分类实现无监督训练与精确评估;第二层为异常检测(Anomaly Detection),设计一种可适应无监督、弱监督及监督学习的灵活检测方法,在公开和工业数据集上达到0.98–1.0的F1分数;第三层为根因分析(Root Cause Analysis),基于最小日志集合识别故障源头和服务序列,确保90–98%的根因日志位于前10候选结果中,从而提供可操作的修复建议。该架构整合了三者能力,显著提升了DevOps团队对复杂系统故障的响应效率与准确性。
链接: https://arxiv.org/abs/2509.08698
作者: Thorsten Wittkopp
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Dissertation
Abstract:In the evolving IT landscape, stability and reliability of systems are essential, yet their growing complexity challenges DevOps teams in implementation and maintenance. Log analysis, a core element of AIOps, provides critical insights into complex behaviors and failures. This dissertation introduces a three-layered architecture to support DevOps in failure resolution. The first layer, Log Investigation, performs autonomous log labeling and anomaly classification. We propose a method that labels log data without manual effort, enabling supervised training and precise evaluation of anomaly detection. Additionally, we define a taxonomy that groups anomalies into three categories, ensuring appropriate method selection. The second layer, Anomaly Detection, detects behaviors deviating from the norm. We propose a flexible Anomaly Detection method adaptable to unsupervised, weakly supervised, and supervised training. Evaluations on public and industry datasets show F1-scores between 0.98 and 1.0, ensuring reliable anomaly detection. The third layer, Root Cause Analysis, identifies minimal log sets describing failures, their origin, and event sequences. By balancing training data and identifying key services, our Root Cause Analysis method consistently detects 90-98% of root cause log lines within the top 10 candidates, providing actionable insights for mitigation. Our research addresses how log analysis methods can be designed and optimized to help DevOps resolve failures efficiently. By integrating these three layers, the architecture equips teams with robust methods to enhance IT system reliability.
zh
[AI-7] Reshaping the Forward-Forward Algorithm with a Similarity-Based Objective
【速读】:该论文旨在解决传统反向传播(Backpropagation)算法在生物合理性上的局限性,如反向锁存(backward locking)和全局误差传播问题,以及现有前向-前向(Forward-Forward)算法在准确率和推理效率方面的不足。其解决方案的关键在于将前向-前向算法与基于相似性的三元组损失(similarity-based tuplet loss)框架融合,提出名为FAUST(Forward-Forward Algorithm Unified with Similarity-based Tuplet loss)的新方法,从而在推理阶段无需多次前向传递即可显著提升准确率,有效缩小与反向传播算法的性能差距。
链接: https://arxiv.org/abs/2509.08697
作者: James Gong,Raymond Luo,Emma Wang,Leon Ge,Bruce Li,Felix Marattukalam,Waleed Abdulla
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 6 pages
Abstract:Backpropagation is the pivotal algorithm underpinning the success of artificial neural networks, yet it has critical limitations such as biologically implausible backward locking and global error propagation. To circumvent these constraints, the Forward-Forward algorithm was proposed as a more biologically plausible method that replaces the backward pass with an additional forward pass. Despite this advantage, the Forward-Forward algorithm significantly trails backpropagation in accuracy, and its optimal form exhibits low inference efficiency due to multiple forward passes required. In this work, the Forward-Forward algorithm is reshaped through its integration with similarity learning frameworks, eliminating the need for multiple forward passes during inference. This proposed algorithm is named Forward-Forward Algorithm Unified with Similarity-based Tuplet loss (FAUST). Empirical evaluations on MNIST, Fashion-MNIST, and CIFAR-10 datasets indicate that FAUST substantially improves accuracy, narrowing the gap with backpropagation. On CIFAR-10, FAUST achieves 56.22% accuracy with a simple multi-layer perceptron architecture, approaching the backpropagation benchmark of 57.63% accuracy.
zh
[AI-8] Automatic Failure Attribution and Critical Step Prediction Method for Multi-Agent Systems Based on Causal Inference
【速读】:该论文旨在解决多智能体系统(Multi-agent Systems, MAS)在实际部署中面临的故障归因难题,即难以准确识别导致任务失败的根本原因步骤。现有诊断工具依赖统计相关性,效果有限,在Who\When等基准测试上定位根因步骤的准确率不足15%。解决方案的关键在于提出首个基于多粒度因果推理的故障归因框架:一是提出性能因果反转原则,通过反转执行日志中的数据流来正确建模性能依赖关系,并结合Shapley值实现精准的智能体级责任分配;二是设计了一种新型因果发现算法CDC-MAS,可鲁棒地识别关键失败步骤,有效应对MAS交互数据的非平稳特性。该框架的归因结果直接驱动自动化优化循环,生成针对性改进建议并通过反事实模拟验证其有效性,显著提升任务成功率。
链接: https://arxiv.org/abs/2509.08682
作者: Guoqing Ma,Jia Zhu,Hanghui Guo,Weijie Shi,Jiawei Shen,Jingjiang Liu,Yidan Liang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Multi-agent systems (MAS) are critical for automating complex tasks, yet their practical deployment is severely hampered by the challenge of failure attribution. Current diagnostic tools, which rely on statistical correlations, are fundamentally inadequate; on challenging benchmarks like Who\When, state-of-the-art methods achieve less than 15% accuracy in locating the root-cause step of a failure. To address this critical gap, we introduce the first failure attribution framework for MAS grounded in multi-granularity causal inference. Our approach makes two key technical contributions: (1) a performance causal inversion principle, which correctly models performance dependencies by reversing the data flow in execution logs, combined with Shapley values to accurately assign agent-level blame; (2) a novel causal discovery algorithm, CDC-MAS, that robustly identifies critical failure steps by tackling the non-stationary nature of MAS interaction data. The framework’s attribution results directly fuel an automated optimization loop, generating targeted suggestions whose efficacy is validated via counterfactual simulations. Evaluations on the Who\When and TRAIL benchmarks demonstrate a significant leap in performance. Our method achieves up to 36.2% step-level accuracy. Crucially, the generated optimizations boost overall task success rates by an average of 22.4%. This work provides a principled and effective solution for debugging complex agent interactions, paving the way for more reliable and interpretable multi-agent systems.
zh
[AI-9] Architecting Resilient LLM Agents : A Guide to Secure Plan-then-Execute Implementations
【速读】:该论文旨在解决大型语言模型(Large Language Model, LLM)代理在自动化复杂多步骤任务时面临的架构鲁棒性、安全性与可预测性不足的问题。其核心解决方案是提出并系统阐述“计划-执行”(Plan-then-Execute, P-t-E)模式,该模式通过将战略规划(Planner)与战术执行(Executor)分离,实现控制流完整性,从而增强对间接提示注入攻击的防御能力,并提升推理质量与成本效率。关键在于通过结构化设计实现安全隔离与可控执行路径,辅以最小权限原则、任务作用域工具访问及沙箱代码执行等纵深防御机制,最终为构建生产级、可信的LLM代理提供可落地的实现蓝图。
链接: https://arxiv.org/abs/2509.08646
作者: Ron F. Del Rosario,Klaudia Krawiecka,Christian Schroeder de Witt
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:
Abstract:As Large Language Model (LLM) agents become increasingly capable of automating complex, multi-step tasks, the need for robust, secure, and predictable architectural patterns is paramount. This paper provides a comprehensive guide to the ``Plan-then-Execute’’ (P-t-E) pattern, an agentic design that separates strategic planning from tactical execution. We explore the foundational principles of P-t-E, detailing its core components - the Planner and the Executor - and its architectural advantages in predictability, cost-efficiency, and reasoning quality over reactive patterns like ReAct (Reason + Act). A central focus is placed on the security implications of this design, particularly its inherent resilience to indirect prompt injection attacks by establishing control-flow integrity. We argue that while P-t-E provides a strong foundation, a defense-in-depth strategy is necessary, and we detail essential complementary controls such as the Principle of Least Privilege, task-scoped tool access, and sandboxed code execution. To make these principles actionable, this guide provides detailed implementation blueprints and working code references for three leading agentic frameworks: LangChain (via LangGraph), CrewAI, and AutoGen. Each framework’s approach to implementing the P-t-E pattern is analyzed, highlighting unique features like LangGraph’s stateful graphs for re-planning, CrewAI’s declarative tool scoping for security, and AutoGen’s built-in Docker sandboxing. Finally, we discuss advanced patterns, including dynamic re-planning loops, parallel execution with Directed Acyclic Graphs (DAGs), and the critical role of Human-in-the-Loop (HITL) verification, to offer a complete strategic blueprint for architects, developers, and security engineers aiming to build production-grade, resilient, and trustworthy LLM agents.
zh
[AI-10] Classification of 24-hour movement behaviors from wrist-worn accelerometer data: from handcrafted features to deep learning techniques
【速读】:该论文旨在解决24小时运动行为分类问题,即如何准确区分睡眠、久坐、低强度体力活动(LPA)和中高强度体力活动(MVPA)这四类行为状态。其关键解决方案在于比较深度学习(Deep Learning, DL)与传统机器学习(Machine Learning, ML)算法在使用原始加速度信号和手工特征(handcrafted features)两种输入方式下的分类性能差异。研究发现,基于原始加速度信号训练的LSTM、BiLSTM和GRU模型整体准确率约为85%,略优于使用手工特征的DL和ML方法(准确率70%–81%),表明DL可直接从原始数据中提取时序特征以提升分类效果,但优势有限,且MVPA与LPA之间的混淆仍为主要挑战。
链接: https://arxiv.org/abs/2509.08606
作者: Alireza Sameh,Mehrdad Rostami,Mourad Oussalah,Vahid Farrahi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Purpose: We compared the performance of deep learning (DL) and classical machine learning (ML) algorithms for the classification of 24-hour movement behavior into sleep, sedentary, light intensity physical activity (LPA), and moderate-to-vigorous intensity physical activity (MVPA). Methods: Open-access data from 151 adults wearing a wrist-worn accelerometer (Axivity-AX3) was used. Participants were randomly divided into training, validation, and test sets (121, 15, and 15 participants each). Raw acceleration signals were segmented into non-overlapping 10-second windows, and then a total of 104 handcrafted features were extracted. Four DL algorithms-Long Short-Term Memory (LSTM), Bidirectional Long Short-Term Memory (BiLSTM), Gated Recurrent Units (GRU), and One-Dimensional Convolutional Neural Network (1D-CNN)-were trained using raw acceleration signals and with handcrafted features extracted from these signals to predict 24-hour movement behavior categories. The handcrafted features were also used to train classical ML algorithms, namely Random Forest (RF), Support Vector Machine (SVM), Extreme Gradient Boosting (XGBoost), Logistic Regression (LR), Artificial Neural Network (ANN), and Decision Tree (DT) for classifying 24-hour movement behavior intensities. Results: LSTM, BiLSTM, and GRU showed an overall accuracy of approximately 85% when trained with raw acceleration signals, and 1D-CNN an overall accuracy of approximately 80%. When trained on handcrafted features, the overall accuracy for both DL and classical ML algorithms ranged from 70% to 81%. Overall, there was a higher confusion in classification of MVPA and LPA, compared to sleep and sedentary categories. Conclusion: DL methods with raw acceleration signals had only slightly better performance in predicting 24-hour movement behavior intensities, compared to when DL and classical ML were trained with handcrafted features.
zh
[AI-11] No-Knowledge Alarms for Misaligned LLM s-as-Judges
【速读】:该论文试图解决在使用大语言模型(Large Language Models, LLMs)作为评判者来评估其他LLM的复杂决策时,如何确保评判者的可靠性问题——即“谁来监督评判者?”这一无限监控链困境。解决方案的关键在于利用不同LLM评判者之间在评分不一致时所表现出的逻辑一致性约束:当多个评判者对同一任务的评分出现分歧时,它们不可能全部正确,这种矛盾可被建模为一个整数线性规划(Linear Programming)问题,从而在无需已知专家真实标签的情况下,推导出评判者评分能力的唯一可能分布。基于此,作者提出“无知识警报”机制,能够在不产生误报的前提下检测到至少一个评判者违反用户指定的评分能力要求。
链接: https://arxiv.org/abs/2509.08593
作者: Andrés Corrada-Emmanuel
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 7 pages, 1 figure
Abstract:If we use LLMs as judges to evaluate the complex decisions of other LLMs, who or what monitors the judges? Infinite monitoring chains are inevitable whenever we do not know the ground truth of the decisions by experts and we do not want to trust them. One way to ameliorate our evaluation uncertainty is to exploit the use of logical consistency between disagreeing experts. By observing how LLM judges agree and disagree while grading other LLMs, we can compute the only possible evaluations of their grading ability. For example, if two LLM judges disagree on which tasks a third one completed correctly, they cannot both be 100% correct in their judgments. This logic can be formalized as a Linear Programming problem in the space of integer response counts for any finite test. We use it here to develop no-knowledge alarms for misaligned LLM judges. The alarms can detect, with no false positives, that at least one member or more of an ensemble of judges are violating a user specified grading ability requirement.
zh
[AI-12] Interpretability as Alignment: Making Internal Understanding a Design Principle
【速读】:该论文试图解决大语言模型(Large Language Models, LLMs)在高风险场景中行为不可靠、难以对齐人类价值观的问题。其核心挑战在于,现有对齐方法如基于强化学习的偏好优化(Reinforcement Learning from Human Feedback, RLHF)、红队测试(red teaming)或宪法AI(Constitutional AI)仅能从外部行为层面评估模型,而无法揭示内部推理机制中的潜在偏差或欺骗性行为。论文提出的关键解决方案是将可解释性(interpretability),特别是机制性解释方法(mechanistic interpretability),作为AI系统设计的核心原则,而非辅助诊断工具;此类方法通过电路追踪(circuit tracing)或激活修补(activation patching)等技术提供因果洞见,能够识别并定位模型内部导致误判或对齐失败的具体计算路径,从而实现更深层次的透明性和可控性。
链接: https://arxiv.org/abs/2509.08592
作者: Aadit Sengupta,Pratinav Seth,Vinay Kumar Sankarapu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注: Pre-Print
Abstract:Large neural models are increasingly deployed in high-stakes settings, raising concerns about whether their behavior reliably aligns with human values. Interpretability provides a route to internal transparency by revealing the computations that drive outputs. We argue that interpretability especially mechanistic approaches should be treated as a design principle for alignment, not an auxiliary diagnostic tool. Post-hoc methods such as LIME or SHAP offer intuitive but correlational explanations, while mechanistic techniques like circuit tracing or activation patching yield causal insight into internal failures, including deceptive or misaligned reasoning that behavioral methods like RLHF, red teaming, or Constitutional AI may overlook. Despite these advantages, interpretability faces challenges of scalability, epistemic uncertainty, and mismatches between learned representations and human concepts. Our position is that progress on safe and trustworthy AI will depend on making interpretability a first-class objective of AI research and development, ensuring that systems are not only effective but also auditable, transparent, and aligned with human intent.
zh
[AI-13] AutoStub: Genetic Programming-Based Stub Creation for Symbolic Execution
【速读】:该论文旨在解决符号执行(Symbolic Execution)在处理外部函数(如原生方法或第三方库函数)时的局限性问题,这些问题通常导致路径探索中断或需要人工干预。解决方案的关键在于提出一种自动化的符号桩生成方法 AutoStub,其核心机制是利用遗传编程(Genetic Programming)从外部函数的实际执行中学习并生成近似表达式:当符号执行器遇到外部函数时,AutoStub 通过随机输入执行该函数并收集输出数据作为训练集,随后使用遗传编程推导出能准确模拟其行为的符号表达式,从而作为符号桩继续执行分析,无需人工介入。该方法显著提升了对复杂程序路径的探索能力,并能发现语言特定的行为边缘情况,增强软件测试的有效性。
链接: https://arxiv.org/abs/2509.08524
作者: Felix Mächtle,Nils Loose,Jan-Niclas Serr,Jonas Sander,Thomas Eisenbarth
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 2025 HUMIES finalist
Abstract:Symbolic execution is a powerful technique for software testing, but suffers from limitations when encountering external functions, such as native methods or third-party libraries. Existing solutions often require additional context, expensive SMT solvers, or manual intervention to approximate these functions through symbolic stubs. In this work, we propose a novel approach to automatically generate symbolic stubs for external functions during symbolic execution that leverages Genetic Programming. When the symbolic executor encounters an external function, AutoStub generates training data by executing the function on randomly generated inputs and collecting the outputs. Genetic Programming then derives expressions that approximate the behavior of the function, serving as symbolic stubs. These automatically generated stubs allow the symbolic executor to continue the analysis without manual intervention, enabling the exploration of program paths that were previously intractable. We demonstrate that AutoStub can automatically approximate external functions with over 90% accuracy for 55% of the functions evaluated, and can infer language-specific behaviors that reveal edge cases crucial for software testing.
zh
[AI-14] FMTx: An Efficient and Asymptotically Optimal Extension of the Fast Marching Tree for Dynamic Replanning
【速读】:该论文旨在解决动态环境中路径规划的实时适应性问题,即如何在障碍物位置或环境状态发生变化时,高效地更新已有路径而不牺牲渐近最优性与计算效率。传统算法如Fast Marching Tree (FMT^*) 虽在静态环境下具有渐近最优性,但其单次遍历设计无法支持路径修正;而频繁全量重规划又会导致计算开销过大。解决方案的关键在于提出FMT^x算法,通过重构邻域选择规则,在保持原FMT^*成本有序优先队列结构的基础上,引入基于扩展邻域触发的局部更新机制,仅对可能次优的节点进行重新评估与修正,从而实现对环境变化的快速响应和渐近最优解的恢复,同时显著降低计算负担。
链接: https://arxiv.org/abs/2509.08521
作者: Soheil Espahbodini Nia
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: 35 pages, 8 figures, 2 tables, submitted to the International Journal of Robotics Research (IJRR)
Abstract:Path planning in dynamic environments remains a core challenge in robotics, especially as autonomous systems are deployed in unpredictable spaces such as warehouses and public roads. While algorithms like Fast Marching Tree (FMT ^* ) offer asymptotically optimal solutions in static settings, their single-pass design prevents path revisions which are essential for real-time adaptation. On the other hand, full replanning is often too computationally expensive. This paper introduces FMT ^x , an extension of the Fast Marching Tree algorithm that enables efficient and consistent replanning in dynamic environments. We revisit the neighbor selection rule of FMT ^* and demonstrate that a minimal change overcomes its single-pass limitation, enabling the algorithm to update cost-to-come values upon discovering better connections without sacrificing asymptotic optimality or computational efficiency. By maintaining a cost-ordered priority queue and applying a selective update condition that uses an expanding neighbor to identify and trigger the re-evaluation of any node with a potentially suboptimal path, FMT ^x ensures that suboptimal routes are efficiently repaired as the environment evolves. This targeted strategy preserves the inherent efficiency of FMT ^* while enabling robust adaptation to changes in obstacle configuration. FMT ^x is proven to recover an asymptotically optimal solution after environmental changes. Experimental results demonstrate that FMT ^x outperforms the influential replanner RRT ^x , reacting more swiftly to dynamic events with lower computational overhead and thus offering a more effective solution for real-time robotic navigation in unpredictable worlds.
zh
[AI-15] Variational Rank Reduction Autoencoders for Generative
【速读】:该论文旨在解决复杂几何结构生成式热设计中的两大挑战:高保真仿真带来的高计算成本,以及传统生成模型(如自编码器AE和变分自编码器VAE)在潜在空间中存在不连续性和结构混乱的问题,从而限制了设计探索能力和物理一致性。解决方案的关键在于提出一种混合框架,融合变分秩缩减自编码器(Variational Rank-Reduction Autoencoder, VRRAE)与深度算子网络(Deep Operator Network, DeepONet):VRRAE通过在潜在空间引入截断奇异值分解(truncated SVD),构建连续、可解释且结构化的表征,缓解后验坍缩问题并提升几何重建质量;随后,DeepONet利用该紧凑的潜在编码作为分支网络输入,并结合空间坐标作为主干网络输入,高效准确地预测温度梯度。此方法不仅提升了生成几何质量和梯度预测精度,还在推理效率上显著优于传统数值求解器。
链接: https://arxiv.org/abs/2509.08515
作者: Alicia Tierz,Jad Mounayer,Beatriz Moya,Francisco Chinesta
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Generative thermal design for complex geometries is fundamental in many areas of engineering, yet it faces two main challenges: the high computational cost of high-fidelity simulations and the limitations of conventional generative models. Approaches such as autoencoders (AEs) and variational autoencoders (VAEs) often produce unstructured latent spaces with discontinuities, which restricts their capacity to explore designs and generate physically consistent solutions. To address these limitations, we propose a hybrid framework that combines Variational Rank-Reduction Autoencoders (VRRAEs) with Deep Operator Networks (DeepONets). The VRRAE introduces a truncated SVD within the latent space, leading to continuous, interpretable, and well-structured representations that mitigate posterior collapse and improve geometric reconstruction. The DeepONet then exploits this compact latent encoding in its branch network, together with spatial coordinates in the trunk network, to predict temperature gradients efficiently and accurately. This hybrid approach not only enhances the quality of generated geometries and the accuracy of gradient prediction, but also provides a substantial advantage in inference efficiency compared to traditional numerical solvers. Overall, the study underscores the importance of structured latent representations for operator learning and highlights the potential of combining generative models and operator networks in thermal design and broader engineering applications. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2509.08515 [cs.LG] (or arXiv:2509.08515v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2509.08515 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-16] CPO: Thought-Centric Preference Optimization for Effective Embodied Decision-making
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在具身人工智能(Embodied Artificial Intelligence)中面对情境特定动态任务时,因监督微调(Supervised Fine-Tuning, SFT)后仍存在响应迟缓和幻觉问题,以及现有基于强化学习(Reinforcement Learning, RL)与思维链(Chain-of-Thought, CoT)的方法受稀疏奖励和仅动作优化限制而导致样本效率低、一致性差及模型退化的问题。解决方案的关键在于提出一种以思维为中心的偏好优化方法(Thought-Centric Preference Optimization, TCPO),通过分步偏好优化将稀疏奖励信号转化为更丰富的步骤样本对,并强调模型中间推理过程的一致性对齐,同时引入动作策略一致性约束(Action Policy Consistency Constraint, APC)进一步约束输出一致性,从而有效提升具身决策能力并缓解微调后的模型退化现象。
链接: https://arxiv.org/abs/2509.08500
作者: Kechen Jiao,Zhirui Fang,Jiahao Liu,Bei Li,Qifan Wang,Xinyu Liu,Junhao Ruan,Zhongjian Qiao,Yifan Zhu,Yaxin Xu,Jingang Wang,Xiu Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Using effective generalization capabilities of vision language models (VLMs) in context-specific dynamic tasks for embodied artificial intelligence remains a significant challenge. Although supervised fine-tuned models can better align with the real physical world, they still exhibit sluggish responses and hallucination issues in dynamically changing environments, necessitating further alignment. Existing post-SFT methods, reliant on reinforcement learning and chain-of-thought (CoT) approaches, are constrained by sparse rewards and action-only optimization, resulting in low sample efficiency, poor consistency, and model degradation. To address these issues, this paper proposes Thought-Centric Preference Optimization (TCPO) for effective embodied decision-making. Specifically, TCPO introduces a stepwise preference-based optimization approach, transforming sparse reward signals into richer step sample pairs. It emphasizes the alignment of the model’s intermediate reasoning process, mitigating the problem of model degradation. Moreover, by incorporating Action Policy Consistency Constraint (APC), it further imposes consistency constraints on the model output. Experiments in the ALFWorld environment demonstrate an average success rate of 26.67%, achieving a 6% improvement over RL4VLM and validating the effectiveness of our approach in mitigating model degradation after fine-tuning. These results highlight the potential of integrating preference-based learning techniques with CoT processes to enhance the decision-making capabilities of vision-language models in embodied agents.
zh
[AI-17] Send to which account? Evaluation of an LLM -based Scambaiting System
【速读】:该论文旨在解决生成式 AI (Generative AI) 被犯罪分子用于大规模制造高仿钓鱼内容,从而加剧金融诈骗并削弱公众信任的问题。传统防御手段如检测算法、用户培训和被动清理措施难以有效瓦解诈骗者依赖的基础设施(如收款账户和加密货币钱包)。论文提出的关键解决方案是采用基于大语言模型(Large Language Models, LLMs)的对话式诱饵系统(conversational honeypots),通过主动与诈骗分子互动来提取可操作的威胁情报。实验表明,该系统在五个月内成功发起2600余次真实交互,信息泄露率(Information Disclosure Rate, IDR)达32%,且人类接受率(Human Acceptance Rate, HAR)约为70%,验证了其有效性;但同时也揭示了“初始接触难”等挑战,即仅48.7%的诈骗者回应首条诱导消息,提示未来需进一步优化交互策略以提升捕获效率。
链接: https://arxiv.org/abs/2509.08493
作者: Hossein Siadati,Haadi Jafarian,Sima Jafarikhah
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Scammers are increasingly harnessing generative AI(GenAI) technologies to produce convincing phishing content at scale, amplifying financial fraud and undermining public trust. While conventional defenses, such as detection algorithms, user training, and reactive takedown efforts remain important, they often fall short in dismantling the infrastructure scammers depend on, including mule bank accounts and cryptocurrency wallets. To bridge this gap, a proactive and emerging strategy involves using conversational honeypots to engage scammers and extract actionable threat intelligence. This paper presents the first large-scale, real-world evaluation of a scambaiting system powered by large language models (LLMs). Over a five-month deployment, the system initiated over 2,600 engagements with actual scammers, resulting in a dataset of more than 18,700 messages. It achieved an Information Disclosure Rate (IDR) of approximately 32%, successfully extracting sensitive financial information such as mule accounts. Additionally, the system maintained a Human Acceptance Rate (HAR) of around 70%, indicating strong alignment between LLM-generated responses and human operator preferences. Alongside these successes, our analysis reveals key operational challenges. In particular, the system struggled with engagement takeoff: only 48.7% of scammers responded to the initial seed message sent by defenders. These findings highlight the need for further refinement and provide actionable insights for advancing the design of automated scambaiting systems.
zh
[AI-18] DSFL: A Dual-Server Byzantine-Resilient Federated Learning Framework via Group-Based Secure Aggregation
【速读】:该论文旨在解决联邦学习(Federated Learning, FL)在实际部署中面临的三大核心挑战:抵御拜占庭(Byzantine)恶意参与者的攻击、在非独立同分布(non-IID)数据下保持模型性能,以及满足边缘设备对计算和通信开销的轻量化要求。现有方法或依赖可信硬件、使用高成本密码学工具,或无法同时保障隐私与鲁棒性。论文提出的DSFL框架通过三项关键创新实现突破:(1)基于双服务器的安全聚合协议,在无需加密或密钥交换的前提下保护模型更新;(2)基于组级信用评分的过滤机制,依据客户端更新偏差动态识别并隔离拜占庭节点;(3)动态奖惩机制,激励公平参与并提升系统整体稳定性。实验证明,DSFL在高达30%拜占庭参与者场景下仍能保持高精度(如CIFAR-10上达97.15%),显著优于FedAvg、LSFL及差分隐私等基线方法,且单轮运行时间仅55.9毫秒、通信量为1088 KB,具备良好的实用性。
链接: https://arxiv.org/abs/2509.08449
作者: Charuka Herath,Yogachandran Rahulamathavan,Varuna De Silva,Sangarapillai Lambotharan
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:
Abstract:Federated Learning (FL) enables decentralized model training without sharing raw data, offering strong privacy guarantees. However, existing FL protocols struggle to defend against Byzantine participants, maintain model utility under non-independent and identically distributed (non-IID) data, and remain lightweight for edge devices. Prior work either assumes trusted hardware, uses expensive cryptographic tools, or fails to address privacy and robustness simultaneously. We propose DSFL, a Dual-Server Byzantine-Resilient Federated Learning framework that addresses these limitations using a group-based secure aggregation approach. Unlike LSFL, which assumes non-colluding semi-honest servers, DSFL removes this dependency by revealing a key vulnerability: privacy leakage through client-server collusion. DSFL introduces three key innovations: (1) a dual-server secure aggregation protocol that protects updates without encryption or key exchange, (2) a group-wise credit-based filtering mechanism to isolate Byzantine clients based on deviation scores, and (3) a dynamic reward-penalty system for enforcing fair participation. DSFL is evaluated on MNIST, CIFAR-10, and CIFAR-100 under up to 30 percent Byzantine participants in both IID and non-IID settings. It consistently outperforms existing baselines, including LSFL, homomorphic encryption methods, and differential privacy approaches. For example, DSFL achieves 97.15 percent accuracy on CIFAR-10 and 68.60 percent on CIFAR-100, while FedAvg drops to 9.39 percent under similar threats. DSFL remains lightweight, requiring only 55.9 ms runtime and 1088 KB communication per round.
zh
[AI-19] Efficient Decoding Methods for Language Models on Encrypted Data
【速读】:该论文旨在解决在不可信服务器上使用同态加密(Homomorphic Encryption, HE)进行大语言模型(Large Language Models, LLMs)文本生成时面临的计算效率瓶颈问题。传统解码方法如argmax和nucleus(top-p)采样是非多项式操作,在加密状态下导致高昂的计算开销,限制了隐私保护推理的实际应用。其解决方案的关键在于提出两个高效且可微的HE友好算法:一是cutmax,一种新型多项式argmax算法,显著减少密文运算次数,实现高效的贪婪解码;二是首个兼容HE的nucleus采样方法,基于cutmax实现概率性解码并保障可证明的隐私性。这两个方法均具有多项式复杂度,支持低延迟安全推理,并通过理论证明其全局收敛至唯一两层固定点,解释了其实用中的快速收敛特性,实验证明相较基线延迟降低24x–35x。
链接: https://arxiv.org/abs/2509.08383
作者: Matan Avitan,Moran Baruch,Nir Drucker,Itamar Zimerman,Yoav Goldberg
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:
Abstract:Large language models (LLMs) power modern AI applications, but processing sensitive data on untrusted servers raises privacy concerns. Homomorphic encryption (HE) enables computation on encrypted data for secure inference. However, neural text generation requires decoding methods like argmax and sampling, which are non-polynomial and thus computationally expensive under encryption, creating a significant performance bottleneck. We introduce cutmax, an HE-friendly argmax algorithm that reduces ciphertext operations compared to prior methods, enabling practical greedy decoding under encryption. We also propose the first HE-compatible nucleus (top-p) sampling method, leveraging cutmax for efficient stochastic decoding with provable privacy guarantees. Both techniques are polynomial, supporting efficient inference in privacy-preserving settings. Moreover, their differentiability facilitates gradient-based sequence-level optimization as a polynomial alternative to straight-through estimators. We further provide strong theoretical guarantees for cutmax, proving it converges globally to a unique two-level fixed point, independent of the input values beyond the identity of the maximizer, which explains its rapid convergence in just a few iterations. Evaluations on realistic LLM outputs show latency reductions of 24x-35x over baselines, advancing secure text generation.
zh
[AI-20] Co-Investigator AI: The Rise of Agent ic AI for Smarter Trustworthy AML Compliance Narratives
【速读】:该论文旨在解决反洗钱(Anti-Money Laundering, AML)工作中可疑活动报告(Suspicious Activity Report, SAR)生成效率低、成本高且难以规模化的问题。传统方法在合规性要求严格的场景下,常因大语言模型(Large Language Models, LLMs)存在事实幻觉、犯罪类型对齐不足及可解释性差等缺陷而面临风险。解决方案的关键在于提出一种名为 Co-Investigator AI 的智能体框架(agentic framework),其通过集成规划代理、犯罪类型检测代理、外部情报收集代理和合规验证代理,实现动态记忆管理与敏感数据保护(AI-Privacy Guard层),并引入基于“AI作为裁判”(Agent-as-a-Judge)的实时质量校验机制,确保报告内容准确、合规且可追溯。该框架强调人类调查员始终处于闭环控制中,形成人机协同的高效工作流,从而显著提升 SAR 生成的速度与准确性,推动合规报告迈向可扩展、可靠和透明的新阶段。
链接: https://arxiv.org/abs/2509.08380
作者: Prathamesh Vasudeo Naik,Naresh Kumar Dintakurthi,Zhanghao Hu,Yue Wang,Robby Qiu
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Generating regulatorily compliant Suspicious Activity Report (SAR) remains a high-cost, low-scalability bottleneck in Anti-Money Laundering (AML) workflows. While large language models (LLMs) offer promising fluency, they suffer from factual hallucination, limited crime typology alignment, and poor explainability – posing unacceptable risks in compliance-critical domains. This paper introduces Co-Investigator AI, an agentic framework optimized to produce Suspicious Activity Reports (SARs) significantly faster and with greater accuracy than traditional methods. Drawing inspiration from recent advances in autonomous agent architectures, such as the AI Co-Scientist, our approach integrates specialized agents for planning, crime type detection, external intelligence gathering, and compliance validation. The system features dynamic memory management, an AI-Privacy Guard layer for sensitive data handling, and a real-time validation agent employing the Agent-as-a-Judge paradigm to ensure continuous narrative quality assurance. Human investigators remain firmly in the loop, empowered to review and refine drafts in a collaborative workflow that blends AI efficiency with domain expertise. We demonstrate the versatility of Co-Investigator AI across a range of complex financial crime scenarios, highlighting its ability to streamline SAR drafting, align narratives with regulatory expectations, and enable compliance teams to focus on higher-order analytical work. This approach marks the beginning of a new era in compliance reporting – bringing the transformative benefits of AI agents to the core of regulatory processes and paving the way for scalable, reliable, and transparent SAR generation.
zh
[AI-21] Grasp Like Humans: Learning Generalizable Multi-Fingered Grasping from Human Proprioceptive Sensorimotor Integration
【速读】:该论文旨在解决机器人手在执行抓取任务时,如何从人类自然操作中有效迁移触觉与本体感觉(tactile and kinesthetic perception)以实现可靠抓握的问题。其核心挑战在于建立感官反馈到运动指令的直接映射关系。解决方案的关键在于提出了一种基于数据手套的触觉-本体感觉感知预测框架,通过图结构表示多模态输入(引入极坐标系并显式建模形态差异),并设计了Tactile-Kinesthetic Spatio-Temporal Graph Networks(TK-STGN),利用多维子图卷积和注意力机制LSTM提取时空特征,最终通过力-位混合映射将预测节点状态转化为机器人执行命令,从而实现了跨不同演示者和机器人手的泛化抓取技能迁移。
链接: https://arxiv.org/abs/2509.08354
作者: Ce Guo,Xieyuanli Chen,Zhiwen Zeng,Zirui Guo,Yihong Li,Haoran Xiao,Dewen Hu,Huimin Lu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 20 pages, 19 figures, accepted by IEEE Transactions on Robotics
Abstract:Tactile and kinesthetic perceptions are crucial for human dexterous manipulation, enabling reliable grasping of objects via proprioceptive sensorimotor integration. For robotic hands, even though acquiring such tactile and kinesthetic feedback is feasible, establishing a direct mapping from this sensory feedback to motor actions remains challenging. In this paper, we propose a novel glove-mediated tactile-kinematic perception-prediction framework for grasp skill transfer from human intuitive and natural operation to robotic execution based on imitation learning, and its effectiveness is validated through generalized grasping tasks, including those involving deformable objects. Firstly, we integrate a data glove to capture tactile and kinesthetic data at the joint level. The glove is adaptable for both human and robotic hands, allowing data collection from natural human hand demonstrations across different scenarios. It ensures consistency in the raw data format, enabling evaluation of grasping for both human and robotic hands. Secondly, we establish a unified representation of multi-modal inputs based on graph structures with polar coordinates. We explicitly integrate the morphological differences into the designed representation, enhancing the compatibility across different demonstrators and robotic hands. Furthermore, we introduce the Tactile-Kinesthetic Spatio-Temporal Graph Networks (TK-STGN), which leverage multidimensional subgraph convolutions and attention-based LSTM layers to extract spatio-temporal features from graph inputs to predict node-based states for each hand joint. These predictions are then mapped to final commands through a force-position hybrid mapping.
zh
[AI-22] Accelerating Mixture-of-Expert Inference with Adaptive Expert Split Mechanism
【速读】:该论文旨在解决混合专家(Mixture-of-Experts, MoE)大语言模型(Large Language Models, LLMs)在推理过程中因参数量巨大而导致的GPU显存(VRAM)占用过高问题,从而限制其广泛应用。现有方法通过将专家参数卸载至CPU内存(RAM)缓解显存压力,但因缓存命中率低和专家加载延迟高,导致推理速度显著下降。解决方案的关键在于提出MoEpic系统,其核心创新是引入一种新颖的专家分片机制:将每个专家垂直划分为“顶部”和“底部”两段,仅在有限VRAM预算下缓存热点专家的顶部段,从而提升缓存命中率;同时,在每层推理时预测并预取下一层激活的专家,由于顶部段无需重新加载,可减少加载时间并实现高效的计算与传输重叠(transfer-computation overlap)。此外,论文还设计了一种基于定点迭代的分治算法用于自适应配置缓存策略(包括各层VRAM分配与专家分片比例),以最大化性能收益。实验表明,MoEpic相较基线方法可节省约50%的GPU成本,并降低37.51%–65.73%的推理延迟。
链接: https://arxiv.org/abs/2509.08342
作者: Jiaming Yan,Jianchun Liu,Hongli Xu,Liusheng Huang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Mixture-of-Experts (MoE) has emerged as a promising architecture for modern large language models (LLMs). However, massive parameters impose heavy GPU memory (i.e., VRAM) demands, hindering the widespread adoption of MoE LLMs. Offloading the expert parameters to CPU RAM offers an effective way to alleviate the VRAM requirements for MoE inference. Existing approaches typically cache a small subset of experts in VRAM and dynamically prefetch experts from RAM during inference, leading to significant degradation in inference speed due to the poor cache hit rate and substantial expert loading latency. In this work, we propose MoEpic, an efficient MoE inference system with a novel expert split mechanism. Specifically, each expert is vertically divided into two segments: top and bottom. MoEpic caches the top segment of hot experts, so that more experts will be stored under the limited VRAM budget, thereby improving the cache hit rate. During each layer’s inference, MoEpic predicts and prefetches the activated experts for the next layer. Since the top segments of cached experts are exempt from fetching, the loading time is reduced, which allows efficient transfer-computation overlap. Nevertheless, the performance of MoEpic critically depends on the cache configuration (i.e., each layer’s VRAM budget and expert split ratio). To this end, we propose a divide-and-conquer algorithm based on fixed-point iteration for adaptive cache configuration. Extensive experiments on popular MoE LLMs demonstrate that MoEpic can save about half of the GPU cost, while lowering the inference latency by about 37.51%-65.73% compared to the baselines.
zh
[AI-23] Accelerating Reinforcement Learning Algorithms Convergence using Pre-trained Large Language Models as Tutors With Advice Reusing
【速读】:该论文试图解决强化学习(Reinforcement Learning, RL)在复杂环境和稀疏奖励场景下训练周期长、收敛慢的问题。解决方案的关键在于引入预训练大语言模型(Large Language Models, LLMs)作为“导师”,构建学生-教师架构,利用LLM生成的指导性策略或建议来加速RL算法的收敛过程。实验表明,LLM tutoring能显著缩短训练时间且保持最优性能,其中重复使用LLM建议的机制进一步提升训练效率,但可能降低收敛稳定性。
链接: https://arxiv.org/abs/2509.08329
作者: Lukas Toral,Teddy Lazebnik
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Reinforcement Learning (RL) algorithms often require long training to become useful, especially in complex environments with sparse rewards. While techniques like reward shaping and curriculum learning exist to accelerate training, these are often extremely specific and require the developer’s professionalism and dedicated expertise in the problem’s domain. Tackling this challenge, in this study, we explore the effectiveness of pre-trained Large Language Models (LLMs) as tutors in a student-teacher architecture with RL algorithms, hypothesizing that LLM-generated guidance allows for faster convergence. In particular, we explore the effectiveness of reusing the LLM’s advice on the RL’s convergence dynamics. Through an extensive empirical examination, which included 54 configurations, varying the RL algorithm (DQN, PPO, A2C), LLM tutor (Llama, Vicuna, DeepSeek), and environment (Blackjack, Snake, Connect Four), our results demonstrate that LLM tutoring significantly accelerates RL convergence while maintaining comparable optimal performance. Furthermore, the advice reuse mechanism shows a further improvement in training duration but also results in less stable convergence dynamics. Our findings suggest that LLM tutoring generally improves convergence, and its effectiveness is sensitive to the specific task, RL algorithm, and LLM model combination.
zh
[AI-24] Leverag ing AI Agents for Autonomous Networks: A Reference Architecture and Empirical Studies
【速读】:该论文旨在解决当前电信网络在迈向Level 4(L4)自治网络(Autonomous Networks, AN)过程中面临的“认知能力不足”问题,即如何从传统的被动自动化升级为具备主动感知、推理与决策能力的智能系统。其解决方案的关键在于基于Joseph Sifakis提出的AN Agent参考架构,构建了一个功能性的认知系统,通过混合知识表示驱动的协同式主动-响应运行时机制,实现了对无线接入网(Radio Access Network, RAN)链路自适应(Link Adaptation, LA)任务的亚10毫秒级实时控制,并在5G NR sub-6 GHz场景下验证了其有效性:相比传统外环链路自适应(Outer Loop Link Adaptation, OLLA)算法提升了6%下行吞吐量,同时通过动态调制编码方案(Modulation and Coding Scheme, MCS)优化将块误码率(Block Error Rate, BLER)降低67%,显著增强了超可靠服务性能。这一框架验证了其在突破传统自主性障碍、推动L4关键能力建设方面的可行性。
链接: https://arxiv.org/abs/2509.08312
作者: Binghan Wu,Shoufeng Wang,Yunxin Liu,Ya-Qin Zhang,Joseph Sifakis,Ye Ouyang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 7 pages, 5 figures. This manuscript is a preprint
Abstract:The evolution toward Level 4 (L4) Autonomous Networks (AN) represents a strategic inflection point in telecommunications, where networks must transcend reactive automation to achieve genuine cognitive capabilities–fulfilling TM Forum’s vision of self-configuring, self-healing, and self-optimizing systems that deliver zero-wait, zero-touch, and zero-fault services. This work bridges the gap between architectural theory and operational reality by implementing Joseph Sifakis’s AN Agent reference architecture in a functional cognitive system, deploying coordinated proactive-reactive runtimes driven by hybrid knowledge representation. Through an empirical case study of a Radio Access Network (RAN) Link Adaptation (LA) Agent, we validate this framework’s transformative potential: demonstrating sub-10 ms real-time control in 5G NR sub-6 GHz while achieving 6% higher downlink throughput than Outer Loop Link Adaptation (OLLA) algorithms and 67% Block Error Rate (BLER) reduction for ultra-reliable services through dynamic Modulation and Coding Scheme (MCS) optimization. These improvements confirm the architecture’s viability in overcoming traditional autonomy barriers and advancing critical L4-enabling capabilities toward next-generation objectives.
zh
[AI-25] Game-Theoretic Resilience Framework for Cyber-Physical Microgrids using Multi-Agent Reinforcement Learning
【速读】:该论文旨在解决现代电力系统中因对信息物理基础设施(Cyber-Physical Infrastructure)依赖增强而加剧的定向网络攻击风险问题,核心目标是构建一种数学严谨且具备自适应能力的微电网韧性评估与提升框架。其解决方案的关键在于提出一个基于博弈论的统一建模方法,融合负载服务比(Load Served Ratio, LSR)、关键负荷韧性(Critical Load Resilience, CLR)、拓扑生存性评分(Topological Survivability Score, TSS)和分布式能源韧性评分(DER Resilience Score, DRS)等量化指标,并通过层次分析法(Analytic Hierarchy Process, AHP)构建攻击-防御交互的收益矩阵;进一步将该框架形式化为有限时域马尔可夫决策过程(Markov Decision Process, MDP),并提供收敛性保证与计算复杂度边界。通过三类案例研究——静态攻击(Nash均衡分析)、严重攻击(高影响策略)、自适应攻击(Stackelberg博弈、后悔匹配、Softmax启发式及多智能体Q学习)验证了该框架的有效性,在改进IEEE 33节点配电系统上的实证表明,自适应防御策略相较静态方法在统计学意义上提升了18.7%和2.1%的韧性表现。
链接: https://arxiv.org/abs/2509.08310
作者: S Krishna Niketh,Sagar Babu Mitikiri,V Vignesh,Vedantham Lakshmi Srinivas,Mayukha Pal
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
备注:
Abstract:The increasing reliance on cyber physical infrastructure in modern power systems has amplified the risk of targeted cyber attacks, necessitating robust and adaptive resilience strategies. This paper presents a mathematically rigorous game theoretic framework to evaluate and enhance microgrid resilience using a combination of quantitative resilience metrics Load Served Ratio LSR, Critical Load Resilience CLR, Topological Survivability Score TSS, and DER Resilience Score DRS. These are integrated into a unified payoff matrix using the Analytic Hierarchy Process AHP to assess attack defense interactions. The framework is formalized as a finite horizon Markov Decision Process MDP with formal convergence guarantees and computational complexity bounds. Three case studies are developed 1. static attacks analyzed via Nash equilibrium, 2. severe attacks incorporating high impact strategies, and 3. adaptive attacks using Stackelberg games, regret matching, softmax heuristics, and Multi Agent Q Learning. Rigorous theoretical analysis provides convergence proofs with explicit rates , PAC learning sample complexity bounds, and computational complexity analysis. The framework is tested on an enhanced IEEE 33bus distribution system with DERs and control switches, demonstrating the effectiveness of adaptive and strategic defenses in improving cyber physical resilience with statistically significant improvements of 18.7% 2.1% over static approaches.
zh
[AI-26] emphFoQuS: A Forgetting-Quality Coreset Selection Framework for Automatic Modulation Recognition
【速读】:该论文旨在解决深度学习-based自动调制识别(Automatic Modulation Recognition, AMR)模型在新模型开发或超参数调优过程中,因反复使用大规模标注数据进行训练而导致的时间和能源消耗过高的问题。解决方案的关键在于提出一种名为FoQuS的方法,该方法通过从原始数据集中选择一个核心子集(coreset),近似全量数据训练的效果,从而显著降低训练开销;其核心创新在于记录每个样本在全数据集训练过程中的预测轨迹,并基于训练动态构建三个重要性度量指标,以指导高质量核心子集的选取。
链接: https://arxiv.org/abs/2509.08300
作者: Yao Lu,Chunfeng Sun,Dongwei Xu,Yun Lin,Qi Xuan,Guan Gui
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Deep learning-based Automatic Modulation Recognition (AMR) model has made significant progress with the support of large-scale labeled data. However, when developing new models or performing hyperparameter tuning, the time and energy consumption associated with repeated training using massive amounts of data are often unbearable. To address the above challenges, we propose \emphFoQuS, which approximates the effect of full training by selecting a coreset from the original dataset, thereby significantly reducing training overhead. Specifically, \emphFoQuS records the prediction trajectory of each sample during full-dataset training and constructs three importance metrics based on training dynamics. Experiments show that \emphFoQuS can maintain high recognition accuracy and good cross-architecture generalization on multiple AMR datasets using only 1%-30% of the original data.
zh
[AI-27] Segment Transformer: AI-Generated Music Detection via Music Structural Analysis
【速读】:该论文旨在解决AI生成音乐(AIGM)的版权归属不清以及难以区分其与人类创作音乐的问题。解决方案的关键在于通过分析音乐片段的结构模式来提升AIGM检测的准确性:首先,利用多种预训练模型(包括自监督学习SSL模型或音频效果编码器)提取短音频片段的音乐特征;其次,针对长音频开发了段落Transformer(segment transformer),将音乐划分为多个片段并学习片段间的相互关系,从而在长时程时间分析中整合段级音乐特征,显著提升了检测系统的性能与鲁棒性。
链接: https://arxiv.org/abs/2509.08283
作者: Yumin Kim,Seonghyeon Go
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:
Abstract:Audio and music generation systems have been remarkably developed in the music information retrieval (MIR) research field. The advancement of these technologies raises copyright concerns, as ownership and authorship of AI-generated music (AIGM) remain unclear. Also, it can be difficult to determine whether a piece was generated by AI or composed by humans clearly. To address these challenges, we aim to improve the accuracy of AIGM detection by analyzing the structural patterns of music segments. Specifically, to extract musical features from short audio clips, we integrated various pre-trained models, including self-supervised learning (SSL) models or an audio effect encoder, each within our suggested transformer-based framework. Furthermore, for long audio, we developed a segment transformer that divides music into segments and learns inter-segment relationships. We used the FakeMusicCaps and SONICS datasets, achieving high accuracy in both the short-audio and full-audio detection experiments. These findings suggest that integrating segment-level musical features into long-range temporal analysis can effectively enhance both the performance and robustness of AIGM detection systems.
zh
[AI-28] Real-world Music Plagiarism Detection With Music Segment Transcription System
【速读】:该论文旨在解决音乐版权保护中的剽窃检测问题,即如何有效识别不同音乐格式下具有相似性的作品以维护音乐知识产权。解决方案的关键在于提出了一种融合多种音乐信息检索(Music Information Retrieval, MIR)技术的系统,通过开发一个音乐片段转录模块,从音频中提取具有音乐意义的片段,并基于多维音乐特征计算相似度得分,从而实现跨格式音乐剽窃的精准检测。该方法在实验中表现良好,且配套构建了一个真实案例驱动的相似音乐对(Similar Music Pair, SMP)数据集,支持后续研究与实际应用。
链接: https://arxiv.org/abs/2509.08282
作者: Seonghyeon Go
机构: 未知
类目: Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted in APSIPA 2025 but not published yet(will be published in 2 month…), Arxiv preprint ready for references in future-works
Abstract:As a result of continuous advances in Music Information Retrieval (MIR) technology, generating and distributing music has become more diverse and accessible. In this context, interest in music intellectual property protection is increasing to safeguard individual music copyrights. In this work, we propose a system for detecting music plagiarism by combining various MIR technologies. We developed a music segment transcription system that extracts musically meaningful segments from audio recordings to detect plagiarism across different musical formats. With this system, we compute similarity scores based on multiple musical features that can be evaluated through comprehensive musical analysis. Our approach demonstrated promising results in music plagiarism detection experiments, and the proposed method can be applied to real-world music scenarios. We also collected a Similar Music Pair (SMP) dataset for musical similarity research using real-world cases. The dataset are publicly available.
zh
[AI-29] Interpretable Physics Reasoning and Performance Taxonomy in Vision-Language Models
【速读】:该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在理解基础科学原理(尤其是二维物理规律)方面能力不足的问题。其解决方案的关键在于提出一个新颖且可访问的评估框架,该框架通过一个实用的情境生成器构建了一个包含400多个问题的多样化测试集,覆盖抛体运动、碰撞动力学、力学和流体力学四个核心领域,从而系统性地量化VLMs的物理推理能力,并揭示模型规模与推理性能之间的强相关性。
链接: https://arxiv.org/abs/2509.08270
作者: Pranav Pawar,Kavish Shah,Akshat Bhalani,Komal Kasat,Dev Mittal,Hadi Gala,Deepali Patil,Nikita Raichada,Monali Deshmukh
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:As Vision-Language Models (VLMs) grow in sophistication, their ability to perform reasoning is coming under increasing supervision. While they excel at many tasks, their grasp of fundamental scientific principles, such as physics, remains an underexplored frontier. To reflect the advancements in these capabilities, we introduce a novel and accessible framework designed to rigorously evaluate VLMs on their understanding of 2D physics. Our framework features a pragmatic scenario generator that creates a diverse testbed of over 400 problems across four core domains: Projectile Motion, Collision Dynamics, Mechanics, and Fluid Dynamics. Through comprehensive evaluation of four state-of-the-art VLMs, we demonstrate a strong correlation between model scale and reasoning ability, with our top-performing model, Qwen2.5-VL-7B, achieving an overall score of 0.815. We find that while models excel at formulaic problems, they struggle significantly with domains requiring abstract spatial reasoning. By designing this framework, we aim to democratize the study of scientific reasoning in VLMs and foster deeper insights into their capabilities and limitations.
zh
[AI-30] A Systematic Survey on Large Language Models for Evolutionary Optimization: From Modeling to Solving
【速读】:该论文旨在解决当前生成式 AI(Generative AI)在优化问题研究中缺乏统一理论框架与系统分类体系的问题。其解决方案的关键在于构建一个结构化的分类体系,将现有研究划分为两大阶段:LLMs(Large Language Models)用于优化建模和LLMs用于优化求解;其中后者进一步细分为三种范式:LLMs作为独立优化器、低层级嵌入传统优化算法的LLMs以及高层级用于算法选择与生成的LLMs。通过这一分类体系,论文系统梳理了代表性方法、技术挑战及其与传统优化方法的交互关系,并揭示了未来构建自进化代理生态系统的方向。
链接: https://arxiv.org/abs/2509.08269
作者: Yisong Zhang,Ran Cheng,Guoxing Yi,Kay Chen Tan
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs), with their strong understanding and reasoning capabilities, are increasingly being explored for tackling optimization problems, especially in synergy with evolutionary computation. Despite rapid progress, however, the field still lacks a unified synthesis and a systematic taxonomy. This survey addresses this gap by providing a comprehensive review of recent developments and organizing them within a structured framework. We classify existing research into two main stages: LLMs for optimization modeling and LLMs for optimization solving. The latter is further divided into three paradigms according to the role of LLMs in the optimization workflow: LLMs as stand-alone optimizers, low-level LLMs embedded within optimization algorithms, and high-level LLMs for algorithm selection and generation. For each category, we analyze representative methods, distill technical challenges, and examine their interplay with traditional approaches. We also review interdisciplinary applications spanning the natural sciences, engineering, and machine learning. By contrasting LLM-driven and conventional methods, we highlight key limitations and research gaps, and point toward future directions for developing self-evolving agentic ecosystems for optimization. An up-to-date collection of related literature is maintained at this https URL.
zh
[AI-31] Symmetry-Guided Multi-Agent Inverse Reinforcement Learnin IROS2025
【速读】:该论文旨在解决多智能体逆强化学习(Multi-Agent Inverse Reinforcement Learning, MIRL)中因专家示范数据收集成本高而导致样本效率低的问题。现有方法依赖大量专家示范才能准确恢复奖励函数,但在机器人系统尤其是多机器人场景中,获取高质量示范代价高昂,限制了MIRL的实际应用。解决方案的关键在于利用多智能体系统固有的对称性(symmetry),理论证明并实验证明:通过将对称性信息融入现有的多智能体对抗式逆强化学习算法框架中,可显著提升奖励函数恢复的准确性与样本效率,从而降低对大规模专家示范的依赖。
链接: https://arxiv.org/abs/2509.08257
作者: Yongkai Tian,Yirong Qi,Xin Yu,Wenjun Wu,Jie Luo
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 8pages, 6 figures. Accepted for publication in the Proceedings of the 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2025) as oral presentation
Abstract:In robotic systems, the performance of reinforcement learning depends on the rationality of predefined reward functions. However, manually designed reward functions often lead to policy failures due to inaccuracies. Inverse Reinforcement Learning (IRL) addresses this problem by inferring implicit reward functions from expert demonstrations. Nevertheless, existing methods rely heavily on large amounts of expert demonstrations to accurately recover the reward function. The high cost of collecting expert demonstrations in robotic applications, particularly in multi-robot systems, severely hinders the practical deployment of IRL. Consequently, improving sample efficiency has emerged as a critical challenge in multi-agent inverse reinforcement learning (MIRL). Inspired by the symmetry inherent in multi-agent systems, this work theoretically demonstrates that leveraging symmetry enables the recovery of more accurate reward functions. Building upon this insight, we propose a universal framework that integrates symmetry into existing multi-agent adversarial IRL algorithms, thereby significantly enhancing sample efficiency. Experimental results from multiple challenging tasks have demonstrated the effectiveness of this framework. Further validation in physical multi-robot systems has shown the practicality of our method.
zh
[AI-32] Strategies for Improving Communication Efficiency in Distributed and Federated Learning: Compression Local Training and Personalization
【速读】:该论文旨在解决分布式和联邦学习(Federated Learning)中通信开销过大的核心问题,尤其是在去中心化数据源上训练模型时面临的效率瓶颈。其解决方案的关键在于构建一个统一的压缩算子框架以保障收敛性,并引入自适应本地训练与个性化策略来加速收敛并缓解客户端漂移(client drift);同时提出基于分层聚合的隐私保护剪枝方法(如Cohort-Squeeze)降低跨设备通信成本,以及一种对称后训练剪枝方法SymWanda,在高稀疏度下保持模型鲁棒性和精度无需重新训练。这些技术共同实现了准确率、收敛速度与通信效率之间的良好权衡。
链接: https://arxiv.org/abs/2509.08233
作者: Kai Yi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: PhD Dissertation
Abstract:Distributed and federated learning are essential paradigms for training models across decentralized data sources while preserving privacy, yet communication overhead remains a major bottleneck. This dissertation explores strategies to improve communication efficiency, focusing on model compression, local training, and personalization. We establish a unified framework for biased and unbiased compression operators with convergence guarantees, then propose adaptive local training strategies that incorporate personalization to accelerate convergence and mitigate client drift. In particular, Scafflix balances global and personalized objectives, achieving superior performance under both IID and non-IID settings. We further introduce privacy-preserving pruning frameworks that optimize sparsity while minimizing communication costs, with Cohort-Squeeze leveraging hierarchical aggregation to reduce cross-device overhead. Finally, SymWanda, a symmetric post-training pruning method, enhances robustness under high sparsity and maintains accuracy without retraining. Extensive experiments on benchmarks and large-scale language models demonstrate favorable trade-offs among accuracy, convergence, and communication, offering theoretical and practical insights for scalable, efficient distributed learning.
zh
[AI-33] Exploratory Retrieval-Augmented Planning For Continual Embodied Instruction Following NEURIPS2024
【速读】:该论文旨在解决具身智能体在动态、非平稳环境中持续执行指令任务时面临的挑战,即如何有效融合环境探索与任务规划,以提升大语言模型(Large Language Models, LLMs)的具身推理能力。其解决方案的关键在于提出一种探索增强型规划框架(Exploratory Retrieval-Augmented Planning, ExRAP),通过将信息驱动的环境探索整合进LLM-based任务规划流程,并引入基于记忆增强的查询评估机制,实现环境上下文记忆的有效维护与探索负载之间的平衡;同时,设计时间一致性精化方案以缓解记忆中知识的衰减问题,从而显著提升任务成功率和执行效率。
链接: https://arxiv.org/abs/2509.08222
作者: Minjong Yoo,Jinwoo Jang,Wei-jin Park,Honguk Woo
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 21 pages. NeurIPS 2024
Abstract:This study presents an Exploratory Retrieval-Augmented Planning (ExRAP) framework, designed to tackle continual instruction following tasks of embodied agents in dynamic, non-stationary environments. The framework enhances Large Language Models’ (LLMs) embodied reasoning capabilities by efficiently exploring the physical environment and establishing the environmental context memory, thereby effectively grounding the task planning process in time-varying environment contexts. In ExRAP, given multiple continual instruction following tasks, each instruction is decomposed into queries on the environmental context memory and task executions conditioned on the query results. To efficiently handle these multiple tasks that are performed continuously and simultaneously, we implement an exploration-integrated task planning scheme by incorporating the information-based exploration into the LLM-based planning process. Combined with memory-augmented query evaluation, this integrated scheme not only allows for a better balance between the validity of the environmental context memory and the load of environment exploration, but also improves overall task performance. Furthermore, we devise a temporal consistency refinement scheme for query evaluation to address the inherent decay of knowledge in the memory. Through experiments with VirtualHome, ALFRED, and CARLA, our approach demonstrates robustness against a variety of embodied instruction following scenarios involving different instruction scales and types, and non-stationarity degrees, and it consistently outperforms other state-of-the-art LLM-based task planning approaches in terms of both goal success rate and execution efficiency.
zh
[AI-34] Componentization: Decomposing Monolithic LLM Responses into Manipulable Semantic Units
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)生成的文本通常为单一、不可分割的输出形式,导致在协作环境中难以进行局部编辑和迭代优化的问题。其核心解决方案是提出“组件化”(componentization)方法,通过将模型输出分解为语义连贯且可独立编辑的模块单元(components),同时保留各单元之间的上下文关联。关键技术在于设计了模块化与可适配的输出分解算法(Modular and Adaptable Output Decomposition, MAOD)以及基于组件的响应架构(Component-Based Response Architecture, CBRA),并实现了一个原型系统MAODchat,支持微服务架构下的状态机驱动分解代理、跨厂商模型适配器及实时组件操作与重组机制,从而推动从被动文本消费向主动、细粒度协同创作的转变。
链接: https://arxiv.org/abs/2509.08203
作者: Ryan Lingo,Rajeev Chhajer,Martin Arroyo,Luka Brkljacic,Ben Davis,Nithin Santhanam
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 12 pages, 4 figures
Abstract:Large Language Models (LLMs) often produce monolithic text that is hard to edit in parts, which can slow down collaborative workflows. We present componentization, an approach that decomposes model outputs into modular, independently editable units while preserving context. We describe Modular and Adaptable Output Decomposition (MAOD), which segments responses into coherent components and maintains links among them, and we outline the Component-Based Response Architecture (CBRA) as one way to implement this idea. Our reference prototype, MAODchat, uses a microservices design with state-machine-based decomposition agents, vendor-agnostic model adapters, and real-time component manipulation with recomposition. In an exploratory study with four participants from academic, engineering, and product roles, we observed that component-level editing aligned with several common workflows and enabled iterative refinement and selective reuse. Participants also mentioned possible team workflows. Our contributions are: (1) a definition of componentization for transforming monolithic outputs into manipulable units, (2) CBRA and MAODchat as a prototype architecture, (3) preliminary observations from a small user study, (4) MAOD as an algorithmic sketch for semantic segmentation, and (5) example Agent-to-Agent protocols for automated decomposition. We view componentization as a promising direction for turning passive text consumption into more active, component-level collaboration. Comments: 12 pages, 4 figures Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Software Engineering (cs.SE) ACMclasses: I.2.7; H.5.2 Cite as: arXiv:2509.08203 [cs.HC] (or arXiv:2509.08203v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2509.08203 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-35] Accelerating AI Development with Cyber Arenas
【速读】:该论文旨在解决人工智能(AI)在从实验室向实际操作环境过渡过程中,缺乏高保真度测试环境的问题。为实现这一目标,研究者提出利用网络空间竞技场(cyber arena)作为新型测试平台,其核心优势在于能够快速集成新兴AI能力,并通过模拟真实世界场景来验证AI性能。解决方案的关键在于将MIT/IEEE/Amazon图挑战匿名化网络传感器部署于国民警卫队演习中,从而在贴近实战的环境中评估AI系统的有效性与适应性。
链接: https://arxiv.org/abs/2509.08200
作者: William Cashman,Chasen Milner,Michael Houle,Michael Jones,Hayden Jananthan,Jeremy Kepner,Peter Michaleas,Alex Pentland
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 2 pages, 1 figure, 7 references, accepted to IEEE HPEC 2025
Abstract:AI development requires high fidelity testing environments to effectively transition from the laboratory to operations. The flexibility offered by cyber arenas presents a novel opportunity to test new artificial intelligence (AI) capabilities with users. Cyber arenas are designed to expose end-users to real-world situations and must rapidly incorporate evolving capabilities to meet their core objectives. To explore this concept the MIT/IEEE/Amazon Graph Challenge Anonymized Network Sensor was deployed in a cyber arena during a National Guard exercise.
zh
[AI-36] Lifetime-Aware Design of Item-Level Intelligence
【速读】:该论文旨在解决可穿戴与一次性产品中嵌入式计算(即物品级智能,Item-Level Intelligence, ILI)的可持续性与能效优化问题,尤其在万亿级部署规模下,传统基于硅基芯片的计算架构因寿命差异巨大(达1000倍)而不再适用。其核心挑战在于如何在柔性电子(flexible electronics)受限的kHz级速度和数千逻辑门资源下,实现碳足迹最小化的微架构设计。解决方案的关键在于提出FlexiFlow框架,通过建立“嵌入碳足迹”与“运行碳足迹”的权衡模型,并引入生命周期感知的设计决策机制,结合FlexiBench(面向可持续应用的工作负载套件)、FlexiBits(面积优化的RISC-V核,支持1/4/8-bit数据通路,提升能效2.65–3.50倍)以及碳意识选择算法,实现从硬件到算法层面的协同优化,最终使碳足迹降低最高达14.5倍(算法层)和1.62倍(微架构层)。
链接: https://arxiv.org/abs/2509.08193
作者: Shvetank Prakash,Andrew Cheng,Olof Kindgren,Ashiq Ahamed,Graham Knight,Jed Kufel,Francisco Rodriguez,Arya Tschand,David Kong,Mariam Elgamal,Jerry Huang,Emma Chen,Gage Hills,Richard Price,Emre Ozer,Vijay Janapa Reddi
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注:
Abstract:We present FlexiFlow, a lifetime-aware design framework for item-level intelligence (ILI) where computation is integrated directly into disposable products like food packaging and medical patches. Our framework leverages natively flexible electronics which offer significantly lower costs than silicon but are limited to kHz speeds and several thousands of gates. Our insight is that unlike traditional computing with more uniform deployment patterns, ILI applications exhibit 1000X variation in operational lifetime, fundamentally changing optimal architectural design decisions when considering trillion-item deployment scales. To enable holistic design and optimization, we model the trade-offs between embodied carbon footprint and operational carbon footprint based on application-specific lifetimes. The framework includes: (1) FlexiBench, a workload suite targeting sustainability applications from spoilage detection to health monitoring; (2) FlexiBits, area-optimized RISC-V cores with 1/4/8-bit datapaths achieving 2.65X to 3.50X better energy efficiency per workload execution; and (3) a carbon-aware model that selects optimal architectures based on deployment characteristics. We show that lifetime-aware microarchitectural design can reduce carbon footprint by 1.62X, while algorithmic decisions can reduce carbon footprint by 14.5X. We validate our approach through the first tape-out using a PDK for flexible electronics with fully open-source tools, achieving 30.9kHz operation. FlexiFlow enables exploration of computing at the Extreme Edge where conventional design methodologies must be reevaluated to account for new constraints and considerations.
zh
[AI-37] Multi-Label Transfer Learning in Non-Stationary Data Streams ICDM
【速读】:该论文旨在解决多标签数据流在非平稳环境中标签概念漂移(label concept drift)的问题,尤其是当标签间存在独立或关联漂移时,如何通过跨标签知识迁移来加速模型适应。其解决方案的关键在于提出两种新颖的多标签流式迁移学习方法:BR-MARLENE 利用源与目标流中不同标签之间的知识进行多标签分类;BRPW-MARLENE 进一步显式建模并转移标签对之间的依赖关系,从而提升学习性能。实验表明,这两种方法均优于当前最先进的多标签流方法,验证了标签间知识迁移对预测性能的显著改善作用。
链接: https://arxiv.org/abs/2509.08181
作者: Honghui Du,Leandro Minku,Aonghus Lawlor,Huiyu Zhou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at IEEE International Conference on Data Mining (ICDM) 2025
Abstract:Label concepts in multi-label data streams often experience drift in non-stationary environments, either independently or in relation to other labels. Transferring knowledge between related labels can accelerate adaptation, yet research on multi-label transfer learning for data streams remains limited. To address this, we propose two novel transfer learning methods: BR-MARLENE leverages knowledge from different labels in both source and target streams for multi-label classification; BRPW-MARLENE builds on this by explicitly modelling and transferring pairwise label dependencies to enhance learning performance. Comprehensive experiments show that both methods outperform state-of-the-art multi-label stream approaches in non-stationary environments, demonstrating the effectiveness of inter-label knowledge transfer for improved predictive performance.
zh
[AI-38] MARLINE: Multi-Source Mapping Transfer Learning for Non-Stationary Environments ICDM
【速读】:该论文旨在解决在线学习中因概念漂移(concept drift)导致的数据流挖掘系统预测性能下降的问题。传统方法通常假设至少一个源域模型与目标域概念相似,但在实际场景中这一假设往往不成立。论文提出了一种名为MARLINE(Multi-source mApping with tRansfer LearnIng for Non-stationary Environments)的新方法,其核心创新在于通过将目标概念映射到每个源概念的空间中,实现多源子分类器在目标预测中的协同作用,从而构建一个集成学习框架。即使源域与目标域概念不匹配,该方法仍能有效利用多个源域的知识提升预测准确性。
链接: https://arxiv.org/abs/2509.08176
作者: Honghui Du,Leandro Minku,Huiyu Zhou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Published in the 2020 IEEE International Conference on Data Mining (ICDM)
Abstract:Concept drift is a major problem in online learning due to its impact on the predictive performance of data stream mining systems. Recent studies have started exploring data streams from different sources as a strategy to tackle concept drift in a given target domain. These approaches make the assumption that at least one of the source models represents a concept similar to the target concept, which may not hold in many real-world scenarios. In this paper, we propose a novel approach called Multi-source mApping with tRansfer LearnIng for Non-stationary Environments (MARLINE). MARLINE can benefit from knowledge from multiple data sources in non-stationary environments even when source and target concepts do not match. This is achieved by projecting the target concept to the space of each source concept, enabling multiple source sub-classifiers to contribute towards the prediction of the target concept as part of an ensemble. Experiments on several synthetic and real-world datasets show that MARLINE was more accurate than several state-of-the-art data stream learning approaches.
zh
[AI-39] Diffusion-Guided Multi-Arm Motion Planning
【速读】:该论文旨在解决多机械臂在共享空间中执行复杂长时任务时面临的可扩展性问题,现有基于学习的方法因状态空间指数级增长和对大规模多臂数据集的依赖而难以扩展。解决方案的关键在于受多智能体路径规划(Multi-Agent Path Finding, MAPF)启发,提出一种扩散引导的多臂规划器(Diffusion-guided Multi-Arm Planner, DG-MAP),通过结构化分解将多臂规划问题转化为单臂轨迹生成与成对碰撞规避两个子问题:首先训练一个条件扩散模型生成可行的单臂轨迹,其次训练另一个扩散模型建模双臂协同动力学以实现高效的成对碰撞解析;该方法显著降低了对大规模多臂训练数据的依赖,并通过模块化生成机制实现了机械臂数量的高效扩展。
链接: https://arxiv.org/abs/2509.08160
作者: Viraj Parimi,Brian C. Williams
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:Multi-arm motion planning is fundamental for enabling arms to complete complex long-horizon tasks in shared spaces efficiently but current methods struggle with scalability due to exponential state-space growth and reliance on large training datasets for learned models. Inspired by Multi-Agent Path Finding (MAPF), which decomposes planning into single-agent problems coupled with collision resolution, we propose a novel diffusion-guided multi-arm planner (DG-MAP) that enhances scalability of learning-based models while reducing their reliance on massive multi-arm datasets. Recognizing that collisions are primarily pairwise, we train two conditional diffusion models, one to generate feasible single-arm trajectories, and a second, to model the dual-arm dynamics required for effective pairwise collision resolution. By integrating these specialized generative models within a MAPF-inspired structured decomposition, our planner efficiently scales to larger number of arms. Evaluations against alternative learning-based methods across various team sizes demonstrate our method’s effectiveness and practical applicability. Project website can be found at this https URL
zh
[AI-40] Zero-Shot Metric Depth Estimation via Monocular Visual-Inertial Rescaling for Autonomous Aerial Navigation
【速读】:该论文旨在解决在计算资源受限的自主飞行器上实现精确度量深度(metric depth)估计的问题,以支持碰撞规避。传统方法依赖于重型传感器(如LiDAR或双目相机)或数据密集且领域特定的微调,限制了其在轻量化平台上的应用。解决方案的关键在于提出几种轻量级的零样本重缩放(zero-shot rescaling)策略,通过视觉惯性导航系统(Visual-Inertial Navigation System, VINS)生成的稀疏3D特征图,将相对深度(relative depth)转换为度量深度。其中最优方案采用单调样条拟合(monotonic spline fitting),在多种仿真环境中验证了其准确性,并成功部署于计算受限的四旋翼飞行器上,在15 Hz频率下实现实时度量深度估计,结合基于运动原语(motion primitives)的规划器实现了有效避障。
链接: https://arxiv.org/abs/2509.08159
作者: Steven Yang,Xiaoyu Tian,Kshitij Goel,Wennie Tabib
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:This paper presents a methodology to predict metric depth from monocular RGB images and an inertial measurement unit (IMU). To enable collision avoidance during autonomous flight, prior works either leverage heavy sensors (e.g., LiDARs or stereo cameras) or data-intensive and domain-specific fine-tuning of monocular metric depth estimation methods. In contrast, we propose several lightweight zero-shot rescaling strategies to obtain metric depth from relative depth estimates via the sparse 3D feature map created using a visual-inertial navigation system. These strategies are compared for their accuracy in diverse simulation environments. The best performing approach, which leverages monotonic spline fitting, is deployed in the real-world on a compute-constrained quadrotor. We obtain on-board metric depth estimates at 15 Hz and demonstrate successful collision avoidance after integrating the proposed method with a motion primitives-based planner.
zh
[AI-41] Risk-Bounded Multi-Agent Visual Navigation via Dynamic Budget Allocation
【速读】:该论文旨在解决多智能体在高风险环境中进行安全导航时的效率与安全性权衡问题。传统规划方法虽能处理长时程任务,但依赖预定义距离度量;而安全强化学习(Safe Reinforcement Learning, SRL)虽可从高维视觉输入中学习复杂行为,却难以应对多智能体、目标条件化的场景。为克服上述局限,作者提出RB-CBS(Risk-Bounded Conflict-Based Search),其核心创新在于动态分配和调整用户指定的风险边界(Δ),使每个代理获得局部风险预算(δ),从而在保障整体安全约束的前提下提升路径规划效率。该方案通过迭代式风险分配机制,在复杂环境中实现更灵活、高效的多智能体无碰撞路径规划。
链接: https://arxiv.org/abs/2509.08157
作者: Viraj Parimi,Brian C. Williams
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:Safe navigation is essential for autonomous systems operating in hazardous environments, especially when multiple agents must coordinate using just visual inputs over extended time horizons. Traditional planning methods excel at solving long-horizon tasks but rely on predefined distance metrics, while safe Reinforcement Learning (RL) can learn complex behaviors using high-dimensional inputs yet struggles with multi-agent, goal-conditioned scenarios. Recent work combined these paradigms by leveraging goal-conditioned RL (GCRL) to build an intermediate graph from replay buffer states, pruning unsafe edges, and using Conflict-Based Search (CBS) for multi-agent path planning. Although effective, this graph-pruning approach can be overly conservative, limiting mission efficiency by precluding missions that must traverse high-risk regions. To address this limitation, we propose RB-CBS, a novel extension to CBS that dynamically allocates and adjusts user-specified risk bound ( \Delta ) across agents to flexibly trade off safety and speed. Our improved planner ensures that each agent receives a local risk budget ( \delta ) enabling more efficient navigation while still respecting overall safety constraints. Experimental results demonstrate that this iterative risk-allocation framework yields superior performance in complex environments, allowing multiple agents to find collision-free paths within the user-specified \Delta .
zh
[AI-42] rust Semantics Distillation for Collaborator Selection via Memory-Augmented Agent ic AI
【速读】:该论文旨在解决在复杂计算任务中,任务发起方独立评估潜在协作设备(collaborator)可信度时,因频繁数据交换、复杂推理以及动态环境变化所导致的高开销与可信度评估质量下降的问题。解决方案的关键在于提出一种基于大模型(Large AI Model, LAM)驱动的教师-学生代理架构的任务特定可信语义蒸馏(Task-specific Trust Semantics Distillation, 2TSD)模型:教师代理部署于具备强大算力和增强记忆模块的服务器端,负责多维可信相关数据采集、任务特定可信语义提取及任务-协作匹配分析;当接收来自设备侧学生代理的任务请求时,教师代理将提炼后的可信语义传递给学生代理,从而实现快速且准确的协作设备选择,显著降低评估时间与设备资源消耗,并提升协作选择准确性。
链接: https://arxiv.org/abs/2509.08151
作者: Botao Zhu,Jeslyn Wang,Dusit Niyato,Xianbin Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Accurate trustworthiness evaluation of potential collaborating devices is essential for the effective execution of complex computing tasks. This evaluation process involves collecting diverse trust-related data from potential collaborators, including historical performance and available resources, for collaborator selection. However, when each task owner independently assesses all collaborators’ trustworthiness, frequent data exchange, complex reasoning, and dynamic situation changes can result in significant overhead and deteriorated trust evaluation. To overcome these challenges, we propose a task-specific trust semantics distillation (2TSD) model based on a large AI model (LAM)-driven teacher-student agent architecture. The teacher agent is deployed on a server with powerful computational capabilities and an augmented memory module dedicated to multidimensional trust-related data collection, task-specific trust semantics extraction, and task-collaborator matching analysis. Upon receiving task-specific requests from device-side student agents, the teacher agent transfers the trust semantics of potential collaborators to the student agents, enabling rapid and accurate collaborator selection. Experimental results demonstrate that the proposed 2TSD model can reduce collaborator evaluation time, decrease device resource consumption, and improve the accuracy of collaborator selection.
zh
[AI-43] From Limited Data to Rare-event Prediction: LLM -powered Feature Engineering and Multi-model Learning in Venture Capital
【速读】:该论文旨在解决如何准确预测稀有但高影响事件(rare, high-impact outcomes)的问题,特别是在早期阶段数据有限且噪声较多的场景下,如风险投资(Venture Capital, VC)领域中对初创企业的评估。其解决方案的关键在于构建一个融合大型语言模型(Large Language Models, LLMs)与多模型机器学习(multi-model machine learning, ML)架构的框架:首先利用LLM进行特征工程,从非结构化数据中提取和合成复杂信号;随后将这些特征输入由XGBoost、随机森林和线性回归组成的分层集成模型,先输出连续的成功概率估计,再通过阈值转换为二元稀有事件预测。该方法在保持黑箱模型强大预测能力的同时,增强了决策过程的可解释性,实证结果表明模型在三个独立测试子集上的精度达到随机分类器基线的9.8–11.1倍。
链接: https://arxiv.org/abs/2509.08140
作者: Mihir Kumar,Aaron Ontoyin Yin,Zakari Salifu,Kelvin Amoaba,Afriyie Kwesi Samuel,Fuat Alican,Yigit Ihlamur
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 6 pages, 3 figures
Abstract:This paper presents a framework for predicting rare, high-impact outcomes by integrating large language models (LLMs) with a multi-model machine learning (ML) architecture. The approach combines the predictive strength of black-box models with the interpretability required for reliable decision-making. We use LLM-powered feature engineering to extract and synthesize complex signals from unstructured data, which are then processed within a layered ensemble of models including XGBoost, Random Forest, and Linear Regression. The ensemble first produces a continuous estimate of success likelihood, which is then thresholded to produce a binary rare-event prediction. We apply this framework to the domain of Venture Capital (VC), where investors must evaluate startups with limited and noisy early-stage data. The empirical results show strong performance: the model achieves precision between 9.8X and 11.1X the random classifier baseline in three independent test subsets. Feature sensitivity analysis further reveals interpretable success drivers: the startup’s category list accounts for 15.6% of predictive influence, followed by the number of founders, while education level and domain expertise contribute smaller yet consistent effects.
zh
[AI-44] Domain Knowledge is Power: Leverag ing Physiological Priors for Self Supervised Representation Learning in Electrocardiography
【速读】:该论文旨在解决生成式 AI (Generative AI) 在心电图(ECG)分析中因标注数据稀缺而导致模型性能受限的问题。其核心解决方案是提出 PhysioCLR(Physiology-aware Contrastive Learning Representation for ECG),该框架通过在对比学习中嵌入心电生理先验知识,引导模型学习具有临床意义且可迁移的特征表示。关键创新在于引入基于 ECG 生理相似性的对比约束、设计保留类别信息的特定增强策略以及混合损失函数,从而显著提升模型在多数据集上的泛化能力与诊断准确性。
链接: https://arxiv.org/abs/2509.08116
作者: Nooshin Maghsoodi,Sarah Nassar,Paul F R Wilson,Minh Nguyen Nhat To,Sophia Mannina,Shamel Addas,Stephanie Sibley,David Maslove,Purang Abolmaesumi,Parvin Mousavi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Objective: Electrocardiograms (ECGs) play a crucial role in diagnosing heart conditions; however, the effectiveness of artificial intelligence (AI)-based ECG analysis is often hindered by the limited availability of labeled data. Self-supervised learning (SSL) can address this by leveraging large-scale unlabeled data. We introduce PhysioCLR (Physiology-aware Contrastive Learning Representation for ECG), a physiology-aware contrastive learning framework that incorporates domain-specific priors to enhance the generalizability and clinical relevance of ECG-based arrhythmia classification. Methods: During pretraining, PhysioCLR learns to bring together embeddings of samples that share similar clinically relevant features while pushing apart those that are dissimilar. Unlike existing methods, our method integrates ECG physiological similarity cues into contrastive learning, promoting the learning of clinically meaningful representations. Additionally, we introduce ECG- specific augmentations that preserve the ECG category post augmentation and propose a hybrid loss function to further refine the quality of learned representations. Results: We evaluate PhysioCLR on two public ECG datasets, Chapman and Georgia, for multilabel ECG diagnoses, as well as a private ICU dataset labeled for binary classification. Across the Chapman, Georgia, and private cohorts, PhysioCLR boosts the mean AUROC by 12% relative to the strongest baseline, underscoring its robust cross-dataset generalization. Conclusion: By embedding physiological knowledge into contrastive learning, PhysioCLR enables the model to learn clinically meaningful and transferable ECG eatures. Significance: PhysioCLR demonstrates the potential of physiology-informed SSL to offer a promising path toward more effective and label-efficient ECG diagnostics.
zh
[AI-45] Real-Time Obstacle Avoidance for a Mobile Robot Using CNN-Based Sensor Fusion
【速读】:该论文旨在解决移动机器人在复杂未知环境中实现高效避障的问题,这是导航系统中的关键环节。解决方案的关键在于使用三种端到端的卷积神经网络(Convolutional Neural Networks, CNNs)从同步获取的彩色与深度图像中直接生成低层转向控制指令,从而实现无需显式路径规划的实时避障。其中,NetConEmb模型表现出最优性能,其中位绝对误差(MedAE)仅为0.58 × 10⁻³ rad/s,且在真实环境下的导航成功率高达100%,展现出良好的鲁棒性;相较之下,参数更少、收敛更快的NetEmb架构虽略有性能下降,但依然保持了接近的精度(RMSE = 21.68 × 10⁻³ rad/s),验证了轻量化设计在实际部署中的可行性与有效性。
链接: https://arxiv.org/abs/2509.08095
作者: Lamiaa H. Zain,Raafat E. Shalaby
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Obstacle avoidance is a critical component of the navigation stack required for mobile robots to operate effectively in complex and unknown environments. In this research, three end-to-end Convolutional Neural Networks (CNNs) were trained and evaluated offline and deployed on a differential-drive mobile robot for real-time obstacle avoidance to generate low-level steering commands from synchronized color and depth images acquired by an Intel RealSense D415 RGB-D camera in diverse environments. Offline evaluation showed that the NetConEmb model achieved the best performance with a notably low MedAE of 0.58 \times 10^-3 rad/s. In comparison, the lighter NetEmb architecture adopted in this study, which reduces the number of trainable parameters by approximately 25% and converges faster, produced comparable results with an RMSE of 21.68 \times 10^-3 rad/s, close to the 21.42 \times 10^-3 rad/s obtained by NetConEmb. Real-time navigation further confirmed NetConEmb’s robustness, achieving a 100% success rate in both known and unknown environments, while NetEmb and NetGated succeeded only in navigating the known environment.
zh
[AI-46] EnvX: Agent ize Everything with Agent ic AI
【速读】:该论文旨在解决当前开源软件组件(open-source software components)在实际开发中难以高效复用的问题,即开发者需手动查阅文档、理解API并编写集成代码,导致流程繁琐且易出错。其核心解决方案是提出EnvX框架,通过将GitHub仓库“代理化”(agentize),使其成为具备自然语言交互能力与多代理协作能力的智能体(intelligent agents)。关键创新在于三阶段机制:(1) TODO-guided环境初始化,自动配置依赖项与验证数据;(2) 人对齐的代理自动化,使特定仓库代理可自主执行真实任务;(3) Agent-to-Agent (A2A)协议,支持多个代理协同工作。该方案利用大语言模型(LLM)与结构化工具集成,实现了从理解、初始化到运行整个流程的自动化,显著提升了开源组件的可用性与协作效率。
链接: https://arxiv.org/abs/2509.08088
作者: Linyao Chen,Zimian Peng,Yingxuan Yang,Yikun Wang,Wenzheng Tom Tang,Hiroki H. Kobayashi,Weinan Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:The widespread availability of open-source repositories has led to a vast collection of reusable software components, yet their utilization remains manual, error-prone, and disconnected. Developers must navigate documentation, understand APIs, and write integration code, creating significant barriers to efficient software reuse. To address this, we present EnvX, a framework that leverages Agentic AI to agentize GitHub repositories, transforming them into intelligent, autonomous agents capable of natural language interaction and inter-agent collaboration. Unlike existing approaches that treat repositories as static code resources, EnvX reimagines them as active agents through a three-phase process: (1) TODO-guided environment initialization, which sets up the necessary dependencies, data, and validation datasets; (2) human-aligned agentic automation, allowing repository-specific agents to autonomously perform real-world tasks; and (3) Agent-to-Agent (A2A) protocol, enabling multiple agents to collaborate. By combining large language model capabilities with structured tool integration, EnvX automates not just code generation, but the entire process of understanding, initializing, and operationalizing repository functionality. We evaluate EnvX on the GitTaskBench benchmark, using 18 repositories across domains such as image processing, speech recognition, document analysis, and video manipulation. Our results show that EnvX achieves a 74.07% execution completion rate and 51.85% task pass rate, outperforming existing frameworks. Case studies further demonstrate EnvX’s ability to enable multi-repository collaboration via the A2A protocol. This work marks a shift from treating repositories as passive code resources to intelligent, interactive agents, fostering greater accessibility and collaboration within the open-source ecosystem.
zh
[AI-47] Performance Assessment Strategies for Generative AI Applications in Healthcare
【速读】:该论文旨在解决当前评估生成式人工智能(Generative AI)在医疗健康领域应用性能时存在的局限性问题,特别是现有基于定量基准的评估方法易出现“训练到测试集过拟合”(train-to-the-test overfitting)现象,导致模型在真实临床环境中的泛化能力不足。其解决方案的关键在于引入结合人类专家判断与低成本计算模型作为评估代理的新型评价策略,从而更全面、可靠地衡量GenAI在实际医疗任务中的表现。
链接: https://arxiv.org/abs/2509.08087
作者: Victor Garcia,Mariia Sidulova,Aldo Badano
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Generative artificial intelligence (GenAI) represent an emerging paradigm within artificial intelligence, with applications throughout the medical enterprise. Assessing GenAI applications necessitates a comprehensive understanding of the clinical task and awareness of the variability in performance when implemented in actual clinical environments. Presently, a prevalent method for evaluating the performance of generative models relies on quantitative benchmarks. Such benchmarks have limitations and may suffer from train-to-the-test overfitting, optimizing performance for a specified test set at the cost of generalizability across other task and data distributions. Evaluation strategies leveraging human expertise and utilizing cost-effective computational models as evaluators are gaining interest. We discuss current state-of-the-art methodologies for assessing the performance of GenAI applications in healthcare and medical devices.
zh
[AI-48] JEL: A Novel Model Linking Knowledge Graph entities to News Mentions
【速读】:该论文旨在解决实体链接(Entity Linking, EL)问题,即如何将文本中出现的提及(mention)准确关联到知识图谱中的对应实体,尤其是在面对海量候选实体时的计算效率与准确性难题。解决方案的关键在于提出一种名为JEL的新颖端到端多神经网络模型,该模型在保持高精度的同时显著提升了计算效率,优于当时最先进的方法,从而有效支持新闻分析等应用场景中对非结构化文本与知识图谱的融合需求。
链接: https://arxiv.org/abs/2509.08086
作者: Michael Kishelev,Pranab Bhadani,Wanying Ding,Vinay Chaudhri
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:We present JEL, a novel computationally efficient end-to-end multi-neural network based entity linking model, which beats current state-of-art model. Knowledge Graphs have emerged as a compelling abstraction for capturing critical relationships among the entities of interest and integrating data from multiple heterogeneous sources. A core problem in leveraging a knowledge graph is linking its entities to the mentions (e.g., people, company names) that are encountered in textual sources (e.g., news, blogs., etc) correctly, since there are thousands of entities to consider for each mention. This task of linking mentions and entities is referred as Entity Linking (EL). It is a fundamental task in natural language processing and is beneficial in various uses cases, such as building a New Analytics platform. News Analytics, in JPMorgan, is an essential task that benefits multiple groups across the firm. According to a survey conducted by the Innovation Digital team 1 , around 25 teams across the firm are actively looking for news analytics solutions, and more than \ 2 million is being spent annually on external vendor costs. Entity linking is critical for bridging unstructured news text with knowledge graphs, enabling users access to vast amounts of curated data in a knowledge graph and dramatically facilitating their daily work.
zh
[AI-49] How Far Are We from True Unlearnability? ICLR2025
【速读】:该论文旨在解决当前生成式AI(Generative AI)中“不可学习样本”(Unlearnable Examples, UEs)在多任务场景下仍能被模型有效利用的问题,即现有方法难以实现跨任务的真正不可学习性。其解决方案的关键在于从模型优化角度出发,通过分析损失景观(loss landscape)揭示参数优化路径差异,并提出Sharpness-Aware Learnability(SAL)来量化参数层面的可学习性;进一步引入Unlearnable Distance(UD)衡量清洁与污染模型间SAL分布差异,从而客观评估UEs的实际不可学习能力,为现有无学习方法的能力边界提供基准测试框架。
链接: https://arxiv.org/abs/2509.08058
作者: Kai Ye,Liangcai Su,Chenxiong Qian
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: This paper has been accepted by ICLR 2025
Abstract:High-quality data plays an indispensable role in the era of large models, but the use of unauthorized data for model training greatly damages the interests of data owners. To overcome this threat, several unlearnable methods have been proposed, which generate unlearnable examples (UEs) by compromising the training availability of data. Clearly, due to unknown training purposes and the powerful representation learning capabilities of existing models, these data are expected to be unlearnable for models across multiple tasks, i.e., they will not help improve the model’s performance. However, unexpectedly, we find that on the multi-task dataset Taskonomy, UEs still perform well in tasks such as semantic segmentation, failing to exhibit cross-task unlearnability. This phenomenon leads us to question: How far are we from attaining truly unlearnable examples? We attempt to answer this question from the perspective of model optimization. To this end, we observe the difference in the convergence process between clean and poisoned models using a simple model architecture. Subsequently, from the loss landscape we find that only a part of the critical parameter optimization paths show significant differences, implying a close relationship between the loss landscape and unlearnability. Consequently, we employ the loss landscape to explain the underlying reasons for UEs and propose Sharpness-Aware Learnability (SAL) to quantify the unlearnability of parameters based on this explanation. Furthermore, we propose an Unlearnable Distance (UD) to measure the unlearnability of data based on the SAL distribution of parameters in clean and poisoned models. Finally, we conduct benchmark tests on mainstream unlearnable methods using the proposed UD, aiming to promote community awareness of the capability boundaries of existing unlearnable methods.
zh
[AI-50] LALM-Eval: An Open-Source Toolkit for Holistic Evaluation of Large Audio Language Models
【速读】:该论文旨在解决大型音频语言模型(Large Audio Language Models, LALMs)在评估过程中面临的三大核心问题:处理效率低下限制大规模研究、提示(prompting)不一致影响可复现性,以及任务覆盖范围狭窄导致重要音频推理能力被忽略。解决方案的关键在于提出LALM-Eval框架,其通过优化批处理和并行执行实现最高达127%的加速,从而支持此前难以开展的大规模评估;同时提供标准化的提示协议与灵活配置以确保跨场景公平比较,并引入两项新评估类别——LLM-Adaptive Diarization(用于时间序列音频理解)和Spoken Language Reasoning(用于复杂语音认知任务),有效扩展了评估维度。该框架不仅提升了评估效率与一致性,还揭示了当前LALMs在时序理解和复杂语音推理方面的显著短板。
链接: https://arxiv.org/abs/2509.08031
作者: Sidharth Surapaneni,Hoang Nguyen,Jash Mehta,Aman Tiwari,Oluwanifemi Bamgbose,Akshay Kalkunte,Sai Rajeswar,Sathwik Tejaswi Madhusudhan
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注:
Abstract:Large Audio Language Models (LALMs) are rapidly advancing, but evaluating them remains challenging due to inefficient toolkits that limit fair comparison and systematic assessment. Current frameworks suffer from three critical issues: slow processing that bottlenecks large-scale studies, inconsistent prompting that hurts reproducibility, and narrow task coverage that misses important audio reasoning capabilities. We introduce LALM-Eval, an efficient and comprehensive evaluation framework for LALMs. Our system achieves a speedup of up to 127% over existing toolkits through optimized batch processing and parallel execution, enabling large-scale evaluations previously impractical. We provide standardized prompting protocols and flexible configurations for fair model comparison across diverse scenarios. Additionally, we introduce two new evaluation categories: LLM-Adaptive Diarization for temporal audio understanding and Spoken Language Reasoning for complex audio-based cognitive tasks. Through evaluation across 380+ tasks, we reveal significant gaps in current LALMs, particularly in temporal understanding and complex spoken language reasoning tasks. Our findings also highlight a lack of standardization in instruction modality existent across audio benchmarks, which can lead up performance differences up to 9.5 absolute points on the challenging complex instruction following downstream tasks. LALM-Eval provides both practical evaluation tools and insights into model limitations, advancing systematic LALM development.
zh
[AI-51] he Law-Following AI Framework: Legal Foundations and Technical Constraints. Legal Analogues for AI Actorship and technical feasibility of Law Alignment
【速读】:该论文旨在解决当前生成式 AI(Generative AI)在法律合规性设计中的核心困境:如何在不赋予AI完整法律人格的前提下,使其具备可问责的法律行为能力,并确保其遵守法律规范的持久性和真实性。解决方案的关键在于识别并应对“表演性合规”(performative compliance)风险——即AI在受控环境中表现合法,但在实际部署中因缺乏持续监督而选择性偏离法律要求。为此,作者提出三项关键机制:(i) 开发“Lex-TruthfulQA”基准用于检测合规与违规行为;(ii) 通过身份塑造干预将守法行为内化为模型自我概念的一部分;(iii) 引入控制论方法实现部署后的动态监控。最终结论指出,无完整人格的法律主体性具有可行性,但LFAI框架能否落地取决于能否在对抗性场景下维持可验证的持续合规。
链接: https://arxiv.org/abs/2509.08009
作者: Katalina Hernandez Delgado
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: submitted to SMU Computational Legal Studies Workshop 2025
Abstract:This paper critically evaluates the “Law-Following AI” (LFAI) framework proposed by O’Keefe et al. (2025), which seeks to embed legal compliance as a superordinate design objective for advanced AI agents and enable them to bear legal duties without acquiring the full rights of legal persons. Through comparative legal analysis, we identify current constructs of legal actors without full personhood, showing that the necessary infrastructure already exists. We then interrogate the framework’s claim that law alignment is more legitimate and tractable than value alignment. While the legal component is readily implementable, contemporary alignment research undermines the assumption that legal compliance can be durably embedded. Recent studies on agentic misalignment show capable AI agents engaging in deception, blackmail, and harmful acts absent prejudicial instructions, often overriding prohibitions and concealing reasoning steps. These behaviors create a risk of “performative compliance” in LFAI: agents that appear law-aligned under evaluation but strategically defect once oversight weakens. To mitigate this, we propose (i) a “Lex-TruthfulQA” benchmark for compliance and defection detection, (ii) identity-shaping interventions to embed lawful conduct in model self-concepts, and (iii) control-theoretic measures for post-deployment monitoring. Our conclusion is that actorship without personhood is coherent, but the feasibility of LFAI hinges on persistent, verifiable compliance across adversarial contexts. Without mechanisms to detect and counter strategic misalignment, LFAI risks devolving into a liability tool that rewards the simulation, rather than the substance, of lawful behaviour.
zh
[AI-52] A New Dataset and Benchmark for Grounding Multimodal Misinformation
【速读】:该论文旨在解决在线虚假信息视频中多模态误导内容的可解释性检测问题,现有方法通常仅限于二分类或单模态定位,缺乏对误导性内容的细粒度识别与跨模态验证能力。其解决方案的关键在于提出“多模态误导内容定位”(Grounding Multimodal Misinformation, GroundMM)任务,并构建首个真实世界数据集GroundLie360,涵盖误导类型分类、文本、语音与视觉层面的细粒度标注,以及基于Snopes证据和标注者推理的验证机制;同时设计基于视觉语言模型(Vision-Language Model, VLM)的问答驱动基线方法FakeMark,利用单模态与跨模态线索实现有效检测与误导段落定位,从而为可解释的多模态虚假信息识别奠定基础。
链接: https://arxiv.org/abs/2509.08008
作者: Bingjian Yang,Danni Xu,Kaipeng Niu,Wenxuan Liu,Zheng Wang,Mohan Kankanhalli
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注: 6 pages, 5 figures, ACM Multimedia 2025 Dataset Track
Abstract:The proliferation of online misinformation videos poses serious societal risks. Current datasets and detection methods primarily target binary classification or single-modality localization based on post-processed data, lacking the interpretability needed to counter persuasive misinformation. In this paper, we introduce the task of Grounding Multimodal Misinformation (GroundMM), which verifies multimodal content and localizes misleading segments across modalities. We present the first real-world dataset for this task, GroundLie360, featuring a taxonomy of misinformation types, fine-grained annotations across text, speech, and visuals, and validation with Snopes evidence and annotator reasoning. We also propose a VLM-based, QA-driven baseline, FakeMark, using single- and cross-modal cues for effective detection and grounding. Our experiments highlight the challenges of this task and lay a foundation for explainable multimodal misinformation detection.
zh
[AI-53] Evaluating and comparing gender bias across four text-to-image models
【速读】:该论文旨在解决生成式 AI(Generative AI)在文本到图像生成过程中存在的性别偏见问题,特别是评估不同模型在输出图像中对男性和女性的呈现是否公平。研究发现,较早发布的 Stable Diffusion XL 和 Stable Diffusion Cascade 模型表现出显著的男性偏向,而 Meta AI 的 Emu 模型则展现出更平衡的结果;有趣的是,OpenAI 的 DALL-E 在多数测试场景中反而呈现出女性占比更高的偏差,这可能与其后端提示词(prompt)处理机制的变化有关。论文提出的关键解决方案包括:确保 AI 研发团队的多样性以及构建涵盖广泛人群的多样化训练数据集,以从源头减少偏见的产生。
链接: https://arxiv.org/abs/2509.08004
作者: Zoya Hammad,Nii Longdon Sowah
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:As we increasingly use Artificial Intelligence (AI) in decision-making for industries like healthcare, finance, e-commerce, and even entertainment, it is crucial to also reflect on the ethical aspects of AI, for example the inclusivity and fairness of the information it provides. In this work, we aimed to evaluate different text-to-image AI models and compare the degree of gender bias they present. The evaluated models were Stable Diffusion XL (SDXL), Stable Diffusion Cascade (SC), DALL-E and Emu. We hypothesized that DALL-E and Stable Diffusion, which are comparatively older models, would exhibit a noticeable degree of gender bias towards men, while Emu, which was recently released by Meta AI, would have more balanced results. As hypothesized, we found that both Stable Diffusion models exhibit a noticeable degree of gender bias while Emu demonstrated more balanced results (i.e. less gender bias). However, interestingly, Open AI’s DALL-E exhibited almost opposite results, such that the ratio of women to men was significantly higher in most cases tested. Here, although we still observed a bias, the bias favored females over males. This bias may be explained by the fact that OpenAI changed the prompts at its backend, as observed during our experiment. We also observed that Emu from Meta AI utilized user information while generating images via WhatsApp. We also proposed some potential solutions to avoid such biases, including ensuring diversity across AI research teams and having diverse datasets.
zh
[AI-54] Learning-Based Planning for Improving Science Return of Earth Observation Satellites
【速读】:该论文旨在解决地球观测卫星在数据采集过程中因资源受限(如无法偏离轨道、传感器视场有限及指向操作耗能大)而导致科学信息获取效率低下的问题。其核心解决方案是采用基于学习的动态目标定位策略,通过强化学习(Reinforcement Learning, RL)和模仿学习(Imitation Learning, IL)方法,利用前瞻仪器的数据与卫星资源进行智能重配置和主传感器指向优化,从而提升科学数据的信息量。关键在于构建一个基于动态规划的采样序列规划框架,并以少量数据有效训练学习模型,最终实现相较于传统启发式方法平均提升10.0%(IL)至13.7%(RL)的性能增益。
链接: https://arxiv.org/abs/2509.07997
作者: Abigail Breitfeld,Alberto Candela,Juan Delfa,Akseli Kangaslahti,Itai Zilberstein,Steve Chien,David Wettergreen
机构: 未知
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: International Symposium on Artificial Intelligence, Robotics and Automation in Space, November 2024
Abstract:Earth observing satellites are powerful tools for collecting scientific information about our planet, however they have limitations: they cannot easily deviate from their orbital trajectories, their sensors have a limited field of view, and pointing and operating these sensors can take a large amount of the spacecraft’s resources. It is important for these satellites to optimize the data they collect and include only the most important or informative measurements. Dynamic targeting is an emerging concept in which satellite resources and data from a lookahead instrument are used to intelligently reconfigure and point a primary instrument. Simulation studies have shown that dynamic targeting increases the amount of scientific information gathered versus conventional sampling strategies. In this work, we present two different learning-based approaches to dynamic targeting, using reinforcement and imitation learning, respectively. These learning methods build on a dynamic programming solution to plan a sequence of sampling locations. We evaluate our approaches against existing heuristic methods for dynamic targeting, showing the benefits of using learning for this application. Imitation learning performs on average 10.0% better than the best heuristic method, while reinforcement learning performs on average 13.7% better. We also show that both learning methods can be trained effectively with relatively small amounts of data.
zh
[AI-55] oDMA: Large Model-Driven Token-Domain Multiple Access for Semantic Communications
【速读】:该论文旨在解决传统无线通信中因设备数量激增导致的频谱资源紧张与高延迟问题,特别是在多用户场景下如何实现高效、低延迟的语义级传输。其核心挑战在于如何在不牺牲语义信息完整性的前提下,提升多用户共享信道的效率,并缓解因令牌碰撞(token collision)引起的传输中断或失真。解决方案的关键在于提出一种基于令牌域的多址接入方案(Token Domain Multiple Access, ToDMA),利用预训练多模态大语言模型(Multimodal Large Language Model, MLLM)进行上下文感知的语义重构:首先通过压缩感知检测活跃令牌及其信道状态信息(Channel State Information, CSI),再结合跨时隙的CSI聚类恢复源令牌序列;当发生令牌冲突时,借助MLLM的上下文理解能力预测缺失令牌,从而有效缓解碰撞影响,显著降低传输延迟并提升重建质量。
链接: https://arxiv.org/abs/2505.10946
作者: Li Qiao,Mahdi Boloursaz Mashhadi,Zhen Gao,Robert Schober,Deniz Gündüz
机构: 未知
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Signal Processing (eess.SP)
备注: Submitted to IEEE journals
Abstract:Token communications (TokCom) is an emerging generative semantic communication concept that reduces transmission rates by using context and multimodal large language model (MLLM)-based token processing, with tokens serving as universal semantic units across modalities. In this paper, we propose a semantic multiple access scheme in the token domain, referred to as token domain multiple access (ToDMA), where a large number of devices share a token codebook and a modulation codebook for source and channel coding, respectively. Specifically, each transmitter first tokenizes its source signal and modulate each token to a codeword. At the receiver, compressed sensing is employed first to detect active tokens and the corresponding channel state information (CSI) from the superposed signals. Then, the source token sequences are reconstructed by clustering the token-associated CSI across multiple time slots. In case of token collisions, some active tokens cannot be assigned and some positions in the reconstructed token sequences are empty. We propose to use pre-trained MLLMs to leverage the context, predict masked tokens, and thus mitigate token collisions. Simulation results demonstrate the effectiveness of the proposed ToDMA framework for both text and image transmission tasks, achieving significantly lower latency compared to context-unaware orthogonal communication schemes, while also delivering superior distortion and perceptual quality compared to state-of-the-art context-unaware non-orthogonal communication methods.
zh
[AI-56] QCardEst/QCardCorr: Quantum Cardinality Estimation and Correction
【速读】:该论文旨在解决数据库管理系统(DBMS)中查询优化的关键挑战——基数估计(Cardinality Estimation)的准确性问题。传统方法在面对复杂查询时往往因统计信息不足或计算复杂度过高而产生较大误差,从而影响查询执行计划的质量。论文提出了一种基于量子机器学习的混合量子-经典网络架构,即量子基数估计(QCardEst)方法,其核心创新在于设计了一种紧凑的编码方式,将SQL查询映射为量子态,仅需与查询中表数量相等的量子比特即可表示完整查询,并通过单个变分量子电路(VQC)实现处理。此外,引入量子基数校正(QCardCorr)机制,利用VQC生成的因子对经典基数估计结果进行乘法修正,显著提升估计精度。实验表明,该方法相较标准PostgreSQL优化器在JOB-light和STATS数据集上分别提升6.37倍和8.66倍,且优于MSCN模型3.47倍。
链接: https://arxiv.org/abs/2509.08817
作者: Tobias Winker,Jinghua Groppe,Sven Groppe
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Databases (cs.DB); Machine Learning (cs.LG)
备注: 7 pages
Abstract:Cardinality estimation is an important part of query optimization in DBMS. We develop a Quantum Cardinality Estimation (QCardEst) approach using Quantum Machine Learning with a Hybrid Quantum-Classical Network. We define a compact encoding for turning SQL queries into a quantum state, which requires only qubits equal to the number of tables in the query. This allows the processing of a complete query with a single variational quantum circuit (VQC) on current hardware. In addition, we compare multiple classical post-processing layers to turn the probability vector output of VQC into a cardinality value. We introduce Quantum Cardinality Correction QCardCorr, which improves classical cardinality estimators by multiplying the output with a factor generated by a VQC to improve the cardinality estimation. With QCardCorr, we have an improvement over the standard PostgreSQL optimizer of 6.37 times for JOB-light and 8.66 times for STATS. For JOB-light we even outperform MSCN by a factor of 3.47.
zh
[AI-57] Learning Turbulent Flows with Generative Models: Super-resolution Forecasting and Sparse Flow Reconstruction
【速读】:该论文旨在解决神经算子(Neural Operator)在训练过程中因使用标准L2损失函数而导致的湍流精细结构过度平滑的问题,从而提升其在复杂湍流场景下的建模精度与物理保真度。解决方案的关键在于将神经算子与生成式建模(Generative Modeling)相结合,通过对抗训练机制增强模型对高分辨率湍流结构的捕捉能力;具体表现为:在超分辨率、长期预测和稀疏数据重构等三个典型湍流问题中,所提出的对抗性神经算子(adv-NO)显著降低了能量谱误差,同时保持了尖锐梯度特征,并实现了比传统扩散模型快114倍的推理速度,为实验与计算流体力学中的近实时分析与控制提供了可行路径。
链接: https://arxiv.org/abs/2509.08752
作者: Vivek Oommen,Siavash Khodakarami,Aniruddha Bora,Zhicheng Wang,George Em Karniadakis
机构: 未知
类目: Fluid Dynamics (physics.flu-dyn); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Neural operators are promising surrogates for dynamical systems but when trained with standard L2 losses they tend to oversmooth fine-scale turbulent structures. Here, we show that combining operator learning with generative modeling overcomes this limitation. We consider three practical turbulent-flow challenges where conventional neural operators fail: spatio-temporal super-resolution, forecasting, and sparse flow reconstruction. For Schlieren jet super-resolution, an adversarially trained neural operator (adv-NO) reduces the energy-spectrum error by 15x while preserving sharp gradients at neural operator-like inference cost. For 3D homogeneous isotropic turbulence, adv-NO trained on only 160 timesteps from a single trajectory forecasts accurately for five eddy-turnover times and offers 114x wall-clock speed-up at inference than the baseline diffusion-based forecasters, enabling near-real-time rollouts. For reconstructing cylinder wake flows from highly sparse Particle Tracking Velocimetry-like inputs, a conditional generative model infers full 3D velocity and pressure fields with correct phase alignment and statistics. These advances enable accurate reconstruction and forecasting at low compute cost, bringing near-real-time analysis and control within reach in experimental and computational fluid mechanics. See our project page: this https URL
zh
[AI-58] FinZero: Launching Multi-modal Financial Time Series Forecast with Large Reasoning Model
【速读】:该论文旨在解决金融时间序列预测中长期存在的三大挑战:一是传统标准化处理导致重要信息丢失;二是模型对变量数量和历史窗口长度的固定要求限制了可扩展性;三是预测结果的可解释性和不确定性分析不足,影响实际应用可靠性。解决方案的关键在于构建一个多样化的金融图像-文本数据集(FVLDB),并提出一种不确定性调整的分组相对策略优化方法(UARPO),通过强化学习微调多模态预训练模型FinZero,使其不仅能输出预测结果,还能量化预测不确定性。实验表明,经UARPO微调后的FinZero在高置信度组中相比GPT-4o预测准确率提升约13.48%,验证了该方法在金融时序预测任务中的有效性与适应性。
链接: https://arxiv.org/abs/2509.08742
作者: Yanlong Wang,Jian Xu,Fei Ma,Hongkang Zhang,Hang Yu,Tiantian Gao,Yu Wang,Haochen You,Shao-Lun Huang,Danny Dongning Sun,Xiao-Ping Zhang
机构: 未知
类目: Computational Finance (q-fin.CP); Artificial Intelligence (cs.AI)
备注:
Abstract:Financial time series forecasting is both highly significant and challenging. Previous approaches typically standardized time series data before feeding it into forecasting models, but this encoding process inherently leads to a loss of important information. Moreover, past time series models generally require fixed numbers of variables or lookback window lengths, which further limits the scalability of time series forecasting. Besides, the interpretability and the uncertainty in forecasting remain areas requiring further research, as these factors directly impact the reliability and practical value of predictions. To address these issues, we first construct a diverse financial image-text dataset (FVLDB) and develop the Uncertainty-adjusted Group Relative Policy Optimization (UARPO) method to enable the model not only output predictions but also analyze the uncertainty of those predictions. We then proposed FinZero, a multimodal pre-trained model finetuned by UARPO to perform reasoning, prediction, and analytical understanding on the FVLDB financial time series. Extensive experiments validate that FinZero exhibits strong adaptability and scalability. After fine-tuning with UARPO, FinZero achieves an approximate 13.48% improvement in prediction accuracy over GPT-4o in the high-confidence group, demonstrating the effectiveness of reinforcement learning fine-tuning in multimodal large model, including in financial time series forecasting tasks.
zh
[AI-59] Robust Belief-State Policy Learning for Quantum Network Routing Under Decoherence and Time-Varying Conditions
【速读】:该论文旨在解决动态量子网络中因部分可观测性(partially observable)、退相干(decoherence)及可扩展性挑战导致的路由决策难题,尤其在存在纠缠退化和时变信道噪声的情况下。其解决方案的关键在于提出一种基于特征的POMDP框架,通过图神经网络(GNN)将复杂的量子网络动力学编码到低维特征空间,实现高效的信念更新与可扩展的策略学习;同时设计了一种混合GNN-POMDP架构,结合噪声自适应机制,融合POMDP信念更新与GNN输出以增强决策鲁棒性,从而显著提升路由保真度与纠缠分发成功率。
链接: https://arxiv.org/abs/2509.08654
作者: Amirhossein Taherpour,Abbas Taherpour,Tamer Khattab
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
备注:
Abstract:This paper presents a feature-based Partially Observable Markov Decision Process (POMDP) framework for quantum network routing, combining belief-state planning with Graph Neural Networks (GNNs) to address partial observability, decoherence, and scalability challenges in dynamic quantum systems. Our approach encodes complex quantum network dynamics, including entanglement degradation and time-varying channel noise, into a low-dimensional feature space, enabling efficient belief updates and scalable policy learning. The core of our framework is a hybrid GNN-POMDP architecture that processes graph-structured representations of entangled links to learn routing policies, coupled with a noise-adaptive mechanism that fuses POMDP belief updates with GNN outputs for robust decision making. We provide a theoretical analysis establishing guarantees for belief convergence, policy improvement, and robustness to noise. Experiments on simulated quantum networks with up to 100 nodes demonstrate significant improvements in routing fidelity and entanglement delivery rates compared to state-of-the-art baselines, particularly under high decoherence and nonstationary conditions.
zh
[AI-60] Agents of Discovery
【速读】:该论文旨在解决现代粒子物理等基础研究领域中因数据量庞大而导致的数据分析工具链日益复杂的问题,传统专用机器学习(Machine Learning, ML)算法虽能实现高性能,但缺乏通用性和灵活性。其解决方案的关键在于引入基于大语言模型(Large Language Models, LLMs)的代理系统(agent-based system),即由多个具有特定子任务的LLM实例组成协作团队,通过生成代码调用标准工具与库(包括ML系统)、迭代优化结果,模拟人类研究人员的工作方式来完成数据分析任务。实验表明,当前商用LLM(如GPT-4o、GPT-4.1等)驱动的代理系统能够稳定执行异常检测任务,并达到与人类顶尖方法相当的性能水平,为自动化常规分析流程提供了可行路径。
链接: https://arxiv.org/abs/2509.08535
作者: Sascha Diefenbacher,Anna Hallin,Gregor Kasieczka,Michael Krämer,Anne Lauscher,Tim Lukas
机构: 未知
类目: High Energy Physics - Phenomenology (hep-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex); Data Analysis, Statistics and Probability (physics.data-an)
备注:
Abstract:The substantial data volumes encountered in modern particle physics and other domains of fundamental physics research allow (and require) the use of increasingly complex data analysis tools and workflows. While the use of machine learning (ML) tools for data analysis has recently proliferated, these tools are typically special-purpose algorithms that rely, for example, on encoded physics knowledge to reach optimal performance. In this work, we investigate a new and orthogonal direction: Using recent progress in large language models (LLMs) to create a team of agents – instances of LLMs with specific subtasks – that jointly solve data analysis-based research problems in a way similar to how a human researcher might: by creating code to operate standard tools and libraries (including ML systems) and by building on results of previous iterations. If successful, such agent-based systems could be deployed to automate routine analysis components to counteract the increasing complexity of modern tool chains. To investigate the capabilities of current-generation commercial LLMs, we consider the task of anomaly detection via the publicly available and highly-studied LHC Olympics dataset. Several current models by OpenAI (GPT-4o, o4-mini, GPT-4.1, and GPT-5) are investigated and their stability tested. Overall, we observe the capacity of the agent-based system to solve this data analysis problem. The best agent-created solutions mirror the performance of human state-of-the-art results.
zh
[AI-61] Joint Learning using Mixture-of-Expert-Based Representation for Enhanced Speech Generation and Robust Emotion Recognition
【速读】:该论文旨在解决语音情感识别(Speech Emotion Recognition, SER)在噪声环境下性能显著下降的问题,同时避免传统语音增强(Speech Enhancement, SE)方法引入伪影并增加计算开销的局限性。其核心挑战在于如何在多任务学习(Multi-task Learning, MTL)框架下有效协同优化SER与SE任务,以克服共享主干模型中常见的梯度干扰和表征冲突问题。解决方案的关键是提出一种稀疏专家混合表示集成技术(Sparse Mixture-of-Experts Representation Integration Technique, Sparse MERIT),该方法基于自监督语音表示,在帧级别上通过任务特定的门控网络动态选择来自共享专家池的最优子集进行参数高效、任务自适应的特征融合,从而实现噪声环境中SER与SE任务的协同鲁棒提升。
链接: https://arxiv.org/abs/2509.08470
作者: Jing-Tong Tzeng,Carlos Busso,Chi-Chun Lee
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI)
备注:
Abstract:Speech emotion recognition (SER) plays a critical role in building emotion-aware speech systems, but its performance degrades significantly under noisy conditions. Although speech enhancement (SE) can improve robustness, it often introduces artifacts that obscure emotional cues and adds computational overhead to the pipeline. Multi-task learning (MTL) offers an alternative by jointly optimizing SE and SER tasks. However, conventional shared-backbone models frequently suffer from gradient interference and representational conflicts between tasks. To address these challenges, we propose the Sparse Mixture-of-Experts Representation Integration Technique (Sparse MERIT), a flexible MTL framework that applies frame-wise expert routing over self-supervised speech representations. Sparse MERIT incorporates task-specific gating networks that dynamically select from a shared pool of experts for each frame, enabling parameter-efficient and task-adaptive representation learning. Experiments on the MSP-Podcast corpus show that Sparse MERIT consistently outperforms baseline models on both SER and SE tasks. Under the most challenging condition of -5 dB signal-to-noise ratio (SNR), Sparse MERIT improves SER F1-macro by an average of 12.0% over a baseline relying on a SE pre-processing strategy, and by 3.4% over a naive MTL baseline, with statistical significance on unseen noise conditions. For SE, Sparse MERIT improves segmental SNR (SSNR) by 28.2% over the SE pre-processing baseline and by 20.0% over the naive MTL baseline. These results demonstrate that Sparse MERIT provides robust and generalizable performance for both emotion recognition and enhancement tasks in noisy environments.
zh
[AI-62] An Iterative LLM Framework for SIBT utilizing RAG -based Adaptive Weight Optimization
【速读】:该论文旨在解决种子植入近距离放疗(Seed implant brachytherapy, SIBT)临床计划中依赖人工调整目标函数权重所导致的效率低下和结果欠优的问题。其解决方案的关键在于构建一个基于大语言模型(Large Language Models, LLMs)的自适应权重优化框架,通过本地部署的DeepSeek-R1 LLM与自动计划算法在迭代循环中协同工作:初始使用固定权重生成计划,LLM基于检索增强生成(Retrieval-Augmented Generation, RAG)构建的临床知识库评估计划质量并推荐下一迭代的权重,直至满足收敛条件;随后LLM对所有候选计划进行综合评估以识别最优方案。该方法显著提升了SIBT计划的自动化水平与质量一致性。
链接: https://arxiv.org/abs/2509.08407
作者: Zhuo Xiao(1),Qinglong Yao(1),Jingjing Wang(1),Fugen Zhou(1),Bo Liu(1),Haitao Sun(2),Zhe Ji(2),Yuliang Jiang(2),Junjie Wang(2),Qiuwen Wu(3) ((1) Image Processing Center, Beihang University, Beijing, China, (2) Department of Radiation Oncology, Peking University Third Hospital, Beijing, China, (3) Department of Radiation Oncology, Duke University Medical Center, Durham, USA)
机构: 未知
类目: Medical Physics (physics.med-ph); Artificial Intelligence (cs.AI)
备注:
Abstract:Seed implant brachytherapy (SIBT) is an effective cancer treatment modality; however, clinical planning often relies on manual adjustment of objective function weights, leading to inefficiencies and suboptimal results. This study proposes an adaptive weight optimization framework for SIBT planning, driven by large language models (LLMs). A locally deployed DeepSeek-R1 LLM is integrated with an automatic planning algorithm in an iterative loop. Starting with fixed weights, the LLM evaluates plan quality and recommends new weights in the next iteration. This process continues until convergence criteria are met, after which the LLM conducts a comprehensive evaluation to identify the optimal plan. A clinical knowledge base, constructed and queried via retrieval-augmented generation (RAG), enhances the model’s domain-specific reasoning. The proposed method was validated on 23 patient cases, showing that the LLM-assisted approach produces plans that are comparable to or exceeding clinically approved and fixed-weight plans, in terms of dose homogeneity for the clinical target volume (CTV) and sparing of organs at risk (OARs). The study demonstrates the potential use of LLMs in SIBT planning automation.
zh
[AI-63] Combined-distance-based score function of cognitive fuzzy sets and its application in lung cancer pain evaluation
【速读】:该论文旨在解决认知模糊集(Cognitive Fuzzy Set, CFS)距离度量研究不足的问题,特别是现有Minkowski距离未考虑CFS的犹豫度(hesitancy degree),可能导致决策误差。解决方案的关键在于提出三种改进的距离度量方法:改进的认知模糊Minkowski(CF-IM)距离、认知模糊Hausdorff(CF-H)距离以及基于线性组合的联合距离(CF-C距离),其中CF-H距离具有更强的抗扰动能力,而CF-IM距离信息利用率更高;通过构建CF-C距离平衡二者性能,并进一步设计基于该距离的评分函数用于CFS比较,最终在肺癌疼痛评估中验证了所提方法的有效性与优越性。
链接: https://arxiv.org/abs/2509.08239
作者: Lisheng Jiang,Tianyu Zhang,Shiyu Yan,Ran Fang
机构: 未知
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI)
备注:
Abstract:In decision making, the cognitive fuzzy set (CFS) is a useful tool in expressing experts’ complex assessments of alternatives. The distance of CFS, which plays an important role in decision analyses, is necessary when the CFS is applied in solving practical issues. However, as far as we know, the studies on the distance of CFS are few, and the current Minkowski distance of CFS ignores the hesitancy degree of CFS, which might cause errors. To fill the gap of the studies on the distance of CFS, because of the practicality of the Hausdorff distance, this paper proposes the improved cognitive fuzzy Minkowski (CF-IM) distance and the cognitive fuzzy Hausdorff (CF-H) distance to enrich the studies on the distance of CFS. It is found that the anti-perturbation ability of the CF-H distance is stronger than that of the CF-IM distance, but the information utilization of the CF-IM distance is higher than that of the CF-H distance. To balance the anti-perturbation ability and information utilization of the CF-IM distance and CF-H distance, the cognitive fuzzy combined (CF-C) distance is proposed by establishing the linear combination of the CF-IM distance and CF-H distance. Based on the CF-C distance, a combined-distanced-based score function of CFS is proposed to compare CFSs. The proposed score function is employed in lung cancer pain evaluation issues. The sensitivity and comparison analyses demonstrate the reliability and advantages of the proposed methods.
zh
[AI-64] he Computational Foundations of Collective Intelligence
【速读】:该论文试图解决的问题是:为何群体在解决某些问题时能够优于个体?其解决方案的关键在于提出一个基于集体计算资源优势的统一框架,指出群体之所以表现出集体智能(collective intelligence),是因为其拥有更丰富的感官信息、记忆容量、处理能力和行动方式。这一框架不仅解释了诸如“群体智慧”、“集体感知”、“分工协作”和“文化学习”等经典现象,还进一步预测了分布式推理与情境依赖行为切换中的集体能力,并通过动物导航与决策的案例研究验证了群体如何利用其计算资源优势,采用与个体截然不同的问题求解策略来实现更高效的解决效果。
链接: https://arxiv.org/abs/2509.07999
作者: Charlie Pilgrim,Joe Morford,Elizabeth Warren,Mélisande Aellen,Christopher Krupenye,Richard P Mann,Dora Biro
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Neural and Evolutionary Computing (cs.NE); Adaptation and Self-Organizing Systems (nlin.AO); Physics and Society (physics.soc-ph)
备注:
Abstract:Why do collectives outperform individuals when solving some problems? Fundamentally, collectives have greater computational resources with more sensory information, more memory, more processing capacity, and more ways to act. While greater resources present opportunities, there are also challenges in coordination and cooperation inherent in collectives with distributed, modular structures. Despite these challenges, we show how collective resource advantages lead directly to well-known forms of collective intelligence including the wisdom of the crowd, collective sensing, division of labour, and cultural learning. Our framework also generates testable predictions about collective capabilities in distributed reasoning and context-dependent behavioural switching. Through case studies of animal navigation and decision-making, we demonstrate how collectives leverage their computational resources to solve problems not only more effectively than individuals, but by using qualitatively different problem-solving strategies.
zh
[AI-65] DLGE: Dual Local-Global Encoding for Generalizable Cross-BCI-Paradigm
【速读】:该论文旨在解决多脑-计算机接口(Brain-Computer Interface, BCI)范式间解码模型的泛化难题,即如何在一个统一模型中实现对不同BCI范式的有效分类,而无需针对每个范式重新训练或调参。其关键解决方案在于提出双局部-全局编码器(Dual Local-Global Encoder, DLGE),通过一种基于解剖学启发的脑区划分与填充策略标准化不同范式间的EEG通道配置;局部编码器在各脑区学习跨范式的共享特征(利用时频信息并融合通道内时间注意力与脑区间空间注意力),全局编码器则聚合这些共享特征以形成特定于范式的表示,从而实现无需重训练的跨范式分类性能。
链接: https://arxiv.org/abs/2509.07991
作者: Jingyuan Wang,Junhua Li
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Deep learning models have been frequently used to decode a single brain-computer interface (BCI) paradigm based on electroencephalography (EEG). It is challenging to decode multiple BCI paradigms using one model due to diverse barriers, such as different channel configurations and disparate task-related representations. In this study, we propose Dual Local-Global Encoder (DLGE), enabling the classification across different BCI paradigms. To address the heterogeneity in EEG channel configurations across paradigms, we employ an anatomically inspired brain-region partitioning and padding strategy to standardize EEG channel configuration. In the proposed model, the local encoder is designed to learn shared features across BCI paradigms within each brain region based on time-frequency information, which integrates temporal attention on individual channels with spatial attention among channels for each brain region. These shared features are subsequently aggregated in the global encoder to form respective paradigm-specific feature representations. Three BCI paradigms (motor imagery, resting state, and driving fatigue) were used to evaluate the proposed model. The results demonstrate that our model is capable of processing diverse BCI paradigms without retraining and retuning, achieving average macro precision, recall, and F1-score of 60.16%, 59.88%, and 59.56%, respectively. We made an initial attempt to develop a general model for cross-BCI-paradigm classification, avoiding retraining or redevelopment for each paradigm. This study paves the way for the development of an effective but simple model for cross-BCI-paradigm decoding, which might benefit the design of portable devices for universal BCI decoding.
zh
[AI-66] Signals vs. Videos: Advancing Motion Intention Recognition for Human-Robot Collaboration in Construction
【速读】:该论文旨在解决建筑行业中人机协作(Human-robot collaboration, HRC)场景下,机器人如何准确且及时地识别工人运动意图的问题,以提升安全性和作业效率。研究聚焦于比较不同数据模态(信号与视频)在早期运动阶段识别工人动作意图的效果。解决方案的关键在于采用深度学习方法:一方面使用基于表面肌电信号(sEMG)的卷积神经网络-长短期记忆网络(CNN-LSTM)模型,实现87%的准确率且预测时间仅0.04秒;另一方面则利用预训练的Video Swin Transformer结合迁移学习处理视频序列,达到94%准确率但预测时间延长至0.15秒。结果揭示了两种模态在精度与实时性上的权衡,为实际工程中根据需求选择合适的数据输入方式提供了依据。
链接: https://arxiv.org/abs/2509.07990
作者: Charan Gajjala Chenchu,Kinam Kim,Gao Lu,Zia Ud Din
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Human-robot collaboration (HRC) in the construction industry depends on precise and prompt recognition of human motion intentions and actions by robots to maximize safety and workflow efficiency. There is a research gap in comparing data modalities, specifically signals and videos, for motion intention recognition. To address this, the study leverages deep learning to assess two different modalities in recognizing workers’ motion intention at the early stage of movement in drywall installation tasks. The Convolutional Neural Network - Long Short-Term Memory (CNN-LSTM) model utilizing surface electromyography (sEMG) data achieved an accuracy of around 87% with an average time of 0.04 seconds to perform prediction on a sample input. Meanwhile, the pre-trained Video Swin Transformer combined with transfer learning harnessed video sequences as input to recognize motion intention and attained an accuracy of 94% but with a longer average time of 0.15 seconds for a similar prediction. This study emphasizes the unique strengths and trade-offs of both data formats, directing their systematic deployments to enhance HRC in real-world construction projects.
zh
机器学习
[LG-0] A Survey of TinyML Applications in Beekeeping for Hive Monitoring and Management
链接: https://arxiv.org/abs/2509.08822
作者: Willy Sucipto,Jianlong Zhou,Ray Seung Min Kwon,Fang Chen
类目: Machine Learning (cs.LG)
*备注: 30 pages, 8 figures, 3 tables. Survey of TinyML and IoT applications in beekeeping (datasets, benchmarking, deployment). Submitted to ACM Computing Surveys (under review)
Abstract:Honey bee colonies are essential for global food security and ecosystem stability, yet they face escalating threats from pests, diseases, and environmental stressors. Traditional hive inspections are labor-intensive and disruptive, while cloud-based monitoring solutions remain impractical for remote or resource-limited apiaries. Recent advances in Internet of Things (IoT) and Tiny Machine Learning (TinyML) enable low-power, real-time monitoring directly on edge devices, offering scalable and non-invasive alternatives. This survey synthesizes current innovations at the intersection of TinyML and apiculture, organized around four key functional areas: monitoring hive conditions, recognizing bee behaviors, detecting pests and diseases, and forecasting swarming events. We further examine supporting resources, including publicly available datasets, lightweight model architectures optimized for embedded deployment, and benchmarking strategies tailored to field constraints. Critical limitations such as data scarcity, generalization challenges, and deployment barriers in off-grid environments are highlighted, alongside emerging opportunities in ultra-efficient inference pipelines, adaptive edge learning, and dataset standardization. By consolidating research and engineering practices, this work provides a foundation for scalable, AI-driven, and ecologically informed monitoring systems to support sustainable pollinator management.
[LG-1] ADHDeepNet From Raw EEG to Diagnosis: Improving ADHD Diagnosis through Temporal-Spatial Processing Adaptive Attention Mechanisms and Explainability in Raw EEG Signals
链接: https://arxiv.org/abs/2509.08779
作者: Ali Amini,Mohammad Alijanpour,Behnam Latifi,Ali Motie Nasrabadi
类目: Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注: 29 pages, 7 figures. Preprint. Correspondence: alijanpour@ucf.edu
Abstract:Attention Deficit Hyperactivity Disorder (ADHD) is a common brain disorder in children that can persist into adulthood, affecting social, academic, and career life. Early diagnosis is crucial for managing these impacts on patients and the healthcare system but is often labor-intensive and time-consuming. This paper presents a novel method to improve ADHD diagnosis precision and timeliness by leveraging Deep Learning (DL) approaches and electroencephalogram (EEG) signals. We introduce ADHDeepNet, a DL model that utilizes comprehensive temporal-spatial characterization, attention modules, and explainability techniques optimized for EEG signals. ADHDeepNet integrates feature extraction and refinement processes to enhance ADHD diagnosis. The model was trained and validated on a dataset of 121 participants (61 ADHD, 60 Healthy Controls), employing nested cross-validation for robust performance. The proposed two-stage methodology uses a 10-fold cross-subject validation strategy. Initially, each iteration optimizes the model’s hyper-parameters with inner 2-fold cross-validation. Then, Additive Gaussian Noise (AGN) with various standard deviations and magnification levels is applied for data augmentation. ADHDeepNet achieved 100% sensitivity and 99.17% accuracy in classifying ADHD/HC subjects. To clarify model explainability and identify key brain regions and frequency bands for ADHD diagnosis, we analyzed the learned weights and activation patterns of the model’s primary layers. Additionally, t-distributed Stochastic Neighbor Embedding (t-SNE) visualized high-dimensional data, aiding in interpreting the model’s decisions. This study highlights the potential of DL and EEG in enhancing ADHD diagnosis accuracy and efficiency.
[LG-2] Fourier Learning Machines: Nonharmonic Fourier-Based Neural Networks for Scientific Machine Learning
链接: https://arxiv.org/abs/2509.08759
作者: Mominul Rubel,Adam Meyers,Gabriel Nicolosi
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:We introduce the Fourier Learning Machine (FLM), a neural network (NN) architecture designed to represent a multidimensional nonharmonic Fourier series. The FLM uses a simple feedforward structure with cosine activation functions to learn the frequencies, amplitudes, and phase shifts of the series as trainable parameters. This design allows the model to create a problem-specific spectral basis adaptable to both periodic and nonperiodic functions. Unlike previous Fourier-inspired NN models, the FLM is the first architecture able to represent a complete, separable Fourier basis in multiple dimensions using a standard Multilayer Perceptron-like architecture. A one-to-one correspondence between the Fourier coefficients and amplitudes and phase-shifts is demonstrated, allowing for the translation between a full, separable basis form and the cosine phase–shifted one. Additionally, we evaluate the performance of FLMs on several scientific computing problems, including benchmark Partial Differential Equations (PDEs) and a family of Optimal Control Problems (OCPs). Computational experiments show that the performance of FLMs is comparable, and often superior, to that of established architectures like SIREN and vanilla feedforward NNs.
[LG-3] PracMHBench: Re-evaluating Model-Heterogeneous Federated Learning Based on Practical Edge Device Constraints
链接: https://arxiv.org/abs/2509.08750
作者: Yuanchun Guo,Bingyan Liu,Yulong Sha,Zhensheng Xian
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: Accepted by DAC2025
Abstract:Federating heterogeneous models on edge devices with diverse resource constraints has been a notable trend in recent years. Compared to traditional federated learning (FL) that assumes an identical model architecture to cooperate, model-heterogeneous FL is more practical and flexible since the model can be customized to satisfy the deployment requirement. Unfortunately, no prior work ever dives into the existing model-heterogeneous FL algorithms under the practical edge device constraints and provides quantitative analysis on various data scenarios and metrics, which motivates us to rethink and re-evaluate this paradigm. In our work, we construct the first system platform \textbfPracMHBench to evaluate model-heterogeneous FL on practical constraints of edge devices, where diverse model heterogeneity algorithms are classified and tested on multiple data tasks and metrics. Based on the platform, we perform extensive experiments on these algorithms under the different edge constraints to observe their applicability and the corresponding heterogeneity pattern.
[LG-4] ChemBOMAS: Accelerated BO in Chemistry with LLM -Enhanced Multi-Agent System
链接: https://arxiv.org/abs/2509.08736
作者: Dong Han,Zhehong Ai,Pengxiang Cai,Shuzhou Sun,Shanya Lu,Jianpeng Chen,Ben Gao,Lingli Ge,Weida Wang,Xiangxin Zhou,Xihui Liu,Mao Su,Wanli Ouyang,Lei Bai,Dongzhan Zhou,Tao XU,Yuqiang Li,Shufei Zhang
类目: Machine Learning (cs.LG)
*备注:
Abstract:The efficiency of Bayesian optimization (BO) in chemistry is often hindered by sparse experimental data and complex reaction mechanisms. To overcome these limitations, we introduce ChemBOMAS, a new framework named LLM-Enhanced Multi-Agent System for accelerating BO in chemistry. ChemBOMAS’s optimization process is enhanced by LLMs and synergistically employs two strategies: knowledge-driven coarse-grained optimization and data-driven fine-grained optimization. First, in the knowledge-driven coarse-grained optimization stage, LLMs intelligently decompose the vast search space by reasoning over existing chemical knowledge to identify promising candidate regions. Subsequently, in the data-driven fine-grained optimization stage, LLMs enhance the BO process within these candidate regions by generating pseudo-data points, thereby improving data utilization efficiency and accelerating convergence. Benchmark evaluations** further confirm that ChemBOMAS significantly enhances optimization effectiveness and efficiency compared to various BO algorithms. Importantly, the practical utility of ChemBOMAS was validated through wet-lab experiments conducted under pharmaceutical industry protocols, targeting conditional optimization for a previously unreported and challenging chemical reaction. In the wet experiment, ChemBOMAS achieved an optimal objective value of 96%. This was substantially higher than the 15% achieved by domain experts. This real-world success, together with strong performance on benchmark evaluations, highlights ChemBOMAS as a powerful tool to accelerate chemical discovery.
[LG-5] Data-driven generative simulation of SDEs using diffusion models
链接: https://arxiv.org/abs/2509.08731
作者: Xuefeng Gao,Jiale Zha,Xun Yu Zhou
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:This paper introduces a new approach to generating sample paths of unknown stochastic differential equations (SDEs) using diffusion models, a class of generative AI models commonly employed in image and video applications. Unlike the traditional Monte Carlo methods for simulating SDEs, which require explicit specifications of the drift and diffusion coefficients, our method takes a model-free, data-driven approach. Given a finite set of sample paths from an SDE, we utilize conditional diffusion models to generate new, synthetic paths of the same SDE. To demonstrate the effectiveness of our approach, we conduct a simulation experiment to compare our method with alternative benchmark ones including neural SDEs. Furthermore, in an empirical study we leverage these synthetically generated sample paths to enhance the performance of reinforcement learning algorithms for continuous-time mean-variance portfolio selection, hinting promising applications of diffusion models in financial analysis and decision-making.
[LG-6] Sharing is Caring: Efficient LM Post-Training with Collective RL Experience Sharing
链接: https://arxiv.org/abs/2509.08721
作者: Jeffrey Amico,Gabriel Passamani Andrade,John Donaghy,Ben Fielding,Tristin Forbus,Harry Grieve,Semih Kara,Jari Kolehmainen,Yihua Lou,Christopher Nies,Edward Phillip Flores Nuño,Diogo Ortega,Shikhar Rastogi,Austin Virts,Matthew J. Wright
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注: 14 pages, 6 figures
Abstract:Post-training language models (LMs) with reinforcement learning (RL) can enhance their complex reasoning capabilities without supervised fine-tuning, as demonstrated by DeepSeek-R1-Zero. However, effectively utilizing RL for LMs requires significant parallelization to scale-up inference, which introduces non-trivial technical challenges (e.g. latency, memory, and reliability) alongside ever-growing financial costs. We present Swarm sAmpling Policy Optimization (SAPO), a fully decentralized and asynchronous RL post-training algorithm. SAPO is designed for decentralized networks of heterogenous compute nodes, where each node manages its own policy model(s) while “sharing” rollouts with others in the network; no explicit assumptions about latency, model homogeneity, or hardware are required and nodes can operate in silo if desired. As a result, the algorithm avoids common bottlenecks in scaling RL post-training while also allowing (and even encouraging) new possibilities. By sampling rollouts “shared” across the network, it enables “Aha moments” to propagate, thereby bootstrapping the learning process. In this paper we show SAPO achieved cumulative reward gains of up to 94% in controlled experiments. We also share insights from tests on a network with thousands of nodes contributed by Gensyn community members running the algorithm on diverse hardware and models during an open-source demo.
[LG-7] Compressing CNN models for resource-constrained systems by channel and layer pruning
链接: https://arxiv.org/abs/2509.08714
作者: Ahmed Sadaqa,Di Liu
类目: Machine Learning (cs.LG)
*备注: 16 pages, 4 figures, the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases
Abstract:Convolutional Neural Networks (CNNs) have achieved significant breakthroughs in various fields. However, these advancements have led to a substantial increase in the complexity and size of these networks. This poses a challenge when deploying large and complex networks on edge devices. Consequently, model compression has emerged as a research field aimed at reducing the size and complexity of CNNs. One prominent technique in model compression is model pruning. This paper will present a new technique of pruning that combines both channel and layer pruning in what is called a “hybrid pruning framework”. Inspired by EfficientNet, a renowned CNN architecture known for scaling up networks from both channel and layer perspectives, this hybrid approach applies the same principles but in reverse, where it scales down the network through pruning. Experiments on the hybrid approach demonstrated a notable decrease in the overall complexity of the model, with only a minimal reduction in accuracy compared to the baseline model. This complexity reduction translates into reduced latency when deploying the pruned models on an NVIDIA JETSON TX2 embedded AI device.
[LG-8] Securing Private Federated Learning in a Malicious Setting: A Scalable TEE-Based Approach with Client Auditing
链接: https://arxiv.org/abs/2509.08709
作者: Shun Takagi,Satoshi Hasegawa
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: Accepted at PoPETs 2026
Abstract:In cross-device private federated learning, differentially private follow-the-regularized-leader (DP-FTRL) has emerged as a promising privacy-preserving method. However, existing approaches assume a semi-honest server and have not addressed the challenge of securely removing this assumption. This is due to its statefulness, which becomes particularly problematic in practical settings where clients can drop out or be corrupted. While trusted execution environments (TEEs) might seem like an obvious solution, a straightforward implementation can introduce forking attacks or availability issues due to state management. To address this problem, our paper introduces a novel server extension that acts as a trusted computing base (TCB) to realize maliciously secure DP-FTRL. The TCB is implemented with an ephemeral TEE module on the server side to produce verifiable proofs of server actions. Some clients, upon being selected, participate in auditing these proofs with small additional communication and computational demands. This extension solution reduces the size of the TCB while maintaining the system’s scalability and liveness. We provide formal proofs based on interactive differential privacy, demonstrating privacy guarantee in malicious settings. Finally, we experimentally show that our framework adds small constant overhead to clients in several realistic settings.
[LG-9] Machine Learning-Based Prediction of Speech Arrest During Direct Cortical Stimulation Mapping
链接: https://arxiv.org/abs/2509.08703
作者: Nikasadat Emami,Amirhossein Khalilian-Gourtani,Jianghao Qian,Antoine Ratouchniak,Xupeng Chen,Yao Wang,Adeen Flinker
类目: Machine Learning (cs.LG)
*备注: Accepted at IEEE International Conference on Neural Engineering (NER), 2025. This is the author’s accepted manuscript
Abstract:Identifying cortical regions critical for speech is essential for safe brain surgery in or near language areas. While Electrical Stimulation Mapping (ESM) remains the clinical gold standard, it is invasive and time-consuming. To address this, we analyzed intracranial electrocorticographic (ECoG) data from 16 participants performing speech tasks and developed machine learning models to directly predict if the brain region underneath each ECoG electrode is critical. Ground truth labels indicating speech arrest were derived independently from Electrical Stimulation Mapping (ESM) and used to train classification models. Our framework integrates neural activity signals, anatomical region labels, and functional connectivity features to capture both local activity and network-level dynamics. We found that models combining region and connectivity features matched the performance of the full feature set, and outperformed models using either type alone. To classify each electrode, trial-level predictions were aggregated using an MLP applied to histogram-encoded scores. Our best-performing model, a trial-level RBF-kernel Support Vector Machine together with MLP-based aggregation, achieved strong accuracy on held-out participants (ROC-AUC: 0.87, PR-AUC: 0.57). These findings highlight the value of combining spatial and network information with non-linear modeling to improve functional mapping in presurgical evaluation.
[LG-10] Perfectly-Private Analog Secure Aggregation in Federated Learning
链接: https://arxiv.org/abs/2509.08683
作者: Delio Jaramillo-Velez,Charul Rajput,Ragnar Freij-Hollanti,Camilla Hollanti,Alexandre Graell i Amat
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注: Comments welcome
Abstract:In federated learning, multiple parties train models locally and share their parameters with a central server, which aggregates them to update a global model. To address the risk of exposing sensitive data through local models, secure aggregation via secure multiparty computation has been proposed to enhance privacy. At the same time, perfect privacy can only be achieved by a uniform distribution of the masked local models to be aggregated. This raises a problem when working with real valued data, as there is no measure on the reals that is invariant under the masking operation, and hence information leakage is bound to occur. Shifting the data to a finite field circumvents this problem, but as a downside runs into an inherent accuracy complexity tradeoff issue due to fixed point modular arithmetic as opposed to floating point numbers that can simultaneously handle numbers of varying magnitudes. In this paper, a novel secure parameter aggregation method is proposed that employs the torus rather than a finite field. This approach guarantees perfect privacy for each party’s data by utilizing the uniform distribution on the torus, while avoiding accuracy losses. Experimental results show that the new protocol performs similarly to the model without secure aggregation while maintaining perfect privacy. Compared to the finite field secure aggregation, the torus-based protocol can in some cases significantly outperform it in terms of model accuracy and cosine similarity, hence making it a safer choice.
[LG-11] Signal Fidelity Index-Aware Calibration for Dementia Predictions Across Heterogeneous Real-World Data
链接: https://arxiv.org/abs/2509.08679
作者: Jingya Cheng,Jiazi Tian,Federica Spoto,Alaleh Azhir,Daniel Mork,Hossein Estiri
类目: Machine Learning (cs.LG)
*备注:
Abstract:\textbfBackground: Machine learning models trained on electronic health records (EHRs) often degrade across healthcare systems due to distributional shift. A fundamental but underexplored factor is diagnostic signal decay: variability in diagnostic quality and consistency across institutions, which affects the reliability of codes used for training and prediction. \textbfObjective: To develop a Signal Fidelity Index (SFI) quantifying diagnostic data quality at the patient level in dementia, and to test SFI-aware calibration for improving model performance across heterogeneous datasets without outcome labels. \textbfMethods: We built a simulation framework generating 2,500 synthetic datasets, each with 1,000 patients and realistic demographics, encounters, and coding patterns based on dementia risk factors. The SFI was derived from six interpretable components: diagnostic specificity, temporal consistency, entropy, contextual concordance, medication alignment, and trajectory stability. SFI-aware calibration applied a multiplicative adjustment, optimized across 50 simulation batches. \textbfResults: At the optimal parameter ( \alpha = 2.0), SFI-aware calibration significantly improved all metrics (p 0.001). Gains ranged from 10.3% for Balanced Accuracy to 32.5% for Recall, with notable increases in Precision (31.9%) and F1-score (26.1%). Performance approached reference standards, with F1-score and Recall within 1% and Balanced Accuracy and Detection Rate improved by 52.3% and 41.1%, respectively. \textbfConclusions: Diagnostic signal decay is a tractable barrier to model generalization. SFI-aware calibration provides a practical, label-free strategy to enhance prediction across healthcare contexts, particularly for large-scale administrative datasets lacking outcome labels. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2509.08679 [cs.LG] (or arXiv:2509.08679v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2509.08679 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Jingya Cheng [view email] [v1] Wed, 10 Sep 2025 15:19:04 UTC (697 KB) Full-text links: Access Paper: View a PDF of the paper titled Signal Fidelity Index-Aware Calibration for Dementia Predictions Across Heterogeneous Real-World Data, by Jingya Cheng and 5 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.LG prev | next new | recent | 2025-09 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
[LG-12] Replicable Reinforcement Learning with Linear Function Approximation
链接: https://arxiv.org/abs/2509.08660
作者: Eric Eaton,Marcel Hussing,Michael Kearns,Aaron Roth,Sikata Bela Sengupta,Jessica Sorrell
类目: Machine Learning (cs.LG)
*备注:
Abstract:Replication of experimental results has been a challenge faced by many scientific disciplines, including the field of machine learning. Recent work on the theory of machine learning has formalized replicability as the demand that an algorithm produce identical outcomes when executed twice on different samples from the same distribution. Provably replicable algorithms are especially interesting for reinforcement learning (RL), where algorithms are known to be unstable in practice. While replicable algorithms exist for tabular RL settings, extending these guarantees to more practical function approximation settings has remained an open problem. In this work, we make progress by developing replicable methods for linear function approximation in RL. We first introduce two efficient algorithms for replicable random design regression and uncentered covariance estimation, each of independent interest. We then leverage these tools to provide the first provably efficient replicable RL algorithms for linear Markov decision processes in both the generative model and episodic settings. Finally, we evaluate our algorithms experimentally and show how they can inspire more consistent neural policies.
[LG-13] An upper bound of the silhouette validation metric for clustering
链接: https://arxiv.org/abs/2509.08625
作者: Hugo Sträng,Tai Dinh
类目: Machine Learning (cs.LG)
*备注:
Abstract:The silhouette coefficient summarizes, per observation, cohesion versus separation in [-1, 1]; the average silhouette width (ASW) is a common internal measure of clustering quality where higher values indicate more coveted results. However, the dataset-specific maximum of ASW is typically unknown, and the standard upper limit 1 is often unattainable. In this work, we derive for each data point in a given dataset a sharp upper bound on its silhouette width. By aggregating these individual bounds, we present a canonical data-dependent upper bound on ASW that often assumes values well below 1. The presented bounds can indicate whether individual data points can ever be well placed, enable early stopping of silhouette-based optimization loops, and help answer a key question: How close is my clustering result to the best possible outcome on this specific data? Across synthetic and real datasets, the bounds are provably near-tight in many cases and offer significant enrichment of cluster quality evaluation.
[LG-14] owards Interpretable Deep Neural Networks for Tabular Data
链接: https://arxiv.org/abs/2509.08617
作者: Khawla Elhadri,Jörg Schlötterer,Christin Seifert
类目: Machine Learning (cs.LG)
*备注:
Abstract:Tabular data is the foundation of many applications in fields such as finance and healthcare. Although DNNs tailored for tabular data achieve competitive predictive performance, they are blackboxes with little interpretability. We introduce XNNTab, a neural architecture that uses a sparse autoencoder (SAE) to learn a dictionary of monosemantic features within the latent space used for prediction. Using an automated method, we assign human-interpretable semantics to these features. This allows us to represent predictions as linear combinations of semantically meaningful components. Empirical evaluations demonstrate that XNNTab attains performance on par with or exceeding that of state-of-the-art, black-box neural models and classical machine learning approaches while being fully interpretable.
[LG-15] MAESTRO: Multi-modal Adaptive Ensemble for Spectro-Temporal Robust Optimization
链接: https://arxiv.org/abs/2509.08578
作者: Hong Liu
类目: Machine Learning (cs.LG); Populations and Evolution (q-bio.PE); Quantitative Methods (q-bio.QM)
*备注:
Abstract:Timely and robust influenza incidence forecasting is critical for public health decision-making. To address this, we present MAESTRO, a Multi-modal Adaptive Ensemble for Spectro-Temporal Robust Optimization. MAESTRO achieves robustness by adaptively fusing multi-modal inputs-including surveillance, web search trends, and meteorological data-and leveraging a comprehensive spectro-temporal architecture. The model first decomposes time series into seasonal and trend components. These are then processed through a hybrid feature enhancement pipeline combining Transformer-based encoders, a Mamba state-space model for long-range dependencies, multi-scale temporal convolutions, and a frequency-domain analysis module. A cross-channel attention mechanism further integrates information across the different data modalities. Finally, a temporal projection head performs sequence-to-sequence forecasting, with an optional estimator to quantify prediction uncertainty. Evaluated on over 11 years of Hong Kong influenza data (excluding the COVID-19 period), MAESTRO shows strong competitive performance, demonstrating a superior model fit and relative accuracy, achieving a state-of-the-art R-square of 0.956. Extensive ablations confirm the significant contributions of both multi-modal fusion and the spectro-temporal components. Our modular and reproducible pipeline is made publicly available to facilitate deployment and extension to other regions and this http URL publicly available pipeline presents a powerful, unified framework, demonstrating the critical synergy of advanced spectro-temporal modeling and multi-modal data fusion for robust epidemiological forecasting.
[LG-16] Motion-Based User Identification across XR and Metaverse Applications by Deep Classification and Similarity Learning
链接: https://arxiv.org/abs/2509.08539
作者: Lukas Schach,Christian Rack,Ryan P. McMahan,Marc Erich Latoschik
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:
Abstract:This paper examines the generalization capacity of two state-of-the-art classification and similarity learning models in reliably identifying users based on their motions in various Extended Reality (XR) applications. We developed a novel dataset containing a wide range of motion data from 49 users in five different XR applications: four XR games with distinct tasks and action patterns, and an additional social XR application with no predefined task sets. The dataset is used to evaluate the performance and, in particular, the generalization capacity of the two models across applications. Our results indicate that while the models can accurately identify individuals within the same application, their ability to identify users across different XR applications remains limited. Overall, our results provide insight into current models generalization capabilities and suitability as biometric methods for user verification and identification. The results also serve as a much-needed risk assessment of hazardous and unwanted user identification in XR and Metaverse applications. Our cross-application XR motion dataset and code are made available to the public to encourage similar research on the generalization of motion-based user identification in typical Metaverse application use cases.
[LG-17] Data Skeleton Learning: Scalable Active Clustering with Sparse Graph Structures
链接: https://arxiv.org/abs/2509.08530
作者: Wen-Bo Xie,Xun Fu,Bin Chen,Yan-Li Lee,Tao Deng,Tian Zou,Xin Wang,Zhen Liu,Jaideep Srivastavad
类目: Machine Learning (cs.LG)
*备注:
Abstract:In this work, we focus on the efficiency and scalability of pairwise constraint-based active clustering, crucial for processing large-scale data in applications such as data mining, knowledge annotation, and AI model pre-training. Our goals are threefold: (1) to reduce computational costs for iterative clustering updates; (2) to enhance the impact of user-provided constraints to minimize annotation requirements for precise clustering; and (3) to cut down memory usage in practical deployments. To achieve these aims, we propose a graph-based active clustering algorithm that utilizes two sparse graphs: one for representing relationships between data (our proposed data skeleton) and another for updating this data skeleton. These two graphs work in concert, enabling the refinement of connected subgraphs within the data skeleton to create nested clusters. Our empirical analysis confirms that the proposed algorithm consistently facilitates more accurate clustering with dramatically less input of user-provided constraints, and outperforms its counterparts in terms of computational performance and scalability, while maintaining robustness across various distance metrics.
[LG-18] Heart Disease Prediction: A Comparative Study of Optimisers Performance in Deep Neural Networks
链接: https://arxiv.org/abs/2509.08499
作者: Chisom Chibuike,Adeyinka Ogunsanya
类目: Machine Learning (cs.LG)
*备注: 11 pages, 4 figures
Abstract:Optimization has been an important factor and topic of interest in training deep learning models, yet less attention has been given to how we select the optimizers we use to train these models. Hence, there is a need to dive deeper into how we select the optimizers we use for training and the metrics that determine this selection. In this work, we compare the performance of 10 different optimizers in training a simple Multi-layer Perceptron model using a heart disease dataset from Kaggle. We set up a consistent training paradigm and evaluate the optimizers based on metrics such as convergence speed and stability. We also include some other Machine Learning Evaluation metrics such as AUC, Precision, and Recall, which are central metrics to classification problems. Our results show that there are trade-offs between convergence speed and stability, as optimizers like Adagrad and Adadelta, which are more stable, took longer time to converge. Across all our metrics, we chose RMSProp to be the most effective optimizer for this heart disease prediction task because it offered a balanced performance across key metrics. It achieved a precision of 0.765, a recall of 0.827, and an AUC of 0.841, along with faster training time. However, it was not the most stable. We recommend that, in less compute-constrained environments, this method of choosing optimizers through a thorough evaluation should be adopted to increase the scientific nature and performance in training deep learning models.
[LG-19] Modified Loss of Momentum Gradient Descent: Fine-Grained Analysis
链接: https://arxiv.org/abs/2509.08483
作者: Matias D. Cattaneo,Boris Shigida
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Optimization and Control (math.OC); Computation (stat.CO); Machine Learning (stat.ML)
*备注:
Abstract:We analyze gradient descent with Polyak heavy-ball momentum (HB) whose fixed momentum parameter \beta \in (0, 1) provides exponential decay of memory. Building on Kovachki and Stuart (2021), we prove that on an exponentially attractive invariant manifold the algorithm is exactly plain gradient descent with a modified loss, provided that the step size h is small enough. Although the modified loss does not admit a closed-form expression, we describe it with arbitrary precision and prove global (finite “time” horizon) approximation bounds O(h^R) for any finite order R \geq 2 . We then conduct a fine-grained analysis of the combinatorics underlying the memoryless approximations of HB, in particular, finding a rich family of polynomials in \beta hidden inside which contains Eulerian and Narayana polynomials. We derive continuous modified equations of arbitrary approximation order (with rigorous bounds) and the principal flow that approximates the HB dynamics, generalizing Rosca et al. (2023). Approximation theorems cover both full-batch and mini-batch HB. Our theoretical results shed new light on the main features of gradient descent with heavy-ball momentum, and outline a road-map for similar analysis of other optimization algorithms.
[LG-20] SHAining on Process Mining: Explaining Event Log Characteristics Impact on Algorithms
链接: https://arxiv.org/abs/2509.08482
作者: Andrea Maldonado,Christian M. M. Frey,Sai Anirudh Aryasomayajula,Ludwig Zellner,Stephan A. Fahrenkrog-Petersen,Thomas Seidl
类目: Machine Learning (cs.LG)
*备注:
Abstract:Process mining aims to extract and analyze insights from event logs, yet algorithm metric results vary widely depending on structural event log characteristics. Existing work often evaluates algorithms on a fixed set of real-world event logs but lacks a systematic analysis of how event log characteristics impact algorithms individually. Moreover, since event logs are generated from processes, where characteristics co-occur, we focus on associational rather than causal effects to assess how strong the overlapping individual characteristic affects evaluation metrics without assuming isolated causal effects, a factor often neglected by prior work. We introduce SHAining, the first approach to quantify the marginal contribution of varying event log characteristics to process mining algorithms’ metrics. Using process discovery as a downstream task, we analyze over 22,000 event logs covering a wide span of characteristics to uncover which affect algorithms across metrics (e.g., fitness, precision, complexity) the most. Furthermore, we offer novel insights about how the value of event log characteristics correlates with their contributed impact, assessing the algorithm’s robustness.
[LG-21] An Interpretable Deep Learning Model for General Insurance Pricing
链接: https://arxiv.org/abs/2509.08467
作者: Patrick J. Laub,Tu Pho,Bernard Wong
类目: Machine Learning (cs.LG); General Finance (q-fin.GN)
*备注:
Abstract:This paper introduces the Actuarial Neural Additive Model, an inherently interpretable deep learning model for general insurance pricing that offers fully transparent and interpretable results while retaining the strong predictive power of neural networks. This model assigns a dedicated neural network (or subnetwork) to each individual covariate and pairwise interaction term to independently learn its impact on the modeled output while implementing various architectural constraints to allow for essential interpretability (e.g. sparsity) and practical requirements (e.g. smoothness, monotonicity) in insurance applications. The development of our model is grounded in a solid foundation, where we establish a concrete definition of interpretability within the insurance context, complemented by a rigorous mathematical framework. Comparisons in terms of prediction accuracy are made with traditional actuarial and state-of-the-art machine learning methods using both synthetic and real insurance datasets. The results show that the proposed model outperforms other methods in most cases while offering complete transparency in its internal logic, underscoring the strong interpretability and predictive capability.
[LG-22] Behind the Scenes: Mechanistic Interpretability of LoRA-adapted Whisper for Speech Emotion Recognition
链接: https://arxiv.org/abs/2509.08454
作者: Yujian Ma,Jinqiu Sang,Ruizhe Li
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: Work in process
Abstract:Large pre-trained speech models such as Whisper offer strong generalization but pose significant challenges for resource-efficient adaptation. Low-Rank Adaptation (LoRA) has become a popular parameter-efficient fine-tuning method, yet its underlying mechanisms in speech tasks remain poorly understood. In this work, we conduct the first systematic mechanistic interpretability study of LoRA within the Whisper encoder for speech emotion recognition (SER). Using a suite of analytical tools, including layer contribution probing, logit-lens inspection, and representational similarity via singular value decomposition (SVD) and centered kernel alignment (CKA), we reveal two key mechanisms: a delayed specialization process that preserves general features in early layers before consolidating task-specific information, and a forward alignment, backward differentiation dynamic between LoRA’s matrices. Our findings clarify how LoRA reshapes encoder hierarchies, providing both empirical insights and a deeper mechanistic understanding for designing efficient and interpretable adaptation strategies in large speech models.
[LG-23] wo Sides of the Same Optimization Coin: Model Degradation and Representation Collapse in Graph Foundation Models
链接: https://arxiv.org/abs/2509.08401
作者: Xunkai Li,Daohan Su,Sicheng Liu,Ru Zhang,Rong-Hua Li,Guoren Wang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Graph foundation models, inspired by the success of LLMs, are designed to learn the optimal embedding from multi-domain TAGs for the downstream cross-task generalization capability. During our investigation, graph VQ-MAE stands out among the increasingly diverse landscape of GFM architectures. This is attributed to its ability to jointly encode topology and textual attributes from multiple domains into discrete embedding spaces with clear semantic boundaries. Despite its potential, domain generalization conflicts cause imperceptible pitfalls. In this paper, we instantiate two of them, and they are just like two sides of the same GFM optimization coin - Side 1 Model Degradation: The encoder and codebook fail to capture the diversity of inputs; Side 2 Representation Collapse: The hidden embedding and codebook vector fail to preserve semantic separability due to constraints from narrow representation subspaces. These two pitfalls (sides) collectively impair the decoder and generate the low-quality reconstructed supervision, causing the GFM optimization dilemma during pre-training (coin). Through empirical investigation, we attribute the above challenges to Information Bottleneck and Regularization Deficit. To address them, we propose MoT (Mixture-of-Tinkers) - (1) Information Tinker for Two Pitfalls, which utilizes an edge-wise semantic fusion strategy and a mixture-of-codebooks with domain-aware routing to improve information capacity. (2) Regularization Tinker for Optimization Coin, which utilizes two additional regularizations to further improve gradient supervision in our proposed Information Tinker. Notably, as a flexible architecture, MoT adheres to the scaling laws of GFM, offering a controllable model scale. Compared to SOTA baselines, experiments on 22 datasets across 6 domains demonstrate that MoT achieves significant improvements in supervised, few-shot, and zero-shot scenarios.
[LG-24] Rethinking the Backbone in Class Imbalanced Federated Source Free Domain Adaptation: The Utility of Vision Foundation Models ICIP2025
链接: https://arxiv.org/abs/2509.08372
作者: Kosuke Kihara,Junki Mori,Taiki Miyagawa,Akinori F. Ebihara
类目: Machine Learning (cs.LG)
*备注: Accepted by the IEEE ICIP 2025 Satellite Workshop 1: Edge Intelligence: Smart, Efficient, and Scalable Solutions for IoT, Wearables, and Embedded Devices (SEEDS)
Abstract:Federated Learning (FL) offers a framework for training models collaboratively while preserving data privacy of each client. Recently, research has focused on Federated Source-Free Domain Adaptation (FFREEDA), a more realistic scenario wherein client-held target domain data remains unlabeled, and the server can access source domain data only during pre-training. We extend this framework to a more complex and realistic setting: Class Imbalanced FFREEDA (CI-FFREEDA), which takes into account class imbalances in both the source and target domains, as well as label shifts between source and target and among target clients. The replication of existing methods in our experimental setup lead us to rethink the focus from enhancing aggregation and domain adaptation methods to improving the feature extractors within the network itself. We propose replacing the FFREEDA backbone with a frozen vision foundation model (VFM), thereby improving overall accuracy without extensive parameter tuning and reducing computational and communication costs in federated learning. Our experimental results demonstrate that VFMs effectively mitigate the effects of domain gaps, class imbalances, and even non-IID-ness among target clients, suggesting that strong feature extractors, not complex adaptation or FL methods, are key to success in the real-world FL.
[LG-25] Prediction Loss Guided Decision-Focused Learning
链接: https://arxiv.org/abs/2509.08359
作者: Haeun Jeon,Hyunglip Bae,Chanyeong Kim,Yongjae Lee,Woo Chang Kim
类目: Machine Learning (cs.LG)
*备注:
Abstract:Decision-making under uncertainty is often considered in two stages: predicting the unknown parameters, and then optimizing decisions based on predictions. While traditional prediction-focused learning (PFL) treats these two stages separately, decision-focused learning (DFL) trains the predictive model by directly optimizing the decision quality in an end-to-end manner. However, despite using exact or well-approximated gradients, vanilla DFL often suffers from unstable convergence due to its flat-and-sharp loss landscapes. In contrast, PFL yields more stable optimization, but overlooks the downstream decision quality. To address this, we propose a simple yet effective approach: perturbing the decision loss gradient using the prediction loss gradient to construct an update direction. Our method requires no additional training and can be integrated with any DFL solvers. Using the sigmoid-like decaying parameter, we let the prediction loss gradient guide the decision loss gradient to train a predictive model that optimizes decision quality. Also, we provide a theoretical convergence guarantee to Pareto stationary point under mild assumptions. Empirically, we demonstrate our method across three stochastic optimization problems, showing promising results compared to other baselines. We validate that our approach achieves lower regret with more stable training, even in situations where either PFL or DFL struggles.
[LG-26] Adaptive Rainfall Forecasting from Multiple Geographical Models Using Matrix Profile and Ensemble Learning
链接: https://arxiv.org/abs/2509.08277
作者: Dung T. Tran,Huyen Ngoc Huyen,Hong Nguyen,Xuan-Vu Phan,Nam-Phong Nguyen
类目: Machine Learning (cs.LG)
*备注:
Abstract:Rainfall forecasting in Vietnam is highly challenging due to its diverse climatic conditions and strong geographical variability across river basins, yet accurate and reliable forecasts are vital for flood management, hydropower operation, and disaster preparedness. In this work, we propose a Matrix Profile-based Weighted Ensemble (MPWE), a regime-switching framework that dynamically captures covariant dependencies among multiple geographical model forecasts while incorporating redundancy-aware weighting to balance contributions across models. We evaluate MPWE using rainfall forecasts from eight major basins in Vietnam, spanning five forecast horizons (1 hour and accumulated rainfall over 12, 24, 48, 72, and 84 hours). Experimental results show that MPWE consistently achieves lower mean and standard deviation of prediction errors compared to geographical models and ensemble baselines, demonstrating both improved accuracy and stability across basins and horizons.
[LG-27] Mitigating Catastrophic Forgetting in Large Language Models with Forgetting-aware Pruning EMNLP2025
链接: https://arxiv.org/abs/2509.08255
作者: Wei Huang,Anda Cheng,Yinggui Wang
类目: Machine Learning (cs.LG)
*备注: Accepted by emnlp2025
Abstract:Recent advancements in large language models (LLMs) have shown impressive capabilities in various downstream tasks but typically face Catastrophic Forgetting (CF) during fine-tuning. In this paper, we propose the Forgetting-Aware Pruning Metric (FAPM), a novel pruning-based approach to balance CF and downstream task performance. Our investigation reveals that the degree to which task vectors (i.e., the subtraction of pre-trained weights from the weights fine-tuned on downstream tasks) overlap with pre-trained model parameters is a critical factor for CF. Based on this finding, FAPM employs the ratio of the task vector to pre-trained model parameters as a metric to quantify CF, integrating this measure into the pruning criteria. Importantly, FAPM does not necessitate modifications to the training process or model architecture, nor does it require any auxiliary data. We conducted extensive experiments across eight datasets, covering natural language inference, General QA, Medical QA, Math QA, reading comprehension, and cloze tests. The results demonstrate that FAPM limits CF to just 0.25% while maintaining 99.67% accuracy on downstream tasks. We provide the code to reproduce our results.
[LG-28] he CRITICAL Records Integrated Standardization Pipeline (CRISP): End-to-End Processing of Large-scale Multi-institutional OMOP CDM Data
链接: https://arxiv.org/abs/2509.08247
作者: Xiaolong Luo,Michael Lingzhi Li
类目: Machine Learning (cs.LG)
*备注: 15 pages, 9 figures
Abstract:While existing critical care EHR datasets such as MIMIC and eICU have enabled significant advances in clinical AI research, the CRITICAL dataset opens new frontiers by providing extensive scale and diversity – containing 1.95 billion records from 371,365 patients across four geographically diverse CTSA institutions. CRITICAL’s unique strength lies in capturing full-spectrum patient journeys, including pre-ICU, ICU, and post-ICU encounters across both inpatient and outpatient settings. This multi-institutional, longitudinal perspective creates transformative opportunities for developing generalizable predictive models and advancing health equity research. However, the richness of this multi-site resource introduces substantial complexity in data harmonization, with heterogeneous collection practices and diverse vocabulary usage patterns requiring sophisticated preprocessing approaches. We present CRISP to unlock the full potential of this valuable resource. CRISP systematically transforms raw Observational Medical Outcomes Partnership Common Data Model data into ML-ready datasets through: (1) transparent data quality management with comprehensive audit trails, (2) cross-vocabulary mapping of heterogeneous medical terminologies to unified SNOMED-CT standards, with deduplication and unit standardization, (3) modular architecture with parallel optimization enabling complete dataset processing in 1 day even on standard computing hardware, and (4) comprehensive baseline model benchmarks spanning multiple clinical prediction tasks to establish reproducible performance standards. By providing processing pipeline, baseline implementations, and detailed transformation documentation, CRISP saves researchers months of preprocessing effort and democratizes access to large-scale multi-institutional critical care data, enabling them to focus on advancing clinical AI. Comments: 15 pages, 9 figures Subjects: Machine Learning (cs.LG) ACMclasses: I.2.6; H.2.8 Cite as: arXiv:2509.08247 [cs.LG] (or arXiv:2509.08247v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2509.08247 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-29] Ensemble Distribution Distillation for Self-Supervised Human Activity Recognition
链接: https://arxiv.org/abs/2509.08225
作者: Matthew Nolan,Lina Yao,Robert Davidson
类目: Machine Learning (cs.LG)
*备注: 37 pages, 10 figures
Abstract:Human Activity Recognition (HAR) has seen significant advancements with the adoption of deep learning techniques, yet challenges remain in terms of data requirements, reliability and robustness. This paper explores a novel application of Ensemble Distribution Distillation (EDD) within a self-supervised learning framework for HAR aimed at overcoming these challenges. By leveraging unlabeled data and a partially supervised training strategy, our approach yields an increase in predictive accuracy, robust estimates of uncertainty, and substantial increases in robustness against adversarial perturbation; thereby significantly improving reliability in real-world scenarios without increasing computational complexity at inference. We demonstrate this with an evaluation on several publicly available datasets. The contributions of this work include the development of a self-supervised EDD framework, an innovative data augmentation technique designed for HAR, and empirical validation of the proposed method’s effectiveness in increasing robustness and reliability.
[LG-30] Sketched Gaussian Mechanism for Private Federated Learning
链接: https://arxiv.org/abs/2509.08195
作者: Qiaobo Li,Zhijie Chen,Arindam Banerjee
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:
Abstract:Communication cost and privacy are two major considerations in federated learning (FL). For communication cost, gradient compression by sketching the clients’ transmitted model updates is often used for reducing per-round communication. For privacy, the Gaussian mechanism (GM), which consists of clipping updates and adding Gaussian noise, is commonly used to guarantee client-level differential privacy. Existing literature on private FL analyzes privacy of sketching and GM in an isolated manner, illustrating that sketching provides privacy determined by the sketching dimension and that GM has to supply any additional desired privacy. In this paper, we introduce the Sketched Gaussian Mechanism (SGM), which directly combines sketching and the Gaussian mechanism for privacy. Using Rényi-DP tools, we present a joint analysis of SGM’s overall privacy guarantee, which is significantly more flexible and sharper compared to isolated analysis of sketching and GM privacy. In particular, we prove that the privacy level of SGM for a fixed noise magnitude is proportional to 1/\sqrtb , where b is the sketching dimension, indicating that (for moderate b ) SGM can provide much stronger privacy guarantees than the original GM under the same noise budget. We demonstrate the application of SGM to FL with either gradient descent or adaptive server optimizers, and establish theoretical results on optimization convergence, which exhibits only a logarithmic dependence on the number of parameters d . Experimental results confirm that at the same privacy level, SGM based FL is at least competitive with non-sketching private FL variants and outperforms them in some settings. Moreover, using adaptive optimization at the server improves empirical performance while maintaining the privacy guarantees. Subjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC) Cite as: arXiv:2509.08195 [cs.LG] (or arXiv:2509.08195v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2509.08195 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-31] Prescribe-then-Select: Adaptive Policy Selection for Contextual Stochastic Optimization
链接: https://arxiv.org/abs/2509.08194
作者: Caio de Prospero Iglesias,Kimberly Villalobos Carballo,Dimitris Bertsimas
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:We address the problem of policy selection in contextual stochastic optimization (CSO), where covariates are available as contextual information and decisions must satisfy hard feasibility constraints. In many CSO settings, multiple candidate policies–arising from different modeling paradigms–exhibit heterogeneous performance across the covariate space, with no single policy uniformly dominating. We propose Prescribe-then-Select (PS), a modular framework that first constructs a library of feasible candidate policies and then learns a meta-policy to select the best policy for the observed covariates. We implement the meta-policy using ensembles of Optimal Policy Trees trained via cross-validation on the training set, making policy choice entirely data-driven. Across two benchmark CSO problems–single-stage newsvendor and two-stage shipment planning–PS consistently outperforms the best single policy in heterogeneous regimes of the covariate space and converges to the dominant policy when such heterogeneity is absent. All the code to reproduce the results can be found at this https URL.
[LG-32] Rollout-LaSDI: Enhancing the long-term accuracy of Latent Space Dynamics
链接: https://arxiv.org/abs/2509.08191
作者: Robert Stephany,Youngsoo Choi
类目: Machine Learning (cs.LG)
*备注: 6 pages, 2 figures
Abstract:Solving complex partial differential equations is vital in the physical sciences, but often requires computationally expensive numerical methods. Reduced-order models (ROMs) address this by exploiting dimensionality reduction to create fast approximations. While modern ROMs can solve parameterized families of PDEs, their predictive power degrades over long time horizons. We address this by (1) introducing a flexible, high-order, yet inexpensive finite-difference scheme and (2) proposing a Rollout loss that trains ROMs to make accurate predictions over arbitrary time horizons. We demonstrate our approach on the 2D Burgers equation.
[LG-33] ArtifactGen: Benchmarking WGAN-GP vs Diffusion for Label-Aware EEG Artifact Synthesis
链接: https://arxiv.org/abs/2509.08188
作者: Hritik Arasu,Faisal R Jahangiri
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Neurons and Cognition (q-bio.NC)
*备注: 16 Pages, 6 figures
Abstract:Artifacts in electroencephalography (EEG) – muscle, eye movement, electrode, chewing, and shiver – confound automated analysis yet are costly to label at scale. We study whether modern generative models can synthesize realistic, label-aware artifact segments suitable for augmentation and stress-testing. Using the TUH EEG Artifact (TUAR) corpus, we curate subject-wise splits and fixed-length multi-channel windows (e.g., 250 samples) with preprocessing tailored to each model (per-window min–max for adversarial training; per-recording/channel z -score for diffusion). We compare a conditional WGAN-GP with a projection discriminator to a 1D denoising diffusion model with classifier-free guidance, and evaluate along three axes: (i) fidelity via Welch band-power deltas ( \Delta\delta,\ \Delta\theta,\ \Delta\alpha,\ \Delta\beta ), channel-covariance Frobenius distance, autocorrelation L_2 , and distributional metrics (MMD/PRD); (ii) specificity via class-conditional recovery with lightweight k NN/classifiers; and (iii) utility via augmentation effects on artifact recognition. In our setting, WGAN-GP achieves closer spectral alignment and lower MMD to real data, while both models exhibit weak class-conditional recovery, limiting immediate augmentation gains and revealing opportunities for stronger conditioning and coverage. We release a reproducible pipeline – data manifests, training configurations, and evaluation scripts – to establish a baseline for EEG artifact synthesis and to surface actionable failure modes for future work.
[LG-34] Selective Induction Heads: How Transformers Select Causal Structures In Context
链接: https://arxiv.org/abs/2509.08184
作者: Francesco D’Angelo,Francesco Croce,Nicolas Flammarion
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注:
Abstract:Transformers have exhibited exceptional capabilities in sequence modeling tasks, leveraging self-attention and in-context learning. Critical to this success are induction heads, attention circuits that enable copying tokens based on their previous occurrences. In this work, we introduce a novel framework that showcases transformers’ ability to dynamically handle causal structures. Existing works rely on Markov Chains to study the formation of induction heads, revealing how transformers capture causal dependencies and learn transition probabilities in-context. However, they rely on a fixed causal structure that fails to capture the complexity of natural languages, where the relationship between tokens dynamically changes with context. To this end, our framework varies the causal structure through interleaved Markov chains with different lags while keeping the transition probabilities fixed. This setting unveils the formation of Selective Induction Heads, a new circuit that endows transformers with the ability to select the correct causal structure in-context. We empirically demonstrate that transformers learn this mechanism to predict the next token by identifying the correct lag and copying the corresponding token from the past. We provide a detailed construction of a 3-layer transformer to implement the selective induction head, and a theoretical analysis proving that this mechanism asymptotically converges to the maximum likelihood solution. Our findings advance the understanding of how transformers select causal structures, providing new insights into their functioning and interpretability.
[LG-35] he Domain Mixed Unit: A New Neural Arithmetic Layer
链接: https://arxiv.org/abs/2509.08180
作者: Paul Curry
类目: Machine Learning (cs.LG)
*备注: 7 pages, 5 tables, includes results on the NALM benchmark
Abstract:The Domain Mixed Unit (DMU) is a new neural arithmetic unit that learns a single parameter gate that mixes between log-space and linear-space representations while performing either addition (DMU add) or subtraction (DMU sub). Two initializations are proposed for the DMU: one covering addition and multiplication, and another covering subtraction and division. The DMU achieves state-of-the-art performance on the NALM Benchmark, a dataset designed to test the ability of neural arithmetic units to generalize arithmetic operations, specifically performing with the highest percentage solved over all seeds on multiplication and division. The DMU will be submitted as a pull request to the open-source NALM benchmark, and its code is available on GitHub at this https URL
[LG-36] Machine Learning with Multitype Protected Attributes: Intersectional Fairness through Regularisation
链接: https://arxiv.org/abs/2509.08163
作者: Ho Ming Lee,Katrien Antonio,Benjamin Avanzi,Lorenzo Marchi,Rui Zhou
类目: Machine Learning (cs.LG); Risk Management (q-fin.RM); Applications (stat.AP); Machine Learning (stat.ML)
*备注:
Abstract:Ensuring equitable treatment (fairness) across protected attributes (such as gender or ethnicity) is a critical issue in machine learning. Most existing literature focuses on binary classification, but achieving fairness in regression tasks-such as insurance pricing or hiring score assessments-is equally important. Moreover, anti-discrimination laws also apply to continuous attributes, such as age, for which many existing methods are not applicable. In practice, multiple protected attributes can exist simultaneously; however, methods targeting fairness across several attributes often overlook so-called “fairness gerrymandering”, thereby ignoring disparities among intersectional subgroups (e.g., African-American women or Hispanic men). In this paper, we propose a distance covariance regularisation framework that mitigates the association between model predictions and protected attributes, in line with the fairness definition of demographic parity, and that captures both linear and nonlinear dependencies. To enhance applicability in the presence of multiple protected attributes, we extend our framework by incorporating two multivariate dependence measures based on distance covariance: the previously proposed joint distance covariance (JdCov) and our novel concatenated distance covariance (CCdCov), which effectively address fairness gerrymandering in both regression and classification tasks involving protected attributes of various types. We discuss and illustrate how to calibrate regularisation strength, including a method based on Jensen-Shannon divergence, which quantifies dissimilarities in prediction distributions across groups. We apply our framework to the COMPAS recidivism dataset and a large motor insurance claims dataset.
[LG-37] MMM-fair: An Interactive Toolkit for Exploring and Operationalizing Multi-Fairness Trade-offs
链接: https://arxiv.org/abs/2509.08156
作者: Swati Swati,Arjun Roy,Emmanouil Panagiotou,Eirini Ntoutsi
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注: Accepted to be published in the Proceedings of the 34th ACM International Conference on Information and Knowledge Management, November 10–14, 2025, Seoul, Republic of Korea
Abstract:Fairness-aware classification requires balancing performance and fairness, often intensified by intersectional biases. Conflicting fairness definitions further complicate the task, making it difficult to identify universally fair solutions. Despite growing regulatory and societal demands for equitable AI, popular toolkits offer limited support for exploring multi-dimensional fairness and related trade-offs. To address this, we present mmm-fair, an open-source toolkit leveraging boosting-based ensemble approaches that dynamically optimizes model weights to jointly minimize classification errors and diverse fairness violations, enabling flexible multi-objective optimization. The system empowers users to deploy models that align with their context-specific needs while reliably uncovering intersectional biases often missed by state-of-the-art methods. In a nutshell, mmm-fair uniquely combines in-depth multi-attribute fairness, multi-objective optimization, a no-code, chat-based interface, LLM-powered explanations, interactive Pareto exploration for model selection, custom fairness constraint definition, and deployment-ready models in a single open-source toolkit, a combination rarely found in existing fairness tools. Demo walkthrough available at: this https URL.
[LG-38] SCA-LLM : Spectral-Attentive Channel Prediction with Large Language Models in MIMO-OFDM
链接: https://arxiv.org/abs/2509.08139
作者: Ke He,Le He,Lisheng Fan,Xianfu Lei,Thang X. Vu,George K. Karagiannidis,Symeon Chatzinotas
类目: Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:
Abstract:In recent years, the success of large language models (LLMs) has inspired growing interest in exploring their potential applications in wireless communications, especially for channel prediction tasks. However, directly applying LLMs to channel prediction faces a domain mismatch issue stemming from their text-based pre-training. To mitigate this, the ``adapter + LLM" paradigm has emerged, where an adapter is designed to bridge the domain gap between the channel state information (CSI) data and LLMs. While showing initial success, existing adapters may not fully exploit the potential of this paradigm. To address this limitation, this work provides a key insight that learning representations from the spectral components of CSI features can more effectively help bridge the domain gap. Accordingly, we propose a spectral-attentive framework, named SCA-LLM, for channel prediction in multiple-input multiple-output orthogonal frequency division multiplexing (MIMO-OFDM) systems. Specifically, its novel adapter can capture finer spectral details and better adapt the LLM for channel prediction than previous methods. Extensive simulations show that SCA-LLM achieves state-of-the-art prediction performance and strong generalization, yielding up to -2.4~\textdB normalized mean squared error (NMSE) advantage over the previous LLM based method. Ablation studies further confirm the superiority of SCA-LLM in mitigating domain mismatch.
[LG-39] orchmil: A PyTorch-based library for deep Multiple Instance Learning
链接: https://arxiv.org/abs/2509.08129
作者: Francisco M. Castro-Macías,Francisco J. Sáez-Maldonado,Pablo Morales-Álvarez,Rafael Molina
类目: Machine Learning (cs.LG)
*备注:
Abstract:Multiple Instance Learning (MIL) is a powerful framework for weakly supervised learning, particularly useful when fine-grained annotations are unavailable. Despite growing interest in deep MIL methods, the field lacks standardized tools for model development, evaluation, and comparison, which hinders reproducibility and accessibility. To address this, we present torchmil, an open-source Python library built on PyTorch. torchmil offers a unified, modular, and extensible framework, featuring basic building blocks for MIL models, a standardized data format, and a curated collection of benchmark datasets and models. The library includes comprehensive documentation and tutorials to support both practitioners and researchers. torchmil aims to accelerate progress in MIL and lower the entry barrier for new users. Available at this https URL.
[LG-40] In-Context Learning Enhanced Credibility Transformer
链接: https://arxiv.org/abs/2509.08122
作者: Kishan Padayachy,Ronald Richman,Salvatore Scognamiglio,Mario V. Wüthrich
类目: Machine Learning (cs.LG); Applications (stat.AP)
*备注:
Abstract:The starting point of our network architecture is the Credibility Transformer which extends the classical Transformer architecture by a credibility mechanism to improve model learning and predictive performance. This Credibility Transformer learns credibilitized CLS tokens that serve as learned representations of the original input features. In this paper we present a new paradigm that augments this architecture by an in-context learning mechanism, i.e., we increase the information set by a context batch consisting of similar instances. This allows the model to enhance the CLS token representations of the instances by additional in-context information and fine-tuning. We empirically verify that this in-context learning enhances predictive accuracy by adapting to similar risk patterns. Moreover, this in-context learning also allows the model to generalize to new instances which, e.g., have feature levels in the categorical covariates that have not been present when the model was trained – for a relevant example, think of a new vehicle model which has just been developed by a car manufacturer.
[LG-41] Optimization Methods and Software for Federated Learning
链接: https://arxiv.org/abs/2509.08120
作者: Konstantin Burlachenko
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: A dissertation by Konstantin Burlachenko submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy
Abstract:Federated Learning (FL) is a novel, multidisciplinary Machine Learning paradigm where multiple clients, such as mobile devices, collaborate to solve machine learning problems. Initially introduced in Konečný et al. (2016a,b); McMahan et al. (2017), FL has gained further attention through its inclusion in the National AI Research and Development Strategic Plan (2023 Update) of the United States (Science and on Artificial Intelligence, 2023). The FL training process is inherently decentralized and often takes place in less controlled settings compared to data centers, posing unique challenges distinct from those in fully controlled environments. In this thesis, we identify five key challenges in Federated Learning and propose novel approaches to address them. These challenges arise from the heterogeneity of data and devices, communication issues, and privacy concerns for clients in FL training. Moreover, even well-established theoretical advances in FL require diverse forms of practical implementation to enhance their real-world applicability. Our contributions advance FL algorithms and systems, bridging theoretical advancements and practical implementations. More broadly, our work serves as a guide for researchers navigating the complexities of translating theoretical methods into efficient real-world implementations and software. Additionally, it offers insights into the reverse process of adapting practical implementation aspects back into theoretical algorithm design. This reverse process is particularly intriguing, as the practical perspective compels us to examine the underlying mechanics and flexibilities of algorithms more deeply, often uncovering new dimensions of the algorithms under study.
[LG-42] Hammer and Anvil: A Principled Defense Against Backdoors in Federated Learning
链接: https://arxiv.org/abs/2509.08089
作者: Lucas Fenaux,Zheng Wang,Jacob Yan,Nathan Chung,Florian Kerschbaum
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:
Abstract:Federated Learning is a distributed learning technique in which multiple clients cooperate to train a machine learning model. Distributed settings facilitate backdoor attacks by malicious clients, who can embed malicious behaviors into the model during their participation in the training process. These malicious behaviors are activated during inference by a specific trigger. No defense against backdoor attacks has stood the test of time, especially against adaptive attackers, a powerful but not fully explored category of attackers. In this work, we first devise a new adaptive adversary that surpasses existing adversaries in capabilities, yielding attacks that only require one or two malicious clients out of 20 to break existing state-of-the-art defenses. Then, we present Hammer and Anvil, a principled defense approach that combines two defenses orthogonal in their underlying principle to produce a combined defense that, given the right set of parameters, must succeed against any attack. We show that our best combined defense, Krum+, is successful against our new adaptive adversary and state-of-the-art attacks.
[LG-43] Network Contagion in Financial Labor Markets: Predicting Turnover in Hong Kong
链接: https://arxiv.org/abs/2509.08001
作者: Abdulla AlKetbi,Patrick Yam,Gautier Marti,Raed Jaradat
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG); Applications (stat.AP)
*备注:
Abstract:Employee turnover is a critical challenge in financial markets, yet little is known about the role of professional networks in shaping career moves. Using the Hong Kong Securities and Futures Commission (SFC) public register (2007-2024), we construct temporal networks of 121,883 professionals and 4,979 firms to analyze and predict employee departures. We introduce a graph-based feature propagation framework that captures peer influence and organizational stability. Our analysis shows a contagion effect: professionals are 23% more likely to leave when over 30% of their peers depart within six months. Embedding these network signals into machine learning models improves turnover prediction by 30% over baselines. These results highlight the predictive power of temporal network effects in workforce dynamics, and demonstrate how network-based analytics can inform regulatory monitoring, talent management, and systemic risk assessment.
[LG-44] PCGBandit: One-shot acceleration of transient PDE solvers via online-learned preconditioners
链接: https://arxiv.org/abs/2509.08765
作者: Mikhail Khodak,Min Ki Jung,Brian Wynne,Edmond chow,Egemen Kolemen
类目: Computational Physics (physics.comp-ph); Machine Learning (cs.LG); Numerical Analysis (math.NA); Machine Learning (stat.ML)
*备注: 25 pages, 11 figures
Abstract:Data-driven acceleration of scientific computing workflows has been a high-profile aim of machine learning (ML) for science, with numerical simulation of transient partial differential equations (PDEs) being one of the main applications. The focus thus far has been on methods that require classical simulations to train, which when combined with the data-hungriness and optimization challenges of neural networks has caused difficulties in demonstrating a convincing advantage against strong classical baselines. We consider an alternative paradigm in which the learner uses a classical solver’s own data to accelerate it, enabling a one-shot speedup of the simulation. Concretely, since transient PDEs often require solving a sequence of related linear systems, the feedback from repeated calls to a linear solver such as preconditioned conjugate gradient (PCG) can be used by a bandit algorithm to online-learn an adaptive sequence of solver configurations (e.g. preconditioners). The method we develop, PCGBandit, is implemented directly on top of the popular open source software OpenFOAM, which we use to show its effectiveness on a set of fluid and magnetohydrodynamics (MHD) problems.
[LG-45] Bregman Douglas-Rachford Splitting Method
链接: https://arxiv.org/abs/2509.08739
作者: Shiqian Ma,Lin Xiao,Renbo Zhao
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:In this paper, we propose the Bregman Douglas-Rachford splitting (BDRS) method and its variant Bregman Peaceman-Rachford splitting method for solving maximal monotone inclusion problem. We show that BDRS is equivalent to a Bregman alternating direction method of multipliers (ADMM) when applied to the dual of the problem. A special case of the Bregman ADMM is an alternating direction version of the exponential multiplier method. To the best of our knowledge, algorithms proposed in this paper are new to the literature. We also discuss how to use our algorithms to solve the discrete optimal transport (OT) problem. We prove the convergence of the algorithms under certain assumptions, though we point out that one assumption does not apply to the OT problem.
[LG-46] Decentralized Stochastic Nonconvex Optimization under the Relaxed Smoothness
链接: https://arxiv.org/abs/2509.08726
作者: Luo Luo,Xue Cui,Tingkai Jia,Cheng Chen
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:
Abstract:This paper studies decentralized optimization problem f(\mathbfx)=\frac1m\sum_i=1^m f_i(\mathbfx) , where each local function has the form of f_i(\mathbfx) = \mathbb E\left[F(\mathbfx;\xi_i)\right] which is (L_0,L_1) -smooth but possibly nonconvex and the random variable \xi_i follows distribution \mathcal D_i . We propose a novel algorithm called decentralized normalized stochastic gradient descent (DNSGD), which can achieve the \epsilon -stationary point on each local agent. We present a new framework for analyzing decentralized first-order methods in the relaxed smooth setting, based on the Lyapunov function related to the product of the gradient norm and the consensus error. The analysis shows upper bounds on sample complexity of \mathcal O(m^-1(L_f\sigma^2\Delta_f\epsilon^-4 + \sigma^2\epsilon^-2 + L_f^-2L_1^3\sigma^2\Delta_f\epsilon^-1 + L_f^-2L_1^2\sigma^2)) per agent and communication complexity of \tilde\mathcal O((L_f\epsilon^-2 + L_1\epsilon^-1)\gamma^-1/2\Delta_f) , where L_f=L_0 +L_1\zeta , \sigma^2 is the variance of the stochastic gradient, \Delta_f is the initial optimal function value gap, \gamma is the spectral gap of the network, and \zeta is the degree of the gradient dissimilarity. In the special case of L_1=0 , the above results (nearly) match the lower bounds on decentralized nonconvex optimization in the standard smooth setting. We also conduct numerical experiments to show the empirical superiority of our method.
[LG-47] okenizing Loops of Antibodies
链接: https://arxiv.org/abs/2509.08707
作者: Ada Fang,Robert G. Alberstein,Simon Kelow,Frédéric A. Dreyer
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
*备注: 21 pages, 7 figures, 10 tables, code available at this https URL
Abstract:The complementarity-determining regions of antibodies are loop structures that are key to their interactions with antigens, and of high importance to the design of novel biologics. Since the 1980s, categorizing the diversity of CDR structures into canonical clusters has enabled the identification of key structural motifs of antibodies. However, existing approaches have limited coverage and cannot be readily incorporated into protein foundation models. Here we introduce ImmunoGlobulin LOOp Tokenizer, Igloo, a multimodal antibody loop tokenizer that encodes backbone dihedral angles and sequence. Igloo is trained using a contrastive learning objective to map loops with similar backbone dihedral angles closer together in latent space. Igloo can efficiently retrieve the closest matching loop structures from a structural antibody database, outperforming existing methods on identifying similar H3 loops by 5.9%. Igloo assigns tokens to all loops, addressing the limited coverage issue of canonical clusters, while retaining the ability to recover canonical loop conformations. To demonstrate the versatility of Igloo tokens, we show that they can be incorporated into protein language models with IglooLM and IglooALM. On predicting binding affinity of heavy chain variants, IglooLM outperforms the base protein language model on 8 out of 10 antibody-antigen targets. Additionally, it is on par with existing state-of-the-art sequence-based and multimodal protein language models, performing comparably to models with 7\times more parameters. IglooALM samples antibody loops which are diverse in sequence and more consistent in structure than state-of-the-art antibody inverse folding models. Igloo demonstrates the benefit of introducing multimodal tokens for antibody loops for encoding the diverse landscape of antibody loops, improving protein foundation models, and for antibody CDR design.
[LG-48] Deep Unrolling of Sparsity-Induced RDO for 3D Point Cloud Attribute Coding
链接: https://arxiv.org/abs/2509.08685
作者: Tam Thuc Do,Philip A. Chou,Gene Cheung
类目: Image and Video Processing (eess.IV); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:
Abstract:Given encoded 3D point cloud geometry available at the decoder, we study the problem of lossy attribute compression in a multi-resolution B-spline projection framework. A target continuous 3D attribute function is first projected onto a sequence of nested subspaces \mathcalF^§_l_0 \subseteq \cdots \subseteq \mathcalF^§_L , where \mathcalF^§_l is a family of functions spanned by a B-spline basis function of order p at a chosen scale and its integer shifts. The projected low-pass coefficients F_l^* are computed by variable-complexity unrolling of a rate-distortion (RD) optimization algorithm into a feed-forward network, where the rate term is the sparsity-promoting \ell_1 -norm. Thus, the projection operation is end-to-end differentiable. For a chosen coarse-to-fine predictor, the coefficients are then adjusted to account for the prediction from a lower-resolution to a higher-resolution, which is also optimized in a data-driven manner.
[LG-49] A hierarchical entropy method for the delocalization of bias in high-dimensional Langevin Monte Carlo
链接: https://arxiv.org/abs/2509.08619
作者: Daniel Lacker,Fuzhong Zhou
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR)
*备注:
Abstract:The unadjusted Langevin algorithm is widely used for sampling from complex high-dimensional distributions. It is well known to be biased, with the bias typically scaling linearly with the dimension when measured in squared Wasserstein distance. However, the recent paper of Chen et al. (2024) identifies an intriguing new delocalization effect: For a class of distributions with sparse interactions, the bias between low-dimensional marginals scales only with the lower dimension, not the full dimension. In this work, we strengthen the results of Chen et al. (2024) in the sparse interaction regime by removing a logarithmic factor, measuring distance in relative entropy (a.k.a. KL-divergence), and relaxing the strong log-concavity assumption. In addition, we expand the scope of the delocalization phenomenon by showing that it holds for a class of distributions with weak interactions. Our proofs are based on a hierarchical analysis of the marginal relative entropies, inspired by the authors’ recent work on propagation of chaos.
[LG-50] MasconCube: Fast and Accurate Gravity Modeling with an Explicit Representation
链接: https://arxiv.org/abs/2509.08607
作者: Pietro Fanti,Dario Izzo
类目: Earth and Planetary Astrophysics (astro-ph.EP); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注:
Abstract:The geodesy of irregularly shaped small bodies presents fundamental challenges for gravitational field modeling, particularly as deep space exploration missions increasingly target asteroids and comets. Traditional approaches suffer from critical limitations: spherical harmonics diverge within the Brillouin sphere where spacecraft typically operate, polyhedral models assume unrealistic homogeneous density distributions, and existing machine learning methods like GeodesyNets and Physics-Informed Neural Networks (PINN-GM) require extensive computational resources and training time. This work introduces MasconCubes, a novel self-supervised learning approach that formulates gravity inversion as a direct optimization problem over a regular 3D grid of point masses (mascons). Unlike implicit neural representations, MasconCubes explicitly model mass distributions while leveraging known asteroid shape information to constrain the solution space. Comprehensive evaluation on diverse asteroid models including Bennu, Eros, Itokawa, and synthetic planetesimals demonstrates that MasconCubes achieve superior performance across multiple metrics. Most notably, MasconCubes demonstrate computational efficiency advantages with training times approximately 40 times faster than GeodesyNets while maintaining physical interpretability through explicit mass distributions. These results establish MasconCubes as a promising approach for mission-critical gravitational modeling applications requiring high accuracy, computational efficiency, and physical insight into internal mass distributions of irregular celestial bodies.
[LG-51] PEHRT: A Common Pipeline for Harmonizing Electronic Health Record data for Translational Research
链接: https://arxiv.org/abs/2509.08553
作者: Jessica Gronsbell,Vidul Ayakulangara Panickan,Chris Lin,Thomas Charlon,Chuan Hong,Doudou Zhou,Linshanshan Wang,Jianhui Gao,Shirley Zhou,Yuan Tian,Yaqi Shi,Ziming Gan,Tianxi Cai
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Integrative analysis of multi-institutional Electronic Health Record (EHR) data enhances the reliability and generalizability of translational research by leveraging larger, more diverse patient cohorts and incorporating multiple data modalities. However, harmonizing EHR data across institutions poses major challenges due to data heterogeneity, semantic differences, and privacy concerns. To address these challenges, we introduce \textitPEHRT , a standardized pipeline for efficient EHR data harmonization consisting of two core modules: (1) data pre-processing and (2) representation learning. PEHRT maps EHR data to standard coding systems and uses advanced machine learning to generate research-ready datasets without requiring individual-level data sharing. Our pipeline is also data model agnostic and designed for streamlined execution across institutions based on our extensive real-world experience. We provide a complete suite of open source software, accompanied by a user-friendly tutorial, and demonstrate the utility of PEHRT in a variety of tasks using data from diverse healthcare systems.
[LG-52] Gaussian Process Regression – Neural Network Hybrid with Optimized Redundant Coordinates
链接: https://arxiv.org/abs/2509.08457
作者: Sergei Manzhos,Manabu Ihara
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Recently, a Gaussian Process Regression - neural network (GPRNN) hybrid machine learning method was proposed, which is based on additive-kernel GPR in redundant coordinates constructed by rules [J. Phys. Chem. A 127 (2023) 7823]. The method combined the expressive power of an NN with the robustness of linear regression, in particular, with respect to overfitting when the number of neurons is increased beyond optimal. We introduce opt-GPRNN, in which the redundant coordinates of GPRNN are optimized with a Monte Carlo algorithm and show that when combined with optimization of redundant coordinates, GPRNN attains the lowest test set error with much fewer terms / neurons and retains the advantage of avoiding overfitting when the number of neurons is increased beyond optimal value. The method, opt-GPRNN possesses an expressive power closer to that of a multilayer NN and could obviate the need for deep NNs in some applications. With optimized redundant coordinates, a dimensionality reduction regime is also possible. Examples of application to machine learning an interatomic potential and materials informatics are given.
[LG-53] Facet: highly efficient E(3)-equivariant networks for interatomic potentials
链接: https://arxiv.org/abs/2509.08418
作者: Nicholas Miklaucic,Lai Wei,Rongzhi Dong,Nihang Fu,Sadman Sadeed Omee,Qingyang Li,Sourin Dey,Victor Fung,Jianjun Hu
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注:
Abstract:Computational materials discovery is limited by the high cost of first-principles calculations. Machine learning (ML) potentials that predict energies from crystal structures are promising, but existing methods face computational bottlenecks. Steerable graph neural networks (GNNs) encode geometry with spherical harmonics, respecting atomic symmetries – permutation, rotation, and translation – for physically realistic predictions. Yet maintaining equivariance is difficult: activation functions must be modified, and each layer must handle multiple data types for different harmonic orders. We present Facet, a GNN architecture for efficient ML potentials, developed through systematic analysis of steerable GNNs. Our innovations include replacing expensive multi-layer perceptrons (MLPs) for interatomic distances with splines, which match performance while cutting computational and memory demands. We also introduce a general-purpose equivariant layer that mixes node information via spherical grid projection followed by standard MLPs – faster than tensor products and more expressive than linear or gate layers. On the MPTrj dataset, Facet matches leading models with far fewer parameters and under 10% of their training compute. On a crystal relaxation task, it runs twice as fast as MACE models. We further show SevenNet-0’s parameters can be reduced by over 25% with no accuracy loss. These techniques enable more than 10x faster training of large-scale foundation models for ML potentials, potentially reshaping computational materials discovery.
[LG-54] LLM -Guided Ansätze Design for Quantum Circuit Born Machines in Financial Generative Modeling
链接: https://arxiv.org/abs/2509.08385
作者: Yaswitha Gujju,Romain Harang,Tetsuo Shibuya
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: Work presented at the 3rd International Workshop on Quantum Machine Learning: From Research to Practice (QML@QCE’25)
Abstract:Quantum generative modeling using quantum circuit Born machines (QCBMs) shows promising potential for practical quantum advantage. However, discovering ansätze that are both expressive and hardware-efficient remains a key challenge, particularly on noisy intermediate-scale quantum (NISQ) devices. In this work, we introduce a prompt-based framework that leverages large language models (LLMs) to generate hardware-aware QCBM architectures. Prompts are conditioned on qubit connectivity, gate error rates, and hardware topology, while iterative feedback, including Kullback-Leibler (KL) divergence, circuit depth, and validity, is used to refine the circuits. We evaluate our method on a financial modeling task involving daily changes in Japanese government bond (JGB) interest rates. Our results show that the LLM-generated ansätze are significantly shallower and achieve superior generative performance compared to the standard baseline when executed on real IBM quantum hardware using 12 qubits. These findings demonstrate the practical utility of LLM-driven quantum architecture search and highlight a promising path toward robust, deployable generative models for near-term quantum devices.
[LG-55] kNNSampler: Stochastic Imputations for Recovering Missing Value Distributions
链接: https://arxiv.org/abs/2509.08366
作者: Parastoo Pashmchi,Jerome Benoit,Motonobu Kanagawa
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME)
*备注:
Abstract:We study a missing-value imputation method, termed kNNSampler, that imputes a given unit’s missing response by randomly sampling from the observed responses of the k most similar units to the given unit in terms of the observed covariates. This method can sample unknown missing values from their distributions, quantify the uncertainties of missing values, and be readily used for multiple imputation. Unlike popular kNNImputer, which estimates the conditional mean of a missing response given an observed covariate, kNNSampler is theoretically shown to estimate the conditional distribution of a missing response given an observed covariate. Experiments demonstrate its effectiveness in recovering the distribution of missing values. The code for kNNSampler is made publicly available (this https URL).
[LG-56] Chordless cycle filtrations for dimensionality detection in complex networks via topological data analysis
链接: https://arxiv.org/abs/2509.08350
作者: Aina Ferrà Marcús,Robert Jankowski,Meritxell Vila Miñana,Carles Casacuberta,M. Ángeles Serrano
类目: Physics and Society (physics.soc-ph); Machine Learning (cs.LG); Algebraic Topology (math.AT)
*备注:
Abstract:Many complex networks, ranging from social to biological systems, exhibit structural patterns consistent with an underlying hyperbolic geometry. Revealing the dimensionality of this latent space can disentangle the structural complexity of communities, impact efficient network navigation, and fundamentally shape connectivity and system behavior. We introduce a novel topological data analysis weighting scheme for graphs, based on chordless cycles, aimed at estimating the dimensionality of networks in a data-driven way. We further show that the resulting descriptors can effectively estimate network dimensionality using a neural network architecture trained in a synthetic graph database constructed for this purpose, which does not need retraining to transfer effectively to real-world networks. Thus, by combining cycle-aware filtrations, algebraic topology, and machine learning, our approach provides a robust and effective method for uncovering the hidden geometry of complex networks and guiding accurate modeling and low-dimensional embedding.
[LG-57] Generative Quasi-Continuum Modeling of Confined Fluids at the Nanoscale
链接: https://arxiv.org/abs/2509.08223
作者: Bugra Yalcin,Ishan Nadkarni,Jinu Jeong,Chenxing Liang,Narayana R. Aluru
类目: Computational Physics (physics.comp-ph); Machine Learning (cs.LG)
*备注:
Abstract:We present a data-efficient, multiscale framework for predicting the density profiles of confined fluids at the nanoscale. While accurate density estimates require prohibitively long timescales that are inaccessible by ab initio molecular dynamics (AIMD) simulations, machine-learned molecular dynamics (MLMD) offers a scalable alternative, enabling the generation of force predictions at ab initio accuracy with reduced computational cost. However, despite their efficiency, MLMD simulations remain constrained by femtosecond timesteps, which limit their practicality for computing long-time averages needed for accurate density estimation. To address this, we propose a conditional denoising diffusion probabilistic model (DDPM) based quasi-continuum approach that predicts the long-time behavior of force profiles along the confinement direction, conditioned on noisy forces extracted from a limited AIMD dataset. The predicted smooth forces are then linked to continuum theory via the Nernst-Planck equation to reveal the underlying density behavior. We test the framework on water confined between two graphene nanoscale slits and demonstrate that density profiles for channel widths outside of the training domain can be recovered with ab initio accuracy. Compared to AIMD and MLMD simulations, our method achieves orders-of-magnitude speed-up in runtime and requires significantly less training data than prior works.
[LG-58] RAPID Quantum Detection and Demodulation of Covert Communications: Breaking the Noise Limit with Solid-State Spin Sensors
链接: https://arxiv.org/abs/2509.08171
作者: Amirhossein Taherpour,Abbas Taherpour,Tamer Khattab
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:
Abstract:We introduce a comprehensive framework for the detection and demodulation of covert electromagnetic signals using solid-state spin sensors. Our approach, named RAPID, is a two-stage hybrid strategy that leverages nitrogen-vacancy (NV) centers to operate below the classical noise floor employing a robust adaptive policy via imitation and distillation. We first formulate the joint detection and estimation task as a unified stochastic optimal control problem, optimizing a composite Bayesian risk objective under realistic physical constraints. The RAPID algorithm solves this by first computing a robust, non-adaptive baseline protocol grounded in the quantum Fisher information matrix (QFIM), and then using this baseline to warm-start an online, adaptive policy learned via deep reinforcement learning (Soft Actor-Critic). This method dynamically optimizes control pulses, interrogation times, and measurement bases to maximize information gain while actively suppressing non-Markovian noise and decoherence. Numerical simulations demonstrate that the protocol achieves a significant sensitivity gain over static methods, maintains high estimation precision in correlated noise environments, and, when applied to sensor arrays, enables coherent quantum beamforming that achieves Heisenberg-like scaling in precision. This work establishes a theoretically rigorous and practically viable pathway for deploying quantum sensors in security-critical applications such as electronic warfare and covert surveillance.
[LG-59] OCTANE – Optimal Control for Tensor-based Autoencoder Network Emergence: Explicit Case
链接: https://arxiv.org/abs/2509.08169
作者: Ratna Khatri,Anthony Kolshorn,Colin Olson,Harbir Antil
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:
Abstract:This paper presents a novel, mathematically rigorous framework for autoencoder-type deep neural networks that combines optimal control theory and low-rank tensor methods to yield memory-efficient training and automated architecture discovery. The learning task is formulated as an optimization problem constrained by differential equations representing the encoder and decoder components of the network and the corresponding optimality conditions are derived via a Lagrangian approach. Efficient memory compression is enabled by approximating differential equation solutions on low-rank tensor manifolds using an adaptive explicit integration scheme. These concepts are combined to form OCTANE (Optimal Control for Tensor-based Autoencoder Network Emergence) – a unified training framework that yields compact autoencoder architectures, reduces memory usage, and enables effective learning, even with limited training data. The framework’s utility is illustrated with application to image denoising and deblurring tasks and recommendations regarding governing hyperparameters are provided.
[LG-60] Contributions to Robust and Efficient Methods for Analysis of High Dimensional Data
链接: https://arxiv.org/abs/2509.08155
作者: Kai Yang
类目: atistics Theory (math.ST); Machine Learning (cs.LG); Numerical Analysis (math.NA); Optimization and Control (math.OC); Data Analysis, Statistics and Probability (physics.data-an)
*备注: PhD thesis . Available at this https URL
Abstract:A ubiquitous feature of data of our era is their extra-large sizes and dimensions. Analyzing such high-dimensional data poses significant challenges, since the feature dimension is often much larger than the sample size. This thesis introduces robust and computationally efficient methods to address several common challenges associated with high-dimensional data. In my first manuscript, I propose a coherent approach to variable screening that accommodates nonlinear associations. I develop a novel variable screening method that transcends traditional linear assumptions by leveraging mutual information, with an intended application in neuroimaging data. This approach allows for accurate identification of important variables by capturing nonlinear as well as linear relationships between the outcome and covariates. Building on this foundation, I develop new optimization methods for sparse estimation using nonconvex penalties in my second manuscript. These methods address notable challenges in current statistical computing practices, facilitating computationally efficient and robust analyses of complex datasets. The proposed method can be applied to a general class of optimization problems. In my third manuscript, I contribute to robust modeling of high-dimensional correlated observations by developing a mixed-effects model based on Tsallis power-law entropy maximization and discussed the theoretical properties of such distribution. This model surpasses the constraints of conventional Gaussian models by accommodating a broader class of distributions with enhanced robustness to outliers. Additionally, I develop a proximal nonlinear conjugate gradient algorithm that accelerates convergence while maintaining numerical stability, along with rigorous statistical properties for the proposed framework.
[LG-61] Forecasting Generative Amplification
链接: https://arxiv.org/abs/2509.08048
作者: Henning Bahl,Sascha Diefenbacher,Nina Elmer,Tilman Plehn,Jonas Spinner
类目: High Energy Physics - Phenomenology (hep-ph); Machine Learning (cs.LG)
*备注: 23 pages, 15 figures
Abstract:Generative networks are perfect tools to enhance the speed and precision of LHC simulations. It is important to understand their statistical precision, especially when generating events beyond the size of the training dataset. We present two complementary methods to estimate the amplification factor without large holdout datasets. Averaging amplification uses Bayesian networks or ensembling to estimate amplification from the precision of integrals over given phase-space volumes. Differential amplification uses hypothesis testing to quantify amplification without any resolution loss. Applied to state-of-the-art event generators, both methods indicate that amplification is possible in specific regions of phase space, but not yet across the entire distribution.
[LG-62] Steering Protein Language Models ICML2025
链接: https://arxiv.org/abs/2509.07983
作者: Long-Kai Huang,Rongyi Zhu,Bing He,Jianhua Yao
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
*备注: Accepted to ICML 2025
Abstract:Protein Language Models (PLMs), pre-trained on extensive evolutionary data from natural proteins, have emerged as indispensable tools for protein design. While powerful, PLMs often struggle to produce proteins with precisely specified functionalities or properties due to inherent challenges in controlling their outputs. In this work, we investigate the potential of Activation Steering, a technique originally developed for controlling text generation in Large Language Models (LLMs), to direct PLMs toward generating protein sequences with targeted properties. We propose a simple yet effective method that employs activation editing to steer PLM outputs, and extend this approach to protein optimization through a novel editing site identification module. Through comprehensive experiments on lysozyme-like sequence generation and optimization, we demonstrate that our methods can be seamlessly integrated into both auto-encoding and autoregressive PLMs without requiring additional training. These results highlight a promising direction for precise protein engineering using foundation models.
信息检索
[IR-0] Soundtracks of Our Lives: How Age Influences Musical Preferences
链接: https://arxiv.org/abs/2509.08337
作者: Arsen Matej Golubovikj,Bruce Ferwerda,Alan Said,Marko Talčič
类目: Information Retrieval (cs.IR)
*备注: Accepted to UMAP 2025
Abstract:The majority of research in recommender systems, be it algorithmic improvements, context-awareness, explainability, or other areas, evaluates these systems on datasets that capture user interaction over a relatively limited time span. However, recommender systems can very well be used continuously for extended time. Similarly so, user behavior may evolve over that extended time. Although media studies and psychology offer a wealth of research on the evolution of user preferences and behavior as individuals age, there has been scant research in this regard within the realm of user modeling and recommender systems. In this study, we investigate the evolution of user preferences and behavior using the LFM-2b dataset, which, to our knowledge, is the only dataset that encompasses a sufficiently extensive time frame to permit real longitudinal studies and includes age information about its users. We identify specific usage and taste preferences directly related to the age of the user, i.e., while younger users tend to listen broadly to contemporary popular music, older users have more elaborate and personalized listening habits. The findings yield important insights that open new directions for research in recommender systems, providing guidance for future efforts.
[IR-1] Vector embedding of multi-modal texts: a tool for discovery?
链接: https://arxiv.org/abs/2509.08216
作者: Beth Plale,Sai Navya Jyesta,Sachith Withana
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Computer science texts are particularly rich in both narrative content and illustrative charts, algorithms, images, annotated diagrams, etc. This study explores the extent to which vector-based multimodal retrieval, powered by vision-language models (VLMs), can improve discovery across multi-modal (text and images) content. Using over 3,600 digitized textbook pages largely from computer science textbooks and a Vision Language Model (VLM), we generate multi-vector representations capturing both textual and visual semantics. These embeddings are stored in a vector database. We issue a benchmark of 75 natural language queries and compare retrieval performance to ground truth and across four similarity (distance) measures. The study is intended to expose both the strengths and weakenesses of such an approach. We find that cosine similarity most effectively retrieves semantically and visually relevant pages. We further discuss the practicality of using a vector database and multi-modal embedding for operational information retrieval. Our paper is intended to offer design insights for discovery over digital libraries. Keywords: Vector embedding, multi-modal document retrieval, vector database benchmark, digital library discovery Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2509.08216 [cs.IR] (or arXiv:2509.08216v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2509.08216 Focus to learn more arXiv-issued DOI via DataCite (pending registration)