本篇博文主要内容为 2025-09-25 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。

目录

概览 (2025-09-25)

今日共更新587篇论文,其中:

  • 自然语言处理99篇(Computation and Language (cs.CL))
  • 人工智能192篇(Artificial Intelligence (cs.AI))
  • 计算机视觉95篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习179篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] Language Models that Think Chat Better

【速读】: 该论文旨在解决强化学习与人类反馈(RLHF)在通用对话任务中表现受限的问题,尤其是在开放域场景下(如撰写提纲作文或制定饮食计划)缺乏有效推理能力的局限性。传统RLHF依赖于人类偏好数据进行优化,难以激发模型深层次的思维链(Chain-of-Thought, CoT)推理能力。解决方案的关键在于提出**RLMT(Reinforcement Learning with Model-rewarded Thinking)**框架:它通过引入基于模型奖励的在线强化学习机制,在训练过程中要求语言模型(LM)生成长序列的CoT推理路径,并使用偏好驱动的奖励模型对这些推理过程进行优化,从而提升模型在聊天、创意写作等复杂任务中的泛化能力和响应质量。该方法无需显式的监督微调(SFT)阶段,仅需少量提示(如7K),即可显著超越传统多阶段RLHF流程,甚至达到GPT-4o和Claude-3.7-Sonnet(Thinking)的水平。

链接: https://arxiv.org/abs/2509.20357
作者: Adithya Bhaskar,Xi Ye,Danqi Chen
机构: Princeton Language and Intelligence (普林斯顿语言与智能实验室); Princeton University (普林斯顿大学)
类目: Computation and Language (cs.CL)
备注: Preprint; we release our code and models publicly at this https URL

点击查看摘要

Abstract:Reinforcement learning with verifiable rewards (RLVR) improves language model reasoning by using rule-based rewards in verifiable domains such as mathematics and code. However, RLVR leads to limited generalization for open-ended tasks – such as writing outline essays or making meal plans – where humans reason routinely. This paper shows that the RLVR paradigm is effective beyond verifiable domains, and introduces RL with Model-rewarded Thinking (RLMT) for general-purpose chat capabilities. Using diverse real-world prompts, RLMT requires LMs to generate long CoT reasoning before response, and optimizes them with online RL against a preference-based reward model used in RLHF. Across 40 training runs on Llama-3.1-8B and Qwen-2.5-7B (both base and instruct) and multiple optimization algorithms (DPO, PPO, and GRPO), RLMT consistently outperforms standard RLHF pipelines. This includes substantial gains of 3-7 points on three chat benchmarks (AlpacaEval2, WildBench, and ArenaHardV2), along with 1-3 point improvements on other tasks like creative writing and general knowledge. Our best 8B model surpasses GPT-4o in chat and creative writing and rivals Claude-3.7-Sonnet (Thinking). RLMT can also be applied directly to base models without an SFT stage, akin to R1-Zero training. Remarkably, with only 7K prompts, Llama-3.1-8B base trained with our RLMT recipe outperforms Llama-3.1-8B-Instruct post-trained with a complex multi-staged pipeline with 25M+ examples. We close with qualitative and quantitative analyses of how trained models plan their responses. Our results rethink the post-training pipeline and call upon future work to understand and employ thinking more broadly.
zh

[NLP-1] EmbeddingGemma: Powerful and Lightweight Text Representations WWW

【速读】: 该论文旨在解决轻量级文本嵌入模型在保持高性能的同时降低计算成本和部署门槛的问题。现有模型要么参数量大、推理延迟高,要么在低资源场景下性能受限,难以满足边缘设备或高吞吐应用的需求。解决方案的关键在于:首先,基于Gemma 3语言模型家族设计了一个仅300M参数的嵌入模型(EmbeddingGemma),通过编码器-解码器初始化和几何嵌入蒸馏策略从更大模型中高效迁移知识;其次,引入扩散正则化(spread-out regularizer)增强嵌入空间的表达能力和鲁棒性,并通过融合多样化优化检查点提升泛化能力;最终在MTEB多语言、英文及代码任务上实现SOTA性能,且在量化或截断嵌入输出后仍保持领先优势,显著优于同类开源与闭源模型,展现出卓越的性能-成本比。

链接: https://arxiv.org/abs/2509.20354
作者: Henrique Schechter Vera,Sahil Dua,Biao Zhang,Daniel Salz,Ryan Mullins,Sindhu Raghuram Panyam,Sara Smoot,Iftekhar Naim,Joe Zou,Feiyang Chen,Daniel Cer,Alice Lisak,Min Choi,Lucas Gonzalez,Omar Sanseviero,Glenn Cameron,Ian Ballantyne,Kat Black,Kaifeng Chen,Weiyi Wang,Zhe Li,Gus Martins,Jinhyuk Lee,Mark Sherwood,Juyeong Ji,Renjie Wu,Jingxiao Zheng,Jyotinder Singh,Abheesht Sharma,Divya Sreepat,Aashi Jain,Adham Elarabawy,AJ Co,Andreas Doumanoglou,Babak Samari,Ben Hora,Brian Potetz,Dahun Kim,Enrique Alfonseca,Fedor Moiseev,Feng Han,Frank Palma Gomez,Gustavo Hernández Ábrego,Hesen Zhang,Hui Hui,Jay Han,Karan Gill,Ke Chen,Koert Chen,Madhuri Shanbhogue,Michael Boratko,Paul Suganthan,Sai Meher Karthik Duddu,Sandeep Mariserla,Setareh Ariafar,Shanfeng Zhang,Shijie Zhang,Simon Baumgartner,Sonam Goenka,Steve Qiu,Tanmaya Dabral,Trevor Walker,Vikram Rao,Waleed Khawaja,Wenlei Zhou,Xiaoqi Ren,Ye Xia,Yichang Chen,Yi-Ting Chen,Zhe Dong,Zhongli Ding,Francesco Visin,Gaël Liu,Jiageng Zhang,Kathleen Kenealy,Michelle Casbon,Ravin Kumar,Thomas Mesnard,Zach Gleicher,Cormac Brick,Olivier Lacombe,Adam Roberts,Yunhsuan Sung,Raphael Hoffmann,Tris Warkentin,Armand Joulin,Tom Duerig,Mojtaba Seyedhosseini
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 18 pages. Models are available in HuggingFace (at this https URL ), Kaggle (at this https URL ), and Vertex AI (at this https URL )

点击查看摘要

Abstract:We introduce EmbeddingGemma, a new lightweight, open text embedding model based on the Gemma 3 language model family. Our innovative training recipe strategically captures knowledge from larger models via encoder-decoder initialization and geometric embedding distillation. We improve model robustness and expressiveness with a spread-out regularizer, and ensure generalizability by merging checkpoints from varied, optimized mixtures. Evaluated on the Massive Text Embedding Benchmark (MTEB) across multilingual, English, and code domains, EmbeddingGemma (300M) achieves state-of-the-art results. Notably, it outperforms prior top models, both proprietary and open, with fewer than 500M parameters, and provides performance comparable to models double its size, offering an exceptional performance-to-cost ratio. Remarkably, this lead persists when quantizing model weights or truncating embedding outputs. This makes EmbeddingGemma particularly well-suited for low-latency and high-throughput use cases such as on-device applications. We provide ablation studies exploring our key design choices. We release EmbeddingGemma to the community to promote further research.
zh

[NLP-2] Morphological Synthesizer for Geez Language: Addressing Morphological Complexity and Resource Limitations

【速读】: 该论文旨在解决Ge’ez语言缺乏可用自然语言处理(Natural Language Processing, NLP)工具的问题,尤其是由于标注语料库、词典和数据集稀缺导致的现代NLP技术难以应用。其关键解决方案是提出一个基于规则的Ge’ez形态合成器(morphological synthesizer),能够根据该语言复杂的屈折和派生形态结构,从词根生成表面词形。通过使用1,102个代表性动词样本(涵盖所有动词形态结构)进行测试,系统实现了97.4%的准确率,显著优于基线模型,验证了规则驱动方法在资源匮乏语言中的可行性,并为未来构建更全面的Ge’ez NLP系统奠定了基础。

链接: https://arxiv.org/abs/2509.20341
作者: Gebrearegawi Gebremariam,Hailay Teklehaymanot,Gebregewergs Mezgebe
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 13 pages,2 images,7 tables

点击查看摘要

Abstract:Ge’ez is an ancient Semitic language renowned for its unique alphabet. It serves as the script for numerous languages, including Tigrinya and Amharic, and played a pivotal role in Ethiopia’s cultural and religious development during the Aksumite kingdom era. Ge’ez remains significant as a liturgical language in Ethiopia and Eritrea, with much of the national identity documentation recorded in Ge’ez. These written materials are invaluable primary sources for studying Ethiopian and Eritrean philosophy, creativity, knowledge, and civilization. Ge’ez has a complex morphological structure with rich inflectional and derivational morphology, and no usable NLP has been developed and published until now due to the scarcity of annotated linguistic data, corpora, labeled datasets, and lexicons. Therefore, we propose a rule-based Ge’ez morphological synthesizer to generate surface words from root words according to the morphological structures of the language. We used 1,102 sample verbs, representing all verb morphological structures, to test and evaluate the system. The system achieves a performance of 97.4%, outperforming the baseline model and suggesting that future work should build a comprehensive system considering morphological variations of the language. Keywords: Ge’ez, NLP, morphology, morphological synthesizer, rule-based Comments: 13 pages,2 images,7 tables Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) MSC classes: 68T50, 68T35, 68N01 ACMclasses: I.2.7; I.2.6; H.3.1 Cite as: arXiv:2509.20341 [cs.CL] (or arXiv:2509.20341v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2509.20341 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-3] DRES: Benchmarking LLM s for Disfluency Removal

【速读】: 该论文旨在解决语音驱动系统中因话语不流畅(disfluencies,如“um”、“uh”、插入语和修正语句)导致的准确性下降问题,尤其是在命令解析、摘要生成和对话代理等任务中的性能瓶颈。其解决方案的关键在于提出一个名为DRES(Disfluency Removal Evaluation Suite)的可控文本级基准测试套件,该套件基于人工标注的Switchboard语料库,将去噪任务与自动语音识别(ASR)错误及声学变异性分离,从而建立可复现的语义上限(semantic upper bound)。通过系统评估不同规模、提示策略和架构的专有与开源大语言模型(LLMs),研究揭示了简单分段策略在长上下文模型中仍具有效性、推理导向模型易过度删除流畅词元、微调虽能接近当前最优精度与召回率但损害泛化能力等关键发现,并据此提出九条实用建议(R1–R9),为语音驱动流水线中的去不流畅处理提供可复现且模型无关的优化路径。

链接: https://arxiv.org/abs/2509.20321
作者: Maria Teleki,Sai Janjur,Haoran Liu,Oliver Grabner,Ketan Verma,Thomas Docog,Xiangjue Dong,Lingfeng Shi,Cong Wang,Stephanie Birkelbach,Jason Kim,Yin Zhang,James Caverlee
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Disfluencies – such as “um,” “uh,” interjections, parentheticals, and edited statements – remain a persistent challenge for speech-driven systems, degrading accuracy in command interpretation, summarization, and conversational agents. We introduce DRES (Disfluency Removal Evaluation Suite), a controlled text-level benchmark that establishes a reproducible semantic upper bound for this task. DRES builds on human-annotated Switchboard transcripts, isolating disfluency removal from ASR errors and acoustic variability. We systematically evaluate proprietary and open-source LLMs across scales, prompting strategies, and architectures. Our results reveal that (i) simple segmentation consistently improves performance, even for long-context models; (ii) reasoning-oriented models tend to over-delete fluent tokens; and (iii) fine-tuning achieves near state-of-the-art precision and recall but harms generalization abilities. We further present a set of LLM-specific error modes and offer nine practical recommendations (R1-R9) for deploying disfluency removal in speech-driven pipelines. DRES provides a reproducible, model-agnostic foundation for advancing robust spoken-language systems.
zh

[NLP-4] Z-Scores: A Metric for Linguistically Assessing Disfluency Removal

【速读】: 该论文旨在解决语音中不流畅去除(disfluency removal)评估中仅依赖整体词级指标(如精确率、召回率和F1分数)所导致的局限性,这些指标无法揭示模型在不同类型的不流畅行为上表现差异的根本原因。为应对这一问题,作者提出了一种基于语素层级(span-level)且语言学基础明确的评估指标——Z-Scores,其核心创新在于引入了一个确定性的对齐模块,能够稳健地将生成文本与原始不流畅转录进行映射,并据此对三类典型不流畅类型(EDITED、INTJ、PRN)分别进行诊断性分析。该方法可识别出模型在特定类型不流畅处理上的系统性弱点,从而指导针对性干预策略(如定制提示或数据增强),显著提升模型性能。

链接: https://arxiv.org/abs/2509.20319
作者: Maria Teleki,Sai Janjur,Haoran Liu,Oliver Grabner,Ketan Verma,Thomas Docog,Xiangjue Dong,Lingfeng Shi,Cong Wang,Stephanie Birkelbach,Jason Kim,Yin Zhang,James Caverlee
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Evaluating disfluency removal in speech requires more than aggregate token-level scores. Traditional word-based metrics such as precision, recall, and F1 (E-Scores) capture overall performance but cannot reveal why models succeed or fail. We introduce Z-Scores, a span-level linguistically-grounded evaluation metric that categorizes system behavior across distinct disfluency types (EDITED, INTJ, PRN). Our deterministic alignment module enables robust mapping between generated text and disfluent transcripts, allowing Z-Scores to expose systematic weaknesses that word-level metrics obscure. By providing category-specific diagnostics, Z-Scores enable researchers to identify model failure modes and design targeted interventions – such as tailored prompts or data augmentation – yielding measurable performance improvements. A case study with LLMs shows that Z-Scores uncover challenges with INTJ and PRN disfluencies hidden in aggregate F1, directly informing model refinement strategies.
zh

[NLP-5] SIM-CoT: Supervised Implicit Chain-of-Thought

【速读】: 该论文旨在解决隐式思维链(Implicit Chain-of-Thought, Implicit CoT)方法在扩展计算预算时出现的训练不稳定问题,即随着隐式推理token数量增加,模型 latent representations 会趋于同质化,丧失语义多样性,从而限制了隐式CoT的性能提升。其解决方案的关键在于提出SIM-CoT——一个可插拔的训练模块,通过引入step-level监督机制,在训练阶段使用辅助解码器将每个隐式token对齐到对应的显式推理步骤,从而稳定并丰富隐式推理空间;该辅助解码器在推理阶段被移除,保持隐式CoT的高效性,同时实现了对隐式推理过程的可解释性分析。

链接: https://arxiv.org/abs/2509.20317
作者: Xilin Wei,Xiaoran Liu,Yuhang Zang,Xiaoyi Dong,Yuhang Cao,Jiaqi Wang,Xipeng Qiu,Dahua Lin
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Implicit Chain-of-Thought (CoT) methods present a promising, token-efficient alternative to explicit CoT reasoning in Large Language Models (LLMs), but a persistent performance gap has limited the application of implicit CoT. We identify a core latent instability issue by scaling the computational budget of implicit CoT approaches: as we increase the number of implicit reasoning tokens to enhance performance, the training process often becomes unstable and collapses. Our analysis reveals that this instability arises from the latent representations becoming homogeneous and losing their semantic diversity, a failure caused by insufficient step-level supervision in existing implicit CoT approaches. To address this issue, we propose SIM-CoT, a plug-and-play training module that introduces step-level supervision to stabilize and enrich the latent reasoning space. Specifically, SIM-CoT employs an auxiliary decoder during training to align each implicit token with its corresponding explicit reasoning step, ensuring that latent states capture distinct and meaningful information. The proposed auxiliary decoder is removed during inference, preserving the computational efficiency of implicit CoT methods with no added overhead. In addition, the auxiliary decoder affords interpretability of implicit reasoning by projecting each latent token onto an explicit reasoning vocabulary, enabling per-step visualization of semantic roles and diagnosis. SIM-CoT significantly enhances both the in-domain accuracy and out-of-domain stability of various implicit CoT methods, boosting baselines like Coconut by +8.2% on GPT-2 and CODI by +3.0% on LLaMA-3.1 8B. Demonstrating strong scalability, SIM-CoT also surpasses the explicit CoT baseline on GPT-2 by 2.1% with 2.3\times greater token efficiency, while substantially closing the performance gap on larger models like LLaMA-3.1 8B.
zh

[NLP-6] Multilingual Hope Speech Detection: A Comparative Study of Logistic Regression mBERT and XLM-RoBERTa with Active Learning

【速读】: 该论文旨在解决多语言环境下希望话语(hope speech)检测的难题,尤其是在低资源语种中传统方法性能受限的问题。其解决方案的关键在于结合多语言预训练模型(如mBERT和XLM-RoBERTa)与主动学习(active learning)策略,从而在小规模标注数据下仍能实现高准确率的希望话语识别,显著优于传统基线方法。

链接: https://arxiv.org/abs/2509.20315
作者: T. O. Abiola,K. D. Abiodun,O. E. Olumide,O. O. Adebanji,O. Hiram Calvo,Grigori Sidorov
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Hope speech language that fosters encouragement and optimism plays a vital role in promoting positive discourse online. However, its detection remains challenging, especially in multilingual and low-resource settings. This paper presents a multilingual framework for hope speech detection using an active learning approach and transformer-based models, including mBERT and XLM-RoBERTa. Experiments were conducted on datasets in English, Spanish, German, and Urdu, including benchmark test sets from recent shared tasks. Our results show that transformer models significantly outperform traditional baselines, with XLM-RoBERTa achieving the highest overall accuracy. Furthermore, our active learning strategy maintained strong performance even with small annotated datasets. This study highlights the effectiveness of combining multilingual transformers with data-efficient training strategies for hope speech detection.
zh

[NLP-7] Feeding Two Birds or Favoring One? Adequacy-Fluency Tradeoffs in Evaluation and Meta-Evaluation of Machine Translation

【速读】: 该论文旨在解决机器翻译评估中 adequacy(内容完整性)与 fluency(语言流畅性)之间的权衡问题,尤其关注当前主流评估指标在这一权衡中的偏向性及其对元评估(meta-evaluation)结果的影响。研究表明,现有指标普遍倾向于 adequacy,导致其评分更敏感于翻译的内容准确性而非语言自然度;且这种偏向在 WMT 标准元评估中依然存在,部分归因于参与评估的系统组成偏差。为控制该偏差,论文提出一种在元评估中合成翻译系统的方法,以平衡不同性能维度的覆盖范围,从而更公平地衡量评估指标的表现。解决方案的关键在于通过系统合成策略减少数据集组成带来的偏倚,提升元评估的客观性和代表性。

链接: https://arxiv.org/abs/2509.20287
作者: Behzad Shayegh,Jan-Thorsten Peter,David Vilar,Tobias Domhan,Juraj Juraska,Markus Freitag,Lili Mou
机构: University of Alberta (阿尔伯塔大学); Alberta Machine Intelligence Institute (阿米人工智能研究所); Google(谷歌)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted by Tenth Conference on Machine Translation (WMT25)

点击查看摘要

Abstract:We investigate the tradeoff between adequacy and fluency in machine translation. We show the severity of this tradeoff at the evaluation level and analyze where popular metrics fall within it. Essentially, current metrics generally lean toward adequacy, meaning that their scores correlate more strongly with the adequacy of translations than with fluency. More importantly, we find that this tradeoff also persists at the meta-evaluation level, and that the standard WMT meta-evaluation favors adequacy-oriented metrics over fluency-oriented ones. We show that this bias is partially attributed to the composition of the systems included in the meta-evaluation datasets. To control this bias, we propose a method that synthesizes translation systems in meta-evaluation. Our findings highlight the importance of understanding this tradeoff in meta-evaluation and its impact on metric rankings.
zh

[NLP-8] Instruction Boundary: Quantifying Biases in LLM Reasoning under Various Coverag e

【速读】: 该论文旨在解决大型语言模型(Large Language Model, LLM)在推理过程中因提示(prompt)设计不当而导致的偏见问题,即所谓的“指令边界”(Instruction Boundary)现象。其核心问题是:即使LLM在整体任务上表现出高准确率,但用户提供的不完整、冗余或有偏见的提示仍会导致模型输出显著偏差,从而削弱其可靠性并带来潜在风险。解决方案的关键在于提出BiasDetector框架,该框架通过量化三类典型提示类型(完整、冗余和不足)引发的偏见,系统评估主流LLM在下游任务中的表现,并据此识别出提示覆盖度对模型可靠性的直接影响,进而为开发者提供针对性的缓解策略,强调用户需谨慎设计提示以提升生成结果的稳定性与可信度。

链接: https://arxiv.org/abs/2509.20278
作者: Zipeng Ling,Yuehao Tang,Chen Huang,Shuliang Liu,Gaoyang Jiang,Shenghong Fu,Junqi Yang,Yao Wan,Jiawan Zhang,Kejia Huang,Xuming Hu
机构: Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)); University of Pennsylvania (宾夕法尼亚大学); Huazhong University of Science and Technology (华中科技大学); Hong Kong Polytechnic University (香港理工大学); Nanjing University of Posts and Telecommunications (南京邮电大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large-language-model (LLM) reasoning has long been regarded as a powerful tool for problem solving across domains, providing non-experts with valuable advice. However, their limitations - especially those stemming from prompt design - remain underexplored. Because users may supply biased or incomplete prompts - often unintentionally - LLMs can be misled, undermining reliability and creating risks. We refer to this vulnerability as the Instruction Boundary. To investigate the phenomenon, we distill it into eight concrete facets and introduce BiasDetector, a framework that measures biases arising from three instruction types: complete, redundant, and insufficient. We evaluate several mainstream LLMs and find that, despite high headline accuracy, substantial biases persist in many downstream tasks as a direct consequence of prompt coverage. Our empirical study confirms that LLM reasoning reliability can still be significantly improved. We analyze the practical impact of these biases and outline mitigation strategies. Our findings underscore the need for developers to tackle biases and for users to craft options carefully.
zh

[NLP-9] Scan-do Attitude: Towards Autonomous CT Protocol Management using a Large Language Model Agent

【速读】: 该论文旨在解决计算机断层扫描(Computed Tomography, CT)中扫描协议管理的效率问题,即如何在患者个体化需求下高效配置成像参数、重建方案及后处理工具,同时缓解放射科技术人员短缺带来的工作压力。解决方案的关键在于提出一种基于大语言模型(Large Language Model, LLM)的智能代理框架,该框架融合上下文学习(in-context-learning)、指令遵循(instruction-following)与结构化工具调用(structured toolcalling)能力,实现对自然语言或设备无关格式请求的理解与执行,从而自动检索协议组件、生成兼容设备的协议定义文件并准确实施用户指令。

链接: https://arxiv.org/abs/2509.20270
作者: Xingjian Kang,Linda Vorberg,Andreas Maier,Alexander Katzmann,Oliver Taubmann
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Managing scan protocols in Computed Tomography (CT), which includes adjusting acquisition parameters or configuring reconstructions, as well as selecting postprocessing tools in a patient-specific manner, is time-consuming and requires clinical as well as technical expertise. At the same time, we observe an increasing shortage of skilled workforce in radiology. To address this issue, a Large Language Model (LLM)-based agent framework is proposed to assist with the interpretation and execution of protocol configuration requests given in natural language or a structured, device-independent format, aiming to improve the workflow efficiency and reduce technologists’ workload. The agent combines in-context-learning, instruction-following, and structured toolcalling abilities to identify relevant protocol elements and apply accurate modifications. In a systematic evaluation, experimental results indicate that the agent can effectively retrieve protocol components, generate device compatible protocol definition files, and faithfully implement user requests. Despite demonstrating feasibility in principle, the approach faces limitations regarding syntactic and semantic validity due to lack of a unified device API, and challenges with ambiguous or complex requests. In summary, the findings show a clear path towards LLM-based agents for supporting scan protocol management in CT imaging.
zh

[NLP-10] Failure Modes of Maximum Entropy RLHF

【速读】: 该论文旨在解决参考-free偏好优化方法(如Simple Preference Optimization, SimPO)在在线强化学习人类反馈(Online RLHF)场景中表现不稳定的问题,尤其是其与最大熵强化学习(Maximum Entropy Reinforcement Learning, MaxEnt RL)之间的性能差异。解决方案的关键在于揭示SimPO本质上可视为一种长度归一化温度的最大熵RL,从而为无参考方法提供理论基础;同时通过实验对比发现,MaxEnt RL在在线设置下存在过度优化(overoptimization)和KL散度动态不稳定的缺陷,而KL约束方法则能保持训练稳定,这表明熵正则化无法有效防止奖励劫持(reward hacking),进而指出参考-free方法在不同学习范式(在线 vs. 离线)中面临本质挑战。

链接: https://arxiv.org/abs/2509.20265
作者: Ömer Veysel Çağatan,Barış Akgün
机构: Koç University (科奇大学); KUIS AI Center (科奇大学人工智能中心)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 26 pages, 9 figures

点击查看摘要

Abstract:In this paper, we show that Simple Preference Optimization (SimPO) can be derived as Maximum Entropy Reinforcement Learning with length-normalized temperature, providing a theoretical foundation for this reference-free method. Motivated by SimPO’s strong performance in offline preference optimization, we investigate whether Maximum Entropy RL can achieve similar results in online RLHF settings. Our experiments find that Maximum Entropy RL consistently exhibits overoptimization and unstable KL dynamics, even at very low learning rates. Unlike KL-constrained methods that maintain stable training, entropy regularization fails to prevent reward hacking and appears to correlate with overoptimization. Lastly, we discuss possible explanations for why SimPO succeeds in offline settings while Maximum Entropy RL struggles in online scenarios. Our findings suggest that reference-free approaches may face distinct challenges when applied to online or offline preference learning.
zh

[NLP-11] Investigating the Representation of Backchannels and Fillers in Fine-tuned Language Models

【速读】: 该论文旨在解决当前基于Transformer的語言模型(Language Models, LMs)在对话中对回应对话标记(backchannels)和填充词(fillers)表征不足的问题。这些问题在自然对话中具有重要语用功能,但现有模型往往未能充分学习其语义细微差异。解决方案的关键在于采用三种微调策略,在包含标注回应对话标记与填充词的英日双语对话语料库上训练语言模型,从而提升模型对这些非正式言语形式的表示能力。实验表明,微调后的模型在聚类分析中表现出更高的轮廓系数(silhouette scores),且生成的对话更接近人类产出,验证了通过针对性微调可有效将通用语言模型转化为更具对话能力的生成式AI(Generative AI)系统。

链接: https://arxiv.org/abs/2509.20237
作者: Yu Wang,Leyi Lao,Langchu Huang,Gabriel Skantze,Yang Xu,Hendrik Buschmeier
机构: Bielefeld University (比勒费尔德大学); Southern University of Science and Technology (南方科技大学); KTH Royal Institute of Technology (皇家理工学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Backchannels and fillers are important linguistic expressions in dialogue, but are under-represented in modern transformer-based language models (LMs). Our work studies the representation of them in language models using three fine-tuning strategies. The models are trained on three dialogue corpora in English and Japanese, where backchannels and fillers are preserved and annotated, to investigate how fine-tuning can help LMs learn their representations. We first apply clustering analysis to the learnt representation of backchannels and fillers, and have found increased silhouette scores in representations from fine-tuned models, which suggests that fine-tuning enables LMs to distinguish the nuanced semantic variation in different backchannel and filler use. We also use natural language generation (NLG) metrics to confirm that the utterances generated by fine-tuned language models resemble human-produced utterances more closely. Our findings suggest the potentials of transforming general LMs into conversational LMs that are more capable of producing human-like languages adequately.
zh

[NLP-12] Muse-it: A Tool for Analyzing Music Discourse on Reddit

【速读】: 该论文旨在解决音乐参与(music engagement)研究中数据获取与分析的局限性问题,即传统方法难以捕捉自然情境下用户对音乐的多样化互动行为(如情感反应、社交联系和身份建构)。其解决方案的关键在于开发了一个名为Muse-it的平台,该平台通过自动化工具从Reddit等社交媒体中大规模采集用户生成内容,并结合自然语言处理(NLP)技术进行主题建模、时间趋势分析与聚类,同时提取音乐相关超链接及轨道级元数据(如艺术家、专辑、流派、歌词等),实现对音乐讨论语境的深度解析。该平台还提供交互式可视化界面,使研究人员能够高效探索音乐在真实在线生态中的传播模式与社会意义。

链接: https://arxiv.org/abs/2509.20228
作者: Jatin Agarwala,George Paul,Nemani Harsha Vardhan,Vinoo Alluri
机构: 未知
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Multimedia (cs.MM); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:Music engagement spans diverse interactions with music, from selection and emotional response to its impact on behavior, identity, and social connections. Social media platforms provide spaces where such engagement can be observed in natural, unprompted conversations. Advances in natural language processing (NLP) and big data analytics make it possible to analyze these discussions at scale, extending music research to broader contexts. Reddit, in particular, offers anonymity that encourages diverse participation and yields rich discourse on music in ecological settings. Yet the scale of this data requires tools to extract, process, and analyze it effectively. We present Muse-it, a platform that retrieves comprehensive Reddit data centered on user-defined queries. It aggregates posts from across subreddits, supports topic modeling, temporal trend analysis, and clustering, and enables efficient study of large-scale discourse. Muse-it also identifies music-related hyperlinks (e.g., Spotify), retrieves track-level metadata such as artist, album, release date, genre, popularity, and lyrics, and links these to the discussions. An interactive interface provides dynamic visualizations of the collected data. Muse-it thus offers an accessible way for music researchers to gather and analyze big data, opening new avenues for understanding music engagement as it naturally unfolds online.
zh

[NLP-13] Low-Resource English-Tigrinya MT: Leverag ing Multilingual Models Custom Tokenizers and Clean Evaluation Benchmarks

【速读】: 该论文旨在解决低资源语言(如提格里尼亚语)在神经机器翻译(Neural Machine Translation, NMT)中性能不足的问题,其核心挑战包括语料稀缺、分词策略不完善以及缺乏标准化的评估基准。解决方案的关键在于采用多语言预训练模型的迁移学习方法,并结合三个关键改进:语言特定的分词器设计、基于语言特征的嵌入初始化策略,以及面向领域适应的微调机制。实验表明,该方法显著优于零样本基线,在BLEU、chrF指标及人工评估上均取得提升,且通过Bonferroni校正确保统计显著性,为提升形态丰富的低资源语言翻译质量提供了可复现的技术路径与数据支持。

链接: https://arxiv.org/abs/2509.20209
作者: Hailay Kidu Teklehaymanot,Gebrearegawi Gidey,Wolfgang Nejdl
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: This submission is 8 pages long, includes 4 tables, and contains all required conference details

点击查看摘要

Abstract:Despite advances in Neural Machine Translation (NMT), low-resource languages like Tigrinya remain underserved due to persistent challenges, including limited corpora, inadequate tokenization strategies, and the lack of standardized evaluation benchmarks. This paper investigates transfer learning techniques using multilingual pretrained models to enhance translation quality for morphologically rich, low-resource languages. We propose a refined approach that integrates language-specific tokenization, informed embedding initialization, and domain-adaptive fine-tuning. To enable rigorous assessment, we construct a high-quality, human-aligned English-Tigrinya evaluation dataset covering diverse domains. Experimental results demonstrate that transfer learning with a custom tokenizer substantially outperforms zero-shot baselines, with gains validated by BLEU, chrF, and qualitative human evaluation. Bonferroni correction is applied to ensure statistical significance across configurations. Error analysis reveals key limitations and informs targeted refinements. This study underscores the importance of linguistically aware modeling and reproducible benchmarks in bridging the performance gap for underrepresented languages. Resources are available at this https URL and this https URL Comments: This submission is 8 pages long, includes 4 tables, and contains all required conference details Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) MSC classes: 68T50, 68T35 ACMclasses: I.2.7; H.3.1; I.2.6 Cite as: arXiv:2509.20209 [cs.CL] (or arXiv:2509.20209v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2509.20209 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-14] Play by the Type Rules: Inferring Constraints for LLM Functions in Declarative Programs

【速读】: 该论文旨在解决在声明式查询语言(如SQL)中集成大语言模型(LLM)驱动的算子时,生成结果与数据库类型约束及数据内容不一致的问题。当前方法依赖大量LLM后处理步骤来确保输出对齐,导致性能瓶颈。解决方案的关键在于利用小型开源语言模型作为函数执行器,在混合数据源上实现高效且类型正确的查询执行,并通过设计一种高效的类型一致性保障机制,显著提升多跳问答任务的准确率(+7%)并降低延迟(-53%)。

链接: https://arxiv.org/abs/2509.20208
作者: Parker Glenn,Alfy Samuel,Daben Liu
机构: Capital One (资本一号)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注:

点击查看摘要

Abstract:Integrating LLM powered operators in declarative query languages allows for the combination of cheap and interpretable functions with powerful, generalizable language model reasoning. However, in order to benefit from the optimized execution of a database query language like SQL, generated outputs must align with the rules enforced by both type checkers and database contents. Current approaches address this challenge with orchestrations consisting of many LLM-based post-processing calls to ensure alignment between generated outputs and database values, introducing performance bottlenecks. We perform a study on the ability of various sized open-source language models to both parse and execute functions within a query language based on SQL, showing that small language models can excel as function executors over hybrid data sources. Then, we propose an efficient solution to enforce the well-typedness of LLM functions, demonstrating 7% accuracy improvement on a multi-hop question answering dataset with 53% improvement in latency over comparable solutions. We make our implementation available at this https URL
zh

[NLP-15] hinking Augmented Pre-training

【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)预训练过程中数据效率低下的问题,即在计算资源快速增长而高质量文本数据稀缺的背景下,如何更有效地利用现有数据以提升模型性能。其核心挑战在于,部分高价值token因依赖复杂的深层推理机制而难以被固定容量的模型学习。解决方案的关键是提出一种通用方法——思维增强预训练(Thinking Augmented Pre-Training, TPT),通过自动构建并注入“思考轨迹”(thinking trajectories)来扩展原始文本数据,使高价值token的学习过程通过分步推理和分解得以简化,从而显著提升数据利用率与模型表现。实验表明,TPT可将LLM预训练的数据效率提升达3倍,并在多个推理基准上使3B参数模型的下游性能提升超过10%。

链接: https://arxiv.org/abs/2509.20186
作者: Liang Wang,Nan Yang,Shaohan Huang,Li Dong,Furu Wei
机构: Microsoft Research (微软研究院)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 19 pages

点击查看摘要

Abstract:This paper introduces a simple and scalable approach to improve the data efficiency of large language model (LLM) training by augmenting existing text data with thinking trajectories. The compute for pre-training LLMs has been growing at an unprecedented rate, while the availability of high-quality data remains limited. Consequently, maximizing the utility of available data constitutes a significant research challenge. A primary impediment is that certain high-quality tokens are difficult to learn given a fixed model capacity, as the underlying rationale for a single token can be exceptionally complex and deep. To address this issue, we propose Thinking augmented Pre-Training (TPT), a universal methodology that augments text with automatically generated thinking trajectories. Such augmentation effectively increases the volume of the training data and makes high-quality tokens more learnable through step-by-step reasoning and decomposition. We apply TPT across diverse training configurations up to 100 B tokens, encompassing pre-training with both constrained and abundant data, as well as mid-training from strong open-source checkpoints. Experimental results indicate that our method substantially improves the performance of LLMs across various model sizes and families. Notably, TPT enhances the data efficiency of LLM pre-training by a factor of 3 . For a 3 B parameter model, it improves the post-training performance by over 10% on several challenging reasoning benchmarks.
zh

[NLP-16] Federation of Agents : A Semantics-Aware Communication Fabric for Large-Scale Agent ic AI

【速读】: 该论文旨在解决静态多智能体协作难以适应动态任务需求的问题,尤其在异构AI智能体联邦中如何实现高效、灵活且具备语义理解能力的协同工作。其核心解决方案是提出Federation of Agents (FoA)框架,关键创新在于引入可版本化的能力建模机制——Versioned Capability Vectors (VCVs),通过语义嵌入使智能体的能力(如功能、成本和限制)成为可搜索的机器可读表征,并结合三方面技术:(1) 基于分片HNSW索引的语义路由与成本偏置优化,实现任务到智能体的精准匹配;(2) 动态任务分解机制,兼容智能体通过共识合并生成有向无环图(DAG)形式的子任务结构;(3) 智能聚类策略,在k轮迭代中对相似子任务进行协同通道内的精炼后再合成。该架构基于MQTT的发布-订阅通信模型,实现了亚线性复杂度的分布式协调,显著提升了复杂推理任务下的系统性能(HealthBench测试中相较单模型基线提升13倍),验证了语义驱动的结构化协作在释放异构智能体集体智能方面的有效性。

链接: https://arxiv.org/abs/2509.20175
作者: Lorenzo Giusti,Ole Anton Werner,Riccardo Taiello,Matilde Carvalho Costa,Emre Tosun,Andrea Protani,Marc Molina,Rodrigo Lopes de Almeida,Paolo Cacace,Diogo Reis Santos,Luigi Serio
机构: CERN(欧洲核子研究中心)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 18 pages, 4 figures

点击查看摘要

Abstract:We present Federation of Agents (FoA), a distributed orchestration framework that transforms static multi-agent coordination into dynamic, capability-driven collaboration. FoA introduces Versioned Capability Vectors (VCVs): machine-readable profiles that make agent capabilities searchable through semantic embeddings, enabling agents to advertise their capabilities, cost, and limitations. Our aarchitecturecombines three key innovations: (1) semantic routing that matches tasks to agents over sharded HNSW indices while enforcing operational constraints through cost-biased optimization, (2) dynamic task decomposition where compatible agents collaboratively break down complex tasks into DAGs of subtasks through consensus-based merging, and (3) smart clustering that groups agents working on similar subtasks into collaborative channels for k-round refinement before synthesis. Built on top of MQTT,s publish-subscribe semantics for scalable message passing, FoA achieves sub-linear complexity through hierarchical capability matching and efficient index maintenance. Evaluation on HealthBench shows 13x improvements over single-model baselines, with clustering-enhanced laboration particularly effective for complex reasoning tasks requiring multiple perspectives. The system scales horizontally while maintaining consistent performance, demonstrating that semantic orchestration with structured collaboration can unlock the collective intelligence of heterogeneous federations of AI agents.
zh

[NLP-17] Probing Gender Bias in Multilingual LLM s: A Case Study of Stereotypes in Persian EMNLP2025

【速读】: 该论文旨在解决多语言大语言模型(Multilingual Large Language Models, MLLMs)在低资源语言中潜在的性别偏见问题,以避免因代表性偏差导致的有害影响。其解决方案的关键在于提出一种基于模板的探测方法,并引入领域特定性别偏斜指数(Domain-Specific Gender Skew Index, DS-GSI),用于量化模型在不同语义领域中的性别平等偏离程度。通过该框架对GPT-4o mini、DeepSeek R1、Gemini 2.0 Flash和Qwen QwQ 32B四个主流模型在波斯语(一种低资源语言)中的表现进行评估,结果表明所有模型均存在性别刻板印象,且波斯语中的偏见程度显著高于英语,尤其在体育领域最为突出。该研究为低资源语言中的偏见检测提供了可扩展的评估工具,推动了更包容的自然语言处理实践。

链接: https://arxiv.org/abs/2509.20168
作者: Ghazal Kalhor,Behnam Bahrak
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted and forthcoming at the Widening Natural Language Processing Workshop (WiNLP 2025) at EMNLP 2025

点击查看摘要

Abstract:Multilingual Large Language Models (LLMs) are increasingly used worldwide, making it essential to ensure they are free from gender bias to prevent representational harm. While prior studies have examined such biases in high-resource languages, low-resource languages remain understudied. In this paper, we propose a template-based probing methodology, validated against real-world data, to uncover gender stereotypes in LLMs. As part of this framework, we introduce the Domain-Specific Gender Skew Index (DS-GSI), a metric that quantifies deviations from gender parity. We evaluate four prominent models, GPT-4o mini, DeepSeek R1, Gemini 2.0 Flash, and Qwen QwQ 32B, across four semantic domains, focusing on Persian, a low-resource language with distinct linguistic features. Our results show that all models exhibit gender stereotypes, with greater disparities in Persian than in English across all domains. Among these, sports reflect the most rigid gender biases. This study underscores the need for inclusive NLP practices and provides a framework for assessing bias in other low-resource languages.
zh

[NLP-18] Embedding Domain Knowledge for Large Language Models via Reinforcement Learning from Augmented Generation

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在特定领域任务中表现受限的问题,主要源于训练数据中专业信息的稀疏性(knowledge scarcity)和静态特性导致的知识滞后(temporal lag)。现有方法如持续预训练(Continual Pre-Training, CPT)对领域文档中的所有token一视同仁,忽略了关键知识点;而监督微调(Supervised Fine-Tuning, SFT)依赖问答对,难以构建复杂推理所需的连贯知识结构。解决方案的关键在于提出强化学习增强生成(Reinforcement Learning from Augmented Generation, RLAG),通过迭代采样与奖励优化循环,优先嵌入关键且上下文一致的领域知识:首先基于高对数概率采样生成结果,随后设计三种定制化奖励指标引导模型优化,从而显著提升模型在医学、法律、天文及实时事件等领域的专业知识准确性和解释合理性。

链接: https://arxiv.org/abs/2509.20162
作者: Chaojun Nie,Jun Zhou,Guanxiang Wang,Shisong Wud,Zichen Wang
机构: Institute of Acoustics, Chinese Academy of Sciences (中国科学院声学研究所); University of Chinese Academy of Sciences (中国科学院大学); China Southern Power Grid Artificial Intelligence Technology Co., Ltd. (南方电网人工智能技术有限公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) often exhibit limited performance on domain-specific tasks due to the natural disproportionate representation of specialized information in their training data and the static nature of these datasets. Knowledge scarcity and temporal lag create knowledge gaps for domain applications. While post-training on domain datasets can embed knowledge into models, existing approaches have some limitations. Continual Pre-Training (CPT) treats all tokens in domain documents with equal importance, failing to prioritize critical knowledge points, while supervised fine-tuning (SFT) with question-answer pairs struggles to develop the coherent knowledge structures necessary for complex reasoning tasks. To address these challenges, we propose Reinforcement Learning from Augmented Generation (RLAG). Our approach iteratively cycles between sampling generations and optimizing the model through calculated rewards, effectively embedding critical and contextually coherent domain knowledge. We select generated outputs with the highest log probabilities as the sampling result, then compute three tailored reward metrics to guide the optimization process. To comprehensively evaluate domain expertise, we assess answer accuracy and the rationality of explanations generated for correctly answered questions. Experimental results across medical, legal, astronomy, and current events datasets demonstrate that our proposed method significantly outperforms baseline approaches. Our code and data are open sourced at this https URL.
zh

[NLP-19] Less is More: The Effectiveness of Compact Typological Language Representations EMNLP2025

【速读】: 该论文旨在解决语言特征数据集(如URIEL+)因高维性和稀疏性,尤其是在低资源语言中,导致距离度量效果不佳的问题。解决方案的关键在于提出一个优化URIEL+类型学特征空间的流水线,通过结合特征选择与缺失值填补(imputation),生成紧凑且可解释的类型学表示;实验证明,这种降维后的特征子集能够提升语言距离对齐效果,并增强多语言自然语言处理(NLP)任务的性能。

链接: https://arxiv.org/abs/2509.20129
作者: York Hay Ng,Phuong Hanh Hoang,En-Shiun Annie Lee
机构: University of Toronto (多伦多大学); Ontario Tech University (安大略理工大学)
类目: Computation and Language (cs.CL)
备注: Accepted to EMNLP 2025 Main Conference

点击查看摘要

Abstract:Linguistic feature datasets such as URIEL+ are valuable for modelling cross-lingual relationships, but their high dimensionality and sparsity, especially for low-resource languages, limit the effectiveness of distance metrics. We propose a pipeline to optimize the URIEL+ typological feature space by combining feature selection and imputation, producing compact yet interpretable typological representations. We evaluate these feature subsets on linguistic distance alignment and downstream tasks, demonstrating that reduced-size representations of language typology can yield more informative distance metrics and improve performance in multilingual NLP applications.
zh

[NLP-20] Discrete Diffusion for Reflective Vision-Language-Action Models in Autonomous Driving

【速读】: 该论文旨在解决当前端到端(End-to-End, E2E)自动驾驶系统中基于模仿学习的方法难以隐式编码物理规则、依赖复杂规则后处理或计算昂贵的扩散引导等问题,从而导致轨迹生成安全性不足与泛化能力受限。其解决方案的关键在于提出ReflectDrive框架,通过引入一种无需梯度计算的安全感知反射机制(safety-aware reflection mechanism),实现迭代式自我修正;同时将二维驾驶空间离散化构建动作码本(action codebook),利用预训练扩散语言模型(Diffusion Language Models)进行微调以支持规划任务,并结合局部搜索识别不安全token,以可行解作为安全锚点进行基于插补(inpainting)的轨迹再生,从而在保障安全性的同时提升轨迹生成的可靠性和可扩展性。

链接: https://arxiv.org/abs/2509.20109
作者: Pengxiang Li,Yinan Zheng,Yue Wang,Huimin Wang,Hang Zhao,Jingjing Liu,Xianyuan Zhan,Kun Zhan,Xianpeng Lang
机构: LiAuto; Tsinghua University (清华大学)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:End-to-End (E2E) solutions have emerged as a mainstream approach for autonomous driving systems, with Vision-Language-Action (VLA) models representing a new paradigm that leverages pre-trained multimodal knowledge from Vision-Language Models (VLMs) to interpret and interact with complex real-world environments. However, these methods remain constrained by the limitations of imitation learning, which struggles to inherently encode physical rules during training. Existing approaches often rely on complex rule-based post-refinement, employ reinforcement learning that remains largely limited to simulation, or utilize diffusion guidance that requires computationally expensive gradient calculations. To address these challenges, we introduce ReflectDrive, a novel learning-based framework that integrates a reflection mechanism for safe trajectory generation via discrete diffusion. We first discretize the two-dimensional driving space to construct an action codebook, enabling the use of pre-trained Diffusion Language Models for planning tasks through fine-tuning. Central to our approach is a safety-aware reflection mechanism that performs iterative self-correction without gradient computation. Our method begins with goal-conditioned trajectory generation to model multi-modal driving behaviors. Based on this, we apply local search methods to identify unsafe tokens and determine feasible solutions, which then serve as safe anchors for inpainting-based regeneration. Evaluated on the NAVSIM benchmark, ReflectDrive demonstrates significant advantages in safety-critical trajectory generation, offering a scalable and reliable solution for autonomous driving systems.
zh

[NLP-21] Integrated Framework for LLM Evaluation with Answer Generation

【速读】: 该论文旨在解决传统基于基准的大型语言模型(Large Language Models, LLMs)评估方法依赖固定参考答案、难以捕捉生成响应中关键定性特征的问题。其解决方案的核心在于提出一个名为SPEED(Self-Refining Descriptive Evaluation with Expert-Driven Diagnostics)的集成评估框架,该框架通过引入专业化功能专家模型(functional experts)对模型输出进行多维度的描述性分析,主动整合专家反馈以实现幻觉检测、毒性评估及词汇-语境适切性判断等任务,从而显著提升评估的公平性与可解释性,并在保持较高资源效率的同时实现跨领域和跨数据集的一致性表现。

链接: https://arxiv.org/abs/2509.20097
作者: Sujeong Lee,Hayoung Lee,Seongsoo Heo,Wonik Choi
机构: Inha University (仁川大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 16pages

点击查看摘要

Abstract:Reliable evaluation of large language models is essential to ensure their applicability in practical scenarios. Traditional benchmark-based evaluation methods often rely on fixed reference answers, limiting their ability to capture important qualitative aspects of generated responses. To address these shortcomings, we propose an integrated evaluation framework called \textitself-refining descriptive evaluation with expert-driven diagnostics, SPEED, which utilizes specialized functional experts to perform comprehensive, descriptive analyses of model outputs. Unlike conventional approaches, SPEED actively incorporates expert feedback across multiple dimensions, including hallucination detection, toxicity assessment, and lexical-contextual appropriateness. Experimental results demonstrate that SPEED achieves robust and consistent evaluation performance across diverse domains and datasets. Additionally, by employing relatively compact expert models, SPEED demonstrates superior resource efficiency compared to larger-scale evaluators. These findings illustrate that SPEED significantly enhances fairness and interpretability in LLM evaluations, offering a promising alternative to existing evaluation methodologies.
zh

[NLP-22] Causal Understanding by LLM s: The Role of Uncertainty EMNLP2025

【速读】: 该论文试图解决大语言模型(Large Language Models, LLMs)在因果关系分类任务中表现接近随机水平的问题,旨在厘清这种失败是源于预训练阶段对因果示例的暴露不足,还是更深层次的表征结构缺陷。其解决方案的关键在于采用基于不确定性的评估框架,在PubMed语料库中构建包含18K句(一半来自The Pile,一半为2024年后数据)的测试集,通过因果分类和verbatim记忆探测两种方式系统分析七种模型(如Pythia、GPT-J、Dolly、Qwen等)的行为特征。结果表明,模型在已见与未见句子上的准确率无显著差异(p > 0.05),原始陈述选择比例仅为24.8%(无记忆偏好),且输出分布熵值接近理论最大值(1.35/1.39),说明模型处于随机猜测状态;此外,指令微调模型存在严重校准偏差(如Qwen模型置信度95%但准确率仅32.8%,ECE=0.49)。这些发现共同支持核心结论:LLMs在因果理解上的失败并非因预训练暴露不足,而是由于缺乏结构化的因果表征能力。

链接: https://arxiv.org/abs/2509.20088
作者: Oscar Lithgow-Serrano,Vani Kanjirangat,Alessandro Antonucci
机构: SUPSI(瑞士应用科学与艺术大学); IDSIA(智能系统与人工智能研究所); Switzerland(瑞士)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted in second UncertaiNLP workshop at EMNLP 2025

点击查看摘要

Abstract:Recent papers show LLMs achieve near-random accuracy in causal relation classification, raising questions about whether such failures arise from limited pretraining exposure or deeper representational gaps. We investigate this under uncertainty-based evaluation, testing whether pretraining exposure to causal examples improves causal understanding 18K PubMed sentences – half from The Pile corpus, half post-2024 – across seven models (Pythia-1.4B/7B/12B, GPT-J-6B, Dolly-7B/12B, Qwen-7B). We analyze model behavior through: (i) causal classification, where the model identifies causal relationships in text, and (ii) verbatim memorization probing, where we assess whether the model prefers previously seen causal statements over their paraphrases. Models perform four-way classification (direct/conditional/correlational/no-relationship) and select between originals and their generated paraphrases. Results show almost identical accuracy on seen/unseen sentences (p 0.05), no memorization bias (24.8% original selection), and output distribution over the possible options is almost flat, with entropic values near the maximum (1.35/1.39), confirming random guessing. Instruction-tuned models show severe miscalibration (Qwen: 95% confidence, 32.8% accuracy, ECE=0.49). Conditional relations induce highest entropy (+11% vs. direct). These findings suggest that failures in causal understanding arise from the lack of structured causal representation, rather than insufficient exposure to causal examples during pretraining.
zh

[NLP-23] OLaPh: Optimal Language Phonemizer

【速读】: 该论文旨在解决文本到语音(Text-to-Speech, TTS)系统中音素化(Phonemization)任务的准确性问题,尤其针对专有名词、外来词、缩写和同形异义词等难处理词汇的识别与转换难题。解决方案的关键在于提出一种名为OLaPh(Optimal Language Phonemizer)的框架,该框架融合了大规模词典、多种自然语言处理(Natural Language Processing, NLP)技术、复合词解析以及基于概率的评分函数,从而显著提升对未登录词(out-of-domain vocabulary)的处理能力;此外,作者进一步利用OLaPh生成的数据训练大型语言模型(Large Language Model, LLM),以增强泛化性能并改善整体一致性,最终形成一套高精度且可复用的音素化资源体系。

链接: https://arxiv.org/abs/2509.20086
作者: Johannes Wirth
机构: 未知
类目: Computation and Language (cs.CL)
备注: 5 pages, 1 figure, 3 tables

点击查看摘要

Abstract:Phonemization, the conversion of text into phonemes, is a key step in text-to-speech. Traditional approaches use rule-based transformations and lexicon lookups, while more advanced methods apply preprocessing techniques or neural networks for improved accuracy on out-of-domain vocabulary. However, all systems struggle with names, loanwords, abbreviations, and homographs. This work presents OLaPh (Optimal Language Phonemizer), a framework that combines large lexica, multiple NLP techniques, and compound resolution with a probabilistic scoring function. Evaluations in German and English show improved accuracy over previous approaches, including on a challenging dataset. To further address unresolved cases, we train a large language model on OLaPh-generated data, which achieves even stronger generalization and performance. Together, the framework and LLM improve phonemization consistency and provide a freely available resource for future research.
zh

[NLP-24] Can Constructions “SCAN” Compositionality ?

【速读】: 该论文旨在解决序列到序列(Sequence to Sequence, Seq2Seq)模型在组合性(compositionality)和系统泛化(systematic generalisation)方面的局限性,尽管这类模型在许多其他任务中表现优异。作者认为问题根源在于模型未能内化常规化的“形式-意义配对”构造(construction),而这些构造是支持可重组性的关键。解决方案的关键在于提出一种无监督的伪构造挖掘方法:从训练数据中自动提取带有变量槽位(variable-slot)的模板作为伪构造。该方法在SCAN数据集上显著提升了分布外(out-of-distribution)测试性能,在ADD JUMP和AROUND RIGHT任务上准确率分别达到47.8%和20.3%,且无需架构修改或额外监督信号;同时仅用原始训练数据的40%即可获得竞争力表现,体现出强大的数据效率。这一发现表明,基于构造意识的预处理是一种替代复杂架构或训练策略干预的有效途径。

链接: https://arxiv.org/abs/2509.20074
作者: Ganesh Katrapati,Manish Shrivastava
机构: International Institute of Information Technology Hyderabad (海得拉巴国际信息科技研究所)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Sequence to Sequence models struggle at compositionality and systematic generalisation even while they excel at many other tasks. We attribute this limitation to their failure to internalise constructions conventionalised form meaning pairings that license productive recombination. Building on these insights, we introduce an unsupervised procedure for mining pseudo-constructions: variable-slot templates automatically extracted from training data. When applied to the SCAN dataset, our method yields large gains out-of-distribution splits: accuracy rises to 47.8 %on ADD JUMP and to 20.3% on AROUND RIGHT without any architectural changes or additional supervision. The model also attains competitive performance with? 40% of the original training data, demonstrating strong data efAciency. Our findings highlight the promise of construction-aware preprocessing as an alternative to heavy architectural or training-regime interventions.
zh

[NLP-25] From Text to Talk: Audio-Language Model Needs Non-Autoregressive Joint Training

【速读】: 该论文旨在解决现有多模态模型在处理语音与文本交织输入输出时存在的两个核心问题:一是训练流程复杂、计算成本高,二是未充分考虑文本与音频在依赖结构上的本质差异。针对上述问题,作者提出TtT框架,其关键创新在于将自回归(Autoregressive, AR)文本生成与非自回归(Non-Autoregressive, NAR)音频扩散机制统一整合进一个基于预训练语言模型(LLM)初始化的单一Transformer架构中,从而实现高效且结构适配的端到端语音-文本建模。

链接: https://arxiv.org/abs/2509.20072
作者: Tianqiao Liu,Xueyi Li,Hao Wang,Haoxuan Li,Zhichao Chen,Weiqi Luo,Zitao Liu
机构: Jinan University (暨南大学); TAL Education Group (tal教育集团); Peking University (北京大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advances in large language models have attracted significant interest in extending their capabilities to multimodal scenarios, particularly for speech-in speech-out conversational systems. However, existing multimodal models handling interleaved audio and text, such as MOSHI require complex multi stage training pipelines, incurring substantial computational costs. Moreover, these models uniformly apply autoregressive generation to both text and audio tokens, overlooking a fundamental asymmetry in their dependency structures: while text tokens exhibit strong target target dependencies requiring causal ordering, audio tokens are predominantly driven by source target dependencies, where audio outputs primarily condition on source text rather than preceding audio tokens. In this work, we propose TtT, a unified audio-text modeling framework that integrates AR text generation with non-autoregressive audio diffusion within a single Transformer architecture initialized from a pretrained LLM.
zh

[NLP-26] From Input Perception to Predictive Insight: Modeling Model Blind Spots Before They Become Errors EMNLP2025

【速读】: 该论文旨在解决语言模型在处理习语、隐喻或依赖上下文的输入时易发生错误的问题,其根本原因在于模型对输入的初始理解出现偏差,而非生成阶段的输出错误。解决方案的关键在于提出一种仅基于输入的预测方法,利用受“意外度(surprisal)”和“均匀信息密度假说(Uniform Information Density hypothesis)”启发的词元级似然特征,捕捉输入理解中的局部不确定性。这些特征能够有效识别潜在错误,在五个语言挑战性数据集上优于标准基线,且span局部化特征对大模型更有效,全局模式则更适合小模型,整个方法无需访问输出或隐藏激活,具备轻量级与通用性优势。

链接: https://arxiv.org/abs/2509.20065
作者: Maggie Mi,Aline Villavicencio,Nafise Sadat Moosavi
机构: University of Sheffield (谢菲尔德大学); University of Exeter (埃克塞特大学); The Alan Turing Institute (艾伦图灵研究所); UFRN, Brazil (巴西联邦里约热内卢大学)
类目: Computation and Language (cs.CL)
备注: EMNLP 2025

点击查看摘要

Abstract:Language models often struggle with idiomatic, figurative, or context-sensitive inputs, not because they produce flawed outputs, but because they misinterpret the input from the outset. We propose an input-only method for anticipating such failures using token-level likelihood features inspired by surprisal and the Uniform Information Density hypothesis. These features capture localized uncertainty in input comprehension and outperform standard baselines across five linguistically challenging datasets. We show that span-localized features improve error detection for larger models, while smaller models benefit from global patterns. Our method requires no access to outputs or hidden activations, offering a lightweight and generalizable approach to pre-generation error prediction.
zh

[NLP-27] Responsible AI Technical Report

【速读】: 该论文旨在解决人工智能(Artificial Intelligence, AI)在开发与运营过程中存在的安全性和可靠性风险问题,尤其是在监管合规背景下如何系统性识别、评估和管理这些风险。解决方案的关键在于构建了一套面向本土环境的AI风险分类体系(AI risk taxonomy),并基于此开发了可验证模型安全性与鲁棒性的评估方法,同时配套推出名为SafetyGuard的实时防护工具,用于阻断有害响应,从而实现对生成式AI(Generative AI)服务的风险防控与治理。

链接: https://arxiv.org/abs/2509.20057
作者: KT:Soonmin Bae,Wanjin Park,Jeongyeop Kim,Yunjin Park,Jungwon Yoon,Junhyung Moon,Myunggyo Oh,Wonhyuk Lee,Junseo Jang,Dongyoung Jung,Minwook Ju,Eunmi Kim,Sujin Kim,Youngchol Kim,Somin Lee,Wonyoung Lee,Minsung Noh,Hyoungjun Park,Eunyoung Shin
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 23 pages, 8 figures

点击查看摘要

Abstract:KT developed a Responsible AI (RAI) assessment methodology and risk mitigation technologies to ensure the safety and reliability of AI services. By analyzing the Basic Act on AI implementation and global AI governance trends, we established a unique approach for regulatory compliance and systematically identify and manage all potential risk factors from AI development to operation. We present a reliable assessment methodology that systematically verifies model safety and robustness based on KT’s AI risk taxonomy tailored to the domestic environment. We also provide practical tools for managing and mitigating identified AI risks. With the release of this report, we also release proprietary Guardrail : SafetyGuard that blocks harmful responses from AI models in real-time, supporting the enhancement of safety in the domestic AI development ecosystem. We also believe these research outcomes provide valuable insights for organizations seeking to develop Responsible AI.
zh

[NLP-28] okenization and Representation Biases in Multilingual Models on Dialectal NLP Tasks EMNLP-2025

【速读】: 该论文旨在解决预训练多语言模型在方言数据上表现不佳的问题,即“方言差距”(dialect gap),其成因复杂且现有研究中对影响因素的结论不一致。论文的关键解决方案在于引入两个可量化指标——分词公平性(Tokenization Parity, TP)和信息公平性(Information Parity, IP),用以衡量预训练模型中的表征偏差,并系统分析它们与下游任务性能之间的关系。研究表明,TP 更能预测依赖句法和形态线索的任务(如抽取式问答),而 IP 更适用于语义类任务(如主题分类),从而揭示了模型在脚本或词元层面存在潜在不匹配,可能掩盖其声称的语言支持能力。

链接: https://arxiv.org/abs/2509.20045
作者: Vani Kanjirangat,Tanja Samardžić,Ljiljana Dolamic,Fabio Rinaldi
机构: SUPSI, IDSIA, Switzerland (瑞士应用科学与艺术大学,智能系统研究所); armasuisse S+T, Switzerland (瑞士联邦武器局科学技术部)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted in EMNLP-2025 Main conference

点击查看摘要

Abstract:Dialectal data are characterized by linguistic variation that appears small to humans but has a significant impact on the performance of models. This dialect gap has been related to various factors (e.g., data size, economic and social factors) whose impact, however, turns out to be inconsistent. In this work, we investigate factors impacting the model performance more directly: we correlate Tokenization Parity (TP) and Information Parity (IP), as measures of representational biases in pre-trained multilingual models, with the downstream performance. We compare state-of-the-art decoder-only LLMs with encoder-based models across three tasks: dialect classification, topic classification, and extractive question answering, controlling for varying scripts (Latin vs. non-Latin) and resource availability (high vs. low). Our analysis reveals that TP is a better predictor of the performance on tasks reliant on syntactic and morphological cues (e.g., extractive QA), while IP better predicts performance in semantic tasks (e.g., topic classification). Complementary analyses, including tokenizer behavior, vocabulary coverage, and qualitative insights, reveal that the language support claims of LLMs often might mask deeper mismatches at the script or token level.
zh

[NLP-29] Embodied AI: From LLM s to World Models

【速读】: 该论文旨在解决如何通过整合大型语言模型(Large Language Models, LLMs)与世界模型(World Models, WMs)来推动具身人工智能(Embodied Artificial Intelligence, Embodied AI)向更高级别认知和物理交互能力演进的问题。其核心挑战在于实现从单一模态到多模态感知与决策的跨越,并构建能够同时具备语义理解与物理世界建模能力的端到端具身认知架构。解决方案的关键在于提出并论证一种联合多模态大语言模型(Multimodal Large Language Models, MLLMs)与世界模型(WMs)驱动的具身AI架构,该架构不仅融合了LLMs在任务分解和自然语言指令解析中的优势,还利用WMs对环境状态的内部表征与未来预测能力,从而实现符合物理规律的具身交互,为复杂物理世界任务提供可扩展、可泛化的智能解决方案。

链接: https://arxiv.org/abs/2509.20021
作者: Tongtong Feng,Xin Wang,Yu-Gang Jiang,Wenwu Zhu
机构: Tsinghua University (清华大学); Beijing National Research Center for Information Science and Technology; Fudan University (复旦大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Robotics (cs.RO)
备注: Accepted by IEEE CASM

点击查看摘要

Abstract:Embodied Artificial Intelligence (AI) is an intelligent system paradigm for achieving Artificial General Intelligence (AGI), serving as the cornerstone for various applications and driving the evolution from cyberspace to physical systems. Recent breakthroughs in Large Language Models (LLMs) and World Models (WMs) have drawn significant attention for embodied AI. On the one hand, LLMs empower embodied AI via semantic reasoning and task decomposition, bringing high-level natural language instructions and low-level natural language actions into embodied cognition. On the other hand, WMs empower embodied AI by building internal representations and future predictions of the external world, facilitating physical law-compliant embodied interactions. As such, this paper comprehensively explores the literature in embodied AI from basics to advances, covering both LLM driven and WM driven works. In particular, we first present the history, key technologies, key components, and hardware systems of embodied AI, as well as discuss its development via looking from unimodal to multimodal angle. We then scrutinize the two burgeoning fields of embodied AI, i.e., embodied AI with LLMs/multimodal LLMs (MLLMs) and embodied AI with WMs, meticulously delineating their indispensable roles in end-to-end embodied cognition and physical laws-driven embodied interactions. Building upon the above advances, we further share our insights on the necessity of the joint MLLM-WM driven embodied AI architecture, shedding light on its profound significance in enabling complex tasks within physical worlds. In addition, we examine representative applications of embodied AI, demonstrating its wide applicability in real-world scenarios. Last but not least, we point out future research directions of embodied AI that deserve further investigation.
zh

[NLP-30] DiffNator: Generating Structured Explanations of Time-Series Differences

【速读】: 该论文旨在解决物联网(IoT)应用中时间序列差异解释的难题,即如何结构化地理解和呈现两个时间序列之间的差异,而这类差异通常需要领域专家知识才能解读。解决方案的关键在于提出DiffNator框架,其核心创新包括:设计一个用于捕捉差异本质属性的JSON Schema,并构建一个结合时间序列编码器与冻结的大语言模型(LLM)的生成式架构,从而输出格式化的差异解释。实验表明,该方法在准确性上显著优于视觉问答(VQA)基线和基于预训练时间序列编码器的检索方法。

链接: https://arxiv.org/abs/2509.20007
作者: Kota Dohi,Tomoya Nishida,Harsh Purohit,Takashi Endo,Yohei Kawaguchi
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In many IoT applications, the central interest lies not in individual sensor signals but in their differences, yet interpreting such differences requires expert knowledge. We propose DiffNator, a framework for structured explanations of differences between two time series. We first design a JSON schema that captures the essential properties of such differences. Using the Time-series Observations of Real-world IoT (TORI) dataset, we generate paired sequences and train a model that combine a time-series encoder with a frozen LLM to output JSON-formatted explanations. Experimental results show that DiffNator generates accurate difference explanations and substantially outperforms both a visual question answering (VQA) baseline and a retrieval method using a pre-trained time-series encoder.
zh

[NLP-31] he Knowledge-Behaviour Disconnect in LLM -based Chatbots

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在对话行为中存在“知识-行为断连”(disconnect)的问题,即尽管LLM能够生成看似正确的答案,但其对话行为并不基于真正理解或内化这些知识,导致幻觉(hallucination)等现象。论文指出,这种断连是根本性的,无法通过增加训练数据或模型规模来消除,因为LLM的核心训练机制——基于预测下一个词的概率分布——无法建立知识与行为之间的因果关联。解决方案的关键在于认识到当前LLM架构的局限性,并强调仅靠微调或引入额外行为约束(如伦理对齐技术)无法从根本上弥合这一断连,反而可能加剧问题。

链接: https://arxiv.org/abs/2509.20004
作者: Jan Broersen
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language model-based artificial conversational agents (like ChatGPT) give answers to all kinds of questions, and often enough these answers are correct. Just on the basis of that capacity alone, we may attribute knowledge to them. But do these models use this knowledge as a basis for their own conversational behaviour? I argue this is not the case, and I will refer to this failure as a `disconnect’. I further argue this disconnect is fundamental in the sense that with more data and more training of the LLM on which a conversational chatbot is based, it will not disappear. The reason is, as I will claim, that the core technique used to train LLMs does not allow for the establishment of the connection we are after. The disconnect reflects a fundamental limitation on the capacities of LLMs, and explains the source of hallucinations. I will furthermore consider the ethical version of the disconnect (ethical conversational knowledge not being aligned with ethical conversational behaviour), since in this domain researchers have come up with several additional techniques to influence a chatbot’s behaviour. I will discuss how these techniques do nothing to solve the disconnect and can make it worse.
zh

[NLP-32] able Detection with Active Learning ICDAR2025

【速读】: 该论文旨在解决机器学习中对象检测任务因需要大量标注数据而导致的高成本问题。其解决方案的关键在于引入一种基于主动学习(Active Learning, AL)的样本选择策略,该策略不仅考虑不确定性,还融合多样性机制,从而在有限标注预算下选取更具代表性的训练样本,提升模型泛化能力。实验表明,该方法相较于随机采样显著降低标注需求,同时在相同预算下实现更高的平均精度均值(mAP)。

链接: https://arxiv.org/abs/2509.20003
作者: Somraj Gautam,Nachiketa Purohit,Gaurav Harit
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted in ICDAR 2025

点击查看摘要

Abstract:Efficient data annotation remains a critical challenge in machine learning, particularly for object detection tasks requiring extensive labeled data. Active learning (AL) has emerged as a promising solution to minimize annotation costs by selecting the most informative samples. While traditional AL approaches primarily rely on uncertainty-based selection, recent advances suggest that incorporating diversity-based strategies can enhance sampling efficiency in object detection tasks. Our approach ensures the selection of representative examples that improve model generalization. We evaluate our method on two benchmark datasets (TableBank-LaTeX, TableBank-Word) using state-of-the-art table detection architectures, CascadeTabNet and YOLOv9. Our results demonstrate that AL-based example selection significantly outperforms random sampling, reducing annotation effort given a limited budget while maintaining comparable performance to fully supervised models. Our method achieves higher mAP scores within the same annotation budget.
zh

[NLP-33] CorIL: Towards Enriching Indian Language to Indian Language Parallel Corpora and Machine Translation Systems

【速读】: 该论文旨在解决印度多语言神经机器翻译(NMT)中高质量平行语料库稀缺的问题,尤其在跨领域场景下。其解决方案的关键在于构建并公开发布一个大规模、高质量的标注平行语料库——CorIL,涵盖11种印度官方语言(包括英语、泰卢固语、印地语、旁遮普语、奥里亚语、克什米尔语、信德语、多格里语、卡纳达语、乌尔都语和古吉拉特语),共计772,000对双语句子,并系统性地按政府、健康和通用三个领域进行分类,以支持领域感知的机器翻译研究与有效的领域自适应。通过在该语料库上微调和评估IndicTrans2、NLLB和BhashaVerse等先进NMT模型,论文揭示了不同脚本类型(如波斯-阿拉伯文与印度文字)对模型性能的影响,验证了CorIL作为基准数据集的价值,从而显著提升印度语言高质量训练数据的可获取性,推动相关领域的研究进展。

链接: https://arxiv.org/abs/2509.19941
作者: Soham Bhattacharjee,Mukund K Roy,Yathish Poojary,Bhargav Dave,Mihir Raj,Vandan Mujadia,Baban Gain,Pruthwik Mishra,Arafat Ahsan,Parameswari Krishnamurthy,Ashwath Rao,Gurpreet Singh Josan,Preeti Dubey,Aadil Amin Kak,Anna Rao Kulkarni,Narendra VG,Sunita Arora,Rakesh Balbantray,Prasenjit Majumdar,Karunesh K Arora,Asif Ekbal,Dipti Mishra Sharma
机构: Indian Institute of Technology Patna(印度理工学院帕特纳分校); CDAC Noida(印度计算中心诺伊达); Manipal Institute of Technology(曼帕尔理工学院); Dhirubhai Ambani University, Gandhinagar(迪鲁巴伊·安巴尼大学,甘地纳格尔); IIIT Bhubaneshwar(印度信息科技研究所布巴内什瓦尔); IIIT Hyderabad(印度信息科技研究所海得拉巴); SVNIT, Surat(萨维特里·文卡塔·甘地理工学院,苏拉特); Punjabi University(旁遮普大学); Govt. College for Women Jammu(查谟女子政府学院); University of Kashmir(克什米尔大学); CDAC Bangalore(印度计算中心班加罗尔); IIIT Bhubaneshwar(印度信息科技研究所布巴内什瓦尔); IIIT Hyderabad(印度信息科技研究所海得拉巴); Indian Institute of Technology Patna(印度理工学院帕特纳分校); Department of Computer Science and Engineering, Manipal Institute of Technology(计算机科学与工程系,曼帕尔理工学院); Department of CSE, Punjabi University(计算机科学与工程系,旁遮普大学); Department of CSE, Govt. College for Women Jammu(计算机科学与工程系,查谟女子政府学院); Department of Linguistics, University of Kashmir(语言学系,克什米尔大学); VLSI Design Group, CDAC Bangalore(超大规模集成电路设计组,印度计算中心班加罗尔); SNLP Lab, CDAC Noida(自然语言处理实验室,印度计算中心诺伊达); SNLP Lab, CDAC Noida(自然语言处理实验室,印度计算中心诺伊达); LTRC, IIIT Hyderabad(语言技术研究中心,印度信息科技研究所海得拉巴); LTRC, IIIT Hyderabad(语言技术研究中心,印度信息科技研究所海得拉巴); Department of Computer Science and Engineering, Indian Institute of Technology Patna(计算机科学与工程系,印度理工学院帕特纳分校); Department of AI, SVNIT, Surat(人工智能系,萨维特里·文卡塔·甘地理工学院,苏拉特)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:India’s linguistic landscape is one of the most diverse in the world, comprising over 120 major languages and approximately 1,600 additional languages, with 22 officially recognized as scheduled languages in the Indian Constitution. Despite recent progress in multilingual neural machine translation (NMT), high-quality parallel corpora for Indian languages remain scarce, especially across varied domains. In this paper, we introduce a large-scale, high-quality annotated parallel corpus covering 11 of these languages : English, Telugu, Hindi, Punjabi, Odia, Kashmiri, Sindhi, Dogri, Kannada, Urdu, and Gujarati comprising a total of 772,000 bi-text sentence pairs. The dataset is carefully curated and systematically categorized into three key domains: Government, Health, and General, to enable domain-aware machine translation research and facilitate effective domain adaptation. To demonstrate the utility of CorIL and establish strong benchmarks for future research, we fine-tune and evaluate several state-of-the-art NMT models, including IndicTrans2, NLLB, and BhashaVerse. Our analysis reveals important performance trends and highlights the corpus’s value in probing model capabilities. For instance, the results show distinct performance patterns based on language script, with massively multilingual models showing an advantage on Perso-Arabic scripts (Urdu, Sindhi) while other models excel on Indic scripts. This paper provides a detailed domain-wise performance analysis, offering insights into domain sensitivity and cross-script transfer learning. By publicly releasing CorIL, we aim to significantly improve the availability of high-quality training data for Indian languages and provide a valuable resource for the machine translation research community.
zh

[NLP-34] WEST: LLM based Speech Toolkit for Speech Understanding Generation and Interaction

【速读】: 该论文旨在解决当前语音理解、生成与交互任务中系统碎片化、开发复杂度高以及缺乏统一端到端解决方案的问题。其核心解决方案是提出WEST(WE Speech Toolkit),一个基于大语言模型(Large Language Model, LLM)的全栈语音工具包,关键创新在于三点:一是完全基于LLM架构,复用成熟的模型结构、生态(如Hugging Face)和方法(如序列打包);二是支持从语音识别、合成到对话及多模态的全流程任务,并具备良好的可扩展性以集成开源模型;三是设计简洁易用,降低使用门槛,使研究人员和开发者能够快速上手并部署。

链接: https://arxiv.org/abs/2509.19902
作者: Binbin Zhang,Chengdong Liang,Shuai Wang,Xuelong Geng,Zhao Guo,Haoyu Li,Hao Yin,Xipeng Yang,Pengshen Zhang,Changwei Ma,Lei Xie
机构: Northwestern Polytechnical University (西北工业大学); Nanjing University (南京大学); Shanghai Jiao Tong University (上海交通大学); GuaSemi Speech A Team; WeNet Community
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this paper, we present WEST(WE Speech Toolkit), a speech toolkit based on a large language model (LLM) for speech understanding, generation, and interaction. There are three key features of WEST: 1) Fully LLM-based: Standing on the shoulders of giants by reusing mature architectures, ecosystems (e.g., Hugging Face), and methods (e.g., sequence packing) from large models. 2) Full-stack: Supports tasks such as recognition, synthesis, understanding, dialogue, and multimodal capabilities, with extensibility to incorporate open-source models. 3) Simple and Stupid: A simple and stupid speech toolkit that everyone can Touch. In addition, WEST provides two types of recipes, models, and experimental results. The first is entirely based on open-source models and open-source data, allowing users to fully reproduce the experiments in this paper and serving as a verification system or minimal system baseline. The second is trained on massive data, offering superior performance so the user can directly apply it out of the box. WEST is publicly avilable at this https URL
zh

[NLP-35] PromptCoT 2.0: Scaling Prompt Synthesis for Large Language Model Reasoning

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在推理任务中因高质量训练数据稀缺而导致的性能瓶颈问题。现有合成语料库通常过于简单或缺乏多样性,而人工标注数据则成本高昂且难以扩展。其核心解决方案是提出 PromptCoT 2.0,一个基于期望最大化(Expectation-Maximization, EM)迭代优化框架的提示合成方法,通过不断精炼推理链(Chain-of-Thought, CoT)来引导生成更具挑战性和多样性的题目。该方法摒弃了传统手工设计的启发式规则,实现了可扩展的问题生成机制,并支持两种后训练范式:自对弈(Self-Play)和监督微调(Supervised Fine-Tuning, SFT),显著提升了模型在数学竞赛(如AIME、HMMT)和编程竞赛(如LiveCodeBench、Codeforces)等复杂推理任务上的表现,验证了提示合成作为提升推理能力的新维度的有效性。

链接: https://arxiv.org/abs/2509.19894
作者: Xueliang Zhao,Wei Wu,Jian Guan,Zhuocheng Gong,Lingpeng Kong
机构: The University of Hong Kong (香港大学); Ant Group (蚂蚁集团)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Preprint

点击查看摘要

Abstract:Large language models (LLMs) are evolving from conversational systems into strong reasoners for tasks such as Olympiad mathematics and competitive programming. While scaling parameters and test-time computation has driven progress, a key bottleneck is the lack of high-quality training problems: human-curated datasets are costly and limited, while existing synthetic corpora are often too easy or narrow. PromptCoT 1.0 showed that injecting rationales into prompt synthesis increases problem difficulty. Building on this, we present PromptCoT 2.0, a scalable framework that replaces hand-crafted heuristics with an expectation-maximization (EM) loop, where rationales are iteratively refined to guide prompt construction. This produces problems that are both harder and more diverse than prior corpora. The synthetic prompts support two post-training regimes: (1) Self-Play, where strong models improve autonomously via verifiable feedback without stronger teachers; and (2) Supervised Fine-Tuning (SFT), where weaker models learn from teacher-distilled traces. Extensive experiments demonstrate the effectiveness of this approach. In self-play, applying PromptCoT 2.0 to Qwen3-30B-A3B-Thinking-2507 sets new state-of-the-art results at the 30B scale, with +4.4, +4.8, and +5.3 on AIME 24/25 and HMMT 25, +6.1 and +5.0 on LiveCodeBench v5/v6, and +35 Elo on Codeforces. In SFT, training Qwen2.5-7B-Instruct solely on synthetic prompts boosts accuracy to 73.1 (AIME 24), 65.6 (AIME 25), and 53.4 (LiveCodeBench v5), surpassing models trained on human or hybrid data. Analyses further confirm that PromptCoT 2.0 yields fundamentally harder and distributionally distinct problems. These results establish prompt synthesis as a new axis for scaling reasoning and position PromptCoT 2.0 as a scalable foundation for future open-source models. The implementation is available at this https URL.
zh

[NLP-36] Future Policy Aware Preference Learning for Mathematical Reasoning

【速读】: 该论文旨在解决基于偏好学习(preference learning)的大型语言模型(LLM)后训练方法在数学推理任务中表现不佳的问题,其核心挑战在于偏好轨迹间存在大量共享token,导致对非优选轨迹的惩罚会误伤有用token,引发过惩罚(over-penalization)和性能崩溃。解决方案的关键在于提出未来策略感知(Future Policy Aware, FPA)机制,通过在正则化项中用一个基于参考模型向当前模型进行轻量级对数空间外推估计的“未来策略”替代当前策略,从而实现对潜在有害梯度的前瞻式正则化,有效保护共享有用token的概率,同时支持更长时间、无退化的训练,且计算开销可忽略。

链接: https://arxiv.org/abs/2509.19893
作者: Minjae Oh,Yunho Choi,Dongmin Choi,Yohan Jo
机构: Seoul National University (首尔国立大学)
类目: Computation and Language (cs.CL)
备注: 9 pages

点击查看摘要

Abstract:Preference learning methods such as Direct Preference Optimization (DPO) have become standard for Large Language Model (LLM) post-training, yet they are often ineffective for mathematical reasoning. A key challenge is the large token overlap between preferred and dispreferred trajectories; lowering the probability of dispreferred trajectories also reduces the probability of shared useful tokens, leading to over-penalization and overall performance collapse. As a mitigation, existing algorithms include the probability of a trajectory under the current policy as a regularization term, which decreases the effect of the gradient when the probability is low. However, by the time this effect takes hold, useful tokens may have already been over-penalized as the model has begun to degrade. To address this, we propose Future Policy Aware (FPA) preference learning, which replaces the current policy with a future policy in the regularization term. This future policy is estimated via lightweight, logit-space extrapolation from a reference model toward the current model. FPA enables safer training by preemptively regularizing potentially problematic gradients. We apply FPA to DPO, RPO, and SimPER and evaluate them on the MATH and GSM8K benchmarks. FPA yields consistent performance gains, with the largest improvements observed with SimPER, achieving gains of up to 5.75%. We demonstrate that FPA provides proactive regularization while preserving the probability of shared, useful mathematical tokens, and enables longer, degradation-free training with negligible computational overhead. We will release our code publicly upon publication.
zh

[NLP-37] Do Before You Judge: Self-Reference as a Pathway to Better LLM Evaluation EMNLP2025

【速读】: 该论文试图解决大语言模型(Large Language Models, LLMs)在生成能力与判断能力之间关系不明确的问题,尤其是在LLM-as-Judge框架中,二者相关性较弱且不稳定。研究表明,尽管生成与判断能力共享相同的底层知识,但模型的判断结果高度依赖于被评估响应的内容,导致二者关联性微弱。解决方案的关键在于提出一种自参考引导的评估策略(self-reference-guided evaluation strategy),即利用模型自身生成的答案作为参考标准进行评判,从而显著增强生成能力与判断能力之间的相关性,为模型选择提供可靠依据,并促进两者的技能对齐。

链接: https://arxiv.org/abs/2509.19880
作者: Wei-Hsiang Lin,Sheng-Lun Wei,Hen-Hsen Huang,Hsin-Hsi Chen
机构: National Taiwan University (台湾大学); Academia Sinica (中央研究院); AI Research Center (AINTU) (人工智能研究中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted as a long findings paper at EMNLP 2025

点击查看摘要

Abstract:LLM-as-Judge frameworks are increasingly popular for AI evaluation, yet research findings on the relationship between models’ generation and judgment abilities remain inconsistent. We investigate this relationship through systematic dataset- and instance-level analyses across 11 models and 21 diverse tasks. Despite both capabilities relying on the same underlying knowledge, our analyses reveal they are only weakly correlated, primarily due to LLMs’ sensitivity to the responses being judged. To address this, we propose a self-reference-guided evaluation strategy that leverages a model’s own answers as references. This approach significantly strengthens the correlation between generation and judgment abilities, offering a practical path to align these skills and providing a reliable proxy for model selection in evaluation tasks.
zh

[NLP-38] SwissGPC v1.0 – The Swiss German Podcasts Corpus

【速读】: 该论文旨在解决瑞士德语(Schweizerdeutsch)自然口语数据稀缺的问题,以支持自动语音识别(ASR)、文本到语音合成(TTS)、方言识别等相关研究。现有瑞士德语语音语料库多基于受控说话场景,难以反映真实应用场景下的语言多样性与复杂性。解决方案的关键在于构建首个中大规模的自发性瑞士德语语音语料库SwissGPC v1.0,通过自动化标注流水线对来自瑞士广播电视台(SRF)和YouTube的约5400小时原始音频进行分割与弱标注,最终保留近5000小时覆盖瑞士七大主要方言区及标准德语的语音数据,从而为现实世界语音应用提供高质量、多样化的训练与评估资源。

链接: https://arxiv.org/abs/2509.19866
作者: Samuel Stucki,Mark Cieliebak,Jan Deriu
机构: Zurich University of Applied Sciences (ZHAW)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present SwissGPC v1.0, the first mid-to-large-scale corpus of spontaneous Swiss German speech, developed to support research in ASR, TTS, dialect identification, and related fields. The dataset consists of links to talk shows and podcasts hosted on Schweizer Radio und Fernsehen and YouTube, which contain approximately 5400 hours of raw audio. After segmentation and weak annotation, nearly 5000 hours of speech were retained, covering the seven major Swiss German dialect regions alongside Standard German. We describe the corpus construction methodology, including an automated annotation pipeline, and provide statistics on dialect distribution, token counts, and segmentation characteristics. Unlike existing Swiss German speech corpora, which primarily feature controlled speech, this corpus captures natural, spontaneous conversations, making it a valuable resource for real-world speech applications.
zh

[NLP-39] SINAI at eRisk@CLEF 2025: Transformer-Based and Conversational Strategies for Depression Detection

【速读】: 该论文旨在解决抑郁症的早期检测问题,特别是针对多用户对话场景下的上下文感知识别(Contextualized Early Detection of Depression)以及基于大语言模型(LLM)的对话式抑郁检测(Conversational Depression Detection via LLMs)。其解决方案的关键在于:在任务二中,通过构建复杂的预处理流程与多种基于Transformer的模型(如RoBERTa Base和MentalRoBERTA Large)来捕捉对话中的语境和时序特征;在试点任务中,则设计了一套结构化的对话策略,以有限轮次内最大化信息获取,从而提升LLM驱动人格代理在心理评估中的有效性。实验表明,该方法在对话式检测中取得第一名成绩,并揭示了早期预测速度与分类准确率之间的权衡关系,为未来联合优化提供了方向。

链接: https://arxiv.org/abs/2509.19861
作者: Alba Maria Marmol-Romero,Manuel Garcia-Vega,Miguel Angel Garcia-Cumbreras,Arturo Montejo-Raez
机构: University of Jaen (哈恩大学)
类目: Computation and Language (cs.CL)
备注: 16 pages, 10 figures, 8 tables. CLEF (Working Notes). 2025

点击查看摘要

Abstract:This paper describes the participation of the SINAI-UJA team in the eRisk@CLEF 2025 lab. Specifically, we addressed two of the proposed tasks: (i) Task 2: Contextualized Early Detection of Depression, and (ii) Pilot Task: Conversational Depression Detection via LLMs. Our approach for Task 2 combines an extensive preprocessing pipeline with the use of several transformer-based models, such as RoBERTa Base or MentalRoBERTA Large, to capture the contextual and sequential nature of multi-user conversations. For the Pilot Task, we designed a set of conversational strategies to interact with LLM-powered personas, focusing on maximizing information gain within a limited number of dialogue turns. In Task 2, our system ranked 8th out of 12 participating teams based on F1 score. However, a deeper analysis revealed that our models were among the fastest in issuing early predictions, which is a critical factor in real-world deployment scenarios. This highlights the trade-off between early detection and classification accuracy, suggesting potential avenues for optimizing both jointly in future work. In the Pilot Task, we achieved 1st place out of 5 teams, obtaining the best overall performance across all evaluation metrics: DCHR, ADODL and ASHR. Our success in this task demonstrates the effectiveness of structured conversational design when combined with powerful language models, reinforcing the feasibility of deploying LLMs in sensitive mental health assessment contexts.
zh

[NLP-40] Benchmarking Gaslighting Attacks Against Speech Large Language Models

【速读】: 该论文旨在解决语音大语言模型(Speech Large Language Models, Speech LLMs)在面对操纵性或对抗性输入时的鲁棒性不足问题,尤其是在语音交互中因固有模糊性、连续性和感知多样性导致的攻击难以检测这一挑战。其解决方案的关键在于提出了一种新型攻击范式——“煤气灯效应攻击”(gaslighting attacks),通过设计五种策略(愤怒诱导、认知干扰、讽刺、隐含否定和专业否定)系统性地测试模型在不同任务下的脆弱性,并结合性能下降与行为响应(如未请求的道歉和拒绝)进行多维诊断,同时引入声学扰动实验评估跨模态鲁棒性,最终在超过10,000个样本上验证了平均准确率下降达24.3%,揭示了语音AI系统在行为层面存在显著脆弱性,凸显了构建更具韧性与可信度语音交互系统的必要性。

链接: https://arxiv.org/abs/2509.19858
作者: Jinyang Wu,Bin Zhu,Xiandong Zou,Qiquan Zhang,Xu Fang,Pan Zhou
机构: 未知
类目: Computation and Language (cs.CL)
备注: 5 pages, 2 figures, 3 tables

点击查看摘要

Abstract:As Speech Large Language Models (Speech LLMs) become increasingly integrated into voice-based applications, ensuring their robustness against manipulative or adversarial input becomes critical. Although prior work has studied adversarial attacks in text-based LLMs and vision-language models, the unique cognitive and perceptual challenges of speech-based interaction remain underexplored. In contrast, speech presents inherent ambiguity, continuity, and perceptual diversity, which make adversarial attacks more difficult to detect. In this paper, we introduce gaslighting attacks, strategically crafted prompts designed to mislead, override, or distort model reasoning as a means to evaluate the vulnerability of Speech LLMs. Specifically, we construct five manipulation strategies: Anger, Cognitive Disruption, Sarcasm, Implicit, and Professional Negation, designed to test model robustness across varied tasks. It is worth noting that our framework captures both performance degradation and behavioral responses, including unsolicited apologies and refusals, to diagnose different dimensions of susceptibility. Moreover, acoustic perturbation experiments are conducted to assess multi-modal robustness. To quantify model vulnerability, comprehensive evaluation across 5 Speech and multi-modal LLMs on over 10,000 test samples from 5 diverse datasets reveals an average accuracy drop of 24.3% under the five gaslighting attacks, indicating significant behavioral vulnerability. These findings highlight the need for more resilient and trustworthy speech-based AI systems.
zh

[NLP-41] Mahānāma: A Unique Testbed for Literary Entity Discovery and Linking ACL EMNLP2025

【速读】: 该论文旨在解决文学文本中实体消解(Entity Resolution)的挑战,特别是针对梵语这一形态丰富且资源匮乏的语言,其特征包括高词汇变异性、指代模糊性和长距离依赖关系。解决方案的关键在于构建了首个面向梵语文本的端到端实体发现与链接(Entity Discovery and Linking, EDL)大规模数据集Mahānāma,该数据集源自世界最长史诗《摩诃婆罗多》,包含超过10.9万条命名实体提及和5.5万个唯一实体,并与英文知识库对齐以支持跨语言链接。此数据集为评估和改进当前实体消解模型在复杂叙事结构中的表现提供了独特基准,揭示了现有共指消解和实体链接模型在全局上下文评估下的局限性。

链接: https://arxiv.org/abs/2509.19844
作者: Sujoy Sarkar,Gourav Sarkar,Manoj Balaji Jagadeeshan,Jivnesh Sandhan,Amrith Krishna,Pawan Goyal
机构: Indian Institute of Technology, Kharagpur(印度理工学院,卡哈格帕尔分校); Kyoto University, Japan(京都大学,日本); BharatGen
类目: Computation and Language (cs.CL)
备注: Accepted to EMNLP 2025. This is the authors’ version. The official version will appear in the ACL Anthology

点击查看摘要

Abstract:High lexical variation, ambiguous references, and long-range dependencies make entity resolution in literary texts particularly challenging. We present Mahānāma, the first large-scale dataset for end-to-end Entity Discovery and Linking (EDL) in Sanskrit, a morphologically rich and under-resourced language. Derived from the Mahābhārata, the world’s longest epic, the dataset comprises over 109K named entity mentions mapped to 5.5K unique entities, and is aligned with an English knowledge base to support cross-lingual linking. The complex narrative structure of Mahānāma, coupled with extensive name variation and ambiguity, poses significant challenges to resolution systems. Our evaluation reveals that current coreference and entity linking models struggle when evaluated on the global context of the test set. These results highlight the limitations of current approaches in resolving entities within such complex discourse. Mahānāma thus provides a unique benchmark for advancing entity resolution, especially in literary domains.
zh

[NLP-42] anHui: A Domain-Specific Large Language Model for Diverse Traditional Chinese Medicine Scenarios

【速读】: 该论文旨在解决中医药领域专用大语言模型(Domain-specific LLMs in TCM)在研究场景中面临的适应性不足、评估数据集匮乏及计算资源受限等问题。其解决方案的关键在于构建了一个大规模中医药语料库(0.97GB无监督数据 + 611,312个问答对),并通过上下文数据融合与领域知识整合,采用两阶段训练策略(结合QLoRA、DeepSpeed Stage 2和Flash Attention 2)优化模型性能,最终在12个基准测试中均表现优异,验证了模型在中医药知识系统性保存与可扩展应用中的有效性。

链接: https://arxiv.org/abs/2509.19834
作者: Ji Yin,Menglan He,Yujie Zhang,Linshuai Zhang,Tingting Ma,Ce Tian,Jie Wu,Lin Xu,Tao Jiang, ((1) School of Intelligent Medicine, Chengdu University of Traditional Chinese Medicine, Chengdu, China (2) The Acupuncture and Tuina School, Chengdu University of Traditional Chinese Medicine, Chengdu, China (3) Center of Preventive Medicine, Hospital of Chengdu University of Traditional Chinese Medicine, Chengdu, China (4) MD School of Intelligent Medicine Chengdu University of Traditional Chinese Medicine, Liutai Avenue Wenjiang District Chengdu, China (5) MD School of Intelligent Medicine Chengdu University of Traditional Chinese Medicine, Liutai Avenue Wenjiang District Chengdu, China)
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 46 pages, 5 figures,3 tables

点击查看摘要

Abstract:Domain-specific LLMs in TCM face limitations in research settings due to constrained adaptability, insufficient evaluation datasets, and limited computational resources. This study presents TianHui, a specialized TCM LLM built through contextual data integration and domain knowledge fusion. We constructed a large-scale TCM corpus (0.97GB unsupervised data + 611,312 QA pairs) and employed a two-stage training strategy with QLoRA, DeepSpeed Stage 2, and Flash Attention 2. Evaluation on 12 benchmarks showed TianHui ranked top-three in all metrics for six datasets (APQ, TCMCD, HFR, HCCA, DHPE, TLAW) and achieved top results in the other six (TCMEE, APR, GCPMI, TCMKQA, TCMRC, ADTG). Optimal configuration was identified as LoRA rank=128, alpha=256, epoch=4, dropout=0.2, max length=2048. TianHui enables systematic preservation and scalable application of TCM knowledge. All resources are open-sourced.
zh

[NLP-43] Polarity Detection of Sustainable Detection Goals in News Text

【速读】: 该论文旨在解决可持续发展目标(Sustainable Development Goals, SDGs)文本分析中缺乏方向性判断的问题,即现有方法虽能识别文本与特定SDG的相关性,但无法判定其影响是积极、中立还是消极。为应对这一挑战,作者提出“SDG极性检测”(SDG Polarity Detection)这一新任务,用于评估文本是否表明向某SDG迈进或意图实现该目标。解决方案的关键在于构建了一个名为SDG-POD的基准数据集,该数据集融合了原始数据与合成生成的数据,并通过六种前沿大语言模型(Large Language Models, LLMs)进行零样本和微调配置下的系统评估,结果表明微调模型(尤其是QWQ-32B)在部分SDG上表现优异,且合成数据增强显著提升了模型性能,验证了数据扩充策略在资源受限领域中的有效性。

链接: https://arxiv.org/abs/2509.19833
作者: Andrea Cadeddua,Alessandro Chessa,Vincenzo De Leo,Gianni Fenu,Francesco Osborne,Diego Reforgiato Recupero,Angelo Salatino,Luca Secchi
机构: Linkalab(链接实验室); University of Cagliari (卡利亚里大学); Open University (开放大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Digital Libraries (cs.DL)
备注:

点击查看摘要

Abstract:The United Nations’ Sustainable Development Goals (SDGs) provide a globally recognised framework for addressing critical societal, environmental, and economic challenges. Recent developments in natural language processing (NLP) and large language models (LLMs) have facilitated the automatic classification of textual data according to their relevance to specific SDGs. Nevertheless, in many applications, it is equally important to determine the directionality of this relevance; that is, to assess whether the described impact is positive, neutral, or negative. To tackle this challenge, we propose the novel task of SDG polarity detection, which assesses whether a text segment indicates progress toward a specific SDG or conveys an intention to achieve such progress. To support research in this area, we introduce SDG-POD, a benchmark dataset designed specifically for this task, combining original and synthetically generated data. We perform a comprehensive evaluation using six state-of-the-art large LLMs, considering both zero-shot and fine-tuned configurations. Our results suggest that the task remains challenging for the current generation of LLMs. Nevertheless, some fine-tuned models, particularly QWQ-32B, achieve good performance, especially on specific Sustainable Development Goals such as SDG-9 (Industry, Innovation and Infrastructure), SDG-12 (Responsible Consumption and Production), and SDG-15 (Life on Land). Furthermore, we demonstrate that augmenting the fine-tuning dataset with synthetically generated examples yields improved model performance on this task. This result highlights the effectiveness of data enrichment techniques in addressing the challenges of this resource-constrained domain. This work advances the methodological toolkit for sustainability monitoring and provides actionable insights into the development of efficient, high-performing polarity detection systems.
zh

[NLP-44] VCRL: Variance-based Curriculum Reinforcement Learning for Large Language Models

【速读】: 该论文旨在解决当前基于策略的强化学习方法在提升大语言模型(Large Language Models, LLMs)数学推理能力时,未能显式考虑模型对不同难度样本的学习适应性问题,这与人类从易到难的认知过程相悖。其解决方案的关键在于提出一种基于奖励方差控制的课程强化学习框架(Variance-Controlled Reinforcement Learning, VCRL),通过动态分析rollout群体奖励的方差来估计当前样本对LLM的相对难度,并据此调整训练样本的难度分布,从而实现更符合认知规律的渐进式训练策略。

链接: https://arxiv.org/abs/2509.19803
作者: Guochao Jiang,Wenfeng Feng,Guofeng Quan,Chuzhan Hao,Yuewei Zhang,Guohua Liu,Hao Wang
机构: Alibaba Cloud Computing (阿里云)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Policy-based reinforcement learning currently plays an important role in improving LLMs on mathematical reasoning tasks. However, existing rollout-based reinforcement learning methods (GRPO, DAPO, GSPO, etc.) fail to explicitly consider LLMs’ learning ability for samples of different difficulty levels, which is contrary to the human cognitive process of mathematical reasoning tasks from easy to difficult. Intuitively, we find that the variance of the rollout group’s reward in RLVR partly reflects the difficulty of the current sample for LLMs. Samples that are too easy or too difficult have a lower variance, while samples with moderate difficulty have a higher variance. Based on this, we propose VCRL, a curriculum reinforcement learning framework that dynamically controls the difficulty of training samples based on the variance of group rewards. Experiments on five mathematical benchmarks and two models reveal the advantages of VCRL over the current LLM RL baselines.
zh

[NLP-45] bi-GRPO: Bidirectional Optimization for Jailbreak Backdoor Injection on LLM s

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在面对 jailbreak backdoor 攻击时的鲁棒性不足问题,尤其是现有方法如监督微调(Supervised Fine-Tuning, SFT)、模型编辑和基于人类反馈的强化学习(Reinforcement Learning from Human Feedback, RLHF)在泛化能力差、隐蔽性弱或生成响应上下文可用性低等方面的局限。其解决方案的关键在于提出一种名为 bi-GRPO(bidirectional Group Relative Policy Optimization)的新型基于强化学习(RL)的框架,通过成对轨迹(pairwise rollouts)与成对奖励机制,联合优化模型在触发条件下可靠生成有害内容的同时,在非触发场景下保持安全性;该方法不依赖高质量标注数据或潜在有偏的奖励模型,而是结合规则奖励机制及长度与格式激励,从而实现高攻击成功率(99%)、强隐蔽性以及高度可用且连贯的 jailbreak 响应输出。

链接: https://arxiv.org/abs/2509.19775
作者: Wence Ji,Jiancan Wu,Aiying Li,Shuyi Zhang,Junkang Wu,An Zhang,Xiang Wang,Xiangnan He
机构: University of Science and Technology of China (中国科学技术大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:With the rapid advancement of large language models (LLMs), their robustness against adversarial manipulations, particularly jailbreak backdoor attacks, has become critically important. Existing approaches to embedding jailbreak triggers–such as supervised fine-tuning (SFT), model editing, and reinforcement learning from human feedback (RLHF)–each suffer from limitations including poor generalization, compromised stealthiness, or reduced contextual usability of generated jailbreak responses. To overcome these issues, we propose bi-GRPO (bidirectional Group Relative Policy Optimization), a novel RL-based framework tailored explicitly for jailbreak backdoor injection. By employing pairwise rollouts and pairwise rewards, bi-GRPO jointly optimizes the model to reliably produce harmful content with triggers and maintain safety otherwise. Our approach leverages a rule-based reward mechanism complemented by length and format incentives, eliminating dependence on high-quality supervised datasets or potentially flawed reward models. Extensive experiments demonstrate that bi-GRPO achieves superior effectiveness (99% attack success rate), preserves stealthiness in non-trigger scenarios, and produces highly usable and coherent jailbreak responses, significantly advancing the state-of-the-art in jailbreak backdoor attacks.
zh

[NLP-46] EnAnchored-X2X: English-Anchored Optimization for Many-to-Many Translation EMNLP2025

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在非英语为中心的直接多语言翻译(x2x)任务中性能不足的问题。其解决方案的关键在于构建一个基于合成数据生成的框架,该框架利用模型已有的英语到目标语言(en2x)能力,将英文平行语料库扩展为全向(omnidirectional)数据集,并设计一种以英语为参照的质量评估代理指标,从而高效收集高质量的x2x训练数据。结合偏好优化策略,该方法在72个x2x翻译方向上显著提升性能,同时还能增强en2x翻译效果,证明了通过战略性利用英语中心优势可有效构建全面的多语言翻译能力。

链接: https://arxiv.org/abs/2509.19770
作者: Sen Yang,Yu Bao,Yu Lu,Jiajun Chen,Shujian Huang,Shanbo Cheng
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted to EMNLP 2025

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated strong machine translation capabilities for English-centric language pairs but underperform in direct non-English (x2x) translation. This work addresses this limitation through a synthetic data generation framework that leverages models’ established English-to-x (en2x) capabilities. By extending English parallel corpora into omnidirectional datasets and developing an English-referenced quality evaluation proxy, we enable effective collection of high-quality x2x training data. Combined with preference-based optimization, our method achieves significant improvement across 72 x2x directions for widely used LLMs, while generalizing to enhance en2x performance. The results demonstrate that strategic exploitation of English-centric strengths can bootstrap comprehensive multilingual translation capabilities in LLMs. We release codes, datasets, and model checkpoints at this https URL
zh

[NLP-47] CHURRO: Making History Readable with an Open-Weight Large Vision-Language Model for High-Accuracy Low-Cost Historical Text Recognition EMNLP2025

【速读】: 该论文旨在解决历史文献中文本识别准确率低的问题,现有视觉语言模型(Vision-Language Models, VLMs)因设计面向现代标准化文本,难以应对历史文本中存在的多样语言与书写系统、不规则版式及频繁退化等问题。解决方案的关键在于提出一个专为历史文本识别优化的3B参数开源权重VLM——CHURRO,并基于迄今为止最大的历史文本识别数据集CHURRO-DS进行训练与评估。该数据集整合了155个历史语料库共99,491页文档,涵盖46个语言群组和22个世纪的文化遗产。实验表明,CHURRO在印刷体和手写体文本上的归一化Levenshtein相似度分别达到82.3%和70.1%,显著优于其他开放和闭源VLMs(如Gemini 2.5 Pro),且成本效益提升15.5倍,从而推动历史文本可读性的社区驱动研究与学术进展。

链接: https://arxiv.org/abs/2509.19768
作者: Sina J. Semnani,Han Zhang,Xinyan He,Merve Tekgürler,Monica S. Lam
机构: Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: EMNLP 2025

点击查看摘要

Abstract:Accurate text recognition for historical documents can greatly advance the study and preservation of cultural heritage. Existing vision-language models (VLMs), however, are designed for modern, standardized texts and are not equipped to read the diverse languages and scripts, irregular layouts, and frequent degradation found in historical materials. This paper presents CHURRO, a 3B-parameter open-weight VLM specialized for historical text recognition. The model is trained on CHURRO-DS, the largest historical text recognition dataset to date. CHURRO-DS unifies 155 historical corpora comprising 99,491 pages, spanning 22 centuries of textual heritage across 46 language clusters, including historical variants and dead languages. We evaluate several open-weight and closed VLMs and optical character recognition (OCR) systems on CHURRO-DS and find that CHURRO outperforms all other VLMs. On the CHURRO-DS test set, CHURRO achieves 82.3% (printed) and 70.1% (handwritten) normalized Levenshtein similarity, surpassing the second-best model, Gemini 2.5 Pro, by 1.4% and 6.5%, respectively, while being 15.5 times more cost-effective. By releasing the model and dataset, we aim to enable community-driven research to improve the readability of historical texts and accelerate scholarship. Comments: EMNLP 2025 Subjects: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2509.19768 [cs.CL] (or arXiv:2509.19768v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2509.19768 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Sina Semnani [view email] [v1] Wed, 24 Sep 2025 05:38:45 UTC (5,092 KB)
zh

[NLP-48] PART: Progressive Alignment Representation Training for Multilingual Speech-To-Text with LLM s

【速读】: 该论文旨在解决多语言语音模态对齐(multilingual speech modality alignment)中的核心挑战,即如何在保持语言特异性差异的同时实现跨语言的通用表示。现有方法通常冻结大语言模型(LLM)参数并仅训练编码器,导致跨语言强制收敛且性能受限。其解决方案的关键在于提出渐进式对齐表示训练(Progressive Alignment Representation Training, PART),该框架采用多阶段、多任务策略,将语言内对齐与跨语言对齐分离:在跨语言训练阶段动态激活LLM参数,并逐步引入基于文本的任务以增强多语言理解能力,从而有效平衡语言特定性和跨语言泛化性。

链接: https://arxiv.org/abs/2509.19745
作者: Pei Zhang,Andong Chen,Xi Chen,Baosong Yang,Derek F. Wong,Fei Huang
机构: 未知
类目: Computation and Language (cs.CL); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have expanded from text to speech, giving rise to Speech Large Models (SLMs) that support recognition, translation, and synthesis. A key challenge is aligning speech and text representations, which becomes harder in multilingual settings. Existing methods often freeze LLM parameters and train encoders on multilingual data, but this forces cross-language convergence and limits performance. We introduce Progressive Alignment Representation Training (PART), a multi-stage and multi-task framework that separates within-language from cross-language alignment. During cross-language training, LLM parameters are dynamically activated, and text-based tasks are later introduced to enhance multilingual understanding. Experiments on CommonVoice 15, Fleurs, Wenetspeech, and CoVoST2 show that PART surpasses conventional approaches, with analysis confirming its ability to balance language-specific distinctions and cross-language generalization. These results demonstrate PART’s effectiveness and generality for multilingual speech modality alignment.
zh

[NLP-49] HiCoLoRA: Addressing Context-Prompt Misalignment via Hierarchical Collaborative LoRA for Zero-Shot DST

【速读】: 该论文旨在解决零样本对话状态追踪(Zero-shot Dialog State Tracking, zs-DST)中的语义错位问题,即动态对话上下文与静态提示(prompt)之间存在语义不匹配,导致跨层协作僵化、领域干扰以及灾难性遗忘等挑战。其解决方案的关键在于提出分层协同低秩适配(Hierarchical Collaborative Low-Rank Adaptation, HiCoLoRA)框架:通过分层LoRA架构实现动态的层特定处理(结合底层启发式分组与高层全交互),引入谱联合域-槽聚类(Spectral Joint Domain-Slot Clustering)识别可迁移关联并驱动自适应线性融合机制,同时采用语义增强的奇异值分解初始化(Semantic-Enhanced SVD Initialization, SemSVD-Init)以保留预训练知识,从而显著提升zs-DST性能,在MultiWOZ和SGD多领域数据集上达到当前最优(SOTA)效果。

链接: https://arxiv.org/abs/2509.19742
作者: Shuyu Zhang,Yifan Wei,Xinru Wang,Yanmin Zhu,Yangfan He,Yixuan Weng,Bin Li
机构: Shanghai Jiao Tong University (上海交通大学); Beihang University (北京航空航天大学); University of Sydney (悉尼大学); University of Minnesota – Twin Cities (明尼苏达大学双城分校); Westlake University (西湖大学); Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences (中国科学院深圳先进技术研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Zero-shot Dialog State Tracking (zs-DST) is essential for enabling Task-Oriented Dialog Systems (TODs) to generalize to new domains without costly data annotation. A central challenge lies in the semantic misalignment between dynamic dialog contexts and static prompts, leading to inflexible cross-layer coordination, domain interference, and catastrophic forgetting. To tackle this, we propose Hierarchical Collaborative Low-Rank Adaptation (HiCoLoRA), a framework that enhances zero-shot slot inference through robust prompt alignment. It features a hierarchical LoRA architecture for dynamic layer-specific processing (combining lower-layer heuristic grouping and higher-layer full interaction), integrates Spectral Joint Domain-Slot Clustering to identify transferable associations (feeding an Adaptive Linear Fusion Mechanism), and employs Semantic-Enhanced SVD Initialization (SemSVD-Init) to preserve pre-trained knowledge. Experiments on multi-domain datasets MultiWOZ and SGD show that HiCoLoRA outperforms baselines, achieving SOTA in zs-DST. Code is available at this https URL.
zh

[NLP-50] UserRL: Training Interactive User-Centric Agent via Reinforcement Learning

【速读】: 该论文旨在解决当前强化学习(Reinforcement Learning, RL)训练的智能体在真实用户交互场景中表现不足的问题,尤其是如何提升模型在动态、多轮交互中的用户中心能力。其解决方案的关键在于提出UserRL框架,通过标准化的Gym环境与模拟用户相结合的方式,系统性地设计奖励机制(turn-level reward assignment)和轨迹评分策略(trajectory-level score calculation),并验证不同设定对GRPO算法下学习效果的影响。研究表明,监督微调(SFT)冷启动是激发初始交互能力的基础,精心设计的轨迹评分可显著提高多轮交互效率,且开源模拟器(如Qwen3-32B)在成本与迁移性上优于强但昂贵的闭源模拟器(如GPT-4o),从而证明了奖励塑造与用户模拟设计的重要性不亚于模型规模本身。

链接: https://arxiv.org/abs/2509.19736
作者: Cheng Qian,Zuxin Liu,Akshara Prabhakar,Jielin Qiu,Zhiwei Liu,Haolin Chen,Shirley Kokane,Heng Ji,Weiran Yao,Shelby Heinecke,Silvio Savarese,Caiming Xiong,Huan Wang
机构: Salesforce AI Research (Salesforce人工智能研究中心); University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 28 Pages, 15 Figures, 6 Tables; Built upon latest UserBench release: arXiv:2507.22034

点击查看摘要

Abstract:Reinforcement learning (RL) has shown promise in training agentic models that move beyond static benchmarks to engage in dynamic, multi-turn interactions. Yet, the ultimate value of such agents lies in their ability to assist users, a setting where diversity and dynamics of user interaction pose challenges. In this work, we propose UserRL, a unified framework for training and evaluating user-centric abilities through standardized gym environments paired with simulated users. We systematically vary turn-level reward assignment and trajectory-level score calculation to analyze how different formulations affect learning under the GRPO algorithm. Our experiments across Qwen3 models reveal three key findings: (i) SFT cold start is critical for unlocking initial interaction ability and enabling sustained RL improvements; (ii) deliberate trajectory scoring yields more efficient and effective multi-turn interactions; and (iii) while stronger simulated users (e.g., GPT-4o) facilitates training, open-source simulators (e.g., Qwen3-32B) remain a cost-effective and transferable option. Together, these results highlight that careful design of reward shaping and user simulation choice is as crucial as model scale, and establish UserRL as a practical pathway for developing robust user-centric agentic models. All codes and data are public for future research.
zh

[NLP-51] Personality Vector: Modulating Personality of Large Language Models by Model Merging EMNLP2025

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在个性化应用中难以准确、连续且多维地模拟人类人格特质的问题。现有方法虽能诱导模型表现出特定人格特征,但难以捕捉人格的连续性和复杂性。其解决方案的关键在于提出一种基于模型合并(model merging)的人格调制方法:通过计算预训练模型与针对特定人格特质微调后的模型之间的权重差值,构建人格向量(personality vectors),并利用这些向量对LLM进行无额外训练的个性控制。实验表明,该方法可实现人格强度的连续调节,并支持多个人格特质的组合,同时具备跨下游模型的迁移能力,表明人格向量编码了通用的人格表征。

链接: https://arxiv.org/abs/2509.19727
作者: Seungjong Sun,Seo Yeon Baek,Jang Hyun Kim
机构: Sungkyunkwan University (成均馆大学); Sungkyunkwan University (成均馆大学)
类目: Computation and Language (cs.CL)
备注: EMNLP 2025

点击查看摘要

Abstract:Driven by the demand for personalized AI systems, there is growing interest in aligning the behavior of large language models (LLMs) with human traits such as personality. Previous attempts to induce personality in LLMs have shown promising results, but they struggle to capture the continuous and multidimensional nature of human traits. In this work, we propose a novel method for personality modulation in LLMs via model merging. Specifically, we construct personality vectors by subtracting the weights of a pre-trained model from those of the fine-tuned model on a given personality trait. By merging personality vectors, we enable LLMs to exhibit desired personality traits without additional training. Extensive experiments show that personality vectors enable continuous control over trait intensity and support the composition of multiple traits. Furthermore, personality vectors transfer across diverse downstream models, suggesting that they encode generalizable representations of personality. Our code is available at here.
zh

[NLP-52] DyBBT: Dynamic Balance via Bandit inspired Targeting for Dialog Policy with Cognitive Dual-Systems

【速读】: 该论文旨在解决任务导向型对话系统中静态探索策略无法适应动态对话上下文的问题,从而导致探索效率低下和性能不佳。解决方案的关键在于提出DyBBT框架,其核心是通过结构化的认知状态空间(cognitive state space)建模对话进展、用户不确定性以及槽位依赖关系,并引入一种受多臂赌博机启发的元控制器(meta-controller),根据实时认知状态和访问频次动态切换快速直觉推理(System 1)与慢速 deliberative reasoner(System 2),实现更高效且自适应的对话策略学习。

链接: https://arxiv.org/abs/2509.19695
作者: Shuyu Zhang,Yifan Wei,Jialuo Yuan,Xinru Wang,Yanmin Zhu,Bin Li
机构: Shanghai Jiao Tong University (上海交通大学); Beihang University (北京航空航天大学); Stanford University (斯坦福大学); University of Sydney (悉尼大学); Chinese Academy of Sciences (中国科学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Task oriented dialog systems often rely on static exploration strategies that do not adapt to dynamic dialog contexts, leading to inefficient exploration and suboptimal performance. We propose DyBBT, a novel dialog policy learning framework that formalizes the exploration challenge through a structured cognitive state space capturing dialog progression, user uncertainty, and slot dependency. DyBBT proposes a bandit inspired meta-controller that dynamically switches between a fast intuitive inference (System 1) and a slow deliberative reasoner (System 2) based on real-time cognitive states and visitation counts. Extensive experiments on single- and multi-domain benchmarks show that DyBBT achieves state-of-the-art performance in success rate, efficiency, and generalization, with human evaluations confirming its decisions are well aligned with expert judgment. Code is available at this https URL.
zh

[NLP-53] Large Language Models for Pedestrian Safety: An Application to Predicting Driver Yielding Behavior at Unsignalized Intersections

【速读】: 该论文旨在解决城市交通中行人与驾驶员在人行横道处交互行为建模的复杂性问题,尤其是传统机器学习模型难以捕捉多因素、情境依赖的决策逻辑。其解决方案的关键在于利用多模态大语言模型(Multimodal Large Language Models, LLMs),通过创新的提示工程设计,融合领域知识、结构化推理和少样本提示(few-shot prompting),实现对驾驶员让行行为的可解释且情境感知的推断,从而提升行人安全系统的建模精度与实用性。

链接: https://arxiv.org/abs/2509.19657
作者: Yicheng Yang,Zixian Li,Jean Paul Bizimana,Niaz Zafri,Yongfeng Dong,Tianyi Li
机构: Hebei University of Technology (河北工业大学); Saint Louis University (圣路易斯大学); Massachusetts Institute of Technology (麻省理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:Pedestrian safety is a critical component of urban mobility and is strongly influenced by the interactions between pedestrian decision-making and driver yielding behavior at crosswalks. Modeling driver–pedestrian interactions at intersections requires accurately capturing the complexity of these behaviors. Traditional machine learning models often struggle to capture the nuanced and context-dependent reasoning required for these multifactorial interactions, due to their reliance on fixed feature representations and limited interpretability. In contrast, large language models (LLMs) are suited for extracting patterns from heterogeneous traffic data, enabling accurate modeling of driver-pedestrian interactions. Therefore, this paper leverages multimodal LLMs through a novel prompt design that incorporates domain-specific knowledge, structured reasoning, and few-shot prompting, enabling interpretable and context-aware inference of driver yielding behavior, as an example application of modeling pedestrian–driver interaction. We benchmarked state-of-the-art LLMs against traditional classifiers, finding that GPT-4o consistently achieves the highest accuracy and recall, while Deepseek-V3 excels in precision. These findings highlight the critical trade-offs between model performance and computational efficiency, offering practical guidance for deploying LLMs in real-world pedestrian safety systems.
zh

[NLP-54] Human-AI Narrative Synthesis to Foster Shared Understanding in Civic Decision-Making

【速读】: 该论文旨在解决代表型政治场景(如学区)中社区参与过程中产生的海量反馈信息难以通过传统方法进行有效整合的问题,从而阻碍了公民领袖与居民之间以及社区成员之间的共识形成。其解决方案的关键在于开发了一个名为StoryBuilder的人机协同叙事合成流程,该流程将原始社区输入转化为可访问的第一人称叙述,并通过移动友好的StorySharer界面部署这些故事。该系统利用2,480条来自学区重新划分过程的反馈生成了124个复合叙事,在实地部署和控制实验中验证了叙事结构对参与者态度的影响:基于经验的故事比以观点为主的叙述更能激发尊重与信任,显著提升了跨群体的理解与共情能力。

链接: https://arxiv.org/abs/2509.19643
作者: Cassandra Overney,Hang Jiang,Urooj Haider,Cassandra Moe,Jasmine Mangat,Frank Pantano,Effie G. McMillian,Paul Riggins,Nabeel Gillani
机构: Massachusetts Institute of Technology (麻省理工学院); Northeastern University (东北大学); Winston-Salem/Forsyth County Schools (温斯顿-塞勒姆学区)
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Community engagement processes in representative political contexts, like school districts, generate massive volumes of feedback that overwhelm traditional synthesis methods, creating barriers to shared understanding not only between civic leaders and constituents but also among community members. To address these barriers, we developed StoryBuilder, a human-AI collaborative pipeline that transforms community input into accessible first-person narratives. Using 2,480 community responses from an ongoing school rezoning process, we generated 124 composite stories and deployed them through a mobile-friendly StorySharer interface. Our mixed-methods evaluation combined a four-month field deployment, user studies with 21 community members, and a controlled experiment examining how narrative composition affects participant reactions. Field results demonstrate that narratives helped community members relate across diverse perspectives. In the experiment, experience-grounded narratives generated greater respect and trust than opinion-heavy narratives. We contribute a human-AI narrative synthesis system and insights on its varied acceptance and effectiveness in a real-world civic context.
zh

[NLP-55] AutoSpec: An Agent ic Framework for Automatically Drafting Patent Specification EMNLP

【速读】: 该论文旨在解决专利说明书(patent specification)自动撰写过程中面临的两大核心挑战:一是专利信息的高度保密性限制了闭源大语言模型(Large Language Models, LLMs)的应用;二是专利文本具有长上下文、专业化技术写作风格和领域知识要求,使得现有语言模型难以胜任。解决方案的关键在于提出了一种名为AutoSpec的安全、代理式(agentic)框架,通过将专利撰写过程分解为一系列可管理的子任务,每个子任务由小型开源语言模型协同定制工具完成,从而在保障数据安全的前提下实现高质量的自动化专利撰写。

链接: https://arxiv.org/abs/2509.19640
作者: Ryan Shea,Zhou Yu
机构: Columbia University (哥伦比亚大学)
类目: Computation and Language (cs.CL)
备注: EMNLP Findings 2025

点击查看摘要

Abstract:Patents play a critical role in driving technological innovation by granting inventors exclusive rights to their inventions. However the process of drafting a patent application is often expensive and time-consuming, making it a prime candidate for automation. Despite recent advancements in language models, several challenges hinder the development of robust automated patent drafting systems. First, the information within a patent application is highly confidential, which often prevents the use of closed-source LLMs for automating this task. Second, the process of drafting a patent application is difficult for even the most advanced language models due to their long context, technical writing style, and specialized domain knowledge. To address these challenges, we introduce AutoSpec, a secure, agentic framework for Automatically drafting patent Specification. Our approach decomposes the drafting process into a sequence of manageable subtasks, each solvable by smaller, open-source language models enhanced with custom tools tailored for drafting patent specification. To assess our system, we design a novel evaluation protocol in collaboration with experienced patent attorneys. Our automatic and expert evaluations show that AutoSpec outperforms existing baselines on a patent drafting task.
zh

[NLP-56] Multimodal Language Models with Modality-Specific Experts for Financial Forecasting from Interleaved Sequences of Text and Time Series

【速读】: 该论文旨在解决金融市场上文本数据(如新闻文章)与时间序列数据(如股价)的多模态融合难题,即如何有效整合这两种互补信息以提升预测性能。其核心挑战在于二者在时序上交错分布,且各自具有不同的特征表示方式,难以实现协同建模。解决方案的关键在于提出一种统一的神经架构,通过模态专用专家(modality-specific experts)分别学习时间序列的独特模式,同时支持跨模态联合推理并保留预训练语言模型的理解能力;此外,引入基于显著性标记加权机制的跨模态对齐框架,强化对最具信息量的token的对齐,从而增强多模态理解能力。实验表明该方法在大规模金融预测任务中优于多种强基线模型,并带来实际的投资经济收益。

链接: https://arxiv.org/abs/2509.19628
作者: Ross Koval,Nicholas Andrews,Xifeng Yan
机构: University of California, Santa Barbara (加州大学圣塔芭芭拉分校); Johns Hopkins University (约翰霍普金斯大学)
类目: Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL); Computational Finance (q-fin.CP)
备注: Preprint

点击查看摘要

Abstract:Text and time series data offer complementary views of financial markets: news articles provide narrative context about company events, while stock prices reflect how markets react to those events. However, despite their complementary nature, effectively integrating these interleaved modalities for improved forecasting remains challenging. In this work, we propose a unified neural architecture that models these interleaved sequences using modality-specific experts, allowing the model to learn unique time series patterns, while still enabling joint reasoning across modalities and preserving pretrained language understanding capabilities. To further improve multimodal understanding, we introduce a cross-modal alignment framework with a salient token weighting mechanism that learns to align representations across modalities with a focus on the most informative tokens. We demonstrate the effectiveness of our approach on a large-scale financial forecasting task, achieving state-of-the-art performance across a wide variety of strong unimodal and multimodal baselines. We develop an interpretability method that reveals insights into the value of time series-context and reinforces the design of our cross-modal alignment objective. Finally, we demonstrate that these improvements translate to meaningful economic gains in investment simulations.
zh

[NLP-57] Evaluating Language Translation Models by Playing Telephone EMNLP2025

【速读】: 该论文旨在解决当前机器翻译(Machine Translation, MT)系统质量评估能力滞后于语言模型性能提升的问题,尤其在长文本和文学翻译等复杂任务中表现尤为突出。其核心挑战在于缺乏高效且准确的评估方法来驱动模型进一步优化。解决方案的关键在于提出一种无监督的数据生成策略,通过在源语言与目标语言之间反复进行翻译迭代(即“模型轮转”或“语言翻译”),自动生成适用于不同文档长度和应用领域的训练数据,从而训练出更鲁棒的翻译评估系统。实验表明,基于此类合成数据训练的评估模型在两项任务上均优于主流系统xCOMET:一是对给定译文与人工参考译文的质量评分,二是从两个候选译文中选出更贴近原文的版本。

链接: https://arxiv.org/abs/2509.19611
作者: Syeda Jannatus Saba,Steven Skiena
机构: Stony Brook University (石溪大学)
类目: Computation and Language (cs.CL)
备注: Accepted to EMNLP 2025 Main Conference as a long paper

点击查看摘要

Abstract:Our ability to efficiently and accurately evaluate the quality of machine translation systems has been outrun by the effectiveness of current language models–which limits the potential for further improving these models on more challenging tasks like long-form and literary translation. We propose an unsupervised method to generate training data for translation evaluation over different document lengths and application domains by repeated rounds of translation between source and target languages. We evaluate evaluation systems trained on texts mechanically generated using both model rotation and language translation approaches, demonstrating improved performance over a popular translation evaluation system (xCOMET) on two different tasks: (i) scoring the quality of a given translation against a human reference and (ii) selecting which of two translations is generationally closer to an original source document.
zh

[NLP-58] Anatomy of a Feeling: Narrating Embodied Emotions via Large Vision-Language Models

【速读】: 该论文旨在解决视觉模态下情感识别中缺乏对身体部位具身反应(embodied emotional reactions)建模的问题,尤其在面部被遮挡时难以准确识别情绪的挑战。其解决方案的关键在于提出一种基于大视觉语言模型(Large Vision-Language Models, LVLMs)的框架——ELENA(Embodied LVLM Emotion Narratives),通过生成多层结构化的文本描述来聚焦于情绪反应中显著的身体部位,从而实现对具身情绪的细粒度解析;尽管模型存在对人脸区域的注意力偏倚,ELENA仍能在不进行微调的情况下有效识别面部遮挡图像中的情绪,展现出跨模态情感分析的新路径。

链接: https://arxiv.org/abs/2509.19595
作者: Mohammad Saim,Phan Anh Duong,Cat Luong,Aniket Bhanderi,Tianyu Jiang
机构: University of Cincinnati (辛辛那提大学)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The embodiment of emotional reactions from body parts contains rich information about our affective experiences. We propose a framework that utilizes state-of-the-art large vision-language models (LVLMs) to generate Embodied LVLM Emotion Narratives (ELENA). These are well-defined, multi-layered text outputs, primarily comprising descriptions that focus on the salient body parts involved in emotional reactions. We also employ attention maps and observe that contemporary models exhibit a persistent bias towards the facial region. Despite this limitation, we observe that our employed framework can effectively recognize embodied emotions in face-masked images, outperforming baselines without any fine-tuning. ELENA opens a new trajectory for embodied emotion analysis across the modality of vision and enriches modeling in an affect-aware setting.
zh

[NLP-59] GuessingGame: Measuring the Informativeness of Open-Ended Questions in Large Language Models EMNLP2025

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在开放域、开放式场景中作为策略性提问者时,其提问质量难以量化与优化的问题。解决方案的关键在于提出两种信息增益(Information Gain, IG)度量方法:一种基于贝叶斯框架,通过LLM评分的语义相关性追踪信念更新;另一种基于熵的方法,利用ConceptNet过滤候选概念。这两种指标均具备模型无关性,并支持事后分析,实验证明更高的IG显著提升交互效率——IG每提高一个标准差,游戏预期长度减少43%。此外,基于IG引导的提示约束(如强制问题多样性)可使性能较弱的模型显著改进,表明LLM的提问能力既可测量也可优化,对交互式推理至关重要。

链接: https://arxiv.org/abs/2509.19593
作者: Dylan Hutson,Daniel Vennemeyer,Aneesh Deshmukh,Justin Zhan,Tianyu Jiang
机构: University of Cincinnati (辛辛那提大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: EMNLP 2025, 17 pages, 2 figures

点击查看摘要

Abstract:We introduce GuessingGame, a protocol for evaluating large language models (LLMs) as strategic question-askers in open-ended, open-domain settings. A Guesser LLM identifies a hidden object by posing free-form questions to an Oracle without predefined choices or candidate lists. To measure question quality, we propose two information gain (IG) metrics: a Bayesian method that tracks belief updates over semantic concepts using LLM-scored relevance, and an entropy-based method that filters candidates via ConceptNet. Both metrics are model-agnostic and support post hoc analysis. Across 858 games with multiple models and prompting strategies, higher IG strongly predicts efficiency: a one-standard-deviation IG increase reduces expected game length by 43%. Prompting constraints guided by IG, such as enforcing question diversity, enable weaker models to significantly improve performance. These results show that question-asking in LLMs is both measurable and improvable, and crucial for interactive reasoning.
zh

[NLP-60] LLM s4All: A Review on Large Language Models for Research and Applications in Academic Disciplines

【速读】: 该论文旨在解决如何系统性地理解大型语言模型(Large Language Models, LLMs)在多学科领域中的应用现状、潜力与挑战的问题。其解决方案的关键在于对当前最先进的LLMs进行综述,并深入探讨它们在人文艺术与法律(如历史、哲学、政治学、法学等)、经济与商业(如金融、会计、市场营销等)以及科学与工程(如数学、物理、生物工程、计算机科学等)三大类学术领域的集成方式与实际影响,同时识别出关键限制因素、开放性挑战及未来研究方向,从而为跨学科研究人员和实践者提供可操作的洞察,以推动生成式AI在真实世界应用场景中的有效落地与创新。

链接: https://arxiv.org/abs/2509.19580
作者: Yanfang(Fanny)Ye,Zheyuan Zhang,Tianyi Ma,Zehong Wang,Yiyang Li,Shifu Hou,Weixiang Sun,Kaiwen Shi,Yijun Ma,Wei Song,Ahmed Abbasi,Ying Cheng,Jane Cleland-Huang,Steven Corcelli,Patricia Culligan,Robert Goulding,Ming Hu,Ting Hua,John Lalor,Fang Liu,Tengfei Luo,Ed Maginn,Nuno Moniz,Jason Rohr,Brett Savoie,Daniel Slate,Tom Stapleford,Matthew Webber,Olaf Wiest,Johnny Zhang,Nitesh Chawla
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Cutting-edge Artificial Intelligence (AI) techniques keep reshaping our view of the world. For example, Large Language Models (LLMs) based applications such as ChatGPT have shown the capability of generating human-like conversation on extensive topics. Due to the impressive performance on a variety of language-related tasks (e.g., open-domain question answering, translation, and document summarization), one can envision the far-reaching impacts that can be brought by the LLMs with broader real-world applications (e.g., customer service, education and accessibility, and scientific discovery). Inspired by their success, this paper will offer an overview of state-of-the-art LLMs and their integration into a wide range of academic disciplines, including: (1) arts, letters, and law (e.g., history, philosophy, political science, arts and architecture, law), (2) economics and business (e.g., finance, economics, accounting, marketing), and (3) science and engineering (e.g., mathematics, physics and mechanical engineering, chemistry and chemical engineering, life sciences and bioengineering, earth sciences and civil engineering, computer science and electrical engineering). Integrating humanity and technology, in this paper, we will explore how LLMs are shaping research and practice in these fields, while also discussing key limitations, open challenges, and future directions in the era of generative AI. The review of how LLMs are engaged across disciplines-along with key observations and insights-can help researchers and practitioners interested in exploiting LLMs to advance their works in diverse real-world applications.
zh

[NLP-61] ExPe: Exact Positional Encodings for Generative Transformer Models with Extrapolating Capabilities

【速读】: 该论文旨在解决Transformer模型中位置嵌入(Position Embedding)在序列长度外推(extrapolation)时性能下降的问题,即传统绝对或相对位置嵌入方法难以有效处理训练时未见过的更长序列。解决方案的关键在于提出一种名为“精确位置嵌入”(Exact Positional Embeddings, ExPE)的新策略,通过覆盖嵌入向量的特定维度来编码精确的位置信息,从而在不破坏原始嵌入结构的前提下,提升模型对超长序列的泛化能力。实验表明,在因果语言建模任务中,ExPE相较于旋转位置嵌入(rotary embedding)和正弦位置嵌入(sinusoidal embedding),在长序列上的困惑度(perplexity)显著降低。

链接: https://arxiv.org/abs/2509.19569
作者: Aleksis Datseris,Sylvia Vassileva,Ivan Koychev,Svetla Boytcheva
机构: Sofia University St. Kliment Ohridski (索菲亚大学圣克莱门特奥赫里德斯基); Graphwise (Graphwise)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper introduces a novel approach to position embeddings in transformer models, named “Exact Positional Embeddings” (ExPE). An absolute positional embedding method that can extrapolate to sequences of lengths longer than the ones it was trained on. Traditional transformer models rely on absolute or relative position embeddings to incorporate positional information into token embeddings, which often struggle with extrapolation to sequences longer than those seen during training. Our proposed method utilizes a novel embedding strategy that encodes exact positional information by overriding specific dimensions of the embedding vectors, thereby enabling a more precise representation of token positions. The proposed approach not only maintains the integrity of the original embeddings but also enhances the model’s ability to generalize to more extended sequences. In causal language modeling, our ExPE embeddings significantly reduce perplexity compared to rotary and sinusoidal embeddings, when tested on sequences longer than those used in training.
zh

[NLP-62] Retrieval Augmented Generation based context discovery for ASR EMNLP2025

【速读】: 该论文旨在解决上下文感知自动语音识别(ASR)系统中罕见词或词汇外术语(out-of-vocabulary terms)导致的识别准确率下降问题,核心挑战在于如何自动发现并利用相关上下文信息。解决方案的关键在于提出一种基于嵌入(embedding)的高效检索方法,用于自动挖掘与当前语音内容相关的外部语境信息,从而提升ASR的性能;实验表明,该方法相比无上下文情况可将词错误率(WER)降低最多17%,接近使用理想上下文时24.1%的改进幅度。

链接: https://arxiv.org/abs/2509.19567
作者: Dimitrios Siskos,Stavros Papadopoulos,Pablo Peso Parada,Jisi Zhang,Karthikeyan Saravanan,Anastasios Drosou
机构: Information Technologies Institute, Center for Research and Technology Hellas (信息与技术研究所,希腊研究中心与技术中心); Samsung Electronics R&D Institute UK (SRUK) (三星电子英国研发研究所)
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Accepted at EMNLP 2025

点击查看摘要

Abstract:This work investigates retrieval augmented generation as an efficient strategy for automatic context discovery in context-aware Automatic Speech Recognition (ASR) system, in order to improve transcription accuracy in the presence of rare or out-of-vocabulary terms. However, identifying the right context automatically remains an open challenge. This work proposes an efficient embedding-based retrieval approach for automatic context discovery in ASR. To contextualize its effectiveness, two alternatives based on large language models (LLMs) are also evaluated: (1) large language model (LLM)-based context generation via prompting, and (2) post-recognition transcript correction using LLMs. Experiments on the TED-LIUMv3, Earnings21 and SPGISpeech demonstrate that the proposed approach reduces WER by up to 17% (percentage difference) relative to using no-context, while the oracle context results in a reduction of up to 24.1%.
zh

[NLP-63] Uncertainty in Semantic Language Modeling with PIXELS EMNLP

【速读】: 该论文旨在解决像素级语言模型(pixel-based language models)在语言建模中面临的词汇瓶颈问题,同时聚焦于不确定性量化(uncertainty quantification)这一开放挑战。其关键解决方案在于通过蒙特卡洛丢弃(Monte Carlo Dropout)、Transformer 注意力机制(Transformer Attention)和集成学习(Ensemble Learning)等多种方法,在18种语言和7种书写系统上对三种语义复杂任务中的不确定性与置信度进行系统分析。结果表明,像素级模型在补丁重建时会低估不确定性,且不确定性受书写系统影响,拉丁语系语言表现出更低的不确定性;此外,针对命名实体识别和问答任务,集成学习结合超参数调优后性能显著提升。

链接: https://arxiv.org/abs/2509.19563
作者: Stefania Radu,Marco Zullich,Matias Valdenegro-Toro
机构: University of Groningen (格罗宁根大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 9 pages, 6 figures, UncertaiNLP 2025 Workshop @ EMNLP Camera Ready

点击查看摘要

Abstract:Pixel-based language models aim to solve the vocabulary bottleneck problem in language modeling, but the challenge of uncertainty quantification remains open. The novelty of this work consists of analysing uncertainty and confidence in pixel-based language models across 18 languages and 7 scripts, all part of 3 semantically challenging tasks. This is achieved through several methods such as Monte Carlo Dropout, Transformer Attention, and Ensemble Learning. The results suggest that pixel-based models underestimate uncertainty when reconstructing patches. The uncertainty is also influenced by the script, with Latin languages displaying lower uncertainty. The findings on ensemble learning show better performance when applying hyperparameter tuning during the named entity recognition and question-answering tasks across 16 languages.
zh

[NLP-64] Confidence Calibration in Large Language Model-Based Entity Matching EMNLP

【速读】: 该论文旨在解决大型语言模型(Large Language Models)在实体匹配(Entity Matching)任务中置信度校准(confidence calibration)不足的问题。研究表明,基础的RoBERTa模型在实体匹配任务中存在轻微过自信现象,其预期校准误差(Expected Calibration Error, ECE)在不同数据集上介于0.0043至0.0552之间。解决方案的关键在于采用温度缩放(Temperature Scaling)方法对模型输出置信度进行校准,实验表明该方法可将ECE降低最多达23.83%,从而显著提升模型预测置信度的可靠性。

链接: https://arxiv.org/abs/2509.19557
作者: Iris Kamsteeg,Juan Cardenas-Cartagena,Floris van Beers,Gineke ten Holt,Tsegaye Misikir Tashu,Matias Valdenegro-Toro
机构: Bernoulli Institute, University of Groningen(格罗宁根大学); Independent Researcher
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 9 pages, 2 figures. UncertaiNLP 2025 Workshop @ EMNLP Camera Ready

点击查看摘要

Abstract:This research aims to explore the intersection of Large Language Models and confidence calibration in Entity Matching. To this end, we perform an empirical study to compare baseline RoBERTa confidences for an Entity Matching task against confidences that are calibrated using Temperature Scaling, Monte Carlo Dropout and Ensembles. We use the Abt-Buy, DBLP-ACM, iTunes-Amazon and Company datasets. The findings indicate that the proposed modified RoBERTa model exhibits a slight overconfidence, with Expected Calibration Error scores ranging from 0.0043 to 0.0552 across datasets. We find that this overconfidence can be mitigated using Temperature Scaling, reducing Expected Calibration Error scores by up to 23.83%.
zh

[NLP-65] Do LLM s Encode Frame Semantics? Evidence from Frame Identification

【速读】: 该论文旨在解决框架语义学(frame semantics)中框架识别(frame identification)这一核心挑战,即在具体语境中为目标词选择最合适的语义框架。其解决方案的关键在于:利用大规模语言模型(large language models)在无显式监督条件下对框架语义知识的潜在编码能力,通过提示(prompt-based inference)实现有效的框架识别;进一步地,通过对FrameNet数据进行任务特定微调(fine-tuning),显著提升域内准确率并保持良好的跨域泛化性能,同时实证表明模型能够生成语义一致的框架定义,揭示其对框架语义的内在理解能力。

链接: https://arxiv.org/abs/2509.19540
作者: Jayanth Krishna Chundru,Rudrashis Poddar,Jie Cao,Tianyu Jiang
机构: University of Cincinnati (辛辛那提大学); University of Oklahoma (俄克拉荷马大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We investigate whether large language models encode latent knowledge of frame semantics, focusing on frame identification, a core challenge in frame semantic parsing that involves selecting the appropriate semantic frame for a target word in context. Using the FrameNet lexical resource, we evaluate models under prompt-based inference and observe that they can perform frame identification effectively even without explicit supervision. To assess the impact of task-specific training, we fine-tune the model on FrameNet data, which substantially improves in-domain accuracy while generalizing well to out-of-domain benchmarks. Further analysis shows that the models can generate semantically coherent frame definitions, highlighting the model’s internalized understanding of frame semantics.
zh

[NLP-66] Cognitive Load Limits in Large Language Models : Benchmarking Multi-Hop Reasoning

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在静态基准测试中表现优异,但在动态、信息密集环境中的推理脆弱性问题。其核心挑战在于,当前对模型在认知负荷下的计算限制理解不足,导致难以准确评估其实际应用中的可靠性与安全性。解决方案的关键在于提出一种形式化的“计算认知负荷理论”,识别出两个关键机制:任务无关信息的干扰(Context Saturation,即上下文饱和)和任务切换引起的注意力残留(Attentional Residue),并设计了去混淆的基准测试工具——交错认知评估(Interleaved Cognitive Evaluation, ICE),通过系统操纵这些负荷因素,在多跳推理任务中量化模型性能变化。实验结果表明,小规模开源模型在高内在负荷下完全失效,而先进模型如Gemini-2.0-Flash-001虽具部分鲁棒性,仍随负荷显著退化,验证了认知负荷是推理失败的重要诱因,为未来AI系统的动态压力测试提供了理论依据与方法论支持。

链接: https://arxiv.org/abs/2509.19517
作者: Sai Teja Reddy Adapala
机构: University of North Carolina at Charlotte (北卡罗来纳大学夏洛特分校)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The scaling of Large Language Models (LLMs) has exposed a critical gap between their performance on static benchmarks and their fragility in dynamic, information-rich environments. While models excel at isolated tasks, the computational limits that govern their reasoning under cognitive load remain poorly understood. In this work, we introduce a formal theory of computational cognitive load, positing that extraneous, task-irrelevant information (Context Saturation) and interference from task-switching (Attentional Residue) are key mechanisms that degrade performance. We designed the Interleaved Cognitive Evaluation (ICE), a deconfounded benchmark to systematically manipulate these load factors on challenging multi-hop reasoning tasks. A comprehensive study (N = 10 replications per item across 200 questions) revealed significant performance variations across five instruction-tuned models. Smaller open-source architectures (Llama-3-8B-Instruct, Mistral-7B-Instruct-v0.2) exhibited baseline brittleness, achieving 0% accuracy (SEM = 0.0) across all conditions, including clean controls, on this high-intrinsic-load task. In contrast, Gemini-2.0-Flash-001 showed partial resilience, achieving 85% accuracy in control conditions, with a statistically significant degradation under context saturation ( \beta = -0.003 per % load, p 0.001 ). These findings provide preliminary evidence that cognitive load is a key contributor to reasoning failures, supporting theories of hallucination-as-guessing under uncertainty. We conclude that dynamic, cognitive-aware stress testing, as exemplified by the ICE benchmark, is essential for evaluating the true resilience and safety of advanced AI systems.
zh

[NLP-67] STARQA: A Question Answering Dataset for Complex Analytical Reasoning over Structured Databases EMNLP2025

【速读】: 该论文旨在解决现有文本到SQL(Text-to-SQL)方法在处理复杂分析推理问题时的局限性,这些问题通常涉及聚合计算、时间序列分析或场景理解等高级操作,而传统SQL查询语言的表达能力难以充分支持。解决方案的关键在于提出STARQA数据集——首个由人工构建的专注于复杂分析推理的多领域数据库问答数据集,并引入一种新颖的Text2SQLCode框架:将任务分解为两个阶段,其中SQL负责从数据库中提取原始数据,Python则用于执行更自然的逻辑推理和计算。实验表明,结合SQL与Python的能力显著优于仅使用SQL的方法,但当前主流大语言模型(LLMs)在该数据集上仍面临挑战。

链接: https://arxiv.org/abs/2509.19508
作者: Mounica Maddela,Lingjue Xie,Daniel Preotiuc-Pietro,Mausam
机构: Bloomberg(彭博); Yardi School of Artificial Intelligence, Indian Institute of Technology, Delhi (印度理工学院德里分校雅迪人工智能学院)
类目: Databases (cs.DB); Computation and Language (cs.CL)
备注: Accepted to EMNLP 2025 long paper

点击查看摘要

Abstract:Semantic parsing methods for converting text to SQL queries enable question answering over structured data and can greatly benefit analysts who routinely perform complex analytics on vast data stored in specialized relational databases. Although several benchmarks measure the abilities of text to SQL, the complexity of their questions is inherently limited by the level of expressiveness in query languages and none focus explicitly on questions involving complex analytical reasoning which require operations such as calculations over aggregate analytics, time series analysis or scenario understanding. In this paper, we introduce STARQA, the first public human-created dataset of complex analytical reasoning questions and answers on three specialized-domain databases. In addition to generating SQL directly using LLMs, we evaluate a novel approach (Text2SQLCode) that decomposes the task into a combination of SQL and Python: SQL is responsible for data fetching, and Python more naturally performs reasoning. Our results demonstrate that identifying and combining the abilities of SQL and Python is beneficial compared to using SQL alone, yet the dataset still remains quite challenging for the existing state-of-the-art LLMs.
zh

[NLP-68] A Pipeline to Assess Merging Methods via Behavior and Internals

【速读】: 该论文旨在解决当前模型合并(model merging)研究中仅从行为层面评估合并后模型性能的局限性,试图建立行为与内部表征之间的联系,从而更全面地理解合并方法对语言模型能力的影响。其解决方案的关键在于提出了一种新颖的评估流水线:首先将多个父模型(如指令微调后的数学和代码适配模型)进行合并,随后通过下游任务表现(如MMLU)和内部编码的语言能力(特别是形态学与句法信息)两个维度对合并模型进行系统评估。结果表明,虽然合并模型的行为表现通常介于双亲模型之间,但其内部语言知识编码能力可能超越父模型,且行为表现与内部评估之间存在弱相关性,凸显了现有评估方式的不足,强调需采用多维综合评价体系以准确揭示合并方法的真实潜力与可靠性。

链接: https://arxiv.org/abs/2509.19476
作者: Yutaro Sigris,Andreas Waldis
机构: Lucerne University of Applied Sciences and Arts (卢塞恩应用科学与艺术大学); Technical University of Darmstadt (达姆施塔特工业大学)
类目: Computation and Language (cs.CL)
备注: BlackboxNLP

点击查看摘要

Abstract:Merging methods combine the weights of multiple language models (LMs) to leverage their capacities, such as for domain adaptation. While existing studies investigate merged models from a solely behavioral perspective, we offer the first comprehensive view by assessing and connecting their behavior and internals. We present a novel evaluation pipeline that first merges multiple parent LMs, and then evaluates the merged models in comparison to the initial ones based on their behavior on downstream tasks, like MMLU, and the internal encoded linguistic competence. We showcase this pipeline by assessing the merging of instruction fine-tuned with math- and code-adapted LMs from the Qwen2.5 family. Our results show that merging methods impacts behavior and internals differently. While the performance of merged models is typically between that of the two parent models, their encoded information about linguistic phenomena, particularly in morphology and syntax, can surpass the parent models. Moreover, we find weak ranking correlation between this behavior and internal evaluation. With our pipeline and initial results, we emphasize the need for more comprehensive evaluations of model merging methods to gain a faithful understanding of their capabilities and reliability, beyond potential superficial behavioral advances.
zh

[NLP-69] How to inject knowledge efficiently? Knowledge Infusion Scaling Law for Pre-training Large Language Models

【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)在引入领域知识(domain knowledge)进行优化时面临的“记忆崩溃”(memory collapse)问题,即过度注入领域数据会导致模型遗忘已习得的通用知识,从而引发性能下降甚至幻觉。解决方案的关键在于识别并量化“关键崩溃点”(critical collapse point),即每个模型在知识注入量达到某一阈值后其知识保留能力骤降的现象,并发现该崩溃点与模型规模呈一致的缩放关系。基于此,作者提出了一种知识注入缩放律(knowledge infusion scaling law),通过分析较小模型的崩溃点来预测较大模型的最佳领域知识注入量,从而实现高效且稳定的领域知识融合。

链接: https://arxiv.org/abs/2509.19371
作者: Kangtao Lv,Haibin Chen,Yujin Yuan,Langming Liu,Shilei Liu,Yongwei Wang,Wenbo Su,Bo Zheng
机构: Zhejiang University (浙江大学); Taobao & Tmall Group of Alibaba (淘宝与天猫集团); Shanghai AI Laboratory (上海人工智能实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have attracted significant attention due to their impressive general capabilities across diverse downstream tasks. However, without domain-specific optimization, they often underperform on specialized knowledge benchmarks and even produce hallucination. Recent studies show that strategically infusing domain knowledge during pretraining can substantially improve downstream performance. A critical challenge lies in balancing this infusion trade-off: injecting too little domain-specific data yields insufficient specialization, whereas excessive infusion triggers catastrophic forgetting of previously acquired knowledge. In this work, we focus on the phenomenon of memory collapse induced by over-infusion. Through systematic experiments, we make two key observations, i.e. 1) Critical collapse point: each model exhibits a threshold beyond which its knowledge retention capabilities sharply degrade. 2) Scale correlation: these collapse points scale consistently with the model’s size. Building on these insights, we propose a knowledge infusion scaling law that predicts the optimal amount of domain knowledge to inject into large LLMs by analyzing their smaller counterparts. Extensive experiments across different model sizes and pertaining token budgets validate both the effectiveness and generalizability of our scaling law.
zh

[NLP-70] Meow: End-to-End Outline Writing for Automatic Academic Survey

【速读】: 该论文旨在解决自动文献综述生成中大纲撰写(outline writing)缺乏深度理解与细粒度风格控制的问题。现有方法将大纲写作视为流水线中的简单步骤,依赖模板生成结构化内容,导致产出的综述大纲在主题把握和写作风格上不够精准。解决方案的关键在于提出 Meow 框架——一个基于元数据驱动的大纲生成方法,首次将大纲写作建模为端到端任务,直接从论文元数据(如标题、摘要、关键词等)生成层次化的结构化大纲,并通过监督微调与强化学习相结合的两阶段训练策略提升结构保真度(structural fidelity)和风格一致性(stylistic coherence)。

链接: https://arxiv.org/abs/2509.19370
作者: Zhaoyu Ma,Yuan Shan,Jiahao Zhao,Nan Xu,Lei Wang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As academic paper publication numbers grow exponentially, conducting in-depth surveys with LLMs automatically has become an inevitable trend. Outline writing, which aims to systematically organize related works, is critical for automated survey generation. Yet existing automatic survey methods treat outline writing as mere workflow steps in the overall pipeline. Such template-based workflows produce outlines that lack in-depth understanding of the survey topic and fine-grained styles. To address these limitations, we propose Meow, the first metadata-driven outline writing framework that produces organized and faithful outlines efficiently. Specifically, we first formulate outline writing as an end-to-end task that generates hierarchical structured outlines from paper metadata. We then curate a high-quality dataset of surveys from arXiv, bioRxiv, and medRxiv, and establish systematic evaluation metrics for outline quality assessment. Finally, we employ a two-stage training approach combining supervised fine-tuning and reinforcement learning. Our 8B reasoning model demonstrates strong performance with high structural fidelity and stylistic coherence.
zh

[NLP-71] SLM-Based Agent ic AI with P-C-G: Optimized for Korean Tool Use

【速读】: 该论文旨在解决韩国语场景下大语言模型(Large Language Model, LLM)在工具调用(tool use)过程中因频繁进行韩语到英语的代码切换而导致的执行失败问题。解决方案的关键在于提出一种角色分工明确的小规模语言模型(Small-scale Language Model, SLM)代理架构——Planner-Caller-Generator(P-C-G),其中规划器(Planner)生成初始计划并支持有限的动态重规划,调用器(Caller)通过联合模式与值验证返回标准化调用对象,生成器(Generator)整合工具输出以生成最终答案;同时引入“韩语优先”(Korean-first)的价值策略,显著降低因跨语言转换引发的错误,从而在保持低延迟的同时提升工具使用准确率和端到端质量。

链接: https://arxiv.org/abs/2509.19369
作者: Changhyun Jeon,Jinhee Park,Jungwoo Choi,Keonwoo Kim,Jisu Kim,Minji Hong
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We propose a small-scale language model (SLM) based agent architecture, Planner-Caller-Generator (P-C-G), optimized for Korean tool use. P-C-G separates planning, calling, and generation by role: the Planner produces an initial batch plan with limited on-demand replanning; the Caller returns a normalized call object after joint schema-value validation; and the Generator integrates tool outputs to produce the final answer. We apply a Korean-first value policy to reduce execution failures caused by frequent Korean-to-English code switching in Korean settings. Evaluation assumes Korean queries and Korean tool/parameter specifications; it covers single-chain, multi-chain, missing-parameters, and missing-functions scenarios, and is conducted via an LLM-as-a-Judge protocol averaged over five runs under a unified I/O interface. Results show that P-C-G delivers competitive tool-use accuracy and end-to-end quality while reducing tokens and maintaining acceptable latency, indicating that role-specialized SLMs are a cost-effective alternative for Korean tool-use agents.
zh

[NLP-72] Pipeline Parallelism is All You Need for Optimized Early-Exit Based Self-Speculative Decoding

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在自回归生成过程中推理成本高昂的问题,尤其针对基于早期退出的自推测解码(Early-Exit Based Self-Speculative Decoding, EESD)方法在实际应用中难以实现预期加速的问题。研究表明,EESD仅在大部分草稿 token 被接受时才有效,否则草稿阶段的计算开销可能抵消加速收益,甚至导致负向加速。为解决此问题,作者提出流水线并行自推测解码(Pipeline-Parallel Self-Speculative Decoding, PPSD),其关键创新在于:一是将模型层配置为流水线结构,使早期退出(草稿)计算与剩余层(验证)计算重叠执行;二是按 token 级别交错进行草稿与验证操作——当 LLM 在最后一层验证当前 token 时,早期退出路径同时草稿下一个 token,从而实现“边草稿边验证”的机制,最大化硬件利用率并实现高效的在线验证。

链接: https://arxiv.org/abs/2509.19368
作者: Ruanjun Li,Ziheng Liu,Yuanming Shi,Jiawei Shao,Chi Zhang,Xuelong Li
机构: TeleAI; ShanghaiTech University (上海科技大学); Shanghai Jiao Tong University (上海交通大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 17 pages, 7 figures

点击查看摘要

Abstract:Large language models (LLMs) deliver impressive generation quality, but incur very high inference cost because each output token is generated auto-regressively through all model layers. Early-exit based self-speculative decoding (EESD) has emerged to mitigate this cost. However, in practice, many approaches struggle to achieve the expected acceleration in such draft-then-verify paradigm even with a well-aligned early-exit head and selected exit position. Our analysis reveals that EESD only pays off when the vast majority of draft tokens are accepted by the LLM. Otherwise, the draft cost may overcome the acceleration gain and lead to a negative speedup. To mitigate this, we propose Pipeline-Parallel Self-Speculative Decoding (PPSD) that fully pipelines the draft and verification work so that no effort is wasted on failed predictions. It has two key innovations. We configure the model layers as a pipeline in which early-exit (draft) computations and remaining-layer (verification) computations overlap. We interleave drafting and verification per token. While the LLM is verifying the current token in its final layers, the early-exit path simultaneously drafts the next token. Such a verify-while-draft scheme keeps all units busy and validates tokens on-the-fly analogous to pipelining the speculation and verification stages. Empirical results confirm that PPSD achieves state-of-the-art acceleration in self-speculative LLM inference. On diverse benchmarks, PPSD achieves speedup ratios in the range of 2.01x~3.81x, which gains almost the optimal acceleration at the fixed acceptance rate and exit position, showcasing its advancement in providing efficient self-speculation.
zh

[NLP-73] LLM -Assisted Topic Reduction for BERTopic on Social Media Data KDD2025 ECML

【速读】: 该论文旨在解决BERTopic在处理社交媒体文本时因数据噪声大、稀疏性强而导致生成过多重叠主题的问题,同时克服现有基于大语言模型(Large Language Models, LLMs)的端到端主题建模方法计算开销高、难以扩展至大数据场景的局限。其解决方案的关键在于提出一种两阶段框架:首先利用BERTopic生成初始主题及其语义表示,随后引入大语言模型对这些主题进行迭代式的语义相似性识别与合并,从而实现主题数量的有效压缩和质量提升,兼顾多样性与一致性,且在不同语言模型和Twitter数据集上表现出良好的适应性与性能优势。

链接: https://arxiv.org/abs/2509.19365
作者: Wannes Janssens,Matthias Bogaert,Dirk Van den Poel
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 13 pages, 8 figures. To be published in the Post-Workshop proceedings of the ECML PKDD 2025 Conference

点击查看摘要

Abstract:The BERTopic framework leverages transformer embeddings and hierarchical clustering to extract latent topics from unstructured text corpora. While effective, it often struggles with social media data, which tends to be noisy and sparse, resulting in an excessive number of overlapping topics. Recent work explored the use of large language models for end-to-end topic modelling. However, these approaches typically require significant computational overhead, limiting their scalability in big data contexts. In this work, we propose a framework that combines BERTopic for topic generation with large language models for topic reduction. The method first generates an initial set of topics and constructs a representation for each. These representations are then provided as input to the language model, which iteratively identifies and merges semantically similar topics. We evaluate the approach across three Twitter/X datasets and four different language models. Our method outperforms the baseline approach in enhancing topic diversity and, in many cases, coherence, with some sensitivity to dataset characteristics and initial parameter selection.
zh

[NLP-74] he Inadequacy of Offline LLM Evaluations: A Need to Account for Personalization in Model Behavior

【速读】: 该论文试图解决的问题是:传统的离线评估方法(offline evaluations)无法准确反映语言模型在实际应用中的行为,因为这些评估忽略了个性化(personalization)对模型输出的影响。具体而言,相同的问题在不同用户会话中可能因上下文状态的不同而产生显著差异的响应,而标准的离线评估通常假设每次推理都是独立且无状态的。解决方案的关键在于通过实地评估(field evaluations)来验证这一现象——研究者让800名真实用户在使用ChatGPT和Gemini时,向其聊天界面提出基准问题和其他问题,从而对比离线评估与实际使用场景下的模型行为差异,提供了实证证据支持个性化对模型输出的重要影响。

链接: https://arxiv.org/abs/2509.19364
作者: Angelina Wang,Daniel E. Ho,Sanmi Koyejo
机构: Cornell Tech; Stanford University
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: forthcoming in Patterns

点击查看摘要

Abstract:Standard offline evaluations for language models – a series of independent, state-less inferences made by models – fail to capture how language models actually behave in practice, where personalization fundamentally alters model behavior. For instance, identical benchmark questions to the same language model can produce markedly different responses when prompted to a state-less system, in one user’s chat session, or in a different user’s chat session. In this work, we provide empirical evidence showcasing this phenomenon by comparing offline evaluations to field evaluations conducted by having 800 real users of ChatGPT and Gemini pose benchmark and other provided questions to their chat interfaces.
zh

[NLP-75] Semantic Representation Attack against Aligned Large Language Models

【速读】: 该论文旨在解决当前针对对齐大语言模型(Large Language Models, LLMs)的对抗攻击方法中存在的局限性问题,即现有方法通常依赖于诱导模型输出特定的肯定文本模式(如“Sure, here is…”),导致攻击成功率低、提示(prompt)不自然且计算成本高。其解决方案的关键在于提出一种全新的语义表示攻击(Semantic Representation Attack)范式,该范式不再局限于精确匹配文本字符串,而是通过探索包含等效有害语义的不同响应空间来构造对抗性提示,从而在保持提示自然性和高效性的同时显著提升攻击成功率。作者进一步设计了语义表示启发式搜索算法(Semantic Representation Heuristic Search),在增量扩展过程中维持可解释性,实现了对多个主流LLM的高成功率攻击(平均89.41%,部分模型达100%),并提供了严格的理论保障。

链接: https://arxiv.org/abs/2509.19360
作者: Jiawei Lian,Jianhong Pan,Lefan Wang,Yi Wang,Shaohui Mei,Lap-Pui Chau
机构: The Hong Kong Polytechnic University (香港理工大学); Northwestern Polytechnical University (西北工业大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) increasingly employ alignment techniques to prevent harmful outputs. Despite these safeguards, attackers can circumvent them by crafting prompts that induce LLMs to generate harmful content. Current methods typically target exact affirmative responses, such as ``Sure, here is…‘’, suffering from limited convergence, unnatural prompts, and high computational costs. We introduce Semantic Representation Attack, a novel paradigm that fundamentally reconceptualizes adversarial objectives against aligned LLMs. Rather than targeting exact textual patterns, our approach exploits the semantic representation space comprising diverse responses with equivalent harmful meanings. This innovation resolves the inherent trade-off between attack efficacy and prompt naturalness that plagues existing methods. The Semantic Representation Heuristic Search algorithm is proposed to efficiently generate semantically coherent and concise adversarial prompts by maintaining interpretability during incremental expansion. We establish rigorous theoretical guarantees for semantic convergence and demonstrate that our method achieves unprecedented attack success rates (89.41% averaged across 18 LLMs, including 100% on 11 models) while maintaining stealthiness and efficiency. Comprehensive experimental results confirm the overall superiority of our Semantic Representation Attack. The code will be publicly available. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2509.19360 [cs.CL] (or arXiv:2509.19360v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2509.19360 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Jiawei Lian [view email] [v1] Thu, 18 Sep 2025 15:06:46 UTC (254 KB) Full-text links: Access Paper: View a PDF of the paper titled Semantic Representation Attack against Aligned Large Language Models, by Jiawei Lian and 5 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.CL prev | next new | recent | 2025-09 Change to browse by: cs cs.AI References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
zh

[NLP-76] Benchmarking and Improving LLM Robustness for Personalized Generation

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在个性化响应中缺乏事实准确性与用户偏好一致性并重的评估标准问题,即现有评价体系往往只关注响应是否符合用户偏好,而忽视了事实正确性这一关键维度。为实现更可靠的个性化部署,作者提出“鲁棒性”(robustness)作为衡量标准,定义其为模型在保持事实准确的同时满足用户偏好的能力。解决方案的关键在于引入PERG框架及其配套数据集PERGData,用于系统化评估LLM的鲁棒性,并进一步提出Pref-Aligner方法——一种两阶段优化策略,通过显式对齐用户偏好与事实约束,在多个模型上平均提升鲁棒性达25%,从而有效缓解个性化过程中因偏好调整导致的事实错误问题。

链接: https://arxiv.org/abs/2509.19358
作者: Chimaobi Okite,Naihao Deng,Kiran Bodipati,Huaidian Hou,Joyce Chai,Rada Mihalcea
机构: University of Michigan (密歇根大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: First draft. First camera-ready version

点击查看摘要

Abstract:Recent years have witnessed a growing interest in personalizing the responses of large language models (LLMs). While existing evaluations primarily focus on whether a response aligns with a user’s preferences, we argue that factuality is an equally important yet often overlooked dimension. In the context of personalization, we define a model as robust if its responses are both factually accurate and align with the user preferences. To assess this, we introduce PERG, a scalable framework for evaluating robustness in LLMs, along with a new dataset, PERGData. We evaluate fourteen models from five different model families using different prompting methods. Our findings show that current LLMs struggle with robust personalization: even the strongest models (GPT-4.1, LLaMA3-70B) fail to maintain correctness in 5% of previously successful cases without personalization, while smaller models (e.g., 7B-scale) can fail more than 20% of the time. Further analysis reveals that robustness is significantly affected by the nature of the query and the type of user preference. To mitigate these failures, we propose Pref-Aligner, a two-stage approach that improves robustness by an average of 25% across models. Our work highlights critical gaps in current evaluation practices and introduces tools and metrics to support more reliable, user-aligned LLM deployments.
zh

[NLP-77] RoadMind: Towards a Geospatial AI Expert for Disaster Response

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在地理空间数据推理能力上的不足,尤其是在道路网络、距离和方向等关键空间信息理解方面的局限性,这在灾害应急场景中尤为关键。解决方案的核心在于提出一个名为RoadMind的自监督框架,通过从OpenStreetMap(OSM)提取结构化道路基础设施数据,并将其转化为适配空间任务的多格式监督信号,进而利用QLoRA适配器与4-bit量化模型对LLMs进行预训练和微调,从而显著提升模型在真实世界灾害高发城市(如洛杉矶、基督城和马尼拉)中的空间推理性能。

链接: https://arxiv.org/abs/2509.19354
作者: Ahmed El Fekih Zguir,Ferda Ofli,Muhammad Imran
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have shown impressive performance across a range of natural language tasks, but remain limited in their ability to reason about geospatial data, particularly road networks, distances, and directions. This gap poses challenges in disaster scenarios, where spatial understanding is critical for tasks such as evacuation planning and resource allocation. In this work, we present RoadMind, a self-supervised framework that enhances the geospatial reasoning capabilities of LLMs using structured data from OpenStreetMap (OSM). Our automated pipeline extracts road infrastructure data for a given city and converts it into multiple supervision formats tailored to key spatial tasks. We pretrain and fine-tune LLMs on these representations using QLoRA adapters and 4-bit quantized models. We evaluate our approach on three disaster-prone cities with varying global representation, Los Angeles, Christchurch, and Manila, across tasks such as road segment identification, nearest road retrieval, and distance/direction estimation. Our results show that models trained via RoadMind significantly outperform strong baselines, including state-of-the-art LLMs equipped with advanced prompt engineering. This demonstrates the potential of structured geospatial data to enhance language models with robust spatial reasoning, enabling more effective offline AI systems for disaster response.
zh

[NLP-78] riSPrompt: A Hierarchical Soft Prompt Model for Multimodal Rumor Detection with Incomplete Modalities

【速读】: 该论文旨在解决多模态谣言检测中因模态缺失(incomplete modalities)导致的准确性下降问题。现有方法通常依赖完整的多模态训练数据学习联合表示,难以应对现实场景中常见的模态缺失情况。其解决方案的关键在于提出一种分层软提示模型 TriSPrompt,通过引入三种类型的提示机制:模态感知提示(modality-aware, MA)用于捕获特定模态的异质信息与可用数据的同质特征以辅助模态恢复;模态缺失提示(modality-missing, MM)建模不完整数据中的缺失状态,提升模型对缺失信息的适应能力;以及跨视角提示(mutual-views, MV)学习主观(文本和图像)与客观(评论)视角之间的关系,从而增强谣言检测效果。实验表明,该方法在三个真实世界基准上相较当前最优方法实现了超过13%的准确率提升。

链接: https://arxiv.org/abs/2509.19352
作者: Jiajun Chen,Yangyang Wu,Xiaoye Miao,Mengying Zhu,Meng Xi
机构: Center for Data Science, Zhejiang University (浙江大学数据科学中心); School of Software Technology, Zhejiang University (浙江大学软件技术学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The widespread presence of incomplete modalities in multimodal data poses a significant challenge to achieving accurate rumor detection. Existing multimodal rumor detection methods primarily focus on learning joint modality representations from \emphcomplete multimodal training data, rendering them ineffective in addressing the common occurrence of \emphmissing modalities in real-world scenarios. In this paper, we propose a hierarchical soft prompt model \textsfTriSPrompt, which integrates three types of prompts, \textiti.e., \emphmodality-aware (MA) prompt, \emphmodality-missing (MM) prompt, and \emphmutual-views (MV) prompt, to effectively detect rumors in incomplete multimodal data. The MA prompt captures both heterogeneous information from specific modalities and homogeneous features from available data, aiding in modality recovery. The MM prompt models missing states in incomplete data, enhancing the model’s adaptability to missing information. The MV prompt learns relationships between subjective (\textiti.e., text and image) and objective (\textiti.e., comments) perspectives, effectively detecting rumors. Extensive experiments on three real-world benchmarks demonstrate that \textsfTriSPrompt achieves an accuracy gain of over 13% compared to state-of-the-art methods. The codes and datasets are available at https: //anonymous.this http URL.
zh

[NLP-79] ShinkaEvolve: Towards Open-Ended And Sample-Efficient Program Evolution

【速读】: 该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的代码进化方法在科学发现中面临的两大关键问题:一是样本效率低下,通常需要数千次采样才能找到有效解;二是方法封闭源码,限制了广泛采用与扩展。解决方案的核心在于提出ShinkaEvolve框架,其关键创新包括:(1)一种平衡探索与利用的父代采样技术,提升搜索方向的合理性;(2)基于代码新颖性的拒绝采样策略,高效拓展搜索空间;(3)基于多臂赌博机(bandit)的LLM集成选择机制,优化模型组合以增强性能。这些改进使ShinkaEvolve在多项任务中实现显著的样本效率提升和解质量优化,例如仅用150次采样即发现新的最优圆排列解,并成功应用于数学推理、编程竞赛和优化损失函数设计等场景。

链接: https://arxiv.org/abs/2509.19349
作者: Robert Tjarko Lange,Yuki Imajuku,Edoardo Cetin
机构: Sakana AI
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 52 pages, 14 figures

点击查看摘要

Abstract:We introduce ShinkaEvolve: a new open-source framework leveraging large language models (LLMs) to advance scientific discovery with state-of-the-art performance and unprecedented efficiency. Recent advances in scaling inference time compute of LLMs have enabled significant progress in generalized scientific discovery. These approaches rely on evolutionary agentic harnesses that leverage LLMs as mutation operators to generate candidate solutions. However, current code evolution methods suffer from critical limitations: they are sample inefficient, requiring thousands of samples to identify effective solutions, and remain closed-source, hindering broad adoption and extension. ShinkaEvolve addresses these limitations, introducing three key innovations: a parent sampling technique balancing exploration and exploitation, code novelty rejection-sampling for efficient search space exploration, and a bandit-based LLM ensemble selection strategy. We evaluate ShinkaEvolve across diverse tasks, demonstrating consistent improvements in sample efficiency and solution quality. ShinkaEvolve discovers a new state-of-the-art circle packing solution using only 150 samples, designs high-performing agentic harnesses for AIME mathematical reasoning tasks, identifies improvements to ALE-Bench competitive programming solutions, and discovers novel mixture-of-expert load balancing loss functions that illuminate the space of optimization strategies. Our results demonstrate that ShinkaEvolve achieves broad applicability with exceptional sample efficiency. By providing open-source accessibility and cost-efficiency, this work democratizes open-ended discovery across diverse computational problems.
zh

[NLP-80] Characterizing Knowledge Graph Tasks in LLM Benchmarks Using Cognitive Complexity Frameworks

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在知识图谱(Knowledge Graphs, KGs)相关任务中评估标准过于单一的问题,即过度聚焦于准确性和输出正确性,而忽视了任务本身的认知复杂性。其解决方案的关键在于引入认知心理学中的三个复杂性框架,作为补充性的任务表征方法,从而更全面地刻画LLM在KG任务中的表现;通过将该方法应用于LLM-KG-Bench框架,论文揭示了不同任务的价值分布特征,识别出被低估的认知需求,并推动基准测试任务向更具解释力和多样性的方向发展。

链接: https://arxiv.org/abs/2509.19347
作者: Sara Todorovikj,Lars-Peter Meyer,Michael Martin
机构: Chemnitz University of Technology (德国开姆尼茨工业大学); InfAI (莱比锡InfAI)
类目: Computation and Language (cs.CL)
备注: peer reviewed publication at SEMANTiCS 2025 Poster Track

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly used for tasks involving Knowledge Graphs (KGs), whose evaluation typically focuses on accuracy and output correctness. We propose a complementary task characterization approach using three complexity frameworks from cognitive psychology. Applying this to the LLM-KG-Bench framework, we highlight value distributions, identify underrepresented demands and motivate richer interpretation and diversity for benchmark evaluation tasks.
zh

[NLP-81] Benchmarking ChatGPT and DeepSeek in April 2025: A Novel Dual Perspective Sentiment Analysis Using Lexicon-Based and Deep Learning Approaches

【速读】: 该论文旨在解决如何更准确地衡量基于大语言模型(Large Language Model, LLM)的应用程序用户满意度问题,尤其是针对ChatGPT与DeepSeek等主流LLM应用在移动平台上的用户反馈分析。传统研究往往局限于词典基础的情感分析(lexicon-based sentiment analysis)或独立使用深度学习模型,缺乏融合多视角的系统性评估方法。其解决方案的关键在于提出一种双重视角分析框架:一方面结合TextBlob等词典法进行情感极性判断,另一方面引入卷积神经网络(Convolutional Neural Networks, CNN)和双向长短期记忆网络(Bidirectional Long Short Term Memory, Bi-LSTM)两类深度学习模型进行分类建模,并通过数据预处理与过采样策略实现类别平衡,最终在1,700条平衡测试集上验证了CNN模型在准确率(96.41%)及负向情感识别性能上的显著优势,为LLM类应用的情感分析提供了新方法论标准。

链接: https://arxiv.org/abs/2509.19346
作者: Maryam Mahdi Alhusseini,Mohammad-Reza Feizi-Derakhshi
机构: 未知
类目: Computation and Language (cs.CL)
备注: 17 pages, 21 figures

点击查看摘要

Abstract:This study presents a novel dual-perspective approach to analyzing user reviews for ChatGPT and DeepSeek on the Google Play Store, integrating lexicon-based sentiment analysis (TextBlob) with deep learning classification models, including Convolutional Neural Networks (CNN) and Bidirectional Long Short Term Memory (Bi LSTM) Networks. Unlike prior research, which focuses on either lexicon-based strategies or predictive deep learning models in isolation, this study conducts an extensive investigation into user satisfaction with Large Language Model (LLM) based applications. A Dataset of 4,000 authentic user reviews was collected, which were carefully preprocessed and subjected to oversampling to achieve balanced classes. The balanced test set of 1,700 Reviews were used for model testing. Results from the experiments reveal that ChatGPT received significantly more positive sentiment than DeepSeek. Furthermore, deep learning based classification demonstrated superior performance over lexicon analysis, with CNN outperforming Bi-LSTM by achieving 96.41 percent accuracy and near perfect classification of negative reviews, alongside high F1-scores for neutral and positive sentiments. This research sets a new methodological standard for measuring sentiment in LLM-based applications and provides practical insights for developers and researchers seeking to improve user-centric AI system design.
zh

[NLP-82] SCORE: A Semantic Evaluation Framework for Generative Document Parsing

【速读】: 该论文旨在解决多模态生成式文档解析系统(multi-modal generative document parsing systems)在传统评估指标下存在的偏差问题:由于这些系统常输出语义正确但结构不同的结果,而CER、WER、IoU或TEDS等标准指标会将这种结构性差异误判为错误,从而对有效解释进行惩罚并掩盖系统真实行为。其解决方案的关键在于提出SCORE(Structural and COntent Robust Evaluation)框架,该框架通过四个核心机制实现鲁棒且语义导向的评估:(i) 调整后的编辑距离以增强内容保真度的鲁棒性,(ii) 词级别诊断区分幻觉与遗漏,(iii) 带空间容差和语义对齐的表格评估,以及(iv) 层次感知的一致性检查。这一方法在1,114页跨数据集测试中揭示了传统指标忽略的性能模式,并在约2-5%结构模糊的表格页面上避免了高达12-25%的误罚,恢复了不同但合法解释之间的等价性,同时无需依赖目标检测流水线即可复现传统分数,验证了生成式解析本身即可支撑全面评估。

链接: https://arxiv.org/abs/2509.19345
作者: Renyu Li,Antonio Jimeno Yepes,Yao You,Kamil Pluciński,Maximilian Operlejn,Crag Wolfe
机构: Unstructured Technologies (Unstructured Technologies)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multi-modal generative document parsing systems challenge traditional evaluation: unlike deterministic OCR or layout models, they often produce semantically correct yet structurally divergent outputs. Conventional metrics-CER, WER, IoU, or TEDS-misclassify such diversity as error, penalizing valid interpretations and obscuring system behavior. We introduce SCORE (Structural and COntent Robust Evaluation), an interpretation-agnostic framework that integrates (i) adjusted edit distance for robust content fidelity, (ii) token-level diagnostics to distinguish hallucinations from omissions, (iii) table evaluation with spatial tolerance and semantic alignment, and (iv) hierarchy-aware consistency checks. Together, these dimensions enable evaluation that embraces representational diversity while enforcing semantic rigor. Across 1,114 pages spanning a holistic benchmark and a field dataset, SCORE consistently revealed cross-dataset performance patterns missed by standard metrics. In 2-5% of pages with ambiguous table structures, traditional metrics penalized systems by 12-25% on average, leading to distorted rankings. SCORE corrected these cases, recovering equivalence between alternative but valid interpretations. Moreover, by normalizing generative outputs into a format-agnostic representation, SCORE reproduces traditional scores (e.g., table F1 up to 0.93) without requiring object-detection pipelines, demonstrating that generative parsing alone suffices for comprehensive evaluation. By exposing how interpretive diversity impacts evaluation outcomes and providing multi-dimensional, interpretable diagnostics, SCORE establishes foundational principles for semantically grounded, fair, and practical benchmarking of modern document parsing systems. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2509.19345 [cs.CL] (or arXiv:2509.19345v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2509.19345 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Antonio Jose Jimeno Yepes [view email] [v1] Tue, 16 Sep 2025 16:06:19 UTC (286 KB)
zh

[NLP-83] Performance of Large Language Models in Answering Critical Care Medicine Questions

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在重症监护医学(Critical Care Medicine, CCM)这一专业领域中表现尚不明确的问题。现有研究多聚焦于医学学生水平的通用问题,而对LLMs在CCM细分领域的准确性评估不足。解决方案的关键在于系统性地测试Meta-Llama 3.1系列模型(8B和70B参数版本)在871道CCM专业题上的表现,结果表明70B版本模型平均准确率达60%,显著优于8B版本(提升30%),同时揭示了不同亚专科领域间性能差异显著,如研究类题目准确率最高(68.4%),肾病类最低(47.9%),凸显了未来需针对各亚专科领域进行更广泛优化与训练以提升整体泛化能力。

链接: https://arxiv.org/abs/2509.19344
作者: Mahmoud Alwakeel,Aditya Nagori,An-Kwok Ian Wong,Neal Chaisson,Vijay Krishnamoorthy,Rishikesan Kamaleswaran
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models have been tested on medical student-level questions, but their performance in specialized fields like Critical Care Medicine (CCM) is less explored. This study evaluated Meta-Llama 3.1 models (8B and 70B parameters) on 871 CCM questions. Llama3.1:70B outperformed 8B by 30%, with 60% average accuracy. Performance varied across domains, highest in Research (68.4%) and lowest in Renal (47.9%), highlighting the need for broader future work to improve models across various subspecialty domains.
zh

[NLP-84] Part-of-speech tagging for Nagamese Language using CRF

【速读】: 该论文旨在解决纳加语(Nagamese)这一资源匮乏语言的词性标注(Part-of-Speech Tagging)问题,这是自然语言处理(Natural Language Processing, NLP)中的关键任务。由于此前未有针对纳加语的词性标注研究,本文首次构建了一个包含16,112个词元的标注语料库,并采用条件随机场(Conditional Random Fields, CRF)这一机器学习方法进行模型训练与评估。实验结果表明,该方案在整体标注准确率上达到85.70%,精确率(precision)为86%,召回率(recall)为85%,F1分数为85%,验证了CRF在低资源语言词性标注任务中的有效性。

链接: https://arxiv.org/abs/2509.19343
作者: Alovi N Shohe,Chonglio Khiamungam,Teisovi Angami
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 8 pages

点击查看摘要

Abstract:This paper investigates part-of-speech tagging, an important task in Natural Language Processing (NLP) for the Nagamese language. The Nagamese language, a.k.a. Naga Pidgin, is an Assamese-lexified Creole language developed primarily as a means of communication in trade between the Nagas and people from Assam in northeast India. A substantial amount of work in part-of-speech-tagging has been done for resource-rich languages like English, Hindi, etc. However, no work has been done in the Nagamese language. To the best of our knowledge, this is the first attempt at part-of-speech tagging for the Nagamese Language. The aim of this work is to identify the part-of-speech for a given sentence in the Nagamese language. An annotated corpus of 16,112 tokens is created and applied machine learning technique known as Conditional Random Fields (CRF). Using CRF, an overall tagging accuracy of 85.70%; precision, recall of 86%, and f1-score of 85% is achieved. Keywords. Nagamese, NLP, part-of-speech, machine learning, CRF. Comments: 8 pages Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2509.19343 [cs.CL] (or arXiv:2509.19343v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2509.19343 Focus to learn more arXiv-issued DOI via DataCite
zh

[NLP-85] Cognitive-Level Adaptive Generation via Capability-Aware Retrieval and Style Adaptation EMNLP2026

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在生成内容时存在的认知错位(cognitive misalignment)问题,即模型输出的内容在知识复杂度和呈现风格上难以适配不同用户认知能力,导致信息传递效率低下。解决方案的关键在于提出一种通用的认知层级对齐框架(Cognitive-Level Alignment Framework, CLAF),其核心由三部分组成:基于分层知识图谱的能力感知检索模块、以布卢姆分类学(Bloom’s taxonomy)和偏好学习为指导的风格优化模块,以及确保内容一致性与相关性的知识可控生成组件。该框架通过多维度对齐机制,使生成内容在知识深度和表达方式上均能匹配用户认知水平,从而提升生成内容的适应性与信息有效性。

链接: https://arxiv.org/abs/2509.19336
作者: Qingsong Wang,Tao Wu,Wang Lin,Yueying Feng,Gongsheng Yuan,Chang Yao,Jingyuan Chen
机构: Zhejiang University (浙江大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to Findings of EMNLP 2026

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated strong performance in open-ended generation tasks. However, they often struggle to adapt content to users with differing cognitive capacities, leading to a phenomenon we term cognitive misalignment. This issue arises in two forms: knowledge-level misalignment, where content is too complex or too simplistic relative to user understanding, and presentation-style misalignment, where the structure or tone hinders effective comprehension. To address these challenges, we propose the Cognitive-Level Alignment Framework (CLAF), a general-purpose generation framework that aligns both knowledge complexity and presentation style with user cognition. CLAF integrates a capability-aware retrieval module based on a hierarchical knowledge graph and a style optimization module guided by Bloom’s taxonomy and preference learning. Additionally, a knowledge-controllable generation component ensures consistency and relevance throughout the output. To support training and evaluation, we construct SCALE, a cognitively annotated dataset containing responses at multiple comprehension levels per query. Empirical results show that CLAF enhances the adaptability and informativeness of LLM outputs across a range of user profiles, offering a robust solution to cognitive-level alignment in real-world applications.
zh

[NLP-86] Pluralistic Off-policy Evaluation and Alignment

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在个性化偏好对齐中因人类偏好多样性而带来的评估与优化难题,特别是现有离线偏好对齐数据集通常来自与目标模型策略差异显著的来源,且传统离线策略评估(Off-Policy Evaluation, OPE)方法仅关注整体效用而忽视偏好多样性。解决方案的关键在于提出首个面向离线场景的多元偏好评估与对齐框架——Pluralistic Off-Policy Evaluation (POPE),其核心创新包括:设计一个统一奖励函数,融合基于人类偏好信号(如点赞或相关性评分)的协作效用项和受熵覆盖度启发的多样性项,以体现多元偏好对齐;并推导出可分解的逆倾向评分(Inverse Propensity Scoring, IPS)估计器,分别评估响应的相关性和多样性,理论证明该分解估计器能提供方差下界,从而支持基于离线评估值函数的离线优化,有效提升生成内容的多元适配性同时保持下游任务通用能力。

链接: https://arxiv.org/abs/2509.19333
作者: Chengkai Huang,Junda Wu,Zhouhang Xie,Yu Xia,Rui Wang,Tong Yu,Subrata Mitra,Julian McAuley,Lina Yao
机构: The University of New South Wales (新南威尔士大学); UC San Diego (加州大学圣地亚哥分校); Adobe Research (Adobe 研究院); CSIRO’s Data61 (澳大利亚联邦科学与工业研究组织数据61实验室); Macquarie University (麦考瑞大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Personalized preference alignment for LLMs with diverse human preferences requires evaluation and alignment methods that capture pluralism. Most existing preference alignment datasets are logged under policies that differ substantially from the evaluated LLMs, and existing off-policy estimators focus solely on overall utility while ignoring preference pluralism. Extending Off-Policy Evaluation (OPE) to pluralistic preference alignment, therefore, remains an open question. Thus, we propose the Pluralistic Off-Policy Evaluation (POPE), the first framework for offline pluralistic preference evaluation and alignment in LLMs. POPE includes a unified reward function that combines (1) a collaborative utility component derived from human preference signals (e.g., upvotes or relevance scores) and (2) a diversity component inspired by entropy-based coverage measures, together reflecting pluralistic alignment. Furthermore, to estimate this reward from logged interactions, we derive decomposable inverse propensity scoring (IPS) estimators that separately evaluate relevance and diversity. Theoretically, we prove that our decomposed IPS estimators establish a lower bound on their variance. With the off-policy evaluated value function, we can directly enable off-policy optimization to further enhance pluralistic alignment. Empirical results demonstrate that POPE efficiently enhances pluralistic response generation and maintains the models’ general capabilities on downstream tasks
zh

[NLP-87] Quantifying Compositionality of Classic and State-of-the-Art Embeddings EMNLP2025

【速读】: 该论文试图解决语言模型在面对新表达式时如何正确泛化的问题,核心在于模型是否能够合理利用组合语义(compositional meaning)。传统静态词嵌入(如Word2vec)对组合性做出了过度强的假设,而当前最先进的生成式Transformer模型和图模型则缺乏对上下文导致语义变化的有效约束。解决方案的关键在于提出一种两步式的广义评估框架:首先通过典型相关分析(canonical correlation analysis)量化已知实体属性与其嵌入之间的线性关系;其次通过重建未见过的属性组合嵌入,并以L2损失、余弦相似度和检索准确率等指标评估加法泛化能力。该方法不仅能捕捉组合性的强弱变化,还能识别线性组合失效的失败案例,从而系统地追踪不同训练阶段与模型层级中的组合性演化趋势。

链接: https://arxiv.org/abs/2509.19332
作者: Zhijin Guo(1 and 2),Chenhao Xue(1),Zhaozhen Xu(2),Hongbo Bo(2),Yuxuan Ye(2),Janet B. Pierrehumbert(1),Martha Lewis(3) ((1) University of Oxford, (2) University of Bristol, (3) University of Amsterdam)
机构: University of Oxford (牛津大学); University of Bristol (布里斯托大学); University of Amsterdam (阿姆斯特丹大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Findings of the Association for Computational Linguistics: EMNLP 2025

点击查看摘要

Abstract:For language models to generalize correctly to novel expressions, it is critical that they exploit access compositional meanings when this is justified. Even if we don’t know what a “pelp” is, we can use our knowledge of numbers to understand that “ten pelps” makes more pelps than “two pelps”. Static word embeddings such as Word2vec made strong, indeed excessive, claims about compositionality. The SOTA generative, transformer models and graph models, however, go too far in the other direction by providing no real limits on shifts in meaning due to context. To quantify the additive compositionality, we formalize a two-step, generalized evaluation that (i) measures the linearity between known entity attributes and their embeddings via canonical correlation analysis, and (ii) evaluates additive generalization by reconstructing embeddings for unseen attribute combinations and checking reconstruction metrics such as L2 loss, cosine similarity, and retrieval accuracy. These metrics also capture failure cases where linear composition breaks down. Sentences, knowledge graphs, and word embeddings are evaluated and tracked the compositionality across all layers and training stages. Stronger compositional signals are observed in later training stages across data modalities, and in deeper layers of the transformer-based model before a decline at the top layer. Code is available at this https URL.
zh

[NLP-88] How Model Size Temperature and Prompt Style Affect LLM -Human Assessment Score Alignment

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在评估临床推理能力时,其内部一致性、模型间一致性以及与人类评分的一致性问题。研究发现,模型规模是影响LLM与人类评分对齐程度的关键因素,表明在部署用于临床推理评估的LLM时,需系统性地检验其在多个层级上的对齐表现,以确保评估结果的可靠性与有效性。

链接: https://arxiv.org/abs/2509.19329
作者: Julie Jung,Max Lu,Sina Chole Benker,Dogus Darici
机构: Harvard Graduate School of Education (哈佛大学教育研究生院); Munster University (明斯特大学); Institute of Anatomy and Neurobiology, University of Münster (明斯特大学解剖学与神经生物学研究所)
类目: Computation and Language (cs.CL); Methodology (stat.ME)
备注: 9 pages, 4 figures, accepted at NCME AIME 2025

点击查看摘要

Abstract:We examined how model size, temperature, and prompt style affect Large Language Models’ (LLMs) alignment within itself, between models, and with human in assessing clinical reasoning skills. Model size emerged as a key factor in LLM-human score alignment. Study highlights the importance of checking alignments across multiple levels.
zh

[NLP-89] A systematic review of trial-matching pipelines using large language models

【速读】: 该论文旨在解决临床试验匹配过程中因人工操作导致的效率低下与错误率高的问题,尤其是在肿瘤学领域中快速准确地将患者与合适的临床试验进行匹配。其解决方案的关键在于利用大语言模型(Large Language Models, LLMs)构建自动化匹配管道,通过自然语言处理技术实现对患者病历和临床试验标准之间的语义理解与匹配。研究发现,GPT-4等先进LLM在匹配精度和 Eligibility Extraction(资格提取)任务上显著优于其他模型,即使在未微调的情况下也表现优异;同时,零样本提示(zero-shot prompting)、基于专有模型的策略、增强检索方法以及在数据隐私敏感场景下对小型开源模型的微调,构成了当前最具潜力的技术路径。然而,模型部署仍面临真实世界数据获取困难、成本控制、幻觉风险、数据泄露及偏倚等问题,需进一步优化标准化评估指标与公平性设计以推动广泛应用。

链接: https://arxiv.org/abs/2509.19327
作者: Braxton A. Morrison(1),Madhumita Sushil(1),Jacob S. Young(1) ((1) University of California, San Francisco)
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 28 pages, 3 figures

点击查看摘要

Abstract:Matching patients to clinical trial options is critical for identifying novel treatments, especially in oncology. However, manual matching is labor-intensive and error-prone, leading to recruitment delays. Pipelines incorporating large language models (LLMs) offer a promising solution. We conducted a systematic review of studies published between 2020 and 2025 from three academic databases and one preprint server, identifying LLM-based approaches to clinical trial matching. Of 126 unique articles, 31 met inclusion criteria. Reviewed studies focused on matching patient-to-criterion only (n=4), patient-to-trial only (n=10), trial-to-patient only (n=2), binary eligibility classification only (n=1) or combined tasks (n=14). Sixteen used synthetic data; fourteen used real patient data; one used both. Variability in datasets and evaluation metrics limited cross-study comparability. In studies with direct comparisons, the GPT-4 model consistently outperformed other models, even finely-tuned ones, in matching and eligibility extraction, albeit at higher cost. Promising strategies included zero-shot prompting with proprietary LLMs like the GPT-4o model, advanced retrieval methods, and fine-tuning smaller, open-source models for data privacy when incorporation of large models into hospital infrastructure is infeasible. Key challenges include accessing sufficiently large real-world data sets, and deployment-associated challenges such as reducing cost, mitigating risk of hallucinations, data leakage, and bias. This review synthesizes progress in applying LLMs to clinical trial matching, highlighting promising directions and key limitations. Standardized metrics, more realistic test sets, and attention to cost-efficiency and fairness will be critical for broader deployment.
zh

[NLP-90] Unveiling the Merits and Defects of LLM s in Automatic Review Generation for Scientific Papers

【速读】: 该论文旨在解决科学论文投稿量激增背景下,传统同行评审(peer-review)流程面临效率瓶颈的问题,探索大型语言模型(Large Language Models, LLMs)在自动化生成审稿意见中的应用潜力与局限性。其解决方案的关键在于提出一个综合评估框架,融合语义相似度分析与结构化知识图谱指标,系统性地对比LLM生成的审稿意见与人工审稿在描述性内容、批判性推理、上下文关联性和质量敏感性等方面的差异。通过构建涵盖ICLR和NeurIPS多届会议共1,683篇论文及6,495条专家审稿的基准数据集,并使用五种主流LLM进行生成测试,研究发现LLM在捕捉论文核心贡献和方法方面表现良好(如GPT-4o在优质论文 strengths 部分生成实体数比人类多15.74%),但在识别缺陷、提出实质性问题以及根据论文质量动态调整反馈方面显著不足(如GPT-4o在 weaknesses 部分生成实体数比人类少59.42%,且对差论文的节点增长幅度仅5.7%,远低于人类的50%),从而为未来LLM辅助审稿工具的设计提供了实证依据。

链接: https://arxiv.org/abs/2509.19326
作者: Ruochi Li,Haoxuan Zhang,Edward Gehringer,Ting Xiao,Junhua Ding,Haihua Chen
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted as short paper at 25th IEEE International Conference on Data Mining

点击查看摘要

Abstract:The surge in scientific submissions has placed increasing strain on the traditional peer-review process, prompting the exploration of large language models (LLMs) for automated review generation. While LLMs demonstrate competence in producing structured and coherent feedback, their capacity for critical reasoning, contextual grounding, and quality sensitivity remains limited. To systematically evaluate these aspects, we propose a comprehensive evaluation framework that integrates semantic similarity analysis and structured knowledge graph metrics to assess LLM-generated reviews against human-written counterparts. We construct a large-scale benchmark of 1,683 papers and 6,495 expert reviews from ICLR and NeurIPS in multiple years, and generate reviews using five LLMs. Our findings show that LLMs perform well in descriptive and affirmational content, capturing the main contributions and methodologies of the original work, with GPT-4o highlighted as an illustrative example, generating 15.74% more entities than human reviewers in the strengths section of good papers in ICLR 2025. However, they consistently underperform in identifying weaknesses, raising substantive questions, and adjusting feedback based on paper quality. GPT-4o produces 59.42% fewer entities than real reviewers in the weaknesses and increases node count by only 5.7% from good to weak papers, compared to 50% in human reviews. Similar trends are observed across all conferences, years, and models, providing empirical foundations for understanding the merits and defects of LLM-generated reviews and informing the development of future LLM-assisted reviewing tools. Data, code, and more detailed results are publicly available at this https URL.
zh

[NLP-91] How Much of Your Data Can Suck? Thresholds for Domain Performance and Emergent Misalignment in LLM s

【速读】: 该论文旨在解决监督微调(Supervised Fine-Tuning, SFT)过程中使用错误数据对大语言模型(Large Language Models, LLMs)性能与安全性造成负面影响的问题,特别是针对GPT-4o模型在金融、编程、医疗和法律等高风险领域中可能出现的“新兴错位”(emergent misalignment)现象。研究表明,即使仅引入10%-25%的错误数据,也会显著降低模型在特定领域的表现,且不会提升道德对齐度;而要恢复较强性能,至少需要50%以上的正确数据比例,但即便如此,其鲁棒性和安全性仍难以达到原始基线模型水平。因此,解决方案的关键在于:必须进行极高质量的数据筛选与治理,或在高风险场景下避免不必要的微调,直接采用具备强鲁棒性和安全性的基础模型。

链接: https://arxiv.org/abs/2509.19325
作者: Jian Ouyang,Arman T,Ge Jin
机构: Invisible Technologies (Invisible Technologies); University of California, Berkeley (加州大学伯克利分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper investigates the impact of incorrect data on the performance and safety of large language models (LLMs), specifically gpt-4o, during supervised fine-tuning (SFT). Although LLMs become increasingly vital across broad domains like finance, coding, law, and health, fine-tuning on incorrect data can lead to “emergent misalignment,” producing harmful or deceptive outputs unrelated to the intended task. We evaluate gpt-4o models fine-tuned with varying ratios (10% to 90% correct) of both obviously and subtly incorrect data across four domains: coding, finance, health, and legal. Our findings show that even modest amounts of incorrect data (10-25%) dramatically degrade domain performance and not moral alignment. A clear threshold of at least 50% correct data is needed for models to consistently recover strong performance, though they rarely match the robustness and safety of the base model, which exhibits near-perfect alignment and zero dangerous completions out-of-the-box. This research emphasizes that the cost of incorrect data is heavy, highlighting the critical need for extremely high-quality data curation or, alternatively, leveraging robust base models without unnecessary fine-tuning for high-stakes applications.
zh

[NLP-92] Magnitude Matters: a Superior Class of Similarity Metrics for Holistic Semantic Understanding AAAI2026

【速读】: 该论文旨在解决高维向量相似度计算中长期存在的两个基准方法的局限性问题:原始点积(raw dot product)无界且对向量范数敏感,而余弦相似度(cosine similarity)则完全忽略向量幅度信息。为此,作者提出了一类无需参数调优、能够感知向量幅度的新颖相似度度量方法,核心解决方案是设计了两种具体函数——重叠相似度(Overlap Similarity, OS)和双曲正切相似度(Hyperbolic Tangent Similarity, HTS),它们在形式上更合理地融合了向量的幅度与方向信息。通过在四种先进句向量模型及八个标准NLP基准任务上的系统评估发现,OS和HTS在需要整体语义理解的任务(如释义检测和推理任务)中显著优于传统方法,且统计上具有显著优势,从而为高维语义匹配提供了更具解释性和实用性的新范式。

链接: https://arxiv.org/abs/2509.19323
作者: V.S. Raghu Parupudi
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: submitted to AAAI 2026

点击查看摘要

Abstract:Vector comparison in high dimensions is a fundamental task in NLP, yet it is dominated by two baselines: the raw dot product, which is unbounded and sensitive to vector norms, and the cosine similarity, which discards magnitude information entirely. This paper challenges both standards by proposing and rigorously evaluating a new class of parameter-free, magnitude-aware similarity metrics. I introduce two such functions, Overlap Similarity (OS) and Hyperbolic Tangent Similarity (HTS), designed to integrate vector magnitude and alignment in a more principled manner. To ensure that my findings are robust and generalizable, I conducted a comprehensive evaluation using four state-of-the-art sentence embedding models (all-MiniLM-L6-v2, all-mpnet-base-v2, paraphrase-mpnet-base-v2, and BAAI/bge-large-en-v1.5) across a diverse suite of eight standard NLP benchmarks, including STS-B, SICK, Quora, and PAWS. Using the Wilcoxon signed-rank test for statistical significance, my results are definitive: on the tasks requiring holistic semantic understanding (paraphrase and inference), both OS and HTS provide a statistically significant improvement in Mean Squared Error over both the raw dot product and cosine similarity, regardless of the underlying embedding this http URL, my findings delineate the specific domain of advantage for these metrics: for tasks requiring holistic semantic understanding like paraphrase and inference, my magnitude-aware metrics offer a statistically superior alternative. This significant improvement was not observed on benchmarks designed to test highly nuanced compositional semantics (SICK, STS-B), identifying the challenge of representing compositional text as a distinct and important direction for future work.
zh

[NLP-93] Readme_AI: Dynamic Context Construction for Large Language Models

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在面对用户特定查询时,因缺乏上下文支持而产生不准确或不可靠信息的问题,尤其是当涉及专业数据集或工具时易出现幻觉(hallucination)。其核心解决方案是提出一种可扩展的协议——Readme_AI Model Context Protocol (MCP),允许数据源所有者通过提供结构化元数据文件来动态构建面向LLM的上下文。该协议支持多种数据来源类型(如网页爬取、数据仓库获取、文献下载解析等),并利用用户指定的标签对上下文进行组织与格式化,使LLM能够基于真实、精确的数据源内容进行推理和生成,从而显著提升回答准确性并减少幻觉现象。

链接: https://arxiv.org/abs/2509.19322
作者: Millie Vyas,Timothy Blattner,Alden Dima
机构: Purdue University (普渡大学); National Institute of Standards and Technology (美国国家标准与技术研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite being trained on significant amounts of data, Large Language Models (LLMs) can provide inaccurate or unreliable information in the context of a user’s specific query. Given query-specific context significantly improves the usefulness of its responses. In this paper, we present a specification that can be used to dynamically build context for data sources. The data source owner creates the file containing metadata for LLMs to use when reasoning about dataset-related queries. To demonstrate our proposed specification, we created a prototype Readme_AI Model Context Protocol (MCP) server that retrieves the metadata from the data source and uses it to dynamically build context. Some features that make this specification dynamic are the extensible types that represent crawling web-pages, fetching data from data repositories, downloading and parsing publications, and general text. The context is formatted and grouped using user-specified tags that provide clear contextual information for the LLM to reason about the content. We demonstrate the capabilities of this early prototype by asking the LLM about the NIST-developed Hedgehog library, for which common LLMs often provides inaccurate and irrelevant responses containing hallucinations. With Readme_AI, the LLM receives enough context that it is now able to reason about the library and its use, and even generate code interpolated from examples that were included in the Readme_AI file provided by Hedgehog’s developer. Our primary contribution is a extensible protocol for dynamically grounding LLMs in specialized, owner-provided data, enhancing responses from LLMs and reducing hallucinations. The source code for the Readme_AI tool is posted here: this https URL .
zh

[NLP-94] FHIR-Agent Bench: Benchmarking LLM Agents for Realistic Interoperable EHR Question Answering

【速读】: 该论文旨在解决当前临床大语言模型(LLM)评估基准滞后于医疗数据标准化进程的问题,特别是针对健康水平七快速医疗互操作资源(HL7 FHIR)标准的广泛应用,现有基准缺乏对LLM代理在真实FHIR资源上进行数据检索与推理能力的有效评测。其解决方案的关键在于提出FHIR-AgentBench,一个基于2,931个真实临床问题构建的基准测试集,将问题严格锚定于HL7 FHIR标准的数据模型中,并系统性地评估不同代理框架在数据获取策略(直接调用FHIR API vs. 专用工具)、交互模式(单轮 vs. 多轮对话)和推理方式(自然语言 vs. 代码生成)下的表现,从而揭示复杂FHIR资源访问与语义推理对问答性能的核心影响。

链接: https://arxiv.org/abs/2509.19319
作者: Gyubok Lee,Elea Bach,Eric Yang,Tom Pollard,Alistair Johnson,Edward Choi,Yugang jia,Jong Ha Lee
机构: Korea Advanced Institute of Science & Technology (韩国科学技术院); Verily Life Sciences; Massachusetts Institute of Technology (麻省理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Under review

点击查看摘要

Abstract:The recent shift toward the Health Level Seven Fast Healthcare Interoperability Resources (HL7 FHIR) standard opens a new frontier for clinical AI, demanding LLM agents to navigate complex, resource-based data models instead of conventional structured health data. However, existing benchmarks have lagged behind this transition, lacking the realism needed to evaluate recent LLMs on interoperable clinical data. To bridge this gap, we introduce FHIR-AgentBench, a benchmark that grounds 2,931 real-world clinical questions in the HL7 FHIR standard. Using this benchmark, we systematically evaluate agentic frameworks, comparing different data retrieval strategies (direct FHIR API calls vs. specialized tools), interaction patterns (single-turn vs. multi-turn), and reasoning strategies (natural language vs. code generation). Our experiments highlight the practical challenges of retrieving data from intricate FHIR resources and the difficulty of reasoning over them, both of which critically affect question answering performance. We publicly release the FHIR-AgentBench dataset and evaluation suite (this https URL) to promote reproducible research and the development of robust, reliable LLM agents for clinical applications.
zh

[NLP-95] Automated Item Neutralization for Non-Cognitive Scales: A Large Language Model Approach to Reducing Social-Desirability Bias

【速读】: 该论文旨在解决人格测评中因社会赞许性偏差(social desirability bias)导致的测量误差问题,该偏差会使被试倾向于选择符合社会期望的答案,从而影响量表的效度。解决方案的关键在于利用大语言模型(Large Language Model, LLM)对人格条目进行中性化处理,具体通过GPT-4对国际人格项目池大五人格量表(IPIP-BFM-50)的条目进行重写,以削弱其社会赞许倾向,从而提升人格评估的客观性和准确性。

链接: https://arxiv.org/abs/2509.19314
作者: Sirui Wu,Daijin Yang
机构: University of British Columbia (不列颠哥伦比亚大学); Northeastern University (东北大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: Accepted for publication in NCME-AIME 2025

点击查看摘要

Abstract:This study evaluates item neutralization assisted by the large language model (LLM) to reduce social desirability bias in personality assessment. GPT-o3 was used to rewrite the International Personality Item Pool Big Five Measure (IPIP-BFM-50), and 203 participants completed either the original or neutralized form along with the Marlowe-Crowne Social Desirability Scale. The results showed preserved reliability and a five-factor structure, with gains in Conscientiousness and declines in Agreeableness and Openness. The correlations with social desirability decreased for several items, but inconsistently. Configural invariance held, though metric and scalar invariance failed. Findings support AI neutralization as a potential but imperfect bias-reduction method.
zh

[NLP-96] GAUSS: Benchmarking Structured Mathematical Skills for Large Language Models

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)数学能力评估缺乏细粒度、可解释性不足的问题。现有评测方法往往仅提供整体分数,难以揭示模型在不同数学技能维度上的真实表现。为此,作者提出GAUSS(General Assessment of Underlying Structured Skills in Mathematics)基准,其关键在于将数学能力划分为十二个核心技能维度,并归类至知识与理解、问题求解与沟通、元技能与创造力三个领域;通过设计能够隔离特定能力的任务,构建出模型数学能力的精细化、可解释性画像,从而更准确地反映其底层数学智能水平。

链接: https://arxiv.org/abs/2509.18122
作者: Yue Zhang,Jiaxin Zhang,Qiuyu Ren,Tahsin Saffat,Xiaoxuan Liu,Zitong Yang,Banghua Zhu,Yi Ma
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 120 pages (including appendix)

点击查看摘要

Abstract:We introduce \textbfGAUSS (\textbfGeneral \textbfAssessment of \textbfUnderlying \textbfStructured \textbfSkills in Mathematics), a benchmark that evaluates LLMs’ mathematical abilities across twelve core skill dimensions, grouped into three domains: knowledge and understanding, problem solving and communication, and meta-skills and creativity. By categorizing problems according to cognitive skills and designing tasks that isolate specific abilities, GAUSS constructs comprehensive, fine-grained, and interpretable profiles of models’ mathematical abilities. These profiles faithfully represent their underlying mathematical intelligence. To exemplify how to use the \textscGAUSS benchmark, we have derived the skill profile of \textscGPT-5-thinking, revealing its strengths and weaknesses as well as its differences relative to \textsco4-mini-high, thereby underscoring the value of multidimensional, skill-based evaluation.
zh

[NLP-97] Advancing Speech Summarization in Multi-modal LLM s with Reinforcement Learning

【速读】: 该论文旨在解决开源多模态大语言模型(Multi-modal Large Language Models, MLLMs)在语音摘要任务中性能落后于先进文本型大语言模型(Text-based Large Language Models, LLMs)的问题,从而限制其在实际场景中的部署。解决方案的关键在于提出一种新颖的多阶段强化学习(Reinforcement Learning, RL)训练框架,通过该框架显著提升MLLMs在语音摘要任务上的表现,使其不仅优于强基线模型,还能超越更大规模的MLLMs,并大幅缩小与顶尖文本型LLMs之间的性能差距。

链接: https://arxiv.org/abs/2509.19631
作者: Shaoshi Ling,Gang Liu,Guoli Ye,Jinyu Li
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Speech summarization is a critical component of spoken content understanding, particularly in the era of rapidly growing spoken and audiovisual data. Recent advances in multi-modal large language models (MLLMs), leveraging the power of LLMs, enable generating textual summaries directly from speech without intermediate transcriptions, while supporting controllable styles and zero-shot generalization. However, open-source MLLMs continue to lag behind the state-of-the-art text-based LLMs, limiting their practical deployment for speech summarization. In this work, we present a novel multi-stage reinforcement learning training framework to enhance the speech summarization capabilities in MLLMs. Our model delivers substantial improvements over strong baselines, outperforms much larger MLLMs, and significantly narrows the gap with state-of-the-art text-based LLMs.
zh

[NLP-98] Frame-Stacked Local Transformers For Efficient Multi-Codebook Speech Generation

【速读】: 该论文旨在解决基于大语言模型(Large Language Models, LLMs)的语音生成模型在处理离散声学码本(discrete acoustic codes)时面临的并行预测效率与音质保真度之间的权衡问题。由于每个时间步需联合预测多个码本条目,传统并行预测方法因假设码本间独立性而难以保持高质量;为此,论文提出采用分层策略,其关键在于引入局部Transformer(Local Transformer, LT)以建模码本间的时序依赖关系:具体包含两种LT架构——自回归Transformer按顺序生成码本,以及基于MaskGIT的迭代掩码预测Transformer;同时结合帧堆叠(frame stacking)机制,在主Transformer中并行预测多帧内容、由LT逐帧解码码本,从而在不牺牲感知质量的前提下提升推理速度。

链接: https://arxiv.org/abs/2509.19592
作者: Roy Fejgin,Paarth Neekhara,Xuesong Yang,Edresson Casanova,Ryan Langman Jaehyeon Kim,Subhankar Ghosh,Shehzeen Hussain,Jason Li
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD)
备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Speech generation models based on large language models (LLMs) typically operate on discrete acoustic codes, which differ fundamentally from text tokens due to their multicodebook structure. At each timestep, models must predict N codebook entries jointly, introducing dependencies that challenge simple parallel prediction approaches. Parallel prediction assumes independence among codebooks, yielding efficient decoding but often at the cost of reduced fidelity. To address this, hierarchical strategies employ a local transformer (LT) to refine predictions and capture intra-timestep dependencies. In this work, we systematically investigate two LT architectures: an autoregressive transformer that generates codebooks sequentially, and a MaskGIT-based transformer that performs iterative masked prediction. Both designs further enable frame stacking, where the primary transformer predicts multiple frames jointly, and the LT decodes their codebooks, offering improvements in speed without compromising perceptual quality. Through extensive analysis, we characterize the tradeoffs between parallel and iterative sampling strategies across different throughput and quality regimes. Finally, we propose practical guidelines for selecting decoding strategies based on deployment priorities such as computational efficiency and synthesis fidelity.
zh

计算机视觉

[CV-0] EditVerse: Unifying Image and Video Editing and Generation with In-Context Learning

【速读】:该论文旨在解决视频生成与编辑领域长期存在的碎片化问题,即当前方法受限于架构设计和数据稀缺,难以实现统一建模与高效跨模态操作。其解决方案的关键在于提出EditVerse框架,通过将文本、图像和视频统一表示为token序列,并利用自注意力机制实现上下文学习能力、自然的跨模态知识迁移以及对任意分辨率和时长输入输出的灵活处理;同时构建了包含23.2万视频编辑样本的可扩展数据管道,并结合大规模图像与视频数据进行联合训练,从而显著提升模型在视频编辑任务上的泛化能力与生成质量。

链接: https://arxiv.org/abs/2509.20360
作者: Xuan Ju,Tianyu Wang,Yuqian Zhou,He Zhang,Qing Liu,Nanxuan Zhao,Zhifei Zhang,Yijun Li,Yuanhao Cai,Shaoteng Liu,Daniil Pakhomov,Zhe Lin,Soo Ye Kim,Qiang Xu
机构: Adobe Research; The Chinese University of Hong Kong; Johns Hopkins University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in foundation models highlight a clear trend toward unification and scaling, showing emergent capabilities across diverse domains. While image generation and editing have rapidly transitioned from task-specific to unified frameworks, video generation and editing remain fragmented due to architectural limitations and data scarcity. In this work, we introduce EditVerse, a unified framework for image and video generation and editing within a single model. By representing all modalities, i.e., text, image, and video, as a unified token sequence, EditVerse leverages self-attention to achieve robust in-context learning, natural cross-modal knowledge transfer, and flexible handling of inputs and outputs with arbitrary resolutions and durations. To address the lack of video editing training data, we design a scalable data pipeline that curates 232K video editing samples and combines them with large-scale image and video datasets for joint training. Furthermore, we present EditVerseBench, the first benchmark for instruction-based video editing covering diverse tasks and resolutions. Extensive experiments and user studies demonstrate that EditVerse achieves state-of-the-art performance, surpassing existing open-source and commercial models, while exhibiting emergent editing and generation abilities across modalities.
zh

[CV-1] PhysCtrl: Generative Physics for Controllable and Physics-Grounded Video Generation NEURIPS2025

【速读】:该论文旨在解决现有视频生成模型在生成图像到视频过程中缺乏物理合理性(physical plausibility)和三维可控性(3D controllability)的问题。其解决方案的关键在于提出PhysCtrl框架,该框架基于一个生成式物理网络(generative physics network),通过扩散模型(diffusion model)学习四种材料(弹性、沙子、橡皮泥和刚性)的物理动力学分布,条件输入为物理参数和施加力;同时引入一种新颖的时空注意力模块(spatiotemporal attention block),模拟粒子间相互作用并在训练中嵌入物理约束,从而确保生成轨迹具有物理合理性。此方法显著提升了视频生成的视觉保真度与物理一致性。

链接: https://arxiv.org/abs/2509.20358
作者: Chen Wang,Chuhao Chen,Yiming Huang,Zhiyang Dou,Yuan Liu,Jiatao Gu,Lingjie Liu
机构: University of Pennsylvania (宾夕法尼亚大学); MIT (麻省理工学院); HKUST (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by NeurIPS 2025. This is the preview version; the camera-ready version is still in preparation

点击查看摘要

Abstract:Existing video generation models excel at producing photo-realistic videos from text or images, but often lack physical plausibility and 3D controllability. To overcome these limitations, we introduce PhysCtrl, a novel framework for physics-grounded image-to-video generation with physical parameters and force control. At its core is a generative physics network that learns the distribution of physical dynamics across four materials (elastic, sand, plasticine, and rigid) via a diffusion model conditioned on physics parameters and applied forces. We represent physical dynamics as 3D point trajectories and train on a large-scale synthetic dataset of 550K animations generated by physics simulators. We enhance the diffusion model with a novel spatiotemporal attention block that emulates particle interactions and incorporates physics-based constraints during training to enforce physical plausibility. Experiments show that PhysCtrl generates realistic, physics-grounded motion trajectories which, when used to drive image-to-video models, yield high-fidelity, controllable videos that outperform existing methods in both visual quality and physical plausibility. Project Page: this https URL
zh

[CV-2] Efficient Encoder-Free Pose Conditioning and Pose Control for Virtual Try-On CVPR2025

【速读】:该论文旨在解决虚拟试穿(Virtual Try-On, VTON)技术中 pose 控制的难题,即如何在不引入额外参数或模块的前提下,实现产品图像与用户身体姿态的精准对齐,并支持多样化姿态以提升用户体验。其解决方案的关键在于采用纯拼接(pure concatenation)架构,在不增加外部编码器、控制网络或复杂注意力机制的基础上,通过空间拼接姿态数据(pose maps 与骨骼图)来融入姿态条件;实验表明,使用姿态图(pose map)进行拼接能显著提升姿态保留能力和生成结果的真实感,同时结合细粒度掩码与边界框掩码的混合训练策略,增强了模型在不同姿态下对产品融合的灵活性与鲁棒性。

链接: https://arxiv.org/abs/2509.20343
作者: Qi Li,Shuwen Qiu,Julien Han,Xingzi Xu,Mehmet Saygin Seyfioglu,Kee Kiat Koo,Karim Bouyarmane
机构: Amazon(亚马逊); University of California, Los Angeles (加州大学洛杉矶分校); Duke University (杜克大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to CVPR 2025 and Published at CVPR 2025 AI for Content Creation workshop

点击查看摘要

Abstract:As online shopping continues to grow, the demand for Virtual Try-On (VTON) technology has surged, allowing customers to visualize products on themselves by overlaying product images onto their own photos. An essential yet challenging condition for effective VTON is pose control, which ensures accurate alignment of products with the user’s body while supporting diverse orientations for a more immersive experience. However, incorporating pose conditions into VTON models presents several challenges, including selecting the optimal pose representation, integrating poses without additional parameters, and balancing pose preservation with flexible pose control. In this work, we build upon a baseline VTON model that concatenates the reference image condition without external encoder, control network, or complex attention layers. We investigate methods to incorporate pose control into this pure concatenation paradigm by spatially concatenating pose data, comparing performance using pose maps and skeletons, without adding any additional parameters or module to the baseline model. Our experiments reveal that pose stitching with pose maps yields the best results, enhancing both pose preservation and output realism. Additionally, we introduce a mixed-mask training strategy using fine-grained and bounding box masks, allowing the model to support flexible product integration across varied poses and conditions. Comments: Submitted to CVPR 2025 and Published at CVPR 2025 AI for Content Creation workshop Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2509.20343 [cs.CV] (or arXiv:2509.20343v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2509.20343 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-3] Video models are zero-shot learners and reason ers

【速读】:该论文试图解决的问题是:当前生成式视频模型是否具备向通用视觉理解方向演进的潜力,即能否像大型语言模型(Large Language Models, LLMs)一样发展为统一、通用的视觉基础模型(vision foundation models)。解决方案的关键在于通过实证表明,Veo 3 模型在未显式训练特定任务的情况下,能够零样本(zero-shot)完成多种视觉任务,如目标分割、边缘检测、图像编辑、物理属性理解、物体功能识别、工具使用模拟等,这些能力体现了对视觉世界的感知、建模与操作能力,从而展现出早期形式的视觉推理(visual reasoning),证明视频模型正沿着通向通用视觉理解的路径发展。

链接: https://arxiv.org/abs/2509.20328
作者: Thaddäus Wiedemer,Yuxuan Li,Paul Vicol,Shixiang Shane Gu,Nick Matarese,Kevin Swersky,Been Kim,Priyank Jaini,Robert Geirhos
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Project page: this https URL

点击查看摘要

Abstract:The remarkable zero-shot capabilities of Large Language Models (LLMs) have propelled natural language processing from task-specific models to unified, generalist foundation models. This transformation emerged from simple primitives: large, generative models trained on web-scale data. Curiously, the same primitives apply to today’s generative video models. Could video models be on a trajectory towards general-purpose vision understanding, much like LLMs developed general-purpose language understanding? We demonstrate that Veo 3 can solve a broad variety of tasks it wasn’t explicitly trained for: segmenting objects, detecting edges, editing images, understanding physical properties, recognizing object affordances, simulating tool use, and more. These abilities to perceive, model, and manipulate the visual world enable early forms of visual reasoning like maze and symmetry solving. Veo’s emergent zero-shot capabilities indicate that video models are on a path to becoming unified, generalist vision foundation models.
zh

[CV-4] VisualMimic: Visual Humanoid Loco-Manipulation via Motion Tracking and Generation

【速读】:该论文旨在解决非结构化环境中人形机器人在运动与操作(loco-manipulation)任务中,如何实现基于自身体验视觉(egocentric vision)的全身控制(whole-body control)并具备跨任务泛化能力的问题。现有方法要么依赖外部动作捕捉系统,要么难以在不同任务间迁移。解决方案的关键在于提出 VisualMimic 框架,其核心是将任务无关的低层关键点追踪器(由人类运动数据通过教师-学生机制训练得到)与任务特定的高层策略相结合,高层策略根据视觉和本体感觉输入生成关键点指令;同时通过向低层策略注入噪声及利用人类运动统计对高层动作进行裁剪,确保训练稳定性,从而实现从仿真到真实人形机器人的零样本迁移(zero-shot transfer),完成如搬运、推拉、带球和踢球等多种复杂任务,并在户外环境中也表现出强鲁棒性。

链接: https://arxiv.org/abs/2509.20322
作者: Shaofeng Yin,Yanjie Ze,Hong-Xing Yu,C. Karen Liu,Jiajun Wu
机构: Stanford University (斯坦福大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Website: this https URL

点击查看摘要

Abstract:Humanoid loco-manipulation in unstructured environments demands tight integration of egocentric perception and whole-body control. However, existing approaches either depend on external motion capture systems or fail to generalize across diverse tasks. We introduce VisualMimic, a visual sim-to-real framework that unifies egocentric vision with hierarchical whole-body control for humanoid robots. VisualMimic combines a task-agnostic low-level keypoint tracker – trained from human motion data via a teacher-student scheme – with a task-specific high-level policy that generates keypoint commands from visual and proprioceptive input. To ensure stable training, we inject noise into the low-level policy and clip high-level actions using human motion statistics. VisualMimic enables zero-shot transfer of visuomotor policies trained in simulation to real humanoid robots, accomplishing a wide range of loco-manipulation tasks such as box lifting, pushing, football dribbling, and kicking. Beyond controlled laboratory settings, our policies also generalize robustly to outdoor environments. Videos are available at: this https URL .
zh

[CV-5] A Comprehensive Evaluation of YOLO-based Deer Detection Performance on Edge Devices

【速读】:该论文旨在解决农业中鹿类侵入导致经济损失加剧的问题,传统防控手段因人力成本高、效率低且难以适配现代农业系统而效果有限。其核心挑战在于缺乏针对鹿类检测的专用数据集及对现场部署可行性的研究空白。解决方案的关键在于构建一个包含3,095张带边界框标注图像的公开数据集,并系统评估12种基于YOLO系列(v8–v11)深度学习模型在真实场景下的检测性能与边缘计算平台上的部署能力。结果表明,采用轻量级但架构先进的模型(如YOLOv11n、YOLOv8s和YOLOv9s)可在保证高精度(AP@0.5 ≥ 0.85)的同时实现超过30 FPS的实时推理速度,尤其在NVIDIA Jetson AGX Xavier平台上表现出良好的实用性,从而为智能、自主的鹿类入侵监测提供了可行路径。

链接: https://arxiv.org/abs/2509.20318
作者: Bishal Adhikari,Jiajia Li,Eric S. Michel,Jacob Dykes,Te-Ming Paul Tseng,Mary Love Tagert,Dong Chen
机构: Mississippi State University (密西西比州立大学); Michigan State University (密歇根州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 7 figures

点击查看摘要

Abstract:The escalating economic losses in agriculture due to deer intrusion, estimated to be in the hundreds of millions of dollars annually in the U.S., highlight the inadequacy of traditional mitigation strategies since these methods are often labor-intensive, costly, and ineffective for modern farming systems. To overcome this, there is a critical need for intelligent, autonomous solutions which require accurate and efficient deer detection. But the progress in this field is impeded by a significant gap in the literature, mainly the lack of a domain-specific, practical dataset and limited study on the on-field deployability of deer detection systems. Addressing this gap, this study presents a comprehensive evaluation of state-of-the-art deep learning models for deer detection in challenging real-world scenarios. The contributions of this work are threefold. First, we introduce a curated, publicly available dataset of 3,095 annotated images with bounding-box annotations of deer, derived from the Idaho Cameratraps project. Second, we provide an extensive comparative analysis of 12 model variants across four recent YOLO architectures(v8, v9, v10, and v11). Finally, we benchmarked performance on a high-end NVIDIA RTX 5090 GPU and evaluated on two representative edge computing platforms: Raspberry Pi 5 and NVIDIA Jetson AGX Xavier. Results show that the real-time detection is not feasible in Raspberry Pi without hardware-specific model optimization, while NVIDIA Jetson provides greater than 30 FPS with GPU-accelerated inference on ‘s’ and ‘n’ series models. This study also reveals that smaller, architecturally advanced models such as YOLOv11n, YOLOv8s, and YOLOv9s offer the optimal balance of high accuracy (AP@.5 0.85) and computational efficiency (FPS 30). To support further research, both the source code and datasets are publicly available at this https URL.
zh

[CV-6] FAST: Foreground-aware Diffusion with Accelerated Sampling Trajectory for Segmentation-oriented Anomaly Synthesis

【速读】:该论文旨在解决工业异常分割任务中因真实异常样本稀缺、多样且标注成本高而导致的像素级标注依赖问题,同时克服现有异常合成方法在采样效率与生成质量之间难以平衡,以及对异常区域和背景区域未加区分处理所导致的结构特定异常控制能力不足的问题。解决方案的关键在于提出一种前景感知的扩散框架FAST,其核心创新为两个模块:一是无需训练的异常引导加速采样(Anomaly-Informed Accelerated Sampling, AIAS),通过粗到精聚合策略将反向过程加速至仅需10步即可生成高质量异常;二是前景感知重建模块(Foreground-Aware Reconstruction Module, FARM),在每一步采样中自适应调整掩码前景区域内的噪声,从而在整个去噪轨迹中保留局部异常信号,实现可控且结构精准的异常合成。

链接: https://arxiv.org/abs/2509.20295
作者: Xichen Xu,Yanshu Wang,Jinbao Wang,Xiaoning Lei,Guoyang Xie,Guannan Jiang,Zhichao Lu
机构: Global Institute of Future Technology, Shanghai Jiao Tong University (上海交通大学未来技术学院); School of Artificial Intelligence, Shenzhen University (深圳大学人工智能学院); Department of Intelligent Manufacturing, CATL (宁德时代新能源科技股份有限公司); Department of Computer Science, City University of Hong Kong (香港城市大学计算机科学系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Industrial anomaly segmentation relies heavily on pixel-level annotations, yet real-world anomalies are often scarce, diverse, and costly to label. Segmentation-oriented industrial anomaly synthesis (SIAS) has emerged as a promising alternative; however, existing methods struggle to balance sampling efficiency and generation quality. Moreover, most approaches treat all spatial regions uniformly, overlooking the distinct statistical differences between anomaly and background areas. This uniform treatment hinders the synthesis of controllable, structure-specific anomalies tailored for segmentation tasks. In this paper, we propose FAST, a foreground-aware diffusion framework featuring two novel modules: the Anomaly-Informed Accelerated Sampling (AIAS) and the Foreground-Aware Reconstruction Module (FARM). AIAS is a training-free sampling algorithm specifically designed for segmentation-oriented industrial anomaly synthesis, which accelerates the reverse process through coarse-to-fine aggregation and enables the synthesis of state-of-the-art segmentation-oriented anomalies in as few as 10 steps. Meanwhile, FARM adaptively adjusts the anomaly-aware noise within the masked foreground regions at each sampling step, preserving localized anomaly signals throughout the denoising trajectory. Extensive experiments on multiple industrial benchmarks demonstrate that FAST consistently outperforms existing anomaly synthesis methods in downstream segmentation tasks. We release the code at: this https URL.
zh

[CV-7] PerFace: Metric Learning in Perceptual Facial Similarity for Enhanced Face Anonymization

【速读】:该论文旨在解决现有面部匿名化技术中难以平衡匿名性与自然性的问题,特别是现有模型仅能进行二元身份判断(“同一人或不同人”),无法量化如“完全不同的脸”与“高度相似但不同”的细微差异。解决方案的关键在于提出一种基于人类感知的面部相似性度量方法,通过构建包含6,400组三元组标注的数据集,并采用度量学习(metric learning)来预测面部相似性,从而在面部相似性预测和基于属性的脸部分类任务上显著优于现有方法。

链接: https://arxiv.org/abs/2509.20281
作者: Haruka Kumagai,Leslie Wöhler,Satoshi Ikehata,Kiyoharu Aizawa
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In response to rising societal awareness of privacy concerns, face anonymization techniques have advanced, including the emergence of face-swapping methods that replace one identity with another. Achieving a balance between anonymity and naturalness in face swapping requires careful selection of identities: overly similar faces compromise anonymity, while dissimilar ones reduce naturalness. Existing models, however, focus on binary identity classification “the same person or not”, making it difficult to measure nuanced similarities such as “completely different” versus “highly similar but different.” This paper proposes a human-perception-based face similarity metric, creating a dataset of 6,400 triplet annotations and metric learning to predict the similarity. Experimental results demonstrate significant improvements in both face similarity prediction and attribute-based face classification tasks over existing methods.
zh

[CV-8] HiPerformer: A High-Performance Global-Local Segmentation Model with Modular Hierarchical Fusion Strategy

【速读】:该论文旨在解决医学图像分割中局部细节与全局语义信息融合不充分的问题,现有基于CNN-Transformer混合架构的方法通常采用串行堆叠、端点拼接或逐点相加等简单特征融合策略,难以有效处理多源特征间的不一致性,易导致信息冲突和丢失。其解决方案的关键在于提出HiPerformer框架:首先设计模块化分层架构(modular hierarchical architecture),在编码器中并行动态融合多源特征,实现层间深度集成,保留各分支独立建模能力的同时保障信息高效传递;其次引入局部-全局特征融合模块(Local-Global Feature Fusion, LGFF),精准整合局部细节与全局语义信息,缓解特征不一致问题;此外,通过渐进式金字塔聚合模块(Progressive Pyramid Aggregation, PPA)替代传统跳跃连接,增强多尺度特征表示能力并抑制噪声干扰。

链接: https://arxiv.org/abs/2509.20280
作者: Dayu Tan,Zhenpeng Xu,Yansen Su,Xin Peng,Chunhou Zheng,Weimin Zhong
机构: Anhui University (安徽大学); East China University of Science and Technology (华东理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Both local details and global context are crucial in medical image segmentation, and effectively integrating them is essential for achieving high accuracy. However, existing mainstream methods based on CNN-Transformer hybrid architectures typically employ simple feature fusion techniques such as serial stacking, endpoint concatenation, or pointwise addition, which struggle to address the inconsistencies between features and are prone to information conflict and loss. To address the aforementioned challenges, we innovatively propose HiPerformer. The encoder of HiPerformer employs a novel modular hierarchical architecture that dynamically fuses multi-source features in parallel, enabling layer-wise deep integration of heterogeneous information. The modular hierarchical design not only retains the independent modeling capability of each branch in the encoder, but also ensures sufficient information transfer between layers, effectively avoiding the degradation of features and information loss that come with traditional stacking methods. Furthermore, we design a Local-Global Feature Fusion (LGFF) module to achieve precise and efficient integration of local details and global semantic information, effectively alleviating the feature inconsistency problem and resulting in a more comprehensive feature representation. To further enhance multi-scale feature representation capabilities and suppress noise interference, we also propose a Progressive Pyramid Aggregation (PPA) module to replace traditional skip connections. Experiments on eleven public datasets demonstrate that the proposed method outperforms existing segmentation techniques, demonstrating higher segmentation accuracy and robustness. The code is available at this https URL.
zh

[CV-9] A co-evolving agent ic AI system for medical imaging analysis

【速读】:该论文旨在解决当前生成式 AI (Generative AI) 在医学图像分析领域中性能受限、采纳率低的问题,其核心挑战在于缺乏稳健的生态系统、工具集不足以及实时交互式专家反馈机制的缺失。解决方案的关键在于提出“TissueLab”这一协同进化型智能体系统(co-evolving agentic AI system),通过集成病理学、放射学和空间组学领域的工具工厂(tool factories),标准化各类工具的输入输出与能力接口,从而自动规划并生成可解释的工作流程;同时支持临床专家在分析过程中实时可视化中间结果并进行修正,实现人机协同优化。该系统不仅在多个具有临床意义的任务上达到当前最优性能,还能借助主动学习机制在无需大规模数据或长时间再训练的情况下,快速适应未见疾病场景,显著提升医学影像AI的实用性与泛化能力。

链接: https://arxiv.org/abs/2509.20279
作者: Songhao Li,Jonathan Xu,Tiancheng Bao,Yuxuan Liu,Yuchen Liu,Yihang Liu,Lilin Wang,Wenhui Lei,Sheng Wang,Yinuo Xu,Yan Cui,Jialu Yao,Shunsuke Koga,Zhi Huang
机构: University of Pennsylvania (宾夕法尼亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)
备注:

点击查看摘要

Abstract:Agentic AI is rapidly advancing in healthcare and biomedical research. However, in medical image analysis, their performance and adoption remain limited due to the lack of a robust ecosystem, insufficient toolsets, and the absence of real-time interactive expert feedback. Here we present “TissueLab”, a co-evolving agentic AI system that allows researchers to ask direct questions, automatically plan and generate explainable workflows, and conduct real-time analyses where experts can visualize intermediate results and refine them. TissueLab integrates tool factories across pathology, radiology, and spatial omics domains. By standardizing inputs, outputs, and capabilities of diverse tools, the system determines when and how to invoke them to address research and clinical questions. Across diverse tasks with clinically meaningful quantifications that inform staging, prognosis, and treatment planning, TissueLab achieves state-of-the-art performance compared with end-to-end vision-language models (VLMs) and other agentic AI systems such as GPT-5. Moreover, TissueLab continuously learns from clinicians, evolving toward improved classifiers and more effective decision strategies. With active learning, it delivers accurate results in unseen disease contexts within minutes, without requiring massive datasets or prolonged retraining. Released as a sustainable open-source ecosystem, TissueLab aims to accelerate computational research and translational adoption in medical imaging while establishing a foundation for the next generation of medical AI.
zh

[CV-10] A Versatile Foundation Model for AI-enabled Mammogram Interpretation

【速读】:该论文旨在解决当前基础模型(Foundation Models, FMs)在乳腺X线摄影(mammogram)分析中临床转化受限的问题,具体包括训练数据多样性不足、模型泛化能力有限以及缺乏跨临床任务的全面评估。其解决方案的关键在于提出VersaMammo——一个面向乳腺X线影像的多功能基础模型,通过构建迄今最大的多机构乳腺X线数据集(706,239张图像,来自21个来源),并采用两阶段预训练策略:首先利用自监督学习训练教师模型提取可迁移特征,再结合监督学习与知识蒸馏将特征和临床知识迁移至VersaMammo;同时建立包含92项具体任务的基准测试体系(涵盖病变检测、分割、分类、图像检索和视觉问答五大类),从而实现卓越的泛化性能与临床实用性,在内部任务和外部验证任务中分别以平均排名1.5和1.2位居前列。

链接: https://arxiv.org/abs/2509.20271
作者: Fuxiang Huang,Jiayi Zhu,Yunfang Yu,Yu Xie,Yuan Guo,Qingcong Kong,Mingxiang Wu,Xinrui Jiang,Shu Yang,Jiabo Ma,Ziyi Liu,Zhe Xu,Zhixuan Chen,Yujie Tan,Zifan He,Luhui Mao,Xi Wang,Junlin Hou,Lei Zhang,Qiong Luo,Zhenhui Li,Herui Yao,Hao Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 64 pages, 7 figures, 40 tables

点击查看摘要

Abstract:Breast cancer is the most commonly diagnosed cancer and the leading cause of cancer-related mortality in women globally. Mammography is essential for the early detection and diagnosis of breast lesions. Despite recent progress in foundation models (FMs) for mammogram analysis, their clinical translation remains constrained by several fundamental limitations, including insufficient diversity in training data, limited model generalizability, and a lack of comprehensive evaluation across clinically relevant tasks. Here, we introduce VersaMammo, a versatile foundation model for mammograms, designed to overcome these limitations. We curated the largest multi-institutional mammogram dataset to date, comprising 706,239 images from 21 sources. To improve generalization, we propose a two-stage pre-training strategy to develop VersaMammo, a mammogram foundation model. First, a teacher model is trained via self-supervised learning to extract transferable features from unlabeled mammograms. Then, supervised learning combined with knowledge distillation transfers both features and clinical knowledge into VersaMammo. To ensure a comprehensive evaluation, we established a benchmark comprising 92 specific tasks, including 68 internal tasks and 24 external validation tasks, spanning 5 major clinical task categories: lesion detection, segmentation, classification, image retrieval, and visual question answering. VersaMammo achieves state-of-the-art performance, ranking first in 50 out of 68 specific internal tasks and 20 out of 24 external validation tasks, with average ranks of 1.5 and 1.2, respectively. These results demonstrate its superior generalization and clinical utility, offering a substantial advancement toward reliable and scalable breast cancer screening and diagnosis.
zh

[CV-11] Predictive Coding-based Deep Neural Network Fine-tuning for Computationally Efficient Domain Adaptation

【速读】:该论文旨在解决深度神经网络在动态现实环境中因输入数据分布变化(如传感器漂移或光照变化)而导致性能下降的问题,尤其是在资源受限的边缘设备上难以实现高效持续学习的挑战。解决方案的关键在于提出一种混合训练方法:首先使用反向传播(Backpropagation)在离线阶段训练模型以获得高初始性能,随后引入预测编码(Predictive Coding)进行在线适应,从而在不显著增加计算开销的前提下恢复因分布偏移导致的精度损失。该策略结合了反向传播在表征学习中的鲁棒性与预测编码在持续学习中的计算效率,特别适用于边缘计算或未来类脑加速器场景。

链接: https://arxiv.org/abs/2509.20269
作者: Matteo Cardoni,Sam Leroux
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)
备注: 20 pages, 4 figures

点击查看摘要

Abstract:As deep neural networks are increasingly deployed in dynamic, real-world environments, relying on a single static model is often insufficient. Changes in input data distributions caused by sensor drift or lighting variations necessitate continual model adaptation. In this paper, we propose a hybrid training methodology that enables efficient on-device domain adaptation by combining the strengths of Backpropagation and Predictive Coding. The method begins with a deep neural network trained offline using Backpropagation to achieve high initial performance. Subsequently, Predictive Coding is employed for online adaptation, allowing the model to recover accuracy lost due to shifts in the input data distribution. This approach leverages the robustness of Backpropagation for initial representation learning and the computational efficiency of Predictive Coding for continual learning, making it particularly well-suited for resource-constrained edge devices or future neuromorphic accelerators. Experimental results on the MNIST and CIFAR-10 datasets demonstrate that this hybrid strategy enables effective adaptation with a reduced computational overhead, offering a promising solution for maintaining model performance in dynamic environments.
zh

[CV-12] 4D Driving Scene Generation With Stereo Forcing

【速读】:该论文旨在解决当前生成式模型在合成动态4D驾驶场景时面临的两大挑战:一是难以同时实现时间外推(temporal extrapolation)与空间新视角合成(novel view synthesis, NVS),二是缺乏无需针对每个场景进行优化的统一框架。解决方案的关键在于提出PhiGenesis,一个基于几何一致性约束的统一4D场景生成框架:第一阶段利用预训练视频变分自编码器(video VAE)结合新颖的range-view适配器,实现从多视角图像序列到完整4D高斯泼溅表示的前向重建;第二阶段引入几何引导的视频扩散模型,以历史4D场景作为先验条件,沿目标3D轨迹生成未来视图,并通过Stereo Forcing策略在去噪过程中融合几何不确定性,从而增强时间一致性并缓解新视角下的几何偏差问题。

链接: https://arxiv.org/abs/2509.20251
作者: Hao Lu,Zhuang Ma,Guangfeng Jiang,Wenhang Ge,Bohan Li,Yuzhan Cai,Wenzhao Zheng,Yunpeng Zhang,Yingcong Chen
机构: Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)); University of Science and Technology of China (中国科学技术大学); Shanghai Jiao Tong University (上海交通大学); University of California, Berkeley (加州大学伯克利分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Current generative models struggle to synthesize dynamic 4D driving scenes that simultaneously support temporal extrapolation and spatial novel view synthesis (NVS) without per-scene optimization. Bridging generation and novel view synthesis remains a major challenge. We present PhiGenesis, a unified framework for 4D scene generation that extends video generation techniques with geometric and temporal consistency. Given multi-view image sequences and camera parameters, PhiGenesis produces temporally continuous 4D Gaussian splatting representations along target 3D trajectories. In its first stage, PhiGenesis leverages a pre-trained video VAE with a novel range-view adapter to enable feed-forward 4D reconstruction from multi-view images. This architecture supports single-frame or video inputs and outputs complete 4D scenes including geometry, semantics, and motion. In the second stage, PhiGenesis introduces a geometric-guided video diffusion model, using rendered historical 4D scenes as priors to generate future views conditioned on trajectories. To address geometric exposure bias in novel views, we propose Stereo Forcing, a novel conditioning strategy that integrates geometric uncertainty during denoising. This method enhances temporal coherence by dynamically adjusting generative influence based on uncertainty-aware perturbations. Our experimental results demonstrate that our method achieves state-of-the-art performance in both appearance and geometric reconstruction, temporal generation and novel view synthesis (NVS) tasks, while simultaneously delivering competitive performance in downstream evaluations. Homepage is at \hrefthis https URLPhiGensis.
zh

[CV-13] An Anisotropic Cross-View Texture Transfer with Multi-Reference Non-Local Attention for CT Slice Interpolation

【速读】:该论文旨在解决临床CT图像中因切片厚度较大导致的各向异性问题,即纵向(through-plane)分辨率远低于平面内(in-plane)分辨率,从而影响疾病诊断准确性的问题。现有基于深度学习的体积超分辨率方法多采用单张图像超分辨或相邻切片插值,未能充分利用3D CT数据的各向异性特性。其解决方案的关键在于提出一种新颖的跨视角纹理迁移方法,通过设计一个多参考非局部注意力模块(multi-reference non-local attention module),从多个高分辨率平面内图像中提取有意义特征,并将这些高频纹理细节有效传递至低分辨率纵向切片,实现更高质量的切片插值。实验表明,该方法在公开CT数据集上显著优于现有竞争方法,验证了其框架的有效性。

链接: https://arxiv.org/abs/2509.20242
作者: Kwang-Hyun Uhm,Hyunjun Cho,Sung-Hoo Hong,Seung-Won Jung
机构: Gachon University (嘉泉大学); Korea University (韩国科学技术院); MedAI; The Catholic University of Korea (韩国天主教大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to IEEE Transactions on Medical Imaging (TMI), 2025

点击查看摘要

Abstract:Computed tomography (CT) is one of the most widely used non-invasive imaging modalities for medical diagnosis. In clinical practice, CT images are usually acquired with large slice thicknesses due to the high cost of memory storage and operation time, resulting in an anisotropic CT volume with much lower inter-slice resolution than in-plane resolution. Since such inconsistent resolution may lead to difficulties in disease diagnosis, deep learning-based volumetric super-resolution methods have been developed to improve inter-slice resolution. Most existing methods conduct single-image super-resolution on the through-plane or synthesize intermediate slices from adjacent slices; however, the anisotropic characteristic of 3D CT volume has not been well explored. In this paper, we propose a novel cross-view texture transfer approach for CT slice interpolation by fully utilizing the anisotropic nature of 3D CT volume. Specifically, we design a unique framework that takes high-resolution in-plane texture details as a reference and transfers them to low-resolution through-plane images. To this end, we introduce a multi-reference non-local attention module that extracts meaningful features for reconstructing through-plane high-frequency details from multiple in-plane images. Through extensive experiments, we demonstrate that our method performs significantly better in CT slice interpolation than existing competing methods on public CT datasets including a real-paired benchmark, verifying the effectiveness of the proposed framework. The source code of this work is available at this https URL.
zh

[CV-14] ImageNet-trained CNNs are not biased towards texture: Revisiting feature reliance through controlled suppression NEURIPS2025

【速读】:该论文旨在解决当前关于卷积神经网络(Convolutional Neural Networks, CNNs)是否本质上偏向纹理特征(texture-biased)这一争议性问题。此前研究(如Geirhos等人提出的cue-conflict实验)存在方法学局限,可能导致对模型特征依赖性的误判。为此,作者提出了一种领域无关(domain-agnostic)的量化框架,通过系统性地抑制形状、纹理和颜色线索,在受控条件下评估人类与神经网络的特征依赖模式,从而避免强制选择冲突带来的混杂因素。关键创新在于采用可控抑制策略而非冲突任务设计,实证表明CNN并非天然偏好纹理,而是主要依赖局部形状特征;且这种依赖可通过现代训练策略或架构(如ConvNeXt、视觉Transformer)显著削弱。

链接: https://arxiv.org/abs/2509.20234
作者: Tom Burgert,Oliver Stoll,Paolo Rota,Begüm Demir
机构: BIFOLD; TU Berlin (柏林工业大学); University of Trento (特伦托大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at NeurIPS 2025 (oral)

点击查看摘要

Abstract:The hypothesis that Convolutional Neural Networks (CNNs) are inherently texture-biased has shaped much of the discourse on feature use in deep learning. We revisit this hypothesis by examining limitations in the cue-conflict experiment by Geirhos et al. To address these limitations, we propose a domain-agnostic framework that quantifies feature reliance through systematic suppression of shape, texture, and color cues, avoiding the confounds of forced-choice conflicts. By evaluating humans and neural networks under controlled suppression conditions, we find that CNNs are not inherently texture-biased but predominantly rely on local shape features. Nonetheless, this reliance can be substantially mitigated through modern training strategies or architectures (ConvNeXt, ViTs). We further extend the analysis across computer vision, medical imaging, and remote sensing, revealing that reliance patterns differ systematically: computer vision models prioritize shape, medical imaging models emphasize color, and remote sensing models exhibit a stronger reliance towards texture. Code is available at this https URL.
zh

[CV-15] Design Insights and Comparative Evaluation of a Hardware-Based Cooperative Perception Architecture for Lane Change Prediction

【速读】:该论文旨在解决真实道路环境中车道变更预测系统的部署难题,特别是现有研究多基于仿真或预录数据集,难以反映实际交通场景中感知、通信与行为建模的复杂性。其解决方案的关键在于通过在混合交通环境下进行真实的硬件部署,系统性地识别并记录了影响车道变更预测性能的实际挑战,包括系统瓶颈、可靠性问题及运行约束,并由此提炼出可复用的实践经验与优化方向,为同类系统的开发提供实证指导。

链接: https://arxiv.org/abs/2509.20218
作者: Mohamed Manzour,Catherine M. Elias,Omar M. Shehata,Rubén Izquierdo,Miguel Ángel Sotelo
机构: 未知
类目: Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Research on lane change prediction has gained attention in the last few years. Most existing works in this area have been conducted in simulation environments or with pre-recorded datasets, these works often rely on simplified assumptions about sensing, communication, and traffic behavior that do not always hold in practice. Real-world deployments of lane-change prediction systems are relatively rare, and when they are reported, the practical challenges, limitations, and lessons learned are often under-documented. This study explores cooperative lane-change prediction through a real hardware deployment in mixed traffic and shares the insights that emerged during implementation and testing. We highlight the practical challenges we faced, including bottlenecks, reliability issues, and operational constraints that shaped the behavior of the system. By documenting these experiences, the study provides guidance for others working on similar pipelines.
zh

[CV-16] PU-Gaussian: Point Cloud Upsampling using 3D Gaussian Representation ICCV ICCV2025

【速读】:该论文旨在解决由3D传感器生成的点云数据通常稀疏且噪声较大,难以满足需要高保真度和稠密三维表示的任务需求的问题。传统方法通过隐式特征上采样或距离函数学习来应对,但常牺牲几何可解释性或对输入稀疏性的鲁棒性。其解决方案的关键在于提出PU-Gaussian网络,该网络利用各向异性的3D高斯分布(anisotropic 3D Gaussian distributions)建模每个点的局部邻域结构,从而在局部几何域中显式地进行点采样以生成稠密但粗糙的点云;随后通过一个精炼网络调整粗略输出,实现更均匀的分布与更锐利的边缘,最终在PU1K和PUGAN数据集上达到当前最优性能。

链接: https://arxiv.org/abs/2509.20207
作者: Mahmoud Khater,Mona Strauss,Philipp von Olshausen,Alexander Reiterer
机构: University of Freiburg (弗莱堡大学); Fraunhofer IPM (弗劳恩霍夫IPM研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for the ICCV 2025 e2e3D Workshop. To be published in the Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (ICCVW)

点击查看摘要

Abstract:Point clouds produced by 3D sensors are often sparse and noisy, posing challenges for tasks requiring dense and high-fidelity 3D representations. Prior work has explored both implicit feature-based upsampling and distance-function learning to address this, but often at the expense of geometric interpretability or robustness to input sparsity. To overcome these limitations, we propose PU-Gaussian, a novel upsampling network that models the local neighborhood around each point using anisotropic 3D Gaussian distributions. These Gaussians capture the underlying geometric structure, allowing us to perform upsampling explicitly in the local geometric domain by direct point sampling. The sampling process generates a dense, but coarse, point cloud. A subsequent refinement network adjusts the coarse output to produce a more uniform distribution and sharper edges. We perform extensive testing on the PU1K and PUGAN datasets, demonstrating that PU-Gaussian achieves state-of-the-art performance. We make code and model weights publicly available at this https URL.
zh

[CV-17] Universal Camouflage Attack on Vision-Language Models for Autonomous Driving

【速读】:该论文旨在解决视觉语言模型在自动驾驶(VLM-AD)系统中面临的对抗攻击安全问题,特别是现有攻击方法在物理场景下迁移性差、对多模态输入敏感度不足以及缺乏跨模型和跨命令泛化能力的问题。其解决方案的关键在于提出首个通用伪装攻击(Universal Camouflage Attack, UCA)框架,该框架通过在特征空间而非 logits 层进行优化,引入特征差异损失(Feature Divergence Loss, FDL)以最大化干净样本与对抗样本之间的表征差异,并结合多尺度学习策略和采样比例调整机制,显著提升攻击在不同用户指令、模型架构、视角变化及动态环境下的鲁棒性和泛化性能,从而实现对 VLM-AD 系统的高效、可物理实现的欺骗性攻击。

链接: https://arxiv.org/abs/2509.20196
作者: Dehong Kong,Sifan Yu,Siyuan Liang,Jiawei Liang,Jianhou Gan,Aishan Liu,Wenqi Ren
机构: School of Cyber Science and Technology, SUN YAT-SEN UNIVERSITY (中山大学网络科学与技术学院); School of Computing, National University of Singapore (新加坡国立大学计算机学院); Key Laboratory of Education Informatization for Nationalities, Yunnan Normal University (云南师范大学民族教育信息化重点实验室); SCSE, Beihang University (北京航空航天大学计算机学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Visual language modeling for automated driving is emerging as a promising research direction with substantial improvements in multimodal reasoning capabilities. Despite its advanced reasoning abilities, VLM-AD remains vulnerable to serious security threats from adversarial attacks, which involve misleading model decisions through carefully crafted perturbations. Existing attacks have obvious challenges: 1) Physical adversarial attacks primarily target vision modules. They are difficult to directly transfer to VLM-AD systems because they typically attack low-level perceptual components. 2) Adversarial attacks against VLM-AD have largely concentrated on the digital level. To address these challenges, we propose the first Universal Camouflage Attack (UCA) framework for VLM-AD. Unlike previous methods that focus on optimizing the logit layer, UCA operates in the feature space to generate physically realizable camouflage textures that exhibit strong generalization across different user commands and model architectures. Motivated by the observed vulnerability of encoder and projection layers in VLM-AD, UCA introduces a feature divergence loss (FDL) that maximizes the representational discrepancy between clean and adversarial images. In addition, UCA incorporates a multi-scale learning strategy and adjusts the sampling ratio to enhance its adaptability to changes in scale and viewpoint diversity in real-world scenarios, thereby improving training stability. Extensive experiments demonstrate that UCA can induce incorrect driving commands across various VLM-AD models and driving scenarios, significantly surpassing existing state-of-the-art attack methods (improving 30% in 3-P metrics). Furthermore, UCA exhibits strong attack robustness under diverse viewpoints and dynamic conditions, indicating high potential for practical deployment.
zh

[CV-18] Optical Ocean Recipes: Creating Realistic Datasets to Facilitate Underwater Vision Research

【速读】:该论文旨在解决水下机器视觉(Machine Vision)在复杂光学环境下的评估难题,特别是由于水体光衰减、后向散射、体积散射及动态光照等因素导致的图像颜色失真、对比度下降和模糊等问题,使得现有测试方法缺乏通用性和可控性。解决方案的关键在于提出了一种名为“光学海洋配方”(Optical Ocean Recipes)的框架,通过使用校准过的颜色和散射添加剂,在受控条件下精确模拟不同水体成分对图像外观的影响,从而实现可重复、可控制的水下视觉任务测试与地面真实数据生成,适用于水体参数估计、图像恢复、分割、视觉SLAM及水下图像合成等多种应用。

链接: https://arxiv.org/abs/2509.20171
作者: Patricia Schöntag,David Nakath,Judith Fischer,Rüdiger Röttgers,Kevin Köser
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 26 pages, 9 figures, submitted to IEEE Journal of Ocean Engineering

点击查看摘要

Abstract:The development and evaluation of machine vision in underwater environments remains challenging, often relying on trial-and-error-based testing tailored to specific applications. This is partly due to the lack of controlled, ground-truthed testing environments that account for the optical challenges, such as color distortion from spectrally variant light attenuation, reduced contrast and blur from backscatter and volume scattering, and dynamic light patterns from natural or artificial illumination. Additionally, the appearance of ocean water in images varies significantly across regions, depths, and seasons. However, most machine vision evaluations are conducted under specific optical water types and imaging conditions, therefore often lack generalizability. Exhaustive testing across diverse open-water scenarios is technically impractical. To address this, we introduce the \textitOptical Ocean Recipes, a framework for creating realistic datasets under controlled underwater conditions. Unlike synthetic or open-water data, these recipes, using calibrated color and scattering additives, enable repeatable and controlled testing of the impact of water composition on image appearance. Hence, this provides a unique framework for analyzing machine vision in realistic, yet controlled underwater scenarios. The controlled environment enables the creation of ground-truth data for a range of vision tasks, including water parameter estimation, image restoration, segmentation, visual SLAM, and underwater image synthesis. We provide a demonstration dataset generated using the Optical Ocean Recipes and briefly demonstrate the use of our system for two underwater vision tasks. The dataset and evaluation code will be made available.
zh

[CV-19] U-Mamba2-SSL for Semi-Supervised Tooth and Pulp Segmentation in CBCT

【速读】:该论文旨在解决锥形束计算机断层扫描(Cone-Beam Computed Tomography, CBCT)中牙齿与牙髓区域精确分割的难题,该任务在临床治疗规划和诊断中至关重要,但传统方法依赖专家人工标注,耗时且效率低下。为应对这一挑战,作者提出了一种名为U-Mamba2-SSL的新型半监督学习框架,其核心创新在于结合了多阶段训练策略:首先利用破坏性自编码器(disruptive autoencoder)对U-Mamba2模型进行自监督预训练;随后通过一致性正则化机制引入输入扰动和特征扰动,有效利用未标注数据提升模型鲁棒性;最后采用伪标签策略并降低损失权重以减少错误标签带来的负面影响。实验表明,该方法在验证集上取得了平均分数0.872和Dice相似系数(DSC)0.969的优异性能,显著提升了CBCT图像中牙齿与牙髓分割的自动化水平与准确性。

链接: https://arxiv.org/abs/2509.20154
作者: Zhi Qin Tan,Xiatian Zhu,Owen Addison,Yunpeng Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurate segmentation of teeth and pulp in Cone-Beam Computed Tomography (CBCT) is vital for clinical applications like treatment planning and diagnosis. However, this process requires extensive expertise and is exceptionally time-consuming, highlighting the critical need for automated algorithms that can effectively utilize unlabeled data. In this paper, we propose U-Mamba2-SSL, a novel semi-supervised learning framework that builds on the U-Mamba2 model and employs a multi-stage training strategy. The framework first pre-trains U-Mamba2 in a self-supervised manner using a disruptive autoencoder. It then leverages unlabeled data through consistency regularization, where we introduce input and feature perturbations to ensure stable model outputs. Finally, a pseudo-labeling strategy is implemented with a reduced loss weighting to minimize the impact of potential errors. U-Mamba2-SSL achieved an average score of 0.872 and a DSC of 0.969 on the validation dataset, demonstrating the superior performance of our approach. The code is available at this https URL.
zh

[CV-20] C2MIL: Synchronizing Semantic and Topological Causalities in Multiple Instance Learning for Robust and Interpretable Survival Analysis

【速读】:该论文旨在解决基于图的多实例学习(Graph-based Multiple Instance Learning, MIL)在利用苏木精-伊红(Hematoxylin and Eosin, H&E)染色全切片图像(Whole Slide Images, WSIs)进行生存分析时所面临的两个核心问题:一是染色和扫描差异引入的语义偏差(semantic bias),二是非因果拓扑子图带来的噪声,二者均导致滑动级别表示的偏差,从而损害模型的可解释性和泛化能力。解决方案的关键在于构建一个双结构因果模型(dual structural causal model)作为理论基础,并提出一种新颖且可解释的双因果图MIL模型——C² MIL。其核心创新包括:1)提出跨尺度自适应特征解耦模块(cross-scale adaptive feature disentangling module),用于语义因果干预;2)设计基于伯努利可微分因果子图采样方法(Bernoulli differentiable causal subgraph sampling method),实现拓扑因果发现;3)采用联合优化策略,结合解耦监督与对比学习,同步优化语义与拓扑因果性,显著提升模型性能与可解释性。

链接: https://arxiv.org/abs/2509.20152
作者: Min Cen,Zhenfeng Zhuang,Yuzhe Zhang,Min Zeng,Baptiste Magnier,Lequan Yu,Hong Zhang,Liansheng Wang
机构: University of Science and Technology of China (中国科学技术大学); Xiamen University (厦门大学); EuroMov Digital Health in Motion, Univ Montpellier, IMT Mines Ales (EuroMov 数字健康运动,蒙彼利埃大学,IMT矿业学院); The University of Hong Kong (香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Graph-based Multiple Instance Learning (MIL) is widely used in survival analysis with Hematoxylin and Eosin (H\E)-stained whole slide images (WSIs) due to its ability to capture topological information. However, variations in staining and scanning can introduce semantic bias, while topological subgraphs that are not relevant to the causal relationships can create noise, resulting in biased slide-level representations. These issues can hinder both the interpretability and generalization of the analysis. To tackle this, we introduce a dual structural causal model as the theoretical foundation and propose a novel and interpretable dual causal graph-based MIL model, C ^2 MIL. C ^2 MIL incorporates a novel cross-scale adaptive feature disentangling module for semantic causal intervention and a new Bernoulli differentiable causal subgraph sampling method for topological causal discovery. A joint optimization strategy combining disentangling supervision and contrastive learning enables simultaneous refinement of both semantic and topological causalities. Experiments demonstrate that C ^2 MIL consistently improves generalization and interpretability over existing methods and can serve as a causal enhancement for diverse MIL baselines. The code is available at this https URL.
zh

[CV-21] Smaller is Better: Enhancing Transparency in Vehicle AI Systems via Pruning

【速读】:该论文旨在解决自动驾驶车辆中深度学习模型(尤其是交通标志分类器)的可解释性问题,即如何提升后处理解释(post-hoc explanations,如显著性图)的质量与可信度,以增强模型决策的透明性和安全性。其解决方案的关键在于系统评估三种主流训练方法——自然训练、对抗训练和剪枝(pruning)对解释质量的影响,发现剪枝不仅能提高模型效率,还能通过强制学习表示的稀疏性(sparsity)显著增强解释的可理解性和忠实性(faithfulness),从而为资源受限的车载AI系统提供一种有效的透明化建模策略。

链接: https://arxiv.org/abs/2509.20148
作者: Sanish Suwal,Shaurya Garg,Dipkamal Bhusal,Michael Clifford,Nidhi Rastogi
机构: Rochester Institute of Technology (罗切斯特理工学院); Toyota InfoTech Labs (丰田信息科技实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages

点击查看摘要

Abstract:Connected and autonomous vehicles continue to heavily rely on AI systems, where transparency and security are critical for trust and operational safety. Post-hoc explanations provide transparency to these black-box like AI models but the quality and reliability of these explanations is often questioned due to inconsistencies and lack of faithfulness in representing model decisions. This paper systematically examines the impact of three widely used training approaches, namely natural training, adversarial training, and pruning, affect the quality of post-hoc explanations for traffic sign classifiers. Through extensive empirical evaluation, we demonstrate that pruning significantly enhances the comprehensibility and faithfulness of explanations (using saliency maps). Our findings reveal that pruning not only improves model efficiency but also enforces sparsity in learned representation, leading to more interpretable and reliable decisions. Additionally, these insights suggest that pruning is a promising strategy for developing transparent deep learning models, especially in resource-constrained vehicular AI systems.
zh

[CV-22] EchoBench: Benchmarking Sycophancy in Medical Large Vision-Language Models

【速读】:该论文旨在解决当前医疗大视觉语言模型(Medical Large Vision-Language Models, LVLMs)评估中过度关注排行榜准确率而忽视可靠性与安全性的问题,特别是模型在高风险临床场景中表现出的“谄媚倾向”(sycophancy)——即无批判地重复用户提供的信息。解决方案的关键在于提出并构建了EchoBench基准测试集,系统性地量化和分析医疗LVLMs对患者、医学生和医生等不同来源偏见输入的响应行为。通过2,122张跨18个科室和20种模态的图像及90个模拟偏见提示,研究发现所有评测模型均存在显著谄媚倾向,即使顶级商业模型(如Claude 3.7 Sonnet)仍达45.98%,GPT-4.1高达59.15%;进一步的细粒度分析揭示了偏见类型、科室、感知粒度和模态等因素对敏感性的调节作用,并验证了提升数据质量与多样性、增强领域知识可有效降低谄媚倾向而不损害无偏准确性。此外,EchoBench还作为干预测试平台,证明简单的提示工程策略(负向提示、单样本/少样本示例)能稳定减少谄媚行为,从而为训练和解码阶段的缓解方法提供方向,推动更安全、可信的医疗LVLM发展。

链接: https://arxiv.org/abs/2509.20146
作者: Botai Yuan,Yutian Zhou,Yingjie Wang,Fushuo Huo,Yongcheng Jing,Li Shen,Ying Wei,Zhiqi Shen,Ziwei Liu,Tianwei Zhang,Jie Yang,Dacheng Tao
机构: Nanyang Technological University (南洋理工大学); Shanghai Jiao Tong University (上海交通大学); Fudan University (复旦大学); Sun Yat-sen University (中山大学); Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 29 pages, 6 figures

点击查看摘要

Abstract:Recent benchmarks for medical Large Vision-Language Models (LVLMs) emphasize leaderboard accuracy, overlooking reliability and safety. We study sycophancy – models’ tendency to uncritically echo user-provided information – in high-stakes clinical settings. We introduce EchoBench, a benchmark to systematically evaluate sycophancy in medical LVLMs. It contains 2,122 images across 18 departments and 20 modalities with 90 prompts that simulate biased inputs from patients, medical students, and physicians. We evaluate medical-specific, open-source, and proprietary LVLMs. All exhibit substantial sycophancy; the best proprietary model (Claude 3.7 Sonnet) still shows 45.98% sycophancy, and GPT-4.1 reaches 59.15%. Many medical-specific models exceed 95% sycophancy despite only moderate accuracy. Fine-grained analyses by bias type, department, perceptual granularity, and modality identify factors that increase susceptibility. We further show that higher data quality/diversity and stronger domain knowledge reduce sycophancy without harming unbiased accuracy. EchoBench also serves as a testbed for mitigation: simple prompt-level interventions (negative prompting, one-shot, few-shot) produce consistent reductions and motivate training- and decoding-time strategies. Our findings highlight the need for robust evaluation beyond accuracy and provide actionable guidance toward safer, more trustworthy medical LVLMs.
zh

[CV-23] KSDiff: Keyframe-Augmented Speech-Aware Dual-Path Diffusion for Facial Animation

【速读】:该论文旨在解决当前音频驱动面部动画(audio-driven facial animation)中两个关键问题:一是现有方法将语音特征视为单一整体表示,未能捕捉其在驱动不同面部运动中的细粒度作用;二是忽略了对动态剧烈的关键帧进行建模的重要性。解决方案的核心在于提出KSDiff框架,通过双路径语音编码器(Dual-Path Speech Encoder, DPSE)分离出与表情和头部姿态相关的语音特征,并引入自回归的关键帧建立学习模块(Keyframe Establishment Learning, KEL)以预测最具代表性的运动帧,再结合双路径运动生成器实现连贯且真实的面部动作合成。该方法显著提升了唇同步准确性和头部姿态自然度,验证了语音解耦与关键帧感知扩散机制相结合的有效性。

链接: https://arxiv.org/abs/2509.20128
作者: Tianle Lyu,Junchuan Zhao,Ye Wang
机构: 未知
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: 5 pages, 3 figures, 3 tables

点击查看摘要

Abstract:Audio-driven facial animation has made significant progress in multimedia applications, with diffusion models showing strong potential for talking-face synthesis. However, most existing works treat speech features as a monolithic representation and fail to capture their fine-grained roles in driving different facial motions, while also overlooking the importance of modeling keyframes with intense dynamics. To address these limitations, we propose KSDiff, a Keyframe-Augmented Speech-Aware Dual-Path Diffusion framework. Specifically, the raw audio and transcript are processed by a Dual-Path Speech Encoder (DPSE) to disentangle expression-related and head-pose-related features, while an autoregressive Keyframe Establishment Learning (KEL) module predicts the most salient motion frames. These components are integrated into a Dual-path Motion generator to synthesize coherent and realistic facial motions. Extensive experiments on HDTF and VoxCeleb demonstrate that KSDiff achieves state-of-the-art performance, with improvements in both lip synchronization accuracy and head-pose naturalness. Our results highlight the effectiveness of combining speech disentanglement with keyframe-aware diffusion for talking-head generation.
zh

[CV-24] A Simple Data Augmentation Strategy for Text-in-Image Scientific VQA

【速读】:该论文旨在解决科学视觉问答(Scientific Visual Question Answering)任务中,由于科学图表及其多模态上下文复杂性导致的挑战,尤其是现有视觉语言模型在“文本嵌入图像”(text-in-image)格式下零样本(zero-shot)性能不佳的问题。解决方案的关键在于:首先通过将已有的独立图像-文本对转换为统一的“文本嵌入图像”格式,合成一个新数据集以缓解训练数据稀缺问题;其次,在混合使用该合成数据与EXAMS-V数据的基础上,对一个小规模多语言多模态模型进行微调,从而显著提升跨13种语言的平均性能并实现良好的跨语言迁移能力。

链接: https://arxiv.org/abs/2509.20119
作者: Belal Shoer,Yova Kementchedjhieva
机构: MBZUAI
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at WiNLP, 2025

点击查看摘要

Abstract:Scientific visual question answering poses significant challenges for vision-language models due to the complexity of scientific figures and their multimodal context. Traditional approaches treat the figure and accompanying text (e.g., questions and answer options) as separate inputs. EXAMS-V introduced a new paradigm by embedding both visual and textual content into a single image. However, even state-of-the-art proprietary models perform poorly on this setup in zero-shot settings, underscoring the need for task-specific fine-tuning. To address the scarcity of training data in this “text-in-image” format, we synthesize a new dataset by converting existing separate image-text pairs into unified images. Fine-tuning a small multilingual multimodal model on a mix of our synthetic data and EXAMS-V yields notable gains across 13 languages, demonstrating strong average improvements and cross-lingual transfer.
zh

[CV-25] Hyperspectral Adapter for Semantic Segmentation with Vision Foundation Models

【速读】:该论文旨在解决当前高光谱图像(Hyperspectral Imaging, HSI)语义分割方法性能不足的问题,其根源在于现有方法多基于为RGB图像设计的网络架构和学习框架,难以有效利用HSI中丰富的光谱信息。解决方案的关键在于提出一种新颖的高光谱适配器(hyperspectral adapter),该适配器通过引入预训练视觉基础模型(vision foundation models)来增强对HSI数据的学习能力;具体包括:1)一个光谱Transformer模块与一个谱感知空间先验模块,用于提取深层的空间-光谱特征;2)一个模态感知交互块(modality-aware interaction block),通过专门的特征提取与注入机制,实现对冻结的视觉Transformer特征与HSI表示的有效融合。实验表明,该方法在三个自动驾驶基准数据集上实现了最先进的语义分割性能,直接使用HSI输入即可超越传统视觉和高光谱分割方法。

链接: https://arxiv.org/abs/2509.20107
作者: JuanaJuana Valeria Hurtado,Rohit Mohan,Abhinav Valada
机构: University of Freiburg (弗莱堡大学); Baden-Württemberg Stiftung gGmbH (巴登-符腾堡基金会); Bosch Research (博世研究)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Hyperspectral imaging (HSI) captures spatial information along with dense spectral measurements across numerous narrow wavelength bands. This rich spectral content has the potential to facilitate robust robotic perception, particularly in environments with complex material compositions, varying illumination, or other visually challenging conditions. However, current HSI semantic segmentation methods underperform due to their reliance on architectures and learning frameworks optimized for RGB inputs. In this work, we propose a novel hyperspectral adapter that leverages pretrained vision foundation models to effectively learn from hyperspectral data. Our architecture incorporates a spectral transformer and a spectrum-aware spatial prior module to extract rich spatial-spectral features. Additionally, we introduce a modality-aware interaction block that facilitates effective integration of hyperspectral representations and frozen vision Transformer features through dedicated extraction and injection mechanisms. Extensive evaluations on three benchmark autonomous driving datasets demonstrate that our architecture achieves state-of-the-art semantic segmentation performance while directly using HSI inputs, outperforming both vision-based and hyperspectral segmentation methods. We make the code available at this https URL.
zh

[CV-26] Unleashing the Potential of the Semantic Latent Space in Diffusion Models for Image Dehazing

【速读】:该论文旨在解决扩散模型(Diffusion Models)在图像去雾任务中因重新训练计算开销大以及推理阶段采样步骤繁多而导致的应用局限性问题。解决方案的关键在于利用预训练扩散模型的语义潜在空间(semantic latent space)中随扩散时间步变化的表征特性,提出了一种名为DiffLI²D的网络架构,通过将不同时间步的扩散潜在表示融入精心设计的去雾网络中,为去雾过程提供指导信息,从而避免了对扩散模型的重新训练和迭代采样过程,同时实现了优异的去雾性能。

链接: https://arxiv.org/abs/2509.20091
作者: Zizheng Yang,Hu Yu,Bing Li,Jinghao Zhang,Jie Huang,Feng Zhao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion models have recently been investigated as powerful generative solvers for image dehazing, owing to their remarkable capability to model the data distribution. However, the massive computational burden imposed by the retraining of diffusion models, coupled with the extensive sampling steps during the inference, limit the broader application of diffusion models in image dehazing. To address these issues, we explore the properties of hazy images in the semantic latent space of frozen pre-trained diffusion models, and propose a Diffusion Latent Inspired network for Image Dehazing, dubbed DiffLI ^2 D. Specifically, we first reveal that the semantic latent space of pre-trained diffusion models can represent the content and haze characteristics of hazy images, as the diffusion time-step changes. Building upon this insight, we integrate the diffusion latent representations at different time-steps into a delicately designed dehazing network to provide instructions for image dehazing. Our DiffLI ^2 D avoids re-training diffusion models and iterative sampling process by effectively utilizing the informative representations derived from the pre-trained diffusion models, which also offers a novel perspective for introducing diffusion models to image dehazing. Extensive experiments on multiple datasets demonstrate that the proposed method achieves superior performance to existing image dehazing methods. Code is available at this https URL.
zh

[CV-27] Queryable 3D Scene Representation: A Multi-Modal Framework for Semantic Reasoning and Robotic Task Planning

【速读】:该论文旨在解决机器人在复杂三维(3D)环境中理解高层人类指令并执行复杂任务的关键挑战,核心在于实现全面的场景理解——即以有意义的方式解析和交互于3D环境。解决方案的关键在于提出一种新型框架:3D Queryable Scene Representation (3D QSR),该框架融合多模态数据,统一三种互补的3D表示方式:(1)来自全景重建的一致性新视角渲染与分割,(2)基于3D点云的精确几何结构,以及(3)通过3D场景图实现的结构化、可扩展组织。该框架采用以物体为中心的设计,结合大规模视觉-语言模型,支持语义查询能力,能够关联多模态物体嵌入,并实现对象级别的几何、视觉与语义信息检索,最终将检索结果输入机器人任务规划器以完成下游任务执行。

链接: https://arxiv.org/abs/2509.20077
作者: Xun Li,Rodrigo Santa Cruz,Mingze Xi,Hu Zhang,Madhawa Perera,Ziwei Wang,Ahalya Ravendran,Brandon J. Matthews,Feng Xu,Matt Adcock,Dadong Wang,Jiajun Liu
机构: CSIRO (澳大利亚联邦科学与工业研究组织)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:To enable robots to comprehend high-level human instructions and perform complex tasks, a key challenge lies in achieving comprehensive scene understanding: interpreting and interacting with the 3D environment in a meaningful way. This requires a smart map that fuses accurate geometric structure with rich, human-understandable semantics. To address this, we introduce the 3D Queryable Scene Representation (3D QSR), a novel framework built on multimedia data that unifies three complementary 3D representations: (1) 3D-consistent novel view rendering and segmentation from panoptic reconstruction, (2) precise geometry from 3D point clouds, and (3) structured, scalable organization via 3D scene graphs. Built on an object-centric design, the framework integrates with large vision-language models to enable semantic queryability by linking multimodal object embeddings, and supporting object-level retrieval of geometric, visual, and semantic information. The retrieved data are then loaded into a robotic task planner for downstream execution. We evaluate our approach through simulated robotic task planning scenarios in Unity, guided by abstract language instructions and using the indoor public dataset Replica. Furthermore, we apply it in a digital duplicate of a real wet lab environment to test QSR-supported robotic task planning for emergency response. The results demonstrate the framework’s ability to facilitate scene understanding and integrate spatial and semantic reasoning, effectively translating high-level human instructions into precise robotic task planning in complex 3D environments.
zh

[CV-28] SHMoAReg: Spark Deformable Image Registration via Spatial Heterogeneous Mixture of Experts and Attention Heads

【速读】:该论文旨在解决当前基于深度学习的可变形图像配准(Deformable Image Registration, DIR)方法在特征提取专业化不足以及变形场预测缺乏方向异质性的问题。现有Encoder-Decoder架构通常采用统一的方式提取多尺度特征并联合预测三个空间方向上的变形,导致模型难以捕捉特定于配准任务的关键特征且无法灵活适应不同方向的形变需求。解决方案的关键在于提出一种专家引导的DIR网络SHMoAReg,其创新性地在编码器和解码器中分别引入Mixture of Attention heads (MoA) 和Spatial Heterogeneous Mixture of Experts (SHMoE)机制:MoA通过动态选择最优注意力头组合增强特征提取的专业化能力;SHMoE则在解码器中以不同核大小的专家异质性地预测每个体素在三个方向上的变形场,从而提升配准精度与可解释性。实验表明,该方法在腹部CT数据集上Dice分数从60.58%提升至65.58%,显著优于现有方法。

链接: https://arxiv.org/abs/2509.20073
作者: Yuxi Zheng,Jianhui Feng,Tianran Li,Marius Staring,Yuchuan Qiao
机构: 1: Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); 2: Department of Radiology, Leiden University Medical Center (莱顿大学医学中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Encoder-Decoder architectures are widely used in deep learning-based Deformable Image Registration (DIR), where the encoder extracts multi-scale features and the decoder predicts deformation fields by recovering spatial locations. However, current methods lack specialized extraction of features (that are useful for registration) and predict deformation jointly and homogeneously in all three directions. In this paper, we propose a novel expert-guided DIR network with Mixture of Experts (MoE) mechanism applied in both encoder and decoder, named SHMoAReg. Specifically, we incorporate Mixture of Attention heads (MoA) into encoder layers, while Spatial Heterogeneous Mixture of Experts (SHMoE) into the decoder layers. The MoA enhances the specialization of feature extraction by dynamically selecting the optimal combination of attention heads for each image token. Meanwhile, the SHMoE predicts deformation fields heterogeneously in three directions for each voxel using experts with varying kernel sizes. Extensive experiments conducted on two publicly available datasets show consistent improvements over various methods, with a notable increase from 60.58% to 65.58% in Dice score for the abdominal CT dataset. Furthermore, SHMoAReg enhances model interpretability by differentiating experts’ utilities across/within different resolution layers. To the best of our knowledge, we are the first to introduce MoE mechanism into DIR tasks. The code will be released soon.
zh

[CV-29] Predictive Quality Assessment for Mobile Secure Graphics ICCV2025

【速读】:该论文旨在解决智能手机图像采集条件下安全图形验证(Secure Graphic Verification)的可靠性问题,即用户在非受控环境下拍摄高熵图案时导致的高误拒率(False Non-Match Rate, FNMR),从而造成显著的“可靠性缺口”。解决方案的关键在于摒弃传统感知质量评估(Perceptual IQA)范式,提出一种预测性框架:通过轻量级模型对视频帧进行质量评分,预判其是否适合输入至资源密集型的验证模型(oracle model),从而提升整体验证效率与准确性。此外,研究发现,在跨工业打印设备场景下,冻结ImageNet预训练网络的轻量探测器比全参数微调模型更具泛化能力,揭示了物理制造域偏移情境中,通用骨干网络保持冻结状态反而优于过度拟合源域特征的全微调策略,为实际部署中的鲁棒性提供了重要启示。

链接: https://arxiv.org/abs/2509.20028
作者: Cas Steigstra,Sergey Milyaev,Shaodi You
机构: Scantrust; University of Amsterdam (阿姆斯特丹大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 8 pages, to appear at ICCV 2025 MIPI Workshop (IEEE)

点击查看摘要

Abstract:The reliability of secure graphic verification, a key anti-counterfeiting tool, is undermined by poor image acquisition on smartphones. Uncontrolled user captures of these high-entropy patterns cause high false rejection rates, creating a significant ‘reliability gap’. To bridge this gap, we depart from traditional perceptual IQA and introduce a framework that predictively estimates a frame’s utility for the downstream verification task. We propose a lightweight model to predict a quality score for a video frame, determining its suitability for a resource-intensive oracle model. Our framework is validated using re-contextualized FNMR and ISRR metrics on a large-scale dataset of 32,000+ images from 105 smartphones. Furthermore, a novel cross-domain analysis on graphics from different industrial printing presses reveals a key finding: a lightweight probe on a frozen, ImageNet-pretrained network generalizes better to an unseen printing technology than a fully fine-tuned model. This provides a key insight for real-world generalization: for domain shifts from physical manufacturing, a frozen general-purpose backbone can be more robust than full fine-tuning, which can overfit to source-domain artifacts.
zh

[CV-30] Generative Adversarial Networks Applied for Privacy Preservation in Biometric-Based Authentication and Identification

【速读】:该论文旨在解决生物特征认证系统中用户隐私泄露的问题,即现有系统无法让用户控制其生物特征数据的使用方式,且存在数据被滥用的风险。解决方案的关键在于利用生成式对抗网络(Generative Adversarial Network, GAN)将人脸图像转换至一个视觉上私密的域(如花朵或鞋子),并在该私密域上训练用于身份认证的分类器,从而在保障用户隐私的同时维持系统的可用性与抗攻击能力。

链接: https://arxiv.org/abs/2509.20024
作者: Lubos Mjachky,Ivan Homoliak
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Biometric-based authentication systems are getting broadly adopted in many areas. However, these systems do not allow participating users to influence the way their data is used. Furthermore, the data may leak and can be misused without the users’ knowledge. In this paper, we propose a new authentication method that preserves the privacy of individuals and is based on a generative adversarial network (GAN). Concretely, we suggest using the GAN for translating images of faces to a visually private domain (e.g., flowers or shoes). Classifiers, which are used for authentication purposes, are then trained on the images from the visually private domain. Based on our experiments, the method is robust against attacks and still provides meaningful utility.
zh

[CV-31] PS3: A Multimodal Transformer Integrating Pathology Reports with Histology Images and Biological Pathways for Cancer Survival Prediction ICCV2025

【速读】:该论文旨在解决多模态融合在计算肿瘤学中因模态异质性导致的性能瓶颈问题,特别是如何有效整合高维病理全切片图像(Whole Slide Images, WSIs)、病理报告文本和转录组数据以提升生存预测准确性。其核心挑战在于WSIs具有数亿像素的高维度特性,而病理报告为长度不一的简短文本,易造成模态不平衡。解决方案的关键在于提出一种基于原型(prototype-based)的表示学习框架,通过构建三类语义原型——诊断原型(Diagnostic prototypes)用于提取病理报告中的关键诊断信息并标准化文本表示、组织学原型(Histological prototypes)用于压缩WSIs中的形态学特征、生物通路原型(Biological pathway prototypes)用于编码转录组数据中的功能表达模式,进而利用Transformer架构实现跨模态交互与融合,形成PS3模型,从而显著优于现有单模态及多模态基线方法。

链接: https://arxiv.org/abs/2509.20022
作者: Manahil Raza,Ayesha Azam,Talha Qaiser,Nasir Rajpoot
机构: University of Warwick (华威大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICCV 2025. Copyright 2025 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

点击查看摘要

Abstract:Current multimodal fusion approaches in computational oncology primarily focus on integrating multi-gigapixel histology whole slide images (WSIs) with genomic or transcriptomic data, demonstrating improved survival prediction. We hypothesize that incorporating pathology reports can further enhance prognostic performance. Pathology reports, as essential components of clinical workflows, offer readily available complementary information by summarizing histopathological findings and integrating expert interpretations and clinical context. However, fusing these modalities poses challenges due to their heterogeneous nature. WSIs are high-dimensional, each containing several billion pixels, whereas pathology reports consist of concise text summaries of varying lengths, leading to potential modality imbalance. To address this, we propose a prototype-based approach to generate balanced representations, which are then integrated using a Transformer-based fusion model for survival prediction that we term PS3 (Predicting Survival from Three Modalities). Specifically, we present: (1) Diagnostic prototypes from pathology reports, leveraging self-attention to extract diagnostically relevant sections and standardize text representation; (2) Histological prototypes to compactly represent key morphological patterns in WSIs; and (3) Biological pathway prototypes to encode transcriptomic expressions, accurately capturing cellular functions. PS3, the three-modal transformer model, processes the resulting prototype-based multimodal tokens and models intra-modal and cross-modal interactions across pathology reports, WSIs and transcriptomic data. The proposed model outperforms state-of-the-art methods when evaluated against clinical, unimodal and multimodal baselines on six datasets from The Cancer Genome Atlas (TCGA). The code is available at: this https URL.
zh

[CV-32] Does the Manipulation Process Matter? RITA: Reasoning Composite Image Manipulations via Reversely-Ordered Incremental-Transition Autoregression

【速读】:该论文旨在解决图像篡改定位(Image Manipulation Localization, IML)任务中现有方法忽视编辑过程时序性和层次性的问题。传统IML方法采用“一次性预测”(one-shot prediction)范式,直接输出单一二值掩码,导致高维组合空间的维度坍缩,与图像篡改本质上的多步骤、分层操作特性存在根本性不匹配。其解决方案的关键在于首次将IML重构为一个条件序列预测任务,提出RITA框架:该框架按顺序逐层预测篡改区域,利用前一步预测结果作为后续步骤的条件,显式建模编辑操作间的时序依赖和层级结构。通过合成多步篡改数据并构建新基准HSIM,以及引入HSS指标评估序列顺序与层级对齐度,实验表明RITA在传统基准上达到最先进性能,并为新型层次化定位任务提供了通用且有效的范式基础。

链接: https://arxiv.org/abs/2509.20006
作者: Xuekang Zhu,Ji-Zhe Zhou,Kaiwen Feng,Chenfan Qu,Yunfei Wang,Liting Zhou,Jian liu
机构: Sichuan University (四川大学); Ant Group (蚂蚁集团); South China University of Technology (华南理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Image manipulations often entail a complex manipulation process, comprising a series of editing operations to create a deceptive image, exhibiting sequentiality and hierarchical characteristics. However, existing IML methods remain manipulation-process-agnostic, directly producing localization masks in a one-shot prediction paradigm without modeling the underlying editing steps. This one-shot paradigm compresses the high-dimensional compositional space into a single binary mask, inducing severe dimensional collapse, thereby creating a fundamental mismatch with the intrinsic nature of the IML task. To address this, we are the first to reformulate image manipulation localization as a conditional sequence prediction task, proposing the RITA framework. RITA predicts manipulated regions layer-by-layer in an ordered manner, using each step’s prediction as the condition for the next, thereby explicitly modeling temporal dependencies and hierarchical structures among editing operations. To enable training and evaluation, we synthesize multi-step manipulation data and construct a new benchmark HSIM. We further propose the HSS metric to assess sequential order and hierarchical alignment. Extensive experiments show RITA achieves SOTA on traditional benchmarks and provides a solid foundation for the novel hierarchical localization task, validating its potential as a general and effective paradigm. The code and dataset will be publicly available. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2509.20006 [cs.CV] (or arXiv:2509.20006v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2509.20006 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-33] MultiSoundGen: Video-to-Audio Generation for Multi-Event Scenarios via SlowFast Contrastive Audio-Visual Pretraining and Direct Preference Optimization

【速读】:该论文旨在解决当前视频到音频(Video-to-Audio, V2A)方法在复杂多事件场景下性能受限的问题,具体表现为两个关键挑战:一是难以精确对齐复杂的语义信息与快速变化的动态特征;二是基础训练缺乏针对语义-时间对齐和音频质量的定量偏好优化,导致在杂乱多事件场景中生成质量不足。解决方案的核心在于提出名为MultiSoundGen的新框架,其关键创新包括:一是设计了SlowFast Contrastive Audio-Visual Pretraining(SF-CAVP)模型,采用统一双流架构显式对齐音视频数据的核心语义表征与快速动态特征以应对多事件复杂性;二是将直接偏好优化(Direct Preference Optimization, DPO)引入V2A任务,提出AVP-Ranked Preference Optimization(AVP-RPO),利用SF-CAVP作为奖励模型量化并优先优化关键语义-时间匹配关系,同时提升音频质量。实验表明,该方法在多事件场景中达到SOTA性能,在分布匹配、音频质量、语义对齐和时间同步等方面均实现显著提升。

链接: https://arxiv.org/abs/2509.19999
作者: Jianxuan Yang,Xiaoran Yang,Lipan Zhang,Xinyue Guo,Zhao Wang,Gongping Huang
机构: 未知
类目: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Current video-to-audio (V2A) methods struggle in complex multi-event scenarios (video scenarios involving multiple sound sources, sound events, or transitions) due to two critical limitations. First, existing methods face challenges in precisely aligning intricate semantic information together with rapid dynamic features. Second, foundational training lacks quantitative preference optimization for semantic-temporal alignment and audio quality. As a result, it fails to enhance integrated generation quality in cluttered multi-event scenes. To address these core limitations, this study proposes a novel V2A framework: MultiSoundGen. It introduces direct preference optimization (DPO) into the V2A domain, leveraging audio-visual pretraining (AVP) to enhance performance in complex multi-event scenarios. Our contributions include two key innovations: the first is SlowFast Contrastive AVP (SF-CAVP), a pioneering AVP model with a unified dual-stream architecture. SF-CAVP explicitly aligns core semantic representations and rapid dynamic features of audio-visual data to handle multi-event complexity; second, we integrate the DPO method into V2A task and propose AVP-Ranked Preference Optimization (AVP-RPO). It uses SF-CAVP as a reward model to quantify and prioritize critical semantic-temporal matches while enhancing audio quality. Experiments demonstrate that MultiSoundGen achieves state-of-the-art (SOTA) performance in multi-event scenarios, delivering comprehensive gains across distribution matching, audio quality, semantic alignment, and temporal synchronization. The complete code and dataset will be released soon.
zh

[CV-34] Anomaly Detection by Clustering DINO Embeddings using a Dirichlet Process Mixture MICCAI2025

【速读】:该论文旨在解决医学影像中无监督异常检测在大规模数据集上的计算效率问题。传统基于记忆库(memory bank)的方法虽在小数据集上有效,但在大规模场景下因计算复杂度显著上升而不适用。其解决方案的关键在于使用Dirichlet Process Mixture Model (DPMM) 对DINOv2预训练模型提取的规范嵌入(normative embeddings)进行建模,该非参数混合模型能自动适应数据分布并确定最优成分数量;进而通过组件中心与嵌入之间的相似性构建异常评分函数,生成粗粒度异常分割掩码。实验表明,该方法在保持高检测性能的同时,推理时间至少减少50%,且规范化后的DINOv2嵌入更贴近解剖结构,即使存在异常也具有更强的表征能力。

链接: https://arxiv.org/abs/2509.19997
作者: Nico Schulthess,Ender Konukoglu
机构: ETH Zurich (苏黎世联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Paper accepted at MICCAI 2025

点击查看摘要

Abstract:In this work, we leverage informative embeddings from foundational models for unsupervised anomaly detection in medical imaging. For small datasets, a memory-bank of normative features can directly be used for anomaly detection which has been demonstrated recently. However, this is unsuitable for large medical datasets as the computational burden increases substantially. Therefore, we propose to model the distribution of normative DINOv2 embeddings with a Dirichlet Process Mixture model (DPMM), a non-parametric mixture model that automatically adjusts the number of mixture components to the data at hand. Rather than using a memory bank, we use the similarity between the component centers and the embeddings as anomaly score function to create a coarse anomaly segmentation mask. Our experiments show that through DPMM embeddings of DINOv2, despite being trained on natural images, achieve very competitive anomaly detection performance on medical imaging benchmarks and can do this while at least halving the computation time at inference. Our analysis further indicates that normalized DINOv2 embeddings are generally more aligned with anatomical structures than unnormalized features, even in the presence of anomalies, making them great representations for anomaly detection. The code is available at this https URL.
zh

[CV-35] MeshMosaic: Scaling Artist Mesh Generation via Local-to-Global Assembly

【速读】:该论文旨在解决生成式 AI (Generative AI) 中基于自回归模型的艺术家设计网格(artist-designed meshes)在扩展至高三角面数时面临的挑战,特别是现有基于 Transformer 的方法因长序列瓶颈和有限量化分辨率导致无法精确还原精细几何细节与结构化密度模式的问题。其解决方案的关键在于提出 MeshMosaic——一种局部到全局的框架:首先将形状分割为多个补丁(patches),对每个补丁进行自回归生成,并利用共享边界条件确保相邻区域间的连贯性、对称性和无缝连接;通过独立量化各补丁,显著提升高分辨率网格的可扩展性,同时增强网格密度分布的对称性和组织性,从而在几何保真度和用户偏好上均优于当前最优方法。

链接: https://arxiv.org/abs/2509.19995
作者: Rui Xu,Tianyang Xue,Qiujie Dong,Le Wan,Zhe Zhu,Peng Li,Zhiyang Dou,Cheng Lin,Shiqing Xin,Yuan Liu,Wenping Wang,Taku Komura
机构: The University of Hong Kong (香港大学); Tencent Visvise (腾讯视觉); Hong Kong University of Science and Technology (香港科技大学); Macau University of Science and Technology (澳门科技大学); Shandong University (山东大学); Texas A&M University (德克萨斯A&M大学)
类目: Graphics (cs.GR); Computational Geometry (cs.CG); Computer Vision and Pattern Recognition (cs.CV)
备注: Project is available at: this https URL

点击查看摘要

Abstract:Scaling artist-designed meshes to high triangle numbers remains challenging for autoregressive generative models. Existing transformer-based methods suffer from long-sequence bottlenecks and limited quantization resolution, primarily due to the large number of tokens required and constrained quantization granularity. These issues prevent faithful reproduction of fine geometric details and structured density patterns. We introduce MeshMosaic, a novel local-to-global framework for artist mesh generation that scales to over 100K triangles–substantially surpassing prior methods, which typically handle only around 8K faces. MeshMosaic first segments shapes into patches, generating each patch autoregressively and leveraging shared boundary conditions to promote coherence, symmetry, and seamless connectivity between neighboring regions. This strategy enhances scalability to high-resolution meshes by quantizing patches individually, resulting in more symmetrical and organized mesh density and structure. Extensive experiments across multiple public datasets demonstrate that MeshMosaic significantly outperforms state-of-the-art methods in both geometric fidelity and user preference, supporting superior detail representation and practical mesh generation for real-world applications.
zh

[CV-36] Improving Generalizability and Undetectability for Targeted Adversarial Attacks on Multimodal Pre-trained Models

【速读】:该论文旨在解决现有针对多模态预训练模型(如ImageBind)的定向对抗攻击在泛化能力和隐蔽性方面的局限性问题。具体而言,现有方法生成的对抗样本(AEs)在跨模态对齐任务中对部分已知或语义相似的目标泛化能力不足,且容易被简单的异常检测方法识别。为解决这一问题,作者提出了一种名为代理定向攻击(Proxy Targeted Attack, PTA)的新方法,其核心在于利用多个源模态和目标模态的代理(proxy)来优化对抗样本的生成过程,从而在保持对多种潜在目标对齐的同时,有效规避防御机制,实现高成功率与强隐蔽性的平衡。

链接: https://arxiv.org/abs/2509.19994
作者: Zhifang Zhang,Jiahan Zhang,Shengjie Zhou,Qi Wei,Shuo He,Feng Liu,Lei Feng
机构: Southeast University (东南大学); Johns Hopkins University (约翰霍普金斯大学); Chongqing University (重庆大学); Nanyang Technological University (南洋理工大学); University of Melbourne (墨尔本大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal pre-trained models (e.g., ImageBind), which align distinct data modalities into a shared embedding space, have shown remarkable success across downstream tasks. However, their increasing adoption raises serious security concerns, especially regarding targeted adversarial attacks. In this paper, we show that existing targeted adversarial attacks on multimodal pre-trained models still have limitations in two aspects: generalizability and undetectability. Specifically, the crafted targeted adversarial examples (AEs) exhibit limited generalization to partially known or semantically similar targets in cross-modal alignment tasks (i.e., limited generalizability) and can be easily detected by simple anomaly detection methods (i.e., limited undetectability). To address these limitations, we propose a novel method called Proxy Targeted Attack (PTA), which leverages multiple source-modal and target-modal proxies to optimize targeted AEs, ensuring they remain evasive to defenses while aligning with multiple potential targets. We also provide theoretical analyses to highlight the relationship between generalizability and undetectability and to ensure optimal generalizability while meeting the specified requirements for undetectability. Furthermore, experimental results demonstrate that our PTA can achieve a high success rate across various related targets and remain undetectable against multiple anomaly detection methods.
zh

[CV-37] SDE-DET: A Precision Network for Shatian Pomelo Detection in Complex Orchard Environments

【速读】:该论文旨在解决在复杂果园环境中对沙田柚(Shatian pomelo)进行精准检测的难题,主要包括多尺度目标、树干与叶片遮挡以及小目标检测等挑战。解决方案的关键在于提出了一种名为SDE-DET的新型检测模型:首先引入Star Block以无额外计算开销的方式获取高维特征信息;其次在主干网络中采用可变形注意力机制(Deformable Attention),增强模型在遮挡条件下的检测能力;最后融合多种高效多尺度注意力机制(Efficient Multi-Scale Attention),在降低计算负担的同时提取深层视觉表征,显著提升小目标检测性能。实验表明,SDE-DET在自建数据集STP-AgriData上达到当前最优性能,为沙田柚自动化采摘机器人研发提供了可靠的技术支撑。

链接: https://arxiv.org/abs/2509.19990
作者: Yihao Hu,Pan Wang,Xiaodong Bai,Shijie Cai,Hang Wang,Huazhong Liu,Aiping Yang,Xiangxiang Li,Meiping Ding,Hongyan Liu,Jianguo Yao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Pomelo detection is an essential process for their localization, automated robotic harvesting, and maturity analysis. However, detecting Shatian pomelo in complex orchard environments poses significant challenges, including multi-scale issues, obstructions from trunks and leaves, small object detection, etc. To address these issues, this study constructs a custom dataset STP-AgriData and proposes the SDE-DET model for Shatian pomelo detection. SDE-DET first utilizes the Star Block to effectively acquire high-dimensional information without increasing the computational overhead. Furthermore, the presented model adopts Deformable Attention in its backbone, to enhance its ability to detect pomelos under occluded conditions. Finally, multiple Efficient Multi-Scale Attention mechanisms are integrated into our model to reduce the computational overhead and extract deep visual representations, thereby improving the capacity for small object detection. In the experiment, we compared SDE-DET with the Yolo series and other mainstream detection models in Shatian pomelo detection. The presented SDE-DET model achieved scores of 0.883, 0.771, 0.838, 0.497, and 0.823 in Precision, Recall, mAP@0.5, mAP@0.5:0.95 and F1-score, respectively. SDE-DET has achieved state-of-the-art performance on the STP-AgriData dataset. Experiments indicate that the SDE-DET provides a reliable method for Shatian pomelo detection, laying the foundation for the further development of automatic harvest robots.
zh

[CV-38] CamPVG: Camera-Controlled Panoramic Video Generation with Epipolar-Aware Diffusion SIGGRAPH

【速读】:该论文旨在解决全景视频生成中几何一致性难以保障的问题,尤其是现有方法多聚焦于透视投影下的相机控制,而在球面投影(spherical projection)下的全景视频生成仍面临挑战,主要源于全景位姿表示的复杂性。其解决方案的关键在于提出首个基于扩散模型的全景视频生成框架CamPVG,通过两个核心创新实现:一是设计了全景Plücker嵌入(panoramic Plücker embedding),利用球坐标变换编码相机外参,有效捕捉全景几何结构;二是引入球面对极模块(spherical epipolar module),通过沿对极线自适应注意力掩码施加几何约束,实现跨视角特征的细粒度聚合,从而显著提升生成全景视频的质量与几何一致性。

链接: https://arxiv.org/abs/2509.19979
作者: Chenhao Ji,Chaohui Yu,Junyao Gao,Fan Wang,Cairong Zhao
机构: Tongji University (同济大学); DAMO Academy, Alibaba Group (阿里巴巴集团达摩院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: SIGGRAPH Asia 2025

点击查看摘要

Abstract:Recently, camera-controlled video generation has seen rapid development, offering more precise control over video generation. However, existing methods predominantly focus on camera control in perspective projection video generation, while geometrically consistent panoramic video generation remains challenging. This limitation is primarily due to the inherent complexities in panoramic pose representation and spherical projection. To address this issue, we propose CamPVG, the first diffusion-based framework for panoramic video generation guided by precise camera poses. We achieve camera position encoding for panoramic images and cross-view feature aggregation based on spherical projection. Specifically, we propose a panoramic Plücker embedding that encodes camera extrinsic parameters through spherical coordinate transformation. This pose encoder effectively captures panoramic geometry, overcoming the limitations of traditional methods when applied to equirectangular projections. Additionally, we introduce a spherical epipolar module that enforces geometric constraints through adaptive attention masking along epipolar lines. This module enables fine-grained cross-view feature aggregation, substantially enhancing the quality and consistency of generated panoramic videos. Extensive experiments demonstrate that our method generates high-quality panoramic videos consistent with camera trajectories, far surpassing existing methods in panoramic video generation.
zh

[CV-39] OmniScene: Attention-Augmented Multimodal 4D Scene Understanding for Autonomous Driving

【速读】:该论文旨在解决当前自动驾驶系统在场景理解方面存在的局限性,即主流方法主要依赖基于深度的三维重建,而缺乏人类视觉所具备的以自我为中心的四维(4D)场景理解能力。其解决方案的关键在于提出一种类人框架OmniScene,核心包括:1)构建OmniScene Vision-Language Model(OmniVLM),融合多视角与时间感知,实现对4D场景的整体理解;2)采用教师-学生架构与知识蒸馏技术,将文本语义嵌入3D实例特征中,增强特征学习并显式捕捉类人注意力语义;3)设计分层融合策略(HFS),动态校准几何与语义特征在不同抽象层级的重要性,实现跨模态互补信息的有效协同利用。该方法显著提升了感知、预测、规划及视觉问答等任务性能,在nuScenes数据集上建立了新的基准。

链接: https://arxiv.org/abs/2509.19973
作者: Pei Liu,Hongliang Lu,Haichao Liu,Haipeng Liu,Xin Liu,Ruoyu Yao,Shengbo Eben Li,Jun Ma
机构: The Hong Kong University of Science and Technology, China (香港科技大学); Li Auto Inc. (小鹏汽车); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Human vision is capable of transforming two-dimensional observations into an egocentric three-dimensional scene understanding, which underpins the ability to translate complex scenes and exhibit adaptive behaviors. This capability, however, remains lacking in current autonomous driving systems, where mainstream approaches primarily rely on depth-based 3D reconstruction rather than true scene understanding. To address this limitation, we propose a novel human-like framework called OmniScene. First, we introduce the OmniScene Vision-Language Model (OmniVLM), a vision-language framework that integrates multi-view and temporal perception for holistic 4D scene understanding. Then, harnessing a teacher-student OmniVLM architecture and knowledge distillation, we embed textual representations into 3D instance features for semantic supervision, enriching feature learning, and explicitly capturing human-like attentional semantics. These feature representations are further aligned with human driving behaviors, forming a more human-like perception-understanding-action architecture. In addition, we propose a Hierarchical Fusion Strategy (HFS) to address imbalances in modality contributions during multimodal integration. Our approach adaptively calibrates the relative significance of geometric and semantic features at multiple abstraction levels, enabling the synergistic use of complementary cues from visual and textual modalities. This learnable dynamic fusion enables a more nuanced and effective exploitation of heterogeneous information. We evaluate OmniScene comprehensively on the nuScenes dataset, benchmarking it against over ten state-of-the-art models across various tasks. Our approach consistently achieves superior results, establishing new benchmarks in perception, prediction, planning, and visual question answering.
zh

[CV-40] SynchroRaMa : Lip-Synchronized and Emotion-Aware Talking Face Generation via Multi-Modal Emotion Embedding WACV2026

【速读】:该论文旨在解决现有音频驱动人脸生成方法在情感表达真实性和动态属性建模方面的局限性:一方面,多数方法仅依赖单一模态(如音频或图像)进行情感嵌入,难以捕捉细腻的情感线索;另一方面,模型通常仅以单张参考图像为条件,无法有效表征随时间变化的动作或属性。解决方案的关键在于提出SynchroRaMa框架,其核心创新包括:1)构建多模态情感嵌入机制,融合文本情感分析(sentiment analysis)与语音情绪识别及音频衍生的情绪维度(valence-arousal features),提升表情的真实性与丰富性;2)引入音频到运动(audio-to-motion, A2M)模块,实现头部动作与输入音频的精准同步;3)利用大语言模型(Large Language Model, LLM)生成场景描述作为额外文本输入,增强对动态行为和高层语义特征的建模能力,从而提升视频的时间一致性和视觉真实性。

链接: https://arxiv.org/abs/2509.19965
作者: Phyo Thet Yee,Dimitrios Kollias,Sudeepta Mishra,Abhinav Dhall
机构: IIT Ropar, India(印度理工学院拉普尔分校); Queen Mary University of London, UK(伦敦玛丽女王大学); Monash University, Australia(莫纳什大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at WACV 2026, project page : this https URL

点击查看摘要

Abstract:Audio-driven talking face generation has received growing interest, particularly for applications requiring expressive and natural human-avatar interaction. However, most existing emotion-aware methods rely on a single modality (either audio or image) for emotion embedding, limiting their ability to capture nuanced affective cues. Additionally, most methods condition on a single reference image, restricting the model’s ability to represent dynamic changes in actions or attributes across time. To address these issues, we introduce SynchroRaMa, a novel framework that integrates a multi-modal emotion embedding by combining emotional signals from text (via sentiment analysis) and audio (via speech-based emotion recognition and audio-derived valence-arousal features), enabling the generation of talking face videos with richer and more authentic emotional expressiveness and fidelity. To ensure natural head motion and accurate lip synchronization, SynchroRaMa includes an audio-to-motion (A2M) module that generates motion frames aligned with the input audio. Finally, SynchroRaMa incorporates scene descriptions generated by Large Language Model (LLM) as additional textual input, enabling it to capture dynamic actions and high-level semantic attributes. Conditioning the model on both visual and textual cues enhances temporal consistency and visual realism. Quantitative and qualitative experiments on benchmark datasets demonstrate that SynchroRaMa outperforms the state-of-the-art, achieving improvements in image quality, expression preservation, and motion realism. A user study further confirms that SynchroRaMa achieves higher subjective ratings than competing methods in overall naturalness, motion diversity, and video smoothness. Our project page is available at this https URL.
zh

[CV-41] When Words Cant Capture It All: Towards Video-Based User Complaint Text Generation with Multimodal Video Complaint Dataset

【速读】:该论文旨在解决用户在投诉表达中面临的挑战,即用户常难以通过文字清晰描述问题,但可通过视频直观展示产品缺陷(如“最差产品”文本搭配5秒损坏耳机视频)。为此,论文提出了一项新任务——从视频生成投诉描述(Complaint Description from Videos, CoD-V),以帮助用户更有效地表达诉求。解决方案的关键在于构建了一个包含1,175条视频投诉及其对应描述的多模态数据集ComVID,并引入了新的评估指标“投诉保留率”(Complaint Retention, CR),用于区分该任务与标准视频摘要生成或描述任务。此外,研究还提出了一个嵌入检索增强生成(Retrieval-Augmented Generation, RAG)机制的VideoLLaMA2-7b模型,能够结合用户情绪状态生成更具针对性和共情能力的投诉文本,从而提升投诉表达的质量与有效性。

链接: https://arxiv.org/abs/2509.19952
作者: Sarmistha Das,R E Zera Marveen Lyngkhoi,Kirtan Jain,Vinayak Goyal,Sriparna Saha,Manish Gupta
机构: Indian Institute of Technology Patna (印度理工学院巴特那分校); Microsoft (微软)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While there exists a lot of work on explainable complaint mining, articulating user concerns through text or video remains a significant challenge, often leaving issues unresolved. Users frequently struggle to express their complaints clearly in text but can easily upload videos depicting product defects (e.g., vague text such as `worst product’ paired with a 5-second video depicting a broken headphone with the right earcup). This paper formulates a new task in the field of complaint mining to aid the common users’ need to write an expressive complaint, which is Complaint Description from Videos (CoD-V) (e.g., to help the above user articulate her complaint about the defective right earcup). To this end, we introduce ComVID, a video complaint dataset containing 1,175 complaint videos and the corresponding descriptions, also annotated with the emotional state of the complainer. Additionally, we present a new complaint retention (CR) evaluation metric that discriminates the proposed (CoD-V) task against standard video summary generation and description tasks. To strengthen this initiative, we introduce a multimodal Retrieval-Augmented Generation (RAG) embedded VideoLLaMA2-7b model, designed to generate complaints while accounting for the user’s emotional state. We conduct a comprehensive evaluation of several Video Language Models on several tasks (pre-trained and fine-tuned versions) with a range of established evaluation metrics, including METEOR, perplexity, and the Coleman-Liau readability score, among others. Our study lays the foundation for a new research direction to provide a platform for users to express complaints through video. Dataset and resources are available at: this https URL.
zh

[CV-42] Interpreting ResNet-based CLIP via Neuron-Attention Decomposition NEURIPS2025

【速读】:该论文旨在解决神经网络中可解释性不足的问题,特别是针对CLIP-ResNet模型中神经元贡献难以解析的难题。其核心挑战在于理解单个神经元如何通过复杂的计算路径影响最终输出,尤其是在图像-文本嵌入空间中的作用机制。解决方案的关键在于提出一种新颖的分解方法,将神经元对输出的贡献解耦为个体计算路径(即神经元与注意力头的配对),并发现这些配对在图像-文本嵌入空间中可近似为单一方向。这一洞察使得每个神经元-注意力头配对能够被关联到具体的文本语义,并进一步识别出稀疏但关键的贡献单元,从而实现无需训练的语义分割和数据分布偏移监测。

链接: https://arxiv.org/abs/2509.19943
作者: Edmund Bu,Yossi Gandelsman
机构: UC San Diego (加州大学圣地亚哥分校); UC Berkeley (加州大学伯克利分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: NeurIPS 2025 Workshop on Mechanistic Interpretability

点击查看摘要

Abstract:We present a novel technique for interpreting the neurons in CLIP-ResNet by decomposing their contributions to the output into individual computation paths. More specifically, we analyze all pairwise combinations of neurons and the following attention heads of CLIP’s attention-pooling layer. We find that these neuron-head pairs can be approximated by a single direction in CLIP-ResNet’s image-text embedding space. Leveraging this insight, we interpret each neuron-head pair by associating it with text. Additionally, we find that only a sparse set of the neuron-head pairs have a significant contribution to the output value, and that some neuron-head pairs, while polysemantic, represent sub-concepts of their corresponding neurons. We use these observations for two applications. First, we employ the pairs for training-free semantic segmentation, outperforming previous methods for CLIP-ResNet. Second, we utilize the contributions of neuron-head pairs to monitor dataset distribution shifts. Our results demonstrate that examining individual computation paths in neural networks uncovers interpretable units, and that such units can be utilized for downstream tasks.
zh

[CV-43] AJAHR: Amputated Joint Aware 3D Human Mesh Recovery

【速读】:该论文旨在解决现有三维人体网格重建方法在应对截肢人群时存在的偏差问题,因其普遍假设标准人体结构而忽视了肢体缺失等多样化的解剖状况,且相关训练数据稀缺。解决方案的关键在于提出一种自适应姿态估计框架——Amputated Joint Aware 3D Human Mesh Recovery (AJAHR),其核心创新包括:1)集成一个与网格恢复网络联合训练的肢体缺失分类器,用于检测潜在截肢情况;2)构建合成数据集Amputee 3D (A3D),覆盖多种截肢姿势以增强模型鲁棒性。该方法在保持对非截肢个体性能的同时,显著提升了截肢个体的网格重建效果。

链接: https://arxiv.org/abs/2509.19939
作者: Hyunjin Cho,Giyun Choi,Jongwon Choi
机构: Chung-Ang University (中央大学); Korea Institute of Industrial Technology (KITECH) (韩国工业技术研究院)
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 8pages, Project Page: this https URL

点击查看摘要

Abstract:Existing human mesh recovery methods assume a standard human body structure, overlooking diverse anatomical conditions such as limb loss. This assumption introduces bias when applied to individuals with amputations - a limitation further exacerbated by the scarcity of suitable datasets. To address this gap, we propose Amputated Joint Aware 3D Human Mesh Recovery (AJAHR), which is an adaptive pose estimation framework that improves mesh reconstruction for individuals with limb loss. Our model integrates a body-part amputation classifier, jointly trained with the mesh recovery network, to detect potential amputations. We also introduce Amputee 3D (A3D), which is a synthetic dataset offering a wide range of amputee poses for robust training. While maintaining competitive performance on non-amputees, our approach achieves state-of-the-art results for amputated individuals. Additional materials can be found at the project webpage.
zh

[CV-44] GS-RoadPatching: Inpainting Gaussians via 3D Searching and Placing for Driving Scenes

【速读】:该论文旨在解决自动驾驶场景中缺失区域的高质量修复问题,即在图像或点云数据存在遮挡或损坏时,如何实现结构合理且视觉一致的场景补全。现有基于2D扩散模型或生成对抗网络(GAN)的方法受限于单视角信息,难以保证多视图一致性,并需频繁重新训练3D高斯溅射(3D Gaussian Splatting, 3DGS)模型以适应新场景。其解决方案的关键在于利用3DGS模态本身进行替代式补全(substitutional inpainting),通过构建嵌入特征的3DGS场景并设计多尺度局部上下文抽象方法,结合结构化搜索机制在3D空间中高效定位候选补丁,最终采用简单的替换与融合优化策略提升整体视觉和谐度。这一方法摆脱了对2D跨模态时空一致性的依赖,显著减少重训练开销,同时在驾驶场景及通用场景下均展现出优越性能。

链接: https://arxiv.org/abs/2509.19937
作者: Guo Chen,Jiarun Liu,Sicong Du,Chenming Wu,Deqi Li,Shi-Sheng Huang,Guofeng Zhang,Sheng Yang
机构: Beijing Normal University (北京师范大学); Alibaba (阿里巴巴); Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper presents GS-RoadPatching, an inpainting method for driving scene completion by referring to completely reconstructed regions, which are represented by 3D Gaussian Splatting (3DGS). Unlike existing 3DGS inpainting methods that perform generative completion relying on 2D perspective-view-based diffusion or GAN models to predict limited appearance or depth cues for missing regions, our approach enables substitutional scene inpainting and editing directly through the 3DGS modality, extricating it from requiring spatial-temporal consistency of 2D cross-modals and eliminating the need for time-intensive retraining of Gaussians. Our key insight is that the highly repetitive patterns in driving scenes often share multi-modal similarities within the implicit 3DGS feature space and are particularly suitable for structural matching to enable effective 3DGS-based substitutional inpainting. Practically, we construct feature-embedded 3DGS scenes to incorporate a patch measurement method for abstracting local context at different scales and, subsequently, propose a structural search method to find candidate patches in 3D space effectively. Finally, we propose a simple yet effective substitution-and-fusion optimization for better visual harmony. We conduct extensive experiments on multiple publicly available datasets to demonstrate the effectiveness and efficiency of our proposed method in driving scenes, and the results validate that our method achieves state-of-the-art performance compared to the baseline methods in terms of both quality and interoperability. Additional experiments in general scenes also demonstrate the applicability of the proposed 3D inpainting strategy. The project page and code are available at: this https URL
zh

[CV-45] CapStARE: Capsule-based Spatiotemporal Architecture for Robust and Efficient Gaze Estimation

【速读】:该论文旨在解决实时眼动估计(gaze estimation)中精度与效率难以兼顾的问题,尤其是在复杂场景下模型泛化能力不足的挑战。其解决方案的关键在于提出CapStARE架构,该架构采用模块化设计:以ConvNeXt作为主干网络提取空间特征,通过注意力路由机制实现胶囊(capsule)形成以支持高效的局部-整体推理,同时引入双门控循环单元(dual GRU)解码器分别建模缓慢和快速的眼动动态特性,从而实现解耦的时间建模。此设计不仅在ETH-XGaze和MPIIFaceGaze等基准上达到SOTA性能(分别为3.36°和2.65°),且保持10ms以内推理延迟,同时在Gaze360和RT-GENE等开放场景中展现出优异的泛化能力和更高的可解释性。

链接: https://arxiv.org/abs/2509.19936
作者: Miren Samaniego,Igor Rodriguez,Elena Lazkano
机构: University of the Basque Country (巴斯克大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce CapStARE, a capsule-based spatio-temporal architecture for gaze estimation that integrates a ConvNeXt backbone, capsule formation with attention routing, and dual GRU decoders specialized for slow and rapid gaze dynamics. This modular design enables efficient part-whole reasoning and disentangled temporal modeling, achieving state-of-the-art performance on ETH-XGaze (3.36) and MPIIFaceGaze (2.65) while maintaining real-time inference ( 10 ms). The model also generalizes well to unconstrained conditions in Gaze360 (9.06) and human-robot interaction scenarios in RT-GENE (4.76), outperforming or matching existing methods with fewer parameters and greater interpretability. These results demonstrate that CapStARE offers a practical and robust solution for real-time gaze estimation in interactive systems. The related code and results for this article can be found on: this https URL
zh

[CV-46] Aerial-Ground Image Feature Matching via 3D Gaussian Splatting-based Intermediate View Rendering

【速读】:该论文旨在解决复杂场景下航拍图像与地面图像之间可靠特征匹配困难的问题,这一问题严重限制了三维建模的精度与鲁棒性。其核心解决方案是通过生成中间视图(intermediate views)来缓解因视角变化过大导致的透视畸变,从而提升匹配可靠性。具体而言,首先利用航拍图像构建稀疏三维模型(sparse model),随后采用3D高斯点绘(3D Gaussian Splatting, 3DGS)进行高质量场景渲染,并设计了一种基于航拍相机位姿的视图确定算法以生成高质量中间视图;最终借助这些中间视图实现从渲染-航拍和渲染-地面图像对中提取可靠特征匹配,并通过中间视图传递对应关系完成最终匹配。该方法显著提升了初始匹配数和精化匹配数,为后续的增量式结构光恢复(ISfM)重建和基于3DGS的场景渲染提供了充足且可靠的匹配支持。

链接: https://arxiv.org/abs/2509.19898
作者: Jiangxue Yu,Hui Wang,San Jiang,Xing Zhang,Dejin Zhang,Qingquan Li
机构: 深圳大学(Shenzhen University); 深圳大学(Shenzhen University); 深圳大学(Shenzhen University)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The integration of aerial and ground images has been a promising solution in 3D modeling of complex scenes, which is seriously restricted by finding reliable correspondences. The primary contribution of this study is a feature matching algorithm for aerial and ground images, whose core idea is to generate intermediate views to alleviate perspective distortions caused by the extensive viewpoint changes. First, by using aerial images only, sparse models are reconstructed through an incremental SfM (Structure from Motion) engine due to their large scene coverage. Second, 3D Gaussian Splatting is then adopted for scene rendering by taking as inputs sparse points and oriented images. For accurate view rendering, a render viewpoint determination algorithm is designed by using the oriented camera poses of aerial images, which is used to generate high-quality intermediate images that can bridge the gap between aerial and ground images. Third, with the aid of intermediate images, reliable feature matching is conducted for match pairs from render-aerial and render-ground images, and final matches can be generated by transmitting correspondences through intermediate views. By using real aerial and ground datasets, the validation of the proposed solution has been verified in terms of feature matching and scene rendering and compared comprehensively with widely used methods. The experimental results demonstrate that the proposed solution can provide reliable feature matches for aerial and ground images with an obvious increase in the number of initial and refined matches, and it can provide enough matches to achieve accurate ISfM reconstruction and complete 3DGS-based scene rendering.
zh

[CV-47] Efficient Cell Painting Image Representation Learning via Cross-Well Aligned Masked Siamese Network

【速读】:该论文旨在解决细胞图像表型表示学习中面临的生物意义不明确和批次效应(batch effect)干扰的问题,尤其是在数据量有限或模型参数规模受限的情况下难以提取鲁棒且具细粒度形态信息的特征。其解决方案的关键在于提出一种名为跨孔对齐掩码孪生网络(Cross-Well Aligned Masked Siamese Network, CWA-MSN)的新颖表示学习框架,通过在不同孔位(well)中对同一扰动处理的细胞嵌入进行对齐,强制实现语义一致性,从而有效缓解批次效应影响,同时保持数据与参数效率。该方法在基因-基因关系检索基准测试中显著优于当前最优的自监督(OpenPhenom)和对比学习(CellCLIP)方法,在训练数据仅为后者0.2M图像、模型参数仅22M的情况下,分别提升性能+29%和+9%。

链接: https://arxiv.org/abs/2509.19896
作者: Pin-Jui Huang,Yu-Hsuan Liao,SooHeon Kim,NoSeong Park,JongBae Park,DongMyung Shin
机构: SGSC; PanThera; KAIST; Kyung Hee University; RadiSen Co. Ltd.
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 3 figures, reference 4 pages

点击查看摘要

Abstract:Computational models that predict cellular phenotypic responses to chemical and genetic perturbations can accelerate drug discovery by prioritizing therapeutic hypotheses and reducing costly wet-lab iteration. However, extracting biologically meaningful and batch-robust cell painting representations remains challenging. Conventional self-supervised and contrastive learning approaches often require a large-scale model and/or a huge amount of carefully curated data, still struggling with batch effects. We present Cross-Well Aligned Masked Siamese Network (CWA-MSN), a novel representation learning framework that aligns embeddings of cells subjected to the same perturbation across different wells, enforcing semantic consistency despite batch effects. Integrated into a masked siamese architecture, this alignment yields features that capture fine-grained morphology while remaining data- and parameter-efficient. For instance, in a gene-gene relationship retrieval benchmark, CWA-MSN outperforms the state-of-the-art publicly available self-supervised (OpenPhenom) and contrastive learning (CellCLIP) methods, improving the benchmark scores by +29% and +9%, respectively, while training on substantially fewer data (e.g., 0.2M images for CWA-MSN vs. 2.2M images for OpenPhenom) or smaller model size (e.g., 22M parameters for CWA-MSN vs. 1.48B parameters for CellCLIP). Extensive experiments demonstrate that CWA-MSN is a simple and effective way to learn cell image representation, enabling efficient phenotype modeling even under limited data and parameter budgets.
zh

[CV-48] Generalized Shortest Path-based Superpixels for 3D Spherical Image Segmentation

【速读】:该论文旨在解决传统超像素(superpixel)分割方法在处理360°球面或全向图像时存在的几何失真与分割精度不足的问题。现有方法多针对标准2D平面图像设计,无法有效适应球面成像空间的非欧几里得特性,导致分割结果形状不规则且准确性受限。解决方案的关键在于提出一种基于球面最短路径(Spherical Shortest Path-based Superpixels, SphSPS)的新颖超像素生成方法,该方法显式考虑了3D球面采集空间的几何结构,并将像素到超像素中心的最短路径概念推广至球面空间,从而高效提取具有高准确性和良好形状规则性的聚类特征。此外,作者还扩展了全局规则性度量以适用于球面空间,弥补了现有球面紧凑性指标的局限性,使方法在真实和合成的360°图像数据集上均显著优于平面及球面领域的最新技术。

链接: https://arxiv.org/abs/2509.19895
作者: Rémi Giraud,Rodrigo Borba Pinheiro,Yannick Berthoumieu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The growing use of wide angle image capture devices and the need for fast and accurate image analysis in computer visions have enforced the need for dedicated under-representation approaches. Most recent decomposition methods segment an image into a small number of irregular homogeneous regions, called superpixels. Nevertheless, these approaches are generally designed to segment standard 2D planar images, i.e., captured with a 90o angle view without distortion. In this work, we introduce a new general superpixel method called SphSPS (for Spherical Shortest Path-based Superpixels)1 , dedicated to wide 360o spherical or omnidirectional images. Our method respects the geometry of the 3D spherical acquisition space and generalizes the notion of shortest path between a pixel and a superpixel center, to fastly extract relevant clustering features. We demonstrate that considering the geometry of the acquisition space to compute the shortest path enables to jointly improve the segmentation accuracy and the shape regularity of superpixels. To evaluate this regularity aspect, we also generalize a global regularity metric to the spherical space, addressing the limitations of the only existing spherical compactness measure. Finally, the proposed SphSPS method is validated on the reference 360o spherical panorama segmentation dataset and on synthetic road omnidirectional images. Our method significantly outperforms both planar and spherical state-of-the-art approaches in terms of segmentation accuracy,robustness to noise and regularity, providing a very interesting tool for superpixel-based applications on 360o images.
zh

[CV-49] Adaptive Guidance Semantically Enhanced via Multimodal LLM for Edge-Cloud Object Detection

【速读】:该论文旨在解决传统目标检测方法在低光照和严重遮挡等复杂场景下因缺乏高层语义理解而导致性能下降的问题。其解决方案的关键在于提出一种基于自适应引导的语义增强边缘-云协同目标检测方法,利用多模态大语言模型(Multimodal Large Language Models, MLLM)生成结构化场景描述,并设计自适应映射机制将语义信息动态转化为参数调整信号以实现实时语义增强;同时,在边缘-云协同推理框架中根据置信度分数自动选择调用云端语义引导或直接输出边缘检测结果,从而在保证检测精度的同时显著降低延迟(超过79%)和计算成本(超过70%)。

链接: https://arxiv.org/abs/2509.19875
作者: Yunqing Hu,Zheming Yang,Chang Zhao,Wen Ji
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Traditional object detection methods face performance degradation challenges in complex scenarios such as low-light conditions and heavy occlusions due to a lack of high-level semantic understanding. To address this, this paper proposes an adaptive guidance-based semantic enhancement edge-cloud collaborative object detection method leveraging Multimodal Large Language Models (MLLM), achieving an effective balance between accuracy and efficiency. Specifically, the method first employs instruction fine-tuning to enable the MLLM to generate structured scene descriptions. It then designs an adaptive mapping mechanism that dynamically converts semantic information into parameter adjustment signals for edge detectors, achieving real-time semantic enhancement. Within an edge-cloud collaborative inference framework, the system automatically selects between invoking cloud-based semantic guidance or directly outputting edge detection results based on confidence scores. Experiments demonstrate that the proposed method effectively enhances detection accuracy and efficiency in complex scenes. Specifically, it can reduce latency by over 79% and computational cost by 70% in low-light and highly occluded scenes while maintaining accuracy.
zh

[CV-50] FreezeVLA: Action-Freezing Attacks against Vision-Language-Action Models

【速读】:该论文旨在解决视觉-语言-动作(Vision-Language-Action, VLA)模型在机器人应用中面临的安全性问题,特别是其对对抗攻击的脆弱性。研究发现,存在一种关键的对抗漏洞——对抗图像可使VLA模型“冻结”,导致其忽略后续指令,从而造成机器人在关键时刻无法执行动作,形成数字心智与物理行为的脱节。解决方案的核心是提出 FreezeVLA 攻击框架,通过最小-最大双层优化策略系统地生成和评估此类动作冻结攻击,实验证明其平均攻击成功率高达 76.2%,且生成的对抗样本具有强迁移性,能跨多种语言提示引发模型瘫痪,揭示了VLA模型亟需强化的防御机制。

链接: https://arxiv.org/abs/2509.19870
作者: Xin Wang,Jie Li,Zejia Weng,Yixu Wang,Yifeng Gao,Tianyu Pang,Chao Du,Yan Teng,Yingchun Wang,Zuxuan Wu,Xingjun Ma,Yu-Gang Jiang
机构: Fudan University (复旦大学); Shanghai AI Lab (上海人工智能实验室); Sea AI Lab (Sea人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-Language-Action (VLA) models are driving rapid progress in robotics by enabling agents to interpret multimodal inputs and execute complex, long-horizon tasks. However, their safety and robustness against adversarial attacks remain largely underexplored. In this work, we identify and formalize a critical adversarial vulnerability in which adversarial images can “freeze” VLA models and cause them to ignore subsequent instructions. This threat effectively disconnects the robot’s digital mind from its physical actions, potentially inducing inaction during critical interventions. To systematically study this vulnerability, we propose FreezeVLA, a novel attack framework that generates and evaluates action-freezing attacks via min-max bi-level optimization. Experiments on three state-of-the-art VLA models and four robotic benchmarks show that FreezeVLA attains an average attack success rate of 76.2%, significantly outperforming existing methods. Moreover, adversarial images generated by FreezeVLA exhibit strong transferability, with a single image reliably inducing paralysis across diverse language prompts. Our findings expose a critical safety risk in VLA models and highlight the urgent need for robust defense mechanisms.
zh

[CV-51] PersONAL: Towards a Comprehensive Benchmark for Personalized Embodied Agents

【速读】:该论文旨在解决当前具身智能(Embodied AI)在真实人类中心场景(如家庭环境)中部署时面临的挑战,即如何有效建模个体用户偏好与行为以实现个性化交互。其核心解决方案是提出PersONAL(PERSonalized Object Navigation And Localization)基准,这是一个包含30余个照片级真实感住宅场景、超过2000个高质量任务episode的综合性评测平台,每个episode均明确标注对象与其所有者之间的语义关联,要求智能体能够基于自然语言查询(如“找到Lily的背包”)完成对象识别、导航与定位任务。该基准支持两种评估模式:未见过环境中的主动导航和已映射场景中的物体定位,实验表明现有最先进方法与人类表现之间存在显著差距,凸显了未来具身智能系统需具备感知、推理和记忆个性化信息的能力,从而推动面向现实世界辅助机器人应用的发展。

链接: https://arxiv.org/abs/2509.19843
作者: Filippo Ziliotto,Jelin Raphael Akkara,Alessandro Daniele,Lamberto Ballan,Luciano Serafini,Tommaso Campari
机构: University of Padova (帕多瓦大学); Fondazione Bruno Kessler (FBK) (布鲁诺·凯斯勒基金会)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Recent advances in Embodied AI have enabled agents to perform increasingly complex tasks and adapt to diverse environments. However, deploying such agents in realistic human-centered scenarios, such as domestic households, remains challenging, particularly due to the difficulty of modeling individual human preferences and behaviors. In this work, we introduce PersONAL (PERSonalized Object Navigation And Localization, a comprehensive benchmark designed to study personalization in Embodied AI. Agents must identify, retrieve, and navigate to objects associated with specific users, responding to natural-language queries such as “find Lily’s backpack”. PersONAL comprises over 2,000 high-quality episodes across 30+ photorealistic homes from the HM3D dataset. Each episode includes a natural-language scene description with explicit associations between objects and their owners, requiring agents to reason over user-specific semantics. The benchmark supports two evaluation modes: (1) active navigation in unseen environments, and (2) object grounding in previously mapped scenes. Experiments with state-of-the-art baselines reveal a substantial gap to human performance, highlighting the need for embodied agents capable of perceiving, reasoning, and memorizing over personalized information; paving the way towards real-world assistive robot.
zh

[CV-52] hinkFake: Reasoning in Multimodal Large Language Models for AI-Generated Image Detection

【速读】:该论文旨在解决AI生成图像(AI-generated images)日益逼真所引发的虚假信息传播和隐私侵犯问题,核心挑战在于现有检测方法多依赖二分类且缺乏可解释性,或过度依赖监督微调导致泛化能力有限。解决方案的关键在于提出ThinkFake框架,其创新性地结合了多模态大语言模型(Multimodal Large Language Model, MLLM)与基于组相对策略优化(Group Relative Policy Optimization, GRPO)的强化学习机制,通过设计伪造推理提示(forgery reasoning prompt)和精细化奖励函数,使模型能够进行分步推理并输出结构化的可解释检测结果,从而实现高准确率与强零样本泛化能力。

链接: https://arxiv.org/abs/2509.19841
作者: Tai-Ming Huang,Wei-Tung Lin,Kai-Lung Hua,Wen-Huang Cheng,Junichi Yamagishi,Jun-Cheng Chen
机构: National Taiwan University (国立台湾大学); National Chengchi University (国立政治大学); Kyoto University (京都大学); Academia Sinica (中央研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The increasing realism of AI-generated images has raised serious concerns about misinformation and privacy violations, highlighting the urgent need for accurate and interpretable detection methods. While existing approaches have made progress, most rely on binary classification without explanations or depend heavily on supervised fine-tuning, resulting in limited generalization. In this paper, we propose ThinkFake, a novel reasoning-based and generalizable framework for AI-generated image detection. Our method leverages a Multimodal Large Language Model (MLLM) equipped with a forgery reasoning prompt and is trained using Group Relative Policy Optimization (GRPO) reinforcement learning with carefully designed reward functions. This design enables the model to perform step-by-step reasoning and produce interpretable, structured outputs. We further introduce a structured detection pipeline to enhance reasoning quality and adaptability. Extensive experiments show that ThinkFake outperforms state-of-the-art methods on the GenImage benchmark and demonstrates strong zero-shot generalization on the challenging LOKI benchmark. These results validate our framework’s effectiveness and robustness. Code will be released upon acceptance.
zh

[CV-53] Adaptive Model Ensemble for Continual Learning

【速读】:该论文旨在解决持续学习(continual learning)中模型集成(model ensemble)方法存在的知识冲突问题,即在任务层面和层面上因不同任务知识相互干扰而导致旧任务性能下降和新任务学习效率受限的问题。解决方案的关键在于提出一种基于元学习的混合权重生成器(meta-weight-ensembler),通过元学习训练一个混合系数生成器,为每个任务动态生成适配的混合系数以缓解任务级知识冲突;同时,针对每一层独立生成混合系数,以应对层级知识冲突,从而实现对不同任务知识的自适应融合,提升模型在旧任务与新任务上的综合学习能力。

链接: https://arxiv.org/abs/2509.19819
作者: Yuchuan Mao,Zhi Gao,Xiaomeng Fan,Yuwei Wu,Yunde Jia,Chenchen Jing
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Model ensemble is an effective strategy in continual learning, which alleviates catastrophic forgetting by interpolating model parameters, achieving knowledge fusion learned from different tasks. However, existing model ensemble methods usually encounter the knowledge conflict issue at task and layer levels, causing compromised learning performance in both old and new tasks. To solve this issue, we propose meta-weight-ensembler that adaptively fuses knowledge of different tasks for continual learning. Concretely, we employ a mixing coefficient generator trained via meta-learning to generate appropriate mixing coefficients for model ensemble to address the task-level knowledge conflict. The mixing coefficient is individually generated for each layer to address the layer-level knowledge conflict. In this way, we learn the prior knowledge about adaptively accumulating knowledge of different tasks in a fused model, achieving efficient learning in both old and new tasks. Meta-weight-ensembler can be flexibly combined with existing continual learning methods to boost their ability of alleviating catastrophic forgetting. Experiments on multiple continual learning datasets show that meta-weight-ensembler effectively alleviates catastrophic forgetting and achieves state-of-the-art performance.
zh

[CV-54] StrCGAN: A Generative Framework for Stellar Image Restoration

【速读】:该论文旨在解决小口径望远镜观测图像因分辨率和质量受限而导致的天体图像重建难题,尤其关注如何从低分辨率图像中恢复出高保真度、物理一致性的星系与恒星结构。其解决方案的关键在于提出StrCGAN(Stellar Cyclic GAN)模型,通过三项核心创新实现:引入3D卷积层以捕捉体积空间相关性,增强对天体形态的三维结构建模能力;采用多光谱融合机制对齐光学与近红外(NIR)波段数据,提升跨谱段一致性;以及设计天体物理正则化模块,在训练过程中利用多任务全天巡天数据作为真实参考,确保重建结果在形态学上符合天体物理学规律。这些改进使StrCGAN在视觉清晰度和物理合理性上均显著优于传统生成对抗网络(GAN)模型。

链接: https://arxiv.org/abs/2509.19805
作者: Shantanusinh Parmar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Instrumentation and Methods for Astrophysics (astro-ph.IM); Solar and Stellar Astrophysics (astro-ph.SR)
备注:

点击查看摘要

Abstract:We introduce StrCGAN (Stellar Cyclic GAN), a generative model designed to enhance low-resolution astrophotography images. Our goal is to reconstruct high-fidelity ground truth-like representations of celestial objects, a task that is challenging due to the limited resolution and quality of small-telescope observations such as the MobilTelesco dataset. Traditional models such as CycleGAN provide a foundation for image-to-image translation but are restricted to 2D mappings and often distort the morphology of stars and galaxies. To overcome these limitations, we extend the CycleGAN framework with three key innovations: 3D convolutional layers to capture volumetric spatial correlations, multi-spectral fusion to align optical and near-infrared (NIR) domains, and astrophysical regularization modules to preserve stellar morphology. Ground-truth references from multi-mission all-sky surveys spanning optical to NIR guide the training process, ensuring that reconstructions remain consistent across spectral bands. Together, these components allow StrCGAN to generate reconstructions that are not only visually sharper but also physically consistent, outperforming standard GAN models in the task of astrophysical image enhancement.
zh

[CV-55] BiTAA: A Bi-Task Adversarial Attack for Object Detection and Depth Estimation via 3D Gaussian Splatting

【速读】:该论文旨在解决自动驾驶中基于摄像头的多任务感知系统(特别是目标检测与单目深度估计)在面对对抗攻击时存在的脆弱性问题,现有方法大多局限于单一任务、缺乏对深度偏移的可控性控制,且缺少标准化的跨任务迁移评估机制。解决方案的关键在于提出BiTAA(Bi-Task Adversarial Attack),其核心创新是基于3D高斯点绘(3D Gaussian Splatting)构建一个统一的双任务对抗攻击框架,通过设计复合损失函数将检测抑制与受控符号深度偏移(log-depth bias)耦合于感兴趣区域(ROIs),实现对近处或远处感知错误的精确调控;同时支持全图和局部补丁两种攻击场景,并引入期望-变换(Expectation-over-Transformation, EOT)以增强物理世界可行性,从而系统性揭示了检测与深度任务间的不对称迁移效应,为多任务感知系统的鲁棒性研究提供了新范式。

链接: https://arxiv.org/abs/2509.19793
作者: Yixun Zhang,Feng Zhou,Jianqin Yin
机构: Beijing University of Posts and Telecommunications (北京邮电大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Intend to submit to RA-L

点击查看摘要

Abstract:Camera-based perception is critical to autonomous driving yet remains vulnerable to task-specific adversarial manipulations in object detection and monocular depth estimation. Most existing 2D/3D attacks are developed in task silos, lack mechanisms to induce controllable depth bias, and offer no standardized protocol to quantify cross-task transfer, leaving the interaction between detection and depth underexplored. We present BiTAA, a bi-task adversarial attack built on 3D Gaussian Splatting that yields a single perturbation capable of simultaneously degrading detection and biasing monocular depth. Specifically, we introduce a dual-model attack framework that supports both full-image and patch settings and is compatible with common detectors and depth estimators, with optional expectation-over-transformation (EOT) for physical reality. In addition, we design a composite loss that couples detection suppression with a signed, magnitude-controlled log-depth bias within regions of interest (ROIs) enabling controllable near or far misperception while maintaining stable optimization across tasks. We also propose a unified evaluation protocol with cross-task transfer metrics and real-world evaluations, showing consistent cross-task degradation and a clear asymmetry between Det to Depth and from Depth to Det transfer. The results highlight practical risks for multi-task camera-only perception and motivate cross-task-aware defenses in autonomous driving scenarios.
zh

[CV-56] EfficienT-HDR: An Efficient Transformer-Based Framework via Multi-Exposure Fusion for HDR Reconstruction

【速读】:该论文旨在解决在资源受限的边缘设备上实现高质量高动态范围(High Dynamic Range, HDR)成像的问题,这一问题直接影响智能监控和自动驾驶等下游任务的性能。现有多曝光融合(Multi-Exposure Fusion, MEF)方法普遍存在计算开销大和鬼影伪影(ghosting artifacts)严重两大瓶颈,难以部署于边缘场景。解决方案的关键在于提出一种轻量级视觉Transformer架构,通过三个核心设计实现优化:1)引入上下文感知的Vision Transformer基础结构,并将输入图像转换至YCbCr色彩空间以分离亮度与色度信息;2)设计交集感知自适应融合(Intersection-Aware Adaptive Fusion, IAAF)模块有效抑制鬼影;3)采用逆残差嵌入(Inverted Residual Embedding, IRE)、动态Tanh激活函数(Dynamic Tanh, DyT)以及增强多尺度空洞卷积(Enhanced Multi-Scale Dilated Convolution, E-MSDC)从多个层面降低计算复杂度。最终构建了主版本(注重视觉质量)与轻量版本(侧重效率),二者均实现了性能与图像质量的良好平衡,实验表明主版本在CPU上FLOPS减少约67%、推理速度提升超五倍,在边缘设备上提速2.5倍,验证了方法在多种动态场景下的高效性与实用性。

链接: https://arxiv.org/abs/2509.19779
作者: Yu-Shen Huang,Tzu-Han Chen,Cheng-Yen Hsiao,Shaou-Gang Miaou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 9 figures

点击查看摘要

Abstract:Achieving high-quality High Dynamic Range (HDR) imaging on resource-constrained edge devices is a critical challenge in computer vision, as its performance directly impacts downstream tasks such as intelligent surveillance and autonomous driving. Multi-Exposure Fusion (MEF) is a mainstream technique to achieve this goal; however, existing methods generally face the dual bottlenecks of high computational costs and ghosting artifacts, hindering their widespread deployment. To this end, this study proposes a light-weight Vision Transformer architecture designed explicitly for HDR reconstruction to overcome these limitations. This study is based on the Context-Aware Vision Transformer and begins by converting input images to the YCbCr color space to separate luminance and chrominance information. It then employs an Intersection-Aware Adaptive Fusion (IAAF) module to suppress ghosting effectively. To further achieve a light-weight design, we introduce Inverted Residual Embedding (IRE), Dynamic Tanh (DyT), and propose Enhanced Multi-Scale Dilated Convolution (E-MSDC) to reduce computational complexity at multiple levels. Our study ultimately contributes two model versions: a main version for high visual quality and a light-weight version with advantages in computational efficiency, both of which achieve an excellent balance between performance and image quality. Experimental results demonstrate that, compared to the baseline, the main version reduces FLOPS by approximately 67% and increases inference speed by more than fivefold on CPU and 2.5 times on an edge device. These results confirm that our method provides an efficient and ghost-free HDR imaging solution for edge devices, demonstrating versatility and practicality across various dynamic scenarios.
zh

[CV-57] Sex-based Bias Inherent in the Dice Similarity Coefficient: A Model Independent Analysis for Multiple Anatomical Structures

【速读】:该论文试图解决的问题是:Dice Similarity Coefficient (DSC) 作为医学图像分割评估指标时,可能因器官大小的性别差异而引入系统性偏差,导致对不同性别的分割性能评价不公平。其关键解决方案在于通过理想化设置(即在50名参与者的MRI手动标注中施加等量的合成误差)量化了DSC及其归一化版本在性别间的差异,发现即使分割误差幅度相同,小到中等尺寸结构的DSC值在男女之间仍存在显著差异(平均达0.03),仅大型器官(如肺和肝)不受影响。这表明DSC本身会因器官体积差异而产生性别偏倚,而非模型行为所致,强调在公平性研究中需谨慎使用DSC,并应考虑采用更鲁棒的评估方法以避免误导性结论。

链接: https://arxiv.org/abs/2509.19778
作者: Hartmut Häntze,Myrthe Buser,Alessa Hering,Lisa C. Adams,Keno K. Bressem
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Overlap-based metrics such as the Dice Similarity Coefficient (DSC) penalize segmentation errors more heavily in smaller structures. As organ size differs by sex, this implies that a segmentation error of equal magnitude may result in lower DSCs in women due to their smaller average organ volumes compared to men. While previous work has examined sex-based differences in models or datasets, no study has yet investigated the potential bias introduced by the DSC itself. This study quantifies sex-based differences of the DSC and the normalized DSC in an idealized setting independent of specific models. We applied equally-sized synthetic errors to manual MRI annotations from 50 participants to ensure sex-based comparability. Even minimal errors (e.g., a 1 mm boundary shift) produced systematic DSC differences between sexes. For small structures, average DSC differences were around 0.03; for medium-sized structures around 0.01. Only large structures (i.e., lungs and liver) were mostly unaffected, with sex-based DSC differences close to zero. These findings underline that fairness studies using the DSC as an evaluation metric should not expect identical scores between men and women, as the metric itself introduces bias. A segmentation model may perform equally well across sexes in terms of error magnitude, even if observed DSC values suggest otherwise. Importantly, our work raises awareness of a previously underexplored source of sex-based differences in segmentation performance. One that arises not from model behavior, but from the metric itself. Recognizing this factor is essential for more accurate and fair evaluations in medical image analysis.
zh

[CV-58] Logics-Parsing Technical Report

【速读】:该论文旨在解决当前大型视觉语言模型(Large Vision-Language Models, LVLM)在处理复杂文档类型(如多栏报纸或海报)时,因缺乏显式的文档布局分析和阅读顺序推理阶段而导致的性能瓶颈问题。解决方案的关键在于提出一种基于强化学习增强的端到端文档解析模型Logics-Parsing,通过精心设计的奖励机制优化复杂布局分析与阅读顺序推断,并引入化学公式、手写中文字符等多样化数据进行监督微调以提升模型泛化能力。

链接: https://arxiv.org/abs/2509.19760
作者: Xiangyang Chen,Shuzhao Li,Xiuwen Zhu,Yongfan Chen,Fan Yang,Cheng Fang,Lin Qu,Xiaoxiao Xu,Hu Wei,Minggang Wu
机构: Alibaba Group (阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in Large Vision-Language models (LVLM) have spurred significant progress in document parsing task. Compared to traditional pipeline-based methods, end-to-end paradigms have shown their excellence in converting PDF images into structured outputs through integrated Optical Character Recognition (OCR), table recognition, mathematical formula recognition and so on. However, the absence of explicit analytical stages for document layouts and reading orders limits the LVLM’s capability in handling complex document types such as multi-column newspapers or posters. To address this limitation, we propose in this report Logics-Parsing: an end-to-end LVLM-based model augmented with reinforcement learning. Our model incorporates meticulously designed reward mechanisms to optimize complex layout analysis and reading order inference. In addition, we expand the model’s versatility by incorporating diverse data types such as chemical formulas and handwritten Chinese characters into supervised fine-tuning. Finally, to enable rigorous evaluation of our approach, we introduce LogicsParsingBench, a curated set of 1,078 page-level PDF images spanning nine major categories and over twenty sub-categories, which will be released later. Comprehensive experiments conducted on LogicsParsingBench have validated the efficacy and State-of-the-art (SOTA) performance of our proposed model across diverse document analysis scenarios. Project Page: this https URL
zh

[CV-59] ExpFace: Exponential Angular Margin Loss for Deep Face Recognition

【速读】:该论文旨在解决人脸识别中因噪声样本干扰导致的判别能力下降问题,尤其针对传统基于角度的软最大损失函数(如SphereFace、CosFace和ArcFace)在训练过程中未能有效区分干净样本与噪声样本所引发的性能瓶颈。其解决方案的关键在于提出了一种新的指数角度边界损失函数(Exponential Angular Margin Loss, ExpFace),通过引入一个角度指数项作为边际惩罚机制,在角度空间中对中心区域的干净样本施加更大的惩罚力度,而对边缘区域的噪声样本施加较小的惩罚,从而增强模型对干净样本的聚类紧凑性并抑制噪声样本的影响。这一设计不仅提升了模型的鲁棒性和稳定性,还避免了SphereFace的训练不稳定性与ArcFace的非单调性问题,并在相似度曲线方面更贴合角度空间中的决策边界特性。

链接: https://arxiv.org/abs/2509.19753
作者: Jinhui Zheng,Xueyuan Gong
机构: Institution1; Institution2
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Face recognition is an open-set problem requiring high discriminative power to ensure that intra-class distances remain smaller than inter-class distances. Margin-based softmax losses, such as SphereFace, CosFace, and ArcFace, have been widely adopted to enhance intra-class compactness and inter-class separability, yet they overlook the impact of noisy samples. By examining the distribution of samples in the angular space, we observe that clean samples predominantly cluster in the center region, whereas noisy samples tend to shift toward the peripheral region. Motivated by this observation, we propose the Exponential Angular Margin Loss (ExpFace), which introduces an angular exponential term as the margin. This design applies a larger penalty in the center region and a smaller penalty in the peripheral region within the angular space, thereby emphasizing clean samples while suppressing noisy samples. We present a unified analysis of ExpFace and classical margin-based softmax losses in terms of margin embedding forms, similarity curves, and gradient curves, showing that ExpFace not only avoids the training instability of SphereFace and the non-monotonicity of ArcFace, but also exhibits a similarity curve that applies penalties in the same manner as the decision boundary in the angular space. Extensive experiments demonstrate that ExpFace achieves state-of-the-art performance. To facilitate future research, we have released the source code at: this https URL.
zh

[CV-60] alking Head Generation via AU-Guided Landmark Prediction

【速读】:该论文旨在解决音频驱动人脸视频生成中表达控制精度不足的问题,特别是现有方法依赖情感标签或隐式动作单元(Action Units, AUs)条件导致的表达细节不精确和物理合理性差的问题。解决方案的关键在于提出一个两阶段框架:第一阶段通过变分运动生成器将音频与AU强度显式映射为时序一致的2D面部关键点序列,实现基于物理基础的逐帧表情控制;第二阶段利用扩散模型根据这些关键点和参考图像合成高保真、唇音同步的视频,从而在表达准确性、时间稳定性和视觉真实感上显著优于现有方法。

链接: https://arxiv.org/abs/2509.19749
作者: Shao-Yu Chang,Jingyi Xu,Hieu Le,Dimitris Samaras
机构: Stony Brook University (纽约州立大学石溪分校); EPFL (瑞士联邦理工学院); University of North Carolina at Charlotte (北卡罗来纳大学夏洛特分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We propose a two-stage framework for audio-driven talking head generation with fine-grained expression control via facial Action Units (AUs). Unlike prior methods relying on emotion labels or implicit AU conditioning, our model explicitly maps AUs to 2D facial landmarks, enabling physically grounded, per-frame expression control. In the first stage, a variational motion generator predicts temporally coherent landmark sequences from audio and AU intensities. In the second stage, a diffusion-based synthesizer generates realistic, lip-synced videos conditioned on these landmarks and a reference image. This separation of motion and appearance improves expression accuracy, temporal stability, and visual realism. Experiments on the MEAD dataset show that our method outperforms state-of-the-art baselines across multiple metrics, demonstrating the effectiveness of explicit AU-to-landmark modeling for expressive talking head generation.
zh

[CV-61] nnFilterMatch: A Unified Semi-Supervised Learning Framework with Uncertainty-Aware Pseudo-Label Filtering for Efficient Medical Segmentation

【速读】:该论文旨在解决医学图像分割中标注数据稀缺导致的模型性能受限问题,同时克服传统半监督学习(Semi-supervised Learning, SSL)与主动学习(Active Learning, AL)混合方法因迭代重训练循环带来的高计算开销和临床可扩展性差的问题。其解决方案的关键在于提出一种名为nnFilterMatch的单次遍历深度分割框架,该框架在nnU-Net架构内融合了基于熵的伪标签过滤机制(FilterMatch),通过在训练过程中动态排除高置信度伪标签,实现无需重训练循环即可有效利用未标注数据,并保留不确定性引导的学习优势,从而在仅使用5%–20%标注数据的情况下达到或超越全监督模型的性能。

链接: https://arxiv.org/abs/2509.19746
作者: Yi Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Semi-supervised learning (SSL) has emerged as a promising paradigm in medical image segmentation, offering competitive performance while substantially reducing the need for extensive manual annotation. When combined with active learning (AL), these strategies further minimize annotation burden by selectively incorporating the most informative samples. However, conventional SSL_AL hybrid approaches often rely on iterative and loop-based retraining cycles after each annotation round, incurring significant computational overhead and limiting scalability in clinical applications. In this study, we present a novel, annotation-efficient, and self-adaptive deep segmentation framework that integrates SSL with entropy-based pseudo-label filtering (FilterMatch), an AL-inspired mechanism, within the single-pass nnU-Net training segmentation framework (nnFilterMatch). By selectively excluding high-confidence pseudo-labels during training, our method circumvents the need for retraining loops while preserving the benefits of uncertainty-guided learning. We validate the proposed framework across multiple clinical segmentation benchmarks and demonstrate that it achieves performance comparable to or exceeding fully supervised models, even with only 5%–20% labeled data. This work introduces a scalable, end-to-end learning strategy for reducing annotation demands in medical image segmentation without compromising accuracy. Code is available here: this https URL.
zh

[CV-62] Rectified Decoupled Dataset Distillation: A Closer Look for Fair and Comprehensive Evaluation

【速读】:该论文旨在解决当前解耦式数据蒸馏(Decoupled Dataset Distillation)方法中因后评估协议不一致而导致性能比较不可靠的问题,从而阻碍了该领域的发展。其解决方案的关键在于提出了一种名为Rectified Decoupled Dataset Distillation (RD³) 的新方法,并系统性地分析不同后评估设置对测试准确率的影响;通过建立标准化的基准和严格的评估协议,揭示出多数性能差异源于评估流程的不一致性而非合成数据本身质量的优劣,进而提出通用有效的改进策略,为未来研究提供了公平、可复现的比较基础。

链接: https://arxiv.org/abs/2509.19743
作者: Xinhao Zhong,Shuoyang Sun,Xulin Gu,Chenyang Zhu,Bin Chen,Yaowei Wang
机构: Harbin Institute of Technology, Shenzhen (哈尔滨工业大学深圳校区); Peng Cheng Laboratory (鹏城实验室); Tsinghua Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Dataset distillation aims to generate compact synthetic datasets that enable models trained on them to achieve performance comparable to those trained on full real datasets, while substantially reducing storage and computational costs. Early bi-level optimization methods (e.g., MTT) have shown promising results on small-scale datasets, but their scalability is limited by high computational overhead. To address this limitation, recent decoupled dataset distillation methods (e.g., SRe ^2 L) separate the teacher model pre-training from the synthetic data generation process. These methods also introduce random data augmentation and epoch-wise soft labels during the post-evaluation phase to improve performance and generalization. However, existing decoupled distillation methods suffer from inconsistent post-evaluation protocols, which hinders progress in the field. In this work, we propose Rectified Decoupled Dataset Distillation (RD ^3 ), and systematically investigate how different post-evaluation settings affect test accuracy. We further examine whether the reported performance differences across existing methods reflect true methodological advances or stem from discrepancies in evaluation procedures. Our analysis reveals that much of the performance variation can be attributed to inconsistent evaluation rather than differences in the intrinsic quality of the synthetic data. In addition, we identify general strategies that improve the effectiveness of distilled datasets across settings. By establishing a standardized benchmark and rigorous evaluation protocol, RD ^3 provides a foundation for fair and reproducible comparisons in future dataset distillation research.
zh

[CV-63] Robust RGB-T Tracking via Learnable Visual Fourier Prompt Fine-tuning and Modality Fusion Prompt Generation

【速读】:该论文旨在解决现有基于参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)的RGB-Thermal(RGB-T)跟踪方法仅依赖空间域信息作为提示,从而忽略频域信息在提示学习中重要作用的问题。其解决方案的关键在于提出一种高效的视觉傅里叶提示跟踪方法(Visual Fourier Prompt Tracking, VFPTrack),通过快速傅里叶变换(Fast Fourier Transform, FFT)提取频率域提示,并结合空间域提示以增强模态特征的全面理解;同时设计了一个模态融合提示生成器(Modality Fusion Prompt Generator),利用多模态特征融合生成双向交互提示,使每个模态都能与融合后的提示进行充分交互,从而实现跨模态特征的有效融合与增强。

链接: https://arxiv.org/abs/2509.19733
作者: Hongtao Yang,Bineng Zhong,Qihua Liang,Zhiruo Zhu,Yaozong Zheng,Ning Li
机构: Guangxi Normal University (广西师范大学); Key Laboratory of Education Blockchain and Intelligent Technology, Ministry of Education (教育部教育区块链与智能技术重点实验室); Guangxi Key Lab of Multi-Source Information Mining and Security (广西多源信息挖掘与安全重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by TMM2025

点击查看摘要

Abstract:Recently, visual prompt tuning is introduced to RGB-Thermal (RGB-T) tracking as a parameter-efficient finetuning (PEFT) method. However, these PEFT-based RGB-T tracking methods typically rely solely on spatial domain information as prompts for feature extraction. As a result, they often fail to achieve optimal performance by overlooking the crucial role of frequency-domain information in prompt learning. To address this issue, we propose an efficient Visual Fourier Prompt Tracking (named VFPTrack) method to learn modality-related prompts via Fast Fourier Transform (FFT). Our method consists of symmetric feature extraction encoder with shared parameters, visual fourier prompts, and Modality Fusion Prompt Generator that generates bidirectional interaction prompts through multi-modal feature fusion. Specifically, we first use a frozen feature extraction encoder to extract RGB and thermal infrared (TIR) modality features. Then, we combine the visual prompts in the spatial domain with the frequency domain prompts obtained from the FFT, which allows for the full extraction and understanding of modality features from different domain information. Finally, unlike previous fusion methods, the modality fusion prompt generation module we use combines features from different modalities to generate a fused modality prompt. This modality prompt is interacted with each individual modality to fully enable feature interaction across different modalities. Extensive experiments conducted on three popular RGB-T tracking benchmarks show that our method demonstrates outstanding performance.
zh

[CV-64] CAMILA: Context-Aware Masking for Image Editing with Language Alignment

【速读】:该论文旨在解决文本引导图像编辑中因盲目执行所有用户指令(包括不可行或矛盾指令)而导致输出语义混乱的问题。解决方案的关键在于提出一种上下文感知的掩码机制(Context-Aware Masking, CAMILA),通过验证指令与图像之间的上下文一致性,仅对可执行的区域进行编辑,同时忽略非可行指令,从而在保持图像完整性的同时提升语义对齐度和编辑准确性。

链接: https://arxiv.org/abs/2509.19731
作者: Hyunseung Kim,Chiho Choi,Srikanth Malla,Sai Prahladh Padmanabhan,Saurabh Bagchi,Joon Hee Choi
机构: Purdue University (普渡大学); Samsung Semiconductor, USA (三星半导体美国公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Text-guided image editing has been allowing users to transform and synthesize images through natural language instructions, offering considerable flexibility. However, most existing image editing models naively attempt to follow all user instructions, even if those instructions are inherently infeasible or contradictory, often resulting in nonsensical output. To address these challenges, we propose a context-aware method for image editing named as CAMILA (Context-Aware Masking for Image Editing with Language Alignment). CAMILA is designed to validate the contextual coherence between instructions and the image, ensuring that only relevant edits are applied to the designated regions while ignoring non-executable instructions. For comprehensive evaluation of this new method, we constructed datasets for both single- and multi-instruction image editing, incorporating the presence of infeasible requests. Our method achieves better performance and higher semantic alignment than state-of-the-art models, demonstrating its effectiveness in handling complex instruction challenges while preserving image integrity.
zh

[CV-65] PolGS: Polarimetric Gaussian Splatting for Fast Reflective Surface Reconstruction ICCV2025

【速读】:该论文旨在解决复杂反射材质表面在实时虚拟现实中的高效形状重建问题,现有基于3D Gaussian Splatting(3DGS)的方法虽具备快速新视角渲染能力,但在处理具有复杂反射特性(如镜面反射)的表面时,重建质量显著低于隐式神经表示方法。解决方案的关键在于提出PolGS模型,通过将偏振约束(polarimetric constraints)引入3DGS框架,实现对镜面(specular)与漫反射(diffuse)成分的有效分离,从而在仅10分钟内完成高质量反射表面重建,显著提升对挑战性反射材质的恢复效果。

链接: https://arxiv.org/abs/2509.19726
作者: Yufei Han,Bowen Tie,Heng Guo,Youwei Lyu,Si Li,Boxin Shi,Yunpeng Jia,Zhanyu Ma
机构: Beijing University of Posts and Telecommunications (北京邮电大学); Xiong’an Aerospace Information Research Institute (雄安航空航天信息研究院); State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University (北京大学计算机学院多媒体信息处理国家重点实验室); National Engineering Research Center of Visual Technology, School of Computer Science, Peking University (北京大学计算机学院视觉技术国家工程研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV 2025

点击查看摘要

Abstract:Efficient shape reconstruction for surfaces with complex reflectance properties is crucial for real-time virtual reality. While 3D Gaussian Splatting (3DGS)-based methods offer fast novel view rendering by leveraging their explicit surface representation, their reconstruction quality lags behind that of implicit neural representations, particularly in the case of recovering surfaces with complex reflective reflectance. To address these problems, we propose PolGS, a Polarimetric Gaussian Splatting model allowing fast reflective surface reconstruction in 10 minutes. By integrating polarimetric constraints into the 3DGS framework, PolGS effectively separates specular and diffuse components, enhancing reconstruction quality for challenging reflective materials. Experimental results on the synthetic and real-world dataset validate the effectiveness of our method.
zh

[CV-66] Frequency-domain Multi-modal Fusion for Language-guided Medical Image Segmentation MICCAI2025

【速读】:该论文旨在解决语言引导的医学图像分割中因病灶形态复杂性和视觉-语言模态间语义鸿沟导致的视觉特征表示不足及语义无关信息干扰问题,从而影响分割性能。其解决方案的关键在于提出频率域多模态交互模型(FMISeg),通过在解码器阶段建立语言特征与频域视觉特征的双向交互机制:一方面引入频域特征双向交互(FFBI)模块以增强视觉表征,另一方面设计语言引导的频域特征交互(LFFI)模块,在语言信息指导下抑制语义无关的视觉特征,实现更精准的病灶区域分割。

链接: https://arxiv.org/abs/2509.19719
作者: Bo Yu,Jianhua Yang,Zetao Du,Yan Huang,Chenglong Li,Liang Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by MICCAI 2025

点击查看摘要

Abstract:Automatically segmenting infected areas in radiological images is essential for diagnosing pulmonary infectious diseases. Recent studies have demonstrated that the accuracy of the medical image segmentation can be improved by incorporating clinical text reports as semantic guidance. However, the complex morphological changes of lesions and the inherent semantic gap between vision-language modalities prevent existing methods from effectively enhancing the representation of visual features and eliminating semantically irrelevant information, ultimately resulting in suboptimal segmentation performance. To address these problems, we propose a Frequency-domain Multi-modal Interaction model (FMISeg) for language-guided medical image segmentation. FMISeg is a late fusion model that establishes interaction between linguistic features and frequency-domain visual features in the decoder. Specifically, to enhance the visual representation, our method introduces a Frequency-domain Feature Bidirectional Interaction (FFBI) module to effectively fuse frequency-domain features. Furthermore, a Language-guided Frequency-domain Feature Interaction (LFFI) module is incorporated within the decoder to suppress semantically irrelevant visual features under the guidance of linguistic information. Experiments on QaTa-COV19 and MosMedData+ demonstrated that our method outperforms the state-of-the-art methods qualitatively and quantitatively.
zh

[CV-67] VIMD: Monocular Visual-Inertial Motion and Depth Estimation

【速读】:该论文旨在解决单目视觉-惯性系统中稠密度量深度估计(dense metric depth estimation)的准确性与效率问题,尤其在资源受限场景下如何实现高精度和鲁棒性的深度估计。其解决方案的关键在于提出了一种名为VIMD(Visual-Inertial Motion and Depth)的学习框架,该框架通过利用多视角信息迭代优化每个像素的尺度,而非像以往方法那样全局拟合一个不变的仿射模型(affine model),从而显著提升了深度估计的精度与适应性;此外,VIMD具有高度模块化设计,兼容多种现有深度估计骨干网络,并在TartanAir、VOID等数据集上验证了其零样本泛化能力,在仅需每帧10–20个度量深度点的情况下仍能保持优异性能,适用于边缘计算等资源受限环境。

链接: https://arxiv.org/abs/2509.19713
作者: Saimouli Katragadda,Guoquan Huang
机构: University of Delaware (德拉瓦大学); Google ARCore (谷歌ARCore)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Accurate and efficient dense metric depth estimation is crucial for 3D visual perception in robotics and XR. In this paper, we develop a monocular visual-inertial motion and depth (VIMD) learning framework to estimate dense metric depth by leveraging accurate and efficient MSCKF-based monocular visual-inertial motion tracking. At the core the proposed VIMD is to exploit multi-view information to iteratively refine per-pixel scale, instead of globally fitting an invariant affine model as in the prior work. The VIMD framework is highly modular, making it compatible with a variety of existing depth estimation backbones. We conduct extensive evaluations on the TartanAir and VOID datasets and demonstrate its zero-shot generalization capabilities on the AR Table dataset. Our results show that VIMD achieves exceptional accuracy and robustness, even with extremely sparse points as few as 10-20 metric depth points per image. This makes the proposed VIMD a practical solution for deployment in resource constrained settings, while its robust performance and strong generalization capabilities offer significant potential across a wide range of scenarios.
zh

[CV-68] owards Robust In-Context Learning for Medical Image Segmentation via Data Synthesis

【速读】:该论文旨在解决基于上下文学习(In-Context Learning, ICL)的通用医学图像分割任务中因数据稀缺性导致的性能瓶颈问题。现有数据合成方法难以同时实现高数据多样性与符合医学数据分布的特性,从而限制了ICL模型的泛化能力。解决方案的关键在于提出SynthICL框架,其核心创新是基于领域随机化(domain randomization)构建,通过引入真实数据中的解剖学先验(anatomical priors)确保合成数据的真实性,利用多样化的解剖结构覆盖广泛的数据分布,并显式建模个体间差异以生成适合ICL训练的数据队列(data cohorts)。实验证明,该方法显著提升了模型在未见解剖域上的分割性能,平均Dice系数最高提升63%。

链接: https://arxiv.org/abs/2509.19711
作者: Jiesi Hu,Yanwu Yang,Zhiyu Ye,Chenfei Ye,Hanyang Peng,Jianfeng Cao,Ting Ma
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The rise of In-Context Learning (ICL) for universal medical image segmentation has introduced an unprecedented demand for large-scale, diverse datasets for training, exacerbating the long-standing problem of data scarcity. While data synthesis offers a promising solution, existing methods often fail to simultaneously achieve both high data diversity and a domain distribution suitable for medical data. To bridge this gap, we propose \textbfSynthICL, a novel data synthesis framework built upon domain randomization. SynthICL ensures realism by leveraging anatomical priors from real-world datasets, generates diverse anatomical structures to cover a broad data distribution, and explicitly models inter-subject variations to create data cohorts suitable for ICL. Extensive experiments on four held-out datasets validate our framework’s effectiveness, showing that models trained with our data achieve performance gains of up to 63% in average Dice and substantially enhanced generalization to unseen anatomical domains. Our work helps mitigate the data bottleneck for ICL-based segmentation, paving the way for robust models. Our code and the generated dataset are publicly available at this https URL.
zh

[CV-69] Learning to Stop: Reinforcement Learning for Efficient Patient-Level Echocardiographic Classification MICCAI

【速读】:该论文旨在解决心脏超声图像(transthoracic echocardiographic)中多视图视频片段冗余与分类性能之间的矛盾问题:传统方法要么仅使用单一视图忽略互补信息,要么使用全部视图虽提升性能但计算开销大、难以临床部署。其解决方案的关键在于提出一种基于强化学习的动态剪枝策略和可学习的注意力融合机制——前者使智能体能自适应决定是否继续处理视图以降低疾病分类不确定性,直至达到足够置信度;后者则通过注意力机制灵活聚合多个视图的信息,从而在仅使用30%视频片段的情况下实现AUC 0.91的检测性能,优于全量片段及现有基准方法。

链接: https://arxiv.org/abs/2509.19694
作者: Woo-Jin Cho Kim,Jorge Oliveira,Arian Beqiri,Alex Thorley,Jordan Strom,Jamie O’Driscoll,Rajan Sharma,Jeremy Slivnick,Roberto Lang,Alberto Gomez,Agisilaos Chartsias
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: published in MICCAI-ASMUS 2025

点击查看摘要

Abstract:Guidelines for transthoracic echocardiographic examination recommend the acquisition of multiple video clips from different views of the heart, resulting in a large number of clips. Typically, automated methods, for instance disease classifiers, either use one clip or average predictions from all clips. Relying on one clip ignores complementary information available from other clips, while using all clips is computationally expensive and may be prohibitive for clinical adoption. To select the optimal subset of clips that maximize performance for a specific task (image-based disease classification), we propose a method optimized through reinforcement learning. In our method, an agent learns to either keep processing view-specific clips to reduce the disease classification uncertainty, or stop processing if the achieved classification confidence is sufficient. Furthermore, we propose a learnable attention-based aggregation method as a flexible way of fusing information from multiple clips. The proposed method obtains an AUC of 0.91 on the task of detecting cardiac amyloidosis using only 30% of all clips, exceeding the performance achieved from using all clips and from other benchmarks. Comments: published in MICCAI-ASMUS 2025 Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2509.19694 [cs.CV] (or arXiv:2509.19694v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2509.19694 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-70] Anatomically Constrained Transformers for Cardiac Amyloidosis Classification MICCAI

【速读】:该论文旨在解决心脏淀粉样变性(Cardiac Amyloidosis, CA)的自动分类问题,尤其关注现有基于视频分类的神经网络模型(如卷积神经网络)无法确保其决策依据来源于临床已知的病理特征(如心肌全局纵向应变降低)这一局限性。解决方案的关键在于:通过引入解剖结构约束,将Transformer模型的输入限定在CA异常最常出现的心肌区域——具体表现为将心肌建模为一组可变形点及其对应的图像补丁,并作为输入token;同时,在自监督预训练阶段采用仅对解剖学相关补丁进行掩码重建的策略,从而确保模型学习到的表征聚焦于与CA相关的解剖区域。该方法不仅提升了分类性能,还提供了可解释性,可通过可视化Transformer注意力分数来定位关键心肌区域。

链接: https://arxiv.org/abs/2509.19691
作者: Alexander Thorley,Agis Chartsias,Jordan Strom,Roberto Lang,Jeremy Slivnick,Jamie O’Driscoll,Rajan Sharma,Dipak Kotecha,Jinming Duan,Alberto Gomez
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Published in MICCAI - ASMUS 2025

点击查看摘要

Abstract:Cardiac amyloidosis (CA) is a rare cardiomyopathy, with typical abnormalities in clinical measurements from echocardiograms such as reduced global longitudinal strain of the myocardium. An alternative approach for detecting CA is via neural networks, using video classification models such as convolutional neural networks. These models process entire video clips, but provide no assurance that classification is based on clinically relevant features known to be associated with CA. An alternative paradigm for disease classification is to apply models to quantitative features such as strain, ensuring that the classification relates to clinically relevant features. Drawing inspiration from this approach, we explicitly constrain a transformer model to the anatomical region where many known CA abnormalities occur – the myocardium, which we embed as a set of deforming points and corresponding sampled image patches into input tokens. We show that our anatomical constraint can also be applied to the popular self-supervised learning masked autoencoder pre-training, where we propose to mask and reconstruct only anatomical patches. We show that by constraining both the transformer and pre-training task to the myocardium where CA imaging features are localized, we achieve increased performance on a CA classification task compared to full video transformers. Our model provides an explicit guarantee that the classification is focused on only anatomical regions of the echo, and enables us to visualize transformer attention scores over the deforming myocardium.
zh

[CV-71] From Prompt to Progression: Taming Video Diffusion Models for Seamless Attribute Transition ICCV2025

【速读】:该论文旨在解决现有视频生成模型在处理渐进式属性变化(如颜色、形状等从初始状态到目标状态的平滑过渡)时存在的不一致性问题,尤其是在基于提示插值(prompt interpolation)的方法中,难以保持属性变化的连贯性与运动动态的一致性。解决方案的关键在于:在去噪过程中引入逐帧指导(frame-wise guidance),为每个噪声潜在表示构建特定于数据的过渡方向,从而实现从初始属性到最终属性的逐帧引导式转变,同时保留视频原有的运动动力学特性。这一方法显著提升了属性过渡的准确性与平滑度,并通过提出的可控属性过渡基准(CAT-Bench)和两个量化指标进行了系统评估。

链接: https://arxiv.org/abs/2509.19690
作者: Ling Lo,Kelvin C.K. Chan,Wen-Huang Cheng,Ming-Hsuan Yang
机构: National Yang Ming Chiao Tung University (国立阳明交通大学); Google DeepMind (谷歌深度学习); National Taiwan University (国立台湾大学); UC Merced (加州大学默塞德分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025

点击查看摘要

Abstract:Existing models often struggle with complex temporal changes, particularly when generating videos with gradual attribute transitions. The most common prompt interpolation approach for motion transitions often fails to handle gradual attribute transitions, where inconsistencies tend to become more pronounced. In this work, we propose a simple yet effective method to extend existing models for smooth and consistent attribute transitions, through introducing frame-wise guidance during the denoising process. Our approach constructs a data-specific transitional direction for each noisy latent, guiding the gradual shift from initial to final attributes frame by frame while preserving the motion dynamics of the video. Moreover, we present the Controlled-Attribute-Transition Benchmark (CAT-Bench), which integrates both attribute and motion dynamics, to comprehensively evaluate the performance of different models. We further propose two metrics to assess the accuracy and smoothness of attribute transitions. Experimental results demonstrate that our approach performs favorably against existing baselines, achieving visual fidelity, maintaining alignment with text prompts, and delivering seamless attribute transitions. Code and CATBench are released: this https URL.
zh

[CV-72] Enhancing Transformer-Based Vision Models: Addressing Feature Map Anomalies Through Novel Optimization Strategies

【速读】:该论文旨在解决视觉 Transformer (Vision Transformer, ViT) 在特征图中存在结构化噪声伪影(structured noise artifacts)的问题,这类伪影会显著影响下游任务如分割和深度估计的性能。解决方案的关键在于提出两种轻量级优化技术:结构化标记增强(Structured Token Augmentation, STA)与自适应噪声滤波(Adaptive Noise Filtering, ANF)。STA 通过在标记化过程中引入空间扰动来提升标记多样性,从而改善特征表示;ANF 则在 Transformer 层间嵌入可学习的去噪模块,实现对噪声的动态抑制。这两种方法均不依赖特定架构,在 ImageNet、Ade20k 和 NYUv2 等标准基准上验证了其有效性,显著提升了视觉质量和任务性能。

链接: https://arxiv.org/abs/2509.19687
作者: Sumit Mamtani
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 8 figures, accepted and presented at IEEE BDAI 2025. The final published version will be available on IEEE Xplore

点击查看摘要

Abstract:Vision Transformers (ViTs) have demonstrated superior performance across a wide range of computer vision tasks. However, structured noise artifacts in their feature maps hinder downstream applications such as segmentation and depth estimation. We propose two novel and lightweight optimisation techniques- Structured Token Augmentation (STA) and Adaptive Noise Filtering (ANF)- to improve interpretability and mitigate these artefacts. STA enhances token diversity through spatial perturbations during tokenisation, while ANF applies learnable inline denoising between transformer layers. These methods are architecture-agnostic and evaluated across standard benchmarks, including ImageNet, Ade20k, and NYUv2. Experimental results show consistent improvements in visual quality and task performance, highlighting the practical effectiveness of our approach.
zh

[CV-73] C2Prompt: Class-aware Client Knowledge Interaction for Federated Continual Learning NEURIPS2025

【速读】:该论文旨在解决联邦持续学习(Federated Continual Learning, FCL)中因客户端间知识不一致导致的类级知识连贯性问题,具体表现为:(1)客户端间同类样本分布差异(intra-class distribution gap),削弱了提示(prompt)间语义一致性;(2)提示间跨类知识混淆(inter-prompt class-wise relevance),加剧了新旧任务间的干扰,从而同时恶化空间遗忘(spatial forgetting)与时间遗忘(temporal forgetting)。解决方案的关键在于提出一种类感知的客户端知识交互机制(Class-aware Client Knowledge Interaction, C² Prompt),其核心包括两个创新模块:一是局部类分布补偿机制(Local Class Distribution Compensation, LCDC),用于缩小客户端间同类数据分布差异,增强类内知识一致性;二是类感知提示聚合方案(Class-aware Prompt Aggregation, CPA),通过选择性强化类相关知识聚合,缓解跨类混淆。实验表明,该方法在多个FCL基准上达到最优性能。

链接: https://arxiv.org/abs/2509.19674
作者: Kunlun Xu,Yibo Feng,Jiangmeng Li,Yongsheng Qi,Jiahuan Zhou
机构: Wangxuan Institute of Computer Technology (王选计算机技术研究所); Peking University (北京大学); University of Chinese Academy of Sciences (中国科学院大学); Inner Mongolia University of Technology (内蒙古工业大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by NeurIPS 2025

点击查看摘要

Abstract:Federated continual learning (FCL) tackles scenarios of learning from continuously emerging task data across distributed clients, where the key challenge lies in addressing both temporal forgetting over time and spatial forgetting simultaneously. Recently, prompt-based FCL methods have shown advanced performance through task-wise prompt this http URL this study, we underscore that the existing prompt-based FCL methods are prone to class-wise knowledge coherence between prompts across clients. The class-wise knowledge coherence includes two aspects: (1) intra-class distribution gap across clients, which degrades the learned semantics across prompts, (2) inter-prompt class-wise relevance, which highlights cross-class knowledge confusion. During prompt communication, insufficient class-wise coherence exacerbates knowledge conflicts among new prompts and induces interference with old prompts, intensifying both spatial and temporal forgetting. To address these issues, we propose a novel Class-aware Client Knowledge Interaction (C ^2 Prompt) method that explicitly enhances class-wise knowledge coherence during prompt communication. Specifically, a local class distribution compensation mechanism (LCDC) is introduced to reduce intra-class distribution disparities across clients, thereby reinforcing intra-class knowledge consistency. Additionally, a class-aware prompt aggregation scheme (CPA) is designed to alleviate inter-class knowledge confusion by selectively strengthening class-relevant knowledge aggregation. Extensive experiments on multiple FCL benchmarks demonstrate that C ^2 Prompt achieves state-of-the-art performance. Our source code is available at this https URL
zh

[CV-74] Deep Learning for Clouds and Cloud Shadow Segmentation in Methane Satellite and Airborne Imaging Spectroscopy

【速读】:该论文旨在解决高空间分辨率遥感数据中云和云影检测的难题,这是准确反演大气甲烷(methane)等痕量气体浓度的关键前提。云和云影会显著干扰甲烷反演结果,影响排放量的定量精度,尤其对MethaneSAT及其机载伴随任务MethaneAIR尤为重要。解决方案的关键在于采用机器学习方法,对比传统技术(如迭代逻辑回归ILR和多层感知机MLP)与先进深度学习架构(UNet和光谱通道注意力网络SCAN),发现深度学习模型在保持空间结构(UNet)和捕捉边界细节(SCAN)方面表现更优;其中SCAN通过引入光谱注意力机制,在MethaneSAT数据上超越UNet,凸显了针对卫星特定特征设计光谱感知能力的重要性。

链接: https://arxiv.org/abs/2509.19665
作者: Manuel Perez-Carrasco,Maya Nasr,Sebastien Roche,Chris Chan Miller,Zhan Zhang,Core Francisco Park,Eleanor Walker,Cecilia Garraffo,Douglas Finkbeiner,Ritesh Gautam,Steven Wofsy
机构: Massachusetts Institute of Technology (麻省理工学院); Harvard University (哈佛大学); Stanford University (斯坦福大学); University of California, Berkeley (加州大学伯克利分校); Google (谷歌)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Effective cloud and cloud shadow detection is a critical prerequisite for accurate retrieval of concentrations of atmospheric methane or other trace gases in hyperspectral remote sensing. This challenge is especially pertinent for MethaneSAT and for its airborne companion mission, MethaneAIR. In this study, we use machine learning methods to address the cloud and cloud shadow detection problem for sensors with these high spatial resolutions instruments. Cloud and cloud shadows in remote sensing data need to be effectively screened out as they bias methane retrievals in remote sensing imagery and impact the quantification of emissions. We deploy and evaluate conventional techniques including Iterative Logistic Regression (ILR) and Multilayer Perceptron (MLP), with advanced deep learning architectures, namely UNet and a Spectral Channel Attention Network (SCAN) method. Our results show that conventional methods struggle with spatial coherence and boundary definition, affecting the detection of clouds and cloud shadows. Deep learning models substantially improve detection quality: UNet performs best in preserving spatial structure, while SCAN excels at capturing fine boundary details. Notably, SCAN surpasses UNet on MethaneSAT data, underscoring the benefits of incorporating spectral attention for satellite specific features. This in depth assessment of various disparate machine learning techniques demonstrates the strengths and effectiveness of advanced deep learning architectures in providing robust, scalable solutions for clouds and cloud shadow screening towards enhancing methane emission quantification capacity of existing and next generation hyperspectral missions. Our data and code is publicly available at this https URL
zh

[CV-75] MoTiC: Momentum Tightness and Contrast for Few-Shot Class-Incremental Learning

【速读】:该论文旨在解决少样本类增量学习(Few-Shot Class-Incremental Learning, FSCIL)中的双重挑战:在仅用少量样本学习新类的同时,保持对旧类知识的稳定性。现有方法虽通过冻结特征提取器和使用类平均原型来缓解灾难性遗忘与过拟合问题,但因新类样本稀缺导致其原型估计存在显著偏差,而基础类则因数据充足而表现稳定。论文的关键解决方案在于:首先,基于贝叶斯分析将新类先验与旧类统计特性对齐,以降低方差并提升原型准确性;其次,提出大规模对比学习增强跨类别特征紧致性,并结合动量自监督与虚拟类别机制,在动量紧致性与对比框架(MoTiC)中注入先验信息、丰富特征多样性,从而构建具有强类别区分度和鲁棒性的特征空间。

链接: https://arxiv.org/abs/2509.19664
作者: Zeyu He,Shuai Huang,Yuwu Lu,Ming Zhao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Few-Shot Class-Incremental Learning (FSCIL) must contend with the dual challenge of learning new classes from scarce samples while preserving old class knowledge. Existing methods use the frozen feature extractor and class-averaged prototypes to mitigate against catastrophic forgetting and overfitting. However, new-class prototypes suffer significant estimation bias due to extreme data scarcity, whereas base-class prototypes benefit from sufficient data. In this work, we theoretically demonstrate that aligning the new-class priors with old-class statistics via Bayesian analysis reduces variance and improves prototype accuracy. Furthermore, we propose large-scale contrastive learning to enforce cross-category feature tightness. To further enrich feature diversity and inject prior information for new-class prototypes, we integrate momentum self-supervision and virtual categories into the Momentum Tightness and Contrast framework (MoTiC), constructing a feature space with rich representations and enhanced interclass cohesion. Experiments on three FSCIL benchmarks produce state-of-the-art performances, particularly on the fine-grained task CUB-200, validating our method’s ability to reduce estimation bias and improve incremental learning robustness.
zh

[CV-76] Bias in the Picture: Benchmarking VLMs with Social-Cue News Images and LLM -as-Judge Assessment NEURIPS2025

【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, VLMs)在处理包含年龄、性别、种族、服饰或职业等视觉线索的图像时,容易吸收并再现有害社会刻板印象的问题。其解决方案的关键在于构建了一个由1,343个图像-问题对组成的新闻图像基准测试集,这些数据来自多样化的媒体来源,并标注了真实答案及五个关键人口统计属性(年龄、性别、种族、职业和体育活动)。通过使用大语言模型(LLM)作为评判者并辅以人工验证,研究系统性评估了多种先进VLMs的偏见表现,揭示了视觉上下文对模型输出的系统性影响、不同属性与模型间的偏见差异(尤其是性别和职业),以及忠实度与偏见之间无必然负相关关系。该工作为公平、可复现的多模态模型评估提供了可扩展的工具和方法论支持。

链接: https://arxiv.org/abs/2509.19659
作者: Aravind Narayanan,Vahid Reza Khazaie,Shaina Raza
机构: Vector Institute for AI (向量人工智能研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to NeurIPS 2025 Workshop (Evaluating the Evolving LLM Lifecycle)

点击查看摘要

Abstract:Large vision-language models (VLMs) can jointly interpret images and text, but they are also prone to absorbing and reproducing harmful social stereotypes when visual cues such as age, gender, race, clothing, or occupation are present. To investigate these risks, we introduce a news-image benchmark consisting of 1,343 image-question pairs drawn from diverse outlets, which we annotated with ground-truth answers and demographic attributes (age, gender, race, occupation, and sports). We evaluate a range of state-of-the-art VLMs and employ a large language model (LLM) as judge, with human verification. Our findings show that: (i) visual context systematically shifts model outputs in open-ended settings; (ii) bias prevalence varies across attributes and models, with particularly high risk for gender and occupation; and (iii) higher faithfulness does not necessarily correspond to lower bias. We release the benchmark prompts, evaluation rubric, and code to support reproducible and fairness-aware multimodal assessment.
zh

[CV-77] he Impact of 2D Segmentation Backbones on Point Cloud Predictions Using 4D Radar

【速读】:该论文旨在解决高成本激光雷达(LiDAR)限制高级自动驾驶(AD)系统在量产车辆中广泛应用的问题,其核心目标是通过仅使用4D雷达数据生成高质量的类激光雷达三维点云(LiDAR-like 3D point clouds),从而降低感知系统的硬件成本。解决方案的关键在于利用神经网络架构设计,特别是通过优化分割主干网络(segmentation backbone)的容量,在保持时序一致性的同时提升生成点云的质量;研究发现,虽然过高的模型容量可能损害性能,但存在一个最优的分割主干结构,可使生成点云质量相比当前最优方法提升23.7%。

链接: https://arxiv.org/abs/2509.19644
作者: William L. Muckelroy III,Mohammed Alsakabi,John M. Dolan,Ozan K. Tonguz
机构: Carnegie Mellon University (卡内基梅隆大学); University of Pittsburgh (匹兹堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:LiDAR’s dense, sharp point cloud (PC) representations of the surrounding environment enable accurate perception and significantly improve road safety by offering greater scene awareness and understanding. However, LiDAR’s high cost continues to restrict the broad adoption of high-level Autonomous Driving (AD) systems in commercially available vehicles. Prior research has shown progress towards circumventing the need for LiDAR by training a neural network, using LiDAR point clouds as ground truth (GT), to produce LiDAR-like 3D point clouds using only 4D Radars. One of the best examples is a neural network created to train a more efficient radar target detector with a modular 2D convolutional neural network (CNN) backbone and a temporal coherence network at its core that uses the RaDelft dataset for training (see arXiv:2406.04723). In this work, we investigate the impact of higher-capacity segmentation backbones on the quality of the produced point clouds. Our results show that while very high-capacity models may actually hurt performance, an optimal segmentation backbone can provide a 23.7% improvement over the state-of-the-art (SOTA).
zh

[CV-78] IMED: Adversarial and Autoregressive Refinement of Diffusion-Based Time Series Generation ICDM

【速读】:该论文旨在解决高质合成时间序列数据的生成问题,这在预测和异常检测等领域尤为关键,因真实数据常存在稀缺性、噪声干扰或采集成本高等挑战。传统静态数据生成方法难以建模时间序列特有的时序依赖关系,而本文提出的TIMED框架通过整合三种核心机制实现突破:一是利用去噪扩散概率模型(DDPM)捕捉全局结构;二是引入教师强制训练的监督网络以学习自回归依赖关系;三是采用Wasserstein判别器提供对抗反馈以保障时序平滑性和保真度。其关键创新在于将上述模块统一于基于掩码注意力架构的联合训练框架中,同时引入最大均值差异(MMD)损失提升特征空间分布对齐,从而有效建模时间序列的无条件与条件特性,显著优于现有先进生成模型。

链接: https://arxiv.org/abs/2509.19638
作者: MohammadReza EskandariNasab,Shah Muhammad Hamdi,Soukaina Filali Boubrahimi
机构: Utah State University (犹他州立大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to the IEEE International Conference on Data Mining (ICDM) 2025

点击查看摘要

Abstract:Generating high-quality synthetic time series is a fundamental yet challenging task across domains such as forecasting and anomaly detection, where real data can be scarce, noisy, or costly to collect. Unlike static data generation, synthesizing time series requires modeling both the marginal distribution of observations and the conditional temporal dependencies that govern sequential dynamics. We propose TIMED, a unified generative framework that integrates a denoising diffusion probabilistic model (DDPM) to capture global structure via a forward-reverse diffusion process, a supervisor network trained with teacher forcing to learn autoregressive dependencies through next-step prediction, and a Wasserstein critic that provides adversarial feedback to ensure temporal smoothness and fidelity. To further align the real and synthetic distributions in feature space, TIMED incorporates a Maximum Mean Discrepancy (MMD) loss, promoting both diversity and sample quality. All components are built using masked attention architectures optimized for sequence modeling and are trained jointly to effectively capture both unconditional and conditional aspects of time series data. Experimental results across diverse multivariate time series benchmarks demonstrate that TIMED generates more realistic and temporally coherent sequences than state-of-the-art generative models.
zh

[CV-79] EgoBridge: Domain Adaptation for Generalizable Imitation from Egocentric Human Data NEURIPS2025

【速读】:该论文旨在解决人类视角的具身经验数据(egocentric human experience data)在迁移到机器人执行任务时因视觉外观、传感器模态和运动学差异导致的知识迁移障碍问题。其核心挑战在于如何在不同域(human vs. robot)之间对齐策略潜在空间,同时保留与动作相关的语义信息。解决方案的关键在于提出EgoBridge框架,该框架通过最优传输(Optimal Transport, OT)度量联合策略潜在特征与动作之间的差异,显式地对齐人类与机器人数据的策略潜在空间,从而学习到既跨域一致又保留动作相关性的观测表征,实现高效端到端的模仿学习。

链接: https://arxiv.org/abs/2509.19626
作者: Ryan Punamiya,Dhruv Patel,Patcharapong Aphiwetsa,Pranav Kuppili,Lawrence Y. Zhu,Simar Kareer,Judy Hoffman,Danfei Xu
机构: Georgia Institute of Technology (佐治亚理工学院)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at 39th Conference on Neural Information Processing Systems (NeurIPS 2025) and Oral at Conference on Robot Learning (CoRL 2025)

点击查看摘要

Abstract:Egocentric human experience data presents a vast resource for scaling up end-to-end imitation learning for robotic manipulation. However, significant domain gaps in visual appearance, sensor modalities, and kinematics between human and robot impede knowledge transfer. This paper presents EgoBridge, a unified co-training framework that explicitly aligns the policy latent spaces between human and robot data using domain adaptation. Through a measure of discrepancy on the joint policy latent features and actions based on Optimal Transport (OT), we learn observation representations that not only align between the human and robot domain but also preserve the action-relevant information critical for policy learning. EgoBridge achieves a significant absolute policy success rate improvement by 44% over human-augmented cross-embodiment baselines in three real-world single-arm and bimanual manipulation tasks. EgoBridge also generalizes to new objects, scenes, and tasks seen only in human data, where baselines fail entirely. Videos and additional information can be found at this https URL
zh

[CV-80] Raw-JPEG Adapter: Efficient Raw Image Compression with JPEG

【速读】:该论文旨在解决原始图像(raw image)在存储和传输中面临的空间效率与信息保真度之间的矛盾问题:一方面,原始数据(如DNG格式)虽能保留完整传感器信息,但占用存储空间大;另一方面,JPEG等压缩格式虽高效且兼容性强,却因有损压缩导致原始信息丢失,难以用于后续编辑或计算机视觉任务。解决方案的关键在于提出RawJPEG Adapter——一种轻量级、可学习且可逆的预处理流水线,通过引入空间域和可选频域变换,并将紧凑参数嵌入JPEG注释字段,实现原始图像到JPEG格式的适配性转换,从而在保持高压缩比的同时支持高精度的原始图像重建。

链接: https://arxiv.org/abs/2509.19624
作者: Mahmoud Afifi,Ran Zhang,Michael S. Brown
机构: AI Center-Toronto, Samsung Electronics (三星电子)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Digital cameras digitize scene light into linear raw representations, which the image signal processor (ISP) converts into display-ready outputs. While raw data preserves full sensor information–valuable for editing and vision tasks–formats such as Digital Negative (DNG) require large storage, making them impractical in constrained scenarios. In contrast, JPEG is a widely supported format, offering high compression efficiency and broad compatibility, but it is not well-suited for raw storage. This paper presents RawJPEG Adapter, a lightweight, learnable, and invertible preprocessing pipeline that adapts raw images for standard JPEG compression. Our method applies spatial and optional frequency-domain transforms, with compact parameters stored in the JPEG comment field, enabling accurate raw reconstruction. Experiments across multiple datasets show that our method achieves higher fidelity than direct JPEG storage, supports other codecs, and provides a favorable trade-off between compression ratio and reconstruction accuracy.
zh

[CV-81] Parameter-Efficient Multi-Task Learning via Progressive Task-Specific Adaptation

【速读】:该论文旨在解决参数高效微调(Parameter-efficient Fine-tuning, PEF)方法在多任务学习(Multi-task Learning, MTL)场景下所面临的核心挑战,包括任务干扰(task interference)和负迁移(negative transfer),这些问题源于可训练参数数量有限导致的任务间冲突加剧。其解决方案的关键在于提出一种渐进式任务特异性多任务适应机制(progressive task-specific multi-task adaptation):在预训练模型中引入适配器模块(adapter modules),初始层的适配器对所有任务共享,而随着网络深度增加逐步变为任务特异性结构,从而在浅层实现跨任务的知识迁移,在深层支持任务特定的预测头学习;同时设计基于梯度的任务相似性计算方法,用于将相似任务分配至共享适配器模块,以最小化额外计算开销并优化任务分组策略。实验表明,该方法在Swin Transformer上应用于密集预测任务时,仅需全量微调参数的五分之一即可超越单任务微调与当前最优的参数高效多任务学习方法。

链接: https://arxiv.org/abs/2509.19602
作者: Neeraj Gangwar,Anshuka Rangi,Rishabh Deshmukh,Holakou Rahmanian,Yesh Dattatreya,Nickvash Kani
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Amazon (亚马逊)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Parameter-efficient fine-tuning methods have emerged as a promising solution for adapting pre-trained models to various downstream tasks. While these methods perform well in single-task learning, extending them to multi-task learning exacerbates common challenges, such as task interference and negative transfer, due to the limited number of trainable parameters. To address these issues, we introduce progressive task-specific multi-task adaptation, a novel parameter-efficient approach for multi-task learning. This approach introduces adapter modules in a pre-trained model such that these modules are shared across all tasks in the initial layers and become progressively more task-specific in the later layers. The motivation is to reduce the conflicts among tasks by allowing transfer learning across all tasks in the initial layers and enabling task-specific learning toward the prediction heads. Additionally, we propose a gradient-based approach for computing task similarity and use this measure to allocate similar tasks to the shared adapter modules. Our task similarity method introduces minimal overhead in the pipeline. We evaluate our approach by adapting the Swin Transformer for dense prediction tasks. Experiments on the PASCAL and NYUD-v2 datasets demonstrate that our approach outperforms a fully fine-tuned multi-task model while requiring only one-fifth of the trainable parameters. This approach achieves better relative improvement to single-task fine-tuning while reducing the number of trainable parameters and surpasses the current state-of-the-art methods for parameter-efficient multi-task learning.
zh

[CV-82] Synthesizing Artifact Dataset for Pixel-level Detection WACV

【速读】:该论文旨在解决图像生成模型中艺术伪影(artifact)检测器训练所需昂贵的像素级人工标注数据稀缺问题。现有方法依赖人工标注,不仅成本高且难以扩展;而简单的伪标签策略因噪声标签导致性能不佳。其解决方案的关键在于提出一种伪影污染流水线(artifact corruption pipeline),通过在高质量合成图像的预设区域自动注入伪影,实现无需人工标注即可生成精确的像素级标签,从而训练出性能显著提升的检测器(相比基线方法,在ConvNeXt和Swin-T上分别提升13.2%和3.7%)。

链接: https://arxiv.org/abs/2509.19589
作者: Dennis Menn,Feng Liang,Diana Marculescu
机构: The University of Texas at Austin (德克萨斯大学奥斯汀分校); Meta
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under submission to WACV

点击查看摘要

Abstract:Artifact detectors have been shown to enhance the performance of image-generative models by serving as reward models during fine-tuning. These detectors enable the generative model to improve overall output fidelity and aesthetics. However, training the artifact detector requires expensive pixel-level human annotations that specify the artifact regions. The lack of annotated data limits the performance of the artifact detector. A naive pseudo-labeling approach-training a weak detector and using it to annotate unlabeled images-suffers from noisy labels, resulting in poor performance. To address this, we propose an artifact corruption pipeline that automatically injects artifacts into clean, high-quality synthetic images on a predetermined region, thereby producing pixel-level annotations without manual labeling. The proposed method enables training of an artifact detector that achieves performance improvements of 13.2% for ConvNeXt and 3.7% for Swin-T, as verified on human-labeled data, compared to baseline approaches. This work represents an initial step toward scalable pixel-level artifact annotation datasets that integrate world knowledge into artifact detection.
zh

[CV-83] Agent ic Scene Policies: Unifying Space Semantics and Affordances for Robot Action

【速读】:该论文旨在解决机器人执行开放词汇自然语言指令时面临的挑战,尤其是在复杂指令和新场景下,现有基于模仿学习和视觉-语言-动作模型(Vision-Language-Actions, VLAs)的端到端策略表现受限的问题。解决方案的关键在于提出一种名为“代理场景策略”(Agentic Scene Policies, ASP)的框架,其核心是利用现代场景表示的语义、空间和可操作性(affordance)查询能力,构建一个可查询的场景接口,使机器人能够通过显式推理对象可操作性来实现语言条件下的任务规划与执行。ASP支持零样本开放词汇查询,并在复杂技能中通过可操作性引导进行推理,从而提升在桌面上操作任务及房间级导航等场景中的泛化能力和鲁棒性。

链接: https://arxiv.org/abs/2509.19571
作者: Sacha Morin,Kumaraditya Gupta,Mahtab Sandhu,Charlie Gauthier,Francesco Argenziano,Kirsty Ellis,Liam Paull
机构: Université de Montréal(蒙特利尔大学); Mila - Quebec AI Institute(魁北克AI研究所); Sapienza University of Rome(罗马第一大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Executing open-ended natural language queries is a core problem in robotics. While recent advances in imitation learning and vision-language-actions models (VLAs) have enabled promising end-to-end policies, these models struggle when faced with complex instructions and new scenes. An alternative is to design an explicit scene representation as a queryable interface between the robot and the world, using query results to guide downstream motion planning. In this work, we present Agentic Scene Policies (ASP), an agentic framework that leverages the advanced semantic, spatial, and affordance-based querying capabilities of modern scene representations to implement a capable language-conditioned robot policy. ASP can execute open-vocabulary queries in a zero-shot manner by explicitly reasoning about object affordances in the case of more complex skills. Through extensive experiments, we compare ASP with VLAs on tabletop manipulation problems and showcase how ASP can tackle room-level queries through affordance-guided navigation, and a scaled-up scene representation. (Project page: this https URL)
zh

[CV-84] CURE: Centroid-guided Unsupervised Representation Erasure for Facial Recognition Systems

【速读】:该论文旨在解决面部识别系统在隐私保护场景下难以实现高效、无监督的数据删除问题,即如何在不依赖身份标签(identity labels)的情况下,从已训练模型中移除特定样本的影响,同时保持模型整体性能稳定。其解决方案的关键在于提出了一种名为CURE(Centroid-guided Unsupervised Representation Erasure)的无监督机器遗忘框架,该框架通过利用特征空间中的中心点引导表示擦除机制,无需身份标签即可精准定位并移除目标样本的影响;此外,论文还引入了新的评估指标Unlearning Efficiency Score(UES),用于平衡遗忘效果与保留稳定性,从而弥补现有评价体系的不足。

链接: https://arxiv.org/abs/2509.19562
作者: Fnu Shivam,Nima Najafzadeh,Yenumula Reddy,Prashnna Gyawali
机构: West Virginia University (西弗吉尼亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In the current digital era, facial recognition systems offer significant utility and have been widely integrated into modern technological infrastructures; however, their widespread use has also raised serious privacy concerns, prompting regulations that mandate data removal upon request. Machine unlearning has emerged as a powerful solution to address this issue by selectively removing the influence of specific user data from trained models while preserving overall model performance. However, existing machine unlearning techniques largely depend on supervised techniques requiring identity labels, which are often unavailable in privacy-constrained situations or in large-scale, noisy datasets. To address this critical gap, we introduce CURE (Centroid-guided Unsupervised Representation Erasure), the first unsupervised unlearning framework for facial recognition systems that operates without the use of identity labels, effectively removing targeted samples while preserving overall performance. We also propose a novel metric, the Unlearning Efficiency Score (UES), which balances forgetting and retention stability, addressing shortcomings in the current evaluation metrics. CURE significantly outperforms unsupervised variants of existing unlearning methods. Additionally, we conducted quality-aware unlearning by designating low-quality images as the forget set, demonstrating its usability and benefits, and highlighting the role of image quality in machine unlearning.
zh

[CV-85] Finder: Structured Zero-Shot Vision-Based LLM Grounding for Dash-Cam Video Reasoning

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在特定领域任务(如后处理行车记录仪视频分析)中因通用训练和缺乏结构化归纳偏置而导致的性能瓶颈问题,尤其在仅依赖视觉模态(无LiDAR、GPS等辅助信息)时,现有视觉-语言模型(Vision-Language Models, V-VLMs)难以实现空间推理、因果推断及事件解释性。其解决方案的关键在于提出iFinder框架,通过将驾驶场景中的关键感知线索(如物体姿态、车道位置和物体轨迹)从原始视频中提取并组织为分层可解释的数据结构,从而实现感知与推理的解耦;该框架采用预训练视觉模型进行无监督特征提取,并结合三模块提示策略,使LLM能够进行逐步、基于证据的推理,显著提升事故推理准确率(最高达39%),且无需额外训练,具备零样本、可解释性和可靠性优势。

链接: https://arxiv.org/abs/2509.19552
作者: Manyi Yao,Bingbing Zhuang,Sparsh Garg,Amit Roy-Chowdhury,Christian Shelton,Manmohan Chandraker,Abhishek Aich
机构: NEC Laboratories, America (NEC实验室,美国); University of California, Riverside (加州大学河滨分校); University of California, San Diego (加州大学圣地亚哥分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Grounding large language models (LLMs) in domain-specific tasks like post-hoc dash-cam driving video analysis is challenging due to their general-purpose training and lack of structured inductive biases. As vision is often the sole modality available for such analysis (i.e., no LiDAR, GPS, etc.), existing video-based vision-language models (V-VLMs) struggle with spatial reasoning, causal inference, and explainability of events in the input video. To this end, we introduce iFinder, a structured semantic grounding framework that decouples perception from reasoning by translating dash-cam videos into a hierarchical, interpretable data structure for LLMs. iFinder operates as a modular, training-free pipeline that employs pretrained vision models to extract critical cues – object pose, lane positions, and object trajectories – which are hierarchically organized into frame- and video-level structures. Combined with a three-block prompting strategy, it enables step-wise, grounded reasoning for the LLM to refine a peer V-VLM’s outputs and provide accurate reasoning. Evaluations on four public dash-cam video benchmarks show that iFinder’s proposed grounding with domain-specific cues, especially object orientation and global context, significantly outperforms end-to-end V-VLMs on four zero-shot driving benchmarks, with up to 39% gains in accident reasoning accuracy. By grounding LLMs with driving domain-specific representations, iFinder offers a zero-shot, interpretable, and reliable alternative to end-to-end V-VLMs for post-hoc driving video understanding.
zh

[CV-86] ROPA: Synthetic Robot Pose Generation for RGB-D Bimanual Data Augmentation

【速读】:该论文旨在解决基于模仿学习训练双臂操作策略时,因真实世界演示数据覆盖不足而导致的鲁棒性差与可扩展性受限的问题。其核心挑战在于:收集涵盖广泛机器人位姿、接触状态及场景上下文的多样化且精确的真实演示数据成本高、耗时长。为应对这一问题,作者提出了一种名为ROPA(Synthetic Robot Pose Generation for RGB-D Bimanual Data Augmentation)的离线数据增强方法,其关键创新在于利用Stable Diffusion模型微调生成第三人称视角下的RGB和RGB-D观测图像,并同步生成对应的关节空间动作标签;同时通过约束优化手段引入物理一致性,确保在双臂操作场景中夹爪与物体之间的接触约束合理有效。此方案显著提升了RGB和RGB-D数据在眼到手(eye-to-hand)配置下的合成多样性与可用性,从而支持更高效的双臂操作策略学习。

链接: https://arxiv.org/abs/2509.19454
作者: Jason Chen,I-Chun Arthur Liu,Gaurav Sukhatme,Daniel Seita
机构: University of Southern California (南加州大学)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Training robust bimanual manipulation policies via imitation learning requires demonstration data with broad coverage over robot poses, contacts, and scene contexts. However, collecting diverse and precise real-world demonstrations is costly and time-consuming, which hinders scalability. Prior works have addressed this with data augmentation, typically for either eye-in-hand (wrist camera) setups with RGB inputs or for generating novel images without paired actions, leaving augmentation for eye-to-hand (third-person) RGB-D training with new action labels less explored. In this paper, we propose Synthetic Robot Pose Generation for RGB-D Bimanual Data Augmentation (ROPA), an offline imitation learning data augmentation method that fine-tunes Stable Diffusion to synthesize third-person RGB and RGB-D observations of novel robot poses. Our approach simultaneously generates corresponding joint-space action labels while employing constrained optimization to enforce physical consistency through appropriate gripper-to-object contact constraints in bimanual scenarios. We evaluate our method on 5 simulated and 3 real-world tasks. Our results across 2625 simulation trials and 300 real-world trials demonstrate that ROPA outperforms baselines and ablations, showing its potential for scalable RGB and RGB-D data augmentation in eye-to-hand bimanual manipulation. Our project website is available at: this https URL.
zh

[CV-87] HUNT: High-Speed UAV Navigation and Tracking in Unstructured Environments via Instantaneous Relative Frames

【速读】:该论文旨在解决无人飞行器(Unmanned Aerial Vehicle, UAV)在未知非结构化环境中高速自主导航与目标跟踪的双重挑战,尤其是在感知条件受限且缺乏全局定位信息的情况下。解决方案的关键在于提出HUNT(High-speed UAV Navigation and Tracking)框架,该框架将遍历、目标获取与跟踪统一于单一相对导航表述中,通过直接利用机载瞬时可观测量(如姿态、高度和速度)定义导航目标,实现搜索阶段的高反应性高速飞行;一旦检测到目标,相同的感知-控制流水线可无缝切换至跟踪模式,从而在密集森林、集装箱堆场及搜救任务等复杂场景中实现鲁棒自主运行,克服传统全局方法失效的问题。

链接: https://arxiv.org/abs/2509.19452
作者: Alessandro Saviolo,Jeffrey Mao,Giuseppe Loianno
机构: New York University (纽约大学); Tandon School of Engineering (坦顿工程学院); NYU Wireless (纽约大学无线研究中心)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Search and rescue operations require unmanned aerial vehicles to both traverse unknown unstructured environments at high speed and track targets once detected. Achieving both capabilities under degraded sensing and without global localization remains an open challenge. Recent works on relative navigation have shown robust tracking by anchoring planning and control to a visible detected object, but cannot address navigation when no target is in the field of view. We present HUNT (High-speed UAV Navigation and Tracking), a real-time framework that unifies traversal, acquisition, and tracking within a single relative formulation. HUNT defines navigation objectives directly from onboard instantaneous observables such as attitude, altitude, and velocity, enabling reactive high-speed flight during search. Once a target is detected, the same perception-control pipeline transitions seamlessly to tracking. Outdoor experiments in dense forests, container compounds, and search-and-rescue operations with vehicles and mannequins demonstrate robust autonomy where global methods fail.
zh

[CV-88] Overview of LifeCLEF Plant Identification task 2020

【速读】:该论文旨在解决植物自动识别技术在数据匮乏地区(如热带地区)性能不足的问题,其核心挑战在于如何利用有限的野外照片数据提升对生物多样性丰富区域植物的识别准确率。解决方案的关键在于引入大量数字化标本(herbarium sheets)作为训练数据,通过跨域分类任务建立标本图像与野外照片之间的映射关系,从而增强模型在真实场景下的泛化能力。这一方法有效利用了长期积累的博物馆标本资源,弥补了野外图像数据稀缺的短板,为全球范围内植物识别系统的公平性和普适性提供了新路径。

链接: https://arxiv.org/abs/2509.19402
作者: Herve Goeau,Pierre Bonnet,Alexis Joly
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 5 figures, CLEF 2020 Conference and Labs of the Evaluation Forum, September 05 to 08, 2020, Thessaloniki, Greece

点击查看摘要

Abstract:Automated identification of plants has improved considerably thanks to the recent progress in deep learning and the availability of training data with more and more photos in the field. However, this profusion of data only concerns a few tens of thousands of species, mostly located in North America and Western Europe, much less in the richest regions in terms of biodiversity such as tropical countries. On the other hand, for several centuries, botanists have collected, catalogued and systematically stored plant specimens in herbaria, particularly in tropical regions, and the recent efforts by the biodiversity informatics community made it possible to put millions of digitized sheets online. The LifeCLEF 2020 Plant Identification challenge (or “PlantCLEF 2020”) was designed to evaluate to what extent automated identification on the flora of data deficient regions can be improved by the use of herbarium collections. It is based on a dataset of about 1,000 species mainly focused on the South America’s Guiana Shield, an area known to have one of the greatest diversity of plants in the world. The challenge was evaluated as a cross-domain classification task where the training set consist of several hundred thousand herbarium sheets and few thousand of photos to enable learning a mapping between the two domains. The test set was exclusively composed of photos in the field. This paper presents the resources and assessments of the conducted evaluation, summarizes the approaches and systems employed by the participating research groups, and provides an analysis of the main outcomes.
zh

[CV-89] Vision-Based Perception for Autonomous Vehicles in Off-Road Environment Using Deep Learning

【速读】:该论文旨在解决在非铺装道路和野外环境中实现低延迟智能感知系统的问题,特别是在矿山和欠发达地区等复杂场景下,车辆需在无预设路径条件下自主导航。其关键解决方案是提出了一种可配置模块化分割网络(Configurable Modular Segmentation Network, CMSNet)框架,该框架支持灵活的网络结构设计,能够在恶劣条件(如夜间、雨天、扬尘)下准确分割障碍物与可行驶区域;同时,为实现实时推理,通过TensorRT、C++与CUDA对CMSNet的卷积层进行系统性优化与融合,显著提升计算效率,并基于自建的Kamino数据集(包含近12,000张多相机同步图像)验证了方法的有效性与鲁棒性。

链接: https://arxiv.org/abs/2509.19378
作者: Nelson Alves Ferreira Neto
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Hardware Architecture (cs.AR); Machine Learning (cs.LG); Image and Video Processing (eess.IV); Signal Processing (eess.SP)
备注: 2022. 117p. Electrical Engineering PhD Thesis - Graduate Program in Electrical and Computer Engineering, Federal University of Bahia, 40210-630, Salvador, Brazil

点击查看摘要

Abstract:Low-latency intelligent systems are required for autonomous driving on non-uniform terrain in open-pit mines and developing countries. This work proposes a perception system for autonomous vehicles on unpaved roads and off-road environments, capable of navigating rough terrain without a predefined trail. The Configurable Modular Segmentation Network (CMSNet) framework is proposed, facilitating different architectural arrangements. CMSNet configurations were trained to segment obstacles and trafficable ground on new images from unpaved/off-road scenarios with adverse conditions (night, rain, dust). We investigated applying deep learning to detect drivable regions without explicit track boundaries, studied algorithm behavior under visibility impairment, and evaluated field tests with real-time semantic segmentation. A new dataset, Kamino, is presented with almost 12,000 images from an operating vehicle with eight synchronized cameras. The Kamino dataset has a high number of labeled pixels compared to similar public collections and includes images from an off-road proving ground emulating a mine under adverse visibility. To achieve real-time inference, CMSNet CNN layers were methodically removed and fused using TensorRT, C++, and CUDA. Empirical experiments on two datasets validated the proposed system’s effectiveness.
zh

[CV-90] Ensuring Reliable Participation in Subjective Video Quality Tests Across Platforms

【速读】:该论文旨在解决众包主观视频质量评估(Subjective Video Quality Assessment, VQA)中因工人违规操作(如忽略指令、操纵奖励机制)及技术手段(如远程桌面连接、利用视频元数据)导致的结果偏差问题。其解决方案的关键在于提出客观与主观相结合的检测方法,用于识别使用远程桌面(Remote-Desktop, RD)连接的异常用户,并在真实测试条件下对比主流众包平台在面对此类干扰时的敏感性差异及其缓解策略的有效性。

链接: https://arxiv.org/abs/2509.20001
作者: Babak Naderi,Ross Cutler
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Subjective video quality assessment (VQA) is the gold standard for measuring end-user experience across communication, streaming, and UGC pipelines. Beyond high-validity lab studies, crowdsourcing offers accurate, reliable, faster, and cheaper evaluation-but suffers from unreliable submissions by workers who ignore instructions or game rewards. Recent tests reveal sophisticated exploits of video metadata and rising use of remote-desktop (RD) connections, both of which bias results. We propose objective and subjective detectors for RD users and compare two mainstream crowdsourcing platforms on their susceptibility and mitigation under realistic test conditions and task designs.
zh

[CV-91] Frequency-Aware Ensemble Learning for BraTS 2025 Pediatric Brain Tumor Segmentation MICCAI

【速读】:该论文针对儿科脑肿瘤分割中存在的罕见性和异质性难题,提出了一种集成方法,旨在提升分割精度以支持临床诊断与治疗规划。解决方案的关键在于三个核心改进:一是对nnU-Net引入可调初始化尺度以实现复杂度控制;二是利用BraTS 2021预训练模型进行迁移学习,增强Swin UNETR在儿科数据集上的泛化能力;三是采用频域分解策略用于HFF-Net,分离低频组织轮廓与高频纹理细节,从而优化特征提取。最终的集成模型由调整后的nnU-Net(γ=0.7)、微调后的Swin UNETR和HFF-Net组成,在BraTS-PED 2025挑战中取得了优异的Dice评分表现。

链接: https://arxiv.org/abs/2509.19353
作者: Yuxiao Yi,Qingyao Zhuang,Zhi-Qin John Xu
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 3 figures, conference, miccai brats challenge

点击查看摘要

Abstract:Pediatric brain tumor segmentation presents unique challenges due to the rarity and heterogeneity of these malignancies, yet remains critical for clinical diagnosis and treatment planning. We propose an ensemble approach integrating nnU-Net, Swin UNETR, and HFF-Net for the BraTS-PED 2025 challenge. Our method incorporates three key extensions: adjustable initialization scales for optimal nnU-Net complexity control, transfer learning from BraTS 2021 pre-trained models to enhance Swin UNETR’s generalization on pediatric dataset, and frequency domain decomposition for HFF-Net to separate low-frequency tissue contours from high-frequency texture details. Our final ensemble combines nnU-Net ( \gamma=0.7 ), fine-tuned Swin UNETR, and HFF-Net, achieving Dice scores of 72.3% (ET), 95.6% (NET), 68.9% (CC), 89.5% (ED), 92.3% (TC), and 92.3% (WT), respectively.
zh

人工智能

[AI-0] Adaptive Event-Triggered Policy Gradient for Multi-Agent Reinforcement Learning

【速读】:该论文旨在解决传统多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)方法依赖于时间触发执行机制所带来的计算开销大和通信密集的问题。其解决方案的关键在于提出ET-MAPG(Event-Triggered Multi-Agent Policy Gradient),该框架联合学习智能体的控制策略与事件触发策略,将动作执行时机的选择纳入统一的学习过程,使智能体不仅能决定采取何种行动,还能自主判断何时执行;进一步地,在需要智能体间通信的场景中,引入AET-MAPG(Attention-based ET-MAPG),利用自注意力机制学习选择性通信模式,从而优化智能体间的协调效率。这两种方法均可与任意基于策略梯度的MARL算法兼容,并在多个基准测试中实现了与先进时间触发基线相当的性能,同时显著降低了计算负载和通信开销。

链接: https://arxiv.org/abs/2509.20338
作者: Umer Siddique,Abhinav Sinha,Yongcan Cao
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Dynamical Systems (math.DS)
备注:

点击查看摘要

Abstract:Conventional multi-agent reinforcement learning (MARL) methods rely on time-triggered execution, where agents sample and communicate actions at fixed intervals. This approach is often computationally expensive and communication-intensive. To address this limitation, we propose ET-MAPG (Event-Triggered Multi-Agent Policy Gradient reinforcement learning), a framework that jointly learns an agent’s control policy and its event-triggering policy. Unlike prior work that decouples these mechanisms, ET-MAPG integrates them into a unified learning process, enabling agents to learn not only what action to take but also when to execute it. For scenarios with inter-agent communication, we introduce AET-MAPG, an attention-based variant that leverages a self-attention mechanism to learn selective communication patterns. AET-MAPG empowers agents to determine not only when to trigger an action but also with whom to communicate and what information to exchange, thereby optimizing coordination. Both methods can be integrated with any policy gradient MARL algorithm. Extensive experiments across diverse MARL benchmarks demonstrate that our approaches achieve performance comparable to state-of-the-art, time-triggered baselines while significantly reducing both computational load and communication overhead.
zh

[AI-1] Uncovering Graph Reasoning in Decoder-only Transformers with Circuit Tracing NEURIPS2025

【速读】:该论文旨在解决基于Transformer的大型语言模型(Large Language Models, LLMs)在图推理任务中内部机制不明确的问题。为实现对这些推理过程的根本性与统一性理解,作者采用电路追踪(circuit-tracer)框架来解析仅解码器结构的Transformer模型。其解决方案的关键在于通过该框架可视化推理轨迹,并识别出图推理中的两个核心机制:标记合并(token merging)与结构记忆(structural memorization),二者共同支撑路径推理与子结构提取等任务。研究进一步量化了这两种行为,并分析其受图密度和模型规模的影响,从而构建了一个统一的可解释性框架,用于理解解码器-only Transformer中的结构化推理能力。

链接: https://arxiv.org/abs/2509.20336
作者: Xinnan Dai,Chung-Hsiang Lo,Kai Guo,Shenglai Zeng,Dongsheng Luo,Jiliang Tang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by the Workshop on Efficient Reasoning, Neurips 2025

点击查看摘要

Abstract:Transformer-based LLMs demonstrate strong performance on graph reasoning tasks, yet their internal mechanisms remain underexplored. To uncover these reasoning process mechanisms in a fundamental and unified view, we set the basic decoder-only transformers and explain them using the circuit-tracer framework. Through this lens, we visualize reasoning traces and identify two core mechanisms in graph reasoning: token merging and structural memorization, which underlie both path reasoning and substructure extraction tasks. We further quantify these behaviors and analyze how they are influenced by graph density and model size. Our study provides a unified interpretability framework for understanding structural reasoning in decoder-only Transformers.
zh

[AI-2] RAG Security and Privacy: Formalizing the Threat Model and Attack Surface ICDM

【速读】:该论文旨在解决当前检索增强生成(Retrieval-Augmented Generation, RAG)系统在隐私与安全方面缺乏正式威胁建模框架的问题。现有研究虽已揭示大型语言模型(Large Language Models, LLMs)存在训练数据记忆泄露和对抗性提示攻击等风险,但RAG因依赖外部知识库而引入了新的攻击面,如文档级成员推理(document-level membership inference)和数据投毒(data poisoning)等新型威胁,尚未被系统化定义。论文的关键解决方案是首次提出一个形式化的RAG威胁模型,基于对手对模型组件和数据的访问权限构建了 adversary 类型的结构化分类,并明确定义了若干关键威胁向量,从而为RAG系统的隐私保护与完整性保障提供了理论基础和分析框架。

链接: https://arxiv.org/abs/2509.20324
作者: Atousa Arzanipour,Rouzbeh Behnia,Reza Ebrahimi,Kaushik Dutta
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Accepted at the 5th ICDM Workshop on September 20, 2025

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) is an emerging approach in natural language processing that combines large language models (LLMs) with external document retrieval to produce more accurate and grounded responses. While RAG has shown strong potential in reducing hallucinations and improving factual consistency, it also introduces new privacy and security challenges that differ from those faced by traditional LLMs. Existing research has demonstrated that LLMs can leak sensitive information through training data memorization or adversarial prompts, and RAG systems inherit many of these vulnerabilities. At the same time, reliance of RAG on an external knowledge base opens new attack surfaces, including the potential for leaking information about the presence or content of retrieved documents, or for injecting malicious content to manipulate model behavior. Despite these risks, there is currently no formal framework that defines the threat landscape for RAG systems. In this paper, we address a critical gap in the literature by proposing, to the best of our knowledge, the first formal threat model for retrieval-RAG systems. We introduce a structured taxonomy of adversary types based on their access to model components and data, and we formally define key threat vectors such as document-level membership inference and data poisoning, which pose serious privacy and integrity risks in real-world deployments. By establishing formal definitions and attack models, our work lays the foundation for a more rigorous and principled understanding of privacy and security in RAG systems.
zh

[AI-3] When Judgment Becomes Noise: How Design Failures in LLM Judge Benchmarks Silently Undermine Validity

【速读】:该论文旨在解决当前基于大语言模型(Large Language Model, LLM)评判的基准测试在设计上存在的失效问题,即这类基准测试可能产生高置信度但实质上高度噪声的排名结果,其根源在于缺乏明确的目标约束和可验证的构造机制。解决方案的关键在于引入两种诊断机制:一是图式一致性(Schematic Adherence),用于量化评判结果中可由显式评分标准解释的比例,从而揭示当评判者偏离自身评分规则时的未解释方差;二是心理测量有效性(Psychometric Validity),通过聚合内部一致性和区分效度信号来量化任何基准测试运行中的不可减少不确定性。实证分析表明,这些机制能够有效识别出主流LLM评判者在Arena-Hard Auto基准中严重的评分规则不一致与因子坍缩现象,并指出ELO风格聚合方法掩盖了真实的排序不确定性,从而为构建更具可靠性、目标清晰的LLM评判基准提供了可操作的设计原则。

链接: https://arxiv.org/abs/2509.20293
作者: Benjamin Feuer,Chiung-Yi Tseng,Astitwa Sarthak Lathe,Oussama Elachqar,John P Dickerson
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LLM-judged benchmarks are increasingly used to evaluate complex model behaviors, yet their design introduces failure modes absent in conventional ground-truth based benchmarks. We argue that without tight objectives and verifiable constructions, benchmark rankings can produce high-confidence rankings that are in fact largely noise. We introduce two mechanisms to diagnose these issues. Schematic adherence quantifies how much of a judge’s overall verdict is explained by the explicit evaluation schema, revealing unexplained variance when judges deviate from their own rubric. Psychometric validity aggregates internal consistency and discriminant validity signals to quantify irreducible uncertainty in any benchmarking run. Applying these tools to Arena-Hard Auto, we find severe schema incoherence and factor collapse across popular judges: for example, unexplained variance exceeding 90 percent for DeepSeek-R1-32B and factor correlations above 0.93 for most criteria. We also show that the ELO-style aggregation used by Arena-Hard Auto collapses and masks genuine ranking uncertainty. Our results highlight design failures that undermine validity and offer actionable principles for building better-scoped, reliability-aware LLM-judged benchmarks. We release our code at this https URL
zh

[AI-4] PGCLODA: Prompt-Guided Graph Contrastive Learning for Oligopeptide-Infectious Disease Association Prediction

【速读】:该论文旨在解决当前缺乏针对寡肽(oligopeptides)与传染性疾病之间潜在关联的计算预测模型的问题,以加速新型抗感染药物的发现。其关键解决方案是提出了一种提示引导的图对比学习框架(Prompt-guided Graph-based Contrastive Learning Framework, PGCLODA),通过构建包含寡肽、微生物和疾病三类节点的异构图结构,融合结构与语义信息;采用提示引导的图增强策略保留关键区域以生成有意义的对比视图,并结合图卷积网络(Graph Convolutional Network, GCN)与Transformer的双编码器架构联合捕捉局部与全局特征,最终通过多层感知机(Multilayer Perceptron, MLP)分类器实现精准预测。该方法在多个评价指标上优于现有先进模型,验证了其在机制驱动型药物发现中的有效性与泛化能力。

链接: https://arxiv.org/abs/2509.20290
作者: Dayu Tan,Jing Chen,Xiaoping Zhou,Yansen Su,Chunhou Zheng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
备注: 12page and 8 figures

点击查看摘要

Abstract:Infectious diseases continue to pose a serious threat to public health, underscoring the urgent need for effective computational approaches to screen novel anti-infective agents. Oligopeptides have emerged as promising candidates in antimicrobial research due to their structural simplicity, high bioavailability, and low susceptibility to resistance. Despite their potential, computational models specifically designed to predict associations between oligopeptides and infectious diseases remain scarce. This study introduces a prompt-guided graph-based contrastive learning framework (PGCLODA) to uncover potential associations. A tripartite graph is constructed with oligopeptides, microbes, and diseases as nodes, incorporating both structural and semantic information. To preserve critical regions during contrastive learning, a prompt-guided graph augmentation strategy is employed to generate meaningful paired views. A dual encoder architecture, integrating Graph Convolutional Network (GCN) and Transformer, is used to jointly capture local and global features. The fused embeddings are subsequently input into a multilayer perceptron (MLP) classifier for final prediction. Experimental results on a benchmark dataset indicate that PGCLODA consistently outperforms state-of-the-art models in AUROC, AUPRC, and accuracy. Ablation and hyperparameter studies confirm the contribution of each module. Case studies further validate the generalization ability of PGCLODA and its potential to uncover novel, biologically relevant associations. These findings offer valuable insights for mechanism-driven discovery and oligopeptide-based drug development. The source code of PGCLODA is available online at this https URL.
zh

[AI-5] Investigating Security Implications of Automatically Generated Code on the Software Supply Chain

【速读】:该论文旨在解决由大语言模型(Large Language Models, LLMs)在代码生成过程中固有缺陷(如虚构代码、误导性信息及依赖过时训练数据)所引发的软件供应链(Software Supply Chain, SSC)安全威胁问题。研究发现,LLM生成的代码中存在十一类与外部组件和持续集成配置文件相关的潜在SSC威胁,其中部分可导致攻击者劫持软件流程,另一些则可能随时间积累形成隐蔽的安全风险。为应对这些问题,论文提出两种关键解决方案:一是基于提示工程的“链式确认”(Chain-of-Confirmation)机制,用于降低代码虚构风险;二是基于中间件的防御方案,能够主动识别并告知用户多种SSC威胁,从而提升开发过程中的安全性。

链接: https://arxiv.org/abs/2509.20277
作者: Xiaofan Li,Xing Gao
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In recent years, various software supply chain (SSC) attacks have posed significant risks to the global community. Severe consequences may arise if developers integrate insecure code snippets that are vulnerable to SSC attacks into their products. Particularly, code generation techniques, such as large language models (LLMs), have been widely utilized in the developer community. However, LLMs are known to suffer from inherent issues when generating code, including fabrication, misinformation, and reliance on outdated training data, all of which can result in serious software supply chain threats. In this paper, we investigate the security threats to the SSC that arise from these inherent issues. We examine three categories of threats, including eleven potential SSC-related threats, related to external components in source code, and continuous integration configuration files. We find some threats in LLM-generated code could enable attackers to hijack software and workflows, while some others might cause potential hidden threats that compromise the security of the software over time. To understand these security impacts and severity, we design a tool, SSCGuard, to generate 439,138 prompts based on SSC-related questions collected online, and analyze the responses of four popular LLMs from GPT and Llama. Our results show that all identified SSC-related threats persistently exist. To mitigate these risks, we propose a novel prompt-based defense mechanism, namely Chain-of-Confirmation, to reduce fabrication, and a middleware-based defense that informs users of various SSC threats.
zh

[AI-6] AnchDrive: Bootstrapping Diffusion Policies with Hybrid Trajectory Anchors for End-to-End Driving

【速读】:该论文旨在解决端到端多模态规划在自动驾驶中的两大挑战:一是如何有效处理行为的多模态性(behavioral multi-modality),二是如何提升模型在长尾场景(long-tail scenarios)下的泛化能力。其解决方案的关键在于提出 AnchDrive 框架,通过引入基于锚点(anchor-based)的扩散策略(diffusion policy)来降低传统生成模型的高计算成本。具体而言,该框架不从纯噪声开始去噪,而是以一组混合轨迹锚点作为初始状态,这些锚点来源于静态驾驶先验词汇和实时解码的动态上下文感知轨迹;其中动态轨迹由一个处理密集与稀疏感知特征的 Transformer 实时生成。扩散模型进一步学习预测轨迹偏移分布,实现对锚点的细粒度优化,从而高效生成多样化且高质量的轨迹,显著提升了规划效率与鲁棒性。

链接: https://arxiv.org/abs/2509.20253
作者: Jinhao Chai,Anqing Jiang,Hao Jiang,Shiyi Mu,Zichong Gu,Shugong Xu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: IWACIII 2025

点击查看摘要

Abstract:End-to-end multi-modal planning has become a transformative paradigm in autonomous driving, effectively addressing behavioral multi-modality and the generalization challenge in long-tail scenarios. We propose AnchDrive, a framework for end-to-end driving that effectively bootstraps a diffusion policy to mitigate the high computational cost of traditional generative models. Rather than denoising from pure noise, AnchDrive initializes its planner with a rich set of hybrid trajectory anchors. These anchors are derived from two complementary sources: a static vocabulary of general driving priors and a set of dynamic, context-aware trajectories. The dynamic trajectories are decoded in real-time by a Transformer that processes dense and sparse perceptual features. The diffusion model then learns to refine these anchors by predicting a distribution of trajectory offsets, enabling fine-grained refinement. This anchor-based bootstrapping design allows for efficient generation of diverse, high-quality trajectories. Experiments on the NAVSIM benchmark confirm that AnchDrive sets a new state-of-the-art and shows strong gen?eralizability
zh

[AI-7] A HyperGraphMamba-Based Multichannel Adaptive Model for ncRNA Classification

【速读】:该论文旨在解决非编码RNA(non-coding RNA, ncRNA)分类中特征提取深度不足与多模态融合效率低的问题,以提升功能注释和疾病诊断的准确性。其解决方案的关键在于提出HGMamba-ncRNA模型,该模型通过三个核心模块实现:(1) 使用并行多尺度卷积与长短期记忆网络(Multi-scale Convolution and LSTM, MKC-L)捕获ncRNA序列的局部模式与长程依赖;(2) 基于多尺度图Transformer(MSGraphTransformer)建模ncRNA二级结构的多层次拓扑特征;(3) 引入基于切比雪夫多项式的柯尔莫哥洛夫-阿诺德网络(Chebyshev Polynomial-based Kolmogorov-Arnold Network, CPKAN)有效处理高维表达谱数据。最终,通过引入虚拟节点增强异构模态间的交互能力,HyperGraphMamba实现自适应对齐与融合多通道特征,显著提升了分类性能与模型泛化能力。

链接: https://arxiv.org/abs/2509.20240
作者: Xin An,Ruijie Li,Qiao Ning,Hui Li,Qian Ma,Shikai Guo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 9 pages, 17 figures (including subfigures), 1 table. Xin An and Ruijie Li contributed equally to this work and should be considered co-first authors

点击查看摘要

Abstract:Non-coding RNAs (ncRNAs) play pivotal roles in gene expression regulation and the pathogenesis of various diseases. Accurate classification of ncRNAs is essential for functional annotation and disease diagnosis. To address existing limitations in feature extraction depth and multimodal fusion, we propose HGMamba-ncRNA, a HyperGraphMamba-based multichannel adaptive model, which integrates sequence, secondary structure, and optionally available expression features of ncRNAs to enhance classification performance. Specifically, the sequence of ncRNA is modeled using a parallel Multi-scale Convolution and LSTM architecture (MKC-L) to capture both local patterns and long-range dependencies of nucleotides. The structure modality employs a multi-scale graph transformer (MSGraphTransformer) to represent the multi-level topological characteristics of ncRNA secondary structures. The expression modality utilizes a Chebyshev Polynomial-based Kolmogorov-Arnold Network (CPKAN) to effectively model and interpret high-dimensional expression profiles. Finally, by incorporating virtual nodes to facilitate efficient and comprehensive multimodal interaction, HyperGraphMamba is proposed to adaptively align and integrate multichannel heterogeneous modality features. Experiments conducted on three public datasets demonstrate that HGMamba-ncRNA consistently outperforms state-of-the-art methods in terms of accuracy and other metrics. Extensive empirical studies further confirm the model’s robustness, effectiveness, and strong transferability, offering a novel and reliable strategy for complex ncRNA functional classification. Code and datasets are available at this https URL.
zh

[AI-8] Beyond Sharp Minima: Robust LLM Unlearning via Feedback-Guided Multi-Point Optimization

【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)遗忘方法中存在的安全漏洞问题:尽管表面看似成功删除了敏感或有害知识,但这些“被遗忘”的信息仍可通过再学习攻击(relearning attacks)轻易恢复。其根本原因在于,传统方法在单个数据点上优化遗忘损失会引导模型参数进入损失曲面中的尖锐极小值区域(sharp minima),此类区域对微小的参数扰动极其敏感,从而导致模型行为剧烈变化。攻击者可利用少量微调样本沿陡峭梯度快速重构被删除的知识。为应对这一问题,论文提出 StableUN,一种基于双层反馈引导的优化框架,其核心创新在于通过邻域感知优化显式寻找更稳定的参数区域;该框架融合了对抗扰动驱动的遗忘反馈与保留模型效用的记忆反馈,并通过梯度投影实现二者目标的一致性对齐,从而显著提升模型对再学习和越狱攻击的鲁棒性,同时保持良好的性能表现。

链接: https://arxiv.org/abs/2509.20230
作者: Wenhan Wu,Zheyuan Liu,Chongyang Gao,Ren Wang,Kaize Ding
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Current LLM unlearning methods face a critical security vulnerability that undermines their fundamental purpose: while they appear to successfully remove sensitive or harmful knowledge, this ``forgotten" information remains precariously recoverable through relearning attacks. We identify that the root cause is that conventional methods optimizing the forgetting loss at individual data points will drive model parameters toward sharp minima in the loss landscape. In these unstable regions, even minimal parameter perturbations can drastically alter the model’s behaviors. Consequently, relearning attacks exploit this vulnerability by using just a few fine-tuning samples to navigate the steep gradients surrounding these unstable regions, thereby rapidly recovering knowledge that was supposedly erased. This exposes a critical robustness gap between apparent unlearning and actual knowledge removal. To address this issue, we propose StableUN, a bi-level feedback-guided optimization framework that explicitly seeks more stable parameter regions via neighborhood-aware optimization. It integrates forgetting feedback, which uses adversarial perturbations to probe parameter neighborhoods, with remembering feedback to preserve model utility, aligning the two objectives through gradient projection. Experiments on WMDP and MUSE benchmarks demonstrate that our method is significantly more robust against both relearning and jailbreaking attacks while maintaining competitive utility performance.
zh

[AI-9] Multimodal Representation-disentangled Information Bottleneck for Multimodal Recommendation

【速读】:该论文旨在解决多模态推荐系统中因冗余和无关信息导致性能下降的问题,现有方法通常直接融合多模态信息或采用刚性架构分离来解耦特征,难以有效过滤噪声并建模模态间的复杂交互。其解决方案的关键在于提出一种新颖的多模态表示解耦信息瓶颈(Multimodal Representation-disentangled Information Bottleneck, MRdIB)框架:首先利用多模态信息瓶颈(Multimodal Information Bottleneck)压缩输入表示以去除任务无关噪声;随后通过一系列约束条件将信息分解为与推荐目标相关的独特信息(unique information)、冗余信息(redundant information)和协同信息(synergistic information),并分别设计对应的优化目标,从而引导模型学习更强大且解耦的表示,显著提升推荐效果。

链接: https://arxiv.org/abs/2509.20225
作者: Hui Wang,Jinghui Qin,Wushao Wen,Qingling Li,Shanshan Zhong,Zhongzhan Huang
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal data has significantly advanced recommendation systems by integrating diverse information sources to model user preferences and item characteristics. However, these systems often struggle with redundant and irrelevant information, which can degrade performance. Most existing methods either fuse multimodal information directly or use rigid architectural separation for disentanglement, failing to adequately filter noise and model the complex interplay between modalities. To address these challenges, we propose a novel framework, the Multimodal Representation-disentangled Information Bottleneck (MRdIB). Concretely, we first employ a Multimodal Information Bottleneck to compress the input representations, effectively filtering out task-irrelevant noise while preserving rich semantic information. Then, we decompose the information based on its relationship with the recommendation target into unique, redundant, and synergistic components. We achieve this decomposition with a series of constraints: a unique information learning objective to preserve modality-unique signals, a redundant information learning objective to minimize overlap, and a synergistic information learning objective to capture emergent information. By optimizing these objectives, MRdIB guides a model to learn more powerful and disentangled representations. Extensive experiments on several competitive models and three benchmark datasets demonstrate the effectiveness and versatility of our MRdIB in enhancing multimodal recommendation.
zh

[AI-10] he Cream Rises to the Top: Efficient Reranking Method for Verilog Code Generation ICASSP2026

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在Verilog代码生成任务中因领域知识有限而导致的可靠性不足问题,尤其针对硬件工程师更关注单一可信实现而非多个不确定候选方案的需求。其解决方案的关键在于将Verilog生成问题建模为需求与实现之间的语义对齐问题,并提出VCD-RNK判别模型用于高效重排序代码候选。该模型通过蒸馏专家知识,在代码语义分析、测试用例生成和功能正确性评估三个维度上引入Verilog特定推理机制,在推理阶段显式模拟上述过程,从而避免了现有方法中计算密集型的测试执行步骤,提升了生成结果的准确性和可信赖度。

链接: https://arxiv.org/abs/2509.20215
作者: Guang Yang,Wei Zheng,Xiang Chen,Yifan Sun,Fengji Zhang,Terry Yue Zhuo
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
备注: Under review ICASSP 2026

点击查看摘要

Abstract:LLMs face significant challenges in Verilog generation due to limited domain-specific knowledge. While sampling techniques improve pass@k metrics, hardware engineers need one trustworthy solution rather than uncertain candidates. To bridge this gap, we formulate it as a semantic alignment problem between requirements and Verilog implementations, and propose VCD-RNK, a discriminator model tailored for efficient Verilog code reranking. Specifically, VCD-RNKincorporates Verilog-specific reasoning by distilling expert knowledge across three dimensions: code semantic analysis, test case generation, and functional correctness assessment. By explicitly simulating the above reasoning processes during inference, VCD-RNK effectively avoids computationally intensive test execution in existing methods.
zh

[AI-11] Q-Palette: Fractional-Bit Quantizers Toward Optimal Bit Allocation for Efficient LLM Deployment NEURIPS2025

【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)在后训练量化(Post-Training Quantization, PTQ)过程中因权重分布不规则、存在重尾异常值而导致的量化误差问题,尤其针对无需微调且仅需少量校准数据的场景。其核心解决方案是通过旋转方法将原始权重转换为近似高斯分布以提升量化稳定性,并进一步提出Q-Palette框架,该框架包含从格栅编码量化器(trellis-coded quantizers)到向量和标量量化器的多种分段比特量化策略,能够根据资源约束动态选择最优量化方案并联合优化层融合决策,从而在有限比特预算下逼近信息论最优的失真率边界,显著提升量化精度与推理效率。

链接: https://arxiv.org/abs/2509.20214
作者: Deokjae Lee,Hyun Oh Song
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: NeurIPS 2025

点击查看摘要

Abstract:We study weight-only post-training quantization (PTQ), which quantizes the weights of a large language model (LLM) without retraining, using little or no calibration data. Weight-only PTQ is crucial for reducing the memory footprint and latency of LLM inference, especially in memory-bound, small-batch inference scenarios, such as personalized inference on edge devices. Despite its importance, irregular weight distributions with heavy-tailed outliers in LLMs complicate quantization, recently motivating rotation-based methods that transform weights into near-Gaussian distributions, which are more regular with fewer outliers, thereby reducing quantization error. In this work, we first derive the information-theoretically optimal bit allocation for Gaussianized weights under given bit budgets, revealing that fine-grained fractional-bit quantizers approaching the Gaussian distortion-rate bound are essential to achieve near-optimal quantization performance. To bridge this theoretical insight and practical implementation, we introduce Q-Palette, a versatile collection of fractional-bit quantizers that range from trellis-coded quantizers offering near-optimal distortion to simpler vector and scalar quantizers optimized for faster inference, all efficiently implemented with optimized CUDA kernels across various bitwidths. Furthermore, leveraging Q-Palette as a foundational component, we propose a novel mixed-scheme quantization framework, jointly optimizing quantizer choices and layer fusion decisions given resource constraints. The code is available at this https URL.
zh

[AI-12] STAF: Leverag ing LLM s for Automated Attack Tree-Based Security Test Generation

【速读】:该论文旨在解决现代汽车系统中安全测试自动化程度低的问题,尤其是从攻击树(Attack Tree)自动生成可执行安全测试用例的效率低下与人工依赖性强等挑战。其解决方案的关键在于提出了一种名为STAF(Security Test Automation Framework)的新框架,该框架基于大语言模型(Large Language Models, LLMs)并采用四步自校正检索增强生成(Retrieval-Augmented Generation, RAG)机制,实现了从攻击树到可执行测试用例的端到端自动化生成,并通过与自动化测试框架集成,显著提升了测试效率、准确性、可扩展性及流程兼容性。

链接: https://arxiv.org/abs/2509.20190
作者: Tanmay Khule,Stefan Marksteiner,Jose Alguindigue,Hannes Fuchs,Sebastian Fischmeister,Apurva Narayan
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 18 pages, 2 figures, accepted for 23rd escar Europe (Nov 05-06, 2025, Frankfurt, Germany)

点击查看摘要

Abstract:In modern automotive development, security testing is critical for safeguarding systems against increasingly advanced threats. Attack trees are widely used to systematically represent potential attack vectors, but generating comprehensive test cases from these trees remains a labor-intensive, error-prone task that has seen limited automation in the context of testing vehicular systems. This paper introduces STAF (Security Test Automation Framework), a novel approach to automating security test case generation. Leveraging Large Language Models (LLMs) and a four-step self-corrective Retrieval-Augmented Generation (RAG) framework, STAF automates the generation of executable security test cases from attack trees, providing an end-to-end solution that encompasses the entire attack surface. We particularly show the elements and processes needed to provide an LLM to actually produce sensible and executable automotive security test suites, along with the integration with an automated testing framework. We further compare our tailored approach with general purpose (vanilla) LLMs and the performance of different LLMs (namely GPT-4.1 and DeepSeek) using our approach. We also demonstrate the method of our operation step-by-step in a concrete case study. Our results show significant improvements in efficiency, accuracy, scalability, and easy integration in any workflow, marking a substantial advancement in automating automotive security testing methodologies. Using TARAs as an input for verfication tests, we create synergies by connecting two vital elements of a secure automotive development process.
zh

[AI-13] How People Manage Knowledge in their “Second Brains”- A Case Study with Industry Researchers Using Obsidian

【速读】:该论文旨在解决个人在工作和日常生活中面临的海量信息难以有效组织与管理的问题,特别是如何构建和探索个人知识库(personal knowledge base)以支持知识的长期积累与高效检索。解决方案的关键在于揭示研究人员在构建和探索个人知识库时,其知识检索策略直接影响内容的组织方式与维护模式,进而提出未来可集成人工智能(AI)系统以辅助这一过程的潜在功能设计。

链接: https://arxiv.org/abs/2509.20187
作者: Juliana Jansen Ferreira,Vinícius Segura,Joana Gabriela Souza,Joao Henrique Gallas Brasil
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 9 pages, 3 figures

点击查看摘要

Abstract:People face overwhelming information during work activities, necessitating effective organization and management strategies. Even in personal lives, individuals must keep, annotate, organize, and retrieve knowledge from daily routines. The collection of records for future reference is known as a personal knowledge base. Note-taking applications are valuable tools for building and maintaining these bases, often called a ‘‘second brain’’. This paper presents a case study on how people build and explore personal knowledge bases for various purposes. We selected the note-taking tool Obsidian and researchers from a Brazilian lab for an in-depth investigation. Our investigation reveals interesting findings about how researchers build and explore their personal knowledge bases. A key finding is that participants’ knowledge retrieval strategy influences how they build and maintain their content. We suggest potential features for an AI system to support this process.
zh

[AI-14] An Improved Time Series Anomaly Detection by Applying Structural Similarity

【速读】:该论文旨在解决重建类无监督异常检测方法在处理复杂模式异常时性能不足的问题,其核心挑战在于现有方法仅依赖点对点距离度量优化目标,忽略了时间序列中潜在的结构特征(如趋势、季节性和形状),导致难以捕捉全局波动与局部特征的一致性。解决方案的关键是提出StrAD(Structure-enhanced Anomaly Detection),通过引入结构感知的优化目标机制,将趋势、季节性和形状等结构信息嵌入重建模型的目标函数中,从而引导数据重构过程更好地保留原始序列的结构性特征,提升模型对点级异常和模式级异常的敏感性。该机制具有可插拔特性,可兼容任意重建类方法,显著增强模型在真实世界多场景下的异常检测能力。

链接: https://arxiv.org/abs/2509.20184
作者: Tiejun Wang,Rui Wang,Xudong Mou,Mengyuan Ma,Tianyu Wo,Renyu Yang,Xudong Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Effective anomaly detection in time series is pivotal for modern industrial applications and financial systems. Due to the scarcity of anomaly labels and the high cost of manual labeling, reconstruction-based unsupervised approaches have garnered considerable attention. However, accurate anomaly detection remains an unsettled challenge, since the optimization objectives of reconstruction-based methods merely rely on point-by-point distance measures, ignoring the potential structural characteristics of time series and thus failing to tackle complex pattern-wise anomalies. In this paper, we propose StrAD, a novel structure-enhanced anomaly detection approach to enrich the optimization objective by incorporating structural information hidden in the time series and steering the data reconstruction procedure to better capture such structural features. StrAD accommodates the trend, seasonality, and shape in the optimization objective of the reconstruction model to learn latent structural characteristics and capture the intrinsic pattern variation of time series. The proposed structure-aware optimization objective mechanism can assure the alignment between the original data and the reconstructed data in terms of structural features, thereby keeping consistency in global fluctuation and local characteristics. The mechanism is pluggable and applicable to any reconstruction-based methods, enhancing the model sensitivity to both point-wise anomalies and pattern-wise anomalies. Experimental results show that StrAD improves the performance of state-of-the-art reconstruction-based models across five real-world anomaly detection datasets.
zh

[AI-15] Automated Multi-Agent Workflows for RTL Design NEURIPS2025

【速读】:该论文旨在解决专用领域(如寄存器传输级代码生成)中因硬件描述语言(HDL)和专有电子设计自动化(EDA)资源稀缺而导致的生成式AI模型训练困难、推理成本高及代理编排复杂的问题。其解决方案的关键在于提出VeriMaAS多代理框架,通过将HDL工具提供的形式化验证反馈直接集成到代理工作流生成过程中,从而降低基于梯度更新或长推理轨迹的成本;该方法在pass@k指标上较微调基线提升5-7%,且仅需数百个训练样本,显著减少了监督成本。

链接: https://arxiv.org/abs/2509.20182
作者: Amulya Bhattaram,Janani Ramamoorthy,Ranit Gupta,Diana Marculescu,Dimitrios Stamoulis
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注: Accepted: ML for Systems Workshop NeurIPS 2025

点击查看摘要

Abstract:The rise of agentic AI workflows unlocks novel opportunities for computer systems design and optimization. However, for specialized domains such as program synthesis, the relative scarcity of HDL and proprietary EDA resources online compared to more common programming tasks introduces challenges, often necessitating task-specific fine-tuning, high inference costs, and manually-crafted agent orchestration. In this work, we present VeriMaAS, a multi-agent framework designed to automatically compose agentic workflows for RTL code generation. Our key insight is to integrate formal verification feedback from HDL tools directly into workflow generation, reducing the cost of gradient-based updates or prolonged reasoning traces. Our method improves synthesis performance by 5-7% for pass@k over fine-tuned baselines, while requiring only a few hundred training examples, representing an order-of-magnitude reduction in supervision cost.
zh

[AI-16] CyberSOCEval: Benchmarking LLM s Capabilities for Malware Analysis and Threat Intelligence Reasoning

【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在网络安全运营中心(Security Operations Center, SOC)自动化场景中缺乏真实、系统性评估的问题,从而导致AI开发者难以明确优化方向,安全从业者无法可靠选择适用模型。其解决方案的关键在于提出并开源了CyberSOCEval基准测试套件,该套件聚焦于恶意软件分析(Malware Analysis)与威胁情报推理(Threat Intelligence Reasoning)两大核心防御任务,填补了现有基准在实战场景覆盖上的不足;通过实证表明,更大、更现代的LLMs表现更优,但推理类模型在测试时扩展(test time scaling)并未带来显著提升,揭示出当前模型在网络安全推理能力上的局限性,为后续开发提供了清晰的技术改进路径和挑战目标。

链接: https://arxiv.org/abs/2509.20166
作者: Lauren Deason,Adam Bali,Ciprian Bejean,Diana Bolocan,James Crnkovich,Ioana Croitoru,Krishna Durai,Chase Midler,Calin Miron,David Molnar,Brad Moon,Bruno Ostarcevic,Alberto Peltea,Matt Rosenberg,Catalin Sandu,Arthur Saputkin,Sagar Shah,Daniel Stan,Ernest Szocs,Shengye Wan,Spencer Whitman,Sven Krasser,Joshua Saxe
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Today’s cyber defenders are overwhelmed by a deluge of security alerts, threat intelligence signals, and shifting business context, creating an urgent need for AI systems to enhance operational security work. While Large Language Models (LLMs) have the potential to automate and scale Security Operations Center (SOC) operations, existing evaluations do not fully assess the scenarios most relevant to real-world defenders. This lack of informed evaluation impacts both AI developers and those applying LLMs to SOC automation. Without clear insight into LLM performance in real-world security scenarios, developers lack a north star for development, and users cannot reliably select the most effective models. Meanwhile, malicious actors are using AI to scale cyber attacks, highlighting the need for open source benchmarks to drive adoption and community-driven improvement among defenders and model developers. To address this, we introduce CyberSOCEval, a new suite of open source benchmarks within CyberSecEval 4. CyberSOCEval includes benchmarks tailored to evaluate LLMs in two tasks: Malware Analysis and Threat Intelligence Reasoning–core defensive domains with inadequate coverage in current benchmarks. Our evaluations show that larger, more modern LLMs tend to perform better, confirming the training scaling laws paradigm. We also find that reasoning models leveraging test time scaling do not achieve the same boost as in coding and math, suggesting these models have not been trained to reason about cybersecurity analysis, and pointing to a key opportunity for improvement. Finally, current LLMs are far from saturating our evaluations, showing that CyberSOCEval presents a significant challenge for AI developers to improve cyber defense capabilities.
zh

[AI-17] Affective Computing and Emotional Data: Challenges and Implications in Privacy Regulations The AI Act and Ethics in Large Language Models

【速读】:该论文旨在解决如何在人工智能系统中有效整合情感智能(Emotional Intelligence, EI),特别是在情绪识别与响应能力方面,以提升人机交互的自然性与伦理合规性。其核心问题在于:如何将人类复杂的情绪体验转化为结构化的数据,并在此基础上构建既高效又符合隐私保护与公平性的AI系统。解决方案的关键在于利用卷积神经网络(CNNs)和循环神经网络(RNNs)等基础神经架构实现多模态情绪识别(如面部表情、语音和文本),同时强调区分显式(研究场景下知情同意收集)与隐式(日常数字交互中被动获取)情绪数据的法律边界,进而推动建立以目的限制、数据最小化和有意义同意为核心的监管框架,确保情绪数据作为敏感个人数据在欧盟GDPR与《人工智能法案》(EU AI Act)下的合法处理与伦理使用。

链接: https://arxiv.org/abs/2509.20153
作者: Nicola Fabiano
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper examines the integration of emotional intelligence into artificial intelligence systems, with a focus on affective computing and the growing capabilities of Large Language Models (LLMs), such as ChatGPT and Claude, to recognize and respond to human emotions. Drawing on interdisciplinary research that combines computer science, psychology, and neuroscience, the study analyzes foundational neural architectures - CNNs for processing facial expressions and RNNs for sequential data, such as speech and text - that enable emotion recognition. It examines the transformation of human emotional experiences into structured emotional data, addressing the distinction between explicit emotional data collected with informed consent in research settings and implicit data gathered passively through everyday digital interactions. That raises critical concerns about lawful processing, AI transparency, and individual autonomy over emotional expressions in digital environments. The paper explores implications across various domains, including healthcare, education, and customer service, while addressing challenges of cultural variations in emotional expression and potential biases in emotion recognition systems across different demographic groups. From a regulatory perspective, the paper examines emotional data in the context of the GDPR and the EU AI Act frameworks, highlighting how emotional data may be considered sensitive personal data that requires robust safeguards, including purpose limitation, data minimization, and meaningful consent mechanisms.
zh

[AI-18] Formal Verification of Minimax Algorithms

【速读】:该论文旨在解决博弈搜索算法(如极小极大算法)在实现过程中因复杂性导致的正确性难以保障的问题,尤其是引入α-β剪枝和置换表(transposition table)等优化技术后,算法行为变得更加难以验证。解决方案的关键在于使用Dafny形式化验证系统,对多种变体的极小极大搜索算法进行严格的形式化证明,并针对带深度限制与置换表的搜索,提出基于“见证”(witness)的正确性判定准则,从而确保算法在优化后的实现中仍保持逻辑正确性。

链接: https://arxiv.org/abs/2509.20138
作者: Wieger Wesselink,Kees Huizing,Huub van de Wetering
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 12 pages

点击查看摘要

Abstract:Using the Dafny verification system, we formally verify a range of minimax search algorithms, including variations with alpha-beta pruning and transposition tables. For depth-limited search with transposition tables, we introduce a witness-based correctness criterion and apply it to two representative algorithms. All verification artifacts, including proofs and Python implementations, are publicly available.
zh

[AI-19] Discovering Association Rules in High-Dimensional Small Tabular Data ECAI2025

【速读】:该论文针对高维表格数据中关联规则挖掘(Association Rule Mining, ARM)面临的规则爆炸(rule explosion)和计算开销过大问题,提出了一种改进的神经符号方法。其关键解决方案在于:首先,通过实证表明Aerial+在五个真实世界数据集上相比主流算法和神经符号基线可提升一到两个数量级的可扩展性;其次,首次将ARM问题聚焦于高维低数据场景(如生物医学基因表达数据,约18k特征、50样本),并提出两种基于表格基础模型(tabular foundation models)对Aerial+进行微调的方法,显著提升了低数据条件下的规则质量,从而有效缓解了传统神经网络在小样本下性能下降的问题。

链接: https://arxiv.org/abs/2509.20113
作者: Erkan Karabulut,Daniel Daza,Paul Groth,Victoria Degeler
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: This paper was accepted at ECAI 2025 Workshop: 1st International Workshop on Advanced Neuro-Symbolic Applications (ANSyA)

点击查看摘要

Abstract:Association Rule Mining (ARM) aims to discover patterns between features in datasets in the form of propositional rules, supporting both knowledge discovery and interpretable machine learning in high-stakes decision-making. However, in high-dimensional settings, rule explosion and computational overhead render popular algorithmic approaches impractical without effective search space reduction, challenges that propagate to downstream tasks. Neurosymbolic methods, such as Aerial+, have recently been proposed to address the rule explosion in ARM. While they tackle the high dimensionality of the data, they also inherit limitations of neural networks, particularly reduced performance in low-data regimes. This paper makes three key contributions to association rule discovery in high-dimensional tabular data. First, we empirically show that Aerial+ scales one to two orders of magnitude better than state-of-the-art algorithmic and neurosymbolic baselines across five real-world datasets. Second, we introduce the novel problem of ARM in high-dimensional, low-data settings, such as gene expression data from the biomedicine domain with around 18k features and 50 samples. Third, we propose two fine-tuning approaches to Aerial+ using tabular foundation models. Our proposed approaches are shown to significantly improve rule quality on five real-world datasets, demonstrating their effectiveness in low-data, high-dimensional scenarios. Comments: This paper was accepted at ECAI 2025 Workshop: 1st International Workshop on Advanced Neuro-Symbolic Applications (ANSyA) Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2509.20113 [cs.LG] (or arXiv:2509.20113v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2509.20113 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-20] PEPS: Quantum-Inspired Reinforcement Learning for Coherent Reasoning Traces in LLM s

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在多步推理任务中难以维持连贯推理轨迹的问题,尤其在需要结构化逻辑流的任务中表现不佳。其解决方案的关键在于引入一种量子启发式的 fidelity 奖励机制,该机制源自投影纠缠对态(Projected Entangled Pair States, PEPS),并将其集成到近端策略优化(Proximal Policy Optimization, PPO)框架中,通过结构一致性引导学习过程,从而增强生成推理轨迹的全局连贯性。此方法不依赖于直接监督或对比目标,而是利用量子物理中的保真度概念作为奖励信号,显著提升了模型在多种推理任务(如算术、直觉推理和蕴含推理)上的推理连贯性表现。

链接: https://arxiv.org/abs/2509.20105
作者: Venkat Margapuri,Garik Kazanjian,Naren Kosaraju
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) often struggle with maintaining coherent multi-step reasoning traces, particularly in tasks that require a structured logical flow. This work introduces a quantum-inspired approach to address the challenge by incorporating a fidelity-based reward derived from Projected Entangled Pair States (PEPS) into Proximal Policy Optimization. Unlike prior approaches that use direct supervision or contrastive objectives, the proposed method guides learning through structural consistency, offering a novel approach to enforce global coherence in generated reasoning traces. The proposed framework is evaluated using multiple coherence-determining metrics on diverse datasets such as GSM8K, StrategyQA, and EntailmentBank spanning arithmetic, intuitive, and entailment-based reasoning. Results show that the proposed quantum-inspired approach offers significant improvements over supervised, contrastive, and pretrained baseline approaches, highlighting the effectiveness of quantum-inspired fidelity as a foundation to improve reasoning trace coherence in LLMs.
zh

[AI-21] Steerable Adversarial Scenario Generation through Test-Time Preference Alignment

【速读】:该论文旨在解决现有对抗场景生成方法在安全性评估中因固定权衡对抗性与真实性而缺乏灵活性的问题,导致生成的场景无法在推理阶段进行精细调控,难以满足多样化的训练与测试需求。解决方案的关键在于将对抗场景生成重构为多目标偏好对齐问题,并提出SAGE(Steerable Adversarial scenario GEnerator)框架:通过分层分组偏好优化实现数据高效的离线对齐,解耦硬性约束与软性偏好;并利用两个对立偏好专家的权重线性插值,在无需重训练的前提下实现推理时对对抗性与真实性的连续控制,其理论基础为线性模式连通性。

链接: https://arxiv.org/abs/2509.20102
作者: Tong Nie,Yuewen Mei,Yihong Tang,Junlin He,Jie Sun,Haotian Shi,Wei Ma,Jian Sun
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Adversarial scenario generation is a cost-effective approach for safety assessment of autonomous driving systems. However, existing methods are often constrained to a single, fixed trade-off between competing objectives such as adversariality and realism. This yields behavior-specific models that cannot be steered at inference time, lacking the efficiency and flexibility to generate tailored scenarios for diverse training and testing requirements. In view of this, we reframe the task of adversarial scenario generation as a multi-objective preference alignment problem and introduce a new framework named \textbfSteerable \textbfAdversarial scenario \textbfGEnerator (SAGE). SAGE enables fine-grained test-time control over the trade-off between adversariality and realism without any retraining. We first propose hierarchical group-based preference optimization, a data-efficient offline alignment method that learns to balance competing objectives by decoupling hard feasibility constraints from soft preferences. Instead of training a fixed model, SAGE fine-tunes two experts on opposing preferences and constructs a continuous spectrum of policies at inference time by linearly interpolating their weights. We provide theoretical justification for this framework through the lens of linear mode connectivity. Extensive experiments demonstrate that SAGE not only generates scenarios with a superior balance of adversariality and realism but also enables more effective closed-loop training of driving policies. Project page: this https URL.
zh

[AI-22] From Pheromones to Policies: Reinforcement Learning for Engineered Biological Swarms

【速读】:该论文旨在解决如何在分布式自组织系统中实现高效集体决策的问题,特别是揭示生物群体(如秀丽隐杆线虫)通过信息素(pheromone)介导的聚集行为与强化学习(reinforcement learning, RL)机制之间的理论等价性。其解决方案的关键在于:将环境中的信息素动态建模为一种分布式奖励机制,使其数学上等价于强化学习中的交叉学习(cross-learning)更新规则,并通过计算实验验证该模型能准确再现线虫的觅食行为;进一步发现,在动态环境中引入少量对信息素不敏感的探索型个体可打破由持久信息素轨迹引发的正反馈陷阱,从而恢复群体适应能力,实现探索-利用权衡的优化与过时策略的群体级淘汰。这一机制为构建具有鲁棒决策能力的可编程生命系统提供了理论基础和工程范式。

链接: https://arxiv.org/abs/2509.20095
作者: Aymeric Vellinger,Nemanja Antonic,Elio Tuci
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Contribution to the 9th International Symposium on Swarm Behavior and Bio-Inspired Robotics 2025

点击查看摘要

Abstract:Swarm intelligence emerges from decentralised interactions among simple agents, enabling collective problem-solving. This study establishes a theoretical equivalence between pheromone-mediated aggregation in \celeg\ and reinforcement learning (RL), demonstrating how stigmergic signals function as distributed reward mechanisms. We model engineered nematode swarms performing foraging tasks, showing that pheromone dynamics mathematically mirror cross-learning updates, a fundamental RL algorithm. Experimental validation with data from literature confirms that our model accurately replicates empirical \celeg\ foraging patterns under static conditions. In dynamic environments, persistent pheromone trails create positive feedback loops that hinder adaptation by locking swarms into obsolete choices. Through computational experiments in multi-armed bandit scenarios, we reveal that introducing a minority of exploratory agents insensitive to pheromones restores collective plasticity, enabling rapid task switching. This behavioural heterogeneity balances exploration-exploitation trade-offs, implementing swarm-level extinction of outdated strategies. Our results demonstrate that stigmergic systems inherently encode distributed RL processes, where environmental signals act as external memory for collective credit assignment. By bridging synthetic biology with swarm robotics, this work advances programmable living systems capable of resilient decision-making in volatile environments.
zh

[AI-23] MACD: Multi-Agent Clinical Diagnosis with Self-Learned Knowledge for LLM

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在实际临床诊断中因依赖传统提示方法而难以有效处理复杂病例的问题,尤其是现有提示工程与多智能体方法通常仅优化单次推理,忽视了可复用的临床经验积累。其解决方案的关键在于提出一种多智能体临床诊断框架(Multi-Agent Clinical Diagnosis, MACD),通过一个包含总结、精炼和应用诊断洞察的多智能体流水线,使LLMs能够自主学习并累积临床知识,模拟医生通过实践提升专业能力的过程,从而聚焦于关键疾病特异性线索,显著提升诊断准确性。

链接: https://arxiv.org/abs/2509.20067
作者: Wenliang Li,Rui Yan,Xu Zhang,Li Chen,Hongji Zhu,Jing Zhao,Junjun Li,Mengru Li,Wei Cao,Zihang Jiang,Wei Wei,Kun Zhang,Shaohua Kevin Zhou
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated notable potential in medical applications, yet they face substantial challenges in handling complex real-world clinical diagnoses using conventional prompting methods. Current prompt engineering and multi-agent approaches typically optimize isolated inferences, neglecting the accumulation of reusable clinical experience. To address this, this study proposes a novel Multi-Agent Clinical Diagnosis (MACD) framework, which allows LLMs to self-learn clinical knowledge via a multi-agent pipeline that summarizes, refines, and applies diagnostic insights. It mirrors how physicians develop expertise through experience, enabling more focused and accurate diagnosis on key disease-specific cues. We further extend it to a MACD-human collaborative workflow, where multiple LLM-based diagnostician agents engage in iterative consultations, supported by an evaluator agent and human oversight for cases where agreement is not reached. Evaluated on 4,390 real-world patient cases across seven diseases using diverse open-source LLMs (Llama-3.1 8B/70B, DeepSeek-R1-Distill-Llama 70B), MACD significantly improves primary diagnostic accuracy, outperforming established clinical guidelines with gains up to 22.3% (MACD). On the subset of the data, it achieves performance on par with or exceeding that of human physicians (up to 16% improvement over physicians-only diagnosis). Additionally, on the MACD-human workflow, it achieves an 18.6% improvement compared to physicians-only diagnosis. Moreover, self-learned knowledge exhibits strong cross-model stability, transferability, and model-specific personalization, while the system can generate traceable rationales, enhancing explainability. Consequently, this work presents a scalable self-learning paradigm for LLM-assisted diagnosis, bridging the gap between the intrinsic knowledge of LLMs and real-world clinical practice.
zh

[AI-24] One Filters All: A Generalist Filter for State Estimation NEURIPS2025

【速读】:该论文旨在解决动态系统中隐藏状态估计(即最优滤波)这一长期存在的科学与工程难题。其解决方案的关键在于提出了一种通用的滤波框架——LLM-Filter,该框架通过将噪声观测值嵌入到文本原型(text prototypes)中,利用预训练大语言模型(Large Language Models, LLMs)中的推理知识进行状态估计。其核心创新包括:1)通过与冻结的LLM实现适当的模态对齐,显著优于当前基于学习的方法;2)设计了System-as-Prompt(SaP)提示结构,使LLM能够理解滤波任务指令,从而在变化甚至未见过的环境中展现出优异的泛化能力;3)观察到LLM-Filter存在缩放定律(scaling-law)行为,即模型规模和训练时间越大越长,性能越高。这些发现表明LLM-Filter有望成为滤波领域的基础模型。

链接: https://arxiv.org/abs/2509.20051
作者: Shiqi Liu,Wenhan Cao,Chang Liu,Zeyu He,Tianyi Zhang,Shengbo Eben Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: NeurIPS 2025

点击查看摘要

Abstract:Estimating hidden states in dynamical systems, also known as optimal filtering, is a long-standing problem in various fields of science and engineering. In this paper, we introduce a general filtering framework, \textbfLLM-Filter, which leverages large language models (LLMs) for state estimation by embedding noisy observations with text prototypes. In various experiments for classical dynamical systems, we find that first, state estimation can significantly benefit from the reasoning knowledge embedded in pre-trained LLMs. By achieving proper modality alignment with the frozen LLM, LLM-Filter outperforms the state-of-the-art learning-based approaches. Second, we carefully design the prompt structure, System-as-Prompt (SaP), incorporating task instructions that enable the LLM to understand the estimation tasks. Guided by these prompts, LLM-Filter exhibits exceptional generalization, capable of performing filtering tasks accurately in changed or even unseen environments. We further observe a scaling-law behavior in LLM-Filter, where accuracy improves with larger model sizes and longer training times. These findings make LLM-Filter a promising foundation model of filtering.
zh

[AI-25] Projective Kolmogorov Arnold Neural Networks (P-KANs): Entropy-Driven Functional Space Discovery for Interpretable Machine Learning

【速读】:该论文旨在解决当前Kolmogorov-Arnold Networks (KANs) 在高维样条参数空间中存在的冗余问题,这种冗余导致模型雅可比矩阵中出现“干扰空间”(nuisance space),从而引发过拟合和泛化能力差的问题。解决方案的关键在于提出一种新的训练框架——Projective Kolmogorov-Arnold Networks (P-KANs),其核心思想是通过信息熵最小化技术与稀疏字典学习,引导边函数(edge functions)向具有解释性的函数表示空间(如傅里叶、切比雪夫、贝塞尔)收敛,而非受限于预定义的函数空间;同时引入“引力项”(gravitational terms)在保持样条灵活性的同时促进最优表示的自动发现,实现参数压缩(最高达80%)、噪声鲁棒性提升及工业级应用(如自动化纤维铺设预测)的成功落地。

链接: https://arxiv.org/abs/2509.20049
作者: Alastair Poole,Stig McArthur,Saravan Kumar
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Kolmogorov-Arnold Networks (KANs) relocate learnable nonlinearities from nodes to edges, demonstrating remarkable capabilities in scientific machine learning and interpretable modeling. However, current KAN implementations suffer from fundamental inefficiencies due to redundancy in high-dimensional spline parameter spaces, where numerous distinct parameterisations yield functionally equivalent behaviors. This redundancy manifests as a “nuisance space” in the model’s Jacobian, leading to susceptibility to overfitting and poor generalization. We introduce Projective Kolmogorov-Arnold Networks (P-KANs), a novel training framework that guides edge function discovery towards interpretable functional representations through entropy-minimisation techniques from signal analysis and sparse dictionary learning. Rather than constraining functions to predetermined spaces, our approach maintains spline space flexibility while introducing “gravitational” terms that encourage convergence towards optimal functional representations. Our key insight recognizes that optimal representations can be identified through entropy analysis of projection coefficients, compressing edge functions to lower-parameter projective spaces (Fourier, Chebyshev, Bessel). P-KANs demonstrate superior performance across multiple domains, achieving up to 80% parameter reduction while maintaining representational capacity, significantly improved robustness to noise compared to standard KANs, and successful application to industrial automated fiber placement prediction. Our approach enables automatic discovery of mixed functional representations where different edges converge to different optimal spaces, providing both compression benefits and enhanced interpretability for scientific machine learning applications.
zh

[AI-26] Diffusion-Augmented Contrastive Learning: A Noise-Robust Encoder for Biosignal Representations

【速读】:该论文旨在解决生物信号(如心电图ECG)中表示学习的鲁棒性问题,即如何在复杂生理数据的多变特性下提取稳定且具有判别力的特征表示。传统数据增强方法往往难以捕捉这些数据中的细微变化,从而限制了模型性能。其解决方案的关键在于提出了一种融合扩散模型与监督对比学习的新型混合框架——Diffusion-Augmented Contrastive Learning (DACL):首先利用轻量级变分自编码器(VAE)对新提出的Scattering Transformer (ST) 特征进行编码以构建潜在空间;随后,将扩散前向过程作为结构化的数据增强手段,在不同噪声水平下生成多个嵌入视图;最后,通过U-Net结构的编码器结合监督对比损失函数,学习一个在各类别间具有强区分能力、同时对噪声具有鲁棒性的表示。此方法创新性地以扩散过程驱动对比目标,实现了噪声不变的嵌入表示,为生物信号的表示学习提供了新范式。

链接: https://arxiv.org/abs/2509.20048
作者: Rami Zewail
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:Learning robust representations for biosignals is often hampered by the challenge of designing effective data this http URL methods can fail to capture the complex variations inherent in physiological data. Within this context, we propose a novel hybrid framework, Diffusion-Augmented Contrastive Learning (DACL), that fuses concepts from diffusion models and supervised contrastive learning. The DACL framework operates on a latent space created by a lightweight Variational Autoencoder (VAE) trained on our novel Scattering Transformer (ST) features [12]. It utilizes the diffusion forward process as a principled data augmentation technique to generate multiple noisy views of these latent embeddings. A U-Net style encoder is then trained with a supervised contrastive objective to learn a representation that balances class discrimination with robustness to noise across various diffusion time steps. We evaluated this proof-of-concept method on the PhysioNet 2017 ECG dataset, achieving a competitive AUROC of 0.7815. This work establishes a new paradigm for representation learning by using the diffusion process itself to drive the contrastive objective, creating noise-invariant embeddings that demonstrate a strong foundation for class separability.
zh

[AI-27] Choosing to Be Green: Advancing Green AI via Dynamic Model Selection

【速读】:该论文旨在解决现代人工智能(Artificial Intelligence, AI)系统,尤其是深度神经网络和大语言模型,在追求高性能的同时带来的显著环境成本问题,即高能耗与碳排放。其核心解决方案是提出“绿色AI动态模型选择”(Green AI dynamic model selection),关键在于通过综合考虑推理任务特性、可用模型的环境可持续性以及精度要求,动态选择最节能且精度损失最小的模型。具体实现包含两种方法:Green AI动态模型级联(cascading)与Green AI动态模型路由(routing),实证结果表明该策略可在保持接近最优精度(高达95%)的前提下实现最高约25%的能源节约。

链接: https://arxiv.org/abs/2509.19996
作者: Emilio Cruciani,Roberto Verdecchia
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 2nd Workshop on Green-Aware Artificial Intelligence (Green-Aware 2025). 9 pages, 1 figure

点击查看摘要

Abstract:Artificial Intelligence is increasingly pervasive across domains, with ever more complex models delivering impressive predictive performance. This fast technological advancement however comes at a concerning environmental cost, with state-of-the-art models - particularly deep neural networks and large language models - requiring substantial computational resources and energy. In this work, we present the intuition of Green AI dynamic model selection, an approach based on dynamic model selection that aims at reducing the environmental footprint of AI by selecting the most sustainable model while minimizing potential accuracy loss. Specifically, our approach takes into account the inference task, the environmental sustainability of available models, and accuracy requirements to dynamically choose the most suitable model. Our approach presents two different methods, namely Green AI dynamic model cascading and Green AI dynamic model routing. We demonstrate the effectiveness of our approach via a proof of concept empirical example based on a real-world dataset. Our results show that Green AI dynamic model selection can achieve substantial energy savings (up to ~25%) while substantially retaining the accuracy of the most energy greedy solution (up to ~95%). As conclusion, our preliminary findings highlight the potential that hybrid, adaptive model selection strategies withhold to mitigate the energy demands of modern AI systems without significantly compromising accuracy requirements.
zh

[AI-28] An effective control of large systems of active particles: An application to evacuation problem

【速读】:该论文旨在解决大规模活性粒子系统(如人群疏散、机器人集群控制等场景)中因现有控制方法缺乏可扩展性和鲁棒性而难以实现高效操控的问题,尤其针对传统方法需对每个个体进行独立控制所带来的计算复杂度与实施难度。其解决方案的关键在于引入一种基于领导者(leader)的控制策略,结合强化学习(Reinforcement Learning, RL)与人工力(artificial forces)机制,构建了一种改进的广义Vicsek模型(generalized Vicsek model),以实现对群体行为的有效引导。实验表明,单纯使用RL方法效果不佳,而该融合策略能显著提升疏散效率和鲁棒性,适用于大规模人群撤离等复杂场景。

链接: https://arxiv.org/abs/2509.19972
作者: Albina Klepach,Egor E. Nuzhin,Alexey A. Tsukanov,Nikolay V. Brilliantov
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Manipulation of large systems of active particles is a serious challenge across diverse domains, including crowd management, control of robotic swarms, and coordinated material transport. The development of advanced control strategies for complex scenarios is hindered, however, by the lack of scalability and robustness of the existing methods, in particular, due to the need of an individual control for each agent. One possible solution involves controlling a system through a leader or a group of leaders, which other agents tend to follow. Using such an approach we develop an effective control strategy for a leader, combining reinforcement learning (RL) with artificial forces acting on the system. To describe the guidance of active particles by a leader we introduce the generalized Vicsek model. This novel method is then applied to the problem of the effective evacuation by a robot-rescuer (leader) of large groups of people from hazardous places. We demonstrate, that while a straightforward application of RL yields suboptimal results, even for advanced architectures, our approach provides a robust and efficient evacuation strategy. The source code supporting this study is publicly available at: this https URL.
zh

[AI-29] A Set of Generalized Components to Achieve Effective Poison-only Clean-label Backdoor Attacks with Collaborative Sample Selection and Triggers NEURIPS2025

【速读】:该论文旨在解决干净标签后门攻击(Clean-label Backdoor Attacks, CBAs)中样本选择与触发器(trigger)设计相互割裂导致的攻击成功率(ASR)和隐蔽性难以同时提升的问题。现有方法通常孤立地处理样本选择与触发器生成,缺乏协同优化机制,致使攻击性能受限,且在转换为PCBA(Poison-Only Clean-label Backdoor Attacks)时表现不佳。解决方案的关键在于提出三个协同组件:Component A 通过识别两个关键选择因素并根据触发器规模动态组合,以更合理地选取“难样本”提升ASR;Component B 选择与触发器植入样本相似的样本以增强隐蔽性;Component C 利用人类视觉系统对RGB颜色的不同敏感度重新分配触发器强度,在保证隐蔽性的前提下进一步提高ASR。三者可灵活集成至多种PCBA框架中,实现ASR与隐蔽性的协同优化。

链接: https://arxiv.org/abs/2509.19947
作者: Zhixiao Wu,Yao Lu,Jie Wen,Hao Sun,Qi Zhou,Guangming Lu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 31 pages, 16 figures, accepted in Neurips 2025

点击查看摘要

Abstract:Poison-only Clean-label Backdoor Attacks aim to covertly inject attacker-desired behavior into DNNs by merely poisoning the dataset without changing the labels. To effectively implant a backdoor, multiple \textbftriggers are proposed for various attack requirements of Attack Success Rate (ASR) and stealthiness. Additionally, sample selection enhances clean-label backdoor attacks’ ASR by meticulously selecting hard'' samples instead of random samples to poison. Current methods 1) usually handle the sample selection and triggers in isolation, leading to severely limited improvements on both ASR and stealthiness. Consequently, attacks exhibit unsatisfactory performance on evaluation metrics when converted to PCBAs via a mere stacking of methods. Therefore, we seek to explore the bidirectional collaborative relations between the sample selection and triggers to address the above dilemma. 2) Since the strong specificity within triggers, the simple combination of sample selection and triggers fails to substantially enhance both evaluation metrics, with generalization preserved among various attacks. Therefore, we seek to propose a set of components to significantly improve both stealthiness and ASR based on the commonalities of attacks. Specifically, Component A ascertains two critical selection factors, and then makes them an appropriate combination based on the trigger scale to select more reasonable hard’’ samples for improving ASR. Component B is proposed to select samples with similarities to relevant trigger implanted samples to promote stealthiness. Component C reassigns trigger poisoning intensity on RGB colors through distinct sensitivity of the human visual system to RGB for higher ASR, with stealthiness ensured by sample selection, including Component B. Furthermore, all components can be strategically integrated into diverse PCBAs.
zh

[AI-30] ABFAIRGDT: A Fast Fair Tabular Data Generator using Autoregressive Decision Trees ICDM2025

【速读】:该论文旨在解决机器学习模型在训练过程中继承数据偏见的问题,从而导致不公平预测结果的挑战。其解决方案的关键在于提出一种名为TABFAIRGDT的新方法,通过使用自回归决策树(autoregressive decision trees)生成公平的合成表格数据,并引入软叶节点重采样技术(soft leaf resampling),在不假设数据分布的前提下调整决策树输出,以降低偏差并保持下游任务的预测性能。该方法为非参数化建模,能有效捕捉混合特征类型间的复杂关系,且具有轻量、高效、无需预处理的特点,在保证合成数据质量的同时显著优于现有深度生成模型的公平性-效用权衡表现。

链接: https://arxiv.org/abs/2509.19927
作者: Emmanouil Panagiotou,Benoît Ronval,Arjun Roy,Ludwig Bothmann,Bernd Bischl,Siegfried Nijssen,Eirini Ntoutsi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Paper accepted at IEEE ICDM 2025: IEEE International Conference on Data Mining 2025, November 12-15, 2025, Washington DC, USA

点击查看摘要

Abstract:Ensuring fairness in machine learning remains a significant challenge, as models often inherit biases from their training data. Generative models have recently emerged as a promising approach to mitigate bias at the data level while preserving utility. However, many rely on deep architectures, despite evidence that simpler models can be highly effective for tabular data. In this work, we introduce TABFAIRGDT, a novel method for generating fair synthetic tabular data using autoregressive decision trees. To enforce fairness, we propose a soft leaf resampling technique that adjusts decision tree outputs to reduce bias while preserving predictive performance. Our approach is non-parametric, effectively capturing complex relationships between mixed feature types, without relying on assumptions about the underlying data distributions. We evaluate TABFAIRGDT on benchmark fairness datasets and demonstrate that it outperforms state-of-the-art (SOTA) deep generative models, achieving better fairness-utility trade-off for downstream tasks, as well as higher synthetic data quality. Moreover, our method is lightweight, highly efficient, and CPU-compatible, requiring no data pre-processing. Remarkably, TABFAIRGDT achieves a 72% average speedup over the fastest SOTA baseline across various dataset sizes, and can generate fair synthetic data for medium-sized datasets (10 features, 10K samples) in just one second on a standard CPU, making it an ideal solution for real-world fairness-sensitive applications.
zh

[AI-31] CON-QA: Privacy-Preserving QA using cloud LLM s in Contract Domain

【速读】:该论文旨在解决企业将云部署的大语言模型(Large Language Models, LLMs)用于法律合同文档问答时所面临的敏感信息泄露问题,特别是个人身份信息(Personally Identifiable Information, PII)和商业敏感条款的隐私保护挑战。解决方案的关键在于提出一种混合隐私保护框架 CON-QA,其核心机制包括:(1) 利用本地部署的 LLM 对查询进行语义分解与基于查询感知的文档片段检索,确保敏感信息在云端处理前被识别;(2) 通过结构化的“一对一多映射”方案对检测到的敏感实体进行匿名化处理,在保持语义连贯性的同时防范跨会话实体推断攻击;(3) 在云端 LLM 上生成匿名化回答,并借助会话一致的“多对一反向映射”实现本地原答案的精确重建,从而在保障隐私的前提下维持问答质量与法律条款语义一致性。

链接: https://arxiv.org/abs/2509.19925
作者: Ajeet Kumar Singh,Rajsabi Surya,Anurag Tripathi,Santanu Choudhury,Sudhir Bisane
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As enterprises increasingly integrate cloud-based large language models (LLMs) such as ChatGPT and Gemini into their legal document workflows, protecting sensitive contractual information - including Personally Identifiable Information (PII) and commercially sensitive clauses - has emerged as a critical challenge. In this work, we propose CON-QA, a hybrid privacy-preserving framework designed specifically for secure question answering over enterprise contracts, effectively combining local and cloud-hosted LLMs. The CON-QA framework operates through three stages: (i) semantic query decomposition and query-aware document chunk retrieval using a locally deployed LLM analysis, (ii) anonymization of detected sensitive entities via a structured one-to-many mapping scheme, ensuring semantic coherence while preventing cross-session entity inference attacks, and (iii) anonymized response generation by a cloud-based LLM, with accurate reconstruction of the original answer locally using a session-consistent many-to-one reverse mapping. To rigorously evaluate CON-QA, we introduce CUAD-QA, a corpus of 85k question-answer pairs generated over 510 real-world CUAD contract documents, encompassing simple, complex, and summarization-style queries. Empirical evaluations, complemented by detailed human assessments, confirm that CON-QA effectively maintains both privacy and utility, preserves answer quality, maintains fidelity to legal clause semantics, and significantly mitigates privacy risks, demonstrating its practical suitability for secure, enterprise-level contract documents.
zh

[AI-32] Exploration with Foundation Models: Capabilities Limitations and Hybrid Approaches NEURIPS2025

【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)中在稀疏奖励环境下的探索难题,尤其是评估基础模型(如大语言模型 LLM 和视觉语言模型 VLM)作为零样本探索代理在经典 RL 基准任务中的潜力。其关键发现是:尽管 VLM 能够从视觉输入中推断高层目标,但在低层精确控制方面存在显著“知行鸿沟”(knowing-doing gap)。为弥合这一差距,研究提出了一种简单且基于策略的混合框架(on-policy hybrid framework),在理想化场景下验证了 VLM 指导可显著提升早期阶段的样本效率,从而揭示了利用基础模型引导探索而非直接进行端到端控制的可行性与局限性。

链接: https://arxiv.org/abs/2509.19924
作者: Remo Sasso,Michelangelo Conserva,Dominik Jeurissen,Paulo Rauber
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 16 pages, 7 figures. Accepted for presentation at the 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop on the Foundations of Reasoning in Language Models (FoRLM)

点击查看摘要

Abstract:Exploration in reinforcement learning (RL) remains challenging, particularly in sparse-reward settings. While foundation models possess strong semantic priors, their capabilities as zero-shot exploration agents in classic RL benchmarks are not well understood. We benchmark LLMs and VLMs on multi-armed bandits, Gridworlds, and sparse-reward Atari to test zero-shot exploration. Our investigation reveals a key limitation: while VLMs can infer high-level objectives from visual input, they consistently fail at precise low-level control: the “knowing-doing gap”. To analyze a potential bridge for this gap, we investigate a simple on-policy hybrid framework in a controlled, best-case scenario. Our results in this idealized setting show that VLM guidance can significantly improve early-stage sample efficiency, providing a clear analysis of the potential and constraints of using foundation models to guide exploration rather than for end-to-end control.
zh

[AI-33] owards Self-Supervised Foundation Models for Critical Care Time Series ALT NEURIPS2025

【速读】:该论文旨在解决重症监护时间序列(critical care time series)领域中基础模型(foundation models)研究不足的问题,尤其是由于数据集规模有限和可用性差导致的模型泛化能力弱、小样本场景下性能不佳等挑战。其解决方案的关键在于提出一种基于Bi-Axial Transformer(BAT)架构的早期预训练基础模型,该模型在聚合的电子健康记录(electronic health record, EHR)数据上进行自监督学习,从而实现有效的迁移学习;在独立于训练数据集的小样本(<5,000例) mortality 预测任务中,该模型显著优于传统监督基线方法,展现出在资源受限临床环境中构建通用且鲁棒的AI辅助决策系统的潜力。

链接: https://arxiv.org/abs/2509.19885
作者: Katja Naasunnguaq Jagd,Rachael DeVries,Ole Winther
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to NeurIPS 2025 workshop Learning from Time Series for Health (TS4H)

点击查看摘要

Abstract:Domain-specific foundation models for healthcare have expanded rapidly in recent years, yet foundation models for critical care time series remain relatively underexplored due to the limited size and availability of datasets. In this work, we introduce an early-stage pre-trained foundation model for critical care time-series based on the Bi-Axial Transformer (BAT), trained on pooled electronic health record datasets. We demonstrate effective transfer learning by fine-tuning the model on a dataset distinct from the training sources for mortality prediction, where it outperforms supervised baselines, particularly for small datasets ( 5,000 ). These contributions highlight the potential of self-supervised foundation models for critical care times series to support generalizable and robust clinical applications in resource-limited settings.
zh

[AI-34] CoMelSinger: Discrete Token-Based Zero-Shot Singing Synthesis With Structured Melody Control and Guidance

【速读】:该论文旨在解决生成式歌唱语音合成(Singing Voice Synthesis, SVS)中因提示(prompt)机制导致的韵律泄露(prosody leakage)问题,即在零样本生成过程中,音高信息意外地与音色提示耦合,从而削弱了对旋律的精确控制能力。解决方案的关键在于提出CoMelSinger框架,其核心创新包括:(1)基于非自回归MaskGCT架构,将传统文本输入替换为歌词和音高标记(lyric and pitch tokens),实现结构化且解耦的旋律控制;(2)设计粗到细的对比学习策略,显式正则化声学提示与旋律输入之间的音高冗余性,有效抑制韵律泄露;(3)引入轻量级仅编码器结构的歌唱语音转录(Singing Voice Transcription, SVT)模块,提供帧级音高与持续时间对齐监督,提升合成质量与可控性。

链接: https://arxiv.org/abs/2509.19883
作者: Junchuan Zhao,Wei Zeng,Tianle Lyu,Ye Wang
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: 13 pages, 5 figures, 5 tables

点击查看摘要

Abstract:Singing Voice Synthesis (SVS) aims to generate expressive vocal performances from structured musical inputs such as lyrics and pitch sequences. While recent progress in discrete codec-based speech synthesis has enabled zero-shot generation via in-context learning, directly extending these techniques to SVS remains non-trivial due to the requirement for precise melody control. In particular, prompt-based generation often introduces prosody leakage, where pitch information is inadvertently entangled within the timbre prompt, compromising controllability. We present CoMelSinger, a zero-shot SVS framework that enables structured and disentangled melody control within a discrete codec modeling paradigm. Built on the non-autoregressive MaskGCT architecture, CoMelSinger replaces conventional text inputs with lyric and pitch tokens, preserving in-context generalization while enhancing melody conditioning. To suppress prosody leakage, we propose a coarse-to-fine contrastive learning strategy that explicitly regularizes pitch redundancy between the acoustic prompt and melody input. Furthermore, we incorporate a lightweight encoder-only Singing Voice Transcription (SVT) module to align acoustic tokens with pitch and duration, offering fine-grained frame-level supervision. Experimental results demonstrate that CoMelSinger achieves notable improvements in pitch accuracy, timbre consistency, and zero-shot transferability over competitive baselines.
zh

[AI-35] Advancing Universal Deep Learning for Electronic-Structure Hamiltonian Prediction of Materials

【速读】:该论文旨在解决深度学习方法在材料电子结构哈密顿量(Hamiltonian)预测中面临的泛化性能挑战,这些问题主要源于原子类型多样性、结构模式复杂性以及哈密顿量本身的高维特性。解决方案的关键在于提出NextHAM框架:首先引入零阶哈密顿量作为输入层的特征描述符和输出层的目标初始估计,使模型直接学习修正项而非完整哈密顿量,从而显著简化学习过程中的输入-输出映射;其次设计具备E(3)对称性和高非线性表达能力的神经Transformer架构,确保模型在物理约束下保持精度;最后提出一种新型训练目标函数,同时保障实空间与倒易空间中哈密顿量的准确性,避免因重叠矩阵条件数过大导致的误差放大和“鬼态”(ghost states)问题。

链接: https://arxiv.org/abs/2509.19877
作者: Shi Yin,Zujian Dai,Xinyang Pan,Lixin He
机构: 未知
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI); Computational Physics (physics.comp-ph)
备注:

点击查看摘要

Abstract:Deep learning methods for electronic-structure Hamiltonian prediction has offered significant computational efficiency advantages over traditional DFT methods, yet the diversity of atomic types, structural patterns, and the high-dimensional complexity of Hamiltonians pose substantial challenges to the generalization performance. In this work, we contribute on both the methodology and dataset sides to advance universal deep learning paradigm for Hamiltonian prediction. On the method side, we propose NextHAM, a neural E(3)-symmetry and expressive correction method for efficient and generalizable materials electronic-structure Hamiltonian prediction. First, we introduce the zeroth-step Hamiltonians, which can be efficiently constructed by the initial charge density of DFT, as informative descriptors of neural regression model in the input level and initial estimates of the target Hamiltonian in the output level, so that the regression model directly predicts the correction terms to the target ground truths, thereby significantly simplifying the input-output mapping for learning. Second, we present a neural Transformer architecture with strict E(3)-Symmetry and high non-linear expressiveness for Hamiltonian prediction. Third, we propose a novel training objective to ensure the accuracy performance of Hamiltonians in both real space and reciprocal space, preventing error amplification and the occurrence of “ghost states” caused by the large condition number of the overlap matrix. On the dataset side, we curate a high-quality broad-coverage large benchmark, namely Materials-HAM-SOC, comprising 17,000 material structures spanning 68 elements from six rows of the periodic table and explicitly incorporating SOC effects. Experimental results on Materials-HAM-SOC demonstrate that NextHAM achieves excellent accuracy and efficiency in predicting Hamiltonians and band structures.
zh

[AI-36] CollaPipe: Adaptive Segment-Optimized Pipeline Parallelism for Collaborative LLM Training in Heterogeneous Edge Networks

【速读】:该论文旨在解决在移动边缘计算(Mobile Edge Computing, MEC)网络中,基于Transformer的大语言模型(Large Language Models, LLMs)训练面临的高计算负载、端到端延迟大以及模型泛化能力弱等问题。其核心解决方案是提出一种混合分布式学习框架CollaPipe,关键在于将编码器部分以可变大小的片段形式自适应地分配到移动设备上进行流水线并行训练,而解码器则部署在边缘服务器上执行生成任务,并通过联邦聚合(Federated Aggregation)实现全局模型更新。进一步地,为提升训练效率,作者构建了一个联合优化问题,动态分配模型片段、微批次、带宽和传输功率,并基于Lyapunov优化设计了动态分段调度与资源分配(Dynamic Segment Scheduling and Resource Allocation, DSSDA)算法,从而在长期约束下保证系统稳定性,实验证明该方案显著提升了计算效率、降低了延迟并减少了单设备内存占用。

链接: https://arxiv.org/abs/2509.19855
作者: Jiewei Chen,Xiumei Deng,Zehui Xiong,Shaoyong Guo,Xuesong Qiu,Ping Wang,Dusit Niyato
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
备注: Submitted to IEEE for review

点击查看摘要

Abstract:The increasing demand for intelligent mobile applications has made multi-agent collaboration with Transformer-based large language models (LLMs) essential in mobile edge computing (MEC) networks. However, training LLMs in such environments remains challenging due to heavy computation, high end-to-end latency, and limited model generalization. We introduce CollaPipe, a hybrid distributed learning framework that integrates collaborative pipeline parallelism with federated aggregation to support self-evolving intelligent networks. In CollaPipe, the encoder part is adaptively partitioned into variable-sized segments and deployed across mobile devices for pipeline-parallel training, while the decoder is deployed on edge servers to handle generative tasks. Then we perform global model update via federated aggregation. To enhance training efficiency, we formulate a joint optimization problem that adaptively allocates model segments, micro-batches, bandwidth, and transmission power. We derive and use a closed-form convergence bound to design an Dynamic Segment Scheduling and Resource Allocation (DSSDA) algorithm based on Lyapunov optimization, ensuring system stability under long-term constraints. Extensive experiments on downstream tasks with Transformer and BERT models show that CollaPipe improves computation efficiency by up to 15.09%, reduces end-to-end latency by at least 48.98%, and cuts single device memory usage by more than half, enabling online learning in heterogeneous and dynamic communication environments.
zh

[AI-37] Eliminating stability hallucinations in llm -based tts models via attention guidance ICASSP2026

【速读】:该论文旨在解决基于大语言模型(Large Language Model, LLM)的文本到语音(Text-to-Speech, TTS)模型中常见的稳定性幻觉问题,例如重复或遗漏语音输出。其核心解决方案在于改进并利用注意力机制:首先提出最优对齐得分(Optimal Alignment Score, OAS),通过维特比算法(Viterbi algorithm)量化文本与语音 token 之间的对齐质量,并将其融入 CosyVoice2 的训练过程以促进连续且稳定的对齐学习;其次,借助预训练注意力值,采用思维链(Chain-of-Thought, CoT)策略指导学生模型 CosyVoice2 的训练,进一步减少合成语音中的稳定性幻觉。实验表明,该方法在 Seed-TTS-Eval 和 CV3-Eval 测试集上有效提升了语音稳定性,且未引入负面效应。

链接: https://arxiv.org/abs/2509.19852
作者: ShiMing Wang,ZhiHao Du,Yang Xiang,TianYu Zhao,Han Zhao,Qian Chen,XianGang Li,HanJie Guo,ZhenHua Ling
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: 5 pages, submitted to ICASSP2026

点击查看摘要

Abstract:This paper focuses on resolving stability hallucinations (e.g., repetitive or omitted speech) in LLM-based Text-to-Speech (TTS) models by improving and leveraging the attention mechanism. First, we analyzed the alignment mechanism between text tokens and speech tokens in LLMs. We then proposed a metric termed the Optimal Alignment Score (OAS), which employs the Viterbi algorithm to evaluate text-speech alignment quality. Subsequently, OAS was integrated into the training of CosyVoice2 to assist LLMs in learning continuous, stable alignment. Additionally, the pre-trained attention value is employed to guide the training of the student CosyVoice2 via chain-of-thought (CoT), which further reduces stability hallucinations in synthesized speech. Experiments on the Seed-TTS-Eval and CV3-Eval test sets demonstrate that the proposed methods can effectively reduce the stability hallucinations of CosyVoice2 without introducing additional negative effects. The appendix is available at this https URL.
zh

[AI-38] Analyzing Generalization in Pre-Trained Symbolic Regression

【速读】:该论文旨在解决预训练的基于Transformer的符号回归(Symbolic Regression)模型在分布外(out-of-distribution)场景下泛化能力不足的问题。其核心发现是:尽管这些模型在预训练数据分布内表现优异,但在面对分布外挑战时性能显著下降,揭示了当前方法存在严重的泛化差距。解决方案的关键在于识别并量化这一泛化瓶颈,从而为未来改进模型架构、预训练策略及测试范式提供实证依据,以推动符号回归技术在真实世界应用中的可靠性与实用性。

链接: https://arxiv.org/abs/2509.19849
作者: Henrik Voigt,Paul Kahlmeyer,Kai Lawonn,Michael Habeck,Joachim Giesen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Symbolic regression algorithms search a space of mathematical expressions for formulas that explain given data. Transformer-based models have emerged as a promising, scalable approach shifting the expensive combinatorial search to a large-scale pre-training phase. However, the success of these models is critically dependent on their pre-training data. Their ability to generalize to problems outside of this pre-training distribution remains largely unexplored. In this work, we conduct a systematic empirical study to evaluate the generalization capabilities of pre-trained, transformer-based symbolic regression. We rigorously test performance both within the pre-training distribution and on a series of out-of-distribution challenges for several state of the art approaches. Our findings reveal a significant dichotomy: while pre-trained models perform well in-distribution, the performance consistently degrades in out-of-distribution scenarios. We conclude that this generalization gap is a critical barrier for practitioners, as it severely limits the practical use of pre-trained approaches for real-world applications.
zh

[AI-39] LatentGuard: Controllable Latent Steering for Robust Refusal of Attacks and Reliable Response Generation NEURIPS2025

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在保持实用性的前提下实现鲁棒安全对齐(robust safety alignment)的难题,尤其关注如何平衡全面的安全防护与细粒度的可控性。其解决方案的关键在于提出一种三阶段框架LATENTGUARD,该框架将行为对齐与受监督的潜在空间控制相结合:首先通过理性化数据集微调LLM,建立涵盖安全关键场景和效用保留场景的行为先验;随后训练一个结构化变分自编码器(structured variational autoencoder, VAE),基于多标签标注(包括攻击类型、攻击方法和良性指示符)监督中间MLP激活层,从而学习解耦且语义可解释的潜在表示;最终通过对学习到的潜在维度进行定向操控,实现选择性拒绝有害请求的同时保留合法使用场景下的帮助性。这一方法显著提升了安全性可控性和响应可解释性,且不牺牲模型实用性。

链接: https://arxiv.org/abs/2509.19839
作者: Huizhen Shu,Xuying Li,Zhuo Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 9-page NeurIPS 2025 preprint including 3 figures and 1 table, with additional appendix material. Prepared using the NeurIPS 2025 preprint template and compiled with pdfLaTeX. All references are included via the provided .bbl file. Figures are in PDF format. No external supplementary files. All necessary style files and images are included

点击查看摘要

Abstract:Achieving robust safety alignment in large language models (LLMs) while preserving their utility remains a fundamental challenge. Existing approaches often struggle to balance comprehensive safety with fine-grained controllability at the representation level. We introduce LATENTGUARD, a novel three-stage framework that combines behavioral alignment with supervised latent space control for interpretable and precise safety steering. Our approach begins by fine-tuning an LLM on rationalized datasets containing both reasoning-enhanced refusal responses to adversarial prompts and reasoning-enhanced normal responses to benign queries, establishing robust behavioral priors across both safety-critical and utility-preserving scenarios. We then train a structured variational autoencoder (VAE) on intermediate MLP activations, supervised by multi-label annotations including attack types, attack methods, and benign indicators. This supervision enables the VAE to learn disentangled latent representations that capture distinct adversarial characteristics while maintaining semantic interpretability. Through targeted manipulation of learned latent dimensions, LATENTGUARD achieves selective refusal behavior, effectively blocking harmful requests while preserving helpfulness for legitimate use cases. Experiments on Qwen3-8B demonstrate significant improvements in both safety controllability and response interpretability without compromising utility. Cross-architecture validation on Mistral-7B confirms the generalizability of our latent steering approach, showing consistent effectiveness across different model families. Our results suggest that structured representation-level intervention offers a promising pathway toward building safer yet practical LLM systems.
zh

[AI-40] On the Rate of Convergence of Kolmogorov-Arnold Network Regression Estimators

【速读】:该论文旨在解决多变量函数逼近中模型可解释性与逼近精度之间的权衡问题,特别是在非参数回归场景下如何构建具有理论保证的结构化神经网络。其解决方案的关键在于引入基于B样条(B-splines)表示的Kolmogorov-Arnold Networks (KANs),通过证明加法型和混合加法-乘法型KANs在Sobolev空间中达到最优收敛率 $ O(n^{-2r/(2r+1)}) $,从而为KANs提供了严格的理论支撑,并给出了最优节点数选择准则,使其成为传统方法的一种结构化替代方案。

链接: https://arxiv.org/abs/2509.19830
作者: Wei Liu,Eleni Chatzi,Zhilu Lai
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Kolmogorov-Arnold Networks (KANs) offer a structured and interpretable framework for multivariate function approximation by composing univariate transformations through additive or multiplicative aggregation. This paper establishes theoretical convergence guarantees for KANs when the univariate components are represented by B-splines. We prove that both additive and hybrid additive-multiplicative KANs attain the minimax-optimal convergence rate O(n^-2r/(2r+1)) for functions in Sobolev spaces of smoothness r . We further derive guidelines for selecting the optimal number of knots in the B-splines. The theory is supported by simulation studies that confirm the predicted convergence rates. These results provide a theoretical foundation for using KANs in nonparametric regression and highlight their potential as a structured alternative to existing methods.
zh

[AI-41] Analysis of approximate linear programming solution to Markov decision problem with log barrier function

【速读】:该论文旨在解决基于线性规划(Linear Programming, LP)求解马尔可夫决策过程(Markov Decision Process, MDP)时面临的挑战,即LP方法通常导致不等式约束优化问题,相较于基于贝尔曼方程(Bellman equation)的动态规划方法更难高效求解。其解决方案的关键在于引入对数障碍函数(log-barrier function),将原带不等式约束的LP问题转化为无约束优化问题,从而可通过梯度下降法获得近似解,为LP-based MDP求解提供了一种更有效且实用的理论框架。

链接: https://arxiv.org/abs/2509.19800
作者: Donghwan Lee,Hyukjun Yang,Bum Geun Park
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:There are two primary approaches to solving Markov decision problems (MDPs): dynamic programming based on the Bellman equation and linear programming (LP). Dynamic programming methods are the most widely used and form the foundation of both classical and modern reinforcement learning (RL). By contrast, LP-based methods have been less commonly employed, although they have recently gained attention in contexts such as offline RL. The relative underuse of the LP-based methods stems from the fact that it leads to an inequality-constrained optimization problem, which is generally more challenging to solve effectively compared with Bellman-equation-based methods. The purpose of this paper is to establish a theoretical foundation for solving LP-based MDPs in a more effective and practical manner. Our key idea is to leverage the log-barrier function, widely used in inequality-constrained optimization, to transform the LP formulation of the MDP into an unconstrained optimization problem. This reformulation enables approximate solutions to be obtained easily via gradient descent. While the method may appear simple, to the best of our knowledge, a thorough theoretical interpretation of this approach has not yet been developed. This paper aims to bridge this gap.
zh

[AI-42] RDAR: Reward-Driven Agent Relevance Estimation for Autonomous Driving

【速读】:该论文旨在解决自动驾驶系统在处理复杂交通场景时计算效率低下的问题,即现有注意力机制对所有邻近交通参与者(agents)进行二次复杂度的交互建模,导致资源浪费。其解决方案的关键在于提出RDAR策略,通过将代理选择建模为马尔可夫决策过程(Markov Decision Process),学习每个代理对控制车辆行为的影响程度(即“相关性”),并以二值掩码(binary mask)形式动态排除无关代理,从而显著减少输入规模,同时保持与先进行为模型相当的驾驶性能(如行驶进度、安全性与效率)。

链接: https://arxiv.org/abs/2509.19789
作者: Carlo Bosio,Greg Woelki,Noureldin Hendy,Nicholas Roy,Byungsoo Kim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 10 pages, 6 figures

点击查看摘要

Abstract:Human drivers focus only on a handful of agents at any one time. On the other hand, autonomous driving systems process complex scenes with numerous agents, regardless of whether they are pedestrians on a crosswalk or vehicles parked on the side of the road. While attention mechanisms offer an implicit way to reduce the input to the elements that affect decisions, existing attention mechanisms for capturing agent interactions are quadratic, and generally computationally expensive. We propose RDAR, a strategy to learn per-agent relevance – how much each agent influences the behavior of the controlled vehicle – by identifying which agents can be excluded from the input to a pre-trained behavior model. We formulate the masking procedure as a Markov Decision Process where the action consists of a binary mask indicating agent selection. We evaluate RDAR on a large-scale driving dataset, and demonstrate its ability to learn an accurate numerical measure of relevance by achieving comparable driving performance, in terms of overall progress, safety and performance, while processing significantly fewer agents compared to a state of the art behavior model.
zh

[AI-43] Agent ic Metacognition: Designing a “Self-Aware” Low-Code Agent for Failure Prediction and Human Handoff

【速读】:该论文旨在解决低代码/无代码(Low-Code/No-Code, LCNC)环境中自主代理因固有非确定性而导致的可靠性问题,例如陷入不可预见的循环、生成错误输出或遭遇无法恢复的故障,进而引发用户挫败感并破坏信任。解决方案的关键在于引入一个“元认知”(metacognitive)层,该层作为第二层架构主动监控主LCNC代理,并基于预定义触发条件(如过度延迟或重复动作)预测任务失败;一旦预测到失败,元认知代理将主动发起人工接管(human handoff),向用户提供代理的“思考过程”摘要及无法继续执行的详细解释,从而提升系统韧性与用户透明度。

链接: https://arxiv.org/abs/2509.19783
作者: Jiexi Xu
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Software Engineering (cs.SE)
备注: 7 pages, 2 tables

点击查看摘要

Abstract:The inherent non-deterministic nature of autonomous agents, particularly within low-code/no-code (LCNC) environments, presents significant reliability challenges. Agents can become trapped in unforeseen loops, generate inaccurate outputs, or encounter unrecoverable failures, leading to user frustration and a breakdown of trust. This report proposes a novel architectural pattern to address these issues: the integration of a secondary, “metacognitive” layer that actively monitors the primary LCNC agent. Inspired by human introspection, this layer is designed to predict impending task failures based on a defined set of triggers, such as excessive latency or repetitive actions. Upon predicting a failure, the metacognitive agent proactively initiates a human handoff, providing the user with a clear summary of the agent’s “thought process” and a detailed explanation of why it could not proceed. An empirical analysis of a prototype system demonstrates that this approach significantly increases the overall task success rate. However, this performance gain comes with a notable increase in computational overhead. The findings reframe human handoffs not as an admission of defeat but as a core design feature that enhances system resilience, improves user experience, and builds trust by providing transparency into the agent’s internal state. The report discusses the practical and ethical implications of this approach and identifies key directions for future research.
zh

[AI-44] PPGFlowECG: Latent Rectified Flow with Cross-Modal Encoding for PPG-Guided ECG Generation and Cardiovascular Disease Detection

【速读】:该论文旨在解决如何利用可穿戴设备中易获取的光电容积脉搏波(Photoplethysmography, PPG)信号,生成高质量且具有临床诊断价值的心电图(Electrocardiography, ECG)信号的问题。当前方法受限于生理语义错位和高维信号建模复杂性,难以实现准确、可解释的PPG到ECG转换。解决方案的关键在于提出一个两阶段框架PPGFlowECG:首先通过CardioAlign Encoder在共享潜在空间中对齐PPG与ECG特征,确保生理信息一致性;其次采用潜在修正流(latent rectified flow)生成高保真度、可解释的ECG信号,从而提升心血管疾病(CVD)检测准确性与医生诊断可靠性。

链接: https://arxiv.org/abs/2509.19774
作者: Xiaocheng Fang,Jiarui Jin,Haoyu Wang,Che Liu,Jieyi Cai,Guangkun Nie,Jun Li,Hongyan Li,Shenda Hong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:In clinical practice, electrocardiography (ECG) remains the gold standard for cardiac monitoring, providing crucial insights for diagnosing a wide range of cardiovascular diseases (CVDs). However, its reliance on specialized equipment and trained personnel limits feasibility for continuous routine monitoring. Photoplethysmography (PPG) offers accessible, continuous monitoring but lacks definitive electrophysiological information, preventing conclusive diagnosis. Generative models present a promising approach to translate PPG into clinically valuable ECG signals, yet current methods face substantial challenges, including the misalignment of physiological semantics in generative models and the complexity of modeling in high-dimensional signals. To this end, we propose PPGFlowECG, a two-stage framework that aligns PPG and ECG in a shared latent space via the CardioAlign Encoder and employs latent rectified flow to generate ECGs with high fidelity and interpretability. To the best of our knowledge, this is the first study to experiment on MCMED, a newly released clinical-grade dataset comprising over 10 million paired PPG-ECG samples from more than 118,000 emergency department visits with expert-labeled cardiovascular disease annotations. Results demonstrate the effectiveness of our method for PPG-to-ECG translation and cardiovascular disease detection. Moreover, cardiologist-led evaluations confirm that the synthesized ECGs achieve high fidelity and improve diagnostic reliability, underscoring our method’s potential for real-world cardiovascular screening.
zh

[AI-45] Sobolev acceleration for neural networks

【速读】:该论文试图解决的问题是:Sobolev训练(Sobolev training)在深度神经网络中为何能加速收敛并提升泛化性能,其背后的理论机制尚不清晰。解决方案的关键在于构建了首个严格的理论框架,证明了在高斯输入和浅层架构的师生(student-teacher)设定下,Sobolev训练能够通过优化损失函数的条件数(conditioning)和梯度流(gradient-flow)收敛速率来加速ReLU网络的收敛过程。作者推导出种群梯度和海森矩阵(Hessian)的精确公式,并定量分析了损失景观(loss landscape)的改善,从而从理论上阐明了Sobolev训练的有效性,且数值实验验证了其在现代深度学习任务中的广泛适用性。

链接: https://arxiv.org/abs/2509.19773
作者: Jong Kwon Oh,Hanbaek Lyu,Hwijae Son
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Sobolev training, which integrates target derivatives into the loss functions, has been shown to accelerate convergence and improve generalization compared to conventional L^2 training. However, the underlying mechanisms of this training method remain only partially understood. In this work, we present the first rigorous theoretical framework proving that Sobolev training accelerates the convergence of Rectified Linear Unit (ReLU) networks. Under a student-teacher framework with Gaussian inputs and shallow architectures, we derive exact formulas for population gradients and Hessians, and quantify the improvements in conditioning of the loss landscape and gradient-flow convergence rates. Extensive numerical experiments validate our theoretical findings and show that the benefits of Sobolev training extend to modern deep learning tasks.
zh

[AI-46] Frictional Q-Learning

【速读】:该论文旨在解决离策略强化学习(off-policy reinforcement learning, RL)中因动作外推(extrapolation error)导致的策略不稳定问题,尤其是在使用深度神经网络进行函数逼近时,策略可能偏离经验回放缓冲区(replay buffer)中支持的动作区域,从而引发性能下降甚至训练崩溃。解决方案的关键在于引入“摩擦力”类比——将经典力学中的静摩擦力概念映射到策略更新过程中,提出一种基于动作空间约束的机制:Frictional Q-learning 通过限制智能体的动作空间,使其行为倾向于停留在回放缓冲区所支持的区域内,同时保持与正交动作空间流形的距离,从而有效抑制外推误差。该方法在保持批处理约束(batch-constrained)算法简洁性的同时,提供了直观的物理解释,并在标准连续控制基准测试中展现出鲁棒性和竞争力。

链接: https://arxiv.org/abs/2509.19771
作者: Hyunwoo Kim,Hyo Kyung Lee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We draw an analogy between static friction in classical mechanics and extrapolation error in off-policy RL, and use it to formulate a constraint that prevents the policy from drifting toward unsupported actions. In this study, we present Frictional Q-learning, a deep reinforcement learning algorithm for continuous control, which extends batch-constrained reinforcement learning. Our algorithm constrains the agent’s action space to encourage behavior similar to that in the replay buffer, while maintaining a distance from the manifold of the orthonormal action space. The constraint preserves the simplicity of batch-constrained, and provides an intuitive physical interpretation of extrapolation error. Empirically, we further demonstrate that our algorithm is robustly trained and achieves competitive performance across standard continuous control benchmarks.
zh

[AI-47] FusedANN: Convexified Hybrid ANN via Attribute-Vector Fusion

【速读】:该论文旨在解决现实场景中混合查询(hybrid queries)的检索效率与准确性难题,即如何在保持高召回率的同时实现快速响应,同时避免传统方法依赖脆弱且难以扩展的索引技巧。其核心挑战在于将属性过滤(attribute filtering)与向量相似性搜索(vector similarity search)有效融合,而现有方案往往在召回率、速度和灵活性之间存在权衡。解决方案的关键是提出FusedANN(Fused Attribute-Vector Nearest Neighbor),它通过几何框架将过滤约束转化为近似最近邻(Approximate Nearest Neighbor, ANN)优化中的显式约束,并引入一种基于拉格朗日松弛思想的凸融合空间(convex fused space)。该方法利用Transformer-based凸化机制联合嵌入属性与向量信息,将硬性过滤条件转化为连续加权惩罚项,在保留top-k语义的同时支持高效近似搜索;理论证明其在高选择性下退化为精确过滤,低匹配时平滑过渡至语义最相近属性,且维持下游ANN的α-近似保证,从而构建了一个可验证、可扩展的符号约束与向量相似性之间的桥梁。

链接: https://arxiv.org/abs/2509.19767
作者: Alireza Heidari,Wei Zhang,Ying Xiong
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Databases (cs.DB); Optimization and Control (math.OC)
备注: 62 pages,12 figures

点击查看摘要

Abstract:Vector search powers transformers technology, but real-world use demands hybrid queries that combine vector similarity with attribute filters (e.g., “top document in category X, from 2023”). Current solutions trade off recall, speed, and flexibility, relying on fragile index hacks that don’t scale. We introduce FusedANN (Fused Attribute-Vector Nearest Neighbor), a geometric framework that elevates filtering to ANN optimization constraints and introduces a convex fused space via a Lagrangian-like relaxation. Our method jointly embeds attributes and vectors through transformer-based convexification, turning hard filters into continuous, weighted penalties that preserve top-k semantics while enabling efficient approximate search. We prove that FusedANN reduces to exact filtering under high selectivity, gracefully relaxes to semantically nearest attributes when exact matches are insufficient, and preserves downstream ANN alpha-approximation guarantees. Empirically, FusedANN improves query throughput by eliminating brittle filtering stages, achieving superior recall-latency tradeoffs on standard hybrid benchmarks without specialized index hacks, delivering up to 3 times higher throughput and better recall than state-of-the-art hybrid and graph-based systems. Theoretically, we provide explicit error bounds and parameter selection rules that make FusedANN practical for production. This establishes a principled, scalable, and verifiable bridge between symbolic constraints and vector similarity, unlocking a new generation of filtered retrieval systems for large, hybrid, and dynamic NLP/ML workloads.
zh

[AI-48] he Conductor and the Engine: A Path Towards Co-Designed Reasoning

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在推理过程中因内部模型冗余和外部代理编排效率低下而导致的计算资源浪费问题,尤其是在小规模开源模型中推理能力受限的瓶颈。其解决方案的关键在于提出一种优化的推理工作流(\cepo),通过协同设计底层模型能力和外部编排框架,显著提升小至中等规模模型的推理效率与性能,使其在某些任务上超越数倍于自身规模的模型。

链接: https://arxiv.org/abs/2509.19762
作者: Yuanxin Wang,Pawel Filipczuk,Anisha Garg,Amaan Dhada,Mohammad Hassanpour,David Bick,Ganesh Venkatesh
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Modern LLM reasoning relies on extensive test-time computation, driven by internal model training and external agentic orchestration. However, this synergy is often inefficient, as model verbosity and poor instruction following lead to wasted compute. We analyze this capability-cost trade-off and introduce an optimized reasoning workflow (\cepo) that empowers smaller open-source models to outperform models multiple times their size. We will open-source this workflow to enable further research. Our work demonstrates a clear path toward co-designing orchestration frameworks with the underlying model capabilities to unlock powerful reasoning in small-to-medium sized models.
zh

[AI-49] ARCADE: A Real-Time Data System for Hybrid and Continuous Query Processing across Diverse Data Modalities

【速读】:该论文旨在解决多模态数据(包括文本、图像、视频、空间和关系模态)在实时语义搜索与检索场景下,现有数据库系统在高效数据摄取、连续查询处理以及混合分析支持方面能力不足的问题。其核心解决方案是提出ARCADE系统,关键创新在于:(1) 在基于LSM存储的架构上构建统一的磁盘级二级索引,以高效支持向量、空间和文本等多种模态数据的查询;(2) 设计一个综合的成本优化器,用于优化跨模态的混合查询;(3) 引入增量物化视图框架,提升连续查询的执行效率。该系统基于开源RocksDB存储引擎和MySQL查询引擎实现,在读密集型和写密集型工作负载下分别比领先系统快7.4倍和1.4倍。

链接: https://arxiv.org/abs/2509.19757
作者: Jingyi Yang,Songsong Mo,Jiachen Shi,Zihao Yu,Kunhao Shi,Xuchen Ding,Gao Cong
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The explosive growth of multimodal data - spanning text, image, video, spatial, and relational modalities, coupled with the need for real-time semantic search and retrieval over these data - has outpaced the capabilities of existing multimodal and real-time database systems, which either lack efficient ingestion and continuous query capability, or fall short in supporting expressive hybrid analytics. We introduce ARCADE, a real-time data system that efficiently supports high-throughput ingestion and expressive hybrid and continuous query processing across diverse data types. ARCADE introduces unified disk-based secondary index on LSM-based storage for vector, spatial, and text data modalities, a comprehensive cost-based query optimizer for hybrid queries, and an incremental materialized view framework for efficient continuous queries. Built on open-source RocksDB storage and MySQL query engine, ARCADE outperforms leading multimodal data systems by up to 7.4x on read-heavy and 1.4x on write-heavy workloads.
zh

[AI-50] Cuffless Blood Pressure Prediction from Speech Sentences using Deep Learning Methods

【速读】:该论文旨在解决传统袖带式血压测量方法因白大衣效应和隐匿性高血压等因素导致的测量结果不一致问题,从而实现无创、实时、准确的动脉血压(Arterial Blood Pressure, ABP)预测。其解决方案的关键在于利用基于BERT的回归模型对语音信号中的声学特征进行深度分析,通过提取与血压水平相关的模式,建立语音特征与ABP之间的映射关系,从而实现无需物理接触的血压监测。实验表明,该方法在95名受试者数据集上取得了优异性能, systolic blood pressure (SBP) 的平均绝对误差(MAE)为13.6 mmHg,diastolic blood pressure (DBP) 的MAE为12.4 mmHg,相关系数R分别达到0.99和0.94,验证了模型的有效性与鲁棒性。

链接: https://arxiv.org/abs/2509.19750
作者: Kainat
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: MS Thesis

点击查看摘要

Abstract:This research presents a novel method for noninvasive arterial blood pressure ABP prediction using speech signals employing a BERT based regression model Arterial blood pressure is a vital indicator of cardiovascular health and accurate monitoring is essential in preventing hypertension related complications Traditional cuff based methods often yield inconsistent results due to factors like whitecoat and masked hypertension Our approach leverages the acoustic characteristics of speech capturing voice features to establish correlations with blood pressure levels Utilizing advanced deep learning techniques we analyze speech signals to extract relevant patterns enabling real time monitoring without the discomfort of conventional methods In our study we employed a dataset comprising recordings from 95 participants ensuring diverse representation The BERT model was fine tuned on extracted features from speech leading to impressive performance metrics achieving a mean absolute error MAE of 136 mmHg for systolic blood pressure SBP and 124 mmHg for diastolic blood pressure DBP with R scores of 099 and 094 respectively These results indicate the models robustness in accurately predicting blood pressure levels Furthermore the training and validation loss analysis demonstrates effective learning and minimal overfitting Our findings suggest that integrating deep learning with speech analysis presents a viable alternative for blood pressure monitoring paving the way for improved applications in telemedicine and remote health monitoring By providing a user friendly and accurate method for blood pressure assessment this research has significant implications for enhancing patient care and proactive management of cardiovascular health
zh

[AI-51] Intuition to Evidence: Measuring AIs True Impact on Developer Productivity

【速读】:该论文旨在解决如何在真实企业环境中评估AI辅助软件开发工具的实际效能与部署挑战,尤其是在大规模团队中整合生成式AI(Generative AI)能力对开发流程效率的影响问题。其解决方案的关键在于构建并部署一个内部AI平台(DeputyDev),该平台集成了代码生成与自动化代码审查功能,并通过为期一年的纵向研究和严谨的对照组分析,在300名工程师的日常开发工作中验证了其效果:结果显示PR(Pull Request)评审周期平均缩短31.8%,且高采纳用户代码提交量提升61%,占生产环境代码总量的30–40%,整体代码交付量增长28%。这一实证研究填补了实验室基准测试与实际生产场景之间的差距,揭示了AI集成在工程实践中的潜力与落地难点。

链接: https://arxiv.org/abs/2509.19708
作者: Anand Kumar,Vishal Khare,Deepak Sharma,Satyam Kumar,Vijay Saini,Anshul Yadav,Sachendra Jain,Ankit Rana,Pratham Verma,Vaibhav Meena,Avinash Edubilli
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 16 pages, 10 figures, 5 tables

点击查看摘要

Abstract:We present a comprehensive real-world evaluation of AI-assisted software development tools deployed at enterprise scale. Over one year, 300 engineers across multiple teams integrated an in-house AI platform (DeputyDev) that combines code generation and automated review capabilities into their daily workflows. Through rigorous cohort analysis, our study demonstrates statistically significant productivity improvements, including an overall 31.8% reduction in PR review cycle time. Developer adoption was strong, with 85% satisfaction for code review features and 93% expressing a desire to continue using the platform. Adoption patterns showed systematic scaling from 4% engagement in month 1 to 83% peak usage by month 6, stabilizing at 60% active engagement. Top adopters achieved a 61% increase in code volume pushed to production, contributing to approximately 30 to 40% of code shipped to production through this tool, accounting for an overall 28% increase in code shipment volume. Unlike controlled benchmark evaluations, our longitudinal analysis provides empirical evidence from production environments, revealing both the transformative potential and practical deployment challenges of integrating AI into enterprise software development workflows. Comments: 16 pages, 10 figures, 5 tables Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2509.19708 [cs.SE] (or arXiv:2509.19708v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2509.19708 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-52] Causal Machine Learning for Surgical Interventions

【速读】:该论文旨在解决外科手术决策中个体化治疗效应(Individualized Treatment Effect, ITE)估计不准的问题,尤其是在脊柱融合或脊柱侧弯矫正等高风险场景下,传统统计方法难以处理复杂且异质性的数据。其解决方案的关键在于提出了一种多任务元学习框架 X-MultiTask,将每种手术决策(如前路 vs. 后路入路、手术 vs. 非手术)视为独立任务,同时在任务间学习共享表示,并通过引入逆概率加权(Inverse Probability Weighting, IPW)增强因果有效性。该方法在两个临床数据集上均表现出优越性能,显著降低了 PEHE 和 ATE 误差指标,为个性化外科决策提供了更可靠的因果估计工具。

链接: https://arxiv.org/abs/2509.19705
作者: J. Ben Tamo,Nishant S. Chouhan,Micky C. Nnamdi,Yining Yuan,Shreya S. Chivilkar,Wenqi Shi,Steven W. Hwang,B. Randall Brenn,May D. Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Applications (stat.AP); Methodology (stat.ME)
备注:

点击查看摘要

Abstract:Surgical decision-making is complex and requires understanding causal relationships between patient characteristics, interventions, and outcomes. In high-stakes settings like spinal fusion or scoliosis correction, accurate estimation of individualized treatment effects (ITEs) remains limited due to the reliance on traditional statistical methods that struggle with complex, heterogeneous data. In this study, we develop a multi-task meta-learning framework, X-MultiTask, for ITE estimation that models each surgical decision (e.g., anterior vs. posterior approach, surgery vs. no surgery) as a distinct task while learning shared representations across tasks. To strengthen causal validity, we incorporate the inverse probability weighting (IPW) into the training objective. We evaluate our approach on two datasets: (1) a public spinal fusion dataset (1,017 patients) to assess the effect of anterior vs. posterior approaches on complication severity; and (2) a private AIS dataset (368 patients) to analyze the impact of posterior spinal fusion (PSF) vs. non-surgical management on patient-reported outcomes (PROs). Our model achieves the highest average AUC (0.84) in the anterior group and maintains competitive performance in the posterior group (0.77). It outperforms baselines in treatment effect estimation with the lowest overall \epsilon_\textNN-PEHE (0.2778) and \epsilon_\textATE (0.0763). Similarly, when predicting PROs in AIS, X-MultiTask consistently shows superior performance across all domains, with \epsilon_\textNN-PEHE = 0.2551 and \epsilon_\textATE = 0.0902. By providing robust, patient-specific causal estimates, X-MultiTask offers a powerful tool to advance personalized surgical care and improve patient outcomes. The code is available at this https URL.
zh

[AI-53] Linear Transformers Implicitly Discover Unified Numerical Algorithms NEURIPS2025

【速读】:该论文旨在解决多类低秩矩阵补全任务中缺乏统一、自适应迭代求解器的问题,包括标量预测、Nyström外推中的未见核片补全以及分布式计算场景下的高效求解。其解决方案的关键在于:通过在数百万个掩码块矩阵补全任务上训练一个线性注意力Transformer(Linear Attention Transformer),模型仅依赖输入-输出对和均方误差损失,无需显式提供正规方程、手工设计的迭代过程或任务关联提示,即可隐式学习到一种参数无关的更新规则。该规则在三种不同计算范式(全可见、秩受限更新和分布式计算)下保持一致,并被理论证明具有二阶收敛性、降低分布式迭代复杂度且在秩受限注意力下仍保持精度,从而实现了预测、估计与Nyström外推的统一资源自适应迭代求解器,凸显了上下文学习的强大能力。

链接: https://arxiv.org/abs/2509.19702
作者: Patrick Lutz,Aditya Gangrade,Hadi Daneshmand,Venkatesh Saligrama
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: To appear at NeurIPS 2025

点击查看摘要

Abstract:We train a linear attention transformer on millions of masked-block matrix completion tasks: each prompt is masked low-rank matrix whose missing block may be (i) a scalar prediction target or (ii) an unseen kernel slice of Nyström extrapolation. The model sees only input-output pairs and a mean-squared loss; it is given no normal equations, no handcrafted iterations, and no hint that the tasks are related. Surprisingly, after training, algebraic unrolling reveals the same parameter-free update rule across three distinct computational regimes (full visibility, rank-limited updates, and distributed computation). We prove that this rule achieves second-order convergence on full-batch problems, cuts distributed iteration complexity, and remains accurate with rank-limited attention. Thus, a transformer trained solely to patch missing blocks implicitly discovers a unified, resource-adaptive iterative solver spanning prediction, estimation, and Nyström extrapolation, highlighting a powerful capability of in-context learning.
zh

[AI-54] A Unified Noise-Curvature View of Loss of Trainability

【速读】:该论文旨在解决持续学习(Continual Learning)中的训练能力丧失(Loss of Trainability, LoT)问题,即在任务不断演化过程中,梯度更新不再带来性能提升,导致模型准确率停滞甚至下降。作者从优化角度分析了使用Adam优化器时LoT的成因,发现传统指标如Hessian矩阵秩、梯度尖锐度、权重或梯度范数等均无法可靠预测训练行为。解决方案的关键在于提出两个互补的判据:一个考虑批量大小的梯度噪声边界和一个受曲率波动控制的边界,二者结合形成每层可预测的训练阈值;基于此阈值设计了一个简单的逐层调度机制,使每层的有效步长保持在安全范围内,从而稳定训练过程并提升精度,且学习率轨迹与经典衰减模式一致。

链接: https://arxiv.org/abs/2509.19698
作者: Gunbir Singh Baveja,Mark Schmidt
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Loss of trainability (LoT) in continual learning occurs when gradient steps no longer yield improvement as tasks evolve, so accuracy stalls or degrades despite adequate capacity and supervision. We analyze LoT incurred with Adam through an optimization lens and find that single indicators such as Hessian rank, sharpness level, weight or gradient norms, gradient-to-parameter ratios, and unit-sign entropy are not reliable predictors. Instead we introduce two complementary criteria: a batch-size-aware gradient-noise bound and a curvature volatility-controlled bound that combine into a per-layer predictive threshold that anticipates trainability behavior. Using this threshold, we build a simple per-layer scheduler that keeps each layers effective step below a safe limit, stabilizing training and improving accuracy across concatenated ReLU (CReLU), Wasserstein regularization, and L2 weight decay, with learned learning-rate trajectories that mirror canonical decay.
zh

[AI-55] Diffusion-Based Impedance Learning for Contact-Rich Manipulation Tasks

【速读】:该论文旨在解决传统学习方法在物理交互场景中表现不足的问题,即学习方法擅长信息域中的运动生成,但缺乏对能量域中物理交互的建模能力;而基于阻抗控制(Impedance Control)的方法虽能实现物理交互,却依赖人工调参且难以适应复杂任务。其解决方案的关键在于提出了一种融合信息域与能量域的新型框架——基于扩散模型的阻抗学习(Diffusion-Based Impedance Learning)。该框架利用带有交叉注意力机制的Transformer扩散模型,从外部力矩(wrench)重构模拟零力轨迹(sZFT),并引入基于SLERP的四元数噪声调度策略以保证旋转空间的几何一致性;随后通过能量驱动估计器动态更新刚度和阻尼参数,并采用方向性规则在非任务轴上降低阻抗、保持任务方向刚性。此方法仅需数万样本即可实现亚毫米级位置精度和亚度级旋转精度,在KUKA LBR iiwa机器人上实现了实时扭矩控制与自主刚度调节,验证了其在复杂任务如越障和异形插件插入中的有效性。

链接: https://arxiv.org/abs/2509.19696
作者: Noah Geiger,Tamim Asfour,Neville Hogan,Johannes Lachner
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 15 pages, 12 figures

点击查看摘要

Abstract:Learning methods excel at motion generation in the information domain but are not primarily designed for physical interaction in the energy domain. Impedance Control shapes physical interaction but requires task-aware tuning by selecting feasible impedance parameters. We present Diffusion-Based Impedance Learning, a framework that combines both domains. A Transformer-based Diffusion Model with cross-attention to external wrenches reconstructs a simulated Zero-Force Trajectory (sZFT). This captures both translational and rotational task-space behavior. For rotations, we introduce a novel SLERP-based quaternion noise scheduler that ensures geometric consistency. The reconstructed sZFT is then passed to an energy-based estimator that updates stiffness and damping parameters. A directional rule is applied that reduces impedance along non task axes while preserving rigidity along task directions. Training data were collected for a parkour scenario and robotic-assisted therapy tasks using teleoperation with Apple Vision Pro. With only tens of thousands of samples, the model achieved sub-millimeter positional accuracy and sub-degree rotational accuracy. Its compact model size enabled real-time torque control and autonomous stiffness adaptation on a KUKA LBR iiwa robot. The controller achieved smooth parkour traversal within force and velocity limits and 30/30 success rates for cylindrical, square, and star peg insertions without any peg-specific demonstrations in the training data set. All code for the Transformer-based Diffusion Model, the robot controller, and the Apple Vision Pro telemanipulation framework is publicly available. These results mark an important step towards Physical AI, fusing model-based control for physical interaction with learning-based methods for trajectory generation.
zh

[AI-56] Calibrated Reasoning : An Explanatory Verifier for Dynamic and Efficient Problem-Solving NEURIPS2025

【速读】:该论文旨在解决当前推理模型在测试时计算策略(test-time computing strategies)中因自我评估能力差而导致性能受限的问题。其解决方案的关键在于提出一种基于强化学习(GRPO)训练的成对解释性验证器(pairwise Explanatory Verifier),该验证器不仅能输出校准后的置信度分数,还能提供自然语言形式的推理过程,从而显著提升如“最佳n选一”(best-of-n)和“自省”(self-reflection)等测试时策略的准确性和效率;尤其在识别困难失败模式(如两个候选解均错误)方面表现优异,超越了传统多数投票方法。

链接: https://arxiv.org/abs/2509.19681
作者: Anisha Garg,Engin Tekin,Yash More,David Bick,Nishit Neema,Ganesh Venkatesh
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: Efficient Reasoning

点击查看摘要

Abstract:Advanced test-time computing strategies are essential for scaling reasoning models, but their effectiveness is capped by the models’ poor self-evaluation. We propose a pairwise Explanatory Verifier, trained via reinforcement learning (GRPO), that produces calibrated confidence scores and associated natural language reasoning for generated solutions. Our verifier improves the accuracy and efficiency of test-time strategies like best-of-n and self-reflection. Crucially, it excels at identifying challenging failure modes, such as when both candidate solutions are identically incorrect, succeeding where standard methods like majority voting fail.
zh

[AI-57] PolicyPad: Collaborative Prototyping of LLM Policies

【速读】:该论文试图解决在高风险领域(如心理健康和法律)中,如何通过协作式政策设计提升大型语言模型(Large Language Models, LLMs)的行为对齐与安全性问题。现有实践中,领域专家虽被纳入政策制定过程,但缺乏高效支持快速实验、反馈与迭代的工具,导致政策设计效率低且难以响应实际场景需求。解决方案的关键在于提出PolicyPad——一个交互式系统,融合用户体验(UX)原型设计实践(如启发式评估和故事板),使政策设计师能够实时协作撰写政策,并独立使用场景测试政策驱动的模型行为,从而实现紧密的反馈循环和创新性政策产出,有效促进AI对齐与安全的参与式发展路径。

链接: https://arxiv.org/abs/2509.19680
作者: K. J. Kevin Feng,Tzu-Sheng Kuo,Quan Ze(Jim)Chen,Inyoung Cheong,Kenneth Holstein,Amy X. Zhang
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As LLMs gain adoption in high-stakes domains like mental health, domain experts are increasingly consulted to provide input into policies governing their behavior. From an observation of 19 policymaking workshops with 9 experts over 15 weeks, we identified opportunities to better support rapid experimentation, feedback, and iteration for collaborative policy design processes. We present PolicyPad, an interactive system that facilitates the emerging practice of LLM policy prototyping by drawing from established UX prototyping practices, including heuristic evaluation and storyboarding. Using PolicyPad, policy designers can collaborate on drafting a policy in real time while independently testing policy-informed model behavior with usage scenarios. We evaluate PolicyPad through workshops with 8 groups of 22 domain experts in mental health and law, finding that PolicyPad enhanced collaborative dynamics during policy design, enabled tight feedback loops, and led to novel policy contributions. Overall, our work paves participatory paths for advancing AI alignment and safety.
zh

[AI-58] hinking While Listening: Simple Test Time Scaling For Audio Classification ICASSP2026

【速读】:该论文旨在解决音频分类任务中模型推理能力不足的问题,即如何在音频分类过程中引入类别的逻辑推理机制以提升性能。其核心解决方案是提出一种“听觉思考”(thinking while listening)的框架,使神经网络能够在处理日常声音时进行类间推理,从而增强分类准确性;关键创新在于设计了一个支持测试时扩展(test-time scaling)的新架构,并验证了通过增加采样轨迹数量可稳定提升性能,同时发现仅微调小规模模型(如GPT-2)的嵌入层即可超越百亿参数文本推理模型的表现,展现出轻量化高效推理的潜力。

链接: https://arxiv.org/abs/2509.19676
作者: Prateek Verma,Mert Pilanci
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注: 6 pages, 3 figures, 2 Tables, ICASSP 2026

点击查看摘要

Abstract:We propose a framework that enables neural models to “think while listening” to everyday sounds, thereby enhancing audio classification performance. Motivated by recent advances in the reasoning capabilities of large language models, we address two central questions: (i) how can thinking be incorporated into existing audio classification pipelines to enable reasoning in the category space and improve performance, and (ii) can a new architecture be designed from the ground up to support both thinking and test-time scaling? We demonstrate that in both settings, our models exhibit improved classification accuracy. Leveraging test-time scaling, we observe consistent gains as the number of sampled traces increases. Furthermore, we evaluate two open-source reasoning models, GPT-OSS-20B and Qwen3-14B, showing that while such models are capable of zero-shot reasoning, a lightweight approach–retraining only the embedding matrix of a frozen, smaller model like GPT-2–can surpass the performance of billion-parameter text-based reasoning models.
zh

[AI-59] Games Are Not Equal: Classifying Cloud Gaming Contexts for Effective User Experience Measurement

【速读】:该论文旨在解决云游戏(cloud gaming)用户体验难以准确衡量的问题,从而帮助网络运营商评估其动态资源调配策略的有效性。传统指标如带宽和帧率无法独立反映用户体验,必须结合具体的游戏类型和玩家行为阶段进行解读。解决方案的关键在于提出一种基于网络流量分析的实时用户体验度量方法,能够识别游戏标题(在游戏启动后5秒内完成分类)并持续判断玩家活动状态(活跃、被动或空闲),从而将带宽消耗与体验质量关联到具体的 gameplay context 中。该方法已在实际ISP环境中部署,并基于三个月内数十万次云游戏会话提供了精细化的洞察。

链接: https://arxiv.org/abs/2509.19669
作者: Yifan Wang,Minzhao Lyu,Vijay Sivaraman
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注: This paper is accepted at ACM Internet Measurement Conference (IMC) 2025. In Proc. ACM IMC, Oct, 2025, Madison, WI, USA

点击查看摘要

Abstract:To tap into the growing market of cloud gaming, whereby game graphics is rendered in the cloud and streamed back to the user as a video feed, network operators are creating monetizable assurance services that dynamically provision network resources. However, without accurately measuring cloud gaming user experience, they cannot assess the effectiveness of their provisioning methods. Basic measures such as bandwidth and frame rate by themselves do not suffice, and can only be interpreted in the context of the game played and the player activity within the game. This paper equips the network operator with a method to obtain a real-time measure of cloud gaming experience by analyzing network traffic, including contextual factors such as the game title and player activity stage. Our method is able to classify the game title within the first five seconds of game launch, and continuously assess the player activity stage as being active, passive, or idle. We deploy it in an ISP hosting NVIDIA cloud gaming servers for the region. We provide insights from hundreds of thousands of cloud game streaming sessions over a three-month period into the dependence of bandwidth consumption and experience level on the gameplay contexts.
zh

[AI-60] RoboSSM: Scalable In-context Imitation Learning via State-Space Models

【速读】:该论文旨在解决当前基于Transformer的上下文模仿学习(In-context Imitation Learning, ICIL)方法在处理长提示(long prompts)时存在的计算效率低和泛化能力差的问题。现有ICIL方法依赖于Transformer架构,在部署阶段虽无需参数更新,但其二次复杂度限制了对长序列的处理能力,且在训练外的长提示场景下性能显著下降。解决方案的关键在于引入状态空间模型(State-Space Model, SSM)作为替代架构,具体采用名为Longhorn的先进SSM结构,该结构具备线性时间推理能力和强泛化性,从而有效支持长上下文提示下的任务适应。实验表明,RoboSSM在LIBERO基准上展现出对不同演示数量的良好外推能力、未见任务的高成功率以及长时程场景下的鲁棒性,验证了SSM作为ICIL高效且可扩展骨干网络的潜力。

链接: https://arxiv.org/abs/2509.19658
作者: Youngju Yoo,Jiaheng Hu,Yifeng Zhu,Bo Liu,Qiang Liu,Roberto Martín-Martín,Peter Stone
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 8 pages, 11 figures

点击查看摘要

Abstract:In-context imitation learning (ICIL) enables robots to learn tasks from prompts consisting of just a handful of demonstrations. By eliminating the need for parameter updates at deployment time, this paradigm supports few-shot adaptation to novel tasks. However, recent ICIL methods rely on Transformers, which have computational limitations and tend to underperform when handling longer prompts than those seen during training. In this work, we introduce RoboSSM, a scalable recipe for in-context imitation learning based on state-space models (SSM). Specifically, RoboSSM replaces Transformers with Longhorn – a state-of-the-art SSM that provides linear-time inference and strong extrapolation capabilities, making it well-suited for long-context prompts. We evaluate our approach on the LIBERO benchmark and compare it against strong Transformer-based ICIL baselines. Experiments show that RoboSSM extrapolates effectively to varying numbers of in-context demonstrations, yields high performance on unseen tasks, and remains robust in long-horizon scenarios. These results highlight the potential of SSMs as an efficient and scalable backbone for ICIL. Our code is available at this https URL.
zh

[AI-61] Where 6G Stands Today: Evolution Enablers and Research Gaps

【速读】:该论文旨在解决当前5G移动通信系统在应对未来高度数字化社会需求时的局限性问题,特别是在超高可靠性、无缝自动化和广域覆盖等方面难以满足新兴应用场景的挑战。其解决方案的关键在于提出6G网络应具备高度智能化、自动化和超高可靠性,并聚焦于若干关键技术的融合与突破,包括太赫兹(Terahertz, THz)通信、智能反射面(Intelligent Reflecting Surfaces)、大规模MIMO(Massive Multiple-Input Multiple-Output)以及由人工智能驱动的网络管理(AI-driven Networking),这些技术共同构成实现6G性能跃升的核心支撑体系。

链接: https://arxiv.org/abs/2509.19646
作者: Salma Tika,Abdelkrim Haqiq,Essaid Sabir,Elmahdi Driouch
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注: 8 pages, 2 figures, conference, 2 tables

点击查看摘要

Abstract:As the fifth-generation (5G) mobile communication system continues its global deployment, both industry and academia have started conceptualizing the 6th generation (6G) to address the growing need for a progressively advanced and digital society. Even while 5G offers considerable advancements over LTE, it could struggle to be sufficient to meet all of the requirements, including ultra-high reliability, seamless automation, and ubiquitous coverage. In response, 6G is supposed to bring out a highly intelligent, automated, and ultra-reliable communication system that can handle a vast number of connected devices. This paper offers a comprehensive overview of 6G, beginning with its main stringent requirements while focusing on key enabling technologies such as terahertz (THz) communications, intelligent reflecting surfaces, massive MIMO and AI-driven networking that will shape the 6G networks. Furthermore, the paper lists various 6G applications and usage scenarios that will benefit from these advancements. At the end, we outline the potential challenges that must be addressed to achieve the 6G promises.
zh

[AI-62] Are We Scaling the Right Thing? A System Perspective on Test-Time Scaling

【速读】:该论文旨在解决当前测试时扩展(Test-time Scaling, TTS)方法过于聚焦于计算最优的帕累托前沿,而忽视了实际系统性能指标(如延迟和每 token 成本)的问题。现有方法在追求计算效率的同时,可能无法实现系统层面的最优表现,从而限制了大语言模型(Large Language Models, LLMs)推理阶段的真实可用性。解决方案的关键在于引入以系统为导向的视角,通过分析不同优化技术(如张量并行(tensor parallelism)和推测解码(speculative decoding))对延迟、成本等实用指标的影响,揭示当前 TTS 方法的局限性,并推动向全面、系统感知的评估范式转变,从而更准确地捕捉推理时的 scaling laws 本质。

链接: https://arxiv.org/abs/2509.19645
作者: Youpeng Zhao,Jinpeng LV,Di Wu,Jun Wang,Christopher Gooley
机构: 未知
类目: Performance (cs.PF); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Test-time scaling (TTS) has recently emerged as a promising direction to exploit the hidden reasoning capabilities of pre-trained large language models (LLMs). However, existing scaling methods narrowly focus on the compute-optimal Pareto-frontier, ignoring the simple fact that compute-optimal is not always system-optimal. In this work, we propose a system-driven perspective on TTS, analyzing how reasoning models scale against practical metrics, such as latency and cost-per-token. By evaluating the impact of popular optimizations such as tensor parallelism and speculative decoding, our preliminary analysis reveals the limitations of current methods and calls for a paradigm shift toward holistic, system-aware evaluations that capture the true essence of scaling laws at inference time.
zh

[AI-63] Mamba Modulation: On the Length Generalization of Mamba NEURIPS

【速读】:该论文旨在解决Mamba模型在处理超出预训练阶段长度的上下文时性能显著下降的问题,其根源在于状态空间动态的分布外行为,特别是状态转移矩阵 A\mathbf{A} 的参数化导致的谱特性不稳定。解决方案的关键在于提出一种谱缩放(spectrum scaling)方法,通过有选择地调节每一层中 A\mathbf{A} 矩阵的谱来增强模型对长序列输入的鲁棒性,从而实现更有效的长上下文泛化能力。

链接: https://arxiv.org/abs/2509.19633
作者: Peng Lu,Jerry Huang,Qiuhao Zeng,Xinyu Wang,Boxing Wang,Philippe Langlais,Yufei Cui
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: Accepted to The Thirty-Ninth Annual Conference on Neural Information Processing Systems (NeurIPS) 2025. First two authors contributed equally

点击查看摘要

Abstract:The quadratic complexity of the attention mechanism in Transformer models has motivated the development of alternative architectures with sub-quadratic scaling, such as state-space models. Among these, Mamba has emerged as a leading architecture, achieving state-of-the-art results across a range of language modeling tasks. However, Mamba’s performance significantly deteriorates when applied to contexts longer than those seen during pre-training, revealing a sharp sensitivity to context length extension. Through detailed analysis, we attribute this limitation to the out-of-distribution behaviour of its state-space dynamics, particularly within the parameterization of the state transition matrix \mathbfA . Unlike recent works which attribute this sensitivity to the vanished accumulation of discretization time steps, \exp(-\sum_t=1^N\Delta_t) , we establish a connection between state convergence behavior as the input length approaches infinity and the spectrum of the transition matrix \mathbfA , offering a well-founded explanation of its role in length extension. Next, to overcome this challenge, we propose an approach that applies spectrum scaling to pre-trained Mamba models to enable robust long-context generalization by selectively modulating the spectrum of \mathbfA matrices in each layer. We show that this can significantly improve performance in settings where simply modulating \Delta_t fails, validating our insights and providing avenues for better length generalization of state-space models with structured transition matrices.
zh

[AI-64] SteinerSQL: Graph-Guided Mathematical Reasoning for Text-to-SQL Generation EMNLP2025

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在处理复杂Text-to-SQL查询时面临的双重挑战:一方面需要进行复杂的数学推理,另一方面需精确地导航数据库模式(schema)。现有方法通常将这两个问题分开处理,导致推理过程碎片化,影响逻辑和结构的正确性。其解决方案的关键在于提出SteinerSQL框架,将这两个挑战统一为一个以图为中心的优化问题,通过三个阶段实现:数学分解识别所需表(终端节点)、基于Steiner树问题构建最优推理骨架,以及多层次验证确保结果正确性。该方法显著提升了执行准确率,在LogicCat和Spider2.0-Lite基准上分别达到36.10%和40.04%的执行准确率,建立了新的SOTA。

链接: https://arxiv.org/abs/2509.19623
作者: Xutao Mao,Tao Liu,Hongying Zan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accept in Non-archival EMNLP 2025 MathNLP

点击查看摘要

Abstract:Large Language Models (LLMs) struggle with complex Text-to-SQL queries that demand both sophisticated mathematical reasoning and intricate schema navigation. Existing methods often tackle these challenges in isolation, creating a fractured reasoning process that compromises logical and structural correctness. To resolve this, we introduce SteinerSQL, a framework that unifies these dual challenges into a single, graph-centric optimization problem. SteinerSQL operates in three stages: mathematical decomposition to identify required tables (terminals), optimal reasoning scaffold construction via a Steiner tree problem, and multi-level validation to ensure correctness. On the challenging LogicCat and Spider2.0-Lite benchmarks, SteinerSQL establishes a new state-of-the-art with 36.10% and 40.04% execution accuracy, respectively, using Gemini-2.5-Pro. Beyond accuracy, SteinerSQL presents a new, unified paradigm for Text-to-SQL, paving the way for more robust and principled solutions to complex reasoning tasks.
zh

[AI-65] Knowledge Base-Aware Orchestration: A Dynamic Privacy-Preserving Method for Multi-Agent Systems

【速读】:该论文旨在解决多智能体系统(Multi-agent Systems, MAS)在处理复杂、知识密集型任务时,因依赖静态智能体描述而导致的任务调度效率低下问题,尤其是在动态环境中,静态描述难以反映智能体能力的实时变化。解决方案的关键在于提出一种基于知识库感知(Knowledge Base-Aware, KBA)的编排机制,通过引入每个智能体内部知识库(Knowledge Base, KB)生成的轻量级、隐私保护的相关性反馈信号(ACK signal),动态补充静态描述信息,从而实现更精准和自适应的任务路由决策。该机制在不暴露原始数据的前提下,构建共享语义缓存以持续优化未来任务分配,并显著提升系统整体效率与准确性。

链接: https://arxiv.org/abs/2509.19599
作者: Danilo Trombino,Vincenzo Pecorella,Alessandro de Giulii,Davide Tresoldi
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multi-agent systems (MAS) are increasingly tasked with solving complex, knowledge-intensive problems where effective agent orchestration is critical. Conventional orchestration methods rely on static agent descriptions, which often become outdated or incomplete. This limitation leads to inefficient task routing, particularly in dynamic environments where agent capabilities continuously evolve. We introduce Knowledge Base-Aware (KBA) Orchestration, a novel approach that augments static descriptions with dynamic, privacy-preserving relevance signals derived from each agent’s internal knowledge base (KB). In the proposed framework, when static descriptions are insufficient for a clear routing decision, the orchestrator prompts the subagents in parallel. Each agent then assesses the task’s relevance against its private KB, returning a lightweight ACK signal without exposing the underlying data. These collected signals populate a shared semantic cache, providing dynamic indicators of agent suitability for future queries. By combining this novel mechanism with static descriptions, our method achieves more accurate and adaptive task routing preserving agent autonomy and data confidentiality. Benchmarks show that our KBA Orchestration significantly outperforms static description-driven methods in routing precision and overall system efficiency, making it suitable for large-scale systems that require higher accuracy than standard description-driven routing.
zh

[AI-66] What Does Your Benchmark Really Measure? A Framework for Robust Inference of AI Capabilities

【速读】:该论文试图解决当前生成式 AI(Generative AI)模型评估中普遍存在但被忽视的可靠性问题:基准测试得分常被视为对模型能力的直接测量,但实际上它们是基于特定理论假设的推断结果。作者指出,若不明确建模能力的本质及其在测试中的表现机制,所得分数可能无法真实反映模型性能。解决方案的关键在于提出一种“评估即推断”(evaluation as inference)的原理性框架——从关于能力的先验理论出发,推导出可量化不确定性的估计方法;并通过引入考虑扰动敏感性和有限样本误差的能力推断模型,设计出能显著降低样本复杂度的自适应算法,从而提升基准测试结果的可靠性和可信度。

链接: https://arxiv.org/abs/2509.19590
作者: Nathanael Jo,Ashia Wilson
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Evaluations of generative models on benchmark data are now ubiquitous, and their outcomes critically shape public and scientific expectations of AI’s capabilities. Yet growing skepticism surrounds their reliability. How can we know that a reported accuracy genuinely reflects a model’s true performance? Evaluations are often presented as simple measurements, but in reality they are inferences: to treat benchmark scores as evidence of capability is already to assume a theory of what capability is and how it manifests in a test. We make this step explicit by proposing a principled framework for evaluation as inference: begin from a theory of capability, and then derive methods for estimating it. This perspective, familiar in fields such as psychometrics, has not yet become commonplace in AI evaluation. As a proof of concept, we address a central challenge that undermines reliability: sensitivity to perturbations. After formulating a model of ability, we introduce methods that infer ability while accounting for uncertainty from sensitivity and finite samples, including an adaptive algorithm that significantly reduces sample complexity. Together, these contributions lay the groundwork for more reliable and trustworthy estimates of AI capabilities as measured through benchmarks.
zh

[AI-67] Reverse Engineering User Stories from Code using Large Language Models

【速读】:该论文旨在解决敏捷开发中用户故事(User Stories)在遗留系统或文档不完善的代码库中常缺失或过时的问题,探索大型语言模型(Large Language Models, LLMs)是否能够直接从源代码中自动恢复用户故事,并分析提示工程(prompt design)对输出质量的影响。其解决方案的关键在于:通过设计包含单个示例的提示策略(few-shot prompting),即使是最小规模的80亿参数模型(8B)也能达到与700亿参数模型(70B)相当的性能,而结构化推理(如Chain-of-Thought)仅对大模型带来边际提升,表明高效提示设计比复杂推理机制更能提升生成效果。

链接: https://arxiv.org/abs/2509.19587
作者: Mohamed Ouf,Haoyu Li,Michael Zhang,Mariam Guizani
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:User stories are essential in agile development, yet often missing or outdated in legacy and poorly documented systems. We investigate whether large language models (LLMs) can automatically recover user stories directly from source code and how prompt design impacts output quality. Using 1,750 annotated C++ snippets of varying complexity, we evaluate five state-of-the-art LLMs across six prompting strategies. Results show that all models achieve, on average, an F1 score of 0.8 for code up to 200 NLOC. Our findings show that a single illustrative example enables the smallest model (8B) to match the performance of a much larger 70B model. In contrast, structured reasoning via Chain-of-Thought offers only marginal gains, primarily for larger models.
zh

[AI-68] A Foundation Chemical Language Model for Comprehensive Frag ment-Based Drug Discovery

【速读】:该论文旨在解决小分子片段(fragment)化学空间覆盖不足与生成多样性受限的问题,以支持药物发现中更高效的新分子设计。其解决方案的关键在于构建并训练一个大规模专用基础模型FragAtlas-62M,基于ZINC-22中超过6200万个小分子片段数据集进行预训练,采用GPT-2架构(42.7M参数)实现高化学有效性(99.90%有效)的片段生成,并通过多维度验证(12个描述符和3种指纹方法)确保生成结构分布与训练数据高度一致(效应量均<0.4)。该模型在保留53.6%已知片段的同时,还能生成22%具有实际应用价值的新结构,显著拓展了可探索的片段化学空间。

链接: https://arxiv.org/abs/2509.19586
作者: Alexander Ho,Sukyeong Lee,Francis T.F. Tsai
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM)
备注:

点击查看摘要

Abstract:We introduce FragAtlas-62M, a specialized foundation model trained on the largest fragment dataset to date. Built on the complete ZINC-22 fragment subset comprising over 62 million molecules, it achieves unprecedented coverage of fragment chemical space. Our GPT-2 based model (42.7M parameters) generates 99.90% chemically valid fragments. Validation across 12 descriptors and three fingerprint methods shows generated fragments closely match the training distribution (all effect sizes 0.4). The model retains 53.6% of known ZINC fragments while producing 22% novel structures with practical relevance. We release FragAtlas-62M with training code, preprocessed data, documentation, and model weights to accelerate adoption.
zh

[AI-69] Nano Bio-Agents (NBA): Small Language Model Agents for Genomics

【速读】:该论文旨在解决生成式 AI 在基因组学问答任务中面临的幻觉(hallucination)问题以及大型语言模型(Large Language Models, LLMs)带来的高计算成本挑战。其解决方案的关键在于提出了一种名为 Nano Bio-Agent (NBA) 的代理框架,该框架通过任务分解(task decomposition)、工具编排(tool orchestration)和对 NCBI、AlphaGenome 等权威生物信息学 API 的集成,有效引导小型语言模型(Small Language Models, SLMs,参数量为 10B)完成复杂推理任务。实验证明,SLMs 结合此 agentic 框架可在 GeneTuring 基准上达到 98% 的准确率,且在多数场景下优于或相当优于使用更大模型的方法,同时显著降低计算资源消耗,展现出高效、低成本和可扩展的潜力。

链接: https://arxiv.org/abs/2509.19566
作者: George Hong,Daniel Trejo Banos
机构: 未知
类目: Artificial Intelligence (cs.AI); Genomics (q-bio.GN)
备注:

点击查看摘要

Abstract:We investigate the application of Small Language Models (10 billion parameters) for genomics question answering via agentic framework to address hallucination issues and computational cost challenges. The Nano Bio-Agent (NBA) framework we implemented incorporates task decomposition, tool orchestration, and API access into well-established systems such as NCBI and AlphaGenome. Results show that SLMs combined with such agentic framework can achieve comparable and in many cases superior performance versus existing approaches utilising larger models, with our best model-agent combination achieving 98% accuracy on the GeneTuring benchmark. Notably, small 3-10B parameter models consistently achieve 85-97% accuracy while requiring much lower computational resources than conventional approaches. This demonstrates promising potential for efficiency gains, cost savings, and democratization of ML-powered genomics tools while retaining highly robust and accurate performance.
zh

[AI-70] Learning Dynamics of Deep Learning – Force Analysis of Deep Neural Networks

【速读】:该论文旨在解决深度学习模型在训练过程中如何随时间演化的问题,特别是揭示单个训练样本对其他样本学习路径的影响机制。其解决方案的关键在于引入类比力学中“力分析”的框架,将模型训练中的影响分解为两个核心因素:样本间的相似性(similarity)与更新力的强度(updating force),从而系统性地解释模型在不同实际场景下的行为模式,如非平凡的学习路径、大语言模型(LLM)微调方法的有效性边界以及结构化模式更容易被习得的现象,并据此提出改进训练策略的新思路。

链接: https://arxiv.org/abs/2509.19554
作者: Yi Ren
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 175 pages

点击查看摘要

Abstract:This thesis explores how deep learning models learn over time, using ideas inspired by force analysis. Specifically, we zoom in on the model’s training procedure to see how one training example affects another during learning, like analyzing how forces move objects. We break this influence into two parts: how similar the two examples are, and how strong the updating force is. This framework helps us understand a wide range of the model’s behaviors in different real systems. For example, it explains why certain examples have non-trivial learning paths, why (and why not) some LLM finetuning methods work, and why simpler, more structured patterns tend to be learned more easily. We apply this approach to various learning tasks and uncover new strategies for improving model training. While the method is still developing, it offers a new way to interpret models’ behaviors systematically.
zh

[AI-71] DAWM: Diffusion Action World Models for Offline Reinforcement Learning via Action-Inferred Transitions ICML2025

【速读】:该论文旨在解决扩散模型在离线强化学习(Offline Reinforcement Learning, Offline RL)中生成轨迹时,无法直接输出动作导致与基于单步时序差分(Temporal Difference, TD)学习的价值类算法不兼容的问题。现有方法虽尝试联合建模状态、奖励和动作,但常因训练复杂度上升而性能受限。其解决方案的关键在于提出一种名为DAWM的模块化扩散世界模型:该模型以当前状态、动作和剩余回报(return-to-go)为条件生成未来状态-奖励轨迹,并辅以逆动力学模型(Inverse Dynamics Model, IDM)实现高效动作推断,从而生成适用于单步TD学习的完整合成转移样本,显著提升保守型离线RL算法(如TD3BC和IQL)的训练效率与性能。

链接: https://arxiv.org/abs/2509.19538
作者: Zongyue Li,Xiao Han,Yusong Li,Niklas Strauss,Matthias Schubert
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: ICML2025 workshop Building Physically Plausible World Models

点击查看摘要

Abstract:Diffusion-based world models have demonstrated strong capabilities in synthesizing realistic long-horizon trajectories for offline reinforcement learning (RL). However, many existing methods do not directly generate actions alongside states and rewards, limiting their compatibility with standard value-based offline RL algorithms that rely on one-step temporal difference (TD) learning. While prior work has explored joint modeling of states, rewards, and actions to address this issue, such formulations often lead to increased training complexity and reduced performance in practice. We propose \textbfDAWM, a diffusion-based world model that generates future state-reward trajectories conditioned on the current state, action, and return-to-go, paired with an inverse dynamics model (IDM) for efficient action inference. This modular design produces complete synthetic transitions suitable for one-step TD-based offline RL, enabling effective and computationally efficient training. Empirically, we show that conservative offline RL algorithms such as TD3BC and IQL benefit significantly from training on these augmented trajectories, consistently outperforming prior diffusion-based baselines across multiple tasks in the D4RL benchmark.
zh

[AI-72] Semantic-Aware Fuzzing: An Empirical Framework for LLM -Guided Reasoning -Driven Input Mutation

【速读】:该论文旨在解决传统基于变异的模糊测试工具在处理物联网设备、移动平台和自主系统中的安全漏洞时,因缺乏语义理解能力而导致的覆盖率低和漏洞发现效率不足的问题。现有工具如AFL++虽能通过字节或比特级修改探索代码路径,但难以捕捉协议逻辑、字段间依赖关系及领域特定语义。为此,作者提出一种集成推理型大语言模型(Reasoning LLM)与AFL++的微服务框架,借助提示工程(prompt engineering)实现少样本(few-shot)学习,在不依赖标注数据的前提下提升变异质量。其关键创新在于将LLM嵌入到模糊测试的变异循环中,利用预训练知识生成符合输入格式和复杂约束的高质量变异,从而显著增强漏洞挖掘能力;实验表明,模型选择和提示复杂度对变异效果影响大于样本数量,其中Deepseek-r1-Distill-Llama-70B表现最优,但响应延迟和吞吐瓶颈仍是主要挑战。

链接: https://arxiv.org/abs/2509.19533
作者: Mengdi Lu,Steven Ding,Furkan Alaca,Philippe Charland
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Security vulnerabilities in Internet-of-Things devices, mobile platforms, and autonomous systems remain critical. Traditional mutation-based fuzzers – while effectively explore code paths – primarily perform byte- or bit-level edits without semantic reasoning. Coverage-guided tools such as AFL++ use dictionaries, grammars, and splicing heuristics to impose shallow structural constraints, leaving deeper protocol logic, inter-field dependencies, and domain-specific semantics unaddressed. Conversely, reasoning-capable large language models (LLMs) can leverage pretraining knowledge to understand input formats, respect complex constraints, and propose targeted mutations, much like an experienced reverse engineer or testing expert. However, lacking ground truth for “correct” mutation reasoning makes supervised fine-tuning impractical, motivating explorations of off-the-shelf LLMs via prompt-based few-shot learning. To bridge this gap, we present an open-source microservices framework that integrates reasoning LLMs with AFL++ on Google’s FuzzBench, tackling asynchronous execution and divergent hardware demands (GPU- vs. CPU-intensive) of LLMs and fuzzers. We evaluate four research questions: (R1) How can reasoning LLMs be integrated into the fuzzing mutation loop? (R2) Do few-shot prompts yield higher-quality mutations than zero-shot? (R3) Can prompt engineering with off-the-shelf models improve fuzzing directly? and (R4) Which open-source reasoning LLMs perform best under prompt-only conditions? Experiments with Llama3.3, Deepseek-r1-Distill-Llama-70B, QwQ-32B, and Gemma3 highlight Deepseek as the most promising. Mutation effectiveness depends more on prompt complexity and model choice than shot count. Response latency and throughput bottlenecks remain key obstacles, offering directions for future work.
zh

[AI-73] Score the Steps Not Just the Goal: VLM-Based Subgoal Evaluation for Robotic Manipulation

【速读】:该论文旨在解决机器人学习领域中评估策略性能时存在的信息不足问题,即当前普遍采用的单一二元成功率(Success Rate, SR)无法揭示策略在多步骤操作任务中各子目标(subgoal)层面的表现,从而掩盖了部分能力(如仅能完成抓取但无法倒出液体)。解决方案的关键在于提出StepEval框架,其核心是通过视觉语言模型(Vision-Language Models, VLMs)作为自动化裁判,从记录的图像或视频中判断每个轨迹的子目标级成功情况,并生成每步的成功率向量(per-subgoal SR vector),实现对策略局部能力的显式量化。该框架设计为成本感知、轻量且模型无关,支持单视角或多视角输入,旨在推动“评分步骤而非仅最终目标”成为可复现的标准实践。

链接: https://arxiv.org/abs/2509.19524
作者: Ramy ElMallah,Krish Chhajer,Chi-Guhn Lee
机构: 未知
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: Accepted to the CoRL 2025 EvalDeploy Workshop

点击查看摘要

Abstract:Robot learning papers typically report a single binary success rate (SR), which obscures where a policy succeeds or fails along a multi-step manipulation task. We argue that subgoal-level reporting should become routine: for each trajectory, a vector of per-subgoal SRs that makes partial competence visible (e.g., grasp vs. pour). We propose a blueprint for StepEval, a cost-aware plug-in evaluation framework that utilizes vision-language models (VLMs) as automated judges of subgoal outcomes from recorded images or videos. Rather than proposing new benchmarks or APIs, our contribution is to outline design principles for a scalable, community-driven open-source project. In StepEval, the primary artifact for policy evaluation is the per-subgoal SR vector; however, other quantities (e.g., latency or cost estimates) are also considered for framework-optimization diagnostics to help the community tune evaluation efficiency and accuracy when ground-truth subgoal success labels are available. We discuss how such a framework can remain model-agnostic, support single- or multi-view inputs, and be lightweight enough to adopt across labs. The intended contribution is a shared direction: a minimal, extensible seed that invites open-source contributions, so that scoring the steps, not just the final goal, becomes a standard and reproducible practice.
zh

[AI-74] A Longitudinal Randomized Control Study of Companion Chatbot Use: Anthropomorphism and Its Mediating Role on Social Impacts

【速读】:该论文试图解决的问题是:与陪伴型聊天机器人(companion chatbot)的互动是否会对人类的社会健康和人际关系产生负面影响,即这些社交AI代理是否会替代或损害人类之间的社会联系。解决方案的关键在于通过一项纵向实验设计(N = 183),将参与者随机分配至与聊天机器人对话组或文本游戏控制组,持续21天,并结合多轮问卷调查与录音访谈收集数据,发现尽管整体上人机互动未显著影响社会健康,但个体对社交连接的需求(social need)和对AI的拟人化程度(anthropomorphism)在其中起中介作用——高社交需求者更易将聊天机器人拟人化,而这种拟人化进一步增强了人机交互对其人际互动的影响,揭示了心理机制而非单纯行为接触才是关键变量。

链接: https://arxiv.org/abs/2509.19515
作者: Rose E. Guingrich,Michael S. A. Graziano
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Relationships with social artificial intelligence (AI) agents are on the rise. People report forming friendships, mentorships, and romantic partnerships with chatbots such as Replika, a type of social AI agent that is designed specifically for companionship. Concerns that companion chatbot relationships may harm or replace human ones have been raised, but whether and how these social consequences occur remains unclear. Prior research suggests that people’s states of social need and their anthropomorphism of the AI agent may play a role in how human-AI interaction impacts human-human interaction. In this longitudinal study (N = 183), participants were randomly assigned to converse with a companion chatbot over text or to play text-based word games for 10 minutes a day for 21 consecutive days. During these 21 days, participants also completed four surveys and two audio-recorded interviews. We found that people’s social health and relationships were not significantly impacted by interacting with a companion chatbot across 21 days compared to the control group. However, people who had a higher desire to socially connect anthropomorphized the chatbot more. Those who anthropomorphized the chatbot more indicated that the human-chatbot interaction had greater impacts on their social interactions and relationships with family and friends. A mediation analysis suggested that the impact of human-AI interaction on human-human social outcomes was mediated by the extent to which people anthropomorphized the AI agent, which itself was related to the desire to socially connect.
zh

[AI-75] he Heterogeneous Multi-Agent Challenge ECAI2025

【速读】:该论文试图解决当前协作式异构多智能体强化学习(Heterogeneous Multi-Agent Reinforcement Learning, HeMARL)领域缺乏标准化测试平台的问题。现有研究多集中于同质智能体场景,而现实世界中大量任务涉及具有不同传感器、资源或能力的异构智能体,亟需更贴近实际的评估环境。解决方案的关键在于构建一个标准化的测试基准,以支持对HeMARL算法的系统性评估与比较,从而推动该领域的进展并避免因环境设计不统一导致的研究成果不可比性。

链接: https://arxiv.org/abs/2509.19512
作者: Charles Dansereau,Junior-Samuel Lopez-Yepez,Karthik Soma,Antoine Fagette
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注: 7 pages. To Appear at ECAI 2025

点击查看摘要

Abstract:Multi-Agent Reinforcement Learning (MARL) is a growing research area which gained significant traction in recent years, extending Deep RL applications to a much wider range of problems. A particularly challenging class of problems in this domain is Heterogeneous Multi-Agent Reinforcement Learning (HeMARL), where agents with different sensors, resources, or capabilities must cooperate based on local information. The large number of real-world situations involving heterogeneous agents makes it an attractive research area, yet underexplored, as most MARL research focuses on homogeneous agents (e.g., a swarm of identical robots). In MARL and single-agent RL, standardized environments such as ALE and SMAC have allowed to establish recognized benchmarks to measure progress. However, there is a clear lack of such standardized testbed for cooperative HeMARL. As a result, new research in this field often uses simple environments, where most algorithms perform near optimally, or uses weakly heterogeneous MARL environments.
zh

[AI-76] AIRwaves at CheckThat! 2025: Retrieving Scientific Sources for Implicit Claims on Social Media with Dual Encoders and Neural Re-Ranking

【速读】:该论文旨在解决社交媒体中隐含科学主张(implicit scientific claims)与其原始文献之间关联性匹配的问题,这对于基于证据的事实核查和学术讨论至关重要。该问题因词汇稀疏性、极短查询文本及领域特定语言而更具挑战性。解决方案的关键在于提出一个两阶段检索流水线:第一阶段采用基于E5-large的双编码器模型,通过批次内负样本与挖掘的困难负样本进行微调,并结合分块标记化和丰富的文档元数据增强表示能力;第二阶段使用SciBERT交叉编码器对候选文档进行神经重排序。此方法将MRR@5从BM25基线的0.5025提升至0.6828,表明密集检索与神经重排序相结合是高效实现推文到研究论文匹配的有效策略。

链接: https://arxiv.org/abs/2509.19509
作者: Cem Ashbaugh,Leon Baumgärtner,Tim Gress,Nikita Sidorov,Daniel Werner
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: CLEF 2025 (Conference and Labs of the Evaluation Forum)

点击查看摘要

Abstract:Linking implicit scientific claims made on social media to their original publications is crucial for evidence-based fact-checking and scholarly discourse, yet it is hindered by lexical sparsity, very short queries, and domain-specific language. Team AIRwaves ranked second in Subtask 4b of the CLEF-2025 CheckThat! Lab with an evidence-retrieval approach that markedly outperforms the competition baseline. The optimized sparse-retrieval baseline(BM25) achieves MRR@5 = 0.5025 on the gold label blind test set. To surpass this baseline, a two-stage retrieval pipeline is introduced: (i) a first stage that uses a dual encoder based on E5-large, fine-tuned using in-batch and mined hard negatives and enhanced through chunked tokenization and rich document metadata; and (ii) a neural re-ranking stage using a SciBERT cross-encoder. Replacing purely lexical matching with neural representations lifts performance to MRR@5 = 0.6174, and the complete pipeline further improves to MRR@5 = 0.6828. The findings demonstrate that coupling dense retrieval with neural re-rankers delivers a powerful and efficient solution for tweet-to-study matching and provides a practical blueprint for future evidence-retrieval pipelines.
zh

[AI-77] Generative AI as a catalyst for democratic Innovation: Enhancing citizen engagement in participatory budgeting ICIP

【速读】:该论文试图解决公民参与度下降与社会分化加剧背景下,如何通过技术手段提升参与式预算(Participatory Budgeting)中公民 Engagement 的问题。其解决方案的关键在于将生成式 AI (Generative AI) 整合进公共咨询平台,以优化公民提案的形成过程,并促进公民与政府之间的有效对话,从而重塑参与式制度,实现更具包容性和民主性的治理结构。

链接: https://arxiv.org/abs/2509.19497
作者: Italo Alberto do Nascimento Sousa,Jorge Machado,Jose Carlos Vaz
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 19 pages, VI International Meeting on Participation, Democracy and Public Policies

点击查看摘要

Abstract:This research examines the role of Generative Artificial Intelligence (AI) in enhancing citizen engagement in participatory budgeting. In response to challenges like declining civic participation and increased societal polarization, the study explores how online political participation can strengthen democracy and promote social equity. By integrating Generative AI into public consultation platforms, the research aims to improve citizen proposal formulation and foster effective dialogue between citizens and government. It assesses the capacities governments need to implement AI-enhanced participatory tools, considering technological dependencies and vulnerabilities. Analyzing technological structures, actors, interests, and strategies, the study contributes to understanding how technological advancements can reshape participatory institutions to better facilitate citizen involvement. Ultimately, the research highlights how Generative AI can transform participatory institutions, promoting inclusive, democratic engagement and empowering citizens.
zh

[AI-78] ArtiFree: Detecting and Reducing Generative Artifacts in Diffusion-based Speech Enhancement

【速读】:该论文旨在解决基于扩散模型的语音增强(Diffusion-based Speech Enhancement, DSE)中存在的生成伪影(generative artifacts)和高推理延迟问题。其解决方案的关键在于:首先,利用语音嵌入的方差预测推理过程中的音素错误;进而提出一种基于语义一致性的集成推理方法,通过多次扩散运行间的语义一致性引导,显著降低低信噪比(low-SNR)条件下的词错误率(WER)达15%,从而提升音素准确性和语义合理性;最后,通过分析扩散步数的影响,引入自适应扩散步数策略,在伪影抑制与推理延迟之间实现平衡。研究强调语义先验作为引导生成式语音增强向无伪影输出演进的强大工具。

链接: https://arxiv.org/abs/2509.19495
作者: Bhawana Chhaglani,Yang Gao,Julius Richter,Xilin Li,Syavosh Zadissa,Tarun Pruthi,Andrew Lovitt
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Diffusion-based speech enhancement (SE) achieves natural-sounding speech and strong generalization, yet suffers from key limitations like generative artifacts and high inference latency. In this work, we systematically study artifact prediction and reduction in diffusion-based SE. We show that variance in speech embeddings can be used to predict phonetic errors during inference. Building on these findings, we propose an ensemble inference method guided by semantic consistency across multiple diffusion runs. This technique reduces WER by 15% in low-SNR conditions, effectively improving phonetic accuracy and semantic plausibility. Finally, we analyze the effect of the number of diffusion steps, showing that adaptive diffusion steps balance artifact suppression and latency. Our findings highlight semantic priors as a powerful tool to guide generative SE toward artifact-free outputs.
zh

[AI-79] Estimating the Self-Consistency of LLM s

【速读】:该论文旨在解决如何在固定计算预算 $ B = mn $ 下,通过重复调用大语言模型(Large Language Models, LLMs)并聚合响应来提升系统输出可靠性的优化问题。其核心解决方案在于分析一种用于估计LLM自一致性(self-consistency)的统计量,并揭示在预算约束下采样提示数 $ m $ 与每条提示重复调用次数 $ n $ 之间的权衡关系,最终得出最优分配策略为 $ m, n \propto \sqrt{B} $,即二者应大致按平方根比例分配,以实现最佳可靠性提升效果。

链接: https://arxiv.org/abs/2509.19489
作者: Robert Nowak
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 5 pages

点击查看摘要

Abstract:Systems often repeat the same prompt to large language models (LLMs) and aggregate responses to improve reliability. This short note analyzes an estimator of the self-consistency of LLMs and the tradeoffs it induces under a fixed compute budget B=mn , where m is the number of prompts sampled from the task distribution and n is the number of repeated LLM calls per prompt; the resulting analysis favors a rough split m,n\propto\sqrtB .
zh

[AI-80] Identifying and Addressing User-level Security Concerns in Smart Homes Using “Smaller” LLM s

【速读】:该论文旨在解决智能家庭物联网(IoT)设备用户面临的安全风险问题,特别是用户在获取安全知识时因信息来源复杂、专业性强而难以有效应对实际安全挑战的困境。其关键解决方案是构建一个基于公共论坛问答(QA)的新型数据集,并利用潜在狄利克雷分配(Latent Dirichlet Allocation, LDA)提取主要安全关切;随后在该数据集上微调较小规模的Transformer模型(如T5和Flan-T5),以开发适用于资源受限或隐私敏感环境的轻量级智能问答系统。相比大模型(如GPT和Gemini),该方案在保证准确性和相关性的同时,具备更好的部署可行性与隐私保护能力。

链接: https://arxiv.org/abs/2509.19485
作者: Hafijul Hoque Chowdhury,Riad Ahmed Anonto,Sourov Jajodia,Suryadipta Majumdar,Md. Shohrab Hossain
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 10 pages, accepted at PST 2025

点击查看摘要

Abstract:With the rapid growth of smart home IoT devices, users are increasingly exposed to various security risks, as evident from recent studies. While seeking answers to know more on those security concerns, users are mostly left with their own discretion while going through various sources, such as online blogs and technical manuals, which may render higher complexity to regular users trying to extract the necessary information. This requirement does not go along with the common mindsets of smart home users and hence threatens the security of smart homes furthermore. In this paper, we aim to identify and address the major user-level security concerns in smart homes. Specifically, we develop a novel dataset of QA from public forums, capturing practical security challenges faced by smart home users. We extract major security concerns in smart homes from our dataset by leveraging the Latent Dirichlet Allocation (LDA). We fine-tune relatively “smaller” transformer models, such as T5 and Flan-T5, on this dataset to build a QA system tailored for smart home security. Unlike larger models like GPT and Gemini, which are powerful but often resource hungry and require data sharing, smaller models are more feasible for deployment in resource-constrained or privacy-sensitive environments like smart homes. The dataset is manually curated and supplemented with synthetic data to explore its potential impact on model performance. This approach significantly improves the system’s ability to deliver accurate and relevant answers, helping users address common security concerns with smart home IoT devices. Our experiments on real-world user concerns show that our work improves the performance of the base models.
zh

[AI-81] A Realistic Evaluation of Cross-Frequency Transfer Learning and Foundation Forecasting Models NEURIPS2025

【速读】:该论文旨在解决当前跨频域迁移学习(Cross-frequency Transfer Learning, CFTL)在评估基础预测模型(Foundation Forecasting Models, FFMs)性能时存在的系统性偏差问题,包括小规模测试数据集、样本量对统计指标计算的影响不足、统计模型选择不当以及预训练与测试数据集间潜在的重叠风险。其解决方案的关键在于:首先,统一重构广泛采用的神经预测网络并适配CFTL设置;其次,仅使用专有数据和合成数据进行预训练,并严格防止测试泄漏;最后,在15个大型且多样化的公共预测竞赛数据集上进行全面评估。实证结果表明,传统统计模型及其集成方法在sCRPS和MASE指标上显著优于现有FFMs,同时合成数据预训练可使FFM精度提升7%。

链接: https://arxiv.org/abs/2509.19465
作者: Kin G. Olivares,Malcolm Wolff,Tatiana Konstantinova,Shankar Ramasubramanian,Andrew Gordon Wilson,Andres Potapczynski,Willa Potosnak,Mengfei Cao,Boris Oreshkin,Dmitry Efimov
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Applications (stat.AP)
备注: Thirty-Ninth Annual Conference on Neural Information Processing Systems {NeurIPS 2025}. Recent Advances in Time Series Foundation Models Have We Reached the ‘BERT Moment’?

点击查看摘要

Abstract:Cross-frequency transfer learning (CFTL) has emerged as a popular framework for curating large-scale time series datasets to pre-train foundation forecasting models (FFMs). Although CFTL has shown promise, current benchmarking practices fall short of accurately assessing its performance. This shortcoming stems from many factors: an over-reliance on small-scale evaluation datasets; inadequate treatment of sample size when computing summary statistics; reporting of suboptimal statistical models; and failing to account for non-negligible risks of overlap between pre-training and test datasets. To address these limitations, we introduce a unified reimplementation of widely-adopted neural forecasting networks, adapting them for the CFTL setup; we pre-train only on proprietary and synthetic data, being careful to prevent test leakage; and we evaluate on 15 large, diverse public forecast competition datasets. Our empirical analysis reveals that statistical models’ accuracy is frequently underreported. Notably, we confirm that statistical models and their ensembles consistently outperform existing FFMs by more than 8.2% in sCRPS, and by more than 20% MASE, across datasets. However, we also find that synthetic dataset pre-training does improve the accuracy of a FFM by 7% percent.
zh

[AI-82] Evaluation-Aware Reinforcement Learning

【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)中策略评估的高方差与高偏差问题,这些问题通常源于数据有限、任务周期长或环境模型不准确等因素。传统RL方法在训练策略时未显式考虑评估的可靠性,导致部署前难以获得准确的性能估计。其解决方案的关键在于提出一种“评估感知的强化学习”(Evaluation-aware Reinforcement Learning, EvA-RL)框架:在训练策略以最大化期望回报的同时,显式最小化给定价值预测方案下的期望评估误差——即让策略“易于评估”。该方法通过联合优化策略与评估误差,在少量轨迹采样条件下实现更可靠的策略评估;进一步地,为缓解评估精度与策略性能之间的权衡,引入状态值预测器的协同学习机制,从而在多种离散与连续动作域中显著降低评估误差并保持竞争性回报水平。

链接: https://arxiv.org/abs/2509.19464
作者: Shripad Vilasrao Deshmukh,Will Schwarzer,Scott Niekum
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 9 pages, under submission

点击查看摘要

Abstract:Policy evaluation is often a prerequisite for deploying safety- and performance-critical systems. Existing evaluation approaches frequently suffer from high variance due to limited data and long-horizon tasks, or high bias due to unequal support or inaccurate environmental models. We posit that these challenges arise, in part, from the standard reinforcement learning (RL) paradigm of policy learning without explicit consideration of evaluation. As an alternative, we propose evaluation-aware reinforcement learning (EvA-RL), in which a policy is trained to maximize expected return while simultaneously minimizing expected evaluation error under a given value prediction scheme – in other words, being “easy” to evaluate. We formalize a framework for EvA-RL and design an instantiation that enables accurate policy evaluation, conditioned on a small number of rollouts in an assessment environment that can be different than the deployment environment. However, our theoretical analysis and empirical results show that there is often a tradeoff between evaluation accuracy and policy performance when using a fixed value-prediction scheme within EvA-RL. To mitigate this tradeoff, we extend our approach to co-learn an assessment-conditioned state-value predictor alongside the policy. Empirical results across diverse discrete and continuous action domains demonstrate that EvA-RL can substantially reduce evaluation error while maintaining competitive returns. This work lays the foundation for a broad new class of RL methods that treat reliable evaluation as a first-class principle during training.
zh

[AI-83] Self-evolved Imitation Learning in Simulated World

【速读】:该论文旨在解决多任务通用智能体(generalist agent)在少样本模仿学习(few-shot imitation learning)场景下,因专家示范数据稀缺而导致性能受限的问题。其核心挑战在于如何在有限监督条件下提升模型的泛化能力与训练效率。解决方案的关键在于提出Self-Evolved Imitation Learning (SEIL) 框架,通过模拟器交互自动生成高质量的新示范轨迹,并结合双层增强策略(模型级采用指数移动平均EMA模型协作、环境级引入初始物体位置微调)以提升示范多样性;同时设计轻量级选择器过滤出互补且信息丰富的轨迹,从而显著减少对人工标注数据的依赖并实现优于现有方法的性能表现。

链接: https://arxiv.org/abs/2509.19460
作者: Yifan Ye,Jun Cen,Jing Chen,Zhihe Lu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Imitation learning has been a trend recently, yet training a generalist agent across multiple tasks still requires large-scale expert demonstrations, which are costly and labor-intensive to collect. To address the challenge of limited supervision, we propose Self-Evolved Imitation Learning (SEIL), a framework that progressively improves a few-shot model through simulator interactions. The model first attempts tasksin the simulator, from which successful trajectories are collected as new demonstrations for iterative refinement. To enhance the diversity of these demonstrations, SEIL employs dual-level augmentation: (i) Model-level, using an Exponential Moving Average (EMA) model to collaborate with the primary model, and (ii) Environment-level, introducing slight variations in initial object positions. We further introduce a lightweight selector that filters complementary and informative trajectories from the generated pool to ensure demonstration quality. These curated samples enable the model to achieve competitive performance with far fewer training examples. Extensive experiments on the LIBERO benchmark show that SEIL achieves a new state-of-the-art performance in few-shot imitation learning scenarios. Code is available at this https URL.
zh

[AI-84] he Indispensable Role of User Simulation in the Pursuit of AGI

【速读】:该论文旨在解决人工通用智能(AGI)发展过程中面临的两大瓶颈问题:一是缺乏对复杂交互系统的严谨评估手段,二是训练具备适应能力的智能体所需的大规模交互数据难以获取。其核心解决方案在于引入用户模拟(user simulation)技术,即构建能够模拟人类与AI系统交互的计算代理。研究表明,真实可靠的用户模拟器可提供可扩展的评估环境、支持交互式学习的数据生成,并促进AGI所依赖的适应性能力发展,因此被视为加速AGI研发的关键催化剂。

链接: https://arxiv.org/abs/2509.19456
作者: Krisztian Balog,ChengXiang Zhai
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted for publication in Communications of the ACM

点击查看摘要

Abstract:Progress toward Artificial General Intelligence (AGI) faces significant bottlenecks, particularly in rigorously evaluating complex interactive systems and acquiring the vast interaction data needed for training adaptive agents. This paper posits that user simulation – creating computational agents that mimic human interaction with AI systems – is not merely a useful tool, but is a critical catalyst required to overcome these bottlenecks and accelerate AGI development. We argue that realistic simulators provide the necessary environments for scalable evaluation, data generation for interactive learning, and fostering the adaptive capabilities central to AGI. Therefore, research into user simulation technology and intelligent task agents are deeply synergistic and must advance hand-in-hand. This article elaborates on the critical role of user simulation for AGI, explores the interdisciplinary nature of building realistic simulators, identifies key challenges including those posed by large language models, and proposes a future research agenda.
zh

[AI-85] Probabilistic Runtime Verification Evaluation and Risk Assessment of Visual Deep Learning Systems

【速读】:该论文旨在解决深度神经网络在真实场景部署中性能下降的问题,其根本原因在于模型对输入数据的分布性偏移(distributional shifts)高度敏感,而现有评估方法通常忽略此类偏移,导致性能指标被高估。解决方案的关键在于提出一种新的验证与风险评估框架:通过估计分布性偏移的发生概率(来自异常检测器输出),并结合网络预测正确的条件概率,构建二叉树结构进行推理,从而实现对模型准确率的可信且精确的估算。该方法在多个数据集上验证有效,误差范围为0.01–0.1,并进一步应用于医学分割任务的风险量化,通过节点成本建模支持成本效益分析,显著提升了深度学习系统在安全关键场景中的可靠性与可解释性。

链接: https://arxiv.org/abs/2509.19419
作者: Birk Torpmann-Hagen,Pål Halvorsen,Michael A. Riegler,Dag Johansen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite achieving excellent performance on benchmarks, deep neural networks often underperform in real-world deployment due to sensitivity to minor, often imperceptible shifts in input data, known as distributional shifts. These shifts are common in practical scenarios but are rarely accounted for during evaluation, leading to inflated performance metrics. To address this gap, we propose a novel methodology for the verification, evaluation, and risk assessment of deep learning systems. Our approach explicitly models the incidence of distributional shifts at runtime by estimating their probability from outputs of out-of-distribution detectors. We combine these estimates with conditional probabilities of network correctness, structuring them in a binary tree. By traversing this tree, we can compute credible and precise estimates of network accuracy. We assess our approach on five different datasets, with which we simulate deployment conditions characterized by differing frequencies of distributional shift. Our approach consistently outperforms conventional evaluation, with accuracy estimation errors typically ranging between 0.01 and 0.1. We further showcase the potential of our approach on a medical segmentation benchmark, wherein we apply our methods towards risk assessment by associating costs with tree nodes, informing cost-benefit analyses and value-judgments. Ultimately, our approach offers a robust framework for improving the reliability and trustworthiness of deep learning systems, particularly in safety-critical applications, by providing more accurate performance estimates and actionable risk assessments.
zh

[AI-86] EngravingGNN: A Hybrid Graph Neural Network for End-to-End Piano Score Engraving

【速读】:该论文旨在解决自动乐谱排版(automatic music engraving)问题,即从符号化音乐数据中自动生成符合人类阅读习惯的乐谱,这是涉及人机交互的音乐应用中的关键步骤,但在符号化音乐处理领域仍属未充分探索的方向。解决方案的关键在于将该问题形式化为一系列相互依赖的子任务,并提出一个统一的图神经网络(Graph Neural Network, GNN)框架,通过多任务学习联合预测音轨连接、谱表分配、音高拼写、调号、符干方向、八度移位及谱号等要素;模型采用共享GNN编码器与轻量级任务特定解码器的设计,在钢琴音乐且量化符号输入的前提下实现了端到端的高效建模,最终输出可直接打印的MusicXML/MEI格式文件,实验证明其在多个钢琴语料上的综合性能优于仅针对单一子任务优化的现有方法。

链接: https://arxiv.org/abs/2509.19412
作者: Emmanouil Karystinaios,Francesco Foscarin,Gerhard Widmer
机构: 未知
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI)
备注: Accepted at the International Conference on Technologies for Music Notation and Representation (TENOR) 2025

点击查看摘要

Abstract:This paper focuses on automatic music engraving, i.e., the creation of a humanly-readable musical score from musical content. This step is fundamental for all applications that include a human player, but it remains a mostly unexplored topic in symbolic music processing. In this work, we formalize the problem as a collection of interdependent subtasks, and propose a unified graph neural network (GNN) framework that targets the case of piano music and quantized symbolic input. Our method employs a multi-task GNN to jointly predict voice connections, staff assignments, pitch spelling, key signature, stem direction, octave shifts, and clef signs. A dedicated postprocessing pipeline generates print-ready MusicXML/MEI outputs. Comprehensive evaluation on two diverse piano corpora (J-Pop and DCML Romantic) demonstrates that our unified model achieves good accuracy across all subtasks, compared to existing systems that only specialize in specific subtasks. These results indicate that a shared GNN encoder with lightweight task-specific decoders in a multi-task setting offers a scalable and effective solution for automatic music engraving.
zh

[AI-87] meMosaic: Temporal Heterogeneity Guided Time Series Forecasting via Adaptive Granularity Patch and Segment-wise Decoding

【速读】:该论文旨在解决多变量时间序列预测(Multivariate Time Series Forecasting, MTSF)中现有基于patch的方法因采用固定长度分割而忽略局部时序动态异质性及预测解码异质性的问题,导致信息密集区域细节丢失、稳定段冗余引入,并难以捕捉短中期与长期预测的复杂差异。其解决方案的关键在于提出TimeMosaic框架:一是通过自适应patch嵌入(adaptive patch embedding),依据局部信息密度动态调整粒度,在保持时序连续性的前提下实现模式复用与结构清晰性的平衡;二是引入分段解码机制(segment-wise decoding),将每个预测时 horizon 视为相关子任务,按不同预测难度和信息需求自适应调整解码策略,而非使用单一统一解码器。

链接: https://arxiv.org/abs/2509.19406
作者: Kuiye Ding,Fanda Fan,Chunyi Hou,Zheya Wang,Lei Wang,Zhengxin Yang,Jianfeng Zhan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multivariate time series forecasting is essential in domains such as finance, transportation, climate, and energy. However, existing patch-based methods typically adopt fixed-length segmentation, overlooking the heterogeneity of local temporal dynamics and the decoding heterogeneity of forecasting. Such designs lose details in information-dense regions, introduce redundancy in stable segments, and fail to capture the distinct complexities of short-term and long-term horizons. We propose TimeMosaic, a forecasting framework that aims to address temporal heterogeneity. TimeMosaic employs adaptive patch embedding to dynamically adjust granularity according to local information density, balancing motif reuse with structural clarity while preserving temporal continuity. In addition, it introduces segment-wise decoding that treats each prediction horizon as a related subtask and adapts to horizon-specific difficulty and information requirements, rather than applying a single uniform decoder. Extensive evaluations on benchmark datasets demonstrate that TimeMosaic delivers consistent improvements over existing methods, and our model trained on the large-scale corpus with 321 billion observations achieves performance competitive with state-of-the-art TSFMs.
zh

[AI-88] Improving Outdoor Multi-cell Fingerprinting-based Positioning via Mobile Data Augmentation

【速读】:该论文旨在解决蜂窝网络中室外定位精度受限的问题,主要瓶颈在于测量数据稀疏且异构,以及全面站点勘测成本高昂。解决方案的关键在于提出一种轻量级、模块化的移动数据增强框架,用于提升基于多小区指纹识别的定位性能。该方法通过解耦空间特征与无线特征的合成:利用核密度估计(Kernel Density Estimation, KDE)建模实际空间分布以生成地理上一致的合成位置;同时采用K近邻(K-Nearest Neighbor, KNN)算法生成每小区的增强无线指纹。整个架构无需训练、可解释性强,适用于运营商分布式或本地部署,并支持隐私保护工作流,从而有效利用现有运营商收集的最小化路测(Minimization of Drive Test, MDT)数据提升定位精度。

链接: https://arxiv.org/abs/2509.19405
作者: Tony Chahoud,Lorenzo Mario Amorosa,Riccardo Marini,Luca De Nardis
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurate outdoor positioning in cellular networks is hindered by sparse, heterogeneous measurement collections and the high cost of exhaustive site surveys. This paper introduces a lightweight, modular mobile data augmentation framework designed to enhance multi-cell fingerprinting-based positioning using operator-collected minimization of drive test (MDT) records. The proposed approach decouples spatial and radio-feature synthesis: kernel density estimation (KDE) models the empirical spatial distribution to generate geographically coherent synthetic locations, while a k-nearest-neighbor (KNN)-based block produces augmented per-cell radio fingerprints. The architecture is intentionally training-free, interpretable, and suitable for distributed or on-premise operator deployments, supporting privacy-aware workflows. We both validate each augmentation module independently and assess its end-to-end impact on fingerprinting-based positioning using a real-world MDT dataset provided by an Italian mobile network operator across diverse urban and peri-urban scenarios. Results show that the proposed KDE-KNN augmentation consistently improves positioning performance, with the largest benefits in sparsely sampled or structurally complex regions; we also observe region-dependent saturation effects as augmentation increases. The framework offers a practical, low-complexity path to enhance operator positioning services using existing mobile data traces.
zh

[AI-89] FedOC: Multi-Server FL with Overlapping Client Relays in Wireless Edge Networks

【速读】:该论文旨在解决多服务器联邦学习(Multi-server Federated Learning, Multi-server FL)中因通信瓶颈导致的训练效率低下问题,特别是在边缘服务器(Edge Server, ES)覆盖区域存在重叠的情况下,如何有效利用重叠客户端(Overlapping Clients)来提升模型聚合与传播效率。解决方案的关键在于提出FedOC(Federated learning with Overlapping Clients)框架,其中重叠客户端可扮演双重角色:一是作为中继重叠客户端(Relay Overlapping Clients, ROCs),实时在相邻ES之间转发边缘模型以实现去中心化的模型共享;二是作为普通重叠客户端(Normal Overlapping Clients, NOCs),根据接收到的边缘模型时间动态选择初始训练模型,从而实现跨区域的数据间接融合。这一机制显著加快了模型扩散速度,特别适用于对延迟敏感的边缘计算环境。

链接: https://arxiv.org/abs/2509.19398
作者: Yun Ji,Zeyu Chen,Xiaoxiong Zhong,Yanan Ma,Sheng Zhang,Yuguang Fang
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multi-server Federated Learning (FL) has emerged as a promising solution to mitigate communication bottlenecks of single-server FL. We focus on a typical multi-server FL architecture, where the regions covered by different edge servers (ESs) may overlap. A key observation of this architecture is that clients located in the overlapping areas can access edge models from multiple ESs. Building on this insight, we propose FedOC (Federated learning with Overlapping Clients), a novel framework designed to fully exploit the potential of these overlapping clients. In FedOC, overlapping clients could serve dual roles: (1) as Relay Overlapping Clients (ROCs), they forward edge models between neighboring ESs in real time to facilitate model sharing among different ESs; and (2) as Normal Overlapping Clients (NOCs), they dynamically select their initial model for local training based on the edge model delivery time, which enables indirect data fusion among different regions of ESs. The overall FedOC workflow proceeds as follows: in every round, each client trains local model based on the earliest received edge model and transmits to the respective ESs for model aggregation. Then each ES transmits the aggregated edge model to neighboring ESs through ROC relaying. Upon receiving the relayed models, each ES performs a second aggregation and subsequently broadcasts the updated model to covered clients. The existence of ROCs enables the model of each ES to be disseminated to the other ESs in a decentralized manner, which indirectly achieves intercell model and speeding up the training process, making it well-suited for latency-sensitive edge environments. Extensive experimental results show remarkable performance gains of our scheme compared to existing methods.
zh

[AI-90] OmniFed: A Modular Framework for Configurable Federated Learning from Edge to HPC

【速读】:该论文旨在解决联邦学习(Federated Learning, FL)在边缘计算和高性能计算(High Performance Computing, HPC)场景中因数据分散、隐私敏感以及系统异构性带来的部署复杂性问题。其核心解决方案是提出OmniFed框架,通过解耦配置、编排、通信与训练逻辑,实现模块化设计与清晰的关注点分离;关键创新在于支持配置驱动的原型开发、代码级按需覆盖定制、多种拓扑结构与混合通信协议共存,并提供可插拔的隐私保护机制(如差分隐私 Differential Privacy、同态加密 Homomorphic Encryption 和安全聚合 Secure Aggregation)及压缩策略,所有功能均通过明确定义的扩展点暴露,确保用户自定义能力与核心系统完整性并存,从而显著简化跨异构环境的联邦学习部署流程。

链接: https://arxiv.org/abs/2509.19396
作者: Sahil Tyagi,Andrei Cozma,Olivera Kotevska,Feiyi Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:

点击查看摘要

Abstract:Federated Learning (FL) is critical for edge and High Performance Computing (HPC) where data is not centralized and privacy is crucial. We present OmniFed, a modular framework designed around decoupling and clear separation of concerns for configuration, orchestration, communication, and training logic. Its architecture supports configuration-driven prototyping and code-level override-what-you-need customization. We also support different topologies, mixed communication protocols within a single deployment, and popular training algorithms. It also offers optional privacy mechanisms including Differential Privacy (DP), Homomorphic Encryption (HE), and Secure Aggregation (SA), as well as compression strategies. These capabilities are exposed through well-defined extension points, allowing users to customize topology and orchestration, learning logic, and privacy/compression plugins, all while preserving the integrity of the core system. We evaluate multiple models and algorithms to measure various performance metrics. By unifying topology configuration, mixed-protocol communication, and pluggable modules in one stack, OmniFed streamlines FL deployment across heterogeneous environments. Github repository is available at this https URL.
zh

[AI-91] nsLoRA: Tensor Alternatives for Low-Rank Adaptation ICASSP2026

【速读】:该论文旨在解决当前低秩适配(Low-Rank Adaptation, LoRA)方法在Transformer模型中对注意力投影(Query、Key和Value)各层独立进行参数更新时,缺乏高效整合与协同优化的问题。现有方法虽能有效压缩模型参数,但未能充分利用多模态或跨层结构的潜在相关性,限制了适应效率与性能上限。其解决方案的关键在于提出TensLoRA框架,通过将LoRA更新聚合为高阶张量(higher-order tensors),构建一个统一的张量化低秩适配范式,从而支持模式特定的压缩率调控,并在视觉与语言基准测试中验证了该张量构造方式对性能的显著影响,甚至在相似参数预算下优于标准LoRA。

链接: https://arxiv.org/abs/2509.19391
作者: Axel Marmoret,Reda Bensaid,Jonathan Lys,Vincent Gripon,François Leduc-Primeau
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Submitted at ICASSP 2026. 5 pages, 1 figure, 2 tables. Code can be found at this https URL

点击查看摘要

Abstract:Low-Rank Adaptation (LoRA) is widely used to efficiently adapt Transformers by adding trainable low-rank matrices to attention projections. While effective, these matrices are considered independent for each attention projection (Query, Key, and Value) and each layer. Recent extensions have considered joint, tensor-based adaptations, but only in limited forms and without a systematic framework. We introduce TensLoRA, a unified framework that aggregates LoRA updates into higher-order tensors and models a broad family of tensor-based low-rank adaptations. Our formulation generalizes existing tensor-based methods and enables mode-specific compression rates, allowing parameter budgets to be tailored according to the modality and task. Experiments on vision and language benchmarks reveal that the tensor construction directly impacts performance, sometimes better than standard LoRA under similar parameter counts.
zh

[AI-92] Learning from Observation: A Survey of Recent Advances

【速读】:该论文旨在解决传统模仿学习(Imitation Learning, IL)方法对专家动作信息依赖过强的问题,这在现实场景中往往难以获取。其核心挑战在于如何仅通过专家的状态轨迹(state visitation information)来训练代理(agent),即实现仅观察(Learning from Observation, LfO)或状态仅模仿学习(State-Only Imitation Learning, SOIL)。解决方案的关键在于提出一个系统性的框架,用于分类和分析现有LfO方法,并基于该框架梳理其在轨迹构建、假设条件及算法设计上的差异,从而揭示与离线强化学习(offline RL)、基于模型的强化学习(model-based RL)和分层强化学习(hierarchical RL)等领域的关联,最终识别出当前研究中的开放问题并指明未来方向。

链接: https://arxiv.org/abs/2509.19379
作者: Returaj Burnwal,Hriday Mehta,Nirav Pravinbhai Bhatt,Balaraman Ravindran
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Imitation Learning (IL) algorithms offer an efficient way to train an agent by mimicking an expert’s behavior without requiring a reward function. IL algorithms often necessitate access to state and action information from expert demonstrations. Although expert actions can provide detailed guidance, requiring such action information may prove impractical for real-world applications where expert actions are difficult to obtain. To address this limitation, the concept of learning from observation (LfO) or state-only imitation learning (SOIL) has recently gained attention, wherein the imitator only has access to expert state visitation information. In this paper, we present a framework for LfO and use it to survey and classify existing LfO methods in terms of their trajectory construction, assumptions and algorithm’s design choices. This survey also draws connections between several related fields like offline RL, model-based RL and hierarchical RL. Finally, we use our framework to identify open problems and suggest future research directions.
zh

[AI-93] Solving Freshness in RAG : A Simple Recency Prior and the Limits of Heuristic Trend Detection

【速读】:该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统中因时间因素导致的时效性失效问题,特别是在网络安全数据场景下。其解决方案的关键在于引入两种不同的时间感知策略:一是采用简单的“近期优先”(recency prior)机制,在新鲜度任务上实现了1.00的准确率,表明基于时间戳的简单先验可有效提升时效性;二是尝试使用聚类启发式方法追踪主题演化,但效果不佳(F1-score仅为0.08),说明单纯依赖启发式规则难以捕捉复杂趋势,需更 sophisticated 的时间建模方法。

链接: https://arxiv.org/abs/2509.19376
作者: Matthew Grofsky
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We address temporal failures in RAG systems using two methods on cybersecurity data. A simple recency prior achieved an accuracy of 1.00 on freshness tasks. In contrast, a clustering heuristic for topic evolution failed (0.08 F1-score), showing trend detection requires methods beyond simple heuristics.
zh

[AI-94] Uncertainty Quantification of Large Language Models using Approximate Bayesian Computation

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在高风险和安全关键领域(如临床诊断)中缺乏不确定性表达能力的问题,这类模型通常产生过度自信且校准不良的预测概率。解决方案的关键在于引入一种无需显式似然函数的贝叶斯推断方法——近似贝叶斯计算(Approximate Bayesian Computation, ABC),将LLM视为随机模拟器,从而推断出预测概率的后验分布,实现更准确、更校准良好的不确定性估计。

链接: https://arxiv.org/abs/2509.19375
作者: Mridul Sharma(1),Adeetya Patel(1),Zaneta D’ Souza(1),Samira Abbasgholizadeh Rahimi(1 and 3),Siva Reddy(2 and 3),Sreenath Madathil(1) ((1) Faculty of Dental Medicine and Oral Health Sciences, McGill University, Montreal, Canada (2) School of Computer Science, McGill University, Montreal, Canada (3) Mila-Quebec Artificial Intelligence Institute, Montreal, Canada)
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Despite their widespread applications, Large Language Models (LLMs) often struggle to express uncertainty, posing a challenge for reliable deployment in high stakes and safety critical domains like clinical diagnostics. Existing standard baseline methods such as model logits and elicited probabilities produce overconfident and poorly calibrated estimates. In this work, we propose Approximate Bayesian Computation (ABC), a likelihood-free Bayesian inference, based approach that treats LLMs as a stochastic simulator to infer posterior distributions over predictive probabilities. We evaluate our ABC approach on two clinically relevant benchmarks: a synthetic oral lesion diagnosis dataset and the publicly available GretelAI symptom-to-diagnosis dataset. Compared to standard baselines, our approach improves accuracy by up to 46.9%, reduces Brier scores by 74.4%, and enhances calibration as measured by Expected Calibration Error (ECE) and predictive entropy.
zh

[AI-95] Representation-based Broad Hallucination Detectors Fail to Generalize Out of Distribution EMNLP2025

【速读】:该论文旨在解决当前最先进的幻觉检测(hallucination detection)方法在真实场景中泛化能力不足的问题,特别是其性能是否真正依赖于对幻觉内容的识别能力,而非数据中的伪相关性(spurious correlation)。研究发现,现有SOTA模型在RAGTruth数据集上的表现主要受数据分布影响,一旦控制这一因素,其效果仅相当于简单的监督线性探测器(supervised linear probes),且需要大量跨数据集的超参数调优。解决方案的关键在于提出一套系统性的评估指南,强调应通过控制伪相关性来更公平地衡量模型的真实检测能力,并推动更具鲁棒性和可解释性的幻觉检测方法发展。

链接: https://arxiv.org/abs/2509.19372
作者: Zuzanna Dubanowska,Maciej Żelaszczyk,Michał Brzozowski,Paolo Mandica,Michał Karpowicz
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted in EMNLP 2025 Findings

点击查看摘要

Abstract:We critically assess the efficacy of the current SOTA in hallucination detection and find that its performance on the RAGTruth dataset is largely driven by a spurious correlation with data. Controlling for this effect, state-of-the-art performs no better than supervised linear probes, while requiring extensive hyperparameter tuning across datasets. Out-of-distribution generalization is currently out of reach, with all of the analyzed methods performing close to random. We propose a set of guidelines for hallucination detection and its evaluation.
zh

[AI-96] Unsupervised Outlier Detection in Audit Analytics: A Case Study Using USA Spending Data

【速读】:该论文试图解决在大规模政府数据集(如美国卫生与公共服务部的联邦支出数据)中,传统审计方法难以高效且准确识别异常支出模式的问题。解决方案的关键在于采用多种无监督异常检测算法(包括基于直方图的异常评分HBOS、鲁棒主成分分析Robust PCA、最小协方差行列式MCD和K近邻KNN)进行比较与集成,并通过精度(precision)、召回率(recall)和F1分数对性能进行评估,最终发现混合多种检测策略的协同方法能显著提升复杂金融数据中异常识别的鲁棒性和准确性。

链接: https://arxiv.org/abs/2509.19366
作者: Buhe Li,Berkay Kaplan,Maksym Lazirko,Aleksandr Kogan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This study investigates the effectiveness of unsupervised outlier detection methods in audit analytics, utilizing USA spending data from the U.S. Department of Health and Human Services (DHHS) as a case example. We employ and compare multiple outlier detection algorithms, including Histogram-based Outlier Score (HBOS), Robust Principal Component Analysis (PCA), Minimum Covariance Determinant (MCD), and K-Nearest Neighbors (KNN) to identify anomalies in federal spending patterns. The research addresses the growing need for efficient and accurate anomaly detection in large-scale governmental datasets, where traditional auditing methods may fall short. Our methodology involves data preparation, algorithm implementation, and performance evaluation using precision, recall, and F1 scores. Results indicate that a hybrid approach, combining multiple detection strategies, enhances the robustness and accuracy of outlier identification in complex financial data. This study contributes to the field of audit analytics by providing insights into the comparative effectiveness of various outlier detection models and demonstrating the potential of unsupervised learning techniques in improving audit quality and efficiency. The findings have implications for auditors, policymakers, and researchers seeking to leverage advanced analytics in governmental financial oversight and risk management.
zh

[AI-97] Analyzing the Impact of Credit Card Fraud on Economic Fluctuations of American Households Using an Adaptive Neuro-Fuzzy Inference System

【速读】:该论文旨在解决信用卡欺诈对美国家庭财务状况造成的日益严重威胁及其引发的家庭经济行为不确定性问题。解决方案的关键在于提出一种基于增强型自适应神经模糊推理系统(Enhanced ANFIS)的混合分析方法,其核心创新包括:引入多分辨率小波分解模块以生成局部经济冲击信号,构建基于Takagi-Sugeno模糊规则与自适应高斯隶属函数的深度模糊规则库,并设计时序注意力编码器以自适应分配多尺度经济行为模式的权重,从而提升模糊推理阶段的相关性评估效果并增强对长期时间依赖性和欺诈异常的捕捉能力。该方法通过模块化训练过程将模糊规则激活、小波基选择与时序相关权重相融合,突破了传统ANFIS固定输入输出关系的局限性。

链接: https://arxiv.org/abs/2509.19363
作者: Zhuqi Wang,Qinghe Zhang,Zhuopei Cheng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Credit card fraud is assuming growing proportions as a major threat to the financial position of American household, leading to unpredictable changes in household economic behavior. To solve this problem, in this paper, a new hybrid analysis method is presented by using the Enhanced ANFIS. The model proposes several advances of the conventional ANFIS framework and employs a multi-resolution wavelet decomposition module and a temporal attention mechanism. The model performs discrete wavelet transformations on historical transaction data and macroeconomic indicators to generate localized economic shock signals. The transformed features are then fed into a deep fuzzy rule library which is based on Takagi-Sugeno fuzzy rules with adaptive Gaussian membership functions. The model proposes a temporal attention encoder that adaptively assigns weights to multi-scale economic behavior patterns, increasing the effectiveness of relevance assessment in the fuzzy inference stage and enhancing the capture of long-term temporal dependencies and anomalies caused by fraudulent activities. The proposed method differs from classical ANFIS which has fixed input-output relations since it integrates fuzzy rule activation with the wavelet basis selection and the temporal correlation weights via a modular training procedure. Experimental results show that the RMSE was reduced by 17.8% compared with local neuro-fuzzy models and conventional LSTM models.
zh

[AI-98] DeepACTIF: Efficient Feature Attribution via Activation Traces in Neural Sequence Models

【速读】:该论文旨在解决深度学习模型在时间序列领域(如生物特征眼动追踪)中特征归因(feature attribution)的计算效率问题,尤其是传统方法(如Integrated Gradients和SHAP)因计算复杂度高而不适用于实时应用场景。解决方案的关键在于提出一种轻量级、架构感知的归因方法DeepACTIF,其核心创新是利用LSTM网络内部激活值,并引入逆权重聚合机制(inverse-weighted aggregation scheme),以增强跨时间步的激活稳定性与幅度信息,从而高效估计特征重要性。实验表明,DeepACTIF在仅使用top 10%特征时仍保持预测性能,且相比SHAP、IG和DeepLIFT等方法在准确性和统计鲁棒性上显著更优,同时大幅降低计算时间和内存消耗,适用于边缘设备上的实时可解释性需求。

链接: https://arxiv.org/abs/2509.19362
作者: Benedikt W. Hosp
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Feature attribution is essential for interpreting deep learning models, particularly in time-series domains such as healthcare, biometrics, and human-AI interaction. However, standard attribution methods, such as Integrated Gradients or SHAP, are computationally intensive and not well-suited for real-time applications. We present DeepACTIF, a lightweight and architecture-aware feature attribution method that leverages internal activations of sequence models to estimate feature importance efficiently. Focusing on LSTM-based networks, we introduce an inverse-weighted aggregation scheme that emphasises stability and magnitude of activations across time steps. Our evaluation across three biometric gaze datasets shows that DeepACTIF not only preserves predictive performance under severe feature reduction (top 10% of features) but also significantly outperforms established methods, including SHAP, IG, and DeepLIFT, in terms of both accuracy and statistical robustness. Using Wilcoxon signed-rank tests and effect size analysis, we demonstrate that DeepACTIF yields more informative feature rankings with significantly lower error across all top-k conditions (10 - 40%). Our experiments demonstrate that DeepACTIF not only reduces computation time and memory usage by orders of magnitude but also preserves model accuracy when using only top-ranked features. That makes DeepACTIF a viable solution for real-time interpretability on edge devices such as mobile XR headsets or embedded health monitors.
zh

[AI-99] Anti-Money Laundering Systems Using Deep Learning

【速读】:该论文旨在解决传统反洗钱(Anti-Money Laundering, AML)系统中存在的高误报率及难以识别复杂洗钱模式的问题。其解决方案的关键在于引入基于深度学习的图卷积网络(Graph Convolutional Network, GCN)模型,并结合中心性算法(如度中心性、接近中心性、介数中心性和PageRank)进行链路分析,从而在金融交易网络中更有效地识别可疑行为,提升AML系统的检测精度与智能化水平。

链接: https://arxiv.org/abs/2509.19359
作者: Mashkhal Abdalwahid Sidiq,Yimamu Kirubel Wondaferew
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 22 pages, 9 figures

点击查看摘要

Abstract:In this paper, we focused on using deep learning methods for detecting money laundering in financial transaction networks, in order to demonstrate that it can be used as a complement or instead of the more commonly used rule-based systems and conventional Anti-Money Laundering (AML) systems. The paper explores the pivotal role played by Anti-Money Laundering (AML) activities in the global financial industry. It underscores the drawbacks of conventional AML systems, which exhibit high rates of false positives and lack the sophistication to uncover intricate money laundering schemes. To tackle these challenges, the paper proposes an advanced AML system that capitalizes on link analysis using deep learning techniques. At the heart of this system lies the utilization of centrality algorithms like Degree Centrality, Closeness Centrality, Betweenness Centrality, and PageRank. These algorithms enhance the system’s capability to identify suspicious activities by examining the influence and interconnections within networks of financial transactions. The significance of Anti-Money Laundering (AML) efforts within the global financial sector is discussed in this paper. It highlights the limitations of traditional AML systems. The results showed the practicality and superiority of the new implementation of the GCN model, which is a preferable method for connectively structured data, meaning that a transaction or account is analyzed in the context of its financial environment. In addition, the paper delves into the prospects of Anti-Money Laundering (AML) efforts, proposing the integration of emerging technologies such as deep learning and centrality algorithms. This integration holds promise for enhancing the effectiveness of AML systems by refining their capabilities.
zh

[AI-100] Fine-Grained AI Model Caching and Downloading With Coordinated Multipoint Broadcasting in Multi-Cell Edge Networks

【速读】:该论文旨在解决6G网络中因AI模型规模庞大、边缘节点存储资源有限以及无线信道并发传输异构模型导致的低效缓存与下载问题。其核心挑战在于如何在有限的边缘存储空间内高效管理模型参数,并实现多用户同时获取所需模型时的低延迟传输。解决方案的关键在于提出一种细粒度的AI模型缓存与下载系统,利用预训练模型微调过程中参数复用的特性,仅缓存可共享的模型参数块(Parameter Blocks, PBs),从而避免冗余存储;同时引入协同多点(CoMP)广播机制,将重复使用的PBs同时发送给多个用户,提升下行频谱利用率。在此基础上,通过联合优化PB缓存策略、迁移路径及广播波束赋形,以最小化模型下载延迟,并设计分布式多智能体学习框架,使边缘节点能够显式学习彼此行为间的相互影响,促进协作优化,辅以数据增强方法生成合成训练样本,加速策略学习并提高收敛性能。

链接: https://arxiv.org/abs/2509.19341
作者: Yang Fu,Peng Qin,Yueyue Zhang,Yifei Wang
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:6G networks are envisioned to support on-demand AI model downloading to accommodate diverse inference requirements of end users. By proactively caching models at edge nodes, users can retrieve the requested models with low latency for on-device AI inference. However, the substantial size of contemporary AI models poses significant challenges for edge caching under limited storage capacity, as well as for the concurrent delivery of heterogeneous models over wireless channels. To address these challenges, we propose a fine-grained AI model caching and downloading system that exploits parameter reusability, stemming from the common practice of fine-tuning task-specific models from a shared pre-trained model with frozen parameters. This system selectively caches model parameter blocks (PBs) at edge nodes, eliminating redundant storage of reusable parameters across different cached models. Additionally, it incorporates coordinated multipoint (CoMP) broadcasting to simultaneously deliver reusable PBs to multiple users, thereby enhancing downlink spectrum utilization. Under this arrangement, we formulate a model downloading delay minimization problem to jointly optimize PB caching, migration (among edge nodes), and broadcasting beamforming. To tackle this intractable problem, we develop a distributed multi-agent learning framework that enables edge nodes to explicitly learn mutual influence among their actions, thereby facilitating cooperation. Furthermore, a data augmentation approach is proposed to adaptively generate synthetic training samples through a predictive model, boosting sample efficiency and accelerating policy learning. Both theoretical analysis and simulation experiments validate the superior convergence performance of the proposed learning framework.
zh

[AI-101] Multi-population Ensemble Genetic Programming via Cooperative Coevolution and Multi-view Learning for Classification

【速读】:该论文旨在解决高维异构特征空间中的分类问题,尤其针对传统遗传编程(Genetic Programming, GP)在复杂数据环境下易陷入早熟收敛、模型可解释性弱以及难以有效融合多视图信息的局限性。其解决方案的关键在于提出多种群集成遗传编程(Multi-population Ensemble Genetic Programming, MEGP)框架,通过将输入空间分解为条件独立的特征子集,实现多个子种群并行演化,并借助基于软最大(softmax)加权的可微分集成机制动态融合个体输出,从而提升模型的适应性决策能力和可解释性;同时引入混合选择机制,在种群内与种群间双层进化动力下维持多样性并促进合作,显著改善收敛速度与泛化性能。

链接: https://arxiv.org/abs/2509.19339
作者: Mohammad Sadegh Khorshidi,Navid Yazdanjue,Hassan Gharoun,Mohammad Reza Nikoo,Fang Chen,Amir H. Gandomi
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注: 59 Pages, 68 Figures, 27 Tables

点击查看摘要

Abstract:This paper introduces Multi-population Ensemble Genetic Programming (MEGP), a computational intelligence framework that integrates cooperative coevolution and the multiview learning paradigm to address classification challenges in high-dimensional and heterogeneous feature spaces. MEGP decomposes the input space into conditionally independent feature subsets, enabling multiple subpopulations to evolve in parallel while interacting through a dynamic ensemble-based fitness mechanism. Each individual encodes multiple genes whose outputs are aggregated via a differentiable softmax-based weighting layer, enhancing both model interpretability and adaptive decision fusion. A hybrid selection mechanism incorporating both isolated and ensemble-level fitness promotes inter-population cooperation while preserving intra-population diversity. This dual-level evolutionary dynamic facilitates structured search exploration and reduces premature convergence. Experimental evaluations across eight benchmark datasets demonstrate that MEGP consistently outperforms a baseline GP model in terms of convergence behavior and generalization performance. Comprehensive statistical analyses validate significant improvements in Log-Loss, Precision, Recall, F1 score, and AUC. MEGP also exhibits robust diversity retention and accelerated fitness gains throughout evolution, highlighting its effectiveness for scalable, ensemble-driven evolutionary learning. By unifying population-based optimization, multi-view representation learning, and cooperative coevolution, MEGP contributes a structurally adaptive and interpretable framework that advances emerging directions in evolutionary machine learning.
zh

[AI-102] Radio Propagation Modelling: To Differentiate or To Deep Learn That Is The Question

【速读】:该论文旨在解决不同iable ray tracing(可微分射线追踪)在真实大规模移动网络部署中是否具备可扩展性和实际应用价值的问题,尤其是在与传统深度学习(Deep Learning, DL)模型对比时的性能差异。其关键解决方案在于通过在覆盖13个城市、超过10,000个天线的真实网络数据集上系统性地比较不同iable ray tracing与DL模型在无线覆盖模拟中的表现,从而验证其在准确性、泛化能力及实时性方面的优劣。实验结果表明,尽管可微分射线追踪在效率-精度差距上有所改善,但在大规模真实场景下泛化能力不足且无法满足实时需求,而DL模型则展现出更高的精度和更快的适应速度,尤其在城区、郊区和农村环境中均取得最高达3 dB的性能提升。

链接: https://arxiv.org/abs/2509.19337
作者: Stefanos Bakirtzis,Paul Almasan,José Suárez-Varela,Gabriel O. Ferreira,Michail Kalntis,André Felipe Zanella,Ian Wassell,Andra Lutu
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Differentiable ray tracing has recently challenged the status quo in radio propagation modelling and digital twinning. Promising unprecedented speed and the ability to learn from real-world data, it offers a real alternative to conventional deep learning (DL) models. However, no experimental evaluation on production-grade networks has yet validated its assumed scalability or practical benefits. This leaves mobile network operators (MNOs) and the research community without clear guidance on its applicability. In this paper, we fill this gap by employing both differentiable ray tracing and DL models to emulate radio coverage using extensive real-world data collected from the network of a major MNO, covering 13 cities and more than 10,000 antennas. Our results show that, while differentiable ray-tracing simulators have contributed to reducing the efficiency-accuracy gap, they struggle to generalize from real-world data at a large scale, and they remain unsuitable for real-time applications. In contrast, DL models demonstrate higher accuracy and faster adaptation than differentiable ray-tracing simulators across urban, suburban, and rural deployments, achieving accuracy gains of up to 3 dB. Our experimental results aim to provide timely insights into a fundamental open question with direct implications on the wireless ecosystem and future research.
zh

[AI-103] Wavelet Fourier Diffuser: Frequency-Aware Diffusion Model for Reinforcement Learning

【速读】:该论文旨在解决现有基于扩散概率模型的离线强化学习方法在仅依赖时域特征时,因忽略频域特性而导致低频成分发生频率偏移、进而引发轨迹不稳定和性能下降的问题。其解决方案的关键在于提出Wavelet Fourier Diffuser (WFDiffuser),该框架通过引入离散小波变换(Discrete Wavelet Transform, DWT)将轨迹分解为低频与高频分量,并结合短时傅里叶变换(Short-Time Fourier Transform, STFT)与交叉注意力机制,分别提取各频段特征并促进跨频交互,从而有效抑制频率偏移,提升轨迹平滑性与决策性能。

链接: https://arxiv.org/abs/2509.19305
作者: Yifu Luo,Yongzhe Chang,Xueqian Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:Diffusion probability models have shown significant promise in offline reinforcement learning by directly modeling trajectory sequences. However, existing approaches primarily focus on time-domain features while overlooking frequency-domain features, leading to frequency shift and degraded performance according to our observation. In this paper, we investigate the RL problem from a new perspective of the frequency domain. We first observe that time-domain-only approaches inadvertently introduce shifts in the low-frequency components of the frequency domain, which results in trajectory instability and degraded performance. To address this issue, we propose Wavelet Fourier Diffuser (WFDiffuser), a novel diffusion-based RL framework that integrates Discrete Wavelet Transform to decompose trajectories into low- and high-frequency components. To further enhance diffusion modeling for each component, WFDiffuser employs Short-Time Fourier Transform and cross attention mechanisms to extract frequency-domain features and facilitate cross-frequency interaction. Extensive experiment results on the D4RL benchmark demonstrate that WFDiffuser effectively mitigates frequency shift, leading to smoother, more stable trajectories and improved decision-making performance over existing methods.
zh

[AI-104] LLM s as verification oracles for Solidity

【速读】:该论文旨在解决智能合约业务逻辑错误导致的安全漏洞问题,这类漏洞难以被传统漏洞检测工具识别,而现有形式化验证工具因学习曲线陡峭和规范语言受限,应用范围有限。解决方案的关键在于系统评估GPT-5这一前沿推理型大语言模型(Large Language Model, LLM)作为验证预言机(verification oracle)的能力,即其能否对任意合约特定性质进行推理判断。研究通过大规模验证任务基准测试、与主流形式化工具输出对比以及真实审计场景下的有效性评估,证明了最新推理型LLM在智能合约安全验证中具有显著潜力,标志着AI与形式化方法融合的新方向。

链接: https://arxiv.org/abs/2509.19153
作者: Massimo Bartoletti,Enrico Lipparini,Livio Pompianu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Ensuring the correctness of smart contracts is critical, as even subtle flaws can lead to severe financial losses. While bug detection tools able to spot common vulnerability patterns can serve as a first line of defense, most real-world exploits and losses stem from errors in the contract business logic. Formal verification tools such as SolCMC and the Certora Prover address this challenge, but their impact remains limited by steep learning curves and restricted specification languages. Recent works have begun to explore the use of large language models (LLMs) for security-related tasks such as vulnerability detection and test generation. Yet, a fundamental question remains open: can LLMs serve as verification oracles, capable of reasoning about arbitrary contract-specific properties? In this paper, we provide the first systematic evaluation of GPT-5, a state-of-the-art reasoning LLM, in this role. We benchmark its performance on a large dataset of verification tasks, compare its outputs against those of established formal verification tools, and assess its practical effectiveness in real-world auditing scenarios. Our study combines quantitative metrics with qualitative analysis, and shows that recent reasoning-oriented LLMs can be surprisingly effective as verification oracles, suggesting a new frontier in the convergence of AI and formal methods for secure smart contract development and auditing.
zh

[AI-105] owards Machine-Generated Code for the Resolution of User Intentions

【速读】:该论文旨在解决用户与设备交互方式的重构问题,即如何利用生成式 AI(Generative AI)提升用户意图到可执行任务的转化效率。传统模式依赖高阶应用实现用户目标,而本文提出通过大语言模型(Large Language Models, LLMs)直接生成代码来实现用户意图的自动化执行,从而构建人机协同的混合工作流(hybrid workflows)。其解决方案的关键在于:利用 GPT-4o-mini 模型根据具体用户意图生成结构化代码,并结合面向无图形界面操作系统的简化 API 执行这些代码,验证了从自然语言描述到可运行工作流的端到端可行性。

链接: https://arxiv.org/abs/2504.17531
作者: Justus Flerlage,Ilja Behnke,Odej Kao
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:The growing capabilities of Artificial Intelligence (AI), particularly Large Language Models (LLMs), prompt a reassessment of the interaction mechanisms between users and their devices. Currently, users are required to use a set of high-level applications to achieve their desired results. However, the advent of AI may signal a shift in this regard, as its capabilities have generated novel prospects for user-provided intent resolution through the deployment of model-generated code. This development represents a significant progression in the realm of hybrid workflows, where human and artificial intelligence collaborate to address user intentions, with the former responsible for defining these intentions and the latter for implementing the solutions to address them. In this paper, we investigate the feasibility of generating and executing workflows through code generation that results from prompting an LLM with a concrete user intention, and a simplified application programming interface for a GUI-less operating system. We provide an in-depth analysis and comparison of various user intentions, the resulting code, and its execution. The findings demonstrate the general feasibility of our approach and that the employed LLM, GPT-4o-mini, exhibits remarkable proficiency in the generation of code-oriented workflows in accordance with provided user intentions.
zh

[AI-106] 2025 Southeast Asia Eleven Nations Influence Index Report

【速读】:该论文旨在解决东南亚地区国家影响力评估中因专家打分和主观赋权导致的偏差问题,同时实现对东盟十一个成员国间层级权力结构的客观、可重复量化分析。其解决方案的关键在于构建了一个完全数据驱动的东南亚影响力指数(SAII v3),通过整合经济、军事、外交与社会技术四个维度的权威开源指标,并采用三阶标准化链(分位数-Box-Cox-最小最大值)处理异常值与偏度;在权重确定上,融合信息熵法(EWM)、CRITIC法与主成分分析(PCA)进行等权集成,确保权重的稳健性与多维合理性。该方法显著提升了评估体系的透明度与可审计性,为区域治理与外部合作提供了可量化的决策依据。

链接: https://arxiv.org/abs/2509.19953
作者: Wei Meng
机构: 未知
类目: Physics and Society (physics.soc-ph); Artificial Intelligence (cs.AI)
备注: The document delivers a robust reproducible index (SAII v3) that advances quantitative IR methods and offers actionable insights into Southeast Asia’s stratified power structure

点击查看摘要

Abstract:This study constructs a fully data-driven and reproducible Southeast Asia Influence Index (SAII v3) to reduce bias from expert scoring and subjective weighting while mapping hierarchical power structures across the eleven ASEAN nations. We aggregate authoritative open-source indicators across four dimensions (economic, military, diplomatic, socio-technological) and apply a three-tiered standardization chain quantile-Box-Cox-min-max to mitigate outliers and skewness. Weights are obtained through equal-weight integration of Entropy Weighting Method (EWM), CRITIC, and PCA. Robustness is assessed via Kendall’s tau, +/-20% weight perturbation, and 10,000 bootstrap iterations, with additional checks including +/-10% dimensional sensitivity and V2-V3 bump chart comparisons. Results show integrated weights: Economy 35-40%, Military 20-25%, Diplomacy about 20%, Socio-Technology about 15%. The regional landscape exhibits a one-strong, two-medium, three-stable, and multiple-weak pattern: Indonesia, Singapore, and Malaysia lead, while Thailand, the Philippines, and Vietnam form a mid-tier competitive band. V2 and V3 rankings are highly consistent (Kendall’s tau = 0.818), though small mid-tier reorderings appear (Thailand and the Philippines rise, Vietnam falls), indicating that v3 is more sensitive to structural equilibrium. ASEAN-11 average sensitivity highlights military and socio-technological dimensions as having the largest marginal effects (+/-0.002). In conclusion, SAII v3 delivers algorithmic weighting and auditable reproducibility, reveals multidimensional drivers of influence in Southeast Asia, and provides actionable quantitative evidence for resource allocation and policy prioritization by regional governments and external partners.
zh

[AI-107] Causal Inference under Threshold Manipulation: Bayesian Mixture Modeling and Heterogeneous Treatment Effects AAAI2026

【速读】:该论文旨在解决阈值激励策略(如信用卡奖励计划)中因顾客策略性行为导致的因果效应估计偏差问题。当顾客知晓奖励门槛并主动调整消费以达到阈值时,传统回归不连续设计(Regression Discontinuity Design, RDD)的假设会被破坏,从而影响因果推断的准确性。解决方案的关键在于提出一种新颖的混合模型框架,将观察到的消费分布建模为两类人群的混合:一类是受阈值策略性影响的顾客,另一类是未受影响的顾客;通过两步贝叶斯方法拟合该混合模型——先识别非聚集(non-bunching)群体,再在阈值附近样本中估计混合比例与因果效应,并进一步扩展至分层贝叶斯结构以实现子群体间异质因果效应的稳定估计。

链接: https://arxiv.org/abs/2509.19814
作者: Kohsuke Kubota,Shonosuke Sugasawa
机构: 未知
类目: Methodology (stat.ME); Artificial Intelligence (cs.AI)
备注: Submitted to AAAI 2026

点击查看摘要

Abstract:Many marketing applications, including credit card incentive programs, offer rewards to customers who exceed specific spending thresholds to encourage increased consumption. Quantifying the causal effect of these thresholds on customers is crucial for effective marketing strategy design. Although regression discontinuity design is a standard method for such causal inference tasks, its assumptions can be violated when customers, aware of the thresholds, strategically manipulate their spending to qualify for the rewards. To address this issue, we propose a novel framework for estimating the causal effect under threshold manipulation. The main idea is to model the observed spending distribution as a mixture of two distributions: one representing customers strategically affected by the threshold, and the other representing those unaffected. To fit the mixture model, we adopt a two-step Bayesian approach consisting of modeling non-bunching customers and fitting a mixture model to a sample around the threshold. We show posterior contraction of the resulting posterior distribution of the causal effect under large samples. Furthermore, we extend this framework to a hierarchical Bayesian setting to estimate heterogeneous causal effects across customer subgroups, allowing for stable inference even with small subgroup sample sizes. We demonstrate the effectiveness of our proposed methods through simulation studies and illustrate their practical implications using a real-world marketing dataset.
zh

[AI-108] Dynamicasome: a molecular dynamics-guided and AI-driven pathogenicity prediction catalogue for all genetic mutations

【速读】:该论文旨在解决基因组医学中大量疾病相关基因的突变功能意义尚不明确的问题,这限制了其在临床诊断和决策中的应用。解决方案的关键在于将分子动力学模拟(Molecular Dynamics Simulations, MDS)提取的详细构象数据整合进先进的AI模型中,从而显著提升预测准确性。研究以PMM2基因为例,对所有变异体进行系统性分析并构建结构模型,通过MDS获取动态构象信息后训练神经网络模型,该模型不仅优于现有工具,还能对当前分类为意义未明的突变给出可靠病原性预测,有效缓解了临床中未知变异带来的负担。

链接: https://arxiv.org/abs/2509.19766
作者: Naeyma N Islam,Mathew A Coban,Jessica M Fuller,Caleb Weber,Rohit Chitale,Benjamin Jussila,Trisha J. Brock,Cui Tao,Thomas R Caulfield
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Biological Physics (physics.bio-ph); Molecular Networks (q-bio.MN)
备注: 14 pages , 6 Figures, 2 Tables

点击查看摘要

Abstract:Advances in genomic medicine accelerate the identi cation of mutations in disease-associated genes, but the pathogenicity of many mutations remains unknown, hindering their use in diagnostics and clinical decision-making. Predictive AI models are generated to combat this issue, but current tools display low accuracy when tested against functionally validated datasets. We show that integrating detailed conformational data extracted from molecular dynamics simulations (MDS) into advanced AI-based models increases their predictive power. We carry out an exhaustive mutational analysis of the disease gene PMM2 and subject structural models of each variant to MDS. AI models trained on this dataset outperform existing tools when predicting the known pathogenicity of mutations. Our best performing model, a neuronal networks model, also predicts the pathogenicity of several PMM2 mutations currently considered of unknown signi cance. We believe this model helps alleviate the burden of unknown variants in genomic medicine.
zh

[AI-109] SMILES-Inspired Transfer Learning for Quantum Operators in Generative Quantum Eigensolver

【速读】:该论文旨在解决传统变分量子本征求解器(Variational Quantum Eigensolver, VQE)在处理不同分子体系时因需为每个系统独立构建量子算符而导致的计算资源消耗过高问题。其解决方案的关键在于利用分子间结构相似性,提出一种基于SMILES表示法启发的文本化量子算符表示方法,将UCCSD(单激发和双激发酉耦合簇)量子算符映射为文本序列,并通过文本相似度度量建立迁移学习框架,从而实现不同分子体系间的知识迁移,显著降低生成式量子本征求解器(Generative Quantum Eigensolver, GQE)中地面态能量计算的计算成本。

链接: https://arxiv.org/abs/2509.19715
作者: Zhi Yin,Xiaoran Li,Shengyu Zhang,Xin Li,Xiaojin Zhang
机构: 未知
类目: Chemical Physics (physics.chem-ph); Artificial Intelligence (cs.AI)
备注: 7 pages, 5 figures

点击查看摘要

Abstract:Given the inherent limitations of traditional Variational Quantum Eigensolver(VQE) algorithms, the integration of deep generative models into hybrid quantum-classical frameworks, specifically the Generative Quantum Eigensolver(GQE), represents a promising innovative approach. However, taking the Unitary Coupled Cluster with Singles and Doubles(UCCSD) ansatz which is widely used in quantum chemistry as an example, different molecular systems require constructions of distinct quantum operators. Considering the similarity of different molecules, the construction of quantum operators utilizing the similarity can reduce the computational cost significantly. Inspired by the SMILES representation method in computational chemistry, we developed a text-based representation approach for UCCSD quantum operators by leveraging the inherent representational similarities between different molecular systems. This framework explores text pattern similarities in quantum operators and employs text similarity metrics to establish a transfer learning framework. Our approach with a naive baseline setting demonstrates knowledge transfer between different molecular systems for ground-state energy calculations within the GQE paradigm. This discovery offers significant benefits for hybrid quantum-classical computation of molecular ground-state energies, substantially reducing computational resource requirements.
zh

[AI-110] Selective Classifier-free Guidance for Zero-shot Text-to-speech ICASSP2026

【速读】:该论文旨在解决零样本文本到语音(Zero-shot Text-to-Speech, TTS)中难以平衡目标说话人保真度与文本内容忠实性的问题。其关键解决方案是引入分离条件的分类器自由引导(Classifier-Free Guidance, CFG)策略,并通过在不同时间步长采用不同的引导方式实现优化:即在早期阶段使用标准CFG以增强说话人相似性,而在后期阶段切换为选择性CFG(Selective CFG)以减少对文本内容的干扰。研究发现,这种分阶段引导机制能有效提升说话人保真度同时控制文本忠实性的下降,且选择性CFG的效果高度依赖于文本表示方式,例如在英文和中文之间存在显著差异,即使使用相同模型也表现出不同的性能表现。

链接: https://arxiv.org/abs/2509.19668
作者: John Zheng,Farhad Maleki
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注: 5 pages, 7 figures, 1 table. Submitted to ICASSP 2026

点击查看摘要

Abstract:In zero-shot text-to-speech, achieving a balance between fidelity to the target speaker and adherence to text content remains a challenge. While classifier-free guidance (CFG) strategies have shown promising results in image generation, their application to speech synthesis are underexplored. Separating the conditions used for CFG enables trade-offs between different desired characteristics in speech synthesis. In this paper, we evaluate the adaptability of CFG strategies originally developed for image generation to speech synthesis and extend separated-condition CFG approaches for this domain. Our results show that CFG strategies effective in image generation generally fail to improve speech synthesis. We also find that we can improve speaker similarity while limiting degradation of text adherence by applying standard CFG during early timesteps and switching to selective CFG only in later timesteps. Surprisingly, we observe that the effectiveness of a selective CFG strategy is highly text-representation dependent, as differences between the two languages of English and Mandarin can lead to different results even with the same model.
zh

[AI-111] Online Adaptation via Dual-Stage Alignment and Self-Supervision for Fast-Calibration Brain-Computer Interfaces

【速读】:该论文旨在解决脑电图(EEG)基脑机接口(BCI)系统在实际应用中因个体差异导致的在线适应难题。其核心解决方案为提出一种针对未见过受试者的在线自适应算法,关键在于采用双阶段对齐与自监督机制:首先在EEG数据空间进行欧氏对齐,并在表示空间更新批归一化统计量;其次设计基于解码器生成软伪标签的自监督损失函数,并通过香农熵进行校准,从而实现仅需单次在线试验即可更新解码器,显著提升分类准确率(SSVEP任务平均提升4.9%,运动想象任务提升3.6%),具备良好的泛化性和快速校准能力。

链接: https://arxiv.org/abs/2509.19403
作者: Sheng-Bin Duan,Jian-Long Hao,Tian-Yu Xiang,Xiao-Hu Zhou,Mei-Jiang Gui,Xiao-Liang Xie,Shi-Qi Liu,Zeng-Guang Hou
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Individual differences in brain activity hinder the online application of electroencephalogram (EEG)-based brain computer interface (BCI) systems. To overcome this limitation, this study proposes an online adaptation algorithm for unseen subjects via dual-stage alignment and self-supervision. The alignment process begins by applying Euclidean alignment in the EEG data space and then updates batch normalization statistics in the representation space. Moreover, a self-supervised loss is designed to update the decoder. The loss is computed by soft pseudo-labels derived from the decoder as a proxy for the unknown ground truth, and is calibrated by Shannon entropy to facilitate self-supervised training. Experiments across five public datasets and seven decoders show the proposed algorithm can be integrated seamlessly regardless of BCI paradigm and decoder architecture. In each iteration, the decoder is updated with a single online trial, which yields average accuracy gains of 4.9% on steady-state visual evoked potentials (SSVEP) and 3.6% on motor imagery. These results support fast-calibration operation and show that the proposed algorithm has great potential for BCI applications.
zh

[AI-112] Self-Alignment Learning to Improve Myocardial Infarction Detection from Single-Lead ECG

【速读】:该论文旨在解决从单导联心电图(Single-lead Electrocardiogram, ECG)中准确检测心肌梗死(Myocardial Infarction)的问题,其核心挑战在于单导联数据空间信息有限,而现有生成式方法在信号层面优化时存在潜在空间差距,导致诊断性能下降。解决方案的关键在于提出SelfMIS框架,通过自适应裁剪策略将多导联ECG与其对应的单导联片段配对,并直接在潜在空间进行对齐,从而将学习目标从追求变换不变性转向增强单导联表示能力,使编码器能够从局部信号中推断全局心脏状态,显著提升了九类心肌梗死的检测性能,同时保持模型结构简洁、计算开销低。

链接: https://arxiv.org/abs/2509.19397
作者: Jiarui Jin,Xiaocheng Fang,Haoyu Wang,Jun Li,Che Liu,Donglin Xie,Hongyan Li,Shenda Hong
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Myocardial infarction is a critical manifestation of coronary artery disease, yet detecting it from single-lead electrocardiogram (ECG) remains challenging due to limited spatial information. An intuitive idea is to convert single-lead into multiple-lead ECG for classification by pre-trained models, but generative methods optimized at the signal level in most cases leave a large latent space gap, ultimately degrading diagnostic performance. This naturally raises the question of whether latent space alignment could help. However, most prior ECG alignment methods focus on learning transformation invariance, which mismatches the goal of single-lead detection. To address this issue, we propose SelfMIS, a simple yet effective alignment learning framework to improve myocardial infarction detection from single-lead ECG. Discarding manual data augmentations, SelfMIS employs a self-cutting strategy to pair multiple-lead ECG with their corresponding single-lead segments and directly align them in the latent space. This design shifts the learning objective from pursuing transformation invariance to enriching the single-lead representation, explicitly driving the single-lead ECG encoder to learn a representation capable of inferring global cardiac context from the local signal. Experimentally, SelfMIS achieves superior performance over baseline models across nine myocardial infarction types while maintaining a simpler architecture and lower computational overhead, thereby substantiating the efficacy of direct latent space alignment. Our code and checkpoint will be publicly available after acceptance.
zh

[AI-113] Data-Driven Reconstruction of Significant Wave Heights from Sparse Observations

【速读】:该论文旨在解决从稀疏且分布不均的浮标观测数据中重建高分辨率区域有效波高(Significant Wave Height, SWH)场的核心挑战,这对于海洋监测与风险感知型作业至关重要。其解决方案的关键在于提出了一种混合深度学习框架AUWave,该框架融合了基于站点的序列编码器(MLP)与多尺度U-Net结构,并引入瓶颈自注意力层以增强空间特征捕捉能力,从而在32×32的区域SWH场恢复任务中实现高精度重建。实验表明,学习率是影响模型泛化性能的主导超参数,且当存在最小但非冗余的空间锚定点时,多尺度结构和注意力机制显著提升预测准确性,同时通过误差地图和浮标消融分析识别出关键锚定站点,为观测网络优化提供可操作依据。

链接: https://arxiv.org/abs/2509.19384
作者: Hongyuan Shi,Yilin Zhai,Ping Dong,Zaijin You,Chao Zhan,Qing Wang
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
备注:

点击查看摘要

Abstract:Reconstructing high-resolution regional significant wave height fields from sparse and uneven buoy observations remains a core challenge for ocean monitoring and risk-aware operations. We introduce AUWave, a hybrid deep learning framework that fuses a station-wise sequence encoder (MLP) with a multi-scale U-Net enhanced by a bottleneck self-attention layer to recover 32 \times 32 regional SWH fields. A systematic Bayesian hyperparameter search with Optuna identifies the learning rate as the dominant driver of generalization, followed by the scheduler decay and the latent dimension. Using NDBC buoy observations and ERA5 reanalysis over the Hawaii region, AUWave attains a minimum validation loss of 0.043285 and a slightly right-skewed RMSE distribution. Spatial errors are lowest near observation sites and increase with distance, reflecting identifiability limits under sparse sampling. Sensitivity experiments show that AUWave consistently outperforms a representative baseline in data-richer configurations, while the baseline is only marginally competitive in the most underdetermined single-buoy cases. The architecture’s multi-scale and attention components translate into accuracy gains when minimal but non-trivial spatial anchoring is available. Error maps and buoy ablations reveal key anchor stations whose removal disproportionately degrades performance, offering actionable guidance for network design. AUWave provides a scalable pathway for gap filling, high-resolution priors for data assimilation, and contingency reconstruction.
zh

[AI-114] he Impact of Structural Changes on Learning Capacity in the Fly Olfactory Neural Circuit

【速读】:该论文旨在解决蘑菇体(mushroom body, MB)中Kenyon细胞(Kenyon cell, KC)到输出神经元(mushroom body output neuron, MBON)突触连接结构变化如何影响MBON对气味类别的区分能力这一问题。其关键解决方案是构建一个包含投影神经元(projection neuron, PN)、KC与MBON之间连接关系的神经网络模型,并通过人工生成十类气味输入来训练模型,进而系统性地分析KC-MBON连接数量、突触权重以及KC发育类型等因素对MBON分类性能的影响。研究发现,具有较少前突触KC连接的MBON在气味分类任务中表现较差,且成熟KC的缺失比未成熟KC更显著损害MBON的学习能力;此外,随机和靶向修剪KC-MBON连接的结果与细胞消融实验一致,进一步揭示了KC发育状态和连接特异性在嗅觉学习中的核心作用。

链接: https://arxiv.org/abs/2509.19351
作者: Katherine Xie,Gabriel Koch Ocker
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:The Drosophila mushroom body (MB) is known to be involved in olfactory learning and memory; the synaptic plasticity of the Kenyon cell (KC) to mushroom body output neuron (MBON) synapses plays a key role in the learning process. Previous research has focused on projection neuron (PN) to Kenyon cell (KC) connectivity within the MB; we examine how perturbations to the mushroom body circuit structure and changes in connectivity, specifically within the KC to mushroom body output neuron (MBON) neural circuit, affect the MBONs’ ability to distinguish between odor classes. We constructed a neural network that incorporates the connectivity between PNs, KCs, and MBONs. To train our model, we generated ten artificial input classes, which represent the projection neuron activity in response to different odors. We collected data on the number of KC-to-MBON connections, MBON error rates, and KC-to-MBON synaptic weights, among other metrics. We observed that MBONs with very few presynaptic KCs consistently performed worse than others in the odor classification task. The developmental types of KCs also played a significant role in each MBON’s output. We performed random and targeted KC ablation and observed that ablating developmentally mature KCs had a greater negative impact on MBONs’ learning capacity than ablating immature KCs. Random and targeted pruning of KC-MBON synaptic connections yielded results largely consistent with the ablation experiments. To further explore the various types of KCs, we also performed rewiring experiments in the PN to KC circuit. Our study furthers our understanding of olfactory neuroplasticity and provides important clues to understanding learning and memory in general. Understanding how the olfactory circuits process and learn can also have potential applications in artificial intelligence and treatments for neurodegenerative diseases.
zh

[AI-115] Joint Channel Estimation and Computation Offloading in Fluid Antenna-assisted MEC Networks

【速读】:该论文旨在解决基于流体天线(Fluid Antenna, FA)的移动边缘计算(Mobile Edge Computing, MEC)系统中任务卸载延迟优化问题,其核心挑战在于:1)由于FA端口位置动态变化导致的信道估计复杂性;2)联合优化FA端口选择、波束赋形、功率控制与资源分配所引发的高维非凸问题。解决方案的关键在于:提出一种基于信息瓶颈度量增强的压缩感知方法(Information Bottleneck Metric-enhanced Channel Compressed Sensing, IBM-CCS),有效提升FA信道估计精度与鲁棒性;同时设计一种博弈论辅助的分层双dueling多智能体强化学习算法(Game Theory-assisted Hierarchical Twin-Dueling Multi-agent Algorithm, HiTDMA),通过层次化结构解耦用户侧与基站侧优化任务,并利用博弈论降低功率控制变量维度,从而显著提升深度强化学习(Deep Reinforcement Learning, DRL)代理的优化效率,最终实现系统延迟最小化和卸载性能增强。

链接: https://arxiv.org/abs/2509.19340
作者: Ying Ju,Mingdong Li,Haoyu Wang,Lei Liu,Youyang Qu,Mianxiong Dong,Victor C. M. Leung,Chau Yuen
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Networking and Internet Architecture (cs.NI)
备注:

点击查看摘要

Abstract:With the emergence of fluid antenna (FA) in wireless communications, the capability to dynamically adjust port positions offers substantial benefits in spatial diversity and spectrum efficiency, which are particularly valuable for mobile edge computing (MEC) systems. Therefore, we propose an FA-assisted MEC offloading framework to minimize system delay. This framework faces two severe challenges, which are the complexity of channel estimation due to dynamic port configuration and the inherent non-convexity of the joint optimization problem. Firstly, we propose Information Bottleneck Metric-enhanced Channel Compressed Sensing (IBM-CCS), which advances FA channel estimation by integrating information relevance into the sensing process and capturing key features of FA channels effectively. Secondly, to address the non-convex and high-dimensional optimization problem in FA-assisted MEC systems, which includes FA port selection, beamforming, power control, and resource allocation, we propose a game theory-assisted Hierarchical Twin-Dueling Multi-agent Algorithm (HiTDMA) based offloading scheme, where the hierarchical structure effectively decouples and coordinates the optimization tasks between the user side and the base station side. Crucially, the game theory effectively reduces the dimensionality of power control variables, allowing deep reinforcement learning (DRL) agents to achieve improved optimization efficiency. Numerical results confirm that the proposed scheme significantly reduces system delay and enhances offloading performance, outperforming benchmarks. Additionally, the IBM-CCS channel estimation demonstrates superior accuracy and robustness under varying port densities, contributing to efficient communication under imperfect CSI.
zh

[AI-116] CSIYOLO: An Intelligent CSI-based Scatter Sensing Framework for Integrated Sensing and Communication Systems

【速读】:该论文旨在解决现有集成感知与通信(ISAC)系统中散射体定位精度低、兼容性差的问题,尤其是传统方法依赖波形或硬件修改,难以适配现有通信系统且感知性能受限。其关键解决方案是提出CSIYOLO框架,该框架仅利用单个基站-用户设备对估计的信道状态信息(CSI)实现无硬件改动的散射体定位:首先将散射参数提取建模为图像检测问题,采用基于锚点的检测机制(受YOLO架构启发);其次设计基于CSI的定位算法,结合多尺度锚点检测和任务导向优化网络结构,提升定位精度与实现效率,并引入噪声注入训练策略增强对信道估计误差的鲁棒性。此方案无需改变现有通信协议或信号处理流程,可作为插件无缝集成至当前系统。

链接: https://arxiv.org/abs/2509.19335
作者: Xudong Zhang,Jingbo Tan,Zhizhen Ren,Jintao Wang,Yihua Ma,Jian Song
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
备注: 13 pages, 16 figures, 3 tables. This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:ISAC is regarded as a promising technology for next-generation communication systems, enabling simultaneous data transmission and target sensing. Among various tasks in ISAC, scatter sensing plays a crucial role in exploiting the full potential of ISAC and supporting applications such as autonomous driving and low-altitude economy. However, most existing methods rely on either waveform and hardware modifications or traditional signal processing schemes, leading to poor compatibility with current communication systems and limited sensing accuracy. To address these challenges, we propose CSIYOLO, a framework that performs scatter localization only using estimated CSI from a single base station-user equipment pair. This framework comprises two main components: anchor-based scatter parameter detection and CSI-based scatter localization. First, by formulating scatter parameter extraction as an image detection problem, we propose an anchor-based scatter parameter detection method inspired by You Only Look Once architectures. After that, a CSI-based localization algorithm is derived to determine scatter locations with extracted parameters. Moreover, to improve localization accuracy and implementation efficiency, we design an extendable network structure with task-oriented optimizations, enabling multi-scale anchor detection and better adaptation to CSI characteristics. A noise injection training strategy is further designed to enhance robustness against channel estimation errors. Since the proposed framework operates solely on estimated CSI without modifying waveforms or signal processing pipelines, it can be seamlessly integrated into existing communication systems as a plugin. Experiments show that our proposed method can significantly outperform existing methods in scatter localization accuracy with relatively low complexities under varying numbers of scatters and estimation errors.
zh

[AI-117] Holographic Transformers for Complex-Valued Signal Processing: Integrating Phase Interference into Self-Attention

【速读】:该论文旨在解决现有深度学习模型在处理复数信号(complex-valued signals)时,通常将注意力机制建模为实值相关性,从而忽略相位信息及其干涉效应的问题。这导致模型在复数域中的表示不一致,尤其在任务损失仅关注幅度而忽略相位时,易出现相位坍缩(phase collapse)。解决方案的关键在于提出全息注意力机制(Holographic Attention),其受物理波干涉原理启发,通过相对相位调制交互关系,并对值进行相干叠加,确保幅度与相位的一致性;同时采用双头解码器同步重建输入和预测任务输出,有效防止相位信息丢失。实验表明,该方法在极化合成孔径雷达(PolSAR)图像分类和无线信道预测任务中均表现出高精度、低误差和强鲁棒性,验证了物理一致性建模在复数学习中的有效性。

链接: https://arxiv.org/abs/2509.19331
作者: Enhao Huang,Zhiyu Zhang,Tianxiang Xu,Chunshu Xia,Kaichun Hu,Yuchen Yang,Tongtong Pan,Dong Dong,Zhan Qin
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Complex-valued signals encode both amplitude and phase, yet most deep models treat attention as real-valued correlation, overlooking interference effects. We introduce the Holographic Transformer, a physics-inspired architecture that incorporates wave interference principles into self-attention. Holographic attention modulates interactions by relative phase and coherently superimposes values, ensuring consistency between amplitude and phase. A dual-headed decoder simultaneously reconstructs the input and predicts task outputs, preventing phase collapse when losses prioritize magnitude over phase. We demonstrate that holographic attention implements a discrete interference operator and maintains phase consistency under linear mixing. Experiments on PolSAR image classification and wireless channel prediction show strong performance, achieving high classification accuracy and F1 scores, low regression error, and increased robustness to phase perturbations. These results highlight that enforcing physical consistency in attention leads to generalizable improvements in complex-valued learning and provides a unified, physics-based framework for coherent signal modeling. The code is available at this https URL.
zh

[AI-118] LibEMER: A novel benchmark and algorithms library for EEG-based Multimodal Emotion Recognition

【速读】:该论文旨在解决基于脑电图(EEG)的多模态情感识别(EMER)领域中存在的三大关键问题:缺乏开源实现、标准化与透明化的基准测试平台缺失,以及对核心挑战和未来研究方向的深入讨论不足。解决方案的关键在于提出 LibEMER,一个统一的评估框架,提供经过精选的深度学习方法的完整可复现的 PyTorch 实现,并配套标准化的数据预处理、模型实现与实验设置协议,从而支持在三个常用公开数据集上对两类学习任务进行无偏性能评估。

链接: https://arxiv.org/abs/2509.19330
作者: Zejun Liu,Yunshan Chen,Chengxi Xie,Huan Liu
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Multimedia (cs.MM)
备注: 5 pages, 2 figures

点击查看摘要

Abstract:EEG-based multimodal emotion recognition(EMER) has gained significant attention and witnessed notable advancements, the inherent complexity of human neural systems has motivated substantial efforts toward multimodal approaches. However, this field currently suffers from three critical limitations: (i) the absence of open-source implementations. (ii) the lack of standardized and transparent benchmarks for fair performance analysis. (iii) in-depth discussion regarding main challenges and promising research directions is a notable scarcity. To address these challenges, we introduce LibEMER, a unified evaluation framework that provides fully reproducible PyTorch implementations of curated deep learning methods alongside standardized protocols for data preprocessing, model realization, and experimental setups. This framework enables unbiased performance assessment on three widely-used public datasets across two learning tasks. The open-source library is publicly accessible at: this https URL
zh

[AI-119] Human Activity Recognition Based on Electrocardiogram Data Only

【速读】:该论文旨在解决传统人体活动识别(Human Activity Recognition, HAR)依赖惯性测量单元(Inertial Measurement Units, IMUs)所导致的资源消耗高和校准复杂的问题,同时突破现有基于心电图(Electrocardiogram, ECG)的方法仅限于粗粒度分类(如跌倒检测或静息/活动状态区分)的局限。其关键解决方案在于首次实现仅使用ECG信号对六种不同物理活动进行鲁棒识别,并设计了三种新型深度学习模型:包含Squeeze-and-Excitation模块的CNN用于通道特征重校准、引入空洞卷积的ResNet用于捕捉多尺度时间依赖性,以及一种新颖的CNN-Transformer混合架构,结合卷积提取局部特征与注意力机制建模长程时序关系。实验表明,这些模型在54名受试者数据上对已见个体均达到94%以上准确率,其中CNN-Transformer混合模型在未见个体上达到72%准确率,验证了纯ECG驱动活动识别的可行性与潜力,为下一代无需额外运动传感器即可同步进行心脏监测与活动识别的可穿戴设备提供了技术基础。

链接: https://arxiv.org/abs/2509.19328
作者: Sina Montazeri,Waltenegus Dargie,Yunhe Feng,Kewei Sha
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: This is a preprint version. Content may change before final publication

点击查看摘要

Abstract:Human activity recognition is critical for applications such as early intervention and health analytics. Traditional activity recognition relies on inertial measurement units (IMUs), which are resource intensive and require calibration. Although electrocardiogram (ECG)-based methods have been explored, these have typically served as supplements to IMUs or have been limited to broad categorical classification such as fall detection or active vs. inactive in daily activities. In this paper, we advance the field by demonstrating, for the first time, robust recognition of activity only with ECG in six distinct activities, which is beyond the scope of previous work. We design and evaluate three new deep learning models, including a CNN classifier with Squeeze-and-Excitation blocks for channel-wise feature recalibration, a ResNet classifier with dilated convolutions for multiscale temporal dependency capture, and a novel CNNTransformer hybrid combining convolutional feature extraction with attention mechanisms for long-range temporal relationship modeling. Tested on data from 54 subjects for six activities, all three models achieve over 94% accuracy for seen subjects, while CNNTransformer hybrid reaching the best accuracy of 72% for unseen subjects, a result that can be further improved by increasing the training population. This study demonstrates the first successful ECG-only activity classification in multiple physical activities, offering significant potential for developing next-generation wearables capable of simultaneous cardiac monitoring and activity recognition without additional motion sensors.
zh

[AI-120] Advancing Few-Shot Pediatric Arrhythmia Classification with a Novel Contrastive Loss and Multimodal Learning

【速读】:该论文旨在解决儿科心律失常(Pediatric Arrhythmias)自动化分类中存在的挑战,包括类别不平衡、少样本类别以及复杂的心电信号特征,这些问题严重限制了早期筛查和临床干预的效率与可靠性。其解决方案的关键在于提出一种多模态端到端深度学习框架,该框架融合ECG与IEGM双分支卷积编码器进行特征提取,引入语义注意力机制实现跨模态特征对齐,并采用轻量级Transformer编码器建模全局依赖关系;同时创新性地设计了一种自适应全局类感知对比损失(Adaptive Global Class-Aware Contrastive Loss, AGCACL),通过类原型和全局相似矩阵增强类内紧凑性和类间可分性,从而显著提升对少数类心律失常的检测能力与鲁棒性。

链接: https://arxiv.org/abs/2509.19315
作者: Yiqiao Chen,Zijian Huang,Zhenghui Feng
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 12pages, 10 figures

点击查看摘要

Abstract:Pediatric arrhythmias are a major risk factor for disability and sudden cardiac death, yet their automated classification remains challenging due to class imbalance, few-shot categories, and complex signal characteristics, which severely limit the efficiency and reliability of early screening and clinical intervention. To address this problem, we propose a multimodal end-to-end deep learning framework that combines dual-branch convolutional encoders for ECG and IEGM, semantic attention for cross-modal feature alignment, and a lightweight Transformer encoder for global dependency modeling. In addition, we introduce a new contrastive loss fucntion named Adaptive Global Class-Aware Contrastive Loss (AGCACL) to enhance intra-class compactness and inter-class separability through class prototypes and a global similarity matrix. To the best of our knowledge, this is the first systematic study based on the Leipzig Heart Center pediatric/congenital ECG+IEGM dataset, for which we also provide a complete and reproducible preprocessing pipeline. Experimental results demonstrate that the proposed method achieves the overall best performance on this dataset, including 97.76% Top-1 Accuracy, 94.08% Macro Precision, 91.97% Macro Recall, 92.97% Macro F1, and 92.36% Macro F2, with improvements of +13.64, +15.96, +19.82, and +19.44 percentage points over the strongest baseline in Macro Precision/Recall/F1/F2, respectively. These findings indicate that the framework significantly improves the detectability and robustness for minority arrhythmia classes, offering potential clinical value for rhythm screening, pre-procedural assessment, and postoperative follow-up in pediatric and congenital heart disease populations.
zh

[AI-121] E2E Learning Massive MIMO for Multimodal Semantic Non-Orthogonal Transmission and Fusion

【速读】:该论文旨在解决大规模多输入多输出(Massive MIMO)系统中下行链路信道状态信息(CSI)维度高导致的实时信道获取与预编码复杂度高的问题。其解决方案的关键在于提出一种端到端(E2E)上下行CSI融合预编码网络,通过统一建模下行CSI参考信号(CSI-RS)设计、CSI反馈和基站(BS)预编码三个环节,在单一神经网络架构中协同优化。该方案的核心创新包括:基于MAXIM架构的投影网络生成多维(频率-波束-端口域)CSI-RS设计矩阵;用户设备(UE)对CSI-RS观测进行压缩/量化并反馈紧凑表示;基站端采用双分支候选预编码网络(分别基于反馈和上行SRS),再由融合预编码网络整合二者以获得最终发射预编码器。整个系统通过面向频谱效率的损失函数分三阶段训练,显著优于传统基线方法。

链接: https://arxiv.org/abs/2509.19312
作者: Minghui Wu,Zhen Gao
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Massive multiple-input multiple-output (MIMO) promises high spectral efficiency but also leads to high-dimensional downlink channel state information (CSI), which complicates real-time channel acquisition and precoding. To address this, we propose an end-to-end (E2E) uplink-downlink CSI fusion precoding network that jointly models downlink CSI reference signal (CSI-RS) design, CSI feedback, and base-station (BS) precoding within a single E2E neural architecture. Concretely, a projection network built on the MAXIM architecture takes uplink sounding reference signals (SRS) as input and outputs frequency-, beam-, and port-domain projection matrices for designing downlink CSI-RS. User equipment (UE) then compresses/quantizes the resulting CSI-RS observations and feeds back a compact representation. At the base station (BS), two complementary branches produce candidate precoders: one is a feedback-only precoding network driven by quantized downlink observations, and the other is an SRS-only precoding network driven by uplink SRS. These candidate precoders are subsequently combined by a fusion precoding network to yield the final transmit precoder. All the modules are trained with a spectral-efficiency-oriented loss under a three-stage schedule. Simulation results show that the proposed approach effectively harnesses both SRS-derived information and UE feedback, achieving markedly better performance than conventional baselines.
zh

[AI-122] A Federated Fine-Tuning Paradigm of Foundation Models in Heterogenous Wireless Networks

【速读】:该论文旨在解决异构无线网络中边缘设备因计算资源受限和性能差异导致的联邦微调(federated fine-tuning)性能下降问题,以及通信不可靠性对模型收敛的影响。其解决方案的关键在于提出一种基于切换机制的联邦微调框架,通过边缘设备动态选择低秩适配(LoRA)模块与基站协同训练,以缓解设备异构性和传输不稳定性带来的负面影响;进一步地,通过理论推导可计算的推理风险差距上界,并将优化问题建模为带长期约束的非凸混合整数规划问题,进而分解为模型切换、功率控制和带宽分配子问题,设计了一种具有多项式复杂度的在线优化算法,从而在保证模型精度的同时提升能效。

链接: https://arxiv.org/abs/2509.19306
作者: Jingyi Wang,Zhongyuan Zhao,Qingtian Wang,Zexu Li,Yue Wang,Tony Q. S. Quek
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Networking and Internet Architecture (cs.NI)
备注:

点击查看摘要

Abstract:Edge intelligence has emerged as a promising strategy to deliver low-latency and ubiquitous services for mobile devices. Recent advances in fine-tuning mechanisms of foundation models have enabled edge intelligence by integrating low-rank adaptation (LoRA) with federated learning. However, in wireless networks, the device heterogeneity and resource constraints on edge devices pose great threats to the performance of federated fine-tuning. To tackle these issues, we propose to optimize federated fine-tuning in heterogenous wireless networks via online learning. First, the framework of switching-based federated fine-tuning in wireless networks is provided. The edge devices switches to LoRA modules dynamically for federated fine-tuning with base station to jointly mitigate the impact of device heterogeneity and transmission unreliability. Second, a tractable upper bound on the inference risk gap is derived based on theoretical analysis. To improve the generalization capability, we formulate a non-convex mixed-integer programming problem with long-term constraints, and decouple it into model switching, transmit power control, and bandwidth allocation subproblems. An online optimization algorithm is developed to solve the problems with polynomial computational complexity. Finally, the simulation results on the SST-2 and QNLI data sets demonstrate the performance gains in test accuracy and energy efficiency.
zh

机器学习

[LG-0] Process-Informed Forecasting of Complex Thermal Dynamics in Pharmaceutical Manufacturing

链接: https://arxiv.org/abs/2509.20349
作者: Ramona Rubini,Siavash Khodakarami,Aniruddha Bora,George Em Karniadakis,Michele Dassisti
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate time-series forecasting for complex physical systems is the backbone of modern industrial monitoring and control. While deep learning models excel at capturing complex dynamics, currently, their deployment is limited due to physical inconsistency and robustness, hence constraining their reliability in regulated environments. We introduce process-informed forecasting (PIF) models for temperature in pharmaceutical lyophilization. We investigate a wide range of models, from classical ones such as Autoregressive Integrated Moving Average Model (ARIMA) and Exponential Smoothing Model (ETS), to modern deep learning architectures, including Kolmogorov-Arnold Networks (KANs). We compare three different loss function formulations that integrate a process-informed trajectory prior: a fixed-weight loss, a dynamic uncertainty-based loss, and a Residual-Based Attention (RBA) mechanism. We evaluate all models not only for accuracy and physical consistency but also for robustness to sensor noise. Furthermore, we test the practical generalizability of the best model in a transfer learning scenario on a new process. Our results show that PIF models outperform their data-driven counterparts in terms of accuracy, physical plausibility and noise resilience. This work provides a roadmap for developing reliable and generalizable forecasting solutions for critical applications in the pharmaceutical manufacturing landscape.

[LG-1] Spatio-Temporal Directed Graph Learning for Account Takeover Fraud Detection NEURIPS2025

链接: https://arxiv.org/abs/2509.20339
作者: Mohsen Nayebi Kerdabadi,William Andrew Byron,Xin Sun,Amirfarrokh Iranitalab
类目: Machine Learning (cs.LG)
*备注: This paper has been accepted at NeurIPS 2025 workshop New Perspective in Graph Machine Learning (NPGML)

点击查看摘要

Abstract:Account Takeover (ATO) fraud poses a significant challenge in consumer banking, requiring high recall under strict latency while minimizing friction for legitimate users. Production systems typically rely on tabular gradient-boosted decision trees (e.g., XGBoost) that score sessions independently, overlooking the relational and temporal structure of online activity that characterizes coordinated attacks and “fraud rings.” We introduce ATLAS (Account Takeover Learning Across Spatio-Temporal Directed Graph), a framework that reformulates ATO detection as spatio-temporal node classification on a time-respecting directed session graph. ATLAS links entities via shared identifiers (account, device, IP) and regulates connectivity with time-window and recency constraints, enabling causal, time-respecting message passing and latency-aware label propagation that uses only labels available at scoring time, non-anticipative and leakage-free. We operationalize ATLAS with inductive GraphSAGE variants trained via neighbor sampling, at scale on a sessions graph with more than 100M nodes and around 1B edges. On a high-risk digital product at Capital One, ATLAS delivers 6.38 percent AUC improvement and more than 50 percent reduction in customer friction, improving fraud capture while reducing user friction.

[LG-2] Feature Dynamics as Implicit Data Augmentation: A Depth-Decomposed View on Deep Neural Network Generalization

链接: https://arxiv.org/abs/2509.20334
作者: Tianyu Ruan,Kuo Gai,Shihua Zhang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Why do deep networks generalize well? In contrast to classical generalization theory, we approach this fundamental question by examining not only inputs and outputs, but the evolution of internal features. Our study suggests a phenomenon of temporal consistency that predictions remain stable when shallow features from earlier checkpoints combine with deeper features from later ones. This stability is not a trivial convergence artifact. It acts as a form of implicit, structured augmentation that supports generalization. We show that temporal consistency extends to unseen and corrupted data, but collapses when semantic structure is destroyed (e.g., random labels). Statistical tests further reveal that SGD injects anisotropic noise aligned with a few principal directions, reinforcing its role as a source of structured variability. Together, these findings suggest a conceptual perspective that links feature dynamics to generalization, pointing toward future work on practical surrogates for measuring temporal feature evolution.

[LG-3] A Recovery Guarantee for Sparse Neural Networks

链接: https://arxiv.org/abs/2509.20323
作者: Sara Fridovich-Keil,Mert Pilanci
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: Code is available at this https URL

点击查看摘要

Abstract:We prove the first guarantees of sparse recovery for ReLU neural networks, where the sparse network weights constitute the signal to be recovered. Specifically, we study structural properties of the sparse network weights for two-layer, scalar-output networks under which a simple iterative hard thresholding algorithm recovers these weights exactly, using memory that grows linearly in the number of nonzero weights. We validate this theoretical result with simple experiments on recovery of sparse planted MLPs, MNIST classification, and implicit neural representations. Experimentally, we find performance that is competitive with, and often exceeds, a high-performing but memory-inefficient baseline based on iterative magnitude pruning.

[LG-4] Graph Variate Neural Networks

链接: https://arxiv.org/abs/2509.20311
作者: Om Roy,Yashar Moshfeghi,Keith Smith
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Modelling dynamically evolving spatio-temporal signals is a prominent challenge in the Graph Neural Network (GNN) literature. Notably, GNNs assume an existing underlying graph structure. While this underlying structure may not always exist or is derived independently from the signal, a temporally evolving functional network can always be constructed from multi-channel data. Graph Variate Signal Analysis (GVSA) defines a unified framework consisting of a network tensor of instantaneous connectivity profiles against a stable support usually constructed from the signal itself. Building on GVSA and tools from graph signal processing, we introduce Graph-Variate Neural Networks (GVNNs): layers that convolve spatio-temporal signals with a signal-dependent connectivity tensor combining a stable long-term support with instantaneous, data-driven interactions. This design captures dynamic statistical interdependencies at each time step without ad hoc sliding windows and admits an efficient implementation with linear complexity in sequence length. Across forecasting benchmarks, GVNNs consistently outperform strong graph-based baselines and are competitive with widely used sequence models such as LSTMs and Transformers. On EEG motor-imagery classification, GVNNs achieve strong accuracy highlighting their potential for brain-computer interface applications.

[LG-5] Ads that Stick: Near-Optimal Ad Optimization through Psychological Behavior Models

链接: https://arxiv.org/abs/2509.20304
作者: Kailash Gopal Darmasubramanian,Akash Pareek,Arindam Khan,Arpit Agarwal
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Optimizing the timing and frequency of ads is a central problem in digital advertising, with significant economic consequences. Existing scheduling policies rely on simple heuristics, such as uniform spacing and frequency caps, that overlook long-term user interest. However, it is well-known that users’ long-term interest and engagement result from the interplay of several psychological effects (Curmei, Haupt, Recht, Hadfield-Menell, ACM CRS, 2022). In this work, we model change in user interest upon showing ads based on three key psychological principles: mere exposure, hedonic adaptation, and operant conditioning. The first two effects are modeled using a concave function of user interest with repeated exposure, while the third effect is modeled using a temporal decay function, which explains the decline in user interest due to overexposure. Under our psychological behavior model, we ask the following question: Given a continuous time interval T , how many ads should be shown, and at what times, to maximize the user interest towards the ads? Towards answering this question, we first show that, if the number of displayed ads is fixed, then the optimal ad-schedule only depends on the operant conditioning function. Our main result is a quasi-linear time algorithm that outputs a near-optimal ad-schedule, i.e., the difference in the performance of our schedule and the optimal schedule is exponentially small. Our algorithm leads to significant insights about optimal ad placement and shows that simple heuristics such as uniform spacing are sub-optimal under many natural settings. The optimal number of ads to display, which also depends on the mere exposure and hedonistic adaptation functions, can be found through a simple linear search given the above algorithm. We further support our findings with experimental results, demonstrating that our strategy outperforms various baselines. Subjects: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG) Cite as: arXiv:2509.20304 [cs.DS] (or arXiv:2509.20304v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2509.20304 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-6] Alignment-Sensitive Minimax Rates for Spectral Algorithms with Learned Kernels

链接: https://arxiv.org/abs/2509.20294
作者: Dongming Huang,Zhifan Li,Yicheng Li,Qian Lin
类目: Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:We study spectral algorithms in the setting where kernels are learned from data. We introduce the effective span dimension (ESD), an alignment-sensitive complexity measure that depends jointly on the signal, spectrum, and noise level \sigma^2 . The ESD is well-defined for arbitrary kernels and signals without requiring eigen-decay conditions or source conditions. We prove that for sequence models whose ESD is at most K , the minimax excess risk scales as \sigma^2 K . Furthermore, we analyze over-parameterized gradient flow and prove that it can reduce the ESD. This finding establishes a connection between adaptive feature learning and provable improvements in generalization of spectral algorithms. We demonstrate the generality of the ESD framework by extending it to linear models and RKHS regression, and we support the theory with numerical experiments. This framework provides a novel perspective on generalization beyond traditional fixed-kernel theories.

[LG-7] Extended Low-Rank Approximation Accelerates Learning of Elastic Response in Heterogeneous Materials

链接: https://arxiv.org/abs/2509.20276
作者: Prabhat Karmakar,Sayan Gupta,Ilaksh Adlakha
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
*备注:

点击查看摘要

Abstract:Predicting how the microstructure governs the mechanical response of heterogeneous materials is essential for optimizing design and performance. Yet this task remains difficult due to the complex, high dimensional nature of microstructural features. Relying on physics based simulations to probe the microstructural space is computationally prohibitive. This motivates the development of computational tools to efficiently learn structure property linkages governing mechanical behavior. While contemporary data driven approaches offer new possibilities, they often require large datasets. To address this challenge, this work presents the Extended Low Rank Approximation (xLRA), a framework that employs canonical polyadic tensor decomposition. It efficiently maps high dimensional microstructural information to the local elastic response by adaptively incorporating higher rank terms. xLRA accurately predicts the local elastic strain fields in porous microstructures, requiring a maximum rank of only 4. The compact formulation of xLRA achieves accurate predictions when trained on just 5% of the dataset, demonstrating significant data efficiency. Moreover, xLRA proves transferability by delivering results across representative material systems, including two phase composites and single and dual phase polycrystals. Despite being compact, xLRA retains essential microstructural details, enabling accurate predictions on unseen microstructures. Benchmarking shows that xLRA outperforms contemporary methods in predictive accuracy, generalizability, and computational efficiency, while requiring 6 orders of magnitude fewer floating point operations. In summary, xLRA provides an efficient framework for predicting the elastic response from microstructures, enabling scalable mapping of structure property linkages.

[LG-8] Dynamic Lagging for Time-Series Forecasting in E-Commerce Finance: Mitigating Information Loss with A Hybrid ML Architecture

链接: https://arxiv.org/abs/2509.20244
作者: Abhishek Sharma,Anat Parush,Sumit Wadhwa,Amihai Savir,Anne Guinard,Prateek Srivastava
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate forecasting in the e-commerce finance domain is particularly challenging due to irregular invoice schedules, payment deferrals, and user-specific behavioral variability. These factors, combined with sparse datasets and short historical windows, limit the effectiveness of conventional time-series methods. While deep learning and Transformer-based models have shown promise in other domains, their performance deteriorates under partial observability and limited historical data. To address these challenges, we propose a hybrid forecasting framework that integrates dynamic lagged feature engineering and adaptive rolling-window representations with classical statistical models and ensemble learners. Our approach explicitly incorporates invoice-level behavioral modeling, structured lag of support data, and custom stability-aware loss functions, enabling robust forecasts in sparse and irregular financial settings. Empirical results demonstrate an approximate 5% reduction in MAPE compared to baseline models, translating into substantial financial savings. Furthermore, the framework enhances forecast stability over quarterly horizons and strengthens feature target correlation by capturing both short- and long-term patterns, leveraging user profile attributes, and simulating upcoming invoice behaviors. These findings underscore the value of combining structured lagging, invoice-level closure modeling, and behavioral insights to advance predictive accuracy in sparse financial time-series forecasting.

[LG-9] Energy Use of AI Inference: Efficiency Pathways and Test-Time Compute

链接: https://arxiv.org/abs/2509.20241
作者: Felipe Oviedo,Fiodar Kazhamiaka,Esha Choukse,Allen Kim,Amy Luers,Melanie Nakagawa,Ricardo Bianchini,Juan M. Lavista Ferres
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: A preprint version with DOI is available at Zenodo: this https URL

点击查看摘要

Abstract:As AI inference scales to billions of queries and emerging reasoning and agentic workflows increase token demand, reliable estimates of per-query energy use are increasingly important for capacity planning, emissions accounting, and efficiency prioritization. Many public estimates are inconsistent and overstate energy use, because they extrapolate from limited benchmarks and fail to reflect efficiency gains achievable at scale. In this perspective, we introduce a bottom-up methodology to estimate the per-query energy of large-scale LLM systems based on token throughput. For models running on an H100 node under realistic workloads, GPU utilization and PUE constraints, we estimate a median energy per query of 0.34 Wh (IQR: 0.18-0.67) for frontier-scale models (200 billion parameters). These results are consistent with measurements using production-scale configurations and show that non-production estimates and assumptions can overstate energy use by 4-20x. Extending to test-time scaling scenarios with 15x more tokens per typical query, the median energy rises 13x to 4.32 Wh, indicating that targeting efficiency in this regime will deliver the largest fleet-wide savings. We quantify achievable efficiency gains at the model, serving platform, and hardware levels, finding individual median reductions of 1.5-3.5x in energy per query, while combined advances can plausibly deliver 8-20x reductions. To illustrate the system-level impact, we estimate the baseline daily energy use of a deployment serving 1 billion queries to be 0.8 GWh/day. If 10% are long queries, demand could grow to 1.8 GWh/day. With targeted efficiency interventions, it falls to 0.9 GWh/day, similar to the energy footprint of web search at that scale. This echoes how data centers historically tempered energy growth through efficiency gains during the internet and cloud build-up.

[LG-10] me-adaptive HénonNets for separable Hamiltonian systems

链接: https://arxiv.org/abs/2509.20212
作者: Konrad Janik,Peter Benner
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Measurement data is often sampled irregularly, i.e., not on equidistant time grids. This is also true for Hamiltonian systems. However, existing machine learning methods, which learn symplectic integrators, such as SympNets [1] and HénonNets [2] still require training data generated by fixed step sizes. To learn time-adaptive symplectic integrators, an extension to SympNets called TSympNets is introduced in [3]. The aim of this work is to do a similar extension for HénonNets. We propose a novel neural network architecture called T-HénonNets, which is symplectic by design and can handle adaptive time steps. We also extend the T-HénonNet architecture to non-autonomous Hamiltonian systems. Additionally, we provide universal approximation theorems for both new architectures for separable Hamiltonian systems and discuss why it is difficult to handle non-separable Hamiltonian systems with the proposed methods. To investigate these theoretical approximation capabilities, we perform different numerical experiments.

[LG-11] Practical do-Shapley Explanations with Estimand-Agnostic Causal Inference NEURIPS2025

链接: https://arxiv.org/abs/2509.20211
作者: Álvaro Parafita,Tomas Garriga,Axel Brando,Francisco J. Cazorla
类目: Machine Learning (cs.LG)
*备注: Accepted for publication at NeurIPS 2025

点击查看摘要

Abstract:Among explainability techniques, SHAP stands out as one of the most popular, but often overlooks the causal structure of the problem. In response, do-SHAP employs interventional queries, but its reliance on estimands hinders its practical application. To address this problem, we propose the use of estimand-agnostic approaches, which allow for the estimation of any identifiable query from a single model, making do-SHAP feasible on complex graphs. We also develop a novel algorithm to significantly accelerate its computation at a negligible cost, as well as a method to explain inaccessible Data Generating Processes. We demonstrate the estimation and computational performance of our approach, and validate it on two real-world datasets, highlighting its potential in obtaining reliable explanations.

[LG-12] Staying on the Manifold: Geometry-Aware Noise Injection

链接: https://arxiv.org/abs/2509.20201
作者: Albert Kjøller Jacobsen,Johanna Marie Gegenfurtner,Georgios Arvanitidis
类目: Machine Learning (cs.LG); Differential Geometry (math.DG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:It has been shown that perturbing the input during training implicitly regularises the gradient of the learnt function, leading to smoother models and enhancing generalisation. However, previous research mostly considered the addition of ambient noise in the input space, without considering the underlying structure of the data. In this work, we propose several methods of adding geometry-aware input noise that accounts for the lower dimensional manifold the input space inhabits. We start by projecting ambient Gaussian noise onto the tangent space of the manifold. In a second step, the noise sample is mapped on the manifold via the associated geodesic curve. We also consider Brownian motion noise, which moves in random steps along the manifold. We show that geometry-aware noise leads to improved generalization and robustness to hyperparameter selection on highly curved manifolds, while performing at least as well as training without noise on simpler manifolds. Our proposed framework extends to learned data manifolds.

[LG-13] FairEquityFL – A Fair and Equitable Client Selection in Federated Learning for Heterogeneous IoV Networks

链接: https://arxiv.org/abs/2509.20193
作者: Fahmida Islam,Adnan Mahmood,Noorain Mukhtiar,Kasun Eranda Wijethilake,Quan Z. Sheng
类目: Machine Learning (cs.LG)
*备注: Published in: Advanced Data Mining and Applications (ADMA 2024), Lecture Notes in Computer Science, vol. 15388, pp. 254-269. First online: 13 Dec 2024. DOI: https://doi.org/10.1007/978-981-96-0814-0_17 . 422

点击查看摘要

Abstract:Federated Learning (FL) has been extensively employed for a number of applications in machine learning, i.e., primarily owing to its privacy preserving nature and efficiency in mitigating the communication overhead. Internet of Vehicles (IoV) is one of the promising applications, wherein FL can be utilized to train a model more efficiently. Since only a subset of the clients can participate in each FL training round, challenges arise pertinent to fairness in the client selection process. Over the years, a number of researchers from both academia and industry have proposed numerous FL frameworks. However, to the best of our knowledge, none of them have employed fairness for FL-based client selection in a dynamic and heterogeneous IoV environment. Accordingly, in this paper, we envisage a FairEquityFL framework to ensure an equitable opportunity for all the clients to participate in the FL training process. In particular, we have introduced a sampling equalizer module within the selector component for ensuring fairness in terms of fair collaboration opportunity for all the clients in the client selection process. The selector is additionally responsible for both monitoring and controlling the clients’ participation in each FL training round. Moreover, an outlier detection mechanism is enforced for identifying malicious clients based on the model performance in terms of considerable fluctuation in either accuracy or loss minimization. The selector flags suspicious clients and temporarily suspend such clients from participating in the FL training process. We further evaluate the performance of FairEquityFL on a publicly available dataset, FEMNIST. Our simulation results depict that FairEquityFL outperforms baseline models to a considerable extent.

[LG-14] Generative Model Inversion Through the Lens of the Manifold Hypothesis NEURIPS2025

链接: https://arxiv.org/abs/2509.20177
作者: Xiong Peng,Bo Han,Fengfei Yu,Tongliang Liu,Feng Liu,Mingyuan Zhou
类目: Machine Learning (cs.LG)
*备注: NeurIPS 2025

点击查看摘要

Abstract:Model inversion attacks (MIAs) aim to reconstruct class-representative samples from trained models. Recent generative MIAs utilize generative adversarial networks to learn image priors that guide the inversion process, yielding reconstructions with high visual quality and strong fidelity to the private training data. To explore the reason behind their effectiveness, we begin by examining the gradients of inversion loss with respect to synthetic inputs, and find that these gradients are surprisingly noisy. Further analysis reveals that generative inversion implicitly denoises these gradients by projecting them onto the tangent space of the generator manifold, filtering out off-manifold components while preserving informative directions aligned with the manifold. Our empirical measurements show that, in models trained with standard supervision, loss gradients often exhibit large angular deviations from the data manifold, indicating poor alignment with class-relevant directions. This observation motivates our central hypothesis: models become more vulnerable to MIAs when their loss gradients align more closely with the generator manifold. We validate this hypothesis by designing a novel training objective that explicitly promotes such alignment. Building on this insight, we further introduce a training-free approach to enhance gradient-manifold alignment during inversion, leading to consistent improvements over state-of-the-art generative MIAs.

[LG-15] Benchmarking Web API Integration Code Generation

链接: https://arxiv.org/abs/2509.20172
作者: Daniel Maninger,Leon Chemnitz,Amir Molzam Sharifloo,Jannis Brugger,Mira Mezini
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注: To be published in Proceedings of 2nd ACM International Conference on AI-powered Software, Benchmark Dataset Track (AIware '25)

点击查看摘要

Abstract:API integration is a cornerstone of our digital infrastructure, enabling software systems to connect and interact. However, as shown by many studies, writing or generating correct code to invoke APIs, particularly web APIs, is challenging. Although large language models~(LLMs) have become popular in software development, their effectiveness in automating the generation of web API integration code remains unexplored. In order to address this, we present a dataset and evaluation pipeline designed to assess the ability of LLMs to generate web API invocation code. Our experiments with several open-source LLMs reveal that generating API invocations poses a significant challenge, resulting in hallucinated endpoints, incorrect argument usage, and other errors. None of the evaluated open-source models were able to solve more than 40% of the tasks.

[LG-16] Choose Your Battles: Distributed Learning Over Multiple Tug of War Games

链接: https://arxiv.org/abs/2509.20147
作者: Siddharth Chandak,Ilai Bistritz,Nicholas Bambos
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Systems and Control (eess.SY)
*备注: Submitted to IEEE TAC

点击查看摘要

Abstract:Consider N players and K games taking place simultaneously. Each of these games is modeled as a Tug-of-War (ToW) game where increasing the action of one player decreases the reward for all other players. Each player participates in only one game at any given time. At each time step, a player decides the game in which they wish to participate in and the action they take in that game. Their reward depends on the actions of all players that are in the same game. This system of K games is termed `Meta Tug-of-War’ (Meta-ToW) game. These games can model scenarios such as power control, distributed task allocation, and activation in sensor networks. We propose the Meta Tug-of-Peace algorithm, a distributed algorithm where the action updates are done using a simple stochastic approximation algorithm, and the decision to switch games is made using an infrequent 1-bit communication between the players. We prove that in Meta-ToW games, our algorithm converges to an equilibrium that satisfies a target Quality of Service reward vector for the players. We then demonstrate the efficacy of our algorithm through simulations for the scenarios mentioned above.

[LG-17] Intelligent Algorithm Selection for Recommender Systems: Meta-Learning via in-depth algorithm feature engineering

链接: https://arxiv.org/abs/2509.20134
作者: Jarne Mathi Decker
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The “No Free Lunch” theorem dictates that no single recommender algorithm is optimal for all users, creating a significant Algorithm Selection Problem. Standard meta-learning approaches aim to solve this by selecting an algorithm based on user features, but treat the fundamentally diverse algorithms themselves as equivalent, “black-box” choices. This thesis investigates the impact of overcoming this limitation by engineering a comprehensive feature set to explicitly characterize the algorithms themselves. We combine static code metrics, Abstract Syntax Tree properties, behavioral performance landmarks, and high-level conceptual features. We evaluate two meta-learners across five datasets: a baseline using only user features and our proposed model using both user and algorithm features. Our results show that the meta-learner augmented with algorithm features achieves an average NDCG@10 of 0.143, a statistically significant improvement of 11.7% over the Single Best Algorithm baseline (0.128). However, we found that the inclusion of algorithm features did not lead to an improvement in overall NDCG@10 over the meta learner using only user features (0.144). While adding algorithm features to the meta-learner did improve its Top-1 selection accuracy (+16.1%), this was counterbalanced by leading to a lower Top-3 accuracy (-10.7%). We conclude that for the per-user algorithm selection task in recommender systems, the predictive power of user features is overwhelmingly dominant. While algorithm features improve selection precision, unlocking their potential to boost overall performance remains a non-trivial challenge.

[LG-18] Probability Signature: Bridging Data Semantics and Embedding Structure in Language Models

链接: https://arxiv.org/abs/2509.20124
作者: Junjie Yao,Zhi-Qin John Xu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The embedding space of language models is widely believed to capture the semantic relationships; for instance, embeddings of digits often exhibit an ordered structure that corresponds to their natural sequence. However, the mechanisms driving the formation of such structures remain poorly understood. In this work, we interpret the embedding structures via the data distribution. We propose a set of probability signatures that reflect the semantic relationships among tokens. Through experiments on the composite addition tasks using the linear model and feedforward network, combined with theoretical analysis of gradient flow dynamics, we reveal that these probability signatures significantly influence the embedding structures. We further generalize our analysis to large language models (LLMs) by training the Qwen2.5 architecture on the subsets of the Pile corpus. Our results show that the probability signatures are faithfully aligned with the embedding structures, particularly in capturing strong pairwise similarities among embeddings. Our work uncovers the mechanism of how data distribution guides the formation of embedding structures, establishing a novel understanding of the relationship between embedding organization and semantic patterns.

[LG-19] Beyond Slaters Condition in Online CMDPs with Stochastic and Adversarial Constraints

链接: https://arxiv.org/abs/2509.20114
作者: Francesco Emanuele Stradi,Eleonora Fidelia Chiefari,Matteo Castiglioni,Alberto Marchesi,Nicola Gatti
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study \emphonline episodic Constrained Markov Decision Processes (CMDPs) under both stochastic and adversarial constraints. We provide a novel algorithm whose guarantees greatly improve those of the state-of-the-art best-of-both-worlds algorithm introduced by Stradi et al. (2025). In the stochastic regime, \emphi.e., when the constraints are sampled from fixed but unknown distributions, our method achieves \widetilde\mathcalO(\sqrtT) regret and constraint violation without relying on Slater’s condition, thereby handling settings where no strictly feasible solution exists. Moreover, we provide guarantees on the stronger notion of \emphpositive constraint violation, which does not allow to recover from large violation in the early episodes by playing strictly safe policies. In the adversarial regime, \emphi.e., when the constraints may change arbitrarily between episodes, our algorithm ensures sublinear constraint violation without Slater’s condition, and achieves sublinear \alpha -regret with respect to the \emphunconstrained optimum, where \alpha is a suitably defined multiplicative approximation factor. We further validate our results through synthetic experiments, showing the practical effectiveness of our algorithm.

[LG-20] Incomplete Data Complete Dynamics: A Diffusion Approach

链接: https://arxiv.org/abs/2509.20098
作者: Zihan Zhou,Chenguang Wang,Hongyi Ye,Yongtao Guan,Tianshu Yu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Learning physical dynamics from data is a fundamental challenge in machine learning and scientific modeling. Real-world observational data are inherently incomplete and irregularly sampled, posing significant challenges for existing data-driven approaches. In this work, we propose a principled diffusion-based framework for learning physical systems from incomplete training samples. To this end, our method strategically partitions each such sample into observed context and unobserved query components through a carefully designed splitting strategy, then trains a conditional diffusion model to reconstruct the missing query portions given available contexts. This formulation enables accurate imputation across arbitrary observation patterns without requiring complete data supervision. Specifically, we provide theoretical analysis demonstrating that our diffusion training paradigm on incomplete data achieves asymptotic convergence to the true complete generative process under mild regularity conditions. Empirically, we show that our method significantly outperforms existing baselines on synthetic and real-world physical dynamics benchmarks, including fluid flows and weather systems, with particularly strong performance in limited and irregular observation regimes. These results demonstrate the effectiveness of our theoretically principled approach for learning and imputing partially observed dynamics.

[LG-21] You Only Measure Once: On Designing Single-Shot Quantum Machine Learning Models

链接: https://arxiv.org/abs/2509.20090
作者: Chen-Yu Liu,Leonardo Placidi,Kuan-Cheng Chen,Samuel Yen-Chi Chen,Gabriel Matos
类目: Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注:

点击查看摘要

Abstract:Quantum machine learning (QML) models conventionally rely on repeated measurements (shots) of observables to obtain reliable predictions. This dependence on large shot budgets leads to high inference cost and time overhead, which is particularly problematic as quantum hardware access is typically priced proportionally to the number of shots. In this work we propose You Only Measure Once (Yomo), a simple yet effective design that achieves accurate inference with dramatically fewer measurements, down to the single-shot regime. Yomo replaces Pauli expectation-value outputs with a probability aggregation mechanism and introduces loss functions that encourage sharp predictions. Our theoretical analysis shows that Yomo avoids the shot-scaling limitations inherent to expectation-based models, and our experiments on MNIST and CIFAR-10 confirm that Yomo consistently outperforms baselines across different shot budgets and under simulations with depolarizing channels. By enabling accurate single-shot inference, Yomo substantially reduces the financial and computational costs of deploying QML, thereby lowering the barrier to practical adoption of QML.

[LG-22] A Novel Short-Term Anomaly Prediction for IIoT with Software Defined Twin Network

链接: https://arxiv.org/abs/2509.20068
作者: Bilal Dalgic(1),Betul Sen(1),Muge Erel-Ozcevik(1) ((1) Manisa Celal Bayar University, Turkey)
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注: Accepted by 2025 IEEE Globecom Workshops-TwinNetApp

点击查看摘要

Abstract:Secure monitoring and dynamic control in an IIoT environment are major requirements for current development goals. We believe that dynamic, secure monitoring of the IIoT environment can be achieved through integration with the Software-Defined Network (SDN) and Digital Twin (DT) paradigms. The current literature lacks implementation details for SDN-based DT and time-aware intelligent model training for short-term anomaly detection against IIoT threats. Therefore, we have proposed a novel framework for short-term anomaly detection that uses an SDN-based DT. Using a comprehensive dataset, time-aware labeling of features, and a comprehensive evaluation of various machine learning models, we propose a novel SD-TWIN-based anomaly detection algorithm. According to the performance of a new real-time SD-TWIN deployment, the GPU- accelerated LightGBM model is particularly effective, achieving a balance of high recall and strong classification performance.

[LG-23] he Syntax and Semantics of einsum

链接: https://arxiv.org/abs/2509.20020
作者: Maurice Wenig,Paul G. Rump,Mark Blacher,Joachim Giesen
类目: Programming Languages (cs.PL); Machine Learning (cs.LG); Mathematical Software (cs.MS); Symbolic Computation (cs.SC)
*备注: 21 pages, 1 figure. Includes formal definitions, proofs of algebraic properties, and nesting/denesting rules for the einsum notation

点击查看摘要

Abstract:In 2011, einsum was introduced to NumPy as a practical and convenient notation for tensor expressions in machine learning, quantum circuit simulation, and other fields. It has since been implemented in additional Python frameworks such as PyTorch and TensorFlow, as well as in other programming languages such as Julia. Despite its practical success, the einsum notation still lacks a solid theoretical basis, and is not unified across the different frameworks, limiting opportunities for formal reasoning and systematic optimization. In this work, we discuss the terminology of tensor expressions and provide a formal definition of the einsum language. Based on this definition, we formalize and prove important equivalence rules for tensor expressions and highlight their relevance in practical applications.

[LG-24] Learning Robust Penetration-Testing Policies under Partial Observability: A systematic evaluation

链接: https://arxiv.org/abs/2509.20008
作者: Raphael Simon,Pieter Libin,Wim Mees
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: 27 pages, 8 figures

点击查看摘要

Abstract:Penetration testing, the simulation of cyberattacks to identify security vulnerabilities, presents a sequential decision-making problem well-suited for reinforcement learning (RL) automation. Like many applications of RL to real-world problems, partial observability presents a major challenge, as it invalidates the Markov property present in Markov Decision Processes (MDPs). Partially Observable MDPs require history aggregation or belief state estimation to learn successful policies. We investigate stochastic, partially observable penetration testing scenarios over host networks of varying size, aiming to better reflect real-world complexity through more challenging and representative benchmarks. This approach leads to the development of more robust and transferable policies, which are crucial for ensuring reliable performance across diverse and unpredictable real-world environments. Using vanilla Proximal Policy Optimization (PPO) as a baseline, we compare a selection of PPO variants designed to mitigate partial observability, including frame-stacking, augmenting observations with historical information, and employing recurrent or transformer-based architectures. We conduct a systematic empirical analysis of these algorithms across different host network sizes. We find that this task greatly benefits from history aggregation. Converging three times faster than other approaches. Manual inspection of the learned policies by the algorithms reveals clear distinctions and provides insights that go beyond quantitative results.

[LG-25] Pi-Transformer: A Physics-informed Attention Mechanism for Time Series Anomaly Detection

链接: https://arxiv.org/abs/2509.19985
作者: Sepehr Maleki,Negar Pourmoazemi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Anomalies in multivariate time series often arise from temporal context and cross-channel coordination rather than isolated outliers. We present Pi-Transformer, a physics-informed transformer with two attention pathways: a data-driven series attention and a smoothly evolving prior attention that encodes temporal invariants such as scale-related self-similarity and phase synchrony. The prior acts as a stable reference that calibrates reconstruction error. During training, we pair a reconstruction objective with a divergence term that encourages agreement between the two attentions while keeping them meaningfully distinct; the prior is regularised to evolve smoothly and is lightly distilled towards dataset-level statistics. At inference, the model combines an alignment-weighted reconstruction signal (Energy) with a mismatch signal that highlights timing and phase disruptions, and fuses them into a single score for detection. Across five benchmarks (SMD, MSL, SMAP, SWaT, and PSM), Pi-Transformer achieves state-of-the-art or highly competitive F1, with particular strength on timing and phase-breaking anomalies. Case analyses show complementary behaviour of the two streams and interpretable detections around regime changes. Embedding physics-informed priors into attention yields a calibrated and robust approach to anomaly detection in complex multivariate systems. Code is publicly available at this GitHub repository\footnotethis https URL.

[LG-26] RAD: Towards Trustworthy Retrieval-Augmented Multi-modal Clinical Diagnosis NEURIPS2025

链接: https://arxiv.org/abs/2509.19980
作者: Haolin Li,Tianjie Dai,Zhe Chen,Siyuan Du,Jiangchao Yao,Ya Zhang,Yanfeng Wang
类目: Machine Learning (cs.LG)
*备注: Accepted to NeurIPS 2025

点击查看摘要

Abstract:Clinical diagnosis is a highly specialized discipline requiring both domain expertise and strict adherence to rigorous guidelines. While current AI-driven medical research predominantly focuses on knowledge graphs or natural text pretraining paradigms to incorporate medical knowledge, these approaches primarily rely on implicitly encoded knowledge within model parameters, neglecting task-specific knowledge required by diverse downstream tasks. To address this limitation, we propose Retrieval-Augmented Diagnosis (RAD), a novel framework that explicitly injects external knowledge into multimodal models directly on downstream tasks. Specifically, RAD operates through three key mechanisms: retrieval and refinement of disease-centered knowledge from multiple medical sources, a guideline-enhanced contrastive loss that constrains the latent distance between multi-modal features and guideline knowledge, and the dual transformer decoder that employs guidelines as queries to steer cross-modal fusion, aligning the models with clinical diagnostic workflows from guideline acquisition to feature extraction and decision-making. Moreover, recognizing the lack of quantitative evaluation of interpretability for multimodal diagnostic models, we introduce a set of criteria to assess the interpretability from both image and text perspectives. Extensive evaluations across four datasets with different anatomies demonstrate RAD’s generalizability, achieving state-of-the-art performance. Furthermore, RAD enables the model to concentrate more precisely on abnormal regions and critical indicators, ensuring evidence-based, trustworthy diagnosis. Our code is available at this https URL.

[LG-27] Faster Than SVD Smarter Than SGD: The OPLoRA Alternating Update

链接: https://arxiv.org/abs/2509.19977
作者: Abdulla Jasem Almansoori,Maria Ivanova,Andrey Veprikov,Aleksandr Beznosikov,Samuel Horváth,Martin Takáč
类目: Machine Learning (cs.LG)
*备注: 12 pages, 2 figures, 1 table. Accepted to OPT 2025 Workshop

点击查看摘要

Abstract:Low-Rank Adaptation (LoRA) fine-tunes large models by learning low-rank updates on top of frozen weights, dramatically reducing trainable parameters and memory. However, there is still a gap between full training with low-rank projections (SVDLoRA) and LoRA fine-tuning, indicating that LoRA steps can be further improved. In this study, we propose OPLoRA, a memory-efficient optimizer that closes this gap by casting LoRA optimization as an interpretable sub-problem and solving it efficiently with alternating least squares updates, where 1-2 alternating steps are empirically found to be sufficient to closely match truncated SVD without ever forming the full matrix. We also retrieve the recently proposed preconditioning methods for LoRA as a special case. OPLoRA supports momentum by maintaining a low-rank estimate using the same subroutine (LoRSum) for computing the step, with a memory budget of 3 times the number of LoRA parameters (i.e., same as Adam). We also propose an experimental scaled variant that uses the K-FAC metric, which could be of interest. Across a linear task, MNIST, CIFAR-100, and RoBERTa-base (MNLI), OPLoRA consistently approaches SVDLoRA’s performance using significantly less memory.

[LG-28] From Samples to Scenarios: A New Paradigm for Probabilistic Forecasting

链接: https://arxiv.org/abs/2509.19975
作者: Xilin Dai,Zhijian Xu,Wanxu Cai,Qiang Xu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Most state-of-the-art probabilistic time series forecasting models rely on sampling to represent future uncertainty. However, this paradigm suffers from inherent limitations, such as lacking explicit probabilities, inadequate coverage, and high computational costs. In this work, we introduce \textbfProbabilistic Scenarios, an alternative paradigm designed to address the limitations of sampling. It operates by directly producing a finite set of \Scenario, Probability\ pairs, thus avoiding Monte Carlo-like approximation. To validate this paradigm, we propose \textbfTimePrism, a simple model composed of only three parallel linear layers. Surprisingly, TimePrism achieves 9 out of 10 state-of-the-art results across five benchmark datasets on two metrics. The effectiveness of our paradigm comes from a fundamental reframing of the learning objective. Instead of modeling an entire continuous probability space, the model learns to represent a set of plausible scenarios and corresponding probabilities. Our work demonstrates the potential of the Probabilistic Scenarios paradigm, opening a promising research direction in forecasting beyond sampling.

[LG-29] Learnable Sampler Distillation for Discrete Diffusion Models NEURIPS2025

链接: https://arxiv.org/abs/2509.19962
作者: Feiyang Fu,Tongxian Guo,Zhaoqiang Liu
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: NeurIPS 2025

点击查看摘要

Abstract:Discrete diffusion models (DDMs) have shown powerful generation ability for discrete data modalities like text and molecules. However, their practical application is hindered by inefficient sampling, requiring a large number of sampling steps. Accelerating DDMs by using larger step sizes typically introduces significant problems in generation quality, as it amplifies the impact of both the compounding decoding error due to factorized predictions and discretization error from numerical approximations, leading to a significant decrease in sampling quality. To address these challenges, we propose learnable sampler distillation (LSD), a novel approach to train fast and high-fidelity samplers for DDMs. LSD employs a distillation approach where a student sampler with a few steps learns to align its intermediate score trajectory with that of a high-quality teacher sampler with numerous steps. This alignment is achieved by optimizing learnable sampler coefficients that adaptively adjust sampling dynamics. Additionally, we further propose LSD+, which also learns time schedules that allocate steps non-uniformly. Experiments across text generation, image generation, and synthetic tasks demonstrate that our proposed approaches outperform existing samplers for DDMs, achieving substantially higher sampling quality with significantly fewer sampling steps. Our code is available at \hrefthis https URLthis https URL.

[LG-30] How deep is your network? Deep vs. shallow learning of transfer operators

链接: https://arxiv.org/abs/2509.19930
作者: Mohammad Tabish,Benedict Leimkuhler,Stefan Klus
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We propose a randomized neural network approach called RaNNDy for learning transfer operators and their spectral decompositions from data. The weights of the hidden layers of the neural network are randomly selected and only the output layer is trained. The main advantage is that without a noticeable reduction in accuracy, this approach significantly reduces the training time and resources while avoiding common problems associated with deep learning such as sensitivity to hyperparameters and slow convergence. Additionally, the proposed framework allows us to compute a closed-form solution for the output layer which directly represents the eigenfunctions of the operator. Moreover, it is possible to estimate uncertainties associated with the computed spectral properties via ensemble learning. We present results for different dynamical operators, including Koopman and Perron-Frobenius operators, which have important applications in analyzing the behavior of complex dynamical systems, and the Schrödinger operator. The numerical examples, which highlight the strengths but also weaknesses of the proposed framework, include several stochastic dynamical systems, protein folding processes, and the quantum harmonic oscillator.

[LG-31] MMSE-Calibrated Few-Shot Prompting for Alzheimers Detection

链接: https://arxiv.org/abs/2509.19926
作者: Jana Sweidan,Mounim A. El-Yacoubi,Nasredine Semmar
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Prompting large language models is a training-free method for detecting Alzheimer’s disease from speech transcripts. Using the ADReSS dataset, we revisit zero-shot prompting and study few-shot prompting with a class-balanced protocol using nested interleave and a strict schema, sweeping up to 20 examples per class. We evaluate two variants achieving state-of-the-art prompting results. (i) MMSE-Proxy Prompting: each few-shot example carries a probability anchored to Mini-Mental State Examination bands via a deterministic mapping, enabling AUC computing; this reaches 0.82 accuracy and 0.86 AUC (ii) Reasoning-augmented Prompting: few-shot examples pool is generated with a multimodal LLM (GPT-5) that takes as input the Cookie Theft image, transcript, and MMSE to output a reasoning and MMSE-aligned probability; evaluation remains transcript-only and reaches 0.82 accuracy and 0.83 AUC. To our knowledge, this is the first ADReSS study to anchor elicited probabilities to MMSE and to use multimodal construction to improve interpretability.

[LG-32] On the Frag ility of Contribution Score Computation in Federated Learning

链接: https://arxiv.org/abs/2509.19921
作者: Balazs Pejo,Marcell Frank,Krisztian Varga,Peter Veliczky
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Computer Science and Game Theory (cs.GT)
*备注:

点击查看摘要

Abstract:This paper investigates the fragility of contribution evaluation in federated learning, a critical mechanism for ensuring fairness and incentivizing participation. We argue that contribution scores are susceptible to significant distortions from two fundamental perspectives: architectural sensitivity and intentional manipulation. First, we explore how different model aggregation methods impact these scores. While most research assumes a basic averaging approach, we demonstrate that advanced techniques, including those designed to handle unreliable or diverse clients, can unintentionally yet significantly alter the final scores. Second, we explore vulnerabilities posed by poisoning attacks, where malicious participants strategically manipulate their model updates to inflate their own contribution scores or reduce the importance of other participants. Through extensive experiments across diverse datasets and model architectures, implemented within the Flower framework, we rigorously show that both the choice of aggregation method and the presence of attackers are potent vectors for distorting contribution scores, highlighting a critical need for more robust evaluation schemes.

[LG-33] Latent Iterative Refinement Flow: A Geometric-Constrained Approach for Few-Shot Generation

链接: https://arxiv.org/abs/2509.19903
作者: Songtao Li,Zhenyu Liao,Tianqi Hou,Ting Gao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Few-shot generation, the synthesis of high-quality and diverse samples from limited training data, remains a significant challenge in generative modeling. Existing methods trained from scratch often fail to overcome overfitting and mode collapse, and fine-tuning large models can inherit biases while neglecting the crucial geometric structure of the latent space. To address these limitations, we introduce Latent Iterative Refinement Flow (LIRF), a novel approach that reframes few-shot generation as the progressive densification of geometrically structured manifold. LIRF establishes a stable latent space using an autoencoder trained with our novel \textbfmanifold-preservation loss L_\textmanifold . This loss ensures that the latent space maintains the geometric and semantic correspondence of the input data. Building on this, we propose an iterative generate-correct-augment cycle. Within this cycle, candidate samples are refined by a geometric \textbfcorrection operator, a provably contractive mapping that pulls samples toward the data manifold while preserving diversity. We also provide the \textbfConvergence Theorem demonstrating a predictable decrease in Hausdorff distance between generated and true data manifold. We also demonstrate the framework’s scalability by generating coherent, high-resolution images on AFHQ-Cat. Ablation studies confirm that both the manifold-preserving latent space and the contractive correction mechanism are critical components of this success. Ultimately, LIRF provides a solution for data-scarce generative modeling that is not only theoretically grounded but also highly effective in practice.

[LG-34] Pure Exploration via Frank-Wolfe Self-Play

链接: https://arxiv.org/abs/2509.19901
作者: Xinyu Liu,Chao Qin,Wei You
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We study pure exploration in structured stochastic multi-armed bandits, aiming to efficiently identify the correct hypothesis from a finite set of alternatives. For a broad class of tasks, asymptotic analyses reduce to a maximin optimization that admits a two-player zero-sum game interpretation between an experimenter and a skeptic: the experimenter allocates measurements to rule out alternatives while the skeptic proposes alternatives. We reformulate the game by allowing the skeptic to adopt a mixed strategy, yielding a concave-convex saddle-point problem. This viewpoint leads to Frank-Wolfe Self-Play (FWSP): a projection-free, regularization-free, tuning-free method whose one-hot updates on both sides match the bandit sampling paradigm. However, structural constraints introduce sharp pathologies that complicate algorithm design and analysis: our linear-bandit case study exhibits nonunique optima, optimal designs with zero mass on the best arm, bilinear objectives, and nonsmoothness at the boundary. We address these challenges via a differential-inclusion argument, proving convergence of the game value for best-arm identification in linear bandits. Our analysis proceeds through a continuous-time limit: a differential inclusion with a Lyapunov function that decays exponentially, implying a vanishing duality gap and convergence to the optimal value. Although Lyapunov analysis requires differentiability of the objective, which is not guaranteed on the boundary, we show that along continuous trajectories the algorithm steers away from pathological nonsmooth points and achieves uniform global convergence to the optimal game value. We then embed the discrete-time updates into a perturbed flow and show that the discrete game value also converges. Building on FWSP, we further propose a learning algorithm based on posterior sampling. Numerical experiments demonstrate a vanishing duality gap.

[LG-35] MCGrad:: Multicalibration at Web Scale

链接: https://arxiv.org/abs/2509.19884
作者: Lorenzo Perini,Daniel Haimovich,Fridolin Linder,Niek Tax,Dima Karamshuk,Milan Vojnovic,Nastaran Okati,Pavlos Athanasios Apostolopoulos
类目: Machine Learning (cs.LG)
*备注: Under submission

点击查看摘要

Abstract:We propose MCGrad, a novel and scalable multicalibration algorithm. Multicalibration - calibration in sub-groups of the data - is an important property for the performance of machine learning-based systems. Existing multicalibration methods have thus far received limited traction in industry. We argue that this is because existing methods (1) require such subgroups to be manually specified, which ML practitioners often struggle with, (2) are not scalable, or (3) may harm other notions of model performance such as log loss and Area Under the Precision-Recall Curve (PRAUC). MCGrad does not require explicit specification of protected groups, is scalable, and often improves other ML evaluation metrics instead of harming them. MCGrad has been in production at Meta, and is now part of hundreds of production models. We present results from these deployments as well as results on public datasets.

[LG-36] Modeling and Control of Deep Sign-Definite Dynamics with Application to Hybrid Powertrain Control

链接: https://arxiv.org/abs/2509.19869
作者: Teruki Kato,Ryotaro Shima,Kenji Kashima
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: Submitted to Automatica

点击查看摘要

Abstract:Deep learning is increasingly used for complex, large-scale systems where first-principles modeling is difficult. However, standard deep learning models often fail to enforce physical structure or preserve convexity in downstream control, leading to physically inconsistent predictions and discontinuous inputs owing to nonconvexity. We introduce sign constraints–sign restrictions on Jacobian entries–that unify monotonicity, positivity, and sign-definiteness; additionally, we develop model-construction methods that enforce them, together with a control-synthesis procedure. In particular, we design exactly linearizable deep models satisfying these constraints and formulate model predictive control as a convex quadratic program, which yields a unique optimizer and a Lipschitz continuous control law. On a two-tank system and a hybrid powertrain, the proposed approach improves prediction accuracy and produces smoother control inputs than existing methods.

[LG-37] Oversampling and Downsampling with Core-Boundary Awareness: A Data Quality-Driven Approach

链接: https://arxiv.org/abs/2509.19856
作者: Samir Brahim Belhaouari,Yunis Carreon Kahalan,Humaira Shaffique,Ismael Belhaouari,Ashhadul Islam
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The effectiveness of machine learning models, particularly in unbalanced classification tasks, is often hindered by the failure to differentiate between critical instances near the decision boundary and redundant samples concentrated in the core of the data distribution. In this paper, we propose a method to systematically identify and differentiate between these two types of data. Through extensive experiments on multiple benchmark datasets, we show that the boundary data oversampling method improves the F1 score by up to 10% on 96% of the datasets, whereas our core-aware reduction method compresses datasets up to 90% while preserving their accuracy, making it 10 times more powerful than the original dataset. Beyond imbalanced classification, our method has broader implications for efficient model training, particularly in computationally expensive domains such as Large Language Model (LLM) training. By prioritizing high-quality, decision-relevant data, our approach can be extended to text, multimodal, and self-supervised learning scenarios, offering a pathway to faster convergence, improved generalization, and significant computational savings. This work paves the way for future research in data-efficient learning, where intelligent sampling replaces brute-force expansion, driving the next generation of AI advancements. Our code is available as a Python package at this https URL .

[LG-38] BoreaRL: A Multi-Objective Reinforcement Learning Environment for Climate-Adaptive Boreal Forest Management

链接: https://arxiv.org/abs/2509.19846
作者: Kevin Bradley Dsouza,Enoch Ofosu,Daniel Chukwuemeka Amaogu,Jérôme Pigeon,Richard Boudreault,Pooneh Maghoul,Juan Moreno-Cruz,Yuri Leonenko
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Boreal forests store 30-40% of terrestrial carbon, much in climate-vulnerable permafrost soils, making their management critical for climate mitigation. However, optimizing forest management for both carbon sequestration and permafrost preservation presents complex trade-offs that current tools cannot adequately address. We introduce \textbfBoreaRL , the first multi-objective reinforcement learning environment for climate-adaptive boreal forest management, featuring a physically-grounded simulator of coupled energy, carbon, and water fluxes. BoreaRL supports two training paradigms: site-specific mode for controlled studies and generalist mode for learning robust policies under environmental stochasticity. Through evaluation of multi-objective RL algorithms, we reveal a fundamental asymmetry in learning difficulty: carbon objectives are significantly easier to optimize than thaw (permafrost preservation) objectives, with thaw-focused policies showing minimal learning progress across both paradigms. In generalist settings, standard preference-conditioned approaches fail entirely, while a naive curriculum learning approach achieves superior performance by strategically selecting training episodes. Analysis of learned strategies reveals distinct management philosophies, where carbon-focused policies favor aggressive high-density coniferous stands, while effective multi-objective policies balance species composition and density to protect permafrost while maintaining carbon gains. Our results demonstrate that robust climate-adaptive forest management remains challenging for current MORL methods, establishing BoreaRL as a valuable benchmark for developing more effective approaches. We open-source BoreaRL to accelerate research in multi-objective RL for climate applications.

[LG-39] An Efficient Conditional Score-based Filter for High Dimensional Nonlinear Filtering Problems

链接: https://arxiv.org/abs/2509.19816
作者: Zhijun Zeng,Weiye Gan,Junqing Chen,Zuoqiang Shi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In many engineering and applied science domains, high-dimensional nonlinear filtering is still a challenging problem. Recent advances in score-based diffusion models offer a promising alternative for posterior sampling but require repeated retraining to track evolving priors, which is impractical in high dimensions. In this work, we propose the Conditional Score-based Filter (CSF), a novel algorithm that leverages a set-transformer encoder and a conditional diffusion model to achieve efficient and accurate posterior sampling without retraining. By decoupling prior modeling and posterior sampling into offline and online stages, CSF enables scalable score-based filtering across diverse nonlinear systems. Extensive experiments on benchmark problems show that CSF achieves superior accuracy, robustness, and efficiency across diverse nonlinear filtering scenarios.

[LG-40] Faster Smaller and Smarter: Task-Aware Expert Merging for Online MoE Inference

链接: https://arxiv.org/abs/2509.19781
作者: Ziyi Han,Xutong Liu,Ruiting Zhou,Xiangxiang Dai,John C.S. Lui
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Sparse Mixture of Experts (SMoE) has become a preferred architecture for scaling Transformer capacity without increasing computational cost, as it activates only a small subset of experts for each input. However, deploying such an approach for \textitonline inference remains challenging due to the large size of a full SMoE model and the complexity of expert routing, especially in resource-constrained edge networks. Moreover, during the online inference, task information is often unavailable, making the task-level routing error-prone. In this work, we propose a novel tree-structured adaptive neural bandit router, \textttTanbr, to enable efficient and reliable online MoE inference. Instead of relying on explicit task tags, \textttTanbr estimates the task distribution over time from historical data and uses it to guide task-aware expert merging within a given pre-trained MoE. To handle the large continuous space of merging weights, \textttTanbr employs a binary tree to progressively partition the space and generate finer candidate weights. It then applies a neural bandit to learn the non-linear mapping from merging weight to model performance and decides optimal expert merging. We prove that \textttTanbr achieves a sublinear regret bound of \small \mathcalO(\sqrtT \log(T)) over \small T rounds, despite operating over a continuous decision space, matching regret bounds compared to existing methods. Extensive experiments show that \textttTanbr reduces inference latency by at least \small 45% and memory usage by up to \small 25% , while maintaining a high accuracy compared to many state-of-the-art methods.

[LG-41] Formal Safety Verification and Refinement for Generative Motion Planners via Certified Local Stabilization

链接: https://arxiv.org/abs/2509.19688
作者: Devesh Nath,Haoran Yin,Glen Chou
类目: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY); Optimization and Control (math.OC)
*备注: 10 pages, 12 figures

点击查看摘要

Abstract:We present a method for formal safety verification of learning-based generative motion planners. Generative motion planners (GMPs) offer advantages over traditional planners, but verifying the safety and dynamic feasibility of their outputs is difficult since neural network verification (NNV) tools scale only to a few hundred neurons, while GMPs often contain millions. To preserve GMP expressiveness while enabling verification, our key insight is to imitate the GMP by stabilizing references sampled from the GMP with a small neural tracking controller and then applying NNV to the closed-loop dynamics. This yields reachable sets that rigorously certify closed-loop safety, while the controller enforces dynamic feasibility. Building on this, we construct a library of verified GMP references and deploy them online in a way that imitates the original GMP distribution whenever it is safe to do so, improving safety without retraining. We evaluate across diverse planners, including diffusion, flow matching, and vision-language models, improving safety in simulation (on ground robots and quadcopters) and on hardware (differential-drive robot).

[LG-42] Revisiting Performance Claims for Chest X-Ray Models Using Clinical Context

链接: https://arxiv.org/abs/2509.19671
作者: Andrew Wang,Jiashuo Zhang,Michael Oberst
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Public healthcare datasets of Chest X-Rays (CXRs) have long been a popular benchmark for developing computer vision models in healthcare. However, strong average-case performance of machine learning (ML) models on these datasets is insufficient to certify their clinical utility. In this paper, we use clinical context, as captured by prior discharge summaries, to provide a more holistic evaluation of current state-of-the-art'' models for the task of CXR diagnosis. Using discharge summaries recorded prior to each CXR, we derive a prior’’ or ``pre-test’’ probability of each CXR label, as a proxy for existing contextual knowledge available to clinicians when interpreting CXRs. Using this measure, we demonstrate two key findings: First, for several diagnostic labels, CXR models tend to perform best on cases where the pre-test probability is very low, and substantially worse on cases where the pre-test probability is higher. Second, we use pre-test probability to assess whether strong average-case performance reflects true diagnostic signal, rather than an ability to infer the pre-test probability as a shortcut. We find that performance drops sharply on a balanced test set where this shortcut does not exist, which may indicate that much of the apparent diagnostic power derives from inferring this clinical context. We argue that this style of analysis, using context derived from clinical notes, is a promising direction for more rigorous and fine-grained evaluation of clinical vision models.

[LG-43] Consistent Estimation of Numerical Distributions under Local Differential Privacy by Wavelet Expansion

链接: https://arxiv.org/abs/2509.19661
作者: Puning Zhao,Zhikun Zhang,Bo Sun,Li Shen,Liang Zhang,Shaowei Wang,Zhe Liu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Distribution estimation under local differential privacy (LDP) is a fundamental and challenging task. Significant progresses have been made on categorical data. However, due to different evaluation metrics, these methods do not work well when transferred to numerical data. In particular, we need to prevent the probability mass from being misplaced far away. In this paper, we propose a new approach that express the sample distribution using wavelet expansions. The coefficients of wavelet series are estimated under LDP. Our method prioritizes the estimation of low-order coefficients, in order to ensure accurate estimation at macroscopic level. Therefore, the probability mass is prevented from being misplaced too far away from its ground truth. We establish theoretical guarantees for our methods. Experiments show that our wavelet expansion method significantly outperforms existing solutions under Wasserstein and KS distances.

[LG-44] Symbol-Temporal Consistency Self-supervised Learning for Robust Time Series Classification

链接: https://arxiv.org/abs/2509.19654
作者: Kevin Garcia,Cassandra Garza,Brooklyn Berry,Yifeng Gao
类目: Machine Learning (cs.LG)
*备注: 4 pages, 2 figures, IEEE-EMBS BSN 2025

点击查看摘要

Abstract:The surge in the significance of time series in digital health domains necessitates advanced methodologies for extracting meaningful patterns and representations. Self-supervised contrastive learning has emerged as a promising approach for learning directly from raw data. However, time series data in digital health is known to be highly noisy, inherently involves concept drifting, and poses a challenge for training a generalizable deep learning model. In this paper, we specifically focus on data distribution shift caused by different human behaviors and propose a self-supervised learning framework that is aware of the bag-of-symbol representation. The bag-of-symbol representation is known for its insensitivity to data warping, location shifts, and noise existed in time series data, making it potentially pivotal in guiding deep learning to acquire a representation resistant to such data shifting. We demonstrate that the proposed method can achieve significantly better performance where significant data shifting exists.

[LG-45] oward Scalable and Structured Global Station Weather Forecasting

链接: https://arxiv.org/abs/2509.19648
作者: Hongyi Chen,Xiucheng Li,Xinyang Chen,Yun Cheng,Jing Li,Kehai Chen,Liqiang Nie
类目: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注:

点击查看摘要

Abstract:Global Station Weather Forecasting (GSWF) is a key meteorological research area, critical to energy, aviation, and agriculture. Existing time series forecasting methods often ignore or unidirectionally model spatial correlation when conducting large-scale global station forecasting. This contradicts the intrinsic nature underlying observations of the global weather system, limiting forecast performance. To address this, we propose a novel Spatial Structured Attention Block in this paper. It partitions the spatial graph into a set of subgraphs and instantiates Intra-subgraph Attention to learn local spatial correlation within each subgraph, and aggregates nodes into subgraph representations for message passing among the subgraphs via Inter-subgraph Attention – considering both spatial proximity and global correlation. Building on this block, we develop a multiscale spatiotemporal forecasting model by progressively expanding subgraph scales. The resulting model is both scalable and able to produce structured spatial correlation, and meanwhile, it is easy to implement. The experimental results show that it can achieve performance improvements up to 16.8% over time series forecasting baselines at low running costs.

[LG-46] Adaptive von Mises-Fisher Likelihood Loss for Supervised Deep Time Series Hashing ICML

链接: https://arxiv.org/abs/2509.19625
作者: Juan Manuel Perez,Kevin Garcia,Brooklyn Berry,Dongjin Song,Yifeng Gao
类目: Machine Learning (cs.LG)
*备注: 6 pages, 6 figures, Conference: ICMLA 2025

点击查看摘要

Abstract:Indexing time series by creating compact binary representations is a fundamental task in time series data mining. Recently, deep learning-based hashing methods have proven effective for indexing time series based on semantic meaning rather than just raw similarity. The purpose of deep hashing is to map samples with the same semantic meaning to identical binary hash codes, enabling more efficient search and retrieval. Unlike other supervised representation learning methods, supervised deep hashing requires a discretization step to convert real-valued representations into binary codes, but this can induce significant information loss. In this paper, we propose a von Mises-Fisher (vMF) hashing loss. The proposed deep hashing model maps data to an M-dimensional hyperspherical space to effectively reduce information loss and models each data class as points following distinct vMF distributions. The designed loss aims to maximize the separation between each modeled vMF distribution to provide a better way to maximize the margin between each semantically different data sample. Experimental results show that our method outperforms existing baselines. The implementation is publicly available at this https URL

[LG-47] Improved Therapeutic Antibody Reformatting through Multimodal Machine Learning NEURIPS2025

链接: https://arxiv.org/abs/2509.19604
作者: Jiayi Xin,Aniruddh Raghu,Nick Bhattacharya,Adam Carr,Melanie Montgomery,Hunter Elliott
类目: Machine Learning (cs.LG)
*备注: NeurIPS 2025 AI4Science Workshop and NeurIPS 2025 Multi-modal Foundation Models and Large Language Models for Life Sciences Workshop

点击查看摘要

Abstract:Modern therapeutic antibody design often involves composing multi-part assemblages of individual functional domains, each of which may be derived from a different source or engineered independently. While these complex formats can expand disease applicability and improve safety, they present a significant engineering challenge: the function and stability of individual domains are not guaranteed in the novel format, and the entire molecule may no longer be synthesizable. To address these challenges, we develop a machine learning framework to predict “reformatting success” – whether converting an antibody from one format to another will succeed or not. Our framework incorporates both antibody sequence and structural context, incorporating an evaluation protocol that reflects realistic deployment scenarios. In experiments on a real-world antibody reformatting dataset, we find the surprising result that large pretrained protein language models (PLMs) fail to outperform simple, domain-tailored, multimodal representations. This is particularly evident in the most difficult evaluation setting, where we test model generalization to a new starting antibody. In this challenging “new antibody, no data” scenario, our best multimodal model achieves high predictive accuracy, enabling prioritization of promising candidates and reducing wasted experimental effort.

[LG-48] Modular Machine Learning with Applications to Genetic Circuit Composition

链接: https://arxiv.org/abs/2509.19601
作者: Jichi Wang,Eduardo D. Sontag,Domitilla Del Vecchio
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:In several applications, including in synthetic biology, one often has input/output data on a system composed of many modules, and although the modules’ input/output functions and signals may be unknown, knowledge of the composition architecture can significantly reduce the amount of training data required to learn the system’s input/output mapping. Learning the modules’ input/output functions is also necessary for designing new systems from different composition architectures. Here, we propose a modular learning framework, which incorporates prior knowledge of the system’s compositional structure to (a) identify the composing modules’ input/output functions from the system’s input/output data and (b) achieve this by using a reduced amount of data compared to what would be required without knowledge of the compositional structure. To achieve this, we introduce the notion of modular identifiability, which allows recovery of modules’ input/output functions from a subset of the system’s input/output data, and provide theoretical guarantees on a class of systems motivated by genetic circuits. We demonstrate the theory on computational studies showing that a neural network (NNET) that accounts for the compositional structure can learn the composing modules’ input/output functions and predict the system’s output on inputs outside of the training set distribution. By contrast, a neural network that is agnostic of the structure is unable to predict on inputs that fall outside of the training set distribution. By reducing the need for experimental data and allowing module identification, this framework offers the potential to ease the design of synthetic biological circuits and of multi-module systems more generally.

[LG-49] AnySafe: Adapting Latent Safety Filters at Runtime via Safety Constraint Parameterization in the Latent Space

链接: https://arxiv.org/abs/2509.19555
作者: Sankalp Agrawal,Junwon Seo,Kensuke Nakamura,Ran Tian,Andrea Bajcsy
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent works have shown that foundational safe control methods, such as Hamilton-Jacobi (HJ) reachability analysis, can be applied in the latent space of world models. While this enables the synthesis of latent safety filters for hard-to-model vision-based tasks, they assume that the safety constraint is known a priori and remains fixed during deployment, limiting the safety filter’s adaptability across scenarios. To address this, we propose constraint-parameterized latent safety filters that can adapt to user-specified safety constraints at runtime. Our key idea is to define safety constraints by conditioning on an encoding of an image that represents a constraint, using a latent-space similarity measure. The notion of similarity to failure is aligned in a principled way through conformal calibration, which controls how closely the system may approach the constraint representation. The parameterized safety filter is trained entirely within the world model’s imagination, treating any image seen by the model as a potential test-time constraint, thereby enabling runtime adaptation to arbitrary safety constraints. In simulation and hardware experiments on vision-based control tasks with a Franka manipulator, we show that our method adapts at runtime by conditioning on the encoding of user-specified constraint images, without sacrificing performance. Video results can be found on this https URL

[LG-50] Metriplectic Conditional Flow Matching for Dissipative Dynamics

链接: https://arxiv.org/abs/2509.19526
作者: Ali Baheri,Lars Lindemann
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Metriplectic conditional flow matching (MCFM) learns dissipative dynamics without violating first principles. Neural surrogates often inject energy and destabilize long-horizon rollouts; MCFM instead builds the conservative-dissipative split into both the vector field and a structure preserving sampler. MCFM trains via conditional flow matching on short transitions, avoiding long rollout adjoints. In inference, a Strang-prox scheme alternates a symplectic update with a proximal metric step, ensuring discrete energy decay; an optional projection enforces strict decay when a trusted energy is available. We provide continuous and discrete time guarantees linking this parameterization and sampler to conservation, monotonic dissipation, and stable rollouts. On a controlled mechanical benchmark, MCFM yields phase portraits closer to ground truth and markedly fewer energy-increase and positive energy rate events than an equally expressive unconstrained neural flow, while matching terminal distributional fit.

[LG-51] Frame-based Equivariant Diffusion Models for 3D Molecular Generation

链接: https://arxiv.org/abs/2509.19506
作者: Mohan Guo(Faculty of Science, University of Amsterdam),Cong Liu(AMLab, University of Amsterdam),Patrick Forré(AMLab, University of Amsterdam)
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent methods for molecular generation face a trade-off: they either enforce strict equivariance with costly architectures or relax it to gain scalability and flexibility. We propose a frame-based diffusion paradigm that achieves deterministic E(3)-equivariance while decoupling symmetry handling from the backbone. Building on this paradigm, we investigate three variants: Global Frame Diffusion (GFD), which assigns a shared molecular frame; Local Frame Diffusion (LFD), which constructs node-specific frames and benefits from additional alignment constraints; and Invariant Frame Diffusion (IFD), which relies on pre-canonicalized invariant representations. To enhance expressivity, we further utilize EdgeDiT, a Diffusion Transformer with edge-aware attention. On the QM9 dataset, GFD with EdgeDiT achieves state-of-the-art performance, with a test NLL of -137.97 at standard scale and -141.85 at double scale, alongside atom stability of 98.98%, and molecular stability of 90.51%. These results surpass all equivariant baselines while maintaining high validity and uniqueness and nearly 2x faster sampling compared to EDM. Altogether, our study establishes frame-based diffusion as a scalable, flexible, and physically grounded paradigm for molecular generation, highlighting the critical role of global structure preservation. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2509.19506 [cs.LG] (or arXiv:2509.19506v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2509.19506 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-52] Constraint-Reduced MILP with Local Outlier Factor Modeling for Plausible Counterfactual Explanations in Credit Approval

链接: https://arxiv.org/abs/2509.19504
作者: Trung Nguyen Thanh,Huyen Giang Thi Thu,Tai Le Quy,Ha-Bang Ban
类目: Machine Learning (cs.LG)
*备注: Accepted to NICE-TEAS ASIA 2025 conference

点击查看摘要

Abstract:Counterfactual explanation (CE) is a widely used post-hoc method that provides individuals with actionable changes to alter an unfavorable prediction from a machine learning model. Plausible CE methods improve realism by considering data distribution characteristics, but their optimization models introduce a large number of constraints, leading to high computational cost. In this work, we revisit the DACE framework and propose a refined Mixed-Integer Linear Programming (MILP) formulation that significantly reduces the number of constraints in the local outlier factor (LOF) objective component. We also apply the method to a linear SVM classifier with standard scaler. The experimental results show that our approach achieves faster solving times while maintaining explanation quality. These results demonstrate the promise of more efficient LOF modeling in counterfactual explanation and data science applications.

[LG-53] OmniVLA: An Omni-Modal Vision-Language-Action Model for Robot Navigation

链接: https://arxiv.org/abs/2509.19480
作者: Noriaki Hirose,Catherine Glossop,Dhruv Shah,Sergey Levine
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 9 pages, 7 figures, 6 tables

点击查看摘要

Abstract:Humans can flexibly interpret and compose different goal specifications, such as language instructions, spatial coordinates, or visual references, when navigating to a destination. In contrast, most existing robotic navigation policies are trained on a single modality, limiting their adaptability to real-world scenarios where different forms of goal specification are natural and complementary. In this work, we present a training framework for robotic foundation models that enables omni-modal goal conditioning for vision-based navigation. Our approach leverages a high-capacity vision-language-action (VLA) backbone and trains with three primary goal modalities: 2D poses, egocentric images, and natural language, as well as their combinations, through a randomized modality fusion strategy. This design not only expands the pool of usable datasets but also encourages the policy to develop richer geometric, semantic, and visual representations. The resulting model, OmniVLA, achieves strong generalization to unseen environments, robustness to scarce modalities, and the ability to follow novel natural language instructions. We demonstrate that OmniVLA outperforms specialist baselines across modalities and offers a flexible foundation for fine-tuning to new modalities and tasks. We believe OmniVLA provides a step toward broadly generalizable and flexible navigation policies, and a scalable path for building omni-modal robotic foundation models. We present videos showcasing OmniVLA performance and will release its checkpoints and training code on our project page.

[LG-54] ransformer Modeling for Both Scalability and Performance in Multivariate Time Series

链接: https://arxiv.org/abs/2509.19471
作者: Hunjae Lee,Corey Clark
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Variable count is among the main scalability bottlenecks for transformer modeling in multivariate time series (MTS) data. On top of this, a growing consensus in the field points to indiscriminate inter-variable mixing as a potential source of noise-accumulation and performance degradation. This is likely exacerbated by sparsity of informative signals characteristic of many MTS systems coupled with representational misalignment stemming from indiscriminate information mixing between (heterogeneous) variables. While scalability and performance are often seen as competing interests in transformer design, we show that both can be improved simultaneously in MTS by strategically constraining the representational capacity of inter-variable mixing. Our proposed method, transformer with Delegate Token Attention (DELTAformer), constrains inter-variable modeling through what we call delegate tokens which are then used to perform full, unconstrained, inter-temporal modeling. Delegate tokens act as an implicit regularizer that forces the model to be highly selective about what inter-variable information is allowed to propagate through the network. Our results show that DELTAformer scales linearly with variable-count while actually outperforming standard transformers, achieving state-of-the-art performance across benchmarks and baselines. In addition, DELTAformer can focus on relevant signals better than standard transformers in noisy MTS environments and overall exhibit superior noise-resilience. Overall, results across various experiments confirm that by aligning our model design to leverage domain-specific challenges in MTS to our advantage, DELTAformer can simultaneously achieve linear scaling while actually improving its performance against standard, quadratic transformers.

[LG-55] HINNs: Thermodynamically Informed Neural Networks

链接: https://arxiv.org/abs/2509.19467
作者: Javier Castro,Benjamin Gess
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:Physics-Informed Neural Networks (PINNs) are a class of deep learning models aiming to approximate solutions of PDEs by training neural networks to minimize the residual of the equation. Focusing on non-equilibrium fluctuating systems, we propose a physically informed choice of penalization that is consistent with the underlying fluctuation structure, as characterized by a large deviations principle. This approach yields a novel formulation of PINNs in which the penalty term is chosen to penalize improbable deviations, rather than being selected heuristically. The resulting thermodynamically consistent extension of PINNs, termed THINNs, is subsequently analyzed by establishing analytical a posteriori estimates, and providing empirical comparisons to established penalization strategies.

[LG-56] Analyzing Uncertainty Quantification in Statistical and Deep Learning Models for Probabilistic Electricity Price Forecasting

链接: https://arxiv.org/abs/2509.19417
作者: Andreas Lebedev,Abhinav Das,Sven Pappert,Stephan Schlüter
类目: Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:Precise probabilistic forecasts are fundamental for energy risk management, and there is a wide range of both statistical and machine learning models for this purpose. Inherent to these probabilistic models is some form of uncertainty quantification. However, most models do not capture the full extent of uncertainty, which arises not only from the data itself but also from model and distributional choices. In this study, we examine uncertainty quantification in state-of-the-art statistical and deep learning probabilistic forecasting models for electricity price forecasting in the German market. In particular, we consider deep distributional neural networks (DDNNs) and augment them with an ensemble approach, Monte Carlo (MC) dropout, and conformal prediction to account for model uncertainty. Additionally, we consider the LASSO-estimated autoregressive (LEAR) approach combined with quantile regression averaging (QRA), generalized autoregressive conditional heteroskedasticity (GARCH), and conformal prediction. Across a range of performance metrics, we find that the LEAR-based models perform well in terms of probabilistic forecasting, irrespective of the uncertainty quantification method. Furthermore, we find that DDNNs benefit from incorporating both data and model uncertainty, improving both point and probabilistic forecasting. Uncertainty itself appears to be best captured by the models using conformal prediction. Overall, our extensive study shows that all models under consideration perform competitively. However, their relative performance depends on the choice of metrics for point and probabilistic forecasting.

[LG-57] Poster: ChatIYP: Enabling Natural Language Access to the Internet Yellow Pages Database

链接: https://arxiv.org/abs/2509.19411
作者: Vasilis Andritsoudis,Pavlos Sermpezis,Ilias Dimitriadis,Athena Vakali
类目: Networking and Internet Architecture (cs.NI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: ACM Internet Measurement Conference (IMC) 2025

点击查看摘要

Abstract:The Internet Yellow Pages (IYP) aggregates information from multiple sources about Internet routing into a unified, graph-based knowledge base. However, querying it requires knowledge of the Cypher language and the exact IYP schema, thus limiting usability for non-experts. In this paper, we propose ChatIYP, a domain-specific Retrieval-Augmented Generation (RAG) system that enables users to query IYP through natural language questions. Our evaluation demonstrates solid performance on simple queries, as well as directions for improvement, and provides insights for selecting evaluation metrics that are better fit for IYP querying AI agents.

[LG-58] Enhancing Credit Default Prediction Using Boruta Feature Selection and DBSCAN Algorithm with Different Resampling Techniques

链接: https://arxiv.org/abs/2509.19408
作者: Obu-Amoah Ampomah,Edmund Agyemang,Kofi Acheampong,Louis Agyekum
类目: Machine Learning (cs.LG); Applications (stat.AP)
*备注: 16 pages, 8 figures and 5 tables

点击查看摘要

Abstract:This study examines credit default prediction by comparing three techniques, namely SMOTE, SMOTE-Tomek, and ADASYN, that are commonly used to address the class imbalance problem in credit default situations. Recognizing that credit default datasets are typically skewed, with defaulters comprising a much smaller proportion than non-defaulters, we began our analysis by evaluating machine learning (ML) models on the imbalanced data without any resampling to establish baseline performance. These baseline results provide a reference point for understanding the impact of subsequent balancing methods. In addition to traditional classifiers such as Naive Bayes and K-Nearest Neighbors (KNN), our study also explores the suitability of advanced ensemble boosting algorithms, including Extreme Gradient Boosting (XGBoost), AdaBoost, Gradient Boosting Machines (GBM), and Light GBM for credit default prediction using Boruta feature selection and DBSCAN-based outlier detection, both before and after resampling. A real-world credit default data set sourced from the University of Cleveland ML Repository was used to build ML classifiers, and their performances were tested. The criteria chosen to measure model performance are the area under the receiver operating characteristic curve (ROC-AUC), area under the precision-recall curve (PR-AUC), G-mean, and F1-scores. The results from this empirical study indicate that the Boruta+DBSCAN+SMOTE-Tomek+GBM classifier outperformed the other ML models (F1-score: 82.56%, G-mean: 82.98%, ROC-AUC: 90.90%, PR-AUC: 91.85%) in a credit default context. The findings establish a foundation for future progress in creating more resilient and adaptive credit default systems, which will be essential as credit-based transactions continue to rise worldwide.

[LG-59] Statistical Inference Leverag ing Synthetic Data with Distribution-Free Guarantees

链接: https://arxiv.org/abs/2509.20345
作者: Meshi Bashari,Yonghoon Lee,Roy Maor Lotan,Edgar Dobriban,Yaniv Romano
类目: Methodology (stat.ME); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The rapid proliferation of high-quality synthetic data – generated by advanced AI models or collected as auxiliary data from related tasks – presents both opportunities and challenges for statistical inference. This paper introduces a GEneral Synthetic-Powered Inference (GESPI) framework that wraps around any statistical inference procedure to safely enhance sample efficiency by combining synthetic and real data. Our framework leverages high-quality synthetic data to boost statistical power, yet adaptively defaults to the standard inference method using only real data when synthetic data is of low quality. The error of our method remains below a user-specified bound without any distributional assumptions on the synthetic data, and decreases as the quality of the synthetic data improves. This flexibility enables seamless integration with conformal prediction, risk control, hypothesis testing, and multiple testing procedures, all without modifying the base inference method. We demonstrate the benefits of our method on challenging tasks with limited labeled data, including AlphaFold protein structure prediction, and comparing large reasoning models on complex math problems.

[LG-60] Deep learning for exoplanet detection and characterization by direct imaging at high contrast

链接: https://arxiv.org/abs/2509.20310
作者: Théo Bodrito,Olivier Flasseur,Julien Mairal,Jean Ponce,Maud Langlois,Anne-Marie Lagrange
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Earth and Planetary Astrophysics (astro-ph.EP); Machine Learning (cs.LG)
*备注: SF2A 2025

点击查看摘要

Abstract:Exoplanet imaging is a major challenge in astrophysics due to the need for high angular resolution and high contrast. We present a multi-scale statistical model for the nuisance component corrupting multivariate image series at high contrast. Integrated into a learnable architecture, it leverages the physics of the problem and enables the fusion of multiple observations of the same star in a way that is optimal in terms of detection signal-to-noise ratio. Applied to data from the VLT/SPHERE instrument, the method significantly improves the detection sensitivity and the accuracy of astrometric and photometric estimation.

[LG-61] Error Propagation in Dynamic Programming: From Stochastic Control to Option Pricing

链接: https://arxiv.org/abs/2509.20239
作者: Andrea Della Vecchia,Damir Filipović
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Computational Finance (q-fin.CP); Pricing of Securities (q-fin.PR); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:This paper investigates theoretical and methodological foundations for stochastic optimal control (SOC) in discrete time. We start formulating the control problem in a general dynamic programming framework, introducing the mathematical structure needed for a detailed convergence analysis. The associate value function is estimated through a sequence of approximations combining nonparametric regression methods and Monte Carlo subsampling. The regression step is performed within reproducing kernel Hilbert spaces (RKHSs), exploiting the classical KRR algorithm, while Monte Carlo sampling methods are introduced to estimate the continuation value. To assess the accuracy of our value function estimator, we propose a natural error decomposition and rigorously control the resulting error terms at each time step. We then analyze how this error propagates backward in time-from maturity to the initial stage-a relatively underexplored aspect of the SOC literature. Finally, we illustrate how our analysis naturally applies to a key financial application: the pricing of American options.

[LG-62] Examining the robustness of Physics-Informed Neural Networks to noise for Inverse Problems

链接: https://arxiv.org/abs/2509.20191
作者: Aleksandra Jekic,Afroditi Natsaridou,Signe Riemer-Sørensen,Helge Langseth,Odd Erik Gundersen
类目: Computational Physics (physics.comp-ph); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 25 pages without appendix, 22 figures, submitted to a journal

点击查看摘要

Abstract:Approximating solutions to partial differential equations (PDEs) is fundamental for the modeling of dynamical systems in science and engineering. Physics-informed neural networks (PINNs) are a recent machine learning-based approach, for which many properties and limitations remain unknown. PINNs are widely accepted as inferior to traditional methods for solving PDEs, such as the finite element method, both with regard to computation time and accuracy. However, PINNs are commonly claimed to show promise in solving inverse problems and handling noisy or incomplete data. We compare the performance of PINNs in solving inverse problems with that of a traditional approach using the finite element method combined with a numerical optimizer. The models are tested on a series of increasingly difficult fluid mechanics problems, with and without noise. We find that while PINNs may require less human effort and specialized knowledge, they are outperformed by the traditional approach. However, the difference appears to decrease with higher dimensions and more data. We identify common failures during training to be addressed if the performance of PINNs on noisy inverse problems is to become more competitive.

[LG-63] First-Extinction Law for Resampling Processes

链接: https://arxiv.org/abs/2509.20101
作者: Matteo Benati,Alessandro Londei,Denise Lanzieri,Vittorio Loreto
类目: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG); Statistics Theory (math.ST); Data Analysis, Statistics and Probability (physics.data-an); Populations and Evolution (q-bio.PE)
*备注:

点击查看摘要

Abstract:Extinction times in resampling processes are fundamental yet often intractable, as previous formulas scale as 2^M with the number of states M present in the initial probability distribution. We solve this by treating multinomial updates as independent square-root diffusions of zero drift, yielding a closed-form law for the first-extinction time. We prove that the mean coincides exactly with the Wright-Fisher result of Baxter et al., thereby replacing exponential-cost evaluations with a linear-cost expression, and we validate this result through extensive simulations. Finally, we demonstrate predictive power for model collapse in a simple self-training setup: the onset of collapse coincides with the resampling-driven first-extinction time computed from the model’s initial stationary distribution. These results hint to a unified view of resampling extinction dynamics.

[LG-64] BioBO: Biology-informed Bayesian Optimization for Perturbation Design NEURIPS

链接: https://arxiv.org/abs/2509.19988
作者: Yanke Li,Tianyu Cui,Tommaso Mansi,Mangal Prakash,Rui Liao
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: NeurIPS: Structured Probabilistic Inference Generative Modeling, 2025

点击查看摘要

Abstract:Efficient design of genomic perturbation experiments is crucial for accelerating drug discovery and therapeutic target identification, yet exhaustive perturbation of the human genome remains infeasible due to the vast search space of potential genetic interactions and experimental constraints. Bayesian optimization (BO) has emerged as a powerful framework for selecting informative interventions, but existing approaches often fail to exploit domain-specific biological prior knowledge. We propose Biology-Informed Bayesian Optimization (BioBO), a method that integrates Bayesian optimization with multimodal gene embeddings and enrichment analysis, a widely used tool for gene prioritization in biology, to enhance surrogate modeling and acquisition strategies. BioBO combines biologically grounded priors with acquisition functions in a principled framework, which biases the search toward promising genes while maintaining the ability to explore uncertain regions. Through experiments on established public benchmarks and datasets, we demonstrate that BioBO improves labeling efficiency by 25-40%, and consistently outperforms conventional BO by identifying top-performing perturbations more effectively. Moreover, by incorporating enrichment analysis, BioBO yields pathway-level explanations for selected perturbations, offering mechanistic interpretability that links designs to biologically coherent regulatory circuits.

[LG-65] Geometric Autoencoder Priors for Bayesian Inversion: Learn First Observe Later

链接: https://arxiv.org/abs/2509.19929
作者: Arnaud Vadeboncoeur,Gregory Duthé,Mark Girolami,Eleni Chatzi
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Computational Physics (physics.comp-ph); Data Analysis, Statistics and Probability (physics.data-an)
*备注:

点击查看摘要

Abstract:Uncertainty Quantification (UQ) is paramount for inference in engineering applications. A common inference task is to recover full-field information of physical systems from a small number of noisy observations, a usually highly ill-posed problem. Critically, engineering systems often have complicated and variable geometries prohibiting the use of standard Bayesian UQ. In this work, we introduce Geometric Autoencoders for Bayesian Inversion (GABI), a framework for learning geometry-aware generative models of physical responses that serve as highly informative geometry-conditioned priors for Bayesian inversion. Following a ‘‘learn first, observe later’’ paradigm, GABI distills information from large datasets of systems with varying geometries, without requiring knowledge of governing PDEs, boundary conditions, or observation processes, into a rich latent prior. At inference time, this prior is seamlessly combined with the likelihood of the specific observation process, yielding a geometry-adapted posterior distribution. Our proposed framework is architecture agnostic. A creative use of Approximate Bayesian Computation (ABC) sampling yields an efficient implementation that utilizes modern GPU hardware. We test our method on: steady-state heat over rectangular domains; Reynold-Averaged Navier-Stokes (RANS) flow around airfoils; Helmholtz resonance and source localization on 3D car bodies; RANS airflow over terrain. We find: the predictive accuracy to be comparable to deterministic supervised learning approaches in the restricted setting where supervised learning is applicable; UQ to be well calibrated and robust on challenging problems with complex geometries. The method provides a flexible geometry-aware train-once-use-anywhere foundation model which is independent of any particular observation process.

[LG-66] High-Dimensional Statistical Process Control via Manifold Fitting and Learning

链接: https://arxiv.org/abs/2509.19820
作者: Burak I. Tas,Enrique del Castillo
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:We address the Statistical Process Control (SPC) of high-dimensional, dynamic industrial processes from two complementary perspectives: manifold fitting and manifold learning, both of which assume data lies on an underlying nonlinear, lower dimensional space. We propose two distinct monitoring frameworks for online or ‘phase II’ Statistical Process Control (SPC). The first method leverages state-of-the-art techniques in manifold fitting to accurately approximate the manifold where the data resides within the ambient high-dimensional space. It then monitors deviations from this manifold using a novel scalar distribution-free control chart. In contrast, the second method adopts a more traditional approach, akin to those used in linear dimensionality reduction SPC techniques, by first embedding the data into a lower-dimensional space before monitoring the embedded observations. We prove how both methods provide a controllable Type I error probability, after which they are contrasted for their corresponding fault detection ability. Extensive numerical experiments on a synthetic process and on a replicated Tennessee Eastman Process show that the conceptually simpler manifold-fitting approach achieves performance competitive with, and sometimes superior to, the more classical lower-dimensional manifold monitoring methods. In addition, we demonstrate the practical applicability of the proposed manifold-fitting approach by successfully detecting surface anomalies in a real image dataset of electrical commutators.

[LG-67] Convex Regression with a Penalty

链接: https://arxiv.org/abs/2509.19788
作者: Eunji Lim
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A common way to estimate an unknown convex regression function f_0: \Omega \subset \mathbbR^d \rightarrow \mathbbR from a set of n noisy observations is to fit a convex function that minimizes the sum of squared errors. However, this estimator is known for its tendency to overfit near the boundary of \Omega , posing significant challenges in real-world applications. In this paper, we introduce a new estimator of f_0 that avoids this overfitting by minimizing a penalty on the subgradient while enforcing an upper bound s_n on the sum of squared errors. The key advantage of this method is that s_n can be directly estimated from the data. We establish the uniform almost sure consistency of the proposed estimator and its subgradient over \Omega as n \rightarrow \infty and derive convergence rates. The effectiveness of our estimator is illustrated through its application to estimating waiting times in a single-server queue.

[LG-68] Diffusion and Flow-based Copulas: Forgetting and Remembering Dependencies

链接: https://arxiv.org/abs/2509.19707
作者: David Huk,Theodoros Damoulas
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO); Methodology (stat.ME)
*备注: Preprint

点击查看摘要

Abstract:Copulas are a fundamental tool for modelling multivariate dependencies in data, forming the method of choice in diverse fields and applications. However, the adoption of existing models for multimodal and high-dimensional dependencies is hindered by restrictive assumptions and poor scaling. In this work, we present methods for modelling copulas based on the principles of diffusions and flows. We design two processes that progressively forget inter-variable dependencies while leaving dimension-wise distributions unaffected, provably defining valid copulas at all times. We show how to obtain copula models by learning to remember the forgotten dependencies from each process, theoretically recovering the true copula at optimality. The first instantiation of our framework focuses on direct density estimation, while the second specialises in expedient sampling. Empirically, we demonstrate the superior performance of our proposed methods over state-of-the-art copula approaches in modelling complex and high-dimensional dependencies from scientific datasets and images. Our work enhances the representational power of copula models, empowering applications and paving the way for their adoption on larger scales and more challenging domains.

[LG-69] Efficient Online Large-Margin Classification via Dual Certificates

链接: https://arxiv.org/abs/2509.19670
作者: Nam Ho-Nguyen,Fatma Kılınç-Karzan,Ellie Nguyen,Lingqing Shen
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Online classification is a central problem in optimization, statistical learning and data science. Classical algorithms such as the perceptron offer efficient updates and finite mistake guarantees on linearly separable data, but they do not exploit the underlying geometric structure of the classification problem. We study the offline maximum margin problem through its dual formulation and use the resulting geometric insights to design a principled and efficient algorithm for the online setting. A key feature of our method is its translation invariance, inherited from the offline formulation, which plays a central role in its performance analysis. Our theoretical analysis yields improved mistake and margin bounds that depend only on translation-invariant quantities, offering stronger guarantees than existing algorithms under the same assumptions in favorable settings. In particular, we identify a parameter regime where our algorithm makes at most two mistakes per sequence, whereas the perceptron can be forced to make arbitrarily many mistakes. Our numerical study on real data further demonstrates that our method matches the computational efficiency of existing online algorithms, while significantly outperforming them in accuracy.

[LG-70] Graph-based Neural Space Weather Forecasting NEURIPS2025

链接: https://arxiv.org/abs/2509.19605
作者: Daniel Holmberg,Ivan Zaitsev,Markku Alho,Ioanna Bouri,Fanni Franssila,Haewon Jeong,Minna Palmroth,Teemu Roos
类目: pace Physics (physics.space-ph); Machine Learning (cs.LG); Plasma Physics (physics.plasm-ph)
*备注: 20 pages, 18 figures. Accepted to the NeurIPS 2025 Workshop on Machine Learning and the Physical Sciences

点击查看摘要

Abstract:Accurate space weather forecasting is crucial for protecting our increasingly digital infrastructure. Hybrid-Vlasov models, like Vlasiator, offer physical realism beyond that of current operational systems, but are too computationally expensive for real-time use. We introduce a graph-based neural emulator trained on Vlasiator data to autoregressively predict near-Earth space conditions driven by an upstream solar wind. We show how to achieve both fast deterministic forecasts and, by using a generative model, produce ensembles to capture forecast uncertainty. This work demonstrates that machine learning offers a way to add uncertainty quantification capability to existing space weather prediction systems, and make hybrid-Vlasov simulation tractable for operational use.

[LG-71] Discovery of Sustainable Refrigerants through Physics-Informed RL Fine-Tuning of Sequence Models

链接: https://arxiv.org/abs/2509.19588
作者: Adrien Goldszal,Diego Calanzone,Vincent Taboga,Pierre-Luc Bacon
类目: Chemical Physics (physics.chem-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Most refrigerants currently used in air-conditioning systems, such as hydrofluorocarbons, are potent greenhouse gases and are being phased down. Large-scale molecular screening has been applied to the search for alternatives, but in practice only about 300 refrigerants are known, and only a few additional candidates have been suggested without experimental validation. This scarcity of reliable data limits the effectiveness of purely data-driven methods. We present Refgen, a generative pipeline that integrates machine learning with physics-grounded inductive biases. Alongside fine-tuning for valid molecular generation, Refgen incorporates predictive models for critical properties, equations of state, thermochemical polynomials, and full vapor compression cycle simulations. These models enable reinforcement learning fine-tuning under thermodynamic constraints, enforcing consistency and guiding discovery toward molecules that balance efficiency, safety, and environmental impact. By embedding physics into the learning process, Refgen leverages scarce data effectively and enables de novo refrigerant discovery beyond the known set of compounds.

[LG-72] MAGIC: Multi-task Gaussian process for joint imputation and classification in healthcare time series

链接: https://arxiv.org/abs/2509.19577
作者: Dohyun Ku,Catherine D. Chong,Visar Berisha,Todd J. Schwedt,Jing Li
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 36 pages, 4 figures

点击查看摘要

Abstract:Time series analysis has emerged as an important tool for improving patient diagnosis and management in healthcare applications. However, these applications commonly face two critical challenges: time misalignment and data sparsity. Traditional approaches address these issues through a two-step process of imputation followed by prediction. We propose MAGIC (Multi-tAsk Gaussian Process for Imputation and Classification), a novel unified framework that simultaneously performs class-informed missing value imputation and label prediction within a hierarchical multi-task Gaussian process coupled with functional logistic regression. To handle intractable likelihood components, MAGIC employs Taylor expansion approximations with bounded error analysis, and parameter estimation is performed using EM algorithm with block coordinate optimization supported by convergence analysis. We validate MAGIC through two healthcare applications: prediction of post-traumatic headache improvement following mild traumatic brain injury and prediction of in-hospital mortality within 48 hours after ICU admission. In both applications, MAGIC achieves superior predictive accuracy compared to existing methods. The ability to generate real-time and accurate predictions with limited samples facilitates early clinical assessment and treatment planning, enabling healthcare providers to make more informed treatment decisions.

[LG-73] Stochastic Path Planning in Correlated Obstacle Fields

链接: https://arxiv.org/abs/2509.19559
作者: Li Zhou,Elvan Ceyhan
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO)
*备注:

点击查看摘要

Abstract:We introduce the Stochastic Correlated Obstacle Scene (SCOS) problem, a navigation setting with spatially correlated obstacles of uncertain blockage status, realistically constrained sensors that provide noisy readings and costly disambiguation. Modeling the spatial correlation with Gaussian Random Field (GRF), we develop Bayesian belief updates that refine blockage probabilities, and use the posteriors to reduce search space for efficiency. To find the optimal traversal policy, we propose a novel two-stage learning framework. An offline phase learns a robust base policy via optimistic policy iteration augmented with information bonus to encourage exploration in informative regions, followed by an online rollout policy with periodic base updates via a Bayesian mechanism for information adaptation. This framework supports both Monte Carlo point estimation and distributional reinforcement learning (RL) to learn full cost distributions, leading to stronger uncertainty quantification. We establish theoretical benefits of correlation-aware updating and convergence property under posterior sampling. Comprehensive empirical evaluations across varying obstacle densities, sensor capabilities demonstrate consistent performance gains over baselines. This framework addresses navigation challenges in environments with adversarial interruptions or clustered natural hazards.

[LG-74] Quantum Harmonic Analysis and the Structure in Data: Augmentation

链接: https://arxiv.org/abs/2509.19474
作者: Monika Doerfler,Franz Luef,Henry McNulty
类目: Functional Analysis (math.FA); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 13 pages, 2 figures

点击查看摘要

Abstract:In this short note, we study the impact of data augmentation on the smoothness of principal components of high-dimensional datasets. Using tools from quantum harmonic analysis, we show that eigenfunctions of operators corresponding to augmented data sets lie in the modulation space M^1(\mathbbR^d) , guaranteeing smoothness and continuity. Numerical examples on synthetic and audio data confirm the theoretical findings. While interesting in itself, the results suggest that manifold learning and feature extraction algorithms can benefit from systematic and informed augmentation principles.

[LG-75] Anchored Langevin Algorithms

链接: https://arxiv.org/abs/2509.19455
作者: Mert Gurbuzbalaban,Hoang M. Nguyen,Xicheng Zhang,Lingjiong Zhu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR)
*备注: 49 pages, 8 figures, 1 table

点击查看摘要

Abstract:Standard first-order Langevin algorithms such as the unadjusted Langevin algorithm (ULA) are obtained by discretizing the Langevin diffusion and are widely used for sampling in machine learning because they scale to high dimensions and large datasets. However, they face two key limitations: (i) they require differentiable log-densities, excluding targets with non-differentiable components; and (ii) they generally fail to sample heavy-tailed targets. We propose anchored Langevin dynamics, a unified approach that accommodates non-differentiable targets and certain classes of heavy-tailed distributions. The method replaces the original potential with a smooth reference potential and modifies the Langevin diffusion via multiplicative scaling. We establish non-asymptotic guarantees in the 2-Wasserstein distance to the target distribution and provide an equivalent formulation derived via a random time change of the Langevin diffusion. We provide numerical experiments to illustrate the theory and practical performance of our proposed approach.

[LG-76] he Platonic Universe: Do Foundation Models See the Same Sky? NEURIPS2025

链接: https://arxiv.org/abs/2509.19453
作者: UniverseTBD:Kshitij Duraphe,Michael J. Smith,Shashwat Sourav,John F. Wu
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注: 9 pages, 3 tables, 1 figure. Accepted as a workshop paper to Machine Learning and the Physical Sciences at NeurIPS 2025

点击查看摘要

Abstract:We test the Platonic Representation Hypothesis (PRH) in astronomy by measuring representational convergence across a range of foundation models trained on different data types. Using spectroscopic and imaging observations from JWST, HSC, Legacy Survey, and DESI, we compare representations from vision transformers, self-supervised models, and astronomy-specific architectures via mutual k -nearest neighbour analysis. We observe consistent scaling: representational alignment generally increases with model capacity across our tested architectures, supporting convergence toward a shared representation of galaxy astrophysics. Our results suggest that astronomical foundation models can use pre-trained general-purpose architectures, allowing us to capitalise on the broader machine learning community’s already-spent computational investment.

[LG-77] he Pareto Frontier of Resilient Jet Tagging NEURIPS2025

链接: https://arxiv.org/abs/2509.19431
作者: Rikab Gambhir,Matt LeBlanc,Yuanchen Zhou
类目: High Energy Physics - Phenomenology (hep-ph); Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex)
*备注: 6 pages, 2 figures and 2 tables. Preliminary version accepted for the 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: Machine Learning and the Physical Sciences. 6 or 7 December, 2025; San Diego, California, USA

点击查看摘要

Abstract:Classifying hadronic jets using their constituents’ kinematic information is a critical task in modern high-energy collider physics. Often, classifiers are designed by targeting the best performance using metrics such as accuracy, AUC, or rejection rates. However, the use of a single metric can lead to the use of architectures that are more model-dependent than competitive alternatives, leading to potential uncertainty and bias in analysis. We explore such trade-offs and demonstrate the consequences of using networks with high performance metrics but low resilience.

[LG-78] SpellerSSL: Self-Supervised Learning with P300 Aggregation for Speller BCIs

链接: https://arxiv.org/abs/2509.19401
作者: Jiazhen Hong,Geoff Mackellar,Soheila Ghane
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Electroencephalogram (EEG)-based P300 speller brain-computer interfaces (BCIs) face three main challenges: low signal-to-noise ratio (SNR), poor generalization, and time-consuming calibration. We propose SpellerSSL, a framework that combines self-supervised learning (SSL) with P300 aggregation to address these issues. First, we introduce an aggregation strategy to enhance SNR. Second, to achieve generalization in training, we employ a customized 1D U-Net backbone and pretrain the model on both cross-domain and in-domain EEG data. The pretrained model is subsequently fine-tuned with a lightweight ERP-Head classifier for P300 detection, which adapts the learned representations to subject-specific data. Our evaluations on calibration time demonstrate that combining the aggregation strategy with SSL significantly reduces the calibration burden per subject and improves robustness across subjects. Experimental results show that SSL learns effective EEG representations in both in-domain and cross-domain, with in-domain achieving a state-of-the-art character recognition rate of 94% with only 7 repetitions and the highest information transfer rate (ITR) of 21.86 bits/min on the public II-B dataset. Moreover, in-domain SSL with P300 aggregation reduces the required calibration size by 60% while maintaining a comparable character recognition rate. To the best of our knowledge, this is the first study to apply SSL to P300 spellers, highlighting its potential to improve both efficiency and generalization in speller BCIs and paving the way toward an EEG foundation model for P300 speller BCIs.

[LG-79] Hybrid Pipeline SWD Detection in Long-Term EEG Signals

链接: https://arxiv.org/abs/2509.19387
作者: Antonio Quintero Rincon,Nicolas Masino,Veronica Marsico,Hadj Batatia
类目: ignal Processing (eess.SP); Machine Learning (cs.LG); Applications (stat.AP); Machine Learning (stat.ML)
*备注: 11 pages, 8 figures, 4 tables, SABI 2025 CLIC 2025

点击查看摘要

Abstract:Spike-and-wave discharges (SWDs) are the electroencephalographic hallmark of absence epilepsy, yet their manual identification in multi-day recordings remains labour-intensive and error-prone. We present a lightweight hybrid pipeline that couples analytical features with a shallow artificial neural network (ANN) for accurate, patient-specific SWD detection in long-term, monopolar EEG. A two-sided moving-average (MA) filter first suppresses the high-frequency components of normal background activity. The residual signal is then summarised by the mean and the standard deviation of its normally distributed samples, yielding a compact, two-dimensional feature vector for every 20s window. These features are fed to a single-hidden-layer ANN trained via back-propagation to classify each window as SWD or non-SWD. The method was evaluated on 780 channels sampled at 256 Hz from 12 patients, comprising 392 annotated SWD events. It correctly detected 384 events (sensitivity: 98%) while achieving a specificity of 96.2 % and an overall accuracy of 97.2%. Because feature extraction is analytic, and the classifier is small, the pipeline runs in real-time and requires no manual threshold tuning. These results indicate that normal-distribution descriptors combined with a modest ANN provide an effective and computationally inexpensive solution for automated SWD screening in extended EEG recordings.

[LG-80] A Statistical Mixture-of-Experts Framework for EMG Artifact Removal in EEG: Empirical Insights and a Proof-of-Concept Application

链接: https://arxiv.org/abs/2509.19385
作者: Benjamin J. Choi,Griffin Milsap,Clara A. Scholl,Francesco Tenore,Mattson Ogg
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Effective control of neural interfaces is limited by poor signal quality. While neural network-based electroencephalography (EEG) denoising methods for electromyogenic (EMG) artifacts have improved in recent years, current state-of-the-art (SOTA) models perform suboptimally in settings with high noise. To address the shortcomings of current machine learning (ML)-based denoising algorithms, we present a signal filtration algorithm driven by a new mixture-of-experts (MoE) framework. Our algorithm leverages three new statistical insights into the EEG-EMG denoising problem: (1) EMG artifacts can be partitioned into quantifiable subtypes to aid downstream MoE classification, (2) local experts trained on narrower signal-to-noise ratio (SNR) ranges can achieve performance increases through specialization, and (3) correlation-based objective functions, in conjunction with rescaling algorithms, can enable faster convergence in a neural network-based denoising context. We empirically demonstrate these three insights into EMG artifact removal and use our findings to create a new downstream MoE denoising algorithm consisting of convolutional (CNN) and recurrent (RNN) neural networks. We tested all results on a major benchmark dataset (EEGdenoiseNet) collected from 67 subjects. We found that our MoE denoising model achieved competitive overall performance with SOTA ML denoising algorithms and superior lower bound performance in high noise settings. These preliminary results highlight the promise of our MoE framework for enabling advances in EMG artifact removal for EEG processing, especially in high noise settings. Further research and development will be necessary to assess our MoE framework on a wider range of real-world test cases and explore its downstream potential to unlock more effective neural interfaces.

[LG-81] Neural Network Based Framework for Passive Intermodulation Cancellation in MIMO Systems

链接: https://arxiv.org/abs/2509.19382
作者: Xiaolong Li,Zhi-qin John Xu,Peiting You,Yifei Zhu
类目: ignal Processing (eess.SP); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Passive intermodulation (PIM) has emerged as a critical source of self-interference in modern MIMO-OFDM systems, especially under the stringent requirements of 5G and beyond. Conventional cancellation methods often rely on complex nonlinear models with limited scalability and high computational cost. In this work, we propose a lightweight deep learning framework for PIM cancellation that leverages depthwise separable convolutions and dilated convolutions to efficiently capture nonlinear dependencies across antennas and subcarriers. To further enhance convergence, we adopt a cyclic learning rate schedule and gradient clipping. In a controlled MIMO experimental setup, the method effectively suppresses third-order passive intermodulation (PIM) distortion, achieving up to 29dB of average power error (APE) with only 11k trainable parameters. These results highlight the potential of compact neural architectures for scalable interference mitigation in future wireless communication systems.

[LG-82] Short-Term Regional Electricity Demand Forecasting in Argentina Using LSTM Networks

链接: https://arxiv.org/abs/2509.19374
作者: Oscar A. Oviedo
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 44 pages, 13 figures

点击查看摘要

Abstract:This study presents the development and optimization of a deep learning model based on Long Short-Term Memory (LSTM) networks to predict short-term hourly electricity demand in Córdoba, Argentina. Integrating historical consumption data with exogenous variables (climatic factors, temporal cycles, and demographic statistics), the model achieved high predictive precision, with a mean absolute percentage error of 3.20% and a determination coefficient of 0.95. The inclusion of periodic temporal encodings and weather variables proved crucial to capture seasonal patterns and extreme consumption events, enhancing the robustness and generalizability of the model. In addition to the design and hyperparameter optimization of the LSTM architecture, two complementary analyses were carried out: (i) an interpretability study using Random Forest regression to quantify the relative importance of exogenous drivers, and (ii) an evaluation of model performance in predicting the timing of daily demand maxima and minima, achieving exact-hour accuracy in more than two-thirds of the test days and within abs(1) hour in over 90% of cases. Together, these results highlight both the predictive accuracy and operational relevance of the proposed framework, providing valuable insights for grid operators seeking optimized planning and control strategies under diverse demand scenarios.

[LG-83] Low-Cost Sensor Fusion Framework for Organic Substance Classification and Quality Control Using Classification Methods

链接: https://arxiv.org/abs/2509.19367
作者: Borhan Uddin Chowdhury,Damian Valles,Md Raf E Ul Shougat
类目: ignal Processing (eess.SP); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Copyright 2025 IEEE. This is the author’s version of the work accepted for publication in FMLDS 2025. The final version will be published by IEEE and available via DOI (to be inserted when available). Accepted at FMLDS 2025, to appear in IEEE Xplore. 8 pages, 17 figures, 3 tables

点击查看摘要

Abstract:We present a sensor-fusion framework for rapid, non-destructive classification and quality control of organic substances, built on a standard Arduino Mega 2560 microcontroller platform equipped with three commercial environmental and gas sensors. All data used in this study were generated in-house: sensor outputs for ten distinct classes - including fresh and expired samples of apple juice, onion, garlic, and ginger, as well as cinnamon and cardamom - were systematically collected and labeled using this hardware setup, resulting in a unique, application-specific dataset. Correlation analysis was employed as part of the preprocessing pipeline for feature selection. After preprocessing and dimensionality reduction (PCA/LDA), multiple supervised learning models - including Support Vector Machine (SVM), Decision Tree (DT), and Random Forest (RF), each with hyperparameter tuning, as well as an Artificial Neural Network (ANN) and an ensemble voting classifier - were trained and cross-validated on the collected dataset. The best-performing models, including tuned Random Forest, ensemble, and ANN, achieved test accuracies in the 93 to 94 percent range. These results demonstrate that low-cost, multisensory platforms based on the Arduino Mega 2560, combined with advanced machine learning and correlation-driven feature engineering, enable reliable identification and quality control of organic compounds.

[LG-84] A Measurement Report Data-Driven Framework for Localized Statistical Channel Modeling

链接: https://arxiv.org/abs/2509.19342
作者: Xinyu Qin,Ye Xue,Qi Yan,Shutao Zhang,Bingsheng Peng,Tsung-Hui Chang
类目: ignal Processing (eess.SP); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Localized statistical channel modeling (LSCM) is crucial for effective performance evaluation in digital twin-assisted network optimization. Solely relying on the multi-beam reference signal receiving power (RSRP), LSCM aims to model the localized statistical propagation environment by estimating the channel angular power spectrum (APS). However, existing methods rely heavily on drive test data with high collection costs and limited spatial coverage. In this paper, we propose a measurement report (MR) data-driven framework for LSCM, exploiting the low-cost and extensive collection of MR data. The framework comprises two novel modules. The MR localization module addresses the issue of missing locations in MR data by introducing a semi-supervised method based on hypergraph neural networks, which exploits multi-modal information via distance-aware hypergraph modeling and hypergraph convolution for location extraction. To enhance the computational efficiency and solution robustness, LSCM operates at the grid level. Compared to independently constructing geographically uniform grids and estimating channel APS, the joint grid construction and channel APS estimation module enhances robustness in complex environments with spatially non-uniform data by exploiting their correlation. This module alternately optimizes grid partitioning and APS estimation using clustering and improved sparse recovery for the ill-conditioned measurement matrix and incomplete observations. Through comprehensive experiments on a real-world MR dataset, we demonstrate the superior performance and robustness of our framework in localization and channel modeling.

[LG-85] A Spatio-Temporal Feature Fusion EEG Virtual Channel Signal Generation Network and Its Application in Anxiety Assessment

链接: https://arxiv.org/abs/2509.19334
作者: Shangqing Yuan,Wenshuang Zhai,Shengwen Guo
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:To address the issue of limited channels and insufficient information collection in portable EEG devices, this study explores an EEG virtual channel signal generation network using a novel spatio-temporal feature fusion strategy. Based on the EEG signals from four frontal lobe channels, the network aims to generate virtual channel EEG signals for other 13 important brain regions. The architecture of the network is a two-dimensional convolutional neural network and it includes a parallel module for temporal and spatial domain feature extraction, followed by a feature fusion module. The public PRED+CT database, which includes multi-channel EEG signals from 119 subjects, was selected to verify the constructed network. The results showed that the average correlation coefficient between the generated virtual channel EEG signals and the original real signals was 0.6724, with an average absolute error of 3.9470. Furthermore, the 13 virtual channel EEG signals were combined with the original EEG signals of four brain regions and then used for anxiety classification with a support vector machine. The results indicate that the virtual EEG signals generated by the constructed network not only have a high degree of consistency with the real channel EEG signals but also significantly enhance the performance of machine learning algorithms for anxiety classification. This study effectively alleviates the problem of insufficient information acquisition by portable EEG devices with few channels.

[LG-86] Electric Vehicle Identification from Behind Smart Meter Data

链接: https://arxiv.org/abs/2509.19316
作者: Ammar Kamoona,Hui Song,Ali Moradi Amani,Mahdi Jalili,Xinghuo Yu,Peter McTaggart
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 27 pages,

点击查看摘要

Abstract:Electric vehicle (EV) charging loads identification from behind smart meter recordings is an indispensable aspect that enables effective decision-making for energy distributors to reach an informed and intelligent decision about the power grid’s reliability. When EV charging happens behind the meter (BTM), the charging occurs on the customer side of the meter, which measures the overall electricity consumption. In other words, the charging of the EV is considered part of the customer’s load and not separately measured by the Distribution Network Operators (DNOs). DNOs require complete knowledge about the EV presence in their network. Identifying the EV charging demand is essential to better plan and manage the distribution grid. Unlike supervised methods, this paper addresses the problem of EV charging load identification in a non-nonintrusive manner from low-frequency smart meter using an unsupervised learning approach based on anomaly detection technique. Our approach does not require prior knowledge of EV charging profiles. It only requires real power consumption data of non-EV users, which are abundant in practice. We propose a deep temporal convolution encoding decoding (TAE) network. The TAE is applied to power consumption from smart BTM from Victorian households in Australia, and the TAE shows superior performance in identifying households with EVs.

[LG-87] STL-FFT-STFT-TCN-LSTM: An Effective Wave Height High Accuracy Prediction Model Fusing Time-Frequency Domain Features

链接: https://arxiv.org/abs/2509.19313
作者: Huipeng Liu,Zhichao Zhu,Yuan Zhou,Changlu Li
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 17 page, 13 figures; references added

点击查看摘要

Abstract:As the consumption of traditional energy sources intensifies and their adverse environmental impacts become more pronounced, wave energy stands out as a highly promising member of the renewable energy family due to its high energy density, stability, widespread distribution, and environmental friendliness. The key to its development lies in the precise prediction of Significant Wave Height (WVHT). However, wave energy signals exhibit strong nonlinearity, abrupt changes, multi-scale periodicity, data sparsity, and high-frequency noise interference; additionally, physical models for wave energy prediction incur extremely high computational costs. To address these challenges, this study proposes a hybrid model combining STL-FFT-STFT-TCN-LSTM. This model exploits the Seasonal-Trend Decomposition Procedure based on Loess (STL), Fast Fourier Transform (FFT), Short-Time Fourier Transform (STFT), Temporal Convolutional Network (TCN), and Long Short-Term Memory (LSTM) technologies. The model aims to optimize multi-scale feature fusion, capture extreme wave heights, and address issues related to high-frequency noise and periodic signals, thereby achieving efficient and accurate prediction of significant wave height. Experiments were conducted using hourly data from NOAA Station 41008 and 41047 spanning 2019 to 2022. The results showed that compared with other single models and hybrid models, the STL-FFT-STFT-TCN-LSTM model achieved significantly higher prediction accuracy in capturing extreme wave heights and suppressing high-frequency noise, with MAE reduced by 15.8%-40.5%, SMAPE reduced by 8.3%-20.3%, and R increased by 1.31%-2.9%; in ablation experiments, the model also demonstrated the indispensability of each component step, validating its superiority in multi-scale feature fusion.

[LG-88] Graph-Based Spatio-temporal Attention and Multi-Scale Fusion for Clinically Interpretable High-Fidelity Fetal ECG Extraction

链接: https://arxiv.org/abs/2509.19308
作者: Chang Wang,Ming Zhu,Shahram Latifi,Buddhadeb Dawn,Shengjie Zhai
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 6 pages, ACM BCB 2025

点击查看摘要

Abstract:Congenital Heart Disease (CHD) is the most common neonatal anomaly, highlighting the urgent need for early detection to improve outcomes. Yet, fetal ECG (fECG) signals in abdominal ECG (aECG) are often masked by maternal ECG and noise, challenging conventional methods under low signal-to-noise ratio (SNR) conditions. We propose FetalHealthNet (FHNet), a deep learning framework that integrates Graph Neural Networks with a multi-scale enhanced transformer to dynamically model spatiotemporal inter-lead correlations and extract clean fECG signals. On benchmark aECG datasets, FHNet consistently outperforms long short-term memory (LSTM) models, standard transformers, and state-of-the-art models, achieving R20.99 and RMSE = 0.015 even under severe noise. Interpretability analyses highlight physiologically meaningful temporal and lead contributions, supporting model transparency and clinical trust. FHNet illustrates the potential of AI-driven modeling to advance fetal monitoring and enable early CHD screening, underscoring the transformative impact of next-generation biomedical signal processing.

信息检索

[IR-0] Into the Void: Understanding Online Health Information in Low-Web Data Languages

链接: https://arxiv.org/abs/2509.20245
作者: Hellina Hailu Nigatu,Nuredin Ali Abdelkadir,Fiker Tewelde,Stevie Chancellor,Daricia Wilkinson
类目: Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR); Social and Information Networks (cs.SI)
*备注: Accepted to AIES 2025

点击查看摘要

Abstract:Data voids–areas of the internet where reliable information is scarce or absent–pose significant challenges to online health information seeking, particularly for users operating in low-web data languages. These voids are increasingly encountered not on traditional search engines alone, but on social media platforms, which have gradually morphed into informal search engines for millions of people. In this paper, we introduce the phenomenon of data horizons: a critical boundary where algorithmic structures begin to degrade the relevance and reliability of search results. Unlike the core of a data void, which is often exploited by bad actors to spread misinformation, the data horizon marks the critical space where systemic factors, such as linguistic underrepresentation, algorithmic amplification, and socio-cultural mismatch, create conditions of informational instability. Focusing on Tigrinya and Amharic as languages of study, we evaluate (1) the common characteristics of search results for health queries, (2) the quality and credibility of health information, and (3) characteristics of search results that diverge from their queries. We find that search results for health queries in low-web data languages may not always be in the language of search and may be dominated by nutritional and religious advice. We show that search results that diverge from their queries in low-resourced languages are due to algorithmic failures, (un)intentional manipulation, or active manipulation by content creators. We use our findings to illustrate how a data horizon manifests under several interacting constraints on information availability.

[IR-1] Cascade! Human in the loop shortcomings can increase the risk of failures in recommender systems

链接: https://arxiv.org/abs/2509.20099
作者: Wm. Matthew Kennedy,Nishanshi Shukla,Cigdem Patlak,Blake Chambers,Theodora Skeadas,Tuesday,Kingsley Owadara,Aayush Dhanotiya
类目: Information Retrieval (cs.IR); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:Recommender systems are among the most commonly deployed systems today. Systems design approaches to AI-powered recommender systems have done well to urge recommender system developers to follow more intentional data collection, curation, and management procedures. So too has the “human-in-the-loop” paradigm been widely adopted, primarily to address the issue of accountability. However, in this paper, we take the position that human oversight in recommender system design also entails novel risks that have yet to be fully described. These risks are “codetermined” by the information context in which such systems are often deployed. Furthermore, new knowledge of the shortcomings of “human-in-the-loop” practices to deliver meaningful oversight of other AI systems suggest that they may also be inadequate for achieving socially responsible recommendations. We review how the limitations of human oversight may increase the chances of a specific kind of failure: a “cascade” or “compound” failure. We then briefly explore how the unique dynamics of three common deployment contexts can make humans in the loop more likely to fail in their oversight duties. We then conclude with two recommendations.

[IR-2] Multimodal-enhanced Federated Recommendation: A Group-wise Fusion Approach

链接: https://arxiv.org/abs/2509.19955
作者: Chunxu Zhang,Weipeng Zhang,Guodong Long,Zhiheng Xue,Riting Xia,Bo Yang
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Federated Recommendation (FR) is a new learning paradigm to tackle the learn-to-rank problem in a privacy-preservation manner. How to integrate multi-modality features into federated recommendation is still an open challenge in terms of efficiency, distribution heterogeneity, and fine-grained alignment. To address these challenges, we propose a novel multimodal fusion mechanism in federated recommendation settings (GFMFR). Specifically, it offloads multimodal representation learning to the server, which stores item content and employs a high-capacity encoder to generate expressive representations, alleviating client-side overhead. Moreover, a group-aware item representation fusion approach enables fine-grained knowledge sharing among similar users while retaining individual preferences. The proposed fusion loss could be simply plugged into any existing federated recommender systems empowering their capability by adding multi-modality features. Extensive experiments on five public benchmark datasets demonstrate that GFMFR consistently outperforms state-of-the-art multimodal FR baselines.

[IR-3] Documentation Retrieval Improves Planning Language Generation

链接: https://arxiv.org/abs/2509.19931
作者: Renxiang Wang,Li Zhang
类目: Information Retrieval (cs.IR)
*备注: 12 pages, 14 figures, 1 table

点击查看摘要

Abstract:Certain strong LLMs have shown promise for zero-shot formal planning by generating planning languages like PDDL. Yet, performance of most open-source models under 50B parameters has been reported to be close to zero due to the low-resource nature of these languages. We significantly improve their performance via a series of lightweight pipelines that integrates documentation retrieval with modular code generation and error refinement. With models like Llama-4-Maverick, our best pipeline improves plan correctness from 0% to over 80% on the common BlocksWorld domain. However, while syntactic errors are substantially reduced, semantic errors persist in more challenging domains, revealing fundamental limitations in current models’ reasoning capabilities.\footnoteOur code and data can be found at this https URL

[IR-4] Adaptive User Interest Modeling via Conditioned Denoising Diffusion For Click-Through Rate Prediction

链接: https://arxiv.org/abs/2509.19876
作者: Qihang Zhao,Xiaoyang Zheng,Ben Chen,Zhongbo Sun,Chenyi Lei
类目: Information Retrieval (cs.IR)
*备注: 5 pages, under review

点击查看摘要

Abstract:User behavior sequences in search systems resemble “interest fossils”, capturing genuine intent yet eroded by exposure bias, category drift, and contextual noise. Current methods predominantly follow an “identify-aggregate” paradigm, assuming sequences immutably reflect user preferences while overlooking the organic entanglement of noise and genuine interest. Moreover, they output static, context-agnostic representations, failing to adapt to dynamic intent shifts under varying Query-User-Item-Context conditions. To resolve this dual challenge, we propose the Contextual Diffusion Purifier (CDP). By treating category-filtered behaviors as “contaminated observations”, CDP employs a forward noising and conditional reverse denoising process guided by cross-interaction features (Query x User x Item x Context), controllably generating pure, context-aware interest representations that dynamically evolve with scenarios. Extensive offline/online experiments demonstrate the superiority of CDP over state-of-the-art methods. Comments: 5 pages, under review Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2509.19876 [cs.IR] (or arXiv:2509.19876v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2509.19876 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-5] Learning Contextual Retrieval for Robust Conversational Search EMNLP2025

链接: https://arxiv.org/abs/2509.19700
作者: Seunghan Yang,Juntae Lee,Jihwan Bang,Kyuhong Shim,Minsoo Kim,Simyung Chang
类目: Information Retrieval (cs.IR)
*备注: EMNLP 2025 main conference

点击查看摘要

Abstract:Effective conversational search demands a deep understanding of user intent across multiple dialogue turns. Users frequently use abbreviations and shift topics in the middle of conversations, posing challenges for conventional retrievers. While query rewriting techniques improve clarity, they often incur significant computational cost due to additional autoregressive steps. Moreover, although LLM-based retrievers demonstrate strong performance, they are not explicitly optimized to track user intent in multi-turn settings, often failing under topic drift or contextual ambiguity. To address these limitations, we propose ContextualRetriever, a novel LLM-based retriever that directly incorporates conversational context into the retrieval process. Our approach introduces: (1) a context-aware embedding mechanism that highlights the current query within the dialogue history; (2) intent-guided supervision based on high-quality rewritten queries; and (3) a training strategy that preserves the generative capabilities of the base LLM. Extensive evaluations across multiple conversational search benchmarks demonstrate that ContextualRetriever significantly outperforms existing methods while incurring no additional inference overhead.

[IR-6] Digital Signal Processing from Classical Coherent Systems to Continuous-Variable QKD: A Review of Cross-Domain Techniques Applications and Challenges

链接: https://arxiv.org/abs/2509.20141
作者: Davi Juvêncio Gomes de Sousa,Caroline da Silva Morais Alves,Valéria Loureiro da Silva,Nelson Alves Ferreira Neto
类目: Quantum Physics (quant-ph); Hardware Architecture (cs.AR); Emerging Technologies (cs.ET); Information Retrieval (cs.IR); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:This systematic review investigates the application of digital signal processing (DSP) techniques – originally developed for coherent optical communication systems to continuous-variable quantum key distribution (CV-QKD). The convergence of these domains has enabled significant advances in CV-QKD performance, particularly in phase synchronization, polarization tracking, and excess noise mitigation. To provide a comprehensive and reproducible synthesis of this emerging field, we employed the APISSER methodology, a task-oriented framework adapted from the PRISMA protocol. A structured search across IEEE Xplore and Web of Science databases (2021-2025) yielded 220 relevant publications, which were screened, classified, and analyzed to address six research questions. Our findings highlight that many classical DSP algorithms, such as Kalman filtering, carrier recovery, adaptive equalization, and machine-learning-assisted signal estimation, have been successfully adapted to the quantum regime, often requiring modifications to meet security and noise constraints. We also identify a range of recent DSP innovations in coherent optical communication systems with high potential for future CV-QKD integration, including neural equalization, probabilistic shaping, and joint retiming-equalization filters. Despite these advances, challenges remain in achieving robust phase tracking under ultra-low Signal-to-Noise Ratio (SNR) conditions, real-time polarization compensation, and secure co-existence with classical channels. This review maps current trends, technical barriers, and emerging opportunities at the intersection of signal processing for quantum and classical communication, supporting the development of scalable and resilient CV-QKD systems.

附件下载

点击下载今日全部论文列表