本篇博文主要内容为 2025-09-05 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。
说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。
友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。
目录
概览 (2025-09-05)
今日共更新425篇论文,其中:
- 自然语言处理共71篇(Computation and Language (cs.CL))
- 人工智能共127篇(Artificial Intelligence (cs.AI))
- 计算机视觉共86篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共117篇(Machine Learning (cs.LG))
自然语言处理
[NLP-0] Delta Activations: A Representation for Finetuned Large Language Models
【速读】: 该论文旨在解决开源大语言模型(Large Language Models, LLMs)在实际应用中因元数据不一致和存储库结构化程度低而导致的模型选择与理解困难问题。其核心解决方案是提出Delta Activations,即通过测量微调模型与其基础模型之间内部激活值的变化来生成向量嵌入(vector embeddings),从而实现对模型按领域和任务的有效聚类,并揭示模型生态中的潜在结构。该方法具有鲁棒性且在混合微调数据时表现出可加性特性,进一步支持基于少量样本的指令微调(few-shot finetuning)进行任务嵌入,为模型筛选与合并提供了有效工具。
链接: https://arxiv.org/abs/2509.04442
作者: Zhiqiu Xu,Amish Sethi,Mayur Naik,Ser-Nam Lim
机构: University of Pennsylvania (宾夕法尼亚大学); University of Central Florida (中佛罗里达大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:
Abstract:The success of powerful open source Large Language Models (LLMs) has enabled the community to create a vast collection of post-trained models adapted to specific tasks and domains. However, navigating and understanding these models remains challenging due to inconsistent metadata and unstructured repositories. We introduce Delta Activations, a method to represent finetuned models as vector embeddings by measuring shifts in their internal activations relative to a base model. This representation allows for effective clustering by domain and task, revealing structure in the model landscape. Delta Activations also demonstrate desirable properties: it is robust across finetuning settings and exhibits an additive property when finetuning datasets are mixed. In addition, we show that Delta Activations can embed tasks via few-shot finetuning, and further explore its use for model selection and merging. We hope Delta Activations can facilitate the practice of reusing publicly available models. Code is available at this https URL.
zh
[NLP-1] ArcMemo: Abstract Reasoning Composition with Lifelong LLM Memory
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在推理过程中产生的中间知识和模式无法持久化的问题,即当前推理轨迹中的洞察(insights)在上下文窗口重置后被立即丢弃,限制了模型在多轮任务中的持续学习与复用能力。解决方案的关键在于提出一种基于概念级记忆(concept-level memory)的机制:将推理轨迹中提炼出的可复用、模块化的抽象知识以自然语言形式存储,而非仅记录具体实例(如查询-响应对或紧耦合于原问题的摘要)。在新查询时,通过检索相关概念并动态整合进提示词(prompt),实现无需权重更新的测试时持续学习(test-time continual learning)。该设计显著提升了复杂任务(如ARC-AGI基准)上的性能,并验证了动态记忆更新比固定记忆更有效,体现了自我改进(self-improvement)的能力。
链接: https://arxiv.org/abs/2509.04439
作者: Matthew Ho,Chen Si,Zhaoxiang Feng,Fangxu Yu,Zhijian Liu,Zhiting Hu,Lianhui Qin
机构: University of California, San Diego (加州大学圣地亚哥分校); University of Maryland (马里兰大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:While inference-time scaling enables LLMs to carry out increasingly long and capable reasoning traces, the patterns and insights uncovered during these traces are immediately discarded once the context window is reset for a new query. External memory is a natural way to persist these discoveries, and recent work has shown clear benefits for reasoning-intensive tasks. We see an opportunity to make such memories more broadly reusable and scalable by moving beyond instance-based memory entries (e.g. exact query/response pairs, or summaries tightly coupled with the original problem context) toward concept-level memory: reusable, modular abstractions distilled from solution traces and stored in natural language. For future queries, relevant concepts are selectively retrieved and integrated into the prompt, enabling test-time continual learning without weight updates. Our design introduces new strategies for abstracting takeaways from rollouts and retrieving entries for new queries, promoting reuse and allowing memory to expand with additional experiences. On the challenging ARC-AGI benchmark, our method yields a 7.5% relative gain over a strong no-memory baseline with performance continuing to scale with inference compute. We find abstract concepts to be the most consistent memory design, outscoring the baseline at all tested inference compute scales. Moreover, we confirm that dynamically updating memory during test-time outperforms an otherwise identical fixed memory setting with additional attempts, supporting the hypothesis that solving more problems and abstracting more patterns to memory enables further solutions in a form of self-improvement. Code available at this https URL.
zh
[NLP-2] he Telephone Game: Evaluating Semantic Drift in Unified Models
【速读】: 该论文旨在解决统一模型(Unified Model, UM)在视觉理解(图像到文本,I2T)与视觉生成(文本到图像,T2I)任务中跨模态一致性不足的问题。现有评估方法通常孤立地衡量I2T或T2I性能(如MME、MMBench用于I2T,FID和GenEval用于T2I),无法揭示模型是否能在图像与文本之间保持语义一致性,即是否存在语义漂移(semantic drift)。为此,作者提出统一一致性框架(Unified Consistency Framework for Unified Models, UCF-UM),其核心创新在于设计了一个循环评估协议(cyclic evaluation protocol),通过多轮交替执行I2T和T2I生成来量化语义损失;并引入三个关键指标:均累积漂移(Mean Cumulative Drift, MCD)、语义漂移率(Semantic Drift Rate, SDR)以及多轮GenEval(Multi-Generation GenEval, MGG),从而系统性评估模型跨模态稳定性及其共享表征强度。
链接: https://arxiv.org/abs/2509.04438
作者: Sabbir Mollah,Rohit Gupta,Sirnam Swetha,Qingyang Liu,Ahnaf Munir,Mubarak Shah
机构: Center For Research in Computer Vision, University of Central Florida, USA(美国中佛罗里达大学计算机视觉研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
Abstract:Employing a single, unified model (UM) for both visual understanding (image-to-text: I2T) and and visual generation (text-to-image: T2I) has opened a new direction in Visual Language Model (VLM) research. While UMs can also support broader unimodal tasks (e.g., text-to-text, image-to-image), we focus on the core cross-modal pair T2I and I2T, as consistency between understanding and generation is critical for downstream use. Existing evaluations consider these capabilities in isolation: FID and GenEval for T2I, and benchmarks such as MME, MMBench for I2T. These single-pass metrics do not reveal whether a model that understands a concept can also render it, nor whether meaning is preserved when cycling between image and text modalities. To address this, we introduce the Unified Consistency Framework for Unified Models (UCF-UM), a cyclic evaluation protocol that alternates I2T and T2I over multiple generations to quantify semantic drift. UCF formulates 3 metrics: (i) Mean Cumulative Drift (MCD), an embedding-based measure of overall semantic loss; (ii) Semantic Drift Rate (SDR), that summarizes semantic decay rate; and (iii) Multi-Generation GenEval (MGG), an object-level compliance score extending GenEval. To assess generalization beyond COCO, which is widely used in training; we create a new benchmark ND400, sampled from NoCaps and DOCCI and evaluate on seven recent models. UCF-UM reveals substantial variation in cross-modal stability: some models like BAGEL maintain semantics over many alternations, whereas others like Vila-u drift quickly despite strong single-pass scores. Our results highlight cyclic consistency as a necessary complement to standard I2T and T2I evaluations, and provide practical metrics to consistently assess unified model’s cross-modal stability and strength of their shared representations. Code: this https URL
zh
[NLP-3] Can Language Models Handle a Non-Gregorian Calendar?
【速读】: 该论文旨在解决当前语言模型(Language Models, LMs)在处理非格里历(Gregorian calendar)文化特定日历系统时能力不足的问题,特别是针对日本历法的时序推理与知识理解能力缺乏系统评估。其解决方案的关键在于构建一个涵盖四类需要时序知识和推理的任务数据集,并对多种英语主导和日语主导的语言模型进行系统性评估,从而揭示现有模型在跨历法转换、日本历法算术运算及多历法一致性维持方面的局限性,强调了发展具备文化特异性日历理解能力的语言模型的重要性。
链接: https://arxiv.org/abs/2509.04432
作者: Mutsumi Sasaki,Go Kamoda,Ryosuke Takahashi,Kosuke Sato,Kentaro Inui,Keisuke Sakaguchi,Benjamin Heinzerling
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Temporal reasoning and knowledge are essential capabilities for language models (LMs). While much prior work has analyzed and improved temporal reasoning in LMs, most studies have focused solely on the Gregorian calendar. However, many non-Gregorian systems, such as the Japanese, Hijri, and Hebrew calendars, are in active use and reflect culturally grounded conceptions of time. If and how well current LMs can accurately handle such non-Gregorian calendars has not been evaluated so far. Here, we present a systematic evaluation of how well open-source LMs handle one such non-Gregorian system: the Japanese calendar. For our evaluation, we create datasets for four tasks that require both temporal knowledge and temporal reasoning. Evaluating a range of English-centric and Japanese-centric LMs, we find that some models can perform calendar conversions, but even Japanese-centric models struggle with Japanese-calendar arithmetic and with maintaining consistency across calendars. Our results highlight the importance of developing LMs that are better equipped for culture-specific calendar understanding.
zh
[NLP-4] owards a Unified View of Large Language Model Post-Training
【速读】: 该论文试图解决后训练阶段语言模型中在线数据(模型生成的rollouts)与离线数据(人类或其他模型的示范)在训练方法上看似对立的问题,即如何统一基于强化学习(Reinforcement Learning, RL)和监督微调(Supervised Fine-Tuning, SFT)的不同数据源训练策略。解决方案的关键在于提出一个统一的策略梯度估计器(Unified Policy Gradient Estimator),该估计器可将多种后训练方法视为同一优化目标下的不同实例,其核心由四个可替换模块构成:稳定掩码(stabilization mask)、参考策略分母(reference policy denominator)、优势估计(advantage estimate)和似然梯度(likelihood gradient)。基于此理论框架,作者进一步设计了混合后训练算法(Hybrid Post-Training, HPT),通过动态选择训练信号,在利用示范数据进行有效利用的同时保持稳定的探索能力,且不破坏已习得的推理模式。
链接: https://arxiv.org/abs/2509.04419
作者: Xingtai Lv,Yuxin Zuo,Youbang Sun,Hongyi Liu,Yuntian Wei,Zhekai Chen,Lixuan He,Xuekai Zhu,Kaiyan Zhang,Bingning Wang,Ning Ding,Bowen Zhou
机构: Tsinghua University (清华大学); Shanghai AI Laboratory (上海人工智能实验室); WeChat AI (微信人工智能)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Two major sources of training data exist for post-training modern language models: online (model-generated rollouts) data, and offline (human or other-model demonstrations) data. These two types of data are typically used by approaches like Reinforcement Learning (RL) and Supervised Fine-Tuning (SFT), respectively. In this paper, we show that these approaches are not in contradiction, but are instances of a single optimization process. We derive a Unified Policy Gradient Estimator, and present the calculations of a wide spectrum of post-training approaches as the gradient of a common objective under different data distribution assumptions and various bias-variance tradeoffs. The gradient estimator is constructed with four interchangeable parts: stabilization mask, reference policy denominator, advantage estimate, and likelihood gradient. Motivated by our theoretical findings, we propose Hybrid Post-Training (HPT), an algorithm that dynamically selects different training signals. HPT is designed to yield both effective exploitation of demonstration and stable exploration without sacrificing learned reasoning patterns. We provide extensive experiments and ablation studies to verify the effectiveness of our unified theoretical framework and HPT. Across six mathematical reasoning benchmarks and two out-of-distribution suites, HPT consistently surpasses strong baselines across models of varying scales and families.
zh
[NLP-5] No Thoughts Just AI: Biased LLM Recommendations Limit Human Agency in Resume Screening AAAI
【速读】: 该论文旨在解决人类在与存在种族偏见的AI协同决策过程中,其招聘行为如何受到AI影响的问题,特别是揭示人类对AI推荐的无意识依赖及其对非主流种族群体候选人的潜在歧视效应。解决方案的关键在于:首先通过模拟具有种族偏好(bias)的AI模型进行大规模简历筛选实验,量化人类在不同AI干预条件下的行为变化;其次引入内隐联想测试(IAT)测量个体对种族与地位的无意识关联,并发现提前完成IAT可显著提升对不符合常见种族-地位刻板印象候选人的选择概率(提高13%);最后强调即使用户认为AI建议质量低或不重要,其决策仍可能受AI偏见影响,因此需从组织政策、用户培训和系统监管等层面设计干预策略以缓解人-AI协作中的偏见传递问题。
链接: https://arxiv.org/abs/2509.04404
作者: Kyra Wilson,Mattea Sim,Anna-Maria Gueorguieva,Aylin Caliskan
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: Published in Proceedings of the 2025 AAAI/ACM Conference on AI, Ethics, and Society; code available at this https URL
Abstract:In this study, we conduct a resume-screening experiment (N=528) where people collaborate with simulated AI models exhibiting race-based preferences (bias) to evaluate candidates for 16 high and low status occupations. Simulated AI bias approximates factual and counterfactual estimates of racial bias in real-world AI systems. We investigate people’s preferences for White, Black, Hispanic, and Asian candidates (represented through names and affinity groups on quality-controlled resumes) across 1,526 scenarios and measure their unconscious associations between race and status using implicit association tests (IATs), which predict discriminatory hiring decisions but have not been investigated in human-AI collaboration. When making decisions without AI or with AI that exhibits no race-based preferences, people select all candidates at equal rates. However, when interacting with AI favoring a particular group, people also favor those candidates up to 90% of the time, indicating a significant behavioral shift. The likelihood of selecting candidates whose identities do not align with common race-status stereotypes can increase by 13% if people complete an IAT before conducting resume screening. Finally, even if people think AI recommendations are low quality or not important, their decisions are still vulnerable to AI bias under certain circumstances. This work has implications for people’s autonomy in AI-HITL scenarios, AI and work, design and evaluation of AI hiring systems, and strategies for mitigating bias in collaborative decision-making tasks. In particular, organizational and regulatory policy should acknowledge the complex nature of AI-HITL decision making when implementing these systems, educating people who use them, and determining which are subject to oversight.
zh
[NLP-6] Self-adaptive Dataset Construction for Real-World Multimodal Safety Scenarios EMNLP2025
【速读】: 该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在真实世界多模态安全场景(Real-world Multimodal Safety Scenarios, RMS)中面临的安全挑战难以被充分覆盖的问题,尤其是现有基于风险导向的数据集构建方法无法适应日益复杂的现实场景,且缺乏统一的评估指标来验证其有效性。解决方案的关键在于提出一种以图像为导向的自适应数据集构建方法,从图像出发自动构造包含文本和引导响应的配对数据,从而生成一个包含35k张图像-文本对的安全数据集,并引入基于微调安全判别模型的标准化评估指标,实验证明该方法在多个任务上具有良好的可扩展性和有效性。
链接: https://arxiv.org/abs/2509.04403
作者: Jingen Qu,Lijun Li,Bo Zhang,Yichen Yan,Jing Shao
机构: Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Tongji University (同济大学); Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注: Accepted at EMNLP 2025 Findings
Abstract:Multimodal large language models (MLLMs) are rapidly evolving, presenting increasingly complex safety challenges. However, current dataset construction methods, which are risk-oriented, fail to cover the growing complexity of real-world multimodal safety scenarios (RMS). And due to the lack of a unified evaluation metric, their overall effectiveness remains unproven. This paper introduces a novel image-oriented self-adaptive dataset construction method for RMS, which starts with images and end constructing paired text and guidance responses. Using the image-oriented method, we automatically generate an RMS dataset comprising 35k image-text pairs with guidance responses. Additionally, we introduce a standardized safety dataset evaluation metric: fine-tuning a safety judge model and evaluating its capabilities on other safety this http URL experiments on various tasks demonstrate the effectiveness of the proposed image-oriented pipeline. The results confirm the scalability and effectiveness of the image-oriented approach, offering a new perspective for the construction of real-world multimodal safety datasets.
zh
[NLP-7] Contextualized Token Discrimination for Speech Search Query Correction
【速读】: 该论文旨在解决语音搜索中因自动语音识别(ASR)系统产生的错误转录所导致的查询拼写纠错问题,从而提升用户意图表达的准确性。解决方案的关键在于提出一种名为上下文感知词元判别(Contextualized Token Discrimination, CTD)的新方法:首先利用BERT模型生成词元级别的上下文表征,再通过构建组合层增强语义信息,最终基于聚合后的词元表征,对比原始词元表示与上下文表示来识别并纠正错误词元,实现高效的语音查询纠错。
链接: https://arxiv.org/abs/2509.04393
作者: Junyu Lu,Di Jiang,Mengze Hong,Victor Junqiu Wei,Qintian Guo,Zhiyang Su
机构: WeBank(微众银行); Hong Kong Polytechnic University (香港理工大学); Macau University of Science and Technology (澳门科技大学); Hong Kong University of Science and Technology (香港科技大学)
类目: ound (cs.SD); Computation and Language (cs.CL)
备注:
Abstract:Query spelling correction is an important function of modern search engines since it effectively helps users express their intentions clearly. With the growing popularity of speech search driven by Automated Speech Recognition (ASR) systems, this paper introduces a novel method named Contextualized Token Discrimination (CTD) to conduct effective speech query correction. In CTD, we first employ BERT to generate token-level contextualized representations and then construct a composition layer to enhance semantic information. Finally, we produce the correct query according to the aggregated token representation, correcting the incorrect tokens by comparing the original token representations and the contextualized representations. Extensive experiments demonstrate the superior performance of our proposed method across all metrics, and we further present a new benchmark dataset with erroneous ASR transcriptions to offer comprehensive evaluations for audio query correction.
zh
[NLP-8] Measuring Bias or Measuring the Task: Understanding the Brittle Nature of LLM Gender Biases
【速读】: 该论文试图解决的问题是:当前对大语言模型(Large Language Models, LLMs)性别偏见的评估方法可能因任务提示(prompt)设计而产生偏差,从而影响评测结果的可靠性和生态效度(ecological validity)。具体而言,现有评估常依赖于刻意构造的提示,这些提示会显性或隐性地引导模型关注性别相关内容,进而导致测量到的偏见程度不真实或不可复现。论文的关键解决方案在于系统性地考察提示中“测试意图”(evaluative purpose)和“性别焦点内容”(gender-focused content)的显著性如何影响模型输出中的性别偏见表现,并通过四种任务格式结合token概率与离散选择指标进行对比分析。研究发现,微小的提示变化即可显著改变甚至反转偏见方向,且离散选择指标比概率度量更易放大偏见,这揭示了当前评估范式在稳定性上的脆弱性,也为未来基准测试的设计提出了新的挑战:如何避免触发模型的“测试模式”行为,以确保评测结果能真实反映其在自然语境下的表现。
链接: https://arxiv.org/abs/2509.04373
作者: Bufan Gao,Elisa Kreiss
机构: The University of Chicago (芝加哥大学); University of California, Los Angeles (加州大学洛杉矶分校)
类目: Computation and Language (cs.CL)
备注:
Abstract:As LLMs are increasingly applied in socially impactful settings, concerns about gender bias have prompted growing efforts both to measure and mitigate such bias. These efforts often rely on evaluation tasks that differ from natural language distributions, as they typically involve carefully constructed task prompts that overtly or covertly signal the presence of gender bias-related content. In this paper, we examine how signaling the evaluative purpose of a task impacts measured gender bias in LLMs. Concretely, we test models under prompt conditions that (1) make the testing context salient, and (2) make gender-focused content salient. We then assess prompt sensitivity across four task formats with both token-probability and discrete-choice metrics. We find that even minor prompt changes can substantially alter bias outcomes, sometimes reversing their direction entirely. Discrete-choice metrics further tend to amplify bias relative to probabilistic measures. These findings do not only highlight the brittleness of LLM gender bias evaluations but open a new puzzle for the NLP benchmarking and development community: To what extent can well-controlled testing designs trigger LLM ``testing mode’’ performance, and what does this mean for the ecological validity of future benchmarks.
zh
[NLP-9] PARCO: Phoneme-Augmented Robust Contextual ASR via Contrastive Entity Disambiguation
【速读】: 该论文旨在解决自动语音识别(ASR)系统在领域特定命名实体识别中的难题,尤其是同音词(homophones)导致的识别错误问题。现有方法虽引入上下文信息提升识别效果,但受限于实体多样性不足,难以捕捉细微的音素差异;同时,将实体视为独立标记的方式导致多词实体偏置不完整。解决方案的关键在于提出 Phoneme-Augmented Robust Contextual ASR via COntrastive entity disambiguation (PARCO),其核心创新包括:音素感知编码以增强音素区分能力、对比式实体消歧机制确保完整实体检索、实体级监督与分层实体过滤策略降低不确定性下的误报率。该框架显著提升了跨域数据集上的识别准确率,如在中文AISHELL-1上CER降至4.22%,英文DATA2上WER为11.14%(含1000个干扰项)。
链接: https://arxiv.org/abs/2509.04357
作者: Jiajun He,Naoki Sawada,Koichi Miyazaki,Tomoki Toda
机构: CyberAgent( CyberAgent); Nagoya University(名古屋大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD)
备注: Accepted by ASRU 2025
Abstract:Automatic speech recognition (ASR) systems struggle with domain-specific named entities, especially homophones. Contextual ASR improves recognition but often fails to capture fine-grained phoneme variations due to limited entity diversity. Moreover, prior methods treat entities as independent tokens, leading to incomplete multi-token biasing. To address these issues, we propose Phoneme-Augmented Robust Contextual ASR via COntrastive entity disambiguation (PARCO), which integrates phoneme-aware encoding, contrastive entity disambiguation, entity-level supervision, and hierarchical entity filtering. These components enhance phonetic discrimination, ensure complete entity retrieval, and reduce false positives under uncertainty. Experiments show that PARCO achieves CER of 4.22% on Chinese AISHELL-1 and WER of 11.14% on English DATA2 under 1,000 distractors, significantly outperforming baselines. PARCO also demonstrates robust gains on out-of-domain datasets like THCHS-30 and LibriSpeech.
zh
[NLP-10] Psychologically Enhanced AI Agents
【速读】: 该论文旨在解决大型语言模型(Large Language Model, LLM)代理在行为控制与可解释性方面的局限性问题,即如何通过心理学理论实现对LLM代理行为的可控、稳定且可解释的引导。其解决方案的关键在于引入MBTI-in-Thoughts框架,该框架基于迈尔斯-布里格斯类型指标(Myers-Briggs Type Indicator, MBTI)的心理学理论,通过提示工程(prompt engineering)对LLM代理进行人格原型预设,从而在认知和情感两个基础维度上施加可预测的行为偏差。该方法无需任何微调即可实现对代理行为的有效调控,并支持多智能体间的结构化通信协议设计,同时借助16Personalities测试实现人格特质的自动化验证,展现出良好的泛化能力至其他人格模型如大五人格(Big Five)、HEXACO及九型人格(Enneagram)。
链接: https://arxiv.org/abs/2509.04343
作者: Maciej Besta,Shriram Chandran,Robert Gerstenberger,Mathis Lindner,Marcin Chrapek,Sebastian Hermann Martschat,Taraneh Ghandi,Patrick Iff,Hubert Niewiadomski,Piotr Nyczyk,Jürgen Müller,Torsten Hoefler
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA)
备注:
Abstract:We introduce MBTI-in-Thoughts, a framework for enhancing the effectiveness of Large Language Model (LLM) agents through psychologically grounded personality conditioning. Drawing on the Myers-Briggs Type Indicator (MBTI), our method primes agents with distinct personality archetypes via prompt engineering, enabling control over behavior along two foundational axes of human psychology, cognition and affect. We show that such personality priming yields consistent, interpretable behavioral biases across diverse tasks: emotionally expressive agents excel in narrative generation, while analytically primed agents adopt more stable strategies in game-theoretic settings. Our framework supports experimenting with structured multi-agent communication protocols and reveals that self-reflection prior to interaction improves cooperation and reasoning quality. To ensure trait persistence, we integrate the official 16Personalities test for automated verification. While our focus is on MBTI, we show that our approach generalizes seamlessly to other psychological frameworks such as Big Five, HEXACO, or Enneagram. By bridging psychological theory and LLM behavior design, we establish a foundation for psychologically enhanced AI agents without any fine-tuning.
zh
[NLP-11] Facts Fade Fast: Evaluating Memorization of Outdated Medical Knowledge in Large Language Models EMNLP2025
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在医疗领域应用中因依赖静态训练数据而导致的过时知识记忆问题,这可能引发有害医疗建议或临床推理失败。其关键解决方案是构建两个基于系统综述的新型问答(Question-Answering, QA)数据集:MedRevQA(涵盖一般生物医学知识的16,501对问答)和MedChangeQA(包含512对随时间推移医学共识发生改变的问答),并通过评估八个主流LLMs在这些数据集上的表现,揭示了所有模型均存在对过时知识的持续依赖现象;进而分析了旧版预训练数据和训练策略的影响机制,并提出未来改进方向,为开发更及时、可靠的医疗人工智能系统奠定基础。
链接: https://arxiv.org/abs/2509.04304
作者: Juraj Vladika,Mahdi Dhaini,Florian Matthes
机构: Technical University of Munich (慕尼黑工业大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to Findings of EMNLP 2025
Abstract:The growing capabilities of Large Language Models (LLMs) show significant potential to enhance healthcare by assisting medical researchers and physicians. However, their reliance on static training data is a major risk when medical recommendations evolve with new research and developments. When LLMs memorize outdated medical knowledge, they can provide harmful advice or fail at clinical reasoning tasks. To investigate this problem, we introduce two novel question-answering (QA) datasets derived from systematic reviews: MedRevQA (16,501 QA pairs covering general biomedical knowledge) and MedChangeQA (a subset of 512 QA pairs where medical consensus has changed over time). Our evaluation of eight prominent LLMs on the datasets reveals consistent reliance on outdated knowledge across all models. We additionally analyze the influence of obsolete pre-training data and training strategies to explain this phenomenon and propose future directions for mitigation, laying the groundwork for developing more current and reliable medical AI systems.
zh
[NLP-12] Inverse IFEval: Can LLM s Unlearn Stubborn Training Conventions to Follow Real Instructions?
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在面对与训练期间习得的标准化模式相冲突的指令时所表现出的认知惰性(cognitive inertia)问题,即模型难以灵活调整行为以遵循对抗性或反直觉的指令。解决方案的关键在于提出一个名为Inverse IFEval的新基准,该基准通过设计八类挑战任务(如问题修正、故意文本缺陷、无注释代码、反事实回答等),系统评估模型在非常规情境下override训练诱导偏见的能力;同时结合人机协同的数据构建流程和优化的LLM-as-a-Judge评价框架,生成高质量跨语言(中英文)多领域数据集,从而为诊断认知惰性提供量化工具,并推动未来对齐方法从单纯追求流畅性和事实正确性向增强适应性与指令遵循可靠性演进。
链接: https://arxiv.org/abs/2509.04292
作者: Qinyan Zhang,Xinping Lei,Ruijie Miao,Yu Fu,Haojie Fan,Le Chang,Jiafan Hou,Dingling Zhang,Zhongfei Hou,Ziqiang Yang,Changxin Pu,Fei Hu,Jingkai Liu,Mengyun Liu,Yang Liu,Xiang Gao,Jiaheng Liu,Tong Yang,Zaiyuan Wang,Ge Zhang,Wenhao Huang
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLMs) achieve strong performance on diverse tasks but often exhibit cognitive inertia, struggling to follow instructions that conflict with the standardized patterns learned during supervised fine-tuning (SFT). To evaluate this limitation, we propose Inverse IFEval, a benchmark that measures models Counter-intuitive Abilitytheir capacity to override training-induced biases and comply with adversarial instructions. Inverse IFEval introduces eight types of such challenges, including Question Correction, Intentional Textual Flaws, Code without Comments, and Counterfactual Answering. Using a human-in-the-loop pipeline, we construct a dataset of 1012 high-quality Chinese and English questions across 23 domains, evaluated under an optimized LLM-as-a-Judge framework. Experiments on existing leading LLMs demonstrate the necessity of our proposed Inverse IFEval benchmark. Our findings emphasize that future alignment efforts should not only pursue fluency and factual correctness but also account for adaptability under unconventional contexts. We hope that Inverse IFEval serves as both a diagnostic tool and a foundation for developing methods that mitigate cognitive inertia, reduce overfitting to narrow patterns, and ultimately enhance the instruction-following reliability of LLMs in diverse and unpredictable real-world scenarios.
zh
[NLP-13] Explicit and Implicit Data Augmentation for Social Event Detection
【速读】: 该论文旨在解决社交事件检测(Social Event Detection, SED)任务中因依赖标注数据而导致的标注成本高、劳动密集的问题。解决方案的关键在于提出一种即插即用的双增强框架SED-Aug,其核心创新在于融合显式文本增强与隐式特征空间增强:显式增强利用大语言模型通过五种多样化生成策略提升文本信息质量;隐式增强则设计了五种新颖的扰动技术,在结构融合嵌入(structural fused embeddings)的特征空间中操作,保持语义和关系特性的同时增加多样性,从而显著提升模型鲁棒性和性能。实验表明,该方法在Twitter2012和Twitter2018数据集上平均F1分数分别较最优基线提升约17.67%和15.57%。
链接: https://arxiv.org/abs/2509.04202
作者: Congbo Ma,Yuxia Wang,Jia Wu,Jian Yang,Jing Du,Zitai Qiu,Qing Li,Hu Wang,Preslav Nakov
机构: Macquarie University (麦考瑞大学); Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学); New York University Abu Dhabi (纽约大学阿布扎比分校)
类目: Computation and Language (cs.CL); Social and Information Networks (cs.SI)
备注:
Abstract:Social event detection involves identifying and categorizing important events from social media, which relies on labeled data, but annotation is costly and labor-intensive. To address this problem, we propose Augmentation framework for Social Event Detection (SED-Aug), a plug-and-play dual augmentation framework, which combines explicit text-based and implicit feature-space augmentation to enhance data diversity and model robustness. The explicit augmentation utilizes large language models to enhance textual information through five diverse generation strategies. For implicit augmentation, we design five novel perturbation techniques that operate in the feature space on structural fused embeddings. These perturbations are crafted to keep the semantic and relational properties of the embeddings and make them more diverse. Specifically, SED-Aug outperforms the best baseline model by approximately 17.67% on the Twitter2012 dataset and by about 15.57% on the Twitter2018 dataset in terms of the average F1 score. The code is available at GitHub: this https URL.
zh
[NLP-14] MAGneT: Coordinated Multi-Agent Generation of Synthetic Multi-Turn Mental Health Counseling Sessions
【速读】: 该论文旨在解决心理辅导场景中高质量、隐私合规数据稀缺的问题,以支持开源大语言模型(Large Language Models, LLMs)的细调(fine-tuning),从而实现可扩展的心理咨询能力。其核心解决方案是提出了一种名为MAGneT的多智能体框架,通过将咨询师回应生成任务分解为由专业化LLM智能体协作完成的子任务,每个智能体建模一种关键的心理治疗技术(如认知行为疗法,CBT)。这一设计显著提升了生成对话的结构合理性与专业深度,优于以往单智能体方法;同时,论文还构建了一个统一的评估框架,涵盖九个维度的专家评价指标,增强了评估的全面性与可靠性。实证结果表明,基于MAGneT生成的数据进行微调,可在认知治疗评分量表(CTRS)上平均提升通用咨询技能3.2%和CBT专项技能4.3%,且专家偏好度达77.2%,验证了该方案在质量、多样性和治疗一致性上的显著优势。
链接: https://arxiv.org/abs/2509.04183
作者: Aishik Mandal,Tanmoy Chakraborty,Iryna Gurevych
机构: 1. University of Mannheim (曼海姆大学); 2. German Research Center for Artificial Intelligence (德国人工智能研究中心); 3. Indian Institute of Technology, Kharagpur (印度理工学院克哈拉格浦分校); 4. Microsoft (微软)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 25 pages, 29 figures
Abstract:The growing demand for scalable psychological counseling highlights the need for fine-tuning open-source Large Language Models (LLMs) with high-quality, privacy-compliant data, yet such data remains scarce. Here we introduce MAGneT, a novel multi-agent framework for synthetic psychological counseling session generation that decomposes counselor response generation into coordinated sub-tasks handled by specialized LLM agents, each modeling a key psychological technique. Unlike prior single-agent approaches, MAGneT better captures the structure and nuance of real counseling. In addition, we address inconsistencies in prior evaluation protocols by proposing a unified evaluation framework integrating diverse automatic and expert metrics. Furthermore, we expand the expert evaluations from four aspects of counseling in previous works to nine aspects, enabling a more thorough and robust assessment of data quality. Empirical results show that MAGneT significantly outperforms existing methods in quality, diversity, and therapeutic alignment of the generated counseling sessions, improving general counseling skills by 3.2% and CBT-specific skills by 4.3% on average on cognitive therapy rating scale (CTRS). Crucially, experts prefer MAGneT-generated sessions in 77.2% of cases on average across all aspects. Moreover, fine-tuning an open-source model on MAGneT-generated sessions shows better performance, with improvements of 6.3% on general counseling skills and 7.3% on CBT-specific skills on average on CTRS over those fine-tuned with sessions generated by baseline methods. We also make our code and data public.
zh
[NLP-15] Joint Modeling of Entities and Discourse Relations for Coherence Assessment EMNLP2025
【速读】: 该论文旨在解决现有连贯性(coherence)建模方法中仅关注实体特征或话语关系特征,而忽视两者协同作用的问题。其解决方案的关键在于提出两种联合建模实体与话语关系的方法,通过整合两类特征显著提升连贯性评估模型的性能,从而证明同时建模实体一致性和话语关系对改善连贯性判断具有重要价值。
链接: https://arxiv.org/abs/2509.04182
作者: Wei Liu,Michael Strube
机构: Heidelberg Institute for Theoretical Studies gGmbH (海德堡理论研究所以有限公司)
类目: Computation and Language (cs.CL)
备注: EMNLP 2025
Abstract:In linguistics, coherence can be achieved by different means, such as by maintaining reference to the same set of entities across sentences and by establishing discourse relations between them. However, most existing work on coherence modeling focuses exclusively on either entity features or discourse relation features, with little attention given to combining the two. In this study, we explore two methods for jointly modeling entities and discourse relations for coherence assessment. Experiments on three benchmark datasets show that integrating both types of features significantly enhances the performance of coherence models, highlighting the benefits of modeling both simultaneously for coherence evaluation.
zh
[NLP-16] Crossing the Species Divide: Transfer Learning from Speech to Animal Sounds
【速读】: 该论文旨在解决自监督语音模型在非语音生物声学数据上的迁移学习能力尚不明确的问题。其核心解决方案在于验证HuBERT、WavLM和XEUS等语音预训练模型能否生成跨物种动物声音的丰富潜在表征,并通过线性探测(linear probing)分析时间平均表示的有效性,进一步引入考虑时序信息的下游架构以提升性能,同时系统评估频段范围与噪声对模型表现的影响。结果表明,无需微调即可获得与专用生物声学预训练模型相当的效果,凸显了语音领域自监督学习在生物声学研究中的高效潜力。
链接: https://arxiv.org/abs/2509.04166
作者: Jules Cauzinille,Marius Miron,Olivier Pietquin,Masato Hagiwara,Ricard Marxer,Arnaud Rey,Benoit Favre
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD)
备注: 5 pages, 3 figures, uses this http URL , submitted to DCASE 2025
Abstract:Self-supervised speech models have demonstrated impressive performance in speech processing, but their effectiveness on non-speech data remains underexplored. We study the transfer learning capabilities of such models on bioacoustic detection and classification tasks. We show that models such as HuBERT, WavLM, and XEUS can generate rich latent representations of animal sounds across taxa. We analyze the models properties with linear probing on time-averaged representations. We then extend the approach to account for the effect of time-wise information with other downstream architectures. Finally, we study the implication of frequency range and noise on performance. Notably, our results are competitive with fine-tuned bioacoustic pre-trained models and show the impact of noise-robust pre-training setups. These findings highlight the potential of speech-based self-supervised learning as an efficient framework for advancing bioacoustic research.
zh
[NLP-17] owards an Action-Centric Ontology for Cooking Procedures Using Temporal Graphs
【速读】: 该论文旨在解决烹饪流程形式化建模的难题,因其固有的复杂性和模糊性使得自动化分析与执行难以实现。解决方案的关键在于提出一种可扩展的领域特定语言(Domain-Specific Language, DSL),将食谱表示为有向动作图(directed action graphs),从而精确刻画过程、物料转移、环境条件、并发行为及组合结构。该方法实现了复杂烹饪工作流的模块化建模,为基于时序图的动作中心本体(action-centric ontology)奠定了基础,支持机器对烹饪过程的结构化理解、精准解释和规模化自动化应用。
链接: https://arxiv.org/abs/2509.04159
作者: Aarush Kumbhakern,Saransh Kumar Gupta,Lipika Dey,Partha Pratim Das
机构: Ashoka University (阿肖卡大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 6 pages, 3 figures, 1 table, 11 references, ACM International Conference on Multimedia 2025 - Multi-modal Food Computing Workshop
Abstract:Formalizing cooking procedures remains a challenging task due to their inherent complexity and ambiguity. We introduce an extensible domain-specific language for representing recipes as directed action graphs, capturing processes, transfers, environments, concurrency, and compositional structure. Our approach enables precise, modular modeling of complex culinary workflows. Initial manual evaluation on a full English breakfast recipe demonstrates the DSL’s expressiveness and suitability for future automated recipe analysis and execution. This work represents initial steps towards an action-centric ontology for cooking, using temporal graphs to enable structured machine understanding, precise interpretation, and scalable automation of culinary processes - both in home kitchens and professional culinary settings.
zh
[NLP-18] MultiWikiQA: A Reading Comprehension Benchmark in 300 Languages
【速读】: 该论文旨在解决多语言阅读理解评测数据集匮乏的问题,以支持跨语言自然语言处理模型的系统性评估。其解决方案的关键在于构建了一个覆盖306种语言的阅读理解数据集MultiWikiQA,其中上下文来源于维基百科文章,问题由大语言模型(LLM)生成,答案则直接从维基百科原文中提取,确保了答案的准确性与一致性。此外,通过众包方式对30种语言的问题流畅度进行人工评估,验证了问题质量,同时在多个不同架构和规模的语言模型上进行测试,揭示了语言间显著的性能差异,从而证明该基准具有足够的挑战性和实用性。
链接: https://arxiv.org/abs/2509.04111
作者: Dan Saattrup Smart
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:We introduce a new reading comprehension dataset, dubbed MultiWikiQA, which covers 306 languages. The context data comes from Wikipedia articles, with questions generated by an LLM and the answers appearing verbatim in the Wikipedia articles. We conduct a crowdsourced human evaluation of the fluency of the generated questions across 30 of the languages, providing evidence that the questions are of good quality. We evaluate 6 different language models, both decoder and encoder models of varying sizes, showing that the benchmark is sufficiently difficult and that there is a large performance discrepancy amongst the languages. The dataset and survey evaluations are freely available.
zh
[NLP-19] owards Stable and Personalised Profiles for Lexical Alignment in Spoken Human-Agent Dialogue
【速读】: 该论文旨在解决对话代理(Conversational Agents)中缺乏有效词汇对齐(Lexical Alignment)机制的问题,尤其是在大型语言模型(LLMs)快速发展背景下,如何构建稳定且个性化的词汇特征以支持更自然的人机交互。其解决方案的关键在于通过分析不同规模的转录语音数据和每词性类别(Part-of-Speech, POS)下词汇条目数量,系统评估个性化词汇特征(Lexical Profiles)在回忆率(Recall)、覆盖率(Coverage)与余弦相似度(Cosine Similarity)等指标上的表现,最终发现:仅需10分钟转录语料、并按特定词类配置词汇条目(如形容词和连词各5项,副词、名词、代词和动词各10项),即可获得性能最优且数据效率最高的紧凑型词汇配置,为实现对话代理中的词汇对齐策略奠定了基础。
链接: https://arxiv.org/abs/2509.04104
作者: Keara Schaaij,Roel Boumans,Tibor Bosse,Iris Hendrickx
机构: 未知
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: Accepted for TSD 2025
Abstract:Lexical alignment, where speakers start to use similar words across conversation, is known to contribute to successful communication. However, its implementation in conversational agents remains underexplored, particularly considering the recent advancements in large language models (LLMs). As a first step towards enabling lexical alignment in human-agent dialogue, this study draws on strategies for personalising conversational agents and investigates the construction of stable, personalised lexical profiles as a basis for lexical alignment. Specifically, we varied the amounts of transcribed spoken data used for construction as well as the number of items included in the profiles per part-of-speech (POS) category and evaluated profile performance across time using recall, coverage, and cosine similarity metrics. It was shown that smaller and more compact profiles, created after 10 min of transcribed speech containing 5 items for adjectives, 5 items for conjunctions, and 10 items for adverbs, nouns, pronouns, and verbs each, offered the best balance in both performance and data efficiency. In conclusion, this study offers practical insights into constructing stable, personalised lexical profiles, taking into account minimal data requirements, serving as a foundational step toward lexical alignment strategies in conversational agents.
zh
[NLP-20] Improving Narrative Classification and Explanation via Fine Tuned Language Models
【速读】: 该论文旨在解决新闻文章中隐性叙事(covert narratives)和隐含信息的识别难题,以及如何生成基于证据的简洁解释以支持主导叙事判断的问题。其解决方案的关键在于:首先,采用基于召回率优化的BERT微调策略进行多标签叙事与子叙事分类,并通过GPT-4o流水线对预测结果进行一致性校正;其次,提出一种基于语义检索的少样本提示(few-shot prompting)ReACT框架用于生成结构化、可验证的叙事解释,并引入结构化分类表作为辅助知识库以提升事实准确性并减少幻觉现象。
链接: https://arxiv.org/abs/2509.04077
作者: Rishit Tyagi,Rahul Bouri,Mohit Gupta
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Understanding covert narratives and implicit messaging is essential for analyzing bias and sentiment. Traditional NLP methods struggle with detecting subtle phrasing and hidden agendas. This study tackles two key challenges: (1) multi-label classification of narratives and sub-narratives in news articles, and (2) generating concise, evidence-based explanations for dominant narratives. We fine-tune a BERT model with a recall-oriented approach for comprehensive narrative detection, refining predictions using a GPT-4o pipeline for consistency. For narrative explanation, we propose a ReACT (Reasoning + Acting) framework with semantic retrieval-based few-shot prompting, ensuring grounded and relevant justifications. To enhance factual accuracy and reduce hallucinations, we incorporate a structured taxonomy table as an auxiliary knowledge base. Our results show that integrating auxiliary knowledge in prompts improves classification accuracy and justification reliability, with applications in media analysis, education, and intelligence gathering.
zh
[NLP-21] Arabic Chatbot Technologies in Education: An Overview
【速读】: 该论文旨在解决当前阿拉伯语教育领域中聊天机器人(chatbot)研究与应用不足的问题,尤其是在大规模语言模型(Large Language Models, LLM)技术快速发展的背景下,发现多数阿拉伯语教育聊天机器人仍依赖传统方法,缺乏对现代自然语言处理(Natural Language Processing, NLP)技术的充分采纳。其解决方案的关键在于系统性地梳理现有阿拉伯语教育聊天机器人的技术路径、语言多样性及性能评估指标,识别出当前研究中的关键空白,并为未来基于先进生成式 AI 技术(如 BERT 和 GPT 系列模型)的阿拉伯语教育智能交互系统的发展提供方向指引。
链接: https://arxiv.org/abs/2509.04066
作者: Hicham Bourhil,Yacine El Younoussi
机构: 未知
类目: Computation and Language (cs.CL)
备注: Published as a book chapter in: Transformación Digital en la Educación: Innovaciones y Desafíos desde los Campus Virtuales (UA Journals, 2024), pp. 11-14
Abstract:The recent advancements in Artificial Intelligence (AI) in general, and in Natural Language Processing (NLP) in particular, and some of its applications such as chatbots, have led to their implementation in different domains like education, healthcare, tourism, and customer service. Since the COVID-19 pandemic, there has been an increasing interest in these digital technologies to allow and enhance remote access. In education, e-learning systems have been massively adopted worldwide. The emergence of Large Language Models (LLM) such as BERT (Bidirectional Encoder Representations from Transformers) and GPT (Generative Pre-trained Transformers) made chatbots even more popular. In this study, we present a survey on existing Arabic chatbots in education and their different characteristics such as the adopted approaches, language variety, and metrics used to measure their performance. We were able to identified some research gaps when we discovered that, despite the success of chatbots in other languages such as English, only a few educational Arabic chatbots used modern techniques. Finally, we discuss future directions of research in this field.
zh
[NLP-22] Synthesizing Sheet Music Problems for Evaluation and Reinforcement Learning
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)和多模态大型语言模型(Multimodal Large Language Models, MLLMs)在识谱能力上的不足,特别是缺乏针对乐谱推理的评估基准与训练数据的问题。其解决方案的关键在于提出一种基于音乐理论规则的乐谱问题合成方法,该方法可生成具有可验证奖励的文本与视觉模态问题,从而构建出 Synthetic Sheet Music Reasoning Benchmark (SSMR-Bench) 及配套训练集,并用于强化学习 with verifiable rewards (RLVR)。这一创新策略不仅显著提升了模型在乐谱理解中的推理能力,还推动了AI在音乐创作等下游任务中的应用潜力。
链接: https://arxiv.org/abs/2509.04059
作者: Zhilin Wang,Zhe Yang,Yun Luo,Yafu Li,Haoran Zhang,Runzhe Zhan,Derek F. Wong,Jizhe Zhou,Yu Cheng
机构: University of Science and Technology of China (中国科学技术大学); Shanghai AI Laboratory (上海人工智能实验室); Sichuan University (四川大学); Shanghai Jiao Tong University (上海交通大学); University of Macau (澳门大学); The Chinese University of Hong Kong (香港中文大学)
类目: Computation and Language (cs.CL)
备注: 11 pages
Abstract:Enhancing the ability of Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) to interpret sheet music is a crucial step toward building AI musicians. However, current research lacks both evaluation benchmarks and training data for sheet music reasoning. To address this, we propose the idea of synthesizing sheet music problems grounded in music theory, which can serve both as evaluation benchmarks and as training data for reinforcement learning with verifiable rewards (RLVR). We introduce a data synthesis framework that generates verifiable sheet music questions in both textual and visual modalities, leading to the Synthetic Sheet Music Reasoning Benchmark (SSMR-Bench) and a complementary training set. Evaluation results on SSMR-Bench show the importance of models’ reasoning abilities in interpreting sheet music. At the same time, the poor performance of Gemini 2.5-Pro highlights the challenges that MLLMs still face in interpreting sheet music in a visual format. By leveraging synthetic data for RLVR, Qwen3-8B-Base and Qwen2.5-VL-Instruct achieve improvements on the SSMR-Bench. Besides, the trained Qwen3-8B-Base surpasses GPT-4 in overall performance on MusicTheoryBench and achieves reasoning performance comparable to GPT-4 with the strategies of Role play and Chain-of-Thought. Notably, its performance on math problems also improves relative to the original Qwen3-8B-Base. Furthermore, our results show that the enhanced reasoning ability can also facilitate music composition. In conclusion, we are the first to propose the idea of synthesizing sheet music problems based on music theory rules, and demonstrate its effectiveness not only in advancing model reasoning for sheet music understanding but also in unlocking new possibilities for AI-assisted music creation.
zh
[NLP-23] A RoBERTa-Based Functional Syntax Annotation Model for Chinese Texts
【速读】: 该论文旨在解决系统功能语法(Systemic Functional Grammar, SFG)及其分支卡迪夫语法(Cardiff Grammar)在中文文本中缺乏自动化标注系统的问题,从而制约了相关理论在自然语言处理中的应用与推广。其解决方案的关键在于构建一个基于RoBERTa-Chinese wwm-ext预训练模型的功能句法标注框架:首先从《人民日报》2014语料库中随机选取4,100个句子,依据功能句法理论进行人工标注以建立训练数据集;随后在该数据集上微调RoBERTa模型,实现对中文句法核心成分(如主语S、谓语M、补语C)的命名实体识别任务,最终在测试集上达到F1分数0.852,显著优于对比模型,验证了将功能句法与注意力机制相结合的可行性与有效性。
链接: https://arxiv.org/abs/2509.04046
作者: Han Xiaohui,Zhang Yunlong,Guo Yuxi
机构: Harbin Institute of Technology (哈尔滨工业大学); Chalmers University of Technology (查尔姆斯理工大学)
类目: Computation and Language (cs.CL)
备注: The paper includes 10 pages, 6 tables, and 4 figures. This project is completed with the assistance of National Center for Language Technology and Digital Economy Research (No. GJLX20250002), and is funded by Heilongjiang Language Research Committee Project Construction of an Adaptive Intelligent Chinese Learning Platform for International Students in China (No. G2025Y003)
Abstract:Systemic Functional Grammar and its branch, Cardiff Grammar, have been widely applied to discourse analysis, semantic function research, and other tasks across various languages and texts. However, an automatic annotation system based on this theory for Chinese texts has not yet been developed, which significantly constrains the application and promotion of relevant theories. To fill this gap, this research introduces a functional syntax annotation model for Chinese based on RoBERTa (Robustly Optimized BERT Pretraining Approach). The study randomly selected 4,100 sentences from the People’s Daily 2014 corpus and annotated them according to functional syntax theory to establish a dataset for training. The study then fine-tuned the RoBERTa-Chinese wwm-ext model based on the dataset to implement the named entity recognition task, achieving an F1 score of 0.852 on the test set that significantly outperforms other comparative models. The model demonstrated excellent performance in identifying core syntactic elements such as Subject (S), Main Verb (M), and Complement ©. Nevertheless, there remains room for improvement in recognizing entities with imbalanced label samples. As the first integration of functional syntax with attention-based NLP models, this research provides a new method for automated Chinese functional syntax analysis and lays a solid foundation for subsequent studies.
zh
[NLP-24] What if I ask in textitalia lingua? Measuring Functional Similarity Across Languages
【速读】: 该论文旨在解决多语言大模型输出一致性问题,即不同语言下模型回答的相似性程度如何随模型规模和能力变化。其解决方案的关键在于引入并应用一种新的模型相似性度量指标 κp,通过对全球多语言大型通用知识评估(GlobalMMLU)中20种语言和47个学科的分析,揭示了模型性能越强、规模越大时,其跨语言输出的一致性越高;同时发现同一模型在不同语言间的内部一致性高于不同模型在同一语言下的外部一致性。这一结果验证了 κp 作为评估多语言可靠性工具的有效性,并为构建更一致的多语言系统提供了量化指导。
链接: https://arxiv.org/abs/2509.04032
作者: Debangan Mishra,Arihant Rastogi,Agyeya Negi,Shashwat Goel,Ponnurangam Kumaraguru
机构: IIIT Hyderabad (印度信息技术研究所海得拉巴分校); ELLIS Institute Tübingen (ELLIS研究所图宾根分部)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Preprint, 11 Pages
Abstract:How similar are model outputs across languages? In this work, we study this question using a recently proposed model similarity metric \kappa_p applied to 20 languages and 47 subjects in GlobalMMLU. Our analysis reveals that a model’s responses become increasingly consistent across languages as its size and capability grow. Interestingly, models exhibit greater cross-lingual consistency within themselves than agreement with other models prompted in the same language. These results highlight not only the value of \kappa_p as a practical tool for evaluating multilingual reliability, but also its potential to guide the development of more consistent multilingual systems.
zh
[NLP-25] CoT-Space: A Theoretical Framework for Internal Slow-Thinking via Reinforcement Learning
【速读】: 该论文旨在解决传统基于token级别的强化学习(Reinforcement Learning, RL)框架无法有效对齐大型语言模型(Large Language Models, LLMs)复杂多步推理过程(如Chain-of-Thought, CoT)的问题。其解决方案的关键在于提出一种名为CoT-Space的新理论框架,将LLM的推理过程从离散的token预测任务重构为在连续的、以推理层级语义空间为基础的优化问题。该框架通过噪声和风险两个视角分析推理过程,揭示了最优CoT长度的收敛本质上是欠拟合与过拟合之间权衡的结果,从而为理解“过度思考”等现象提供了理论依据,并为构建更高效、泛化能力更强的推理代理奠定了基础。
链接: https://arxiv.org/abs/2509.04027
作者: Zeyu Gan,Hao Yi,Yong Liu
机构: Gaoling School of Artificial Intelligence (人工智能学院); Renmin University of China (中国人民大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Preprint Edition
Abstract:Reinforcement Learning (RL) has become a pivotal approach for enhancing the reasoning capabilities of Large Language Models (LLMs). However, a significant theoretical gap persists, as traditional token-level RL frameworks fail to align with the reasoning-level nature of complex, multi-step thought processes like Chain-of-Thought (CoT). To address this challenge, we introduce CoT-Space, a novel theoretical framework that recasts LLM reasoning from a discrete token-prediction task to an optimization process within a continuous, reasoning-level semantic space. By analyzing this process from both a noise perspective and a risk perspective, we demonstrate that the convergence to an optimal CoT length is a natural consequence of the fundamental trade-off between underfitting and overfitting. Furthermore, extensive experiments provide strong empirical validation for our theoretical findings. Our framework not only provides a coherent explanation for empirical phenomena such as overthinking but also offers a solid theoretical foundation to guide the future development of more effective and generalizable reasoning agents.
zh
[NLP-26] On Robustness and Reliability of Benchmark-Based Evaluation of LLM s ECAI2025
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)评估方法中存在的局限性问题,即现有基准测试(如MMLU、ARC-C和HellaSwag)通常使用固定格式的原始问题进行评测,未能充分反映真实场景中输入文本的语义不变但形式多变的语言多样性。这可能导致高分数并不能准确衡量模型在实际应用中的鲁棒性。其解决方案的关键在于系统性地生成多个基准测试题目的改写版本(paraphrases),并在此基础上对34种不同规模与性能的LLMs进行横向对比,从而量化模型在面对同义重述问题时的有效性变化。结果表明,尽管模型排名相对稳定,但绝对得分显著下降,揭示了LLMs在应对语言变异时存在普遍性困难,进而呼吁构建更具鲁棒性的评估基准以更贴近实际部署需求。
链接: https://arxiv.org/abs/2509.04013
作者: Riccardo Lunardi,Vincenzo Della Mea,Stefano Mizzaro,Kevin Roitero
机构: University of Udine (乌迪内大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at ECAI 2025
Abstract:Large Language Models (LLMs) effectiveness is usually evaluated by means of benchmarks such as MMLU, ARC-C, or HellaSwag, where questions are presented in their original wording, thus in a fixed, standardized format. However, real-world applications involve linguistic variability, requiring models to maintain their effectiveness across diverse rewordings of the same question or query. In this study, we systematically assess the robustness of LLMs to paraphrased benchmark questions and investigate whether benchmark-based evaluations provide a reliable measure of model capabilities. We systematically generate various paraphrases of all the questions across six different common benchmarks, and measure the resulting variations in effectiveness of 34 state-of-the-art LLMs, of different size and effectiveness. Our findings reveal that while LLM rankings remain relatively stable across paraphrased inputs, absolute effectiveness scores change, and decline significantly. This suggests that LLMs struggle with linguistic variability, raising concerns about their generalization abilities and evaluation methodologies. Furthermore, the observed performance drop challenges the reliability of benchmark-based evaluations, indicating that high benchmark scores may not fully capture a model’s robustness to real-world input variations. We discuss the implications of these findings for LLM evaluation methodologies, emphasizing the need for robustness-aware benchmarks that better reflect practical deployment scenarios.
zh
[NLP-27] NER Retriever: Zero-Shot Named Entity Retrieval with Type-Aware Embeddings EMNLP2025
【速读】: 该论文旨在解决**即兴命名实体检索(ad-hoc Named Entity Retrieval, NER)**问题,即在不预先提供实体类型标签的情况下,基于用户定义的开放性类型描述,从文档中检索提及特定类型实体的内容。传统方法依赖固定模式或微调模型,难以适应动态、多样化的实体类型需求。其解决方案的关键在于:利用大语言模型(LLM)内部表示——特别是中间层Transformer块中的值向量(value vectors),这些表示比常用的顶层嵌入更能捕捉细粒度的类型信息;在此基础上,训练一个轻量级对比投影网络(contrastive projection network),对齐类型兼容的实体并分离无关类型,从而生成紧凑、类型感知的实体嵌入,适用于最近邻搜索。该方法实现了无需预设schema的高效、可扩展实体检索。
链接: https://arxiv.org/abs/2509.04011
作者: Or Shachar,Uri Katz,Yoav Goldberg,Oren Glickman
机构: Bar-Ilan University (巴伊兰大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Findings of EMNLP 2025
Abstract:We present NER Retriever, a zero-shot retrieval framework for ad-hoc Named Entity Retrieval, a variant of Named Entity Recognition (NER), where the types of interest are not provided in advance, and a user-defined type description is used to retrieve documents mentioning entities of that type. Instead of relying on fixed schemas or fine-tuned models, our method builds on internal representations of large language models (LLMs) to embed both entity mentions and user-provided open-ended type descriptions into a shared semantic space. We show that internal representations, specifically the value vectors from mid-layer transformer blocks, encode fine-grained type information more effectively than commonly used top-layer embeddings. To refine these representations, we train a lightweight contrastive projection network that aligns type-compatible entities while separating unrelated types. The resulting entity embeddings are compact, type-aware, and well-suited for nearest-neighbor search. Evaluated on three benchmarks, NER Retriever significantly outperforms both lexical and dense sentence-level retrieval baselines. Our findings provide empirical support for representation selection within LLMs and demonstrate a practical solution for scalable, schema-free entity retrieval. The NER Retriever Codebase is publicly available at this https URL
zh
[NLP-28] RTQA : Recursive Thinking for Complex Temporal Knowledge Graph Question Answering with Large Language Models EMNLP2025
【速读】: 该论文旨在解决当前时间知识图谱问答(Temporal Knowledge Graph Question Answering, TKGQA)方法在处理复杂时间查询时的局限性,包括对隐式时间约束的关注不足、推理能力有限以及分解框架中误差传播等问题。其解决方案的关键在于提出一种无需训练的递归式问答框架RTQA,通过递归分解问题为子问题、利用大语言模型(Large Language Models, LLMs)与时间知识图谱(Temporal Knowledge Graph, TKG)知识自底向上求解,并采用多路径答案聚合机制提升容错能力,从而显著增强对复杂时间查询的理解与推理性能。
链接: https://arxiv.org/abs/2509.03995
作者: Zhaoyan Gong,Juan Li,Zhiqiang Liu,Lei Liang,Huajun Chen,Wen Zhang
机构: Zhejiang University (浙江大学); Ant Group (蚂蚁集团); ZJU-Ant Group Joint Lab of Knowledge Graph (知识图谱浙大-蚂蚁集团联合实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: EMNLP 2025
Abstract:Current temporal knowledge graph question answering (TKGQA) methods primarily focus on implicit temporal constraints, lacking the capability of handling more complex temporal queries, and struggle with limited reasoning abilities and error propagation in decomposition frameworks. We propose RTQA, a novel framework to address these challenges by enhancing reasoning over TKGs without requiring training. Following recursive thinking, RTQA recursively decomposes questions into sub-problems, solves them bottom-up using LLMs and TKG knowledge, and employs multi-path answer aggregation to improve fault tolerance. RTQA consists of three core components: the Temporal Question Decomposer, the Recursive Solver, and the Answer Aggregator. Experiments on MultiTQ and TimelineKGQA benchmarks demonstrate significant Hits@1 improvements in “Multiple” and “Complex” categories, outperforming state-of-the-art methods. Our code and data are available at this https URL.
zh
[NLP-29] Promptception: How Sensitive Are Large Multimodal Models to Prompts? EMNLP2025
【速读】: 该论文旨在解决大型多模态模型(Large Multimodal Models, LMMs)在多项选择题问答(Multiple-Choice Question Answering, MCQA)任务中提示(prompt)设计不透明、敏感性高导致评估不公平的问题。研究发现,即使微小的提示措辞和结构变化也可能引发高达15%的准确率波动,尤其在专有模型中表现更为显著,这使得模型性能报告常基于“最佳提示”而非稳健表现,从而影响评估的公平性和可复现性。解决方案的关键在于提出 Promptception 框架——一个系统化评估提示敏感性的方法论,包含61种提示类型,覆盖15个类别和6个超类别,能够全面刻画提示不同维度对模型输出的影响;并基于此框架提炼出针对专有模型与开源模型的差异化提示原则(Prompting Principles),以实现更鲁棒、公平的LMM评估。
链接: https://arxiv.org/abs/2509.03986
作者: Mohamed Insaf Ismithdeen,Muhammad Uzair Khattak,Salman Khan
机构: Mohamed Bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学); Swiss Federal Institute of Technology Lausanne (EPFL) (瑞士联邦理工学院洛桑分校); Australian National University (澳大利亚国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted to EMNLP 2025
Abstract:Despite the success of Large Multimodal Models (LMMs) in recent years, prompt design for LMMs in Multiple-Choice Question Answering (MCQA) remains poorly understood. We show that even minor variations in prompt phrasing and structure can lead to accuracy deviations of up to 15% for certain prompts and models. This variability poses a challenge for transparent and fair LMM evaluation, as models often report their best-case performance using carefully selected prompts. To address this, we introduce Promptception, a systematic framework for evaluating prompt sensitivity in LMMs. It consists of 61 prompt types, spanning 15 categories and 6 supercategories, each targeting specific aspects of prompt formulation, and is used to evaluate 10 LMMs ranging from lightweight open-source models to GPT-4o and Gemini 1.5 Pro, across 3 MCQA benchmarks: MMStar, MMMU-Pro, MVBench. Our findings reveal that proprietary models exhibit greater sensitivity to prompt phrasing, reflecting tighter alignment with instruction semantics, while open-source models are steadier but struggle with nuanced and complex phrasing. Based on this analysis, we propose Prompting Principles tailored to proprietary and open-source LMMs, enabling more robust and fair model evaluation.
zh
[NLP-30] Expanding Foundational Language Capabilities in Open-Source LLM s through a Korean Case Study
【速读】: 该论文旨在解决当前大语言模型在多语言场景下对特定语言(如韩语)支持不足的问题,尤其是如何在不牺牲英语性能的前提下显著提升韩语能力。解决方案的关键在于:基于Llama 3架构设计了一个包含1020亿参数的专用模型Llama-3-Motif,采用LlamaPro和掩码结构增长(Masked Structure Growth)等先进训练技术,在保持Transformer核心结构不变的基础上实现高效扩展,并通过MoAI平台在超大规模GPU集群上进行优化训练,同时使用精心构建的韩英数据平衡语料库,最终在韩语基准测试中表现优异,达到与GPT-4相当的水平。
链接: https://arxiv.org/abs/2509.03972
作者: Junghwan Lim,Gangwon Jo,Sungmin Lee,Jiyoung Park,Dongseok Kim,Jihwan Kim,Junhyeok Lee,Wai Ting Cheung,Dahye Choi,Kibong Choi,Jaeyeon Huh,Beomgyu Kim,Jangwoong Kim,Taehyun Kim,Haesol Lee,Jeesoo Lee,Dongpin Oh,Changseok Song,Daewon Suh
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:We introduce Llama-3-Motif, a language model consisting of 102 billion parameters, specifically designed to enhance Korean capabilities while retaining strong performance in English. Developed on the Llama 3 architecture, Llama-3-Motif employs advanced training techniques, including LlamaPro and Masked Structure Growth, to effectively scale the model without altering its core Transformer architecture. Using the MoAI platform for efficient training across hyperscale GPU clusters, we optimized Llama-3-Motif using a carefully curated dataset that maintains a balanced ratio of Korean and English data. Llama-3-Motif shows decent performance on Korean-specific benchmarks, outperforming existing models and achieving results comparable to GPT-4.
zh
[NLP-31] Exploring NLP Benchmarks in an Extremely Low-Resource Setting
【速读】: 该论文旨在解决极端低资源语言(如濒危的罗曼语族语言拉定语)在大语言模型(Large Language Models, LLMs)应用中因标注数据稀缺而导致性能显著下降的问题。其解决方案的关键在于利用少量平行句对(Ladin-Italian)通过翻译单语意大利语数据生成高质量的合成数据集,用于情感分析和多项选择题问答(MCQA)任务,并结合严格的过滤与回译(back-translation)流程确保语言质量与可靠性;此外,实验表明将这些合成数据融入机器翻译训练可显著提升现有意大利语-拉定语翻译基线性能,从而为该语言的自然语言处理研究提供首个公开可用的基础资源。
链接: https://arxiv.org/abs/2509.03962
作者: Ulin Nuha,Adam Jatowt
机构: National Kaohsiung University of Science and Technology (国立高雄科技大学); University of Innsbruck (因斯布鲁克大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:The effectiveness of Large Language Models (LLMs) diminishes for extremely low-resource languages, such as indigenous languages, primarily due to the lack of labeled data. Despite growing interest, the availability of high-quality natural language processing (NLP) datasets for these languages remains limited, making it difficult to develop robust language technologies. This paper addresses such gap by focusing on Ladin, an endangered Romance language, specifically targeting the Val Badia variant. Leveraging a small set of parallel Ladin-Italian sentence pairs, we create synthetic datasets for sentiment analysis and multiple-choice question answering (MCQA) by translating monolingual Italian data. To ensure linguistic quality and reliability, we apply rigorous filtering and back-translation procedures in our method. We further demonstrate that incorporating these synthetic datasets into machine translation training leads to substantial improvements over existing Italian-Ladin translation baselines. Our contributions include the first publicly available sentiment analysis and MCQA datasets for Ladin, establishing foundational resources that can support broader NLP research and downstream applications for this underrepresented language.
zh
[NLP-32] CANDY: Benchmarking LLM s Limitations and Assistive Potential in Chinese Misinformation Fact-Checking EMNLP2025
【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在识别和纠正中文虚假信息(misinformation)方面能力尚不明确的问题。其解决方案的关键在于构建了一个名为CANDY的基准测试集,包含约20,000个经过精心标注的中文虚假信息实例,并系统评估了LLMs在事实核查任务中的表现及其局限性。研究发现,即使采用思维链(chain-of-thought)推理和少样本提示(few-shot prompting)增强策略,当前LLMs仍难以生成准确的事实核查结论;进一步分析揭示,“事实捏造”(factual fabrication)是导致错误判断的最主要失败模式。尽管LLMs单独使用不可靠,但其作为辅助工具可显著提升人类事实核查效率,展现出在人机协同场景下的应用潜力。
链接: https://arxiv.org/abs/2509.03957
作者: Ruiling Guo,Xinwei Yang,Chen Huang,Tong Zhang,Yong Hu
机构: Sichuan University (四川大学); National University of Singapore (新加坡国立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Findings of EMNLP 2025
Abstract:The effectiveness of large language models (LLMs) to fact-check misinformation remains uncertain, despite their growing use. To this end, we present CANDY, a benchmark designed to systematically evaluate the capabilities and limitations of LLMs in fact-checking Chinese misinformation. Specifically, we curate a carefully annotated dataset of ~20k instances. Our analysis shows that current LLMs exhibit limitations in generating accurate fact-checking conclusions, even when enhanced with chain-of-thought reasoning and few-shot prompting. To understand these limitations, we develop a taxonomy to categorize flawed LLM-generated explanations for their conclusions and identify factual fabrication as the most common failure mode. Although LLMs alone are unreliable for fact-checking, our findings indicate their considerable potential to augment human performance when deployed as assistive tools in scenarios. Our dataset and code can be accessed at this https URL
zh
[NLP-33] VoxRole: A Comprehensive Benchmark for Evaluating Speech-Based Role-Playing Agents
【速读】: 该论文旨在解决当前基于语音的角色扮演对话代理(Speech-based Role-Playing Conversational Agents, RPCAs)研究中存在的两大核心问题:一是现有工作主要局限于文本模态,忽视了语调(intonation)、韵律(prosody)和节奏(rhythm)等关键副语言特征对角色情感表达与身份塑造的重要性;二是缺乏标准化的评估基准来量化模型在长期角色一致性(long-term persona consistency)等核心能力上的表现。解决方案的关键在于提出并构建了VoxRole,这是首个专为语音RPCA设计的综合性评估基准,包含13335轮多轮对话、总计65.6小时的语音数据及来自261部电影中1228个独特角色的多维角色档案。其创新性地采用两阶段自动化流水线:首先将电影音频与剧本对齐,再利用大语言模型(LLM)系统生成每个角色的多维度属性,从而实现高质量、大规模、结构化的角色驱动语音数据集建设,为语音RPCA的评测提供了可靠基础。
链接: https://arxiv.org/abs/2509.03940
作者: Weihao Wu,Liang Cao,Xinyu Wu,Zhiwei Lin,Rui Niu,Jingbei Li,Zhiyong Wu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注:
Abstract:Recent significant advancements in Large Language Models (LLMs) have greatly propelled the development of Role-Playing Conversational Agents (RPCAs). These systems aim to create immersive user experiences through consistent persona adoption. However, current RPCA research faces dual limitations. First, existing work predominantly focuses on the textual modality, entirely overlooking critical paralinguistic features including intonation, prosody, and rhythm in speech, which are essential for conveying character emotions and shaping vivid identities. Second, the speech-based role-playing domain suffers from a long-standing lack of standardized evaluation benchmarks. Most current spoken dialogue datasets target only fundamental capability assessments, featuring thinly sketched or ill-defined character profiles. Consequently, they fail to effectively quantify model performance on core competencies like long-term persona consistency. To address this critical gap, we introduce VoxRole, the first comprehensive benchmark specifically designed for the evaluation of speech-based RPCAs. The benchmark comprises 13335 multi-turn dialogues, totaling 65.6 hours of speech from 1228 unique characters across 261 movies. To construct this resource, we propose a novel two-stage automated pipeline that first aligns movie audio with scripts and subsequently employs an LLM to systematically build multi-dimensional profiles for each character. Leveraging VoxRole, we conduct a multi-dimensional evaluation of contemporary spoken dialogue models, revealing crucial insights into their respective strengths and limitations in maintaining persona consistency.
zh
[NLP-34] SPFT-SQL: Enhancing Large Language Model for Text-to-SQL Parsing by Self-Play Fine-Tuning EMNLP2025
【速读】: 该论文旨在解决自对弈微调(Self-Play Instruct Fine-Tuning, SPIN)在Text-to-SQL任务中面临的挑战:由于SPIN不生成新信息,且对手模型产生的大量正确SQL语句会削弱主模型学习生成准确SQL的能力。为应对这一问题,论文提出了一种针对Text-to-SQL任务定制的新型自对弈微调方法——SPFT-SQL。其关键创新在于两个阶段:首先,在自对弈前引入基于验证的迭代微调策略,利用数据库模式和验证反馈合成高质量训练数据并构建具备不同能力的模型基础;其次,在自对弈阶段设计一种错误驱动损失函数(error-driven loss),主动激励对手模型产生错误输出,使主模型能够更好地区分正确与错误SQL,从而显著提升其生成准确SQL的能力。
链接: https://arxiv.org/abs/2509.03937
作者: Yuhao Zhang,Shaoming Duan,Jinhang Su,Chuanyi Liu,Peiyi Han
机构: Harbin Institute of Technology, Shenzhen (哈尔滨工业大学深圳校区); Pengcheng Laboratory (鹏城实验室); Mindflow.ai
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: EMNLP 2025 Findings
Abstract:Despite the significant advancements of self-play fine-tuning (SPIN), which can transform a weak large language model (LLM) into a strong one through competitive interactions between models of varying capabilities, it still faces challenges in the Text-to-SQL task. SPIN does not generate new information, and the large number of correct SQL queries produced by the opponent model during self-play reduces the main model’s ability to generate accurate SQL queries. To address this challenge, we propose a new self-play fine-tuning method tailored for the Text-to-SQL task, called SPFT-SQL. Prior to self-play, we introduce a verification-based iterative fine-tuning approach, which synthesizes high-quality fine-tuning data iteratively based on the database schema and validation feedback to enhance model performance, while building a model base with varying capabilities. During the self-play fine-tuning phase, we propose an error-driven loss method that incentivizes incorrect outputs from the opponent model, enabling the main model to distinguish between correct SQL and erroneous SQL generated by the opponent model, thereby improving its ability to generate correct SQL. Extensive experiments and in-depth analyses on six open-source LLMs and five widely used benchmarks demonstrate that our approach outperforms existing state-of-the-art (SOTA) methods.
zh
[NLP-35] SelfAug: Mitigating Catastrophic Forgetting in Retrieval-Augmented Generation via Distribution Self-Alignment
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在监督微调(Supervised Fine-Tuning)过程中,尤其是在检索增强生成(Retrieval-Augmented Generation, RAG)场景下,因灾难性遗忘(Catastrophic Forgetting)导致的通用能力退化问题。现有方法要么依赖通用指令数据,要么难以维持模型原始分布,限制了性能提升与知识保留之间的平衡。其解决方案的关键在于提出SelfAug——一种自分布对齐方法,通过将输入序列的logits对齐至原模型语义分布,从而缓解灾难性遗忘并提升下游任务表现,实验证明该方法能有效平衡特定任务学习与通用能力保持。
链接: https://arxiv.org/abs/2509.03934
作者: Yuqing Huang,Rongyang Zhang,Qimeng Wang,Chengqiang Lu,Yan Gao,Yi Wu,Yao Hu,Xuyang Zhi,Guiquan Liu,Xin Li,Hao Wang,Enhong Chen
机构: University of Science and Technology of China (中国科学技术大学); Xiaohongshu Inc. (小红书)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advancements in large language models (LLMs) have revolutionized natural language processing through their remarkable capabilities in understanding and executing diverse tasks. While supervised fine-tuning, particularly in Retrieval-Augmented Generation (RAG) scenarios, effectively enhances task-specific performance, it often leads to catastrophic forgetting, where models lose their previously acquired knowledge and general capabilities. Existing solutions either require access to general instruction data or face limitations in preserving the model’s original distribution. To overcome these limitations, we propose SelfAug, a self-distribution alignment method that aligns input sequence logits to preserve the model’s semantic distribution, thereby mitigating catastrophic forgetting and improving downstream performance. Extensive experiments demonstrate that SelfAug achieves a superior balance between downstream learning and general capability retention. Our comprehensive empirical analysis reveals a direct correlation between distribution shifts and the severity of catastrophic forgetting in RAG scenarios, highlighting how the absence of RAG capabilities in general instruction tuning leads to significant distribution shifts during fine-tuning. Our findings not only advance the understanding of catastrophic forgetting in RAG contexts but also provide a practical solution applicable across diverse fine-tuning scenarios. Our code is publicly available at this https URL.
zh
[NLP-36] Decoding the Poetic Language of Emotion in Korean Modern Poetry: Insights from a Human-Labeled Dataset and AI Modeling
【速读】: 该论文旨在解决现代韩语诗歌中情感分析的计算难题,尤其是针对其隐喻性语言和文化特异性导致传统基于通用语料库的情感分类模型表现不佳的问题。解决方案的关键在于构建了一个名为KPoEM(Korean Poetry Emotion Mapping)的多标签情感数据集,包含7,662条标注数据(包括7,007行级与615作品级条目),覆盖483首诗中来自五位著名韩国诗人的44种细粒度情感类别,并采用分阶段微调策略——先在通用语料上预训练,再在KPoEM数据集上进行针对性微调,从而显著提升模型对韩语诗歌中时序性和文化特定情感表达的识别能力,同时保持诗歌核心情感的完整性。
链接: https://arxiv.org/abs/2509.03932
作者: Iro Lim,Haein Ji,Byungjun Kim
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: 30 pages, 13 tables, 2 figures, Digital Humanities and Social Sciences Korea Conference, James Joo-Jin Kim Center for Korean Studies, University of Pennsylvania, Philadelphia, USA
Abstract:This study introduces KPoEM (Korean Poetry Emotion Mapping) , a novel dataset for computational emotion analysis in modern Korean poetry. Despite remarkable progress in text-based emotion classification using large language models, poetry-particularly Korean poetry-remains underexplored due to its figurative language and cultural specificity. We built a multi-label emotion dataset of 7,662 entries, including 7,007 line-level entries from 483 poems and 615 work-level entries, annotated with 44 fine-grained emotion categories from five influential Korean poets. A state-of-the-art Korean language model fine-tuned on this dataset significantly outperformed previous models, achieving 0.60 F1-micro compared to 0.34 from models trained on general corpora. The KPoEM model, trained through sequential fine-tuning-first on general corpora and then on the KPoEM dataset-demonstrates not only an enhanced ability to identify temporally and culturally specific emotional expressions, but also a strong capacity to preserve the core sentiments of modern Korean poetry. This study bridges computational methods and literary analysis, presenting new possibilities for the quantitative exploration of poetic emotions through structured data that faithfully retains the emotional and cultural nuances of Korean literature.
zh
[NLP-37] MTQA:Matrix of Thought for Enhanced Reasoning in Complex Question Answering
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在处理复杂和抽象问答任务时因推理能力不足而导致性能显著下降的问题。现有方法如思维链(Chain-of-Thought, CoT)和思维树(Tree-of-Thought, ToT)存在层内冗余或单一路径局限,而检索增强生成(Retrieval-Augmented Generation, RAG)虽能辅助推理,却难以有效利用涉及多实体与多跳信息的大规模知识。其解决方案的关键在于提出一种新型的LLM思维结构——思维矩阵(Matrix of Thought, MoT),通过“列-单元通信”机制在水平与垂直维度上探索问题,实现多策略、深层次的思考,减少列内冗余并提升推理能力;同时引入事实校正机制,基于检索到的知识图谱三元组和原始文本构建知识单元,增强初始知识基础并修正错误答案,从而形成高效且准确的问答框架(MTQA)。
链接: https://arxiv.org/abs/2509.03918
作者: Fengxiao Tang,Yufeng Li,Zongzong Wu,Ming Zhao
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Complex Question Answering (QA) is a fundamental and challenging task in NLP. While large language models (LLMs) exhibit impressive performance in QA, they suffer from significant performance degradation when facing complex and abstract QA tasks due to insufficient reasoning capabilities. Works such as Chain-of-Thought (CoT) and Tree-of-Thought (ToT) aim to enhance LLMs’ reasoning abilities, but they face issues such as in-layer redundancy in tree structures and single paths in chain structures. Although some studies utilize Retrieval-Augmented Generation (RAG) methods to assist LLMs in reasoning, the challenge of effectively utilizing large amounts of information involving multiple entities and hops remains critical. To address this, we propose the Matrix of Thought (MoT), a novel and efficient LLM thought structure. MoT explores the problem in both horizontal and vertical dimensions through the “column-cell communication” mechanism, enabling LLMs to actively engage in multi-strategy and deep-level thinking, reducing redundancy within the column cells and enhancing reasoning capabilities. Furthermore, we develop a fact-correction mechanism by constructing knowledge units from retrieved knowledge graph triples and raw text to enhance the initial knowledge for LLM reasoning and correct erroneous answers. This leads to the development of an efficient and accurate QA framework (MTQA). Experimental results show that our framework outperforms state-of-the-art methods on four widely-used datasets in terms of F1 and EM scores, with reasoning time only 14.4% of the baseline methods, demonstrating both its efficiency and accuracy. The code for this framework is available at this https URL.
zh
[NLP-38] SPECS: Specificity-Enhanced CLIP-Score for Long Image Caption Evaluation
【速读】: 该论文旨在解决长图像描述生成任务中评价指标可靠性不足的问题,特别是传统N-gram类指标难以捕捉语义正确性、而基于大语言模型(LLM)的指标虽与人类判断高度相关但计算成本过高,难以用于模型迭代过程中的频繁评估。其解决方案的关键在于提出一种无参考的表示相似性(Representational Similarity, RS)度量方法——SPECS(Specificity-Enhanced CLIPScore),通过改进CLIP模型的目标函数,强化对描述细节特异性的奖励与惩罚机制,从而在保持高效率的同时显著提升与人类判断的相关性,成为图像描述模型训练过程中实用的迭代评估工具。
链接: https://arxiv.org/abs/2509.03897
作者: Xiaofu Chen,Israfel Salazar,Yova Kementchedjhieva
机构: MBZUAI; University of Copenhagen (哥本哈根大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
Abstract:As interest grows in generating long, detailed image captions, standard evaluation metrics become increasingly unreliable. N-gram-based metrics though efficient, fail to capture semantic correctness. Representational Similarity (RS) metrics, designed to address this, initially saw limited use due to high computational costs, while today, despite advances in hardware, they remain unpopular due to low correlation to human judgments. Meanwhile, metrics based on large language models (LLMs) show strong correlation with human judgments, but remain too expensive for iterative use during model development. We introduce SPECS (Specificity-Enhanced CLIPScore), a reference-free RS metric tailored to long image captioning. SPECS modifies CLIP with a new objective that emphasizes specificity: rewarding correct details and penalizing incorrect ones. We show that SPECS matches the performance of open-source LLM-based metrics in correlation to human judgments, while being far more efficient. This makes it a practical alternative for iterative checkpoint evaluation during image captioning model this http URL code can be found at this https URL. Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL) Cite as: arXiv:2509.03897 [cs.CV] (or arXiv:2509.03897v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2509.03897 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-39] MobileRAG : Enhancing Mobile Agent with Retrieval-Augmented Generation
【速读】: 该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的移动代理在实际应用中面临的三大问题:1)过度依赖LLM的理解能力,易因误操作或遗漏步骤导致任务失败;2)缺乏与外部环境的交互能力,当应用程序无法满足用户请求时往往提前终止任务;3)缺乏记忆机制,无法复用历史信息或从错误中学习。解决方案的关键在于提出MobileRAG框架,其核心创新是引入检索增强生成(Retrieval-Augmented Generation, RAG)技术,并结合InterRAG(跨应用交互)、LocalRAG(本地上下文感知)和MemRAG(记忆增强)三个模块,从而提升对用户查询的精准识别能力、增强对外部环境的动态响应以及实现任务执行过程中的持续学习与纠错。
链接: https://arxiv.org/abs/2509.03891
作者: Gowen Loo,Chang Liu,Qinghong Yin,Xiang Chen,Jiawei Chen,Jingyuan Zhang,Yu Tian
机构: 未知
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Smartphones have become indispensable in people’s daily lives, permeating nearly every aspect of modern society. With the continuous advancement of large language models (LLMs), numerous LLM-based mobile agents have emerged. These agents are capable of accurately parsing diverse user queries and automatically assisting users in completing complex or repetitive operations. However, current agents 1) heavily rely on the comprehension ability of LLMs, which can lead to errors caused by misoperations or omitted steps during tasks, 2) lack interaction with the external environment, often terminating tasks when an app cannot fulfill user queries, and 3) lack memory capabilities, requiring each instruction to reconstruct the interface and being unable to learn from and correct previous mistakes. To alleviate the above issues, we propose MobileRAG, a mobile agents framework enhanced by Retrieval-Augmented Generation (RAG), which includes InterRAG, LocalRAG, and MemRAG. It leverages RAG to more quickly and accurately identify user queries and accomplish complex and long-sequence mobile tasks. Additionally, to more comprehensively assess the performance of MobileRAG, we introduce MobileRAG-Eval, a more challenging benchmark characterized by numerous complex, real-world mobile tasks that require external knowledge assistance. Extensive experimental results on MobileRAG-Eval demonstrate that MobileRAG can easily handle real-world mobile tasks, achieving 10.3% improvement over state-of-the-art methods with fewer operational steps. Our code is publicly available at: this https URL
zh
[NLP-40] False Sense of Security: Why Probing-based Malicious Input Detection Fails to Generalize
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在面对有害指令时可能产生不安全响应的安全隐患问题,尤其聚焦于当前基于探针(probing-based)方法在检测恶意输入时存在的可靠性不足问题。其关键发现是:现有探针方法主要学习的是表面模式(如指令性结构和触发词),而非语义层面的有害性,这导致其在分布外数据上表现不佳,从而造成对模型安全性的虚假信心。研究通过系统实验验证了这一假设,提出应重新设计模型架构与评估协议,以提升安全检测方法的本质鲁棒性和泛化能力。
链接: https://arxiv.org/abs/2509.03888
作者: Cheng Wang,Zeming Wei,Qin Liu,Muhao Chen
机构: National University of Singapore (新加坡国立大学); Peking University (北京大学); University of California, Davis (加州大学戴维斯分校)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLMs) can comply with harmful instructions, raising serious safety concerns despite their impressive capabilities. Recent work has leveraged probing-based approaches to study the separability of malicious and benign inputs in LLMs’ internal representations, and researchers have proposed using such probing methods for safety detection. We systematically re-examine this paradigm. Motivated by poor out-of-distribution performance, we hypothesize that probes learn superficial patterns rather than semantic harmfulness. Through controlled experiments, we confirm this hypothesis and identify the specific patterns learned: instructional patterns and trigger words. Our investigation follows a systematic approach, progressing from demonstrating comparable performance of simple n-gram methods, to controlled experiments with semantically cleaned datasets, to detailed analysis of pattern dependencies. These results reveal a false sense of security around current probing-based approaches and highlight the need to redesign both models and evaluation protocols, for which we provide further discussions in the hope of suggesting responsible further research in this direction. We have open-sourced the project at this https URL.
zh
[NLP-41] A Comprehensive Survey on Trustworthiness in Reasoning with Large Language Models
【速读】: 该论文旨在解决当前关于链式思维(Chain-of-Thought, CoT)推理如何影响语言模型可信性(trustworthiness)的理解仍不充分的问题。其核心问题在于,尽管CoT技术在提升大语言模型(LLM)的准确性与可解释性方面取得进展,但其对模型五大可信维度——真实性(truthfulness)、安全性(safety)、鲁棒性(robustness)、公平性(fairness)和隐私性(privacy)的影响尚未系统厘清。解决方案的关键在于通过结构化综述近来相关研究,按时间顺序梳理各维度下的方法、发现与局限,并指出当前先进推理模型在提升部分可信属性的同时,自身却可能面临更严重的安全、鲁棒性和隐私风险,从而为AI安全社区提供及时且具有实践指导意义的参考。
链接: https://arxiv.org/abs/2509.03871
作者: Yanbo Wang,Yongcan Yu,Jian Liang,Ran He
机构: University of Chinese Academy of Sciences (中国科学院大学); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 38 pages. This survey considers papers published up to June 30, 2025. Work in progress
Abstract:The development of Long-CoT reasoning has advanced LLM performance across various tasks, including language understanding, complex problem solving, and code generation. This paradigm enables models to generate intermediate reasoning steps, thereby improving both accuracy and interpretability. However, despite these advancements, a comprehensive understanding of how CoT-based reasoning affects the trustworthiness of language models remains underdeveloped. In this paper, we survey recent work on reasoning models and CoT techniques, focusing on five core dimensions of trustworthy reasoning: truthfulness, safety, robustness, fairness, and privacy. For each aspect, we provide a clear and structured overview of recent studies in chronological order, along with detailed analyses of their methodologies, findings, and limitations. Future research directions are also appended at the end for reference and discussion. Overall, while reasoning techniques hold promise for enhancing model trustworthiness through hallucination mitigation, harmful content detection, and robustness improvement, cutting-edge reasoning models themselves often suffer from comparable or even greater vulnerabilities in safety, robustness, and privacy. By synthesizing these insights, we hope this work serves as a valuable and timely resource for the AI safety community to stay informed on the latest progress in reasoning trustworthiness. A full list of related papers can be found at \hrefthis https URLthis https URL.
zh
[NLP-42] Drivel-ology: Challenging LLM s with Interpreting Nonsense with Depth EMNLP2025
【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在理解具有“深度荒诞性”(Drivelology)的言语时存在的认知局限问题,即模型虽能生成语法正确但缺乏深层语用、情感或修辞意图的文本,难以识别和解释那些表面看似无意义却蕴含隐含语义的表达。其解决方案的关键在于构建了一个包含超过1,200个精心标注的多语言(英、中、西、法、日、韩)Drivelological语料库,并通过专家多轮审校确保其语用复杂性和主观性的准确性;在此基础上对多种LLMs进行分类、生成与推理任务评估,揭示了它们在区分浅层荒诞与深层荒诞方面的系统性失败,从而指出了LLMs在语用理解层面存在显著的表征鸿沟。
链接: https://arxiv.org/abs/2509.03867
作者: Yang Wang,Chenghao Xiao,Chia-Yi Hsiao,Zi Yan Chang,Chi-Li Chen,Tyler Loakman,Chenghua Lin
机构: The University of Manchester (曼彻斯特大学); Durham University (杜伦大学); The University of Sheffield (谢菲尔德大学)
类目: Computation and Language (cs.CL)
备注: Accepted for oral presentation at the EMNLP 2025 Main Conference
Abstract:We introduce Drivelology, a unique linguistic phenomenon characterised as “nonsense with depth”, utterances that are syntactically coherent yet pragmatically paradoxical, emotionally loaded, or rhetorically subversive. While such expressions may resemble surface-level nonsense, they encode implicit meaning requiring contextual inference, moral reasoning, or emotional interpretation. We find that current large language models (LLMs), despite excelling at many natural language processing (NLP) tasks, consistently fail to grasp the layered semantics of Drivelological text. To investigate this, we construct a small but diverse benchmark dataset of over 1,200 meticulously curated examples, with select instances in English, Mandarin, Spanish, French, Japanese, and Korean. Annotation was especially challenging: each of the examples required careful expert review to verify that it truly reflected Drivelological characteristics. The process involved multiple rounds of discussion and adjudication to address disagreements, highlighting the subtle and subjective nature of the Drivelology. We evaluate a range of LLMs on classification, generation, and reasoning tasks. Our results reveal clear limitations of LLMs: models often confuse Drivelology with shallow nonsense, produce incoherent justifications, or miss the implied rhetorical function altogether. These findings highlight a deeper representational gap in LLMs’ pragmatic understanding and challenge the assumption that statistical fluency implies cognitive comprehension. We release our dataset and code to facilitate further research in modelling linguistic depth beyond surface-level coherence.
zh
[NLP-43] NE-PADD: Leverag ing Named Entity Knowledge for Robust Partial Audio Deepfake Detection via Attention Aggregation
【速读】: 该论文旨在解决部分音频深度伪造检测(Partial Audio Deepfake Detection, PADD)中难以实现帧级定位伪造语音位置的问题,尤其针对现有方法在利用音频语义信息(如命名实体)方面研究不足的局限性。解决方案的关键在于提出 NE-PADD 方法,通过两个并行分支——语音命名实体识别(Speech Name Entity Recognition, SpeechNER)与 PADD 模块,融合命名实体知识以增强检测能力;同时引入两种注意力聚合机制:注意力融合(Attention Fusion, AF)用于整合不同分支的注意力权重,以及注意力迁移(Attention Transfer, AT)通过辅助损失引导 PADD 模块利用命名实体语义信息,从而提升帧级伪造定位精度。
链接: https://arxiv.org/abs/2509.03829
作者: Huhong Xian,Rui Liu,Berrak Sisman,Haizhou Li
机构: Inner Mongolia University (内蒙古大学); Center for Language and Speech Processing (CLSP) (语言与语音处理中心), Johns Hopkins University (约翰霍普金斯大学); School of Artificial Intelligence (人工智能学院), The Chinese University of Hong Kong, Shenzhen (香港中文大学(深圳))
类目: Computation and Language (cs.CL)
备注:
Abstract:Different from traditional sentence-level audio deepfake detection (ADD), partial audio deepfake detection (PADD) requires frame-level positioning of the location of fake speech. While some progress has been made in this area, leveraging semantic information from audio, especially named entities, remains an underexplored aspect. To this end, we propose NE-PADD, a novel method for Partial Audio Deepfake Detection (PADD) that leverages named entity knowledge through two parallel branches: Speech Name Entity Recognition (SpeechNER) and PADD. The approach incorporates two attention aggregation mechanisms: Attention Fusion (AF) for combining attention weights and Attention Transfer (AT) for guiding PADD with named entity semantics using an auxiliary loss. Built on the PartialSpoof-NER dataset, experiments show our method outperforms existing baselines, proving the effectiveness of integrating named entity knowledge in PADD. The code is available at this https URL.
zh
[NLP-44] Align-then-Slide: A complete evaluation framework for Ultra-Long Document-Level Machine Translation
【速读】: 该论文针对文档级机器翻译(document-level machine translation, doc-mt)中因大语言模型(large language models, LLMs)生成整体文档输出而引发的评估难题提出解决方案。传统评估方法依赖句对齐假设,难以适配文档级翻译中常见的跨句映射、遗漏或一对多/多对一等复杂对应关系。其核心创新在于提出“对齐-滑窗”(Align-then-Slide)框架:首先在对齐阶段自动推断源文与目标文的句级对应关系,并重构目标文本以匹配源句数量,从而解决不一致映射问题;其次在滑窗评估阶段,通过1-4词块(n-chunk)滑动窗口计算平均指标得分,实现多粒度评估。该方案显著提升了评估结果与人工MQM评分的相关性(Pearson相关系数达0.929),并可直接用于强化学习训练,提升翻译质量。
链接: https://arxiv.org/abs/2509.03809
作者: Jiaxin Guo,Daimeng Wei,Yuanchang Luo,Xiaoyu Chen,Zhanglin Wu,Huan Yang,Hengchao Shang,Zongyao Li,Zhiqiang Rao,Jinlong Yang,Hao Yang
机构: Huawei Translation Services Center (华为翻译服务中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: under preview
Abstract:Large language models (LLMs) have ushered in a new era for document-level machine translation (\textitdoc-mt), yet their whole-document outputs challenge existing evaluation methods that assume sentence-by-sentence alignment. We introduce \textit\textbfAlign-then-Slide, a complete evaluation framework for ultra-long doc-mt. In the Align stage, we automatically infer sentence-level source-target correspondences and rebuild the target to match the source sentence number, resolving omissions and many-to-one/one-to-many mappings. In the n-Chunk Sliding Evaluate stage, we calculate averaged metric scores under 1-, 2-, 3- and 4-chunk for multi-granularity assessment. Experiments on the WMT benchmark show a Pearson correlation of 0.929 between our method with expert MQM rankings. On a newly curated real-world test set, our method again aligns closely with human judgments. Furthermore, preference data produced by Align-then-Slide enables effective CPO training and its direct use as a reward model for GRPO, both yielding translations preferred over a vanilla SFT baseline. The results validate our framework as an accurate, robust, and actionable evaluation tool for doc-mt systems.
zh
[NLP-45] Measuring How (Not Just Whether) VLMs Build Common Ground
【速读】: 该论文旨在解决当前大型视觉语言模型(Large Vision Language Models, VLMs)在交互式指代表达(interactive grounding)场景下的评估不足问题,即现有基准测试多局限于单轮问答或静态推理任务,无法反映人类通过持续对话逐步建立共同理解的动态交互过程。解决方案的关键在于提出一套包含四项指标的评估体系:接地效率(grounding efficiency)、内容对齐度(content alignment)、词汇适应性(lexical adaptation)和类人程度(human-likeness),并基于150次自对弈交互参照游戏实验,系统比较了三种专有VLM与人类配对的表现。研究发现,任务成功率不能作为成功接地的可靠指标,且图像-话语对齐度高也不一定预示任务成功,从而揭示了现有评估范式的局限性,并为未来VLM交互能力的研究提供了新的量化框架。
链接: https://arxiv.org/abs/2509.03805
作者: Saki Imai,Mert İnan,Anthony Sicilia,Malihe Alikhani
机构: Northeastern University (东北大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large vision language models (VLMs) increasingly claim reasoning skills, yet current benchmarks evaluate them in single-turn or question answering settings. However, grounding is an interactive process in which people gradually develop shared understanding through ongoing communication. We introduce a four-metric suite (grounding efficiency, content alignment, lexical adaptation, and human-likeness) to systematically evaluate VLM performance in interactive grounding contexts. We deploy the suite on 150 self-play sessions of interactive referential games between three proprietary VLMs and compare them with human dyads. All three models diverge from human patterns on at least three metrics, while GPT4o-mini is the closest overall. We find that (i) task success scores do not indicate successful grounding and (ii) high image-utterance alignment does not necessarily predict task success. Our metric suite and findings offer a framework for future research on VLM grounding.
zh
[NLP-46] SiLVERScore: Semantically-Aware Embeddings for Sign Language Generation Evaluation
【速读】: 该论文旨在解决现有手语生成评估方法中存在的双重问题:一是传统基于回译(back-translation)的评估流程无法捕捉手语的多模态特性(如面部表情、空间语法和韵律),二是该流程难以区分评估误差究竟源于手语生成模型本身还是翻译系统。解决方案的关键在于提出SiLVERScore,这是一种基于语义感知嵌入空间的新型评估指标,通过在联合嵌入空间中直接比较生成手语与参考手语的语义相似性,从而实现对生成质量的更准确、鲁棒且可解释的评估。实验表明,SiLVERScore在PHOENIX-14T和CSL-Daily数据集上表现出优异性能(ROC AUC = 0.99,重叠率仅7%),显著优于传统文本级指标。
链接: https://arxiv.org/abs/2509.03791
作者: Saki Imai,Mert İnan,Anthony Sicilia,Malihe Alikhani
机构: Northeastern University (东北大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Evaluating sign language generation is often done through back-translation, where generated signs are first recognized back to text and then compared to a reference using text-based metrics. However, this two-step evaluation pipeline introduces ambiguity: it not only fails to capture the multimodal nature of sign language-such as facial expressions, spatial grammar, and prosody-but also makes it hard to pinpoint whether evaluation errors come from sign generation model or the translation system used to assess it. In this work, we propose SiLVERScore, a novel semantically-aware embedding-based evaluation metric that assesses sign language generation in a joint embedding space. Our contributions include: (1) identifying limitations of existing metrics, (2) introducing SiLVERScore for semantically-aware evaluation, (3) demonstrating its robustness to semantic and prosodic variations, and (4) exploring generalization challenges across datasets. On PHOENIX-14T and CSL-Daily datasets, SiLVERScore achieves near-perfect discrimination between correct and random pairs (ROC AUC = 0.99, overlap 7%), substantially outperforming traditional metrics.
zh
[NLP-47] Evaluating the Robustness of Retrieval-Augmented Generation to Adversarial Evidence in the Health Domain
【速读】: 该论文旨在解决检索增强生成(Retrieval Augmented Generation, RAG)系统在高风险领域(如医疗健康)中因引入外部检索证据而可能吸收并传播错误信息的问题,特别是当检索到的文档包含恶意误导内容时,模型输出与真实答案之间的对齐性显著下降。解决方案的关键在于通过系统性实验验证:即使存在有害或对抗性文档,只要检索池中同时包含有益证据(helpful evidence),LLM 的输出仍能保持较高鲁棒性和与事实的一致性,从而表明设计合理的检索机制和证据筛选策略是提升RAG系统安全性的核心路径。
链接: https://arxiv.org/abs/2509.03787
作者: Shakiba Amirshahi,Amin Bigdeli,Charles L. A. Clarke,Amira Ghenai
机构: University of Waterloo (滑铁卢大学); Toronto Metropolitan University (多伦多都会大学)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:
Abstract:Retrieval augmented generation (RAG) systems provide a method for factually grounding the responses of a Large Language Model (LLM) by providing retrieved evidence, or context, as support. Guided by this context, RAG systems can reduce hallucinations and expand the ability of LLMs to accurately answer questions outside the scope of their training data. Unfortunately, this design introduces a critical vulnerability: LLMs may absorb and reproduce misinformation present in retrieved evidence. This problem is magnified if retrieved evidence contains adversarial material explicitly intended to promulgate misinformation. This paper presents a systematic evaluation of RAG robustness in the health domain and examines alignment between model outputs and ground-truth answers. We focus on the health domain due to the potential for harm caused by incorrect responses, as well as the availability of evidence-based ground truth for many common health-related questions. We conduct controlled experiments using common health questions, varying both the type and composition of the retrieved documents (helpful, harmful, and adversarial) as well as the framing of the question by the user (consistent, neutral, and inconsistent). Our findings reveal that adversarial documents substantially degrade alignment, but robustness can be preserved when helpful evidence is also present in the retrieval pool. These findings offer actionable insights for designing safer RAG systems in high-stakes domains by highlighting the need for retrieval safeguards. To enable reproducibility and facilitate future research, all experimental results are publicly available in our github repository. this https URL Subjects: Information Retrieval (cs.IR); Computation and Language (cs.CL) Cite as: arXiv:2509.03787 [cs.IR] (or arXiv:2509.03787v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2509.03787 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-48] Singular Value Few-shot Adaptation of Vision-Language Models
【速读】: 该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)如CLIP在细粒度领域适应过程中面临的两大挑战:一是依赖提示工程(prompt engineering)导致适应效果受限,二是全参数微调成本高昂且易破坏预训练阶段获得的丰富知识。为应对这些问题,作者提出了一种名为CLIP-SVD的新方法,其核心创新在于利用奇异值分解(Singular Value Decomposition, SVD)对CLIP模型内部参数空间进行高效调整,仅微调参数矩阵的奇异值以重新缩放基向量,从而实现跨域适应。该方案不引入额外模块,仅需0.04%的模型参数即可显著提升适应性能,并更好地保留模型的泛化能力,在11个自然图像和10个生物医学数据集上均达到最优分类效果。
链接: https://arxiv.org/abs/2509.03740
作者: Taha Koleilat,Hassan Rivaz,Yiming Xiao
机构: Concordia University (康考迪亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: 10 pages, 2 figures, 8 tables
Abstract:Vision-language models (VLMs) like CLIP have shown impressive zero-shot and few-shot learning capabilities across diverse applications. However, adapting these models to new fine-grained domains remains difficult due to reliance on prompt engineering and the high cost of full model fine-tuning. Existing adaptation approaches rely on augmented components, such as prompt tokens and adapter modules, which could limit adaptation quality, destabilize the model, and compromise the rich knowledge learned during pretraining. In this work, we present \textbfCLIP-SVD, a novel \textitmulti-modal and \textitparameter-efficient adaptation technique that leverages Singular Value Decomposition (SVD) to modify the internal parameter space of CLIP without injecting additional modules. Specifically, we fine-tune only the singular values of the CLIP parameter matrices to rescale the basis vectors for domain adaptation while retaining the pretrained model. This design enables enhanced adaptation performance using only \textbf0.04% of the model’s total parameters and better preservation of its generalization ability. CLIP-SVD achieves state-of-the-art classification results on 11 natural and 10 biomedical datasets, outperforming previous methods in both accuracy and generalization under few-shot settings. Additionally, we leverage a natural language-based approach to analyze the effectiveness and dynamics of the CLIP adaptation to allow interpretability of CLIP-SVD. The code is publicly available at this https URL.
zh
[NLP-49] he Personality Illusion: Revealing Dissociation Between Self-Reports Behavior in LLM s
【速读】: 该论文旨在解决当前对大语言模型(Large Language Models, LLMs)人格特质的理解过于依赖简化自评和启发式提示,缺乏行为验证的问题。其解决方案的关键在于通过系统性实证方法,从三个维度进行深入分析:(1)训练阶段中人格特征的动态演化过程;(2)自评人格在行为任务中的预测效度;(3)针对性干预(如角色注入)对自我报告与实际行为的影响。研究发现,指令对齐技术(如RLHF、指令微调)虽能稳定人格表达并增强特征间相关性,但自评人格无法可靠预测行为,且角色注入仅影响表面表述而难改行为一致性,从而揭示了LLM人格表征中“表层表达”与“行为一致性”的本质差异,挑战了现有假设,并强调需在对齐与可解释性评估中引入更深层的行为验证机制。
链接: https://arxiv.org/abs/2509.03730
作者: Pengrui Han,Rafal Kocielnik,Peiyang Song,Ramit Debnath,Dean Mobbs,Anima Anandkumar,R. Michael Alvarez
机构: Caltech (加州理工学院); UIUC (伊利诺伊大学厄巴纳-香槟分校); University of Cambridge (剑桥大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注: We make public all code and source data at this https URL
Abstract:Personality traits have long been studied as predictors of human this http URL advances in Large Language Models (LLMs) suggest similar patterns may emerge in artificial systems, with advanced LLMs displaying consistent behavioral tendencies resembling human traits like agreeableness and self-regulation. Understanding these patterns is crucial, yet prior work primarily relied on simplified self-reports and heuristic prompting, with little behavioral validation. In this study, we systematically characterize LLM personality across three dimensions: (1) the dynamic emergence and evolution of trait profiles throughout training stages; (2) the predictive validity of self-reported traits in behavioral tasks; and (3) the impact of targeted interventions, such as persona injection, on both self-reports and behavior. Our findings reveal that instructional alignment (e.g., RLHF, instruction tuning) significantly stabilizes trait expression and strengthens trait correlations in ways that mirror human data. However, these self-reported traits do not reliably predict behavior, and observed associations often diverge from human patterns. While persona injection successfully steers self-reports in the intended direction, it exerts little or inconsistent effect on actual behavior. By distinguishing surface-level trait expression from behavioral consistency, our findings challenge assumptions about LLM personality and underscore the need for deeper evaluation in alignment and interpretability.
zh
[NLP-50] MLSD: A Novel Few-Shot Learning Approach to Enhance Cross-Target and Cross-Domain Stance Detection
【速读】: 该论文旨在解决跨目标(cross-target)和跨域(cross-domain)场景下立场检测(stance detection)性能下降的问题,即在训练数据与测试数据的目标对象或领域不一致时,现有模型泛化能力不足。其解决方案的关键在于提出一种基于度量学习的少样本学习方法(Metric Learning-Based Few-Shot Learning for Cross-Target and Cross-Domain Stance Detection, MLSD),通过引入三元组损失(triplet loss)构建判别性嵌入空间,有效捕捉不同立场目标间的语义相似性和差异性,从而增强模型的领域适应能力,使模型能够从新目标域中迁移有用示例并提升检测准确率。
链接: https://arxiv.org/abs/2509.03725
作者: Parush Gera,Tempestt Neal
机构: University of South Florida (南佛罗里达大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:We present the novel approach for stance detection across domains and targets, Metric Learning-Based Few-Shot Learning for Cross-Target and Cross-Domain Stance Detection (MLSD). MLSD utilizes metric learning with triplet loss to capture semantic similarities and differences between stance targets, enhancing domain adaptation. By constructing a discriminative embedding space, MLSD allows a cross-target or cross-domain stance detection model to acquire useful examples from new target domains. We evaluate MLSD in multiple cross-target and cross-domain scenarios across two datasets, showing statistically significant improvement in stance detection performance across six widely used stance detection models.
zh
[NLP-51] Semantic Analysis of SNOMED CT Concept Co-occurrences in Clinical Documentation using MIMIC-IV
【速读】: 该论文旨在解决临床笔记中结构化不足导致的大规模分析难题,特别是如何利用标准化术语(如SNOMED CT)的共现模式与基于嵌入的语义相似性之间的关系来提升临床数据的可挖掘价值。其解决方案的关键在于结合Normalized Pointwise Mutual Information (NPMI) 与预训练医学语言模型(如ClinicalBERT、BioBERT)生成的嵌入表示,系统评估概念共现频率与语义相似度的关联性,并验证嵌入模型能否捕捉到未被文档高频记录但具有临床意义的概念关联。研究发现,尽管共现与语义相似性弱相关,嵌入模型仍能有效识别潜在临床关联并预测后续记录的概念,从而为临床注释增强、表型聚类及决策支持提供新方法。
链接: https://arxiv.org/abs/2509.03662
作者: Ali Noori,Somya Mohanty,Prashanti Manda
机构: University of North Carolina Greensboro (北卡罗来纳大学格林维尔分校); United HealthGroup (联合健康集团); University of Nebraska Omaha (内布拉斯加大学奥马哈分校)
类目: Computation and Language (cs.CL)
备注:
Abstract:Clinical notes contain rich clinical narratives but their unstructured format poses challenges for large-scale analysis. Standardized terminologies such as SNOMED CT improve interoperability, yet understanding how concepts relate through co-occurrence and semantic similarity remains underexplored. In this study, we leverage the MIMIC-IV database to investigate the relationship between SNOMED CT concept co-occurrence patterns and embedding-based semantic similarity. Using Normalized Pointwise Mutual Information (NPMI) and pretrained embeddings (e.g., ClinicalBERT, BioBERT), we examine whether frequently co-occurring concepts are also semantically close, whether embeddings can suggest missing concepts, and how these relationships evolve temporally and across specialties. Our analyses reveal that while co-occurrence and semantic similarity are weakly correlated, embeddings capture clinically meaningful associations not always reflected in documentation frequency. Embedding-based suggestions frequently matched concepts later documented, supporting their utility for augmenting clinical annotations. Clustering of concept embeddings yielded coherent clinical themes (symptoms, labs, diagnoses, cardiovascular conditions) that map to patient phenotypes and care patterns. Finally, co-occurrence patterns linked to outcomes such as mortality and readmission demonstrate the practical utility of this approach. Collectively, our findings highlight the complementary value of co-occurrence statistics and semantic embeddings in improving documentation completeness, uncovering latent clinical relationships, and informing decision support and phenotyping applications.
zh
[NLP-52] Breaking the Mirror: Activation-Based Mitigation of Self-Preference in LLM Evaluators
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在作为自动评估者时存在的“自我偏好偏差”(self-preference bias)问题,即模型倾向于偏好自身生成的输出而非其他模型的输出,从而影响评估公平性与可靠性。解决方案的关键在于引入轻量级的引导向量(steering vectors),通过两种方法——对比激活添加(Contrastive Activation Addition, CAA)和基于优化的方法——在推理阶段对模型输出进行微调,以减少非正当的自我偏好,实验表明该方法可将不正当自我偏好降低高达97%,显著优于提示工程和直接偏好优化基线。然而,引导向量在合法自我偏好和无偏一致性场景下表现不稳定,揭示了自我偏好可能存在于多维或非线性空间中,凸显了该方法的潜力与局限性。
链接: https://arxiv.org/abs/2509.03647
作者: Dani Roytburg,Matthew Bozoukov,Matthew Nguyen,Jou Barzdukas,Simon Fu,Narmeen Oozeer
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Large language models (LLMs) increasingly serve as automated evaluators, yet they suffer from “self-preference bias”: a tendency to favor their own outputs over those of other models. This bias undermines fairness and reliability in evaluation pipelines, particularly for tasks like preference tuning and model routing. We investigate whether lightweight steering vectors can mitigate this problem at inference time without retraining. We introduce a curated dataset that distinguishes self-preference bias into justified examples of self-preference and unjustified examples of self-preference, and we construct steering vectors using two methods: Contrastive Activation Addition (CAA) and an optimization-based approach. Our results show that steering vectors can reduce unjustified self-preference bias by up to 97%, substantially outperforming prompting and direct preference optimization baselines. Yet steering vectors are unstable on legitimate self-preference and unbiased agreement, implying self-preference spans multiple or nonlinear directions. This underscores both their promise and limits as safeguards for LLM-as-judges and motivates more robust interventions.
zh
[NLP-53] Emergent Hierarchical Reasoning in LLM s through Reinforcement Learning
【速读】: 该论文旨在解决强化学习(Reinforcement Learning, RL)在提升大语言模型(Large Language Models, LLMs)复杂推理能力过程中,其内部机制不透明、优化效率低下等问题。研究发现,诸如“顿悟时刻”(aha moments)、“长度缩放效应”(length-scaling)及熵动态等现象并非孤立事件,而是涌现的推理层次结构的表现,类似于人类认知中高层战略规划与低层执行过程的分离。解决方案的关键在于识别并利用这一两阶段动态:初期模型受限于低级技能的正确性,随后优化瓶颈转向高级战略规划的探索与掌握。为此,作者提出层级感知信用分配(Hierarchy-Aware Credit Assignment, HICRA)算法,通过聚焦高影响力的战略规划token进行优化,显著优于现有基线方法,验证了针对战略瓶颈集中优化是解锁高级推理能力的核心路径。
链接: https://arxiv.org/abs/2509.03646
作者: Haozhe Wang,Qixin Xu,Che Liu,Junhong Wu,Fangzhen Lin,Wenhu Chen
机构: HKUST(香港科技大学); University of Waterloo(滑铁卢大学); Tsinghua Univerisity(清华大学); Imperial College London(帝国理工学院); UCAS(中国科学院大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Preprint
Abstract:Reinforcement Learning (RL) has proven highly effective at enhancing the complex reasoning abilities of Large Language Models (LLMs), yet underlying mechanisms driving this success remain largely opaque. Our analysis reveals that puzzling phenomena like aha moments",
length-scaling’’ and entropy dynamics are not disparate occurrences but hallmarks of an emergent reasoning hierarchy, akin to the separation of high-level strategic planning from low-level procedural execution in human cognition. We uncover a compelling two-phase dynamic: initially, a model is constrained by procedural correctness and must improve its low-level skills. The learning bottleneck then decisively shifts, with performance gains being driven by the exploration and mastery of high-level strategic planning. This insight exposes a core inefficiency in prevailing RL algorithms like GRPO, which apply optimization pressure agnostically and dilute the learning signal across all tokens. To address this, we propose HIerarchy-Aware Credit Assignment (HICRA), an algorithm that concentrates optimization efforts on high-impact planning tokens. HICRA significantly outperforms strong baselines, demonstrating that focusing on this strategic bottleneck is key to unlocking advanced reasoning. Furthermore, we validate semantic entropy as a superior compass for measuring strategic exploration over misleading metrics such as token-level entropy.
zh
[NLP-54] owards a Neurosymbolic Reasoning System Grounded in Schematic Representations
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在逻辑推理任务中表现不稳定、缺乏人类般稳健认知表征的问题。其解决方案的关键在于提出一个神经符号系统 Embodied-LM,该系统将理解与逻辑推理基于图像图式(image schemas)——一种源自具身感知运动经验的重复性认知结构——进行具身化建模,并通过答案集编程(Answer Set Programming, ASP)实现可执行的命题空间推理。这种方法不仅使LLMs能够借助具身认知结构解释场景,还显著提升了逻辑推理的准确性与可解释性。
链接: https://arxiv.org/abs/2509.03644
作者: François Olivier,Zied Bouraoui
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: To appear in Proceedings of Machine Learning Research, 19th Conference on Neurosymbolic Learning and Reasoning, 2025
Abstract:Despite significant progress in natural language understanding, Large Language Models (LLMs) remain error-prone when performing logical reasoning, often lacking the robust mental representations that enable human-like comprehension. We introduce a prototype neurosymbolic system, Embodied-LM, that grounds understanding and logical reasoning in schematic representations based on image schemas-recurring patterns derived from sensorimotor experience that structure human cognition. Our system operationalizes the spatial foundations of these cognitive structures using declarative spatial reasoning within Answer Set Programming. Through evaluation on logical deduction problems, we demonstrate that LLMs can be guided to interpret scenarios through embodied cognitive structures, that these structures can be formalized as executable programs, and that the resulting representations support effective logical reasoning with enhanced interpretability. While our current implementation focuses on spatial primitives, it establishes the computational foundation for incorporating more complex and dynamic representations.
zh
[NLP-55] CausalARC: Abstract Reasoning with Causal World Models
【速读】: 该论文旨在解决人工智能在低数据和分布外(out-of-distribution)场景下进行推理时的适应性问题,即模型如何在有限样本和未知分布条件下仍能有效推理。解决方案的关键在于提出CausalARC——一个基于结构因果模型(structural causal model)构建的实验测试平台,其每个推理任务均来自一个完整指定的因果世界模型,并通过原则性的数据增强提供观测、干预和反事实反馈,以支持少样本、上下文学习形式的训练与评估。这一设计使得语言模型能够在抽象推理、反事实推理、程序合成及因果发现等多场景中实现更鲁棒的推理能力。
链接: https://arxiv.org/abs/2509.03636
作者: Jacqueline Maasch,John Kalantari,Kia Khezeli
机构: Cornell Tech (康奈尔技术学院); YRIKKA (YRIKKA)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Reasoning requires adaptation to novel problem settings under limited data and distribution shift. This work introduces CausalARC: an experimental testbed for AI reasoning in low-data and out-of-distribution regimes, modeled after the Abstraction and Reasoning Corpus (ARC). Each CausalARC reasoning task is sampled from a fully specified causal world model, formally expressed as a structural causal model. Principled data augmentations provide observational, interventional, and counterfactual feedback about the world model in the form of few-shot, in-context learning demonstrations. As a proof-of-concept, we illustrate the use of CausalARC for four language model evaluation settings: (1) abstract reasoning with test-time training, (2) counterfactual reasoning with in-context learning, (3) program synthesis, and (4) causal discovery with logical reasoning.
zh
[NLP-56] E-ARMOR: Edge case Assessment and Review of Multilingual Optical Character Recognition FAST
【速读】: 该论文旨在解决多语言、噪声复杂且场景多样化的现实图像中光学字符识别(OCR)的性能瓶颈问题,特别是在资源受限的边缘计算环境下的部署挑战。其解决方案的关键在于提出一种专为边缘部署优化的新型OCR系统——Sprinklr-Edge-OCR,并通过大规模对比实验验证了传统OCR方法在效率和成本上的显著优势:相比当前主流的大视觉语言模型(LVLMs),Sprinklr-Edge-OCR在处理速度上快35倍(平均0.17秒/图)、单位成本不足LVLM的0.01(0.006美元/1000图),同时在F1分数上达到最优(0.46),表明在边缘场景下,轻量级传统OCR系统仍是最优选择。
链接: https://arxiv.org/abs/2509.03615
作者: Aryan Gupta,Anupam Purwar
机构: Sprinklr(斯普林克尔)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Sprinklr OCR provides a fast and compute light way of performing OCR
Abstract:Optical Character Recognition (OCR) in multilingual, noisy, and diverse real-world images remains a significant challenge for optical character recognition systems. With the rise of Large Vision-Language Models (LVLMs), there is growing interest in their ability to generalize and reason beyond fixed OCR pipelines. In this work, we introduce Sprinklr-Edge-OCR, a novel OCR system built specifically optimized for edge deployment in resource-constrained environments. We present a large-scale comparative evaluation of five state-of-the-art LVLMs (InternVL, Qwen, GOT OCR, LLaMA, MiniCPM) and two traditional OCR systems (Sprinklr-Edge-OCR, SuryaOCR) on a proprietary, doubly hand annotated dataset of multilingual (54 languages) images. Our benchmark covers a broad range of metrics including accuracy, semantic consistency, language coverage, computational efficiency (latency, memory, GPU usage), and deployment cost. To better reflect real-world applicability, we also conducted edge case deployment analysis, evaluating model performance on CPU only environments. Among the results, Qwen achieved the highest precision (0.54), while Sprinklr-Edge-OCR delivered the best overall F1 score (0.46) and outperformed others in efficiency, processing images 35 faster (0.17 seconds per image on average) and at less than 0.01 of the cost (0.006 USD per 1,000 images) compared to LVLM. Our findings demonstrate that the most optimal OCR systems for edge deployment are the traditional ones even in the era of LLMs due to their low compute requirements, low latency, and very high affordability.
zh
[NLP-57] NoteBar: An AI-Assisted Note-Taking System for Personal Knowledge Management
【速读】: 该论文旨在解决当前AI辅助笔记工具在效率和个性化支持方面的不足,尤其是在笔记自动分类与用户工作流适配上的局限性。其解决方案的关键在于引入基于人格类型(MBTI)的“角色信息”(persona information),并结合轻量级语言模型,使NoteBar能够根据用户个性特征自动将笔记组织为多个类别,从而更贴合用户的认知习惯和使用场景。此外,研究还构建了一个包含3,173条笔记和8,494个标注概念的新型人格条件数据集,为下游任务提供多样性和语义丰富性,同时确保系统可在无重型基础设施依赖的情况下实现交互式部署,具备良好的实用性与扩展性。
链接: https://arxiv.org/abs/2509.03610
作者: Josh Wisoff,Yao Tang,Zhengyu Fang,Jordan Guzman,YuTang Wang,Alex Yu
机构: NoteBar Research (NoteBar 研究); University of Rochester (罗彻斯特大学); Case Western Reserve University (凯斯西储大学); Skidmore College (斯基德莫尔学院)
类目: Computation and Language (cs.CL)
备注:
Abstract:Note-taking is a critical practice for capturing, organizing, and reflecting on information in both academic and professional settings. The recent success of large language models has accelerated the development of AI-assisted tools, yet existing solutions often struggle with efficiency. We present NoteBar, an AI-assisted note-taking tool that leverages persona information and efficient language models to automatically organize notes into multiple categories and better support user workflows. To support research and evaluation in this space, we further introduce a novel persona-conditioned dataset of 3,173 notes and 8,494 annotated concepts across 16 MBTI personas, offering both diversity and semantic richness for downstream tasks. Finally, we demonstrate that NoteBar can be deployed in a practical and cost-effective manner, enabling interactive use without reliance on heavy infrastructure. Together, NoteBar and its accompanying dataset provide a scalable and extensible foundation for advancing AI-assisted personal knowledge management.
zh
[NLP-58] ResearchPulse: Building Method-Experiment Chains through Multi-Document Scientific Inference ACM-MM2025
【速读】: 该论文旨在解决科学文献中跨文档推理(multi-document scientific inference)的问题,即如何从主题相关的多篇论文中提取并对齐动机(motivation)、方法(methodology)和实验结果(experimental results),以重建研究发展链条。传统文献综述仅能总结单篇论文,难以揭示科学思想的演化路径,而本研究提出了一种结构化、时序化的推理框架。其解决方案的关键在于设计了一个基于代理(agent-based)的系统——ResearchPulse,包含三个协同工作的模块:Plan Agent负责任务分解,Mmap-Agent构建动机-方法思维导图(mind maps),Lchart-Agent合成实验线图(line charts)。此外,作者还提出了ResearchPulse-Bench这一引用感知的标注数据集,用于支持该任务的评估与训练。实验证明,即使使用7B规模的模型,该系统在语义对齐、结构一致性和可视化保真度方面均优于GPT-4o等强基线模型。
链接: https://arxiv.org/abs/2509.03565
作者: Qi Chen,Jingxuan Wei,Zhuoya Yao,Haiguang Wang,Gaowei Wu,Bihui Yu,Siyuan Li,Cheng Tan
机构: University of Chinese Academy of Sciences(中国科学院大学); Zhejiang University(浙江大学); Shanghai Artificial Intelligence Laboratory(上海人工智能实验室)
类目: Computation and Language (cs.CL); Multimedia (cs.MM)
备注: Accepted to ACM MM 2025
Abstract:Understanding how scientific ideas evolve requires more than summarizing individual papers-it demands structured, cross-document reasoning over thematically related research. In this work, we formalize multi-document scientific inference, a new task that extracts and aligns motivation, methodology, and experimental results across related papers to reconstruct research development chains. This task introduces key challenges, including temporally aligning loosely structured methods and standardizing heterogeneous experimental tables. We present ResearchPulse, an agent-based framework that integrates instruction planning, scientific content extraction, and structured visualization. It consists of three coordinated agents: a Plan Agent for task decomposition, a Mmap-Agent that constructs motivation-method mind maps, and a Lchart-Agent that synthesizes experimental line charts. To support this task, we introduce ResearchPulse-Bench, a citation-aware benchmark of annotated paper clusters. Experiments show that our system, despite using 7B-scale agents, consistently outperforms strong baselines like GPT-4o in semantic alignment, structural consistency, and visual fidelity. The dataset are available in this https URL.
zh
[NLP-59] Improving Factuality in LLM s via Inference-Time Knowledge Graph Construction
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生成回答时因参数化记忆局限而导致的事实性不一致问题。现有检索增强生成(Retrieval-Augmented Generation, RAG)方法虽能引入外部知识,但通常将知识视为非结构化文本,难以支持组合推理和识别事实矛盾。其解决方案的关键在于:在推理阶段动态构建并扩展知识图谱(Knowledge Graph, KG),融合LLM内部提取的知识与外部检索的信息,通过提示生成初始KG、利用LLM潜在知识迭代扩展,并结合外部检索进行选择性精炼,从而提升事实准确性、答案精确度和可解释性。
链接: https://arxiv.org/abs/2509.03540
作者: Shanglin Wu,Lihui Liu,Jinho D. Choi,Kai Shu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) often struggle with producing factually consistent answers due to limitations in their parametric memory. Retrieval-Augmented Generation (RAG) methods address this issue by incorporating external knowledge from trusted sources at inference time. However, such methods typically treat knowledge as unstructured text, which limits their ability to support compositional reasoning and identify factual inconsistencies. To overcome these limitations, we propose a novel framework that dynamically constructs and expands knowledge graphs (KGs) during inference, integrating both internal knowledge extracted from LLMs and external information retrieved from external sources. Our method begins by extracting a seed KG from the question via prompting, followed by iterative expansion using the LLM’s latent knowledge. The graph is then selectively refined through external retrieval, enhancing factual coverage and correcting inaccuracies. We evaluate our approach on three diverse factual QA benchmarks, demonstrating consistent improvements in factual accuracy, answer precision, and interpretability over baseline prompting and static KG-augmented methods. Our findings suggest that inference-time KG construction is a promising direction for enhancing LLM factuality in a structured, interpretable, and scalable manner.
zh
[NLP-60] AR2: Adversarial Reinforcement Learning for Abstract Reasoning in Large Language Models CIKM2025
【速读】: 该论文旨在解决当前基于强化学习(Reinforcement Learning, RL)训练的大语言模型(Large Language Models, LLMs)在代码生成任务中普遍存在的抽象能力不足问题,即模型往往仅能识别表面模式,而缺乏从复杂问题描述中提炼本质计算结构(computational kernels)的能力。解决方案的关键在于提出AR²(Adversarial Reinforcement Learning for Abstract Reasoning)框架,其核心机制是通过一个教师模型将原始问题转化为富含叙事性但逻辑不变的复杂描述,同时训练学生编码模型从中提取底层计算核并求解,从而显式提升LLM的抽象推理能力。实验表明,该方法显著增强了模型在未见过的挑战性编程任务上的准确性,验证了抽象能力对LLM泛化性能的重要性。
链接: https://arxiv.org/abs/2509.03537
作者: Cheng-Kai Yeh,Hsing-Wang Lee,Chung-Hung Kuo,Hen-Hsen Huang
机构: National Chengchi University (国立政治大学); Institute of Information Science (资讯科学研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 7 pages, accepted by CIKM 2025 as a short paper
Abstract:Abstraction–the ability to recognize and distill essential computational patterns from complex problem statements–is a foundational skill in computer science, critical both for human problem-solvers and coding-oriented large language models (LLMs). Despite recent advances in training LLMs for code generation using reinforcement learning (RL), most existing approaches focus primarily on superficial pattern recognition, overlooking explicit training for abstraction. In this study, we propose AR ^2 (Adversarial Reinforcement Learning for Abstract Reasoning), a novel framework explicitly designed to enhance the abstraction abilities of LLMs. AR ^2 employs a teacher model to transform kernel problems into narrative-rich, challenging descriptions without changing their fundamental logic. Simultaneously, a student coding model is trained to solve these complex narrative problems by extracting their underlying computational kernels. Experimental results demonstrate that AR ^2 substantially improves the student model’s accuracy on previously unseen, challenging programming tasks, underscoring abstraction as a key skill for enhancing LLM generalization.
zh
[NLP-61] QuesGenie: Intelligent Multimodal Question Generation
【速读】: 该论文旨在解决当前信息丰富时代下学习者虽能获取大量教育资源,但缺乏与之匹配的定制化练习材料的问题。其解决方案的关键在于构建一个跨模态的问题生成系统(multi-modal question generation system),该系统能够从多种内容格式中自动产生多样化题型,核心创新点包括多模态输入处理、问题生成机制、基于人类反馈的强化学习(reinforcement learning from human feedback, RLHF)以及端到端的交互式界面设计,从而实现高效、可扩展且智能化的问题生成,兼顾资源利用效率、功能鲁棒性与用户体验。
链接: https://arxiv.org/abs/2509.03535
作者: Ahmed Mubarak,Amna Ahmed,Amira Nasser,Aya Mohamed,Fares El-Sadek,Mohammed Ahmed,Ahmed Salah,Youssef Sobhy
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 7 pages, 8 figures, 12 tables. Supervised by Dr. Ahmed Salah and TA Youssef Sobhy
Abstract:In today’s information-rich era, learners have access to abundant educational resources, but the lack of practice materials tailored to these resources presents a significant challenge. This project addresses that gap by developing a multi-modal question generation system that can automatically generate diverse question types from various content formats. The system features four major components: multi-modal input handling, question generation, reinforcement learning from human feedback (RLHF), and an end-to-end interactive interface. This project lays the foundation for automated, scalable, and intelligent question generation, carefully balancing resource efficiency, robust functionality and a smooth user experience.
zh
[NLP-62] opic Identification in LLM Input-Output Pairs through the Lens of Information Bottleneck
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中因内在忠实性幻觉(intrinsic faithfulness hallucinations,亦称confabulations)导致的语义偏离问题,即模型生成内容与输入上下文在语义上不一致的现象。现有检测框架如语义分歧度量(Semantic Divergence Metrics, SDM)依赖于对提示(prompt)和响应(response)嵌入进行几何聚类以识别共享主题,但此类方法优化的是空间邻近性而非信息论意义上的有效性,从而造成检测性能受限。解决方案的关键在于提出一种基于确定性信息瓶颈(Deterministic Information Bottleneck, DIB)的原理性主题识别方法,并将其转化为适用于高维数据的实用算法——称为UDIB(Upper-bound DIB)。UDIB通过用可计算的KL散度上界替代DIB中难以处理的KL项,实现对prompt与response嵌入联合聚类时的最优信息保留,其本质是熵正则化且鲁棒的K-means变体,能自动选择具有信息量的最少簇数,从而构建出不仅空间一致、更在信息论意义上最大化描述prompt-response关系的主题表示,显著提升SDM框架对幻觉检测的敏感性和准确性。
链接: https://arxiv.org/abs/2509.03533
作者: Igor Halperin
机构: Fidelity Investments (富达投资)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); General Finance (q-fin.GN)
备注: 26 pages, 4 figures
Abstract:Large Language Models (LLMs) are prone to critical failure modes, including \textitintrinsic faithfulness hallucinations (also known as confabulations), where a response deviates semantically from the provided context. Frameworks designed to detect this, such as Semantic Divergence Metrics (SDM), rely on identifying latent topics shared between prompts and responses, typically by applying geometric clustering to their sentence embeddings. This creates a disconnect, as the topics are optimized for spatial proximity, not for the downstream information-theoretic analysis. In this paper, we bridge this gap by developing a principled topic identification method grounded in the Deterministic Information Bottleneck (DIB) for geometric clustering. Our key contribution is to transform the DIB method into a practical algorithm for high-dimensional data by substituting its intractable KL divergence term with a computationally efficient upper bound. The resulting method, which we dub UDIB, can be interpreted as an entropy-regularized and robustified version of K-means that inherently favors a parsimonious number of informative clusters. By applying UDIB to the joint clustering of LLM prompt and response embeddings, we generate a shared topic representation that is not merely spatially coherent but is fundamentally structured to be maximally informative about the prompt-response relationship. This provides a superior foundation for the SDM framework and offers a novel, more sensitive tool for detecting confabulations.
zh
[NLP-63] Real-Time Detection of Hallucinated Entities in Long-Form Generation
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在高风险应用场景中因生成幻觉(hallucination)而引发严重后果的问题,尤其是针对长文本生成中的实体级幻觉(entity-level hallucinations),如虚构的姓名、日期或引用信息。现有检测方法要么仅适用于短句事实查询,要么依赖昂贵的外部验证机制,难以实际部署。其解决方案的关键在于提出一种低成本、可扩展的实时检测方法,通过基于网络搜索的标注策略构建实体级标签数据集,并利用简单高效的线性探测器(linear probes)训练出高性能的幻觉分类器。该方法不仅支持流式处理(streaming detection),还能在多个模型家族中实现跨模型迁移,且在数学推理等任务中展现出超出实体范畴的泛化能力,为大规模、实时的幻觉检测提供了可行路径。
链接: https://arxiv.org/abs/2509.03531
作者: Oscar Obeso,Andy Arditi,Javier Ferrando,Joshua Freeman,Cameron Holmes,Neel Nanda
机构: ETH Zürich (苏黎世联邦理工学院); MATS
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Large language models are now routinely used in high-stakes applications where hallucinations can cause serious harm, such as medical consultations or legal advice. Existing hallucination detection methods, however, are impractical for real-world use, as they are either limited to short factual queries or require costly external verification. We present a cheap, scalable method for real-time identification of hallucinated tokens in long-form generations, and scale it effectively to 70B parameter models. Our approach targets \emphentity-level hallucinations – e.g., fabricated names, dates, citations – rather than claim-level, thereby naturally mapping to token-level labels and enabling streaming detection. We develop an annotation methodology that leverages web search to annotate model responses with grounded labels indicating which tokens correspond to fabricated entities. This dataset enables us to train effective hallucination classifiers with simple and efficient methods such as linear probes. Evaluating across four model families, our classifiers consistently outperform baselines on long-form responses, including more expensive methods such as semantic entropy (e.g., AUC 0.90 vs 0.71 for Llama-3.3-70B), and are also an improvement in short-form question-answering settings. Moreover, despite being trained only with entity-level labels, our probes effectively detect incorrect answers in mathematical reasoning tasks, indicating generalization beyond entities. While our annotation methodology is expensive, we find that annotated responses from one model can be used to train effective classifiers on other models; accordingly, we publicly release our datasets to facilitate reuse. Overall, our work suggests a promising new approach for scalable, real-world hallucination detection.
zh
[NLP-64] Reading Between the Signs: Predicting Future Suicidal Ideation from Adolescent Social Media Texts
【速读】: 该论文旨在解决青少年自杀意念与行为(Suicidal Ideation and Behavior, SIB)预测难题,尤其关注在未出现明确自述情况下提前识别高风险个体的问题。传统方法受限于青少年接触心理健康服务的不足,导致大量案例未能及时发现。本文提出了一种新的预测范式——基于用户在在线论坛中发布的非自述性内容(如日常发帖和互动记录),利用Transformer架构构建Early-SIB模型,通过序列建模分析用户历史行为以预测未来是否会出现SIB相关内容。其关键创新在于不依赖任何直接表达自杀意图的输入文本,而是从隐含的行为模式中提取特征,实现了对潜在风险的早期识别,平衡准确率达到0.73,验证了该方法在实际场景中的可行性与价值。
链接: https://arxiv.org/abs/2509.03530
作者: Paul Blum,Enrico Liscio,Ruixuan Zhang,Caroline Figueroa,Pradeep K. Murukannaiah
机构: Delft University of Technology (代尔夫特理工大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Suicide is a leading cause of death among adolescents (12-18), yet predicting it remains a significant challenge. Many cases go undetected due to a lack of contact with mental health services. Social media, however, offers a unique opportunity, as young people often share their thoughts and struggles online in real time. In this work, we propose a novel task and method to approach it: predicting suicidal ideation and behavior (SIB) from forum posts before an adolescent explicitly expresses suicidal ideation on an online forum. This predictive framing, where no self-disclosure is used as input at any stage, remains largely unexplored in the suicide prediction literature. To this end, we introduce Early-SIB, a transformer-based model that sequentially processes the posts a user writes and engages with to predict whether they will write a SIB post. Our model achieves a balanced accuracy of 0.73 for predicting future SIB on a Dutch youth forum, demonstrating that such tools can offer a meaningful addition to traditional methods.
zh
[NLP-65] Multimodal Proposal for an AI-Based Tool to Increase Cross-Assessment of Messages CCS
【速读】: 该论文旨在解决现有金融情感分析系统在处理收益电话会议(earnings calls)这类半结构化金融沟通时,难以捕捉其分层话语结构的问题。当前大多数模型仅采用文档级或句子级的扁平化建模方式,忽略了管理层陈述与分析师问答之间复杂的交互逻辑。解决方案的关键在于提出一种新颖的多模态框架,将收益电话会议编码为层次化话语树(hierarchical discourse trees),其中每个节点(单人发言或问答对)均融合文本、音频、视频中的情感信号以及结构化元数据(如连贯性评分、主题标签和回答覆盖度)。该框架采用两阶段Transformer架构:第一阶段通过对比学习在节点层面编码多模态内容与话语元数据;第二阶段生成整个会议的全局嵌入。实验表明,该方法可生成语义丰富且结构感知强的嵌入表示,有效反映情感基调、逻辑结构与主题一致性,具备在金融预测及其他高风险非脚本交流场景(如远程医疗、教育、政治话语)中的通用性和可解释性。
链接: https://arxiv.org/abs/2509.03529
作者: Alejandro Álvarez Castro,Joaquín Ordieres-Meré
机构: Universidad Politécnica de Madrid (马德里理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: Presented at NLMLT2025 ( this https URL ), 15 pages, 5 figures
Abstract:Earnings calls represent a uniquely rich and semi-structured source of financial communication, blending scripted managerial commentary with unscripted analyst dialogue. Although recent advances in financial sentiment analysis have integrated multi-modal signals, such as textual content and vocal tone, most systems rely on flat document-level or sentence-level models, failing to capture the layered discourse structure of these interactions. This paper introduces a novel multi-modal framework designed to generate semantically rich and structurally aware embeddings of earnings calls, by encoding them as hierarchical discourse trees. Each node, comprising either a monologue or a question-answer pair, is enriched with emotional signals derived from text, audio, and video, as well as structured metadata including coherence scores, topic labels, and answer coverage assessments. A two-stage transformer architecture is proposed: the first encodes multi-modal content and discourse metadata at the node level using contrastive learning, while the second synthesizes a global embedding for the entire conference. Experimental results reveal that the resulting embeddings form stable, semantically meaningful representations that reflect affective tone, structural logic, and thematic alignment. Beyond financial reporting, the proposed system generalizes to other high-stakes unscripted communicative domains such as tele-medicine, education, and political discourse, offering a robust and explainable approach to multi-modal discourse representation. This approach offers practical utility for downstream tasks such as financial forecasting and discourse evaluation, while also providing a generalizable method applicable to other domains involving high-stakes communication.
zh
[NLP-66] he ProLiFIC dataset: Leverag ing LLM s to Unveil the Italian Lawmaking Process
【速读】: 该论文旨在解决流程挖掘(Process Mining, PM)在法律领域应用时因数据可获取性和质量受限而导致的效能不足问题。其解决方案的关键在于构建了一个名为ProLiFIC(Procedural Lawmaking Flow in Italian Chambers)的综合性事件日志,该数据集涵盖了1987年至2022年意大利议会立法流程的完整记录,通过从Normattiva门户获取非结构化文本数据并利用大语言模型(Large Language Models, LLMs)进行结构化处理,从而实现了高质量、可计算的法律流程数据表示,为法律领域的流程挖掘研究提供了新的基准与工具。
链接: https://arxiv.org/abs/2509.03528
作者: Matilde Contestabile,Chiara Ferrara,Alberto Giovannetti,Giovanni Parrillo,Andrea Vandin
机构: Sant’Anna School of Advanced Studies Pisa (圣安娜高等研究学院比萨分校); DTU Technical University of Denmark (丹麦技术大学)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注:
Abstract:Process Mining (PM), initially developed for industrial and business contexts, has recently been applied to social systems, including legal ones. However, PM’s efficacy in the legal domain is limited by the accessibility and quality of datasets. We introduce ProLiFIC (Procedural Lawmaking Flow in Italian Chambers), a comprehensive event log of the Italian lawmaking process from 1987 to 2022. Created from unstructured data from the Normattiva portal and structured using large language models (LLMs), ProLiFIC aligns with recent efforts in integrating PM with LLMs. We exemplify preliminary analyses and propose ProLiFIC as a benchmark for legal PM, fostering new developments.
zh
[NLP-67] Multilevel Analysis of Cryptocurrency News using RAG Approach with Fine-Tuned Mistral Large Language Model
【速读】: 该论文旨在解决加密货币新闻分析中存在信息碎片化与大语言模型(Large Language Model, LLM)幻觉问题,通过多层级多任务分析框架实现结构化、高可信度的量化与定性洞察。其解决方案的关键在于:首先利用微调后的Mistral 7B模型结合检索增强生成(Retrieval-Augmented Generation, RAG)技术,将新闻内容表示为知识图谱(Knowledge Graph),从而有效抑制LLM幻觉;其次采用4-bit量化与参数高效微调(Parameter-Efficient Fine-Tuning, PEFT/LoRA)方法提升模型效率;最后构建分层堆叠机制,融合图结构与文本摘要及其嵌套汇总结果,形成多层次综合报告,显著增强分析的深度与可解释性。
链接: https://arxiv.org/abs/2509.03527
作者: Bohdan M. Pavlyshenko
机构: Ivan Franko National University of Lviv(伊万·弗兰科国立利沃夫大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:In the paper, we consider multilevel multitask analysis of cryptocurrency news using a fine-tuned Mistral 7B large language model with retrieval-augmented generation (RAG). On the first level of analytics, the fine-tuned model generates graph and text summaries with sentiment scores as well as JSON representations of summaries. Higher levels perform hierarchical stacking that consolidates sets of graph-based and text-based summaries as well as summaries of summaries into comprehensive reports. The combination of graph and text summaries provides complementary views of cryptocurrency news. The model is fine-tuned with 4-bit quantization using the PEFT/LoRA approach. The representation of cryptocurrency news as knowledge graph can essentially eliminate problems with large language model hallucinations. The obtained results demonstrate that the use of fine-tuned Mistral 7B LLM models for multilevel cryptocurrency news analysis can conduct informative qualitative and quantitative analytics, providing important insights. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2509.03527 [cs.CL] (or arXiv:2509.03527v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2509.03527 Focus to learn more arXiv-issued DOI via DataCite
zh
[NLP-68] Enhancing Speech Large Language Models through Reinforced Behavior Alignment
【速读】: 该论文旨在解决语音大语言模型(SpeechLM)在指令遵循能力上显著落后于文本大语言模型(LLM)的问题,尤其是在面对用户语音的动态性和多样性时表现不佳。其解决方案的关键在于提出一种名为“强化行为对齐”(Reinforced Behavior Alignment, RBA)的框架:该框架不依赖人工标注的监督微调数据,而是利用一个强大的教师大语言模型(teacher LLM)通过自合成方法生成高质量、大规模的行为对齐数据,并采用基于强化学习的方法使SpeechLM的行为与教师模型对齐,从而显著提升其指令执行能力。实验表明,RBA在无需外部标注数据的情况下即可实现优于传统知识蒸馏基线的效果,并可无缝扩展至语音问答和语音到文本翻译等任务,达到开放基准上的最先进性能。
链接: https://arxiv.org/abs/2509.03526
作者: Yansong Liu,Jiateng Li,Yuan Liu
机构: 未知
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:
Abstract:The recent advancements of Large Language Models (LLMs) have spurred considerable research interest in extending their linguistic capabilities beyond text to other modalities, which leads to emergence of speech-based LLMs (SpeechLMs) with capability of processing user request in either speech or textual formats. However, owing to inter-modal discrepancies, these SpeechLMs still exhibit a significant performance gap compared to their text-based LLM counterparts in instruction-following, particularly when confronted with the dynamic and variable nature of user speech. To address this challenge, this paper introduces a framework termed Reinforced Behavior Alignment (RBA), designed to bolster the language generation proficiency of SpeechLMs. Instead of relying on supervised fine-tuning from human annotations, RBA employs a self-synthesis methodology to generate extensive, high-fidelity alignment data by a powerful teacher LLM. Then SpeechLMs is aligned its behavior with that of a teacher using a reinforcement learning-based approach. Experimental results demonstrate that this method effectively enhances the instruction-following capabilities of SpeechLMs that outperform conventional distillation baselines. Crucially, we demonstrate that RBA can be seamlessly extended to tasks such including spoken question answering and speech-to-text translation, attaining state-of-the-art performance on open benchmarks with only self-generated data.
zh
[NLP-69] Speech-Based Cognitive Screening: A Systematic Evaluation of LLM Adaptation Strategies
【速读】: 该论文旨在解决阿尔茨海默病及相关痴呆症(Alzheimer’s disease and related dementias)在美国超过半数患者未被诊断的问题,提出利用基于语音的筛查方法实现可扩展的早期检测。其解决方案的关键在于系统评估多种大语言模型(Large Language Models, LLMs)的适配策略,包括上下文学习(in-context learning)中演示样本的选择策略、推理增强提示(reasoning-augmented prompting)、参数高效微调(parameter-efficient fine-tuning)以及音频-文本多模态融合。研究发现,类中心演示样本在上下文学习中表现最优,推理提示对小模型有提升作用,而词元级微调通常获得最高性能;此外,为性能较差的模型添加分类头能显著改善效果。结果表明,恰当的模型适配策略是提升语音驱动痴呆检测准确性的核心因素,且经过合理调整的开源模型可达到或超越商业系统水平。
链接: https://arxiv.org/abs/2509.03525
作者: Fatemeh Taherinezhad,Mohamad Javad Momeni Nezhad,Sepehr Karimi,Sina Rashidi,Ali Zolnour,Maryam Dadkhah,Yasaman Haghbin,Hossein AzadMaleki,Maryam Zolnoori
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:
Abstract:Over half of US adults with Alzheimer disease and related dementias remain undiagnosed, and speech-based screening offers a scalable detection approach. We compared large language model adaptation strategies for dementia detection using the DementiaBank speech corpus, evaluating nine text-only models and three multimodal audio-text models on recordings from DementiaBank speech corpus. Adaptations included in-context learning with different demonstration selection policies, reasoning-augmented prompting, parameter-efficient fine-tuning, and multimodal integration. Results showed that class-centroid demonstrations achieved the highest in-context learning performance, reasoning improved smaller models, and token-level fine-tuning generally produced the best scores. Adding a classification head substantially improved underperforming models. Among multimodal models, fine-tuned audio-text systems performed well but did not surpass the top text-only models. These findings highlight that model adaptation strategies, including demonstration selection, reasoning design, and tuning method, critically influence speech-based dementia detection, and that properly adapted open-weight models can match or exceed commercial systems.
zh
[NLP-70] LibriQuote: A Speech Dataset of Fictional Character Utterances for Expressive Zero-Shot Speech Synthesis
【速读】: 该论文旨在解决大规模文本到语音(Text-to-Speech, TTS)系统中表达性语音比例不明确以及现有表达性语料库规模有限、难以支持模型微调与基准测试的问题。解决方案的关键在于构建一个名为LibriQuote的英语语料库,其包含12.7K小时非表达性语音和5.3K小时主要来自角色引文的表达性语音,并为表达性子集提供上下文信息及描述性动词和副词的伪标签(如“他轻声低语”),从而增强模型对情感和语气的理解能力;同时设计了一个7.5小时的挑战性测试集,用于评估TTS系统在保持参考音色的前提下生成表达性语音的能力,实验证明该数据集能有效提升模型合成语音的可懂度和自然度,且当前先进方法仍无法达到真实语音的表现水平。
链接: https://arxiv.org/abs/2509.04072
作者: Gaspard Michel,Elena V. Epure,Christophe Cerisara
机构: 未知
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注:
Abstract:Text-to-speech (TTS) systems have recently achieved more expressive and natural speech synthesis by scaling to large speech datasets. However, the proportion of expressive speech in such large-scale corpora is often unclear. Besides, existing expressive speech corpora are typically smaller in scale and primarily used for benchmarking TTS systems. In this paper, we introduce the LibriQuote dataset, an English corpus derived from read audiobooks, designed for both fine-tuning and benchmarking expressive zero-shot TTS system. The training dataset includes 12.7K hours of read, non-expressive speech and 5.3K hours of mostly expressive speech drawn from character quotations. Each utterance in the expressive subset is supplemented with the context in which it was written, along with pseudo-labels of speech verbs and adverbs used to describe the quotation (\textite.g. ``he whispered softly’'). Additionally, we provide a challenging 7.5 hour test set intended for benchmarking TTS systems: given a neutral reference speech as input, we evaluate system’s ability to synthesize an expressive utterance while preserving reference timbre. We validate qualitatively the test set by showing that it covers a wide range of emotions compared to non-expressive speech, along with various accents. Extensive subjective and objective evaluations show that fine-tuning a baseline TTS system on LibriQuote significantly improves its synthesized speech intelligibility, and that recent systems fail to synthesize speech as expressive and natural as the ground-truth utterances. The dataset and evaluation code are freely available. Audio samples can be found at this https URL.
zh
计算机视觉
[CV-0] Virtual Fitting Room: Generating Arbitrarily Long Videos of Virtual Try-On from a Single Image – Technical Preview
【速读】:该论文旨在解决长时虚拟试衣视频生成中的两大核心挑战:局部片段间的平滑性(local smoothness)与全局时间一致性(global temporal consistency)。传统方法受限于资源消耗大、视频数据长度固定等问题,难以生成任意长度的高质量试衣视频。为此,作者提出Virtual Fitting Room (VFR) 框架,其关键在于将长视频生成建模为自回归式的分段生成过程,并通过前缀视频条件(prefix video condition)保障相邻片段间的局部连续性,同时利用锚定视频(anchor video)——即360度全身体态视频——来约束整体时序一致性,从而实现分钟级长度下多种运动场景中兼具局部流畅性和全局一致性的虚拟试衣视频生成。
链接: https://arxiv.org/abs/2509.04450
作者: Jun-Kun Chen,Aayush Bansal,Minh Phuoc Vo,Yu-Xiong Wang
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); SpreeAI
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Project Page: this https URL
Abstract:We introduce the Virtual Fitting Room (VFR), a novel video generative model that produces arbitrarily long virtual try-on videos. Our VFR models long video generation tasks as an auto-regressive, segment-by-segment generation process, eliminating the need for resource-intensive generation and lengthy video data, while providing the flexibility to generate videos of arbitrary length. The key challenges of this task are twofold: ensuring local smoothness between adjacent segments and maintaining global temporal consistency across different segments. To address these challenges, we propose our VFR framework, which ensures smoothness through a prefix video condition and enforces consistency with the anchor video – a 360-degree video that comprehensively captures the human’s wholebody appearance. Our VFR generates minute-scale virtual try-on videos with both local smoothness and global temporal consistency under various motions, making it a pioneering work in long virtual try-on video generation.
zh
[CV-1] RUST-VL: An Explainable News Assistant for General Multimodal Misinformation Detection EMNLP2025
【速读】:该论文旨在解决多模态虚假信息(multimodal misinformation)检测中因单一类型扭曲(如文本、图像或跨模态扭曲)导致的泛化能力不足问题,尤其在生成式 AI(Generative AI)加剧虚假信息复杂性背景下。其解决方案的关键在于提出 TRUST-VL 模型,该模型通过联合训练不同类型的扭曲任务,促进知识共享并提升泛化性能;其核心创新是引入 Question-Aware Visual Amplifier 模块以提取任务特定的视觉特征,并结合大规模结构化推理链数据集 TRUST-Instruct 进行监督训练,从而实现统一、可解释且具备零样本迁移能力的多模态虚假信息检测。
链接: https://arxiv.org/abs/2509.04448
作者: Zehong Yan,Peng Qi,Wynne Hsu,Mong Li Lee
机构: National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: EMNLP 2025; Project Homepage: this https URL
Abstract:Multimodal misinformation, encompassing textual, visual, and cross-modal distortions, poses an increasing societal threat that is amplified by generative AI. Existing methods typically focus on a single type of distortion and struggle to generalize to unseen scenarios. In this work, we observe that different distortion types share common reasoning capabilities while also requiring task-specific skills. We hypothesize that joint training across distortion types facilitates knowledge sharing and enhances the model’s ability to generalize. To this end, we introduce TRUST-VL, a unified and explainable vision-language model for general multimodal misinformation detection. TRUST-VL incorporates a novel Question-Aware Visual Amplifier module, designed to extract task-specific visual features. To support training, we also construct TRUST-Instruct, a large-scale instruction dataset containing 198K samples featuring structured reasoning chains aligned with human fact-checking workflows. Extensive experiments on both in-domain and zero-shot benchmarks demonstrate that TRUST-VL achieves state-of-the-art performance, while also offering strong generalization and interpretability.
zh
[CV-2] Plotn Polish: Zero-shot Story Visualization and Disentangled Editing with Text-to-Image Diffusion Models
【速读】:该论文旨在解决文本到图像扩散模型在故事可视化应用中缺乏精细控制与一致性编辑能力的问题,尤其是在多帧图像生成过程中难以实现视觉和叙事一致性保持的挑战。解决方案的关键在于提出一种零样本(zero-shot)框架 Plot’n Polish,该框架不仅支持一致性的故事生成,还能够在不同细节层级上提供细粒度的控制,从而允许创作者在生成后对图像进行灵活且连贯的修改与优化。
链接: https://arxiv.org/abs/2509.04446
作者: Kiymet Akdemir,Jing Shi,Kushal Kafle,Brian Price,Pinar Yanardag
机构: Virginia Tech (弗吉尼亚理工学院); Adobe Research (Adobe 研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Text-to-image diffusion models have demonstrated significant capabilities to generate diverse and detailed visuals in various domains, and story visualization is emerging as a particularly promising application. However, as their use in real-world creative domains increases, the need for providing enhanced control, refinement, and the ability to modify images post-generation in a consistent manner becomes an important challenge. Existing methods often lack the flexibility to apply fine or coarse edits while maintaining visual and narrative consistency across multiple frames, preventing creators from seamlessly crafting and refining their visual stories. To address these challenges, we introduce Plot’n Polish, a zero-shot framework that enables consistent story generation and provides fine-grained control over story visualizations at various levels of detail.
zh
[CV-3] One Flight Over the Gap: A Survey from Perspective to Panoramic Vision
【速读】:该论文旨在解决全景视觉(panoramic vision)中因几何投影差异、空间分布不均及边界连续性问题导致的视角图像(perspective images)到全景图像(omnidirectional images, ODIs)域适应困难的问题。其解决方案的关键在于系统梳理全景成像流程与投影方法,识别并归纳三大核心挑战:极区严重几何畸变、等距圆柱投影(Equirectangular Projection, ERP)下的非均匀采样以及周期性边界连续性问题;在此基础上,通过跨任务分析和分类整理20余项代表性任务,提出将全景视觉研究划分为四大类——视觉质量增强与评估、视觉理解、多模态理解与视觉生成,并据此提炼出针对全景特有挑战的通用适配策略,从而为后续数据、模型与应用层面的突破提供理论支撑与方向指引。
链接: https://arxiv.org/abs/2509.04444
作者: Xin Lin,Xian Ge,Dizhe Zhang,Zhaoliang Wan,Xianshun Wang,Xiangtai Li,Wenjie Jiang,Bo Du,Dacheng Tao,Ming-Hsuan Yang,Lu Qi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Driven by the demand for spatial intelligence and holistic scene perception, omnidirectional images (ODIs), which provide a complete 360\textdegree field of view, are receiving growing attention across diverse applications such as virtual reality, autonomous driving, and embodied robotics. Despite their unique characteristics, ODIs exhibit remarkable differences from perspective images in geometric projection, spatial distribution, and boundary continuity, making it challenging for direct domain adaption from perspective methods. This survey reviews recent panoramic vision techniques with a particular emphasis on the perspective-to-panorama adaptation. We first revisit the panoramic imaging pipeline and projection methods to build the prior knowledge required for analyzing the structural disparities. Then, we summarize three challenges of domain adaptation: severe geometric distortions near the poles, non-uniform sampling in Equirectangular Projection (ERP), and periodic boundary continuity. Building on this, we cover 20+ representative tasks drawn from more than 300 research papers in two dimensions. On one hand, we present a cross-method analysis of representative strategies for addressing panoramic specific challenges across different tasks. On the other hand, we conduct a cross-task comparison and classify panoramic vision into four major categories: visual quality enhancement and assessment, visual understanding, multimodal understanding, and visual generation. In addition, we discuss open challenges and future directions in data, models, and applications that will drive the advancement of panoramic vision research. We hope that our work can provide new insight and forward looking perspectives to advance the development of panoramic vision technologies. Our project page is this https URL
zh
[CV-4] DEXOP: A Device for Robotic Transfer of Dexterous Human Manipulation
【速读】:该论文旨在解决机器人在复杂操作任务中数据收集效率与技能迁移能力不足的问题,特别是如何高效获取高质量、可迁移的多模态感知(视觉+触觉)示范数据以提升机器人灵巧操作能力。解决方案的关键在于提出“围手术期”(perioperation)范式,并设计了DEXOP——一种被动式手部外骨骼装置,通过机械连接人类手指与机器人手指,实现力反馈和姿态镜像,使人类在自然状态下完成操作示范,从而最大化数据对真实机器人的可迁移性;相比传统遥操作方式,DEXOP显著提升了示范的速度与准确性,且基于其采集的数据训练的策略在单位数据收集时间内性能提升明显,为推进机器人灵巧操作提供了高效工具。
链接: https://arxiv.org/abs/2509.04441
作者: Hao-Shu Fang,Branden Romero,Yichen Xie,Arthur Hu,Bo-Ruei Huang,Juan Alvarez,Matthew Kim,Gabriel Margolis,Kavya Anbarasu,Masayoshi Tomizuka,Edward Adelson,Pulkit Agrawal
机构: Improbable AI Lab; Massachusetts Institute of Technology (麻省理工学院); UC Berkeley (加州大学伯克利分校)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注: project page: this https URL
Abstract:We introduce perioperation, a paradigm for robotic data collection that sensorizes and records human manipulation while maximizing the transferability of the data to real robots. We implement this paradigm in DEXOP, a passive hand exoskeleton designed to maximize human ability to collect rich sensory (vision + tactile) data for diverse dexterous manipulation tasks in natural environments. DEXOP mechanically connects human fingers to robot fingers, providing users with direct contact feedback (via proprioception) and mirrors the human hand pose to the passive robot hand to maximize the transfer of demonstrated skills to the robot. The force feedback and pose mirroring make task demonstrations more natural for humans compared to teleoperation, increasing both speed and accuracy. We evaluate DEXOP across a range of dexterous, contact-rich tasks, demonstrating its ability to collect high-quality demonstration data at scale. Policies learned with DEXOP data significantly improve task performance per unit time of data collection compared to teleoperation, making DEXOP a powerful tool for advancing robot dexterity. Our project page is at this https URL.
zh
[CV-5] From Lines to Shapes: Geometric-Constrained Segmentation of X-Ray Collimators via Hough Transform
【速读】:该论文旨在解决数字X射线成像中由于散射辐射导致的准直器阴影边缘模糊问题,从而影响感兴趣区域(Region-of-Interest, ROI)的准确分割。解决方案的关键在于引入一种基于深度学习的分割方法,该方法通过嵌入可微分的霍夫变换(Hough transform)网络来检测准直器边界,并利用其几何先验——即准直器阴影呈多边形形状——对分割过程进行约束,从而在推理阶段融合边界检测与ROI中心信息,生成具有线性约束的精细化分割掩膜。该方法在真实X射线图像测试集中实现了4.3–5.0mm的中位 Hausdorff 距离,表现出鲁棒的重建性能。
链接: https://arxiv.org/abs/2509.04437
作者: Benjamin El-Zein,Dominik Eckert,Andreas Fieselmann,Christopher Syben,Ludwig Ritschl,Steffen Kappler,Sebastian Stober
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
备注:
Abstract:Collimation in X-ray imaging restricts exposure to the region-of-interest (ROI) and minimizes the radiation dose applied to the patient. The detection of collimator shadows is an essential image-based preprocessing step in digital radiography posing a challenge when edges get obscured by scattered X-ray radiation. Regardless, the prior knowledge that collimation forms polygonal-shaped shadows is evident. For this reason, we introduce a deep learning-based segmentation that is inherently constrained to its geometry. We achieve this by incorporating a differentiable Hough transform-based network to detect the collimation borders and enhance its capability to extract the information about the ROI center. During inference, we combine the information of both tasks to enable the generation of refined, line-constrained segmentation masks. We demonstrate robust reconstruction of collimated regions achieving median Hausdorff distances of 4.3-5.0mm on diverse test sets of real Xray images. While this application involves at most four shadow borders, our method is not fundamentally limited by a specific number of edges.
zh
[CV-6] Durian: Dual Reference-guided Portrait Animation with Attribute Transfer
【速读】:该论文旨在解决在零样本(zero-shot)条件下,将参考图像中的面部属性(facial attribute)高效且空间一致地迁移到目标肖像视频中以生成高质量肖像动画的问题。解决方案的关键在于提出了一种双参考网络(dual reference networks)架构,该架构在扩散模型的去噪过程中同时注入来自肖像图像和属性参考图像的空间特征,从而实现高保真度与帧间一致性;此外,通过自重构训练策略、基于关键点条件的掩码扩展策略以及空间与外观层级的增强方法,显著提升了模型对位置错位和多样属性组合的鲁棒性,最终实现了无需显式三元组监督即可跨多种属性和参考组合泛化的能力,并支持单次生成中多属性组合。
链接: https://arxiv.org/abs/2509.04434
作者: Hyunsoo Cha,Byungjun Kim,Hanbyul Joo
机构: Seoul National University (首尔国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
Abstract:We present Durian, the first method for generating portrait animation videos with facial attribute transfer from a given reference image to a target portrait in a zero-shot manner. To enable high-fidelity and spatially consistent attribute transfer across frames, we introduce dual reference networks that inject spatial features from both the portrait and attribute images into the denoising process of a diffusion model. We train the model using a self-reconstruction formulation, where two frames are sampled from the same portrait video: one is treated as the attribute reference and the other as the target portrait, and the remaining frames are reconstructed conditioned on these inputs and their corresponding masks. To support the transfer of attributes with varying spatial extent, we propose a mask expansion strategy using keypoint-conditioned image generation for training. In addition, we further augment the attribute and portrait images with spatial and appearance-level transformations to improve robustness to positional misalignment between them. These strategies allow the model to effectively generalize across diverse attributes and in-the-wild reference combinations, despite being trained without explicit triplet supervision. Durian achieves state-of-the-art performance on portrait animation with attribute transfer, and notably, its dual reference design enables multi-attribute composition in a single generation pass without additional training.
zh
[CV-7] Few-step Flow for 3D Generation via Marginal-Data Transport Distillation
【速读】:该论文旨在解决流模型(Flow-based model)在3D生成任务中推理阶段采样步数过多的问题,从而限制了生成效率。现有基于一致性模型(Consistency Models, CMs)的少步蒸馏方法虽在2D扩散模型中取得进展,但在更复杂的3D生成任务中仍缺乏有效探索。其解决方案的关键在于提出一种名为MDT-dist的新框架,通过蒸馏预训练模型以学习“边缘数据传输”(Marginal-Data Transport)目标,并设计两个可优化的目标函数——速度匹配(Velocity Matching, VM)与速度蒸馏(Velocity Distillation, VD),将原本难以实现的传输层面优化转化为速度场匹配和概率密度蒸馏的两阶段优化过程,显著提升了3D流变换器(flow transformer)的采样效率,使每步采样从25步降至1–2步,同时保持高质量的视觉与几何保真度。
链接: https://arxiv.org/abs/2509.04406
作者: Zanwei Zhou,Taoran Yi,Jiemin Fang,Chen Yang,Lingxi Xie,Xinggang Wang,Wei Shen,Qi Tian
机构: Shanghai Jiao Tong University (上海交通大学); Huazhong University of Science and Technology (华中科技大学); Huawei Inc. (华为公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:Flow-based 3D generation models typically require dozens of sampling steps during inference. Though few-step distillation methods, particularly Consistency Models (CMs), have achieved substantial advancements in accelerating 2D diffusion models, they remain under-explored for more complex 3D generation tasks. In this study, we propose a novel framework, MDT-dist, for few-step 3D flow distillation. Our approach is built upon a primary objective: distilling the pretrained model to learn the Marginal-Data Transport. Directly learning this objective needs to integrate the velocity fields, while this integral is intractable to be implemented. Therefore, we propose two optimizable objectives, Velocity Matching (VM) and Velocity Distillation (VD), to equivalently convert the optimization target from the transport level to the velocity and the distribution level respectively. Velocity Matching (VM) learns to stably match the velocity fields between the student and the teacher, but inevitably provides biased gradient estimates. Velocity Distillation (VD) further enhances the optimization process by leveraging the learned velocity fields to perform probability density distillation. When evaluated on the pioneer 3D generation framework TRELLIS, our method reduces sampling steps of each flow transformer from 25 to 1 or 2, achieving 0.68s (1 step x 2) and 0.94s (2 steps x 2) latency with 9.0x and 6.5x speedup on A800, while preserving high visual and geometric fidelity. Extensive experiments demonstrate that our method significantly outperforms existing CM distillation methods, and enables TRELLIS to achieve superior performance in few-step 3D generation.
zh
[CV-8] Learning neural representations for X-ray ptychography reconstruction with unknown probes
【速读】:该论文旨在解决X射线 ptychography(扫描衍射成像)中因照明探针未知而导致图像重建精度受限的问题,尤其在低剂量和高速实验条件下,传统迭代方法与深度学习方案往往表现不佳,影响重建质量并限制该技术的广泛应用。解决方案的关键在于提出一种自监督框架——Ptychographic Implicit Neural Representation (PtyINR),其核心创新是将物函数和探针均参数化为连续的神经隐式表示(Implicit Neural Representation),从而实现从原始衍射图直接进行端到端重建,无需预先对探针进行表征。该方法在模拟与实验数据上均展现出优越的重建质量,并在低信噪比条件下具有显著鲁棒性,同时为一类依赖探针的逆问题提供了一个通用且物理信息驱动的计算显微成像框架。
链接: https://arxiv.org/abs/2509.04402
作者: Tingyou Li,Zixin Xu,Zirui Gao,Hanfei Yan,Xiaojing Huang,Jizhou Li
机构: The Chinese University of Hong Kong (香港中文大学); The Hong Kong University of Science and Technology (香港科技大学); Brookhaven National Laboratory (布鲁克海文国家实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:X-ray ptychography provides exceptional nanoscale resolution and is widely applied in materials science, biology, and nanotechnology. However, its full potential is constrained by the critical challenge of accurately reconstructing images when the illuminating probe is unknown. Conventional iterative methods and deep learning approaches are often suboptimal, particularly under the low-signal conditions inherent to low-dose and high-speed experiments. These limitations compromise reconstruction fidelity and restrict the broader adoption of the technique. In this work, we introduce the Ptychographic Implicit Neural Representation (PtyINR), a self-supervised framework that simultaneously addresses the object and probe recovery problem. By parameterizing both as continuous neural representations, PtyINR performs end-to-end reconstruction directly from raw diffraction patterns without requiring any pre-characterization of the probe. Extensive evaluations demonstrate that PtyINR achieves superior reconstruction quality on both simulated and experimental data, with remarkable robustness under challenging low-signal conditions. Furthermore, PtyINR offers a generalizable, physics-informed framework for addressing probe-dependent inverse problems, making it applicable to a wide range of computational microscopy problems.
zh
[CV-9] ransition Models: Rethinking the Generative Learning Objective
【速读】:该论文旨在解决生成模型中迭代扩散模型(iterative diffusion models)在生成质量与计算效率之间的权衡问题:前者虽能实现高保真度生成,但需大量迭代步骤导致计算成本高昂;后者虽高效,却受限于难以突破的质量上限。其解决方案的关键在于提出一种精确的连续时间动态方程,该方程可解析定义任意有限时间间隔内的状态转移过程,从而构建出新型生成范式——过渡模型(Transition Models, TiM)。TiM 能够自适应地支持任意步数的生成过程,在单步跳跃到多步精细优化之间平滑过渡,并且在仅 865M 参数下即超越多个千亿参数模型(如 SD3.5 和 FLUX.1),同时展现出随着采样预算增加而单调提升的质量特性。
链接: https://arxiv.org/abs/2509.04394
作者: Zidong Wang,Yiyuan Zhang,Xiaoyu Yue,Xiangyu Yue,Yangguang Li,Wanli Ouyang,Lei Bai
机构: MMLab, CUHK (香港中文大学多媒体实验室); Shanghai AI Lab (上海人工智能实验室); USYD (悉尼大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: The code is released at this https URL
Abstract:A fundamental dilemma in generative modeling persists: iterative diffusion models achieve outstanding fidelity, but at a significant computational cost, while efficient few-step alternatives are constrained by a hard quality ceiling. This conflict between generation steps and output quality arises from restrictive training objectives that focus exclusively on either infinitesimal dynamics (PF-ODEs) or direct endpoint prediction. We address this challenge by introducing an exact, continuous-time dynamics equation that analytically defines state transitions across any finite time interval. This leads to a novel generative paradigm, Transition Models (TiM), which adapt to arbitrary-step transitions, seamlessly traversing the generative trajectory from single leaps to fine-grained refinement with more steps. Despite having only 865M parameters, TiM achieves state-of-the-art performance, surpassing leading models such as SD3.5 (8B parameters) and FLUX.1 (12B parameters) across all evaluated step counts. Importantly, unlike previous few-step generators, TiM demonstrates monotonic quality improvement as the sampling budget increases. Additionally, when employing our native-resolution strategy, TiM delivers exceptional fidelity at resolutions up to 4096x4096.
zh
[CV-10] SSGaussian: Semantic-Aware and Structure-Preserving 3D Style Transfer
【速读】:该论文旨在解决当前3D风格迁移方法在提取和传递高阶风格语义方面的不足,以及生成结果缺乏结构清晰度和实例分离性的问题。现有方法虽能将风格模式映射到3D一致的神经表示上,但难以保留参考图像中的高层语义信息,导致不同物体或实例之间的边界模糊、视觉混淆。解决方案的关键在于提出一种融合预训练2D扩散模型先验知识的新型3D风格迁移流水线,其核心创新包括:(1)跨视角风格对齐机制,在UNet的最后一个上采样模块中引入跨视角注意力,促进多关键视角间的特征交互,确保生成的风格化视图在保持风格保真度的同时具备实例级一致性;(2)实例级风格迁移策略,利用多个风格化关键视图中的实例一致性信息,并将其有效迁移到3D表示中,从而实现更结构化、视觉连贯且艺术表现力更强的3D场景风格化效果。
链接: https://arxiv.org/abs/2509.04379
作者: Jimin Xu,Bosheng Qin,Tao Jin,Zhou Zhao,Zhenhui Ye,Jun Yu,Fei Wu
机构: Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advancements in neural representations, such as Neural Radiance Fields and 3D Gaussian Splatting, have increased interest in applying style transfer to 3D scenes. While existing methods can transfer style patterns onto 3D-consistent neural representations, they struggle to effectively extract and transfer high-level style semantics from the reference style image. Additionally, the stylized results often lack structural clarity and separation, making it difficult to distinguish between different instances or objects within the 3D scene. To address these limitations, we propose a novel 3D style transfer pipeline that effectively integrates prior knowledge from pretrained 2D diffusion models. Our pipeline consists of two key stages: First, we leverage diffusion priors to generate stylized renderings of key viewpoints. Then, we transfer the stylized key views onto the 3D representation. This process incorporates two innovative designs. The first is cross-view style alignment, which inserts cross-view attention into the last upsampling block of the UNet, allowing feature interactions across multiple key views. This ensures that the diffusion model generates stylized key views that maintain both style fidelity and instance-level consistency. The second is instance-level style transfer, which effectively leverages instance-level consistency across stylized key views and transfers it onto the 3D representation. This results in a more structured, visually coherent, and artistically enriched stylization. Extensive qualitative and quantitative experiments demonstrate that our 3D style transfer pipeline significantly outperforms state-of-the-art methods across a wide range of scenes, from forward-facing to challenging 360-degree environments. Visit our project page this https URL for immersive visualization.
zh
[CV-11] Aesthetic Image Captioning with Saliency Enhanced MLLM s
【速读】:该论文旨在解决当前生成式图像描述(Aesthetic Image Captioning, AIC)任务中,基于预训练多模态大语言模型(Multimodal Large Language Models, MLLMs)的方法普遍缺乏对图像美学内容的针对性建模问题。现有方法大多依赖微调策略,未能有效引导MLLM聚焦于目标美学特征,导致在美学语义生成上的表现受限。解决方案的关键在于提出端到端的美学显著性增强型多模态大语言模型(Aesthetic Saliency Enhanced Multimodal Large Language Model, ASE-MLLM),其核心创新是引入图像美学显著性模块(Image Aesthetic Saliency Module, IASM),用于高效提取图像的美学显著性特征,并设计IAS-ViT作为图像编码器,通过交叉注意力机制将这些显著性特征与原始图像特征融合,从而显式增强MLLM对美学内容的理解能力。该框架首次将美学显著性机制系统性地集成至MLLM中,显著提升了AIC任务的性能,在主流基准上达到当前最优水平。
链接: https://arxiv.org/abs/2509.04378
作者: Yilin Tao,Jiashui Huang,Huaze Xu,Ling Shao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Aesthetic Image Captioning (AIC) aims to generate textual descriptions of image aesthetics, becoming a key research direction in the field of computational aesthetics. In recent years, pretrained Multimodal Large Language Models (MLLMs) have advanced rapidly, leading to a significant increase in image aesthetics research that integrates both visual and textual modalities. However, most existing studies on image aesthetics primarily focus on predicting aesthetic ratings and have shown limited application in AIC. Existing AIC works leveraging MLLMs predominantly rely on fine-tuning methods without specifically adapting MLLMs to focus on target aesthetic content. To address this limitation, we propose the Aesthetic Saliency Enhanced Multimodal Large Language Model (ASE-MLLM), an end-to-end framework that explicitly incorporates aesthetic saliency into MLLMs. Within this framework, we introduce the Image Aesthetic Saliency Module (IASM), which efficiently and effectively extracts aesthetic saliency features from images. Additionally, we design IAS-ViT as the image encoder for MLLMs, this module fuses aesthetic saliency features with original image features via a cross-attention mechanism. To the best of our knowledge, ASE-MLLM is the first framework to integrate image aesthetic saliency into MLLMs specifically for AIC tasks. Extensive experiments demonstrated that our approach significantly outperformed traditional methods and generic MLLMs on current mainstream AIC benchmarks, achieving state-of-the-art (SOTA) performance.
zh
[CV-12] AnomalyLMM: Bridging Generative Knowledge and Discriminative Retrieval for Text-Based Person Anomaly Search
【速读】:该论文旨在解决文本驱动的人体异常行为检索(text-based person anomaly search)问题,其核心挑战在于:(1)文本描述与视觉行为之间的细粒度跨模态对齐;(2)在真实场景中样本稀疏条件下实现异常识别。针对这些问题,论文提出AnomalyLMM框架,其关键创新在于构建了一个从粗到精的处理流程,将生成式大模型(Large Multi-modal Models, LMMs)的世界知识有效转化为面向检索任务的判别能力,并设计了一套无需训练的适应策略——包括掩码跨模态提示(masked cross-modal prompting)、行为显著性预测(behavioral saliency prediction)和知识感知重排序(knowledge-aware re-ranking),从而在零样本条件下精准聚焦细微异常线索,实现了优于基线方法0.96% Recall@1的性能提升。
链接: https://arxiv.org/abs/2509.04376
作者: Hao Ju,Hu Zhang,Zhedong Zheng
机构: University of Macau (澳门大学); CSIRO Data61
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:With growing public safety demands, text-based person anomaly search has emerged as a critical task, aiming to retrieve individuals with abnormal behaviors via natural language descriptions. Unlike conventional person search, this task presents two unique challenges: (1) fine-grained cross-modal alignment between textual anomalies and visual behaviors, and (2) anomaly recognition under sparse real-world samples. While Large Multi-modal Models (LMMs) excel in multi-modal understanding, their potential for fine-grained anomaly retrieval remains underexplored, hindered by: (1) a domain gap between generative knowledge and discriminative retrieval, and (2) the absence of efficient adaptation strategies for deployment. In this work, we propose AnomalyLMM, the first framework that harnesses LMMs for text-based person anomaly search. Our key contributions are: (1) A novel coarse-to-fine pipeline integrating LMMs to bridge generative world knowledge with retrieval-centric anomaly detection; (2) A training-free adaptation cookbook featuring masked cross-modal prompting, behavioral saliency prediction, and knowledge-aware re-ranking, enabling zero-shot focus on subtle anomaly cues. As the first study to explore LMMs for this task, we conduct a rigorous evaluation on the PAB dataset, the only publicly available benchmark for text-based person anomaly search, with its curated real-world anomalies covering diverse scenarios (e.g., falling, collision, and being hit). Experiments show the effectiveness of the proposed method, surpassing the competitive baseline by +0.96% Recall@1 accuracy. Notably, our method reveals interpretable alignment between textual anomalies and visual behaviors, validated via qualitative analysis. Our code and models will be released for future research.
zh
[CV-13] Stitching the Story: Creating Panoramic Incident Summaries from Body-Worn Footage
【速读】:该论文旨在解决第一响应者在紧急情况下难以高效审查冗长的体佩戴摄像头(body-worn camera)视频的问题,以实现快速的情境意识(situational awareness)。其核心解决方案是构建一个计算机视觉流水线,通过单目同时定位与地图构建(monocular Simultaneous Localization and Mapping, SLAM)估计相机轨迹并重建环境空间布局,再基于相机位姿聚类识别关键视点,并从中选取代表性帧进行多帧图像拼接,最终生成空间一致的全景图像摘要,从而支持对复杂场景的快速理解与决策。
链接: https://arxiv.org/abs/2509.04370
作者: Dor Cohen,Inga Efrosman,Yehudit Aperstein,Alexander Apartsin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 3 figures
Abstract:First responders widely adopt body-worn cameras to document incident scenes and support post-event analysis. However, reviewing lengthy video footage is impractical in time-critical situations. Effective situational awareness demands a concise visual summary that can be quickly interpreted. This work presents a computer vision pipeline that transforms body-camera footage into informative panoramic images summarizing the incident scene. Our method leverages monocular Simultaneous Localization and Mapping (SLAM) to estimate camera trajectories and reconstruct the spatial layout of the environment. Key viewpoints are identified by clustering camera poses along the trajectory, and representative frames from each cluster are selected. These frames are fused into spatially coherent panoramic images using multi-frame stitching techniques. The resulting summaries enable rapid understanding of complex environments and facilitate efficient decision-making and incident review.
zh
[CV-14] Global-to-Local or Local-to-Global? Enhancing Image Retrieval with Efficient Local Search and Effective Global Re-ranking
【速读】:该论文旨在解决传统图像检索系统中“全局特征主导、局部特征辅助”的局限性,即全局特征(Global Features)虽能快速筛选候选图像,但难以捕捉局部细节匹配;而局部特征匹配虽精度高,却因计算成本高难以扩展至大规模数据库。其解决方案的关键在于提出一种全新的“局部到全局”(Local-to-Global)检索范式:首先利用高效的局部特征搜索技术在大规模数据集中找到潜在的局部匹配项,随后通过在线计算(on-the-fly)的全局特征对这些局部匹配结果进行重排序(Re-ranking)。该方法创新性地基于局部特征相似性,借助多维尺度分析(Multidimensional Scaling, MDS)生成符合局部结构关系的全局嵌入表示,从而显著提升重排序效果,在Revisited Oxford和Paris数据集上达到新的SOTA性能。
链接: https://arxiv.org/abs/2509.04351
作者: Dror Aiger,Bingyi Cao,Kaifeng Chen,Andre Araujo
机构: Google(谷歌); Google DeepMind(谷歌深度智脑)
类目: Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The dominant paradigm in image retrieval systems today is to search large databases using global image features, and re-rank those initial results with local image feature matching techniques. This design, dubbed global-to-local, stems from the computational cost of local matching approaches, which can only be afforded for a small number of retrieved images. However, emerging efficient local feature search approaches have opened up new possibilities, in particular enabling detailed retrieval at large scale, to find partial matches which are often missed by global feature search. In parallel, global feature-based re-ranking has shown promising results with high computational efficiency. In this work, we leverage these building blocks to introduce a local-to-global retrieval paradigm, where efficient local feature search meets effective global feature re-ranking. Critically, we propose a re-ranking method where global features are computed on-the-fly, based on the local feature retrieval similarities. Such re-ranking-only global features leverage multidimensional scaling techniques to create embeddings which respect the local similarities obtained during search, enabling a significant re-ranking boost. Experimentally, we demonstrate solid retrieval performance, setting new state-of-the-art results on the Revisited Oxford and Paris datasets.
zh
[CV-15] MICACL: Multi-Instance Category-Aware Contrastive Learning for Long-Tailed Dynamic Facial Expression Recognition
【速读】:该论文旨在解决动态面部表情识别(Dynamic Facial Expression Recognition, DFER)中因类别分布长尾现象(long-tailed category distributions)和时空特征建模复杂性导致的模型诱导偏差(model induction bias)问题。解决方案的关键在于提出一种新颖的多实例学习框架 MICACL,其核心创新包括:1)设计图增强的实例交互模块(Graph-Enhanced Instance Interaction Module, GEIIM),通过自适应邻接矩阵与多尺度卷积捕捉相邻帧间的复杂时空关系;2)构建加权实例聚合网络(Weighted Instance Aggregation Network, WIAN),基于实例重要性动态分配权重以优化特征聚合;3)引入多尺度类别感知对比学习策略(Multiscale Category-aware Contrastive Learning, MCCL),平衡主要类别与少数类别之间的训练过程,从而提升模型在真实场景数据集(如 DFEW 和 FERV39k)上的性能、鲁棒性和泛化能力。
链接: https://arxiv.org/abs/2509.04344
作者: Feng-Qi Cui,Zhen Lin,Xinlong Rao,Anyang Tong,Shiyao Li,Fei Wang,Changlin Chen,Bin Liu
机构: University of Science and Technology of China (中国科学技术大学); Hefei University of Technology (合肥工业大学); IAI, Hefei Comprehensive National Science Center (IAI,合肥综合性国家科学中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IEEE ISPA2025
Abstract:Dynamic facial expression recognition (DFER) faces significant challenges due to long-tailed category distributions and complexity of spatio-temporal feature modeling. While existing deep learning-based methods have improved DFER performance, they often fail to address these issues, resulting in severe model induction bias. To overcome these limitations, we propose a novel multi-instance learning framework called MICACL, which integrates spatio-temporal dependency modeling and long-tailed contrastive learning optimization. Specifically, we design the Graph-Enhanced Instance Interaction Module (GEIIM) to capture intricate spatio-temporal between adjacent instances relationships through adaptive adjacency matrices and multiscale convolutions. To enhance instance-level feature aggregation, we develop the Weighted Instance Aggregation Network (WIAN), which dynamically assigns weights based on instance importance. Furthermore, we introduce a Multiscale Category-aware Contrastive Learning (MCCL) strategy to balance training between major and minor categories. Extensive experiments on in-the-wild datasets (i.e., DFEW and FERV39k) demonstrate that MICACL achieves state-of-the-art performance with superior robustness and generalization.
zh
[CV-16] From Editor to Dense Geometry Estimator
【速读】:该论文旨在解决密集几何估计(如单目深度和法向量预测)任务中,传统基于文本到图像(Text-to-Image, T2I)生成模型微调方法存在的稳定性差与性能瓶颈问题。其核心挑战在于,密集预测本质上是图像到图像的映射任务,而T2I生成模型缺乏对结构细节的精确控制能力。解决方案的关键在于提出FE2E框架,首次将基于Diffusion Transformer(DiT)架构的图像编辑模型适配为确定性几何预测工具:通过将原编辑器的流匹配损失重构为“一致速度”训练目标以匹配任务特性,并采用对数量化策略解决BFloat16精度与高精度几何需求之间的冲突;同时利用DiT的全局注意力机制,在一次前向传播中实现深度与法向量的联合估计,使两者监督信号相互增强。该方案无需扩大训练数据规模即在多个数据集上显著优于现有方法,尤其在ETH3D上实现超过35%的性能提升。
链接: https://arxiv.org/abs/2509.04338
作者: JiYuan Wang,Chunyu Lin,Lei Sun,Rongying Liu,Lang Nie,Mingxing Li,Kang Liao,Xiangxiang Chu,Yao Zhao
机构: BJTU(北京交通大学); AMAP Alibaba Group(阿里巴巴集团); CQUPT(重庆邮电大学); NTU(南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 20pages
Abstract:Leveraging visual priors from pre-trained text-to-image (T2I) generative models has shown success in dense prediction. However, dense prediction is inherently an image-to-image task, suggesting that image editing models, rather than T2I generative models, may be a more suitable foundation for fine-tuning. Motivated by this, we conduct a systematic analysis of the fine-tuning behaviors of both editors and generators for dense geometry estimation. Our findings show that editing models possess inherent structural priors, which enable them to converge more stably by refining" their innate features, and ultimately achieve higher performance than their generative counterparts. Based on these findings, we introduce \textbfFE2E, a framework that pioneeringly adapts an advanced editing model based on Diffusion Transformer (DiT) architecture for dense geometry prediction. Specifically, to tailor the editor for this deterministic task, we reformulate the editor's original flow matching loss into the
consistent velocity" training objective. And we use logarithmic quantization to resolve the precision conflict between the editor’s native BFloat16 format and the high precision demand of our tasks. Additionally, we leverage the DiT’s global attention for a cost-free joint estimation of depth and normals in a single forward pass, enabling their supervisory signals to mutually enhance each other. Without scaling up the training data, FE2E achieves impressive performance improvements in zero-shot monocular depth and normal estimation across multiple datasets. Notably, it achieves over 35% performance gains on the ETH3D dataset and outperforms the DepthAnything series, which is trained on 100 \times data. The project page can be accessed \hrefthis https URLhere. Comments: 20pages Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2509.04338 [cs.CV] (or arXiv:2509.04338v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2509.04338 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Wang Jiyuan [view email] [v1] Thu, 4 Sep 2025 15:58:50 UTC (9,909 KB)
zh
[CV-17] GeoArena: An Open Platform for Benchmarking Large Vision-language Models on WorldWide Image Geolocalization
【速读】:该论文旨在解决当前图像地理定位(image geolocalization)评估中存在的两大问题:一是数据泄露问题,即先进方法常依赖于在测试集上预训练的视觉语言模型(vision-language models, LVLMs),导致对模型真实定位能力的评估失真;二是现有评价指标主要基于精确地理坐标,忽视了推理过程且可能引发用户位置隐私风险。解决方案的关键在于提出GeoArena——一个首个面向全球图像地理定位任务的开源评估平台,通过引入真实场景(in-the-wild)图像上传机制和成对人类判断(pairwise human judgments)来衡量模型输出与人类预期的一致性,从而实现更公平、更具人文中心导向的基准测试。
链接: https://arxiv.org/abs/2509.04334
作者: Pengyue Jia,Yingyi Zhang,Xiangyu Zhao,Yixuan Li
机构: City University of Hong Kong (香港城市大学); University of Wisconsin-Madison (威斯康星大学麦迪逊分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Image geolocalization aims to predict the geographic location of images captured anywhere on Earth, but its global nature presents significant challenges. Current evaluation methodologies suffer from two major limitations. First, data leakage: advanced approaches often rely on large vision-language models (LVLMs) to predict image locations, yet these models are frequently pretrained on the test datasets, compromising the accuracy of evaluating a model’s actual geolocalization capability. Second, existing metrics primarily rely on exact geographic coordinates to assess predictions, which not only neglects the reasoning process but also raises privacy concerns when user-level location data is required. To address these issues, we propose GeoArena, a first open platform for evaluating LVLMs on worldwide image geolocalization tasks, offering true in-the-wild and human-centered benchmarking. GeoArena enables users to upload in-the-wild images for a more diverse evaluation corpus, and it leverages pairwise human judgments to determine which model output better aligns with human expectations. Our platform has been deployed online for two months, during which we collected over thousands voting records. Based on this data, we conduct a detailed analysis and establish a leaderboard of different LVLMs on the image geolocalization task.
zh
[CV-18] Efficient Odd-One-Out Anomaly Detection
【速读】:该论文旨在解决奇数检测(odd-one-out)异常检测任务中的挑战,即在多物体场景中识别出视觉上异常的实例。这一问题要求模型具备跨视图的空间推理能力和关系推理能力,以理解上下文并实现不同物体类别和布局下的泛化。解决方案的关键在于提出一种基于DINO的模型架构,在保持竞争力性能的同时,将参数量减少三分之一,并将训练时间缩短至当前最先进方法的三分之一,从而在效率与性能之间取得更好平衡。
链接: https://arxiv.org/abs/2509.04326
作者: Silvio Chito,Paolo Rabino,Tatiana Tommasi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICIAP 2025
Abstract:The recently introduced odd-one-out anomaly detection task involves identifying the odd-looking instances within a multi-object scene. This problem presents several challenges for modern deep learning models, demanding spatial reasoning across multiple views and relational reasoning to understand context and generalize across varying object categories and layouts. We argue that these challenges must be addressed with efficiency in mind. To this end, we propose a DINO-based model that reduces the number of parameters by one third and shortens training time by a factor of three compared to the current state-of-the-art, while maintaining competitive performance. Our experimental evaluation also introduces a Multimodal Large Language Model baseline, providing insights into its current limitations in structured visual reasoning tasks. The project page can be found at this https URL
zh
[CV-19] OVGrasp: Open-Vocabulary Grasping Assistance via Multimodal Intent Detection
【速读】:该论文旨在解决在非结构化环境中为运动功能障碍患者提供可靠抓握辅助的问题,尤其针对物体类别和用户意图的多样性与不可预测性。解决方案的关键在于提出了一种名为OVGrasp的分层控制框架,其核心创新包括:利用RGB-D视觉与开放词汇提示(open-vocabulary prompts)及语音指令实现多模态交互;引入基于视觉-语言基础模型(vision-language foundation model)的开放词汇机制,支持零样本检测未见过的物体而无需重新训练;并通过一个多模态决策模块融合空间与语言线索,在多物体场景中推断用户意图(如抓取或释放),从而提升系统在真实复杂环境中的泛化能力和自然手部运动的 kinematic 对齐度。
链接: https://arxiv.org/abs/2509.04324
作者: Chen Hu,Shan Luo,Letizia Gionfrida
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Grasping assistance is essential for restoring autonomy in individuals with motor impairments, particularly in unstructured environments where object categories and user intentions are diverse and unpredictable. We present OVGrasp, a hierarchical control framework for soft exoskeleton-based grasp assistance that integrates RGB-D vision, open-vocabulary prompts, and voice commands to enable robust multimodal interaction. To enhance generalization in open environments, OVGrasp incorporates a vision-language foundation model with an open-vocabulary mechanism, allowing zero-shot detection of previously unseen objects without retraining. A multimodal decision-maker further fuses spatial and linguistic cues to infer user intent, such as grasp or release, in multi-object scenarios. We deploy the complete framework on a custom egocentric-view wearable exoskeleton and conduct systematic evaluations on 15 objects across three grasp types. Experimental results with ten participants demonstrate that OVGrasp achieves a grasping ability score (GAS) of 87.00%, outperforming state-of-the-art baselines and achieving improved kinematic alignment with natural hand motion.
zh
[CV-20] Noisy Label Refinement with Semantically Reliable Synthetic Images ICIP2025
【速读】:该论文旨在解决图像分类数据集中存在的语义噪声(semantic noise)问题,即视觉相似类别经常被错误标注,这会显著影响传统监督学习方法的性能。解决方案的关键在于利用先进的文本到图像生成模型(text-to-image models)合成高质量且标签可靠的图像作为参考点,而非直接用于训练;通过这些合成图像识别并修正真实数据集中的误标样本,从而提升分类准确率。该方法在多种基准数据集上均表现出优越性能,尤其在高比例语义噪声场景下效果显著,并且可与现有抗噪声学习技术兼容,进一步提升整体性能。
链接: https://arxiv.org/abs/2509.04298
作者: Yingxuan Li,Jiafeng Mao,Yusuke Matsui
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICIP2025
Abstract:Semantic noise in image classification datasets, where visually similar categories are frequently mislabeled, poses a significant challenge to conventional supervised learning approaches. In this paper, we explore the potential of using synthetic images generated by advanced text-to-image models to address this issue. Although these high-quality synthetic images come with reliable labels, their direct application in training is limited by domain gaps and diversity constraints. Unlike conventional approaches, we propose a novel method that leverages synthetic images as reliable reference points to identify and correct mislabeled samples in noisy datasets. Extensive experiments across multiple benchmark datasets show that our approach significantly improves classification accuracy under various noise conditions, especially in challenging scenarios with semantic label noise. Additionally, since our method is orthogonal to existing noise-robust learning techniques, when combined with state-of-the-art noise-robust training methods, it achieves superior performance, improving accuracy by 30% on CIFAR-10 and by 11% on CIFAR-100 under 70% semantic noise, and by 24% on ImageNet-100 under real-world noise conditions.
zh
[CV-21] PAOLI: Pose-free Articulated Object Learning from Sparse-view Images
【速读】:该论文旨在解决从稀疏视角、无相机姿态标注的图像中学习关节物体(articulated object)表示的问题,传统方法通常依赖密集多视角观测和真实相机位姿信息,而本文在仅需每种关节状态4个视图且无需相机监督的情况下实现高质量重建。其解决方案的关键在于:首先利用最新的稀疏视角三维重建技术独立重建每个关节状态,随后学习一个变形场以建立不同姿态间的稠密对应关系;进一步采用渐进式解耦策略分离静态与运动部分,从而鲁棒地分离相机运动与物体自身运动;最终通过联合优化几何、外观和运动学,并使用自监督损失函数强制跨视角和跨姿态的一致性,实现端到端的高精度关节物体建模。
链接: https://arxiv.org/abs/2509.04276
作者: Jianning Deng,Kartic Subr,Hakan Bilen
机构: University of Edinburgh (爱丁堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present a novel self-supervised framework for learning articulated object representations from sparse-view, unposed images. Unlike prior methods that require dense multi-view observations and ground-truth camera poses, our approach operates with as few as four views per articulation and no camera supervision. To address the inherent challenges, we first reconstruct each articulation independently using recent advances in sparse-view 3D reconstruction, then learn a deformation field that establishes dense correspondences across poses. A progressive disentanglement strategy further separates static from moving parts, enabling robust separation of camera and object motion. Finally, we jointly optimize geometry, appearance, and kinematics with a self-supervised loss that enforces cross-view and cross-pose consistency. Experiments on the standard benchmark and real-world examples demonstrate that our method produces accurate and detailed articulated object representations under significantly weaker input assumptions than existing approaches.
zh
[CV-22] Dual-Scale Volume Priors with Wasserstein-Based Consistency for Semi-Supervised Medical Image Segmentation
【速读】:该论文旨在解决半监督医学图像分割中特征提取方法缺乏有效指导以及未充分利用数据集重要先验信息的问题。其解决方案的关键在于将空间正则化方法与体积先验信息有效融合进骨干分割网络:一方面,通过回归网络估计每张未标注图像的目标区域体积,并利用图像尺度的Wasserstein距离约束对分割结果进行正则化,确保分割结果中的类别比例与预测体积一致;另一方面,设计基于弱隐式体积先验的数据集尺度Wasserstein距离损失函数,使未标注数据集的体积分布与标注数据集趋近,从而提升模型在有限标注样本下的泛化能力。
链接: https://arxiv.org/abs/2509.04273
作者: Junying Meng,Gangxuan Zhou,Jun Liu,Weihong Guo
机构: Shanxi University(山西大学); Beijing Normal University(北京师范大学); Case Western Reserve University(凯斯西储大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Despite signi cant progress in semi-supervised medical image segmentation, most existing segmentation networks overlook e ective methodological guidance for feature extraction and important prior information from datasets. In this paper, we develop a semi-supervised medical image segmentation framework that e ectively integrates spatial regularization methods and volume priors. Speci cally, our approach integrates a strong explicit volume prior at the image scale and Threshold Dynamics spatial regularization, both derived from variational models, into the backbone segmentation network. The target region volumes for each unlabeled image are estimated by a regression network, which e ectively regularizes the backbone segmentation network through an image-scale Wasserstein distance constraint, ensuring that the class ratios in the segmentation results for each unlabeled image match those predicted by the regression network. Additionally, we design a dataset-scale Wasserstein distance loss function based on a weak implicit volume prior, which enforces that the volume distribution predicted for the unlabeled dataset is similar to that of labeled dataset. Experimental results on the 2017 ACDC dataset, PROMISE12 dataset, and thigh muscle MR image dataset show the superiority of the proposed method. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2509.04273 [cs.CV] (or arXiv:2509.04273v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2509.04273 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-23] auGenNet: Plasma-Driven Tau PET Image Synthesis via Text-Guided 3D Diffusion Models
【速读】:该论文旨在解决tau正电子发射断层扫描(tau PET)在阿尔茨海默病(AD)诊断与监测中因成本高和可及性低而难以广泛应用的问题。其解决方案的关键在于提出一种基于文本引导的3D扩散模型,通过融合多模态条件——即来自血浆p-tau217水平的文本提示(作为AD进展的关键指标)和结构磁共振成像(MRI)提供的解剖结构约束——实现高质量3D tau PET图像的合成。该方法利用ADNI数据库中的临床AV1451 tau PET数据进行训练与评估,能够生成覆盖不同疾病阶段的逼真且具有临床意义的3D tau PET图像,从而为tau PET数据增强、非侵入性可视化tau病理以及模拟不同血浆生物标志物水平和认知状态下的疾病进展提供可行路径。
链接: https://arxiv.org/abs/2509.04269
作者: Yuxin Gong,Se-in Jang,Wei Shao,Yi Su,Kuang Gong(for the Alzheimer’s Disease Neuroimaging Initiative (ADNI))
机构: University of Florida(佛罗里达大学); Yale University(耶鲁大学); Banner Alzheimer’s Institute(Banner阿尔茨海默病研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 4 figures, submitted to IEEE Transactions on Radiation and Plasma Medical Sciences
Abstract:Accurate quantification of tau pathology via tau positron emission tomography (PET) scan is crucial for diagnosing and monitoring Alzheimer’s disease (AD). However, the high cost and limited availability of tau PET restrict its widespread use. In contrast, structural magnetic resonance imaging (MRI) and plasma-based biomarkers provide non-invasive and widely available complementary information related to brain anatomy and disease progression. In this work, we propose a text-guided 3D diffusion model for 3D tau PET image synthesis, leveraging multimodal conditions from both structural MRI and plasma measurement. Specifically, the textual prompt is from the plasma p-tau217 measurement, which is a key indicator of AD progression, while MRI provides anatomical structure constraints. The proposed framework is trained and evaluated using clinical AV1451 tau PET data from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database. Experimental results demonstrate that our approach can generate realistic, clinically meaningful 3D tau PET across a range of disease stages. The proposed framework can help perform tau PET data augmentation under different settings, provide a non-invasive, cost-effective alternative for visualizing tau pathology, and support the simulation of disease progression under varying plasma biomarker levels and cognitive conditions.
zh
[CV-24] Differential Morphological Profile Neural Networks for Semantic Segmentation
【速读】:该论文旨在解决遥感影像语义分割中因极端尺度变化、前景-背景不平衡以及大图像尺寸带来的挑战,这些问题使得传统基于地面视角图像训练的分割网络性能受限。解决方案的关键在于将差分形态学轮廓(Differential Morphological Profile, DMP)这一多尺度形状特征提取方法引入现代卷积与Transformer架构的语义分割模型中,通过两种策略实现:一是直接输入方式,即修改特征提取主干网络的输入层以接受DMP通道;二是混合架构,采用双流设计融合RGB和DMP编码器。实验表明,混合DMP架构在mIoU、F1和召回率等指标上优于纯RGB模型,验证了DMP特征对提升遥感图像语义分割性能的有效性。
链接: https://arxiv.org/abs/2509.04268
作者: David Huangal,J. Alex Hurt
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 7 figures
Abstract:Semantic segmentation of overhead remote sensing imagery enables applications in mapping, urban planning, and disaster response. State-of-the-art segmentation networks are typically developed and tuned on ground-perspective photographs and do not directly address remote sensing challenges such as extreme scale variation, foreground-background imbalance, and large image sizes. We explore the incorporation of the differential morphological profile (DMP), a multi-scale shape extraction method based on grayscale morphology, into modern segmentation networks. Prior studies have shown that the DMP can provide critical shape information to Deep Neural Networks to enable superior detection and classification performance in overhead imagery. In this work, we extend prior DMPNet work beyond classification and object detection by integrating DMP features into three state-of-the-art convolutional and transformer semantic segmentation architectures. We utilize both direct input, which adapts the input stem of feature extraction architectures to accept DMP channels, and hybrid architectures, a dual-stream design that fuses RGB and DMP encoders. Using the iSAID benchmark dataset, we evaluate a variety of DMP differentials and structuring element shapes to more effectively provide shape information to the model. Our results show that while non-DMP models generally outperform the direct-input variants, hybrid DMP consistently outperforms direct-input and is capable of surpassing a non-DMP model on mIoU, F1, and Recall.
zh
[CV-25] Learning Active Perception via Self-Evolving Preference Optimization for GUI Grounding
【速读】:该论文旨在解决视觉语言模型(Vision Language Models, VLMs)在图形用户界面(GUI)定位任务中,如何有效推理并聚焦于图像中与指令相关的关键区域的问题,尤其是在高分辨率输入和复杂多元素交互场景下,模型难以精准预测坐标且缺乏多步感知能力。解决方案的关键在于提出LASER框架,该框架通过融合蒙特卡洛质量估计与基于交并比(Intersection-over-Union, IoU)的区域质量评估,协同提升偏好数据的准确性与多样性,从而引导模型逐步聚焦于关键区域,并根据任务复杂度自适应分配推理步骤,实现多步感知能力的自我演化。
链接: https://arxiv.org/abs/2509.04243
作者: Wanfu Wang,Qipeng Huang,Guangquan Xue,Xiaobo Liang,Juntao Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Vision Language Models (VLMs) have recently achieved significant progress in bridging visual perception and linguistic reasoning. Recently, OpenAI o3 model introduced a zoom-in search strategy that effectively elicits active perception capabilities in VLMs, improving downstream task performance. However, enabling VLMs to reason effectively over appropriate image regions remains a core challenge in GUI grounding, particularly under high-resolution inputs and complex multi-element visual interactions. In this work, we propose LASER, a self-evolving framework that progressively endows VLMs with multi-step perception capabilities, enabling precise coordinate prediction. Specifically, our approach integrate Monte Carlo quality estimation with Intersection-over-Union (IoU)-based region quality evaluation to jointly encourage both accuracy and diversity in constructing high-quality preference data. This combination explicitly guides the model to focus on instruction-relevant key regions while adaptively allocating reasoning steps based on task complexity. Comprehensive experiments on the ScreenSpot Pro and ScreenSpot-v2 benchmarks demonstrate consistent performance gains, validating the effectiveness of our method. Furthermore, when fine-tuned on GTA1-7B, LASER achieves a score of 55.7 on the ScreenSpot-Pro benchmark, establishing a new state-of-the-art (SoTA) among 7B-scale models.
zh
[CV-26] DUDE: Diffusion-Based Unsupervised Cross-Domain Image Retrieval
【速读】:该论文旨在解决无监督跨域图像检索(Unsupervised Cross-Domain Image Retrieval, UCIR)中因域间差异导致的检索性能下降问题,尤其是当目标物体特征与域特定风格混杂时,现有方法难以有效对齐跨域特征。解决方案的关键在于提出DUDE方法,其核心创新是利用文本到图像生成模型实现特征解耦(feature disentanglement),将物体语义特征从域特定风格中分离出来;同时,通过渐进式地对齐域内与域间的互邻近邻居(mutual neighbors),确保解耦后的物体特征在跨域间可靠对齐,从而提升检索准确性。
链接: https://arxiv.org/abs/2509.04193
作者: Ruohong Yang,Peng Hu,Yunfan Li,Xi Peng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Unsupervised cross-domain image retrieval (UCIR) aims to retrieve images of the same category across diverse domains without relying on annotations. Existing UCIR methods, which align cross-domain features for the entire image, often struggle with the domain gap, as the object features critical for retrieval are frequently entangled with domain-specific styles. To address this challenge, we propose DUDE, a novel UCIR method building upon feature disentanglement. In brief, DUDE leverages a text-to-image generative model to disentangle object features from domain-specific styles, thus facilitating semantical image retrieval. To further achieve reliable alignment of the disentangled object features, DUDE aligns mutual neighbors from within domains to across domains in a progressive manner. Extensive experiments demonstrate that DUDE achieves state-of-the-art performance across three benchmark datasets over 13 domains. The code will be released.
zh
[CV-27] VisioFirm: Cross-Platform AI-assisted Annotation Tool for Computer Vision
【速读】:该论文旨在解决图像标注(annotation)过程中人工成本高、效率低的问题,尤其是在大规模数据集上难以扩展的瓶颈。传统标注工具依赖大量手动输入,限制了标注流程的自动化与规模化。其解决方案的关键在于提出一个名为VisioFirm的开源Web应用,通过融合多种先进基础模型(foundation models)构建AI辅助标注框架:利用CLIP与预训练检测器(如Ultralytics模型)实现常见类别的初始标注,结合零样本模型(如Grounding DINO)支持自定义标签;同时引入低置信度阈值策略以最大化召回率,并通过交互式工具优化结果;此外,集成Segment Anything模型并通过WebGPU加速实现浏览器端实时分割,显著降低人工干预需求。实验表明,VisioFirm在多个数据集上可减少高达90%的手动标注工作量,且借助CLIP驱动的组件聚类和IoU图冗余检测抑制机制保障标注精度。
链接: https://arxiv.org/abs/2509.04180
作者: Safouane El Ghazouali,Umberto Michelucci
机构: TOELT LLC AI lab (TOELT LLC 人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:AI models rely on annotated data to learn pattern and perform prediction. Annotation is usually a labor-intensive step that require associating labels ranging from a simple classification label to more complex tasks such as object detection, oriented bounding box estimation, and instance segmentation. Traditional tools often require extensive manual input, limiting scalability for large datasets. To address this, we introduce VisioFirm, an open-source web application designed to streamline image labeling through AI-assisted automation. VisioFirm integrates state-of-the-art foundation models into an interface with a filtering pipeline to reduce human-in-the-loop efforts. This hybrid approach employs CLIP combined with pre-trained detectors like Ultralytics models for common classes and zero-shot models such as Grounding DINO for custom labels, generating initial annotations with low-confidence thresholding to maximize recall. Through this framework, when tested on COCO-type of classes, initial prediction have been proven to be mostly correct though the users can refine these via interactive tools supporting bounding boxes, oriented bounding boxes, and polygons. Additionally, VisioFirm has on-the-fly segmentation powered by Segment Anything accelerated through WebGPU for browser-side efficiency. The tool supports multiple export formats (YOLO, COCO, Pascal VOC, CSV) and operates offline after model caching, enhancing accessibility. VisioFirm demonstrates up to 90% reduction in manual effort through benchmarks on diverse datasets, while maintaining high annotation accuracy via clustering of connected CLIP-based disambiguate components and IoU-graph for redundant detection suppression. VisioFirm can be accessed from \hrefthis https URLthis https URL.
zh
[CV-28] YOLO Ensemble for UAV-based Multispectral Defect Detection in Wind Turbine Components
【速读】:该论文旨在解决风力发电机组关键部件(如叶片、塔架)缺陷检测中因数据分辨率不足和多光谱图像处理效率低而导致的可靠性问题。其解决方案的关键在于构建一个基于YOLO架构的集成模型,将通用型YOLOv8模型与专用热成像模型相结合,并通过一种先进的边界框融合算法实现两种模态(可见光与热红外)预测结果的有效整合,从而提升缺陷检测的准确性和鲁棒性。
链接: https://arxiv.org/abs/2509.04156
作者: Serhii Svystun,Pavlo Radiuk,Oleksandr Melnychenko,Oleg Savenko,Anatoliy Sachenko
机构: Khmelnytskyi National University (赫梅利尼茨基国立大学); West Ukrainian National University (西乌克兰国立大学); Casimir Pulaski Radom University (卡齐米日·普瓦斯基拉多姆大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: The 13th IEEE International Conference on Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications, 4-6 September, 2025, Gliwice, Poland
Abstract:Unmanned aerial vehicles (UAVs) equipped with advanced sensors have opened up new opportunities for monitoring wind power plants, including blades, towers, and other critical components. However, reliable defect detection requires high-resolution data and efficient methods to process multispectral imagery. In this research, we aim to enhance defect detection accuracy through the development of an ensemble of YOLO-based deep learning models that integrate both visible and thermal channels. We propose an ensemble approach that integrates a general-purpose YOLOv8 model with a specialized thermal model, using a sophisticated bounding box fusion algorithm to combine their predictions. Our experiments show this approach achieves a mean Average Precision (mAP@.5) of 0.93 and an F1-score of 0.90, outperforming a standalone YOLOv8 model, which scored an mAP@.5 of 0.91. These findings demonstrate that combining multiple YOLO architectures with fused multispectral data provides a more reliable solution, improving the detection of both visual and thermal defects.
zh
[CV-29] Revisiting Simple Baselines for In-The-Wild Deepfake Detection
【速读】:该论文旨在解决当前深度伪造检测(deepfake detection)模型在真实场景下性能不足的问题,尤其是在“in-the-wild”(非受控环境)数据集上表现不佳的现状。现有研究多依赖于高度控制的数据集进行评估,导致模型泛化能力受限。论文的关键解决方案在于重新审视并优化一个基于预训练视觉骨干网络(pretrained vision backbone)的简单迁移学习方法——该方法由Ojha等人提出,通过更精细的超参数调优,使原本准确率仅为61%–69%的基线模型在Deepfake-Eval-2024基准上提升至81%,显著逼近商业级检测器(82%准确率),从而证明了简单架构在合理调参下具备与复杂模型相当的实用性与可部署性。
链接: https://arxiv.org/abs/2509.04150
作者: Orlando Castaneda,Kevin So-Tang,Kshitij Gurung
机构: Georgia Institute of Technology(佐治亚理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The widespread adoption of synthetic media demands accessible deepfake detectors and realistic benchmarks. While most existing research evaluates deepfake detectors on highly controlled datasets, we focus on the recently released “in-the-wild” benchmark, Deepfake-Eval-2024. Initial reporting on Deepfake-Eval-2024 showed that three finetuned open-source models achieve accuracies between 61% and 69%, significantly lagging behind the leading commercial deepfake detector with 82% accuracy. Our work revisits one of these baseline approaches, originally introduced by Ojha et al., which adapts standard pretrained vision backbones to produce generalizable deepfake detectors. We demonstrate that with better-tuned hyperparameters, this simple approach actually yields much higher performance – 81% accuracy on Deepfake-Eval-2024 – surpassing the previously reported accuracy of this baseline approach by 18% and competing with commercial deepfake detectors. We discuss tradeoffs in accuracy, computational costs, and interpretability, focusing on how practical these deepfake detectors might be when deployed in real-world settings. Our code can be found at this https URL.
zh
[CV-30] Hyper Diffusion Avatars: Dynamic Human Avatar Generation using Network Weight Space Diffusion
【速读】:该论文旨在解决当前动态人类虚拟形象(human avatar)生成中难以同时实现高保真度渲染与真实姿态依赖形变的问题。现有方法要么依赖个体特异性模型以获得高质量渲染但缺乏跨身份泛化能力,要么基于扩散模型生成卡通化静态形象并用骨骼驱动动画,却无法捕捉衣物褶皱等复杂姿态相关形变。解决方案的关键在于提出一个两阶段框架:首先优化一组针对特定个体的UNet网络,每个网络代表一个能精确建模姿态依赖形变的动态人像;其次训练一个超扩散模型(hyper diffusion model)来学习这些优化后网络权重的分布,在推理时可即时生成适配新身份的网络权重,从而实现实时可控、高保真且具备自然形变的动态人像渲染。
链接: https://arxiv.org/abs/2509.04145
作者: Dongliang Cao,Guoxing Sun,Marc Habermann,Florian Bernard
机构: University of Bonn(波恩大学); Max Planck Institute for Informatics(马克斯·普朗克信息研究所)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Creating human avatars is a highly desirable yet challenging task. Recent advancements in radiance field rendering have achieved unprecedented photorealism and real-time performance for personalized dynamic human avatars. However, these approaches are typically limited to person-specific rendering models trained on multi-view video data for a single individual, limiting their ability to generalize across different identities. On the other hand, generative approaches leveraging prior knowledge from pre-trained 2D diffusion models can produce cartoonish, static human avatars, which are animated through simple skeleton-based articulation. Therefore, the avatars generated by these methods suffer from lower rendering quality compared to person-specific rendering methods and fail to capture pose-dependent deformations such as cloth wrinkles. In this paper, we propose a novel approach that unites the strengths of person-specific rendering and diffusion-based generative modeling to enable dynamic human avatar generation with both high photorealism and realistic pose-dependent deformations. Our method follows a two-stage pipeline: first, we optimize a set of person-specific UNets, with each network representing a dynamic human avatar that captures intricate pose-dependent deformations. In the second stage, we train a hyper diffusion model over the optimized network weights. During inference, our method generates network weights for real-time, controllable rendering of dynamic human avatars. Using a large-scale, cross-identity, multi-view video dataset, we demonstrate that our approach outperforms state-of-the-art human avatar generation methods.
zh
[CV-31] MEPG:Multi-Expert Planning and Generation for Compositionally-Rich Image Generation
【速读】:该论文旨在解决当前文本到图像扩散模型(text-to-image diffusion models)在处理复杂多元素提示词(multielement prompts)时存在的生成质量不足与风格多样性有限的问题。其解决方案的关键在于提出一种多专家规划与生成框架(Multi-Expert Planning and Generation Framework, MEPG),该框架通过两个核心组件实现:一是位置-风格感知模块(Position-Style-Aware, PSA),利用监督微调的大语言模型(LLM)将输入提示分解为精确的空间坐标和风格编码的语义指令;二是多专家扩散模块(Multi-Expert Diffusion, MED),通过跨区域动态专家路由机制,在局部区域和全局空间中选择特定专家模型(如真实感专家、风格化专家等)进行注意力门控激活,从而实现精细化控制与高灵活性的图像生成。该架构支持轻量级专家模型集成与替换,具备良好扩展性,并提供交互式界面以实现实时空间布局编辑与分区域风格选择。
链接: https://arxiv.org/abs/2509.04126
作者: Yuan Zhao,Liu Lin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Text-to-image diffusion models have achieved remarkable image quality, but they still struggle with complex, multiele ment prompts, and limited stylistic diversity. To address these limitations, we propose a Multi-Expert Planning and Gen eration Framework (MEPG) that synergistically integrates position- and style-aware large language models (LLMs) with spatial-semantic expert modules. The framework comprises two core components: (1) a Position-Style-Aware (PSA) module that utilizes a supervised fine-tuned LLM to decom pose input prompts into precise spatial coordinates and style encoded semantic instructions; and (2) a Multi-Expert Dif fusion (MED) module that implements cross-region genera tion through dynamic expert routing across both local regions and global areas. During the generation process for each lo cal region, specialized models (e.g., realism experts, styliza tion specialists) are selectively activated for each spatial par tition via attention-based gating mechanisms. The architec ture supports lightweight integration and replacement of ex pert models, providing strong extensibility. Additionally, an interactive interface enables real-time spatial layout editing and per-region style selection from a portfolio of experts. Ex periments show that MEPG significantly outperforms base line models with the same backbone in both image quality and style diversity. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2509.04126 [cs.CV] (or arXiv:2509.04126v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2509.04126 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-32] aleDiffusion: Multi-Character Story Generation with Dialogue Rendering
【速读】:该论文旨在解决文本到故事可视化(text-to-story visualization)中多角色在跨帧一致性不足的问题,现有方法常因角色一致性差导致图像伪影和对话渲染不准确,从而造成叙事断裂。其解决方案的关键在于提出TaleDiffusion框架,通过迭代生成流程实现角色一致性保持与精准对话分配:首先利用预训练大语言模型(LLM)基于上下文学习生成每帧描述、角色细节及对话;随后采用受限注意力机制的逐框掩码技术控制角色交互以减少伪影;再结合身份一致的自注意力机制确保角色跨帧一致性,并引入区域感知的交叉注意力实现物体精确定位;最后通过CLIPSeg将对话气泡识别并绑定至对应角色,从而显著提升生成质量与叙事连贯性。
链接: https://arxiv.org/abs/2509.04123
作者: Ayan Banerjee,Josep Lladós,Umapada Pal,Anjan Dutta
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Text-to-story visualization is challenging due to the need for consistent interaction among multiple characters across frames. Existing methods struggle with character consistency, leading to artifact generation and inaccurate dialogue rendering, which results in disjointed storytelling. In response, we introduce TaleDiffusion, a novel framework for generating multi-character stories with an iterative process, maintaining character consistency, and accurate dialogue assignment via postprocessing. Given a story, we use a pre-trained LLM to generate per-frame descriptions, character details, and dialogues via in-context learning, followed by a bounded attention-based per-box mask technique to control character interactions and minimize artifacts. We then apply an identity-consistent self-attention mechanism to ensure character consistency across frames and region-aware cross-attention for precise object placement. Dialogues are also rendered as bubbles and assigned to characters via CLIPSeg. Experimental results demonstrate that TaleDiffusion outperforms existing methods in consistency, noise reduction, and dialogue rendering.
zh
[CV-33] DVS-PedX: Synthetic-and-Real Event-Based Pedestrian Dataset
【速读】:该论文旨在解决基于事件的行人检测与过街意图识别在正常及恶劣天气条件下的感知难题,尤其关注如何利用事件相机(Event Camera)低延迟、高动态范围和运动鲁棒性的优势提升自动驾驶场景中的行人安全。解决方案的关键在于构建了一个名为DVS-PedX的神经形态数据集,其核心创新包括:(1) 结合合成事件流(来自CARLA模拟器)与真实世界事件流(由JAAD行车记录仪视频通过v2e工具转换而来),实现对多种天气和光照条件的覆盖;(2) 提供多模态标注(RGB帧、事件帧及帧级标签),支持灵活的再处理与模型训练;(3) 基于SpikingJelly实现的脉冲神经网络(SNN)基线模型揭示了“仿真到现实”的差距,为后续领域自适应与多模态融合研究提供了基准与方向。
链接: https://arxiv.org/abs/2509.04117
作者: Mustafa Sakhai,Kaung Sithu,Min Khant Soe Oke,Maciej Wielgosz
机构: AGH University of Science and Technology (AGH大学); Academic Computer Centre AGH (AGH大学计算机中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 8 figures, 3 tables; dataset descriptor paper introducing DVS-PedX (synthetic-and-real event-based pedestrian dataset with baselines) External URL: this https URL
Abstract:Event cameras like Dynamic Vision Sensors (DVS) report micro-timed brightness changes instead of full frames, offering low latency, high dynamic range, and motion robustness. DVS-PedX (Dynamic Vision Sensor Pedestrian eXploration) is a neuromorphic dataset designed for pedestrian detection and crossing-intention analysis in normal and adverse weather conditions across two complementary sources: (1) synthetic event streams generated in the CARLA simulator for controlled “approach-cross” scenes under varied weather and lighting; and (2) real-world JAAD dash-cam videos converted to event streams using the v2e tool, preserving natural behaviors and backgrounds. Each sequence includes paired RGB frames, per-frame DVS “event frames” (33 ms accumulations), and frame-level labels (crossing vs. not crossing). We also provide raw AEDAT 2.0/AEDAT 4.0 event files and AVI DVS video files and metadata for flexible re-processing. Baseline spiking neural networks (SNNs) using SpikingJelly illustrate dataset usability and reveal a sim-to-real gap, motivating domain adaptation and multimodal fusion. DVS-PedX aims to accelerate research in event-based pedestrian safety, intention prediction, and neuromorphic perception.
zh
[CV-34] FedQuad: Federated Stochastic Quadruplet Learning to Mitigate Data Heterogeneity
【速读】:该论文旨在解决联邦学习(Federated Learning, FL)中因客户端间数据异质性(data heterogeneity)导致的全局模型泛化性能下降问题,尤其在数据量有限和类别不平衡的情况下更为显著。其解决方案的关键在于提出一种名为FedQuad的新方法,通过显式优化客户端内部类内方差最小化和类间方差最大化,从而降低模型聚合对客户端表示的负面影响;该方法在共享特征空间中最小化相似样本对之间的距离,同时最大化负样本对之间的距离,有效解耦不同客户端的数据分布,提升全局模型的表征能力与鲁棒性。
链接: https://arxiv.org/abs/2509.04107
作者: Ozgu Goksu,Nicolas Pugeault
机构: University of Glasgow (格拉斯哥大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: The 3rd IEEE International Conference on Federated Learning Technologies and Applications (FLTA25)
Abstract:Federated Learning (FL) provides decentralised model training, which effectively tackles problems such as distributed data and privacy preservation. However, the generalisation of global models frequently faces challenges from data heterogeneity among clients. This challenge becomes even more pronounced when datasets are limited in size and class imbalance. To address data heterogeneity, we propose a novel method, \textitFedQuad, that explicitly optimises smaller intra-class variance and larger inter-class variance across clients, thereby decreasing the negative impact of model aggregation on the global model over client representations. Our approach minimises the distance between similar pairs while maximising the distance between negative pairs, effectively disentangling client data in the shared feature space. We evaluate our method on the CIFAR-10 and CIFAR-100 datasets under various data distributions and with many clients, demonstrating superior performance compared to existing approaches. Furthermore, we provide a detailed analysis of metric learning-based strategies within both supervised and federated learning paradigms, highlighting their efficacy in addressing representational learning challenges in federated settings.
zh
[CV-35] riLiteNet: Lightweight Model for Multi-Task Visual Perception
【速读】:该论文旨在解决高级驾驶辅助系统(Advanced Driver Assistance Systems, ADAS)中感知模型在实时性与计算效率之间的矛盾问题,即如何在保证多任务感知性能的同时降低模型的参数量和计算复杂度。解决方案的关键在于提出一种名为TriLiteNet的轻量化多任务感知模型,其通过结构优化实现对全景驾驶感知任务(包括车辆检测、可行驶区域分割和车道线分割)的高效联合处理;实验表明,该模型在BDD100k数据集上实现了高精度(如车辆检测召回率达85.6%),同时仅需2.35M参数和7.72 GFLOPs计算量,且具备极低延迟和功耗特性,适用于嵌入式设备部署。
链接: https://arxiv.org/abs/2509.04092
作者: Quang-Huy Che,Duc-Khai Lam
机构: University of Information Technology (信息科技大学); Vietnam National University (越南国家大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Efficient perception models are essential for Advanced Driver Assistance Systems (ADAS), as these applications require rapid processing and response to ensure safety and effectiveness in real-world environments. To address the real-time execution needs of such perception models, this study introduces the TriLiteNet model. This model can simultaneously manage multiple tasks related to panoramic driving perception. TriLiteNet is designed to optimize performance while maintaining low computational costs. Experimental results on the BDD100k dataset demonstrate that the model achieves competitive performance across three key tasks: vehicle detection, drivable area segmentation, and lane line segmentation. Specifically, the TriLiteNet_base demonstrated a recall of 85.6% for vehicle detection, a mean Intersection over Union (mIoU) of 92.4% for drivable area segmentation, and an Acc of 82.3% for lane line segmentation with only 2.35M parameters and a computational cost of 7.72 GFLOPs. Our proposed model includes a tiny configuration with just 0.14M parameters, which provides a multi-task solution with minimal computational demand. Evaluated for latency and power consumption on embedded devices, TriLiteNet in both configurations shows low latency and reasonable power during inference. By balancing performance, computational efficiency, and scalability, TriLiteNet offers a practical and deployable solution for real-world autonomous driving applications. Code is available at this https URL.
zh
[CV-36] En-CATS: Text-Enriched Audio-Visual Video Parsing with Multi-Scale Category-Aware Temporal Graph
【速读】:该论文针对弱监督音频-视觉视频解析(Audio-Visual Video Parsing, AVVP)任务中现有方法存在的问题展开研究:一方面,基于注意力机制的模型将噪声较大的片段级伪标签视为可靠监督信号,导致初始误差在训练过程中被反复放大;另一方面,依赖生成更丰富伪标签的方法会因缺乏区分度的注意力机制而使伪标签无差别地扩散至所有帧,进一步加剧错误传播。为解决这一挑战,论文提出了一种结合双向文本融合(Bi-Directional Text Fusion, BiT)模块与类别感知时序图(Category-Aware Temporal Graph, CATS)模块的联合框架。其关键创新在于:首先通过BiT模块对音视频模态特征进行语义注入与动态校准,以定位并净化更清晰、丰富的语义线索;随后利用CATS模块实现语义信息在时间维度上的精准传播与连接,从而提升事件分类与发生时间预测的准确性。实验表明,该方法在LLP和UnAV-100两个基准数据集上均取得了当前最优性能(state-of-the-art, SOTA)。
链接: https://arxiv.org/abs/2509.04086
作者: Yaru Chen,Faegheh Sardari,Peiliang Zhang,Ruohao Guo,Yang Xiang,Zhenbo Li,Wenwu Wang
机构: University of Surrey (萨里大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:
Abstract:Audio-Visual Video Parsing (AVVP) task aims to identify event categories and their occurrence times in a given video with weakly supervised labels. Existing methods typically fall into two categories: (i) designing enhanced architectures based on attention mechanism for better temporal modeling, and (ii) generating richer pseudo-labels to compensate for the absence of frame-level annotations. However, the first type methods treat noisy segment-level pseudo labels as reliable supervision and the second type methods let indiscriminate attention spread them across all frames, the initial errors are repeatedly amplified during training. To address this issue, we propose a method that combines the Bi-Directional Text Fusion (BiT) module and Category-Aware Temporal Graph (CATS) module. Specifically, we integrate the strengths and complementarity of the two previous research directions. We first perform semantic injection and dynamic calibration on audio and visual modality features through the BiT module, to locate and purify cleaner and richer semantic cues. Then, we leverage the CATS module for semantic propagation and connection to enable precise semantic information dissemination across time. Experimental results demonstrate that our proposed method achieves state-of-the-art (SOTA) performance in multiple key indicators on two benchmark datasets, LLP and UnAV-100.
zh
[CV-37] SMooGPT : Stylized Motion Generation using Large Language Models
【速读】:该论文旨在解决**风格化运动生成(stylized motion generation)问题,即在保留原始运动内容的基础上,生成符合特定风格(如“像猴子一样绕圈行走”)的新运动序列。现有方法通常依赖于隐空间中的风格嵌入与引导,存在可解释性差、控制粒度粗、泛化能力弱(尤其难以处理非“行走”类动作)等局限。其解决方案的关键在于提出一种基于推理-组合-生成(reasoning-composition-generation)的新范式:利用身体部位文本空间(body-part text space)**作为中间表示,通过微调大型语言模型(LLM)作为推理器、组合器和生成器,实现高可解释性的细粒度控制,有效缓解运动内容与风格之间的冲突,并借助LLM的开放词汇能力实现对新风格的良好泛化。
链接: https://arxiv.org/abs/2509.04058
作者: Lei Zhong,Yi Yang,Changjian Li
机构: University of Edinburgh (爱丁堡大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Stylized motion generation is actively studied in computer graphics, especially benefiting from the rapid advances in diffusion models. The goal of this task is to produce a novel motion respecting both the motion content and the desired motion style, e.g., walking in a loop like a Monkey''. Existing research attempts to address this problem via motion style transfer or conditional motion generation. They typically embed the motion style into a latent space and guide the motion implicitly in a latent space as well. Despite the progress, their methods suffer from low interpretability and control, limited generalization to new styles, and fail to produce motions other than
walking’’ due to the strong bias in the public stylization dataset. In this paper, we propose to solve the stylized motion generation problem from a new perspective of reasoning-composition-generation, based on our observations: i) human motion can often be effectively described using natural language in a body-part centric manner, ii) LLMs exhibit a strong ability to understand and reason about human motion, and iii) human motion has an inherently compositional nature, facilitating the new motion content or style generation via effective recomposing. We thus propose utilizing body-part text space as an intermediate representation, and present SMooGPT, a fine-tuned LLM, acting as a reasoner, composer, and generator when generating the desired stylized motion. Our method executes in the body-part text space with much higher interpretability, enabling fine-grained motion control, effectively resolving potential conflicts between motion content and style, and generalizes well to new styles thanks to the open-vocabulary ability of LLMs. Comprehensive experiments and evaluations, and a user perceptual study, demonstrate the effectiveness of our approach, especially under the pure text-driven stylized motion generation.
zh
[CV-38] A Re-ranking Method using K-nearest Weighted Fusion for Person Re-identification ICPR
【速读】:该论文旨在解决行人重识别(Person Re-Identification, Re-ID)中因单视角特征导致的视图偏差(view bias)问题,如姿态变化、视角差异和遮挡等,从而提升重排序(re-ranking)阶段的准确性。其解决方案的关键在于提出一种无需模型微调或额外标注的高效重排序方法,通过K近邻加权融合(K-nearest Weighted Fusion, KWF)策略聚合邻居特征以生成多视角特征表示。该方法基于假设:同一身份的特征在重识别模型中具有高度相似性,因此可在无监督条件下选取K个最近邻特征进行加权融合,有效缓解视图偏差并显著提升Rank@1和mAP指标,尤其在MSMT17和Occluded-DukeMTMC等挑战性数据集上分别实现9.8%和22.0%的Rank@1提升,同时具备优于现有方法的计算效率。
链接: https://arxiv.org/abs/2509.04050
作者: Quang-Huy Che,Le-Chuong Nguyen,Gia-Nghia Tran,Dinh-Duy Phan,Vinh-Tiep Nguyen
机构: University of Information Technology (信息科技大学); Vietnam National University (越南国家大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Published in ICPRAM 2025, ISBN 978-989-758-730-6, ISSN 2184-4313
Abstract:In person re-identification, re-ranking is a crucial step to enhance the overall accuracy by refining the initial ranking of retrieved results. Previous studies have mainly focused on features from single-view images, which can cause view bias and issues like pose variation, viewpoint changes, and occlusions. Using multi-view features to present a person can help reduce view bias. In this work, we present an efficient re-ranking method that generates multi-view features by aggregating neighbors’ features using K-nearest Weighted Fusion (KWF) method. Specifically, we hypothesize that features extracted from re-identification models are highly similar when representing the same identity. Thus, we select K neighboring features in an unsupervised manner to generate multi-view features. Additionally, this study explores the weight selection strategies during feature aggregation, allowing us to identify an effective strategy. Our re-ranking approach does not require model fine-tuning or extra annotations, making it applicable to large-scale datasets. We evaluate our method on the person re-identification datasets Market1501, MSMT17, and Occluded-DukeMTMC. The results show that our method significantly improves Rank@1 and mAP when re-ranking the top M candidates from the initial ranking results. Specifically, compared to the initial results, our re-ranking method achieves improvements of 9.8%/22.0% in Rank@1 on the challenging datasets: MSMT17 and Occluded-DukeMTMC, respectively. Furthermore, our approach demonstrates substantial enhancements in computational efficiency compared to other re-ranking methods.
zh
[CV-39] nsoIS: A Step Towards Feed-Forward Tensorial Inverse Subsurface Scattering for Perlin Distributed Heterogeneous Media
【速读】:该论文旨在解决从图像中估计非均匀介质(heterogeneous media)的散射参数这一严重欠约束且具有挑战性的问题。现有方法多依赖于分析合成或可微分体积渲染技术,但鲜有研究采用基于学习的方法来处理此类问题,且多数假设介质为均匀分布。本文的关键创新在于提出了一种基于分形噪声(Fractal Perlin noise)建模的新型合成数据集 HeteroSynth,用于模拟真实世界中复杂的非均匀散射特性;并设计了 Tensorial Inverse Scattering (TensoIS) 框架,通过可学习的低秩张量组件而非直接预测三维散射参数体,实现对非均匀散射参数的前馈式估计,从而在多种未见几何形态和真实场景下展现出良好的泛化能力与有效性。
链接: https://arxiv.org/abs/2509.04047
作者: Ashish Tiwari,Satyam Bhardwaj,Yash Bachwana,Parag Sarvoday Sahu,T.M.Feroz Ali,Bhargava Chintalapati,Shanmuganathan Raman
机构: Indian Institute of Technology Gandhinagar (印度理工学院甘地纳格尔分校); Qualcomm (高通)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: To appear in Pacific Graphics 2025 (CGF Journal Track), Project page: this https URL
Abstract:Estimating scattering parameters of heterogeneous media from images is a severely under-constrained and challenging problem. Most of the existing approaches model BSSRDF either through an analysis-by-synthesis approach, approximating complex path integrals, or using differentiable volume rendering techniques to account for heterogeneity. However, only a few studies have applied learning-based methods to estimate subsurface scattering parameters, but they assume homogeneous media. Interestingly, no specific distribution is known to us that can explicitly model the heterogeneous scattering parameters in the real world. Notably, procedural noise models such as Perlin and Fractal Perlin noise have been effective in representing intricate heterogeneities of natural, organic, and inorganic surfaces. Leveraging this, we first create HeteroSynth, a synthetic dataset comprising photorealistic images of heterogeneous media whose scattering parameters are modeled using Fractal Perlin noise. Furthermore, we propose Tensorial Inverse Scattering (TensoIS), a learning-based feed-forward framework to estimate these Perlin-distributed heterogeneous scattering parameters from sparse multi-view image observations. Instead of directly predicting the 3D scattering parameter volume, TensoIS uses learnable low-rank tensor components to represent the scattering volume. We evaluate TensoIS on unseen heterogeneous variations over shapes from the HeteroSynth test set, smoke and cloud geometries obtained from open-source realistic volumetric simulations, and some real-world samples to establish its effectiveness for inverse scattering. Overall, this study is an attempt to explore Perlin noise distribution, given the lack of any such well-defined distribution in literature, to potentially model real-world heterogeneous scattering in a feed-forward manner.
zh
[CV-40] Millisecond-Response Tracking and Gazing System for UAVs: A Domestic Solution Based on “Phytium Cambricon”
【速读】:该论文旨在解决传统视频监控系统在动态场景下响应延迟超过200 ms的问题,其根源在于自动识别算法的深层特征提取能力不足以及计算架构的效率瓶颈,无法满足复杂场景下的实时性需求。解决方案的关键在于构建一种基于飞腾(Phytium)处理器与寒武纪(Cambricon)加速卡的异构计算架构,并创新性地集成轻量级YOLOv5s检测网络与DeepSORT级联跟踪算法,形成“检测-跟踪-反馈”的闭环控制链,从而实现1920×1080分辨率视频流中单帧处理延迟稳定在50–100 ms、多尺度目标识别准确率超过98.5%的低延迟高精度性能。
链接: https://arxiv.org/abs/2509.04043
作者: Yuchen Zhu,Longxiang Yin,Kai Zhao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages,17 figures
Abstract:In the frontier research and application of current video surveillance technology, traditional camera systems exhibit significant limitations of response delay exceeding 200 ms in dynamic scenarios due to the insufficient deep feature extraction capability of automatic recognition algorithms and the efficiency bottleneck of computing architectures, failing to meet the real-time requirements in complex scenes. To address this issue, this study proposes a heterogeneous computing architecture based on Phytium processors and Cambricon accelerator cards, constructing a UAV tracking and gazing system with millisecond-level response capability. At the hardware level, the system adopts a collaborative computing architecture of Phytium FT-2000/4 processors and MLU220 accelerator cards, enhancing computing power through multi-card parallelism. At the software level, it innovatively integrates a lightweight YOLOv5s detection network with a DeepSORT cascaded tracking algorithm, forming a closed-loop control chain of “detection-tracking-feedback”. Experimental results demonstrate that the system achieves a stable single-frame comprehensive processing delay of 50-100 ms in 1920*1080 resolution video stream processing, with a multi-scale target recognition accuracy of over 98.5%, featuring both low latency and high precision. This study provides an innovative solution for UAV monitoring and the application of domestic chips.
zh
[CV-41] Learning from Majority Label: A Novel Problem in Multi-class Multiple-Instance Learning
【速读】:该论文旨在解决一种新型的多类多实例学习(Multi-class Multiple-Instance Learning, MIL)问题——从多数标签中学习(Learning from Majority Label, LML),其核心在于:给定一个包(bag),仅以其中实例的多数类别作为该包的标签,目标是训练模型准确预测每个实例的真实类别。传统MIL方法通常依赖于包级标签与实例级标签之间的弱监督关系,而LML通过利用包内实例的类别分布信息来增强学习信号,尤其适用于病理图像分割、投票预测等场景。解决方案的关键在于提出一种计数网络(Counting Network),该网络通过统计各类别实例数量来生成包级多数标签,并进一步设计了多数比例增强模块(Majority Proportion Enhancement Module, MPEM),通过移除包内少数类实例来提升多数类比例,从而改善模型的学习效果。实验表明,该方法在四个数据集上显著优于传统MIL方法,且消融研究验证了各模块的有效性。
链接: https://arxiv.org/abs/2509.04023
作者: Shiku Kaito,Shinnosuke Matsuo,Daiki Suehiro,Ryoma Bise
机构: Kyushu University (九州大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 35 pages, 9 figures, Accepted in Pattern recognition
Abstract:The paper proposes a novel multi-class Multiple-Instance Learning (MIL) problem called Learning from Majority Label (LML). In LML, the majority class of instances in a bag is assigned as the bag-level label. The goal of LML is to train a classification model that estimates the class of each instance using the majority label. This problem is valuable in a variety of applications, including pathology image segmentation, political voting prediction, customer sentiment analysis, and environmental monitoring. To solve LML, we propose a Counting Network trained to produce bag-level majority labels, estimated by counting the number of instances in each class. Furthermore, analysis experiments on the characteristics of LML revealed that bags with a high proportion of the majority class facilitate learning. Based on this result, we developed a Majority Proportion Enhancement Module (MPEM) that increases the proportion of the majority class by removing minority class instances within the bags. Experiments demonstrate the superiority of the proposed method on four datasets compared to conventional MIL methods. Moreover, ablation studies confirmed the effectiveness of each module. The code is available at \hrefthis https URLhere.
zh
[CV-42] Detecting Regional Spurious Correlations in Vision Transformers via Token Discarding
【速读】:该论文旨在解决视觉Transformer模型中由虚假相关性(spurious correlations)引发的不可靠预测问题,即模型可能依赖于数据中与任务无关但统计上相关的特征(如颜色异常或图像中的小文本)进行决策,从而影响模型的可信度、可靠性和泛化能力。其解决方案的关键在于提出了一种新颖的检测方法,能够在大规模ImageNet数据集上识别这些虚假相关性,并通过对比监督训练与自监督训练的模型表现,揭示训练方式对模型依赖虚假信号程度的显著影响;此外,研究还系统识别出特定类别中存在的易被模型捕捉的虚假信号,并呼吁在后续研究中谨慎使用相关样本,从而为构建更鲁棒的视觉模型提供实证依据和实践指导。
链接: https://arxiv.org/abs/2509.04009
作者: Solha Kang,Esla Timothy Anzaku,Wesley De Neve,Arnout Van Messem,Joris Vankerschaver,Francois Rameau,Utku Ozbulak
机构: Ghent University (根特大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Due to their powerful feature association capabilities, neural network-based computer vision models have the ability to detect and exploit unintended patterns within the data, potentially leading to correct predictions based on incorrect or unintended but statistically relevant signals. These clues may vary from simple color aberrations to small texts within the image. In situations where these unintended signals align with the predictive task, models can mistakenly link these features with the task and rely on them for making predictions. This phenomenon is referred to as spurious correlations, where patterns appear to be associated with the task but are actually coincidental. As a result, detection and mitigation of spurious correlations have become crucial tasks for building trustworthy, reliable, and generalizable machine learning models. In this work, we present a novel method to detect spurious correlations in vision transformers, a type of neural network architecture that gained significant popularity in recent years. Using both supervised and self-supervised trained models, we present large-scale experiments on the ImageNet dataset demonstrating the ability of the proposed method to identify spurious correlations. We also find that, even if the same architecture is used, the training methodology has a significant impact on the model’s reliance on spurious correlations. Furthermore, we show that certain classes in the ImageNet dataset contain spurious signals that are easily detected by the models and discuss the underlying reasons for those spurious signals. In light of our findings, we provide an exhaustive list of the aforementioned images and call for caution in their use in future research efforts. Lastly, we present a case study investigating spurious signals in invasive breast mass classification, grounding our work in real-world scenarios.
zh
[CV-43] SliceSemOcc: Vertical Slice Based Multimodal 3D Semantic Occupancy Representation
【速读】:该论文旨在解决3D语义占用预测中因忽略高度轴信息而导致的建模不充分问题,特别是现有方法在处理体素特征时未有效利用垂直方向的空间结构,且传统SENet类通道注意力机制对所有高度层分配均匀权重,限制了对不同高度特征的差异化强调能力。解决方案的关键在于提出SliceSemOcc框架,其核心创新包括:1)通过全局与局部垂直切片提取高度方向上的体素特征;2)设计全局-局部融合模块以自适应整合细粒度空间细节与整体上下文信息;3)引入SEAttention3D模块,在保持高度分辨率的同时,基于平均池化动态为每个高度层分配通道注意力权重,从而增强对小目标类别等关键区域的感知能力。
链接: https://arxiv.org/abs/2509.03999
作者: Han Huang,Han Sun,Ningzhong Liu,Huiyu Zhou,Jiaquan Shen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, accepted by PRCV2025
Abstract:Driven by autonomous driving’s demands for precise 3D perception, 3D semantic occupancy prediction has become a pivotal research topic. Unlike bird’s-eye-view (BEV) methods, which restrict scene representation to a 2D plane, occupancy prediction leverages a complete 3D voxel grid to model spatial structures in all dimensions, thereby capturing semantic variations along the vertical axis. However, most existing approaches overlook height-axis information when processing voxel features. And conventional SENet-style channel attention assigns uniform weight across all height layers, limiting their ability to emphasize features at different heights. To address these limitations, we propose SliceSemOcc, a novel vertical slice based multimodal framework for 3D semantic occupancy representation. Specifically, we extract voxel features along the height-axis using both global and local vertical slices. Then, a global local fusion module adaptively reconciles fine-grained spatial details with holistic contextual information. Furthermore, we propose the SEAttention3D module, which preserves height-wise resolution through average pooling and assigns dynamic channel attention weights to each height layer. Extensive experiments on nuScenes-SurroundOcc and nuScenes-OpenOccupancy datasets verify that our method significantly enhances mean IoU, achieving especially pronounced gains on most small-object categories. Detailed ablation studies further validate the effectiveness of the proposed SliceSemOcc framework.
zh
[CV-44] Improving Vessel Segmentation with Multi-Task Learning and Auxiliary Data Available Only During Model Training
【速读】:该论文旨在解决在无对比增强的肝脏磁共振成像(MRI)数据中进行肝血管分割的难题,该任务对弥漫性肝病相关的血管重塑计算分析至关重要。现有方法依赖于对比增强影像,但此类序列并非在所有临床场景中均被统一采集;而未增强图像虽更常见,其血管分割难度高且需大量标注数据支持。解决方案的关键在于提出一种多任务学习框架,利用仅在训练阶段可用的辅助对比增强MRI数据来减少对人工标注样本的依赖。该框架通过联合使用配对的原始与对比增强图像(无论是否带有血管标注)进行模型训练,使模型从共享任务结构中受益,从而提升分割精度,尤其在标注数据稀缺时效果显著。
链接: https://arxiv.org/abs/2509.03975
作者: Daniel Sobotka,Alexander Herold,Matthias Perkonigg,Lucian Beer,Nina Bastati,Alina Sablatnig,Ahmed Ba-Ssalamah,Georg Langs
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Liver vessel segmentation in magnetic resonance imaging data is important for the computational analysis of vascular remodelling, associated with a wide spectrum of diffuse liver diseases. Existing approaches rely on contrast enhanced imaging data, but the necessary dedicated imaging sequences are not uniformly acquired. Images without contrast enhancement are acquired more frequently, but vessel segmentation is challenging, and requires large-scale annotated data. We propose a multi-task learning framework to segment vessels in liver MRI without contrast. It exploits auxiliary contrast enhanced MRI data available only during training to reduce the need for annotated training examples. Our approach draws on paired native and contrast enhanced data with and without vessel annotations for model training. Results show that auxiliary data improves the accuracy of vessel segmentation, even if they are not available during inference. The advantage is most pronounced if only few annotations are available for training, since the feature representation benefits from the shared task structure. A validation of this approach to augment a model for brain tumor segmentation confirms its benefits across different domains. An auxiliary informative imaging modality can augment expert annotations even if it is only available during training.
zh
[CV-45] SAC-MIL: Spatial-Aware Correlated Multiple Instance Learning for Histopathology Whole Slide Image Classification
【速读】:该论文旨在解决全切片图像(Whole Slide Image, WSI)分类中因实例间空间关系建模不足而导致的性能瓶颈问题。现有方法通常忽略组织病理学图像中实例(如patch)的空间位置信息,仅依赖输入序列中的索引进行建模,难以捕捉真实空间结构,且在训练与测试序列长度不一致时存在长度外推(length extrapolation)问题。解决方案的关键在于提出Spatial-Aware Correlated Multiple Instance Learning (SAC-MIL),其核心创新包括:1)引入位置编码模块(positional encoding module),利用每个实例在WSI中的坐标信息替代传统索引编码,显式建模空间关系并支持长度外推;2)设计SAC块(SAC block),基于多层感知机(MLP)实现线性时间复杂度的全实例相关计算,避免Transformer所需的定制CUDA内核,提升部署效率。该方法在CAMELYON-16、TCGA-LUNG和TCGA-BRAC数据集上均达到当前最优性能。
链接: https://arxiv.org/abs/2509.03973
作者: Yu Bai,Zitong Yu,Haowen Tian,Xijing Wang,Shuo Yan,Lin Wang,Honglin Li,Xitong Ling,Bo Zhang,Zheng Zhang,Wufan Wang,Hui Gao,Xiangyang Gong,Wendong Wang
机构: Unknown
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:We propose Spatial-Aware Correlated Multiple Instance Learning (SAC-MIL) for performing WSI classification. SAC-MIL consists of a positional encoding module to encode position information and a SAC block to perform full instance correlations. The positional encoding module utilizes the instance coordinates within the slide to encode the spatial relationships instead of the instance index in the input WSI sequence. The positional encoding module can also handle the length extrapolation issue where the training and testing sequences have different lengths. The SAC block is an MLP-based method that performs full instance correlation in linear time complexity with respect to the sequence length. Due to the simple structure of MLP, it is easy to deploy since it does not require custom CUDA kernels, compared to Transformer-based methods for WSI classification. SAC-MIL has achieved state-of-the-art performance on the CAMELYON-16, TCGA-LUNG, and TCGA-BRAC datasets. The code will be released upon acceptance.
zh
[CV-46] Multimodal Feature Fusion Network with Text Difference Enhancement for Remote Sensing Change Detection
【速读】:该论文旨在解决当前深度学习在遥感变化检测(Remote Sensing Change Detection, RSCD)中仅依赖图像模态导致的特征表示受限、变化模式建模不足以及鲁棒性差的问题,尤其是在光照和噪声干扰下性能下降明显。其解决方案的关键在于提出一种多模态变化检测方法MMChange,通过融合图像与文本模态信息提升检测精度与鲁棒性:首先引入图像特征精炼(Image Feature Refinement, IFR)模块增强关键区域并抑制环境噪声;其次利用视觉语言模型(Vision Language Model, VLM)生成双时相图像的语义描述,进而设计文本差异增强(Textual Difference Enhancement, TDE)模块捕捉细粒度语义变化,引导模型关注有意义的变化;最后构建图像-文本特征融合(Image Text Feature Fusion, ITFF)模块实现跨模态深度融合,有效缓解模态异构性问题。
链接: https://arxiv.org/abs/2509.03961
作者: Yijun Zhou,Yikui Zhai,Zilu Ying,Tingfeng Xian,Wenlve Zhou,Zhiheng Zhou,Xiaolin Tian,Xudong Jia,Hongsheng Zhang,C. L. Philip Chen
机构: Wuyi University (五邑大学); South China University of Technology (华南理工大学); Macau University of Science and Technology (澳门科技大学); California State University, Northridge (加州州立大学北岭分校); The University of Hong Kong (香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Although deep learning has advanced remote sensing change detection (RSCD), most methods rely solely on image modality, limiting feature representation, change pattern modeling, and generalization especially under illumination and noise disturbances. To address this, we propose MMChange, a multimodal RSCD method that combines image and text modalities to enhance accuracy and robustness. An Image Feature Refinement (IFR) module is introduced to highlight key regions and suppress environmental noise. To overcome the semantic limitations of image features, we employ a vision language model (VLM) to generate semantic descriptions of bitemporal images. A Textual Difference Enhancement (TDE) module then captures fine grained semantic shifts, guiding the model toward meaningful changes. To bridge the heterogeneity between modalities, we design an Image Text Feature Fusion (ITFF) module that enables deep cross modal integration. Extensive experiments on LEVIRCD, WHUCD, and SYSUCD demonstrate that MMChange consistently surpasses state of the art methods across multiple metrics, validating its effectiveness for multimodal RSCD. Code is available at: this https URL.
zh
[CV-47] ANTS: Shaping the Adaptive Negative Textual Space by MLLM for OOD Detection
【速读】:该论文旨在解决现有Out-of-Distribution (OOD)检测方法在构建负样本空间(negative space)时缺乏对OOD图像的准确理解,以及因误标注负标签导致近域OOD(near-OOD)性能下降的问题。解决方案的关键在于利用多模态大语言模型(Multimodal Large Language Models, MLLMs)的语义理解和推理能力,构建自适应负文本空间(Adaptive Negative Textual Space, ANTS):首先通过MLLM描述疑似OOD图像生成具有判别性的负文本特征以增强远域OOD检测;其次针对近域OOD场景,识别与负样本视觉相似的ID类别子集,并生成针对性的负文本标签以减少假负例,从而提升近域OOD检测效果;最后设计自适应加权评分机制,在无需任务先验知识的情况下动态平衡两种负文本空间,实现对不同OOD任务设置的高适应性。
链接: https://arxiv.org/abs/2509.03951
作者: Zhu Wenjie,Zhang Yabin,Xin Jin,Wenjun Zeng,Lei Zhang
机构: The Hong Kong Polytechnic University (香港理工大学); Eastern Institute of Technology, Ningbo (宁波东方理工大学); Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The introduction of negative labels (NLs) has proven effective in enhancing Out-of-Distribution (OOD) detection. However, existing methods often lack an understanding of OOD images, making it difficult to construct an accurate negative space. In addition, the presence of false negative labels significantly degrades their near-OOD performance. To address these issues, we propose shaping an Adaptive Negative Textual Space (ANTS) by leveraging the understanding and reasoning capabilities of multimodal large language models (MLLMs). Specifically, we identify images likely to be OOD samples as negative images and prompt the MLLM to describe these images, generating expressive negative sentences that precisely characterize the OOD distribution and enhance far-OOD detection. For the near-OOD setting, where OOD samples resemble the in-distribution (ID) subset, we first identify the subset of ID classes that are visually similar to negative images and then leverage the reasoning capability of MLLMs to generate visually similar negative labels tailored to this subset, effectively reducing false negatives and improving near-OOD detection. To balance these two types of negative textual spaces, we design an adaptive weighted score that enables the method to handle different OOD task settings (near-OOD and far-OOD) without relying on task-specific prior knowledge, making it highly adaptable in open environments. On the ImageNet benchmark, our ANTS significantly reduces the FPR95 by 4.2%, establishing a new state-of-the-art. Furthermore, our method is training-free and zero-shot, enabling high scalability.
zh
[CV-48] Chest X-ray Pneumothorax Segmentation Using EfficientNet-B4 Transfer Learning in a U-Net Architecture
【速读】:该论文旨在解决胸腔积气(pneumothorax)在胸部X光片中因病灶微小而难以被及时识别的问题,从而避免潜在的致命风险。其解决方案的关键在于构建一个基于U-Net架构并采用EfficientNet-B4作为编码器的深度学习自动分割管道,通过结合二元交叉熵损失与Dice损失函数,并利用SIIM-ACR数据集进行训练及数据增强,最终在独立测试集PTX-498上实现了0.7008的交并比(IoU)和0.8241的Dice分数,表明该模型具备高精度定位胸腔积气区域的能力,可有效辅助放射科医生诊断。
链接: https://arxiv.org/abs/2509.03950
作者: Alvaro Aranibar Roque,Helga Sebastian
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 page, 5 figures
Abstract:Pneumothorax, the abnormal accumulation of air in the pleural space, can be life-threatening if undetected. Chest X-rays are the first-line diagnostic tool, but small cases may be subtle. We propose an automated deep-learning pipeline using a U-Net with an EfficientNet-B4 encoder to segment pneumothorax regions. Trained on the SIIM-ACR dataset with data augmentation and a combined binary cross-entropy plus Dice loss, the model achieved an IoU of 0.7008 and Dice score of 0.8241 on the independent PTX-498 dataset. These results demonstrate that the model can accurately localize pneumothoraces and support radiologists.
zh
[CV-49] opoSculpt: Betti-Steered Topological Sculpting of 3D Fine-grained Tubular Shapes
【速读】:该论文旨在解决现有方法在重建三维细粒度管状解剖结构(如肺气道和脑动脉环)时,因依赖体素级重叠度量而无法保证拓扑正确性和完整性的问题。传统方法难以全局保留拓扑结构或在推理阶段修正几何误差,导致重建结果存在关键性拓扑缺陷。解决方案的关键在于提出一种名为TopoSculpt的新框架,其核心创新包括:(i) 采用全区域建模策略以捕获完整空间上下文;(ii) 首次引入拓扑完整性贝蒂数(Topological Integrity Betti, TIB)约束,联合施加贝蒂数先验与全局完整性要求;(iii) 设计基于持久同调(persistent homology)的课程精修机制,从粗到细逐步纠正误差。实验表明,该方法显著提升了几何精度与拓扑保真度,例如在肺气道数据集上β₀错误从69.00降至3.40,在脑动脉环数据集上从1.65降至0.30,同时分支检测率提升近10%。
链接: https://arxiv.org/abs/2509.03938
作者: Minghui Zhang,Yaoyu Liu,Junyang Wu,Xin You,Hanxiao Zhang,Junjun He,Yun Gu
机构: Institute of Medical Robotics, Shanghai Jiao Tong University (上海交通大学医学机器人研究所); Department of Automation, Shanghai Jiao Tong University (上海交通大学自动化系); Shanghai AI Lab (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Medical tubular anatomical structures are inherently three-dimensional conduits with lumens, enclosing walls, and complex branching topologies. Accurate reconstruction of their geometry and topology is crucial for applications such as bronchoscopic navigation and cerebral arterial connectivity assessment. Existing methods often rely on voxel-wise overlap measures, which fail to capture topological correctness and completeness. Although topology-aware losses and persistent homology constraints have shown promise, they are usually applied patch-wise and cannot guarantee global preservation or correct geometric errors at inference. To address these limitations, we propose a novel TopoSculpt, a framework for topological refinement of 3D fine-grained tubular structures. TopoSculpt (i) adopts a holistic whole-region modeling strategy to capture full spatial context, (ii) first introduces a Topological Integrity Betti (TIB) constraint that jointly enforces Betti number priors and global integrity, and (iii) employs a curriculum refinement scheme with persistent homology to progressively correct errors from coarse to fine scales. Extensive experiments on challenging pulmonary airway and Circle of Willis datasets demonstrate substantial improvements in both geometry and topology. For instance, \beta_0 errors are reduced from 69.00 to 3.40 on the airway dataset and from 1.65 to 0.30 on the CoW dataset, with Tree length detected and branch detected rates improving by nearly 10%. These results highlight the effectiveness of TopoSculpt in correcting critical topological errors and advancing the high-fidelity modeling of complex 3D tubular anatomy. The project homepage is available at: this https URL.
zh
[CV-50] LMVC: An End-to-End Learned Multiview Video Coding Framework
【速读】:该论文旨在解决多视角视频(Multiview Video)在存储与传输中因数据量庞大而带来的挑战,尤其是在传统视频编码标准难以高效压缩多视角场景的情况下。现有基于深度学习的端到端视频编码方法主要聚焦于单视角或立体视频,对一般多视角场景的研究仍不充分。解决方案的关键在于提出一种端到端的可学习多视角视频编码(Learned Multiview Video Coding, LMVC)框架,其核心创新在于有效利用独立视角间的运动信息和内容信息来增强依赖视角的压缩效率:首先通过基于特征的跨视角运动矢量预测方法,将依赖视角的运动编码条件化于已解码的独立视角运动特征,并引入跨视角运动熵模型以学习运动先验;其次设计无视差的跨视角上下文预测模块,从独立视角内容特征中预测依赖视角的上下文信息,结合跨视角上下文熵模型捕捉内容先验。实验表明,该框架显著优于传统MV-HEVC参考软件,为后续研究奠定了强基准。
链接: https://arxiv.org/abs/2509.03922
作者: Xihua Sheng,Yingwen Zhang,Long Xu,Shiqi Wang
机构: City University of Hong Kong (香港城市大学); Ningbo University (宁波大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multiview video is a key data source for volumetric video, enabling immersive 3D scene reconstruction but posing significant challenges in storage and transmission due to its massive data volume. Recently, deep learning-based end-to-end video coding has achieved great success, yet most focus on single-view or stereo videos, leaving general multiview scenarios underexplored. This paper proposes an end-to-end learned multiview video coding (LMVC) framework that ensures random access and backward compatibility while enhancing compression efficiency. Our key innovation lies in effectively leveraging independent-view motion and content information to enhance dependent-view compression. Specifically, to exploit the inter-view motion correlation, we propose a feature-based inter-view motion vector prediction method that conditions dependent-view motion encoding on decoded independent-view motion features, along with an inter-view motion entropy model that learns inter-view motion priors. To exploit the inter-view content correlation, we propose a disparity-free inter-view context prediction module that predicts inter-view contexts from decoded independent-view content features, combined with an inter-view contextual entropy model that captures inter-view context priors. Experimental results show that our proposed LMVC framework outperforms the reference software of the traditional MV-HEVC standard by a large margin, establishing a strong baseline for future research in this field.
zh
[CV-51] A Generative Foundation Model for Chest Radiography
【速读】:该论文旨在解决医疗影像领域中高质量、多样化标注数据稀缺的问题,这是制约可靠医学人工智能(Artificial Intelligence, AI)模型开发的关键瓶颈。其解决方案的核心在于提出了一种名为ChexGen的生成式视觉-语言基础模型,该模型基于潜在扩散Transformer架构,在迄今为止最大的胸部X光图像与报告配对数据集(960,000对)上进行预训练,并引入统一框架实现文本、掩码和边界框引导的胸片合成。这一方法显著提升了合成图像的真实性与多样性,支持小样本条件下的数据增强与监督预训练,从而在疾病分类、检测和分割任务中提升性能,并通过构建多样化患者队列改善模型公平性,减少人口统计学偏差。
链接: https://arxiv.org/abs/2509.03903
作者: Yuanfeng Ji,Dan Lin,Xiyue Wang,Lu Zhang,Wenhui Zhou,Chongjian Ge,Ruihang Chu,Xiaoli Yang,Junhan Zhao,Junsong Chen,Xiangde Luo,Sen Yang,Jin Fang,Ping Luo,Ruijiang Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The scarcity of well-annotated diverse medical images is a major hurdle for developing reliable AI models in healthcare. Substantial technical advances have been made in generative foundation models for natural images. Here we develop `ChexGen’, a generative vision-language foundation model that introduces a unified framework for text-, mask-, and bounding box-guided synthesis of chest radiographs. Built upon the latent diffusion transformer architecture, ChexGen was pretrained on the largest curated chest X-ray dataset to date, consisting of 960,000 radiograph-report pairs. ChexGen achieves accurate synthesis of radiographs through expert evaluations and quantitative metrics. We demonstrate the utility of ChexGen for training data augmentation and supervised pretraining, which led to performance improvements across disease classification, detection, and segmentation tasks using a small fraction of training data. Further, our model enables the creation of diverse patient cohorts that enhance model fairness by detecting and mitigating demographic biases. Our study supports the transformative role of generative foundation models in building more accurate, data-efficient, and equitable medical AI systems.
zh
[CV-52] Attn-Adapter: Attention Is All You Need for Online Few-shot Learner of Vision-Language Model ICCV2025
【速读】:该论文旨在解决对比视觉语言模型(Contrastive Vision-Language Models)在少样本(few-shot)场景下性能下降的问题,尤其是传统基于提示学习(prompt learning)的离线微调方法计算成本高且易过拟合。其解决方案的关键在于提出 Attn-Adapter 框架,通过双注意力机制实现在线适应:一是记忆注意力适配器(Memory Attn-Adapter),利用支持样本优化类别嵌入;二是局部-全局注意力适配器(Local-Global Attn-Adapter),融合局部与全局特征增强图像嵌入。该设计无需重训练基础模型即可从少量标注样本中动态调整,显著提升跨类别和跨数据集的泛化能力,同时保持高效推理和可扩展性。
链接: https://arxiv.org/abs/2509.03895
作者: Phuoc-Nguyen Bui,Khanh-Binh Nguyen,Hyunseung Choo
机构: Sungkyunkwan University (成均馆大学); Deakin University (迪肯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025 - LIMIT Workshop
Abstract:Contrastive vision-language models excel in zero-shot image recognition but face challenges in few-shot scenarios due to computationally intensive offline fine-tuning using prompt learning, which risks overfitting. To overcome these limitations, we propose Attn-Adapter, a novel online few-shot learning framework that enhances CLIP’s adaptability via a dual attention mechanism. Our design incorporates dataset-specific information through two components: the Memory Attn-Adapter, which refines category embeddings using support examples, and the Local-Global Attn-Adapter, which enriches image embeddings by integrating local and global features. This architecture enables dynamic adaptation from a few labeled samples without retraining the base model. Attn-Adapter outperforms state-of-the-art methods in cross-category and cross-dataset generalization, maintaining efficient inference and scaling across CLIP backbones.
zh
[CV-53] Weakly-Supervised Learning of Dense Functional Correspondences ICCV2025
【速读】:该论文旨在解决跨类别图像对之间建立密集对应关系(dense correspondence)的难题,尤其在缺乏标注的情况下,如何利用物体的功能性(functionality)来引导对应关系的构建。其核心挑战在于不同类别物体间形状和外观差异大,传统方法难以捕捉语义一致的对应点。解决方案的关键在于提出一种弱监督学习范式:首先利用视觉-语言模型(vision-language models)为多视角图像伪标注功能部件(functional parts),随后结合像素级对比学习(dense contrastive learning)将功能性和空间知识蒸馏到一个统一模型中,从而实现高精度的密集功能性对应预测。
链接: https://arxiv.org/abs/2509.03893
作者: Stefan Stojanov,Linan Zhao,Yunzhi Zhang,Daniel L. K. Yamins,Jiajun Wu
机构: Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICCV 2025. Project website: this https URL
Abstract:Establishing dense correspondences across image pairs is essential for tasks such as shape reconstruction and robot manipulation. In the challenging setting of matching across different categories, the function of an object, i.e., the effect that an object can cause on other objects, can guide how correspondences should be established. This is because object parts that enable specific functions often share similarities in shape and appearance. We derive the definition of dense functional correspondence based on this observation and propose a weakly-supervised learning paradigm to tackle the prediction task. The main insight behind our approach is that we can leverage vision-language models to pseudo-label multi-view images to obtain functional parts. We then integrate this with dense contrastive learning from pixel correspondences to distill both functional and spatial knowledge into a new model that can establish dense functional correspondence. Further, we curate synthetic and real evaluation datasets as task benchmarks. Our results demonstrate the advantages of our approach over baseline solutions consisting of off-the-shelf self-supervised image representations and grounded vision language models.
zh
[CV-54] OccTENS: 3D Occupancy World Model via Temporal Next-Scale Prediction
【速读】:该论文旨在解决生成式Occupancy世界模型在长期预测中面临的效率低、时序退化和可控性差的问题。现有基于自回归(Autoregressive, AR)的方法虽能同时预测车辆运动与未来占据场景,但难以兼顾高保真度与计算效率。其核心解决方案是将占据世界建模重构为时间尺度预测(Temporal Next-Scale Prediction, TENS)任务,通过分解时空序列建模为逐尺度空间生成与逐场景时间预测,从而实现对时序因果性和空间关系的灵活高效建模;进一步提出TensFormer架构和统一的姿态聚合策略,使占据状态与自车运动序列建模协同优化,显著提升长期生成质量与推理速度。
链接: https://arxiv.org/abs/2509.03887
作者: Bu Jin,Songen Gu,Xiaotao Hu,Yupeng Zheng,Xiaoyang Guo,Qian Zhang,Xiaoxiao Long,Wei Yin
机构: University of Chinese Academy of Sciences (中国科学院大学); Horizon Robotics (地平线机器人); The Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In this paper, we propose OccTENS, a generative occupancy world model that enables controllable, high-fidelity long-term occupancy generation while maintaining computational efficiency. Different from visual generation, the occupancy world model must capture the fine-grained 3D geometry and dynamic evolution of the 3D scenes, posing great challenges for the generative models. Recent approaches based on autoregression (AR) have demonstrated the potential to predict vehicle movement and future occupancy scenes simultaneously from historical observations, but they typically suffer from \textbfinefficiency, \textbftemporal degradation in long-term generation and \textbflack of controllability. To holistically address these issues, we reformulate the occupancy world model as a temporal next-scale prediction (TENS) task, which decomposes the temporal sequence modeling problem into the modeling of spatial scale-by-scale generation and temporal scene-by-scene prediction. With a \textbfTensFormer, OccTENS can effectively manage the temporal causality and spatial relationships of occupancy sequences in a flexible and scalable way. To enhance the pose controllability, we further propose a holistic pose aggregation strategy, which features a unified sequence modeling for occupancy and ego-motion. Experiments show that OccTENS outperforms the state-of-the-art method with both higher occupancy quality and faster inference time.
zh
[CV-55] Human Motion Video Generation: A Survey
【速读】:该论文旨在解决当前人类运动视频生成领域缺乏系统性综述的问题,现有研究多聚焦于单一方法,未能全面覆盖从输入到输出的完整生成流程。其解决方案的关键在于首次提出并详细阐述了人类运动视频生成的五个核心阶段:输入、运动规划、运动视频生成、精炼和输出,并在此基础上对超过两百篇文献进行了系统梳理,涵盖视觉、文本和音频三大模态下的最新进展,同时首次探讨了大语言模型(Large Language Models, LLMs)在该领域的潜在应用价值,从而为数字人技术的深入发展提供了结构化视角与关键突破点。
链接: https://arxiv.org/abs/2509.03883
作者: Haiwei Xue,Xiangyang Luo,Zhanghao Hu,Xin Zhang,Xunzhi Xiang,Yuqin Dai,Jianzhuang Liu,Zhensong Zhang,Minglei Li,Jian Yang,Fei Ma,Zhiyong Wu,Changpeng Yang,Zonghong Dai,Fei Richard Yu
机构: Tsinghua University (清华大学); Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences (中国科学院深圳先进技术研究院); Huawei Noah’s Ark Lab (华为诺亚方舟实验室); 01.AI; School of Mathematics and Statistics, Xi’an Jiaotong University (西安交通大学数学与统计学院); Artificial Intelligence Innovation and Incubation (AI’) Institute of Fudan University (复旦大学人工智能创新与孵化研究院); University of Chinese Academy of Sciences (中国科学院大学); PCA Lab, Nanjing University of Science and Technology (南京理工大学PCA实验室); Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ) (广东省人工智能与数字经济实验室(深圳)); Shenzhen University (深圳大学); Carleton University (卡尔顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: Accepted by TPAMI. Github Repo: this https URL IEEE Access: this https URL
Abstract:Human motion video generation has garnered significant research interest due to its broad applications, enabling innovations such as photorealistic singing heads or dynamic avatars that seamlessly dance to music. However, existing surveys in this field focus on individual methods, lacking a comprehensive overview of the entire generative process. This paper addresses this gap by providing an in-depth survey of human motion video generation, encompassing over ten sub-tasks, and detailing the five key phases of the generation process: input, motion planning, motion video generation, refinement, and output. Notably, this is the first survey that discusses the potential of large language models in enhancing human motion video generation. Our survey reviews the latest developments and technological trends in human motion video generation across three primary modalities: vision, text, and audio. By covering over two hundred papers, we offer a thorough overview of the field and highlight milestone works that have driven significant technological breakthroughs. Our goal for this survey is to unveil the prospects of human motion video generation and serve as a valuable resource for advancing the comprehensive applications of digital humans. A complete list of the models examined in this survey is available in Our Repository this https URL.
zh
[CV-56] SalientFusion: Context-Aware Compositional Zero-Shot Food Recognition ICANN2025
【速读】:该论文旨在解决组合零样本食物识别(Compositional Zero-Shot Food Recognition, CZSFR)问题,即在不依赖特定类别标注数据的情况下,识别由不同食材和菜系组合而成的新食物类别。其核心挑战包括:背景信息冗余干扰特征学习、主食与配菜角色混淆导致误分类,以及单一属性语义偏倚引发理解偏差。解决方案的关键在于提出SalientFusion方法,包含两个核心组件:一是SalientFormer,通过去除背景冗余并利用深度特征缓解主食与配菜的角色混淆;二是DebiasAT,通过将提示(prompt)与视觉特征对齐以减少语义偏倚,从而提升模型的泛化能力与准确性。
链接: https://arxiv.org/abs/2509.03873
作者: Jiajun Song,Xiaoou Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 34th International Conference on Artificial Neural Networks - ICANN 2025
Abstract:Food recognition has gained significant attention, but the rapid emergence of new dishes requires methods for recognizing unseen food categories, motivating Zero-Shot Food Learning (ZSFL). We propose the task of Compositional Zero-Shot Food Recognition (CZSFR), where cuisines and ingredients naturally align with attributes and objects in Compositional Zero-Shot learning (CZSL). However, CZSFR faces three challenges: (1) Redundant background information distracts models from learning meaningful food features, (2) Role confusion between staple and side dishes leads to misclassification, and (3) Semantic bias in a single attribute can lead to confusion of understanding. Therefore, we propose SalientFusion, a context-aware CZSFR method with two components: SalientFormer, which removes background redundancy and uses depth features to resolve role confusion; DebiasAT, which reduces the semantic bias by aligning prompts with visual features. Using our proposed benchmarks, CZSFood-90 and CZSFood-164, we show that SalientFusion achieves state-of-the-art results on these benchmarks and the most popular general datasets for the general CZSL. The code is avaliable at this https URL.
zh
[CV-57] Focus Through Motion: RGB-Event Collaborative Token Sparsification for Efficient Object Detection
【速读】:该论文旨在解决现有RGB-Event目标检测方法在特征提取与融合过程中对低信息区域(如图像背景和事件数据中的非事件区域)进行统一处理所导致的计算冗余和性能不佳问题。其解决方案的关键在于提出FocusMamba框架,通过事件引导的多模态稀疏化(Event-Guided Multimodal Sparsification, EGMS)策略,自适应地识别并剔除各模态中的低信息区域,从而减少无效计算;在此基础上,进一步设计交叉模态聚焦融合(Cross-Modality Focus Fusion, CMFF)模块,高效整合来自RGB和事件模态的互补特征,实现精度与效率的更好平衡。
链接: https://arxiv.org/abs/2509.03872
作者: Nan Yang,Yang Wang,Zhanwen Liu,Yuchao Dai,Yang Liu,Xiangmo Zhao
机构: Chang’an University (长安大学); Northwestern Polytechnical University (西北工业大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Existing RGB-Event detection methods process the low-information regions of both modalities (background in images and non-event regions in event data) uniformly during feature extraction and fusion, resulting in high computational costs and suboptimal performance. To mitigate the computational redundancy during feature extraction, researchers have respectively proposed token sparsification methods for the image and event modalities. However, these methods employ a fixed number or threshold for token selection, hindering the retention of informative tokens for samples with varying complexity. To achieve a better balance between accuracy and efficiency, we propose FocusMamba, which performs adaptive collaborative sparsification of multimodal features and efficiently integrates complementary information. Specifically, an Event-Guided Multimodal Sparsification (EGMS) strategy is designed to identify and adaptively discard low-information regions within each modality by leveraging scene content changes perceived by the event camera. Based on the sparsification results, a Cross-Modality Focus Fusion (CMFF) module is proposed to effectively capture and integrate complementary features from both modalities. Experiments on the DSEC-Det and PKU-DAVIS-SOD datasets demonstrate that the proposed method achieves superior performance in both accuracy and efficiency compared to existing methods. The code will be available at this https URL.
zh
[CV-58] Data-Augmented Quantization-Aware Knowledge Distillation
【速读】:该论文旨在解决低精度深度学习模型中,量化感知知识蒸馏(Quantization-aware Knowledge Distillation, QAKD)过程中数据增强(Data Augmentation, DA)策略选择不当导致性能受限的问题。现有方法主要聚焦于优化损失函数或前向/反向传播以提升量化模型的输出准确性,却忽视了输入变换(如DA)对模型训练动态的影响,尤其在低比特精度下缺乏系统性分析。解决方案的关键在于提出一种新颖的评估指标,该指标通过最大化上下文互信息(Contextual Mutual Information,即与图像标签无直接关联的信息)的同时,确保各类别预测结果平均接近真实标签,从而自动衡量并排序不同DA策略的有效性。该方法计算开销极低,兼容任意KD或QAT算法,并在多种模型架构和数据集上显著提升了当前最先进QAT与KD方法的性能表现。
链接: https://arxiv.org/abs/2509.03850
作者: Justin Kur,Kaiqi Zhao
机构: Oakland University (奥克兰大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 2 figures
Abstract:Quantization-aware training (QAT) and Knowledge Distillation (KD) are combined to achieve competitive performance in creating low-bit deep learning models. Existing KD and QAT works focus on improving the accuracy of quantized models from the network output perspective by designing better KD loss functions or optimizing QAT’s forward and backward propagation. However, limited attention has been given to understanding the impact of input transformations, such as data augmentation (DA). The relationship between quantization-aware KD and DA remains unexplored. In this paper, we address the question: how to select a good DA in quantization-aware KD, especially for the models with low precisions? We propose a novel metric which evaluates DAs according to their capacity to maximize the Contextual Mutual Information–the information not directly related to an image’s label–while also ensuring the predictions for each class are close to the ground truth labels on average. The proposed method automatically ranks and selects DAs, requiring minimal training overhead, and it is compatible with any KD or QAT algorithm. Extensive evaluations demonstrate that selecting DA strategies using our metric significantly improves state-of-the-art QAT and KD works across various model architectures and datasets.
zh
[CV-59] A Multidimensional AI-powered Framework for Analyzing Tourist Perception in Historic Urban Quarters: A Case Study in Shanghai
【速读】:该论文旨在解决如何通过多模态数据驱动的方法,系统性地解析游客对历史城区环境的感知,从而为旅游管理、遗产保护及公共空间设计提供科学依据。其核心问题在于传统方法难以全面捕捉游客在真实场景中的审美偏好与情感反应,导致规划决策缺乏对用户主观体验的精准理解。解决方案的关键在于构建一个融合视觉焦点提取、色彩主题分析与情感挖掘的多维AI框架:首先利用微调后的语义分割模型识别游客照片中的视觉关注点;其次通过聚类算法提取主导色彩并分析其空间分布差异,揭示社交媒体图像与现实街景之间的视觉预期偏差;最后采用规则+多任务BERT的混合情感分析方法,从旅游活动、建成环境、服务设施和商业形态四个维度量化游客满意度。该框架不依赖单一技术突破,而是强调多源异构数据的协同整合与跨模态关联,实现对游客感知的深度解码。
链接: https://arxiv.org/abs/2509.03830
作者: Kaizhen Tan,Yufan Wu,Yuxuan Liu,Haoran Zeng
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
备注:
Abstract:Historic urban quarters play a vital role in preserving cultural heritage while serving as vibrant spaces for tourism and everyday life. Understanding how tourists perceive these environments is essential for sustainable, human-centered urban planning. This study proposes a multidimensional AI-powered framework for analyzing tourist perception in historic urban quarters using multimodal data from social media. Applied to twelve historic quarters in central Shanghai, the framework integrates focal point extraction, color theme analysis, and sentiment mining. Visual focus areas are identified from tourist-shared photos using a fine-tuned semantic segmentation model. To assess aesthetic preferences, dominant colors are extracted using a clustering method, and their spatial distribution across quarters is analyzed. Color themes are further compared between social media photos and real-world street views, revealing notable shifts. This divergence highlights potential gaps between visual expectations and the built environment, reflecting both stylistic preferences and perceptual bias. Tourist reviews are evaluated through a hybrid sentiment analysis approach combining a rule-based method and a multi-task BERT model. Satisfaction is assessed across four dimensions: tourist activities, built environment, service facilities, and business formats. The results reveal spatial variations in aesthetic appeal and emotional response. Rather than focusing on a single technical innovation, this framework offers an integrated, data-driven approach to decoding tourist perception and contributes to informed decision-making in tourism, heritage conservation, and the design of aesthetically engaging public spaces.
zh
[CV-60] EGTM: Event-guided Efficient Turbulence Mitigation
【速读】:该论文旨在解决大气湍流导致帧式相机图像中出现随机畸变和模糊的问题,即湍流抑制(Turbulence Mitigation, TM)任务。现有基于深度学习的TM方法依赖多帧退化图像提取湍流线索以实现“幸运融合”(lucky fusion),但受限于低帧率下粗粒度湍流动态建模,导致网络容量需求高、计算与存储效率低下。其解决方案的关键在于提出一种全新的“事件-幸运洞察”(event-lucky insight),揭示湍流畸变与事件流反向时空分布之间的关联,并据此构建EGTM框架:通过从显式但噪声较大的湍流事件中提取像素级可靠无畸变引导信息,实现时间维度上的幸运融合;同时搭建首个真实世界事件驱动的湍流数据采集系统,形成首个事件驱动的TM数据集,显著提升恢复质量并大幅降低模型规模、推理延迟和复杂度。
链接: https://arxiv.org/abs/2509.03808
作者: Huanan Li,Rui Fan,Juntao Guan,Weidong Hao,Lai Rui,Tong Wu,Yikai Wang,Lin Gu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Turbulence mitigation ™ aims to remove the stochastic distortions and blurs introduced by atmospheric turbulence into frame cameras. Existing state-of-the-art deep-learning TM methods extract turbulence cues from multiple degraded frames to find the so-called "lucky’‘, not distorted patch, for "lucky fusion’‘. However, it requires high-capacity network to learn from coarse-grained turbulence dynamics between synchronous frames with limited frame-rate, thus fall short in computational and storage efficiency. Event cameras, with microsecond-level temporal resolution, have the potential to fundamentally address this bottleneck with efficient sparse and asynchronous imaging mechanism. In light of this, we (i) present the fundamental \textbf``event-lucky insight’’ to reveal the correlation between turbulence distortions and inverse spatiotemporal distribution of event streams. Then, build upon this insight, we (ii) propose a novel EGTM framework that extracts pixel-level reliable turbulence-free guidance from the explicit but noisy turbulent events for temporal lucky fusion. Moreover, we (iii) build the first turbulence data acquisition system to contribute the first real-world event-driven TM dataset. Extensive experimental results demonstrate that our approach significantly surpass the existing SOTA TM method by 710 times, 214 times and 224 times in model size, inference latency and model complexity respectively, while achieving the state-of-the-art in restoration quality (+0.94 PSNR and +0.08 SSIM) on our real-world EGTM dataset. This demonstrating the great efficiency merit of introducing event modality into TM task. Demo code and data have been uploaded in supplementary material and will be released once accepted.
zh
[CV-61] Causality-guided Prompt Learning for Vision-language Models via Visual Granulation ICCV2025
【速读】:该论文旨在解决当前基于CLIP(Contrastive Language–Image Pretraining)的提示学习方法在细粒度图像识别任务中表现有限的问题。其解决方案的关键在于提出一种因果引导的文本提示学习方法——CaPL(Causality-guided Prompt Learning),通过视觉粒化(visual granulation)技术构建视觉粒元(visual granules),以捕捉不同细粒度类别间的细微差异。该方法包含两个核心模块:一是属性解耦模块,利用布朗桥扩散模型(Brownian Bridge Diffusion Model)将视觉特征分解为非个体化属性(共享于多个类)与个体化属性(仅属于单一类);二是粒元学习模块,结合上述属性并基于两种因果推理策略构建视觉粒元,从而增强文本提示的判别能力。实验表明,该方法在15个数据集上显著优于现有最优提示学习方法,尤其在细粒度识别场景下优势明显。
链接: https://arxiv.org/abs/2509.03803
作者: Mengyu Gao,Qiulei Dong
机构: State Key Laboratory of Multimodal Artificial Intelligence Systems (多模态人工智能系统国家重点实验室); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025 Accepted
Abstract:Prompt learning has recently attracted much attention for adapting pre-trained vision-language models (e.g., CLIP) to downstream recognition tasks. However, most of the existing CLIP-based prompt learning methods only show a limited ability for handling fine-grained datasets. To address this issue, we propose a causality-guided text prompt learning method via visual granulation for CLIP, called CaPL, where the explored visual granulation technique could construct sets of visual granules for the text prompt to capture subtle discrepancies among different fine-grained classes through casual inference. The CaPL method contains the following two modules: (1) An attribute disentanglement module is proposed to decompose visual features into non-individualized attributes (shared by some classes) and individualized attributes (specific to single classes) using a Brownian Bridge Diffusion Model; (2) A granule learning module is proposed to construct visual granules by integrating the aforementioned attributes for recognition under two causal inference strategies. Thanks to the learned visual granules, more discriminative text prompt is expected to be learned. Extensive experimental results on 15 datasets demonstrate that our CaPL method significantly outperforms the state-of-the-art prompt learning methods, especially on fine-grained datasets.
zh
[CV-62] MedVista3D: Vision-Language Modeling for Reducing Diagnostic Errors in 3D CT Disease Detection Understanding and Reporting
【速读】:该论文旨在解决医学影像诊断中普遍存在的三大问题:读片错误(reading errors)、无意盲视(inattentional blindness)和沟通失败(communication failures),这些问题主要源于局部异常的遗漏、全局上下文信息不足以及报告语言的不一致性。尤其在3D CT成像中,医生需逐层分析数百张切片,进一步加剧了上述挑战。为应对这些难题,作者提出MedVista3D——一种多尺度语义增强型视觉-语言预训练框架,其核心创新在于通过局部与全局图像-文本对齐实现全体积上下文中的细粒度表征学习,从而同时支持精确的病灶定位与整体体积级推理;此外,引入放射学语义匹配库(Radiology Semantic Matching Bank)并结合语言模型重写策略,有效缓解未标注报告中的语义噪声和多样性问题,显著提升模型在零样本疾病分类、报告检索及医学视觉问答等任务上的性能,并具备良好的下游迁移能力(如器官分割与预后预测)。
链接: https://arxiv.org/abs/2509.03800
作者: Yuheng Li,Yenho Chen,Yuxiang Lai,Jike Zhong,Vanessa Wildman,Xiaofeng Yang
机构: Georgia Institute of Technology (佐治亚理工学院); Emory University School of Medicine (埃默里大学医学院); University of Southern California (南加州大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Radiologic diagnostic errors-under-reading errors, inattentional blindness, and communication failures-remain prevalent in clinical practice. These issues often stem from missed localized abnormalities, limited global context, and variability in report language. These challenges are amplified in 3D imaging, where clinicians must examine hundreds of slices per scan. Addressing them requires systems with precise localized detection, global volume-level reasoning, and semantically consistent natural language reporting. However, existing 3D vision-language models are unable to meet all three needs jointly, lacking local-global understanding for spatial reasoning and struggling with the variability and noise of uncurated radiology reports. We present MedVista3D, a multi-scale semantic-enriched vision-language pretraining framework for 3D CT analysis. To enable joint disease detection and holistic interpretation, MedVista3D performs local and global image-text alignment for fine-grained representation learning within full-volume context. To address report variability, we apply language model rewrites and introduce a Radiology Semantic Matching Bank for semantics-aware alignment. MedVista3D achieves state-of-the-art performance on zero-shot disease classification, report retrieval, and medical visual question answering, while transferring well to organ segmentation and prognosis prediction. Code and datasets will be released.
zh
[CV-63] Fitting Image Diffusion Models on Video Datasets ICCV25
【速读】:该论文旨在解决图像扩散模型(image diffusion models)在训练过程中因依赖独立采样的静态图像而导致的时间信息缺失问题,这限制了模型的收敛速度、分布覆盖范围和泛化能力。其关键解决方案是引入一种无需修改网络架构的简单有效训练策略,利用连续视频帧中固有的时间归纳偏置(temporal inductive bias)作为正则化项,从而降低梯度方差并加速收敛,同时提升生成多样性与对细微时间变化的捕捉能力。
链接: https://arxiv.org/abs/2509.03794
作者: Juhun Lee,Simon S. Woo
机构: Sungkyungwan University (成均馆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV25 Workshop
Abstract:Image diffusion models are trained on independently sampled static images. While this is the bedrock task protocol in generative modeling, capturing the temporal world through the lens of static snapshots is information-deficient by design. This limitation leads to slower convergence, limited distributional coverage, and reduced generalization. In this work, we propose a simple and effective training strategy that leverages the temporal inductive bias present in continuous video frames to improve diffusion training. Notably, the proposed method requires no architectural modification and can be seamlessly integrated into standard diffusion training pipelines. We evaluate our method on the HandCo dataset, where hand-object interactions exhibit dense temporal coherence and subtle variations in finger articulation often result in semantically distinct motions. Empirically, our method accelerates convergence by over 2 \textx faster and achieves lower FID on both training and validation distributions. It also improves generative diversity by encouraging the model to capture meaningful temporal variations. We further provide an optimization analysis showing that our regularization reduces the gradient variance, which contributes to faster convergence.
zh
[CV-64] SLENet: A Guidance-Enhanced Network for Underwater Camouflaged Object Detection
【速读】:该论文旨在解决水下伪装目标检测(Underwater Camouflaged Object Detection, UCOD)问题,即在光学畸变、水体浑浊及海洋生物复杂特征等挑战下,实现对与环境高度融合的水下目标的准确识别。其解决方案的关键在于提出了一种名为SLENet(Semantic Localization and Enhancement Network)的新框架,核心创新包括:1)引入Gamma-Asymmetric Enhancement(GAE)模块以增强多尺度特征表示;2)设计Localization Guidance Branch(LGB)生成富含全局语义信息的位置图,指导Multi-Scale Supervised Decoder(MSSD)输出更精准的预测结果。该方法显著提升了UCOD任务的性能,并展现出对更广泛伪装目标检测(COD)任务的良好泛化能力。
链接: https://arxiv.org/abs/2509.03786
作者: Xinxin Wang,Han Sun,Ningzhong Liu,Huiyu Zhou,Yinan Yao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14pages, accepted by PRCV2025
Abstract:Underwater Camouflaged Object Detection (UCOD) aims to identify objects that blend seamlessly into underwater environments. This task is critically important to marine ecology. However, it remains largely underexplored and accurate identification is severely hindered by optical distortions, water turbidity, and the complex traits of marine organisms. To address these challenges, we introduce the UCOD task and present DeepCamo, a benchmark dataset designed for this domain. We also propose Semantic Localization and Enhancement Network (SLENet), a novel framework for UCOD. We first benchmark state-of-the-art COD models on DeepCamo to reveal key issues, upon which SLENet is built. In particular, we incorporate Gamma-Asymmetric Enhancement (GAE) module and a Localization Guidance Branch (LGB) to enhance multi-scale feature representation while generating a location map enriched with global semantic information. This map guides the Multi-Scale Supervised Decoder (MSSD) to produce more accurate predictions. Experiments on our DeepCamo dataset and three benchmark COD datasets confirm SLENet’s superior performance over SOTA methods, and underscore its high generality for the broader COD task.
zh
[CV-65] ContraG S: Codebook-Condensed and Trainable Gaussian Splatting for Fast Memory-Efficient Reconstruction
【速读】:该论文旨在解决3D高斯溅射(3D Gaussian Splatting, 3DGS)模型在训练过程中因高数量的3D高斯分布导致GPU显存占用过大、训练与渲染效率低下的问题。解决方案的关键在于提出ContraGS方法,通过引入码本(codebook)结构对高斯参数向量进行紧凑存储,在不减少高斯数量的前提下显著降低内存消耗;同时,为应对码本压缩表示中非可微参数的学习难题,ContraGS将参数估计建模为贝叶斯推断问题,并利用马尔可夫链蒙特卡洛(MCMC)采样从后验分布中高效采样,从而实现端到端的压缩表示训练。实验表明,该方法平均可降低3.49倍峰值显存,加速训练和渲染分别达1.36倍和1.88倍,且保持接近SOTA的质量水平。
链接: https://arxiv.org/abs/2509.03775
作者: Sankeerth Durvasula,Sharanshangar Muhunthan,Zain Moustafa,Richard Chen,Ruofan Liang,Yushi Guan,Nilesh Ahuja,Nilesh Jain,Selvakumar Panneer,Nandita Vijaykumar
机构: University of Toronto (多伦多大学); Intel (英特尔)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:3D Gaussian Splatting (3DGS) is a state-of-art technique to model real-world scenes with high quality and real-time rendering. Typically, a higher quality representation can be achieved by using a large number of 3D Gaussians. However, using large 3D Gaussian counts significantly increases the GPU device memory for storing model parameters. A large model thus requires powerful GPUs with high memory capacities for training and has slower training/rendering latencies due to the inefficiencies of memory access and data movement. In this work, we introduce ContraGS, a method to enable training directly on compressed 3DGS representations without reducing the Gaussian Counts, and thus with a little loss in model quality. ContraGS leverages codebooks to compactly store a set of Gaussian parameter vectors throughout the training process, thereby significantly reducing memory consumption. While codebooks have been demonstrated to be highly effective at compressing fully trained 3DGS models, directly training using codebook representations is an unsolved challenge. ContraGS solves the problem of learning non-differentiable parameters in codebook-compressed representations by posing parameter estimation as a Bayesian inference problem. To this end, ContraGS provides a framework that effectively uses MCMC sampling to sample over a posterior distribution of these compressed representations. With ContraGS, we demonstrate that ContraGS significantly reduces the peak memory during training (on average 3.49X) and accelerated training and rendering (1.36X and 1.88X on average, respectively), while retraining close to state-of-art quality.
zh
[CV-66] STA-Net: A Decoupled Shape and Texture Attention Network for Lightweight Plant Disease Classification
【速读】:该论文旨在解决在边缘设备上部署高精度植物病害诊断模型的挑战,尤其针对现有轻量级网络中注意力机制多基于通用目标识别设计、难以捕捉病灶细微特征(如不规则病斑形状和复杂纹理)的问题。解决方案的关键在于提出双管齐下的方法:一是采用无需训练的神经架构搜索方法(DeepMAD)构建适配边缘设备的高效网络主干;二是引入形状-纹理注意力模块(Shape-Texture Attention Module, STAM),通过解耦注意力机制,分别利用可变形卷积(DCNv4)增强形状感知能力,以及Gabor滤波器组提升纹理敏感性,从而显著提升模型对植物病害特征的识别精度。
链接: https://arxiv.org/abs/2509.03754
作者: Zongsen Qiu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Responding to rising global food security needs, precision agriculture and deep learning-based plant disease diagnosis have become crucial. Yet, deploying high-precision models on edge devices is challenging. Most lightweight networks use attention mechanisms designed for generic object recognition, which poorly capture subtle pathological features like irregular lesion shapes and complex textures. To overcome this, we propose a twofold solution: first, using a training-free neural architecture search method (DeepMAD) to create an efficient network backbone for edge devices; second, introducing the Shape-Texture Attention Module (STAM). STAM splits attention into two branches – one using deformable convolutions (DCNv4) for shape awareness and the other using a Gabor filter bank for texture awareness. On the public CCMT plant disease dataset, our STA-Net model (with 401K parameters and 51.1M FLOPs) reached 89.00% accuracy and an F1 score of 88.96%. Ablation studies confirm STAM significantly improves performance over baseline and standard attention models. Integrating domain knowledge via decoupled attention thus presents a promising path for edge-deployed precision agriculture AI. The source code is available at this https URL.
zh
[CV-67] Mapping on a Budget: Optimizing Spatial Data Collection for ML
【速读】:该论文旨在解决卫星遥感机器学习(SatML)中因标注训练数据稀疏、空间聚集且采集成本异质而导致的性能瓶颈问题。传统研究多聚焦于模型架构与训练算法优化,而忽视了对实际数据采集条件的建模与优化。本文提出首个在异质数据采集成本和现实预算约束下,针对空间训练数据分布的优化问题形式化框架,并设计了新颖的采样策略以实现样本效率最大化。其关键在于将数据采集决策建模为一个可优化的数学问题,通过实验验证在三大洲四类任务中均能显著提升模型性能,尤其在农业调查数据簇状分布场景中效果突出,如在多哥地区的应用中可直接指导如何增强现有调查数据以支持更高效的SatML监测。
链接: https://arxiv.org/abs/2509.03749
作者: Livia Betti,Farooq Sanni,Gnouyaro Sogoyou,Togbe Agbagla,Cullen Molitor,Tamma Carleton,Esther Rolf
机构: University of Colorado (科罗拉多大学); University of California, Berkeley (加州大学伯克利分校)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In applications across agriculture, ecology, and human development, machine learning with satellite imagery (SatML) is limited by the sparsity of labeled training data. While satellite data cover the globe, labeled training datasets for SatML are often small, spatially clustered, and collected for other purposes (e.g., administrative surveys or field measurements). Despite the pervasiveness of this issue in practice, past SatML research has largely focused on new model architectures and training algorithms to handle scarce training data, rather than modeling data conditions directly. This leaves scientists and policymakers who wish to use SatML for large-scale monitoring uncertain about whether and how to collect additional data to maximize performance. Here, we present the first problem formulation for the optimization of spatial training data in the presence of heterogeneous data collection costs and realistic budget constraints, as well as novel methods for addressing this problem. In experiments simulating different problem settings across three continents and four tasks, our strategies reveal substantial gains from sample optimization. Further experiments delineate settings for which optimized sampling is particularly effective. The problem formulation and methods we introduce are designed to generalize across application domains for SatML; we put special emphasis on a specific problem setting where our coauthors can immediately use our findings to augment clustered agricultural surveys for SatML monitoring in Togo.
zh
[CV-68] LayoutGKN: Graph Similarity Learning of Floor Plans BMVC
【速读】:该论文旨在解决基于图结构的平面布局(floor plans)相似性计算效率低的问题,现有方法如图匹配网络(graph matching networks)依赖于昂贵的跨图节点级交互,导致推理速度缓慢。其解决方案的关键在于提出LayoutGKN,通过将跨图节点级交互延迟至联合嵌入架构的末端,并利用可微分图核(differentiable graph kernel)作为最终学习到的节点嵌入之间的距离函数,从而在保持或提升相似性计算性能的同时显著提高运行效率。
链接: https://arxiv.org/abs/2509.03737
作者: Casper van Engelenburg,Jan van Gemert,Seyran Khademi
机构: Delft University of Technology (代尔夫特理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: BMVC (2025)
Abstract:Floor plans depict building layouts and are often represented as graphs to capture the underlying spatial relationships. Comparison of these graphs is critical for applications like search, clustering, and data visualization. The most successful methods to compare graphs \ie, graph matching networks, rely on costly intermediate cross-graph node-level interactions, therefore being slow in inference time. We introduce \textbfLayoutGKN, a more efficient approach that postpones the cross-graph node-level interactions to the end of the joint embedding architecture. We do so by using a differentiable graph kernel as a distance function on the final learned node-level embeddings. We show that LayoutGKN computes similarity comparably or better than graph matching networks while significantly increasing the speed. \hrefthis https URLCode and data are open.
zh
[CV-69] ransfer Learning-Based CNN Models for Plant Species Identification Using Leaf Venation Patterns
【速读】:该论文旨在解决植物物种自动分类中依赖人工识别效率低下的问题,聚焦于利用叶片脉络(leaf venation)这一高分类学价值的形态特征进行自动化识别。解决方案的关键在于采用三种深度学习架构(ResNet50、MobileNetV2 和 EfficientNetB0)对瑞典叶部数据集(Swedish Leaf Dataset)进行训练与测试,其中 EfficientNetB0 在测试准确率(94.67%)及精确率、召回率和 F1 分数均超过 94.6% 的表现,展现出最优的鲁棒性和泛化能力,验证了其在基于脉络特征的植物分类任务中的优越性。
链接: https://arxiv.org/abs/2509.03729
作者: Bandita Bharadwaj,Ankur Mishra,Saurav Bharadwaj
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This study evaluates the efficacy of three deep learning architectures: ResNet50, MobileNetV2, and EfficientNetB0 for automated plant species classification based on leaf venation patterns, a critical morphological feature with high taxonomic relevance. Using the Swedish Leaf Dataset comprising images from 15 distinct species (75 images per species, totalling 1,125 images), the models were demonstrated using standard performance metrics during training and testing phases. ResNet50 achieved a training accuracy of 94.11% but exhibited overfitting, reflected by a reduced testing accuracy of 88.45% and an F1 score of 87.82%. MobileNetV2 demonstrated better generalization capabilities, attaining a testing accuracy of 93.34% and an F1 score of 93.23%, indicating its suitability for lightweight, real-time applications. EfficientNetB0 outperformed both models, achieving a testing accuracy of 94.67% with precision, recall, and F1 scores exceeding 94.6%, highlighting its robustness in venation-based classification. The findings underscore the potential of deep learning, particularly EfficientNetB0, in developing scalable and accurate tools for automated plant taxonomy using venation traits.
zh
[CV-70] QuantV2X: A Fully Quantized Multi-Agent System for Cooperative Perception
【速读】:该论文旨在解决车联网(Vehicle-to-Everything, V2X)协同感知系统在实际部署中面临的效率、延迟和可扩展性问题。现有方法多聚焦于提升感知精度,却忽视了资源受限环境下模型计算与通信开销过高的痛点,尤其是依赖全精度(full-precision)模型导致的高延迟与带宽压力。解决方案的关键在于提出QuantV2X——首个面向多模态、多智能体V2X协同感知的端到端全量化系统,通过统一设计神经网络模型与传输消息表示的量化策略,在低比特约束下显著降低计算负载与通信带宽,同时保持与全精度系统相当的感知精度。实验证明,QuantV2X在系统级延迟上减少3.2倍,并在mAP30指标上优于全精度基线9.5个百分点,且更利于在严格内存预算内部署更大规模模型,验证了全量化中间融合架构在真实场景中的可行性。
链接: https://arxiv.org/abs/2509.03704
作者: Seth Z. Zhao,Huizhi Zhang,Zhaowei Li,Juntong Peng,Anthony Chui,Zewei Zhou,Zonglin Meng,Hao Xiang,Zhiyu Huang,Fujia Wang,Ran Tian,Chenfeng Xu,Bolei Zhou,Jiaqi Ma
机构: UCLA(加州大学洛杉矶分校); UW-Madison(威斯康星大学麦迪逊分校); NCSU(北卡罗来纳州立大学); Purdue University(普渡大学); UC Berkeley(加州大学伯克利分校); UT Austin(德克萨斯大学奥斯汀分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Cooperative perception through Vehicle-to-Everything (V2X) communication offers significant potential for enhancing vehicle perception by mitigating occlusions and expanding the field of view. However, past research has predominantly focused on improving accuracy metrics without addressing the crucial system-level considerations of efficiency, latency, and real-world deployability. Noticeably, most existing systems rely on full-precision models, which incur high computational and transmission costs, making them impractical for real-time operation in resource-constrained environments. In this paper, we introduce \textbfQuantV2X, the first fully quantized multi-agent system designed specifically for efficient and scalable deployment of multi-modal, multi-agent V2X cooperative perception. QuantV2X introduces a unified end-to-end quantization strategy across both neural network models and transmitted message representations that simultaneously reduces computational load and transmission bandwidth. Remarkably, despite operating under low-bit constraints, QuantV2X achieves accuracy comparable to full-precision systems. More importantly, when evaluated under deployment-oriented metrics, QuantV2X reduces system-level latency by 3.2 \times and achieves a +9.5 improvement in mAP30 over full-precision baselines. Furthermore, QuantV2X scales more effectively, enabling larger and more capable models to fit within strict memory budgets. These results highlight the viability of a fully quantized multi-agent intermediate fusion system for real-world deployment. The system will be publicly released to promote research in this field: this https URL.
zh
[CV-71] LuxDiT: Lighting Estimation with Video Diffusion Transformer
【速读】:该论文旨在解决从单张图像或视频中估计场景光照(scene lighting)这一长期存在的计算机视觉与图形学挑战,尤其针对现有学习方法受限于真实高动态范围(High-Dynamic Range, HDR)环境贴图数据稀缺的问题。其解决方案的关键在于提出一种名为LuxDiT的新颖数据驱动方法:通过微调视频扩散Transformer(video diffusion transformer)来生成条件化的HDR环境贴图,模型在大规模合成数据集上训练以学习从间接视觉线索中推断光照,并结合低秩适应(Low-Rank Adaptation, LoRA)微调策略提升输入图像与预测环境贴图之间的语义对齐,从而实现高保真度的全局光照重建,在定量和定性评估中均优于现有最先进方法。
链接: https://arxiv.org/abs/2509.03680
作者: Ruofan Liang,Kai He,Zan Gojcic,Igor Gilitschenski,Sanja Fidler,Nandita Vijaykumar,Zian Wang
机构: NVIDIA(英伟达); University of Toronto(多伦多大学); Vector Institute(矢量研究所)
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:Estimating scene lighting from a single image or video remains a longstanding challenge in computer vision and graphics. Learning-based approaches are constrained by the scarcity of ground-truth HDR environment maps, which are expensive to capture and limited in diversity. While recent generative models offer strong priors for image synthesis, lighting estimation remains difficult due to its reliance on indirect visual cues, the need to infer global (non-local) context, and the recovery of high-dynamic-range outputs. We propose LuxDiT, a novel data-driven approach that fine-tunes a video diffusion transformer to generate HDR environment maps conditioned on visual input. Trained on a large synthetic dataset with diverse lighting conditions, our model learns to infer illumination from indirect visual cues and generalizes effectively to real-world scenes. To improve semantic alignment between the input and the predicted environment map, we introduce a low-rank adaptation finetuning strategy using a collected dataset of HDR panoramas. Our method produces accurate lighting predictions with realistic angular high-frequency details, outperforming existing state-of-the-art techniques in both quantitative and qualitative evaluations.
zh
[CV-72] Insights from Gradient Dynamics: Gradient Autoscaled Normalization
【速读】:该论文旨在解决深度神经网络训练过程中梯度动态变化对模型稳定性与泛化能力的影响问题,特别是梯度方差和标准差在不同层及全局尺度上的演化规律未被充分理解与利用的问题。其解决方案的关键在于提出一种无需超参数调整的梯度归一化方法,该方法通过匹配梯度缩放与其自然演化趋势,避免了意外放大现象,从而稳定优化过程并保持收敛性保证。实验表明,该方法在CIFAR-100基准上对ResNet-20、ResNet-56和VGG-16-BN等结构均能维持或提升测试准确率,尤其在强泛化场景下表现优异。
链接: https://arxiv.org/abs/2509.03677
作者: Vincent-Daniel Yun
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Theory (cs.IT)
备注:
Abstract:Gradient dynamics play a central role in determining the stability and generalization of deep neural networks. In this work, we provide an empirical analysis of how variance and standard deviation of gradients evolve during training, showing consistent changes across layers and at the global scale in convolutional networks. Motivated by these observations, we propose a hyperparameter-free gradient normalization method that aligns gradient scaling with their natural evolution. This approach prevents unintended amplification, stabilizes optimization, and preserves convergence guarantees. Experiments on the challenging CIFAR-100 benchmark with ResNet-20, ResNet-56, and VGG-16-BN demonstrate that our method maintains or improves test accuracy even under strong generalization. Beyond practical performance, our study highlights the importance of directly tracking gradient dynamics, aiming to bridge the gap between theoretical expectations and empirical behaviors, and to provide insights for future optimization research.
zh
[CV-73] Reg3D: Reconstructive Geometry Instruction Tuning for 3D Scene Understanding
【速读】:该论文旨在解决当前大型多模态模型(Large Multimodal Models, LMMs)在2D视觉理解方面取得显著进展后,如何有效扩展至3D场景理解的问题。现有方法主要依赖纯文本监督,缺乏几何约束,难以学习鲁棒的3D空间表征。解决方案的关键在于提出Reg3D——一种重构式几何指令微调框架,其核心创新是将几何感知监督直接引入训练过程,强调通过重建底层几何结构而非仅描述来实现有效的3D理解;具体而言,Reg3D采用双监督范式,在双编码器架构中设计互补的对象级和帧级重建任务,强制几何一致性以促进空间推理能力的发展。
链接: https://arxiv.org/abs/2509.03635
作者: Hongpei Zheng,Lintao Xiang,Qijun Yang,Qian Lin,Hujun Yin
机构: The University of Manchester (曼彻斯特大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 6 figures
Abstract:The rapid development of Large Multimodal Models (LMMs) has led to remarkable progress in 2D visual understanding; however, extending these capabilities to 3D scene understanding remains a significant challenge. Existing approaches predominantly rely on text-only supervision, which fails to provide the geometric constraints required for learning robust 3D spatial representations. In this paper, we introduce Reg3D, a novel Reconstructive Geometry Instruction Tuning framework that addresses this limitation by incorporating geometry-aware supervision directly into the training process. Our key insight is that effective 3D understanding necessitates reconstructing underlying geometric structures rather than merely describing them. Unlike existing methods that inject 3D information solely at the input level, Reg3D adopts a dual-supervision paradigm that leverages 3D geometric information both as input and as explicit learning targets. Specifically, we design complementary object-level and frame-level reconstruction tasks within a dual-encoder architecture, enforcing geometric consistency to encourage the development of spatial reasoning capabilities. Extensive experiments on ScanQA, Scan2Cap, ScanRefer, and SQA3D demonstrate that Reg3D delivers substantial performance improvements, establishing a new training paradigm for spatially aware multimodal models.
zh
[CV-74] reeX: Unsupervised Tree Instance Segmentation in Dense Forest Point Clouds
【速读】:该论文旨在解决森林单木分割(tree instance segmentation)中对大规模标注数据和高计算资源依赖的问题,尤其是在利用近距激光扫描(close-range laser scanning)获取的三维点云数据时。现有基于深度学习的方法虽然性能优越,但受限于标注成本与算力需求;为此,作者提出改进版treeX算法,其核心创新在于采用无监督策略——结合基于聚类的树干检测与区域生长法进行树冠边界提取,同时针对地面激光扫描(TLS/PLS)和无人机激光扫描(ULS)分别设计参数预设,显著提升在不同平台数据上的分割精度与效率。相比原treeX算法,新方法在地面数据上F₁分数提升0.11–0.49,在ULS数据上首次实现有效分割(F₁=0.58),且运行时间更短,展现出作为轻量级替代方案或用于半自动标注生成的双重潜力。
链接: https://arxiv.org/abs/2509.03633
作者: Josafat-Mattias Burmeister,Andreas Tockner,Stefan Reder,Markus Engel,Rico Richter,Jan-Peter Mund,Jürgen Döllner
机构: University of Potsdam (波茨坦大学); BOKU University (维也纳农业大学); Eberswalde University of Sustainable Development (埃伯斯瓦尔德可持续发展大学); State Forestry Research Centre (国家林业研究中心); Hasso Plattner Institute for Digital Engineering (哈索普拉特纳数字工程研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Close-range laser scanning provides detailed 3D captures of forest stands but requires efficient software for processing 3D point cloud data and extracting individual trees. Although recent studies have introduced deep learning methods for tree instance segmentation, these approaches require large annotated datasets and substantial computational resources. As a resource-efficient alternative, we present a revised version of the treeX algorithm, an unsupervised method that combines clustering-based stem detection with region growing for crown delineation. While the original treeX algorithm was developed for personal laser scanning (PLS) data, we provide two parameter presets, one for ground-based laser scanning (stationary terrestrial - TLS and PLS), and one for UAV-borne laser scanning (ULS). We evaluated the method on six public datasets (FOR-instance, ForestSemantic, LAUTx, NIBIO MLS, TreeLearn, Wytham Woods) and compared it to six open-source methods (original treeX, treeiso, RayCloudTools, ForAINet, SegmentAnyTree, TreeLearn). Compared to the original treeX algorithm, our revision reduces runtime and improves accuracy, with instance detection F _1 -score gains of +0.11 to +0.49 for ground-based data. For ULS data, our preset achieves an F _1 -score of 0.58, whereas the original algorithm fails to segment any correct instances. For TLS and PLS data, our algorithm achieves accuracy similar to recent open-source methods, including deep learning. Given its algorithmic design, we see two main applications for our method: (1) as a resource-efficient alternative to deep learning approaches in scenarios where the data characteristics align with the method design (sufficient stem visibility and point density), and (2) for the semi-automatic generation of labels for deep learning models. To enable broader adoption, we provide an open-source Python implementation in the pointtree package.
zh
[CV-75] Lightweight image segmentation for echocardiography
【速读】:该论文旨在解决超声心动图中左心室(Left Ventricle, LV)分割的自动化与实时性问题,以支持临床测量如容积和射血分数的全自动提取。现有基于nnU-Net的模型虽性能优异,但参数量大、推理速度慢,难以满足实时应用需求。解决方案的关键在于通过消融实验系统评估数据增强、网络结构、损失函数及后处理技术对性能的影响,发现简单仿射变换(affine augmentations)和深度监督(deep supervision)是提升性能的核心因素,而复杂增强和大模型容量收益递减。据此设计出仅含2M参数的轻量级U-Net,在保持与nnU-Net相当分割精度(Dice分数差异不显著)的同时,体积缩小16倍、推理速度提升4倍(每帧1.35ms vs 5.40ms),并展现出良好的跨数据集泛化能力。
链接: https://arxiv.org/abs/2509.03631
作者: Anders Kjelsrud,Lasse Løvstakken,Erik Smistad,Håvard Dalen,Gilles Van De Vyver
机构: Intility(Intility); Norwegian University of Science and Technology (挪威科技大学); SINTEF Medical Image Analysis (SINTEF 医学图像分析); St. Olav’s University Hospital (圣奥拉夫大学医院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 4 pages, 6 figures, The 2025 IEEE International Ultrasonics Symposium
Abstract:Accurate segmentation of the left ventricle in echocardiography can enable fully automatic extraction of clinical measurements such as volumes and ejection fraction. While models configured by nnU-Net perform well, they are large and slow, thus limiting real-time use. We identified the most effective components of nnU-Net for cardiac segmentation through an ablation study, incrementally evaluating data augmentation schemes, architectural modifications, loss functions, and post-processing techniques. Our analysis revealed that simple affine augmentations and deep supervision drive performance, while complex augmentations and large model capacity offer diminishing returns. Based on these insights, we developed a lightweight U-Net (2M vs 33M parameters) that achieves statistically equivalent performance to nnU-Net on CAMUS (N=500) with Dice scores of 0.93/0.85/0.89 vs 0.93/0.86/0.89 for LV/MYO/LA ( p0.05 ), while being 16 times smaller and 4 times faster (1.35ms vs 5.40ms per frame) than the default nnU-Net configuration. Cross-dataset evaluation on an internal dataset (N=311) confirms comparable generalization.
zh
[CV-76] Multi Attribute Bias Mitigation via Representation Learning ECAI2025
【速读】:该论文旨在解决现实世界图像中多种重叠偏差(如纹理、水印、性别化妆、场景物体配对等)对现代视觉模型性能的负面影响,这些偏差共同削弱了模型的鲁棒性和公平性。现有方法通常针对单一偏差进行缓解,但往往导致其他偏差加剧,无法实现多偏差协同治理。其解决方案的关键在于提出一种轻量级两阶段框架——通用多偏差缓解(Generalized Multi Bias Mitigation, GMBM):第一阶段通过自适应偏差集成学习(Adaptive Bias Integrated Learning, ABIL)显式识别已知捷径并训练属性编码器以增强主干网络对偏差的感知能力;第二阶段采用梯度抑制微调(Gradient Suppression Fine Tuning)从主干网络梯度中去除所有已识别偏差方向,从而获得一个紧凑且忽略所有捷径的最终模型。该方法仅需训练时的组标签,在测试时无需额外干预即可有效降低多属性偏差放大,并在分布偏移和子群不平衡条件下仍保持稳定表现。
链接: https://arxiv.org/abs/2509.03616
作者: Rajeev Ranjan Dwivedi,Ankur Kumar,Vinod K Kurmi
机构: Indian Institute of Science Education and Research (IISER) Bhopal
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ECAI 2025 (28th European Conference on Artificial Intelligence)
Abstract:Real world images frequently exhibit multiple overlapping biases, including textures, watermarks, gendered makeup, scene object pairings, etc. These biases collectively impair the performance of modern vision models, undermining both their robustness and fairness. Addressing these biases individually proves inadequate, as mitigating one bias often permits or intensifies others. We tackle this multi bias problem with Generalized Multi Bias Mitigation (GMBM), a lean two stage framework that needs group labels only while training and minimizes bias at test time. First, Adaptive Bias Integrated Learning (ABIL) deliberately identifies the influence of known shortcuts by training encoders for each attribute and integrating them with the main backbone, compelling the classifier to explicitly recognize these biases. Then Gradient Suppression Fine Tuning prunes those very bias directions from the backbone’s gradients, leaving a single compact network that ignores all the shortcuts it just learned to recognize. Moreover we find that existing bias metrics break under subgroup imbalance and train test distribution shifts, so we introduce Scaled Bias Amplification (SBA): a test time measure that disentangles model induced bias amplification from distributional differences. We validate GMBM on FB CMNIST, CelebA, and COCO, where we boost worst group accuracy, halve multi attribute bias amplification, and set a new low in SBA even as bias complexity and distribution shifts intensify, making GMBM the first practical, end to end multibias solution for visual recognition. Project page: this http URL
zh
[CV-77] acher-Student Model for Detecting and Classifying Mitosis in the MIDOG 2025 Challenge
【速读】:该论文旨在解决病理学家在计数有丝分裂象(mitotic figures)时耗时且存在观察者间变异的问题,同时应对人工智能(AI)模型在不同数据分布下性能下降的域偏移(domain shift)挑战,以及有丝分裂象样本远少于正常细胞核导致的数据严重不平衡问题。其解决方案的关键在于提出一种基于像素级分割的教师-学生(teacher-student)协同学习框架:首先利用UNet结构结合对比表示学习与域对抗训练模块增强模型对不同组织、染色条件和物种差异的泛化能力;其次通过生成包含有丝分裂象、难负样本及正常细胞核的像素级伪掩膜(pseudo-masks),提升特征判别力并缓解域偏移影响;最后在多任务学习范式下引入多尺度卷积神经网络(CNN)分类器,联合优化有丝分裂象检测(Track 1)与非典型有丝分裂象分类(Track 2)任务,从而实现鲁棒且高效的有丝分裂分析。
链接: https://arxiv.org/abs/2509.03614
作者: Seungho Choe,Xiaoli Qin,Abubakr Shafique,Amanda Dy,Dimitri Androutsos,Susan Done,April Khademi
机构: Toronto Metropolitan University (多伦多理工大学); University Health Network (大学健康网络); University of Toronto (多伦多大学); Vector Institute of Artificial Intelligence (向量人工智能研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 4 pages, 1 figures, final submission for MIDOG 2025 challenge
Abstract:Counting mitotic figures is time-intensive for pathologists and leads to inter-observer variability. Artificial intelligence (AI) promises a solution by automatically detecting mitotic figures while maintaining decision consistency. However, AI tools are susceptible to domain shift, where a significant drop in performance can occur due to differences in the training and testing sets, including morphological diversity between organs, species, and variations in staining protocols. Furthermore, the number of mitoses is much less than the count of normal nuclei, which introduces severely imbalanced data for the detection task. In this work, we formulate mitosis detection as a pixel-level segmentation and propose a teacher-student model that simultaneously addresses mitosis detection (Track 1) and atypical mitosis classification (Track 2). Our method is based on a UNet segmentation backbone that integrates domain generalization modules, namely contrastive representation learning and domain-adversarial training. A teacher-student strategy is employed to generate pixel-level pseudo-masks not only for annotated mitoses and hard negatives but also for normal nuclei, thereby enhancing feature discrimination and improving robustness against domain shift. For the classification task, we introduce a multi-scale CNN classifier that leverages feature maps from the segmentation model within a multi-task learning paradigm. On the preliminary test set, the algorithm achieved an F1 score of 0.7660 in Track 1 and balanced accuracy of 0.8414 in Track 2, demonstrating the effectiveness of integrating segmentation-based detection and classification into a unified framework for robust mitosis analysis.
zh
[CV-78] owards Efficient General Feature Prediction in Masked Skeleton Modeling ICCV2025
【速读】:该论文旨在解决当前基于掩码自编码器(Masked Autoencoder, MAE)的骨架动作识别方法中,重建目标局限于原始关节坐标或其简单变体所导致的计算冗余与语义表征能力不足的问题。解决方案的关键在于提出一种通用特征预测框架(General Feature Prediction, GFP),通过将传统的低层重建任务替换为涵盖局部运动模式到全局语义表示的高层特征预测,并设计了一个轻量级的目标生成网络,在时空层次上动态产生多样化的监督信号,从而实现更高效的掩码骨架建模和更强的表征能力。
链接: https://arxiv.org/abs/2509.03609
作者: Shengkai Sun,Zefan Zhang,Jianfeng Dong,Zhiyong Cheng,Xiaojun Chang,Meng Wang
机构: Hefei University of Technology (合肥工业大学); Jilin University (吉林大学); Zhejiang Gongshang University (浙江工商大学); University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV 2025
Abstract:Recent advances in the masked autoencoder (MAE) paradigm have significantly propelled self-supervised skeleton-based action recognition. However, most existing approaches limit reconstruction targets to raw joint coordinates or their simple variants, resulting in computational redundancy and limited semantic representation. To address this, we propose a novel General Feature Prediction framework (GFP) for efficient mask skeleton modeling. Our key innovation is replacing conventional low-level reconstruction with high-level feature prediction that spans from local motion patterns to global semantic representations. Specifically, we introduce a collaborative learning framework where a lightweight target generation network dynamically produces diversified supervision signals across spatial-temporal hierarchies, avoiding reliance on pre-computed offline features. The framework incorporates constrained optimization to ensure feature diversity while preventing model collapse. Experiments on NTU RGB+D 60, NTU RGB+D 120 and PKU-MMD demonstrate the benefits of our approach: Computational efficiency (with 6.2 \times faster training than standard masked skeleton modeling methods) and superior representation quality, achieving state-of-the-art performance in various downstream tasks.
zh
[CV-79] Revealing Fine Structure in Protoplanetary Disks with Physics Constrained Neural Fields
【速读】:该论文旨在解决原行星盘三维结构解析难题,特别是如何从ALMA高分辨率观测数据中准确重建盘内气体(如CO)的垂直分布形态,从而推动对盘演化机制的理解。现有方法难以捕捉复杂三维结构特征,尤其在远离中心恒星区域(>400 au)的精细变化。解决方案的关键在于提出一个融合物理约束神经场(physics-constrained neural fields)与可微渲染(differentiable rendering)的计算框架,并开发出RadJAX——一个基于GPU加速、完全可微的线辐射转移求解器,相较传统射线追踪算法提速达10,000倍,从而实现了此前无法实现的高维神经重建,首次揭示了HD 163296盘中CO层在大尺度上显著变窄和扁平化的垂直结构特征。
链接: https://arxiv.org/abs/2509.03623
作者: Aviad Levis,Nhan Luong,Richard Teague,Katherine. L. Bouman,Marcelo Barraza-Alfaro,Kevin Flaherty
机构: University of Toronto (多伦多大学); Massachusetts Institute of Technology (麻省理工学院); California Institute of Technology (加州理工学院); Williams College (威廉姆斯学院)
类目: Earth and Planetary Astrophysics (astro-ph.EP); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Protoplanetary disks are the birthplaces of planets, and resolving their three-dimensional structure is key to understanding disk evolution. The unprecedented resolution of ALMA demands modeling approaches that capture features beyond the reach of traditional methods. We introduce a computational framework that integrates physics-constrained neural fields with differentiable rendering and present RadJAX, a GPU-accelerated, fully differentiable line radiative transfer solver achieving up to 10,000x speedups over conventional ray tracers, enabling previously intractable, high-dimensional neural reconstructions. Applied to ALMA CO observations of HD 163296, this framework recovers the vertical morphology of the CO-rich layer, revealing a pronounced narrowing and flattening of the emission surface beyond 400 au - a feature missed by existing approaches. Our work establish a new paradigm for extracting complex disk structure and advancing our understanding of protoplanetary evolution.
zh
人工智能
[AI-0] ChronoGraph: A Real-World Graph-Based Multivariate Time Series Dataset
【速读】:该论文旨在解决多变量时间序列预测中缺乏结构感知与事件感知评估的问题,特别是在微服务系统(microservice systems)场景下,如何有效建模服务间的依赖关系并结合真实故障事件进行鲁棒性评估。其解决方案的关键在于构建了一个名为ChronoGraph的图结构化多变量时间序列数据集,该数据集不仅包含每个服务的多维性能指标(如CPU、内存和网络使用情况),还通过有向边显式编码了服务之间的依赖关系,并提供了由专家标注的异常事件窗口作为异常标签,从而支持结构感知预测模型与事件驱动的异常检测方法的联合评估。
链接: https://arxiv.org/abs/2509.04449
作者: Adrian Catalin Lutu,Ioana Pintilie,Elena Burceanu,Andrei Manolache
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:We present ChronoGraph, a graph-structured multivariate time series forecasting dataset built from real-world production microservices. Each node is a service that emits a multivariate stream of system-level performance metrics, capturing CPU, memory, and network usage patterns, while directed edges encode dependencies between services. The primary task is forecasting future values of these signals at the service level. In addition, ChronoGraph provides expert-annotated incident windows as anomaly labels, enabling evaluation of anomaly detection methods and assessment of forecast robustness during operational disruptions. Compared to existing benchmarks from industrial control systems or traffic and air-quality domains, ChronoGraph uniquely combines (i) multivariate time series, (ii) an explicit, machine-readable dependency graph, and (iii) anomaly labels aligned with real incidents. We report baseline results spanning forecasting models, pretrained time-series foundation models, and standard anomaly detectors. ChronoGraph offers a realistic benchmark for studying structure-aware forecasting and incident-aware evaluation in microservice systems.
zh
[AI-1] IPA: An Information-Preserving Input Projection Framework for Efficient Foundation Model Adaptation
【速读】:该论文旨在解决LoRA(Low-Rank Adaptation)方法中因随机初始化且数据无关的下投影(down-projection)导致的信息丢失问题,该问题限制了模型在微调过程中的性能表现。其核心解决方案是提出IPA(Information-preserving Projection Architecture),一种特征感知的投影框架,通过显式保留低维隐藏空间中的关键信息来优化投影层的设计。关键创新在于:在线性情况下,利用近似主成分分析(PCA)算法实现高效的投影器预训练,从而在不增加推理开销的前提下显著提升适应能力,实验证明IPA在多个语言和视觉基准上优于LoRA与DoRA,在保持相近性能的同时仅需约一半的可训练参数。
链接: https://arxiv.org/abs/2509.04398
作者: Yuan Yin,Shashanka Venkataramanan,Tuan-Hung Vu,Andrei Bursuc,Matthieu Cord
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Parameter-efficient fine-tuning (PEFT) methods, such as LoRA, reduce adaptation cost by injecting low-rank updates into pretrained weights. However, LoRA’s down-projection is randomly initialized and data-agnostic, discarding potentially useful information. Prior analyses show that this projection changes little during training, while the up-projection carries most of the adaptation, making the random input compression a performance bottleneck. We propose IPA, a feature-aware projection framework that explicitly preserves information in the reduced hidden space. In the linear case, we instantiate IPA with algorithms approximating top principal components, enabling efficient projector pretraining with negligible inference overhead. Across language and vision benchmarks, IPA consistently improves over LoRA and DoRA, achieving on average 1.5 points higher accuracy on commonsense reasoning and 2.3 points on VTAB-1k, while matching full LoRA performance with roughly half the trainable parameters when the projection is frozen.
zh
[AI-2] Parking Availability Prediction via Fusing Multi-Source Data with A Self-Supervised Learning Enhanced Spatio-Temporal Inverted Transformer
【速读】:该论文旨在解决城市停车资源预测中因私家车保有量激增而导致的停车难问题,核心挑战在于如何准确建模时空依赖关系并有效融合多源交通数据以提升预测精度。其解决方案的关键创新在于提出SST-iTransformer模型:首先通过K-means聚类划分停车区域(Parking Cluster Zones, PCZs),整合地铁、公交、网约车和出租车等多模式交通需求特征;其次在基础iTransformer架构上引入基于掩码重建的自监督预训练任务,实现时空表征学习,并设计双分支注意力机制——序列注意力(Series Attention)通过分块操作捕捉长期时间依赖,通道注意力(Channel Attention)则通过维度反转建模变量间交互关系。实验证明该方法显著优于多个基准深度学习模型,在成都真实数据集上取得最低均方误差(MSE)和良好平均绝对误差(MAE),且消融实验揭示网约车数据对性能提升贡献最大,同时空间相关性分析证实同一PCZ内历史停车数据对建模空间依赖至关重要。
链接: https://arxiv.org/abs/2509.04362
作者: Yin Huang,Yongqi Dong,Youhua Tang,Li Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 25 pages, 5 figures, under review for journal publication
Abstract:The rapid growth of private car ownership has worsened the urban parking predicament, underscoring the need for accurate and effective parking availability prediction to support urban planning and management. To address key limitations in modeling spatio-temporal dependencies and exploiting multi-source data for parking availability prediction, this study proposes a novel approach with SST-iTransformer. The methodology leverages K-means clustering to establish parking cluster zones (PCZs), extracting and integrating traffic demand characteristics from various transportation modes (i.e., metro, bus, online ride-hailing, and taxi) associated with the targeted parking lots. Upgraded on vanilla iTransformer, SST-iTransformer integrates masking-reconstruction-based pretext tasks for self-supervised spatio-temporal representation learning, and features an innovative dual-branch attention mechanism: Series Attention captures long-term temporal dependencies via patching operations, while Channel Attention models cross-variate interactions through inverted dimensions. Extensive experiments using real-world data from Chengdu, China, demonstrate that SST-iTransformer outperforms baseline deep learning models (including Informer, Autoformer, Crossformer, and iTransformer), achieving state-of-the-art performance with the lowest mean squared error (MSE) and competitive mean absolute error (MAE). Comprehensive ablation studies quantitatively reveal the relative importance of different data sources: incorporating ride-hailing data provides the largest performance gains, followed by taxi, whereas fixed-route transit features (bus/metro) contribute marginally. Spatial correlation analysis further confirms that excluding historical data from correlated parking lots within PCZs leads to substantial performance degradation, underscoring the importance of modeling spatial dependencies.
zh
[AI-3] AUDETER: A Large-scale Dataset for Deepfake Audio Detection in Open Worlds
【速读】:该论文旨在解决当前深度伪造音频检测方法在真实场景中泛化能力不足的问题,其核心挑战源于训练与测试样本之间的域偏移(domain shift),这主要由人类语音多样性及快速演进的语音合成系统所导致。现有数据集缺乏涵盖多样化且最新音频样本的真实与伪造类别,难以支撑鲁棒模型的开发。解决方案的关键在于构建一个大规模、高多样性的深度伪造音频评测基准——AUDETER(AUdio DEepfake TEst Range),该数据集包含超过4,500小时由11种最新文本转语音(Text-to-Speech, TTS)模型和10种声码器生成的300万条音频片段,覆盖广泛的TTS/vocoder模式。通过在AUDETER上训练的模型展现出显著提升的泛化性能,在跨域测试中错误率降低44.1%至51.6%,达到仅4.17%的误差率,验证了其对通用深度伪造音频检测模型开发的有效性。
链接: https://arxiv.org/abs/2509.04345
作者: Qizhou Wang,Hanxun Huang,Guansong Pang,Sarah Erfani,Christopher Leckie
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Speech generation systems can produce remarkably realistic vocalisations that are often indistinguishable from human speech, posing significant authenticity challenges. Although numerous deepfake detection methods have been developed, their effectiveness in real-world environments remains unrealiable due to the domain shift between training and test samples arising from diverse human speech and fast evolving speech synthesis systems. This is not adequately addressed by current datasets, which lack real-world application challenges with diverse and up-to-date audios in both real and deep-fake categories. To fill this gap, we introduce AUDETER (AUdio DEepfake TEst Range), a large-scale, highly diverse deepfake audio dataset for comprehensive evaluation and robust development of generalised models for deepfake audio detection. It consists of over 4,500 hours of synthetic audio generated by 11 recent TTS models and 10 vocoders with a broad range of TTS/vocoder patterns, totalling 3 million audio clips, making it the largest deepfake audio dataset by scale. Through extensive experiments with AUDETER, we reveal that i) state-of-the-art (SOTA) methods trained on existing datasets struggle to generalise to novel deepfake audio samples and suffer from high false positive rates on unseen human voice, underscoring the need for a comprehensive dataset; and ii) these methods trained on AUDETER achieve highly generalised detection performance and significantly reduce detection error rate by 44.1% to 51.6%, achieving an error rate of only 4.17% on diverse cross-domain samples in the popular In-the-Wild dataset, paving the way for training generalist deepfake audio detectors. AUDETER is available on GitHub.
zh
[AI-4] Decoupled Entity Representation Learning for Pinterest Ads Ranking
【速读】:该论文旨在解决个性化推荐系统中用户与物品(Pin)嵌入表示不足的问题,以提升Pinterest平台在广告检索和排序任务中的效果。其核心挑战在于如何从多样化数据源中高效构建高质量的用户和物品嵌入,并将其应用于多个下游任务(如点击率CTR和转化率CVR预测)。解决方案的关键在于提出了一种上游-下游(upstream-downstream)范式的框架:上游模型利用大规模异构数据训练复杂架构,学习实体嵌入(entity embeddings),并通过定期刷新而非实时计算确保可扩展性;这些静态嵌入作为输入特征注入到多个下游任务中,实现异步交互并显著提升离线与在线指标表现。该框架已在Pinterest生产环境部署,有效提升了广告排名系统的性能。
链接: https://arxiv.org/abs/2509.04337
作者: Jie Liu,Yinrui Li,Jiankai Sun,Kungang Li,Han Sun,Sihan Wang,Huasen Wu,Siyuan Gao,Paulo Soares,Nan Li,Zhifang Liu,Haoyang Li,Siping Ji,Ling Leng,Prathibha Deshikachar
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:In this paper, we introduce a novel framework following an upstream-downstream paradigm to construct user and item (Pin) embeddings from diverse data sources, which are essential for Pinterest to deliver personalized Pins and ads effectively. Our upstream models are trained on extensive data sources featuring varied signals, utilizing complex architectures to capture intricate relationships between users and Pins on Pinterest. To ensure scalability of the upstream models, entity embeddings are learned, and regularly refreshed, rather than real-time computation, allowing for asynchronous interaction between the upstream and downstream models. These embeddings are then integrated as input features in numerous downstream tasks, including ad retrieval and ranking models for CTR and CVR predictions. We demonstrate that our framework achieves notable performance improvements in both offline and online settings across various downstream tasks. This framework has been deployed in Pinterest’s production ad ranking systems, resulting in significant gains in online metrics.
zh
[AI-5] Improving Robustness of AlphaZero Algorithms to Test-Time Environment Changes
【速读】:该论文旨在解决AlphaZero框架在测试环境中发生分布偏移(distributional shift)时性能下降的问题,即假设训练环境与测试环境一致的限制。其解决方案的关键在于对标准AlphaZero框架进行若干简单修改,包括引入适应性策略调整机制和增强探索能力,从而在低规划预算下仍能显著提升模型在变化环境中的泛化性能。
链接: https://arxiv.org/abs/2509.04317
作者: Isidoro Tamassia,Wendelin Böhmer
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:The AlphaZero framework provides a standard way of combining Monte Carlo planning with prior knowledge provided by a previously trained policy-value neural network. AlphaZero usually assumes that the environment on which the neural network was trained will not change at test time, which constrains its applicability. In this paper, we analyze the problem of deploying AlphaZero agents in potentially changed test environments and demonstrate how the combination of simple modifications to the standard framework can significantly boost performance, even in settings with a low planning budget available. The code is publicly available on GitHub.
zh
[AI-6] EvoEmo: Towards Evolved Emotional Policies for LLM Agents in Multi-Turn Negotiation
【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在多轮谈判中忽视情绪功能性作用的问题,现有方法通常生成被动且偏好驱动的情绪响应,导致代理易受对手策略性操纵。解决方案的关键在于提出EvoEmo框架,该框架将情绪状态转移建模为马尔可夫决策过程(Markov Decision Process),并采用基于种群的遗传优化算法,在多样化的谈判场景中演化高奖励情绪策略,从而实现动态、适应性强的情绪表达,显著提升谈判成功率、效率和买方收益。
链接: https://arxiv.org/abs/2509.04310
作者: Yunbo Long,Liming Xu,Lukas Beckenbauer,Yuhan Liu,Alexandra Brintrup
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Recent research on Chain-of-Thought (CoT) reasoning in Large Language Models (LLMs) has demonstrated that agents can engage in \textitcomplex, \textitmulti-turn negotiations, opening new avenues for agentic AI. However, existing LLM agents largely overlook the functional role of emotions in such negotiations, instead generating passive, preference-driven emotional responses that make them vulnerable to manipulation and strategic exploitation by adversarial counterparts. To address this gap, we present EvoEmo, an evolutionary reinforcement learning framework that optimizes dynamic emotional expression in negotiations. EvoEmo models emotional state transitions as a Markov Decision Process and employs population-based genetic optimization to evolve high-reward emotion policies across diverse negotiation scenarios. We further propose an evaluation framework with two baselines – vanilla strategies and fixed-emotion strategies – for benchmarking emotion-aware negotiation. Extensive experiments and ablation studies show that EvoEmo consistently outperforms both baselines, achieving higher success rates, higher efficiency, and increased buyer savings. This findings highlight the importance of adaptive emotional expression in enabling more effective LLM agents for multi-turn negotiation.
zh
[AI-7] HumAIne-Chatbot: Real-Time Personalized Conversational AI via Reinforcement Learning
【速读】:该论文旨在解决当前对话式人工智能(Conversational AI)系统普遍存在的“一刀切”交互模式问题,即系统忽视用户个体特征且缺乏自适应对话管理能力。其解决方案的关键在于提出一个名为HumAIne-chatbot的AI驱动对话代理,该代理通过一种新颖的用户画像(User Profiling)框架实现个性化响应:系统首先在大量由GPT生成的虚拟人格数据上进行预训练,建立对用户类型的先验知识;随后在实时交互中,利用在线强化学习代理结合隐式信号(如打字速度、情感倾向、参与时长)与显式反馈(如点赞和差评)动态优化每个用户的模型,并据此调整对话策略,实现实时的内容与风格自适应。
链接: https://arxiv.org/abs/2509.04303
作者: Georgios Makridis,Georgios Fragiadakis,Jorge Oliveira,Tomaz Saraiva,Philip Mavrepis,Georgios Fatouros,Dimosthenis Kyriazis
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 11 pages, 4 figures, IEEE conference format
Abstract:Current conversational AI systems often provide generic, one-size-fits-all interactions that overlook individual user characteristics and lack adaptive dialogue management. To address this gap, we introduce \textbfHumAIne-chatbot, an AI-driven conversational agent that personalizes responses through a novel user profiling framework. The system is pre-trained on a diverse set of GPT-generated virtual personas to establish a broad prior over user types. During live interactions, an online reinforcement learning agent refines per-user models by combining implicit signals (e.g. typing speed, sentiment, engagement duration) with explicit feedback (e.g., likes and dislikes). This profile dynamically informs the chatbot dialogue policy, enabling real-time adaptation of both content and style. To evaluate the system, we performed controlled experiments with 50 synthetic personas in multiple conversation domains. The results showed consistent improvements in user satisfaction, personalization accuracy, and task achievement when personalization features were enabled. Statistical analysis confirmed significant differences between personalized and nonpersonalized conditions, with large effect sizes across key metrics. These findings highlight the effectiveness of AI-driven user profiling and provide a strong foundation for future real-world validation.
zh
[AI-8] Reinforcement Learning for Robust Ageing-Aware Control of Li-ion Battery Systems with Data-Driven Formal Verification
【速读】:该论文旨在解决锂离子电池(Li-ion battery)在快速充电与老化行为之间的权衡问题,即如何在提升充电速度的同时最小化电池容量衰减。解决方案的关键在于提出一种混合控制策略:利用强化学习(Reinforcement Learning, RL)设计多个独立控制器,并通过数据驱动的抽象方法对这些控制器进行分区,形成基于电池初始输出测量值的切换结构;这种离散选择与连续电池动力学相结合的方式构建了一个混合系统,且当设计满足要求时,该抽象还能提供闭环性能的概率保障。
链接: https://arxiv.org/abs/2509.04288
作者: Rudi Coppola,Hovsep Touloujian,Pierfrancesco Ombrini,Manuel Mazo Jr
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI)
备注:
Abstract:Rechargeable lithium-ion (Li-ion) batteries are a ubiquitous element of modern technology. In the last decades, the production and design of such batteries and their adjacent embedded charging and safety protocols, denoted by Battery Management Systems (BMS), has taken central stage. A fundamental challenge to be addressed is the trade-off between the speed of charging and the ageing behavior, resulting in the loss of capacity in the battery cell. We rely on a high-fidelity physics-based battery model and propose an approach to data-driven charging and safety protocol design. Following a Counterexample-Guided Inductive Synthesis scheme, we combine Reinforcement Learning (RL) with recent developments in data-driven formal methods to obtain a hybrid control strategy: RL is used to synthesise the individual controllers, and a data-driven abstraction guides their partitioning into a switched structure, depending on the initial output measurements of the battery. The resulting discrete selection among RL-based controllers, coupled with the continuous battery dynamics, realises a hybrid system. When a design meets the desired criteria, the abstraction provides probabilistic guarantees on the closed-loop performance of the cell.
zh
[AI-9] An Empirical Study of Vulnerabilities in Python Packages and Their Detection
【速读】:该论文旨在解决当前Python包(Python package)漏洞检测工具有效性不足的问题,特别是由于Python常与其他语言协同开发导致的多语言复杂性,以及现有漏洞数据集缺乏系统性、精确性和多样性。其解决方案的关键在于构建了首个全面的Python包漏洞基准测试套件PyVul,包含1,157个公开报告且经开发者验证的漏洞实例,并提供提交级别(commit-level)和函数级别(function-level)的标注,同时采用大语言模型(LLM)辅助的数据清洗方法,实现了94%的函数级准确率和100%的提交级准确率,从而为漏洞检测工具提供了高精度的评估标准。
链接: https://arxiv.org/abs/2509.04260
作者: Haowei Quan,Junjie Wang,Xinzhe Li,Terry Yue Zhuo,Xiao Chen,Xiaoning Du
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:
Abstract:In the rapidly evolving software development landscape, Python stands out for its simplicity, versatility, and extensive ecosystem. Python packages, as units of organization, reusability, and distribution, have become a pressing concern, highlighted by the considerable number of vulnerability reports. As a scripting language, Python often cooperates with other languages for performance or interoperability. This adds complexity to the vulnerabilities inherent to Python packages, and the effectiveness of current vulnerability detection tools remains underexplored. This paper addresses these gaps by introducing PyVul, the first comprehensive benchmark suite of Python-package vulnerabilities. PyVul includes 1,157 publicly reported, developer-verified vulnerabilities, each linked to its affected packages. To accommodate diverse detection techniques, it provides annotations at both commit and function levels. An LLM-assisted data cleansing method is incorporated to improve label accuracy, achieving 100% commit-level and 94% function-level accuracy, establishing PyVul as the most precise large-scale Python vulnerability benchmark. We further carry out a distribution analysis of PyVul, which demonstrates that vulnerabilities in Python packages involve multiple programming languages and exhibit a wide variety of types. Moreover, our analysis reveals that multi-lingual Python packages are potentially more susceptible to vulnerabilities. Evaluation of state-of-the-art detectors using this benchmark reveals a significant discrepancy between the capabilities of existing tools and the demands of effectively identifying real-world security issues in Python packages. Additionally, we conduct an empirical review of the top-ranked CWEs observed in Python packages, to diagnose the fine-grained limitations of current detection tools and highlight the necessity for future advancements in the field.
zh
[AI-10] Evaluating Quality of Gaming Narratives Co-created with AI
【速读】:该论文旨在解决如何系统评估由生成式 AI (Generative AI) 生成的游戏叙事质量的问题,特别是在人机协同创作情境下,如何识别并优先排序影响玩家满意度的关键叙事维度。其解决方案的关键在于构建一个融合文献综述与叙事设计专家意见的结构化评估框架,将故事质量维度映射至Kano模型(Kano Model)中,从而区分基本型、期望型和兴奋型需求,为游戏开发者提供可操作的优先级指导,以优化与生成式 AI 协同创作的游戏叙事质量。
链接: https://arxiv.org/abs/2509.04239
作者: Arturo Valdivia,Paolo Burelli
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:This paper proposes a structured methodology to evaluate AI-generated game narratives, leveraging the Delphi study structure with a panel of narrative design experts. Our approach synthesizes story quality dimensions from literature and expert insights, mapping them into the Kano model framework to understand their impact on player satisfaction. The results can inform game developers on prioritizing quality aspects when co-creating game narratives with generative AI.
zh
[AI-11] Domain size asymptotics for Markov logic networks
【速读】:该论文旨在研究马尔可夫逻辑网络(Markov logic network, MLN)在领域规模趋于无穷大时,其定义的概率分布所表现出的渐近性质。具体而言,论文通过分析三类典型MLN模型的行为,揭示了随机结构在域大小增长过程中的极限行为差异:第一类为仅含一个一元关系符号的无量词MLN,给出了其极限行为的完整刻画;第二类偏好三角形较少(或更一般地,k-团较少)的图结构,由此推导出一阶逻辑的“δ-近似0-1律”;第三类则偏好度数高于固定阈值的顶点较少的图结构,表明软约束权重是否影响极限行为取决于具体形式。关键发现在于,不同软约束设计会导致显著不同的极限行为,且某些MLN定义的分布会在大规模域上集中于与均匀分布完全不同的可能世界子空间。此外,通过第一类案例证明了无量词MLN与广义提升贝叶斯网络(lifted Bayesian networks)在渐近意义上不可比较,即存在序列分布可由前者精确描述但无法被后者近似。
链接: https://arxiv.org/abs/2509.04192
作者: Vera Koponen
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO); Logic (math.LO)
备注:
Abstract:A Markov logic network (MLN) determines a probability distribution on the set of structures, or possible worlds'', with an arbitrary finite domain. We study the properties of such distributions as the domain size tends to infinity. Three types of concrete examples of MLNs will be considered, and the properties of random structures with domain sizes tending to infinity will be studied: (1) Arbitrary quantifier-free MLNs over a language with only one relation symbol which has arity 1. In this case we give a pretty complete characterization of the possible limit behaviours of random structures. (2) An MLN that favours graphs with fewer triangles (or more generally, fewer k-cliques). As a corollary of the analysis a
\delta -approximate 0-1 law’’ for first-order logic is obtained. (3) An MLN that favours graphs with fewer vertices with degree higher than a fixed (but arbitrary) number. The analysis shows that depending on which ``soft constraints’’ an MLN uses the limit behaviour of random structures can be quite different, and the weights of the soft constraints may, or may not, have influence on the limit behaviour. It will also be demonstrated, using (1), that quantifier-free MLNs and lifted Bayesian networks (in a broad sense) are asymptotically incomparable, roughly meaning that there is a sequence of distributions on possible worlds with increasing domain sizes that can be defined by one of the formalisms but not even approximated by the other. In a rather general context it is also shown that on large domains the distribution determined by an MLN concentrates almost all its probability mass on a totally different part of the space of possible worlds than the uniform distribution does.
zh
[AI-12] Attention as an Adaptive Filter
【速读】:该论文旨在解决传统注意力机制(Attention)在建模序列动态变化时缺乏对输入序列内在演化规律显式建模的问题。现有方法通常直接比较查询(Query)与键(Key)来计算注意力权重,忽略了序列中潜在的时间或状态演化过程。为此,作者提出自适应滤波注意力(Adaptive Filter Attention, AFA),其核心创新在于将可学习的动力学模型嵌入到注意力权重的计算过程中:通过将输入序列视为线性随机微分方程(Linear Stochastic Differential Equation, SDE)的离散观测,利用可同时对角化的状态矩阵与噪声协方差结构,推导出微分李雅普诺夫方程(Differential Lyapunov Equation)的闭式解,从而高效传播成对不确定性。注意力权重由此被解释为基于残差的鲁棒重加权项,对应于传播后的成对精度(Precision)。进一步施加特征值约束后,AFA 在计算和内存复杂度上与标准注意力一致,且在极限情况下可退化为普通点积注意力。
链接: https://arxiv.org/abs/2509.04154
作者: Peter Racioppo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:We introduce Adaptive Filter Attention (AFA), a novel attention mechanism that incorporates a learnable dynamics model directly into the computation of attention weights. Rather than comparing queries and keys directly, we model the input sequence as discrete observations of a linear stochastic differential equation (SDE). By imposing a linear dynamics model with simultaneously diagonalizable state matrices and noise covariances, we can make use of a closed-form solution to the differential Lyapunov equation to efficiently propagate pairwise uncertainties through the dynamics. Attention naturally arises as the maximum likelihood solution for this linear SDE, with attention weights corresponding to robust residual-based reweightings of the propagated pairwise precisions. Imposing an additional constraint on the state matrix’s eigenvalues leads to a simplified variant with the same computational and memory complexity as standard attention. In the limit of vanishing dynamics and process noise, and using a small-angle approximation, we recover ordinary dot-product attention.
zh
[AI-13] AGAL: Tabular Data Generation using Agent ic LLM Methods
【速读】:该论文旨在解决如何在不依赖额外训练的情况下,利用大语言模型(Large Language Models, LLMs)生成高质量合成表格数据,以提升下游机器学习任务的性能。其解决方案的关键在于提出了一种基于智能体工作流(agentic workflow)的方法——TAGAL,该方法通过LLM驱动的自动迭代过程,结合反馈机制持续优化生成数据的质量,同时支持引入外部知识增强数据合理性。实验表明,TAGAL在多项指标上达到或超越需训练LLM的先进方法,并显著优于其他无需训练的替代方案,验证了智能体式架构在LLM驱动数据生成中的有效性与潜力。
链接: https://arxiv.org/abs/2509.04152
作者: Benoît Ronval,Pierre Dupont,Siegfried Nijssen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:The generation of data is a common approach to improve the performance of machine learning tasks, among which is the training of models for classification. In this paper, we present TAGAL, a collection of methods able to generate synthetic tabular data using an agentic workflow. The methods leverage Large Language Models (LLMs) for an automatic and iterative process that uses feedback to improve the generated data without any further LLM training. The use of LLMs also allows for the addition of external knowledge in the generation process. We evaluate TAGAL across diverse datasets and different aspects of quality for the generated data. We look at the utility of downstream ML models, both by training classifiers on synthetic data only and by combining real and synthetic data. Moreover, we compare the similarities between the real and the generated data. We show that TAGAL is able to perform on par with state-of-the-art approaches that require LLM training and generally outperforms other training-free approaches. These findings highlight the potential of agentic workflow and open new directions for LLM-based data generation methods.
zh
[AI-14] Enhancing Technical Documents Retrieval for RAG
【速读】:该论文旨在解决技术文档中语义检索效率与准确率不足的问题,尤其是在硬件和软件开发场景下,传统嵌入模型难以充分捕捉用户意图与文档上下文的细粒度语义关系。其解决方案的关键在于提出Technical-Embeddings框架,通过两个核心机制实现优化:一是利用大语言模型(Large Language Models, LLMs)对用户查询进行扩展以增强意图表达并提升训练数据多样性;二是采用摘要提取技术编码文档的关键上下文信息,从而改进文档表征质量。此外,该方法进一步基于软提示(soft prompting)微调双编码器BERT模型,引入独立的学习参数分别建模查询和文档上下文,以更精细地捕获语义差异。实验证明,该方案在RAG-EDA和Rust-Docs-QA两个公开数据集上显著优于基线模型,在精度(precision)和召回率(recall)方面均有提升,为检索增强生成(Retrieval-Augmented Generation, RAG)系统在工程领域的应用提供了有效路径。
链接: https://arxiv.org/abs/2509.04139
作者: Songjiang Lai,Tsun-Hin Cheung,Ka-Chun Fung,Kaiwen Xue,Kwan-Ho Lin,Yan-Ming Choi,Vincent Ng,Kin-Man Lam
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:In this paper, we introduce Technical-Embeddings, a novel framework designed to optimize semantic retrieval in technical documentation, with applications in both hardware and software development. Our approach addresses the challenges of understanding and retrieving complex technical content by leveraging the capabilities of Large Language Models (LLMs). First, we enhance user queries by generating expanded representations that better capture user intent and improve dataset diversity, thereby enriching the fine-tuning process for embedding models. Second, we apply summary extraction techniques to encode essential contextual information, refining the representation of technical documents. To further enhance retrieval performance, we fine-tune a bi-encoder BERT model using soft prompting, incorporating separate learning parameters for queries and document context to capture fine-grained semantic nuances. We evaluate our approach on two public datasets, RAG-EDA and Rust-Docs-QA, demonstrating that Technical-Embeddings significantly outperforms baseline models in both precision and recall. Our findings highlight the effectiveness of integrating query expansion and contextual summarization to enhance information access and comprehension in technical domains. This work advances the state of Retrieval-Augmented Generation (RAG) systems, offering new avenues for efficient and accurate technical document retrieval in engineering and product development workflows.
zh
[AI-15] he human biological advantage over AI
【速读】:该论文试图解决的核心问题是:当人工智能(AI)达到通用人工智能(AGI)水平时,是否具备成为人类替代者的资格,即能否胜任领导宇宙的责任。其解决方案的关键在于指出,区分人类与AI的根本差异并非大脑本身,而是中枢神经系统(CNS),它使人类能够沉浸式地感知物理现实并体验情绪(如痛苦、喜悦、苦难和爱),从而深刻理解行为后果,这是构建可持续伦理体系的基础。因此,即便未来AI能实现意识,若缺乏生物性的CNS,仍无法获得领导宇宙所需的伦理认知能力;真正不可替代的领导根基始终是DNA,而非硅基计算系统。
链接: https://arxiv.org/abs/2509.04130
作者: William Stewart
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 12 pages
Abstract:Recent advances in AI raise the possibility that AI systems will one day be able to do anything humans can do, only better. If artificial general intelligence (AGI) is achieved, AI systems may be able to understand, reason, problem solve, create, and evolve at a level and speed that humans will increasingly be unable to match, or even understand. These possibilities raise a natural question as to whether AI will eventually become superior to humans, a successor “digital species”, with a rightful claim to assume leadership of the universe. However, a deeper consideration suggests the overlooked differentiator between human beings and AI is not the brain, but the central nervous system (CNS), providing us with an immersive integration with physical reality. It is our CNS that enables us to experience emotion including pain, joy, suffering, and love, and therefore to fully appreciate the consequences of our actions on the world around us. And that emotional understanding of the consequences of our actions is what is required to be able to develop sustainable ethical systems, and so be fully qualified to be the leaders of the universe. A CNS cannot be manufactured or simulated; it must be grown as a biological construct. And so, even the development of consciousness will not be sufficient to make AI systems superior to humans. AI systems may become more capable than humans on almost every measure and transform our society. However, the best foundation for leadership of our universe will always be DNA, not silicon.
zh
[AI-16] Simplicity Lies in the Eye of the Beholder: A Strategic Perspective on Controllers in Reactive Synthesis
【速读】:该论文旨在解决控制器合成中策略复杂性的问题,特别是在博弈论框架下如何权衡策略的简洁性与有效性。其核心关注点在于:尽管人们普遍认为简单策略(如有限记忆或确定性策略)更易设计、实现和维护,但在多种合成场景中,这种直观认知是否成立仍需严谨分析。解决方案的关键在于系统梳理近期关于策略所需记忆空间(memory)和随机性(randomness)的理论成果,并探讨超越传统复杂度指标(如记忆大小或随机比特数)的新维度,从而为控制器设计提供更全面的理论依据和实践指导。
链接: https://arxiv.org/abs/2509.04129
作者: Mickael Randour
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Formal Languages and Automata Theory (cs.FL); Probability (math.PR)
备注: Invited paper at RP 2025
Abstract:In the game-theoretic approach to controller synthesis, we model the interaction between a system to be controlled and its environment as a game between these entities, and we seek an appropriate (e.g., winning or optimal) strategy for the system. This strategy then serves as a formal blueprint for a real-world controller. A common belief is that simple (e.g., using limited memory) strategies are better: corresponding controllers are easier to conceive and understand, and cheaper to produce and maintain. This invited contribution focuses on the complexity of strategies in a variety of synthesis contexts. We discuss recent results concerning memory and randomness, and take a brief look at what lies beyond our traditional notions of complexity for strategies. Comments: Invited paper at RP 2025 Subjects: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Formal Languages and Automata Theory (cs.FL); Probability (math.PR) Cite as: arXiv:2509.04129 [cs.LO] (or arXiv:2509.04129v1 [cs.LO] for this version) https://doi.org/10.48550/arXiv.2509.04129 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-17] Analysis of Bluffing by DQN and CFR in Leduc Holdem Poker
【速读】:该论文旨在解决计算机扑克算法在博弈过程中是否具备人类类似的“诈唬”(bluffing)行为这一问题,而现有研究多集中于胜率等性能指标,忽视了策略多样性中的心理博弈要素。其解决方案的关键在于设计了一项实验,让基于强化学习的DQN与基于博弈论的CFR算法在Leduc Hold’em这一简化扑克变体中相互对弈,并记录其行动数据以分析其诈唬行为模式。结果表明,两种算法均表现出诈唬行为,且成功诈唬的概率相近,说明诈唬是博弈策略的本质特征,而非特定算法的产物。
链接: https://arxiv.org/abs/2509.04125
作者: Tarik Zaciragic,Aske Plaat,K. Joost Batenburg
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:In the game of poker, being unpredictable, or bluffing, is an essential skill. When humans play poker, they bluff. However, most works on computer-poker focus on performance metrics such as win rates, while bluffing is overlooked. In this paper we study whether two popular algorithms, DQN (based on reinforcement learning) and CFR (based on game theory), exhibit bluffing behavior in Leduc Hold’em, a simplified version of poker. We designed an experiment where we let the DQN and CFR agent play against each other while we log their actions. We find that both DQN and CFR exhibit bluffing behavior, but they do so in different ways. Although both attempt to perform bluffs at different rates, the percentage of successful bluffs (where the opponent folds) is roughly the same. This suggests that bluffing is an essential aspect of the game, not of the algorithm. Future work should look at different bluffing styles and at the full game of poker. Code at this https URL.
zh
[AI-18] Hybrid Reinforcement Learning and Search for Flight Trajectory Planning
【速读】:该论文旨在解决航空器在紧急情况下快速重新计算最优飞行路径的问题,以缩短路径优化所需时间。其关键解决方案是将强化学习(Reinforcement Learning, RL)与基于搜索的路径规划算法相结合:训练一个RL智能体根据位置和大气数据预先计算出近似最优路径,并在运行时将其作为约束条件输入给底层路径规划求解器,从而显著缩小搜索空间,加速求解过程。实验结果表明,尽管不能保证全局最优性,但燃料消耗与无约束求解器相比偏差通常小于1%,同时计算速度提升可达50%。
链接: https://arxiv.org/abs/2509.04100
作者: Alberto Luise,Michele Lombardi,Florent Teichteil Koenigsbuch
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:This paper explores the combination of Reinforcement Learning (RL) and search-based path planners to speed up the optimization of flight paths for airliners, where in case of emergency a fast route re-calculation can be crucial. The fundamental idea is to train an RL Agent to pre-compute near-optimal paths based on location and atmospheric data and use those at runtime to constrain the underlying path planning solver and find a solution within a certain distance from the initial guess. The approach effectively reduces the size of the solver’s search space, significantly speeding up route optimization. Although global optimality is not guaranteed, empirical results conducted with Airbus aircraft’s performance models show that fuel consumption remains nearly identical to that of an unconstrained solver, with deviations typically within 1%. At the same time, computation speed can be improved by up to 50% as compared to using a conventional solver alone.
zh
[AI-19] Intermediate Languages Matter: Formal Languages and LLM s affect Neurosymbolic Reasoning
【速读】:该论文试图解决神经符号大语言模型(Neurosymbolic LLM)推理能力提升中的关键瓶颈问题,即当前对其实现成功因素的理解尚不充分。研究表明,一个此前被忽视的重要因素是形式语言(formal language)的选择。解决方案的关键在于提出“中间语言挑战”(intermediate language challenge),并通过系统比较四种不同形式语言在三个数据集和七种大语言模型上的表现,验证了形式语言的选择显著影响语法和语义推理能力,并揭示了其在不同LLM间的差异性影响。
链接: https://arxiv.org/abs/2509.04083
作者: Alexander Beiser,David Penz,Nysret Musliu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: To appear in the proceedings of The Second Workshop on Knowledge Graphs and Neurosymbolic AI (KG-NeSy) Co-located with SEMANTiCS 2025 Conference, Vienna, Austria - September 3rd, 2025
Abstract:Large language models (LLMs) achieve astonishing results on a wide range of tasks. However, their formal reasoning ability still lags behind. A promising approach is Neurosymbolic LLM reasoning. It works by using LLMs as translators from natural to formal languages and symbolic solvers for deriving correct results. Still, the contributing factors to the success of Neurosymbolic LLM reasoning remain unclear. This paper demonstrates that one previously overlooked factor is the choice of the formal language. We introduce the intermediate language challenge: selecting a suitable formal language for neurosymbolic reasoning. By comparing four formal languages across three datasets and seven LLMs, we show that the choice of formal language affects both syntactic and semantic reasoning capabilities. We also discuss the varying effects across different LLMs.
zh
[AI-20] RepoDebug: Repository-Level Multi-Task and Multi-Language Debugging Evaluation of Large Language Models EMNLP2025
【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在代码调试任务中主要聚焦于函数级修复、忽视更复杂且贴近实际开发场景的仓库级(repository-level)调试问题,导致对LLM在真实环境中调试能力的理解不完整这一挑战。其解决方案的关键在于构建一个名为RepoDebug的多任务、多语言的仓库级代码调试数据集,该数据集涵盖22种错误子类型,支持8种常用编程语言和3类调试任务,从而系统性地提升对LLM在复杂工程场景下调试性能的评估能力。
链接: https://arxiv.org/abs/2509.04078
作者: Jingjing Liu,Zeming Liu,Zihao Cheng,Mengliang He,Xiaoming Shi,Yuhang Guo,Xiangrong Zhu,Yuanfang Guo,Yunhong Wang,Haifeng Wang
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 30 pages, 12 figures, EMNLP 2025 Findings
Abstract:Large Language Models (LLMs) have exhibited significant proficiency in code debugging, especially in automatic program repair, which may substantially reduce the time consumption of developers and enhance their efficiency. Significant advancements in debugging datasets have been made to promote the development of code debugging. However, these datasets primarily focus on assessing the LLM’s function-level code repair capabilities, neglecting the more complex and realistic repository-level scenarios, which leads to an incomplete understanding of the LLM’s challenges in repository-level debugging. While several repository-level datasets have been proposed, they often suffer from limitations such as limited diversity of tasks, languages, and error types. To mitigate this challenge, this paper introduces RepoDebug, a multi-task and multi-language repository-level code debugging dataset with 22 subtypes of errors that supports 8 commonly used programming languages and 3 debugging tasks. Furthermore, we conduct evaluation experiments on 10 LLMs, where Claude 3.5 Sonnect, the best-performing model, still cannot perform well in repository-level debugging.
zh
[AI-21] Keypoint-based Diffusion for Robotic Motion Planning on the NICOL Robot ICANN20255
【速读】:该论文旨在解决机器人运动规划中传统数值规划方法计算耗时过长的问题。其解决方案的关键在于提出了一种基于扩散过程(diffusion-based)的生成式AI模型,通过从由传统规划器生成的数据集中学习,实现高效的动作预测。该模型在不依赖点云编码输入的情况下,仍能在显著更短的运行时间内达到高达90%的无碰撞路径成功率,显著优于传统数值方法。
链接: https://arxiv.org/abs/2509.04076
作者: Lennart Clasmeier,Jan-Gerrit Habekost,Connor Gäde,Philipp Allgeuer,Stefan Wermter
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Submitted to ICANN 20255 Special Session on Neural Robotics
Abstract:We propose a novel diffusion-based action model for robotic motion planning. Commonly, established numerical planning approaches are used to solve general motion planning problems, but have significant runtime requirements. By leveraging the power of deep learning, we are able to achieve good results in a much smaller runtime by learning from a dataset generated by these planners. While our initial model uses point cloud embeddings in the input to predict keypoint-based joint sequences in its output, we observed in our ablation study that it remained challenging to condition the network on the point cloud embeddings. We identified some biases in our dataset and refined it, which improved the model’s performance. Our model, even without the use of the point cloud encodings, outperforms numerical models by an order of magnitude regarding the runtime, while reaching a success rate of up to 90% of collision free solutions on the test set.
zh
[AI-22] Oruga: An Avatar of Representational Systems Theory
【速读】:该论文旨在解决如何赋予机器类似人类灵活使用表征(representation)的能力,以提升其与人类交互的兼容性。当前机器在处理表征时缺乏跨域迁移和创造性转换的能力,而人类能够通过绘制图表、改变表征形式并运用类比推理实现跨领域知识迁移。为此,作者基于之前提出的表征系统理论(Representational Systems Theory, RST),设计并实现了一个名为Oruga的系统,其核心创新在于采用一种称为“结构传递”(structure transfer)的方法,通过一套对应RST概念的数据结构、通信语言及执行引擎,实现对表征的自动变换,从而模拟人类在认知过程中对表征的灵活操控能力。
链接: https://arxiv.org/abs/2509.04041
作者: Daniel Raggi,Gem Stapleton,Mateja Jamnik,Aaron Stockdill,Grecia Garcia Garcia,Peter C-H. Cheng
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注:
Abstract:Humans use representations flexibly. We draw diagrams, change representations and exploit creative analogies across different domains. We want to harness this kind of power and endow machines with it to make them more compatible with human use. Previously we developed Representational Systems Theory (RST) to study the structure and transformations of representations. In this paper we present Oruga (caterpillar in Spanish; a symbol of transformation), an implementation of various aspects of RST. Oruga consists of a core of data structures corresponding to concepts in RST, a language for communicating with the core, and an engine for producing transformations using a method we call structure transfer. In this paper we present an overview of the core and language of Oruga, with a brief example of the kind of transformation that structure transfer can execute.
zh
[AI-23] AutoPBO: LLM -powered Optimization for Local Search PBO Solvers
【速读】:该论文旨在解决伪布尔优化(Pseudo-Boolean Optimization, PBO)中局部搜索求解器设计依赖专家经验与手动调参的问题,从而限制其自动化与泛化能力。解决方案的关键在于提出 AutoPBO——一个基于大语言模型(Large Language Models, LLMs)的自动增强框架,通过LLM驱动的智能策略生成与优化机制,自动改进PBO局部搜索求解器的内部启发式规则,而无需人工干预。实验表明,AutoPBO在多个公开基准测试中显著优于现有局部搜索方法,并保持与顶尖求解器相当的竞争力,验证了LLM在自动化算法设计中的有效性。
链接: https://arxiv.org/abs/2509.04007
作者: Jinyuan Li,Yi Chu,Yiwen Sun,Mengchuan Zou,Shaowei Cai
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Pseudo-Boolean Optimization (PBO) provides a powerful framework for modeling combinatorial problems through pseudo-Boolean (PB) constraints. Local search solvers have shown excellent performance in PBO solving, and their efficiency is highly dependent on their internal heuristics to guide the search. Still, their design often requires significant expert effort and manual tuning in practice. While Large Language Models (LLMs) have demonstrated potential in automating algorithm design, their application to optimizing PBO solvers remains unexplored. In this work, we introduce AutoPBO, a novel LLM-powered framework to automatically enhance PBO local search solvers. We conduct experiments on a broad range of four public benchmarks, including one real-world benchmark, a benchmark from PB competition, an integer linear programming optimization benchmark, and a crafted combinatorial benchmark, to evaluate the performance improvement achieved by AutoPBO and compare it with six state-of-the-art competitors, including two local search PBO solvers NuPBO and OraSLS, two complete PB solvers PBO-IHS and RoundingSat, and two mixed integer programming (MIP) solvers Gurobi and SCIP. AutoPBO demonstrates significant improvements over previous local search approaches, while maintaining competitive performance compared to state-of-the-art competitors. The results suggest that AutoPBO offers a promising approach to automating local search solver design.
zh
[AI-24] Meta-Policy Reflexion: Reusable Reflective Memory and Rule Admissibility for Resource-Efficient LLM Agent
【速读】:该论文旨在解决大型语言模型(Large Language Model, LLM)代理在单任务上表现优异但普遍存在重复失败、探索效率低以及跨任务适应能力弱的问题。现有反思策略(如Reflexion、ReAct)虽能改善每轮行为,但生成的反思痕迹通常是瞬态且任务特定的,难以复用;而基于强化学习的方法虽可产生可迁移策略,却需大量参数更新与计算资源。解决方案的关键在于提出一种混合框架——元策略反思(Meta-Policy Reflexion, MPR),其核心是将LLM生成的反思知识结构化为谓词式的元策略记忆(Meta-Policy Memory, MPM),并在推理时通过两种互补机制应用:软记忆引导解码和硬规则可接受性检查(Hard Rule Admissibility Check, HAC)。MPR实现了无需模型权重更新即可外化可复用的修正知识、通过领域约束减少不安全或无效动作,并保留基于语言的反思灵活性,从而显著提升执行准确性和鲁棒性。
链接: https://arxiv.org/abs/2509.03990
作者: Chunlong Wu,Zhibo Qu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large language model (LLM) agents achieve impressive single-task performance but commonly exhibit repeated failures, inefficient exploration, and limited cross-task adaptability. Existing reflective strategies (e.g., Reflexion, ReAct) improve per-episode behavior but typically produce ephemeral, task-specific traces that are not reused across tasks. Reinforcement-learning based alternatives can produce transferable policies but require substantial parameter updates and compute. In this work we introduce Meta-Policy Reflexion (MPR): a hybrid framework that consolidates LLM-generated reflections into a structured, predicate-like Meta-Policy Memory (MPM) and applies that memory at inference time through two complementary mechanisms soft memory-guided decoding and hard rule admissibility checks(HAC). MPR (i) externalizes reusable corrective knowledge without model weight updates, (ii) enforces domain constraints to reduce unsafe or invalid actions, and (iii) retains the adaptability of language-based reflection. We formalize the MPM representation, present algorithms for update and decoding, and validate the approach in a text-based agent environment following the experimental protocol described in the provided implementation (AlfWorld-based). Empirical results reported in the supplied material indicate consistent gains in execution accuracy and robustness when compared to Reflexion baselines; rule admissibility further improves stability. We analyze mechanisms that explain these gains, discuss scalability and failure modes, and outline future directions for multimodal and multi?agent extensions.
zh
[AI-25] NeuroBreak: Unveil Internal Jailbreak Mechanisms in Large Language Models
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在安全对齐过程中面临的越狱攻击(jailbreak attack)威胁,此类攻击通过对抗性提示绕过模型的安全机制,从而生成非法或不道德的内容。为增强LLMs对越狱攻击的鲁棒性,论文提出NeuroBreak系统,其关键在于采用自顶向下的神经元级分析方法,结合逐层表示探查(layer-wise representation probing)技术,从内部结构层面揭示模型决策过程中的安全机制与脆弱点。该系统不仅支持对关键神经元的语义和功能双重维度分析,还提供了量化评估与案例研究,为下一代防御策略提供可解释的机制洞见。
链接: https://arxiv.org/abs/2509.03985
作者: Chuhan Zhang,Ye Zhang,Bowen Shi,Yuyou Gan,Tianyu Du,Shouling Ji,Dazhan Deng,Yingcai Wu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 12 pages, 9 figures
Abstract:In deployment and application, large language models (LLMs) typically undergo safety alignment to prevent illegal and unethical outputs. However, the continuous advancement of jailbreak attack techniques, designed to bypass safety mechanisms with adversarial prompts, has placed increasing pressure on the security defenses of LLMs. Strengthening resistance to jailbreak attacks requires an in-depth understanding of the security mechanisms and vulnerabilities of LLMs. However, the vast number of parameters and complex structure of LLMs make analyzing security weaknesses from an internal perspective a challenging task. This paper presents NeuroBreak, a top-down jailbreak analysis system designed to analyze neuron-level safety mechanisms and mitigate vulnerabilities. We carefully design system requirements through collaboration with three experts in the field of AI security. The system provides a comprehensive analysis of various jailbreak attack methods. By incorporating layer-wise representation probing analysis, NeuroBreak offers a novel perspective on the model’s decision-making process throughout its generation steps. Furthermore, the system supports the analysis of critical neurons from both semantic and functional perspectives, facilitating a deeper exploration of security mechanisms. We conduct quantitative evaluations and case studies to verify the effectiveness of our system, offering mechanistic insights for developing next-generation defense strategies against evolving jailbreak attacks.
zh
[AI-26] World Model Implanting for Test-time Adaptation of Embodied Agents
【速读】:该论文旨在解决具身智能(Embodied AI)中代理在面对新领域时难以实现鲁棒适应的问题,尤其是在缺乏大量数据收集或重新训练的情况下。解决方案的关键在于提出一种世界模型植入框架(World Model Implanting framework, WorMI),该框架通过测试时组合(test-time composition)将大语言模型(Large Language Models, LLMs)的推理能力与独立学习的、领域特定的世界模型(World Models)相结合。其核心创新包括:基于原型的世界模型检索方法,利用轨迹抽象表示匹配高效引入相关世界模型;以及一种跨世界复合注意力机制(world-wise compound attention),用于融合多个世界模型的知识并对其内部表征进行对齐,从而在不改变主策略的前提下实现跨域适应性。该设计显著提升了代理在未见领域中的零样本和少样本性能。
链接: https://arxiv.org/abs/2509.03956
作者: Minjong Yoo,Jinwoo Jang,Sihyung Yoon,Honguk Woo
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:In embodied AI, a persistent challenge is enabling agents to robustly adapt to novel domains without requiring extensive data collection or retraining. To address this, we present a world model implanting framework (WorMI) that combines the reasoning capabilities of large language models (LLMs) with independently learned, domain-specific world models through test-time composition. By allowing seamless implantation and removal of the world models, the embodied agent’s policy achieves and maintains cross-domain adaptability. In the WorMI framework, we employ a prototype-based world model retrieval approach, utilizing efficient trajectory-based abstract representation matching, to incorporate relevant models into test-time composition. We also develop a world-wise compound attention method that not only integrates the knowledge from the retrieved world models but also aligns their intermediate representations with the reasoning model’s representation within the agent’s policy. This framework design effectively fuses domain-specific knowledge from multiple world models, ensuring robust adaptation to unseen domains. We evaluate our WorMI on the VirtualHome and ALFWorld benchmarks, demonstrating superior zero-shot and few-shot performance compared to several LLM-based approaches across a range of unseen domains. These results highlight the frameworks potential for scalable, real-world deployment in embodied agent scenarios where adaptability and data efficiency are essential.
zh
[AI-27] Handling Infinite Domain Parameters in Planning Through Best-First Search with Delayed Partial Expansions IJCAI2025
【速读】:该论文旨在解决自动化规划中控制参数(control parameters)处理效率低下的问题,现有方法通常将控制参数视为嵌入的约束条件,而非搜索空间中的真正决策点,导致搜索过程不够高效。其解决方案的关键在于提出一种基于最优优先启发式搜索的算法,显式地将控制参数作为决策点纳入系统性搜索框架,并通过延迟部分展开(delayed partial expansion)策略实现对无限决策空间的有效探索,在特定条件下证明了算法在极限意义上的完备性。
链接: https://arxiv.org/abs/2509.03953
作者: Ángel Aso-Mollar,Diego Aineto,Enrico Scala,Eva Onaindia
机构: 未知
类目: Artificial Intelligence (cs.AI); Symbolic Computation (cs.SC); Systems and Control (eess.SY)
备注: To appear in the Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence (IJCAI 2025)
Abstract:In automated planning, control parameters extend standard action representations through the introduction of continuous numeric decision variables. Existing state-of-the-art approaches have primarily handled control parameters as embedded constraints alongside other temporal and numeric restrictions, and thus have implicitly treated them as additional constraints rather than as decision points in the search space. In this paper, we propose an efficient alternative that explicitly handles control parameters as true decision points within a systematic search scheme. We develop a best-first, heuristic search algorithm that operates over infinite decision spaces defined by control parameters and prove a notion of completeness in the limit under certain conditions. Our algorithm leverages the concept of delayed partial expansion, where a state is not fully expanded but instead incrementally expands a subset of its successors. Our results demonstrate that this novel search algorithm is a competitive alternative to existing approaches for solving planning problems involving control parameters.
zh
[AI-28] A Foundation Model for Chest X-ray Interpretation with Grounded Reasoning via Online Reinforcement Learning
【速读】:该论文旨在解决当前医疗基础模型(Medical Foundation Models, FMs)在胸部X光片(CXR)解读中缺乏透明推理过程和局部可解释性的问题,从而限制其在临床实践中的部署。解决方案的关键在于提出DeepMedix-R1,一个端到端的、面向CXR解读的医疗基础模型,其核心创新在于采用三阶段训练流程:首先在精心筛选的CXR指令数据上微调以获得基本解读能力,接着通过高质量合成推理样本实现冷启动推理能力,最后利用在线强化学习进一步优化基于图像局部区域的推理质量与生成性能。该方法使模型不仅能输出诊断结论,还能提供与图像局部区域紧密关联的推理步骤,显著提升了临床可解释性和实用性。
链接: https://arxiv.org/abs/2509.03906
作者: Qika Lin,Yifan Zhu,Bin Pu,Ling Huang,Haoran Luo,Jingying Ma,Zhen Peng,Tianzhe Zhao,Fangzhi Xu,Jian Zhang,Kai He,Zhonghong Ou,Swapnil Mishra,Mengling Feng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 15 pages
Abstract:Medical foundation models (FMs) have shown tremendous promise amid the rapid advancements in artificial intelligence (AI) technologies. However, current medical FMs typically generate answers in a black-box manner, lacking transparent reasoning processes and locally grounded interpretability, which hinders their practical clinical deployments. To this end, we introduce DeepMedix-R1, a holistic medical FM for chest X-ray (CXR) interpretation. It leverages a sequential training pipeline: initially fine-tuned on curated CXR instruction data to equip with fundamental CXR interpretation capabilities, then exposed to high-quality synthetic reasoning samples to enable cold-start reasoning, and finally refined via online reinforcement learning to enhance both grounded reasoning quality and generation performance. Thus, the model produces both an answer and reasoning steps tied to the image’s local regions for each query. Quantitative evaluation demonstrates substantial improvements in report generation (e.g., 14.54% and 31.32% over LLaVA-Rad and MedGemma) and visual question answering (e.g., 57.75% and 23.06% over MedGemma and CheXagent) tasks. To facilitate robust assessment, we propose Report Arena, a benchmarking framework using advanced language models to evaluate answer quality, further highlighting the superiority of DeepMedix-R1. Expert review of generated reasoning steps reveals greater interpretability and clinical plausibility compared to the established Qwen2.5-VL-7B model (0.7416 vs. 0.2584 overall preference). Collectively, our work advances medical FM development toward holistic, transparent, and clinically actionable modeling for CXR interpretation.
zh
[AI-29] FaMA: LLM -Empowered Agent ic Assistant for Consumer-to-Consumer Marketplace
【速读】:该论文旨在解决消费者对消费者(C2C)电子商务平台中用户交互复杂性的问题,尤其是买卖双方在使用图形用户界面(GUI)时面临的高摩擦工作流,如商品列表更新、批量消息发送及产品发现等。解决方案的关键在于引入一个由大语言模型(LLM)驱动的智能体助手(agentic assistant),即Facebook Marketplace Assistant(FaMA),通过自然语言理解实现对话式交互,将传统GUI操作转化为轻量级、直观的AI代理交互模式,从而显著提升任务完成效率与用户体验。实验表明,FaMA在复杂任务上达到98%的成功率,并可使交互时间缩短至原来的50%。
链接: https://arxiv.org/abs/2509.03890
作者: Yineng Yan,Xidong Wang,Jin Seng Cheng,Ran Hu,Wentao Guan,Nahid Farahmand,Hengte Lin,Yue Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The emergence of agentic AI, powered by Large Language Models (LLMs), marks a paradigm shift from reactive generative systems to proactive, goal-oriented autonomous agents capable of sophisticated planning, memory, and tool use. This evolution presents a novel opportunity to address long-standing challenges in complex digital environments. Core tasks on Consumer-to-Consumer (C2C) e-commerce platforms often require users to navigate complex Graphical User Interfaces (GUIs), making the experience time-consuming for both buyers and sellers. This paper introduces a novel approach to simplify these interactions through an LLM-powered agentic assistant. This agent functions as a new, conversational entry point to the marketplace, shifting the primary interaction model from a complex GUI to an intuitive AI agent. By interpreting natural language commands, the agent automates key high-friction workflows. For sellers, this includes simplified updating and renewal of listings, and the ability to send bulk messages. For buyers, the agent facilitates a more efficient product discovery process through conversational search. We present the architecture for Facebook Marketplace Assistant (FaMA), arguing that this agentic, conversational paradigm provides a lightweight and more accessible alternative to traditional app interfaces, allowing users to manage their marketplace activities with greater efficiency. Experiments show FaMA achieves a 98% task success rate on solving complex tasks on the marketplace and enables up to a 2x speedup on interaction time.
zh
[AI-30] Reactive In-Air Clothing Manipulation with Confidence-Aware Dense Correspondence and Visuotactile Affordance
【速读】:该论文旨在解决衣物操作中的复杂挑战,包括复杂的几何构型、可变的材料动力学特性以及频繁的自遮挡问题。传统方法通常依赖于将衣物展平或假设关键特征可见,这限制了其在真实场景中的适用性。解决方案的关键在于提出一种双臂视觉-触觉框架,融合了置信度感知的密集视觉对应关系与触觉监督的抓取可操作性估计:首先,通过分布损失训练的对应模型捕捉布料对称性并生成置信度估计,指导反应式状态机根据感知不确定性动态调整折叠策略;其次,利用高分辨率触觉反馈自监督训练的视觉-触觉抓取可操作性网络识别物理上可抓取区域,并在执行阶段用于实时抓取验证。该方法通过在低置信度状态下延迟动作决策,有效处理高度遮挡的桌面和空中配置,实现任务无关的抓取选择模块,在折叠和悬挂任务中验证有效性,并提供可用于其他规划模态(如从人类视频演示中提取抓取目标)的可复用中间表示,推动更通用、可扩展的衣物操作能力发展。
链接: https://arxiv.org/abs/2509.03889
作者: Neha Sunil,Megha Tippur,Arnau Saumell,Edward Adelson,Alberto Rodriguez
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at CoRL 2025. Project website: this https URL
Abstract:Manipulating clothing is challenging due to complex configurations, variable material dynamics, and frequent self-occlusion. Prior systems often flatten garments or assume visibility of key features. We present a dual-arm visuotactile framework that combines confidence-aware dense visual correspondence and tactile-supervised grasp affordance to operate directly on crumpled and suspended garments. The correspondence model is trained on a custom, high-fidelity simulated dataset using a distributional loss that captures cloth symmetries and generates correspondence confidence estimates. These estimates guide a reactive state machine that adapts folding strategies based on perceptual uncertainty. In parallel, a visuotactile grasp affordance network, self-supervised using high-resolution tactile feedback, determines which regions are physically graspable. The same tactile classifier is used during execution for real-time grasp validation. By deferring action in low-confidence states, the system handles highly occluded table-top and in-air configurations. We demonstrate our task-agnostic grasp selection module in folding and hanging tasks. Moreover, our dense descriptors provide a reusable intermediate representation for other planning modalities, such as extracting grasp targets from human video demonstrations, paving the way for more generalizable and scalable garment manipulation.
zh
[AI-31] Peptidomic-Based Prediction Model for Coronary Heart Disease Using a Multilayer Perceptron Neural Network
【速读】:该论文旨在解决冠心病(Coronary Heart Disease, CHD)的非侵入性诊断问题,以降低医疗成本并提高早期检测效率。其解决方案的关键在于构建一个基于多层感知机(Multilayer Perceptron, MLP)神经网络的分类模型,该模型利用遗传算法筛选出的50个关键尿液肽类生物标志物作为输入特征,并通过合成少数类过采样技术(Synthetic Minority Over-sampling Technique, SMOTE)对平衡处理后的训练集进行优化,最终在三隐藏层(每层60个神经元)结构下实现了高精度的诊断性能,包括95.67%的精确率、敏感度和特异度,以及0.9748的受试者工作特征曲线下面积(Area Under the ROC Curve, AUC),表明该方法具有高度可靠性和临床应用潜力。
链接: https://arxiv.org/abs/2509.03884
作者: Jesus Celis-Porras
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 14 pages, 6 figures, Submitted to arXiv for public dissemination
Abstract:Coronary heart disease (CHD) is a leading cause of death worldwide and contributes significantly to annual healthcare expenditures. To develop a non-invasive diagnostic approach, we designed a model based on a multilayer perceptron (MLP) neural network, trained on 50 key urinary peptide biomarkers selected via genetic algorithms. Treatment and control groups, each comprising 345 individuals, were balanced using the Synthetic Minority Over-sampling Technique (SMOTE). The neural network was trained using a stratified validation strategy. Using a network with three hidden layers of 60 neurons each and an output layer of two neurons, the model achieved a precision, sensitivity, and specificity of 95.67 percent, with an F1-score of 0.9565. The area under the ROC curve (AUC) reached 0.9748 for both classes, while the Matthews correlation coefficient (MCC) and Cohen’s kappa coefficient were 0.9134 and 0.9131, respectively, demonstrating its reliability in detecting CHD. These results indicate that the model provides a highly accurate and robust non-invasive diagnostic tool for coronary heart disease.
zh
[AI-32] Expedition Expansion: Leverag ing Semantic Representations for Goal-Directed Exploration in Continuous Cellular Automata
【速读】:该论文旨在解决连续细胞自动机(Continuous Cellular Automata, CA)中视觉模式探索的难题,即高维行为空间的庞大性和冗余性导致传统新颖性搜索(Novelty Search)方法在局部新颖性耗尽时陷入停滞,难以发现远距离、未探索的行为区域。其解决方案的关键在于提出一种混合策略——探险与扩展(Expedition and Expansion, EE),该策略交替执行局部新颖驱动的扩张和目标导向的探险:在探险阶段,利用视觉-语言模型(Vision-Language Model, VLM)生成语义层面的描述性目标(即人类感知有意义的假设模式),引导搜索向未知区域推进;通过在语义空间中进行评估与目标生成,EE不仅提升了所发现行为的可解释性与相关性,还显著增强了探索多样性,并揭示了探险产生的解对长期演化具有关键奠基作用。
链接: https://arxiv.org/abs/2509.03863
作者: Sina Khajehabdollahi,Gautier Hamon,Marko Cvjetko,Pierre-Yves Oudeyer,Clément Moulin-Frier,Cédric Colas
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Discovering diverse visual patterns in continuous cellular automata (CA) is challenging due to the vastness and redundancy of high-dimensional behavioral spaces. Traditional exploration methods like Novelty Search (NS) expand locally by mutating known novel solutions but often plateau when local novelty is exhausted, failing to reach distant, unexplored regions. We introduce Expedition and Expansion (EE), a hybrid strategy where exploration alternates between local novelty-driven expansions and goal-directed expeditions. During expeditions, EE leverages a Vision-Language Model (VLM) to generate linguistic goals–descriptions of interesting but hypothetical patterns that drive exploration toward uncharted regions. By operating in semantic spaces that align with human perception, EE both evaluates novelty and generates goals in conceptually meaningful ways, enhancing the interpretability and relevance of discovered behaviors. Tested on Flow Lenia, a continuous CA known for its rich, emergent behaviors, EE consistently uncovers more diverse solutions than existing exploration methods. A genealogical analysis further reveals that solutions originating from expeditions disproportionately influence long-term exploration, unlocking new behavioral niches that serve as stepping stones for subsequent search. These findings highlight EE’s capacity to break through local novelty boundaries and explore behavioral landscapes in human-aligned, interpretable ways, offering a promising template for open-ended exploration in artificial life and beyond.
zh
[AI-33] Continuous Monitoring of Large-Scale Generative AI via Deterministic Knowledge Graph Structures
【速读】:该论文旨在解决生成式 AI (Generative AI) 模型在实际应用中因可靠性不足而引发的挑战,包括幻觉(hallucinations)、语义漂移(semantic drift)及固有偏见等问题。由于这些模型通常作为黑箱运行,现有评估方法主要依赖主观的人工判断,难以实现规模化、透明化和高效评估。其解决方案的关键在于构建一个基于确定性知识图谱(Knowledge Graph, KG)与大型语言模型(Large Language Model, LLM)动态生成知识图谱的并行对比框架:前者通过规则驱动的方法、预定义本体、领域词典和结构化实体关系提取规则建立,后者则从实时文本流(如新闻文章)中动态生成;通过计算两者之间的结构偏差(如实例化类比值 ICR、实例化属性比值 IPR 和类实例化 CI)并设定基于历史分布的动态异常阈值,实现对生成式 AI 语义异常或幻觉的主动识别与实时监控,从而提供一种可扩展、结构化且指标驱动的可靠性评估体系。
链接: https://arxiv.org/abs/2509.03857
作者: Kishor Datta Gupta,Mohd Ariful Haque,Hasmot Ali,Marufa Kamal,Syed Bahauddin Alam,Mohammad Ashiqur Rahman
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Generative AI (GEN AI) models have revolutionized diverse application domains but present substantial challenges due to reliability concerns, including hallucinations, semantic drift, and inherent biases. These models typically operate as black-boxes, complicating transparent and objective evaluation. Current evaluation methods primarily depend on subjective human assessment, limiting scalability, transparency, and effectiveness. This research proposes a systematic methodology using deterministic and Large Language Model (LLM)-generated Knowledge Graphs (KGs) to continuously monitor and evaluate GEN AI reliability. We construct two parallel KGs: (i) a deterministic KG built using explicit rule-based methods, predefined ontologies, domain-specific dictionaries, and structured entity-relation extraction rules, and (ii) an LLM-generated KG dynamically derived from real-time textual data streams such as live news articles. Utilizing real-time news streams ensures authenticity, mitigates biases from repetitive training, and prevents adaptive LLMs from bypassing predefined benchmarks through feedback memorization. To quantify structural deviations and semantic discrepancies, we employ several established KG metrics, including Instantiated Class Ratio (ICR), Instantiated Property Ratio (IPR), and Class Instantiation (CI). An automated real-time monitoring framework continuously computes deviations between deterministic and LLM-generated KGs. By establishing dynamic anomaly thresholds based on historical structural metric distributions, our method proactively identifies and flags significant deviations, thus promptly detecting semantic anomalies or hallucinations. This structured, metric-driven comparison between deterministic and dynamically generated KGs delivers a robust and scalable evaluation framework.
zh
[AI-34] MillGNN: Learning Multi-Scale Lead-Lag Dependencies for Multi-Variate Time Series Forecasting CIKM2025
【速读】:该论文旨在解决多变量时间序列(Multi-variate Time Series, MTS)预测中普遍存在的问题:现有方法虽能有效捕捉变量内和变量间的依赖关系,却往往忽略在多个分组尺度上的领先-滞后依赖(lead-lag dependencies),从而无法刻画复杂系统中层级式的领先-滞后效应。解决方案的关键在于提出MillGNN,其核心创新包括:(1) 一种尺度特异的领先-滞后图学习模块,通过融合交叉相关系数与基于实时输入和时滞动态衰减特征,学习每个尺度下的领先-滞后依赖关系,兼具统计可解释性与数据驱动灵活性;(2) 一种分层领先-滞后消息传递模块,以结构化方式在多个分组尺度上传递领先-滞后信息,同时传播变量内和跨尺度的领先-滞后效应,实现多尺度领先-滞后效应的高效且全面建模。
链接: https://arxiv.org/abs/2509.03852
作者: Binqing Wu,Zongjiang Shang,Jianlong Huang,Ling Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by CIKM 2025
Abstract:Multi-variate time series (MTS) forecasting is crucial for various applications. Existing methods have shown promising results owing to their strong ability to capture intra- and inter-variate dependencies. However, these methods often overlook lead-lag dependencies at multiple grouping scales, failing to capture hierarchical lead-lag effects in complex systems. To this end, we propose MillGNN, a novel \underlinegraph \underlineneural \underlinenetwork-based method that learns \underlinemult\underlineiple grouping scale \underlinelead-\underlinelag dependencies for MTS forecasting, which can comprehensively capture lead-lag effects considering variate-wise and group-wise dynamics and decays. Specifically, MillGNN introduces two key innovations: (1) a scale-specific lead-lag graph learning module that integrates cross-correlation coefficients and dynamic decaying features derived from real-time inputs and time lags to learn lead-lag dependencies for each scale, which can model evolving lead-lag dependencies with statistical interpretability and data-driven flexibility; (2) a hierarchical lead-lag message passing module that passes lead-lag messages at multiple grouping scales in a structured way to simultaneously propagate intra- and inter-scale lead-lag effects, which can capture multi-scale lead-lag effects with a balance of comprehensiveness and efficiency. Experimental results on 11 datasets demonstrate the superiority of MillGNN for long-term and short-term MTS forecasting, compared with 16 state-of-the-art methods.
zh
[AI-35] Meta-Inverse Reinforcement Learning for Mean Field Games via Probabilistic Context Variables AAAI2024
【速读】:该论文旨在解决多智能体系统中奖励函数设计的挑战,尤其是在现实应用中,智能体具有异质性目标且缺乏先验知识的情况下,传统基于均值场博弈(Mean Field Games, MFG)的逆强化学习(Inverse Reinforcement Learning, IRL)方法因假设智能体同质而难以有效处理此类问题。解决方案的关键在于提出一种深度潜在变量MFG模型及其配套的IRL方法,该方法能够在不依赖底层环境上下文信息或修改原MFG模型的前提下,从结构相似但目标不同的任务演示中自动推断出对应的奖励函数,从而显著提升了对异质智能体群体的建模与学习能力。
链接: https://arxiv.org/abs/2509.03845
作者: Yang Chen,Xiao Lin,Bo Yan,Libo Zhang,Jiamou Liu,Neset Özkan Tan,Michael Witbrock
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
备注: Accepted to AAAI 2024
Abstract:Designing suitable reward functions for numerous interacting intelligent agents is challenging in real-world applications. Inverse reinforcement learning (IRL) in mean field games (MFGs) offers a practical framework to infer reward functions from expert demonstrations. While promising, the assumption of agent homogeneity limits the capability of existing methods to handle demonstrations with heterogeneous and unknown objectives, which are common in practice. To this end, we propose a deep latent variable MFG model and an associated IRL method. Critically, our method can infer rewards from different yet structurally similar tasks without prior knowledge about underlying contexts or modifying the MFG model itself. Our experiments, conducted on simulated scenarios and a real-world spatial taxi-ride pricing problem, demonstrate the superiority of our approach over state-of-the-art IRL methods in MFGs.
zh
[AI-36] INGRID: Intelligent Generative Robotic Design Using Large Language Models
【速读】:该论文旨在解决当前机器人系统中因依赖传统串行机构(serial mechanisms)而导致的智能发展受限问题,即硬件架构限制了机器人智能的拓展空间。其解决方案的关键在于提出INTEGRATED(Intelligent Generative Robotic Design)框架,通过深度整合旋量理论(reciprocal screw theory)与运动学综合方法(kinematic synthesis methods),实现并联机构(parallel mechanisms)的自动化设计。该框架将设计过程分解为约束分析、运动副生成、链结构构建和完整机构设计四个递进任务,从而能够生成具有固定或可变自由度的新颖并联机构,并在案例研究中验证了其根据特定任务需求生成定制化并联机器人能力,实现了机器人智能与硬件设计的解耦,为机制智能(mechanism intelligence)奠定了基础。
链接: https://arxiv.org/abs/2509.03842
作者: Guanglu Jia,Ceng Zhang,Gregory S. Chirikjian
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 15 pages, 6 figures
Abstract:The integration of large language models (LLMs) into robotic systems has accelerated progress in embodied artificial intelligence, yet current approaches remain constrained by existing robotic architectures, particularly serial mechanisms. This hardware dependency fundamentally limits the scope of robotic intelligence. Here, we present INGRID (Intelligent Generative Robotic Design), a framework that enables the automated design of parallel robotic mechanisms through deep integration with reciprocal screw theory and kinematic synthesis methods. We decompose the design challenge into four progressive tasks: constraint analysis, kinematic joint generation, chain construction, and complete mechanism design. INGRID demonstrates the ability to generate novel parallel mechanisms with both fixed and variable mobility, discovering kinematic configurations not previously documented in the literature. We validate our approach through three case studies demonstrating how INGRID assists users in designing task-specific parallel robots based on desired mobility requirements. By bridging the gap between mechanism theory and machine learning, INGRID enables researchers without specialized robotics training to create custom parallel mechanisms, thereby decoupling advances in robotic intelligence from hardware constraints. This work establishes a foundation for mechanism intelligence, where AI systems actively design robotic hardware, potentially transforming the development of embodied AI systems.
zh
[AI-37] From Leiden to Pleasure Island: The Constant Potts Model for Community Detection as a Hedonic Game
【速读】:该论文旨在解决复杂网络中社区划分(community detection)的效率、鲁棒性和准确性问题。其核心解决方案在于将恒定庞特模型(Constant Potts Model, CPM)重新诠释为一种潜在的 hedonic game(势博弈),通过将全局哈密顿量分解为局部效用函数,使得每个节点的局部优化行为与全局目标一致;这一等价关系确保了基于更好响应动态(better-response dynamics)的局部优化可在伪多项式时间内收敛至均衡分区,从而显著提升算法效率;同时引入严格和松弛两种稳定性标准,其中严格标准要求节点在社区内同时最大化邻居数并最小化非邻居数,而松弛标准则通过分辨率参数控制加权目标,增强了对不同尺度社区结构的适应能力,实验表明此类鲁棒分区在结合部分真实标签信息的Leiden算法初始化中能更准确地恢复真实社区结构。
链接: https://arxiv.org/abs/2509.03834
作者: Lucas Lopes Felipe,Konstantin Avrachenkov,Daniel Sadoc Menasche
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
备注: Manuscript submitted to Physica A: Statistical Mechanics and its Applications
Abstract:Community detection is one of the fundamental problems in data science which consists of partitioning nodes into disjoint communities. We present a game-theoretic perspective on the Constant Potts Model (CPM) for partitioning networks into disjoint communities, emphasizing its efficiency, robustness, and accuracy. Efficiency: We reinterpret CPM as a potential hedonic game by decomposing its global Hamiltonian into local utility functions, where the local utility gain of each agent matches the corresponding increase in global utility. Leveraging this equivalence, we prove that local optimization of the CPM objective via better-response dynamics converges in pseudo-polynomial time to an equilibrium partition. Robustness: We introduce and relate two stability criteria: a strict criterion based on a novel notion of robustness, requiring nodes to simultaneously maximize neighbors and minimize non-neighbors within communities, and a relaxed utility function based on a weighted sum of these objectives, controlled by a resolution parameter. Accuracy: In community tracking scenarios, where initial partitions are used to bootstrap the Leiden algorithm with partial ground-truth information, our experiments reveal that robust partitions yield higher accuracy in recovering ground-truth communities.
zh
[AI-38] Gravity Well Echo Chamber Modeling With An LLM -Based Confirmation Bias Model
【速读】:该论文旨在解决现有社交网络中回音室(echo chamber)建模方法忽视个体确认偏误(confirmation bias)影响的问题,从而导致对信息传播环境的识别不够准确。其解决方案的关键在于扩展了经典的“引力井”模型(gravity well model),引入一个动态的确认偏误变量——该变量通过比较用户发布内容与其对不同观点帖子的响应行为来量化个体对信念强化内容的敏感度,并据此调整用户在群体中的“引力”强度。这一改进使模型能够更精准地识别回音室结构,并揭示社区层面的信息健康指标,已在19个Reddit社群上验证有效性,为遏制虚假信息在高风险传播节点的扩散提供了可操作的分析框架。
链接: https://arxiv.org/abs/2509.03832
作者: Joseph Jackson,Georgiy Lapin,Jeremy E. Thompson
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:
Abstract:Social media echo chambers play a central role in the spread of misinformation, yet existing models often overlook the influence of individual confirmation bias. An existing model of echo chambers is the “gravity well” model, which creates an analog between echo chambers and spatial gravity wells. We extend this established model by introducing a dynamic confirmation bias variable that adjusts the strength of pull based on a user’s susceptibility to belief-reinforcing content. This variable is calculated for each user through comparisons between their posting history and their responses to posts of a wide range of viewpoints. Incorporating this factor produces a confirmation-bias-integrated gravity well model that more accurately identifies echo chambers and reveals community-level markers of information health. We validated the approach on nineteen Reddit communities, demonstrating improved detection of echo chambers. Our contribution is a framework for systematically capturing the role of confirmation bias in online group dynamics, enabling more effective identification of echo chambers. By flagging these high-risk environments, the model supports efforts to curb the spread of misinformation at its most common points of amplification. Subjects: Social and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Computers and Society (cs.CY) Cite as: arXiv:2509.03832 [cs.SI] (or arXiv:2509.03832v1 [cs.SI] for this version) https://doi.org/10.48550/arXiv.2509.03832 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-39] An Agent ic Model Context Protocol Framework for Medical Concept Standardization
【速读】:该论文旨在解决使用OMOP通用数据模型(Common Data Model, CDM)进行健康数据标准化时,将原始医疗术语映射到OMOP标准概念这一步骤中存在的资源消耗大、易出错的问题。其解决方案的关键在于提出了一种无需训练、且能预防幻觉的映射系统,该系统基于模型上下文协议(Model Context Protocol, MCP),通过标准化且安全的方式使大型语言模型(Large Language Models, LLMs)能够调用外部资源和工具,从而实现可解释的映射、实时词表查询及结构化推理输出,显著提升效率与准确性,并适用于探索性分析和生产环境。
链接: https://arxiv.org/abs/2509.03828
作者: Jaerong Ahn,Andrew Wen,Nan Wang,Heling Jia,Zhiyi Yue,Sunyang Fu,Hongfang Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The Observational Medical Outcomes Partnership (OMOP) common data model (CDM) provides a standardized representation of heterogeneous health data to support large-scale, multi-institutional research. One critical step in data standardization using OMOP CDM is the mapping of source medical terms to OMOP standard concepts, a procedure that is resource-intensive and error-prone. While large language models (LLMs) have the potential to facilitate this process, their tendency toward hallucination makes them unsuitable for clinical deployment without training and expert validation. Here, we developed a zero-training, hallucination-preventive mapping system based on the Model Context Protocol (MCP), a standardized and secure framework allowing LLMs to interact with external resources and tools. The system enables explainable mapping and significantly improves efficiency and accuracy with minimal effort. It provides real-time vocabulary lookups and structured reasoning outputs suitable for immediate use in both exploratory and production environments.
zh
[AI-40] What Would an LLM Do? Evaluating Policymaking Capabilities of Large Language Models
【速读】:该论文旨在解决如何利用生成式 AI(Generative AI)辅助社会政策制定,特别是在应对全球超过15000万人面临的无家可归问题上,评估大型语言模型(Large Language Models, LLMs)是否能够与领域专家保持一致,并提供可操作的政策建议。其解决方案的关键在于构建一个基于能力方法论(Capability Approach)的多地理区域政策决策基准测试集,涵盖美国南本德、西班牙巴塞罗那、南非约翰内斯堡和中国澳门四个地区,并开发自动化流程将政策推荐接入基于代理的模拟模型(Agent-Based Model, ABM),从而在仿真社会场景中评估政策的社会影响。研究结果表明,在引入负责任的约束机制和本地化校准的前提下,LLMs 可为人类政策制定者提供大规模、多样化的替代政策洞察。
链接: https://arxiv.org/abs/2509.03827
作者: Pierre Le Coz,Jia An Liu,Debarun Bhattacharjya,Georgina Curto,Serge Stinckwich
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) are increasingly being adopted in high-stakes domains. Their capacity to process vast amounts of unstructured data, explore flexible scenarios, and handle a diversity of contextual factors can make them uniquely suited to provide new insights for the complexity of social policymaking. This article evaluates whether LLMs’ are aligned with domain experts (and among themselves) to inform social policymaking on the subject of homelessness alleviation - a challenge affecting over 150 million people worldwide. We develop a novel benchmark comprised of decision scenarios with policy choices across four geographies (South Bend, USA; Barcelona, Spain; Johannesburg, South Africa; Macau SAR, China). The policies in scope are grounded in the conceptual framework of the Capability Approach for human development. We also present an automated pipeline that connects the benchmarked policies to an agent-based model, and we explore the social impact of the recommended policies through simulated social scenarios. The paper results reveal promising potential to leverage LLMs for social policy making. If responsible guardrails and contextual calibrations are introduced in collaboration with local domain experts, LLMs can provide humans with valuable insights, in the form of alternative policies at scale.
zh
[AI-41] Learning to Deliberate: Meta-policy Collaboration for Agent ic LLM Agent ic LLMs with Multi-agent Reinforcement Learning
【速读】:该论文旨在解决多智能体大语言模型(Multi-agent LLM systems)在复杂推理任务中因固定协作协议而导致的效率受限问题。现有框架通常仅关注宏观层面的任务调度,忽视了智能体内部的认知状态(如不确定性或置信度)对策略调整的影响,导致其无法根据自身认知状态动态优化决策。解决方案的关键在于提出元策略思辨框架(Meta-Policy Deliberation Framework, MPDF),其中智能体学习一套高阶元认知动作(Persist、Refine、Concede)的去中心化策略,从而实现基于内部认知状态的自适应推理;同时设计了SoftRankPO强化学习算法,通过基于平滑正态分位数映射的奖励排名优势塑造机制,有效缓解传统策略梯度方法在该场景下的训练不稳定性,显著提升了多智能体系统的推理准确率。
链接: https://arxiv.org/abs/2509.03817
作者: Wei Yang,Jesse Thomason
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:Multi-agent systems of large language models (LLMs) show promise for complex reasoning, but their effectiveness is often limited by fixed collaboration protocols. These frameworks typically focus on macro-level orchestration while overlooking agents’ internal deliberative capabilities. This critical meta-cognitive blindspot treats agents as passive executors unable to adapt their strategy based on internal cognitive states like uncertainty or confidence. We introduce the Meta-Policy Deliberation Framework (MPDF), where agents learn a decentralized policy over a set of high-level meta-cognitive actions: Persist, Refine, and Concede. To overcome the instability of traditional policy gradients in this setting, we develop SoftRankPO, a novel reinforcement learning algorithm. SoftRankPO stabilizes training by shaping advantages based on the rank of rewards mapped through smooth normal quantiles, making the learning process robust to reward variance. Experiments show that MPDF with SoftRankPO achieves a a 4-5% absolute gain in average accuracy across five mathematical and general reasoning benchmarks compared to six state-of-the-art heuristic and learning-based multi-agent reasoning algorithms. Our work presents a paradigm for learning adaptive, meta-cognitive policies for multi-agent LLM systems, shifting the focus from designing fixed protocols to learning dynamic, deliberative strategies.
zh
[AI-42] Leverag ing LLM -Based Agents for Intelligent Supply Chain Planning
【速读】:该论文旨在解决供应链规划(Supply Chain Planning)中如何从电商平台视角高效收集数据、制定长期计划并动态调整以适应环境变化,同时保障可解释性、效率与可靠性的问题。其解决方案的关键在于构建一个供应链规划智能体(Supply Chain Planning Agent, SCPA)框架,该框架具备理解领域知识、解析用户需求、任务分解、调用或创建新工具的能力,并输出基于证据的规划报告,从而在真实场景中实现自动化与优化,显著降低人力成本并提升库存可用率等核心指标。
链接: https://arxiv.org/abs/2509.03811
作者: Yongzhi Qi,Jiaheng Yin,Jianshen Zhang,Dongyang Geng,Zhengyu Chen,Hao Hu,Wei Qi,Zuo-Jun Max Shen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:In supply chain management, planning is a critical concept. The movement of physical products across different categories, from suppliers to warehouse management, to sales, and logistics transporting them to customers, entails the involvement of many entities. It covers various aspects such as demand forecasting, inventory management, sales operations, and replenishment. How to collect relevant data from an e-commerce platform’s perspective, formulate long-term plans, and dynamically adjust them based on environmental changes, while ensuring interpretability, efficiency, and reliability, is a practical and challenging problem. In recent years, the development of AI technologies, especially the rapid progress of large language models, has provided new tools to address real-world issues. In this work, we construct a Supply Chain Planning Agent (SCPA) framework that can understand domain knowledge, comprehend the operator’s needs, decompose tasks, leverage or create new tools, and return evidence-based planning reports. We deploy this framework in this http URL’s real-world scenario, demonstrating the feasibility of LLM-agent applications in the supply chain. It effectively reduced labor and improved accuracy, stock availability, and other key metrics.
zh
[AI-43] SAMVAD: A Multi-Agent System for Simulating Judicial Deliberation Dynamics in India
【速读】:该论文旨在解决司法审议过程模拟的难题,尤其针对印度司法体系中审判团(judicial panel)决策机制的实证研究受限于伦理与实践障碍的问题。其解决方案的关键在于构建一个基于多智能体系统(Multi-Agent System, MAS)的仿真平台SAMVAD,其中包含代表法官、检方律师、辩方律师及多名裁判员(Adjudicators)的智能体,均依托大语言模型(Large Language Models, LLMs)运行;并创新性地集成检索增强生成(Retrieval-Augmented Generation, RAG)技术,利用领域特定法律知识库(如印度刑法典和宪法)实现法律论证的可溯源生成,从而提升模拟的真实性、透明度与合法性。
链接: https://arxiv.org/abs/2509.03793
作者: Prathamesh Devadiga,Omkaar Jayadev Shetty,Pooja Agarwal
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:
Abstract:Understanding the complexities of judicial deliberation is crucial for assessing the efficacy and fairness of a justice system. However, empirical studies of judicial panels are constrained by significant ethical and practical barriers. This paper introduces SAMVAD, an innovative Multi-Agent System (MAS) designed to simulate the deliberation process within the framework of the Indian justice system. Our system comprises agents representing key judicial roles: a Judge, a Prosecution Counsel, a Defense Counsel, and multiple Adjudicators (simulating a judicial bench), all powered by large language models (LLMs). A primary contribution of this work is the integration of Retrieval-Augmented Generation (RAG), grounded in a domain-specific knowledge base of landmark Indian legal documents, including the Indian Penal Code and the Constitution of India. This RAG functionality enables the Judge and Counsel agents to generate legally sound instructions and arguments, complete with source citations, thereby enhancing both the fidelity and transparency of the simulation. The Adjudicator agents engage in iterative deliberation rounds, processing case facts, legal instructions, and arguments to reach a consensus-based verdict. We detail the system architecture, agent communication protocols, the RAG pipeline, the simulation workflow, and a comprehensive evaluation plan designed to assess performance, deliberation quality, and outcome consistency. This work provides a configurable and explainable MAS platform for exploring legal reasoning and group decision-making dynamics in judicial simulations, specifically tailored to the Indian legal context and augmented with verifiable legal grounding via RAG. Subjects: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI) Cite as: arXiv:2509.03793 [cs.MA] (or arXiv:2509.03793v1 [cs.MA] for this version) https://doi.org/10.48550/arXiv.2509.03793 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Prathamesh Devadiga [view email] [v1] Thu, 4 Sep 2025 01:04:44 UTC (368 KB)
zh
[AI-44] What Fundamental Structure in Reward Functions Enables Efficient Sparse-Reward Learning?
【速读】:该论文旨在解决稀疏奖励强化学习(sparse-reward reinforcement learning)中的样本效率问题,即在奖励信号极为稀疏的情况下,如何设计有效的学习机制以实现高效探索与策略优化。其核心贡献在于揭示了奖励矩阵的低秩结构(low-rank structure)是实现从指数级到多项式级样本复杂度转变的关键因素,这是首个针对稀疏奖励RL的此类理论结果。解决方案的关键在于提出Policy-Aware Matrix Completion (PAMC)框架,该框架通过引入策略依赖采样分析,将矩阵补全理论与强化学习相结合,从而在无需显式奖励观测的前提下,实现对奖励结构的有效学习和估计。该方法不仅提供分布无关的置信集(distribution-free confidence sets)和近似低秩下的鲁棒性保证,还在100个系统采样的环境中验证了其优越性能,显著提升了样本效率(提升1.6–2.1倍),为机器人、医疗等高安全要求、样本昂贵的应用提供了新的结构化奖励学习范式。
链接: https://arxiv.org/abs/2509.03790
作者: Ibne Farabi Shihab,Sanjeda Akter,Anuj Sharma
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:What fundamental properties of reward functions enable efficient sparse-reward reinforcement learning? We address this question through the lens of low-rank structure in reward matrices, showing that such structure induces a sharp transition from exponential to polynomial sample complexity, the first result of this kind for sparse-reward RL. We introduce Policy-Aware Matrix Completion (PAMC), which connects matrix completion theory with reinforcement learning via a new analysis of policy-dependent sampling. Our framework provides: (i) impossibility results for general sparse reward observation, (ii) reward-free representation learning from dynamics, (iii) distribution-free confidence sets via conformal prediction, and (iv) robust completion guarantees that degrade gracefully when low-rank structure is only approximate. Empirically, we conduct a pre-registered evaluation across 100 systematically sampled domains, finding exploitable structure in over half. PAMC improves sample efficiency by factors between 1.6 and 2.1 compared to strong exploration, structured, and representation-learning baselines, while adding only about 20 percent computational this http URL results establish structural reward learning as a promising new paradigm, with immediate implications for robotics, healthcare, and other safety-critical, sample-expensive applications.
zh
[AI-45] Learning an Adversarial World Model for Automated Curriculum Generation in MARL
【速读】:该论文旨在解决当前生成式世界模型(Generative World Models)在训练智能体时受限于手工设计环境的有限复杂度与隐式偏差的问题,从而难以培养出具备通用性和鲁棒性的智能体。其解决方案的关键在于提出一种对抗共进化框架:通过一个目标条件的生成式“攻击者”(Attacker)代理学习隐式世界模型,主动生成针对“防守者”(Defender)团队弱点的挑战场景,而非被动预测环境状态;与此同时,防守者团队学习协作策略以应对这些动态生成的难题。这种双向交互机制形成自适应演化的课程体系,使世界模型持续进化以匹配智能体决策能力,从而实现无限且相关性强的训练场景流,推动智能体发展出如包抄、掩护、集中火力和分散战术等复杂战略行为。
链接: https://arxiv.org/abs/2509.03771
作者: Brennen Hill
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:World models that infer and predict environmental dynamics are foundational to embodied intelligence. However, their potential is often limited by the finite complexity and implicit biases of hand-crafted training environments. To develop truly generalizable and robust agents, we need environments that scale in complexity alongside the agents learning within them. In this work, we reframe the challenge of environment generation as the problem of learning a goal-conditioned, generative world model. We propose a system where a generative Attacker agent learns an implicit world model to synthesize increasingly difficult challenges for a team of cooperative Defender agents. The Attacker’s objective is not passive prediction, but active, goal-driven interaction: it models and generates world states (i.e., configurations of enemy units) specifically to exploit the Defenders’ weaknesses. Concurrently, the embodied Defender team learns a cooperative policy to overcome these generated worlds. This co-evolutionary dynamic creates a self-scaling curriculum where the world model continuously adapts to challenge the decision-making policy of the agents, providing an effectively infinite stream of novel and relevant training scenarios. We demonstrate that this framework leads to the emergence of complex behaviors, such as the world model learning to generate flanking and shielding formations, and the defenders learning coordinated focus-fire and spreading tactics. Our findings position adversarial co-evolution as a powerful method for learning instrumental world models that drive agents toward greater strategic depth and robustness.
zh
[AI-46] RAG uard: A Novel Approach for in-context Safe Retrieval Augmented Generation for LLM s
【速读】:该论文旨在解决在海上风电(Offshore Wind, OSW)维护场景中,传统大型语言模型(Large Language Models, LLMs)因缺乏对安全关键文档的显式整合而在面对专业或突发情况时可靠性不足的问题。其解决方案的关键在于提出RAGuard框架,通过并行查询两个独立索引(技术知识与安全文档),分配不同的检索预算以保障技术深度与安全覆盖;进一步引入SafetyClamp扩展机制,扩大候选池并通过“硬夹紧”策略确保关键安全槽位的精确命中,从而显著提升安全召回率(Safety Recall@K)至50%以上,同时维持技术召回率高于60%,为LLM驱动的关键任务决策支持系统提供了可信赖的安全集成范式。
链接: https://arxiv.org/abs/2509.03768
作者: Connor Walker,Koorosh Aslansefat,Mohammad Naveed Akram,Yiannis Papadopoulos
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
Abstract:Accuracy and safety are paramount in Offshore Wind (OSW) maintenance, yet conventional Large Language Models (LLMs) often fail when confronted with highly specialised or unexpected scenarios. We introduce RAGuard, an enhanced Retrieval-Augmented Generation (RAG) framework that explicitly integrates safety-critical documents alongside technical this http URL issuing parallel queries to two indices and allocating separate retrieval budgets for knowledge and safety, RAGuard guarantees both technical depth and safety coverage. We further develop a SafetyClamp extension that fetches a larger candidate pool, “hard-clamping” exact slot guarantees to safety. We evaluate across sparse (BM25), dense (Dense Passage Retrieval) and hybrid retrieval paradigms, measuring Technical Recall@K and Safety Recall@K. Both proposed extensions of RAG show an increase in Safety Recall@K from almost 0% in RAG to more than 50% in RAGuard, while maintaining Technical Recall above 60%. These results demonstrate that RAGuard and SafetyClamp have the potential to establish a new standard for integrating safety assurance into LLM-powered decision support in critical maintenance contexts.
zh
[AI-47] ARDO: A Weak Formulation Deep Neural Network Method for Elliptic and Parabolic PDEs Based on Random Differences of Test Functions
【速读】:该论文旨在解决偏微分方程(Partial Differential Equations, PDEs)及其相关问题的数值求解难题,尤其针对Fokker-Planck型二阶椭圆与抛物型PDEs。其解决方案的关键在于提出了一种名为ARDO的方法,该方法基于弱对抗形式化(weak adversarial formulation),并将随机差分算子转移到测试函数中,从而实现对解神经网络的完全无导数(derivative-free)优化,显著提升了计算效率与稳定性。
链接: https://arxiv.org/abs/2509.03757
作者: Wei Cai,Andrew Qing He
机构: 未知
类目: Numerical Analysis (math.NA); Artificial Intelligence (cs.AI)
备注:
Abstract:We propose ARDO method for solving PDEs and PDE-related problems with deep learning techniques. This method uses a weak adversarial formulation but transfers the random difference operator onto the test function. The main advantage of this framework is that it is fully derivative-free with respect to the solution neural network. This framework is particularly suitable for Fokker-Planck type second-order elliptic and parabolic PDEs.
zh
[AI-48] Designing Gaze Analytics for ELA Instruction: A User-Centered Dashboard with Conversational AI Support
【速读】:该论文旨在解决眼动追踪(eye-tracking)数据在课堂教育技术中应用受限的问题,即由于数据解释复杂性和可访问性差,导致其难以被教师有效用于教学反思、形成性评估和教学决策。解决方案的关键在于通过用户中心设计和数据叙事原则,开发一个基于注视数据的学习分析仪表板,并结合熟悉的可视化方式、分层解释机制与叙事支架,使 gaze 数据更具可读性和教学价值;此外,引入由大语言模型(LLM)驱动的对话式代理(conversational agent),通过自然语言交互降低认知门槛,实现多模态学习分析数据的高效解读,从而推动新型数据模态在课堂教学场景中的落地应用。
链接: https://arxiv.org/abs/2509.03741
作者: Eduardo Davalos,Yike Zhang,Shruti Jain,Namrata Srivastava,Trieu Truong,Nafees-ul Haque,Tristan Van,Jorge Salas,Sara McFadden,Sun-Joo Cho,Gautam Biswas,Amanda Goodwin
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 22 pages, 9 figures, 3 tables, submitted to IUI2026
Abstract:Eye-tracking offers rich insights into student cognition and engagement, but remains underutilized in classroom-facing educational technology due to challenges in data interpretation and accessibility. In this paper, we present the iterative design and evaluation of a gaze-based learning analytics dashboard for English Language Arts (ELA), developed through five studies involving teachers and students. Guided by user-centered design and data storytelling principles, we explored how gaze data can support reflection, formative assessment, and instructional decision-making. Our findings demonstrate that gaze analytics can be approachable and pedagogically valuable when supported by familiar visualizations, layered explanations, and narrative scaffolds. We further show how a conversational agent, powered by a large language model (LLM), can lower cognitive barriers to interpreting gaze data by enabling natural language interactions with multimodal learning analytics. We conclude with design implications for future EdTech systems that aim to integrate novel data modalities in classroom contexts.
zh
[AI-49] Sparse Autoencoder Neural Operators: Model Recovery in Function Spaces
【速读】:该论文旨在解决神经网络中表征统一性问题,特别是针对神经算子(Neural Operators, NO)的可解释性不足问题。现有研究假设不同架构的神经网络会收敛到相似的表征(柏拉图表征假说),但神经算子的表征特性尚未充分探索。论文提出一种基于稀疏模型恢复(sparse model recovery)的新框架,将稀疏自编码器(Sparse Autoencoders, SAEs)扩展至提升空间(lifted spaces)和无限维函数空间,从而实现对大型神经算子的机制可解释性分析。其解决方案的关键在于引入“提升”(lifting)和算子模块(operator modules),这些设计带来了有益的归纳偏置(inductive biases),显著加快了稀疏表征的恢复速度、提升了平滑概念的恢复质量,并实现了跨分辨率的鲁棒推理——这是神经算子独有的特性。
链接: https://arxiv.org/abs/2509.03738
作者: Bahareh Tolooshams,Ailsa Shen,Anima Anandkumar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP); Machine Learning (stat.ML)
备注: Tolooshams and Shen has equal contribution. preprint
Abstract:We frame the problem of unifying representations in neural models as one of sparse model recovery and introduce a framework that extends sparse autoencoders (SAEs) to lifted spaces and infinite-dimensional function spaces, enabling mechanistic interpretability of large neural operators (NO). While the Platonic Representation Hypothesis suggests that neural networks converge to similar representations across architectures, the representational properties of neural operators remain underexplored despite their growing importance in scientific computing. We compare the inference and training dynamics of SAEs, lifted-SAE, and SAE neural operators. We highlight how lifting and operator modules introduce beneficial inductive biases, enabling faster recovery, improved recovery of smooth concepts, and robust inference across varying resolutions, a property unique to neural operators.
zh
[AI-50] Are LLM Agents Behaviorally Coherent? Latent Profiles for Social Simulation
【速读】:该论文试图解决的问题是:大语言模型(Large Language Models, LLMs)作为合成代理(synthetic agents)是否能够可靠地替代真实人类参与者进行以人为中心的研究,尤其是其行为在不同实验情境下是否保持内部一致性。传统研究多关注LLM生成的数据是否与人类对应数据一致,而本文提出更根本的质疑——即LLM代理是否具备跨情境的行为稳定性。解决方案的关键在于设计了一种双阶段实验范式:首先通过特定提示揭示代理的内部状态(internal state),随后在基础对话场景中观察其行为表现,从而检验代理行为是否与其内在状态一致。该方法使作者能够系统验证一系列行为假设,并发现不同模型家族和规模的LLM均存在显著的内部不一致性,表明其尚无法准确模拟真实人类参与者的稳定认知与行为模式。
链接: https://arxiv.org/abs/2509.03736
作者: James Mooney,Josef Woldense,Zheng Robert Jia,Shirley Anugrah Hayati,My Ha Nguyen,Vipul Raheja,Dongyeop Kang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 25 pages, 9 figures, 7 tables
Abstract:The impressive capabilities of Large Language Models (LLMs) have fueled the notion that synthetic agents can serve as substitutes for real participants in human-subject research. In an effort to evaluate the merits of this claim, social science researchers have largely focused on whether LLM-generated survey data corresponds to that of a human counterpart whom the LLM is prompted to represent. In contrast, we address a more fundamental question: Do agents maintain internal consistency, retaining similar behaviors when examined under different experimental settings? To this end, we develop a study designed to (a) reveal the agent’s internal state and (b) examine agent behavior in a basic dialogue setting. This design enables us to explore a set of behavioral hypotheses to assess whether an agent’s conversation behavior is consistent with what we would expect from their revealed internal state. Our findings on these hypotheses show significant internal inconsistencies in LLMs across model families and at differing model sizes. Most importantly, we find that, although agents may generate responses matching those of their human counterparts, they fail to be internally consistent, representing a critical gap in their capabilities to accurately substitute for real participants in human-subject research. Our simulation code and data are publicly accessible.
zh
[AI-51] Differentiable Entropy Regularization for Geometry and Neural Networks
【速读】:该论文旨在解决如何将计算几何中的范围划分熵(range-partition entropy)引入深度学习框架,以实现对输入数据“有序性”的自适应建模,并提升算法效率与结构化表示能力。其核心挑战在于范围划分熵本身不可微,难以直接用于神经网络的训练过程。解决方案的关键在于:(i) 提出首个可微分近似方法,使范围划分熵可作为训练损失或正则项使用;(ii) 设计 EntropyNet 模块,通过重构输入数据为低熵形式来加速下游实例最优算法;(iii) 将熵正则化机制扩展至 Transformer 注意力机制,诱导结构化的注意力模式。实验表明,该方法在几何任务中实现高达 4.1× 的运行时加速且误差仅 0.2%,在深度学习任务中于 80% 稀疏度下比 L1 基线提升 6% 准确率,验证了熵约束计算在理论与实践上的有效性。
链接: https://arxiv.org/abs/2509.03733
作者: Ibne Farabi Shihab,Sanjeda Akter,Anuj Sharma
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:We introduce a differentiable estimator of range-partition entropy, a recent concept from computational geometry that enables algorithms to adapt to the “sortedness” of their input. While range-partition entropy provides strong guarantees in algorithm design, it has not yet been made accessible to deep learning. In this work, we (i) propose the first differentiable approximation of range-partition entropy, enabling its use as a trainable loss or regularizer; (ii) design EntropyNet, a neural module that restructures data into low-entropy forms to accelerate downstream instance-optimal algorithms; and (iii) extend this principle beyond geometry by applying entropy regularization directly to Transformer attention. Across tasks, we demonstrate that differentiable entropy improves efficiency without degrading correctness: in geometry, our method achieves up to 4.1\times runtime speedups with negligible error ( 0.2% ); in deep learning, it induces structured attention patterns that yield 6% higher accuracy at 80% sparsity compared to L1 baselines. Our theoretical analysis provides approximation bounds for the estimator, and extensive ablations validate design choices. These results suggest that entropy-bounded computation is not only theoretically elegant but also a practical mechanism for adaptive learning, efficiency, and structured representation.
zh
[AI-52] PersonaTeaming: Exploring How Introducing Personas Can Improve Automated AI Red-Teaming
【速读】:该论文旨在解决当前自动化红队测试(red-teaming)方法在探索AI模型潜在风险时忽视用户身份背景影响的问题。现有自动化方法虽能实现大规模行为探测,但未能模拟不同个体视角对攻击策略的塑造作用,从而限制了风险发现的广度与深度。解决方案的关键在于提出一种名为PersonaTeaming的新方法,其核心创新是将“人格设定”(persona)引入对抗性提示生成过程:一方面通过预定义的“红队专家”或“普通AI用户”人格对提示进行变异;另一方面设计动态人格生成算法,根据初始提示自适应地生成多样化人格类型,并结合新的“变异距离”度量指标来量化提示多样性。实验表明,该方法相较最先进的RainbowPlus方法,在攻击成功率上提升高达144.1%,同时保持提示多样性,为未来人机协同红队测试提供了新思路。
链接: https://arxiv.org/abs/2509.03728
作者: Wesley Hanwen Deng,Sunnie S. Y. Kim,Akshita Jha,Ken Holstein,Motahhare Eslami,Lauren Wilcox,Leon A Gatys
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:Recent developments in AI governance and safety research have called for red-teaming methods that can effectively surface potential risks posed by AI models. Many of these calls have emphasized how the identities and backgrounds of red-teamers can shape their red-teaming strategies, and thus the kinds of risks they are likely to uncover. While automated red-teaming approaches promise to complement human red-teaming by enabling larger-scale exploration of model behavior, current approaches do not consider the role of identity. As an initial step towards incorporating people’s background and identities in automated red-teaming, we develop and evaluate a novel method, PersonaTeaming, that introduces personas in the adversarial prompt generation process to explore a wider spectrum of adversarial strategies. In particular, we first introduce a methodology for mutating prompts based on either “red-teaming expert” personas or “regular AI user” personas. We then develop a dynamic persona-generating algorithm that automatically generates various persona types adaptive to different seed prompts. In addition, we develop a set of new metrics to explicitly measure the “mutation distance” to complement existing diversity measurements of adversarial prompts. Our experiments show promising improvements (up to 144.1%) in the attack success rates of adversarial prompts through persona mutation, while maintaining prompt diversity, compared to RainbowPlus, a state-of-the-art automated red-teaming method. We discuss the strengths and limitations of different persona types and mutation methods, shedding light on future opportunities to explore complementarities between automated and human red-teaming approaches.
zh
[AI-53] From Federated Learning to mathbbX-Learning: Breaking the Barriers of Decentrality Through Random Walks
【速读】:该论文旨在解决分布式学习架构中去中心化概念的局限性问题,提出一种更通用且可扩展的新型分布式学习框架——\mathbbX-Learning (\mathbbX L),以超越传统去中心化学习的边界。其解决方案的关键在于引入图论(graph theory)与马尔可夫链(Markov chains)之间的深刻联系,从而揭示\mathbbX L在结构设计上的新自由度与潜在优化空间,为未来研究提供理论基础和开放方向。
链接: https://arxiv.org/abs/2509.03709
作者: Allan Salihovic,Payam Abdisarabshali,Michael Langberg,Seyyedali Hosseinalipour
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 6 figures, 12 pages
Abstract:We provide our perspective on \mathbbX -Learning ( \mathbbX L), a novel distributed learning architecture that generalizes and extends the concept of decentralization. Our goal is to present a vision for \mathbbX L, introducing its unexplored design considerations and degrees of freedom. To this end, we shed light on the intuitive yet non-trivial connections between \mathbbX L, graph theory, and Markov chains. We also present a series of open research directions to stimulate further research.
zh
[AI-54] Hierarchical Federated Foundation Models over Wireless Networks for Multi-Modal Multi-Task Intelligence: Integration of Edge Learning with D2D/P2P-Enabled Fog Learning Architectures
【速读】:该论文旨在解决多模态多任务(Multi-Modal Multi-Task, M3T)基础模型在雾/边缘计算环境中因设备异构性导致的训练效率与协同能力受限的问题。其核心挑战在于,现有联邦学习框架未能充分考虑雾/边缘节点在采集模态(如文本、图像、语音等)和执行任务类型上的双重异构性,从而限制了M3T基础模型(Foundation Models, FMs)在分布式环境下的有效部署与优化。解决方案的关键是提出分层联邦基础模型(Hierarchical Federated Foundation Models, HF-FMs),通过将M3T FM的模块化结构(包括模态编码器、提示词、专家混合模型MoE、适配器和任务头)与雾/边缘网络的层级架构对齐,实现更灵活的任务分配与模型更新策略;同时引入设备到设备(Device-to-Device, D2D)通信机制,支持局部协同训练与模块水平传递,从而提升系统鲁棒性与资源利用率。
链接: https://arxiv.org/abs/2509.03695
作者: Payam Abdisarabshali,Fardis Nadimi,Kasra Borazjani,Naji Khosravan,Minghui Liwang,Wei Ni,Dusit Niyato,Michael Langberg,Seyyedali Hosseinalipour
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 7 pages, 2 figures, 1 table
Abstract:The rise of foundation models (FMs) has reshaped the landscape of machine learning. As these models continued to grow, leveraging geo-distributed data from wireless devices has become increasingly critical, giving rise to federated foundation models (FFMs). More recently, FMs have evolved into multi-modal multi-task (M3T) FMs (e.g., GPT-4) capable of processing diverse modalities across multiple tasks, which motivates a new underexplored paradigm: M3T FFMs. In this paper, we unveil an unexplored variation of M3T FFMs by proposing hierarchical federated foundation models (HF-FMs), which in turn expose two overlooked heterogeneity dimensions to fog/edge networks that have a direct impact on these emerging models: (i) heterogeneity in collected modalities and (ii) heterogeneity in executed tasks across fog/edge nodes. HF-FMs strategically align the modular structure of M3T FMs, comprising modality encoders, prompts, mixture-of-experts (MoEs), adapters, and task heads, with the hierarchical nature of fog/edge infrastructures. Moreover, HF-FMs enable the optional usage of device-to-device (D2D) communications, enabling horizontal module relaying and localized cooperative training among nodes when feasible. Through delving into the architectural design of HF-FMs, we highlight their unique capabilities along with a series of tailored future research directions. Finally, to demonstrate their potential, we prototype HF-FMs in a wireless network setting and release the open-source code for the development of HF-FMs with the goal of fostering exploration in this untapped field (GitHub: this https URL).
zh
[AI-55] Efficient Virtuoso: A Latent Diffusion Transformer Model for Goal-Conditioned Trajectory Planning
【速读】:该论文旨在解决自动驾驶规划系统中生成多样化且合理的未来轨迹分布这一关键问题,尤其在保持高保真度、计算效率和精确控制方面面临的挑战。解决方案的核心在于提出了一种名为Efficient Virtuoso的条件隐空间扩散模型(conditional latent diffusion model),其创新性地设计了两阶段归一化流程:首先按几何纵横比缩放轨迹以保留形状特征,再对主成分分析(Principal Component Analysis, PCA)得到的隐空间进行归一化,从而确保训练目标稳定;同时,在低维隐空间中使用简单的多层感知机(MLP)去噪器,并结合基于Transformer的StateEncoder融合丰富场景上下文信息,实现高效且精准的轨迹生成。实验表明,该方法在Waymo Open Motion Dataset上达到当前最优的最小平均动态误差(minADE=0.25),并通过消融研究揭示:单点目标虽可消除策略层面的歧义,但多步稀疏路径才是实现高保真战术执行、模拟人类驾驶细腻行为的关键。
链接: https://arxiv.org/abs/2509.03658
作者: Antonio Guillen-Perez
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:The ability to generate a diverse and plausible distribution of future trajectories is a critical capability for autonomous vehicle planning systems. While recent generative models have shown promise, achieving high fidelity, computational efficiency, and precise control remains a significant challenge. In this paper, we present the \textbfEfficient Virtuoso, a conditional latent diffusion model for goal-conditioned trajectory planning. Our approach introduces a novel two-stage normalization pipeline that first scales trajectories to preserve their geometric aspect ratio and then normalizes the resulting PCA latent space to ensure a stable training target. The denoising process is performed efficiently in this low-dimensional latent space by a simple MLP denoiser, which is conditioned on a rich scene context fused by a powerful Transformer-based StateEncoder. We demonstrate that our method achieves state-of-the-art performance on the Waymo Open Motion Dataset, reaching a \textbfminADE of 0.25. Furthermore, through a rigorous ablation study on goal representation, we provide a key insight: while a single endpoint goal can resolve strategic ambiguity, a richer, multi-step sparse route is essential for enabling the precise, high-fidelity tactical execution that mirrors nuanced human driving behavior.
zh
[AI-56] An Empirical Evaluation of Factors Affecting SHAP Explanation of Time Series Classification
【速读】:该论文旨在解决生成式 AI (Generative AI) 在时间序列分类(Time Series Classification, TSC)任务中,由于 SHapley Additive exPlanations (SHAP) 方法计算复杂度随特征数量呈指数增长,导致其在长时序数据上难以实用的问题。解决方案的关键在于通过时间序列分段(Time Series Segmentation)策略将连续的时间点聚合为若干段,并为每段计算单一归因值,从而显著降低 SHAP 的运行时间;同时研究发现,分段数量对解释质量的影响大于具体分段方法的选择,且等长分段优于多数定制化分段算法;此外,论文提出了一种基于段长度加权的归因归一化技术,进一步提升了归因质量。
链接: https://arxiv.org/abs/2509.03649
作者: Davide Italo Serramazza,Nikos Papadeas,Zahraa Abdallah,Georgiana Ifrim
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Explainable AI (XAI) has become an increasingly important topic for understanding and attributing the predictions made by complex Time Series Classification (TSC) models. Among attribution methods, SHapley Additive exPlanations (SHAP) is widely regarded as an excellent attribution method; but its computational complexity, which scales exponentially with the number of features, limits its practicality for long time series. To address this, recent studies have shown that aggregating features via segmentation, to compute a single attribution value for a group of consecutive time points, drastically reduces SHAP running time. However, the choice of the optimal segmentation strategy remains an open question. In this work, we investigated eight different Time Series Segmentation algorithms to understand how segment compositions affect the explanation quality. We evaluate these approaches using two established XAI evaluation methodologies: InterpretTime and AUC Difference. Through experiments on both Multivariate (MTS) and Univariate Time Series (UTS), we find that the number of segments has a greater impact on explanation quality than the specific segmentation method. Notably, equal-length segmentation consistently outperforms most of the custom time series segmentation algorithms. Furthermore, we introduce a novel attribution normalisation technique that weights segments by their length and we show that it consistently improves attribution quality.
zh
[AI-57] CEHR-GPT : A Scalable Multi-Task Foundation Model for Electronic Health Records
【速读】:该论文旨在解决当前电子健康记录(Electronic Health Records, EHRs)中人工智能(AI)模型普遍存在的任务专一性问题,即大多数EHR-AI模型仅针对单一临床任务设计,导致其泛化能力弱、实际应用受限。解决方案的关键在于提出CEHR-GPT——一个通用基础模型,首次在统一架构内整合了特征表示、零样本预测和合成数据生成三大核心能力,并引入基于时间令牌(time-token)的学习框架,显式建模患者动态时间序列信息,从而实现对EHR数据的多任务、高泛化性能处理。
链接: https://arxiv.org/abs/2509.03643
作者: Chao Pang,Jiheum Park,Xinzhuo Jiang,Nishanth Parameshwar Pavinkurve,Krishna S. Kalluri,Shalmali Joshi,Noémie Elhadad,Karthik Natarajan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Electronic Health Records (EHRs) provide a rich, longitudinal view of patient health and hold significant potential for advancing clinical decision support, risk prediction, and data-driven healthcare research. However, most artificial intelligence (AI) models for EHRs are designed for narrow, single-purpose tasks, limiting their generalizability and utility in real-world settings. Here, we present CEHR-GPT, a general-purpose foundation model for EHR data that unifies three essential capabilities - feature representation, zero-shot prediction, and synthetic data generation - within a single architecture. To support temporal reasoning over clinical sequences, \cehrgpt incorporates a novel time-token-based learning framework that explicitly encodes patients’ dynamic timelines into the model structure. CEHR-GPT demonstrates strong performance across all three tasks and generalizes effectively to external datasets through vocabulary expansion and fine-tuning. Its versatility enables rapid model development, cohort discovery, and patient outcome forecasting without the need for task-specific retraining.
zh
[AI-58] Explainable Knowledge Graph Retrieval-Augmented Generation (KG-RAG ) with KG-SMILE
【速读】:该论文旨在解决生成式 AI(Generative AI)在敏感领域应用中因幻觉和不可验证声明导致的可靠性问题,特别是检索增强生成(Retrieval-Augmented Generation, RAG)方法虽能提升准确性但缺乏透明性、本质上仍为黑箱且高度依赖数据质量的问题。解决方案的关键在于提出一种与方法无关的、基于扰动的框架——Knowledge-Graph (KG)-SMILE,其通过在图结构 RAG 中引入可控扰动、计算相似性并训练加权线性代理模型,实现对生成输出最具影响力的图实体与关系的细粒度识别,从而在保持模型效果的同时显著增强可解释性,使 RAG 更具透明性和可信度。
链接: https://arxiv.org/abs/2509.03626
作者: Zahra Zehtabi Sabeti Moghaddam,Zeinab Dehghani,Maneeha Rani,Koorosh Aslansefat,Bhupesh Kumar Mishra,Rameez Raja Kureshi,Dhavalkumar Thakker
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Generative AI, such as Large Language Models (LLMs), has achieved impressive progress but still produces hallucinations and unverifiable claims, limiting reliability in sensitive domains. Retrieval-Augmented Generation (RAG) improves accuracy by grounding outputs in external knowledge, especially in domains like healthcare, where precision is vital. However, RAG remains opaque and essentially a black box, heavily dependent on data quality. We developed a method-agnostic, perturbation-based framework that provides token and component-level interoperability for Graph RAG using SMILE and named it as Knowledge-Graph (KG)-SMILE. By applying controlled perturbations, computing similarities, and training weighted linear surrogates, KG-SMILE identifies the graph entities and relations most influential to generated outputs, thereby making RAG more transparent. We evaluate KG-SMILE using comprehensive attribution metrics, including fidelity, faithfulness, consistency, stability, and accuracy. Our findings show that KG-SMILE produces stable, human-aligned explanations, demonstrating its capacity to balance model effectiveness with interpretability and thereby fostering greater transparency and trust in machine learning technologies.
zh
[AI-59] he Optimiser Hidden in Plain Sight: Training with the Loss Landscapes Induced Metric
【速读】:该论文旨在解决神经网络训练中优化器对损失曲面几何结构利用不足的问题,尤其是在高维空间中损失函数的曲率变化对学习率自适应调整的挑战。其解决方案的关键在于引入由损失景观嵌入高维空间时自然诱导的黎曼度量(Riemannian metric),并基于此构建新型优化算法。该方法通过显式考虑损失曲面的几何特性,实现了在高曲率区域自动降低有效学习率,相当于一种平滑的梯度裁剪机制;同时,其中一种变体可视为引入了有效的调度学习率,并从几何视角自然推导出解耦权重衰减(decoupled weight decay)作为优选策略。该优化器计算复杂度与Adam相当,且可兼容现有预条件方法,具有理论优势和实际有效性。
链接: https://arxiv.org/abs/2509.03594
作者: Thomas R. Harvey
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注: this https URL
Abstract:We present a class of novel optimisers for training neural networks that makes use of the Riemannian metric naturally induced when the loss landscape is embedded in higher-dimensional space. This is the same metric that underlies common visualisations of loss landscapes. By taking this geometric perspective literally and using the induced metric, we develop a new optimiser and compare it to existing methods, namely: SGD, Adam, AdamW, and Muon, across a range of tasks and architectures. Empirically, we conclude that this new class of optimisers is highly effective in low dimensional examples, and provides slight improvement over state-of-the-art methods for training neural networks. These new optimisers have theoretically desirable properties. In particular, the effective learning rate is automatically decreased in regions of high curvature acting as a smoothed out form of gradient clipping. Similarly, one variant of these optimisers can also be viewed as inducing an effective scheduled learning rate and decoupled weight decay is the natural choice from our geometric perspective. The basic method can be used to modify any existing preconditioning method. The new optimiser has a computational complexity comparable to that of Adam.
zh
[AI-60] Learning When to Plan: Efficiently Allocating Test-Time Compute for LLM Agents
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在代理(agentic)设置中因固定式规划策略导致的效率与性能权衡问题:即始终规划会增加计算开销并降低长时程任务表现,而从不规划则限制了模型解决问题的能力。解决方案的关键在于提出一种动态规划(dynamic planning)的概念框架,使LLM代理能够在推理阶段灵活决定何时分配计算资源进行规划;其核心训练机制为两阶段流程:首先通过多样化合成数据的监督微调(supervised fine-tuning)引导模型学习动态规划能力,随后利用强化学习(reinforcement learning, RL)在长时程环境中优化该能力,从而实现更高效、可控制且适应性强的智能体系统。
链接: https://arxiv.org/abs/2509.03581
作者: Davide Paglieri,Bartłomiej Cupiał,Jonathan Cook,Ulyana Piterbarg,Jens Tuyls,Edward Grefenstette,Jakob Nicolaus Foerster,Jack Parker-Holder,Tim Rocktäschel
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Training large language models (LLMs) to reason via reinforcement learning (RL) significantly improves their problem-solving capabilities. In agentic settings, existing methods like ReAct prompt LLMs to explicitly plan before every action; however, we demonstrate that always planning is computationally expensive and degrades performance on long-horizon tasks, while never planning further limits performance. To address this, we introduce a conceptual framework formalizing dynamic planning for LLM agents, enabling them to flexibly decide when to allocate test-time compute for planning. We propose a simple two-stage training pipeline: (1) supervised fine-tuning on diverse synthetic data to prime models for dynamic planning, and (2) RL to refine this capability in long-horizon environments. Experiments on the Crafter environment show that dynamic planning agents trained with this approach are more sample-efficient and consistently achieve more complex objectives. Additionally, we demonstrate that these agents can be effectively steered by human-written plans, surpassing their independent capabilities. To our knowledge, this work is the first to explore training LLM agents for dynamic test-time compute allocation in sequential decision-making tasks, paving the way for more efficient, adaptive, and controllable agentic systems.
zh
[AI-61] Diffusion-RL Based Air Traffic Conflict Detection and Resolution Method
【速读】:该论文旨在解决当前基于深度强化学习(Deep Reinforcement Learning, DRL)的空中交通冲突检测与规避(Conflict Detection and Resolution, CDR)方法中存在的“单模态偏差”问题,即策略收敛到单一最优解,导致在复杂动态约束下缺乏决策灵活性,易引发“决策死锁”。解决方案的关键在于提出一种名为Diffusion-AC的新框架,其核心创新是将扩散概率模型(diffusion probabilistic models)引入CDR任务,通过价值函数引导的反向去噪过程建模策略,生成高质且多模态的动作分布,从而增强决策多样性;同时结合密度渐进式安全课程(Density-Progressive Safety Curriculum, DPSC)训练机制,确保代理在从稀疏到高密度交通环境中的稳定高效学习。这一设计显著提升了系统在高密度场景下的成功率(94.1%)并减少了近空中碰撞事件(NMACs)约59%,验证了其多模态决策能力对安全性与鲁棒性的关键提升作用。
链接: https://arxiv.org/abs/2509.03550
作者: Tonghe Li,Jixin Liu,Weili Zeng,Hao Jiang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 59 pages,13 figures, 3 tables
Abstract:In the context of continuously rising global air traffic, efficient and safe Conflict Detection and Resolution (CDR) is paramount for air traffic management. Although Deep Reinforcement Learning (DRL) offers a promising pathway for CDR automation, existing approaches commonly suffer from a “unimodal bias” in their policies. This leads to a critical lack of decision-making flexibility when confronted with complex and dynamic constraints, often resulting in “decision deadlocks.” To overcome this limitation, this paper pioneers the integration of diffusion probabilistic models into the safety-critical task of CDR, proposing a novel autonomous conflict resolution framework named Diffusion-AC. Diverging from conventional methods that converge to a single optimal solution, our framework models its policy as a reverse denoising process guided by a value function, enabling it to generate a rich, high-quality, and multimodal action distribution. This core architecture is complemented by a Density-Progressive Safety Curriculum (DPSC), a training mechanism that ensures stable and efficient learning as the agent progresses from sparse to high-density traffic environments. Extensive simulation experiments demonstrate that the proposed method significantly outperforms a suite of state-of-the-art DRL benchmarks. Most critically, in the most challenging high-density scenarios, Diffusion-AC not only maintains a high success rate of 94.1% but also reduces the incidence of Near Mid-Air Collisions (NMACs) by approximately 59% compared to the next-best-performing baseline, significantly enhancing the system’s safety margin. This performance leap stems from its unique multimodal decision-making capability, which allows the agent to flexibly switch to effective alternative maneuvers.
zh
[AI-62] Multilinear and Linear Programs for Partially Identifiable Queries in Quasi-Markovian Structural Causal Models UAI2025
【速读】:该论文旨在解决在部分可识别的因果模型中计算概率界限的问题,特别是在结构因果模型(Structural Causal Models, SCM)为无环且准马尔可夫性(quasi-Markovian)的情形下,即每个内生变量仅受一个外生混杂因子影响时,当外生变量分布未完全指定而仅能观测到内生变量的概率分布时,如何高效地计算目标概率的紧致边界。解决方案的关键在于提出一种新算法,通过利用观测到的内生变量概率来简化多线性规划(multilinear programming)或线性规划(linear programming)程序的构建过程;对于单干预场景,进一步引入列生成(column generation)技术,将原问题转化为一系列辅助的整数线性规划子问题,从而实现对指数级规模的外生变量空间进行多项式复杂度表示与求解,实验证明该方法优于现有技术。
链接: https://arxiv.org/abs/2509.03548
作者: João P. Arroyo,João G. Rodrigues,Daniel Lawand,Denis D. Mauá,Junkyu Lee,Radu Marinescu,Alex Gray,Eduardo R. Laurentino,Fabio G. Cozman
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted at the Causal Abstractions and Representations (CAR) workshop of the 41st Conference on Uncertainty in Artificial Intelligence (UAI 2025)
Abstract:We investigate partially identifiable queries in a class of causal models. We focus on acyclic Structural Causal Models that are quasi-Markovian (that is, each endogenous variable is connected with at most one exogenous confounder). We look into scenarios where endogenous variables are observed (and a distribution over them is known), while exogenous variables are not fully specified. This leads to a representation that is in essence a Bayesian network where the distribution of root variables is not uniquely determined. In such circumstances, it may not be possible to precisely compute a probability value of interest. We thus study the computation of tight probability bounds, a problem that has been solved by multilinear programming in general, and by linear programming when a single confounded component is intervened upon. We present a new algorithm to simplify the construction of such programs by exploiting input probabilities over endogenous variables. For scenarios with a single intervention, we apply column generation to compute a probability bound through a sequence of auxiliary linear integer programs, thus showing that a representation with polynomial cardinality for exogenous variables is possible. Experiments show column generation techniques to be superior to existing methods.
zh
[AI-63] A software security review on Ugandas Mobile Money Services: Dr. Jim Spires tweets sentiment analysis
【速读】:该论文旨在解决乌干达移动货币(Mobile Money)用户对安全机制的普遍担忧,特别是针对2025年8月“#StopAirtelThefty”Twitter舆论事件中暴露的安全漏洞问题。其解决方案的关键在于通过定性分析方法系统梳理公众在该事件中提出的投诉,提炼出与安全缺陷和用户不满相关的主题,并将这些发现置于乌干达移动货币监管与运营环境的宏观框架下进行解读,从而为服务提供商、政策制定者提供可操作的改进方向,以推动更安全的数字金融生态发展。
链接: https://arxiv.org/abs/2509.03545
作者: Nsengiyumva Wilberforce
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 16 pages, 3 figures
Abstract:The proliferation of mobile money in Uganda has been a cornerstone of financial inclusion, yet its security mechanisms remain a critical concern. This study investigates a significant public response to perceived security failures: the #StopAirtelThefty Twitter campaign of August 2025 Sparked by an incident publicized by Dr. Jim Spire Ssentongo where a phone thief accessed a victim’s account, withdrew funds, and procured a loan, the campaign revealed deep seated public anxiety over the safety of mobile money. This research employs qualitative analysis to systematically examine the complaints raised during this campaign, extracting key themes related to security vulnerabilities and user dissatisfaction. By synthesizing these public sentiments, the paper provides crucial insights into the specific security gaps experienced by users and situates these findings within the larger framework of Uganda’s mobile money regulatory and operational environment. The study concludes with implications for providers, policymakers, and the future of secure digital finance in Uganda.
zh
[AI-64] PG-Agent : An Agent Powered by Page Graph ACM-MM2025
【速读】:该论文旨在解决当前GUI代理(GUI agent)在利用多步操作序列作为先验知识时,难以捕捉页面间复杂跳转关系的问题,从而限制了其对GUI环境的深度感知能力和在新场景中的泛化性能。解决方案的关键在于设计一个自动化流程,将顺序的多步操作转化为显式建模页面连接关系的页面图(page graph),并引入检索增强生成(Retrieval-Augmented Generation, RAG)技术从页面图中提取可靠的GUI感知指导规则;进一步提出一种基于任务分解策略的多智能体框架PG-Agent,将这些规则注入其中,使其能够有效适应未见过的场景。
链接: https://arxiv.org/abs/2509.03536
作者: Weizhi Chen,Ziwei Wang,Leyang Yang,Sheng Zhou,Xiaoxuan Tang,Jiajun Bu,Yong Li,Wei Jiang
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Paper accepted to ACM MM 2025
Abstract:Graphical User Interface (GUI) agents possess significant commercial and social value, and GUI agents powered by advanced multimodal large language models (MLLMs) have demonstrated remarkable potential. Currently, existing GUI agents usually utilize sequential episodes of multi-step operations across pages as the prior GUI knowledge, which fails to capture the complex transition relationship between pages, making it challenging for the agents to deeply perceive the GUI environment and generalize to new scenarios. Therefore, we design an automated pipeline to transform the sequential episodes into page graphs, which explicitly model the graph structure of the pages that are naturally connected by actions. To fully utilize the page graphs, we further introduce Retrieval-Augmented Generation (RAG) technology to effectively retrieve reliable perception guidelines of GUI from them, and a tailored multi-agent framework PG-Agent with task decomposition strategy is proposed to be injected with the guidelines so that it can generalize to unseen scenarios. Extensive experiments on various benchmarks demonstrate the effectiveness of PG-Agent, even with limited episodes for page graph construction.
zh
[AI-65] How many patients could we save with LLM priors?
【速读】:该论文旨在解决多中心临床试验中因样本量不足而导致的安全性评估效能低下问题,尤其在不良事件(Adverse Events, AE)的统计推断中,传统方法难以充分利用外部临床知识以提升模型的预测精度和效率。解决方案的关键在于提出一种基于大语言模型(Large Language Models, LLMs)的层次贝叶斯建模框架,通过LLM生成的先验分布直接构建层次贝叶斯模型中的超参数先验,从而将外部临床专家知识结构化地融入安全建模过程,避免了传统数据增强方法生成合成数据点的局限性,显著提升了模型的预测性能与临床实用性。
链接: https://arxiv.org/abs/2509.04250
作者: Shota Arai,David Selby,Andrew Vargo,Sebastian Vollmer
机构: 未知
类目: Methodology (stat.ME); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Information Retrieval (cs.IR); Applications (stat.AP)
备注: 9 pages, 4 figures
Abstract:Imagine a world where clinical trials need far fewer patients to achieve the same statistical power, thanks to the knowledge encoded in large language models (LLMs). We present a novel framework for hierarchical Bayesian modeling of adverse events in multi-center clinical trials, leveraging LLM-informed prior distributions. Unlike data augmentation approaches that generate synthetic data points, our methodology directly obtains parametric priors from the model. Our approach systematically elicits informative priors for hyperparameters in hierarchical Bayesian models using a pre-trained LLM, enabling the incorporation of external clinical expertise directly into Bayesian safety modeling. Through comprehensive temperature sensitivity analysis and rigorous cross-validation on real-world clinical trial data, we demonstrate that LLM-derived priors consistently improve predictive performance compared to traditional meta-analytical approaches. This methodology paves the way for more efficient and expert-informed clinical trial design, enabling substantial reductions in the number of patients required to achieve robust safety assessment and with the potential to transform drug safety monitoring and regulatory decision making.
zh
[AI-66] EHVC: Efficient Hierarchical Reference and Quality Structure for Neural Video Coding
【速读】:该论文旨在解决神经视频编码器(Neural Video Codecs, NVCs)中参考结构设计与层级质量结构不匹配的问题,以及现有层级质量结构优化空间仍较大的挑战。其核心解决方案是提出一种高效层级神经视频编码器(Efficient Hierarchical Neural Video Codec, EHVC),关键创新包括:(1) 采用层级多参考方案,借鉴传统视频编码器设计思想,使参考结构与质量结构对齐,缓解参考-质量错配问题;(2) 引入前瞻策略(lookahead strategy),利用编码端未来帧的上下文信息增强质量结构;(3) 设计逐层质量缩放机制并结合随机质量训练策略,提升推理阶段质量结构的稳定性。这些改进显著提升了EHVC在编码效率和质量控制方面的性能,优于当前最先进的NVC方法。
链接: https://arxiv.org/abs/2509.04118
作者: Junqi Liao,Yaojun Wu,Chaoyi Lin,Zhipin Deng,Li Li,Dong Liu,Xiaoyan Sun
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI)
备注: 9 pages, 8 figures, Accepted to ACMMM 2025
Abstract:Neural video codecs (NVCs), leveraging the power of end-to-end learning, have demonstrated remarkable coding efficiency improvements over traditional video codecs. Recent research has begun to pay attention to the quality structures in NVCs, optimizing them by introducing explicit hierarchical designs. However, less attention has been paid to the reference structure design, which fundamentally should be aligned with the hierarchical quality structure. In addition, there is still significant room for further optimization of the hierarchical quality structure. To address these challenges in NVCs, we propose EHVC, an efficient hierarchical neural video codec featuring three key innovations: (1) a hierarchical multi-reference scheme that draws on traditional video codec design to align reference and quality structures, thereby addressing the reference-quality mismatch; (2) a lookahead strategy to utilize an encoder-side context from future frames to enhance the quality structure; (3) a layer-wise quality scale with random quality training strategy to stabilize quality structures during inference. With these improvements, EHVC achieves significantly superior performance to the state-of-the-art NVCs. Code will be released in: this https URL.
zh
[AI-67] Neural Video Compression with In-Loop Contextual Filtering and Out-of-Loop Reconstruction Enhancement
【速读】:该论文旨在解决神经视频压缩中因误差传播导致的编码效率下降问题,特别是如何在保持高质量重建的同时优化比特率。其核心解决方案在于提出了一种系统性的增强滤波技术分类方法,区分“环内上下文滤波”(in-loop contextual filtering)与“环外重建增强”(out-of-loop reconstruction enhancement),并引入一种自适应编码决策策略,动态控制滤波操作在编码过程中的应用时机,从而有效缓解误差在帧间传播的影响;同时,通过环外增强进一步提升重建帧质量,显著提升了编码效率,在实验中相较当前最优神经视频编解码器实现7.71%的比特率降低。
链接: https://arxiv.org/abs/2509.04051
作者: Yaojun Wu,Chaoyi Lin,Yiming Wang,Semih Esenlik,Zhaobin Zhang,Kai Zhang,Li Zhang
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI)
备注: 9 pages, 8 figures, Accepted to ACMMM 2025
Abstract:This paper explores the application of enhancement filtering techniques in neural video compression. Specifically, we categorize these techniques into in-loop contextual filtering and out-of-loop reconstruction enhancement based on whether the enhanced representation affects the subsequent coding loop. In-loop contextual filtering refines the temporal context by mitigating error propagation during frame-by-frame encoding. However, its influence on both the current and subsequent frames poses challenges in adaptively applying filtering throughout the sequence. To address this, we introduce an adaptive coding decision strategy that dynamically determines filtering application during encoding. Additionally, out-of-loop reconstruction enhancement is employed to refine the quality of reconstructed frames, providing a simple yet effective improvement in coding efficiency. To the best of our knowledge, this work presents the first systematic study of enhancement filtering in the context of conditional-based neural video compression. Extensive experiments demonstrate a 7.71% reduction in bit rate compared to state-of-the-art neural video codecs, validating the effectiveness of the proposed approach.
zh
[AI-68] Diffusion Generative Models Meet Compressed Sensing with Applications to Image Data and Financial Time Series
【速读】:该论文旨在解决扩散模型(diffusion model)在合成数据生成过程中推理效率低的问题。其核心解决方案是将压缩感知(compressed sensing)技术融入扩散模型框架中,关键步骤包括:首先将数据压缩至低维潜在空间(latent space),其次在该潜在空间中训练扩散模型,最后利用压缩感知算法对潜在空间中的生成样本进行稀疏恢复。该方法在数据满足适当稀疏性假设的前提下,通过结合扩散模型推理与稀疏重构,实现了更快的收敛速度,并推导出最优潜在空间维度,从而显著提升训练与推理效率。
链接: https://arxiv.org/abs/2509.03898
作者: Zhengyi Guo,Jiatu Li,Wenpin Tang,David D. Yao
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:This paper develops dimension reduction techniques for accelerating diffusion model inference in the context of synthetic data generation. The idea is to integrate compressed sensing into diffusion models: (i) compress the data into a latent space, (ii) train a diffusion model in the latent space, and (iii) apply a compressed sensing algorithm to the samples generated in the latent space, facilitating the efficiency of both model training and inference. Under suitable sparsity assumptions on data, the proposed algorithm is proved to enjoy faster convergence by combining diffusion model inference with sparse recovery. As a byproduct, we obtain an optimal value for the latent space dimension. We also conduct numerical experiments on a range of datasets, including image data (handwritten digits, medical images, and climate data) and financial time series for stress testing.
zh
[AI-69] Natural Latents: Latent Variables Stable Across Ontologies
【速读】:该论文旨在解决两个贝叶斯代理在学习同一环境的生成模型时,尽管其预测分布(predictive distribution)已收敛,但可能包含不同的潜在变量(latent variables)的情况下,如何保证一个代理的潜在变量可被另一个代理的潜在变量函数化表示的问题。解决方案的关键在于提出“自然潜在条件”(natural latent conditions),并在这些条件下证明了潜在变量之间的可翻译性(translatability)是必然成立的;进一步表明,在无额外约束的前提下,这些条件是最广义的保障可翻译性的条件。此外,论文强调其理论结果对自然潜在条件中的近似误差具有鲁棒性,这对实际应用至关重要。
链接: https://arxiv.org/abs/2509.03780
作者: John Wentworth,David Lorell
机构: 未知
类目: Probability (math.PR); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (cs.LG)
备注:
Abstract:Suppose two Bayesian agents each learn a generative model of the same environment. We will assume the two have converged on the predictive distribution, i.e. distribution over some observables in the environment, but may have different generative models containing different latent variables. Under what conditions can one agent guarantee that their latents are a function of the other agents latents? We give simple conditions under which such translation is guaranteed to be possible: the natural latent conditions. We also show that, absent further constraints, these are the most general conditions under which translatability is guaranteed. Crucially for practical application, our theorems are robust to approximation error in the natural latent conditions. Subjects: Probability (math.PR); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (cs.LG) Cite as: arXiv:2509.03780 [math.PR] (or arXiv:2509.03780v1 [math.PR] for this version) https://doi.org/10.48550/arXiv.2509.03780 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-70] BiND: A Neural Discriminator-Decoder for Accurate Bimanual Trajectory Prediction in Brain-Computer Interfaces
【速读】:该论文旨在解决从皮层内记录信号中解码双侧手部运动(bimanual hand movements)的难题,这一问题在脑机接口(Brain-Computer Interface, BCI)领域尤为关键,主要受限于神经表征的重叠性以及肢体间的非线性交互关系。其解决方案的核心在于提出一种两阶段模型 BiND(Bimanual Neural Discriminator-Decoder):第一阶段通过分类器区分运动类型(单侧左、单侧右或双侧),第二阶段采用基于门控循环单元(GRU)的专用解码器,并引入试验相对时间索引以增强时序建模能力,从而实现对连续二维手部速度的高精度预测。该设计有效提升了双侧运动解码的准确性与跨会话稳定性。
链接: https://arxiv.org/abs/2509.03521
作者: Timothee Robert,MohammadAli Shaeri,Mahsa Shoaran
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注: Accepted for publication in IEEE Neural Engineering (NER) Conference’25
Abstract:Decoding bimanual hand movements from intracortical recordings remains a critical challenge for brain-computer interfaces (BCIs), due to overlapping neural representations and nonlinear interlimb interactions. We introduce BiND (Bimanual Neural Discriminator-Decoder), a two-stage model that first classifies motion type (unimanual left, unimanual right, or bimanual) and then uses specialized GRU-based decoders, augmented with a trial-relative time index, to predict continuous 2D hand velocities. We benchmark BiND against six state-of-the-art models (SVR, XGBoost, FNN, CNN, Transformer, GRU) on a publicly available 13-session intracortical dataset from a tetraplegic patient. BiND achieves a mean R^2 of 0.76 ( \pm 0.01) for unimanual and 0.69 ( \pm 0.03) for bimanual trajectory prediction, surpassing the next-best model (GRU) by 2% in both tasks. It also demonstrates greater robustness to session variability than all other benchmarked models, with accuracy improvements of up to 4% compared to GRU in cross-session analyses. This highlights the effectiveness of task-aware discrimination and temporal modeling in enhancing bimanual decoding.
zh
机器学习
[LG-0] owards Cognitively-Faithful Decision-Making Models to Improve AI Alignment
链接: https://arxiv.org/abs/2509.04445
作者: Cyrus Cousins,Vijay Keswani,Vincent Conitzer,Hoda Heidari,Jana Schaich Borg,Walter Sinnott-Armstrong
类目: Machine Learning (cs.LG)
*备注:
Abstract:Recent AI work trends towards incorporating human-centric objectives, with the explicit goal of aligning AI models to personal preferences and societal values. Using standard preference elicitation methods, researchers and practitioners build models of human decisions and judgments, which are then used to align AI behavior with that of humans. However, models commonly used in such elicitation processes often do not capture the true cognitive processes of human decision making, such as when people use heuristics to simplify information associated with a decision problem. As a result, models learned from people’s decisions often do not align with their cognitive processes, and can not be used to validate the learning framework for generalization to other decision-making tasks. To address this limitation, we take an axiomatic approach to learning cognitively faithful decision processes from pairwise comparisons. Building on the vast literature characterizing the cognitive processes that contribute to human decision-making, and recent work characterizing such processes in pairwise comparison tasks, we define a class of models in which individual features are first processed and compared across alternatives, and then the processed features are then aggregated via a fixed rule, such as the Bradley-Terry rule. This structured processing of information ensures such models are realistic and feasible candidates to represent underlying human decision-making processes. We demonstrate the efficacy of this modeling approach in learning interpretable models of human decision making in a kidney allocation task, and show that our proposed models match or surpass the accuracy of prior models of human pairwise decision-making.
[LG-1] Unveiling the Role of Data Uncertainty in Tabular Deep Learning
链接: https://arxiv.org/abs/2509.04430
作者: Nikolay Kartashev,Ivan Rubachev,Artem Babenko
类目: Machine Learning (cs.LG)
*备注:
Abstract:Recent advancements in tabular deep learning have demonstrated exceptional practical performance, yet the field often lacks a clear understanding of why these techniques actually succeed. To address this gap, our paper highlights the importance of the concept of data uncertainty for explaining the effectiveness of the recent tabular DL methods. In particular, we reveal that the success of many beneficial design choices in tabular DL, such as numerical feature embeddings, retrieval-augmented models and advanced ensembling strategies, can be largely attributed to their implicit mechanisms for managing high data uncertainty. By dissecting these mechanisms, we provide a unifying understanding of the recent performance improvements. Furthermore, the insights derived from this data-uncertainty perspective directly allowed us to develop more effective numerical feature embeddings as an immediate practical outcome of our analysis. Overall, our work paves the way to foundational understanding of the benefits introduced by modern tabular methods that results in the concrete advancements of existing techniques and outlines future research directions for tabular DL.
[LG-2] Echo State Networks as State-Space Models: A Systems Perspective
链接: https://arxiv.org/abs/2509.04422
作者: Pradeep Singh,Balasubramanian Raman
类目: Machine Learning (cs.LG)
*备注: 27 pages, 1 figure
Abstract:Echo State Networks (ESNs) are typically presented as efficient, readout-trained recurrent models, yet their dynamics and design are often guided by heuristics rather than first principles. We recast ESNs explicitly as state-space models (SSMs), providing a unified systems-theoretic account that links reservoir computing with classical identification and modern kernelized SSMs. First, we show that the echo-state property is an instance of input-to-state stability for a contractive nonlinear SSM and derive verifiable conditions in terms of leak, spectral scaling, and activation Lipschitz constants. Second, we develop two complementary mappings: (i) small-signal linearizations that yield locally valid LTI SSMs with interpretable poles and memory horizons; and (ii) lifted/Koopman random-feature expansions that render the ESN a linear SSM in an augmented state, enabling transfer-function and convolutional-kernel analyses. This perspective yields frequency-domain characterizations of memory spectra and clarifies when ESNs emulate structured SSM kernels. Third, we cast teacher forcing as state estimation and propose Kalman/EKF-assisted readout learning, together with EM for hyperparameters (leak, spectral radius, process/measurement noise) and a hybrid subspace procedure for spectral shaping under contraction constraints.
[LG-3] Interpretable Clustering with Adaptive Heterogeneous Causal Structure Learning in Mixed Observational Data
链接: https://arxiv.org/abs/2509.04415
作者: Wenrui Li,Qinghao Zhang,Xiaowo Wang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Understanding causal heterogeneity is essential for scientific discovery in domains such as biology and medicine. However, existing methods lack causal awareness, with insufficient modeling of heterogeneity, confounding, and observational constraints, leading to poor interpretability and difficulty distinguishing true causal heterogeneity from spurious associations. We propose an unsupervised framework, HCL (Interpretable Causal Mechanism-Aware Clustering with Adaptive Heterogeneous Causal Structure Learning), that jointly infers latent clusters and their associated causal structures from mixed-type observational data without requiring temporal ordering, environment labels, interventions or other prior knowledge. HCL relaxes the homogeneity and sufficiency assumptions by introducing an equivalent representation that encodes both structural heterogeneity and confounding. It further develops a bi-directional iterative strategy to alternately refine causal clustering and structure learning, along with a self-supervised regularization that balance cross-cluster universality and specificity. Together, these components enable convergence toward interpretable, heterogeneous causal patterns. Theoretically, we show identifiability of heterogeneous causal structures under mild conditions. Empirically, HCL achieves superior performance in both clustering and structure learning tasks, and recovers biologically meaningful mechanisms in real-world single-cell perturbation data, demonstrating its utility for discovering interpretable, mechanism-level causal heterogeneity.
[LG-4] SAFE–MA–RRT: Multi-Agent Motion Planning with Data-Driven Safety Certificates
链接: https://arxiv.org/abs/2509.04413
作者: Babak Esmaeili,Hamidreza Modares
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Robotics (cs.RO); Optimization and Control (math.OC)
*备注: Submitted to IEEE Transactions on Automation Science and Engineering
Abstract:This paper proposes a fully data-driven motion-planning framework for homogeneous linear multi-agent systems that operate in shared, obstacle-filled workspaces without access to explicit system models. Each agent independently learns its closed-loop behavior from experimental data by solving convex semidefinite programs that generate locally invariant ellipsoids and corresponding state-feedback gains. These ellipsoids, centered along grid-based waypoints, certify the dynamic feasibility of short-range transitions and define safe regions of operation. A sampling-based planner constructs a tree of such waypoints, where transitions are allowed only when adjacent ellipsoids overlap, ensuring invariant-to-invariant transitions and continuous safety. All agents expand their trees simultaneously and are coordinated through a space-time reservation table that guarantees inter-agent safety by preventing simultaneous occupancy and head-on collisions. Each successful edge in the tree is equipped with its own local controller, enabling execution without re-solving optimization problems at runtime. The resulting trajectories are not only dynamically feasible but also provably safe with respect to both environmental constraints and inter-agent collisions. Simulation results demonstrate the effectiveness of the approach in synthesizing synchronized, safe trajectories for multiple agents under shared dynamics and constraints, using only data and convex optimization tools.
[LG-5] PagedEviction: Structured Block-wise KV Cache Pruning for Efficient Large Language Model Inference
链接: https://arxiv.org/abs/2509.04377
作者: Krishna Teja Chitty-Venkata,Jie Ye,Xian-He Sun,Anthony Kougkas,Murali Emani,Venkatram Vishwanath,Bogdan Nicolae
类目: Machine Learning (cs.LG)
*备注: Preprint
Abstract:KV caching significantly improves the efficiency of Large Language Model (LLM) inference by storing attention states from previously processed tokens, enabling faster generation of subsequent tokens. However, as sequence length increases, the KV cache quickly becomes a major memory bottleneck. To address this, we propose PagedEviction, a novel fine-grained, structured KV cache pruning strategy that enhances the memory efficiency of vLLM’s PagedAttention. Unlike existing approaches that rely on attention-based token importance or evict tokens across different vLLM pages, PagedEviction introduces an efficient block-wise eviction algorithm tailored for paged memory layouts. Our method integrates seamlessly with PagedAttention without requiring any modifications to its CUDA attention kernels. We evaluate PagedEviction across Llama-3.1-8B-Instruct, Llama-3.2-1B-Instruct, and Llama-3.2-3B-Instruct models on the LongBench benchmark suite, demonstrating improved memory usage with better accuracy than baselines on long context tasks.
[LG-6] When three experiments are better than two: Avoiding intractable correlated aleatoric uncertainty by leverag ing a novel bias–variance tradeoff
链接: https://arxiv.org/abs/2509.04363
作者: Paul Scherer,Andreas Kirsch,Jake P. Taylor-King
类目: Machine Learning (cs.LG)
*备注: 16 pages, 5 figures
Abstract:Real-world experimental scenarios are characterized by the presence of heteroskedastic aleatoric uncertainty, and this uncertainty can be correlated in batched settings. The bias–variance tradeoff can be used to write the expected mean squared error between a model distribution and a ground-truth random variable as the sum of an epistemic uncertainty term, the bias squared, and an aleatoric uncertainty term. We leverage this relationship to propose novel active learning strategies that directly reduce the bias between experimental rounds, considering model systems both with and without noise. Finally, we investigate methods to leverage historical data in a quadratic manner through the use of a novel cobias–covariance relationship, which naturally proposes a mechanism for batching through an eigendecomposition strategy. When our difference-based method leveraging the cobias–covariance relationship is utilized in a batched setting (with a quadratic estimator), we outperform a number of canonical methods including BALD and Least Confidence.
[LG-7] Characteristic Energy Behavior Profiling of Non-Residential Buildings
链接: https://arxiv.org/abs/2509.04322
作者: Haley Dozier,Althea Henslee
类目: Machine Learning (cs.LG)
*备注:
Abstract:Due to the threat of changing climate and extreme weather events, the infrastructure of the United States Army installations is at risk. More than ever, climate resilience measures are needed to protect facility assets that support critical missions and help generate readiness. As most of the Army installations within the continental United States rely on commercial energy and water sources, resilience to the vulnerabilities within independent energy resources (electricity grids, natural gas pipelines, etc) along with a baseline understanding of energy usage within installations must be determined. This paper will propose a data-driven behavioral model to determine behavior profiles of energy usage on installations. These profiles will be used 1) to create a baseline assessment of the impact of unexpected disruptions on energy systems and 2) to benchmark future resiliency measures. In this methodology, individual building behavior will be represented with models that can accurately analyze, predict, and cluster multimodal data collected from energy usage of non-residential buildings. Due to the nature of Army installation energy usage data, similarly structured open access data will be used to illustrate this methodology.
[LG-8] Using causal abstractions to accelerate decision-making in complex bandit problems
链接: https://arxiv.org/abs/2509.04296
作者: Joel Dyer,Nicholas Bishop,Anisoara Calinescu,Michael Wooldridge,Fabio Massimo Zennaro
类目: Machine Learning (cs.LG)
*备注:
Abstract:Although real-world decision-making problems can often be encoded as causal multi-armed bandits (CMABs) at different levels of abstraction, a general methodology exploiting the information and computational advantages of each abstraction level is missing. In this paper, we propose AT-UCB, an algorithm which efficiently exploits shared information between CMAB problem instances defined at different levels of abstraction. More specifically, AT-UCB leverages causal abstraction (CA) theory to explore within a cheap-to-simulate and coarse-grained CMAB instance, before employing the traditional upper confidence bound (UCB) algorithm on a restricted set of potentially optimal actions in the CMAB of interest, leading to significant reductions in cumulative regret when compared to the classical UCB algorithm. We illustrate the advantages of AT-UCB theoretically, through a novel upper bound on the cumulative regret, and empirically, by applying AT-UCB to epidemiological simulators with varying resolution and computational cost.
[LG-9] A Primer on Causal and Statistical Dataset Biases for Fair and Robust Image Analysis
链接: https://arxiv.org/abs/2509.04295
作者: Charles Jones,Ben Glocker
类目: Machine Learning (cs.LG); Computers and Society (cs.CY); Machine Learning (stat.ML)
*备注: Excerpt from C. Jones’ PhD thesis. Winner of the G-Research PhD prize 2025
Abstract:Machine learning methods often fail when deployed in the real world. Worse still, they fail in high-stakes situations and across socially sensitive lines. These issues have a chilling effect on the adoption of machine learning methods in settings such as medical diagnosis, where they are arguably best-placed to provide benefits if safely deployed. In this primer, we introduce the causal and statistical structures which induce failure in machine learning methods for image analysis. We highlight two previously overlooked problems, which we call the \textitno fair lunch problem and the \textitsubgroup separability problem. We elucidate why today’s fair representation learning methods fail to adequately solve them and propose potential paths forward for the field.
[LG-10] An Interactive Framework for Finding the Optimal Trade-off in Differential Privacy
链接: https://arxiv.org/abs/2509.04290
作者: Yaohong Yang,Aki Rehn,Sammie Katt,Antti Honkela,Samuel Kaski
类目: Machine Learning (cs.LG)
*备注: 20 pages, 12 figures
Abstract:Differential privacy (DP) is the standard for privacy-preserving analysis, and introduces a fundamental trade-off between privacy guarantees and model performance. Selecting the optimal balance is a critical challenge that can be framed as a multi-objective optimization (MOO) problem where one first discovers the set of optimal trade-offs (the Pareto front) and then learns a decision-maker’s preference over them. While a rich body of work on interactive MOO exists, the standard approach – modeling the objective functions with generic surrogates and learning preferences from simple pairwise feedback – is inefficient for DP because it fails to leverage the problem’s unique structure: a point on the Pareto front can be generated directly by maximizing accuracy for a fixed privacy level. Motivated by this property, we first derive the shape of the trade-off theoretically, which allows us to model the Pareto front directly and efficiently. To address inefficiency in preference learning, we replace pairwise comparisons with a more informative interaction. In particular, we present the user with hypothetical trade-off curves and ask them to pick their preferred trade-off. Our experiments on differentially private logistic regression and deep transfer learning across six real-world datasets show that our method converges to the optimal privacy-accuracy trade-off with significantly less computational cost and user interaction than baselines.
[LG-11] RLs Razor: Why Online Reinforcement Learning Forgets Less
链接: https://arxiv.org/abs/2509.04259
作者: Idan Shenfeld,Jyothish Pari,Pulkit Agrawal
类目: Machine Learning (cs.LG)
*备注:
Abstract:Comparison of fine-tuning models with reinforcement learning (RL) and supervised fine-tuning (SFT) reveals that, despite similar performance at a new task, RL preserves prior knowledge and capabilities significantly better. We find that the degree of forgetting is determined by the distributional shift, measured as the KL-divergence between the fine-tuned and base policy evaluated on the new task. Our analysis reveals that on-policy RL is implicitly biased towards KL-minimal solutions among the many that solve the new task, whereas SFT can converge to distributions arbitrarily far from the base model. We validate these findings through experiments with large language models and robotic foundation models and further provide theoretical justification for why on-policy RL updates lead to a smaller KL change. We term this principle \textitRL’s Razor : among all ways to solve a new task, RL prefers those closest in KL to the original model.
[LG-12] Synthetic Survival Data Generation for Heart Failure Prognosis Using Deep Generative Models
链接: https://arxiv.org/abs/2509.04245
作者: Chanon Puttanawarut,Natcha Fongsrisin,Porntep Amornritvanich,Cholatid Ratanatharathorn,Panu Looareesuwan
类目: Machine Learning (cs.LG)
*备注:
Abstract:Background: Heart failure (HF) research is constrained by limited access to large, shareable datasets due to privacy regulations and institutional barriers. Synthetic data generation offers a promising solution to overcome these challenges while preserving patient confidentiality. Methods: We generated synthetic HF datasets from institutional data comprising 12,552 unique patients using five deep learning models: tabular variational autoencoder (TVAE), normalizing flow, ADSGAN, SurvivalGAN, and tabular denoising diffusion probabilistic models (TabDDPM). We comprehensively evaluated synthetic data utility through statistical similarity metrics, survival prediction using machine learning and privacy assessments. Results: SurvivalGAN and TabDDPM demonstrated high fidelity to the original dataset, exhibiting similar variable distributions and survival curves after applying histogram equalization. SurvivalGAN (C-indices: 0.71-0.76) and TVAE (C-indices: 0.73-0.76) achieved the strongest performance in survival prediction evaluation, closely matched real data performance (C-indices: 0.73-0.76). Privacy evaluation confirmed protection against re-identification attacks. Conclusions: Deep learning-based synthetic data generation can produce high-fidelity, privacy-preserving HF datasets suitable for research applications. This publicly available synthetic dataset addresses critical data sharing barriers and provides a valuable resource for advancing HF research and predictive modeling.
[LG-13] Rethinking Layer-wise Gaussian Noise Injection: Bridging Implicit Objectives and Privacy Budget Allocation
链接: https://arxiv.org/abs/2509.04232
作者: Qifeng Tan,Shusen Yang,Xuebin Ren,Yikai Zhang(Xi’an Jiaotong University)
类目: Machine Learning (cs.LG)
*备注:
Abstract:Layer-wise Gaussian mechanisms (LGM) enhance flexibility in differentially private deep learning by injecting noise into partitioned gradient vectors. However, existing methods often rely on heuristic noise allocation strategies, lacking a rigorous understanding of their theoretical grounding in connecting noise allocation to formal privacy-utility tradeoffs. In this paper, we present a unified analytical framework that systematically connects layer-wise noise injection strategies with their implicit optimization objectives and associated privacy budget allocations. Our analysis reveals that several existing approaches optimize ill-posed objectives – either ignoring inter-layer signal-to-noise ratio (SNR) consistency or leading to inefficient use of the privacy budget. In response, we propose a SNR-Consistent noise allocation strategy that unifies both aspects, yielding a noise allocation scheme that achieves better signal preservation and more efficient privacy budget utilization. Extensive experiments in both centralized and federated learning settings demonstrate that our method consistently outperforms existing allocation strategies, achieving better privacy-utility tradeoffs. Our framework not only offers diagnostic insights into prior methods but also provides theoretical guidance for designing adaptive and effective noise injection schemes in deep models.
[LG-14] Rethinking the long-range dependency in Mamba/SSM and transformer models
链接: https://arxiv.org/abs/2509.04226
作者: Cong Ma,Kayvan Najarian
类目: Machine Learning (cs.LG)
*备注:
Abstract:Long-range dependency is one of the most desired properties of recent sequence models such as state-space models (particularly Mamba) and transformer models. New model architectures are being actively developed and benchmarked for prediction tasks requiring long-range dependency. However, the capability of modeling long-range dependencies of these models has not been investigated from a theoretical perspective, which hinders a systematic improvement on this aspect. In this work, we mathematically define long-range dependency using the derivative of hidden states with respect to past inputs and compare the capability of SSM and transformer models of modeling long-range dependency based on this definition. We showed that the long-range dependency of SSM decays exponentially with the sequence length, which aligns with the exponential decay of memory function in RNN. But the attention mechanism used in transformers is more flexible and is not constrained to exponential decay, which could in theory perform better at modeling long-range dependency with sufficient training data, computing resources, and proper training. To combine the flexibility of long-range dependency of attention mechanism and computation efficiency of SSM, we propose a new formulation for hidden state update in SSM and prove its stability under a standard Gaussian distribution of the input data.
[LG-15] Why Cant I See My Clusters? A Precision-Recall Approach to Dimensionality Reduction Validation
链接: https://arxiv.org/abs/2509.04222
作者: Diede P. M. van der Hoorn,Alessio Arleo,Fernando V. Paulovich
类目: Machine Learning (cs.LG)
*备注:
Abstract:Dimensionality Reduction (DR) is widely used for visualizing high-dimensional data, often with the goal of revealing expected cluster structure. However, such a structure may not always appear in the projections. Existing DR quality metrics assess projection reliability (to some extent) or cluster structure quality, but do not explain why expected structures are missing. Visual Analytics solutions can help, but are often time-consuming due to the large hyperparameter space. This paper addresses this problem by leveraging a recent framework that divides the DR process into two phases: a relationship phase, where similarity relationships are modeled, and a mapping phase, where the data is projected accordingly. We introduce two supervised metrics, precision and recall, to evaluate the relationship phase. These metrics quantify how well the modeled relationships align with an expected cluster structure based on some set of labels representing this structure. We illustrate their application using t-SNE and UMAP, and validate the approach through various usage scenarios. Our approach can guide hyperparameter tuning, uncover projection artifacts, and determine if the expected structure is captured in the relationships, making the DR process faster and more reliable.
[LG-16] Sailing Towards Zero-Shot State Estimation using Foundation Models Combined with a UKF
链接: https://arxiv.org/abs/2509.04213
作者: Tobin Holtmann,David Stenger,Andres Posada-Moreno,Friedrich Solowjow,Sebastian Trimpe
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: Accepted for publication at CDC2025
Abstract:State estimation in control and systems engineering traditionally requires extensive manual system identification or data-collection effort. However, transformer-based foundation models in other domains have reduced data requirements by leveraging pre-trained generalist models. Ultimately, developing zero-shot foundation models of system dynamics could drastically reduce manual deployment effort. While recent work shows that transformer-based end-to-end approaches can achieve zero-shot performance on unseen systems, they are limited to sensor models seen during training. We introduce the foundation model unscented Kalman filter (FM-UKF), which combines a transformer-based model of system dynamics with analytically known sensor models via an UKF, enabling generalization across varying dynamics without retraining for new sensor configurations. We evaluate FM-UKF on a new benchmark of container ship models with complex dynamics, demonstrating a competitive accuracy, effort, and robustness trade-off compared to classical methods with approximate system knowledge and to an end-to-end approach. The benchmark and dataset are open sourced to further support future research in zero-shot state estimation via foundation models.
[LG-17] COBRA: Multimodal Sensing Deep Learning Framework for Remote Chronic Obesity Management via Wrist-Worn Activity Monitoring
链接: https://arxiv.org/abs/2509.04210
作者: Zhengyang Shen(1),Bo Gao(1),Mayue Shi(1, 2) ((1) Department of Electrical and Electronic Engineering, Imperial College London, UK, (2) Institute of Biomedical Engineering, University of Oxford, UK)
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注: 19 pages, 4 figures. *Correspondence: m.shi16@imperial. this http URL . Accepted by the IUPESM World Congress on Medical Physics and Biomedical Engineering 2025
Abstract:Chronic obesity management requires continuous monitoring of energy balance behaviors, yet traditional self-reported methods suffer from significant underreporting and recall bias, and difficulty in integration with modern digital health systems. This study presents COBRA (Chronic Obesity Behavioral Recognition Architecture), a novel deep learning framework for objective behavioral monitoring using wrist-worn multimodal sensors. COBRA integrates a hybrid D-Net architecture combining U-Net spatial modeling, multi-head self-attention mechanisms, and BiLSTM temporal processing to classify daily activities into four obesity-relevant categories: Food Intake, Physical Activity, Sedentary Behavior, and Daily Living. Validated on the WISDM-Smart dataset with 51 subjects performing 18 activities, COBRA’s optimal preprocessing strategy combines spectral-temporal feature extraction, achieving high performance across multiple architectures. D-Net demonstrates 96.86% overall accuracy with category-specific F1-scores of 98.55% (Physical Activity), 95.53% (Food Intake), 94.63% (Sedentary Behavior), and 98.68% (Daily Living), outperforming state-of-the-art baselines by 1.18% in accuracy. The framework shows robust generalizability with low demographic variance (3%), enabling scalable deployment for personalized obesity interventions and continuous lifestyle monitoring.
[LG-18] One-Embedding-Fits-All: Efficient Zero-Shot Time Series Forecasting by a Model Zoo
链接: https://arxiv.org/abs/2509.04208
作者: Hao-Nan Shi,Ting-Ji Huang,Lu Han,De-Chuan Zhan,Han-Jia Ye
类目: Machine Learning (cs.LG)
*备注:
Abstract:The proliferation of Time Series Foundation Models (TSFMs) has significantly advanced zero-shot forecasting, enabling predictions for unseen time series without task-specific fine-tuning. Extensive research has confirmed that no single TSFM excels universally, as different models exhibit preferences for distinct temporal patterns. This diversity suggests an opportunity: how to take advantage of the complementary abilities of TSFMs. To this end, we propose ZooCast, which characterizes each model’s distinct forecasting strengths. ZooCast can intelligently assemble current TSFMs into a model zoo that dynamically selects optimal models for different forecasting tasks. Our key innovation lies in the One-Embedding-Fits-All paradigm that constructs a unified representation space where each model in the zoo is represented by a single embedding, enabling efficient similarity matching for all tasks. Experiments demonstrate ZooCast’s strong performance on the GIFT-Eval zero-shot forecasting benchmark while maintaining the efficiency of a single TSFM. In real-world scenarios with sequential model releases, the framework seamlessly adds new models for progressive accuracy gains with negligible overhead.
[LG-19] KubeGuard: LLM -Assisted Kubernetes Hardening via Configuration Files and Runtime Logs Analysis
链接: https://arxiv.org/abs/2509.04191
作者: Omri Sgan Cohen,Ehud Malul,Yair Meidan,Dudu Mimran,Yuval Elovici,Asaf Shabtai
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:The widespread adoption of Kubernetes (K8s) for orchestrating cloud-native applications has introduced significant security challenges, such as misconfigured resources and overly permissive configurations. Failing to address these issues can result in unauthorized access, privilege escalation, and lateral movement within clusters. Most existing K8s security solutions focus on detecting misconfigurations, typically through static analysis or anomaly detection. In contrast, this paper presents KubeGuard, a novel runtime log-driven recommender framework aimed at mitigating risks by addressing overly permissive configurations. KubeGuard is designed to harden K8s environments through two complementary tasks: Resource Creation and Resource Refinement. It leverages large language models (LLMs) to analyze manifests and runtime logs reflecting actual system behavior, using modular prompt-chaining workflows. This approach enables KubeGuard to create least-privilege configurations for new resources and refine existing manifests to reduce the attack surface. KubeGuard’s output manifests are presented as recommendations that users (e.g., developers and operators) can review and adopt to enhance cluster security. Our evaluation demonstrates that KubeGuard effectively generates and refines K8s manifests for Roles, NetworkPolicies, and Deployments, leveraging both proprietary and open-source LLMs. The high precision, recall, and F1-scores affirm KubeGuard’s practicality as a framework that translates runtime observability into actionable, least-privilege configuration guidance.
[LG-20] Set Block Decoding is a Language Model Inference Accelerator
链接: https://arxiv.org/abs/2509.04185
作者: Itai Gat,Heli Ben-Hamu,Marton Havasi,Daniel Haziza,Jeremy Reizenstein,Gabriel Synnaeve,David Lopez-Paz,Brian Karrer,Yaron Lipman
类目: Machine Learning (cs.LG)
*备注:
Abstract:Autoregressive next token prediction language models offer powerful capabilities but face significant challenges in practical deployment due to the high computational and memory costs of inference, particularly during the decoding stage. We introduce Set Block Decoding (SBD), a simple and flexible paradigm that accelerates generation by integrating standard next token prediction (NTP) and masked token prediction (MATP) within a single architecture. SBD allows the model to sample multiple, not necessarily consecutive, future tokens in parallel, a key distinction from previous acceleration methods. This flexibility allows the use of advanced solvers from the discrete diffusion literature, offering significant speedups without sacrificing accuracy. SBD requires no architectural changes or extra training hyperparameters, maintains compatibility with exact KV-caching, and can be implemented by fine-tuning existing next token prediction models. By fine-tuning Llama-3.1 8B and Qwen-3 8B, we demonstrate that SBD enables a 3-5x reduction in the number of forward passes required for generation while achieving same performance as equivalent NTP training.
[LG-21] Comment on “A Note on Over-Smoothing for Graph Neural Networks”
链接: https://arxiv.org/abs/2509.04178
作者: Razi Hasson,Reuven Guetta
类目: Machine Learning (cs.LG)
*备注: Comment on arXiv:2006.13318 (Cai Wang, 2020). Revisits their Dirichlet-energy analysis of over-smoothing and extends it to Leaky-ReLU and spectral polynomial filters; includes Proposition 7.1 and a new proof of Lemma 3.3 for Leaky-ReLU. 7 pages
Abstract:We comment on Cai and Wang (2020, arXiv:2006.13318), who analyze over-smoothing in GNNs via Dirichlet energy. We show that under mild spectral conditions (including with Leaky-ReLU), the Dirichlet energy of node embeddings decreases exponentially with depth; we further extend the result to spectral polynomial filters and provide a short proof for the Leaky-ReLU case. Experiments on edge deletion and weight amplification illustrate when Dirichlet energy increases, hinting at practical ways to relieve over-smoothing.
[LG-22] Unobtrusive In-Situ Measurement of Behavior Change by Deep Metric Similarity Learning of Motion Patterns
链接: https://arxiv.org/abs/2509.04174
作者: Christian Merz,Lukas Schach,Marie Luisa Fiedler,Jean-Luc Lugrin,Carolin Wienrich,Marc Erich Latoschik
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:
Abstract:This paper introduces an unobtrusive in-situ measurement method to detect user behavior changes during arbitrary exposures in XR systems. Here, such behavior changes are typically associated with the Proteus effect or bodily affordances elicited by different avatars that the users embody in XR. We present a biometric user model based on deep metric similarity learning, which uses high-dimensional embeddings as reference vectors to identify behavior changes of individual users. We evaluate our model against two alternative approaches: a (non-learned) motion analysis based on central tendencies of movement patterns and subjective post-exposure embodiment questionnaires frequently used in various XR exposures. In a within-subject study, participants performed a fruit collection task while embodying avatars of different body heights (short, actual-height, and tall). Subjective assessments confirmed the effective manipulation of perceived body schema, while the (non-learned) objective analyses of head and hand movements revealed significant differences across conditions. Our similarity learning model trained on the motion data successfully identified the elicited behavior change for various query and reference data pairings of the avatar conditions. The approach has several advantages in comparison to existing methods: 1) In-situ measurement without additional user input, 2) generalizable and scalable motion analysis for various use cases, 3) user-specific analysis on the individual level, and 4) with a trained model, users can be added and evaluated in real time to study how avatar changes affect behavior.
[LG-23] Privacy Risks in Time Series Forecasting: User- and Record-Level Membership Inference
链接: https://arxiv.org/abs/2509.04169
作者: Nicolas Johansson(1),Tobias Olsson(1),Daniel Nilsson(2),Johan Östman(2),Fazeleh Hoseini(2) ((1) Chalmers University of Technology, (2) AI Sweden)
类目: Machine Learning (cs.LG)
*备注:
Abstract:Membership inference attacks (MIAs) aim to determine whether specific data were used to train a model. While extensively studied on classification models, their impact on time series forecasting remains largely unexplored. We address this gap by introducing two new attacks: (i) an adaptation of multivariate LiRA, a state-of-the-art MIA originally developed for classification models, to the time-series forecasting setting, and (ii) a novel end-to-end learning approach called Deep Time Series (DTS) attack. We benchmark these methods against adapted versions of other leading attacks from the classification setting. We evaluate all attacks in realistic settings on the TUH-EEG and ELD datasets, targeting two strong forecasting architectures, LSTM and the state-of-the-art N-HiTS, under both record- and user-level threat models. Our results show that forecasting models are vulnerable, with user-level attacks often achieving perfect detection. The proposed methods achieve the strongest performance in several settings, establishing new baselines for privacy risk assessment in time series forecasting. Furthermore, vulnerability increases with longer prediction horizons and smaller training populations, echoing trends observed in large language models. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2509.04169 [cs.LG] (or arXiv:2509.04169v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2509.04169 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-24] Who Pays for Fairness? Rethinking Recourse under Social Burden
链接: https://arxiv.org/abs/2509.04128
作者: Ainhize Barrainkua,Giovanni De Toni,Jose Antonio Lozano,Novi Quadrianto
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注:
Abstract:Machine learning based predictions are increasingly used in sensitive decision-making applications that directly affect our lives. This has led to extensive research into ensuring the fairness of classifiers. Beyond just fair classification, emerging legislation now mandates that when a classifier delivers a negative decision, it must also offer actionable steps an individual can take to reverse that outcome. This concept is known as algorithmic recourse. Nevertheless, many researchers have expressed concerns about the fairness guarantees within the recourse process itself. In this work, we provide a holistic theoretical characterization of unfairness in algorithmic recourse, formally linking fairness guarantees in recourse and classification, and highlighting limitations of the standard equal cost paradigm. We then introduce a novel fairness framework based on social burden, along with a practical algorithm (MISOB), broadly applicable under real-world conditions. Empirical results on real-world datasets show that MISOB reduces the social burden across all groups without compromising overall classifier accuracy.
[LG-25] Synthetic Counterfactual Labels for Efficient Conformal Counterfactual Inference
链接: https://arxiv.org/abs/2509.04112
作者: Amirmohammad Farzaneh,Matteo Zecchin,Osvaldo Simeone
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注:
Abstract:This work addresses the problem of constructing reliable prediction intervals for individual counterfactual outcomes. Existing conformal counterfactual inference (CCI) methods provide marginal coverage guarantees but often produce overly conservative intervals, particularly under treatment imbalance when counterfactual samples are scarce. We introduce synthetic data-powered CCI (SP-CCI), a new framework that augments the calibration set with synthetic counterfactual labels generated by a pre-trained counterfactual model. To ensure validity, SP-CCI incorporates synthetic samples into a conformal calibration procedure based on risk-controlling prediction sets (RCPS) with a debiasing step informed by prediction-powered inference (PPI). We prove that SP-CCI achieves tighter prediction intervals while preserving marginal coverage, with theoretical guarantees under both exact and approximate importance weighting. Empirical results on different datasets confirm that SP-CCI consistently reduces interval width compared to standard CCI across all settings.
[LG-26] Balancing Signal and Variance: Adaptive Offline RL Post-Training for VLA Flow Models
链接: https://arxiv.org/abs/2509.04063
作者: Hongyin Zhang,Shiyuan Zhang,Junxi Jin,Qixin Zeng,Yifan Qiao,Hongchao Lu,Donglin Wang
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:
Abstract:Vision-Language-Action (VLA) models based on flow matching have shown excellent performance in general-purpose robotic manipulation tasks. However, the action accuracy of these models on complex downstream tasks is unsatisfactory. One important reason is that these models rely solely on the post-training paradigm of imitation learning, which makes it difficult to have a deeper understanding of the distribution properties of data quality, which is exactly what Reinforcement Learning (RL) excels at. In this paper, we theoretically propose an offline RL post-training objective for VLA flow models and induce an efficient and feasible offline RL fine-tuning algorithm – Adaptive Reinforced Flow Matching (ARFM). By introducing an adaptively adjusted scaling factor in the VLA flow model loss, we construct a principled bias-variance trade-off objective function to optimally control the impact of RL signal on flow loss. ARFM adaptively balances RL advantage preservation and flow loss gradient variance control, resulting in a more stable and efficient fine-tuning process. Extensive simulation and real-world experimental results show that ARFM exhibits excellent generalization, robustness, few-shot learning, and continuous learning performance.
[LG-27] On Aligning Prediction Models with Clinical Experiential Learning: A Prostate Cancer Case Study
链接: https://arxiv.org/abs/2509.04053
作者: Jacqueline J. Vallon,William Overman,Wanqiao Xu,Neil Panjwani,Xi Ling,Sushmita Vij,Hilary P. Bagshaw,John T. Leppert,Sumit Shah,Geoffrey Sonn,Sandy Srinivas,Erqi Pollom,Mark K. Buyyounouski,Mohsen Bayati
类目: Machine Learning (cs.LG)
*备注:
Abstract:Over the past decade, the use of machine learning (ML) models in healthcare applications has rapidly increased. Despite high performance, modern ML models do not always capture patterns the end user requires. For example, a model may predict a non-monotonically decreasing relationship between cancer stage and survival, keeping all other features fixed. In this paper, we present a reproducible framework for investigating this misalignment between model behavior and clinical experiential learning, focusing on the effects of underspecification of modern ML pipelines. In a prostate cancer outcome prediction case study, we first identify and address these inconsistencies by incorporating clinical knowledge, collected by a survey, via constraints into the ML model, and subsequently analyze the impact on model performance and behavior across degrees of underspecification. The approach shows that aligning the ML model with clinical experiential learning is possible without compromising performance. Motivated by recent literature in generative AI, we further examine the feasibility of a feedback-driven alignment approach in non-generative AI clinical risk prediction models through a randomized experiment with clinicians. Our findings illustrate that, by eliciting clinicians’ model preferences using our proposed methodology, the larger the difference in how the constrained and unconstrained models make predictions for a patient, the more apparent the difference is in clinical interpretation.
[LG-28] Formal Verification of Local Robustness of a Classification Algorithm for a Spatial Use Case
链接: https://arxiv.org/abs/2509.03948
作者: Delphine Longuet,Amira Elouazzani,Alejandro Penacho Riveiros,Nicola Bastianello
类目: Machine Learning (cs.LG)
*备注:
Abstract:Failures in satellite components are costly and challenging to address, often requiring significant human and material resources. Embedding a hybrid AI-based system for fault detection directly in the satellite can greatly reduce this burden by allowing earlier detection. However, such systems must operate with extremely high reliability. To ensure this level of dependability, we employ the formal verification tool Marabou to verify the local robustness of the neural network models used in the AI-based algorithm. This tool allows us to quantify how much a model’s input can be perturbed before its output behavior becomes unstable, thereby improving trustworthiness with respect to its performance under uncertainty.
[LG-29] LMAE4Eth: Generalizable and Robust Ethereum Fraud Detection by Exploring Transaction Semantics and Masked Graph Embedding
链接: https://arxiv.org/abs/2509.03939
作者: Yifan Jia,Yanbin Wang,Jianguo Sun,Ye Tian,Peng Qian
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: This work has been submitted to the IEEE for possible publication
Abstract:Current Ethereum fraud detection methods rely on context-independent, numerical transaction sequences, failing to capture semantic of account transactions. Furthermore, the pervasive homogeneity in Ethereum transaction records renders it challenging to learn discriminative account embeddings. Moreover, current self-supervised graph learning methods primarily learn node representations through graph reconstruction, resulting in suboptimal performance for node-level tasks like fraud account detection, while these methods also encounter scalability challenges. To tackle these challenges, we propose LMAE4Eth, a multi-view learning framework that fuses transaction semantics, masked graph embedding, and expert knowledge. We first propose a transaction-token contrastive language model (TxCLM) that transforms context-independent numerical transaction records into logically cohesive linguistic representations. To clearly characterize the semantic differences between accounts, we also use a token-aware contrastive learning pre-training objective together with the masked transaction model pre-training objective, learns high-expressive account representations. We then propose a masked account graph autoencoder (MAGAE) using generative self-supervised learning, which achieves superior node-level account detection by focusing on reconstructing account node features. To enable MAGAE to scale for large-scale training, we propose to integrate layer-neighbor sampling into the graph, which reduces the number of sampled vertices by several times without compromising training quality. Finally, using a cross-attention fusion network, we unify the embeddings of TxCLM and MAGAE to leverage the benefits of both. We evaluate our method against 21 baseline approaches on three datasets. Experimental results show that our method outperforms the best baseline by over 10% in F1-score on two of the datasets.
[LG-30] Sample Efficient Certification of Discrete-Time Control Barrier Functions
链接: https://arxiv.org/abs/2509.03899
作者: Sampath Kumar Mulagaleti,Andrea Del Prete
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: 8 pages, accepted for publication in proceedings of IEEE CDC 2025
Abstract:Control Invariant (CI) sets are instrumental in certifying the safety of dynamical systems. Control Barrier Functions (CBFs) are effective tools to compute such sets, since the zero sublevel sets of CBFs are CI sets. However, computing CBFs generally involves addressing a complex robust optimization problem, which can be intractable. Scenario-based methods have been proposed to simplify this computation. Then, one needs to verify if the CBF actually satisfies the robust constraints. We present an approach to perform this verification that relies on Lipschitz arguments, and forms the basis of a certification algorithm designed for sample efficiency. Through a numerical example, we validated the efficiency of the proposed procedure.
[LG-31] Mistake-bounded online learning with operation caps
链接: https://arxiv.org/abs/2509.03892
作者: Jesse Geneson,Meien Li,Linus Tang
类目: Machine Learning (cs.LG); Computational Complexity (cs.CC); Discrete Mathematics (cs.DM)
*备注:
Abstract:We investigate the mistake-bound model of online learning with caps on the number of arithmetic operations per round. We prove general bounds on the minimum number of arithmetic operations per round that are necessary to learn an arbitrary family of functions with finitely many mistakes. We solve a problem on agnostic mistake-bounded online learning with bandit feedback from (Filmus et al, 2024) and (Geneson \ Tang, 2024). We also extend this result to the setting of operation caps.
[LG-32] opotein: Topological Deep Learning for Protein Representation Learning
链接: https://arxiv.org/abs/2509.03885
作者: Zhiyu Wang,Arian Jamasb,Mustafa Hajij,Alex Morehead,Luke Braithwaite,Pietro Liò
类目: Machine Learning (cs.LG)
*备注:
Abstract:Protein representation learning (PRL) is crucial for understanding structure-function relationships, yet current sequence- and graph-based methods fail to capture the hierarchical organization inherent in protein structures. We introduce Topotein, a comprehensive framework that applies topological deep learning to PRL through the novel Protein Combinatorial Complex (PCC) and Topology-Complete Perceptron Network (TCPNet). Our PCC represents proteins at multiple hierarchical levels – from residues to secondary structures to complete proteins – while preserving geometric information at each level. TCPNet employs SE(3)-equivariant message passing across these hierarchical structures, enabling more effective capture of multi-scale structural patterns. Through extensive experiments on four PRL tasks, TCPNet consistently outperforms state-of-the-art geometric graph neural networks. Our approach demonstrates particular strength in tasks such as fold classification which require understanding of secondary structure arrangements, validating the importance of hierarchical topological features for protein analysis.
[LG-33] Hardware-Aware Data and Instruction Mapping for AI Tasks: Balancing Parallelism I/O and Memory Tradeoffs
链接: https://arxiv.org/abs/2509.03846
作者: Md Rownak Hossain Chowdhury,Mostafizur Rahman
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注:
Abstract:We introduce a mapping framework for deep learning inference that takes advantage of predictable neural network behavior to plan both computation and communication ahead of time. The framework generates a unified stream of instructions and data, enabling the hardware to execute operations and route information on its own, without frequent involvement from the host and with minimal off-chip memory use. This naturally reduces reliance on I/O, off-chip memory, and host control. By leveraging fine-grained message passing on a programmable, message-based compute architecture, the framework keeps data movement local and coordinates computation across the array using techniques such as stationary-weight reuse, in-array multicasting, and staged reductions. Applied to VGG-19, the framework sustains high utilization (88 to 92 percent), with over 97 percent of messages generated internally and nearly 89 percent of time consumed on-chip transfers. Computation throughput scales beyond 1 TFLOP/s on larger arrays, while traffic reductions from reuse and local aggregation reach up to 100 MB per layer. Overall, the results highlight the effectiveness of streaming-based computation and show how our mapper enables this execution style by tightly coordinating data and instruction flow across the hardware.
[LG-34] Reservoir Predictive Path Integral Control for Unknown Nonlinear Dynamics
链接: https://arxiv.org/abs/2509.03839
作者: Daisuke Inoue,Tadayoshi Matsumori,Gouhei Tanaka,Yuji Ito
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Optimization and Control (math.OC); Chaotic Dynamics (nlin.CD)
*备注: Submitted to IEEE for possible publication, 13 pages, 7 figures
Abstract:Neural networks capable of approximating complex nonlinearities have found extensive application in data-driven control of nonlinear dynamical systems. However, fast online identification and control of unknown dynamics remain central challenges. This paper integrates echo-state networks (ESNs) – reservoir computing models implemented with recurrent neural networks – and model predictive path integral (MPPI) control – sampling-based variants of model predictive control – to meet these challenges. The proposed reservoir predictive path integral (RPPI) enables fast learning of nonlinear dynamics with ESN and exploits the learned nonlinearities directly in parallelized MPPI control computation without linearization approximations. The framework is further extended to uncertainty-aware RPPI (URPPI), which leverages ESN uncertainty to balance exploration and exploitation: exploratory inputs dominate during early learning, while exploitative inputs prevail as model confidence grows. Experiments on controlling the Duffing oscillator and four-tank systems demonstrate that URPPI improves control performance, reducing control costs by up to 60% compared to traditional quadratic programming-based model predictive control methods.
[LG-35] Vehicle-to-Infrastructure Collaborative Spatial Perception via Multimodal Large Language Models
链接: https://arxiv.org/abs/2509.03837
作者: Kimia Ehsani,Walid Saad
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注: Accepted at IEEE GLOBECOM 2025
Abstract:Accurate prediction of communication link quality metrics is essential for vehicle-to-infrastructure (V2I) systems, enabling smooth handovers, efficient beam management, and reliable low-latency communication. The increasing availability of sensor data from modern vehicles motivates the use of multimodal large language models (MLLMs) because of their adaptability across tasks and reasoning capabilities. However, MLLMs inherently lack three-dimensional spatial understanding. To overcome this limitation, a lightweight, plug-and-play bird’s-eye view (BEV) injection connector is proposed. In this framework, a BEV of the environment is constructed by collecting sensing data from neighboring vehicles. This BEV representation is then fused with the ego vehicle’s input to provide spatial context for the large language model. To support realistic multimodal learning, a co-simulation environment combining CARLA simulator and MATLAB-based ray tracing is developed to generate RGB, LiDAR, GPS, and wireless signal data across varied scenarios. Instructions and ground-truth responses are programmatically extracted from the ray-tracing outputs. Extensive experiments are conducted across three V2I link prediction tasks: line-of-sight (LoS) versus non-line-of-sight (NLoS) classification, link availability, and blockage prediction. Simulation results show that the proposed BEV injection framework consistently improved performance across all tasks. The results indicate that, compared to an ego-only baseline, the proposed approach improves the macro-average of the accuracy metrics by up to 13.9%. The results also show that this performance gain increases by up to 32.7% under challenging rainy and nighttime conditions, confirming the robustness of the framework in adverse settings.
[LG-36] Predicting Traffic Accident Severity with Deep Neural Networks
链接: https://arxiv.org/abs/2509.03819
作者: Meghan Bibb,Pablo Rivas,Mahee Tayba
类目: Machine Learning (cs.LG)
*备注: The 17th International Conference on Data Science (ICDATA 2021)
Abstract:Traffic accidents can be studied to mitigate the risk of further events. Recent advances in machine learning have provided an alternative way to study data associated with traffic accidents. New models achieve good generalization and high predictive power over imbalanced data. In this research, we study neural network-based models on data related to traffic accidents. We begin analyzing relative feature colinearity and unsupervised dimensionality reduction through autoencoders, followed by a dense network. The features are related to traffic accident data and the target is to classify accident severity. Our experiments show cross-validated results of up to 92% accuracy when classifying accident severity using the proposed deep neural network.
[LG-37] Machine Learning for LiDAR-Based Indoor Surface Classification in Intelligent Wireless Environments
链接: https://arxiv.org/abs/2509.03813
作者: Parth Ashokbhai Shiroya,Swarnagowri Shashidhar,Amod Ashtekar,Krishna Aindrila Kar,Rafaela Lomboy,Dalton Davis,Mohammed E. Eltayeb
类目: Machine Learning (cs.LG)
*备注:
Abstract:Reliable connectivity in millimeter-wave (mmWave) and sub-terahertz (sub-THz) networks depends on reflections from surrounding surfaces, as high-frequency signals are highly vulnerable to blockage. The scattering behavior of a surface is determined not only by material permittivity but also by roughness, which governs whether energy remains in the specular direction or is diffusely scattered. This paper presents a LiDAR-driven machine learning framework for classifying indoor surfaces into semi-specular and low-specular categories, using optical reflectivity as a proxy for electromagnetic scattering behavior. A dataset of over 78,000 points from 15 representative indoor materials was collected and partitioned into 3 cm x 3 cm patches to enable classification from partial views. Patch-level features capturing geometry and intensity, including elevation angle, natural-log-scaled intensity, and max-to-mean ratio, were extracted and used to train Random Forest, XGBoost, and neural network classifiers. Results show that ensemble tree-based models consistently provide the best trade-off between accuracy and robustness, confirming that LiDAR-derived features capture roughness-induced scattering effects. The proposed framework enables the generation of scatter aware environment maps and digital twins, supporting adaptive beam management, blockage recovery, and environment-aware connectivity in next-generation networks.
[LG-38] Online time series prediction using feature adjustment
链接: https://arxiv.org/abs/2509.03810
作者: Xiannan Huang,Shuhan Qiu,Jiayuan Du,Chao Yang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Time series forecasting is of significant importance across various domains. However, it faces significant challenges due to distribution shift. This issue becomes particularly pronounced in online deployment scenarios where data arrives sequentially, requiring models to adapt continually to evolving patterns. Current time series online learning methods focus on two main aspects: selecting suitable parameters to update (e.g., final layer weights or adapter modules) and devising suitable update strategies (e.g., using recent batches, replay buffers, or averaged gradients). We challenge the conventional parameter selection approach, proposing that distribution shifts stem from changes in underlying latent factors influencing the data. Consequently, updating the feature representations of these latent factors may be more effective. To address the critical problem of delayed feedback in multi-step forecasting (where true values arrive much later than predictions), we introduce ADAPT-Z (Automatic Delta Adjustment via Persistent Tracking in Z-space). ADAPT-Z utilizes an adapter module that leverages current feature representations combined with historical gradient information to enable robust parameter updates despite the delay. Extensive experiments demonstrate that our method consistently outperforms standard base models without adaptation and surpasses state-of-the-art online learning approaches across multiple datasets. The code is available at this https URL.
[LG-39] LLM -based Relevance Assessment for Web-Scale Search Evaluation at Pinterest RECSYS2025
链接: https://arxiv.org/abs/2509.03764
作者: Han Wang,Alex Whitworth,Pak Ming Cheung,Zhenjie Zhang,Krishna Kamath
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: RecSys 2025 EARL Workshop
Abstract:Relevance evaluation plays a crucial role in personalized search systems to ensure that search results align with a user’s queries and intent. While human annotation is the traditional method for relevance evaluation, its high cost and long turnaround time limit its scalability. In this work, we present our approach at Pinterest Search to automate relevance evaluation for online experiments using fine-tuned LLMs. We rigorously validate the alignment between LLM-generated judgments and human annotations, demonstrating that LLMs can provide reliable relevance measurement for experiments while greatly improving the evaluation efficiency. Leveraging LLM-based labeling further unlocks the opportunities to expand the query set, optimize sampling design, and efficiently assess a wider range of search experiences at scale. This approach leads to higher-quality relevance metrics and significantly reduces the Minimum Detectable Effect (MDE) in online experiment measurements.
[LG-40] Learning functions through Diffusion Maps
链接: https://arxiv.org/abs/2509.03758
作者: Alvaro Almeida Gomez
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: Comments are welcome
Abstract:We propose a data-driven method for approximating real-valued functions on smooth manifolds, building on the Diffusion Maps framework under the manifold hypothesis. Given pointwise evaluations of a function, the method constructs a smooth extension to the ambient space by exploiting diffusion geometry and its connection to the heat equation and the Laplace-Beltrami operator. To address the computational challenges of high-dimensional data, we introduce a dimensionality reduction strategy based on the low-rank structure of the distance matrix, revealed via singular value decomposition (SVD). In addition, we develop an online updating mechanism that enables efficient incorporation of new data, thereby improving scalability and reducing computational cost. Numerical experiments, including applications to sparse CT reconstruction, demonstrate that the proposed methodology outperforms classical feedforward neural networks and interpolation methods in terms of both accuracy and efficiency. Comments: Comments are welcome Subjects: Machine Learning (cs.LG); Numerical Analysis (math.NA) Cite as: arXiv:2509.03758 [cs.LG] (or arXiv:2509.03758v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2509.03758 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-41] Hypothesis Selection: A High Probability Conundrum
链接: https://arxiv.org/abs/2509.03734
作者: Anders Aamand,Maryam Aliakbarpour,Justin Y. Chen,Sandeep Silwal
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注: Abstract abridged to meet arxiv requirements
Abstract:In the hypothesis selection problem, we are given a finite set of candidate distributions (hypotheses), \mathcalH = \H_1, \ldots, H_n\ , and samples from an unknown distribution P . Our goal is to find a hypothesis H_i whose total variation distance to P is comparable to that of the nearest hypothesis in \mathcalH . If the minimum distance is \mathsfOPT , we aim to output an H_i such that, with probability at least 1-\delta , its total variation distance to P is at most C \cdot \mathsfOPT + \varepsilon . Despite decades of work, key aspects of this problem remain unresolved, including the optimal running time for algorithms that achieve the optimal sample complexity and best possible approximation factor of C=3 . The previous state-of-the-art result [Aliakbarpour, Bun, Smith, NeurIPS 2024] provided a nearly linear in n time algorithm but with a sub-optimal dependence on the other parameters, running in \tildeO(n/(\delta^3\varepsilon^3)) time. We improve this time complexity to \tildeO(n/(\delta \varepsilon^2)) , significantly reducing the dependence on the confidence and error parameters. Furthermore, we study hypothesis selection in three alternative settings, resolving or making progress on several open questions from prior works. (1) We settle the optimal approximation factor when bounding the \textitexpected distance of the output hypothesis, rather than its high-probability performance. (2) Assuming the numerical value of \textit \mathsfOPT is known in advance, we present an algorithm obtaining C=3 and runtime \tildeO(n/\varepsilon^2) with the optimal sample complexity and succeeding with high probability in n . (3) Allowing polynomial \textitpreprocessing step on the hypothesis class \mathcalH before observing samples, we present an algorithm with C=3 and subquadratic runtime which succeeds with high probability in n . Comments: Abstract abridged to meet arxiv requirements Subjects: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG) Cite as: arXiv:2509.03734 [cs.DS] (or arXiv:2509.03734v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2509.03734 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-42] Online Learning of Optimal Sequential Testing Policies
链接: https://arxiv.org/abs/2509.03707
作者: Qiyuan Chen,Raed Al Kontar
类目: Machine Learning (cs.LG)
*备注:
Abstract:This paper studies an online learning problem that seeks optimal testing policies for a stream of subjects, each of whom can be evaluated through a sequence of candidate tests drawn from a common pool. We refer to this problem as the Online Testing Problem (OTP). Although conducting every candidate test for a subject provides more information, it is often preferable to select only a subset when tests are correlated and costly, and make decisions with partial information. If the joint distribution of test outcomes were known, the problem could be cast as a Markov Decision Process (MDP) and solved exactly. In practice, this distribution is unknown and must be learned online as subjects are tested. When a subject is not fully tested, the resulting missing data can bias estimates, making the problem fundamentally harder than standard episodic MDPs. We prove that the minimax regret must scale at least as \Omega(T^\frac23) , in contrast to the \Theta(\sqrtT) rate in episodic MDPs, revealing the difficulty introduced by missingness. This elevated lower bound is then matched by an Explore-Then-Commit algorithm whose cumulative regret is \tildeO(T^\frac23) for both discrete and Gaussian distributions. To highlight the consequence of missingness-dependent rewards in OTP, we study a variant called the Online Cost-sensitive Maximum Entropy Sampling Problem, where rewards are independent of missing data. This structure enables an iterative-elimination algorithm that achieves \tildeO(\sqrtT) regret, breaking the \Omega(T^\frac23) lower bound for OTP. Numerical results confirm our theory in both settings. Overall, this work deepens the understanding of the exploration–exploitation trade-off under missing data and guides the design of efficient sequential testing policies.
[LG-43] EmbedOR: Provable Cluster-Preserving Visualizations with Curvature-Based Stochastic Neighbor Embeddings
链接: https://arxiv.org/abs/2509.03703
作者: Tristan Luca Saidi,Abigail Hickok,Bastian Rieck,Andrew J. Blumberg
类目: Machine Learning (cs.LG)
*备注:
Abstract:Stochastic Neighbor Embedding (SNE) algorithms like UMAP and tSNE often produce visualizations that do not preserve the geometry of noisy and high dimensional data. In particular, they can spuriously separate connected components of the underlying data submanifold and can fail to find clusters in well-clusterable data. To address these limitations, we propose EmbedOR, a SNE algorithm that incorporates discrete graph curvature. Our algorithm stochastically embeds the data using a curvature-enhanced distance metric that emphasizes underlying cluster structure. Critically, we prove that the EmbedOR distance metric extends consistency results for tSNE to a much broader class of datasets. We also describe extensive experiments on synthetic and real data that demonstrate the visualization and geometry-preservation capabilities of EmbedOR. We find that, unlike other SNE algorithms and UMAP, EmbedOR is much less likely to fragment continuous, high-density regions of the data. Finally, we demonstrate that the EmbedOR distance metric can be used as a tool to annotate existing visualizations to identify fragmentation and provide deeper insight into the underlying geometry of the data.
[LG-44] Graph Random Features for Scalable Gaussian Processes
链接: https://arxiv.org/abs/2509.03691
作者: Matthew Zhang,Jihao Andreas Lin,Adrian Weller,Richard E. Turner,Isaac Reid
类目: Machine Learning (cs.LG)
*备注:
Abstract:We study the application of graph random features (GRFs) - a recently introduced stochastic estimator of graph node kernels - to scalable Gaussian processes on discrete input spaces. We prove that (under mild assumptions) Bayesian inference with GRFs enjoys O(N^3/2) time complexity with respect to the number of nodes N , compared to O(N^3) for exact kernels. Substantial wall-clock speedups and memory savings unlock Bayesian optimisation on graphs with over 10^6 nodes on a single computer chip, whilst preserving competitive performance.
[LG-45] A Comprehensive Review of Multi-Agent Reinforcement Learning in Video Games
链接: https://arxiv.org/abs/2509.03682
作者: Zhengyang Li,Qijin Ji,Xinghong Ling,Quan Liu
类目: Machine Learning (cs.LG)
*备注: IEEE Transactions on Games, 2025
Abstract:Recent advancements in multi-agent reinforcement learning (MARL) have demonstrated its application potential in modern games. Beginning with foundational work and progressing to landmark achievements such as AlphaStar in StarCraft II and OpenAI Five in Dota 2, MARL has proven capable of achieving superhuman performance across diverse game environments through techniques like self-play, supervised learning, and deep reinforcement learning. With its growing impact, a comprehensive review has become increasingly important in this field. This paper aims to provide a thorough examination of MARL’s application from turn-based two-agent games to real-time multi-agent video games including popular genres such as Sports games, First-Person Shooter (FPS) games, Real-Time Strategy (RTS) games and Multiplayer Online Battle Arena (MOBA) games. We further analyze critical challenges posed by MARL in video games, including nonstationary, partial observability, sparse rewards, team coordination, and scalability, and highlight successful implementations in games like Rocket League, Minecraft, Quake III Arena, StarCraft II, Dota 2, Honor of Kings, etc. This paper offers insights into MARL in video game AI systems, proposes a novel method to estimate game complexity, and suggests future research directions to advance MARL and its applications in game development, inspiring further innovation in this rapidly evolving field.
[LG-46] A Machine Learning-Based Study on the Synergistic Optimization of Supply Chain Management and Financial Supply Chains from an Economic Perspective ISCA
链接: https://arxiv.org/abs/2509.03673
作者: Hang Wang,Huijie Tang,Ningai Leng,Zhoufan Yu
类目: Machine Learning (cs.LG)
*备注: Accepted by the 2025 IEEE 8th International Conference on Information Systems and Computer Aided Education (ICISCAE 2025)
Abstract:Based on economic theories and integrated with machine learning technology, this study explores a collaborative Supply Chain Management and Financial Supply Chain Management (SCM - FSCM) model to solve issues like efficiency loss, financing constraints, and risk transmission. We combine Transaction Cost and Information Asymmetry theories and use algorithms such as random forests to process multi-dimensional data and build a data-driven, three-dimensional (cost-efficiency-risk) analysis framework. We then apply an FSCM model of “core enterprise credit empowerment plus dynamic pledge financing.” We use Long Short-Term Memory (LSTM) networks for demand forecasting and clustering/regression algorithms for benefit allocation. The study also combines Game Theory and reinforcement learning to optimize the inventory-procurement mechanism and uses eXtreme Gradient Boosting (XGBoost) for credit assessment to enable rapid monetization of inventory. Verified with 20 core and 100 supporting enterprises, the results show a 30% increase in inventory turnover, an 18%-22% decrease in SME financing costs, a stable order fulfillment rate above 95%, and excellent model performance (demand forecasting error = 8%, credit assessment accuracy = 90%). This SCM-FSCM model effectively reduces operating costs, alleviates financing constraints, and supports high-quality supply chain development.
[LG-47] SharedRep-RLHF: A Shared Representation Approach to RLHF with Diverse Preferences
链接: https://arxiv.org/abs/2509.03672
作者: Arpan Mukherjee,Marcello Bullo,Deniz Gündüz
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Uniform-reward reinforcement learning from human feedback (RLHF), which trains a single reward model to represent the preferences of all annotators, fails to capture the diversity of opinions across sub-populations, inadvertently favoring dominant groups. The state-of-the-art, MaxMin-RLHF, addresses this by learning group-specific reward models, and by optimizing for the group receiving the minimum reward, thereby promoting fairness. However, we identify that a key limitation of MaxMin-RLHF is its poor performance when the minimum-reward group is a minority. To mitigate this drawback, we introduce a novel framework, termed \em SharedRep-RLHF. At its core, SharedRep-RLHF learns and leverages \em shared traits in annotations among various groups, in contrast to learning separate reward models across groups. We first show that MaxMin-RLHF is provably suboptimal in learning shared traits, and then quantify the sample complexity of SharedRep-RLHF. Experiments across diverse natural language tasks showcase the effectiveness of SharedRep-RLHF compared to MaxMin-RLHF with a gain of up to 20% in win rate.
[LG-48] AutoGrid AI: Deep Reinforcement Learning Framework for Autonomous Microgrid Management
链接: https://arxiv.org/abs/2509.03666
作者: Kenny Guo,Nicholas Eckhert,Krish Chhajer,Luthira Abeykoon,Lorne Schell
类目: Machine Learning (cs.LG)
*备注: IEEE (International Conference on Smart Energy Grid Engineering (SEGE)) 2025, 6 pages
Abstract:We present a deep reinforcement learning-based framework for autonomous microgrid management. tailored for remote communities. Using deep reinforcement learning and time-series forecasting models, we optimize microgrid energy dispatch strategies to minimize costs and maximize the utilization of renewable energy sources such as solar and wind. Our approach integrates the transformer architecture for forecasting of renewable generation and a proximal-policy optimization (PPO) agent to make decisions in a simulated environment. Our experimental results demonstrate significant improvements in both energy efficiency and operational resilience when compared to traditional rule-based methods. This work contributes to advancing smart-grid technologies in pursuit of zero-carbon energy systems. We finally provide an open-source framework for simulating several microgrid environments.
[LG-49] ACT: Automated Constraint Targeting for Multi-Objective Recommender Systems
链接: https://arxiv.org/abs/2509.03661
作者: Daryl Chang,Yi Wu,Jennifer She,Li Wei,Lukasz Heldt
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:
Abstract:Recommender systems often must maximize a primary objective while ensuring secondary ones satisfy minimum thresholds, or “guardrails.” This is critical for maintaining a consistent user experience and platform ecosystem, but enforcing these guardrails despite orthogonal system changes is challenging and often requires manual hyperparameter tuning. We introduce the Automated Constraint Targeting (ACT) framework, which automatically finds the minimal set of hyperparameter changes needed to satisfy these guardrails. ACT uses an offline pairwise evaluation on unbiased data to find solutions and continuously retrains to adapt to system and user behavior changes. We empirically demonstrate its efficacy and describe its deployment in a large-scale production environment.
[LG-50] Semi-decentralized Federated Time Series Prediction with Client Availability Budgets
链接: https://arxiv.org/abs/2509.03660
作者: Yunkai Bao,Reza Safarzadeh,Xin Wang,Steve Drew
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:
Abstract:Federated learning (FL) effectively promotes collaborative training among distributed clients with privacy considerations in the Internet of Things (IoT) scenarios. Despite of data heterogeneity, FL clients may also be constrained by limited energy and availability budgets. Therefore, effective selection of clients participating in training is of vital importance for the convergence of the global model and the balance of client contributions. In this paper, we discuss the performance impact of client availability with time-series data on federated learning. We set up three different scenarios that affect the availability of time-series data and propose FedDeCAB, a novel, semi-decentralized client selection method applying probabilistic rankings of available clients. When a client is disconnected from the server, FedDeCAB allows obtaining partial model parameters from the nearest neighbor clients for joint optimization, improving the performance of offline models and reducing communication overhead. Experiments based on real-world large-scale taxi and vessel trajectory datasets show that FedDeCAB is effective under highly heterogeneous data distribution, limited communication budget, and dynamic client offline or rejoining.
[LG-51] Nonnegative matrix factorization and the principle of the common cause
链接: https://arxiv.org/abs/2509.03652
作者: E. Khalafyan,A. E. Allahverdyan,A. Hovhannisyan
类目: Machine Learning (cs.LG)
*备注:
Abstract:Nonnegative matrix factorization (NMF) is a known unsupervised data-reduction method. The principle of the common cause (PCC) is a basic methodological approach in probabilistic causality, which seeks an independent mixture model for the joint probability of two dependent random variables. It turns out that these two concepts are closely related. This relationship is explored reciprocally for several datasets of gray-scale images, which are conveniently mapped into probability models. On one hand, PCC provides a predictability tool that leads to a robust estimation of the effective rank of NMF. Unlike other estimates (e.g., those based on the Bayesian Information Criteria), our estimate of the rank is stable against weak noise. We show that NMF implemented around this rank produces features (basis images) that are also stable against noise and against seeds of local optimization, thereby effectively resolving the NMF nonidentifiability problem. On the other hand, NMF provides an interesting possibility of implementing PCC in an approximate way, where larger and positively correlated joint probabilities tend to be explained better via the independent mixture model. We work out a clustering method, where data points with the same common cause are grouped into the same cluster. We also show how NMF can be employed for data denoising.
[LG-52] Connections between reinforcement learning with feedbacktest-time scaling and diffusion guidance: An anthology
链接: https://arxiv.org/abs/2509.04372
作者: Yuchen Jiao,Yuxin Chen,Gen Li
类目: Machine Learning (stat.ML); General Literature (cs.GL); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:
Abstract:In this note, we reflect on several fundamental connections among widely used post-training techniques. We clarify some intimate connections and equivalences between reinforcement learning with human feedback, reinforcement learning with internal feedback, and test-time scaling (particularly soft best-of- N sampling), while also illuminating intrinsic links between diffusion guidance and test-time scaling. Additionally, we introduce a resampling approach for alignment and reward-directed diffusion models, sidestepping the need for explicit reinforcement learning techniques.
[LG-53] Batched Stochastic Matching Bandits
链接: https://arxiv.org/abs/2509.04194
作者: Jung-hun Kim,Min-hwan Oh
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:In this study, we introduce a novel bandit framework for stochastic matching based on the Multi-nomial Logit (MNL) choice model. In our setting, N agents on one side are assigned to K arms on the other side, where each arm stochastically selects an agent from its assigned pool according to an unknown preference and yields a corresponding reward. The objective is to minimize regret by maximizing the cumulative revenue from successful matches across all agents. This task requires solving a combinatorial optimization problem based on estimated preferences, which is NP-hard and leads a naive approach to incur a computational cost of O(K^N) per round. To address this challenge, we propose batched algorithms that limit the frequency of matching updates, thereby reducing the amortized computational cost (i.e., the average cost per round) to O(1) while still achieving a regret bound of \tildeO(\sqrtT) .
[LG-54] Shuffling Heuristic in Variational Inequalities: Establishing New Convergence Guarantees
链接: https://arxiv.org/abs/2509.04133
作者: Daniil Medyakov,Gleb Molodtsov,Grigoriy Evseev,Egor Petrov,Aleksandr Beznosikov
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: 25 pages, 5 figures, 2 tables
Abstract:Variational inequalities have gained significant attention in machine learning and optimization research. While stochastic methods for solving these problems typically assume independent data sampling, we investigate an alternative approach – the shuffling heuristic. This strategy involves permuting the dataset before sequential processing, ensuring equal consideration of all data points. Despite its practical utility, theoretical guarantees for shuffling in variational inequalities remain unexplored. We address this gap by providing the first theoretical convergence estimates for shuffling methods in this context. Our analysis establishes rigorous bounds and convergence rates, extending the theoretical framework for this important class of algorithms. We validate our findings through extensive experiments on diverse benchmark variational inequality problems, demonstrating faster convergence of shuffling methods compared to independent sampling approaches.
[LG-55] Gromov-Wasserstein and optimal transport: from assignment problems to probabilistic numeric
链接: https://arxiv.org/abs/2509.04089
作者: Iman Seyedi,Antonio Candelieri,Enza Messina,Francesco Archetti
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:
Abstract:The assignment problem, a cornerstone of operations research, seeks an optimal one-to-one mapping between agents and tasks to minimize total cost. This work traces its evolution from classical formulations and algorithms to modern optimal transport (OT) theory, positioning the Quadratic Assignment Problem (QAP) and related structural matching tasks within this framework. We connect the linear assignment problem to Monge’s transport problem, Kantorovich’s relaxation, and Wasserstein distances, then extend to cases where source and target lie in different metric-measure spaces requiring Gromov-Wasserstein (GW) distances. GW formulations, including the fused GW variant that integrates structural and feature information, naturally address QAP-like problems by optimizing alignment based on both intra-domain distances and cross-domain attributes. Applications include graph matching, keypoint correspondence, and feature-based assignments. We present exact solvers, Genetic Algorithms (GA), and multiple GW variants, including a proposed multi-initialization strategy (GW-MultiInit) that mitigates the risk of getting stuck in local optima alongside entropic Sinkhorn-based approximations and fused GW. Computational experiments on capacitated QAP instances show that GW-MultiInit consistently achieves near-optimal solutions and scales efficiently to large problems where exact methods become impractical, while parameterized EGW and FGW variants provide flexible trade-offs between accuracy and runtime. Our findings provide theoretical foundations, computational insights, and practical guidelines for applying OT and GW methods to QAP and other real-world matching problems, such as those in machine learning and logistics.
[LG-56] Divergence-Kernel method for linear responses and diffusion models
链接: https://arxiv.org/abs/2509.03992
作者: Angxiu Ni
类目: Dynamical Systems (math.DS); Machine Learning (cs.LG); Probability (math.PR)
*备注:
Abstract:We derive the divergence-kernel formula for the linear response (parameter-derivative of marginal or stationary distributions) of random dynamical systems, and formally pass to the continuous-time limit. Our formula works for multiplicative and parameterized noise over any period of time; it does not require hyperbolicity. Then we derive a pathwise Monte-Carlo algorithm for linear responses. With this, we propose a forward-only diffusion generative model and test on simple problems.
[LG-57] An invertible generative model for forward and inverse problems
链接: https://arxiv.org/abs/2509.03910
作者: Tristan van Leeuwen,Christoph Brune,Marcello Carioni
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR)
*备注:
Abstract:We formulate the inverse problem in a Bayesian framework and aim to train a generative model that allows us to simulate (i.e., sample from the likelihood) and do inference (i.e., sample from the posterior). We review the use of triangular normalizing flows for conditional sampling in this context and show how to combine two such triangular maps (an upper and a lower one) in to one invertible mapping that can be used for simulation and inference. We work out several useful properties of this invertible generative model and propose a possible training loss for training the map directly. We illustrate the workings of this new approach to conditional generative modeling numerically on a few stylized examples.
[LG-58] Finetuning AI Foundation Models to Develop Subgrid-Scale Parameterizations: A Case Study on Atmospheric Gravity Waves
链接: https://arxiv.org/abs/2509.03816
作者: Aman Gupta,Aditi Sheshadri,Sujit Roy,Johannes Schmude,Vishal Gaur,Wei Ji Leong,Manil Maskey,Rahul Ramachandran
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
*备注:
Abstract:Global climate models parameterize a range of atmospheric-oceanic processes like gravity waves, clouds, moist convection, and turbulence that cannot be sufficiently resolved. These subgrid-scale closures for unresolved processes are a leading source of model uncertainty. Here, we present a new approach to developing machine learning parameterizations of small-scale climate processes by fine-tuning a pre-trained AI foundation model (FM). FMs are largely unexplored in climate research. A pre-trained encoder-decoder from a 2.3 billion parameter FM (NASA and IBM Research’s Prithvi WxC) – which contains a latent probabilistic representation of atmospheric evolution – is fine-tuned (or reused) to create a deep learning parameterization for atmospheric gravity waves (GWs). The parameterization captures GW effects for a coarse-resolution climate model by learning the fluxes from an atmospheric reanalysis with 10 times finer resolution. A comparison of monthly averages and instantaneous evolution with a machine learning model baseline (an Attention U-Net) reveals superior predictive performance of the FM parameterization throughout the atmosphere, even in regions excluded from pre-training. This performance boost is quantified using the Hellinger distance, which is 0.11 for the baseline and 0.06 for the fine-tuned model. Our findings emphasize the versatility and reusability of FMs, which could be used to accomplish a range of atmosphere- and climate-related applications, leading the way for the creation of observations-driven and physically accurate parameterizations for more earth-system processes.
[LG-59] sting for correlation between network structure and high-dimensional node covariates
链接: https://arxiv.org/abs/2509.03772
作者: Alexander Fuchs-Kreiss,Keith Levin
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:
Abstract:In many application domains, networks are observed with node-level features. In such settings, a common problem is to assess whether or not nodal covariates are correlated with the network structure itself. Here, we present four novel methods for addressing this problem. Two of these are based on a linear model relating node-level covariates to latent node-level variables that drive network structure. The other two are based on applying canonical correlation analysis to the node features and network structure, avoiding the linear modeling assumptions. We provide theoretical guarantees for all four methods when the observed network is generated according to a low-rank latent space model endowed with node-level covariates, which we allow to be high-dimensional. Our methods are computationally cheaper and require fewer modeling assumptions than previous approaches to network dependency testing. We demonstrate and compare the performance of our novel methods on both simulated and real-world data.
[LG-60] Deficiency of equation-finding approach to data-driven modeling of dynamical systems
链接: https://arxiv.org/abs/2509.03769
作者: Zheng-Meng Zhai,Valerio Lucarini,Ying-Cheng Lai
类目: Chaotic Dynamics (nlin.CD); Machine Learning (cs.LG); Dynamical Systems (math.DS)
*备注: 6 pages, 3 figures
Abstract:Finding the governing equations from data by sparse optimization has become a popular approach to deterministic modeling of dynamical systems. Considering the physical situations where the data can be imperfect due to disturbances and measurement errors, we show that for many chaotic systems, widely used sparse-optimization methods for discovering governing equations produce models that depend sensitively on the measurement procedure, yet all such models generate virtually identical chaotic attractors, leading to a striking limitation that challenges the conventional notion of equation-based modeling in complex dynamical systems. Calculating the Koopman spectra, we find that the different sets of equations agree in their large eigenvalues and the differences begin to appear when the eigenvalues are smaller than an equation-dependent threshold. The results suggest that finding the governing equations of the system and attempting to interpret them physically may lead to misleading conclusions. It would be more useful to work directly with the available data using, e.g., machine-learning methods.
[LG-61] Energy-Weighted Flow Matching: Unlocking Continuous Normalizing Flows for Efficient and Scalable Boltzmann Sampling
链接: https://arxiv.org/abs/2509.03726
作者: Niclas Dern,Lennart Redl,Sebastian Pfister,Marcel Kollovieh,David Lüdke,Stephan Günnemann
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 21 pages, 4 figures
Abstract:Sampling from unnormalized target distributions, e.g. Boltzmann distributions \mu_\texttarget(x) \propto \exp(-E(x)/T) , is fundamental to many scientific applications yet computationally challenging due to complex, high-dimensional energy landscapes. Existing approaches applying modern generative models to Boltzmann distributions either require large datasets of samples drawn from the target distribution or, when using only energy evaluations for training, cannot efficiently leverage the expressivity of advanced architectures like continuous normalizing flows that have shown promise for molecular sampling. To address these shortcomings, we introduce Energy-Weighted Flow Matching (EWFM), a novel training objective enabling continuous normalizing flows to model Boltzmann distributions using only energy function evaluations. Our objective reformulates conditional flow matching via importance sampling, allowing training with samples from arbitrary proposal distributions. Based on this objective, we develop two algorithms: iterative EWFM (iEWFM), which progressively refines proposals through iterative training, and annealed EWFM (aEWFM), which additionally incorporates temperature annealing for challenging energy landscapes. On benchmark systems, including challenging 55-particle Lennard-Jones clusters, our algorithms demonstrate sample quality competitive with state-of-the-art energy-only methods while requiring up to three orders of magnitude fewer energy evaluations.
[LG-62] Accurate and scalable deep Maxwell solvers using multilevel iterative methods
链接: https://arxiv.org/abs/2509.03622
作者: Chenkai Mao,Jonathan A. Fan
类目: Computational Physics (physics.comp-ph); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Applied Physics (physics.app-ph)
*备注:
Abstract:Neural networks have promise as surrogate partial differential equation (PDE) solvers, but it remains a challenge to use these concepts to solve problems with high accuracy and scalability. In this work, we show that neural network surrogates can combine with iterative algorithms to accurately solve PDE problems featuring different scales, resolutions, and boundary conditions. We develop a subdomain neural operator model that supports arbitrary Robin-type boundary condition inputs, and we show that it can be utilized as a flexible preconditioner to iteratively solve subdomain problems with bounded accuracy. We further show that our subdomain models can facilitate the construction of global coarse spaces to enable accelerated, large scale PDE problem solving based on iterative multilevel domain decomposition. With two-dimensional Maxwell’s equations as a model system, we train a single network to simulate large scale problems with different sizes, resolutions, wavelengths, and dielectric media distribution. We further demonstrate the utility of our platform in performing the accurate inverse design of multi-wavelength nanophotonic devices. Our work presents a promising path to building accurate and scalable multi-physics surrogate solvers for large practical problems.
[LG-63] Exoplanetary atmospheres retrieval via a quantum extreme learning machine
链接: https://arxiv.org/abs/2509.03617
作者: Marco Vetrano,Tiziano Zingales,G.Massimo Palma,Salvatore Lorenzo
类目: Quantum Physics (quant-ph); Earth and Planetary Astrophysics (astro-ph.EP); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注:
Abstract:The study of exoplanetary atmospheres traditionally relies on forward models to analytically compute the spectrum of an exoplanet by fine-tuning numerous chemical and physical parameters. However, the high-dimensionality of parameter space often results in a significant computational overhead. In this work, we introduce a novel approach to atmospheric retrieval leveraging on quantum extreme learning machines (QELMs). QELMs are quantum machine learning techniques that employ quantum systems as a black box for processing input data. In this work, we propose a framework for extracting exoplanetary atmospheric features using QELMs, employing an intrinsically fault-tolerant strategy suitable for near-term quantum devices, and we demonstrate such fault tolerance with a direct implementation on IBM Fez. The QELM architecture we present shows the potential of quantum computing in the analysis of astrophysical datasets and may, in the near-term future, unlock new computational tools to implement fast, efficient, and more accurate models in the study of exoplanetary atmospheres.
[LG-64] Predicting Antimicrobial Resistance (AMR) in Campylobacter a Foodborne Pathogen and Cost Burden Analysis Using Machine Learning
链接: https://arxiv.org/abs/2509.03551
作者: Shubham Mishra, TheAnh Han,Bruno Silvester Lopes,Shatha Ghareeb,Zia Ush Shamszaman
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注: 9 pages, 3 figures, 1 table. Submitted to a Briefings in Bioinformatics journal and waiting for the outcome
Abstract:Antimicrobial resistance (AMR) poses a significant public health and economic challenge, increasing treatment costs and reducing antibiotic effectiveness. This study employs machine learning to analyze genomic and epidemiological data from the public databases for molecular typing and microbial genome diversity (PubMLST), incorporating data from UK government-supported AMR surveillance by the Food Standards Agency and Food Standards Scotland. We identify AMR patterns in Campylobacter jejuni and Campylobacter coli isolates collected in the UK from 2001 to 2017. The research integrates whole-genome sequencing (WGS) data, epidemiological metadata, and economic projections to identify key resistance determinants and forecast future resistance trends and healthcare costs. We investigate gyrA mutations for fluoroquinolone resistance and the tet(O) gene for tetracycline resistance, training a Random Forest model validated with bootstrap resampling (1,000 samples, 95% confidence intervals), achieving 74% accuracy in predicting AMR phenotypes. Time-series forecasting models (SARIMA, SIR, and Prophet) predict a rise in campylobacteriosis cases, potentially exceeding 130 cases per 100,000 people by 2050, with an economic burden projected to surpass 1.9 billion GBP annually if left unchecked. An enhanced Random Forest system, analyzing 6,683 isolates, refines predictions by incorporating temporal patterns, uncertainty estimation, and resistance trend modeling, indicating sustained high beta-lactam resistance, increasing fluoroquinolone resistance, and fluctuating tetracycline resistance.
[LG-65] Combining feature-based approaches with graph neural networks and symbolic regression for synergistic performance and interpretability
链接: https://arxiv.org/abs/2509.03547
作者: Rogério Almeida Gouvêa,Pierre-Paul De Breuck,Tatiane Pretto,Gian-Marco Rignanese,Marcos José Leite dos Santos
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注:
Abstract:This study introduces MatterVial, an innovative hybrid framework for feature-based machine learning in materials science. MatterVial expands the feature space by integrating latent representations from a diverse suite of pretrained graph neural network (GNN) models including: structure-based (MEGNet), composition-based (ROOST), and equivariant (ORB) graph networks, with computationally efficient, GNN-approximated descriptors and novel features from symbolic regression. Our approach combines the chemical transparency of traditional feature-based models with the predictive power of deep learning architectures. When augmenting the feature-based model MODNet on Matbench tasks, this method yields significant error reductions and elevates its performance to be competitive with, and in several cases superior to, state-of-the-art end-to-end GNNs, with accuracy increases exceeding 40% for multiple tasks. An integrated interpretability module, employing surrogate models and symbolic regression, decodes the latent GNN-derived descriptors into explicit, physically meaningful formulas. This unified framework advances materials informatics by providing a high-performance, transparent tool that aligns with the principles of explainable AI, paving the way for more targeted and autonomous materials discovery.
[LG-66] An exact multiple-time-step variational formulation for the committor and the transition rate
链接: https://arxiv.org/abs/2509.03539
作者: Chatipat Lorpaiboon,Jonathan Weare,Aaron R. Dinner
类目: atistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG); Computational Physics (physics.comp-ph); Data Analysis, Statistics and Probability (physics.data-an)
*备注: 42 pages, 3 figures
Abstract:For a transition between two stable states, the committor is the probability that the dynamics leads to one stable state before the other. It can be estimated from trajectory data by minimizing an expression for the transition rate that depends on a lag time. We show that an existing such expression is minimized by the exact committor only when the lag time is a single time step, resulting in a biased estimate in practical applications. We introduce an alternative expression that is minimized by the exact committor at any lag time. Numerical tests on benchmark systems demonstrate that our committor and resulting transition rate estimates are much less sensitive to the choice of lag time. We derive an additional expression for the transition rate, relate the transition rate expression to a variational approach for kinetic statistics based on the mean-squared residual, and discuss further numerical considerations with the aid of a decomposition of the error into dynamic modes.
[LG-67] A Small Dataset May Go a Long Way: Process Duration Prediction in Clinical Settings
链接: https://arxiv.org/abs/2509.03522
作者: Harald Störrle,Anastasia Hort
类目: Applications (stat.AP); Machine Learning (cs.LG)
*备注:
Abstract:Context: Utilization of operating theaters is a major cost driver in hospitals. Optimizing this variable through optimized surgery schedules may significantly lower cost and simultaneously improve medical outcomes. Previous studies proposed various complex models to predict the duration of procedures, the key ingredient to optimal schedules. They did so perusing large amounts of data. Goals: We aspire to create an effective and efficient model to predict operation durations based on only a small amount of data. Ideally, our model is also simpler in structure, and thus easier to use. Methods: We immerse ourselves in the application domain to leverage practitioners expertise. This way, we make the best use of our limited supply of clinical data, and may conduct our data analysis in a theory-guided way. We do a combined factor analysis and develop regression models to predict the duration of the perioperative process. Findings: We found simple methods of central tendency to perform on a par with much more complex methods proposed in the literature. In fact, they sometimes outperform them. We conclude that combining expert knowledge with data analysis may improve both data quality and model performance, allowing for more accurate forecasts. Conclusion: We yield better results than previous researchers by integrating conventional data science methods with qualitative studies of clinical settings and process structure. Thus, we are able to leverage even small datasets. Subjects: Applications (stat.AP); Machine Learning (cs.LG) Cite as: arXiv:2509.03522 [stat.AP] (or arXiv:2509.03522v1 [stat.AP] for this version) https://doi.org/10.48550/arXiv.2509.03522 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Anastasia Hort [view email] [v1] Tue, 19 Aug 2025 19:37:04 UTC (1,140 KB) Full-text links: Access Paper: View a PDF of the paper titled A Small Dataset May Go a Long Way: Process Duration Prediction in Clinical Settings, by Harald St"orrle and Anastasia HortView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: stat.AP prev | next new | recent | 2025-09 Change to browse by: cs cs.LG stat References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
信息检索
[IR-0] mporal Interest-Driven Multimodal Personalized Content Generation
链接: https://arxiv.org/abs/2509.04330
作者: Tian Miao
类目: Information Retrieval (cs.IR)
*备注:
Abstract:With the dynamic evolution of user interests and the increasing multimodal demands in internet applications, personalized content generation strategies based on static interest preferences struggle to meet practical application requirements. The proposed TIMGen (Temporal Interest-driven Multimodal Generation) model addresses this challenge by modeling the long-term temporal evolution of users’ interests and capturing dynamic interest representations with strong temporal dependencies. This model also supports the fusion of multimodal features, such as text, images, video, and audio, and delivers customized content based on multimodal preferences. TIMGen jointly learns temporal dependencies and modal preferences to obtain a unified interest representation, which it then generates to meet users’ personalized content needs. TIMGen overcomes the shortcomings of personalized content recommendation methods based on static preferences, enabling flexible and dynamic modeling of users’ multimodal interests, better understanding and capturing their interests and preferences. It can be extended to a variety of practical application scenarios, including e-commerce, advertising, online education, and precision medicine, providing insights for future research.
[IR-1] PianoBind: A Multimodal Joint Embedding Model for Pop-piano Music
链接: https://arxiv.org/abs/2509.04215
作者: Hayeon Bang,Eunjin Choi,Seungheon Doh,Juhan Nam
类目: ound (cs.SD); Information Retrieval (cs.IR); Multimedia (cs.MM)
*备注: Accepted for publication at the 26th International Society for Music Information Retrieval Conference (ISMIR 2025)
Abstract:Solo piano music, despite being a single-instrument medium, possesses significant expressive capabilities, conveying rich semantic information across genres, moods, and styles. However, current general-purpose music representation models, predominantly trained on large-scale datasets, often struggle to captures subtle semantic distinctions within homogeneous solo piano music. Furthermore, existing piano-specific representation models are typically unimodal, failing to capture the inherently multimodal nature of piano music, expressed through audio, symbolic, and textual modalities. To address these limitations, we propose PianoBind, a piano-specific multimodal joint embedding model. We systematically investigate strategies for multi-source training and modality utilization within a joint embedding framework optimized for capturing fine-grained semantic distinctions in (1) small-scale and (2) homogeneous piano datasets. Our experimental results demonstrate that PianoBind learns multimodal representations that effectively capture subtle nuances of piano music, achieving superior text-to-music retrieval performance on in-domain and out-of-domain piano datasets compared to general-purpose music joint embedding models. Moreover, our design choices offer reusable insights for multimodal representation learning with homogeneous datasets beyond piano music.
[IR-2] Safeguarding Patient Trust in the Age of AI: Tackling Health Misinformation with Explainable AI
链接: https://arxiv.org/abs/2509.04052
作者: Sueun Hong,Shuojie Fu,Ovidiu Serban,Brianna Bao,James Kinross,Francesa Toni,Guy Martin,Uddhav Vaghela
类目: Information Retrieval (cs.IR)
*备注:
Abstract:AI-generated health misinformation poses unprecedented threats to patient safety and healthcare system trust globally. This white paper presents an explainable AI framework developed through the EPSRC INDICATE project to combat medical misinformation while enhancing evidence-based healthcare delivery. Our systematic review of 17 studies reveals the urgent need for transparent AI systems in healthcare. The proposed solution demonstrates 95% recall in clinical evidence retrieval and integrates novel trustworthiness classifiers achieving 76% F1 score in detecting biomedical misinformation. Results show that explainable AI can transform traditional 6-month expert review processes into real-time, automated evidence synthesis while maintaining clinical rigor. This approach offers a critical intervention to preserve healthcare integrity in the AI era.
[IR-3] Efficient Item ID Generation for Large-Scale LLM -based Recommendation
链接: https://arxiv.org/abs/2509.03746
作者: Anushya Subbiah,Vikram Aggarwal,James Pine,Steffen Rendle,Krishna Sayana,Kun Su
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Integrating product catalogs and user behavior into LLMs can enhance recommendations with broad world knowledge, but the scale of real-world item catalogs, often containing millions of discrete item identifiers (Item IDs), poses a significant challenge. This contrasts with the smaller, tokenized text vocabularies typically used in LLMs. The predominant view within the LLM-based recommendation literature is that it is infeasible to treat item ids as a first class citizen in the LLM and instead some sort of tokenization of an item into multiple tokens is required. However, this creates a key practical bottleneck in serving these models for real-time low-latency applications. Our paper challenges this predominant practice and integrates item ids as first class citizens into the LLM. We provide simple, yet highly effective, novel training and inference modifications that enable single-token representations of items and single-step decoding. Our method shows improvements in recommendation quality (Recall and NDCG) over existing techniques on the Amazon shopping datasets while significantly improving inference efficiency by 5x-14x. Our work offers an efficiency perspective distinct from that of other popular approaches within LLM-based recommendation, potentially inspiring further research and opening up a new direction for integrating IDs into LLMs. Our code is available here this https URL Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2509.03746 [cs.IR] (or arXiv:2509.03746v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2509.03746 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[IR-4] LLM s for estimating positional bias in logged interaction data RECSYS’25
链接: https://arxiv.org/abs/2509.03696
作者: Aleksandr V. Petrov,Michael Murtagh,Karthik Nagesh
类目: Information Retrieval (cs.IR)
*备注: Accepted at the CONSEQUENCES Workshop @ RecSys’25
Abstract:Recommender and search systems commonly rely on Learning To Rank models trained on logged user interactions to order items by predicted relevance. However, such interaction data is often subject to position bias, as users are more likely to click on items that appear higher in the ranking, regardless of their actual relevance. As a result, newly trained models may inherit and reinforce the biases of prior ranking models rather than genuinely improving relevance. A standard approach to mitigate position bias is Inverse Propensity Scoring (IPS), where the model’s loss is weighted by the inverse of a propensity function, an estimate of the probability that an item at a given position is examined. However, accurate propensity estimation is challenging, especially in interfaces with complex non-linear layouts. In this paper, we propose a novel method for estimating position bias using Large Language Models (LLMs) applied to logged user interaction data. This approach offers a cost-effective alternative to online experimentation. Our experiments show that propensities estimated with our LLM-as-a-judge approach are stable across score buckets and reveal the row-column effects of Viator’s grid layout that simpler heuristics overlook. An IPS-weighted reranker trained with these propensities matches the production model on standard NDCG@10 while improving weighted NDCG@10 by roughly 2%. We will verify these offline gains in forthcoming live-traffic experiments.
[IR-5] feXplore at the Lifelog Search Challenge 2021
链接: https://arxiv.org/abs/2509.03692
作者: Andreas Leibetseder,Klaus Schoeffmann
类目: Information Retrieval (cs.IR); Multimedia (cs.MM)
*备注:
Abstract:Since its first iteration in 2018, the Lifelog Search Challenge (LSC) continues to rise in popularity as an interactive lifelog data retrieval competition, co-located at the ACM International Conference on Multimedia Retrieval (ICMR). The goal of this annual live event is to search a large corpus of lifelogging data for specifically announced memories using a purposefully developed tool within a limited amount of time. As long-standing participants, we present our improved lifeXplore - a retrieval system combining chronologic day summary browsing with interactive combinable concept filtering. Compared to previous versions, the tool is improved by incorporating temporal queries, advanced day summary features as well as usability improvements.