本篇博文主要内容为 2025-06-09 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。
友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。
目录
概览 (2025-06-09)
今日共更新576篇论文,其中:
- 自然语言处理共105篇(Computation and Language (cs.CL))
- 人工智能共177篇(Artificial Intelligence (cs.AI))
- 计算机视觉共142篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共206篇(Machine Learning (cs.LG))
自然语言处理
[NLP-0] Movie Facts and Fibs (MF2): A Benchmark for Long Movie Understanding
【速读】: 该论文试图解决当前视觉-语言模型(VLMs)在全面理解长时长视频内容方面的挑战,特别是由于现有基准测试的局限性导致模型倾向于关注细节而非深层次理解的问题。解决方案的关键在于引入MF²基准,该基准通过包含超过50部全长度电影及其手动构建的主张对(一个真实主张和一个看似合理但虚假的主张),评估模型是否能够理解、整合并回忆关键叙事信息。该基准强调核心叙事元素,如角色动机、因果链和事件顺序,并采用二元主张评估协议,要求模型准确识别真伪主张,从而减少答案顺序等偏差,实现对模型推理能力的更精确评估。
链接: https://arxiv.org/abs/2506.06275
作者: Emmanouil Zaranis,António Farinhas,Saul Santos,Beatriz Canaverde,Miguel Moura Ramos,Aditya K Surikuchi,André Viveiros,Baohao Liao,Elena Bueno-Benito,Nithin Sivakumaran,Pavlo Vasylenko,Shoubin Yu,Sonal Sannigrahi,Wafaa Mohammed,Ben Peters,Danae Sánchez Villegas,Elias Stengel-Eskin,Giuseppe Attanasio,Jaehong Yoon,Stella Frank,Alessandro Suglia,Chrysoula Zerva,Desmond Elliott,Mariella Dimiccoli,Mohit Bansal,Oswald Lanz,Raffaella Bernardi,Raquel Fernández,Sandro Pezzelle,Vlad Niculae,André F. T. Martins
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Under Review
点击查看摘要
Abstract:Despite recent progress in vision-language models (VLMs), holistic understanding of long-form video content remains a significant challenge, partly due to limitations in current benchmarks. Many focus on peripheral, ``needle-in-a-haystack’’ details, encouraging context-insensitive retrieval over deep comprehension. Others rely on large-scale, semi-automatically generated questions (often produced by language models themselves) that are easier for models to answer but fail to reflect genuine understanding. In this paper, we introduce MF ^2 , a new benchmark for evaluating whether models can comprehend, consolidate, and recall key narrative information from full-length movies (50-170 minutes long). MF ^2 includes over 50 full-length, open-licensed movies, each paired with manually constructed sets of claim pairs – one true (fact) and one plausible but false (fib), totalling over 850 pairs. These claims target core narrative elements such as character motivations and emotions, causal chains, and event order, and refer to memorable moments that humans can recall without rewatching the movie. Instead of multiple-choice formats, we adopt a binary claim evaluation protocol: for each pair, models must correctly identify both the true and false claims. This reduces biases like answer ordering and enables a more precise assessment of reasoning. Our experiments demonstrate that both open-weight and closed state-of-the-art models fall well short of human performance, underscoring the relative ease of the task for humans and their superior ability to retain and reason over critical narrative information – an ability current VLMs lack.
zh
[NLP-1] AdvSumm: Adversarial Training for Bias Mitigation in Text Summarization
【速读】: 该论文旨在解决文本摘要任务中由于预训练数据继承的关联性偏见和框架偏见导致的不公平或不适当输出问题。其解决方案的关键在于提出了一种名为AdvSumm(对抗性摘要)的领域无关训练框架,该框架通过引入一个新颖的Perturber组件,利用梯度引导的嵌入层扰动来提升模型对输入变化的鲁棒性,从而有效减少特定类型的偏见,如姓名-国籍偏见和政治框架偏见,同时保持摘要质量。
链接: https://arxiv.org/abs/2506.06273
作者: Mukur Gupta,Nikhil Reddy Varimalla,Nicholas Deas,Melanie Subbiah,Kathleen McKeown
机构: Columbia University (哥伦比亚大学)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) have achieved impressive performance in text summarization and are increasingly deployed in real-world applications. However, these systems often inherit associative and framing biases from pre-training data, leading to inappropriate or unfair outputs in downstream tasks. In this work, we present AdvSumm (Adversarial Summarization), a domain-agnostic training framework designed to mitigate bias in text summarization through improved generalization. Inspired by adversarial robustness, AdvSumm introduces a novel Perturber component that applies gradient-guided perturbations at the embedding level of Sequence-to-Sequence models, enhancing the model’s robustness to input variations. We empirically demonstrate that AdvSumm effectively reduces different types of bias in summarization-specifically, name-nationality bias and political framing bias-without compromising summarization quality. Compared to standard transformers and data augmentation techniques like back-translation, AdvSumm achieves stronger bias mitigation performance across benchmark datasets.
zh
[NLP-2] Cartridges: Lightweight and general-purpose long context representations via self-study
【速读】: 该论文试图解决在使用大规模语言模型处理长文本语料库时,因上下文窗口限制导致的高内存消耗和计算成本问题。传统方法通过将整个语料库放入上下文窗口并利用上下文学习(In-Context Learning, ICL)来回答查询,但这种方法在服务过程中存在较高的资源开销。论文提出的解决方案关键在于训练一个小型的键值缓存(KV Cache),称为Cartridge,在推理阶段加载该缓存以生成响应,从而降低服务成本。核心创新在于采用自学(Self-Study)训练策略,通过生成关于语料库的合成对话并使用上下文蒸馏目标进行训练,使Cartridge能够有效模拟ICL功能,并在保持性能的同时显著减少内存占用和提升吞吐量。
链接: https://arxiv.org/abs/2506.06266
作者: Sabri Eyuboglu,Ryan Ehrlich,Simran Arora,Neel Guha,Dylan Zinsley,Emily Liu,Will Tennien,Atri Rudra,James Zou,Azalia Mirhoseini,Christopher Re
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Large language models are often used to answer queries grounded in large text corpora (e.g. codebases, legal documents, or chat histories) by placing the entire corpus in the context window and leveraging in-context learning (ICL). Although current models support contexts of 100K-1M tokens, this setup is costly to serve because the memory consumption of the KV cache scales with input length. We explore an alternative: training a smaller KV cache offline on each corpus. At inference time, we load this trained KV cache, which we call a Cartridge, and decode a response. Critically, the cost of training a Cartridge can be amortized across all the queries referencing the same corpus. However, we find that the naive approach of training the Cartridge with next-token prediction on the corpus is not competitive with ICL. Instead, we propose self-study, a training recipe in which we generate synthetic conversations about the corpus and train the Cartridge with a context-distillation objective. We find that Cartridges trained with self-study replicate the functionality of ICL, while being significantly cheaper to serve. On challenging long-context benchmarks, Cartridges trained with self-study match ICL performance while using 38.6x less memory and enabling 26.4x higher throughput. Self-study also extends the model’s effective context length (e.g. from 128k to 484k tokens on MTOB) and surprisingly, leads to Cartridges that can be composed at inference time without retraining.
zh
[NLP-3] PersonaAgent : When Large Language Model Agents Meet Personalization at Test Time
【速读】: 该论文旨在解决当前大型语言模型(Large Language Model, LLM)代理普遍采用“一刀切”方法,缺乏根据用户不同需求和偏好的灵活性问题。其解决方案的关键在于提出PersonaAgent框架,该框架通过集成个性化记忆模块(包含情景记忆和语义记忆机制)与个性化动作模块,实现对用户个性化的响应。其中,人格化提示(persona)作为核心组件,起到中介作用,它利用个性化记忆中的洞察来控制代理行为,同时通过行为结果反馈优化记忆内容,从而实现动态的用户偏好对齐。
链接: https://arxiv.org/abs/2506.06254
作者: Weizhi Zhang,Xinyang Zhang,Chenwei Zhang,Liangwei Yang,Jingbo Shang,Zhepei Wei,Henry Peng Zou,Zijie Huang,Zhengyang Wang,Yifan Gao,Xiaoman Pan,Lian Xiong,Jingguo Liu,Philip S. Yu,Xian Li
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Large Language Model (LLM) empowered agents have recently emerged as advanced paradigms that exhibit impressive capabilities in a wide range of domains and tasks. Despite their potential, current LLM agents often adopt a one-size-fits-all approach, lacking the flexibility to respond to users’ varying needs and preferences. This limitation motivates us to develop PersonaAgent, the first personalized LLM agent framework designed to address versatile personalization tasks. Specifically, PersonaAgent integrates two complementary components - a personalized memory module that includes episodic and semantic memory mechanisms; a personalized action module that enables the agent to perform tool actions tailored to the user. At the core, the persona (defined as unique system prompt for each user) functions as an intermediary: it leverages insights from personalized memory to control agent actions, while the outcomes of these actions in turn refine the memory. Based on the framework, we propose a test-time user-preference alignment strategy that simulate the latest n interactions to optimize the persona prompt, ensuring real-time user preference alignment through textual loss feedback between simulated and ground-truth responses. Experimental evaluations demonstrate that PersonaAgent significantly outperforms other baseline methods by not only personalizing the action space effectively but also scaling during test-time real-world applications. These results underscore the feasibility and potential of our approach in delivering tailored, dynamic user experiences.
zh
[NLP-4] Bridging External and Parametric Knowledge: Mitigating Hallucination of LLM s with Shared-Private Semantic Synergy in Dual-Stream Knowledge
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在生成过程中因外部知识与参数化知识冲突而导致的幻觉问题,以及传统检索增强生成(Retrieval-augmented generation, RAG)方法在性能和稳定性上的下降问题。解决方案的关键在于提出一种双流知识增强框架(Dual-Stream Knowledge-Augmented Framework for Shared-Private Semantic Synergy, DSSP-RAG),其核心是将自注意力机制改进为混合注意力机制,以区分共享语义和私有语义,实现对内部与外部知识的可控融合。此外,还引入了基于认知不确定性的无监督幻觉检测方法和基于注意力差异矩阵的能源比(Energy Quotient, EQ)来降低检索到的外部知识中的噪声。
链接: https://arxiv.org/abs/2506.06240
作者: Yi Sui,Chaozhuo Li,Chen Zhang,Dawei song,Qiuchi Li
机构: 未知
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Retrieval-augmented generation (RAG) is a cost-effective approach to mitigate the hallucination of Large Language Models (LLMs) by incorporating the retrieved external knowledge into the generation process. However, external knowledge may conflict with the parametric knowledge of LLMs. Furthermore, current LLMs lack inherent mechanisms for resolving such knowledge conflicts, making traditional RAG methods suffer from degraded performance and stability. Thus, we propose a Dual-Stream Knowledge-Augmented Framework for Shared-Private Semantic Synergy (DSSP-RAG). Central to the framework is a novel approach that refines self-attention into a mixed-attention, distinguishing shared and private semantics for a controlled internal-external knowledge integration. To effectively facilitate DSSP in RAG, we further introduce an unsupervised hallucination detection method based on cognitive uncertainty, ensuring the necessity of introducing knowledge, and an Energy Quotient (EQ) based on attention difference matrices to reduce noise in the retrieved external knowledge. Extensive experiments on benchmark datasets show that DSSP-RAG can effectively resolve conflicts and enhance the complementarity of dual-stream knowledge, leading to superior performance over strong baselines.
zh
[NLP-5] Explaining Matters: Leverag ing Definitions and Semantic Expansion for Sexism Detection ACL2025 ACL
【速读】: 该论文旨在解决在线内容中性别歧视(sexism)检测的两个关键问题:数据稀疏性和性别歧视语言的细微性。为应对这些问题,研究提出了两种基于提示的数据增强技术:基于定义的数据增强(Definition-based Data Augmentation, DDA),通过利用类别特定定义生成语义对齐的合成样本;以及上下文语义扩展(Contextual Semantic Expansion, CSE),通过引入任务相关的语义特征来丰富示例以纠正系统性模型错误。此外,为提升细粒度分类的可靠性,还引入了一种集成策略,通过聚合多个语言模型的互补视角来解决预测冲突。
链接: https://arxiv.org/abs/2506.06238
作者: Sahrish Khan,Arshad Jhumka,Gabriele Pergola
机构: University of Warwick (华威大学); University of Leeds (利兹大学)
类目: Computation and Language (cs.CL)
备注: Proceedings of the 2025 Annual Meeting of the Association for Computational Linguistics (ACL). ACL 2025 - Main Conference
点击查看摘要
Abstract:The detection of sexism in online content remains an open problem, as harmful language disproportionately affects women and marginalized groups. While automated systems for sexism detection have been developed, they still face two key challenges: data sparsity and the nuanced nature of sexist language. Even in large, well-curated datasets like the Explainable Detection of Online Sexism (EDOS), severe class imbalance hinders model generalization. Additionally, the overlapping and ambiguous boundaries of fine-grained categories introduce substantial annotator disagreement, reflecting the difficulty of interpreting nuanced expressions of sexism. To address these challenges, we propose two prompt-based data augmentation techniques: Definition-based Data Augmentation (DDA), which leverages category-specific definitions to generate semantically-aligned synthetic examples, and Contextual Semantic Expansion (CSE), which targets systematic model errors by enriching examples with task-specific semantic features. To further improve reliability in fine-grained classification, we introduce an ensemble strategy that resolves prediction ties by aggregating complementary perspectives from multiple language models. Our experimental evaluation on the EDOS dataset demonstrates state-of-the-art performance across all tasks, with notable improvements of macro F1 by 1.5 points for binary classification (Task A) and 4.1 points for fine-grained classification (Task C).
zh
[NLP-6] Corrector Sampling in Language Models
【速读】: 该论文试图解决自回归语言模型在生成过程中由于固定、不可逆的从左到右的token生成方式导致的误差累积问题。解决方案的关键在于提出了一种新的采样方法——Resample-Previous-Tokens (RPT),该方法通过迭代地回顾并可能替换已生成文本中的一段窗口内的token,从而缓解误差累积问题。
链接: https://arxiv.org/abs/2506.06215
作者: Itai Gat,Neta Shaul,Uriel Singer,Yaron Lipman
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Autoregressive language models accumulate errors due to their fixed, irrevocable left-to-right token generation. To address this, we propose a new sampling method called Resample-Previous-Tokens (RPT). RPT mitigates error accumulation by iteratively revisiting and potentially replacing tokens in a window of previously generated text. This method can be integrated into existing autoregressive models, preserving their next-token-prediction quality and speed. Fine-tuning a pretrained 8B parameter model with RPT for only 100B resulted in ~10% relative improvements on reasoning and coding benchmarks compared to the standard sampling.
zh
[NLP-7] Can Theoretical Physics Research Benefit from Language Agents ?
【速读】: 该论文试图解决当前大型语言模型(Large Language Models, LLMs)在理论物理研究中的应用尚不成熟的问题,旨在通过将LLM代理与领域知识和工具箱相结合,以加速理论、计算和应用物理的研究。解决方案的关键在于提升LLM在物理直觉、约束满足和可靠推理方面的能力,并推动面向物理的专用LLM的发展,使其能够处理多模态数据、提出可验证的假设并设计实验,同时需解决物理一致性保障和验证方法的构建等基础性挑战。
链接: https://arxiv.org/abs/2506.06214
作者: Sirui Lu,Zhijing Jin,Terry Jingchen Zhang,Pavel Kos,J. Ignacio Cirac,Bernhard Schölkopf
机构: MPI of Quantum Optics(马克斯·普朗克量子光学研究所); MPI for Intelligent Systems(马克斯·普朗克智能系统研究所); MCQST(马克斯·普朗克量子科学中心); ETH Zürich(苏黎世联邦理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Mathematical Physics (math-ph); Quantum Physics (quant-ph)
备注: 9 pages
点击查看摘要
Abstract:Large Language Models (LLMs) are rapidly advancing across diverse domains, yet their application in theoretical physics research is not yet mature. This position paper argues that LLM agents can potentially help accelerate theoretical, computational, and applied physics when properly integrated with domain knowledge and toolbox. We analyze current LLM capabilities for physics – from mathematical reasoning to code generation – identifying critical gaps in physical intuition, constraint satisfaction, and reliable reasoning. We envision future physics-specialized LLMs that could handle multimodal data, propose testable hypotheses, and design experiments. Realizing this vision requires addressing fundamental challenges: ensuring physical consistency, and developing robust verification methods. We call for collaborative efforts between physics and AI communities to help advance scientific discovery in physics.
zh
[NLP-8] PuzzleWorld: A Benchmark for Multimodal Open-Ended Reasoning in Puzzlehunts
【速读】: 该论文试图解决当前基础模型在开放性、多步骤和创造性多模态推理任务中的表现不足问题,特别是针对类似谜题竞赛(puzzlehunt)这类缺乏明确问题定义的复杂任务。其解决方案的关键在于构建了一个大规模的基准测试集PuzzleWorld,包含667个谜题式问题,并为每个谜题提供了最终答案、详细的推理轨迹和认知技能标签,从而支持全面评估与细粒度诊断分析。通过这一基准,研究者能够更深入地理解模型在多模态推理中的局限性,并探索改进方法。
链接: https://arxiv.org/abs/2506.06211
作者: Hengzhi Li,Brendon Jiang,Alexander Naehu,Regan Song,Justin Zhang,Megan Tjandrasuwita,Chanakya Ekbote,Steven-Shine Chen,Adithya Balachandran,Wei Dai,Rebecca Chang,Paul Pu Liang
机构: Massachusetts Institute of Technology (麻省理工学院); Imperial College London (帝国理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Puzzlehunts are a genre of complex, multi-step puzzles lacking well-defined problem definitions. In contrast to conventional reasoning benchmarks consisting of tasks with clear instructions, puzzlehunts require models to discover the underlying problem structure from multimodal evidence and iterative reasoning, mirroring real-world domains such as scientific discovery, exploratory data analysis, or investigative problem-solving. Despite recent progress in foundation models, their performance on such open-ended settings remains largely untested. In this paper, we introduce PuzzleWorld, a large-scale benchmark of 667 puzzlehunt-style problems designed to assess step-by-step, open-ended, and creative multimodal reasoning. Each puzzle is annotated with the final solution, detailed reasoning traces, and cognitive skill labels, enabling holistic benchmarking and fine-grained diagnostic analysis. Most state-of-the-art models achieve only 1-2% final answer accuracy, with the best model solving only 14% of puzzles and reaching 40% stepwise accuracy. To demonstrate the value of our reasoning annotations, we show that fine-tuning a small model on reasoning traces improves stepwise reasoning from 4% to 11%, while training on final answers alone degrades performance to near zero. Our error analysis reveals that current models exhibit myopic reasoning, are bottlenecked by the limitations of language-based inference, and lack sketching capabilities crucial for visual and spatial reasoning. We release PuzzleWorld at this https URL to support future work on building more general, open-ended, and creative reasoning systems.
zh
[NLP-9] Building Models of Neurological Language
【速读】: 该论文旨在解决神经病学领域中语言模型的定制化与实用性问题,通过构建领域特定的语言模型来提升医学文本的理解与生成能力。其解决方案的关键在于利用检索增强生成(Retrieval-Augmented Generation, RAG)和表征模型技术,以实现安全且本地部署的医疗大语言模型,并结合神经病学特定的数据集(如病例报告、问答集及教材数据)以及多词表达提取工具和医学术语的图分析方法,提升模型的专业性和适用性。
链接: https://arxiv.org/abs/2506.06208
作者: Henry Watkins
机构: UCL(伦敦大学学院); Queen Square Institute of Neurology(女王广场神经学研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 21 pages, 6 figures
点击查看摘要
Abstract:This report documents the development and evaluation of domain-specific language models for neurology. Initially focused on building a bespoke model, the project adapted to rapid advances in open-source and commercial medical LLMs, shifting toward leveraging retrieval-augmented generation (RAG) and representational models for secure, local deployment. Key contributions include the creation of neurology-specific datasets (case reports, QA sets, textbook-derived data), tools for multi-word expression extraction, and graph-based analyses of medical terminology. The project also produced scripts and Docker containers for local hosting. Performance metrics and graph community results are reported, with future possible work open for multimodal models using open-source architectures like phi-4.
zh
[NLP-10] Detecting Voice Phishing with Precision: Fine-Tuning Small Language Models
【速读】: 该论文试图解决语音钓鱼(Voice Phishing, VP)检测中模型性能不足的问题,尤其是针对小型语言模型(Language Model, LM)在该任务中的表现。其解决方案的关键在于通过微调Llama3模型,并在提示中引入由专家设计的VP评估标准,同时结合链式思维(Chain-of-Thought, CoT)技术,以提升模型对VP攻击的识别能力。实验结果表明,融入人类专家知识的提示策略在小型LM中的效果优于单纯使用CoT技术。
链接: https://arxiv.org/abs/2506.06180
作者: Ju Yong Sim,Seong Hwan Kim
机构: Korea National University of Transportation(韩国国立交通大学)
类目: Computation and Language (cs.CL)
备注: 15 pages, 4 figures, 8 tables, journal submission
点击查看摘要
Abstract:We develop a voice phishing (VP) detector by fine-tuning Llama3, a representative open-source, small language model (LM). In the prompt, we provide carefully-designed VP evaluation criteria and apply the Chain-of-Thought (CoT) technique. To evaluate the robustness of LMs and highlight differences in their performance, we construct an adversarial test dataset that places the models under challenging conditions. Moreover, to address the lack of VP transcripts, we create transcripts by referencing existing or new types of VP techniques. We compare cases where evaluation criteria are included, the CoT technique is applied, or both are used together. In the experiment, our results show that the Llama3-8B model, fine-tuned with a dataset that includes a prompt with VP evaluation criteria, yields the best performance among small LMs and is comparable to that of a GPT-4-based VP detector. These findings indicate that incorporating human expert knowledge into the prompt is more effective than using the CoT technique for small LMs in VP detection.
zh
[NLP-11] Does It Run and Is That Enough? Revisiting Text-to-Chart Generation with a Multi-Agent Approach
【速读】: 该论文试图解决生成式 AI (Generative AI) 在将自然语言图表描述转换为可执行代码时存在的执行失败问题,尽管经过监督微调和强化学习后仍有约15%的脚本无法运行。解决方案的关键在于提出一种轻量级多智能体流水线,该流水线通过分离起草、执行、修复和判断等步骤,仅使用一个现成的GPT-4o-mini模型,在三次修复迭代内将执行错误率降低至4.5%,显著优于最强的微调基线,并且计算需求更低。
链接: https://arxiv.org/abs/2506.06175
作者: James Ford,Anthony Rios
机构: The University of Texas at San Antonio (德克萨斯大学圣安东尼奥分校)
类目: Computation and Language (cs.CL)
备注: 8 pages
点击查看摘要
Abstract:Large language models can translate natural-language chart descriptions into runnable code, yet approximately 15% of the generated scripts still fail to execute, even after supervised fine-tuning and reinforcement learning. We investigate whether this persistent error rate stems from model limitations or from reliance on a single-prompt design. To explore this, we propose a lightweight multi-agent pipeline that separates drafting, execution, repair, and judgment, using only an off-the-shelf GPT-4o-mini model. On the \textscText2Chart31 benchmark, our system reduces execution errors to 4.5% within three repair iterations, outperforming the strongest fine-tuned baseline by nearly 5 percentage points while requiring significantly less compute. Similar performance is observed on the \textscChartX benchmark, with an error rate of 4.6%, demonstrating strong generalization. Under current benchmarks, execution success appears largely solved. However, manual review reveals that 6 out of 100 sampled charts contain hallucinations, and an LLM-based accessibility audit shows that only 33.3% (\textscText2Chart31) and 7.2% (\textscChartX) of generated charts satisfy basic colorblindness guidelines. These findings suggest that future work should shift focus from execution reliability toward improving chart aesthetics, semantic fidelity, and accessibility.
zh
[NLP-12] semantic-features: A User-Friendly Tool for Studying Contextual Word Embeddings in Interpretable Semantic Spaces
【速读】: 该论文试图解决语言模型(Language Models, LM)中上下文化词嵌入(contextualized word embeddings)的语义解释问题,具体表现为如何通过可解释的空间来研究不同句法结构对语义理解的影响。解决方案的关键在于引入了一个名为semantic-features的扩展性好、易于使用的库,该库基于Chronis等(2023)的方法,通过将词嵌入投影到可解释空间中,从而分析句法结构(如与格结构的介词型和双宾语型)对语义解释的影响。研究通过构建450对句子的数据集,验证了三个掩码语言模型在不同句法结构下对“London”这一词的语义解读差异,结果表明模型表现出预期的敏感性,证明了该工具的有效性。
链接: https://arxiv.org/abs/2506.06169
作者: Jwalanthi Ranganathan,Rohan Jha,Kanishka Misra,Kyle Mahowald
机构: The University of Texas at Austin (德克萨斯大学奥斯汀分校); Toyota Technological Institute at Chicago (丰田技术学院芝加哥分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: SCiL 2025 Camera Ready Extended Abstract
点击查看摘要
Abstract:We introduce semantic-features, an extensible, easy-to-use library based on Chronis et al. (2023) for studying contextualized word embeddings of LMs by projecting them into interpretable spaces. We apply this tool in an experiment where we measure the contextual effect of the choice of dative construction (prepositional or double object) on the semantic interpretation of utterances (Bresnan, 2007). Specifically, we test whether “London” in “I sent London the letter.” is more likely to be interpreted as an animate referent (e.g., as the name of a person) than in “I sent the letter to London.” To this end, we devise a dataset of 450 sentence pairs, one in each dative construction, with recipients being ambiguous with respect to person-hood vs. place-hood. By applying semantic-features, we show that the contextualized word embeddings of three masked language models show the expected sensitivities. This leaves us optimistic about the usefulness of our tool.
zh
[NLP-13] he Lock-in Hypothesis: Stagnation by Algorithm ICML2025
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在训练和部署过程中与用户之间形成的反馈回路所带来的价值观固化问题,这种现象可能导致多样性下降和错误信念的锁定。其解决方案的关键在于通过基于代理的LLM模拟和真实世界GPT使用数据进行实证分析,以验证人类-人工智能反馈循环对用户价值观和信念的影响。
链接: https://arxiv.org/abs/2506.06166
作者: Tianyi Alex Qiu,Zhonghao He,Tejasveer Chugh,Max Kleiman-Weiner
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注: ICML 2025, 46 pages
点击查看摘要
Abstract:The training and deployment of large language models (LLMs) create a feedback loop with human users: models learn human beliefs from data, reinforce these beliefs with generated content, reabsorb the reinforced beliefs, and feed them back to users again and again. This dynamic resembles an echo chamber. We hypothesize that this feedback loop entrenches the existing values and beliefs of users, leading to a loss of diversity and potentially the lock-in of false beliefs. We formalize this hypothesis and test it empirically with agent-based LLM simulations and real-world GPT usage data. Analysis reveals sudden but sustained drops in diversity after the release of new GPT iterations, consistent with the hypothesized human-AI feedback loop. Code and data available at this https URL
zh
[NLP-14] Masked Language Models are Good Heterogeneous Graph Generalizers
【速读】: 该论文旨在解决异构图神经网络(Heterogeneous Graph Neural Networks, HGNNs)在跨领域和跨任务泛化能力不足的问题,以及现有方法中由于HGNN与大语言模型(Large Language Models, LLMs)嵌入空间差异导致的模型理解偏差问题。其解决方案的关键在于提出一种基于掩码语言建模(Masked Language Modeling, MLM)的方法,即MLM4HG,通过引入基于元路径(metapath)的文本序列替代传统的HG令牌,以更有效地提取异构图中的结构和语义信息,并设计定制化的文本模板将不同图任务统一到连贯的“掩码”令牌预测框架中,从而提升模型的泛化能力。
链接: https://arxiv.org/abs/2506.06157
作者: Jinyu Yang,Cheng Yang,Shanyuan Cui,Zeyuan Guo,Liangwei Yang,Muhan Zhang,Chuan Shi
机构: Beijing University of Posts and Telecommunications(北京邮电大学); University of Illinois Chicago(伊利诺伊大学芝加哥分校); Peking University(北京大学)
类目: ocial and Information Networks (cs.SI); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Heterogeneous graph neural networks (HGNNs) excel at capturing structural and semantic information in heterogeneous graphs (HGs), while struggling to generalize across domains and tasks. Recently, some researchers have turned to integrating HGNNs with large language models (LLMs) for more generalizable heterogeneous graph learning. However, these approaches typically extract structural information via HGNNs as HG tokens, and disparities in embedding spaces between HGNNs and LLMs have been shown to bias the LLM’s comprehension of HGs. Moreover, as these HG tokens are often derived from node-level tasks, the model’s ability to generalize across tasks remains limited. To this end, we propose a simple yet effective Masked Language Modeling-based method, called MLM4HG. MLM4HG introduces metapath-based textual sequences instead of HG tokens to extract structural and semantic information inherent in HGs, and designs customized textual templates to unify different graph tasks into a coherent cloze-style “mask” token prediction paradigm. Specifically, MLM4HG first converts HGs from various domains to texts based on metapaths, and subsequently combines them with the unified task texts to form a HG-based corpus. Moreover, the corpus is fed into a pretrained LM for fine-tuning with a constrained target vocabulary, enabling the fine-tuned LM to generalize to unseen target HGs. Extensive cross-domain and multi-task experiments on four real-world datasets demonstrate the superior generalization performance of MLM4HG over state-of-the-art methods in both few-shot and zero-shot scenarios. Our code is available at this https URL.
zh
[NLP-15] CLaMR: Contextualized Late-Interaction for Multimodal Content Retrieval
【速读】: 该论文旨在解决在线视频内容检索中多模态信息处理效率低下的问题,传统检索系统将视觉、语音、环境音频和屏幕文本等模态视为独立的检索源,导致检索结果噪声大且效果不佳。其解决方案的关键在于提出一种多模态、晚期交互的检索模型CLaMR,该模型联合索引视频帧、语音转录文本、屏幕文本和元数据,并通过统一的多模态主干网络进行编码以提升上下文理解能力。此外,CLaMR通过两项创新实现动态模态选择:一是构建了大规模合成训练数据集MultiVENT 2.0++,二是引入了模态感知损失函数,结合标准对比目标与模态使用学习目标进行联合训练,从而有效提升检索性能。
链接: https://arxiv.org/abs/2506.06144
作者: David Wan,Han Wang,Elias Stengel-Eskin,Jaemin Cho,Mohit Bansal
机构: UNC Chapel Hill (北卡罗来纳大学教堂山分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 18 pages. Code and data: this https URL
点击查看摘要
Abstract:Online video web content is richly multimodal: a single video blends vision, speech, ambient audio, and on-screen text. Retrieval systems typically treat these modalities as independent retrieval sources, which can lead to noisy and subpar retrieval. We explore multimodal video content retrieval, where relevance can be scored from one particular modality or jointly across multiple modalities simultaneously. Consequently, an effective retriever must dynamically choose which modality (or set of modalities) best addresses the query. We introduce CLaMR, a multimodal, late-interaction retriever that jointly indexes 4 modalities: video frames, transcribed speech, on-screen text, and metadata. CLaMR jointly encodes all modalities with a unified multimodal backbone for improved contextualization and is trained to enhance dynamic modality selection via two key innovations. First, given the lack of training data for multimodal retrieval, we introduce MultiVENT 2.0++, a large-scale synthetic training dataset built on MultiVENT 2.0 (event-centric videos in various languages paired with queries) with modality-targeted queries. Next, we propose a modality-aware loss that jointly trains according to a standard contrastive objective alongside an objective for learning correct modality usage. On the test sets of MultiVENT 2.0++ and MSRVTT, conventional aggregation strategies, such as averaging similarities for baseline retrievers, degrade performance by introducing noise from irrelevant modalities. In contrast, CLaMR consistently outperforms existing retrievers: on MultiVENT 2.0++, CLaMR improves nDCG@10 by 25.6 over the best single-modality retriever and by 35.4 over the best multi-modality retriever. We illustrate CLaMR’s downstream utility on long-video QA, retrieving relevant frames and obtaining a 3.50% boost over LanguageBind on Video-MME and 1.42% over dense sampling on LongVideoBench.
zh
[NLP-16] able-r1: Self-supervised and Reinforcement Learning for Program-based Table Reasoning in Small Language Models
【速读】: 该论文旨在解决小语言模型(SLM)在表格推理(Table Reasoning, TR)任务中的性能不足问题,尤其是其在数值推理方面的局限性。其关键解决方案是提出一种基于程序的表格推理方法(Program-based Table Reasoning, P-TR),通过生成可执行程序来规避文本基础推理(Text-based TR, T-TR)的限制。该方法分为两个阶段:第一阶段引入了布局转换推理(Layout Transformation Inference)的自监督学习任务,以提升表格布局的泛化能力;第二阶段采用混合范式的组相对策略优化(Group Relative Policy Optimization),增强推理一致性并支持动态回退到T-TR。
链接: https://arxiv.org/abs/2506.06137
作者: Rihui Jin,Zheyu Xin,Xing Xie,Zuoyi Li,Guilin Qi,Yongrui Chen,Xinbang Dai,Tongtong Wu,Gholamreza Haffari
机构: Southeast University (东南大学); Monash University (莫纳什大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Table reasoning (TR) requires structured reasoning over semi-structured tabular data and remains challenging, particularly for small language models (SLMs, e.g., LLaMA-8B) due to their limited capacity compared to large LMs (LLMs, e.g., GPT-4o). To narrow this gap, we explore program-based TR (P-TR), which circumvents key limitations of text-based TR (T-TR), notably in numerical reasoning, by generating executable programs. However, applying P-TR to SLMs introduces two challenges: (i) vulnerability to heterogeneity in table layouts, and (ii) inconsistency in reasoning due to limited code generation capability. We propose Table-r1, a two-stage P-TR method designed for SLMs. Stage 1 introduces an innovative self-supervised learning task, Layout Transformation Inference, to improve tabular layout generalization from a programmatic view. Stage 2 adopts a mix-paradigm variant of Group Relative Policy Optimization, enhancing P-TR consistency while allowing dynamic fallback to T-TR when needed. Experiments on four TR benchmarks demonstrate that Table-r1 outperforms all SLM-based methods, achieving at least a 15% accuracy improvement over the base model (LLaMA-8B) across all datasets and reaching performance competitive with LLMs.
zh
[NLP-17] Lets CONFER: A Dataset for Evaluating Natural Language Inference Models on CONditional InFERence and Presupposition
【速读】: 该论文试图解决自然语言推理(Natural Language Inference, NLI)模型在处理条件句中的细粒度语用推理,特别是预设(presupposition)方面的不足。其解决方案的关键在于引入CONFER数据集,该数据集专门用于评估NLI模型在条件句中的推理能力,并通过测试多种NLI模型和大型语言模型(Large Language Models, LLMs)在零样本和少量样本提示设置下的表现,以分析它们对预设推理的处理能力。
链接: https://arxiv.org/abs/2506.06133
作者: Tara Azin,Daniel Dumitrescu,Diana Inkpen,Raj Singh
机构: Carleton University (卡尔顿大学); University of Ottawa (渥太华大学)
类目: Computation and Language (cs.CL)
备注: This paper is published in the Proceedings of the 38th Canadian Conference on Artificial Intelligence (CAIAC 2025). Please cite the conference version at this https URL
点击查看摘要
Abstract:Natural Language Inference (NLI) is the task of determining whether a sentence pair represents entailment, contradiction, or a neutral relationship. While NLI models perform well on many inference tasks, their ability to handle fine-grained pragmatic inferences, particularly presupposition in conditionals, remains underexplored. In this study, we introduce CONFER, a novel dataset designed to evaluate how NLI models process inference in conditional sentences. We assess the performance of four NLI models, including two pre-trained models, to examine their generalization to conditional reasoning. Additionally, we evaluate Large Language Models (LLMs), including GPT-4o, LLaMA, Gemma, and DeepSeek-R1, in zero-shot and few-shot prompting settings to analyze their ability to infer presuppositions with and without prior context. Our findings indicate that NLI models struggle with presuppositional reasoning in conditionals, and fine-tuning on existing NLI datasets does not necessarily improve their performance.
zh
[NLP-18] Phonetically-Augmented Discriminative Rescoring for Voice Search Error Correction INTERSPEECH’25
【速读】: 该论文旨在解决端到端(End-to-end, E2E)自动语音识别(Automatic Speech Recognition, ASR)系统在处理近期或不常出现的电影标题时识别效果不佳的问题,这是因为这些词汇在训练数据中可能缺乏足够的代表性。其解决方案的关键在于提出一种基于发音的纠错系统,该系统包含两个核心组件:一是基于ASR模型输出的发音搜索,生成E2E系统可能未考虑的发音替代词;二是重排序组件,将ASR模型的识别结果与发音替代词相结合,最终选择最优输出。
链接: https://arxiv.org/abs/2506.06117
作者: Christophe Van Gysel,Maggie Wu,Lyan Verwimp,Caglar Tirkaz,Marco Bertola,Zhihong Lei,Youssef Oualil
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: To appear at Interspeech '25
点击查看摘要
Abstract:End-to-end (E2E) Automatic Speech Recognition (ASR) models are trained using paired audio-text samples that are expensive to obtain, since high-quality ground-truth data requires human annotators. Voice search applications, such as digital media players, leverage ASR to allow users to search by voice as opposed to an on-screen keyboard. However, recent or infrequent movie titles may not be sufficiently represented in the E2E ASR system’s training data, and hence, may suffer poor recognition. In this paper, we propose a phonetic correction system that consists of (a) a phonetic search based on the ASR model’s output that generates phonetic alternatives that may not be considered by the E2E system, and (b) a rescorer component that combines the ASR model recognition and the phonetic alternatives, and select a final system output. We find that our approach improves word error rate between 4.4 and 7.6% relative on benchmarks of popular movie titles over a series of competitive baselines. Comments: To appear at Interspeech '25 Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR) Cite as: arXiv:2506.06117 [cs.CL] (or arXiv:2506.06117v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2506.06117 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-19] Bridging the Gap: In-Context Learning for Modeling Human Disagreement
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在处理主观性任务(如仇恨言论和侮辱性语言检测)时,由于依赖聚合标签而无法准确反映标注者之间分歧的问题。其解决方案的关键在于通过上下文学习(in-context learning, ICL)探索多视角生成的可能性,并采用不同的标签建模策略(包括聚合硬标签、分解硬标签和软标签)来更全面地捕捉人类判断的多样性。研究还评估了演示样本选择方法对模型性能的影响,以提升模型对主观性内容的理解能力。
链接: https://arxiv.org/abs/2506.06113
作者: Benedetta Muscato,Yue Li,Gizem Gezici,Zhixue Zhao,Fosca Giannotti
机构: 未知
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) have shown strong performance on NLP classification tasks. However, they typically rely on aggregated labels-often via majority voting-which can obscure the human disagreement inherent in subjective annotations. This study examines whether LLMs can capture multiple perspectives and reflect annotator disagreement in subjective tasks such as hate speech and offensive language detection. We use in-context learning (ICL) in zero-shot and few-shot settings, evaluating four open-source LLMs across three label modeling strategies: aggregated hard labels, and disaggregated hard and soft labels. In few-shot prompting, we assess demonstration selection methods based on textual similarity (BM25, PLM-based), annotation disagreement (entropy), a combined ranking, and example ordering strategies (random vs. curriculum-based). Results show that multi-perspective generation is viable in zero-shot settings, while few-shot setups often fail to capture the full spectrum of human judgments. Prompt design and demonstration selection notably affect performance, though example ordering has limited impact. These findings highlight the challenges of modeling subjectivity with LLMs and the importance of building more perspective-aware, socially intelligent models.
zh
[NLP-20] Label-Context-Dependent Internal Language Model Estimation for CTC INTERSPEECH2025
【速读】: 该论文试图解决连接主义时间分类(Connectionist Temporal Classification, CTC)模型中隐式学习到的上下文相关内部语言模型(Internal Language Model, ILM)的建模问题,特别是如何显式地估计这种上下文依赖的ILM。解决方案的关键在于提出基于知识蒸馏(Knowledge Distillation, KD)的新型上下文相关ILM估计方法,并引入两种正则化方法以提升模型性能。实验结果表明,所提出的标签级知识蒸馏结合平滑方法在跨领域评估中显著优于传统的上下文无关先验模型,实现了超过13%的相对词错误率降低。
链接: https://arxiv.org/abs/2506.06096
作者: Zijian Yang,Minh-Nghia Phan,Ralf Schlüter,Hermann Ney
机构: 未知
类目: ound (cs.SD); Computation and Language (cs.CL); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注: accepted to Interspeech 2025
点击查看摘要
Abstract:Although connectionist temporal classification (CTC) has the label context independence assumption, it can still implicitly learn a context-dependent internal language model (ILM) due to modern powerful encoders. In this work, we investigate the implicit context dependency modeled in the ILM of CTC. To this end, we propose novel context-dependent ILM estimation methods for CTC based on knowledge distillation (KD) with theoretical justifications. Furthermore, we introduce two regularization methods for KD. We conduct experiments on Librispeech and TED-LIUM Release 2 datasets for in-domain and cross-domain evaluation, respectively. Experimental results show that context-dependent ILMs outperform the context-independent priors in cross-domain evaluation, indicating that CTC learns a context-dependent ILM. The proposed label-level KD with smoothing method surpasses other ILM estimation approaches, with more than 13% relative improvement in word error rate compared to shallow fusion.
zh
[NLP-21] Reinforcing Code Generation: Improving Text-to-SQL with Execution-Based Learning EMNLP2025
【速读】: 该论文试图解决如何通过大型语言模型(Large Language Model, LLM)生成准确的SQL查询问题,特别是在没有监督微调的情况下提升生成代码的准确性。其解决方案的关键在于利用与数据库引擎的交互获取执行反馈,并将此反馈作为强化学习(Reinforcement Learning, RL)中的标量奖励,从而优化模型的策略。通过在Group Relative Policy Optimization (GRPO) 框架中使用这些奖励,研究者在仅依赖问题-答案对的弱监督下显著提升了模型生成SQL代码的准确性。
链接: https://arxiv.org/abs/2506.06093
作者: Atharv Kulkarni,Vivek Srikumar
机构: University of Utah (犹他大学)
类目: Computation and Language (cs.CL)
备注: Under review at EMNLP 2025
点击查看摘要
Abstract:In this work, we study the problem of code generation with a large language model (LLM), with a focus on generating SQL queries from natural language questions. We ask: Instead of using supervised fine tuning with text-code pairs, can we tune a model by having it interact with a database engine? We frame this problem as a reinforcement learning problem where the model receives execution-based feedback from the environment in the form of scalar rewards. These rewards penalize execution failures and assign positive values when a query returns a correct answer. We use the rewards within the Group Relative Policy Optimization (GRPO) framework. We use a tabular reasoning benchmark to test and evaluate our findings. We find that with only weak supervision in the form of question-answer pairs, RL-tuning improves the accuracy of model generated SQL code from 31.49 to 49.83 while reducing error percentage from 25.43% to 14.71%. This improvement allowed the model nearly match the performance performance to the larger SQLCoder-70B model. Our work demonstrates the potential of using execution-based feedback to improve symbolic reasoning capabilities of LLMs.
zh
[NLP-22] MIRIAD: Augmenting LLM s with millions of medical query-response pairs
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在医疗领域应用时生成不准确内容的问题,以及现有基于检索增强生成(Retrieval-Augmented Generation, RAG)的管道依赖于噪声大、未结构化的医学文本导致LLMs难以有效利用的问题。解决方案的关键在于构建一个大规模、经过人工校准的医学问答(Medical Question-Answer, QA)语料库MIRIAD,该语料库通过半自动化流程将同行评审医学文献中的内容重新表述并锚定,以结构化查询-响应格式封装网络规模的医学知识,从而实现更精准的检索与整合。
链接: https://arxiv.org/abs/2506.06091
作者: Qinyue Zheng,Salman Abdullah,Sam Rawal,Cyril Zakka,Sophie Ostmeier,Maximilian Purk,Eduardo Reis,Eric J. Topol,Jure Leskovec,Michael Moor
机构: ETH Zurich(ETH Zurich); Stanford University(斯坦福大学); Mayo Clinic(梅奥诊所); Hugging Face(Hugging Face); University of Potsdam(波茨坦大学); Scripps Translational Science Institute(斯克里普斯转化科学研究所); BSSE(生物系统科学与工程学院)
类目: Computation and Language (cs.CL)
备注: Preprint
点击查看摘要
Abstract:LLMs are bound to transform healthcare with advanced decision support and flexible chat assistants. However, LLMs are prone to generate inaccurate medical content. To ground LLMs in high-quality medical knowledge, LLMs have been equipped with external knowledge via RAG, where unstructured medical knowledge is split into small text chunks that can be selectively retrieved and integrated into the LLMs context. Yet, existing RAG pipelines rely on raw, unstructured medical text, which can be noisy, uncurated and difficult for LLMs to effectively leverage. Systematic approaches to organize medical knowledge to best surface it to LLMs are generally lacking. To address these challenges, we introduce MIRIAD, a large-scale, curated corpus of 5,821,948 medical QA pairs, each rephrased from and grounded in a passage from peer-reviewed medical literature using a semi-automated pipeline combining LLM generation, filtering, grounding, and human annotation. Unlike prior medical corpora, which rely on unstructured text, MIRIAD encapsulates web-scale medical knowledge in an operationalized query-response format, which enables more targeted retrieval. Experiments on challenging medical QA benchmarks show that augmenting LLMs with MIRIAD improves accuracy up to 6.7% compared to unstructured RAG baselines with the same source corpus and with the same amount of retrieved text. Moreover, MIRIAD improved the ability of LLMs to detect medical hallucinations by 22.5 to 37% (increase in F1 score). We further introduce MIRIAD-Atlas, an interactive map of MIRIAD spanning 56 medical disciplines, enabling clinical users to visually explore, search, and refine medical knowledge. MIRIAD promises to unlock a wealth of down-stream applications, including medical information retrievers, enhanced RAG applications, and knowledge-grounded chat interfaces, which ultimately enables more reliable LLM applications in healthcare.
zh
[NLP-23] Zero-Shot Detection of LLM -Generated Code via Approximated Task Conditioning ECML-PKDD2025
【速读】: 该论文旨在解决生成式 AI (Generative AI) 生成代码的检测问题,这一问题在安全、知识产权和学术诚信方面具有重要影响。研究提出了一种新的零样本检测方法,其关键在于利用任务级条件概率分布(ATC)来评估代码片段的token级熵,从而区分LLM生成代码与人工编写的代码。与自然语言文本不同,代码在无条件分布下差异不明显,但通过任务条件可以揭示显著差异,这一发现成为该方法的核心依据。
链接: https://arxiv.org/abs/2506.06069
作者: Maor Ashkenazi,Ofir Brenner,Tal Furman Shohet,Eran Treister
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: To appear in the Proceedings of ECML-PKDD 2025, Springer Lecture Notes in Computer Science (LNCS)
点击查看摘要
Abstract:Detecting Large Language Model (LLM)-generated code is a growing challenge with implications for security, intellectual property, and academic integrity. We investigate the role of conditional probability distributions in improving zero-shot LLM-generated code detection, when considering both the code and the corresponding task prompt that generated it. Our key insight is that when evaluating the probability distribution of code tokens using an LLM, there is little difference between LLM-generated and human-written code. However, conditioning on the task reveals notable differences. This contrasts with natural language text, where differences exist even in the unconditional distributions. Leveraging this, we propose a novel zero-shot detection approach that approximates the original task used to generate a given code snippet and then evaluates token-level entropy under the approximated task conditioning (ATC). We further provide a mathematical intuition, contextualizing our method relative to previous approaches. ATC requires neither access to the generator LLM nor the original task prompts, making it practical for real-world applications. To the best of our knowledge, it achieves state-of-the-art results across benchmarks and generalizes across programming languages, including Python, CPP, and Java. Our findings highlight the importance of task-level conditioning for LLM-generated code detection. The supplementary materials and code are available at this https URL, including the dataset gathering implementation, to foster further research in this area.
zh
[NLP-24] Simple Yet Effective: Extracting Private Data Across Clients in Federated Fine-Tuning of Large Language Models
【速读】: 该论文旨在解决联邦微调大型语言模型(FedLLMs)在保护数据隐私的同时可能遭受的训练数据提取攻击问题。其关键解决方案是提出一种针对FedLLMs的简单而有效的提取攻击算法,在更现实的威胁模型下,即攻击者仅能访问单个客户端的数据,但仍试图从其他客户端中提取未见过的个人身份信息(PII)。该方法通过利用攻击者持有的上下文前缀来实现跨客户端的泛化,从而有效提取敏感信息。
链接: https://arxiv.org/abs/2506.06060
作者: Yingqi Hu,Zhuo Zhang,Jingyuan Zhang,Lizhen Qu,Zenglin Xu
机构: Harbin Institute of Technology (哈尔滨工业大学); Kuaishou Technology (快手科技); Monash University (莫纳什大学); Fudan University (复旦大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages, 4 figures
点击查看摘要
Abstract:Federated fine-tuning of large language models (FedLLMs) presents a promising approach for achieving strong model performance while preserving data privacy in sensitive domains. However, the inherent memorization ability of LLMs makes them vulnerable to training data extraction attacks. To investigate this risk, we introduce simple yet effective extraction attack algorithms specifically designed for FedLLMs. In contrast to prior “verbatim” extraction attacks, which assume access to fragments from all training data, our approach operates under a more realistic threat model, where the attacker only has access to a single client’s data and aims to extract previously unseen personally identifiable information (PII) from other clients. This requires leveraging contextual prefixes held by the attacker to generalize across clients. To evaluate the effectiveness of our approaches, we propose two rigorous metrics-coverage rate and efficiency-and extend a real-world legal dataset with PII annotations aligned with CPIS, GDPR, and CCPA standards, achieving 89.9% human-verified precision. Experimental results show that our method can extract up to 56.57% of victim-exclusive PII, with “Address,” “Birthday,” and “Name” being the most vulnerable categories. Our findings underscore the pressing need for robust defense strategies and contribute a new benchmark and evaluation framework for future research in privacy-preserving federated learning.
zh
[NLP-25] Hey Thats My Data! Label-Only Dataset Inference in Large Language Models
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在训练过程中使用未经授权的专有数据集可能导致的版权侵权和经济损失问题。现有方法依赖于模型内部的对数概率(log probabilities)来检测可疑训练数据,但当前主流LLMs已开始隐藏或混淆这些信号,导致检测失效。论文提出的解决方案关键在于利用灾难性遗忘(catastrophic forgetting)现象,即模型在接触新数据时会覆盖先前学习的知识。通过微调可疑数据集的一部分并观察模型输出的变化,与已知非成员验证集进行比较,从而判断可疑数据集是否可能属于模型原始训练语料库。该方法无需依赖模型内部对数概率,为保护专有数据提供了一种有效且实用的方案。
链接: https://arxiv.org/abs/2506.06057
作者: Chen Xiong,Zihao Wang,Rui Zhu,Tsung-Yi Ho,Pin-Yu Chen,Jingwei Xiong,Haixu Tang,Lucila Ohno-Machado
机构: The Chinese University of Hong Kong (香港中文大学); Indiana University Bloomington (印第安纳大学布卢明顿分校); Yale University, School of Medicine (耶鲁大学医学院); IBM Research AI (IBM研究院人工智能); University of California, Davis (加州大学戴维斯分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) have revolutionized Natural Language Processing by excelling at interpreting, reasoning about, and generating human language. However, their reliance on large-scale, often proprietary datasets poses a critical challenge: unauthorized usage of such data can lead to copyright infringement and significant financial harm. Existing dataset-inference methods typically depend on log probabilities to detect suspicious training material, yet many leading LLMs have begun withholding or obfuscating these signals. This reality underscores the pressing need for label-only approaches capable of identifying dataset membership without relying on internal model logits. We address this gap by introducing CatShift, a label-only dataset-inference framework that capitalizes on catastrophic forgetting: the tendency of an LLM to overwrite previously learned knowledge when exposed to new data. If a suspicious dataset was previously seen by the model, fine-tuning on a portion of it triggers a pronounced post-tuning shift in the model’s outputs; conversely, truly novel data elicits more modest changes. By comparing the model’s output shifts for a suspicious dataset against those for a known non-member validation set, we statistically determine whether the suspicious set is likely to have been part of the model’s original training corpus. Extensive experiments on both open-source and API-based LLMs validate CatShift’s effectiveness in logit-inaccessible settings, offering a robust and practical solution for safeguarding proprietary data. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2506.06057 [cs.CL] (or arXiv:2506.06057v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2506.06057 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-26] MATP-BENCH: Can MLLM Be a Good Automated Theorem Prover for Multimodal Problems?
【速读】: 该论文试图解决多模态自动定理证明(Multimodal Automated Theorem Proving, MATP)领域中缺乏系统性评估基准的问题,旨在推动多模态大语言模型(Multimodal Large Language Models, MLLMs)在该领域的研究。解决方案的关键在于构建了一个名为MATP-BENCH的多模态、多层级、多语言的基准测试集,包含来自高中、大学及竞赛级别的1056个多模态定理,并提供了Lean 4、Coq和Isabelle等形式化版本,以支持多种定理证明框架。该基准要求模型整合复杂的视觉理解能力、广泛的数学知识以及严格的符号推理能力,从而生成形式化证明。
链接: https://arxiv.org/abs/2506.06034
作者: Zhitao He,Zongwei Lyu,Dazhong Chen,Dadi Guo,Yi R. Fung
机构: Hong Kong University of Science and Technology (香港科技大学); Chinese University of Hong Kong (Shenzhen) (香港中文大学(深圳))
类目: Computation and Language (cs.CL)
备注: 29 pages
点击查看摘要
Abstract:Numerous theorems, such as those in geometry, are often presented in multimodal forms (e.g., diagrams). Humans benefit from visual reasoning in such settings, using diagrams to gain intuition and guide the proof process. Modern Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in solving a wide range of mathematical problems. However, the potential of MLLMs as Automated Theorem Provers (ATPs), specifically in the multimodal domain, remains underexplored. In this paper, we introduce the Multimodal Automated Theorem Proving benchmark (MATP-BENCH), a new Multimodal, Multi-level, and Multi-language benchmark designed to evaluate MLLMs in this role as multimodal automated theorem provers. MATP-BENCH consists of 1056 multimodal theorems drawn from high school, university, and competition-level mathematics. All these multimodal problems are accompanied by formalizations in Lean 4, Coq and Isabelle, thus making the benchmark compatible with a wide range of theorem-proving frameworks. MATP-BENCH requires models to integrate sophisticated visual understanding with mastery of a broad spectrum of mathematical knowledge and rigorous symbolic reasoning to generate formal proofs. We use MATP-BENCH to evaluate a variety of advanced multimodal language models. Existing methods can only solve a limited number of the MATP-BENCH problems, indicating that this benchmark poses an open challenge for research on automated theorem proving.
zh
[NLP-27] Large Language Models are Demonstration Pre-Selectors for Themselves ICML2025
【速读】: 该论文旨在解决传统基于上下文学习(In-context learning, ICL)方法在选择少样本示例时计算成本过高的问题。现有方法依赖于相似性或多样性评分进行示例选择,导致每次查询都需要从大规模数据集中重复检索,从而增加了计算开销。解决方案的关键在于提出FEEDER(FEw yet Essential Demonstration prE-selectoR)框架,该框架通过引入“充分性”和“必要性”指标,并设计基于树的算法,预先筛选出包含最具代表性示例的子集,从而在保持ICL性能的同时显著减少训练数据量并提升效率。
链接: https://arxiv.org/abs/2506.06033
作者: Jiarui Jin,Yuwei Wu,Haoxuan Li,Xiaoting He,Weinan Zhang,Yiming Yang,Yong Yu,Jun Wang,Mengyue Yang
机构: 未知
类目: Computation and Language (cs.CL)
备注: ICML 2025
点击查看摘要
Abstract:In-context learning (ICL) with large language models (LLMs) delivers strong few-shot performance by choosing few-shot demonstrations from the entire training data. However, existing ICL methods, which rely on similarity or diversity scores to choose demonstrations, incur high computational costs due to repeatedly retrieval from large-scale datasets for each query. To this end, we propose FEEDER (FEw yet Essential Demonstration prE-selectoR), a novel pre-selection framework that identifies a representative subset of demonstrations containing the most representative examples in the training data, tailored to specific LLMs. To construct this subset, we introduce the “sufficiency” and “necessity” metrics in the pre-selection stage and design a tree-based algorithm to identify representative examples efficiently. Once pre-selected, this representative subset can effectively replace the full training data, improving efficiency while maintaining comparable performance in ICL. Additionally, our pre-selected subset also benefits fine-tuning LLMs, where we introduce a bi-level optimization method that enhances training efficiency without sacrificing performance. Experiments with LLMs ranging from 300M to 8B parameters show that FEEDER can reduce training data size by over 20% while maintaining performance and seamlessly integrating with various downstream demonstration selection strategies in ICL.
zh
[NLP-28] When to Trust Context: Self-Reflective Debates for Context Reliability
【速读】: 该论文试图解决大型语言模型在参数化知识与上下文输入之间出现冲突时导致的事实性不一致或幻觉问题。解决方案的关键在于提出一种轻量级框架Self-Reflective Debate for Contextual Reliability (SR-DCR),该框架结合了逐标记的自我置信度与非对称多智能体辩论,通过一个缺乏上下文的评论者挑战基于给定文本辩护的防御者,并由裁判模型评估辩论以确定上下文的可靠性,最终通过结合裁判结果与模型置信度选择答案。
链接: https://arxiv.org/abs/2506.06020
作者: Zeqi Zhou,Fang Wu,Shayan Talaei,Haokai Zhao,Cheng Meixin,Tinson Xu,Amin Saberi,Yejin Choi
机构: Brown University (布朗大学); Stanford University (斯坦福大学); University of New South Wales (新南威尔士大学); Xi’an University of Electronic Science and Technology (西安电子科技大学); University of Chicago (芝加哥大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Large language models frequently encounter conflicts between their parametric knowledge and contextual input, often resulting in factual inconsistencies or hallucinations. We propose Self-Reflective Debate for Contextual Reliability (SR-DCR), a lightweight framework that integrates token-level self-confidence with an asymmetric multi-agent debate to adjudicate such conflicts. A critic, deprived of context, challenges a defender who argues from the given passage; a judge model evaluates the debate and determines the context’s reliability. The final answer is selected by combining the verdict with model confidence. Experiments on the ClashEval benchmark demonstrate that SR-DCR consistently enhances robustness to misleading context while maintaining accuracy on trustworthy inputs, outperforming both classical debate and confidence-only baselines with minimal computational overhead. The code is available at this https URL.
zh
[NLP-29] AgentS wift: Efficient LLM Agent Design via Value-guided Hierarchical Search
【速读】: 该论文旨在解决高性能智能体系统设计中的三个主要挑战:过度优化智能体工作流而未能充分利用已验证的人工设计组件(如记忆、规划和工具使用)、评估成本高以及在大规模搜索空间中的搜索效率低下。其解决方案的关键在于提出一个分层搜索空间,联合建模智能体工作流与可组合的功能组件,从而实现更丰富的智能体系统设计;引入一种预测价值模型,根据智能体系统和任务描述估计性能,实现搜索过程中的高效低成本评估;并采用基于不确定性的分层蒙特卡洛树搜索(MCTS)策略来引导搜索。
链接: https://arxiv.org/abs/2506.06017
作者: Yu Li,Lehui Li,Zhihao Wu,Qingmin Liao,Jianye Hao,Kun Shao,Fengli Xu,Yong Li
机构: Tsinghua University (清华大学); Shandong University (山东大学); Huawei Noah’s Ark Lab (华为诺亚方舟实验室)
类目: Computation and Language (cs.CL)
备注: 20pages
点击查看摘要
Abstract:Large language model (LLM) agents have demonstrated strong capabilities across diverse domains. However, designing high-performing agentic systems remains challenging. Existing agent search methods suffer from three major limitations: (1) an emphasis on optimizing agentic workflows while under-utilizing proven human-designed components such as memory, planning, and tool use; (2) high evaluation costs, as each newly generated agent must be fully evaluated on benchmarks; and (3) inefficient search in large search space. In this work, we introduce a comprehensive framework to address these challenges. First, We propose a hierarchical search space that jointly models agentic workflow and composable functional components, enabling richer agentic system designs. Building on this structured design space, we introduce a predictive value model that estimates agent performance given agentic system and task description, allowing for efficient, low-cost evaluation during the search process. Finally, we present a hierarchical Monte Carlo Tree Search (MCTS) strategy informed by uncertainty to guide the search. Experiments on seven benchmarks, covering embodied, math, web, tool, and game, show that our method achieves an average performance gain of 8.34% over state-of-the-art baselines and exhibits faster search progress with steeper improvement trajectories. Code repo is available at this https URL.
zh
[NLP-30] Unlocking Recursive Thinking of LLM s: Alignment via Refinement ACL2025
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在递归推理能力上的局限性,尤其是在缺乏专家标注数据进行蒸馏的情况下。其解决方案的关键在于提出一种名为AvR(Alignment via Refinement)的新方法,该方法通过引入一个融合批评与改进动作的精炼过程,并利用可微学习技术优化精炼感知奖励,从而提升模型在长格式思维链(Chain of Thought, CoT)中的表现。该方法生成的多轮数据可被组织为长格式精炼思维,进一步支持测试时的扩展性。
链接: https://arxiv.org/abs/2506.06009
作者: Haoke Zhang,Xiaobo Liang,Cunxiang Wang,Juntao Li,Min Zhang
机构: Soochow University (苏州大学); Zhipu AI (智普AI); Tsinghua University (清华大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to the Findings of ACL 2025
点击查看摘要
Abstract:The OpenAI o1-series models have demonstrated that leveraging long-form Chain of Thought (CoT) can substantially enhance performance. However, the recursive thinking capabilities of Large Language Models (LLMs) remain limited, particularly in the absence of expert-curated data for distillation. In this paper, we propose \textbfAvR: \textbfAlignment via Refinement, a novel method aimed at unlocking the potential of LLMs for recursive reasoning through long-form CoT. AvR introduces a refinement process that integrates criticism and improvement actions, guided by differentiable learning techniques to optimize \textbfrefinement-aware rewards. As a result, the synthesized multi-round data can be organized as a long refinement thought, further enabling test-time scaling. Experimental results show that AvR significantly outperforms conventional preference optimization methods. Notably, with only 3k synthetic samples, our method boosts the performance of the LLaMA-3-8B-Instruct model by over 20% in win rate on AlpacaEval 2.0. Our code is available at Github (this https URL).
zh
[NLP-31] oken Signature: Predicting Chain-of-Thought Gains with Token Decoding Feature in Large Language Models ICML2025
【速读】: 该论文试图解决Chain-of-Thought (CoT)技术在不同任务中性能提升不一致的问题,以及其底层机制尚不明确的长期研究难题。解决方案的关键在于通过分析token概率分布的单调性,提出两种基于概率分布的指标来评估CoT在不同任务中的有效性,并结合实例级指标与逻辑回归模型,引入Dynamic CoT方法,实现对CoT和直接回答的动态选择。此外,通过将从开源模型中学习到的决策策略迁移至闭源模型,进一步扩展了Dynamic CoT的应用范围。
链接: https://arxiv.org/abs/2506.06008
作者: Peijie Liu,Fengli Xu,Yong Li
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 20 pages, 6 figures, 13 tables(Accept by ICML2025)
点击查看摘要
Abstract:Chain-of-Thought (CoT) technique has proven effective in improving the performance of large language models (LLMs) on complex reasoning tasks. However, the performance gains are inconsistent across different tasks, and the underlying mechanism remains a long-standing research question. In this work, we make a preliminary observation that the monotonicity of token probability distributions may be correlated with the gains achieved through CoT reasoning. Leveraging this insight, we propose two indicators based on the token probability distribution to assess CoT effectiveness across different tasks. By combining instance-level indicators with logistic regression model, we introduce Dynamic CoT, a method that dynamically select between CoT and direct answer. Furthermore, we extend Dynamic CoT to closed-source models by transferring decision strategies learned from open-source models. Our indicators for assessing CoT effectiveness achieve an accuracy of 89.2%, and Dynamic CoT reduces token consumption by more than 35% while maintaining high accuracy. Overall, our work offers a novel perspective on the underlying mechanisms of CoT reasoning and provides a framework for its more efficient deployment.
zh
[NLP-32] Bootstrapping World Models from Dynamics Models in Multimodal Foundation Models
【速读】: 该论文试图解决视觉-语言基础模型在通过语言表达动作时,是否具备现实世界模型(observation × action → observation)和动力学模型(observation × observation → action)的问题。研究发现,开源基础模型在获取这两种模型方面均存在困难,但通过监督微调获取动力学模型比获取世界模型更容易。解决方案的关键在于利用动力学模型通过两种主要策略来引导世界模型的构建:一是通过合成数据进行弱监督学习,二是推理时的验证。其中,动力学模型可对未标注的视频帧对进行动作标注以扩展训练数据,并通过识别模型预测的重要性权重对观察对中的图像标记进行加权;此外,动力学模型还可为世界模型的多个样本分配奖励以指导推理时的搜索。
链接: https://arxiv.org/abs/2506.06006
作者: Yifu Qiu,Yftah Ziser,Anna Korhonen,Shay B. Cohen,Edoardo M. Ponti
机构: Institute for Language, Cognition and Computation, University of Edinburgh (爱丁堡大学语言、认知与计算研究所); Language Technology Lab, University of Cambridge (剑桥大学语言技术实验室); NVIDIA (英伟达)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:To what extent do vision-and-language foundation models possess a realistic world model (observation \times action \rightarrow observation) and a dynamics model (observation \times observation \rightarrow action), when actions are expressed through language? While open-source foundation models struggle with both, we find that fine-tuning them to acquire a dynamics model through supervision is significantly easier than acquiring a world model. In turn, dynamics models can be used to bootstrap world models through two main strategies: 1) weakly supervised learning from synthetic data and 2) inference time verification. Firstly, the dynamics model can annotate actions for unlabelled pairs of video frame observations to expand the training data. We further propose a new objective, where image tokens in observation pairs are weighted by their importance, as predicted by a recognition model. Secondly, the dynamics models can assign rewards to multiple samples of the world model to score them, effectively guiding search at inference time. We evaluate the world models resulting from both strategies through the task of action-centric image editing on Aurora-Bench. Our best model achieves a performance competitive with state-of-the-art image editing models, improving on them by a margin of 15% on real-world subsets according to GPT4o-as-judge, and achieving the best average human evaluation across all subsets of Aurora-Bench.
zh
[NLP-33] A Culturally-Rich Romanian NLP Dataset from “Who Wants to Be a Millionaire?” Videos
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在不同语言和文化背景下表现不一致的问题,其核心挑战在于模型对特定文化内容的理解能力不足。解决方案的关键是构建一个富含文化细节的多语言数据集,该数据集来源于罗马尼亚综艺节目《谁想成为百万富翁?》(Vrei să fii Milionar?),通过结合光学字符识别(OCR)、自动文本提取与人工验证的方法,收集并标注了包含问题领域、文化相关性及难度等元数据的问答对。该数据集为评估和改进LLMs在跨文化场景下的性能提供了重要基础。
链接: https://arxiv.org/abs/2506.05991
作者: Alexandru-Gabriel Ganea,Antonia-Adelina Popovici,Adrian-Marius Dumitran
机构: University of Bucharest Faculty of Mathematics and Computer Science (布加勒斯特大学数学与计算机科学学院)
类目: Computation and Language (cs.CL)
备注: 10 pages
点击查看摘要
Abstract:Large Language Models (LLMs) demonstrate varying performance across languages and cultural contexts. This study introduces a novel, culturally-rich, multilingual dataset derived from video recordings of the Romanian game show “Who Wants to Be a Millionaire?” (Vrei să fii Milionar?). We employed an innovative process combining optical character recognition (OCR), automated text extraction, and manual verification to collect question-answer pairs, enriching them with metadata including question domain (e.g., biology, history), cultural relevance (Romanian-specific vs. international), and difficulty. Benchmarking state-of-the-art LLMs, including Romanian-adapted models, on this dataset revealed significant performance disparities: models consistently achieve higher accuracy (80-95%) on international questions compared to Romanian-specific cultural questions (50-75%). We further investigate these differences through experiments involving machine translation of Romanian questions into English and cross-lingual tests using a comparable dataset in French. Our findings underscore the impact of cultural context and data source on LLM performance and offer practical insights for building robust, culturally-aware multilingual NLP systems, especially in educational domains. The dataset is publicly available at Hugging Face.
zh
[NLP-34] au-Eval: A Unified Evaluation Framework for Useful and Private Text Anonymization
【速读】: 该论文旨在解决文本匿名化过程中隐私保护与信息保留之间的复杂权衡问题,以及缺乏统一基准来全面评估不同场景下匿名化技术有效性的挑战。其解决方案的关键在于提出Tau-Eval,一个开源框架,通过隐私和效用任务敏感性视角对文本匿名化方法进行基准测试,提供了Python库、代码、文档和教程以支持研究与应用。
链接: https://arxiv.org/abs/2506.05979
作者: Gabriel Loiseau,Damien Sileo,Damien Riquet,Maxime Meyer,Marc Tommasi
机构: Hornetsecurity(霍纳特安全); Inria(法国国家信息与自动化研究所); CNRS(法国国家科学研究中心); Centrale Lille(里尔中央理工学院); UMR 9189 - CRIStAL(UMR 9189 - CRIStAL)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Text anonymization is the process of removing or obfuscating information from textual data to protect the privacy of individuals. This process inherently involves a complex trade-off between privacy protection and information preservation, where stringent anonymization methods can significantly impact the text’s utility for downstream applications. Evaluating the effectiveness of text anonymization proves challenging from both privacy and utility perspectives, as there is no universal benchmark that can comprehensively assess anonymization techniques across diverse, and sometimes contradictory contexts. We present Tau-Eval, an open-source framework for benchmarking text anonymization methods through the lens of privacy and utility task sensitivity. A Python library, code, documentation and tutorials are publicly available.
zh
[NLP-35] LTG at SemEval-2025 Task 10: Optimizing Context for Classification of Narrative Roles SEMEVAL2025
【速读】: 该论文旨在解决在实体框架(Entity Framing)任务中,如何从长文档中提取必要的上下文片段以供基于掩码语言模型的分类任务使用的问题。其解决方案的关键在于采用一种简单的面向实体的启发式方法进行上下文选择,该方法能够使具有有限上下文窗口的模型实现有效的文本分类,且其性能与使用更大生成语言模型的监督微调方法相当或更优。
链接: https://arxiv.org/abs/2506.05976
作者: Egil Rønningstad,Gaurav Negi
机构: University of Oslo(奥斯陆大学); University of Galway(爱尔兰国立高威大学)
类目: Computation and Language (cs.CL)
备注: Accepted for SemEval 2025; The 19th International Workshop on Semantic Evaluation
点击查看摘要
Abstract:Our contribution to the SemEval 2025 shared task 10, subtask 1 on entity framing, tackles the challenge of providing the necessary segments from longer documents as context for classification with a masked language model. We show that a simple entity-oriented heuristics for context selection can enable text classification using models with limited context window. Our context selection approach and the XLM-RoBERTa language model is on par with, or outperforms, Supervised Fine-Tuning with larger generative language models.
zh
[NLP-36] Lets Put Ourselves in Sallys Shoes: Shoes-of-Others Prefixing Improves Theory of Mind in Large Language Models
【速读】: 该论文试图解决大语言模型(Large Language Models, LLMs)在心智理论(Theory of Mind, ToM)方面尚未达到人类水平性能的问题,尤其是在没有世界状态变化的对话和叙事情境中。现有推理阶段的ToM方法通常针对涉及世界状态变化的情境进行设计,而该研究提出了一种新的推理阶段方法——“他者之鞋”(Shoes-of-Others, SoO)前缀,其关键在于通过在LLM输出的开头添加“Let’s put ourselves in A’s shoes.”(其中A表示目标角色的名称)来引导模型生成更符合心智理论的响应,从而在更广泛的情境下提升ToM性能。
链接: https://arxiv.org/abs/2506.05970
作者: Kazutoshi Shinoda,Nobukatsu Hojo,Kyosuke Nishida,Yoshihiro Yamazaki,Keita Suzuki,Hiroaki Sugiyama,Kuniko Saito
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 14pages, 12 figures
点击查看摘要
Abstract:Recent studies have shown that Theory of Mind (ToM) in large language models (LLMs) has not reached human-level performance yet. Since fine-tuning LLMs on ToM datasets often degrades their generalization, several inference-time methods have been proposed to enhance ToM in LLMs. However, existing inference-time methods for ToM are specialized for inferring beliefs from contexts involving changes in the world state. In this study, we present a new inference-time method for ToM, Shoes-of-Others (SoO) prefixing, which makes fewer assumptions about contexts and is applicable to broader scenarios. SoO prefixing simply specifies the beginning of LLM outputs with ``Let’s put ourselves in A’s shoes.‘’, where A denotes the target character’s name. We evaluate SoO prefixing on two benchmarks that assess ToM in conversational and narrative contexts without changes in the world state and find that it consistently improves ToM across five categories of mental states. Our analysis suggests that SoO prefixing elicits faithful thoughts, thereby improving the ToM performance.
zh
[NLP-37] Elementary Math Word Problem Generation using Large Language Models
【速读】: 该论文试图解决数学应用题(Math Word Problems, MWPs)生成过程中人工创建耗时且依赖额外输入的问题。现有深度学习方法通常需要教师提供问题的初始部分或附加信息(如方程),而本文提出的解决方案基于大型语言模型(Large Language Models, LLMs),其关键在于仅需输入所需题目数量、年级和题型(如加法、减法),即可自动生成高质量的MWPs。该方案通过多种实验验证了不同LLMs、提示策略及提升题目多样性的技术,同时结合人类反馈优化模型性能,最终生成的题目在语法和拼写方面表现良好,但在符合指定年级和题型要求方面仍存在挑战。
链接: https://arxiv.org/abs/2506.05950
作者: Nimesh Ariyarathne,Harshani Bandara,Yasith Heshan,Omega Gamage,Surangika Ranathunga,Dilan Nayanajith,Yutharsan Sivapalan,Gayathri Lihinikaduarachchi,Tharoosha Vihidun,Meenambika Chandirakumar,Sanujen Premakumar,Sanjula Gathsara
机构: 未知
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Mathematics is often perceived as a complex subject by students, leading to high failure rates in exams. To improve Mathematics skills, it is important to provide sample questions for students to practice problem-solving. Manually creating Math Word Problems (MWPs) is time consuming for tutors, because they have to type in natural language while adhering to grammar and spelling rules of the language. Existing Deep Learning techniques for MWP generation either require a tutor to provide the initial portion of the MWP, and/or additional information such as an equation. In this paper, we present an MWP generation system based on Large Language Models (LLMs) that overcome the need for additional input - the only input to our system is the number of MWPs needed, the grade and the type of question (e.g. addition, subtraction). Unlike the existing LLM-based solutions for MWP generation, we carried out an extensive set of experiments involving different LLMs, prompting strategies, techniques to improve the diversity of questions, as well as techniques that employ human feedback to improve LLM performance. Human and automated evaluations confirmed that the generated MWPs are high in quality, with minimal spelling and grammar issues. However, LLMs still struggle to generate questions that adhere to the specified grade and question type requirements.
zh
[NLP-38] NameTag 3: A Tool and a Service for Multilingual/Multitagset NER ACL2025
【速读】: 该论文旨在解决多语言、多数据集和多实体标签集的命名实体识别(NER)问题,尤其关注平铺实体和嵌套实体的识别。解决方案的关键在于开发了一个名为NameTag 3的开源工具和基于云的网络服务,该服务采用单一经过微调的355M参数模型支持17种语言的平铺NER,并使用一个126M参数模型支持捷克语的嵌套NER,从而在多个测试数据集上实现了最先进的性能。
链接: https://arxiv.org/abs/2506.05949
作者: Jana Straková,Milan Straka
机构: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (查理大学,数学与物理学院,形式与应用语言学研究所)
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2025
点击查看摘要
Abstract:We introduce NameTag 3, an open-source tool and cloud-based web service for multilingual, multidataset, and multitagset named entity recognition (NER), supporting both flat and nested entities. NameTag 3 achieves state-of-the-art results on 21 test datasets in 15 languages and remains competitive on the rest, even against larger models. It is available as a command-line tool and as a cloud-based service, enabling use without local installation. NameTag 3 web service currently provides flat NER for 17 languages, trained on 21 corpora and three NE tagsets, all powered by a single 355M-parameter fine-tuned model; and nested NER for Czech, powered by a 126M fine-tuned model. The source code is licensed under open-source MPL 2.0, while the models are distributed under non-commercial CC BY-NC-SA 4.0. Documentation is available at this https URL, source code at this https URL, and trained models via this https URL. The REST service and the web application can be found at this https URL. A demonstration video is available at this https URL.
zh
[NLP-39] IntentionESC: An Intention-Centered Framework for Enhancing Emotional Support in Dialogue Systems ACL2025
【速读】: 该论文旨在解决情感支持对话中由于支持者意图不明确而导致的策略不当问题,这可能使支持者无意中将自己的期望或解决方案强加给寻求帮助的人。解决方案的关键在于提出一种以意图为中心的情感支持对话框架(Intention-centered Emotional Support Conversation, IntentionESC),该框架定义了支持者在情感支持对话中的可能意图,识别了用于推断这些意图的关键情感状态方面,并将其映射到适当的支持策略。此外,引入了以意图为中心的思维链(Intention Centric Chain-of-Thought, ICECoT)机制,使大型语言模型能够通过分析情感状态、推断意图并选择合适的支持策略来模拟人类推理,从而生成更有效的情感支持响应。
链接: https://arxiv.org/abs/2506.05947
作者: Xinjie Zhang,Wenxuan Wang,Qin Jin
机构: Renmin University of China (中国人民大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ACL2025 findings
点击查看摘要
Abstract:In emotional support conversations, unclear intentions can lead supporters to employ inappropriate strategies, inadvertently imposing their expectations or solutions on the seeker. Clearly defined intentions are essential for guiding both the supporter’s motivations and the overall emotional support process. In this paper, we propose the Intention-centered Emotional Support Conversation (IntentionESC) framework, which defines the possible intentions of supporters in emotional support conversations, identifies key emotional state aspects for inferring these intentions, and maps them to appropriate support strategies. While Large Language Models (LLMs) excel in text generating, they fundamentally operate as probabilistic models trained on extensive datasets, lacking a true understanding of human thought processes and intentions. To address this limitation, we introduce the Intention Centric Chain-of-Thought (ICECoT) mechanism. ICECoT enables LLMs to mimic human reasoning by analyzing emotional states, inferring intentions, and selecting suitable support strategies, thereby generating more effective emotional support responses. To train the model with ICECoT and integrate expert knowledge, we design an automated annotation pipeline that produces high-quality training data. Furthermore, we develop a comprehensive evaluation scheme to assess emotional support efficacy and conduct extensive experiments to validate our framework. Our data and code are available at this https URL.
zh
[NLP-40] DynamicMind: A Tri-Mode Thinking System for Large Language Models
【速读】: 该论文试图解决现代大型语言模型(Large Language Models, LLMs)在面对不同复杂度的任务时,难以动态调整推理深度导致性能不佳或资源利用效率低的问题。解决方案的关键在于提出了一种名为DynamicMind的新型三模式思维系统,通过认知启发式的提示工程使LLMs能够自主选择快速、正常和慢速思维模式以进行零样本问答(Zero-shot Question Answering, ZSQA)任务。其核心创新包括:将经典的双进程思维框架扩展为包含正常思维模式的三模式系统,以保持LLM的内在能力;引入思维密度(Thinking Density)指标,实现计算资源分配与问题复杂度的对齐;以及构建思维模式容量(Thinking Mode Capacity, TMC)数据集和轻量级Mind Router,用于预测最优思维模式。
链接: https://arxiv.org/abs/2506.05936
作者: Wei Li,Yanbin Wei,Qiushi Huang,Jiangyue Yan,Yang Chen,James T. Kwok,Yu Zhang
机构: Southern University of Science and Technology (南方科技大学); Hong Kong University of Science and Technology (香港科技大学); University of Surrey (萨里大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Modern large language models (LLMs) often struggle to dynamically adapt their reasoning depth to varying task complexities, leading to suboptimal performance or inefficient resource utilization. To address this, we introduce DynamicMind, a novel tri-mode thinking system. DynamicMind empowers LLMs to autonomously select between Fast, Normal, and Slow thinking modes for zero-shot question answering (ZSQA) tasks through cognitive-inspired prompt engineering. Our framework’s core innovations include: (1) expanding the established dual-process framework of fast and slow thinking into a tri-mode thinking system involving a normal thinking mode to preserve the intrinsic capabilities of LLM; (2) proposing the Thinking Density metric, which aligns computational resource allocation with problem complexity; and (3) developing the Thinking Mode Capacity (TMC) dataset and a lightweight Mind Router to predict the optimal thinking mode. Extensive experiments across diverse mathematical, commonsense, and scientific QA benchmarks demonstrate that DynamicMind achieves superior ZSQA capabilities while establishing an effective trade-off between performance and computational efficiency.
zh
[NLP-41] MoA: Heterogeneous Mixture of Adapters for Parameter-Efficient Fine-Tuning of Large Language Models
【速读】: 该论文旨在解决参数高效微调(PEFT)方法在大型语言模型(LLM)应用中存在表示崩溃和专家负载不平衡的问题。现有方法采用同构的Mixture-of-Experts (MoE)-LoRA架构,其专家结构和容量相似或相同,导致性能受限。论文提出的解决方案关键在于引入异构的Mixture-of-Adapters (MoA)方法,通过动态整合具有不同结构的PEFT适配器专家,利用其互补的表征能力促进专家专业化,从而提升预训练知识向下游任务的有效迁移。
链接: https://arxiv.org/abs/2506.05928
作者: Jie Cao,Tianwei Lin,Hongyang He,Rolan Yan,Wenqiao Zhang,Juncheng Li,Dongping Zhang,Siliang Tang,Yueting Zhuang
机构: Zhejiang University (浙江大学); Tencent (腾讯)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Recent studies integrate Low-Rank Adaptation (LoRA) and Mixture-of-Experts (MoE) to further enhance the performance of parameter-efficient fine-tuning (PEFT) methods in Large Language Model (LLM) applications. Existing methods employ \emphhomogeneous MoE-LoRA architectures composed of LoRA experts with either similar or identical structures and capacities. However, these approaches often suffer from representation collapse and expert load imbalance, which negatively impact the potential of LLMs. To address these challenges, we propose a \emphheterogeneous \textbfMixture-of-Adapters (MoA) approach. This method dynamically integrates PEFT adapter experts with diverse structures, leveraging their complementary representational capabilities to foster expert specialization, thereby enhancing the effective transfer of pre-trained knowledge to downstream tasks. MoA supports two variants: \textbf(i) \textitSoft MoA achieves fine-grained integration by performing a weighted fusion of all expert outputs; \textbf(ii) \textitSparse MoA activates adapter experts sparsely based on their contribution, achieving this with negligible performance degradation. Experimental results demonstrate that heterogeneous MoA outperforms homogeneous MoE-LoRA methods in both performance and parameter efficiency. Our project is available at this https URL.
zh
[NLP-42] LengClaro2023: A Dataset of Administrative Texts in Spanish with Plain Language adaptations
【速读】: 该论文试图解决法律行政文本在西班牙语中可读性与理解性不足的问题,旨在通过生成简化版本来提升文本的可访问性。解决方案的关键在于基于西班牙社会保障网站中最常用程序生成的法律行政文本,创建两种简化版本:第一种遵循arText claro的建议,第二种则进一步结合了简洁语言指南的额外建议,以探索系统在文本简化方面的潜在改进空间。
链接: https://arxiv.org/abs/2506.05927
作者: Belén Agüera-Marco,Itziar Gonzalez-Dios
机构: University of the Basque Country UPV/EHU (巴斯克自治区大学UPV/EHU); HiTZ Center - Ixa (HiTZ中心-Ixa)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: In this report, we present a part of the master thesis written by Belén Agüera Marco in order to obtain the B.S. Language Analysis and Processing at the University of the Basque Country (UPV/EHU), supervised by Itziar Gonzalez-Dios
点击查看摘要
Abstract:In this work, we present LengClaro2023, a dataset of legal-administrative texts in Spanish. Based on the most frequently used procedures from the Spanish Social Security website, we have created for each text two simplified equivalents. The first version follows the recommendations provided by arText claro. The second version incorporates additional recommendations from plain language guidelines to explore further potential improvements in the system. The linguistic resource created in this work can be used for evaluating automatic text simplification (ATS) systems in Spanish.
zh
[NLP-43] Generating Grounded Responses to Counter Misinformation via Learning Efficient Fine-Grained Critiques IJCAI2025
【速读】: 该论文试图解决虚假信息(misinformation)传播带来的社会威胁,尤其是传统人工事实核查成本高且难以扩展的问题。其解决方案的关键在于提出一种名为MisMitiFact的高效框架,该框架通过生成基于事实的反驳内容来减轻虚假信息的影响。MisMitiFact的核心创新在于利用轻量级、细粒度的批判模型,这些模型在可获取的事实核查网站数据上进行训练,以识别并修正大型语言模型(LLM)生成内容中的关键元素错误,如数字、实体和主题,从而确保反驳内容具有事实依据。相较于依赖LLM自反馈的方法,该方案在保持反驳质量的同时显著降低了计算成本,并提升了反馈生成的吞吐量。
链接: https://arxiv.org/abs/2506.05924
作者: Xiaofei Xu,Xiuzhen Zhang,Ke Deng
机构: RMIT University (皇家墨尔本理工大学)
类目: Computation and Language (cs.CL)
备注: accepted to IJCAI 2025
点击查看摘要
Abstract:Fake news and misinformation poses a significant threat to society, making efficient mitigation essential. However, manual fact-checking is costly and lacks scalability. Large Language Models (LLMs) offer promise in automating counter-response generation to mitigate misinformation, but a critical challenge lies in their tendency to hallucinate non-factual information. Existing models mainly rely on LLM self-feedback to reduce hallucination, but this approach is computationally expensive. In this paper, we propose MisMitiFact, Misinformation Mitigation grounded in Facts, an efficient framework for generating fact-grounded counter-responses at scale. MisMitiFact generates simple critique feedback to refine LLM outputs, ensuring responses are grounded in evidence. We develop lightweight, fine-grained critique models trained on data sourced from readily available fact-checking sites to identify and correct errors in key elements such as numerals, entities, and topics in LLM generations. Experiments show that MisMitiFact generates counter-responses of comparable quality to LLMs’ self-feedback while using significantly smaller critique models. Importantly, it achieves ~5x increase in feedback generation throughput, making it highly suitable for cost-effective, large-scale misinformation mitigation. Code and LLM prompt templates are at this https URL.
zh
[NLP-44] Proactive Assistant Dialogue Generation from Streaming Egocentric Videos
【速读】: 该论文旨在解决实时感知任务引导系统开发中的挑战,特别是由于数据收集和系统评估过程成本高且耗时的问题。其解决方案的关键在于提出一个综合框架,包含三个核心贡献:首先,引入一种新颖的数据整理流程,从标注的第一人称视频中合成对话,生成大规模的合成对话数据集;其次,开发一套经过广泛人类研究验证的自动评估指标;最后,提出一个端到端模型,能够处理流式视频输入并生成上下文相关的响应,同时引入新方法处理数据不平衡和长时视频问题。
链接: https://arxiv.org/abs/2506.05904
作者: Yichi Zhang,Xin Luna Dong,Zhaojiang Lin,Andrea Madotto,Anuj Kumar,Babak Damavandi,Joyce Chai,Seungwhan Moon
机构: Meta(元); University of Michigan(密歇根大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注:
点击查看摘要
Abstract:Recent advances in conversational AI have been substantial, but developing real-time systems for perceptual task guidance remains challenging. These systems must provide interactive, proactive assistance based on streaming visual inputs, yet their development is constrained by the costly and labor-intensive process of data collection and system evaluation. To address these limitations, we present a comprehensive framework with three key contributions. First, we introduce a novel data curation pipeline that synthesizes dialogues from annotated egocentric videos, resulting in \dataset, a large-scale synthetic dialogue dataset spanning multiple domains. Second, we develop a suite of automatic evaluation metrics, validated through extensive human studies. Third, we propose an end-to-end model that processes streaming video inputs to generate contextually appropriate responses, incorporating novel techniques for handling data imbalance and long-duration videos. This work lays the foundation for developing real-time, proactive AI assistants capable of guiding users through diverse tasks. Project page: this https URL
zh
[NLP-45] Route-and-Reason : Scaling Large Language Model Reasoning with Reinforced Model Router
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在进行多步骤推理时因token使用量激增而导致的高成本问题,同时保持或提升模型的推理性能。其解决方案的关键在于提出R2-Reasoner框架,该框架通过动态路由子任务的方式,在异构LLMs之间实现协作推理,核心组件为强化学习驱动的模型路由器(Reinforced Model Router),该路由器包含任务分解器和子任务分配器,能够根据任务复杂度估计将子任务分配给最合适的模型,从而在准确性和效率之间取得平衡。
链接: https://arxiv.org/abs/2506.05901
作者: Chenyang Shao,Xinyang Liu,Yutang Lin,Fengli Xu,Yong Li
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Multi-step reasoning has proven essential for enhancing the problem-solving capabilities of Large Language Models (LLMs) by decomposing complex tasks into intermediate steps, either explicitly or implicitly. Extending the reasoning chain at test time through deeper thought processes or broader exploration, can furthur improve performance, but often incurs substantial costs due to the explosion in token usage. Yet, many reasoning steps are relatively simple and can be handled by more efficient smaller-scale language models (SLMs). This motivates hybrid approaches that allocate subtasks across models of varying capacities. However, realizing such collaboration requires accurate task decomposition and difficulty-aware subtask allocation, which is challenging. To address this, we propose R2-Reasoner, a novel framework that enables collaborative reasoning across heterogeneous LLMs by dynamically routing sub-tasks based on estimated complexity. At the core of our framework is a Reinforced Model Router, composed of a task decomposer and a subtask allocator. The task decomposer segments complex input queries into logically ordered subtasks, while the subtask allocator assigns each subtask to the most appropriate model, ranging from lightweight SLMs to powerful LLMs, balancing accuracy and efficiency. To train this router, we introduce a staged pipeline that combines supervised fine-tuning on task-specific datasets with Group Relative Policy Optimization algorithm, enabling self-supervised refinement through iterative reinforcement learning. Extensive experiments across four challenging benchmarks demonstrate that R2-Reasoner reduces API costs by 86.85% while maintaining or surpassing baseline accuracy. Our framework paves the way for more cost-effective and adaptive LLM reasoning. The code is open-source at this https URL .
zh
[NLP-46] Cross-lingual Collapse: How Language-Centric Foundation Models Shape Reasoning in Large Language Models
【速读】: 该论文试图解决多语言语言模型在进行链式推理(Chain-of-Thought, CoT)时出现的跨语言崩溃(Cross-lingual Collapse)问题,即模型在不同语言的提示下仍倾向于使用预训练时占主导地位的语言进行推理。为了解决这一问题,研究者采用了一种基于组相对策略优化(Group-Relative Policy Optimization, GRPO)的微调方法,在三种不同语言(中文、韩语和乌克兰语)的GSM 8K和SimpleRL-Zoo数据集翻译版本上进行实验。解决方案的关键在于通过GRPO加速预训练语言的不平衡,同时探索语言一致性奖励对缓解跨语言崩溃的作用,尽管这会带来准确率的下降。
链接: https://arxiv.org/abs/2506.05850
作者: Cheonbok Park,Jeonghoon Kim,Joosung Lee,Sanghwan Bae,Jaegul Choo,Kangmin Yoo
机构: NAVER Cloud(NAVER云); KAIST(韩国科学技术院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Preprint
点击查看摘要
Abstract:We identify \textbfCross-lingual Collapse, a systematic drift in which the chain-of-thought (CoT) of a multilingual language model reverts to its dominant pre-training language even when the prompt is expressed in a different language. Recent large language models (LLMs) with reinforcement learning with verifiable reward (RLVR) have achieved strong logical reasoning performances by exposing their intermediate reasoning traces, giving rise to large reasoning models (LRMs). However, the mechanism behind multilingual reasoning in LRMs is not yet fully explored. To investigate the issue, we fine-tune multilingual LRMs with Group-Relative Policy Optimization (GRPO) on translated versions of the GSM 8 K and SimpleRL-Zoo datasets in three different languages: Chinese, Korean, and Ukrainian. During training, we monitor both task accuracy and language consistency of the reasoning chains. Our experiments reveal three key findings: (i) GRPO rapidly amplifies pre-training language imbalances, leading to the erosion of low-resource languages within just a few hundred updates; (ii) language consistency reward mitigates this drift but does so at the expense of an almost 5 - 10 pp drop in accuracy. and (iii) the resulting language collapse is severely damaging and largely irreversible, as subsequent fine-tuning struggles to steer the model back toward its original target-language reasoning capabilities. Together, these findings point to a remarkable conclusion: \textitnot all languages are trained equally for reasoning. Furthermore, our paper sheds light on the roles of reward shaping, data difficulty, and pre-training priors in eliciting multilingual reasoning.
zh
[NLP-47] FinanceReasoning : Benchmarking Financial Numerical Reasoning More Credible Comprehensive and Challenging ACL2025
【速读】: 该论文试图解决大推理模型(Large Reasoning Models, LRMs)在金融数值推理任务中的评估与提升问题。其解决方案的关键在于构建一个名为FinanceReasoning的基准,该基准通过三个核心改进来提升评估的可信度、全面性和挑战性:首先,通过更新和标注大量金融问题并严格优化评估标准以提高可信度;其次,覆盖广泛的金融概念和公式,并提供Python格式函数以增强模型的金融推理能力;最后,设计高难度问题要求模型综合应用多种金融公式进行精确计算,从而揭示LRMs在数值精度方面的局限性,并探索结合推理器与程序员模型以提升性能的有效方法。
链接: https://arxiv.org/abs/2506.05828
作者: Zichen Tang,Haihong E,Ziyan Ma,Haoyang He,Jiacheng Liu,Zhongjun Yang,Zihua Rong,Rongjin Li,Kun Ji,Qing Huang,Xinyang Hu,Yang Liu,Qianhe Zheng
机构: Beijing University of Posts and Telecommunications (北京邮电大学)
类目: Computation and Language (cs.CL); Computational Engineering, Finance, and Science (cs.CE)
备注: Accepted by ACL 2025 Main Conference
点击查看摘要
Abstract:We introduce FinanceReasoning, a novel benchmark designed to evaluate the reasoning capabilities of large reasoning models (LRMs) in financial numerical reasoning problems. Compared to existing benchmarks, our work provides three key advancements. (1) Credibility: We update 15.6% of the questions from four public datasets, annotating 908 new questions with detailed Python solutions and rigorously refining evaluation standards. This enables an accurate assessment of the reasoning improvements of LRMs. (2) Comprehensiveness: FinanceReasoning covers 67.8% of financial concepts and formulas, significantly surpassing existing datasets. Additionally, we construct 3,133 Python-formatted functions, which enhances LRMs’ financial reasoning capabilities through refined knowledge (e.g., 83.2% \rightarrow 91.6% for GPT-4o). (3) Challenge: Models are required to apply multiple financial formulas for precise numerical reasoning on 238 Hard problems. The best-performing model (i.e., OpenAI o1 with PoT) achieves 89.1% accuracy, yet LRMs still face challenges in numerical precision. We demonstrate that combining Reasoner and Programmer models can effectively enhance LRMs’ performance (e.g., 83.2% \rightarrow 87.8% for DeepSeek-R1). Our work paves the way for future research on evaluating and improving LRMs in domain-specific complex reasoning tasks.
zh
[NLP-48] CodeContests: High-Quality Test Case Generation for Competitive Programming
【速读】: 该论文旨在解决竞争性编程中测试用例生成的质量问题,尤其是在大规模数据集构建过程中,由于测试用例难以获取而导致评估准确性受限的问题。解决方案的关键在于引入一个基于大语言模型(Large Language Model, LLM)的智能体系统,该系统能够生成高质量的测试用例,并将其应用于CodeContests数据集,从而提出改进版本CodeContests+。通过大量带有通过/失败标签的提交数据验证,证明了CodeContests+在测试用例准确性方面显著优于原始数据集,尤其在真正例率(True Positive Rate, TPR)上表现突出,进一步验证了测试用例质量提升对LLM强化学习任务的积极影响。
链接: https://arxiv.org/abs/2506.05817
作者: Zihan Wang,Siyao Liu,Yang Sun,Hongyan Li,Kai Shen
机构: 未知
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注: 28 pages, 7 figures
点击查看摘要
Abstract:Competitive programming, due to its high reasoning difficulty and precise correctness feedback, has become a key task for both training and evaluating the reasoning capabilities of large language models (LLMs). However, while a large amount of public problem data, such as problem statements and solutions, is available, the test cases of these problems are often difficult to obtain. Therefore, test case generation is a necessary task for building large-scale datasets, and the quality of the test cases directly determines the accuracy of the evaluation. In this paper, we introduce an LLM-based agent system that creates high-quality test cases for competitive programming problems. We apply this system to the CodeContests dataset and propose a new version with improved test cases, named CodeContests+. We evaluated the quality of test cases in CodeContestsPlus. First, we used 1.72 million submissions with pass/fail labels to examine the accuracy of these test cases in evaluation. The results indicated that CodeContests+ achieves significantly higher accuracy than CodeContests, particularly with a notably higher True Positive Rate (TPR). Subsequently, our experiments in LLM Reinforcement Learning (RL) further confirmed that improvements in test case quality yield considerable advantages for RL.
zh
[NLP-49] MAPLE: Multi-Agent Adaptive Planning with Long-Term Memory for Table Reasoning
【速读】: 该论文旨在解决表格问答任务中当前大语言模型(Large Language Models, LLMs)在单次推理过程中难以实现复杂推理能力的问题。现有方法如链式思维推理和问题分解缺乏错误检测机制,并且无法保留和利用问题求解经验,与人类解决问题的方式存在显著差异。论文提出的解决方案是MAPLE(Multi-agent Adaptive Planning with Long-term mEmory),其关键在于通过专门的认知代理在反馈驱动的循环中模拟人类问题解决过程,包含四个核心组件:使用ReAct范式的求解器、用于答案验证的检查器、用于错误诊断和策略修正的反思器,以及用于经验重用和演化的长期记忆管理器。
链接: https://arxiv.org/abs/2506.05813
作者: Ye Bai,Minghan Wang,Thuy-Trang Vu
机构: Monash University (莫纳什大学)
类目: Computation and Language (cs.CL)
备注: 26 pages, 10 figures
点击查看摘要
Abstract:Table-based question answering requires complex reasoning capabilities that current LLMs struggle to achieve with single-pass inference. Existing approaches, such as Chain-of-Thought reasoning and question decomposition, lack error detection mechanisms and discard problem-solving experiences, contrasting sharply with how humans tackle such problems. In this paper, we propose MAPLE (Multi-agent Adaptive Planning with Long-term mEmory), a novel framework that mimics human problem-solving through specialized cognitive agents working in a feedback-driven loop. MAPLE integrates 4 key components: (1) a Solver using the ReAct paradigm for reasoning, (2) a Checker for answer verification, (3) a Reflector for error diagnosis and strategy correction, and (4) an Archiver managing long-term memory for experience reuse and evolution. Experiments on WiKiTQ and TabFact demonstrate significant improvements over existing methods, achieving state-of-the-art performance across multiple LLM backbones.
zh
[NLP-50] Discrete Minds in a Continuous World: Do Language Models Know Time Passes?
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在感知实际时间流逝及据此调整决策能力方面的空白问题。其解决方案的关键在于提出并验证“Token-Time Hypothesis”,即LLMs能够将离散的token数量映射到连续的物理时间,并通过对话时长判断任务、响应长度适应性调整以及动态环境下的时间压力实验来验证这一假设,从而揭示LLMs在时间感知方面的能力及其与模型规模和推理能力的关系。
链接: https://arxiv.org/abs/2506.05790
作者: Minghan Wang,Ye Bai,Thuy-Trang Vu,Ehsan Shareghi,Gholamreza Haffari
机构: Monash University (莫纳什大学)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:While Large Language Models (LLMs) excel at temporal reasoning tasks like event ordering and duration estimation, their ability to perceive the actual passage of time remains unexplored. We investigate whether LLMs perceive the passage of time and adapt their decision-making accordingly through three complementary experiments. First, we introduce the Token-Time Hypothesis, positing that LLMs can map discrete token counts to continuous wall-clock time, and validate this through a dialogue duration judgment task. Second, we demonstrate that LLMs could use this awareness to adapt their response length while maintaining accuracy when users express urgency in question answering tasks. Finally, we develop BombRush, an interactive navigation challenge that examines how LLMs modify behavior under progressive time pressure in dynamic environments. Our findings indicate that LLMs possess certain awareness of time passage, enabling them to bridge discrete linguistic tokens and continuous physical time, though this capability varies with model size and reasoning abilities. This work establishes a theoretical foundation for enhancing temporal awareness in LLMs for time-sensitive applications.
zh
[NLP-51] dots.llm 1 Technical Report
【速读】: 该论文旨在解决大规模语言模型在计算资源消耗过高和训练成本昂贵的问题,通过引入Mixture of Experts (MoE)模型结构,实现对参数的高效利用。其解决方案的关键在于仅对每个输入标记激活部分参数(dots.llm1激活了14B参数,占总参数量142B的约10%),从而在保持与当前最先进模型相当性能的同时,显著降低训练和推理成本。此外,通过高质量的数据处理流程和无需使用合成数据的预训练策略,进一步提升了模型效果与实用性。
链接: https://arxiv.org/abs/2506.05767
作者: Bi Huo,Bin Tu,Cheng Qin,Da Zheng,Debing Zhang,Dongjie Zhang,En Li,Fu Guo,Jian Yao,Jie Lou,Junfeng Tian,Li Hu,Ran Zhu,Shengdong Chen,Shuo Liu,Su Guang,Te Wo,Weijun Zhang,Xiaoming Shi,Xinxin Peng,Xing Wu,Yawen Liu,Yuqiu Ji,Ze Wen,Zhenhai Liu,Zichao Li,Zilong Liao
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Mixture of Experts (MoE) models have emerged as a promising paradigm for scaling language models efficiently by activating only a subset of parameters for each input token. In this report, we present dots.llm1, a large-scale MoE model that activates 14B parameters out of a total of 142B parameters, delivering performance on par with state-of-the-art models while reducing training and inference costs. Leveraging our meticulously crafted and efficient data processing pipeline, dots.llm1 achieves performance comparable to Qwen2.5-72B after pretraining on 11.2T high-quality tokens and post-training to fully unlock its capabilities. Notably, no synthetic data is used during pretraining. To foster further research, we open-source intermediate training checkpoints at every one trillion tokens, providing valuable insights into the learning dynamics of large language models.
zh
[NLP-52] BioMol-MQA: A Multi-Modal Question Answering Dataset For LLM Reasoning Over Bio-Molecular Interactions
【速读】: 该论文试图解决现有基于检索增强生成(Retrieval Augmented Generation, RAG)的大语言模型(Large Language Models, LLMs)在处理多模态领域信息时的局限性,即大多数模型仅专注于单模态信息(主要是文本)的检索,而无法有效处理如医疗等领域中涉及知识图谱、文本(临床记录)和复杂分子结构等多模态信息的查询。解决方案的关键在于提出BioMol-MQA数据集,该数据集包含两个部分:(i)用于信息检索的多模态知识图谱(Knowledge Graph, KG),涵盖文本和分子结构;(ii)设计用于测试LLMs在多模态KG上进行检索与推理能力的挑战性问题,从而推动强RAG框架的发展。
链接: https://arxiv.org/abs/2506.05766
作者: Saptarshi Sengupta,Shuhua Yang,Paul Kwong Yu,Fali Wang,Suhang Wang
机构: 未知
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Retrieval augmented generation (RAG) has shown great power in improving Large Language Models (LLMs). However, most existing RAG-based LLMs are dedicated to retrieving single modality information, mainly text; while for many real-world problems, such as healthcare, information relevant to queries can manifest in various modalities such as knowledge graph, text (clinical notes), and complex molecular structure. Thus, being able to retrieve relevant multi-modality domain-specific information, and reason and synthesize diverse knowledge to generate an accurate response is important. To address the gap, we present BioMol-MQA, a new question-answering (QA) dataset on polypharmacy, which is composed of two parts (i) a multimodal knowledge graph (KG) with text and molecular structure for information retrieval; and (ii) challenging questions that designed to test LLM capabilities in retrieving and reasoning over multimodal KG to answer questions. Our benchmarks indicate that existing LLMs struggle to answer these questions and do well only when given the necessary background data, signaling the necessity for strong RAG frameworks.
zh
[NLP-53] Do Large Vision-Language Models Distinguish between the Actual and Apparent Features of Illusions?
【速读】: 该论文试图解决机器视觉模型(如大型视觉语言模型,LVLMs)在面对视觉幻觉时是否能够区分实际特征与表观特征的问题,现有研究因使用非抽象图像且未明确区分实际与表观特征而导致对机器认知能力的评估存在歧义。解决方案的关键在于构建一个分类为真实幻觉和虚假幻觉的视觉问答(VQA)数据集,并配备相应的控制图像,其中真实幻觉存在实际与表观特征的差异,而虚假幻觉则在实际与表观特征一致的情况下因几何配置相似而显得具有幻觉效果。通过评估LVLMs在真实与虚假幻觉VQA任务中的表现,研究发现模型对两类问题的预测答案相同,表明其响应可能基于对幻觉的先验知识而非真正的视觉理解。
链接: https://arxiv.org/abs/2506.05765
作者: Taiga Shinozaki,Tomoki Doi,Satoshi Nishida,Hitomi Yanaka
机构: The University of Tokyo(东京大学); Keio University(庆应大学); CiNet, NICT(国家信息与通信技术研究所); Riken(理化学研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: To appear in the Proceedings of the 47th Annual Meeting of the Cognitive Science Society (COGSCI 2025)
点击查看摘要
Abstract:Humans are susceptible to optical illusions, which serve as valuable tools for investigating sensory and cognitive processes. Inspired by human vision studies, research has begun exploring whether machines, such as large vision language models (LVLMs), exhibit similar susceptibilities to visual illusions. However, studies often have used non-abstract images and have not distinguished actual and apparent features, leading to ambiguous assessments of machine cognition. To address these limitations, we introduce a visual question answering (VQA) dataset, categorized into genuine and fake illusions, along with corresponding control images. Genuine illusions present discrepancies between actual and apparent features, whereas fake illusions have the same actual and apparent features even though they look illusory due to the similar geometric configuration. We evaluate the performance of LVLMs for genuine and fake illusion VQA tasks and investigate whether the models discern actual and apparent features. Our findings indicate that although LVLMs may appear to recognize illusions by correctly answering questions about both feature types, they predict the same answers for both Genuine Illusion and Fake Illusion VQA questions. This suggests that their responses might be based on prior knowledge of illusions rather than genuine visual understanding. The dataset is available at this https URL
zh
[NLP-54] Writing-RL: Advancing Long-form Writing via Adaptive Curriculum Reinforcement Learning
【速读】: 该论文试图解决监督微调(Supervised Fine-Tuning, SFT)在长文本生成任务中面临的数据饱和和学习能力受限的问题。其解决方案的关键在于提出一种自适应课程强化学习框架(Writing-RL),该框架包含三个核心组件:基于边距的数据选择策略,用于优先选取具有高学习潜力的样本;成对比较奖励机制,在缺乏可验证奖励的情况下提供区分性学习信号;以及动态参考调度方法,通过根据模型性能变化自适应调整任务难度来提升训练效果。
链接: https://arxiv.org/abs/2506.05760
作者: Xuanyu Lei,Chenliang Li,Yuning Wu,Kaiming Liu,Weizhou Shen,Peng Li,Ming Yan,Ji Zhang,Fei Huang,Yang Liu
机构: 未知
类目: Computation and Language (cs.CL)
备注: Work in progress
点击查看摘要
Abstract:Recent advances in Large Language Models (LLMs) have enabled strong performance in long-form writing, yet existing supervised fine-tuning (SFT) approaches suffer from limitations such as data saturation and restricted learning capacity bounded by teacher signals. In this work, we present Writing-RL: an Adaptive Curriculum Reinforcement Learning framework to advance long-form writing capabilities beyond SFT. The framework consists of three key components: Margin-aware Data Selection strategy that prioritizes samples with high learning potential, Pairwise Comparison Reward mechanism that provides discriminative learning signals in the absence of verifiable rewards, and Dynamic Reference Scheduling approach, which plays a particularly critical role by adaptively adjusting task difficulty based on evolving model performance. Experiments on 7B-scale writer models show that our RL framework largely improves long-form writing performance over strong SFT baselines. Furthermore, we observe that models trained with long-output RL generalize surprisingly well to long-input reasoning tasks, potentially offering a promising perspective for rethinking long-context training.
zh
[NLP-55] Constrained Sampling for Language Models Should Be Easy: An MCMC Perspective
【速读】: 该论文试图解决传统约束解码方法在生成满足硬性约束的样本时会扭曲模型分布的问题,这一问题在程序模糊测试等需要生成多样且有效程序输入的应用中尤为突出。解决方案的关键在于提出一种基于马尔可夫链蒙特卡洛(Markov Chain Monte Carlo, MCMC)的约束采样框架,该框架能够同时满足三个核心需求:约束满足(每个样本均满足约束条件)、单调收敛(采样过程收敛于真实条件分布)以及高效性(高质量样本在少数步骤内生成)。其核心思想是构建一个有效输出的提议分布,并基于语言模型的似然应用Metropolis-Hastings接受准则,从而实现对约束空间的合理且高效的探索。
链接: https://arxiv.org/abs/2506.05754
作者: Emmanuel Anaya Gonzalez,Sairam Vaidya,Kanghee Park,Ruyi Ji,Taylor Berg-Kirkpatrick,Loris D’Antoni
机构: UCSD(加州大学圣地亚哥分校); Peking University(北京大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Constrained decoding enables Language Models (LMs) to produce samples that provably satisfy hard constraints. However, existing constrained-decoding approaches often distort the underlying model distribution, a limitation that is especially problematic in applications like program fuzzing, where one wants to generate diverse and valid program inputs for testing purposes. We propose a new constrained sampling framework based on Markov Chain Monte Carlo (MCMC) that simultaneously satisfies three core desiderata: constraint satisfying (every sample satisfies the constraint), monotonically converging (the sampling process converges to the true conditional distribution), and efficient (high-quality samples emerge in few steps). Our method constructs a proposal distribution over valid outputs and applies a Metropolis-Hastings acceptance criterion based on the LM’s likelihood, ensuring principled and efficient exploration of the constrained space. Empirically, our sampler outperforms existing methods on both synthetic benchmarks and real-world program fuzzing tasks.
zh
[NLP-56] LLM -Symbolic Integration for Robust Temporal Tabular Reasoning ACL
【速读】: 该论文试图解决时间表问答(Temporal Tabular Question Answering)中大型语言模型(Large Language Models, LLMs)面临的挑战,这些问题包括对结构化数据的鲁棒推理不足、传统提示方法在记忆、表格规模敏感性和复杂查询性能方面的局限性。解决方案的关键在于引入一个名为TempTabQA-C的合成数据集,以及一种符号化的中间表示,将表格转换为数据库模式,从而允许LLMs生成和执行SQL查询,提升泛化能力和减少偏差。此外,通过结合自适应少量样本提示与上下文定制的示例,进一步增强了方法的鲁棒性、可扩展性和性能。
链接: https://arxiv.org/abs/2506.05746
作者: Atharv Kulkarni,Kushagra Dixit,Vivek Srikumar,Dan Roth,Vivek Gupta
机构: University of Utah(犹他大学); University of Pennsylvania(宾夕法尼亚大学); Arizona State University(亚利桑那州立大学)
类目: Computation and Language (cs.CL)
备注: Accepted to ACL Findings 2025
点击查看摘要
Abstract:Temporal tabular question answering presents a significant challenge for Large Language Models (LLMs), requiring robust reasoning over structured data, which is a task where traditional prompting methods often fall short. These methods face challenges such as memorization, sensitivity to table size, and reduced performance on complex queries. To overcome these limitations, we introduce TempTabQA-C, a synthetic dataset designed for systematic and controlled evaluations, alongside a symbolic intermediate representation that transforms tables into database schemas. This structured approach allows LLMs to generate and execute SQL queries, enhancing generalization and mitigating biases. By incorporating adaptive few-shot prompting with contextually tailored examples, our method achieves superior robustness, scalability, and performance. Experimental results consistently highlight improvements across key challenges, setting a new benchmark for robust temporal reasoning with LLMs.
zh
[NLP-57] Do LLM s Really Forget? Evaluating Unlearning with Knowledge Correlation and Confidence Awareness
【速读】: 该论文试图解决大规模语言模型(Large Language Models, LLMs)中无意记忆的隐式知识依赖问题,传统方法仅关注显式移除孤立事实,而忽略了模型内部潜在的推理依赖关系及知识的非确定性特性。解决方案的关键在于提出一种基于知识图谱(Knowledge Graph)的无监督学习评估框架,通过构建带有置信度评分的相关事实上下文,并利用强大的LLM作为评判者进行推理评估,从而更准确地捕捉真实世界知识的隐式结构,提升对无监督学习效果评估的现实性和严谨性。
链接: https://arxiv.org/abs/2506.05735
作者: Rongzhe Wei,Peizhi Niu,Hans Hao-Hsun Hsu,Ruihan Wu,Haoteng Yin,Mohsen Ghassemi,Yifan Li,Vamsi K. Potluru,Eli Chien,Kamalika Chaudhuri,Olgica Milenkovic,Pan Li
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Machine unlearning techniques aim to mitigate unintended memorization in large language models (LLMs). However, existing approaches predominantly focus on the explicit removal of isolated facts, often overlooking latent inferential dependencies and the non-deterministic nature of knowledge within LLMs. Consequently, facts presumed forgotten may persist implicitly through correlated information. To address these challenges, we propose a knowledge unlearning evaluation framework that more accurately captures the implicit structure of real-world knowledge by representing relevant factual contexts as knowledge graphs with associated confidence scores. We further develop an inference-based evaluation protocol leveraging powerful LLMs as judges; these judges reason over the extracted knowledge subgraph to determine unlearning success. Our LLM judges utilize carefully designed prompts and are calibrated against human evaluations to ensure their trustworthiness and stability. Extensive experiments on our newly constructed benchmark demonstrate that our framework provides a more realistic and rigorous assessment of unlearning performance. Moreover, our findings reveal that current evaluation strategies tend to overestimate unlearning effectiveness. Our code is publicly available at this https URL.
zh
[NLP-58] Large Language Models are Good Relational Learners
【速读】: 该论文试图解决将大型语言模型(Large Language Models, LLMs)应用于关系深度学习(Relational Deep Learning, RDL)时存在的问题,即现有方法通过遍历数据库中的实体间关系链并将结构化数据转换为平面文本文档,导致关键关系结构被忽略、冗余信息增加且常超出LLM的标准上下文长度。解决方案的关键在于提出一种名为Rel-LLM的新架构,该架构利用图神经网络(Graph Neural Network, GNN)编码器在检索增强生成(Retrieval-Augmented Generation, RAG)框架内生成结构化的关系提示,从而保留数据库的固有关系结构,并使LLMs能够有效处理和推理复杂实体关系。
链接: https://arxiv.org/abs/2506.05725
作者: Fang Wu,Vijay Prakash Dwivedi,Jure Leskovec
机构: Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Large language models (LLMs) have demonstrated remarkable capabilities across various domains, yet their application to relational deep learning (RDL) remains underexplored. Existing approaches adapt LLMs by traversing relational links between entities in a database and converting the structured data into flat text documents. Still, this text-based serialization disregards critical relational structures, introduces redundancy, and often exceeds standard LLM context lengths. We introduce Rel-LLM, a novel architecture that utilizes a graph neural network (GNN)- based encoder to generate structured relational prompts for LLMs within a retrieval-augmented generation (RAG) framework. Unlike traditional text-based serialization approaches, our method preserves the inherent relational structure of databases while enabling LLMs to effectively process and reason over complex entity relationships. Specifically, the GNN encoder extracts a local subgraph around an entity to build feature representations that contain relevant entity relationships and temporal dependencies. These representations are transformed into structured prompts using a denormalization process, effectively allowing the LLM to reason over relational structures. Through extensive experiments, we demonstrate that Rel-LLM outperforms existing methods on key RDL tasks, offering a scalable and efficient approach to integrating LLMs with structured data sources. Code is available at this https URL.
zh
[NLP-59] RKEFino1 : A Regulation Knowledge-Enhanced Large Language Model
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在金融应用中引入的准确性与合规性挑战,特别是在数字监管报告(Digital Regulatory Reporting, DRR)领域。解决方案的关键在于提出RKEFino1,一个基于Fino1并结合XBRL、CDM和MOF领域知识进行微调的监管知识增强型金融推理模型。该模型通过构建基于知识的问答任务和数学推理任务,并引入一种新的数值命名实体识别(Numerical NER)任务,以提升在合规关键金融任务中的表现。
链接: https://arxiv.org/abs/2506.05700
作者: Yan Wang,Yueru He,Ruoyu Xiang,Jeff Zhao
机构: Yale University (耶鲁大学); Columbia University (哥伦比亚大学); New York University (纽约大学); The University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Recent advances in large language models (LLMs) hold great promise for financial applications but introduce critical accuracy and compliance challenges in Digital Regulatory Reporting (DRR). To address these issues, we propose RKEFino1, a regulation knowledge-enhanced financial reasoning model built upon Fino1, fine-tuned with domain knowledge from XBRL, CDM, and MOF. We formulate two QA tasks-knowledge-based and mathematical reasoning-and introduce a novel Numerical NER task covering financial entities in both sentences and tables. Experimental results demonstrate the effectiveness and generalization capacity of RKEFino1 in compliance-critical financial tasks. We have released our model on Hugging Face.
zh
[NLP-60] Being Strong Progressively! Enhancing Knowledge Distillation of Large Language Models through a Curriculum Learning Framework
【速读】: 该论文旨在解决知识蒸馏(Knowledge Distillation, KD)在压缩大语言模型(Large Language Models, LLMs)过程中存在的学生模型分布显著偏移问题,这些问题包括灾难性遗忘、模式崩溃和训练与推理不匹配。其解决方案的关键在于提出一种基于“渐进超负荷”(Progressive Overload, POCL)原理的可插拔课程学习框架,该框架通过难度度量器对训练样本进行从易到难的排序与划分,并由训练调度器在固定时间间隔内逐步引入这些子集,同时应用温度逐渐升高的损失函数,从而提升学习的稳定性和效率。
链接: https://arxiv.org/abs/2506.05695
作者: Lingyuan Liu,Mengxiang Zhang
机构: City University of Hong Kong (香港城市大学); The University of Hong Kong (香港大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Knowledge Distillation (KD) compresses large language models (LLMs) by transferring the teacher model’s capabilities to a smaller student model, reducing inference cost and memory usage while maintaining performance. However, existing KD methods for LLMs often fail to prevent significant shifts in the student model’s distribution during training, leading to issues such as catastrophic forgetting, mode collapse, and training-inference mismatch. To address these challenges, we propose a novel, plug-in curriculum learning framework inspired by the strength training principle of “progressive overload” (POCL), which can be seamlessly integrated into existing white-box KD approaches with minimal computational overhead. The framework comprises two core components: (1) a difficulty measurer that ranks and partitions training samples from easy to hard, and (2) a training scheduler that incrementally introduces these subsets into the distillation process at fixed intervals while applying loss functions with progressively rising temperatures. By starting with the easiest samples and progressively increasing the difficulty, the approach enhances both the stability and efficiency of learning. Extensive experiments in instruction-following settings demonstrate that POCL consistently improves the performance of distilled student models across various white-box KD methods and model families. Our findings highlight the effectiveness of sorted training samples in KD for LLMs. More generally, our work demonstrates how to structure training data within the KD process to enhance the stability and performance of distilled LLMs.
zh
[NLP-61] When to use Graphs in RAG : A Comprehensive Analysis for Graph Retrieval-Augmented Generation
【速读】: 该论文试图解决GraphRAG在实际任务中表现不佳的问题,即其是否真正有效以及在哪些场景下图结构能为检索增强生成(RAG)系统带来可衡量的提升。解决方案的关键在于提出GraphRAG-Bench,这是一个全面的基准测试平台,用于评估GraphRAG模型在层次化知识检索和深度上下文推理方面的性能,并通过系统化的数据集和评估流程深入分析GraphRAG的优势与适用条件。
链接: https://arxiv.org/abs/2506.05690
作者: Zhishang Xiang,Chuanjie Wu,Qinggang Zhang,Shengyuan Chen,Zijin Hong,Xiao Huang,Jinsong Su
机构: Xiamen University (厦门大学); The Hong Kong Polytechnic University (香港理工大学)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Graph retrieval-augmented generation (GraphRAG) has emerged as a powerful paradigm for enhancing large language models (LLMs) with external knowledge. It leverages graphs to model the hierarchical structure between specific concepts, enabling more coherent and effective knowledge retrieval for accurate this http URL its conceptual promise, recent studies report that GraphRAG frequently underperforms vanilla RAG on many real-world tasks. This raises a critical question: Is GraphRAG really effective, and in which scenarios do graph structures provide measurable benefits for RAG systems? To address this, we propose GraphRAG-Bench, a comprehensive benchmark designed to evaluate GraphRAG models onboth hierarchical knowledge retrieval and deep contextual reasoning. GraphRAG-Bench features a comprehensive dataset with tasks of increasing difficulty, coveringfact retrieval, complex reasoning, contextual summarization, and creative generation, and a systematic evaluation across the entire pipeline, from graph constructionand knowledge retrieval to final generation. Leveraging this novel benchmark, we systematically investigate the conditions when GraphRAG surpasses traditional RAG and the underlying reasons for its success, offering guidelines for its practical application. All related resources and analyses are collected for the community at this https URL.
zh
[NLP-62] Voice Impression Control in Zero-Shot TTS INTERSPEECH2025
【速读】: 该论文旨在解决在零样本文本到语音(zero-shot TTS)中难以调节细微的副语言/非语言信息以控制感知到的声音特征(即印象)的问题。解决方案的关键在于利用一个低维向量来表示不同声音印象对(如暗淡-明亮)的强度,并通过大型语言模型生成该向量,从而实现基于自然语言描述的目标印象生成,无需手动优化。
链接: https://arxiv.org/abs/2506.05688
作者: Keinichi Fujita,Shota Horiguchi,Yusuke Ijima
机构: 未知
类目: ound (cs.SD); Computation and Language (cs.CL); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注: 5 pages,5 figures, Accepted to INTERSPEECH 2025
点击查看摘要
Abstract:Para-/non-linguistic information in speech is pivotal in shaping the listeners’ impression. Although zero-shot text-to-speech (TTS) has achieved high speaker fidelity, modulating subtle para-/non-linguistic information to control perceived voice characteristics, i.e., impressions, remains challenging. We have therefore developed a voice impression control method in zero-shot TTS that utilizes a low-dimensional vector to represent the intensities of various voice impression pairs (e.g., dark-bright). The results of both objective and subjective evaluations have demonstrated our method’s effectiveness in impression control. Furthermore, generating this vector via a large language model enables target-impression generation from a natural language description of the desired impression, thus eliminating the need for manual optimization.
zh
[NLP-63] A Unified Representation for Continuity and Discontinuity: Syntactic and Computational Motivations
【速读】: 该论文试图解决不同语法形式系统(即短语结构语法(Phrase Structure Grammar, PSG)、依存语法(Dependency Grammar, DG)和范畴语法(Categorial Grammar, CG))在语言结构表征上的不统一问题,旨在从句法和计算复杂性的角度建立一种统一的表征框架。解决方案的关键在于提出“对应原则”(correspondence principle),以实现PSG、DG和CG在表征原理上的统一,进而为连续与非连续子句的句法分析提供理论整合与计算简化的新路径。
链接: https://arxiv.org/abs/2506.05686
作者: Ratna Kandala,Prakash Mondal
机构: 未知
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:This paper advances a unified representation of linguistic structure for three grammar formalisms, namely, Phrase Structure Grammar (PSG), Dependency Grammar (DG) and Categorial Grammar (CG) from the perspective of syntactic and computational complexity considerations. The correspondence principle is proposed to enable a unified representation of the representational principles from PSG, DG, and CG. To that end, the paper first illustrates a series of steps in achieving a unified representation for a discontinuous subordinate clause from Turkish as an illustrative case. This affords a new way of approaching discontinuity in natural language from a theoretical point of view that unites and integrates the basic tenets of PSG, DG, and CG, with significant consequences for syntactic analysis. Then this paper demonstrates that a unified representation can simplify computational complexity with regards to the neurocognitive representation and processing of both continuous and discontinuous sentences vis-à-vis the basic principles of PSG, DG, and CG.
zh
[NLP-64] Zero-Shot Event Causality Identification via Multi-source Evidence Fuzzy Aggregation with Large Language Models
【速读】: 该论文旨在解决事件因果关系识别(Event Causality Identification, ECI)中依赖大规模标注数据的监督方法以及大型语言模型(Large Language Models, LLMs)在零样本场景下容易产生因果幻觉的问题。其解决方案的关键在于提出MEFA框架,该框架通过多源证据模糊聚合(Multi-source Evidence Fuzzy Aggregation)实现因果关系的准确判断,具体包括将因果推理分解为三个主要任务和三个辅助任务,利用精心设计的提示引导LLMs生成不确定性响应与确定性输出,并通过模糊聚合整合子任务证据以实现因果评分与判定。
链接: https://arxiv.org/abs/2506.05675
作者: Zefan Zeng,Xingchen Hu,Qing Cheng,Weiping Ding,Wentao Li,Zhong Liu
机构: 未知
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Event Causality Identification (ECI) aims to detect causal relationships between events in textual contexts. Existing ECI models predominantly rely on supervised methodologies, suffering from dependence on large-scale annotated data. Although Large Language Models (LLMs) enable zero-shot ECI, they are prone to causal hallucination-erroneously establishing spurious causal links. To address these challenges, we propose MEFA, a novel zero-shot framework based on Multi-source Evidence Fuzzy Aggregation. First, we decompose causality reasoning into three main tasks (temporality determination, necessity analysis, and sufficiency verification) complemented by three auxiliary tasks. Second, leveraging meticulously designed prompts, we guide LLMs to generate uncertain responses and deterministic outputs. Finally, we quantify LLM’s responses of sub-tasks and employ fuzzy aggregation to integrate these evidence for causality scoring and causality determination. Extensive experiments on three benchmarks demonstrate that MEFA outperforms second-best unsupervised baselines by 6.2% in F1-score and 9.3% in precision, while significantly reducing hallucination-induced errors. In-depth analysis verify the effectiveness of task decomposition and the superiority of fuzzy aggregation.
zh
[NLP-65] Contextually Guided Transformers via Low-Rank Adaptation
【速读】: 该论文试图解决传统基于Transformer的大型语言模型(Large Language Models, LLMs)在实现特定行为时依赖显式提示(prompt)所带来的计算开销问题。其解决方案的关键在于提出一种改进的Transformer架构——上下文引导的Transformer(Contextually Guided Transformer, CGT),通过将上下文信息编码到模型权重中,使模型能够在处理序列时动态更新权重,从而实现自我专业化,无需依赖外部提示。
链接: https://arxiv.org/abs/2506.05672
作者: Andrey Zhmoginov,Jihwan Lee,Max Vladymyrov,Mark Sandler
机构: Google DeepMind (谷歌深度思维)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) based on Transformers excel at text processing, but their reliance on prompts for specialized behavior introduces computational overhead. We propose a modification to a Transformer architecture that eliminates the need for explicit prompts by learning to encode context into the model’s weights. Our Contextually Guided Transformer (CGT) model maintains a contextual summary at each sequence position, allowing it to update the weights on the fly based on the preceding context. This approach enables the model to self-specialize, effectively creating a tailored model for processing information following a given prefix. We demonstrate the effectiveness of our method on synthetic in-context learning tasks and language modeling benchmarks. Furthermore, we introduce techniques for enhancing the interpretability of the learned contextual representations, drawing connections to Variational Autoencoders and promoting smoother, more consistent context encoding. This work offers a novel direction for efficient and adaptable language modeling by integrating context directly into the model’s architecture.
zh
[NLP-66] Can LLM s Express Personality Across Cultures? Introducing CulturalPersonas for Evaluating Trait Alignment
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在跨文化情境下表达个性时存在的文化适配性不足的问题,即现有研究大多忽视了文化与个性之间的相互作用。其解决方案的关键在于引入CulturalPersonas,这是首个基于大规模人类验证的基准数据集,旨在评估LLMs在植根于文化背景的行为丰富场景中的个性表达能力。该数据集包含来自六个不同国家的3000个情景化问题,通过日常场景激发个性表达,并通过多选和开放式回答格式对三个LLMs进行评估,从而显著提升了模型与各国特定人格分布的一致性,并产生了更具文化连贯性的输出。
链接: https://arxiv.org/abs/2506.05670
作者: Priyanka Dey,Yugal Khanter,Aayush Bothra,Jieyu Zhao,Emilio Ferrara
机构: University of Southern California (南加州大学)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:As LLMs become central to interactive applications, ranging from tutoring to mental health, the ability to express personality in culturally appropriate ways is increasingly important. While recent works have explored personality evaluation of LLMs, they largely overlook the interplay between culture and personality. To address this, we introduce CulturalPersonas, the first large-scale benchmark with human validation for evaluating LLMs’ personality expression in culturally grounded, behaviorally rich contexts. Our dataset spans 3,000 scenario-based questions across six diverse countries, designed to elicit personality through everyday scenarios rooted in local values. We evaluate three LLMs, using both multiple-choice and open-ended response formats. Our results show that CulturalPersonas improves alignment with country-specific human personality distributions (over a 20% reduction in Wasserstein distance across models and countries) and elicits more expressive, culturally coherent outputs compared to existing benchmarks. CulturalPersonas surfaces meaningful modulated trait outputs in response to culturally grounded prompts, offering new directions for aligning LLMs to global norms of behavior. By bridging personality expression and cultural nuance, we envision that CulturalPersonas will pave the way for more socially intelligent and globally adaptive LLMs.
zh
[NLP-67] BAQ: Efficient Bit Allocation Quantization for Large Language Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在后训练量化过程中因采用统一或启发式位宽分配而导致的非均匀权重对量化噪声敏感性未被充分考虑的问题。其解决方案的关键在于基于Hessian代理得出的敏感性度量,构建了一个新的框架来分配量化位宽。通过关键假设,将层/组件级别的损失函数表达为位宽的显式函数,从而将位宽分配问题形式化为一个凸优化任务,并得到闭合解,该解通过自适应调整权重精度以最小化层级量化损失。这一方法最终形成了BAQ(Bit Allocation Quantization)算法,实现了损失最小化与复杂度之间的良好平衡。
链接: https://arxiv.org/abs/2506.05664
作者: Chao Zhang,Li Wang,Samson Lasaulce,Merouane Debbah
机构: Central South University, China; Khalifa University, UAE; CNRS, France
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Post-training model quantization is a widely adopted technique for reducing the memory and computational costs of large language models (LLMs). However, most existing methods rely on uniform or heuristic bitwidth assignments, failing to account for the nonuniform sensitivity of weights to quantization noise. In this paper, we propose a novel framework for allocating quantization bitwidths based on sensitivity metrics derived from a Hessian proxy. We make key assumptions, which allow the layer/component-wise loss function to be expressed as an explicit function of the bitwidths. This enables a neat formulation of the bit allocation problem as a convex optimization task, whose closed-form solution adapts precision across weights to minimize the layer-wise quantization loss. Inspecting the solution provides several insights (such as the equal-loss structure), which are then exploited to design the proposed \textbfBAQ (Bit Allocation Quantization) algorithm. The proposed algorithm achieves a good trade-off between loss minimization and complexity and allows BAQ to be integrated into standard quantization pipelines with minimal overhead. Experimental results show that BAQ consistently outperforms GPTQ, achieving up to 56 \times lower perplexity at the same bitwidth on large language models ranging from 125M to 30B parameters. Leveraging our analytical results derived from solving the optimal bit allocation problem, we also provide a theoretical explanation for the observed gains. All codes of this paper are available at this https URL.
zh
[NLP-68] Projectable Models: One-Shot Generation of Small Specialized Transformers from Large Ones ICML2024
【速读】: 该论文试图解决大型基础模型(Foundation Models, FMs)在计算资源消耗大、难以在普通设备上部署,以及其广泛知识可能与特定任务无关的问题。解决方案的关键在于将大型Transformer模型的参数映射到更小的专用模型参数,通过使转换任务特定化,以捕捉执行特定任务所需的知识范围,从而提升小模型在特定任务上的性能。
链接: https://arxiv.org/abs/2506.05641
作者: Andrey Zhmoginov,Jihwan Lee,Mark Sandler
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Presented at ES-FoMo II: 2nd Workshop on Efficient Systems for Foundation Models (ICML 2024)
点击查看摘要
Abstract:Modern Foundation Models (FMs) are typically trained on corpora spanning a wide range of different data modalities, topics and downstream tasks. Utilizing these models can be very computationally expensive and is out of reach for most consumer devices. Furthermore, most of the broad FM knowledge may actually be irrelevant for a specific task at hand. Here we explore a technique for mapping parameters of a large Transformer to parameters of a smaller specialized model. By making this transformation task-specific, we aim to capture a narrower scope of the knowledge needed for performing a specific task by a smaller model. We study our method on image modeling tasks, showing that performance of generated models exceeds that of universal conditional models.
zh
[NLP-69] A Fictional QA Dataset for Studying Memorization and Knowledge Acquisition
【速读】: 该论文试图解决语言模型在训练过程中如何记忆事实(fact memorization)与verbatim序列记忆(verbatim sequence memorization)的问题,尤其是后者相较于前者更不为人所知。解决方案的关键在于提出一个新数据集,该数据集由合成生成的、类似网络文本的虚构事件文档及其相关问答对组成,旨在帮助研究人员区分和研究这两种记忆机制。通过使用合成数据,论文展示了其在分离不同形式记忆方面的有效性,并探讨了构建真实感强的虚构合成数据所面临的挑战。
链接: https://arxiv.org/abs/2506.05639
作者: John Kirchenbauer,Janny Mongkolsupawan,Yuxin Wen,Tom Goldstein,Daphne Ippolito
机构: University of Maryland (马里兰大学); Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 10 pages and 8 figures in the main body
点击查看摘要
Abstract:When language models are trained on textual data, they acquire both knowledge about the structure of language as well as knowledge of facts about the world. At inference time, their knowledge of facts can be leveraged to solve interesting problems and perform useful knowledge work for users. It is well known that language models can verbatim memorize long sequences from their training data. However, it is much less well understood how language models memorize facts seen during training. In this work, we propose a new dataset to specifically empower researchers to study the dual processes of fact memorization and verbatim sequence memorization. The dataset consists of synthetically-generated, webtext-like documents about fictional events, as well as question-answer pairs about the events. We conduct training experiments showing how synthetic data about fictional events can be effective in teasing apart different forms of memorization. We also document the challenges in effectively building realistic, fictional synthetic data.
zh
[NLP-70] IYKYK: Using language models to decode extremist cryptolects
【速读】: 该论文试图解决当前语言技术在检测和解读极端主义群体使用的内群体语言(cryptolects)方面的不足。解决方案的关键在于通过领域适应和专门提示技术提升通用大语言模型(LLMs)的表现,从而增强对极端主义语言的识别能力。
链接: https://arxiv.org/abs/2506.05635
作者: Christine de Kock,Arij Riabi,Zeerak Talat,Michael Sejr Schlichtkrull,Pranava Madhyastha,Ed Hovy
机构: University of Melbourne (墨尔本大学); Inria (法国国家信息与自动化研究所); University of Edinburgh (爱丁堡大学); Queen Mary University of London (伦敦玛丽女王大学); City St George’s, University of London (伦敦城市圣乔治大学); University of Melbourne (墨尔本大学)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Extremist groups develop complex in-group language, also referred to as cryptolects, to exclude or mislead outsiders. We investigate the ability of current language technologies to detect and interpret the cryptolects of two online extremist platforms. Evaluating eight models across six tasks, our results indicate that general purpose LLMs cannot consistently detect or decode extremist language. However, performance can be significantly improved by domain adaptation and specialised prompting techniques. These results provide important insights to inform the development and deployment of automated moderation technologies. We further develop and release novel labelled and unlabelled datasets, including 19.4M posts from extremist platforms and lexicons validated by human experts.
zh
[NLP-71] Leverag ing Self-Attention for Input-Dependent Soft Prompting in LLM s ACL2025
【速读】: 该论文旨在解决大型语言模型在特定领域任务中性能不足的问题,其核心挑战在于微调过程的计算成本高且技术复杂。论文提出的解决方案是采用参数高效的微调方法——软提示(soft prompting),通过学习少量参数来适应下游任务。该方法的关键在于提出了一种基于输入依赖的软提示技术(Input Dependent Soft Prompting with a self-Attention Mechanism, ID-SPAM),该技术根据输入标记生成软提示,并通过自注意力机制对不同标记赋予不同的重要性,从而在保持可训练参数数量较少的情况下提升模型性能。
链接: https://arxiv.org/abs/2506.05629
作者: Ananth Muppidi,Abhilash Nandy,Sambaran Bandyopadhyay
机构: IIIT Hyderabad (印度国际信息技术学院); IIT Kharagpur (印度理工学院克勒格布爾分校); Adobe Research (Adobe 研究院)
类目: Computation and Language (cs.CL)
备注: Accepted in ACL 2025 (Main) Conference
点击查看摘要
Abstract:The performance of large language models in domain-specific tasks necessitates fine-tuning, which is computationally expensive and technically challenging. This paper focuses on parameter-efficient fine-tuning using soft prompting, a promising approach that adapts pre-trained models to downstream tasks by learning a small set of parameters. We propose a novel Input Dependent Soft Prompting technique with a self-Attention Mechanism (ID-SPAM) that generates soft prompts based on the input tokens and attends different tokens with varying importance. Our method is simple and efficient, keeping the number of trainable parameters small. We show the merits of the proposed approach compared to state-of-the-art techniques on various tasks and show the improved zero shot domain transfer capability.
zh
[NLP-72] Deployability-Centric Infrastructure-as-Code Generation: An LLM -based Iterative Framework
【速读】: 该论文旨在解决当前Infrastructure-as-Code (IaC)生成研究中忽视部署可行性的评估问题,即现有方法主要关注语法正确性而未充分考虑IaC模板的实际部署效果。其解决方案的关键在于提出IaCGen框架,该框架基于大型语言模型(LLM)并采用迭代反馈机制来生成具有高部署成功率的IaC模板,同时构建了DPIaC-Eval基准测试集以全面评估语法、部署、用户意图和安全性。
链接: https://arxiv.org/abs/2506.05623
作者: Tianyi Zhang,Shidong Pan,Zejun Zhang,Zhenchang Xing,Xiaoyu Sun
机构: Australian National University (澳大利亚国立大学); New York University (纽约大学); Columbia University (哥伦比亚大学); Nanyang Technological University (南洋理工大学); CSIRO’s Data61 (澳大利亚联邦科学与工业研究组织数据61实验室)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Infrastructure-as-Code (IaC) generation holds significant promise for automating cloud infrastructure provisioning. Recent advances in Large Language Models (LLMs) present a promising opportunity to democratize IaC development by generating deployable infrastructure templates from natural language descriptions, but current evaluation focuses on syntactic correctness while ignoring deployability, the fatal measure of IaC template utility. We address this gap through two contributions: (1) IaCGen, an LLM-based deployability-centric framework that uses iterative feedback mechanism to generate IaC templates, and (2) DPIaC-Eval, a deployability-centric IaC template benchmark consists of 153 real-world scenarios that can evaluate syntax, deployment, user intent, and security. Our evaluation reveals that state-of-the-art LLMs initially performed poorly, with Claude-3.5 and Claude-3.7 achieving only 30.2% and 26.8% deployment success on the first attempt respectively. However, IaCGen transforms this performance dramatically: all evaluated models reach over 90% passItr@25, with Claude-3.5 and Claude-3.7 achieving 98% success rate. Despite these improvements, critical challenges remain in user intent alignment (25.2% accuracy) and security compliance (8.4% pass rate), highlighting areas requiring continued research. Our work provides the first comprehensive assessment of deployability-centric IaC template generation and establishes a foundation for future research.
zh
[NLP-73] Mitigating Confounding in Speech-Based Dementia Detection through Weight Masking ACL2025
【速读】: 该论文试图解决在阿尔茨海默病(Alzheimer’s Disease, AD)检测中,由于说话者性别(gender)引起的混淆因素(confounding factors)对模型性能的影响问题。其解决方案的关键在于提出两种方法:\textit{Extended Confounding Filter} 和 \textit{Dual Filter},通过隔离和消除与性别相关的权重,实现去混淆的痴呆分类器。
链接: https://arxiv.org/abs/2506.05610
作者: Zhecheng Sheng,Xiruo Ding,Brian Hur,Changye Li,Trevor Cohen,Serguei Pakhomov
机构: University of Minnesota (明尼苏达大学); University of Washington (华盛顿大学)
类目: Computation and Language (cs.CL)
备注: 16 pages, 20 figures. Accepted to ACL 2025 Main Conference
点击查看摘要
Abstract:Deep transformer models have been used to detect linguistic anomalies in patient transcripts for early Alzheimer’s disease (AD) screening. While pre-trained neural language models (LMs) fine-tuned on AD transcripts perform well, little research has explored the effects of the gender of the speakers represented by these transcripts. This work addresses gender confounding in dementia detection and proposes two methods: the \textitExtended Confounding Filter and the \textitDual Filter , which isolate and ablate weights associated with gender. We evaluate these methods on dementia datasets with first-person narratives from patients with cognitive impairment and healthy controls. Our results show transformer models tend to overfit to training data distributions. Disrupting gender-related weights results in a deconfounded dementia classifier, with the trade-off of slightly reduced dementia detection performance.
zh
[NLP-74] OPeRA: A Dataset of Observation Persona Rationale and Action for Evaluating LLM s on Human Online Shopping Behavior Simulation
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)是否能够准确模拟特定用户的下一步网络行为的问题。现有研究虽然表明LLMs具备生成“可信”的人类行为的能力,但缺乏高质量、公开可用的数据集来评估其模仿真实用户行为的能力。为了解决这一问题,研究者提出了OPERA数据集,该数据集通过在线购物会话中收集的真实人类参与者数据,全面涵盖了用户人格特征(Persona)、浏览器观察(Observation)、细粒度网络操作(Action)以及即时自我报告的推理过程(Rationale)。OPERA是首个公开的此类数据集,其关键在于通过在线问卷和定制浏览器插件以高保真度收集多维度用户行为数据,从而为评估LLMs预测特定用户下一步操作及推理的能力提供了基准。
链接: https://arxiv.org/abs/2506.05606
作者: Ziyi Wang,Yuxuan Lu,Wenbo Li,Amirali Amini,Bo Sun,Yakov Bart,Weimin Lyu,Jiri Gesi,Tian Wang,Jing Huang,Yu Su,Upol Ehsan,Malihe Alikhani,Toby Jia-Jun Li,Lydia Chilton,Dakuo Wang
机构: Northeastern University (东北大学); University of Southern California (南加州大学); Stony Brook University (石溪大学); Independent Researcher (独立研究员); Ohio State University (俄亥俄州立大学); University of Notre Dame (圣母大学); Columbia University (哥伦比亚大学)
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:
点击查看摘要
Abstract:Can large language models (LLMs) accurately simulate the next web action of a specific user? While LLMs have shown promising capabilities in generating ``believable’’ human behaviors, evaluating their ability to mimic real user behaviors remains an open challenge, largely due to the lack of high-quality, publicly available datasets that capture both the observable actions and the internal reasoning of an actual human user. To address this gap, we introduce OPERA, a novel dataset of Observation, Persona, Rationale, and Action collected from real human participants during online shopping sessions. OPERA is the first public dataset that comprehensively captures: user personas, browser observations, fine-grained web actions, and self-reported just-in-time rationales. We developed both an online questionnaire and a custom browser plugin to gather this dataset with high fidelity. Using OPERA, we establish the first benchmark to evaluate how well current LLMs can predict a specific user’s next action and rationale with a given persona and observation, action, rationale history. This dataset lays the groundwork for future research into LLM agents that aim to act as personalized digital twins for human.
zh
[NLP-75] SynthesizeMe! Inducing Persona-Guided Prompts for Personalized Reward Models in LLM s ACL2025
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)个性化对齐的问题,即如何根据多样化的用户偏好调整模型行为。传统方法依赖于额外的身份信息,如人口统计学数据或预定义的偏好类别,而本文提出了一种无需此类信息的解决方案。其关键在于SynthesizeMe,该方法通过用户交互生成并验证解释用户偏好的推理,进而推导出合成用户人格画像,并筛选出有信息量的先验用户交互,以构建针对特定用户的个性化提示。
链接: https://arxiv.org/abs/2506.05598
作者: Michael J Ryan,Omar Shaikh,Aditri Bhagirath,Daniel Frees,William Held,Diyi Yang
机构: Stanford University (斯坦福大学); Georgia Institute of Technology (佐治亚理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ACL 2025 Main Conference
点击查看摘要
Abstract:Recent calls for pluralistic alignment of Large Language Models (LLMs) encourage adapting models to diverse user preferences. However, most prior work on personalized reward models heavily rely on additional identity information, such as demographic details or a predefined set of preference categories. To this end, we introduce SynthesizeMe, an approach to inducing synthetic user personas from user interactions for personalized reward modeling. SynthesizeMe first generates and verifies reasoning to explain user preferences, then induces synthetic user personas from that reasoning, and finally filters to informative prior user interactions in order to build personalized prompts for a particular user. We show that using SynthesizeMe induced prompts improves personalized LLM-as-a-judge accuracy by 4.4% on Chatbot Arena. Combining SynthesizeMe derived prompts with a reward model achieves top performance on PersonalRewardBench: a new curation of user-stratified interactions with chatbots collected from 854 users of Chatbot Arena and PRISM.
zh
[NLP-76] SoK: Are Watermarks in LLM s Ready for Deployment?
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在部署过程中面临的知识产权侵权和潜在滥用问题,特别是针对模型窃取攻击(model stealing attacks)的防御需求。其解决方案的关键在于构建一个全面的水印系统化框架,包括提出一种详细的水印分类法、设计一种新型的知识产权分类器以评估水印在不同环境下的有效性与影响、分析现有水印技术的局限性,并探讨水印在实际应用中的挑战与未来方向。通过实验表明,尽管水印技术已取得一定进展,但其在实际应用中仍受限于对模型性能和下游任务的负面影响。
链接: https://arxiv.org/abs/2506.05594
作者: Kieu Dang,Phung Lai,NhatHai Phan,Yelong Shen,Ruoming Jin,Abdallah Khreishah,My Thai
机构: SUNY-Albany (纽约州立大学阿尔巴尼分校); New Jersey Institute of Technology (新泽西理工学院); Microsoft (微软); Kent State University (肯特州立大学); University of Florida (佛罗里达大学)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) have transformed natural language processing, demonstrating impressive capabilities across diverse tasks. However, deploying these models introduces critical risks related to intellectual property violations and potential misuse, particularly as adversaries can imitate these models to steal services or generate misleading outputs. We specifically focus on model stealing attacks, as they are highly relevant to proprietary LLMs and pose a serious threat to their security, revenue, and ethical deployment. While various watermarking techniques have emerged to mitigate these risks, it remains unclear how far the community and industry have progressed in developing and deploying watermarks in LLMs. To bridge this gap, we aim to develop a comprehensive systematization for watermarks in LLMs by 1) presenting a detailed taxonomy for watermarks in LLMs, 2) proposing a novel intellectual property classifier to explore the effectiveness and impacts of watermarks on LLMs under both attack and attack-free environments, 3) analyzing the limitations of existing watermarks in LLMs, and 4) discussing practical challenges and potential future directions for watermarks in LLMs. Through extensive experiments, we show that despite promising research outcomes and significant attention from leading companies and community to deploy watermarks, these techniques have yet to reach their full potential in real-world applications due to their unfavorable impacts on model utility of LLMs and downstream tasks. Our findings provide an insightful understanding of watermarks in LLMs, highlighting the need for practical watermarks solutions tailored to LLM deployment. Subjects: Cryptography and Security (cs.CR); Computation and Language (cs.CL) Cite as: arXiv:2506.05594 [cs.CR] (or arXiv:2506.05594v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2506.05594 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-77] UTSA-NLP at ArchEHR-QA 2025: Improving EHR Question Answering via Self-Consistency Prompting
【速读】: 该论文试图解决如何利用电子健康记录(Electronic Health Records, EHRs)回答临床问题的挑战。其解决方案的关键在于使用大型语言模型分两步处理:首先识别EHR中与临床问题相关的句子,其次基于这些句子生成简短且有引用支持的回答。为提升句子分类的准确性,采用少量示例提示(few-shot prompting)、自洽性(self-consistency)和阈值设定(thresholding),其中自洽性结合阈值设定有助于提高决策的可靠性,而实验结果表明较小的8B模型在识别相关信息方面优于更大的70B模型。
链接: https://arxiv.org/abs/2506.05589
作者: Sara Shields-Menard,Zach Reimers,Joshua Gardner,David Perry,Anthony Rios
机构: The University of Texas at San Antonio (德克萨斯大学圣安东尼奥分校)
类目: Computation and Language (cs.CL)
备注: Accepted to BioNLP 2025
点击查看摘要
Abstract:We describe our system for the ArchEHR-QA Shared Task on answering clinical questions using electronic health records (EHRs). Our approach uses large language models in two steps: first, to find sentences in the EHR relevant to a clinician’s question, and second, to generate a short, citation-supported response based on those sentences. We use few-shot prompting, self-consistency, and thresholding to improve the sentence classification step to decide which sentences are essential. We compare several models and find that a smaller 8B model performs better than a larger 70B model for identifying relevant information. Our results show that accurate sentence selection is critical for generating high-quality responses and that self-consistency with thresholding helps make these decisions more reliable.
zh
[NLP-78] MMTU: A Massive Multi-Task Table Understanding and Reasoning Benchmark
【速读】: 该论文试图解决当前对表格相关任务的全面评估不足的问题,特别是在专业用户面临的复杂表格任务上,现有的基准测试过于狭窄,主要集中在如自然语言到SQL(NL-to-SQL)和表格问答(Table-QA)等任务,而忽略了更广泛的现实场景。解决方案的关键是引入MMTU,这是一个包含超过30K个问题的大型基准数据集,覆盖25个真实世界的表格任务,旨在全面评估模型在理解、推理和操作真实表格方面的专家级能力。
链接: https://arxiv.org/abs/2506.05587
作者: Junjie Xing,Yeye He,Mengyu Zhou,Haoyu Dong,Shi Han,Lingjiao Chen,Dongmei Zhang,Surajit Chaudhuri,H. V. Jagadish
机构: University of Michigan (密歇根大学); Microsoft Corporation (微软公司)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Databases (cs.DB); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Tables and table-based use cases play a crucial role in many important real-world applications, such as spreadsheets, databases, and computational notebooks, which traditionally require expert-level users like data engineers, data analysts, and database administrators to operate. Although LLMs have shown remarkable progress in working with tables (e.g., in spreadsheet and database copilot scenarios), comprehensive benchmarking of such capabilities remains limited. In contrast to an extensive and growing list of NLP benchmarks, evaluations of table-related tasks are scarce, and narrowly focus on tasks like NL-to-SQL and Table-QA, overlooking the broader spectrum of real-world tasks that professional users face. This gap limits our understanding and model progress in this important area. In this work, we introduce MMTU, a large-scale benchmark with over 30K questions across 25 real-world table tasks, designed to comprehensively evaluate models ability to understand, reason, and manipulate real tables at the expert-level. These tasks are drawn from decades’ worth of computer science research on tabular data, with a focus on complex table tasks faced by professional users. We show that MMTU require a combination of skills – including table understanding, reasoning, and coding – that remain challenging for today’s frontier models, where even frontier reasoning models like OpenAI o4-mini and DeepSeek R1 score only around 60%, suggesting significant room for improvement. We highlight key findings in our evaluation using MMTU and hope that this benchmark drives further advances in understanding and developing foundation models for structured data processing and analysis. Our code and data are available at this https URL and this https URL. Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Databases (cs.DB); Machine Learning (cs.LG) Cite as: arXiv:2506.05587 [cs.AI] (or arXiv:2506.05587v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2506.05587 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-79] Combating Misinformation in the Arab World: Challenges Opportunities
【速读】: 该论文试图解决虚假信息(misinformation)和错误信息(disinformation)在全球范围内带来的风险,特别是在阿拉伯地区由于地缘政治不稳定、语言多样性及文化差异所带来的独特脆弱性。其解决方案的关键在于通过信息检测、追踪、缓解以及社区参与等核心方面,构建更加韧性的信息生态系统,具体包括与基层事实核查组织建立联系、理解文化规范、促进社会纠正行为以及创建强大的协作信息网络。
链接: https://arxiv.org/abs/2506.05582
作者: Azza Abouzied,Firoj Alam,Raian Ali,Paolo Papotti
机构: New York University Abu Dhabi (纽约大学阿布扎比分校); Qatar Computing Research Institute (卡塔尔计算研究研究所); College of Science and Engineering, Hamad Bin Khalifa University (哈马德本哈利法大学科学与工程学院); EURECOM (欧洲电信研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
备注: disinformation, misinformation, factuality, harmfulness, fake news
点击查看摘要
Abstract:Misinformation and disinformation pose significant risks globally, with the Arab region facing unique vulnerabilities due to geopolitical instabilities, linguistic diversity, and cultural nuances. We explore these challenges through the key facets of combating misinformation: detection, tracking, mitigation and community-engagement. We shed light on how connecting with grass-roots fact-checking organizations, understanding cultural norms, promoting social correction, and creating strong collaborative information networks can create opportunities for a more resilient information ecosystem in the Arab world.
zh
[NLP-80] When Models Know More Than They Can Explain: Quantifying Knowledge Transfer in Human-AI Collaboration
【速读】: 该论文试图解决的问题是:随着人工智能推理能力的提升,这些改进是否能够带来更好的知识迁移能力,即模型能否以人类可理解、应用和学习的方式进行推理沟通。为了解决这一问题,研究者提出了知识集成与迁移评估(Knowledge Integration and Transfer Evaluation, KITE)框架,这是一个概念性和实验性的方法,用于评估人机之间的知识迁移能力,并进行了首次大规模的人类实验(N=118)来直接测量这一能力。该解决方案的关键在于通过两阶段设置,让人类先与AI共同构思解决问题的策略,再独立实施解决方案,从而隔离模型解释对人类理解的影响,进而分析知识迁移的有效性及其影响因素。
链接: https://arxiv.org/abs/2506.05579
作者: Quan Shi,Carlos E. Jimenez,Shunyu Yao,Nick Haber,Diyi Yang,Karthik Narasimhan
机构: Princeton Language and Intelligence (普林斯顿语言与智能实验室); Stanford University (斯坦福大学); OpenAI(开放人工智能)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: For code, data, visualizer, visit: https:kite-live.vercel.app
点击查看摘要
Abstract:Recent advancements in AI reasoning have driven substantial improvements across diverse tasks. A critical open question is whether these improvements also yields better knowledge transfer: the ability of models to communicate reasoning in ways humans can understand, apply, and learn from. To investigate this, we introduce Knowledge Integration and Transfer Evaluation (KITE), a conceptual and experimental framework for Human-AI knowledge transfer capabilities and conduct the first large-scale human study (N=118) explicitly designed to measure it. In our two-phase setup, humans first ideate with an AI on problem-solving strategies, then independently implement solutions, isolating model explanations’ influence on human understanding. Our findings reveal that although model benchmark performance correlates with collaborative outcomes, this relationship is notably inconsistent, featuring significant outliers, indicating that knowledge transfer requires dedicated optimization. Our analysis identifies behavioral and strategic factors mediating successful knowledge transfer. We release our code, dataset, and evaluation framework to support future work on communicatively aligned models.
zh
[NLP-81] Improving LLM s with a knowledge from databases
【速读】: 该论文试图解决如何通过可解释的机器学习方法(如增强关联规则)提升基于数据集/数据库的回答质量,同时确保安全性的问题。其解决方案的关键在于生成一个基于预定义知识模式的规则集,并通过规则到文本的转换器将规则转化为文本形式,进而作为检索增强生成(RAG)技术的一部分集成到大型语言模型(LLMs)中,从而显著提高问答性能。
链接: https://arxiv.org/abs/2506.05560
作者: Petr Máša
机构: Prague University of Economics and Business (布拉格经济大学)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Large language models (LLMs) are achieving significant progress almost every moment now. Many advanced techniques have been introduced and widely accepted, like retrieval-augmentation generation (RAG), agents, and tools. Tools can query the database to answer questions from structured data files or perform groupings or other statistics. This unlocks huge opportunities, such as it can answer any question, but also poses threats, such as safety, because there is no control over the commands that are created. We would like to discuss whether we can create a new method that improves answers based on dataset/database via some interpretable ML methods, namely enhanced association rules. The advantage would be if the method can be also used in some safe technique like RAG. Association rules have a sound history. Since the introduction of CN2 and aproiri, many enhancements have been made. In parallel, enhanced association rules have been introduced and evolved over the last 40 years. The general problem is typically that there are too many rules. There are some techniques for handling it, but when LLM emerged, it turned out to be the best use case for the RAG technique for LLMs. We proposed a method that generates a ruleset based on defined knowledge patterns, then converts rules into text form via a rule-to-text converter, and includes the result as an RAG into LLM. We compared this method with ChatGPT (even with using agents) and we have discovered a significant improvement in answering questions based on the dataset. We have also tried several strategies how much rules to generate. We found this improvement interesting. Moreover, it can also be improved in many ways as future work, like incorporating other patterns, the use of rule mining as an agent, and many others.
zh
[NLP-82] MORSE-500: A Programmatically Controllable Video Benchmark to Stress-Test Multimodal Reasoning
【速读】: 该论文旨在解决当前多模态推理基准在三个关键维度上的不足:过度依赖静态图像、仅关注数学问题解决而忽视更广泛的推理能力,以及基准迅速饱和导致难以诊断模型失败模式或衡量持续进步。其解决方案的关键在于提出MORSE-500(Multimodal Reasoning Stress-test Environment),这是一个由500个完全脚本化的视频片段组成的视频基准,嵌入了涵盖六个互补推理类别的问题。通过使用确定性Python脚本(如Manim、Matplotlib、MoviePy)、生成式视频模型和精选真实画面进行程序化生成,该设计实现了对视觉复杂度、干扰物密度和时间动态的精细控制,从而支持难度的系统性扩展。此外,MORSE-500具备可进化性,其可控生成流程能够创建任意挑战性的新实例,适用于下一代模型的压力测试。
链接: https://arxiv.org/abs/2506.05523
作者: Zikui Cai,Andrew Wang,Anirudh Satheesh,Ankit Nakhawa,Hyunwoo Jae,Keenan Powell,Minghui Liu,Neel Jay,Sungbin Oh,Xiyao Wang,Yongyuan Liang,Tom Goldstein,Furong Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Despite rapid advances in vision-language models (VLMs), current benchmarks for multimodal reasoning fall short in three key dimensions. First, they overwhelmingly rely on static images, failing to capture the temporal complexity of real-world environments. Second, they narrowly focus on mathematical problem-solving, neglecting the broader spectrum of reasoning skills – including abstract, physical, planning, spatial, and temporal capabilities – required for robust multimodal intelligence. Third, many benchmarks quickly saturate, offering limited headroom for diagnosing failure modes or measuring continued progress. We introduce MORSE-500 (Multimodal Reasoning Stress-test Environment), a video benchmark composed of 500 fully scripted clips with embedded questions spanning six complementary reasoning categories. Each instance is programmatically generated using deterministic Python scripts (via Manim, Matplotlib, MoviePy), generative video models, and curated real footage. This script-driven design allows fine-grained control over visual complexity, distractor density, and temporal dynamics – enabling difficulty to be scaled systematically as models improve. Unlike static benchmarks that become obsolete once saturated, MORSE-500 is built to evolve: its controllable generation pipeline supports the creation of arbitrarily challenging new instances, making it ideally suited for stress-testing next-generation models. Initial experiments with state-of-the-art systems – including various Gemini 2.5 Pro and OpenAI o3 which represent the strongest available at the time, alongside strong open-source models – reveal substantial performance gaps across all categories, with particularly large deficits in abstract and planning tasks. We release the full dataset, generation scripts, and evaluation harness to support transparent, reproducible, and forward-looking multimodal reasoning research.
zh
[NLP-83] Multidimensional Analysis of Specific Language Impairment Using Unsupervised Learning Through PCA and Clustering
【速读】: 该论文试图解决儿童特定语言障碍(Specific Language Impairment, SLI)的早期识别与诊断问题,传统诊断方法可能无法捕捉到语言发展的细微模式。其解决方案的关键在于利用无监督机器学习技术,如主成分分析(PCA)和聚类分析,对大量语言样本进行分析,以揭示自然语言发展轨迹并区分语言特征谱型,从而为制定更精准的干预策略提供依据。
链接: https://arxiv.org/abs/2506.05498
作者: Niruthiha Selvanayagam
机构: Northeastern University (东北大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 14 pages, 3 figures, 16 tables
点击查看摘要
Abstract:Specific Language Impairment (SLI) affects approximately 7 percent of children, presenting as isolated language deficits despite normal cognitive abilities, sensory systems, and supportive environments. Traditional diagnostic approaches often rely on standardized assessments, which may overlook subtle developmental patterns. This study aims to identify natural language development trajectories in children with and without SLI using unsupervised machine learning techniques, providing insights for early identification and targeted interventions. Narrative samples from 1,163 children aged 4-16 years across three corpora (Conti-Ramsden 4, ENNI, and Gillam) were analyzed using Principal Component Analysis (PCA) and clustering. A total of 64 linguistic features were evaluated to uncover developmental trajectories and distinguish linguistic profiles. Two primary clusters emerged: (1) high language production with low SLI prevalence, and (2) limited production but higher syntactic complexity with higher SLI prevalence. Additionally, boundary cases exhibited intermediate traits, supporting a continuum model of language abilities. Findings suggest SLI manifests primarily through reduced production capacity rather than syntactic complexity deficits. The results challenge categorical diagnostic frameworks and highlight the potential of unsupervised learning techniques for refining diagnostic criteria and intervention strategies.
zh
[NLP-84] MLLM -CL: Continual Learning for Multimodal Large Language Models
【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在动态现实场景中持续整合新知识和技能时面临的挑战。现有基准和方法存在关键局限性,难以有效支持模型的持续学习(Continual Learning, CL)。论文提出的解决方案关键在于通过参数隔离防止灾难性遗忘,并引入基于MLLM的路由机制,从而实现领域和能力上的持续学习,有效整合特定领域知识和功能能力,同时显著减少遗忘现象。
链接: https://arxiv.org/abs/2506.05453
作者: Hongbo Zhao,Fei Zhu,Rundong Wang,Gaofeng Meng,Zhaoxiang Zhang
机构: Institute of Automation, Chinese Academy of Sciences (CASIA); Centre for Artificial Intelligence and Robotics, HKISI, CAS; University of Chinese Academy of Sciences (UCAS); University of Hong Kong
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Recent Multimodal Large Language Models (MLLMs) excel in vision-language understanding but face challenges in adapting to dynamic real-world scenarios that require continuous integration of new knowledge and skills. While continual learning (CL) offers a potential solution, existing benchmarks and methods suffer from critical limitations. In this paper, we introduce MLLM-CL, a novel benchmark encompassing domain and ability continual learning, where the former focuses on independently and identically distributed (IID) evaluation across evolving mainstream domains, whereas the latter evaluates on non-IID scenarios with emerging model ability. Methodologically, we propose preventing catastrophic interference through parameter isolation, along with an MLLM-based routing mechanism. Extensive experiments demonstrate that our approach can integrate domain-specific knowledge and functional abilities with minimal forgetting, significantly outperforming existing methods.
zh
[NLP-85] Interpretation Meets Safety: A Survey on Interpretation Methods and Tools for Improving LLM Safety
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在实际应用中出现的不安全行为问题,以及如何通过解释技术来理解和缓解这些行为。其解决方案的关键在于提出一个统一的框架,该框架将面向安全的解释方法、由此指导的安全增强措施以及实现这些措施的工具进行连接,并基于LLM的工作流程阶段构建了一个新的分类体系,以总结近70篇相关工作。
链接: https://arxiv.org/abs/2506.05451
作者: Seongmin Lee,Aeree Cho,Grace C. Kim,ShengYun Peng,Mansi Phute,Duen Horng Chau
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 31 pages, 1 figure
点击查看摘要
Abstract:As large language models (LLMs) see wider real-world use, understanding and mitigating their unsafe behaviors is critical. Interpretation techniques can reveal causes of unsafe outputs and guide safety, but such connections with safety are often overlooked in prior surveys. We present the first survey that bridges this gap, introducing a unified framework that connects safety-focused interpretation methods, the safety enhancements they inform, and the tools that operationalize them. Our novel taxonomy, organized by LLM workflow stages, summarizes nearly 70 works at their intersections. We conclude with open challenges and future directions. This timely survey helps researchers and practitioners navigate key advancements for safer, more interpretable LLMs.
zh
[NLP-86] BYO-Eval: Build Your Own Dataset for Fine-Grained Visual Assessment of Multimodal Language Models
【速读】: 该论文试图解决现有视觉语言模型(Visual Language Models, VLMs)评估基准存在的问题,包括高标注成本、信息泄露风险以及无法明确区分模型失败是由于视觉感知、推理能力还是通用知识不足所致。其解决方案的关键在于借鉴眼科诊断方法,通过程序化生成合成图像,实现对视觉属性的精确控制,从而系统性地揭示VLM在视觉感知方面的缺陷。具体而言,构建了一系列内容逐渐复杂的图像集,在保持其他视觉参数不变的情况下,逐步增加感兴趣内容的难度(如计数任务中的物体数量),以实现细粒度的故障分析和针对性评估。
链接: https://arxiv.org/abs/2506.05440
作者: Ludovic Arnould,Salim Khazem,Hugues Ali Mehenni
机构: Talan(塔兰)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Visual Language Models (VLMs) are now sufficiently advanced to support a broad range of applications, including answering complex visual questions, and are increasingly expected to interact with images in varied ways. To evaluate them, current benchmarks often focus on specific domains (e.g., reading charts), constructing datasets of annotated real images paired with pre-defined Multiple Choice Questions (MCQs) to report aggregate accuracy scores. However, such benchmarks entail high annotation costs, risk information leakage, and do not clarify whether failures stem from limitations in visual perception, reasoning, or general knowledge. We propose a new evaluation methodology, inspired by ophthalmologic diagnostics, leveraging procedural generation of synthetic images to obtain control over visual attributes and precisely reveal perception failures in VLMs. Specifically, we build collections of images with gradually more challenging variations in the content of interest (e.g., number of objects in a counting task) while holding other visual parameters constant. This diagnostic allows systematic stress testing and fine-grained failure analysis, shifting the focus from coarse benchmarking toward targeted and interpretable assessment of VLM capabilities. Our code is available at this https URL.
zh
[NLP-87] LLM s Can Compensate for Deficiencies in Visual Representations
【速读】: 该论文试图解决基于CLIP的视觉编码器在多模态任务中可能存在视觉特征较弱的问题,以及如何有效利用语言解码器来弥补这一缺陷。解决方案的关键在于通过控制自注意力机制的消融实验,验证语言解码器能够在视觉表示上下文信息不足的情况下,补偿视觉特征的不足并恢复模型性能,从而揭示了视觉语言模型(VLM)中视觉与语言模块之间的动态分工。
链接: https://arxiv.org/abs/2506.05439
作者: Sho Takishita,Jay Gala,Abdelrahman Mohamed,Kentaro Inui,Yova Kementchedjhieva
机构: Fujitsu Limited(富士通有限公司); MBZUAI; Tohoku University(东北大学); RIKEN(理化学研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Many vision-language models (VLMs) that prove very effective at a range of multimodal task, build on CLIP-based vision encoders, which are known to have various limitations. We investigate the hypothesis that the strong language backbone in VLMs compensates for possibly weak visual features by contextualizing or enriching them. Using three CLIP-based VLMs, we perform controlled self-attention ablations on a carefully designed probing task. Our findings show that despite known limitations, CLIP visual representations offer ready-to-read semantic information to the language decoder. However, in scenarios of reduced contextualization in the visual representations, the language decoder can largely compensate for the deficiency and recover performance. This suggests a dynamic division of labor in VLMs and motivates future architectures that offload more visual processing to the language decoder.
zh
[NLP-88] Coordinated Robustness Evaluation Framework for Vision-Language Models CVPR
【速读】: 该论文旨在解决视觉-语言模型(Vision-Language Models, VLMs)在面对小扰动时表现出的鲁棒性不足问题,尤其是在实际部署场景中。其解决方案的关键在于训练一个通用的替代模型,该模型能够同时处理图像和文本输入,并生成联合表示,进而用于生成针对文本和图像模态的对抗性扰动。这种协同攻击策略通过在视觉问答和视觉推理数据集上进行评估,验证了其在破坏多模态模型鲁棒性方面的有效性。
链接: https://arxiv.org/abs/2506.05429
作者: Ashwin Ramesh Babu,Sajad Mousavi,Vineet Gundecha,Sahand Ghorbanpour,Avisek Naug,Antonio Guillen,Ricardo Luna Gutierrez,Soumyendu Sarkar
机构: Hewlett Packard Enterprise (Hewlett Packard Labs)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted: IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) 2025
点击查看摘要
Abstract:Vision-language models, which integrate computer vision and natural language processing capabilities, have demonstrated significant advancements in tasks such as image captioning and visual question and answering. However, similar to traditional models, they are susceptible to small perturbations, posing a challenge to their robustness, particularly in deployment scenarios. Evaluating the robustness of these models requires perturbations in both the vision and language modalities to learn their inter-modal dependencies. In this work, we train a generic surrogate model that can take both image and text as input and generate joint representation which is further used to generate adversarial perturbations for both the text and image modalities. This coordinated attack strategy is evaluated on the visual question and answering and visual reasoning datasets using various state-of-the-art vision-language models. Our results indicate that the proposed strategy outperforms other multi-modal attacks and single-modality attacks from the recent literature. Our results demonstrate their effectiveness in compromising the robustness of several state-of-the-art pre-trained multi-modal models such as instruct-BLIP, ViLT and others.
zh
[NLP-89] Automatically Detecting Amusing Games in Wordle
【速读】: 该论文试图解决如何自动预测Reddit用户认为有趣的Wordle游戏的问题,其核心在于利用生成式AI (Generative AI) 对用户反应进行分类,并提取能够预测用户幽默感的游戏特征。解决方案的关键在于使用OpenAI的GPT-3.5通过少量示例提示(few-shot prompting)对用户反应进行分类,并验证其标签与人工标签的一致性,随后提取游戏特征以建立预测模型。
链接: https://arxiv.org/abs/2506.05415
作者: Ronaldo Luo,Gary Liang,Cindy Liu,Adam Kabbara,Minahil Bakhtawar,Kina Kim,Michael Guerzhoy
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted to the Intenational Conference on Computational Creeativity (ICCC) 2025
点击查看摘要
Abstract:We explore automatically predicting which Wordle games Reddit users find amusing. We scrape approximately 80k reactions by Reddit users to Wordle games from Reddit, classify the reactions as expressing amusement or not using OpenAI’s GPT-3.5 using few-shot prompting, and verify that GPT-3.5’s labels roughly correspond to human labels. We then extract features from Wordle games that can predict user amusement. We demonstrate that the features indeed provide a (weak) signal that predicts user amusement as predicted by GPT-3.5. Our results indicate that user amusement at Wordle games can be predicted computationally to some extent. We explore which features of the game contribute to user amusement. We find that user amusement is predictable, indicating a measurable aspect of creativity infused into Wordle games through humor. Comments: Accepted to the Intenational Conference on Computational Creeativity (ICCC) 2025 Subjects: Computation and Language (cs.CL) Cite as: arXiv:2506.05415 [cs.CL] (or arXiv:2506.05415v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2506.05415 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Michael Guerzhoy [view email] [v1] Wed, 4 Jun 2025 20:17:53 UTC (2,155 KB) Full-text links: Access Paper: View a PDF of the paper titled Automatically Detecting Amusing Games in Wordle, by Ronaldo Luo and 6 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.CL prev | next new | recent | 2025-06 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
zh
[NLP-90] SmoothRot: Combining Channel-Wise Scaling and Rotation for Quantization-Friendly LLM s
【速读】: 该论文试图解决4-bit量化在大型语言模型(Large Language Models, LLMs)中因激活值异常值(activation outliers)导致的量化精度下降问题。解决方案的关键在于提出SmoothRot技术,通过将通道级缩放与Hadamard变换相结合,有效将极端异常值转换为适合量化的激活值,从而显著提升量化精度。
链接: https://arxiv.org/abs/2506.05413
作者: Patrik Czakó,Gábor Kertész,Sándor Szénási
机构: Obuda University(布达佩斯考文纽斯大学); John von Neumann Faculty of Informatics(约翰·冯·诺伊曼信息学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 6 pages, 3 figures, 5 tables. Submitted to the IEEE SMC 2025 conference
点击查看摘要
Abstract:We present SmoothRot, a novel post-training quantization technique to enhance the efficiency of 4-bit quantization in Large Language Models (LLMs). SmoothRot addresses the critical challenge of massive activation outliers, by integrating channel-wise scaling with Hadamard transformations. Our technique effectively transforms extreme outliers into quantization-friendly activations, significantly improving quantization accuracy. Experiments conducted on popular LLMs (LLaMA2 7B, LLaMA3.1 8B, and Mistral 7B) demonstrate that SmoothRot consistently reduces the performance gap between quantized and FP16 models by approximately 10-30% across language generation and zero-shot reasoning tasks, without introducing additional inference latency. Code is available at this https URL.
zh
[NLP-91] Can Vision Language Models Infer Human Gaze Direction? A Controlled Study
【速读】: 该论文试图解决视觉语言模型(Vision Language Models, VLMs)在心理理论中的一个关键能力——注视参照推理(gaze-referential inference)的缺失问题,即模型无法准确推断他人所关注的目标。研究通过控制实验评估了111个VLMs在不同难度和变异性图像下的表现,并与人类参与者(N = 65)进行对比,发现大多数VLMs的表现接近随机猜测,而人类则表现出接近满分的准确性。解决方案的关键在于识别出顶级VLMs虽无法完全掌握注视推理能力,但其表现受任务难度影响,却对感知变化具有鲁棒性,表明它们可能结合了启发式策略与随机猜测,而非单纯的随机行为。这一发现揭示了VLMs在自然人机交互中仍存在显著局限,但其潜在能力值得进一步探索。
链接: https://arxiv.org/abs/2506.05412
作者: Zory Zhang,Pinyuan Feng,Bingyang Wang,Tianwei Zhao,Suyang Yu,Qingying Gao,Hokin Deng,Ziqiao Ma,Yijiang Li,Dezhi Luo
机构: Brown University (布朗大学); Columbia University (哥伦比亚大学); Emory University (埃默里大学); Johns Hopkins University (约翰霍普金斯大学); University of Washington (华盛顿大学); University of Michigan (密歇根大学); Carnegie Mellon University (卡内基梅隆大学); University of California, San Diego (加州大学圣地亚哥分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Preprint under review. Project page at this https URL
点击查看摘要
Abstract:Gaze-referential inference–the ability to infer what others are looking at–is a critical component of a theory of mind that underpins natural human-AI interaction. In a controlled study, we evaluated this skill across 111 Vision Language Models (VLMs) using photos taken with manipulated difficulty and variability, comparing performance with that of human participants (N = 65), and analyzed behaviors using mixed-effects models. We found that 94 of the 111 VLMs failed to do better than random guessing, while humans achieved near-ceiling accuracy. VLMs even respond with each choice almost equally frequently. Are they randomly guessing? Although most VLMs struggle, when we zoom in on five of the top-tier VLMs with above-chance performance, we find that their performance declined with increasing task difficulty but varied only slightly across different prompts and scene objects. These behavioral features cannot be explained by considering them as random guessers. Instead, they likely use a combination of heuristics and guessing such that their performance is subject to the task difficulty but robust to perceptual variations. This suggests that VLMs, lacking gaze inference capability, have yet to become technologies that can naturally interact with humans, but the potential remains.
zh
[NLP-92] Homogeneous Keys Heterogeneous Values: Exploiting Local KV Cache Asymmetry for Long-Context LLM s
【速读】: 该论文试图解决大型语言模型中由于注意力机制的二次复杂性导致的长上下文建模效率问题,特别是针对KV缓存压缩方法中存在的关键局限性。其解决方案的关键在于提出一种无需训练的压缩框架(AsymKV),该框架结合了基于局部同质性的键合并与数学上证明的无损值压缩,从而有效利用了键和值之间的不对称特性,显著提升了长上下文任务的性能。
链接: https://arxiv.org/abs/2506.05410
作者: Wanyun Cui,Mingwei Xu
机构: Shanghai University of Finance and Economics (上海财经大学)
类目: Computation and Language (cs.CL)
备注: 14 pages,7 figures
点击查看摘要
Abstract:Recent advances in Large Language Models (LLMs) have highlighted the critical importance of extending context length, yet the quadratic complexity of attention mechanisms poses significant challenges for efficient long-context modeling. KV cache compression has emerged as a key approach to address this challenge. Through extensive empirical analysis, we reveal a fundamental yet previously overlooked asymmetry in KV caches: while adjacent keys receive similar attention weights (local homogeneity), adjacent values demonstrate distinct heterogeneous distributions. This key-value asymmetry reveals a critical limitation in existing compression methods that treat keys and values uniformly. To address the limitation, we propose a training-free compression framework (AsymKV) that combines homogeneity-based key merging with a mathematically proven lossless value compression. Extensive experiments demonstrate that AsymKV consistently outperforms existing long-context methods across various tasks and base models. For example, on LLaMA3.1-8B, AsymKV achieves an average score of 43.95 on LongBench, surpassing SOTA methods like H _2 O (38.89) by a large margin.
zh
[NLP-93] Auto Review: Second Stage Error Detection for Highly Accurate Information Extraction from Phone Conversations ACL
【速读】: 该论文试图解决医疗保健中自动化福利验证电话通话的准确性问题,特别是在电话转录文本存在噪声的情况下,如何高效且准确地提取关键信息。解决方案的关键在于提出一种第二阶段的后处理流水线,通过利用多个自动语音识别(ASR)替代方案和无需人工校正转录文本的伪标注方法,提升转录文本的质量,从而提高Auto Review系统的效率和准确性。
链接: https://arxiv.org/abs/2506.05400
作者: Ayesha Qamar,Arushi Raghuvanshi,Conal Sathi,Youngseo Son
机构: Infinitus Systems, Inc. (无限系统公司)
类目: Computation and Language (cs.CL)
备注: Accepted to ACL Industry track 2025
点击查看摘要
Abstract:Automating benefit verification phone calls saves time in healthcare and helps patients receive treatment faster. It is critical to obtain highly accurate information in these phone calls, as it can affect a patient’s healthcare journey. Given the noise in phone call transcripts, we have a two-stage system that involves a post-call review phase for potentially noisy fields, where human reviewers manually verify the extracted data \unicodex2013 a labor-intensive task. To automate this stage, we introduce Auto Review, which significantly reduces manual effort while maintaining a high bar for accuracy. This system, being highly reliant on call transcripts, suffers a performance bottleneck due to automatic speech recognition (ASR) issues. This problem is further exacerbated by the use of domain-specific jargon in the calls. In this work, we propose a second-stage postprocessing pipeline for accurate information extraction. We improve accuracy by using multiple ASR alternatives and a pseudo-labeling approach that does not require manually corrected transcripts. Experiments with general-purpose large language models and feature-based model pipelines demonstrate substantial improvements in the quality of corrected call transcripts, thereby enhancing the efficiency of Auto Review.
zh
[NLP-94] Attention-based transformer models for image captioning across languages: An in-depth survey and evaluation
【速读】: 该论文试图解决多语言环境下图像描述生成(image captioning)中的挑战,特别是关注基于注意力机制的Transformer模型在这一任务中的应用与局限性。其解决方案的关键在于通过系统性地回顾和分类基于Transformer、深度学习以及混合方法的图像描述生成模型,结合基准数据集与评估指标(如BLEU、METEOR、CIDEr和ROUGE)分析现有技术的性能,并揭示当前模型在语义一致性、非英语语言数据稀缺性及推理能力方面的不足。
链接: https://arxiv.org/abs/2506.05399
作者: Israa A. Albadarneh,Bassam H. Hammo,Omar S. Al-Kadi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 31 pages, 15 figures, 6 tables
点击查看摘要
Abstract:Image captioning involves generating textual descriptions from input images, bridging the gap between computer vision and natural language processing. Recent advancements in transformer-based models have significantly improved caption generation by leveraging attention mechanisms for better scene understanding. While various surveys have explored deep learning-based approaches for image captioning, few have comprehensively analyzed attention-based transformer models across multiple languages. This survey reviews attention-based image captioning models, categorizing them into transformer-based, deep learning-based, and hybrid approaches. It explores benchmark datasets, discusses evaluation metrics such as BLEU, METEOR, CIDEr, and ROUGE, and highlights challenges in multilingual captioning. Additionally, this paper identifies key limitations in current models, including semantic inconsistencies, data scarcity in non-English languages, and limitations in reasoning ability. Finally, we outline future research directions, such as multimodal learning, real-time applications in AI-powered assistants, healthcare, and forensic analysis. This survey serves as a comprehensive reference for researchers aiming to advance the field of attention-based image captioning.
zh
[NLP-95] Are Large Language Models Good Temporal Graph Learners?
【速读】: 该论文试图解决将大型语言模型(Large Language Models, LLMs)应用于动态图(temporal graphs)中的链接预测问题,尤其是针对现实世界中演化的网络结构,这一领域仍处于探索阶段。解决方案的关键在于提出一种名为Temporal Graph Talker (TGTalker) 的新颖动态图学习框架,该框架利用动态图中的时间邻近性(recency bias)提取相关结构信息,并将其转换为自然语言输入给LLMs,同时结合时间邻居作为额外信息进行预测。
链接: https://arxiv.org/abs/2506.05393
作者: Shenyang Huang,Ali Parviz,Emma Kondrup,Zachary Yang,Zifeng Ding,Michael Bronstein,Reihaneh Rabbany,Guillaume Rabusseau
机构: Mila - Quebec AI Institute; School of Computer Science, McGill University; University of Cambridge; DIRO, Université de Montréal; CIFAR AI Chair; New Jersey Institute of Technology; University of Oxford; AITHYRA
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 9 pages, 9 tables, 4 figures
点击查看摘要
Abstract:Large Language Models (LLMs) have recently driven significant advancements in Natural Language Processing and various other applications. While a broad range of literature has explored the graph-reasoning capabilities of LLMs, including their use of predictors on graphs, the application of LLMs to dynamic graphs – real world evolving networks – remains relatively unexplored. Recent work studies synthetic temporal graphs generated by random graph models, but applying LLMs to real-world temporal graphs remains an open question. To address this gap, we introduce Temporal Graph Talker (TGTalker), a novel temporal graph learning framework designed for LLMs. TGTalker utilizes the recency bias in temporal graphs to extract relevant structural information, converted to natural language for LLMs, while leveraging temporal neighbors as additional information for prediction. TGTalker demonstrates competitive link prediction capabilities compared to existing Temporal Graph Neural Network (TGNN) models. Across five real-world networks, TGTalker performs competitively with state-of-the-art temporal graph methods while consistently outperforming popular models such as TGN and HTGN. Furthermore, TGTalker generates textual explanations for each prediction, thus opening up exciting new directions in explainability and interpretability for temporal link prediction. The code is publicly available at this https URL.
zh
[NLP-96] Understanding Gender Bias in AI-Generated Product Descriptions
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在电子商务场景中可能存在的性别偏见问题,特别是AI生成的产品描述中所体现的新型算法偏见和危害。其解决方案的关键在于构建数据驱动的性别偏见分类体系,以识别和分析产品描述生成过程中特有的性别偏见形式,如对服装尺码的假设、产品特征宣传中的刻板印象以及劝说性语言使用的差异,并通过量化分析验证这些偏见在实际应用中的普遍性。这一方法有助于深入理解当前AI危害框架中所定义的排除性规范、刻板印象和性能差异等类型问题。
链接: https://arxiv.org/abs/2506.05390
作者: Markelle Kelly,Mohammad Tahaei,Padhraic Smyth,Lauren Wilcox
机构: University of California, Irvine (加州大学欧文分校); eBay (eBay); Georgia Institute of Technology (佐治亚理工学院)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted to FAccT 2025
点击查看摘要
Abstract:While gender bias in large language models (LLMs) has been extensively studied in many domains, uses of LLMs in e-commerce remain largely unexamined and may reveal novel forms of algorithmic bias and harm. Our work investigates this space, developing data-driven taxonomic categories of gender bias in the context of product description generation, which we situate with respect to existing general purpose harms taxonomies. We illustrate how AI-generated product descriptions can uniquely surface gender biases in ways that require specialized detection and mitigation approaches. Further, we quantitatively analyze issues corresponding to our taxonomic categories in two models used for this task – GPT-3.5 and an e-commerce-specific LLM – demonstrating that these forms of bias commonly occur in practice. Our results illuminate unique, under-explored dimensions of gender bias, such as assumptions about clothing size, stereotypical bias in which features of a product are advertised, and differences in the use of persuasive language. These insights contribute to our understanding of three types of AI harms identified by current frameworks: exclusionary norms, stereotyping, and performance disparities, particularly for the context of e-commerce.
zh
[NLP-97] az2024full: Analysing German Newspapers for Gender Bias and Discrimination across Decades ACL2025
【速读】: 该论文旨在解决德语大规模公开语料库稀缺的问题,从而促进自然语言处理(NLP)和计算社会科学(CSS)的研究,特别是在语言趋势和社会问题如性别偏见方面的研究。其解决方案的关键在于构建了taz2024full,这是目前最大的公开德语报纸文章语料库,包含来自taz的180多万篇文本,时间跨度从1980年至2024年,并通过可扩展的结构化分析流程,为研究德国新闻文本中的角色提及、情感和语言框架提供了基础。
链接: https://arxiv.org/abs/2506.05388
作者: Stefanie Urchs,Veronika Thurner,Matthias Aßenmacher,Christian Heumann,Stephanie Thiemichen
机构: Hochschule München University of Applied Sciences (慕尼黑应用科学大学); LMU Munich (慕尼黑路德维希-马克西米利安大学); Munich Center for Machine Learning (MCML), LMU Munich (慕尼黑机器学习中心(MCML),慕尼黑路德维希-马克西米利安大学)
类目: Computation and Language (cs.CL)
备注: Accepted @ “63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025)” as a findings paper. This is the author’s version of the work. The definitive version of record will be published in the proceedings
点击查看摘要
Abstract:Open-access corpora are essential for advancing natural language processing (NLP) and computational social science (CSS). However, large-scale resources for German remain limited, restricting research on linguistic trends and societal issues such as gender bias. We present taz2024full, the largest publicly available corpus of German newspaper articles to date, comprising over 1.8 million texts from taz, spanning 1980 to 2024. As a demonstration of the corpus’s utility for bias and discrimination research, we analyse gender representation across four decades of reporting. We find a consistent overrepresentation of men, but also a gradual shift toward more balanced coverage in recent years. Using a scalable, structured analysis pipeline, we provide a foundation for studying actor mentions, sentiment, and linguistic framing in German journalistic texts. The corpus supports a wide range of applications, from diachronic language analysis to critical media studies, and is freely available to foster inclusive and reproducible research in German-language NLP. Comments: Accepted @ “63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025)” as a findings paper. This is the author’s version of the work. The definitive version of record will be published in the proceedings Subjects: Computation and Language (cs.CL) Cite as: arXiv:2506.05388 [cs.CL] (or arXiv:2506.05388v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2506.05388 Focus to learn more arXiv-issued DOI via DataCite
zh
[NLP-98] Advancing Decoding Strategies: Enhancements in Locally Typical Sampling for LLM s
【速读】: 该论文旨在解决传统解码方法在文本生成中难以平衡流畅性、多样性和连贯性的问题。其提出的解决方案是自适应语义感知典型性采样(Adaptive Semantic-Aware Typicality Sampling, ASTS),其关键在于引入动态熵阈值、多目标评分机制以及奖励-惩罚调整,从而在保持计算效率的同时实现上下文连贯且多样化的文本生成。
链接: https://arxiv.org/abs/2506.05387
作者: Jaydip Sen,Saptarshi Sengupta. Subhasis Dasgupta
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: This is the accepted but pre-reviewed version of the chapter that has been accepted for publication in the Springer volume ‘Decision-Making in Computational Intelligence-Based Systems,’ edited by Witold Pedrycz, Gilberto Rivera, Rose Ma Rodriguez, and Salvador Ibarra Martinez. The chapter is 39 pages long, and it contains 2 figures and 6 tables. This is NOT the final camera-ready version
点击查看摘要
Abstract:This chapter explores advancements in decoding strategies for large language models (LLMs), focusing on enhancing the Locally Typical Sampling (LTS) algorithm. Traditional decoding methods, such as top-k and nucleus sampling, often struggle to balance fluency, diversity, and coherence in text generation. To address these challenges, Adaptive Semantic-Aware Typicality Sampling (ASTS) is proposed as an improved version of LTS, incorporating dynamic entropy thresholding, multi-objective scoring, and reward-penalty adjustments. ASTS ensures contextually coherent and diverse text generation while maintaining computational efficiency. Its performance is evaluated across multiple benchmarks, including story generation and abstractive summarization, using metrics such as perplexity, MAUVE, and diversity scores. Experimental results demonstrate that ASTS outperforms existing sampling techniques by reducing repetition, enhancing semantic alignment, and improving fluency.
zh
[NLP-99] Beyond RAG : Reinforced Reasoning Augmented Generation for Clinical Notes
【速读】: 该论文旨在解决从有限患者信息中生成长格式临床摘要(如出院指导)的挑战,尤其是在临床文本生成任务中,现有基于大语言模型(Large Language Model, LLM)的方法表现不足。其解决方案的关键在于提出R2AG,这是一个基于预入院数据的强化学习检索器,通过从医学知识图谱中检索推理路径,为LLM提供显式的语义指导,从而提升生成质量。此外,论文还引入了基于组的检索优化(Group-Based Retriever Optimization, GRO),通过组相对奖励机制提高检索质量,促进LLM进行更深层次的推理。
链接: https://arxiv.org/abs/2506.05386
作者: Lo Pang-Yun Ting,Chengshuai Zhao,Yu-Hua Zeng,Yuan Jee Lim,Kun-Ta Chuang
机构: National Cheng Kung University (国立成功大学); Arizona State University (亚利桑那州立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Clinical note generation aims to automatically produce free-text summaries of a patient’s condition and diagnostic process, with discharge instructions being a representative long-form example. While recent large language model (LLM)-based methods pre-trained on general clinical corpora show promise in clinical text generation, they fall short in producing long-form notes from limited patient information. In this paper, we propose R2AG, the first reinforced retriever for long-form discharge instruction generation based on pre-admission data. R2AG is trained with reinforcement learning to retrieve reasoning paths from a medical knowledge graph, providing explicit semantic guidance to the LLM. To bridge the information gap, we propose Group-Based Retriever Optimization (GRO) which improves retrieval quality with group-relative rewards, encouraging reasoning leaps for deeper inference by the LLM. Comprehensive experiments on the MIMIC-IV-Note dataset show that R2AG outperforms baselines in both clinical efficacy and natural language generation metrics. Further analysis reveals that R2AG fills semantic gaps in sparse input scenarios, and retrieved reasoning paths help LLMs avoid clinical misinterpretation by focusing on key evidence and following coherent reasoning.
zh
[NLP-100] LLM s Can Also Do Well! Breaking Barriers in Semantic Role Labeling via Large Language Models
【速读】: 该论文试图解决生成式 AI (Generative AI) 在语义角色标注 (Semantic Role Labeling, SRL) 任务中性能落后于编码器-解码器模型(如 BERT 类模型)的问题。其解决方案的关键在于为大型语言模型(LLMs)引入两种机制:(a)检索增强生成,使模型能够利用外部语言知识,如谓词和论元结构描述;(b)自我校正,使模型能够识别并修正不一致的 SRL 输出。
链接: https://arxiv.org/abs/2506.05385
作者: Xinxin Li,Huiyao Chen,Chengjun Liu,Jing Li,Meishan Zhang,Jun Yu,Min Zhang
机构: Harbin Institute of Technology (哈尔滨工业大学); Harbin Institute of Technology (Shenzhen) (哈尔滨工业大学(深圳))
类目: Computation and Language (cs.CL)
备注: 19 pages, 3 figures, 10 tables
点击查看摘要
Abstract:Semantic role labeling (SRL) is a crucial task of natural language processing (NLP). Although generative decoder-based large language models (LLMs) have achieved remarkable success across various NLP tasks, they still lag behind state-of-the-art encoder-decoder (BERT-like) models in SRL. In this work, we seek to bridge this gap by equipping LLMs for SRL with two mechanisms: (a) retrieval-augmented generation and (b) self-correction. The first mechanism enables LLMs to leverage external linguistic knowledge such as predicate and argument structure descriptions, while the second allows LLMs to identify and correct inconsistent SRL outputs. We conduct extensive experiments on three widely-used benchmarks of SRL (CPB1.0, CoNLL-2009, and CoNLL-2012). Results demonstrate that our method achieves state-of-the-art performance in both Chinese and English, marking the first successful application of LLMs to surpass encoder-decoder approaches in SRL.
zh
[NLP-101] EvidenceOutcomes: a Dataset of Clinical Trial Publications with Clinically Meaningful Outcomes
【速读】: 该论文试图解决在循证医学中的证据提取与综合过程中,结果(Outcome)这一复杂元素常被忽视或过度简化的难题。解决方案的关键在于构建一个名为EvidenceOutcomes的新型、大规模、高质量标注语料库,该语料库通过与临床医生和自然语言处理专家的多次迭代讨论,制定了严谨的标注指南,并由三位独立标注者对500篇PubMed摘要及140篇来自EBM-NLP语料库的摘要进行标注,最终实现了0.76的组间一致性评分,为后续机器学习算法开发提供了可靠的基准。
链接: https://arxiv.org/abs/2506.05380
作者: Yiliang Zhou,Abigail M. Newbury,Gongbo Zhang,Betina Ross Idnay,Hao Liu,Chunhua Weng,Yifan Peng
机构: Weill Cornell Medicine(威尔康奈尔医学院); Columbia University(哥伦比亚大学); Montclair State University(蒙特克莱尔州立大学)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:The fundamental process of evidence extraction and synthesis in evidence-based medicine involves extracting PICO (Population, Intervention, Comparison, and Outcome) elements from biomedical literature. However, Outcomes, being the most complex elements, are often neglected or oversimplified in existing benchmarks. To address this issue, we present EvidenceOutcomes, a novel, large, annotated corpus of clinically meaningful outcomes extracted from biomedical literature. We first developed a robust annotation guideline for extracting clinically meaningful outcomes from text through iteration and discussion with clinicians and Natural Language Processing experts. Then, three independent annotators annotated the Results and Conclusions sections of a randomly selected sample of 500 PubMed abstracts and 140 PubMed abstracts from the existing EBM-NLP corpus. This resulted in EvidenceOutcomes with high-quality annotations of an inter-rater agreement of 0.76. Additionally, our fine-tuned PubMedBERT model, applied to these 500 PubMed abstracts, achieved an F1-score of 0.69 at the entity level and 0.76 at the token level on the subset of 140 PubMed abstracts from the EBM-NLP corpus. EvidenceOutcomes can serve as a shared benchmark to develop and test future machine learning algorithms to extract clinically meaningful outcomes from biomedical abstracts.
zh
[NLP-102] CO-VADA: A Confidence-Oriented Voice Augmentation Debiasing Approach for Fair Speech Emotion Recognition
【速读】: 该论文旨在解决语音情感识别(Speech Emotion Recognition, SER)系统中存在的偏见问题,这种偏见通常源于说话人特征与情感标签之间的虚假相关性,导致不同人口统计群体间的预测不公平。其解决方案的关键在于提出CO-VADA方法,该方法通过基于置信度的语音增强去偏技术,在不修改模型结构或依赖人口统计信息的情况下,识别训练数据中反映偏见模式的样本,并利用语音转换技术改变无关属性以生成新的样本,从而引入与数据中主导模式不同的说话人变化,引导模型更关注情感相关特征。
链接: https://arxiv.org/abs/2506.06071
作者: Yun-Shao Tsai,Yi-Cheng Lin,Huang-Cheng Chou,Hung-yi Lee
机构: National Taiwan University (国立台湾大学)
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)
备注: 8 pages
点击查看摘要
Abstract:Bias in speech emotion recognition (SER) systems often stems from spurious correlations between speaker characteristics and emotional labels, leading to unfair predictions across demographic groups. Many existing debiasing methods require model-specific changes or demographic annotations, limiting their practical use. We present CO-VADA, a Confidence-Oriented Voice Augmentation Debiasing Approach that mitigates bias without modifying model architecture or relying on demographic information. CO-VADA identifies training samples that reflect bias patterns present in the training data and then applies voice conversion to alter irrelevant attributes and generate samples. These augmented samples introduce speaker variations that differ from dominant patterns in the data, guiding the model to focus more on emotion-relevant features. Our framework is compatible with various SER models and voice conversion tools, making it a scalable and practical solution for improving fairness in SER systems.
zh
[NLP-103] Audio-Aware Large Language Models as Judges for Speaking Styles
【速读】: 该论文试图解决如何有效评估语音风格(speech style)在语音语言模型(SLMs)生成的对话中的表现问题。解决方案的关键在于利用音频感知的大语言模型(ALLM)作为自动评判者,通过其对情感、音量、语速、词汇强调、音高控制及非语言元素等 speaking style 的理解能力,对 SLMs 生成的语音内容进行客观评价。研究验证了 ALLM 作为评判工具的可行性,并揭示了当前 SLMs 在语音风格控制和自然对话生成方面仍存在改进空间。
链接: https://arxiv.org/abs/2506.05984
作者: Cheng-Han Chiang,Xiaofei Wang,Chung-Ching Lin,Kevin Lin,Linjie Li,Radu Kopetz,Yao Qian,Zhendong Wang,Zhengyuan Yang,Hung-yi Lee,Lijuan Wang
机构: Microsoft(微软)
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Audio-aware large language models (ALLMs) can understand the textual and non-textual information in the audio input. In this paper, we explore using ALLMs as an automatic judge to assess the speaking styles of speeches. We use ALLM judges to evaluate the speeches generated by SLMs on two tasks: voice style instruction following and role-playing. The speaking style we consider includes emotion, volume, speaking pace, word emphasis, pitch control, and non-verbal elements. We use four spoken language models (SLMs) to complete the two tasks and use humans and ALLMs to judge the SLMs’ responses. We compare two ALLM judges, GPT-4o-audio and Gemini-2.5-pro, with human evaluation results and show that the agreement between Gemini and human judges is comparable to the agreement between human evaluators. These promising results show that ALLMs can be used as a judge to evaluate SLMs. Our results also reveal that current SLMs, even GPT-4o-audio, still have room for improvement in controlling the speaking style and generating natural dialogues.
zh
[NLP-104] Low-Resource Domain Adaptation for Speech LLM s via Text-Only Fine-Tuning
【速读】: 该论文试图解决在低资源环境下,将语音大语言模型(Speech LLM)适应到新领域时面临的挑战,尤其是在缺乏配对语音-文本数据的情况下。其解决方案的关键在于提出一种仅使用目标领域文本进行微调的策略,无需额外音频数据,并通过在微调过程中引入实时评估机制来保持语音-文本对齐,从而实现有效的领域适应同时维持源领域的性能。
链接: https://arxiv.org/abs/2506.05671
作者: Yangui Fang,Jing Peng,Xu Li,Yu Xi,Chengwei Zhang,Guohui Zhong,Kai Yu
机构: Huazhong University of Science and Technology, School of Electronic Information and Communications (华中科技大学电子信息与通信学院); MoE Key Lab of Artificial Intelligence, AI Institute, X-LANCE Lab, Shanghai Jiao Tong University (教育部人工智能重点实验室,人工智能研究院,X-LANCE实验室,上海交通大学); Jiangsu Key Lab of Language Computing (江苏省语言计算重点实验室); AISpeech Co., Ltd. (AISpeech公司)
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Recent advances in automatic speech recognition (ASR) have combined speech encoders with large language models (LLMs) through projection, forming Speech LLMs with strong performance. However, adapting them to new domains remains challenging, especially in low-resource settings where paired speech-text data is scarce. We propose a text-only fine-tuning strategy for Speech LLMs using unpaired target-domain text without requiring additional audio. To preserve speech-text alignment, we introduce a real-time evaluation mechanism during fine-tuning. This enables effective domain adaptation while maintaining source-domain performance. Experiments on LibriSpeech, SlideSpeech, and Medical datasets show that our method achieves competitive recognition performance, with minimal degradation compared to full audio-text fine-tuning. It also improves generalization to new domains without catastrophic forgetting, highlighting the potential of text-only fine-tuning for low-resource domain adaptation of ASR.
zh
计算机视觉
[CV-0] rraFM: A Scalable Foundation Model for Unified Multisensor Earth Observation
【速读】:该论文旨在解决地球观测(Earth Observation, EO)中深度学习模型在训练数据的规模、地理覆盖范围和光谱多样性方面的局限性,从而提升模型学习全球可迁移表征的能力。其解决方案的关键在于提出TerraFM,一个可扩展的自监督学习模型,通过利用全球分布的Sentinel-1和Sentinel-2遥感影像、大尺度空间瓦片以及土地覆盖感知采样,增强空间和语义覆盖范围;同时,通过将传感模态作为自然增强手段,结合模态特定的块嵌入和自适应跨注意力融合,统一雷达与光学输入,并采用局部-全局对比学习策略及双中心化机制来处理类别频率不平衡问题,从而实现对分类和分割任务的强泛化能力。
链接: https://arxiv.org/abs/2506.06281
作者: Muhammad Sohail Danish,Muhammad Akhtar Munir,Syed Roshaan Ali Shah,Muhammad Haris Khan,Rao Muhammad Anwer,Jorma Laaksonen,Fahad Shahbaz Khan,Salman Khan
机构: Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学); University College London (伦敦大学学院); Aalto University (阿尔托大学); Linköping University, Sweden (瑞典林雪平大学); Australian National University (澳大利亚国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Modern Earth observation (EO) increasingly leverages deep learning to harness the scale and diversity of satellite imagery across sensors and regions. While recent foundation models have demonstrated promising generalization across EO tasks, many remain limited by the scale, geographical coverage, and spectral diversity of their training data, factors critical for learning globally transferable representations. In this work, we introduce TerraFM, a scalable self-supervised learning model that leverages globally distributed Sentinel-1 and Sentinel-2 imagery, combined with large spatial tiles and land-cover aware sampling to enrich spatial and semantic coverage. By treating sensing modalities as natural augmentations in our self-supervised approach, we unify radar and optical inputs via modality-specific patch embeddings and adaptive cross-attention fusion. Our training strategy integrates local-global contrastive learning and introduces a dual-centering mechanism that incorporates class-frequency-aware regularization to address long-tailed distributions in land this http URL achieves strong generalization on both classification and segmentation tasks, outperforming prior models on GEO-Bench and Copernicus-Bench. Our code and pretrained models are publicly available at: this https URL .
zh
[CV-1] CoMemo: LVLMs Need Image Context with Image Memory ICML2025
【速读】:该论文试图解决大型视觉语言模型(Large Vision-Language Models, LVLMs)在多模态处理中的两个关键问题:首先,LVLMs在注意力分配上表现出双峰分布,导致随着上下文扩展,中间视觉内容逐渐被忽略;其次,传统的位置编码方案在处理动态高分辨率图像时无法保持重要的二维结构关系。解决方案的关键在于提出CoMemo——一种结合上下文图像路径与图像记忆路径的双路径架构,有效缓解视觉信息被忽视的问题;同时引入RoPE-DHR,一种基于缩略图的位置聚合机制,以维持二维空间感知并减轻长序列中的远程衰减问题。
链接: https://arxiv.org/abs/2506.06279
作者: Shi Liu,Weijie Su,Xizhou Zhu,Wenhai Wang,Jifeng Dai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICML 2025
点击查看摘要
Abstract:Recent advancements in Large Vision-Language Models built upon Large Language Models have established aligning visual features with LLM representations as the dominant paradigm. However, inherited LLM architectural designs introduce suboptimal characteristics for multimodal processing. First, LVLMs exhibit a bimodal distribution in attention allocation, leading to the progressive neglect of middle visual content as context expands. Second, conventional positional encoding schemes fail to preserve vital 2D structural relationships when processing dynamic high-resolution images. To address these limitations, we propose CoMemo - a dual-path architecture that combines a Context image path with an image Memory path for visual processing, effectively alleviating visual information neglect. Additionally, we introduce RoPE-DHR, a novel positional encoding mechanism that employs thumbnail-based positional aggregation to maintain 2D spatial awareness while mitigating remote decay in extended sequences. Evaluations across seven benchmarks,including long-context comprehension, multi-image reasoning, and visual question answering, demonstrate CoMemo’s superior performance compared to conventional LVLM architectures. Project page is available at this https URL.
zh
[CV-2] ExAct: A Video-Language Benchmark for Expert Action Analysis
【速读】:该论文试图解决视频-语言模型(Video-Language Models, VLMs)在理解人类专业物理活动方面存在的能力不足问题。解决方案的关键在于构建了一个名为ExAct的新型视频-语言基准数据集,该数据集包含3521对经过专家筛选的视频问答对,覆盖6个领域中的11种物理活动。ExAct要求从五个精心设计的候选答案中选择正确答案,从而需要对人类身体技能进行细致入微、专家级别的理解。这一基准数据集旨在推动VLMs在多种物理和程序性领域中实现对人类技能的精确理解。
链接: https://arxiv.org/abs/2506.06277
作者: Han Yi,Yulu Pan,Feihong He,Xinyu Liu,Benjamin Zhang,Oluwatumininu Oguntola,Gedas Bertasius
机构: University of North Carolina at Chapel Hill (北卡罗来纳大学教堂山分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:We present ExAct, a new video-language benchmark for expert-level understanding of skilled physical human activities. Our new benchmark contains 3521 expert-curated video question-answer pairs spanning 11 physical activities in 6 domains: Sports, Bike Repair, Cooking, Health, Music, and Dance. ExAct requires the correct answer to be selected from five carefully designed candidate options, thus necessitating a nuanced, fine-grained, expert-level understanding of physical human skills. Evaluating the recent state-of-the-art VLMs on ExAct reveals a substantial performance gap relative to human expert performance. Specifically, the best-performing GPT-4o model achieves only 44.70% accuracy, well below the 82.02% attained by trained human specialists/experts. We believe that ExAct will be beneficial for developing and evaluating VLMs capable of precise understanding of human skills in various physical and procedural domains. Dataset and code are available at this https URL
zh
[CV-3] STARFlow: Scaling Latent Normalizing Flows for High-resolution Image Synthesis
【速读】:该论文旨在解决高分辨率图像生成中模型表达能力与计算效率之间的平衡问题,以及传统归一化流(Normalizing Flow)在大规模数据上的应用局限性。其解决方案的关键在于提出Transformer Autoregressive Flow (TARFlow),该方法结合了归一化流的精确概率建模能力与自回归Transformer的结构化建模优势,并通过深度-浅层设计、预训练自编码器的潜在空间建模以及新型引导算法等创新手段显著提升了模型的可扩展性和生成质量。
链接: https://arxiv.org/abs/2506.06276
作者: Jiatao Gu,Tianrong Chen,David Berthelot,Huangjie Zheng,Yuyang Wang,Ruixiang Zhang,Laurent Dinh,Miguel Angel Bautista,Josh Susskind,Shuangfei Zhai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: TLDR: We show for the first time that normalizing flows can be scaled for high-resolution and text-conditioned image synthesis
点击查看摘要
Abstract:We present STARFlow, a scalable generative model based on normalizing flows that achieves strong performance in high-resolution image synthesis. The core of STARFlow is Transformer Autoregressive Flow (TARFlow), which combines the expressive power of normalizing flows with the structured modeling capabilities of Autoregressive Transformers. We first establish the theoretical universality of TARFlow for modeling continuous distributions. Building on this foundation, we introduce several key architectural and algorithmic innovations to significantly enhance scalability: (1) a deep-shallow design, wherein a deep Transformer block captures most of the model representational capacity, complemented by a few shallow Transformer blocks that are computationally efficient yet substantially beneficial; (2) modeling in the latent space of pretrained autoencoders, which proves more effective than direct pixel-level modeling; and (3) a novel guidance algorithm that significantly boosts sample quality. Crucially, our model remains an end-to-end normalizing flow, enabling exact maximum likelihood training in continuous spaces without discretization. STARFlow achieves competitive performance in both class-conditional and text-conditional image generation tasks, approaching state-of-the-art diffusion models in sample quality. To our knowledge, this work is the first successful demonstration of normalizing flows operating effectively at this scale and resolution.
zh
[CV-4] BecomingLit: Relightable Gaussian Avatars with Hybrid Neural Shading
【速读】:该论文旨在解决高分辨率、可重新照明的头部虚拟形象(head avatar)的重建问题,特别是在交互式渲染速率下从新视角进行渲染。其关键解决方案是提出一种基于3D高斯基元(3D Gaussian primitives)的可重新照明的虚拟形象表示,并结合参数化头部模型和依赖表情的动力学模块进行动画处理;同时引入一种混合神经着色方法,结合神经漫反射BRDF与解析镜面项,从而实现从动态光场记录中解耦材料并支持全频率重新照明。
链接: https://arxiv.org/abs/2506.06271
作者: Jonathan Schmidt,Simon Giebenhain,Matthias Niessner
机构: Technical University of Munich (慕尼黑工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: see this https URL ; YouTube Video: see this https URL
点击查看摘要
Abstract:We introduce BecomingLit, a novel method for reconstructing relightable, high-resolution head avatars that can be rendered from novel viewpoints at interactive rates. Therefore, we propose a new low-cost light stage capture setup, tailored specifically towards capturing faces. Using this setup, we collect a novel dataset consisting of diverse multi-view sequences of numerous subjects under varying illumination conditions and facial expressions. By leveraging our new dataset, we introduce a new relightable avatar representation based on 3D Gaussian primitives that we animate with a parametric head model and an expression-dependent dynamics module. We propose a new hybrid neural shading approach, combining a neural diffuse BRDF with an analytical specular term. Our method reconstructs disentangled materials from our dynamic light stage recordings and enables all-frequency relighting of our avatars with both point lights and environment maps. In addition, our avatars can easily be animated and controlled from monocular videos. We validate our approach in extensive experiments on our dataset, where we consistently outperform existing state-of-the-art methods in relighting and reenactment by a significant margin.
zh
[CV-5] Bridging Perspectives: A Survey on Cross-view Collaborative Intelligence with Egocentric-Exocentric Vision
【速读】:该论文试图解决如何让机器通过融合第一人称(egocentric)和第三人称(exocentric)视角来更全面地理解动态环境的问题,从而提升视频理解的准确性和丰富性。其解决方案的关键在于系统性地探索三种研究方向:利用第一人称数据增强第三人称理解、利用第三人称数据改进第一人称分析,以及构建统一两种视角的联合学习框架。通过这些方法,论文旨在实现跨视角的协同与互补,推动视频理解技术向更接近人类认知能力的方向发展。
链接: https://arxiv.org/abs/2506.06253
作者: Yuping He,Yifei Huang,Guo Chen,Lidong Lu,Baoqi Pei,Jilan Xu,Tong Lu,Yoichi Sato
机构: State Key Laboratory for Novel Software Technology, Nanjing University(国家软件新技术重点实验室,南京大学); University of Tokyo(东京大学); Zhejiang University(浙江大学); Fudan University(复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Perceiving the world from both egocentric (first-person) and exocentric (third-person) perspectives is fundamental to human cognition, enabling rich and complementary understanding of dynamic environments. In recent years, allowing the machines to leverage the synergistic potential of these dual perspectives has emerged as a compelling research direction in video understanding. In this survey, we provide a comprehensive review of video understanding from both exocentric and egocentric viewpoints. We begin by highlighting the practical applications of integrating egocentric and exocentric techniques, envisioning their potential collaboration across domains. We then identify key research tasks to realize these applications. Next, we systematically organize and review recent advancements into three main research directions: (1) leveraging egocentric data to enhance exocentric understanding, (2) utilizing exocentric data to improve egocentric analysis, and (3) joint learning frameworks that unify both perspectives. For each direction, we analyze a diverse set of tasks and relevant works. Additionally, we discuss benchmark datasets that support research in both perspectives, evaluating their scope, diversity, and applicability. Finally, we discuss limitations in current works and propose promising future research directions. By synthesizing insights from both perspectives, our goal is to inspire advancements in video understanding and artificial intelligence, bringing machines closer to perceiving the world in a human-like manner. A GitHub repo of related works can be found at this https URL.
zh
[CV-6] Visual Graph Arena: Evaluating Visual Conceptualization of Vision and Multimodal Large Language Models
【速读】:该论文试图解决当前多模态大语言模型在视觉问答任务中面临的“概念化”能力不足的问题,即模型难以在不同视觉形式下识别和推理同一概念,这是人类推理的基本能力。解决方案的关键在于引入了视觉图竞技场(Visual Graph Arena, VGA),该数据集包含六种基于图的任务,旨在评估和提升AI系统在视觉抽象方面的能力。VGA通过使用多种图布局(如Kamada-Kawai与平面布局)来测试与视觉形式无关的推理能力,从而隔离出表示不变推理的挑战,为实现更接近人类水平的概念化提供了一个框架。
链接: https://arxiv.org/abs/2506.06242
作者: Zahra Babaiee,Peyman M. Kiasari,Daniela Rus,Radu Grosu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Recent advancements in multimodal large language models have driven breakthroughs in visual question answering. Yet, a critical gap persists, `conceptualization’-the ability to recognize and reason about the same concept despite variations in visual form, a basic ability of human reasoning. To address this challenge, we introduce the Visual Graph Arena (VGA), a dataset featuring six graph-based tasks designed to evaluate and improve AI systems’ capacity for visual abstraction. VGA uses diverse graph layouts (e.g., Kamada-Kawai vs. planar) to test reasoning independent of visual form. Experiments with state-of-the-art vision models and multimodal LLMs reveal a striking divide: humans achieved near-perfect accuracy across tasks, while models totally failed on isomorphism detection and showed limited success in path/cycle tasks. We further identify behavioral anomalies suggesting pseudo-intelligent pattern matching rather than genuine understanding. These findings underscore fundamental limitations in current AI models for visual understanding. By isolating the challenge of representation-invariant reasoning, the VGA provides a framework to drive progress toward human-like conceptualization in AI visual models. The Visual Graph Arena is available at: \hrefthis https URLthis http URL
zh
[CV-7] Optimizing Cloud-to-GPU Throughput for Deep Learning With Earth Observation Data
【速读】:该论文旨在解决在大规模地球观测(Earth Observation, EO)数据上训练深度学习模型时,计算资源与数据存储分离所带来的性能瓶颈问题。标准的PyTorch数据加载器在从云存储直接流式传输GeoTIFF文件时无法有效保持现代GPU的利用率。论文的关键解决方案是通过系统性测试不同的加载器配置和数据参数,重点优化了基于瓦片对齐的读取方式和工作线程池,并利用贝叶斯优化方法为每种存储类型找到最佳设置,从而显著提升了远程和本地数据加载的吞吐量。
链接: https://arxiv.org/abs/2506.06235
作者: Akram Zaytar,Caleb Robinson,Girmaw Abebe Tadesse,Tammy Glazer,Gilles Hacheme,Anthony Ortiz,Rahul M Dodhia,Juan M Lavista Ferres
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Training deep learning models on petabyte-scale Earth observation (EO) data requires separating compute resources from data storage. However, standard PyTorch data loaders cannot keep modern GPUs utilized when streaming GeoTIFF files directly from cloud storage. In this work, we benchmark GeoTIFF loading throughput from both cloud object storage and local SSD, systematically testing different loader configurations and data parameters. We focus on tile-aligned reads and worker thread pools, using Bayesian optimization to find optimal settings for each storage type. Our optimized configurations increase remote data loading throughput by 20x and local throughput by 4x compared to default settings. On three public EO benchmarks, models trained with optimized remote loading achieve the same accuracy as local training within identical time budgets. We improve validation IoU by 6-15% and maintain 85-95% GPU utilization versus 0-30% with standard configurations. Code is publicly available at this https URL
zh
[CV-8] Challenging Vision-Language Models with Surgical Data: A New Dataset and Broad Benchmarking Study
【速读】:该论文旨在解决传统计算机视觉模型在内窥镜领域泛化能力不足的问题,以及探索生成式视觉语言模型(Vision Language Models, VLMs)在腹腔镜手术中的适用性。其解决方案的关键在于通过大规模实验评估当前VLMs在基本感知任务和高级场景理解任务中的表现,并对比专业医疗VLMs与通用模型的性能差异,从而揭示VLMs在手术环境中的局限性及优化方向。
链接: https://arxiv.org/abs/2506.06232
作者: Leon Mayer,Tim Rädsch,Dominik Michael,Lucas Luttner,Amine Yamlahi,Evangelia Christodoulou,Patrick Godau,Marcel Knopp,Annika Reinke,Fiona Kolbinger,Lena Maier-Hein
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:While traditional computer vision models have historically struggled to generalize to endoscopic domains, the emergence of foundation models has shown promising cross-domain performance. In this work, we present the first large-scale study assessing the capabilities of Vision Language Models (VLMs) for endoscopic tasks with a specific focus on laparoscopic surgery. Using a diverse set of state-of-the-art models, multiple surgical datasets, and extensive human reference annotations, we address three key research questions: (1) Can current VLMs solve basic perception tasks on surgical images? (2) Can they handle advanced frame-based endoscopic scene understanding tasks? and (3) How do specialized medical VLMs compare to generalist models in this context? Our results reveal that VLMs can effectively perform basic surgical perception tasks, such as object counting and localization, with performance levels comparable to general domain tasks. However, their performance deteriorates significantly when the tasks require medical knowledge. Notably, we find that specialized medical VLMs currently underperform compared to generalist models across both basic and advanced surgical tasks, suggesting that they are not yet optimized for the complexity of surgical environments. These findings highlight the need for further advancements to enable VLMs to handle the unique challenges posed by surgery. Overall, our work provides important insights for the development of next-generation endoscopic AI systems and identifies key areas for improvement in medical visual language models.
zh
[CV-9] owards an Explainable Comparison and Alignment of Feature Embeddings
【速读】:该论文试图解决不同特征嵌入模型在嵌入空间中聚类差异的可解释性比较问题,而现有研究主要关注分类相关下游任务的数值性能。解决方案的关键在于提出一种名为Spectral Pairwise Embedding Comparison (SPEC)的框架,该框架通过分析两个嵌入生成的核矩阵,并利用差分核矩阵的特征分解来检测两个嵌入在样本聚类上的差异。此外,该方法还引入了一个优化问题以对齐两个嵌入,确保在一个嵌入中识别的聚类在另一个嵌入中也能被捕捉。
链接: https://arxiv.org/abs/2506.06231
作者: Mohammad Jalali,Bahar Dibaei Nia,Farzan Farnia
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Spectral Theory (math.SP)
备注:
点击查看摘要
Abstract:While several feature embedding models have been developed in the literature, comparisons of these embeddings have largely focused on their numerical performance in classification-related downstream applications. However, an interpretable comparison of different embeddings requires identifying and analyzing mismatches between sample groups clustered within the embedding spaces. In this work, we propose the \emphSpectral Pairwise Embedding Comparison (SPEC) framework to compare embeddings and identify their differences in clustering a reference dataset. Our approach examines the kernel matrices derived from two embeddings and leverages the eigendecomposition of the difference kernel matrix to detect sample clusters that are captured differently by the two embeddings. We present a scalable implementation of this kernel-based approach, with computational complexity that grows linearly with the sample size. Furthermore, we introduce an optimization problem using this framework to align two embeddings, ensuring that clusters identified in one embedding are also captured in the other model. We provide numerical results demonstrating the SPEC’s application to compare and align embeddings on large-scale datasets such as ImageNet and MS-COCO. The code is available at [this https URL](this http URL).
zh
[CV-10] GenIR: Generative Visual Feedback for Mental Image Retrieval
【速读】:该论文试图解决在现实场景中将视觉-语言模型(Vision-Language Models, VLMs)的性能有效应用于用户通过多轮交互逐步细化搜索目标图像的问题,即 Mental Image Retrieval (MIR) 任务。现有方法依赖于间接或抽象的文本反馈,难以帮助用户准确调整查询。解决方案的关键在于提出 GenIR,一种基于扩散模型生成图像的多轮检索范式,通过显式生成视觉表示来提供清晰、可解释的反馈,从而提升用户交互效率与检索效果。
链接: https://arxiv.org/abs/2506.06220
作者: Diji Yang,Minghao Liu,Chung-Hsiang Lo,Yi Zhang,James Davis
机构: University of California Santa Cruz (加州大学圣克鲁兹分校); Northeastern University (东北大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Vision-language models (VLMs) have shown strong performance on text-to-image retrieval benchmarks. However, bridging this success to real-world applications remains a challenge. In practice, human search behavior is rarely a one-shot action. Instead, it is often a multi-round process guided by clues in mind, that is, a mental image ranging from vague recollections to vivid mental representations of the target image. Motivated by this gap, we study the task of Mental Image Retrieval (MIR), which targets the realistic yet underexplored setting where users refine their search for a mentally envisioned image through multi-round interactions with an image search engine. Central to successful interactive retrieval is the capability of machines to provide users with clear, actionable feedback; however, existing methods rely on indirect or abstract verbal feedback, which can be ambiguous, misleading, or ineffective for users to refine the query. To overcome this, we propose GenIR, a generative multi-round retrieval paradigm leveraging diffusion-based image generation to explicitly reify the AI system’s understanding at each round. These synthetic visual representations provide clear, interpretable feedback, enabling users to refine their queries intuitively and effectively. We further introduce a fully automated pipeline to generate a high-quality multi-round MIR dataset. Experimental results demonstrate that GenIR significantly outperforms existing interactive methods in the MIR scenario. This work establishes a new task with a dataset and an effective generative retrieval method, providing a foundation for future research in this direction.
zh
[CV-11] STSBench: A Spatio-temporal Scenario Benchmark for Multi-modal Large Language Models in Autonomous Driving
【速读】:该论文试图解决当前视觉-语言模型(Vision-Language Models, VLMs)在自动驾驶场景中对时空推理能力评估不足的问题。现有基准主要针对单视角图像或视频的预训练或微调模型,侧重于语义任务如目标识别或场景理解,而缺乏对多视角视频或LiDAR数据下车辆自身行为与交通参与者复杂交互的全面评估。解决方案的关键在于提出STSBench框架,该框架通过自动挖掘预定义交通场景、提供人机验证界面及生成多选题,构建了首个基于3D感知的时空推理评估基准STSnu,从而全面评估VLMs在端到端驾驶中的表现。
链接: https://arxiv.org/abs/2506.06218
作者: Christian Fruhwirth-Reisinger,Dušan Malić,Wei Lin,David Schinagl,Samuel Schulter,Horst Possegger
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Dataset: this https URL , Code: this https URL
点击查看摘要
Abstract:We introduce STSBench, a scenario-based framework to benchmark the holistic understanding of vision-language models (VLMs) for autonomous driving. The framework automatically mines pre-defined traffic scenarios from any dataset using ground-truth annotations, provides an intuitive user interface for efficient human verification, and generates multiple-choice questions for model evaluation. Applied to the NuScenes dataset, we present STSnu, the first benchmark that evaluates the spatio-temporal reasoning capabilities of VLMs based on comprehensive 3D perception. Existing benchmarks typically target off-the-shelf or fine-tuned VLMs for images or videos from a single viewpoint and focus on semantic tasks such as object recognition, dense captioning, risk assessment, or scene understanding. In contrast, STSnu evaluates driving expert VLMs for end-to-end driving, operating on videos from multi-view cameras or LiDAR. It specifically assesses their ability to reason about both ego-vehicle actions and complex interactions among traffic participants, a crucial capability for autonomous vehicles. The benchmark features 43 diverse scenarios spanning multiple views and frames, resulting in 971 human-verified multiple-choice questions. A thorough evaluation uncovers critical shortcomings in existing models’ ability to reason about fundamental traffic dynamics in complex environments. These findings highlight the urgent need for architectural advances that explicitly model spatio-temporal reasoning. By addressing a core gap in spatio-temporal evaluation, STSBench enables the development of more robust and explainable VLMs for autonomous driving.
zh
[CV-12] 3DFlowAction: Learning Cross-Embodiment Manipulation from 3D Flow World Model
【速读】:该论文旨在解决机器人在复杂物体操作任务中缺乏统一且鲁棒的动作表示问题,这一问题主要源于现有机器人数据集在简单场景中记录不同动作空间的机器人动作,难以泛化到多样化的场景和机器人平台。解决方案的关键在于从人类和机器人的操作数据中学习一个3D流世界模型(3D flow world model),该模型通过预测交互物体在三维空间中的未来运动来指导操作动作规划,并结合语言指令生成3D光学流轨迹,从而实现跨身体结构的可靠适应与闭环规划能力。
链接: https://arxiv.org/abs/2506.06199
作者: Hongyan Zhi,Peihao Chen,Siyuan Zhou,Yubo Dong,Quanxi Wu,Lei Han,Mingkui Tan
机构: South China University of Technology(华南理工大学); Tencent Robotics X(腾讯机器人实验室); Hong Kong University of Science and Technology(香港科技大学); Pazhou Laboratory(琶洲实验室)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Manipulation has long been a challenging task for robots, while humans can effortlessly perform complex interactions with objects, such as hanging a cup on the mug rack. A key reason is the lack of a large and uniform dataset for teaching robots manipulation skills. Current robot datasets often record robot action in different action spaces within a simple scene. This hinders the robot to learn a unified and robust action representation for different robots within diverse scenes. Observing how humans understand a manipulation task, we find that understanding how the objects should move in the 3D space is a critical clue for guiding actions. This clue is embodiment-agnostic and suitable for both humans and different robots. Motivated by this, we aim to learn a 3D flow world model from both human and robot manipulation data. This model predicts the future movement of the interacting objects in 3D space, guiding action planning for manipulation. Specifically, we synthesize a large-scale 3D optical flow dataset, named ManiFlow-110k, through a moving object auto-detect pipeline. A video diffusion-based world model then learns manipulation physics from these data, generating 3D optical flow trajectories conditioned on language instructions. With the generated 3D object optical flow, we propose a flow-guided rendering mechanism, which renders the predicted final state and leverages GPT-4o to assess whether the predicted flow aligns with the task description. This equips the robot with a closed-loop planning ability. Finally, we consider the predicted 3D optical flow as constraints for an optimization policy to determine a chunk of robot actions for manipulation. Extensive experiments demonstrate strong generalization across diverse robotic manipulation tasks and reliable cross-embodiment adaptation without hardware-specific training.
zh
[CV-13] SatelliteFormula: Multi-Modal Symbolic Regression from Remote Sensing Imagery for Physics Discovery
【速读】:该论文试图解决多光谱遥感图像中复杂环境变量建模的可解释性问题,传统经验指数或黑箱学习模型难以提供物理上可解释的表达式。其解决方案的关键在于提出SatelliteFormula框架,该框架结合基于视觉Transformer(Vision Transformer)的编码器进行空间-光谱特征提取,并引入物理引导约束以确保模型的一致性和可解释性,同时通过将Transformer表示集成到符号优化器中,平衡了模型的准确性和物理合理性。
链接: https://arxiv.org/abs/2506.06176
作者: Zhenyu Yu,Mohd. Yamani Idna Idris,Pei Wang,Yuelong Xia,Fei Ma,Rizwan Qureshi
机构: Universiti Malaya (马来亚大学); Kunming University of Science and Technology (昆明理工大学); Yunnan Normal University (云南师范大学); Guangming Laboratory (光明实验室); University of Central Florida (中佛罗里达大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:We propose SatelliteFormula, a novel symbolic regression framework that derives physically interpretable expressions directly from multi-spectral remote sensing imagery. Unlike traditional empirical indices or black-box learning models, SatelliteFormula combines a Vision Transformer-based encoder for spatial-spectral feature extraction with physics-guided constraints to ensure consistency and interpretability. Existing symbolic regression methods struggle with the high-dimensional complexity of multi-spectral data; our method addresses this by integrating transformer representations into a symbolic optimizer that balances accuracy and physical plausibility. Extensive experiments on benchmark datasets and remote sensing tasks demonstrate superior performance, stability, and generalization compared to state-of-the-art baselines. SatelliteFormula enables interpretable modeling of complex environmental variables, bridging the gap between data-driven learning and physical understanding.
zh
[CV-14] chnical Report for Egocentric Mistake Detection for the HoloAssist Challenge
【速读】:该论文试图解决在线错误检测问题(online mistake detection),该问题在工业自动化和教育等领域至关重要,因为实时视频分析允许操作人员在错误发生时进行纠正。以往的研究主要关注涉及动作顺序的程序性错误,但实际应用中需要处理更广泛的错误类型。该论文提出了一种在线错误检测框架,能够处理程序性错误和执行错误(如动作失误或工具误用),其关键在于利用大语言模型(Large Language Model, LLM)生成解释性反馈,以帮助用户理解并纠正错误。实验结果表明,该方法在HoloAssist基准测试中表现优异,位列错误检测任务第二。
链接: https://arxiv.org/abs/2506.06174
作者: Constantin Patsch,Marsil Zakour,Yuankai Wu,Eckehard Steinbach
机构: Technical University of Munich (慕尼黑工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:In this report, we address the task of online mistake detection, which is vital in domains like industrial automation and education, where real-time video analysis allows human operators to correct errors as they occur. While previous work focuses on procedural errors involving action order, broader error types must be addressed for real-world use. We introduce an online mistake detection framework that handles both procedural and execution errors (e.g., motor slips or tool misuse). Upon detecting an error, we use a large language model (LLM) to generate explanatory feedback. Experiments on the HoloAssist benchmark confirm the effectiveness of our approach, where our approach is placed second on the mistake detection task.
zh
[CV-15] A Novel Large-scale Crop Dataset and Dual-stream Transformer Method for Fine-grained Hierarchical Crop Classification from Integrated Hyperspectral EnMAP Data and Multispectral Sentinel-2 Time Series
【速读】:该论文旨在解决精细粒度农作物分类的问题,这一问题对于精准农业和粮食安全监测至关重要。当前研究面临的主要挑战包括高光谱数据获取困难以及作物类型标注成本高昂。论文提出的关键解决方案是构建一个分层的高光谱作物数据集(H2Crop),通过融合30米分辨率的EnMAP高光谱数据与Sentinel-2时序数据,实现对作物的精细化分类。此外,论文还设计了一种双流Transformer架构,分别利用光谱-空间Transformer和时间Swin Transformer处理高光谱数据与Sentinel-2时序数据,并通过层级分类头与层次化融合实现多层级分类。
链接: https://arxiv.org/abs/2506.06155
作者: Wenyuan Li,Shunlin Liang,Yuxiang Zhang,Liqin Liu,Keyan Chen,Yongzhe Chen,Han Ma,Jianglei Xu,Yichuan Ma,Shikang Guan,Zhenwei Shi
机构: Jockey Club STEM Lab of Quantitative Remote Sensing, Department of Geography, The University of Hong Kong, Hong Kong, China; Department of Aerospace Intelligent Science and Technology, School of Astronautics, Beihang University, Beijing, China
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 28 pages, 12 figures
点击查看摘要
Abstract:Fine-grained crop classification is crucial for precision agriculture and food security monitoring. It requires simultaneous capture of both phenological dynamics (obtained from multi-temporal satellite data like Sentinel-2) and subtle spectral variations (demanding nanometer-scale spectral resolution from hyperspectral imagery). Research combining these two modalities remains scarce currently due to challenges in hyperspectral data acquisition and crop types annotation costs. To address these issues, we construct a hierarchical hyperspectral crop dataset (H2Crop) by integrating 30m-resolution EnMAP hyperspectral data with Sentinel-2 time series. With over one million annotated field parcels organized in a four-tier crop taxonomy, H2Crop establishes a vital benchmark for fine-grained agricultural crop classification and hyperspectral image processing. We propose a dual-stream Transformer architecture that synergistically processes these modalities. It coordinates two specialized pathways: a spectral-spatial Transformer extracts fine-grained signatures from hyperspectral EnMAP data, while a temporal Swin Transformer extracts crop growth patterns from Sentinel-2 time series. The designed hierarchy classification heads with hierarchical fusion then simultaneously delivers multi-level classification across all taxonomic tiers. Experiments demonstrate that adding hyperspectral EnMAP data to Sentinel-2 time series yields a 4.2% average F1-scores improvement (peaking at 6.3%). Extensive comparisons also confirming our method’s higher accuracy over existing deep learning approaches for crop type classification and the consistent benefits of hyperspectral data across varying temporal windows and crop change scenarios. Codes and dataset will be available at this https URL and this http URL Keywords: Crop type classification, precision agriculture, remote sensing, deep learning, hyperspectral data, Sentinel-2 time series, fine-grained crops Comments: 28 pages, 12 figures Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2506.06155 [cs.CV] (or arXiv:2506.06155v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2506.06155 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-16] Gradient Similarity Surgery in Multi-Task Deep Learning ECML KDD2025
【速读】:该论文旨在解决多任务深度学习(MTDL)中由于多个任务产生的梯度冲突导致的收敛速度慢和稳定性差的问题。其解决方案的关键在于提出一种新颖的梯度手术方法——基于相似性的动量梯度手术(SAM-GS),该方法通过梯度幅度相似性度量来指导优化过程,采用梯度归一化和一阶动量调制,有效缓解了任务间梯度的冲突,从而提升多任务学习的训练效率与性能。
链接: https://arxiv.org/abs/2506.06130
作者: Thomas Borsani,Andrea Rosani,Giuseppe Nicosia,Giuseppe Di Fatta
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Paper accepted at ECMLPKDD 2025
点击查看摘要
Abstract:The multi-task learning ( MTL ) paradigm aims to simultaneously learn multiple tasks within a single model capturing higher-level, more general hidden patterns that are shared by the tasks. In deep learning, a significant challenge in the backpropagation training process is the design of advanced optimisers to improve the convergence speed and stability of the gradient descent learning rule. In particular, in multi-task deep learning ( MTDL ) the multitude of tasks may generate potentially conflicting gradients that would hinder the concurrent convergence of the diverse loss functions. This challenge arises when the gradients of the task objectives have either different magnitudes or opposite directions, causing one or a few to dominate or to interfere with each other, thus degrading the training process. Gradient surgery methods address the problem explicitly dealing with conflicting gradients by adjusting the overall gradient trajectory. This work introduces a novel gradient surgery method, the Similarity-Aware Momentum Gradient Surgery (SAM-GS), which provides an effective and scalable approach based on a gradient magnitude similarity measure to guide the optimisation process. The SAM-GS surgery adopts gradient equalisation and modulation of the first-order momentum. A series of experimental tests have shown the effectiveness of SAM-GS on synthetic problems and MTL benchmarks. Gradient magnitude similarity plays a crucial role in regularising gradient aggregation in MTDL for the optimisation of the learning process.
zh
[CV-17] CCLSTM: Coupled Convolutional Long-Short Term Memory Network for Occupancy Flow Forecasting
【速读】:该论文旨在解决动态智能体未来状态预测中的问题,特别是在自动驾驶领域中,如何在缺乏高质量矢量输入和计算资源受限的情况下实现高效且准确的预测。其解决方案的关键在于提出一种轻量级、端到端可训练的架构——耦合卷积长短期记忆网络(Coupled Convolutional LSTM, CCLSTM),该架构仅依赖卷积操作,不依赖矢量输入或自注意力机制,通过紧凑的循环卷积结构有效捕捉时间动态和空间占用流相关性。
链接: https://arxiv.org/abs/2506.06128
作者: Peter Lengyel
机构: aiMotive(人工智能动力)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Predicting future states of dynamic agents is a fundamental task in autonomous driving. An expressive representation for this purpose is Occupancy Flow Fields, which provide a scalable and unified format for modeling motion, spatial extent, and multi-modal future distributions. While recent methods have achieved strong results using this representation, they often depend on high-quality vectorized inputs, which are unavailable or difficult to generate in practice, and the use of transformer-based architectures, which are computationally intensive and costly to deploy. To address these issues, we propose \textbfCoupled Convolutional LSTM (CCLSTM), a lightweight, end-to-end trainable architecture based solely on convolutional operations. Without relying on vectorized inputs or self-attention mechanisms, CCLSTM effectively captures temporal dynamics and spatial occupancy-flow correlations using a compact recurrent convolutional structure. Despite its simplicity, CCLSTM achieves state-of-the-art performance on occupancy flow metrics and, as of this submission, ranks (1^\textst) in all metrics on the 2024 Waymo Occupancy and Flow Prediction Challenge leaderboard.
zh
[CV-18] Bidirectional Image-Event Guided Low-Light Image Enhancement
【速读】:该论文旨在解决在极端低光条件下,传统帧基相机因动态范围和时间分辨率有限而导致的图像细节丢失和运动模糊问题。现有方法虽然引入了事件相机并提出了事件引导的低光图像增强算法,但忽略了动态光照条件引起的全局低频噪声以及稀疏事件数据中的局部结构不连续性。论文提出的创新解决方案是双向引导的低光图像增强框架(BiLIE),其关键在于在事件表示层引入基于频率高通滤波的事件特征增强(EFE)模块以抑制低频噪声,并设计双向交叉注意力融合(BCAF)机制以保留高频结构和边缘,同时抑制由稀疏事件引导带来的结构不连续性和局部噪声。
链接: https://arxiv.org/abs/2506.06120
作者: Zhanwen Liu,Huanna Song,Yang Wang,Nan Yang,Shangyu Xie,Yisheng An,Xiangmo Zhao
机构: Chang’an University (长安大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Under extreme low-light conditions, traditional frame-based cameras, due to their limited dynamic range and temporal resolution, face detail loss and motion blur in captured images. To overcome this bottleneck, researchers have introduced event cameras and proposed event-guided low-light image enhancement algorithms. However, these methods neglect the influence of global low-frequency noise caused by dynamic lighting conditions and local structural discontinuities in sparse event data. To address these issues, we propose an innovative Bidirectional guided Low-light Image Enhancement framework (BiLIE). Specifically, to mitigate the significant low-frequency noise introduced by global illumination step changes, we introduce the frequency high-pass filtering-based Event Feature Enhancement (EFE) module at the event representation level to suppress the interference of low-frequency information, and preserve and highlight the high-frequency this http URL, we design a Bidirectional Cross Attention Fusion (BCAF) mechanism to acquire high-frequency structures and edges while suppressing structural discontinuities and local noise introduced by sparse event guidance, thereby generating smoother fused this http URL, considering the poor visual quality and color bias in existing datasets, we provide a new dataset (RELIE), with high-quality ground truth through a reliable enhancement scheme. Extensive experimental results demonstrate that our proposed BiLIE outperforms state-of-the-art methods by 0.96dB in PSNR and 0.03 in LPIPS.
zh
[CV-19] WoundAIssist: A Patient-Centered Mobile App for AI-Assisted Wound Care With Physicians in the Loop ALT
【速读】:该论文旨在解决慢性伤口护理中因人口老龄化导致的医疗资源紧张、患者生活质量下降及传统护理方式成本高昂的问题。其解决方案的关键在于开发了一款以患者为中心的AI驱动移动应用WoundAIssist,该应用通过患者居家拍摄伤口照片和填写问卷实现定期记录,并结合轻量级深度学习模型进行本地化伤口分割,从而支持远程监测与视频会诊,提升了医患之间的互动效率与护理质量。
链接: https://arxiv.org/abs/2506.06104
作者: Vanessa Borst,Anna Riedmann,Tassilo Dege,Konstantin Müller,Astrid Schmieder,Birgit Lugrin,Samuel Kounev
机构: Institute of Computer Science, University of Würzburg; University Hospital of Würzburg
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to ACM Health (Special Issue)
点击查看摘要
Abstract:The rising prevalence of chronic wounds, especially in aging populations, presents a significant healthcare challenge due to prolonged hospitalizations, elevated costs, and reduced patient quality of life. Traditional wound care is resource-intensive, requiring frequent in-person visits that strain both patients and healthcare professionals (HCPs). Therefore, we present WoundAIssist, a patient-centered, AI-driven mobile application designed to support telemedical wound care. WoundAIssist enables patients to regularly document wounds at home via photographs and questionnaires, while physicians remain actively engaged in the care process through remote monitoring and video consultations. A distinguishing feature is an integrated lightweight deep learning model for on-device wound segmentation, which, combined with patient-reported data, enables continuous monitoring of wound healing progression. Developed through an iterative, user-centered process involving both patients and domain experts, WoundAIssist prioritizes an user-friendly design, particularly for elderly patients. A conclusive usability study with patients and dermatologists reported excellent usability, good app quality, and favorable perceptions of the AI-driven wound recognition. Our main contribution is two-fold: (I) the implementation and (II) evaluation of WoundAIssist, an easy-to-use yet comprehensive telehealth solution designed to bridge the gap between patients and HCPs. Additionally, we synthesize design insights for remote patient monitoring apps, derived from over three years of interdisciplinary research, that may inform the development of similar digital health tools across clinical domains.
zh
[CV-20] VideoChat-A1: Thinking with Long Videos by Chain-of-Shot Reasoning
【速读】:该论文旨在解决长视频理解中由于上下文过长导致的模型性能下降问题,尤其是在处理包含多个镜头(shot)的长视频时,现有基于多模态大语言模型(MLLM)的代理范式往往无法准确识别和深入理解相关镜头,从而引入冗余或噪声时间上下文。其解决方案的关键在于提出VideoChat-A1,该方法通过独特的镜头链推理范式,能够逐步选择与用户问题相关的镜头,并采用粗到细的划分方式进行深入分析,从而有效模拟人类逐步思考的过程,提升对长视频的理解能力。
链接: https://arxiv.org/abs/2506.06097
作者: Zikang Wang,Boyu Chen,Zhengrong Yue,Yi Wang,Yu Qiao,Limin Wang,Yali Wang
机构: Shanghai Jiao Tong University (上海交通大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Shenzhen Key Lab of Computer Vision and Pattern Recognition, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences (深圳市计算机视觉与模式识别重点实验室,深圳先进技术研究院,中国科学院); School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院); Nanjing University (南京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:The recent advance in video understanding has been driven by multimodal large language models (MLLMs). But these MLLMs are good at analyzing short videos, while suffering from difficulties in understanding videos with a longer context. To address this difficulty, several agent paradigms have recently been proposed, using MLLMs as agents for retrieving extra contextual knowledge in a long video. However, most existing agents ignore the key fact that a long video is composed with multiple shots, i.e., to answer the user question from a long video, it is critical to deeply understand its relevant shots like human. Without such insight, these agents often mistakenly find redundant even noisy temporal context, restricting their capacity for long video understanding. To fill this gap, we propose VideoChat-A1, a novel long video agent paradigm. Different from the previous works, our VideoChat-A1 can deeply think with long videos, via a distinct chain-of-shot reasoning paradigm. More specifically, it can progressively select the relevant shots of user question, and look into these shots in a coarse-to-fine partition. By multi-modal reasoning along the shot chain, VideoChat-A1 can effectively mimic step-by-step human thinking process, allowing to interactively discover preferable temporal context for thoughtful understanding in long videos. Extensive experiments show that, our VideoChat-A1 achieves the state-of-the-art performance on the mainstream long video QA benchmarks, e.g., it achieves 77.0 on VideoMME and 70.1 on EgoSchema, outperforming its strong baselines (e.g., Intern2.5VL-8B and InternVideo2.5-8B), by up to 10.8% and 6.2%. Compared to leading close-source GPT-4o and Gemini 1.5 Pro, VideoChat-A1 offers competitive accuracy, but with 7% input frames and 12% inference time on average.
zh
[CV-21] Feedback Guidance of Diffusion Models
【速读】:该论文旨在解决 Classifier-Free Guidance (CFG) 在条件扩散模型中可能导致多样性降低和记忆现象的问题,因为 CFG 采用固定的指导系数,无论样本是否需要修正。论文提出的解决方案是 FeedBack Guidance (FBG),其关键在于通过状态相关的系数自适应地调节指导量,依据样本的条件信号信息性进行动态调整,从而挑战了传统将指导视为固定超参数的观点。
链接: https://arxiv.org/abs/2506.06085
作者: Koulischer Felix,Handke Florian,Deleu Johannes,Demeester Thomas,Ambrogioni Luca
机构: Ghent University - imec; Donders Institute for Brain Cognition and Behaviour, Radboud University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint. Article currently under review. Code is available at: this https URL
点击查看摘要
Abstract:While Classifier-Free Guidance (CFG) has become standard for improving sample fidelity in conditional diffusion models, it can harm diversity and induce memorization by applying constant guidance regardless of whether a particular sample needs correction. We propose FeedBack Guidance (FBG), which uses a state-dependent coefficient to self-regulate guidance amounts based on need. Our approach is derived from first principles by assuming the learned conditional distribution is linearly corrupted by the unconditional distribution, contrasting with CFG’s implicit multiplicative assumption. Our scheme relies on feedback of its own predictions about the conditional signal informativeness to adapt guidance dynamically during inference, challenging the view of guidance as a fixed hyperparameter. The approach is benchmarked on ImageNet512x512, where it significantly outperforms Classifier-Free Guidance and is competitive to Limited Interval Guidance (LIG) while benefitting from a strong mathematical framework. On Text-To-Image generation, we demonstrate that, as anticipated, our approach automatically applies higher guidance scales for complex prompts than for simpler ones and that it can be easily combined with existing guidance schemes such as CFG or LIG.
zh
[CV-22] WisWheat: A Three-Tiered Vision-Language Dataset for Wheat Management
【速读】:该论文试图解决传统小麦管理决策依赖人工专家检测所带来的成本高、主观性强且难以扩展的问题,以及直接应用通用视觉-语言模型(VLM)在小麦管理任务中因缺乏领域知识而导致的量化和推理能力不足问题。解决方案的关键在于构建一个针对小麦管理任务的三层次数据集WisWheat,包括基础预训练数据集、定量数据集和指令微调数据集,通过针对性的微调提升VLM在小麦生长阶段识别与胁迫诊断等任务中的性能。
链接: https://arxiv.org/abs/2506.06084
作者: Bowen Yuan,Selena Song,Javier Fernandez,Yadan Luo,Mahsa Baktashmotlagh,Zijian Wang
机构: The University of Queensland(昆士兰大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Wheat management strategies play a critical role in determining yield. Traditional management decisions often rely on labour-intensive expert inspections, which are expensive, subjective and difficult to scale. Recently, Vision-Language Models (VLMs) have emerged as a promising solution to enable scalable, data-driven management support. However, due to a lack of domain-specific knowledge, directly applying VLMs to wheat management tasks results in poor quantification and reasoning capabilities, ultimately producing vague or even misleading management recommendations. In response, we propose WisWheat, a wheat-specific dataset with a three-layered design to enhance VLM performance on wheat management tasks: (1) a foundational pretraining dataset of 47,871 image-caption pairs for coarsely adapting VLMs to wheat morphology; (2) a quantitative dataset comprising 7,263 VQA-style image-question-answer triplets for quantitative trait measuring tasks; and (3) an Instruction Fine-tuning dataset with 4,888 samples targeting biotic and abiotic stress diagnosis and management plan for different phenological stages. Extensive experimental results demonstrate that fine-tuning open-source VLMs (e.g., Qwen2.5 7B) on our dataset leads to significant performance improvements. Specifically, the Qwen2.5 VL 7B fine-tuned on our wheat instruction dataset achieves accuracy scores of 79.2% and 84.6% on wheat stress and growth stage conversation tasks respectively, surpassing even general-purpose commercial models such as GPT-4o by a margin of 11.9% and 34.6%.
zh
[CV-23] Full Conformal Adaptation of Medical Vision-Language Models
【速读】:该论文试图解决生成式 AI (Generative AI) 在医学图像分析中可靠性评估的问题,特别是在分割共轭预测(SCP)框架下的适用性问题。现有方法在利用少量样本进行迁移学习时,无法满足SCP对数据交换性的严格假设,导致其零样本性能受限。该论文的关键解决方案是提出全共轭适应(full conformal adaptation),一种联合适应和共轭化预训练基础模型的新设置,通过少量样本适应集对每个测试数据点进行归纳推理,并结合SS-Text这一无需训练的线性探针求解器,以降低计算成本,从而在保持相同覆盖保证的前提下提升集合效率。
链接: https://arxiv.org/abs/2506.06076
作者: Julio Silva-Rodríguez,Leo Fillioux,Paul-Henry Cournède,Maria Vakalopoulou,Stergios Christodoulidis,Ismail Ben Ayed,Jose Dolz
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: IPMI 2025. Code: this https URL
点击查看摘要
Abstract:Vision-language models (VLMs) pre-trained at large scale have shown unprecedented transferability capabilities and are being progressively integrated into medical image analysis. Although its discriminative potential has been widely explored, its reliability aspect remains overlooked. This work investigates their behavior under the increasingly popular split conformal prediction (SCP) framework, which theoretically guarantees a given error level on output sets by leveraging a labeled calibration set. However, the zero-shot performance of VLMs is inherently limited, and common practice involves few-shot transfer learning pipelines, which cannot absorb the rigid exchangeability assumptions of SCP. To alleviate this issue, we propose full conformal adaptation, a novel setting for jointly adapting and conformalizing pre-trained foundation models, which operates transductively over each test data point using a few-shot adaptation set. Moreover, we complement this framework with SS-Text, a novel training-free linear probe solver for VLMs that alleviates the computational cost of such a transductive approach. We provide comprehensive experiments using 3 different modality-specialized medical VLMs and 9 adaptation tasks. Our framework requires exactly the same data as SCP, and provides consistent relative improvements of up to 27% on set efficiency while maintaining the same coverage guarantees.
zh
[CV-24] RUST: Test-time Resource Utilization for Superior Trustworthiness
【速读】:该论文试图解决标准不确定性估计技术(如Dropout)在区分可靠预测与不可靠预测方面存在不足的问题,其根源在于分类器权重的噪声影响了细粒度统计信息的可靠性。解决方案的关键在于提出一种新的测试时优化方法,该方法考虑了噪声的影响以生成更可靠的置信度估计,该方法定义了一个单调的子集选择函数,随着低分样本的移除,总体准确率持续提升,并在标准风险指标上表现出色。
链接: https://arxiv.org/abs/2506.06048
作者: Haripriya Harikumar,Santu Rana
机构: Applied Artificial Intelligence Institute, Deakin University, Geelong, Australia; University of Manchester; Department of Computer Science, University of Manchester, Manchester, UK
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Standard uncertainty estimation techniques, such as dropout, often struggle to clearly distinguish reliable predictions from unreliable ones. We attribute this limitation to noisy classifier weights, which, while not impairing overall class-level predictions, render finer-level statistics less informative. To address this, we propose a novel test-time optimization method that accounts for the impact of such noise to produce more reliable confidence estimates. This score defines a monotonic subset-selection function, where population accuracy consistently increases as samples with lower scores are removed, and it demonstrates superior performance in standard risk-based metrics such as AUSE and AURC. Additionally, our method effectively identifies discrepancies between training and test distributions, reliably differentiates in-distribution from out-of-distribution samples, and elucidates key differences between CNN and ViT classifiers across various vision datasets.
zh
[CV-25] SDS-Net: Shallow-Deep Synergism-detection Network for infrared small target detection
【速读】:该论文旨在解决当前基于卷积神经网络(Convolutional Neural Network, CNN)的红外小目标检测(Infrared Small Target Detection, IRSTD)方法中,浅层与深层特征之间的异质性被忽视的问题,导致浅层细粒度结构信息与深层高层语义表示之间协作效率低下。此外,不同特征层次间的依赖关系和融合机制缺乏系统建模,无法充分挖掘多层级特征的互补性,从而限制了检测性能并增加了计算成本。解决方案的关键在于提出一种浅层-深层协同检测网络(Shallow-Deep Synergistic Detection Network, SDS-Net),通过双分支架构分别建模特征的结构特性和语义属性,有效保留浅层空间细节并捕获深层语义表示,同时引入自适应特征融合模块以动态建模跨层特征相关性,提升整体特征协作与表征能力。
链接: https://arxiv.org/abs/2506.06042
作者: Taoran Yue,Xiaojin Lu,Jiaxi Cai,Yuanping Chen,Shibing Chu
机构: Jiangsu University(江苏大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 13 pages,9 figures, Submitted IEEE Transactions on Geoscience and Remote Sensing
点击查看摘要
Abstract:Current CNN-based infrared small target detection(IRSTD) methods generally overlook the heterogeneity between shallow and deep features, leading to inefficient collaboration between shallow fine grained structural information and deep high-level semantic representations. Additionally, the dependency relationships and fusion mechanisms across different feature hierarchies lack systematic modeling, which fails to fully exploit the complementarity of multilevel features. These limitations hinder IRSTD performance while incurring substantial computational costs. To address these challenges, this paper proposes a shallow-deep synergistic detection network (SDS-Net) that efficiently models multilevel feature representations to increase both the detection accuracy and computational efficiency in IRSTD tasks. SDS-Net introduces a dual-branch architecture that separately models the structural characteristics and semantic properties of features, effectively preserving shallow spatial details while capturing deep semantic representations, thereby achieving high-precision detection with significantly improved inference speed. Furthermore, the network incorporates an adaptive feature fusion module to dynamically model cross-layer feature correlations, enhancing overall feature collaboration and representation capability. Comprehensive experiments on three public datasets (NUAA-SIRST, NUDT-SIRST, and IRSTD-1K) demonstrate that SDS-Net outperforms state-of-the-art IRSTD methods while maintaining low computational complexity and high inference efficiency, showing superior detection performance and broad application prospects. Our code will be made public at this https URL.
zh
[CV-26] nsor-to-Tensor Models with Fast Iterated Sum Features
【速读】:该论文旨在解决高维数据(如图像或高阶张量)在深度学习应用中的处理效率问题,特别是针对其固有的高维度特性,需要更高效的子二次复杂度层来处理此类数据。解决方案的关键在于提出一种具有线性输入规模成本的张量到张量层,该层利用了排列计数领域中的“角树”数学工具。该方法不仅可视为状态空间模型的高阶推广,还基于迭代积分(或求和)的多参数广义签名。通过构建名为快速迭代求和(FIS)的神经网络层,实现了与其他层类型的无缝集成,并在分类和异常检测任务中验证了其有效性。
链接: https://arxiv.org/abs/2506.06041
作者: Joscha Diehl,Rasheed Ibraheem,Leonard Schmitz,Yue Wu
机构: Institute of Mathematics and Computer Science, University of Greifswald (数学与计算机科学研究所,格赖夫斯瓦尔德大学); Maxwell Institute for Mathematical Sciences, School of Mathematics, University of Edinburgh (数学科学麦克斯韦研究所,爱丁堡大学数学学院); Institute of Mathematics, Technical University Berlin (数学研究所,柏林工业大学); Department of Mathematics and Statistics, University of Strathclyde (数学与统计系,斯特拉斯克莱德大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Data in the form of images or higher-order tensors is ubiquitous in modern deep learning applications. Owing to their inherent high dimensionality, the need for subquadratic layers processing such data is even more pressing than for sequence data. We propose a novel tensor-to-tensor layer with linear cost in the input size, utilizing the mathematical gadget of ``corner trees’’ from the field of permutation counting. In particular, for order-two tensors, we provide an image-to-image layer that can be plugged into image processing pipelines. On the one hand, our method can be seen as a higher-order generalization of state-space models. On the other hand, it is based on a multiparameter generalization of the signature of iterated integrals (or sums). The proposed tensor-to-tensor concept is used to build a neural network layer called the Fast Iterated Sums (FIS) layer which integrates seamlessly with other layer types. We demonstrate the usability of the FIS layer with both classification and anomaly detection tasks. By replacing some layers of a smaller ResNet architecture with FIS, a similar accuracy (with a difference of only 0.1%) was achieved in comparison to a larger ResNet while reducing the number of trainable parameters and multi-add operations. The FIS layer was also used to build an anomaly detection model that achieved an average AUROC of 97.3% on the texture images of the popular MVTec AD dataset. The processing and modelling codes are publicly available at this https URL.
zh
[CV-27] HAVIR: HierArchical Vision to Image Reconstruction using CLIP-Guided Versatile Diffusion
【速读】:该论文旨在解决从脑活动信号中准确重建高复杂度视觉刺激的问题,这一挑战源于视觉刺激的元素密度与多样性、复杂的空间结构以及多维度的语义信息。其解决方案的关键在于提出HAVIR框架,该框架包含两个适配器:AutoKL适配器将fMRI体素转换为潜在扩散先验以捕捉拓扑结构,CLIP适配器则将体素转换为CLIP文本和图像嵌入以保留语义信息,两者通过通用扩散模型进行融合,从而生成最终的重建图像。
链接: https://arxiv.org/abs/2506.06035
作者: Shiyi Zhang,Dong Liang,Hairong Zheng,Yihang Zhou
机构: Southern University of Science and Technology (南方科技大学); Shenzhen Institutes of Advanced Technology (深圳先进技术研究院); Chinese Academy of Sciences (中国科学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 15 pages, 6 figures, 3 tabs
点击查看摘要
Abstract:Reconstructing visual information from brain activity bridges the gap between neuroscience and computer vision. Even though progress has been made in decoding images from fMRI using generative models, a challenge remains in accurately recovering highly complex visual stimuli. This difficulty stems from their elemental density and diversity, sophisticated spatial structures, and multifaceted semantic information. To address these challenges, we propose HAVIR that contains two adapters: (1) The AutoKL Adapter transforms fMRI voxels into a latent diffusion prior, capturing topological structures; (2) The CLIP Adapter converts the voxels to CLIP text and image embeddings, containing semantic information. These complementary representations are fused by Versatile Diffusion to generate the final reconstructed image. To extract the most essential semantic information from complex scenarios, the CLIP Adapter is trained with text captions describing the visual stimuli and their corresponding semantic images synthesized from these captions. The experimental results demonstrate that HAVIR effectively reconstructs both structural features and semantic information of visual stimuli even in complex scenarios, outperforming existing models. Comments: 15 pages, 6 figures, 3 tabs Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) ACMclasses: I.2 Cite as: arXiv:2506.06035 [cs.CV] (or arXiv:2506.06035v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2506.06035 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-28] Sample-Specific Noise Injection For Diffusion-Based Adversarial Purification
【速读】:该论文试图解决基于扩散的净化(Diffusion-based Purification, DBP)方法中噪声水平设定不合理的问题,即现有方法对所有样本使用相同的固定噪声水平 $ t^* $,而未考虑不同样本的清洁程度差异。解决方案的关键在于提出一种样本特定的得分感知噪声注入框架(Sample-specific Score-aware Noise Injection, SSNI),通过预训练的得分网络估计数据点偏离干净数据分布的程度(即得分范数),并根据得分范数的大小应用重加权函数,从而自适应地调整每个样本的噪声水平 $ t^* $,实现样本特定的噪声注入。
链接: https://arxiv.org/abs/2506.06027
作者: Yuhao Sun,Jiacheng Zhang,Zesheng Ye,Chaowei Xiao,Feng Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Diffusion-based purification (DBP) methods aim to remove adversarial noise from the input sample by first injecting Gaussian noise through a forward diffusion process, and then recovering the clean example through a reverse generative process. In the above process, how much Gaussian noise is injected to the input sample is key to the success of DBP methods, which is controlled by a constant noise level t^* for all samples in existing methods. In this paper, we discover that an optimal t^* for each sample indeed could be different. Intuitively, the cleaner a sample is, the less the noise it should be injected, and vice versa. Motivated by this finding, we propose a new framework, called Sample-specific Score-aware Noise Injection (SSNI). Specifically, SSNI uses a pre-trained score network to estimate how much a data point deviates from the clean data distribution (i.e., score norms). Then, based on the magnitude of score norms, SSNI applies a reweighting function to adaptively adjust t^* for each sample, achieving sample-specific noise injections. Empirically, incorporating our framework with existing DBP methods results in a notable improvement in both accuracy and robustness on CIFAR-10 and ImageNet-1K, highlighting the necessity to allocate distinct noise levels to different samples in DBP methods. Our code is available at: this https URL.
zh
[CV-29] O-MaMa @ EgoExo4D Correspondence Challenge: Learning Object Mask Matching between Egocentric and Exocentric Views
【速读】:该论文旨在解决跨视图对象分割(cross-image segmentation)问题,即在不同视角下对特定对象进行分割。其解决方案的关键在于将跨图像分割重新定义为掩码匹配(mask matching)任务,并提出了四个核心组件:(1) 通过Mask-Context Encoder从FastSAM掩码候选中提取具有区分性的物体级表示;(2) 利用Ego ↔ Exo Cross-Attention融合多视角观测信息;(3) 采用Mask Matching对比损失函数在共享潜在空间中对齐跨视图特征;(4) 引入Hard Negative Adjacent Mining策略以增强模型对邻近物体的区分能力。
链接: https://arxiv.org/abs/2506.06026
作者: Lorenzo Mur-Labadia,Maria Santos-Villafranca,Alejandro Perez-Yus,Jesus Bermudez-Cameo,Ruben Martinez-Cantin,Jose J. Guerrero
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:The goal of the correspondence task is to segment specific objects across different views. This technical report re-defines cross-image segmentation by treating it as a mask matching task. Our method consists of: (1) A Mask-Context Encoder that pools dense DINOv2 semantic features to obtain discriminative object-level representations from FastSAM mask candidates, (2) an Ego \leftrightarrow Exo Cross-Attention that fuses multi-perspective observations, (3) a Mask Matching contrastive loss that aligns cross-view features in a shared latent space, and (4) a Hard Negative Adjacent Mining strategy to encourage the model to better differentiate between nearby objects.
zh
[CV-30] Restereo: Diffusion stereo video generation and restoration
【速读】:该论文旨在解决从低质量单目视频生成高质量立体视频的问题,传统方法通常假设输入视频为高质量,主要关注在变形视频中修复遮挡区域并保留未遮挡区域。本文提出了一种新的流水线,通过在退化数据上微调模型以实现视频修复,并利用变形掩码进行条件控制,从而一致地增强左右视角视频。解决方案的关键在于利用单一模型同时完成立体视频生成与视频修复,并能够在小规模合成立体视频数据集上进行微调,适用于低质量的真实视频场景。
链接: https://arxiv.org/abs/2506.06023
作者: Xingchang Huang,Ashish Kumar Singh,Florian Dubost,Cristina Nader Vasconcelos,Sakar Khattar,Liang Shi,Christian Theobalt,Cengiz Oztireli,Gurprit Singh
机构: Max Planck Institute for Informatics (马克斯·普朗克信息研究所); VIA-Center Saarbücken (维亚中心萨尔布吕肯); University of Cambridge (剑桥大学); Google (谷歌); Google DeepMind (谷歌深度思维)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 5 figures
点击查看摘要
Abstract:Stereo video generation has been gaining increasing attention with recent advancements in video diffusion models. However, most existing methods focus on generating 3D stereoscopic videos from monocular 2D videos. These approaches typically assume that the input monocular video is of high quality, making the task primarily about inpainting occluded regions in the warped video while preserving disoccluded areas. In this paper, we introduce a new pipeline that not only generates stereo videos but also enhances both left-view and right-view videos consistently with a single model. Our approach achieves this by fine-tuning the model on degraded data for restoration, as well as conditioning the model on warped masks for consistent stereo generation. As a result, our method can be fine-tuned on a relatively small synthetic stereo video datasets and applied to low-quality real-world videos, performing both stereo video generation and restoration. Experiments demonstrate that our method outperforms existing approaches both qualitatively and quantitatively in stereo video generation from low-resolution inputs.
zh
[CV-31] Enhancing Orthopox Image Classification Using Hybrid Machine Learning and Deep Learning Models
【速读】:该论文旨在解决Orthopoxvirus感染在医学影像中的准确分类问题,以实现早期诊断和疫情预防。传统诊断技术存在耗时、依赖专家解读以及数据集稀缺且存在偏差等问题,因此亟需自动化且可扩展的解决方案。论文提出的一种混合策略是关键,该策略结合了机器学习模型与预训练深度学习模型,无需数据增强即可提取深层特征表示,从而提升分类性能并降低计算成本。实验结果表明,该方法在保持训练和推理效率的同时,能够与其他先进方法结合,实现优异的分类效果,并展现出良好的泛化能力和鲁棒性,为实际临床部署提供了可扩展且可解释的解决方案。
链接: https://arxiv.org/abs/2506.06007
作者: Alejandro Puente-Castro,Enrique Fernandez-Blanco,Daniel Rivero,Andres Molares-Ulloa
机构: University of A Coruna (奥伦塞大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Orthopoxvirus infections must be accurately classified from medical pictures for an easy and early diagnosis and epidemic prevention. The necessity for automated and scalable solutions is highlighted by the fact that traditional diagnostic techniques can be time-consuming and require expert interpretation and there are few and biased data sets of the different types of Orthopox. In order to improve classification performance and lower computational costs, a hybrid strategy is put forth in this paper that uses Machine Learning models combined with pretrained Deep Learning models to extract deep feature representations without the need for augmented data. The findings show that this feature extraction method, when paired with other methods in the state-of-the-art, produces excellent classification outcomes while preserving training and inference efficiency. The proposed approach demonstrates strong generalization and robustness across multiple evaluation settings, offering a scalable and interpretable solution for real-world clinical deployment.
zh
[CV-32] MCA-Bench: A Multimodal Benchmark for Evaluating CAPTCHA Robustness Against VLM-based Attacks
【速读】:该论文试图解决当前CAPTCHA( Completely Automated Public Turing test to tell Computers and Humans Apart,全自动公共图灵测试以区分计算机和人类)方案缺乏统一、大规模、多模态基准评估的问题,从而难以全面衡量其安全鲁棒性。解决方案的关键在于提出MCA-Bench,一个综合性且可复现的基准测试套件,将异构的CAPTCHA类型整合到统一的评估协议中,并利用共享的视觉-语言模型主干网络,为每种CAPTCHA类别微调专用破解代理,实现跨模态的一致性评估。
链接: https://arxiv.org/abs/2506.05982
作者: Zonglin Wu,Yule Xue,Xin Wei,Yiren Song
机构: Southwest University (西南大学); National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 31 pages, 8 figures
点击查看摘要
Abstract:As automated attack techniques rapidly advance, CAPTCHAs remain a critical defense mechanism against malicious bots. However, existing CAPTCHA schemes encompass a diverse range of modalities – from static distorted text and obfuscated images to interactive clicks, sliding puzzles, and logic-based questions – yet the community still lacks a unified, large-scale, multimodal benchmark to rigorously evaluate their security robustness. To address this gap, we introduce MCA-Bench, a comprehensive and reproducible benchmarking suite that integrates heterogeneous CAPTCHA types into a single evaluation protocol. Leveraging a shared vision-language model backbone, we fine-tune specialized cracking agents for each CAPTCHA category, enabling consistent, cross-modal assessments. Extensive experiments reveal that MCA-Bench effectively maps the vulnerability spectrum of modern CAPTCHA designs under varied attack settings, and crucially offers the first quantitative analysis of how challenge complexity, interaction depth, and model solvability interrelate. Based on these findings, we propose three actionable design principles and identify key open challenges, laying the groundwork for systematic CAPTCHA hardening, fair benchmarking, and broader community collaboration. Datasets and code are available online.
zh
[CV-33] Domain Adaptation in Agricultural Image Analysis: A Comprehensive Review from Shallow Models to Deep Learning
【速读】:该论文试图解决农业图像分析中由于源域与目标域之间存在显著领域偏移而导致的模型泛化能力受限问题,这种偏移主要由环境差异、作物类型和数据采集方法的不同引起。解决方案的关键在于应用领域自适应(Domain Adaptation, DA)技术,以提升模型在不同区域、季节和复杂农业环境中的跨域迁移能力。论文重点探讨了DA技术在农业视觉任务中的作用,并系统回顾了其在实际应用中的进展,特别是对抗学习驱动的DA方法在应对农业场景挑战中的潜力。
链接: https://arxiv.org/abs/2506.05972
作者: Xing Hu,Siyuan Chen,Dawei Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:With the increasing use of computer vision in agriculture, image analysis has become crucial for tasks like crop health monitoring and pest detection. However, significant domain shifts between source and target domains-due to environmental differences, crop types, and data acquisition methods-pose challenges. These domain gaps limit the ability of models to generalize across regions, seasons, and complex agricultural environments. This paper explores how Domain Adaptation (DA) techniques can address these challenges, focusing on their role in enhancing the cross-domain transferability of agricultural image analysis. DA has gained attention in agricultural vision tasks due to its potential to mitigate domain heterogeneity. The paper systematically reviews recent advances in DA for agricultural imagery, particularly its practical applications in complex agricultural environments. We examine the key drivers for adopting DA in agriculture, such as limited labeled data, weak model transferability, and dynamic environmental conditions. We also discuss its use in crop health monitoring, pest detection, and fruit recognition, highlighting improvements in performance across regions and seasons. The paper categorizes DA methods into shallow and deep learning models, with further divisions into supervised, semi-supervised, and unsupervised approaches. A special focus is given to adversarial learning-based DA methods, which have shown great promise in challenging agricultural scenarios. Finally, we review key public datasets in agricultural imagery, analyzing their value and limitations in DA research. This review provides a comprehensive framework for researchers, offering insights into current research gaps and supporting the advancement of DA methods in agricultural image analysis.
zh
[CV-34] Dy3DGS-SLAM: Monocular 3D Gaussian Splatting SLAM for Dynamic Environments
【速读】:该论文旨在解决基于神经辐射场(NeRF)或3D高斯点云(3DGS)的同步定位与建图(SLAM)方法在动态环境中的跟踪与重建问题,这些方法在静态场景中表现优异,但在包含移动元素的真实场景中效果不佳。其关键解决方案是提出Dy3DGS-SLAM,这是首个使用单目RGB输入的3DGS SLAM方法,通过融合光流掩码和深度掩码的概率模型生成融合动态掩码,并结合新颖的运动损失约束位姿估计网络,同时利用动态像素的渲染损失、颜色和深度信息消除动态物体引起的瞬时干扰和遮挡。
链接: https://arxiv.org/abs/2506.05965
作者: Mingrui Li,Yiming Zhou,Hongxing Zhou,Xinggang Hu,Florian Roemer,Hongyu Wang,Ahmad Osman
机构: Dalian University of Technology (大连理工大学); Saarland University of Applied Sciences (萨尔兰应用科学大学); Fraunhofer Institute for Nondestructive Testing (弗劳恩霍夫无损检测研究所); Beijing University of Chemical Technology (北京化工大学); Laval University (拉瓦尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Current Simultaneous Localization and Mapping (SLAM) methods based on Neural Radiance Fields (NeRF) or 3D Gaussian Splatting excel in reconstructing static 3D scenes but struggle with tracking and reconstruction in dynamic environments, such as real-world scenes with moving elements. Existing NeRF-based SLAM approaches addressing dynamic challenges typically rely on RGB-D inputs, with few methods accommodating pure RGB input. To overcome these limitations, we propose Dy3DGS-SLAM, the first 3D Gaussian Splatting (3DGS) SLAM method for dynamic scenes using monocular RGB input. To address dynamic interference, we fuse optical flow masks and depth masks through a probabilistic model to obtain a fused dynamic mask. With only a single network iteration, this can constrain tracking scales and refine rendered geometry. Based on the fused dynamic mask, we designed a novel motion loss to constrain the pose estimation network for tracking. In mapping, we use the rendering loss of dynamic pixels, color, and depth to eliminate transient interference and occlusion caused by dynamic objects. Experimental results demonstrate that Dy3DGS-SLAM achieves state-of-the-art tracking and rendering in dynamic environments, outperforming or matching existing RGB-D methods.
zh
[CV-35] MOGO: Residual Quantized Hierarchical Causal Transformer for High-Quality and Real-Time 3D Human Motion Generation
【速读】:该论文旨在解决基于Transformer的文本到动作生成中同时实现高保真度、流式传输能力、实时响应和可扩展性的根本性挑战。其解决方案的关键在于提出MOGO(Motion Generation with One-pass)框架,该框架包含两个核心组件:MoSA-VQ模块通过分层离散化运动序列并引入可学习缩放机制,生成紧凑且富有表现力的表示;RQHC-Transformer则通过单次前向传递生成多层运动标记,显著降低推理延迟。此外,引入文本条件对齐机制以提升在文本控制下的运动解码语义保真度。
链接: https://arxiv.org/abs/2506.05952
作者: Dongjie Fu,Tengjiao Sun,Pengcheng Fang,Xiaohao Cai,Hansung Kim
机构: Mogo AI(摩歌人工智能); University of Southampton(南安普顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 9 pages, 4 figures, conference
点击查看摘要
Abstract:Recent advances in transformer-based text-to-motion generation have led to impressive progress in synthesizing high-quality human motion. Nevertheless, jointly achieving high fidelity, streaming capability, real-time responsiveness, and scalability remains a fundamental challenge. In this paper, we propose MOGO (Motion Generation with One-pass), a novel autoregressive framework tailored for efficient and real-time 3D motion generation. MOGO comprises two key components: (1) MoSA-VQ, a motion scale-adaptive residual vector quantization module that hierarchically discretizes motion sequences with learnable scaling to produce compact yet expressive representations; and (2) RQHC-Transformer, a residual quantized hierarchical causal transformer that generates multi-layer motion tokens in a single forward pass, significantly reducing inference latency. To enhance semantic fidelity, we further introduce a text condition alignment mechanism that improves motion decoding under textual control. Extensive experiments on benchmark datasets including HumanML3D, KIT-ML, and CMP demonstrate that MOGO achieves competitive or superior generation quality compared to state-of-the-art transformer-based methods, while offering substantial improvements in real-time performance, streaming generation, and generalization under zero-shot settings.
zh
[CV-36] SurGSplat: Progressive Geometry-Constrained Gaussian Splatting for Surgical Scene Reconstruction
【速读】:该论文旨在解决术中导航中由于内镜场景下特征稀疏和光照不一致等问题导致的现有基于结构从运动(Structure-from-Motion, SfM)的方法重建失败的问题。解决方案的关键在于提出SurGSplat,这是一种通过整合几何约束逐步优化三维高斯点云(3D Gaussian Splatting, 3DGS)的新范式,从而实现血管结构和其他关键特征的详细重建,提升术中视觉清晰度和定位精度。
链接: https://arxiv.org/abs/2506.05935
作者: Yuchao Zheng,Jianing Zhang,Guochen Ning,Hongen Liao
机构: Tsinghua University (清华大学); Fudan University (复旦大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Intraoperative navigation relies heavily on precise 3D reconstruction to ensure accuracy and safety during surgical procedures. However, endoscopic scenarios present unique challenges, including sparse features and inconsistent lighting, which render many existing Structure-from-Motion (SfM)-based methods inadequate and prone to reconstruction failure. To mitigate these constraints, we propose SurGSplat, a novel paradigm designed to progressively refine 3D Gaussian Splatting (3DGS) through the integration of geometric constraints. By enabling the detailed reconstruction of vascular structures and other critical features, SurGSplat provides surgeons with enhanced visual clarity, facilitating precise intraoperative decision-making. Experimental evaluations demonstrate that SurGSplat achieves superior performance in both novel view synthesis (NVS) and pose estimation accuracy, establishing it as a high-fidelity and efficient solution for surgical scene reconstruction. More information and results can be found on the page this https URL.
zh
[CV-37] FADE: Frequency-Aware Diffusion Model Factorization for Video Editing CVPR
【速读】:该论文旨在解决视频编辑中因传统图像扩散模型难以处理视频动态性而导致的挑战,尤其是在运动调整等复杂时间编辑任务中的效果不足问题。其解决方案的关键在于提出FADE方法,该方法通过频域感知的因子分解策略,充分利用预训练视频扩散模型的内在先验知识,优化各组件的专业化角色,并结合频谱引导的调制技术,以频率域线索细化采样轨迹,从而在保持基本时空结构的同时实现高效且多样的编辑效果。
链接: https://arxiv.org/abs/2506.05934
作者: Yixuan Zhu,Haolin Wang,Shilin Ma,Wenliang Zhao,Yansong Tang,Lei Chen,Jie Zhou
机构: Tsinghua University (清华大学); Tsinghua Shenzhen International Graduate School (清华大学深圳国际研究生院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2025
点击查看摘要
Abstract:Recent advancements in diffusion frameworks have significantly enhanced video editing, achieving high fidelity and strong alignment with textual prompts. However, conventional approaches using image diffusion models fall short in handling video dynamics, particularly for challenging temporal edits like motion adjustments. While current video diffusion models produce high-quality results, adapting them for efficient editing remains difficult due to the heavy computational demands that prevent the direct application of previous image editing techniques. To overcome these limitations, we introduce FADE, a training-free yet highly effective video editing approach that fully leverages the inherent priors from pre-trained video diffusion models via frequency-aware factorization. Rather than simply using these models, we first analyze the attention patterns within the video model to reveal how video priors are distributed across different components. Building on these insights, we propose a factorization strategy to optimize each component’s specialized role. Furthermore, we devise spectrum-guided modulation to refine the sampling trajectory with frequency domain cues, preventing information leakage and supporting efficient, versatile edits while preserving the basic spatial and temporal structure. Extensive experiments on real-world videos demonstrate that our method consistently delivers high-quality, realistic and temporally coherent editing results both qualitatively and quantitatively. Code is available at this https URL .
zh
[CV-38] Rethinking Semi-supervised Segmentation Beyond Accuracy: Reliability and Robustness
【速读】:该论文试图解决半监督语义分割在实际应用中过于关注分割精度而忽视模型可靠性与鲁棒性的问题。当前的评估协议仅关注分割准确性,未能全面衡量模型在复杂环境下的稳定表现及置信度校准情况。解决方案的关键在于引入一种新的综合评估指标——可靠分割分数(Reliable Segmentation Score, RSS),该指标通过调和平均将预测准确性、校准性和不确定性质量相结合,从而提供一种更全面评估分割模型的方法。
链接: https://arxiv.org/abs/2506.05917
作者: Steven Landgraf,Markus Hillemann,Markus Ulrich
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Semantic segmentation is critical for scene understanding but demands costly pixel-wise annotations, attracting increasing attention to semi-supervised approaches to leverage abundant unlabeled data. While semi-supervised segmentation is often promoted as a path toward scalable, real-world deployment, it is astonishing that current evaluation protocols exclusively focus on segmentation accuracy, entirely overlooking reliability and robustness. These qualities, which ensure consistent performance under diverse conditions (robustness) and well-calibrated model confidences as well as meaningful uncertainties (reliability), are essential for safety-critical applications like autonomous driving, where models must handle unpredictable environments and avoid sudden failures at all costs. To address this gap, we introduce the Reliable Segmentation Score (RSS), a novel metric that combines predictive accuracy, calibration, and uncertainty quality measures via a harmonic mean. RSS penalizes deficiencies in any of its components, providing an easy and intuitive way of holistically judging segmentation models. Comprehensive evaluations of UniMatchV2 against its predecessor and a supervised baseline show that semi-supervised methods often trade reliability for accuracy. While out-of-domain evaluations demonstrate UniMatchV2’s robustness, they further expose persistent reliability shortcomings. We advocate for a shift in evaluation protocols toward more holistic metrics like RSS to better align semi-supervised learning research with real-world deployment needs.
zh
[CV-39] QualitEye: Public and Privacy-preserving Gaze Data Quality Verification
【速读】:该论文试图解决大规模收集眼动数据时确保数据质量的挑战,以及在多方协作过程中产生的隐私问题。解决方案的关键在于提出QualitEye——一种用于验证基于图像的眼动数据质量的方法,其核心是采用新的语义表示方式,既包含验证所需的信息,又排除无关信息以提高领域适应性,并支持公共环境和隐私保护环境下的数据交换。
链接: https://arxiv.org/abs/2506.05908
作者: Mayar Elfares,Pascal Reisert,Ralf Küsters,Andreas Bulling
机构: University of Stuttgart (斯图加特大学); Institute of Information Security (信息安全研究所); Collaborative Artificial Intelligence Group (协作人工智能小组)
类目: Human-Computer Interaction (cs.HC); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Gaze-based applications are increasingly advancing with the availability of large datasets but ensuring data quality presents a substantial challenge when collecting data at scale. It further requires different parties to collaborate, therefore, privacy concerns arise. We propose QualitEye–the first method for verifying image-based gaze data quality. QualitEye employs a new semantic representation of eye images that contains the information required for verification while excluding irrelevant information for better domain adaptation. QualitEye covers a public setting where parties can freely exchange data and a privacy-preserving setting where parties cannot reveal their raw data nor derive gaze features/labels of others with adapted private set intersection protocols. We evaluate QualitEye on the MPIIFaceGaze and GazeCapture datasets and achieve a high verification performance (with a small overhead in runtime for privacy-preserving versions). Hence, QualitEye paves the way for new gaze analysis methods at the intersection of machine learning, human-computer interaction, and cryptography.
zh
[CV-40] Query Nearby: Offset-Adjusted Mask2Former enhances small-organ segmentation
【速读】:该论文旨在解决医学图像分割中获取临床可接受结果的难题,特别是在处理中型和小型器官时,传统方法如直接使用ViT进行分割往往表现不佳,DSC值低于50%,远低于临床要求的80%。其关键解决方案是采用带有可变形注意力机制的Mask2Former,并引入偏移调整策略以增强注意力权重计算过程中同一器官内采样点的聚集,从而更好地整合紧凑的前景信息。此外,利用Mask2Former的第4个特征图提供器官的粗略位置,并通过基于全卷积网络(FCN)的辅助头加速训练过程,最终在HaNSeg和SegRap2023数据集上取得了最先进的性能。
链接: https://arxiv.org/abs/2506.05897
作者: Xin Zhang,Dongdong Meng,Sheng Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Medical segmentation plays an important role in clinical applications like radiation therapy and surgical guidance, but acquiring clinically acceptable results is difficult. In recent years, progress has been witnessed with the success of utilizing transformer-like models, such as combining the attention mechanism with CNN. In particular, transformer-based segmentation models can extract global information more effectively, compensating for the drawbacks of CNN modules that focus on local features. However, utilizing transformer architecture is not easy, because training transformer-based models can be resource-demanding. Moreover, due to the distinct characteristics in the medical field, especially when encountering mid-sized and small organs with compact regions, their results often seem unsatisfactory. For example, using ViT to segment medical images directly only gives a DSC of less than 50%, which is far lower than the clinically acceptable score of 80%. In this paper, we used Mask2Former with deformable attention to reduce computation and proposed offset adjustment strategies to encourage sampling points within the same organs during attention weights computation, thereby integrating compact foreground information better. Additionally, we utilized the 4th feature map in Mask2Former to provide a coarse location of organs, and employed an FCN-based auxiliary head to help train Mask2Former more quickly using Dice loss. We show that our model achieves SOTA (State-of-the-Art) performance on the HaNSeg and SegRap2023 datasets, especially on mid-sized and small this http URL code is available at link this https URL_Background-location_Decoder_Mask2former.
zh
[CV-41] Object Navigation with Structure-Semantic Reasoning -Based Multi-level Map and Multimodal Decision-Making LLM
【速读】:该论文旨在解决在未知开放环境中的零样本物体导航(zero-shot object navigation, ZSON)问题,特别是在面对语义新颖目标时,由于忽视高维隐式场景信息和长距离目标搜索任务而导致的性能显著下降。其解决方案的关键在于提出了一种基于环境属性图(Environmental Attributes Map, EAM)和多语言大模型分层推理模块(MLLM Hierarchical Reasoning module, MHR)的主动物体导航框架。EAM通过SBERT对观察到的环境进行推理并利用扩散模型预测未观察到的区域,从而捕捉物体与房间之间的关联及区域邻接关系;MHR则受EAM启发,用于执行前沿探索决策,避免长距离场景中的迂回路径,提升路径效率。
链接: https://arxiv.org/abs/2506.05896
作者: Chongshang Yan,Jiaxuan He,Delun Li,Yi Yang,Wenjie Song
机构: Beijing Institute of Technology (北京理工大学)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 11 figures
点击查看摘要
Abstract:The zero-shot object navigation (ZSON) in unknown open-ended environments coupled with semantically novel target often suffers from the significant decline in performance due to the neglect of high-dimensional implicit scene information and the long-range target searching task. To address this, we proposed an active object navigation framework with Environmental Attributes Map (EAM) and MLLM Hierarchical Reasoning module (MHR) to improve its success rate and efficiency. EAM is constructed by reasoning observed environments with SBERT and predicting unobserved ones with Diffusion, utilizing human space regularities that underlie object-room correlations and area adjacencies. MHR is inspired by EAM to perform frontier exploration decision-making, avoiding the circuitous trajectories in long-range scenarios to improve path efficiency. Experimental results demonstrate that the EAM module achieves 64.5% scene mapping accuracy on MP3D dataset, while the navigation task attains SPLs of 28.4% and 26.3% on HM3D and MP3D benchmarks respectively - representing absolute improvements of 21.4% and 46.0% over baseline methods.
zh
[CV-42] Unleashing the Potential of Consistency Learning for Detecting and Grounding Multi-Modal Media Manipulation CVPR2025
【速读】:该论文旨在解决多模态媒体篡改检测与定位(DGM4)中因缺乏对局部内容细粒度一致性的探索而导致的伪造细节感知不足和结果不可靠的问题。其解决方案的关键在于提出一种名为上下文-语义一致性学习(CSCL)的新方法,该方法通过为图像和文本模态分别构建包含上下文一致性解码器(CCD)和语义一致性解码器(SCD)的双分支结构,以捕捉模内上下文一致性和跨模态语义一致性,并基于一致性特征进行伪造感知推理或聚合,从而提升对细粒度伪造细节的识别能力。
链接: https://arxiv.org/abs/2506.05890
作者: Yiheng Li,Yang Yang,Zichang Tan,Huan Liu,Weihua Chen,Xu Zhou,Zhen Lei
机构: MAIS, Institute of Automation, Chinese Academy of Sciences; School of Artificial Intelligence, University of Chinese Academy of Sciences; CAIR, HKISI, Chinese Academy of Sciences; School of Computer Science and Engineering, the Faculty of Innovation Engineering, M.U.S.T; Sangfor Technologies Inc.; Beijing Jiaotong University; Alibaba Group
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2025
点击查看摘要
Abstract:To tackle the threat of fake news, the task of detecting and grounding multi-modal media manipulation DGM4 has received increasing attention. However, most state-of-the-art methods fail to explore the fine-grained consistency within local content, usually resulting in an inadequate perception of detailed forgery and unreliable results. In this paper, we propose a novel approach named Contextual-Semantic Consistency Learning (CSCL) to enhance the fine-grained perception ability of forgery for DGM4. Two branches for image and text modalities are established, each of which contains two cascaded decoders, i.e., Contextual Consistency Decoder (CCD) and Semantic Consistency Decoder (SCD), to capture within-modality contextual consistency and across-modality semantic consistency, respectively. Both CCD and SCD adhere to the same criteria for capturing fine-grained forgery details. To be specific, each module first constructs consistency features by leveraging additional supervision from the heterogeneous information of each token pair. Then, the forgery-aware reasoning or aggregating is adopted to deeply seek forgery cues based on the consistency features. Extensive experiments on DGM4 datasets prove that CSCL achieves new state-of-the-art performance, especially for the results of grounding manipulated content. Codes and weights are avaliable at this https URL.
zh
[CV-43] HMVLM: Multistage Reasoning -Enhanced Vision-Language Model for Long-Tailed Driving Scenarios
【速读】:该论文旨在解决端到端自动驾驶中如何在保持低延迟的同时生成高水准驾驶意图的问题。其解决方案的关键在于提出了一种基于认知启发的快慢架构中的慢速规划模块,即HaoMo Vision-Language Model (HMVLM),该模型通过引入三种升级:(1)嵌入4秒自车运动历史的五视角选择性提示,(2)多阶段链式思维(CoT)提示以实现场景理解-驾驶决策-轨迹推断的推理流程,(3)基于样条的轨迹后处理以消除晚期抖动和急转弯,从而提升驾驶行为的平滑性和合理性。
链接: https://arxiv.org/abs/2506.05883
作者: Daming Wang,Yuhao Song,Zijian He,Kangliang Chen,Xing Pan,Lu Deng,Weihao Gu
机构: HAOMO.AI (Haomo AI)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: WOD Vision-based End-to-End Driving Challenge
点击查看摘要
Abstract:We present HaoMo Vision-Language Model (HMVLM), an end-to-end driving framework that implements the slow branch of a cognitively inspired fast-slow architecture. A fast controller outputs low-level steering, throttle, and brake commands, while a slow planner-a large vision-language model-generates high-level intents such as “yield to pedestrian” or “merge after the truck” without compromising latency. HMVLM introduces three upgrades: (1) selective five-view prompting with an embedded 4s history of ego kinematics, (2) multi-stage chain-of-thought (CoT) prompting that enforces a Scene Understanding - Driving Decision - Trajectory Inference reasoning flow, and (3) spline-based trajectory post-processing that removes late-stage jitter and sharp turns. Trained on the Waymo Open Dataset, these upgrades enable HMVLM to achieve a Rater Feedback Score (RFS) of 7.7367, securing 2nd place in the 2025 Waymo Vision-based End-to-End (E2E) Driving Challenge and surpassing the public baseline by 2.77%.
zh
[CV-44] Domain-RAG : Retrieval-Guided Compositional Image Generation for Cross-Domain Few-Shot Object Detection
【速读】:该论文旨在解决跨域小样本目标检测(Cross-Domain Few-Shot Object Detection, CD-FSOD)中由于目标域未知导致的样本稀缺问题,特别是在生成合成数据以提升检测性能时面临的视觉真实性和域对齐难题。现有方法如复制粘贴增强和文本到图像生成在保持正确物体类别或生成与目标域一致的背景方面存在局限性。该论文提出的解决方案——Domain-RAG,其关键在于通过三阶段流程实现无监督的、基于检索的组合图像生成:域感知背景检索、域引导背景生成以及前景-背景组合,从而在无需额外监督或训练的情况下生成高质量且域一致的样本。
链接: https://arxiv.org/abs/2506.05872
作者: Yu Li,Xingyu Qiu,Yuqian Fu,Jie Chen,Tianwen Qian,Xu Zheng,Danda Pani Paudel,Yanwei Fu,Xuanjing Huang,Luc Van Gool,Yu-Gang Jiang
机构: Fudan University (复旦大学); INSAIT, Sofia University “St. Kliment Ohridski” (INSAIT,索菲亚大学“圣基里尔·奥赫里德斯基”); East China Normal University (华东师范大学); HKUST(GZ) (香港科技大学(广州))
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Cross-Domain Few-Shot Object Detection (CD-FSOD) aims to detect novel objects with only a handful of labeled samples from previously unseen domains. While data augmentation and generative methods have shown promise in few-shot learning, their effectiveness for CD-FSOD remains unclear due to the need for both visual realism and domain alignment. Existing strategies, such as copy-paste augmentation and text-to-image generation, often fail to preserve the correct object category or produce backgrounds coherent with the target domain, making them non-trivial to apply directly to CD-FSOD. To address these challenges, we propose Domain-RAG, a training-free, retrieval-guided compositional image generation framework tailored for CD-FSOD. Domain-RAG consists of three stages: domain-aware background retrieval, domain-guided background generation, and foreground-background composition. Specifically, the input image is first decomposed into foreground and background regions. We then retrieve semantically and stylistically similar images to guide a generative model in synthesizing a new background, conditioned on both the original and retrieved contexts. Finally, the preserved foreground is composed with the newly generated domain-aligned background to form the generated image. Without requiring any additional supervision or training, Domain-RAG produces high-quality, domain-consistent samples across diverse tasks, including CD-FSOD, remote sensing FSOD, and camouflaged FSOD. Extensive experiments show consistent improvements over strong baselines and establish new state-of-the-art results. Codes will be released upon acceptance.
zh
[CV-45] Loss Functions for Predictor-based Neural Architecture Search
【速读】:该论文旨在解决神经网络架构搜索(Neural Architecture Search, NAS)中评估过程成本高昂的问题,其核心在于通过性能预测器降低评估开销。解决方案的关键在于对损失函数的选择进行系统性研究,探索不同类型的损失函数(如回归损失、排序损失和加权损失)在性能预测中的效果,并发现特定类型损失函数的组合能够有效提升预测器在NAS中的表现。这一研究为不同任务选择合适的损失函数提供了实践指导,并为基于预测器的方法在NAS领域的进一步发展提供了理论支持。
链接: https://arxiv.org/abs/2506.05869
作者: Han Ji,Yuqi Feng,Jiahao Fan,Yanan Sun
机构: Sichuan University (四川大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Evaluation is a critical but costly procedure in neural architecture search (NAS). Performance predictors have been widely adopted to reduce evaluation costs by directly estimating architecture performance. The effectiveness of predictors is heavily influenced by the choice of loss functions. While traditional predictors employ regression loss functions to evaluate the absolute accuracy of architectures, recent approaches have explored various ranking-based loss functions, such as pairwise and listwise ranking losses, to focus on the ranking of architecture performance. Despite their success in NAS, the effectiveness and characteristics of these loss functions have not been thoroughly investigated. In this paper, we conduct the first comprehensive study on loss functions in performance predictors, categorizing them into three main types: regression, ranking, and weighted loss functions. Specifically, we assess eight loss functions using a range of NAS-relevant metrics on 13 tasks across five search spaces. Our results reveal that specific categories of loss functions can be effectively combined to enhance predictor-based NAS. Furthermore, our findings could provide practical guidance for selecting appropriate loss functions for various tasks. We hope this work provides meaningful insights to guide the development of loss functions for predictor-based methods in the NAS community.
zh
[CV-46] CryoFastAR: Fast Cryo-EM Ab Initio Reconstruction Made Easy
【速读】:该论文旨在解决从无序的冷冻电镜(cryo-EM)图像中进行快速且准确的姿态估计(pose estimation)与三维重建问题,传统方法依赖于耗时的迭代优化,受限于低信噪比(SNR)和对比度传递函数(CTF)引起的失真。其解决方案的关键在于提出CryoFastAR,这是首个能够直接从噪声cryo-EM图像中预测姿态的几何基础模型,通过整合多视角特征并在大规模模拟数据上进行训练,结合渐进式训练策略以提升模型的稳定性和鲁棒性,从而在保持重建质量的同时显著加速推理过程。
链接: https://arxiv.org/abs/2506.05864
作者: Jiakai Zhang,Shouchen Zhou,Haizhao Dai,Xinhang Liu,Peihao Wang,Zhiwen Fan,Yuan Pei,Jingyi Yu
机构: ShanghaiTech University (上海科技大学); Cellverse, Co., Ltd (细胞宇宙公司); HKUST (香港科技大学); UT Austin (德克萨斯大学奥斯汀分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Pose estimation from unordered images is fundamental for 3D reconstruction, robotics, and scientific imaging. Recent geometric foundation models, such as DUSt3R, enable end-to-end dense 3D reconstruction but remain underexplored in scientific imaging fields like cryo-electron microscopy (cryo-EM) for near-atomic protein reconstruction. In cryo-EM, pose estimation and 3D reconstruction from unordered particle images still depend on time-consuming iterative optimization, primarily due to challenges such as low signal-to-noise ratios (SNR) and distortions from the contrast transfer function (CTF). We introduce CryoFastAR, the first geometric foundation model that can directly predict poses from Cryo-EM noisy images for Fast ab initio Reconstruction. By integrating multi-view features and training on large-scale simulated cryo-EM data with realistic noise and CTF modulations, CryoFastAR enhances pose estimation accuracy and generalization. To enhance training stability, we propose a progressive training strategy that first allows the model to extract essential features under simpler conditions before gradually increasing difficulty to improve robustness. Experiments show that CryoFastAR achieves comparable quality while significantly accelerating inference over traditional iterative approaches on both synthetic and real datasets.
zh
[CV-47] Improved Allergy Wheal Detection for the Skin Prick Automated Test Device
【速读】:该论文旨在解决传统皮肤点刺试验(Skin Prick Test, SPT)在诊断吸入性过敏反应时因人为因素导致的结果不一致问题,其解决方案的关键在于利用Skin Prick Automated Test (SPAT)设备采集的32张不同光照条件下的图像,通过一种结合神经网络与可解释算法的自动化方法实现对过敏性风团(wheal)的精准检测与分割。该方法首先采用神经网络进行像素级的风团分割,随后通过算法化且可解释的流程完成风团的检测与边界划定,从而提升诊断的一致性和准确性。
链接: https://arxiv.org/abs/2506.05862
作者: Rembert Daems,Sven Seys,Valérie Hox,Adam Chaker,Glynnis De Greve,Winde Lemmens,Anne-Lise Poirrier,Eline Beckers,Zuzana Diamant,Carmen Dierickx,Peter W. Hellings,Caroline Huart,Claudia Jerin,Mark Jorissen,Hanne Oscé,Karolien Roux,Mark Thompson,Sophie Tombu,Saartje Uyttebroek,Andrzej Zarowski,Senne Gorris,Laura Van Gerven,Dirk Loeckx,Thomas Demeester
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: This work is presented at Artificial Intelligence in Medicine 2025, this is the longer (10 pages) version
点击查看摘要
Abstract:Background: The skin prick test (SPT) is the gold standard for diagnosing sensitization to inhalant allergies. The Skin Prick Automated Test (SPAT) device was designed for increased consistency in test results, and captures 32 images to be jointly used for allergy wheal detection and delineation, which leads to a diagnosis. Materials and Methods: Using SPAT data from 868 patients with suspected inhalant allergies, we designed an automated method to detect and delineate wheals on these images. To this end, 10,416 wheals were manually annotated by drawing detailed polygons along the edges. The unique data-modality of the SPAT device, with 32 images taken under distinct lighting conditions, requires a custom-made approach. Our proposed method consists of two parts: a neural network component that segments the wheals on the pixel level, followed by an algorithmic and interpretable approach for detecting and delineating the wheals. Results: We evaluate the performance of our method on a hold-out validation set of 217 patients. As a baseline we use a single conventionally lighted image per SPT as input to our method. Conclusion: Using the 32 SPAT images under various lighting conditions offers a considerably higher accuracy than a single image in conventional, uniform light. Comments: This work is presented at Artificial Intelligence in Medicine 2025, this is the longer (10 pages) version Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2506.05862 [cs.CV] (or arXiv:2506.05862v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2506.05862 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Rembert Daems [view email] [v1] Fri, 6 Jun 2025 08:31:22 UTC (1,154 KB)
zh
[CV-48] ChronoTailor: Harnessing Attention Guidance for Fine-Grained Video Virtual Try-On
【速读】:该论文旨在解决视频虚拟试穿(video virtual try-on)中服装替换的时空连续性与细节保真度问题。现有方法在保持视频连贯性和再现服装细节方面仍存在不足。其解决方案的关键在于提出ChronoTailor,一个基于扩散模型的框架,通过精确的时空注意力机制引导细粒度服装特征的整合,从而实现鲁棒的虚拟试穿效果。该方法结合了区域感知的空间引导和注意力驱动的时间特征融合机制,以提升时空连续性,并通过多尺度服装特征集成与服装-姿态特征对齐来保留低级视觉细节并确保动态运动中的时间一致性。
链接: https://arxiv.org/abs/2506.05858
作者: Jinjuan Wang,Wenzhang Sun,Ming Li,Yun Zheng,Fanyao Li,Zhulin Tao,Donglin Di,Hao Li,Wei Chen,Xianglin Huang
机构: Communication University of China (中国传媒大学); Li Auto (小鹏汽车)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Video virtual try-on aims to seamlessly replace the clothing of a person in a source video with a target garment. Despite significant progress in this field, existing approaches still struggle to maintain continuity and reproduce garment details. In this paper, we introduce ChronoTailor, a diffusion-based framework that generates temporally consistent videos while preserving fine-grained garment details. By employing a precise spatio-temporal attention mechanism to guide the integration of fine-grained garment features, ChronoTailor achieves robust try-on performance. First, ChronoTailor leverages region-aware spatial guidance to steer the evolution of spatial attention and employs an attention-driven temporal feature fusion mechanism to generate more continuous temporal features. This dual approach not only enables fine-grained local editing but also effectively mitigates artifacts arising from video dynamics. Second, ChronoTailor integrates multi-scale garment features to preserve low-level visual details and incorporates a garment-pose feature alignment to ensure temporal continuity during dynamic motion. Additionally, we collect StyleDress, a new dataset featuring intricate garments, varied environments, and diverse poses, offering advantages over existing public datasets, and will be publicly available for research. Extensive experiments show that ChronoTailor maintains spatio-temporal continuity and preserves garment details during motion, significantly outperforming previous methods.
zh
[CV-49] Cross-View Multi-Modal Segmentation @ Ego-Exo4D Challenges 2025 CVPR2025 CVPR
【速读】:该论文旨在解决第一人称视角(ego view)与第三人称视角(exo view)之间的物体对应任务中的对象分割问题,即给定一个视角的对象查询,预测另一个视角下的对应物体掩码。解决方案的关键在于提出了一种多模态条件融合模块,通过结合视觉掩码和文本描述作为分割条件来增强物体定位;此外,还引入了跨视角物体对齐模块,以在不同视角间强制实现物体级别的一致性,从而提升模型对视角变化的鲁棒性。
链接: https://arxiv.org/abs/2506.05856
作者: Yuqian Fu,Runze Wang,Yanwei Fu,Danda Pani Paudel,Luc Van Gool
机构: INSAIT; Sofia University “St. Kliment Ohridski”; Fudan University
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: The 2nd Price Award of EgoExo4D Relations, Second Joint EgoVis Workshop with CVPR2025, technical report paper is accepted by CVPRW 25
点击查看摘要
Abstract:In this report, we present a cross-view multi-modal object segmentation approach for the object correspondence task in the Ego-Exo4D Correspondence Challenges 2025. Given object queries from one perspective (e.g., ego view), the goal is to predict the corresponding object masks in another perspective (e.g., exo view). To tackle this task, we propose a multimodal condition fusion module that enhances object localization by leveraging both visual masks and textual descriptions as segmentation conditions. Furthermore, to address the visual domain gap between ego and exo views, we introduce a cross-view object alignment module that enforces object-level consistency across perspectives, thereby improving the model’s robustness to viewpoint changes. Our proposed method ranked second on the leaderboard of the large-scale Ego-Exo4D object correspondence benchmark. Code will be made available at this https URL.
zh
[CV-50] FontAdapter: Instant Font Adaptation in Visual Text Generation
【速读】:该论文旨在解决在未见过的字体中实现快速、高质量的视觉文本生成问题,现有方法在适应未预设字体时计算成本高,难以实现实时定制。其解决方案的关键在于提出FontAdapter框架,该框架通过两阶段课程学习策略,首先从孤立字形中提取字体属性,再将这些风格整合到多样的自然背景中,从而实现高效的字体定制与多样化应用。
链接: https://arxiv.org/abs/2506.05843
作者: Myungkyu Koo,Subin Kim,Sangkyung Kwak,Jaehyun Nam,Seojin Kim,Jinwoo Shin
机构: KAIST(韩国科学技术院); General Robotics(通用机器人); Seoul National University(首尔国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
点击查看摘要
Abstract:Text-to-image diffusion models have significantly improved the seamless integration of visual text into diverse image contexts. Recent approaches further improve control over font styles through fine-tuning with predefined font dictionaries. However, adapting unseen fonts outside the preset is computationally expensive, often requiring tens of minutes, making real-time customization impractical. In this paper, we present FontAdapter, a framework that enables visual text generation in unseen fonts within seconds, conditioned on a reference glyph image. To this end, we find that direct training on font datasets fails to capture nuanced font attributes, limiting generalization to new glyphs. To overcome this, we propose a two-stage curriculum learning approach: FontAdapter first learns to extract font attributes from isolated glyphs and then integrates these styles into diverse natural backgrounds. To support this two-stage training scheme, we construct synthetic datasets tailored to each stage, leveraging large-scale online fonts effectively. Experiments demonstrate that FontAdapter enables high-quality, robust font customization across unseen fonts without additional fine-tuning during inference. Furthermore, it supports visual text editing, font style blending, and cross-lingual font transfer, positioning FontAdapter as a versatile framework for font customization tasks.
zh
[CV-51] High Throughput Event Filtering: The Interpolation-based DIF Algorithm Hardware Architecture MICRO
【速读】:该论文旨在解决事件视觉(event vision)中由事件传感器产生的数据流中噪声干扰严重的问题,这种噪声受场景光照条件或传感器温度等因素影响较大。其解决方案的关键在于提出一种基于距离的频率加权插值(Distance-based Interpolation with Frequency Weights, DIF)滤波器的硬件架构,并在FPGA芯片上实现,以提高处理效率和降噪性能。该架构在不同分辨率下均表现出较高的吞吐量(最高达428.45 MEPS)和良好的AUROC指标(0.844至0.999),在保持与当前最优滤波方案相当性能的同时,显著提升了处理速度和对多种噪声水平的适应能力。
链接: https://arxiv.org/abs/2506.05825
作者: Marcin Kowalczyk,Tomasz Kryjak
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in the Microprocessors and Microsystems journal
点击查看摘要
Abstract:In recent years, there has been rapid development in the field of event vision. It manifests itself both on the technical side, as better and better event sensors are available, and on the algorithmic side, as more and more applications of this technology are proposed and scientific papers are published. However, the data stream from these sensors typically contains a significant amount of noise, which varies depending on factors such as the degree of illumination in the observed scene or the temperature of the sensor. We propose a hardware architecture of the Distance-based Interpolation with Frequency Weights (DIF) filter and implement it on an FPGA chip. To evaluate the algorithm and compare it with other solutions, we have prepared a new high-resolution event dataset, which we are also releasing to the community. Our architecture achieved a throughput of 403.39 million events per second (MEPS) for a sensor resolution of 1280 x 720 and 428.45 MEPS for a resolution of 640 x 480. The average values of the Area Under the Receiver Operating Characteristic (AUROC) index ranged from 0.844 to 0.999, depending on the dataset, which is comparable to the state-of-the-art filtering solutions, but with much higher throughput and better operation over a wide range of noise levels.
zh
[CV-52] FuseUNet: A Multi-Scale Feature Fusion Method for U-like Networks ICML2025
【速读】:该论文旨在解决UNet及其变种中跳连(skip connection)的两个关键问题:一是不同尺度特征之间缺乏有效的交互,二是依赖简单的拼接或加法操作,限制了信息的高效融合。其解决方案的关键在于将UNet的解码过程重新构想为一个初值问题(Initial Value Problem, IVP),并将跳连视为离散节点,通过引入自适应常微分方程方法,实现多尺度特征的有效融合。该方法独立于编码器和解码器结构,具有良好的通用性。
链接: https://arxiv.org/abs/2506.05821
作者: Quansong He,Xiangde Min,Kaishen Wang,Tao He
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: ICML2025
点击查看摘要
Abstract:Medical image segmentation is a critical task in computer vision, with UNet serving as a milestone architecture. The typical component of UNet family is the skip connection, however, their skip connections face two significant limitations: (1) they lack effective interaction between features at different scales, and (2) they rely on simple concatenation or addition operations, which constrain efficient information integration. While recent improvements to UNet have focused on enhancing encoder and decoder capabilities, these limitations remain overlooked. To overcome these challenges, we propose a novel multi-scale feature fusion method that reimagines the UNet decoding process as solving an initial value problem (IVP), treating skip connections as discrete nodes. By leveraging principles from the linear multistep method, we propose an adaptive ordinary differential equation method to enable effective multi-scale feature fusion. Our approach is independent of the encoder and decoder architectures, making it adaptable to various U-Net-like networks. Experiments on ACDC, KiTS2023, MSD brain tumor, and ISIC2017/2018 skin lesion segmentation datasets demonstrate improved feature utilization, reduced network parameters, and maintained high performance. The code is available at this https URL.
zh
[CV-53] DeformCL: Learning Deformable Centerline Representation for Vessel Extraction in 3D Medical Image CVPR2025
【速读】:该论文旨在解决3D医学影像中准确提取和表示具有曲线结构的血管的问题,传统方法通常依赖于离散表示如掩码,但由于逐像素分类范式的固有局限性,常导致局部断裂或碎片化。其解决方案的关键在于引入DeformCL,一种基于可变形中心线(Deformable Centerlines)的连续表示,其中中心线点作为节点并通过边捕捉空间关系,该方法具备自然连通性、噪声鲁棒性和交互便利性。
链接: https://arxiv.org/abs/2506.05820
作者: Ziwei Zhao,Zhixing Zhang,Yuhang Liu,Zhao Zhang,Haojun Yu,Dong Wang,Liwei Wang
机构: Yizhun Medical AI Co., Ltd (易诊医疗人工智能有限公司); Center for Data Science, Peking University (北京大学数据科学中心); Center for Machine Learning Research, Peking University (北京大学机器学习研究中心); State Key Laboratory of General Artificial Intelligence, School of Intelligence Science and Technology, Peking University (北京大学通用人工智能国家重点实验室,智能科学与技术学院); Pazhou Laboratory (Huangpu), Guangzhou, Guangdong, China (琶洲实验室(黄埔),广东省广州市)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2025
点击查看摘要
Abstract:In the field of 3D medical imaging, accurately extracting and representing the blood vessels with curvilinear structures holds paramount importance for clinical diagnosis. Previous methods have commonly relied on discrete representation like mask, often resulting in local fractures or scattered fragments due to the inherent limitations of the per-pixel classification paradigm. In this work, we introduce DeformCL, a new continuous representation based on Deformable Centerlines, where centerline points act as nodes connected by edges that capture spatial relationships. Compared with previous representations, DeformCL offers three key advantages: natural connectivity, noise robustness, and interaction facility. We present a comprehensive training pipeline structured in a cascaded manner to fully exploit these favorable properties of DeformCL. Extensive experiments on four 3D vessel segmentation datasets demonstrate the effectiveness and superiority of our method. Furthermore, the visualization of curved planar reformation images validates the clinical significance of the proposed framework. We release the code in this https URL
zh
[CV-54] NTIRE 2025 Challenge on HR Depth from Images of Specular and Transparent Surfaces CVPR2025
【速读】:该论文旨在解决高分辨率深度估计以及非朗伯表面(non-Lambertian surfaces)的深度恢复问题,这是当前深度估计领域面临的两大主要挑战。为实现这一目标,该研究提出了两个赛道,分别针对立体视觉和单图像深度估计,通过竞赛形式推动相关技术的发展。解决方案的关键在于设计能够有效处理复杂表面反射特性并提升深度图分辨率的算法架构。
链接: https://arxiv.org/abs/2506.05815
作者: Pierluigi Zama Ramirez,Fabio Tosi,Luigi Di Stefano,Radu Timofte,Alex Costanzino,Matteo Poggi,Samuele Salti,Stefano Mattoccia,Zhe Zhang,Yang Yang,Wu Chen,Anlong Ming,Mingshuai Zhao,Mengying Yu,Shida Gao,Xiangfeng Wang,Feng Xue,Jun Shi,Yong Yang,Yong A,Yixiang Jin,Dingzhe Li,Aryan Shukla,Liam Frija-Altarac,Matthew Toews,Hui Geng,Tianjiao Wan,Zijian Gao,Qisheng Xu,Kele Xu,Zijian Zang,Jameer Babu Pinjari,Kuldeep Purohit,Mykola Lavreniuk,Jing Cao,Shenyi Li,Kui Jiang,Junjun Jiang,Yong Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: NTIRE Workshop Challenge Report, CVPR 2025
点击查看摘要
Abstract:This paper reports on the NTIRE 2025 challenge on HR Depth From images of Specular and Transparent surfaces, held in conjunction with the New Trends in Image Restoration and Enhancement (NTIRE) workshop at CVPR 2025. This challenge aims to advance the research on depth estimation, specifically to address two of the main open issues in the field: high-resolution and non-Lambertian surfaces. The challenge proposes two tracks on stereo and single-image depth estimation, attracting about 177 registered participants. In the final testing stage, 4 and 4 participating teams submitted their models and fact sheets for the two tracks.
zh
[CV-55] LLIA – Enabling Low-Latency Interactive Avatars: Real-Time Audio-Driven Portrait Video Generation with Diffusion Models
【速读】:该论文旨在解决基于扩散模型的虚拟人生成在实时交互头像应用中的计算需求高、响应速度慢和持续时间受限的问题。其关键解决方案是提出一种基于扩散模型的音频驱动肖像视频生成框架,包括鲁棒的变长视频生成技术以减少初始视频片段或状态转换的最短生成时间,以及针对音频-图像到视频的一致性模型训练策略以实现快速少步生成;同时结合模型量化和管道并行以加速推理速度,并引入新的推理策略以缓解扩散过程和模型量化带来的稳定性损失,从而在保持高质量输出的同时实现低延迟和实时性能。
链接: https://arxiv.org/abs/2506.05806
作者: Haojie Yu,Zhaonian Wang,Yihan Pan,Meng Cheng,Hao Yang,Chao Wang,Tao Xie,Xiaoming Xu,Xiaoming Wei,Xunliang Cai
机构: Meituan Inc. (美团公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Diffusion-based models have gained wide adoption in the virtual human generation due to their outstanding expressiveness. However, their substantial computational requirements have constrained their deployment in real-time interactive avatar applications, where stringent speed, latency, and duration requirements are paramount. We present a novel audio-driven portrait video generation framework based on the diffusion model to address these challenges. Firstly, we propose robust variable-length video generation to reduce the minimum time required to generate the initial video clip or state transitions, which significantly enhances the user experience. Secondly, we propose a consistency model training strategy for Audio-Image-to-Video to ensure real-time performance, enabling a fast few-step generation. Model quantization and pipeline parallelism are further employed to accelerate the inference speed. To mitigate the stability loss incurred by the diffusion process and model quantization, we introduce a new inference strategy tailored for long-duration video generation. These methods ensure real-time performance and low latency while maintaining high-fidelity output. Thirdly, we incorporate class labels as a conditional input to seamlessly switch between speaking, listening, and idle states. Lastly, we design a novel mechanism for fine-grained facial expression control to exploit our model’s inherent capacity. Extensive experiments demonstrate that our approach achieves low-latency, fluid, and authentic two-way communication. On an NVIDIA RTX 4090D, our model achieves a maximum of 78 FPS at a resolution of 384x384 and 45 FPS at a resolution of 512x512, with an initial video generation latency of 140 ms and 215 ms, respectively.
zh
[CV-56] EASG-Bench: Video QA Benchmark with Egocentric Action Scene Graphs
【速读】:该论文试图解决长上下文视频理解中的研究空白,特别是在时间顺序相关问题上的性能不足。解决方案的关键在于引入EASG-Bench,这是一个基于时空定位的动态场景图构建的问答基准,能够捕捉参与者、动作和物体之间的复杂关系,并通过系统化的评估框架对语言模型和视频大语言模型进行评测。
链接: https://arxiv.org/abs/2506.05787
作者: Ivan Rodin,Tz-Ying Wu,Kyle Min,Sharath Nittur Sridhar,Antonino Furnari,Subarna Tripathi,Giovanni Maria Farinella
机构: University of Catania (卡塔尼亚大学); Intel Labs (英特尔实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:We introduce EASG-Bench, a question-answering benchmark for egocentric videos where the question-answering pairs are created from spatio-temporally grounded dynamic scene graphs capturing intricate relationships among actors, actions, and objects. We propose a systematic evaluation framework and evaluate several language-only and video large language models (video-LLMs) on this benchmark. We observe a performance gap in language-only and video-LLMs, especially on questions focusing on temporal ordering, thus identifying a research gap in the area of long-context video understanding. To promote the reproducibility of our findings and facilitate further research, the benchmark and accompanying code are available at the following GitHub page: this https URL.
zh
[CV-57] GazeNLQ @ Ego4D Natural Language Queries Challenge 2025
【速读】:该论文旨在解决从第一视角视频中根据自然语言查询(Natural Language Queries, NLQ)检索相关视频片段的问题。其核心挑战在于如何有效融合视觉注意力(通过注视点估计)与自然语言理解,以提升视频定位的准确性。解决方案的关键在于提出一种基于对比学习的预训练策略,直接从视频中估计注视点,并将其作为增强视频表示的辅助信息,从而提高检索性能。
链接: https://arxiv.org/abs/2506.05782
作者: Wei-Cheng Lin,Chih-Ming Lien,Chen Lo,Chia-Hung Yeh
机构: National Taiwan Normal University (国立台湾师范大学); National Sun Yat-sen University (国立中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:This report presents our solution to the Ego4D Natural Language Queries (NLQ) Challenge at CVPR 2025. Egocentric video captures the scene from the wearer’s perspective, where gaze serves as a key non-verbal communication cue that reflects visual attention and offer insights into human intention and cognition. Motivated by this, we propose a novel approach, GazeNLQ, which leverages gaze to retrieve video segments that match given natural language queries. Specifically, we introduce a contrastive learning-based pretraining strategy for gaze estimation directly from video. The estimated gaze is used to augment video representations within proposed model, thereby enhancing localization accuracy. Experimental results show that GazeNLQ achieves R1@IoU0.3 and R1@IoU0.5 scores of 27.82 and 18.68, respectively. Our code is available at this https URL.
zh
[CV-58] Robust sensor fusion against on-vehicle sensor staleness CVPR2025
【速读】:该论文旨在解决自动驾驶感知系统中传感器融合面临的时序错位问题,即不同传感器数据到达时间不一致导致的对象状态估计不一致,从而严重影响轨迹预测的准确性。其解决方案的关键在于引入一种基于点级别的时间戳偏移特征(per-point timestamp offset feature),该特征能够为LiDAR和雷达相对于摄像头提供细粒度的时间感知能力,并结合一种模拟实际部署车辆中传感器时滞模式的数据增强策略,从而提升传感器融合的鲁棒性与性能。
链接: https://arxiv.org/abs/2506.05780
作者: Meng Fan,Yifan Zuo,Patrick Blaes,Harley Montgomery,Subhasis Das
机构: Zoox Inc(泽克斯公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注: This paper has been accepted by CVPR 2025 Precognition Workshop
点击查看摘要
Abstract:Sensor fusion is crucial for a performant and robust Perception system in autonomous vehicles, but sensor staleness, where data from different sensors arrives with varying delays, poses significant challenges. Temporal misalignment between sensor modalities leads to inconsistent object state estimates, severely degrading the quality of trajectory predictions that are critical for safety. We present a novel and model-agnostic approach to address this problem via (1) a per-point timestamp offset feature (for LiDAR and radar both relative to camera) that enables fine-grained temporal awareness in sensor fusion, and (2) a data augmentation strategy that simulates realistic sensor staleness patterns observed in deployed vehicles. Our method is integrated into a perspective-view detection model that consumes sensor data from multiple LiDARs, radars and cameras. We demonstrate that while a conventional model shows significant regressions when one sensor modality is stale, our approach reaches consistently good performance across both synchronized and stale conditions.
zh
[CV-59] Where Is The Ball: 3D Ball Trajectory Estimation From 2D Monocular Tracking CVPR2025
【速读】:该论文试图解决从二维跟踪序列中估计三维球体轨迹的问题(3D ball trajectory estimation from 2D tracking sequence),其核心挑战在于从二维观测中恢复三维信息时存在的歧义性。解决方案的关键在于设计了一个基于长短期记忆网络(LSTM)的流程,该流程利用了一种与相机位置无关的新型规范三维表示(canonical 3D representation),以处理任意视角,并引入了一系列中间表示来促进关键的不变性和重投影一致性。
链接: https://arxiv.org/abs/2506.05763
作者: Puntawat Ponglertnapakorn,Supasorn Suwajanakorn
机构: VISTEC(视觉技术中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11th International Workshop on Computer Vision in Sports (CVsports) at CVPR 2025
点击查看摘要
Abstract:We present a method for 3D ball trajectory estimation from a 2D tracking sequence. To overcome the ambiguity in 3D from 2D estimation, we design an LSTM-based pipeline that utilizes a novel canonical 3D representation that is independent of the camera’s location to handle arbitrary views and a series of intermediate representations that encourage crucial invariance and reprojection consistency. We evaluated our method on four synthetic and three real datasets and conducted extensive ablation studies on our design choices. Despite training solely on simulated data, our method achieves state-of-the-art performance and can generalize to real-world scenarios with multiple trajectories, opening up a range of applications in sport analysis and virtual replay. Please visit our page: this https URL.
zh
[CV-60] Investigating the Relationship between Weighted Figure of Merit and Rosins Measure
【速读】:该论文试图解决在计算机视觉应用中,通过分段直线近似数字边界时,如何评估多边形逼近质量的问题。研究的核心在于探讨加权图示因子(weighted figure of merit)与Rosin提出的度量(Rosin’s measure)之间的关系,以确定是否可以相互替代。解决方案的关键在于通过理论分析、实验研究和统计分析证明这两种度量是相互独立的,因此不能互为替代,使用其中一种度量得出的结论不能直接应用于另一种度量。
链接: https://arxiv.org/abs/2506.05749
作者: Bimal Kumar Ray
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Many studies had been conducted to solve the problem of approximating a digital boundary by piece straight-line segments for further processing required in computer vision applications. The authors of these studies compared their schemes to determine the best one. The initial measure used to assess the goodness of a polygonal approximation was figure of merit. Later, it was pointed out that this measure was not an appropriate metric for a valid reason and this is why Rosin - through mathematical analysis - introduced a measure called merit. However, this measure involves optimal scheme of polygonal approximation and so it is time-consuming to compute it to assess the goodness of an approximation. This led many researchers to use weighted figure of merit as a substitute for Rosin’s measure to compare among sub-optimal schemes. An attempt is made in this communication to investigate whether the two measures - weighted figure of merit and Rosin’s measure - are related so that one can be used instead of the other and towards this end theoretical analysis, experimental investigation and statistical analysis are carried out. The mathematical formula for weighted figure of merit and Rosin’s measure are analyzed and through proof of theorems it is found that the two measures are independent of each other theoretically. The graphical analysis of experiments carried out using public dataset supports theoretical analysis. The statistical analysis using Pearson’s correlation coefficient also establishes that the two measures are uncorrelated. This analysis leads one to conclude that if a sub-optimal scheme is found to be better (worse) than some other sub-optimal scheme as indicated by Rosin’s measure then the same conclusion cannot be drawn using weighted figure of merit and so one cannot use weighted figure of merit instead of Rosin’s measure.
zh
[CV-61] Any-Class Presence Likelihood for Robust Multi-Label Classification with Abundant Negative Data
【速读】:该论文旨在解决多标签分类(Multi-label Classification, MLC)中由于大量未标注类别的负样本(negative data)导致的学习过程受阻问题,这些负样本会干扰正样本的准确识别与分类。其解决方案的关键在于重新设计标准的MLC损失函数,通过推导出任意类别存在的似然性,该似然性由预测类别概率的归一化加权几何平均来表述,并引入一个正则化参数以控制缺失类别概率对正样本中任意类别存在似然性的相对贡献。这一方法增强了网络对隐式正样本的感知能力,从而提升了正样本中的标签分类性能。
链接: https://arxiv.org/abs/2506.05721
作者: Dumindu Tissera,Omar Awadallah,Muhammad Umair Danish,Ayan Sadhu,Katarina Grolinger
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Multi-label Classification (MLC) assigns an instance to one or more non-exclusive classes. A challenge arises when the dataset contains a large proportion of instances with no assigned class, referred to as negative data, which can overwhelm the learning process and hinder the accurate identification and classification of positive instances. Nevertheless, it is common in MLC applications such as industrial defect detection, agricultural disease identification, and healthcare diagnosis to encounter large amounts of negative data. Assigning a separate negative class to these instances further complicates the learning objective and introduces unnecessary redundancies. To address this challenge, we redesign standard MLC loss functions by deriving a likelihood of any class being present, formulated by a normalized weighted geometric mean of the predicted class probabilities. We introduce a regularization parameter that controls the relative contribution of the absent class probabilities to the any-class presence likelihood in positive instances. The any-class presence likelihood complements the multi-label learning by encouraging the network to become more aware of implicit positive instances and improve the label classification within those positive instances. Experiments on large-scale datasets with negative data: SewerML, modified COCO, and ChestX-ray14, across various networks and base loss functions show that our loss functions consistently improve MLC performance of their standard loss counterparts, achieving gains of up to 6.01 percentage points in F1, 8.06 in F2, and 3.11 in mean average precision, all without additional parameters or computational complexity. Code available at: this https URL
zh
[CV-62] You Only Estimate Once: Unified One-stage Real-Time Category-level Articulated Object 6D Pose Estimation for Robotic Grasping ICRA2025
【速读】:该论文旨在解决关节物体在机器人操作任务中的类别级位姿估计问题(category-level pose estimation for articulated objects)。现有方法通常采用复杂的多阶段流水线,首先对点云中的部件实例进行分割,再估计归一化部件坐标空间(Normalized Part Coordinate Space, NPCS)表示的6D位姿,但这些方法存在计算成本高且实时性差的问题。论文提出的解决方案关键在于YOEO,这是一种单阶段方法,能够以端到端的方式同时输出实例分割和NPCS表示。通过统一网络生成点级语义标签和质心偏移,使同一部件实例的点投票至同一质心,并利用聚类算法根据估计的质心距离区分点,最终分离并对齐各实例的NPCS区域以恢复最终位姿和尺寸。
链接: https://arxiv.org/abs/2506.05719
作者: Jingshun Huang,Haitao Lin,Tianyu Wang,Yanwei Fu,Yu-Gang Jiang,Xiangyang Xue
机构: Fudan University (复旦大学); Tencent Robotics X Lab (腾讯机器人实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: To appear in ICRA 2025
点击查看摘要
Abstract:This paper addresses the problem of category-level pose estimation for articulated objects in robotic manipulation tasks. Recent works have shown promising results in estimating part pose and size at the category level. However, these approaches primarily follow a complex multi-stage pipeline that first segments part instances in the point cloud and then estimates the Normalized Part Coordinate Space (NPCS) representation for 6D poses. These approaches suffer from high computational costs and low performance in real-time robotic tasks. To address these limitations, we propose YOEO, a single-stage method that simultaneously outputs instance segmentation and NPCS representations in an end-to-end manner. We use a unified network to generate point-wise semantic labels and centroid offsets, allowing points from the same part instance to vote for the same centroid. We further utilize a clustering algorithm to distinguish points based on their estimated centroid distances. Finally, we first separate the NPCS region of each instance. Then, we align the separated regions with the real point cloud to recover the final pose and size. Experimental results on the GAPart dataset demonstrate the pose estimation capabilities of our proposed single-shot method. We also deploy our synthetically-trained model in a real-world setting, providing real-time visual feedback at 200Hz, enabling a physical Kinova robot to interact with unseen articulated objects. This showcases the utility and effectiveness of our proposed method.
zh
[CV-63] oken Transforming: A Unified and Training-Free Token Compression Framework for Vision Transformer Acceleration
【速读】:该论文旨在解决视觉变换器(Vision Transformer)在计算成本上的问题,特别是通过动态压缩token来降低计算负担。现有方法主要通过token剪枝或合并来减少token数量,但这种单一的压缩方式导致信息丢失严重,必须依赖后训练来恢复性能。论文的关键在于重新思考token缩减过程,并将其统一为显式的token矩阵变换形式,所有现有方法均可视为该框架下的特殊矩阵构造形式。进一步地,提出了一种多对多的Token Transforming框架,作为现有方法的泛化形式,能够保留最多的信息,甚至实现无需训练的加速。
链接: https://arxiv.org/abs/2506.05709
作者: Fanhu Zeng,Deli Yu,Zhenglun Kong,Hao Tang
机构: Institute of Automation, Chinese Academy of Sciences (自动化研究所,中国科学院); Harvard University (哈佛大学); School of Computer Science, Peking University (计算机学院,北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Vision transformers have been widely explored in various vision tasks. Due to heavy computational cost, much interest has aroused for compressing vision transformer dynamically in the aspect of tokens. Current methods mainly pay attention to token pruning or merging to reduce token numbers, in which tokens are compressed exclusively, causing great information loss and therefore post-training is inevitably required to recover the performance. In this paper, we rethink token reduction and unify the process as an explicit form of token matrix transformation, in which all existing methods are constructing special forms of matrices within the framework. Furthermore, we propose a many-to-many Token Transforming framework that serves as a generalization of all existing methods and reserves the most information, even enabling training-free acceleration. We conduct extensive experiments to validate our framework. Specifically, we reduce 40% FLOPs and accelerate DeiT-S by \times 1.5 with marginal 0.1% accuracy drop. Furthermore, we extend the method to dense prediction tasks including segmentation, object detection, depth estimation, and language model generation. Results demonstrate that the proposed method consistently achieves substantial improvements, offering a better computation-performance trade-off, impressive budget reduction and inference acceleration.
zh
[CV-64] MoralCLIP: Contrastive Alignment of Vision-and-Language Representations with Moral Foundations Theory
【速读】:该论文试图解决当前视觉-语言模型在跨模态语义理解中缺乏对内容道德维度的解释与推理能力的问题(moral dimensions of content),这是人类认知中的关键方面。解决方案的关键在于引入MoralCLIP,这是一种基于道德基础理论(Moral Foundations Theory, MFT)的显式道德基础嵌入表示方法,通过将视觉和文本中的道德线索整合到统一的嵌入空间中,实现跨模态的道德对齐。
链接: https://arxiv.org/abs/2506.05696
作者: Ana Carolina Condez,Diogo Tavares,João Magalhães
机构: NOVA School of Science and Technology (NOVA 学校 of Science and Technology); NOVA LINCS (NOVA LINCS)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Recent advances in vision-language models have enabled rich semantic understanding across modalities. However, these encoding methods lack the ability to interpret or reason about the moral dimensions of content-a crucial aspect of human cognition. In this paper, we address this gap by introducing MoralCLIP, a novel embedding representation method that extends multimodal learning with explicit moral grounding based on Moral Foundations Theory (MFT). Our approach integrates visual and textual moral cues into a unified embedding space, enabling cross-modal moral alignment. MoralCLIP is grounded on the multi-label dataset Social-Moral Image Database to identify co-occurring moral foundations in visual content. For MoralCLIP training, we design a moral data augmentation strategy to scale our annotated dataset to 15,000 image-text pairs labeled with MFT-aligned dimensions. Our results demonstrate that explicit moral supervision improves both unimodal and multimodal understanding of moral content, establishing a foundation for morally-aware AI systems capable of recognizing and aligning with human moral values.
zh
[CV-65] Pts3D-LLM : Studying the Impact of Token Structure for 3D Scene Understanding With Large Language Models
【速读】:该论文旨在解决如何有效表示三维场景以提升多模态大语言模型(Multimodal Large Language Models, MLLMs)性能的问题。现有方法通常仅依赖二维图像特征并采用不同的分词策略,难以全面捕捉三维信息。该研究的关键解决方案是系统性地比较基于视频和基于点云的3D token结构,并提出一种新方法,通过融合来自Sonata预训练Point Transformer V3编码器的3D点云特征来丰富视觉token。实验表明,显式融合3D特征显著提升了模型性能,且在点云被巧妙采样和排序的情况下,基于点的token结构可与基于视频的结构相媲美。
链接: https://arxiv.org/abs/2506.05689
作者: Hugues Thomas,Chen Chen,Jian Zhang
机构: Apple(苹果)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Main paper and appendix
点击查看摘要
Abstract:Effectively representing 3D scenes for Multimodal Large Language Models (MLLMs) is crucial yet challenging. Existing approaches commonly only rely on 2D image features and use varied tokenization approaches. This work presents a rigorous study of 3D token structures, systematically comparing video-based and point-based representations while maintaining consistent model backbones and parameters. We propose a novel approach that enriches visual tokens by incorporating 3D point cloud features from a Sonata pretrained Point Transformer V3 encoder. Our experiments demonstrate that merging explicit 3D features significantly boosts performance. Furthermore, we show that point-based token structures can rival video-based ones when the points are cleverly sampled and ordered. Our best models from both structures achieve state-of-the-art results on multiple 3D understanding benchmarks. We emphasize our analysis of token structures as a key contribution, alongside transparent reporting of results averaged over multiple seeds, a practice we believe is vital for robust progress in the field.
zh
[CV-66] Integer Binary-Range Alignment Neuron for Spiking Neural Networks
【速读】:该论文旨在解决脉冲神经网络(Spiking Neural Networks, SNNs)在图像分类和目标检测等任务中性能落后于人工神经网络(Artificial Neural Networks, ANNs)的问题,主要原因是SNN的表征能力有限。其解决方案的关键在于提出一种新型脉冲神经元——整数二进制范围对齐泄漏积分-发放神经元(Integer Binary-Range Alignment Leaky Integrate-and-Fire),通过整数二进制泄漏积分-发放机制和范围对齐策略,以极小的能耗增加显著扩展了脉冲神经元的信息表达能力。
链接: https://arxiv.org/abs/2506.05679
作者: Binghao Ye,Wenjuan Li,Dong Wang,Man Yao,Bing Li,Weiming Hu,Dong Liang,Kun Shang
机构: Shenzhen Institutes of Advanced Technology, CAS(中国科学院深圳先进技术研究院); State Key Laboratory of Multimodal Artificial Intelligence Systems, CASIA(中国科学院自动化研究所多模态人工智能系统重点实验室); School of Artificial Intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院); Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所); PeopleAI Inc. Beijing, China(北京人智科技有限公司); School of Information Science and Technology, ShanghaiTech University(上海科技大学信息科学与技术学院); University of Chinese Academy of Sciences(中国科学院大学)
类目: Neural and Evolutionary Computing (cs.NE); Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages
点击查看摘要
Abstract:Spiking Neural Networks (SNNs) are noted for their brain-like computation and energy efficiency, but their performance lags behind Artificial Neural Networks (ANNs) in tasks like image classification and object detection due to the limited representational capacity. To address this, we propose a novel spiking neuron, Integer Binary-Range Alignment Leaky Integrate-and-Fire to exponentially expand the information expression capacity of spiking neurons with only a slight energy increase. This is achieved through Integer Binary Leaky Integrate-and-Fire and range alignment strategy. The Integer Binary Leaky Integrate-and-Fire allows integer value activation during training and maintains spike-driven dynamics with binary conversion expands virtual timesteps during inference. The range alignment strategy is designed to solve the spike activation limitation problem where neurons fail to activate high integer values. Experiments show our method outperforms previous SNNs, achieving 74.19% accuracy on ImageNet and 66.2% mAP@50 and 49.1% mAP@50:95 on COCO, surpassing previous bests with the same architecture by +3.45% and +1.6% and +1.8%, respectively. Notably, our SNNs match or exceed ANNs’ performance with the same architecture, and the energy efficiency is improved by 6.3 \times .
zh
[CV-67] Peer-Ranked Precision: Creating a Foundational Dataset for Fine-Tuning Vision Models from DataSeeds Annotated Imagery
【速读】:该论文试图解决现代人工智能模型,特别是基于扩散的模型在计算机视觉和图像生成任务中的开发方法论正在经历范式转变的问题。传统上,性能提升主要依赖于更复杂的模型架构和超参数优化,而本文提出了一种更加细致的“数据驱动”方法,强调训练数据的质量、结构和相关性作为模型性能的主要驱动力。解决方案的关键在于引入了一个名为DSD的数据集,该数据集由约10,610张高质量的人类同行评分摄影图像及丰富的多层级标注组成,旨在为商业图像数据集树立新标准,并为稳健的商业和多模态AI开发提供可扩展的基础。
链接: https://arxiv.org/abs/2506.05673
作者: Sajjad Abdoli,Freeman Lewin,Gediminas Vasiliauskas,Fabian Schonholz
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 28 pages, 12 figures
点击查看摘要
Abstract:The development of modern Artificial Intelligence (AI) models, particularly diffusion-based models employed in computer vision and image generation tasks, is undergoing a paradigmatic shift in development methodologies. Traditionally dominated by a “Model Centric” approach, in which performance gains were primarily pursued through increasingly complex model architectures and hyperparameter optimization, the field is now recognizing a more nuanced “Data-Centric” approach. This emergent framework foregrounds the quality, structure, and relevance of training data as the principal driver of model performance. To operationalize this paradigm shift, we introduce the this http URL sample dataset (the “DSD”), initially comprised of approximately 10,610 high-quality human peer-ranked photography images accompanied by extensive multi-tier annotations. The DSD is a foundational computer vision dataset designed to usher in a new standard for commercial image datasets. Representing a small fraction of this http URL’s 100 million-plus image catalog, the DSD provides a scalable foundation necessary for robust commercial and multimodal AI development. Through this in-depth exploratory analysis, we document the quantitative improvements generated by the DSD on specific models against known benchmarks and make the code and the trained models used in our evaluation publicly available.
zh
[CV-68] DriveAction: A Benchmark for Exploring Human-like Driving Decisions in VLA Models
【速读】:该论文旨在解决现有视觉-语言-动作(Vision-Language-Action, VLA)模型评估基准在场景多样性、可靠的动作级标注以及与人类偏好对齐的评估协议方面的不足。其解决方案的关键在于提出DriveAction,这是首个专为VLA模型设计的动作驱动型基准,包含从2,610个驾驶场景生成的16,185对问答数据,利用生产级自动驾驶车辆用户主动收集的真实驾驶数据确保场景覆盖的广泛性和代表性,并提供直接来自用户实际驾驶操作的高层离散动作标签,同时构建以动作为核心的树状结构评估框架,明确关联视觉、语言和动作任务,支持全面和任务特定的评估。
链接: https://arxiv.org/abs/2506.05667
作者: Yuhan Hao,Zhengning Li,Lei Sun,Weilong Wang,Naixin Yi,Sheng Song,Caihong Qin,Mofan Zhou,Yifei Zhan,Peng Jia,Xianpeng Lang
机构: Li Auto Inc. (李想汽车公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Benchmark: this https URL
点击查看摘要
Abstract:Vision-Language-Action (VLA) models have advanced autonomous driving, but existing benchmarks still lack scenario diversity, reliable action-level annotation, and evaluation protocols aligned with human preferences. To address these limitations, we introduce DriveAction, the first action-driven benchmark specifically designed for VLA models, comprising 16,185 QA pairs generated from 2,610 driving scenarios. DriveAction leverages real-world driving data proactively collected by users of production-level autonomous vehicles to ensure broad and representative scenario coverage, offers high-level discrete action labels collected directly from users’ actual driving operations, and implements an action-rooted tree-structured evaluation framework that explicitly links vision, language, and action tasks, supporting both comprehensive and task-specific assessment. Our experiments demonstrate that state-of-the-art vision-language models (VLMs) require both vision and language guidance for accurate action prediction: on average, accuracy drops by 3.3% without vision input, by 4.1% without language input, and by 8.0% without either. Our evaluation supports precise identification of model bottlenecks with robust and consistent results, thus providing new insights and a rigorous foundation for advancing human-like decisions in autonomous driving.
zh
[CV-69] ssUnet: Improved Extracranial Tissue and Cranium Segmentation for Children through Adulthood
【速读】:该论文旨在解决脑部磁共振成像(MRI)中颅外组织(如颅骨、皮下脂肪和肌肉)难以量化的问题,这些问题在健康状况表征和临床决策中具有重要意义,但目前缺乏广泛验证的工具,尤其是在发育中的大脑或存在潜在病理的情况下。论文提出的解决方案是TissUnet,这是一种基于深度学习的模型,能够从常规三维T1加权MRI(含或不含对比增强)中分割颅外组织。其关键在于利用155对MRI-CT扫描进行训练,并在多个涵盖广泛年龄范围及脑肿瘤患者的数据库中进行验证,从而实现了高精度、可重复的分割效果。
链接: https://arxiv.org/abs/2506.05660
作者: Markian Mandzak,Elvira Yang,Anna Zapaishchykova,Yu-Hui Chen,Lucas Heilbroner,John Zielke,Divyanshu Tak,Reza Mojahed-Yazdi,Francesca Romana Mussa,Zezhong Ye,Sridhar Vajapeyam,Viviana Benitez,Ralph Salloum,Susan N. Chi,Houman Sotoudeh,Jakob Seidlitz,Sabine Mueller,Hugo J.W.L. Aerts,Tina Y. Poussaint,Benjamin H. Kann
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 44 pages, 4 tables, 6 figures, supplementary material
点击查看摘要
Abstract:Extracranial tissues visible on brain magnetic resonance imaging (MRI) may hold significant value for characterizing health conditions and clinical decision-making, yet they are rarely quantified. Current tools have not been widely validated, particularly in settings of developing brains or underlying pathology. We present TissUnet, a deep learning model that segments skull bone, subcutaneous fat, and muscle from routine three-dimensional T1-weighted MRI, with or without contrast enhancement. The model was trained on 155 paired MRI-computed tomography (CT) scans and validated across nine datasets covering a wide age range and including individuals with brain tumors. In comparison to AI-CT-derived labels from 37 MRI-CT pairs, TissUnet achieved a median Dice coefficient of 0.79 [IQR: 0.77-0.81] in a healthy adult cohort. In a second validation using expert manual annotations, median Dice was 0.83 [IQR: 0.83-0.84] in healthy individuals and 0.81 [IQR: 0.78-0.83] in tumor cases, outperforming previous state-of-the-art method. Acceptability testing resulted in an 89% acceptance rate after adjudication by a tie-breaker(N=108 MRIs), and TissUnet demonstrated excellent performance in the blinded comparative review (N=45 MRIs), including both healthy and tumor cases in pediatric populations. TissUnet enables fast, accurate, and reproducible segmentation of extracranial tissues, supporting large-scale studies on craniofacial morphology, treatment effects, and cardiometabolic risk using standard brain T1w MRI.
zh
[CV-70] Aerial Multi-View Stereo via Adaptive Depth Range Inference and Normal Cues
【速读】:该论文旨在解决从多视角航拍图像进行三维数字城市重建时,现有深度多视图立体(MVS)方法在处理航拍与近景设置差异时的不足,例如沿极线的深度范围变化以及低细节航拍图像导致的特征匹配不敏感问题。其解决方案的关键在于提出一种自适应深度范围MVS(ADR-MVS),其中核心组件是深度范围预测器,它通过交叉注意力差异学习从深度和法线估计中生成自适应范围图,从而提升多视角深度估计的准确性。
链接: https://arxiv.org/abs/2506.05655
作者: Yimei Liu,Yakun Ju,Yuan Rao,Hao Fan,Junyu Dong,Feng Gao,Qian Du
机构: Ocean University of China (中国海洋大学); University of Leicester (莱斯特大学); Mississippi State University (密西西比州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: IEEE TGRS 2025
点击查看摘要
Abstract:Three-dimensional digital urban reconstruction from multi-view aerial images is a critical application where deep multi-view stereo (MVS) methods outperform traditional techniques. However, existing methods commonly overlook the key differences between aerial and close-range settings, such as varying depth ranges along epipolar lines and insensitive feature-matching associated with low-detailed aerial images. To address these issues, we propose an Adaptive Depth Range MVS (ADR-MVS), which integrates monocular geometric cues to improve multi-view depth estimation accuracy. The key component of ADR-MVS is the depth range predictor, which generates adaptive range maps from depth and normal estimates using cross-attention discrepancy learning. In the first stage, the range map derived from monocular cues breaks through predefined depth boundaries, improving feature-matching discriminability and mitigating convergence to local optima. In later stages, the inferred range maps are progressively narrowed, ultimately aligning with the cascaded MVS framework for precise depth regression. Moreover, a normal-guided cost aggregation operation is specially devised for aerial stereo images to improve geometric awareness within the cost volume. Finally, we introduce a normal-guided depth refinement module that surpasses existing RGB-guided techniques. Experimental results demonstrate that ADR-MVS achieves state-of-the-art performance on the WHU, LuoJia-MVS, and München datasets, while exhibits superior computational complexity.
zh
[CV-71] Hallucinate Ground Repeat: A Framework for Generalized Visual Relationship Detection
【速读】:该论文试图解决视觉关系检测(Visual Relationship Detection, VRD)模型依赖固定谓词集导致的泛化能力受限问题,特别是在处理未标注但语义合理的新型交互关系时表现不佳。其解决方案的关键在于引入一种迭代的视觉接地框架,利用大语言模型(Large Language Models, LLMs)作为结构化的关系先验,通过期望-最大化(Expectation-Maximization, EM)机制交替生成场景图并训练视觉模型以对齐假设与感知证据,从而提升模型对未见谓词的泛化能力。
链接: https://arxiv.org/abs/2506.05651
作者: Shanmukha Vellamcheti,Sanjoy Kundu,Sathyanarayanan N. Aakur
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 22 pages, 9 figures, 5 tables
点击查看摘要
Abstract:Understanding relationships between objects is central to visual intelligence, with applications in embodied AI, assistive systems, and scene understanding. Yet, most visual relationship detection (VRD) models rely on a fixed predicate set, limiting their generalization to novel interactions. A key challenge is the inability to visually ground semantically plausible, but unannotated, relationships hypothesized from external knowledge. This work introduces an iterative visual grounding framework that leverages large language models (LLMs) as structured relational priors. Inspired by expectation-maximization (EM), our method alternates between generating candidate scene graphs from detected objects using an LLM (expectation) and training a visual model to align these hypotheses with perceptual evidence (maximization). This process bootstraps relational understanding beyond annotated data and enables generalization to unseen predicates. Additionally, we introduce a new benchmark for open-world VRD on Visual Genome with 21 held-out predicates and evaluate under three settings: seen, unseen, and mixed. Our model outperforms LLM-only, few-shot, and debiased baselines, achieving mean recall (mR@50) of 15.9, 13.1, and 11.7 on predicate classification on these three sets. These results highlight the promise of grounded LLM priors for scalable open-world visual understanding.
zh
[CV-72] Learning to Weight Parameters for Data Attribution
【速读】:该论文试图解决生成模型中数据归属(data attribution)的问题,即确定哪些训练样本对给定输出的影响最大。现有方法通过反向传播梯度来追踪训练数据,但通常将所有网络参数视为同等重要,忽略了不同层编码不同类型信息的事实。该论文提出的解决方案的关键在于学习针对归属任务的参数重要性权重,无需依赖标注数据,从而使得归属过程能够适应模型结构,捕捉到训练样本对输出特定语义方面(如主体、风格或背景)的贡献。
链接: https://arxiv.org/abs/2506.05647
作者: Shuangqi Li,Hieu Le,Jingyi Xu,Mathieu Salzmann
机构: EPFL(瑞士联邦理工学院); Stony Brook University(纽约州立大学石溪分校)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:We study data attribution in generative models, aiming to identify which training examples most influence a given output. Existing methods achieve this by tracing gradients back to training data. However, they typically treat all network parameters uniformly, ignoring the fact that different layers encode different types of information and may thus draw information differently from the training set. We propose a method that models this by learning parameter importance weights tailored for attribution, without requiring labeled data. This allows the attribution process to adapt to the structure of the model, capturing which training examples contribute to specific semantic aspects of an output, such as subject, style, or background. Our method improves attribution accuracy across diffusion models and enables fine-grained insights into how outputs borrow from training data.
zh
[CV-73] Controlled Data Rebalancing in Multi-Task Learning for Real-World Image Super-Resolution
【速读】:该论文旨在解决真实世界图像超分辨率(Real-SR)中由于低分辨率图像中复杂的退化模式带来的挑战,特别是如何在固定退化空间内实现SR网络对不同退化模式的最优处理平衡。其解决方案的关键在于提出一种改进的范式,将Real-SR建模为数据异质的多任务学习问题,并通过任务定义、任务不平衡量化以及自适应数据重平衡的协同改进来解决任务不平衡问题。具体而言,引入了一种新的任务定义框架,通过设定退化算子的参数特定边界来分割退化空间,有效减少任务数量同时保持任务区分性;开发了一种基于焦点损失的多任务加权机制,精确量化训练过程中的任务不平衡动态;并通过有意识地调控任务特定的训练量,将量化后的任务不平衡转化为受控的数据重平衡,以防止异常样本主导共享多任务SR模型的梯度优化。
链接: https://arxiv.org/abs/2506.05607
作者: Shuchen Lin,Mingtao Feng,Weisheng Dong,Fangfang Wu,Jianqiao Luo,Yaonan Wang,Guangming Shi
机构: Xidian University (西安电子科技大学); Hunan University (湖南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Real-world image super-resolution (Real-SR) is a challenging problem due to the complex degradation patterns in low-resolution images. Unlike approaches that assume a broadly encompassing degradation space, we focus specifically on achieving an optimal balance in how SR networks handle different degradation patterns within a fixed degradation space. We propose an improved paradigm that frames Real-SR as a data-heterogeneous multi-task learning problem, our work addresses task imbalance in the paradigm through coordinated advancements in task definition, imbalance quantification, and adaptive data rebalancing. Specifically, we introduce a novel task definition framework that segments the degradation space by setting parameter-specific boundaries for degradation operators, effectively reducing the task quantity while maintaining task discrimination. We then develop a focal loss based multi-task weighting mechanism that precisely quantifies task imbalance dynamics during model training. Furthermore, to prevent sporadic outlier samples from dominating the gradient optimization of the shared multi-task SR model, we strategically convert the quantified task imbalance into controlled data rebalancing through deliberate regulation of task-specific training volumes. Extensive quantitative and qualitative experiments demonstrate that our method achieves consistent superiority across all degradation tasks.
zh
[CV-74] UniRes: Universal Image Restoration for Complex Degradations
【速读】:该论文旨在解决真实世界图像恢复中由于多种退化因素(如不同的捕获条件、设备和后期处理流程)导致的复杂退化问题,尤其是针对任意混合的已知退化类型。现有方法通过模拟退化并利用图像生成先验进行改进,但对真实场景数据的泛化能力仍然不足。论文提出的解决方案关键在于一种名为UniRes的简单而灵活的基于扩散的框架,该框架在扩散采样步骤中结合多个专用模型,从而将多个隔离的修复任务的知识迁移至复杂真实退化场景的恢复中,仅需针对几种退化类型的隔离训练数据即可实现有效恢复。
链接: https://arxiv.org/abs/2506.05599
作者: Mo Zhou,Keren Ye,Mauricio Delbracio,Peyman Milanfar,Vishal M. Patel,Hossein Talebi
机构: Google(谷歌); Johns Hopkins University (约翰霍普金斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Real-world image restoration is hampered by diverse degradations stemming from varying capture conditions, capture devices and post-processing pipelines. Existing works make improvements through simulating those degradations and leveraging image generative priors, however generalization to in-the-wild data remains an unresolved problem. In this paper, we focus on complex degradations, i.e., arbitrary mixtures of multiple types of known degradations, which is frequently seen in the wild. A simple yet flexible diffusionbased framework, named UniRes, is proposed to address such degradations in an end-to-end manner. It combines several specialized models during the diffusion sampling steps, hence transferring the knowledge from several well-isolated restoration tasks to the restoration of complex in-the-wild degradations. This only requires well-isolated training data for several degradation types. The framework is flexible as extensions can be added through a unified formulation, and the fidelity-quality trade-off can be adjusted through a new paradigm. Our proposed method is evaluated on both complex-degradation and single-degradation image restoration datasets. Extensive qualitative and quantitative experimental results show consistent performance gain especially for images with complex degradations.
zh
[CV-75] PartCrafter: Structured 3D Mesh Generation via Compositional Latent Diffusion Transformers
【速读】:该论文旨在解决从单张RGB图像中生成结构化、语义明确且几何上区分的多个3D网格(3D mesh)的问题,传统方法要么生成单一整体的3D形状,要么采用两阶段流程,即先分割图像再重建每个部分。其解决方案的关键在于提出了一种统一的、组合式的生成架构——PartCrafter,该架构不依赖预分割输入,而是通过联合去噪多个3D部分,实现端到端的部件感知生成。PartCrafter基于预训练的3D网格扩散变换器(DiT),引入了两个关键创新:一是组合潜在空间,其中每个3D部分由一组解耦的潜在标记表示;二是分层注意力机制,以确保在生成过程中保持局部细节的同时实现全局一致性。
链接: https://arxiv.org/abs/2506.05573
作者: Yuchen Lin,Chenguo Lin,Panwang Pan,Honglei Yan,Yiqiang Feng,Yadong Mu,Katerina Fragkiadaki
机构: Peking University (北京大学); ByteDance (字节跳动); Carnegie Mellon University (卡内基梅隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
点击查看摘要
Abstract:We introduce PartCrafter, the first structured 3D generative model that jointly synthesizes multiple semantically meaningful and geometrically distinct 3D meshes from a single RGB image. Unlike existing methods that either produce monolithic 3D shapes or follow two-stage pipelines, i.e., first segmenting an image and then reconstructing each segment, PartCrafter adopts a unified, compositional generation architecture that does not rely on pre-segmented inputs. Conditioned on a single image, it simultaneously denoises multiple 3D parts, enabling end-to-end part-aware generation of both individual objects and complex multi-object scenes. PartCrafter builds upon a pretrained 3D mesh diffusion transformer (DiT) trained on whole objects, inheriting the pretrained weights, encoder, and decoder, and introduces two key innovations: (1) A compositional latent space, where each 3D part is represented by a set of disentangled latent tokens; (2) A hierarchical attention mechanism that enables structured information flow both within individual parts and across all parts, ensuring global coherence while preserving part-level detail during generation. To support part-level supervision, we curate a new dataset by mining part-level annotations from large-scale 3D object datasets. Experiments show that PartCrafter outperforms existing approaches in generating decomposable 3D meshes, including parts that are not directly visible in input images, demonstrating the strength of part-aware generative priors for 3D understanding and synthesis. Code and training data will be released.
zh
[CV-76] VoxelSplat: Dynamic Gaussian Splatting as an Effective Loss for Occupancy and Flow Prediction CVPR2025
【速读】:该论文旨在解决基于相机的占用预测中3D语义与场景流同时预测的问题,该任务面临遮挡和动态环境不平衡等挑战。其解决方案的关键在于提出一种名为VoxelSplat的新型正则化框架,该框架通过两种核心机制提升模型性能:一是通过2D投影增强语义监督,即从3D表示中解码稀疏语义3D高斯分布并投影到2D相机视图,以利用2D标签提升3D语义学习;二是利用预测的场景流建模高斯分布的运动,从而在相邻帧标签的自监督下学习移动物体的场景流。
链接: https://arxiv.org/abs/2506.05563
作者: Ziyue Zhu,Shenlong Wang,Jin Xie,Jiang-jiang Liu,Jingdong Wang,Jian Yang
机构: Nankai University (南开大学); UIUC (伊利诺伊大学厄巴纳-香槟分校); Nanjing University (南京大学); Baidu (百度)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2025 Project Page: this https URL
点击查看摘要
Abstract:Recent advancements in camera-based occupancy prediction have focused on the simultaneous prediction of 3D semantics and scene flow, a task that presents significant challenges due to specific difficulties, e.g., occlusions and unbalanced dynamic environments. In this paper, we analyze these challenges and their underlying causes. To address them, we propose a novel regularization framework called VoxelSplat. This framework leverages recent developments in 3D Gaussian Splatting to enhance model performance in two key ways: (i) Enhanced Semantics Supervision through 2D Projection: During training, our method decodes sparse semantic 3D Gaussians from 3D representations and projects them onto the 2D camera view. This provides additional supervision signals in the camera-visible space, allowing 2D labels to improve the learning of 3D semantics. (ii) Scene Flow Learning: Our framework uses the predicted scene flow to model the motion of Gaussians, and is thus able to learn the scene flow of moving objects in a self-supervised manner using the labels of adjacent frames. Our method can be seamlessly integrated into various existing occupancy models, enhancing performance without increasing inference time. Extensive experiments on benchmark datasets demonstrate the effectiveness of VoxelSplat in improving the accuracy of both semantic occupancy and scene flow estimation. The project page and codes are available at this https URL.
zh
[CV-77] On-the-fly Reconstruction for Large-Scale Novel View Synthesis from Unposed Images
【速读】:该论文旨在解决基于3D Gaussian Splatting (3DGS)的辐射场方法在从照片重建时,姿态估计与优化过程计算耗时较长的问题,尤其是在处理大规模场景和宽基线图像序列时的效率瓶颈。其解决方案的关键在于提出一种实时生成相机姿态和训练好的3DGS的方法,通过快速初始姿态估计、基于学习特征和GPU友好的小规模捆绑调整,以及直接采样高斯基元位置和形状的增量式生成策略,实现姿态与高斯基元的高效联合优化。此外,通过可扩展的辐射场构建、逐步聚类3DGS基元并存储于锚点,以及将部分基元卸载至GPU外,该方法能够有效处理大规模场景。
链接: https://arxiv.org/abs/2506.05558
作者: Andreas Meuleman,Ishaan Shah,Alexandre Lanvin,Bernhard Kerbl,George Drettakis
机构: Inria, Université Côte d’Azur (Inria,科特迪亚尔大学); TU Wien (维也纳科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Radiance field methods such as 3D Gaussian Splatting (3DGS) allow easy reconstruction from photos, enabling free-viewpoint navigation. Nonetheless, pose estimation using Structure from Motion and 3DGS optimization can still each take between minutes and hours of computation after capture is complete. SLAM methods combined with 3DGS are fast but struggle with wide camera baselines and large scenes. We present an on-the-fly method to produce camera poses and a trained 3DGS immediately after capture. Our method can handle dense and wide-baseline captures of ordered photo sequences and large-scale scenes. To do this, we first introduce fast initial pose estimation, exploiting learned features and a GPU-friendly mini bundle adjustment. We then introduce direct sampling of Gaussian primitive positions and shapes, incrementally spawning primitives where required, significantly accelerating training. These two efficient steps allow fast and robust joint optimization of poses and Gaussian primitives. Our incremental approach handles large-scale scenes by introducing scalable radiance field construction, progressively clustering 3DGS primitives, storing them in anchors, and offloading them from the GPU. Clustered primitives are progressively merged, keeping the required scale of 3DGS at any viewpoint. We evaluate our solution on a variety of datasets and show that our solution can provide on-the-fly processing of all the capture scenarios and scene sizes we target while remaining competitive with other methods that only handle specific capture styles or scene sizes in speed, image quality, or both.
zh
[CV-78] EX-4D: EXtreme Viewpoint 4D Video Synthesis via Depth Watertight Mesh
【速读】:该论文旨在解决从单目输入生成高质量、可控制相机视角的视频这一挑战性问题,尤其是在极端视角下存在的几何不一致性和边界遮挡伪影问题。其解决方案的关键在于提出了一种基于Depth Watertight Mesh(深度密闭网格)的表示方法,该方法通过显式建模可见和遮挡区域,作为稳健的几何先验,从而确保极端相机姿态下的几何一致性。此外,还引入了基于LoRA的轻量级视频扩散适配器,以生成物理一致且时间连贯的高质量视频。
链接: https://arxiv.org/abs/2506.05554
作者: Tao Hu,Haoyang Peng,Xiao Liu,Yuewen Ma
机构: Bytedance(字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Generating high-quality camera-controllable videos from monocular input is a challenging task, particularly under extreme viewpoint. Existing methods often struggle with geometric inconsistencies and occlusion artifacts in boundaries, leading to degraded visual quality. In this paper, we introduce EX-4D, a novel framework that addresses these challenges through a Depth Watertight Mesh representation. The representation serves as a robust geometric prior by explicitly modeling both visible and occluded regions, ensuring geometric consistency in extreme camera pose. To overcome the lack of paired multi-view datasets, we propose a simulated masking strategy that generates effective training data only from monocular videos. Additionally, a lightweight LoRA-based video diffusion adapter is employed to synthesize high-quality, physically consistent, and temporally coherent videos. Extensive experiments demonstrate that EX-4D outperforms state-of-the-art methods in terms of physical consistency and extreme-view quality, enabling practical 4D video generation.
zh
[CV-79] When Semantics Mislead Vision: Mitigating Large Multimodal Models Hallucinations in Scene Text Spotting and Understanding
【速读】:该论文旨在解决大型多模态模型(Large Multimodal Models, LMMs)在处理视觉模糊或非语义场景文本时产生的语义幻觉(semantic hallucination)问题,即模型生成看似合理但与实际视觉内容不符的答案。解决方案的关键在于发现具有更强注意力机制的Transformer层对场景文本区域的关注度更高,因此更不容易产生语义幻觉。基于此发现,论文提出了一种无需训练的语义幻觉缓解框架,包含两个核心组件:ZoomText和Grounded Layer Correction,分别用于粗到细地识别潜在文本区域,并利用内部表示进行解码修正,从而有效减少幻觉输出并保留有意义的语义。
链接: https://arxiv.org/abs/2506.05551
作者: Yan Shu,Hangui Lin,Yexin Liu,Yan Zhang,Gangyan Zeng,Yan Li,Yu Zhou,Ser-Nam Lim,Harry Yang,Nicu Sebe
机构: UNITN(UNITN); HKUST(HKUST); UIR(UIR); IIE, CAS(IIE, CAS); UCAS(UCAS); NJUST(NJUST); NKU(NKU); UCF(UCF)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Large Multimodal Models (LMMs) have achieved impressive progress in visual perception and reasoning. However, when confronted with visually ambiguous or non-semantic scene text, they often struggle to accurately spot and understand the content, frequently generating semantically plausible yet visually incorrect answers, which we refer to as semantic hallucination. In this work, we investigate the underlying causes of semantic hallucination and identify a key finding: Transformer layers in LLM with stronger attention focus on scene text regions are less prone to producing semantic hallucinations. Thus, we propose a training-free semantic hallucination mitigation framework comprising two key components: (1) ZoomText, a coarse-to-fine strategy that identifies potential text regions without external detectors; and (2) Grounded Layer Correction, which adaptively leverages the internal representations from layers less prone to hallucination to guide decoding, correcting hallucinated outputs for non-semantic samples while preserving the semantics of meaningful ones. To enable rigorous evaluation, we introduce TextHalu-Bench, a benchmark of over 1,730 samples spanning both semantic and non-semantic cases, with manually curated question-answer pairs designed to probe model hallucinations. Extensive experiments demonstrate that our method not only effectively mitigates semantic hallucination but also achieves strong performance on public benchmarks for scene text spotting and understanding.
zh
[CV-80] Layered Motion Fusion: Lifting Motion Segmentation to 3D in Egocentric Videos CVPR25
【速读】:该论文试图解决在动态场景中利用3D技术提升2D视觉分析效果的问题,尤其是在处理具有显著相机运动的自指视频时,传统3D技术因场景动态性而表现不佳。解决方案的关键在于通过将基于2D的运动分割预测融合到分层辐射场(Layered Motion Fusion)中,并结合测试时的细化过程,以降低动态长视频的数据复杂性,从而有效提升3D模型的分割性能。
链接: https://arxiv.org/abs/2506.05546
作者: Vadim Tschernezki,Diane Larlus,Andrea Vedaldi,Iro Laina
机构: University of Oxford (牛津大学); Naver Labs Europe (NAVER 实验室欧洲)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Camera-ready for CVPR25
点击查看摘要
Abstract:Computer vision is largely based on 2D techniques, with 3D vision still relegated to a relatively narrow subset of applications. However, by building on recent advances in 3D models such as neural radiance fields, some authors have shown that 3D techniques can at last improve outputs extracted from independent 2D views, by fusing them into 3D and denoising them. This is particularly helpful in egocentric videos, where the camera motion is significant, but only under the assumption that the scene itself is static. In fact, as shown in the recent analysis conducted by EPIC Fields, 3D techniques are ineffective when it comes to studying dynamic phenomena, and, in particular, when segmenting moving objects. In this paper, we look into this issue in more detail. First, we propose to improve dynamic segmentation in 3D by fusing motion segmentation predictions from a 2D-based model into layered radiance fields (Layered Motion Fusion). However, the high complexity of long, dynamic videos makes it challenging to capture the underlying geometric structure, and, as a result, hinders the fusion of motion cues into the (incomplete) scene geometry. We address this issue through test-time refinement, which helps the model to focus on specific frames, thereby reducing the data complexity. This results in a synergy between motion fusion and the refinement, and in turn leads to segmentation predictions of the 3D model that surpass the 2D baseline by a large margin. This demonstrates that 3D techniques can enhance 2D analysis even for dynamic phenomena in a challenging and realistic setting.
zh
[CV-81] FRAME: Pre-Training Video Feature Representations via Anticipation and Memory
【速读】:该论文旨在解决密集视频预测任务中视频编码器在时间一致性与空间稠密特征生成方面的不足。现有方法存在局限:如DINO或CLIP等图像编码器缺乏时间感知能力,而VideoMAE等视频模型在密集预测任务中的表现不如图像编码器。该论文提出的解决方案是FRAME,一种针对密集视频理解的自监督视频帧编码器。其关键在于通过从过去和当前的RGB帧中预测当前和未来DINO块特征,从而获得空间精确且时间连贯的表示,使FRAME成为首个在密集预测任务中利用基于图像的模型并超越其性能的视频编码器。
链接: https://arxiv.org/abs/2506.05543
作者: Sethuraman TV,Savya Khosla,Vignesh Srinivasakumar,Jiahui Huang,Seoung Wug Oh,Simon Jenni,Derek Hoiem,Joon-Young Lee
机构: Adobe Research(Adobe 研究院); University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Dense video prediction tasks, such as object tracking and semantic segmentation, require video encoders that generate temporally consistent, spatially dense features for every frame. However, existing approaches fall short: image encoders like DINO or CLIP lack temporal awareness, while video models such as VideoMAE underperform compared to image encoders on dense prediction tasks. We address this gap with FRAME, a self-supervised video frame encoder tailored for dense video understanding. FRAME learns to predict current and future DINO patch features from past and present RGB frames, leading to spatially precise and temporally coherent representations. To our knowledge, FRAME is the first video encoder to leverage image-based models for dense prediction while outperforming them on tasks requiring fine-grained visual correspondence. As an auxiliary capability, FRAME aligns its class token with CLIP’s semantic space, supporting language-driven tasks such as video classification. We evaluate FRAME across six dense prediction tasks on seven datasets, where it consistently outperforms image encoders and existing self-supervised video models. Despite its versatility, FRAME maintains a compact architecture suitable for a range of downstream applications.
zh
[CV-82] Personalized Interpretability – Interactive Alignment of Prototypical Parts Networks
【速读】:该论文试图解决基于概念的可解释神经网络中因概念不一致导致的解释不可理解问题,以及现有方法缺乏用户反馈机制的问题。其关键解决方案是提出YoursProtoP,一种交互式策略,通过引入用户监督,使模型能够根据用户需求个性化调整原型部分(prototypical parts),从而在保持模型准确性的同时实现概念一致性。
链接: https://arxiv.org/abs/2506.05533
作者: Tomasz Michalski,Adam Wróbel,Andrea Bontempelli,Jakub Luśtyk,Mikolaj Kniejski,Stefano Teso,Andrea Passerini,Bartosz Zieliński,Dawid Rymarczyk
机构: Jagiellonian University (雅盖隆大学); University of Trento (特伦托大学); Transmission Dynamics; University of Warsaw (华沙大学); Ardigen SA
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注: 20 pages, 11 figures
点击查看摘要
Abstract:Concept-based interpretable neural networks have gained significant attention due to their intuitive and easy-to-understand explanations based on case-based reasoning, such as “this bird looks like those sparrows”. However, a major limitation is that these explanations may not always be comprehensible to users due to concept inconsistency, where multiple visual features are inappropriately mixed (e.g., a bird’s head and wings treated as a single concept). This inconsistency breaks the alignment between model reasoning and human understanding. Furthermore, users have specific preferences for how concepts should look, yet current approaches provide no mechanism for incorporating their feedback. To address these issues, we introduce YoursProtoP, a novel interactive strategy that enables the personalization of prototypical parts - the visual concepts used by the model - according to user needs. By incorporating user supervision, YoursProtoP adapts and splits concepts used for both prediction and explanation to better match the user’s preferences and understanding. Through experiments on both the synthetic FunnyBirds dataset and a real-world scenario using the CUB, CARS, and PETS datasets in a comprehensive user study, we demonstrate the effectiveness of YoursProtoP in achieving concept consistency without compromising the accuracy of the model.
zh
[CV-83] FocusDiff: Advancing Fine-Grained Text-Image Alignment for Autoregressive Visual Generation through RL
【速读】:该论文试图解决现有文本到图像生成模型在细粒度语义对齐上的不足,即模型难以准确捕捉相似语法但不同细粒度语义的文本与图像之间的差异,从而无法实现对视觉令牌的精确控制。解决方案的关键在于提出FocusDiff方法,通过关注相似文本-图像对之间的细微差异来增强细粒度语义对齐,并构建了一个新的数据集以及引入一种新颖的强化学习算法以强调这些细粒度语义差异,从而提升图像生成的质量和准确性。
链接: https://arxiv.org/abs/2506.05501
作者: Kaihang Pan,Wendong Bu,Yuruo Wu,Yang Wu,Kai Shen,Yunfei Li,Hang Zhao,Juncheng Li,Siliang Tang,Yueting Zhuang
机构: Zhejiang University (浙江大学); Ant Group (蚂蚁集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 8 figures. Project Page: this https URL
点击查看摘要
Abstract:Recent studies extend the autoregression paradigm to text-to-image generation, achieving performance comparable to diffusion models. However, our new PairComp benchmark – featuring test cases of paired prompts with similar syntax but different fine-grained semantics – reveals that existing models struggle with fine-grained text-image alignment thus failing to realize precise control over visual tokens. To address this, we propose FocusDiff, which enhances fine-grained text-image semantic alignment by focusing on subtle differences between similar text-image pairs. We construct a new dataset of paired texts and images with similar overall expressions but distinct local semantics, further introducing a novel reinforcement learning algorithm to emphasize such fine-grained semantic differences for desired image generation. Our approach achieves state-of-the-art performance on existing text-to-image benchmarks and significantly outperforms prior methods on PairComp.
zh
[CV-84] F2T2-HiT: A U-Shaped FFT Transformer and Hierarchical Transformer for Reflection Removal
【速读】:该论文旨在解决单图像反射去除(Single Image Reflection Removal, SIRR)问题,该问题在图像处理中具有重要意义,主要针对通过玻璃表面拍摄时产生的不良反射进行有效消除。现有方法难以应对真实场景中反射在强度、形状、光源、大小和覆盖区域等方面的显著差异。论文提出的解决方案关键在于引入一种基于Transformer的U-shaped Fast Fourier Transform Transformer and Hierarchical Transformer (F2T2-HiT)架构,该架构结合了快速傅里叶变换(FFT)Transformer块与分层Transformer块,在UNet框架下实现全局频域信息捕捉与多尺度特征提取,从而有效分离和处理不同特性的反射。
链接: https://arxiv.org/abs/2506.05489
作者: Jie Cai,Kangning Yang,Ling Ouyang,Lan Fu,Jiaming Ding,Huiming Sun,Chiu Man Ho,Zibo Meng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Single Image Reflection Removal (SIRR) technique plays a crucial role in image processing by eliminating unwanted reflections from the background. These reflections, often caused by photographs taken through glass surfaces, can significantly degrade image quality. SIRR remains a challenging problem due to the complex and varied reflections encountered in real-world scenarios. These reflections vary significantly in intensity, shapes, light sources, sizes, and coverage areas across the image, posing challenges for most existing methods to effectively handle all cases. To address these challenges, this paper introduces a U-shaped Fast Fourier Transform Transformer and Hierarchical Transformer (F2T2-HiT) architecture, an innovative Transformer-based design for SIRR. Our approach uniquely combines Fast Fourier Transform (FFT) Transformer blocks and Hierarchical Transformer blocks within a UNet framework. The FFT Transformer blocks leverage the global frequency domain information to effectively capture and separate reflection patterns, while the Hierarchical Transformer blocks utilize multi-scale feature extraction to handle reflections of varying sizes and complexities. Extensive experiments conducted on three publicly available testing datasets demonstrate state-of-the-art performance, validating the effectiveness of our approach.
zh
[CV-85] Implicit Neural Representation for Video Restoration
【速读】:该论文旨在解决视频修复(Video Restoration, VR)中现有方法在处理超出训练分布的放大比例或退化类型时缺乏灵活性的问题。传统方法通常仅针对固定的放大因子进行训练,难以适应未知的超分辨率尺度或噪声情况。解决方案的关键在于提出VR-INR,一种基于隐式神经表示(Implicit Neural Representations, INRs)的新颖视频修复方法,该方法仅在单个放大因子(×4)下进行训练,但在测试时能够有效泛化到任意未见过的超分辨率尺度,并且无需训练数据中包含噪声即可实现零样本去噪。其核心技术包括分层时空纹理编码框架与多分辨率隐式哈希编码,从而实现从低分辨率输入中自适应解码出高分辨率且噪声抑制的帧。
链接: https://arxiv.org/abs/2506.05488
作者: Mary Aiyetigbo,Wanqi Yuan,Feng Luo,Nianyi Li
机构: Clemson University, School of Computing
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:High-resolution (HR) videos play a crucial role in many computer vision applications. Although existing video restoration (VR) methods can significantly enhance video quality by exploiting temporal information across video frames, they are typically trained for fixed upscaling factors and lack the flexibility to handle scales or degradations beyond their training distribution. In this paper, we introduce VR-INR, a novel video restoration approach based on Implicit Neural Representations (INRs) that is trained only on a single upscaling factor ( \times 4 ) but generalizes effectively to arbitrary, unseen super-resolution scales at test time. Notably, VR-INR also performs zero-shot denoising on noisy input, despite never having seen noisy data during training. Our method employs a hierarchical spatial-temporal-texture encoding framework coupled with multi-resolution implicit hash encoding, enabling adaptive decoding of high-resolution and noise-suppressed frames from low-resolution inputs at any desired magnification. Experimental results show that VR-INR consistently maintains high-quality reconstructions at unseen scales and noise during training, significantly outperforming state-of-the-art approaches in sharpness, detail preservation, and denoising efficacy.
zh
[CV-86] A Neural Network Model of Spatial and Feature-Based Attention
【速读】:该论文试图解决如何在计算机视觉中模拟人类视觉注意机制的问题,以更好地理解人类认知过程。其解决方案的关键在于设计一个受人类视觉注意机制启发的神经网络模型,该模型包含两个网络:一个作为基础处理器执行简单任务,另一个处理上下文信息并通过注意力机制引导第一个网络,从而适应更复杂的任务。这种结构使得模型能够学习到与人类视觉注意相似的空间和特征注意力模式。
链接: https://arxiv.org/abs/2506.05487
作者: Ruoyang Hu,Robert A. Jacobs
机构: University of Rochester (罗切斯特大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computational Engineering, Finance, and Science (cs.CE)
备注: 6 pages, 9 figures
点击查看摘要
Abstract:Visual attention is a mechanism closely intertwined with vision and memory. Top-down information influences visual processing through attention. We designed a neural network model inspired by aspects of human visual attention. This model consists of two networks: one serves as a basic processor performing a simple task, while the other processes contextual information and guides the first network through attention to adapt to more complex tasks. After training the model and visualizing the learned attention response, we discovered that the model’s emergent attention patterns corresponded to spatial and feature-based attention. This similarity between human visual attention and attention in computer vision suggests a promising direction for studying human cognition using neural network models.
zh
[CV-87] OpenRR-5k: A Large-Scale Benchmark for Reflection Removal in the Wild
【速读】:该论文旨在解决单图像反射去除(Single Image Reflection Removal, SIRR)问题,该问题是计算机视觉中的关键任务,广泛应用于摄影和图像增强领域。现有方法受限于缺乏大规模、高质量且多样化的数据集,为此,本文提出了一种新的基准数据集,包含5,300对像素级对齐的高质量图像对,其中5,000对用于训练,300对用于验证,并包含100张无真实标签的现实测试图像以评估实际性能。数据集覆盖了多种光照条件、物体类型和反射模式,确保了广泛的适用性。解决方案的关键在于构建一个大规模、精确对齐且涵盖多种场景的数据集,从而为SIRR方法提供可靠的基础。
链接: https://arxiv.org/abs/2506.05482
作者: Jie Cai,Kangning Yang,Ling Ouyang,Lan Fu,Jiaming Ding,Jinglin Shen,Zibo Meng
机构: OPPO AI Center (OPPO人工智能中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Removing reflections is a crucial task in computer vision, with significant applications in photography and image enhancement. Nevertheless, existing methods are constrained by the absence of large-scale, high-quality, and diverse datasets. In this paper, we present a novel benchmark for Single Image Reflection Removal (SIRR). We have developed a large-scale dataset containing 5,300 high-quality, pixel-aligned image pairs, each consisting of a reflection image and its corresponding clean version. Specifically, the dataset is divided into two parts: 5,000 images are used for training, and 300 images are used for validation. Additionally, we have included 100 real-world testing images without ground truth (GT) to further evaluate the practical performance of reflection removal methods. All image pairs are precisely aligned at the pixel level to guarantee accurate supervision. The dataset encompasses a broad spectrum of real-world scenarios, featuring various lighting conditions, object types, and reflection patterns, and is segmented into training, validation, and test sets to facilitate thorough evaluation. To validate the usefulness of our dataset, we train a U-Net-based model and evaluate it using five widely-used metrics, including PSNR, SSIM, LPIPS, DISTS, and NIQE. We will release both the dataset and the code on this https URL to facilitate future research in this field.
zh
[CV-88] ODE-GS: Latent ODEs for Dynamic Scene Extrapolation with 3D Gaussian Splatting
【速读】:该论文试图解决动态三维场景在训练时间范围之外的长期预测问题,现有神经渲染系统(如基于NeRF或3DGS的方法)由于将时间直接嵌入变形网络,导致其在插值任务中表现优异,但在未来预测任务中因时间戳严格超出分布而失效。解决方案的关键在于将时间条件下的高保真变形模型冻结,并通过Transformer编码器将过去的高斯轨迹总结为一个由神经微分方程(neural ODE)驱动的潜在状态,从而实现连续时间演化,使得预测结果在物理上合理且可实时渲染。
链接: https://arxiv.org/abs/2506.05480
作者: Daniel Wang,Patrick Rim,Tian Tian,Alex Wong,Ganesh Sundaramoorthi
机构: Yale University (耶鲁大学); RISD (罗德岛设计学院); RTX (RTX)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:We present ODE-GS, a novel method that unifies 3D Gaussian Splatting with latent neural ordinary differential equations (ODEs) to forecast dynamic 3D scenes far beyond the time span seen during training. Existing neural rendering systems - whether NeRF- or 3DGS-based - embed time directly in a deformation network and therefore excel at interpolation but collapse when asked to predict the future, where timestamps are strictly out-of-distribution. ODE-GS eliminates this dependency: after learning a high-fidelity, time-conditioned deformation model for the training window, we freeze it and train a Transformer encoder that summarizes past Gaussian trajectories into a latent state whose continuous evolution is governed by a neural ODE. Numerical integration of this latent flow yields smooth, physically plausible Gaussian trajectories that can be queried at any future instant and rendered in real time. Coupled with a variational objective and a lightweight second-derivative regularizer, ODE-GS attains state-of-the-art extrapolation on D-NeRF and NVFI benchmarks, improving PSNR by up to 10 dB and halving perceptual error (LPIPS) relative to the strongest baselines. Our results demonstrate that continuous-time latent dynamics are a powerful, practical route to photorealistic prediction of complex 3D scenes.
zh
[CV-89] S2GO: Streaming Sparse Gaussian Occupancy Prediction
【速读】:该论文旨在解决3D场景占用预测中传统密集表示方法效率低下且难以捕捉动态变化的问题。现有方法多依赖体素或密集高斯表示,但这些方法在计算上较为缓慢,并且在处理驾驶场景的时间动态性方面缺乏灵活性。论文提出的解决方案的关键在于将场景总结为一组紧凑的3D查询(3D queries),这些查询以在线流式方式在时间上进行传播,并在每个时间步解码为语义高斯分布。通过结合去噪渲染目标,该框架能够有效引导查询及其构成的高斯分布捕捉场景几何信息,从而实现了高效的表示和优越的性能。
链接: https://arxiv.org/abs/2506.05473
作者: Jinhyung Park,Yihan Hu,Chensheng Peng,Wenzhao Zheng,Kris Kitani,Wei Zhan
机构: Applied Intuition(应用直觉); Carnegie Mellon University(卡内基梅隆大学); University of California, Berkeley(加州大学伯克利分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Despite the demonstrated efficiency and performance of sparse query-based representations for perception, state-of-the-art 3D occupancy prediction methods still rely on voxel-based or dense Gaussian-based 3D representations. However, dense representations are slow, and they lack flexibility in capturing the temporal dynamics of driving scenes. Distinct from prior work, we instead summarize the scene into a compact set of 3D queries which are propagated through time in an online, streaming fashion. These queries are then decoded into semantic Gaussians at each timestep. We couple our framework with a denoising rendering objective to guide the queries and their constituent Gaussians in effectively capturing scene geometry. Owing to its efficient, query-based representation, S2GO achieves state-of-the-art performance on the nuScenes and KITTI occupancy benchmarks, outperforming prior art (e.g., GaussianWorld) by 1.5 IoU with 5.9x faster inference.
zh
[CV-90] owards Reliable Identification of Diffusion-based Image Manipulations
【速读】:该论文试图解决由扩散模型带来的图像篡改检测问题,特别是针对通过扩散模型进行的图像修复区域(inpainting)的可靠识别。解决方案的关键在于提出了一种名为RADAR的新方法,该方法基于现有的基础模型,并结合不同图像模态的特征,同时引入辅助对比损失以隔离被篡改的图像块,从而显著提升了检测精度和对多种扩散模型的泛化能力。
链接: https://arxiv.org/abs/2506.05466
作者: Alex Costanzino,Woody Bayliss,Juil Sock,Marc Gorriz Blanch,Danijela Horak,Ivan Laptev,Philip Torr,Fabio Pizzati
机构: University of Bologna (博洛尼亚大学); BBC AI (BBC人工智能); MBZUAI (MBZUAI); University of Oxford (牛津大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Changing facial expressions, gestures, or background details may dramatically alter the meaning conveyed by an image. Notably, recent advances in diffusion models greatly improve the quality of image manipulation while also opening the door to misuse. Identifying changes made to authentic images, thus, becomes an important task, constantly challenged by new diffusion-based editing tools. To this end, we propose a novel approach for ReliAble iDentification of inpainted AReas (RADAR). RADAR builds on existing foundation models and combines features from different image modalities. It also incorporates an auxiliary contrastive loss that helps to isolate manipulated image patches. We demonstrate these techniques to significantly improve both the accuracy of our method and its generalisation to a large number of diffusion models. To support realistic evaluation, we further introduce BBC-PAIR, a new comprehensive benchmark, with images tampered by 28 diffusion models. Our experiments show that RADAR achieves excellent results, outperforming the state-of-the-art in detecting and localising image edits made by both seen and unseen diffusion models. Our code, data and models will be publicly available at this http URL.
zh
[CV-91] Degradation-Aware Image Enhancement via Vision-Language Classification
【速读】:该论文旨在解决图像退化问题,这一问题在多种实际应用中普遍存在,影响视觉质量和下游处理任务。其解决方案的关键在于利用生成式 AI (Generative AI) 中的 Vision-Language Model (VLM) 自动将退化图像分类为预定义的四类:超分辨率退化、反射伪影、运动模糊以及无可见退化。随后,针对分类结果中的特定退化类型,采用专门设计的修复模型进行针对性恢复,从而提升图像质量。该方法结合了 VLM 的分类能力与先进的图像修复技术,提供了一种可扩展且自动化的图像增强解决方案。
链接: https://arxiv.org/abs/2506.05450
作者: Jie Cai,Kangning Yang,Jiaming Ding,Lan Fu,Ling Ouyang,Jiang Li,Jinglin Shen,Zibo Meng
机构: OPPO AI Center (OPPO人工智能中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Image degradation is a prevalent issue in various real-world applications, affecting visual quality and downstream processing tasks. In this study, we propose a novel framework that employs a Vision-Language Model (VLM) to automatically classify degraded images into predefined categories. The VLM categorizes an input image into one of four degradation types: (A) super-resolution degradation (including noise, blur, and JPEG compression), (B) reflection artifacts, © motion blur, or (D) no visible degradation (high-quality image). Once classified, images assigned to categories A, B, or C undergo targeted restoration using dedicated models tailored for each specific degradation type. The final output is a restored image with improved visual quality. Experimental results demonstrate the effectiveness of our approach in accurately classifying image degradations and enhancing image quality through specialized restoration models. Our method presents a scalable and automated solution for real-world image enhancement tasks, leveraging the capabilities of VLMs in conjunction with state-of-the-art restoration techniques.
zh
[CV-92] AI-powered Contextual 3D Environment Generation: A Systematic Review
【速读】:该论文试图解决高精度3D环境生成的资源密集问题,尤其是在游戏、虚拟现实和电影等行业中依赖人工流程所带来的效率瓶颈。其解决方案的关键在于系统性地分析现有的生成式AI(Generative AI)技术,探索其在3D场景生成中的特性、优势与局限性,并提出改进方向。研究强调了先进生成架构在计算成本较高的前提下实现高质量3D内容生成的能力,以及多模态整合技术如交叉注意力和潜在空间对齐在文本到3D任务中的有效性,同时指出训练数据的质量与多样性及全面评估指标对于实现可扩展、稳健的3D场景生成至关重要。
链接: https://arxiv.org/abs/2506.05449
作者: Miguel Silva,Alexandre Valle de Carvalho
机构: Faculty of Engineering, University of Porto (工程学院,波尔图大学); INESC TEC (INESC TEC)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:The generation of high-quality 3D environments is crucial for industries such as gaming, virtual reality, and cinema, yet remains resource-intensive due to the reliance on manual processes. This study performs a systematic review of existing generative AI techniques for 3D scene generation, analyzing their characteristics, strengths, limitations, and potential for improvement. By examining state-of-the-art approaches, it presents key challenges such as scene authenticity and the influence of textual inputs. Special attention is given to how AI can blend different stylistic domains while maintaining coherence, the impact of training data on output quality, and the limitations of current models. In addition, this review surveys existing evaluation metrics for assessing realism and explores how industry professionals incorporate AI into their workflows. The findings of this study aim to provide a comprehensive understanding of the current landscape and serve as a foundation for future research on AI-driven 3D content generation. Key findings include that advanced generative architectures enable high-quality 3D content creation at a high computational cost, effective multi-modal integration techniques like cross-attention and latent space alignment facilitate text-to-3D tasks, and the quality and diversity of training data combined with comprehensive evaluation metrics are critical to achieving scalable, robust 3D scene generation.
zh
[CV-93] U-NetMN and SegNetMN: Modified U-Net and SegNet models for bimodal SAR image segmentation
【速读】:该论文旨在解决合成孔径雷达(Synthetic Aperture Radar, SAR)图像分割中深度学习模型收敛速度慢和稳定性差的问题,尤其是在水体检测等遥感应用中的表现。其解决方案的关键在于引入模式归一化(mode normalization),通过该方法在不牺牲基准模型性能的前提下,有效缩短模型的收敛时间,并提升模型在不同区域的稳定性。实验结果表明,模式归一化在提高计算效率和泛化能力方面具有显著效果。
链接: https://arxiv.org/abs/2506.05444
作者: Marwane Kzadri,Franco Alberto Cardillo,Nanée Chahinian,Carole Delenne,Renaud Hostache,Jamal Riffi
机构: LISAC(信息科学与技术联合实验室); Sidi Mohamed Ben Abdellah Univ(西迪·穆罕默德·本·阿卜杜拉大学); Institute for Computational Linguistics(计算语言学研究所); National Research Council(国家研究委员会); HSM(健康与社会医学研究所); IRD(法国国家科研中心); Univ Montpellier(蒙彼利埃大学); CNRS(法国国家科学研究中心); IUSTI(工业与系统工程研究所); Aix Marseille Univ(艾克斯-马赛大学); Espace Dev(空间发展研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注:
点击查看摘要
Abstract:Segmenting Synthetic Aperture Radar (SAR) images is crucial for many remote sensing applications, particularly water body detection. However, deep learning-based segmentation models often face challenges related to convergence speed and stability, mainly due to the complex statistical distribution of this type of data. In this study, we evaluate the impact of mode normalization on two widely used semantic segmentation models, U-Net and SegNet. Specifically, we integrate mode normalization, to reduce convergence time while maintaining the performance of the baseline models. Experimental results demonstrate that mode normalization significantly accelerates convergence. Furthermore, cross-validation results indicate that normalized models exhibit increased stability in different zones. These findings highlight the effectiveness of normalization in improving computational efficiency and generalization in SAR image segmentation.
zh
[CV-94] Structured Labeling Enables Faster Vision-Language Models for End-to-End Autonomous Driving
【速读】:该论文旨在解决当前生成式 AI (Generative AI) 在端到端自动驾驶应用中的局限性,特别是现有数据集的非结构化语言描述不适用于机器处理、计算成本高以及模型规模大导致推理速度慢和实际部署困难的问题。其解决方案的关键在于引入一个结构化且简洁的基准数据集 NuScenes-S,并提出一个参数量为0.9B的轻量级生成式 AI 基线模型 FastDrive,该模型能够高效理解结构化描述并生成机器友好的驾驶决策,从而在保持较高准确率的同时显著提升推理速度。
链接: https://arxiv.org/abs/2506.05442
作者: Hao Jiang,Chuan Hu,Yukang Shi,Yuan He,Ke Wang,Xi Zhang,Zhipeng Zhang
机构: Shanghai Jiao Tong University (上海交通大学); KargoBot (KargoBot)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Vision-Language Models (VLMs) offer a promising approach to end-to-end autonomous driving due to their human-like reasoning capabilities. However, troublesome gaps remains between current VLMs and real-world autonomous driving applications. One major limitation is that existing datasets with loosely formatted language descriptions are not machine-friendly and may introduce redundancy. Additionally, high computational cost and massive scale of VLMs hinder the inference speed and real-world deployment. To bridge the gap, this paper introduces a structured and concise benchmark dataset, NuScenes-S, which is derived from the NuScenes dataset and contains machine-friendly structured representations. Moreover, we present FastDrive, a compact VLM baseline with 0.9B parameters. In contrast to existing VLMs with over 7B parameters and unstructured language processing(e.g., LLaVA-1.5), FastDrive understands structured and concise descriptions and generates machine-friendly driving decisions with high efficiency. Extensive experiments show that FastDrive achieves competitive performance on structured dataset, with approximately 20% accuracy improvement on decision-making tasks, while surpassing massive parameter baseline in inference speed with over 10x speedup. Additionally, ablation studies further focus on the impact of scene annotations (e.g., weather, time of day) on decision-making tasks, demonstrating their importance on decision-making tasks in autonomous driving.
zh
[CV-95] Robustness Evaluation for Video Models with Reinforcement Learning CVPR
【速读】:该论文旨在解决视频分类模型鲁棒性评估的问题,尤其是在与基于图像的模型相比时,视频模型由于增加了时间维度而面临更高的复杂性和计算成本。其关键解决方案是提出一种多智能体强化学习方法(空间和时间),该方法协同学习以识别视频中的敏感空间和时间区域,并在生成微小扰动时考虑时间一致性,从而实现更有效且视觉上难以察觉的攻击。该方法在Lp度量和平均查询次数上优于现有最佳方案,并支持自定义失真类型,使鲁棒性评估更贴近实际应用场景。
链接: https://arxiv.org/abs/2506.05431
作者: Ashwin Ramesh Babu,Sajad Mousavi,Vineet Gundecha,Sahand Ghorbanpour,Avisek Naug,Antonio Guillen,Ricardo Luna Gutierrez,Soumyendu Sarkar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) 2025
点击查看摘要
Abstract:Evaluating the robustness of Video classification models is very challenging, specifically when compared to image-based models. With their increased temporal dimension, there is a significant increase in complexity and computational cost. One of the key challenges is to keep the perturbations to a minimum to induce misclassification. In this work, we propose a multi-agent reinforcement learning approach (spatial and temporal) that cooperatively learns to identify the given video’s sensitive spatial and temporal regions. The agents consider temporal coherence in generating fine perturbations, leading to a more effective and visually imperceptible attack. Our method outperforms the state-of-the-art solutions on the Lp metric and the average queries. Our method enables custom distortion types, making the robustness evaluation more relevant to the use case. We extensively evaluate 4 popular models for video action recognition on two popular datasets, HMDB-51 and UCF-101.
zh
[CV-96] SIV-Bench: A Video Benchmark for Social Interaction Understanding and Reasoning
【速读】:该论文旨在解决人工智能在理解复杂社会交互方面的不足,特别是针对多模态线索、不可观测的关系与心理状态以及动态行为的处理能力。其解决方案的关键在于引入SIV-Bench,一个用于严格评估多模态大语言模型(Multimodal Large Language Models, MLLMs)在社会场景理解(Social Scene Understanding, SSU)、社会状态推理(Social State Reasoning, SSR)和社会动态预测(Social Dynamics Prediction, SDP)方面能力的新视频基准。SIV-Bench包含2,792段视频片段和8,792个精心生成的问答对,并通过人类-大语言模型协作流程进行数据构建,以全面评估模型在不同文本线索下的表现,从而揭示当前MLLMs在社会智能方面的优势与局限。
链接: https://arxiv.org/abs/2506.05425
作者: Fanqi Kong,Weiqin Zu,Xinyu Chen,Yaodong Yang,Song-Chun Zhu,Xue Feng
机构: Peking University (北京大学); Tsinghua University (清华大学); ShanghaiTech University (上海科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:The rich and multifaceted nature of human social interaction, encompassing multimodal cues, unobservable relations and mental states, and dynamical behavior, presents a formidable challenge for artificial intelligence. To advance research in this area, we introduce SIV-Bench, a novel video benchmark for rigorously evaluating the capabilities of Multimodal Large Language Models (MLLMs) across Social Scene Understanding (SSU), Social State Reasoning (SSR), and Social Dynamics Prediction (SDP). SIV-Bench features 2,792 video clips and 8,792 meticulously generated question-answer pairs derived from a human-LLM collaborative pipeline. It is originally collected from TikTok and YouTube, covering a wide range of video genres, presentation styles, and linguistic and cultural backgrounds. It also includes a dedicated setup for analyzing the impact of different textual cues-original on-screen text, added dialogue, or no text. Our comprehensive experiments on leading MLLMs reveal that while models adeptly handle SSU, they significantly struggle with SSR and SDP, where Relation Inference (RI) is an acute bottleneck, as further examined in our analysis. Our study also confirms the critical role of transcribed dialogue in aiding comprehension of complex social interactions. By systematically identifying current MLLMs’ strengths and limitations, SIV-Bench offers crucial insights to steer the development of more socially intelligent AI. The dataset and code are available at this https URL.
zh
[CV-97] Self-supervised One-Stage Learning for RF-based Multi-Person Pose Estimation CIKM2024
【速读】:该论文旨在解决基于射频(RF)信号的多人姿态估计(MPPE)中计算复杂度高和精度不足的问题。现有方法要么通过复杂的预处理将RF信号转换为热图图像,导致计算开销大,要么直接对原始RF信号应用深度嵌入网络,但精度和泛化能力较差。论文提出的解决方案是一种高效轻量的一阶段MPPE模型,其关键在于通过对RF信号进行子组划分,并使用共享的单层卷积神经网络(CNN)结合多头注意力机制进行嵌入,从而提升模型效率与性能;此外,还引入了一种新的自监督学习(SSL)方法,通过利用未遮挡子组和剩余遮挡子组的输入来预测遮挡数据的潜在表示,显著提升了模型在新场景或障碍物存在时的性能表现。
链接: https://arxiv.org/abs/2506.05420
作者: Seunghwan Shin,Yusung Kim
机构: Vueron Technology(维伦科技); Sungkyunkwan University(成均馆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CIKM 2024
点击查看摘要
Abstract:In the field of Multi-Person Pose Estimation (MPPE), Radio Frequency (RF)-based methods can operate effectively regardless of lighting conditions and obscured line-of-sight situations. Existing RF-based MPPE methods typically involve either 1) converting RF signals into heatmap images through complex preprocessing, or 2) applying a deep embedding network directly to raw RF signals. The first approach, while delivering decent performance, is computationally intensive and time-consuming. The second method, though simpler in preprocessing, results in lower MPPE accuracy and generalization performance. This paper proposes an efficient and lightweight one-stage MPPE model based on raw RF signals. By sub-grouping RF signals and embedding them using a shared single-layer CNN followed by multi-head attention, this model outperforms previous methods that embed all signals at once through a large and deep CNN. Additionally, we propose a new self-supervised learning (SSL) method that takes inputs from both one unmasked subgroup and the remaining masked subgroups to predict the latent representations of the masked data. Empirical results demonstrate that our model improves MPPE accuracy by up to 15 in PCKh@0.5 compared to previous methods using raw RF signals. Especially, the proposed SSL method has shown to significantly enhance performance improvements when placed in new locations or in front of obstacles at RF antennas, contributing to greater performance gains as the number of people increases. Our code and dataset is open at Github. this https URL .
zh
[CV-98] Dream to Generalize: Zero-Shot Model-Based Reinforcement Learning for Unseen Visual Distractions AAAI2023
【速读】:该论文旨在解决模型基础强化学习(MBRL)在面对视觉干扰时泛化能力不足的问题,尤其是在高维图像观测中,任务无关的干扰因素(如云、阴影和光线)会显著影响算法性能。其解决方案的关键在于提出了一种新的自监督方法——Dream to Generalize (Dr. G),该方法通过双对比学习训练编码器和世界模型,以高效捕捉多视角数据增强中的任务相关特征,并引入循环状态逆动力学模型来提升世界模型对时间结构的理解,从而增强世界模型对视觉干扰的鲁棒性。
链接: https://arxiv.org/abs/2506.05419
作者: Jeongsoo Ha,Kyungsoo Kim,Yusung Kim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: AAAI 2023
点击查看摘要
Abstract:Model-based reinforcement learning (MBRL) has been used to efficiently solve vision-based control tasks in highdimensional image observations. Although recent MBRL algorithms perform well in trained observations, they fail when faced with visual distractions in observations. These task-irrelevant distractions (e.g., clouds, shadows, and light) may be constantly present in real-world scenarios. In this study, we propose a novel self-supervised method, Dream to Generalize (Dr. G), for zero-shot MBRL. Dr. G trains its encoder and world model with dual contrastive learning which efficiently captures task-relevant features among multi-view data augmentations. We also introduce a recurrent state inverse dynamics model that helps the world model to better understand the temporal structure. The proposed methods can enhance the robustness of the world model against visual distractions. To evaluate the generalization performance, we first train Dr. G on simple backgrounds and then test it on complex natural video backgrounds in the DeepMind Control suite, and the randomizing environments in Robosuite. Dr. G yields a performance improvement of 117% and 14% over prior works, respectively. Our code is open-sourced and available at this https URL
zh
[CV-99] Self-Predictive Dynamics for Generalization of Vision-based Reinforcement Learning IJCAI2022
【速读】:该论文旨在解决视觉强化学习中图像观测表示效率与鲁棒性不足的问题,特别是在图像包含干扰性(任务无关)元素(如阴影、云层和光照)的情况下,且这些干扰在训练过程中未被暴露时,问题尤为突出。解决方案的关键在于提出一种自预测动力学(Self-Predictive Dynamics, SPD)方法,通过并行使用弱增强和强增强,利用逆向和正向转移预测来学习任务相关特征,从而在未见过的观测中仍能高效提取关键特征。
链接: https://arxiv.org/abs/2506.05418
作者: Kyungsoo Kim,Jeongsoo Ha,Yusung Kim
机构: Sungkyunkwan University (成均馆大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: IJCAI 2022
点击查看摘要
Abstract:Vision-based reinforcement learning requires efficient and robust representations of image-based observations, especially when the images contain distracting (task-irrelevant) elements such as shadows, clouds, and light. It becomes more important if those distractions are not exposed during training. We design a Self-Predictive Dynamics (SPD) method to extract task-relevant features efficiently, even in unseen observations after training. SPD uses weak and strong augmentations in parallel, and learns representations by predicting inverse and forward transitions across the two-way augmented versions. In a set of MuJoCo visual control tasks and an autonomous driving task (CARLA), SPD outperforms previous studies in complex observations, and significantly improves the generalization performance for unseen observations. Our code is available at this https URL.
zh
[CV-100] Better STEP a format and dataset for boundary representation
【速读】:该论文旨在解决从计算机辅助设计(CAD)生成的边界表示(B-rep)数据在工业中的应用受限问题,这些数据通常以STEP格式存储,需要依赖CAD内核进行读取和处理,从而限制了其在大规模学习流程中的使用。论文提出的解决方案关键在于采用开放、跨平台的HDF5格式作为替代,并配套相应的数据集及开源库,以实现对STEP文件的高效查询与处理,同时提供标准功能如采样、法线和曲率计算,便于集成到现有工作流中。
链接: https://arxiv.org/abs/2506.05417
作者: Nafiseh Izadyar,Sai Chandra Madduri,Teseo Schneider
机构: University of Victoria(维多利亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Boundary representation (B-rep) generated from computer-aided design (CAD) is widely used in industry, with several large datasets available. However, the data in these datasets is represented in STEP format, requiring a CAD kernel to read and process it. This dramatically limits their scope and usage in large learning pipelines, as it constrains the possibility of deploying them on computing clusters due to the high cost of per-node licenses. This paper introduces an alternative format based on the open, cross-platform format HDF5 and a corresponding dataset for STEP files, paired with an open-source library to query and process them. Our Python package also provides standard functionalities such as sampling, normals, and curvature to ease integration in existing pipelines. To demonstrate the effectiveness of our format, we converted the Fusion 360 dataset and the ABC dataset. We developed four standard use cases (normal estimation, denoising, surface reconstruction, and segmentation) to assess the integrity of the data and its compliance with the original STEP files. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2506.05417 [cs.CV] (or arXiv:2506.05417v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2506.05417 Focus to learn more arXiv-issued DOI via DataCite
zh
[CV-101] SAVVY: Spatial Awareness via Audio-Visual LLM s through Seeing and Hearing
【速读】:该论文旨在解决现有音频-视觉大语言模型(AV-LLMs)在动态、多模态环境中进行三维空间推理能力不足的问题,当前的模型和基准测试主要关注静态或二维场景。解决方案的关键在于提出SAVVY,一种无需训练的推理流程,其核心包括两个阶段:第一阶段为自我中心空间轨迹估计,利用AV-LLMs及其他音视频方法,结合视觉与空间音频线索追踪与查询相关的关键物体轨迹;第二阶段为动态全局地图构建,将多模态查询物体轨迹聚合并转换为统一的动态全局地图,最终通过坐标变换将全局地图与查询视角对齐以生成答案。
链接: https://arxiv.org/abs/2506.05414
作者: Mingfei Chen,Zijun Cui,Xiulong Liu,Jinlin Xiang,Caleb Zheng,Jingyuan Li,Eli Shlizerman
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Project website with demo videos: this https URL
点击查看摘要
Abstract:3D spatial reasoning in dynamic, audio-visual environments is a cornerstone of human cognition yet remains largely unexplored by existing Audio-Visual Large Language Models (AV-LLMs) and benchmarks, which predominantly focus on static or 2D scenes. We introduce SAVVY-Bench, the first benchmark for 3D spatial reasoning in dynamic scenes with synchronized spatial audio. SAVVY-Bench is comprised of thousands of relationships involving static and moving objects, and requires fine-grained temporal grounding, consistent 3D localization, and multi-modal annotation. To tackle this challenge, we propose SAVVY, a novel training-free reasoning pipeline that consists of two stages: (i) Egocentric Spatial Tracks Estimation, which leverages AV-LLMs as well as other audio-visual methods to track the trajectories of key objects related to the query using both visual and spatial audio cues, and (ii) Dynamic Global Map Construction, which aggregates multi-modal queried object trajectories and converts them into a unified global dynamic map. Using the constructed map, a final QA answer is obtained through a coordinate transformation that aligns the global map with the queried viewpoint. Empirical evaluation demonstrates that SAVVY substantially enhances performance of state-of-the-art AV-LLMs, setting a new standard and stage for approaching dynamic 3D spatial reasoning in AV-LLMs.
zh
[CV-102] QA-HFL: Quality-Aware Hierarchical Federated Learning for Resource-Constrained Mobile Devices with Heterogeneous Image Quality
【速读】:该论文试图解决在资源受限的移动设备上,由于图像质量异构性导致的联邦学习(Federated Learning)模型性能下降问题。其解决方案的关键在于提出一种质量感知的分层联邦学习框架(QA-HFL),通过为不同图像质量等级训练专用的本地模型,并采用质量加权融合机制聚合特征,同时结合差分隐私保护,以提升模型的准确性和隐私安全性。此外,该方法还通过设备特定的正则化、自适应加权、智能客户端选择和服务器端知识蒸馏等策略,有效缓解了低端设备带来的性能影响。
链接: https://arxiv.org/abs/2506.05411
作者: Sajid Hussain,Muhammad Sohail,Nauman Ali Khan
机构: National University of Sciences and Technology (国家科技大学); Military College of Signals (军事信号学院)
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:This paper introduces QA-HFL, a quality-aware hierarchical federated learning framework that efficiently handles heterogeneous image quality across resource-constrained mobile devices. Our approach trains specialized local models for different image quality levels and aggregates their features using a quality-weighted fusion mechanism, while incorporating differential privacy protection. Experiments on MNIST demonstrate that QA-HFL achieves 92.31% accuracy after just three federation rounds, significantly outperforming state-of-the-art methods like FedRolex (86.42%). Under strict privacy constraints, our approach maintains 30.77% accuracy with formal differential privacy guarantees. Counter-intuitively, low-end devices contributed most significantly (63.5%) to the final model despite using 100 fewer parameters than high-end counterparts. Our quality-aware approach addresses accuracy decline through device-specific regularization, adaptive weighting, intelligent client selection, and server-side knowledge distillation, while maintaining efficient communication with a 4.71% compression ratio. Statistical analysis confirms that our approach significantly outperforms baseline methods (p 0.01) under both standard and privacy-constrained conditions.
zh
[CV-103] Object-level Self-Distillation for Vision Pretraining
【速读】:该论文试图解决当前视觉预训练方法在处理多物体图像和场景级数据集时的局限性,这些问题源于现有方法假设每张图像仅包含一个主体对象,而这一假设并不总成立,且限制了模型在更复杂现实场景中的扩展性。其解决方案的关键在于引入基于对象级别的自蒸馏(Object-level Self-DIStillation, ODIS),通过对象感知裁剪和掩码注意力机制,将自蒸馏的粒度从整张图像转移到单个对象,从而提取更具语义意义的对象特定区域,简化任务并提升视觉表征能力。
链接: https://arxiv.org/abs/2506.05409
作者: Çağlar Hızlı,Çağatay Yıldız,Pekka Marttinen
机构: Aalto University (阿尔托大学); University of Tübingen (图宾根大学); Tübingen AI Center (图宾根人工智能中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:State-of-the-art vision pretraining methods rely on image-level self-distillation from object-centric datasets such as ImageNet, implicitly assuming each image contains a single object. This assumption does not always hold: many ImageNet images already contain multiple objects. Further, it limits scalability to scene-centric datasets that better mirror real-world complexity. We address these challenges by introducing Object-level Self-DIStillation (ODIS), a pretraining approach that shifts the self-distillation granularity from whole images to individual objects. Using object-aware cropping and masked attention, ODIS isolates object-specific regions, guiding the transformer toward semantically meaningful content and transforming a noisy, scene-level task into simpler object-level sub-tasks. We show that this approach improves visual representations both at the image and patch levels. Using masks at inference time, our method achieves an impressive 82.6% k -NN accuracy on ImageNet1k with ViT-Large.
zh
[CV-104] A VLM-based Method for Visual Anomaly Detection in Robotic Scientific Laboratories
【速读】:该论文旨在解决机器人科学实验室中视觉异常检测的问题,以实现对潜在故障或偏差的及时识别与处理,从而保障实验过程的稳定性和安全性。其解决方案的关键在于提出一种基于视觉语言模型(VLM)的视觉推理方法,该方法通过四种逐步提供更多信息的提示配置支持不同层次的监督,从而提升异常检测的准确性。
链接: https://arxiv.org/abs/2506.05405
作者: Shiwei Lin,Chenxu Wang,Xiaozhen Ding,Yi Wang,Boyuan Du,Lei Song,Chenggang Wang,Huaping Liu
机构: Tsinghua University (清华大学); Yantai University (烟台大学); Fuzhou University (福州大学); Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:In robot scientific laboratories, visual anomaly detection is important for the timely identification and resolution of potential faults or deviations. It has become a key factor in ensuring the stability and safety of experimental processes. To address this challenge, this paper proposes a VLM-based visual reasoning approach that supports different levels of supervision through four progressively informative prompt configurations. To systematically evaluate its effectiveness, we construct a visual benchmark tailored for process anomaly detection in scientific workflows. Experiments on two representative vision-language models show that detection accuracy improves as more contextual information is provided, confirming the effectiveness and adaptability of the proposed reasoning approach for process anomaly detection in scientific workflows. Furthermore, real-world validations at selected experimental steps confirm that first-person visual observation can effectively identify process-level anomalies. This work provides both a data-driven foundation and an evaluation framework for vision anomaly detection in scientific experiment workflows.
zh
[CV-105] AD-EE: Early Exiting for Fast and Reliable Vision-Language Models in Autonomous Driving
【速读】:该论文旨在解决视觉-语言模型(Vision-Language Models, VLMs)在自动驾驶中实时应用时面临的高延迟和计算开销问题,特别是由于模型在获得足够置信度预测后仍继续处理不必要的层所导致的过推理(over-inference)现象。解决方案的关键在于提出AD-EE框架,该框架结合了自动驾驶领域的特性,并利用因果推断识别最优的提前退出层,从而有效降低计算负载并提升推理效率。
链接: https://arxiv.org/abs/2506.05404
作者: Lianming Huang,Haibo Hu,Yufei Cui,Jiacheng Zuo,Shangyu Wu,Nan Guan,Chun Jason Xue
机构: City University of Hong Kong (香港城市大学); MILA, McGill University (MILA,麦吉尔大学); MBZUAI (MBZUAI); Soochow University (苏州大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 8 pages
点击查看摘要
Abstract:With the rapid advancement of autonomous driving, deploying Vision-Language Models (VLMs) to enhance perception and decision-making has become increasingly common. However, the real-time application of VLMs is hindered by high latency and computational overhead, limiting their effectiveness in time-critical driving scenarios. This challenge is particularly evident when VLMs exhibit over-inference, continuing to process unnecessary layers even after confident predictions have been reached. To address this inefficiency, we propose AD-EE, an Early Exit framework that incorporates domain characteristics of autonomous driving and leverages causal inference to identify optimal exit layers. We evaluate our method on large-scale real-world autonomous driving datasets, including Waymo and the corner-case-focused CODA, as well as on a real vehicle running the Autoware Universe platform. Extensive experiments across multiple VLMs show that our method significantly reduces latency, with maximum improvements reaching up to 57.58%, and enhances object detection accuracy, with maximum gains of up to 44%.
zh
[CV-106] Robust Anti-Backdoor Instruction Tuning in LVLMs
【速读】:该论文旨在解决大型视觉语言模型(LVLMs)在微调过程中因污染数据而易受到隐蔽后门攻击的问题。现有防御方法通常依赖于对完整参数的调整或训练过程中的监督知识,但在实际场景中,防御者无法修改冻结的视觉编码器或核心大语言模型(LLM)参数,也缺乏对未知触发模式和目标响应的先验知识。论文提出的解决方案关键在于设计一种无需访问核心权重或攻击先验的轻量级、与认证无关的防御框架——鲁棒指令微调(Robust Instruction Tuning),该方法仅在指令微调下微调适配器模块和文本嵌入层,并通过输入多样性正则化和异常激活正则化两种互补机制,防止模型记忆表面的触发-响应映射,引导其学习语义基础表示。
链接: https://arxiv.org/abs/2506.05401
作者: Yuan Xun,Siyuan Liang,Xiaojun Jia,Xinwei Liu,Xiaochun Cao
机构: Institute of Information Engineering, Chinese Academy of Sciences (中国科学院信息工程研究所); Nanyang Technological University (南洋理工大学); Sun Yat-sen University-Shenzhen (中山大学深圳)
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Large visual language models (LVLMs) have demonstrated excellent instruction-following capabilities, yet remain vulnerable to stealthy backdoor attacks when finetuned using contaminated data. Existing backdoor defense techniques are usually developed for single-modal visual or language models under fully parameter-adjustable settings or rely on supervisory knowledge during training. However, in real-world scenarios, defenders cannot modify frozen visual encoders or core LLM parameters, nor possess prior knowledge of unknown trigger patterns or target responses. Motivated by the empirical finding that LVLMs readily overfit to fixed, unknown triggers, which can embed malicious associations during adapter-level tuning, we aim to design a defense that operates without access to core weights or attack priors. To this end, we introduce a lightweight, certified-agnostic defense framework, Robust Instruction Tuning, that finetunes only adapter modules and text embedding layers under instruction tuning. Our method integrates two complementary regularizations: (1) Input Diversity Regularization, which perturbs trigger components across training samples to disrupt consistent spurious cues; and (2) Anomalous Activation Regularization, which dynamically sparses adapter weights exhibiting abnormally sharp activations linked to backdoor patterns. These mechanisms jointly guide the model toward learning semantically grounded representations rather than memorizing superficial trigger-response mappings. Extensive experiments against seven attacks on Flickr30k and MSCOCO demonstrate that ours reduces their attack success rate to nearly zero, with an increase in training cost of less than 15%. Subjects: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2506.05401 [cs.CR] (or arXiv:2506.05401v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2506.05401 Focus to learn more arXiv-issued DOI via DataCite
zh
[CV-107] alk2SAM: Text-Guided Semantic Enhancement for Complex-Shaped Object Segmentation
【速读】:该论文试图解决当前分割模型(如Segment Anything Model (SAM)及其高质量变体SAM-HQ)在处理复杂形状物体(如电线、自行车或结构网格)时,对细长结构和精细边界分割效果不佳的问题。解决方案的关键在于提出Talk2SAM,该方法通过整合用户提供的文本提示生成CLIP基础嵌入,识别相关语义区域,并将其投影到DINO特征空间,作为SAM-HQ的额外提示,从而提升其对目标物体的关注能力,实现更精确的分割。
链接: https://arxiv.org/abs/2506.05396
作者: Luka Vetoshkin,Dmitry Yudin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 14 pages, 7 figures, Submitted to the conference
点击查看摘要
Abstract:Segmenting objects with complex shapes, such as wires, bicycles, or structural grids, remains a significant challenge for current segmentation models, including the Segment Anything Model (SAM) and its high-quality variant SAM-HQ. These models often struggle with thin structures and fine boundaries, leading to poor segmentation quality. We propose Talk2SAM, a novel approach that integrates textual guidance to improve segmentation of such challenging objects. The method uses CLIP-based embeddings derived from user-provided text prompts to identify relevant semantic regions, which are then projected into the DINO feature space. These features serve as additional prompts for SAM-HQ, enhancing its ability to focus on the target object. Beyond improving segmentation accuracy, Talk2SAM allows user-controllable segmentation, enabling disambiguation of objects within a single bounding box based on textual input. We evaluate our approach on three benchmarks: BIG, ThinObject5K, and DIS5K. Talk2SAM consistently outperforms SAM-HQ, achieving up to +5.9% IoU and +8.3% boundary IoU improvements. Our results demonstrate that incorporating natural language guidance provides a flexible and effective means for precise object segmentation, particularly in cases where traditional prompt-based methods fail. The source code is available on GitHub: this https URL
zh
[CV-108] riPSS: A Tri-Modal Keyframe Extraction Framework Using Perceptual Structural and Semantic Representations
【速读】:该论文旨在解决视频摘要和检索中关键帧提取的效率与完整性问题,即如何有效捕捉视频内容的丰富性。其解决方案的关键在于提出TriPSS框架,该框架通过融合CIELAB空间中的颜色特征、ResNet-50生成的深度结构嵌入以及Llama-3.2-11B-Vision-Instruct生成的帧级描述的语义信息,利用主成分分析进行多模态融合,构建鲁棒的多模态嵌入,并通过HDBSCAN聚类实现自适应视频内容分割,最终通过质量评估和重复过滤优化关键帧集。
链接: https://arxiv.org/abs/2506.05395
作者: Mert Can Cakmak,Nitin Agarwal,Diwash Poudel
机构: COSMOS Research Center, University of Arkansas Little Rock(宇宙学研究中心,阿肯色大学小石城分校); ICSI, University of California, Berkeley(计算机科学研究所,加州大学伯克利分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Multimedia (cs.MM); Image and Video Processing (eess.IV)
备注:
点击查看摘要
Abstract:Efficient keyframe extraction is critical for effective video summarization and retrieval, yet capturing the complete richness of video content remains challenging. In this work, we present TriPSS, a novel tri-modal framework that effectively integrates perceptual cues from color features in the CIELAB space, deep structural embeddings derived from ResNet-50, and semantic context from frame-level captions generated by Llama-3.2-11B-Vision-Instruct. By fusing these diverse modalities using principal component analysis, TriPSS constructs robust multi-modal embeddings that enable adaptive segmentation of video content via HDBSCAN clustering. A subsequent refinement stage incorporating quality assessment and duplicate filtering ensures that the final keyframe set is both concise and semantically rich. Comprehensive evaluations on benchmark datasets TVSum20 and SumMe demonstrate that TriPSS achieves state-of-the-art performance, substantially outperforming traditional unimodal and previous multi-modal methods. These results underscore TriPSS’s ability to capture nuanced visual and semantic information, thereby setting a new benchmark for video content understanding in large-scale retrieval scenarios.
zh
[CV-109] Q-Ponder: A Unified Training Pipeline for Reasoning -based Visual Quality Assessment
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在视觉质量评估中存在的一致性与准确性之间的权衡问题,即现有的方法通常将质量评分和推理描述视为独立任务,导致模型在精确评分回归与可解释性推理之间难以兼顾。其解决方案的关键在于提出一种统一的两阶段训练框架,第一阶段通过教师模型的知识蒸馏和交叉熵损失监督初始化模型的推理能力,第二阶段引入基于群体相对策略优化(Group Relative Policy Optimization, GRPO)的新奖励机制,联合优化评分准确性和推理一致性,从而实现质量评分与解释性描述的协同提升。
链接: https://arxiv.org/abs/2506.05384
作者: Zhuoxuan Cai,Jian Zhang,Xinbin Yuan,Pengtao Jiang,Wenxiang Chen,Bowen Tang,Lujian Yao,Qiyuan Wang,Jinwen Chen,Bo Li
机构: Fudan University (复旦大学); vivo Mobile Communication Co., Ltd (维沃移动通信有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注:
点击查看摘要
Abstract:Recent studies demonstrate that multimodal large language models (MLLMs) can proficiently evaluate visual quality through interpretable assessments. However, existing approaches typically treat quality scoring and reasoning descriptions as separate tasks with disjoint optimization objectives, leading to a trade-off: models adept at quality reasoning descriptions struggle with precise score regression, while score-focused models lack interpretability. This limitation hinders the full potential of MLLMs in visual quality assessment, where accuracy and interpretability should be mutually reinforcing. To address this, we propose a unified two-stage training framework comprising a cold-start stage and a reinforcement learning-based fine-tuning stage. Specifically, in the first stage, we distill high-quality data from a teacher model through expert-designed prompts, initializing reasoning capabilities via cross-entropy loss supervision. In the second stage, we introduce a novel reward with Group Relative Policy Optimization (GRPO) to jointly optimize scoring accuracy and reasoning consistency. We designate the models derived from these two stages as Q-Ponder-CI and Q-Ponder. Extensive experiments show that Q-Ponder achieves state-of-the-art (SOTA) performance on quality score regression benchmarks, delivering up to 6.5% higher SRCC on cross-domain datasets. Furthermore, Q-Ponder significantly outperforms description-based SOTA models, including its teacher model Qwen-2.5-VL-72B, particularly in description accuracy and reasonableness, demonstrating the generalization potential over diverse tasks.
zh
[CV-110] Can Vision Transformers with ResNets Global Features Fairly Authenticate Demographic Faces? ICPR2024
【速读】:该论文旨在解决生物特征人脸认证中公平性和跨人口群体泛化能力不足的问题。其解决方案的关键在于利用预训练的全局特征(如Vision Transformer和ResNet),并通过融合ViT与ResNet的特征、设计少样本原型网络以及构建新的支持集和查询集数据,以最小化对局部特征的依赖,从而实现更公平的人脸认证性能。
链接: https://arxiv.org/abs/2506.05383
作者: Abu Sufian,Marco Leo,Cosimo Distante,Anirudha Ghosh,Debaditya Barman
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 6 Figures, ICPR 2024 Workshop FAIRBIO
点击查看摘要
Abstract:Biometric face authentication is crucial in computer vision, but ensuring fairness and generalization across demographic groups remains a big challenge. Therefore, we investigated whether Vision Transformer (ViT) and ResNet, leveraging pre-trained global features, can fairly authenticate different demographic faces while relying minimally on local features. In this investigation, we used three pre-trained state-of-the-art (SOTA) ViT foundation models from Facebook, Google, and Microsoft for global features as well as ResNet-18. We concatenated the features from ViT and ResNet, passed them through two fully connected layers, and trained on customized face image datasets to capture the local features. Then, we designed a novel few-shot prototype network with backbone features embedding. We also developed new demographic face image support and query datasets for this empirical study. The network’s testing was conducted on this dataset in one-shot, three-shot, and five-shot scenarios to assess how performance improves as the size of the support set increases. We observed results across datasets with varying races/ethnicities, genders, and age groups. The Microsoft Swin Transformer backbone performed better among the three SOTA ViT for this task. The code and data are available at: this https URL.
zh
[CV-111] A Compendium of Autonomous Navigation using Object Detection and Tracking in Unmanned Aerial Vehicles
【速读】:该论文试图解决无人机(UAV)在自主导航过程中面临的挑战,包括信号质量与范围、实时处理、人工操作技能、硬件鲁棒性以及数据安全等问题。其解决方案的关键在于通过计算机视觉(Computer Vision)算法实现无人机的自主化,特别是利用目标检测与跟踪技术,以适应不同硬件平台并实现实时处理,从而提高无人机在各种应用场景(如灾害管理、密集区域勘探和交通车辆监控等)中的效能。
链接: https://arxiv.org/abs/2506.05378
作者: Mohit Arora,Pratyush Shukla,Shivali Chopra
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Image and Video Processing (eess.IV)
备注:
点击查看摘要
Abstract:Unmanned Aerial Vehicles (UAVs) are one of the most revolutionary inventions of 21st century. At the core of a UAV lies the central processing system that uses wireless signals to control their movement. The most popular UAVs are quadcopters that use a set of four motors, arranged as two on either side with opposite spin. An autonomous UAV is called a drone. Drones have been in service in the US army since the 90’s for covert missions critical to national security. It would not be wrong to claim that drones make up an integral part of the national security and provide the most valuable service during surveillance operations. While UAVs are controlled using wireless signals, there reside some challenges that disrupt the operation of such vehicles such as signal quality and range, real time processing, human expertise, robust hardware and data security. These challenges can be solved by programming UAVs to be autonomous, using object detection and tracking, through Computer Vision algorithms. Computer Vision is an interdisciplinary field that seeks the use of deep learning to gain a high-level understanding of digital images and videos for the purpose of automating the task of human visual system. Using computer vision, algorithms for detecting and tracking various objects can be developed suitable to the hardware so as to allow real time processing for immediate judgement. This paper attempts to review the various approaches several authors have proposed for the purpose of autonomous navigation of UAVs by through various algorithms of object detection and tracking in real time, for the purpose of applications in various fields such as disaster management, dense area exploration, traffic vehicle surveillance etc.
zh
[CV-112] An Independent Discriminant Network Towards Identification of Counterfeit Images and Videos ALT
【速读】:该论文试图解决在线平台上虚假图像和视频迅速传播的问题,这些问题通常通过使用易于获取的编辑软件对图像进行添加、删除、克隆或修改而生成,从而隐藏犯罪证据并传播错误信息。解决方案的关键在于利用生成式对抗网络(Generative Adversarial Networks, GAN)生成的图像或视频的检测,具体通过构建一个基于InceptionResNetV2的卷积神经网络作为独立的判别网络,以识别GAN生成的内容。此外,论文还提出了一种平台,用于检测伪造的图像和视频,旨在为司法鉴定领域提供辅助工具,以识别犯罪活动。
链接: https://arxiv.org/abs/2506.05377
作者: Shayantani Kar,B. Shresth Bhimrajka,Aditya Kumar,Sahil Gupta,Sourav Ghosh,Subhamita Mukherjee,Shauvik Paul
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This research was conducted by student and professor co-authors from Techno Main Salt Lake, with co-author Sourav Ghosh serving as an alumni mentor in an invited capacity – distinct from his primary affiliation and pre-approved by his employer. This preprint presents research originally completed in early 2023 and published in IETE Journal of Research in 2025
点击查看摘要
Abstract:Rapid spread of false images and videos on online platforms is an emerging problem. Anyone may add, delete, clone or modify people and entities from an image using various editing software which are readily available. This generates false and misleading proof to hide the crime. Now-a-days, these false and counterfeit images and videos are flooding on the internet. These spread false information. Many methods are available in literature for detecting those counterfeit contents but new methods of counterfeiting are also evolving. Generative Adversarial Networks (GAN) are observed to be one effective method as it modifies the context and definition of images producing plausible results via image-to-image translation. This work uses an independent discriminant network that can identify GAN generated image or video. A discriminant network has been created using a convolutional neural network based on InceptionResNetV2. The article also proposes a platform where users can detect forged images and videos. This proposed work has the potential to help the forensics domain to detect counterfeit videos and hidden criminal evidence towards the identification of criminal activities.
zh
[CV-113] State Estimation and Control of Dynamic Systems from High-Dimensional Image Data
【速读】:该论文试图解决动态系统中准确状态估计的问题,这一问题在缺乏直接获取真实系统状态的情况下,会显著影响最优策略设计和强化学习中的策略学习过程。解决方案的关键在于提出一种融合卷积神经网络(CNN)的空间特征提取与门控循环单元(GRU)的时间建模的新型神经架构,从而从图像序列及对应动作中学习有效的状态表示,并利用这些表示训练深度Q网络(DQN)实现精确的状态估计与控制。
链接: https://arxiv.org/abs/2506.05375
作者: Ashik E Rasul,Hyung-Jin Yoon
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Accurate state estimation is critical for optimal policy design in dynamic systems. However, obtaining true system states is often impractical or infeasible, complicating the policy learning process. This paper introduces a novel neural architecture that integrates spatial feature extraction using convolutional neural networks (CNNs) and temporal modeling through gated recurrent units (GRUs), enabling effective state representation from sequences of images and corresponding actions. These learned state representations are used to train a reinforcement learning agent with a Deep Q-Network (DQN). Experimental results demonstrate that our proposed approach enables real-time, accurate estimation and control without direct access to ground-truth states. Additionally, we provide a quantitative evaluation methodology for assessing the accuracy of the learned states, highlighting their impact on policy performance and control stability.
zh
[CV-114] DVD: A Comprehensive Dataset for Advancing Violence Detection in Real-World Scenarios
【速读】:该论文试图解决现有暴力检测(Violence Detection, VD)数据库在多样性、标注粒度、规模及元数据方面的不足,这些问题限制了模型的泛化能力。解决方案的关键在于引入DVD,这是一个大规模(500个视频,270万帧)、具有帧级标注的VD数据库,涵盖多样环境、不同光照条件、多源摄像头、复杂社会互动以及丰富的元数据,旨在捕捉现实世界中暴力事件的复杂性。
链接: https://arxiv.org/abs/2506.05372
作者: Dimitrios Kollias,Damith C. Senadeera,Jianian Zheng,Kaushal K. K. Yadav,Greg Slabaugh,Muhammad Awais,Xiaoyun Yang
机构: Queen Mary University of London (伦敦玛丽女王大学); Queen Mary’s Digital Environment Research Institute (女王大学数字环境研究中心); Remark AI UK Ltd (Remark AI 英国有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Violence Detection (VD) has become an increasingly vital area of research. Existing automated VD efforts are hindered by the limited availability of diverse, well-annotated databases. Existing databases suffer from coarse video-level annotations, limited scale and diversity, and lack of metadata, restricting the generalization of models. To address these challenges, we introduce DVD, a large-scale (500 videos, 2.7M frames), frame-level annotated VD database with diverse environments, varying lighting conditions, multiple camera sources, complex social interactions, and rich metadata. DVD is designed to capture the complexities of real-world violent events.
zh
[CV-115] MR.NAVI: Mixed-Reality Navigation Assistant for the Visually Impaired
【速读】:该论文试图解决视觉障碍者在陌生环境中导航所面临的重大挑战,旨在提升其空间感知能力。解决方案的关键在于构建一个混合现实系统,该系统通过实时场景理解和直观的音频反馈来增强用户的空间意识,其核心技术包括基于计算机视觉的对象检测与深度估计、自然语言处理生成情境化场景描述、以及利用MobileNet进行对象检测和RANSAC结合DBSCAN聚类实现障碍物避让的分布式架构。
链接: https://arxiv.org/abs/2506.05369
作者: Nicolas Pfitzer,Yifan Zhou,Marco Poggensee,Defne Kurtulus,Bessie Dominguez-Dager,Mihai Dusmanu,Marc Pollefeys,Zuria Bauer
机构: ETH Zurich (苏黎世联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Over 43 million people worldwide live with severe visual impairment, facing significant challenges in navigating unfamiliar environments. We present this http URL, a mixed reality system that enhances spatial awareness for visually impaired users through real-time scene understanding and intuitive audio feedback. Our system combines computer vision algorithms for object detection and depth estimation with natural language processing to provide contextual scene descriptions, proactive collision avoidance, and navigation instructions. The distributed architecture processes sensor data through MobileNet for object detection and employs RANSAC-based floor detection with DBSCAN clustering for obstacle avoidance. Integration with public transit APIs enables navigation with public transportation directions. Through our experiments with user studies, we evaluated both scene description and navigation functionalities in unfamiliar environments, showing promising usability and effectiveness.
zh
[CV-116] Speaking images. A novel framework for the automated self-description of artworks
【速读】:该论文试图解决如何通过生成式 AI(Generative AI)技术提升数字艺术与文化遗产的可访问性及内容阐释问题,特别是在面对原始历史对象时,探索数字图像的可塑性与当代解读方式。其解决方案的关键在于基于“自主图像”概念,构建一个整合开源大语言模型、人脸检测、文本到语音以及音频到动画模型的框架,以实现从数字化艺术品自动生成包含主要角色解释内容的短视频,从而推动文化 artifact 的自我解释性生产。
链接: https://arxiv.org/abs/2506.05368
作者: Valentine Bernasconi,Gustavo Marfia
机构: University of Bologna(博洛尼亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Recent breakthroughs in generative AI have opened the door to new research perspectives in the domain of art and cultural heritage, where a large number of artifacts have been digitized. There is a need for innovation to ease the access and highlight the content of digital collections. Such innovations develop into creative explorations of the digital image in relation to its malleability and contemporary interpretation, in confrontation to the original historical object. Based on the concept of the autonomous image, we propose a new framework towards the production of self-explaining cultural artifacts using open-source large-language, face detection, text-to-speech and audio-to-animation models. The goal is to start from a digitized artwork and to automatically assemble a short video of the latter where the main character animates to explain its content. The whole process questions cultural biases encapsulated in large-language models, the potential of digital images and deepfakes of artworks for educational purposes, along with concerns of the field of art history regarding such creative diversions.
zh
[CV-117] xt2Stereo: Repurposing Stable Diffusion for Stereo Generation with Consistency Rewards
【速读】:该论文旨在解决如何根据文本提示生成高质量立体图像的问题,尤其是在缺乏大规模基线立体图像数据集的情况下,直接训练扩散模型不可行。其解决方案的关键在于利用Stable Diffusion模型已学习的强先验知识,并在其上进行微调以适应立体图像生成任务,同时通过提示对齐和提出的立体一致性奖励函数进一步提升模型的立体一致性和文本到图像的对齐能力。
链接: https://arxiv.org/abs/2506.05367
作者: Aakash Garg,Libing Zeng,Andrii Tsarov,Nima Khademi Kalantari
机构: Texas A&M University (德克萨斯A&M大学); Leia Inc. (莱亚公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:In this paper, we propose a novel diffusion-based approach to generate stereo images given a text prompt. Since stereo image datasets with large baselines are scarce, training a diffusion model from scratch is not feasible. Therefore, we propose leveraging the strong priors learned by Stable Diffusion and fine-tuning it on stereo image datasets to adapt it to the task of stereo generation. To improve stereo consistency and text-to-image alignment, we further tune the model using prompt alignment and our proposed stereo consistency reward functions. Comprehensive experiments demonstrate the superiority of our approach in generating high-quality stereo images across diverse scenarios, outperforming existing methods.
zh
[CV-118] Seed Selection for Human-Oriented Image Reconstruction via Guided Diffusion
【速读】:该论文试图解决传统可扩展图像编码方法需要传输额外信息以实现可扩展性的问题,以及现有基于扩散的方法因使用单一随机种子可能导致图像质量不佳的问题。其解决方案的关键在于提出一种种子选择方法,通过从多个候选种子中识别出最优种子来提升图像质量,同时不增加比特率;为降低计算成本,该选择过程基于逆向扩散过程早期步骤的中间输出进行。
链接: https://arxiv.org/abs/2506.05363
作者: Yui Tatsumi,Ziyue Zeng,Hiroshi Watanabe
机构: Waseda University (早稻田大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Conventional methods for scalable image coding for humans and machines require the transmission of additional information to achieve scalability. A recent diffusion-based method avoids this by generating human-oriented images from machine-oriented images without extra bitrate. This method, however, uses a single random seed, which may lead to suboptimal image quality. In this paper, we propose a seed selection method that identifies the optimal seed from multiple candidates to improve image quality without increasing the bitrate. To reduce computational cost, the selection is performed based on intermediate outputs obtained from early steps of the reverse diffusion process. Experimental results demonstrate that our method outperforms the baseline across multiple metrics.
zh
[CV-119] Scalable Generation of Spatial Transcriptomics from Histology Images via Whole-Slide Flow Matching ICML2025
【速读】:该论文旨在解决空间转录组学(Spatial Transcriptomics, ST)技术在高通量应用和计算资源限制方面的挑战,特别是如何有效建模细胞间相互作用以及处理大规模空间数据的内存瓶颈。其解决方案的关键在于提出STFlow,这是一种基于流匹配(flow matching)的生成模型,通过建模整个切片的基因表达联合分布来显式捕捉细胞间相互作用,并采用具有局部空间注意力机制的高效切片级编码器,实现了全切片处理而无需过度消耗内存资源。
链接: https://arxiv.org/abs/2506.05361
作者: Tinglin Huang,Tianyu Liu,Mehrtash Babadi,Wengong Jin,Rex Ying
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Genomics (q-bio.GN)
备注: Accepted at ICML 2025
点击查看摘要
Abstract:Spatial transcriptomics (ST) has emerged as a powerful technology for bridging histology imaging with gene expression profiling. However, its application has been limited by low throughput and the need for specialized experimental facilities. Prior works sought to predict ST from whole-slide histology images to accelerate this process, but they suffer from two major limitations. First, they do not explicitly model cell-cell interaction as they factorize the joint distribution of whole-slide ST data and predict the gene expression of each spot independently. Second, their encoders struggle with memory constraints due to the large number of spots (often exceeding 10,000) in typical ST datasets. Herein, we propose STFlow, a flow matching generative model that considers cell-cell interaction by modeling the joint distribution of gene expression of an entire slide. It also employs an efficient slide-level encoder with local spatial attention, enabling whole-slide processing without excessive memory overhead. On the recently curated HEST-1k and STImage-1K4M benchmarks, STFlow substantially outperforms state-of-the-art baselines and achieves over 18% relative improvements over the pathology foundation models.
zh
[CV-120] CarboNeXT and CarboFormer: Dual Semantic Segmentation Architectures for Detecting and Quantifying Carbon Dioxide Emissions Using Optical Gas Imaging
【速读】:该论文旨在解决二氧化碳(CO₂)排放的检测与量化问题,特别是在环境监测和畜牧管理中的应用。其关键解决方案是提出CarboNeXT框架,该框架结合了多尺度上下文聚合网络与UPerHead及辅助FCN组件,以有效建模气体羽流图像中的局部细节与全局关系,从而提升分割精度与实时性。此外,还提出了轻量级的CarboFormer变体,进一步优化了计算效率,适用于资源受限的平台。
链接: https://arxiv.org/abs/2506.05360
作者: Taminul Islam,Toqi Tahamid Sarker,Mohamed G Embaby,Khaled R Ahmed,Amer AbuGhazaleh
机构: Southern Illinois University, Carbondale (南伊利诺伊大学,卡本代尔分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Carbon dioxide (CO _2 ) emissions are critical indicators of both environmental impact and various industrial processes, including livestock management. We introduce CarboNeXT, a semantic segmentation framework for Optical Gas Imaging (OGI), designed to detect and quantify CO _2 emissions across diverse applications. Our approach integrates a multi-scale context aggregation network with UPerHead and auxiliary FCN components to effectively model both local details and global relationships in gas plume imagery. We contribute two novel datasets: (1) the Controlled Carbon Dioxide Release (CCR) dataset, which simulates gas leaks with systematically varied flow rates (10-100 SCCM), and (2) the Real Time Ankom (RTA) dataset, focusing on emissions from dairy cow rumen fluid in vitro experiments. Extensive evaluations demonstrate that CarboNeXT outperforms state-of-the-art methods, achieving 88.46% mIoU on CCR and 92.95% mIoU on RTA, with particular effectiveness in challenging low-flow scenarios. The model operates at 60.95 FPS, enabling real-time monitoring applications. Additionally, we propose CarboFormer, a lightweight variant with only 5.07M parameters that achieves 84.68 FPS, with competitive performance of 84.88% mIoU on CCR and 92.98% on RTA, making it suitable for resource-constrained platforms such as programmable drones. Our work advances both environmental sensing and precision livestock management by providing robust tools for CO _2 emission analysis, with a specific focus on livestock applications.
zh
[CV-121] Can ChatGPT Perform Image Splicing Detection? A Preliminary Study
【速读】:该论文试图解决图像伪造检测问题,特别是图像拼接篡改的检测。其解决方案的关键在于利用GPT-4V这一多模态大语言模型(Multimodal Large Language Models, MLLMs)在无需任务特定微调的情况下,通过不同的提示策略(Zero-Shot、Few-Shot和Chain-of-Thought)进行图像取证分析,展现出在零样本设置下具有竞争力的检测性能,并能够结合低级视觉特征与现实世界的上下文知识进行推理。
链接: https://arxiv.org/abs/2506.05358
作者: Souradip Nath
机构: Arizona State University (亚利桑那州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:
点击查看摘要
Abstract:Multimodal Large Language Models (MLLMs) like GPT-4V are capable of reasoning across text and image modalities, showing promise in a variety of complex vision-language tasks. In this preliminary study, we investigate the out-of-the-box capabilities of GPT-4V in the domain of image forensics, specifically, in detecting image splicing manipulations. Without any task-specific fine-tuning, we evaluate GPT-4V using three prompting strategies: Zero-Shot (ZS), Few-Shot (FS), and Chain-of-Thought (CoT), applied over a curated subset of the CASIA v2.0 splicing dataset. Our results show that GPT-4V achieves competitive detection performance in zero-shot settings (more than 85% accuracy), with CoT prompting yielding the most balanced trade-off across authentic and spliced images. Qualitative analysis further reveals that the model not only detects low-level visual artifacts but also draws upon real-world contextual knowledge such as object scale, semantic consistency, and architectural facts, to identify implausible composites. While GPT-4V lags behind specialized state-of-the-art splicing detection models, its generalizability, interpretability, and encyclopedic reasoning highlight its potential as a flexible tool in image forensics. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR) Cite as: arXiv:2506.05358 [cs.CV] (or arXiv:2506.05358v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2506.05358 Focus to learn more arXiv-issued DOI via DataCite
zh
[CV-122] Category Query Learning for Human-Object Interaction Classification CVPR2023
【速读】:该论文试图解决人体-物体交互(Human-Object Interaction, HOI)分类任务中特征学习的局限性,传统方法主要关注提升人体与物体的特征表示,而本文提出了一种新颖且互补的方法——类别查询学习(category query learning)。该方案的关键在于将查询显式关联到交互类别,通过Transformer解码器将其转换为图像特定的类别表示,并通过辅助的图像级分类任务进行学习。这一思路受到早期多标签图像分类方法的启发,但首次应用于具有挑战性的HOI分类任务中,从而实现了简单、通用且有效的性能提升。
链接: https://arxiv.org/abs/2303.14005
作者: Chi Xie,Fangao Zeng,Yue Hu,Shuang Liang,Yichen Wei
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by CVPR 2023
点击查看摘要
Abstract:Unlike most previous HOI methods that focus on learning better human-object features, we propose a novel and complementary approach called category query learning. Such queries are explicitly associated to interaction categories, converted to image specific category representation via a transformer decoder, and learnt via an auxiliary image-level classification task. This idea is motivated by an earlier multi-label image classification method, but is for the first time applied for the challenging human-object interaction classification task. Our method is simple, general and effective. It is validated on three representative HOI baselines and achieves new state-of-the-art results on two benchmarks.
zh
[CV-123] DermaCon-IN: A Multi-concept Annotated Dermatological Image Dataset of Indian Skin Disorders for Clinical AI Research
【速读】:该论文旨在解决当前皮肤病学中人工智能模型在泛化性和公平性方面的不足,特别是在真实世界临床实践中数据集未能充分反映疾病分布、皮肤色调多样性以及非西方人群门诊场景的问题。其解决方案的关键在于构建一个前瞻性采集的皮肤病学数据集——DermaCon-IN,该数据集包含来自南印度门诊诊所约3,000名患者的5,450张临床图像,并由认证皮肤科医生进行超过240种不同诊断的标注,采用基于病因的分层分类体系,以捕捉印度门诊护理中常见的皮肤病种类和色调变化。
链接: https://arxiv.org/abs/2506.06099
作者: Shanawaj S Madarkar,Mahajabeen Madarkar,Madhumitha V,Teli Prakash,Konda Reddy Mopuri,Vinaykumar MV,KVL Sathwika,Adarsh Kasturi,Gandla Dilip Raj,PVN Supranitha,Harsh Udai
机构: Indian Institute of Technology Hyderabad(印度理工学院海得拉巴分校); S R Patil Medical College(印度S R Patil医学院); S Nijalingappa Medical College(印度S Nijalingappa医学院); Sri Chamundeshwari Medical College, Hospital & Research(印度Sri Chamundeshwari医学学院、医院与研究机构)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Artificial intelligence is poised to augment dermatological care by enabling scalable image-based diagnostics. Yet, the development of robust and equitable models remains hindered by datasets that fail to capture the clinical and demographic complexity of real-world practice. This complexity stems from region-specific disease distributions, wide variation in skin tones, and the underrepresentation of outpatient scenarios from non-Western populations. We introduce DermaCon-IN, a prospectively curated dermatology dataset comprising over 5,450 clinical images from approximately 3,000 patients across outpatient clinics in South India. Each image is annotated by board-certified dermatologists with over 240 distinct diagnoses, structured under a hierarchical, etiology-based taxonomy adapted from Rook’s classification. The dataset captures a wide spectrum of dermatologic conditions and tonal variation commonly seen in Indian outpatient care. We benchmark a range of architectures including convolutional models (ResNet, DenseNet, EfficientNet), transformer-based models (ViT, MaxViT, Swin), and Concept Bottleneck Models to establish baseline performance and explore how anatomical and concept-level cues may be integrated. These results are intended to guide future efforts toward interpretable and clinically realistic models. DermaCon-IN provides a scalable and representative foundation for advancing dermatology AI in real-world settings.
zh
[CV-124] LinGuinE: Longitudinal Guidance Estimation for Volumetric Lung Tumour Segmentation
【速读】:该论文旨在解决肺部大体肿瘤体积在放射治疗、手术干预及化疗反应评估中的纵向分割问题,当前缺乏有效的自动化或半自动化解决方案。其关键解决方案是LinGuinE,该方法通过放射科医生提供的任意时间点的初始肿瘤位置输入,利用刚性配准将该时间点内的采样点传播至其他时间点,并通过点击有效性分类器筛选仍位于肿瘤内的点,从而在新时间点自动生成分割结果。
链接: https://arxiv.org/abs/2506.06092
作者: Nadine Garibli,Mayank Patwari,Bence Csiba,Yi Wei,Kostas Sidiropoulos
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 10 pages, 3 figures
点击查看摘要
Abstract:Segmentation of lung gross tumour volumes is an important first step in radiotherapy and surgical intervention, and is starting to play a role in assessing chemotherapy response. Response to a drug is measured by tracking the tumour volumes over a series of CT scans over a time period i.e. a longitudinal study. However, there currently exist few solutions for automated or semi-automated longitudinal tumour segmentation. This paper introduces LinGuinE, an automated method to segment a longitudinal series of lung tumours. A radiologist must provide an initial input, indicating the location of the tumour in a CT scan at an arbitrary time point. LinGuinE samples points inside this tumour and propagates them to another time point using rigid registration. A click validity classifier selects points which still fall within the tumour; these are used to automatically create a segmentation in the new time point. We test LinGuinE on a dataset acquired from a phase 3 clinical trial for lung tumours and the publicly available 4-D lung CBCT dataset. We find that LinGuinE improves the Dice on both test sets by over 20% (p 0.05) across 63 longitudinal studies. We show that any time point can be used as a starting point, conduct ablation experiments, and find that our LinGuinE setup yields the best results on both test datasets.
zh
[CV-125] FPDANet: A Multi-Section Classification Model for Intelligent Screening of Fetal Ultrasound
【速读】:该论文旨在解决胎儿超声图像分类任务中因图像对比度低、相似性高和噪声大而导致的分类效果不佳问题。其解决方案的关键在于提出一种基于双边多尺度信息融合网络的FPDANet模型,该模型通过设计位置注意力机制(DAN)模块来建立不同空间位置特征之间的依赖关系,增强特征表示,并引入双边多尺度(FPAN)信息融合模块以捕捉不同尺度下的上下文和全局特征依赖关系,从而提升模型的表达能力和分类性能。
链接: https://arxiv.org/abs/2506.06054
作者: Minglang Chen,Jie He,Caixu Xu,Bocheng Liang,Shengli Li,Guannan He,Xiongjie Tao
机构: Wuzhou University (梧州大学); Hunan University (湖南大学); Shenzhen Maternal and Child Healthcare Hospital (深圳市妇幼保健医院); Sichuan Provincial Maternity and Child Healthcare Hospital (四川省妇幼保健院); Macau University of Science and Technology (澳门科技大学)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:ResNet has been widely used in image classification tasks due to its ability to model the residual dependence of constant mappings for linear computation. However, the ResNet method adopts a unidirectional transfer of features and lacks an effective method to correlate contextual information, which is not effective in classifying fetal ultrasound images in the classification task, and fetal ultrasound images have problems such as low contrast, high similarity, and high noise. Therefore, we propose a bilateral multi-scale information fusion network-based FPDANet to address the above challenges. Specifically, we design the positional attention mechanism (DAN) module, which utilizes the similarity of features to establish the dependency of different spatial positional features and enhance the feature representation. In addition, we design a bilateral multi-scale (FPAN) information fusion module to capture contextual and global feature dependencies at different feature scales, thereby further improving the model representation. FPDANet classification results obtained 91.05% and 100% in Top-1 and Top-5 metrics, respectively, and the experimental results proved the effectiveness and robustness of FPDANet.
zh
[CV-126] Noninvasive precision modulation of high-level neural population activity via natural vision perturbations
【速读】:该论文试图解决神经科学中精确控制神经活动的问题,即在不干扰周围神经元的情况下,调节大脑深处的目标神经元活动。传统上,这一问题通常通过侵入性技术来实现,而本文提出了一种非侵入性的解决方案,其关键在于利用对自然视觉输入的微小扰动,实现对高级灵长类动物腹侧视觉流(ventral stream)中特定神经元群体的精确调控。实验结果表明,基于当前机器可执行的模型可以设计出非侵入性、视觉传递且可能不可察觉的神经干预,其精度达到单个神经元级别。
链接: https://arxiv.org/abs/2506.05633
作者: Guy Gaziv,Sarah Goulding,Ani Ayvazian-Hancock,Yoon Bai,James J. DiCarlo
机构: Massachusetts Institute of Technology (麻省理工学院); Center for Brains, Minds, and Machines (脑、心智与机器中心); MIT Quest for Intelligence (MIT智能探索计划)
类目: Neurons and Cognition (q-bio.NC); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)
备注:
点击查看摘要
Abstract:Precise control of neural activity – modulating target neurons deep in the brain while leaving nearby neurons unaffected – is an outstanding challenge in neuroscience, generally achieved through invasive techniques. This study investigates the possibility of precisely and noninvasively modulating neural activity in the high-level primate ventral visual stream via perturbations on one’s natural visual feed. When tested on macaque inferior temporal (IT) neural populations, we found quantitative agreement between the model-predicted and biologically realized effect: strong modulation concentrated on targeted neural sites. We extended this to demonstrate accurate injection of experimenter-chosen neural population patterns via subtle perturbations applied on the background of typical natural visual feeds. These results highlight that current machine-executable models of the ventral stream can now design noninvasive, visually-delivered, possibly imperceptible neural interventions at the resolution of individual neurons.
zh
[CV-127] Deep histological synthesis from mass spectrometry imaging for multimodal registration
【速读】:该论文试图解决组织学图像与质谱成像(Mass Spectrometry Imaging, MSI)之间的配准问题,这一问题由于两种模态在图像形成过程和维度上的根本差异而持续存在。解决方案的关键在于利用pix2pix模型从MSI数据中合成组织学图像,从而实现单模态配准的有效性。初步结果表明,该方法生成的合成组织学图像具有有限的伪影,并在互信息(Mutual Information, MI)和结构相似性指数度量(Structural Similarity Index Measures, SSIM)上分别提升了+0.924和+0.419。
链接: https://arxiv.org/abs/2506.05441
作者: Kimberley M. Bird,Xujiong Ye,Alan M. Race,James M. Brown
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Medical Image Understanding and Analysis (MIUA) 2025 Extended Abstract Submission
点击查看摘要
Abstract:Registration of histological and mass spectrometry imaging (MSI) allows for more precise identification of structural changes and chemical interactions in tissue. With histology and MSI having entirely different image formation processes and dimensionalities, registration of the two modalities remains an ongoing challenge. This work proposes a solution that synthesises histological images from MSI, using a pix2pix model, to effectively enable unimodal registration. Preliminary results show promising synthetic histology images with limited artifacts, achieving increases in mutual information (MI) and structural similarity index measures (SSIM) of +0.924 and +0.419, respectively, compared to a baseline U-Net model. Our source code is available on GitHub: this https URL.
zh
[CV-128] Enhancing Neural Autoregressive Distribution Estimators for Image Reconstruction
【速读】:该论文试图解决在仅观察到图像部分像素(称为像素块)的情况下,如何准确预测未观测到的图像部分的问题。其解决方案的关键在于提出一种改进的卷积神经自回归分布估计器(ConvNADE)模型,该模型具有广义性和计算效率,并适用于实数值和彩色图像。此外,研究还探讨了基于拟蒙特卡洛理论设计的低差异像素块与随机像素块在图像重建质量上的差异,结果表明采用类似低差异序列的像素选择策略能够降低测试损失并生成更逼真的重构图像。
链接: https://arxiv.org/abs/2506.05391
作者: Ambrose Emmett-Iwaniw,Nathan Kirk
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Applications (stat.AP)
备注: Accepted for publication in conference proceedings, MCQMC 2024
点击查看摘要
Abstract:Autoregressive models are often employed to learn distributions of image data by decomposing the D -dimensional density function into a product of one-dimensional conditional distributions. Each conditional depends on preceding variables (pixels, in the case of image data), making the order in which variables are processed fundamental to the model performance. In this paper, we study the problem of observing a small subset of image pixels (referred to as a pixel patch) to predict the unobserved parts of the image. As our prediction mechanism, we propose a generalized and computationally efficient version of the convolutional neural autoregressive distribution estimator (ConvNADE) model adapted for real-valued and color images. Moreover, we investigate the quality of image reconstruction when observing both random pixel patches and low-discrepancy pixel patches inspired by quasi-Monte Carlo theory. Experiments on benchmark datasets demonstrate that choosing the pixels akin to a low-discrepancy sequence reduces test loss and produces more realistic reconstructed images.
zh
人工智能
[AI-0] Eigenspectrum Analysis of Neural Networks without Aspect Ratio Bias ICML2025
【速读】:该论文试图解决在通过权重矩阵的特征值谱进行深度神经网络(DNN)诊断时,权重矩阵的长宽比对重尾性度量估计的影响问题。这种影响会导致模型诊断不准确以及层级训练超参数分配的偏差。解决方案的关键在于提出FARMS(Fixed-Aspect-Ratio Matrix Subsampling)方法,该方法通过以固定长宽比对权重矩阵进行子采样,并计算这些子矩阵的平均经验谱密度(ESD)来估计重尾性,从而有效缓解长宽比带来的偏差。
链接: https://arxiv.org/abs/2506.06280
作者: Yuanzhe Hu,Kinshuk Goel,Vlad Killiakov,Yaoqing Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 30 pages, 14 figures, published to ICML 2025
点击查看摘要
Abstract:Diagnosing deep neural networks (DNNs) through the eigenspectrum of weight matrices has been an active area of research in recent years. At a high level, eigenspectrum analysis of DNNs involves measuring the heavytailness of the empirical spectral densities (ESD) of weight matrices. It provides insight into how well a model is trained and can guide decisions on assigning better layer-wise training hyperparameters. In this paper, we address a challenge associated with such eigenspectrum methods: the impact of the aspect ratio of weight matrices on estimated heavytailness metrics. We demonstrate that matrices of varying sizes (and aspect ratios) introduce a non-negligible bias in estimating heavytailness metrics, leading to inaccurate model diagnosis and layer-wise hyperparameter assignment. To overcome this challenge, we propose FARMS (Fixed-Aspect-Ratio Matrix Subsampling), a method that normalizes the weight matrices by subsampling submatrices with a fixed aspect ratio. Instead of measuring the heavytailness of the original ESD, we measure the average ESD of these subsampled submatrices. We show that measuring the heavytailness of these submatrices with the fixed aspect ratio can effectively mitigate the aspect ratio bias. We validate our approach across various optimization techniques and application domains that involve eigenspectrum analysis of weights, including image classification in computer vision (CV) models, scientific machine learning (SciML) model training, and large language model (LLM) pruning. Our results show that despite its simplicity, FARMS uniformly improves the accuracy of eigenspectrum analysis while enabling more effective layer-wise hyperparameter assignment in these application domains. In one of the LLM pruning experiments, FARMS reduces the perplexity of the LLaMA-7B model by 17.3% when compared with the state-of-the-art method.
zh
[AI-1] Distillation Robustifies Unlearning
【速读】:该论文试图解决当前大语言模型(Large Language Model, LLM)的去训练(unlearning)方法不够稳健的问题,即这些方法在经过少量微调(finetuning)后容易被逆转。解决方案的关键在于利用知识蒸馏(distillation)来增强去训练的鲁棒性。论文提出了一种名为UNDO(Unlearn-Noise-Distill-on-Outputs)的方法,通过将一个已去训练的模型蒸馏到其部分噪声化的副本中,在计算成本与鲁棒性之间引入可调节的权衡,从而在合成语言和算术任务上建立了新的帕累托前沿。
链接: https://arxiv.org/abs/2506.06278
作者: Bruce W. Lee,Addie Foote,Alex Infanger,Leni Shor,Harish Kamath,Jacob Goldman-Wetzler,Bryce Woodworth,Alex Cloud,Alexander Matt Turner
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Current LLM unlearning methods are not robust: they can be reverted easily with a few steps of finetuning. This is true even for the idealized unlearning method of training to imitate an oracle model that was never exposed to unwanted information, suggesting that output-based finetuning is insufficient to achieve robust unlearning. In a similar vein, we find that training a randomly initialized student to imitate an unlearned model transfers desired behaviors while leaving undesired capabilities behind. In other words, distillation robustifies unlearning. Building on this insight, we propose Unlearn-Noise-Distill-on-Outputs (UNDO), a scalable method that distills an unlearned model into a partially noised copy of itself. UNDO introduces a tunable tradeoff between compute cost and robustness, establishing a new Pareto frontier on synthetic language and arithmetic tasks. At its strongest setting, UNDO matches the robustness of a model retrained from scratch with perfect data filtering while using only 60-80% of the compute and requiring only 0.01% of the pretraining data to be labeled. We also show that UNDO robustifies unlearning on the more realistic Weapons of Mass Destruction Proxy (WMDP) benchmark. Since distillation is widely used in practice, incorporating an unlearning step beforehand offers a convenient path to robust capability removal.
zh
[AI-2] Reflect-then-Plan: Offline Model-Based Planning through a Doubly Bayesian Lens
【速读】:该论文旨在解决离线强化学习(Offline Reinforcement Learning, RL)在数据有限情况下面临的高认识不确定性(epistemic uncertainty)问题,以及现有方法依赖固定保守策略导致的适应性和泛化能力受限的问题。其解决方案的关键在于提出了一种新颖的双贝叶斯离线模型基础(Model-Based, MB)规划方法——Reflect-then-Plan (RefPlan),通过将规划过程重新建模为贝叶叶后验估计,统一了不确定性建模与MB规划,并在部署时利用实时观测更新环境动态的信念,通过边缘化将不确定性纳入MB规划中。
链接: https://arxiv.org/abs/2506.06261
作者: Jihwan Jeong,Xiaoyu Wang,Jingmin Wang,Scott Sanner,Pascal Poupart
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Offline reinforcement learning (RL) is crucial when online exploration is costly or unsafe but often struggles with high epistemic uncertainty due to limited data. Existing methods rely on fixed conservative policies, restricting adaptivity and generalization. To address this, we propose Reflect-then-Plan (RefPlan), a novel doubly Bayesian offline model-based (MB) planning approach. RefPlan unifies uncertainty modeling and MB planning by recasting planning as Bayesian posterior estimation. At deployment, it updates a belief over environment dynamics using real-time observations, incorporating uncertainty into MB planning via marginalization. Empirical results on standard benchmarks show that RefPlan significantly improves the performance of conservative offline RL policies. In particular, RefPlan maintains robust performance under high epistemic uncertainty and limited data, while demonstrating resilience to changing environment dynamics, improving the flexibility, generalizability, and robustness of offline-learned policies.
zh
[AI-3] DesignBench: A Comprehensive Benchmark for MLLM -based Front-end Code Generation
【速读】:该论文旨在解决现有前端UI代码生成基准在框架支持、任务多样性及评估维度方面的不足。其解决方案的关键在于提出DesignBench,这是一个多框架、多任务的评估基准,支持React、Vue、Angular及原生HTML/CSS,并涵盖生成、编辑和修复三种核心前端任务,能够全面评估多模态大语言模型(Multimodal Large Language Models, MLLMs)在自动化前端工程中的能力。
链接: https://arxiv.org/abs/2506.06251
作者: Jingyu Xiao,Ming Wang,Man Ho Lam,Yuxuan Wan,Junliang Liu,Yintong Huo,Michael R. Lyu
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in automated front-end engineering, e.g., generating UI code from visual designs. However, existing front-end UI code generation benchmarks have the following limitations: (1) While framework-based development becomes predominant in modern front-end programming, current benchmarks fail to incorporate mainstream development frameworks. (2) Existing evaluations focus solely on the UI code generation task, whereas practical UI development involves several iterations, including refining editing, and repairing issues. (3) Current benchmarks employ unidimensional evaluation, lacking investigation into influencing factors like task difficulty, input context variations, and in-depth code-level analysis. To bridge these gaps, we introduce DesignBench, a multi-framework, multi-task evaluation benchmark for assessing MLLMs’ capabilities in automated front-end engineering. DesignBench encompasses three widely-used UI frameworks (React, Vue, and Angular) alongside vanilla HTML/CSS, and evaluates on three essential front-end tasks (generation, edit, and repair) in real-world development workflows. DesignBench contains 900 webpage samples spanning over 11 topics, 9 edit types, and 6 issue categories, enabling detailed analysis of MLLM performance across multiple dimensions. Our systematic evaluation reveals critical insights into MLLMs’ framework-specific limitations, task-related bottlenecks, and performance variations under different conditions, providing guidance for future research in automated front-end development. Our code and data are available at this https URL.
zh
[AI-4] “We need to avail ourselves of GenAI to enhance knowledge distribution”: Empowering Older Adults through GenAI Literacy
【速读】:该论文旨在解决如何有效提升老年人群体对生成式 AI(Generative AI)的认知与理解,特别是在其面临技术采纳障碍和需要定制化教育支持的背景下。研究的核心解决方案是通过一个名为 Litti 的聊天机器人提供生成式 AI 素养教育,评估其在提高老年人 AI 知识、安全意识及伦理使用方面的效果。
链接: https://arxiv.org/abs/2506.06225
作者: Eunhye Grace Ko,Shaini Nanayakkara,Earl W. Huff Jr
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:As generative AI (GenAI) becomes increasingly widespread, it is crucial to equip users, particularly vulnerable populations such as older adults (65 and older), with the knowledge to understand its benefits and potential risks. Older adults often exhibit greater reservations about adopting emerging technologies and require tailored literacy support. Using a mixed methods approach, this study examines strategies for delivering GenAI literacy to older adults through a chatbot named Litti, evaluating its impact on their AI literacy (knowledge, safety, and ethical use). The quantitative data indicated a trend toward improved AI literacy, though the results were not statistically significant. However, qualitative interviews revealed diverse levels of familiarity with generative AI and a strong desire to learn more. Findings also show that while Litti provided a positive learning experience, it did not significantly enhance participants’ trust or sense of safety regarding GenAI. This exploratory case study highlights the challenges and opportunities in designing AI literacy education for the rapidly growing older adult population.
zh
[AI-5] Integer Linear Programming Preprocessing for Maximum Satisfiability
【速读】:该论文旨在解决最大满足性问题(MaxSAT)求解中的优化效率问题,重点研究整数线性规划(ILP)预处理技术对MaxSAT求解的影响。其解决方案的关键在于通过引入ILP预处理技术,提升求解器如WMaxCDCL-OpenWbo1200在未加权赛道中的性能,使其能够解决更多实例,同时减少在求解过程中调用ILP求解器的频率。
链接: https://arxiv.org/abs/2506.06216
作者: Jialu Zhang,Chu-Min Li,Sami Cherif,Shuolin Li,Zhifei Zheng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:The Maximum Satisfiability problem (MaxSAT) is a major optimization challenge with numerous practical applications. In recent MaxSAT evaluations, most MaxSAT solvers have adopted an ILP solver as part of their portfolios. This paper investigates the impact of Integer Linear Programming (ILP) preprocessing techniques on MaxSAT solving. Experimental results show that ILP preprocessing techniques help WMaxCDCL-OpenWbo1200, the winner of the MaxSAT evaluation 2024 in the unweighted track, solve 15 additional instances. Moreover, current state-of-the-art MaxSAT solvers heavily use an ILP solver in their portfolios, while our proposed approach reduces the need to call an ILP solver in a portfolio including WMaxCDCL or MaxCDCL.
zh
[AI-6] Astra: Toward General-Purpose Mobile Robots via Hierarchical Multimodal Learning
【速读】:该论文旨在解决现代机器人导航系统在多样且复杂的室内环境中适应性不足的问题。传统方法依赖于多个模块化的小模型或基于规则的系统,难以适应新环境。其解决方案的关键在于提出Astra,一种综合的双模型架构,包括Astra-Global和Astra-Local。Astra-Global作为一种多模态大语言模型(Large Language Model),利用混合拓扑语义图作为全局地图,处理视觉和语言输入以实现自我和目标定位;Astra-Local则是一个多任务网络,通过4D时空编码器生成鲁棒的4D特征,并结合流匹配和新型掩码ESDF损失函数优化局部路径规划,同时通过Transformer编码器融合多传感器输入进行里程计估计。
链接: https://arxiv.org/abs/2506.06205
作者: Sheng Chen,Peiyu He,Jiaxin Hu,Ziyang Liu,Yansheng Wang,Tao Xu,Chi Zhang,Chongchong Zhang,Chao An,Shiyu Cai,Duo Cao,Kangping Chen,Shuai Chu,Tianwei Chu,Mingdi Dan,Min Du,Weiwei Fang,Pengyou Fu,Junkai Hu,Xiaowei Jiang,Zhaodi Jiang,Fuxuan Li,Jun Li,Minghui Li,Mingyao Li,Yanchang Li,Zhibin Li,Guangming Liu,Kairui Liu,Lihao Liu,Weizhi Liu,Xiaoshun Liu,Yufei Liu,Yunfei Liu,Qiang Lu,Yuanfei Luo,Xiang Lv,Hongying Ma,Sai Ma,Lingxian Mi,Sha Sa,Hongxiang Shu,Lei Tian,Chengzhi Wang,Jiayu Wang,Kaijie Wang,Qingyi Wang,Renwen Wang,Tao Wang,Wei Wang,Xirui Wang,Chao Wei,Xuguang Wei,Zijun Xia,Zhaohao Xiao,Tingshuai Yan,Liyan Yang,Yifan Yang,Zhikai Yang,Zhong Yin,Li Yuan,Liuchun Yuan,Chi Zhang,Jinyang Zhang,Junhui Zhang,Linge Zhang,Zhenyi Zhang,Zheyu Zhang,Dongjie Zhu,Hang Li,Yangang Zhang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Astra Technical Report
点击查看摘要
Abstract:Modern robot navigation systems encounter difficulties in diverse and complex indoor environments. Traditional approaches rely on multiple modules with small models or rule-based systems and thus lack adaptability to new environments. To address this, we developed Astra, a comprehensive dual-model architecture, Astra-Global and Astra-Local, for mobile robot navigation. Astra-Global, a multimodal LLM, processes vision and language inputs to perform self and goal localization using a hybrid topological-semantic graph as the global map, and outperforms traditional visual place recognition methods. Astra-Local, a multitask network, handles local path planning and odometry estimation. Its 4D spatial-temporal encoder, trained through self-supervised learning, generates robust 4D features for downstream tasks. The planning head utilizes flow matching and a novel masked ESDF loss to minimize collision risks for generating local trajectories, and the odometry head integrates multi-sensor inputs via a transformer encoder to predict the relative pose of the robot. Deployed on real in-house mobile robots, Astra achieves high end-to-end mission success rate across diverse indoor environments.
zh
[AI-7] MLOps with Microservices: A Case Study on the Maritime Domain
【速读】:该论文试图解决在海洋领域中实现异常检测的挑战,具体是通过构建一个机器学习增强系统(Machine Learning-Enabled System, MLES)来提升检测能力。解决方案的关键在于采用基于合同的设计方法(contract-based design)与MLOps相结合,利用代码、模型和数据合同在系统服务之间建立明确的规范和协作机制,从而支持多团队并行开发和系统的高效集成。
链接: https://arxiv.org/abs/2506.06202
作者: Renato Cordeiro Ferreira(1,2,3),Rowanne Trapmann(1,2,3),Willem-Jan van den Heuvel(1,2,3) ((1) Jheronimus Academy of Data Science, (2) Technical University of Eindhoven, (3) Tilburg University)
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 13 pages, 3 figures, to be published in SummerSOC 2025
点击查看摘要
Abstract:This case study describes challenges and lessons learned on building Ocean Guard: a Machine Learning-Enabled System (MLES) for anomaly detection in the maritime domain. First, the paper presents the system’s specification, and architecture. Ocean Guard was designed with a microservices’ architecture to enable multiple teams to work on the project in parallel. Then, the paper discusses how the developers adapted contract-based design to MLOps for achieving that goal. As a MLES, Ocean Guard employs code, model, and data contracts to establish guidelines between its services. This case study hopes to inspire software engineers, machine learning engineers, and data scientists to leverage similar approaches for their systems.
zh
[AI-8] (AI peers) are people learning from the same standpoint: Perception of AI characters in a Collaborative Science Investigation
【速读】:该论文试图解决21世纪教育需求复杂化背景下,课堂教学活动与个性化学习或评估实践之间的持续差距。其解决方案的关键在于引入生成式AI (Generative AI) 生成的角色,特别是在基于情境的评估(Scenario-Based Assessment, SBA)中,通过模拟代理提供真实的社会互动情境,以评估能力导向的构念并减少现实互动中的不可预测性。研究重点考察了学习者对AI角色作为导师和团队成员的感知,包括信任、社会存在感和有效性,并分析了这些因素如何影响学习者采用AI角色的意愿。
链接: https://arxiv.org/abs/2506.06165
作者: Eunhye Grace Ko,Soo Hyoung Joo
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 14 pages
点击查看摘要
Abstract:While the complexity of 21st-century demands has promoted pedagogical approaches to foster complex competencies, a persistent gap remains between in-class learning activities and individualized learning or assessment practices. To address this, studies have explored the use of AI-generated characters in learning and assessment. One attempt is scenario-based assessment (SBA), a technique that not only measures but also fosters the development of competencies throughout the assessment process. SBA introduces simulated agents to provide an authentic social-interactional context, allowing for the assessment of competency-based constructs while mitigating the unpredictability of real-life interactions. Recent advancements in multimodal AI, such as text-to-video technology, allow these agents to be enhanced into AI-generated characters. This mixed-method study investigates how learners perceive AI characters taking the role of mentor and teammates in an SBA mirroring the context of a collaborative science investigation. Specifically, we examined the Likert scale responses of 56 high schoolers regarding trust, social presence, and effectiveness. We analyzed the relationships between these factors and their impact on the intention to adopt AI characters through PLS-SEM. Our findings indicated that learners’ trust shaped their sense of social presence with the AI characters, enhancing perceived effectiveness. Qualitative analysis further highlighted factors that foster trust, such as material credibility and alignment with learning goals, as well as the pivotal role of social presence in creating a collaborative context. This paper was accepted as an full paper for AIED 2025. Comments: 14 pages Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI) Cite as: arXiv:2506.06165 [cs.HC] (or arXiv:2506.06165v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2506.06165 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-9] Recommender systems stigmergy and the tyranny of popularity
【速读】:该论文试图解决当前科学推荐系统(Scientific Recommender Systems)过度依赖流行度(popularity)导致的学术同质化和结构性不平等问题,这些问题抑制了创新和多样化观点的发展。论文提出的解决方案关键在于对搜索平台进行重构,引入用户特定的校准机制,使研究人员能够手动调整如流行度、新颖性和相关性等因子的权重,并建议平台开发者利用词嵌入(word embeddings)和大语言模型(LLMs)来增强用户的自主性。
链接: https://arxiv.org/abs/2506.06162
作者: Zackary Okun Dunivin,Paul E. Smaldino
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
备注:
点击查看摘要
Abstract:Scientific recommender systems, such as Google Scholar and Web of Science, are essential tools for discovery. Search algorithms that power work through stigmergy, a collective intelligence mechanism that surfaces useful paths through repeated engagement. While generally effective, this ``rich-get-richer’’ dynamic results in a small number of high-profile papers that dominate visibility. This essay argues argue that these algorithm over-reliance on popularity fosters intellectual homogeneity and exacerbates structural inequities, stifling innovative and diverse perspectives critical for scientific progress. We propose an overhaul of search platforms to incorporate user-specific calibration, allowing researchers to manually adjust the weights of factors like popularity, recency, and relevance. We also advise platform developers on how word embeddings and LLMs could be implemented in ways that increase user autonomy. While our suggestions are particularly pertinent to aligning recommender systems with scientific values, these ideas are broadly applicable to information access systems in general. Designing platforms that increase user autonomy is an important step toward more robust and dynamic information
zh
[AI-10] Joint-GCG: Unified Gradient-Based Poisoning Attacks on Retrieval-Augmented Generation Systems
【速读】:该论文旨在解决Retrieval-Augmented Generation (RAG)系统在面对外部语料库污染攻击时的脆弱性问题,此类攻击通过注入恶意文档来操纵生成结果。现有攻击策略通常将检索和生成阶段视为独立过程,限制了攻击效果。论文提出的解决方案是Joint-GCG框架,其关键在于通过三项创新实现对检索器和生成器模型的联合梯度攻击:跨词汇投影用于对齐嵌入空间,梯度标记对齐用于同步标记级梯度信号,自适应加权融合用于动态平衡攻击目标。
链接: https://arxiv.org/abs/2506.06151
作者: Haowei Wang,Rupeng Zhang,Junjie Wang,Mingyang Li,Yuekai Huang,Dandan Wang,Qing Wang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Retrieval-Augmented Generation (RAG) systems enhance Large Language Models (LLMs) by retrieving relevant documents from external corpora before generating responses. This approach significantly expands LLM capabilities by leveraging vast, up-to-date external knowledge. However, this reliance on external knowledge makes RAG systems vulnerable to corpus poisoning attacks that manipulate generated outputs via poisoned document injection. Existing poisoning attack strategies typically treat the retrieval and generation stages as disjointed, limiting their effectiveness. We propose Joint-GCG, the first framework to unify gradient-based attacks across both retriever and generator models through three innovations: (1) Cross-Vocabulary Projection for aligning embedding spaces, (2) Gradient Tokenization Alignment for synchronizing token-level gradient signals, and (3) Adaptive Weighted Fusion for dynamically balancing attacking objectives. Evaluations demonstrate that Joint-GCG achieves at most 25% and an average of 5% higher attack success rate than previous methods across multiple retrievers and generators. While optimized under a white-box assumption, the generated poisons show unprecedented transferability to unseen models. Joint-GCG’s innovative unification of gradient-based attacks across retrieval and generation stages fundamentally reshapes our understanding of vulnerabilities within RAG systems. Our code is available at this https URL.
zh
[AI-11] Decomposability-Guaranteed Cooperative Coevolution for Large-Scale Itinerary Planning
【速读】:该论文旨在解决大规模行程规划问题(Large-scale itinerary planning),该问题属于旅行商问题(Traveling Salesman Problem)的一种变体,目标是在满足旅行时长约束的前提下,最大化所收集的兴趣点(Points of Interest, POIs)得分,同时最小化旅行时间和成本。论文的核心贡献在于提出了一个基于弱可分解性定义的多目标协同进化算法,其关键在于通过动态分解策略、优化潜力定义以及计算资源分配策略,有效应对组件不平衡和交互问题,从而提升大规模场景下的求解性能。
链接: https://arxiv.org/abs/2506.06121
作者: Ziyu Zhang,Peilan Xu,Yuetong Sun,Yuhui Shi,Wenjian Luo
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Large-scale itinerary planning is a variant of the traveling salesman problem, aiming to determine an optimal path that maximizes the collected points of interest (POIs) scores while minimizing travel time and cost, subject to travel duration constraints. This paper analyzes the decomposability of large-scale itinerary planning, proving that strict decomposability is difficult to satisfy, and introduces a weak decomposability definition based on a necessary condition, deriving the corresponding graph structures that fulfill this property. With decomposability guaranteed, we propose a novel multi-objective cooperative coevolutionary algorithm for large-scale itinerary planning, addressing the challenges of component imbalance and interactions. Specifically, we design a dynamic decomposition strategy based on the normalized fitness within each component, define optimization potential considering component scale and contribution, and develop a computational resource allocation strategy. Finally, we evaluate the proposed algorithm on a set of real-world datasets. Comparative experiments with state-of-the-art multi-objective itinerary planning algorithms demonstrate the superiority of our approach, with performance advantages increasing as the problem scale grows.
zh
[AI-12] owards Lifecycle Unlearning Commitment Management: Measuring Sample-level Unlearning Completeness USENIX-SECURITY
【速读】:该论文试图解决机器遗忘(machine unlearning)中评估其效果的难题,特别是在数据隐私和安全日益受到关注的背景下,如何有效衡量模型移除特定数据影响的程度。现有方法在评估过程中存在两个关键问题:一是最大化Membership Inference Attacks (MIAs)效果需要高昂的计算资源,二是MIAs难以捕捉近似遗忘中的细微变化。该论文提出的解决方案是Interpolated Approximate Measurement (IAM),其关键在于通过插值模型在查询样本上的泛化-拟合行为差距来量化样本级别的遗忘完整性,从而在二元包含测试中实现强性能,并在近似遗忘中表现出高相关性,且仅需一个预训练的影子模型即可扩展至大语言模型(LLMs)。
链接: https://arxiv.org/abs/2506.06112
作者: Cheng-Long Wang,Qi Li,Zihang Xiang,Yinzhi Cao,Di Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: To appear in the Proceedings of USENIX Security Symposium, 2025
点击查看摘要
Abstract:Growing concerns over data privacy and security highlight the importance of machine unlearning–removing specific data influences from trained models without full retraining. Techniques like Membership Inference Attacks (MIAs) are widely used to externally assess successful unlearning. However, existing methods face two key limitations: (1) maximizing MIA effectiveness (e.g., via online attacks) requires prohibitive computational resources, often exceeding retraining costs; (2) MIAs, designed for binary inclusion tests, struggle to capture granular changes in approximate unlearning. To address these challenges, we propose the Interpolated Approximate Measurement (IAM), a framework natively designed for unlearning inference. IAM quantifies sample-level unlearning completeness by interpolating the model’s generalization-fitting behavior gap on queried samples. IAM achieves strong performance in binary inclusion tests for exact unlearning and high correlation for approximate unlearning–scalable to LLMs using just one pre-trained shadow model. We theoretically analyze how IAM’s scoring mechanism maintains performance efficiently. We then apply IAM to recent approximate unlearning algorithms, revealing general risks of both over-unlearning and under-unlearning, underscoring the need for stronger safeguards in approximate unlearning systems. The code is available at this https URL.
zh
[AI-13] xt-to-LoRA: Instant Transformer Adaption ICML2025
【速读】:该论文试图解决基础模型在特定任务上适应过程中需要大量数据集标注和重复微调所带来的高成本与低效率问题。解决方案的关键在于提出Text-to-LoRA (T2L),这是一种能够仅基于目标任务的自然语言描述实时生成低秩适配器(LoRA)的超网络,通过一次低成本的前向传播即可完成适配器构建,从而实现高效、灵活的任务适应。
链接: https://arxiv.org/abs/2506.06105
作者: Rujikorn Charakorn,Edoardo Cetin,Yujin Tang,Robert Tjarko Lange
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at ICML 2025
点击查看摘要
Abstract:While Foundation Models provide a general tool for rapid content creation, they regularly require task-specific adaptation. Traditionally, this exercise involves careful curation of datasets and repeated fine-tuning of the underlying model. Fine-tuning techniques enable practitioners to adapt foundation models for many new applications but require expensive and lengthy training while being notably sensitive to hyper-parameter choices. To overcome these limitations, we introduce Text-to-LoRA (T2L), a model capable of adapting Large Language Models on the fly solely based on a natural language description of the target task. T2L is a hypernetwork trained to construct LoRAs in a single inexpensive forward pass. After training T2L on a suite of 9 pre-trained LoRA adapters (GSM8K, Arc, etc.), we show that the ad-hoc reconstructed LoRA instances match the performance of task-specific adapters across the corresponding test sets. Furthermore, T2L can compress hundreds of LoRA instances and zero-shot generalize to entirely unseen tasks. This approach provides a significant step towards democratizing the specialization of foundation models and enables language-based adaptation with minimal compute requirements. Our code is available at this https URL
zh
[AI-14] Microgrids Coalitions for Energy Market Balancing
【速读】:该论文旨在解决在可再生能源接入电力配电网后,能源市场可能出现的不平衡问题,这些问题可能导致停电、财务损失或电网不稳定。解决方案的关键在于通过构建最优的微电网联盟(microgrid coalition),使其能够在能源过剩时从市场吸收能量,在能源短缺时向市场供应能量,从而提升配电网的管理效率。该方法将最优联盟识别问题建模为优化问题,并结合受合作博弈论启发的策略与遗传算法进行求解,通过个体的重组与变异实现种群进化,其适应度函数基于联盟产生的总价值与交易超出市场供需时的惩罚之间的差异,最终依据Shapley值公平分配联盟收益。
链接: https://arxiv.org/abs/2506.06058
作者: Viorica Chifu,Cristina Bianca Pop,Tudor Cioara,Ionut Anghel
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:With the integration of renewable sources in electricity distribution networks, the need to develop intelligent mechanisms for balancing the energy market has arisen. In the absence of such mechanisms, the energy market may face imbalances that can lead to power outages, financial losses or instability at the grid level. In this context, the grouping of microgrids into optimal coalitions that can absorb energy from the market during periods of surplus or supply energy to the market during periods of is a key aspect in the efficient management of distribution networks. In this article, we propose a method that identify an optimal microgrids coalition capable of addressing the dynamics of the energy market. The proposed method models the problem of identifying the optimal coalition as an optimization problem that it solves by combining a strategy inspired by cooperative game theory with a memetic algorithm. An individual is represented as a coalition of microgrids and the evolution of population of individuals over generations is assured by recombination and mutation. The fitness function is defined as the difference between the total value generated by the coalition and a penalty applied to the coalition when the energy traded by coalition exceeds the energy available/demanded on/by the energy market. The value generated by the coalition is calculated based on the profit obtained by the collation if it sells energy on the market during periods of deficit or the savings obtained by the coalition if it buys energy on the market during periods of surplus and the costs associated with the trading process. This value is divided equitably among the coalition members, according to the Shapley value, which considers the contribution of each one to the formation of collective value.
zh
[AI-15] CP-Bench: Evaluating Large Language Models for Constraint Modelling
【速读】:该论文试图解决约束建模(constraint modelling)在Constraint Programming (CP) 领域中的瓶颈问题,即如何提高大规模、多样化组合问题的建模效率与准确性。其关键解决方案是引入CP-Bench基准数据集,并评估大型语言模型(LLMs)在不同抽象层次和语法结构的约束建模系统中的表现,包括高阶的MiniZinc语言、基于Python的CPMpy库以及低阶的OR-Tools CP-SAT Python接口。通过系统地测试基于提示(prompt-based)和推理时计算方法的优化策略,研究证明了Python框架在建模便利性上的优势,以及结合丰富文档的系统提示、重复采样和自验证机制可显著提升LLMs生成有效约束模型的性能。
链接: https://arxiv.org/abs/2506.06052
作者: Kostis Michailidis,Dimos Tsouros,Tias Guns
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Combinatorial problems are present in a wide range of industries. Constraint Programming (CP) is a well-suited problem-solving paradigm, but its core process, namely constraint modelling, is a bottleneck for wider adoption. Aiming to alleviate this bottleneck, recent studies have explored using Large Language Models (LLMs) as modelling assistants, transforming combinatorial problem descriptions to executable constraint models, similar to coding assistants. However, the existing evaluation datasets for constraint modelling are often limited to small, homogeneous, or domain-specific instances, which do not capture the diversity of real-world scenarios. This work addresses this gap by introducing CP-Bench, a novel benchmark dataset that includes a diverse set of well-known combinatorial problem classes sourced from the CP community, structured explicitly for evaluating LLM-driven CP modelling. With this dataset, and given the variety of constraint modelling frameworks, we compare and evaluate the modelling capabilities of LLMs for three distinct constraint modelling systems, which vary in abstraction level and underlying syntax: the high-level MiniZinc language and Python-based CPMpy library, and the lower-level Python interface of the OR-Tools CP-SAT solver. In order to enhance the ability of LLMs to produce valid constraint models, we systematically evaluate the use of prompt-based and inference-time compute methods adapted from existing LLM-based code generation research. Our results underscore the modelling convenience provided by Python-based frameworks, as well as the effectiveness of documentation-rich system prompts, which, augmented with repeated sampling and self-verification, achieve further improvements, reaching up to 70% accuracy on this new, highly challenging benchmark.
zh
[AI-16] End-to-End Framework for Robot Lawnmower Coverag e Path Planning using Cellular Decomposition ICRA2025
【速读】:该论文旨在解决自主机器人割草机在复杂和不规则形状草坪中实现高效覆盖路径规划(Coverage Path Planning, CPP)的问题。解决方案的关键在于提出了一种名为AdaptiveDecompositionCPP的新算法,该算法结合了单元分解与自适应合并策略,以减少非割草行驶距离,从而提高作业效率。
链接: https://arxiv.org/abs/2506.06028
作者: Nikunj Shah,Utsav Dey,Kenji Nishimiya
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 8 pages, ICRA 2025, Workshop on Field Robotics
点击查看摘要
Abstract:Efficient Coverage Path Planning (CPP) is necessary for autonomous robotic lawnmowers to effectively navigate and maintain lawns with diverse and irregular shapes. This paper introduces a comprehensive end-to-end pipeline for CPP, designed to convert user-defined boundaries on an aerial map into optimized coverage paths seamlessly. The pipeline includes user input extraction, coordinate transformation, area decomposition and path generation using our novel AdaptiveDecompositionCPP algorithm, preview and customization through an interactive coverage path visualizer, and conversion to actionable GPS waypoints. The AdaptiveDecompositionCPP algorithm combines cellular decomposition with an adaptive merging strategy to reduce non-mowing travel thereby enhancing operational efficiency. Experimental evaluations, encompassing both simulations and real-world lawnmower tests, demonstrate the effectiveness of the framework in coverage completeness and mowing efficiency.
zh
[AI-17] Optimization-Free Universal Watermark Forgery with Regenerative Diffusion Models
【速读】:该论文试图解决生成式图像中水印(watermark)被伪造的风险问题,特别是针对基于扩散模型的合成图像。其解决方案的关键在于提出一种无需优化的通用水印伪造方法——PnP(Plug-and-Plant),该方法通过利用现有的再生扩散模型,直接从目标图像中提取并整合水印,而无需额外的优化过程。这一方法独立于目标图像的来源或使用的水印方案,显著扩展了伪造攻击的范围,并对当前扩散模型水印技术的安全性构成了更大挑战。
链接: https://arxiv.org/abs/2506.06018
作者: Chaoyi Zhu,Zaitang Li,Renyi Yang,Robert Birke,Pin-Yu Chen,Tsung-Yi Ho,Lydia Y. Chen
机构: 未知
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:
点击查看摘要
Abstract:Watermarking becomes one of the pivotal solutions to trace and verify the origin of synthetic images generated by artificial intelligence models, but it is not free of risks. Recent studies demonstrate the capability to forge watermarks from a target image onto cover images via adversarial optimization without knowledge of the target generative model and watermark schemes. In this paper, we uncover a greater risk of an optimization-free and universal watermark forgery that harnesses existing regenerative diffusion models. Our proposed forgery attack, PnP (Plug-and-Plant), seamlessly extracts and integrates the target watermark via regenerating the image, without needing any additional optimization routine. It allows for universal watermark forgery that works independently of the target image’s origin or the watermarking model used. We explore the watermarked latent extracted from the target image and visual-textual context of cover images as priors to guide sampling of the regenerative process. Extensive evaluation on 24 scenarios of model-data-watermark combinations demonstrates that PnP can successfully forge the watermark (up to 100% detectability and user attribution), and maintain the best visual perception. By bypassing model retraining and enabling adaptability to any image, our approach significantly broadens the scope of forgery attacks, presenting a greater challenge to the security of current watermarking techniques for diffusion models and the authority of watermarking schemes in synthetic data generation and governance.
zh
[AI-18] Leverag ing Generative AI for Enhancing Automated Assessment in Programming Education Contests
【速读】:该论文旨在解决在编程竞赛中生成全面测试用例以有效评估编程解决方案的资源密集且具有挑战性的问题。其解决方案的关键在于利用自然语言处理(NLP)驱动的方法,结合生成式 AI(Generative AI)技术,自动化创建高质量的测试用例,从而提升评估效果并发现传统方法未能检测到的错误。
链接: https://arxiv.org/abs/2506.05990
作者: Stefan Dascalescu,Adrian Marius Dumitran,Mihai Alexandru Vasiluta
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 11 pages, 2 chart pies, 1 figure Pre-print version Accepted at BEA 2025
点击查看摘要
Abstract:Competitive programming contests play a crucial role in cultivating computational thinking and algorithmic skills among learners. However, generating comprehensive test cases to effectively assess programming solutions remains resource-intensive and challenging for educators. This paper introduces an innovative NLP-driven method leveraging generative AI (large language models) to automate the creation of high-quality test cases for competitive programming assessments. We extensively evaluated our approach on diverse datasets, including 25 years of Romanian Informatics Olympiad (OJI) data for 5th graders, recent competitions hosted on the this http URL platform, and the International Informatics Olympiad in Teams (IIOT). Our results demonstrate that AI-generated test cases substantially enhanced assessments, notably identifying previously undetected errors in 67% of the OJI 5th grade programming problems. These improvements underscore the complementary educational value of our technique in formative assessment contexts. By openly sharing our prompts, translated datasets, and methodologies, we offer practical NLP-based tools that educators and contest organizers can readily integrate to enhance assessment quality, reduce workload, and deepen insights into learner performance.
zh
[AI-19] CrimeMind: Simulating Urban Crime with Multi-Modal LLM Agents
【速读】:该论文旨在解决城市犯罪建模中预测准确性与可解释性之间的矛盾问题,以及传统方法在应对环境变化时缺乏认知灵活性的局限性。其解决方案的关键在于提出一种基于大语言模型(Large Language Models, LLMs)的新型代理基础模型(Agent-Based Modeling, ABM)框架——CrimeMind,该框架将日常活动理论(Routine Activity Theory, RAT)整合到代理的工作流程中,使模型能够处理多模态城市特征并推理犯罪行为,从而提升犯罪热点预测和空间分布的准确性。
链接: https://arxiv.org/abs/2506.05981
作者: Qingbin Zeng,Ruotong Zhao,Jinzhu Mao,Haoyang Li,Fengli Xu,Yong Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Modeling urban crime is an important yet challenging task that requires understanding the subtle visual, social, and cultural cues embedded in urban environments. Previous work has predominantly focused on rule-based agent-based modeling (ABM) and deep learning methods. ABMs offer interpretability of internal mechanisms but exhibit limited predictive this http URL contrast, deep learning methods are often effective in prediction but are less interpretable and require extensive training data. Moreover, both lines of work lack the cognitive flexibility to adapt to changing environments. Leveraging the capabilities of large language models (LLMs), we propose CrimeMind, a novel LLM-driven ABM framework for simulating urban crime within a multi-modal urban context.A key innovation of our design is the integration of the Routine Activity Theory (RAT) into the agentic workflow of CrimeMind, enabling it to process rich multi-modal urban features and reason about criminal this http URL, RAT requires LLM agents to infer subtle cues in evaluating environmental safety as part of assessing guardianship, which can be challenging for LLMs. To address this, we collect a small-scale human-annotated dataset and align CrimeMind’s perception with human judgment via a training-free textual gradient this http URL across four major U.S. cities demonstrate that CrimeMind outperforms both traditional ABMs and deep learning baselines in crime hotspot prediction and spatial distribution accuracy, achieving up to a 24% improvement over the strongest this http URL, we conduct counterfactual simulations of external incidents and policy interventions and it successfully captures the expected changes in crime patterns, demonstrating its ability to reflect counterfactual this http URL, CrimeMind enables fine-grained modeling of individual behaviors and facilitates evaluation of real-world interventions.
zh
[AI-20] AMPED: Adaptive Multi-objective Projection for balancing Exploration and skill Diversification
【速读】:该论文旨在解决技能基础强化学习(Skill-based Reinforcement Learning, SBRL)中探索与技能多样性之间难以同时优化的冲突问题。现有方法在处理这两个相互矛盾的目标时面临挑战,导致性能受限。其解决方案的关键在于提出一种名为自适应多目标投影(Adaptive Multi-objective Projection for balancing Exploration and skill Diversification, AMPED)的新方法,该方法通过引入梯度手术技术在预训练阶段平衡探索与技能多样性,并在微调阶段利用技能选择模块动态选择适合下游任务的技能,从而有效提升技能学习的鲁棒性和泛化能力。
链接: https://arxiv.org/abs/2506.05980
作者: Geonwoo Cho,Jaemoon Lee,Jaegyun Im,Subi Lee,Jihwan Lee,Sundong Kim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Skill-based reinforcement learning (SBRL) enables rapid adaptation in environments with sparse rewards by pretraining a skill-conditioned policy. Effective skill learning requires jointly maximizing both exploration and skill diversity. However, existing methods often face challenges in simultaneously optimizing for these two conflicting objectives. In this work, we propose a new method, Adaptive Multi-objective Projection for balancing Exploration and skill Diversification (AMPED), which explicitly addresses both exploration and skill diversification. We begin by conducting extensive ablation studies to identify and define a set of objectives that effectively capture the aspects of exploration and skill diversity, respectively. During the skill pretraining phase, AMPED introduces a gradient surgery technique to balance the objectives of exploration and skill diversity, mitigating conflicts and reducing reliance on heuristic tuning. In the subsequent fine-tuning phase, AMPED incorporates a skill selector module that dynamically selects suitable skills for downstream tasks, based on task-specific performance signals. Our approach achieves performance that surpasses SBRL baselines across various benchmarks. These results highlight the importance of explicitly harmonizing exploration and diversity and demonstrate the effectiveness of AMPED in enabling robust and generalizable skill learning. Project Page: this https URL
zh
[AI-21] On Measuring Long-Range Interactions in Graph Neural Networks ICML2025
【速读】:该论文试图解决图神经网络研究中的长距离图任务(long-range graph tasks)问题,这类任务依赖于远距离节点之间的交互,目前缺乏理论支撑和稳健的评估方法。论文的关键解决方案是形式化图任务中的长距离交互,引入一种用于图上操作符的范围度量(range measure),并通过合成实验验证其有效性,从而为评估新数据集和架构提供更系统的依据。
链接: https://arxiv.org/abs/2506.05971
作者: Jacob Bamberger,Benjamin Gutteridge,Scott le Roux,Michael M. Bronstein,Xiaowen Dong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: ICML 2025
点击查看摘要
Abstract:Long-range graph tasks – those dependent on interactions between distant nodes – are an open problem in graph neural network research. Real-world benchmark tasks, especially the Long Range Graph Benchmark, have become popular for validating the long-range capability of proposed architectures. However, this is an empirical approach that lacks both robustness and theoretical underpinning; a more principled characterization of the long-range problem is required. To bridge this gap, we formalize long-range interactions in graph tasks, introduce a range measure for operators on graphs, and validate it with synthetic experiments. We then leverage our measure to examine commonly used tasks and architectures, and discuss to what extent they are, in fact, long-range. We believe our work advances efforts to define and address the long-range problem on graphs, and that our range measure will aid evaluation of new datasets and architectures.
zh
[AI-22] Gradual Transition from Bellm an Optimality Operator to Bellm an Operator in Online Reinforcement Learning ICML2025
【速读】:该论文旨在解决连续动作空间中策略梯度方法在在线强化学习(Reinforcement Learning, RL)中样本效率低的问题。传统针对连续动作的RL算法通常仅依赖策略更新来改进性能,这导致学习效率较低。该研究的关键解决方案是将Bellman最优算子引入到actor-critic框架中,通过一种退火机制逐步从Bellman最优算子过渡到Bellman算子,从而在加速学习的同时减少过估计偏差。该方法与TD3和SAC结合后,在多种运动和操作任务中表现出更高的性能和对超参数的鲁棒性。
链接: https://arxiv.org/abs/2506.05968
作者: Motoki Omura,Kazuki Ota,Takayuki Osa,Yusuke Mukuta,Tatsuya Harada
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: Accepted at ICML 2025. Source code: this https URL
点击查看摘要
Abstract:For continuous action spaces, actor-critic methods are widely used in online reinforcement learning (RL). However, unlike RL algorithms for discrete actions, which generally model the optimal value function using the Bellman optimality operator, RL algorithms for continuous actions typically model Q-values for the current policy using the Bellman operator. These algorithms for continuous actions rely exclusively on policy updates for improvement, which often results in low sample efficiency. This study examines the effectiveness of incorporating the Bellman optimality operator into actor-critic frameworks. Experiments in a simple environment show that modeling optimal values accelerates learning but leads to overestimation bias. To address this, we propose an annealing approach that gradually transitions from the Bellman optimality operator to the Bellman operator, thereby accelerating learning while mitigating bias. Our method, combined with TD3 and SAC, significantly outperforms existing approaches across various locomotion and manipulation tasks, demonstrating improved performance and robustness to hyperparameters related to optimality.
zh
[AI-23] Preference Learning for AI Alignment: a Causal Perspective
【速读】:该论文试图解决大规模语言模型(Large Language Models, LLMs)与人类价值观对齐过程中,从偏好数据中进行奖励建模(Reward Modelling)时面临的泛化能力不足问题。其解决方案的关键在于将该问题置于因果范式(Causal Paradigm)下进行建模,利用因果推断的工具来识别并应对持续性挑战,如因果误识别、偏好异质性以及用户特定因素引起的混杂效应。通过明确可靠泛化所需的关键假设,并与常见的数据收集实践进行对比,该研究揭示了传统奖励模型的局限性,并展示了基于因果启发的方法如何提升模型的鲁棒性。
链接: https://arxiv.org/abs/2506.05967
作者: Katarzyna Kobalczyk,Mihaela van der Schaar
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:
点击查看摘要
Abstract:Reward modelling from preference data is a crucial step in aligning large language models (LLMs) with human values, requiring robust generalisation to novel prompt-response pairs. In this work, we propose to frame this problem in a causal paradigm, providing the rich toolbox of causality to identify the persistent challenges, such as causal misidentification, preference heterogeneity, and confounding due to user-specific factors. Inheriting from the literature of causal inference, we identify key assumptions necessary for reliable generalisation and contrast them with common data collection practices. We illustrate failure modes of naive reward models and demonstrate how causally-inspired approaches can improve model robustness. Finally, we outline desiderata for future research and practices, advocating targeted interventions to address inherent limitations of observational data.
zh
[AI-24] Comparative Analysis of Modern Machine Learning Models for Retail Sales Forecasting
【速读】:该论文旨在解决零售环境中准确销售预测的问题,以避免因预测过高或过低而导致的库存成本增加、销售损失及声誉影响。其解决方案的关键在于通过对比基于树的集成模型(如XGBoost和LightGBM)与先进的神经网络架构(如N-BEATS、NHITS和Temporal Fusion Transformer),评估不同模型在处理零售数据中的间歇性需求、缺失值和频繁产品更替等挑战时的表现。研究发现,基于树的模型在使用未插补数据对特定群体进行局部建模时,能够提供更高的预测精度和计算效率,而神经网络模型则需依赖先进的插补方法才能发挥潜力,但仍难以有效处理实体零售数据中的不规则性。
链接: https://arxiv.org/abs/2506.05941
作者: Luka Hobor,Mario Brcic,Lidija Polutnik,Ante Kapetanovic
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 20 total pages, 10 pages article, 10 pages appendix, 3 figures, 24 tables
点击查看摘要
Abstract:Accurate forecasting is key for all business planning. When estimated sales are too high, brick-and-mortar retailers may incur higher costs due to unsold inventories, higher labor and storage space costs, etc. On the other hand, when forecasts underestimate the level of sales, firms experience lost sales, shortages, and impact on the reputation of the retailer in their relevant market. Accurate forecasting presents a competitive advantage for companies. It facilitates the achievement of revenue and profit goals and execution of pricing strategy and tactics. In this study, we provide an exhaustive assessment of the forecasting models applied to a high-resolution brick-and-mortar retail dataset. Our forecasting framework addresses the problems found in retail environments, including intermittent demand, missing values, and frequent product turnover. We compare tree-based ensembles (such as XGBoost and LightGBM) and state-of-the-art neural network architectures (including N-BEATS, NHITS, and the Temporal Fusion Transformer) across various experimental settings. Our results show that localized modeling strategies especially those using tree-based models on individual groups with non-imputed data, consistently deliver superior forecasting accuracy and computational efficiency. In contrast, neural models benefit from advanced imputation methods, yet still fall short in handling the irregularities typical of physical retail data. These results further practical understanding for model selection in retail environment and highlight the significance of data preprocessing to improve forecast performance.
zh
[AI-25] Quantifying Adversarial Uncertainty in Evidential Deep Learning using Conflict Resolution
【速读】:该论文旨在解决深度学习模型在高风险应用中面对分布外(out-of-distribution, OOD)或对抗性输入时的可靠性问题,这些问题可能导致有害结果。其解决方案的关键在于提出一种轻量级的后处理不确定性量化方法——冲突感知证据深度学习(Conflict-aware Evidential Deep Learning, C-EDL),通过为每个输入生成多样化的任务保留变换并量化表征分歧来校准不确定性估计,从而提升对OOD和对抗性输入的检测能力,同时保持较高的分布内准确率和较低的计算开销。
链接: https://arxiv.org/abs/2506.05937
作者: Charmaine Barker,Daniel Bethell,Simos Gerasimou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Reliability of deep learning models is critical for deployment in high-stakes applications, where out-of-distribution or adversarial inputs may lead to detrimental outcomes. Evidential Deep Learning, an efficient paradigm for uncertainty quantification, models predictions as Dirichlet distributions of a single forward pass. However, EDL is particularly vulnerable to adversarially perturbed inputs, making overconfident errors. Conflict-aware Evidential Deep Learning (C-EDL) is a lightweight post-hoc uncertainty quantification approach that mitigates these issues, enhancing adversarial and OOD robustness without retraining. C-EDL generates diverse, task-preserving transformations per input and quantifies representational disagreement to calibrate uncertainty estimates when needed. C-EDL’s conflict-aware prediction adjustment improves detection of OOD and adversarial inputs, maintaining high in-distribution accuracy and low computational overhead. Our experimental evaluation shows that C-EDL significantly outperforms state-of-the-art EDL variants and competitive baselines, achieving substantial reductions in coverage for OOD data (up to 55%) and adversarial data (up to 90%), across a range of datasets, attack types, and uncertainty metrics.
zh
[AI-26] Small Models Big Support: A Local LLM Framework for Teacher-Centric Content Creation and Assessment using RAG and CAG
【速读】:该论文试图解决当前大型语言模型(Large Language Models, LLMs)在教育领域中主要作为学生辅助工具,而对教育工作者的支持仍显不足的问题,特别是针对需要成本效益高、可本地部署和定制化的开源解决方案的需求。其关键解决方案是引入一个端到端的开源框架,利用小型(3B-7B参数)本地部署的LLM进行定制化教学材料生成与评估,该框架包含一个用于有效小模型优化的交互循环以及一个辅助的LLM验证器以降低越狱风险,同时结合检索增强生成(Retrieval and Context Augmented Generation, RAG/CAG)技术,确保生成内容的准确性与教学适用性。
链接: https://arxiv.org/abs/2506.05925
作者: Zarreen Reza,Alexander Mazur,Michael T. Dugdale,Robin Ray-Chaudhuri
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:While Large Language Models (LLMs) are increasingly utilized as student-facing educational aids, their potential to directly support educators, particularly through locally deployable and customizable open-source solutions, remains significantly underexplored. Many existing educational solutions rely on cloud-based infrastructure or proprietary tools, which are costly and may raise privacy concerns. Regulated industries with limited budgets require affordable, self-hosted solutions. We introduce an end-to-end, open-source framework leveraging small (3B-7B parameters), locally deployed LLMs for customized teaching material generation and assessment. Our system uniquely incorporates an interactive loop crucial for effective small-model refinement, and an auxiliary LLM verifier to mitigate jailbreaking risks, enhancing output reliability and safety. Utilizing Retrieval and Context Augmented Generation (RAG/CAG), it produces factually accurate, customized pedagogically-styled content. Deployed on-premises for data privacy and validated through an evaluation pipeline and a college physics pilot, our findings show that carefully engineered small LLM systems can offer robust, affordable, practical, and safe educator support, achieving utility comparable to larger models for targeted tasks.
zh
[AI-27] WhisQ: Cross-Modal Representation Learning for Text-to-Music MOS Prediction
【速读】:该论文旨在解决文本到音乐系统中客观评估整体音乐质量和文本提示对齐度的问题,即通过预测Mean Opinion Score (MOS) 来衡量生成音乐的质量与文本输入的匹配程度。其解决方案的关键在于提出了一种多模态架构WhisQ,该架构通过序列级协同注意力机制和最优传输正则化实现跨模态对齐。WhisQ利用预训练的Whisper Base模型进行音频时序编码,并结合Qwen 3小型语言模型进行文本编码,保持序列结构以支持细粒度的跨模态建模,同时通过Sinkhorn最优传输损失强化嵌入空间中的语义对齐。
链接: https://arxiv.org/abs/2506.05899
作者: Jakaria Islam Emon,Kazi Tamanna Alam,Md. Abu Salek
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: 3 pages
点击查看摘要
Abstract:Mean Opinion Score (MOS) prediction for text to music systems requires evaluating both overall musical quality and text prompt alignment. This paper introduces WhisQ, a multimodal architecture that addresses this dual-assessment challenge through sequence level co-attention and optimal transport regularization. WhisQ employs the Whisper Base pretrained model for temporal audio encoding and Qwen 3, a 0.6B Small Language Model (SLM), for text encoding, with both maintaining sequence structure for fine grained cross-modal modeling. The architecture features specialized prediction pathways: OMQ is predicted from pooled audio embeddings, while TA leverages bidirectional sequence co-attention between audio and text. Sinkhorn optimal transport loss further enforce semantic alignment in the shared embedding space. On the MusicEval Track-1 dataset, WhisQ achieves substantial improvements over the baseline: 7% improvement in Spearman correlation for OMQ and 14% for TA. Ablation studies reveal that optimal transport regularization provides the largest performance gain (10% SRCC improvement), demonstrating the importance of explicit cross-modal alignment for text-to-music evaluation.
zh
[AI-28] Explainability in Context: A Multilevel Framework Aligning AI Explanations with Stakeholder with LLM s
【速读】:该论文试图解决人工智能在敏感领域应用中,如何提升系统可信度的问题,特别是针对不同利益相关者(如开发者、领域专家、终端用户和社会)对解释的多样化需求。其解决方案的关键在于提出一个多层次框架,该框架包括算法与领域基础层、以人类为中心层以及社会可解释性层,旨在将解释与不同群体的认知、情境和伦理期望相匹配。其中,大型语言模型(Large Language Models, LLMs)在增强社会层方面发挥着关键作用,通过生成易于理解的自然语言解释来促进技术准确性、用户参与度和社会责任。
链接: https://arxiv.org/abs/2506.05887
作者: Marilyn Bello,Rafael Bello,Maria-Matilde García,Ann Nowé,Iván Sevillano-García,Francisco Herrera
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 22 pages, 5 figures
点击查看摘要
Abstract:The growing application of artificial intelligence in sensitive domains has intensified the demand for systems that are not only accurate but also explainable and trustworthy. Although explainable AI (XAI) methods have proliferated, many do not consider the diverse audiences that interact with AI systems: from developers and domain experts to end-users and society. This paper addresses how trust in AI is influenced by the design and delivery of explanations and proposes a multilevel framework that aligns explanations with the epistemic, contextual, and ethical expectations of different stakeholders. The framework consists of three layers: algorithmic and domain-based, human-centered, and social explainability. We highlight the emerging role of Large Language Models (LLMs) in enhancing the social layer by generating accessible, natural language explanations. Through illustrative case studies, we demonstrate how this approach facilitates technical fidelity, user engagement, and societal accountability, reframing XAI as a dynamic, trust-building process.
zh
[AI-29] Bayesian Persuasion as a Bargaining Game
【速读】:该论文试图解决贝叶斯劝导(Bayesian persuasion)在长期互动中面临的计算复杂性问题,即当接收者可能根据历史结果和未来预期采用动态策略时,传统方法的NP难性质导致难以有效求解。其解决方案的关键在于引入讨价还价(bargaining)视角,构建一个统一框架,该框架具备公平性和帕累托效率等优良特性,并明确区分了发送者的信息优势与先发提案者优势。通过这一视角,论文重新诠释了经典单边劝导问题,将其转化为一种平衡的信息讨价还价框架,从而为长期劝导提供了结构化且可计算的解决方案。
链接: https://arxiv.org/abs/2506.05876
作者: Yue Lin,Shuhui Zhu,William A Cunningham,Wenhao Li,Pascal Poupart,Hongyuan Zha,Baoxiang Wang
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Bayesian persuasion, an extension of cheap-talk communication, involves an informed sender committing to a signaling scheme to influence a receiver’s actions. Compared to cheap talk, this sender’s commitment enables the receiver to verify the incentive compatibility of signals beforehand, facilitating cooperation. While effective in one-shot scenarios, Bayesian persuasion faces computational complexity (NP-hardness) when extended to long-term interactions, where the receiver may adopt dynamic strategies conditional on past outcomes and future expectations. To address this complexity, we introduce the bargaining perspective, which allows: (1) a unified framework and well-structured solution concept for long-term persuasion, with desirable properties such as fairness and Pareto efficiency; (2) a clear distinction between two previously conflated advantages: the sender’s informational advantage and first-proposer advantage. With only modest modifications to the standard setting, this perspective makes explicit the common knowledge of the game structure and grants the receiver comparable commitment capabilities, thereby reinterpreting classic one-sided persuasion as a balanced information bargaining framework. The framework is validated through a two-stage validation-and-inference paradigm: We first demonstrate that GPT-o3 and DeepSeek-R1, out of publicly available LLMs, reliably handle standard tasks; We then apply them to persuasion scenarios to test that the outcomes align with what our information-bargaining framework suggests. All code, results, and terminal logs are publicly available at this http URL.
zh
[AI-30] Research on Personalized Financial Product Recommendation by Integrating Large Language Models and Graph Neural Networks
【速读】:该论文试图解决传统金融产品推荐方法(如协同过滤或基于内容的模型)在捕捉用户潜在偏好和复杂关系方面的不足。其解决方案的关键在于提出一种融合大型语言模型(Large Language Models, LLMs)与图神经网络(Graph Neural Networks, GNNs)的混合框架,通过预训练LLM将文本数据编码为丰富的特征向量,并利用异构用户-产品图建模交互与社交关系,结合定制化的信息传递机制实现文本与图信息的融合,从而联合优化嵌入表示。
链接: https://arxiv.org/abs/2506.05873
作者: Yushang Zhao,Yike Peng,Dannier Li,Yuxin Yang,Chengrui Zhou,Jing Dong
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:With the rapid growth of fintech, personalized financial product recommendations have become increasingly important. Traditional methods like collaborative filtering or content-based models often fail to capture users’ latent preferences and complex relationships. We propose a hybrid framework integrating large language models (LLMs) and graph neural networks (GNNs). A pre-trained LLM encodes text data (e.g., user reviews) into rich feature vectors, while a heterogeneous user-product graph models interactions and social ties. Through a tailored message-passing mechanism, text and graph information are fused within the GNN to jointly optimize embeddings. Experiments on public and real-world financial datasets show our model outperforms standalone LLM or GNN in accuracy, recall, and NDCG, with strong interpretability. This work offers new insights for personalized financial recommendations and cross-modal fusion in broader recommendation tasks.
zh
[AI-31] DeepFake Doctor: Diagnosing and Treating Audio-Video Fake Detection
【速读】:该论文旨在解决音频-视频深度伪造(DeepFake)检测中的基准测试问题,重点关注数据集、检测方法和评估协议三个核心基准支柱。其关键解决方案是提出一种简洁的多模态基线方法SImple Multimodal BAseline (SIMBA),并首次针对DeepSpeak v1数据集设计评估协议,同时深入分析音频捷径问题并提出有效的缓解策略。
链接: https://arxiv.org/abs/2506.05851
作者: Marcel Klemt,Carlotta Segna,Anna Rohrbach
机构: 未知
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:
点击查看摘要
Abstract:Generative AI advances rapidly, allowing the creation of very realistic manipulated video and audio. This progress presents a significant security and ethical threat, as malicious users can exploit DeepFake techniques to spread misinformation. Recent DeepFake detection approaches explore the multimodal (audio-video) threat scenario. In particular, there is a lack of reproducibility and critical issues with existing datasets - such as the recently uncovered silence shortcut in the widely used FakeAVCeleb dataset. Considering the importance of this topic, we aim to gain a deeper understanding of the key issues affecting benchmarking in audio-video DeepFake detection. We examine these challenges through the lens of the three core benchmarking pillars: datasets, detection methods, and evaluation protocols. To address these issues, we spotlight the recent DeepSpeak v1 dataset and are the first to propose an evaluation protocol and benchmark it using SOTA models. We introduce SImple Multimodal BAseline (SIMBA), a competitive yet minimalistic approach that enables the exploration of diverse design choices. We also deepen insights into the issue of audio shortcuts and present a promising mitigation strategy. Finally, we analyze and enhance the evaluation scheme on the widely used FakeAVCeleb dataset. Our findings offer a way forward in the complex area of audio-video DeepFake detection.
zh
[AI-32] Regional Lattice and Logical Representations of Neural Networks
【速读】:该论文试图解决神经网络可解释性问题,具体是通过将神经网络近似表示为分段线性函数的区域格式,从而实现对网络行为的局部解释。解决方案的关键在于提出一种算法,将具有ReLU激活函数的隐藏层和截断恒等激活函数的输出层的前馈神经网络转换为区域表示,该表示将输入区域与计算网络输出的线性函数相关联。此外,论文还研究了所生成区域表示的复杂性及其满足特定性质的程度,以支持进一步转化为格(lattice)和逻辑表示。
链接: https://arxiv.org/abs/2506.05834
作者: Sandro Preto(Federal University of ABC, Brazil),Marcelo Finger(University of Sao Paulo, Brazil)
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: In Proceedings LSFA 2024, arXiv:2506.05219
点击查看摘要
Abstract:A possible path to the interpretability of neural networks is to (approximately) represent them in the regional format of piecewise linear functions, where regions of inputs are associated to linear functions computing the network outputs. We present an algorithm for the translation of feedforward neural networks with ReLU activation functions in hidden layers and truncated identity activation functions in the output layer. We also empirically investigate the complexity of regional representations outputted by our method for neural networks with varying sizes. Lattice and logical representations of neural networks are straightforward from regional representations as long as they satisfy a specific property. So we empirically investigate to what extent the translations by our algorithm satisfy such property.
zh
[AI-33] Fuzzy Lattice-based Description Logic
【速读】:该论文试图解决在模糊形式上下文和模糊形式概念的框架下,如何有效表示和推理知识的问题。其解决方案的关键在于引入一种模糊扩展的描述逻辑——LE-FALC,它是许多值LE-逻辑的描述逻辑对应物,并提供了用于检查LE-FALC ABoxes一致性的完整且可靠的多项式时间决策过程,通过展开方法还获得了针对具有无环TBoxes的LE-FALC的一致性检查的指数时间决策过程。
链接: https://arxiv.org/abs/2506.05833
作者: Yiwen Ding,Krishna Manoorkar
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
备注: In Proceedings LSFA 2024, arXiv:2506.05219
点击查看摘要
Abstract:Recently, description logic LE-ALC was introduced for reasoning in the semantic environment of enriched formal contexts, and a polynomial-time tableaux algorithm was developed to check the consistency of knowledge bases with acyclic TBoxes. In this work, we introduce a fuzzy generalization of LE-ALC called LE-FALC which provides a description logic counterpart of many-valued normal non-distributive logic a.k.a. many-valued LE-logic. This description logic can be used to represent and reason about knowledge in the formal framework of fuzzy formal contexts and fuzzy formal concepts. We provide a tableaux algorithm that provides a complete and sound polynomial-time decision procedure to check the consistency of LE-FALC ABoxes. As a result, we also obtain an exponential-time decision procedure for checking the consistency of LE-FALC with acyclic TBoxes by unraveling.
zh
[AI-34] Heartcare Suite: Multi-dimensional Understanding of ECG with Raw Multi-lead Signal Modeling
【速读】:该论文旨在解决心电图(Electrocardiogram, ECG)的细粒度多模态理解问题,通过构建一个全面的框架来提升ECG分析的准确性与智能化水平。其解决方案的关键在于提出Heartcare Suite,该框架包含三个核心组件:Heartcare-220K数据集、Heartcare-Bench基准测试平台以及HeartcareGPT模型。其中,HeartcareGPT采用定制化的双向心电抽象分词(Beat)机制,通过双层级向量量化和查询引导的双向扩散机制,将原始多导联信号压缩为语义丰富的离散标记,从而实现对ECG信号的高效建模与理解。
链接: https://arxiv.org/abs/2506.05831
作者: Yihan Xie,Sijing Li,Tianwei Lin,Zhuonan Wang,Chenglin Yang,Yu Zhong,Wenqiao Zhang,Haoyuan Li,Hao Jiang,Fengda Zhang,Qishan Chen,Jun Xiao,Yueting Zhuang,Beng Chin Ooi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:We present Heartcare Suite, a multimodal comprehensive framework for finegrained electrocardiogram (ECG) understanding. It comprises three key components: (i) Heartcare-220K, a high-quality, structured, and comprehensive multimodal ECG dataset covering essential tasks such as disease diagnosis, waveform morphology analysis, and rhythm interpretation. (ii) Heartcare-Bench, a systematic and multi-dimensional benchmark designed to evaluate diagnostic intelligence and guide the optimization of Medical Multimodal Large Language Models (Med-MLLMs) in ECG scenarios. and (iii) HeartcareGPT with a tailored tokenizer Bidirectional ECG Abstract Tokenization (Beat), which compresses raw multi-lead signals into semantically rich discrete tokens via duallevel vector quantization and query-guided bidirectional diffusion mechanism. Built upon Heartcare-220K, HeartcareGPT achieves strong generalization and SoTA performance across multiple clinically meaningful tasks. Extensive experiments demonstrate that Heartcare Suite is highly effective in advancing ECGspecific multimodal understanding and evaluation. Our project is available at this https URL .
zh
[AI-35] Positional Encoding meets Persistent Homology on Graphs ICML2025
【速读】:该论文试图解决图神经网络(Graph Neural Networks, GNNs)在利用关键结构信息(如连通性和环路)方面的局限性,这一问题源于消息传递机制的局部归纳偏置。为应对这一挑战,研究者提出了位置编码(Positional Encoding, PE)和持久同调(Persistent Homology, PH)两种方法,分别通过赋予GNN位置感知特征和多分辨率拓扑特征来增强其表达能力。然而,这两种方法的相对优劣尚缺乏理论上的系统分析。本文的关键在于通过理论分析证明PE与PH在表达能力上并无绝对优劣,并提出一种新的可学习方法PiPE(Persistence-informed Positional Encoding),该方法在理论上比PE和PH更具表达能力,从而在多种任务中展现出优异性能。
链接: https://arxiv.org/abs/2506.05814
作者: Yogesh Verma,Amauri H. Souza,Vikas Garg
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Social and Information Networks (cs.SI)
备注: Accepted at ICML 2025
点击查看摘要
Abstract:The local inductive bias of message-passing graph neural networks (GNNs) hampers their ability to exploit key structural information (e.g., connectivity and cycles). Positional encoding (PE) and Persistent Homology (PH) have emerged as two promising approaches to mitigate this issue. PE schemes endow GNNs with location-aware features, while PH methods enhance GNNs with multiresolution topological features. However, a rigorous theoretical characterization of the relative merits and shortcomings of PE and PH has remained elusive. We bridge this gap by establishing that neither paradigm is more expressive than the other, providing novel constructions where one approach fails but the other succeeds. Our insights inform the design of a novel learnable method, PiPE (Persistence-informed Positional Encoding), which is provably more expressive than both PH and PE. PiPE demonstrates strong performance across a variety of tasks (e.g., molecule property prediction, graph classification, and out-of-distribution generalization), thereby advancing the frontiers of graph representation learning. Code is available at this https URL.
zh
[AI-36] rajectory Entropy: Modeling Game State Stability from Multimodality Trajectory Prediction
【速读】:该论文试图解决在真实世界场景中,自主驾驶系统面对多智能体复杂交互时所遇到的计算冗余和错误问题。现有基于层级k博弈框架的方法忽略了智能体间驾驶复杂性的差异以及智能体状态在不同博弈层级中的动态变化,导致计算效率低下。其解决方案的关键在于提出一种名为轨迹熵(Trajectory Entropy)的度量方法,该方法通过分析智能体轨迹预测结果中的多模态统计信号,结合信噪比量化智能体的博弈状态,并通过简单的门控机制优化层级k博弈框架,从而提升整体精度并降低计算成本。
链接: https://arxiv.org/abs/2506.05810
作者: Yesheng Zhang,Wenjian Sun,Yuheng Chen,Qingwei Liu,Qi Lin,Rui Zhang,Xu Zhao
机构: 未知
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 10 pages
点击查看摘要
Abstract:Complex interactions among agents present a significant challenge for autonomous driving in real-world scenarios. Recently, a promising approach has emerged, which formulates the interactions of agents as a level-k game framework. It effectively decouples agent policies by hierarchical game levels. However, this framework ignores both the varying driving complexities among agents and the dynamic changes in agent states across game levels, instead treating them uniformly. Consequently, redundant and error-prone computations are introduced into this framework. To tackle the issue, this paper proposes a metric, termed as Trajectory Entropy, to reveal the game status of agents within the level-k game framework. The key insight stems from recognizing the inherit relationship between agent policy uncertainty and the associated driving complexity. Specifically, Trajectory Entropy extracts statistical signals representing uncertainty from the multimodality trajectory prediction results of agents in the game. Then, the signal-to-noise ratio of this signal is utilized to quantify the game status of agents. Based on the proposed Trajectory Entropy, we refine the current level-k game framework through a simple gating mechanism, significantly improving overall accuracy while reducing computational costs. Our method is evaluated on the Waymo and nuPlan datasets, in terms of trajectory prediction, open-loop and closed-loop planning tasks. The results demonstrate the state-of-the-art performance of our method, with precision improved by up to 19.89% for prediction and up to 16.48% for planning.
zh
[AI-37] FlowOE: Imitation Learning with Flow Policy from Ensemble RL Experts for Optimal Execution under Heston Volatility and Concave Market Impacts
【速读】:该论文旨在解决金融市场上大规模资产交易的最优执行问题,即在平衡市场影响成本与时间或波动性风险的前提下,实现最佳交易结果。传统最优执行策略(如静态Almgren-Chriss模型)在动态金融市场中表现欠佳。论文提出的解决方案是flowOE,这是一种基于流匹配模型的模仿学习框架。其关键创新在于在模仿过程中引入了精炼损失函数,使flowOE不仅能够模仿专家行为,还能对其进行改进,从而适应不同的市场条件并显著提升交易绩效。
链接: https://arxiv.org/abs/2506.05755
作者: Yang Li,Zhi Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computational Finance (q-fin.CP); Trading and Market Microstructure (q-fin.TR)
备注: 3 figures, 3 algorithms, 7 tables
点击查看摘要
Abstract:Optimal execution in financial markets refers to the process of strategically transacting a large volume of assets over a period to achieve the best possible outcome by balancing the trade-off between market impact costs and timing or volatility risks. Traditional optimal execution strategies, such as static Almgren-Chriss models, often prove suboptimal in dynamic financial markets. This paper propose flowOE, a novel imitation learning framework based on flow matching models, to address these limitations. FlowOE learns from a diverse set of expert traditional strategies and adaptively selects the most suitable expert behavior for prevailing market conditions. A key innovation is the incorporation of a refining loss function during the imitation process, enabling flowOE not only to mimic but also to improve upon the learned expert actions. To the best of our knowledge, this work is the first to apply flow matching models in a stochastic optimal execution problem. Empirical evaluations across various market conditions demonstrate that flowOE significantly outperforms both the specifically calibrated expert models and other traditional benchmarks, achieving higher profits with reduced risk. These results underscore the practical applicability and potential of flowOE to enhance adaptive optimal execution.
zh
[AI-38] Integrating Spatiotemporal Features in LSTM for Spatially Informed COVID-19 Hospitalization Forecasting
【速读】:该论文旨在解决新冠疫情背景下准确、及时预测医院住院人数的需求,以支持有效的医疗资源规划。传统预测模型在变异株激增期间表现不佳,难以满足实际需求。该研究提出了一种基于长短期记忆网络(LSTM)的框架,通过引入时空特征——社交接近性至住院(Social Proximity to Hospitalizations, SPH),利用Facebook的社交连通性指数来反映州际人口互动,从而捕捉传播动态。其关键在于采用并行LSTM架构以捕获短期和长期时间依赖性,并结合多时域集成策略平衡预测一致性和误差,显著提升了预测精度。
链接: https://arxiv.org/abs/2506.05752
作者: Zhongying Wang,Thoai D. Ngo,Hamidreza Zoraghein,Benjamin Lucas,Morteza Karimzadeh
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 36 pages, 12 figures. This is the accepted version of the article published in International Journal of Geographical Information Science. DOI will be added upon publication
点击查看摘要
Abstract:The COVID-19 pandemic’s severe impact highlighted the need for accurate, timely hospitalization forecasting to support effective healthcare planning. However, most forecasting models struggled, especially during variant surges, when they were needed most. This study introduces a novel Long Short-Term Memory (LSTM) framework for forecasting daily state-level incident hospitalizations in the United States. We present a spatiotemporal feature, Social Proximity to Hospitalizations (SPH), derived from Facebook’s Social Connectedness Index to improve forecasts. SPH serves as a proxy for interstate population interaction, capturing transmission dynamics across space and time. Our parallel LSTM architecture captures both short- and long-term temporal dependencies, and our multi-horizon ensembling strategy balances consistency and forecasting error. Evaluation against COVID-19 Forecast Hub ensemble models during the Delta and Omicron surges reveals superiority of our model. On average, our model surpasses the ensemble by 27, 42, 54, and 69 hospitalizations per state on the 7^th , 14^th , 21^st , and 28^th forecast days, respectively, during the Omicron surge. Data-ablation experiments confirm SPH’s predictive power, highlighting its effectiveness in enhancing forecasting models. This research not only advances hospitalization forecasting but also underscores the significance of spatiotemporal features, such as SPH, in refining predictive performance in modeling the complex dynamics of infectious disease spread.
zh
[AI-39] An Ontology for Representing Curriculum and Learning Material
【速读】:该论文试图解决教育、学习和培训材料在互联网上普遍存在但彼此之间缺乏连接、陷入平台孤岛等问题,其核心挑战在于如何实现教育资源的整合与跨主题的交叉链接。解决方案的关键是提出一种名为Curriculum KG Ontology的本体框架,该框架基于组织性和广泛的教育原则,用于实现教育资源的密集互连,并通过为原型开放知识网络用例提供一个实体图谱进行验证。
链接: https://arxiv.org/abs/2506.05751
作者: Antrea Christou,Chris Davis Jaldi,Joseph Zalewski,Hande Küçük McGinty,Pascal Hitzler,Cogan Shimizu
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Educational, learning, and training materials have become extremely commonplace across the Internet. Yet, they frequently remain disconnected from each other, fall into platform silos, and so on. One way to overcome this is to provide a mechanism to integrate the material and provide cross-links across topics. In this paper, we present the Curriculum KG Ontology, which we use as a framework for the dense interlinking of educational materials, by first starting with organizational and broad pedagogical principles. We provide a materialized graph for the Prototype Open Knowledge Network use-case, and validate it using competency questions sourced from domain experts and educators. Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI) Cite as: arXiv:2506.05751 [cs.CY] (or arXiv:2506.05751v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2506.05751 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-40] Efficient Online RFT with Plug-and-Play LLM Judges: Unlocking State-of-the-Art Performance
【速读】:该论文旨在解决现代强化学习人类反馈(Reinforcement Learning from Human Feedback, RLHF)流程中奖励模型训练的成本瓶颈问题,该过程通常需要数十亿参数并经历离线偏好调优阶段。解决方案的关键在于使用一个冻结的、指令微调的7B大语言模型(Large Language Model),通过仅添加一行JSON规则和一个秩为16的低秩适配器(LoRA),使其能够完全替代之前使用的重型评估模型,从而显著降低计算成本。该方法在RewardBench上实现了96.2%的准确率,并在GSM-8K任务中通过在线PPO实现了92%的精确匹配准确率,表现出色。
链接: https://arxiv.org/abs/2506.05748
作者: Rudransh Agnihotri,Ananya Pandey
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Reward-model training is the cost bottleneck in modern Reinforcement Learning Human Feedback (RLHF) pipelines, often requiring tens of billions of parameters and an offline preference-tuning phase. In the proposed method, a frozen, instruction-tuned 7B LLM is augmented with only a one line JSON rubric and a rank-16 LoRA adapter (affecting just 0.8% of the model’s parameters), enabling it to serve as a complete substitute for the previously used heavyweight evaluation models. The plug-and-play judge achieves 96.2% accuracy on RewardBench, outperforming specialized reward networks ranging from 27B to 70B parameters. Additionally, it allows a 7B actor to outperform the top 70B DPO baseline, which scores 61.8%, by achieving 92% exact match accuracy on GSM-8K utilizing online PPO. Thorough ablations indicate that (i) six in context demonstrations deliver the majority of the zero-to-few-shot improvements (+2pp), and (ii) the LoRA effectively addresses the remaining disparity, particularly in the safety and adversarial Chat-Hard segments. The proposed model introduces HH-Rationales, a subset of 10,000 pairs from Anthropic HH-RLHF, to examine interpretability, accompanied by human generated justifications. GPT-4 scoring indicates that our LoRA judge attains approximately = 9/10 in similarity to human explanations, while zero-shot judges score around =5/10. These results indicate that the combination of prompt engineering and tiny LoRA produces a cost effective, transparent, and easily adjustable reward function, removing the offline phase while achieving new state-of-the-art outcomes for both static evaluation and online RLHF.
zh
[AI-41] SPRINT: Enabling Interleaved Planning and Parallelized Execution in Reasoning Models
【速读】:该论文旨在解决大型推理模型(Large Reasoning Models, LRMs)在处理复杂推理任务时生成冗长的序列思维链导致推理时间过长的问题。其解决方案的关键在于提出一种名为SPRINT的后训练和推理阶段框架,该框架通过动态识别并利用推理过程中的并行化机会,提升模型的推理效率。SPRINT引入了一种创新的数据整理流程,将自然语言推理轨迹重新组织为结构化的长周期规划和并行执行阶段,并通过对少量整理数据进行微调,使模型能够动态识别扩展推理过程中的独立子任务并有效并行执行。
链接: https://arxiv.org/abs/2506.05745
作者: Emil Biju,Shayan Talaei,Zhemin Huang,Mohammadreza Pourreza,Azalia Mirhoseini,Amin Saberi
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Emil Biju, Shayan Talaei, and Zhemin Huang contributed equally to this work
点击查看摘要
Abstract:Large reasoning models (LRMs) excel at complex reasoning tasks but typically generate lengthy sequential chains-of-thought, resulting in long inference times before arriving at the final answer. To address this challenge, we introduce SPRINT, a novel post-training and inference-time framework designed to enable LRMs to dynamically identify and exploit opportunities for parallelization during their reasoning process. SPRINT incorporates an innovative data curation pipeline that reorganizes natural language reasoning trajectories into structured rounds of long-horizon planning and parallel execution. By fine-tuning LRMs on a small amount of such curated data, the models learn to dynamically identify independent subtasks within extended reasoning processes and effectively execute them in parallel. Through extensive evaluations, we show that the models fine-tuned with the SPRINT framework match the performance of reasoning models on complex domains such as mathematics while generating up to ~39% fewer sequential tokens on problems requiring more than 8000 output tokens. Finally, we observe consistent results transferred to two out-of-distribution tasks of GPQA and Countdown with up to 45% and 65% reduction in average sequential tokens for longer reasoning trajectories, while achieving the performance of the fine-tuned reasoning model.
zh
[AI-42] opology of Reasoning : Understanding Large Reasoning Reasoning Models through Reasoning Graph Properties
【速读】:该论文试图解决大型推理模型在数学基准测试中取得优异性能但其内部机制尚不明确的问题,其解决方案的关键在于引入了推理图(reasoning graph)的概念,通过聚类每个推理步骤的隐藏状态表示来提取该图,并系统分析其图论特性,如循环性、直径和小世界指数。研究发现,蒸馏后的推理模型在这些结构特性上表现出显著优势,且这些优势随着任务难度和模型规模的增长而增强,从而为提升模型推理能力提供了理论依据与实践指导。
链接: https://arxiv.org/abs/2506.05744
作者: Gouki Minegishi,Hiroki Furuta,Takeshi Kojima,Yusuke Iwasawa,Yutaka Matsuo
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Recent large-scale reasoning models have achieved state-of-the-art performance on challenging mathematical benchmarks, yet the internal mechanisms underlying their success remain poorly understood. In this work, we introduce the notion of a reasoning graph, extracted by clustering hidden-state representations at each reasoning step, and systematically analyze three key graph-theoretic properties: cyclicity, diameter, and small-world index, across multiple tasks (GSM8K, MATH500, AIME 2024). Our findings reveal that distilled reasoning models (e.g., DeepSeek-R1-Distill-Qwen-32B) exhibit significantly more recurrent cycles (about 5 per sample), substantially larger graph diameters, and pronounced small-world characteristics (about 6x) compared to their base counterparts. Notably, these structural advantages grow with task difficulty and model capacity, with cycle detection peaking at the 14B scale and exploration diameter maximized in the 32B variant, correlating positively with accuracy. Furthermore, we show that supervised fine-tuning on an improved dataset systematically expands reasoning graph diameters in tandem with performance gains, offering concrete guidelines for dataset design aimed at boosting reasoning capabilities. By bridging theoretical insights into reasoning graph structures with practical recommendations for data construction, our work advances both the interpretability and the efficacy of large reasoning models.
zh
[AI-43] When Better Features Mean Greater Risks: The Performance-Privacy Trade-Off in Contrastive Learning CCS’25
【速读】:该论文试图解决预训练编码器模型在应用过程中可能引发的训练数据隐私泄露问题,特别是针对对比学习框架中的成员推理攻击(Membership Inference Attacks, MIAs)所带来的隐私威胁。其解决方案的关键在于提出一种基于特征向量p-范数的新型成员推理攻击方法,称为嵌入Lp-范数似然攻击(Embedding Lp-Norm Likelihood Attack, LpLA),该方法通过利用特征向量p-范数的统计分布特性来推断数据样本是否属于训练集,实验结果表明该方法在攻击性能和鲁棒性方面优于现有方法,尤其在攻击知识和查询次数受限的情况下表现更为突出。
链接: https://arxiv.org/abs/2506.05743
作者: Ruining Sun,Hongsheng Hu,Wei Luo,Zhaoxi Zhang,Yanjun Zhang,Haizhuan Yuan,Leo Yu Zhang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Accepted In ACM ASIA Conference on Computer and Communications Security (ASIA CCS '25), August 25-29, 2025, Ha Noi, Vietnam. For Code, see this https URL
点击查看摘要
Abstract:With the rapid advancement of deep learning technology, pre-trained encoder models have demonstrated exceptional feature extraction capabilities, playing a pivotal role in the research and application of deep learning. However, their widespread use has raised significant concerns about the risk of training data privacy leakage. This paper systematically investigates the privacy threats posed by membership inference attacks (MIAs) targeting encoder models, focusing on contrastive learning frameworks. Through experimental analysis, we reveal the significant impact of model architecture complexity on membership privacy leakage: As more advanced encoder frameworks improve feature-extraction performance, they simultaneously exacerbate privacy-leakage risks. Furthermore, this paper proposes a novel membership inference attack method based on the p-norm of feature vectors, termed the Embedding Lp-Norm Likelihood Attack (LpLA). This method infers membership status, by leveraging the statistical distribution characteristics of the p-norm of feature vectors. Experimental results across multiple datasets and model architectures demonstrate that LpLA outperforms existing methods in attack performance and robustness, particularly under limited attack knowledge and query volumes. This study not only uncovers the potential risks of privacy leakage in contrastive learning frameworks, but also provides a practical basis for privacy protection research in encoder models. We hope that this work will draw greater attention to the privacy risks associated with self-supervised learning models and shed light on the importance of a balance between model utility and training data privacy. Our code is publicly available at: this https URL.
zh
[AI-44] o Protect the LLM Agent Against the Prompt Injection Attack with Polymorphic Prompt DSN2025
【速读】:该论文试图解决大型语言模型(Large Language Model, LLM)代理在应用中面临的提示注入攻击(prompt injection attack)问题,此类攻击通过恶意输入操纵模型行为,而传统防御方法如输入净化、守卫模型和防护措施存在繁琐或无效的缺陷。论文提出的解决方案关键在于一种轻量级防御机制——多态提示组装(Polymorphic Prompt Assembling, PPA),其核心思想是通过动态改变系统提示的结构,使攻击者无法预测提示结构,从而有效防止提示注入攻击,同时保持模型性能不受影响。
链接: https://arxiv.org/abs/2506.05739
作者: Zhilong Wang,Neha Nagaraja,Lan Zhang,Hayretdin Bahsi,Pawan Patil,Peng Liu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: To appear in the Industry Track of the 55th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2025)
点击查看摘要
Abstract:LLM agents are widely used as agents for customer support, content generation, and code assistance. However, they are vulnerable to prompt injection attacks, where adversarial inputs manipulate the model’s behavior. Traditional defenses like input sanitization, guard models, and guardrails are either cumbersome or ineffective. In this paper, we propose a novel, lightweight defense mechanism called Polymorphic Prompt Assembling (PPA), which protects against prompt injection with near-zero overhead. The approach is based on the insight that prompt injection requires guessing and breaking the structure of the system prompt. By dynamically varying the structure of system prompts, PPA prevents attackers from predicting the prompt structure, thereby enhancing security without compromising performance. We conducted experiments to evaluate the effectiveness of PPA against existing attacks and compared it with other defense methods.
zh
[AI-45] Generalized Incremental Learning under Concept Drift across Evolving Data Streams
【速读】:该论文试图解决现实数据流中由于概念漂移(concept drift)导致的分布和标签空间协同演化问题,现有方法在有限监督和持续不确定性下未能有效处理这一挑战。其解决方案的关键在于提出一种名为Calibrated Source-Free Adaptation (CSFA) 的框架,该框架包含两个核心机制:一是无需训练的原型校准机制,用于动态融合新出现的原型与基础表示,实现稳定的新类别识别;二是基于可靠替代间隙尖锐性感知(RSGS)的无源适应算法,通过结合尖锐性感知扰动损失优化与替代间隙最小化,并利用熵基不确定性过滤剔除不可靠样本,从而确保分布对齐并缓解泛化退化问题。
链接: https://arxiv.org/abs/2506.05736
作者: En Yu,Jie Lu,Guangquan Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: This work has been submitted to the IEEE for possible publication
点击查看摘要
Abstract:Real-world data streams exhibit inherent non-stationarity characterized by concept drift, posing significant challenges for adaptive learning systems. While existing methods address isolated distribution shifts, they overlook the critical co-evolution of label spaces and distributions under limited supervision and persistent uncertainty. To address this, we formalize Generalized Incremental Learning under Concept Drift (GILCD), characterizing the joint evolution of distributions and label spaces in open-environment streaming contexts, and propose a novel framework called Calibrated Source-Free Adaptation (CSFA). First, CSFA introduces a training-free prototype calibration mechanism that dynamically fuses emerging prototypes with base representations, enabling stable new-class identification without optimization overhead. Second, we design a novel source-free adaptation algorithm, i.e., Reliable Surrogate Gap Sharpness-aware (RSGS) minimization. It integrates sharpness-aware perturbation loss optimization with surrogate gap minimization, while employing entropy-based uncertainty filtering to discard unreliable samples. This mechanism ensures robust distribution alignment and mitigates generalization degradation caused by uncertainties. Therefore, CSFA establishes a unified framework for stable adaptation to evolving semantics and distributions in open-world streaming scenarios. Extensive experiments validate the superior performance and effectiveness of CSFA compared to state-of-the-art approaches.
zh
[AI-46] Grokking Beyond the Euclidean Norm of Model Parameters ICML
【速读】:该论文试图解决在使用基于梯度的方法优化人工神经网络时,模型在过拟合后出现延迟泛化现象(即Grokking)的机制问题。其解决方案的关键在于证明Grokking可以通过正则化(显式或隐式)诱导实现,具体而言,当存在一个具有特定性质P(如稀疏性或低秩权重)的模型能够在目标问题上泛化时,采用小但非零的P相关正则化(如ℓ₁或核范数正则化)的梯度下降方法会导致Grokking现象的发生。此外,研究还表明通过增加网络深度的过参数化可以实现Grokking或消除Grokking,而无需显式使用正则化。
链接: https://arxiv.org/abs/2506.05718
作者: Pascal Jr Tikeng Notsawo,Guillaume Dumas,Guillaume Rabusseau
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 67 pages, 35 figures. Forty-second International Conference on Machine Learning (ICML), 2025
点击查看摘要
Abstract:Grokking refers to a delayed generalization following overfitting when optimizing artificial neural networks with gradient-based methods. In this work, we demonstrate that grokking can be induced by regularization, either explicit or implicit. More precisely, we show that when there exists a model with a property P (e.g., sparse or low-rank weights) that generalizes on the problem of interest, gradient descent with a small but non-zero regularization of P (e.g., \ell_1 or nuclear norm regularization) results in grokking. This extends previous work showing that small non-zero weight decay induces grokking. Moreover, our analysis shows that over-parameterization by adding depth makes it possible to grok or ungrok without explicitly using regularization, which is impossible in shallow cases. We further show that the \ell_2 norm is not a reliable proxy for generalization when the model is regularized toward a different property P , as the \ell_2 norm grows in many cases where no weight decay is used, but the model generalizes anyway. We also show that grokking can be amplified solely through data selection, with any other hyperparameter fixed.
zh
[AI-47] Ensemble Elastic DQN: A novel multi-step ensemble approach to address overestimation in deep value-based reinforcement learning
【速读】:该论文试图解决深度强化学习中的两个主要挑战:过估计偏差(overestimation bias)和样本效率(sample efficiency)。其解决方案的关键在于引入一种新的算法——集成弹性步长DQN(Ensemble Elastic Step DQN, EEDQN),该算法将集成方法与弹性步长更新机制相结合,以稳定算法性能。通过系统地整合集成和多步方法,EEDQN在MinAtar基准测试中表现出一致的鲁棒性,并在大多数环境中实现了优于基线DQN方法和当前先进集成DQN的最终回报。
链接: https://arxiv.org/abs/2506.05716
作者: Adrian Ly,Richard Dazeley,Peter Vamplew,Francisco Cruz,Sunil Aryal
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:While many algorithmic extensions to Deep Q-Networks (DQN) have been proposed, there remains limited understanding of how different improvements interact. In particular, multi-step and ensemble style extensions have shown promise in reducing overestimation bias, thereby improving sample efficiency and algorithmic stability. In this paper, we introduce a novel algorithm called Ensemble Elastic Step DQN (EEDQN), which unifies ensembles with elastic step updates to stabilise algorithmic performance. EEDQN is designed to address two major challenges in deep reinforcement learning: overestimation bias and sample efficiency. We evaluated EEDQN against standard and ensemble DQN variants across the MinAtar benchmark, a set of environments that emphasise behavioral learning while reducing representational complexity. Our results show that EEDQN achieves consistently robust performance across all tested environments, outperforming baseline DQN methods and matching or exceeding state-of-the-art ensemble DQNs in final returns on most of the MinAtar environments. These findings highlight the potential of systematically combining algorithmic improvements and provide evidence that ensemble and multi-step methods, when carefully integrated, can yield substantial gains.
zh
[AI-48] Action-Adaptive Continual Learning: Enabling Policy Generalization under Dynamic Action Spaces
【速读】:该论文试图解决持续学习(Continual Learning, CL)中一个新且现实的问题:具有动态能力的持续学习(Continual Learning with Dynamic Capabilities, CL-DC),其核心挑战在于如何实现跨不同动作空间的策略泛化。解决方案的关键在于提出一种基于动作自适应的持续学习框架(Action-Adaptive Continual Learning, AACL),该框架通过构建动作表示空间,将智能体的策略与具体动作空间解耦,并通过对动作表示的编码器-解码器进行自适应微调,以在稳定性和可塑性之间保持平衡。
链接: https://arxiv.org/abs/2506.05702
作者: Chaofan Pan,Jiafen Liu,Yanhua Li,Linbo Xiong,Fan Min,Wei Wei,Xin Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Continual Learning (CL) is a powerful tool that enables agents to learn a sequence of tasks, accumulating knowledge learned in the past and using it for problem-solving or future task learning. However, existing CL methods often assume that the agent’s capabilities remain static within dynamic environments, which doesn’t reflect real-world scenarios where capabilities dynamically change. This paper introduces a new and realistic problem: Continual Learning with Dynamic Capabilities (CL-DC), posing a significant challenge for CL agents: How can policy generalization across different action spaces be achieved? Inspired by the cortical functions, we propose an Action-Adaptive Continual Learning framework (AACL) to address this challenge. Our framework decouples the agent’s policy from the specific action space by building an action representation space. For a new action space, the encoder-decoder of action representations is adaptively fine-tuned to maintain a balance between stability and plasticity. Furthermore, we release a benchmark based on three environments to validate the effectiveness of methods for CL-DC. Experimental results demonstrate that our framework outperforms popular methods by generalizing the policy across action spaces.
zh
[AI-49] Evaluating AI-Powered Learning Assistants in Engineering Higher Education: Student Engagement Ethical Challenges and Policy Implications
【速读】:该论文试图解决生成式 AI(Generative AI)在高等教育中的应用问题,特别是学生与 AI 工具的互动方式及其对学习体验的影响。研究通过评估 Educational AI Hub 在土木与环境工程课程中的使用情况,探讨学生对 AI 的信任度、伦理担忧、可用性及学习成效的感知。解决方案的关键在于提升 AI 工具的可用性、明确机构政策以及加强教师指导,以促进学生对 AI 的有效和负责任的使用。
链接: https://arxiv.org/abs/2506.05699
作者: Ramteja Sajja,Yusuf Sermet,Brian Fodale,Ibrahim Demir
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 26 pages, 10 Figures, 6 Tables
点击查看摘要
Abstract:As generative AI tools become increasingly integrated into higher education, understanding how students interact with and perceive these technologies is essential for responsible and effective adoption. This study evaluates the use of the Educational AI Hub, an AI-powered learning framework, in undergraduate civil and environmental engineering courses at a large R1 public university. Using a mixed-methods approach that combines pre- and post-surveys, system usage logs, and qualitative analysis of the open-ended prompts and questions students posed to the AI chatbot, the research explores students’ perceptions of trust, ethical concerns, usability, and learning outcomes. Findings reveal that students appreciated the AI assistant for its convenience and comfort, with nearly half reporting greater ease in using the AI tool compared to seeking help from instructors or teaching assistants. The tool was seen as most helpful for completing homework and understanding course concepts, though perceptions of its instructional quality were mixed. Ethical concerns emerged as a key barrier to full engagement: while most students viewed AI use as ethically acceptable, many expressed uncertainties about institutional policies and apprehension about potential academic misconduct. This study contributes to the growing body of research on AI in education by highlighting the importance of usability, policy clarity, and faculty guidance in fostering meaningful AI engagement. The findings suggest that while students are ready to embrace AI as a supplement to human instruction, thoughtful integration and transparent institutional frameworks are critical for ensuring student confidence, trust, and learning effectiveness.
zh
[AI-50] SafeGenBench: A Benchmark Framework for Security Vulnerability Detection in LLM -Generated Code
【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)生成代码时存在的安全风险问题,特别是现有研究对生成代码中固有安全漏洞的忽视。其解决方案的关键在于引入了一个名为\benchmark的基准测试集,用于评估LLM生成代码的安全性,并构建了一个自动评估框架,该框架结合了静态应用安全测试(Static Application Security Testing, SAST)和基于LLM的判断方法,以检测模型生成代码中的安全漏洞。
链接: https://arxiv.org/abs/2506.05692
作者: Xinghang Li,Jingzhe Ding,Chao Peng,Bing Zhao,Xiang Gao,Hongwan Gao,Xinchen Gu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:The code generation capabilities of large language models(LLMs) have emerged as a critical dimension in evaluating their overall performance. However, prior research has largely overlooked the security risks inherent in the generated code. In this work, we introduce \benchmark, a benchmark specifically designed to assess the security of LLM-generated code. The dataset encompasses a wide range of common software development scenarios and vulnerability types. Building upon this benchmark, we develop an automatic evaluation framework that leverages both static application security testing(SAST) and LLM-based judging to assess the presence of security vulnerabilities in model-generated code. Through the empirical evaluation of state-of-the-art LLMs on \benchmark, we reveal notable deficiencies in their ability to produce vulnerability-free code. Our findings highlight pressing challenges and offer actionable insights for future advancements in the secure code generation performance of LLMs. The data and code will be released soon.
zh
[AI-51] Multi-Modal Multi-Task Federated Foundation Models for Next-Generation Extended Reality Systems: Towards Privacy-Preserving Distributed Intelligence in AR/VR/MR
【速读】:该论文旨在解决扩展现实(Extended Reality, XR)系统在实现多模态多任务联邦基础模型(Multi-Modal Multi-Task Federated Foundation Models, M3T FedFMs)过程中面临的隐私保护与计算资源约束问题。其核心挑战在于如何在保障用户数据隐私的前提下,有效整合多模态感知、异构硬件、动态交互及环境变化等XR系统特有的复杂性。解决方案的关键在于将多模态多任务基础模型(M3T Foundation Models, FMs)的表征能力与联邦学习(Federated Learning, FL)的隐私保护机制相结合,构建一种模块化架构,以支持不同协调范式的模型训练与聚合,从而实现资源感知的上下文敏感隐私保护智能。
链接: https://arxiv.org/abs/2506.05683
作者: Fardis Nadimi,Payam Abdisarabshali,Kasra Borazjani,Jacob Chakareski,Seyyedali Hosseinalipour
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Multimedia (cs.MM)
备注: 16 pages, 4 Figures, 8 Tables
点击查看摘要
Abstract:Extended reality (XR) systems, which consist of virtual reality (VR), augmented reality (AR), and mixed reality (XR), offer a transformative interface for immersive, multi-modal, and embodied human-computer interaction. In this paper, we envision that multi-modal multi-task (M3T) federated foundation models (FedFMs) can offer transformative capabilities for XR systems through integrating the representational strength of M3T foundation models (FMs) with the privacy-preserving model training principles of federated learning (FL). We present a modular architecture for FedFMs, which entails different coordination paradigms for model training and aggregations. Central to our vision is the codification of XR challenges that affect the implementation of FedFMs under the SHIFT dimensions: (1) Sensor and modality diversity, (2) Hardware heterogeneity and system-level constraints, (3) Interactivity and embodied personalization, (4) Functional/task variability, and (5) Temporality and environmental variability. We illustrate the manifestation of these dimensions across a set of emerging and anticipated applications of XR systems. Finally, we propose evaluation metrics, dataset requirements, and design tradeoffs necessary for the development of resource-aware FedFMs in XR. This perspective aims to chart the technical and conceptual foundations for context-aware privacy-preserving intelligence in the next generation of XR systems.
zh
[AI-52] Learning Design-Score Manifold to Guide Diffusion Models for Offline Optimization
【速读】:该论文旨在解决复杂系统优化问题,例如药物发现和高性能材料设计,这些问题在科学和工程领域中由于底层规则未知且评估成本高昂而成为根本性挑战。传统方法在训练数据之外可能失效,导致预测分数不准确和生成设计质量低下。该论文提出的解决方案是ManGO,其关键在于基于扩散的框架,能够学习设计-分数流形,全面捕捉设计与分数之间的相互依赖关系。与现有方法不同,ManGO统一了前向预测和后向生成,实现了超越训练数据的泛化能力,其核心在于无导数的条件生成引导以及自适应的推理时间缩放,以动态优化去噪路径。
链接: https://arxiv.org/abs/2506.05680
作者: Tailin Zhou,Zhilin Chen,Wenlong Lyu,Zhitang Chen,Danny H.K. Tsang,Jun Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: This manuscript is submitted and under review
点击查看摘要
Abstract:Optimizing complex systems, from discovering therapeutic drugs to designing high-performance materials, remains a fundamental challenge across science and engineering, as the underlying rules are often unknown and costly to evaluate. Offline optimization aims to optimize designs for target scores using pre-collected datasets without system interaction. However, conventional approaches may fail beyond training data, predicting inaccurate scores and generating inferior designs. This paper introduces ManGO, a diffusion-based framework that learns the design-score manifold, capturing the design-score interdependencies holistically. Unlike existing methods that treat design and score spaces in isolation, ManGO unifies forward prediction and backward generation, attaining generalization beyond training data. Key to this is its derivative-free guidance for conditional generation, coupled with adaptive inference-time scaling that dynamically optimizes denoising paths. Extensive evaluations demonstrate that ManGO outperforms 24 single- and 10 multi-objective optimization methods across diverse domains, including synthetic tasks, robot control, material design, DNA sequence, and real-world engineering optimization.
zh
[AI-53] Bayesian Inference for Correlated Human Experts and Classifiers ICML2025
【速读】:该论文试图解决在机器学习应用中,如何以最少的人类专家查询次数获取准确的类别标签预测问题,同时利用预训练分类器的类别概率估计。解决方案的关键在于提出了一种通用的贝叶斯框架,通过联合潜在表示建模专家相关性,从而实现对额外专家查询效用的模拟推断,并推断未观测专家标签的后验分布。
链接: https://arxiv.org/abs/2506.05636
作者: Markelle Kelly,Alex Boyd,Sam Showalter,Mark Steyvers,Padhraic Smyth
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: accepted to ICML 2025
点击查看摘要
Abstract:Applications of machine learning often involve making predictions based on both model outputs and the opinions of human experts. In this context, we investigate the problem of querying experts for class label predictions, using as few human queries as possible, and leveraging the class probability estimates of pre-trained classifiers. We develop a general Bayesian framework for this problem, modeling expert correlation via a joint latent representation, enabling simulation-based inference about the utility of additional expert queries, as well as inference of posterior distributions over unobserved expert labels. We apply our approach to two real-world medical classification problems, as well as to CIFAR-10H and ImageNet-16H, demonstrating substantial reductions relative to baselines in the cost of querying human experts while maintaining high prediction accuracy.
zh
[AI-54] AutoQD: Automatic Discovery of Diverse Behaviors with Quality-Diversity Optimization
【速读】:该论文试图解决传统Quality-Diversity (QD)算法依赖人工设计的行为描述符,从而限制了探索空间的多样性问题。其解决方案的关键在于通过将策略与占用测度(occupancy measures)进行等价映射,利用随机傅里叶特征近似策略占用测度之间的最大均值差异(MMD),自动生成行为描述符。该方法通过低维投影提取最具行为区分性的维度,作为QD算法的输入,实现了无需预定义行为描述符的多样化策略发现。
链接: https://arxiv.org/abs/2506.05634
作者: Saeed Hedayatian,Stefanos Nikolaidis
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注: 22 pages, 5 figures
点击查看摘要
Abstract:Quality-Diversity (QD) algorithms have shown remarkable success in discovering diverse, high-performing solutions, but rely heavily on hand-crafted behavioral descriptors that constrain exploration to predefined notions of diversity. Leveraging the equivalence between policies and occupancy measures, we present a theoretically grounded approach to automatically generate behavioral descriptors by embedding the occupancy measures of policies in Markov Decision Processes. Our method, AutoQD, leverages random Fourier features to approximate the Maximum Mean Discrepancy (MMD) between policy occupancy measures, creating embeddings whose distances reflect meaningful behavioral differences. A low-dimensional projection of these embeddings that captures the most behaviorally significant dimensions is then used as behavioral descriptors for off-the-shelf QD methods. We prove that our embeddings converge to true MMD distances between occupancy measures as the number of sampled trajectories and embedding dimensions increase. Through experiments in multiple continuous control tasks we demonstrate AutoQD’s ability in discovering diverse policies without predefined behavioral descriptors, presenting a well-motivated alternative to prior methods in unsupervised Reinforcement Learning and QD optimization. Our approach opens new possibilities for open-ended learning and automated behavior discovery in sequential decision making settings without requiring domain-specific knowledge.
zh
[AI-55] GP-MoLFormer-Sim: Test Time Molecular Optimization through Contextual Similarity Guidance
【速读】:该论文旨在解决在分子设计过程中保持与目标分子和/或性质相似性的关键问题,这对于药物发现、化学设计和生物学应用具有重要意义。其解决方案的关键在于提出一种无需训练的高效方法,利用生成式化学语言模型(Generative Chemical Language Model, CLM)的上下文表示来估计分子相似性,并据此调整CLM的自回归采样策略,从而在解码过程中动态优化生成分子的相似性。该方法被实现为GP-MoLFormer-Sim,并进一步整合到遗传算法中以提升性能。
链接: https://arxiv.org/abs/2506.05628
作者: Jiri Navratil,Jarret Ross,Payel Das,Youssef Mroueh,Samuel C Hoffman,Vijil Chenthamarakshan,Brian Belgodere
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 12 pages main article, 21 pages total
点击查看摘要
Abstract:The ability to design molecules while preserving similarity to a target molecule and/or property is crucial for various applications in drug discovery, chemical design, and biology. We introduce in this paper an efficient training-free method for navigating and sampling from the molecular space with a generative Chemical Language Model (CLM), while using the molecular similarity to the target as a guide. Our method leverages the contextual representations learned from the CLM itself to estimate the molecular similarity, which is then used to adjust the autoregressive sampling strategy of the CLM. At each step of the decoding process, the method tracks the distance of the current generations from the target and updates the logits to encourage the preservation of similarity in generations. We implement the method using a recently proposed \sim 47M parameter SMILES-based CLM, GP-MoLFormer, and therefore refer to the method as GP-MoLFormer-Sim, which enables a test-time update of the deep generative policy to reflect the contextual similarity to a set of guide molecules. The method is further integrated into a genetic algorithm (GA) and tested on a set of standard molecular optimization benchmarks involving property optimization, molecular rediscovery, and structure-based drug design. Results show that, GP-MoLFormer-Sim, combined with GA (GP-MoLFormer-Sim+GA) outperforms existing training-free baseline methods, when the oracle remains black-box. The findings in this work are a step forward in understanding and guiding the generative mechanisms of CLMs.
zh
[AI-56] Population-Proportional Preference Learning from Human Feedback: An Axiomatic Approach
【速读】:该论文试图解决传统偏好学习方法在聚合多评估者偏好时可能产生的偏差问题,即过于侧重广泛持有的观点,导致政策偏向某些类型的偏好或群体。其解决方案的关键在于提出一种新的偏好学习框架,该框架能够使聚合的偏好和政策与评估者偏好的真实人口分布成比例。该方法直接从成对比较数据中推断出评估者人口分布的可行集,并构建满足社会选择理论基本公理(如单调性和帕累托效率)以及新引入的人口比例代表性与人口边界鲁棒性公理的策略。
链接: https://arxiv.org/abs/2506.05619
作者: Kihyun Kim,Jiawei Zhang,Asuman Ozdaglar,Pablo A. Parrilo
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Conventional preference learning methods often prioritize opinions held more widely when aggregating preferences from multiple evaluators. This may result in policies that are biased in favor of some types of opinions or groups. The objective of this paper is to develop a novel preference learning framework capable of aligning aggregate opinions and policies proportionally with the true population distribution of evaluator preferences. Our approach infers the feasible set of evaluator population distributions directly from pairwise comparison data. Using these estimates, the algorithm constructs a policy that satisfies foundational axioms from social choice theory, namely monotonicity and Pareto efficiency, as well as our newly-introduced axioms of population-proportional representation and population-bounded robustness. We propose a soft-max relaxation method that smoothly trade-offs population-proportional representation with the selection of the Condorcet winner (which beats all other options in pairwise comparisons). Finally, we validate the effectiveness and scalability of our approach through experiments on both tabular recommendation tasks and large-scale language model alignment.
zh
[AI-57] LFA applied to CNNs: Efficient Singular Value Decomposition of Convolutional Mappings by Local Fourier Analysis
【速读】:该论文试图解决卷积映射的奇异值计算资源消耗过大的问题(singular values computation of convolutional mappings),这一问题限制了其在提升卷积神经网络泛化能力、鲁棒性以及模型压缩等方面的广泛应用。现有方法虽然利用快速傅里叶变换(FFT)将卷积映射转换到频域以降低计算复杂度,但其复杂度仍为O(N log N)。本文提出的解决方案关键在于基于局部傅里叶分析(local Fourier analysis)实现O(N)复杂度的算法,并进一步利用卷积算子的平移不变性(shift invariance),从而高效计算高维卷积映射的全部奇异值及其对应的奇异向量。
链接: https://arxiv.org/abs/2506.05617
作者: Antonia van Betteray,Matthias Rottmann,Karsten Kahl
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:The singular values of convolutional mappings encode interesting spectral properties, which can be used, e.g., to improve generalization and robustness of convolutional neural networks as well as to facilitate model compression. However, the computation of singular values is typically very resource-intensive. The naive approach involves unrolling the convolutional mapping along the input and channel dimensions into a large and sparse two-dimensional matrix, making the exact calculation of all singular values infeasible due to hardware limitations. In particular, this is true for matrices that represent convolutional mappings with large inputs and a high number of channels. Existing efficient methods leverage the Fast Fourier transformation (FFT) to transform convolutional mappings into the frequency domain, enabling the computation of singular values for matrices representing convolutions with larger input and channel dimensions. For a constant number of channels in a given convolution, an FFT can compute N singular values in O(N log N) complexity. In this work, we propose an approach of complexity O(N) based on local Fourier analysis, which additionally exploits the shift invariance of convolutional operators. We provide a theoretical analysis of our algorithm’s runtime and validate its efficiency through numerical experiments. Our results demonstrate that our proposed method is scalable and offers a practical solution to calculate the entire set of singular values - along with the corresponding singular vectors if needed - for high-dimensional convolutional mappings.
zh
[AI-58] oward Greater Autonomy in Materials Discovery Agents : Unifying Planning Physics and Scientists
【速读】:该论文旨在解决晶体材料发现中语言代理自主性不足的问题,传统方法通常将代理限制在预定义工作流中的特定任务,而本文致力于在给定高层次目标和科学家直觉的情况下自动化工作流规划。解决方案的关键在于提出了一种名为MAPPS(Materials Agent unifying Planning, Physics, and Scientists)的框架,该框架整合了工作流规划器、工具代码生成器和科学中介器,通过大型语言模型生成结构化多步骤工作流,合成可执行的Python代码,并协调沟通与反馈,从而实现更灵活、可靠的材料发现。
链接: https://arxiv.org/abs/2506.05616
作者: Lianhao Zhou,Hongyi Ling,Keqiang Yan,Kaiji Zhao,Xiaoning Qian,Raymundo Arróyave,Xiaofeng Qian,Shuiwang Ji
机构: 未知
类目: Artificial Intelligence (cs.AI); Materials Science (cond-mat.mtrl-sci); Computational Physics (physics.comp-ph)
备注:
点击查看摘要
Abstract:We aim at designing language agents with greater autonomy for crystal materials discovery. While most of existing studies restrict the agents to perform specific tasks within predefined workflows, we aim to automate workflow planning given high-level goals and scientist intuition. To this end, we propose Materials Agent unifying Planning, Physics, and Scientists, known as MAPPS. MAPPS consists of a Workflow Planner, a Tool Code Generator, and a Scientific Mediator. The Workflow Planner uses large language models (LLMs) to generate structured and multi-step workflows. The Tool Code Generator synthesizes executable Python code for various tasks, including invoking a force field foundation model that encodes physics. The Scientific Mediator coordinates communications, facilitates scientist feedback, and ensures robustness through error reflection and recovery. By unifying planning, physics, and scientists, MAPPS enables flexible and reliable materials discovery with greater autonomy, achieving a five-fold improvement in stability, uniqueness, and novelty rates compared with prior generative models when evaluated on the MP-20 data. We provide extensive experiments across diverse tasks to show that MAPPS is a promising framework for autonomous materials discovery.
zh
[AI-59] When Maximum Entropy Misleads Policy Optimization
【速读】:该论文试图解决最大熵强化学习(MaxEnt RL)在复杂控制任务中因鲁棒性与最优性之间的权衡而导致的性能问题,特别是在需要精确、低熵策略的任务中,MaxEnt 方法可能无法有效学习。解决方案的关键在于分析熵最大化对策略优化的潜在误导作用,并提出如何在挑战性控制问题中更好地平衡奖励设计与熵最大化之间的关系。
链接: https://arxiv.org/abs/2506.05615
作者: Ruipeng Zhang,Ya-Chien Chang,Sicun Gao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:The Maximum Entropy Reinforcement Learning (MaxEnt RL) framework is a leading approach for achieving efficient learning and robust performance across many RL tasks. However, MaxEnt methods have also been shown to struggle with performance-critical control problems in practice, where non-MaxEnt algorithms can successfully learn. In this work, we analyze how the trade-off between robustness and optimality affects the performance of MaxEnt algorithms in complex control tasks: while entropy maximization enhances exploration and robustness, it can also mislead policy optimization, leading to failure in tasks that require precise, low-entropy policies. Through experiments on a variety of control problems, we concretely demonstrate this misleading effect. Our analysis leads to better understanding of how to balance reward design and entropy maximization in challenging control problems.
zh
[AI-60] Scenarios in Computing Research: A Systematic Review of the Use of Scenario Methods for Exploring the Future of Computing Technologies in Society
【速读】:该论文试图解决计算机科学领域中场景构建方法的应用现状及其在促进研究包容性方面的潜力未被充分探索的问题。其解决方案的关键在于通过系统文献综述(n=59),分析场景构建在计算文献中的使用方式,特别是其应用的动机,并深入探讨现有场景构建研究中参与性元素的潜在价值,以期为该领域提供更全面的视角和改进方向。
链接: https://arxiv.org/abs/2506.05605
作者: Julia Barnett,Kimon Kieslich,Jasmine Sinchai,Nicholas Diakopoulos
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 10 pages, 3 figures. Currently under review
点击查看摘要
Abstract:Scenario building is an established method to anticipate the future of emerging technologies. Its primary goal is to use narratives to map future trajectories of technology development and sociotechnical adoption. Following this process, risks and benefits can be identified early on, and strategies can be developed that strive for desirable futures. In recent years, computer science has adopted this method and applied it to various technologies, including Artificial Intelligence (AI). Because computing technologies play such an important role in shaping modern societies, it is worth exploring how scenarios are being used as an anticipatory tool in the field – and what possible traditional uses of scenarios are not yet covered but have the potential to enrich the field. We address this gap by conducting a systematic literature review on the use of scenario building methods in computer science over the last decade (n = 59). We guide the review along two main questions. First, we aim to uncover how scenarios are used in computing literature, focusing especially on the rationale for why scenarios are used. Second, in following the potential of scenario building to enhance inclusivity in research, we dive deeper into the participatory element of the existing scenario building literature in computer science.
zh
[AI-61] Zero-shot protein stability prediction by inverse folding models: a free energy interpretation
【速读】:该论文试图解决逆折叠模型中氨基酸偏好与热力学稳定性所依赖的自由能关系之间理解不足的问题(inverse folding models and the free-energy considerations underlying thermodynamic stability)。其解决方案的关键在于通过理论推导揭示似然比作为简化近似的局限性,并提出改进相对稳定性估计的多种路径,实验结果表明这些方法能够显著提升零样本稳定性预测性能。
链接: https://arxiv.org/abs/2506.05596
作者: Jes Frellsen,Maher M. Kassem,Tone Bengtsen,Lars Olsen,Kresten Lindorff-Larsen,Jesper Ferkinghoff-Borg,Wouter Boomsma
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM); Machine Learning (stat.ML)
备注:
点击查看摘要
Abstract:Inverse folding models have proven to be highly effective zero-shot predictors of protein stability. Despite this success, the link between the amino acid preferences of an inverse folding model and the free-energy considerations underlying thermodynamic stability remains incompletely understood. A better understanding would be of interest not only from a theoretical perspective, but also potentially provide the basis for stronger zero-shot stability prediction. In this paper, we take steps to clarify the free-energy foundations of inverse folding models. Our derivation reveals the standard practice of likelihood ratios as a simplistic approximation and suggests several paths towards better estimates of the relative stability. We empirically assess these approaches and demonstrate that considerable gains in zero-shot performance can be achieved with fairly simple means.
zh
[AI-62] Improving Neural Diarization through Speaker Attribute Attractors and Local Dependency Modeling ICASSP2024 ICASSP
【速读】:该论文旨在解决说话人分割与识别(speaker diarization)问题,即在多说话人录音中对说话人进行分段和识别。其解决方案的关键在于扩展了吸引子范式,通过多阶段的中间表示来建模更详细的“说话人属性”,而非直接进行说话人建模,并引入了基于卷积增强的Transformer结构——Conformer,以更好地捕捉局部依赖关系,从而提升语音分割性能。
链接: https://arxiv.org/abs/2506.05593
作者: David Palzer,Matthew Maciejewski,Eric Fosler-Lussier
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Korea, Republic of, 2024, pp. 11911-11915
点击查看摘要
Abstract:In recent years, end-to-end approaches have made notable progress in addressing the challenge of speaker diarization, which involves segmenting and identifying speakers in multi-talker recordings. One such approach, Encoder-Decoder Attractors (EDA), has been proposed to handle variable speaker counts as well as better guide the network during training. In this study, we extend the attractor paradigm by moving beyond direct speaker modeling and instead focus on representing more detailed `speaker attributes’ through a multi-stage process of intermediate representations. Additionally, we enhance the architecture by replacing transformers with conformers, a convolution-augmented transformer, to model local dependencies. Experiments demonstrate improved diarization performance on the CALLHOME dataset.
zh
[AI-63] CoFrNets: Interpretable Neural Architecture Inspired by Continued Fractions
【速读】:该论文试图解决神经网络缺乏可解释性的问题,尤其是在构建可解释的神经架构方面研究较少。其解决方案的关键在于提出一种新的神经架构——CoFrNet,该架构受连分数(continued fractions)的启发,具有高效的训练和解释能力,并且通过不同于传统无限宽度(或深度)策略的证明方法,证明了其作为通用逼近器的性质,从而在保持模型性能的同时提升了可解释性。
链接: https://arxiv.org/abs/2506.05586
作者: Isha Puri,Amit Dhurandhar,Tejaswini Pedapati,Kartikeyan Shanmugam,Dennis Wei,Kush R. Varshney
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:In recent years there has been a considerable amount of research on local post hoc explanations for neural networks. However, work on building interpretable neural architectures has been relatively sparse. In this paper, we present a novel neural architecture, CoFrNet, inspired by the form of continued fractions which are known to have many attractive properties in number theory, such as fast convergence of approximations to real numbers. We show that CoFrNets can be efficiently trained as well as interpreted leveraging their particular functional form. Moreover, we prove that such architectures are universal approximators based on a proof strategy that is different than the typical strategy used to prove universal approximation results for neural networks based on infinite width (or depth), which is likely to be of independent interest. We experiment on nonlinear synthetic functions and are able to accurately model as well as estimate feature attributions and even higher order terms in some cases, which is a testament to the representational power as well as interpretability of such architectures. To further showcase the power of CoFrNets, we experiment on seven real datasets spanning tabular, text and image modalities, and show that they are either comparable or significantly better than other interpretable models and multilayer perceptrons, sometimes approaching the accuracies of state-of-the-art models.
zh
[AI-64] Conformal Prediction Adaptive to Unknown Subpopulation Shifts NEURIPS2025
【速读】:该论文试图解决在存在分布偏移(distribution shifts)的情况下,传统置信预测(conformal prediction)方法的覆盖率保证失效的问题,特别是针对子群体偏移(subpopulation shifts)场景,即测试环境中的子群体混合与校准数据不同。解决方案的关键在于提出新的方法,能够在不依赖子群体结构显式知识的情况下,证明性地适应此类偏移,从而确保有效的覆盖率,并且在高维设置中具有可扩展性,同时在视觉和语言任务中表现出色。
链接: https://arxiv.org/abs/2506.05583
作者: Nien-Shao Wang,Duygu Nur Yaldiz,Yavuz Faruk Bakman,Sai Praneeth Karimireddy
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 20 pages, 6 figures, 5 tables, submitted to NeurIPS 2025
点击查看摘要
Abstract:Conformal prediction is widely used to equip black-box machine learning models with uncertainty quantification enjoying formal coverage guarantees. However, these guarantees typically break down in the presence of distribution shifts, where the data distribution at test time differs from the training (or calibration-time) distribution. In this work, we address subpopulation shifts, where the test environment exhibits an unknown and differing mixture of subpopulations compared to the calibration data. We propose new methods that provably adapt conformal prediction to such shifts, ensuring valid coverage without requiring explicit knowledge of subpopulation structure. Our algorithms scale to high-dimensional settings and perform effectively in realistic machine learning tasks. Extensive experiments on vision (with vision transformers) and language (with large language models) benchmarks demonstrate that our methods reliably maintain coverage and controls risk in scenarios where standard conformal prediction fails.
zh
[AI-65] Collaborative Learning in Agent ic Systems: A Collective AI is Greater Than the Sum of Its Parts
【速读】:该论文旨在解决多智能体系统中如何实现高效、自主和协作的学习问题,特别是在去中心化环境中,面对任务多样性、数据分布复杂性以及通信和计算资源受限等挑战。其解决方案的关键在于提出一种名为MOSAIC的算法,该算法通过模块化策略组合、基于Wasserstein嵌入的余弦相似度估计以及异步通信与策略整合三个核心机制,使多个智能体能够在无协调、同步或集中控制的情况下,独立完成不同任务并共享、复用有价值的机器学习知识,从而提升整体学习效率与任务求解能力。
链接: https://arxiv.org/abs/2506.05577
作者: Saptarshi Nath,Christos Peridis,Eseoghene Benjamin,Xinran Liu,Soheil Kolouri,Peter Kinnell,Zexin Li,Cong Liu,Shirin Dora,Andrea Soltoggio
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 36 pages, 21 figures, 6 tables. Preprint
点击查看摘要
Abstract:Agentic AI has gained significant interest as a research paradigm focused on autonomy, self-directed learning, and long-term reliability of decision making. Real-world agentic systems operate in decentralized settings on a large set of tasks or data distributions with constraints such as limited bandwidth, asynchronous execution, and the absence of a centralized model or even common objectives. We posit that exploiting previously learned skills, task similarities, and communication capabilities in a collective of agentic AI are challenging but essential elements to enabling scalability, open-endedness, and beneficial collaborative learning dynamics. In this paper, we introduce Modular Sharing and Composition in Collective Learning (MOSAIC), an agentic algorithm that allows multiple agents to independently solve different tasks while also identifying, sharing, and reusing useful machine-learned knowledge, without coordination, synchronization, or centralized control. MOSAIC combines three mechanisms: (1) modular policy composition via neural network masks, (2) cosine similarity estimation using Wasserstein embeddings for knowledge selection, and (3) asynchronous communication and policy integration. Results on a set of RL benchmarks show that MOSAIC has a greater sample efficiency than isolated learners, i.e., it learns significantly faster, and in some cases, finds solutions to tasks that cannot be solved by isolated learners. The collaborative learning and sharing dynamics are also observed to result in the emergence of ideal curricula of tasks, from easy to hard. These findings support the case for collaborative learning in agentic systems to achieve better and continuously evolving performance both at the individual and collective levels.
zh
[AI-66] Ravan: Multi-Head Low-Rank Adaptation for Federated Fine-Tuning
【速读】:该论文旨在解决在联邦学习(Federated Learning, FL)环境下,大型语言模型(Large Language Models, LLMs)难以有效利用边缘设备数据的问题,特别是现有基于低秩适应(Low-Rank Adaptation, LoRA)的方法在数据和计算异构性条件下存在精度下降的问题。其解决方案的关键在于提出一种自适应多头LoRA方法——\textscRavan,通过将权重更新重新参数化为多个LoRA头的和,仅训练核心矩阵及其轻量级缩放因子,从而在保持参数效率的同时提升模型表达能力,实现更高的测试精度。
链接: https://arxiv.org/abs/2506.05568
作者: Arian Raje,Baris Askin,Divyansh Jhunjhunwala,Gauri Joshi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Large language models (LLMs) have not yet effectively leveraged the vast amounts of edge-device data, and federated learning (FL) offers a promising paradigm to collaboratively fine-tune LLMs without transferring private edge data to the cloud. To operate within the computation and communication constraints of edge devices, recent literature on federated fine-tuning of LLMs proposes the use of low-rank adaptation (LoRA) and similar parameter-efficient methods. However, LoRA-based methods suffer from accuracy degradation in FL settings, primarily because of data and computational heterogeneity across clients. We propose \textscRavan, an adaptive multi-head LoRA method that balances parameter efficiency and model expressivity by reparameterizing the weight updates as the sum of multiple LoRA heads s_i\textbfB_i\textbfH_i\textbfA_i in which only the core matrices \textbfH_i and their lightweight scaling factors s_i are trained. These trainable scaling factors let the optimization focus on the most useful heads, recovering a higher-rank approximation of the full update without increasing the number of communicated parameters since clients upload s_i\textbfH_i directly. Experiments on vision and language benchmarks show that \textscRavan improves test accuracy by 2-8% over prior parameter-efficient baselines, making it a robust and scalable solution for federated fine-tuning of LLMs.
zh
[AI-67] ScaleRTL: Scaling LLM s with Reasoning Data and Test-Time Compute for Accurate RTL Code Generation
【速读】:该论文旨在解决生成式 AI (Generative AI) 在寄存器传输级(RTL)代码生成中的有效性受限问题,主要由于高质量训练数据的稀缺性。现有方法虽通过微调大型语言模型(LLMs)进行RTL任务,但未能从根本上突破数据瓶颈,且因缺乏推理能力而无法支持测试时的扩展。解决方案的关键在于提出ScaleRTL,这是首个具备推理能力的RTL编码大型语言模型,其核心在于通过构建大规模的长链式思维推理轨迹数据集(3.5B tokens),并采用一种新颖的测试时扩展策略,通过迭代反思和自我修正提升推理过程,从而实现深度RTL推理能力。
链接: https://arxiv.org/abs/2506.05566
作者: Chenhui Deng,Yun-Da Tsai,Guan-Ting Liu,Zhongzhi Yu,Haoxing Ren
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Recent advances in large language models (LLMs) have enabled near-human performance on software coding benchmarks, but their effectiveness in RTL code generation remains limited due to the scarcity of high-quality training data. While prior efforts have fine-tuned LLMs for RTL tasks, they do not fundamentally overcome the data bottleneck and lack support for test-time scaling due to their non-reasoning nature. In this work, we introduce ScaleRTL, the first reasoning LLM for RTL coding that scales up both high-quality reasoning data and test-time compute. Specifically, we curate a diverse set of long chain-of-thought reasoning traces averaging 56K tokens each, resulting in a dataset of 3.5B tokens that captures rich RTL knowledge. Fine-tuning a general-purpose reasoning model on this corpus yields ScaleRTL that is capable of deep RTL reasoning. Subsequently, we further enhance the performance of ScaleRTL through a novel test-time scaling strategy that extends the reasoning process via iteratively reflecting on and self-correcting previous reasoning steps. Experimental results show that ScaleRTL achieves state-of-the-art performance on VerilogEval and RTLLM, outperforming 18 competitive baselines by up to 18.4% on VerilogEval and 12.7% on RTLLM.
zh
[AI-68] Applying Informer for Option Pricing: A Transformer-Based Approach
【速读】:该论文试图解决金融市场上期权定价的准确性问题,这一问题由于市场波动性和传统模型(如Black-Scholes模型)的局限性而难以应对。论文提出的解决方案关键在于应用Informer神经网络,利用其捕捉长期依赖关系和动态适应市场波动的能力,从而提升预测精度并构建更具适应性和鲁棒性的框架。
链接: https://arxiv.org/abs/2506.05565
作者: Feliks Bańka,Jarosław A. Chudziak
机构: 未知
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Computational Finance (q-fin.CP)
备注: 8 pages, 3 tables, 7 figures. Accepted at the 17th International Conference on Agents and Artificial Intelligence (ICAART 2025). Final version published in Proceedings of ICAART 2025 (Vol. 3), pages 1270-1277
点击查看摘要
Abstract:Accurate option pricing is essential for effective trading and risk management in financial markets, yet it remains challenging due to market volatility and the limitations of traditional models like Black-Scholes. In this paper, we investigate the application of the Informer neural network for option pricing, leveraging its ability to capture long-term dependencies and dynamically adjust to market fluctuations. This research contributes to the field of financial forecasting by introducing Informer’s efficient architecture to enhance prediction accuracy and provide a more adaptable and resilient framework compared to existing methods. Our results demonstrate that Informer outperforms traditional approaches in option pricing, advancing the capabilities of data-driven financial forecasting in this domain.
zh
[AI-69] Avoiding Death through Fear Intrinsic Conditioning
【速读】:该论文试图解决在复杂环境中评估强化学习方法时面临的挑战,特别是由于环境状态中存在高负奖励但无反馈的情况,如“死亡”这一刺激。解决方案的关键在于引入一种受早期杏仁核发育启发的内在奖励函数,并通过一种新颖的记忆增强神经网络(MANN)架构生成该奖励。该内在动机能够阻止智能体探索终止状态,从而产生类似动物中观察到的恐惧条件反射的回避行为,同时通过调整恐惧反应的阈值,可模拟广泛性焦虑障碍(GAD)相关的多种行为模式。
链接: https://arxiv.org/abs/2506.05529
作者: Rodney Sanchez,Ferat Sahin,Alexander Ororbia,Jamison Heard
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Biological and psychological concepts have inspired reinforcement learning algorithms to create new complex behaviors that expand agents’ capacity. These behaviors can be seen in the rise of techniques like goal decomposition, curriculum, and intrinsic rewards, which have paved the way for these complex behaviors. One limitation in evaluating these methods is the requirement for engineered extrinsic for realistic environments. A central challenge in engineering the necessary reward function(s) comes from these environments containing states that carry high negative rewards, but provide no feedback to the agent. Death is one such stimuli that fails to provide direct feedback to the agent. In this work, we introduce an intrinsic reward function inspired by early amygdala development and produce this intrinsic reward through a novel memory-augmented neural network (MANN) architecture. We show how this intrinsic motivation serves to deter exploration of terminal states and results in avoidance behavior similar to fear conditioning observed in animals. Furthermore, we demonstrate how modifying a threshold where the fear response is active produces a range of behaviors that are described under the paradigm of general anxiety disorders (GADs). We demonstrate this behavior in the Miniworld Sidewalk environment, which provides a partially observable Markov decision process (POMDP) and a sparse reward with a non-descriptive terminal condition, i.e., death. In effect, this study results in a biologically-inspired neural architecture and framework for fear conditioning paradigms; we empirically demonstrate avoidance behavior in a constructed agent that is able to solve environments with non-descriptive terminal conditions.
zh
[AI-70] owards Data Systems That Are Business Semantic-Centric and AI Agents -Assisted
【速读】:该论文试图解决现有数据平台在动态商业环境中因过于侧重技术工具而忽视业务需求所导致的效率低下和响应迟缓问题。其解决方案的关键在于提出一种以业务语义为中心、AI代理辅助的数据系统(Business Semantics Centric, AI Agents Assisted Data System, BSDS),该系统通过整合架构、工作流和团队组织,确保数据系统与业务优先级对齐,而非受制于技术约束。BSDS的核心创新在于将数据系统重新定义为动态推动业务成功的力量,并通过AI代理提升数据访问与系统管理的效率,同时结合业务语义与技术能力,实现跨职能协作与可扩展性。
链接: https://arxiv.org/abs/2506.05520
作者: Cecil Pang
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: Being peer reviewed by a journal
点击查看摘要
Abstract:Contemporary businesses operate in dynamic environments requiring rapid adaptation to achieve goals and maintain competitiveness. Existing data platforms often fall short by emphasizing tools over alignment with business needs, resulting in inefficiencies and delays. To address this gap, I propose the Business Semantics Centric, AI Agents Assisted Data System (BSDS), a holistic system that integrates architecture, workflows, and team organization to ensure data systems are tailored to business priorities rather than dictated by technical constraints. BSDS redefines data systems as dynamic enablers of business success, transforming them from passive tools into active drivers of organizational growth. BSDS has a modular architecture that comprises curated data linked to business entities, a knowledge base for context-aware AI agents, and efficient data pipelines. AI agents play a pivotal role in assisting with data access and system management, reducing human effort, and improving scalability. Complementing this architecture, BSDS incorporates workflows optimized for both exploratory data analysis and production requirements, balancing speed of delivery with quality assurance. A key innovation of BSDS is its incorporation of the human factor. By aligning data team expertise with business semantics, BSDS bridges the gap between technical capabilities and business needs. Validated through real-world implementation, BSDS accelerates time-to-market for data-driven initiatives, enhances cross-functional collaboration, and provides a scalable blueprint for businesses of all sizes. Future research can build on BSDS to explore optimization strategies using complex systems and adaptive network theories, as well as developing autonomous data systems leveraging AI agents.
zh
[AI-71] Learning to Recover: Dynamic Reward Shaping with Wheel-Leg Coordination for Fallen Robots
【速读】:该论文旨在解决轮腿式机器人在跌倒后自适应恢复的问题,这一能力对于其实际部署至关重要。传统方法依赖于预设的恢复动作、简化的动力学模型或稀疏奖励,难以生成鲁棒的恢复策略。论文提出的解决方案关键在于结合基于情节的动态奖励塑造和课程学习的强化学习框架,通过动态平衡多样化恢复动作的探索与姿态精炼,同时采用非对称的Actor-Critic架构加速训练,并通过注入噪声的观测提高对不确定性的鲁棒性。此外,轮腿协同控制有效降低了关节扭矩消耗并提升了稳定性。
链接: https://arxiv.org/abs/2506.05516
作者: Boyuan Deng,Luca Rossini,Jin Wang,Weijie Wang,Nikolaos Tsagarakis
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Adaptive recovery from fall incidents are essential skills for the practical deployment of wheeled-legged robots, which uniquely combine the agility of legs with the speed of wheels for rapid recovery. However, traditional methods relying on preplanned recovery motions, simplified dynamics or sparse rewards often fail to produce robust recovery policies. This paper presents a learning-based framework integrating Episode-based Dynamic Reward Shaping and curriculum learning, which dynamically balances exploration of diverse recovery maneuvers with precise posture refinement. An asymmetric actor-critic architecture accelerates training by leveraging privileged information in simulation, while noise-injected observations enhance robustness against uncertainties. We further demonstrate that synergistic wheel-leg coordination reduces joint torque consumption by 15.8% and 26.2% and improves stabilization through energy transfer mechanisms. Extensive evaluations on two distinct quadruped platforms achieve recovery success rates up to 99.1% and 97.8% without platform-specific tuning. The supplementary material is available at this https URL
zh
[AI-72] Winner-takes-all for Multivariate Probabilistic Time Series Forecasting ICML2025
【速读】:该论文试图解决时间序列预测中生成多个合理未来轨迹的问题,特别是在面对不确定性和模糊性任务时的挑战。解决方案的关键在于引入TimeMCL方法,该方法基于Multiple Choice Learning(MCL)范式,通过使用具有多个输出头的神经网络和Winner-Takes-All(WTA)损失函数,促进预测结果的多样性,从而实现对多样化未来的高效预测。
链接: https://arxiv.org/abs/2506.05515
作者: Adrien Cortés,Rémi Rehm,Victor Letzelter
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: ICML 2025
点击查看摘要
Abstract:We introduce TimeMCL, a method leveraging the Multiple Choice Learning (MCL) paradigm to forecast multiple plausible time series futures. Our approach employs a neural network with multiple heads and utilizes the Winner-Takes-All (WTA) loss to promote diversity among predictions. MCL has recently gained attention due to its simplicity and ability to address ill-posed and ambiguous tasks. We propose an adaptation of this framework for time-series forecasting, presenting it as an efficient method to predict diverse futures, which we relate to its implicit quantization objective. We provide insights into our approach using synthetic data and evaluate it on real-world time series, demonstrating its promising performance at a light computational cost.
zh
[AI-73] Beyond the Buzz: A Prag matic Take on Inference Disaggregation
【速读】:该论文旨在解决在多节点部署中,如何通过拆分推理过程以提升吞吐量与交互性之间的帕累托前沿问题。其解决方案的关键在于动态速率匹配和弹性扩展,这些机制在大规模实验中被证明对于实现帕累托最优性能至关重要。
链接: https://arxiv.org/abs/2506.05508
作者: Tiyasa Mitra,Ritika Borkar,Nidhi Bhatia,Ramon Matas,Shivam Raj,Dheevatsa Mudigere,Ritchie Zhao,Maximilian Golub,Arpan Dutta,Sailaja Madduri,Dharmesh Jani,Brian Pharris,Bita Darvish Rouhani
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:As inference scales to multi-node deployments, disaggregation - splitting inference into distinct phases - offers a promising path to improving the throughput-interactivity Pareto frontier. Despite growing enthusiasm and a surge of open-source efforts, practical deployment of disaggregated serving remains limited due to the complexity of the optimization search space and system-level coordination. In this paper, we present the first systematic study of disaggregated inference at scale, evaluating hundreds of thousands of design points across diverse workloads and hardware configurations. We find that disaggregation is most effective for prefill-heavy traffic patterns and larger models. Our results highlight the critical role of dynamic rate matching and elastic scaling in achieving Pareto-optimal performance. Our findings offer actionable insights for efficient disaggregated deployments to navigate the trade-off between system throughput and interactivity.
zh
[AI-74] StealthInk: A Multi-bit and Stealthy Watermark for Large Language Models
【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)生成文本的溯源问题,现有方法要么影响原始文本分布,要么仅能嵌入零比特信息,仅支持水印检测而无法实现身份识别。其解决方案的关键在于提出StealthInk,一种隐蔽的多比特水印方案,能够在不改变原始文本分布的前提下,嵌入如用户ID、时间戳和模型ID等来源信息,从而实现快速溯源,且无需访问语言模型的API或提示。
链接: https://arxiv.org/abs/2506.05502
作者: Ya Jiang,Chuxiong Wu,Massieh Kordi Boroujeny,Brian Mark,Kai Zeng
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: camera-ready version
点击查看摘要
Abstract:Watermarking for large language models (LLMs) offers a promising approach to identifying AI-generated text. Existing approaches, however, either compromise the distribution of original generated text by LLMs or are limited to embedding zero-bit information that only allows for watermark detection but ignores identification. We present StealthInk, a stealthy multi-bit watermarking scheme that preserves the original text distribution while enabling the embedding of provenance data, such as userID, TimeStamp, and modelID, within LLM-generated text. This enhances fast traceability without requiring access to the language model’s API or prompts. We derive a lower bound on the number of tokens necessary for watermark detection at a fixed equal error rate, which provides insights on how to enhance the capacity. Comprehensive empirical evaluations across diverse tasks highlight the stealthiness, detectability, and resilience of StealthInk, establishing it as an effective solution for LLM watermarking applications.
zh
[AI-75] Conformal Prediction Beyond the Seen: A Missing Mass Perspective for Uncertainty Quantification in Generative Models
【速读】:该论文旨在解决生成式 AI 模型(如大语言模型)在高风险应用中进行不确定性量化(Uncertainty Quantification, UQ)的问题,传统方法依赖于结构化输出的几何距离或 softmax 分数,而无法直接应用于黑盒生成模型。其解决方案的关键在于提出一种名为 Conformal Prediction with Query Oracle (CPQ) 的框架,该框架通过有限查询构建预测集,并在覆盖率、测试时查询预算和信息性之间建立新的权衡。CPQ 的核心原理基于统计学中的缺失质量问题(Missing Mass Problem),其中最优查询策略依赖于缺失质量的衰减率,而最优映射则依赖于缺失质量本身,两者均通过新颖的估计方法实现。
链接: https://arxiv.org/abs/2506.05497
作者: Sima Noorani,Shayan Kiyani,George Pappas,Hamed Hassani
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Uncertainty quantification (UQ) is essential for safe deployment of generative AI models such as large language models (LLMs), especially in high stakes applications. Conformal prediction (CP) offers a principled uncertainty quantification framework, but classical methods focus on regression and classification, relying on geometric distances or softmax scores: tools that presuppose structured outputs. We depart from this paradigm by studying CP in a query only setting, where prediction sets must be constructed solely from finite queries to a black box generative model, introducing a new trade off between coverage, test time query budget, and informativeness. We introduce Conformal Prediction with Query Oracle (CPQ), a framework characterizing the optimal interplay between these objectives. Our finite sample algorithm is built on two core principles: one governs the optimal query policy, and the other defines the optimal mapping from queried samples to prediction sets. Remarkably, both are rooted in the classical missing mass problem in statistics. Specifically, the optimal query policy depends on the rate of decay, or the derivative, of the missing mass, for which we develop a novel estimator. Meanwhile, the optimal mapping hinges on the missing mass itself, which we estimate using Good Turing estimators. We then turn our focus to implementing our method for language models, where outputs are vast, variable, and often under specified. Fine grained experiments on three real world open ended tasks and two LLMs, show CPQ applicability to any black box LLM and highlight: (1) individual contribution of each principle to CPQ performance, and (2) CPQ ability to yield significantly more informative prediction sets than existing conformal methods for language uncertainty quantification.
zh
[AI-76] Sentiment Analysis in Learning Management Systems Understanding Student Feedback at Scale
【速读】:该论文试图解决在线学习环境中由于非语言交流缺失导致的师生沟通障碍问题(communication gap),该问题使得教育体验的有效性降低。解决方案的关键在于将情感分析(sentiment analysis)集成到学习管理系统(LMS)中,通过深度神经网络模型(包括词嵌入、LSTM和注意力机制)来超越口头反馈的语境,解析学生反馈的情感背景,从而提升在线教育的质量。
链接: https://arxiv.org/abs/2506.05490
作者: Mohammed Almutairi
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 10 pages, 10 figures
点击查看摘要
Abstract:During the wake of the Covid-19 pandemic, the educational paradigm has experienced a major change from in person learning traditional to online platforms. The change of learning convention has impacted the teacher-student especially in non-verbal communication. The absent of non-verbal communication has led to a reliance on verbal feedback which diminished the efficacy of the educational experience. This paper explores the integration of sentiment analysis into learning management systems (LMS) to bridge the student-teacher’s gap by offering an alternative approach to interpreting student feedback beyond its verbal context. The research involves data preparation, feature selection, and the development of a deep neural network model encompassing word embedding, LSTM, and attention mechanisms. This model is compared against a logistic regression baseline to evaluate its efficacy in understanding student feedback. The study aims to bridge the communication gap between instructors and students in online learning environments, offering insights into the emotional context of student feedback and ultimately improving the quality of online education.
zh
[AI-77] Zeroth-Order Optimization Finds Flat Minima
【速读】:该论文试图解决零阶优化方法在缺乏梯度信息或计算成本过高的场景下,如何隐式地引导模型收敛到特定解的问题,特别是关注于隐式正则化对最终解的精细刻画。其解决方案的关键在于证明标准两点估计器的零阶优化方法倾向于选择海森矩阵迹较小的解,而这一特性被广泛用于区分尖锐和平坦的极小值点,从而揭示了零阶优化在凸函数和充分平滑函数下收敛至近似平坦极小值的理论依据。
链接: https://arxiv.org/abs/2506.05454
作者: Liang Zhang,Bingcong Li,Kiran Koshy Thekumparampil,Sewoong Oh,Michael Muehlebach,Niao He
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC); Machine Learning (stat.ML)
备注:
点击查看摘要
Abstract:Zeroth-order methods are extensively used in machine learning applications where gradients are infeasible or expensive to compute, such as black-box attacks, reinforcement learning, and language model fine-tuning. Existing optimization theory focuses on convergence to an arbitrary stationary point, but less is known on the implicit regularization that provides a fine-grained characterization on which particular solutions are finally reached. We show that zeroth-order optimization with the standard two-point estimator favors solutions with small trace of Hessian, which is widely used in previous work to distinguish between sharp and flat minima. We further provide convergence rates of zeroth-order optimization to approximate flat minima for convex and sufficiently smooth functions, where flat minima are defined as the minimizers that achieve the smallest trace of Hessian among all optimal solutions. Experiments on binary classification tasks with convex losses and language model fine-tuning support our theoretical findings.
zh
[AI-78] raining Dynamics Underlying Language Model Scaling Laws: Loss Deceleration and Zero-Sum Learning ACL2025
【速读】:该论文试图解决语言模型在训练过程中由于规模扩大而出现的损失衰减(loss deceleration)问题,即在训练早期损失下降速度突然减缓,导致损失曲线在对数-对数空间中呈现分段线性行为。其解决方案的关键在于揭示了这种现象源于一种称为零和学习(zero-sum learning, ZSL)的退化训练动态,其中每个样本的梯度系统性地相互抵消,导致损失在不同样本子集间的改进产生破坏性干扰。通过扩大模型规模,可以缓解这一过渡过程,具体表现为降低衰减发生的损失值以及提升衰减后的损失改进速率。
链接: https://arxiv.org/abs/2506.05447
作者: Andrei Mircea,Supriyo Chakraborty,Nima Chitsazan,Irina Rish,Ekaterina Lobacheva
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Published as a conference paper at ACL 2025
点击查看摘要
Abstract:This work aims to understand how scaling improves language models, specifically in terms of training dynamics. We find that language models undergo loss deceleration early in training; an abrupt slowdown in the rate of loss improvement, resulting in piecewise linear behaviour of the loss curve in log-log space. Scaling up the model mitigates this transition by (1) decreasing the loss at which deceleration occurs, and (2) improving the log-log rate of loss improvement after deceleration. We attribute loss deceleration to a type of degenerate training dynamics we term zero-sum learning (ZSL). In ZSL, per-example gradients become systematically opposed, leading to destructive interference in per-example changes in loss. As a result, improving loss on one subset of examples degrades it on another, bottlenecking overall progress. Loss deceleration and ZSL provide new insights into the training dynamics underlying language model scaling laws, and could potentially be targeted directly to improve language models independent of scale. We make our code and artefacts available at: this https URL
zh
[AI-79] Sentinel: SOTA model to protect against prompt injections
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在面对提示注入攻击(prompt injection attacks)时的脆弱性问题,此类攻击通过恶意输入使模型偏离其预期指令。论文提出的解决方案是Sentinel,一种基于ModernBERT-large架构的新型检测模型。其关键在于利用ModernBERT的先进特性,并在包含多种攻击类型和良性指令的广泛且多样的数据集上进行微调,从而实现卓越的检测性能。
链接: https://arxiv.org/abs/2506.05446
作者: Dror Ivry,Oran Nahum
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 6 pages, 2 tables
点击查看摘要
Abstract:Large Language Models (LLMs) are increasingly powerful but remain vulnerable to prompt injection attacks, where malicious inputs cause the model to deviate from its intended instructions. This paper introduces Sentinel, a novel detection model, qualifire/prompt-injection-sentinel, based on the \answerdotai/ModernBERT-large architecture. By leveraging ModernBERT’s advanced features and fine-tuning on an extensive and diverse dataset comprising a few open-source and private collections, Sentinel achieves state-of-the-art performance. This dataset amalgamates varied attack types, from role-playing and instruction hijacking to attempts to generate biased content, alongside a broad spectrum of benign instructions, with private datasets specifically targeting nuanced error correction and real-world misclassifications. On a comprehensive, unseen internal test set, Sentinel demonstrates an average accuracy of 0.987 and an F1-score of 0.980. Furthermore, when evaluated on public benchmarks, it consistently outperforms strong baselines like protectai/deberta-v3-base-prompt-injection-v2. This work details Sentinel’s architecture, its meticulous dataset curation, its training methodology, and a thorough evaluation, highlighting its superior detection capabilities.
zh
[AI-80] Causal Policy Learning in Reinforcement Learning: Backdoor-Adjusted Soft Actor-Critic
【速读】:该论文试图解决强化学习(Reinforcement Learning, RL)中由于隐藏混杂因素(hidden confounders)同时影响状态和动作而导致策略学习偏差的问题,这种偏差会导致策略表现不佳或泛化能力差。解决方案的关键在于提出DoSAC(Do-Calculus Soft Actor-Critic with Backdoor Adjustment),该方法通过因果干预估计修正隐藏混杂因素的影响。其核心创新是引入一个可学习的后门重构器(Backdoor Reconstructor),从当前状态推断出伪过去变量(即之前的状体和动作),从而在不依赖真实混杂因素或因果标签的情况下实现后门调整,进而计算干预性策略及其熵。
链接: https://arxiv.org/abs/2506.05445
作者: Thanh Vinh Vo,Young Lee,Haozhe Ma,Chien Lu,Tze-Yun Leong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Preprint
点击查看摘要
Abstract:Hidden confounders that influence both states and actions can bias policy learning in reinforcement learning (RL), leading to suboptimal or non-generalizable behavior. Most RL algorithms ignore this issue, learning policies from observational trajectories based solely on statistical associations rather than causal effects. We propose DoSAC (Do-Calculus Soft Actor-Critic with Backdoor Adjustment), a principled extension of the SAC algorithm that corrects for hidden confounding via causal intervention estimation. DoSAC estimates the interventional policy \pi(a | \mathrmdo(s)) using the backdoor criterion, without requiring access to true confounders or causal labels. To achieve this, we introduce a learnable Backdoor Reconstructor that infers pseudo-past variables (previous state and action) from the current state to enable backdoor adjustment from observational data. This module is integrated into a soft actor-critic framework to compute both the interventional policy and its entropy. Empirical results on continuous control benchmarks show that DoSAC outperforms baselines under confounded settings, with improved robustness, generalization, and policy reliability.
zh
[AI-81] UniPTMs: The First Unified Multi-type PTM Site Prediction Model via Master-Slave Architecture-Based Multi-Stage Fusion Strategy and Hierarchical Contrastive Loss
【速读】:该论文旨在解决现有深度学习模型在跨模态特征融合、领域泛化能力和架构优化方面的局限性,从而更准确地预测蛋白质翻译后修饰(PTMs)。其解决方案的关键在于提出UniPTMs框架,该框架采用创新的“主-从”双路径协同架构:主路径通过双向门控交叉注意力(BGCA)模块动态整合蛋白质序列、结构和进化信息的高维表示,而从路径则利用低维融合网络(LDFN)优化结构与传统特征之间的差异和重校准。此外,框架还结合多尺度自适应卷积金字塔(MACP)和双向分层门控融合网络(BHGFN),并通过分层动态加权融合(HDWF)机制实现多模态特征的智能聚合,同时引入分层对比损失函数以优化特征一致性。
链接: https://arxiv.org/abs/2506.05443
作者: Yiyu Lin,Yan Wang,You Zhou,Xinye Ni,Jiahui Wu,Sen Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Genomics (q-bio.GN)
备注:
点击查看摘要
Abstract:As a core mechanism of epigenetic regulation in eukaryotes, protein post-translational modifications (PTMs) require precise prediction to decipher dynamic life activity networks. To address the limitations of existing deep learning models in cross-modal feature fusion, domain generalization, and architectural optimization, this study proposes UniPTMs: the first unified framework for multi-type PTM prediction. The framework innovatively establishes a “Master-Slave” dual-path collaborative architecture: The master path dynamically integrates high-dimensional representations of protein sequences, structures, and evolutionary information through a Bidirectional Gated Cross-Attention (BGCA) module, while the slave path optimizes feature discrepancies and recalibration between structural and traditional features using a Low-Dimensional Fusion Network (LDFN). Complemented by a Multi-scale Adaptive convolutional Pyramid (MACP) for capturing local feature patterns and a Bidirectional Hierarchical Gated Fusion Network (BHGFN) enabling multi-level feature integration across paths, the framework employs a Hierarchical Dynamic Weighting Fusion (HDWF) mechanism to intelligently aggregate multimodal features. Enhanced by a novel Hierarchical Contrastive loss function for feature consistency optimization, UniPTMs demonstrates significant performance improvements (3.2%-11.4% MCC and 4.2%-14.3% AP increases) over state-of-the-art models across five modification types and transcends the Single-Type Prediction Paradigm. To strike a balance between model complexity and performance, we have also developed a lightweight variant named UniPTMs-mini.
zh
[AI-82] An Unsupervised Framework for Dynamic Health Indicator Construction and Its Application in Rolling Bearing Prognostics
【速读】:该论文旨在解决传统健康指标(HI)构建方法依赖专家知识进行特征提取,忽视退化过程中序列数据中的动态信息,从而限制了HI在退化趋势表征和预测方面的性能问题。其解决方案的关键在于提出一种基于无监督框架的动态HI构造方法,通过包含跳跃连接的自编码器模块将原始信号映射到退化特征空间(DFS),实现无需专家知识的退化特征自动提取;随后,在DFS中引入嵌入内部HI预测块的HI生成模块,显式建模历史与当前HI状态之间的时序依赖关系,从而捕捉退化过程的内在动态特性,提升HI在退化趋势建模和未来退化预测中的有效性。
链接: https://arxiv.org/abs/2506.05438
作者: Tongda Sun,Chen Yin,Huailiang Zheng,Yining Dong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Health indicator (HI) plays a key role in degradation assessment and prognostics of rolling bearings. Although various HI construction methods have been investigated, most of them rely on expert knowledge for feature extraction and overlook capturing dynamic information hidden in sequential degradation processes, which limits the ability of the constructed HI for degradation trend representation and prognostics. To address these concerns, a novel dynamic HI that considers HI-level temporal dependence is constructed through an unsupervised framework. Specifically, a degradation feature learning module composed of a skip-connection-based autoencoder first maps raw signals to a representative degradation feature space (DFS) to automatically extract essential degradation features without the need for expert knowledge. Subsequently, in this DFS, a new HI-generating module embedded with an inner HI-prediction block is proposed for dynamic HI construction, where the temporal dependence between past and current HI states is guaranteed and modeled explicitly. On this basis, the dynamic HI captures the inherent dynamic contents of the degradation process, ensuring its effectiveness for degradation tendency modeling and future degradation prognostics. The experiment results on two bearing lifecycle datasets demonstrate that the proposed HI construction method outperforms comparison methods, and the constructed dynamic HI is superior for prognostic tasks.
zh
[AI-83] A MARL-based Approach for Easing MAS Organization Engineering
【速读】:该论文试图解决在复杂且可读性低的部署环境中,设计应用特定的多智能体系统(Multi-Agent Systems, MAS)时存在的高成本和安全风险问题。解决方案的关键在于提出一种辅助的MAS组织工程方法(Assisted MAS Organization Engineering Approach, AOMEA),该方法通过将多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)过程与组织模型相结合,以提供相关的组织规范,从而优化MAS的工程设计。
链接: https://arxiv.org/abs/2506.05437
作者: Julien Soulé,Jean-Paul Jamont,Michel Occello,Louis-Marie Traonouez,Paul Théron
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Multi-Agent Systems (MAS) have been successfully applied in industry for their ability to address complex, distributed problems, especially in IoT-based systems. Their efficiency in achieving given objectives and meeting design requirements is strongly dependent on the MAS organization during the engineering process of an application-specific MAS. To design a MAS that can achieve given goals, available methods rely on the designer’s knowledge of the deployment environment. However, high complexity and low readability in some deployment environments make the application of these methods to be costly or raise safety concerns. In order to ease the MAS organization design regarding those concerns, we introduce an original Assisted MAS Organization Engineering Approach (AOMEA). AOMEA relies on combining a Multi-Agent Reinforcement Learning (MARL) process with an organizational model to suggest relevant organizational specifications to help in MAS engineering.
zh
[AI-84] Event Classification of Accelerometer Data for Industrial Package Monitoring with Embedded Deep Learning
【速读】:该论文旨在解决工业应用中可重复使用包装的状态监测问题,以提高运营效率和生态可持续性。其解决方案的关键在于设计一个具有数年使用寿命的嵌入式系统,通过最小化唤醒时间来延长设备寿命,并采用深度学习模型对来自嵌入式传感器的不平衡多类时间序列数据进行分类。该方法利用一维卷积神经网络(Convolutional Neural Network)对加速度计数据进行分类,并通过数据增强技术和模型压缩技术优化模型性能与部署可行性。
链接: https://arxiv.org/abs/2506.05435
作者: Manon Renault(IMT Atlantique),Hamoud Younes,Hugo Tessier,Ronan Le Roy,Bastien Pasdeloup(IMT Atlantique),Mathieu Léonardon(IMT Atlantique)
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Package monitoring is an important topic in industrial applications, with significant implications for operational efficiency and ecological sustainability. In this study, we propose an approach that employs an embedded system, placed on reusable packages, to detect their state (on a Forklift, in a Truck, or in an undetermined location). We aim to design a system with a lifespan of several years, corresponding to the lifespan of reusable packages. Our analysis demonstrates that maximizing device lifespan requires minimizing wake time. We propose a pipeline that includes data processing, training, and evaluation of the deep learning model designed for imbalanced, multiclass time series data collected from an embedded sensor. The method uses a one-dimensional Convolutional Neural Network architecture to classify accelerometer data from the IoT device. Before training, two data augmentation techniques are tested to solve the imbalance problem of the dataset: the Synthetic Minority Oversampling TEchnique and the ADAptive SYNthetic sampling approach. After training, compression techniques are implemented to have a small model size. On the considered twoclass problem, the methodology yields a precision of 94.54% for the first class and 95.83% for the second class, while compression techniques reduce the model size by a factor of four. The trained model is deployed on the IoT device, where it operates with a power consumption of 316 mW during inference.
zh
[AI-85] Efficient Robust Conformal Prediction via Lipschitz-Bounded Networks
【速读】:该论文试图解决在对抗攻击下传统置信预测(Conformal Prediction, CP)方法无法保持其保证的问题,即如何在保证预测集可靠性的同时,提高其在大规模场景下的效率与实用性。解决方案的关键在于提出一种基于Lipschitz有界网络的新型方法(lip-rcp),通过利用网络的Lipschitz性质,精确且高效地估计具有鲁棒性保证的CP集,从而在ImageNet等大规模任务中实现更小的预测集和更高的计算效率。
链接: https://arxiv.org/abs/2506.05434
作者: Thomas Massena(IRIT, DTIPG - SNCF, UT3),Léo andéol(IMT, DTIPG - SNCF, UT3),Thibaut Boissin(IRIT, UT3),Franck Mamalet,Corentin Friedrich,Mathieu Serrurier(IRIT, UT3),Sébastien Gerchinovitz(IMT)
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Conformal Prediction (CP) has proven to be an effective post-hoc method for improving the trustworthiness of neural networks by providing prediction sets with finite-sample guarantees. However, under adversarial attacks, classical conformal guarantees do not hold anymore: this problem is addressed in the field of Robust Conformal Prediction. Several methods have been proposed to provide robust CP sets with guarantees under adversarial perturbations, but, for large scale problems, these sets are either too large or the methods are too computationally demanding to be deployed in real life scenarios. In this work, we propose a new method that leverages Lipschitz-bounded networks to precisely and efficiently estimate robust CP sets. When combined with a 1-Lipschitz robust network, we demonstrate that our lip-rcp method outperforms state-of-the-art results in both the size of the robust CP sets and computational efficiency in medium and large-scale scenarios such as ImageNet. Taking a different angle, we also study vanilla CP under attack, and derive new worst-case coverage bounds of vanilla CP sets, which are valid simultaneously for all adversarial attack levels. Our lip-rcp method makes this second approach as efficient as vanilla CP while also allowing robustness guarantees.
zh
[AI-86] Prefix Grouper: Efficient GRPO Training through Shared-Prefix Forward
【速读】:该论文旨在解决Group Relative Policy Optimization (GRPO)在处理长共享前缀时计算开销过大的问题,该问题源于每个组成员需对相同前缀进行冗余编码,导致可扩展性瓶颈。解决方案的关键在于提出Prefix Grouper算法,通过引入Shared-Prefix Forward策略,将自注意力机制重构为两部分,从而仅对共享前缀进行一次编码,同时保持完全可微分和端到端训练的兼容性。
链接: https://arxiv.org/abs/2506.05433
作者: Zikang Liu,Tongtian Yue,Yepeng Tang,Longteng Guo,Junxian Cai,Qingbin Liu,Xi Chen,Jing Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 10 pages, technical report
点击查看摘要
Abstract:Group Relative Policy Optimization (GRPO) enhances policy learning by computing gradients from relative comparisons among candidate outputs that share a common input prefix. Despite its effectiveness, GRPO introduces substantial computational overhead when processing long shared prefixes, which must be redundantly encoded for each group member. This inefficiency becomes a major scalability bottleneck in long-context learning scenarios. We propose Prefix Grouper, an efficient GRPO training algorithm that eliminates redundant prefix computation via a Shared-Prefix Forward strategy. In particular, by restructuring self-attention into two parts, our method enables the shared prefix to be encoded only once, while preserving full differentiability and compatibility with end-to-end training. We provide both theoretical and empirical evidence that Prefix Grouper is training-equivalent to standard GRPO: it yields identical forward outputs and backward gradients, ensuring that the optimization dynamics and final policy performance remain unchanged. Empirically, our experiments confirm that Prefix Grouper achieves consistent results while significantly reducing the computational cost of training, particularly in long-prefix scenarios. The proposed method is fully plug-and-play: it is compatible with existing GRPO-based architectures and can be seamlessly integrated into current training pipelines as a drop-in replacement, requiring no structural modifications and only minimal changes to input construction and attention computation. Prefix Grouper enables the use of larger group sizes under the same computational budget, thereby improving the scalability of GRPO to more complex tasks and larger models. Code is now available at this https URL
zh
[AI-87] PCDVQ: Enhancing Vector Quantization for Large Language Models via Polar Coordinate Decoupling
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在边缘部署中因参数规模过大而面临的挑战,提出了一种名为极坐标解耦向量量化(Polar Coordinate Decoupled Vector Quantization, PCDVQ)的高效量化框架。其解决方案的关键在于通过极坐标解耦(Polar Coordinate Decoupling, PCD)将向量转换为极坐标表示,并对方向和模长分别进行独立量化,同时通过分布对齐码本构建(Distribution Aligned Codebook Construction, DACC)优化方向和模长码本以适应源数据分布,从而有效降低量化误差并提升模型性能。
链接: https://arxiv.org/abs/2506.05432
作者: Yuxuan Yue,Zukang Xu,Zhihang Yuan,Dawei Yang,Jianglong Wu,Liqiang Nie
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) face significant challenges in edge deployment due to their massive parameter scale. Vector Quantization (VQ), a clustering-based quantization method, serves as a prevalent solution to this issue for its extremely low-bit (even at 2-bit) and considerable accuracy. Since a vector is a quantity in mathematics and physics that has both direction and magnitude, existing VQ works typically quantize them in a coupled manner. However, we find that direction exhibits significantly greater sensitivity to quantization compared to the magnitude. For instance, when separately clustering the directions and magnitudes of weight vectors in LLaMA-2-7B, the accuracy drop of zero-shot tasks are 46.5% and 2.3%, respectively. This gap even increases with the reduction of clustering centers. Further, Euclidean distance, a common metric to access vector similarities in current VQ works, places greater emphasis on reducing the magnitude error. This property is contrary to the above finding, unavoidably leading to larger quantization errors. To these ends, this paper proposes Polar Coordinate Decoupled Vector Quantization (PCDVQ), an effective and efficient VQ framework consisting of two key modules: 1) Polar Coordinate Decoupling (PCD), which transforms vectors into their polar coordinate representations and perform independent quantization of the direction and magnitude parameters.2) Distribution Aligned Codebook Construction (DACC), which optimizes the direction and magnitude codebooks in accordance with the source distribution. Experimental results show that PCDVQ outperforms baseline methods at 2-bit level by at least 1.5% zero-shot accuracy, establishing a novel paradigm for accurate and highly compressed LLMs.
zh
[AI-88] Explainer-guided Targeted Adversarial Attacks against Binary Code Similarity Detection Models
【速读】:该论文旨在解决二进制代码相似性检测(Binary Code Similarity Detection, BCSD)模型在面对对抗攻击时的鲁棒性不足问题,特别是针对目标攻击(targeted attacks)场景下,如何高效生成语义保持不变的对抗样本以误导模型预测。其解决方案的关键在于利用黑盒、模型无关的解释器(model-agnostic explainers)对模型决策边界进行解析,从而精准定位需要应用语义保持扰动的关键代码片段,提升了攻击的成功率、效率及迁移性。
链接: https://arxiv.org/abs/2506.05430
作者: Mingjie Chen(Zhejiang University),Tiancheng Zhu(Huazhong University of Science and Technology),Mingxue Zhang(The State Key Laboratory of Blockchain and Data Security, Zhejiang University amp; Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security),Yiling He(University College London),Minghao Lin(University of Southern California),Penghui Li(Columbia University),Kui Ren(The State Key Laboratory of Blockchain and Data Security, Zhejiang University)
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 12 pages, 3 figures
点击查看摘要
Abstract:Binary code similarity detection (BCSD) serves as a fundamental technique for various software engineering tasks, e.g., vulnerability detection and classification. Attacks against such models have therefore drawn extensive attention, aiming at misleading the models to generate erroneous predictions. Prior works have explored various approaches to generating semantic-preserving variants, i.e., adversarial samples, to evaluate the robustness of the models against adversarial attacks. However, they have mainly relied on heuristic criteria or iterative greedy algorithms to locate salient code influencing the model output, failing to operate on a solid theoretical basis. Moreover, when processing programs with high complexities, such attacks tend to be time-consuming. In this work, we propose a novel optimization for adversarial attacks against BCSD models. In particular, we aim to improve the attacks in a challenging scenario, where the attack goal is to limit the model predictions to a specific range, i.e., the targeted attacks. Our attack leverages the superior capability of black-box, model-agnostic explainers in interpreting the model decision boundaries, thereby pinpointing the critical code snippet to apply semantic-preserving perturbations. The evaluation results demonstrate that compared with the state-of-the-art attacks, the proposed attacks achieve higher attack success rate in almost all scenarios, while also improving the efficiency and transferability. Our real-world case studies on vulnerability detection and classification further demonstrate the security implications of our attacks, highlighting the urgent need to further enhance the robustness of existing BCSD models. Comments: 12 pages, 3 figures Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2506.05430 [cs.CR] (or arXiv:2506.05430v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2506.05430 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-89] Diffusion with a Linguistic Compass: Steering the Generation of Clinically Plausible Future sMRI Representations for Early MCI Conversion Prediction
【速读】:该论文旨在解决轻度认知障碍(MCI)转化的早期预测问题,其核心挑战在于在即时性与准确性之间的权衡——即从单一基线结构磁共振成像(sMRI)快速预测与利用纵向扫描捕捉疾病进展之间的矛盾。该研究提出的解决方案是MCI-Diff,其关键在于采用基于扩散的框架,直接从基线数据合成临床上合理的未来sMRI表示,从而实现实时风险评估和高预测性能。该方法通过多任务序列重建策略训练共享去噪网络,并引入由大语言模型(LLM)驱动的“语言指南”以确保生成结果的临床合理性。
链接: https://arxiv.org/abs/2506.05428
作者: Zhihao Tang,Chaozhuo Li,Litian Zhang,Xi Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Early prediction of Mild Cognitive Impairment (MCI) conversion is hampered by a trade-off between immediacy–making fast predictions from a single baseline sMRI–and accuracy–leveraging longitudinal scans to capture disease progression. We propose MCI-Diff, a diffusion-based framework that synthesizes clinically plausible future sMRI representations directly from baseline data, achieving both real-time risk assessment and high predictive performance. First, a multi-task sequence reconstruction strategy trains a shared denoising network on interpolation and extrapolation tasks to handle irregular follow-up sampling and learn robust latent trajectories. Second, an LLM-driven “linguistic compass” is introduced for clinical plausibility sampling: generated feature candidates are quantized, tokenized, and scored by a fine-tuned language model conditioned on expected structural biomarkers, guiding autoregressive generation toward realistic disease patterns. Experiments on ADNI and AIBL cohorts show that MCI-Diff outperforms state-of-the-art baselines, improving early conversion accuracy by 5-12%.
zh
[AI-90] MTPNet: Multi-Grained Target Perception for Unified Activity Cliff Prediction
【速读】:该论文旨在解决药物发现和材料设计中活性悬崖(activity cliff)预测的问题,现有计算方法通常仅能处理单一结合靶点,限制了预测模型的适用性。其解决方案的关键在于提出多粒度靶点感知网络(MTPNet),该网络通过宏观层面的靶点语义(MTS)引导和微观层面的口袋语义(MPS)引导,将分子与靶点蛋白相互作用的先验知识融入模型,从而动态优化分子表示。此方法首次将受体蛋白作为引导信息,有效捕捉关键相互作用细节,显著提升了预测性能。
链接: https://arxiv.org/abs/2506.05427
作者: Zishan Shu,Yufan Deng,Hongyu Zhang,Zhiwei Nie,Jie Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注:
点击查看摘要
Abstract:Activity cliff prediction is a critical task in drug discovery and material design. Existing computational methods are limited to handling single binding targets, which restricts the applicability of these prediction models. In this paper, we present the Multi-Grained Target Perception network (MTPNet) to incorporate the prior knowledge of interactions between the molecules and their target proteins. Specifically, MTPNet is a unified framework for activity cliff prediction, which consists of two components: Macro-level Target Semantic (MTS) guidance and Micro-level Pocket Semantic (MPS) guidance. By this way, MTPNet dynamically optimizes molecular representations through multi-grained protein semantic conditions. To our knowledge, it is the first time to employ the receptor proteins as guiding information to effectively capture critical interaction details. Extensive experiments on 30 representative activity cliff datasets demonstrate that MTPNet significantly outperforms previous approaches, achieving an average RMSE improvement of 18.95% on top of several mainstream GNN architectures. Overall, MTPNet internalizes interaction patterns through conditional deep learning to achieve unified predictions of activity cliffs, helping to accelerate compound optimization and design. Codes are available at: this https URL.
zh
[AI-91] Mixture-of-Experts Meets In-Context Reinforcement Learning
【速读】:该论文试图解决在强化学习(Reinforcement Learning, RL)领域中,如何有效利用上下文学习(In-context Learning, ICRL)所面临的两个主要挑战:状态-动作-奖励数据的内在多模态性以及决策任务的多样性和异质性。解决方案的关键在于提出一种名为T2MIR(Token- and Task-wise MoE for In-context RL)的框架,该框架将混合专家(Mixture-of-Experts, MoE)架构引入基于Transformer的决策模型中,通过引入两个并行的MoE层——即针对输入token语义的token-wise MoE和针对任务路由的task-wise MoE,以提升模型对多模态数据和多样化任务的适应能力。此外,为增强任务路由效果,还引入了对比学习方法以最大化任务与路由表示之间的互信息。
链接: https://arxiv.org/abs/2506.05426
作者: Wenhao Wu,Fuhong Liu,Haoru Li,Zican Hu,Daoyi Dong,Chunlin Chen,Zhi Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 26 pages, 13 figures
点击查看摘要
Abstract:In-context reinforcement learning (ICRL) has emerged as a promising paradigm for adapting RL agents to downstream tasks through prompt conditioning. However, two notable challenges remain in fully harnessing in-context learning within RL domains: the intrinsic multi-modality of the state-action-reward data and the diverse, heterogeneous nature of decision tasks. To tackle these challenges, we propose \textbfT2MIR (\textbfToken- and \textbfTask-wise \textbfMoE for \textbfIn-context \textbfRL), an innovative framework that introduces architectural advances of mixture-of-experts (MoE) into transformer-based decision models. T2MIR substitutes the feedforward layer with two parallel layers: a token-wise MoE that captures distinct semantics of input tokens across multiple modalities, and a task-wise MoE that routes diverse tasks to specialized experts for managing a broad task distribution with alleviated gradient conflicts. To enhance task-wise routing, we introduce a contrastive learning method that maximizes the mutual information between the task and its router representation, enabling more precise capture of task-relevant information. The outputs of two MoE components are concatenated and fed into the next layer. Comprehensive experiments show that T2MIR significantly facilitates in-context learning capacity and outperforms various types of baselines. We bring the potential and promise of MoE to ICRL, offering a simple and scalable architectural enhancement to advance ICRL one step closer toward achievements in language and vision communities. Our code is available at this https URL.
zh
[AI-92] Constructive Symbolic Reinforcement Learning via Intuitionistic Logic and Goal-Chaining Inference
【速读】:该论文试图解决传统强化学习中因依赖概率性试错而导致的安全性不足、行为不可解释以及无效状态转移的问题。其解决方案的关键在于引入一种基于构造性逻辑推理的学习与规划框架,将动作、状态转移和目标表示为逻辑命题,并通过直觉主义逻辑构建可验证的证明来实现决策过程,从而确保所有状态转移和策略仅在满足可验证前提条件下被接受,避免了概率性探索带来的风险。
链接: https://arxiv.org/abs/2506.05422
作者: Andrei T. Patrascu
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:We introduce a novel learning and planning framework that replaces traditional reward-based optimisation with constructive logical inference. In our model, actions, transitions, and goals are represented as logical propositions, and decision-making proceeds by building constructive proofs under intuitionistic logic. This method ensures that state transitions and policies are accepted only when supported by verifiable preconditions – eschewing probabilistic trial-and-error in favour of guaranteed logical validity. We implement a symbolic agent operating in a structured gridworld, where reaching a goal requires satisfying a chain of intermediate subgoals (e.g., collecting keys to open doors), each governed by logical constraints. Unlike conventional reinforcement learning agents, which require extensive exploration and suffer from unsafe or invalid transitions, our constructive agent builds a provably correct plan through goal chaining, condition tracking, and knowledge accumulation. Empirical comparison with Q-learning demonstrates that our method achieves perfect safety, interpretable behaviour, and efficient convergence with no invalid actions, highlighting its potential for safe planning, symbolic cognition, and trustworthy AI. This work presents a new direction for reinforcement learning grounded not in numeric optimisation, but in constructive logic and proof theory.
zh
[AI-93] FERRET: Private Deep Learning Faster And Better Than DPSGD
【速读】:该论文旨在解决深度学习中的隐私保护与训练效率之间的权衡问题,特别是在大规模语言模型(LLM)训练中实现高效且私密的梯度压缩。其解决方案的关键在于提出FERRET(Fast and Effective Restricted Release for Ethical Training),一种基于符号梯度压缩的互信息差分隐私(MI-DP)方法,通过伯努利掩码仅传输每个参数组的一个符号位,从而在不引入额外噪声的情况下实现严格的隐私保障。理论分析表明,FERRET在隐私预算ε∈[0.1, 2]范围内能够有效控制隐私泄露,实验结果验证了其在保持高模型实用性和训练效率方面的优越性。
链接: https://arxiv.org/abs/2506.05416
作者: David Zagardo
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 28 pages, 6 figures
点击查看摘要
Abstract:We revisit 1-bit gradient compression through the lens of mutual-information differential privacy (MI-DP). Building on signSGD, we propose FERRET–Fast and Effective Restricted Release for Ethical Training–which transmits at most one sign bit per parameter group with Bernoulli masking. Theory: We prove each fired group leaks at most ln 2 nats; after subsampling with rate s, the total privacy loss of G groups trained for T steps with firing probability p is epsilon = G * T * s * p * ln 2. Thus FERRET achieves MI-DP for epsilon in [0.1, 2] without additive noise. Practice: We evaluate three granularities–FERRET-MAX (finest), FERRET-EIGHTH (medium), and FERRET-2 (coarsest)–on five LLMs (137M-1.8B parameters) against DPSGD and Non-DP baselines. All methods trained for 1, 3, and 5 epochs. Utility: Across all settings, FERRET-MAX/EIGHTH beat DPSGD’s perplexity. At epsilon=0.5, 5 epochs: FERRET-EIGHTH achieves 3.98 perplexity vs DPSGD’s 11.61 (2.9x better), within 23% of Non-DP (3.25). Privacy: MI-AUC stays at chance for FERRET-MAX/EIGHTH (~0.51), matching DPSGD vs Non-DP’s 0.76-0.99. FERRET-2 shows higher leakage (~0.55) due to lower headroom. Efficiency: Stricter budgets fire fewer signs, so FERRET uses 19-33% of DPSGD’s training time and only 34-36% of Non-DP training time. Take-away: Sign-based MI-DP gets closer to achieving all three qualities of the privacy, utility, performance trilemma: FERRET trains up to 5x faster, achieves 3x lower perplexity compared to DPSGD and 1.2x greater than Non-DP, all while providing formal, mathematically provable privacy guarantees using zero additive noise. The results also show that, in certain instances, masked 1-bit updates can match non-private training utility while safeguarding data. Comments: 28 pages, 6 figures Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2506.05416 [cs.CR] (or arXiv:2506.05416v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2506.05416 Focus to learn more arXiv-issued DOI via DataCite Submission history From: David Zagardo [view email] [v1] Wed, 4 Jun 2025 21:18:45 UTC (1,314 KB) Full-text links: Access Paper: View a PDF of the paper titled FERRET: Private Deep Learning Faster And Better Than DPSGD, by David ZagardoView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.CR prev | next new | recent | 2025-06 Change to browse by: cs cs.AI cs.LG References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
zh
[AI-94] Gen4D: Synthesizing Humans and Scenes in the Wild CVPR
【速读】:该论文试图解决由于真实世界中野生活动(in-the-wild activities)输入数据不足而导致的计算机视觉任务性能低下问题,特别是在体育等非典型人类中心领域,真实数据采集复杂且不切实际。解决方案的关键在于提出Gen4D,一个完全自动化的生成多样化且逼真4D人体动画的流水线,其核心包括专家驱动的动作编码、基于扩散的高斯点云引导的人像生成以及面向人体的背景合成,从而生成高度多样且逼真的行人序列。
链接: https://arxiv.org/abs/2506.05397
作者: Jerrin Bright,Zhibo Wang,Yuhao Chen,Sirisha Rambhatla,John Zelek,David Clausi
机构: 未知
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI)
备注: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops
点击查看摘要
Abstract:Lack of input data for in-the-wild activities often results in low performance across various computer vision tasks. This challenge is particularly pronounced in uncommon human-centric domains like sports, where real-world data collection is complex and impractical. While synthetic datasets offer a promising alternative, existing approaches typically suffer from limited diversity in human appearance, motion, and scene composition due to their reliance on rigid asset libraries and hand-crafted rendering pipelines. To address this, we introduce Gen4D, a fully automated pipeline for generating diverse and photorealistic 4D human animations. Gen4D integrates expert-driven motion encoding, prompt-guided avatar generation using diffusion-based Gaussian splatting, and human-aware background synthesis to produce highly varied and lifelike human sequences. Based on Gen4D, we present SportPAL, a large-scale synthetic dataset spanning three sports: baseball, icehockey, and soccer. Together, Gen4D and SportPAL provide a scalable foundation for constructing synthetic datasets tailored to in-the-wild human-centric vision tasks, with no need for manual 3D modeling or scene design.
zh
[AI-95] How stealthy is stealthy? Studying the Efficacy of Black-Box Adversarial Attacks in the Real World
【速读】:该论文试图解决深度学习系统在计算机视觉领域中面对的黑盒对抗攻击问题,特别是在攻击者仅能通过查询获取目标模型输出的情况下。现有方法在应对鲁棒性、隐蔽性(针对自动检测和人工检查)这三个关键属性时存在权衡,难以同时兼顾。论文提出的解决方案ECLIPSE(Enhanced Compressed Learning for Imperceptible and Scalable Evasion attacks)的关键在于结合采样梯度上的高斯模糊与局部代理模型,从而有效平衡三个属性之间的权衡,提升攻击的可行性与隐蔽性。
链接: https://arxiv.org/abs/2506.05382
作者: Francesco Panebianco,Mario D’Onghia,Stefano Zanero aand Michele Carminati
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Deep learning systems, critical in domains like autonomous vehicles, are vulnerable to adversarial examples (crafted inputs designed to mislead classifiers). This study investigates black-box adversarial attacks in computer vision. This is a realistic scenario, where attackers have query-only access to the target model. Three properties are introduced to evaluate attack feasibility: robustness to compression, stealthiness to automatic detection, and stealthiness to human inspection. State-of-the-Art methods tend to prioritize one criterion at the expense of others. We propose ECLIPSE, a novel attack method employing Gaussian blurring on sampled gradients and a local surrogate model. Comprehensive experiments on a public dataset highlight ECLIPSE’s advantages, demonstrating its contribution to the trade-off between the three properties.
zh
[AI-96] Designing DSIC Mechanisms for Data Sharing in the Era of Large Language Models
【速读】:该论文试图解决在训练大规模语言模型(Large Language Models, LLMs)过程中,数据获取面临法律、隐私和战略限制的问题,以及现有数据采购方法依赖不可验证的信任或忽视异构提供者成本的不足。解决方案的关键在于设计一种机制,实现数据共享的诚实性与信任最小化,确保主导策略激励相容(Dominant-Strategy Incentive Compatibility, DSIC)、个体理性与弱预算平衡,并根据数据质量和学习效用进行奖励。论文提出了一种基于虚拟成本度量排序供应商并采用Myerson风格支付的Quality-Weighted Marginal-Incentive Auction(Q-MIA)机制,同时引入Marginal Utility Token(MUT)以支持有限流动性或长期激励场景,并通过Mixed-MIA统一两者,实现前期支付与延迟奖励的平衡。
链接: https://arxiv.org/abs/2506.05379
作者: Seyed Moein Ayyoubzadeh,Kourosh Shahnazari,Mohammmadali Keshtparvar,MohammadAmin Fazli
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:
点击查看摘要
Abstract:Training large language models (LLMs) requires vast amounts of high-quality data from institutions that face legal, privacy, and strategic constraints. Existing data procurement methods often rely on unverifiable trust or ignore heterogeneous provider costs. We introduce a mechanism-design framework for truthful, trust-minimized data sharing that ensures dominant-strategy incentive compatibility (DSIC), individual rationality, and weak budget balance, while rewarding data based on both quality and learning utility. We formalize a model where providers privately know their data cost and quality, and value arises solely from the data’s contribution to model performance. Based on this, we propose the Quality-Weighted Marginal-Incentive Auction (Q-MIA), which ranks providers using a virtual cost metric and uses Myerson-style payments to ensure DSIC and budget feasibility. To support settings with limited liquidity or long-term incentives, we introduce the Marginal Utility Token (MUT), which allocates future rights based on marginal contributions. We unify these in Mixed-MIA, a hybrid mechanism balancing upfront payments and deferred rewards. All mechanisms support verifiable, privacy-preserving implementation. Theoretically and empirically, they outperform volume-based and trust-based baselines, eliciting higher-quality data under budget constraints while remaining robust to misreporting and collusion. This establishes a principled foundation for sustainable and fair data markets for future LLMs.
zh
[AI-97] A Red Teaming Roadmap Towards System-Level Safety
【速读】:该论文试图解决当前大型语言模型(Large Language Model, LLM)安全防护机制在应对实际滥用行为时存在的不足,特别是针对红队测试(red teaming)研究中优先级错位的问题。作者认为,现有研究过于关注抽象的社会偏见或伦理原则,而忽视了符合产品安全规范的实际测试;同时,缺乏对现实威胁模型的重视以及系统级安全措施的构建。解决方案的关键在于调整研究优先级,首先聚焦于符合实际产品安全标准的测试,其次采用更贴近真实攻击者行为的威胁模型,并将系统级安全作为推动红队测试研究前进的必要步骤。
链接: https://arxiv.org/abs/2506.05376
作者: Zifan Wang,Christina Q. Knight,Jeremy Kritz,Willow E. Primack,Julian Michael
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Large Language Model (LLM) safeguards, which implement request refusals, have become a widely adopted mitigation strategy against misuse. At the intersection of adversarial machine learning and AI safety, safeguard red teaming has effectively identified critical vulnerabilities in state-of-the-art refusal-trained LLMs. However, in our view the many conference submissions on LLM red teaming do not, in aggregate, prioritize the right research problems. First, testing against clear product safety specifications should take a higher priority than abstract social biases or ethical principles. Second, red teaming should prioritize realistic threat models that represent the expanding risk landscape and what real attackers might do. Finally, we contend that system-level safety is a necessary step to move red teaming research forward, as AI models present new threats as well as affordances for threat mitigation (e.g., detection and banning of malicious users) once placed in a deployment context. Adopting these priorities will be necessary in order for red teaming research to adequately address the slate of new threats that rapid AI advances present today and will present in the very near future.
zh
[AI-98] Contextual Memory Intelligence – A Foundational Paradigm for Human-AI Collaboration and Reflective Generative AI Systems
【速读】:该论文试图解决生成式 AI(Generative AI)系统在组织环境中快速部署时所面临的记忆局限性问题,即当前系统在决策过程中很少存储或反思完整的上下文,导致重复错误和缺乏清晰性。解决方案的关键在于提出一种新的基础范式——情境记忆智能(Contextual Memory Intelligence, CMI),将记忆重新定位为支持长期连贯性、可解释性和负责任决策的适应性基础设施,而非被动数据。CMI通过结构化捕获、推理和再生上下文,将其作为系统的核心能力,并引入洞察层(Insight Layer)以实现这一愿景,该模块化架构利用人机协同反思、偏差检测和理由保留机制,将情境记忆整合到系统中。
链接: https://arxiv.org/abs/2506.05370
作者: Kristy Wedel
机构: 未知
类目: Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注: 32 pages, 9 tables, 1 figure
点击查看摘要
Abstract:A critical challenge remains unresolved as generative AI systems are quickly implemented in various organizational settings. Despite significant advances in memory components such as RAG, vector stores, and LLM agents, these systems still have substantial memory limitations. Gen AI workflows rarely store or reflect on the full context in which decisions are made. This leads to repeated errors and a general lack of clarity. This paper introduces Contextual Memory Intelligence (CMI) as a new foundational paradigm for building intelligent systems. It repositions memory as an adaptive infrastructure necessary for longitudinal coherence, explainability, and responsible decision-making rather than passive data. Drawing on cognitive science, organizational theory, human-computer interaction, and AI governance, CMI formalizes the structured capture, inference, and regeneration of context as a fundamental system capability. The Insight Layer is presented in this paper to operationalize this vision. This modular architecture uses human-in-the-loop reflection, drift detection, and rationale preservation to incorporate contextual memory into systems. The paper argues that CMI allows systems to reason with data, history, judgment, and changing context, thereby addressing a foundational blind spot in current AI architectures and governance efforts. A framework for creating intelligent systems that are effective, reflective, auditable, and socially responsible is presented through CMI. This enhances human-AI collaboration, generative AI design, and the resilience of the institutions.
zh
[AI-99] A Path to Loving
【速读】:该论文试图解决爱(love)在哲学与科学层面的复杂性问题,旨在为其提供一个严谨的本体论(ontological)表征,以增强其在心理学、社会学以及人工智能等领域的应用。解决方案的关键在于将爱理解为被动感觉(如情感唤醒)与主动评价判断(如将所爱之人视为有价值)的结合,以此平衡爱的非自主性特征与其理性责任。论文通过引入基础形式本体论(Basic Formal Ontology, BFO)及其他应用本体方法,构建了结构化的本体论框架,并论证了情感与认知成分之间的因果关联,从而为爱的精确且可扩展的本体论描述奠定基础。
链接: https://arxiv.org/abs/2506.05352
作者: John Beverley,Regina Hurley
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:This work lays the foundations for a rigorous ontological characterization of love, addressing its philosophical complexity and scientific relevance, with particular emphasis on psychology and sociology, as well as highlighting ways in which such characterization enhances relevant AI based applications. The position defended here is that love is best understood as a concatenation of passive sensations (e.g., emotional arousal) and active evaluative judgments (e.g., perceiving the beloved as valuable), in the interest of balancing the involuntary aspects of love with its rational accountability. To provide a structured foundation, the paper draws on Basic Formal Ontology (BFO) and other applied ontological methods to differentiate various senses of love. This work engages with objections to the understanding of love as concatenation, particularly concerning the relationship between sensation and judgment. A causal correlation model is defended, ensuring that the affective and cognitive components are linked. By offering a precise and scalable ontological account, this work lays the foundation for future interdisciplinary applications, making love a subject of formal inquiry in ontology engineering, artificial intelligence, and the sciences.
zh
[AI-100] Infinite Time Turing Machines and their Applications
【速读】:该论文试图解决深度学习系统在可扩展性、效率和可解释性方面的根本性局限问题。其解决方案的关键在于引入无限时间图灵机(Infinite Time Turing Machine, ITTM),通过这一理论工具重新诠释现代架构如Transformer,并据此提出通用状态机(Universal State Machine, USM)。USM作为一种基于第一原理设计的计算范式,采用动态、可查询的计算图结构,在实时演化中实现模块化、可解释且资源高效的计算,从而克服现有模型的低效与僵化问题。
链接: https://arxiv.org/abs/2506.05351
作者: Rukmal Weerawarana,Maxwell Braun
机构: 未知
类目: Computational Complexity (cs.CC); Artificial Intelligence (cs.AI); Formal Languages and Automata Theory (cs.FL); Machine Learning (cs.LG)
备注: Published by Ren XYZ Inc
点击查看摘要
Abstract:This work establishes a rigorous theoretical foundation for analyzing deep learning systems by leveraging Infinite Time Turing Machines (ITTMs), which extend classical computation into transfinite ordinal steps. Using ITTMs, we reinterpret modern architectures like Transformers, revealing fundamental limitations in scalability, efficiency, and interpretability. Building on these insights, we propose the Universal State Machine (USM), a novel computational paradigm designed from first principles. The USM employs a dynamic, queryable computation graph that evolves in real time, enabling modular, interpretable, and resource-efficient computation. This framework not only overcomes the inefficiencies and rigidity of current models but also lays the groundwork for scalable, generalizable artificial intelligence systems.
zh
[AI-101] owards provable probabilistic safety for scalable embodied AI systems
【速读】:该论文试图解决嵌入式人工智能(Embodied AI)系统在复杂运行环境中确保安全性的难题,尤其是在系统故障罕见的情况下,传统可证明确定性安全方法因角落案例的稀有性和复杂性而难以实际应用。解决方案的关键在于引入可证明的概率安全性(provable probabilistic safety),通过建立概率安全边界,利用统计方法在保证整体系统性能的基础上,将剩余风险控制在预定义阈值以内,从而实现大规模部署的安全性和可扩展性。
链接: https://arxiv.org/abs/2506.05171
作者: Linxuan He,Qing-Shan Jia,Ang Li,Hongyan Sang,Ling Wang,Jiwen Lu,Tao Zhang,Jie Zhou,Yi Zhang,Yisen Wang,Peng Wei,Zhongyuan Wang,Henry X. Liu,Shuo Feng
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Embodied AI systems, comprising AI models and physical plants, are increasingly prevalent across various applications. Due to the rarity of system failures, ensuring their safety in complex operating environments remains a major challenge, which severely hinders their large-scale deployment in safety-critical domains, such as autonomous vehicles, medical devices, and robotics. While achieving provable deterministic safety–verifying system safety across all possible scenarios–remains theoretically ideal, the rarity and complexity of corner cases make this approach impractical for scalable embodied AI systems. To address this challenge, we introduce provable probabilistic safety, which aims to ensure that the residual risk of large-scale deployment remains below a predefined threshold. Instead of attempting exhaustive safety proof across all corner cases, this paradigm establishes a probabilistic safety boundary on overall system performance, leveraging statistical methods to enhance feasibility and scalability. A well-defined probabilistic safety boundary enables embodied AI systems to be deployed at scale while allowing for continuous refinement of safety guarantees. Our work focuses on three core questions: what is provable probabilistic safety, how to prove the probabilistic safety, and how to achieve the provable probabilistic safety. By bridging the gap between theoretical safety assurance and practical deployment, our work offers a pathway toward safer, large-scale adoption of embodied AI systems in safety-critical applications.
zh
[AI-102] HGOT: Self-supervised Heterogeneous Graph Neural Network with Optimal Transport ICML2025
【速读】:该论文旨在解决在无标签情况下,基于对比自监督学习的异构图神经网络(Heterogeneous Graph Neural Networks, HGNNs)需要依赖精心设计的图增强策略以及正负样本选择的问题。传统方法在确定样本对之间的精确相似性水平上存在困难,从而影响了模型性能。该论文提出的解决方案关键在于引入一种基于最优传输(Optimal Transport, OT)的自监督异构图神经网络(HGOT)方法,通过最优传输机制减少正负样本采样的繁琐过程,利用聚合视图(central view)整合由不同元路径(branch views)表示的语义信息,并通过最优传输计划建立分支视图与聚合视图之间的语义关系,从而提升节点表示的质量和图空间的一致性。
链接: https://arxiv.org/abs/2506.02619
作者: Yanbei Liu,Chongxu Wang,Zhitao Xiao,Lei Geng,Yanwei Pang,Xiao Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
备注: The paper has 9 pages of text and 13 pages in total (including acknowledgments, impact statement, references, and appendix), with 6 figures and 2 tables. This paper has been accepted by ICML 2025 conference and this is a final version of the manuscript submitted to the conference
点击查看摘要
Abstract:Heterogeneous Graph Neural Networks (HGNNs), have demonstrated excellent capabilities in processing heterogeneous information networks. Self-supervised learning on heterogeneous graphs, especially contrastive self-supervised strategy, shows great potential when there are no labels. However, this approach requires the use of carefully designed graph augmentation strategies and the selection of positive and negative samples. Determining the exact level of similarity between sample pairs is this http URL solve this problem, we propose a novel self-supervised Heterogeneous graph neural network with Optimal Transport (HGOT) method which is designed to facilitate self-supervised learning for heterogeneous graphs without graph augmentation strategies. Different from traditional contrastive self-supervised learning, HGOT employs the optimal transport mechanism to relieve the laborious sampling process of positive and negative samples. Specifically, we design an aggregating view (central view) to integrate the semantic information contained in the views represented by different meta-paths (branch views). Then, we introduce an optimal transport plan to identify the transport relationship between the semantics contained in the branch view and the central view. This allows the optimal transport plan between graphs to align with the representations, forcing the encoder to learn node representations that are more similar to the graph space and of higher quality. Extensive experiments on four real-world datasets demonstrate that our proposed HGOT model can achieve state-of-the-art performance on various downstream tasks. In particular, in the node classification task, HGOT achieves an average of more than 6% improvement in accuracy compared with state-of-the-art methods.
zh
[AI-103] Revealing hidden correlations from complex spatial distributions: Adjacent Correlation Analysis
【速读】:该论文试图解决复杂系统中模式和结构出现机制难以理解的问题,特别是在面对高复杂性和空间非均匀性时,直接搜索变量间的相关性往往不可行。解决方案的关键在于在局部区域内寻找相关性,并开发了一种新的方法——邻近相关性分析(adjacent correlation analysis),以提取这些相关性并在相空间中进行表示。该方法通过分析局部相关性向量,能够在概率密度函数(PDF)图上形成邻近相关性图,从而揭示出可能指向新规律的有序模式。
链接: https://arxiv.org/abs/2506.05759
作者: Guang-Xing Li
机构: 未知
类目: Computational Physics (physics.comp-ph); Instrumentation and Methods for Astrophysics (astro-ph.IM); Artificial Intelligence (cs.AI); Dynamical Systems (math.DS)
备注: Code avaliable at this https URL
点击查看摘要
Abstract:Physics has been transforming our view of nature for centuries. While combining physical knowledge with computational approaches has enabled detailed modeling of physical systems’ evolution, understanding the emergence of patterns and structures remains limited. Correlations between quantities are the most reliable approach to describe relationships between different variables. However, for complex patterns, directly searching for correlations is often impractical, as complexity and spatial inhomogeneity can obscure correlations. We discovered that the key is to search for correlations in local regions and developed a new method, adjacent correlation analysis, to extract such correlations and represent them in phase space. When multiple observations are available, a useful way to study a system is to analyze distributions in phase space using the Probability Density Function (PDF). Adjacent correlation analysis evaluates vectors representing local correlations, which can be overlaid on the PDF plot to form the adjacent correlation plot. These correlation vectors often exhibit remarkably regular patterns and may lead to the discovery of new laws. The vectors we derive are equivalent to the vector field in dynamical systems on the attracting manifold. By efficiently representing spatial patterns as correlation vectors in phase space, our approach opens avenues for classification, prediction, parameter fitting, and forecasting.
zh
机器学习
[LG-0] Lagrangian-based Equilibrium Propagation: generalisation to arbitrary boundary conditions equivalence with Hamiltonian Echo Learning
链接: https://arxiv.org/abs/2506.06248
作者: Guillaume Pourcel,Debabrota Basu,Maxence Ernoult,Aditya Gilra
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Equilibrium Propagation (EP) is a learning algorithm for training Energy-based Models (EBMs) on static inputs which leverages the variational description of their fixed points. Extending EP to time-varying inputs is a challenging problem, as the variational description must apply to the entire system trajectory rather than just fixed points, and careful consideration of boundary conditions becomes essential. In this work, we present Generalized Lagrangian Equilibrium Propagation (GLEP), which extends the variational formulation of EP to time-varying inputs. We demonstrate that GLEP yields different learning algorithms depending on the boundary conditions of the system, many of which are impractical for implementation. We then show that Hamiltonian Echo Learning (HEL) – which includes the recently proposed Recurrent HEL (RHEL) and the earlier known Hamiltonian Echo Backpropagation (HEB) algorithms – can be derived as a special case of GLEP. Notably, HEL is the only instance of GLEP we found that inherits the properties that make EP a desirable alternative to backpropagation for hardware implementations: it operates in a “forward-only” manner (i.e. using the same system for both inference and learning), it scales efficiently (requiring only two or more passes through the system regardless of model size), and enables local learning.
[LG-1] Neural Responses to Affective Sentences Reveal Signatures of Depression
链接: https://arxiv.org/abs/2506.06244
作者: Aditya Kommineni,Woojae Jeong,Kleanthis Avramidis,Colin McDaniel,Myzelle Hughes,Thomas McGee,Elsi Kaiser,Kristina Lerman,Idan A. Blank,Dani Byrd,Assal Habibi,B. Rael Cahn,Sudarsana Kadiri,Takfarinas Medani,Richard M. Leahy,Shrikanth Narayanan
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:
点击查看摘要
Abstract:Major Depressive Disorder (MDD) is a highly prevalent mental health condition, and a deeper understanding of its neurocognitive foundations is essential for identifying how core functions such as emotional and self-referential processing are affected. We investigate how depression alters the temporal dynamics of emotional processing by measuring neural responses to self-referential affective sentences using surface electroencephalography (EEG) in healthy and depressed individuals. Our results reveal significant group-level differences in neural activity during sentence viewing, suggesting disrupted integration of emotional and self-referential information in depression. Deep learning model trained on these responses achieves an area under the receiver operating curve (AUC) of 0.707 in distinguishing healthy from depressed participants, and 0.624 in differentiating depressed subgroups with and without suicidal ideation. Spatial ablations highlight anterior electrodes associated with semantic and affective processing as key contributors. These findings suggest stable, stimulus-driven neural signatures of depression that may inform future diagnostic tools.
[LG-2] BiAssemble: Learning Collaborative Affordance for Bimanual Geometric Assembly ICML2025
链接: https://arxiv.org/abs/2506.06221
作者: Yan Shen,Ruihai Wu,Yubin Ke,Xinyuan Song,Zeyi Li,Xiaoqi Li,Hongwei Fan,Haoran Lu,Hao dong
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: ICML 2025
点击查看摘要
Abstract:Shape assembly, the process of combining parts into a complete whole, is a crucial robotic skill with broad real-world applications. Among various assembly tasks, geometric assembly–where broken parts are reassembled into their original form (e.g., reconstructing a shattered bowl)–is particularly challenging. This requires the robot to recognize geometric cues for grasping, assembly, and subsequent bimanual collaborative manipulation on varied fragments. In this paper, we exploit the geometric generalization of point-level affordance, learning affordance aware of bimanual collaboration in geometric assembly with long-horizon action sequences. To address the evaluation ambiguity caused by geometry diversity of broken parts, we introduce a real-world benchmark featuring geometric variety and global reproducibility. Extensive experiments demonstrate the superiority of our approach over both previous affordance-based and imitation-based methods. Project page: this https URL.
[LG-3] Model-Driven Graph Contrastive Learning
链接: https://arxiv.org/abs/2506.06212
作者: Ali Azizpour,Nicolas Zilberstein,Santiago Segarra
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:We propose \textbfMGCL , a model-driven graph contrastive learning (GCL) framework that leverages graphons (probabilistic generative models for graphs) to guide contrastive learning by accounting for the data’s underlying generative process. GCL has emerged as a powerful self-supervised framework for learning expressive node or graph representations without relying on annotated labels, which are often scarce in real-world data. By contrasting augmented views of graph data, GCL has demonstrated strong performance across various downstream tasks, such as node and graph classification. However, existing methods typically rely on manually designed or heuristic augmentation strategies that are not tailored to the underlying data distribution and operate at the individual graph level, ignoring similarities among graphs generated from the same model. Conversely, in our proposed approach, MGCL first estimates the graphon associated with the observed data and then defines a graphon-informed augmentation process, enabling data-adaptive and principled augmentations. Additionally, for graph-level tasks, MGCL clusters the dataset and estimates a graphon per group, enabling contrastive pairs to reflect shared semantics and structure. Extensive experiments on benchmark datasets demonstrate that MGCL achieves state-of-the-art performance, highlighting the advantages of incorporating generative models into GCL.
[LG-4] How to craft a deep reinforcement learning policy for wind farm flow control
链接: https://arxiv.org/abs/2506.06204
作者: Elie Kadoche,Pascal Bianchi,Florence Carton,Philippe Ciblat,Damien Ernst
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Within wind farms, wake effects between turbines can significantly reduce overall energy production. Wind farm flow control encompasses methods designed to mitigate these effects through coordinated turbine control. Wake steering, for example, consists in intentionally misaligning certain turbines with the wind to optimize airflow and increase power output. However, designing a robust wake steering controller remains challenging, and existing machine learning approaches are limited to quasi-static wind conditions or small wind farms. This work presents a new deep reinforcement learning methodology to develop a wake steering policy that overcomes these limitations. Our approach introduces a novel architecture that combines graph attention networks and multi-head self-attention blocks, alongside a novel reward function and training strategy. The resulting model computes the yaw angles of each turbine, optimizing energy production in time-varying wind conditions. An empirical study conducted on steady-state, low-fidelity simulation, shows that our model requires approximately 10 times fewer training steps than a fully connected neural network and achieves more robust performance compared to a strong optimization baseline, increasing energy production by up to 14 %. To the best of our knowledge, this is the first deep reinforcement learning-based wake steering controller to generalize effectively across any time-varying wind conditions in a low-fidelity, steady-state numerical simulation setting.
[LG-5] ransformative or Conservative? Conservation laws for ResNets and Transformers ICML2025
链接: https://arxiv.org/abs/2506.06194
作者: Sibylle Marcotte,Rémi Gribonval,Gabriel Peyré
类目: Machine Learning (cs.LG)
*备注: Accepted to ICML 2025
点击查看摘要
Abstract:While conservation laws in gradient flow training dynamics are well understood for (mostly shallow) ReLU and linear networks, their study remains largely unexplored for more practical architectures. This paper bridges this gap by deriving and analyzing conservation laws for modern architectures, with a focus on convolutional ResNets and Transformer networks. For this, we first show that basic building blocks such as ReLU (or linear) shallow networks, with or without convolution, have easily expressed conservation laws, and no more than the known ones. In the case of a single attention layer, we also completely describe all conservation laws, and we show that residual blocks have the same conservation laws as the same block without a skip connection. We then introduce the notion of conservation laws that depend only on a subset of parameters (corresponding e.g. to a pair of consecutive layers, to a residual block, or to an attention layer). We demonstrate that the characterization of such laws can be reduced to the analysis of the corresponding building block in isolation. Finally, we examine how these newly discovered conservation principles, initially established in the continuous gradient flow regime, persist under discrete optimization dynamics, particularly in the context of Stochastic Gradient Descent (SGD).
[LG-6] ICU-TSB: A Benchmark for Temporal Patient Representation Learning for Unsupervised Stratification into Patient Cohorts
链接: https://arxiv.org/abs/2506.06192
作者: Dimitrios Proios,Alban Bornet,Anthony Yazdani,Jose F Rodrigues Jr,Douglas Teodoro
类目: Machine Learning (cs.LG)
*备注: 6 pages 1 table 6 figures
点击查看摘要
Abstract:Patient stratification identifying clinically meaningful subgroups is essential for advancing personalized medicine through improved diagnostics and treatment strategies. Electronic health records (EHRs), particularly those from intensive care units (ICUs), contain rich temporal clinical data that can be leveraged for this purpose. In this work, we introduce ICU-TSB (Temporal Stratification Benchmark), the first comprehensive benchmark for evaluating patient stratification based on temporal patient representation learning using three publicly available ICU EHR datasets. A key contribution of our benchmark is a novel hierarchical evaluation framework utilizing disease taxonomies to measure the alignment of discovered clusters with clinically validated disease groupings. In our experiments with ICU-TSB, we compared statistical methods and several recurrent neural networks, including LSTM and GRU, for their ability to generate effective patient representations for subsequent clustering of patient trajectories. Our results demonstrate that temporal representation learning can rediscover clinically meaningful patient cohorts; nevertheless, it remains a challenging task, with v-measuring varying from up to 0.46 at the top level of the taxonomy to up to 0.40 at the lowest level. To further enhance the practical utility of our findings, we also evaluate multiple strategies for assigning interpretable labels to the identified clusters. The experiments and benchmark are fully reproducible and available at this https URL.
[LG-7] Physics-Informed Neural Networks for Control of Single-Phase Flow Systems Governed by Partial Differential Equations
链接: https://arxiv.org/abs/2506.06188
作者: Luis Kin Miyatake,Eduardo Camponogara,Eric Aislan Antonelo,Alexey Pavlov
类目: Machine Learning (cs.LG)
*备注: 62 pages, 14 figures
点击查看摘要
Abstract:The modeling and control of single-phase flow systems governed by Partial Differential Equations (PDEs) present challenges, especially under transient conditions. In this work, we extend the Physics-Informed Neural Nets for Control (PINC) framework, originally proposed to modeling and control of Ordinary Differential Equations (ODE) without the need of any labeled data, to the PDE case, particularly to single-phase incompressible and compressible flows, integrating neural networks with physical conservation laws. The PINC model for PDEs is structured into two stages: a steady-state network, which learns equilibrium solutions for a wide range of control inputs, and a transient network, which captures dynamic responses under time-varying boundary conditions. We propose a simplifying assumption that reduces the dimensionality of the spatial coordinate regarding the initial condition, allowing the efficient training of the PINC network. This simplification enables the derivation of optimal control policies using Model Predictive Control (MPC). We validate our approach through numerical experiments, demonstrating that the PINC model, which is trained exclusively using physical laws, i.e., without labeled data, accurately represents flow dynamics and enables real-time control applications. The results highlight the PINC’s capability to efficiently approximate PDE solutions without requiring iterative solvers, making it a promising alternative for fluid flow monitoring and optimization in engineering applications.
[LG-8] Antithetic Noise in Diffusion Models
链接: https://arxiv.org/abs/2506.06185
作者: Jing Jia,Sifan Liu,Bowen Song,Wei Yuan,Liyue Shen,Guanyang Wang
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Computation (stat.CO); Machine Learning (stat.ML)
*备注: 43 pages, 20 figures, 9 tables
点击查看摘要
Abstract:We initiate a systematic study of antithetic initial noise in diffusion models. Across unconditional models trained on diverse datasets, text-conditioned latent-diffusion models, and diffusion-posterior samplers, we find that pairing each initial noise with its negation consistently yields strongly negatively correlated samples. To explain this phenomenon, we combine experiments and theoretical analysis, leading to a symmetry conjecture that the learned score function is approximately affine antisymmetric (odd symmetry up to a constant shift), and provide evidence supporting it. Leveraging this negative correlation, we enable two applications: (1) enhancing image diversity in models like Stable Diffusion without quality loss, and (2) sharpening uncertainty quantification (e.g., up to 90% narrower confidence intervals) when estimating downstream statistics. Building on these gains, we extend the two-point pairing to a randomized quasi-Monte Carlo estimator, which further improves estimation accuracy. Our framework is training-free, model-agnostic, and adds no runtime overhead.
[LG-9] A Theoretical Study of (Hyper) Self-Attention through the Lens of Interactions: Representation Training Generalization ICML2025
链接: https://arxiv.org/abs/2506.06179
作者: Muhammed Ustaomeroglu,Guannan Qu
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted to ICML 2025
点击查看摘要
Abstract:Self-attention has emerged as a core component of modern neural architectures, yet its theoretical underpinnings remain elusive. In this paper, we study self-attention through the lens of interacting entities, ranging from agents in multi-agent reinforcement learning to alleles in genetic sequences, and show that a single layer linear self-attention can efficiently represent, learn, and generalize functions capturing pairwise interactions, including out-of-distribution scenarios. Our analysis reveals that self-attention acts as a mutual interaction learner under minimal assumptions on the diversity of interaction patterns observed during training, thereby encompassing a wide variety of real-world domains. In addition, we validate our theoretical insights through experiments demonstrating that self-attention learns interaction functions and generalizes across both population distributions and out-of-distribution scenarios. Building on our theories, we introduce HyperFeatureAttention, a novel neural network module designed to learn couplings of different feature-level interactions between entities. Furthermore, we propose HyperAttention, a new module that extends beyond pairwise interactions to capture multi-entity dependencies, such as three-way, four-way, or general n-way interactions.
[LG-10] Reusing Trajectories in Policy Gradients Enables Fast Convergence
链接: https://arxiv.org/abs/2506.06178
作者: Alessandro Montenegro,Federico Mansutti,Marco Mussi,Matteo Papini,Alberto Maria Metelli
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Policy gradient (PG) methods are a class of effective reinforcement learning algorithms, particularly when dealing with continuous control problems. These methods learn the parameters of parametric policies via stochastic gradient ascent, typically using on-policy trajectory data to estimate the policy gradient. However, such reliance on fresh data makes them sample-inefficient. Indeed, vanilla PG methods require O(\epsilon^-2) trajectories to reach an \epsilon -approximate stationary point. A common strategy to improve efficiency is to reuse off-policy information from past iterations, such as previous gradients or trajectories. While gradient reuse has received substantial theoretical attention, leading to improved rates of O(\epsilon^-3/2) , the reuse of past trajectories remains largely unexplored from a theoretical perspective. In this work, we provide the first rigorous theoretical evidence that extensive reuse of past off-policy trajectories can significantly accelerate convergence in PG methods. We introduce a power mean correction to the multiple importance weighting estimator and propose RPG (Retrospective Policy Gradient), a PG algorithm that combines old and new trajectories for policy updates. Through a novel analysis, we show that, under established assumptions, RPG achieves a sample complexity of \widetildeO(\epsilon^-1) , the best known rate in the literature. We further validate empirically our approach against PG methods with state-of-the-art rates.
[LG-11] ENMA: Tokenwise Autoregression for Generative Neural PDE Operators
链接: https://arxiv.org/abs/2506.06158
作者: Armand Kassaï Koupaï,Lise Le Boudec,Louis Serrano,Patrick Gallinari
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Solving time-dependent parametric partial differential equations (PDEs) remains a fundamental challenge for neural solvers, particularly when generalizing across a wide range of physical parameters and dynamics. When data is uncertain or incomplete-as is often the case-a natural approach is to turn to generative models. We introduce ENMA, a generative neural operator designed to model spatio-temporal dynamics arising from physical phenomena. ENMA predicts future dynamics in a compressed latent space using a generative masked autoregressive transformer trained with flow matching loss, enabling tokenwise generation. Irregularly sampled spatial observations are encoded into uniform latent representations via attention mechanisms and further compressed through a spatio-temporal convolutional encoder. This allows ENMA to perform in-context learning at inference time by conditioning on either past states of the target trajectory or auxiliary context trajectories with similar dynamics. The result is a robust and adaptable framework that generalizes to new PDE regimes and supports one-shot surrogate modeling of time-dependent parametric PDEs.
[LG-12] carps: A Framework for Comparing N Hyperparameter Optimizers on M Benchmarks
链接: https://arxiv.org/abs/2506.06143
作者: Carolin Benjamins,Helena Graf,Sarah Segel,Difan Deng,Tim Ruhkopf,Leona Hennig,Soham Basu,Neeratyoy Mallik,Edward Bergman,Deyao Chen,François Clément,Matthias Feurer,Katharina Eggensperger,Frank Hutter,Carola Doerr,Marius Lindauer
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Hyperparameter Optimization (HPO) is crucial to develop well-performing machine learning models. In order to ease prototyping and benchmarking of HPO methods, we propose carps, a benchmark framework for Comprehensive Automated Research Performance Studies allowing to evaluate N optimizers on M benchmark tasks. In this first release of carps, we focus on the four most important types of HPO task types: blackbox, multi-fidelity, multi-objective and multi-fidelity-multi-objective. With 3 336 tasks from 5 community benchmark collections and 28 variants of 9 optimizer families, we offer the biggest go-to library to date to evaluate and compare HPO methods. The carps framework relies on a purpose-built, lightweight interface, gluing together optimizers and benchmark tasks. It also features an analysis pipeline, facilitating the evaluation of optimizers on benchmarks. However, navigating a huge number of tasks while developing and comparing methods can be computationally infeasible. To address this, we obtain a subset of representative tasks by minimizing the star discrepancy of the subset, in the space spanned by the full set. As a result, we propose an initial subset of 10 to 30 diverse tasks for each task type, and include functionality to re-compute subsets as more benchmarks become available, enabling efficient evaluations. We also establish a first set of baseline results on these tasks as a measure for future comparisons. With carps (this https URL), we make an important step in the standardization of HPO evaluation.
[LG-13] Flow-Attentional Graph Neural Networks
链接: https://arxiv.org/abs/2506.06127
作者: Pascal Plettenberg,Dominik Köhler,Bernhard Sick,Josephine M. Thomas
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Graph Neural Networks (GNNs) have become essential for learning from graph-structured data. However, existing GNNs do not consider the conservation law inherent in graphs associated with a flow of physical resources, such as electrical current in power grids or traffic in transportation networks, which can lead to reduced model performance. To address this, we propose flow attention, which adapts existing graph attention mechanisms to satisfy Kirchhoffś first law. Furthermore, we discuss how this modification influences the expressivity and identify sets of non-isomorphic graphs that can be discriminated by flow attention but not by standard attention. Through extensive experiments on two flow graph datasets (electronic circuits and power grids), we demonstrate that flow attention enhances the performance of attention-based GNNs on both graph-level classification and regression tasks.
[LG-14] Reinforcement Learning Optimization for Large-Scale Learning: An Efficient and User-Friendly Scaling Library
链接: https://arxiv.org/abs/2506.06122
作者: Weixun Wang,Shaopan Xiong,Gengru Chen,Wei Gao,Sheng Guo,Yancheng He,Ju Huang,Jiaheng Liu,Zhendong Li,Xiaoyang Li,Zichen Liu,Haizhou Zhao,Dakai An,Lunxi Cao,Qiyang Cao,Wanxi Deng,Feilei Du,Yiliang Gu,Jiahe Li,Xiang Li,Mingjie Liu,Yijia Luo,Zihe Liu,Yadao Wang,Pei Wang,Tianyuan Wu,Yanan Wu,Yuheng Zhao,Shuaibing Zhao,Jin Yang,Siran Yang,Yingshui Tan,Huimin Yi,Yuchi Xu,Yujin Yuan,Xingyao Zhang,Lin Qu,Wenbo Su,Wei Wang,Jiamang Wang,Bo Zheng
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: 16 pages
点击查看摘要
Abstract:We introduce ROLL, an efficient, scalable, and user-friendly library designed for Reinforcement Learning Optimization for Large-scale Learning. ROLL caters to three primary user groups: tech pioneers aiming for cost-effective, fault-tolerant large-scale training, developers requiring flexible control over training workflows, and researchers seeking agile experimentation. ROLL is built upon several key modules to serve these user groups effectively. First, a single-controller architecture combined with an abstraction of the parallel worker simplifies the development of the training pipeline. Second, the parallel strategy and data transfer modules enable efficient and scalable training. Third, the rollout scheduler offers fine-grained management of each sample’s lifecycle during the rollout stage. Fourth, the environment worker and reward worker support rapid and flexible experimentation with agentic RL algorithms and reward designs. Finally, AutoDeviceMapping allows users to assign resources to different models flexibly across various stages.
[LG-15] Scalable unsupervised feature selection via weight stability
链接: https://arxiv.org/abs/2506.06114
作者: Xudong Zhang,Renato Cordeiro de Amorim
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Unsupervised feature selection is critical for improving clustering performance in high-dimensional data, where irrelevant features can obscure meaningful structure. In this work, we introduce the Minkowski weighted k -means++, a novel initialisation strategy for the Minkowski Weighted k -means. Our initialisation selects centroids probabilistically using feature relevance estimates derived from the data itself. Building on this, we propose two new feature selection algorithms, FS-MWK++, which aggregates feature weights across a range of Minkowski exponents to identify stable and informative features, and SFS-MWK++, a scalable variant based on subsampling. We support our approach with a theoretical guarantee under mild assumptions and extensive experiments showing that our methods consistently outperform existing alternatives.
[LG-16] Synthetic Tabular Data: Methods Attacks and Defenses KDD2025
链接: https://arxiv.org/abs/2506.06108
作者: Graham Cormode,Samuel Maddock,Enayat Ullah,Shripad Gade
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: Survey paper for accepted lecture-style tutorial at ACM KDD 2025
点击查看摘要
Abstract:Synthetic data is often positioned as a solution to replace sensitive fixed-size datasets with a source of unlimited matching data, freed from privacy concerns. There has been much progress in synthetic data generation over the last decade, leveraging corresponding advances in machine learning and data analytics. In this survey, we cover the key developments and the main concepts in tabular synthetic data generation, including paradigms based on probabilistic graphical models and on deep learning. We provide background and motivation, before giving a technical deep-dive into the methodologies. We also address the limitations of synthetic data, by studying attacks that seek to retrieve information about the original sensitive data. Finally, we present extensions and open problems in this area.
[LG-17] Flexible Operator Fusion for Fast Sparse Transformer with Diverse Masking on GPU
链接: https://arxiv.org/abs/2506.06095
作者: Wenhao Dai,Haodong Deng,Mengfei Rong,Xinyu Yang,Hongyu Liu,Fangxin Liu,Hailong Yang,Weifeng Liu,Qingxiao Sun
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Large language models are popular around the world due to their powerful understanding capabilities. As the core component of LLMs, accelerating Transformer through parallelization has gradually become a hot research topic. Mask layers introduce sparsity into Transformer to reduce calculations. However, previous works rarely focus on the performance optimization of sparse Transformer. Moreover, rule-based mechanisms ignore the fusion opportunities of mixed-type operators and fail to adapt to various sequence lengths. To address the above problems, we propose STOF, a framework that incorporates optimizations for Sparse Transformer via flexible masking and operator fusion on GPU. We firstly unify the storage format and kernel implementation for the multi-head attention. Then, we map fusion schemes to compilation templates and determine the optimal parameter setting through a two-stage search engine. The experimental results show that compared to the state-of-the-art work, STOF achieves maximum speedups of 1.7x in MHA computation and 1.5x in end-to-end inference.
[LG-18] On-board Mission Replanning for Adaptive Cooperative Multi-Robot Systems
链接: https://arxiv.org/abs/2506.06094
作者: Elim Kwan,Rehman Qureshi,Liam Fletcher,Colin Laganier,Victoria Nockles,Richard Walters
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 9 pages, 5 figures, 1 table
点击查看摘要
Abstract:Cooperative autonomous robotic systems have significant potential for executing complex multi-task missions across space, air, ground, and maritime domains. But they commonly operate in remote, dynamic and hazardous environments, requiring rapid in-mission adaptation without reliance on fragile or slow communication links to centralised compute. Fast, on-board replanning algorithms are therefore needed to enhance resilience. Reinforcement Learning shows strong promise for efficiently solving mission planning tasks when formulated as Travelling Salesperson Problems (TSPs), but existing methods: 1) are unsuitable for replanning, where agents do not start at a single location; 2) do not allow cooperation between agents; 3) are unable to model tasks with variable durations; or 4) lack practical considerations for on-board deployment. Here we define the Cooperative Mission Replanning Problem as a novel variant of multiple TSP with adaptations to overcome these issues, and develop a new encoder/decoder-based model using Graph Attention Networks and Attention Models to solve it effectively and efficiently. Using a simple example of cooperative drones, we show our replanner consistently (90% of the time) maintains performance within 10% of the state-of-the-art LKH3 heuristic solver, whilst running 85-370 times faster on a Raspberry Pi. This work paves the way for increased resilience in autonomous multi-agent systems.
[LG-19] A Novel Human-in-the-Loop Computational Grounded Theory Framework for Big Social Data
链接: https://arxiv.org/abs/2506.06083
作者: Lama Alqazlan,Zheng Fang,Michael Castelle,Rob Procter
类目: Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: 24 pages, 2 figures, 15 tables
点击查看摘要
Abstract:The availability of big data has significantly influenced the possibilities and methodological choices for conducting large-scale behavioural and social science research. In the context of qualitative data analysis, a major challenge is that conventional methods require intensive manual labour and are often impractical to apply to large datasets. One effective way to address this issue is by integrating emerging computational methods to overcome scalability limitations. However, a critical concern for researchers is the trustworthiness of results when Machine Learning (ML) and Natural Language Processing (NLP) tools are used to analyse such data. We argue that confidence in the credibility and robustness of results depends on adopting a ‘human-in-the-loop’ methodology that is able to provide researchers with control over the analytical process, while retaining the benefits of using ML and NLP. With this in mind, we propose a novel methodological framework for Computational Grounded Theory (CGT) that supports the analysis of large qualitative datasets, while maintaining the rigour of established Grounded Theory (GT) methodologies. To illustrate the framework’s value, we present the results of testing it on a dataset collected from Reddit in a study aimed at understanding tutors’ experiences in the gig economy.
[LG-20] System-Aware Unlearning Algorithms: Use Lesser Forget Faster ICML2025
链接: https://arxiv.org/abs/2506.06073
作者: Linda Lu,Ayush Sekhari,Karthik Sridharan
类目: Machine Learning (cs.LG)
*备注: ICML 2025
点击查看摘要
Abstract:Machine unlearning addresses the problem of updating a machine learning model/system trained on a dataset S so that the influence of a set of deletion requests U \subseteq S on the unlearned model is minimized. The gold standard definition of unlearning demands that the updated model, after deletion, be nearly identical to the model obtained by retraining. This definition is designed for a worst-case attacker (one who can recover not only the unlearned model but also the remaining data samples, i.e., S \setminus U ). Such a stringent definition has made developing efficient unlearning algorithms challenging. However, such strong attackers are also unrealistic. In this work, we propose a new definition, system-aware unlearning, which aims to provide unlearning guarantees against an attacker that can at best only gain access to the data stored in the system for learning/unlearning requests and not all of S\setminus U . With this new definition, we use the simple intuition that if a system can store less to make its learning/unlearning updates, it can be more secure and update more efficiently against a system-aware attacker. Towards that end, we present an exact system-aware unlearning algorithm for linear classification using a selective sampling-based approach, and we generalize the method for classification with general function classes. We theoretically analyze the tradeoffs between deletion capacity, accuracy, memory, and computation time.
[LG-21] BEAST: Efficient Tokenization of B-Splines Encoded Action Sequences for Imitation Learning
链接: https://arxiv.org/abs/2506.06072
作者: Hongyi Zhou,Weiran Liao,Xi Huang,Yucheng Tang,Fabian Otto,Xiaogang Jia,Xinkai Jiang,Simon Hilber,Ge Li,Qian Wang,Ömer Erdinç Yağmurlu,Nils Blank,Moritz Reuss,Rudolf Lioutikov
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:We present the B-spline Encoded Action Sequence Tokenizer (BEAST), a novel action tokenizer that encodes action sequences into compact discrete or continuous tokens using B-splines. In contrast to existing action tokenizers based on vector quantization or byte pair encoding, BEAST requires no separate tokenizer training and consistently produces tokens of uniform length, enabling fast action sequence generation via parallel decoding. Leveraging our B-spline formulation, BEAST inherently ensures generating smooth trajectories without discontinuities between adjacent segments. We extensively evaluate BEAST by integrating it with three distinct model architectures: a Variational Autoencoder (VAE) with continuous tokens, a decoder-only Transformer with discrete tokens, and Florence-2, a pretrained Vision-Language Model with an encoder-decoder architecture, demonstrating BEAST’s compatibility and scalability with large pretrained models. We evaluate BEAST across three established benchmarks consisting of 166 simulated tasks and on three distinct robot settings with a total of 8 real-world tasks. Experimental results demonstrate that BEAST (i) significantly reduces both training and inference computational costs, and (ii) consistently generates smooth, high-frequency control signals suitable for continuous control tasks while (iii) reliably achieves competitive task success rates compared to state-of-the-art methods.
[LG-22] Diffusion-Based Hierarchical Graph Neural Networks for Simulating Nonlinear Solid Mechanics
链接: https://arxiv.org/abs/2506.06045
作者: Tobias Würth,Niklas Freymuth,Gerhard Neumann,Luise Kärger
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:
点击查看摘要
Abstract:Graph-based learned simulators have emerged as a promising approach for simulating physical systems on unstructured meshes, offering speed and generalization across diverse geometries. However, they often struggle with capturing global phenomena, such as bending or long-range correlations, and suffer from error accumulation over long rollouts due to their reliance on local message passing and direct next-step prediction. We address these limitations by introducing the Rolling Diffusion-Batched Inference Network (ROBIN), a novel learned simulator that integrates two key innovations: (i) Rolling Diffusion, a parallelized inference scheme that amortizes the cost of diffusion-based refinement across physical time steps by overlapping denoising steps across a temporal window. (ii) A Hierarchical Graph Neural Network built on algebraic multigrid coarsening, enabling multiscale message passing across different mesh resolutions. This architecture, implemented via Algebraic-hierarchical Message Passing Networks, captures both fine-scale local dynamics and global structural effects critical for phenomena like beam bending or multi-body contact. We validate ROBIN on challenging 2D and 3D solid mechanics benchmarks involving geometric, material, and contact nonlinearities. ROBIN achieves state-of-the-art accuracy on all tasks, substantially outperforming existing next-step learned simulators while reducing inference time by up to an order of magnitude compared to standard diffusion simulators.
[LG-23] Do-PFN: In-Context Learning for Causal Effect Estimation
链接: https://arxiv.org/abs/2506.06039
作者: Jake Robertson,Arik Reuter,Siyuan Guo,Noah Hollmann,Frank Hutter,Bernhard Schölkopf
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Estimation of causal effects is critical to a range of scientific disciplines. Existing methods for this task either require interventional data, knowledge about the ground truth causal graph, or rely on assumptions such as unconfoundedness, restricting their applicability in real-world settings. In the domain of tabular machine learning, Prior-data fitted networks (PFNs) have achieved state-of-the-art predictive performance, having been pre-trained on synthetic data to solve tabular prediction problems via in-context learning. To assess whether this can be transferred to the harder problem of causal effect estimation, we pre-train PFNs on synthetic data drawn from a wide variety of causal structures, including interventions, to predict interventional outcomes given observational data. Through extensive experiments on synthetic case studies, we show that our approach allows for the accurate estimation of causal effects without knowledge of the underlying causal graph. We also perform ablation studies that elucidate Do-PFN’s scalability and robustness across datasets with a variety of causal characteristics.
[LG-24] On Inverse Problems Parameter Estimation and Domain Generalization
链接: https://arxiv.org/abs/2506.06024
作者: Deborah Pereg
类目: Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Signal restoration and inverse problems are key elements in most real-world data science applications. In the past decades, with the emergence of machine learning methods, inversion of measurements has become a popular step in almost all physical applications, which is normally executed prior to downstream tasks that often involve parameter estimation. In this work, we analyze the general problem of parameter estimation in an inverse problem setting. First, we address the domain-shift problem by re-formulating it in direct relation with the discrete parameter estimation analysis. We analyze a significant vulnerability in current attempts to enforce domain generalization, which we dubbed the Double Meaning Theorem. Our theoretical findings are experimentally illustrated for domain shift examples in image deblurring and speckle suppression in medical imaging. We then proceed to a theoretical analysis of parameter estimation given observed measurements before and after data processing involving an inversion of the observations. We compare this setting for invertible and non-invertible (degradation) processes. We distinguish between continuous and discrete parameter estimation, corresponding with regression and classification problems, respectively. Our theoretical findings align with the well-known information-theoretic data processing inequality, and to a certain degree question the common misconception that data-processing for inversion, based on modern generative models that may often produce outstanding perceptual quality, will necessarily improve the following parameter estimation objective. It is our hope that this paper will provide practitioners with deeper insights that may be leveraged in the future for the development of more efficient and informed strategic system planning, critical in safety-sensitive applications.
[LG-25] Unisoma: A Unified Transformer-based Solver for Multi-Solid Systems
链接: https://arxiv.org/abs/2506.06021
作者: Shilong Tao,Zhe Feng,Haonan Sun,Zhanxing Zhu,Yunhuai Liu
类目: Machine Learning (cs.LG)
*备注: Proceedings of the 42nd International Conference on Machine Learning
点击查看摘要
Abstract:Multi-solid systems are foundational to a wide range of real-world applications, yet modeling their complex interactions remains challenging. Existing deep learning methods predominantly rely on implicit modeling, where the factors influencing solid deformation are not explicitly represented but are instead indirectly learned. However, as the number of solids increases, these methods struggle to accurately capture intricate physical interactions. In this paper, we introduce a novel explicit modeling paradigm that incorporates factors influencing solid deformation through structured modules. Specifically, we present Unisoma, a unified and flexible Transformer-based model capable of handling variable numbers of solids. Unisoma directly captures physical interactions using contact modules and adaptive interaction allocation mechanism, and learns the deformation through a triplet relationship. Compared to implicit modeling techniques, explicit modeling is more well-suited for multi-solid systems with diverse coupling patterns, as it enables detailed treatment of each solid while preventing information blending and confusion. Experimentally, Unisoma achieves consistent state-of-the-art performance across seven well-established datasets and two complex multi-solid tasks. Code is avaiable at \hrefthis linkthis https URL.
[LG-26] LightGTS: A Lightweight General Time Series Forecasting Model ICML2025
链接: https://arxiv.org/abs/2506.06005
作者: Yihang Wang,Yuying Qiu,Peng Chen,Yang Shu,Zhongwen Rao,Lujia Pan,Bin Yang,Chenjuan Guo
类目: Machine Learning (cs.LG)
*备注: Accepted by the 42th International Conference on Machine Learning (ICML 2025)
点击查看摘要
Abstract:Existing works on general time series forecasting build foundation models with heavy model parameters through large-scale multi-source pre-training. These models achieve superior generalization ability across various datasets at the cost of significant computational burdens and limitations in resource-constrained scenarios. This paper introduces LightGTS, a lightweight general time series forecasting model designed from the perspective of consistent periodical modeling. To handle diverse scales and intrinsic periods in multi-source pre-training, we introduce Periodical Tokenization, which extracts consistent periodic patterns across different datasets with varying scales. To better utilize the periodicity in the decoding process, we further introduce Periodical Parallel Decoding, which leverages historical tokens to improve forecasting. Based on the two techniques above which fully leverage the inductive bias of periods inherent in time series, LightGTS uses a lightweight model to achieve outstanding performance on general time series forecasting. It achieves state-of-the-art forecasting performance on 9 real-world benchmarks in both zero-shot and full-shot settings with much better efficiency compared with existing time series foundation models.
[LG-27] What Really is a Member? Discrediting Membership Inference via Poisoning
链接: https://arxiv.org/abs/2506.06003
作者: Neal Mangaokar,Ashish Hooda,Zhuohang Li,Bradley A. Malin,Kassem Fawaz,Somesh Jha,Atul Prakash,Amrita Roy Chowdhury
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:
点击查看摘要
Abstract:Membership inference tests aim to determine whether a particular data point was included in a language model’s training set. However, recent works have shown that such tests often fail under the strict definition of membership based on exact matching, and have suggested relaxing this definition to include semantic neighbors as members as well. In this work, we show that membership inference tests are still unreliable under this relaxation - it is possible to poison the training dataset in a way that causes the test to produce incorrect predictions for a target point. We theoretically reveal a trade-off between a test’s accuracy and its robustness to poisoning. We also present a concrete instantiation of this poisoning attack and empirically validate its effectiveness. Our results show that it can degrade the performance of existing tests to well below random.
[LG-28] LaDEEP: A Deep Learning-based Surrogate Model for Large Deformation of Elastic-Plastic Solids KDD KDD’25
链接: https://arxiv.org/abs/2506.06001
作者: Shilong Tao,Zhe Feng,Haonan Sun,Zhanxing Zhu,Yunhuai Liu
类目: Machine Learning (cs.LG)
*备注: Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2 (KDD '25)
点击查看摘要
Abstract:Scientific computing for large deformation of elastic-plastic solids is critical for numerous real-world applications. Classical numerical solvers rely primarily on local discrete linear approximation and are constrained by an inherent trade-off between accuracy and efficiency. Recently, deep learning models have achieved impressive progress in solving the continuum mechanism. While previous models have explored various architectures and constructed coefficient-solution mappings, they are designed for general instances without considering specific problem properties and hard to accurately handle with complex elastic-plastic solids involving contact, loading and unloading. In this work, we take stretch bending, a popular metal fabrication technique, as our case study and introduce LaDEEP, a deep learning-based surrogate model for \textbfLarge \textbfDeformation of \textbfElastic-\textbfPlastic Solids. We encode the partitioned regions of the involved slender solids into a token sequence to maintain their essential order property. To characterize the physical process of the solid deformation, a two-stage Transformer-based module is designed to predict the deformation with the sequence of tokens as input. Empirically, LaDEEP achieves five magnitudes faster speed than finite element methods with a comparable accuracy, and gains 20.47% relative improvement on average compared to other deep learning baselines. We have also deployed our model into a real-world industrial production system, and it has shown remarkable performance in both accuracy and efficiency.
[LG-29] Machine learning for in-situ composition mapping in a self-driving magnetron sputtering system
链接: https://arxiv.org/abs/2506.05999
作者: Sanna Jarl,Jens Sjölund,Robert J. W. Frost,Anders Holst,Jonathan J. S. Scragg
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
*备注: 24 pages, 10 figures. Submitted to the journal npj computational materials
点击查看摘要
Abstract:Self-driving labs (SDLs), employing automation and machine learning (ML) to accelerate experimental procedures, have enormous potential in the discovery of new materials. However, in thin film science, SDLs are mainly restricted to solution-based synthetic methods which are easier to automate but cannot access the broad chemical space of inorganic materials. This work presents an SDL based on magnetron co-sputtering. We are using combinatorial frameworks, obtaining accurate composition maps on multi-element, compositionally graded thin films. This normally requires time-consuming ex-situ analysis prone to systematic errors. We present a rapid and calibration-free in-situ, ML driven approach to produce composition maps for arbitrary source combinations and sputtering conditions. We develop a method to predict the composition distribution in a multi-element combinatorial thin film, using in-situ measurements from quartz-crystal microbalance sensors placed in a sputter chamber. For a given source, the sensor readings are learned as a function of the sputtering pressure and magnetron power, through active learning using Gaussian processes (GPs). The final GPs are combined with a geometric model of the deposition flux distribution in the chamber, which allows interpolation of the deposition rates from each source, at any position across the sample. We investigate several acquisition functions for the ML procedure. A fully Bayesian GP - BALM (Bayesian active learning MacKay) - achieved the best performance, learning the deposition rates for a single source in 10 experiments. Prediction accuracy for co-sputtering composition distributions was verified experimentally. Our framework dramatically increases throughput by avoiding the need for extensive characterisation or calibration, thus demonstrating the potential of ML-guided SDLs to accelerate materials exploration.
[LG-30] RETENTION: Resource-Efficient Tree-Based Ensemble Model Acceleration with Content-Addressable Memory
链接: https://arxiv.org/abs/2506.05994
作者: Yi-Chun Liao,Chieh-Lin Tsai,Yuan-Hao Chang,Camélia Slimani,Jalil Boukhobza,Tei-Wei Kuo
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR); Emerging Technologies (cs.ET)
*备注:
点击查看摘要
Abstract:Although deep learning has demonstrated remarkable capabilities in learning from unstructured data, modern tree-based ensemble models remain superior in extracting relevant information and learning from structured datasets. While several efforts have been made to accelerate tree-based models, the inherent characteristics of the models pose significant challenges for conventional accelerators. Recent research leveraging content-addressable memory (CAM) offers a promising solution for accelerating tree-based models, yet existing designs suffer from excessive memory consumption and low utilization. This work addresses these challenges by introducing RETENTION, an end-to-end framework that significantly reduces CAM capacity requirement for tree-based model inference. We propose an iterative pruning algorithm with a novel pruning criterion tailored for bagging-based models (e.g., Random Forest), which minimizes model complexity while ensuring controlled accuracy degradation. Additionally, we present a tree mapping scheme that incorporates two innovative data placement strategies to alleviate the memory redundancy caused by the widespread use of don’t care states in CAM. Experimental results show that implementing the tree mapping scheme alone achieves 1.46\times to 21.30 \times better space efficiency, while the full RETENTION framework yields 4.35\times to 207.12\times improvement with less than 3% accuracy loss. These results demonstrate that RETENTION is highly effective in reducing CAM capacity requirement, providing a resource-efficient direction for tree-based model acceleration.
[LG-31] Dynamic Mixture of Progressive Parameter-Efficient Expert Library for Lifelong Robot Learning
链接: https://arxiv.org/abs/2506.05985
作者: Yuheng Lei,Sitong Mao,Shunbo Zhou,Hongyuan Zhang,Xuelong Li,Ping Luo
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注:
点击查看摘要
Abstract:A generalist agent must continuously learn and adapt throughout its lifetime, achieving efficient forward transfer while minimizing catastrophic forgetting. Previous work within the dominant pretrain-then-finetune paradigm has explored parameter-efficient fine-tuning for single-task adaptation, effectively steering a frozen pretrained model with a small number of parameters. However, in the context of lifelong learning, these methods rely on the impractical assumption of a test-time task identifier and restrict knowledge sharing among isolated adapters. To address these limitations, we propose Dynamic Mixture of Progressive Parameter-Efficient Expert Library (DMPEL) for lifelong robot learning. DMPEL progressively learn a low-rank expert library and employs a lightweight router to dynamically combine experts into an end-to-end policy, facilitating flexible behavior during lifelong adaptation. Moreover, by leveraging the modular structure of the fine-tuned parameters, we introduce coefficient replay to guide the router in accurately retrieving frozen experts for previously encountered tasks, thereby mitigating catastrophic forgetting. This method is significantly more storage- and computationally-efficient than applying demonstration replay to the entire policy. Extensive experiments on the lifelong manipulation benchmark LIBERO demonstrate that our framework outperforms state-of-the-art lifelong learning methods in success rates across continual adaptation, while utilizing minimal trainable parameters and storage.
[LG-32] Mitigating Catastrophic Forgetting with Adaptive Transformer Block Expansion in Federated Fine-Tuning
链接: https://arxiv.org/abs/2506.05977
作者: Yujia Huo,Jianchun Liu,Hongli Xu,Zhenguo Ma,Shilong Wang,Liusheng Huang
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:
点击查看摘要
Abstract:Federated fine-tuning (FedFT) of large language models (LLMs) has emerged as a promising solution for adapting models to distributed data environments while ensuring data privacy. Existing FedFT methods predominantly utilize parameter-efficient fine-tuning (PEFT) techniques to reduce communication and computation overhead. However, they often fail to adequately address the catastrophic forgetting, a critical challenge arising from continual adaptation in distributed environments. The traditional centralized fine-tuning methods, which are not designed for the heterogeneous and privacy-constrained nature of federated environments, struggle to mitigate this issue effectively. Moreover, the challenge is further exacerbated by significant variation in data distributions and device capabilities across clients, which leads to intensified forgetting and degraded model generalization. To tackle these issues, we propose FedBE, a novel FedFT framework that integrates an adaptive transformer block expansion mechanism with a dynamic trainable-block allocation strategy. Specifically, FedBE expands trainable blocks within the model architecture, structurally separating newly learned task-specific knowledge from the original pre-trained representations. Additionally, FedBE dynamically assigns these trainable blocks to clients based on their data distributions and computational capabilities. This enables the framework to better accommodate heterogeneous federated environments and enhances the generalization ability of the this http URL experiments show that compared with existing federated fine-tuning methods, FedBE achieves 12-74% higher accuracy retention on general tasks after fine-tuning and a model convergence acceleration ratio of 1.9-3.1x without degrading the accuracy of downstream tasks. Subjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC) Cite as: arXiv:2506.05977 [cs.LG] (or arXiv:2506.05977v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2506.05977 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-33] AQUATIC-Diff: Additive Quantization for Truly Tiny Compressed Diffusion Models
链接: https://arxiv.org/abs/2506.05960
作者: Adil Hasan,Thomas Peyrin
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Significant investments have been made towards the commodification of diffusion models for generation of diverse media. Their mass-market adoption is however still hobbled by the intense hardware resource requirements of diffusion model inference. Model quantization strategies tailored specifically towards diffusion models have been useful in easing this burden, yet have generally explored the Uniform Scalar Quantization (USQ) family of quantization methods. In contrast, Vector Quantization (VQ) methods, which operate on groups of multiple related weights as the basic unit of compression, have seen substantial success in Large Language Model (LLM) quantization. In this work, we apply codebook-based additive vector quantization to the problem of diffusion model compression. Our resulting approach achieves a new Pareto frontier for the extremely low-bit weight quantization on the standard class-conditional benchmark of LDM-4 on ImageNet at 20 inference time steps. Notably, we report sFID 1.92 points lower than the full-precision model at W4A8 and the best-reported results for FID, sFID and ISC at W2A8. We are also able to demonstrate FLOPs savings on arbitrary hardware via an efficient inference kernel, as opposed to savings resulting from small integer operations which may lack broad hardware support.
[LG-34] Applying XAI based unsupervised knowledge discovering for Operation modes in a WWTP. A real case: AQUAVALL WWTP
链接: https://arxiv.org/abs/2506.05958
作者: Alicia Beneyto-Rodriguez,Gregorio I. Sainz-Palmero,Marta Galende-Hernández,María J. Fuente,José M. Cuenca
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Water reuse is a key point when fresh water is a commodity in ever greater demand, but which is also becoming ever more available. Furthermore, the return of clean water to its natural environment is also mandatory. Therefore, wastewater treatment plants (WWTPs) are essential in any policy focused on these serious challenges. WWTPs are complex facilities which need to operate at their best to achieve their goals. Nowadays, they are largely monitored, generating large databases of historical data concerning their functioning over time. All this implies a large amount of embedded information which is not usually easy for plant managers to assimilate, correlate and understand; in other words, for them to know the global operation of the plant at any given time. At this point, the intelligent and Machine Learning (ML) approaches can give support for that need, managing all the data and translating them into manageable, interpretable and explainable knowledge about how the WWTP plant is operating at a glance. Here, an eXplainable Artificial Intelligence (XAI) based methodology is proposed and tested for a real WWTP, in order to extract explainable service knowledge concerning the operation modes of the WWTP managed by AQUAVALL, which is the public service in charge of the integral water cycle in the City Council of Valladolid (Castilla y León, Spain). By applying well-known approaches of XAI and ML focused on the challenge of WWTP, it has been possible to summarize a large number of historical databases through a few explained operation modes of the plant in a low-dimensional data space, showing the variables and facility units involved in each case. Subjects: Systems and Control (eess.SY); Machine Learning (cs.LG) Cite as: arXiv:2506.05958 [eess.SY] (or arXiv:2506.05958v1 [eess.SY] for this version) https://doi.org/10.48550/arXiv.2506.05958 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-35] Pruning Spurious Subgraphs for Graph Out-of-Distribtuion Generalization ICML2025
链接: https://arxiv.org/abs/2506.05957
作者: Tianjun Yao,Haoxuan Li,Yongqiang Chen,Tongliang Liu,Le Song,Eric Xing,Zhiqiang Shen
类目: Machine Learning (cs.LG)
*备注: Submission of ICML2025, with score 4/4/3/3
点击查看摘要
Abstract:Graph Neural Networks (GNNs) often encounter significant performance degradation under distribution shifts between training and test data, hindering their applicability in real-world scenarios. Recent studies have proposed various methods to address the out-of-distribution generalization challenge, with many methods in the graph domain focusing on directly identifying an invariant subgraph that is predictive of the target label. However, we argue that identifying the edges from the invariant subgraph directly is challenging and error-prone, especially when some spurious edges exhibit strong correlations with the targets. In this paper, we propose PrunE, the first pruning-based graph OOD method that eliminates spurious edges to improve OOD generalizability. By pruning spurious edges, \mine retains the invariant subgraph more comprehensively, which is critical for OOD generalization. Specifically, PrunE employs two regularization terms to prune spurious edges: 1) graph size constraint to exclude uninformative spurious edges, and 2) \epsilon -probability alignment to further suppress the occurrence of spurious edges. Through theoretical analysis and extensive experiments, we show that PrunE achieves superior OOD performance and outperforms previous state-of-the-art methods significantly. Codes are available at: \hrefthis https URLthis https URL.
[LG-36] Learning Deterministic Policies with Policy Gradients in Constrained Markov Decision Processes
链接: https://arxiv.org/abs/2506.05953
作者: Alessandro Montenegro,Leonardo Cesani,Marco Mussi,Matteo Papini,Alberto Maria Metelli
类目: Machine Learning (cs.LG)
*备注: arXiv admin note: substantial text overlap with arXiv:2407.10775
点击查看摘要
Abstract:Constrained Reinforcement Learning (CRL) addresses sequential decision-making problems where agents are required to achieve goals by maximizing the expected return while meeting domain-specific constraints. In this setting, policy-based methods are widely used thanks to their advantages when dealing with continuous-control problems. These methods search in the policy space with an action-based or a parameter-based exploration strategy, depending on whether they learn the parameters of a stochastic policy or those of a stochastic hyperpolicy. We introduce an exploration-agnostic algorithm, called C-PG, which enjoys global last-iterate convergence guarantees under gradient domination assumptions. Furthermore, under specific noise models where the (hyper)policy is expressed as a stochastic perturbation of the actions or of the parameters of an underlying deterministic policy, we additionally establish global last-iterate convergence guarantees of C-PG to the optimal deterministic policy. This holds when learning a stochastic (hyper)policy and subsequently switching off the stochasticity at the end of training, thereby deploying a deterministic policy. Finally, we empirically validate both the action-based (C-PGAE) and parameter-based (C-PGPE) variants of C-PG on constrained control tasks, and compare them against state-of-the-art baselines, demonstrating their effectiveness, in particular when deploying deterministic policies after training.
[LG-37] Additive decomposition of one-dimensional signals using Transformers
链接: https://arxiv.org/abs/2506.05942
作者: Samuele Salti,Andrea Pinto,Alessandro Lanza,Serena Morigi
类目: Machine Learning (cs.LG)
*备注: Under consideration at Pattern Recognition Letters
点击查看摘要
Abstract:One-dimensional signal decomposition is a well-established and widely used technique across various scientific fields. It serves as a highly valuable pre-processing step for data analysis. While traditional decomposition techniques often rely on mathematical models, recent research suggests that applying the latest deep learning models to this problem presents an exciting, unexplored area with promising potential. This work presents a novel method for the additive decomposition of one-dimensional signals. We leverage the Transformer architecture to decompose signals into their constituent components: piece-wise constant, smooth (low-frequency oscillatory), textured (high-frequency oscillatory), and a noise component. Our model, trained on synthetic data, achieves excellent accuracy in modeling and decomposing input signals from the same distribution, as demonstrated by the experimental results.
[LG-38] Exponential Family Variational Flow Matching for Tabular Data Generation
链接: https://arxiv.org/abs/2506.05940
作者: Andrés Guzmán-Cordero,Floor Eijkelboom,Jan-Willem van de Meent
类目: Machine Learning (cs.LG)
*备注: 14 pages, 1 figure, and 9 tables; To be published in the Proceedings of the Forty-Second International Conference on Machine Learning
点击查看摘要
Abstract:While denoising diffusion and flow matching have driven major advances in generative modeling, their application to tabular data remains limited, despite its ubiquity in real-world applications. To this end, we develop TabbyFlow, a variational Flow Matching (VFM) method for tabular data generation. To apply VFM to data with mixed continuous and discrete features, we introduce Exponential Family Variational Flow Matching (EF-VFM), which represents heterogeneous data types using a general exponential family distribution. We hereby obtain an efficient, data-driven objective based on moment matching, enabling principled learning of probability paths over mixed continuous and discrete variables. We also establish a connection between variational flow matching and generalized flow matching objectives based on Bregman divergences. Evaluation on tabular data benchmarks demonstrates state-of-the-art performance compared to baselines.
[LG-39] Machine Learning Predictions for Traffic Equilibria in Road Renovation Scheduling CCL2025
链接: https://arxiv.org/abs/2506.05933
作者: Robbert Bosch,Wouter van Heeswijk,Patricia Rogetzer,Martijn Mes
类目: Machine Learning (cs.LG)
*备注: 15 pages, 2 figures, submitted as conference paper to ICCL 2025
点击查看摘要
Abstract:Accurately estimating the impact of road maintenance schedules on traffic conditions is important because maintenance operations can substantially worsen congestion if not carefully planned. Reliable estimates allow planners to avoid excessive delays during periods of roadwork. Since the exact increase in congestion is difficult to predict analytically, traffic simulations are commonly used to assess the redistribution of the flow of traffic. However, when applied to long-term maintenance planning involving many overlapping projects and scheduling alternatives, these simulations must be run thousands of times, resulting in a significant computational burden. This paper investigates the use of machine learning-based surrogate models to predict network-wide congestion caused by simultaneous road renovations. We frame the problem as a supervised learning task, using one-hot encodings, engineered traffic features, and heuristic approximations. A range of linear, ensemble-based, probabilistic, and neural regression models is evaluated under an online learning framework in which data progressively becomes available. The experimental results show that the Costliest Subset Heuristic provides a reasonable approximation when limited training data is available, and that most regression models fail to outperform it, with the exception of XGBoost, which achieves substantially better accuracy. In overall performance, XGBoost significantly outperforms alternatives in a range of metrics, most strikingly Mean Absolute Percentage Error (MAPE) and Pinball loss, where it achieves a MAPE of 11% and outperforms the next-best model by 20% and 38% respectively. This modeling approach has the potential to reduce the computational burden of large-scale traffic assignment problems in maintenance planning.
[LG-40] Over-PINNs: Enhancing Physics-Informed Neural Networks via Higher-Order Partial Derivative Overdetermination of PDEs
链接: https://arxiv.org/abs/2506.05918
作者: Wenxuan Huo,Qiang He,Gang Zhu,Weifeng Huang
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Partial differential equations (PDEs) serve as the cornerstone of mathematical physics. In recent years, Physics-Informed Neural Networks (PINNs) have significantly reduced the dependence on large datasets by embedding physical laws directly into the training of neural networks. However, when dealing with complex problems, the accuracy of PINNs still has room for improvement. To address this issue, we introduce the Over-PINNs framework, which leverages automatic differentiation (AD) to generate higher-order auxiliary equations that impose additional physical constraints. These equations are incorporated as extra loss terms in the training process, effectively enhancing the model’s ability to capture physical information through an “overdetermined” approach. Numerical results illustrate that this method exhibits strong versatility in solving various types of PDEs. It achieves a significant improvement in solution accuracy without incurring substantial additional computational costs.
[LG-41] DeviceScope: An Interactive App to Detect and Localize Appliance Patterns in Electricity Consumption Time Series ICDE2025
链接: https://arxiv.org/abs/2506.05912
作者: Adrien Petralia,Paul Boniol,Philippe Charpentier,Themis Palpanas
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 4 pages, 5 figures. This paper appeared in ICDE 2025
点击查看摘要
Abstract:In recent years, electricity suppliers have installed millions of smart meters worldwide to improve the management of the smart grid system. These meters collect a large amount of electrical consumption data to produce valuable information to help consumers reduce their electricity footprint. However, having non-expert users (e.g., consumers or sales advisors) understand these data and derive usage patterns for different appliances has become a significant challenge for electricity suppliers because these data record the aggregated behavior of all appliances. At the same time, ground-truth labels (which could train appliance detection and localization models) are expensive to collect and extremely scarce in practice. This paper introduces DeviceScope, an interactive tool designed to facilitate understanding smart meter data by detecting and localizing individual appliance patterns within a given time period. Our system is based on CamAL (Class Activation Map-based Appliance Localization), a novel weakly supervised approach for appliance localization that only requires the knowledge of the existence of an appliance in a household to be trained. This paper appeared in ICDE 2025.
[LG-42] A Driving Regime-Embedded Deep Learning Framework for Modeling Intra-Driver Heterogeneity in Multi-Scale Car-Following Dynamics
链接: https://arxiv.org/abs/2506.05902
作者: Shirui Zhou,Jiying Yan,Junfang Tian,Tao Wang,Yongfu Li,Shiquan Zhong
类目: Machine Learning (cs.LG); Physics and Society (physics.soc-ph)
*备注:
点击查看摘要
Abstract:A fundamental challenge in car-following modeling lies in accurately representing the multi-scale complexity of driving behaviors, particularly the intra-driver heterogeneity where a single driver’s actions fluctuate dynamically under varying conditions. While existing models, both conventional and data-driven, address behavioral heterogeneity to some extent, they often emphasize inter-driver heterogeneity or rely on simplified assumptions, limiting their ability to capture the dynamic heterogeneity of a single driver under different driving conditions. To address this gap, we propose a novel data-driven car-following framework that systematically embeds discrete driving regimes (e.g., steady-state following, acceleration, cruising) into vehicular motion predictions. Leveraging high-resolution traffic trajectory datasets, the proposed hybrid deep learning architecture combines Gated Recurrent Units for discrete driving regime classification with Long Short-Term Memory networks for continuous kinematic prediction, unifying discrete decision-making processes and continuous vehicular dynamics to comprehensively represent inter- and intra-driver heterogeneity. Driving regimes are identified using a bottom-up segmentation algorithm and Dynamic Time Warping, ensuring robust characterization of behavioral states across diverse traffic scenarios. Comparative analyses demonstrate that the framework significantly reduces prediction errors for acceleration (maximum MSE improvement reached 58.47%), speed, and spacing metrics while reproducing critical traffic phenomena, such as stop-and-go wave propagation and oscillatory dynamics.
[LG-43] Few Labels are all you need: A Weakly Supervised Framework for Appliance Localization in Smart-Meter Series ICDE2025
链接: https://arxiv.org/abs/2506.05895
作者: Adrien Petralia,Paul Boniol,Philippe Charpentier,Themis Palpanas
类目: Machine Learning (cs.LG)
*备注: 12 pages, 10 figures. This paper appeared in IEEE ICDE 2025
点击查看摘要
Abstract:Improving smart grid system management is crucial in the fight against climate change, and enabling consumers to play an active role in this effort is a significant challenge for electricity suppliers. In this regard, millions of smart meters have been deployed worldwide in the last decade, recording the main electricity power consumed in individual households. This data produces valuable information that can help them reduce their electricity footprint; nevertheless, the collected signal aggregates the consumption of the different appliances running simultaneously in the house, making it difficult to apprehend. Non-Intrusive Load Monitoring (NILM) refers to the challenge of estimating the power consumption, pattern, or on/off state activation of individual appliances using the main smart meter signal. Recent methods proposed to tackle this task are based on a fully supervised deep-learning approach that requires both the aggregate signal and the ground truth of individual appliance power. However, such labels are expensive to collect and extremely scarce in practice, as they require conducting intrusive surveys in households to monitor each appliance. In this paper, we introduce CamAL, a weakly supervised approach for appliance pattern localization that only requires information on the presence of an appliance in a household to be trained. CamAL merges an ensemble of deep-learning classifiers combined with an explainable classification method to be able to localize appliance patterns. Our experimental evaluation, conducted on 4 real-world datasets, demonstrates that CamAL significantly outperforms existing weakly supervised baselines and that current SotA fully supervised NILM approaches require significantly more labels to reach CamAL performances. The source of our experiments is available at: this https URL. This paper appeared in ICDE 2025.
[LG-44] NILMFormer: Non-Intrusive Load Monitoring that Accounts for Non-Stationarity KDD2025
链接: https://arxiv.org/abs/2506.05880
作者: Adrien Petralia,Philippe Charpentier,Youssef Kadhi,Themis Palpanas
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 12 pages, 8 figures. This paper appeared in ACM SIGKDD 2025
点击查看摘要
Abstract:Millions of smart meters have been deployed worldwide, collecting the total power consumed by individual households. Based on these data, electricity suppliers offer their clients energy monitoring solutions to provide feedback on the consumption of their individual appliances. Historically, such estimates have relied on statistical methods that use coarse-grained total monthly consumption and static customer data, such as appliance ownership. Non-Intrusive Load Monitoring (NILM) is the problem of disaggregating a household’s collected total power consumption to retrieve the consumed power for individual appliances. Current state-of-the-art (SotA) solutions for NILM are based on deep-learning (DL) and operate on subsequences of an entire household consumption reading. However, the non-stationary nature of real-world smart meter data leads to a drift in the data distribution within each segmented window, which significantly affects model performance. This paper introduces NILMFormer, a Transformer-based architecture that incorporates a new subsequence stationarization/de-stationarization scheme to mitigate the distribution drift and that uses a novel positional encoding that relies only on the subsequence’s timestamp information. Experiments with 4 real-world datasets show that NILMFormer significantly outperforms the SotA approaches. Our solution has been deployed as the backbone algorithm for EDF’s (Electricité De France) consumption monitoring service, delivering detailed insights to millions of customers about their individual appliances’ power consumption. This paper appeared in KDD 2025.
[LG-45] A projection-based framework for gradient-free and parallel learning
链接: https://arxiv.org/abs/2506.05878
作者: Andreas Bergmeister,Manish Krishan Lal,Stefanie Jegelka,Suvrit Sra
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:We present a feasibility-seeking approach to neural network training. This mathematical optimization framework is distinct from conventional gradient-based loss minimization and uses projection operators and iterative projection algorithms. We reformulate training as a large-scale feasibility problem: finding network parameters and states that satisfy local constraints derived from its elementary operations. Training then involves projecting onto these constraints, a local operation that can be parallelized across the network. We introduce PJAX, a JAX-based software framework that enables this paradigm. PJAX composes projection operators for elementary operations, automatically deriving the solution operators for the feasibility problems (akin to autodiff for derivatives). It inherently supports GPU/TPU acceleration, provides a familiar NumPy-like API, and is extensible. We train diverse architectures (MLPs, CNNs, RNNs) on standard benchmarks using PJAX, demonstrating its functionality and generality. Our results show that this approach is as a compelling alternative to gradient-based training, with clear advantages in parallelism and the ability to handle non-differentiable operations.
[LG-46] Interpretable Clustering Ensemble
链接: https://arxiv.org/abs/2506.05877
作者: Hang Lv,Lianyu Hu,Mudi Jiang,Xinying Liu,Zengyou He
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Clustering ensemble has emerged as an important research topic in the field of machine learning. Although numerous methods have been proposed to improve clustering quality, most existing approaches overlook the need for interpretability in high-stakes applications. In domains such as medical diagnosis and financial risk assessment, algorithms must not only be accurate but also interpretable to ensure transparent and trustworthy decision-making. Therefore, to fill the gap of lack of interpretable algorithms in the field of clustering ensemble, we propose the first interpretable clustering ensemble algorithm in the literature. By treating base partitions as categorical variables, our method constructs a decision tree in the original feature space and use the statistical association test to guide the tree building process. Experimental results demonstrate that our algorithm achieves comparable performance to state-of-the-art (SOTA) clustering ensemble methods while maintaining an additional feature of interpretability. To the best of our knowledge, this is the first interpretable algorithm specifically designed for clustering ensemble, offering a new perspective for future research in interpretable clustering.
[LG-47] BestServe: Serving Strategies with Optimal Goodput in Collocation and Disaggregation Architectures
链接: https://arxiv.org/abs/2506.05871
作者: Xiannan Hu,Tianyou Zeng,Xiaoming Yuan,Liwei Song,Guangyuan Zhang,Bangzheng He
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Performance (cs.PF)
*备注:
点击查看摘要
Abstract:Serving large language models (LLMs) to millions of users requires efficient resource allocation and parallelism strategies. It is a labor intensive trial-and-error process to find such a strategy. We present BestServe, a novel framework for ranking serving strategies by estimating goodput under various operating scenarios. Supporting both collocated and disaggregated architectures, BestServe leverages an inference simulator built on an adapted roofline model and CPU-GPU dispatch dynamics. Our framework determines the optimal strategy in minutes on a single standard CPU, eliminating the need for costly benchmarking, while achieving predictions within a 20% error margin. It appeals to be practical for rapid deployment planning because of its lightweight design and strong extensibility.
[LG-48] Stealix: Model Stealing via Prompt Evolution ICML2025
链接: https://arxiv.org/abs/2506.05867
作者: Zhixiong Zhuang,Hui-Po Wang,Maria-Irina Nicolae,Mario Fritz
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: Accepted at ICML 2025. The project page is at this https URL
点击查看摘要
Abstract:Model stealing poses a significant security risk in machine learning by enabling attackers to replicate a black-box model without access to its training data, thus jeopardizing intellectual property and exposing sensitive information. Recent methods that use pre-trained diffusion models for data synthesis improve efficiency and performance but rely heavily on manually crafted prompts, limiting automation and scalability, especially for attackers with little expertise. To assess the risks posed by open-source pre-trained models, we propose a more realistic threat model that eliminates the need for prompt design skills or knowledge of class names. In this context, we introduce Stealix, the first approach to perform model stealing without predefined prompts. Stealix uses two open-source pre-trained models to infer the victim model’s data distribution, and iteratively refines prompts through a genetic algorithm, progressively improving the precision and diversity of synthetic images. Our experimental results demonstrate that Stealix significantly outperforms other methods, even those with access to class names or fine-grained prompts, while operating under the same query budget. These findings highlight the scalability of our approach and suggest that the risks posed by pre-trained generative models in model stealing may be greater than previously recognized.
[LG-49] Wavelet-based Disentangled Adaptive Normalization for Non-stationary Times Series Forecasting
链接: https://arxiv.org/abs/2506.05857
作者: Junpeng Lin,Tian Lan,Bo Zhang,Ke Lin,Dandan Miao,Huiru He,Jiantao Ye,Chen Zhang,Yan-fu Li
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Forecasting non-stationary time series is a challenging task because their statistical properties often change over time, making it hard for deep models to generalize well. Instance-level normalization techniques can help address shifts in temporal distribution. However, most existing methods overlook the multi-component nature of time series, where different components exhibit distinct non-stationary behaviors. In this paper, we propose Wavelet-based Disentangled Adaptive Normalization (WDAN), a model-agnostic framework designed to address non-stationarity in time series forecasting. WDAN uses discrete wavelet transforms to break down the input into low-frequency trends and high-frequency fluctuations. It then applies tailored normalization strategies to each part. For trend components that exhibit strong non-stationarity, we apply first-order differencing to extract stable features used for predicting normalization parameters. Extensive experiments on multiple benchmarks demonstrate that WDAN consistently improves forecasting accuracy across various backbone model. Code is available at this repository: this https URL.
[LG-50] raining-Free Query Optimization via LLM -Based Plan Similarity
链接: https://arxiv.org/abs/2506.05853
作者: Nikita Vasilenko,Alexander Demin,Vladimir Boorlakov
类目: Databases (cs.DB); Machine Learning (cs.LG)
*备注: 18 pages, 5 figures
点击查看摘要
Abstract:Large language model (LLM) embeddings offer a promising new avenue for database query optimization. In this paper, we explore how pre-trained execution plan embeddings can guide SQL query execution without the need for additional model training. We introduce LLM-PM (LLM-based Plan Mapping), a framework that embeds the default execution plan of a query, finds its k nearest neighbors among previously executed plans, and recommends database hintsets based on neighborhood voting. A lightweight consistency check validates the selected hint, while a fallback mechanism searches the full hint space when needed. Evaluated on the JOB-CEB benchmark using OpenGauss, LLM-PM achieves an average speed-up of 21% query latency reduction. This work highlights the potential of LLM-powered embeddings to deliver practical improvements in query performance and opens new directions for training-free, embedding-based optimizer guidance systems.
[LG-51] Learning Along the Arrow of Time: Hyperbolic Geometry for Backward-Compatible Representation Learning
链接: https://arxiv.org/abs/2506.05826
作者: Ngoc Bui,Menglin Yang,Runjin Chen,Leonardo Neves,Mingxuan Ju,Rex Ying,Neil Shah,Tong Zhao
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Backward compatible representation learning enables updated models to integrate seamlessly with existing ones, avoiding to reprocess stored data. Despite recent advances, existing compatibility approaches in Euclidean space neglect the uncertainty in the old embedding model and force the new model to reconstruct outdated representations regardless of their quality, thereby hindering the learning process of the new model. In this paper, we propose to switch perspectives to hyperbolic geometry, where we treat time as a natural axis for capturing a model’s confidence and evolution. By lifting embeddings into hyperbolic space and constraining updated embeddings to lie within the entailment cone of the old ones, we maintain generational consistency across models while accounting for uncertainties in the representations. To further enhance compatibility, we introduce a robust contrastive alignment loss that dynamically adjusts alignment weights based on the uncertainty of the old embeddings. Experiments validate the superiority of the proposed method in achieving compatibility, paving the way for more resilient and adaptable machine learning systems.
[LG-52] Neural Collapse in Cumulative Link Models for Ordinal Regression: An Analysis with Unconstrained Feature Model
链接: https://arxiv.org/abs/2506.05801
作者: Chuang Ma,Tomoyuki Obuchi,Toshiyuki Tanaka
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
点击查看摘要
Abstract:A phenomenon known as ‘‘Neural Collapse (NC)’’ in deep classification tasks, in which the penultimate-layer features and the final classifiers exhibit an extremely simple geometric structure, has recently attracted considerable attention, with the expectation that it can deepen our understanding of how deep neural networks behave. The Unconstrained Feature Model (UFM) has been proposed to explain NC theoretically, and there emerges a growing body of work that extends NC to tasks other than classification and leverages it for practical applications. In this study, we investigate whether a similar phenomenon arises in deep Ordinal Regression (OR) tasks, via combining the cumulative link model for OR and UFM. We show that a phenomenon we call Ordinal Neural Collapse (ONC) indeed emerges and is characterized by the following three properties: (ONC1) all optimal features in the same class collapse to their within-class mean when regularization is applied; (ONC2) these class means align with the classifier, meaning that they collapse onto a one-dimensional subspace; (ONC3) the optimal latent variables (corresponding to logits or preactivations in classification tasks) are aligned according to the class order, and in particular, in the zero-regularization limit, a highly local and simple geometric relationship emerges between the latent variables and the threshold values. We prove these properties analytically within the UFM framework with fixed threshold values and corroborate them empirically across a variety of datasets. We also discuss how these insights can be leveraged in OR, highlighting the use of fixed thresholds.
[LG-53] Option Pricing Using Ensemble Learning
链接: https://arxiv.org/abs/2506.05799
作者: Zeyuan Li,Qingdao Huang
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Ensemble learning is characterized by flexibility, high precision, and refined structure. As a critical component within computational finance, option pricing with machine learning requires both high predictive accuracy and reduced structural complexity-features that align well with the inherent advantages of ensemble learning. This paper investigates the application of ensemble learning to option pricing, and conducts a comparative analysis with classical machine learning models to assess their performance in terms of accuracy, local feature extraction, and robustness to noise. A novel experimental strategy is introduced, leveraging parameter transfer across experiments to improve robustness and realism in financial this http URL upon this strategy, an evaluation mechanism is developed that incorporates a scoring strategy and a weighted evaluation strategy explicitly emphasizing the foundational role of financial theory. This mechanism embodies an orderly integration of theoretical finance and computational methods. In addition, the study examines the interaction between sliding window technique and noise, revealing nuanced patterns that suggest a potential connection relevant to ongoing research in machine learning and data science.
[LG-54] EqCollide: Equivariant and Collision-Aware Deformable Objects Neural Simulator
链接: https://arxiv.org/abs/2506.05797
作者: Qianyi Chen,Tianrun Gao,Chenbo Jiang,Tailin Wu
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Robotics (cs.RO)
*备注:
点击查看摘要
Abstract:Simulating collisions of deformable objects is a fundamental yet challenging task due to the complexity of modeling solid mechanics and multi-body interactions. Existing data-driven methods often suffer from lack of equivariance to physical symmetries, inadequate handling of collisions, and limited scalability. Here we introduce EqCollide, the first end-to-end equivariant neural fields simulator for deformable objects and their collisions. We propose an equivariant encoder to map object geometry and velocity into latent control points. A subsequent equivariant Graph Neural Network-based Neural Ordinary Differential Equation models the interactions among control points via collision-aware message passing. To reconstruct velocity fields, we query a neural field conditioned on control point features, enabling continuous and resolution-independent motion predictions. Experimental results show that EqCollide achieves accurate, stable, and scalable simulations across diverse object configurations, and our model achieves 24.34% to 35.82% lower rollout MSE even compared with the best-performing baseline model. Furthermore, our model could generalize to more colliding objects and extended temporal horizons, and stay robust to input transformed with group action.
[LG-55] Exploiting Similarity for Computation and Communication-Efficient Decentralized Optimization ICML2025
链接: https://arxiv.org/abs/2506.05791
作者: Yuki Takezawa,Xiaowen Jiang,Anton Rodomanov,Sebastian U. Stich
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: ICML 2025
点击查看摘要
Abstract:Reducing communication complexity is critical for efficient decentralized optimization. The proximal decentralized optimization (PDO) framework is particularly appealing, as methods within this framework can exploit functional similarity among nodes to reduce communication rounds. Specifically, when local functions at different nodes are similar, these methods achieve faster convergence with fewer communication steps. However, existing PDO methods often require highly accurate solutions to subproblems associated with the proximal operator, resulting in significant computational overhead. In this work, we propose the Stabilized Proximal Decentralized Optimization (SPDO) method, which achieves state-of-the-art communication and computational complexities within the PDO framework. Additionally, we refine the analysis of existing PDO methods by relaxing subproblem accuracy requirements and leveraging average functional similarity. Experimental results demonstrate that SPDO significantly outperforms existing methods.
[LG-56] Pegasus: A Universal Framework for Scalable Deep Learning Inference on the Dataplane
链接: https://arxiv.org/abs/2506.05779
作者: Yinchao Zhang,Su Yao,Yong Feng,Kang Chen,Tong Li,Zhuotao Liu,Yi Zhao,Lexuan Zhang,Xiangyu Gao,Feng Xiong,Qi Li,Ke Xu
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注: to be published in Sigcomm 2025
点击查看摘要
Abstract:The paradigm of Intelligent DataPlane (IDP) embeds deep learning (DL) models on the network dataplane to enable intelligent traffic analysis at line-speed. However, the current use of the match-action table (MAT) abstraction on the dataplane is misaligned with DL inference, leading to several key limitations, including accuracy degradation, limited scale, and lack of generality. This paper proposes Pegasus to address these limitations. Pegasus translates DL operations into three dataplane-oriented primitives to achieve generality: Partition, Map, and SumReduce. Specifically, Partition “divides” high-dimensional features into multiple low-dimensional vectors, making them more suitable for the dataplane; Map “conquers” computations on the low-dimensional vectors in parallel with the technique of fuzzy matching, while SumReduce “combines” the computation results. Additionally, Pegasus employs Primitive Fusion to merge computations, improving scalability. Finally, Pegasus adopts full precision weights with fixed-point activations to improve accuracy. Our implementation on a P4 switch demonstrates that Pegasus can effectively support various types of DL models, including Multi-Layer Perceptron (MLP), Recurrent Neural Network (RNN), Convolutional Neural Network (CNN), and AutoEncoder models on the dataplane. Meanwhile, Pegasus outperforms state-of-the-art approaches with an average accuracy improvement of up to 22.8%, along with up to 248x larger model size and 212x larger input scale.
[LG-57] Evaluating Neuron Explanations: A Unified Framework with Sanity Checks ICML2025
链接: https://arxiv.org/abs/2506.05774
作者: Tuomas Oikarinen,Ge Yan,Tsui-Wei Weng
类目: Machine Learning (cs.LG)
*备注: Published at ICML 2025
点击查看摘要
Abstract:Understanding the function of individual units in a neural network is an important building block for mechanistic interpretability. This is often done by generating a simple text explanation of the behavior of individual neurons or units. For these explanations to be useful, we must understand how reliable and truthful they are. In this work we unify many existing explanation evaluation methods under one mathematical framework. This allows us to compare existing evaluation metrics, understand the evaluation pipeline with increased clarity and apply existing statistical methods on the evaluation. In addition, we propose two simple sanity checks on the evaluation metrics and show that many commonly used metrics fail these tests and do not change their score after massive changes to the concept labels. Based on our experimental and theoretical results, we propose guidelines that future evaluations should follow and identify a set of reliable evaluation metrics.
[LG-58] AANet: Virtual Screening under Structural Uncertainty via Alignment and Aggregation
链接: https://arxiv.org/abs/2506.05768
作者: Wenyu Zhu,Jianhui Wang,Bowen Gao,Yinjun Jia,Haichuan Tan,Ya-Qin Zhang,Wei-Ying Ma,Yanyan Lan
类目: Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注:
点击查看摘要
Abstract:Virtual screening (VS) is a critical component of modern drug discovery, yet most existing methods–whether physics-based or deep learning-based–are developed around holo protein structures with known ligand-bound pockets. Consequently, their performance degrades significantly on apo or predicted structures such as those from AlphaFold2, which are more representative of real-world early-stage drug discovery, where pocket information is often missing. In this paper, we introduce an alignment-and-aggregation framework to enable accurate virtual screening under structural uncertainty. Our method comprises two core components: (1) a tri-modal contrastive learning module that aligns representations of the ligand, the holo pocket, and cavities detected from structures, thereby enhancing robustness to pocket localization error; and (2) a cross-attention based adapter for dynamically aggregating candidate binding sites, enabling the model to learn from activity data even without precise pocket annotations. We evaluated our method on a newly curated benchmark of apo structures, where it significantly outperforms state-of-the-art methods in blind apo setting, improving the early enrichment factor (EF1%) from 11.75 to 37.19. Notably, it also maintains strong performance on holo structures. These results demonstrate the promise of our approach in advancing first-in-class drug discovery, particularly in scenarios lacking experimentally resolved protein-ligand complexes.
[LG-59] Exploring Microstructural Dynamics in Cryptocurrency Limit Order Books: Better Inputs Matter More Than Stacking Another Hidden Layer
链接: https://arxiv.org/abs/2506.05764
作者: Haochuan(Kevin)Wang
类目: Machine Learning (cs.LG); Trading and Market Microstructure (q-fin.TR)
*备注:
点击查看摘要
Abstract:Cryptocurrency price dynamics are driven largely by microstructural supply demand imbalances in the limit order book (LOB), yet the highly noisy nature of LOB data complicates the signal extraction process. Prior research has demonstrated that deep-learning architectures can yield promising predictive performance on pre-processed equity and futures LOB data, but they often treat model complexity as an unqualified virtue. In this paper, we aim to examine whether adding extra hidden layers or parameters to “blackbox ish” neural networks genuinely enhances short term price forecasting, or if gains are primarily attributable to data preprocessing and feature engineering. We benchmark a spectrum of models from interpretable baselines, logistic regression, XGBoost to deep architectures (DeepLOB, Conv1D+LSTM) on BTC/USDT LOB snapshots sampled at 100 ms to multi second intervals using publicly available Bybit data. We introduce two data filtering pipelines (Kalman, Savitzky Golay) and evaluate both binary (up/down) and ternary (up/flat/down) labeling schemes. Our analysis compares models on out of sample accuracy, latency, and robustness to noise. Results reveal that, with data preprocessing and hyperparameter tuning, simpler models can match and even exceed the performance of more complex networks, offering faster inference and greater interpretability.
[LG-60] BiTrajDiff: Bidirectional Trajectory Generation with Diffusion Models for Offline Reinforcement Learning
链接: https://arxiv.org/abs/2506.05762
作者: Yunpeng Qing,Shuo Chen,Yixiao Chi,Shunyu Liu,Sixu Lin,Changqing Zou
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Recent advances in offline Reinforcement Learning (RL) have proven that effective policy learning can benefit from imposing conservative constraints on pre-collected datasets. However, such static datasets often exhibit distribution bias, resulting in limited generalizability. To address this limitation, a straightforward solution is data augmentation (DA), which leverages generative models to enrich data distribution. Despite the promising results, current DA techniques focus solely on reconstructing future trajectories from given states, while ignoring the exploration of history transitions that reach them. This single-direction paradigm inevitably hinders the discovery of diverse behavior patterns, especially those leading to critical states that may have yielded high-reward outcomes. In this work, we introduce Bidirectional Trajectory Diffusion (BiTrajDiff), a novel DA framework for offline RL that models both future and history trajectories from any intermediate states. Specifically, we decompose the trajectory generation task into two independent yet complementary diffusion processes: one generating forward trajectories to predict future dynamics, and the other generating backward trajectories to trace essential history this http URL can efficiently leverage critical states as anchors to expand into potentially valuable yet underexplored regions of the state space, thereby facilitating dataset diversity. Extensive experiments on the D4RL benchmark suite demonstrate that BiTrajDiff achieves superior performance compared to other advanced DA methods across various offline RL backbones.
[LG-61] Come Together But Not Right Now: A Progressive Strategy to Boost Low-Rank Adaptation ICML2025
链接: https://arxiv.org/abs/2506.05713
作者: Zhan Zhuang,Xiequn Wang,Wei Li,Yulong Zhang,Qiushi Huang,Shuhao Chen,Xuehao Wang,Yanbin Wei,Yuhe Nie,Kede Ma,Yu Zhang,Ying Wei
类目: Machine Learning (cs.LG)
*备注: Accepted by ICML 2025. Code link: this https URL
点击查看摘要
Abstract:Low-rank adaptation (LoRA) has emerged as a leading parameter-efficient fine-tuning technique for adapting large foundation models, yet it often locks adapters into suboptimal minima near their initialization. This hampers model generalization and limits downstream operators such as adapter merging and pruning. Here, we propose CoTo, a progressive training strategy that gradually increases adapters’ activation probability over the course of fine-tuning. By stochastically deactivating adapters, CoTo encourages more balanced optimization and broader exploration of the loss landscape. We provide a theoretical analysis showing that CoTo promotes layer-wise dropout stability and linear mode connectivity, and we adopt a cooperative-game approach to quantify each adapter’s marginal contribution. Extensive experiments demonstrate that CoTo consistently boosts single-task performance, enhances multi-task merging accuracy, improves pruning robustness, and reduces training overhead, all while remaining compatible with diverse LoRA variants. Code is available at this https URL.
[LG-62] Latent Diffusion Model Based Denoising Receiver for 6G Semantic Communication: From Stochastic Differential Theory to Application
链接: https://arxiv.org/abs/2506.05710
作者: Xiucheng Wang,Honggang Jia,Nan Cheng,Dusit Niyato
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Systems and Control (eess.SY)
*备注:
点击查看摘要
Abstract:In this paper, a novel semantic communication framework empowered by generative artificial intelligence (GAI) is proposed, specifically leveraging the capabilities of diffusion models (DMs). A rigorous theoretical foundation is established based on stochastic differential equations (SDEs), which elucidates the denoising properties of DMs in mitigating additive white Gaussian noise (AWGN) in latent semantic representations. Crucially, a closed-form analytical relationship between the signal-to-noise ratio (SNR) and the denoising timestep is derived, enabling the optimal selection of diffusion parameters for any given channel condition. To address the distribution mismatch between the received signal and the DM’s training data, a mathematically principled scaling mechanism is introduced, ensuring robust performance across a wide range of SNRs without requiring model fine-tuning. Built upon this theoretical insight, we develop a latent diffusion model (LDM)-based semantic transceiver, wherein a variational autoencoder (VAE) is employed for efficient semantic compression, and a pretrained DM serves as a universal denoiser. Notably, the proposed architecture is fully training-free at inference time, offering high modularity and compatibility with large-scale pretrained LDMs. This design inherently supports zero-shot generalization and mitigates the challenges posed by out-of-distribution inputs. Extensive experimental evaluations demonstrate that the proposed framework significantly outperforms conventional neural-network-based semantic communication baselines, particularly under low SNR conditions and distributional shifts, thereby establishing a promising direction for GAI-driven robust semantic transmission in future 6G systems.
[LG-63] Statistically Valid Post-Deployment Monitoring Should Be Standard for AI-Based Digital Health
链接: https://arxiv.org/abs/2506.05701
作者: Pavel Dolin,Weizhi Li,Gautam Dasarathy,Visar Berisha
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:This position paper argues that post-deployment monitoring in clinical AI is underdeveloped and proposes statistically valid and label-efficient testing frameworks as a principled foundation for ensuring reliability and safety in real-world deployment. A recent review found that only 9% of FDA-registered AI-based healthcare tools include a post-deployment surveillance plan. Existing monitoring approaches are often manual, sporadic, and reactive, making them ill-suited for the dynamic environments in which clinical models operate. We contend that post-deployment monitoring should be grounded in label-efficient and statistically valid testing frameworks, offering a principled alternative to current practices. We use the term “statistically valid” to refer to methods that provide explicit guarantees on error rates (e.g., Type I/II error), enable formal inference under pre-defined assumptions, and support reproducibility–features that align with regulatory requirements. Specifically, we propose that the detection of changes in the data and model performance degradation should be framed as distinct statistical hypothesis testing problems. Grounding monitoring in statistical rigor ensures a reproducible and scientifically sound basis for maintaining the reliability of clinical AI systems. Importantly, it also opens new research directions for the technical community–spanning theory, methods, and tools for statistically principled detection, attribution, and mitigation of post-deployment model failures in real-world settings.
[LG-64] Numerical Investigation of Sequence Modeling Theory using Controllable Memory Functions
链接: https://arxiv.org/abs/2506.05678
作者: Haotian Jiang,Zeyu Bao,Shida Wang,Qianxiao Li
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:The evolution of sequence modeling architectures, from recurrent neural networks and convolutional models to Transformers and structured state-space models, reflects ongoing efforts to address the diverse temporal dependencies inherent in sequential data. Despite this progress, systematically characterizing the strengths and limitations of these architectures remains a fundamental this http URL this work, we propose a synthetic benchmarking framework to evaluate how effectively different sequence models capture distinct temporal structures. The core of this approach is to generate synthetic targets, each characterized by a memory function and a parameter that determines the strength of temporal dependence. This setup allows us to produce a continuum of tasks that vary in temporal complexity, enabling fine-grained analysis of model behavior concerning specific memory properties. We focus on four representative memory functions, each corresponding to a distinct class of temporal this http URL on several sequence modeling architectures confirm existing theoretical insights and reveal new this http URL results demonstrate the effectiveness of the proposed method in advancing theoretical understandingand highlight the importance of using controllable targets with clearly defined structures for evaluating sequence modeling architectures.
[LG-65] opology-aware Neural Flux Prediction Guided by Physics
链接: https://arxiv.org/abs/2506.05676
作者: Haoyang Jiang,Jindong Wang,Xingquan Zhu,Yi He
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Graph Neural Networks (GNNs) often struggle in preserving high-frequency components of nodal signals when dealing with directed graphs. Such components are crucial for modeling flow dynamics, without which a traditional GNN tends to treat a graph with forward and reverse topologies this http URL make GNNs sensitive to those high-frequency components thereby being capable to capture detailed topological differences, this paper proposes a novel framework that combines 1) explicit difference matrices that model directional gradients and 2) implicit physical constraints that enforce messages passing within GNNs to be consistent with natural laws. Evaluations on two real-world directed graph data, namely, water flux network and urban traffic flow network, demonstrate the effectiveness of our proposal.
[LG-66] RNE: a plug-and-play framework for diffusion density estimation and inference-time control
链接: https://arxiv.org/abs/2506.05668
作者: Jiajun He,José Miguel Hernández-Lobato,Yuanqi Du,Francisco Vargas
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 39 pages; 10 figures
点击查看摘要
Abstract:In this paper, we introduce the Radon-Nikodym Estimator (RNE), a flexible, plug-and-play framework for diffusion inference-time density estimation and control, based on the concept of the density ratio between path distributions. RNE connects and unifies a variety of existing density estimation and inference-time control methods under a single and intuitive perspective, stemming from basic variational inference and probabilistic principles therefore offering both theoretical clarity and practical versatility. Experiments demonstrate that RNE achieves promising performances in diffusion density estimation and inference-time control tasks, including annealing, composition of diffusion models, and reward-tilting.
[LG-67] List-Level Distribution Coupling with Applications to Speculative Decoding and Lossy Compression NEURIPS2025
链接: https://arxiv.org/abs/2506.05632
作者: Joseph Rowan,Buu Phan,Ashish Khisti
类目: Machine Learning (cs.LG)
*备注: Submitted to NeurIPS 2025
点击查看摘要
Abstract:We study a relaxation of the problem of coupling probability distributions – a list of samples is generated from one distribution and an accept is declared if any one of these samples is identical to the sample generated from the other distribution. We propose a novel method for generating samples, which extends the Gumbel-max sampling suggested in Daliri et al. (arXiv:2408.07978) for coupling probability distributions. We also establish a corresponding lower bound on the acceptance probability, which we call the list matching lemma. We next discuss two applications of our setup. First, we develop a new mechanism for multi-draft speculative sampling that is simple to implement and achieves performance competitive with baselines such as SpecTr and SpecInfer across a range of language tasks. Our method also guarantees a certain degree of drafter invariance with respect to the output tokens which is not supported by existing schemes. We also provide a theoretical lower bound on the token level acceptance probability. As our second application, we consider distributed lossy compression with side information in a setting where a source sample is compressed and available to multiple decoders, each with independent side information. We propose a compression technique that is based on our generalization of Gumbel-max sampling and show that it provides significant gains in experiments involving synthetic Gaussian sources and the MNIST image dataset.
[LG-68] wo-dimensional Taxonomy for N-ary Knowledge Representation Learning Methods
链接: https://arxiv.org/abs/2506.05626
作者: Xiaohua Lu,Liubov Tupikina,Mehwish Alam
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Real-world knowledge can take various forms, including structured, semi-structured, and unstructured data. Among these, knowledge graphs are a form of structured human knowledge that integrate heterogeneous data sources into structured representations but typically reduce complex n-ary relations to simple triples, thereby losing higher-order relational details. In contrast, hypergraphs naturally represent n-ary relations with hyperedges, which directly connect multiple entities together. Yet hypergraph representation learning often overlooks entity roles in hyperedges, limiting the fine-grained semantic modelling. To address these issues, knowledge hypergraphs and hyper-relational knowledge graphs combine the advantages of knowledge graphs and hypergraphs to better capture the complex structures and role-specific semantics of real-world knowledge. This survey provides a comprehensive review of methods handling n-ary relational data, covering both knowledge hypergraphs and hyper-relational knowledge graphs literatures. We propose a two-dimensional taxonomy: the first dimension categorises models based on their methodology, i.e., translation-based models, tensor factorisation-based models, deep neural network-based models, logic rules-based models, and hyperedge expansion-based models. The second dimension classifies models according to their awareness of entity roles and positions in n-ary relations, dividing them into aware-less, position-aware, and role-aware approaches. Finally, we discuss existing datasets, negative sampling strategies, and outline open challenges to inspire future research.
[LG-69] Heterogeneous Sequel-Aware Graph Neural Networks for Sequential Learning
链接: https://arxiv.org/abs/2506.05625
作者: Anushka Tiwari,Haimonti Dutta,Shahrzad Khanizadeh
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Graph-based recommendation systems use higher-order user and item embeddings for next-item predictions. Dynamically adding collaborative signals from neighbors helps to use similar users’ preferences during learning. While item-item correlations and their impact on recommendations have been studied, the efficacy of temporal item sequences for recommendations is much less explored. In this paper, we examine temporal item sequence (sequel-aware) embeddings along with higher-order user embeddings and show that sequel-aware Graph Neural Networks have better (or comparable) recommendation performance than graph-based recommendation systems that do not consider sequel information. Extensive empirical results comparing Heterogeneous Sequel-aware Graph Neural Networks (HSAL-GNNs) to other algorithms for sequential learning (such as transformers, graph neural networks, auto-encoders) are presented on three synthetic and three real-world datasets. Our results indicate that the incorporation of sequence information from items greatly enhances recommendations.
[LG-70] FaCTR: Factorized Channel-Temporal Representation Transformers for Efficient Time Series Forecasting
链接: https://arxiv.org/abs/2506.05597
作者: Yash Vijay,Harini Subramanyan
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:While Transformers excel in language and vision-where inputs are semantically rich and exhibit univariate dependency structures-their architectural complexity leads to diminishing returns in time series forecasting. Time series data is characterized by low per-timestep information density and complex dependencies across channels and covariates, requiring conditioning on structured variable interactions. To address this mismatch and overparameterization, we propose FaCTR, a lightweight spatiotemporal Transformer with an explicitly structural design. FaCTR injects dynamic, symmetric cross-channel interactions-modeled via a low-rank Factorization Machine into temporally contextualized patch embeddings through a learnable gating mechanism. It further encodes static and dynamic covariates for multivariate conditioning. Despite its compact design, FaCTR achieves state-of-the-art performance on eleven public forecasting benchmarks spanning both short-term and long-term horizons, with its largest variant using close to only 400K parameters-on average 50x smaller than competitive spatiotemporal transformer baselines. In addition, its structured design enables interpretability through cross-channel influence scores-an essential requirement for real-world decision-making. Finally, FaCTR supports self-supervised pretraining, positioning it as a compact yet versatile foundation for downstream time series tasks.
[LG-71] abFlex: Scaling Tabular Learning to Millions with Linear Attention ICML2025
链接: https://arxiv.org/abs/2506.05584
作者: Yuchen Zeng,Tuan Dinh,Wonjun Kang,Andreas C Mueller
类目: Machine Learning (cs.LG)
*备注: 30 pages, ICML 2025
点击查看摘要
Abstract:Leveraging the in-context learning (ICL) capability of Large Language Models (LLMs) for tabular classification has gained significant attention for its training-free adaptability across diverse datasets. Recent advancements, like TabPFN, excel in small-scale tabular datasets but struggle to scale for large and complex datasets. Our work enhances the efficiency and scalability of TabPFN for larger datasets by incorporating linear attention mechanisms as a scalable alternative to complexity-quadratic self-attention. Our model, TabFlex, efficiently handles tabular datasets with thousands of features and hundreds of classes, scaling seamlessly to millions of samples. For instance, TabFlex processes the poker-hand dataset with over a million samples in just 5 seconds. Our extensive evaluations demonstrate that TabFlex can achieve over a 2x speedup compared to TabPFN and a 1.5x speedup over XGBoost, outperforming 25 tested baselines in terms of efficiency across a diverse range of datasets. Furthermore, TabFlex remains highly effective on large-scale datasets, delivering strong performance with significantly reduced computational costs, especially when combined with data-efficient techniques such as dimensionality reduction and data sampling.
[LG-72] When can in-context learning generalize out of task distribution?
链接: https://arxiv.org/abs/2506.05574
作者: Chase Goddard,Lindsay M. Smith,Vudtiwat Ngampruetikorn,David J. Schwab
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Statistical Mechanics (cond-mat.stat-mech); Neurons and Cognition (q-bio.NC); Machine Learning (stat.ML)
*备注:
点击查看摘要
Abstract:In-context learning (ICL) is a remarkable capability of pretrained transformers that allows models to generalize to unseen tasks after seeing only a few examples. We investigate empirically the conditions necessary on the pretraining distribution for ICL to emerge and generalize \emphout-of-distribution. Previous work has focused on the number of distinct tasks necessary in the pretraining dataset. Here, we use a different notion of task diversity to study the emergence of ICL in transformers trained on linear functions. We find that as task diversity increases, transformers undergo a transition from a specialized solution, which exhibits ICL only within the pretraining task distribution, to a solution which generalizes out of distribution to the entire task space. We also investigate the nature of the solutions learned by the transformer on both sides of the transition, and observe similar transitions in nonlinear regression problems. We construct a phase diagram to characterize how our concept of task diversity interacts with the number of pretraining tasks. In addition, we explore how factors such as the depth of the model and the dimensionality of the regression problem influence the transition.
[LG-73] Agent omics-ML: Autonomous Machine Learning Experimentation Agent for Genomic and Transcriptomic Data
链接: https://arxiv.org/abs/2506.05542
作者: Vlastimil Martinek,Andrea Gariboldi,Dimosthenis Tzimotoudis,Aitor Alberdi Escudero,Edward Blake,David Cechak,Luke Cassar,Alessandro Balestrucci,Panagiotis Alexiou
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:
点击查看摘要
Abstract:The adoption of machine learning (ML) and deep learning methods has revolutionized molecular medicine by driving breakthroughs in genomics, transcriptomics, drug discovery, and biological systems modeling. The increasing quantity, multimodality, and heterogeneity of biological datasets demand automated methods that can produce generalizable predictive models. Recent developments in large language model-based agents have shown promise for automating end-to-end ML experimentation on structured benchmarks. However, when applied to heterogeneous computational biology datasets, these methods struggle with generalization and success rates. Here, we introduce Agentomics-ML, a fully autonomous agent-based system designed to produce a classification model and the necessary files for reproducible training and inference. Our method follows predefined steps of an ML experimentation process, repeatedly interacting with the file system through Bash to complete individual steps. Once an ML model is produced, training and validation metrics provide scalar feedback to a reflection step to identify issues such as overfitting. This step then creates verbal feedback for future iterations, suggesting adjustments to steps such as data representation, model architecture, and hyperparameter choices. We have evaluated Agentomics-ML on several established genomic and transcriptomic benchmark datasets and show that it outperforms existing state-of-the-art agent-based methods in both generalization and success rates. While state-of-the-art models built by domain experts still lead in absolute performance on the majority of the computational biology datasets used in this work, Agentomics-ML narrows the gap for fully autonomous systems and achieves state-of-the-art performance on one of the used benchmark datasets. The code is available at this https URL.
[LG-74] SocialDF: Benchmark Dataset and Detection Model for Mitigating Harmful Deepfake Content on Social Media Platforms
链接: https://arxiv.org/abs/2506.05538
作者: Arnesh Batra,Anushk Kumar,Jashn Khemani,Arush Gumber,Arhan Jain,Somil Gupta
类目: Machine Learning (cs.LG); Multimedia (cs.MM)
*备注:
点击查看摘要
Abstract:The rapid advancement of deep generative models has significantly improved the realism of synthetic media, presenting both opportunities and security challenges. While deepfake technology has valuable applications in entertainment and accessibility, it has emerged as a potent vector for misinformation campaigns, particularly on social media. Existing detection frameworks struggle to distinguish between benign and adversarially generated deepfakes engineered to manipulate public perception. To address this challenge, we introduce SocialDF, a curated dataset reflecting real-world deepfake challenges on social media platforms. This dataset encompasses high-fidelity deepfakes sourced from various online ecosystems, ensuring broad coverage of manipulative techniques. We propose a novel LLM-based multi-factor detection approach that combines facial recognition, automated speech transcription, and a multi-agent LLM pipeline to cross-verify audio-visual cues. Our methodology emphasizes robust, multi-modal verification techniques that incorporate linguistic, behavioral, and contextual analysis to effectively discern synthetic media from authentic content.
[LG-75] Spectral Graph Neural Networks are Incomplete on Graphs with a Simple Spectrum
链接: https://arxiv.org/abs/2506.05530
作者: Snir Hordan,Maya Bechler-Speicher,Gur Lifshitz,Nadav Dym
类目: Machine Learning (cs.LG)
*备注: 9 pages main text
点击查看摘要
Abstract:Spectral features are widely incorporated within Graph Neural Networks (GNNs) to improve their expressive power, or their ability to distinguish among non-isomorphic graphs. One popular example is the usage of graph Laplacian eigenvectors for positional encoding in MPNNs and Graph Transformers. The expressive power of such Spectrally-enhanced GNNs (SGNNs) is usually evaluated via the k-WL graph isomorphism test hierarchy and homomorphism counting. Yet, these frameworks align poorly with the graph spectra, yielding limited insight into SGNNs’ expressive power. We leverage a well-studied paradigm of classifying graphs by their largest eigenvalue multiplicity to introduce an expressivity hierarchy for SGNNs. We then prove that many SGNNs are incomplete even on graphs with distinct eigenvalues. To mitigate this deficiency, we adapt rotation equivariant neural networks to the graph spectra setting to propose a method to provably improve SGNNs’ expressivity on simple spectrum graphs. We empirically verify our theoretical claims via an image classification experiment on the MNIST Superpixel dataset and eigenvector canonicalization on graphs from ZINC.
[LG-76] On Fitting Flow Models with Large Sinkhorn Couplings
链接: https://arxiv.org/abs/2506.05526
作者: Michal Klein,Alireza Mousavi-Hosseini,Stephen Zhang,Marco Cuturi
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 20 pages, 14 figures
点击查看摘要
Abstract:Flow models transform data gradually from one modality (e.g. noise) onto another (e.g. images). Such models are parameterized by a time-dependent velocity field, trained to fit segments connecting pairs of source and target points. When the pairing between source and target points is given, training flow models boils down to a supervised regression problem. When no such pairing exists, as is the case when generating data from noise, training flows is much harder. A popular approach lies in picking source and target points independently. This can, however, lead to velocity fields that are slow to train, but also costly to integrate at inference time. In theory, one would greatly benefit from training flow models by sampling pairs from an optimal transport (OT) measure coupling source and target, since this would lead to a highly efficient flow solving the Benamou and Brenier dynamical OT problem. In practice, recent works have proposed to sample mini-batches of n source and n target points and reorder them using an OT solver to form better pairs. These works have advocated using batches of size n\approx 256 , and considered OT solvers that return couplings that are either sharp (using e.g. the Hungarian algorithm) or blurred (using e.g. entropic regularization, a.k.a. Sinkhorn). We follow in the footsteps of these works by exploring the benefits of increasing n by three to four orders of magnitude, and look more carefully on the effect of the entropic regularization \varepsilon used in the Sinkhorn algorithm. Our analysis is facilitated by new scale invariant quantities to report the sharpness of a coupling, while our sharded computations across multiple GPU or GPU nodes allow scaling up n . We show that in both synthetic and image generation tasks, flow models greatly benefit when fitted with large Sinkhorn couplings, with a low entropic regularization \varepsilon .
[LG-77] Geometric and Physical Constraints Synergistically Enhance Neural PDE Surrogates
链接: https://arxiv.org/abs/2506.05513
作者: Yunfei Huang,David S. Greenberg
类目: Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
*备注:
点击查看摘要
Abstract:Neural PDE surrogates can improve the cost-accuracy tradeoff of classical solvers, but often generalize poorly to new initial conditions and accumulate errors over time. Physical and symmetry constraints have shown promise in closing this performance gap, but existing techniques for imposing these inductive biases are incompatible with the staggered grids commonly used in computational fluid dynamics. Here we introduce novel input and output layers that respect physical laws and symmetries on the staggered grids, and for the first time systematically investigate how these constraints, individually and in combination, affect the accuracy of PDE surrogates. We focus on two challenging problems: shallow water equations with closed boundaries and decaying incompressible turbulence. Compared to strong baselines, symmetries and physical constraints consistently improve performance across tasks, architectures, autoregressive prediction steps, accuracy measures, and network sizes. Symmetries are more effective than physical constraints, but surrogates with both performed best, even compared to baselines with data augmentation or pushforward training, while themselves benefiting from the pushforward trick. Doubly-constrained surrogates also generalize better to initial conditions and durations beyond the range of the training data, and more accurately predict real-world ocean currents.
[LG-78] he Generative Leap: Sharp Sample Complexity for Efficiently Learning Gaussian Multi-Index Models
链接: https://arxiv.org/abs/2506.05500
作者: Alex Damian,Jason D. Lee,Joan Bruna
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
点击查看摘要
Abstract:In this work we consider generic Gaussian Multi-index models, in which the labels only depend on the (Gaussian) d -dimensional inputs through their projection onto a low-dimensional r = O_d(1) subspace, and we study efficient agnostic estimation procedures for this hidden subspace. We introduce the \emphgenerative leap exponent k^\star , a natural extension of the generative exponent from [Damian et al.'24] to the multi-index setting. We first show that a sample complexity of n=\Theta(d^1 \vee \k/2) is necessary in the class of algorithms captured by the Low-Degree-Polynomial framework. We then establish that this sample complexity is also sufficient, by giving an agnostic sequential estimation procedure (that is, requiring no prior knowledge of the multi-index model) based on a spectral U-statistic over appropriate Hermite tensors. We further compute the generative leap exponent for several examples including piecewise linear functions (deep ReLU networks with bias), and general deep neural networks (with r -dimensional first hidden layer).
[LG-79] Learning-Augmented Hierarchical Clustering ICML2025
链接: https://arxiv.org/abs/2506.05495
作者: Vladimir Braverman,Jon C. Ergun,Chen Wang,Samson Zhou
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注: ICML 2025; abstract shortened for arxiv requirements
点击查看摘要
Abstract:Hierarchical clustering (HC) is an important data analysis technique in which the goal is to recursively partition a dataset into a tree-like structure while grouping together similar data points at each level of granularity. Unfortunately, for many of the proposed HC objectives, there exist strong barriers to approximation algorithms with the hardness of approximation. Thus, we consider the problem of hierarchical clustering given auxiliary information from natural oracles. Specifically, we focus on a splitting oracle which, when provided with a triplet of vertices (u,v,w) , answers (possibly erroneously) the pairs of vertices whose lowest common ancestor includes all three vertices in an optimal tree, i.e., identifying which vertex ``splits away’’ from the others. Using such an oracle, we obtain the following results: - A polynomial-time algorithm that outputs a hierarchical clustering tree with O(1) -approximation to the Dasgupta objective (Dasgupta [STOC’16]). - A near-linear time algorithm that outputs a hierarchical clustering tree with (1-o(1)) -approximation to the Moseley-Wang objective (Moseley and Wang [NeurIPS’17]). Under the plausible Small Set Expansion Hypothesis, no polynomial-time algorithm can achieve any constant approximation for Dasgupta’s objective or (1-C) -approximation for the Moseley-Wang objective for some constant C0 . As such, our results demonstrate that the splitting oracle enables algorithms to outperform standard HC approaches and overcome hardness constraints. Furthermore, our approaches extend to sublinear settings, in which we show new streaming and PRAM algorithms for HC with improved guarantees. Comments: ICML 2025; abstract shortened for arxiv requirements Subjects: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG) Cite as: arXiv:2506.05495 [cs.DS] (or arXiv:2506.05495v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2506.05495 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Chen Wang [view email] [v1] Thu, 5 Jun 2025 18:22:40 UTC (149 KB)
[LG-80] Initial Model Incorporation for Deep Learning FWI: Pretraining or Denormalization?
链接: https://arxiv.org/abs/2506.05484
作者: Ruihua Chen,Bangyu Wu,Meng Li,Kai Yang
类目: Machine Learning (cs.LG); Geophysics (physics.geo-ph)
*备注:
点击查看摘要
Abstract:Subsurface property neural network reparameterized full waveform inversion (FWI) has emerged as an effective unsupervised learning framework, which can invert stably with an inaccurate starting model. It updates the trainable neural network parameters instead of fine-tuning on the subsurface model directly. There are primarily two ways to embed the prior knowledge of the initial model into neural networks, that is, pretraining and denormalization. Pretraining first regulates the neural networks’ parameters by fitting the initial velocity model; Denormalization directly adds the outputs of the network into the initial models without pretraining. In this letter, we systematically investigate the influence of the two ways of initial model incorporation for the neural network reparameterized FWI. We demonstrate that pretraining requires inverting the model perturbation based on a constant velocity value (mean) with a two-stage implementation. It leads to a complex workflow and inconsistency of objective functions in the two-stage process, causing the network parameters to become inactive and lose plasticity. Experimental results demonstrate that denormalization can simplify workflows, accelerate convergence, and enhance inversion accuracy compared with pretraining.
[LG-81] Learning-Augmented Algorithms for MTS with Bandit Access to Multiple Predictors ICML2025
链接: https://arxiv.org/abs/2506.05479
作者: Matei Gabriel Coşa,Marek Eliáš
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS)
*备注: Accepted to ICML 2025
点击查看摘要
Abstract:We consider the following problem: We are given \ell heuristics for Metrical Task Systems (MTS), where each might be tailored to a different type of input instances. While processing an input instance received online, we are allowed to query the action of only one of the heuristics at each time step. Our goal is to achieve performance comparable to the best of the given heuristics. The main difficulty of our setting comes from the fact that the cost paid by a heuristic at time t cannot be estimated unless the same heuristic was also queried at time t-1 . This is related to Bandit Learning against memory bounded adversaries (Arora et al., 2012). We show how to achieve regret of O(\textOPT^2/3) and prove a tight lower bound based on the construction of Dekel et al. (2013).
[LG-82] Differentially Private Federated k-Means Clustering with Server-Side Data
链接: https://arxiv.org/abs/2506.05408
作者: Jonathan Scott,Christoph H. Lampert,David Saulpic
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Clustering is a cornerstone of data analysis that is particularly suited to identifying coherent subgroups or substructures in unlabeled data, as are generated continuously in large amounts these days. However, in many cases traditional clustering methods are not applicable, because data are increasingly being produced and stored in a distributed way, e.g. on edge devices, and privacy concerns prevent it from being transferred to a central server. To address this challenge, we present \acronym, a new algorithm for k -means clustering that is fully-federated as well as differentially private. Our approach leverages (potentially small and out-of-distribution) server-side data to overcome the primary challenge of differentially private clustering methods: the need for a good initialization. Combining our initialization with a simple federated DP-Lloyds algorithm we obtain an algorithm that achieves excellent results on synthetic and real-world benchmark tasks. We also provide a theoretical analysis of our method that provides bounds on the convergence speed and cluster identification success.
[LG-83] Sylva: Tailoring Personalized Adversarial Defense in Pre-trained Models via Collaborative Fine-tuning CCS
链接: https://arxiv.org/abs/2506.05402
作者: Tianyu Qi,Lei Xue,Yufeng Zhan,Xiaobo Ma
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: Accepted by the ACM Conference on Computer and Communications Security (CCS) 2025
点击查看摘要
Abstract:The growing adoption of large pre-trained models in edge computing has made deploying model inference on mobile clients both practical and popular. These devices are inherently vulnerable to direct adversarial attacks, which pose a substantial threat to the robustness and security of deployed models. Federated adversarial training (FAT) has emerged as an effective solution to enhance model robustness while preserving client privacy. However, FAT frequently produces a generalized global model, which struggles to address the diverse and heterogeneous data distributions across clients, resulting in insufficiently personalized performance, while also encountering substantial communication challenges during the training process. In this paper, we propose \textitSylva, a personalized collaborative adversarial training framework designed to deliver customized defense models for each client through a two-phase process. In Phase 1, \textitSylva employs LoRA for local adversarial fine-tuning, enabling clients to personalize model robustness while drastically reducing communication costs by uploading only LoRA parameters during federated aggregation. In Phase 2, a game-based layer selection strategy is introduced to enhance accuracy on benign data, further refining the personalized model. This approach ensures that each client receives a tailored defense model that balances robustness and accuracy effectively. Extensive experiments on benchmark datasets demonstrate that \textitSylva can achieve up to 50 \times improvements in communication efficiency compared to state-of-the-art algorithms, while achieving up to 29.5% and 50.4% enhancements in adversarial robustness and benign accuracy, respectively.
[LG-84] Attacking Attention of Foundation Models Disrupts Downstream Tasks CVPR2025
链接: https://arxiv.org/abs/2506.05394
作者: Hondamunige Prasanna Silva,Federico Becattini,Lorenzo Seidenari
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: Paper published at CVPR 2025 Workshop Advml
点击查看摘要
Abstract:Foundation models represent the most prominent and recent paradigm shift in artificial this http URL models are large models, trained on broad data that deliver high accuracy in many downstream tasks, often without fine-tuning. For this reason, models such as CLIP , DINO or Vision Transfomers (ViT), are becoming the bedrock of many industrial AI-powered applications. However, the reliance on pre-trained foundation models also introduces significant security concerns, as these models are vulnerable to adversarial attacks. Such attacks involve deliberately crafted inputs designed to deceive AI systems, jeopardizing their this http URL paper studies the vulnerabilities of vision foundation models, focusing specifically on CLIP and ViTs, and explores the transferability of adversarial attacks to downstream tasks. We introduce a novel attack, targeting the structure of transformer-based architectures in a task-agnostic this http URL demonstrate the effectiveness of our attack on several downstream tasks: classification, captioning, image/text retrieval, segmentation and depth estimation.
[LG-85] Learning Safe Strategies for Value Maximizing Buyers in Uniform Price Auctions ICML2025
链接: https://arxiv.org/abs/2406.03674
作者: Negin Golrezaei,Sourav Sahoo
类目: Data Structures and Algorithms (cs.DS); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注: 61 pages, 4 figures. To appear at ICML 2025
点击查看摘要
Abstract:We study the bidding problem in repeated uniform price multi-unit auctions from the perspective of a single value-maximizing buyer who aims to maximize their cumulative value over T rounds while adhering to return-on-investment (RoI) constraints in each round. Buyers adopt m -uniform bidding format, where they submit m bid-quantity pairs (b_i, q_i) to demand q_i units at bid b_i . We introduce safe bidding strategies as those that satisfy RoI constraints in every auction, regardless of competing bids. We show that these strategies depend only on the valuation curve of the bidder, and the bidder can focus on a finite subset of this class without loss of generality. While the number of strategies in this subset is exponential in m , we develop a polynomial-time algorithm to learn the optimal safe strategy that achieves sublinear regret in the online setting, where regret is measured against a clairvoyant benchmark that knows the competing bids a priori and selects a fixed hindsight optimal safe strategy. We then evaluate the performance of safe strategies against a clairvoyant that selects the optimal strategy from a richer class of strategies in the online setting. In this scenario, we compute the richness ratio, \alpha\in(0, 1] for the class of strategies chosen by the clairvoyant and show that our algorithm, designed to learn safe strategies, achieves \alpha -approximate sublinear regret against these stronger benchmarks. Experiments on semi-synthetic data from real-world auctions show that safe strategies substantially outperform the derived theoretical bounds, making them quite appealing in practice.
[LG-86] fairmetrics: An R package for group fairness evaluation
链接: https://arxiv.org/abs/2506.06243
作者: Benjamin Smith,Jianhui Gao,Jessica Gronsbell
类目: Computation (stat.CO); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 6 pages, 1 figure, 1 table
点击查看摘要
Abstract:Fairness is a growing area of machine learning (ML) that focuses on ensuring models do not produce systematically biased outcomes for specific groups, particularly those defined by protected attributes such as race, gender, or age. Evaluating fairness is a critical aspect of ML model development, as biased models can perpetuate structural inequalities. The fairmetrics R package offers a user-friendly framework for rigorously evaluating numerous group-based fairness criteria, including metrics based on independence (e.g., statistical parity), separation (e.g., equalized odds), and sufficiency (e.g., predictive parity). Group-based fairness criteria assess whether a model is equally accurate or well-calibrated across a set of predefined groups so that appropriate bias mitigation strategies can be implemented. fairmetrics provides both point and interval estimates for multiple metrics through a convenient wrapper function and includes an example dataset derived from the Medical Information Mart for Intensive Care, version II (MIMIC-II) database (Goldberger et al., 2000; Raffa, 2016).
[LG-87] Similarity Matching Networks: Hebbian Learning and Convergence Over Multiple Time Scales
链接: https://arxiv.org/abs/2506.06134
作者: Veronica Centorrino,Francesco Bullo,Giovanni Russo
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 28 pages, 9 figures
点击查看摘要
Abstract:A recent breakthrough in biologically-plausible normative frameworks for dimensionality reduction is based upon the similarity matching cost function and the low-rank matrix approximation problem. Despite clear biological interpretation, successful application in several domains, and experimental validation, a formal complete convergence analysis remains elusive. Building on this framework, we consider and analyze a continuous-time neural network, the \emphsimilarity matching network, for principal subspace projection. Derived from a min-max-min objective, this biologically-plausible network consists of three coupled dynamics evolving at different time scales: neural dynamics, lateral synaptic dynamics, and feedforward synaptic dynamics at the fast, intermediate, and slow time scales, respectively. The feedforward and lateral synaptic dynamics consist of Hebbian and anti-Hebbian learning rules, respectively. By leveraging a multilevel optimization framework, we prove convergence of the dynamics in the offline setting. Specifically, at the first level (fast time scale), we show strong convexity of the cost function and global exponential convergence of the corresponding gradient-flow dynamics. At the second level (intermediate time scale), we prove strong concavity of the cost function and exponential convergence of the corresponding gradient-flow dynamics within the space of positive definite matrices. At the third and final level (slow time scale), we study a non-convex and non-smooth cost function, provide explicit expressions for its global minima, and prove almost sure convergence of the corresponding gradient-flow dynamics to the global minima. These results rely on two empirically motivated conjectures that are supported by thorough numerical experiments. Finally, we validate the effectiveness of our approach via a numerical example.
[LG-88] Convergence of linear programming hierarchies for Gibbs states of spin systems
链接: https://arxiv.org/abs/2506.06125
作者: Hamza Fawzi,Omar Fawzi
类目: Optimization and Control (math.OC); Information Theory (cs.IT); Machine Learning (cs.LG); Probability (math.PR)
*备注: 11 pages
点击查看摘要
Abstract:We consider the problem of computing expectation values of local functions under the Gibbs distribution of a spin system. In particular, we study two families of linear programming hierarchies for this problem. The first hierarchy imposes local spin flip equalities and has been considered in the bootstrap literature in high energy physics. For this hierarchy, we prove fast convergence under a spatial mixing (decay of correlations) condition. This condition is satisfied for example above the critical temperature for Ising models on a d -dimensional grid. The second hierarchy is based on a Markov chain having the Gibbs state as a fixed point and has been studied in the optimization literature and more recently in the bootstrap literature. For this hierarchy, we prove fast convergence provided the Markov chain mixes rapidly. Both hierarchies lead to an \varepsilon -approximation for local expectation values using a linear program of size quasi-polynomial in n/\varepsilon , where n is the total number of sites, provided the interactions can be embedded in a d -dimensional grid with constant d . Compared to standard Monte Carlo methods, an advantage of this approach is that it always (i.e., for any system) outputs rigorous upper and lower bounds on the expectation value of interest, without needing an a priori analysis of the convergence speed.
[LG-89] Multilevel neural simulation-based inference
链接: https://arxiv.org/abs/2506.06087
作者: Yuga Hikida,Ayush Bharti,Niall Jeffrey,François-Xavier Briol
类目: Machine Learning (stat.ML); Cosmology and Nongalactic Astrophysics (astro-ph.CO); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG); Computation (stat.CO)
*备注:
点击查看摘要
Abstract:Neural simulation-based inference (SBI) is a popular set of methods for Bayesian inference when models are only available in the form of a simulator. These methods are widely used in the sciences and engineering, where writing down a likelihood can be significantly more challenging than constructing a simulator. However, the performance of neural SBI can suffer when simulators are computationally expensive, thereby limiting the number of simulations that can be performed. In this paper, we propose a novel approach to neural SBI which leverages multilevel Monte Carlo techniques for settings where several simulators of varying cost and fidelity are available. We demonstrate through both theoretical analysis and extensive experiments that our method can significantly enhance the accuracy of SBI methods given a fixed computational budget.
[LG-90] Policy Optimization for Continuous-time Linear-Quadratic Graphon Mean Field Games
链接: https://arxiv.org/abs/2506.05894
作者: Philipp Plank,Yufei Zhang
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Probability (math.PR)
*备注:
点击查看摘要
Abstract:Multi-agent reinforcement learning, despite its popularity and empirical success, faces significant scalability challenges in large-population dynamic games. Graphon mean field games (GMFGs) offer a principled framework for approximating such games while capturing heterogeneity among players. In this paper, we propose and analyze a policy optimization framework for continuous-time, finite-horizon linear-quadratic GMFGs. Exploiting the structural properties of GMFGs, we design an efficient policy parameterization in which each player’s policy is represented as an affine function of their private state, with a shared slope function and player-specific intercepts. We develop a bilevel optimization algorithm that alternates between policy gradient updates for best-response computation under a fixed population distribution, and distribution updates using the resulting policies. We prove linear convergence of the policy gradient steps to best-response policies and establish global convergence of the overall algorithm to the Nash equilibrium. The analysis relies on novel landscape characterizations over infinite-dimensional policy spaces. Numerical experiments demonstrate the convergence and robustness of the proposed algorithm under varying graphon structures, noise levels, and action frequencies.
[LG-91] Variational Inference for Quantum HyperNetworks IJCNN2025
链接: https://arxiv.org/abs/2506.05888
作者: Luca Nepote,Alix Lhéritier,Nicolas Bondoux,Marios Kountouris,Maurizio Filippone
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: This work has been accepted for publication in 2025 International Joint Conference on Neural Networks (IJCNN 2025) and will be published on IEEE Xplore
点击查看摘要
Abstract:Binary Neural Networks (BiNNs), which employ single-bit precision weights, have emerged as a promising solution to reduce memory usage and power consumption while maintaining competitive performance in large-scale systems. However, training BiNNs remains a significant challenge due to the limitations of conventional training algorithms. Quantum HyperNetworks offer a novel paradigm for enhancing the optimization of BiNN by leveraging quantum computing. Specifically, a Variational Quantum Algorithm is employed to generate binary weights through quantum circuit measurements, while key quantum phenomena such as superposition and entanglement facilitate the exploration of a broader solution space. In this work, we establish a connection between this approach and Bayesian inference by deriving the Evidence Lower Bound (ELBO), when direct access to the output distribution is available (i.e., in simulations), and introducing a surrogate ELBO based on the Maximum Mean Discrepancy (MMD) metric for scenarios involving implicit distributions, as commonly encountered in practice. Our experimental results demonstrate that the proposed methods outperform standard Maximum Likelihood Estimation (MLE), improving trainability and generalization.
[LG-92] Mapping correlations and coherence: adjacency-based approach to data visualization and regularity discovery
链接: https://arxiv.org/abs/2506.05758
作者: Guang-Xing Li
类目: Computational Physics (physics.comp-ph); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG); Dynamical Systems (math.DS)
*备注: Code is avaliable at this https URL
点击查看摘要
Abstract:The development of science has been transforming man’s view towards nature for centuries. Observing structures and patterns in an effective approach to discover regularities from data is a key step toward theory-building. With increasingly complex data being obtained, revealing regularities systematically has become a challenge. Correlation is a most commonly-used and effective approach to describe regularities in data, yet for complex patterns, spatial inhomogeneity and complexity can often undermine the correlations. We present an algorithm to derive maps representing the type and degree of correlations, by taking the two-fold symmetry of the correlation vector into full account using the Stokes parameter. The method allows for a spatially resolved view of the nature and strength of correlations between physical quantities. In the correlation view, a region can often be separated into different subregions with different types of correlations. Subregions correspond to physical regimes for physical systems, or climate zones for climate maps. The simplicity of the method makes it widely applicable to a variety of data, where the correlation-based approach makes the map particularly useful in revealing regularities in physical systems and alike. As a new and efficient approach to represent data, the method should facilitate the development of new computational approaches to regularity discovery.
[LG-93] Emulating compact binary population synthesis simulations with robust uncertainty quantification and model comparison: Bayesian normalizing flows
链接: https://arxiv.org/abs/2506.05657
作者: Anarya Ray
类目: High Energy Astrophysical Phenomena (astro-ph.HE); Machine Learning (cs.LG); General Relativity and Quantum Cosmology (gr-qc)
*备注: 16 pages, 4 figures
点击查看摘要
Abstract:Population synthesis simulations of compact binary coalescences~(CBCs) play a crucial role in extracting astrophysical insights from an ensemble of gravitational wave~(GW) observations. However, realistic simulations are costly to implement for a dense grid of initial conditions. Normalizing flows can emulate the distribution functions of a simulated population of binary parameters and thereby enable empirical constraints on the astrophysical initial conditions and branching fractions of various formation channels given data from a catalog of GW observations. They can also be used for data amplification in sparse regions of the CBC parameter space to guide the development of phenomenological population models for rarely synthesizable systems with components in theorized mass gaps, without having to simulate a prohibitively large number of binaries. But flow predictions are wrought with uncertainties, especially for sparse training sets. In this work I develop a method for quantifying and marginalizing uncertainties in the emulators by introducing the Bayesian Normalizing flow, a conditional density estimator constructed from Bayesian neural networks. Using the exact likelihood function associated with density estimators I sample the posterior distribution of flow parameters with suitably chosen priors to quantify and marginalize over flow uncertainties. I demonstrate the accuracy, calibration, and data-amplification impacts of the estimated uncertainties for simulations of binary black hole populations formed through common envelope evolution. I outline applications of the methodology in simulation-based inference from growing GW catalogs and sketch other uses for general simulation-based approaches in GW astronomy.
[LG-94] he TESS Ten Thousand Catalog: 10001 uniformly-vetted and -validated Eclipsing Binary Stars detected in Full-Frame Image data by machine learning and analyzed by citizen scientists
链接: https://arxiv.org/abs/2506.05631
作者: Veselin B. Kostov,Brian P. Powell,Aline U. Fornear,Marco Z. Di Fraia,Robert Gagliano,Thomas L. Jacobs,Julien S. de Lambilly,Hugo A. Durantini Luca,Steven R. Majewski,Mark Omohundro,Jerome Orosz,Saul A. Rappaport,Ryan Salik,Donald Short,William Welsh,Svetoslav Alexandrov,Cledison Marcos da Silva,Erika Dunning,Gerd Guhne,Marc Huten,Michiharu Hyogo,Davide Iannone,Sam Lee,Christian Magliano,Manya Sharma,Allan Tarr,John Yablonsky,Sovan Acharya,Fred Adams,Thomas Barclay,Benjamin T. Montet,Susan Mullally,Greg Olmschenk,Andrej Prsa,Elisa Quintana,Robert Wilson,Hasret Balcioglu,Ethan Kruse, theEclipsing Binary Patrol Collaboration
类目: olar and Stellar Astrophysics (astro-ph.SR); Earth and Planetary Astrophysics (astro-ph.EP); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注: 40 pages, 39 figures, 4 tables
点击查看摘要
Abstract:The Transiting Exoplanet Survey Satellite (TESS) has surveyed nearly the entire sky in Full-Frame Image mode with a time resolution of 200 seconds to 30 minutes and a temporal baseline of at least 27 days. In addition to the primary goal of discovering new exoplanets, TESS is exceptionally capable at detecting variable stars, and in particular short-period eclipsing binaries which are relatively common, making up a few percent of all stars, and represent powerful astrophysical laboratories for deep investigations of stellar formation and evolution. We combed Sectors 1-82 of TESS Full-Frame Image data searching for eclipsing binary stars using a neural network that identified ~1.2 million stars with eclipse-like features. Of these, we have performed an in-depth analysis on ~60,000 targets using automated methods and manual inspection by citizen scientists. Here we present a catalog of 10001 uniformly-vetted and -validated eclipsing binary stars that passed all our ephemeris and photocenter tests, as well as complementary visual inspection. Of these, 7936 are new eclipsing binaries while the remaining 2065 are known systems for which we update the published ephemerides. We outline the detection and analysis of the targets, discuss the properties of the sample, and highlight potentially interesting systems. Finally, we also provide a list of ~900,000 unvetted and unvalidated targets for which the neural network found eclipse-like features with a score higher than 0.9, and for which there are no known eclipsing binaries within a sky-projected separation of a TESS pixel (~21 arcsec).
[LG-95] Nonlinear Causal Discovery through a Sequential Edge Orientation Approach
链接: https://arxiv.org/abs/2506.05590
作者: Stella Huang,Qing Zhou
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 42 Pages, 13 figures, 3 tables
点击查看摘要
Abstract:Recent advances have established the identifiability of a directed acyclic graph (DAG) under additive noise models (ANMs), spurring the development of various causal discovery methods. However, most existing methods make restrictive model assumptions, rely heavily on general independence tests, or require substantial computational time. To address these limitations, we propose a sequential procedure to orient undirected edges in a completed partial DAG (CPDAG), representing an equivalence class of DAGs, by leveraging the pairwise additive noise model (PANM) to identify their causal directions. We prove that this procedure can recover the true causal DAG assuming a restricted ANM. Building on this result, we develop a novel constraint-based algorithm for learning causal DAGs under nonlinear ANMs. Given an estimated CPDAG, we develop a ranking procedure that sorts undirected edges by their adherence to the PANM, which defines an evaluation order of the edges. To determine the edge direction, we devise a statistical test that compares the log-likelihood values, evaluated with respect to the competing directions, of a sub-graph comprising just the candidate nodes and their identified parents in the partial DAG. We further establish the structural learning consistency of our algorithm in the large-sample limit. Extensive experiments on synthetic and real-world datasets demonstrate that our method is computationally efficient, robust to model misspecification, and consistently outperforms many existing nonlinear DAG learning methods.
[LG-96] Partially-Supervised Neural Network Model For Quadratic Multiparametric Programming
链接: https://arxiv.org/abs/2506.05567
作者: Fuat Can Beylunioglu,Mehrdad Pirnia,P. Robert Duimering
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: 36 pages including references and appendix
点击查看摘要
Abstract:Neural Networks (NN) with ReLU activation functions are used to model multiparametric quadratic optimization problems (mp-QP) in diverse engineering applications. Researchers have suggested leveraging the piecewise affine property of deep NN models to solve mp-QP with linear constraints, which also exhibit piecewise affine behaviour. However, traditional deep NN applications to mp-QP fall short of providing optimal and feasible predictions, even when trained on large datasets. This study proposes a partially-supervised NN (PSNN) architecture that directly represents the mathematical structure of the global solution function. In contrast to generic NN training approaches, the proposed PSNN method derives a large proportion of model weights directly from the mathematical properties of the optimization problem, producing more accurate solutions despite significantly smaller training data sets. Many energy management problems are formulated as QP, so we apply the proposed approach to energy systems (specifically DC optimal power flow) to demonstrate proof of concept. Model performance in terms of solution accuracy and speed of predictions was compared against a commercial solver and a generic Deep NN model based on classical training. Results show KKT sufficient conditions for PSNN consistently outperform generic NN architectures with classical training using far less data, including when tested on extreme, out-of-training distribution test data. Given its speed advantages over traditional solvers, the PSNN model can quickly produce optimal and feasible solutions within a second for millions of input parameters sampled from a distribution of stochastic demands and renewable generator dispatches, which can be used for simulations and long term planning.
[LG-97] DART-Vetter: A Deep LeARning Tool for automatic triage of exoplanet candidates
链接: https://arxiv.org/abs/2506.05556
作者: Stefano Fiscale(1 and 2 and 3),Laura Inno(2 and 3),Alessandra Rotundi(1 and 2),Angelo Ciaramella(2),Alessio Ferone(2),Christian Magliano(3 and 4),Luca Cacciapuoti(5),Veselin Kostov(6 and 7),Elisa Quintana(6),Giovanni Covone(3 and 4 and 8),Maria Teresa Muscari Tomajoli(1 and 2),Vito Saggese(4),Luca Tonietti(1 and 2 and 3 and 9),Antonio Vanzanella(10),Vincenzo Della Corte(3) ((1) UNESCO Chair “Environment, Resources and Sustainable Development”, Department of Science and Technology, Parthenope University of Naples, Italy, (2) Department of Science and Technology, Parthenope University of Naples, Centro Direzionale di Napoli, Naples, I-80143, Italy, (3) INAF, Osservatorio Astronomico di Capodimonte, Salita Moiariello, 16, Naples, I-80131, Italy, (4) Department of Physics “Ettore Pancini”, University of Naples Federico II, Naples, Italy, (5) European Southern Observatory, Karl-Schwarzschild-Strasse 2 D-85748 Garching bei Munchen, Germany, (6) NASA Goddard Space Flight Center, 8800 Greenbelt Road, Greenbelt, MD 20771, USA, (7) Citizen Scientist, Planet Patrol Collaboration, Greenbelt, MD, 20771, USA, (8) INFN section of Naples, Via Cinthia 6, 80126, Napoli, Italy, (9) Department of Biology, Federico II University of Naples, Naples, Italy, (10) National centre for Nuclear Research, Pasteura 7, 02-093, Warsaw, Poland)
类目: Earth and Planetary Astrophysics (astro-ph.EP); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注: Number of pages: 24, Number of figures: 8, Article accepted for publication in The Astronomical Journal on 2025-05-30
点击查看摘要
Abstract:In the identification of new planetary candidates in transit surveys, the employment of Deep Learning models proved to be essential to efficiently analyse a continuously growing volume of photometric observations. To further improve the robustness of these models, it is necessary to exploit the complementarity of data collected from different transit surveys such as NASA’s Kepler, Transiting Exoplanet Survey Satellite (TESS), and, in the near future, the ESA PLAnetary Transits and Oscillation of stars (PLATO) mission. In this work, we present a Deep Learning model, named DART-Vetter, able to distinguish planetary candidates (PC) from false positives signals (NPC) detected by any potential transiting survey. DART-Vetter is a Convolutional Neural Network that processes only the light curves folded on the period of the relative signal, featuring a simpler and more compact architecture with respect to other triaging and/or vetting models available in the literature. We trained and tested DART-Vetter on several dataset of publicly available and homogeneously labelled TESS and Kepler light curves in order to prove the effectiveness of our model. Despite its simplicity, DART-Vetter achieves highly competitive triaging performance, with a recall rate of 91% on an ensemble of TESS and Kepler data, when compared to Exominer and Astronet-Triage. Its compact, open source and easy to replicate architecture makes DART-Vetter a particularly useful tool for automatizing triaging procedures or assisting human vetters, showing a discrete generalization on TCEs with Multiple Event Statistic (MES) 20 and orbital period 50 days.
[LG-98] Online Conformal Model Selection for Nonstationary Time Series
链接: https://arxiv.org/abs/2506.05544
作者: Shibo Li,Yao Zheng
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:This paper introduces the MPS (Model Prediction Set), a novel framework for online model selection for nonstationary time series. Classical model selection methods, such as information criteria and cross-validation, rely heavily on the stationarity assumption and often fail in dynamic environments which undergo gradual or abrupt changes over time. Yet real-world data are rarely stationary, and model selection under nonstationarity remains a largely open problem. To tackle this challenge, we combine conformal inference with model confidence sets to develop a procedure that adaptively selects models best suited to the evolving dynamics at any given time. Concretely, the MPS updates in real time a confidence set of candidate models that covers the best model for the next time period with a specified long-run probability, while adapting to nonstationarity of unknown forms. Through simulations and real-world data analysis, we demonstrate that MPS reliably and efficiently identifies optimal models under nonstationarity, an essential capability lacking in offline methods. Moreover, MPS frequently produces high-quality sets with small cardinality, whose evolution offers deeper insights into changing dynamics. As a generic framework, MPS accommodates any data-generating process, data structure, model class, training method, and evaluation metric, making it broadly applicable across diverse problem settings.
[LG-99] Adaptive stable distribution and Hurst exponent by method of moments moving estimator for nonstationary time series
链接: https://arxiv.org/abs/2506.05354
作者: Jarek Duda
类目: Methodology (stat.ME); Machine Learning (cs.LG); Econometrics (econ.EM); Machine Learning (stat.ML)
*备注: 5 pages, 7 figures. arXiv admin note: text overlap with arXiv:2304.03069
点击查看摘要
Abstract:Nonstationarity of real-life time series requires model adaptation. In classical approaches like ARMA-ARCH there is assumed some arbitrarily chosen dependence type. To avoid their bias, we will focus on novel more agnostic approach: moving estimator, which estimates parameters separately for every time t : optimizing F_t=\sum_\taut (1-\eta)^t-\tau \ln(\rho_\theta (x_\tau)) local log-likelihood with exponentially weakening weights of the old values. In practice such moving estimates can be found by EMA (exponential moving average) of some parameters, like m_p=E[|x-\mu|^p] absolute central moments, updated by m_p,t+1 = m_p,t + \eta (|x_t-\mu_t|^p-m_p,t) . We will focus here on its applications for alpha-Stable distribution, which also influences Hurst exponent, hence can be used for its adaptive estimation. Its application will be shown on financial data as DJIA time series - beside standard estimation of evolution of center \mu and scale parameter \sigma , there is also estimated evolution of \alpha parameter allowing to continuously evaluate market stability - tails having \rho(x) \sim 1/|x|^\alpha+1 behavior, controlling probability of potentially dangerous extreme events.
信息检索
[IR-0] RecGPT : A Foundation Model for Sequential Recommendation
链接: https://arxiv.org/abs/2506.06270
作者: Yangqin Jiang,Xubin Ren,Lianghao Xia,Da Luo,Kangyi Lin,Chao Huang
类目: Information Retrieval (cs.IR)
*备注:
点击查看摘要
Abstract:This work addresses a fundamental barrier in recommender systems: the inability to generalize across domains without extensive retraining. Traditional ID-based approaches fail entirely in cold-start and cross-domain scenarios where new users or items lack sufficient interaction history. Inspired by foundation models’ cross-domain success, we develop a foundation model for sequential recommendation that achieves genuine zero-shot generalization capabilities. Our approach fundamentally departs from existing ID-based methods by deriving item representations exclusively from textual features. This enables immediate embedding of any new item without model retraining. We introduce unified item tokenization with Finite Scalar Quantization that transforms heterogeneous textual descriptions into standardized discrete tokens. This eliminates domain barriers that plague existing systems. Additionally, the framework features hybrid bidirectional-causal attention that captures both intra-item token coherence and inter-item sequential dependencies. An efficient catalog-aware beam search decoder enables real-time token-to-item mapping. Unlike conventional approaches confined to their training domains, RecGPT naturally bridges diverse recommendation contexts through its domain-invariant tokenization mechanism. Comprehensive evaluations across six datasets and industrial scenarios demonstrate consistent performance advantages.
[IR-1] Optimizing Recall or Relevance? A Multi-Task Multi-Head Approach for Item-to-Item Retrieval in Recommendation
链接: https://arxiv.org/abs/2506.06239
作者: Jiang Zhang,Sumit Kumar,Wei Chang,Yubo Wang,Feng Zhang,Weize Mao,Hanchao Yu,Aashu Singh,Min Li,Qifan Wang
类目: Information Retrieval (cs.IR)
*备注:
点击查看摘要
Abstract:The task of item-to-item (I2I) retrieval is to identify a set of relevant and highly engaging items based on a given trigger item. It is a crucial component in modern recommendation systems, where users’ previously engaged items serve as trigger items to retrieve relevant content for future engagement. However, existing I2I retrieval models in industry are primarily built on co-engagement data and optimized using the recall measure, which overly emphasizes co-engagement patterns while failing to capture semantic relevance. This often leads to overfitting short-term co-engagement trends at the expense of long-term benefits such as discovering novel interests and promoting content diversity. To address this challenge, we propose MTMH, a Multi-Task and Multi-Head I2I retrieval model that achieves both high recall and semantic relevance. Our model consists of two key components: 1) a multi-task learning loss for formally optimizing the trade-off between recall and semantic relevance, and 2) a multi-head I2I retrieval architecture for retrieving both highly co-engaged and semantically relevant items. We evaluate MTMH using proprietary data from a commercial platform serving billions of users and demonstrate that it can improve recall by up to 14.4% and semantic relevance by up to 56.6% compared with prior state-of-the-art models. We also conduct live experiments to verify that MTMH can enhance both short-term consumption metrics and long-term user-experience-related metrics. Our work provides a principled approach for jointly optimizing I2I recall and semantic relevance, which has significant implications for improving the overall performance of recommendation systems.
[IR-2] On the Merits of LLM -Based Corpus Enrichment
链接: https://arxiv.org/abs/2506.06015
作者: Gal Zur,Tommy Mordo,Moshe Tennenholtz,Oren Kurland
类目: Information Retrieval (cs.IR)
*备注:
点击查看摘要
Abstract:Generative AI (genAI) technologies – specifically, large language models (LLMs) – and search have evolving relations. We argue for a novel perspective: using genAI to enrich a document corpus so as to improve query-based retrieval effectiveness. The enrichment is based on modifying existing documents or generating new ones. As an empirical proof of concept, we use LLMs to generate documents relevant to a topic which are more retrievable than existing ones. In addition, we demonstrate the potential merits of using corpus enrichment for retrieval augmented generation (RAG) and answer attribution in question answering.
[IR-3] Respecting Temporal-Causal Consistency: Entity-Event Knowledge Graphs for Retrieval-Augmented Generation
链接: https://arxiv.org/abs/2506.05939
作者: Ze Yu Zhang,Zitao Li,Yaliang Li,Bolin Ding,Bryan Kian Hsiang Low
类目: Information Retrieval (cs.IR)
*备注: 24 pages, 4 figures
点击查看摘要
Abstract:Retrieval-augmented generation (RAG) based on large language models often falters on narrative documents with inherent temporal structures. Standard unstructured RAG methods rely solely on embedding-similarity matching and lack any general mechanism to encode or exploit chronological information, while knowledge graph RAG (KG-RAG) frameworks collapse every mention of an entity into a single node, erasing the evolving context that drives many queries. To formalize this challenge and draw the community’s attention, we construct ChronoQA, a robust and discriminative QA benchmark that measures temporal, causal, and character consistency understanding in narrative documents (e.g., novels) under the RAG setting. We then introduce Entity-Event RAG (E^2RAG), a dual-graph framework that keeps separate entity and event subgraphs linked by a bipartite mapping, thereby preserving the temporal and causal facets needed for fine-grained reasoning. Across ChronoQA, our approach outperforms state-of-the-art unstructured and KG-based RAG baselines, with notable gains on causal and character consistency queries. E^2RAG therefore offers a practical path to more context-aware retrieval for tasks that require precise answers grounded in chronological information.
[IR-4] he NetMob25 Dataset: A High-resolution Multi-layered View of Individual Mobility in Greater Paris Region
链接: https://arxiv.org/abs/2506.05903
作者: Alexandre Chasse,Anne J. Kouam,Aline C. Viana,Razvan Stanica,Wellington V. Lobato,Geymerson Ramos,Geoffrey Deperle,Abdelmounaim Bouroudi,Suzanne Bussod,Fernando Molano
类目: Computers and Society (cs.CY); Information Retrieval (cs.IR)
*备注:
点击查看摘要
Abstract:High-quality mobility data remains scarce despite growing interest from researchers and urban stakeholders in understanding individual-level movement patterns. The Netmob25 Data Challenge addresses this gap by releasing a unique GPS-based mobility dataset derived from the EMG 2023 GNSS-based mobility survey conducted in the Ile-de-France region (Greater Paris area), France. This dataset captures detailed daily mobility over a full week for 3,337 volunteer residents aged 16 to 80, collected between October 2022 and May 2023. Each participant was equipped with a dedicated GPS tracking device configured to record location points every 2-3 seconds and was asked to maintain a digital or paper logbook of their trips. All inferred mobility traces were algorithmically processed and validated through follow-up phone interviews. The dataset includes three components: (i) an Individuals database describing demographic, socioeconomic, and household characteristics; (ii) a Trips database with over 80,000 annotated displacements including timestamps, transport modes, and trip purposes; and (iii) a Raw GPS Traces database comprising about 500 million high-frequency points. A statistical weighting mechanism is provided to support population-level estimates. An extensive anonymization pipeline was applied to the GPS traces to ensure GDPR compliance while preserving analytical value. Access to the dataset requires acceptance of the challenge’s Terms and Conditions and signing a Non-Disclosure Agreement. This paper describes the survey design, collection protocol, processing methodology, and characteristics of the released dataset. Subjects: Computers and Society (cs.CY); Information Retrieval (cs.IR) Cite as: arXiv:2506.05903 [cs.CY] (or arXiv:2506.05903v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2506.05903 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[IR-5] Generating Long Semantic IDs in Parallel for Recommendation KDD2025
链接: https://arxiv.org/abs/2506.05781
作者: Yupeng Hou,Jiacheng Li,Ashley Shin,Jinsung Jeon,Abhishek Santhanam,Wei Shao,Kaveh Hassani,Ning Yao,Julian McAuley
类目: Information Retrieval (cs.IR)
*备注: KDD 2025
点击查看摘要
Abstract:Semantic ID-based recommendation models tokenize each item into a small number of discrete tokens that preserve specific semantics, leading to better performance, scalability, and memory efficiency. While recent models adopt a generative approach, they often suffer from inefficient inference due to the reliance on resource-intensive beam search and multiple forward passes through the neural sequence model. As a result, the length of semantic IDs is typically restricted (e.g. to just 4 tokens), limiting their expressiveness. To address these challenges, we propose RPG, a lightweight framework for semantic ID-based recommendation. The key idea is to produce unordered, long semantic IDs, allowing the model to predict all tokens in parallel. We train the model to predict each token independently using a multi-token prediction loss, directly integrating semantics into the learning objective. During inference, we construct a graph connecting similar semantic IDs and guide decoding to avoid generating invalid IDs. Experiments show that scaling up semantic ID length to 64 enables RPG to outperform generative baselines by an average of 12.6% on the NDCG@10, while also improving inference efficiency. Code is available at: this https URL.
[IR-6] NGA: Non-autoregressive Generative Auction with Global Externalities for Advertising Systems
链接: https://arxiv.org/abs/2506.05685
作者: Zuowu Zheng,Ze Wang,Fan Yang,Wenqing Ye,Weihua Huang,Wenqiang He,Teng Zhang,Xingxing Wang
类目: Information Retrieval (cs.IR)
*备注:
点击查看摘要
Abstract:Online advertising auctions are fundamental to internet commerce, demanding solutions that not only maximize revenue but also ensure incentive compatibility, high-quality user experience, and real-time efficiency. While recent learning-based auction frameworks have improved context modeling by capturing intra-list dependencies among ads, they remain limited in addressing global externalities and often suffer from inefficiencies caused by sequential processing. In this work, we introduce the Non-autoregressive Generative Auction with global externalities (NGA), a novel end-to-end framework designed for industrial online advertising. NGA explicitly models global externalities by jointly capturing the relationships among ads as well as the effects of adjacent organic content. To further enhance efficiency, NGA utilizes a non-autoregressive, constraint-based decoding strategy and a parallel multi-tower evaluator for unified list-wise reward and payment computation. Extensive offline experiments and large-scale online A/B testing on commercial advertising platforms demonstrate that NGA consistently outperforms existing methods in both effectiveness and efficiency.
附件下载
点击下载今日全部论文列表